docs(runbook): netbird-client mesh-drop / DNS troubleshooting

Document the 2026-06-18 incident class: a road-warrior laptop losing DNS on a network transition strands NetBird (can't resolve the coordinator FQDN), taking ubongo unreachable until DNS recovers. Adds triage (local DNS vs coordinator), device mitigations (reliable resolvers + hosts-file pin), the non-mesh LAN break-glass to ubongo, and why ubongo is relay-only (deferred mesh-hardening, not a bug) — including the break-glass rule that hardening must preserve.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-18 22:30:41 +02:00
parent a23ecd708d
commit 77a20b8d40

View file

@ -67,6 +67,68 @@ the SSH key provides auth.
--- ---
## Troubleshooting — mesh drops / SSH to `ubongo` times out
Symptom: SSH to `ubongo` (or any peer) times out for minutes and recovers on its own;
`netbird status` shows **Management/Signal: Disconnected** or peers stuck **Connecting**.
verified: client DNS/relay behaviour + NRPT scope read from a 0.72.4 debug bundle;
mitigations per docs.netbird.io (`/manage/dns/troubleshooting`,
`/help/troubleshooting-client`) · 2026-06-18
**1. Triage — is it your device or the coordinator?** On the device:
```sh
netbird status -d # Management/Signal Connected? peers P2P/Relayed?
nslookup netbird.askari.wingu.me # coordinator FQDN
nslookup pkgs.netbird.io # a PUBLIC name — control test
```
If the relay/handshake errors say `lookup netbird.askari.wingu.me: no such host` **and**
a *public* name (`pkgs.netbird.io`) also fails to resolve, your **local resolver is
dead** — the coordinator and `ubongo` are almost certainly fine. NetBird only manages
`*.netbird.selfhosted` resolution (a single NRPT rule), so it is **not** the cause.
Confirm from the other side if you can: the dashboard shows peer *last-seen*; `askari`/
`ubongo` staying green ⇒ the fault is your device's network.
**Why it cascades:** NetBird re-resolves the coordinator FQDN on every reconnect. A
network transition (Wi-Fi ↔ phone hotspot, sleep/wake) that briefly kills DNS means it
can't reach management/signal/relay — and since `ubongo` is **relay-only** (below), there
is no direct path to fall back to, so SSH dies until DNS recovers.
**2. Make the device resilient:**
- **Reliable resolvers** — set the device's DNS to public resolvers (`1.1.1.1`, `8.8.8.8`)
rather than a network-handed or homelab-internal resolver that's unreachable off-LAN.
Windows: inspect with `Get-DnsClientServerAddress`.
- **Pin the coordinator** so a DNS hiccup can't strand the client — add to the hosts file
(`C:\Windows\System32\drivers\etc\hosts` as admin, or `/etc/hosts`):
```
77.42.120.136 netbird.askari.wingu.me
```
`askari`'s stable WAN IP; TLS still validates on the hostname. Removes the multi-minute
reconnect deadlocks.
**3. Break-glass — reach `ubongo` without the mesh.** When the mesh is down you still need
a way in. On the home LAN, go straight to `ubongo`'s wired address (bypasses the mesh and
coordinator DNS entirely):
```sh
ssh sjat@10.20.10.151 # ubongo eno1 (LAN) — verify this works from your device NOW
```
> ⚠️ This works **today** only because `ubongo`'s host-firewall default-deny is not yet
> applied. When the deferred mesh-hardening lands (SSH only on `wt0`), this path closes
> unless a break-glass SSH rule is added to the firewall catalog. That hardening **must**
> keep a non-mesh break-glass (catalog SSH rule from a trusted LAN/admin source) — else a
> DNS/mesh outage = full lockout. (ADR-021 break-glass.)
**Why `ubongo` is relay-only (and P2P is not the fix).** Peers connect to `ubongo` as
`Relayed`, never `P2P`: its `nftables` default-deny drops the inbound UDP that ICE
hole-punching needs (egress is open, so STUN itself succeeds). This is the **intended
current posture** — P2P / NAT-traversal is the *deferred mesh-hardening* (ADR-016/020,
STATUS.md). Enabling it needs a firewall-catalog UDP entry **plus** an `accepted-risks.md`
deviation or ADR amendment, and OPNsense NAT work — and it would **not** have prevented a
DNS-driven outage (a re-handshake still needs signal, which needs DNS). Tracked as future
hardening, not a quick fix.
---
## Notes ## Notes
- **Split-tunnel:** NetBird routes only the `100.x` overlay by default — normal/work - **Split-tunnel:** NetBird routes only the `100.x` overlay by default — normal/work
@ -76,7 +138,7 @@ the SSH key provides auth.
- **Troubleshooting***"failed while getting Management Service public key"* / won't - **Troubleshooting***"failed while getting Management Service public key"* / won't
register: confirm `https://netbird.askari.wingu.me` loads in a browser from the device register: confirm `https://netbird.askari.wingu.me` loads in a browser from the device
(DNS + TLS + the gRPC routing through Caddy are reachable), the URL is exact, and the (DNS + TLS + the gRPC routing through Caddy are reachable), the URL is exact, and the
terminal is elevated. If a peer shows Disconnected, NAT traversal is falling back to the terminal is elevated. For peers stuck Disconnected/Connecting or SSH-to-`ubongo`
relay (over 443) — usually transient. timeouts that recover on their own, see **Troubleshooting — mesh drops** above.
- **Removing a device:** `netbird down` then uninstall; revoke its peer in the dashboard - **Removing a device:** `netbird down` then uninstall; revoke its peer in the dashboard
(and the setup key if one-off). (and the setup key if one-off).