docs(runbook): netbird-client mesh-drop / DNS troubleshooting

Document the 2026-06-18 incident class: a road-warrior laptop losing DNS on a network transition strands NetBird (can't resolve the coordinator FQDN), taking ubongo unreachable until DNS recovers. Adds triage (local DNS vs coordinator), device mitigations (reliable resolvers + hosts-file pin), the non-mesh LAN break-glass to ubongo, and why ubongo is relay-only (deferred mesh-hardening, not a bug) — including the break-glass rule that hardening must preserve. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 22:30:41 +02:00 · 2026-06-18 22:30:41 +02:00 · 77a20b8d40
commit 77a20b8d40
parent a23ecd708d
1 changed files with 64 additions and 2 deletions
--- a/docs/runbooks/netbird-client.md
+++ b/docs/runbooks/netbird-client.md
@ -67,6 +67,68 @@ the SSH key provides auth.
 ---
 ## Troubleshooting — mesh drops / SSH to `ubongo` times out
 Symptom: SSH to `ubongo` (or any peer) times out for minutes and recovers on its own;
 `netbird status` shows **Management/Signal: Disconnected** or peers stuck **Connecting**.
 verified: client DNS/relay behaviour + NRPT scope read from a 0.72.4 debug bundle;
 mitigations per docs.netbird.io (`/manage/dns/troubleshooting`,
 `/help/troubleshooting-client`) · 2026-06-18
 **1. Triage — is it your device or the coordinator?** On the device:
 ```sh
 netbird status -d                     # Management/Signal Connected? peers P2P/Relayed?
 nslookup netbird.askari.wingu.me      # coordinator FQDN
 nslookup pkgs.netbird.io              # a PUBLIC name — control test
 ```
 If the relay/handshake errors say `lookup netbird.askari.wingu.me: no such host` **and**
 a *public* name (`pkgs.netbird.io`) also fails to resolve, your **local resolver is
 dead** — the coordinator and `ubongo` are almost certainly fine. NetBird only manages
 `*.netbird.selfhosted` resolution (a single NRPT rule), so it is **not** the cause.
 Confirm from the other side if you can: the dashboard shows peer *last-seen*; `askari`/
 `ubongo` staying green ⇒ the fault is your device's network.
 **Why it cascades:** NetBird re-resolves the coordinator FQDN on every reconnect. A
 network transition (Wi-Fi ↔ phone hotspot, sleep/wake) that briefly kills DNS means it
 can't reach management/signal/relay — and since `ubongo` is **relay-only** (below), there
 is no direct path to fall back to, so SSH dies until DNS recovers.
 **2. Make the device resilient:**
 - **Reliable resolvers** — set the device's DNS to public resolvers (`1.1.1.1`, `8.8.8.8`)
  rather than a network-handed or homelab-internal resolver that's unreachable off-LAN.
  Windows: inspect with `Get-DnsClientServerAddress`.
 - **Pin the coordinator** so a DNS hiccup can't strand the client — add to the hosts file
  (`C:\Windows\System32\drivers\etc\hosts` as admin, or `/etc/hosts`):
  ```
  77.42.120.136  netbird.askari.wingu.me
  ```
  `askari`'s stable WAN IP; TLS still validates on the hostname. Removes the multi-minute
  reconnect deadlocks.
 **3. Break-glass — reach `ubongo` without the mesh.** When the mesh is down you still need
 a way in. On the home LAN, go straight to `ubongo`'s wired address (bypasses the mesh and
 coordinator DNS entirely):
 ```sh
 ssh sjat@10.20.10.151        # ubongo eno1 (LAN) — verify this works from your device NOW
 ```
 > ⚠️ This works **today** only because `ubongo`'s host-firewall default-deny is not yet
 > applied. When the deferred mesh-hardening lands (SSH only on `wt0`), this path closes
 > unless a break-glass SSH rule is added to the firewall catalog. That hardening **must**
 > keep a non-mesh break-glass (catalog SSH rule from a trusted LAN/admin source) — else a
 > DNS/mesh outage = full lockout. (ADR-021 break-glass.)
 **Why `ubongo` is relay-only (and P2P is not the fix).** Peers connect to `ubongo` as
 `Relayed`, never `P2P`: its `nftables` default-deny drops the inbound UDP that ICE
 hole-punching needs (egress is open, so STUN itself succeeds). This is the **intended
 current posture** — P2P / NAT-traversal is the *deferred mesh-hardening* (ADR-016/020,
 STATUS.md). Enabling it needs a firewall-catalog UDP entry **plus** an `accepted-risks.md`
 deviation or ADR amendment, and OPNsense NAT work — and it would **not** have prevented a
 DNS-driven outage (a re-handshake still needs signal, which needs DNS). Tracked as future
 hardening, not a quick fix.
 ---
 ## Notes
 - **Split-tunnel:** NetBird routes only the `100.x` overlay by default — normal/work
@ -76,7 +138,7 @@ the SSH key provides auth.
 - **Troubleshooting** — *"failed while getting Management Service public key"* / won't
  register: confirm `https://netbird.askari.wingu.me` loads in a browser from the device
  (DNS + TLS + the gRPC routing through Caddy are reachable), the URL is exact, and the
-  terminal is elevated. If a peer shows Disconnected, NAT traversal is falling back to the
+  terminal is elevated. For peers stuck Disconnected/Connecting or SSH-to-`ubongo`
-  relay (over 443) — usually transient.
+  timeouts that recover on their own, see **Troubleshooting — mesh drops** above.
 - **Removing a device:** `netbird down` then uninstall; revoke its peer in the dashboard
  (and the setup key if one-off).