From 77a20b8d40dc422e118378ed164db877e2b9ccf5 Mon Sep 17 00:00:00 2001 From: sjat Date: Thu, 18 Jun 2026 22:30:41 +0200 Subject: [PATCH] docs(runbook): netbird-client mesh-drop / DNS troubleshooting MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Document the 2026-06-18 incident class: a road-warrior laptop losing DNS on a network transition strands NetBird (can't resolve the coordinator FQDN), taking ubongo unreachable until DNS recovers. Adds triage (local DNS vs coordinator), device mitigations (reliable resolvers + hosts-file pin), the non-mesh LAN break-glass to ubongo, and why ubongo is relay-only (deferred mesh-hardening, not a bug) — including the break-glass rule that hardening must preserve. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/runbooks/netbird-client.md | 66 ++++++++++++++++++++++++++++++++- 1 file changed, 64 insertions(+), 2 deletions(-) diff --git a/docs/runbooks/netbird-client.md b/docs/runbooks/netbird-client.md index c7e2362..50fdba8 100644 --- a/docs/runbooks/netbird-client.md +++ b/docs/runbooks/netbird-client.md @@ -67,6 +67,68 @@ the SSH key provides auth. --- +## Troubleshooting — mesh drops / SSH to `ubongo` times out + +Symptom: SSH to `ubongo` (or any peer) times out for minutes and recovers on its own; +`netbird status` shows **Management/Signal: Disconnected** or peers stuck **Connecting**. + +verified: client DNS/relay behaviour + NRPT scope read from a 0.72.4 debug bundle; +mitigations per docs.netbird.io (`/manage/dns/troubleshooting`, +`/help/troubleshooting-client`) · 2026-06-18 + +**1. Triage — is it your device or the coordinator?** On the device: +```sh +netbird status -d # Management/Signal Connected? peers P2P/Relayed? +nslookup netbird.askari.wingu.me # coordinator FQDN +nslookup pkgs.netbird.io # a PUBLIC name — control test +``` +If the relay/handshake errors say `lookup netbird.askari.wingu.me: no such host` **and** +a *public* name (`pkgs.netbird.io`) also fails to resolve, your **local resolver is +dead** — the coordinator and `ubongo` are almost certainly fine. NetBird only manages +`*.netbird.selfhosted` resolution (a single NRPT rule), so it is **not** the cause. +Confirm from the other side if you can: the dashboard shows peer *last-seen*; `askari`/ +`ubongo` staying green ⇒ the fault is your device's network. + +**Why it cascades:** NetBird re-resolves the coordinator FQDN on every reconnect. A +network transition (Wi-Fi ↔ phone hotspot, sleep/wake) that briefly kills DNS means it +can't reach management/signal/relay — and since `ubongo` is **relay-only** (below), there +is no direct path to fall back to, so SSH dies until DNS recovers. + +**2. Make the device resilient:** +- **Reliable resolvers** — set the device's DNS to public resolvers (`1.1.1.1`, `8.8.8.8`) + rather than a network-handed or homelab-internal resolver that's unreachable off-LAN. + Windows: inspect with `Get-DnsClientServerAddress`. +- **Pin the coordinator** so a DNS hiccup can't strand the client — add to the hosts file + (`C:\Windows\System32\drivers\etc\hosts` as admin, or `/etc/hosts`): + ``` + 77.42.120.136 netbird.askari.wingu.me + ``` + `askari`'s stable WAN IP; TLS still validates on the hostname. Removes the multi-minute + reconnect deadlocks. + +**3. Break-glass — reach `ubongo` without the mesh.** When the mesh is down you still need +a way in. On the home LAN, go straight to `ubongo`'s wired address (bypasses the mesh and +coordinator DNS entirely): +```sh +ssh sjat@10.20.10.151 # ubongo eno1 (LAN) — verify this works from your device NOW +``` +> ⚠️ This works **today** only because `ubongo`'s host-firewall default-deny is not yet +> applied. When the deferred mesh-hardening lands (SSH only on `wt0`), this path closes +> unless a break-glass SSH rule is added to the firewall catalog. That hardening **must** +> keep a non-mesh break-glass (catalog SSH rule from a trusted LAN/admin source) — else a +> DNS/mesh outage = full lockout. (ADR-021 break-glass.) + +**Why `ubongo` is relay-only (and P2P is not the fix).** Peers connect to `ubongo` as +`Relayed`, never `P2P`: its `nftables` default-deny drops the inbound UDP that ICE +hole-punching needs (egress is open, so STUN itself succeeds). This is the **intended +current posture** — P2P / NAT-traversal is the *deferred mesh-hardening* (ADR-016/020, +STATUS.md). Enabling it needs a firewall-catalog UDP entry **plus** an `accepted-risks.md` +deviation or ADR amendment, and OPNsense NAT work — and it would **not** have prevented a +DNS-driven outage (a re-handshake still needs signal, which needs DNS). Tracked as future +hardening, not a quick fix. + +--- + ## Notes - **Split-tunnel:** NetBird routes only the `100.x` overlay by default — normal/work @@ -76,7 +138,7 @@ the SSH key provides auth. - **Troubleshooting** — *"failed while getting Management Service public key"* / won't register: confirm `https://netbird.askari.wingu.me` loads in a browser from the device (DNS + TLS + the gRPC routing through Caddy are reachable), the URL is exact, and the - terminal is elevated. If a peer shows Disconnected, NAT traversal is falling back to the - relay (over 443) — usually transient. + terminal is elevated. For peers stuck Disconnected/Connecting or SSH-to-`ubongo` + timeouts that recover on their own, see **Troubleshooting — mesh drops** above. - **Removing a device:** `netbird down` then uninstall; revoke its peer in the dashboard (and the setup key if one-off).