docs(runbook): netbird-client mesh-drop / DNS troubleshooting
Document the 2026-06-18 incident class: a road-warrior laptop losing DNS on a network transition strands NetBird (can't resolve the coordinator FQDN), taking ubongo unreachable until DNS recovers. Adds triage (local DNS vs coordinator), device mitigations (reliable resolvers + hosts-file pin), the non-mesh LAN break-glass to ubongo, and why ubongo is relay-only (deferred mesh-hardening, not a bug) — including the break-glass rule that hardening must preserve. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
a23ecd708d
commit
77a20b8d40
1 changed files with 64 additions and 2 deletions
|
|
@ -67,6 +67,68 @@ the SSH key provides auth.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Troubleshooting — mesh drops / SSH to `ubongo` times out
|
||||||
|
|
||||||
|
Symptom: SSH to `ubongo` (or any peer) times out for minutes and recovers on its own;
|
||||||
|
`netbird status` shows **Management/Signal: Disconnected** or peers stuck **Connecting**.
|
||||||
|
|
||||||
|
verified: client DNS/relay behaviour + NRPT scope read from a 0.72.4 debug bundle;
|
||||||
|
mitigations per docs.netbird.io (`/manage/dns/troubleshooting`,
|
||||||
|
`/help/troubleshooting-client`) · 2026-06-18
|
||||||
|
|
||||||
|
**1. Triage — is it your device or the coordinator?** On the device:
|
||||||
|
```sh
|
||||||
|
netbird status -d # Management/Signal Connected? peers P2P/Relayed?
|
||||||
|
nslookup netbird.askari.wingu.me # coordinator FQDN
|
||||||
|
nslookup pkgs.netbird.io # a PUBLIC name — control test
|
||||||
|
```
|
||||||
|
If the relay/handshake errors say `lookup netbird.askari.wingu.me: no such host` **and**
|
||||||
|
a *public* name (`pkgs.netbird.io`) also fails to resolve, your **local resolver is
|
||||||
|
dead** — the coordinator and `ubongo` are almost certainly fine. NetBird only manages
|
||||||
|
`*.netbird.selfhosted` resolution (a single NRPT rule), so it is **not** the cause.
|
||||||
|
Confirm from the other side if you can: the dashboard shows peer *last-seen*; `askari`/
|
||||||
|
`ubongo` staying green ⇒ the fault is your device's network.
|
||||||
|
|
||||||
|
**Why it cascades:** NetBird re-resolves the coordinator FQDN on every reconnect. A
|
||||||
|
network transition (Wi-Fi ↔ phone hotspot, sleep/wake) that briefly kills DNS means it
|
||||||
|
can't reach management/signal/relay — and since `ubongo` is **relay-only** (below), there
|
||||||
|
is no direct path to fall back to, so SSH dies until DNS recovers.
|
||||||
|
|
||||||
|
**2. Make the device resilient:**
|
||||||
|
- **Reliable resolvers** — set the device's DNS to public resolvers (`1.1.1.1`, `8.8.8.8`)
|
||||||
|
rather than a network-handed or homelab-internal resolver that's unreachable off-LAN.
|
||||||
|
Windows: inspect with `Get-DnsClientServerAddress`.
|
||||||
|
- **Pin the coordinator** so a DNS hiccup can't strand the client — add to the hosts file
|
||||||
|
(`C:\Windows\System32\drivers\etc\hosts` as admin, or `/etc/hosts`):
|
||||||
|
```
|
||||||
|
77.42.120.136 netbird.askari.wingu.me
|
||||||
|
```
|
||||||
|
`askari`'s stable WAN IP; TLS still validates on the hostname. Removes the multi-minute
|
||||||
|
reconnect deadlocks.
|
||||||
|
|
||||||
|
**3. Break-glass — reach `ubongo` without the mesh.** When the mesh is down you still need
|
||||||
|
a way in. On the home LAN, go straight to `ubongo`'s wired address (bypasses the mesh and
|
||||||
|
coordinator DNS entirely):
|
||||||
|
```sh
|
||||||
|
ssh sjat@10.20.10.151 # ubongo eno1 (LAN) — verify this works from your device NOW
|
||||||
|
```
|
||||||
|
> ⚠️ This works **today** only because `ubongo`'s host-firewall default-deny is not yet
|
||||||
|
> applied. When the deferred mesh-hardening lands (SSH only on `wt0`), this path closes
|
||||||
|
> unless a break-glass SSH rule is added to the firewall catalog. That hardening **must**
|
||||||
|
> keep a non-mesh break-glass (catalog SSH rule from a trusted LAN/admin source) — else a
|
||||||
|
> DNS/mesh outage = full lockout. (ADR-021 break-glass.)
|
||||||
|
|
||||||
|
**Why `ubongo` is relay-only (and P2P is not the fix).** Peers connect to `ubongo` as
|
||||||
|
`Relayed`, never `P2P`: its `nftables` default-deny drops the inbound UDP that ICE
|
||||||
|
hole-punching needs (egress is open, so STUN itself succeeds). This is the **intended
|
||||||
|
current posture** — P2P / NAT-traversal is the *deferred mesh-hardening* (ADR-016/020,
|
||||||
|
STATUS.md). Enabling it needs a firewall-catalog UDP entry **plus** an `accepted-risks.md`
|
||||||
|
deviation or ADR amendment, and OPNsense NAT work — and it would **not** have prevented a
|
||||||
|
DNS-driven outage (a re-handshake still needs signal, which needs DNS). Tracked as future
|
||||||
|
hardening, not a quick fix.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- **Split-tunnel:** NetBird routes only the `100.x` overlay by default — normal/work
|
- **Split-tunnel:** NetBird routes only the `100.x` overlay by default — normal/work
|
||||||
|
|
@ -76,7 +138,7 @@ the SSH key provides auth.
|
||||||
- **Troubleshooting** — *"failed while getting Management Service public key"* / won't
|
- **Troubleshooting** — *"failed while getting Management Service public key"* / won't
|
||||||
register: confirm `https://netbird.askari.wingu.me` loads in a browser from the device
|
register: confirm `https://netbird.askari.wingu.me` loads in a browser from the device
|
||||||
(DNS + TLS + the gRPC routing through Caddy are reachable), the URL is exact, and the
|
(DNS + TLS + the gRPC routing through Caddy are reachable), the URL is exact, and the
|
||||||
terminal is elevated. If a peer shows Disconnected, NAT traversal is falling back to the
|
terminal is elevated. For peers stuck Disconnected/Connecting or SSH-to-`ubongo`
|
||||||
relay (over 443) — usually transient.
|
timeouts that recover on their own, see **Troubleshooting — mesh drops** above.
|
||||||
- **Removing a device:** `netbird down` then uninstall; revoke its peer in the dashboard
|
- **Removing a device:** `netbird down` then uninstall; revoke its peer in the dashboard
|
||||||
(and the setup key if one-off).
|
(and the setup key if one-off).
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue