boma/docs/runbooks/netbird-client.md
sjat 77a20b8d40 docs(runbook): netbird-client mesh-drop / DNS troubleshooting
Document the 2026-06-18 incident class: a road-warrior laptop losing DNS on a network transition strands NetBird (can't resolve the coordinator FQDN), taking ubongo unreachable until DNS recovers. Adds triage (local DNS vs coordinator), device mitigations (reliable resolvers + hosts-file pin), the non-mesh LAN break-glass to ubongo, and why ubongo is relay-only (deferred mesh-hardening, not a bug) — including the break-glass rule that hardening must preserve.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 22:30:41 +02:00

7.4 KiB

Runbook — Enrolling a NetBird client (road-warrior device)

Joins a client/road-warrior device (laptop, desktop, phone) to the boma NetBird mesh so it can reach ubongo and other peers from anywhere. The self-hosted coordinator is on askari (ADR-016, M4b); enrollment lands a device on the 100.64.0.0/10 overlay.

Hosts vs clients. Managed Linux hosts join via the base role's mesh concern (base__mesh_enabled: true + the reusable key in vault.netbird.setup_key) — see ADR-016 / the base README, not this runbook. This runbook is for user devices NetBird doesn't manage with Ansible.

verified: NetBird client install + self-hosted --management-url flow · docs.netbird.io (/get-started/install/windows, /get-started/cli) · 2026-06-17

Prerequisites

  • The coordinator's first-boot /setup admin exists and you can log in at https://netbird.askari.wingu.me.
  • Auth, pick one:
    • SSO (recommended for a personal device) — your dashboard account; no secret to copy.
    • Setup key — dashboard → Settings → Setup Keys → a reusable key (mint a client-specific one for clean ACL grouping, or reuse the existing reusable key).
  • Local admin rights on the device (the client installs a service).
  • Coordinator facts: management URL https://netbird.askari.wingu.me; ubongo = 100.99.146.14 (ubongo.netbird.selfhosted); askari = 100.99.226.39.

Part A — Windows 11

  1. Install: download + run the MSI https://pkgs.netbird.io/windows/msi/x64 (official x64 client; installs the tray app + the netbird service).
  2. Connect from an elevated Windows Terminal / PowerShell ("Run as administrator"):
    netbird up --management-url https://netbird.askari.wingu.me
    
    A browser opens — sign in with your dashboard account. (SSO won't open a browser? use a key: netbird up --setup-key <KEY> --management-url https://netbird.askari.wingu.me.)
  3. Proceed to Part C (verify).

Part B — Other platforms (same management URL)

  • macOS / Linux desktop: install the client (macOS: NetBird app / Homebrew; Linux: pkgs.netbird.io per the distro — same apt/rpm flow as base's mesh concern), then netbird up --management-url https://netbird.askari.wingu.me (Linux: prefix sudo).
  • Android / iOS: install the NetBird app, then in Settings → Advanced / Server set the management server to https://netbird.askari.wingu.me before logging in; connect and complete the SSO login. (Setup keys are supported in-app too.)

Part C — Verify + use

netbird status        # expect: Management: Connected, Signal: Connected, a 100.x NetBird IP
netbird status -d     # peer detail — ubongo (100.99.146.14) + askari (100.99.226.39) listed

Reach ubongo over the mesh:

ssh sjat@100.99.146.14        # or: ssh sjat@ubongo.netbird.selfhosted

SSH auth is separate from the mesh: ubongo is key-only (passwords disabled), so the device needs an SSH key authorised for sjat@ubongo. The mesh provides the network path; the SSH key provides auth.


Troubleshooting — mesh drops / SSH to ubongo times out

Symptom: SSH to ubongo (or any peer) times out for minutes and recovers on its own; netbird status shows Management/Signal: Disconnected or peers stuck Connecting.

verified: client DNS/relay behaviour + NRPT scope read from a 0.72.4 debug bundle; mitigations per docs.netbird.io (/manage/dns/troubleshooting, /help/troubleshooting-client) · 2026-06-18

1. Triage — is it your device or the coordinator? On the device:

netbird status -d                     # Management/Signal Connected? peers P2P/Relayed?
nslookup netbird.askari.wingu.me      # coordinator FQDN
nslookup pkgs.netbird.io              # a PUBLIC name — control test

If the relay/handshake errors say lookup netbird.askari.wingu.me: no such host and a public name (pkgs.netbird.io) also fails to resolve, your local resolver is dead — the coordinator and ubongo are almost certainly fine. NetBird only manages *.netbird.selfhosted resolution (a single NRPT rule), so it is not the cause. Confirm from the other side if you can: the dashboard shows peer last-seen; askari/ ubongo staying green ⇒ the fault is your device's network.

Why it cascades: NetBird re-resolves the coordinator FQDN on every reconnect. A network transition (Wi-Fi ↔ phone hotspot, sleep/wake) that briefly kills DNS means it can't reach management/signal/relay — and since ubongo is relay-only (below), there is no direct path to fall back to, so SSH dies until DNS recovers.

2. Make the device resilient:

  • Reliable resolvers — set the device's DNS to public resolvers (1.1.1.1, 8.8.8.8) rather than a network-handed or homelab-internal resolver that's unreachable off-LAN. Windows: inspect with Get-DnsClientServerAddress.
  • Pin the coordinator so a DNS hiccup can't strand the client — add to the hosts file (C:\Windows\System32\drivers\etc\hosts as admin, or /etc/hosts):
    77.42.120.136  netbird.askari.wingu.me
    
    askari's stable WAN IP; TLS still validates on the hostname. Removes the multi-minute reconnect deadlocks.

3. Break-glass — reach ubongo without the mesh. When the mesh is down you still need a way in. On the home LAN, go straight to ubongo's wired address (bypasses the mesh and coordinator DNS entirely):

ssh sjat@10.20.10.151        # ubongo eno1 (LAN) — verify this works from your device NOW

⚠️ This works today only because ubongo's host-firewall default-deny is not yet applied. When the deferred mesh-hardening lands (SSH only on wt0), this path closes unless a break-glass SSH rule is added to the firewall catalog. That hardening must keep a non-mesh break-glass (catalog SSH rule from a trusted LAN/admin source) — else a DNS/mesh outage = full lockout. (ADR-021 break-glass.)

Why ubongo is relay-only (and P2P is not the fix). Peers connect to ubongo as Relayed, never P2P: its nftables default-deny drops the inbound UDP that ICE hole-punching needs (egress is open, so STUN itself succeeds). This is the intended current posture — P2P / NAT-traversal is the deferred mesh-hardening (ADR-016/020, STATUS.md). Enabling it needs a firewall-catalog UDP entry plus an accepted-risks.md deviation or ADR amendment, and OPNsense NAT work — and it would not have prevented a DNS-driven outage (a re-handshake still needs signal, which needs DNS). Tracked as future hardening, not a quick fix.


Notes

  • Split-tunnel: NetBird routes only the 100.x overlay by default — normal/work networking is unaffected.
  • Persistence: the service auto-starts on boot and reconnects; the tray app has Connect/Disconnect; CLI netbird down / netbird up (no flags after first setup).
  • Troubleshooting"failed while getting Management Service public key" / won't register: confirm https://netbird.askari.wingu.me loads in a browser from the device (DNS + TLS + the gRPC routing through Caddy are reachable), the URL is exact, and the terminal is elevated. For peers stuck Disconnected/Connecting or SSH-to-ubongo timeouts that recover on their own, see Troubleshooting — mesh drops above.
  • Removing a device: netbird down then uninstall; revoke its peer in the dashboard (and the setup key if one-off).