Document the 2026-06-18 incident class: a road-warrior laptop losing DNS on a network transition strands NetBird (can't resolve the coordinator FQDN), taking ubongo unreachable until DNS recovers. Adds triage (local DNS vs coordinator), device mitigations (reliable resolvers + hosts-file pin), the non-mesh LAN break-glass to ubongo, and why ubongo is relay-only (deferred mesh-hardening, not a bug) — including the break-glass rule that hardening must preserve. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
144 lines
7.4 KiB
Markdown
144 lines
7.4 KiB
Markdown
# Runbook — Enrolling a NetBird client (road-warrior device)
|
|
|
|
Joins a **client/road-warrior device** (laptop, desktop, phone) to the boma NetBird mesh
|
|
so it can reach `ubongo` and other peers from anywhere. The self-hosted coordinator is on
|
|
`askari` (ADR-016, M4b); enrollment lands a device on the `100.64.0.0/10` overlay.
|
|
|
|
> **Hosts vs clients.** Managed **Linux hosts** join via the `base` role's `mesh` concern
|
|
> (`base__mesh_enabled: true` + the reusable key in `vault.netbird.setup_key`) — see
|
|
> ADR-016 / the `base` README, *not* this runbook. This runbook is for **user devices**
|
|
> NetBird doesn't manage with Ansible.
|
|
|
|
verified: NetBird client install + self-hosted `--management-url` flow · docs.netbird.io
|
|
(`/get-started/install/windows`, `/get-started/cli`) · 2026-06-17
|
|
|
|
## Prerequisites
|
|
|
|
- The coordinator's first-boot `/setup` admin exists and you can log in at
|
|
`https://netbird.askari.wingu.me`.
|
|
- **Auth, pick one:**
|
|
- **SSO** (recommended for a personal device) — your dashboard account; no secret to copy.
|
|
- **Setup key** — dashboard → **Settings → Setup Keys** → a reusable key (mint a
|
|
client-specific one for clean ACL grouping, or reuse the existing reusable key).
|
|
- Local **admin rights** on the device (the client installs a service).
|
|
- **Coordinator facts:** management URL `https://netbird.askari.wingu.me`; `ubongo`
|
|
= `100.99.146.14` (`ubongo.netbird.selfhosted`); `askari` = `100.99.226.39`.
|
|
|
|
---
|
|
|
|
## Part A — Windows 11
|
|
|
|
1. **Install:** download + run the MSI **https://pkgs.netbird.io/windows/msi/x64**
|
|
(official x64 client; installs the tray app + the `netbird` service).
|
|
2. **Connect** from an **elevated** Windows Terminal / PowerShell ("Run as administrator"):
|
|
```powershell
|
|
netbird up --management-url https://netbird.askari.wingu.me
|
|
```
|
|
A browser opens — sign in with your dashboard account. (SSO won't open a browser?
|
|
use a key: `netbird up --setup-key <KEY> --management-url https://netbird.askari.wingu.me`.)
|
|
3. Proceed to **Part C** (verify).
|
|
|
|
---
|
|
|
|
## Part B — Other platforms (same management URL)
|
|
|
|
- **macOS / Linux desktop:** install the client (macOS: NetBird app / Homebrew; Linux:
|
|
`pkgs.netbird.io` per the distro — same apt/rpm flow as `base`'s `mesh` concern), then
|
|
`netbird up --management-url https://netbird.askari.wingu.me` (Linux: prefix `sudo`).
|
|
- **Android / iOS:** install the **NetBird** app, then in **Settings → Advanced /
|
|
Server** set the management server to `https://netbird.askari.wingu.me` **before**
|
|
logging in; connect and complete the SSO login. (Setup keys are supported in-app too.)
|
|
|
|
---
|
|
|
|
## Part C — Verify + use
|
|
|
|
```sh
|
|
netbird status # expect: Management: Connected, Signal: Connected, a 100.x NetBird IP
|
|
netbird status -d # peer detail — ubongo (100.99.146.14) + askari (100.99.226.39) listed
|
|
```
|
|
Reach `ubongo` over the mesh:
|
|
```sh
|
|
ssh sjat@100.99.146.14 # or: ssh sjat@ubongo.netbird.selfhosted
|
|
```
|
|
**SSH auth is separate from the mesh:** `ubongo` is key-only (passwords disabled), so the
|
|
device needs an SSH key authorised for `sjat@ubongo`. The mesh provides the network path;
|
|
the SSH key provides auth.
|
|
|
|
---
|
|
|
|
## Troubleshooting — mesh drops / SSH to `ubongo` times out
|
|
|
|
Symptom: SSH to `ubongo` (or any peer) times out for minutes and recovers on its own;
|
|
`netbird status` shows **Management/Signal: Disconnected** or peers stuck **Connecting**.
|
|
|
|
verified: client DNS/relay behaviour + NRPT scope read from a 0.72.4 debug bundle;
|
|
mitigations per docs.netbird.io (`/manage/dns/troubleshooting`,
|
|
`/help/troubleshooting-client`) · 2026-06-18
|
|
|
|
**1. Triage — is it your device or the coordinator?** On the device:
|
|
```sh
|
|
netbird status -d # Management/Signal Connected? peers P2P/Relayed?
|
|
nslookup netbird.askari.wingu.me # coordinator FQDN
|
|
nslookup pkgs.netbird.io # a PUBLIC name — control test
|
|
```
|
|
If the relay/handshake errors say `lookup netbird.askari.wingu.me: no such host` **and**
|
|
a *public* name (`pkgs.netbird.io`) also fails to resolve, your **local resolver is
|
|
dead** — the coordinator and `ubongo` are almost certainly fine. NetBird only manages
|
|
`*.netbird.selfhosted` resolution (a single NRPT rule), so it is **not** the cause.
|
|
Confirm from the other side if you can: the dashboard shows peer *last-seen*; `askari`/
|
|
`ubongo` staying green ⇒ the fault is your device's network.
|
|
|
|
**Why it cascades:** NetBird re-resolves the coordinator FQDN on every reconnect. A
|
|
network transition (Wi-Fi ↔ phone hotspot, sleep/wake) that briefly kills DNS means it
|
|
can't reach management/signal/relay — and since `ubongo` is **relay-only** (below), there
|
|
is no direct path to fall back to, so SSH dies until DNS recovers.
|
|
|
|
**2. Make the device resilient:**
|
|
- **Reliable resolvers** — set the device's DNS to public resolvers (`1.1.1.1`, `8.8.8.8`)
|
|
rather than a network-handed or homelab-internal resolver that's unreachable off-LAN.
|
|
Windows: inspect with `Get-DnsClientServerAddress`.
|
|
- **Pin the coordinator** so a DNS hiccup can't strand the client — add to the hosts file
|
|
(`C:\Windows\System32\drivers\etc\hosts` as admin, or `/etc/hosts`):
|
|
```
|
|
77.42.120.136 netbird.askari.wingu.me
|
|
```
|
|
`askari`'s stable WAN IP; TLS still validates on the hostname. Removes the multi-minute
|
|
reconnect deadlocks.
|
|
|
|
**3. Break-glass — reach `ubongo` without the mesh.** When the mesh is down you still need
|
|
a way in. On the home LAN, go straight to `ubongo`'s wired address (bypasses the mesh and
|
|
coordinator DNS entirely):
|
|
```sh
|
|
ssh sjat@10.20.10.151 # ubongo eno1 (LAN) — verify this works from your device NOW
|
|
```
|
|
> ⚠️ This works **today** only because `ubongo`'s host-firewall default-deny is not yet
|
|
> applied. When the deferred mesh-hardening lands (SSH only on `wt0`), this path closes
|
|
> unless a break-glass SSH rule is added to the firewall catalog. That hardening **must**
|
|
> keep a non-mesh break-glass (catalog SSH rule from a trusted LAN/admin source) — else a
|
|
> DNS/mesh outage = full lockout. (ADR-021 break-glass.)
|
|
|
|
**Why `ubongo` is relay-only (and P2P is not the fix).** Peers connect to `ubongo` as
|
|
`Relayed`, never `P2P`: its `nftables` default-deny drops the inbound UDP that ICE
|
|
hole-punching needs (egress is open, so STUN itself succeeds). This is the **intended
|
|
current posture** — P2P / NAT-traversal is the *deferred mesh-hardening* (ADR-016/020,
|
|
STATUS.md). Enabling it needs a firewall-catalog UDP entry **plus** an `accepted-risks.md`
|
|
deviation or ADR amendment, and OPNsense NAT work — and it would **not** have prevented a
|
|
DNS-driven outage (a re-handshake still needs signal, which needs DNS). Tracked as future
|
|
hardening, not a quick fix.
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
- **Split-tunnel:** NetBird routes only the `100.x` overlay by default — normal/work
|
|
networking is unaffected.
|
|
- **Persistence:** the service auto-starts on boot and reconnects; the tray app has
|
|
Connect/Disconnect; CLI `netbird down` / `netbird up` (no flags after first setup).
|
|
- **Troubleshooting** — *"failed while getting Management Service public key"* / won't
|
|
register: confirm `https://netbird.askari.wingu.me` loads in a browser from the device
|
|
(DNS + TLS + the gRPC routing through Caddy are reachable), the URL is exact, and the
|
|
terminal is elevated. For peers stuck Disconnected/Connecting or SSH-to-`ubongo`
|
|
timeouts that recover on their own, see **Troubleshooting — mesh drops** above.
|
|
- **Removing a device:** `netbird down` then uninstall; revoke its peer in the dashboard
|
|
(and the setup key if one-off).
|