# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`) ## Status Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` role and service-role machinery (STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when `base` exists. ## Context `ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to WireGuard-via-OPNsense for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`: `askari` + road warriors), and `docs/CAPABILITIES.md` flagged NetBird (mesh) as a real alternative to weigh. This ADR settles it. ## Decision A single **NetBird** mesh is the sole remote-access overlay, self-hosted on `askari`, **replacing** ADR-007's VLAN-99 OPNsense WireGuard. The decision in four parts: 1. **Scope — mesh replaces WireGuard.** One overlay for `ubongo`, `askari`, and road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired. 2. **Control plane — self-hosted on `askari`.** Sovereignty (boma self-hosts Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that survives a homelab outage and stays out of the cluster it administers. 3. **Tool — NetBird.** Self-hosting selects NetBird (first-class, fully open-source self-host). Tailscale would mean Headscale (third-party reimplementation, partial parity) — ruled out below. 4. **Routing — agent on every Linux host**, not a subnet router. At boma's scale (2–5 hosts) the "agent everywhere" cost is trivial and the `base` role already runs everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception (`mgmt`/gateway reached by a single advertised route or LAN-side admin). 5. **Identity — embedded local users** (Dex in the management container); external SSO (Zitadel/Keycloak) stays an optional future. ## Verified facts (ADR-014) verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05 — components management+signal+dashboard+relay/TURN(Coturn), **single container since v0.65**; **built-in local users / embedded IdP since v0.62** (external OIDC optional); ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host. verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for `management/`/`signal/`/`relay/`, BSD-3-Clause elsewhere; fully open source, no open-core feature gating. ## Architecture Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on `askari`. NetBird manages its own overlay addressing (default `100.64.0.0/10`); no boma VLAN is allocated for it. - `askari` (Hetzner, off-site, always-up) — runs the NetBird stack **and** is a peer. - `ubongo` — agent. - All Linux managed hosts — agent via the `base` role. - Road-warrior clients (`mamba`, phone, work PC) — agent/app. - OPNsense / `mgmt` — single non-agent exception. ## Security - **ACLs mirror ADR-007 intent** (NetBird default-deny): mesh peers → `srv` metrics ports only; admin peers (`ubongo`, `mamba`) → `srv` + `mgmt`; clients → least privilege. - **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`), consumed by `base`; prefer ephemeral/scoped keys. - **Host firewall:** `base` nftables allows inbound SSH on NetBird's `wt0` interface (primary, WireGuard-authenticated) **and** from `ubongo`'s LAN address (secondary, mesh-independent — required by the LAN-IP recovery path below, so a mesh/coordinator outage never blocks on-LAN SSH). All other LAN hosts remain default-denied. This makes explicit the control-node SSH allow that the recovery model already implied; the access doctrine and the three-tier access ladder live in **ADR-021**. - **New public surface on `askari`:** management API + dashboard (80/443) + Coturn (3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical, `base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence. Recorded as accepted-risk R3. ## Recovery & operations - **Ansible stays off the mesh:** `ubongo` reaches the fleet by LAN IP (ADR-009); a mesh/coordinator outage never blocks on-LAN runs. - **Bootstrap order:** stand up the coordinator on `askari` → enroll `ubongo` → `base` enrolls the fleet. - **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage. NetBird's management datastore is backed up encrypted off `askari` (synced to `ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage. - **`askari` is Ansible-managed:** its own inventory group `offsite_hosts` — provisioned as **Terraform IaC** (`hetznercloud/hcloud`), managed independently of the Proxmox cluster (its own provider + local state). Ansible configuration: `base` role, plus a dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with `SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are version-pinned (ADR-011). boma's `dns` role stays authoritative for `boma.baobab.band`; NetBird built-in DNS scoped/off. ## What was ruled out | Option | Reason | |---|---| | Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. | | Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on `askari`. | | Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. | | Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. `askari` instead. | | Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. | | Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. | ## Consequences - A new public surface appears on `askari` — management API + dashboard (80/443) + Coturn (3478) — mitigated by TLS, embedded-IdP login, source-IP limits where practical, `base` hardening and version-pinned NetBird, and recorded as accepted-risk R3 (Security). - On-LAN SSH never depends on the mesh: `base` allows inbound SSH from `ubongo`'s LAN address as a mesh-independent secondary path, so a mesh/coordinator outage never blocks on-LAN SSH and Ansible stays off the mesh (Security; Recovery & operations). - The mesh survives a homelab outage because the coordinator is off-site on `askari`, with its management datastore backed up encrypted off `askari` and peers keeping last-known config through a brief coordinator outage (Recovery & operations). - Choosing NetBird over plain OPNsense WireGuard, Tailscale, Tailscale+Headscale, an on-cluster coordinator, a `ubongo` subnet router, and a standalone IdP gains identity/ACL policy, self-hosted sovereignty, no routing SPOF, and a light single operator footprint (What was ruled out). - Implementation is pending: the role tasks land only once the unbuilt `base` role and service-role machinery exist (Status). ## Related ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security), ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted), ADR-021 (operational access; SSH ladder reconciling `wt0` + `ubongo`'s LAN address).