9.3 KiB
ADR-016 — Mesh VPN (NetBird, self-hosted on askari)
Status
Accepted (2026-06-05). Designed, not built — depends on the unbuilt base role and service-role machinery
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
base exists.
Context
ubongo (ADR-015) needs remote SSH access from anywhere without exposing anything to
the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to
WireGuard-via-OPNsense for the vpn VLAN (VLAN 99, 10.99.0.0/24: askari + road
warriors), and docs/CAPABILITIES.md flagged NetBird (mesh) as a real alternative to
weigh. This ADR settles it.
Decision
A single NetBird mesh is the sole remote-access overlay, self-hosted on askari,
replacing ADR-007's VLAN-99 OPNsense WireGuard.
The decision in four parts:
- Scope — mesh replaces WireGuard. One overlay for
ubongo,askari, and road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired. - Control plane — self-hosted on
askari. Sovereignty (boma self-hosts Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that survives a homelab outage and stays out of the cluster it administers. - Tool — NetBird. Self-hosting selects NetBird (first-class, fully open-source self-host). Tailscale would mean Headscale (third-party reimplementation, partial parity) — ruled out below.
- Routing — agent on every Linux host, not a subnet router. At boma's scale (2–5
hosts) the "agent everywhere" cost is trivial and the
baserole already runs everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception (mgmt/gateway reached by a single advertised route or LAN-side admin). - Identity — embedded local users (Dex in the management container); external SSO (Zitadel/Keycloak) stays an optional future.
Verified facts (ADR-014)
verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05 — components management+signal+dashboard+relay/TURN(Coturn), single container since v0.65; built-in local users / embedded IdP since v0.62 (external OIDC optional); ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host.
verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for
management//signal//relay/, BSD-3-Clause elsewhere; fully open source, no
open-core feature gating.
Architecture
Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on askari.
NetBird manages its own overlay addressing (default 100.64.0.0/10); no boma VLAN is
allocated for it.
askari(Hetzner, off-site, always-up) — runs the NetBird stack and is a peer.ubongo— agent.- All Linux managed hosts — agent via the
baserole. - Road-warrior clients (
mamba, phone, work PC) — agent/app. - OPNsense /
mgmt— single non-agent exception.
Security
- ACLs mirror ADR-007 intent (NetBird default-deny): mesh peers →
srvmetrics ports only; admin peers (ubongo,mamba) →srv+mgmt; clients → least privilege. - Enrollment via setup keys stored in
vault.yml(vault.netbird.setup_key), consumed bybase; prefer ephemeral/scoped keys. - Host firewall:
basenftables allows inbound SSH on NetBird'swt0interface (primary, WireGuard-authenticated) and fromubongo's LAN address (secondary, mesh-independent — required by the LAN-IP recovery path below, so a mesh/coordinator outage never blocks on-LAN SSH). All other LAN hosts remain default-denied. This makes explicit the control-node SSH allow that the recovery model already implied; the access doctrine and the three-tier access ladder live in ADR-021. - New public surface on
askari: management API + dashboard (80/443) + Coturn (3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,basehardening, and version-pinned NetBird (ADR-011) patched on boma's cadence. Recorded as accepted-risk R3.
Recovery & operations
- Ansible stays off the mesh:
ubongoreaches the fleet by LAN IP (ADR-009); a mesh/coordinator outage never blocks on-LAN runs. - Bootstrap order: stand up the coordinator on
askari→ enrollubongo→baseenrolls the fleet. - Coordinator survival: off-site on
askari⇒ mesh survives a homelab outage. NetBird's management datastore is backed up encrypted offaskari(synced toubongo/mamba); peers keep last-known config through a brief coordinator outage. askariis Ansible-managed: its own inventory groupoffsite_hosts— provisioned as Terraform IaC (hetznercloud/hcloud), managed independently of the Proxmox cluster (its own provider + local state). Ansible configuration:baserole, plus a dedicatednetbird_coordinatorservice role (one service = one role, ADR-004; withSECURITY.md). Agent install/enrollment lives inbase. NetBird server + agents are version-pinned (ADR-011). boma'sdnsrole stays authoritative forboma.baobab.band; NetBird built-in DNS scoped/off.
What was ruled out
| Option | Reason |
|---|---|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. |
| Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on askari. |
| Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. |
| Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. askari instead. |
Subnet router via ubongo |
Makes ubongo a routing SPOF; askari goes blind to srv when ubongo is down. Agent-per-host instead. |
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
Consequences
- A new public surface appears on
askari— management API + dashboard (80/443) + Coturn (3478) — mitigated by TLS, embedded-IdP login, source-IP limits where practical,basehardening and version-pinned NetBird, and recorded as accepted-risk R3 (Security). - On-LAN SSH never depends on the mesh:
baseallows inbound SSH fromubongo's LAN address as a mesh-independent secondary path, so a mesh/coordinator outage never blocks on-LAN SSH and Ansible stays off the mesh (Security; Recovery & operations). - The mesh survives a homelab outage because the coordinator is off-site on
askari, with its management datastore backed up encrypted offaskariand peers keeping last-known config through a brief coordinator outage (Recovery & operations). - Choosing NetBird over plain OPNsense WireGuard, Tailscale, Tailscale+Headscale, an
on-cluster coordinator, a
ubongosubnet router, and a standalone IdP gains identity/ACL policy, self-hosted sovereignty, no routing SPOF, and a light single operator footprint (What was ruled out). - Implementation is pending: the role tasks land only once the unbuilt
baserole and service-role machinery exist (Status).
Availability — an askari outage (amendment 2026-06-20)
The coordinator is deliberately single (one off-site host). Recorded here so its
availability envelope is explicit; accepted as R8 (docs/security/accepted-risks.md).
The mesh is not a default gateway — wt0 routes only the overlay CIDR (100.99.0.0/16);
normal traffic uses the host's default route. So an askari outage has a narrow blast
radius:
| Traffic | askari down |
|---|---|
| LAN device → LAN service (direct / via reverse proxy) | unaffected |
| node ↔ node over LAN IPs (cluster) | unaffected |
| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P) |
road-warrior → ubongo (remote, relayed) |
breaks |
| mesh control plane (new enrol / ACL change / re-handshake) | pauses |
Only remote (off-LAN) mesh access to peers is lost, and only when off-LAN and askari
is down simultaneously. On-LAN access to ubongo never depends on the mesh (Recovery &
operations, above).
Recovery: rebuild the coordinator (/setup + re-enrol peers, M5) or restore from backup
once ADR-022 lands; the netbird_coordinator store backup is the next sub-project (its
gap is named in R8 and BACKUP.md). Client/road-warrior break-glass (reliable resolvers +
the coordinator-FQDN /etc/hosts pin) is in docs/runbooks/netbird-client.md; managed mesh
hosts get the same pin via base__mesh_coordinator_pin.
Not pursued (deliberately, given the narrow blast radius): direct P2P (punctures the default-deny posture; only helps established sessions), a second relay (needs another public host / reintroduces the home public surface), a second coordinator (unsupported by self-hosted NetBird; against this ADR).
Related
ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted),
ADR-021 (operational access; SSH ladder reconciling wt0 + ubongo's LAN address).