boma/docs/decisions/016-mesh-vpn.md
sjat 9e0c264658 docs: reconcile lower-severity review findings (O9-O24)
- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional,
  outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative
  boma.baobab.band -> boma.wingu.me transition note already added earlier
- terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and
  <host>.boma.baobab.band per ADR-007 naming (O11)
- ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections
  placed after Consequences, matching ADR-014/019-023 (O13)
- docs/README + inventories/README: list the missing subdirs / offsite_hosts +
  offsite.yml merge behaviour (O14, O29 note)
- ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19)
- ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20)
- ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21)
- netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23)
- ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24)
- capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28)
- tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9)
- tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep)

O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected);
the fix lives in the generator for the next regeneration. make lint + pytest (57) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00

7.6 KiB
Raw Blame History

ADR-016 — Mesh VPN (NetBird, self-hosted on askari)

Status

Accepted (2026-06-05). Designed, not built — depends on the unbuilt base role and service-role machinery (STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when base exists.

Context

ubongo (ADR-015) needs remote SSH access from anywhere without exposing anything to the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to WireGuard-via-OPNsense for the vpn VLAN (VLAN 99, 10.99.0.0/24: askari + road warriors), and docs/CAPABILITIES.md flagged NetBird (mesh) as a real alternative to weigh. This ADR settles it.

Decision

A single NetBird mesh is the sole remote-access overlay, self-hosted on askari, replacing ADR-007's VLAN-99 OPNsense WireGuard.

The decision in four parts:

  1. Scope — mesh replaces WireGuard. One overlay for ubongo, askari, and road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired.
  2. Control plane — self-hosted on askari. Sovereignty (boma self-hosts Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that survives a homelab outage and stays out of the cluster it administers.
  3. Tool — NetBird. Self-hosting selects NetBird (first-class, fully open-source self-host). Tailscale would mean Headscale (third-party reimplementation, partial parity) — ruled out below.
  4. Routing — agent on every Linux host, not a subnet router. At boma's scale (25 hosts) the "agent everywhere" cost is trivial and the base role already runs everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception (mgmt/gateway reached by a single advertised route or LAN-side admin).
  5. Identity — embedded local users (Dex in the management container); external SSO (Zitadel/Keycloak) stays an optional future.

Verified facts (ADR-014)

verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05 — components management+signal+dashboard+relay/TURN(Coturn), single container since v0.65; built-in local users / embedded IdP since v0.62 (external OIDC optional); ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host.

verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for management//signal//relay/, BSD-3-Clause elsewhere; fully open source, no open-core feature gating.

Architecture

Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on askari. NetBird manages its own overlay addressing (default 100.64.0.0/10); no boma VLAN is allocated for it.

  • askari (Hetzner, off-site, always-up) — runs the NetBird stack and is a peer.
  • ubongo — agent.
  • All Linux managed hosts — agent via the base role.
  • Road-warrior clients (mamba, phone, work PC) — agent/app.
  • OPNsense / mgmt — single non-agent exception.

Security

  • ACLs mirror ADR-007 intent (NetBird default-deny): mesh peers → srv metrics ports only; admin peers (ubongo, mamba) → srv + mgmt; clients → least privilege.
  • Enrollment via setup keys stored in vault.yml (vault.netbird.setup_key), consumed by base; prefer ephemeral/scoped keys.
  • Host firewall: base nftables allows inbound SSH on NetBird's wt0 interface (primary, WireGuard-authenticated) and from ubongo's LAN address (secondary, mesh-independent — required by the LAN-IP recovery path below, so a mesh/coordinator outage never blocks on-LAN SSH). All other LAN hosts remain default-denied. This makes explicit the control-node SSH allow that the recovery model already implied; the access doctrine and the three-tier access ladder live in ADR-021.
  • New public surface on askari: management API + dashboard (80/443) + Coturn (3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical, base hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence. Recorded as accepted-risk R3.

Recovery & operations

  • Ansible stays off the mesh: ubongo reaches the fleet by LAN IP (ADR-009); a mesh/coordinator outage never blocks on-LAN runs.
  • Bootstrap order: stand up the coordinator on askari → enroll ubongobase enrolls the fleet.
  • Coordinator survival: off-site on askari ⇒ mesh survives a homelab outage. NetBird's management datastore is backed up encrypted off askari (synced to ubongo/mamba); peers keep last-known config through a brief coordinator outage.
  • askari is Ansible-managed: its own inventory group offsite_hosts — provisioned as Terraform IaC (hetznercloud/hcloud), managed independently of the Proxmox cluster (its own provider + local state). Ansible configuration: base role, plus a dedicated netbird_coordinator service role (one service = one role, ADR-004; with SECURITY.md). Agent install/enrollment lives in base. NetBird server + agents are version-pinned (ADR-011). boma's dns role stays authoritative for boma.baobab.band; NetBird built-in DNS scoped/off.

What was ruled out

Option Reason
Plain OPNsense WireGuard (ADR-007 as-is) No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment.
Tailscale (hosted coordinator) Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on askari.
Tailscale + Headscale Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting.
Coordinator on the cluster Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. askari instead.
Subnet router via ubongo Makes ubongo a routing SPOF; askari goes blind to srv when ubongo is down. Agent-per-host instead.
Standalone IdP (Zitadel/Keycloak) now Heavy for one operator; embedded local users suffice.

Consequences

  • A new public surface appears on askari — management API + dashboard (80/443) + Coturn (3478) — mitigated by TLS, embedded-IdP login, source-IP limits where practical, base hardening and version-pinned NetBird, and recorded as accepted-risk R3 (Security).
  • On-LAN SSH never depends on the mesh: base allows inbound SSH from ubongo's LAN address as a mesh-independent secondary path, so a mesh/coordinator outage never blocks on-LAN SSH and Ansible stays off the mesh (Security; Recovery & operations).
  • The mesh survives a homelab outage because the coordinator is off-site on askari, with its management datastore backed up encrypted off askari and peers keeping last-known config through a brief coordinator outage (Recovery & operations).
  • Choosing NetBird over plain OPNsense WireGuard, Tailscale, Tailscale+Headscale, an on-cluster coordinator, a ubongo subnet router, and a standalone IdP gains identity/ACL policy, self-hosted sovereignty, no routing SPOF, and a light single operator footprint (What was ruled out).
  • Implementation is pending: the role tasks land only once the unbuilt base role and service-role machinery exist (Status).

ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security), ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted), ADR-021 (operational access; SSH ladder reconciling wt0 + ubongo's LAN address).