boma/docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md
sjat 99ace3eb48 Add design spec for mesh VPN (NetBird self-hosted on askari)
Resolves ADR-015 deferred item #1: the mesh VPN is NetBird, self-hosted on
askari, replacing ADR-007's VLAN-99 OPNsense WireGuard. Agent-per-host
enrollment via base, embedded local-user IdP, coordinator off-site for
outage survival. Basis for ADR-016.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 10:58:35 +02:00

11 KiB
Raw Permalink Blame History

Design — Mesh VPN (NetBird, self-hosted on askari)

  • Date: 2026-06-05
  • Status: Approved design — pending implementation plan
  • Resolves: ADR-015 deferred item #1 (mesh VPN choice) and the accepted-risks.md R3 "pending VPN choice" placeholder
  • Amends: ADR-007 (retires the VLAN-99 OPNsense WireGuard design)
  • Becomes: ADR-016 (this design is the basis for that ADR)

Problem

ubongo (ADR-015) needs remote SSH access from anywhere (work PC, laptop, phone) without exposing anything to the public internet. ADR-015 left the access mechanism — the "mesh VPN" — deferred to this discussion.

Meanwhile ADR-007 already commits to WireGuard-via-OPNsense for the vpn VLAN (VLAN 99, 10.99.0.0/24): askari (the off-site Hetzner monitoring VPS) peers to OPNsense, plus road-warrior clients. And docs/CAPABILITIES.md already flags the open question: "ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real alternative to weigh."

So the real decision is three-cornered (plain OPNsense WireGuard vs NetBird vs Tailscale), with an architectural sub-question of whether a mesh replaces or coexists with the ADR-007 WireGuard.

Decisions (as settled)

  1. Scope — the mesh replaces WireGuard. A single overlay becomes the sole remote-access path for ubongo, askari, and road-warrior clients. ADR-007's VLAN-99 OPNsense WireGuard design is retired.
  2. Control plane — self-hosted, on askari. Maximum sovereignty (boma already self-hosts Vaultwarden, Forgejo, its own DNS), no third-party trust, and an off-site coordinator that survives a homelab outage and stays out of the cluster it administers.
  3. Tool — NetBird. Self-hosting on askari selects NetBird: it is designed to be self-hosted as a first-class, fully open-source stack. (Tailscale's self-host path means Headscale, a separate third-party reimplementation with partial parity — ruled out below.)
  4. Routing — NetBird agent on every (Linux) host, not a subnet router. At boma's scale (25 hosts, treated as individuals) the usual "agent everywhere" downside is moot, and the base role already runs on every host, so enrollment is one uniform role task. Avoids a routing single-point-of-failure and gives granular per-peer ACLs that match ADR-007's firewall intent. One exception: OPNsense (FreeBSD) is not a first-class NetBird agent target, so mgmt/gateway reachability is handled by a single advertised route or by administering OPNsense from an on-LAN meshed peer.
  5. Identity — embedded local users (Dex, built into the management container), not a standalone Zitadel/Keycloak. YAGNI for a single operator; external SSO remains a documented future option.

Verified facts (ADR-014)

verified: NetBird self-hosting architecture · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05

  • Components: management + signal + dashboard + relay/TURN (Coturn). Since v0.65 the core services are merged into a single container; deploy via Docker Compose.
  • Identity: since v0.62, built-in local users with an embedded IdP (Dex); external OIDC IdPs (Zitadel, Keycloak, Authentik, Okta, …) are optional, not required.
  • Ports (behind reverse proxy): TCP 80/443 + UDP 3478 (STUN/TURN).
  • Host: a Linux VM + Docker Compose + a domain name; lightweight.

verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05

  • Dual license: AGPLv3 for management/, signal/, relay/; BSD-3-Clause elsewhere. Fully open source, self-hostable, no open-core feature gating.

Architecture & topology

A single NetBird mesh is the sole remote-access overlay, replacing ADR-007's VLAN-99 WireGuard. Data plane is peer-to-peer WireGuard; control plane is self-hosted NetBird on askari.

askari's dual role. askari (Hetzner, off-site, always-up, independent of the cluster per ADR-007) runs the NetBird management stack (single container: management + signal + dashboard + Coturn, behind a reverse proxy on TCP 80/443 + UDP 3478) and is itself a mesh peer. Off-site hosting is what makes the mesh survive a full homelab outage and keeps the coordinator out of the cluster it administers (no chicken-and-egg).

Peers:

  • askari — coordinator + peer.
  • ubongo (control/AI-worker host) — agent.
  • All Linux managed hosts (dns1/2, proxy, …) — agent via the base role.
  • Road-warrior clients — mamba, phone, work PC — agent/app.
  • OPNsense / mgmt — the single non-agent exception (advertised route or LAN-side admin from a meshed peer).

Retired: ADR-007's VLAN-99 WireGuard endpoint on OPNsense and the 10.99.0.0/24 peer scheme. askari reaches srv/mgmt over the mesh under NetBird ACLs instead of OPNsense routing 10.99.0.0/24.


Security model, ACLs, and attack surface

ACL policy mirrors ADR-007's firewall intent (NetBird is default-deny):

  • vpn peers → srv metrics ports only (askari's monitoring scope).
  • admin peers (ubongo, mamba) → srv + mgmt for administration.
  • road-warrior clients → only what each needs; nothing by default.

Enrollment via setup keys. Hosts join non-interactively using NetBird setup keys, stored in vault.yml as vault.netbird.setup_key and consumed by the base role. Prefer ephemeral/scoped keys (ADR-002).

Host firewall interaction. NetBird creates a wt0 mesh interface. The base role's nftables default-deny allows inbound admin (SSH) only on wt0, denied on the physical NIC — the pattern ADR-015 set for ubongo, now applied fleet-wide. Mesh

  • nftables are defence-in-depth.

The new attack surface — a public control plane on askari. Today askari exposes a WireGuard UDP port; with NetBird self-hosted it exposes the **management API

  • dashboard (80/443)** and Coturn (3478) publicly, and the management API is keys-to-the-kingdom for the whole mesh. Mitigations baked in:
  • Dashboard/API behind TLS + the embedded IdP login; source-IP restrictions where practical.
  • askari runs base hardening (already a public managed host) and NetBird is version-pinned (ADR-011) and patched on boma's cadence — self-hosting means owning the CVE cadence (AGPLv3 server).

Net vs ADR-002: nothing from the cluster is publicly exposed; the only public surface is on askari (a public VPS by design), shifting from "WireGuard port" to "NetBird control plane."


Recovery, bootstrap ordering, and operations

Ansible's control path stays off the mesh. ubongo is on the LAN and reaches the fleet by LAN IP (ADR-009). The mesh only provides external reach to ubongo/the fleet, so a mesh/coordinator outage never blocks on-LAN Ansible runs and there is no chicken-and-egg in the critical path.

Bootstrap order (askari-first):

  1. Stand up the NetBird coordinator on askari.
  2. Enroll ubongo.
  3. base role enrolls the rest of the fleet via setup keys from vault.

Recovery. Coordinator off-site on askari ⇒ the mesh survives a full homelab outage. Two must-haves:

  • Back up NetBird's management datastore off askari — encrypted, synced to ubongo/mamba. If askari dies, restore the coordinator; peers re-enroll.
  • Existing peer tunnels keep running on last-known config through a brief coordinator outage; only changes/new enrollments need it live — so askari is important but not instantly fatal.

askari becomes Ansible-managed. It joins the inventory under its own group and gets the base role plus a dedicated netbird_coordinator service role (one service = one role per ADR-004, with its own SECURITY.md per the service-role standard). Agent install/enrollment lives in base.

DNS & versions. boma's dns role stays authoritative for boma.baobab.band; NetBird's built-in DNS is scoped/off to avoid overlap. NetBird server (on askari) and agents (via base) are version-pinned (ADR-011).


Documentation & implementation changes

This is a substantial decision → its own ADR, with amendments linking to it.

Doc Change
ADR-016 (new) Home of record for this design.
ADR-007 (network) Replace the VLAN-99 WireGuard section + 10.99.0.0/24 scheme with the NetBird mesh; update the firewall-intent table and the askari external-monitoring section to ride the mesh.
ADR-015 (control host) Resolve deferred item #1: mesh VPN = NetBird self-hosted on askari; update the access/recovery notes.
docs/security/accepted-risks.md Replace R3 ("pending VPN choice") with the concrete residual risk: self-hosted coordinator = no third-party trust, but a public NetBird control plane on askari to harden + patch.
docs/CAPABILITIES.md Resolve the VPN row (line ~29): decided — NetBird mesh, self-hosted on askari.
STATUS.md Add rows (designed, not built): NetBird coordinator on askari; NetBird agent enrollment in base.
base role (when built) Install + enroll the NetBird agent; nftables allows SSH only on wt0.
netbird_coordinator service role (new, when built) Deploys the NetBird stack on askari; populated SECURITY.md; molecule scenario.
requirements.yml Only if a task needs a new collection module (ADR dependencies policy).

Scope note: like the ubongo work, most implementation here waits on the base and service-role machinery that STATUS.md lists as not-yet-built. This spec settles the decision and the doc reconciliation; the role tasks land when base is built.


Deferred / out of scope

  1. External SSO IdP (Zitadel/Keycloak) — embedded local users now; SSO later if a second operator or service-SSO need appears.
  2. OPNsense mesh integration specifics — the exact mgmt reachability mechanism (single advertised route vs LAN-side admin) is settled during implementation when OPNsense automation is built.
  3. The base / netbird_coordinator role implementation — depends on the unbuilt base role and service-role standard.

What was ruled out

Option Reason
Plain OPNsense WireGuard (ADR-007 as-is) No identity/ACL layer, manual peer config, OPNsense-centric; the operator wants a mesh with policy-based access and easy multi-device enrollment.
Tailscale (hosted coordinator) Adds a third-party trust dependency for the control plane; against boma's self-hosting ethos. (Hosted coordinator's recovery benefit is matched by putting a self-hosted coordinator off-site on askari.)
Tailscale + Headscale (self-hosted) Headscale is a third-party reimplementation of Tailscale's control server with partial feature parity and no official vendor support — weaker than NetBird's first-class self-hosting.
Mesh coordinator on the cluster Recreates the chicken-and-egg ADR-015 escapes, and dies with the homelab. askari (off-site) instead.
Subnet router via ubongo Makes ubongo a routing SPOF; askari would go blind to srv when ubongo is down even if services are healthy. Agent-per-host instead.
Standalone IdP (Zitadel/Keycloak) now Heavy for a single operator; embedded local users (Dex) suffice. External SSO stays a future option.

See also: ADR-007 (network), ADR-015 (control host), ADR-002 (security), ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible handoff), ADR-013 (heritage — V4 used WireGuard; NetBird is translated, not transplanted).