diff --git a/docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md b/docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md new file mode 100644 index 0000000..32471b2 --- /dev/null +++ b/docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md @@ -0,0 +1,206 @@ +# Design — Mesh VPN (NetBird, self-hosted on `askari`) + +- **Date:** 2026-06-05 +- **Status:** Approved design — pending implementation plan +- **Resolves:** ADR-015 deferred item #1 (mesh VPN choice) and the `accepted-risks.md` + R3 "pending VPN choice" placeholder +- **Amends:** ADR-007 (retires the VLAN-99 OPNsense WireGuard design) +- **Becomes:** ADR-016 (this design is the basis for that ADR) + +--- + +## Problem + +`ubongo` (ADR-015) needs remote SSH access from anywhere (work PC, laptop, phone) +without exposing anything to the public internet. ADR-015 left the access mechanism — +the "mesh VPN" — deferred to this discussion. + +Meanwhile ADR-007 already commits to **WireGuard-via-OPNsense** for the `vpn` VLAN +(VLAN 99, `10.99.0.0/24`): `askari` (the off-site Hetzner monitoring VPS) peers to +OPNsense, plus road-warrior clients. And `docs/CAPABILITIES.md` already flags the open +question: *"ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real +alternative to weigh."* + +So the real decision is three-cornered (plain OPNsense WireGuard vs NetBird vs +Tailscale), with an architectural sub-question of whether a mesh replaces or coexists +with the ADR-007 WireGuard. + +## Decisions (as settled) + +1. **Scope — the mesh *replaces* WireGuard.** A single overlay becomes the sole + remote-access path for `ubongo`, `askari`, and road-warrior clients. ADR-007's + VLAN-99 OPNsense WireGuard design is retired. +2. **Control plane — self-hosted, on `askari`.** Maximum sovereignty (boma already + self-hosts Vaultwarden, Forgejo, its own DNS), no third-party trust, and an off-site + coordinator that survives a homelab outage and stays out of the cluster it + administers. +3. **Tool — NetBird.** Self-hosting on `askari` selects NetBird: it is designed to be + self-hosted as a first-class, fully open-source stack. (Tailscale's self-host path + means Headscale, a separate third-party reimplementation with partial parity — ruled + out below.) +4. **Routing — NetBird agent on every (Linux) host**, not a subnet router. At boma's + scale (2–5 hosts, treated as individuals) the usual "agent everywhere" downside is + moot, and the `base` role already runs on every host, so enrollment is one uniform + role task. Avoids a routing single-point-of-failure and gives granular per-peer ACLs + that match ADR-007's firewall intent. **One exception:** OPNsense (FreeBSD) is not a + first-class NetBird agent target, so `mgmt`/gateway reachability is handled by a + single advertised route or by administering OPNsense from an on-LAN meshed peer. +5. **Identity — embedded local users** (Dex, built into the management container), not + a standalone Zitadel/Keycloak. YAGNI for a single operator; external SSO remains a + documented future option. + +## Verified facts (ADR-014) + +> verified: NetBird self-hosting architecture · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05 +> - Components: management + signal + dashboard + relay/TURN (Coturn). Since **v0.65** +> the core services are **merged into a single container**; deploy via Docker Compose. +> - Identity: since **v0.62**, built-in **local users** with an **embedded IdP (Dex)**; +> external OIDC IdPs (Zitadel, Keycloak, Authentik, Okta, …) are **optional**, not +> required. +> - Ports (behind reverse proxy): **TCP 80/443** + **UDP 3478** (STUN/TURN). +> - Host: a Linux VM + Docker Compose + a domain name; lightweight. +> +> verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 +> - Dual license: **AGPLv3** for `management/`, `signal/`, `relay/`; **BSD-3-Clause** +> elsewhere. Fully open source, self-hostable, no open-core feature gating. + +--- + +## Architecture & topology + +A single NetBird mesh is the sole remote-access overlay, replacing ADR-007's VLAN-99 +WireGuard. Data plane is peer-to-peer WireGuard; control plane is self-hosted NetBird +on `askari`. + +**`askari`'s dual role.** `askari` (Hetzner, off-site, always-up, independent of the +cluster per ADR-007) runs the **NetBird management stack** (single container: +management + signal + dashboard + Coturn, behind a reverse proxy on TCP 80/443 + UDP +3478) **and** is itself a mesh peer. Off-site hosting is what makes the mesh survive a +full homelab outage and keeps the coordinator out of the cluster it administers (no +chicken-and-egg). + +**Peers:** +- `askari` — coordinator + peer. +- `ubongo` (control/AI-worker host) — agent. +- All Linux managed hosts (`dns1/2`, `proxy`, …) — agent via the `base` role. +- Road-warrior clients — `mamba`, phone, work PC — agent/app. +- OPNsense / `mgmt` — the single non-agent exception (advertised route or LAN-side + admin from a meshed peer). + +**Retired:** ADR-007's VLAN-99 WireGuard endpoint on OPNsense and the +`10.99.0.0/24` peer scheme. `askari` reaches `srv`/`mgmt` over the mesh under NetBird +ACLs instead of OPNsense routing `10.99.0.0/24`. + +--- + +## Security model, ACLs, and attack surface + +**ACL policy mirrors ADR-007's firewall intent** (NetBird is default-deny): +- `vpn` peers → `srv` **metrics ports only** (askari's monitoring scope). +- admin peers (`ubongo`, `mamba`) → `srv` + `mgmt` for administration. +- road-warrior clients → only what each needs; nothing by default. + +**Enrollment via setup keys.** Hosts join non-interactively using NetBird **setup +keys**, stored in `vault.yml` as `vault.netbird.setup_key` and consumed by the `base` +role. Prefer ephemeral/scoped keys (ADR-002). + +**Host firewall interaction.** NetBird creates a `wt0` mesh interface. The `base` +role's nftables default-deny allows inbound admin (SSH) **only on `wt0`**, denied on +the physical NIC — the pattern ADR-015 set for `ubongo`, now applied fleet-wide. Mesh ++ nftables are defence-in-depth. + +**The new attack surface — a public control plane on `askari`.** Today `askari` +exposes a WireGuard UDP port; with NetBird self-hosted it exposes the **management API ++ dashboard (80/443)** and **Coturn (3478)** publicly, and the management API is +keys-to-the-kingdom for the whole mesh. Mitigations baked in: +- Dashboard/API behind TLS + the embedded IdP login; source-IP restrictions where + practical. +- `askari` runs `base` hardening (already a public managed host) and NetBird is + **version-pinned** (ADR-011) and patched on boma's cadence — self-hosting means + owning the CVE cadence (AGPLv3 server). + +Net vs ADR-002: nothing from the **cluster** is publicly exposed; the only public +surface is on `askari` (a public VPS by design), shifting from "WireGuard port" to +"NetBird control plane." + +--- + +## Recovery, bootstrap ordering, and operations + +**Ansible's control path stays off the mesh.** `ubongo` is on the LAN and reaches the +fleet by **LAN IP** (ADR-009). The mesh only provides *external* reach to +`ubongo`/the fleet, so a mesh/coordinator outage never blocks on-LAN Ansible runs and +there is no chicken-and-egg in the critical path. + +**Bootstrap order** (askari-first): +1. Stand up the NetBird coordinator on `askari`. +2. Enroll `ubongo`. +3. `base` role enrolls the rest of the fleet via setup keys from vault. + +**Recovery.** Coordinator off-site on `askari` ⇒ the mesh survives a full homelab +outage. Two must-haves: +- **Back up NetBird's management datastore** off `askari` — encrypted, synced to + `ubongo`/`mamba`. If `askari` dies, restore the coordinator; peers re-enroll. +- Existing peer tunnels keep running on last-known config through a brief coordinator + outage; only changes/new enrollments need it live — so `askari` is important but not + instantly fatal. + +**`askari` becomes Ansible-managed.** It joins the inventory under its own group and +gets the `base` role plus a dedicated **`netbird_coordinator` service role** (one +service = one role per ADR-004, with its own `SECURITY.md` per the service-role +standard). Agent install/enrollment lives in `base`. + +**DNS & versions.** boma's `dns` role stays authoritative for `boma.baobab.band`; +NetBird's built-in DNS is scoped/off to avoid overlap. NetBird server (on `askari`) +and agents (via `base`) are version-pinned (ADR-011). + +--- + +## Documentation & implementation changes + +This is a substantial decision → its own ADR, with amendments linking to it. + +| Doc | Change | +|---|---| +| ADR-016 (new) | Home of record for this design. | +| ADR-007 (network) | Replace the VLAN-99 WireGuard section + `10.99.0.0/24` scheme with the NetBird mesh; update the firewall-intent table and the `askari` external-monitoring section to ride the mesh. | +| ADR-015 (control host) | Resolve deferred item #1: mesh VPN = NetBird self-hosted on `askari`; update the access/recovery notes. | +| `docs/security/accepted-risks.md` | Replace R3 ("pending VPN choice") with the concrete residual risk: self-hosted coordinator = no third-party trust, but a public NetBird control plane on `askari` to harden + patch. | +| `docs/CAPABILITIES.md` | Resolve the VPN row (line ~29): decided — NetBird mesh, self-hosted on `askari`. | +| `STATUS.md` | Add rows (designed, not built): NetBird coordinator on `askari`; NetBird agent enrollment in `base`. | +| `base` role (when built) | Install + enroll the NetBird agent; nftables allows SSH only on `wt0`. | +| `netbird_coordinator` service role (new, when built) | Deploys the NetBird stack on `askari`; populated `SECURITY.md`; molecule scenario. | +| `requirements.yml` | Only if a task needs a new collection module (ADR dependencies policy). | + +**Scope note:** like the `ubongo` work, most *implementation* here waits on the `base` +and service-role machinery that STATUS.md lists as not-yet-built. This spec settles the +decision and the doc reconciliation; the role tasks land when `base` is built. + +--- + +## Deferred / out of scope + +1. **External SSO IdP** (Zitadel/Keycloak) — embedded local users now; SSO later if a + second operator or service-SSO need appears. +2. **OPNsense mesh integration specifics** — the exact `mgmt` reachability mechanism + (single advertised route vs LAN-side admin) is settled during implementation when + OPNsense automation is built. +3. **The `base` / `netbird_coordinator` role implementation** — depends on the + unbuilt `base` role and service-role standard. + +--- + +## What was ruled out + +| Option | Reason | +|---|---| +| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config, OPNsense-centric; the operator wants a mesh with policy-based access and easy multi-device enrollment. | +| Tailscale (hosted coordinator) | Adds a third-party trust dependency for the control plane; against boma's self-hosting ethos. (Hosted coordinator's recovery benefit is matched by putting a self-hosted coordinator off-site on `askari`.) | +| Tailscale + Headscale (self-hosted) | Headscale is a third-party reimplementation of Tailscale's control server with partial feature parity and no official vendor support — weaker than NetBird's first-class self-hosting. | +| Mesh coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes, and dies with the homelab. `askari` (off-site) instead. | +| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` would go blind to `srv` when `ubongo` is down even if services are healthy. Agent-per-host instead. | +| Standalone IdP (Zitadel/Keycloak) now | Heavy for a single operator; embedded local users (Dex) suffice. External SSO stays a future option. | + +See also: ADR-007 (network), ADR-015 (control host), ADR-002 (security), ADR-011 +(version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible handoff), +ADR-013 (heritage — V4 used WireGuard; NetBird is translated, not transplanted).