boma/docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md
sjat 99ace3eb48 Add design spec for mesh VPN (NetBird self-hosted on askari)
Resolves ADR-015 deferred item #1: the mesh VPN is NetBird, self-hosted on
askari, replacing ADR-007's VLAN-99 OPNsense WireGuard. Agent-per-host
enrollment via base, embedded local-user IdP, coordinator off-site for
outage survival. Basis for ADR-016.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 10:58:35 +02:00

206 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Design — Mesh VPN (NetBird, self-hosted on `askari`)
- **Date:** 2026-06-05
- **Status:** Approved design — pending implementation plan
- **Resolves:** ADR-015 deferred item #1 (mesh VPN choice) and the `accepted-risks.md`
R3 "pending VPN choice" placeholder
- **Amends:** ADR-007 (retires the VLAN-99 OPNsense WireGuard design)
- **Becomes:** ADR-016 (this design is the basis for that ADR)
---
## Problem
`ubongo` (ADR-015) needs remote SSH access from anywhere (work PC, laptop, phone)
without exposing anything to the public internet. ADR-015 left the access mechanism —
the "mesh VPN" — deferred to this discussion.
Meanwhile ADR-007 already commits to **WireGuard-via-OPNsense** for the `vpn` VLAN
(VLAN 99, `10.99.0.0/24`): `askari` (the off-site Hetzner monitoring VPS) peers to
OPNsense, plus road-warrior clients. And `docs/CAPABILITIES.md` already flags the open
question: *"ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real
alternative to weigh."*
So the real decision is three-cornered (plain OPNsense WireGuard vs NetBird vs
Tailscale), with an architectural sub-question of whether a mesh replaces or coexists
with the ADR-007 WireGuard.
## Decisions (as settled)
1. **Scope — the mesh *replaces* WireGuard.** A single overlay becomes the sole
remote-access path for `ubongo`, `askari`, and road-warrior clients. ADR-007's
VLAN-99 OPNsense WireGuard design is retired.
2. **Control plane — self-hosted, on `askari`.** Maximum sovereignty (boma already
self-hosts Vaultwarden, Forgejo, its own DNS), no third-party trust, and an off-site
coordinator that survives a homelab outage and stays out of the cluster it
administers.
3. **Tool — NetBird.** Self-hosting on `askari` selects NetBird: it is designed to be
self-hosted as a first-class, fully open-source stack. (Tailscale's self-host path
means Headscale, a separate third-party reimplementation with partial parity — ruled
out below.)
4. **Routing — NetBird agent on every (Linux) host**, not a subnet router. At boma's
scale (25 hosts, treated as individuals) the usual "agent everywhere" downside is
moot, and the `base` role already runs on every host, so enrollment is one uniform
role task. Avoids a routing single-point-of-failure and gives granular per-peer ACLs
that match ADR-007's firewall intent. **One exception:** OPNsense (FreeBSD) is not a
first-class NetBird agent target, so `mgmt`/gateway reachability is handled by a
single advertised route or by administering OPNsense from an on-LAN meshed peer.
5. **Identity — embedded local users** (Dex, built into the management container), not
a standalone Zitadel/Keycloak. YAGNI for a single operator; external SSO remains a
documented future option.
## Verified facts (ADR-014)
> verified: NetBird self-hosting architecture · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
> - Components: management + signal + dashboard + relay/TURN (Coturn). Since **v0.65**
> the core services are **merged into a single container**; deploy via Docker Compose.
> - Identity: since **v0.62**, built-in **local users** with an **embedded IdP (Dex)**;
> external OIDC IdPs (Zitadel, Keycloak, Authentik, Okta, …) are **optional**, not
> required.
> - Ports (behind reverse proxy): **TCP 80/443** + **UDP 3478** (STUN/TURN).
> - Host: a Linux VM + Docker Compose + a domain name; lightweight.
>
> verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05
> - Dual license: **AGPLv3** for `management/`, `signal/`, `relay/`; **BSD-3-Clause**
> elsewhere. Fully open source, self-hostable, no open-core feature gating.
---
## Architecture & topology
A single NetBird mesh is the sole remote-access overlay, replacing ADR-007's VLAN-99
WireGuard. Data plane is peer-to-peer WireGuard; control plane is self-hosted NetBird
on `askari`.
**`askari`'s dual role.** `askari` (Hetzner, off-site, always-up, independent of the
cluster per ADR-007) runs the **NetBird management stack** (single container:
management + signal + dashboard + Coturn, behind a reverse proxy on TCP 80/443 + UDP
3478) **and** is itself a mesh peer. Off-site hosting is what makes the mesh survive a
full homelab outage and keeps the coordinator out of the cluster it administers (no
chicken-and-egg).
**Peers:**
- `askari` — coordinator + peer.
- `ubongo` (control/AI-worker host) — agent.
- All Linux managed hosts (`dns1/2`, `proxy`, …) — agent via the `base` role.
- Road-warrior clients — `mamba`, phone, work PC — agent/app.
- OPNsense / `mgmt` — the single non-agent exception (advertised route or LAN-side
admin from a meshed peer).
**Retired:** ADR-007's VLAN-99 WireGuard endpoint on OPNsense and the
`10.99.0.0/24` peer scheme. `askari` reaches `srv`/`mgmt` over the mesh under NetBird
ACLs instead of OPNsense routing `10.99.0.0/24`.
---
## Security model, ACLs, and attack surface
**ACL policy mirrors ADR-007's firewall intent** (NetBird is default-deny):
- `vpn` peers → `srv` **metrics ports only** (askari's monitoring scope).
- admin peers (`ubongo`, `mamba`) → `srv` + `mgmt` for administration.
- road-warrior clients → only what each needs; nothing by default.
**Enrollment via setup keys.** Hosts join non-interactively using NetBird **setup
keys**, stored in `vault.yml` as `vault.netbird.setup_key` and consumed by the `base`
role. Prefer ephemeral/scoped keys (ADR-002).
**Host firewall interaction.** NetBird creates a `wt0` mesh interface. The `base`
role's nftables default-deny allows inbound admin (SSH) **only on `wt0`**, denied on
the physical NIC — the pattern ADR-015 set for `ubongo`, now applied fleet-wide. Mesh
+ nftables are defence-in-depth.
**The new attack surface — a public control plane on `askari`.** Today `askari`
exposes a WireGuard UDP port; with NetBird self-hosted it exposes the **management API
+ dashboard (80/443)** and **Coturn (3478)** publicly, and the management API is
keys-to-the-kingdom for the whole mesh. Mitigations baked in:
- Dashboard/API behind TLS + the embedded IdP login; source-IP restrictions where
practical.
- `askari` runs `base` hardening (already a public managed host) and NetBird is
**version-pinned** (ADR-011) and patched on boma's cadence — self-hosting means
owning the CVE cadence (AGPLv3 server).
Net vs ADR-002: nothing from the **cluster** is publicly exposed; the only public
surface is on `askari` (a public VPS by design), shifting from "WireGuard port" to
"NetBird control plane."
---
## Recovery, bootstrap ordering, and operations
**Ansible's control path stays off the mesh.** `ubongo` is on the LAN and reaches the
fleet by **LAN IP** (ADR-009). The mesh only provides *external* reach to
`ubongo`/the fleet, so a mesh/coordinator outage never blocks on-LAN Ansible runs and
there is no chicken-and-egg in the critical path.
**Bootstrap order** (askari-first):
1. Stand up the NetBird coordinator on `askari`.
2. Enroll `ubongo`.
3. `base` role enrolls the rest of the fleet via setup keys from vault.
**Recovery.** Coordinator off-site on `askari` ⇒ the mesh survives a full homelab
outage. Two must-haves:
- **Back up NetBird's management datastore** off `askari` — encrypted, synced to
`ubongo`/`mamba`. If `askari` dies, restore the coordinator; peers re-enroll.
- Existing peer tunnels keep running on last-known config through a brief coordinator
outage; only changes/new enrollments need it live — so `askari` is important but not
instantly fatal.
**`askari` becomes Ansible-managed.** It joins the inventory under its own group and
gets the `base` role plus a dedicated **`netbird_coordinator` service role** (one
service = one role per ADR-004, with its own `SECURITY.md` per the service-role
standard). Agent install/enrollment lives in `base`.
**DNS & versions.** boma's `dns` role stays authoritative for `boma.baobab.band`;
NetBird's built-in DNS is scoped/off to avoid overlap. NetBird server (on `askari`)
and agents (via `base`) are version-pinned (ADR-011).
---
## Documentation & implementation changes
This is a substantial decision → its own ADR, with amendments linking to it.
| Doc | Change |
|---|---|
| ADR-016 (new) | Home of record for this design. |
| ADR-007 (network) | Replace the VLAN-99 WireGuard section + `10.99.0.0/24` scheme with the NetBird mesh; update the firewall-intent table and the `askari` external-monitoring section to ride the mesh. |
| ADR-015 (control host) | Resolve deferred item #1: mesh VPN = NetBird self-hosted on `askari`; update the access/recovery notes. |
| `docs/security/accepted-risks.md` | Replace R3 ("pending VPN choice") with the concrete residual risk: self-hosted coordinator = no third-party trust, but a public NetBird control plane on `askari` to harden + patch. |
| `docs/CAPABILITIES.md` | Resolve the VPN row (line ~29): decided — NetBird mesh, self-hosted on `askari`. |
| `STATUS.md` | Add rows (designed, not built): NetBird coordinator on `askari`; NetBird agent enrollment in `base`. |
| `base` role (when built) | Install + enroll the NetBird agent; nftables allows SSH only on `wt0`. |
| `netbird_coordinator` service role (new, when built) | Deploys the NetBird stack on `askari`; populated `SECURITY.md`; molecule scenario. |
| `requirements.yml` | Only if a task needs a new collection module (ADR dependencies policy). |
**Scope note:** like the `ubongo` work, most *implementation* here waits on the `base`
and service-role machinery that STATUS.md lists as not-yet-built. This spec settles the
decision and the doc reconciliation; the role tasks land when `base` is built.
---
## Deferred / out of scope
1. **External SSO IdP** (Zitadel/Keycloak) — embedded local users now; SSO later if a
second operator or service-SSO need appears.
2. **OPNsense mesh integration specifics** — the exact `mgmt` reachability mechanism
(single advertised route vs LAN-side admin) is settled during implementation when
OPNsense automation is built.
3. **The `base` / `netbird_coordinator` role implementation** — depends on the
unbuilt `base` role and service-role standard.
---
## What was ruled out
| Option | Reason |
|---|---|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config, OPNsense-centric; the operator wants a mesh with policy-based access and easy multi-device enrollment. |
| Tailscale (hosted coordinator) | Adds a third-party trust dependency for the control plane; against boma's self-hosting ethos. (Hosted coordinator's recovery benefit is matched by putting a self-hosted coordinator off-site on `askari`.) |
| Tailscale + Headscale (self-hosted) | Headscale is a third-party reimplementation of Tailscale's control server with partial feature parity and no official vendor support — weaker than NetBird's first-class self-hosting. |
| Mesh coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes, and dies with the homelab. `askari` (off-site) instead. |
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` would go blind to `srv` when `ubongo` is down even if services are healthy. Agent-per-host instead. |
| Standalone IdP (Zitadel/Keycloak) now | Heavy for a single operator; embedded local users (Dex) suffice. External SSO stays a future option. |
See also: ADR-007 (network), ADR-015 (control host), ADR-002 (security), ADR-011
(version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible handoff),
ADR-013 (heritage — V4 used WireGuard; NetBird is translated, not transplanted).