boma/docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md

207 lines
11 KiB
Markdown
Raw Permalink Normal View History

# Design — Mesh VPN (NetBird, self-hosted on `askari`)
- **Date:** 2026-06-05
- **Status:** Approved design — pending implementation plan
- **Resolves:** ADR-015 deferred item #1 (mesh VPN choice) and the `accepted-risks.md`
R3 "pending VPN choice" placeholder
- **Amends:** ADR-007 (retires the VLAN-99 OPNsense WireGuard design)
- **Becomes:** ADR-016 (this design is the basis for that ADR)
---
## Problem
`ubongo` (ADR-015) needs remote SSH access from anywhere (work PC, laptop, phone)
without exposing anything to the public internet. ADR-015 left the access mechanism —
the "mesh VPN" — deferred to this discussion.
Meanwhile ADR-007 already commits to **WireGuard-via-OPNsense** for the `vpn` VLAN
(VLAN 99, `10.99.0.0/24`): `askari` (the off-site Hetzner monitoring VPS) peers to
OPNsense, plus road-warrior clients. And `docs/CAPABILITIES.md` already flags the open
question: *"ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real
alternative to weigh."*
So the real decision is three-cornered (plain OPNsense WireGuard vs NetBird vs
Tailscale), with an architectural sub-question of whether a mesh replaces or coexists
with the ADR-007 WireGuard.
## Decisions (as settled)
1. **Scope — the mesh *replaces* WireGuard.** A single overlay becomes the sole
remote-access path for `ubongo`, `askari`, and road-warrior clients. ADR-007's
VLAN-99 OPNsense WireGuard design is retired.
2. **Control plane — self-hosted, on `askari`.** Maximum sovereignty (boma already
self-hosts Vaultwarden, Forgejo, its own DNS), no third-party trust, and an off-site
coordinator that survives a homelab outage and stays out of the cluster it
administers.
3. **Tool — NetBird.** Self-hosting on `askari` selects NetBird: it is designed to be
self-hosted as a first-class, fully open-source stack. (Tailscale's self-host path
means Headscale, a separate third-party reimplementation with partial parity — ruled
out below.)
4. **Routing — NetBird agent on every (Linux) host**, not a subnet router. At boma's
scale (25 hosts, treated as individuals) the usual "agent everywhere" downside is
moot, and the `base` role already runs on every host, so enrollment is one uniform
role task. Avoids a routing single-point-of-failure and gives granular per-peer ACLs
that match ADR-007's firewall intent. **One exception:** OPNsense (FreeBSD) is not a
first-class NetBird agent target, so `mgmt`/gateway reachability is handled by a
single advertised route or by administering OPNsense from an on-LAN meshed peer.
5. **Identity — embedded local users** (Dex, built into the management container), not
a standalone Zitadel/Keycloak. YAGNI for a single operator; external SSO remains a
documented future option.
## Verified facts (ADR-014)
> verified: NetBird self-hosting architecture · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
> - Components: management + signal + dashboard + relay/TURN (Coturn). Since **v0.65**
> the core services are **merged into a single container**; deploy via Docker Compose.
> - Identity: since **v0.62**, built-in **local users** with an **embedded IdP (Dex)**;
> external OIDC IdPs (Zitadel, Keycloak, Authentik, Okta, …) are **optional**, not
> required.
> - Ports (behind reverse proxy): **TCP 80/443** + **UDP 3478** (STUN/TURN).
> - Host: a Linux VM + Docker Compose + a domain name; lightweight.
>
> verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05
> - Dual license: **AGPLv3** for `management/`, `signal/`, `relay/`; **BSD-3-Clause**
> elsewhere. Fully open source, self-hostable, no open-core feature gating.
---
## Architecture & topology
A single NetBird mesh is the sole remote-access overlay, replacing ADR-007's VLAN-99
WireGuard. Data plane is peer-to-peer WireGuard; control plane is self-hosted NetBird
on `askari`.
**`askari`'s dual role.** `askari` (Hetzner, off-site, always-up, independent of the
cluster per ADR-007) runs the **NetBird management stack** (single container:
management + signal + dashboard + Coturn, behind a reverse proxy on TCP 80/443 + UDP
3478) **and** is itself a mesh peer. Off-site hosting is what makes the mesh survive a
full homelab outage and keeps the coordinator out of the cluster it administers (no
chicken-and-egg).
**Peers:**
- `askari` — coordinator + peer.
- `ubongo` (control/AI-worker host) — agent.
- All Linux managed hosts (`dns1/2`, `proxy`, …) — agent via the `base` role.
- Road-warrior clients — `mamba`, phone, work PC — agent/app.
- OPNsense / `mgmt` — the single non-agent exception (advertised route or LAN-side
admin from a meshed peer).
**Retired:** ADR-007's VLAN-99 WireGuard endpoint on OPNsense and the
`10.99.0.0/24` peer scheme. `askari` reaches `srv`/`mgmt` over the mesh under NetBird
ACLs instead of OPNsense routing `10.99.0.0/24`.
---
## Security model, ACLs, and attack surface
**ACL policy mirrors ADR-007's firewall intent** (NetBird is default-deny):
- `vpn` peers → `srv` **metrics ports only** (askari's monitoring scope).
- admin peers (`ubongo`, `mamba`) → `srv` + `mgmt` for administration.
- road-warrior clients → only what each needs; nothing by default.
**Enrollment via setup keys.** Hosts join non-interactively using NetBird **setup
keys**, stored in `vault.yml` as `vault.netbird.setup_key` and consumed by the `base`
role. Prefer ephemeral/scoped keys (ADR-002).
**Host firewall interaction.** NetBird creates a `wt0` mesh interface. The `base`
role's nftables default-deny allows inbound admin (SSH) **only on `wt0`**, denied on
the physical NIC — the pattern ADR-015 set for `ubongo`, now applied fleet-wide. Mesh
+ nftables are defence-in-depth.
**The new attack surface — a public control plane on `askari`.** Today `askari`
exposes a WireGuard UDP port; with NetBird self-hosted it exposes the **management API
+ dashboard (80/443)** and **Coturn (3478)** publicly, and the management API is
keys-to-the-kingdom for the whole mesh. Mitigations baked in:
- Dashboard/API behind TLS + the embedded IdP login; source-IP restrictions where
practical.
- `askari` runs `base` hardening (already a public managed host) and NetBird is
**version-pinned** (ADR-011) and patched on boma's cadence — self-hosting means
owning the CVE cadence (AGPLv3 server).
Net vs ADR-002: nothing from the **cluster** is publicly exposed; the only public
surface is on `askari` (a public VPS by design), shifting from "WireGuard port" to
"NetBird control plane."
---
## Recovery, bootstrap ordering, and operations
**Ansible's control path stays off the mesh.** `ubongo` is on the LAN and reaches the
fleet by **LAN IP** (ADR-009). The mesh only provides *external* reach to
`ubongo`/the fleet, so a mesh/coordinator outage never blocks on-LAN Ansible runs and
there is no chicken-and-egg in the critical path.
**Bootstrap order** (askari-first):
1. Stand up the NetBird coordinator on `askari`.
2. Enroll `ubongo`.
3. `base` role enrolls the rest of the fleet via setup keys from vault.
**Recovery.** Coordinator off-site on `askari` ⇒ the mesh survives a full homelab
outage. Two must-haves:
- **Back up NetBird's management datastore** off `askari` — encrypted, synced to
`ubongo`/`mamba`. If `askari` dies, restore the coordinator; peers re-enroll.
- Existing peer tunnels keep running on last-known config through a brief coordinator
outage; only changes/new enrollments need it live — so `askari` is important but not
instantly fatal.
**`askari` becomes Ansible-managed.** It joins the inventory under its own group and
gets the `base` role plus a dedicated **`netbird_coordinator` service role** (one
service = one role per ADR-004, with its own `SECURITY.md` per the service-role
standard). Agent install/enrollment lives in `base`.
**DNS & versions.** boma's `dns` role stays authoritative for `boma.baobab.band`;
NetBird's built-in DNS is scoped/off to avoid overlap. NetBird server (on `askari`)
and agents (via `base`) are version-pinned (ADR-011).
---
## Documentation & implementation changes
This is a substantial decision → its own ADR, with amendments linking to it.
| Doc | Change |
|---|---|
| ADR-016 (new) | Home of record for this design. |
| ADR-007 (network) | Replace the VLAN-99 WireGuard section + `10.99.0.0/24` scheme with the NetBird mesh; update the firewall-intent table and the `askari` external-monitoring section to ride the mesh. |
| ADR-015 (control host) | Resolve deferred item #1: mesh VPN = NetBird self-hosted on `askari`; update the access/recovery notes. |
| `docs/security/accepted-risks.md` | Replace R3 ("pending VPN choice") with the concrete residual risk: self-hosted coordinator = no third-party trust, but a public NetBird control plane on `askari` to harden + patch. |
| `docs/CAPABILITIES.md` | Resolve the VPN row (line ~29): decided — NetBird mesh, self-hosted on `askari`. |
| `STATUS.md` | Add rows (designed, not built): NetBird coordinator on `askari`; NetBird agent enrollment in `base`. |
| `base` role (when built) | Install + enroll the NetBird agent; nftables allows SSH only on `wt0`. |
| `netbird_coordinator` service role (new, when built) | Deploys the NetBird stack on `askari`; populated `SECURITY.md`; molecule scenario. |
| `requirements.yml` | Only if a task needs a new collection module (ADR dependencies policy). |
**Scope note:** like the `ubongo` work, most *implementation* here waits on the `base`
and service-role machinery that STATUS.md lists as not-yet-built. This spec settles the
decision and the doc reconciliation; the role tasks land when `base` is built.
---
## Deferred / out of scope
1. **External SSO IdP** (Zitadel/Keycloak) — embedded local users now; SSO later if a
second operator or service-SSO need appears.
2. **OPNsense mesh integration specifics** — the exact `mgmt` reachability mechanism
(single advertised route vs LAN-side admin) is settled during implementation when
OPNsense automation is built.
3. **The `base` / `netbird_coordinator` role implementation** — depends on the
unbuilt `base` role and service-role standard.
---
## What was ruled out
| Option | Reason |
|---|---|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config, OPNsense-centric; the operator wants a mesh with policy-based access and easy multi-device enrollment. |
| Tailscale (hosted coordinator) | Adds a third-party trust dependency for the control plane; against boma's self-hosting ethos. (Hosted coordinator's recovery benefit is matched by putting a self-hosted coordinator off-site on `askari`.) |
| Tailscale + Headscale (self-hosted) | Headscale is a third-party reimplementation of Tailscale's control server with partial feature parity and no official vendor support — weaker than NetBird's first-class self-hosting. |
| Mesh coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes, and dies with the homelab. `askari` (off-site) instead. |
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` would go blind to `srv` when `ubongo` is down even if services are healthy. Agent-per-host instead. |
| Standalone IdP (Zitadel/Keycloak) now | Heavy for a single operator; embedded local users (Dex) suffice. External SSO stays a future option. |
See also: ADR-007 (network), ADR-015 (control host), ADR-002 (security), ADR-011
(version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible handoff),
ADR-013 (heritage — V4 used WireGuard; NetBird is translated, not transplanted).