From ff796c64ca76f3a23d199bfe97bc59eccdd315bb Mon Sep 17 00:00:00 2001 From: sjat Date: Fri, 5 Jun 2026 11:45:45 +0200 Subject: [PATCH] =?UTF-8?q?Add=20ADR-016=20(mesh=20VPN=20=E2=80=94=20NetBi?= =?UTF-8?q?rd=20self-hosted=20on=20askari)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/decisions/016-mesh-vpn.md | 105 +++++++++++++++++++++++++++++++++ 1 file changed, 105 insertions(+) create mode 100644 docs/decisions/016-mesh-vpn.md diff --git a/docs/decisions/016-mesh-vpn.md b/docs/decisions/016-mesh-vpn.md new file mode 100644 index 0000000..a322c0b --- /dev/null +++ b/docs/decisions/016-mesh-vpn.md @@ -0,0 +1,105 @@ +# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`) + +## Context + +`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to +the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to +WireGuard-via-OPNsense for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`: `askari` + road +warriors), and `docs/CAPABILITIES.md` flagged NetBird (mesh) as a real alternative to +weigh. This ADR settles it. + +## Decision + +A single **NetBird** mesh is the sole remote-access overlay, self-hosted on `askari`, +**replacing** ADR-007's VLAN-99 OPNsense WireGuard. + +The decision in four parts: + +1. **Scope — mesh replaces WireGuard.** One overlay for `ubongo`, `askari`, and + road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired. +2. **Control plane — self-hosted on `askari`.** Sovereignty (boma self-hosts + Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that + survives a homelab outage and stays out of the cluster it administers. +3. **Tool — NetBird.** Self-hosting selects NetBird (first-class, fully open-source + self-host). Tailscale would mean Headscale (third-party reimplementation, partial + parity) — ruled out below. +4. **Routing — agent on every Linux host**, not a subnet router. At boma's scale (2–5 + hosts) the "agent everywhere" cost is trivial and the `base` role already runs + everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives + granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception + (`mgmt`/gateway reached by a single advertised route or LAN-side admin). +5. **Identity — embedded local users** (Dex in the management container); external SSO + (Zitadel/Keycloak) stays an optional future. + +## Verified facts (ADR-014) + +verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05 +— components management+signal+dashboard+relay/TURN(Coturn), **single container since +v0.65**; **built-in local users / embedded IdP since v0.62** (external OIDC optional); +ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host. + +verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for +`management/`/`signal/`/`relay/`, BSD-3-Clause elsewhere; fully open source, no +open-core feature gating. + +## Architecture + +Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on `askari`. +NetBird manages its own overlay addressing (default `100.64.0.0/10`); no boma VLAN is +allocated for it. + +- `askari` (Hetzner, off-site, always-up) — runs the NetBird stack **and** is a peer. +- `ubongo` — agent. +- All Linux managed hosts — agent via the `base` role. +- Road-warrior clients (`mamba`, phone, work PC) — agent/app. +- OPNsense / `mgmt` — single non-agent exception. + +## Security + +- **ACLs mirror ADR-007 intent** (NetBird default-deny): mesh peers → `srv` metrics + ports only; admin peers (`ubongo`, `mamba`) → `srv` + `mgmt`; clients → least + privilege. +- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`), + consumed by `base`; prefer ephemeral/scoped keys. +- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH + **only on `wt0`** (the ADR-015 pattern, fleet-wide). +- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn + (3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical, + `base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence. + Recorded as accepted-risk R3. + +## Recovery & operations + +- **Ansible stays off the mesh:** `ubongo` reaches the fleet by LAN IP (ADR-009); a + mesh/coordinator outage never blocks on-LAN runs. +- **Bootstrap order:** stand up the coordinator on `askari` → enroll `ubongo` → + `base` enrolls the fleet. +- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage. + NetBird's management datastore is backed up encrypted off `askari` (synced to + `ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage. +- **`askari` is Ansible-managed:** its own inventory group, `base` role, plus a + dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with + `SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are + version-pinned (ADR-011). boma's `dns` role stays authoritative for + `boma.baobab.band`; NetBird built-in DNS scoped/off. + +## Status + +Designed, not built — depends on the unbuilt `base` role and service-role machinery +(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when +`base` exists. + +## What was ruled out + +| Option | Reason | +|---|---| +| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. | +| Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on `askari`. | +| Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. | +| Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. `askari` instead. | +| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. | +| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. | + +See also: ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security), +ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible +handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted).