boma/docs/decisions/016-mesh-vpn.md

166 lines
9.4 KiB
Markdown
Raw Normal View History

# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
## Status
Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` role and service-role machinery
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
`base` exists.
## Context
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to
WireGuard-via-OPNsense for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`: `askari` + road
warriors), and `docs/CAPABILITIES.md` flagged NetBird (mesh) as a real alternative to
weigh. This ADR settles it.
## Decision
A single **NetBird** mesh is the sole remote-access overlay, self-hosted on `askari`,
**replacing** ADR-007's VLAN-99 OPNsense WireGuard.
The decision in four parts:
1. **Scope — mesh replaces WireGuard.** One overlay for `ubongo`, `askari`, and
road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired.
2. **Control plane — self-hosted on `askari`.** Sovereignty (boma self-hosts
Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that
survives a homelab outage and stays out of the cluster it administers.
3. **Tool — NetBird.** Self-hosting selects NetBird (first-class, fully open-source
self-host). Tailscale would mean Headscale (third-party reimplementation, partial
parity) — ruled out below.
4. **Routing — agent on every Linux host**, not a subnet router. At boma's scale (25
hosts) the "agent everywhere" cost is trivial and the `base` role already runs
everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives
granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception
(`mgmt`/gateway reached by a single advertised route or LAN-side admin).
5. **Identity — embedded local users** (Dex in the management container); external SSO
(Zitadel/Keycloak) stays an optional future.
## Verified facts (ADR-014)
verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
— components management+signal+dashboard+relay/TURN(Coturn), **single container since
v0.65**; **built-in local users / embedded IdP since v0.62** (external OIDC optional);
ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host.
verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for
`management/`/`signal/`/`relay/`, BSD-3-Clause elsewhere; fully open source, no
open-core feature gating.
## Architecture
Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on `askari`.
NetBird manages its own overlay addressing (default `100.64.0.0/10`); no boma VLAN is
allocated for it.
- `askari` (Hetzner, off-site, always-up) — runs the NetBird stack **and** is a peer.
- `ubongo` — agent.
- All Linux managed hosts — agent via the `base` role.
- Road-warrior clients (`mamba`, phone, work PC) — agent/app.
- OPNsense / `mgmt` — single non-agent exception.
## Security
- **ACLs mirror ADR-007 intent** (NetBird default-deny): mesh peers → `srv` metrics
ports only; admin peers (`ubongo`, `mamba`) → `srv` + `mgmt`; clients → least
privilege.
- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`),
consumed by `base`; prefer ephemeral/scoped keys.
- **Host firewall:** `base` nftables allows inbound SSH on NetBird's `wt0` interface
(primary, WireGuard-authenticated) **and** from `ubongo`'s LAN address (secondary,
mesh-independent — required by the LAN-IP recovery path below, so a mesh/coordinator
outage never blocks on-LAN SSH). All other LAN hosts remain default-denied. This makes
explicit the control-node SSH allow that the recovery model already implied; the access
doctrine and the three-tier access ladder live in **ADR-021**.
- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn
(3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,
`base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence.
Recorded as accepted-risk R3.
## Recovery & operations
- **Ansible stays off the mesh:** `ubongo` reaches the fleet by LAN IP (ADR-009); a
mesh/coordinator outage never blocks on-LAN runs.
- **Bootstrap order:** stand up the coordinator on `askari` → enroll `ubongo`
`base` enrolls the fleet.
- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage.
NetBird's management datastore is backed up encrypted off `askari` (synced to
`ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage.
- **`askari` is Ansible-managed:** its own inventory group `offsite_hosts` — provisioned
as **Terraform IaC** (`hetznercloud/hcloud`), managed independently of the Proxmox
cluster (its own provider + local state). Ansible configuration: `base` role, plus a
dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with
`SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are
version-pinned (ADR-011). boma's `dns` role stays authoritative for
`boma.baobab.band`; NetBird built-in DNS scoped/off.
## What was ruled out
| Option | Reason |
|---|---|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. |
| Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on `askari`. |
| Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. |
| Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. `askari` instead. |
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. |
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
## Consequences
- A new public surface appears on `askari` — management API + dashboard (80/443) +
Coturn (3478) — mitigated by TLS, embedded-IdP login, source-IP limits where
practical, `base` hardening and version-pinned NetBird, and recorded as accepted-risk
R3 (Security).
- On-LAN SSH never depends on the mesh: `base` allows inbound SSH from `ubongo`'s LAN
address as a mesh-independent secondary path, so a mesh/coordinator outage never
blocks on-LAN SSH and Ansible stays off the mesh (Security; Recovery & operations).
- The mesh survives a homelab outage because the coordinator is off-site on `askari`,
with its management datastore **intended** to be backed up encrypted off `askari` (not yet built — see the Availability amendment / R8) and peers keeping
last-known config through a brief coordinator outage (Recovery & operations).
- Choosing NetBird over plain OPNsense WireGuard, Tailscale, Tailscale+Headscale, an
on-cluster coordinator, a `ubongo` subnet router, and a standalone IdP gains
identity/ACL policy, self-hosted sovereignty, no routing SPOF, and a light single
operator footprint (What was ruled out).
- Implementation is pending: the role tasks land only once the unbuilt `base` role and
service-role machinery exist (Status).
docs: reconcile lower-severity review findings (O9-O24) - ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional, outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative boma.baobab.band -> boma.wingu.me transition note already added earlier - terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and <host>.boma.baobab.band per ADR-007 naming (O11) - ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections placed after Consequences, matching ADR-014/019-023 (O13) - docs/README + inventories/README: list the missing subdirs / offsite_hosts + offsite.yml merge behaviour (O14, O29 note) - ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19) - ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20) - ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21) - netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23) - ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24) - capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28) - tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9) - tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep) O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected); the fix lives in the generator for the next regeneration. make lint + pytest (57) green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00
## Availability — an `askari` outage (amendment 2026-06-20)
The coordinator is deliberately **single** (one off-site host). Recorded here so its
availability envelope is explicit; accepted as **R8** (`docs/security/accepted-risks.md`).
The mesh is **not** a default gateway — `wt0` routes only the overlay CIDR (`100.99.0.0/16`);
normal traffic uses the host's default route. So an `askari` outage has a **narrow blast
radius**:
| Traffic | `askari` down |
|---|---|
| LAN device → LAN service (direct / via reverse proxy) | unaffected |
| node ↔ node over LAN IPs (cluster) | unaffected |
| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P) |
| **road-warrior → `ubongo` (remote, relayed)** | **breaks** |
| mesh control plane (new enrol / ACL change / re-handshake) | pauses |
Only remote (off-LAN) mesh access to peers is lost, and only when off-LAN **and** `askari`
is down simultaneously. On-LAN access to `ubongo` never depends on the mesh (Recovery &
operations, above).
**Recovery:** rebuild the coordinator (`/setup` + re-enrol peers, M5) or restore from backup
once ADR-022 lands; the `netbird_coordinator` store backup is the **next sub-project** (its
gap is named in R8 and `BACKUP.md`). Client/road-warrior break-glass (reliable resolvers +
the coordinator-FQDN `/etc/hosts` pin) is in `docs/runbooks/netbird-client.md`; managed mesh
hosts get the same pin via `base__mesh_coordinator_pin`.
**Not pursued** (deliberately, given the narrow blast radius): direct P2P (punctures the
default-deny posture; only helps established sessions), a second relay (needs another public
host / reintroduces the home public surface), a second coordinator (unsupported by
self-hosted NetBird; against this ADR).
docs: reconcile lower-severity review findings (O9-O24) - ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional, outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative boma.baobab.band -> boma.wingu.me transition note already added earlier - terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and <host>.boma.baobab.band per ADR-007 naming (O11) - ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections placed after Consequences, matching ADR-014/019-023 (O13) - docs/README + inventories/README: list the missing subdirs / offsite_hosts + offsite.yml merge behaviour (O14, O29 note) - ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19) - ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20) - ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21) - netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23) - ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24) - capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28) - tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9) - tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep) O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected); the fix lives in the generator for the next regeneration. make lint + pytest (57) green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00
## Related
ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted),
ADR-021 (operational access; SSH ladder reconciling `wt0` + `ubongo`'s LAN address).