Add ADR-016 (mesh VPN — NetBird self-hosted on askari)
This commit is contained in:
parent
4b85b14f1f
commit
ff796c64ca
1 changed files with 105 additions and 0 deletions
105
docs/decisions/016-mesh-vpn.md
Normal file
105
docs/decisions/016-mesh-vpn.md
Normal file
|
|
@ -0,0 +1,105 @@
|
||||||
|
# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
|
||||||
|
the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to
|
||||||
|
WireGuard-via-OPNsense for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`: `askari` + road
|
||||||
|
warriors), and `docs/CAPABILITIES.md` flagged NetBird (mesh) as a real alternative to
|
||||||
|
weigh. This ADR settles it.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
A single **NetBird** mesh is the sole remote-access overlay, self-hosted on `askari`,
|
||||||
|
**replacing** ADR-007's VLAN-99 OPNsense WireGuard.
|
||||||
|
|
||||||
|
The decision in four parts:
|
||||||
|
|
||||||
|
1. **Scope — mesh replaces WireGuard.** One overlay for `ubongo`, `askari`, and
|
||||||
|
road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired.
|
||||||
|
2. **Control plane — self-hosted on `askari`.** Sovereignty (boma self-hosts
|
||||||
|
Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that
|
||||||
|
survives a homelab outage and stays out of the cluster it administers.
|
||||||
|
3. **Tool — NetBird.** Self-hosting selects NetBird (first-class, fully open-source
|
||||||
|
self-host). Tailscale would mean Headscale (third-party reimplementation, partial
|
||||||
|
parity) — ruled out below.
|
||||||
|
4. **Routing — agent on every Linux host**, not a subnet router. At boma's scale (2–5
|
||||||
|
hosts) the "agent everywhere" cost is trivial and the `base` role already runs
|
||||||
|
everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives
|
||||||
|
granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception
|
||||||
|
(`mgmt`/gateway reached by a single advertised route or LAN-side admin).
|
||||||
|
5. **Identity — embedded local users** (Dex in the management container); external SSO
|
||||||
|
(Zitadel/Keycloak) stays an optional future.
|
||||||
|
|
||||||
|
## Verified facts (ADR-014)
|
||||||
|
|
||||||
|
verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
|
||||||
|
— components management+signal+dashboard+relay/TURN(Coturn), **single container since
|
||||||
|
v0.65**; **built-in local users / embedded IdP since v0.62** (external OIDC optional);
|
||||||
|
ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host.
|
||||||
|
|
||||||
|
verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for
|
||||||
|
`management/`/`signal/`/`relay/`, BSD-3-Clause elsewhere; fully open source, no
|
||||||
|
open-core feature gating.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on `askari`.
|
||||||
|
NetBird manages its own overlay addressing (default `100.64.0.0/10`); no boma VLAN is
|
||||||
|
allocated for it.
|
||||||
|
|
||||||
|
- `askari` (Hetzner, off-site, always-up) — runs the NetBird stack **and** is a peer.
|
||||||
|
- `ubongo` — agent.
|
||||||
|
- All Linux managed hosts — agent via the `base` role.
|
||||||
|
- Road-warrior clients (`mamba`, phone, work PC) — agent/app.
|
||||||
|
- OPNsense / `mgmt` — single non-agent exception.
|
||||||
|
|
||||||
|
## Security
|
||||||
|
|
||||||
|
- **ACLs mirror ADR-007 intent** (NetBird default-deny): mesh peers → `srv` metrics
|
||||||
|
ports only; admin peers (`ubongo`, `mamba`) → `srv` + `mgmt`; clients → least
|
||||||
|
privilege.
|
||||||
|
- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`),
|
||||||
|
consumed by `base`; prefer ephemeral/scoped keys.
|
||||||
|
- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH
|
||||||
|
**only on `wt0`** (the ADR-015 pattern, fleet-wide).
|
||||||
|
- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn
|
||||||
|
(3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,
|
||||||
|
`base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence.
|
||||||
|
Recorded as accepted-risk R3.
|
||||||
|
|
||||||
|
## Recovery & operations
|
||||||
|
|
||||||
|
- **Ansible stays off the mesh:** `ubongo` reaches the fleet by LAN IP (ADR-009); a
|
||||||
|
mesh/coordinator outage never blocks on-LAN runs.
|
||||||
|
- **Bootstrap order:** stand up the coordinator on `askari` → enroll `ubongo` →
|
||||||
|
`base` enrolls the fleet.
|
||||||
|
- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage.
|
||||||
|
NetBird's management datastore is backed up encrypted off `askari` (synced to
|
||||||
|
`ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage.
|
||||||
|
- **`askari` is Ansible-managed:** its own inventory group, `base` role, plus a
|
||||||
|
dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with
|
||||||
|
`SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are
|
||||||
|
version-pinned (ADR-011). boma's `dns` role stays authoritative for
|
||||||
|
`boma.baobab.band`; NetBird built-in DNS scoped/off.
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Designed, not built — depends on the unbuilt `base` role and service-role machinery
|
||||||
|
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
|
||||||
|
`base` exists.
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
| Option | Reason |
|
||||||
|
|---|---|
|
||||||
|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. |
|
||||||
|
| Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on `askari`. |
|
||||||
|
| Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. |
|
||||||
|
| Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. `askari` instead. |
|
||||||
|
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. |
|
||||||
|
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
|
||||||
|
|
||||||
|
See also: ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
|
||||||
|
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
|
||||||
|
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted).
|
||||||
Loading…
Add table
Reference in a new issue