Resolves ADR-015 deferred item #1: the mesh VPN is NetBird, self-hosted on askari, replacing ADR-007's VLAN-99 OPNsense WireGuard. Agent-per-host enrollment via base, embedded local-user IdP, coordinator off-site for outage survival. Basis for ADR-016. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
206 lines
11 KiB
Markdown
206 lines
11 KiB
Markdown
# Design — Mesh VPN (NetBird, self-hosted on `askari`)
|
||
|
||
- **Date:** 2026-06-05
|
||
- **Status:** Approved design — pending implementation plan
|
||
- **Resolves:** ADR-015 deferred item #1 (mesh VPN choice) and the `accepted-risks.md`
|
||
R3 "pending VPN choice" placeholder
|
||
- **Amends:** ADR-007 (retires the VLAN-99 OPNsense WireGuard design)
|
||
- **Becomes:** ADR-016 (this design is the basis for that ADR)
|
||
|
||
---
|
||
|
||
## Problem
|
||
|
||
`ubongo` (ADR-015) needs remote SSH access from anywhere (work PC, laptop, phone)
|
||
without exposing anything to the public internet. ADR-015 left the access mechanism —
|
||
the "mesh VPN" — deferred to this discussion.
|
||
|
||
Meanwhile ADR-007 already commits to **WireGuard-via-OPNsense** for the `vpn` VLAN
|
||
(VLAN 99, `10.99.0.0/24`): `askari` (the off-site Hetzner monitoring VPS) peers to
|
||
OPNsense, plus road-warrior clients. And `docs/CAPABILITIES.md` already flags the open
|
||
question: *"ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real
|
||
alternative to weigh."*
|
||
|
||
So the real decision is three-cornered (plain OPNsense WireGuard vs NetBird vs
|
||
Tailscale), with an architectural sub-question of whether a mesh replaces or coexists
|
||
with the ADR-007 WireGuard.
|
||
|
||
## Decisions (as settled)
|
||
|
||
1. **Scope — the mesh *replaces* WireGuard.** A single overlay becomes the sole
|
||
remote-access path for `ubongo`, `askari`, and road-warrior clients. ADR-007's
|
||
VLAN-99 OPNsense WireGuard design is retired.
|
||
2. **Control plane — self-hosted, on `askari`.** Maximum sovereignty (boma already
|
||
self-hosts Vaultwarden, Forgejo, its own DNS), no third-party trust, and an off-site
|
||
coordinator that survives a homelab outage and stays out of the cluster it
|
||
administers.
|
||
3. **Tool — NetBird.** Self-hosting on `askari` selects NetBird: it is designed to be
|
||
self-hosted as a first-class, fully open-source stack. (Tailscale's self-host path
|
||
means Headscale, a separate third-party reimplementation with partial parity — ruled
|
||
out below.)
|
||
4. **Routing — NetBird agent on every (Linux) host**, not a subnet router. At boma's
|
||
scale (2–5 hosts, treated as individuals) the usual "agent everywhere" downside is
|
||
moot, and the `base` role already runs on every host, so enrollment is one uniform
|
||
role task. Avoids a routing single-point-of-failure and gives granular per-peer ACLs
|
||
that match ADR-007's firewall intent. **One exception:** OPNsense (FreeBSD) is not a
|
||
first-class NetBird agent target, so `mgmt`/gateway reachability is handled by a
|
||
single advertised route or by administering OPNsense from an on-LAN meshed peer.
|
||
5. **Identity — embedded local users** (Dex, built into the management container), not
|
||
a standalone Zitadel/Keycloak. YAGNI for a single operator; external SSO remains a
|
||
documented future option.
|
||
|
||
## Verified facts (ADR-014)
|
||
|
||
> verified: NetBird self-hosting architecture · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
|
||
> - Components: management + signal + dashboard + relay/TURN (Coturn). Since **v0.65**
|
||
> the core services are **merged into a single container**; deploy via Docker Compose.
|
||
> - Identity: since **v0.62**, built-in **local users** with an **embedded IdP (Dex)**;
|
||
> external OIDC IdPs (Zitadel, Keycloak, Authentik, Okta, …) are **optional**, not
|
||
> required.
|
||
> - Ports (behind reverse proxy): **TCP 80/443** + **UDP 3478** (STUN/TURN).
|
||
> - Host: a Linux VM + Docker Compose + a domain name; lightweight.
|
||
>
|
||
> verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05
|
||
> - Dual license: **AGPLv3** for `management/`, `signal/`, `relay/`; **BSD-3-Clause**
|
||
> elsewhere. Fully open source, self-hostable, no open-core feature gating.
|
||
|
||
---
|
||
|
||
## Architecture & topology
|
||
|
||
A single NetBird mesh is the sole remote-access overlay, replacing ADR-007's VLAN-99
|
||
WireGuard. Data plane is peer-to-peer WireGuard; control plane is self-hosted NetBird
|
||
on `askari`.
|
||
|
||
**`askari`'s dual role.** `askari` (Hetzner, off-site, always-up, independent of the
|
||
cluster per ADR-007) runs the **NetBird management stack** (single container:
|
||
management + signal + dashboard + Coturn, behind a reverse proxy on TCP 80/443 + UDP
|
||
3478) **and** is itself a mesh peer. Off-site hosting is what makes the mesh survive a
|
||
full homelab outage and keeps the coordinator out of the cluster it administers (no
|
||
chicken-and-egg).
|
||
|
||
**Peers:**
|
||
- `askari` — coordinator + peer.
|
||
- `ubongo` (control/AI-worker host) — agent.
|
||
- All Linux managed hosts (`dns1/2`, `proxy`, …) — agent via the `base` role.
|
||
- Road-warrior clients — `mamba`, phone, work PC — agent/app.
|
||
- OPNsense / `mgmt` — the single non-agent exception (advertised route or LAN-side
|
||
admin from a meshed peer).
|
||
|
||
**Retired:** ADR-007's VLAN-99 WireGuard endpoint on OPNsense and the
|
||
`10.99.0.0/24` peer scheme. `askari` reaches `srv`/`mgmt` over the mesh under NetBird
|
||
ACLs instead of OPNsense routing `10.99.0.0/24`.
|
||
|
||
---
|
||
|
||
## Security model, ACLs, and attack surface
|
||
|
||
**ACL policy mirrors ADR-007's firewall intent** (NetBird is default-deny):
|
||
- `vpn` peers → `srv` **metrics ports only** (askari's monitoring scope).
|
||
- admin peers (`ubongo`, `mamba`) → `srv` + `mgmt` for administration.
|
||
- road-warrior clients → only what each needs; nothing by default.
|
||
|
||
**Enrollment via setup keys.** Hosts join non-interactively using NetBird **setup
|
||
keys**, stored in `vault.yml` as `vault.netbird.setup_key` and consumed by the `base`
|
||
role. Prefer ephemeral/scoped keys (ADR-002).
|
||
|
||
**Host firewall interaction.** NetBird creates a `wt0` mesh interface. The `base`
|
||
role's nftables default-deny allows inbound admin (SSH) **only on `wt0`**, denied on
|
||
the physical NIC — the pattern ADR-015 set for `ubongo`, now applied fleet-wide. Mesh
|
||
+ nftables are defence-in-depth.
|
||
|
||
**The new attack surface — a public control plane on `askari`.** Today `askari`
|
||
exposes a WireGuard UDP port; with NetBird self-hosted it exposes the **management API
|
||
+ dashboard (80/443)** and **Coturn (3478)** publicly, and the management API is
|
||
keys-to-the-kingdom for the whole mesh. Mitigations baked in:
|
||
- Dashboard/API behind TLS + the embedded IdP login; source-IP restrictions where
|
||
practical.
|
||
- `askari` runs `base` hardening (already a public managed host) and NetBird is
|
||
**version-pinned** (ADR-011) and patched on boma's cadence — self-hosting means
|
||
owning the CVE cadence (AGPLv3 server).
|
||
|
||
Net vs ADR-002: nothing from the **cluster** is publicly exposed; the only public
|
||
surface is on `askari` (a public VPS by design), shifting from "WireGuard port" to
|
||
"NetBird control plane."
|
||
|
||
---
|
||
|
||
## Recovery, bootstrap ordering, and operations
|
||
|
||
**Ansible's control path stays off the mesh.** `ubongo` is on the LAN and reaches the
|
||
fleet by **LAN IP** (ADR-009). The mesh only provides *external* reach to
|
||
`ubongo`/the fleet, so a mesh/coordinator outage never blocks on-LAN Ansible runs and
|
||
there is no chicken-and-egg in the critical path.
|
||
|
||
**Bootstrap order** (askari-first):
|
||
1. Stand up the NetBird coordinator on `askari`.
|
||
2. Enroll `ubongo`.
|
||
3. `base` role enrolls the rest of the fleet via setup keys from vault.
|
||
|
||
**Recovery.** Coordinator off-site on `askari` ⇒ the mesh survives a full homelab
|
||
outage. Two must-haves:
|
||
- **Back up NetBird's management datastore** off `askari` — encrypted, synced to
|
||
`ubongo`/`mamba`. If `askari` dies, restore the coordinator; peers re-enroll.
|
||
- Existing peer tunnels keep running on last-known config through a brief coordinator
|
||
outage; only changes/new enrollments need it live — so `askari` is important but not
|
||
instantly fatal.
|
||
|
||
**`askari` becomes Ansible-managed.** It joins the inventory under its own group and
|
||
gets the `base` role plus a dedicated **`netbird_coordinator` service role** (one
|
||
service = one role per ADR-004, with its own `SECURITY.md` per the service-role
|
||
standard). Agent install/enrollment lives in `base`.
|
||
|
||
**DNS & versions.** boma's `dns` role stays authoritative for `boma.baobab.band`;
|
||
NetBird's built-in DNS is scoped/off to avoid overlap. NetBird server (on `askari`)
|
||
and agents (via `base`) are version-pinned (ADR-011).
|
||
|
||
---
|
||
|
||
## Documentation & implementation changes
|
||
|
||
This is a substantial decision → its own ADR, with amendments linking to it.
|
||
|
||
| Doc | Change |
|
||
|---|---|
|
||
| ADR-016 (new) | Home of record for this design. |
|
||
| ADR-007 (network) | Replace the VLAN-99 WireGuard section + `10.99.0.0/24` scheme with the NetBird mesh; update the firewall-intent table and the `askari` external-monitoring section to ride the mesh. |
|
||
| ADR-015 (control host) | Resolve deferred item #1: mesh VPN = NetBird self-hosted on `askari`; update the access/recovery notes. |
|
||
| `docs/security/accepted-risks.md` | Replace R3 ("pending VPN choice") with the concrete residual risk: self-hosted coordinator = no third-party trust, but a public NetBird control plane on `askari` to harden + patch. |
|
||
| `docs/CAPABILITIES.md` | Resolve the VPN row (line ~29): decided — NetBird mesh, self-hosted on `askari`. |
|
||
| `STATUS.md` | Add rows (designed, not built): NetBird coordinator on `askari`; NetBird agent enrollment in `base`. |
|
||
| `base` role (when built) | Install + enroll the NetBird agent; nftables allows SSH only on `wt0`. |
|
||
| `netbird_coordinator` service role (new, when built) | Deploys the NetBird stack on `askari`; populated `SECURITY.md`; molecule scenario. |
|
||
| `requirements.yml` | Only if a task needs a new collection module (ADR dependencies policy). |
|
||
|
||
**Scope note:** like the `ubongo` work, most *implementation* here waits on the `base`
|
||
and service-role machinery that STATUS.md lists as not-yet-built. This spec settles the
|
||
decision and the doc reconciliation; the role tasks land when `base` is built.
|
||
|
||
---
|
||
|
||
## Deferred / out of scope
|
||
|
||
1. **External SSO IdP** (Zitadel/Keycloak) — embedded local users now; SSO later if a
|
||
second operator or service-SSO need appears.
|
||
2. **OPNsense mesh integration specifics** — the exact `mgmt` reachability mechanism
|
||
(single advertised route vs LAN-side admin) is settled during implementation when
|
||
OPNsense automation is built.
|
||
3. **The `base` / `netbird_coordinator` role implementation** — depends on the
|
||
unbuilt `base` role and service-role standard.
|
||
|
||
---
|
||
|
||
## What was ruled out
|
||
|
||
| Option | Reason |
|
||
|---|---|
|
||
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config, OPNsense-centric; the operator wants a mesh with policy-based access and easy multi-device enrollment. |
|
||
| Tailscale (hosted coordinator) | Adds a third-party trust dependency for the control plane; against boma's self-hosting ethos. (Hosted coordinator's recovery benefit is matched by putting a self-hosted coordinator off-site on `askari`.) |
|
||
| Tailscale + Headscale (self-hosted) | Headscale is a third-party reimplementation of Tailscale's control server with partial feature parity and no official vendor support — weaker than NetBird's first-class self-hosting. |
|
||
| Mesh coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes, and dies with the homelab. `askari` (off-site) instead. |
|
||
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` would go blind to `srv` when `ubongo` is down even if services are healthy. Agent-per-host instead. |
|
||
| Standalone IdP (Zitadel/Keycloak) now | Heavy for a single operator; embedded local users (Dex) suffice. External SSO stays a future option. |
|
||
|
||
See also: ADR-007 (network), ADR-015 (control host), ADR-002 (security), ADR-011
|
||
(version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible handoff),
|
||
ADR-013 (heritage — V4 used WireGuard; NetBird is translated, not transplanted).
|