# Design — Mesh VPN (NetBird, self-hosted on `askari`) - **Date:** 2026-06-05 - **Status:** Approved design — pending implementation plan - **Resolves:** ADR-015 deferred item #1 (mesh VPN choice) and the `accepted-risks.md` R3 "pending VPN choice" placeholder - **Amends:** ADR-007 (retires the VLAN-99 OPNsense WireGuard design) - **Becomes:** ADR-016 (this design is the basis for that ADR) --- ## Problem `ubongo` (ADR-015) needs remote SSH access from anywhere (work PC, laptop, phone) without exposing anything to the public internet. ADR-015 left the access mechanism — the "mesh VPN" — deferred to this discussion. Meanwhile ADR-007 already commits to **WireGuard-via-OPNsense** for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`): `askari` (the off-site Hetzner monitoring VPS) peers to OPNsense, plus road-warrior clients. And `docs/CAPABILITIES.md` already flags the open question: *"ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real alternative to weigh."* So the real decision is three-cornered (plain OPNsense WireGuard vs NetBird vs Tailscale), with an architectural sub-question of whether a mesh replaces or coexists with the ADR-007 WireGuard. ## Decisions (as settled) 1. **Scope — the mesh *replaces* WireGuard.** A single overlay becomes the sole remote-access path for `ubongo`, `askari`, and road-warrior clients. ADR-007's VLAN-99 OPNsense WireGuard design is retired. 2. **Control plane — self-hosted, on `askari`.** Maximum sovereignty (boma already self-hosts Vaultwarden, Forgejo, its own DNS), no third-party trust, and an off-site coordinator that survives a homelab outage and stays out of the cluster it administers. 3. **Tool — NetBird.** Self-hosting on `askari` selects NetBird: it is designed to be self-hosted as a first-class, fully open-source stack. (Tailscale's self-host path means Headscale, a separate third-party reimplementation with partial parity — ruled out below.) 4. **Routing — NetBird agent on every (Linux) host**, not a subnet router. At boma's scale (2–5 hosts, treated as individuals) the usual "agent everywhere" downside is moot, and the `base` role already runs on every host, so enrollment is one uniform role task. Avoids a routing single-point-of-failure and gives granular per-peer ACLs that match ADR-007's firewall intent. **One exception:** OPNsense (FreeBSD) is not a first-class NetBird agent target, so `mgmt`/gateway reachability is handled by a single advertised route or by administering OPNsense from an on-LAN meshed peer. 5. **Identity — embedded local users** (Dex, built into the management container), not a standalone Zitadel/Keycloak. YAGNI for a single operator; external SSO remains a documented future option. ## Verified facts (ADR-014) > verified: NetBird self-hosting architecture · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05 > - Components: management + signal + dashboard + relay/TURN (Coturn). Since **v0.65** > the core services are **merged into a single container**; deploy via Docker Compose. > - Identity: since **v0.62**, built-in **local users** with an **embedded IdP (Dex)**; > external OIDC IdPs (Zitadel, Keycloak, Authentik, Okta, …) are **optional**, not > required. > - Ports (behind reverse proxy): **TCP 80/443** + **UDP 3478** (STUN/TURN). > - Host: a Linux VM + Docker Compose + a domain name; lightweight. > > verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 > - Dual license: **AGPLv3** for `management/`, `signal/`, `relay/`; **BSD-3-Clause** > elsewhere. Fully open source, self-hostable, no open-core feature gating. --- ## Architecture & topology A single NetBird mesh is the sole remote-access overlay, replacing ADR-007's VLAN-99 WireGuard. Data plane is peer-to-peer WireGuard; control plane is self-hosted NetBird on `askari`. **`askari`'s dual role.** `askari` (Hetzner, off-site, always-up, independent of the cluster per ADR-007) runs the **NetBird management stack** (single container: management + signal + dashboard + Coturn, behind a reverse proxy on TCP 80/443 + UDP 3478) **and** is itself a mesh peer. Off-site hosting is what makes the mesh survive a full homelab outage and keeps the coordinator out of the cluster it administers (no chicken-and-egg). **Peers:** - `askari` — coordinator + peer. - `ubongo` (control/AI-worker host) — agent. - All Linux managed hosts (`dns1/2`, `proxy`, …) — agent via the `base` role. - Road-warrior clients — `mamba`, phone, work PC — agent/app. - OPNsense / `mgmt` — the single non-agent exception (advertised route or LAN-side admin from a meshed peer). **Retired:** ADR-007's VLAN-99 WireGuard endpoint on OPNsense and the `10.99.0.0/24` peer scheme. `askari` reaches `srv`/`mgmt` over the mesh under NetBird ACLs instead of OPNsense routing `10.99.0.0/24`. --- ## Security model, ACLs, and attack surface **ACL policy mirrors ADR-007's firewall intent** (NetBird is default-deny): - `vpn` peers → `srv` **metrics ports only** (askari's monitoring scope). - admin peers (`ubongo`, `mamba`) → `srv` + `mgmt` for administration. - road-warrior clients → only what each needs; nothing by default. **Enrollment via setup keys.** Hosts join non-interactively using NetBird **setup keys**, stored in `vault.yml` as `vault.netbird.setup_key` and consumed by the `base` role. Prefer ephemeral/scoped keys (ADR-002). **Host firewall interaction.** NetBird creates a `wt0` mesh interface. The `base` role's nftables default-deny allows inbound admin (SSH) **only on `wt0`**, denied on the physical NIC — the pattern ADR-015 set for `ubongo`, now applied fleet-wide. Mesh + nftables are defence-in-depth. **The new attack surface — a public control plane on `askari`.** Today `askari` exposes a WireGuard UDP port; with NetBird self-hosted it exposes the **management API + dashboard (80/443)** and **Coturn (3478)** publicly, and the management API is keys-to-the-kingdom for the whole mesh. Mitigations baked in: - Dashboard/API behind TLS + the embedded IdP login; source-IP restrictions where practical. - `askari` runs `base` hardening (already a public managed host) and NetBird is **version-pinned** (ADR-011) and patched on boma's cadence — self-hosting means owning the CVE cadence (AGPLv3 server). Net vs ADR-002: nothing from the **cluster** is publicly exposed; the only public surface is on `askari` (a public VPS by design), shifting from "WireGuard port" to "NetBird control plane." --- ## Recovery, bootstrap ordering, and operations **Ansible's control path stays off the mesh.** `ubongo` is on the LAN and reaches the fleet by **LAN IP** (ADR-009). The mesh only provides *external* reach to `ubongo`/the fleet, so a mesh/coordinator outage never blocks on-LAN Ansible runs and there is no chicken-and-egg in the critical path. **Bootstrap order** (askari-first): 1. Stand up the NetBird coordinator on `askari`. 2. Enroll `ubongo`. 3. `base` role enrolls the rest of the fleet via setup keys from vault. **Recovery.** Coordinator off-site on `askari` ⇒ the mesh survives a full homelab outage. Two must-haves: - **Back up NetBird's management datastore** off `askari` — encrypted, synced to `ubongo`/`mamba`. If `askari` dies, restore the coordinator; peers re-enroll. - Existing peer tunnels keep running on last-known config through a brief coordinator outage; only changes/new enrollments need it live — so `askari` is important but not instantly fatal. **`askari` becomes Ansible-managed.** It joins the inventory under its own group and gets the `base` role plus a dedicated **`netbird_coordinator` service role** (one service = one role per ADR-004, with its own `SECURITY.md` per the service-role standard). Agent install/enrollment lives in `base`. **DNS & versions.** boma's `dns` role stays authoritative for `boma.baobab.band`; NetBird's built-in DNS is scoped/off to avoid overlap. NetBird server (on `askari`) and agents (via `base`) are version-pinned (ADR-011). --- ## Documentation & implementation changes This is a substantial decision → its own ADR, with amendments linking to it. | Doc | Change | |---|---| | ADR-016 (new) | Home of record for this design. | | ADR-007 (network) | Replace the VLAN-99 WireGuard section + `10.99.0.0/24` scheme with the NetBird mesh; update the firewall-intent table and the `askari` external-monitoring section to ride the mesh. | | ADR-015 (control host) | Resolve deferred item #1: mesh VPN = NetBird self-hosted on `askari`; update the access/recovery notes. | | `docs/security/accepted-risks.md` | Replace R3 ("pending VPN choice") with the concrete residual risk: self-hosted coordinator = no third-party trust, but a public NetBird control plane on `askari` to harden + patch. | | `docs/CAPABILITIES.md` | Resolve the VPN row (line ~29): decided — NetBird mesh, self-hosted on `askari`. | | `STATUS.md` | Add rows (designed, not built): NetBird coordinator on `askari`; NetBird agent enrollment in `base`. | | `base` role (when built) | Install + enroll the NetBird agent; nftables allows SSH only on `wt0`. | | `netbird_coordinator` service role (new, when built) | Deploys the NetBird stack on `askari`; populated `SECURITY.md`; molecule scenario. | | `requirements.yml` | Only if a task needs a new collection module (ADR dependencies policy). | **Scope note:** like the `ubongo` work, most *implementation* here waits on the `base` and service-role machinery that STATUS.md lists as not-yet-built. This spec settles the decision and the doc reconciliation; the role tasks land when `base` is built. --- ## Deferred / out of scope 1. **External SSO IdP** (Zitadel/Keycloak) — embedded local users now; SSO later if a second operator or service-SSO need appears. 2. **OPNsense mesh integration specifics** — the exact `mgmt` reachability mechanism (single advertised route vs LAN-side admin) is settled during implementation when OPNsense automation is built. 3. **The `base` / `netbird_coordinator` role implementation** — depends on the unbuilt `base` role and service-role standard. --- ## What was ruled out | Option | Reason | |---|---| | Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config, OPNsense-centric; the operator wants a mesh with policy-based access and easy multi-device enrollment. | | Tailscale (hosted coordinator) | Adds a third-party trust dependency for the control plane; against boma's self-hosting ethos. (Hosted coordinator's recovery benefit is matched by putting a self-hosted coordinator off-site on `askari`.) | | Tailscale + Headscale (self-hosted) | Headscale is a third-party reimplementation of Tailscale's control server with partial feature parity and no official vendor support — weaker than NetBird's first-class self-hosting. | | Mesh coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes, and dies with the homelab. `askari` (off-site) instead. | | Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` would go blind to `srv` when `ubongo` is down even if services are healthy. Agent-per-host instead. | | Standalone IdP (Zitadel/Keycloak) now | Heavy for a single operator; embedded local users (Dex) suffice. External SSO stays a future option. | See also: ADR-007 (network), ADR-015 (control host), ADR-002 (security), ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible handoff), ADR-013 (heritage — V4 used WireGuard; NetBird is translated, not transplanted).