207 lines
11 KiB
Markdown
207 lines
11 KiB
Markdown
|
|
# Design — Mesh VPN (NetBird, self-hosted on `askari`)
|
|||
|
|
|
|||
|
|
- **Date:** 2026-06-05
|
|||
|
|
- **Status:** Approved design — pending implementation plan
|
|||
|
|
- **Resolves:** ADR-015 deferred item #1 (mesh VPN choice) and the `accepted-risks.md`
|
|||
|
|
R3 "pending VPN choice" placeholder
|
|||
|
|
- **Amends:** ADR-007 (retires the VLAN-99 OPNsense WireGuard design)
|
|||
|
|
- **Becomes:** ADR-016 (this design is the basis for that ADR)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Problem
|
|||
|
|
|
|||
|
|
`ubongo` (ADR-015) needs remote SSH access from anywhere (work PC, laptop, phone)
|
|||
|
|
without exposing anything to the public internet. ADR-015 left the access mechanism —
|
|||
|
|
the "mesh VPN" — deferred to this discussion.
|
|||
|
|
|
|||
|
|
Meanwhile ADR-007 already commits to **WireGuard-via-OPNsense** for the `vpn` VLAN
|
|||
|
|
(VLAN 99, `10.99.0.0/24`): `askari` (the off-site Hetzner monitoring VPS) peers to
|
|||
|
|
OPNsense, plus road-warrior clients. And `docs/CAPABILITIES.md` already flags the open
|
|||
|
|
question: *"ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real
|
|||
|
|
alternative to weigh."*
|
|||
|
|
|
|||
|
|
So the real decision is three-cornered (plain OPNsense WireGuard vs NetBird vs
|
|||
|
|
Tailscale), with an architectural sub-question of whether a mesh replaces or coexists
|
|||
|
|
with the ADR-007 WireGuard.
|
|||
|
|
|
|||
|
|
## Decisions (as settled)
|
|||
|
|
|
|||
|
|
1. **Scope — the mesh *replaces* WireGuard.** A single overlay becomes the sole
|
|||
|
|
remote-access path for `ubongo`, `askari`, and road-warrior clients. ADR-007's
|
|||
|
|
VLAN-99 OPNsense WireGuard design is retired.
|
|||
|
|
2. **Control plane — self-hosted, on `askari`.** Maximum sovereignty (boma already
|
|||
|
|
self-hosts Vaultwarden, Forgejo, its own DNS), no third-party trust, and an off-site
|
|||
|
|
coordinator that survives a homelab outage and stays out of the cluster it
|
|||
|
|
administers.
|
|||
|
|
3. **Tool — NetBird.** Self-hosting on `askari` selects NetBird: it is designed to be
|
|||
|
|
self-hosted as a first-class, fully open-source stack. (Tailscale's self-host path
|
|||
|
|
means Headscale, a separate third-party reimplementation with partial parity — ruled
|
|||
|
|
out below.)
|
|||
|
|
4. **Routing — NetBird agent on every (Linux) host**, not a subnet router. At boma's
|
|||
|
|
scale (2–5 hosts, treated as individuals) the usual "agent everywhere" downside is
|
|||
|
|
moot, and the `base` role already runs on every host, so enrollment is one uniform
|
|||
|
|
role task. Avoids a routing single-point-of-failure and gives granular per-peer ACLs
|
|||
|
|
that match ADR-007's firewall intent. **One exception:** OPNsense (FreeBSD) is not a
|
|||
|
|
first-class NetBird agent target, so `mgmt`/gateway reachability is handled by a
|
|||
|
|
single advertised route or by administering OPNsense from an on-LAN meshed peer.
|
|||
|
|
5. **Identity — embedded local users** (Dex, built into the management container), not
|
|||
|
|
a standalone Zitadel/Keycloak. YAGNI for a single operator; external SSO remains a
|
|||
|
|
documented future option.
|
|||
|
|
|
|||
|
|
## Verified facts (ADR-014)
|
|||
|
|
|
|||
|
|
> verified: NetBird self-hosting architecture · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
|
|||
|
|
> - Components: management + signal + dashboard + relay/TURN (Coturn). Since **v0.65**
|
|||
|
|
> the core services are **merged into a single container**; deploy via Docker Compose.
|
|||
|
|
> - Identity: since **v0.62**, built-in **local users** with an **embedded IdP (Dex)**;
|
|||
|
|
> external OIDC IdPs (Zitadel, Keycloak, Authentik, Okta, …) are **optional**, not
|
|||
|
|
> required.
|
|||
|
|
> - Ports (behind reverse proxy): **TCP 80/443** + **UDP 3478** (STUN/TURN).
|
|||
|
|
> - Host: a Linux VM + Docker Compose + a domain name; lightweight.
|
|||
|
|
>
|
|||
|
|
> verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05
|
|||
|
|
> - Dual license: **AGPLv3** for `management/`, `signal/`, `relay/`; **BSD-3-Clause**
|
|||
|
|
> elsewhere. Fully open source, self-hostable, no open-core feature gating.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Architecture & topology
|
|||
|
|
|
|||
|
|
A single NetBird mesh is the sole remote-access overlay, replacing ADR-007's VLAN-99
|
|||
|
|
WireGuard. Data plane is peer-to-peer WireGuard; control plane is self-hosted NetBird
|
|||
|
|
on `askari`.
|
|||
|
|
|
|||
|
|
**`askari`'s dual role.** `askari` (Hetzner, off-site, always-up, independent of the
|
|||
|
|
cluster per ADR-007) runs the **NetBird management stack** (single container:
|
|||
|
|
management + signal + dashboard + Coturn, behind a reverse proxy on TCP 80/443 + UDP
|
|||
|
|
3478) **and** is itself a mesh peer. Off-site hosting is what makes the mesh survive a
|
|||
|
|
full homelab outage and keeps the coordinator out of the cluster it administers (no
|
|||
|
|
chicken-and-egg).
|
|||
|
|
|
|||
|
|
**Peers:**
|
|||
|
|
- `askari` — coordinator + peer.
|
|||
|
|
- `ubongo` (control/AI-worker host) — agent.
|
|||
|
|
- All Linux managed hosts (`dns1/2`, `proxy`, …) — agent via the `base` role.
|
|||
|
|
- Road-warrior clients — `mamba`, phone, work PC — agent/app.
|
|||
|
|
- OPNsense / `mgmt` — the single non-agent exception (advertised route or LAN-side
|
|||
|
|
admin from a meshed peer).
|
|||
|
|
|
|||
|
|
**Retired:** ADR-007's VLAN-99 WireGuard endpoint on OPNsense and the
|
|||
|
|
`10.99.0.0/24` peer scheme. `askari` reaches `srv`/`mgmt` over the mesh under NetBird
|
|||
|
|
ACLs instead of OPNsense routing `10.99.0.0/24`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Security model, ACLs, and attack surface
|
|||
|
|
|
|||
|
|
**ACL policy mirrors ADR-007's firewall intent** (NetBird is default-deny):
|
|||
|
|
- `vpn` peers → `srv` **metrics ports only** (askari's monitoring scope).
|
|||
|
|
- admin peers (`ubongo`, `mamba`) → `srv` + `mgmt` for administration.
|
|||
|
|
- road-warrior clients → only what each needs; nothing by default.
|
|||
|
|
|
|||
|
|
**Enrollment via setup keys.** Hosts join non-interactively using NetBird **setup
|
|||
|
|
keys**, stored in `vault.yml` as `vault.netbird.setup_key` and consumed by the `base`
|
|||
|
|
role. Prefer ephemeral/scoped keys (ADR-002).
|
|||
|
|
|
|||
|
|
**Host firewall interaction.** NetBird creates a `wt0` mesh interface. The `base`
|
|||
|
|
role's nftables default-deny allows inbound admin (SSH) **only on `wt0`**, denied on
|
|||
|
|
the physical NIC — the pattern ADR-015 set for `ubongo`, now applied fleet-wide. Mesh
|
|||
|
|
+ nftables are defence-in-depth.
|
|||
|
|
|
|||
|
|
**The new attack surface — a public control plane on `askari`.** Today `askari`
|
|||
|
|
exposes a WireGuard UDP port; with NetBird self-hosted it exposes the **management API
|
|||
|
|
+ dashboard (80/443)** and **Coturn (3478)** publicly, and the management API is
|
|||
|
|
keys-to-the-kingdom for the whole mesh. Mitigations baked in:
|
|||
|
|
- Dashboard/API behind TLS + the embedded IdP login; source-IP restrictions where
|
|||
|
|
practical.
|
|||
|
|
- `askari` runs `base` hardening (already a public managed host) and NetBird is
|
|||
|
|
**version-pinned** (ADR-011) and patched on boma's cadence — self-hosting means
|
|||
|
|
owning the CVE cadence (AGPLv3 server).
|
|||
|
|
|
|||
|
|
Net vs ADR-002: nothing from the **cluster** is publicly exposed; the only public
|
|||
|
|
surface is on `askari` (a public VPS by design), shifting from "WireGuard port" to
|
|||
|
|
"NetBird control plane."
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Recovery, bootstrap ordering, and operations
|
|||
|
|
|
|||
|
|
**Ansible's control path stays off the mesh.** `ubongo` is on the LAN and reaches the
|
|||
|
|
fleet by **LAN IP** (ADR-009). The mesh only provides *external* reach to
|
|||
|
|
`ubongo`/the fleet, so a mesh/coordinator outage never blocks on-LAN Ansible runs and
|
|||
|
|
there is no chicken-and-egg in the critical path.
|
|||
|
|
|
|||
|
|
**Bootstrap order** (askari-first):
|
|||
|
|
1. Stand up the NetBird coordinator on `askari`.
|
|||
|
|
2. Enroll `ubongo`.
|
|||
|
|
3. `base` role enrolls the rest of the fleet via setup keys from vault.
|
|||
|
|
|
|||
|
|
**Recovery.** Coordinator off-site on `askari` ⇒ the mesh survives a full homelab
|
|||
|
|
outage. Two must-haves:
|
|||
|
|
- **Back up NetBird's management datastore** off `askari` — encrypted, synced to
|
|||
|
|
`ubongo`/`mamba`. If `askari` dies, restore the coordinator; peers re-enroll.
|
|||
|
|
- Existing peer tunnels keep running on last-known config through a brief coordinator
|
|||
|
|
outage; only changes/new enrollments need it live — so `askari` is important but not
|
|||
|
|
instantly fatal.
|
|||
|
|
|
|||
|
|
**`askari` becomes Ansible-managed.** It joins the inventory under its own group and
|
|||
|
|
gets the `base` role plus a dedicated **`netbird_coordinator` service role** (one
|
|||
|
|
service = one role per ADR-004, with its own `SECURITY.md` per the service-role
|
|||
|
|
standard). Agent install/enrollment lives in `base`.
|
|||
|
|
|
|||
|
|
**DNS & versions.** boma's `dns` role stays authoritative for `boma.baobab.band`;
|
|||
|
|
NetBird's built-in DNS is scoped/off to avoid overlap. NetBird server (on `askari`)
|
|||
|
|
and agents (via `base`) are version-pinned (ADR-011).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Documentation & implementation changes
|
|||
|
|
|
|||
|
|
This is a substantial decision → its own ADR, with amendments linking to it.
|
|||
|
|
|
|||
|
|
| Doc | Change |
|
|||
|
|
|---|---|
|
|||
|
|
| ADR-016 (new) | Home of record for this design. |
|
|||
|
|
| ADR-007 (network) | Replace the VLAN-99 WireGuard section + `10.99.0.0/24` scheme with the NetBird mesh; update the firewall-intent table and the `askari` external-monitoring section to ride the mesh. |
|
|||
|
|
| ADR-015 (control host) | Resolve deferred item #1: mesh VPN = NetBird self-hosted on `askari`; update the access/recovery notes. |
|
|||
|
|
| `docs/security/accepted-risks.md` | Replace R3 ("pending VPN choice") with the concrete residual risk: self-hosted coordinator = no third-party trust, but a public NetBird control plane on `askari` to harden + patch. |
|
|||
|
|
| `docs/CAPABILITIES.md` | Resolve the VPN row (line ~29): decided — NetBird mesh, self-hosted on `askari`. |
|
|||
|
|
| `STATUS.md` | Add rows (designed, not built): NetBird coordinator on `askari`; NetBird agent enrollment in `base`. |
|
|||
|
|
| `base` role (when built) | Install + enroll the NetBird agent; nftables allows SSH only on `wt0`. |
|
|||
|
|
| `netbird_coordinator` service role (new, when built) | Deploys the NetBird stack on `askari`; populated `SECURITY.md`; molecule scenario. |
|
|||
|
|
| `requirements.yml` | Only if a task needs a new collection module (ADR dependencies policy). |
|
|||
|
|
|
|||
|
|
**Scope note:** like the `ubongo` work, most *implementation* here waits on the `base`
|
|||
|
|
and service-role machinery that STATUS.md lists as not-yet-built. This spec settles the
|
|||
|
|
decision and the doc reconciliation; the role tasks land when `base` is built.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Deferred / out of scope
|
|||
|
|
|
|||
|
|
1. **External SSO IdP** (Zitadel/Keycloak) — embedded local users now; SSO later if a
|
|||
|
|
second operator or service-SSO need appears.
|
|||
|
|
2. **OPNsense mesh integration specifics** — the exact `mgmt` reachability mechanism
|
|||
|
|
(single advertised route vs LAN-side admin) is settled during implementation when
|
|||
|
|
OPNsense automation is built.
|
|||
|
|
3. **The `base` / `netbird_coordinator` role implementation** — depends on the
|
|||
|
|
unbuilt `base` role and service-role standard.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## What was ruled out
|
|||
|
|
|
|||
|
|
| Option | Reason |
|
|||
|
|
|---|---|
|
|||
|
|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config, OPNsense-centric; the operator wants a mesh with policy-based access and easy multi-device enrollment. |
|
|||
|
|
| Tailscale (hosted coordinator) | Adds a third-party trust dependency for the control plane; against boma's self-hosting ethos. (Hosted coordinator's recovery benefit is matched by putting a self-hosted coordinator off-site on `askari`.) |
|
|||
|
|
| Tailscale + Headscale (self-hosted) | Headscale is a third-party reimplementation of Tailscale's control server with partial feature parity and no official vendor support — weaker than NetBird's first-class self-hosting. |
|
|||
|
|
| Mesh coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes, and dies with the homelab. `askari` (off-site) instead. |
|
|||
|
|
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` would go blind to `srv` when `ubongo` is down even if services are healthy. Agent-per-host instead. |
|
|||
|
|
| Standalone IdP (Zitadel/Keycloak) now | Heavy for a single operator; embedded local users (Dex) suffice. External SSO stays a future option. |
|
|||
|
|
|
|||
|
|
See also: ADR-007 (network), ADR-015 (control host), ADR-002 (security), ADR-011
|
|||
|
|
(version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible handoff),
|
|||
|
|
ADR-013 (heritage — V4 used WireGuard; NetBird is translated, not transplanted).
|