Compare commits

..

10 commits

Author SHA1 Message Date
cd62c5e098 new-host runbook: mesh VPN resolved to NetBird (ADR-016) 2026-06-05 11:52:22 +02:00
ed9fdcc10a CLAUDE.md: link ADR-016 (mesh VPN) 2026-06-05 11:51:36 +02:00
787aa3b8e1 STATUS: record NetBird mesh (coordinator + base enrollment) 2026-06-05 11:50:53 +02:00
841f666de9 CAPABILITIES: VPN decided — NetBird self-hosted (ADR-016) 2026-06-05 11:50:04 +02:00
08165ffb68 accepted-risks: R3 now the concrete NetBird coordinator risk 2026-06-05 11:48:58 +02:00
2ae5cf4535 ADR-015: resolve mesh-VPN deferral — NetBird on askari (ADR-016) 2026-06-05 11:48:04 +02:00
5a32dd46d3 ADR-007: retire VLAN-99 WireGuard for the NetBird mesh (ADR-016) 2026-06-05 11:47:03 +02:00
ff796c64ca Add ADR-016 (mesh VPN — NetBird self-hosted on askari) 2026-06-05 11:45:45 +02:00
4b85b14f1f Add implementation plan for NetBird mesh VPN
Task-by-task docs plan: author ADR-016 and reconcile ADR-007 (retire VLAN-99
WireGuard), ADR-015 (resolve deferred #1), accepted-risks R3, CAPABILITIES,
STATUS, CLAUDE.md. Documentation-only; role/deployment waits on the base role.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:44:05 +02:00
99ace3eb48 Add design spec for mesh VPN (NetBird self-hosted on askari)
Resolves ADR-015 deferred item #1: the mesh VPN is NetBird, self-hosted on
askari, replacing ADR-007's VLAN-99 OPNsense WireGuard. Agent-per-host
enrollment via base, embedded local-user IdP, coordinator off-site for
outage survival. Basis for ADR-016.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 10:58:35 +02:00
10 changed files with 826 additions and 26 deletions

View file

@ -202,6 +202,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
| Control / AI-worker host (`ubongo`) | `docs/decisions/015-control-host.md` | | Control / AI-worker host (`ubongo`) | `docs/decisions/015-control-host.md` |
| Terraform | `docs/decisions/006-terraform.md` | | Terraform | `docs/decisions/006-terraform.md` |
| Network topology | `docs/decisions/007-network.md` | | Network topology | `docs/decisions/007-network.md` |
| Mesh VPN (NetBird, self-hosted) | `docs/decisions/016-mesh-vpn.md` |
| Testing methodology | `docs/decisions/008-testing.md` | | Testing methodology | `docs/decisions/008-testing.md` |
| TF ↔ Ansible handoff | `docs/decisions/009-provisioning-handoff.md` | | TF ↔ Ansible handoff | `docs/decisions/009-provisioning-handoff.md` |
| Forgejo & CI | `docs/decisions/010-forgejo-ci.md` | | Forgejo & CI | `docs/decisions/010-forgejo-ci.md` |

View file

@ -53,6 +53,8 @@ So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `bas
| CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) `base`/`docker_host` roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) | | CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) `base`/`docker_host` roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) |
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built | | Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. | | `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
| NetBird mesh — coordinator on `askari` | ADR-016 | Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. Decision recorded; not deployed (askari + service-role machinery not built). |
| NetBird agent enrollment in `base` | ADR-016 | Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. Designed; base role not built. |
## Keeping this honest ## Keeping this honest

View file

@ -26,7 +26,7 @@ decisions this frame enables.
|---|---|---|---|---|---| |---|---|---|---|---|---|
| Reverse proxy / TLS | Traefik | P | core | Edge routing + ACME certs for everything exposed | Spin-up order names it (TODO 12) | | Reverse proxy / TLS | Traefik | P | core | Edge routing + ACME certs for everything exposed | Spin-up order names it (TODO 12) |
| Internal DNS | `dns` role → dns1/dns2 | P | core | Authoritative internal zone (ADR-007) | Ansible-rendered zone | | Internal DNS | `dns` role → dns1/dns2 | P | core | Authoritative internal zone (ADR-007) | Ansible-rendered zone |
| VPN / remote access | Netbird · *or* OPNsense WireGuard | P | candidate | Secure remote access to `srv`/`mgmt` | ⚠️ ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real alternative to weigh | | VPN / remote access | NetBird (self-hosted on `askari`) | P | core | Secure mesh remote access to `srv`/`mgmt` | **Decided (ADR-016):** NetBird mesh replaces ADR-007 OPNsense WireGuard |
| Service portal / dashboard | Homepage | A | candidate | One landing page listing all services — a "what does what" front door | Gap surfaced by V4; fits boma's legibility goal | | Service portal / dashboard | Homepage | A | candidate | One landing page listing all services — a "what does what" front door | Gap surfaced by V4; fits boma's legibility goal |
_(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not containers.)_ _(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not containers.)_

View file

@ -47,7 +47,7 @@ ISP
| 30 | `lan` | `10.30.0.0/24` | Trusted home devices. DHCP. Access to selected `srv` services via OPNsense. | | 30 | `lan` | `10.30.0.0/24` | Trusted home devices. DHCP. Access to selected `srv` services via OPNsense. |
| 40 | `iot` | `10.40.0.0/24` | Smart home, cameras, printers. DHCP. Internet egress only + HA exception. | | 40 | `iot` | `10.40.0.0/24` | Smart home, cameras, printers. DHCP. Internet egress only + HA exception. |
| 50 | `guest` | `10.50.0.0/24` | Guest WiFi. DHCP. Internet only, fully isolated. | | 50 | `guest` | `10.50.0.0/24` | Guest WiFi. DHCP. Internet only, fully isolated. |
| 99 | `vpn` | `10.99.0.0/24` | WireGuard peers. `askari` (Hetzner) + road-warrior clients. | | 99 | `vpn` | _(retired)_ | **Replaced by the NetBird mesh (ADR-016).** Remote access for `ubongo`, `askari`, and road-warrior clients rides a self-hosted NetBird overlay, not an OPNsense WireGuard subnet. `10.99.0.0/24` is freed. |
--- ---
@ -102,13 +102,14 @@ Assigned infrastructure addresses:
| `10.50.0.1` | OPNsense gateway | | `10.50.0.1` | OPNsense gateway |
| `10.50.0.100``.249` | DHCP pool | | `10.50.0.100``.249` | DHCP pool |
### VLAN 99 — vpn (10.99.0.0/24) — WireGuard ### VLAN 99 — vpn — retired
| Address | Host | The OPNsense WireGuard VPN (`10.99.0.0/24`) is **replaced by the NetBird mesh**
|---|---| (ADR-016). Remote access for `ubongo`, `askari`, and road-warrior clients rides a
| `10.99.0.1` | OPNsense (WireGuard endpoint) | self-hosted NetBird overlay — data plane peer-to-peer WireGuard, control plane
| `10.99.0.2` | `askari` (Hetzner VPS) | NetBird self-hosted on `askari`. NetBird manages its own overlay addressing
| `10.99.0.10`+ | Road-warrior clients | (default `100.64.0.0/10`); no boma VLAN/subnet is allocated for it, and
`10.99.0.0/24` is freed.
### Corosync ring (172.16.0.0/24) — not on managed switch ### Corosync ring (172.16.0.0/24) — not on managed switch
@ -132,8 +133,8 @@ Assigned infrastructure addresses:
| `iot` | internet | allow egress only | | `iot` | internet | allow egress only |
| `iot` | `srv` (HA IP only) | allow on integration ports | | `iot` | `srv` (HA IP only) | allow on integration ports |
| `guest` | internet | allow, isolated from all internal | | `guest` | internet | allow, isolated from all internal |
| `vpn` | `srv` (metrics ports) | allow (monitoring) | | mesh peers | `srv` (metrics ports) | allow (monitoring) — enforced by NetBird ACLs, not OPNsense (ADR-016) |
| `vpn` | `mgmt` | allow (administration from askari) | | mesh peers | `mgmt` | allow (administration) — enforced by NetBird ACLs (ADR-016) |
**Home Assistant ↔ IoT**: HA VM at `10.20.0.13` can reach IoT VLAN on required **Home Assistant ↔ IoT**: HA VM at `10.20.0.13` can reach IoT VLAN on required
ports. OPNsense Avahi (mDNS reflector) bridges `srv``iot` for device discovery. ports. OPNsense Avahi (mDNS reflector) bridges `srv``iot` for device discovery.
@ -176,11 +177,12 @@ All other queries go upstream (e.g., `1.1.1.1`, `9.9.9.9`).
## External monitoring — askari ## External monitoring — askari
`askari` (Hetzner VPS) connects via WireGuard to OPNsense (`10.99.0.1`). `askari` (Hetzner VPS) is a peer on the **NetBird mesh** (ADR-016) and also **hosts
Its peer address is `10.99.0.2`. OPNsense routes `10.99.0.0/24` into the VPN the self-hosted NetBird coordinator** (management/signal/relay). It reaches `srv`
tunnel and allows `askari` narrow access to `srv` metrics endpoints and `mgmt` metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird
for administration. ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing.
`askari` is provisioned and managed independently of the Proxmox cluster — it `askari` is provisioned and managed independently of the Proxmox cluster — it must
must be reachable even when the homelab is down (its entire purpose). be reachable even when the homelab is down (its entire purpose), which is also why
the mesh coordinator lives here: an off-site control plane survives a homelab outage.
FQDN: `askari.baobab.band`. FQDN: `askari.baobab.band`.

View file

@ -63,14 +63,15 @@ Manual, on bare metal:
1. Install Debian 13 on the box (one-time, by hand). 1. Install Debian 13 on the box (one-time, by hand).
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock. 2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
3. Join the mesh VPN (choice deferred — see below). 3. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016).
4. From then on `ubongo` manages every other host normally; Ansible manages *it* for 4. From then on `ubongo` manages every other host normally; Ansible manages *it* for
baseline config via the `control` group (`base` role only). baseline config via the `control` group (`base` role only).
### Access & security ### Access & security
- Remote access is via the **mesh VPN** (choice deferred). SSH to `ubongo` over the - Remote access is via the **mesh VPN** — NetBird, self-hosted on `askari` (ADR-016).
mesh; nothing is published to the public internet — this stays inside ADR-002. SSH to `ubongo` over the mesh; nothing is published to the public internet — this
stays inside ADR-002.
- `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban, - `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban,
auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**, auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**,
denied on the physical NIC. denied on the physical NIC.
@ -109,10 +110,9 @@ master password.
## Deferred (separate specs / discussions) ## Deferred (separate specs / discussions)
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Recovery 1. **Mesh VPN choice — RESOLVED (ADR-016):** NetBird, self-hosted on `askari`
dimension: a hosted coordinator keeps the mesh up when the cluster is down; a (off-site, so it survives a homelab outage and stays out of the cluster it
self-hosted coordinator must live off-cluster (on `ubongo`), never on the fleet, administers). Replaces ADR-007's OPNsense WireGuard.
or it recreates the chicken-and-egg.
2. **Browser-E2E verification harness** — Playwright/headless-Chromium, test-user 2. **Browser-E2E verification harness** — Playwright/headless-Chromium, test-user
generation, screenshot-back-to-Claude, and the new ADR-008 level. generation, screenshot-back-to-Claude, and the new ADR-008 level.
3. **`rbw` offline-cache verification** — confirm offline decryption before relying 3. **`rbw` offline-cache verification** — confirm offline decryption before relying

View file

@ -0,0 +1,105 @@
# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
## Context
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to
WireGuard-via-OPNsense for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`: `askari` + road
warriors), and `docs/CAPABILITIES.md` flagged NetBird (mesh) as a real alternative to
weigh. This ADR settles it.
## Decision
A single **NetBird** mesh is the sole remote-access overlay, self-hosted on `askari`,
**replacing** ADR-007's VLAN-99 OPNsense WireGuard.
The decision in four parts:
1. **Scope — mesh replaces WireGuard.** One overlay for `ubongo`, `askari`, and
road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired.
2. **Control plane — self-hosted on `askari`.** Sovereignty (boma self-hosts
Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that
survives a homelab outage and stays out of the cluster it administers.
3. **Tool — NetBird.** Self-hosting selects NetBird (first-class, fully open-source
self-host). Tailscale would mean Headscale (third-party reimplementation, partial
parity) — ruled out below.
4. **Routing — agent on every Linux host**, not a subnet router. At boma's scale (25
hosts) the "agent everywhere" cost is trivial and the `base` role already runs
everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives
granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception
(`mgmt`/gateway reached by a single advertised route or LAN-side admin).
5. **Identity — embedded local users** (Dex in the management container); external SSO
(Zitadel/Keycloak) stays an optional future.
## Verified facts (ADR-014)
verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
— components management+signal+dashboard+relay/TURN(Coturn), **single container since
v0.65**; **built-in local users / embedded IdP since v0.62** (external OIDC optional);
ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host.
verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for
`management/`/`signal/`/`relay/`, BSD-3-Clause elsewhere; fully open source, no
open-core feature gating.
## Architecture
Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on `askari`.
NetBird manages its own overlay addressing (default `100.64.0.0/10`); no boma VLAN is
allocated for it.
- `askari` (Hetzner, off-site, always-up) — runs the NetBird stack **and** is a peer.
- `ubongo` — agent.
- All Linux managed hosts — agent via the `base` role.
- Road-warrior clients (`mamba`, phone, work PC) — agent/app.
- OPNsense / `mgmt` — single non-agent exception.
## Security
- **ACLs mirror ADR-007 intent** (NetBird default-deny): mesh peers → `srv` metrics
ports only; admin peers (`ubongo`, `mamba`) → `srv` + `mgmt`; clients → least
privilege.
- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`),
consumed by `base`; prefer ephemeral/scoped keys.
- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH
**only on `wt0`** (the ADR-015 pattern, fleet-wide).
- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn
(3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,
`base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence.
Recorded as accepted-risk R3.
## Recovery & operations
- **Ansible stays off the mesh:** `ubongo` reaches the fleet by LAN IP (ADR-009); a
mesh/coordinator outage never blocks on-LAN runs.
- **Bootstrap order:** stand up the coordinator on `askari` → enroll `ubongo`
`base` enrolls the fleet.
- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage.
NetBird's management datastore is backed up encrypted off `askari` (synced to
`ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage.
- **`askari` is Ansible-managed:** its own inventory group, `base` role, plus a
dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with
`SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are
version-pinned (ADR-011). boma's `dns` role stays authoritative for
`boma.baobab.band`; NetBird built-in DNS scoped/off.
## Status
Designed, not built — depends on the unbuilt `base` role and service-role machinery
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
`base` exists.
## What was ruled out
| Option | Reason |
|---|---|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. |
| Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on `askari`. |
| Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. |
| Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. `askari` instead. |
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. |
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
See also: ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted).

View file

@ -128,8 +128,8 @@ provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
make collections # Ansible collections make collections # Ansible collections
rbw login && rbw unlock # vault password from Vaultwarden (see rotate-secrets.md) rbw login && rbw unlock # vault password from Vaultwarden (see rotate-secrets.md)
``` ```
4. Join the mesh VPN (choice deferred — see ADR-015) so it is reachable over SSH 4. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016) — so it is
from elsewhere. reachable over SSH from elsewhere.
5. Add `ubongo` to `inventories/<env>/hosts.yml` under the `control` group. 5. Add `ubongo` to `inventories/<env>/hosts.yml` under the `control` group.
Because `ubongo` is not in `local.vms`, this is the only case where editing Because `ubongo` is not in `local.vms`, this is the only case where editing

View file

@ -15,7 +15,7 @@ revisit (trigger).
|---|---|---|---| |---|---|---|---|
| R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (tiered image pinning per ADR-011 — stateful `tag@digest`, stateless rolling — prefer official/verified images; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise | | R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (tiered image pinning per ADR-011 — stateful `tag@digest`, stateless rolling — prefer official/verified images; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise |
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers | | R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
| R3 | **Mesh-VPN coordinator dependency (pending VPN choice)** — remote SSH to the control node `ubongo` (ADR-015) rides a mesh VPN whose coordination plane may be a third party (e.g. hosted Tailscale/NetBird) | A hosted coordinator keeps the mesh up when the cluster is down, which *helps* recovery; nothing is exposed to the public internet (ADR-002 preserved). Provisional — finalised when the VPN is chosen (separate discussion) | The VPN choice is settled (replace this entry with the concrete decision); a self-hosted coordinator is adopted; the provider's trust/security posture changes | | R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
_Last reviewed: 2026-06-05. The prior gaps (full CIS hardening, SELinux/AppArmor, _Last reviewed: 2026-06-05. The prior gaps (full CIS hardening, SELinux/AppArmor,
IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS

View file

@ -0,0 +1,484 @@
# Mesh VPN (NetBird) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Record the decision that boma's mesh VPN is NetBird (self-hosted on `askari`), by authoring ADR-016 and reconciling every doc that currently assumes OPNsense WireGuard or an undecided VPN.
**Architecture:** Documentation-only change. NetBird replaces ADR-007's VLAN-99 OPNsense WireGuard as the single remote-access overlay for `ubongo`, `askari`, and road-warrior clients; coordinator self-hosted off-site on `askari`; agent-per-host enrollment via the (unbuilt) `base` role; embedded local-user identity. The role/service implementation waits on the `base` role and service-role machinery that STATUS.md lists as not-yet-built — this plan settles the decision and the doc reconciliation only.
**Tech Stack:** Markdown only. Verification is the repo's pre-commit hooks (trailing-whitespace, end-of-file, gitleaks, ansible-lint, vault-encryption guard) plus a final cross-reference/staleness sweep. No markdown linter exists, so "tests" are hook-pass + grep checks.
---
## Pre-flight (read once before starting)
- **`rbw` must be unlocked before every commit** (the pre-commit ansible-lint hook decrypts `vault.yml`). Run `rbw unlocked` (exit 0 = good); if not, stop and ask the user to `rbw unlock`.
- **Commit style:** one commit per task, imperative subject ≤72 chars.
- **Order matters:** Task 1 (ADR-016) lands first — every later task links to it.
- **Spec reference:** `docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md`.
- **Branch:** start by creating `chore/mesh-vpn-netbird-docs` off `main` (the controller does this before dispatching Task 1; do not implement on `main`).
---
## File map
| File | Action | Responsibility after change |
|---|---|---|
| `docs/decisions/016-mesh-vpn.md` | Create | Home of record for the NetBird mesh decision |
| `docs/decisions/007-network.md` | Modify | VLAN-99 WireGuard retired; askari rides the mesh + hosts the coordinator |
| `docs/decisions/015-control-host.md` | Modify | Resolve deferred item #1 (mesh = NetBird on askari) |
| `docs/security/accepted-risks.md` | Modify | Replace R3 placeholder with the concrete residual risk |
| `docs/CAPABILITIES.md` | Modify | VPN row decided: NetBird, self-hosted |
| `STATUS.md` | Modify | Two rows: NetBird coordinator + agent enrollment (designed, not built) |
| `CLAUDE.md` | Modify | ADR-016 in Further reading |
---
### Task 1: Author ADR-016 (the home of record)
**Files:**
- Create: `docs/decisions/016-mesh-vpn.md`
- [ ] **Step 1: Create the ADR file**
Create `docs/decisions/016-mesh-vpn.md` with exactly this content (preserve em-dashes —, backticks, table pipes, and the `verified:` stamps):
```markdown
# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
## Context
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to
WireGuard-via-OPNsense for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`: `askari` + road
warriors), and `docs/CAPABILITIES.md` flagged NetBird (mesh) as a real alternative to
weigh. This ADR settles it.
## Decision
A single **NetBird** mesh is the sole remote-access overlay, self-hosted on `askari`,
**replacing** ADR-007's VLAN-99 OPNsense WireGuard.
The decision in four parts:
1. **Scope — mesh replaces WireGuard.** One overlay for `ubongo`, `askari`, and
road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired.
2. **Control plane — self-hosted on `askari`.** Sovereignty (boma self-hosts
Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that
survives a homelab outage and stays out of the cluster it administers.
3. **Tool — NetBird.** Self-hosting selects NetBird (first-class, fully open-source
self-host). Tailscale would mean Headscale (third-party reimplementation, partial
parity) — ruled out below.
4. **Routing — agent on every Linux host**, not a subnet router. At boma's scale (25
hosts) the "agent everywhere" cost is trivial and the `base` role already runs
everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives
granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception
(`mgmt`/gateway reached by a single advertised route or LAN-side admin).
5. **Identity — embedded local users** (Dex in the management container); external SSO
(Zitadel/Keycloak) stays an optional future.
## Verified facts (ADR-014)
verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
— components management+signal+dashboard+relay/TURN(Coturn), **single container since
v0.65**; **built-in local users / embedded IdP since v0.62** (external OIDC optional);
ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host.
verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for
`management/`/`signal/`/`relay/`, BSD-3-Clause elsewhere; fully open source, no
open-core feature gating.
## Architecture
Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on `askari`.
NetBird manages its own overlay addressing (default `100.64.0.0/10`); no boma VLAN is
allocated for it.
- `askari` (Hetzner, off-site, always-up) — runs the NetBird stack **and** is a peer.
- `ubongo` — agent.
- All Linux managed hosts — agent via the `base` role.
- Road-warrior clients (`mamba`, phone, work PC) — agent/app.
- OPNsense / `mgmt` — single non-agent exception.
## Security
- **ACLs mirror ADR-007 intent** (NetBird default-deny): mesh peers → `srv` metrics
ports only; admin peers (`ubongo`, `mamba`) → `srv` + `mgmt`; clients → least
privilege.
- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`),
consumed by `base`; prefer ephemeral/scoped keys.
- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH
**only on `wt0`** (the ADR-015 pattern, fleet-wide).
- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn
(3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,
`base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence.
Recorded as accepted-risk R3.
## Recovery & operations
- **Ansible stays off the mesh:** `ubongo` reaches the fleet by LAN IP (ADR-009); a
mesh/coordinator outage never blocks on-LAN runs.
- **Bootstrap order:** stand up the coordinator on `askari` → enroll `ubongo`
`base` enrolls the fleet.
- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage.
NetBird's management datastore is backed up encrypted off `askari` (synced to
`ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage.
- **`askari` is Ansible-managed:** its own inventory group, `base` role, plus a
dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with
`SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are
version-pinned (ADR-011). boma's `dns` role stays authoritative for
`boma.baobab.band`; NetBird built-in DNS scoped/off.
## Status
Designed, not built — depends on the unbuilt `base` role and service-role machinery
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
`base` exists.
## What was ruled out
| Option | Reason |
|---|---|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. |
| Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on `askari`. |
| Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. |
| Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. `askari` instead. |
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. |
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
See also: ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted).
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/016-mesh-vpn.md`
Expected: Passed/Skipped (ansible-lint Skipped for non-YAML).
```bash
git add docs/decisions/016-mesh-vpn.md
git commit -m "Add ADR-016 (mesh VPN — NetBird self-hosted on askari)"
```
---
### Task 2: Amend ADR-007 (retire VLAN-99 WireGuard, askari on the mesh)
**Files:**
- Modify: `docs/decisions/007-network.md`
Read the file first, then make FOUR exact edits. Preserve em-dashes —, backticks, table pipes.
- [ ] **Step 1: Update the VLAN-99 row in the VLAN design table**
Find:
```
| 99 | `vpn` | `10.99.0.0/24` | WireGuard peers. `askari` (Hetzner) + road-warrior clients. |
```
Replace with:
```
| 99 | `vpn` | _(retired)_ | **Replaced by the NetBird mesh (ADR-016).** Remote access for `ubongo`, `askari`, and road-warrior clients rides a self-hosted NetBird overlay, not an OPNsense WireGuard subnet. `10.99.0.0/24` is freed. |
```
- [ ] **Step 2: Replace the VLAN-99 addressing subsection**
Find:
```
### VLAN 99 — vpn (10.99.0.0/24) — WireGuard
| Address | Host |
|---|---|
| `10.99.0.1` | OPNsense (WireGuard endpoint) |
| `10.99.0.2` | `askari` (Hetzner VPS) |
| `10.99.0.10`+ | Road-warrior clients |
```
Replace with:
```
### VLAN 99 — vpn — retired
The OPNsense WireGuard VPN (`10.99.0.0/24`) is **replaced by the NetBird mesh**
(ADR-016). Remote access for `ubongo`, `askari`, and road-warrior clients rides a
self-hosted NetBird overlay — data plane peer-to-peer WireGuard, control plane
NetBird self-hosted on `askari`. NetBird manages its own overlay addressing
(default `100.64.0.0/10`); no boma VLAN/subnet is allocated for it, and
`10.99.0.0/24` is freed.
```
- [ ] **Step 3: Update the two `vpn` rows in the OPNsense firewall-rules table**
Find:
```
| `vpn` | `srv` (metrics ports) | allow (monitoring) |
| `vpn` | `mgmt` | allow (administration from askari) |
```
Replace with:
```
| mesh peers | `srv` (metrics ports) | allow (monitoring) — enforced by NetBird ACLs, not OPNsense (ADR-016) |
| mesh peers | `mgmt` | allow (administration) — enforced by NetBird ACLs (ADR-016) |
```
- [ ] **Step 4: Rewrite the "External monitoring — askari" section**
Find:
```
`askari` (Hetzner VPS) connects via WireGuard to OPNsense (`10.99.0.1`).
Its peer address is `10.99.0.2`. OPNsense routes `10.99.0.0/24` into the VPN
tunnel and allows `askari` narrow access to `srv` metrics endpoints and `mgmt`
for administration.
`askari` is provisioned and managed independently of the Proxmox cluster — it
must be reachable even when the homelab is down (its entire purpose).
FQDN: `askari.baobab.band`.
```
Replace with:
```
`askari` (Hetzner VPS) is a peer on the **NetBird mesh** (ADR-016) and also **hosts
the self-hosted NetBird coordinator** (management/signal/relay). It reaches `srv`
metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird
ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing.
`askari` is provisioned and managed independently of the Proxmox cluster — it must
be reachable even when the homelab is down (its entire purpose), which is also why
the mesh coordinator lives here: an off-site control plane survives a homelab outage.
FQDN: `askari.baobab.band`.
```
- [ ] **Step 5: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/007-network.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/007-network.md
git commit -m "ADR-007: retire VLAN-99 WireGuard for the NetBird mesh (ADR-016)"
```
---
### Task 3: Resolve ADR-015 deferred item #1
**Files:**
- Modify: `docs/decisions/015-control-host.md`
Read the file first, then make THREE exact edits.
- [ ] **Step 1: Update provisioning step 3**
Find:
```
3. Join the mesh VPN (choice deferred — see below).
```
Replace with:
```
3. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016).
```
- [ ] **Step 2: Update the Access & security mesh line**
Find:
```
- Remote access is via the **mesh VPN** (choice deferred). SSH to `ubongo` over the
mesh; nothing is published to the public internet — this stays inside ADR-002.
```
Replace with:
```
- Remote access is via the **mesh VPN** — NetBird, self-hosted on `askari` (ADR-016).
SSH to `ubongo` over the mesh; nothing is published to the public internet — this
stays inside ADR-002.
```
- [ ] **Step 3: Resolve deferred item #1**
Find:
```
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Recovery
dimension: a hosted coordinator keeps the mesh up when the cluster is down; a
self-hosted coordinator must live off-cluster (on `ubongo`), never on the fleet,
or it recreates the chicken-and-egg.
```
Replace with:
```
1. **Mesh VPN choice — RESOLVED (ADR-016):** NetBird, self-hosted on `askari`
(off-site, so it survives a homelab outage and stays out of the cluster it
administers). Replaces ADR-007's OPNsense WireGuard.
```
- [ ] **Step 4: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/015-control-host.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/015-control-host.md
git commit -m "ADR-015: resolve mesh-VPN deferral — NetBird on askari (ADR-016)"
```
---
### Task 4: Replace accepted-risks R3 with the concrete residual risk
**Files:**
- Modify: `docs/security/accepted-risks.md`
Read the file first, then make ONE exact edit. (The row is long — match it whole.)
- [ ] **Step 1: Replace the R3 row**
Find:
```
| R3 | **Mesh-VPN coordinator dependency (pending VPN choice)** — remote SSH to the control node `ubongo` (ADR-015) rides a mesh VPN whose coordination plane may be a third party (e.g. hosted Tailscale/NetBird) | A hosted coordinator keeps the mesh up when the cluster is down, which *helps* recovery; nothing is exposed to the public internet (ADR-002 preserved). Provisional — finalised when the VPN is chosen (separate discussion) | The VPN choice is settled (replace this entry with the concrete decision); a self-hosted coordinator is adopted; the provider's trust/security posture changes |
```
Replace with:
```
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
```
- [ ] **Step 2: Bump the "Last reviewed" date**
Find:
```
_Last reviewed: 2026-06-05. The prior gaps
```
This already reads `2026-06-05` (today) from the previous work, so **no change is needed** — confirm it says `2026-06-05` and move on. (If it shows an earlier date, set it to `2026-06-05`.)
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md`
Expected: Passed/Skipped.
```bash
git add docs/security/accepted-risks.md
git commit -m "accepted-risks: R3 now the concrete NetBird coordinator risk"
```
---
### Task 5: Update the CAPABILITIES VPN row
**Files:**
- Modify: `docs/CAPABILITIES.md`
Read the file first, then make ONE exact edit.
- [ ] **Step 1: Replace the VPN / remote access row**
Find:
```
| VPN / remote access | Netbird · *or* OPNsense WireGuard | P | candidate | Secure remote access to `srv`/`mgmt` | ⚠️ ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real alternative to weigh |
```
Replace with:
```
| VPN / remote access | NetBird (self-hosted on `askari`) | P | core | Secure mesh remote access to `srv`/`mgmt` | **Decided (ADR-016):** NetBird mesh replaces ADR-007 OPNsense WireGuard |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/CAPABILITIES.md`
Expected: Passed/Skipped.
```bash
git add docs/CAPABILITIES.md
git commit -m "CAPABILITIES: VPN decided — NetBird self-hosted (ADR-016)"
```
---
### Task 6: Add NetBird rows to STATUS.md
**Files:**
- Modify: `STATUS.md`
Read the file first, then make ONE exact edit (add two rows after the `ubongo` row).
- [ ] **Step 1: Add the two rows**
Find:
```
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
```
Replace with that SAME line followed by the two new rows:
```
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
| NetBird mesh — coordinator on `askari` | ADR-016 | Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. Decision recorded; not deployed (askari + service-role machinery not built). |
| NetBird agent enrollment in `base` | ADR-016 | Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. Designed; base role not built. |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files STATUS.md`
Expected: Passed/Skipped.
```bash
git add STATUS.md
git commit -m "STATUS: record NetBird mesh (coordinator + base enrollment)"
```
---
### Task 7: Link ADR-016 from CLAUDE.md
**Files:**
- Modify: `CLAUDE.md`
Read the file first, then make ONE exact edit.
- [ ] **Step 1: Add the Further reading row after Network topology**
Find:
```
| Network topology | `docs/decisions/007-network.md` |
```
Replace with that SAME line followed by the new row:
```
| Network topology | `docs/decisions/007-network.md` |
| Mesh VPN (NetBird, self-hosted) | `docs/decisions/016-mesh-vpn.md` |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
Expected: Passed/Skipped.
```bash
git add CLAUDE.md
git commit -m "CLAUDE.md: link ADR-016 (mesh VPN)"
```
---
### Task 8: Final consistency sweep
**Files:** none modified (verification only)
- [ ] **Step 1: Confirm no doc still treats OPNsense WireGuard / `10.99` as the active remote-access path, and no "pending/deferred VPN" language remains**
Run:
```bash
grep -rniE "choice deferred|pending VPN choice|10\.99\.0|WireGuard (endpoint|peers|to OPNsense)" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
```
Expected: the ONLY hits are in `007-network.md` and `016-mesh-vpn.md`, where they describe the **retirement** of `10.99.0.0/24` (e.g. "`10.99.0.0/24` is freed", "no `10.99.0.0/24` routing") — those are correct and expected. There must be **no** hit that still treats OPNsense WireGuard or `10.99.0.x` as the *live* remote-access path, and **no** `choice deferred` / `pending VPN choice` anywhere. Legitimate mentions of "WireGuard" as NetBird's *data plane* are fine and won't match this pattern (it only matches `WireGuard endpoint|peers|to OPNsense`). If a canonical doc still names the WireGuard VPN as live, fix it as in the relevant task above and amend that commit.
- [ ] **Step 2: Confirm ADR-016 exists and is cross-linked**
Run:
```bash
test -f docs/decisions/016-mesh-vpn.md && echo "ADR-016 present"
grep -rl "ADR-016\|016-mesh-vpn" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
```
Expected: the file exists and the referencing docs (007, 015, accepted-risks, CAPABILITIES, STATUS, CLAUDE.md) appear.
- [ ] **Step 3: Full hook run**
Run: `rbw unlocked && pre-commit run --all-files`
Expected: all hooks Passed/Skipped. Fix anything that fails (most likely trailing whitespace / end-of-file) and amend the owning commit.
- [ ] **Step 4: Push (only if the user asks)**
Per CLAUDE.md, push to `origin` is the off-machine backup. If the user wants it pushed:
```bash
git push origin <branch-or-main-after-merge>
```
---
## Self-review notes (author)
- **Spec coverage:** decision/architecture/security/recovery → Task 1 (ADR-016); the spec's "Documentation & implementation changes" table → Tasks 27; deferrals (external SSO, OPNsense mesh specifics, role implementation) are recorded in ADR-016/STATUS, not implemented here (correct — they need the unbuilt `base`/service-role machinery). ✓
- **Not in scope (intentional):** the `netbird_coordinator` service role, the `base`-role agent task, vault `setup_key` material, and any live deployment — all wait on `base`/service-role machinery (STATUS-honest). ✓
- **No placeholders:** every edit shows exact find/replace text; the `_(retired)_` token in ADR-007 is deliberate table content. ✓
- **Name consistency:** ADR file is `016-mesh-vpn.md` everywhere; `vault.netbird.setup_key`, `netbird_coordinator`, and `wt0` are used identically across ADR-016 and the sweep. ✓
```

View file

@ -0,0 +1,206 @@
# Design — Mesh VPN (NetBird, self-hosted on `askari`)
- **Date:** 2026-06-05
- **Status:** Approved design — pending implementation plan
- **Resolves:** ADR-015 deferred item #1 (mesh VPN choice) and the `accepted-risks.md`
R3 "pending VPN choice" placeholder
- **Amends:** ADR-007 (retires the VLAN-99 OPNsense WireGuard design)
- **Becomes:** ADR-016 (this design is the basis for that ADR)
---
## Problem
`ubongo` (ADR-015) needs remote SSH access from anywhere (work PC, laptop, phone)
without exposing anything to the public internet. ADR-015 left the access mechanism —
the "mesh VPN" — deferred to this discussion.
Meanwhile ADR-007 already commits to **WireGuard-via-OPNsense** for the `vpn` VLAN
(VLAN 99, `10.99.0.0/24`): `askari` (the off-site Hetzner monitoring VPS) peers to
OPNsense, plus road-warrior clients. And `docs/CAPABILITIES.md` already flags the open
question: *"ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real
alternative to weigh."*
So the real decision is three-cornered (plain OPNsense WireGuard vs NetBird vs
Tailscale), with an architectural sub-question of whether a mesh replaces or coexists
with the ADR-007 WireGuard.
## Decisions (as settled)
1. **Scope — the mesh *replaces* WireGuard.** A single overlay becomes the sole
remote-access path for `ubongo`, `askari`, and road-warrior clients. ADR-007's
VLAN-99 OPNsense WireGuard design is retired.
2. **Control plane — self-hosted, on `askari`.** Maximum sovereignty (boma already
self-hosts Vaultwarden, Forgejo, its own DNS), no third-party trust, and an off-site
coordinator that survives a homelab outage and stays out of the cluster it
administers.
3. **Tool — NetBird.** Self-hosting on `askari` selects NetBird: it is designed to be
self-hosted as a first-class, fully open-source stack. (Tailscale's self-host path
means Headscale, a separate third-party reimplementation with partial parity — ruled
out below.)
4. **Routing — NetBird agent on every (Linux) host**, not a subnet router. At boma's
scale (25 hosts, treated as individuals) the usual "agent everywhere" downside is
moot, and the `base` role already runs on every host, so enrollment is one uniform
role task. Avoids a routing single-point-of-failure and gives granular per-peer ACLs
that match ADR-007's firewall intent. **One exception:** OPNsense (FreeBSD) is not a
first-class NetBird agent target, so `mgmt`/gateway reachability is handled by a
single advertised route or by administering OPNsense from an on-LAN meshed peer.
5. **Identity — embedded local users** (Dex, built into the management container), not
a standalone Zitadel/Keycloak. YAGNI for a single operator; external SSO remains a
documented future option.
## Verified facts (ADR-014)
> verified: NetBird self-hosting architecture · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
> - Components: management + signal + dashboard + relay/TURN (Coturn). Since **v0.65**
> the core services are **merged into a single container**; deploy via Docker Compose.
> - Identity: since **v0.62**, built-in **local users** with an **embedded IdP (Dex)**;
> external OIDC IdPs (Zitadel, Keycloak, Authentik, Okta, …) are **optional**, not
> required.
> - Ports (behind reverse proxy): **TCP 80/443** + **UDP 3478** (STUN/TURN).
> - Host: a Linux VM + Docker Compose + a domain name; lightweight.
>
> verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05
> - Dual license: **AGPLv3** for `management/`, `signal/`, `relay/`; **BSD-3-Clause**
> elsewhere. Fully open source, self-hostable, no open-core feature gating.
---
## Architecture & topology
A single NetBird mesh is the sole remote-access overlay, replacing ADR-007's VLAN-99
WireGuard. Data plane is peer-to-peer WireGuard; control plane is self-hosted NetBird
on `askari`.
**`askari`'s dual role.** `askari` (Hetzner, off-site, always-up, independent of the
cluster per ADR-007) runs the **NetBird management stack** (single container:
management + signal + dashboard + Coturn, behind a reverse proxy on TCP 80/443 + UDP
3478) **and** is itself a mesh peer. Off-site hosting is what makes the mesh survive a
full homelab outage and keeps the coordinator out of the cluster it administers (no
chicken-and-egg).
**Peers:**
- `askari` — coordinator + peer.
- `ubongo` (control/AI-worker host) — agent.
- All Linux managed hosts (`dns1/2`, `proxy`, …) — agent via the `base` role.
- Road-warrior clients — `mamba`, phone, work PC — agent/app.
- OPNsense / `mgmt` — the single non-agent exception (advertised route or LAN-side
admin from a meshed peer).
**Retired:** ADR-007's VLAN-99 WireGuard endpoint on OPNsense and the
`10.99.0.0/24` peer scheme. `askari` reaches `srv`/`mgmt` over the mesh under NetBird
ACLs instead of OPNsense routing `10.99.0.0/24`.
---
## Security model, ACLs, and attack surface
**ACL policy mirrors ADR-007's firewall intent** (NetBird is default-deny):
- `vpn` peers → `srv` **metrics ports only** (askari's monitoring scope).
- admin peers (`ubongo`, `mamba`) → `srv` + `mgmt` for administration.
- road-warrior clients → only what each needs; nothing by default.
**Enrollment via setup keys.** Hosts join non-interactively using NetBird **setup
keys**, stored in `vault.yml` as `vault.netbird.setup_key` and consumed by the `base`
role. Prefer ephemeral/scoped keys (ADR-002).
**Host firewall interaction.** NetBird creates a `wt0` mesh interface. The `base`
role's nftables default-deny allows inbound admin (SSH) **only on `wt0`**, denied on
the physical NIC — the pattern ADR-015 set for `ubongo`, now applied fleet-wide. Mesh
+ nftables are defence-in-depth.
**The new attack surface — a public control plane on `askari`.** Today `askari`
exposes a WireGuard UDP port; with NetBird self-hosted it exposes the **management API
+ dashboard (80/443)** and **Coturn (3478)** publicly, and the management API is
keys-to-the-kingdom for the whole mesh. Mitigations baked in:
- Dashboard/API behind TLS + the embedded IdP login; source-IP restrictions where
practical.
- `askari` runs `base` hardening (already a public managed host) and NetBird is
**version-pinned** (ADR-011) and patched on boma's cadence — self-hosting means
owning the CVE cadence (AGPLv3 server).
Net vs ADR-002: nothing from the **cluster** is publicly exposed; the only public
surface is on `askari` (a public VPS by design), shifting from "WireGuard port" to
"NetBird control plane."
---
## Recovery, bootstrap ordering, and operations
**Ansible's control path stays off the mesh.** `ubongo` is on the LAN and reaches the
fleet by **LAN IP** (ADR-009). The mesh only provides *external* reach to
`ubongo`/the fleet, so a mesh/coordinator outage never blocks on-LAN Ansible runs and
there is no chicken-and-egg in the critical path.
**Bootstrap order** (askari-first):
1. Stand up the NetBird coordinator on `askari`.
2. Enroll `ubongo`.
3. `base` role enrolls the rest of the fleet via setup keys from vault.
**Recovery.** Coordinator off-site on `askari` ⇒ the mesh survives a full homelab
outage. Two must-haves:
- **Back up NetBird's management datastore** off `askari` — encrypted, synced to
`ubongo`/`mamba`. If `askari` dies, restore the coordinator; peers re-enroll.
- Existing peer tunnels keep running on last-known config through a brief coordinator
outage; only changes/new enrollments need it live — so `askari` is important but not
instantly fatal.
**`askari` becomes Ansible-managed.** It joins the inventory under its own group and
gets the `base` role plus a dedicated **`netbird_coordinator` service role** (one
service = one role per ADR-004, with its own `SECURITY.md` per the service-role
standard). Agent install/enrollment lives in `base`.
**DNS & versions.** boma's `dns` role stays authoritative for `boma.baobab.band`;
NetBird's built-in DNS is scoped/off to avoid overlap. NetBird server (on `askari`)
and agents (via `base`) are version-pinned (ADR-011).
---
## Documentation & implementation changes
This is a substantial decision → its own ADR, with amendments linking to it.
| Doc | Change |
|---|---|
| ADR-016 (new) | Home of record for this design. |
| ADR-007 (network) | Replace the VLAN-99 WireGuard section + `10.99.0.0/24` scheme with the NetBird mesh; update the firewall-intent table and the `askari` external-monitoring section to ride the mesh. |
| ADR-015 (control host) | Resolve deferred item #1: mesh VPN = NetBird self-hosted on `askari`; update the access/recovery notes. |
| `docs/security/accepted-risks.md` | Replace R3 ("pending VPN choice") with the concrete residual risk: self-hosted coordinator = no third-party trust, but a public NetBird control plane on `askari` to harden + patch. |
| `docs/CAPABILITIES.md` | Resolve the VPN row (line ~29): decided — NetBird mesh, self-hosted on `askari`. |
| `STATUS.md` | Add rows (designed, not built): NetBird coordinator on `askari`; NetBird agent enrollment in `base`. |
| `base` role (when built) | Install + enroll the NetBird agent; nftables allows SSH only on `wt0`. |
| `netbird_coordinator` service role (new, when built) | Deploys the NetBird stack on `askari`; populated `SECURITY.md`; molecule scenario. |
| `requirements.yml` | Only if a task needs a new collection module (ADR dependencies policy). |
**Scope note:** like the `ubongo` work, most *implementation* here waits on the `base`
and service-role machinery that STATUS.md lists as not-yet-built. This spec settles the
decision and the doc reconciliation; the role tasks land when `base` is built.
---
## Deferred / out of scope
1. **External SSO IdP** (Zitadel/Keycloak) — embedded local users now; SSO later if a
second operator or service-SSO need appears.
2. **OPNsense mesh integration specifics** — the exact `mgmt` reachability mechanism
(single advertised route vs LAN-side admin) is settled during implementation when
OPNsense automation is built.
3. **The `base` / `netbird_coordinator` role implementation** — depends on the
unbuilt `base` role and service-role standard.
---
## What was ruled out
| Option | Reason |
|---|---|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config, OPNsense-centric; the operator wants a mesh with policy-based access and easy multi-device enrollment. |
| Tailscale (hosted coordinator) | Adds a third-party trust dependency for the control plane; against boma's self-hosting ethos. (Hosted coordinator's recovery benefit is matched by putting a self-hosted coordinator off-site on `askari`.) |
| Tailscale + Headscale (self-hosted) | Headscale is a third-party reimplementation of Tailscale's control server with partial feature parity and no official vendor support — weaker than NetBird's first-class self-hosting. |
| Mesh coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes, and dies with the homelab. `askari` (off-site) instead. |
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` would go blind to `srv` when `ubongo` is down even if services are healthy. Agent-per-host instead. |
| Standalone IdP (Zitadel/Keycloak) now | Heavy for a single operator; embedded local users (Dex) suffice. External SSO stays a future option. |
See also: ADR-007 (network), ADR-015 (control host), ADR-002 (security), ADR-011
(version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible handoff),
ADR-013 (heritage — V4 used WireGuard; NetBird is translated, not transplanted).