diff --git a/docs/superpowers/plans/2026-06-05-mesh-vpn-netbird.md b/docs/superpowers/plans/2026-06-05-mesh-vpn-netbird.md new file mode 100644 index 0000000..1ad3dad --- /dev/null +++ b/docs/superpowers/plans/2026-06-05-mesh-vpn-netbird.md @@ -0,0 +1,484 @@ +# Mesh VPN (NetBird) Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Record the decision that boma's mesh VPN is NetBird (self-hosted on `askari`), by authoring ADR-016 and reconciling every doc that currently assumes OPNsense WireGuard or an undecided VPN. + +**Architecture:** Documentation-only change. NetBird replaces ADR-007's VLAN-99 OPNsense WireGuard as the single remote-access overlay for `ubongo`, `askari`, and road-warrior clients; coordinator self-hosted off-site on `askari`; agent-per-host enrollment via the (unbuilt) `base` role; embedded local-user identity. The role/service implementation waits on the `base` role and service-role machinery that STATUS.md lists as not-yet-built — this plan settles the decision and the doc reconciliation only. + +**Tech Stack:** Markdown only. Verification is the repo's pre-commit hooks (trailing-whitespace, end-of-file, gitleaks, ansible-lint, vault-encryption guard) plus a final cross-reference/staleness sweep. No markdown linter exists, so "tests" are hook-pass + grep checks. + +--- + +## Pre-flight (read once before starting) + +- **`rbw` must be unlocked before every commit** (the pre-commit ansible-lint hook decrypts `vault.yml`). Run `rbw unlocked` (exit 0 = good); if not, stop and ask the user to `rbw unlock`. +- **Commit style:** one commit per task, imperative subject ≤72 chars. +- **Order matters:** Task 1 (ADR-016) lands first — every later task links to it. +- **Spec reference:** `docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md`. +- **Branch:** start by creating `chore/mesh-vpn-netbird-docs` off `main` (the controller does this before dispatching Task 1; do not implement on `main`). + +--- + +## File map + +| File | Action | Responsibility after change | +|---|---|---| +| `docs/decisions/016-mesh-vpn.md` | Create | Home of record for the NetBird mesh decision | +| `docs/decisions/007-network.md` | Modify | VLAN-99 WireGuard retired; askari rides the mesh + hosts the coordinator | +| `docs/decisions/015-control-host.md` | Modify | Resolve deferred item #1 (mesh = NetBird on askari) | +| `docs/security/accepted-risks.md` | Modify | Replace R3 placeholder with the concrete residual risk | +| `docs/CAPABILITIES.md` | Modify | VPN row decided: NetBird, self-hosted | +| `STATUS.md` | Modify | Two rows: NetBird coordinator + agent enrollment (designed, not built) | +| `CLAUDE.md` | Modify | ADR-016 in Further reading | + +--- + +### Task 1: Author ADR-016 (the home of record) + +**Files:** +- Create: `docs/decisions/016-mesh-vpn.md` + +- [ ] **Step 1: Create the ADR file** + +Create `docs/decisions/016-mesh-vpn.md` with exactly this content (preserve em-dashes —, backticks, table pipes, and the `verified:` stamps): + +```markdown +# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`) + +## Context + +`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to +the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to +WireGuard-via-OPNsense for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`: `askari` + road +warriors), and `docs/CAPABILITIES.md` flagged NetBird (mesh) as a real alternative to +weigh. This ADR settles it. + +## Decision + +A single **NetBird** mesh is the sole remote-access overlay, self-hosted on `askari`, +**replacing** ADR-007's VLAN-99 OPNsense WireGuard. + +The decision in four parts: + +1. **Scope — mesh replaces WireGuard.** One overlay for `ubongo`, `askari`, and + road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired. +2. **Control plane — self-hosted on `askari`.** Sovereignty (boma self-hosts + Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that + survives a homelab outage and stays out of the cluster it administers. +3. **Tool — NetBird.** Self-hosting selects NetBird (first-class, fully open-source + self-host). Tailscale would mean Headscale (third-party reimplementation, partial + parity) — ruled out below. +4. **Routing — agent on every Linux host**, not a subnet router. At boma's scale (2–5 + hosts) the "agent everywhere" cost is trivial and the `base` role already runs + everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives + granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception + (`mgmt`/gateway reached by a single advertised route or LAN-side admin). +5. **Identity — embedded local users** (Dex in the management container); external SSO + (Zitadel/Keycloak) stays an optional future. + +## Verified facts (ADR-014) + +verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05 +— components management+signal+dashboard+relay/TURN(Coturn), **single container since +v0.65**; **built-in local users / embedded IdP since v0.62** (external OIDC optional); +ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host. + +verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for +`management/`/`signal/`/`relay/`, BSD-3-Clause elsewhere; fully open source, no +open-core feature gating. + +## Architecture + +Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on `askari`. +NetBird manages its own overlay addressing (default `100.64.0.0/10`); no boma VLAN is +allocated for it. + +- `askari` (Hetzner, off-site, always-up) — runs the NetBird stack **and** is a peer. +- `ubongo` — agent. +- All Linux managed hosts — agent via the `base` role. +- Road-warrior clients (`mamba`, phone, work PC) — agent/app. +- OPNsense / `mgmt` — single non-agent exception. + +## Security + +- **ACLs mirror ADR-007 intent** (NetBird default-deny): mesh peers → `srv` metrics + ports only; admin peers (`ubongo`, `mamba`) → `srv` + `mgmt`; clients → least + privilege. +- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`), + consumed by `base`; prefer ephemeral/scoped keys. +- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH + **only on `wt0`** (the ADR-015 pattern, fleet-wide). +- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn + (3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical, + `base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence. + Recorded as accepted-risk R3. + +## Recovery & operations + +- **Ansible stays off the mesh:** `ubongo` reaches the fleet by LAN IP (ADR-009); a + mesh/coordinator outage never blocks on-LAN runs. +- **Bootstrap order:** stand up the coordinator on `askari` → enroll `ubongo` → + `base` enrolls the fleet. +- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage. + NetBird's management datastore is backed up encrypted off `askari` (synced to + `ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage. +- **`askari` is Ansible-managed:** its own inventory group, `base` role, plus a + dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with + `SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are + version-pinned (ADR-011). boma's `dns` role stays authoritative for + `boma.baobab.band`; NetBird built-in DNS scoped/off. + +## Status + +Designed, not built — depends on the unbuilt `base` role and service-role machinery +(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when +`base` exists. + +## What was ruled out + +| Option | Reason | +|---|---| +| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. | +| Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on `askari`. | +| Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. | +| Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. `askari` instead. | +| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. | +| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. | + +See also: ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security), +ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible +handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted). +``` + +- [ ] **Step 2: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files docs/decisions/016-mesh-vpn.md` +Expected: Passed/Skipped (ansible-lint Skipped for non-YAML). +```bash +git add docs/decisions/016-mesh-vpn.md +git commit -m "Add ADR-016 (mesh VPN — NetBird self-hosted on askari)" +``` + +--- + +### Task 2: Amend ADR-007 (retire VLAN-99 WireGuard, askari on the mesh) + +**Files:** +- Modify: `docs/decisions/007-network.md` + +Read the file first, then make FOUR exact edits. Preserve em-dashes —, backticks, table pipes. + +- [ ] **Step 1: Update the VLAN-99 row in the VLAN design table** + +Find: +``` +| 99 | `vpn` | `10.99.0.0/24` | WireGuard peers. `askari` (Hetzner) + road-warrior clients. | +``` +Replace with: +``` +| 99 | `vpn` | _(retired)_ | **Replaced by the NetBird mesh (ADR-016).** Remote access for `ubongo`, `askari`, and road-warrior clients rides a self-hosted NetBird overlay, not an OPNsense WireGuard subnet. `10.99.0.0/24` is freed. | +``` + +- [ ] **Step 2: Replace the VLAN-99 addressing subsection** + +Find: +``` +### VLAN 99 — vpn (10.99.0.0/24) — WireGuard + +| Address | Host | +|---|---| +| `10.99.0.1` | OPNsense (WireGuard endpoint) | +| `10.99.0.2` | `askari` (Hetzner VPS) | +| `10.99.0.10`+ | Road-warrior clients | +``` +Replace with: +``` +### VLAN 99 — vpn — retired + +The OPNsense WireGuard VPN (`10.99.0.0/24`) is **replaced by the NetBird mesh** +(ADR-016). Remote access for `ubongo`, `askari`, and road-warrior clients rides a +self-hosted NetBird overlay — data plane peer-to-peer WireGuard, control plane +NetBird self-hosted on `askari`. NetBird manages its own overlay addressing +(default `100.64.0.0/10`); no boma VLAN/subnet is allocated for it, and +`10.99.0.0/24` is freed. +``` + +- [ ] **Step 3: Update the two `vpn` rows in the OPNsense firewall-rules table** + +Find: +``` +| `vpn` | `srv` (metrics ports) | allow (monitoring) | +| `vpn` | `mgmt` | allow (administration from askari) | +``` +Replace with: +``` +| mesh peers | `srv` (metrics ports) | allow (monitoring) — enforced by NetBird ACLs, not OPNsense (ADR-016) | +| mesh peers | `mgmt` | allow (administration) — enforced by NetBird ACLs (ADR-016) | +``` + +- [ ] **Step 4: Rewrite the "External monitoring — askari" section** + +Find: +``` +`askari` (Hetzner VPS) connects via WireGuard to OPNsense (`10.99.0.1`). +Its peer address is `10.99.0.2`. OPNsense routes `10.99.0.0/24` into the VPN +tunnel and allows `askari` narrow access to `srv` metrics endpoints and `mgmt` +for administration. + +`askari` is provisioned and managed independently of the Proxmox cluster — it +must be reachable even when the homelab is down (its entire purpose). +FQDN: `askari.baobab.band`. +``` +Replace with: +``` +`askari` (Hetzner VPS) is a peer on the **NetBird mesh** (ADR-016) and also **hosts +the self-hosted NetBird coordinator** (management/signal/relay). It reaches `srv` +metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird +ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing. + +`askari` is provisioned and managed independently of the Proxmox cluster — it must +be reachable even when the homelab is down (its entire purpose), which is also why +the mesh coordinator lives here: an off-site control plane survives a homelab outage. +FQDN: `askari.baobab.band`. +``` + +- [ ] **Step 5: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files docs/decisions/007-network.md` +Expected: Passed/Skipped. +```bash +git add docs/decisions/007-network.md +git commit -m "ADR-007: retire VLAN-99 WireGuard for the NetBird mesh (ADR-016)" +``` + +--- + +### Task 3: Resolve ADR-015 deferred item #1 + +**Files:** +- Modify: `docs/decisions/015-control-host.md` + +Read the file first, then make THREE exact edits. + +- [ ] **Step 1: Update provisioning step 3** + +Find: +``` +3. Join the mesh VPN (choice deferred — see below). +``` +Replace with: +``` +3. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016). +``` + +- [ ] **Step 2: Update the Access & security mesh line** + +Find: +``` +- Remote access is via the **mesh VPN** (choice deferred). SSH to `ubongo` over the + mesh; nothing is published to the public internet — this stays inside ADR-002. +``` +Replace with: +``` +- Remote access is via the **mesh VPN** — NetBird, self-hosted on `askari` (ADR-016). + SSH to `ubongo` over the mesh; nothing is published to the public internet — this + stays inside ADR-002. +``` + +- [ ] **Step 3: Resolve deferred item #1** + +Find: +``` +1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Recovery + dimension: a hosted coordinator keeps the mesh up when the cluster is down; a + self-hosted coordinator must live off-cluster (on `ubongo`), never on the fleet, + or it recreates the chicken-and-egg. +``` +Replace with: +``` +1. **Mesh VPN choice — RESOLVED (ADR-016):** NetBird, self-hosted on `askari` + (off-site, so it survives a homelab outage and stays out of the cluster it + administers). Replaces ADR-007's OPNsense WireGuard. +``` + +- [ ] **Step 4: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files docs/decisions/015-control-host.md` +Expected: Passed/Skipped. +```bash +git add docs/decisions/015-control-host.md +git commit -m "ADR-015: resolve mesh-VPN deferral — NetBird on askari (ADR-016)" +``` + +--- + +### Task 4: Replace accepted-risks R3 with the concrete residual risk + +**Files:** +- Modify: `docs/security/accepted-risks.md` + +Read the file first, then make ONE exact edit. (The row is long — match it whole.) + +- [ ] **Step 1: Replace the R3 row** + +Find: +``` +| R3 | **Mesh-VPN coordinator dependency (pending VPN choice)** — remote SSH to the control node `ubongo` (ADR-015) rides a mesh VPN whose coordination plane may be a third party (e.g. hosted Tailscale/NetBird) | A hosted coordinator keeps the mesh up when the cluster is down, which *helps* recovery; nothing is exposed to the public internet (ADR-002 preserved). Provisional — finalised when the VPN is chosen (separate discussion) | The VPN choice is settled (replace this entry with the concrete decision); a self-hosted coordinator is adopted; the provider's trust/security posture changes | +``` +Replace with: +``` +| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering | +``` + +- [ ] **Step 2: Bump the "Last reviewed" date** + +Find: +``` +_Last reviewed: 2026-06-05. The prior gaps +``` +This already reads `2026-06-05` (today) from the previous work, so **no change is needed** — confirm it says `2026-06-05` and move on. (If it shows an earlier date, set it to `2026-06-05`.) + +- [ ] **Step 3: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md` +Expected: Passed/Skipped. +```bash +git add docs/security/accepted-risks.md +git commit -m "accepted-risks: R3 now the concrete NetBird coordinator risk" +``` + +--- + +### Task 5: Update the CAPABILITIES VPN row + +**Files:** +- Modify: `docs/CAPABILITIES.md` + +Read the file first, then make ONE exact edit. + +- [ ] **Step 1: Replace the VPN / remote access row** + +Find: +``` +| VPN / remote access | Netbird · *or* OPNsense WireGuard | P | candidate | Secure remote access to `srv`/`mgmt` | ⚠️ ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real alternative to weigh | +``` +Replace with: +``` +| VPN / remote access | NetBird (self-hosted on `askari`) | P | core | Secure mesh remote access to `srv`/`mgmt` | **Decided (ADR-016):** NetBird mesh replaces ADR-007 OPNsense WireGuard | +``` + +- [ ] **Step 2: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files docs/CAPABILITIES.md` +Expected: Passed/Skipped. +```bash +git add docs/CAPABILITIES.md +git commit -m "CAPABILITIES: VPN decided — NetBird self-hosted (ADR-016)" +``` + +--- + +### Task 6: Add NetBird rows to STATUS.md + +**Files:** +- Modify: `STATUS.md` + +Read the file first, then make ONE exact edit (add two rows after the `ubongo` row). + +- [ ] **Step 1: Add the two rows** + +Find: +``` +| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. | +``` +Replace with that SAME line followed by the two new rows: +``` +| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. | +| NetBird mesh — coordinator on `askari` | ADR-016 | Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. Decision recorded; not deployed (askari + service-role machinery not built). | +| NetBird agent enrollment in `base` | ADR-016 | Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. Designed; base role not built. | +``` + +- [ ] **Step 2: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files STATUS.md` +Expected: Passed/Skipped. +```bash +git add STATUS.md +git commit -m "STATUS: record NetBird mesh (coordinator + base enrollment)" +``` + +--- + +### Task 7: Link ADR-016 from CLAUDE.md + +**Files:** +- Modify: `CLAUDE.md` + +Read the file first, then make ONE exact edit. + +- [ ] **Step 1: Add the Further reading row after Network topology** + +Find: +``` +| Network topology | `docs/decisions/007-network.md` | +``` +Replace with that SAME line followed by the new row: +``` +| Network topology | `docs/decisions/007-network.md` | +| Mesh VPN (NetBird, self-hosted) | `docs/decisions/016-mesh-vpn.md` | +``` + +- [ ] **Step 2: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files CLAUDE.md` +Expected: Passed/Skipped. +```bash +git add CLAUDE.md +git commit -m "CLAUDE.md: link ADR-016 (mesh VPN)" +``` + +--- + +### Task 8: Final consistency sweep + +**Files:** none modified (verification only) + +- [ ] **Step 1: Confirm no doc still treats OPNsense WireGuard / `10.99` as the active remote-access path, and no "pending/deferred VPN" language remains** + +Run: +```bash +grep -rniE "choice deferred|pending VPN choice|10\.99\.0|WireGuard (endpoint|peers|to OPNsense)" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/" +``` +Expected: the ONLY hits are in `007-network.md` and `016-mesh-vpn.md`, where they describe the **retirement** of `10.99.0.0/24` (e.g. "`10.99.0.0/24` is freed", "no `10.99.0.0/24` routing") — those are correct and expected. There must be **no** hit that still treats OPNsense WireGuard or `10.99.0.x` as the *live* remote-access path, and **no** `choice deferred` / `pending VPN choice` anywhere. Legitimate mentions of "WireGuard" as NetBird's *data plane* are fine and won't match this pattern (it only matches `WireGuard endpoint|peers|to OPNsense`). If a canonical doc still names the WireGuard VPN as live, fix it as in the relevant task above and amend that commit. + +- [ ] **Step 2: Confirm ADR-016 exists and is cross-linked** + +Run: +```bash +test -f docs/decisions/016-mesh-vpn.md && echo "ADR-016 present" +grep -rl "ADR-016\|016-mesh-vpn" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/" +``` +Expected: the file exists and the referencing docs (007, 015, accepted-risks, CAPABILITIES, STATUS, CLAUDE.md) appear. + +- [ ] **Step 3: Full hook run** + +Run: `rbw unlocked && pre-commit run --all-files` +Expected: all hooks Passed/Skipped. Fix anything that fails (most likely trailing whitespace / end-of-file) and amend the owning commit. + +- [ ] **Step 4: Push (only if the user asks)** + +Per CLAUDE.md, push to `origin` is the off-machine backup. If the user wants it pushed: +```bash +git push origin +``` + +--- + +## Self-review notes (author) + +- **Spec coverage:** decision/architecture/security/recovery → Task 1 (ADR-016); the spec's "Documentation & implementation changes" table → Tasks 2–7; deferrals (external SSO, OPNsense mesh specifics, role implementation) are recorded in ADR-016/STATUS, not implemented here (correct — they need the unbuilt `base`/service-role machinery). ✓ +- **Not in scope (intentional):** the `netbird_coordinator` service role, the `base`-role agent task, vault `setup_key` material, and any live deployment — all wait on `base`/service-role machinery (STATUS-honest). ✓ +- **No placeholders:** every edit shows exact find/replace text; the `_(retired)_` token in ADR-007 is deliberate table content. ✓ +- **Name consistency:** ADR file is `016-mesh-vpn.md` everywhere; `vault.netbird.setup_key`, `netbird_coordinator`, and `wt0` are used identically across ADR-016 and the sweep. ✓ +```