boma/docs/superpowers/plans/2026-06-05-mesh-vpn-netbird.md
sjat 4b85b14f1f Add implementation plan for NetBird mesh VPN
Task-by-task docs plan: author ADR-016 and reconcile ADR-007 (retire VLAN-99
WireGuard), ADR-015 (resolve deferred #1), accepted-risks R3, CAPABILITIES,
STATUS, CLAUDE.md. Documentation-only; role/deployment waits on the base role.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:44:05 +02:00

484 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Mesh VPN (NetBird) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Record the decision that boma's mesh VPN is NetBird (self-hosted on `askari`), by authoring ADR-016 and reconciling every doc that currently assumes OPNsense WireGuard or an undecided VPN.
**Architecture:** Documentation-only change. NetBird replaces ADR-007's VLAN-99 OPNsense WireGuard as the single remote-access overlay for `ubongo`, `askari`, and road-warrior clients; coordinator self-hosted off-site on `askari`; agent-per-host enrollment via the (unbuilt) `base` role; embedded local-user identity. The role/service implementation waits on the `base` role and service-role machinery that STATUS.md lists as not-yet-built — this plan settles the decision and the doc reconciliation only.
**Tech Stack:** Markdown only. Verification is the repo's pre-commit hooks (trailing-whitespace, end-of-file, gitleaks, ansible-lint, vault-encryption guard) plus a final cross-reference/staleness sweep. No markdown linter exists, so "tests" are hook-pass + grep checks.
---
## Pre-flight (read once before starting)
- **`rbw` must be unlocked before every commit** (the pre-commit ansible-lint hook decrypts `vault.yml`). Run `rbw unlocked` (exit 0 = good); if not, stop and ask the user to `rbw unlock`.
- **Commit style:** one commit per task, imperative subject ≤72 chars.
- **Order matters:** Task 1 (ADR-016) lands first — every later task links to it.
- **Spec reference:** `docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md`.
- **Branch:** start by creating `chore/mesh-vpn-netbird-docs` off `main` (the controller does this before dispatching Task 1; do not implement on `main`).
---
## File map
| File | Action | Responsibility after change |
|---|---|---|
| `docs/decisions/016-mesh-vpn.md` | Create | Home of record for the NetBird mesh decision |
| `docs/decisions/007-network.md` | Modify | VLAN-99 WireGuard retired; askari rides the mesh + hosts the coordinator |
| `docs/decisions/015-control-host.md` | Modify | Resolve deferred item #1 (mesh = NetBird on askari) |
| `docs/security/accepted-risks.md` | Modify | Replace R3 placeholder with the concrete residual risk |
| `docs/CAPABILITIES.md` | Modify | VPN row decided: NetBird, self-hosted |
| `STATUS.md` | Modify | Two rows: NetBird coordinator + agent enrollment (designed, not built) |
| `CLAUDE.md` | Modify | ADR-016 in Further reading |
---
### Task 1: Author ADR-016 (the home of record)
**Files:**
- Create: `docs/decisions/016-mesh-vpn.md`
- [ ] **Step 1: Create the ADR file**
Create `docs/decisions/016-mesh-vpn.md` with exactly this content (preserve em-dashes —, backticks, table pipes, and the `verified:` stamps):
```markdown
# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
## Context
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to
WireGuard-via-OPNsense for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`: `askari` + road
warriors), and `docs/CAPABILITIES.md` flagged NetBird (mesh) as a real alternative to
weigh. This ADR settles it.
## Decision
A single **NetBird** mesh is the sole remote-access overlay, self-hosted on `askari`,
**replacing** ADR-007's VLAN-99 OPNsense WireGuard.
The decision in four parts:
1. **Scope — mesh replaces WireGuard.** One overlay for `ubongo`, `askari`, and
road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired.
2. **Control plane — self-hosted on `askari`.** Sovereignty (boma self-hosts
Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that
survives a homelab outage and stays out of the cluster it administers.
3. **Tool — NetBird.** Self-hosting selects NetBird (first-class, fully open-source
self-host). Tailscale would mean Headscale (third-party reimplementation, partial
parity) — ruled out below.
4. **Routing — agent on every Linux host**, not a subnet router. At boma's scale (25
hosts) the "agent everywhere" cost is trivial and the `base` role already runs
everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives
granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception
(`mgmt`/gateway reached by a single advertised route or LAN-side admin).
5. **Identity — embedded local users** (Dex in the management container); external SSO
(Zitadel/Keycloak) stays an optional future.
## Verified facts (ADR-014)
verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
— components management+signal+dashboard+relay/TURN(Coturn), **single container since
v0.65**; **built-in local users / embedded IdP since v0.62** (external OIDC optional);
ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host.
verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for
`management/`/`signal/`/`relay/`, BSD-3-Clause elsewhere; fully open source, no
open-core feature gating.
## Architecture
Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on `askari`.
NetBird manages its own overlay addressing (default `100.64.0.0/10`); no boma VLAN is
allocated for it.
- `askari` (Hetzner, off-site, always-up) — runs the NetBird stack **and** is a peer.
- `ubongo` — agent.
- All Linux managed hosts — agent via the `base` role.
- Road-warrior clients (`mamba`, phone, work PC) — agent/app.
- OPNsense / `mgmt` — single non-agent exception.
## Security
- **ACLs mirror ADR-007 intent** (NetBird default-deny): mesh peers → `srv` metrics
ports only; admin peers (`ubongo`, `mamba`) → `srv` + `mgmt`; clients → least
privilege.
- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`),
consumed by `base`; prefer ephemeral/scoped keys.
- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH
**only on `wt0`** (the ADR-015 pattern, fleet-wide).
- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn
(3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,
`base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence.
Recorded as accepted-risk R3.
## Recovery & operations
- **Ansible stays off the mesh:** `ubongo` reaches the fleet by LAN IP (ADR-009); a
mesh/coordinator outage never blocks on-LAN runs.
- **Bootstrap order:** stand up the coordinator on `askari` → enroll `ubongo`
`base` enrolls the fleet.
- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage.
NetBird's management datastore is backed up encrypted off `askari` (synced to
`ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage.
- **`askari` is Ansible-managed:** its own inventory group, `base` role, plus a
dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with
`SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are
version-pinned (ADR-011). boma's `dns` role stays authoritative for
`boma.baobab.band`; NetBird built-in DNS scoped/off.
## Status
Designed, not built — depends on the unbuilt `base` role and service-role machinery
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
`base` exists.
## What was ruled out
| Option | Reason |
|---|---|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. |
| Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on `askari`. |
| Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. |
| Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. `askari` instead. |
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. |
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
See also: ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted).
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/016-mesh-vpn.md`
Expected: Passed/Skipped (ansible-lint Skipped for non-YAML).
```bash
git add docs/decisions/016-mesh-vpn.md
git commit -m "Add ADR-016 (mesh VPN — NetBird self-hosted on askari)"
```
---
### Task 2: Amend ADR-007 (retire VLAN-99 WireGuard, askari on the mesh)
**Files:**
- Modify: `docs/decisions/007-network.md`
Read the file first, then make FOUR exact edits. Preserve em-dashes —, backticks, table pipes.
- [ ] **Step 1: Update the VLAN-99 row in the VLAN design table**
Find:
```
| 99 | `vpn` | `10.99.0.0/24` | WireGuard peers. `askari` (Hetzner) + road-warrior clients. |
```
Replace with:
```
| 99 | `vpn` | _(retired)_ | **Replaced by the NetBird mesh (ADR-016).** Remote access for `ubongo`, `askari`, and road-warrior clients rides a self-hosted NetBird overlay, not an OPNsense WireGuard subnet. `10.99.0.0/24` is freed. |
```
- [ ] **Step 2: Replace the VLAN-99 addressing subsection**
Find:
```
### VLAN 99 — vpn (10.99.0.0/24) — WireGuard
| Address | Host |
|---|---|
| `10.99.0.1` | OPNsense (WireGuard endpoint) |
| `10.99.0.2` | `askari` (Hetzner VPS) |
| `10.99.0.10`+ | Road-warrior clients |
```
Replace with:
```
### VLAN 99 — vpn — retired
The OPNsense WireGuard VPN (`10.99.0.0/24`) is **replaced by the NetBird mesh**
(ADR-016). Remote access for `ubongo`, `askari`, and road-warrior clients rides a
self-hosted NetBird overlay — data plane peer-to-peer WireGuard, control plane
NetBird self-hosted on `askari`. NetBird manages its own overlay addressing
(default `100.64.0.0/10`); no boma VLAN/subnet is allocated for it, and
`10.99.0.0/24` is freed.
```
- [ ] **Step 3: Update the two `vpn` rows in the OPNsense firewall-rules table**
Find:
```
| `vpn` | `srv` (metrics ports) | allow (monitoring) |
| `vpn` | `mgmt` | allow (administration from askari) |
```
Replace with:
```
| mesh peers | `srv` (metrics ports) | allow (monitoring) — enforced by NetBird ACLs, not OPNsense (ADR-016) |
| mesh peers | `mgmt` | allow (administration) — enforced by NetBird ACLs (ADR-016) |
```
- [ ] **Step 4: Rewrite the "External monitoring — askari" section**
Find:
```
`askari` (Hetzner VPS) connects via WireGuard to OPNsense (`10.99.0.1`).
Its peer address is `10.99.0.2`. OPNsense routes `10.99.0.0/24` into the VPN
tunnel and allows `askari` narrow access to `srv` metrics endpoints and `mgmt`
for administration.
`askari` is provisioned and managed independently of the Proxmox cluster — it
must be reachable even when the homelab is down (its entire purpose).
FQDN: `askari.baobab.band`.
```
Replace with:
```
`askari` (Hetzner VPS) is a peer on the **NetBird mesh** (ADR-016) and also **hosts
the self-hosted NetBird coordinator** (management/signal/relay). It reaches `srv`
metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird
ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing.
`askari` is provisioned and managed independently of the Proxmox cluster — it must
be reachable even when the homelab is down (its entire purpose), which is also why
the mesh coordinator lives here: an off-site control plane survives a homelab outage.
FQDN: `askari.baobab.band`.
```
- [ ] **Step 5: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/007-network.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/007-network.md
git commit -m "ADR-007: retire VLAN-99 WireGuard for the NetBird mesh (ADR-016)"
```
---
### Task 3: Resolve ADR-015 deferred item #1
**Files:**
- Modify: `docs/decisions/015-control-host.md`
Read the file first, then make THREE exact edits.
- [ ] **Step 1: Update provisioning step 3**
Find:
```
3. Join the mesh VPN (choice deferred — see below).
```
Replace with:
```
3. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016).
```
- [ ] **Step 2: Update the Access & security mesh line**
Find:
```
- Remote access is via the **mesh VPN** (choice deferred). SSH to `ubongo` over the
mesh; nothing is published to the public internet — this stays inside ADR-002.
```
Replace with:
```
- Remote access is via the **mesh VPN** — NetBird, self-hosted on `askari` (ADR-016).
SSH to `ubongo` over the mesh; nothing is published to the public internet — this
stays inside ADR-002.
```
- [ ] **Step 3: Resolve deferred item #1**
Find:
```
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Recovery
dimension: a hosted coordinator keeps the mesh up when the cluster is down; a
self-hosted coordinator must live off-cluster (on `ubongo`), never on the fleet,
or it recreates the chicken-and-egg.
```
Replace with:
```
1. **Mesh VPN choice — RESOLVED (ADR-016):** NetBird, self-hosted on `askari`
(off-site, so it survives a homelab outage and stays out of the cluster it
administers). Replaces ADR-007's OPNsense WireGuard.
```
- [ ] **Step 4: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/015-control-host.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/015-control-host.md
git commit -m "ADR-015: resolve mesh-VPN deferral — NetBird on askari (ADR-016)"
```
---
### Task 4: Replace accepted-risks R3 with the concrete residual risk
**Files:**
- Modify: `docs/security/accepted-risks.md`
Read the file first, then make ONE exact edit. (The row is long — match it whole.)
- [ ] **Step 1: Replace the R3 row**
Find:
```
| R3 | **Mesh-VPN coordinator dependency (pending VPN choice)** — remote SSH to the control node `ubongo` (ADR-015) rides a mesh VPN whose coordination plane may be a third party (e.g. hosted Tailscale/NetBird) | A hosted coordinator keeps the mesh up when the cluster is down, which *helps* recovery; nothing is exposed to the public internet (ADR-002 preserved). Provisional — finalised when the VPN is chosen (separate discussion) | The VPN choice is settled (replace this entry with the concrete decision); a self-hosted coordinator is adopted; the provider's trust/security posture changes |
```
Replace with:
```
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
```
- [ ] **Step 2: Bump the "Last reviewed" date**
Find:
```
_Last reviewed: 2026-06-05. The prior gaps
```
This already reads `2026-06-05` (today) from the previous work, so **no change is needed** — confirm it says `2026-06-05` and move on. (If it shows an earlier date, set it to `2026-06-05`.)
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md`
Expected: Passed/Skipped.
```bash
git add docs/security/accepted-risks.md
git commit -m "accepted-risks: R3 now the concrete NetBird coordinator risk"
```
---
### Task 5: Update the CAPABILITIES VPN row
**Files:**
- Modify: `docs/CAPABILITIES.md`
Read the file first, then make ONE exact edit.
- [ ] **Step 1: Replace the VPN / remote access row**
Find:
```
| VPN / remote access | Netbird · *or* OPNsense WireGuard | P | candidate | Secure remote access to `srv`/`mgmt` | ⚠️ ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real alternative to weigh |
```
Replace with:
```
| VPN / remote access | NetBird (self-hosted on `askari`) | P | core | Secure mesh remote access to `srv`/`mgmt` | **Decided (ADR-016):** NetBird mesh replaces ADR-007 OPNsense WireGuard |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/CAPABILITIES.md`
Expected: Passed/Skipped.
```bash
git add docs/CAPABILITIES.md
git commit -m "CAPABILITIES: VPN decided — NetBird self-hosted (ADR-016)"
```
---
### Task 6: Add NetBird rows to STATUS.md
**Files:**
- Modify: `STATUS.md`
Read the file first, then make ONE exact edit (add two rows after the `ubongo` row).
- [ ] **Step 1: Add the two rows**
Find:
```
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
```
Replace with that SAME line followed by the two new rows:
```
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
| NetBird mesh — coordinator on `askari` | ADR-016 | Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. Decision recorded; not deployed (askari + service-role machinery not built). |
| NetBird agent enrollment in `base` | ADR-016 | Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. Designed; base role not built. |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files STATUS.md`
Expected: Passed/Skipped.
```bash
git add STATUS.md
git commit -m "STATUS: record NetBird mesh (coordinator + base enrollment)"
```
---
### Task 7: Link ADR-016 from CLAUDE.md
**Files:**
- Modify: `CLAUDE.md`
Read the file first, then make ONE exact edit.
- [ ] **Step 1: Add the Further reading row after Network topology**
Find:
```
| Network topology | `docs/decisions/007-network.md` |
```
Replace with that SAME line followed by the new row:
```
| Network topology | `docs/decisions/007-network.md` |
| Mesh VPN (NetBird, self-hosted) | `docs/decisions/016-mesh-vpn.md` |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
Expected: Passed/Skipped.
```bash
git add CLAUDE.md
git commit -m "CLAUDE.md: link ADR-016 (mesh VPN)"
```
---
### Task 8: Final consistency sweep
**Files:** none modified (verification only)
- [ ] **Step 1: Confirm no doc still treats OPNsense WireGuard / `10.99` as the active remote-access path, and no "pending/deferred VPN" language remains**
Run:
```bash
grep -rniE "choice deferred|pending VPN choice|10\.99\.0|WireGuard (endpoint|peers|to OPNsense)" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
```
Expected: the ONLY hits are in `007-network.md` and `016-mesh-vpn.md`, where they describe the **retirement** of `10.99.0.0/24` (e.g. "`10.99.0.0/24` is freed", "no `10.99.0.0/24` routing") — those are correct and expected. There must be **no** hit that still treats OPNsense WireGuard or `10.99.0.x` as the *live* remote-access path, and **no** `choice deferred` / `pending VPN choice` anywhere. Legitimate mentions of "WireGuard" as NetBird's *data plane* are fine and won't match this pattern (it only matches `WireGuard endpoint|peers|to OPNsense`). If a canonical doc still names the WireGuard VPN as live, fix it as in the relevant task above and amend that commit.
- [ ] **Step 2: Confirm ADR-016 exists and is cross-linked**
Run:
```bash
test -f docs/decisions/016-mesh-vpn.md && echo "ADR-016 present"
grep -rl "ADR-016\|016-mesh-vpn" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
```
Expected: the file exists and the referencing docs (007, 015, accepted-risks, CAPABILITIES, STATUS, CLAUDE.md) appear.
- [ ] **Step 3: Full hook run**
Run: `rbw unlocked && pre-commit run --all-files`
Expected: all hooks Passed/Skipped. Fix anything that fails (most likely trailing whitespace / end-of-file) and amend the owning commit.
- [ ] **Step 4: Push (only if the user asks)**
Per CLAUDE.md, push to `origin` is the off-machine backup. If the user wants it pushed:
```bash
git push origin <branch-or-main-after-merge>
```
---
## Self-review notes (author)
- **Spec coverage:** decision/architecture/security/recovery → Task 1 (ADR-016); the spec's "Documentation & implementation changes" table → Tasks 27; deferrals (external SSO, OPNsense mesh specifics, role implementation) are recorded in ADR-016/STATUS, not implemented here (correct — they need the unbuilt `base`/service-role machinery). ✓
- **Not in scope (intentional):** the `netbird_coordinator` service role, the `base`-role agent task, vault `setup_key` material, and any live deployment — all wait on `base`/service-role machinery (STATUS-honest). ✓
- **No placeholders:** every edit shows exact find/replace text; the `_(retired)_` token in ADR-007 is deliberate table content. ✓
- **Name consistency:** ADR file is `016-mesh-vpn.md` everywhere; `vault.netbird.setup_key`, `netbird_coordinator`, and `wt0` are used identically across ADR-016 and the sweep. ✓
```