Compare commits
10 commits
a53941dffe
...
cd62c5e098
| Author | SHA1 | Date | |
|---|---|---|---|
| cd62c5e098 | |||
| ed9fdcc10a | |||
| 787aa3b8e1 | |||
| 841f666de9 | |||
| 08165ffb68 | |||
| 2ae5cf4535 | |||
| 5a32dd46d3 | |||
| ff796c64ca | |||
| 4b85b14f1f | |||
| 99ace3eb48 |
10 changed files with 826 additions and 26 deletions
|
|
@ -202,6 +202,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
|
|||
| Control / AI-worker host (`ubongo`) | `docs/decisions/015-control-host.md` |
|
||||
| Terraform | `docs/decisions/006-terraform.md` |
|
||||
| Network topology | `docs/decisions/007-network.md` |
|
||||
| Mesh VPN (NetBird, self-hosted) | `docs/decisions/016-mesh-vpn.md` |
|
||||
| Testing methodology | `docs/decisions/008-testing.md` |
|
||||
| TF ↔ Ansible handoff | `docs/decisions/009-provisioning-handoff.md` |
|
||||
| Forgejo & CI | `docs/decisions/010-forgejo-ci.md` |
|
||||
|
|
|
|||
|
|
@ -53,6 +53,8 @@ So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `bas
|
|||
| CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) `base`/`docker_host` roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) |
|
||||
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |
|
||||
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
|
||||
| NetBird mesh — coordinator on `askari` | ADR-016 | Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. Decision recorded; not deployed (askari + service-role machinery not built). |
|
||||
| NetBird agent enrollment in `base` | ADR-016 | Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. Designed; base role not built. |
|
||||
|
||||
## Keeping this honest
|
||||
|
||||
|
|
|
|||
|
|
@ -26,7 +26,7 @@ decisions this frame enables.
|
|||
|---|---|---|---|---|---|
|
||||
| Reverse proxy / TLS | Traefik | P | core | Edge routing + ACME certs for everything exposed | Spin-up order names it (TODO 12) |
|
||||
| Internal DNS | `dns` role → dns1/dns2 | P | core | Authoritative internal zone (ADR-007) | Ansible-rendered zone |
|
||||
| VPN / remote access | Netbird · *or* OPNsense WireGuard | P | candidate | Secure remote access to `srv`/`mgmt` | ⚠️ ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real alternative to weigh |
|
||||
| VPN / remote access | NetBird (self-hosted on `askari`) | P | core | Secure mesh remote access to `srv`/`mgmt` | **Decided (ADR-016):** NetBird mesh replaces ADR-007 OPNsense WireGuard |
|
||||
| Service portal / dashboard | Homepage | A | candidate | One landing page listing all services — a "what does what" front door | Gap surfaced by V4; fits boma's legibility goal |
|
||||
|
||||
_(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not containers.)_
|
||||
|
|
|
|||
|
|
@ -47,7 +47,7 @@ ISP
|
|||
| 30 | `lan` | `10.30.0.0/24` | Trusted home devices. DHCP. Access to selected `srv` services via OPNsense. |
|
||||
| 40 | `iot` | `10.40.0.0/24` | Smart home, cameras, printers. DHCP. Internet egress only + HA exception. |
|
||||
| 50 | `guest` | `10.50.0.0/24` | Guest WiFi. DHCP. Internet only, fully isolated. |
|
||||
| 99 | `vpn` | `10.99.0.0/24` | WireGuard peers. `askari` (Hetzner) + road-warrior clients. |
|
||||
| 99 | `vpn` | _(retired)_ | **Replaced by the NetBird mesh (ADR-016).** Remote access for `ubongo`, `askari`, and road-warrior clients rides a self-hosted NetBird overlay, not an OPNsense WireGuard subnet. `10.99.0.0/24` is freed. |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -102,13 +102,14 @@ Assigned infrastructure addresses:
|
|||
| `10.50.0.1` | OPNsense gateway |
|
||||
| `10.50.0.100`–`.249` | DHCP pool |
|
||||
|
||||
### VLAN 99 — vpn (10.99.0.0/24) — WireGuard
|
||||
### VLAN 99 — vpn — retired
|
||||
|
||||
| Address | Host |
|
||||
|---|---|
|
||||
| `10.99.0.1` | OPNsense (WireGuard endpoint) |
|
||||
| `10.99.0.2` | `askari` (Hetzner VPS) |
|
||||
| `10.99.0.10`+ | Road-warrior clients |
|
||||
The OPNsense WireGuard VPN (`10.99.0.0/24`) is **replaced by the NetBird mesh**
|
||||
(ADR-016). Remote access for `ubongo`, `askari`, and road-warrior clients rides a
|
||||
self-hosted NetBird overlay — data plane peer-to-peer WireGuard, control plane
|
||||
NetBird self-hosted on `askari`. NetBird manages its own overlay addressing
|
||||
(default `100.64.0.0/10`); no boma VLAN/subnet is allocated for it, and
|
||||
`10.99.0.0/24` is freed.
|
||||
|
||||
### Corosync ring (172.16.0.0/24) — not on managed switch
|
||||
|
||||
|
|
@ -132,8 +133,8 @@ Assigned infrastructure addresses:
|
|||
| `iot` | internet | allow egress only |
|
||||
| `iot` | `srv` (HA IP only) | allow on integration ports |
|
||||
| `guest` | internet | allow, isolated from all internal |
|
||||
| `vpn` | `srv` (metrics ports) | allow (monitoring) |
|
||||
| `vpn` | `mgmt` | allow (administration from askari) |
|
||||
| mesh peers | `srv` (metrics ports) | allow (monitoring) — enforced by NetBird ACLs, not OPNsense (ADR-016) |
|
||||
| mesh peers | `mgmt` | allow (administration) — enforced by NetBird ACLs (ADR-016) |
|
||||
|
||||
**Home Assistant ↔ IoT**: HA VM at `10.20.0.13` can reach IoT VLAN on required
|
||||
ports. OPNsense Avahi (mDNS reflector) bridges `srv` ↔ `iot` for device discovery.
|
||||
|
|
@ -176,11 +177,12 @@ All other queries go upstream (e.g., `1.1.1.1`, `9.9.9.9`).
|
|||
|
||||
## External monitoring — askari
|
||||
|
||||
`askari` (Hetzner VPS) connects via WireGuard to OPNsense (`10.99.0.1`).
|
||||
Its peer address is `10.99.0.2`. OPNsense routes `10.99.0.0/24` into the VPN
|
||||
tunnel and allows `askari` narrow access to `srv` metrics endpoints and `mgmt`
|
||||
for administration.
|
||||
`askari` (Hetzner VPS) is a peer on the **NetBird mesh** (ADR-016) and also **hosts
|
||||
the self-hosted NetBird coordinator** (management/signal/relay). It reaches `srv`
|
||||
metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird
|
||||
ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing.
|
||||
|
||||
`askari` is provisioned and managed independently of the Proxmox cluster — it
|
||||
must be reachable even when the homelab is down (its entire purpose).
|
||||
`askari` is provisioned and managed independently of the Proxmox cluster — it must
|
||||
be reachable even when the homelab is down (its entire purpose), which is also why
|
||||
the mesh coordinator lives here: an off-site control plane survives a homelab outage.
|
||||
FQDN: `askari.baobab.band`.
|
||||
|
|
|
|||
|
|
@ -63,14 +63,15 @@ Manual, on bare metal:
|
|||
|
||||
1. Install Debian 13 on the box (one-time, by hand).
|
||||
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
|
||||
3. Join the mesh VPN (choice deferred — see below).
|
||||
3. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016).
|
||||
4. From then on `ubongo` manages every other host normally; Ansible manages *it* for
|
||||
baseline config via the `control` group (`base` role only).
|
||||
|
||||
### Access & security
|
||||
|
||||
- Remote access is via the **mesh VPN** (choice deferred). SSH to `ubongo` over the
|
||||
mesh; nothing is published to the public internet — this stays inside ADR-002.
|
||||
- Remote access is via the **mesh VPN** — NetBird, self-hosted on `askari` (ADR-016).
|
||||
SSH to `ubongo` over the mesh; nothing is published to the public internet — this
|
||||
stays inside ADR-002.
|
||||
- `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban,
|
||||
auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**,
|
||||
denied on the physical NIC.
|
||||
|
|
@ -109,10 +110,9 @@ master password.
|
|||
|
||||
## Deferred (separate specs / discussions)
|
||||
|
||||
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Recovery
|
||||
dimension: a hosted coordinator keeps the mesh up when the cluster is down; a
|
||||
self-hosted coordinator must live off-cluster (on `ubongo`), never on the fleet,
|
||||
or it recreates the chicken-and-egg.
|
||||
1. **Mesh VPN choice — RESOLVED (ADR-016):** NetBird, self-hosted on `askari`
|
||||
(off-site, so it survives a homelab outage and stays out of the cluster it
|
||||
administers). Replaces ADR-007's OPNsense WireGuard.
|
||||
2. **Browser-E2E verification harness** — Playwright/headless-Chromium, test-user
|
||||
generation, screenshot-back-to-Claude, and the new ADR-008 level.
|
||||
3. **`rbw` offline-cache verification** — confirm offline decryption before relying
|
||||
|
|
|
|||
105
docs/decisions/016-mesh-vpn.md
Normal file
105
docs/decisions/016-mesh-vpn.md
Normal file
|
|
@ -0,0 +1,105 @@
|
|||
# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
|
||||
|
||||
## Context
|
||||
|
||||
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
|
||||
the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to
|
||||
WireGuard-via-OPNsense for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`: `askari` + road
|
||||
warriors), and `docs/CAPABILITIES.md` flagged NetBird (mesh) as a real alternative to
|
||||
weigh. This ADR settles it.
|
||||
|
||||
## Decision
|
||||
|
||||
A single **NetBird** mesh is the sole remote-access overlay, self-hosted on `askari`,
|
||||
**replacing** ADR-007's VLAN-99 OPNsense WireGuard.
|
||||
|
||||
The decision in four parts:
|
||||
|
||||
1. **Scope — mesh replaces WireGuard.** One overlay for `ubongo`, `askari`, and
|
||||
road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired.
|
||||
2. **Control plane — self-hosted on `askari`.** Sovereignty (boma self-hosts
|
||||
Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that
|
||||
survives a homelab outage and stays out of the cluster it administers.
|
||||
3. **Tool — NetBird.** Self-hosting selects NetBird (first-class, fully open-source
|
||||
self-host). Tailscale would mean Headscale (third-party reimplementation, partial
|
||||
parity) — ruled out below.
|
||||
4. **Routing — agent on every Linux host**, not a subnet router. At boma's scale (2–5
|
||||
hosts) the "agent everywhere" cost is trivial and the `base` role already runs
|
||||
everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives
|
||||
granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception
|
||||
(`mgmt`/gateway reached by a single advertised route or LAN-side admin).
|
||||
5. **Identity — embedded local users** (Dex in the management container); external SSO
|
||||
(Zitadel/Keycloak) stays an optional future.
|
||||
|
||||
## Verified facts (ADR-014)
|
||||
|
||||
verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
|
||||
— components management+signal+dashboard+relay/TURN(Coturn), **single container since
|
||||
v0.65**; **built-in local users / embedded IdP since v0.62** (external OIDC optional);
|
||||
ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host.
|
||||
|
||||
verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for
|
||||
`management/`/`signal/`/`relay/`, BSD-3-Clause elsewhere; fully open source, no
|
||||
open-core feature gating.
|
||||
|
||||
## Architecture
|
||||
|
||||
Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on `askari`.
|
||||
NetBird manages its own overlay addressing (default `100.64.0.0/10`); no boma VLAN is
|
||||
allocated for it.
|
||||
|
||||
- `askari` (Hetzner, off-site, always-up) — runs the NetBird stack **and** is a peer.
|
||||
- `ubongo` — agent.
|
||||
- All Linux managed hosts — agent via the `base` role.
|
||||
- Road-warrior clients (`mamba`, phone, work PC) — agent/app.
|
||||
- OPNsense / `mgmt` — single non-agent exception.
|
||||
|
||||
## Security
|
||||
|
||||
- **ACLs mirror ADR-007 intent** (NetBird default-deny): mesh peers → `srv` metrics
|
||||
ports only; admin peers (`ubongo`, `mamba`) → `srv` + `mgmt`; clients → least
|
||||
privilege.
|
||||
- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`),
|
||||
consumed by `base`; prefer ephemeral/scoped keys.
|
||||
- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH
|
||||
**only on `wt0`** (the ADR-015 pattern, fleet-wide).
|
||||
- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn
|
||||
(3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,
|
||||
`base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence.
|
||||
Recorded as accepted-risk R3.
|
||||
|
||||
## Recovery & operations
|
||||
|
||||
- **Ansible stays off the mesh:** `ubongo` reaches the fleet by LAN IP (ADR-009); a
|
||||
mesh/coordinator outage never blocks on-LAN runs.
|
||||
- **Bootstrap order:** stand up the coordinator on `askari` → enroll `ubongo` →
|
||||
`base` enrolls the fleet.
|
||||
- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage.
|
||||
NetBird's management datastore is backed up encrypted off `askari` (synced to
|
||||
`ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage.
|
||||
- **`askari` is Ansible-managed:** its own inventory group, `base` role, plus a
|
||||
dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with
|
||||
`SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are
|
||||
version-pinned (ADR-011). boma's `dns` role stays authoritative for
|
||||
`boma.baobab.band`; NetBird built-in DNS scoped/off.
|
||||
|
||||
## Status
|
||||
|
||||
Designed, not built — depends on the unbuilt `base` role and service-role machinery
|
||||
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
|
||||
`base` exists.
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|---|---|
|
||||
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. |
|
||||
| Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on `askari`. |
|
||||
| Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. |
|
||||
| Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. `askari` instead. |
|
||||
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. |
|
||||
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
|
||||
|
||||
See also: ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
|
||||
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
|
||||
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted).
|
||||
|
|
@ -128,8 +128,8 @@ provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
|
|||
make collections # Ansible collections
|
||||
rbw login && rbw unlock # vault password from Vaultwarden (see rotate-secrets.md)
|
||||
```
|
||||
4. Join the mesh VPN (choice deferred — see ADR-015) so it is reachable over SSH
|
||||
from elsewhere.
|
||||
4. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016) — so it is
|
||||
reachable over SSH from elsewhere.
|
||||
5. Add `ubongo` to `inventories/<env>/hosts.yml` under the `control` group.
|
||||
|
||||
Because `ubongo` is not in `local.vms`, this is the only case where editing
|
||||
|
|
|
|||
|
|
@ -15,7 +15,7 @@ revisit (trigger).
|
|||
|---|---|---|---|
|
||||
| R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (tiered image pinning per ADR-011 — stateful `tag@digest`, stateless rolling — prefer official/verified images; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise |
|
||||
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
|
||||
| R3 | **Mesh-VPN coordinator dependency (pending VPN choice)** — remote SSH to the control node `ubongo` (ADR-015) rides a mesh VPN whose coordination plane may be a third party (e.g. hosted Tailscale/NetBird) | A hosted coordinator keeps the mesh up when the cluster is down, which *helps* recovery; nothing is exposed to the public internet (ADR-002 preserved). Provisional — finalised when the VPN is chosen (separate discussion) | The VPN choice is settled (replace this entry with the concrete decision); a self-hosted coordinator is adopted; the provider's trust/security posture changes |
|
||||
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
|
||||
|
||||
_Last reviewed: 2026-06-05. The prior gaps (full CIS hardening, SELinux/AppArmor,
|
||||
IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
|
||||
|
|
|
|||
484
docs/superpowers/plans/2026-06-05-mesh-vpn-netbird.md
Normal file
484
docs/superpowers/plans/2026-06-05-mesh-vpn-netbird.md
Normal file
|
|
@ -0,0 +1,484 @@
|
|||
# Mesh VPN (NetBird) Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Record the decision that boma's mesh VPN is NetBird (self-hosted on `askari`), by authoring ADR-016 and reconciling every doc that currently assumes OPNsense WireGuard or an undecided VPN.
|
||||
|
||||
**Architecture:** Documentation-only change. NetBird replaces ADR-007's VLAN-99 OPNsense WireGuard as the single remote-access overlay for `ubongo`, `askari`, and road-warrior clients; coordinator self-hosted off-site on `askari`; agent-per-host enrollment via the (unbuilt) `base` role; embedded local-user identity. The role/service implementation waits on the `base` role and service-role machinery that STATUS.md lists as not-yet-built — this plan settles the decision and the doc reconciliation only.
|
||||
|
||||
**Tech Stack:** Markdown only. Verification is the repo's pre-commit hooks (trailing-whitespace, end-of-file, gitleaks, ansible-lint, vault-encryption guard) plus a final cross-reference/staleness sweep. No markdown linter exists, so "tests" are hook-pass + grep checks.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight (read once before starting)
|
||||
|
||||
- **`rbw` must be unlocked before every commit** (the pre-commit ansible-lint hook decrypts `vault.yml`). Run `rbw unlocked` (exit 0 = good); if not, stop and ask the user to `rbw unlock`.
|
||||
- **Commit style:** one commit per task, imperative subject ≤72 chars.
|
||||
- **Order matters:** Task 1 (ADR-016) lands first — every later task links to it.
|
||||
- **Spec reference:** `docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md`.
|
||||
- **Branch:** start by creating `chore/mesh-vpn-netbird-docs` off `main` (the controller does this before dispatching Task 1; do not implement on `main`).
|
||||
|
||||
---
|
||||
|
||||
## File map
|
||||
|
||||
| File | Action | Responsibility after change |
|
||||
|---|---|---|
|
||||
| `docs/decisions/016-mesh-vpn.md` | Create | Home of record for the NetBird mesh decision |
|
||||
| `docs/decisions/007-network.md` | Modify | VLAN-99 WireGuard retired; askari rides the mesh + hosts the coordinator |
|
||||
| `docs/decisions/015-control-host.md` | Modify | Resolve deferred item #1 (mesh = NetBird on askari) |
|
||||
| `docs/security/accepted-risks.md` | Modify | Replace R3 placeholder with the concrete residual risk |
|
||||
| `docs/CAPABILITIES.md` | Modify | VPN row decided: NetBird, self-hosted |
|
||||
| `STATUS.md` | Modify | Two rows: NetBird coordinator + agent enrollment (designed, not built) |
|
||||
| `CLAUDE.md` | Modify | ADR-016 in Further reading |
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Author ADR-016 (the home of record)
|
||||
|
||||
**Files:**
|
||||
- Create: `docs/decisions/016-mesh-vpn.md`
|
||||
|
||||
- [ ] **Step 1: Create the ADR file**
|
||||
|
||||
Create `docs/decisions/016-mesh-vpn.md` with exactly this content (preserve em-dashes —, backticks, table pipes, and the `verified:` stamps):
|
||||
|
||||
```markdown
|
||||
# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
|
||||
|
||||
## Context
|
||||
|
||||
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
|
||||
the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to
|
||||
WireGuard-via-OPNsense for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`: `askari` + road
|
||||
warriors), and `docs/CAPABILITIES.md` flagged NetBird (mesh) as a real alternative to
|
||||
weigh. This ADR settles it.
|
||||
|
||||
## Decision
|
||||
|
||||
A single **NetBird** mesh is the sole remote-access overlay, self-hosted on `askari`,
|
||||
**replacing** ADR-007's VLAN-99 OPNsense WireGuard.
|
||||
|
||||
The decision in four parts:
|
||||
|
||||
1. **Scope — mesh replaces WireGuard.** One overlay for `ubongo`, `askari`, and
|
||||
road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired.
|
||||
2. **Control plane — self-hosted on `askari`.** Sovereignty (boma self-hosts
|
||||
Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that
|
||||
survives a homelab outage and stays out of the cluster it administers.
|
||||
3. **Tool — NetBird.** Self-hosting selects NetBird (first-class, fully open-source
|
||||
self-host). Tailscale would mean Headscale (third-party reimplementation, partial
|
||||
parity) — ruled out below.
|
||||
4. **Routing — agent on every Linux host**, not a subnet router. At boma's scale (2–5
|
||||
hosts) the "agent everywhere" cost is trivial and the `base` role already runs
|
||||
everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives
|
||||
granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception
|
||||
(`mgmt`/gateway reached by a single advertised route or LAN-side admin).
|
||||
5. **Identity — embedded local users** (Dex in the management container); external SSO
|
||||
(Zitadel/Keycloak) stays an optional future.
|
||||
|
||||
## Verified facts (ADR-014)
|
||||
|
||||
verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
|
||||
— components management+signal+dashboard+relay/TURN(Coturn), **single container since
|
||||
v0.65**; **built-in local users / embedded IdP since v0.62** (external OIDC optional);
|
||||
ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host.
|
||||
|
||||
verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for
|
||||
`management/`/`signal/`/`relay/`, BSD-3-Clause elsewhere; fully open source, no
|
||||
open-core feature gating.
|
||||
|
||||
## Architecture
|
||||
|
||||
Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on `askari`.
|
||||
NetBird manages its own overlay addressing (default `100.64.0.0/10`); no boma VLAN is
|
||||
allocated for it.
|
||||
|
||||
- `askari` (Hetzner, off-site, always-up) — runs the NetBird stack **and** is a peer.
|
||||
- `ubongo` — agent.
|
||||
- All Linux managed hosts — agent via the `base` role.
|
||||
- Road-warrior clients (`mamba`, phone, work PC) — agent/app.
|
||||
- OPNsense / `mgmt` — single non-agent exception.
|
||||
|
||||
## Security
|
||||
|
||||
- **ACLs mirror ADR-007 intent** (NetBird default-deny): mesh peers → `srv` metrics
|
||||
ports only; admin peers (`ubongo`, `mamba`) → `srv` + `mgmt`; clients → least
|
||||
privilege.
|
||||
- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`),
|
||||
consumed by `base`; prefer ephemeral/scoped keys.
|
||||
- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH
|
||||
**only on `wt0`** (the ADR-015 pattern, fleet-wide).
|
||||
- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn
|
||||
(3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,
|
||||
`base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence.
|
||||
Recorded as accepted-risk R3.
|
||||
|
||||
## Recovery & operations
|
||||
|
||||
- **Ansible stays off the mesh:** `ubongo` reaches the fleet by LAN IP (ADR-009); a
|
||||
mesh/coordinator outage never blocks on-LAN runs.
|
||||
- **Bootstrap order:** stand up the coordinator on `askari` → enroll `ubongo` →
|
||||
`base` enrolls the fleet.
|
||||
- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage.
|
||||
NetBird's management datastore is backed up encrypted off `askari` (synced to
|
||||
`ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage.
|
||||
- **`askari` is Ansible-managed:** its own inventory group, `base` role, plus a
|
||||
dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with
|
||||
`SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are
|
||||
version-pinned (ADR-011). boma's `dns` role stays authoritative for
|
||||
`boma.baobab.band`; NetBird built-in DNS scoped/off.
|
||||
|
||||
## Status
|
||||
|
||||
Designed, not built — depends on the unbuilt `base` role and service-role machinery
|
||||
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
|
||||
`base` exists.
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|---|---|
|
||||
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. |
|
||||
| Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on `askari`. |
|
||||
| Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. |
|
||||
| Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. `askari` instead. |
|
||||
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. |
|
||||
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
|
||||
|
||||
See also: ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
|
||||
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
|
||||
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/016-mesh-vpn.md`
|
||||
Expected: Passed/Skipped (ansible-lint Skipped for non-YAML).
|
||||
```bash
|
||||
git add docs/decisions/016-mesh-vpn.md
|
||||
git commit -m "Add ADR-016 (mesh VPN — NetBird self-hosted on askari)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Amend ADR-007 (retire VLAN-99 WireGuard, askari on the mesh)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/decisions/007-network.md`
|
||||
|
||||
Read the file first, then make FOUR exact edits. Preserve em-dashes —, backticks, table pipes.
|
||||
|
||||
- [ ] **Step 1: Update the VLAN-99 row in the VLAN design table**
|
||||
|
||||
Find:
|
||||
```
|
||||
| 99 | `vpn` | `10.99.0.0/24` | WireGuard peers. `askari` (Hetzner) + road-warrior clients. |
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
| 99 | `vpn` | _(retired)_ | **Replaced by the NetBird mesh (ADR-016).** Remote access for `ubongo`, `askari`, and road-warrior clients rides a self-hosted NetBird overlay, not an OPNsense WireGuard subnet. `10.99.0.0/24` is freed. |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Replace the VLAN-99 addressing subsection**
|
||||
|
||||
Find:
|
||||
```
|
||||
### VLAN 99 — vpn (10.99.0.0/24) — WireGuard
|
||||
|
||||
| Address | Host |
|
||||
|---|---|
|
||||
| `10.99.0.1` | OPNsense (WireGuard endpoint) |
|
||||
| `10.99.0.2` | `askari` (Hetzner VPS) |
|
||||
| `10.99.0.10`+ | Road-warrior clients |
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
### VLAN 99 — vpn — retired
|
||||
|
||||
The OPNsense WireGuard VPN (`10.99.0.0/24`) is **replaced by the NetBird mesh**
|
||||
(ADR-016). Remote access for `ubongo`, `askari`, and road-warrior clients rides a
|
||||
self-hosted NetBird overlay — data plane peer-to-peer WireGuard, control plane
|
||||
NetBird self-hosted on `askari`. NetBird manages its own overlay addressing
|
||||
(default `100.64.0.0/10`); no boma VLAN/subnet is allocated for it, and
|
||||
`10.99.0.0/24` is freed.
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Update the two `vpn` rows in the OPNsense firewall-rules table**
|
||||
|
||||
Find:
|
||||
```
|
||||
| `vpn` | `srv` (metrics ports) | allow (monitoring) |
|
||||
| `vpn` | `mgmt` | allow (administration from askari) |
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
| mesh peers | `srv` (metrics ports) | allow (monitoring) — enforced by NetBird ACLs, not OPNsense (ADR-016) |
|
||||
| mesh peers | `mgmt` | allow (administration) — enforced by NetBird ACLs (ADR-016) |
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Rewrite the "External monitoring — askari" section**
|
||||
|
||||
Find:
|
||||
```
|
||||
`askari` (Hetzner VPS) connects via WireGuard to OPNsense (`10.99.0.1`).
|
||||
Its peer address is `10.99.0.2`. OPNsense routes `10.99.0.0/24` into the VPN
|
||||
tunnel and allows `askari` narrow access to `srv` metrics endpoints and `mgmt`
|
||||
for administration.
|
||||
|
||||
`askari` is provisioned and managed independently of the Proxmox cluster — it
|
||||
must be reachable even when the homelab is down (its entire purpose).
|
||||
FQDN: `askari.baobab.band`.
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
`askari` (Hetzner VPS) is a peer on the **NetBird mesh** (ADR-016) and also **hosts
|
||||
the self-hosted NetBird coordinator** (management/signal/relay). It reaches `srv`
|
||||
metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird
|
||||
ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing.
|
||||
|
||||
`askari` is provisioned and managed independently of the Proxmox cluster — it must
|
||||
be reachable even when the homelab is down (its entire purpose), which is also why
|
||||
the mesh coordinator lives here: an off-site control plane survives a homelab outage.
|
||||
FQDN: `askari.baobab.band`.
|
||||
```
|
||||
|
||||
- [ ] **Step 5: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/007-network.md`
|
||||
Expected: Passed/Skipped.
|
||||
```bash
|
||||
git add docs/decisions/007-network.md
|
||||
git commit -m "ADR-007: retire VLAN-99 WireGuard for the NetBird mesh (ADR-016)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Resolve ADR-015 deferred item #1
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/decisions/015-control-host.md`
|
||||
|
||||
Read the file first, then make THREE exact edits.
|
||||
|
||||
- [ ] **Step 1: Update provisioning step 3**
|
||||
|
||||
Find:
|
||||
```
|
||||
3. Join the mesh VPN (choice deferred — see below).
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
3. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Update the Access & security mesh line**
|
||||
|
||||
Find:
|
||||
```
|
||||
- Remote access is via the **mesh VPN** (choice deferred). SSH to `ubongo` over the
|
||||
mesh; nothing is published to the public internet — this stays inside ADR-002.
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
- Remote access is via the **mesh VPN** — NetBird, self-hosted on `askari` (ADR-016).
|
||||
SSH to `ubongo` over the mesh; nothing is published to the public internet — this
|
||||
stays inside ADR-002.
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Resolve deferred item #1**
|
||||
|
||||
Find:
|
||||
```
|
||||
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Recovery
|
||||
dimension: a hosted coordinator keeps the mesh up when the cluster is down; a
|
||||
self-hosted coordinator must live off-cluster (on `ubongo`), never on the fleet,
|
||||
or it recreates the chicken-and-egg.
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
1. **Mesh VPN choice — RESOLVED (ADR-016):** NetBird, self-hosted on `askari`
|
||||
(off-site, so it survives a homelab outage and stays out of the cluster it
|
||||
administers). Replaces ADR-007's OPNsense WireGuard.
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/015-control-host.md`
|
||||
Expected: Passed/Skipped.
|
||||
```bash
|
||||
git add docs/decisions/015-control-host.md
|
||||
git commit -m "ADR-015: resolve mesh-VPN deferral — NetBird on askari (ADR-016)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Replace accepted-risks R3 with the concrete residual risk
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/security/accepted-risks.md`
|
||||
|
||||
Read the file first, then make ONE exact edit. (The row is long — match it whole.)
|
||||
|
||||
- [ ] **Step 1: Replace the R3 row**
|
||||
|
||||
Find:
|
||||
```
|
||||
| R3 | **Mesh-VPN coordinator dependency (pending VPN choice)** — remote SSH to the control node `ubongo` (ADR-015) rides a mesh VPN whose coordination plane may be a third party (e.g. hosted Tailscale/NetBird) | A hosted coordinator keeps the mesh up when the cluster is down, which *helps* recovery; nothing is exposed to the public internet (ADR-002 preserved). Provisional — finalised when the VPN is chosen (separate discussion) | The VPN choice is settled (replace this entry with the concrete decision); a self-hosted coordinator is adopted; the provider's trust/security posture changes |
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Bump the "Last reviewed" date**
|
||||
|
||||
Find:
|
||||
```
|
||||
_Last reviewed: 2026-06-05. The prior gaps
|
||||
```
|
||||
This already reads `2026-06-05` (today) from the previous work, so **no change is needed** — confirm it says `2026-06-05` and move on. (If it shows an earlier date, set it to `2026-06-05`.)
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md`
|
||||
Expected: Passed/Skipped.
|
||||
```bash
|
||||
git add docs/security/accepted-risks.md
|
||||
git commit -m "accepted-risks: R3 now the concrete NetBird coordinator risk"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Update the CAPABILITIES VPN row
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/CAPABILITIES.md`
|
||||
|
||||
Read the file first, then make ONE exact edit.
|
||||
|
||||
- [ ] **Step 1: Replace the VPN / remote access row**
|
||||
|
||||
Find:
|
||||
```
|
||||
| VPN / remote access | Netbird · *or* OPNsense WireGuard | P | candidate | Secure remote access to `srv`/`mgmt` | ⚠️ ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real alternative to weigh |
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
| VPN / remote access | NetBird (self-hosted on `askari`) | P | core | Secure mesh remote access to `srv`/`mgmt` | **Decided (ADR-016):** NetBird mesh replaces ADR-007 OPNsense WireGuard |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/CAPABILITIES.md`
|
||||
Expected: Passed/Skipped.
|
||||
```bash
|
||||
git add docs/CAPABILITIES.md
|
||||
git commit -m "CAPABILITIES: VPN decided — NetBird self-hosted (ADR-016)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Add NetBird rows to STATUS.md
|
||||
|
||||
**Files:**
|
||||
- Modify: `STATUS.md`
|
||||
|
||||
Read the file first, then make ONE exact edit (add two rows after the `ubongo` row).
|
||||
|
||||
- [ ] **Step 1: Add the two rows**
|
||||
|
||||
Find:
|
||||
```
|
||||
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
|
||||
```
|
||||
Replace with that SAME line followed by the two new rows:
|
||||
```
|
||||
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
|
||||
| NetBird mesh — coordinator on `askari` | ADR-016 | Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. Decision recorded; not deployed (askari + service-role machinery not built). |
|
||||
| NetBird agent enrollment in `base` | ADR-016 | Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. Designed; base role not built. |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files STATUS.md`
|
||||
Expected: Passed/Skipped.
|
||||
```bash
|
||||
git add STATUS.md
|
||||
git commit -m "STATUS: record NetBird mesh (coordinator + base enrollment)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Link ADR-016 from CLAUDE.md
|
||||
|
||||
**Files:**
|
||||
- Modify: `CLAUDE.md`
|
||||
|
||||
Read the file first, then make ONE exact edit.
|
||||
|
||||
- [ ] **Step 1: Add the Further reading row after Network topology**
|
||||
|
||||
Find:
|
||||
```
|
||||
| Network topology | `docs/decisions/007-network.md` |
|
||||
```
|
||||
Replace with that SAME line followed by the new row:
|
||||
```
|
||||
| Network topology | `docs/decisions/007-network.md` |
|
||||
| Mesh VPN (NetBird, self-hosted) | `docs/decisions/016-mesh-vpn.md` |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
|
||||
Expected: Passed/Skipped.
|
||||
```bash
|
||||
git add CLAUDE.md
|
||||
git commit -m "CLAUDE.md: link ADR-016 (mesh VPN)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 8: Final consistency sweep
|
||||
|
||||
**Files:** none modified (verification only)
|
||||
|
||||
- [ ] **Step 1: Confirm no doc still treats OPNsense WireGuard / `10.99` as the active remote-access path, and no "pending/deferred VPN" language remains**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
grep -rniE "choice deferred|pending VPN choice|10\.99\.0|WireGuard (endpoint|peers|to OPNsense)" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
|
||||
```
|
||||
Expected: the ONLY hits are in `007-network.md` and `016-mesh-vpn.md`, where they describe the **retirement** of `10.99.0.0/24` (e.g. "`10.99.0.0/24` is freed", "no `10.99.0.0/24` routing") — those are correct and expected. There must be **no** hit that still treats OPNsense WireGuard or `10.99.0.x` as the *live* remote-access path, and **no** `choice deferred` / `pending VPN choice` anywhere. Legitimate mentions of "WireGuard" as NetBird's *data plane* are fine and won't match this pattern (it only matches `WireGuard endpoint|peers|to OPNsense`). If a canonical doc still names the WireGuard VPN as live, fix it as in the relevant task above and amend that commit.
|
||||
|
||||
- [ ] **Step 2: Confirm ADR-016 exists and is cross-linked**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
test -f docs/decisions/016-mesh-vpn.md && echo "ADR-016 present"
|
||||
grep -rl "ADR-016\|016-mesh-vpn" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
|
||||
```
|
||||
Expected: the file exists and the referencing docs (007, 015, accepted-risks, CAPABILITIES, STATUS, CLAUDE.md) appear.
|
||||
|
||||
- [ ] **Step 3: Full hook run**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --all-files`
|
||||
Expected: all hooks Passed/Skipped. Fix anything that fails (most likely trailing whitespace / end-of-file) and amend the owning commit.
|
||||
|
||||
- [ ] **Step 4: Push (only if the user asks)**
|
||||
|
||||
Per CLAUDE.md, push to `origin` is the off-machine backup. If the user wants it pushed:
|
||||
```bash
|
||||
git push origin <branch-or-main-after-merge>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Self-review notes (author)
|
||||
|
||||
- **Spec coverage:** decision/architecture/security/recovery → Task 1 (ADR-016); the spec's "Documentation & implementation changes" table → Tasks 2–7; deferrals (external SSO, OPNsense mesh specifics, role implementation) are recorded in ADR-016/STATUS, not implemented here (correct — they need the unbuilt `base`/service-role machinery). ✓
|
||||
- **Not in scope (intentional):** the `netbird_coordinator` service role, the `base`-role agent task, vault `setup_key` material, and any live deployment — all wait on `base`/service-role machinery (STATUS-honest). ✓
|
||||
- **No placeholders:** every edit shows exact find/replace text; the `_(retired)_` token in ADR-007 is deliberate table content. ✓
|
||||
- **Name consistency:** ADR file is `016-mesh-vpn.md` everywhere; `vault.netbird.setup_key`, `netbird_coordinator`, and `wt0` are used identically across ADR-016 and the sweep. ✓
|
||||
```
|
||||
206
docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md
Normal file
206
docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md
Normal file
|
|
@ -0,0 +1,206 @@
|
|||
# Design — Mesh VPN (NetBird, self-hosted on `askari`)
|
||||
|
||||
- **Date:** 2026-06-05
|
||||
- **Status:** Approved design — pending implementation plan
|
||||
- **Resolves:** ADR-015 deferred item #1 (mesh VPN choice) and the `accepted-risks.md`
|
||||
R3 "pending VPN choice" placeholder
|
||||
- **Amends:** ADR-007 (retires the VLAN-99 OPNsense WireGuard design)
|
||||
- **Becomes:** ADR-016 (this design is the basis for that ADR)
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
`ubongo` (ADR-015) needs remote SSH access from anywhere (work PC, laptop, phone)
|
||||
without exposing anything to the public internet. ADR-015 left the access mechanism —
|
||||
the "mesh VPN" — deferred to this discussion.
|
||||
|
||||
Meanwhile ADR-007 already commits to **WireGuard-via-OPNsense** for the `vpn` VLAN
|
||||
(VLAN 99, `10.99.0.0/24`): `askari` (the off-site Hetzner monitoring VPS) peers to
|
||||
OPNsense, plus road-warrior clients. And `docs/CAPABILITIES.md` already flags the open
|
||||
question: *"ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real
|
||||
alternative to weigh."*
|
||||
|
||||
So the real decision is three-cornered (plain OPNsense WireGuard vs NetBird vs
|
||||
Tailscale), with an architectural sub-question of whether a mesh replaces or coexists
|
||||
with the ADR-007 WireGuard.
|
||||
|
||||
## Decisions (as settled)
|
||||
|
||||
1. **Scope — the mesh *replaces* WireGuard.** A single overlay becomes the sole
|
||||
remote-access path for `ubongo`, `askari`, and road-warrior clients. ADR-007's
|
||||
VLAN-99 OPNsense WireGuard design is retired.
|
||||
2. **Control plane — self-hosted, on `askari`.** Maximum sovereignty (boma already
|
||||
self-hosts Vaultwarden, Forgejo, its own DNS), no third-party trust, and an off-site
|
||||
coordinator that survives a homelab outage and stays out of the cluster it
|
||||
administers.
|
||||
3. **Tool — NetBird.** Self-hosting on `askari` selects NetBird: it is designed to be
|
||||
self-hosted as a first-class, fully open-source stack. (Tailscale's self-host path
|
||||
means Headscale, a separate third-party reimplementation with partial parity — ruled
|
||||
out below.)
|
||||
4. **Routing — NetBird agent on every (Linux) host**, not a subnet router. At boma's
|
||||
scale (2–5 hosts, treated as individuals) the usual "agent everywhere" downside is
|
||||
moot, and the `base` role already runs on every host, so enrollment is one uniform
|
||||
role task. Avoids a routing single-point-of-failure and gives granular per-peer ACLs
|
||||
that match ADR-007's firewall intent. **One exception:** OPNsense (FreeBSD) is not a
|
||||
first-class NetBird agent target, so `mgmt`/gateway reachability is handled by a
|
||||
single advertised route or by administering OPNsense from an on-LAN meshed peer.
|
||||
5. **Identity — embedded local users** (Dex, built into the management container), not
|
||||
a standalone Zitadel/Keycloak. YAGNI for a single operator; external SSO remains a
|
||||
documented future option.
|
||||
|
||||
## Verified facts (ADR-014)
|
||||
|
||||
> verified: NetBird self-hosting architecture · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
|
||||
> - Components: management + signal + dashboard + relay/TURN (Coturn). Since **v0.65**
|
||||
> the core services are **merged into a single container**; deploy via Docker Compose.
|
||||
> - Identity: since **v0.62**, built-in **local users** with an **embedded IdP (Dex)**;
|
||||
> external OIDC IdPs (Zitadel, Keycloak, Authentik, Okta, …) are **optional**, not
|
||||
> required.
|
||||
> - Ports (behind reverse proxy): **TCP 80/443** + **UDP 3478** (STUN/TURN).
|
||||
> - Host: a Linux VM + Docker Compose + a domain name; lightweight.
|
||||
>
|
||||
> verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05
|
||||
> - Dual license: **AGPLv3** for `management/`, `signal/`, `relay/`; **BSD-3-Clause**
|
||||
> elsewhere. Fully open source, self-hostable, no open-core feature gating.
|
||||
|
||||
---
|
||||
|
||||
## Architecture & topology
|
||||
|
||||
A single NetBird mesh is the sole remote-access overlay, replacing ADR-007's VLAN-99
|
||||
WireGuard. Data plane is peer-to-peer WireGuard; control plane is self-hosted NetBird
|
||||
on `askari`.
|
||||
|
||||
**`askari`'s dual role.** `askari` (Hetzner, off-site, always-up, independent of the
|
||||
cluster per ADR-007) runs the **NetBird management stack** (single container:
|
||||
management + signal + dashboard + Coturn, behind a reverse proxy on TCP 80/443 + UDP
|
||||
3478) **and** is itself a mesh peer. Off-site hosting is what makes the mesh survive a
|
||||
full homelab outage and keeps the coordinator out of the cluster it administers (no
|
||||
chicken-and-egg).
|
||||
|
||||
**Peers:**
|
||||
- `askari` — coordinator + peer.
|
||||
- `ubongo` (control/AI-worker host) — agent.
|
||||
- All Linux managed hosts (`dns1/2`, `proxy`, …) — agent via the `base` role.
|
||||
- Road-warrior clients — `mamba`, phone, work PC — agent/app.
|
||||
- OPNsense / `mgmt` — the single non-agent exception (advertised route or LAN-side
|
||||
admin from a meshed peer).
|
||||
|
||||
**Retired:** ADR-007's VLAN-99 WireGuard endpoint on OPNsense and the
|
||||
`10.99.0.0/24` peer scheme. `askari` reaches `srv`/`mgmt` over the mesh under NetBird
|
||||
ACLs instead of OPNsense routing `10.99.0.0/24`.
|
||||
|
||||
---
|
||||
|
||||
## Security model, ACLs, and attack surface
|
||||
|
||||
**ACL policy mirrors ADR-007's firewall intent** (NetBird is default-deny):
|
||||
- `vpn` peers → `srv` **metrics ports only** (askari's monitoring scope).
|
||||
- admin peers (`ubongo`, `mamba`) → `srv` + `mgmt` for administration.
|
||||
- road-warrior clients → only what each needs; nothing by default.
|
||||
|
||||
**Enrollment via setup keys.** Hosts join non-interactively using NetBird **setup
|
||||
keys**, stored in `vault.yml` as `vault.netbird.setup_key` and consumed by the `base`
|
||||
role. Prefer ephemeral/scoped keys (ADR-002).
|
||||
|
||||
**Host firewall interaction.** NetBird creates a `wt0` mesh interface. The `base`
|
||||
role's nftables default-deny allows inbound admin (SSH) **only on `wt0`**, denied on
|
||||
the physical NIC — the pattern ADR-015 set for `ubongo`, now applied fleet-wide. Mesh
|
||||
+ nftables are defence-in-depth.
|
||||
|
||||
**The new attack surface — a public control plane on `askari`.** Today `askari`
|
||||
exposes a WireGuard UDP port; with NetBird self-hosted it exposes the **management API
|
||||
+ dashboard (80/443)** and **Coturn (3478)** publicly, and the management API is
|
||||
keys-to-the-kingdom for the whole mesh. Mitigations baked in:
|
||||
- Dashboard/API behind TLS + the embedded IdP login; source-IP restrictions where
|
||||
practical.
|
||||
- `askari` runs `base` hardening (already a public managed host) and NetBird is
|
||||
**version-pinned** (ADR-011) and patched on boma's cadence — self-hosting means
|
||||
owning the CVE cadence (AGPLv3 server).
|
||||
|
||||
Net vs ADR-002: nothing from the **cluster** is publicly exposed; the only public
|
||||
surface is on `askari` (a public VPS by design), shifting from "WireGuard port" to
|
||||
"NetBird control plane."
|
||||
|
||||
---
|
||||
|
||||
## Recovery, bootstrap ordering, and operations
|
||||
|
||||
**Ansible's control path stays off the mesh.** `ubongo` is on the LAN and reaches the
|
||||
fleet by **LAN IP** (ADR-009). The mesh only provides *external* reach to
|
||||
`ubongo`/the fleet, so a mesh/coordinator outage never blocks on-LAN Ansible runs and
|
||||
there is no chicken-and-egg in the critical path.
|
||||
|
||||
**Bootstrap order** (askari-first):
|
||||
1. Stand up the NetBird coordinator on `askari`.
|
||||
2. Enroll `ubongo`.
|
||||
3. `base` role enrolls the rest of the fleet via setup keys from vault.
|
||||
|
||||
**Recovery.** Coordinator off-site on `askari` ⇒ the mesh survives a full homelab
|
||||
outage. Two must-haves:
|
||||
- **Back up NetBird's management datastore** off `askari` — encrypted, synced to
|
||||
`ubongo`/`mamba`. If `askari` dies, restore the coordinator; peers re-enroll.
|
||||
- Existing peer tunnels keep running on last-known config through a brief coordinator
|
||||
outage; only changes/new enrollments need it live — so `askari` is important but not
|
||||
instantly fatal.
|
||||
|
||||
**`askari` becomes Ansible-managed.** It joins the inventory under its own group and
|
||||
gets the `base` role plus a dedicated **`netbird_coordinator` service role** (one
|
||||
service = one role per ADR-004, with its own `SECURITY.md` per the service-role
|
||||
standard). Agent install/enrollment lives in `base`.
|
||||
|
||||
**DNS & versions.** boma's `dns` role stays authoritative for `boma.baobab.band`;
|
||||
NetBird's built-in DNS is scoped/off to avoid overlap. NetBird server (on `askari`)
|
||||
and agents (via `base`) are version-pinned (ADR-011).
|
||||
|
||||
---
|
||||
|
||||
## Documentation & implementation changes
|
||||
|
||||
This is a substantial decision → its own ADR, with amendments linking to it.
|
||||
|
||||
| Doc | Change |
|
||||
|---|---|
|
||||
| ADR-016 (new) | Home of record for this design. |
|
||||
| ADR-007 (network) | Replace the VLAN-99 WireGuard section + `10.99.0.0/24` scheme with the NetBird mesh; update the firewall-intent table and the `askari` external-monitoring section to ride the mesh. |
|
||||
| ADR-015 (control host) | Resolve deferred item #1: mesh VPN = NetBird self-hosted on `askari`; update the access/recovery notes. |
|
||||
| `docs/security/accepted-risks.md` | Replace R3 ("pending VPN choice") with the concrete residual risk: self-hosted coordinator = no third-party trust, but a public NetBird control plane on `askari` to harden + patch. |
|
||||
| `docs/CAPABILITIES.md` | Resolve the VPN row (line ~29): decided — NetBird mesh, self-hosted on `askari`. |
|
||||
| `STATUS.md` | Add rows (designed, not built): NetBird coordinator on `askari`; NetBird agent enrollment in `base`. |
|
||||
| `base` role (when built) | Install + enroll the NetBird agent; nftables allows SSH only on `wt0`. |
|
||||
| `netbird_coordinator` service role (new, when built) | Deploys the NetBird stack on `askari`; populated `SECURITY.md`; molecule scenario. |
|
||||
| `requirements.yml` | Only if a task needs a new collection module (ADR dependencies policy). |
|
||||
|
||||
**Scope note:** like the `ubongo` work, most *implementation* here waits on the `base`
|
||||
and service-role machinery that STATUS.md lists as not-yet-built. This spec settles the
|
||||
decision and the doc reconciliation; the role tasks land when `base` is built.
|
||||
|
||||
---
|
||||
|
||||
## Deferred / out of scope
|
||||
|
||||
1. **External SSO IdP** (Zitadel/Keycloak) — embedded local users now; SSO later if a
|
||||
second operator or service-SSO need appears.
|
||||
2. **OPNsense mesh integration specifics** — the exact `mgmt` reachability mechanism
|
||||
(single advertised route vs LAN-side admin) is settled during implementation when
|
||||
OPNsense automation is built.
|
||||
3. **The `base` / `netbird_coordinator` role implementation** — depends on the
|
||||
unbuilt `base` role and service-role standard.
|
||||
|
||||
---
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|---|---|
|
||||
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config, OPNsense-centric; the operator wants a mesh with policy-based access and easy multi-device enrollment. |
|
||||
| Tailscale (hosted coordinator) | Adds a third-party trust dependency for the control plane; against boma's self-hosting ethos. (Hosted coordinator's recovery benefit is matched by putting a self-hosted coordinator off-site on `askari`.) |
|
||||
| Tailscale + Headscale (self-hosted) | Headscale is a third-party reimplementation of Tailscale's control server with partial feature parity and no official vendor support — weaker than NetBird's first-class self-hosting. |
|
||||
| Mesh coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes, and dies with the homelab. `askari` (off-site) instead. |
|
||||
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` would go blind to `srv` when `ubongo` is down even if services are healthy. Agent-per-host instead. |
|
||||
| Standalone IdP (Zitadel/Keycloak) now | Heavy for a single operator; embedded local users (Dex) suffice. External SSO stays a future option. |
|
||||
|
||||
See also: ADR-007 (network), ADR-015 (control host), ADR-002 (security), ADR-011
|
||||
(version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible handoff),
|
||||
ADR-013 (heritage — V4 used WireGuard; NetBird is translated, not transplanted).
|
||||
Loading…
Add table
Reference in a new issue