boma/docs/superpowers/plans/2026-06-14-m4b-netbird.md
sjat 9e0c264658 docs: reconcile lower-severity review findings (O9-O24)
- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional,
  outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative
  boma.baobab.band -> boma.wingu.me transition note already added earlier
- terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and
  <host>.boma.baobab.band per ADR-007 naming (O11)
- ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections
  placed after Consequences, matching ADR-014/019-023 (O13)
- docs/README + inventories/README: list the missing subdirs / offsite_hosts +
  offsite.yml merge behaviour (O14, O29 note)
- ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19)
- ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20)
- ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21)
- netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23)
- ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24)
- capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28)
- tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9)
- tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep)

O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected);
the fix lives in the generator for the next regeneration. make lint + pytest (57) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00

91 lines
7.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# M4b — NetBird coordinator (service role) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use `- [ ]` checkboxes.
**Goal:** Deploy the self-hosted NetBird control plane on askari as boma's first real service role (`netbird_coordinator`), fronted by the M4a Caddy, reachable at `https://netbird.askari.wingu.me` with the embedded Dex login.
**Architecture:** NetBird's own `configure.sh` generates the canonical compose + config for a pinned version; boma **captures that reference once and translates it into role templates** (ADR-004/013 — don't run their imperative script in production, render from templates). Runs in **external-reverse-proxy mode** (no bundled Traefik); Caddy adds a `netbird.askari.wingu.me` route. Secrets (datastore encryption key, TURN password, Dex secrets) are generated into vault; the setup key is stubbed `CHANGEME` for M5.
**Tech Stack:** NetBird (combined `netbird-server` container if stable for the pinned version, else the multi-container set), embedded Dex IdP, Coturn, Docker Compose, Caddy (M4a), Ansible.
**Spec:** `docs/superpowers/specs/2026-06-14-netbird-coordinator-m4-design.md` · **Prereq:** M4a (Docker + Caddy) ✓ on askari.
**Execution context:** Task 1 runs `configure.sh` in a scratch dir (capture only). Tasks 26 author. **Task 7 deploys live to askari** (gated). NetBird self-hosting is finicky — expect live debugging.
---
### Task 1: Capture NetBird's reference setup (pin the version)
- [ ] **Step 1:** Pick + pin the NetBird version (ADR-014 — check the latest stable release). Record it.
- [ ] **Step 2:** In a scratch dir (on ubongo, throwaway), fetch NetBird's `getting-started`/`configure.sh` for that version and run it with answers for: domain `netbird.askari.wingu.me`, **external reverse proxy** (disable bundled Traefik/Caddy), **embedded Dex** (no external SSO), Let's Encrypt off (Caddy terminates TLS).
- [ ] **Step 3:** Capture the generated files verbatim into the plan/notes: `docker-compose.yml`, `management.json` (or `config.yaml`), `turnserver.conf`, `openid-configuration.json`, dashboard env. Also capture NetBird's **Caddy external-proxy template** (their docs ship one) — it shows the exact upstreams + HTTP/2/gRPC routing the dashboard/management/signal/relay need.
- [ ] **Step 4:** No commit (reference capture; informs Tasks 24).
---
### Task 2: `netbird_coordinator` service role — templates
**Files:** `roles/netbird_coordinator/` (scaffold via `make new-role NAME=netbird_coordinator`): `defaults/main.yml`, `tasks/main.yml`, `templates/{docker-compose.yml,management.json,turnserver.conf,openid-configuration.json,dashboard.env}.j2`, `handlers/main.yml`, `README.md`.
- [ ] **Step 1:** Translate the captured compose into `templates/docker-compose.yml.j2` — containers, the shared `boma` Docker network (so Caddy reaches them by name), **no host port mappings except what Caddy/Coturn need** (Coturn 3478/udp; everything else internal, Caddy fronts it). Pin image tags (ADR-011).
- [ ] **Step 2:** Translate `management.json`/`config.yaml` into a template — fill `Datadir`, `DataStoreEncryptionKey` (`{{ vault.netbird.datastore_key }}`), `HttpConfig` (public URL `https://netbird.askari.wingu.me`), `TURNConfig` (coturn host + `{{ vault.netbird.turn_password }}`), `Signal`, `Relay`, `Store` (sqlite), and the embedded-Dex IdP block (DeviceAuthorizationFlow/PKCE, `openid-configuration.json` URL).
- [ ] **Step 3:** `turnserver.conf.j2` (realm = `netbird.askari.wingu.me`, the TURN secret), `openid-configuration.json.j2`, `dashboard.env.j2` (`NETBIRD_MGMT_API_ENDPOINT=https://netbird.askari.wingu.me`, the `AUTH_*` Dex values).
- [ ] **Step 4:** `defaults/main.yml` (`netbird__*` knobs: version, base_dir `/opt/services/netbird`, domain) + `tasks/main.yml` (ADR-004 deploy mechanics: ensure dir, render all files, `community.docker.docker_compose_v2` up; `netbird__manage` toggle for Molecule).
- [ ] **Step 5:** `make lint`; commit `feat(netbird): coordinator service role (compose + config templates)`.
---
### Task 3: Secrets (CHANGEME convention + generated)
- [ ] **Step 1:** Add to vault (`make edit-vault`): `vault.netbird.datastore_key`, `vault.netbird.turn_password`, any Dex client secret — **generate** strong values (or stub `CHANGEME` + a comment if operator-supplied). Add `vault.netbird.setup_key: CHANGEME` with a comment "created in the NetBird dashboard after first boot — M5 enrolment".
- [ ] **Step 2:** `make check-vault` confirms structure + lists the `setup_key` placeholder.
- [ ] **Step 3:** Commit the vault.
---
### Task 4: Wire Caddy + DNS
- [ ] **Step 1:** Append to `reverse_proxy__routes` (`group_vars/all/reverse_proxy.yml`): `{host: netbird.askari.wingu.me, upstream: "<netbird container:port>"}` — per the captured Caddy template (NetBird needs HTTP/2 + gRPC; add the required Caddy directives, e.g. separate handles for the management gRPC path if the template shows them).
- [ ] **Step 2:** `netbird.askari.wingu.me` already resolves via the `*.askari.wingu.me` wildcard (M4a) — no new DNS record.
- [ ] **Step 3:** Commit.
---
### Task 5: Service-role standard files (ADR-004, authored)
- [ ] **Step 1:** Author `roles/netbird_coordinator/SECURITY.md` (copy `docs/security/service-security-template.md`; record the public surface = Caddy 443 + Coturn 3478, embedded-Dex auth, accepted-risk R3).
- [ ] **Step 2:** `VERIFY.md` (copy the template; the `/verify-service` UI spec — run later when the playwright harness exists).
- [ ] **Step 3:** `ACCESS.md` (ADR-021; the dashboard/admin access + `access__*` intent).
- [ ] **Step 4:** `BACKUP.md` (ADR-022; the **datastore is stateful**`backup__*` data; record that off-site backup is **pending `fisi`** — an accepted risk for now).
- [ ] **Step 5:** `make lint`; commit `docs(netbird): service-role standard files (SECURITY/VERIFY/ACCESS/BACKUP)`.
---
### Task 6: Add netbird to the offsite playbook
- [ ] **Step 1:** In `playbooks/offsite.yml`, add `netbird_coordinator` after `reverse_proxy` (role-name tag). `make lint`. Commit.
---
### Task 7: Deploy to askari + verify (gated, live — expect debugging)
> NetBird self-hosting is finicky; budget for iterating on the management config + Caddy routing.
- [ ] **Step 1:** `make check PLAYBOOK=offsite LIMIT=askari TAGS=netbird` — review.
- [ ] **Step 2:** `make deploy PLAYBOOK=offsite LIMIT=askari TAGS=netbird``make deploy ... TAGS=reverse_proxy` (Caddy reloads with the netbird route).
- [ ] **Step 3:** Verify: `docker compose ps` all healthy; `curl -sI https://netbird.askari.wingu.me` → 200 with the M4a cert; the **dashboard loads** in a browser; the management API responds. Iterate on config/routing until green.
- [ ] **Step 4:** No repo commit (host state).
---
### Task 8: Docs
- [ ] **Step 1:** STATUS — `netbird_coordinator` built + applied (dashboard live); the first service role. ROADMAP M4b done; **M5 (enrol) next**. `make lint`; commit.
---
## Self-Review (completed)
- **Spec coverage:** external-proxy NetBird + embedded Dex (Decisions 3) → Tasks 1,2,4; first service role + standard files (Decision 7) → Tasks 2,5; firewall 3478 (Decision 5) → done in M4a; setup key M5 + CHANGEME (Decision 8) → Task 3; Caddy front (M4a) → Task 4. Enrolment → M5, correct.
- **Placeholder scan:** the concrete config field *values* are intentionally captured from `configure.sh` (Task 1) rather than invented — version-sensitive, and inventing them would be wrong. The plan pins the method, not guesses.
- **Risk:** NetBird's external-proxy + gRPC routing is the hard part — Task 1 captures NetBird's own Caddy template to get it right, and Task 7 budgets for live iteration.