diff --git a/docs/superpowers/plans/2026-06-20-mesh-spof-accept-resilience.md b/docs/superpowers/plans/2026-06-20-mesh-spof-accept-resilience.md new file mode 100644 index 0000000..73b217a --- /dev/null +++ b/docs/superpowers/plans/2026-06-20-mesh-spof-accept-resilience.md @@ -0,0 +1,237 @@ +# Mesh SPOF — accept + targeted resilience — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Accept askari's single-coordinator SPOF as a documented availability trade-off, and harden the one real gap — a `base` mesh knob that pins the coordinator FQDN in `/etc/hosts` on managed mesh hosts so a local-DNS hiccup can't strand the mesh. + +**Architecture:** One additive, idempotent `base` `mesh`-concern task (a `/etc/hosts` line via `lineinfile`, gated on a new opt-in knob), Molecule-tested; plus documentation (accepted-risk R8 + an ADR-016 availability amendment + STATUS/ROADMAP). No new infra, no Terraform, no live-deploy gate. + +**Tech Stack:** Ansible (`base` role, `lineinfile`), Molecule (Debian 13), Markdown docs. + +**Spec:** `docs/superpowers/specs/2026-06-20-mesh-spof-accept-resilience-design.md` + +## Global Constraints + +- **FQCN always** (`ansible.builtin.*`); role defaults use the `rolename__var` namespace. +- **No new collection** — derive the coordinator FQDN with builtin `regex_replace` (NOT `urlsplit`, which would pull in `community.general`). +- The pin is **opt-in and additive**: gated on `base__mesh_enabled | bool` AND `base__mesh_coordinator_pin | length > 0`. Empty knob (the default) = a clean no-op. The coordinator host (`askari`/`offsite_hosts`) is **exempt** — leave its pin empty. +- **askari's coordinator IP = `77.42.120.136`** (stable WAN; the A record for `netbird.askari.wingu.me`); ubongo is in the `control` group. +- `make lint` clean + `rbw unlocked` before any commit (the pre-commit hook decrypts the vault). +- **No new infra** — no P2P, no second relay/coordinator, no Terraform. The coordinator off-site backup is **out of scope** (ADR-022 kickoff). +- Tags: the new task carries the `mesh` concern tag (it belongs to the mesh concern). + +--- + +### Task 1: `base` mesh coordinator-FQDN `/etc/hosts` pin (DNS-resilience) + +Add an opt-in knob that pins the coordinator FQDN (derived from `base__mesh_management_url`) to a stable IP in `/etc/hosts`, so a managed mesh host survives a local-DNS failure. TDD'd through the role's Molecule scenario (which already exercises the `mesh` concern with `manage: false`). + +**Files:** +- Modify: `roles/base/defaults/main.yml` (add the knob after the mesh block, ~line 53) +- Modify: `roles/base/tasks/mesh.yml` (append the pin task) +- Modify: `roles/base/molecule/default/converge.yml` (add a fixture pin to the vars block) +- Modify: `roles/base/molecule/default/verify.yml` (assert the rendered `/etc/hosts` line) +- Modify: `inventories/production/group_vars/control/vars.yml` (set the pin for ubongo) + +**Interfaces:** +- Produces: role default `base__mesh_coordinator_pin` (string, default `""`); when set + `base__mesh_enabled`, an `/etc/hosts` line ` ` where `` is `base__mesh_management_url` minus scheme/port/path. + +- [ ] **Step 1: Write the failing Molecule test (fixture + assertion)** + +In `roles/base/molecule/default/converge.yml`, add one line to the `vars:` block (after `base__mesh_setup_key`, ~line 15): + +```yaml + base__mesh_coordinator_pin: "203.0.113.9" # fixture coordinator IP (TEST-NET-3); pins the FQDN from base__mesh_management_url +``` + +In `roles/base/molecule/default/verify.yml`, append to the `tasks:` list (after the mesh no-op assertion at the end): + +```yaml + - name: Read /etc/hosts (coordinator pin) + ansible.builtin.slurp: + src: /etc/hosts + register: _etchosts + - name: Assert the coordinator FQDN is pinned to the fixture IP (DNS-resilience / R8) + ansible.builtin.assert: + that: + - "'203.0.113.9 netbird.askari.wingu.me' in (_etchosts.content | b64decode)" + fail_msg: "base__mesh_coordinator_pin did not render the /etc/hosts coordinator pin" + success_msg: "coordinator FQDN pinned in /etc/hosts" +``` + +- [ ] **Step 2: Run Molecule to verify it fails** + +Run: `make test ROLE=base` +Expected: FAIL at "Assert the coordinator FQDN is pinned…" — no pin task exists yet, so `/etc/hosts` has no such line. + +- [ ] **Step 3: Add the default knob** + +In `roles/base/defaults/main.yml`, after `base__mesh_version` (~line 53), add: + +```yaml + +# DNS-resilience (ADR-016 availability / accepted-risk R8): when set to the coordinator's +# stable IP, pin the coordinator FQDN (derived from base__mesh_management_url) in /etc/hosts +# so a managed mesh host survives a local-DNS hiccup (the 2026-06-18 incident class). Empty +# = no pin. The coordinator host itself (askari/offsite_hosts) is exempt — leave it empty. +base__mesh_coordinator_pin: "" +``` + +- [ ] **Step 4: Add the pin task** + +Append to `roles/base/tasks/mesh.yml`: + +```yaml + +- name: Pin the NetBird coordinator FQDN in /etc/hosts (DNS-resilience, ADR-016 availability / R8) + ansible.builtin.lineinfile: + path: /etc/hosts + regexp: '\s{{ _coordinator_fqdn | regex_escape }}$' + line: "{{ base__mesh_coordinator_pin }} {{ _coordinator_fqdn }}" + state: present + vars: + _coordinator_fqdn: "{{ base__mesh_management_url | regex_replace('^https?://', '') | regex_replace('[:/].*', '') }}" + when: + - base__mesh_enabled | bool + - base__mesh_coordinator_pin | length > 0 + tags: [mesh] +``` + +(`_coordinator_fqdn` strips the scheme then anything from the first `:`/`/` → `netbird.askari.wingu.me`. The `regexp` matches an existing ` ` at line end so a changed IP updates in place — idempotent; absent → appended.) + +- [ ] **Step 5: Run Molecule to verify it passes** + +Run: `make test ROLE=base` +Expected: PASS — the new assertion is green and Molecule idempotence is clean (re-running the pin task reports `ok`, not `changed`). The idempotence pass is what proves the `regexp` matches the line it wrote. + +> Note: the empty-knob no-op (the production default for non-mesh / coordinator hosts) is guaranteed by the `when: base__mesh_coordinator_pin | length > 0` gate, not a separate Molecule case — a single converge can't hold both var-states, and boma uses one default scenario per role. The fixture exercises the meaningful path (rendering + FQDN extraction + idempotence). + +- [ ] **Step 6: Wire the production pin for ubongo** + +In `inventories/production/group_vars/control/vars.yml`, after the `base__mesh_enabled: true` block, add: + +```yaml + +# DNS-resilience (ADR-016 availability / R8): pin the coordinator FQDN to askari's stable WAN +# IP in /etc/hosts so a local-DNS hiccup (the 2026-06-18 incident class) can't strand ubongo's +# mesh. askari (offsite_hosts) is exempt — it reaches the coordinator locally. +base__mesh_coordinator_pin: "77.42.120.136" +``` + +- [ ] **Step 7: Lint and commit** + +```bash +rbw unlocked && make lint +git add roles/base/defaults/main.yml roles/base/tasks/mesh.yml \ + roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml \ + inventories/production/group_vars/control/vars.yml +git commit -m "feat(base): pin the NetBird coordinator FQDN in /etc/hosts (mesh DNS-resilience)" \ + -m "Co-Authored-By: Claude Opus 4.8 (1M context) " +``` + +--- + +### Task 2: Accept + document the SPOF (R8, ADR-016 amendment, STATUS/ROADMAP) + +Record the single-coordinator SPOF as a conscious, revisitable trade-off and capture the availability analysis + recovery. Pure documentation; references the pin from Task 1. + +**Files:** +- Modify: `docs/security/accepted-risks.md` (add row R8; bump the review date) +- Modify: `docs/decisions/016-mesh-vpn.md` (add the availability amendment subsection) +- Modify: `STATUS.md` (note the SPOF accepted + the coordinator-pin knob) +- Modify: `docs/ROADMAP.md` (mark sub-project 3 addressed; surface ADR-022 backup + ACL as next) + +- [ ] **Step 1: Add accepted-risk R8** + +In `docs/security/accepted-risks.md`, add this row to the table after R7: + +```markdown +| R8 | **Single off-site mesh coordinator is an availability SPOF for remote mesh access** — `askari` hosts the only NetBird management/signal/relay (ADR-016); while askari is down, every *relayed* peer (all of `ubongo`'s, by the deliberate default-deny posture) loses remote mesh reachability and the control plane pauses. The `netbird_coordinator` store also has **no off-site backup yet** (BACKUP.md), so an askari loss loses mesh control-plane state until rebuilt | Inherent to ADR-016's deliberate single off-site coordinator (sovereignty; survives a homelab outage). **Narrow blast radius:** the mesh is not a gateway (`wt0` routes only `100.99.0.0/16`) — LAN, intra-cluster, and local-service traffic are unaffected; only remote/off-LAN mesh access breaks, and only when off-LAN *and* askari is down at once. askari is a reliable always-on VPS; mitigations: client + managed-host coordinator-FQDN DNS pin (`base__mesh_coordinator_pin`; runbook), documented `/setup` rebuild | askari proves unreliable; the cluster grows to depend on the mesh for intra-node traffic; remote mesh access becomes business-critical; or the ADR-022 backup role lands (closes the state-loss half) | +``` + +Then update the closing line's date: change `_Last reviewed: 2026-06-18.` to `_Last reviewed: 2026-06-20.` + +- [ ] **Step 2: Add the ADR-016 availability amendment** + +In `docs/decisions/016-mesh-vpn.md`, add this subsection immediately before the `## Related` section: + +```markdown +## Availability — an `askari` outage (amendment 2026-06-20) + +The coordinator is deliberately **single** (one off-site host). Recorded here so its +availability envelope is explicit; accepted as **R8** (`docs/security/accepted-risks.md`). + +The mesh is **not** a default gateway — `wt0` routes only the overlay CIDR (`100.99.0.0/16`); +normal traffic uses the host's default route. So an `askari` outage has a **narrow blast +radius**: + +| Traffic | `askari` down | +|---|---| +| LAN device → LAN service (direct / via reverse proxy) | unaffected | +| node ↔ node over LAN IPs (cluster) | unaffected | +| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P) | +| **road-warrior → `ubongo` (remote, relayed)** | **breaks** | +| mesh control plane (new enrol / ACL change / re-handshake) | pauses | + +Only remote (off-LAN) mesh access to peers is lost, and only when off-LAN **and** `askari` +is down simultaneously. On-LAN access to `ubongo` never depends on the mesh (Recovery & +operations, above). + +**Recovery:** rebuild the coordinator (`/setup` + re-enrol peers, M5) or restore from backup +once ADR-022 lands; the `netbird_coordinator` store backup is the **next sub-project** (its +gap is named in R8 and `BACKUP.md`). Client/road-warrior break-glass (reliable resolvers + +the coordinator-FQDN `/etc/hosts` pin) is in `docs/runbooks/netbird-client.md`; managed mesh +hosts get the same pin via `base__mesh_coordinator_pin`. + +**Not pursued** (deliberately, given the narrow blast radius): direct P2P (punctures the +default-deny posture; only helps established sessions), a second relay (needs another public +host / reintroduces the home public surface), a second coordinator (unsupported by +self-hosted NetBird; against this ADR). +``` + +- [ ] **Step 3: Update STATUS.md** + +In `STATUS.md`, in the `roles/base/` row, append to the end of the firewall/mesh description (before the closing ` |`): a sentence noting the pin and the accepted SPOF: + +```markdown + The `mesh` concern also pins the coordinator FQDN in `/etc/hosts` (`base__mesh_coordinator_pin`, set for ubongo) so a local-DNS hiccup can't strand the mesh; the single-coordinator SPOF is an accepted availability risk (R8, ADR-016 availability amendment). +``` + +- [ ] **Step 4: Update ROADMAP.md** + +In `docs/ROADMAP.md`, in the "Remaining mesh-hardening sub-projects" list, change item 3 from the SPOF-reduction "(next)" wording to **DONE**, and make the NetBird ACL the next item. Replace the current items 3–4 block with: + +```markdown +3. ~~**askari relay-SPOF reduction**~~ → **DONE (2026-06-20)** — assessed + **accepted** as a + documented availability risk (R8 + ADR-016 availability amendment): the blast radius is + narrow (LAN/intra-cluster/local traffic never touch askari), so no P2P / second relay / + second coordinator was warranted. Hardened the one real gap — a managed-host coordinator-FQDN + DNS pin (`base__mesh_coordinator_pin`). The coordinator off-site backup gap is handed to ADR-022. +4. **NetBird ACL off Allow-All** to scoped policies (open mechanism question — no headless API path). +5. **ADR-022 backup kickoff** — off-site backup of the `netbird_coordinator` store (named in R8 / + BACKUP.md) as the first slice of the backup role (restic + the `fisi` pull node). +``` + +- [ ] **Step 5: Consistency check + commit** + +```bash +grep -q "^| R8 " docs/security/accepted-risks.md && \ +grep -q "Availability — an .askari. outage" docs/decisions/016-mesh-vpn.md && \ +echo "docs OK" +``` +Expected: `docs OK`. + +```bash +rbw unlocked +git add docs/security/accepted-risks.md docs/decisions/016-mesh-vpn.md STATUS.md docs/ROADMAP.md +git commit -m "docs(security): accept the single-coordinator mesh SPOF (R8) + ADR-016 availability amendment" \ + -m "Co-Authored-By: Claude Opus 4.8 (1M context) " +``` + +--- + +## Notes / out of scope + +- **Coordinator off-site backup → ADR-022 kickoff** (next sub-project). Not built here. +- **Direct P2P / second relay / second coordinator** — deliberately not pursued (spec §Design). +- No live deploy is required to land this — the pin is additive/idempotent and applies to ubongo on the next routine `base` apply (`make deploy PLAYBOOK=site LIMIT=ubongo`, operator's discretion). Optional post-deploy spot-check: `getent hosts netbird.askari.wingu.me` on ubongo resolves to `77.42.120.136`.