boma/docs/superpowers/plans/2026-06-20-mesh-spof-accept-resilience.md
sjat 0286c78f36 docs(plan): mesh-hardening SPOF — accept + DNS-resilience implementation plan
Two tasks: a base mesh coordinator-FQDN /etc/hosts pin (Molecule TDD) + the accept-and-document docs (R8, ADR-016 availability amendment, STATUS/ROADMAP). Coordinator backup deferred to ADR-022.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 10:49:26 +02:00

237 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Mesh SPOF — accept + targeted resilience — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Accept askari's single-coordinator SPOF as a documented availability trade-off, and harden the one real gap — a `base` mesh knob that pins the coordinator FQDN in `/etc/hosts` on managed mesh hosts so a local-DNS hiccup can't strand the mesh.
**Architecture:** One additive, idempotent `base` `mesh`-concern task (a `/etc/hosts` line via `lineinfile`, gated on a new opt-in knob), Molecule-tested; plus documentation (accepted-risk R8 + an ADR-016 availability amendment + STATUS/ROADMAP). No new infra, no Terraform, no live-deploy gate.
**Tech Stack:** Ansible (`base` role, `lineinfile`), Molecule (Debian 13), Markdown docs.
**Spec:** `docs/superpowers/specs/2026-06-20-mesh-spof-accept-resilience-design.md`
## Global Constraints
- **FQCN always** (`ansible.builtin.*`); role defaults use the `rolename__var` namespace.
- **No new collection** — derive the coordinator FQDN with builtin `regex_replace` (NOT `urlsplit`, which would pull in `community.general`).
- The pin is **opt-in and additive**: gated on `base__mesh_enabled | bool` AND `base__mesh_coordinator_pin | length > 0`. Empty knob (the default) = a clean no-op. The coordinator host (`askari`/`offsite_hosts`) is **exempt** — leave its pin empty.
- **askari's coordinator IP = `77.42.120.136`** (stable WAN; the A record for `netbird.askari.wingu.me`); ubongo is in the `control` group.
- `make lint` clean + `rbw unlocked` before any commit (the pre-commit hook decrypts the vault).
- **No new infra** — no P2P, no second relay/coordinator, no Terraform. The coordinator off-site backup is **out of scope** (ADR-022 kickoff).
- Tags: the new task carries the `mesh` concern tag (it belongs to the mesh concern).
---
### Task 1: `base` mesh coordinator-FQDN `/etc/hosts` pin (DNS-resilience)
Add an opt-in knob that pins the coordinator FQDN (derived from `base__mesh_management_url`) to a stable IP in `/etc/hosts`, so a managed mesh host survives a local-DNS failure. TDD'd through the role's Molecule scenario (which already exercises the `mesh` concern with `manage: false`).
**Files:**
- Modify: `roles/base/defaults/main.yml` (add the knob after the mesh block, ~line 53)
- Modify: `roles/base/tasks/mesh.yml` (append the pin task)
- Modify: `roles/base/molecule/default/converge.yml` (add a fixture pin to the vars block)
- Modify: `roles/base/molecule/default/verify.yml` (assert the rendered `/etc/hosts` line)
- Modify: `inventories/production/group_vars/control/vars.yml` (set the pin for ubongo)
**Interfaces:**
- Produces: role default `base__mesh_coordinator_pin` (string, default `""`); when set + `base__mesh_enabled`, an `/etc/hosts` line `<pin-ip> <fqdn>` where `<fqdn>` is `base__mesh_management_url` minus scheme/port/path.
- [ ] **Step 1: Write the failing Molecule test (fixture + assertion)**
In `roles/base/molecule/default/converge.yml`, add one line to the `vars:` block (after `base__mesh_setup_key`, ~line 15):
```yaml
base__mesh_coordinator_pin: "203.0.113.9" # fixture coordinator IP (TEST-NET-3); pins the FQDN from base__mesh_management_url
```
In `roles/base/molecule/default/verify.yml`, append to the `tasks:` list (after the mesh no-op assertion at the end):
```yaml
- name: Read /etc/hosts (coordinator pin)
ansible.builtin.slurp:
src: /etc/hosts
register: _etchosts
- name: Assert the coordinator FQDN is pinned to the fixture IP (DNS-resilience / R8)
ansible.builtin.assert:
that:
- "'203.0.113.9 netbird.askari.wingu.me' in (_etchosts.content | b64decode)"
fail_msg: "base__mesh_coordinator_pin did not render the /etc/hosts coordinator pin"
success_msg: "coordinator FQDN pinned in /etc/hosts"
```
- [ ] **Step 2: Run Molecule to verify it fails**
Run: `make test ROLE=base`
Expected: FAIL at "Assert the coordinator FQDN is pinned…" — no pin task exists yet, so `/etc/hosts` has no such line.
- [ ] **Step 3: Add the default knob**
In `roles/base/defaults/main.yml`, after `base__mesh_version` (~line 53), add:
```yaml
# DNS-resilience (ADR-016 availability / accepted-risk R8): when set to the coordinator's
# stable IP, pin the coordinator FQDN (derived from base__mesh_management_url) in /etc/hosts
# so a managed mesh host survives a local-DNS hiccup (the 2026-06-18 incident class). Empty
# = no pin. The coordinator host itself (askari/offsite_hosts) is exempt — leave it empty.
base__mesh_coordinator_pin: ""
```
- [ ] **Step 4: Add the pin task**
Append to `roles/base/tasks/mesh.yml`:
```yaml
- name: Pin the NetBird coordinator FQDN in /etc/hosts (DNS-resilience, ADR-016 availability / R8)
ansible.builtin.lineinfile:
path: /etc/hosts
regexp: '\s{{ _coordinator_fqdn | regex_escape }}$'
line: "{{ base__mesh_coordinator_pin }} {{ _coordinator_fqdn }}"
state: present
vars:
_coordinator_fqdn: "{{ base__mesh_management_url | regex_replace('^https?://', '') | regex_replace('[:/].*', '') }}"
when:
- base__mesh_enabled | bool
- base__mesh_coordinator_pin | length > 0
tags: [mesh]
```
(`_coordinator_fqdn` strips the scheme then anything from the first `:`/`/``netbird.askari.wingu.me`. The `regexp` matches an existing ` <fqdn>` at line end so a changed IP updates in place — idempotent; absent → appended.)
- [ ] **Step 5: Run Molecule to verify it passes**
Run: `make test ROLE=base`
Expected: PASS — the new assertion is green and Molecule idempotence is clean (re-running the pin task reports `ok`, not `changed`). The idempotence pass is what proves the `regexp` matches the line it wrote.
> Note: the empty-knob no-op (the production default for non-mesh / coordinator hosts) is guaranteed by the `when: base__mesh_coordinator_pin | length > 0` gate, not a separate Molecule case — a single converge can't hold both var-states, and boma uses one default scenario per role. The fixture exercises the meaningful path (rendering + FQDN extraction + idempotence).
- [ ] **Step 6: Wire the production pin for ubongo**
In `inventories/production/group_vars/control/vars.yml`, after the `base__mesh_enabled: true` block, add:
```yaml
# DNS-resilience (ADR-016 availability / R8): pin the coordinator FQDN to askari's stable WAN
# IP in /etc/hosts so a local-DNS hiccup (the 2026-06-18 incident class) can't strand ubongo's
# mesh. askari (offsite_hosts) is exempt — it reaches the coordinator locally.
base__mesh_coordinator_pin: "77.42.120.136"
```
- [ ] **Step 7: Lint and commit**
```bash
rbw unlocked && make lint
git add roles/base/defaults/main.yml roles/base/tasks/mesh.yml \
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml \
inventories/production/group_vars/control/vars.yml
git commit -m "feat(base): pin the NetBird coordinator FQDN in /etc/hosts (mesh DNS-resilience)" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 2: Accept + document the SPOF (R8, ADR-016 amendment, STATUS/ROADMAP)
Record the single-coordinator SPOF as a conscious, revisitable trade-off and capture the availability analysis + recovery. Pure documentation; references the pin from Task 1.
**Files:**
- Modify: `docs/security/accepted-risks.md` (add row R8; bump the review date)
- Modify: `docs/decisions/016-mesh-vpn.md` (add the availability amendment subsection)
- Modify: `STATUS.md` (note the SPOF accepted + the coordinator-pin knob)
- Modify: `docs/ROADMAP.md` (mark sub-project 3 addressed; surface ADR-022 backup + ACL as next)
- [ ] **Step 1: Add accepted-risk R8**
In `docs/security/accepted-risks.md`, add this row to the table after R7:
```markdown
| R8 | **Single off-site mesh coordinator is an availability SPOF for remote mesh access**`askari` hosts the only NetBird management/signal/relay (ADR-016); while askari is down, every *relayed* peer (all of `ubongo`'s, by the deliberate default-deny posture) loses remote mesh reachability and the control plane pauses. The `netbird_coordinator` store also has **no off-site backup yet** (BACKUP.md), so an askari loss loses mesh control-plane state until rebuilt | Inherent to ADR-016's deliberate single off-site coordinator (sovereignty; survives a homelab outage). **Narrow blast radius:** the mesh is not a gateway (`wt0` routes only `100.99.0.0/16`) — LAN, intra-cluster, and local-service traffic are unaffected; only remote/off-LAN mesh access breaks, and only when off-LAN *and* askari is down at once. askari is a reliable always-on VPS; mitigations: client + managed-host coordinator-FQDN DNS pin (`base__mesh_coordinator_pin`; runbook), documented `/setup` rebuild | askari proves unreliable; the cluster grows to depend on the mesh for intra-node traffic; remote mesh access becomes business-critical; or the ADR-022 backup role lands (closes the state-loss half) |
```
Then update the closing line's date: change `_Last reviewed: 2026-06-18.` to `_Last reviewed: 2026-06-20.`
- [ ] **Step 2: Add the ADR-016 availability amendment**
In `docs/decisions/016-mesh-vpn.md`, add this subsection immediately before the `## Related` section:
```markdown
## Availability — an `askari` outage (amendment 2026-06-20)
The coordinator is deliberately **single** (one off-site host). Recorded here so its
availability envelope is explicit; accepted as **R8** (`docs/security/accepted-risks.md`).
The mesh is **not** a default gateway — `wt0` routes only the overlay CIDR (`100.99.0.0/16`);
normal traffic uses the host's default route. So an `askari` outage has a **narrow blast
radius**:
| Traffic | `askari` down |
|---|---|
| LAN device → LAN service (direct / via reverse proxy) | unaffected |
| node ↔ node over LAN IPs (cluster) | unaffected |
| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P) |
| **road-warrior → `ubongo` (remote, relayed)** | **breaks** |
| mesh control plane (new enrol / ACL change / re-handshake) | pauses |
Only remote (off-LAN) mesh access to peers is lost, and only when off-LAN **and** `askari`
is down simultaneously. On-LAN access to `ubongo` never depends on the mesh (Recovery &
operations, above).
**Recovery:** rebuild the coordinator (`/setup` + re-enrol peers, M5) or restore from backup
once ADR-022 lands; the `netbird_coordinator` store backup is the **next sub-project** (its
gap is named in R8 and `BACKUP.md`). Client/road-warrior break-glass (reliable resolvers +
the coordinator-FQDN `/etc/hosts` pin) is in `docs/runbooks/netbird-client.md`; managed mesh
hosts get the same pin via `base__mesh_coordinator_pin`.
**Not pursued** (deliberately, given the narrow blast radius): direct P2P (punctures the
default-deny posture; only helps established sessions), a second relay (needs another public
host / reintroduces the home public surface), a second coordinator (unsupported by
self-hosted NetBird; against this ADR).
```
- [ ] **Step 3: Update STATUS.md**
In `STATUS.md`, in the `roles/base/` row, append to the end of the firewall/mesh description (before the closing ` |`): a sentence noting the pin and the accepted SPOF:
```markdown
The `mesh` concern also pins the coordinator FQDN in `/etc/hosts` (`base__mesh_coordinator_pin`, set for ubongo) so a local-DNS hiccup can't strand the mesh; the single-coordinator SPOF is an accepted availability risk (R8, ADR-016 availability amendment).
```
- [ ] **Step 4: Update ROADMAP.md**
In `docs/ROADMAP.md`, in the "Remaining mesh-hardening sub-projects" list, change item 3 from the SPOF-reduction "(next)" wording to **DONE**, and make the NetBird ACL the next item. Replace the current items 34 block with:
```markdown
3. ~~**askari relay-SPOF reduction**~~**DONE (2026-06-20)** — assessed + **accepted** as a
documented availability risk (R8 + ADR-016 availability amendment): the blast radius is
narrow (LAN/intra-cluster/local traffic never touch askari), so no P2P / second relay /
second coordinator was warranted. Hardened the one real gap — a managed-host coordinator-FQDN
DNS pin (`base__mesh_coordinator_pin`). The coordinator off-site backup gap is handed to ADR-022.
4. **NetBird ACL off Allow-All** to scoped policies (open mechanism question — no headless API path).
5. **ADR-022 backup kickoff** — off-site backup of the `netbird_coordinator` store (named in R8 /
BACKUP.md) as the first slice of the backup role (restic + the `fisi` pull node).
```
- [ ] **Step 5: Consistency check + commit**
```bash
grep -q "^| R8 " docs/security/accepted-risks.md && \
grep -q "Availability — an .askari. outage" docs/decisions/016-mesh-vpn.md && \
echo "docs OK"
```
Expected: `docs OK`.
```bash
rbw unlocked
git add docs/security/accepted-risks.md docs/decisions/016-mesh-vpn.md STATUS.md docs/ROADMAP.md
git commit -m "docs(security): accept the single-coordinator mesh SPOF (R8) + ADR-016 availability amendment" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
## Notes / out of scope
- **Coordinator off-site backup → ADR-022 kickoff** (next sub-project). Not built here.
- **Direct P2P / second relay / second coordinator** — deliberately not pursued (spec §Design).
- No live deploy is required to land this — the pin is additive/idempotent and applies to ubongo on the next routine `base` apply (`make deploy PLAYBOOK=site LIMIT=ubongo`, operator's discretion). Optional post-deploy spot-check: `getent hosts netbird.askari.wingu.me` on ubongo resolves to `77.42.120.136`.