docs(plan): mesh-hardening SPOF — accept + DNS-resilience implementation plan

Two tasks: a base mesh coordinator-FQDN /etc/hosts pin (Molecule TDD) + the accept-and-document docs (R8, ADR-016 availability amendment, STATUS/ROADMAP). Coordinator backup deferred to ADR-022.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-20 10:49:26 +02:00
parent 3ba22d199a
commit 0286c78f36

View file

@ -0,0 +1,237 @@
# Mesh SPOF — accept + targeted resilience — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Accept askari's single-coordinator SPOF as a documented availability trade-off, and harden the one real gap — a `base` mesh knob that pins the coordinator FQDN in `/etc/hosts` on managed mesh hosts so a local-DNS hiccup can't strand the mesh.
**Architecture:** One additive, idempotent `base` `mesh`-concern task (a `/etc/hosts` line via `lineinfile`, gated on a new opt-in knob), Molecule-tested; plus documentation (accepted-risk R8 + an ADR-016 availability amendment + STATUS/ROADMAP). No new infra, no Terraform, no live-deploy gate.
**Tech Stack:** Ansible (`base` role, `lineinfile`), Molecule (Debian 13), Markdown docs.
**Spec:** `docs/superpowers/specs/2026-06-20-mesh-spof-accept-resilience-design.md`
## Global Constraints
- **FQCN always** (`ansible.builtin.*`); role defaults use the `rolename__var` namespace.
- **No new collection** — derive the coordinator FQDN with builtin `regex_replace` (NOT `urlsplit`, which would pull in `community.general`).
- The pin is **opt-in and additive**: gated on `base__mesh_enabled | bool` AND `base__mesh_coordinator_pin | length > 0`. Empty knob (the default) = a clean no-op. The coordinator host (`askari`/`offsite_hosts`) is **exempt** — leave its pin empty.
- **askari's coordinator IP = `77.42.120.136`** (stable WAN; the A record for `netbird.askari.wingu.me`); ubongo is in the `control` group.
- `make lint` clean + `rbw unlocked` before any commit (the pre-commit hook decrypts the vault).
- **No new infra** — no P2P, no second relay/coordinator, no Terraform. The coordinator off-site backup is **out of scope** (ADR-022 kickoff).
- Tags: the new task carries the `mesh` concern tag (it belongs to the mesh concern).
---
### Task 1: `base` mesh coordinator-FQDN `/etc/hosts` pin (DNS-resilience)
Add an opt-in knob that pins the coordinator FQDN (derived from `base__mesh_management_url`) to a stable IP in `/etc/hosts`, so a managed mesh host survives a local-DNS failure. TDD'd through the role's Molecule scenario (which already exercises the `mesh` concern with `manage: false`).
**Files:**
- Modify: `roles/base/defaults/main.yml` (add the knob after the mesh block, ~line 53)
- Modify: `roles/base/tasks/mesh.yml` (append the pin task)
- Modify: `roles/base/molecule/default/converge.yml` (add a fixture pin to the vars block)
- Modify: `roles/base/molecule/default/verify.yml` (assert the rendered `/etc/hosts` line)
- Modify: `inventories/production/group_vars/control/vars.yml` (set the pin for ubongo)
**Interfaces:**
- Produces: role default `base__mesh_coordinator_pin` (string, default `""`); when set + `base__mesh_enabled`, an `/etc/hosts` line `<pin-ip> <fqdn>` where `<fqdn>` is `base__mesh_management_url` minus scheme/port/path.
- [ ] **Step 1: Write the failing Molecule test (fixture + assertion)**
In `roles/base/molecule/default/converge.yml`, add one line to the `vars:` block (after `base__mesh_setup_key`, ~line 15):
```yaml
base__mesh_coordinator_pin: "203.0.113.9" # fixture coordinator IP (TEST-NET-3); pins the FQDN from base__mesh_management_url
```
In `roles/base/molecule/default/verify.yml`, append to the `tasks:` list (after the mesh no-op assertion at the end):
```yaml
- name: Read /etc/hosts (coordinator pin)
ansible.builtin.slurp:
src: /etc/hosts
register: _etchosts
- name: Assert the coordinator FQDN is pinned to the fixture IP (DNS-resilience / R8)
ansible.builtin.assert:
that:
- "'203.0.113.9 netbird.askari.wingu.me' in (_etchosts.content | b64decode)"
fail_msg: "base__mesh_coordinator_pin did not render the /etc/hosts coordinator pin"
success_msg: "coordinator FQDN pinned in /etc/hosts"
```
- [ ] **Step 2: Run Molecule to verify it fails**
Run: `make test ROLE=base`
Expected: FAIL at "Assert the coordinator FQDN is pinned…" — no pin task exists yet, so `/etc/hosts` has no such line.
- [ ] **Step 3: Add the default knob**
In `roles/base/defaults/main.yml`, after `base__mesh_version` (~line 53), add:
```yaml
# DNS-resilience (ADR-016 availability / accepted-risk R8): when set to the coordinator's
# stable IP, pin the coordinator FQDN (derived from base__mesh_management_url) in /etc/hosts
# so a managed mesh host survives a local-DNS hiccup (the 2026-06-18 incident class). Empty
# = no pin. The coordinator host itself (askari/offsite_hosts) is exempt — leave it empty.
base__mesh_coordinator_pin: ""
```
- [ ] **Step 4: Add the pin task**
Append to `roles/base/tasks/mesh.yml`:
```yaml
- name: Pin the NetBird coordinator FQDN in /etc/hosts (DNS-resilience, ADR-016 availability / R8)
ansible.builtin.lineinfile:
path: /etc/hosts
regexp: '\s{{ _coordinator_fqdn | regex_escape }}$'
line: "{{ base__mesh_coordinator_pin }} {{ _coordinator_fqdn }}"
state: present
vars:
_coordinator_fqdn: "{{ base__mesh_management_url | regex_replace('^https?://', '') | regex_replace('[:/].*', '') }}"
when:
- base__mesh_enabled | bool
- base__mesh_coordinator_pin | length > 0
tags: [mesh]
```
(`_coordinator_fqdn` strips the scheme then anything from the first `:`/`/``netbird.askari.wingu.me`. The `regexp` matches an existing ` <fqdn>` at line end so a changed IP updates in place — idempotent; absent → appended.)
- [ ] **Step 5: Run Molecule to verify it passes**
Run: `make test ROLE=base`
Expected: PASS — the new assertion is green and Molecule idempotence is clean (re-running the pin task reports `ok`, not `changed`). The idempotence pass is what proves the `regexp` matches the line it wrote.
> Note: the empty-knob no-op (the production default for non-mesh / coordinator hosts) is guaranteed by the `when: base__mesh_coordinator_pin | length > 0` gate, not a separate Molecule case — a single converge can't hold both var-states, and boma uses one default scenario per role. The fixture exercises the meaningful path (rendering + FQDN extraction + idempotence).
- [ ] **Step 6: Wire the production pin for ubongo**
In `inventories/production/group_vars/control/vars.yml`, after the `base__mesh_enabled: true` block, add:
```yaml
# DNS-resilience (ADR-016 availability / R8): pin the coordinator FQDN to askari's stable WAN
# IP in /etc/hosts so a local-DNS hiccup (the 2026-06-18 incident class) can't strand ubongo's
# mesh. askari (offsite_hosts) is exempt — it reaches the coordinator locally.
base__mesh_coordinator_pin: "77.42.120.136"
```
- [ ] **Step 7: Lint and commit**
```bash
rbw unlocked && make lint
git add roles/base/defaults/main.yml roles/base/tasks/mesh.yml \
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml \
inventories/production/group_vars/control/vars.yml
git commit -m "feat(base): pin the NetBird coordinator FQDN in /etc/hosts (mesh DNS-resilience)" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 2: Accept + document the SPOF (R8, ADR-016 amendment, STATUS/ROADMAP)
Record the single-coordinator SPOF as a conscious, revisitable trade-off and capture the availability analysis + recovery. Pure documentation; references the pin from Task 1.
**Files:**
- Modify: `docs/security/accepted-risks.md` (add row R8; bump the review date)
- Modify: `docs/decisions/016-mesh-vpn.md` (add the availability amendment subsection)
- Modify: `STATUS.md` (note the SPOF accepted + the coordinator-pin knob)
- Modify: `docs/ROADMAP.md` (mark sub-project 3 addressed; surface ADR-022 backup + ACL as next)
- [ ] **Step 1: Add accepted-risk R8**
In `docs/security/accepted-risks.md`, add this row to the table after R7:
```markdown
| R8 | **Single off-site mesh coordinator is an availability SPOF for remote mesh access**`askari` hosts the only NetBird management/signal/relay (ADR-016); while askari is down, every *relayed* peer (all of `ubongo`'s, by the deliberate default-deny posture) loses remote mesh reachability and the control plane pauses. The `netbird_coordinator` store also has **no off-site backup yet** (BACKUP.md), so an askari loss loses mesh control-plane state until rebuilt | Inherent to ADR-016's deliberate single off-site coordinator (sovereignty; survives a homelab outage). **Narrow blast radius:** the mesh is not a gateway (`wt0` routes only `100.99.0.0/16`) — LAN, intra-cluster, and local-service traffic are unaffected; only remote/off-LAN mesh access breaks, and only when off-LAN *and* askari is down at once. askari is a reliable always-on VPS; mitigations: client + managed-host coordinator-FQDN DNS pin (`base__mesh_coordinator_pin`; runbook), documented `/setup` rebuild | askari proves unreliable; the cluster grows to depend on the mesh for intra-node traffic; remote mesh access becomes business-critical; or the ADR-022 backup role lands (closes the state-loss half) |
```
Then update the closing line's date: change `_Last reviewed: 2026-06-18.` to `_Last reviewed: 2026-06-20.`
- [ ] **Step 2: Add the ADR-016 availability amendment**
In `docs/decisions/016-mesh-vpn.md`, add this subsection immediately before the `## Related` section:
```markdown
## Availability — an `askari` outage (amendment 2026-06-20)
The coordinator is deliberately **single** (one off-site host). Recorded here so its
availability envelope is explicit; accepted as **R8** (`docs/security/accepted-risks.md`).
The mesh is **not** a default gateway — `wt0` routes only the overlay CIDR (`100.99.0.0/16`);
normal traffic uses the host's default route. So an `askari` outage has a **narrow blast
radius**:
| Traffic | `askari` down |
|---|---|
| LAN device → LAN service (direct / via reverse proxy) | unaffected |
| node ↔ node over LAN IPs (cluster) | unaffected |
| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P) |
| **road-warrior → `ubongo` (remote, relayed)** | **breaks** |
| mesh control plane (new enrol / ACL change / re-handshake) | pauses |
Only remote (off-LAN) mesh access to peers is lost, and only when off-LAN **and** `askari`
is down simultaneously. On-LAN access to `ubongo` never depends on the mesh (Recovery &
operations, above).
**Recovery:** rebuild the coordinator (`/setup` + re-enrol peers, M5) or restore from backup
once ADR-022 lands; the `netbird_coordinator` store backup is the **next sub-project** (its
gap is named in R8 and `BACKUP.md`). Client/road-warrior break-glass (reliable resolvers +
the coordinator-FQDN `/etc/hosts` pin) is in `docs/runbooks/netbird-client.md`; managed mesh
hosts get the same pin via `base__mesh_coordinator_pin`.
**Not pursued** (deliberately, given the narrow blast radius): direct P2P (punctures the
default-deny posture; only helps established sessions), a second relay (needs another public
host / reintroduces the home public surface), a second coordinator (unsupported by
self-hosted NetBird; against this ADR).
```
- [ ] **Step 3: Update STATUS.md**
In `STATUS.md`, in the `roles/base/` row, append to the end of the firewall/mesh description (before the closing ` |`): a sentence noting the pin and the accepted SPOF:
```markdown
The `mesh` concern also pins the coordinator FQDN in `/etc/hosts` (`base__mesh_coordinator_pin`, set for ubongo) so a local-DNS hiccup can't strand the mesh; the single-coordinator SPOF is an accepted availability risk (R8, ADR-016 availability amendment).
```
- [ ] **Step 4: Update ROADMAP.md**
In `docs/ROADMAP.md`, in the "Remaining mesh-hardening sub-projects" list, change item 3 from the SPOF-reduction "(next)" wording to **DONE**, and make the NetBird ACL the next item. Replace the current items 34 block with:
```markdown
3. ~~**askari relay-SPOF reduction**~~**DONE (2026-06-20)** — assessed + **accepted** as a
documented availability risk (R8 + ADR-016 availability amendment): the blast radius is
narrow (LAN/intra-cluster/local traffic never touch askari), so no P2P / second relay /
second coordinator was warranted. Hardened the one real gap — a managed-host coordinator-FQDN
DNS pin (`base__mesh_coordinator_pin`). The coordinator off-site backup gap is handed to ADR-022.
4. **NetBird ACL off Allow-All** to scoped policies (open mechanism question — no headless API path).
5. **ADR-022 backup kickoff** — off-site backup of the `netbird_coordinator` store (named in R8 /
BACKUP.md) as the first slice of the backup role (restic + the `fisi` pull node).
```
- [ ] **Step 5: Consistency check + commit**
```bash
grep -q "^| R8 " docs/security/accepted-risks.md && \
grep -q "Availability — an .askari. outage" docs/decisions/016-mesh-vpn.md && \
echo "docs OK"
```
Expected: `docs OK`.
```bash
rbw unlocked
git add docs/security/accepted-risks.md docs/decisions/016-mesh-vpn.md STATUS.md docs/ROADMAP.md
git commit -m "docs(security): accept the single-coordinator mesh SPOF (R8) + ADR-016 availability amendment" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
## Notes / out of scope
- **Coordinator off-site backup → ADR-022 kickoff** (next sub-project). Not built here.
- **Direct P2P / second relay / second coordinator** — deliberately not pursued (spec §Design).
- No live deploy is required to land this — the pin is additive/idempotent and applies to ubongo on the next routine `base` apply (`make deploy PLAYBOOK=site LIMIT=ubongo`, operator's discretion). Optional post-deploy spot-check: `getent hosts netbird.askari.wingu.me` on ubongo resolves to `77.42.120.136`.