boma/docs/superpowers/plans/2026-06-20-mesh-spof-accept-resilience.md
sjat 0286c78f36 docs(plan): mesh-hardening SPOF — accept + DNS-resilience implementation plan
Two tasks: a base mesh coordinator-FQDN /etc/hosts pin (Molecule TDD) + the accept-and-document docs (R8, ADR-016 availability amendment, STATUS/ROADMAP). Coordinator backup deferred to ADR-022.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 10:49:26 +02:00

14 KiB
Raw Blame History

Mesh SPOF — accept + targeted resilience — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Accept askari's single-coordinator SPOF as a documented availability trade-off, and harden the one real gap — a base mesh knob that pins the coordinator FQDN in /etc/hosts on managed mesh hosts so a local-DNS hiccup can't strand the mesh.

Architecture: One additive, idempotent base mesh-concern task (a /etc/hosts line via lineinfile, gated on a new opt-in knob), Molecule-tested; plus documentation (accepted-risk R8 + an ADR-016 availability amendment + STATUS/ROADMAP). No new infra, no Terraform, no live-deploy gate.

Tech Stack: Ansible (base role, lineinfile), Molecule (Debian 13), Markdown docs.

Spec: docs/superpowers/specs/2026-06-20-mesh-spof-accept-resilience-design.md

Global Constraints

  • FQCN always (ansible.builtin.*); role defaults use the rolename__var namespace.
  • No new collection — derive the coordinator FQDN with builtin regex_replace (NOT urlsplit, which would pull in community.general).
  • The pin is opt-in and additive: gated on base__mesh_enabled | bool AND base__mesh_coordinator_pin | length > 0. Empty knob (the default) = a clean no-op. The coordinator host (askari/offsite_hosts) is exempt — leave its pin empty.
  • askari's coordinator IP = 77.42.120.136 (stable WAN; the A record for netbird.askari.wingu.me); ubongo is in the control group.
  • make lint clean + rbw unlocked before any commit (the pre-commit hook decrypts the vault).
  • No new infra — no P2P, no second relay/coordinator, no Terraform. The coordinator off-site backup is out of scope (ADR-022 kickoff).
  • Tags: the new task carries the mesh concern tag (it belongs to the mesh concern).

Task 1: base mesh coordinator-FQDN /etc/hosts pin (DNS-resilience)

Add an opt-in knob that pins the coordinator FQDN (derived from base__mesh_management_url) to a stable IP in /etc/hosts, so a managed mesh host survives a local-DNS failure. TDD'd through the role's Molecule scenario (which already exercises the mesh concern with manage: false).

Files:

  • Modify: roles/base/defaults/main.yml (add the knob after the mesh block, ~line 53)
  • Modify: roles/base/tasks/mesh.yml (append the pin task)
  • Modify: roles/base/molecule/default/converge.yml (add a fixture pin to the vars block)
  • Modify: roles/base/molecule/default/verify.yml (assert the rendered /etc/hosts line)
  • Modify: inventories/production/group_vars/control/vars.yml (set the pin for ubongo)

Interfaces:

  • Produces: role default base__mesh_coordinator_pin (string, default ""); when set + base__mesh_enabled, an /etc/hosts line <pin-ip> <fqdn> where <fqdn> is base__mesh_management_url minus scheme/port/path.

  • Step 1: Write the failing Molecule test (fixture + assertion)

In roles/base/molecule/default/converge.yml, add one line to the vars: block (after base__mesh_setup_key, ~line 15):

    base__mesh_coordinator_pin: "203.0.113.9"   # fixture coordinator IP (TEST-NET-3); pins the FQDN from base__mesh_management_url

In roles/base/molecule/default/verify.yml, append to the tasks: list (after the mesh no-op assertion at the end):

    - name: Read /etc/hosts (coordinator pin)
      ansible.builtin.slurp:
        src: /etc/hosts
      register: _etchosts
    - name: Assert the coordinator FQDN is pinned to the fixture IP (DNS-resilience / R8)
      ansible.builtin.assert:
        that:
          - "'203.0.113.9 netbird.askari.wingu.me' in (_etchosts.content | b64decode)"
        fail_msg: "base__mesh_coordinator_pin did not render the /etc/hosts coordinator pin"
        success_msg: "coordinator FQDN pinned in /etc/hosts"
  • Step 2: Run Molecule to verify it fails

Run: make test ROLE=base Expected: FAIL at "Assert the coordinator FQDN is pinned…" — no pin task exists yet, so /etc/hosts has no such line.

  • Step 3: Add the default knob

In roles/base/defaults/main.yml, after base__mesh_version (~line 53), add:


# DNS-resilience (ADR-016 availability / accepted-risk R8): when set to the coordinator's
# stable IP, pin the coordinator FQDN (derived from base__mesh_management_url) in /etc/hosts
# so a managed mesh host survives a local-DNS hiccup (the 2026-06-18 incident class). Empty
# = no pin. The coordinator host itself (askari/offsite_hosts) is exempt — leave it empty.
base__mesh_coordinator_pin: ""
  • Step 4: Add the pin task

Append to roles/base/tasks/mesh.yml:


- name: Pin the NetBird coordinator FQDN in /etc/hosts (DNS-resilience, ADR-016 availability / R8)
  ansible.builtin.lineinfile:
    path: /etc/hosts
    regexp: '\s{{ _coordinator_fqdn | regex_escape }}$'
    line: "{{ base__mesh_coordinator_pin }} {{ _coordinator_fqdn }}"
    state: present
  vars:
    _coordinator_fqdn: "{{ base__mesh_management_url | regex_replace('^https?://', '') | regex_replace('[:/].*', '') }}"
  when:
    - base__mesh_enabled | bool
    - base__mesh_coordinator_pin | length > 0
  tags: [mesh]

(_coordinator_fqdn strips the scheme then anything from the first ://netbird.askari.wingu.me. The regexp matches an existing <fqdn> at line end so a changed IP updates in place — idempotent; absent → appended.)

  • Step 5: Run Molecule to verify it passes

Run: make test ROLE=base Expected: PASS — the new assertion is green and Molecule idempotence is clean (re-running the pin task reports ok, not changed). The idempotence pass is what proves the regexp matches the line it wrote.

Note: the empty-knob no-op (the production default for non-mesh / coordinator hosts) is guaranteed by the when: base__mesh_coordinator_pin | length > 0 gate, not a separate Molecule case — a single converge can't hold both var-states, and boma uses one default scenario per role. The fixture exercises the meaningful path (rendering + FQDN extraction + idempotence).

  • Step 6: Wire the production pin for ubongo

In inventories/production/group_vars/control/vars.yml, after the base__mesh_enabled: true block, add:


# DNS-resilience (ADR-016 availability / R8): pin the coordinator FQDN to askari's stable WAN
# IP in /etc/hosts so a local-DNS hiccup (the 2026-06-18 incident class) can't strand ubongo's
# mesh. askari (offsite_hosts) is exempt — it reaches the coordinator locally.
base__mesh_coordinator_pin: "77.42.120.136"
  • Step 7: Lint and commit
rbw unlocked && make lint
git add roles/base/defaults/main.yml roles/base/tasks/mesh.yml \
        roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml \
        inventories/production/group_vars/control/vars.yml
git commit -m "feat(base): pin the NetBird coordinator FQDN in /etc/hosts (mesh DNS-resilience)" \
           -m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"

Task 2: Accept + document the SPOF (R8, ADR-016 amendment, STATUS/ROADMAP)

Record the single-coordinator SPOF as a conscious, revisitable trade-off and capture the availability analysis + recovery. Pure documentation; references the pin from Task 1.

Files:

  • Modify: docs/security/accepted-risks.md (add row R8; bump the review date)

  • Modify: docs/decisions/016-mesh-vpn.md (add the availability amendment subsection)

  • Modify: STATUS.md (note the SPOF accepted + the coordinator-pin knob)

  • Modify: docs/ROADMAP.md (mark sub-project 3 addressed; surface ADR-022 backup + ACL as next)

  • Step 1: Add accepted-risk R8

In docs/security/accepted-risks.md, add this row to the table after R7:

| R8 | **Single off-site mesh coordinator is an availability SPOF for remote mesh access**`askari` hosts the only NetBird management/signal/relay (ADR-016); while askari is down, every *relayed* peer (all of `ubongo`'s, by the deliberate default-deny posture) loses remote mesh reachability and the control plane pauses. The `netbird_coordinator` store also has **no off-site backup yet** (BACKUP.md), so an askari loss loses mesh control-plane state until rebuilt | Inherent to ADR-016's deliberate single off-site coordinator (sovereignty; survives a homelab outage). **Narrow blast radius:** the mesh is not a gateway (`wt0` routes only `100.99.0.0/16`) — LAN, intra-cluster, and local-service traffic are unaffected; only remote/off-LAN mesh access breaks, and only when off-LAN *and* askari is down at once. askari is a reliable always-on VPS; mitigations: client + managed-host coordinator-FQDN DNS pin (`base__mesh_coordinator_pin`; runbook), documented `/setup` rebuild | askari proves unreliable; the cluster grows to depend on the mesh for intra-node traffic; remote mesh access becomes business-critical; or the ADR-022 backup role lands (closes the state-loss half) |

Then update the closing line's date: change _Last reviewed: 2026-06-18. to _Last reviewed: 2026-06-20.

  • Step 2: Add the ADR-016 availability amendment

In docs/decisions/016-mesh-vpn.md, add this subsection immediately before the ## Related section:

## Availability — an `askari` outage (amendment 2026-06-20)

The coordinator is deliberately **single** (one off-site host). Recorded here so its
availability envelope is explicit; accepted as **R8** (`docs/security/accepted-risks.md`).

The mesh is **not** a default gateway — `wt0` routes only the overlay CIDR (`100.99.0.0/16`);
normal traffic uses the host's default route. So an `askari` outage has a **narrow blast
radius**:

| Traffic | `askari` down |
|---|---|
| LAN device → LAN service (direct / via reverse proxy) | unaffected |
| node ↔ node over LAN IPs (cluster) | unaffected |
| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P) |
| **road-warrior → `ubongo` (remote, relayed)** | **breaks** |
| mesh control plane (new enrol / ACL change / re-handshake) | pauses |

Only remote (off-LAN) mesh access to peers is lost, and only when off-LAN **and** `askari`
is down simultaneously. On-LAN access to `ubongo` never depends on the mesh (Recovery &
operations, above).

**Recovery:** rebuild the coordinator (`/setup` + re-enrol peers, M5) or restore from backup
once ADR-022 lands; the `netbird_coordinator` store backup is the **next sub-project** (its
gap is named in R8 and `BACKUP.md`). Client/road-warrior break-glass (reliable resolvers +
the coordinator-FQDN `/etc/hosts` pin) is in `docs/runbooks/netbird-client.md`; managed mesh
hosts get the same pin via `base__mesh_coordinator_pin`.

**Not pursued** (deliberately, given the narrow blast radius): direct P2P (punctures the
default-deny posture; only helps established sessions), a second relay (needs another public
host / reintroduces the home public surface), a second coordinator (unsupported by
self-hosted NetBird; against this ADR).
  • Step 3: Update STATUS.md

In STATUS.md, in the roles/base/ row, append to the end of the firewall/mesh description (before the closing |): a sentence noting the pin and the accepted SPOF:

 The `mesh` concern also pins the coordinator FQDN in `/etc/hosts` (`base__mesh_coordinator_pin`, set for ubongo) so a local-DNS hiccup can't strand the mesh; the single-coordinator SPOF is an accepted availability risk (R8, ADR-016 availability amendment).
  • Step 4: Update ROADMAP.md

In docs/ROADMAP.md, in the "Remaining mesh-hardening sub-projects" list, change item 3 from the SPOF-reduction "(next)" wording to DONE, and make the NetBird ACL the next item. Replace the current items 34 block with:

3. ~~**askari relay-SPOF reduction**~~**DONE (2026-06-20)** — assessed + **accepted** as a
   documented availability risk (R8 + ADR-016 availability amendment): the blast radius is
   narrow (LAN/intra-cluster/local traffic never touch askari), so no P2P / second relay /
   second coordinator was warranted. Hardened the one real gap — a managed-host coordinator-FQDN
   DNS pin (`base__mesh_coordinator_pin`). The coordinator off-site backup gap is handed to ADR-022.
4. **NetBird ACL off Allow-All** to scoped policies (open mechanism question — no headless API path).
5. **ADR-022 backup kickoff** — off-site backup of the `netbird_coordinator` store (named in R8 /
   BACKUP.md) as the first slice of the backup role (restic + the `fisi` pull node).
  • Step 5: Consistency check + commit
grep -q "^| R8 " docs/security/accepted-risks.md && \
grep -q "Availability — an .askari. outage" docs/decisions/016-mesh-vpn.md && \
echo "docs OK"

Expected: docs OK.

rbw unlocked
git add docs/security/accepted-risks.md docs/decisions/016-mesh-vpn.md STATUS.md docs/ROADMAP.md
git commit -m "docs(security): accept the single-coordinator mesh SPOF (R8) + ADR-016 availability amendment" \
           -m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"

Notes / out of scope

  • Coordinator off-site backup → ADR-022 kickoff (next sub-project). Not built here.
  • Direct P2P / second relay / second coordinator — deliberately not pursued (spec §Design).
  • No live deploy is required to land this — the pin is additive/idempotent and applies to ubongo on the next routine base apply (make deploy PLAYBOOK=site LIMIT=ubongo, operator's discretion). Optional post-deploy spot-check: getent hosts netbird.askari.wingu.me on ubongo resolves to 77.42.120.136.