docs(plan): mesh-hardening SPOF — accept + DNS-resilience implementation plan
Two tasks: a base mesh coordinator-FQDN /etc/hosts pin (Molecule TDD) + the accept-and-document docs (R8, ADR-016 availability amendment, STATUS/ROADMAP). Coordinator backup deferred to ADR-022. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
3ba22d199a
commit
0286c78f36
1 changed files with 237 additions and 0 deletions
237
docs/superpowers/plans/2026-06-20-mesh-spof-accept-resilience.md
Normal file
237
docs/superpowers/plans/2026-06-20-mesh-spof-accept-resilience.md
Normal file
|
|
@ -0,0 +1,237 @@
|
||||||
|
# Mesh SPOF — accept + targeted resilience — Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Accept askari's single-coordinator SPOF as a documented availability trade-off, and harden the one real gap — a `base` mesh knob that pins the coordinator FQDN in `/etc/hosts` on managed mesh hosts so a local-DNS hiccup can't strand the mesh.
|
||||||
|
|
||||||
|
**Architecture:** One additive, idempotent `base` `mesh`-concern task (a `/etc/hosts` line via `lineinfile`, gated on a new opt-in knob), Molecule-tested; plus documentation (accepted-risk R8 + an ADR-016 availability amendment + STATUS/ROADMAP). No new infra, no Terraform, no live-deploy gate.
|
||||||
|
|
||||||
|
**Tech Stack:** Ansible (`base` role, `lineinfile`), Molecule (Debian 13), Markdown docs.
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-06-20-mesh-spof-accept-resilience-design.md`
|
||||||
|
|
||||||
|
## Global Constraints
|
||||||
|
|
||||||
|
- **FQCN always** (`ansible.builtin.*`); role defaults use the `rolename__var` namespace.
|
||||||
|
- **No new collection** — derive the coordinator FQDN with builtin `regex_replace` (NOT `urlsplit`, which would pull in `community.general`).
|
||||||
|
- The pin is **opt-in and additive**: gated on `base__mesh_enabled | bool` AND `base__mesh_coordinator_pin | length > 0`. Empty knob (the default) = a clean no-op. The coordinator host (`askari`/`offsite_hosts`) is **exempt** — leave its pin empty.
|
||||||
|
- **askari's coordinator IP = `77.42.120.136`** (stable WAN; the A record for `netbird.askari.wingu.me`); ubongo is in the `control` group.
|
||||||
|
- `make lint` clean + `rbw unlocked` before any commit (the pre-commit hook decrypts the vault).
|
||||||
|
- **No new infra** — no P2P, no second relay/coordinator, no Terraform. The coordinator off-site backup is **out of scope** (ADR-022 kickoff).
|
||||||
|
- Tags: the new task carries the `mesh` concern tag (it belongs to the mesh concern).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: `base` mesh coordinator-FQDN `/etc/hosts` pin (DNS-resilience)
|
||||||
|
|
||||||
|
Add an opt-in knob that pins the coordinator FQDN (derived from `base__mesh_management_url`) to a stable IP in `/etc/hosts`, so a managed mesh host survives a local-DNS failure. TDD'd through the role's Molecule scenario (which already exercises the `mesh` concern with `manage: false`).
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `roles/base/defaults/main.yml` (add the knob after the mesh block, ~line 53)
|
||||||
|
- Modify: `roles/base/tasks/mesh.yml` (append the pin task)
|
||||||
|
- Modify: `roles/base/molecule/default/converge.yml` (add a fixture pin to the vars block)
|
||||||
|
- Modify: `roles/base/molecule/default/verify.yml` (assert the rendered `/etc/hosts` line)
|
||||||
|
- Modify: `inventories/production/group_vars/control/vars.yml` (set the pin for ubongo)
|
||||||
|
|
||||||
|
**Interfaces:**
|
||||||
|
- Produces: role default `base__mesh_coordinator_pin` (string, default `""`); when set + `base__mesh_enabled`, an `/etc/hosts` line `<pin-ip> <fqdn>` where `<fqdn>` is `base__mesh_management_url` minus scheme/port/path.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing Molecule test (fixture + assertion)**
|
||||||
|
|
||||||
|
In `roles/base/molecule/default/converge.yml`, add one line to the `vars:` block (after `base__mesh_setup_key`, ~line 15):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
base__mesh_coordinator_pin: "203.0.113.9" # fixture coordinator IP (TEST-NET-3); pins the FQDN from base__mesh_management_url
|
||||||
|
```
|
||||||
|
|
||||||
|
In `roles/base/molecule/default/verify.yml`, append to the `tasks:` list (after the mesh no-op assertion at the end):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- name: Read /etc/hosts (coordinator pin)
|
||||||
|
ansible.builtin.slurp:
|
||||||
|
src: /etc/hosts
|
||||||
|
register: _etchosts
|
||||||
|
- name: Assert the coordinator FQDN is pinned to the fixture IP (DNS-resilience / R8)
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- "'203.0.113.9 netbird.askari.wingu.me' in (_etchosts.content | b64decode)"
|
||||||
|
fail_msg: "base__mesh_coordinator_pin did not render the /etc/hosts coordinator pin"
|
||||||
|
success_msg: "coordinator FQDN pinned in /etc/hosts"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run Molecule to verify it fails**
|
||||||
|
|
||||||
|
Run: `make test ROLE=base`
|
||||||
|
Expected: FAIL at "Assert the coordinator FQDN is pinned…" — no pin task exists yet, so `/etc/hosts` has no such line.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add the default knob**
|
||||||
|
|
||||||
|
In `roles/base/defaults/main.yml`, after `base__mesh_version` (~line 53), add:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
|
||||||
|
# DNS-resilience (ADR-016 availability / accepted-risk R8): when set to the coordinator's
|
||||||
|
# stable IP, pin the coordinator FQDN (derived from base__mesh_management_url) in /etc/hosts
|
||||||
|
# so a managed mesh host survives a local-DNS hiccup (the 2026-06-18 incident class). Empty
|
||||||
|
# = no pin. The coordinator host itself (askari/offsite_hosts) is exempt — leave it empty.
|
||||||
|
base__mesh_coordinator_pin: ""
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Add the pin task**
|
||||||
|
|
||||||
|
Append to `roles/base/tasks/mesh.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
|
||||||
|
- name: Pin the NetBird coordinator FQDN in /etc/hosts (DNS-resilience, ADR-016 availability / R8)
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/hosts
|
||||||
|
regexp: '\s{{ _coordinator_fqdn | regex_escape }}$'
|
||||||
|
line: "{{ base__mesh_coordinator_pin }} {{ _coordinator_fqdn }}"
|
||||||
|
state: present
|
||||||
|
vars:
|
||||||
|
_coordinator_fqdn: "{{ base__mesh_management_url | regex_replace('^https?://', '') | regex_replace('[:/].*', '') }}"
|
||||||
|
when:
|
||||||
|
- base__mesh_enabled | bool
|
||||||
|
- base__mesh_coordinator_pin | length > 0
|
||||||
|
tags: [mesh]
|
||||||
|
```
|
||||||
|
|
||||||
|
(`_coordinator_fqdn` strips the scheme then anything from the first `:`/`/` → `netbird.askari.wingu.me`. The `regexp` matches an existing ` <fqdn>` at line end so a changed IP updates in place — idempotent; absent → appended.)
|
||||||
|
|
||||||
|
- [ ] **Step 5: Run Molecule to verify it passes**
|
||||||
|
|
||||||
|
Run: `make test ROLE=base`
|
||||||
|
Expected: PASS — the new assertion is green and Molecule idempotence is clean (re-running the pin task reports `ok`, not `changed`). The idempotence pass is what proves the `regexp` matches the line it wrote.
|
||||||
|
|
||||||
|
> Note: the empty-knob no-op (the production default for non-mesh / coordinator hosts) is guaranteed by the `when: base__mesh_coordinator_pin | length > 0` gate, not a separate Molecule case — a single converge can't hold both var-states, and boma uses one default scenario per role. The fixture exercises the meaningful path (rendering + FQDN extraction + idempotence).
|
||||||
|
|
||||||
|
- [ ] **Step 6: Wire the production pin for ubongo**
|
||||||
|
|
||||||
|
In `inventories/production/group_vars/control/vars.yml`, after the `base__mesh_enabled: true` block, add:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
|
||||||
|
# DNS-resilience (ADR-016 availability / R8): pin the coordinator FQDN to askari's stable WAN
|
||||||
|
# IP in /etc/hosts so a local-DNS hiccup (the 2026-06-18 incident class) can't strand ubongo's
|
||||||
|
# mesh. askari (offsite_hosts) is exempt — it reaches the coordinator locally.
|
||||||
|
base__mesh_coordinator_pin: "77.42.120.136"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 7: Lint and commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
rbw unlocked && make lint
|
||||||
|
git add roles/base/defaults/main.yml roles/base/tasks/mesh.yml \
|
||||||
|
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml \
|
||||||
|
inventories/production/group_vars/control/vars.yml
|
||||||
|
git commit -m "feat(base): pin the NetBird coordinator FQDN in /etc/hosts (mesh DNS-resilience)" \
|
||||||
|
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Accept + document the SPOF (R8, ADR-016 amendment, STATUS/ROADMAP)
|
||||||
|
|
||||||
|
Record the single-coordinator SPOF as a conscious, revisitable trade-off and capture the availability analysis + recovery. Pure documentation; references the pin from Task 1.
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/security/accepted-risks.md` (add row R8; bump the review date)
|
||||||
|
- Modify: `docs/decisions/016-mesh-vpn.md` (add the availability amendment subsection)
|
||||||
|
- Modify: `STATUS.md` (note the SPOF accepted + the coordinator-pin knob)
|
||||||
|
- Modify: `docs/ROADMAP.md` (mark sub-project 3 addressed; surface ADR-022 backup + ACL as next)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add accepted-risk R8**
|
||||||
|
|
||||||
|
In `docs/security/accepted-risks.md`, add this row to the table after R7:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| R8 | **Single off-site mesh coordinator is an availability SPOF for remote mesh access** — `askari` hosts the only NetBird management/signal/relay (ADR-016); while askari is down, every *relayed* peer (all of `ubongo`'s, by the deliberate default-deny posture) loses remote mesh reachability and the control plane pauses. The `netbird_coordinator` store also has **no off-site backup yet** (BACKUP.md), so an askari loss loses mesh control-plane state until rebuilt | Inherent to ADR-016's deliberate single off-site coordinator (sovereignty; survives a homelab outage). **Narrow blast radius:** the mesh is not a gateway (`wt0` routes only `100.99.0.0/16`) — LAN, intra-cluster, and local-service traffic are unaffected; only remote/off-LAN mesh access breaks, and only when off-LAN *and* askari is down at once. askari is a reliable always-on VPS; mitigations: client + managed-host coordinator-FQDN DNS pin (`base__mesh_coordinator_pin`; runbook), documented `/setup` rebuild | askari proves unreliable; the cluster grows to depend on the mesh for intra-node traffic; remote mesh access becomes business-critical; or the ADR-022 backup role lands (closes the state-loss half) |
|
||||||
|
```
|
||||||
|
|
||||||
|
Then update the closing line's date: change `_Last reviewed: 2026-06-18.` to `_Last reviewed: 2026-06-20.`
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add the ADR-016 availability amendment**
|
||||||
|
|
||||||
|
In `docs/decisions/016-mesh-vpn.md`, add this subsection immediately before the `## Related` section:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Availability — an `askari` outage (amendment 2026-06-20)
|
||||||
|
|
||||||
|
The coordinator is deliberately **single** (one off-site host). Recorded here so its
|
||||||
|
availability envelope is explicit; accepted as **R8** (`docs/security/accepted-risks.md`).
|
||||||
|
|
||||||
|
The mesh is **not** a default gateway — `wt0` routes only the overlay CIDR (`100.99.0.0/16`);
|
||||||
|
normal traffic uses the host's default route. So an `askari` outage has a **narrow blast
|
||||||
|
radius**:
|
||||||
|
|
||||||
|
| Traffic | `askari` down |
|
||||||
|
|---|---|
|
||||||
|
| LAN device → LAN service (direct / via reverse proxy) | unaffected |
|
||||||
|
| node ↔ node over LAN IPs (cluster) | unaffected |
|
||||||
|
| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P) |
|
||||||
|
| **road-warrior → `ubongo` (remote, relayed)** | **breaks** |
|
||||||
|
| mesh control plane (new enrol / ACL change / re-handshake) | pauses |
|
||||||
|
|
||||||
|
Only remote (off-LAN) mesh access to peers is lost, and only when off-LAN **and** `askari`
|
||||||
|
is down simultaneously. On-LAN access to `ubongo` never depends on the mesh (Recovery &
|
||||||
|
operations, above).
|
||||||
|
|
||||||
|
**Recovery:** rebuild the coordinator (`/setup` + re-enrol peers, M5) or restore from backup
|
||||||
|
once ADR-022 lands; the `netbird_coordinator` store backup is the **next sub-project** (its
|
||||||
|
gap is named in R8 and `BACKUP.md`). Client/road-warrior break-glass (reliable resolvers +
|
||||||
|
the coordinator-FQDN `/etc/hosts` pin) is in `docs/runbooks/netbird-client.md`; managed mesh
|
||||||
|
hosts get the same pin via `base__mesh_coordinator_pin`.
|
||||||
|
|
||||||
|
**Not pursued** (deliberately, given the narrow blast radius): direct P2P (punctures the
|
||||||
|
default-deny posture; only helps established sessions), a second relay (needs another public
|
||||||
|
host / reintroduces the home public surface), a second coordinator (unsupported by
|
||||||
|
self-hosted NetBird; against this ADR).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Update STATUS.md**
|
||||||
|
|
||||||
|
In `STATUS.md`, in the `roles/base/` row, append to the end of the firewall/mesh description (before the closing ` |`): a sentence noting the pin and the accepted SPOF:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
The `mesh` concern also pins the coordinator FQDN in `/etc/hosts` (`base__mesh_coordinator_pin`, set for ubongo) so a local-DNS hiccup can't strand the mesh; the single-coordinator SPOF is an accepted availability risk (R8, ADR-016 availability amendment).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Update ROADMAP.md**
|
||||||
|
|
||||||
|
In `docs/ROADMAP.md`, in the "Remaining mesh-hardening sub-projects" list, change item 3 from the SPOF-reduction "(next)" wording to **DONE**, and make the NetBird ACL the next item. Replace the current items 3–4 block with:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
3. ~~**askari relay-SPOF reduction**~~ → **DONE (2026-06-20)** — assessed + **accepted** as a
|
||||||
|
documented availability risk (R8 + ADR-016 availability amendment): the blast radius is
|
||||||
|
narrow (LAN/intra-cluster/local traffic never touch askari), so no P2P / second relay /
|
||||||
|
second coordinator was warranted. Hardened the one real gap — a managed-host coordinator-FQDN
|
||||||
|
DNS pin (`base__mesh_coordinator_pin`). The coordinator off-site backup gap is handed to ADR-022.
|
||||||
|
4. **NetBird ACL off Allow-All** to scoped policies (open mechanism question — no headless API path).
|
||||||
|
5. **ADR-022 backup kickoff** — off-site backup of the `netbird_coordinator` store (named in R8 /
|
||||||
|
BACKUP.md) as the first slice of the backup role (restic + the `fisi` pull node).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Consistency check + commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -q "^| R8 " docs/security/accepted-risks.md && \
|
||||||
|
grep -q "Availability — an .askari. outage" docs/decisions/016-mesh-vpn.md && \
|
||||||
|
echo "docs OK"
|
||||||
|
```
|
||||||
|
Expected: `docs OK`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
rbw unlocked
|
||||||
|
git add docs/security/accepted-risks.md docs/decisions/016-mesh-vpn.md STATUS.md docs/ROADMAP.md
|
||||||
|
git commit -m "docs(security): accept the single-coordinator mesh SPOF (R8) + ADR-016 availability amendment" \
|
||||||
|
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Notes / out of scope
|
||||||
|
|
||||||
|
- **Coordinator off-site backup → ADR-022 kickoff** (next sub-project). Not built here.
|
||||||
|
- **Direct P2P / second relay / second coordinator** — deliberately not pursued (spec §Design).
|
||||||
|
- No live deploy is required to land this — the pin is additive/idempotent and applies to ubongo on the next routine `base` apply (`make deploy PLAYBOOK=site LIMIT=ubongo`, operator's discretion). Optional post-deploy spot-check: `getent hosts netbird.askari.wingu.me` on ubongo resolves to `77.42.120.136`.
|
||||||
Loading…
Add table
Reference in a new issue