Two tasks: a base mesh coordinator-FQDN /etc/hosts pin (Molecule TDD) + the accept-and-document docs (R8, ADR-016 availability amendment, STATUS/ROADMAP). Coordinator backup deferred to ADR-022. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
237 lines
14 KiB
Markdown
237 lines
14 KiB
Markdown
# Mesh SPOF — accept + targeted resilience — Implementation Plan
|
||
|
||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||
|
||
**Goal:** Accept askari's single-coordinator SPOF as a documented availability trade-off, and harden the one real gap — a `base` mesh knob that pins the coordinator FQDN in `/etc/hosts` on managed mesh hosts so a local-DNS hiccup can't strand the mesh.
|
||
|
||
**Architecture:** One additive, idempotent `base` `mesh`-concern task (a `/etc/hosts` line via `lineinfile`, gated on a new opt-in knob), Molecule-tested; plus documentation (accepted-risk R8 + an ADR-016 availability amendment + STATUS/ROADMAP). No new infra, no Terraform, no live-deploy gate.
|
||
|
||
**Tech Stack:** Ansible (`base` role, `lineinfile`), Molecule (Debian 13), Markdown docs.
|
||
|
||
**Spec:** `docs/superpowers/specs/2026-06-20-mesh-spof-accept-resilience-design.md`
|
||
|
||
## Global Constraints
|
||
|
||
- **FQCN always** (`ansible.builtin.*`); role defaults use the `rolename__var` namespace.
|
||
- **No new collection** — derive the coordinator FQDN with builtin `regex_replace` (NOT `urlsplit`, which would pull in `community.general`).
|
||
- The pin is **opt-in and additive**: gated on `base__mesh_enabled | bool` AND `base__mesh_coordinator_pin | length > 0`. Empty knob (the default) = a clean no-op. The coordinator host (`askari`/`offsite_hosts`) is **exempt** — leave its pin empty.
|
||
- **askari's coordinator IP = `77.42.120.136`** (stable WAN; the A record for `netbird.askari.wingu.me`); ubongo is in the `control` group.
|
||
- `make lint` clean + `rbw unlocked` before any commit (the pre-commit hook decrypts the vault).
|
||
- **No new infra** — no P2P, no second relay/coordinator, no Terraform. The coordinator off-site backup is **out of scope** (ADR-022 kickoff).
|
||
- Tags: the new task carries the `mesh` concern tag (it belongs to the mesh concern).
|
||
|
||
---
|
||
|
||
### Task 1: `base` mesh coordinator-FQDN `/etc/hosts` pin (DNS-resilience)
|
||
|
||
Add an opt-in knob that pins the coordinator FQDN (derived from `base__mesh_management_url`) to a stable IP in `/etc/hosts`, so a managed mesh host survives a local-DNS failure. TDD'd through the role's Molecule scenario (which already exercises the `mesh` concern with `manage: false`).
|
||
|
||
**Files:**
|
||
- Modify: `roles/base/defaults/main.yml` (add the knob after the mesh block, ~line 53)
|
||
- Modify: `roles/base/tasks/mesh.yml` (append the pin task)
|
||
- Modify: `roles/base/molecule/default/converge.yml` (add a fixture pin to the vars block)
|
||
- Modify: `roles/base/molecule/default/verify.yml` (assert the rendered `/etc/hosts` line)
|
||
- Modify: `inventories/production/group_vars/control/vars.yml` (set the pin for ubongo)
|
||
|
||
**Interfaces:**
|
||
- Produces: role default `base__mesh_coordinator_pin` (string, default `""`); when set + `base__mesh_enabled`, an `/etc/hosts` line `<pin-ip> <fqdn>` where `<fqdn>` is `base__mesh_management_url` minus scheme/port/path.
|
||
|
||
- [ ] **Step 1: Write the failing Molecule test (fixture + assertion)**
|
||
|
||
In `roles/base/molecule/default/converge.yml`, add one line to the `vars:` block (after `base__mesh_setup_key`, ~line 15):
|
||
|
||
```yaml
|
||
base__mesh_coordinator_pin: "203.0.113.9" # fixture coordinator IP (TEST-NET-3); pins the FQDN from base__mesh_management_url
|
||
```
|
||
|
||
In `roles/base/molecule/default/verify.yml`, append to the `tasks:` list (after the mesh no-op assertion at the end):
|
||
|
||
```yaml
|
||
- name: Read /etc/hosts (coordinator pin)
|
||
ansible.builtin.slurp:
|
||
src: /etc/hosts
|
||
register: _etchosts
|
||
- name: Assert the coordinator FQDN is pinned to the fixture IP (DNS-resilience / R8)
|
||
ansible.builtin.assert:
|
||
that:
|
||
- "'203.0.113.9 netbird.askari.wingu.me' in (_etchosts.content | b64decode)"
|
||
fail_msg: "base__mesh_coordinator_pin did not render the /etc/hosts coordinator pin"
|
||
success_msg: "coordinator FQDN pinned in /etc/hosts"
|
||
```
|
||
|
||
- [ ] **Step 2: Run Molecule to verify it fails**
|
||
|
||
Run: `make test ROLE=base`
|
||
Expected: FAIL at "Assert the coordinator FQDN is pinned…" — no pin task exists yet, so `/etc/hosts` has no such line.
|
||
|
||
- [ ] **Step 3: Add the default knob**
|
||
|
||
In `roles/base/defaults/main.yml`, after `base__mesh_version` (~line 53), add:
|
||
|
||
```yaml
|
||
|
||
# DNS-resilience (ADR-016 availability / accepted-risk R8): when set to the coordinator's
|
||
# stable IP, pin the coordinator FQDN (derived from base__mesh_management_url) in /etc/hosts
|
||
# so a managed mesh host survives a local-DNS hiccup (the 2026-06-18 incident class). Empty
|
||
# = no pin. The coordinator host itself (askari/offsite_hosts) is exempt — leave it empty.
|
||
base__mesh_coordinator_pin: ""
|
||
```
|
||
|
||
- [ ] **Step 4: Add the pin task**
|
||
|
||
Append to `roles/base/tasks/mesh.yml`:
|
||
|
||
```yaml
|
||
|
||
- name: Pin the NetBird coordinator FQDN in /etc/hosts (DNS-resilience, ADR-016 availability / R8)
|
||
ansible.builtin.lineinfile:
|
||
path: /etc/hosts
|
||
regexp: '\s{{ _coordinator_fqdn | regex_escape }}$'
|
||
line: "{{ base__mesh_coordinator_pin }} {{ _coordinator_fqdn }}"
|
||
state: present
|
||
vars:
|
||
_coordinator_fqdn: "{{ base__mesh_management_url | regex_replace('^https?://', '') | regex_replace('[:/].*', '') }}"
|
||
when:
|
||
- base__mesh_enabled | bool
|
||
- base__mesh_coordinator_pin | length > 0
|
||
tags: [mesh]
|
||
```
|
||
|
||
(`_coordinator_fqdn` strips the scheme then anything from the first `:`/`/` → `netbird.askari.wingu.me`. The `regexp` matches an existing ` <fqdn>` at line end so a changed IP updates in place — idempotent; absent → appended.)
|
||
|
||
- [ ] **Step 5: Run Molecule to verify it passes**
|
||
|
||
Run: `make test ROLE=base`
|
||
Expected: PASS — the new assertion is green and Molecule idempotence is clean (re-running the pin task reports `ok`, not `changed`). The idempotence pass is what proves the `regexp` matches the line it wrote.
|
||
|
||
> Note: the empty-knob no-op (the production default for non-mesh / coordinator hosts) is guaranteed by the `when: base__mesh_coordinator_pin | length > 0` gate, not a separate Molecule case — a single converge can't hold both var-states, and boma uses one default scenario per role. The fixture exercises the meaningful path (rendering + FQDN extraction + idempotence).
|
||
|
||
- [ ] **Step 6: Wire the production pin for ubongo**
|
||
|
||
In `inventories/production/group_vars/control/vars.yml`, after the `base__mesh_enabled: true` block, add:
|
||
|
||
```yaml
|
||
|
||
# DNS-resilience (ADR-016 availability / R8): pin the coordinator FQDN to askari's stable WAN
|
||
# IP in /etc/hosts so a local-DNS hiccup (the 2026-06-18 incident class) can't strand ubongo's
|
||
# mesh. askari (offsite_hosts) is exempt — it reaches the coordinator locally.
|
||
base__mesh_coordinator_pin: "77.42.120.136"
|
||
```
|
||
|
||
- [ ] **Step 7: Lint and commit**
|
||
|
||
```bash
|
||
rbw unlocked && make lint
|
||
git add roles/base/defaults/main.yml roles/base/tasks/mesh.yml \
|
||
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml \
|
||
inventories/production/group_vars/control/vars.yml
|
||
git commit -m "feat(base): pin the NetBird coordinator FQDN in /etc/hosts (mesh DNS-resilience)" \
|
||
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||
```
|
||
|
||
---
|
||
|
||
### Task 2: Accept + document the SPOF (R8, ADR-016 amendment, STATUS/ROADMAP)
|
||
|
||
Record the single-coordinator SPOF as a conscious, revisitable trade-off and capture the availability analysis + recovery. Pure documentation; references the pin from Task 1.
|
||
|
||
**Files:**
|
||
- Modify: `docs/security/accepted-risks.md` (add row R8; bump the review date)
|
||
- Modify: `docs/decisions/016-mesh-vpn.md` (add the availability amendment subsection)
|
||
- Modify: `STATUS.md` (note the SPOF accepted + the coordinator-pin knob)
|
||
- Modify: `docs/ROADMAP.md` (mark sub-project 3 addressed; surface ADR-022 backup + ACL as next)
|
||
|
||
- [ ] **Step 1: Add accepted-risk R8**
|
||
|
||
In `docs/security/accepted-risks.md`, add this row to the table after R7:
|
||
|
||
```markdown
|
||
| R8 | **Single off-site mesh coordinator is an availability SPOF for remote mesh access** — `askari` hosts the only NetBird management/signal/relay (ADR-016); while askari is down, every *relayed* peer (all of `ubongo`'s, by the deliberate default-deny posture) loses remote mesh reachability and the control plane pauses. The `netbird_coordinator` store also has **no off-site backup yet** (BACKUP.md), so an askari loss loses mesh control-plane state until rebuilt | Inherent to ADR-016's deliberate single off-site coordinator (sovereignty; survives a homelab outage). **Narrow blast radius:** the mesh is not a gateway (`wt0` routes only `100.99.0.0/16`) — LAN, intra-cluster, and local-service traffic are unaffected; only remote/off-LAN mesh access breaks, and only when off-LAN *and* askari is down at once. askari is a reliable always-on VPS; mitigations: client + managed-host coordinator-FQDN DNS pin (`base__mesh_coordinator_pin`; runbook), documented `/setup` rebuild | askari proves unreliable; the cluster grows to depend on the mesh for intra-node traffic; remote mesh access becomes business-critical; or the ADR-022 backup role lands (closes the state-loss half) |
|
||
```
|
||
|
||
Then update the closing line's date: change `_Last reviewed: 2026-06-18.` to `_Last reviewed: 2026-06-20.`
|
||
|
||
- [ ] **Step 2: Add the ADR-016 availability amendment**
|
||
|
||
In `docs/decisions/016-mesh-vpn.md`, add this subsection immediately before the `## Related` section:
|
||
|
||
```markdown
|
||
## Availability — an `askari` outage (amendment 2026-06-20)
|
||
|
||
The coordinator is deliberately **single** (one off-site host). Recorded here so its
|
||
availability envelope is explicit; accepted as **R8** (`docs/security/accepted-risks.md`).
|
||
|
||
The mesh is **not** a default gateway — `wt0` routes only the overlay CIDR (`100.99.0.0/16`);
|
||
normal traffic uses the host's default route. So an `askari` outage has a **narrow blast
|
||
radius**:
|
||
|
||
| Traffic | `askari` down |
|
||
|---|---|
|
||
| LAN device → LAN service (direct / via reverse proxy) | unaffected |
|
||
| node ↔ node over LAN IPs (cluster) | unaffected |
|
||
| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P) |
|
||
| **road-warrior → `ubongo` (remote, relayed)** | **breaks** |
|
||
| mesh control plane (new enrol / ACL change / re-handshake) | pauses |
|
||
|
||
Only remote (off-LAN) mesh access to peers is lost, and only when off-LAN **and** `askari`
|
||
is down simultaneously. On-LAN access to `ubongo` never depends on the mesh (Recovery &
|
||
operations, above).
|
||
|
||
**Recovery:** rebuild the coordinator (`/setup` + re-enrol peers, M5) or restore from backup
|
||
once ADR-022 lands; the `netbird_coordinator` store backup is the **next sub-project** (its
|
||
gap is named in R8 and `BACKUP.md`). Client/road-warrior break-glass (reliable resolvers +
|
||
the coordinator-FQDN `/etc/hosts` pin) is in `docs/runbooks/netbird-client.md`; managed mesh
|
||
hosts get the same pin via `base__mesh_coordinator_pin`.
|
||
|
||
**Not pursued** (deliberately, given the narrow blast radius): direct P2P (punctures the
|
||
default-deny posture; only helps established sessions), a second relay (needs another public
|
||
host / reintroduces the home public surface), a second coordinator (unsupported by
|
||
self-hosted NetBird; against this ADR).
|
||
```
|
||
|
||
- [ ] **Step 3: Update STATUS.md**
|
||
|
||
In `STATUS.md`, in the `roles/base/` row, append to the end of the firewall/mesh description (before the closing ` |`): a sentence noting the pin and the accepted SPOF:
|
||
|
||
```markdown
|
||
The `mesh` concern also pins the coordinator FQDN in `/etc/hosts` (`base__mesh_coordinator_pin`, set for ubongo) so a local-DNS hiccup can't strand the mesh; the single-coordinator SPOF is an accepted availability risk (R8, ADR-016 availability amendment).
|
||
```
|
||
|
||
- [ ] **Step 4: Update ROADMAP.md**
|
||
|
||
In `docs/ROADMAP.md`, in the "Remaining mesh-hardening sub-projects" list, change item 3 from the SPOF-reduction "(next)" wording to **DONE**, and make the NetBird ACL the next item. Replace the current items 3–4 block with:
|
||
|
||
```markdown
|
||
3. ~~**askari relay-SPOF reduction**~~ → **DONE (2026-06-20)** — assessed + **accepted** as a
|
||
documented availability risk (R8 + ADR-016 availability amendment): the blast radius is
|
||
narrow (LAN/intra-cluster/local traffic never touch askari), so no P2P / second relay /
|
||
second coordinator was warranted. Hardened the one real gap — a managed-host coordinator-FQDN
|
||
DNS pin (`base__mesh_coordinator_pin`). The coordinator off-site backup gap is handed to ADR-022.
|
||
4. **NetBird ACL off Allow-All** to scoped policies (open mechanism question — no headless API path).
|
||
5. **ADR-022 backup kickoff** — off-site backup of the `netbird_coordinator` store (named in R8 /
|
||
BACKUP.md) as the first slice of the backup role (restic + the `fisi` pull node).
|
||
```
|
||
|
||
- [ ] **Step 5: Consistency check + commit**
|
||
|
||
```bash
|
||
grep -q "^| R8 " docs/security/accepted-risks.md && \
|
||
grep -q "Availability — an .askari. outage" docs/decisions/016-mesh-vpn.md && \
|
||
echo "docs OK"
|
||
```
|
||
Expected: `docs OK`.
|
||
|
||
```bash
|
||
rbw unlocked
|
||
git add docs/security/accepted-risks.md docs/decisions/016-mesh-vpn.md STATUS.md docs/ROADMAP.md
|
||
git commit -m "docs(security): accept the single-coordinator mesh SPOF (R8) + ADR-016 availability amendment" \
|
||
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||
```
|
||
|
||
---
|
||
|
||
## Notes / out of scope
|
||
|
||
- **Coordinator off-site backup → ADR-022 kickoff** (next sub-project). Not built here.
|
||
- **Direct P2P / second relay / second coordinator** — deliberately not pursued (spec §Design).
|
||
- No live deploy is required to land this — the pin is additive/idempotent and applies to ubongo on the next routine `base` apply (`make deploy PLAYBOOK=site LIMIT=ubongo`, operator's discretion). Optional post-deploy spot-check: `getent hosts netbird.askari.wingu.me` on ubongo resolves to `77.42.120.136`.
|