boma/docs/superpowers/specs/2026-06-20-mesh-spof-accept-resilience-design.md
sjat 3ba22d199a docs(spec): mesh-hardening SPOF — accept single-coordinator SPOF + DNS-resilience pin
Sub-project 3 of the mesh-hardening follow-on. Accepts the single off-site coordinator as a documented availability SPOF (R8 + ADR-016 amendment) given the narrow blast radius (LAN/intra-cluster/local traffic unaffected; only remote relayed mesh access breaks). Hardens the one real gap: a base mesh coordinator-FQDN /etc/hosts pin so managed hosts survive a local-DNS hiccup. Coordinator off-site backup explicitly deferred to an ADR-022 kickoff (no throwaway infra).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 10:42:19 +02:00

163 lines
9.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Spec — Mesh-hardening (SPOF): accept the single-coordinator SPOF + targeted resilience
Status: Accepted (2026-06-20)
## Context & scope
The **mesh-hardening follow-on** decomposed into independent sub-projects (ROADMAP). Progress:
1. ~~ubongo nftables INPUT-only default-deny~~**DONE 2026-06-19**.
2. ~~askari SSH → `wt0` redesign~~**DONE 2026-06-20** (live reboot-validated).
3. **askari relay-SPOF reduction***this spec*.
4. NetBird ACL off Allow-All — not started.
`askari` runs boma's **single** self-hosted NetBird coordinator (management + signal + relay +
STUN, one combined container) **and** is a mesh peer (ADR-016). Because `ubongo`'s INPUT-only
default-deny drops the inbound UDP that ICE hole-punching needs, `ubongo`'s peers are always
**`Relayed`** through askari's own relay (intentional posture — `docs/runbooks/netbird-client.md`,
the `ubongo-relay-only` finding). So askari is a single point of failure for **relayed mesh
traffic**.
### The decisive finding — the blast radius is narrow
The mesh (`wt0`) is **not** a default gateway. Verified on ubongo (2026-06-20):
```
wt0 routes ONLY 100.99.0.0/16 · default route via 10.20.10.1 dev eno1 · Networks: - (no subnet-routes/exit-node)
```
So an askari outage affects **only** traffic addressed to a peer's `100.99.x.x` mesh IP over the
relay:
| Traffic | askari down |
|---|---|
| LAN device → LAN service (direct or via reverse proxy) | unaffected |
| node ↔ node over LAN IPs (future cluster) | unaffected |
| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P, local ICE candidate) |
| **road-warrior → ubongo (remote, relayed)** | **breaks** |
| mesh control plane (new enrol / ACL change / re-handshake) | pauses |
Nothing on the LAN and no future intra-cluster traffic depends on askari. The only loss is
**remote (off-LAN) mesh access to peers** — and only when off-LAN *and* askari is down at once.
### Why we are not "fixing" the SPOF with new infrastructure
- **A second coordinator** is not supported by self-hosted NetBird (single management/signal) and
contradicts ADR-016's deliberate single off-site coordinator.
- **Direct P2P** only helps already-established sessions (re-handshakes still need askari's
signal), and enabling it punctures `ubongo`'s deliberate default-deny (a firewall-catalog UDP
entry + an `accepted-risks` deviation + OPNsense NAT) — cost out of proportion to a narrow,
rare failure.
- **A second relay** needs another publicly-reachable host; a relay at home reintroduces the
public home surface ADR-016's off-site coordinator exists to avoid.
Given a reliable always-on VPS and boma's 25-host scale, the sound engineering choice is to
**accept the SPOF as a conscious, documented trade-off** and harden only the two spots real
incidents point to.
## Goal / success criteria
- The single-coordinator SPOF is **explicitly accepted and documented** (register entry + an
ADR-016 availability analysis + recovery), so the trade-off is revisitable, not forgotten.
- **Managed mesh hosts survive a local-DNS hiccup:** `ubongo` (and future managed mesh hosts)
resolve the coordinator FQDN even when their resolver dies on a transition, mirroring the
client-side fix already in the runbook.
- **No new infrastructure** — no P2P, no second relay, no second coordinator, no Terraform.
- The coordinator **off-site backup gap** is named in the accepted risk and explicitly handed to
the next sub-project (ADR-022), not built here.
## Design
### (a) Accepted-risk `R8` — `docs/security/accepted-risks.md`
Add one row to the register (owned by ADR-002):
- **Risk:** *Single off-site mesh coordinator is an availability SPOF for remote mesh access*
askari hosts the only management/signal/relay (ADR-016); a relayed peer (all of ubongo's) loses
remote mesh reachability while askari is down, and the control plane pauses. The
`netbird_coordinator` store has **no off-site backup yet** (BACKUP.md), so an askari loss also
loses mesh control-plane state until rebuilt.
- **Rationale:** inherent to ADR-016's deliberate single off-site coordinator (sovereignty,
survives a homelab outage); **narrow blast radius** (above table — LAN/intra-cluster/local
unaffected); askari is a reliable always-on VPS; mitigations exist (client + managed-host DNS
pin; documented rebuild).
- **Revisit trigger:** askari proves unreliable; the cluster grows to depend on the mesh for
intra-node traffic; remote mesh access becomes business-critical; or the ADR-022 backup role
lands (closes the state-loss half).
R8 is the **availability** complement to R3 (which covers askari as a *security* target).
### (b) ADR-016 amendment — an "Availability — an askari outage" subsection
A short subsection capturing: the blast-radius table; that the SPOF is an accepted property
(→ R8); and the **recovery procedure** — rebuild the coordinator (`/setup` + re-enrol peers, M5)
or restore from backup once ADR-022 lands; client/road-warrior break-glass already in
`docs/runbooks/netbird-client.md`; on-LAN access to ubongo never depends on the mesh (ADR-016
recovery model). Recorded as an amendment (dated), ADR-016 status stays Accepted.
### (c) DNS-resilience — pin the coordinator FQDN on managed mesh hosts (`base` `mesh` concern)
The 2026-06-18 outage was a client failing to resolve `netbird.askari.wingu.me` on a network
transition; the client fix (public resolvers + an `/etc/hosts` pin to askari's stable WAN IP) is
already in the runbook. The gap: **managed** mesh hosts have no equivalent. Add to `base`'s `mesh`
concern (`roles/base/tasks/mesh.yml`):
- New default `base__mesh_coordinator_pin: ""` (empty → no pin; opt-in).
- When set (and `base__mesh_enabled`), render an `/etc/hosts` entry mapping the coordinator FQDN
— derived from `base__mesh_management_url` via the `urlsplit('hostname')` filter, **not** a
duplicated literal — to `base__mesh_coordinator_pin`, idempotently (a marker-scoped
`blockinfile`/`lineinfile`).
- Set `base__mesh_coordinator_pin` to askari's static WAN IP for managed mesh hosts that depend
on the coordinator (ubongo via the `control` group_vars; future cluster groups as they appear).
The **coordinator host itself (askari) is exempt** (it would point its own FQDN at its own WAN
IP — needs NAT hairpin and is a server with stable DNS); the plan confirms the exact group_vars
placement and the askari exemption.
The pin is safe because askari's WAN IP is static (operator-confirmed); rendering it from a single
inventory variable keeps it maintainable if it ever changes.
## New & changed code/docs
- `docs/security/accepted-risks.md` — add row **R8**; bump the "Last reviewed" date.
- `docs/decisions/016-mesh-vpn.md` — add the dated "Availability — an askari outage" amendment
subsection (blast-radius table + recovery + R8 cross-ref).
- `roles/base/defaults/main.yml` — add `base__mesh_coordinator_pin: ""` with a comment.
- `roles/base/tasks/mesh.yml` — add the `/etc/hosts` coordinator-pin task (gated on
`base__mesh_enabled` + a non-empty pin; FQDN from `urlsplit`).
- `inventories/production/group_vars/control/vars.yml` — set `base__mesh_coordinator_pin` to
askari's WAN IP for ubongo.
- `roles/base/molecule/default/{converge,verify}.yml` — assert that with the pin set + a fixture
FQDN the `/etc/hosts` entry renders, and that an empty pin renders nothing (no-op).
- `STATUS.md` / `docs/ROADMAP.md` — mark sub-project 3 done; surface ADR-022 (coordinator backup)
as the next item. (Land with the implementation, not this spec.)
## Testing
- **Molecule** (`base` default scenario): (1) `base__mesh_coordinator_pin: ""` → no `/etc/hosts`
coordinator line (default no-op); (2) pin set + a fixture `base__mesh_management_url` → exactly
one idempotent `<ip> <fqdn>` line, FQDN correctly extracted by `urlsplit`. Existing
firewall/hardening/mesh assertions stay green.
- **No live deploy required for acceptance** — the pin is additive and idempotent; it lands on
ubongo on the next routine `base` apply. (Optional spot-check: `getent hosts
netbird.askari.wingu.me` on ubongo resolves to the pinned IP.)
## Risks & rollback
- **Stale pin if askari's WAN IP changes** — mitigated by rendering from one inventory variable
(single edit) and askari's IP being static; the pin is removable by clearing the knob + a
re-apply.
- **Over-pinning the coordinator host** — askari is explicitly exempt (hairpin/DNS), set in
group_vars scope.
- **Accepting the SPOF** is itself the residual risk — bounded by the narrow blast radius, the
documented recovery, and R8's revisit triggers.
## Out of scope / follow-ons
- **Coordinator off-site backup → ADR-022 kickoff (the next sub-project).** Named in R8 and
`BACKUP.md` as the open gap; building it means ADR-022's pull-node (`fisi`) + restic design, not
throwaway plumbing here.
- **Direct P2P / NAT-traversal** — deferred posture change (default-deny puncture + OPNsense NAT +
governance); explicitly not pursued here.
- **A second relay / second coordinator** — ruled out above (infra cost / not supported / against
ADR-016).
- **NetBird ACL off Allow-All** — separate sub-project (4).