boma/docs/superpowers/specs/2026-06-20-mesh-spof-accept-resilience-design.md
sjat 3ba22d199a docs(spec): mesh-hardening SPOF — accept single-coordinator SPOF + DNS-resilience pin
Sub-project 3 of the mesh-hardening follow-on. Accepts the single off-site coordinator as a documented availability SPOF (R8 + ADR-016 amendment) given the narrow blast radius (LAN/intra-cluster/local traffic unaffected; only remote relayed mesh access breaks). Hardens the one real gap: a base mesh coordinator-FQDN /etc/hosts pin so managed hosts survive a local-DNS hiccup. Coordinator off-site backup explicitly deferred to an ADR-022 kickoff (no throwaway infra).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 10:42:19 +02:00

9.1 KiB
Raw Blame History

Spec — Mesh-hardening (SPOF): accept the single-coordinator SPOF + targeted resilience

Status: Accepted (2026-06-20)

Context & scope

The mesh-hardening follow-on decomposed into independent sub-projects (ROADMAP). Progress:

  1. ubongo nftables INPUT-only default-denyDONE 2026-06-19.
  2. askari SSH → wt0 redesignDONE 2026-06-20 (live reboot-validated).
  3. askari relay-SPOF reductionthis spec.
  4. NetBird ACL off Allow-All — not started.

askari runs boma's single self-hosted NetBird coordinator (management + signal + relay + STUN, one combined container) and is a mesh peer (ADR-016). Because ubongo's INPUT-only default-deny drops the inbound UDP that ICE hole-punching needs, ubongo's peers are always Relayed through askari's own relay (intentional posture — docs/runbooks/netbird-client.md, the ubongo-relay-only finding). So askari is a single point of failure for relayed mesh traffic.

The decisive finding — the blast radius is narrow

The mesh (wt0) is not a default gateway. Verified on ubongo (2026-06-20):

wt0 routes ONLY 100.99.0.0/16   ·   default route via 10.20.10.1 dev eno1   ·   Networks: -  (no subnet-routes/exit-node)

So an askari outage affects only traffic addressed to a peer's 100.99.x.x mesh IP over the relay:

Traffic askari down
LAN device → LAN service (direct or via reverse proxy) unaffected
node ↔ node over LAN IPs (future cluster) unaffected
node ↔ node same-LAN over mesh IPs unaffected (direct P2P, local ICE candidate)
road-warrior → ubongo (remote, relayed) breaks
mesh control plane (new enrol / ACL change / re-handshake) pauses

Nothing on the LAN and no future intra-cluster traffic depends on askari. The only loss is remote (off-LAN) mesh access to peers — and only when off-LAN and askari is down at once.

Why we are not "fixing" the SPOF with new infrastructure

  • A second coordinator is not supported by self-hosted NetBird (single management/signal) and contradicts ADR-016's deliberate single off-site coordinator.
  • Direct P2P only helps already-established sessions (re-handshakes still need askari's signal), and enabling it punctures ubongo's deliberate default-deny (a firewall-catalog UDP entry + an accepted-risks deviation + OPNsense NAT) — cost out of proportion to a narrow, rare failure.
  • A second relay needs another publicly-reachable host; a relay at home reintroduces the public home surface ADR-016's off-site coordinator exists to avoid.

Given a reliable always-on VPS and boma's 25-host scale, the sound engineering choice is to accept the SPOF as a conscious, documented trade-off and harden only the two spots real incidents point to.

Goal / success criteria

  • The single-coordinator SPOF is explicitly accepted and documented (register entry + an ADR-016 availability analysis + recovery), so the trade-off is revisitable, not forgotten.
  • Managed mesh hosts survive a local-DNS hiccup: ubongo (and future managed mesh hosts) resolve the coordinator FQDN even when their resolver dies on a transition, mirroring the client-side fix already in the runbook.
  • No new infrastructure — no P2P, no second relay, no second coordinator, no Terraform.
  • The coordinator off-site backup gap is named in the accepted risk and explicitly handed to the next sub-project (ADR-022), not built here.

Design

(a) Accepted-risk R8docs/security/accepted-risks.md

Add one row to the register (owned by ADR-002):

  • Risk: Single off-site mesh coordinator is an availability SPOF for remote mesh access — askari hosts the only management/signal/relay (ADR-016); a relayed peer (all of ubongo's) loses remote mesh reachability while askari is down, and the control plane pauses. The netbird_coordinator store has no off-site backup yet (BACKUP.md), so an askari loss also loses mesh control-plane state until rebuilt.
  • Rationale: inherent to ADR-016's deliberate single off-site coordinator (sovereignty, survives a homelab outage); narrow blast radius (above table — LAN/intra-cluster/local unaffected); askari is a reliable always-on VPS; mitigations exist (client + managed-host DNS pin; documented rebuild).
  • Revisit trigger: askari proves unreliable; the cluster grows to depend on the mesh for intra-node traffic; remote mesh access becomes business-critical; or the ADR-022 backup role lands (closes the state-loss half).

R8 is the availability complement to R3 (which covers askari as a security target).

(b) ADR-016 amendment — an "Availability — an askari outage" subsection

A short subsection capturing: the blast-radius table; that the SPOF is an accepted property (→ R8); and the recovery procedure — rebuild the coordinator (/setup + re-enrol peers, M5) or restore from backup once ADR-022 lands; client/road-warrior break-glass already in docs/runbooks/netbird-client.md; on-LAN access to ubongo never depends on the mesh (ADR-016 recovery model). Recorded as an amendment (dated), ADR-016 status stays Accepted.

(c) DNS-resilience — pin the coordinator FQDN on managed mesh hosts (base mesh concern)

The 2026-06-18 outage was a client failing to resolve netbird.askari.wingu.me on a network transition; the client fix (public resolvers + an /etc/hosts pin to askari's stable WAN IP) is already in the runbook. The gap: managed mesh hosts have no equivalent. Add to base's mesh concern (roles/base/tasks/mesh.yml):

  • New default base__mesh_coordinator_pin: "" (empty → no pin; opt-in).
  • When set (and base__mesh_enabled), render an /etc/hosts entry mapping the coordinator FQDN — derived from base__mesh_management_url via the urlsplit('hostname') filter, not a duplicated literal — to base__mesh_coordinator_pin, idempotently (a marker-scoped blockinfile/lineinfile).
  • Set base__mesh_coordinator_pin to askari's static WAN IP for managed mesh hosts that depend on the coordinator (ubongo via the control group_vars; future cluster groups as they appear). The coordinator host itself (askari) is exempt (it would point its own FQDN at its own WAN IP — needs NAT hairpin and is a server with stable DNS); the plan confirms the exact group_vars placement and the askari exemption.

The pin is safe because askari's WAN IP is static (operator-confirmed); rendering it from a single inventory variable keeps it maintainable if it ever changes.

New & changed code/docs

  • docs/security/accepted-risks.md — add row R8; bump the "Last reviewed" date.
  • docs/decisions/016-mesh-vpn.md — add the dated "Availability — an askari outage" amendment subsection (blast-radius table + recovery + R8 cross-ref).
  • roles/base/defaults/main.yml — add base__mesh_coordinator_pin: "" with a comment.
  • roles/base/tasks/mesh.yml — add the /etc/hosts coordinator-pin task (gated on base__mesh_enabled + a non-empty pin; FQDN from urlsplit).
  • inventories/production/group_vars/control/vars.yml — set base__mesh_coordinator_pin to askari's WAN IP for ubongo.
  • roles/base/molecule/default/{converge,verify}.yml — assert that with the pin set + a fixture FQDN the /etc/hosts entry renders, and that an empty pin renders nothing (no-op).
  • STATUS.md / docs/ROADMAP.md — mark sub-project 3 done; surface ADR-022 (coordinator backup) as the next item. (Land with the implementation, not this spec.)

Testing

  • Molecule (base default scenario): (1) base__mesh_coordinator_pin: "" → no /etc/hosts coordinator line (default no-op); (2) pin set + a fixture base__mesh_management_url → exactly one idempotent <ip> <fqdn> line, FQDN correctly extracted by urlsplit. Existing firewall/hardening/mesh assertions stay green.
  • No live deploy required for acceptance — the pin is additive and idempotent; it lands on ubongo on the next routine base apply. (Optional spot-check: getent hosts netbird.askari.wingu.me on ubongo resolves to the pinned IP.)

Risks & rollback

  • Stale pin if askari's WAN IP changes — mitigated by rendering from one inventory variable (single edit) and askari's IP being static; the pin is removable by clearing the knob + a re-apply.
  • Over-pinning the coordinator host — askari is explicitly exempt (hairpin/DNS), set in group_vars scope.
  • Accepting the SPOF is itself the residual risk — bounded by the narrow blast radius, the documented recovery, and R8's revisit triggers.

Out of scope / follow-ons

  • Coordinator off-site backup → ADR-022 kickoff (the next sub-project). Named in R8 and BACKUP.md as the open gap; building it means ADR-022's pull-node (fisi) + restic design, not throwaway plumbing here.
  • Direct P2P / NAT-traversal — deferred posture change (default-deny puncture + OPNsense NAT + governance); explicitly not pursued here.
  • A second relay / second coordinator — ruled out above (infra cost / not supported / against ADR-016).
  • NetBird ACL off Allow-All — separate sub-project (4).