docs(spec): mesh-hardening SPOF — accept single-coordinator SPOF + DNS-resilience pin
Sub-project 3 of the mesh-hardening follow-on. Accepts the single off-site coordinator as a documented availability SPOF (R8 + ADR-016 amendment) given the narrow blast radius (LAN/intra-cluster/local traffic unaffected; only remote relayed mesh access breaks). Hardens the one real gap: a base mesh coordinator-FQDN /etc/hosts pin so managed hosts survive a local-DNS hiccup. Coordinator off-site backup explicitly deferred to an ADR-022 kickoff (no throwaway infra). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
f10fe8bb60
commit
3ba22d199a
1 changed files with 163 additions and 0 deletions
|
|
@ -0,0 +1,163 @@
|
|||
# Spec — Mesh-hardening (SPOF): accept the single-coordinator SPOF + targeted resilience
|
||||
|
||||
Status: Accepted (2026-06-20)
|
||||
|
||||
## Context & scope
|
||||
|
||||
The **mesh-hardening follow-on** decomposed into independent sub-projects (ROADMAP). Progress:
|
||||
|
||||
1. ~~ubongo nftables INPUT-only default-deny~~ — **DONE 2026-06-19**.
|
||||
2. ~~askari SSH → `wt0` redesign~~ — **DONE 2026-06-20** (live reboot-validated).
|
||||
3. **askari relay-SPOF reduction** ← *this spec*.
|
||||
4. NetBird ACL off Allow-All — not started.
|
||||
|
||||
`askari` runs boma's **single** self-hosted NetBird coordinator (management + signal + relay +
|
||||
STUN, one combined container) **and** is a mesh peer (ADR-016). Because `ubongo`'s INPUT-only
|
||||
default-deny drops the inbound UDP that ICE hole-punching needs, `ubongo`'s peers are always
|
||||
**`Relayed`** through askari's own relay (intentional posture — `docs/runbooks/netbird-client.md`,
|
||||
the `ubongo-relay-only` finding). So askari is a single point of failure for **relayed mesh
|
||||
traffic**.
|
||||
|
||||
### The decisive finding — the blast radius is narrow
|
||||
|
||||
The mesh (`wt0`) is **not** a default gateway. Verified on ubongo (2026-06-20):
|
||||
|
||||
```
|
||||
wt0 routes ONLY 100.99.0.0/16 · default route via 10.20.10.1 dev eno1 · Networks: - (no subnet-routes/exit-node)
|
||||
```
|
||||
|
||||
So an askari outage affects **only** traffic addressed to a peer's `100.99.x.x` mesh IP over the
|
||||
relay:
|
||||
|
||||
| Traffic | askari down |
|
||||
|---|---|
|
||||
| LAN device → LAN service (direct or via reverse proxy) | unaffected |
|
||||
| node ↔ node over LAN IPs (future cluster) | unaffected |
|
||||
| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P, local ICE candidate) |
|
||||
| **road-warrior → ubongo (remote, relayed)** | **breaks** |
|
||||
| mesh control plane (new enrol / ACL change / re-handshake) | pauses |
|
||||
|
||||
Nothing on the LAN and no future intra-cluster traffic depends on askari. The only loss is
|
||||
**remote (off-LAN) mesh access to peers** — and only when off-LAN *and* askari is down at once.
|
||||
|
||||
### Why we are not "fixing" the SPOF with new infrastructure
|
||||
|
||||
- **A second coordinator** is not supported by self-hosted NetBird (single management/signal) and
|
||||
contradicts ADR-016's deliberate single off-site coordinator.
|
||||
- **Direct P2P** only helps already-established sessions (re-handshakes still need askari's
|
||||
signal), and enabling it punctures `ubongo`'s deliberate default-deny (a firewall-catalog UDP
|
||||
entry + an `accepted-risks` deviation + OPNsense NAT) — cost out of proportion to a narrow,
|
||||
rare failure.
|
||||
- **A second relay** needs another publicly-reachable host; a relay at home reintroduces the
|
||||
public home surface ADR-016's off-site coordinator exists to avoid.
|
||||
|
||||
Given a reliable always-on VPS and boma's 2–5-host scale, the sound engineering choice is to
|
||||
**accept the SPOF as a conscious, documented trade-off** and harden only the two spots real
|
||||
incidents point to.
|
||||
|
||||
## Goal / success criteria
|
||||
|
||||
- The single-coordinator SPOF is **explicitly accepted and documented** (register entry + an
|
||||
ADR-016 availability analysis + recovery), so the trade-off is revisitable, not forgotten.
|
||||
- **Managed mesh hosts survive a local-DNS hiccup:** `ubongo` (and future managed mesh hosts)
|
||||
resolve the coordinator FQDN even when their resolver dies on a transition, mirroring the
|
||||
client-side fix already in the runbook.
|
||||
- **No new infrastructure** — no P2P, no second relay, no second coordinator, no Terraform.
|
||||
- The coordinator **off-site backup gap** is named in the accepted risk and explicitly handed to
|
||||
the next sub-project (ADR-022), not built here.
|
||||
|
||||
## Design
|
||||
|
||||
### (a) Accepted-risk `R8` — `docs/security/accepted-risks.md`
|
||||
|
||||
Add one row to the register (owned by ADR-002):
|
||||
|
||||
- **Risk:** *Single off-site mesh coordinator is an availability SPOF for remote mesh access* —
|
||||
askari hosts the only management/signal/relay (ADR-016); a relayed peer (all of ubongo's) loses
|
||||
remote mesh reachability while askari is down, and the control plane pauses. The
|
||||
`netbird_coordinator` store has **no off-site backup yet** (BACKUP.md), so an askari loss also
|
||||
loses mesh control-plane state until rebuilt.
|
||||
- **Rationale:** inherent to ADR-016's deliberate single off-site coordinator (sovereignty,
|
||||
survives a homelab outage); **narrow blast radius** (above table — LAN/intra-cluster/local
|
||||
unaffected); askari is a reliable always-on VPS; mitigations exist (client + managed-host DNS
|
||||
pin; documented rebuild).
|
||||
- **Revisit trigger:** askari proves unreliable; the cluster grows to depend on the mesh for
|
||||
intra-node traffic; remote mesh access becomes business-critical; or the ADR-022 backup role
|
||||
lands (closes the state-loss half).
|
||||
|
||||
R8 is the **availability** complement to R3 (which covers askari as a *security* target).
|
||||
|
||||
### (b) ADR-016 amendment — an "Availability — an askari outage" subsection
|
||||
|
||||
A short subsection capturing: the blast-radius table; that the SPOF is an accepted property
|
||||
(→ R8); and the **recovery procedure** — rebuild the coordinator (`/setup` + re-enrol peers, M5)
|
||||
or restore from backup once ADR-022 lands; client/road-warrior break-glass already in
|
||||
`docs/runbooks/netbird-client.md`; on-LAN access to ubongo never depends on the mesh (ADR-016
|
||||
recovery model). Recorded as an amendment (dated), ADR-016 status stays Accepted.
|
||||
|
||||
### (c) DNS-resilience — pin the coordinator FQDN on managed mesh hosts (`base` `mesh` concern)
|
||||
|
||||
The 2026-06-18 outage was a client failing to resolve `netbird.askari.wingu.me` on a network
|
||||
transition; the client fix (public resolvers + an `/etc/hosts` pin to askari's stable WAN IP) is
|
||||
already in the runbook. The gap: **managed** mesh hosts have no equivalent. Add to `base`'s `mesh`
|
||||
concern (`roles/base/tasks/mesh.yml`):
|
||||
|
||||
- New default `base__mesh_coordinator_pin: ""` (empty → no pin; opt-in).
|
||||
- When set (and `base__mesh_enabled`), render an `/etc/hosts` entry mapping the coordinator FQDN
|
||||
— derived from `base__mesh_management_url` via the `urlsplit('hostname')` filter, **not** a
|
||||
duplicated literal — to `base__mesh_coordinator_pin`, idempotently (a marker-scoped
|
||||
`blockinfile`/`lineinfile`).
|
||||
- Set `base__mesh_coordinator_pin` to askari's static WAN IP for managed mesh hosts that depend
|
||||
on the coordinator (ubongo via the `control` group_vars; future cluster groups as they appear).
|
||||
The **coordinator host itself (askari) is exempt** (it would point its own FQDN at its own WAN
|
||||
IP — needs NAT hairpin and is a server with stable DNS); the plan confirms the exact group_vars
|
||||
placement and the askari exemption.
|
||||
|
||||
The pin is safe because askari's WAN IP is static (operator-confirmed); rendering it from a single
|
||||
inventory variable keeps it maintainable if it ever changes.
|
||||
|
||||
## New & changed code/docs
|
||||
|
||||
- `docs/security/accepted-risks.md` — add row **R8**; bump the "Last reviewed" date.
|
||||
- `docs/decisions/016-mesh-vpn.md` — add the dated "Availability — an askari outage" amendment
|
||||
subsection (blast-radius table + recovery + R8 cross-ref).
|
||||
- `roles/base/defaults/main.yml` — add `base__mesh_coordinator_pin: ""` with a comment.
|
||||
- `roles/base/tasks/mesh.yml` — add the `/etc/hosts` coordinator-pin task (gated on
|
||||
`base__mesh_enabled` + a non-empty pin; FQDN from `urlsplit`).
|
||||
- `inventories/production/group_vars/control/vars.yml` — set `base__mesh_coordinator_pin` to
|
||||
askari's WAN IP for ubongo.
|
||||
- `roles/base/molecule/default/{converge,verify}.yml` — assert that with the pin set + a fixture
|
||||
FQDN the `/etc/hosts` entry renders, and that an empty pin renders nothing (no-op).
|
||||
- `STATUS.md` / `docs/ROADMAP.md` — mark sub-project 3 done; surface ADR-022 (coordinator backup)
|
||||
as the next item. (Land with the implementation, not this spec.)
|
||||
|
||||
## Testing
|
||||
|
||||
- **Molecule** (`base` default scenario): (1) `base__mesh_coordinator_pin: ""` → no `/etc/hosts`
|
||||
coordinator line (default no-op); (2) pin set + a fixture `base__mesh_management_url` → exactly
|
||||
one idempotent `<ip> <fqdn>` line, FQDN correctly extracted by `urlsplit`. Existing
|
||||
firewall/hardening/mesh assertions stay green.
|
||||
- **No live deploy required for acceptance** — the pin is additive and idempotent; it lands on
|
||||
ubongo on the next routine `base` apply. (Optional spot-check: `getent hosts
|
||||
netbird.askari.wingu.me` on ubongo resolves to the pinned IP.)
|
||||
|
||||
## Risks & rollback
|
||||
|
||||
- **Stale pin if askari's WAN IP changes** — mitigated by rendering from one inventory variable
|
||||
(single edit) and askari's IP being static; the pin is removable by clearing the knob + a
|
||||
re-apply.
|
||||
- **Over-pinning the coordinator host** — askari is explicitly exempt (hairpin/DNS), set in
|
||||
group_vars scope.
|
||||
- **Accepting the SPOF** is itself the residual risk — bounded by the narrow blast radius, the
|
||||
documented recovery, and R8's revisit triggers.
|
||||
|
||||
## Out of scope / follow-ons
|
||||
|
||||
- **Coordinator off-site backup → ADR-022 kickoff (the next sub-project).** Named in R8 and
|
||||
`BACKUP.md` as the open gap; building it means ADR-022's pull-node (`fisi`) + restic design, not
|
||||
throwaway plumbing here.
|
||||
- **Direct P2P / NAT-traversal** — deferred posture change (default-deny puncture + OPNsense NAT +
|
||||
governance); explicitly not pursued here.
|
||||
- **A second relay / second coordinator** — ruled out above (infra cost / not supported / against
|
||||
ADR-016).
|
||||
- **NetBird ACL off Allow-All** — separate sub-project (4).
|
||||
Loading…
Add table
Reference in a new issue