boma/docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md
sjat a178729587 docs(spec): mesh-hardening redesign — askari wt0-primary + WAN break-glass
Redesign of the backed-out 2026-06-17 askari SSH->wt0 attempt. Mirrors the proven ubongo 2/3 pattern (INPUT-only default-deny, SSH scoped by iifname wt0, no sshd ListenAddress change -> no boot-race) and adds the coordinator-host exception the incident demanded: a permanent non-mesh break-glass (WAN :22 from ubongo's static WAN IP + the Hetzner console), WAN :22 deliberately left open. Folds in the netbird_coordinator geo-DB robustness fix (FRICTION #4) so a transient egress blip can't FATAL the control plane. Harness-GREEN gate before a supervised live cutover.

Operator decision (2026-06-19): do this redesign first, then a separate sub-project to reduce askari's SPOF role.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 16:25:26 +02:00

13 KiB

Spec — Mesh-hardening redesign: askari SSH wt0-primary + permanent WAN break-glass

Status: Accepted (2026-06-19)

Context & scope

The mesh-hardening follow-on (deferred from M5) was decomposed into three independent sub-projects, each with its own spec → plan → implementation cycle. Progress so far:

  1. askari SSH → wt0attempted 2026-06-17, BACKED OUT after it took askari down on reboot (spec/plan docs/superpowers/{specs,plans}/2026-06-17-mesh-hardening-askari-ssh-wt0*).
  2. ubongo nftables INPUT-only default-deny — DONE 2026-06-19, reboot-validated (base__firewall_input_only).
  3. NetBird ACL off Allow-All → scoped policies — not started.

This spec is the redesign of (1). The operator sequencing decision (2026-06-19) is: do this redesign first, then a separate sub-project to reduce askari's single-point-of-failure (SPOF) role. This spec covers only the redesign of (1). The SPOF reduction is the named follow-on (its own later spec).

Why the 2026-06-17 attempt was backed out

Four hazards, recorded in docs/FRICTION.md (the six 2026-06-17 signals):

  1. base's forward policy drop breaks Docker hosts on reboot — nftables loaded default-deny before Docker, so container forwarding/NAT (WAN→Caddy, Caddy→coordinator) died after reboot.
  2. ip_nonlocal_bind did NOT beat the sshd boot-race — binding sshd ListenAddress to the wt0 IP still failed at boot ("could not assign the address"); and because wt0 never came up, sshd had no listener at all.
  3. The coordinator host can't bootstrap the mesh it depends on — askari runs the NetBird coordinator and is a mesh peer; its agent needs the local coordinator container healthy to bring up wt0. After an unclean reboot the coordinator was down → wt0 never came up → with SSH wt0-only, the host was reachable only via the Hetzner console. General rule: never make a host's only management path depend on a service that host itself hosts.
  4. The coordinator FATAL-loops on the geolocation-DB download with no egress — a transient loss of container egress (here: NAT wiped by nft flush) crash-loops the whole control plane.

What changed since 2026-06-17 (enablers this redesign relies on)

  • docker_host container-forward nftables drop-in (172ae37) — reboot-safe Docker forwarding (available as a later tightening; not required by this pass).
  • base__firewall_input_only — input-only default-deny, forward chain stays policy accept (Docker-safe). Proven on ubongo and reboot-validated 2026-06-19.
  • The ADR-025 integration harness — reproduces a host's boot on a throwaway local VM, so reboot-safety is proven GREEN before the real host is touched.

Goal / success criteria

  • askari's host nftables firewall is applied at last (base__firewall_apply: true), INPUT-only default-deny — matching ubongo.
  • Normal management is over the mesh: ansible_host resolves to askari's wt0 IP (100.99.226.39); SSH-over-wt0 and ansible askari -m ping over the mesh both succeed.
  • A permanent non-mesh break-glass survives a mesh/coordinator outage, via two independent channels:
    • the Hetzner web console (out-of-band; always works, IP-independent); and
    • WAN :22 reachable only from ubongo's WAN IP (91.226.145.80), enforced at both the host nftables layer (base__firewall_admin_addrs) and the Hetzner Cloud Firewall. WAN :22 is deliberately NOT closed — the coordinator-host exception (FRICTION #3).
  • askari survives a reboot under the new firewall, unattended: Docker forwarding/NAT intact, https://test.askari.wingu.me + https://netbird.askari.wingu.me serve valid certs, STUN 3478/udp answers, the coordinator container is healthy (geo-DB no longer FATAL), wt0 returns, SSH is reachable over both wt0 and the WAN break-glass.
  • No sshd ListenAddress change (base__ssh_listen_mesh_only stays false) — this is what sidesteps the boot-race that sank the 2026-06-17 attempt.

Design — mirror ubongo 2/3, with the coordinator-host exception

The host firewall does the SSH scoping; sshd is left listening on all interfaces. This is the ubongo 2/3 pattern, which is proven and reboot-validated.

  1. base firewall, INPUT-only default-deny (base__firewall_apply: true, base__firewall_input_only: true): the input chain defaults to drop; the forward chain stays policy accept so Docker container forwarding/NAT and published-port DNAT keep working across a reboot. Allowed ingress:
    • :22/tcp via iifname "wt0" (the interface-name match that survives wt0 being absent at boot — base__firewall_mgmt_interface: wt0);
    • :22/tcp from 91.226.145.80 (ubongo's WAN — the break-glass; via base__firewall_admin_addrs);
    • the public service surface from the catalog: 80,443/tcp + 3478/udp (WAN).
  2. No sshd change. base__ssh_listen_mesh_only stays false; sshd keeps listening on all interfaces. The firewall, not sshd, restricts where :22 is reachable. There is no ListenAddress, hence no ip_nonlocal_bind, hence no boot-race.
  3. The Hetzner Cloud Firewall is unchanged — the :22-from-ubongo rule stays (the 2026-06-17 attempt removed it; this redesign keeps it as the perimeter break-glass).
  4. Coordinator geo-DB robustness — make the netbird_coordinator control plane survive a transient egress loss (the nat-flush window on apply, and the boot window before Docker re-adds its NAT), so the coordinator stays healthy and wt0 can come back. One of:
    • pre-seed the GeoLite2 DB into the persistent netbird_data:/var/lib/netbird volume so netbird-server finds it locally and never needs to download; or
    • disable / make non-fatal the geolocation requirement in config.yaml.j2. The exact v0.72.4 mechanism is verified against NetBird's pinned docs at plan time (ADR-014) — the design fixes the intent (a transient egress blip must not FATAL the control plane); the plan fixes the knob.

Rejected alternatives (these are the 2026-06-17 failures)

  • sshd ListenAddress = wt0 IP + ip_nonlocal_bind → boot-race; did not bind. Out.
  • forward policy drop on a Docker host → broke forwarding on reboot. Out (use input_only; the docker_host container-forward drop-in is a later tightening).
  • Close WAN :22 entirely → coordinator host left console-only on a bad reboot. Out (keep WAN :22-from-ubongo as the always-available non-mesh path).

How each 2026-06-17 failure is answered

2026-06-17 failure Fix in this design
forward drop killed Docker on reboot base__firewall_input_only: true → forward stays accept
ip_nonlocal_bind sshd boot-race no sshd ListenAddress change; firewall scopes :22 by iifname "wt0"
coordinator chicken-egg / lockout permanent WAN :22-from-ubongo + Hetzner console; management never depends on a service askari hosts
coordinator geo-DB FATAL-loop pre-seed / non-fatal geo so a transient egress blip can't crash the control plane

New & changed code

Inventory:

  • inventories/production/group_vars/offsite_hosts/vars.yml
    • base__firewall_apply: true (was false);
    • base__firewall_input_only: true (new — forward stays accept, Docker-safe);
    • base__firewall_admin_addrs: ["91.226.145.80"] (new — ubongo's WAN, the break-glass; comment states what it is and why a coordinator host keeps a non-mesh path);
    • base__ssh_listen_mesh_only: false stays (explicit — no boot-race);
    • rewrite the header backout note → "redesigned 2026-06-19: wt0-primary + permanent WAN break-glass; see this spec."
  • inventories/production/host_vars/askari.yml (new) — ansible_host: 100.99.226.39 (the wt0 IP), so Ansible manages askari over the mesh. Overrides the TF-generated WAN ansible_host in offsite.yml (host_vars are not regenerated by tf_to_inventory.py). Header comment explains why.

Role netbird_coordinator:

  • The geo-DB robustness change above (templates/config.yaml.j2 and/or a pre-seed task + templates/docker-compose.yml.j2 volume already persists /var/lib/netbird), with Molecule/verify coverage that the control plane comes up without external geo egress.

Firewall catalog (inventories/production/group_vars/all/firewall.yml):

  • No change. It already enumerates askari's public ingress (reverse_proxy 80/443, netbird_stun 3478/udp). :22 is handled by the base firewall's built-in SSH rules (mgmt_interface wt0 + admin_addrs), not the catalog.

Terraform / Hetzner Cloud Firewall:

  • No change. The WAN :22-from-ubongo rule stays (the perimeter half of the break-glass).

sshd:

  • No change.

Validation

Harness-first GREEN gate (ADR-025) — before any live change

A "be askari" integration profile (Docker host + a coordinator-like container on the shared network + base__firewall_input_only + admin_addrs), driven through make test-integration HOST=askari (reusing the existing profile/overlay/verify pattern):

  • input chain default-deny with :22 accepted via iifname "wt0" and from the break-glass admin address; forward chain policy accept;
  • published-port DNAT + NAT masquerade survive a reboot (the RED→GREEN reboot cycle);
  • the coordinator-like container comes up healthy with no external geo egress;
  • SSH path returns after reboot.

This must be GREEN before the live cutover.

Live cutover — supervised, console open, break-glass never removed

Sequencing rule (FRICTION #6): validate reboot-recovery while a fallback path is still open. Because the WAN break-glass is never removed in this design, that invariant holds by construction.

  1. Pre-check: ssh sjat@100.99.226.39 (over wt0) and ansible askari -m ping (forced over wt0) both succeed; public services + STUN healthy.
  2. Repoint Ansible: add host_vars/askari.yml (ansible_host = wt0 IP); confirm ansible askari -m ping runs over the mesh.
  3. Apply base (+ the geo-DB fix): one make deploy PLAYBOOK=site LIMIT=askari converge applies INPUT-only default-deny with the wt0 + admin-addr SSH allow and the coordinator robustness change. The firewall concern's armed auto-rollback (base__firewall_rollback_timeout: 45) reverts a bad ruleset. Then a post-apply restart docker rebuilds NAT (base's flush ruleset wipes Docker's nat — FRICTION); the coordinator now survives the egress window thanks to the geo-DB fix.
  4. Verify the new steady state: public services serve valid certs; STUN answers; SSH over wt0 works; SSH over the WAN break-glass (91.226.145.80:22) works.
  5. Reboot resilience (the real test): reboot askari (Hetzner console available) and confirm — with no intervention — Docker forwarding/NAT, public services, the coordinator, wt0, and SSH (both paths) all return.

Risks & rollback

  • ubongo's WAN IP anchors the break-glass. If it is dynamic and rotates, the host admin_addrs rule and the Hetzner FW rule must be updated. The Hetzner console is the IP-independent ultimate break-glass. (Confirmed static by the operator 2026-06-19; it is also already the Hetzner FW assumption today.)
  • Mid-cutover lockout: mitigated by the staged order (a path open at each step), the firewall auto-rollback timer, ansible_host = wt0 (the confirm tests the real new path), and the WAN break-glass that is never removed.
  • Reboot lockout: mitigated by iifname "wt0" scoping (no sshd boot-race), the WAN break-glass, the geo-DB fix (coordinator survives the egress window), and harness GREEN.
  • Default-deny breaks a public service: mitigated by the catalog already enumerating all live ingress and the §Validation service checks; reversible via base__firewall_apply: false.
  • Ultimate break-glass: the Hetzner web console (out-of-band).

Out of scope / follow-ons

  • SPOF reduction (the next sub-project) — reduce askari's single-point-of-failure role (currently ubongo → askari is Relayed through askari's own relay; if askari is down the mesh data plane for relayed peers is down). Its own spec, after this.
  • NetBird ACL off Allow-All — until then any enrolled peer can reach askari's wt0:22; scoping that is a separate sub-project.
  • Full forward-chain hardening — the docker_host container-forward drop-in (full forward default-deny, reboot-safe) as a later tightening over the input_only baseline.
  • Coordinator off-site backup (FRICTION #5, ADR-022) — still pending; noted, not in scope.
  • STATUS.md / ROADMAP updates land with the implementation, not this spec.