Redesign of the backed-out 2026-06-17 askari SSH->wt0 attempt. Mirrors the proven ubongo 2/3 pattern (INPUT-only default-deny, SSH scoped by iifname wt0, no sshd ListenAddress change -> no boot-race) and adds the coordinator-host exception the incident demanded: a permanent non-mesh break-glass (WAN :22 from ubongo's static WAN IP + the Hetzner console), WAN :22 deliberately left open. Folds in the netbird_coordinator geo-DB robustness fix (FRICTION #4) so a transient egress blip can't FATAL the control plane. Harness-GREEN gate before a supervised live cutover. Operator decision (2026-06-19): do this redesign first, then a separate sub-project to reduce askari's SPOF role. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
13 KiB
Spec — Mesh-hardening redesign: askari SSH wt0-primary + permanent WAN break-glass
Status: Accepted (2026-06-19)
Context & scope
The mesh-hardening follow-on (deferred from M5) was decomposed into three independent sub-projects, each with its own spec → plan → implementation cycle. Progress so far:
askari SSH →— attempted 2026-06-17, BACKED OUT after it took askari down on reboot (spec/planwt0docs/superpowers/{specs,plans}/2026-06-17-mesh-hardening-askari-ssh-wt0*).- ubongo nftables INPUT-only default-deny — DONE 2026-06-19, reboot-validated
(
base__firewall_input_only). - NetBird ACL off Allow-All → scoped policies — not started.
This spec is the redesign of (1). The operator sequencing decision (2026-06-19) is: do this redesign first, then a separate sub-project to reduce askari's single-point-of-failure (SPOF) role. This spec covers only the redesign of (1). The SPOF reduction is the named follow-on (its own later spec).
Why the 2026-06-17 attempt was backed out
Four hazards, recorded in docs/FRICTION.md (the six 2026-06-17 signals):
base'sforward policy dropbreaks Docker hosts on reboot — nftables loaded default-deny before Docker, so container forwarding/NAT (WAN→Caddy, Caddy→coordinator) died after reboot.ip_nonlocal_binddid NOT beat the sshd boot-race — binding sshdListenAddressto thewt0IP still failed at boot ("could not assign the address"); and becausewt0never came up, sshd had no listener at all.- The coordinator host can't bootstrap the mesh it depends on — askari runs the
NetBird coordinator and is a mesh peer; its agent needs the local coordinator container
healthy to bring up
wt0. After an unclean reboot the coordinator was down →wt0never came up → with SSHwt0-only, the host was reachable only via the Hetzner console. General rule: never make a host's only management path depend on a service that host itself hosts. - The coordinator FATAL-loops on the geolocation-DB download with no egress — a
transient loss of container egress (here: NAT wiped by
nft flush) crash-loops the whole control plane.
What changed since 2026-06-17 (enablers this redesign relies on)
docker_hostcontainer-forward nftables drop-in (172ae37) — reboot-safe Docker forwarding (available as a later tightening; not required by this pass).base__firewall_input_only— input-only default-deny, forward chain stayspolicy accept(Docker-safe). Proven on ubongo and reboot-validated 2026-06-19.- The ADR-025 integration harness — reproduces a host's boot on a throwaway local VM, so reboot-safety is proven GREEN before the real host is touched.
Goal / success criteria
- askari's host nftables firewall is applied at last (
base__firewall_apply: true), INPUT-only default-deny — matching ubongo. - Normal management is over the mesh:
ansible_hostresolves to askari'swt0IP (100.99.226.39); SSH-over-wt0andansible askari -m pingover the mesh both succeed. - A permanent non-mesh break-glass survives a mesh/coordinator outage, via two
independent channels:
- the Hetzner web console (out-of-band; always works, IP-independent); and
- WAN
:22reachable only from ubongo's WAN IP (91.226.145.80), enforced at both the host nftables layer (base__firewall_admin_addrs) and the Hetzner Cloud Firewall. WAN:22is deliberately NOT closed — the coordinator-host exception (FRICTION #3).
- askari survives a reboot under the new firewall, unattended: Docker forwarding/NAT
intact,
https://test.askari.wingu.me+https://netbird.askari.wingu.meserve valid certs, STUN3478/udpanswers, the coordinator container is healthy (geo-DB no longer FATAL),wt0returns, SSH is reachable over bothwt0and the WAN break-glass. - No sshd
ListenAddresschange (base__ssh_listen_mesh_onlystaysfalse) — this is what sidesteps the boot-race that sank the 2026-06-17 attempt.
Design — mirror ubongo 2/3, with the coordinator-host exception
The host firewall does the SSH scoping; sshd is left listening on all interfaces. This is the ubongo 2/3 pattern, which is proven and reboot-validated.
basefirewall, INPUT-only default-deny (base__firewall_apply: true,base__firewall_input_only: true): the input chain defaults todrop; the forward chain stayspolicy acceptso Docker container forwarding/NAT and published-port DNAT keep working across a reboot. Allowed ingress::22/tcpviaiifname "wt0"(the interface-name match that surviveswt0being absent at boot —base__firewall_mgmt_interface: wt0);:22/tcpfrom91.226.145.80(ubongo's WAN — the break-glass; viabase__firewall_admin_addrs);- the public service surface from the catalog:
80,443/tcp+3478/udp(WAN).
- No sshd change.
base__ssh_listen_mesh_onlystaysfalse; sshd keeps listening on all interfaces. The firewall, not sshd, restricts where:22is reachable. There is noListenAddress, hence noip_nonlocal_bind, hence no boot-race. - The Hetzner Cloud Firewall is unchanged — the
:22-from-ubongo rule stays (the 2026-06-17 attempt removed it; this redesign keeps it as the perimeter break-glass). - Coordinator geo-DB robustness — make the
netbird_coordinatorcontrol plane survive a transient egress loss (the nat-flush window on apply, and the boot window before Docker re-adds its NAT), so the coordinator stays healthy andwt0can come back. One of:- pre-seed the GeoLite2 DB into the persistent
netbird_data:/var/lib/netbirdvolume so netbird-server finds it locally and never needs to download; or - disable / make non-fatal the geolocation requirement in
config.yaml.j2. The exact v0.72.4 mechanism is verified against NetBird's pinned docs at plan time (ADR-014) — the design fixes the intent (a transient egress blip must not FATAL the control plane); the plan fixes the knob.
- pre-seed the GeoLite2 DB into the persistent
Rejected alternatives (these are the 2026-06-17 failures)
- sshd
ListenAddress = wt0 IP+ip_nonlocal_bind→ boot-race; did not bind. Out. forward policy dropon a Docker host → broke forwarding on reboot. Out (useinput_only; thedocker_hostcontainer-forward drop-in is a later tightening).- Close WAN
:22entirely → coordinator host left console-only on a bad reboot. Out (keep WAN:22-from-ubongo as the always-available non-mesh path).
How each 2026-06-17 failure is answered
| 2026-06-17 failure | Fix in this design |
|---|---|
forward drop killed Docker on reboot |
base__firewall_input_only: true → forward stays accept |
ip_nonlocal_bind sshd boot-race |
no sshd ListenAddress change; firewall scopes :22 by iifname "wt0" |
| coordinator chicken-egg / lockout | permanent WAN :22-from-ubongo + Hetzner console; management never depends on a service askari hosts |
| coordinator geo-DB FATAL-loop | pre-seed / non-fatal geo so a transient egress blip can't crash the control plane |
New & changed code
Inventory:
inventories/production/group_vars/offsite_hosts/vars.yml—base__firewall_apply: true(wasfalse);base__firewall_input_only: true(new — forward staysaccept, Docker-safe);base__firewall_admin_addrs: ["91.226.145.80"](new — ubongo's WAN, the break-glass; comment states what it is and why a coordinator host keeps a non-mesh path);base__ssh_listen_mesh_only: falsestays (explicit — no boot-race);- rewrite the header backout note → "redesigned 2026-06-19:
wt0-primary + permanent WAN break-glass; see this spec."
inventories/production/host_vars/askari.yml(new) —ansible_host: 100.99.226.39(thewt0IP), so Ansible manages askari over the mesh. Overrides the TF-generated WANansible_hostinoffsite.yml(host_vars are not regenerated bytf_to_inventory.py). Header comment explains why.
Role netbird_coordinator:
- The geo-DB robustness change above (
templates/config.yaml.j2and/or a pre-seed task +templates/docker-compose.yml.j2volume already persists/var/lib/netbird), with Molecule/verify coverage that the control plane comes up without external geo egress.
Firewall catalog (inventories/production/group_vars/all/firewall.yml):
- No change. It already enumerates askari's public ingress (
reverse_proxy80/443,netbird_stun3478/udp).:22is handled by thebasefirewall's built-in SSH rules (mgmt_interfacewt0+admin_addrs), not the catalog.
Terraform / Hetzner Cloud Firewall:
- No change. The WAN
:22-from-ubongo rule stays (the perimeter half of the break-glass).
sshd:
- No change.
Validation
Harness-first GREEN gate (ADR-025) — before any live change
A "be askari" integration profile (Docker host + a coordinator-like container on the shared
network + base__firewall_input_only + admin_addrs), driven through make test-integration HOST=askari (reusing the existing profile/overlay/verify pattern):
- input chain default-deny with
:22accepted viaiifname "wt0"and from the break-glass admin address; forward chainpolicy accept; - published-port DNAT + NAT masquerade survive a reboot (the RED→GREEN reboot cycle);
- the coordinator-like container comes up healthy with no external geo egress;
- SSH path returns after reboot.
This must be GREEN before the live cutover.
Live cutover — supervised, console open, break-glass never removed
Sequencing rule (FRICTION #6): validate reboot-recovery while a fallback path is still open. Because the WAN break-glass is never removed in this design, that invariant holds by construction.
- Pre-check:
ssh sjat@100.99.226.39(overwt0) andansible askari -m ping(forced overwt0) both succeed; public services + STUN healthy. - Repoint Ansible: add
host_vars/askari.yml(ansible_host=wt0IP); confirmansible askari -m pingruns over the mesh. - Apply
base(+ the geo-DB fix): onemake deploy PLAYBOOK=site LIMIT=askariconverge applies INPUT-only default-deny with thewt0+ admin-addr SSH allow and the coordinator robustness change. The firewall concern's armed auto-rollback (base__firewall_rollback_timeout: 45) reverts a bad ruleset. Then a post-applyrestart dockerrebuilds NAT (base'sflush rulesetwipes Docker's nat — FRICTION); the coordinator now survives the egress window thanks to the geo-DB fix. - Verify the new steady state: public services serve valid certs; STUN answers; SSH
over
wt0works; SSH over the WAN break-glass (91.226.145.80→:22) works. - Reboot resilience (the real test): reboot askari (Hetzner console available) and
confirm — with no intervention — Docker forwarding/NAT, public services, the coordinator,
wt0, and SSH (both paths) all return.
Risks & rollback
- ubongo's WAN IP anchors the break-glass. If it is dynamic and rotates, the host
admin_addrsrule and the Hetzner FW rule must be updated. The Hetzner console is the IP-independent ultimate break-glass. (Confirmed static by the operator 2026-06-19; it is also already the Hetzner FW assumption today.) - Mid-cutover lockout: mitigated by the staged order (a path open at each step), the
firewall auto-rollback timer,
ansible_host=wt0(the confirm tests the real new path), and the WAN break-glass that is never removed. - Reboot lockout: mitigated by
iifname "wt0"scoping (no sshd boot-race), the WAN break-glass, the geo-DB fix (coordinator survives the egress window), and harness GREEN. - Default-deny breaks a public service: mitigated by the catalog already enumerating all
live ingress and the §Validation service checks; reversible via
base__firewall_apply: false. - Ultimate break-glass: the Hetzner web console (out-of-band).
Out of scope / follow-ons
- SPOF reduction (the next sub-project) — reduce askari's single-point-of-failure role
(currently
ubongo → askariisRelayedthrough askari's own relay; if askari is down the mesh data plane for relayed peers is down). Its own spec, after this. - NetBird ACL off Allow-All — until then any enrolled peer can reach askari's
wt0:22; scoping that is a separate sub-project. - Full forward-chain hardening — the
docker_hostcontainer-forward drop-in (full forward default-deny, reboot-safe) as a later tightening over theinput_onlybaseline. - Coordinator off-site backup (FRICTION #5, ADR-022) — still pending; noted, not in scope.
- STATUS.md / ROADMAP updates land with the implementation, not this spec.