docs(friction): capture 6 signals from the mesh-hardening 1/3 incident

firewall-breaks-Docker-hosts, ip_nonlocal_bind didn't beat the boot race,
coordinator-host circular bootstrap, NetBird geo-DB FATAL dependency, no
off-site coordinator backup, and reboot-tested-after-removing-break-glass.
For the next /kaizen + the mesh-hardening re-spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-17 22:21:19 +02:00
parent 847d9885e2
commit 958e35e3c3

View file

@ -22,6 +22,74 @@ earning its keep.
_(append new raw signals here; the next kaizen review consumes them)_
<!-- The six below are from the 2026-06-17 mesh-hardening-1/3 incident: applying base's
nftables default-deny + wt0-only sshd to askari (the off-site Docker host that ALSO runs
the NetBird coordinator) took it down on reboot; recovery needed the Hetzner console +
a WAN-SSH break-glass. Spec/plan: docs/superpowers/{specs,plans}/2026-06-17-mesh-hardening-askari-ssh-wt0*. -->
- `[gotcha]` **`base`'s nftables `forward policy drop` breaks Docker hosts on reboot**
(2026-06-17): `base/templates/nftables.conf.j2` sets `chain forward { ... policy drop; }`.
On a Docker host, container traffic is *forwarded* (published-port DNAT → container, and
inter-container over the bridge), so the drop kills it. It worked right after `make
deploy` (Docker's runtime rules coexisted) but after a reboot nftables loaded our
default-deny *before* Docker, breaking WAN→Caddy and Caddy→coordinator → the public
services and the mesh went down. The `docker_host` "`nftables.d` container-forward rules"
that would make this Docker-safe are explicitly **pending** (STATUS.md). → the `base`
firewall (`base__firewall_apply`) must NOT be applied to any Docker host until
`docker_host` ships the container-forward rules; add a guard/check (a Docker host with
`firewall_apply: true` and no container-forward drop-in is a misconfiguration), and the
firewall design (ADR-020) should state the Docker-host dependency explicitly.
- `[gotcha]` **`ip_nonlocal_bind` did NOT beat the sshd boot-race** (2026-06-17): the
mesh-hardening plan bound sshd `ListenAddress` to the `wt0` IP and set
`net.ipv4.ip_nonlocal_bind=1` so sshd could bind the mesh IP before `wt0` exists at
boot. In practice the console still showed sshd *"could not assign the address"* at boot
— so the protection did not work as designed, and because `wt0` never came up (the
coordinator was down), sshd had no listener at all → no SSH path. → the entire
"sshd listens on `wt0` only" premise is unsound without (a) a *verified* boot-race fix
and (b) a guaranteed non-mesh break-glass. Re-investigate why `ip_nonlocal_bind` didn't
help (ordering vs the sysctl drop-in load? the sysctl not applied before sshd start?),
or drop ListenAddress-on-mesh entirely and rely on the host firewall for SSH scoping.
- `[gotcha]` **The coordinator host can't bootstrap the mesh it depends on** (2026-06-17):
`askari` runs the NetBird coordinator AND is a mesh peer. After a reboot its NetBird
agent needs the coordinator (a local container) to be serving to bring up `wt0` — but
the coordinator wasn't healthy, so `wt0` never came up. Circular. Combined with sshd
being `wt0`-only, the host was reachable only via the Hetzner console. → the
coordinator host must keep a **non-mesh management path always** (don't move its SSH onto
`wt0`), or the mesh-hardening must treat the coordinator host as a special case. General
rule: never make a host's only management path depend on a service that host itself
hosts.
- `[gotcha]` **NetBird `netbird-server` FATAL-loops on the geolocation DB download with no
egress** (2026-06-17): on startup the combined `netbird-server:0.72.4` tries to download
the GeoLite2 DB from `pkgs.netbird.io` and treats failure as **FATAL** (crash-loop) — so
any loss of container egress (here: Docker NAT masquerade wiped when `nftables` was
flushed, not re-added by a plain `restart docker`) takes the whole control plane down.
Recovery was `restart docker` (rebuild NAT) → force-recreate the container so it could
download. → for the `netbird_coordinator` role: pre-seed/persist the geo DB in the data
dir (or pin a local copy), or disable the geolocation requirement, so a transient egress
blip can't FATAL the coordinator. Note for the firewall design: container egress (NAT)
is fragile across `nft flush` + reboot.
- `[friction]` **No off-site coordinator backup turned a 2-minute restore into a long live
recovery** (2026-06-17): the NetBird coordinator's stateful store (`/var/lib/netbird`,
encrypted SQLite) has **no off-site backup yet** (ADR-022 `backup` role pending,
flagged in STATUS as the coordinator's deferred backup). During the incident there was a
real fear the unclean reboots had corrupted the store, with no restore path. It turned
out to be a runtime/egress issue, not corruption — but the absence of a backup made the
whole recovery higher-stakes. → prioritise the ADR-022 backup contract for the
`netbird_coordinator` store ahead of the rest of the backup role; a recent off-host copy
would have made "rebuild askari from scratch" a safe option.
- `[friction]` **The plan tested reboot-recovery AFTER removing the break-glass**
(2026-06-17): the mesh-hardening plan's live cutover closed the WAN `:22` (step 5)
*before* the reboot-resilience test (step 7), so the one fallback path was gone exactly
when the reboot exposed the boot-race + Docker-firewall bugs. → sequencing rule for
lockout-risky cutovers: **validate reboot-recovery while the old access path is still
open**, and only retire the break-glass once recovery (incl. a reboot) is proven.
Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks.
---
## Kaizen reviews — decisions ledger