docs(friction): capture 6 signals from the mesh-hardening 1/3 incident

firewall-breaks-Docker-hosts, ip_nonlocal_bind didn't beat the boot race, coordinator-host circular bootstrap, NetBird geo-DB FATAL dependency, no off-site coordinator backup, and reboot-tested-after-removing-break-glass. For the next /kaizen + the mesh-hardening re-spec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 22:21:19 +02:00 · 2026-06-17 22:21:19 +02:00 · 958e35e3c3
commit 958e35e3c3
parent 847d9885e2
1 changed files with 68 additions and 0 deletions
--- a/docs/FRICTION.md
+++ b/docs/FRICTION.md
@ -22,6 +22,74 @@ earning its keep.

 _(append new raw signals here; the next kaizen review consumes them)_

+<!-- The six below are from the 2026-06-17 mesh-hardening-1/3 incident: applying base's
+nftables default-deny + wt0-only sshd to askari (the off-site Docker host that ALSO runs
+the NetBird coordinator) took it down on reboot; recovery needed the Hetzner console +
+a WAN-SSH break-glass. Spec/plan: docs/superpowers/{specs,plans}/2026-06-17-mesh-hardening-askari-ssh-wt0*. -->
+
+- `[gotcha]` **`base`'s nftables `forward policy drop` breaks Docker hosts on reboot**
+  (2026-06-17): `base/templates/nftables.conf.j2` sets `chain forward { ... policy drop; }`.
+  On a Docker host, container traffic is *forwarded* (published-port DNAT → container, and
+  inter-container over the bridge), so the drop kills it. It worked right after `make
+  deploy` (Docker's runtime rules coexisted) but after a reboot nftables loaded our
+  default-deny *before* Docker, breaking WAN→Caddy and Caddy→coordinator → the public
+  services and the mesh went down. The `docker_host` "`nftables.d` container-forward rules"
+  that would make this Docker-safe are explicitly **pending** (STATUS.md). → the `base`
+  firewall (`base__firewall_apply`) must NOT be applied to any Docker host until
+  `docker_host` ships the container-forward rules; add a guard/check (a Docker host with
+  `firewall_apply: true` and no container-forward drop-in is a misconfiguration), and the
+  firewall design (ADR-020) should state the Docker-host dependency explicitly.
+
+- `[gotcha]` **`ip_nonlocal_bind` did NOT beat the sshd boot-race** (2026-06-17): the
+  mesh-hardening plan bound sshd `ListenAddress` to the `wt0` IP and set
+  `net.ipv4.ip_nonlocal_bind=1` so sshd could bind the mesh IP before `wt0` exists at
+  boot. In practice the console still showed sshd *"could not assign the address"* at boot
+  — so the protection did not work as designed, and because `wt0` never came up (the
+  coordinator was down), sshd had no listener at all → no SSH path. → the entire
+  "sshd listens on `wt0` only" premise is unsound without (a) a *verified* boot-race fix
+  and (b) a guaranteed non-mesh break-glass. Re-investigate why `ip_nonlocal_bind` didn't
+  help (ordering vs the sysctl drop-in load? the sysctl not applied before sshd start?),
+  or drop ListenAddress-on-mesh entirely and rely on the host firewall for SSH scoping.
+
+- `[gotcha]` **The coordinator host can't bootstrap the mesh it depends on** (2026-06-17):
+  `askari` runs the NetBird coordinator AND is a mesh peer. After a reboot its NetBird
+  agent needs the coordinator (a local container) to be serving to bring up `wt0` — but
+  the coordinator wasn't healthy, so `wt0` never came up. Circular. Combined with sshd
+  being `wt0`-only, the host was reachable only via the Hetzner console. → the
+  coordinator host must keep a **non-mesh management path always** (don't move its SSH onto
+  `wt0`), or the mesh-hardening must treat the coordinator host as a special case. General
+  rule: never make a host's only management path depend on a service that host itself
+  hosts.
+
+- `[gotcha]` **NetBird `netbird-server` FATAL-loops on the geolocation DB download with no
+  egress** (2026-06-17): on startup the combined `netbird-server:0.72.4` tries to download
+  the GeoLite2 DB from `pkgs.netbird.io` and treats failure as **FATAL** (crash-loop) — so
+  any loss of container egress (here: Docker NAT masquerade wiped when `nftables` was
+  flushed, not re-added by a plain `restart docker`) takes the whole control plane down.
+  Recovery was `restart docker` (rebuild NAT) → force-recreate the container so it could
+  download. → for the `netbird_coordinator` role: pre-seed/persist the geo DB in the data
+  dir (or pin a local copy), or disable the geolocation requirement, so a transient egress
+  blip can't FATAL the coordinator. Note for the firewall design: container egress (NAT)
+  is fragile across `nft flush` + reboot.
+
+- `[friction]` **No off-site coordinator backup turned a 2-minute restore into a long live
+  recovery** (2026-06-17): the NetBird coordinator's stateful store (`/var/lib/netbird`,
+  encrypted SQLite) has **no off-site backup yet** (ADR-022 `backup` role pending,
+  flagged in STATUS as the coordinator's deferred backup). During the incident there was a
+  real fear the unclean reboots had corrupted the store, with no restore path. It turned
+  out to be a runtime/egress issue, not corruption — but the absence of a backup made the
+  whole recovery higher-stakes. → prioritise the ADR-022 backup contract for the
+  `netbird_coordinator` store ahead of the rest of the backup role; a recent off-host copy
+  would have made "rebuild askari from scratch" a safe option.
+
+- `[friction]` **The plan tested reboot-recovery AFTER removing the break-glass**
+  (2026-06-17): the mesh-hardening plan's live cutover closed the WAN `:22` (step 5)
+  *before* the reboot-resilience test (step 7), so the one fallback path was gone exactly
+  when the reboot exposed the boot-race + Docker-firewall bugs. → sequencing rule for
+  lockout-risky cutovers: **validate reboot-recovery while the old access path is still
+  open**, and only retire the break-glass once recovery (incl. a reboot) is proven.
+  Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks.
+
 ---

 ## Kaizen reviews — decisions ledger