From 958e35e3c3f98d1d51d8caf878e359cf0afdbf52 Mon Sep 17 00:00:00 2001 From: sjat Date: Wed, 17 Jun 2026 22:21:19 +0200 Subject: [PATCH] docs(friction): capture 6 signals from the mesh-hardening 1/3 incident firewall-breaks-Docker-hosts, ip_nonlocal_bind didn't beat the boot race, coordinator-host circular bootstrap, NetBird geo-DB FATAL dependency, no off-site coordinator backup, and reboot-tested-after-removing-break-glass. For the next /kaizen + the mesh-hardening re-spec. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/FRICTION.md | 68 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 68 insertions(+) diff --git a/docs/FRICTION.md b/docs/FRICTION.md index d95be37..b447f9b 100644 --- a/docs/FRICTION.md +++ b/docs/FRICTION.md @@ -22,6 +22,74 @@ earning its keep. _(append new raw signals here; the next kaizen review consumes them)_ + + +- `[gotcha]` **`base`'s nftables `forward policy drop` breaks Docker hosts on reboot** + (2026-06-17): `base/templates/nftables.conf.j2` sets `chain forward { ... policy drop; }`. + On a Docker host, container traffic is *forwarded* (published-port DNAT → container, and + inter-container over the bridge), so the drop kills it. It worked right after `make + deploy` (Docker's runtime rules coexisted) but after a reboot nftables loaded our + default-deny *before* Docker, breaking WAN→Caddy and Caddy→coordinator → the public + services and the mesh went down. The `docker_host` "`nftables.d` container-forward rules" + that would make this Docker-safe are explicitly **pending** (STATUS.md). → the `base` + firewall (`base__firewall_apply`) must NOT be applied to any Docker host until + `docker_host` ships the container-forward rules; add a guard/check (a Docker host with + `firewall_apply: true` and no container-forward drop-in is a misconfiguration), and the + firewall design (ADR-020) should state the Docker-host dependency explicitly. + +- `[gotcha]` **`ip_nonlocal_bind` did NOT beat the sshd boot-race** (2026-06-17): the + mesh-hardening plan bound sshd `ListenAddress` to the `wt0` IP and set + `net.ipv4.ip_nonlocal_bind=1` so sshd could bind the mesh IP before `wt0` exists at + boot. In practice the console still showed sshd *"could not assign the address"* at boot + — so the protection did not work as designed, and because `wt0` never came up (the + coordinator was down), sshd had no listener at all → no SSH path. → the entire + "sshd listens on `wt0` only" premise is unsound without (a) a *verified* boot-race fix + and (b) a guaranteed non-mesh break-glass. Re-investigate why `ip_nonlocal_bind` didn't + help (ordering vs the sysctl drop-in load? the sysctl not applied before sshd start?), + or drop ListenAddress-on-mesh entirely and rely on the host firewall for SSH scoping. + +- `[gotcha]` **The coordinator host can't bootstrap the mesh it depends on** (2026-06-17): + `askari` runs the NetBird coordinator AND is a mesh peer. After a reboot its NetBird + agent needs the coordinator (a local container) to be serving to bring up `wt0` — but + the coordinator wasn't healthy, so `wt0` never came up. Circular. Combined with sshd + being `wt0`-only, the host was reachable only via the Hetzner console. → the + coordinator host must keep a **non-mesh management path always** (don't move its SSH onto + `wt0`), or the mesh-hardening must treat the coordinator host as a special case. General + rule: never make a host's only management path depend on a service that host itself + hosts. + +- `[gotcha]` **NetBird `netbird-server` FATAL-loops on the geolocation DB download with no + egress** (2026-06-17): on startup the combined `netbird-server:0.72.4` tries to download + the GeoLite2 DB from `pkgs.netbird.io` and treats failure as **FATAL** (crash-loop) — so + any loss of container egress (here: Docker NAT masquerade wiped when `nftables` was + flushed, not re-added by a plain `restart docker`) takes the whole control plane down. + Recovery was `restart docker` (rebuild NAT) → force-recreate the container so it could + download. → for the `netbird_coordinator` role: pre-seed/persist the geo DB in the data + dir (or pin a local copy), or disable the geolocation requirement, so a transient egress + blip can't FATAL the coordinator. Note for the firewall design: container egress (NAT) + is fragile across `nft flush` + reboot. + +- `[friction]` **No off-site coordinator backup turned a 2-minute restore into a long live + recovery** (2026-06-17): the NetBird coordinator's stateful store (`/var/lib/netbird`, + encrypted SQLite) has **no off-site backup yet** (ADR-022 `backup` role pending, + flagged in STATUS as the coordinator's deferred backup). During the incident there was a + real fear the unclean reboots had corrupted the store, with no restore path. It turned + out to be a runtime/egress issue, not corruption — but the absence of a backup made the + whole recovery higher-stakes. → prioritise the ADR-022 backup contract for the + `netbird_coordinator` store ahead of the rest of the backup role; a recent off-host copy + would have made "rebuild askari from scratch" a safe option. + +- `[friction]` **The plan tested reboot-recovery AFTER removing the break-glass** + (2026-06-17): the mesh-hardening plan's live cutover closed the WAN `:22` (step 5) + *before* the reboot-resilience test (step 7), so the one fallback path was gone exactly + when the reboot exposed the boot-race + Docker-firewall bugs. → sequencing rule for + lockout-risky cutovers: **validate reboot-recovery while the old access path is still + open**, and only retire the break-glass once recovery (incl. a reboot) is proven. + Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks. + --- ## Kaizen reviews — decisions ledger