docs(friction): capture 6 signals from the mesh-hardening 1/3 incident
firewall-breaks-Docker-hosts, ip_nonlocal_bind didn't beat the boot race, coordinator-host circular bootstrap, NetBird geo-DB FATAL dependency, no off-site coordinator backup, and reboot-tested-after-removing-break-glass. For the next /kaizen + the mesh-hardening re-spec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
847d9885e2
commit
958e35e3c3
1 changed files with 68 additions and 0 deletions
|
|
@ -22,6 +22,74 @@ earning its keep.
|
|||
|
||||
_(append new raw signals here; the next kaizen review consumes them)_
|
||||
|
||||
<!-- The six below are from the 2026-06-17 mesh-hardening-1/3 incident: applying base's
|
||||
nftables default-deny + wt0-only sshd to askari (the off-site Docker host that ALSO runs
|
||||
the NetBird coordinator) took it down on reboot; recovery needed the Hetzner console +
|
||||
a WAN-SSH break-glass. Spec/plan: docs/superpowers/{specs,plans}/2026-06-17-mesh-hardening-askari-ssh-wt0*. -->
|
||||
|
||||
- `[gotcha]` **`base`'s nftables `forward policy drop` breaks Docker hosts on reboot**
|
||||
(2026-06-17): `base/templates/nftables.conf.j2` sets `chain forward { ... policy drop; }`.
|
||||
On a Docker host, container traffic is *forwarded* (published-port DNAT → container, and
|
||||
inter-container over the bridge), so the drop kills it. It worked right after `make
|
||||
deploy` (Docker's runtime rules coexisted) but after a reboot nftables loaded our
|
||||
default-deny *before* Docker, breaking WAN→Caddy and Caddy→coordinator → the public
|
||||
services and the mesh went down. The `docker_host` "`nftables.d` container-forward rules"
|
||||
that would make this Docker-safe are explicitly **pending** (STATUS.md). → the `base`
|
||||
firewall (`base__firewall_apply`) must NOT be applied to any Docker host until
|
||||
`docker_host` ships the container-forward rules; add a guard/check (a Docker host with
|
||||
`firewall_apply: true` and no container-forward drop-in is a misconfiguration), and the
|
||||
firewall design (ADR-020) should state the Docker-host dependency explicitly.
|
||||
|
||||
- `[gotcha]` **`ip_nonlocal_bind` did NOT beat the sshd boot-race** (2026-06-17): the
|
||||
mesh-hardening plan bound sshd `ListenAddress` to the `wt0` IP and set
|
||||
`net.ipv4.ip_nonlocal_bind=1` so sshd could bind the mesh IP before `wt0` exists at
|
||||
boot. In practice the console still showed sshd *"could not assign the address"* at boot
|
||||
— so the protection did not work as designed, and because `wt0` never came up (the
|
||||
coordinator was down), sshd had no listener at all → no SSH path. → the entire
|
||||
"sshd listens on `wt0` only" premise is unsound without (a) a *verified* boot-race fix
|
||||
and (b) a guaranteed non-mesh break-glass. Re-investigate why `ip_nonlocal_bind` didn't
|
||||
help (ordering vs the sysctl drop-in load? the sysctl not applied before sshd start?),
|
||||
or drop ListenAddress-on-mesh entirely and rely on the host firewall for SSH scoping.
|
||||
|
||||
- `[gotcha]` **The coordinator host can't bootstrap the mesh it depends on** (2026-06-17):
|
||||
`askari` runs the NetBird coordinator AND is a mesh peer. After a reboot its NetBird
|
||||
agent needs the coordinator (a local container) to be serving to bring up `wt0` — but
|
||||
the coordinator wasn't healthy, so `wt0` never came up. Circular. Combined with sshd
|
||||
being `wt0`-only, the host was reachable only via the Hetzner console. → the
|
||||
coordinator host must keep a **non-mesh management path always** (don't move its SSH onto
|
||||
`wt0`), or the mesh-hardening must treat the coordinator host as a special case. General
|
||||
rule: never make a host's only management path depend on a service that host itself
|
||||
hosts.
|
||||
|
||||
- `[gotcha]` **NetBird `netbird-server` FATAL-loops on the geolocation DB download with no
|
||||
egress** (2026-06-17): on startup the combined `netbird-server:0.72.4` tries to download
|
||||
the GeoLite2 DB from `pkgs.netbird.io` and treats failure as **FATAL** (crash-loop) — so
|
||||
any loss of container egress (here: Docker NAT masquerade wiped when `nftables` was
|
||||
flushed, not re-added by a plain `restart docker`) takes the whole control plane down.
|
||||
Recovery was `restart docker` (rebuild NAT) → force-recreate the container so it could
|
||||
download. → for the `netbird_coordinator` role: pre-seed/persist the geo DB in the data
|
||||
dir (or pin a local copy), or disable the geolocation requirement, so a transient egress
|
||||
blip can't FATAL the coordinator. Note for the firewall design: container egress (NAT)
|
||||
is fragile across `nft flush` + reboot.
|
||||
|
||||
- `[friction]` **No off-site coordinator backup turned a 2-minute restore into a long live
|
||||
recovery** (2026-06-17): the NetBird coordinator's stateful store (`/var/lib/netbird`,
|
||||
encrypted SQLite) has **no off-site backup yet** (ADR-022 `backup` role pending,
|
||||
flagged in STATUS as the coordinator's deferred backup). During the incident there was a
|
||||
real fear the unclean reboots had corrupted the store, with no restore path. It turned
|
||||
out to be a runtime/egress issue, not corruption — but the absence of a backup made the
|
||||
whole recovery higher-stakes. → prioritise the ADR-022 backup contract for the
|
||||
`netbird_coordinator` store ahead of the rest of the backup role; a recent off-host copy
|
||||
would have made "rebuild askari from scratch" a safe option.
|
||||
|
||||
- `[friction]` **The plan tested reboot-recovery AFTER removing the break-glass**
|
||||
(2026-06-17): the mesh-hardening plan's live cutover closed the WAN `:22` (step 5)
|
||||
*before* the reboot-resilience test (step 7), so the one fallback path was gone exactly
|
||||
when the reboot exposed the boot-race + Docker-firewall bugs. → sequencing rule for
|
||||
lockout-risky cutovers: **validate reboot-recovery while the old access path is still
|
||||
open**, and only retire the break-glass once recovery (incl. a reboot) is proven.
|
||||
Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks.
|
||||
|
||||
---
|
||||
|
||||
## Kaizen reviews — decisions ledger
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue