docs(spec): mesh-hardening redesign — askari wt0-primary + WAN break-glass
Redesign of the backed-out 2026-06-17 askari SSH->wt0 attempt. Mirrors the proven ubongo 2/3 pattern (INPUT-only default-deny, SSH scoped by iifname wt0, no sshd ListenAddress change -> no boot-race) and adds the coordinator-host exception the incident demanded: a permanent non-mesh break-glass (WAN :22 from ubongo's static WAN IP + the Hetzner console), WAN :22 deliberately left open. Folds in the netbird_coordinator geo-DB robustness fix (FRICTION #4) so a transient egress blip can't FATAL the control plane. Harness-GREEN gate before a supervised live cutover. Operator decision (2026-06-19): do this redesign first, then a separate sub-project to reduce askari's SPOF role. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
ef5e049e9b
commit
a178729587
1 changed files with 216 additions and 0 deletions
|
|
@ -0,0 +1,216 @@
|
||||||
|
# Spec — Mesh-hardening redesign: askari SSH `wt0`-primary + permanent WAN break-glass
|
||||||
|
|
||||||
|
Status: Accepted (2026-06-19)
|
||||||
|
|
||||||
|
## Context & scope
|
||||||
|
|
||||||
|
The **mesh-hardening follow-on** (deferred from M5) was decomposed into three independent
|
||||||
|
sub-projects, each with its own spec → plan → implementation cycle. Progress so far:
|
||||||
|
|
||||||
|
1. ~~askari SSH → `wt0`~~ — **attempted 2026-06-17, BACKED OUT** after it took askari down
|
||||||
|
on reboot (spec/plan `docs/superpowers/{specs,plans}/2026-06-17-mesh-hardening-askari-ssh-wt0*`).
|
||||||
|
2. ubongo nftables INPUT-only default-deny — **DONE 2026-06-19**, reboot-validated
|
||||||
|
(`base__firewall_input_only`).
|
||||||
|
3. NetBird ACL off Allow-All → scoped policies — not started.
|
||||||
|
|
||||||
|
This spec is the **redesign of (1)**. The operator sequencing decision (2026-06-19) is:
|
||||||
|
do this redesign **first**, then a separate sub-project to reduce askari's
|
||||||
|
single-point-of-failure (SPOF) role. **This spec covers only the redesign of (1).** The SPOF
|
||||||
|
reduction is the named follow-on (its own later spec).
|
||||||
|
|
||||||
|
### Why the 2026-06-17 attempt was backed out
|
||||||
|
|
||||||
|
Four hazards, recorded in `docs/FRICTION.md` (the six 2026-06-17 signals):
|
||||||
|
|
||||||
|
1. **`base`'s `forward policy drop` breaks Docker hosts on reboot** — nftables loaded
|
||||||
|
default-deny before Docker, so container forwarding/NAT (WAN→Caddy, Caddy→coordinator)
|
||||||
|
died after reboot.
|
||||||
|
2. **`ip_nonlocal_bind` did NOT beat the sshd boot-race** — binding sshd `ListenAddress`
|
||||||
|
to the `wt0` IP still failed at boot ("could not assign the address"); and because
|
||||||
|
`wt0` never came up, sshd had no listener at all.
|
||||||
|
3. **The coordinator host can't bootstrap the mesh it depends on** — askari runs the
|
||||||
|
NetBird coordinator *and* is a mesh peer; its agent needs the local coordinator container
|
||||||
|
healthy to bring up `wt0`. After an unclean reboot the coordinator was down → `wt0`
|
||||||
|
never came up → with SSH `wt0`-only, the host was reachable only via the Hetzner console.
|
||||||
|
General rule: *never make a host's only management path depend on a service that host
|
||||||
|
itself hosts.*
|
||||||
|
4. **The coordinator FATAL-loops on the geolocation-DB download with no egress** — a
|
||||||
|
transient loss of container egress (here: NAT wiped by `nft flush`) crash-loops the whole
|
||||||
|
control plane.
|
||||||
|
|
||||||
|
### What changed since 2026-06-17 (enablers this redesign relies on)
|
||||||
|
|
||||||
|
- `docker_host` **container-forward nftables drop-in** (`172ae37`) — reboot-safe Docker
|
||||||
|
forwarding (available as a later tightening; not required by this pass).
|
||||||
|
- **`base__firewall_input_only`** — input-only default-deny, forward chain stays
|
||||||
|
`policy accept` (Docker-safe). **Proven on ubongo and reboot-validated 2026-06-19.**
|
||||||
|
- The **ADR-025 integration harness** — reproduces a host's boot on a throwaway local VM,
|
||||||
|
so reboot-safety is proven GREEN before the real host is touched.
|
||||||
|
|
||||||
|
## Goal / success criteria
|
||||||
|
|
||||||
|
- askari's host nftables firewall is **applied at last** (`base__firewall_apply: true`),
|
||||||
|
INPUT-only default-deny — matching ubongo.
|
||||||
|
- **Normal management is over the mesh:** `ansible_host` resolves to askari's `wt0` IP
|
||||||
|
(`100.99.226.39`); SSH-over-`wt0` and `ansible askari -m ping` over the mesh both succeed.
|
||||||
|
- **A permanent non-mesh break-glass survives a mesh/coordinator outage**, via two
|
||||||
|
independent channels:
|
||||||
|
- the **Hetzner web console** (out-of-band; always works, IP-independent); and
|
||||||
|
- **WAN `:22` reachable only from ubongo's WAN IP (`91.226.145.80`)**, enforced at *both*
|
||||||
|
the host nftables layer (`base__firewall_admin_addrs`) and the Hetzner Cloud Firewall.
|
||||||
|
WAN `:22` is **deliberately NOT closed** — the coordinator-host exception (FRICTION #3).
|
||||||
|
- **askari survives a reboot under the new firewall, unattended:** Docker forwarding/NAT
|
||||||
|
intact, `https://test.askari.wingu.me` + `https://netbird.askari.wingu.me` serve valid
|
||||||
|
certs, STUN `3478/udp` answers, the coordinator container is healthy (geo-DB no longer
|
||||||
|
FATAL), `wt0` returns, SSH is reachable over both `wt0` and the WAN break-glass.
|
||||||
|
- **No sshd `ListenAddress` change** (`base__ssh_listen_mesh_only` stays `false`) — this is
|
||||||
|
what sidesteps the boot-race that sank the 2026-06-17 attempt.
|
||||||
|
|
||||||
|
## Design — mirror ubongo 2/3, with the coordinator-host exception
|
||||||
|
|
||||||
|
The host firewall does the SSH scoping; sshd is left listening on all interfaces. This is
|
||||||
|
the ubongo 2/3 pattern, which is proven and reboot-validated.
|
||||||
|
|
||||||
|
1. **`base` firewall, INPUT-only default-deny** (`base__firewall_apply: true`,
|
||||||
|
`base__firewall_input_only: true`): the input chain defaults to `drop`; the forward chain
|
||||||
|
stays `policy accept` so Docker container forwarding/NAT and published-port DNAT keep
|
||||||
|
working across a reboot. Allowed ingress:
|
||||||
|
- `:22/tcp` via `iifname "wt0"` (the interface-name match that survives `wt0` being
|
||||||
|
absent at boot — `base__firewall_mgmt_interface: wt0`);
|
||||||
|
- `:22/tcp` from `91.226.145.80` (ubongo's WAN — the break-glass; via
|
||||||
|
`base__firewall_admin_addrs`);
|
||||||
|
- the public service surface from the catalog: `80,443/tcp` + `3478/udp` (WAN).
|
||||||
|
2. **No sshd change.** `base__ssh_listen_mesh_only` stays `false`; sshd keeps listening on
|
||||||
|
all interfaces. The firewall, not sshd, restricts where `:22` is reachable. There is no
|
||||||
|
`ListenAddress`, hence no `ip_nonlocal_bind`, hence no boot-race.
|
||||||
|
3. **The Hetzner Cloud Firewall is unchanged** — the `:22`-from-ubongo rule stays (the
|
||||||
|
2026-06-17 attempt removed it; this redesign keeps it as the perimeter break-glass).
|
||||||
|
4. **Coordinator geo-DB robustness** — make the `netbird_coordinator` control plane survive
|
||||||
|
a transient egress loss (the nat-flush window on apply, and the boot window before Docker
|
||||||
|
re-adds its NAT), so the coordinator stays healthy and `wt0` can come back. One of:
|
||||||
|
- **pre-seed** the GeoLite2 DB into the persistent `netbird_data:/var/lib/netbird` volume
|
||||||
|
so netbird-server finds it locally and never needs to download; or
|
||||||
|
- **disable / make non-fatal** the geolocation requirement in `config.yaml.j2`.
|
||||||
|
The exact v0.72.4 mechanism is verified against NetBird's pinned docs at plan time
|
||||||
|
(ADR-014) — the design fixes the *intent* (a transient egress blip must not FATAL the
|
||||||
|
control plane); the plan fixes the *knob*.
|
||||||
|
|
||||||
|
### Rejected alternatives (these are the 2026-06-17 failures)
|
||||||
|
|
||||||
|
- sshd `ListenAddress = wt0 IP` + `ip_nonlocal_bind` → boot-race; did not bind. **Out.**
|
||||||
|
- `forward policy drop` on a Docker host → broke forwarding on reboot. **Out** (use
|
||||||
|
`input_only`; the `docker_host` container-forward drop-in is a later tightening).
|
||||||
|
- Close WAN `:22` entirely → coordinator host left console-only on a bad reboot. **Out**
|
||||||
|
(keep WAN `:22`-from-ubongo as the always-available non-mesh path).
|
||||||
|
|
||||||
|
### How each 2026-06-17 failure is answered
|
||||||
|
|
||||||
|
| 2026-06-17 failure | Fix in this design |
|
||||||
|
|---|---|
|
||||||
|
| `forward drop` killed Docker on reboot | `base__firewall_input_only: true` → forward stays `accept` |
|
||||||
|
| `ip_nonlocal_bind` sshd boot-race | no sshd `ListenAddress` change; firewall scopes `:22` by `iifname "wt0"` |
|
||||||
|
| coordinator chicken-egg / lockout | permanent WAN `:22`-from-ubongo + Hetzner console; management never depends on a service askari hosts |
|
||||||
|
| coordinator geo-DB FATAL-loop | pre-seed / non-fatal geo so a transient egress blip can't crash the control plane |
|
||||||
|
|
||||||
|
## New & changed code
|
||||||
|
|
||||||
|
**Inventory:**
|
||||||
|
|
||||||
|
- `inventories/production/group_vars/offsite_hosts/vars.yml` —
|
||||||
|
- `base__firewall_apply: true` (was `false`);
|
||||||
|
- `base__firewall_input_only: true` (new — forward stays `accept`, Docker-safe);
|
||||||
|
- `base__firewall_admin_addrs: ["91.226.145.80"]` (new — ubongo's WAN, the break-glass;
|
||||||
|
comment states what it is and why a coordinator host keeps a non-mesh path);
|
||||||
|
- `base__ssh_listen_mesh_only: false` stays (explicit — no boot-race);
|
||||||
|
- rewrite the header backout note → "redesigned 2026-06-19: `wt0`-primary + permanent WAN
|
||||||
|
break-glass; see this spec."
|
||||||
|
- `inventories/production/host_vars/askari.yml` (**new**) — `ansible_host: 100.99.226.39`
|
||||||
|
(the `wt0` IP), so Ansible manages askari over the mesh. Overrides the TF-generated WAN
|
||||||
|
`ansible_host` in `offsite.yml` (host_vars are not regenerated by `tf_to_inventory.py`).
|
||||||
|
Header comment explains why.
|
||||||
|
|
||||||
|
**Role `netbird_coordinator`:**
|
||||||
|
|
||||||
|
- The geo-DB robustness change above (`templates/config.yaml.j2` and/or a pre-seed task +
|
||||||
|
`templates/docker-compose.yml.j2` volume already persists `/var/lib/netbird`), with
|
||||||
|
Molecule/verify coverage that the control plane comes up without external geo egress.
|
||||||
|
|
||||||
|
**Firewall catalog** (`inventories/production/group_vars/all/firewall.yml`):
|
||||||
|
|
||||||
|
- **No change.** It already enumerates askari's public ingress (`reverse_proxy` 80/443,
|
||||||
|
`netbird_stun` 3478/udp). `:22` is handled by the `base` firewall's built-in SSH rules
|
||||||
|
(`mgmt_interface` `wt0` + `admin_addrs`), not the catalog.
|
||||||
|
|
||||||
|
**Terraform / Hetzner Cloud Firewall:**
|
||||||
|
|
||||||
|
- **No change.** The WAN `:22`-from-ubongo rule stays (the perimeter half of the break-glass).
|
||||||
|
|
||||||
|
**sshd:**
|
||||||
|
|
||||||
|
- **No change.**
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
### Harness-first GREEN gate (ADR-025) — before any live change
|
||||||
|
|
||||||
|
A "be askari" integration profile (Docker host + a coordinator-like container on the shared
|
||||||
|
network + `base__firewall_input_only` + `admin_addrs`), driven through `make
|
||||||
|
test-integration HOST=askari` (reusing the existing profile/overlay/verify pattern):
|
||||||
|
|
||||||
|
- input chain default-deny with `:22` accepted via `iifname "wt0"` **and** from the
|
||||||
|
break-glass admin address; forward chain `policy accept`;
|
||||||
|
- published-port DNAT + NAT masquerade survive a **reboot** (the RED→GREEN reboot cycle);
|
||||||
|
- the coordinator-like container comes up healthy with **no external geo egress**;
|
||||||
|
- SSH path returns after reboot.
|
||||||
|
|
||||||
|
This must be GREEN before the live cutover.
|
||||||
|
|
||||||
|
### Live cutover — supervised, console open, break-glass never removed
|
||||||
|
|
||||||
|
Sequencing rule (FRICTION #6): validate reboot-recovery while a fallback path is still open.
|
||||||
|
Because the WAN break-glass is *never* removed in this design, that invariant holds by
|
||||||
|
construction.
|
||||||
|
|
||||||
|
1. **Pre-check:** `ssh sjat@100.99.226.39` (over `wt0`) and `ansible askari -m ping` (forced
|
||||||
|
over `wt0`) both succeed; public services + STUN healthy.
|
||||||
|
2. **Repoint Ansible:** add `host_vars/askari.yml` (`ansible_host` = `wt0` IP); confirm
|
||||||
|
`ansible askari -m ping` runs over the mesh.
|
||||||
|
3. **Apply `base` (+ the geo-DB fix):** one `make deploy PLAYBOOK=site LIMIT=askari`
|
||||||
|
converge applies INPUT-only default-deny with the `wt0` + admin-addr SSH allow and the
|
||||||
|
coordinator robustness change. The firewall concern's armed auto-rollback
|
||||||
|
(`base__firewall_rollback_timeout: 45`) reverts a bad ruleset. Then a post-apply
|
||||||
|
`restart docker` rebuilds NAT (base's `flush ruleset` wipes Docker's nat — FRICTION); the
|
||||||
|
coordinator now survives the egress window thanks to the geo-DB fix.
|
||||||
|
4. **Verify the new steady state:** public services serve valid certs; STUN answers; SSH
|
||||||
|
over `wt0` works; SSH over the WAN break-glass (`91.226.145.80` → `:22`) works.
|
||||||
|
5. **Reboot resilience (the real test):** reboot askari (Hetzner console available) and
|
||||||
|
confirm — with no intervention — Docker forwarding/NAT, public services, the coordinator,
|
||||||
|
`wt0`, and SSH (both paths) all return.
|
||||||
|
|
||||||
|
## Risks & rollback
|
||||||
|
|
||||||
|
- **ubongo's WAN IP anchors the break-glass.** If it is dynamic and rotates, the host
|
||||||
|
`admin_addrs` rule and the Hetzner FW rule must be updated. The **Hetzner console** is the
|
||||||
|
IP-independent ultimate break-glass. (Confirmed static by the operator 2026-06-19; it is
|
||||||
|
also already the Hetzner FW assumption today.)
|
||||||
|
- **Mid-cutover lockout:** mitigated by the staged order (a path open at each step), the
|
||||||
|
firewall auto-rollback timer, `ansible_host` = `wt0` (the confirm tests the real new path),
|
||||||
|
and the WAN break-glass that is never removed.
|
||||||
|
- **Reboot lockout:** mitigated by `iifname "wt0"` scoping (no sshd boot-race), the WAN
|
||||||
|
break-glass, the geo-DB fix (coordinator survives the egress window), and harness GREEN.
|
||||||
|
- **Default-deny breaks a public service:** mitigated by the catalog already enumerating all
|
||||||
|
live ingress and the §Validation service checks; reversible via `base__firewall_apply:
|
||||||
|
false`.
|
||||||
|
- **Ultimate break-glass:** the Hetzner web console (out-of-band).
|
||||||
|
|
||||||
|
## Out of scope / follow-ons
|
||||||
|
|
||||||
|
- **SPOF reduction (the next sub-project)** — reduce askari's single-point-of-failure role
|
||||||
|
(currently `ubongo → askari` is `Relayed` through askari's own relay; if askari is down the
|
||||||
|
mesh data plane for relayed peers is down). Its own spec, after this.
|
||||||
|
- **NetBird ACL off Allow-All** — until then any enrolled peer can reach askari's `wt0:22`;
|
||||||
|
scoping that is a separate sub-project.
|
||||||
|
- **Full forward-chain hardening** — the `docker_host` container-forward drop-in (full
|
||||||
|
forward default-deny, reboot-safe) as a later tightening over the `input_only` baseline.
|
||||||
|
- **Coordinator off-site backup** (FRICTION #5, ADR-022) — still pending; noted, not in scope.
|
||||||
|
- STATUS.md / ROADMAP updates land with the implementation, not this spec.
|
||||||
Loading…
Add table
Reference in a new issue