From 24a1d909c977c3d1807b6df6dfdedca35e2be390 Mon Sep 17 00:00:00 2001 From: sjat Date: Fri, 19 Jun 2026 09:12:58 +0200 Subject: [PATCH] =?UTF-8?q?docs(spec):=20mesh-hardening=202/3=20=E2=80=94?= =?UTF-8?q?=20ubongo=20INPUT-only=20default-deny?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sub-project 2 of the mesh-hardening follow-on (the post-incident roadmap ordering puts ubongo first). Harden the control node's inbound surface via base's nftables firewall as INPUT-only default-deny: the forward chain stays permissive (new base__firewall_input_only knob) so Docker egress + the libvirt-NAT integration harness keep working, and there is no sshd ListenAddress change — sidestepping the ip_nonlocal_bind boot-race that sank askari. SSH allowed from wt0, ssh-from-control (Ansible self), and mamba on the LAN (new base__firewall_admin_addrs). Harness-validated before an operator-supervised cutover; the physical console is the permanent break-glass. Design maps to the four relevant 2026-06-17 incident lessons (FRICTION signals 1/2/3/6). Co-Authored-By: Claude Opus 4.8 (1M context) --- ...sh-hardening-ubongo-default-deny-design.md | 197 ++++++++++++++++++ 1 file changed, 197 insertions(+) create mode 100644 docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md diff --git a/docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md b/docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md new file mode 100644 index 0000000..e76a29c --- /dev/null +++ b/docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md @@ -0,0 +1,197 @@ +# Spec — Mesh-hardening (2 of 3): ubongo INPUT-only default-deny + `ssh-from-control` + +Status: Accepted (2026-06-19) + +## Context & scope + +The **mesh-hardening follow-on** (deferred from M5, ROADMAP) was decomposed into three +independent sub-projects, each its own spec → plan → implementation cycle: + +1. askari SSH → `wt0` — spec/plan written 2026-06-17, **attempted and backed out the same day** + (the incident; six lessons in `FRICTION.md`). Needs a redesign — **not** this spec. +2. **ubongo nftables default-deny + `ssh-from-control`** ← *this spec* +3. NetBird ACL off Allow-All → scoped policies (its own later spec; open mechanism question — + no headless API path). + +ROADMAP (re-ordered after the 2026-06-17 incident) puts **ubongo first**: it is the clean, +low-risk case — a physical box with a permanent console break-glass, and *not* the coordinator +host that the incident proved you must not corner. + +This spec hardens **ubongo's inbound surface only**. It does **not** change sshd's +`ListenAddress` (so no boot-race), does **not** apply a forward-chain default-deny (so Docker + +the libvirt NAT keep working), and does **not** touch askari or the NetBird ACL. + +Current state (verified on ubongo, 2026-06-19): **no host firewall** — sshd listens on +`0.0.0.0:22`, reachable from LAN, mesh, and anything routable; only Docker's + libvirt's own +`iptables-nft` tables exist. Interfaces: `eno1` `10.20.10.151` (LAN, = `ansible_host`), `wt0` +`100.99.146.14` (mesh), `docker0` (one container, no published ports), `virbr-boma` +`192.168.150.1/24` (the libvirt NAT that `make test-integration` uses), `ip_forward=1`. + +## Goal / success criteria + +- SSH to ubongo succeeds over **`wt0`** (road-warriors, askari), from **mamba on the LAN** + (`10.20.10.50`), and via the **`ssh-from-control` self-path** (Ansible; source `10.20.10.151`). +- SSH from any **other** LAN source is **dropped** (default-deny on `input`). +- **Docker container egress and `make test-integration` (libvirt NAT) keep working** — the + forward chain is untouched. +- A **reboot** does not lock SSH out (no `ListenAddress`, so no bind race). +- Break-glass is the **on-prem physical console** (permanent, non-mesh). The live apply is + additionally gated by the firewall **auto-rollback** timer. + +## Design + +Apply base's nftables `firewall` concern to ubongo, with two adjustments and one deliberate +non-change: + +1. **INPUT-only default-deny.** The `input` chain keeps `policy drop` with the guaranteed + management plane: `lo`, `established,related`, ICMP, SSH on `wt0`, and SSH from + `ssh-from-control` (`10.20.10.151`). We add **one operator-workstation source** (mamba, + `10.20.10.50`) via a new `base__firewall_admin_addrs` list. Everything else on `eno1` drops. +2. **Forward chain left permissive.** base hardcodes `chain forward { … policy drop; }` for + inter-container isolation. On ubongo that would break Docker egress **and** the libvirt NAT + the integration harness depends on — the same class of failure that sank askari (FRICTION + 2026-06-17, signal 1). A new `base__firewall_input_only` knob renders the forward chain + `policy accept` instead. Docker's and libvirt's own `iptables-nft` forward rules continue to + apply (separate tables); base simply does not add a default-deny on top. +3. **No sshd `ListenAddress` change.** sshd keeps listening on `0.0.0.0:22`; nftables does all + inbound scoping. This deliberately avoids the `ip_nonlocal_bind` boot-race that broke askari + (FRICTION signal 2) — there is nothing to bind before `wt0` exists. + +Resulting `input` allow-list: + +``` +iif "lo" accept +ct state established,related accept +ct state invalid drop +iifname "wt0" tcp dport 22 accept # mesh (road-warriors, askari) +ip saddr 10.20.10.151 tcp dport 22 accept # ssh-from-control (Ansible self) — group_vars/all +ip saddr 10.20.10.50 tcp dport 22 accept # mamba on the LAN — base__firewall_admin_addrs +ip protocol icmp accept ; ip6 nexthdr ipv6-icmp accept +# (no catalog services on ubongo) → default drop +chain forward: policy accept # Docker + libvirt-NAT forwarding preserved +``` + +## Why ubongo is the safe case (maps to the 2026-06-17 incident) + +- **Signal 1** (forward-drop breaks Docker hosts): sidestepped — INPUT-only leaves forwarding alone. +- **Signal 2** (`ip_nonlocal_bind` boot-race): sidestepped — no `ListenAddress`; sshd binds nothing new. +- **Signal 3** (a host's only mgmt path must not depend on a service it hosts): satisfied — + ubongo is not the coordinator and keeps three independent paths (mesh, LAN, physical console). +- **Signal 6** (recovery tested after the break-glass was removed): the physical console is + permanent (nothing to retire), and reboot-recovery is proven on a throwaway VM first. + +## New & changed code + +**Role `base`:** + +- `roles/base/defaults/main.yml` — add: + - `base__firewall_input_only: false` — when true, the forward chain is `policy accept` + (host-local input filtering only), for hosts that route/forward container or NAT traffic + (e.g. the control node's Docker + libvirt-NAT) where a forward default-deny would break them. + - `base__firewall_admin_addrs: []` — extra LAN source IPs allowed to SSH (besides `wt0` + + `ssh-from-control`); for an operator workstation reaching the host over the LAN. Key-gated. +- `roles/base/templates/nftables.conf.j2`: + - the forward line (currently line 21) → + `chain forward { type filter hook forward priority 0; policy {{ "accept" if base__firewall_input_only | bool else "drop" }}; }` + - after the `ssh-from-control` block (currently lines 12-14), add a loop: + `{% for addr in base__firewall_admin_addrs %}` → + `ip saddr {{ addr }} tcp dport {{ base__firewall_ssh_port }} accept` +- `roles/base/molecule/default/{converge,verify}.yml` — fixture sets `input_only: true` + an + `admin_addrs` entry; assert (a) `forward` renders `policy accept`, (b) the admin-addr accept + rule renders, (c) existing input default-deny + `wt0` + control-addr assertions stay green. + +**Inventory** (`inventories/production/group_vars/control/vars.yml`, append): + +```yaml +# Mesh-hardening 2/3 (2026-06-19, ADR-020/021): apply base's host firewall to ubongo as +# INPUT-only default-deny — harden the inbound surface, leave the forward chain permissive so +# Docker egress + the libvirt-NAT integration harness keep working. sshd is unchanged +# (nftables scopes inbound), so there is no boot-race. Reach ubongo over wt0, the +# ssh-from-control self-path (base__firewall_control_addr in group_vars/all), or mamba on the +# LAN. Break-glass: the physical console. +base__firewall_input_only: true +base__firewall_admin_addrs: + - "10.20.10.50" # mamba over the LAN (NetBird off). Raw DHCP lease — see note below. +# base__firewall_apply defaults true; base__firewall_control_addr (= ubongo's own 10.20.10.151) +# is set in group_vars/all and covers Ansible's self-connection. +``` + +**Integration harness** (ADR-025) — a "be ubongo" profile, mirroring "be askari": + +- `tests/integration/overrides/ubongo.yml` — `firewall_apply: true`, `input_only: true`, + `admin_addrs: ["192.168.150.99"]` (a representative LAN addr to exercise the rule), + `firewall_control_addr: "192.168.150.1"` (the libvirt-NAT gateway = the harness's own SSH + path, so the apply + reboot don't lock it out), `ssh_listen_mesh_only: false`, + `mesh_enabled: false`. +- `tests/integration/profiles/ubongo.json` — mirror `profiles/askari.json` (VM resources/image). +- `tests/integration/verify.yml` — make the assertions **profile-aware** (gated on the active + profile, since `verify.yml` is shared): for ubongo assert `input` policy drop, `forward` + policy **accept**, and the admin-addr rule present. Reachability across the reboot is the + harness's existing cycle. The askari assertions (Docker/forward-DNAT) must **not** run for the + ubongo profile, nor vice-versa. + +Enables `make test-integration HOST=ubongo`. + +## The mamba admin-addr — a deliberately interim value + +`base__firewall_admin_addrs: ["10.20.10.50"]` is mamba's **current raw DHCP lease**, not a +reservation (operator decision, 2026-06-19). Caveats, accepted for now: + +- **Lease drift:** if DHCP reassigns `10.20.10.50`, the rule allows whatever host then holds it + (still SSH-key-gated, so low risk) and mamba loses its *LAN* path. **Backstop:** mamba also + reaches ubongo over `wt0` (mesh), so it is never cut off — only the off-mesh LAN convenience + lapses until the IP is corrected. +- **Revisit trigger:** when OPNsense-as-code lands (ADR-020 perimeter layer), replace this with + a **DHCP reservation** (MAC → fixed IP) and allow the reserved address. Tracked here and in + the implementation plan's follow-ups. + +## Testing + +- **Molecule** (base `default`, render-only, `firewall_apply: false`): the new forward-accept + + admin-addr assertions above, with existing assertions green. +- **Integration harness** (`make test-integration HOST=ubongo`): on a throwaway UEFI VM, apply + the ubongo overlay, assert the ruleset shape, and prove **SSH survives a reboot** from an + allowed source (the existing assert/cycle). This is the gate before touching the real control + node. +- **Live** (during cutover): SSH over `wt0` ✓, from mamba LAN ✓, Ansible self-ping ✓; SSH from a + disallowed LAN host dropped ✓; `docker run … ` egress ✓; a fresh `make test-integration` + still spins a VM (libvirt NAT intact) ✓. + +## Staged cutover (operator-supervised — lockout-aware, FRICTION signal-6 order) + +ubongo is managed as `sjat` (password sudo), so the live apply needs the operator present +anyway. The physical console is open throughout. + +1. **Harness GREEN:** `make test-integration HOST=ubongo` passes (incl. the reboot). +2. **Pre-check the real paths** *before* applying: SSH over `wt0`, SSH from mamba + (`10.20.10.50`), `ansible ubongo -m ping`. Confirm the physical console is reachable. +3. **Dry-run:** `make check PLAYBOOK=site LIMIT=ubongo TAGS=firewall` — review the nftables diff + (input default-deny + `wt0` + `10.20.10.151` + `10.20.10.50`; forward `policy accept`). +4. **Apply (auto-rollback armed):** `make deploy PLAYBOOK=site LIMIT=ubongo TAGS=firewall` — the + firewall concern snapshots, arms the 45 s revert, applies, `reset_connection` → + `wait_for_connection` over the live path (`10.20.10.151`), then cancels the timer. A bad + ruleset reverts itself; the console is the ultimate fallback. +5. **Verify** every path + Docker egress + a fresh integration-VM spin (above). +6. **Reboot ubongo; confirm SSH returns on all paths unaided** (console present). Only now is it + done — recovery is proven *while the break-glass is still there*. +7. **Docs:** update `STATUS.md` (ubongo row: input-only default-deny applied) and `ROADMAP.md` + (mesh-hardening 2/3 done; next is sub-project 1 askari redesign or 3 NetBird ACL). + +## Risks & rollback + +- **Self-referential apply** (ubongo runs Ansible against itself): mitigated by the auto-rollback + timer, the `wait_for_connection` over the real path, three redundant allowed sources, and the + permanent physical console. ubongo cannot be bricked. +- **Raw-lease fragility:** documented above; backstopped by the mesh path; revisit with OPNsense. +- **No new container isolation** (forward stays accept): accepted — ubongo is a single-tenant + control node, not a service host; Docker/libvirt keep their own forward rules. The forward + default-deny remains the norm for real service hosts (`base__firewall_input_only: false`). + +## Out of scope / follow-ons + +- askari SSH → `wt0` redesign (sub-project 1) — needs the boot-race + coordinator-bootstrap + resolved; folds in the coordinator-robustness (geo-DB FATAL-loop) + off-site backup lessons. +- NetBird ACL off Allow-All (sub-project 3) — open mechanism question (no headless API path). +- OPNsense DHCP reservation for mamba (and ubongo) — replaces the raw lease; with OPNsense-as-code. +- Forward-chain container isolation on ubongo — deliberately not done here. +- `STATUS.md` / `ROADMAP.md` edits land with the implementation, not this spec.