Sub-project 2 of the mesh-hardening follow-on (the post-incident roadmap ordering puts ubongo first). Harden the control node's inbound surface via base's nftables firewall as INPUT-only default-deny: the forward chain stays permissive (new base__firewall_input_only knob) so Docker egress + the libvirt-NAT integration harness keep working, and there is no sshd ListenAddress change — sidestepping the ip_nonlocal_bind boot-race that sank askari. SSH allowed from wt0, ssh-from-control (Ansible self), and mamba on the LAN (new base__firewall_admin_addrs). Harness-validated before an operator-supervised cutover; the physical console is the permanent break-glass. Design maps to the four relevant 2026-06-17 incident lessons (FRICTION signals 1/2/3/6). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
12 KiB
Spec — Mesh-hardening (2 of 3): ubongo INPUT-only default-deny + ssh-from-control
Status: Accepted (2026-06-19)
Context & scope
The mesh-hardening follow-on (deferred from M5, ROADMAP) was decomposed into three independent sub-projects, each its own spec → plan → implementation cycle:
- askari SSH →
wt0— spec/plan written 2026-06-17, attempted and backed out the same day (the incident; six lessons inFRICTION.md). Needs a redesign — not this spec. - ubongo nftables default-deny +
ssh-from-control← this spec - NetBird ACL off Allow-All → scoped policies (its own later spec; open mechanism question — no headless API path).
ROADMAP (re-ordered after the 2026-06-17 incident) puts ubongo first: it is the clean, low-risk case — a physical box with a permanent console break-glass, and not the coordinator host that the incident proved you must not corner.
This spec hardens ubongo's inbound surface only. It does not change sshd's
ListenAddress (so no boot-race), does not apply a forward-chain default-deny (so Docker +
the libvirt NAT keep working), and does not touch askari or the NetBird ACL.
Current state (verified on ubongo, 2026-06-19): no host firewall — sshd listens on
0.0.0.0:22, reachable from LAN, mesh, and anything routable; only Docker's + libvirt's own
iptables-nft tables exist. Interfaces: eno1 10.20.10.151 (LAN, = ansible_host), wt0
100.99.146.14 (mesh), docker0 (one container, no published ports), virbr-boma
192.168.150.1/24 (the libvirt NAT that make test-integration uses), ip_forward=1.
Goal / success criteria
- SSH to ubongo succeeds over
wt0(road-warriors, askari), from mamba on the LAN (10.20.10.50), and via thessh-from-controlself-path (Ansible; source10.20.10.151). - SSH from any other LAN source is dropped (default-deny on
input). - Docker container egress and
make test-integration(libvirt NAT) keep working — the forward chain is untouched. - A reboot does not lock SSH out (no
ListenAddress, so no bind race). - Break-glass is the on-prem physical console (permanent, non-mesh). The live apply is additionally gated by the firewall auto-rollback timer.
Design
Apply base's nftables firewall concern to ubongo, with two adjustments and one deliberate
non-change:
- INPUT-only default-deny. The
inputchain keepspolicy dropwith the guaranteed management plane:lo,established,related, ICMP, SSH onwt0, and SSH fromssh-from-control(10.20.10.151). We add one operator-workstation source (mamba,10.20.10.50) via a newbase__firewall_admin_addrslist. Everything else oneno1drops. - Forward chain left permissive. base hardcodes
chain forward { … policy drop; }for inter-container isolation. On ubongo that would break Docker egress and the libvirt NAT the integration harness depends on — the same class of failure that sank askari (FRICTION 2026-06-17, signal 1). A newbase__firewall_input_onlyknob renders the forward chainpolicy acceptinstead. Docker's and libvirt's owniptables-nftforward rules continue to apply (separate tables); base simply does not add a default-deny on top. - No sshd
ListenAddresschange. sshd keeps listening on0.0.0.0:22; nftables does all inbound scoping. This deliberately avoids theip_nonlocal_bindboot-race that broke askari (FRICTION signal 2) — there is nothing to bind beforewt0exists.
Resulting input allow-list:
iif "lo" accept
ct state established,related accept
ct state invalid drop
iifname "wt0" tcp dport 22 accept # mesh (road-warriors, askari)
ip saddr 10.20.10.151 tcp dport 22 accept # ssh-from-control (Ansible self) — group_vars/all
ip saddr 10.20.10.50 tcp dport 22 accept # mamba on the LAN — base__firewall_admin_addrs
ip protocol icmp accept ; ip6 nexthdr ipv6-icmp accept
# (no catalog services on ubongo) → default drop
chain forward: policy accept # Docker + libvirt-NAT forwarding preserved
Why ubongo is the safe case (maps to the 2026-06-17 incident)
- Signal 1 (forward-drop breaks Docker hosts): sidestepped — INPUT-only leaves forwarding alone.
- Signal 2 (
ip_nonlocal_bindboot-race): sidestepped — noListenAddress; sshd binds nothing new. - Signal 3 (a host's only mgmt path must not depend on a service it hosts): satisfied — ubongo is not the coordinator and keeps three independent paths (mesh, LAN, physical console).
- Signal 6 (recovery tested after the break-glass was removed): the physical console is permanent (nothing to retire), and reboot-recovery is proven on a throwaway VM first.
New & changed code
Role base:
roles/base/defaults/main.yml— add:base__firewall_input_only: false— when true, the forward chain ispolicy accept(host-local input filtering only), for hosts that route/forward container or NAT traffic (e.g. the control node's Docker + libvirt-NAT) where a forward default-deny would break them.base__firewall_admin_addrs: []— extra LAN source IPs allowed to SSH (besideswt0+ssh-from-control); for an operator workstation reaching the host over the LAN. Key-gated.
roles/base/templates/nftables.conf.j2:- the forward line (currently line 21) →
chain forward { type filter hook forward priority 0; policy {{ "accept" if base__firewall_input_only | bool else "drop" }}; } - after the
ssh-from-controlblock (currently lines 12-14), add a loop:{% for addr in base__firewall_admin_addrs %}→ip saddr {{ addr }} tcp dport {{ base__firewall_ssh_port }} accept
- the forward line (currently line 21) →
roles/base/molecule/default/{converge,verify}.yml— fixture setsinput_only: true+ anadmin_addrsentry; assert (a)forwardrenderspolicy accept, (b) the admin-addr accept rule renders, (c) existing input default-deny +wt0+ control-addr assertions stay green.
Inventory (inventories/production/group_vars/control/vars.yml, append):
# Mesh-hardening 2/3 (2026-06-19, ADR-020/021): apply base's host firewall to ubongo as
# INPUT-only default-deny — harden the inbound surface, leave the forward chain permissive so
# Docker egress + the libvirt-NAT integration harness keep working. sshd is unchanged
# (nftables scopes inbound), so there is no boot-race. Reach ubongo over wt0, the
# ssh-from-control self-path (base__firewall_control_addr in group_vars/all), or mamba on the
# LAN. Break-glass: the physical console.
base__firewall_input_only: true
base__firewall_admin_addrs:
- "10.20.10.50" # mamba over the LAN (NetBird off). Raw DHCP lease — see note below.
# base__firewall_apply defaults true; base__firewall_control_addr (= ubongo's own 10.20.10.151)
# is set in group_vars/all and covers Ansible's self-connection.
Integration harness (ADR-025) — a "be ubongo" profile, mirroring "be askari":
tests/integration/overrides/ubongo.yml—firewall_apply: true,input_only: true,admin_addrs: ["192.168.150.99"](a representative LAN addr to exercise the rule),firewall_control_addr: "192.168.150.1"(the libvirt-NAT gateway = the harness's own SSH path, so the apply + reboot don't lock it out),ssh_listen_mesh_only: false,mesh_enabled: false.tests/integration/profiles/ubongo.json— mirrorprofiles/askari.json(VM resources/image).tests/integration/verify.yml— make the assertions profile-aware (gated on the active profile, sinceverify.ymlis shared): for ubongo assertinputpolicy drop,forwardpolicy accept, and the admin-addr rule present. Reachability across the reboot is the harness's existing cycle. The askari assertions (Docker/forward-DNAT) must not run for the ubongo profile, nor vice-versa.
Enables make test-integration HOST=ubongo.
The mamba admin-addr — a deliberately interim value
base__firewall_admin_addrs: ["10.20.10.50"] is mamba's current raw DHCP lease, not a
reservation (operator decision, 2026-06-19). Caveats, accepted for now:
- Lease drift: if DHCP reassigns
10.20.10.50, the rule allows whatever host then holds it (still SSH-key-gated, so low risk) and mamba loses its LAN path. Backstop: mamba also reaches ubongo overwt0(mesh), so it is never cut off — only the off-mesh LAN convenience lapses until the IP is corrected. - Revisit trigger: when OPNsense-as-code lands (ADR-020 perimeter layer), replace this with a DHCP reservation (MAC → fixed IP) and allow the reserved address. Tracked here and in the implementation plan's follow-ups.
Testing
- Molecule (base
default, render-only,firewall_apply: false): the new forward-accept + admin-addr assertions above, with existing assertions green. - Integration harness (
make test-integration HOST=ubongo): on a throwaway UEFI VM, apply the ubongo overlay, assert the ruleset shape, and prove SSH survives a reboot from an allowed source (the existing assert/cycle). This is the gate before touching the real control node. - Live (during cutover): SSH over
wt0✓, from mamba LAN ✓, Ansible self-ping ✓; SSH from a disallowed LAN host dropped ✓;docker run …egress ✓; a freshmake test-integrationstill spins a VM (libvirt NAT intact) ✓.
Staged cutover (operator-supervised — lockout-aware, FRICTION signal-6 order)
ubongo is managed as sjat (password sudo), so the live apply needs the operator present
anyway. The physical console is open throughout.
- Harness GREEN:
make test-integration HOST=ubongopasses (incl. the reboot). - Pre-check the real paths before applying: SSH over
wt0, SSH from mamba (10.20.10.50),ansible ubongo -m ping. Confirm the physical console is reachable. - Dry-run:
make check PLAYBOOK=site LIMIT=ubongo TAGS=firewall— review the nftables diff (input default-deny +wt0+10.20.10.151+10.20.10.50; forwardpolicy accept). - Apply (auto-rollback armed):
make deploy PLAYBOOK=site LIMIT=ubongo TAGS=firewall— the firewall concern snapshots, arms the 45 s revert, applies,reset_connection→wait_for_connectionover the live path (10.20.10.151), then cancels the timer. A bad ruleset reverts itself; the console is the ultimate fallback. - Verify every path + Docker egress + a fresh integration-VM spin (above).
- Reboot ubongo; confirm SSH returns on all paths unaided (console present). Only now is it done — recovery is proven while the break-glass is still there.
- Docs: update
STATUS.md(ubongo row: input-only default-deny applied) andROADMAP.md(mesh-hardening 2/3 done; next is sub-project 1 askari redesign or 3 NetBird ACL).
Risks & rollback
- Self-referential apply (ubongo runs Ansible against itself): mitigated by the auto-rollback
timer, the
wait_for_connectionover the real path, three redundant allowed sources, and the permanent physical console. ubongo cannot be bricked. - Raw-lease fragility: documented above; backstopped by the mesh path; revisit with OPNsense.
- No new container isolation (forward stays accept): accepted — ubongo is a single-tenant
control node, not a service host; Docker/libvirt keep their own forward rules. The forward
default-deny remains the norm for real service hosts (
base__firewall_input_only: false).
Out of scope / follow-ons
- askari SSH →
wt0redesign (sub-project 1) — needs the boot-race + coordinator-bootstrap resolved; folds in the coordinator-robustness (geo-DB FATAL-loop) + off-site backup lessons. - NetBird ACL off Allow-All (sub-project 3) — open mechanism question (no headless API path).
- OPNsense DHCP reservation for mamba (and ubongo) — replaces the raw lease; with OPNsense-as-code.
- Forward-chain container isolation on ubongo — deliberately not done here.
STATUS.md/ROADMAP.mdedits land with the implementation, not this spec.