Compare commits

...

10 commits

Author SHA1 Message Date
180af46879 docs(friction): log the Molecule input_only-accept coverage gap
Final-review finding: the default Molecule scenario only renders the forward
drop (input_only off) branch; the accept branch is covered by the integration
harness only. Tracked for a kaizen decision (2nd scenario vs accept the split).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:40:29 +02:00
8d8c86fa39 docs(friction): VM-testing standard + libvirt stale-session gotcha
Two signals from running the ubongo harness gate: (1) the operator wants a
standard pre-authorising isolated VM integration tests on ubongo so the agent
doesn't ask each time; (2) a stale agent session (shell predating the
integration_test libvirt-group grant) carries stale process groups, so the
harness's qemu-img/file writes are denied -> run via 'sg libvirt -c ...';
self-heal idea noted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:32:09 +02:00
468f8c3a92 fix(integration): match live nft priority filter in the ubongo verify
`nft list ruleset` prints the symbolic chain priority (`filter` = 0); the ubongo
profile asserted `priority 0` (the rendered-file format the Molecule scenario
checks), so the live-ruleset assertion failed even though the firewall was
correct. Assert `priority filter` for the input/forward policy lines. Caught by
the harness GREEN gate (`make test-integration HOST=ubongo`).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:32:09 +02:00
26bb7e442d fix(integration): pin system python for virt-install (venv PATH hijack)
The Makefile prepends .venv/bin to PATH (so the venv's ansible tools resolve),
but virt-install's `#!/usr/bin/env python3` shebang then resolved to the
isolated venv, which lacks system PyGObject (gi) -> ModuleNotFoundError. Strip
.venv/bin from PATH for the virt-install call so its shebang finds
/usr/bin/python3 (which has gi); ansible runs via its absolute .venv path and is
unaffected. Surfaced running `make test-integration HOST=ubongo`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:32:09 +02:00
6ac5afaf67 test(integration): add the 'be ubongo' profile (input-only default-deny)
A control-group VM that applies base with INPUT-only default-deny (forward
policy accept; admin-addr SSH allow). verify.yml is now profile-aware via an
integration_profile marker — the askari Docker/DNAT block is gated, and a ubongo
block asserts input drop + forward accept + the admin-addr rule. Enables
`make test-integration HOST=ubongo`. Mesh-hardening 2/3 (ADR-025).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:52:17 +02:00
b3e14decb4 feat(inventory): ubongo gets INPUT-only host firewall + mamba LAN SSH
Enables base__firewall_input_only on the control group (forward chain stays
permissive so Docker egress + the integration-test libvirt NAT survive) and
allows the operator workstations' LAN IPs (mamba 10.20.10.50 + 10.20.10.17;
raw leases, backstopped by wt0). Mesh-hardening 2/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:42:49 +02:00
b10a33f439 feat(base): input-only forward policy + admin-addr SSH allow
base__firewall_input_only renders the forward chain policy accept (host-local
INPUT filtering only) for hosts that forward container/NAT traffic; defaults
false so real service hosts keep the forward default-deny. base__firewall_admin_addrs
adds operator-workstation LAN sources to the SSH allow-list alongside wt0 +
ssh-from-control. Molecule locks the secure default + the admin rule.
Mesh-hardening 2/3 (ADR-020/021).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:37:06 +02:00
66a9a0af08 docs: ubongo admin-addrs add 10.20.10.17 + flag raw-lease follow-up
Allow a second operator workstation (10.20.10.17) onto ubongo's LAN SSH
alongside mamba (10.20.10.50). Both are raw DHCP leases; recorded a FRICTION
open signal to replace them with MAC-pinned OPNsense reservations when
OPNsense-as-code lands (ADR-020 / TODO 3.5).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:26:04 +02:00
e14e347047 docs(plan): mesh-hardening 2/3 — ubongo implementation plan
Five tasks: base knobs (input-only forward policy + admin-addr SSH allow,
TDD via Molecule) → enable on the control group → a 'be ubongo' integration
profile (profile-aware verify) → the real-VM harness GREEN gate → the
operator-supervised live cutover (signal-6 order, physical-console break-glass).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:26:04 +02:00
24a1d909c9 docs(spec): mesh-hardening 2/3 — ubongo INPUT-only default-deny
Sub-project 2 of the mesh-hardening follow-on (the post-incident roadmap
ordering puts ubongo first). Harden the control node's inbound surface via
base's nftables firewall as INPUT-only default-deny: the forward chain stays
permissive (new base__firewall_input_only knob) so Docker egress + the
libvirt-NAT integration harness keep working, and there is no sshd ListenAddress
change — sidestepping the ip_nonlocal_bind boot-race that sank askari. SSH
allowed from wt0, ssh-from-control (Ansible self), and mamba on the LAN (new
base__firewall_admin_addrs). Harness-validated before an operator-supervised
cutover; the physical console is the permanent break-glass.

Design maps to the four relevant 2026-06-17 incident lessons (FRICTION signals
1/2/3/6).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:12:58 +02:00
13 changed files with 850 additions and 10 deletions

View file

@ -146,6 +146,57 @@ harness on ubongo and shaking it down against real KVM (spec/plan in docs/superp
the holistic cross-file review. → for infra this novel, budget for BOTH an adversarial the holistic cross-file review. → for infra this novel, budget for BOTH an adversarial
cross-file review AND a real-hardware run; neither alone would have shipped it working. cross-file review AND a real-hardware run; neither alone would have shipped it working.
<!-- From the 2026-06-19 mesh-hardening-2/3 design (ubongo INPUT-only default-deny). -->
- `[friction]` **Raw DHCP leases pinned in ubongo's host firewall (admin-addr SSH allows)**
(2026-06-19): mesh-hardening 2/3 lets the operator workstations reach ubongo's LAN SSH by
*raw lease*`base__firewall_admin_addrs: ["10.20.10.50" (mamba), "10.20.10.17"]` — because
there is no DHCP reservation yet (OPNsense isn't managed as code). A lease reassignment
silently moves the allow to whatever host next holds the IP (still SSH-key-gated) and drops
the workstation's *LAN* path (mesh still works, so never a full lockout). → when
OPNsense-as-code lands (ADR-020 perimeter / TODO 3.5), replace both with **MAC-pinned DHCP
reservations** (`10.20.10.17` = MAC `bc:0f:f3:c8:4a:8a`; mamba's MAC TBD) and allow the
reserved IPs. Spec: `docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md`.
- `[gotcha]` **`make test-integration` on ubongo fails (`qemu-img` "Permission denied") when
the agent session predates the `libvirt` group grant** (2026-06-19): the `integration_test`
role adds `claude` to `libvirt`+`kvm` and makes the cache dir `/var/lib/boma-integration`
`root:libvirt 2775` — correct — but a `claude` session whose shell started *before* that
grant carries a stale process group set (`id``claude,docker` only, no `libvirt`), so
`qemu-img create` of the VM overlay into the group-owned dir is denied. `virsh`/`virt-install`
still work (they reach system libvirtd via polkit/socket, and the real KVM runs server-side
as `libvirt-qemu`), so ONLY claude's own file-writes break. Unblock without restarting the
session: **`sg libvirt -c 'make test-integration HOST=<name>'`** (claude needs only `libvirt`
for the dir; `kvm` is server-side; note `sg` adds one group, not the full set). → self-heal
in `scripts/integration-vm.py`: if the `libvirt` gid is absent from `os.getgroups()`, re-exec
under `sg libvirt` (or have the Makefile target do it), so a stale-session agent never hits
this opaque symptom. New agent sessions pick the groups up on login, so it's a stale-session
transient — but high-confusion, worth self-healing.
- `[friction]` **No standard for when the agent may run local-VM integration tests on ubongo
without asking** (2026-06-19): `make test-integration HOST=<name>` spins an ISOLATED throwaway
KVM VM (its own libvirt NAT; never touches the real host's firewall/network; guards:
one-VM-at-a-time + a 4 GiB free-RAM floor + auto-destroy on success), so it is safe and
self-contained — yet the agent paused for a go-ahead before running it (mesh-hardening 2/3,
Task 4). The operator wants a STANDARD that pre-authorises VM-testing on ubongo so the agent
just runs it. → decide + record the rule: e.g. a `.claude/settings.json` permission allow for
`make test-integration*` / `scripts/integration-vm.py` (and the `sg libvirt -c '…'` form per
the gotcha above), plus a CLAUDE.md line distinguishing the pre-authorised isolated VM tests
from the genuinely-gated live steps (`make deploy` to real hosts, host reboots, cutovers —
still need a go-ahead). Ties to the `test-risky-infra-before-live-deploy` +
`dont-reask-settled-defaults` memories + ADR-025.
- `[gotcha]` **Molecule covers only the `input_only`-OFF (forward drop) branch of the base
firewall** (2026-06-19): mesh-hardening 2/3 added `base__firewall_input_only` (forward policy
drop↔accept). The `default` Molecule scenario renders ONE fixture, set to the secure default
(drop) — so the fast `make test ROLE=base` gate locks the drop default (security-critical for
service hosts) but does NOT exercise the `=true` → forward-`accept` rendering; only `make
test-integration HOST=ubongo` does (passed GREEN). An in-converge re-render can't cheaply
cover it (role defaults aren't in scope outside the role run). → decide in kaizen: a second
Molecule scenario (`molecule/input-only/`) asserting forward `policy accept`, vs accepting the
integration-only coverage. Final-review finding; not a cutover blocker (the accept branch is a
literal, and a var-name break would fail the drop branch too → caught).
--- ---
## Kaizen reviews — decisions ledger ## Kaizen reviews — decisions ledger

View file

@ -0,0 +1,470 @@
# Mesh-hardening 2/3 — ubongo INPUT-only default-deny — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Apply base's nftables firewall to the control node (ubongo) as an INPUT-only default-deny — hardening its inbound surface — while leaving the forward chain permissive so Docker egress and the libvirt-NAT integration harness keep working, and without any sshd `ListenAddress` change.
**Architecture:** Two new `base` knobs make the existing firewall concern fit a control node: `base__firewall_input_only` flips the forward chain to `policy accept` (host-local input filtering only), and `base__firewall_admin_addrs` adds operator-workstation LAN sources to the SSH allow-list (alongside `wt0` and `ssh-from-control`). sshd is untouched (nftables does the scoping → no `ip_nonlocal_bind` boot-race). The change is validated on a throwaway VM via the ADR-025 integration harness (a new "be ubongo" profile) before an operator-supervised live cutover whose safety net is the firewall auto-rollback timer plus the permanent on-prem physical console.
**Tech Stack:** Ansible (role `base`, FQCN), nftables, Jinja2, Molecule on Debian 13, pytest (none new), the ADR-025 integration harness (`scripts/integration-vm.py`, JSON profiles, `-e @` overlays).
**Spec:** `docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md`
**Conventions:** `make lint` and `make test ROLE=base` before each commit; `make check` before `make deploy`; never hand-edit the generated `offsite.yml`; `rbw unlocked` for any commit touching Ansible content and for the integration/live applies (the production `group_vars/all/vault.yml` is in inventory scope and gets decrypted at playbook load). Tasks 13 are code (subagent-driven, each lint/Molecule-verified). Task 4 is a real-VM validation gate on ubongo. Task 5 is the live, operator-supervised cutover.
---
## File Structure
| File | Create/Modify | Responsibility |
|---|---|---|
| `roles/base/defaults/main.yml` | Modify | Declare `base__firewall_input_only` + `base__firewall_admin_addrs` (defaults: off / empty). |
| `roles/base/templates/nftables.conf.j2` | Modify | Conditional forward policy; render an SSH-allow rule per admin address. |
| `roles/base/molecule/default/converge.yml` | Modify | Fixture: an admin-addr source (input-only stays at its default → forward drop). |
| `roles/base/molecule/default/verify.yml` | Modify | Assert forward-drop default + the admin-addr rule render. |
| `inventories/production/group_vars/control/vars.yml` | Modify | Turn the knobs on for ubongo (input-only; mamba's LAN IP). |
| `tests/integration/overrides/ubongo.yml` | Create | The "be ubongo" overlay (input-only firewall; harness SSH lifeline). |
| `tests/integration/profiles/ubongo.json` | Create | The "be ubongo" VM profile (group `control`, applies `site.yml:base`). |
| `tests/integration/overrides/askari.yml` | Modify | Add the `integration_profile` marker (verify is now profile-aware). |
| `tests/integration/verify.yml` | Modify | Gate the askari (Docker/DNAT) block; add the ubongo (input-only) block + a guard. |
| `STATUS.md`, `docs/ROADMAP.md` | Modify (Task 5) | Record mesh-hardening 2/3 done. |
---
### Task 1: base role — `base__firewall_input_only` (forward policy) + `base__firewall_admin_addrs` (LAN SSH allow)
**Files:**
- Modify: `roles/base/defaults/main.yml`
- Modify: `roles/base/templates/nftables.conf.j2`
- Modify: `roles/base/molecule/default/converge.yml`
- Modify: `roles/base/molecule/default/verify.yml`
> **Test strategy (note):** Molecule renders one fixture, so it locks the *secure default*
> `input_only` **off** → forward `policy drop` — plus the new admin-addr rule (red→green). The
> `input_only` **on** → forward `policy accept` path is exercised on a real VM by the
> integration "be ubongo" profile (Tasks 34), whose verify fails red until this template
> conditional exists. Both branches are covered, across the two test layers.
- [ ] **Step 1: Write the failing test (extend Molecule verify)**
In `roles/base/molecule/default/verify.yml`, after the `Assert the docker_host extension hook is present` block, add:
```yaml
- name: Assert the forward chain defaults to policy drop (input_only off)
ansible.builtin.assert:
that:
- "'hook forward priority 0; policy drop;' in nft"
fail_msg: >-
forward chain must default to policy drop when base__firewall_input_only is
false (container isolation stays the norm on real service hosts)
- name: Assert the admin-addr SSH allow rule (operator workstation on the LAN)
ansible.builtin.assert:
that:
- "'ip saddr 10.30.0.77 tcp dport 22 accept' in nft"
fail_msg: "missing admin-addr SSH allow rule from base__firewall_admin_addrs"
```
- [ ] **Step 2: Add the fixture that drives it (Molecule converge)**
In `roles/base/molecule/default/converge.yml`, add to the `vars:` block (after the `base__firewall_control_addr` line):
```yaml
base__firewall_admin_addrs:
- "10.30.0.77" # fixture: an operator-workstation LAN source (admin-addr SSH allow)
```
- [ ] **Step 3: Run the test to verify it fails**
Run: `make test ROLE=base`
Expected: FAIL on `Assert the admin-addr SSH allow rule` (the template does not consume `base__firewall_admin_addrs` yet, so the `ip saddr 10.30.0.77 …` rule is absent). The forward-drop assertion passes already (the template currently hardcodes `policy drop`).
- [ ] **Step 4: Add the defaults**
In `roles/base/defaults/main.yml`, after the `base__firewall_apply: true` line (end of the firewall behaviour block, currently line 13), add:
```yaml
base__firewall_input_only: false # true → the forward chain is `policy accept` (host-local
# INPUT filtering only). For hosts that forward/route
# container or NAT traffic (the control node's Docker +
# libvirt-NAT) where a forward default-deny would break
# them. Real service hosts keep this false (forward drop).
base__firewall_admin_addrs: [] # extra LAN source IPs allowed to SSH, besides wt0 +
# ssh-from-control. For an operator workstation reaching
# the host over the LAN (no mesh). Key-gated. (ADR-021)
```
- [ ] **Step 5: Make the forward policy conditional + render the admin-addr rules**
In `roles/base/templates/nftables.conf.j2`:
(a) Replace the forward-chain line (currently line 21):
```jinja
chain forward { type filter hook forward priority 0; policy {{ 'accept' if base__firewall_input_only | bool else 'drop' }}; }
```
(b) After the `ssh-from-control` `{% endif %}` (currently line 14) and before the `ip protocol icmp accept` line, add the admin-addr loop:
```jinja
{% for addr in base__firewall_admin_addrs %}
ip saddr {{ addr }} tcp dport {{ base__firewall_ssh_port }} accept
{% endfor %}
```
- [ ] **Step 6: Run the test to verify it passes**
Run: `make test ROLE=base`
Expected: PASS — converge renders the ruleset; verify confirms the forward chain is `policy drop` (input_only defaults false) and the `ip saddr 10.30.0.77 tcp dport 22 accept` rule is present; all pre-existing assertions stay green.
- [ ] **Step 7: Lint**
Run: `make lint`
Expected: `Passed: 0 failure(s)` and `check-tags: OK`.
- [ ] **Step 8: Commit**
```bash
git add roles/base/defaults/main.yml roles/base/templates/nftables.conf.j2 \
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
git commit -m "feat(base): input-only forward policy + admin-addr SSH allow
base__firewall_input_only renders the forward chain policy accept (host-local
INPUT filtering only) for hosts that forward container/NAT traffic; defaults
false so real service hosts keep the forward default-deny. base__firewall_admin_addrs
adds operator-workstation LAN sources to the SSH allow-list alongside wt0 +
ssh-from-control. Molecule locks the secure default + the admin rule.
Mesh-hardening 2/3 (ADR-020/021).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 2: inventory — enable input-only default-deny + mamba on ubongo (control group)
**Files:**
- Modify: `inventories/production/group_vars/control/vars.yml`
- [ ] **Step 1: Turn the knobs on for the control group**
Append to `inventories/production/group_vars/control/vars.yml`:
```yaml
# Mesh-hardening 2/3 (2026-06-19, ADR-020/021): apply base's host firewall to ubongo as
# INPUT-only default-deny — harden the inbound surface, leave the forward chain permissive so
# Docker egress + the libvirt-NAT integration harness keep working. sshd is unchanged
# (nftables scopes inbound), so there is no boot-race. Reach ubongo over wt0 (mesh), the
# ssh-from-control self-path (base__firewall_control_addr, group_vars/all = 10.20.10.151), or
# mamba on the LAN. Break-glass: the physical console. (base__firewall_apply defaults true.)
base__firewall_input_only: true
base__firewall_admin_addrs:
- "10.20.10.50" # mamba over the LAN (NetBird off). Raw DHCP lease — revisit with an
# OPNsense reservation when OPNsense-as-code lands; backstopped by wt0.
- "10.20.10.17" # 2nd operator workstation (MAC bc:0f:f3:c8:4a:8a). Raw lease — ditto.
```
- [ ] **Step 2: Verify the vars resolve for ubongo**
Run: `.venv/bin/ansible-inventory -i inventories/production/ --host ubongo 2>/dev/null | grep -E 'firewall_input_only|firewall_admin_addrs|10.20.10.(50|17)'`
Expected: shows `"base__firewall_input_only": true` and `"base__firewall_admin_addrs": ["10.20.10.50", "10.20.10.17"]`.
- [ ] **Step 3: Lint**
Run: `make lint`
Expected: clean pass (`check-tags: OK`).
- [ ] **Step 4: Commit**
```bash
git add inventories/production/group_vars/control/vars.yml
git commit -m "feat(inventory): ubongo gets INPUT-only host firewall + mamba LAN SSH
Enables base__firewall_input_only on the control group (forward chain stays
permissive so Docker egress + the integration-test libvirt NAT survive) and
allows the operator workstations' LAN IPs (mamba 10.20.10.50 + 10.20.10.17;
raw leases, backstopped by wt0). Mesh-hardening 2/3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 3: integration harness — "be ubongo" profile (overlay + profile + profile-aware verify)
**Files:**
- Create: `tests/integration/overrides/ubongo.yml`
- Create: `tests/integration/profiles/ubongo.json`
- Modify: `tests/integration/overrides/askari.yml`
- Modify: `tests/integration/verify.yml`
- [ ] **Step 1: Create the "be ubongo" overlay**
Create `tests/integration/overrides/ubongo.yml`:
```yaml
---
# Integration-test overlay for the "ubongo" profile (ADR-025). Passed via `-e @`.
# Exercises mesh-hardening 2/3: base's INPUT-only default-deny on the control node — input
# chain default-deny, forward chain left permissive (Docker/libvirt-NAT safe), no sshd
# ListenAddress change (so no boot-race).
integration_profile: ubongo
base__firewall_apply: true
base__firewall_input_only: true # forward chain renders `policy accept`
base__firewall_admin_addrs:
- "192.168.150.98" # two representative LAN sources — exercises the
- "192.168.150.99" # admin-addr loop with a multi-entry list (like ubongo)
# Never wt0-only; never touch the real mesh from a throwaway VM.
base__ssh_listen_mesh_only: false
base__mesh_enabled: false
# Allow SSH from the libvirt-NAT gateway (where the driver/ansible connect from) so the
# default-deny apply + the reboot don't lock out the harness. By source IP (interface-
# independent). This is the harness's lifeline; the admin-addr above is only exercised.
base__firewall_control_addr: "192.168.150.1"
```
- [ ] **Step 2: Create the "be ubongo" VM profile**
Create `tests/integration/profiles/ubongo.json`:
```json
{
"groups": ["control"],
"applies": [
{"playbook": "site.yml", "tags": ["base"]}
],
"extra_vars_files": ["overrides/ubongo.yml"],
"mem_mib": 2048,
"vcpus": 2
}
```
- [ ] **Step 3: Mark the askari overlay with its profile name**
In `tests/integration/overrides/askari.yml`, after the two header comment lines (before `base__firewall_apply: true`), add:
```yaml
integration_profile: askari
```
- [ ] **Step 4: Make `verify.yml` profile-aware (the test)**
Replace the entire contents of `tests/integration/verify.yml` with:
```yaml
---
# Integration verify (ADR-025). Outcome-based, profile-aware: the active profile is named by
# `integration_profile` (set in each profile's overlay). Each profile asserts its own success
# criteria; an unknown/unset profile fails loudly (never a silent pass).
- name: Verify the rebooted host
hosts: all
become: true
gather_facts: false
tasks:
- name: A known integration_profile must be set (no silent pass)
ansible.builtin.assert:
that:
- integration_profile is defined
- integration_profile in ['askari', 'ubongo']
fail_msg: "integration_profile must be set in the profile overlay (askari|ubongo)"
# ── askari profile — Docker host: published-port forwarding survives the reboot ──
# The load-bearing check probes the VM's published :80 FROM the controller (ubongo) — if
# base's forward-drop killed DNAT, this times out (the FRICTION 2026-06-17 #1 bug).
- name: (askari) Gather service facts
when: integration_profile == 'askari'
ansible.builtin.service_facts:
- name: (askari) Docker daemon is active
when: integration_profile == 'askari'
ansible.builtin.assert:
that: "ansible_facts.services['docker.service'].state == 'running'"
fail_msg: "docker.service is not running"
- name: (askari) Forward chain permits container traffic (drop-in loaded)
when: integration_profile == 'askari'
ansible.builtin.command: nft list chain inet filter forward
register: _fwd
changed_when: false
- name: (askari) Assert container forwarding is allowed (not pure drop)
when: integration_profile == 'askari'
ansible.builtin.assert:
that: "'accept' in _fwd.stdout"
fail_msg: >-
forward chain is pure drop — container forwarding will die on reboot
(FRICTION 2026-06-17 #1). docker_host container-forward drop-in missing.
- name: (askari) Published port answers from the controller (DNAT + forward alive)
when: integration_profile == 'askari'
delegate_to: localhost
become: false
ansible.builtin.uri:
url: "http://{{ ansible_host }}/"
follow_redirects: none
status_code: [200, 301, 308, 404, 502, 503]
timeout: 10
register: _probe
retries: 5
delay: 6
until: _probe is succeeded
# ── ubongo profile — control node: INPUT-only default-deny survives the reboot ──
# SSH reachability across the reboot is proven by the harness itself (it re-SSHes and
# checks boot_id changed before this verify runs). Here we assert the ruleset shape.
- name: (ubongo) Read the live nftables ruleset
when: integration_profile == 'ubongo'
ansible.builtin.command: nft list ruleset
register: _nft
changed_when: false
- name: (ubongo) INPUT default-deny, forward permissive, admin-addr allow
when: integration_profile == 'ubongo'
ansible.builtin.assert:
that:
- "'hook input priority 0; policy drop;' in _nft.stdout"
- "'hook forward priority 0; policy accept;' in _nft.stdout"
- "'ip saddr 192.168.150.98 tcp dport 22 accept' in _nft.stdout"
- "'ip saddr 192.168.150.99 tcp dport 22 accept' in _nft.stdout"
fail_msg: >-
ubongo profile: expected input policy drop, forward policy accept (input-only),
and both admin-addr (192.168.150.98/99) SSH allows in the live ruleset.
```
- [ ] **Step 5: Validate the JSON + lint**
Run: `.venv/bin/python -m json.tool tests/integration/profiles/ubongo.json >/dev/null && echo OK` then `make lint`
Expected: `OK`, then a clean lint pass (`check-tags: OK`).
- [ ] **Step 6: Commit**
```bash
git add tests/integration/overrides/ubongo.yml tests/integration/profiles/ubongo.json \
tests/integration/overrides/askari.yml tests/integration/verify.yml
git commit -m "test(integration): add the 'be ubongo' profile (input-only default-deny)
A control-group VM that applies base with INPUT-only default-deny (forward
policy accept; admin-addr SSH allow). verify.yml is now profile-aware via an
integration_profile marker — the askari Docker/DNAT block is gated, and a ubongo
block asserts input drop + forward accept + the admin-addr rule. Enables
\`make test-integration HOST=ubongo\`. Mesh-hardening 2/3 (ADR-025).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 4: Validate on the integration harness (`make test-integration HOST=ubongo`) — the GREEN gate
> Runs a throwaway UEFI VM on ubongo: boots it, applies the base role with the ubongo
> overlay (INPUT-only default-deny), **reboots it**, and asserts the ruleset + SSH-returns.
> This proves the change survives a reboot before the real control node is ever touched
> (spec §cutover step 1; FRICTION signal-6). No code change / no commit — a validation gate.
- [ ] **Step 1: Ensure the vault is unlocked**
The run loads `inventories/production/group_vars/all/vault.yml` (symlinked into the run dir), which is decrypted at playbook load.
Run: `rbw unlocked || rbw unlock`
Expected: exits 0 (unlocked). If it prompts, the operator unlocks.
- [ ] **Step 2: Run the integration cycle**
Run: `make test-integration HOST=ubongo`
Expected (the `cycle`: up → apply → reboot → assert): the VM gets a `192.168.150.x` lease; `site.yml --tags base` applies cleanly; `… rebooted (boot_id changed), SSH back at 192.168.150.x`; then `VERIFY PASSED for boma-it-ubongo-…`. The VM is destroyed on success.
- [ ] **Step 3: On failure, read the diagnostics**
If it prints `VERIFY FAILED`, diagnostics are in `~/integration-runs/boma-it-ubongo-<id>/` (`nft.txt`, `console.log`, `journal.txt`). The likely suspects: the admin-addr/forward assertion (Task 1/3 wiring) or SSH not returning post-reboot (the `base__firewall_control_addr: 192.168.150.1` lifeline in the overlay). Fix the implicated task, re-commit, and re-run Step 2. Re-run `make test-integration-clean` first if a VM was left defined.
- [ ] **Step 4: Record the result**
Capture the `VERIFY PASSED` line in the task notes (this is the gate Task 5 step 1 depends on). No commit.
---
### Task 5: Live staged cutover (operator-supervised — NOT a subagent task)
> Touches the **real ubongo** (the control node Ansible runs from) and reboots it — lockout-
> risky. Run it interactively with the operator, in order, verifying each step before the
> next. The firewall auto-rollback timer (`base__firewall_rollback_timeout`, 45 s) +
> `wait_for_connection` over the live path is the safety net; the **on-prem physical console**
> is the permanent break-glass. Do NOT hand this to an unattended agent.
- [ ] **Step 1: Pre-checks (gate: Task 4 GREEN)**
- `rbw unlocked || rbw unlock`.
- SSH to ubongo over `wt0` from a road-warrior succeeds.
- SSH to ubongo from mamba on the LAN (`10.20.10.50`) succeeds.
- `.venv/bin/ansible ubongo -i inventories/production/ -m ping``SUCCESS` (over `10.20.10.151`).
- The physical console is reachable. If any path fails, STOP.
- [ ] **Step 2: Dry-run the firewall apply**
Run: `make check PLAYBOOK=site LIMIT=ubongo TAGS=firewall`
Expected: the nftables diff shows `policy drop` on input, `iifname "wt0" … accept`, `ip saddr 10.20.10.151 … accept`, `ip saddr 10.20.10.50 … accept`, and the forward chain as `policy accept`. No errors.
- [ ] **Step 3: Apply the host firewall (auto-rollback armed)**
Run: `make deploy PLAYBOOK=site LIMIT=ubongo TAGS=firewall`
Expected: the firewall concern snapshots `/etc/nftables.rollback`, arms the 45 s `systemd-run` revert, applies the ruleset, `reset_connection``wait_for_connection` over `10.20.10.151` succeeds, then cancels the timer. If connectivity is lost, the timer reverts the ruleset within 45 s and the console is the fallback.
- [ ] **Step 4: Verify every path + forwarding still works**
```bash
# from a road-warrior over wt0, and from mamba on the LAN:
ssh sjat@100.99.146.14 true && echo "wt0 OK"
ssh sjat@10.20.10.151 true && echo "mamba-LAN OK" # run from mamba (10.20.10.50)
# Ansible self-path:
.venv/bin/ansible ubongo -i inventories/production/ -m ping
# a disallowed LAN host (e.g. 10.20.10.17) must now be refused/timeout on :22
# Docker egress (forward chain still permissive):
docker run --rm busybox wget -qO- https://cloudflare.com/cdn-cgi/trace | head -1
# libvirt-NAT forwarding intact — a fresh integration VM still reaches apt:
make test-integration HOST=ubongo # expect VERIFY PASSED (proves the NAT path survived)
```
Expected: `wt0 OK`, `mamba-LAN OK`, Ansible `SUCCESS`, the disallowed host refused, the Docker egress line returns, and the integration cycle passes.
- [ ] **Step 5: Reboot resilience — while the console is present (FRICTION signal-6)**
With the operator at the physical console, reboot ubongo (`sudo systemctl reboot`). After it returns, confirm SSH comes back on all paths **unaided**:
```bash
ssh sjat@100.99.146.14 true && echo "wt0 OK after reboot"
.venv/bin/ansible ubongo -i inventories/production/ -m ping
```
Expected: SSH returns with no manual intervention (no `ListenAddress`, so nothing to race). Only now is the cutover complete.
- [ ] **Step 6: Update STATUS + ROADMAP**
- In `STATUS.md`: in the `roles/base/` row of "Scaffolded but empty", change the firewall note — the `firewall` concern is now **applied to ubongo** as INPUT-only default-deny (it is no longer "not yet applied to any host"); note the `base__firewall_input_only` knob and that the forward default-deny still awaits the `docker_host` drop-in for real service hosts. Add the ubongo control-node row's "Pending" item for default-deny → done.
- In `docs/ROADMAP.md`: mark **mesh-hardening sub-project 2 (ubongo default-deny) done**; the remaining follow-on is sub-project 1 (askari SSH→`wt0` *redesign*) and sub-project 3 (NetBird ACL). Update the "Next step" section accordingly.
```bash
git add STATUS.md docs/ROADMAP.md
git commit -m "docs: ubongo INPUT-only default-deny applied (mesh-hardening 2/3 done)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
- [ ] **Step 7: Push**
Run: `git push origin main`
---
## Self-review (against the spec)
- **§ Design — INPUT-only default-deny** → Task 1 (forward-policy knob) + Task 2 (enabled on ubongo). ✓
- **§ Design — admin-addrs (operator workstations on LAN)** → Task 1 (`base__firewall_admin_addrs` + template loop) + Task 2 (`10.20.10.50` mamba, `10.20.10.17`). ✓
- **§ Design — no sshd ListenAddress change** → nothing touches `ssh.yml`/`sshd_hardening.conf.j2`; only nftables. ✓ (verified: Tasks 13 file lists exclude them).
- **§ allow-list** (lo, established, wt0, ssh-from-control, admin-addr, icmp; forward accept) → template already renders lo/established/wt0/control/icmp; Task 1 adds admin-addr + forward-accept. ✓
- **§ Why-safe (incident signals 1/2/3/6)** → signal 1 (forward accept, Task 1); signal 2 (no ListenAddress); signal 3 (ubongo keeps LAN + console); signal 6 (Task 4 harness reboot + Task 5 step 5 reboot-while-console). ✓
- **§ New & changed code** (defaults, template, molecule, group_vars/control, integration profile) → Tasks 13. ✓
- **§ admin raw-leases + revisit** → Task 2 comments record both leases + the OPNsense-reservation revisit trigger; backstop (wt0) noted; flagged in `FRICTION.md`. ✓
- **§ Testing** (Molecule render asserts; `make test-integration HOST=ubongo`; live checks) → Task 1 (Molecule), Task 4 (harness), Task 5 step 4 (live). ✓ Coverage split (default in Molecule, input_only on the VM) noted in Task 1.
- **§ Staged cutover (signal-6 order)** → Task 5 steps 17; reboot-recovery (step 5) precedes nothing that retires a break-glass (the console is permanent). ✓
- **§ Risks/rollback** → auto-rollback (Task 5 step 3), redundant paths + physical console, raw-lease backstop. ✓
- **Type/name consistency:** `base__firewall_input_only` (bool) and `base__firewall_admin_addrs` (list) are spelled identically in defaults, template, converge, group_vars, and the overlay. `integration_profile` is spelled identically in both overlays and the three gates in `verify.yml`. ✓
- **Placeholder scan:** no TBD/TODO; every code/command step shows the actual content. ✓

View file

@ -0,0 +1,203 @@
# Spec — Mesh-hardening (2 of 3): ubongo INPUT-only default-deny + `ssh-from-control`
Status: Accepted (2026-06-19)
## Context & scope
The **mesh-hardening follow-on** (deferred from M5, ROADMAP) was decomposed into three
independent sub-projects, each its own spec → plan → implementation cycle:
1. askari SSH → `wt0` — spec/plan written 2026-06-17, **attempted and backed out the same day**
(the incident; six lessons in `FRICTION.md`). Needs a redesign — **not** this spec.
2. **ubongo nftables default-deny + `ssh-from-control`** ← *this spec*
3. NetBird ACL off Allow-All → scoped policies (its own later spec; open mechanism question —
no headless API path).
ROADMAP (re-ordered after the 2026-06-17 incident) puts **ubongo first**: it is the clean,
low-risk case — a physical box with a permanent console break-glass, and *not* the coordinator
host that the incident proved you must not corner.
This spec hardens **ubongo's inbound surface only**. It does **not** change sshd's
`ListenAddress` (so no boot-race), does **not** apply a forward-chain default-deny (so Docker +
the libvirt NAT keep working), and does **not** touch askari or the NetBird ACL.
Current state (verified on ubongo, 2026-06-19): **no host firewall** — sshd listens on
`0.0.0.0:22`, reachable from LAN, mesh, and anything routable; only Docker's + libvirt's own
`iptables-nft` tables exist. Interfaces: `eno1` `10.20.10.151` (LAN, = `ansible_host`), `wt0`
`100.99.146.14` (mesh), `docker0` (one container, no published ports), `virbr-boma`
`192.168.150.1/24` (the libvirt NAT that `make test-integration` uses), `ip_forward=1`.
## Goal / success criteria
- SSH to ubongo succeeds over **`wt0`** (road-warriors, askari), from **mamba on the LAN**
(`10.20.10.50`), and via the **`ssh-from-control` self-path** (Ansible; source `10.20.10.151`).
- SSH from any **other** LAN source is **dropped** (default-deny on `input`).
- **Docker container egress and `make test-integration` (libvirt NAT) keep working** — the
forward chain is untouched.
- A **reboot** does not lock SSH out (no `ListenAddress`, so no bind race).
- Break-glass is the **on-prem physical console** (permanent, non-mesh). The live apply is
additionally gated by the firewall **auto-rollback** timer.
## Design
Apply base's nftables `firewall` concern to ubongo, with two adjustments and one deliberate
non-change:
1. **INPUT-only default-deny.** The `input` chain keeps `policy drop` with the guaranteed
management plane: `lo`, `established,related`, ICMP, SSH on `wt0`, and SSH from
`ssh-from-control` (`10.20.10.151`). We add **one operator-workstation source** (mamba,
`10.20.10.50`) via a new `base__firewall_admin_addrs` list. Everything else on `eno1` drops.
2. **Forward chain left permissive.** base hardcodes `chain forward { … policy drop; }` for
inter-container isolation. On ubongo that would break Docker egress **and** the libvirt NAT
the integration harness depends on — the same class of failure that sank askari (FRICTION
2026-06-17, signal 1). A new `base__firewall_input_only` knob renders the forward chain
`policy accept` instead. Docker's and libvirt's own `iptables-nft` forward rules continue to
apply (separate tables); base simply does not add a default-deny on top.
3. **No sshd `ListenAddress` change.** sshd keeps listening on `0.0.0.0:22`; nftables does all
inbound scoping. This deliberately avoids the `ip_nonlocal_bind` boot-race that broke askari
(FRICTION signal 2) — there is nothing to bind before `wt0` exists.
Resulting `input` allow-list:
```
iif "lo" accept
ct state established,related accept
ct state invalid drop
iifname "wt0" tcp dport 22 accept # mesh (road-warriors, askari)
ip saddr 10.20.10.151 tcp dport 22 accept # ssh-from-control (Ansible self) — group_vars/all
ip saddr 10.20.10.50 tcp dport 22 accept # mamba on the LAN — base__firewall_admin_addrs
ip saddr 10.20.10.17 tcp dport 22 accept # 2nd operator wkstn — base__firewall_admin_addrs
ip protocol icmp accept ; ip6 nexthdr ipv6-icmp accept
# (no catalog services on ubongo) → default drop
chain forward: policy accept # Docker + libvirt-NAT forwarding preserved
```
## Why ubongo is the safe case (maps to the 2026-06-17 incident)
- **Signal 1** (forward-drop breaks Docker hosts): sidestepped — INPUT-only leaves forwarding alone.
- **Signal 2** (`ip_nonlocal_bind` boot-race): sidestepped — no `ListenAddress`; sshd binds nothing new.
- **Signal 3** (a host's only mgmt path must not depend on a service it hosts): satisfied —
ubongo is not the coordinator and keeps three independent paths (mesh, LAN, physical console).
- **Signal 6** (recovery tested after the break-glass was removed): the physical console is
permanent (nothing to retire), and reboot-recovery is proven on a throwaway VM first.
## New & changed code
**Role `base`:**
- `roles/base/defaults/main.yml` — add:
- `base__firewall_input_only: false` — when true, the forward chain is `policy accept`
(host-local input filtering only), for hosts that route/forward container or NAT traffic
(e.g. the control node's Docker + libvirt-NAT) where a forward default-deny would break them.
- `base__firewall_admin_addrs: []` — extra LAN source IPs allowed to SSH (besides `wt0` +
`ssh-from-control`); for an operator workstation reaching the host over the LAN. Key-gated.
- `roles/base/templates/nftables.conf.j2`:
- the forward line (currently line 21) →
`chain forward { type filter hook forward priority 0; policy {{ "accept" if base__firewall_input_only | bool else "drop" }}; }`
- after the `ssh-from-control` block (currently lines 12-14), add a loop:
`{% for addr in base__firewall_admin_addrs %}`
`ip saddr {{ addr }} tcp dport {{ base__firewall_ssh_port }} accept`
- `roles/base/molecule/default/{converge,verify}.yml` — fixture sets `input_only: true` + an
`admin_addrs` entry; assert (a) `forward` renders `policy accept`, (b) the admin-addr accept
rule renders, (c) existing input default-deny + `wt0` + control-addr assertions stay green.
**Inventory** (`inventories/production/group_vars/control/vars.yml`, append):
```yaml
# Mesh-hardening 2/3 (2026-06-19, ADR-020/021): apply base's host firewall to ubongo as
# INPUT-only default-deny — harden the inbound surface, leave the forward chain permissive so
# Docker egress + the libvirt-NAT integration harness keep working. sshd is unchanged
# (nftables scopes inbound), so there is no boot-race. Reach ubongo over wt0, the
# ssh-from-control self-path (base__firewall_control_addr in group_vars/all), or mamba on the
# LAN. Break-glass: the physical console.
base__firewall_input_only: true
base__firewall_admin_addrs:
- "10.20.10.50" # mamba over the LAN (NetBird off). Raw DHCP lease — see note below.
- "10.20.10.17" # a 2nd operator workstation (MAC bc:0f:f3:c8:4a:8a). Raw lease — ditto.
# base__firewall_apply defaults true; base__firewall_control_addr (= ubongo's own 10.20.10.151)
# is set in group_vars/all and covers Ansible's self-connection.
```
**Integration harness** (ADR-025) — a "be ubongo" profile, mirroring "be askari":
- `tests/integration/overrides/ubongo.yml``firewall_apply: true`, `input_only: true`,
`admin_addrs: ["192.168.150.99"]` (a representative LAN addr to exercise the rule),
`firewall_control_addr: "192.168.150.1"` (the libvirt-NAT gateway = the harness's own SSH
path, so the apply + reboot don't lock it out), `ssh_listen_mesh_only: false`,
`mesh_enabled: false`.
- `tests/integration/profiles/ubongo.json` — mirror `profiles/askari.json` (VM resources/image).
- `tests/integration/verify.yml` — make the assertions **profile-aware** (gated on the active
profile, since `verify.yml` is shared): for ubongo assert `input` policy drop, `forward`
policy **accept**, and the admin-addr rule present. Reachability across the reboot is the
harness's existing cycle. The askari assertions (Docker/forward-DNAT) must **not** run for the
ubongo profile, nor vice-versa.
Enables `make test-integration HOST=ubongo`.
## The admin-addrs — deliberately interim values
`base__firewall_admin_addrs: ["10.20.10.50", "10.20.10.17"]` are the operator workstations'
**current raw DHCP leases** (mamba + a second box), not reservations (operator decision,
2026-06-19). Both share the operator's `sjat` SSH key. Caveats, accepted for now:
- **Lease drift:** if DHCP reassigns either IP, the rule allows whatever host then holds it
(still SSH-key-gated, so low risk) and that workstation loses its *LAN* path. **Backstop:**
the workstations also reach ubongo over `wt0` (mesh), so they are never cut off — only the
off-mesh LAN convenience lapses until the IP is corrected.
- **Revisit trigger (flagged for follow-up):** when OPNsense-as-code lands (ADR-020 perimeter /
TODO 3.5), replace both raw leases with **MAC-pinned DHCP reservations** (`10.20.10.17` =
MAC `bc:0f:f3:c8:4a:8a`) and allow the reserved addresses. Recorded as a `FRICTION.md` open
signal so the next `/kaizen` surfaces it.
## Testing
- **Molecule** (base `default`, render-only, `firewall_apply: false`): the new forward-accept +
admin-addr assertions above, with existing assertions green.
- **Integration harness** (`make test-integration HOST=ubongo`): on a throwaway UEFI VM, apply
the ubongo overlay, assert the ruleset shape, and prove **SSH survives a reboot** from an
allowed source (the existing assert/cycle). This is the gate before touching the real control
node.
- **Live** (during cutover): SSH over `wt0` ✓, from mamba LAN ✓, Ansible self-ping ✓; SSH from a
disallowed LAN host dropped ✓; `docker run … ` egress ✓; a fresh `make test-integration`
still spins a VM (libvirt NAT intact) ✓.
## Staged cutover (operator-supervised — lockout-aware, FRICTION signal-6 order)
ubongo is managed as `sjat` (password sudo), so the live apply needs the operator present
anyway. The physical console is open throughout.
1. **Harness GREEN:** `make test-integration HOST=ubongo` passes (incl. the reboot).
2. **Pre-check the real paths** *before* applying: SSH over `wt0`, SSH from mamba
(`10.20.10.50`), `ansible ubongo -m ping`. Confirm the physical console is reachable.
3. **Dry-run:** `make check PLAYBOOK=site LIMIT=ubongo TAGS=firewall` — review the nftables diff
(input default-deny + `wt0` + `10.20.10.151` + `10.20.10.50`; forward `policy accept`).
4. **Apply (auto-rollback armed):** `make deploy PLAYBOOK=site LIMIT=ubongo TAGS=firewall` — the
firewall concern snapshots, arms the 45 s revert, applies, `reset_connection`
`wait_for_connection` over the live path (`10.20.10.151`), then cancels the timer. A bad
ruleset reverts itself; the console is the ultimate fallback.
5. **Verify** every path + Docker egress + a fresh integration-VM spin (above).
6. **Reboot ubongo; confirm SSH returns on all paths unaided** (console present). Only now is it
done — recovery is proven *while the break-glass is still there*.
7. **Docs:** update `STATUS.md` (ubongo row: input-only default-deny applied) and `ROADMAP.md`
(mesh-hardening 2/3 done; next is sub-project 1 askari redesign or 3 NetBird ACL).
## Risks & rollback
- **Self-referential apply** (ubongo runs Ansible against itself): mitigated by the auto-rollback
timer, the `wait_for_connection` over the real path, three redundant allowed sources, and the
permanent physical console. ubongo cannot be bricked.
- **Raw-lease fragility:** documented above; backstopped by the mesh path; revisit with OPNsense.
- **No new container isolation** (forward stays accept): accepted — ubongo is a single-tenant
control node, not a service host; Docker/libvirt keep their own forward rules. The forward
default-deny remains the norm for real service hosts (`base__firewall_input_only: false`).
## Out of scope / follow-ons
- askari SSH → `wt0` redesign (sub-project 1) — needs the boot-race + coordinator-bootstrap
resolved; folds in the coordinator-robustness (geo-DB FATAL-loop) + off-site backup lessons.
- NetBird ACL off Allow-All (sub-project 3) — open mechanism question (no headless API path).
- OPNsense DHCP reservations for the admin workstations (`10.20.10.50` mamba, `10.20.10.17`)
and ubongo — replace the raw leases with MAC-pinned reservations; flagged in `FRICTION.md`,
with OPNsense-as-code.
- Forward-chain container isolation on ubongo — deliberately not done here.
- `STATUS.md` / `ROADMAP.md` edits land with the implementation, not this spec.

View file

@ -19,3 +19,15 @@ base__ai_worker_user: claude
# Enrollment only; the host firewall default-deny stays deferred (the mesh-hardening # Enrollment only; the host firewall default-deny stays deferred (the mesh-hardening
# follow-on), so this brings up wt0 without changing SSH exposure. # follow-on), so this brings up wt0 without changing SSH exposure.
base__mesh_enabled: true base__mesh_enabled: true
# Mesh-hardening 2/3 (2026-06-19, ADR-020/021): apply base's host firewall to ubongo as
# INPUT-only default-deny — harden the inbound surface, leave the forward chain permissive so
# Docker egress + the libvirt-NAT integration harness keep working. sshd is unchanged
# (nftables scopes inbound), so there is no boot-race. Reach ubongo over wt0 (mesh), the
# ssh-from-control self-path (base__firewall_control_addr, group_vars/all = 10.20.10.151), or
# mamba on the LAN. Break-glass: the physical console. (base__firewall_apply defaults true.)
base__firewall_input_only: true
base__firewall_admin_addrs:
- "10.20.10.50" # mamba over the LAN (NetBird off). Raw DHCP lease — revisit with an
# OPNsense reservation when OPNsense-as-code lands; backstopped by wt0.
- "10.20.10.17" # 2nd operator workstation (MAC bc:0f:f3:c8:4a:8a). Raw lease — ditto.

View file

@ -11,6 +11,14 @@ base__firewall_rollback_timeout: 45 # seconds before the auto-revert fires on a
base__firewall_confirm_timeout: 20 # seconds to re-establish a fresh connection post-apply base__firewall_confirm_timeout: 20 # seconds to re-establish a fresh connection post-apply
base__firewall_dropin_dir: /etc/nftables.d base__firewall_dropin_dir: /etc/nftables.d
base__firewall_apply: true # set false to render+validate without applying (CI/Molecule) base__firewall_apply: true # set false to render+validate without applying (CI/Molecule)
base__firewall_input_only: false # true → the forward chain is `policy accept` (host-local
# INPUT filtering only). For hosts that forward/route
# container or NAT traffic (the control node's Docker +
# libvirt-NAT) where a forward default-deny would break
# them. Real service hosts keep this false (forward drop).
base__firewall_admin_addrs: [] # extra LAN source IPs allowed to SSH, besides wt0 +
# ssh-from-control. For an operator workstation reaching
# the host over the LAN (no mesh). Key-gated. (ADR-021)
# SSH hardening + fail2ban (ADR-002) — `hardening` concern. # SSH hardening + fail2ban (ADR-002) — `hardening` concern.
base__ssh_password_authentication: "no" base__ssh_password_authentication: "no"

View file

@ -6,6 +6,8 @@
vars: vars:
base__firewall_apply: false base__firewall_apply: false
base__firewall_control_addr: 10.10.0.99 # test control-node LAN address base__firewall_control_addr: 10.10.0.99 # test control-node LAN address
base__firewall_admin_addrs:
- "10.30.0.77" # fixture: an operator-workstation LAN source (admin-addr SSH allow)
# Exercise the mesh concern's include path with the live actions gated off, so it # Exercise the mesh concern's include path with the live actions gated off, so it
# runs hermetically (no coordinator/key needed) and must be a clean no-op. # runs hermetically (no coordinator/key needed) and must be a clean no-op.
base__mesh_enabled: true base__mesh_enabled: true

View file

@ -51,6 +51,20 @@
- "'include \"/etc/nftables.d/*.nft\"' in nft" - "'include \"/etc/nftables.d/*.nft\"' in nft"
fail_msg: "missing drop-in include hook" fail_msg: "missing drop-in include hook"
- name: Assert the forward chain defaults to policy drop (input_only off)
ansible.builtin.assert:
that:
- "'hook forward priority 0; policy drop;' in nft"
fail_msg: >-
forward chain must default to policy drop when base__firewall_input_only is
false (container isolation stays the norm on real service hosts)
- name: Assert the admin-addr SSH allow rule (operator workstation on the LAN)
ansible.builtin.assert:
that:
- "'ip saddr 10.30.0.77 tcp dport 22 accept' in nft"
fail_msg: "missing admin-addr SSH allow rule from base__firewall_admin_addrs"
- name: Syntax-check the rendered ruleset (no apply) - name: Syntax-check the rendered ruleset (no apply)
ansible.builtin.command: nft -c -f /etc/nftables.conf ansible.builtin.command: nft -c -f /etc/nftables.conf
changed_when: false changed_when: false

View file

@ -12,13 +12,16 @@ table inet filter {
{% if base__firewall_control_addr %} {% if base__firewall_control_addr %}
ip saddr {{ base__firewall_control_addr }} tcp dport {{ base__firewall_ssh_port }} accept ip saddr {{ base__firewall_control_addr }} tcp dport {{ base__firewall_ssh_port }} accept
{% endif %} {% endif %}
{% for addr in base__firewall_admin_addrs %}
ip saddr {{ addr }} tcp dport {{ base__firewall_ssh_port }} accept
{% endfor %}
ip protocol icmp accept ip protocol icmp accept
ip6 nexthdr ipv6-icmp accept ip6 nexthdr ipv6-icmp accept
{% for r in base__firewall_resolved %} {% for r in base__firewall_resolved %}
ip saddr { {{ r.sources | join(', ') }} } {{ r.proto }} dport {{ r.port }} accept ip saddr { {{ r.sources | join(', ') }} } {{ r.proto }} dport {{ r.port }} accept
{% endfor %} {% endfor %}
} }
chain forward { type filter hook forward priority 0; policy drop; } chain forward { type filter hook forward priority 0; policy {{ 'accept' if base__firewall_input_only | bool else 'drop' }}; }
chain output { type filter hook output priority 0; policy accept; } chain output { type filter hook output priority 0; policy accept; }
} }

View file

@ -201,6 +201,13 @@ def up(host, name=None, mem_mib=DEFAULT_MEM_MIB, vcpus=DEFAULT_VCPUS):
sh(["cloud-localds", "--network-config", str(RUN_DIR / "network-config"), sh(["cloud-localds", "--network-config", str(RUN_DIR / "network-config"),
str(seed), str(RUN_DIR / "user-data"), str(RUN_DIR / "meta-data")]) str(seed), str(RUN_DIR / "user-data"), str(RUN_DIR / "meta-data")])
console = CACHE_DIR / f"{name}-console.log" console = CACHE_DIR / f"{name}-console.log"
# virt-install has a `#!/usr/bin/env python3` shebang; the Makefile prepends .venv/bin to
# PATH (so the venv's ansible tools resolve), which would hijack virt-install into the
# isolated venv — it lacks system PyGObject (`gi`) and crashes. Strip the venv from PATH
# for this system tool so its shebang finds /usr/bin/python3 (which has gi). Ansible is
# invoked via its absolute .venv path elsewhere, so it is unaffected.
sys_path = ":".join(p for p in os.environ.get("PATH", "").split(":")
if "/.venv/bin" not in p)
sh(["virt-install", "--name", name, "--memory", str(mem_mib), "--vcpus", str(vcpus), sh(["virt-install", "--name", name, "--memory", str(mem_mib), "--vcpus", str(vcpus),
"--boot", "uefi", # genericcloud triple-faults on legacy BIOS handoff; UEFI boots "--boot", "uefi", # genericcloud triple-faults on legacy BIOS handoff; UEFI boots
"--import", "--import",
@ -210,7 +217,8 @@ def up(host, name=None, mem_mib=DEFAULT_MEM_MIB, vcpus=DEFAULT_VCPUS):
"--osinfo", "debian13", "--osinfo", "debian13",
"--graphics", "none", "--graphics", "none",
"--serial", f"file,path={console}", "--serial", f"file,path={console}",
"--noautoconsole"]) "--noautoconsole"],
env=dict(os.environ, PATH=sys_path))
ip = wait_for_ip(name) ip = wait_for_ip(name)
wait_for_ssh(ip, "ansible") wait_for_ssh(ip, "ansible")
# Block until cloud-init finishes (incl. apt-get update) so apply sees a ready system. # Block until cloud-init finishes (incl. apt-get update) so apply sees a ready system.

View file

@ -1,6 +1,7 @@
--- ---
# Integration-test overlay for the "askari" profile (ADR-025). Passed via `-e @`. # Integration-test overlay for the "askari" profile (ADR-025). Passed via `-e @`.
# Reproduces the 2026-06-17 incident: apply base's nftables default-deny to a Docker host. # Reproduces the 2026-06-17 incident: apply base's nftables default-deny to a Docker host.
integration_profile: askari
base__firewall_apply: true base__firewall_apply: true
# Keep a break-glass: sshd stays on all interfaces (never wt0-only in a throwaway VM). # Keep a break-glass: sshd stays on all interfaces (never wt0-only in a throwaway VM).
base__ssh_listen_mesh_only: false base__ssh_listen_mesh_only: false

View file

@ -0,0 +1,18 @@
---
# Integration-test overlay for the "ubongo" profile (ADR-025). Passed via `-e @`.
# Exercises mesh-hardening 2/3: base's INPUT-only default-deny on the control node — input
# chain default-deny, forward chain left permissive (Docker/libvirt-NAT safe), no sshd
# ListenAddress change (so no boot-race).
integration_profile: ubongo
base__firewall_apply: true
base__firewall_input_only: true # forward chain renders `policy accept`
base__firewall_admin_addrs:
- "192.168.150.98" # two representative LAN sources — exercises the
- "192.168.150.99" # admin-addr loop with a multi-entry list (like ubongo)
# Never wt0-only; never touch the real mesh from a throwaway VM.
base__ssh_listen_mesh_only: false
base__mesh_enabled: false
# Allow SSH from the libvirt-NAT gateway (where the driver/ansible connect from) so the
# default-deny apply + the reboot don't lock out the harness. By source IP (interface-
# independent). This is the harness's lifeline; the admin-addr above is only exercised.
base__firewall_control_addr: "192.168.150.1"

View file

@ -0,0 +1,9 @@
{
"groups": ["control"],
"applies": [
{"playbook": "site.yml", "tags": ["base"]}
],
"extra_vars_files": ["overrides/ubongo.yml"],
"mem_mib": 2048,
"vcpus": 2
}

View file

@ -1,33 +1,48 @@
--- ---
# Integration verify (ADR-025). Outcome-based: proves Docker forwarding survives the # Integration verify (ADR-025). Outcome-based, profile-aware: the active profile is named by
# reboot. The load-bearing check probes the VM's published :80 FROM the controller # `integration_profile` (set in each profile's overlay). Each profile asserts its own success
# (ubongo) — if base's forward-drop killed DNAT, this times out (the FRICTION #1 bug). # criteria; an unknown/unset profile fails loudly (never a silent pass).
- name: Verify the rebooted host - name: Verify the rebooted host
hosts: all hosts: all
become: true become: true
gather_facts: false gather_facts: false
tasks: tasks:
- name: Gather service facts - name: A known integration_profile must be set (no silent pass)
ansible.builtin.assert:
that:
- integration_profile is defined
- integration_profile in ['askari', 'ubongo']
fail_msg: "integration_profile must be set in the profile overlay (askari|ubongo)"
# ── askari profile — Docker host: published-port forwarding survives the reboot ──
# The load-bearing check probes the VM's published :80 FROM the controller (ubongo) — if
# base's forward-drop killed DNAT, this times out (the FRICTION 2026-06-17 #1 bug).
- name: (askari) Gather service facts
when: integration_profile == 'askari'
ansible.builtin.service_facts: ansible.builtin.service_facts:
- name: Docker daemon is active - name: (askari) Docker daemon is active
when: integration_profile == 'askari'
ansible.builtin.assert: ansible.builtin.assert:
that: "ansible_facts.services['docker.service'].state == 'running'" that: "ansible_facts.services['docker.service'].state == 'running'"
fail_msg: "docker.service is not running" fail_msg: "docker.service is not running"
- name: Forward chain permits container traffic (drop-in loaded) - name: (askari) Forward chain permits container traffic (drop-in loaded)
when: integration_profile == 'askari'
ansible.builtin.command: nft list chain inet filter forward ansible.builtin.command: nft list chain inet filter forward
register: _fwd register: _fwd
changed_when: false changed_when: false
- name: Assert container forwarding is allowed (not pure drop) - name: (askari) Assert container forwarding is allowed (not pure drop)
when: integration_profile == 'askari'
ansible.builtin.assert: ansible.builtin.assert:
that: "'accept' in _fwd.stdout" that: "'accept' in _fwd.stdout"
fail_msg: >- fail_msg: >-
forward chain is pure drop — container forwarding will die on reboot forward chain is pure drop — container forwarding will die on reboot
(FRICTION 2026-06-17 #1). docker_host container-forward drop-in missing. (FRICTION 2026-06-17 #1). docker_host container-forward drop-in missing.
- name: Published port answers from the controller (DNAT + forward alive) - name: (askari) Published port answers from the controller (DNAT + forward alive)
when: integration_profile == 'askari'
delegate_to: localhost delegate_to: localhost
become: false become: false
ansible.builtin.uri: ansible.builtin.uri:
@ -42,3 +57,29 @@
retries: 5 retries: 5
delay: 6 delay: 6
until: _probe is succeeded until: _probe is succeeded
# ── ubongo profile — control node: INPUT-only default-deny survives the reboot ──
# SSH reachability across the reboot is proven by the harness itself (it re-SSHes and
# checks boot_id changed before this verify runs). Here we assert the ruleset shape.
- name: (ubongo) Read the live nftables ruleset
when: integration_profile == 'ubongo'
ansible.builtin.command: nft list ruleset
register: _nft
changed_when: false
- name: (ubongo) INPUT default-deny, forward permissive, lifeline + admin-addr allow
when: integration_profile == 'ubongo'
ansible.builtin.assert:
that:
# live `nft list ruleset` prints the SYMBOLIC priority (`filter` = 0), unlike the
# rendered /etc/nftables.conf (`priority 0`) that the Molecule scenario asserts against.
- "'hook input priority filter; policy drop;' in _nft.stdout"
- "'hook forward priority filter; policy accept;' in _nft.stdout"
# the ssh-from-control lifeline (base__firewall_control_addr) — the reconnect path
- "'ip saddr 192.168.150.1 tcp dport 22 accept' in _nft.stdout"
- "'ip saddr 192.168.150.98 tcp dport 22 accept' in _nft.stdout"
- "'ip saddr 192.168.150.99 tcp dport 22 accept' in _nft.stdout"
fail_msg: >-
ubongo profile: expected input policy drop, forward policy accept (input-only),
the ssh-from-control lifeline (192.168.150.1), and both admin-addr
(192.168.150.98/99) SSH allows in the live ruleset.