boma/docs/superpowers/plans/2026-06-19-mesh-hardening-ubongo-default-deny.md
sjat e14e347047 docs(plan): mesh-hardening 2/3 — ubongo implementation plan
Five tasks: base knobs (input-only forward policy + admin-addr SSH allow,
TDD via Molecule) → enable on the control group → a 'be ubongo' integration
profile (profile-aware verify) → the real-VM harness GREEN gate → the
operator-supervised live cutover (signal-6 order, physical-console break-glass).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:26:04 +02:00

24 KiB
Raw Blame History

Mesh-hardening 2/3 — ubongo INPUT-only default-deny — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Apply base's nftables firewall to the control node (ubongo) as an INPUT-only default-deny — hardening its inbound surface — while leaving the forward chain permissive so Docker egress and the libvirt-NAT integration harness keep working, and without any sshd ListenAddress change.

Architecture: Two new base knobs make the existing firewall concern fit a control node: base__firewall_input_only flips the forward chain to policy accept (host-local input filtering only), and base__firewall_admin_addrs adds operator-workstation LAN sources to the SSH allow-list (alongside wt0 and ssh-from-control). sshd is untouched (nftables does the scoping → no ip_nonlocal_bind boot-race). The change is validated on a throwaway VM via the ADR-025 integration harness (a new "be ubongo" profile) before an operator-supervised live cutover whose safety net is the firewall auto-rollback timer plus the permanent on-prem physical console.

Tech Stack: Ansible (role base, FQCN), nftables, Jinja2, Molecule on Debian 13, pytest (none new), the ADR-025 integration harness (scripts/integration-vm.py, JSON profiles, -e @ overlays).

Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md

Conventions: make lint and make test ROLE=base before each commit; make check before make deploy; never hand-edit the generated offsite.yml; rbw unlocked for any commit touching Ansible content and for the integration/live applies (the production group_vars/all/vault.yml is in inventory scope and gets decrypted at playbook load). Tasks 13 are code (subagent-driven, each lint/Molecule-verified). Task 4 is a real-VM validation gate on ubongo. Task 5 is the live, operator-supervised cutover.


File Structure

File Create/Modify Responsibility
roles/base/defaults/main.yml Modify Declare base__firewall_input_only + base__firewall_admin_addrs (defaults: off / empty).
roles/base/templates/nftables.conf.j2 Modify Conditional forward policy; render an SSH-allow rule per admin address.
roles/base/molecule/default/converge.yml Modify Fixture: an admin-addr source (input-only stays at its default → forward drop).
roles/base/molecule/default/verify.yml Modify Assert forward-drop default + the admin-addr rule render.
inventories/production/group_vars/control/vars.yml Modify Turn the knobs on for ubongo (input-only; mamba's LAN IP).
tests/integration/overrides/ubongo.yml Create The "be ubongo" overlay (input-only firewall; harness SSH lifeline).
tests/integration/profiles/ubongo.json Create The "be ubongo" VM profile (group control, applies site.yml:base).
tests/integration/overrides/askari.yml Modify Add the integration_profile marker (verify is now profile-aware).
tests/integration/verify.yml Modify Gate the askari (Docker/DNAT) block; add the ubongo (input-only) block + a guard.
STATUS.md, docs/ROADMAP.md Modify (Task 5) Record mesh-hardening 2/3 done.

Task 1: base role — base__firewall_input_only (forward policy) + base__firewall_admin_addrs (LAN SSH allow)

Files:

  • Modify: roles/base/defaults/main.yml
  • Modify: roles/base/templates/nftables.conf.j2
  • Modify: roles/base/molecule/default/converge.yml
  • Modify: roles/base/molecule/default/verify.yml

Test strategy (note): Molecule renders one fixture, so it locks the secure defaultinput_only off → forward policy drop — plus the new admin-addr rule (red→green). The input_only on → forward policy accept path is exercised on a real VM by the integration "be ubongo" profile (Tasks 34), whose verify fails red until this template conditional exists. Both branches are covered, across the two test layers.

  • Step 1: Write the failing test (extend Molecule verify)

In roles/base/molecule/default/verify.yml, after the Assert the docker_host extension hook is present block, add:

    - name: Assert the forward chain defaults to policy drop (input_only off)
      ansible.builtin.assert:
        that:
          - "'hook forward priority 0; policy drop;' in nft"
        fail_msg: >-
          forward chain must default to policy drop when base__firewall_input_only is
          false (container isolation stays the norm on real service hosts)

    - name: Assert the admin-addr SSH allow rule (operator workstation on the LAN)
      ansible.builtin.assert:
        that:
          - "'ip saddr 10.30.0.77 tcp dport 22 accept' in nft"
        fail_msg: "missing admin-addr SSH allow rule from base__firewall_admin_addrs"
  • Step 2: Add the fixture that drives it (Molecule converge)

In roles/base/molecule/default/converge.yml, add to the vars: block (after the base__firewall_control_addr line):

    base__firewall_admin_addrs:
      - "10.30.0.77"   # fixture: an operator-workstation LAN source (admin-addr SSH allow)
  • Step 3: Run the test to verify it fails

Run: make test ROLE=base Expected: FAIL on Assert the admin-addr SSH allow rule (the template does not consume base__firewall_admin_addrs yet, so the ip saddr 10.30.0.77 … rule is absent). The forward-drop assertion passes already (the template currently hardcodes policy drop).

  • Step 4: Add the defaults

In roles/base/defaults/main.yml, after the base__firewall_apply: true line (end of the firewall behaviour block, currently line 13), add:

base__firewall_input_only: false     # true → the forward chain is `policy accept` (host-local
                                     # INPUT filtering only). For hosts that forward/route
                                     # container or NAT traffic (the control node's Docker +
                                     # libvirt-NAT) where a forward default-deny would break
                                     # them. Real service hosts keep this false (forward drop).
base__firewall_admin_addrs: []       # extra LAN source IPs allowed to SSH, besides wt0 +
                                     # ssh-from-control. For an operator workstation reaching
                                     # the host over the LAN (no mesh). Key-gated. (ADR-021)
  • Step 5: Make the forward policy conditional + render the admin-addr rules

In roles/base/templates/nftables.conf.j2:

(a) Replace the forward-chain line (currently line 21):

  chain forward { type filter hook forward priority 0; policy {{ 'accept' if base__firewall_input_only | bool else 'drop' }}; }

(b) After the ssh-from-control {% endif %} (currently line 14) and before the ip protocol icmp accept line, add the admin-addr loop:

{% for addr in base__firewall_admin_addrs %}
    ip saddr {{ addr }} tcp dport {{ base__firewall_ssh_port }} accept
{% endfor %}
  • Step 6: Run the test to verify it passes

Run: make test ROLE=base Expected: PASS — converge renders the ruleset; verify confirms the forward chain is policy drop (input_only defaults false) and the ip saddr 10.30.0.77 tcp dport 22 accept rule is present; all pre-existing assertions stay green.

  • Step 7: Lint

Run: make lint Expected: Passed: 0 failure(s) and check-tags: OK.

  • Step 8: Commit
git add roles/base/defaults/main.yml roles/base/templates/nftables.conf.j2 \
        roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
git commit -m "feat(base): input-only forward policy + admin-addr SSH allow

base__firewall_input_only renders the forward chain policy accept (host-local
INPUT filtering only) for hosts that forward container/NAT traffic; defaults
false so real service hosts keep the forward default-deny. base__firewall_admin_addrs
adds operator-workstation LAN sources to the SSH allow-list alongside wt0 +
ssh-from-control. Molecule locks the secure default + the admin rule.
Mesh-hardening 2/3 (ADR-020/021).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"

Task 2: inventory — enable input-only default-deny + mamba on ubongo (control group)

Files:

  • Modify: inventories/production/group_vars/control/vars.yml

  • Step 1: Turn the knobs on for the control group

Append to inventories/production/group_vars/control/vars.yml:


# Mesh-hardening 2/3 (2026-06-19, ADR-020/021): apply base's host firewall to ubongo as
# INPUT-only default-deny — harden the inbound surface, leave the forward chain permissive so
# Docker egress + the libvirt-NAT integration harness keep working. sshd is unchanged
# (nftables scopes inbound), so there is no boot-race. Reach ubongo over wt0 (mesh), the
# ssh-from-control self-path (base__firewall_control_addr, group_vars/all = 10.20.10.151), or
# mamba on the LAN. Break-glass: the physical console. (base__firewall_apply defaults true.)
base__firewall_input_only: true
base__firewall_admin_addrs:
  - "10.20.10.50"   # mamba over the LAN (NetBird off). Raw DHCP lease — revisit with an
                    # OPNsense reservation when OPNsense-as-code lands; backstopped by wt0.
  - "10.20.10.17"   # 2nd operator workstation (MAC bc:0f:f3:c8:4a:8a). Raw lease — ditto.
  • Step 2: Verify the vars resolve for ubongo

Run: .venv/bin/ansible-inventory -i inventories/production/ --host ubongo 2>/dev/null | grep -E 'firewall_input_only|firewall_admin_addrs|10.20.10.(50|17)' Expected: shows "base__firewall_input_only": true and "base__firewall_admin_addrs": ["10.20.10.50", "10.20.10.17"].

  • Step 3: Lint

Run: make lint Expected: clean pass (check-tags: OK).

  • Step 4: Commit
git add inventories/production/group_vars/control/vars.yml
git commit -m "feat(inventory): ubongo gets INPUT-only host firewall + mamba LAN SSH

Enables base__firewall_input_only on the control group (forward chain stays
permissive so Docker egress + the integration-test libvirt NAT survive) and
allows the operator workstations' LAN IPs (mamba 10.20.10.50 + 10.20.10.17;
raw leases, backstopped by wt0). Mesh-hardening 2/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"

Task 3: integration harness — "be ubongo" profile (overlay + profile + profile-aware verify)

Files:

  • Create: tests/integration/overrides/ubongo.yml

  • Create: tests/integration/profiles/ubongo.json

  • Modify: tests/integration/overrides/askari.yml

  • Modify: tests/integration/verify.yml

  • Step 1: Create the "be ubongo" overlay

Create tests/integration/overrides/ubongo.yml:

---
# Integration-test overlay for the "ubongo" profile (ADR-025). Passed via `-e @`.
# Exercises mesh-hardening 2/3: base's INPUT-only default-deny on the control node — input
# chain default-deny, forward chain left permissive (Docker/libvirt-NAT safe), no sshd
# ListenAddress change (so no boot-race).
integration_profile: ubongo
base__firewall_apply: true
base__firewall_input_only: true        # forward chain renders `policy accept`
base__firewall_admin_addrs:
  - "192.168.150.98"                   # two representative LAN sources — exercises the
  - "192.168.150.99"                   # admin-addr loop with a multi-entry list (like ubongo)
# Never wt0-only; never touch the real mesh from a throwaway VM.
base__ssh_listen_mesh_only: false
base__mesh_enabled: false
# Allow SSH from the libvirt-NAT gateway (where the driver/ansible connect from) so the
# default-deny apply + the reboot don't lock out the harness. By source IP (interface-
# independent). This is the harness's lifeline; the admin-addr above is only exercised.
base__firewall_control_addr: "192.168.150.1"
  • Step 2: Create the "be ubongo" VM profile

Create tests/integration/profiles/ubongo.json:

{
  "groups": ["control"],
  "applies": [
    {"playbook": "site.yml", "tags": ["base"]}
  ],
  "extra_vars_files": ["overrides/ubongo.yml"],
  "mem_mib": 2048,
  "vcpus": 2
}
  • Step 3: Mark the askari overlay with its profile name

In tests/integration/overrides/askari.yml, after the two header comment lines (before base__firewall_apply: true), add:

integration_profile: askari
  • Step 4: Make verify.yml profile-aware (the test)

Replace the entire contents of tests/integration/verify.yml with:

---
# Integration verify (ADR-025). Outcome-based, profile-aware: the active profile is named by
# `integration_profile` (set in each profile's overlay). Each profile asserts its own success
# criteria; an unknown/unset profile fails loudly (never a silent pass).
- name: Verify the rebooted host
  hosts: all
  become: true
  gather_facts: false
  tasks:
    - name: A known integration_profile must be set (no silent pass)
      ansible.builtin.assert:
        that:
          - integration_profile is defined
          - integration_profile in ['askari', 'ubongo']
        fail_msg: "integration_profile must be set in the profile overlay (askari|ubongo)"

    # ── askari profile — Docker host: published-port forwarding survives the reboot ──
    # The load-bearing check probes the VM's published :80 FROM the controller (ubongo) — if
    # base's forward-drop killed DNAT, this times out (the FRICTION 2026-06-17 #1 bug).
    - name: (askari) Gather service facts
      when: integration_profile == 'askari'
      ansible.builtin.service_facts:

    - name: (askari) Docker daemon is active
      when: integration_profile == 'askari'
      ansible.builtin.assert:
        that: "ansible_facts.services['docker.service'].state == 'running'"
        fail_msg: "docker.service is not running"

    - name: (askari) Forward chain permits container traffic (drop-in loaded)
      when: integration_profile == 'askari'
      ansible.builtin.command: nft list chain inet filter forward
      register: _fwd
      changed_when: false

    - name: (askari) Assert container forwarding is allowed (not pure drop)
      when: integration_profile == 'askari'
      ansible.builtin.assert:
        that: "'accept' in _fwd.stdout"
        fail_msg: >-
          forward chain is pure drop — container forwarding will die on reboot
          (FRICTION 2026-06-17 #1). docker_host container-forward drop-in missing.

    - name: (askari) Published port answers from the controller (DNAT + forward alive)
      when: integration_profile == 'askari'
      delegate_to: localhost
      become: false
      ansible.builtin.uri:
        url: "http://{{ ansible_host }}/"
        follow_redirects: none
        status_code: [200, 301, 308, 404, 502, 503]
        timeout: 10
      register: _probe
      retries: 5
      delay: 6
      until: _probe is succeeded

    # ── ubongo profile — control node: INPUT-only default-deny survives the reboot ──
    # SSH reachability across the reboot is proven by the harness itself (it re-SSHes and
    # checks boot_id changed before this verify runs). Here we assert the ruleset shape.
    - name: (ubongo) Read the live nftables ruleset
      when: integration_profile == 'ubongo'
      ansible.builtin.command: nft list ruleset
      register: _nft
      changed_when: false

    - name: (ubongo) INPUT default-deny, forward permissive, admin-addr allow
      when: integration_profile == 'ubongo'
      ansible.builtin.assert:
        that:
          - "'hook input priority 0; policy drop;' in _nft.stdout"
          - "'hook forward priority 0; policy accept;' in _nft.stdout"
          - "'ip saddr 192.168.150.98 tcp dport 22 accept' in _nft.stdout"
          - "'ip saddr 192.168.150.99 tcp dport 22 accept' in _nft.stdout"
        fail_msg: >-
          ubongo profile: expected input policy drop, forward policy accept (input-only),
          and both admin-addr (192.168.150.98/99) SSH allows in the live ruleset.
  • Step 5: Validate the JSON + lint

Run: .venv/bin/python -m json.tool tests/integration/profiles/ubongo.json >/dev/null && echo OK then make lint Expected: OK, then a clean lint pass (check-tags: OK).

  • Step 6: Commit
git add tests/integration/overrides/ubongo.yml tests/integration/profiles/ubongo.json \
        tests/integration/overrides/askari.yml tests/integration/verify.yml
git commit -m "test(integration): add the 'be ubongo' profile (input-only default-deny)

A control-group VM that applies base with INPUT-only default-deny (forward
policy accept; admin-addr SSH allow). verify.yml is now profile-aware via an
integration_profile marker — the askari Docker/DNAT block is gated, and a ubongo
block asserts input drop + forward accept + the admin-addr rule. Enables
\`make test-integration HOST=ubongo\`. Mesh-hardening 2/3 (ADR-025).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"

Task 4: Validate on the integration harness (make test-integration HOST=ubongo) — the GREEN gate

Runs a throwaway UEFI VM on ubongo: boots it, applies the base role with the ubongo overlay (INPUT-only default-deny), reboots it, and asserts the ruleset + SSH-returns. This proves the change survives a reboot before the real control node is ever touched (spec §cutover step 1; FRICTION signal-6). No code change / no commit — a validation gate.

  • Step 1: Ensure the vault is unlocked

The run loads inventories/production/group_vars/all/vault.yml (symlinked into the run dir), which is decrypted at playbook load.

Run: rbw unlocked || rbw unlock Expected: exits 0 (unlocked). If it prompts, the operator unlocks.

  • Step 2: Run the integration cycle

Run: make test-integration HOST=ubongo Expected (the cycle: up → apply → reboot → assert): the VM gets a 192.168.150.x lease; site.yml --tags base applies cleanly; … rebooted (boot_id changed), SSH back at 192.168.150.x; then VERIFY PASSED for boma-it-ubongo-…. The VM is destroyed on success.

  • Step 3: On failure, read the diagnostics

If it prints VERIFY FAILED, diagnostics are in ~/integration-runs/boma-it-ubongo-<id>/ (nft.txt, console.log, journal.txt). The likely suspects: the admin-addr/forward assertion (Task 1/3 wiring) or SSH not returning post-reboot (the base__firewall_control_addr: 192.168.150.1 lifeline in the overlay). Fix the implicated task, re-commit, and re-run Step 2. Re-run make test-integration-clean first if a VM was left defined.

  • Step 4: Record the result

Capture the VERIFY PASSED line in the task notes (this is the gate Task 5 step 1 depends on). No commit.


Task 5: Live staged cutover (operator-supervised — NOT a subagent task)

Touches the real ubongo (the control node Ansible runs from) and reboots it — lockout- risky. Run it interactively with the operator, in order, verifying each step before the next. The firewall auto-rollback timer (base__firewall_rollback_timeout, 45 s) + wait_for_connection over the live path is the safety net; the on-prem physical console is the permanent break-glass. Do NOT hand this to an unattended agent.

  • Step 1: Pre-checks (gate: Task 4 GREEN)

  • rbw unlocked || rbw unlock.

  • SSH to ubongo over wt0 from a road-warrior succeeds.

  • SSH to ubongo from mamba on the LAN (10.20.10.50) succeeds.

  • .venv/bin/ansible ubongo -i inventories/production/ -m pingSUCCESS (over 10.20.10.151).

  • The physical console is reachable. If any path fails, STOP.

  • Step 2: Dry-run the firewall apply

Run: make check PLAYBOOK=site LIMIT=ubongo TAGS=firewall Expected: the nftables diff shows policy drop on input, iifname "wt0" … accept, ip saddr 10.20.10.151 … accept, ip saddr 10.20.10.50 … accept, and the forward chain as policy accept. No errors.

  • Step 3: Apply the host firewall (auto-rollback armed)

Run: make deploy PLAYBOOK=site LIMIT=ubongo TAGS=firewall Expected: the firewall concern snapshots /etc/nftables.rollback, arms the 45 s systemd-run revert, applies the ruleset, reset_connectionwait_for_connection over 10.20.10.151 succeeds, then cancels the timer. If connectivity is lost, the timer reverts the ruleset within 45 s and the console is the fallback.

  • Step 4: Verify every path + forwarding still works
# from a road-warrior over wt0, and from mamba on the LAN:
ssh sjat@100.99.146.14 true && echo "wt0 OK"
ssh sjat@10.20.10.151 true && echo "mamba-LAN OK"   # run from mamba (10.20.10.50)
# Ansible self-path:
.venv/bin/ansible ubongo -i inventories/production/ -m ping
# a disallowed LAN host (e.g. 10.20.10.17) must now be refused/timeout on :22
# Docker egress (forward chain still permissive):
docker run --rm busybox wget -qO- https://cloudflare.com/cdn-cgi/trace | head -1
# libvirt-NAT forwarding intact — a fresh integration VM still reaches apt:
make test-integration HOST=ubongo   # expect VERIFY PASSED (proves the NAT path survived)

Expected: wt0 OK, mamba-LAN OK, Ansible SUCCESS, the disallowed host refused, the Docker egress line returns, and the integration cycle passes.

  • Step 5: Reboot resilience — while the console is present (FRICTION signal-6)

With the operator at the physical console, reboot ubongo (sudo systemctl reboot). After it returns, confirm SSH comes back on all paths unaided:

ssh sjat@100.99.146.14 true && echo "wt0 OK after reboot"
.venv/bin/ansible ubongo -i inventories/production/ -m ping

Expected: SSH returns with no manual intervention (no ListenAddress, so nothing to race). Only now is the cutover complete.

  • Step 6: Update STATUS + ROADMAP

  • In STATUS.md: in the roles/base/ row of "Scaffolded but empty", change the firewall note — the firewall concern is now applied to ubongo as INPUT-only default-deny (it is no longer "not yet applied to any host"); note the base__firewall_input_only knob and that the forward default-deny still awaits the docker_host drop-in for real service hosts. Add the ubongo control-node row's "Pending" item for default-deny → done.

  • In docs/ROADMAP.md: mark mesh-hardening sub-project 2 (ubongo default-deny) done; the remaining follow-on is sub-project 1 (askari SSH→wt0 redesign) and sub-project 3 (NetBird ACL). Update the "Next step" section accordingly.

git add STATUS.md docs/ROADMAP.md
git commit -m "docs: ubongo INPUT-only default-deny applied (mesh-hardening 2/3 done)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
  • Step 7: Push

Run: git push origin main


Self-review (against the spec)

  • § Design — INPUT-only default-deny → Task 1 (forward-policy knob) + Task 2 (enabled on ubongo). ✓
  • § Design — admin-addrs (operator workstations on LAN) → Task 1 (base__firewall_admin_addrs + template loop) + Task 2 (10.20.10.50 mamba, 10.20.10.17). ✓
  • § Design — no sshd ListenAddress change → nothing touches ssh.yml/sshd_hardening.conf.j2; only nftables. ✓ (verified: Tasks 13 file lists exclude them).
  • § allow-list (lo, established, wt0, ssh-from-control, admin-addr, icmp; forward accept) → template already renders lo/established/wt0/control/icmp; Task 1 adds admin-addr + forward-accept. ✓
  • § Why-safe (incident signals 1/2/3/6) → signal 1 (forward accept, Task 1); signal 2 (no ListenAddress); signal 3 (ubongo keeps LAN + console); signal 6 (Task 4 harness reboot + Task 5 step 5 reboot-while-console). ✓
  • § New & changed code (defaults, template, molecule, group_vars/control, integration profile) → Tasks 13. ✓
  • § admin raw-leases + revisit → Task 2 comments record both leases + the OPNsense-reservation revisit trigger; backstop (wt0) noted; flagged in FRICTION.md. ✓
  • § Testing (Molecule render asserts; make test-integration HOST=ubongo; live checks) → Task 1 (Molecule), Task 4 (harness), Task 5 step 4 (live). ✓ Coverage split (default in Molecule, input_only on the VM) noted in Task 1.
  • § Staged cutover (signal-6 order) → Task 5 steps 17; reboot-recovery (step 5) precedes nothing that retires a break-glass (the console is permanent). ✓
  • § Risks/rollback → auto-rollback (Task 5 step 3), redundant paths + physical console, raw-lease backstop. ✓
  • Type/name consistency: base__firewall_input_only (bool) and base__firewall_admin_addrs (list) are spelled identically in defaults, template, converge, group_vars, and the overlay. integration_profile is spelled identically in both overlays and the three gates in verify.yml. ✓
  • Placeholder scan: no TBD/TODO; every code/command step shows the actual content. ✓