Compare commits
13 commits
77a20b8d40
...
215060bac1
| Author | SHA1 | Date | |
|---|---|---|---|
| 215060bac1 | |||
| fa2c4c6368 | |||
| a881185c73 | |||
| 180af46879 | |||
| 8d8c86fa39 | |||
| 468f8c3a92 | |||
| 26bb7e442d | |||
| 6ac5afaf67 | |||
| b3e14decb4 | |||
| b10a33f439 | |||
| 66a9a0af08 | |||
| e14e347047 | |||
| 24a1d909c9 |
15 changed files with 884 additions and 24 deletions
|
|
@ -5,7 +5,7 @@ This repo is partly aspirational: the ADRs in `docs/decisions/` describe the
|
|||
truth. **Before relying on a role, provider, or pipeline existing, check here.**
|
||||
If something is listed as "designed, not built", do not assume it works.
|
||||
|
||||
_Last reviewed: 2026-06-18._
|
||||
_Last reviewed: 2026-06-19._
|
||||
|
||||
## Real and working today
|
||||
|
||||
|
|
@ -30,7 +30,7 @@ _Last reviewed: 2026-06-18._
|
|||
| `roles/dev_env/` — interactive developer environment | **Built + applied.** zsh + oh-my-zsh + oh-my-posh, tmux + TPM plugins, neovim; dotfiles deployed via GNU stow (re-derived from V4/fisi per ADR-013). Node.js from a pinned upstream tarball (not Debian's npm). Lint + Molecule (idempotent) green. **Applied to `ubongo`** for users `sjat` + `claude` (verified: zsh login shells, stow-symlinked `.zshrc`/`.tmux.conf` + nvim config, oh-my-zsh, tmux plugins; nvim v0.12.2, oh-my-posh 29.0.1). Run via `playbooks/workstation.yml` against the `control` group (no dedicated `workstations` group yet). |
|
||||
| `make check` / `make deploy PLAYBOOK=<name>` | **Works.** First end-to-end run (applying `dev_env`) surfaced + fixed latent bugs: Makefile `PLAYBOOK` var collision (binary path vs playbook-name arg) meant the targets never ran; `ansible.cfg` referenced uninstalled community.general callbacks (now built-in `default` + `ansible.posix.profile_tasks`); `acl` package added so Ansible can `become_user` an unprivileged user. The make targets now function — though `site`/`base`/`docker_host` content is still incomplete (see below). |
|
||||
| `roles/public_dns/` + `playbooks/dns.yml` | **Built + applied.** Manages wingu.me at Gandi LiveDNS as code (`community.general.gandi_livedns`, PAT from `vault.gandi.pat`); record data, anti-spoof baseline (SPF `-all` + DMARC reject), and the Gandi-defaults purge are defined + unit-tested (`tests/test_public_dns.py`). **Applied to wingu.me (2026-06-14):** purged Gandi's 13 seeded defaults; zone now holds only the SPF + DMARC TXT records; idempotent re-run clean. No null-MX (Gandi rejects `0 .`) — the MX is removed, so no MX + no apex A = no mail. M1 of the roadmap. |
|
||||
| `ubongo` — physical control / AI-worker host (ADR-015) | **Built (partial).** Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to `fisi` (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via `make setup`/`make collections`). Repo cloned under a dedicated `claude` user (docker + libvirt groups, **`NOPASSWD:ALL` sudo** — ADR-015 amended 2026-06-18; operator `sjat` uses password-required sudo via `sudo` group; the former `sjat-ansible` NOPASSWD drop-in removed 2026-06-18). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory `control` group at 10.20.10.151. **`dev_env` now applied here** (zsh/tmux/nvim for `sjat` + `claude`, via `playbooks/workstation.yml`). Managed as the operator account `sjat` (`group_vars/control` sets `ansible_user: sjat`), not the `ansible` service user `group_vars/all` assumes — ubongo has no bootstrapped `ansible` user. **NetBird mesh-enrolled (M5, 2026-06-17):** `wt0` up at `100.99.146.14` via the `base` `mesh` concern. **Pending:** full `base` hardening (only `firewall` exists, NOT applied here — default-deny is the deferred mesh-hardening step now that `wt0` exists); proper `ansible`-user bootstrap (currently managed as `sjat`); OPNsense DHCP reservation for 10.20.10.151 (MAC `88:a4:c2:e0:ee:da`); Terraform state backup (now relevant — the offsite tfstate exists). |
|
||||
| `ubongo` — physical control / AI-worker host (ADR-015) | **Built (partial).** Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to `fisi` (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via `make setup`/`make collections`). Repo cloned under a dedicated `claude` user (docker + libvirt groups, **`NOPASSWD:ALL` sudo** — ADR-015 amended 2026-06-18; operator `sjat` uses password-required sudo via `sudo` group; the former `sjat-ansible` NOPASSWD drop-in removed 2026-06-18). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory `control` group at 10.20.10.151. **`dev_env` now applied here** (zsh/tmux/nvim for `sjat` + `claude`, via `playbooks/workstation.yml`). Managed as the operator account `sjat` (`group_vars/control` sets `ansible_user: sjat`), not the `ansible` service user `group_vars/all` assumes — ubongo has no bootstrapped `ansible` user. **NetBird mesh-enrolled (M5, 2026-06-17):** `wt0` up at `100.99.146.14` via the `base` `mesh` concern. **`base` firewall applied (mesh-hardening 2/3, 2026-06-19):** INPUT-only default-deny — input locked to `wt0` + ssh-from-control (`10.20.10.151`) + workstations (`10.20.10.50` mamba, `10.20.10.17`); forward `accept` (Docker/libvirt-NAT safe). Live-verified (SSH self-path + Docker egress, after a post-apply `restart docker` — base's flush wipes Docker nat, FRICTION); **real-host reboot validation pending** (low-risk — lockout-safe via the permanent console). `claude` now self-SSHes (ad-hoc `authorized_keys` grant so the agent can run SSH-based deploys with the auto-rollback safety; fold into the control-node bootstrap). **Pending:** full `base` hardening (auditd/CIS); proper `ansible`-user bootstrap (currently managed as `sjat`); OPNsense DHCP reservations (10.20.10.151 MAC `88:a4:c2:e0:ee:da` + the `.50`/`.17` workstation leases); Terraform state backup (now relevant — the offsite tfstate exists). |
|
||||
| `askari` — off-site Hetzner VPS (ADR-007/016, M2) | **Built + applied.** Provisioned by Terraform (`environments/offsite`, `hetznercloud/hcloud`) as **cx23 / hel1 / Debian 13.5** (CAX11/ARM was out of stock EU-wide on 2026-06-14 → cx23 is same-spec x86, cheaper). cloud-init created the `ansible` user + passwordless sudo; a TF-managed Hetzner Cloud Firewall allows SSH only from ubongo's WAN (`91.226.145.80`). Reachable from ubongo (`ansible offsite_hosts -m ping` ✓), in the `offsite_hosts` inventory (generated `offsite.yml`), published at `askari.wingu.me` → `77.42.120.136`. **SSH-hardened + fail2ban (M3).** **Docker + Caddy reverse proxy (M4a):** `docker_host` + `reverse_proxy` (vanilla Caddy, HTTP-01) applied; `https://test.askari.wingu.me` serves a valid Let's Encrypt cert ✓ (firewall opens 80/443/3478). **NetBird coordinator (M4b):** `netbird_coordinator` deployed — dashboard live at `https://netbird.askari.wingu.me` (valid LE cert), management API behind embedded Dex (401 unauth), STUN on 3478/udp. **NetBird peer (M5, 2026-06-17):** also enrolled as a mesh agent (`base` `mesh` concern) — `wt0` at `100.99.226.39`, Management+Signal Connected; the agent coexists with the coordinator. **Pending:** host firewall + moving askari's SSH onto `wt0` (deferred mesh-hardening; the Hetzner Cloud Firewall is its perimeter until then), offsite tfstate backup (ADR-022). |
|
||||
| `roles/docker_host/` (Docker engine) + `roles/reverse_proxy/` (Caddy, ADR-024) | **Built + applied** (askari, M4a). `docker_host` installs Docker CE + compose; `reverse_proxy` is boma's standard Caddy proxy (HTTP-01 for public hosts; routes from `reverse_proxy__routes`). **DNS-01 for mesh/LAN-only services is now built + proven (2026-06-15):** custom `caddy-gandi` image (`.docker/caddy-gandi/`, `make caddy-image`, pinned caddy-dns/gandi v1.1.0 → Bearer PAT), enabled per-instance via `reverse_proxy__acme_dns_provider: gandi` + `reverse_proxy__image`. Verified end-to-end — a real wildcard cert issued via LE **staging** + Gandi DNS-01 with `vault.gandi.pat`. M4a's deferral (version skew + Hetzner-IP build) is closed; image **pending registry push** (`make caddy-image-push` needs `docker login`). The `reverse_proxy` Caddyfile is bind-mounted as a **directory** (`./caddy` → `/etc/caddy`) so atomic re-renders are visible in-container and `caddy reload` actually applies new routes (a single-file mount pinned the stale inode). |
|
||||
| `roles/netbird_coordinator/` — NetBird control plane (ADR-016, M4b) | **Built + applied (askari, 2026-06-16). boma's FIRST real service role.** Self-hosted NetBird **v0.72.4**: a single combined `netbird-server` container (management + signal + relay + STUN + **embedded Dex IdP** at `/oauth2`) + `dashboard:v2.39.0`, on the shared `boma` network behind the M4a Caddy via gRPC-h2c + WebSocket + path routing (`reverse_proxy__routes` gained a raw-`caddy` route type). Secrets `vault.netbird.{auth_secret,datastore_key}` (self-generated). Carries the full service-role file set (SECURITY/VERIFY/ACCESS/BACKUP) — **first stateful role** (`backup__state: true`; encrypted SQLite at `/var/lib/netbird`, off-site backup pending `fisi`/ADR-022). **Verified live:** dashboard 200 + valid LE cert, `/api` 401 (auth-gated, routes OK), STUN up. **Not yet configured:** first-boot `/setup` admin + peer enrolment = M5. |
|
||||
|
|
@ -39,7 +39,7 @@ _Last reviewed: 2026-06-18._
|
|||
|
||||
| Thing | State |
|
||||
|---|---|
|
||||
| `roles/base/` | **Partially built.** Concerns built: `firewall` (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) and **`hardening`** (M3: sshd drop-in key-only + `PermitRootLogin no`, fail2ban sshd jail 5/1h; ADR-002) — both pytest/Molecule-tested. The **`hardening`** concern is **applied to askari** (`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`). The `firewall` concern is built but **not yet applied** to any host (mesh-gated to avoid lockout — M5). Not built: auditd, packages, users (Phase 2 / TODO 15). |
|
||||
| `roles/base/` | **Partially built.** Concerns built: `firewall` (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) and **`hardening`** (M3: sshd drop-in key-only + `PermitRootLogin no`, fail2ban sshd jail 5/1h; ADR-002) — both pytest/Molecule-tested. The **`hardening`** concern is **applied to askari** (`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`). The `firewall` concern is **applied to ubongo** (mesh-hardening 2/3, 2026-06-19): INPUT-only default-deny via the new `base__firewall_input_only` knob (input default-deny + `wt0`/ssh-from-control/`base__firewall_admin_addrs` allow-list; forward left `accept` so Docker/libvirt-NAT survive). **Caveat:** base's `flush ruleset` wipes a Docker host's nat, so applying to a Docker host needs a follow-up `restart docker` (FRICTION) — hence still **not** applied to askari pending `docker_host`'s nftables integration. Not built: auditd, packages, users (Phase 2 / TODO 15). |
|
||||
| `inventories/*/hosts.yml` | Structured stubs with empty host maps (`hosts: {}`); regenerated by `make tf-inventory` once Terraform has hosts |
|
||||
| `inventories/production/group_vars/{docker_hosts,proxmox_hosts}/` | Empty dirs |
|
||||
|
||||
|
|
@ -50,7 +50,7 @@ daemon hardening + `nftables.d` container rules, ADR-004/ADR-020 — is still pe
|
|||
A `make deploy PLAYBOOK=site` run now applies real content — `base` (its `firewall` +
|
||||
`hardening` concerns) plus a functional `docker_host` (Docker engine) on docker hosts —
|
||||
but in practice it is still limited: the production cluster has no docker hosts yet, and
|
||||
`base`'s `firewall` concern is mesh-gated until M5, so a full cluster `site` run does not
|
||||
`base`'s `firewall` concern is now applied to `ubongo` (control) but not yet to cluster docker hosts (none exist), so a full cluster `site` run does not
|
||||
yet exist. (The `make check`/`deploy` machinery itself works — first proven by applying
|
||||
`dev_env` via `playbooks/workstation.yml`, then `base`/`docker_host`/`reverse_proxy` on
|
||||
askari.)
|
||||
|
|
|
|||
|
|
@ -146,6 +146,74 @@ harness on ubongo and shaking it down against real KVM (spec/plan in docs/superp
|
|||
the holistic cross-file review. → for infra this novel, budget for BOTH an adversarial
|
||||
cross-file review AND a real-hardware run; neither alone would have shipped it working.
|
||||
|
||||
<!-- From the 2026-06-19 mesh-hardening-2/3 design (ubongo INPUT-only default-deny). -->
|
||||
|
||||
- `[friction]` **Raw DHCP leases pinned in ubongo's host firewall (admin-addr SSH allows)**
|
||||
(2026-06-19): mesh-hardening 2/3 lets the operator workstations reach ubongo's LAN SSH by
|
||||
*raw lease* — `base__firewall_admin_addrs: ["10.20.10.50" (mamba), "10.20.10.17"]` — because
|
||||
there is no DHCP reservation yet (OPNsense isn't managed as code). A lease reassignment
|
||||
silently moves the allow to whatever host next holds the IP (still SSH-key-gated) and drops
|
||||
the workstation's *LAN* path (mesh still works, so never a full lockout). → when
|
||||
OPNsense-as-code lands (ADR-020 perimeter / TODO 3.5), replace both with **MAC-pinned DHCP
|
||||
reservations** (`10.20.10.17` = MAC `bc:0f:f3:c8:4a:8a`; mamba's MAC TBD) and allow the
|
||||
reserved IPs. Spec: `docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md`.
|
||||
|
||||
- `[gotcha]` **`make test-integration` on ubongo fails (`qemu-img` "Permission denied") when
|
||||
the agent session predates the `libvirt` group grant** (2026-06-19): the `integration_test`
|
||||
role adds `claude` to `libvirt`+`kvm` and makes the cache dir `/var/lib/boma-integration`
|
||||
`root:libvirt 2775` — correct — but a `claude` session whose shell started *before* that
|
||||
grant carries a stale process group set (`id` → `claude,docker` only, no `libvirt`), so
|
||||
`qemu-img create` of the VM overlay into the group-owned dir is denied. `virsh`/`virt-install`
|
||||
still work (they reach system libvirtd via polkit/socket, and the real KVM runs server-side
|
||||
as `libvirt-qemu`), so ONLY claude's own file-writes break. Unblock without restarting the
|
||||
session: **`sg libvirt -c 'make test-integration HOST=<name>'`** (claude needs only `libvirt`
|
||||
for the dir; `kvm` is server-side; note `sg` adds one group, not the full set). → self-heal
|
||||
in `scripts/integration-vm.py`: if the `libvirt` gid is absent from `os.getgroups()`, re-exec
|
||||
under `sg libvirt` (or have the Makefile target do it), so a stale-session agent never hits
|
||||
this opaque symptom. New agent sessions pick the groups up on login, so it's a stale-session
|
||||
transient — but high-confusion, worth self-healing.
|
||||
|
||||
- `[friction]` **No standard for when the agent may run local-VM integration tests on ubongo
|
||||
without asking** (2026-06-19): `make test-integration HOST=<name>` spins an ISOLATED throwaway
|
||||
KVM VM (its own libvirt NAT; never touches the real host's firewall/network; guards:
|
||||
one-VM-at-a-time + a 4 GiB free-RAM floor + auto-destroy on success), so it is safe and
|
||||
self-contained — yet the agent paused for a go-ahead before running it (mesh-hardening 2/3,
|
||||
Task 4). The operator wants a STANDARD that pre-authorises VM-testing on ubongo so the agent
|
||||
just runs it. → decide + record the rule: e.g. a `.claude/settings.json` permission allow for
|
||||
`make test-integration*` / `scripts/integration-vm.py` (and the `sg libvirt -c '…'` form per
|
||||
the gotcha above), plus a CLAUDE.md line distinguishing the pre-authorised isolated VM tests
|
||||
from the genuinely-gated live steps (`make deploy` to real hosts, host reboots, cutovers —
|
||||
still need a go-ahead). Ties to the `test-risky-infra-before-live-deploy` +
|
||||
`dont-reask-settled-defaults` memories + ADR-025.
|
||||
|
||||
- `[gotcha]` **Molecule covers only the `input_only`-OFF (forward drop) branch of the base
|
||||
firewall** (2026-06-19): mesh-hardening 2/3 added `base__firewall_input_only` (forward policy
|
||||
drop↔accept). The `default` Molecule scenario renders ONE fixture, set to the secure default
|
||||
(drop) — so the fast `make test ROLE=base` gate locks the drop default (security-critical for
|
||||
service hosts) but does NOT exercise the `=true` → forward-`accept` rendering; only `make
|
||||
test-integration HOST=ubongo` does (passed GREEN). An in-converge re-render can't cheaply
|
||||
cover it (role defaults aren't in scope outside the role run). → decide in kaizen: a second
|
||||
Molecule scenario (`molecule/input-only/`) asserting forward `policy accept`, vs accepting the
|
||||
integration-only coverage. Final-review finding; not a cutover blocker (the accept branch is a
|
||||
literal, and a var-name break would fail the drop branch too → caught).
|
||||
|
||||
- `[gotcha]` **Applying base's firewall to a Docker host flushes Docker's nat → container
|
||||
egress dies until `restart docker`** (2026-06-19, mesh-hardening 2/3 live cutover): base's
|
||||
`nftables.conf.j2` starts with `flush ruleset`, which wipes ALL tables incl. Docker's
|
||||
`ip nat`/`ip filter` (+ libvirt's). On ubongo I chose INPUT-only so `forward` stays `accept`
|
||||
— yet the apply STILL broke CONTAINER egress: `docker pull` worked (dockerd uses HOST egress)
|
||||
but a container `ping` FAILED — the masquerade (SNAT) was gone, so replies couldn't return.
|
||||
`forward accept` permits forwarding but can't replace the missing nat. The spec's "input-only
|
||||
keeps Docker egress working" was therefore **incomplete**, and the local-VM harness couldn't
|
||||
catch it (the test VM runs no Docker). Fix on the live host: `systemctl restart docker`
|
||||
re-adds its `ip nat`/`ip filter` (egress restored; coexists fine with base's `inet filter`).
|
||||
On REBOOT it self-heals (dockerd re-adds nat on boot; `forward accept` doesn't block — unlike
|
||||
the 2026-06-17 `forward drop` incident). → (1) any cutover/runbook applying base firewall to a
|
||||
Docker host MUST `restart docker` + check container egress after the apply; (2) the pending
|
||||
`docker_host` nftables integration should own re-adding/persisting Docker's rules so base's
|
||||
`flush` is safe; (3) the firewall final-review checklist should include "does the host run
|
||||
Docker/libvirt? the flush wipes their nat."
|
||||
|
||||
---
|
||||
|
||||
## Kaizen reviews — decisions ledger
|
||||
|
|
|
|||
|
|
@ -13,7 +13,7 @@ as ordering changes, or as new milestones appear. Each milestone gets its own
|
|||
spec → plan → implementation cycle (`docs/superpowers/specs/` then `…/plans/`) when it
|
||||
comes up; this file stays high-level.
|
||||
|
||||
_Last updated: 2026-06-17._
|
||||
_Last updated: 2026-06-19._
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -206,14 +206,17 @@ Canonical dependency order:
|
|||
|
||||
## Next step
|
||||
|
||||
**Phase 1 is complete (M1–M5).** The next build is the **mesh-hardening follow-on**
|
||||
(deferred from M5, now safe because the `wt0` mesh path exists):
|
||||
**Phase 1 complete (M1–M5); mesh-hardening 2/3 (ubongo default-deny) DONE (2026-06-19)** —
|
||||
INPUT-only nftables default-deny applied + live-verified on `ubongo` (`base__firewall_input_only`;
|
||||
spec/plan `docs/superpowers/{specs,plans}/2026-06-19-mesh-hardening-ubongo-default-deny*`;
|
||||
real-host reboot validation pending, low-risk — lockout-safe via the permanent console).
|
||||
Remaining mesh-hardening sub-projects, each its own spec → plan → implementation cycle:
|
||||
|
||||
1. apply `base`'s nftables **default-deny** to `ubongo` + set `base__firewall_control_addr`
|
||||
(ADR-021 `ssh-from-control`, built/dormant) — lockout-risky on the control node itself,
|
||||
so it relies on the firewall's auto-rollback;
|
||||
2. tighten the NetBird ACL **off Allow-All** to scoped policies;
|
||||
3. move `askari`'s SSH onto `wt0`, retiring the Hetzner-firewall WAN allow.
|
||||
1. ~~`ubongo` nftables default-deny + `ssh-from-control`~~ → **DONE (2026-06-19).**
|
||||
2. tighten the NetBird ACL **off Allow-All** to scoped policies (open mechanism question —
|
||||
no headless API path).
|
||||
3. **redesign** `askari`'s SSH → `wt0` (the 2026-06-17 attempt was backed out; the redesign
|
||||
must resolve the boot-race, the coordinator-bootstrap chicken-egg, and the Docker-nat-flush
|
||||
that the `flush ruleset` causes on a Docker host).
|
||||
|
||||
Needs its own spec → plan → implementation cycle. **Then** the Procurement gate
|
||||
(`/capacity-review` → buy Proxmox hardware) opens Phase 2.
|
||||
**Then** the Procurement gate (`/capacity-review` → buy Proxmox hardware) opens Phase 2.
|
||||
|
|
|
|||
|
|
@ -0,0 +1,470 @@
|
|||
# Mesh-hardening 2/3 — ubongo INPUT-only default-deny — Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Apply base's nftables firewall to the control node (ubongo) as an INPUT-only default-deny — hardening its inbound surface — while leaving the forward chain permissive so Docker egress and the libvirt-NAT integration harness keep working, and without any sshd `ListenAddress` change.
|
||||
|
||||
**Architecture:** Two new `base` knobs make the existing firewall concern fit a control node: `base__firewall_input_only` flips the forward chain to `policy accept` (host-local input filtering only), and `base__firewall_admin_addrs` adds operator-workstation LAN sources to the SSH allow-list (alongside `wt0` and `ssh-from-control`). sshd is untouched (nftables does the scoping → no `ip_nonlocal_bind` boot-race). The change is validated on a throwaway VM via the ADR-025 integration harness (a new "be ubongo" profile) before an operator-supervised live cutover whose safety net is the firewall auto-rollback timer plus the permanent on-prem physical console.
|
||||
|
||||
**Tech Stack:** Ansible (role `base`, FQCN), nftables, Jinja2, Molecule on Debian 13, pytest (none new), the ADR-025 integration harness (`scripts/integration-vm.py`, JSON profiles, `-e @` overlays).
|
||||
|
||||
**Spec:** `docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md`
|
||||
|
||||
**Conventions:** `make lint` and `make test ROLE=base` before each commit; `make check` before `make deploy`; never hand-edit the generated `offsite.yml`; `rbw unlocked` for any commit touching Ansible content and for the integration/live applies (the production `group_vars/all/vault.yml` is in inventory scope and gets decrypted at playbook load). Tasks 1–3 are code (subagent-driven, each lint/Molecule-verified). Task 4 is a real-VM validation gate on ubongo. Task 5 is the live, operator-supervised cutover.
|
||||
|
||||
---
|
||||
|
||||
## File Structure
|
||||
|
||||
| File | Create/Modify | Responsibility |
|
||||
|---|---|---|
|
||||
| `roles/base/defaults/main.yml` | Modify | Declare `base__firewall_input_only` + `base__firewall_admin_addrs` (defaults: off / empty). |
|
||||
| `roles/base/templates/nftables.conf.j2` | Modify | Conditional forward policy; render an SSH-allow rule per admin address. |
|
||||
| `roles/base/molecule/default/converge.yml` | Modify | Fixture: an admin-addr source (input-only stays at its default → forward drop). |
|
||||
| `roles/base/molecule/default/verify.yml` | Modify | Assert forward-drop default + the admin-addr rule render. |
|
||||
| `inventories/production/group_vars/control/vars.yml` | Modify | Turn the knobs on for ubongo (input-only; mamba's LAN IP). |
|
||||
| `tests/integration/overrides/ubongo.yml` | Create | The "be ubongo" overlay (input-only firewall; harness SSH lifeline). |
|
||||
| `tests/integration/profiles/ubongo.json` | Create | The "be ubongo" VM profile (group `control`, applies `site.yml:base`). |
|
||||
| `tests/integration/overrides/askari.yml` | Modify | Add the `integration_profile` marker (verify is now profile-aware). |
|
||||
| `tests/integration/verify.yml` | Modify | Gate the askari (Docker/DNAT) block; add the ubongo (input-only) block + a guard. |
|
||||
| `STATUS.md`, `docs/ROADMAP.md` | Modify (Task 5) | Record mesh-hardening 2/3 done. |
|
||||
|
||||
---
|
||||
|
||||
### Task 1: base role — `base__firewall_input_only` (forward policy) + `base__firewall_admin_addrs` (LAN SSH allow)
|
||||
|
||||
**Files:**
|
||||
- Modify: `roles/base/defaults/main.yml`
|
||||
- Modify: `roles/base/templates/nftables.conf.j2`
|
||||
- Modify: `roles/base/molecule/default/converge.yml`
|
||||
- Modify: `roles/base/molecule/default/verify.yml`
|
||||
|
||||
> **Test strategy (note):** Molecule renders one fixture, so it locks the *secure default* —
|
||||
> `input_only` **off** → forward `policy drop` — plus the new admin-addr rule (red→green). The
|
||||
> `input_only` **on** → forward `policy accept` path is exercised on a real VM by the
|
||||
> integration "be ubongo" profile (Tasks 3–4), whose verify fails red until this template
|
||||
> conditional exists. Both branches are covered, across the two test layers.
|
||||
|
||||
- [ ] **Step 1: Write the failing test (extend Molecule verify)**
|
||||
|
||||
In `roles/base/molecule/default/verify.yml`, after the `Assert the docker_host extension hook is present` block, add:
|
||||
|
||||
```yaml
|
||||
- name: Assert the forward chain defaults to policy drop (input_only off)
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- "'hook forward priority 0; policy drop;' in nft"
|
||||
fail_msg: >-
|
||||
forward chain must default to policy drop when base__firewall_input_only is
|
||||
false (container isolation stays the norm on real service hosts)
|
||||
|
||||
- name: Assert the admin-addr SSH allow rule (operator workstation on the LAN)
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- "'ip saddr 10.30.0.77 tcp dport 22 accept' in nft"
|
||||
fail_msg: "missing admin-addr SSH allow rule from base__firewall_admin_addrs"
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add the fixture that drives it (Molecule converge)**
|
||||
|
||||
In `roles/base/molecule/default/converge.yml`, add to the `vars:` block (after the `base__firewall_control_addr` line):
|
||||
|
||||
```yaml
|
||||
base__firewall_admin_addrs:
|
||||
- "10.30.0.77" # fixture: an operator-workstation LAN source (admin-addr SSH allow)
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Run the test to verify it fails**
|
||||
|
||||
Run: `make test ROLE=base`
|
||||
Expected: FAIL on `Assert the admin-addr SSH allow rule` (the template does not consume `base__firewall_admin_addrs` yet, so the `ip saddr 10.30.0.77 …` rule is absent). The forward-drop assertion passes already (the template currently hardcodes `policy drop`).
|
||||
|
||||
- [ ] **Step 4: Add the defaults**
|
||||
|
||||
In `roles/base/defaults/main.yml`, after the `base__firewall_apply: true` line (end of the firewall behaviour block, currently line 13), add:
|
||||
|
||||
```yaml
|
||||
base__firewall_input_only: false # true → the forward chain is `policy accept` (host-local
|
||||
# INPUT filtering only). For hosts that forward/route
|
||||
# container or NAT traffic (the control node's Docker +
|
||||
# libvirt-NAT) where a forward default-deny would break
|
||||
# them. Real service hosts keep this false (forward drop).
|
||||
base__firewall_admin_addrs: [] # extra LAN source IPs allowed to SSH, besides wt0 +
|
||||
# ssh-from-control. For an operator workstation reaching
|
||||
# the host over the LAN (no mesh). Key-gated. (ADR-021)
|
||||
```
|
||||
|
||||
- [ ] **Step 5: Make the forward policy conditional + render the admin-addr rules**
|
||||
|
||||
In `roles/base/templates/nftables.conf.j2`:
|
||||
|
||||
(a) Replace the forward-chain line (currently line 21):
|
||||
|
||||
```jinja
|
||||
chain forward { type filter hook forward priority 0; policy {{ 'accept' if base__firewall_input_only | bool else 'drop' }}; }
|
||||
```
|
||||
|
||||
(b) After the `ssh-from-control` `{% endif %}` (currently line 14) and before the `ip protocol icmp accept` line, add the admin-addr loop:
|
||||
|
||||
```jinja
|
||||
{% for addr in base__firewall_admin_addrs %}
|
||||
ip saddr {{ addr }} tcp dport {{ base__firewall_ssh_port }} accept
|
||||
{% endfor %}
|
||||
```
|
||||
|
||||
- [ ] **Step 6: Run the test to verify it passes**
|
||||
|
||||
Run: `make test ROLE=base`
|
||||
Expected: PASS — converge renders the ruleset; verify confirms the forward chain is `policy drop` (input_only defaults false) and the `ip saddr 10.30.0.77 tcp dport 22 accept` rule is present; all pre-existing assertions stay green.
|
||||
|
||||
- [ ] **Step 7: Lint**
|
||||
|
||||
Run: `make lint`
|
||||
Expected: `Passed: 0 failure(s)` and `check-tags: OK`.
|
||||
|
||||
- [ ] **Step 8: Commit**
|
||||
|
||||
```bash
|
||||
git add roles/base/defaults/main.yml roles/base/templates/nftables.conf.j2 \
|
||||
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
|
||||
git commit -m "feat(base): input-only forward policy + admin-addr SSH allow
|
||||
|
||||
base__firewall_input_only renders the forward chain policy accept (host-local
|
||||
INPUT filtering only) for hosts that forward container/NAT traffic; defaults
|
||||
false so real service hosts keep the forward default-deny. base__firewall_admin_addrs
|
||||
adds operator-workstation LAN sources to the SSH allow-list alongside wt0 +
|
||||
ssh-from-control. Molecule locks the secure default + the admin rule.
|
||||
Mesh-hardening 2/3 (ADR-020/021).
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: inventory — enable input-only default-deny + mamba on ubongo (control group)
|
||||
|
||||
**Files:**
|
||||
- Modify: `inventories/production/group_vars/control/vars.yml`
|
||||
|
||||
- [ ] **Step 1: Turn the knobs on for the control group**
|
||||
|
||||
Append to `inventories/production/group_vars/control/vars.yml`:
|
||||
|
||||
```yaml
|
||||
|
||||
# Mesh-hardening 2/3 (2026-06-19, ADR-020/021): apply base's host firewall to ubongo as
|
||||
# INPUT-only default-deny — harden the inbound surface, leave the forward chain permissive so
|
||||
# Docker egress + the libvirt-NAT integration harness keep working. sshd is unchanged
|
||||
# (nftables scopes inbound), so there is no boot-race. Reach ubongo over wt0 (mesh), the
|
||||
# ssh-from-control self-path (base__firewall_control_addr, group_vars/all = 10.20.10.151), or
|
||||
# mamba on the LAN. Break-glass: the physical console. (base__firewall_apply defaults true.)
|
||||
base__firewall_input_only: true
|
||||
base__firewall_admin_addrs:
|
||||
- "10.20.10.50" # mamba over the LAN (NetBird off). Raw DHCP lease — revisit with an
|
||||
# OPNsense reservation when OPNsense-as-code lands; backstopped by wt0.
|
||||
- "10.20.10.17" # 2nd operator workstation (MAC bc:0f:f3:c8:4a:8a). Raw lease — ditto.
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify the vars resolve for ubongo**
|
||||
|
||||
Run: `.venv/bin/ansible-inventory -i inventories/production/ --host ubongo 2>/dev/null | grep -E 'firewall_input_only|firewall_admin_addrs|10.20.10.(50|17)'`
|
||||
Expected: shows `"base__firewall_input_only": true` and `"base__firewall_admin_addrs": ["10.20.10.50", "10.20.10.17"]`.
|
||||
|
||||
- [ ] **Step 3: Lint**
|
||||
|
||||
Run: `make lint`
|
||||
Expected: clean pass (`check-tags: OK`).
|
||||
|
||||
- [ ] **Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add inventories/production/group_vars/control/vars.yml
|
||||
git commit -m "feat(inventory): ubongo gets INPUT-only host firewall + mamba LAN SSH
|
||||
|
||||
Enables base__firewall_input_only on the control group (forward chain stays
|
||||
permissive so Docker egress + the integration-test libvirt NAT survive) and
|
||||
allows the operator workstations' LAN IPs (mamba 10.20.10.50 + 10.20.10.17;
|
||||
raw leases, backstopped by wt0). Mesh-hardening 2/3.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: integration harness — "be ubongo" profile (overlay + profile + profile-aware verify)
|
||||
|
||||
**Files:**
|
||||
- Create: `tests/integration/overrides/ubongo.yml`
|
||||
- Create: `tests/integration/profiles/ubongo.json`
|
||||
- Modify: `tests/integration/overrides/askari.yml`
|
||||
- Modify: `tests/integration/verify.yml`
|
||||
|
||||
- [ ] **Step 1: Create the "be ubongo" overlay**
|
||||
|
||||
Create `tests/integration/overrides/ubongo.yml`:
|
||||
|
||||
```yaml
|
||||
---
|
||||
# Integration-test overlay for the "ubongo" profile (ADR-025). Passed via `-e @`.
|
||||
# Exercises mesh-hardening 2/3: base's INPUT-only default-deny on the control node — input
|
||||
# chain default-deny, forward chain left permissive (Docker/libvirt-NAT safe), no sshd
|
||||
# ListenAddress change (so no boot-race).
|
||||
integration_profile: ubongo
|
||||
base__firewall_apply: true
|
||||
base__firewall_input_only: true # forward chain renders `policy accept`
|
||||
base__firewall_admin_addrs:
|
||||
- "192.168.150.98" # two representative LAN sources — exercises the
|
||||
- "192.168.150.99" # admin-addr loop with a multi-entry list (like ubongo)
|
||||
# Never wt0-only; never touch the real mesh from a throwaway VM.
|
||||
base__ssh_listen_mesh_only: false
|
||||
base__mesh_enabled: false
|
||||
# Allow SSH from the libvirt-NAT gateway (where the driver/ansible connect from) so the
|
||||
# default-deny apply + the reboot don't lock out the harness. By source IP (interface-
|
||||
# independent). This is the harness's lifeline; the admin-addr above is only exercised.
|
||||
base__firewall_control_addr: "192.168.150.1"
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Create the "be ubongo" VM profile**
|
||||
|
||||
Create `tests/integration/profiles/ubongo.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"groups": ["control"],
|
||||
"applies": [
|
||||
{"playbook": "site.yml", "tags": ["base"]}
|
||||
],
|
||||
"extra_vars_files": ["overrides/ubongo.yml"],
|
||||
"mem_mib": 2048,
|
||||
"vcpus": 2
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Mark the askari overlay with its profile name**
|
||||
|
||||
In `tests/integration/overrides/askari.yml`, after the two header comment lines (before `base__firewall_apply: true`), add:
|
||||
|
||||
```yaml
|
||||
integration_profile: askari
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Make `verify.yml` profile-aware (the test)**
|
||||
|
||||
Replace the entire contents of `tests/integration/verify.yml` with:
|
||||
|
||||
```yaml
|
||||
---
|
||||
# Integration verify (ADR-025). Outcome-based, profile-aware: the active profile is named by
|
||||
# `integration_profile` (set in each profile's overlay). Each profile asserts its own success
|
||||
# criteria; an unknown/unset profile fails loudly (never a silent pass).
|
||||
- name: Verify the rebooted host
|
||||
hosts: all
|
||||
become: true
|
||||
gather_facts: false
|
||||
tasks:
|
||||
- name: A known integration_profile must be set (no silent pass)
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- integration_profile is defined
|
||||
- integration_profile in ['askari', 'ubongo']
|
||||
fail_msg: "integration_profile must be set in the profile overlay (askari|ubongo)"
|
||||
|
||||
# ── askari profile — Docker host: published-port forwarding survives the reboot ──
|
||||
# The load-bearing check probes the VM's published :80 FROM the controller (ubongo) — if
|
||||
# base's forward-drop killed DNAT, this times out (the FRICTION 2026-06-17 #1 bug).
|
||||
- name: (askari) Gather service facts
|
||||
when: integration_profile == 'askari'
|
||||
ansible.builtin.service_facts:
|
||||
|
||||
- name: (askari) Docker daemon is active
|
||||
when: integration_profile == 'askari'
|
||||
ansible.builtin.assert:
|
||||
that: "ansible_facts.services['docker.service'].state == 'running'"
|
||||
fail_msg: "docker.service is not running"
|
||||
|
||||
- name: (askari) Forward chain permits container traffic (drop-in loaded)
|
||||
when: integration_profile == 'askari'
|
||||
ansible.builtin.command: nft list chain inet filter forward
|
||||
register: _fwd
|
||||
changed_when: false
|
||||
|
||||
- name: (askari) Assert container forwarding is allowed (not pure drop)
|
||||
when: integration_profile == 'askari'
|
||||
ansible.builtin.assert:
|
||||
that: "'accept' in _fwd.stdout"
|
||||
fail_msg: >-
|
||||
forward chain is pure drop — container forwarding will die on reboot
|
||||
(FRICTION 2026-06-17 #1). docker_host container-forward drop-in missing.
|
||||
|
||||
- name: (askari) Published port answers from the controller (DNAT + forward alive)
|
||||
when: integration_profile == 'askari'
|
||||
delegate_to: localhost
|
||||
become: false
|
||||
ansible.builtin.uri:
|
||||
url: "http://{{ ansible_host }}/"
|
||||
follow_redirects: none
|
||||
status_code: [200, 301, 308, 404, 502, 503]
|
||||
timeout: 10
|
||||
register: _probe
|
||||
retries: 5
|
||||
delay: 6
|
||||
until: _probe is succeeded
|
||||
|
||||
# ── ubongo profile — control node: INPUT-only default-deny survives the reboot ──
|
||||
# SSH reachability across the reboot is proven by the harness itself (it re-SSHes and
|
||||
# checks boot_id changed before this verify runs). Here we assert the ruleset shape.
|
||||
- name: (ubongo) Read the live nftables ruleset
|
||||
when: integration_profile == 'ubongo'
|
||||
ansible.builtin.command: nft list ruleset
|
||||
register: _nft
|
||||
changed_when: false
|
||||
|
||||
- name: (ubongo) INPUT default-deny, forward permissive, admin-addr allow
|
||||
when: integration_profile == 'ubongo'
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- "'hook input priority 0; policy drop;' in _nft.stdout"
|
||||
- "'hook forward priority 0; policy accept;' in _nft.stdout"
|
||||
- "'ip saddr 192.168.150.98 tcp dport 22 accept' in _nft.stdout"
|
||||
- "'ip saddr 192.168.150.99 tcp dport 22 accept' in _nft.stdout"
|
||||
fail_msg: >-
|
||||
ubongo profile: expected input policy drop, forward policy accept (input-only),
|
||||
and both admin-addr (192.168.150.98/99) SSH allows in the live ruleset.
|
||||
```
|
||||
|
||||
- [ ] **Step 5: Validate the JSON + lint**
|
||||
|
||||
Run: `.venv/bin/python -m json.tool tests/integration/profiles/ubongo.json >/dev/null && echo OK` then `make lint`
|
||||
Expected: `OK`, then a clean lint pass (`check-tags: OK`).
|
||||
|
||||
- [ ] **Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add tests/integration/overrides/ubongo.yml tests/integration/profiles/ubongo.json \
|
||||
tests/integration/overrides/askari.yml tests/integration/verify.yml
|
||||
git commit -m "test(integration): add the 'be ubongo' profile (input-only default-deny)
|
||||
|
||||
A control-group VM that applies base with INPUT-only default-deny (forward
|
||||
policy accept; admin-addr SSH allow). verify.yml is now profile-aware via an
|
||||
integration_profile marker — the askari Docker/DNAT block is gated, and a ubongo
|
||||
block asserts input drop + forward accept + the admin-addr rule. Enables
|
||||
\`make test-integration HOST=ubongo\`. Mesh-hardening 2/3 (ADR-025).
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Validate on the integration harness (`make test-integration HOST=ubongo`) — the GREEN gate
|
||||
|
||||
> Runs a throwaway UEFI VM on ubongo: boots it, applies the base role with the ubongo
|
||||
> overlay (INPUT-only default-deny), **reboots it**, and asserts the ruleset + SSH-returns.
|
||||
> This proves the change survives a reboot before the real control node is ever touched
|
||||
> (spec §cutover step 1; FRICTION signal-6). No code change / no commit — a validation gate.
|
||||
|
||||
- [ ] **Step 1: Ensure the vault is unlocked**
|
||||
|
||||
The run loads `inventories/production/group_vars/all/vault.yml` (symlinked into the run dir), which is decrypted at playbook load.
|
||||
|
||||
Run: `rbw unlocked || rbw unlock`
|
||||
Expected: exits 0 (unlocked). If it prompts, the operator unlocks.
|
||||
|
||||
- [ ] **Step 2: Run the integration cycle**
|
||||
|
||||
Run: `make test-integration HOST=ubongo`
|
||||
Expected (the `cycle`: up → apply → reboot → assert): the VM gets a `192.168.150.x` lease; `site.yml --tags base` applies cleanly; `… rebooted (boot_id changed), SSH back at 192.168.150.x`; then `VERIFY PASSED for boma-it-ubongo-…`. The VM is destroyed on success.
|
||||
|
||||
- [ ] **Step 3: On failure, read the diagnostics**
|
||||
|
||||
If it prints `VERIFY FAILED`, diagnostics are in `~/integration-runs/boma-it-ubongo-<id>/` (`nft.txt`, `console.log`, `journal.txt`). The likely suspects: the admin-addr/forward assertion (Task 1/3 wiring) or SSH not returning post-reboot (the `base__firewall_control_addr: 192.168.150.1` lifeline in the overlay). Fix the implicated task, re-commit, and re-run Step 2. Re-run `make test-integration-clean` first if a VM was left defined.
|
||||
|
||||
- [ ] **Step 4: Record the result**
|
||||
|
||||
Capture the `VERIFY PASSED` line in the task notes (this is the gate Task 5 step 1 depends on). No commit.
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Live staged cutover (operator-supervised — NOT a subagent task)
|
||||
|
||||
> Touches the **real ubongo** (the control node Ansible runs from) and reboots it — lockout-
|
||||
> risky. Run it interactively with the operator, in order, verifying each step before the
|
||||
> next. The firewall auto-rollback timer (`base__firewall_rollback_timeout`, 45 s) +
|
||||
> `wait_for_connection` over the live path is the safety net; the **on-prem physical console**
|
||||
> is the permanent break-glass. Do NOT hand this to an unattended agent.
|
||||
|
||||
- [ ] **Step 1: Pre-checks (gate: Task 4 GREEN)**
|
||||
|
||||
- `rbw unlocked || rbw unlock`.
|
||||
- SSH to ubongo over `wt0` from a road-warrior succeeds.
|
||||
- SSH to ubongo from mamba on the LAN (`10.20.10.50`) succeeds.
|
||||
- `.venv/bin/ansible ubongo -i inventories/production/ -m ping` → `SUCCESS` (over `10.20.10.151`).
|
||||
- The physical console is reachable. If any path fails, STOP.
|
||||
|
||||
- [ ] **Step 2: Dry-run the firewall apply**
|
||||
|
||||
Run: `make check PLAYBOOK=site LIMIT=ubongo TAGS=firewall`
|
||||
Expected: the nftables diff shows `policy drop` on input, `iifname "wt0" … accept`, `ip saddr 10.20.10.151 … accept`, `ip saddr 10.20.10.50 … accept`, and the forward chain as `policy accept`. No errors.
|
||||
|
||||
- [ ] **Step 3: Apply the host firewall (auto-rollback armed)**
|
||||
|
||||
Run: `make deploy PLAYBOOK=site LIMIT=ubongo TAGS=firewall`
|
||||
Expected: the firewall concern snapshots `/etc/nftables.rollback`, arms the 45 s `systemd-run` revert, applies the ruleset, `reset_connection` → `wait_for_connection` over `10.20.10.151` succeeds, then cancels the timer. If connectivity is lost, the timer reverts the ruleset within 45 s and the console is the fallback.
|
||||
|
||||
- [ ] **Step 4: Verify every path + forwarding still works**
|
||||
|
||||
```bash
|
||||
# from a road-warrior over wt0, and from mamba on the LAN:
|
||||
ssh sjat@100.99.146.14 true && echo "wt0 OK"
|
||||
ssh sjat@10.20.10.151 true && echo "mamba-LAN OK" # run from mamba (10.20.10.50)
|
||||
# Ansible self-path:
|
||||
.venv/bin/ansible ubongo -i inventories/production/ -m ping
|
||||
# a disallowed LAN host (e.g. 10.20.10.17) must now be refused/timeout on :22
|
||||
# Docker egress (forward chain still permissive):
|
||||
docker run --rm busybox wget -qO- https://cloudflare.com/cdn-cgi/trace | head -1
|
||||
# libvirt-NAT forwarding intact — a fresh integration VM still reaches apt:
|
||||
make test-integration HOST=ubongo # expect VERIFY PASSED (proves the NAT path survived)
|
||||
```
|
||||
Expected: `wt0 OK`, `mamba-LAN OK`, Ansible `SUCCESS`, the disallowed host refused, the Docker egress line returns, and the integration cycle passes.
|
||||
|
||||
- [ ] **Step 5: Reboot resilience — while the console is present (FRICTION signal-6)**
|
||||
|
||||
With the operator at the physical console, reboot ubongo (`sudo systemctl reboot`). After it returns, confirm SSH comes back on all paths **unaided**:
|
||||
|
||||
```bash
|
||||
ssh sjat@100.99.146.14 true && echo "wt0 OK after reboot"
|
||||
.venv/bin/ansible ubongo -i inventories/production/ -m ping
|
||||
```
|
||||
Expected: SSH returns with no manual intervention (no `ListenAddress`, so nothing to race). Only now is the cutover complete.
|
||||
|
||||
- [ ] **Step 6: Update STATUS + ROADMAP**
|
||||
|
||||
- In `STATUS.md`: in the `roles/base/` row of "Scaffolded but empty", change the firewall note — the `firewall` concern is now **applied to ubongo** as INPUT-only default-deny (it is no longer "not yet applied to any host"); note the `base__firewall_input_only` knob and that the forward default-deny still awaits the `docker_host` drop-in for real service hosts. Add the ubongo control-node row's "Pending" item for default-deny → done.
|
||||
- In `docs/ROADMAP.md`: mark **mesh-hardening sub-project 2 (ubongo default-deny) done**; the remaining follow-on is sub-project 1 (askari SSH→`wt0` *redesign*) and sub-project 3 (NetBird ACL). Update the "Next step" section accordingly.
|
||||
|
||||
```bash
|
||||
git add STATUS.md docs/ROADMAP.md
|
||||
git commit -m "docs: ubongo INPUT-only default-deny applied (mesh-hardening 2/3 done)
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
- [ ] **Step 7: Push**
|
||||
|
||||
Run: `git push origin main`
|
||||
|
||||
---
|
||||
|
||||
## Self-review (against the spec)
|
||||
|
||||
- **§ Design — INPUT-only default-deny** → Task 1 (forward-policy knob) + Task 2 (enabled on ubongo). ✓
|
||||
- **§ Design — admin-addrs (operator workstations on LAN)** → Task 1 (`base__firewall_admin_addrs` + template loop) + Task 2 (`10.20.10.50` mamba, `10.20.10.17`). ✓
|
||||
- **§ Design — no sshd ListenAddress change** → nothing touches `ssh.yml`/`sshd_hardening.conf.j2`; only nftables. ✓ (verified: Tasks 1–3 file lists exclude them).
|
||||
- **§ allow-list** (lo, established, wt0, ssh-from-control, admin-addr, icmp; forward accept) → template already renders lo/established/wt0/control/icmp; Task 1 adds admin-addr + forward-accept. ✓
|
||||
- **§ Why-safe (incident signals 1/2/3/6)** → signal 1 (forward accept, Task 1); signal 2 (no ListenAddress); signal 3 (ubongo keeps LAN + console); signal 6 (Task 4 harness reboot + Task 5 step 5 reboot-while-console). ✓
|
||||
- **§ New & changed code** (defaults, template, molecule, group_vars/control, integration profile) → Tasks 1–3. ✓
|
||||
- **§ admin raw-leases + revisit** → Task 2 comments record both leases + the OPNsense-reservation revisit trigger; backstop (wt0) noted; flagged in `FRICTION.md`. ✓
|
||||
- **§ Testing** (Molecule render asserts; `make test-integration HOST=ubongo`; live checks) → Task 1 (Molecule), Task 4 (harness), Task 5 step 4 (live). ✓ Coverage split (default in Molecule, input_only on the VM) noted in Task 1.
|
||||
- **§ Staged cutover (signal-6 order)** → Task 5 steps 1–7; reboot-recovery (step 5) precedes nothing that retires a break-glass (the console is permanent). ✓
|
||||
- **§ Risks/rollback** → auto-rollback (Task 5 step 3), redundant paths + physical console, raw-lease backstop. ✓
|
||||
- **Type/name consistency:** `base__firewall_input_only` (bool) and `base__firewall_admin_addrs` (list) are spelled identically in defaults, template, converge, group_vars, and the overlay. `integration_profile` is spelled identically in both overlays and the three gates in `verify.yml`. ✓
|
||||
- **Placeholder scan:** no TBD/TODO; every code/command step shows the actual content. ✓
|
||||
|
|
@ -0,0 +1,203 @@
|
|||
# Spec — Mesh-hardening (2 of 3): ubongo INPUT-only default-deny + `ssh-from-control`
|
||||
|
||||
Status: Accepted (2026-06-19)
|
||||
|
||||
## Context & scope
|
||||
|
||||
The **mesh-hardening follow-on** (deferred from M5, ROADMAP) was decomposed into three
|
||||
independent sub-projects, each its own spec → plan → implementation cycle:
|
||||
|
||||
1. askari SSH → `wt0` — spec/plan written 2026-06-17, **attempted and backed out the same day**
|
||||
(the incident; six lessons in `FRICTION.md`). Needs a redesign — **not** this spec.
|
||||
2. **ubongo nftables default-deny + `ssh-from-control`** ← *this spec*
|
||||
3. NetBird ACL off Allow-All → scoped policies (its own later spec; open mechanism question —
|
||||
no headless API path).
|
||||
|
||||
ROADMAP (re-ordered after the 2026-06-17 incident) puts **ubongo first**: it is the clean,
|
||||
low-risk case — a physical box with a permanent console break-glass, and *not* the coordinator
|
||||
host that the incident proved you must not corner.
|
||||
|
||||
This spec hardens **ubongo's inbound surface only**. It does **not** change sshd's
|
||||
`ListenAddress` (so no boot-race), does **not** apply a forward-chain default-deny (so Docker +
|
||||
the libvirt NAT keep working), and does **not** touch askari or the NetBird ACL.
|
||||
|
||||
Current state (verified on ubongo, 2026-06-19): **no host firewall** — sshd listens on
|
||||
`0.0.0.0:22`, reachable from LAN, mesh, and anything routable; only Docker's + libvirt's own
|
||||
`iptables-nft` tables exist. Interfaces: `eno1` `10.20.10.151` (LAN, = `ansible_host`), `wt0`
|
||||
`100.99.146.14` (mesh), `docker0` (one container, no published ports), `virbr-boma`
|
||||
`192.168.150.1/24` (the libvirt NAT that `make test-integration` uses), `ip_forward=1`.
|
||||
|
||||
## Goal / success criteria
|
||||
|
||||
- SSH to ubongo succeeds over **`wt0`** (road-warriors, askari), from **mamba on the LAN**
|
||||
(`10.20.10.50`), and via the **`ssh-from-control` self-path** (Ansible; source `10.20.10.151`).
|
||||
- SSH from any **other** LAN source is **dropped** (default-deny on `input`).
|
||||
- **Docker container egress and `make test-integration` (libvirt NAT) keep working** — the
|
||||
forward chain is untouched.
|
||||
- A **reboot** does not lock SSH out (no `ListenAddress`, so no bind race).
|
||||
- Break-glass is the **on-prem physical console** (permanent, non-mesh). The live apply is
|
||||
additionally gated by the firewall **auto-rollback** timer.
|
||||
|
||||
## Design
|
||||
|
||||
Apply base's nftables `firewall` concern to ubongo, with two adjustments and one deliberate
|
||||
non-change:
|
||||
|
||||
1. **INPUT-only default-deny.** The `input` chain keeps `policy drop` with the guaranteed
|
||||
management plane: `lo`, `established,related`, ICMP, SSH on `wt0`, and SSH from
|
||||
`ssh-from-control` (`10.20.10.151`). We add **one operator-workstation source** (mamba,
|
||||
`10.20.10.50`) via a new `base__firewall_admin_addrs` list. Everything else on `eno1` drops.
|
||||
2. **Forward chain left permissive.** base hardcodes `chain forward { … policy drop; }` for
|
||||
inter-container isolation. On ubongo that would break Docker egress **and** the libvirt NAT
|
||||
the integration harness depends on — the same class of failure that sank askari (FRICTION
|
||||
2026-06-17, signal 1). A new `base__firewall_input_only` knob renders the forward chain
|
||||
`policy accept` instead. Docker's and libvirt's own `iptables-nft` forward rules continue to
|
||||
apply (separate tables); base simply does not add a default-deny on top.
|
||||
3. **No sshd `ListenAddress` change.** sshd keeps listening on `0.0.0.0:22`; nftables does all
|
||||
inbound scoping. This deliberately avoids the `ip_nonlocal_bind` boot-race that broke askari
|
||||
(FRICTION signal 2) — there is nothing to bind before `wt0` exists.
|
||||
|
||||
Resulting `input` allow-list:
|
||||
|
||||
```
|
||||
iif "lo" accept
|
||||
ct state established,related accept
|
||||
ct state invalid drop
|
||||
iifname "wt0" tcp dport 22 accept # mesh (road-warriors, askari)
|
||||
ip saddr 10.20.10.151 tcp dport 22 accept # ssh-from-control (Ansible self) — group_vars/all
|
||||
ip saddr 10.20.10.50 tcp dport 22 accept # mamba on the LAN — base__firewall_admin_addrs
|
||||
ip saddr 10.20.10.17 tcp dport 22 accept # 2nd operator wkstn — base__firewall_admin_addrs
|
||||
ip protocol icmp accept ; ip6 nexthdr ipv6-icmp accept
|
||||
# (no catalog services on ubongo) → default drop
|
||||
chain forward: policy accept # Docker + libvirt-NAT forwarding preserved
|
||||
```
|
||||
|
||||
## Why ubongo is the safe case (maps to the 2026-06-17 incident)
|
||||
|
||||
- **Signal 1** (forward-drop breaks Docker hosts): sidestepped — INPUT-only leaves forwarding alone.
|
||||
- **Signal 2** (`ip_nonlocal_bind` boot-race): sidestepped — no `ListenAddress`; sshd binds nothing new.
|
||||
- **Signal 3** (a host's only mgmt path must not depend on a service it hosts): satisfied —
|
||||
ubongo is not the coordinator and keeps three independent paths (mesh, LAN, physical console).
|
||||
- **Signal 6** (recovery tested after the break-glass was removed): the physical console is
|
||||
permanent (nothing to retire), and reboot-recovery is proven on a throwaway VM first.
|
||||
|
||||
## New & changed code
|
||||
|
||||
**Role `base`:**
|
||||
|
||||
- `roles/base/defaults/main.yml` — add:
|
||||
- `base__firewall_input_only: false` — when true, the forward chain is `policy accept`
|
||||
(host-local input filtering only), for hosts that route/forward container or NAT traffic
|
||||
(e.g. the control node's Docker + libvirt-NAT) where a forward default-deny would break them.
|
||||
- `base__firewall_admin_addrs: []` — extra LAN source IPs allowed to SSH (besides `wt0` +
|
||||
`ssh-from-control`); for an operator workstation reaching the host over the LAN. Key-gated.
|
||||
- `roles/base/templates/nftables.conf.j2`:
|
||||
- the forward line (currently line 21) →
|
||||
`chain forward { type filter hook forward priority 0; policy {{ "accept" if base__firewall_input_only | bool else "drop" }}; }`
|
||||
- after the `ssh-from-control` block (currently lines 12-14), add a loop:
|
||||
`{% for addr in base__firewall_admin_addrs %}` →
|
||||
`ip saddr {{ addr }} tcp dport {{ base__firewall_ssh_port }} accept`
|
||||
- `roles/base/molecule/default/{converge,verify}.yml` — fixture sets `input_only: true` + an
|
||||
`admin_addrs` entry; assert (a) `forward` renders `policy accept`, (b) the admin-addr accept
|
||||
rule renders, (c) existing input default-deny + `wt0` + control-addr assertions stay green.
|
||||
|
||||
**Inventory** (`inventories/production/group_vars/control/vars.yml`, append):
|
||||
|
||||
```yaml
|
||||
# Mesh-hardening 2/3 (2026-06-19, ADR-020/021): apply base's host firewall to ubongo as
|
||||
# INPUT-only default-deny — harden the inbound surface, leave the forward chain permissive so
|
||||
# Docker egress + the libvirt-NAT integration harness keep working. sshd is unchanged
|
||||
# (nftables scopes inbound), so there is no boot-race. Reach ubongo over wt0, the
|
||||
# ssh-from-control self-path (base__firewall_control_addr in group_vars/all), or mamba on the
|
||||
# LAN. Break-glass: the physical console.
|
||||
base__firewall_input_only: true
|
||||
base__firewall_admin_addrs:
|
||||
- "10.20.10.50" # mamba over the LAN (NetBird off). Raw DHCP lease — see note below.
|
||||
- "10.20.10.17" # a 2nd operator workstation (MAC bc:0f:f3:c8:4a:8a). Raw lease — ditto.
|
||||
# base__firewall_apply defaults true; base__firewall_control_addr (= ubongo's own 10.20.10.151)
|
||||
# is set in group_vars/all and covers Ansible's self-connection.
|
||||
```
|
||||
|
||||
**Integration harness** (ADR-025) — a "be ubongo" profile, mirroring "be askari":
|
||||
|
||||
- `tests/integration/overrides/ubongo.yml` — `firewall_apply: true`, `input_only: true`,
|
||||
`admin_addrs: ["192.168.150.99"]` (a representative LAN addr to exercise the rule),
|
||||
`firewall_control_addr: "192.168.150.1"` (the libvirt-NAT gateway = the harness's own SSH
|
||||
path, so the apply + reboot don't lock it out), `ssh_listen_mesh_only: false`,
|
||||
`mesh_enabled: false`.
|
||||
- `tests/integration/profiles/ubongo.json` — mirror `profiles/askari.json` (VM resources/image).
|
||||
- `tests/integration/verify.yml` — make the assertions **profile-aware** (gated on the active
|
||||
profile, since `verify.yml` is shared): for ubongo assert `input` policy drop, `forward`
|
||||
policy **accept**, and the admin-addr rule present. Reachability across the reboot is the
|
||||
harness's existing cycle. The askari assertions (Docker/forward-DNAT) must **not** run for the
|
||||
ubongo profile, nor vice-versa.
|
||||
|
||||
Enables `make test-integration HOST=ubongo`.
|
||||
|
||||
## The admin-addrs — deliberately interim values
|
||||
|
||||
`base__firewall_admin_addrs: ["10.20.10.50", "10.20.10.17"]` are the operator workstations'
|
||||
**current raw DHCP leases** (mamba + a second box), not reservations (operator decision,
|
||||
2026-06-19). Both share the operator's `sjat` SSH key. Caveats, accepted for now:
|
||||
|
||||
- **Lease drift:** if DHCP reassigns either IP, the rule allows whatever host then holds it
|
||||
(still SSH-key-gated, so low risk) and that workstation loses its *LAN* path. **Backstop:**
|
||||
the workstations also reach ubongo over `wt0` (mesh), so they are never cut off — only the
|
||||
off-mesh LAN convenience lapses until the IP is corrected.
|
||||
- **Revisit trigger (flagged for follow-up):** when OPNsense-as-code lands (ADR-020 perimeter /
|
||||
TODO 3.5), replace both raw leases with **MAC-pinned DHCP reservations** (`10.20.10.17` =
|
||||
MAC `bc:0f:f3:c8:4a:8a`) and allow the reserved addresses. Recorded as a `FRICTION.md` open
|
||||
signal so the next `/kaizen` surfaces it.
|
||||
|
||||
## Testing
|
||||
|
||||
- **Molecule** (base `default`, render-only, `firewall_apply: false`): the new forward-accept +
|
||||
admin-addr assertions above, with existing assertions green.
|
||||
- **Integration harness** (`make test-integration HOST=ubongo`): on a throwaway UEFI VM, apply
|
||||
the ubongo overlay, assert the ruleset shape, and prove **SSH survives a reboot** from an
|
||||
allowed source (the existing assert/cycle). This is the gate before touching the real control
|
||||
node.
|
||||
- **Live** (during cutover): SSH over `wt0` ✓, from mamba LAN ✓, Ansible self-ping ✓; SSH from a
|
||||
disallowed LAN host dropped ✓; `docker run … ` egress ✓; a fresh `make test-integration`
|
||||
still spins a VM (libvirt NAT intact) ✓.
|
||||
|
||||
## Staged cutover (operator-supervised — lockout-aware, FRICTION signal-6 order)
|
||||
|
||||
ubongo is managed as `sjat` (password sudo), so the live apply needs the operator present
|
||||
anyway. The physical console is open throughout.
|
||||
|
||||
1. **Harness GREEN:** `make test-integration HOST=ubongo` passes (incl. the reboot).
|
||||
2. **Pre-check the real paths** *before* applying: SSH over `wt0`, SSH from mamba
|
||||
(`10.20.10.50`), `ansible ubongo -m ping`. Confirm the physical console is reachable.
|
||||
3. **Dry-run:** `make check PLAYBOOK=site LIMIT=ubongo TAGS=firewall` — review the nftables diff
|
||||
(input default-deny + `wt0` + `10.20.10.151` + `10.20.10.50`; forward `policy accept`).
|
||||
4. **Apply (auto-rollback armed):** `make deploy PLAYBOOK=site LIMIT=ubongo TAGS=firewall` — the
|
||||
firewall concern snapshots, arms the 45 s revert, applies, `reset_connection` →
|
||||
`wait_for_connection` over the live path (`10.20.10.151`), then cancels the timer. A bad
|
||||
ruleset reverts itself; the console is the ultimate fallback.
|
||||
5. **Verify** every path + Docker egress + a fresh integration-VM spin (above).
|
||||
6. **Reboot ubongo; confirm SSH returns on all paths unaided** (console present). Only now is it
|
||||
done — recovery is proven *while the break-glass is still there*.
|
||||
7. **Docs:** update `STATUS.md` (ubongo row: input-only default-deny applied) and `ROADMAP.md`
|
||||
(mesh-hardening 2/3 done; next is sub-project 1 askari redesign or 3 NetBird ACL).
|
||||
|
||||
## Risks & rollback
|
||||
|
||||
- **Self-referential apply** (ubongo runs Ansible against itself): mitigated by the auto-rollback
|
||||
timer, the `wait_for_connection` over the real path, three redundant allowed sources, and the
|
||||
permanent physical console. ubongo cannot be bricked.
|
||||
- **Raw-lease fragility:** documented above; backstopped by the mesh path; revisit with OPNsense.
|
||||
- **No new container isolation** (forward stays accept): accepted — ubongo is a single-tenant
|
||||
control node, not a service host; Docker/libvirt keep their own forward rules. The forward
|
||||
default-deny remains the norm for real service hosts (`base__firewall_input_only: false`).
|
||||
|
||||
## Out of scope / follow-ons
|
||||
|
||||
- askari SSH → `wt0` redesign (sub-project 1) — needs the boot-race + coordinator-bootstrap
|
||||
resolved; folds in the coordinator-robustness (geo-DB FATAL-loop) + off-site backup lessons.
|
||||
- NetBird ACL off Allow-All (sub-project 3) — open mechanism question (no headless API path).
|
||||
- OPNsense DHCP reservations for the admin workstations (`10.20.10.50` mamba, `10.20.10.17`)
|
||||
and ubongo — replace the raw leases with MAC-pinned reservations; flagged in `FRICTION.md`,
|
||||
with OPNsense-as-code.
|
||||
- Forward-chain container isolation on ubongo — deliberately not done here.
|
||||
- `STATUS.md` / `ROADMAP.md` edits land with the implementation, not this spec.
|
||||
|
|
@ -19,3 +19,15 @@ base__ai_worker_user: claude
|
|||
# Enrollment only; the host firewall default-deny stays deferred (the mesh-hardening
|
||||
# follow-on), so this brings up wt0 without changing SSH exposure.
|
||||
base__mesh_enabled: true
|
||||
|
||||
# Mesh-hardening 2/3 (2026-06-19, ADR-020/021): apply base's host firewall to ubongo as
|
||||
# INPUT-only default-deny — harden the inbound surface, leave the forward chain permissive so
|
||||
# Docker egress + the libvirt-NAT integration harness keep working. sshd is unchanged
|
||||
# (nftables scopes inbound), so there is no boot-race. Reach ubongo over wt0 (mesh), the
|
||||
# ssh-from-control self-path (base__firewall_control_addr, group_vars/all = 10.20.10.151), or
|
||||
# mamba on the LAN. Break-glass: the physical console. (base__firewall_apply defaults true.)
|
||||
base__firewall_input_only: true
|
||||
base__firewall_admin_addrs:
|
||||
- "10.20.10.50" # mamba over the LAN (NetBird off). Raw DHCP lease — revisit with an
|
||||
# OPNsense reservation when OPNsense-as-code lands; backstopped by wt0.
|
||||
- "10.20.10.17" # 2nd operator workstation (MAC bc:0f:f3:c8:4a:8a). Raw lease — ditto.
|
||||
|
|
|
|||
|
|
@ -11,6 +11,14 @@ base__firewall_rollback_timeout: 45 # seconds before the auto-revert fires on a
|
|||
base__firewall_confirm_timeout: 20 # seconds to re-establish a fresh connection post-apply
|
||||
base__firewall_dropin_dir: /etc/nftables.d
|
||||
base__firewall_apply: true # set false to render+validate without applying (CI/Molecule)
|
||||
base__firewall_input_only: false # true → the forward chain is `policy accept` (host-local
|
||||
# INPUT filtering only). For hosts that forward/route
|
||||
# container or NAT traffic (the control node's Docker +
|
||||
# libvirt-NAT) where a forward default-deny would break
|
||||
# them. Real service hosts keep this false (forward drop).
|
||||
base__firewall_admin_addrs: [] # extra LAN source IPs allowed to SSH, besides wt0 +
|
||||
# ssh-from-control. For an operator workstation reaching
|
||||
# the host over the LAN (no mesh). Key-gated. (ADR-021)
|
||||
|
||||
# SSH hardening + fail2ban (ADR-002) — `hardening` concern.
|
||||
base__ssh_password_authentication: "no"
|
||||
|
|
|
|||
|
|
@ -6,6 +6,8 @@
|
|||
vars:
|
||||
base__firewall_apply: false
|
||||
base__firewall_control_addr: 10.10.0.99 # test control-node LAN address
|
||||
base__firewall_admin_addrs:
|
||||
- "10.30.0.77" # fixture: an operator-workstation LAN source (admin-addr SSH allow)
|
||||
# Exercise the mesh concern's include path with the live actions gated off, so it
|
||||
# runs hermetically (no coordinator/key needed) and must be a clean no-op.
|
||||
base__mesh_enabled: true
|
||||
|
|
|
|||
|
|
@ -51,6 +51,20 @@
|
|||
- "'include \"/etc/nftables.d/*.nft\"' in nft"
|
||||
fail_msg: "missing drop-in include hook"
|
||||
|
||||
- name: Assert the forward chain defaults to policy drop (input_only off)
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- "'hook forward priority 0; policy drop;' in nft"
|
||||
fail_msg: >-
|
||||
forward chain must default to policy drop when base__firewall_input_only is
|
||||
false (container isolation stays the norm on real service hosts)
|
||||
|
||||
- name: Assert the admin-addr SSH allow rule (operator workstation on the LAN)
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- "'ip saddr 10.30.0.77 tcp dport 22 accept' in nft"
|
||||
fail_msg: "missing admin-addr SSH allow rule from base__firewall_admin_addrs"
|
||||
|
||||
- name: Syntax-check the rendered ruleset (no apply)
|
||||
ansible.builtin.command: nft -c -f /etc/nftables.conf
|
||||
changed_when: false
|
||||
|
|
|
|||
|
|
@ -12,13 +12,16 @@ table inet filter {
|
|||
{% if base__firewall_control_addr %}
|
||||
ip saddr {{ base__firewall_control_addr }} tcp dport {{ base__firewall_ssh_port }} accept
|
||||
{% endif %}
|
||||
{% for addr in base__firewall_admin_addrs %}
|
||||
ip saddr {{ addr }} tcp dport {{ base__firewall_ssh_port }} accept
|
||||
{% endfor %}
|
||||
ip protocol icmp accept
|
||||
ip6 nexthdr ipv6-icmp accept
|
||||
{% for r in base__firewall_resolved %}
|
||||
ip saddr { {{ r.sources | join(', ') }} } {{ r.proto }} dport {{ r.port }} accept
|
||||
{% endfor %}
|
||||
}
|
||||
chain forward { type filter hook forward priority 0; policy drop; }
|
||||
chain forward { type filter hook forward priority 0; policy {{ 'accept' if base__firewall_input_only | bool else 'drop' }}; }
|
||||
chain output { type filter hook output priority 0; policy accept; }
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -201,6 +201,13 @@ def up(host, name=None, mem_mib=DEFAULT_MEM_MIB, vcpus=DEFAULT_VCPUS):
|
|||
sh(["cloud-localds", "--network-config", str(RUN_DIR / "network-config"),
|
||||
str(seed), str(RUN_DIR / "user-data"), str(RUN_DIR / "meta-data")])
|
||||
console = CACHE_DIR / f"{name}-console.log"
|
||||
# virt-install has a `#!/usr/bin/env python3` shebang; the Makefile prepends .venv/bin to
|
||||
# PATH (so the venv's ansible tools resolve), which would hijack virt-install into the
|
||||
# isolated venv — it lacks system PyGObject (`gi`) and crashes. Strip the venv from PATH
|
||||
# for this system tool so its shebang finds /usr/bin/python3 (which has gi). Ansible is
|
||||
# invoked via its absolute .venv path elsewhere, so it is unaffected.
|
||||
sys_path = ":".join(p for p in os.environ.get("PATH", "").split(":")
|
||||
if "/.venv/bin" not in p)
|
||||
sh(["virt-install", "--name", name, "--memory", str(mem_mib), "--vcpus", str(vcpus),
|
||||
"--boot", "uefi", # genericcloud triple-faults on legacy BIOS handoff; UEFI boots
|
||||
"--import",
|
||||
|
|
@ -210,7 +217,8 @@ def up(host, name=None, mem_mib=DEFAULT_MEM_MIB, vcpus=DEFAULT_VCPUS):
|
|||
"--osinfo", "debian13",
|
||||
"--graphics", "none",
|
||||
"--serial", f"file,path={console}",
|
||||
"--noautoconsole"])
|
||||
"--noautoconsole"],
|
||||
env=dict(os.environ, PATH=sys_path))
|
||||
ip = wait_for_ip(name)
|
||||
wait_for_ssh(ip, "ansible")
|
||||
# Block until cloud-init finishes (incl. apt-get update) so apply sees a ready system.
|
||||
|
|
|
|||
|
|
@ -1,6 +1,7 @@
|
|||
---
|
||||
# Integration-test overlay for the "askari" profile (ADR-025). Passed via `-e @`.
|
||||
# Reproduces the 2026-06-17 incident: apply base's nftables default-deny to a Docker host.
|
||||
integration_profile: askari
|
||||
base__firewall_apply: true
|
||||
# Keep a break-glass: sshd stays on all interfaces (never wt0-only in a throwaway VM).
|
||||
base__ssh_listen_mesh_only: false
|
||||
|
|
|
|||
18
tests/integration/overrides/ubongo.yml
Normal file
18
tests/integration/overrides/ubongo.yml
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
---
|
||||
# Integration-test overlay for the "ubongo" profile (ADR-025). Passed via `-e @`.
|
||||
# Exercises mesh-hardening 2/3: base's INPUT-only default-deny on the control node — input
|
||||
# chain default-deny, forward chain left permissive (Docker/libvirt-NAT safe), no sshd
|
||||
# ListenAddress change (so no boot-race).
|
||||
integration_profile: ubongo
|
||||
base__firewall_apply: true
|
||||
base__firewall_input_only: true # forward chain renders `policy accept`
|
||||
base__firewall_admin_addrs:
|
||||
- "192.168.150.98" # two representative LAN sources — exercises the
|
||||
- "192.168.150.99" # admin-addr loop with a multi-entry list (like ubongo)
|
||||
# Never wt0-only; never touch the real mesh from a throwaway VM.
|
||||
base__ssh_listen_mesh_only: false
|
||||
base__mesh_enabled: false
|
||||
# Allow SSH from the libvirt-NAT gateway (where the driver/ansible connect from) so the
|
||||
# default-deny apply + the reboot don't lock out the harness. By source IP (interface-
|
||||
# independent). This is the harness's lifeline; the admin-addr above is only exercised.
|
||||
base__firewall_control_addr: "192.168.150.1"
|
||||
9
tests/integration/profiles/ubongo.json
Normal file
9
tests/integration/profiles/ubongo.json
Normal file
|
|
@ -0,0 +1,9 @@
|
|||
{
|
||||
"groups": ["control"],
|
||||
"applies": [
|
||||
{"playbook": "site.yml", "tags": ["base"]}
|
||||
],
|
||||
"extra_vars_files": ["overrides/ubongo.yml"],
|
||||
"mem_mib": 2048,
|
||||
"vcpus": 2
|
||||
}
|
||||
|
|
@ -1,33 +1,48 @@
|
|||
---
|
||||
# Integration verify (ADR-025). Outcome-based: proves Docker forwarding survives the
|
||||
# reboot. The load-bearing check probes the VM's published :80 FROM the controller
|
||||
# (ubongo) — if base's forward-drop killed DNAT, this times out (the FRICTION #1 bug).
|
||||
# Integration verify (ADR-025). Outcome-based, profile-aware: the active profile is named by
|
||||
# `integration_profile` (set in each profile's overlay). Each profile asserts its own success
|
||||
# criteria; an unknown/unset profile fails loudly (never a silent pass).
|
||||
- name: Verify the rebooted host
|
||||
hosts: all
|
||||
become: true
|
||||
gather_facts: false
|
||||
tasks:
|
||||
- name: Gather service facts
|
||||
- name: A known integration_profile must be set (no silent pass)
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- integration_profile is defined
|
||||
- integration_profile in ['askari', 'ubongo']
|
||||
fail_msg: "integration_profile must be set in the profile overlay (askari|ubongo)"
|
||||
|
||||
# ── askari profile — Docker host: published-port forwarding survives the reboot ──
|
||||
# The load-bearing check probes the VM's published :80 FROM the controller (ubongo) — if
|
||||
# base's forward-drop killed DNAT, this times out (the FRICTION 2026-06-17 #1 bug).
|
||||
- name: (askari) Gather service facts
|
||||
when: integration_profile == 'askari'
|
||||
ansible.builtin.service_facts:
|
||||
|
||||
- name: Docker daemon is active
|
||||
- name: (askari) Docker daemon is active
|
||||
when: integration_profile == 'askari'
|
||||
ansible.builtin.assert:
|
||||
that: "ansible_facts.services['docker.service'].state == 'running'"
|
||||
fail_msg: "docker.service is not running"
|
||||
|
||||
- name: Forward chain permits container traffic (drop-in loaded)
|
||||
- name: (askari) Forward chain permits container traffic (drop-in loaded)
|
||||
when: integration_profile == 'askari'
|
||||
ansible.builtin.command: nft list chain inet filter forward
|
||||
register: _fwd
|
||||
changed_when: false
|
||||
|
||||
- name: Assert container forwarding is allowed (not pure drop)
|
||||
- name: (askari) Assert container forwarding is allowed (not pure drop)
|
||||
when: integration_profile == 'askari'
|
||||
ansible.builtin.assert:
|
||||
that: "'accept' in _fwd.stdout"
|
||||
fail_msg: >-
|
||||
forward chain is pure drop — container forwarding will die on reboot
|
||||
(FRICTION 2026-06-17 #1). docker_host container-forward drop-in missing.
|
||||
|
||||
- name: Published port answers from the controller (DNAT + forward alive)
|
||||
- name: (askari) Published port answers from the controller (DNAT + forward alive)
|
||||
when: integration_profile == 'askari'
|
||||
delegate_to: localhost
|
||||
become: false
|
||||
ansible.builtin.uri:
|
||||
|
|
@ -42,3 +57,29 @@
|
|||
retries: 5
|
||||
delay: 6
|
||||
until: _probe is succeeded
|
||||
|
||||
# ── ubongo profile — control node: INPUT-only default-deny survives the reboot ──
|
||||
# SSH reachability across the reboot is proven by the harness itself (it re-SSHes and
|
||||
# checks boot_id changed before this verify runs). Here we assert the ruleset shape.
|
||||
- name: (ubongo) Read the live nftables ruleset
|
||||
when: integration_profile == 'ubongo'
|
||||
ansible.builtin.command: nft list ruleset
|
||||
register: _nft
|
||||
changed_when: false
|
||||
|
||||
- name: (ubongo) INPUT default-deny, forward permissive, lifeline + admin-addr allow
|
||||
when: integration_profile == 'ubongo'
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
# live `nft list ruleset` prints the SYMBOLIC priority (`filter` = 0), unlike the
|
||||
# rendered /etc/nftables.conf (`priority 0`) that the Molecule scenario asserts against.
|
||||
- "'hook input priority filter; policy drop;' in _nft.stdout"
|
||||
- "'hook forward priority filter; policy accept;' in _nft.stdout"
|
||||
# the ssh-from-control lifeline (base__firewall_control_addr) — the reconnect path
|
||||
- "'ip saddr 192.168.150.1 tcp dport 22 accept' in _nft.stdout"
|
||||
- "'ip saddr 192.168.150.98 tcp dport 22 accept' in _nft.stdout"
|
||||
- "'ip saddr 192.168.150.99 tcp dport 22 accept' in _nft.stdout"
|
||||
fail_msg: >-
|
||||
ubongo profile: expected input policy drop, forward policy accept (input-only),
|
||||
the ssh-from-control lifeline (192.168.150.1), and both admin-addr
|
||||
(192.168.150.98/99) SSH allows in the live ruleset.
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue