The local-VM integration harness RED→GREEN acceptance passed on real hardware
(2026-06-18): a KVM VM on ubongo reproduced the 2026-06-17 nftables/Docker reboot
breakage (RED) and survived with the docker_host container-forward drop-in (GREEN).
ADR-025: Status updated to PASSED; shakedown learnings section added (UEFI boot
required, claude sudo load-bearing); ADR-021 added to Related.
STATUS.md: integration-harness section updated from PENDING to PASSED; ubongo
entry updated to reflect claude NOPASSWD sudo + sjat-ansible NOPASSWD removal;
last-reviewed date updated.
docs/TODO.md: item 2.4 collapsed to one-line pointer per the file's convention.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The integration-testing shakedown reversed ADR-015's "no local sudo" sub-decision:
the claude AI-worker now has NOPASSWD:ALL sudo on ubongo — without it, virsh,
nft, and journalctl all block during VM diagnosis. Compensating controls:
password-locked account, auditd/Loki attribution, repo-managed revocable drop-in.
ADR-015: dated amendment note in Status + expanded AI-worker identity section.
ADR-021: new §Sudo model (amendment 2026-06-18) — claude=NOPASSWD, sjat=password
required; former sjat NOPASSWD drop-in removed 2026-06-18 (least-privilege cleanup).
accepted-risks.md: R7 added (claude NOPASSWD:ALL on ubongo); last-reviewed updated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add base__ai_worker_user var (default empty), a new operational_access.yml
task file that drops a validated sudoers file for the named user, and wire it
into base/tasks/main.yml after the hardening includes under the `users` tag.
Set base__ai_worker_user: claude in group_vars/control so that applying base
to ubongo is idempotent with the manual /etc/sudoers.d/claude-ai-worker drop-in
already in place. Password remains locked; NOPASSWD is the only sudo path;
actions are attributed via auditd (ADR-021).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Accept caddy's 308 on :80 as proof the DNAT+forward path is alive; don't follow into https (tls internal has no cert for a bare-IP request). This load-bearing end-to-end check is what caught the br-+/br-* nftables-wildcard bug that the string-presence assert missed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
base's inet-filter forward chain is policy-drop; on a Docker host that kills published-port DNAT + inter-container forwarding ON REBOOT (nftables loads default-deny before dockerd). This drop-in (loaded via base's /etc/nftables.d/*.nft include at boot) appends the container-bridge accepts so a rebooted Docker host keeps forwarding. Resolves FRICTION 2026-06-17 #1 and the GREEN half of ADR-025's acceptance test. NB nftables wildcard is br-*, not the iptables br-+.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Running the harness leaves tests/integration/.run/ (gitignored, generated); exclude it from yamllint + ansible-lint so a post-run 'make lint' passes. Also emit a --- doc-start in the generated inventory.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
base's default-deny firewall would drop the driver's post-reboot SSH from the libvirt NAT gateway; set base__firewall_control_addr to the gateway (by source IP, interface-independent).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cloud-init package_update:true + block on 'cloud-init status --wait' in up() so apply sees populated apt lists (fresh genericcloud images ship empty lists); dump_diagnostics()/console() read the root:0600 serial log via sudo instead of shutil.copy, which raised PermissionError mid-diagnostics.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The tls-internal/acme_ca knobs used {%- -%} trims validated only against raw jinja2; ansible (trim_blocks=True) double-stripped newlines and collapsed the Caddyfile onto single lines, crash-looping caddy. Match the role's existing plain {% %} style.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Debian 13 genericcloud image triple-faults at the legacy real-mode kernel
handoff under SeaBIOS/q35 (boot-loops at GRUB, no 'Decompressing Linux', no DHCP
lease). Booting via UEFI (OVMF -> efistub) bypasses the legacy entry and boots
cleanly: cloud-init runs, DHCP lease obtained, SSH reachable. Verified end-to-end.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Don't rely on the genericcloud image's network fallback; the seed now carries a
network-config forcing dhcp4 on en* interfaces. A correct prerequisite for the VM
to network once cloud-init processes the seed. (Note: a separate no-DHCP-lease
issue on first real boot is still under investigation — the guest isn't networking
and, under the no-sudo claude model, the VM console/logs aren't introspectable
without libguestfs; see next steps.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Under qemu:///system the hypervisor runs as libvirt-qemu, which cannot traverse
/home/claude — so the overlay/seed/console must live in /var/lib/boma-integration
(group libvirt, world-traversable, created by the integration_test role), not the
repo/home RUN_DIR. The inventory (hosts.yml + group_vars symlink, read by ansible
as claude) stays in RUN_DIR. Verified: virt-install now creates the domain.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bare virsh/virt-install default to qemu:///session for a non-root caller, but
the substrate, /dev/kvm, and the boma-it NAT network live on the SYSTEM libvirtd.
Pin the URI so the driver targets system regardless of who runs it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The driver passed -i <RUN_DIR>/ (a directory); ansible's directory-inventory
loader then parsed sibling files (notably 'current', which holds the real host
string 'askari') as INI inventory, creating phantom hosts incl. the real askari
with its full hostvars — violating the single-host safety invariant (and a hard
error in ansible 2.18 on the binary qcow2/seed files). Point -i at the single
hosts.yml file; ansible still loads the adjacent group_vars symlink. (review C1)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- New docs/runbooks/integration-testing.md: when to use (firewall/
sshd/boot/Docker changes); make test-integration commands; lower-
level driver sub-commands; cert tier guidance; diagnostics dir;
VM inspection (virsh console / SSH); safety invariants; resource
constraints; adding a new profile; self-validating acceptance test.
- docs/runbooks/new-host.md: pre-flight warning before deploying
lockout-risky changes (firewall/sshd/boot) while break-glass is open
- docs/runbooks/new-role.md: step 13 pre-flight for lockout-risky roles
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- ADR-008: add reboot-survivability gap row + ADR-025 pointer to the
"not tested in Molecule" table
- ADR-015: reconcile "not a hypervisor" with ephemeral KVM test VMs
(ADR-025); note ~3 GiB test-VM RAM against the 16 GiB sizing
- accepted-risks: add R6 (le-prod-wildcard PAT + transient TXT records)
- CLAUDE.md: add make test-integration[/-clean] to key-commands;
add ADR-025 + runbook rows to further-reading
- hardware/reference.md: note one ephemeral KVM test VM on ubongo
- STATUS.md: add integration harness entry (built, lint+pytest clean;
RED/GREEN acceptance PENDING ubongo live pass); TODO 2.4 stays open
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Accepted decision to implement ADR-008 Level 2/3 on ubongo via
libvirt/KVM directly: throwaway VM overlays, stdlib-only driver,
tiered cert fidelity, three safety invariants. Addresses the
2026-06-17 mesh-hardening incident's reboot-survivability gap.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add ADR-025 integration-test harness targets to Makefile:
- test-integration HOST=<name> [CERTS=internal|le-staging] [KEEP=1]
- test-integration-clean (prune stale VM snapshots)
Also add tests/integration/.run/ to .gitignore.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Throwaway KVM VMs on ubongo (libvirt, Approach A) that mirror a real host (real Docker, real reboot, real role apply) to catch the reboot/firewall/boot-order class Molecule cannot - the 2026-06-17 mesh-hardening incident. First profile: be askari; tiered certs (internal + le-staging built, le-prod-wildcard on-demand). Concrete build of ADR-008 Level 2/3; to be recorded as ADR-025.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From the 2026-06-17 mesh-hardening incident: Molecule can't catch
reboot/firewall-x-Docker/boot-order bugs — build local-VM pre-deploy testing
on ubongo (ADR-008 Level 2/3). And a smooth screenshot hand-off for the agent
during incidents.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
firewall-breaks-Docker-hosts, ip_nonlocal_bind didn't beat the boot race,
coordinator-host circular bootstrap, NetBird geo-DB FATAL dependency, no
off-site coordinator backup, and reboot-tested-after-removing-break-glass.
For the next /kaizen + the mesh-hardening re-spec.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Incident 2026-06-17: applying base's nftables default-deny (forward policy drop)
to askari — a Docker host — broke container forwarding/NAT on reboot, and the
wt0-only sshd ListenAddress left no break-glass (ip_nonlocal_bind did NOT beat
the boot race). Recovery: disable nftables + restart docker (restore the wiped
NAT masquerade) + force-recreate the coordinator (it FATAL-looped unable to
download its GeoLite2 DB with no egress) -> mesh re-formed.
Back out the enablement so a future deploy can't re-break askari:
- offsite_hosts: base__ssh_listen_mesh_only=false, base__firewall_apply=false
- remove host_vars/askari.yml (manage over the WAN again, not wt0)
- tf/offsite: re-open WAN :22 to ubongo only (break-glass; already applied)
askari now: sshd on all interfaces (Ansible-managed), nftables disabled, WAN :22
open -> stable + reboot-survivable. The base feature code (sshd ListenAddress
option, firewall public zone) stays; it's just not enabled on Docker hosts.
Mesh-hardening 1/3 to be re-spec'd before any retry.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Hetzner Cloud Firewall SSH rule is now conditional on a non-empty
ssh_admin_cidrs (default []); askari sets it empty so the WAN :22 rule is
removed on the next apply. SSH is reached over wt0; break-glass is the Hetzner
console. Apply is the live cutover (Task 5). Mesh-hardening 1/3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
host_vars/askari.yml points ansible_host at the wt0 IP (overriding the generated
offsite.yml); offsite_hosts sets base__ssh_listen_mesh_only. Mesh-hardening 1/3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a public (0.0.0.0/0) zone and askari's Caddy (80/443) + NetBird STUN
(3478/udp) ingress so the base nftables default-deny does not drop the live
public services when applied to askari. Molecule + filter unit test cover the
public-zone rendering. Mesh-hardening 1/3 (ADR-020/024/016).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
base__ssh_listen_mesh_only binds sshd to the live wt0 IP only, with
ip_nonlocal_bind to beat the post-boot bind race and a fail-closed assert so an
unresolved address never silently listens on all interfaces. Molecule covers
the render + sysctl. Mesh-hardening 1/3 (ADR-016/021).
Environmental checkpoint applied: the molecule-debian13 container image lacks
procps (no sysctl binary). Added molecule/default/prepare.yml to install procps
and sysctls: {net.ipv4.ip_nonlocal_bind: "0"} to molecule.yml platform so the
ansible.posix.sysctl task can write and read back the value hermetically.
Sysctl file format is net.ipv4.ip_nonlocal_bind=1 (no spaces); verify.yml
grep pattern updated to match ansible.posix.sysctl's actual output.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5 tasks: base sshd ListenAddress+ip_nonlocal_bind (Molecule), firewall public
zone + askari catalog, inventory wt0 override, TF retire WAN :22, then the live
operator-supervised staged cutover.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Decomposes the M5 mesh-hardening follow-on into 3 independent sub-specs; this
is sub-project 1. Three-layer SSH-on-wt0 (sshd ListenAddress=mesh + nftables
iifname wt0 + retire the Hetzner WAN :22), ip_nonlocal_bind to beat the
post-boot wt0 bind race (fail-closed), live wt0 fact for the listen addr,
staged cutover with the firewall auto-rollback as the safety gate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Migrate the single-file-bind-mount/stale-config gotcha (reload-in-place needs a
directory mount; restart-based roles don't) to docs/testing/gotchas.md, and move
all 7 open signals out of FRICTION.md's Open-signals section into the new
2026-06-17 decisions-ledger block: all consumed, 1 PARK (the ubongo
self-management gap, tracked in STATUS), 0 REMOVE. Relax test_load_signals to
accept an empty Open-signals section (the goal state after a kaizen pass).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scripts/registry-login.sh reads vault.forgejo.registry_token and pipes it to
docker login --password-stdin (never echoed, never on argv); 'make registry-login'
wires it with the venv binaries. Adds the operator-minted CHANGEME vault stub
(fill via make edit-vault) and a per-machine prereq note in the claude-code-setup
runbook, so 'make caddy-image-push'/'molecule-image-push' become agent-completable
non-interactively. Consumes the 2026-06-15 signal in docs/FRICTION.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
guard-vault-preflight: block a locked 'git commit' only when the staged set
(git diff --cached, plus -a/--all) contains ansible content matching the
pre-commit ansible-lint hook's files: scope. Docs-/config-only commits never
trigger that hook, so they no longer need the vault — fixing the false block on
docs-only commits. Fails safe to block when unsure.
guard-execution-mode-menu: widen the execution-mode arm to also catch free-form
prose re-asks of the subagent-vs-inline choice ('which execution approach?',
'subagent vs inline', ...), which the literal-menu matcher missed; the push
re-ask is intentionally left to the dont-reask-settled-defaults memory.
Consumes two 2026-06-17 signals in docs/FRICTION.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add check_mode: false to the state:directory base_dir tasks so that 'make check'
on a brand-new compose service role creates the scaffold during --check and the
rest of the dry-run (templates + docker_compose_v2 up) can be evaluated instead
of failing on a missing project_src. The directive is inert under a normal
converge (incl. Molecule + its tagged second converge), so role tests are
unchanged. Consumes the 2026-06-16 signal in docs/FRICTION.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>