The local-VM integration harness RED→GREEN acceptance passed on real hardware (2026-06-18): a KVM VM on ubongo reproduced the 2026-06-17 nftables/Docker reboot breakage (RED) and survived with the docker_host container-forward drop-in (GREEN). ADR-025: Status updated to PASSED; shakedown learnings section added (UEFI boot required, claude sudo load-bearing); ADR-021 added to Related. STATUS.md: integration-harness section updated from PENDING to PASSED; ubongo entry updated to reflect claude NOPASSWD sudo + sjat-ansible NOPASSWD removal; last-reviewed date updated. docs/TODO.md: item 2.4 collapsed to one-line pointer per the file's convention. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
10 KiB
ADR-025 — Local VM integration testing on ubongo
Status
Accepted (2026-06-18). Implements ADR-008 Level 2/3 (deferred for lack of hosts; now
viable on ubongo). RED→GREEN acceptance PASSED on real hardware (2026-06-18): a
throwaway KVM VM on ubongo reproduced the 2026-06-17 incident (base's nftables forward
default-deny kills Docker forwarding on reboot) — RED — and survived the reboot once
the docker_host container-forward drop-in was applied — GREEN. Two shakedown
learnings added below.
Context
Molecule (ADR-008 Level 1) tests each role in a single Docker container: one
converge, no real kernel netfilter, no real Docker daemon in the loop, and no
reboot. That structurally cannot catch an entire class of bug — reboot-survivability,
host-firewall × Docker interaction, and boot-ordering — which is exactly the class
that caused the 2026-06-17 mesh-hardening incident.
During that incident, base's nftables forward { policy drop; } killed the askari
Docker host on reboot: nftables loaded its default-deny before Docker, breaking
published-port DNAT and inter-container forwarding. Public services and the mesh went
down. It had worked right after make deploy, when Docker's runtime rules still
coexisted. ip_nonlocal_bind also failed to beat the sshd boot-race, leaving the mesh
listener absent at boot. Recovery required the Hetzner console and a WAN-SSH
break-glass. Molecule had passed.
ADR-008's Level 2/3 was deferred "for lack of hosts." ubongo breaks that deferral:
verified: ubongo KVM capability · Bash (2026-06-18 session) ·
/dev/kvmpresent + accessible (kvm group), Intel VT-x (vmx) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 GiB disk free; libvirt/QEMU/Vagrant not yet installed · 2026-06-18.
Decision
1. Virtualisation approach: libvirt/KVM directly (Approach A)
A golden Debian-13 genericcloud qcow2 is cached locally on ubongo. Each run boots an
ephemeral qcow2 overlay backed by it (the golden image is never mutated), seeded
via cloud-init NoCloud, driven by a stdlib-only Python driver (scripts/ integration-vm.py) over virsh / virt-install / cloud-localds. No libvirt- python dependency — the driver stays portable and the role stays lean.
2. Fidelity envelope
The bugs are post-boot, not in the provisioning path. A lightweight local hypervisor is sufficient: real OS, real kernel netfilter, real Docker daemon, real published-port DNAT, a real reboot, and the coordinator running inside the VM (so the VM forms its own one-node mesh, reproducing the circular bootstrap). The Proxmox provisioning chrome is not mirrored.
3. Scope: one throwaway VM at a time, instantiated from real inventory
The first profile is "be askari" — a single box running Docker host + NetBird coordinator + mesh peer, mirroring the host whose incident motivates this work. The mechanism is generic: swap the profile to "be" any inventory host. Multi-VM topologies are a deferred extension.
4. Acceptance: self-validating against the real failure
The harness is accepted when it can, on a local VM:
- Apply
base(firewall on, nodocker_hostcontainer-forward drop-in) to a Docker host, reboot, and observe the 2026-06-17 breakage (Docker forwarding dead, services down). If step 1 passes, the harness is not faithful. - Apply the
docker_hostcontainer-forward fix, re-run, and survive the reboot.
5. Tiered cert fidelity via a --certs knob
DNS-01 is what makes real certs possible without public inbound (validation is out-of-band via a Gandi TXT record; the VM needs only outbound to ACME + Gandi, which the isolated NAT network provides):
| Tier | Description | Default? |
|---|---|---|
internal |
Caddy tls internal — zero deps, instant. For incident repro and runs where certs are not under test. |
Yes |
le-staging |
Real DNS-01 ACME against Let's Encrypt staging — real caddy-gandi path, real cert files/renewal, untrusted root, effectively no rate limits. | Built in v1; use when testing the ACME/cert path. |
le-prod-wildcard |
A real trusted *.test.wingu.me wildcard, issued once, persisted on ubongo, reused across runs. |
On-demand only. Accepted risk recorded as R6 in docs/security/accepted-risks.md. |
A deliberate "no-egress" failure scenario (reproducing FRICTION 2026-06-17 #4 —
netbird-server FATAL-loops on GeoLite2 download when egress is lost) forces
internal, since ACME requires egress.
6. The toolchain is Ansible-managed
A new non-service role (integration_test, control group) installs and enables
libvirt + QEMU + virtinst reproducibly. The driver manages the golden image lazily on
first run (keeping the role lean; no fiddly download/refresh logic in Ansible). The
repo owns ubongo's state.
7. Stubs live in an overlay file, never in the real inventory
Transient inventory entries for the test VM are generated at runtime as a single-host
file. Stubs (cert tier, in-VM coordinator endpoint, VM connection details) live in
tests/integration/overrides/<host>.yml — an explicit, reviewable overlay. The real
inventory is never touched, so make tf-inventory and "don't edit inventory directly"
stay intact.
Consequences
- Reconciles ADR-015: ubongo runs ephemeral KVM test VMs as part of its
local-test-runner role — it is still not a production hypervisor. A default VM
(~2 vCPU / 3 GiB / 20 GiB thin overlay) against ~13 GiB free is comfortable; the
driver enforces one integration VM at a time (resource guard, name-prefix
boma-it-*) and refuses to start below a free-RAM threshold. - Operationalises the standing rule: "firewall/sshd/boot changes must be tested on
a real VM with a real reboot before they touch a live host" (FRICTION 2026-06-17 #6)
becomes a concrete, runnable step documented in
docs/runbooks/integration-testing.md. - Accepted risk R6:
le-prod-wildcardruns pass the production Gandi PAT (vault.gandi.pat) to an ephemeral local VM and write transient_acme-challengeTXT records into the realwingu.mezone. Scope: on-demand only;le-stagingis the default. Compensating controls: ephemeral VM, isolated NAT network, TXT records auto-removed by Caddy after validation. - Three safety invariants make the test tool itself safe:
- The transient inventory contains only the test VM — no real host is ever in scope.
- "Be askari" points NetBird at the in-VM coordinator — the VM forms its own one-node mesh; it never enrols in the real mesh.
- Test VMs sit on an isolated libvirt NAT network — outbound NAT for ACME/image pulls
only, not reachable to the LAN (
10.20.x) or the real mesh.
- Diagnostics on failure (catching a bug is the point): failure keeps the VM and
dumps
nft list ruleset,docker ps,ss -tlnp,journalctl -b,systemd-analyze critical-chain.make test-integration-cleanreaps allboma-it-*orphans. Diagnostics land in gitignored~/integration-runs/<ts>-<host>/. - Future pinch: concurrency with the Level-4 Chromium/Playwright stack (ADR-017)
competes for ubongo RAM. The resource guard is the v1 answer — one integration VM at a
time; don't run alongside a heavy Level-4 session. Revisit at
/capacity-review.
Scope
In scope: reboot-survivability, host-firewall × Docker interaction, boot-ordering, cert/ACME paths, mesh bootstrap on one box.
Out of scope (v1): multi-VM mini-cluster (inter-host mesh dataplane); CI gate (this is an interactive, agent-driven pre-deploy check; CI stays lint + Molecule per ADR-008/010); the Proxmox provisioning path (the bugs live in the boot/kernel/Docker layer, not provisioning).
What was ruled out
| Option | Reason |
|---|---|
| Proxmox VE nested on ubongo | Highest fidelity including the provisioning step, but heavy (nested virt, RAM), in tension with ADR-015, and the incident bugs do not live in provisioning. |
| Vagrant + vagrant-libvirt | Mature lifecycle/snapshots, but adds the Ruby/Vagrant ecosystem + a fragile plugin; boxes drift from the real Debian cloud image; the reboot→assert sequence still needs custom logic. |
| terraform-provider-libvirt | Declarative and reuses TF, but poor at the imperative apply→reboot→re-apply test sequence; adds throwaway state; blurs ADR-006's "TF owns production VM existence on Proxmox" boundary. |
Verified facts (ADR-014)
- verified: ubongo KVM capability · Bash ·
/dev/kvmpresent + accessible (kvm group), Intel VT-x (vmx) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 GiB disk free · 2026-06-18.
Shakedown learnings (2026-06-18 live run)
Two findings from the RED→GREEN acceptance run that affect anyone operating the harness:
-
Boot firmware: UEFI required. The Debian 13 genericcloud image triple-faults under legacy BIOS/SeaBIOS and does not reach the kernel. Boot the VM with UEFI (
virt-install --boot uefi;ovmfpackage). The driver does this by default; note it here so the requirement is findable. -
claudesudo is load-bearing. VM management (virsh,virt-install,cloud-localds) and offline diagnostics (nft list ruleset,journalctl -b,systemd-analyze critical-chain) all require root. The harness assumes the AI-worker hasNOPASSWD:ALLsudo onubongo— settled as the ADR-015 amendment (2026-06-18) and registered as R7 indocs/security/accepted-risks.md. Aclaudeaccount without sudo will block the harness at the firstvirshcall.
The nine full shakedown findings (including the UEFI boot-loop) are in
docs/FRICTION.md.
Related
- ADR-006 — Terraform owns production VM existence (boundary this ADR respects).
- ADR-008 — Testing methodology (Levels 1–4); this ADR is the concrete build of Level 2/3.
- ADR-015 — Control host (ubongo); this ADR reconciles "not a hypervisor" with ephemeral test VMs; amended 2026-06-18 for claude sudo.
- ADR-016 — Mesh VPN; the "be askari" profile includes the coordinator role.
- ADR-020 — Firewall strategy; firewall × Docker interaction is what this harness tests.
- ADR-021 — Operational access; sudo model for
claudeandsjatonubongo. - ADR-024 — Reverse proxy (Caddy); cert tiers exercise the DNS-01 ACME path.