boma/docs/superpowers/specs/2026-06-18-local-vm-integration-testing-design.md
sjat 02e1eb7449 docs(spec): design local VM integration testing on ubongo (2.4)
Throwaway KVM VMs on ubongo (libvirt, Approach A) that mirror a real host (real Docker, real reboot, real role apply) to catch the reboot/firewall/boot-order class Molecule cannot - the 2026-06-17 mesh-hardening incident. First profile: be askari; tiered certs (internal + le-staging built, le-prod-wildcard on-demand). Concrete build of ADR-008 Level 2/3; to be recorded as ADR-025.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 11:35:51 +02:00

267 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Local VM integration testing on ubongo (design)
**Status:** Designed, not built. Resolves `docs/TODO.md` item 2.4 (Local VM integration
testing on ubongo, pre-deploy).
**Date:** 2026-06-18.
**Implements:** the concrete build of ADR-008 Level 2/3 (staging/integration), deferred
for lack of hosts but hostable on ubongo. To be recorded as **ADR-025**.
## Context
Molecule (ADR-008 Level 1) tests each role in a single Docker container: one `converge`,
no real kernel netfilter, no real Docker daemon in the loop, and **no reboot**. That
structurally cannot catch an entire class of bug — reboot-survivability, host-firewall ×
Docker interaction, and boot-ordering — which is exactly the class that caused the
**2026-06-17 mesh-hardening incident**:
- `base`'s nftables `forward { policy drop; }` killed the askari Docker host **on reboot**
(nftables loaded its default-deny *before* Docker, breaking published-port DNAT and
inter-container forwarding → public services + the mesh went down). It had worked right
after `make deploy`, when Docker's runtime rules still coexisted. (FRICTION 2026-06-17 #1.)
- `ip_nonlocal_bind` did **not** beat the sshd boot-race; sshd bound to the `wt0` mesh IP
had no listener at boot. (FRICTION #2.)
- The coordinator host could not bootstrap the mesh it itself hosts. (FRICTION #3.)
- NetBird `netbird-server` FATAL-loops on the GeoLite2 download when egress is lost — and
egress was lost when `nft flush` wiped Docker's NAT masquerade. (FRICTION #4.)
Recovery needed the Hetzner console + a WAN-SSH break-glass. The lesson, already crystallised
as a standing rule: *firewall/sshd/boot changes must be tested on a real VM with a real
reboot before they touch a live host, and a non-mesh break-glass must be kept.*
This spec defines a way for the agent to spin up **throwaway KVM VMs locally on ubongo**
that mirror a target host (real Docker, a real reboot, the real role apply) and validate
risky infra changes **before** a live deploy. ubongo can host this today:
> verified: ubongo KVM capability · Bash (this session) · `/dev/kvm` present + accessible
> (kvm group), Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198
> GiB disk free; libvirt/QEMU/Vagrant **not yet installed** · 2026-06-18.
## Goals
- Reproduce the 2026-06-17 bug class locally: real OS boot, real Docker, real netfilter,
the real role apply, a **real reboot**, then outcome assertions.
- Let the agent drive the full loop autonomously: provision → apply → reboot → assert →
teardown, with diagnostics captured on failure.
- Mirror a *real* host from inventory (first profile: "be askari"), so the apply is
faithful, not synthetic.
- Be the concrete tool that operationalises the standing "test risky infra before live
deploy" rule.
## Non-goals (v1)
- Not a production hypervisor on ubongo (reconciles ADR-015 — see Governance).
- Not nested Proxmox; the provisioning *chrome* (template clone / Terraform) is **not**
mirrored — every incident bug lives in the boot/kernel/Docker layer, not provisioning.
- Not a multi-VM mini-cluster; one VM at a time. (All six 2026-06-17 signals occurred on a
single host that was Docker host + coordinator + mesh peer.) Multi-VM is a later extension.
- Not a CI gate; this is an interactive, agent-driven pre-deploy check on ubongo (CI stays
lint + Molecule per ADR-008/010).
## Decisions (from the 2026-06-18 brainstorm)
1. **Virtualisation approach: libvirt/KVM directly (Approach A).** A golden Debian-13
genericcloud qcow2 cached locally; each run boots an ephemeral qcow2 overlay backed by
it, seeded via cloud-init NoCloud, driven by a **stdlib-only** Python script over
`virsh` (no `libvirt-python` dependency). Chosen over Vagrant+vagrant-libvirt (Ruby/plugin
footprint, box drift from the real cloud image) and terraform-provider-libvirt (poor at
the imperative apply→reboot→re-apply sequence, throwaway state, blurs ADR-006's prod-VM
boundary). Lightest footprint on a 15 GiB control node; full control of the reboot step;
the same Debian cloud image real hosts boot.
2. **Fidelity envelope: real OS/Docker/netfilter/reboot, not the Proxmox provisioning
path.** A lightweight local hypervisor is enough because the bugs are post-boot.
3. **Scope: one throwaway VM at a time, instantiated from a real host's inventory.** First
profile: **"be askari"** (Docker host + NetBird coordinator + mesh peer on one box). The
mechanism is generic — later "be" any host by swapping which inventory host it mirrors.
4. **Acceptance is self-validating against the real failure.** Done = the harness, on a
local VM, applies `base` (firewall on) to a Docker host, reboots, and **observes the
2026-06-17 breakage** (Docker forwarding dead / services down); then, with the
`docker_host` container-forward drop-in in place, the same run **survives the reboot**.
If step 1 passes, the harness is not faithful.
5. **Tiered cert fidelity via a `--certs` knob** (DNS-01 is what makes real certs possible
with no public inbound — validation is out-of-band via a Gandi TXT record; the VM needs
only outbound to ACME + Gandi, which the NAT net provides):
- `internal` (default) — Caddy `tls internal`, zero deps, instant; for the incident repro
and runs where certs aren't under test.
- `le-staging` — real DNS-01 ACME against Let's Encrypt **staging**: real caddy-gandi
path, real cert files/renewal, untrusted root, effectively no rate limits. **Built in v1.**
- `le-prod-wildcard` — a real trusted `*.test.wingu.me` wildcard, **issued once,
persisted on ubongo, reused** across runs. Wired in v1 but **on-demand only**; its
accepted risk is recorded when used (prod Gandi credential reaching an ephemeral VM;
transient TXT in the real `wingu.me` zone). A deliberate "no-egress" failure scenario
(to reproduce FRICTION #4) forces `internal`, since ACME needs egress.
6. **The toolchain is Ansible-managed**, not hand-installed: a new non-service role
(`integration_test`, `control` group) installs/enables libvirt+QEMU reproducibly. The
repo owns ubongo's state. The driver manages *images* lazily on first run (keeps the role
lean; avoids fiddly download/refresh logic in Ansible).
7. **Stubs live in an overlay file, never in the real inventory** — so `make tf-inventory`
and "don't edit inventory directly" stay intact, and every stub is explicit and reviewable.
8. **A new ADR-025** records this decision (approach + alternatives + cert tiers); ADR-008
gains a pointer and redirects its "what Molecule does NOT test" gaps here.
## Architecture — five isolated components
| # | Component | Purpose | Location |
|---|-----------|---------|----------|
| 1 | **`integration_test` role** (non-service, `control` group) | Install/enable libvirt+QEMU+virtinst, add `sjat`/`claude` to `libvirt` group, create the image-cache dir, drop the driver. Idempotent, Molecule-tested. | `roles/integration_test/` |
| 2 | **`integration-vm.py` driver** | Stdlib-only lifecycle over `virsh`: `up / apply / reboot / assert / cycle / reset / down / prune / console`. Lazily ensures the golden image (download + checksum). | `scripts/integration-vm.py` |
| 3 | **Profiles + var overlays** | Make a VM "become" a host: pull that host's real group_vars/host_vars + layer a small explicit overlay (cert tier, in-VM coordinator endpoint, VM connection). | `tests/integration/overrides/<host>.yml` |
| 4 | **Verify playbook** | Outcome-based post-reboot assertions (Docker up, published-port DNAT alive, `nft` sane, service responds, `wt0` up), reusing Molecule's `verify.yml` philosophy. | `tests/integration/verify.yml` |
| 5 | **Makefile target** | `make test-integration HOST=<name> [CERTS=...] [KEEP=1]``cycle`; `make test-integration-clean``prune`. Documented in CLAUDE.md's command table. | `Makefile` |
## Lifecycle / data flow
`make test-integration HOST=askari` drives:
```
1. ensure golden image Debian-13 genericcloud qcow2, cached + checksum-verified
2. ephemeral overlay qcow2 backed by golden (throwaway; never mutate golden)
3. cloud-init NoCloud seed hostname + ansible user + ubongo's SSH key + NIC
4. virt-install --import boot on an isolated libvirt NAT net (DHCP IP + outbound NAT)
5. wait for SSH IP via `virsh domifaddr --source lease` (guest-agent optional)
6. transient inventory askari's real vars + ansible_host=<lease IP> + stub overlay
7. ansible-playbook site THE REAL APPLY (base + docker_host + reverse_proxy + coordinator)
8. [snapshot post-apply] optional reset point for fast re-runs
9. virsh reboot ──────────┐ ← the step Molecule structurally cannot do
10. wait for SSH ┘
11. ansible-playbook verify outcome assertions; THIS is where the incident surfaces
12. report + teardown pass/fail; on fail keep VM + dump diagnostics; else destroy overlay
```
Steps 17 build a real Docker daemon with real published-port DNAT to break; step 9 is a
real kernel reboot, so nftables loads default-deny before Docker exactly as on askari.
## Fidelity boundary & cert tiers
**Faithful where the bug lives:** real kernel, real netfilter, real Docker with
published-port DNAT, the real role apply, a real reboot, and the coordinator running *inside
the VM* so the VM is its own mesh peer — reproducing the circular mesh-bootstrap (FRICTION #3)
on one box.
**Stubbed where it needs the public internet** (explicit, in the overlay): LE certs via the
`--certs` knob (Decision 5); public DNS (`askari.wingu.me`) → local resolution; NetBird
geo-DB → pre-seeded or requirement disabled (which is *also* the FRICTION #4 fix, so the
harness can test both the FATAL-loop and its remedy).
## Acceptance test (self-validating)
1. Run the cycle on **today's** `base` (firewall on, no `docker_host` container-forward
drop-in) → **step 11 must FAIL after reboot** (Docker forwarding dead, services down).
2. Implement the `docker_host` container-forward rules (the pending fix STATUS.md names) →
re-run → **step 11 must PASS across the reboot.**
**Scope boundary:** the *harness* is this plan's deliverable. The `docker_host`
container-forward fix is a separate work item (FRICTION 2026-06-17 #1). v1's acceptance
deliberately spans both, because a credible harness must demonstrate **both** a true-negative
(red on the broken state) and a true-positive (green on the fixed state) — otherwise we have
only ever watched the assert go red. The plan decides sequencing: build the small
`docker_host` drop-in as the green-half of acceptance, or consume it if built separately
first. Minimum credible v1 is the red half (faithful reproduction); full acceptance is red→green.
This one round-trip proves the harness reproduces the incident, the fix works, and the loop
can be trusted for the next risky change before it touches a live host.
## Robustness, isolation & teardown
**Failure leaves evidence** (catching a bug is the point):
| Step fails | Behaviour |
|---|---|
| Golden image (1) | Fail fast, clear message; image cached (one-time cost) |
| Boot / first SSH (45) | **Capture serial console to a log file**, fail with its tail — the automated equivalent of the Hetzner console (ties to TODO 10.8) |
| Apply (7) | Keep VM, surface Ansible output, dump diagnostics |
| **No SSH after reboot (910)** | The classic incident signature; FAIL, keep VM, capture console — the harness *succeeding* |
| Assert (11) | FAIL, keep VM, dump post-mortem: `nft list ruleset`, `docker ps`, `ss -tlnp`, `journalctl -b`, `systemd-analyze critical-chain`; exit non-zero |
Diagnostics land in gitignored `~/integration-runs/<ts>-<host>/` (same pattern as ADR-017's
screenshot dir; the agent reads them directly).
**Three safety invariants** (these make the testing tool itself safe):
1. **The transient inventory contains only the test VM** — no real host is ever in scope;
the apply is `--limit`ed to the VM.
2. **"Be askari" points NetBird at the in-VM coordinator (localhost)** — the VM forms its
own one-node mesh; it never enrolls in the real mesh.
3. **Test VMs sit on an isolated libvirt NAT net** — outbound NAT for ACME/image pulls, but
not reachable to the LAN (`10.20.x`) or the real mesh.
**Resource guard** (ubongo's 15 GiB ceiling, ADR-015/012): default VM ≈ 2 vCPU / 3 GiB / 20
GiB thin overlay; the driver refuses to start below a free-RAM threshold and enforces **one
integration VM at a time** (name-prefix `boma-it-*`). **Teardown:** success destroys domain +
overlay; failure keeps them and prints how to inspect; `make test-integration-clean` reaps
all `boma-it-*` orphans. An optional post-apply **snapshot** lets `reset` re-run
reboot+assert without re-applying (fast iteration on a fix).
## Testing the tester
- **pytest** on the driver's pure logic: transient-inventory generation, var/overlay merge,
`--certs`→overlay mapping, DHCP-lease parsing, resource-guard math (mock `virsh`). Joins
boma's existing pytest suite.
- **Molecule** (Docker) on the `integration_test` role: asserts libvirt/qemu/virtinst
installed, `libvirtd` enabled, users in `libvirt` group, driver present. (Cannot run
KVM-in-Docker — the documented Molecule limitation.)
- **End-to-end self-test = the acceptance test above**, run manually on first build and
recorded in the runbook.
## Governance & documentation touch-points
- **ADR-025 "Local VM integration testing"** — decision, approach A, rejected alternatives
(Proxmox-nested / Vagrant / TF-libvirt), cert tiers.
- **ADR-008** — pointer to ADR-025; redirect its "what Molecule does NOT test" gaps
(nftables loading, mesh dataplane) to this level.
- **ADR-015** — one-line reconciliation: "not a hypervisor" → runs *ephemeral KVM test VMs*
as part of its local-test-runner role (still not a production hypervisor); note the
test-VM RAM load.
- **`docs/security/accepted-risks.md`** — the `le-prod-wildcard` risk (prod Gandi credential
→ ephemeral VM; transient TXT in real `wingu.me`).
- **CLAUDE.md** command table + **`docs/runbooks/integration-testing.md`** (run a cycle,
cert knobs, where diagnostics land, inspecting a kept failed VM, pruning) + **STATUS.md**
entry. The runbook's pre-flight line operationalises FRICTION #6 (*validate
reboot-recovery before retiring the break-glass*).
## Capacity
One VM (~3 GiB) against ~13 GiB free is comfortable. The only future pinch is concurrency
with the Level-4 Chromium/Playwright stack (ADR-017) — handled by the resource guard +
"one at a time." Add a note to `docs/hardware/reference.md`; revisit at `/capacity-review`.
## Alternatives considered
- **Proxmox VE nested on ubongo** — highest fidelity incl. the provisioning step, but heavy
(nested virt, RAM), in tension with ADR-015, and the incident bugs don't live in
provisioning. Rejected.
- **Vagrant + vagrant-libvirt** — mature lifecycle/snapshots, but adds the Ruby/Vagrant
ecosystem + a fragile plugin, boxes drift from the real Debian cloud image, and the
reboot→assert sequence still needs custom logic. Rejected.
- **terraform-provider-libvirt** — declarative and reuses TF, but poor at the imperative
apply→reboot→re-apply test sequence, adds throwaway state, and blurs ADR-006's
"TF owns *production* VM existence on Proxmox" boundary. Rejected.
## Open questions / deferred
- **Multi-VM mini-staging** (inter-host mesh/dataplane) — design the driver + NAT net so a
topology is an additive extension; out of scope for v1.
- **Interplay with the Level-4 browser stack** — both want ubongo RAM; the resource guard is
the v1 answer, revisit when Level 4 is built.
- **Snapshot strategy depth** — v1 ships clone-and-destroy + an optional post-apply snapshot;
richer snapshot trees deferred.
## Knowledge to verify at plan stage (ADR-014)
These are from memory / unverified and must be confirmed against version-matched docs before
the plan asserts them:
- Exact `virt-install --import` flags and the cloud-init **NoCloud** seed format on the
Debian-13 libvirt stack.
- Whether the Debian-13 genericcloud image ships `qemu-guest-agent` (IP can come from the
DHCP lease regardless — guest-agent is an optimisation, not a requirement).
- Let's Encrypt **rate limits** (prod vs staging) — to confirm "issue the wildcard once,
reuse" stays within limits.
- The `caddy-dns/gandi` DNS-01 configuration and pinned version already used by
`reverse_proxy`, and whether the Gandi LiveDNS API key can be scoped to `test.wingu.me`.
- libvirt default vs a dedicated isolated NAT network on Debian-13 (`virsh net-*`).