From 02e1eb7449e97103549cd360b093d3053f26d83d Mon Sep 17 00:00:00 2001 From: sjat Date: Thu, 18 Jun 2026 11:35:51 +0200 Subject: [PATCH] docs(spec): design local VM integration testing on ubongo (2.4) Throwaway KVM VMs on ubongo (libvirt, Approach A) that mirror a real host (real Docker, real reboot, real role apply) to catch the reboot/firewall/boot-order class Molecule cannot - the 2026-06-17 mesh-hardening incident. First profile: be askari; tiered certs (internal + le-staging built, le-prod-wildcard on-demand). Concrete build of ADR-008 Level 2/3; to be recorded as ADR-025. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...-18-local-vm-integration-testing-design.md | 267 ++++++++++++++++++ 1 file changed, 267 insertions(+) create mode 100644 docs/superpowers/specs/2026-06-18-local-vm-integration-testing-design.md diff --git a/docs/superpowers/specs/2026-06-18-local-vm-integration-testing-design.md b/docs/superpowers/specs/2026-06-18-local-vm-integration-testing-design.md new file mode 100644 index 0000000..d1edc09 --- /dev/null +++ b/docs/superpowers/specs/2026-06-18-local-vm-integration-testing-design.md @@ -0,0 +1,267 @@ +# Local VM integration testing on ubongo (design) + +**Status:** Designed, not built. Resolves `docs/TODO.md` item 2.4 (Local VM integration +testing on ubongo, pre-deploy). +**Date:** 2026-06-18. +**Implements:** the concrete build of ADR-008 Level 2/3 (staging/integration), deferred +for lack of hosts but hostable on ubongo. To be recorded as **ADR-025**. + +## Context + +Molecule (ADR-008 Level 1) tests each role in a single Docker container: one `converge`, +no real kernel netfilter, no real Docker daemon in the loop, and **no reboot**. That +structurally cannot catch an entire class of bug — reboot-survivability, host-firewall × +Docker interaction, and boot-ordering — which is exactly the class that caused the +**2026-06-17 mesh-hardening incident**: + +- `base`'s nftables `forward { policy drop; }` killed the askari Docker host **on reboot** + (nftables loaded its default-deny *before* Docker, breaking published-port DNAT and + inter-container forwarding → public services + the mesh went down). It had worked right + after `make deploy`, when Docker's runtime rules still coexisted. (FRICTION 2026-06-17 #1.) +- `ip_nonlocal_bind` did **not** beat the sshd boot-race; sshd bound to the `wt0` mesh IP + had no listener at boot. (FRICTION #2.) +- The coordinator host could not bootstrap the mesh it itself hosts. (FRICTION #3.) +- NetBird `netbird-server` FATAL-loops on the GeoLite2 download when egress is lost — and + egress was lost when `nft flush` wiped Docker's NAT masquerade. (FRICTION #4.) + +Recovery needed the Hetzner console + a WAN-SSH break-glass. The lesson, already crystallised +as a standing rule: *firewall/sshd/boot changes must be tested on a real VM with a real +reboot before they touch a live host, and a non-mesh break-glass must be kept.* + +This spec defines a way for the agent to spin up **throwaway KVM VMs locally on ubongo** +that mirror a target host (real Docker, a real reboot, the real role apply) and validate +risky infra changes **before** a live deploy. ubongo can host this today: + +> verified: ubongo KVM capability · Bash (this session) · `/dev/kvm` present + accessible +> (kvm group), Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 +> GiB disk free; libvirt/QEMU/Vagrant **not yet installed** · 2026-06-18. + +## Goals + +- Reproduce the 2026-06-17 bug class locally: real OS boot, real Docker, real netfilter, + the real role apply, a **real reboot**, then outcome assertions. +- Let the agent drive the full loop autonomously: provision → apply → reboot → assert → + teardown, with diagnostics captured on failure. +- Mirror a *real* host from inventory (first profile: "be askari"), so the apply is + faithful, not synthetic. +- Be the concrete tool that operationalises the standing "test risky infra before live + deploy" rule. + +## Non-goals (v1) + +- Not a production hypervisor on ubongo (reconciles ADR-015 — see Governance). +- Not nested Proxmox; the provisioning *chrome* (template clone / Terraform) is **not** + mirrored — every incident bug lives in the boot/kernel/Docker layer, not provisioning. +- Not a multi-VM mini-cluster; one VM at a time. (All six 2026-06-17 signals occurred on a + single host that was Docker host + coordinator + mesh peer.) Multi-VM is a later extension. +- Not a CI gate; this is an interactive, agent-driven pre-deploy check on ubongo (CI stays + lint + Molecule per ADR-008/010). + +## Decisions (from the 2026-06-18 brainstorm) + +1. **Virtualisation approach: libvirt/KVM directly (Approach A).** A golden Debian-13 + genericcloud qcow2 cached locally; each run boots an ephemeral qcow2 overlay backed by + it, seeded via cloud-init NoCloud, driven by a **stdlib-only** Python script over + `virsh` (no `libvirt-python` dependency). Chosen over Vagrant+vagrant-libvirt (Ruby/plugin + footprint, box drift from the real cloud image) and terraform-provider-libvirt (poor at + the imperative apply→reboot→re-apply sequence, throwaway state, blurs ADR-006's prod-VM + boundary). Lightest footprint on a 15 GiB control node; full control of the reboot step; + the same Debian cloud image real hosts boot. + +2. **Fidelity envelope: real OS/Docker/netfilter/reboot, not the Proxmox provisioning + path.** A lightweight local hypervisor is enough because the bugs are post-boot. + +3. **Scope: one throwaway VM at a time, instantiated from a real host's inventory.** First + profile: **"be askari"** (Docker host + NetBird coordinator + mesh peer on one box). The + mechanism is generic — later "be" any host by swapping which inventory host it mirrors. + +4. **Acceptance is self-validating against the real failure.** Done = the harness, on a + local VM, applies `base` (firewall on) to a Docker host, reboots, and **observes the + 2026-06-17 breakage** (Docker forwarding dead / services down); then, with the + `docker_host` container-forward drop-in in place, the same run **survives the reboot**. + If step 1 passes, the harness is not faithful. + +5. **Tiered cert fidelity via a `--certs` knob** (DNS-01 is what makes real certs possible + with no public inbound — validation is out-of-band via a Gandi TXT record; the VM needs + only outbound to ACME + Gandi, which the NAT net provides): + - `internal` (default) — Caddy `tls internal`, zero deps, instant; for the incident repro + and runs where certs aren't under test. + - `le-staging` — real DNS-01 ACME against Let's Encrypt **staging**: real caddy-gandi + path, real cert files/renewal, untrusted root, effectively no rate limits. **Built in v1.** + - `le-prod-wildcard` — a real trusted `*.test.wingu.me` wildcard, **issued once, + persisted on ubongo, reused** across runs. Wired in v1 but **on-demand only**; its + accepted risk is recorded when used (prod Gandi credential reaching an ephemeral VM; + transient TXT in the real `wingu.me` zone). A deliberate "no-egress" failure scenario + (to reproduce FRICTION #4) forces `internal`, since ACME needs egress. + +6. **The toolchain is Ansible-managed**, not hand-installed: a new non-service role + (`integration_test`, `control` group) installs/enables libvirt+QEMU reproducibly. The + repo owns ubongo's state. The driver manages *images* lazily on first run (keeps the role + lean; avoids fiddly download/refresh logic in Ansible). + +7. **Stubs live in an overlay file, never in the real inventory** — so `make tf-inventory` + and "don't edit inventory directly" stay intact, and every stub is explicit and reviewable. + +8. **A new ADR-025** records this decision (approach + alternatives + cert tiers); ADR-008 + gains a pointer and redirects its "what Molecule does NOT test" gaps here. + +## Architecture — five isolated components + +| # | Component | Purpose | Location | +|---|-----------|---------|----------| +| 1 | **`integration_test` role** (non-service, `control` group) | Install/enable libvirt+QEMU+virtinst, add `sjat`/`claude` to `libvirt` group, create the image-cache dir, drop the driver. Idempotent, Molecule-tested. | `roles/integration_test/` | +| 2 | **`integration-vm.py` driver** | Stdlib-only lifecycle over `virsh`: `up / apply / reboot / assert / cycle / reset / down / prune / console`. Lazily ensures the golden image (download + checksum). | `scripts/integration-vm.py` | +| 3 | **Profiles + var overlays** | Make a VM "become" a host: pull that host's real group_vars/host_vars + layer a small explicit overlay (cert tier, in-VM coordinator endpoint, VM connection). | `tests/integration/overrides/.yml` | +| 4 | **Verify playbook** | Outcome-based post-reboot assertions (Docker up, published-port DNAT alive, `nft` sane, service responds, `wt0` up), reusing Molecule's `verify.yml` philosophy. | `tests/integration/verify.yml` | +| 5 | **Makefile target** | `make test-integration HOST= [CERTS=...] [KEEP=1]` → `cycle`; `make test-integration-clean` → `prune`. Documented in CLAUDE.md's command table. | `Makefile` | + +## Lifecycle / data flow + +`make test-integration HOST=askari` drives: + +``` + 1. ensure golden image Debian-13 genericcloud qcow2, cached + checksum-verified + 2. ephemeral overlay qcow2 backed by golden (throwaway; never mutate golden) + 3. cloud-init NoCloud seed hostname + ansible user + ubongo's SSH key + NIC + 4. virt-install --import boot on an isolated libvirt NAT net (DHCP IP + outbound NAT) + 5. wait for SSH IP via `virsh domifaddr --source lease` (guest-agent optional) + 6. transient inventory askari's real vars + ansible_host= + stub overlay + 7. ansible-playbook site THE REAL APPLY (base + docker_host + reverse_proxy + coordinator) + 8. [snapshot post-apply] optional reset point for fast re-runs + 9. virsh reboot ──────────┐ ← the step Molecule structurally cannot do +10. wait for SSH ┘ +11. ansible-playbook verify outcome assertions; THIS is where the incident surfaces +12. report + teardown pass/fail; on fail keep VM + dump diagnostics; else destroy overlay +``` + +Steps 1–7 build a real Docker daemon with real published-port DNAT to break; step 9 is a +real kernel reboot, so nftables loads default-deny before Docker exactly as on askari. + +## Fidelity boundary & cert tiers + +**Faithful where the bug lives:** real kernel, real netfilter, real Docker with +published-port DNAT, the real role apply, a real reboot, and the coordinator running *inside +the VM* so the VM is its own mesh peer — reproducing the circular mesh-bootstrap (FRICTION #3) +on one box. + +**Stubbed where it needs the public internet** (explicit, in the overlay): LE certs via the +`--certs` knob (Decision 5); public DNS (`askari.wingu.me`) → local resolution; NetBird +geo-DB → pre-seeded or requirement disabled (which is *also* the FRICTION #4 fix, so the +harness can test both the FATAL-loop and its remedy). + +## Acceptance test (self-validating) + +1. Run the cycle on **today's** `base` (firewall on, no `docker_host` container-forward + drop-in) → **step 11 must FAIL after reboot** (Docker forwarding dead, services down). +2. Implement the `docker_host` container-forward rules (the pending fix STATUS.md names) → + re-run → **step 11 must PASS across the reboot.** + +**Scope boundary:** the *harness* is this plan's deliverable. The `docker_host` +container-forward fix is a separate work item (FRICTION 2026-06-17 #1). v1's acceptance +deliberately spans both, because a credible harness must demonstrate **both** a true-negative +(red on the broken state) and a true-positive (green on the fixed state) — otherwise we have +only ever watched the assert go red. The plan decides sequencing: build the small +`docker_host` drop-in as the green-half of acceptance, or consume it if built separately +first. Minimum credible v1 is the red half (faithful reproduction); full acceptance is red→green. + +This one round-trip proves the harness reproduces the incident, the fix works, and the loop +can be trusted for the next risky change before it touches a live host. + +## Robustness, isolation & teardown + +**Failure leaves evidence** (catching a bug is the point): + +| Step fails | Behaviour | +|---|---| +| Golden image (1) | Fail fast, clear message; image cached (one-time cost) | +| Boot / first SSH (4–5) | **Capture serial console to a log file**, fail with its tail — the automated equivalent of the Hetzner console (ties to TODO 10.8) | +| Apply (7) | Keep VM, surface Ansible output, dump diagnostics | +| **No SSH after reboot (9–10)** | The classic incident signature; FAIL, keep VM, capture console — the harness *succeeding* | +| Assert (11) | FAIL, keep VM, dump post-mortem: `nft list ruleset`, `docker ps`, `ss -tlnp`, `journalctl -b`, `systemd-analyze critical-chain`; exit non-zero | + +Diagnostics land in gitignored `~/integration-runs/-/` (same pattern as ADR-017's +screenshot dir; the agent reads them directly). + +**Three safety invariants** (these make the testing tool itself safe): +1. **The transient inventory contains only the test VM** — no real host is ever in scope; + the apply is `--limit`ed to the VM. +2. **"Be askari" points NetBird at the in-VM coordinator (localhost)** — the VM forms its + own one-node mesh; it never enrolls in the real mesh. +3. **Test VMs sit on an isolated libvirt NAT net** — outbound NAT for ACME/image pulls, but + not reachable to the LAN (`10.20.x`) or the real mesh. + +**Resource guard** (ubongo's 15 GiB ceiling, ADR-015/012): default VM ≈ 2 vCPU / 3 GiB / 20 +GiB thin overlay; the driver refuses to start below a free-RAM threshold and enforces **one +integration VM at a time** (name-prefix `boma-it-*`). **Teardown:** success destroys domain + +overlay; failure keeps them and prints how to inspect; `make test-integration-clean` reaps +all `boma-it-*` orphans. An optional post-apply **snapshot** lets `reset` re-run +reboot+assert without re-applying (fast iteration on a fix). + +## Testing the tester + +- **pytest** on the driver's pure logic: transient-inventory generation, var/overlay merge, + `--certs`→overlay mapping, DHCP-lease parsing, resource-guard math (mock `virsh`). Joins + boma's existing pytest suite. +- **Molecule** (Docker) on the `integration_test` role: asserts libvirt/qemu/virtinst + installed, `libvirtd` enabled, users in `libvirt` group, driver present. (Cannot run + KVM-in-Docker — the documented Molecule limitation.) +- **End-to-end self-test = the acceptance test above**, run manually on first build and + recorded in the runbook. + +## Governance & documentation touch-points + +- **ADR-025 "Local VM integration testing"** — decision, approach A, rejected alternatives + (Proxmox-nested / Vagrant / TF-libvirt), cert tiers. +- **ADR-008** — pointer to ADR-025; redirect its "what Molecule does NOT test" gaps + (nftables loading, mesh dataplane) to this level. +- **ADR-015** — one-line reconciliation: "not a hypervisor" → runs *ephemeral KVM test VMs* + as part of its local-test-runner role (still not a production hypervisor); note the + test-VM RAM load. +- **`docs/security/accepted-risks.md`** — the `le-prod-wildcard` risk (prod Gandi credential + → ephemeral VM; transient TXT in real `wingu.me`). +- **CLAUDE.md** command table + **`docs/runbooks/integration-testing.md`** (run a cycle, + cert knobs, where diagnostics land, inspecting a kept failed VM, pruning) + **STATUS.md** + entry. The runbook's pre-flight line operationalises FRICTION #6 (*validate + reboot-recovery before retiring the break-glass*). + +## Capacity + +One VM (~3 GiB) against ~13 GiB free is comfortable. The only future pinch is concurrency +with the Level-4 Chromium/Playwright stack (ADR-017) — handled by the resource guard + +"one at a time." Add a note to `docs/hardware/reference.md`; revisit at `/capacity-review`. + +## Alternatives considered + +- **Proxmox VE nested on ubongo** — highest fidelity incl. the provisioning step, but heavy + (nested virt, RAM), in tension with ADR-015, and the incident bugs don't live in + provisioning. Rejected. +- **Vagrant + vagrant-libvirt** — mature lifecycle/snapshots, but adds the Ruby/Vagrant + ecosystem + a fragile plugin, boxes drift from the real Debian cloud image, and the + reboot→assert sequence still needs custom logic. Rejected. +- **terraform-provider-libvirt** — declarative and reuses TF, but poor at the imperative + apply→reboot→re-apply test sequence, adds throwaway state, and blurs ADR-006's + "TF owns *production* VM existence on Proxmox" boundary. Rejected. + +## Open questions / deferred + +- **Multi-VM mini-staging** (inter-host mesh/dataplane) — design the driver + NAT net so a + topology is an additive extension; out of scope for v1. +- **Interplay with the Level-4 browser stack** — both want ubongo RAM; the resource guard is + the v1 answer, revisit when Level 4 is built. +- **Snapshot strategy depth** — v1 ships clone-and-destroy + an optional post-apply snapshot; + richer snapshot trees deferred. + +## Knowledge to verify at plan stage (ADR-014) + +These are from memory / unverified and must be confirmed against version-matched docs before +the plan asserts them: + +- Exact `virt-install --import` flags and the cloud-init **NoCloud** seed format on the + Debian-13 libvirt stack. +- Whether the Debian-13 genericcloud image ships `qemu-guest-agent` (IP can come from the + DHCP lease regardless — guest-agent is an optimisation, not a requirement). +- Let's Encrypt **rate limits** (prod vs staging) — to confirm "issue the wildcard once, + reuse" stays within limits. +- The `caddy-dns/gandi` DNS-01 configuration and pinned version already used by + `reverse_proxy`, and whether the Gandi LiveDNS API key can be scoped to `test.wingu.me`. +- libvirt default vs a dedicated isolated NAT network on Debian-13 (`virsh net-*`).