docs(spec): design local VM integration testing on ubongo (2.4)
Throwaway KVM VMs on ubongo (libvirt, Approach A) that mirror a real host (real Docker, real reboot, real role apply) to catch the reboot/firewall/boot-order class Molecule cannot - the 2026-06-17 mesh-hardening incident. First profile: be askari; tiered certs (internal + le-staging built, le-prod-wildcard on-demand). Concrete build of ADR-008 Level 2/3; to be recorded as ADR-025. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
69faaf5e43
commit
02e1eb7449
1 changed files with 267 additions and 0 deletions
|
|
@ -0,0 +1,267 @@
|
|||
# Local VM integration testing on ubongo (design)
|
||||
|
||||
**Status:** Designed, not built. Resolves `docs/TODO.md` item 2.4 (Local VM integration
|
||||
testing on ubongo, pre-deploy).
|
||||
**Date:** 2026-06-18.
|
||||
**Implements:** the concrete build of ADR-008 Level 2/3 (staging/integration), deferred
|
||||
for lack of hosts but hostable on ubongo. To be recorded as **ADR-025**.
|
||||
|
||||
## Context
|
||||
|
||||
Molecule (ADR-008 Level 1) tests each role in a single Docker container: one `converge`,
|
||||
no real kernel netfilter, no real Docker daemon in the loop, and **no reboot**. That
|
||||
structurally cannot catch an entire class of bug — reboot-survivability, host-firewall ×
|
||||
Docker interaction, and boot-ordering — which is exactly the class that caused the
|
||||
**2026-06-17 mesh-hardening incident**:
|
||||
|
||||
- `base`'s nftables `forward { policy drop; }` killed the askari Docker host **on reboot**
|
||||
(nftables loaded its default-deny *before* Docker, breaking published-port DNAT and
|
||||
inter-container forwarding → public services + the mesh went down). It had worked right
|
||||
after `make deploy`, when Docker's runtime rules still coexisted. (FRICTION 2026-06-17 #1.)
|
||||
- `ip_nonlocal_bind` did **not** beat the sshd boot-race; sshd bound to the `wt0` mesh IP
|
||||
had no listener at boot. (FRICTION #2.)
|
||||
- The coordinator host could not bootstrap the mesh it itself hosts. (FRICTION #3.)
|
||||
- NetBird `netbird-server` FATAL-loops on the GeoLite2 download when egress is lost — and
|
||||
egress was lost when `nft flush` wiped Docker's NAT masquerade. (FRICTION #4.)
|
||||
|
||||
Recovery needed the Hetzner console + a WAN-SSH break-glass. The lesson, already crystallised
|
||||
as a standing rule: *firewall/sshd/boot changes must be tested on a real VM with a real
|
||||
reboot before they touch a live host, and a non-mesh break-glass must be kept.*
|
||||
|
||||
This spec defines a way for the agent to spin up **throwaway KVM VMs locally on ubongo**
|
||||
that mirror a target host (real Docker, a real reboot, the real role apply) and validate
|
||||
risky infra changes **before** a live deploy. ubongo can host this today:
|
||||
|
||||
> verified: ubongo KVM capability · Bash (this session) · `/dev/kvm` present + accessible
|
||||
> (kvm group), Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198
|
||||
> GiB disk free; libvirt/QEMU/Vagrant **not yet installed** · 2026-06-18.
|
||||
|
||||
## Goals
|
||||
|
||||
- Reproduce the 2026-06-17 bug class locally: real OS boot, real Docker, real netfilter,
|
||||
the real role apply, a **real reboot**, then outcome assertions.
|
||||
- Let the agent drive the full loop autonomously: provision → apply → reboot → assert →
|
||||
teardown, with diagnostics captured on failure.
|
||||
- Mirror a *real* host from inventory (first profile: "be askari"), so the apply is
|
||||
faithful, not synthetic.
|
||||
- Be the concrete tool that operationalises the standing "test risky infra before live
|
||||
deploy" rule.
|
||||
|
||||
## Non-goals (v1)
|
||||
|
||||
- Not a production hypervisor on ubongo (reconciles ADR-015 — see Governance).
|
||||
- Not nested Proxmox; the provisioning *chrome* (template clone / Terraform) is **not**
|
||||
mirrored — every incident bug lives in the boot/kernel/Docker layer, not provisioning.
|
||||
- Not a multi-VM mini-cluster; one VM at a time. (All six 2026-06-17 signals occurred on a
|
||||
single host that was Docker host + coordinator + mesh peer.) Multi-VM is a later extension.
|
||||
- Not a CI gate; this is an interactive, agent-driven pre-deploy check on ubongo (CI stays
|
||||
lint + Molecule per ADR-008/010).
|
||||
|
||||
## Decisions (from the 2026-06-18 brainstorm)
|
||||
|
||||
1. **Virtualisation approach: libvirt/KVM directly (Approach A).** A golden Debian-13
|
||||
genericcloud qcow2 cached locally; each run boots an ephemeral qcow2 overlay backed by
|
||||
it, seeded via cloud-init NoCloud, driven by a **stdlib-only** Python script over
|
||||
`virsh` (no `libvirt-python` dependency). Chosen over Vagrant+vagrant-libvirt (Ruby/plugin
|
||||
footprint, box drift from the real cloud image) and terraform-provider-libvirt (poor at
|
||||
the imperative apply→reboot→re-apply sequence, throwaway state, blurs ADR-006's prod-VM
|
||||
boundary). Lightest footprint on a 15 GiB control node; full control of the reboot step;
|
||||
the same Debian cloud image real hosts boot.
|
||||
|
||||
2. **Fidelity envelope: real OS/Docker/netfilter/reboot, not the Proxmox provisioning
|
||||
path.** A lightweight local hypervisor is enough because the bugs are post-boot.
|
||||
|
||||
3. **Scope: one throwaway VM at a time, instantiated from a real host's inventory.** First
|
||||
profile: **"be askari"** (Docker host + NetBird coordinator + mesh peer on one box). The
|
||||
mechanism is generic — later "be" any host by swapping which inventory host it mirrors.
|
||||
|
||||
4. **Acceptance is self-validating against the real failure.** Done = the harness, on a
|
||||
local VM, applies `base` (firewall on) to a Docker host, reboots, and **observes the
|
||||
2026-06-17 breakage** (Docker forwarding dead / services down); then, with the
|
||||
`docker_host` container-forward drop-in in place, the same run **survives the reboot**.
|
||||
If step 1 passes, the harness is not faithful.
|
||||
|
||||
5. **Tiered cert fidelity via a `--certs` knob** (DNS-01 is what makes real certs possible
|
||||
with no public inbound — validation is out-of-band via a Gandi TXT record; the VM needs
|
||||
only outbound to ACME + Gandi, which the NAT net provides):
|
||||
- `internal` (default) — Caddy `tls internal`, zero deps, instant; for the incident repro
|
||||
and runs where certs aren't under test.
|
||||
- `le-staging` — real DNS-01 ACME against Let's Encrypt **staging**: real caddy-gandi
|
||||
path, real cert files/renewal, untrusted root, effectively no rate limits. **Built in v1.**
|
||||
- `le-prod-wildcard` — a real trusted `*.test.wingu.me` wildcard, **issued once,
|
||||
persisted on ubongo, reused** across runs. Wired in v1 but **on-demand only**; its
|
||||
accepted risk is recorded when used (prod Gandi credential reaching an ephemeral VM;
|
||||
transient TXT in the real `wingu.me` zone). A deliberate "no-egress" failure scenario
|
||||
(to reproduce FRICTION #4) forces `internal`, since ACME needs egress.
|
||||
|
||||
6. **The toolchain is Ansible-managed**, not hand-installed: a new non-service role
|
||||
(`integration_test`, `control` group) installs/enables libvirt+QEMU reproducibly. The
|
||||
repo owns ubongo's state. The driver manages *images* lazily on first run (keeps the role
|
||||
lean; avoids fiddly download/refresh logic in Ansible).
|
||||
|
||||
7. **Stubs live in an overlay file, never in the real inventory** — so `make tf-inventory`
|
||||
and "don't edit inventory directly" stay intact, and every stub is explicit and reviewable.
|
||||
|
||||
8. **A new ADR-025** records this decision (approach + alternatives + cert tiers); ADR-008
|
||||
gains a pointer and redirects its "what Molecule does NOT test" gaps here.
|
||||
|
||||
## Architecture — five isolated components
|
||||
|
||||
| # | Component | Purpose | Location |
|
||||
|---|-----------|---------|----------|
|
||||
| 1 | **`integration_test` role** (non-service, `control` group) | Install/enable libvirt+QEMU+virtinst, add `sjat`/`claude` to `libvirt` group, create the image-cache dir, drop the driver. Idempotent, Molecule-tested. | `roles/integration_test/` |
|
||||
| 2 | **`integration-vm.py` driver** | Stdlib-only lifecycle over `virsh`: `up / apply / reboot / assert / cycle / reset / down / prune / console`. Lazily ensures the golden image (download + checksum). | `scripts/integration-vm.py` |
|
||||
| 3 | **Profiles + var overlays** | Make a VM "become" a host: pull that host's real group_vars/host_vars + layer a small explicit overlay (cert tier, in-VM coordinator endpoint, VM connection). | `tests/integration/overrides/<host>.yml` |
|
||||
| 4 | **Verify playbook** | Outcome-based post-reboot assertions (Docker up, published-port DNAT alive, `nft` sane, service responds, `wt0` up), reusing Molecule's `verify.yml` philosophy. | `tests/integration/verify.yml` |
|
||||
| 5 | **Makefile target** | `make test-integration HOST=<name> [CERTS=...] [KEEP=1]` → `cycle`; `make test-integration-clean` → `prune`. Documented in CLAUDE.md's command table. | `Makefile` |
|
||||
|
||||
## Lifecycle / data flow
|
||||
|
||||
`make test-integration HOST=askari` drives:
|
||||
|
||||
```
|
||||
1. ensure golden image Debian-13 genericcloud qcow2, cached + checksum-verified
|
||||
2. ephemeral overlay qcow2 backed by golden (throwaway; never mutate golden)
|
||||
3. cloud-init NoCloud seed hostname + ansible user + ubongo's SSH key + NIC
|
||||
4. virt-install --import boot on an isolated libvirt NAT net (DHCP IP + outbound NAT)
|
||||
5. wait for SSH IP via `virsh domifaddr --source lease` (guest-agent optional)
|
||||
6. transient inventory askari's real vars + ansible_host=<lease IP> + stub overlay
|
||||
7. ansible-playbook site THE REAL APPLY (base + docker_host + reverse_proxy + coordinator)
|
||||
8. [snapshot post-apply] optional reset point for fast re-runs
|
||||
9. virsh reboot ──────────┐ ← the step Molecule structurally cannot do
|
||||
10. wait for SSH ┘
|
||||
11. ansible-playbook verify outcome assertions; THIS is where the incident surfaces
|
||||
12. report + teardown pass/fail; on fail keep VM + dump diagnostics; else destroy overlay
|
||||
```
|
||||
|
||||
Steps 1–7 build a real Docker daemon with real published-port DNAT to break; step 9 is a
|
||||
real kernel reboot, so nftables loads default-deny before Docker exactly as on askari.
|
||||
|
||||
## Fidelity boundary & cert tiers
|
||||
|
||||
**Faithful where the bug lives:** real kernel, real netfilter, real Docker with
|
||||
published-port DNAT, the real role apply, a real reboot, and the coordinator running *inside
|
||||
the VM* so the VM is its own mesh peer — reproducing the circular mesh-bootstrap (FRICTION #3)
|
||||
on one box.
|
||||
|
||||
**Stubbed where it needs the public internet** (explicit, in the overlay): LE certs via the
|
||||
`--certs` knob (Decision 5); public DNS (`askari.wingu.me`) → local resolution; NetBird
|
||||
geo-DB → pre-seeded or requirement disabled (which is *also* the FRICTION #4 fix, so the
|
||||
harness can test both the FATAL-loop and its remedy).
|
||||
|
||||
## Acceptance test (self-validating)
|
||||
|
||||
1. Run the cycle on **today's** `base` (firewall on, no `docker_host` container-forward
|
||||
drop-in) → **step 11 must FAIL after reboot** (Docker forwarding dead, services down).
|
||||
2. Implement the `docker_host` container-forward rules (the pending fix STATUS.md names) →
|
||||
re-run → **step 11 must PASS across the reboot.**
|
||||
|
||||
**Scope boundary:** the *harness* is this plan's deliverable. The `docker_host`
|
||||
container-forward fix is a separate work item (FRICTION 2026-06-17 #1). v1's acceptance
|
||||
deliberately spans both, because a credible harness must demonstrate **both** a true-negative
|
||||
(red on the broken state) and a true-positive (green on the fixed state) — otherwise we have
|
||||
only ever watched the assert go red. The plan decides sequencing: build the small
|
||||
`docker_host` drop-in as the green-half of acceptance, or consume it if built separately
|
||||
first. Minimum credible v1 is the red half (faithful reproduction); full acceptance is red→green.
|
||||
|
||||
This one round-trip proves the harness reproduces the incident, the fix works, and the loop
|
||||
can be trusted for the next risky change before it touches a live host.
|
||||
|
||||
## Robustness, isolation & teardown
|
||||
|
||||
**Failure leaves evidence** (catching a bug is the point):
|
||||
|
||||
| Step fails | Behaviour |
|
||||
|---|---|
|
||||
| Golden image (1) | Fail fast, clear message; image cached (one-time cost) |
|
||||
| Boot / first SSH (4–5) | **Capture serial console to a log file**, fail with its tail — the automated equivalent of the Hetzner console (ties to TODO 10.8) |
|
||||
| Apply (7) | Keep VM, surface Ansible output, dump diagnostics |
|
||||
| **No SSH after reboot (9–10)** | The classic incident signature; FAIL, keep VM, capture console — the harness *succeeding* |
|
||||
| Assert (11) | FAIL, keep VM, dump post-mortem: `nft list ruleset`, `docker ps`, `ss -tlnp`, `journalctl -b`, `systemd-analyze critical-chain`; exit non-zero |
|
||||
|
||||
Diagnostics land in gitignored `~/integration-runs/<ts>-<host>/` (same pattern as ADR-017's
|
||||
screenshot dir; the agent reads them directly).
|
||||
|
||||
**Three safety invariants** (these make the testing tool itself safe):
|
||||
1. **The transient inventory contains only the test VM** — no real host is ever in scope;
|
||||
the apply is `--limit`ed to the VM.
|
||||
2. **"Be askari" points NetBird at the in-VM coordinator (localhost)** — the VM forms its
|
||||
own one-node mesh; it never enrolls in the real mesh.
|
||||
3. **Test VMs sit on an isolated libvirt NAT net** — outbound NAT for ACME/image pulls, but
|
||||
not reachable to the LAN (`10.20.x`) or the real mesh.
|
||||
|
||||
**Resource guard** (ubongo's 15 GiB ceiling, ADR-015/012): default VM ≈ 2 vCPU / 3 GiB / 20
|
||||
GiB thin overlay; the driver refuses to start below a free-RAM threshold and enforces **one
|
||||
integration VM at a time** (name-prefix `boma-it-*`). **Teardown:** success destroys domain +
|
||||
overlay; failure keeps them and prints how to inspect; `make test-integration-clean` reaps
|
||||
all `boma-it-*` orphans. An optional post-apply **snapshot** lets `reset` re-run
|
||||
reboot+assert without re-applying (fast iteration on a fix).
|
||||
|
||||
## Testing the tester
|
||||
|
||||
- **pytest** on the driver's pure logic: transient-inventory generation, var/overlay merge,
|
||||
`--certs`→overlay mapping, DHCP-lease parsing, resource-guard math (mock `virsh`). Joins
|
||||
boma's existing pytest suite.
|
||||
- **Molecule** (Docker) on the `integration_test` role: asserts libvirt/qemu/virtinst
|
||||
installed, `libvirtd` enabled, users in `libvirt` group, driver present. (Cannot run
|
||||
KVM-in-Docker — the documented Molecule limitation.)
|
||||
- **End-to-end self-test = the acceptance test above**, run manually on first build and
|
||||
recorded in the runbook.
|
||||
|
||||
## Governance & documentation touch-points
|
||||
|
||||
- **ADR-025 "Local VM integration testing"** — decision, approach A, rejected alternatives
|
||||
(Proxmox-nested / Vagrant / TF-libvirt), cert tiers.
|
||||
- **ADR-008** — pointer to ADR-025; redirect its "what Molecule does NOT test" gaps
|
||||
(nftables loading, mesh dataplane) to this level.
|
||||
- **ADR-015** — one-line reconciliation: "not a hypervisor" → runs *ephemeral KVM test VMs*
|
||||
as part of its local-test-runner role (still not a production hypervisor); note the
|
||||
test-VM RAM load.
|
||||
- **`docs/security/accepted-risks.md`** — the `le-prod-wildcard` risk (prod Gandi credential
|
||||
→ ephemeral VM; transient TXT in real `wingu.me`).
|
||||
- **CLAUDE.md** command table + **`docs/runbooks/integration-testing.md`** (run a cycle,
|
||||
cert knobs, where diagnostics land, inspecting a kept failed VM, pruning) + **STATUS.md**
|
||||
entry. The runbook's pre-flight line operationalises FRICTION #6 (*validate
|
||||
reboot-recovery before retiring the break-glass*).
|
||||
|
||||
## Capacity
|
||||
|
||||
One VM (~3 GiB) against ~13 GiB free is comfortable. The only future pinch is concurrency
|
||||
with the Level-4 Chromium/Playwright stack (ADR-017) — handled by the resource guard +
|
||||
"one at a time." Add a note to `docs/hardware/reference.md`; revisit at `/capacity-review`.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Proxmox VE nested on ubongo** — highest fidelity incl. the provisioning step, but heavy
|
||||
(nested virt, RAM), in tension with ADR-015, and the incident bugs don't live in
|
||||
provisioning. Rejected.
|
||||
- **Vagrant + vagrant-libvirt** — mature lifecycle/snapshots, but adds the Ruby/Vagrant
|
||||
ecosystem + a fragile plugin, boxes drift from the real Debian cloud image, and the
|
||||
reboot→assert sequence still needs custom logic. Rejected.
|
||||
- **terraform-provider-libvirt** — declarative and reuses TF, but poor at the imperative
|
||||
apply→reboot→re-apply test sequence, adds throwaway state, and blurs ADR-006's
|
||||
"TF owns *production* VM existence on Proxmox" boundary. Rejected.
|
||||
|
||||
## Open questions / deferred
|
||||
|
||||
- **Multi-VM mini-staging** (inter-host mesh/dataplane) — design the driver + NAT net so a
|
||||
topology is an additive extension; out of scope for v1.
|
||||
- **Interplay with the Level-4 browser stack** — both want ubongo RAM; the resource guard is
|
||||
the v1 answer, revisit when Level 4 is built.
|
||||
- **Snapshot strategy depth** — v1 ships clone-and-destroy + an optional post-apply snapshot;
|
||||
richer snapshot trees deferred.
|
||||
|
||||
## Knowledge to verify at plan stage (ADR-014)
|
||||
|
||||
These are from memory / unverified and must be confirmed against version-matched docs before
|
||||
the plan asserts them:
|
||||
|
||||
- Exact `virt-install --import` flags and the cloud-init **NoCloud** seed format on the
|
||||
Debian-13 libvirt stack.
|
||||
- Whether the Debian-13 genericcloud image ships `qemu-guest-agent` (IP can come from the
|
||||
DHCP lease regardless — guest-agent is an optimisation, not a requirement).
|
||||
- Let's Encrypt **rate limits** (prod vs staging) — to confirm "issue the wildcard once,
|
||||
reuse" stays within limits.
|
||||
- The `caddy-dns/gandi` DNS-01 configuration and pinned version already used by
|
||||
`reverse_proxy`, and whether the Gandi LiveDNS API key can be scoped to `test.wingu.me`.
|
||||
- libvirt default vs a dedicated isolated NAT network on Debian-13 (`virsh net-*`).
|
||||
Loading…
Add table
Reference in a new issue