boma/docs/decisions/025-local-vm-integration-testing.md
sjat bc8592616b fix: address final whole-branch review findings
- ADR-023 §4: ADR-015 no-sudo sub-decision now Superseded-by ADR-025 (bidirectional), not just an in-place amendment.
- STATUS: drop the deferred `reset` verb; honest integration_test (molecule not run in this env; applied to ubongo) + verify (forward/DNAT, not wt0); RED->GREEN validated.
- driver: remove unused `import shutil`.
- README: fix the ADR-025 link filename.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:52:28 +02:00

180 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-025 — Local VM integration testing on ubongo
## Status
Accepted (2026-06-18). Implements ADR-008 Level 2/3 (deferred for lack of hosts; now
viable on ubongo). **RED→GREEN acceptance PASSED on real hardware (2026-06-18):** a
throwaway KVM VM on ubongo reproduced the 2026-06-17 incident (base's nftables forward
default-deny kills Docker forwarding on reboot) — RED — and survived the reboot once
the `docker_host` container-forward drop-in was applied — GREEN. Two shakedown
learnings added below.
## Context
Molecule (ADR-008 Level 1) tests each role in a single Docker container: one
`converge`, no real kernel netfilter, no real Docker daemon in the loop, and **no
reboot**. That structurally cannot catch an entire class of bug — reboot-survivability,
host-firewall × Docker interaction, and boot-ordering — which is exactly the class
that caused the **2026-06-17 mesh-hardening incident**.
During that incident, `base`'s nftables `forward { policy drop; }` killed the askari
Docker host **on reboot**: nftables loaded its default-deny before Docker, breaking
published-port DNAT and inter-container forwarding. Public services and the mesh went
down. It had worked right after `make deploy`, when Docker's runtime rules still
coexisted. `ip_nonlocal_bind` also failed to beat the sshd boot-race, leaving the mesh
listener absent at boot. Recovery required the Hetzner console and a WAN-SSH
break-glass. Molecule had passed.
ADR-008's Level 2/3 was deferred "for lack of hosts." ubongo breaks that deferral:
> verified: ubongo KVM capability · Bash (2026-06-18 session) · `/dev/kvm` present +
> accessible (kvm group), Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM
> free of 16, ~198 GiB disk free; libvirt/QEMU/Vagrant **not yet installed** ·
> 2026-06-18.
## Decision
### 1. Virtualisation approach: libvirt/KVM directly (Approach A)
A golden Debian-13 genericcloud qcow2 is cached locally on ubongo. Each run boots an
ephemeral qcow2 **overlay** backed by it (the golden image is never mutated), seeded
via cloud-init NoCloud, driven by a **stdlib-only** Python driver (`scripts/
integration-vm.py`) over `virsh` / `virt-install` / `cloud-localds`. No `libvirt-
python` dependency — the driver stays portable and the role stays lean.
### 2. Fidelity envelope
The bugs are **post-boot**, not in the provisioning path. A lightweight local hypervisor
is sufficient: real OS, real kernel netfilter, real Docker daemon, real published-port
DNAT, a **real reboot**, and the coordinator running inside the VM (so the VM forms its
own one-node mesh, reproducing the circular bootstrap). The Proxmox provisioning chrome
is not mirrored.
### 3. Scope: one throwaway VM at a time, instantiated from real inventory
The first profile is **"be askari"** — a single box running Docker host + NetBird
coordinator + mesh peer, mirroring the host whose incident motivates this work. The
mechanism is generic: swap the profile to "be" any inventory host. Multi-VM topologies
are a deferred extension.
### 4. Acceptance: self-validating against the real failure
The harness is accepted when it can, on a local VM:
1. Apply `base` (firewall on, no `docker_host` container-forward drop-in) to a Docker
host, reboot, and observe the **2026-06-17 breakage** (Docker forwarding dead,
services down). If step 1 passes, the harness is not faithful.
2. Apply the `docker_host` container-forward fix, re-run, and **survive the reboot**.
### 5. Tiered cert fidelity via a `--certs` knob
DNS-01 is what makes real certs possible without public inbound (validation is
out-of-band via a Gandi TXT record; the VM needs only outbound to ACME + Gandi, which
the isolated NAT network provides):
| Tier | Description | Default? |
|---|---|---|
| `internal` | Caddy `tls internal` — zero deps, instant. For incident repro and runs where certs are not under test. | Yes |
| `le-staging` | Real DNS-01 ACME against Let's Encrypt **staging** — real caddy-gandi path, real cert files/renewal, untrusted root, effectively no rate limits. | Built in v1; use when testing the ACME/cert path. |
| `le-prod-wildcard` | A real trusted `*.test.wingu.me` wildcard, **issued once, persisted on ubongo, reused** across runs. | On-demand only. Accepted risk recorded as R6 in `docs/security/accepted-risks.md`. |
A deliberate "no-egress" failure scenario (reproducing FRICTION 2026-06-17 #4
`netbird-server` FATAL-loops on GeoLite2 download when egress is lost) forces
`internal`, since ACME requires egress.
### 6. The toolchain is Ansible-managed
A new non-service role (`integration_test`, `control` group) installs and enables
libvirt + QEMU + virtinst reproducibly. The driver manages the golden image lazily on
first run (keeping the role lean; no fiddly download/refresh logic in Ansible). The
repo owns ubongo's state.
### 7. Stubs live in an overlay file, never in the real inventory
Transient inventory entries for the test VM are generated at runtime as a single-host
file. Stubs (cert tier, in-VM coordinator endpoint, VM connection details) live in
`tests/integration/overrides/<host>.yml` — an explicit, reviewable overlay. The real
inventory is never touched, so `make tf-inventory` and "don't edit inventory directly"
stay intact.
## Consequences
- **Reconciles ADR-015:** ubongo runs ephemeral KVM test VMs as part of its
local-test-runner role — it is still not a production hypervisor. A default VM
(~2 vCPU / 3 GiB / 20 GiB thin overlay) against ~13 GiB free is comfortable; the
driver enforces **one integration VM at a time** (resource guard, name-prefix
`boma-it-*`) and refuses to start below a free-RAM threshold.
- **Operationalises the standing rule:** "firewall/sshd/boot changes must be tested on
a real VM with a real reboot before they touch a live host" (FRICTION 2026-06-17 #6)
becomes a concrete, runnable step documented in `docs/runbooks/integration-testing.md`.
- **Accepted risk R6:** `le-prod-wildcard` runs pass the production Gandi PAT
(`vault.gandi.pat`) to an ephemeral local VM and write transient `_acme-challenge`
TXT records into the real `wingu.me` zone. Scope: on-demand only; `le-staging` is the
default. Compensating controls: ephemeral VM, isolated NAT network, TXT records
auto-removed by Caddy after validation.
- **Three safety invariants** make the test tool itself safe:
1. The transient inventory contains only the test VM — no real host is ever in scope.
2. "Be askari" points NetBird at the in-VM coordinator — the VM forms its own one-node
mesh; it never enrols in the real mesh.
3. Test VMs sit on an isolated libvirt NAT network — outbound NAT for ACME/image pulls
only, not reachable to the LAN (`10.20.x`) or the real mesh.
- **Diagnostics on failure** (catching a bug is the point): failure keeps the VM and
dumps `nft list ruleset`, `docker ps`, `ss -tlnp`, `journalctl -b`,
`systemd-analyze critical-chain`. `make test-integration-clean` reaps all `boma-it-*`
orphans. Diagnostics land in gitignored `~/integration-runs/<ts>-<host>/`.
- **Future pinch:** concurrency with the Level-4 Chromium/Playwright stack (ADR-017)
competes for ubongo RAM. The resource guard is the v1 answer — one integration VM at a
time; don't run alongside a heavy Level-4 session. Revisit at `/capacity-review`.
## Scope
**In scope:** reboot-survivability, host-firewall × Docker interaction, boot-ordering,
cert/ACME paths, mesh bootstrap on one box.
**Out of scope (v1):** multi-VM mini-cluster (inter-host mesh dataplane); CI gate
(this is an interactive, agent-driven pre-deploy check; CI stays lint + Molecule per
ADR-008/010); the Proxmox provisioning path (the bugs live in the boot/kernel/Docker
layer, not provisioning).
## What was ruled out
| Option | Reason |
|---|---|
| **Proxmox VE nested on ubongo** | Highest fidelity including the provisioning step, but heavy (nested virt, RAM), in tension with ADR-015, and the incident bugs do not live in provisioning. |
| **Vagrant + vagrant-libvirt** | Mature lifecycle/snapshots, but adds the Ruby/Vagrant ecosystem + a fragile plugin; boxes drift from the real Debian cloud image; the reboot→assert sequence still needs custom logic. |
| **terraform-provider-libvirt** | Declarative and reuses TF, but poor at the imperative apply→reboot→re-apply test sequence; adds throwaway state; blurs ADR-006's "TF owns *production* VM existence on Proxmox" boundary. |
## Verified facts (ADR-014)
- verified: ubongo KVM capability · Bash · `/dev/kvm` present + accessible (kvm group),
Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 GiB
disk free · 2026-06-18.
## Shakedown learnings (2026-06-18 live run)
Two findings from the RED→GREEN acceptance run that affect anyone operating the harness:
1. **Boot firmware: UEFI required.** The Debian 13 genericcloud image triple-faults
under legacy BIOS/SeaBIOS and does not reach the kernel. Boot the VM with UEFI
(`virt-install --boot uefi`; `ovmf` package). The driver does this by default; note
it here so the requirement is findable.
2. **`claude` sudo is load-bearing.** VM management (`virsh`, `virt-install`,
`cloud-localds`) and offline diagnostics (`nft list ruleset`, `journalctl -b`,
`systemd-analyze critical-chain`) all require root. The harness assumes the
AI-worker has `NOPASSWD:ALL` sudo on `ubongo` — settled as the ADR-015 amendment
(2026-06-18) and registered as R7 in `docs/security/accepted-risks.md`. A `claude`
account without sudo will block the harness at the first `virsh` call.
The nine full shakedown findings (including the UEFI boot-loop) are in
`docs/FRICTION.md`.
## Related
- ADR-006 — Terraform owns production VM existence (boundary this ADR respects).
- ADR-008 — Testing methodology (Levels 14); this ADR is the concrete build of Level 2/3.
- ADR-015 — Control host (ubongo); this ADR reconciles "not a hypervisor" with ephemeral test VMs. **Supersedes** ADR-015's "no local sudo" sub-decision for the AI-worker — the shakedown necessitated `claude` NOPASSWD sudo (ADR-023 §4; access model in ADR-021, risk R7).
- ADR-016 — Mesh VPN; the "be askari" profile includes the coordinator role.
- ADR-020 — Firewall strategy; firewall × Docker interaction is what this harness tests.
- ADR-021 — Operational access; sudo model for `claude` and `sjat` on `ubongo`.
- ADR-024 — Reverse proxy (Caddy); cert tiers exercise the DNS-01 ACME path.