docs(adr): ADR-025 local VM integration testing
Accepted decision to implement ADR-008 Level 2/3 on ubongo via libvirt/KVM directly: throwaway VM overlays, stdlib-only driver, tiered cert fidelity, three safety invariants. Addresses the 2026-06-17 mesh-hardening incident's reboot-survivability gap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
d68734267b
commit
edcc347a95
1 changed files with 157 additions and 0 deletions
157
docs/decisions/025-local-vm-integration-testing.md
Normal file
157
docs/decisions/025-local-vm-integration-testing.md
Normal file
|
|
@ -0,0 +1,157 @@
|
|||
# ADR-025 — Local VM integration testing on ubongo
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-18). Implements ADR-008 Level 2/3 (deferred for lack of hosts; now
|
||||
viable on ubongo). The harness code is built and lint+pytest-clean; RED/GREEN
|
||||
acceptance is pending the first live run on ubongo.
|
||||
|
||||
## Context
|
||||
|
||||
Molecule (ADR-008 Level 1) tests each role in a single Docker container: one
|
||||
`converge`, no real kernel netfilter, no real Docker daemon in the loop, and **no
|
||||
reboot**. That structurally cannot catch an entire class of bug — reboot-survivability,
|
||||
host-firewall × Docker interaction, and boot-ordering — which is exactly the class
|
||||
that caused the **2026-06-17 mesh-hardening incident**.
|
||||
|
||||
During that incident, `base`'s nftables `forward { policy drop; }` killed the askari
|
||||
Docker host **on reboot**: nftables loaded its default-deny before Docker, breaking
|
||||
published-port DNAT and inter-container forwarding. Public services and the mesh went
|
||||
down. It had worked right after `make deploy`, when Docker's runtime rules still
|
||||
coexisted. `ip_nonlocal_bind` also failed to beat the sshd boot-race, leaving the mesh
|
||||
listener absent at boot. Recovery required the Hetzner console and a WAN-SSH
|
||||
break-glass. Molecule had passed.
|
||||
|
||||
ADR-008's Level 2/3 was deferred "for lack of hosts." ubongo breaks that deferral:
|
||||
|
||||
> verified: ubongo KVM capability · Bash (2026-06-18 session) · `/dev/kvm` present +
|
||||
> accessible (kvm group), Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM
|
||||
> free of 16, ~198 GiB disk free; libvirt/QEMU/Vagrant **not yet installed** ·
|
||||
> 2026-06-18.
|
||||
|
||||
## Decision
|
||||
|
||||
### 1. Virtualisation approach: libvirt/KVM directly (Approach A)
|
||||
|
||||
A golden Debian-13 genericcloud qcow2 is cached locally on ubongo. Each run boots an
|
||||
ephemeral qcow2 **overlay** backed by it (the golden image is never mutated), seeded
|
||||
via cloud-init NoCloud, driven by a **stdlib-only** Python driver (`scripts/
|
||||
integration-vm.py`) over `virsh` / `virt-install` / `cloud-localds`. No `libvirt-
|
||||
python` dependency — the driver stays portable and the role stays lean.
|
||||
|
||||
### 2. Fidelity envelope
|
||||
|
||||
The bugs are **post-boot**, not in the provisioning path. A lightweight local hypervisor
|
||||
is sufficient: real OS, real kernel netfilter, real Docker daemon, real published-port
|
||||
DNAT, a **real reboot**, and the coordinator running inside the VM (so the VM forms its
|
||||
own one-node mesh, reproducing the circular bootstrap). The Proxmox provisioning chrome
|
||||
is not mirrored.
|
||||
|
||||
### 3. Scope: one throwaway VM at a time, instantiated from real inventory
|
||||
|
||||
The first profile is **"be askari"** — a single box running Docker host + NetBird
|
||||
coordinator + mesh peer, mirroring the host whose incident motivates this work. The
|
||||
mechanism is generic: swap the profile to "be" any inventory host. Multi-VM topologies
|
||||
are a deferred extension.
|
||||
|
||||
### 4. Acceptance: self-validating against the real failure
|
||||
|
||||
The harness is accepted when it can, on a local VM:
|
||||
|
||||
1. Apply `base` (firewall on, no `docker_host` container-forward drop-in) to a Docker
|
||||
host, reboot, and observe the **2026-06-17 breakage** (Docker forwarding dead,
|
||||
services down). If step 1 passes, the harness is not faithful.
|
||||
2. Apply the `docker_host` container-forward fix, re-run, and **survive the reboot**.
|
||||
|
||||
### 5. Tiered cert fidelity via a `--certs` knob
|
||||
|
||||
DNS-01 is what makes real certs possible without public inbound (validation is
|
||||
out-of-band via a Gandi TXT record; the VM needs only outbound to ACME + Gandi, which
|
||||
the isolated NAT network provides):
|
||||
|
||||
| Tier | Description | Default? |
|
||||
|---|---|---|
|
||||
| `internal` | Caddy `tls internal` — zero deps, instant. For incident repro and runs where certs are not under test. | Yes |
|
||||
| `le-staging` | Real DNS-01 ACME against Let's Encrypt **staging** — real caddy-gandi path, real cert files/renewal, untrusted root, effectively no rate limits. | Built in v1; use when testing the ACME/cert path. |
|
||||
| `le-prod-wildcard` | A real trusted `*.test.wingu.me` wildcard, **issued once, persisted on ubongo, reused** across runs. | On-demand only. Accepted risk recorded as R6 in `docs/security/accepted-risks.md`. |
|
||||
|
||||
A deliberate "no-egress" failure scenario (reproducing FRICTION 2026-06-17 #4 —
|
||||
`netbird-server` FATAL-loops on GeoLite2 download when egress is lost) forces
|
||||
`internal`, since ACME requires egress.
|
||||
|
||||
### 6. The toolchain is Ansible-managed
|
||||
|
||||
A new non-service role (`integration_test`, `control` group) installs and enables
|
||||
libvirt + QEMU + virtinst reproducibly. The driver manages the golden image lazily on
|
||||
first run (keeping the role lean; no fiddly download/refresh logic in Ansible). The
|
||||
repo owns ubongo's state.
|
||||
|
||||
### 7. Stubs live in an overlay file, never in the real inventory
|
||||
|
||||
Transient inventory entries for the test VM are generated at runtime as a single-host
|
||||
file. Stubs (cert tier, in-VM coordinator endpoint, VM connection details) live in
|
||||
`tests/integration/overrides/<host>.yml` — an explicit, reviewable overlay. The real
|
||||
inventory is never touched, so `make tf-inventory` and "don't edit inventory directly"
|
||||
stay intact.
|
||||
|
||||
## Consequences
|
||||
|
||||
- **Reconciles ADR-015:** ubongo runs ephemeral KVM test VMs as part of its
|
||||
local-test-runner role — it is still not a production hypervisor. A default VM
|
||||
(~2 vCPU / 3 GiB / 20 GiB thin overlay) against ~13 GiB free is comfortable; the
|
||||
driver enforces **one integration VM at a time** (resource guard, name-prefix
|
||||
`boma-it-*`) and refuses to start below a free-RAM threshold.
|
||||
- **Operationalises the standing rule:** "firewall/sshd/boot changes must be tested on
|
||||
a real VM with a real reboot before they touch a live host" (FRICTION 2026-06-17 #6)
|
||||
becomes a concrete, runnable step documented in `docs/runbooks/integration-testing.md`.
|
||||
- **Accepted risk R6:** `le-prod-wildcard` runs pass the production Gandi PAT
|
||||
(`vault.gandi.pat`) to an ephemeral local VM and write transient `_acme-challenge`
|
||||
TXT records into the real `wingu.me` zone. Scope: on-demand only; `le-staging` is the
|
||||
default. Compensating controls: ephemeral VM, isolated NAT network, TXT records
|
||||
auto-removed by Caddy after validation.
|
||||
- **Three safety invariants** make the test tool itself safe:
|
||||
1. The transient inventory contains only the test VM — no real host is ever in scope.
|
||||
2. "Be askari" points NetBird at the in-VM coordinator — the VM forms its own one-node
|
||||
mesh; it never enrols in the real mesh.
|
||||
3. Test VMs sit on an isolated libvirt NAT network — outbound NAT for ACME/image pulls
|
||||
only, not reachable to the LAN (`10.20.x`) or the real mesh.
|
||||
- **Diagnostics on failure** (catching a bug is the point): failure keeps the VM and
|
||||
dumps `nft list ruleset`, `docker ps`, `ss -tlnp`, `journalctl -b`,
|
||||
`systemd-analyze critical-chain`. `make test-integration-clean` reaps all `boma-it-*`
|
||||
orphans. Diagnostics land in gitignored `~/integration-runs/<ts>-<host>/`.
|
||||
- **Future pinch:** concurrency with the Level-4 Chromium/Playwright stack (ADR-017)
|
||||
competes for ubongo RAM. The resource guard is the v1 answer — one integration VM at a
|
||||
time; don't run alongside a heavy Level-4 session. Revisit at `/capacity-review`.
|
||||
|
||||
## Scope
|
||||
|
||||
**In scope:** reboot-survivability, host-firewall × Docker interaction, boot-ordering,
|
||||
cert/ACME paths, mesh bootstrap on one box.
|
||||
|
||||
**Out of scope (v1):** multi-VM mini-cluster (inter-host mesh dataplane); CI gate
|
||||
(this is an interactive, agent-driven pre-deploy check; CI stays lint + Molecule per
|
||||
ADR-008/010); the Proxmox provisioning path (the bugs live in the boot/kernel/Docker
|
||||
layer, not provisioning).
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|---|---|
|
||||
| **Proxmox VE nested on ubongo** | Highest fidelity including the provisioning step, but heavy (nested virt, RAM), in tension with ADR-015, and the incident bugs do not live in provisioning. |
|
||||
| **Vagrant + vagrant-libvirt** | Mature lifecycle/snapshots, but adds the Ruby/Vagrant ecosystem + a fragile plugin; boxes drift from the real Debian cloud image; the reboot→assert sequence still needs custom logic. |
|
||||
| **terraform-provider-libvirt** | Declarative and reuses TF, but poor at the imperative apply→reboot→re-apply test sequence; adds throwaway state; blurs ADR-006's "TF owns *production* VM existence on Proxmox" boundary. |
|
||||
|
||||
## Verified facts (ADR-014)
|
||||
|
||||
- verified: ubongo KVM capability · Bash · `/dev/kvm` present + accessible (kvm group),
|
||||
Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 GiB
|
||||
disk free · 2026-06-18.
|
||||
|
||||
## Related
|
||||
|
||||
- ADR-006 — Terraform owns production VM existence (boundary this ADR respects).
|
||||
- ADR-008 — Testing methodology (Levels 1–4); this ADR is the concrete build of Level 2/3.
|
||||
- ADR-015 — Control host (ubongo); this ADR reconciles "not a hypervisor" with ephemeral test VMs.
|
||||
- ADR-016 — Mesh VPN; the "be askari" profile includes the coordinator role.
|
||||
- ADR-020 — Firewall strategy; firewall × Docker interaction is what this harness tests.
|
||||
- ADR-024 — Reverse proxy (Caddy); cert tiers exercise the DNS-01 ACME path.
|
||||
Loading…
Add table
Reference in a new issue