docs(adr): ADR-025 local VM integration testing
Accepted decision to implement ADR-008 Level 2/3 on ubongo via libvirt/KVM directly: throwaway VM overlays, stdlib-only driver, tiered cert fidelity, three safety invariants. Addresses the 2026-06-17 mesh-hardening incident's reboot-survivability gap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
d68734267b
commit
edcc347a95
1 changed files with 157 additions and 0 deletions
157
docs/decisions/025-local-vm-integration-testing.md
Normal file
157
docs/decisions/025-local-vm-integration-testing.md
Normal file
|
|
@ -0,0 +1,157 @@
|
||||||
|
# ADR-025 — Local VM integration testing on ubongo
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (2026-06-18). Implements ADR-008 Level 2/3 (deferred for lack of hosts; now
|
||||||
|
viable on ubongo). The harness code is built and lint+pytest-clean; RED/GREEN
|
||||||
|
acceptance is pending the first live run on ubongo.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Molecule (ADR-008 Level 1) tests each role in a single Docker container: one
|
||||||
|
`converge`, no real kernel netfilter, no real Docker daemon in the loop, and **no
|
||||||
|
reboot**. That structurally cannot catch an entire class of bug — reboot-survivability,
|
||||||
|
host-firewall × Docker interaction, and boot-ordering — which is exactly the class
|
||||||
|
that caused the **2026-06-17 mesh-hardening incident**.
|
||||||
|
|
||||||
|
During that incident, `base`'s nftables `forward { policy drop; }` killed the askari
|
||||||
|
Docker host **on reboot**: nftables loaded its default-deny before Docker, breaking
|
||||||
|
published-port DNAT and inter-container forwarding. Public services and the mesh went
|
||||||
|
down. It had worked right after `make deploy`, when Docker's runtime rules still
|
||||||
|
coexisted. `ip_nonlocal_bind` also failed to beat the sshd boot-race, leaving the mesh
|
||||||
|
listener absent at boot. Recovery required the Hetzner console and a WAN-SSH
|
||||||
|
break-glass. Molecule had passed.
|
||||||
|
|
||||||
|
ADR-008's Level 2/3 was deferred "for lack of hosts." ubongo breaks that deferral:
|
||||||
|
|
||||||
|
> verified: ubongo KVM capability · Bash (2026-06-18 session) · `/dev/kvm` present +
|
||||||
|
> accessible (kvm group), Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM
|
||||||
|
> free of 16, ~198 GiB disk free; libvirt/QEMU/Vagrant **not yet installed** ·
|
||||||
|
> 2026-06-18.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### 1. Virtualisation approach: libvirt/KVM directly (Approach A)
|
||||||
|
|
||||||
|
A golden Debian-13 genericcloud qcow2 is cached locally on ubongo. Each run boots an
|
||||||
|
ephemeral qcow2 **overlay** backed by it (the golden image is never mutated), seeded
|
||||||
|
via cloud-init NoCloud, driven by a **stdlib-only** Python driver (`scripts/
|
||||||
|
integration-vm.py`) over `virsh` / `virt-install` / `cloud-localds`. No `libvirt-
|
||||||
|
python` dependency — the driver stays portable and the role stays lean.
|
||||||
|
|
||||||
|
### 2. Fidelity envelope
|
||||||
|
|
||||||
|
The bugs are **post-boot**, not in the provisioning path. A lightweight local hypervisor
|
||||||
|
is sufficient: real OS, real kernel netfilter, real Docker daemon, real published-port
|
||||||
|
DNAT, a **real reboot**, and the coordinator running inside the VM (so the VM forms its
|
||||||
|
own one-node mesh, reproducing the circular bootstrap). The Proxmox provisioning chrome
|
||||||
|
is not mirrored.
|
||||||
|
|
||||||
|
### 3. Scope: one throwaway VM at a time, instantiated from real inventory
|
||||||
|
|
||||||
|
The first profile is **"be askari"** — a single box running Docker host + NetBird
|
||||||
|
coordinator + mesh peer, mirroring the host whose incident motivates this work. The
|
||||||
|
mechanism is generic: swap the profile to "be" any inventory host. Multi-VM topologies
|
||||||
|
are a deferred extension.
|
||||||
|
|
||||||
|
### 4. Acceptance: self-validating against the real failure
|
||||||
|
|
||||||
|
The harness is accepted when it can, on a local VM:
|
||||||
|
|
||||||
|
1. Apply `base` (firewall on, no `docker_host` container-forward drop-in) to a Docker
|
||||||
|
host, reboot, and observe the **2026-06-17 breakage** (Docker forwarding dead,
|
||||||
|
services down). If step 1 passes, the harness is not faithful.
|
||||||
|
2. Apply the `docker_host` container-forward fix, re-run, and **survive the reboot**.
|
||||||
|
|
||||||
|
### 5. Tiered cert fidelity via a `--certs` knob
|
||||||
|
|
||||||
|
DNS-01 is what makes real certs possible without public inbound (validation is
|
||||||
|
out-of-band via a Gandi TXT record; the VM needs only outbound to ACME + Gandi, which
|
||||||
|
the isolated NAT network provides):
|
||||||
|
|
||||||
|
| Tier | Description | Default? |
|
||||||
|
|---|---|---|
|
||||||
|
| `internal` | Caddy `tls internal` — zero deps, instant. For incident repro and runs where certs are not under test. | Yes |
|
||||||
|
| `le-staging` | Real DNS-01 ACME against Let's Encrypt **staging** — real caddy-gandi path, real cert files/renewal, untrusted root, effectively no rate limits. | Built in v1; use when testing the ACME/cert path. |
|
||||||
|
| `le-prod-wildcard` | A real trusted `*.test.wingu.me` wildcard, **issued once, persisted on ubongo, reused** across runs. | On-demand only. Accepted risk recorded as R6 in `docs/security/accepted-risks.md`. |
|
||||||
|
|
||||||
|
A deliberate "no-egress" failure scenario (reproducing FRICTION 2026-06-17 #4 —
|
||||||
|
`netbird-server` FATAL-loops on GeoLite2 download when egress is lost) forces
|
||||||
|
`internal`, since ACME requires egress.
|
||||||
|
|
||||||
|
### 6. The toolchain is Ansible-managed
|
||||||
|
|
||||||
|
A new non-service role (`integration_test`, `control` group) installs and enables
|
||||||
|
libvirt + QEMU + virtinst reproducibly. The driver manages the golden image lazily on
|
||||||
|
first run (keeping the role lean; no fiddly download/refresh logic in Ansible). The
|
||||||
|
repo owns ubongo's state.
|
||||||
|
|
||||||
|
### 7. Stubs live in an overlay file, never in the real inventory
|
||||||
|
|
||||||
|
Transient inventory entries for the test VM are generated at runtime as a single-host
|
||||||
|
file. Stubs (cert tier, in-VM coordinator endpoint, VM connection details) live in
|
||||||
|
`tests/integration/overrides/<host>.yml` — an explicit, reviewable overlay. The real
|
||||||
|
inventory is never touched, so `make tf-inventory` and "don't edit inventory directly"
|
||||||
|
stay intact.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- **Reconciles ADR-015:** ubongo runs ephemeral KVM test VMs as part of its
|
||||||
|
local-test-runner role — it is still not a production hypervisor. A default VM
|
||||||
|
(~2 vCPU / 3 GiB / 20 GiB thin overlay) against ~13 GiB free is comfortable; the
|
||||||
|
driver enforces **one integration VM at a time** (resource guard, name-prefix
|
||||||
|
`boma-it-*`) and refuses to start below a free-RAM threshold.
|
||||||
|
- **Operationalises the standing rule:** "firewall/sshd/boot changes must be tested on
|
||||||
|
a real VM with a real reboot before they touch a live host" (FRICTION 2026-06-17 #6)
|
||||||
|
becomes a concrete, runnable step documented in `docs/runbooks/integration-testing.md`.
|
||||||
|
- **Accepted risk R6:** `le-prod-wildcard` runs pass the production Gandi PAT
|
||||||
|
(`vault.gandi.pat`) to an ephemeral local VM and write transient `_acme-challenge`
|
||||||
|
TXT records into the real `wingu.me` zone. Scope: on-demand only; `le-staging` is the
|
||||||
|
default. Compensating controls: ephemeral VM, isolated NAT network, TXT records
|
||||||
|
auto-removed by Caddy after validation.
|
||||||
|
- **Three safety invariants** make the test tool itself safe:
|
||||||
|
1. The transient inventory contains only the test VM — no real host is ever in scope.
|
||||||
|
2. "Be askari" points NetBird at the in-VM coordinator — the VM forms its own one-node
|
||||||
|
mesh; it never enrols in the real mesh.
|
||||||
|
3. Test VMs sit on an isolated libvirt NAT network — outbound NAT for ACME/image pulls
|
||||||
|
only, not reachable to the LAN (`10.20.x`) or the real mesh.
|
||||||
|
- **Diagnostics on failure** (catching a bug is the point): failure keeps the VM and
|
||||||
|
dumps `nft list ruleset`, `docker ps`, `ss -tlnp`, `journalctl -b`,
|
||||||
|
`systemd-analyze critical-chain`. `make test-integration-clean` reaps all `boma-it-*`
|
||||||
|
orphans. Diagnostics land in gitignored `~/integration-runs/<ts>-<host>/`.
|
||||||
|
- **Future pinch:** concurrency with the Level-4 Chromium/Playwright stack (ADR-017)
|
||||||
|
competes for ubongo RAM. The resource guard is the v1 answer — one integration VM at a
|
||||||
|
time; don't run alongside a heavy Level-4 session. Revisit at `/capacity-review`.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
**In scope:** reboot-survivability, host-firewall × Docker interaction, boot-ordering,
|
||||||
|
cert/ACME paths, mesh bootstrap on one box.
|
||||||
|
|
||||||
|
**Out of scope (v1):** multi-VM mini-cluster (inter-host mesh dataplane); CI gate
|
||||||
|
(this is an interactive, agent-driven pre-deploy check; CI stays lint + Molecule per
|
||||||
|
ADR-008/010); the Proxmox provisioning path (the bugs live in the boot/kernel/Docker
|
||||||
|
layer, not provisioning).
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
| Option | Reason |
|
||||||
|
|---|---|
|
||||||
|
| **Proxmox VE nested on ubongo** | Highest fidelity including the provisioning step, but heavy (nested virt, RAM), in tension with ADR-015, and the incident bugs do not live in provisioning. |
|
||||||
|
| **Vagrant + vagrant-libvirt** | Mature lifecycle/snapshots, but adds the Ruby/Vagrant ecosystem + a fragile plugin; boxes drift from the real Debian cloud image; the reboot→assert sequence still needs custom logic. |
|
||||||
|
| **terraform-provider-libvirt** | Declarative and reuses TF, but poor at the imperative apply→reboot→re-apply test sequence; adds throwaway state; blurs ADR-006's "TF owns *production* VM existence on Proxmox" boundary. |
|
||||||
|
|
||||||
|
## Verified facts (ADR-014)
|
||||||
|
|
||||||
|
- verified: ubongo KVM capability · Bash · `/dev/kvm` present + accessible (kvm group),
|
||||||
|
Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 GiB
|
||||||
|
disk free · 2026-06-18.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
- ADR-006 — Terraform owns production VM existence (boundary this ADR respects).
|
||||||
|
- ADR-008 — Testing methodology (Levels 1–4); this ADR is the concrete build of Level 2/3.
|
||||||
|
- ADR-015 — Control host (ubongo); this ADR reconciles "not a hypervisor" with ephemeral test VMs.
|
||||||
|
- ADR-016 — Mesh VPN; the "be askari" profile includes the coordinator role.
|
||||||
|
- ADR-020 — Firewall strategy; firewall × Docker interaction is what this harness tests.
|
||||||
|
- ADR-024 — Reverse proxy (Caddy); cert tiers exercise the DNS-01 ACME path.
|
||||||
Loading…
Add table
Reference in a new issue