From 4732730515bba29dd98600af0ae8e1e20207b8cb Mon Sep 17 00:00:00 2001 From: sjat Date: Thu, 18 Jun 2026 12:51:22 +0200 Subject: [PATCH] docs: wire ADR-025 into testing/control-host/risks/status/capacity - ADR-008: add reboot-survivability gap row + ADR-025 pointer to the "not tested in Molecule" table - ADR-015: reconcile "not a hypervisor" with ephemeral KVM test VMs (ADR-025); note ~3 GiB test-VM RAM against the 16 GiB sizing - accepted-risks: add R6 (le-prod-wildcard PAT + transient TXT records) - CLAUDE.md: add make test-integration[/-clean] to key-commands; add ADR-025 + runbook rows to further-reading - hardware/reference.md: note one ephemeral KVM test VM on ubongo - STATUS.md: add integration harness entry (built, lint+pytest clean; RED/GREEN acceptance PENDING ubongo live pass); TODO 2.4 stays open Co-Authored-By: Claude Opus 4.8 (1M context) --- CLAUDE.md | 4 ++++ STATUS.md | 12 ++++++++++++ docs/decisions/008-testing.md | 6 ++++++ docs/decisions/015-control-host.md | 8 ++++++-- docs/hardware/reference.md | 2 +- docs/security/accepted-risks.md | 1 + 6 files changed, 30 insertions(+), 3 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 2e662c9..01334a7 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -43,6 +43,8 @@ Full design rationale: `docs/decisions/` | Terraform plan | `make tf-plan [TF_ENV=staging]` | | Terraform apply | `make tf-apply [TF_ENV=staging]` | | Regenerate Ansible inventory | `make tf-inventory TF_ENV=` | +| Integration-test a host on a local VM | `make test-integration HOST= [CERTS=…]` | +| Clean up integration test VMs | `make test-integration-clean` | **Always `tf-plan` before `tf-apply`. Always `check` before `deploy`. Never skip lint.** @@ -256,6 +258,8 @@ Single-contributor, trunk-based (no merge requests / approval gates): | Backup & disaster recovery | `docs/decisions/022-backup.md` | | ADR structure & lifecycle | `docs/decisions/023-adr-structure.md` | | Reverse proxy (Caddy) | `docs/decisions/024-reverse-proxy.md` | +| Local VM integration testing (ADR-025) | `docs/decisions/025-local-vm-integration-testing.md` | +| Integration testing runbook | `docs/runbooks/integration-testing.md` | | Adding a new role | `docs/runbooks/new-role.md` | | Adding a new host | `docs/runbooks/new-host.md` | | Enrolling a NetBird client (laptop/phone) | `docs/runbooks/netbird-client.md` | diff --git a/STATUS.md b/STATUS.md index 9d51c8d..ab72425 100644 --- a/STATUS.md +++ b/STATUS.md @@ -81,6 +81,18 @@ askari.) | Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. | | Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. | +## Integration test harness (branch feat/integration-testing) + +| Thing | State | +|---|---| +| `roles/integration_test/` | **Built** — installs/enables libvirt+QEMU+virtinst on `control` group hosts; adds `sjat`/`claude` to `libvirt` group; creates image-cache dir; drops the driver. Molecule + pytest clean. | +| `scripts/integration-vm.py` | **Built** — stdlib-only lifecycle driver over `virsh`/`virt-install`/`cloud-localds`: `up / apply / reboot / assert / cycle / reset / down / prune / console`. Lazily ensures the golden Debian-13 genericcloud image. pytest clean (transient-inventory generation, var/overlay merge, `--certs` mapping, DHCP-lease parsing, resource-guard math). | +| `tests/integration/` (profile, verify, overrides) | **Built** — "be askari" profile + var overlay + `verify.yml` outcome assertions (Docker up, published-port DNAT, nft sane, `wt0` up). pytest clean. | +| `make test-integration` / `make test-integration-clean` | **Built** — wired into `Makefile`. | +| ADR-025 | **Accepted (2026-06-18)** — decision recorded, approach A, cert tiers, safety invariants documented. | +| **RED/GREEN acceptance (ubongo live pass)** | **PENDING** — the harness has not yet been run on a real VM. RED (reproduce 2026-06-17 breakage after reboot) and GREEN (survive reboot with `docker_host` container-forward fix) are the acceptance gate. `docs/TODO.md` item 2.4 remains open until this passes. | +| `le-staging` cert validation | **PENDING** — wired in v1 but not yet exercised on a real VM. | + ## Keeping this honest Update this file whenever you build, stub, or remove something. It is the first diff --git a/docs/decisions/008-testing.md b/docs/decisions/008-testing.md index c2c5d22..647b3ba 100644 --- a/docs/decisions/008-testing.md +++ b/docs/decisions/008-testing.md @@ -154,6 +154,7 @@ Level 2 (staging) or Level 3 (external). This is a conscious, documented decisio | Capability | Reason not testable in Molecule | |---|---| | `nftables` rule loading | Requires `nf_tables` kernel module; not available in Docker | +| **Reboot-survivability / host-firewall × Docker interaction / boot-ordering** | **Requires a real kernel reboot — the class that caused the 2026-06-17 mesh-hardening incident. Now covered by local VM integration testing (ADR-025).** | | NetBird mesh data plane (`wt0` WireGuard interface) | Requires the `wireguard` kernel module; Molecule checks only that the agent is installed/configured (ADR-016) | | `unattended-upgrades` behaviour | Installs correctly; actual upgrade behaviour requires a real apt environment | | DHCP behaviour (OPNsense) | OPNsense is managed by Ansible but not testable in a container | @@ -165,6 +166,11 @@ For the above, Molecule tests only what it can: that the relevant packages are installed, that configuration files render correctly, and that services are enabled. Behavioural correctness is confirmed on staging. +**ADR-025 is the concrete build of Level 2/3** — local VM integration testing on +ubongo (libvirt/KVM, throwaway overlay VMs, stdlib-only driver). It specifically +targets the reboot-survivability / host-firewall × Docker / boot-ordering class that +Molecule structurally cannot reach. See `docs/decisions/025-local-vm-integration-testing.md`. + --- ### CI pipeline diff --git a/docs/decisions/015-control-host.md b/docs/decisions/015-control-host.md index 13c1b5f..a5eff28 100644 --- a/docs/decisions/015-control-host.md +++ b/docs/decisions/015-control-host.md @@ -43,8 +43,12 @@ points at this physical box. This *strengthens* the ADR-009 control-node excepti it is genuinely outside Terraform's world, not a VM pretending to be the exception. Every other host stays a Terraform-managed VM exactly as designed. -`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a hypervisor -and runs no `docker_host` services. +`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a production +hypervisor and runs no `docker_host` services. It does run **ephemeral KVM test VMs** +as part of its local-test-runner role (ADR-025 — local VM integration testing): one +throwaway VM at a time (~3 GiB RAM), against ~13 GiB free of the 16 GiB sized here. +This is not a production workload — it is the concrete implementation of ADR-008 Level +2/3, and the resource guard enforces one-at-a-time to stay within the RAM ceiling. ### Hardware target diff --git a/docs/hardware/reference.md b/docs/hardware/reference.md index 7252050..92c0255 100644 --- a/docs/hardware/reference.md +++ b/docs/hardware/reference.md @@ -25,7 +25,7 @@ - **Storage:** 256 GB SanDisk X600 SATA 2.5" SSD (model SD9TB8W256G1001; TCG Opal-capable, Opal unused — no disk encryption) - **NICs:** wired GbE, interface eno1, MAC 88:a4:c2:e0:ee:da - **BIOS:** Lenovo M2WKT5AA (2023-06-20) -- **Notes:** always-on; control plane + AI-worker (dedicated `claude` user) + local test runner (Molecule/Docker) per ADR-015; not a Proxmox guest; remote access currently LAN SSH only (mesh deferred) +- **Notes:** always-on; control plane + AI-worker (dedicated `claude` user) + local test runner (Molecule/Docker) per ADR-015; not a Proxmox guest; remote access currently LAN SSH only (mesh deferred). Also runs **one ephemeral KVM integration test VM** (~3 GiB RAM) at a time per ADR-025 — the resource guard enforces one-at-a-time; do not run a test-integration cycle alongside a heavy Level-4 browser session (Chromium/Playwright). ### fisi (backup node — outside the cluster; provisional) - **Model / form factor:** HP Elite 600 G9 (tower) diff --git a/docs/security/accepted-risks.md b/docs/security/accepted-risks.md index 82e3e4e..069c2c4 100644 --- a/docs/security/accepted-risks.md +++ b/docs/security/accepted-risks.md @@ -18,6 +18,7 @@ revisit (trigger). | R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and STUN (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh (NetBird v0.72.4 embeds STUN in the combined server — no separate Coturn) | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering | | R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target | | R5 | **No disk encryption on `ubongo`** — the control node's SSD (SanDisk X600 256 GB, TCG-Opal-capable but Opal unused) is unencrypted at rest, so it holds recovery-critical secrets in plaintext: the Ansible Vault password's `rbw` local cache and (future) Terraform state. Physical theft of the box would expose them | `ubongo` is always-on in a physically controlled location; compensating controls are a **BIOS supervisor password** and **disabled external/USB + PXE boot** (an attacker cannot trivially boot another OS to read the disk), and the offline-recoverable design means the irreducible root secret (Vaultwarden master password) is never stored on the box anyway. Full-disk encryption was weighed against the always-on/unattended-reboot requirement (LUKS+TPM auto-unlock or passphrase) and deferred for simplicity at this trust level | `ubongo` is relocated to a less-trusted physical location; the box starts holding additional high-value secrets; or a reinstall onto LUKS (TPM-sealed) is undertaken | +| R6 | **`le-prod-wildcard` integration runs** — when `CERTS=le-prod-wildcard` is passed to `make test-integration`, the production Gandi PAT (`vault.gandi.pat`) is passed to an ephemeral local test VM via the var overlay, and transient `_acme-challenge` TXT records are written into the real `wingu.me` DNS zone to satisfy the Let's Encrypt DNS-01 challenge. A compromised or long-lived test VM could exfiltrate the PAT; the real zone is briefly (seconds) modified | Scope is **on-demand only** — `le-staging` is the default cert tier (`CERTS=internal` for incident repro); `le-prod-wildcard` is an explicit opt-in. Compensating controls: the VM is ephemeral and destroyed on success; it sits on an isolated libvirt NAT network (no LAN/mesh access); TXT records are auto-removed by Caddy immediately after validation; the PAT is not persisted inside the VM after the run. ADR-025 documents the cert-tier design and the three isolation invariants | The PAT is exfiltrated from a test VM; the `wingu.me` zone shows unexpected records; a `CERTS=le-prod-wildcard` run must be audited or the tier must be revoked | _Last reviewed: 2026-06-11. The prior gaps (full CIS hardening, SELinux/AppArmor, IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS