docs: wire ADR-025 into testing/control-host/risks/status/capacity
- ADR-008: add reboot-survivability gap row + ADR-025 pointer to the "not tested in Molecule" table - ADR-015: reconcile "not a hypervisor" with ephemeral KVM test VMs (ADR-025); note ~3 GiB test-VM RAM against the 16 GiB sizing - accepted-risks: add R6 (le-prod-wildcard PAT + transient TXT records) - CLAUDE.md: add make test-integration[/-clean] to key-commands; add ADR-025 + runbook rows to further-reading - hardware/reference.md: note one ephemeral KVM test VM on ubongo - STATUS.md: add integration harness entry (built, lint+pytest clean; RED/GREEN acceptance PENDING ubongo live pass); TODO 2.4 stays open Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
edcc347a95
commit
4732730515
6 changed files with 30 additions and 3 deletions
|
|
@ -43,6 +43,8 @@ Full design rationale: `docs/decisions/`
|
||||||
| Terraform plan | `make tf-plan [TF_ENV=staging]` |
|
| Terraform plan | `make tf-plan [TF_ENV=staging]` |
|
||||||
| Terraform apply | `make tf-apply [TF_ENV=staging]` |
|
| Terraform apply | `make tf-apply [TF_ENV=staging]` |
|
||||||
| Regenerate Ansible inventory | `make tf-inventory TF_ENV=<staging\|production>` |
|
| Regenerate Ansible inventory | `make tf-inventory TF_ENV=<staging\|production>` |
|
||||||
|
| Integration-test a host on a local VM | `make test-integration HOST=<name> [CERTS=…]` |
|
||||||
|
| Clean up integration test VMs | `make test-integration-clean` |
|
||||||
|
|
||||||
**Always `tf-plan` before `tf-apply`. Always `check` before `deploy`. Never skip lint.**
|
**Always `tf-plan` before `tf-apply`. Always `check` before `deploy`. Never skip lint.**
|
||||||
|
|
||||||
|
|
@ -256,6 +258,8 @@ Single-contributor, trunk-based (no merge requests / approval gates):
|
||||||
| Backup & disaster recovery | `docs/decisions/022-backup.md` |
|
| Backup & disaster recovery | `docs/decisions/022-backup.md` |
|
||||||
| ADR structure & lifecycle | `docs/decisions/023-adr-structure.md` |
|
| ADR structure & lifecycle | `docs/decisions/023-adr-structure.md` |
|
||||||
| Reverse proxy (Caddy) | `docs/decisions/024-reverse-proxy.md` |
|
| Reverse proxy (Caddy) | `docs/decisions/024-reverse-proxy.md` |
|
||||||
|
| Local VM integration testing (ADR-025) | `docs/decisions/025-local-vm-integration-testing.md` |
|
||||||
|
| Integration testing runbook | `docs/runbooks/integration-testing.md` |
|
||||||
| Adding a new role | `docs/runbooks/new-role.md` |
|
| Adding a new role | `docs/runbooks/new-role.md` |
|
||||||
| Adding a new host | `docs/runbooks/new-host.md` |
|
| Adding a new host | `docs/runbooks/new-host.md` |
|
||||||
| Enrolling a NetBird client (laptop/phone) | `docs/runbooks/netbird-client.md` |
|
| Enrolling a NetBird client (laptop/phone) | `docs/runbooks/netbird-client.md` |
|
||||||
|
|
|
||||||
12
STATUS.md
12
STATUS.md
|
|
@ -81,6 +81,18 @@ askari.)
|
||||||
| Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. |
|
| Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. |
|
||||||
| Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. |
|
| Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. |
|
||||||
|
|
||||||
|
## Integration test harness (branch feat/integration-testing)
|
||||||
|
|
||||||
|
| Thing | State |
|
||||||
|
|---|---|
|
||||||
|
| `roles/integration_test/` | **Built** — installs/enables libvirt+QEMU+virtinst on `control` group hosts; adds `sjat`/`claude` to `libvirt` group; creates image-cache dir; drops the driver. Molecule + pytest clean. |
|
||||||
|
| `scripts/integration-vm.py` | **Built** — stdlib-only lifecycle driver over `virsh`/`virt-install`/`cloud-localds`: `up / apply / reboot / assert / cycle / reset / down / prune / console`. Lazily ensures the golden Debian-13 genericcloud image. pytest clean (transient-inventory generation, var/overlay merge, `--certs` mapping, DHCP-lease parsing, resource-guard math). |
|
||||||
|
| `tests/integration/` (profile, verify, overrides) | **Built** — "be askari" profile + var overlay + `verify.yml` outcome assertions (Docker up, published-port DNAT, nft sane, `wt0` up). pytest clean. |
|
||||||
|
| `make test-integration` / `make test-integration-clean` | **Built** — wired into `Makefile`. |
|
||||||
|
| ADR-025 | **Accepted (2026-06-18)** — decision recorded, approach A, cert tiers, safety invariants documented. |
|
||||||
|
| **RED/GREEN acceptance (ubongo live pass)** | **PENDING** — the harness has not yet been run on a real VM. RED (reproduce 2026-06-17 breakage after reboot) and GREEN (survive reboot with `docker_host` container-forward fix) are the acceptance gate. `docs/TODO.md` item 2.4 remains open until this passes. |
|
||||||
|
| `le-staging` cert validation | **PENDING** — wired in v1 but not yet exercised on a real VM. |
|
||||||
|
|
||||||
## Keeping this honest
|
## Keeping this honest
|
||||||
|
|
||||||
Update this file whenever you build, stub, or remove something. It is the first
|
Update this file whenever you build, stub, or remove something. It is the first
|
||||||
|
|
|
||||||
|
|
@ -154,6 +154,7 @@ Level 2 (staging) or Level 3 (external). This is a conscious, documented decisio
|
||||||
| Capability | Reason not testable in Molecule |
|
| Capability | Reason not testable in Molecule |
|
||||||
|---|---|
|
|---|---|
|
||||||
| `nftables` rule loading | Requires `nf_tables` kernel module; not available in Docker |
|
| `nftables` rule loading | Requires `nf_tables` kernel module; not available in Docker |
|
||||||
|
| **Reboot-survivability / host-firewall × Docker interaction / boot-ordering** | **Requires a real kernel reboot — the class that caused the 2026-06-17 mesh-hardening incident. Now covered by local VM integration testing (ADR-025).** |
|
||||||
| NetBird mesh data plane (`wt0` WireGuard interface) | Requires the `wireguard` kernel module; Molecule checks only that the agent is installed/configured (ADR-016) |
|
| NetBird mesh data plane (`wt0` WireGuard interface) | Requires the `wireguard` kernel module; Molecule checks only that the agent is installed/configured (ADR-016) |
|
||||||
| `unattended-upgrades` behaviour | Installs correctly; actual upgrade behaviour requires a real apt environment |
|
| `unattended-upgrades` behaviour | Installs correctly; actual upgrade behaviour requires a real apt environment |
|
||||||
| DHCP behaviour (OPNsense) | OPNsense is managed by Ansible but not testable in a container |
|
| DHCP behaviour (OPNsense) | OPNsense is managed by Ansible but not testable in a container |
|
||||||
|
|
@ -165,6 +166,11 @@ For the above, Molecule tests only what it can: that the relevant packages are
|
||||||
installed, that configuration files render correctly, and that services are enabled.
|
installed, that configuration files render correctly, and that services are enabled.
|
||||||
Behavioural correctness is confirmed on staging.
|
Behavioural correctness is confirmed on staging.
|
||||||
|
|
||||||
|
**ADR-025 is the concrete build of Level 2/3** — local VM integration testing on
|
||||||
|
ubongo (libvirt/KVM, throwaway overlay VMs, stdlib-only driver). It specifically
|
||||||
|
targets the reboot-survivability / host-firewall × Docker / boot-ordering class that
|
||||||
|
Molecule structurally cannot reach. See `docs/decisions/025-local-vm-integration-testing.md`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### CI pipeline
|
### CI pipeline
|
||||||
|
|
|
||||||
|
|
@ -43,8 +43,12 @@ points at this physical box. This *strengthens* the ADR-009 control-node excepti
|
||||||
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
|
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
|
||||||
Every other host stays a Terraform-managed VM exactly as designed.
|
Every other host stays a Terraform-managed VM exactly as designed.
|
||||||
|
|
||||||
`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a hypervisor
|
`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a production
|
||||||
and runs no `docker_host` services.
|
hypervisor and runs no `docker_host` services. It does run **ephemeral KVM test VMs**
|
||||||
|
as part of its local-test-runner role (ADR-025 — local VM integration testing): one
|
||||||
|
throwaway VM at a time (~3 GiB RAM), against ~13 GiB free of the 16 GiB sized here.
|
||||||
|
This is not a production workload — it is the concrete implementation of ADR-008 Level
|
||||||
|
2/3, and the resource guard enforces one-at-a-time to stay within the RAM ceiling.
|
||||||
|
|
||||||
### Hardware target
|
### Hardware target
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -25,7 +25,7 @@
|
||||||
- **Storage:** 256 GB SanDisk X600 SATA 2.5" SSD (model SD9TB8W256G1001; TCG Opal-capable, Opal unused — no disk encryption)
|
- **Storage:** 256 GB SanDisk X600 SATA 2.5" SSD (model SD9TB8W256G1001; TCG Opal-capable, Opal unused — no disk encryption)
|
||||||
- **NICs:** wired GbE, interface eno1, MAC 88:a4:c2:e0:ee:da
|
- **NICs:** wired GbE, interface eno1, MAC 88:a4:c2:e0:ee:da
|
||||||
- **BIOS:** Lenovo M2WKT5AA (2023-06-20)
|
- **BIOS:** Lenovo M2WKT5AA (2023-06-20)
|
||||||
- **Notes:** always-on; control plane + AI-worker (dedicated `claude` user) + local test runner (Molecule/Docker) per ADR-015; not a Proxmox guest; remote access currently LAN SSH only (mesh deferred)
|
- **Notes:** always-on; control plane + AI-worker (dedicated `claude` user) + local test runner (Molecule/Docker) per ADR-015; not a Proxmox guest; remote access currently LAN SSH only (mesh deferred). Also runs **one ephemeral KVM integration test VM** (~3 GiB RAM) at a time per ADR-025 — the resource guard enforces one-at-a-time; do not run a test-integration cycle alongside a heavy Level-4 browser session (Chromium/Playwright).
|
||||||
|
|
||||||
### fisi (backup node — outside the cluster; provisional)
|
### fisi (backup node — outside the cluster; provisional)
|
||||||
- **Model / form factor:** HP Elite 600 G9 (tower)
|
- **Model / form factor:** HP Elite 600 G9 (tower)
|
||||||
|
|
|
||||||
|
|
@ -18,6 +18,7 @@ revisit (trigger).
|
||||||
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and STUN (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh (NetBird v0.72.4 embeds STUN in the combined server — no separate Coturn) | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
|
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and STUN (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh (NetBird v0.72.4 embeds STUN in the combined server — no separate Coturn) | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
|
||||||
| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
|
| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
|
||||||
| R5 | **No disk encryption on `ubongo`** — the control node's SSD (SanDisk X600 256 GB, TCG-Opal-capable but Opal unused) is unencrypted at rest, so it holds recovery-critical secrets in plaintext: the Ansible Vault password's `rbw` local cache and (future) Terraform state. Physical theft of the box would expose them | `ubongo` is always-on in a physically controlled location; compensating controls are a **BIOS supervisor password** and **disabled external/USB + PXE boot** (an attacker cannot trivially boot another OS to read the disk), and the offline-recoverable design means the irreducible root secret (Vaultwarden master password) is never stored on the box anyway. Full-disk encryption was weighed against the always-on/unattended-reboot requirement (LUKS+TPM auto-unlock or passphrase) and deferred for simplicity at this trust level | `ubongo` is relocated to a less-trusted physical location; the box starts holding additional high-value secrets; or a reinstall onto LUKS (TPM-sealed) is undertaken |
|
| R5 | **No disk encryption on `ubongo`** — the control node's SSD (SanDisk X600 256 GB, TCG-Opal-capable but Opal unused) is unencrypted at rest, so it holds recovery-critical secrets in plaintext: the Ansible Vault password's `rbw` local cache and (future) Terraform state. Physical theft of the box would expose them | `ubongo` is always-on in a physically controlled location; compensating controls are a **BIOS supervisor password** and **disabled external/USB + PXE boot** (an attacker cannot trivially boot another OS to read the disk), and the offline-recoverable design means the irreducible root secret (Vaultwarden master password) is never stored on the box anyway. Full-disk encryption was weighed against the always-on/unattended-reboot requirement (LUKS+TPM auto-unlock or passphrase) and deferred for simplicity at this trust level | `ubongo` is relocated to a less-trusted physical location; the box starts holding additional high-value secrets; or a reinstall onto LUKS (TPM-sealed) is undertaken |
|
||||||
|
| R6 | **`le-prod-wildcard` integration runs** — when `CERTS=le-prod-wildcard` is passed to `make test-integration`, the production Gandi PAT (`vault.gandi.pat`) is passed to an ephemeral local test VM via the var overlay, and transient `_acme-challenge` TXT records are written into the real `wingu.me` DNS zone to satisfy the Let's Encrypt DNS-01 challenge. A compromised or long-lived test VM could exfiltrate the PAT; the real zone is briefly (seconds) modified | Scope is **on-demand only** — `le-staging` is the default cert tier (`CERTS=internal` for incident repro); `le-prod-wildcard` is an explicit opt-in. Compensating controls: the VM is ephemeral and destroyed on success; it sits on an isolated libvirt NAT network (no LAN/mesh access); TXT records are auto-removed by Caddy immediately after validation; the PAT is not persisted inside the VM after the run. ADR-025 documents the cert-tier design and the three isolation invariants | The PAT is exfiltrated from a test VM; the `wingu.me` zone shows unexpected records; a `CERTS=le-prod-wildcard` run must be audited or the tier must be revoked |
|
||||||
|
|
||||||
_Last reviewed: 2026-06-11. The prior gaps (full CIS hardening, SELinux/AppArmor,
|
_Last reviewed: 2026-06-11. The prior gaps (full CIS hardening, SELinux/AppArmor,
|
||||||
IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
|
IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue