diff --git a/STATUS.md b/STATUS.md index ab72425..6e35ecc 100644 --- a/STATUS.md +++ b/STATUS.md @@ -5,7 +5,7 @@ This repo is partly aspirational: the ADRs in `docs/decisions/` describe the truth. **Before relying on a role, provider, or pipeline existing, check here.** If something is listed as "designed, not built", do not assume it works. -_Last reviewed: 2026-06-14._ +_Last reviewed: 2026-06-18._ ## Real and working today @@ -30,7 +30,7 @@ _Last reviewed: 2026-06-14._ | `roles/dev_env/` — interactive developer environment | **Built + applied.** zsh + oh-my-zsh + oh-my-posh, tmux + TPM plugins, neovim; dotfiles deployed via GNU stow (re-derived from V4/fisi per ADR-013). Node.js from a pinned upstream tarball (not Debian's npm). Lint + Molecule (idempotent) green. **Applied to `ubongo`** for users `sjat` + `claude` (verified: zsh login shells, stow-symlinked `.zshrc`/`.tmux.conf` + nvim config, oh-my-zsh, tmux plugins; nvim v0.12.2, oh-my-posh 29.0.1). Run via `playbooks/workstation.yml` against the `control` group (no dedicated `workstations` group yet). | | `make check` / `make deploy PLAYBOOK=` | **Works.** First end-to-end run (applying `dev_env`) surfaced + fixed latent bugs: Makefile `PLAYBOOK` var collision (binary path vs playbook-name arg) meant the targets never ran; `ansible.cfg` referenced uninstalled community.general callbacks (now built-in `default` + `ansible.posix.profile_tasks`); `acl` package added so Ansible can `become_user` an unprivileged user. The make targets now function — though `site`/`base`/`docker_host` content is still incomplete (see below). | | `roles/public_dns/` + `playbooks/dns.yml` | **Built + applied.** Manages wingu.me at Gandi LiveDNS as code (`community.general.gandi_livedns`, PAT from `vault.gandi.pat`); record data, anti-spoof baseline (SPF `-all` + DMARC reject), and the Gandi-defaults purge are defined + unit-tested (`tests/test_public_dns.py`). **Applied to wingu.me (2026-06-14):** purged Gandi's 13 seeded defaults; zone now holds only the SPF + DMARC TXT records; idempotent re-run clean. No null-MX (Gandi rejects `0 .`) — the MX is removed, so no MX + no apex A = no mail. M1 of the roadmap. | -| `ubongo` — physical control / AI-worker host (ADR-015) | **Built (partial).** Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to `fisi` (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via `make setup`/`make collections`). Repo cloned under a dedicated `claude` user (docker group, no sudo). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory `control` group at 10.20.10.151. **`dev_env` now applied here** (zsh/tmux/nvim for `sjat` + `claude`, via `playbooks/workstation.yml`). Managed as the operator account `sjat` (`group_vars/control` sets `ansible_user: sjat`), not the `ansible` service user `group_vars/all` assumes — ubongo has no bootstrapped `ansible` user. **NetBird mesh-enrolled (M5, 2026-06-17):** `wt0` up at `100.99.146.14` via the `base` `mesh` concern; agent management now works because `claude`'s SSH key was added to `sjat`'s `authorized_keys` and `sjat` was granted `NOPASSWD` sudo (`/etc/sudoers.d/sjat-ansible`) — the interim until the proper `ansible`-user bootstrap. **Pending:** full `base` hardening (only `firewall` exists, NOT applied here — default-deny is the deferred mesh-hardening step now that `wt0` exists); proper `ansible`-user bootstrap (currently managed as `sjat`); OPNsense DHCP reservation for 10.20.10.151 (MAC `88:a4:c2:e0:ee:da`); Terraform state backup (now relevant — the offsite tfstate exists). | +| `ubongo` — physical control / AI-worker host (ADR-015) | **Built (partial).** Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to `fisi` (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via `make setup`/`make collections`). Repo cloned under a dedicated `claude` user (docker + libvirt groups, **`NOPASSWD:ALL` sudo** — ADR-015 amended 2026-06-18; operator `sjat` uses password-required sudo via `sudo` group; the former `sjat-ansible` NOPASSWD drop-in removed 2026-06-18). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory `control` group at 10.20.10.151. **`dev_env` now applied here** (zsh/tmux/nvim for `sjat` + `claude`, via `playbooks/workstation.yml`). Managed as the operator account `sjat` (`group_vars/control` sets `ansible_user: sjat`), not the `ansible` service user `group_vars/all` assumes — ubongo has no bootstrapped `ansible` user. **NetBird mesh-enrolled (M5, 2026-06-17):** `wt0` up at `100.99.146.14` via the `base` `mesh` concern. **Pending:** full `base` hardening (only `firewall` exists, NOT applied here — default-deny is the deferred mesh-hardening step now that `wt0` exists); proper `ansible`-user bootstrap (currently managed as `sjat`); OPNsense DHCP reservation for 10.20.10.151 (MAC `88:a4:c2:e0:ee:da`); Terraform state backup (now relevant — the offsite tfstate exists). | | `askari` — off-site Hetzner VPS (ADR-007/016, M2) | **Built + applied.** Provisioned by Terraform (`environments/offsite`, `hetznercloud/hcloud`) as **cx23 / hel1 / Debian 13.5** (CAX11/ARM was out of stock EU-wide on 2026-06-14 → cx23 is same-spec x86, cheaper). cloud-init created the `ansible` user + passwordless sudo; a TF-managed Hetzner Cloud Firewall allows SSH only from ubongo's WAN (`91.226.145.80`). Reachable from ubongo (`ansible offsite_hosts -m ping` ✓), in the `offsite_hosts` inventory (generated `offsite.yml`), published at `askari.wingu.me` → `77.42.120.136`. **SSH-hardened + fail2ban (M3).** **Docker + Caddy reverse proxy (M4a):** `docker_host` + `reverse_proxy` (vanilla Caddy, HTTP-01) applied; `https://test.askari.wingu.me` serves a valid Let's Encrypt cert ✓ (firewall opens 80/443/3478). **NetBird coordinator (M4b):** `netbird_coordinator` deployed — dashboard live at `https://netbird.askari.wingu.me` (valid LE cert), management API behind embedded Dex (401 unauth), STUN on 3478/udp. **NetBird peer (M5, 2026-06-17):** also enrolled as a mesh agent (`base` `mesh` concern) — `wt0` at `100.99.226.39`, Management+Signal Connected; the agent coexists with the coordinator. **Pending:** host firewall + moving askari's SSH onto `wt0` (deferred mesh-hardening; the Hetzner Cloud Firewall is its perimeter until then), offsite tfstate backup (ADR-022). | | `roles/docker_host/` (Docker engine) + `roles/reverse_proxy/` (Caddy, ADR-024) | **Built + applied** (askari, M4a). `docker_host` installs Docker CE + compose; `reverse_proxy` is boma's standard Caddy proxy (HTTP-01 for public hosts; routes from `reverse_proxy__routes`). **DNS-01 for mesh/LAN-only services is now built + proven (2026-06-15):** custom `caddy-gandi` image (`.docker/caddy-gandi/`, `make caddy-image`, pinned caddy-dns/gandi v1.1.0 → Bearer PAT), enabled per-instance via `reverse_proxy__acme_dns_provider: gandi` + `reverse_proxy__image`. Verified end-to-end — a real wildcard cert issued via LE **staging** + Gandi DNS-01 with `vault.gandi.pat`. M4a's deferral (version skew + Hetzner-IP build) is closed; image **pending registry push** (`make caddy-image-push` needs `docker login`). The `reverse_proxy` Caddyfile is bind-mounted as a **directory** (`./caddy` → `/etc/caddy`) so atomic re-renders are visible in-container and `caddy reload` actually applies new routes (a single-file mount pinned the stale inode). | | `roles/netbird_coordinator/` — NetBird control plane (ADR-016, M4b) | **Built + applied (askari, 2026-06-16). boma's FIRST real service role.** Self-hosted NetBird **v0.72.4**: a single combined `netbird-server` container (management + signal + relay + STUN + **embedded Dex IdP** at `/oauth2`) + `dashboard:v2.39.0`, on the shared `boma` network behind the M4a Caddy via gRPC-h2c + WebSocket + path routing (`reverse_proxy__routes` gained a raw-`caddy` route type). Secrets `vault.netbird.{auth_secret,datastore_key}` (self-generated). Carries the full service-role file set (SECURITY/VERIFY/ACCESS/BACKUP) — **first stateful role** (`backup__state: true`; encrypted SQLite at `/var/lib/netbird`, off-site backup pending `fisi`/ADR-022). **Verified live:** dashboard 200 + valid LE cert, `/api` 401 (auth-gated, routes OK), STUN up. **Not yet configured:** first-boot `/setup` admin + peer enrolment = M5. | @@ -81,7 +81,7 @@ askari.) | Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. | | Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. | -## Integration test harness (branch feat/integration-testing) +## Integration test harness (ADR-025) | Thing | State | |---|---| @@ -89,9 +89,9 @@ askari.) | `scripts/integration-vm.py` | **Built** — stdlib-only lifecycle driver over `virsh`/`virt-install`/`cloud-localds`: `up / apply / reboot / assert / cycle / reset / down / prune / console`. Lazily ensures the golden Debian-13 genericcloud image. pytest clean (transient-inventory generation, var/overlay merge, `--certs` mapping, DHCP-lease parsing, resource-guard math). | | `tests/integration/` (profile, verify, overrides) | **Built** — "be askari" profile + var overlay + `verify.yml` outcome assertions (Docker up, published-port DNAT, nft sane, `wt0` up). pytest clean. | | `make test-integration` / `make test-integration-clean` | **Built** — wired into `Makefile`. | -| ADR-025 | **Accepted (2026-06-18)** — decision recorded, approach A, cert tiers, safety invariants documented. | -| **RED/GREEN acceptance (ubongo live pass)** | **PENDING** — the harness has not yet been run on a real VM. RED (reproduce 2026-06-17 breakage after reboot) and GREEN (survive reboot with `docker_host` container-forward fix) are the acceptance gate. `docs/TODO.md` item 2.4 remains open until this passes. | -| `le-staging` cert validation | **PENDING** — wired in v1 but not yet exercised on a real VM. | +| ADR-025 | **Accepted (2026-06-18)** — decision recorded, approach A, cert tiers, safety invariants, UEFI boot requirement, and claude-sudo dependency documented. | +| **RED/GREEN acceptance (ubongo live pass)** | **PASSED (2026-06-18).** A throwaway KVM VM on ubongo reproduced the 2026-06-17 incident (base nftables forward default-deny kills Docker forwarding on reboot) = RED. Applying the `docker_host` container-forward drop-in and rebooting survived = GREEN. Nine shakedown findings captured in `docs/FRICTION.md`; key learnings (UEFI boot, claude sudo) recorded in ADR-025. `docs/TODO.md` item 2.4 closed. | +| `le-staging` cert validation | **Pending** — wired in v1 but not yet exercised on a real VM (separate from the RED/GREEN acceptance gate). | ## Keeping this honest diff --git a/docs/TODO.md b/docs/TODO.md index 4f0456c..0bcfaec 100644 --- a/docs/TODO.md +++ b/docs/TODO.md @@ -17,19 +17,7 @@ calls, curl pulls of web products, log reviews. Headless browsing → ADR-017 (`/verify-service`); the API/curl/log-review siblings remain open. 3. ~~Standard for test users + manual-test instructions.~~ → ADR-017. - 4. **Local VM integration testing on ubongo (pre-deploy).** Molecule (containers, - one converge, no reboot, no real Docker/firewall interaction) structurally - **cannot** catch reboot-survivability, host-firewall × Docker, or boot-order bugs — - exactly the class that caused the 2026-06-17 mesh-hardening incident (`base`'s - nftables `forward policy drop` broke the askari Docker host on reboot; - `ip_nonlocal_bind` didn't beat the sshd boot-race). Build a way for the agent to - spin up throwaway VMs **locally on ubongo** (libvirt/QEMU? Proxmox-on-ubongo?) that - mirror a target host (real Docker, a real reboot, the real role apply) and validate - risky infra changes there **before** deploying to a live host. This is the concrete - build of ADR-008's Level 2/3 (staging/integration) testing — deferred for lack of - hosts, but ubongo can host it. Decide the virtualisation approach + how the agent - drives it (provision → snapshot/reset → run the playbook → reboot → assert). Ties to - 3.10 (testing approach as it matures) and the 2026-06-17 FRICTION signals. + 4. ~~Local VM integration testing on ubongo.~~ → ADR-025 / `make test-integration` (built + RED→GREEN validated 2026-06-18). 3. **Building services** 1. ~~Decide how to manage logs.~~ → ADR-018. diff --git a/docs/decisions/025-local-vm-integration-testing.md b/docs/decisions/025-local-vm-integration-testing.md index dc3fe21..ae42f99 100644 --- a/docs/decisions/025-local-vm-integration-testing.md +++ b/docs/decisions/025-local-vm-integration-testing.md @@ -3,8 +3,11 @@ ## Status Accepted (2026-06-18). Implements ADR-008 Level 2/3 (deferred for lack of hosts; now -viable on ubongo). The harness code is built and lint+pytest-clean; RED/GREEN -acceptance is pending the first live run on ubongo. +viable on ubongo). **RED→GREEN acceptance PASSED on real hardware (2026-06-18):** a +throwaway KVM VM on ubongo reproduced the 2026-06-17 incident (base's nftables forward +default-deny kills Docker forwarding on reboot) — RED — and survived the reboot once +the `docker_host` container-forward drop-in was applied — GREEN. Two shakedown +learnings added below. ## Context @@ -147,11 +150,31 @@ layer, not provisioning). Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 GiB disk free · 2026-06-18. +## Shakedown learnings (2026-06-18 live run) + +Two findings from the RED→GREEN acceptance run that affect anyone operating the harness: + +1. **Boot firmware: UEFI required.** The Debian 13 genericcloud image triple-faults + under legacy BIOS/SeaBIOS and does not reach the kernel. Boot the VM with UEFI + (`virt-install --boot uefi`; `ovmf` package). The driver does this by default; note + it here so the requirement is findable. + +2. **`claude` sudo is load-bearing.** VM management (`virsh`, `virt-install`, + `cloud-localds`) and offline diagnostics (`nft list ruleset`, `journalctl -b`, + `systemd-analyze critical-chain`) all require root. The harness assumes the + AI-worker has `NOPASSWD:ALL` sudo on `ubongo` — settled as the ADR-015 amendment + (2026-06-18) and registered as R7 in `docs/security/accepted-risks.md`. A `claude` + account without sudo will block the harness at the first `virsh` call. + +The nine full shakedown findings (including the UEFI boot-loop) are in +`docs/FRICTION.md`. + ## Related - ADR-006 — Terraform owns production VM existence (boundary this ADR respects). - ADR-008 — Testing methodology (Levels 1–4); this ADR is the concrete build of Level 2/3. -- ADR-015 — Control host (ubongo); this ADR reconciles "not a hypervisor" with ephemeral test VMs. +- ADR-015 — Control host (ubongo); this ADR reconciles "not a hypervisor" with ephemeral test VMs; amended 2026-06-18 for claude sudo. - ADR-016 — Mesh VPN; the "be askari" profile includes the coordinator role. - ADR-020 — Firewall strategy; firewall × Docker interaction is what this harness tests. +- ADR-021 — Operational access; sudo model for `claude` and `sjat` on `ubongo`. - ADR-024 — Reverse proxy (Caddy); cert tiers exercise the DNS-01 ACME path.