docs(friction): VM-testing standard + libvirt stale-session gotcha

Two signals from running the ubongo harness gate: (1) the operator wants a
standard pre-authorising isolated VM integration tests on ubongo so the agent
doesn't ask each time; (2) a stale agent session (shell predating the
integration_test libvirt-group grant) carries stale process groups, so the
harness's qemu-img/file writes are denied -> run via 'sg libvirt -c ...';
self-heal idea noted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-19 10:32:09 +02:00
parent 468f8c3a92
commit 8d8c86fa39

View file

@ -158,6 +158,34 @@ harness on ubongo and shaking it down against real KVM (spec/plan in docs/superp
reservations** (`10.20.10.17` = MAC `bc:0f:f3:c8:4a:8a`; mamba's MAC TBD) and allow the
reserved IPs. Spec: `docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md`.
- `[gotcha]` **`make test-integration` on ubongo fails (`qemu-img` "Permission denied") when
the agent session predates the `libvirt` group grant** (2026-06-19): the `integration_test`
role adds `claude` to `libvirt`+`kvm` and makes the cache dir `/var/lib/boma-integration`
`root:libvirt 2775` — correct — but a `claude` session whose shell started *before* that
grant carries a stale process group set (`id``claude,docker` only, no `libvirt`), so
`qemu-img create` of the VM overlay into the group-owned dir is denied. `virsh`/`virt-install`
still work (they reach system libvirtd via polkit/socket, and the real KVM runs server-side
as `libvirt-qemu`), so ONLY claude's own file-writes break. Unblock without restarting the
session: **`sg libvirt -c 'make test-integration HOST=<name>'`** (claude needs only `libvirt`
for the dir; `kvm` is server-side; note `sg` adds one group, not the full set). → self-heal
in `scripts/integration-vm.py`: if the `libvirt` gid is absent from `os.getgroups()`, re-exec
under `sg libvirt` (or have the Makefile target do it), so a stale-session agent never hits
this opaque symptom. New agent sessions pick the groups up on login, so it's a stale-session
transient — but high-confusion, worth self-healing.
- `[friction]` **No standard for when the agent may run local-VM integration tests on ubongo
without asking** (2026-06-19): `make test-integration HOST=<name>` spins an ISOLATED throwaway
KVM VM (its own libvirt NAT; never touches the real host's firewall/network; guards:
one-VM-at-a-time + a 4 GiB free-RAM floor + auto-destroy on success), so it is safe and
self-contained — yet the agent paused for a go-ahead before running it (mesh-hardening 2/3,
Task 4). The operator wants a STANDARD that pre-authorises VM-testing on ubongo so the agent
just runs it. → decide + record the rule: e.g. a `.claude/settings.json` permission allow for
`make test-integration*` / `scripts/integration-vm.py` (and the `sg libvirt -c '…'` form per
the gotcha above), plus a CLAUDE.md line distinguishing the pre-authorised isolated VM tests
from the genuinely-gated live steps (`make deploy` to real hosts, host reboots, cutovers —
still need a go-ahead). Ties to the `test-risky-infra-before-live-deploy` +
`dont-reask-settled-defaults` memories + ADR-025.
---
## Kaizen reviews — decisions ledger