docs(friction): capture 9 signals from the ADR-025 harness shakedown

UEFI-vs-BIOS boot loop, no-sudo diagnosis gap (-> claude sudo decision), qemu
session-vs-system URI, system-qemu home-traversal, directory-inventory phantom
hosts, jinja trim_blocks render trap, empty apt lists on fresh cloud images,
NAT-gateway firewall allow, and the review-vs-hardware coverage lesson.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-18 16:30:13 +02:00
parent f27514860e
commit 941141e270

View file

@ -90,6 +90,62 @@ a WAN-SSH break-glass. Spec/plan: docs/superpowers/{specs,plans}/2026-06-17-mesh
open**, and only retire the break-glass once recovery (incl. a reboot) is proven.
Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks.
<!-- The below are from the 2026-06-18 ADR-025 build: standing up the local-VM integration
harness on ubongo and shaking it down against real KVM (spec/plan in docs/superpowers/). -->
- `[gotcha]` **Debian 13 genericcloud boot-loops under legacy BIOS/SeaBIOS** (2026-06-18):
`virt-install --import` of the genericcloud qcow2 with the default (SeaBIOS) firmware
triple-faults at the real-mode kernel handoff — GRUB loops, no "Decompressing Linux", no
DHCP lease. The symptom (no network) pointed away from the cause (firmware). → boot test
VMs via **UEFI** (`virt-install --boot uefi`; OVMF→efistub).
- `[friction]` **The no-sudo `claude` model blocked diagnosing a failed VM** (2026-06-18):
under ADR-015 `claude` had no sudo, so when the VM wouldn't network there was no way to
introspect it (serial logs are `root:0600`, libguestfs not installed, mounting needs
root). Diagnosis was fully blocked until the operator granted `claude` sudo. → DECISION:
`claude` gets `NOPASSWD:ALL` (reverses ADR-015's "no local sudo"); compensating control
is auditd/Loki attribution (already in ADR-015). Amend ADR-015/ADR-021 + accepted-risks;
codify the sudoers drop-in in Ansible.
- `[gotcha]` **Non-root `virsh`/`virt-install` default to `qemu:///session`** (2026-06-18):
the substrate (NAT net, /dev/kvm) lives on `qemu:///system`. → pin
`LIBVIRT_DEFAULT_URI=qemu:///system` in the driver.
- `[gotcha]` **`qemu:///system` (libvirt-qemu) can't traverse `/home`** (2026-06-18): VM
disk/seed/console under the repo/home failed "Permission denied (search permissions for
/home/claude)". → put per-VM artifacts in a system-readable dir (`/var/lib/boma-integration`,
group libvirt); the inventory (read by ansible as the user) can stay in the repo.
- `[gotcha]` **`ansible-playbook -i <dir>/` parses sibling non-inventory files as INI**
(2026-06-18): pointing `-i` at a run-dir holding a state file + qcow2s made the directory
inventory loader parse the state file as INI → phantom hosts INCLUDING the real `askari`
(with its real vars), breaking the single-host isolation invariant. → point `-i` at the
single `hosts.yml`. Caught by the holistic cross-file review BEFORE any hardware run.
- `[gotcha]` **Jinja `{%- -%}` + ansible `trim_blocks=True` double-strip newlines**
(2026-06-18): a template edit used `{%- -%}`, reviewed by rendering with RAW jinja2
(trim_blocks=False) which looked fine; ansible (trim_blocks=True) then collapsed the
rendered Caddyfile onto single lines → caddy crash-looped on invalid config. → verify
templates with ansible's whitespace (trim_blocks=True), not raw jinja2; prefer plain
`{% %}` at column 0 (the repo's existing style).
- `[gotcha]` **Fresh cloud images have empty apt lists** (2026-06-18): `apt install
nftables` failed "No package matching 'nftables' is available" on a fresh genericcloud
VM whose cloud-init had `package_update: false`. → `package_update: true` AND block on
`cloud-init status --wait` before applying.
- `[gotcha]` **base's default-deny firewall drops SSH to a NAT'd VM unless the gateway is
allowed** (2026-06-18): the driver reaches the VM via the libvirt-NAT gateway
(192.168.150.1). `ct established,related accept` saves the in-flight apply connection,
but a fresh post-reboot SSH is dropped without an explicit allow. → test overlay sets
`base__firewall_control_addr` to the NAT gateway.
- `[recurring]` **Real-hardware shakedown and static review each caught what the other
couldn't** (2026-06-18): the qemu-URI, storage-path, UEFI, apt-list, and caddy-render
bugs ALL surfaced only on a live KVM run; the phantom-host inventory bug surfaced only in
the holistic cross-file review. → for infra this novel, budget for BOTH an adversarial
cross-file review AND a real-hardware run; neither alone would have shipped it working.
---
## Kaizen reviews — decisions ledger