From 941141e2701b1dd1ae1eb9568293d03a4d400704 Mon Sep 17 00:00:00 2001 From: sjat Date: Thu, 18 Jun 2026 16:30:13 +0200 Subject: [PATCH] docs(friction): capture 9 signals from the ADR-025 harness shakedown UEFI-vs-BIOS boot loop, no-sudo diagnosis gap (-> claude sudo decision), qemu session-vs-system URI, system-qemu home-traversal, directory-inventory phantom hosts, jinja trim_blocks render trap, empty apt lists on fresh cloud images, NAT-gateway firewall allow, and the review-vs-hardware coverage lesson. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/FRICTION.md | 56 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) diff --git a/docs/FRICTION.md b/docs/FRICTION.md index b447f9b..888ef73 100644 --- a/docs/FRICTION.md +++ b/docs/FRICTION.md @@ -90,6 +90,62 @@ a WAN-SSH break-glass. Spec/plan: docs/superpowers/{specs,plans}/2026-06-17-mesh open**, and only retire the break-glass once recovery (incl. a reboot) is proven. Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks. + + +- `[gotcha]` **Debian 13 genericcloud boot-loops under legacy BIOS/SeaBIOS** (2026-06-18): + `virt-install --import` of the genericcloud qcow2 with the default (SeaBIOS) firmware + triple-faults at the real-mode kernel handoff — GRUB loops, no "Decompressing Linux", no + DHCP lease. The symptom (no network) pointed away from the cause (firmware). → boot test + VMs via **UEFI** (`virt-install --boot uefi`; OVMF→efistub). + +- `[friction]` **The no-sudo `claude` model blocked diagnosing a failed VM** (2026-06-18): + under ADR-015 `claude` had no sudo, so when the VM wouldn't network there was no way to + introspect it (serial logs are `root:0600`, libguestfs not installed, mounting needs + root). Diagnosis was fully blocked until the operator granted `claude` sudo. → DECISION: + `claude` gets `NOPASSWD:ALL` (reverses ADR-015's "no local sudo"); compensating control + is auditd/Loki attribution (already in ADR-015). Amend ADR-015/ADR-021 + accepted-risks; + codify the sudoers drop-in in Ansible. + +- `[gotcha]` **Non-root `virsh`/`virt-install` default to `qemu:///session`** (2026-06-18): + the substrate (NAT net, /dev/kvm) lives on `qemu:///system`. → pin + `LIBVIRT_DEFAULT_URI=qemu:///system` in the driver. + +- `[gotcha]` **`qemu:///system` (libvirt-qemu) can't traverse `/home`** (2026-06-18): VM + disk/seed/console under the repo/home failed "Permission denied (search permissions for + /home/claude)". → put per-VM artifacts in a system-readable dir (`/var/lib/boma-integration`, + group libvirt); the inventory (read by ansible as the user) can stay in the repo. + +- `[gotcha]` **`ansible-playbook -i /` parses sibling non-inventory files as INI** + (2026-06-18): pointing `-i` at a run-dir holding a state file + qcow2s made the directory + inventory loader parse the state file as INI → phantom hosts INCLUDING the real `askari` + (with its real vars), breaking the single-host isolation invariant. → point `-i` at the + single `hosts.yml`. Caught by the holistic cross-file review BEFORE any hardware run. + +- `[gotcha]` **Jinja `{%- -%}` + ansible `trim_blocks=True` double-strip newlines** + (2026-06-18): a template edit used `{%- -%}`, reviewed by rendering with RAW jinja2 + (trim_blocks=False) which looked fine; ansible (trim_blocks=True) then collapsed the + rendered Caddyfile onto single lines → caddy crash-looped on invalid config. → verify + templates with ansible's whitespace (trim_blocks=True), not raw jinja2; prefer plain + `{% %}` at column 0 (the repo's existing style). + +- `[gotcha]` **Fresh cloud images have empty apt lists** (2026-06-18): `apt install + nftables` failed "No package matching 'nftables' is available" on a fresh genericcloud + VM whose cloud-init had `package_update: false`. → `package_update: true` AND block on + `cloud-init status --wait` before applying. + +- `[gotcha]` **base's default-deny firewall drops SSH to a NAT'd VM unless the gateway is + allowed** (2026-06-18): the driver reaches the VM via the libvirt-NAT gateway + (192.168.150.1). `ct established,related accept` saves the in-flight apply connection, + but a fresh post-reboot SSH is dropped without an explicit allow. → test overlay sets + `base__firewall_control_addr` to the NAT gateway. + +- `[recurring]` **Real-hardware shakedown and static review each caught what the other + couldn't** (2026-06-18): the qemu-URI, storage-path, UEFI, apt-list, and caddy-render + bugs ALL surfaced only on a live KVM run; the phantom-host inventory bug surfaced only in + the holistic cross-file review. → for infra this novel, budget for BOTH an adversarial + cross-file review AND a real-hardware run; neither alone would have shipped it working. + --- ## Kaizen reviews — decisions ledger