docs(friction): capture 9 signals from the ADR-025 harness shakedown
UEFI-vs-BIOS boot loop, no-sudo diagnosis gap (-> claude sudo decision), qemu session-vs-system URI, system-qemu home-traversal, directory-inventory phantom hosts, jinja trim_blocks render trap, empty apt lists on fresh cloud images, NAT-gateway firewall allow, and the review-vs-hardware coverage lesson. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
f27514860e
commit
941141e270
1 changed files with 56 additions and 0 deletions
|
|
@ -90,6 +90,62 @@ a WAN-SSH break-glass. Spec/plan: docs/superpowers/{specs,plans}/2026-06-17-mesh
|
|||
open**, and only retire the break-glass once recovery (incl. a reboot) is proven.
|
||||
Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks.
|
||||
|
||||
<!-- The below are from the 2026-06-18 ADR-025 build: standing up the local-VM integration
|
||||
harness on ubongo and shaking it down against real KVM (spec/plan in docs/superpowers/). -->
|
||||
|
||||
- `[gotcha]` **Debian 13 genericcloud boot-loops under legacy BIOS/SeaBIOS** (2026-06-18):
|
||||
`virt-install --import` of the genericcloud qcow2 with the default (SeaBIOS) firmware
|
||||
triple-faults at the real-mode kernel handoff — GRUB loops, no "Decompressing Linux", no
|
||||
DHCP lease. The symptom (no network) pointed away from the cause (firmware). → boot test
|
||||
VMs via **UEFI** (`virt-install --boot uefi`; OVMF→efistub).
|
||||
|
||||
- `[friction]` **The no-sudo `claude` model blocked diagnosing a failed VM** (2026-06-18):
|
||||
under ADR-015 `claude` had no sudo, so when the VM wouldn't network there was no way to
|
||||
introspect it (serial logs are `root:0600`, libguestfs not installed, mounting needs
|
||||
root). Diagnosis was fully blocked until the operator granted `claude` sudo. → DECISION:
|
||||
`claude` gets `NOPASSWD:ALL` (reverses ADR-015's "no local sudo"); compensating control
|
||||
is auditd/Loki attribution (already in ADR-015). Amend ADR-015/ADR-021 + accepted-risks;
|
||||
codify the sudoers drop-in in Ansible.
|
||||
|
||||
- `[gotcha]` **Non-root `virsh`/`virt-install` default to `qemu:///session`** (2026-06-18):
|
||||
the substrate (NAT net, /dev/kvm) lives on `qemu:///system`. → pin
|
||||
`LIBVIRT_DEFAULT_URI=qemu:///system` in the driver.
|
||||
|
||||
- `[gotcha]` **`qemu:///system` (libvirt-qemu) can't traverse `/home`** (2026-06-18): VM
|
||||
disk/seed/console under the repo/home failed "Permission denied (search permissions for
|
||||
/home/claude)". → put per-VM artifacts in a system-readable dir (`/var/lib/boma-integration`,
|
||||
group libvirt); the inventory (read by ansible as the user) can stay in the repo.
|
||||
|
||||
- `[gotcha]` **`ansible-playbook -i <dir>/` parses sibling non-inventory files as INI**
|
||||
(2026-06-18): pointing `-i` at a run-dir holding a state file + qcow2s made the directory
|
||||
inventory loader parse the state file as INI → phantom hosts INCLUDING the real `askari`
|
||||
(with its real vars), breaking the single-host isolation invariant. → point `-i` at the
|
||||
single `hosts.yml`. Caught by the holistic cross-file review BEFORE any hardware run.
|
||||
|
||||
- `[gotcha]` **Jinja `{%- -%}` + ansible `trim_blocks=True` double-strip newlines**
|
||||
(2026-06-18): a template edit used `{%- -%}`, reviewed by rendering with RAW jinja2
|
||||
(trim_blocks=False) which looked fine; ansible (trim_blocks=True) then collapsed the
|
||||
rendered Caddyfile onto single lines → caddy crash-looped on invalid config. → verify
|
||||
templates with ansible's whitespace (trim_blocks=True), not raw jinja2; prefer plain
|
||||
`{% %}` at column 0 (the repo's existing style).
|
||||
|
||||
- `[gotcha]` **Fresh cloud images have empty apt lists** (2026-06-18): `apt install
|
||||
nftables` failed "No package matching 'nftables' is available" on a fresh genericcloud
|
||||
VM whose cloud-init had `package_update: false`. → `package_update: true` AND block on
|
||||
`cloud-init status --wait` before applying.
|
||||
|
||||
- `[gotcha]` **base's default-deny firewall drops SSH to a NAT'd VM unless the gateway is
|
||||
allowed** (2026-06-18): the driver reaches the VM via the libvirt-NAT gateway
|
||||
(192.168.150.1). `ct established,related accept` saves the in-flight apply connection,
|
||||
but a fresh post-reboot SSH is dropped without an explicit allow. → test overlay sets
|
||||
`base__firewall_control_addr` to the NAT gateway.
|
||||
|
||||
- `[recurring]` **Real-hardware shakedown and static review each caught what the other
|
||||
couldn't** (2026-06-18): the qemu-URI, storage-path, UEFI, apt-list, and caddy-render
|
||||
bugs ALL surfaced only on a live KVM run; the phantom-host inventory bug surfaced only in
|
||||
the holistic cross-file review. → for infra this novel, budget for BOTH an adversarial
|
||||
cross-file review AND a real-hardware run; neither alone would have shipped it working.
|
||||
|
||||
---
|
||||
|
||||
## Kaizen reviews — decisions ledger
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue