docs(friction): capture 9 signals from the ADR-025 harness shakedown
UEFI-vs-BIOS boot loop, no-sudo diagnosis gap (-> claude sudo decision), qemu session-vs-system URI, system-qemu home-traversal, directory-inventory phantom hosts, jinja trim_blocks render trap, empty apt lists on fresh cloud images, NAT-gateway firewall allow, and the review-vs-hardware coverage lesson. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
f27514860e
commit
941141e270
1 changed files with 56 additions and 0 deletions
|
|
@ -90,6 +90,62 @@ a WAN-SSH break-glass. Spec/plan: docs/superpowers/{specs,plans}/2026-06-17-mesh
|
||||||
open**, and only retire the break-glass once recovery (incl. a reboot) is proven.
|
open**, and only retire the break-glass once recovery (incl. a reboot) is proven.
|
||||||
Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks.
|
Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks.
|
||||||
|
|
||||||
|
<!-- The below are from the 2026-06-18 ADR-025 build: standing up the local-VM integration
|
||||||
|
harness on ubongo and shaking it down against real KVM (spec/plan in docs/superpowers/). -->
|
||||||
|
|
||||||
|
- `[gotcha]` **Debian 13 genericcloud boot-loops under legacy BIOS/SeaBIOS** (2026-06-18):
|
||||||
|
`virt-install --import` of the genericcloud qcow2 with the default (SeaBIOS) firmware
|
||||||
|
triple-faults at the real-mode kernel handoff — GRUB loops, no "Decompressing Linux", no
|
||||||
|
DHCP lease. The symptom (no network) pointed away from the cause (firmware). → boot test
|
||||||
|
VMs via **UEFI** (`virt-install --boot uefi`; OVMF→efistub).
|
||||||
|
|
||||||
|
- `[friction]` **The no-sudo `claude` model blocked diagnosing a failed VM** (2026-06-18):
|
||||||
|
under ADR-015 `claude` had no sudo, so when the VM wouldn't network there was no way to
|
||||||
|
introspect it (serial logs are `root:0600`, libguestfs not installed, mounting needs
|
||||||
|
root). Diagnosis was fully blocked until the operator granted `claude` sudo. → DECISION:
|
||||||
|
`claude` gets `NOPASSWD:ALL` (reverses ADR-015's "no local sudo"); compensating control
|
||||||
|
is auditd/Loki attribution (already in ADR-015). Amend ADR-015/ADR-021 + accepted-risks;
|
||||||
|
codify the sudoers drop-in in Ansible.
|
||||||
|
|
||||||
|
- `[gotcha]` **Non-root `virsh`/`virt-install` default to `qemu:///session`** (2026-06-18):
|
||||||
|
the substrate (NAT net, /dev/kvm) lives on `qemu:///system`. → pin
|
||||||
|
`LIBVIRT_DEFAULT_URI=qemu:///system` in the driver.
|
||||||
|
|
||||||
|
- `[gotcha]` **`qemu:///system` (libvirt-qemu) can't traverse `/home`** (2026-06-18): VM
|
||||||
|
disk/seed/console under the repo/home failed "Permission denied (search permissions for
|
||||||
|
/home/claude)". → put per-VM artifacts in a system-readable dir (`/var/lib/boma-integration`,
|
||||||
|
group libvirt); the inventory (read by ansible as the user) can stay in the repo.
|
||||||
|
|
||||||
|
- `[gotcha]` **`ansible-playbook -i <dir>/` parses sibling non-inventory files as INI**
|
||||||
|
(2026-06-18): pointing `-i` at a run-dir holding a state file + qcow2s made the directory
|
||||||
|
inventory loader parse the state file as INI → phantom hosts INCLUDING the real `askari`
|
||||||
|
(with its real vars), breaking the single-host isolation invariant. → point `-i` at the
|
||||||
|
single `hosts.yml`. Caught by the holistic cross-file review BEFORE any hardware run.
|
||||||
|
|
||||||
|
- `[gotcha]` **Jinja `{%- -%}` + ansible `trim_blocks=True` double-strip newlines**
|
||||||
|
(2026-06-18): a template edit used `{%- -%}`, reviewed by rendering with RAW jinja2
|
||||||
|
(trim_blocks=False) which looked fine; ansible (trim_blocks=True) then collapsed the
|
||||||
|
rendered Caddyfile onto single lines → caddy crash-looped on invalid config. → verify
|
||||||
|
templates with ansible's whitespace (trim_blocks=True), not raw jinja2; prefer plain
|
||||||
|
`{% %}` at column 0 (the repo's existing style).
|
||||||
|
|
||||||
|
- `[gotcha]` **Fresh cloud images have empty apt lists** (2026-06-18): `apt install
|
||||||
|
nftables` failed "No package matching 'nftables' is available" on a fresh genericcloud
|
||||||
|
VM whose cloud-init had `package_update: false`. → `package_update: true` AND block on
|
||||||
|
`cloud-init status --wait` before applying.
|
||||||
|
|
||||||
|
- `[gotcha]` **base's default-deny firewall drops SSH to a NAT'd VM unless the gateway is
|
||||||
|
allowed** (2026-06-18): the driver reaches the VM via the libvirt-NAT gateway
|
||||||
|
(192.168.150.1). `ct established,related accept` saves the in-flight apply connection,
|
||||||
|
but a fresh post-reboot SSH is dropped without an explicit allow. → test overlay sets
|
||||||
|
`base__firewall_control_addr` to the NAT gateway.
|
||||||
|
|
||||||
|
- `[recurring]` **Real-hardware shakedown and static review each caught what the other
|
||||||
|
couldn't** (2026-06-18): the qemu-URI, storage-path, UEFI, apt-list, and caddy-render
|
||||||
|
bugs ALL surfaced only on a live KVM run; the phantom-host inventory bug surfaced only in
|
||||||
|
the holistic cross-file review. → for infra this novel, budget for BOTH an adversarial
|
||||||
|
cross-file review AND a real-hardware run; neither alone would have shipped it working.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Kaizen reviews — decisions ledger
|
## Kaizen reviews — decisions ledger
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue