Commit graph

332 commits

Author SHA1 Message Date
b1aa0f49d9 fix(integration): verify probes :80 without following redirects
Accept caddy's 308 on :80 as proof the DNAT+forward path is alive; don't follow into https (tls internal has no cert for a bare-IP request). This load-bearing end-to-end check is what caught the br-+/br-* nftables-wildcard bug that the string-presence assert missed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:57:47 +02:00
172ae37953 feat(docker_host): container-forward nftables drop-in (reboot-safe Docker forwarding)
base's inet-filter forward chain is policy-drop; on a Docker host that kills published-port DNAT + inter-container forwarding ON REBOOT (nftables loads default-deny before dockerd). This drop-in (loaded via base's /etc/nftables.d/*.nft include at boot) appends the container-bridge accepts so a rebooted Docker host keeps forwarding. Resolves FRICTION 2026-06-17 #1 and the GREEN half of ADR-025's acceptance test. NB nftables wildcard is br-*, not the iptables br-+.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:57:47 +02:00
051c040343 fix(integration): exclude transient .run/ from linters; --- in generated inventory
Running the harness leaves tests/integration/.run/ (gitignored, generated); exclude it from yamllint + ansible-lint so a post-run 'make lint' passes. Also emit a --- doc-start in the generated inventory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:44:12 +02:00
c7194ca147 feat(integration): allow SSH from the NAT gateway in the askari overlay
base's default-deny firewall would drop the driver's post-reboot SSH from the libvirt NAT gateway; set base__firewall_control_addr to the gateway (by source IP, interface-independent).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:35:15 +02:00
35446538df fix(integration-vm): apt-ready VMs + sudo-read serial console diagnostics
cloud-init package_update:true + block on 'cloud-init status --wait' in up() so apply sees populated apt lists (fresh genericcloud images ship empty lists); dump_diagnostics()/console() read the root:0600 serial log via sudo instead of shutil.copy, which raised PermissionError mid-diagnostics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:35:15 +02:00
83983d739c fix(reverse_proxy): plain {% %} tags so the Caddyfile renders under ansible trim_blocks
The tls-internal/acme_ca knobs used {%- -%} trims validated only against raw jinja2; ansible (trim_blocks=True) double-stripped newlines and collapsed the Caddyfile onto single lines, crash-looping caddy. Match the role's existing plain {% %} style.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:35:15 +02:00
941141e270 docs(friction): capture 9 signals from the ADR-025 harness shakedown
UEFI-vs-BIOS boot loop, no-sudo diagnosis gap (-> claude sudo decision), qemu
session-vs-system URI, system-qemu home-traversal, directory-inventory phantom
hosts, jinja trim_blocks render trap, empty apt lists on fresh cloud images,
NAT-gateway firewall allow, and the review-vs-hardware coverage lesson.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:30:13 +02:00
f27514860e fix(integration-vm): boot test VMs via UEFI
The Debian 13 genericcloud image triple-faults at the legacy real-mode kernel
handoff under SeaBIOS/q35 (boot-loops at GRUB, no 'Decompressing Linux', no DHCP
lease). Booting via UEFI (OVMF -> efistub) bypasses the legacy entry and boots
cleanly: cloud-init runs, DHCP lease obtained, SSH reachable. Verified end-to-end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:13:35 +02:00
65bacb25fa feat(integration-vm): force DHCP via explicit cloud-init network-config
Don't rely on the genericcloud image's network fallback; the seed now carries a
network-config forcing dhcp4 on en* interfaces. A correct prerequisite for the VM
to network once cloud-init processes the seed. (Note: a separate no-DHCP-lease
issue on first real boot is still under investigation — the guest isn't networking
and, under the no-sudo claude model, the VM console/logs aren't introspectable
without libguestfs; see next steps.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 15:05:49 +02:00
e5256696d6 fix(integration-vm): place VM disk/seed/console in CACHE_DIR for system-qemu
Under qemu:///system the hypervisor runs as libvirt-qemu, which cannot traverse
/home/claude — so the overlay/seed/console must live in /var/lib/boma-integration
(group libvirt, world-traversable, created by the integration_test role), not the
repo/home RUN_DIR. The inventory (hosts.yml + group_vars symlink, read by ansible
as claude) stays in RUN_DIR. Verified: virt-install now creates the domain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 14:56:35 +02:00
147eb874ea fix(integration-vm): pin LIBVIRT_DEFAULT_URI=qemu:///system
Bare virsh/virt-install default to qemu:///session for a non-root caller, but
the substrate, /dev/kvm, and the boma-it NAT network live on the SYSTEM libvirtd.
Pin the URI so the driver targets system regardless of who runs it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 14:41:31 +02:00
ed1187d1c3 fix(integration-vm): point ansible -i at hosts.yml, not the run dir
The driver passed -i <RUN_DIR>/ (a directory); ansible's directory-inventory
loader then parsed sibling files (notably 'current', which holds the real host
string 'askari') as INI inventory, creating phantom hosts incl. the real askari
with its full hostvars — violating the single-host safety invariant (and a hard
error in ansible 2.18 on the binary qcow2/seed files). Point -i at the single
hosts.yml file; ansible still loads the adjacent group_vars symlink. (review C1)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 13:04:54 +02:00
f51ae1a13d docs(runbook): integration-testing runbook + pre-flight cross-links
- New docs/runbooks/integration-testing.md: when to use (firewall/
  sshd/boot/Docker changes); make test-integration commands; lower-
  level driver sub-commands; cert tier guidance; diagnostics dir;
  VM inspection (virsh console / SSH); safety invariants; resource
  constraints; adding a new profile; self-validating acceptance test.
- docs/runbooks/new-host.md: pre-flight warning before deploying
  lockout-risky changes (firewall/sshd/boot) while break-glass is open
- docs/runbooks/new-role.md: step 13 pre-flight for lockout-risky roles

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 12:59:06 +02:00
4732730515 docs: wire ADR-025 into testing/control-host/risks/status/capacity
- ADR-008: add reboot-survivability gap row + ADR-025 pointer to the
  "not tested in Molecule" table
- ADR-015: reconcile "not a hypervisor" with ephemeral KVM test VMs
  (ADR-025); note ~3 GiB test-VM RAM against the 16 GiB sizing
- accepted-risks: add R6 (le-prod-wildcard PAT + transient TXT records)
- CLAUDE.md: add make test-integration[/-clean] to key-commands;
  add ADR-025 + runbook rows to further-reading
- hardware/reference.md: note one ephemeral KVM test VM on ubongo
- STATUS.md: add integration harness entry (built, lint+pytest clean;
  RED/GREEN acceptance PENDING ubongo live pass); TODO 2.4 stays open

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 12:51:22 +02:00
edcc347a95 docs(adr): ADR-025 local VM integration testing
Accepted decision to implement ADR-008 Level 2/3 on ubongo via
libvirt/KVM directly: throwaway VM overlays, stdlib-only driver,
tiered cert fidelity, three safety invariants. Addresses the
2026-06-17 mesh-hardening incident's reboot-survivability gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 12:49:52 +02:00
d68734267b feat(make): test-integration / test-integration-clean targets
Add ADR-025 integration-test harness targets to Makefile:
- test-integration HOST=<name> [CERTS=internal|le-staging] [KEEP=1]
- test-integration-clean (prune stale VM snapshots)

Also add tests/integration/.run/ to .gitignore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-18 12:45:38 +02:00
3769c9ebb9 feat(integration): outcome-based verify playbook (DNAT-survives-reboot)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-18 12:38:22 +02:00
10121e72d3 feat(integration): askari profile, stub overlay, cert-tier files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-18 12:37:32 +02:00
0989f047eb feat(reverse_proxy): tls-internal + acme_ca knobs for integration/staging (ADR-025)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-18 12:30:49 +02:00
4fb4cf99c3 fix(integration-vm): boot-id-verified reboot + actionable timeouts + inventory guard (review)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-18 12:28:06 +02:00
68abd67ce6 feat(integration-vm): teardown, prune, console, full cycle + dispatch 2026-06-18 12:21:06 +02:00
8ea9966d88 feat(integration-vm): reboot, verify run, failure diagnostics 2026-06-18 12:20:52 +02:00
d1c91930ac feat(integration-vm): transient inventory + real-playbook apply 2026-06-18 12:20:37 +02:00
fdd4df34b1 feat(integration-vm): network + VM boot (overlay, cloud-init seed, virt-install import) 2026-06-18 12:20:25 +02:00
af76763c16 feat(integration-vm): golden image fetch + SHA512 verification 2026-06-18 12:19:58 +02:00
a8dc3c787a feat(integration-vm): cert-tier + profile + transient inventory rendering 2026-06-18 12:17:37 +02:00
6f53d00b71 feat(integration-vm): cloud-init user-data/meta-data rendering 2026-06-18 12:12:08 +02:00
b5d5dffeaf feat(integration-vm): vm naming, RAM guard, lease IP parsing 2026-06-18 12:11:56 +02:00
64767ac187 feat(integration-vm): driver skeleton + CLI dispatch 2026-06-18 12:11:41 +02:00
ac6a01296a feat(integration_test): KVM/libvirt substrate role on the control node 2026-06-18 12:09:35 +02:00
65533be4d9 docs(plan): implementation plan for local VM integration testing (2.4)
20-task TDD plan: integration_test substrate role, stdlib virsh driver, askari profile, tiered certs, RED->GREEN acceptance, docker_host container-forward fix, ADR-025 + docs. Follows the 2026-06-18 design spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 11:56:04 +02:00
02e1eb7449 docs(spec): design local VM integration testing on ubongo (2.4)
Throwaway KVM VMs on ubongo (libvirt, Approach A) that mirror a real host (real Docker, real reboot, real role apply) to catch the reboot/firewall/boot-order class Molecule cannot - the 2026-06-17 mesh-hardening incident. First profile: be askari; tiered certs (internal + le-staging built, le-prod-wildcard on-demand). Concrete build of ADR-008 Level 2/3; to be recorded as ADR-025.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 11:35:51 +02:00
69faaf5e43 docs(todo): local VM integration testing (2.4) + screenshot hand-off (10.8)
From the 2026-06-17 mesh-hardening incident: Molecule can't catch
reboot/firewall-x-Docker/boot-order bugs — build local-VM pre-deploy testing
on ubongo (ADR-008 Level 2/3). And a smooth screenshot hand-off for the agent
during incidents.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 22:27:26 +02:00
958e35e3c3 docs(friction): capture 6 signals from the mesh-hardening 1/3 incident
firewall-breaks-Docker-hosts, ip_nonlocal_bind didn't beat the boot race,
coordinator-host circular bootstrap, NetBird geo-DB FATAL dependency, no
off-site coordinator backup, and reboot-tested-after-removing-break-glass.
For the next /kaizen + the mesh-hardening re-spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 22:21:19 +02:00
847d9885e2 revert: back out mesh-hardening 1/3 on askari after it broke the Docker host
Incident 2026-06-17: applying base's nftables default-deny (forward policy drop)
to askari — a Docker host — broke container forwarding/NAT on reboot, and the
wt0-only sshd ListenAddress left no break-glass (ip_nonlocal_bind did NOT beat
the boot race). Recovery: disable nftables + restart docker (restore the wiped
NAT masquerade) + force-recreate the coordinator (it FATAL-looped unable to
download its GeoLite2 DB with no egress) -> mesh re-formed.

Back out the enablement so a future deploy can't re-break askari:
- offsite_hosts: base__ssh_listen_mesh_only=false, base__firewall_apply=false
- remove host_vars/askari.yml (manage over the WAN again, not wt0)
- tf/offsite: re-open WAN :22 to ubongo only (break-glass; already applied)

askari now: sshd on all interfaces (Ansible-managed), nftables disabled, WAN :22
open -> stable + reboot-survivable. The base feature code (sshd ListenAddress
option, firewall public zone) stays; it's just not enabled on Docker hosts.
Mesh-hardening 1/3 to be re-spec'd before any retry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 22:16:17 +02:00
b0511179cb feat(tf/offsite): retire askari's WAN :22 (mesh-only SSH)
The Hetzner Cloud Firewall SSH rule is now conditional on a non-empty
ssh_admin_cidrs (default []); askari sets it empty so the WAN :22 rule is
removed on the next apply. SSH is reached over wt0; break-glass is the Hetzner
console. Apply is the live cutover (Task 5). Mesh-hardening 1/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:51:24 +02:00
cc21344ab1 feat(inventory): manage askari over wt0 + enable mesh-only SSH
host_vars/askari.yml points ansible_host at the wt0 IP (overriding the generated
offsite.yml); offsite_hosts sets base__ssh_listen_mesh_only. Mesh-hardening 1/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:49:15 +02:00
3b30e70ba5 feat(firewall): public zone + askari's public services in the catalog
Adds a public (0.0.0.0/0) zone and askari's Caddy (80/443) + NetBird STUN
(3478/udp) ingress so the base nftables default-deny does not drop the live
public services when applied to askari. Molecule + filter unit test cover the
public-zone rendering. Mesh-hardening 1/3 (ADR-020/024/016).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:46:03 +02:00
39d2ad38ca feat(base): opt-in sshd ListenAddress on the mesh IP (fail-closed)
base__ssh_listen_mesh_only binds sshd to the live wt0 IP only, with
ip_nonlocal_bind to beat the post-boot bind race and a fail-closed assert so an
unresolved address never silently listens on all interfaces. Molecule covers
the render + sysctl. Mesh-hardening 1/3 (ADR-016/021).

Environmental checkpoint applied: the molecule-debian13 container image lacks
procps (no sysctl binary). Added molecule/default/prepare.yml to install procps
and sysctls: {net.ipv4.ip_nonlocal_bind: "0"} to molecule.yml platform so the
ansible.posix.sysctl task can write and read back the value hermetically.
Sysctl file format is net.ipv4.ip_nonlocal_bind=1 (no spaces); verify.yml
grep pattern updated to match ansible.posix.sysctl's actual output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:43:08 +02:00
dfa363cecd docs(plan): mesh-hardening 1/3 — askari SSH onto wt0 implementation plan
5 tasks: base sshd ListenAddress+ip_nonlocal_bind (Molecule), firewall public
zone + askari catalog, inventory wt0 override, TF retire WAN :22, then the live
operator-supervised staged cutover.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:25:59 +02:00
292c204752 docs(spec): mesh-hardening 1/3 — move askari SSH onto wt0
Decomposes the M5 mesh-hardening follow-on into 3 independent sub-specs; this
is sub-project 1. Three-layer SSH-on-wt0 (sshd ListenAddress=mesh + nftables
iifname wt0 + retire the Hetzner WAN :22), ip_nonlocal_bind to beat the
post-boot wt0 bind race (fail-closed), live wt0 fact for the listen addr,
staged cutover with the firewall auto-rollback as the safety gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:15:12 +02:00
e5a8e5d3b9 docs(roadmap): Phase 1 complete — point Next step at mesh-hardening follow-on
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 18:39:08 +02:00
5947ba8756 chore(vault): Forgejo registry_token supplied (operator-minted, encrypted)
registry-login verified end-to-end (docker login -> Login Succeeded).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 18:37:11 +02:00
a0762c563e docs(kaizen): bind-mount gotcha + consume 7 signals into the ledger (2026-06-17)
Migrate the single-file-bind-mount/stale-config gotcha (reload-in-place needs a
directory mount; restart-based roles don't) to docs/testing/gotchas.md, and move
all 7 open signals out of FRICTION.md's Open-signals section into the new
2026-06-17 decisions-ledger block: all consumed, 1 PARK (the ubongo
self-management gap, tracked in STATUS), 0 REMOVE. Relax test_load_signals to
accept an empty Open-signals section (the goal state after a kaizen pass).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:50:17 +02:00
c1323a3f29 feat(make): registry-login via vaulted Forgejo token (kaizen)
scripts/registry-login.sh reads vault.forgejo.registry_token and pipes it to
docker login --password-stdin (never echoed, never on argv); 'make registry-login'
wires it with the venv binaries. Adds the operator-minted CHANGEME vault stub
(fill via make edit-vault) and a per-machine prereq note in the claude-code-setup
runbook, so 'make caddy-image-push'/'molecule-image-push' become agent-completable
non-interactively. Consumes the 2026-06-15 signal in docs/FRICTION.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:50:07 +02:00
39904a778a fix(hooks): scope vault-preflight to staged ansible; catch prose exec re-asks
guard-vault-preflight: block a locked 'git commit' only when the staged set
(git diff --cached, plus -a/--all) contains ansible content matching the
pre-commit ansible-lint hook's files: scope. Docs-/config-only commits never
trigger that hook, so they no longer need the vault — fixing the false block on
docs-only commits. Fails safe to block when unsure.

guard-execution-mode-menu: widen the execution-mode arm to also catch free-form
prose re-asks of the subagent-vs-inline choice ('which execution approach?',
'subagent vs inline', ...), which the literal-menu matcher missed; the push
re-ask is intentionally left to the dont-reask-settled-defaults memory.

Consumes two 2026-06-17 signals in docs/FRICTION.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:49:55 +02:00
8f1c7d47ec fix(reverse_proxy,netbird_coordinator): create scaffold dirs in check mode
Add check_mode: false to the state:directory base_dir tasks so that 'make check'
on a brand-new compose service role creates the scaffold during --check and the
rest of the dry-run (templates + docker_compose_v2 up) can be evaluated instead
of failing on a missing project_src. The directive is inert under a normal
converge (incl. Molecule + its tagged second converge), so role tests are
unchanged. Consumes the 2026-06-16 signal in docs/FRICTION.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:49:47 +02:00
b0c0150db2 feat(scan): repo-scan rename-incomplete check (kaizen)
When a numbered ADR announces a rename Old->New, flag design-doc lines where
Old still appears in present tense — skipping the announcing ADR, lines that
also name New, and historical/negation cues, and rejecting ADR-NNN tokens as
terms. Structural cousin of stale-deferred; run by /review-repo. Zero findings
on the current tree (the Traefik->Caddy ripple edits have landed). Consumes the
2026-06-14 KEEP-OPEN signal in docs/FRICTION.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:49:41 +02:00
959f9b30b5 feat(statusline): show context-window usage % in the status line
Adds .claude/statusline.sh (reads context_window.used_percentage +
context_window_size straight from the statusLine JSON; green<70/yellow/red
bar) and wires it via .claude/settings.json statusLine. Committed in-repo so
it follows boma to any clone, matching how .claude/ already tracks hooks +
plugins.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:35:47 +02:00
5d14efc864 docs: Phase 1 complete — clients enrolled + NetBird client runbook
mamba + work laptop enrolled in the mesh → ubongo reachable from anywhere; the
mobile-access goal is met and Phase 1 (remote access) is complete. Adds
docs/runbooks/netbird-client.md (reusable client-enrollment runbook) + STATUS/
ROADMAP flips + CLAUDE.md reading-table entry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:11:32 +02:00