Commit graph

50 commits

Author SHA1 Message Date
dc5cc8933f fix(harness): fall back to --source arp for VM IP discovery (no leaseshelper)
wait_for_ip now tries --source lease first then --source arp; both produce
identical output handled by parse_lease_ip. Removes the suid leaseshelper
dependency introduced and backed out in Task 3. New unit test confirms
parse_lease_ip works on --source arp output format.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:29:35 +02:00
8ca42c389c fix(integration): fix VM boot: hostname, netplan, known_hosts handling
Three fixes found during askari_inputonly integration-test development:

1. Hostname sanitization: cloud-init rejects underscores in local-hostname
   (silently skips network-config → VM never gets DHCP). Sanitize with
   name.replace("_", "-") for the meta-data hostname; paths/domain names
   keep the original (underscore is valid there).

2. Netplan explicit interface: match.name: en* with a named key produces a
   .network file that networkd never DHCPs. Use explicit enp1s0 (all virtio
   NICs in these KVM VMs) + renderer: networkd to bypass the bug.

3. ansible_ssh_common_args in the generated hosts.yml: integration VMs
   reuse IPs (different VMs at same 192.168.150.x lease). StrictHostKey
   accept-new from ansible.cfg blocks changed keys. Add StrictHostKeyChecking=no
   + UserKnownHostsFile=/dev/null per-host to the generated inventory so
   stale known_hosts entries never block the apply step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 19:15:07 +02:00
26bb7e442d fix(integration): pin system python for virt-install (venv PATH hijack)
The Makefile prepends .venv/bin to PATH (so the venv's ansible tools resolve),
but virt-install's `#!/usr/bin/env python3` shebang then resolved to the
isolated venv, which lacks system PyGObject (gi) -> ModuleNotFoundError. Strip
.venv/bin from PATH for the virt-install call so its shebang finds
/usr/bin/python3 (which has gi); ansible runs via its absolute .venv path and is
unaffected. Surfaced running `make test-integration HOST=ubongo`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:32:09 +02:00
bc8592616b fix: address final whole-branch review findings
- ADR-023 §4: ADR-015 no-sudo sub-decision now Superseded-by ADR-025 (bidirectional), not just an in-place amendment.
- STATUS: drop the deferred `reset` verb; honest integration_test (molecule not run in this env; applied to ubongo) + verify (forward/DNAT, not wt0); RED->GREEN validated.
- driver: remove unused `import shutil`.
- README: fix the ADR-025 link filename.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:52:28 +02:00
051c040343 fix(integration): exclude transient .run/ from linters; --- in generated inventory
Running the harness leaves tests/integration/.run/ (gitignored, generated); exclude it from yamllint + ansible-lint so a post-run 'make lint' passes. Also emit a --- doc-start in the generated inventory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:44:12 +02:00
35446538df fix(integration-vm): apt-ready VMs + sudo-read serial console diagnostics
cloud-init package_update:true + block on 'cloud-init status --wait' in up() so apply sees populated apt lists (fresh genericcloud images ship empty lists); dump_diagnostics()/console() read the root:0600 serial log via sudo instead of shutil.copy, which raised PermissionError mid-diagnostics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:35:15 +02:00
f27514860e fix(integration-vm): boot test VMs via UEFI
The Debian 13 genericcloud image triple-faults at the legacy real-mode kernel
handoff under SeaBIOS/q35 (boot-loops at GRUB, no 'Decompressing Linux', no DHCP
lease). Booting via UEFI (OVMF -> efistub) bypasses the legacy entry and boots
cleanly: cloud-init runs, DHCP lease obtained, SSH reachable. Verified end-to-end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:13:35 +02:00
65bacb25fa feat(integration-vm): force DHCP via explicit cloud-init network-config
Don't rely on the genericcloud image's network fallback; the seed now carries a
network-config forcing dhcp4 on en* interfaces. A correct prerequisite for the VM
to network once cloud-init processes the seed. (Note: a separate no-DHCP-lease
issue on first real boot is still under investigation — the guest isn't networking
and, under the no-sudo claude model, the VM console/logs aren't introspectable
without libguestfs; see next steps.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 15:05:49 +02:00
e5256696d6 fix(integration-vm): place VM disk/seed/console in CACHE_DIR for system-qemu
Under qemu:///system the hypervisor runs as libvirt-qemu, which cannot traverse
/home/claude — so the overlay/seed/console must live in /var/lib/boma-integration
(group libvirt, world-traversable, created by the integration_test role), not the
repo/home RUN_DIR. The inventory (hosts.yml + group_vars symlink, read by ansible
as claude) stays in RUN_DIR. Verified: virt-install now creates the domain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 14:56:35 +02:00
147eb874ea fix(integration-vm): pin LIBVIRT_DEFAULT_URI=qemu:///system
Bare virsh/virt-install default to qemu:///session for a non-root caller, but
the substrate, /dev/kvm, and the boma-it NAT network live on the SYSTEM libvirtd.
Pin the URI so the driver targets system regardless of who runs it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 14:41:31 +02:00
ed1187d1c3 fix(integration-vm): point ansible -i at hosts.yml, not the run dir
The driver passed -i <RUN_DIR>/ (a directory); ansible's directory-inventory
loader then parsed sibling files (notably 'current', which holds the real host
string 'askari') as INI inventory, creating phantom hosts incl. the real askari
with its full hostvars — violating the single-host safety invariant (and a hard
error in ansible 2.18 on the binary qcow2/seed files). Point -i at the single
hosts.yml file; ansible still loads the adjacent group_vars symlink. (review C1)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 13:04:54 +02:00
4fb4cf99c3 fix(integration-vm): boot-id-verified reboot + actionable timeouts + inventory guard (review)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-18 12:28:06 +02:00
68abd67ce6 feat(integration-vm): teardown, prune, console, full cycle + dispatch 2026-06-18 12:21:06 +02:00
8ea9966d88 feat(integration-vm): reboot, verify run, failure diagnostics 2026-06-18 12:20:52 +02:00
d1c91930ac feat(integration-vm): transient inventory + real-playbook apply 2026-06-18 12:20:37 +02:00
fdd4df34b1 feat(integration-vm): network + VM boot (overlay, cloud-init seed, virt-install import) 2026-06-18 12:20:25 +02:00
af76763c16 feat(integration-vm): golden image fetch + SHA512 verification 2026-06-18 12:19:58 +02:00
a8dc3c787a feat(integration-vm): cert-tier + profile + transient inventory rendering 2026-06-18 12:17:37 +02:00
64767ac187 feat(integration-vm): driver skeleton + CLI dispatch 2026-06-18 12:11:41 +02:00
c1323a3f29 feat(make): registry-login via vaulted Forgejo token (kaizen)
scripts/registry-login.sh reads vault.forgejo.registry_token and pipes it to
docker login --password-stdin (never echoed, never on argv); 'make registry-login'
wires it with the venv binaries. Adds the operator-minted CHANGEME vault stub
(fill via make edit-vault) and a per-machine prereq note in the claude-code-setup
runbook, so 'make caddy-image-push'/'molecule-image-push' become agent-completable
non-interactively. Consumes the 2026-06-15 signal in docs/FRICTION.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:50:07 +02:00
b0c0150db2 feat(scan): repo-scan rename-incomplete check (kaizen)
When a numbered ADR announces a rename Old->New, flag design-doc lines where
Old still appears in present tense — skipping the announcing ADR, lines that
also name New, and historical/negation cues, and rejecting ADR-NNN tokens as
terms. Structural cousin of stale-deferred; run by /review-repo. Zero findings
on the current tree (the Traefik->Caddy ripple edits have landed). Consumes the
2026-06-14 KEEP-OPEN signal in docs/FRICTION.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:49:41 +02:00
fd1e83a378 fix(kaizen): scope still_exists to repo paths; test age nudge; tidy --today
- Add REPO_DIRS constant; still_exists now only checks tokens that start
  with a known repo top-level dir, ignoring plugin names (caddy-dns/gandi),
  make command fragments (tf-init/plan), and role-relative paths.
- Add test_still_exists_ignores_non_repo_tokens (was failing before fix).
- Add test_nudge_line_overdue_on_age to close coverage gap on age threshold.
- Add load_signals docstring.
- Replace manual --today date parsing with datetime.date.fromisoformat type
  converter so malformed dates give a clean argparse error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:25:03 +02:00
b185ac4765 feat(kaizen): friction-scan CLI (--json default, --nudge)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:18:16 +02:00
c6f66ee634 feat(kaizen): recurrence count + referenced-path existence
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:17:39 +02:00
72b9262f34 feat(kaizen): parse tag/first_seen/age per signal
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:17:03 +02:00
859732b04d feat(kaizen): friction-scan section extraction + signal split
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:16:36 +02:00
9e0c264658 docs: reconcile lower-severity review findings (O9-O24)
- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional,
  outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative
  boma.baobab.band -> boma.wingu.me transition note already added earlier
- terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and
  <host>.boma.baobab.band per ADR-007 naming (O11)
- ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections
  placed after Consequences, matching ADR-014/019-023 (O13)
- docs/README + inventories/README: list the missing subdirs / offsite_hosts +
  offsite.yml merge behaviour (O14, O29 note)
- ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19)
- ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20)
- ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21)
- netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23)
- ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24)
- capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28)
- tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9)
- tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep)

O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected);
the fix lives in the generator for the next regeneration. make lint + pytest (57) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00
64f1e821d8 docs(review): 2026-06-14 repo audit — M4a doc drift + Traefik→Caddy lag
11 safe auto-fixes (docs/comments only): reverse_proxy meta stale DNS-01
description, base/playbooks/scripts/terraform/public_dns README build-state,
CAPABILITIES reverse-proxy Traefik→Caddy, README ADR list → 024, TF cax11→cx23
stamps, public_dns wildcard DNS-01→HTTP-01 comment. 29 open findings reported.
make lint green. No stale-deferred (ADR-011 open questions still open).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:37:54 +02:00
9d4a49d49d feat(vault): CHANGEME placeholder convention + check-vault flags them
Streamline the recurring secret-entry friction: the agent stubs a needed secret as
vault.<service>.<key>=CHANGEME with a what/how-to-obtain comment, wires the code,
and commits; the operator fills it via make edit-vault (real value never hits chat).
check-vault now lists outstanding CHANGEME placeholders so none are forgotten.
Convention documented in CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 15:40:37 +02:00
79f2315eee feat(make): add edit-vault + check-vault targets
`make edit-vault` runs `ansible-vault edit` (decrypt → nvim → re-encrypt on :wq,
abort on :cq) so editing the vault is one step with no plaintext left in the work
tree, then validates structure. `make check-vault` runs scripts/check-vault.py:
decrypts in-memory, asserts valid YAML with secrets under the nested `vault:` map
and no empty leaves, and prints a values-masked structure view (comments visible,
secrets never printed). Both default to the production all-vault; override VAULT=.

Update the vault header comment, CLAUDE.md (command table + Secrets section), and
scripts/README to point at edit-vault (note check-vault.py is the one venv-
dependent helper, by design).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:36:15 +02:00
d0a3307822 docs(adr): fix 007/008 heading nesting; require date in Superseded status
Final-review polish: demote the sub-headings under the demoted 'IP addressing'
(007) and 'Three testing levels'/'What Molecule tests' (008) to #### so they
nest correctly instead of flattening to siblings. Tighten the adr-structure
Superseded pattern to require '(YYYY-MM-DD)' per ADR-023.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 15:00:58 +02:00
6d7d27b03b docs(adr): add Proposed lifecycle state; mark ADR-011 Proposed
Revisits the lifecycle decision on the evidence of ADR-011 (a real draft
with open questions). Adds a fourth state, Proposed (YYYY-MM-DD), to ADR-023,
the template, the adr-structure check (+test), spec and plan. Sets ADR-011's
Status to Proposed and removes its now-redundant inline 'Proposed' line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 14:48:55 +02:00
a3ea0f7d80 feat(review): add adr-structure check to repo-scan
Flags numbered ADRs missing a mandatory section (Status/Context/Decision/
Consequences) or with an unparseable Status line. Presence only, not order.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 13:57:42 +02:00
fac438cc92 fix(tags): recognize name: role key; only check roles: in plays
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 15:20:09 +02:00
5aeeb094eb feat(tags): enforce role imports carry their role-name tag
Adds role_tag_problems() to check-tags.py: every role imported in a
play's roles: block must carry its own role name as a tag (extra tags
allowed; templated role names skipped). Wires the check into main() so
make lint catches violations. 6 new unit tests (29 total, all passing).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 15:12:48 +02:00
2e5a1e1e23 fix(tags): exclude molecule scenarios from tag scan; clarify ADR enforcement
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 09:50:14 +02:00
a3ea2aceb2 feat(tags): scan roles/+playbooks/ and fail on unknown tags
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 09:33:12 +02:00
b45118dac3 feat(tags): checker helpers — tag collection & allowed-set
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 09:28:03 +02:00
568729e7bd repo-scan: cut broken-path-ref + marker false positives
- broken-path-ref: skip template/generated-report paths — a placeholder
  (<service>) immediately following the match, a YYYY-MM-DD date token, or a
  path under a generated-report reviews/ dir (14 -> 0 on the current tree).
- marker: skip numbered-backlog references (TODO 8.2, TODO-3.1, TODO (2.2,
  TODO item 16) which point at the backlog, not code markers (35 -> 2; the
  remaining two are literal "TODO:" strings in a plan doc). Real code markers
  (TODO:, FIXME, etc.) still caught — verified with a synthetic fixture.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 20:37:40 +02:00
d8afa94c4b Name and propagate the offsite_hosts inventory group (askari)
Review O4: ADR-016 said askari gets "its own inventory group" but never named it.
Settled as offsite_hosts (off-site, distinct from on-site-but-off-cluster ubongo).
Added to VALID_GROUPS (tf_to_inventory.py), ADR-009 valid groups, ADR-001/ADR-016
host-group enumerations, and CLAUDE.md. Generated hosts.yml picks up the section on
the next make tf-inventory (a manual-exception group like control).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 18:54:54 +02:00
f566fd17eb review-repo: add stale-deferred check for ADR Deferred entries
repo-scan.py now enumerates open ADR "Deferred/Open" items and flags any that
another file describes as resolved but which isn't marked resolved in place
(the recurring miss in docs/FRICTION.md). review-repo.md's Phase 2 reviewer
confirms each open item against later ADRs/STATUS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 18:13:49 +02:00
4c535c908e Record ADR-012 + STATUS/CLAUDE/scripts docs for capacity tooling
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:34:38 +02:00
05694f6ea4 Complete capacity-scan.py: usage stub, subprocess glue, main()
Adds gather_usage() (stubbed, returns available:false), known_hostnames()
with graceful degradation when terraform/ansible-inventory are absent,
_run_json() helper, and main() that parses reference.md and emits JSON.
Three new TDD tests (12 total, all passing). Script exits 0 with valid
JSON even when no cluster is provisioned.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:30:45 +02:00
8ed00c9206 Add hostname parsers + find_drift() to capacity-scan.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:24:11 +02:00
b240fa8bfe Add compute_rollup() to capacity-scan.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:21:22 +02:00
07ecbb2789 Add capacity-scan.py with parse_table()
Implements the parse_table() function and pytest test harness for the
capacity-scan script. Tests cover header matching and graceful empty
return when the required header is absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:20:10 +02:00
703f1716e5 review-repo: harden scanner, apply safe fixes, record first review
First /review-repo run on boma. Hardened repo-scan.py (no TODO.md/prose false
positives). Applied 7 safe fixes (DNS staleness x2, STATUS factual correction,
hosts.yml path generalisation, trunk-based wording x2, scripts/README). Recorded
the run and 17 open findings in docs/reviews/2026-05-30-*.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 19:10:58 +02:00
b33130eea9 Add /review-repo command with deterministic pre-scan and reviews store
New on-demand repo audit: scripts/repo-scan.py does the cheap deterministic
checks (markers, broken refs, unencrypted vaults) and inventory; the command
fans out judgement reviewers across four dimensions, applies only safe/obvious
fixes, and writes a tracked report to docs/reviews/. Cron + email deferred.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 18:56:01 +02:00
4ee1b66e23 Source vault password from Vaultwarden via rbw; nest vault structure
Master vault password is fetched from Vaultwarden via the rbw agent
(scripts/vault-pass-client.sh, wired as vault_password_file) instead of a
plaintext .vault_pass. Vault secrets use a nested vault.<service>.<key> map.
Encrypted vault.yml files are excluded from lint. Includes the host rename in
Makefile and STATUS.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 18:16:35 +02:00
3f1d7eb128 Add core Ansible scaffold, tooling, and pre-commit guards
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 14:10:01 +02:00