sjat/boma

sjat 4732730515 docs: wire ADR-025 into testing/control-host/risks/status/capacity

- ADR-008: add reboot-survivability gap row + ADR-025 pointer to the
  "not tested in Molecule" table
- ADR-015: reconcile "not a hypervisor" with ephemeral KVM test VMs
  (ADR-025); note ~3 GiB test-VM RAM against the 16 GiB sizing
- accepted-risks: add R6 (le-prod-wildcard PAT + transient TXT records)
- CLAUDE.md: add make test-integration[/-clean] to key-commands;
  add ADR-025 + runbook rows to further-reading
- hardware/reference.md: note one ephemeral KVM test VM on ubongo
- STATUS.md: add integration harness entry (built, lint+pytest clean;
  RED/GREEN acceptance PENDING ubongo live pass); TODO 2.4 stays open

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-18 12:51:22 +02:00

15 KiB

Raw Blame History

CLAUDE.md — Ansible homelab monorepo

This file is read by Claude Code at the start of every session. Keep it dense and command-focused. Verbose detail lives in docs/.

Before assuming a role, provider, or pipeline exists, check STATUS.md. Much of the design in docs/decisions/ is intended, not yet built (e.g. the base/docker_host roles are currently empty; Terraform is not inited).

Project in one paragraph

Homelab infrastructure automation for a Proxmox cluster running 2–5 Debian 13 VMs. All hosts share a hardened base configuration. Each host runs a defined set of Docker services deployed via Compose files rendered from Ansible templates. Ansible runs from a dedicated physical control node (ubongo) outside the cluster. CI runs on Forgejo Actions (self-hosted).

Full design rationale: docs/decisions/

Key commands

Action	Command
Lint everything	`make lint`
Test a single role	`make test ROLE=<name>`
Test all roles	`make test-all`
Check mode (dry run)	`make check PLAYBOOK=<name>`
Deploy a playbook	`make deploy PLAYBOOK=<name>`
Scaffold a new role	`make new-role NAME=<name>`
Review repo for drift/cruft	`/review-repo` (Claude command)
Review hardware capacity	`/capacity-review` (Claude command)
Edit the vault (nvim, auto re-encrypt)	`make edit-vault [VAULT=<path>]`
Validate vault structure	`make check-vault [VAULT=<path>]`
Encrypt a vault file	`make encrypt FILE=<path>`
Decrypt a vault file	`make decrypt FILE=<path>`
Install Python deps	`make setup`
Install Ansible collections	`make collections`
Initialise Terraform	`make tf-init [TF_ENV=staging]`
Terraform plan	`make tf-plan [TF_ENV=staging]`
Terraform apply	`make tf-apply [TF_ENV=staging]`
Regenerate Ansible inventory	`make tf-inventory TF_ENV=<staging\|production>`
Integration-test a host on a local VM	`make test-integration HOST=<name> [CERTS=…]`
Clean up integration test VMs	`make test-integration-clean`

Always tf-plan before tf-apply. Always check before deploy. Never skip lint.

TF_ENV defaults to staging. Always specify TF_ENV=production explicitly for production.

Ansible conventions

FQCN always: ansible.builtin.template, never template
Tags (ADR-019): import each role with its role-name tag once at the play level (Ansible inherits it to every task). Tag a task/block with a concern tag from the approved list (tests/tags.yml) only where it genuinely belongs to that concern — don't invent tags or tag for tagging's sake. Target one axis at a time (role/service or concern; tags are union/OR, never intersected). make lint enforces the vocabulary and that each role import carries its role-name tag.
Handlers: use listen: topic strings, not direct name references
Variables: rolename__varname double-underscore namespace for role defaults
No inline vars in playbooks: use group_vars/ or host_vars/ only
Loops: prefer loop: over with_items:
Loop var keys: index with item['key'], never item.key — a key named values/keys/items/get/… resolves to the dict method (silently corrupt + non-idempotent), not the value
Conditionals: prefer true/false over yes/no

Secrets

Encrypted files are always named vault.yml, sitting alongside vars.yml
Never put plaintext secrets in any file not named vault.yml
Structure secrets as a nested map vault.<service>.<key> (e.g. vault.grafana.admin_password); reference as {{ vault.grafana.admin_password }}
Vault password comes from Vaultwarden via rbw (scripts/vault-pass-client.sh, wired as vault_password_file). Unlock once per session: rbw unlock
Before any vault-dependent task (make deploy/check/encrypt/decrypt, or any git commit — the pre-commit ansible-lint hook decrypts vault.yml), run rbw unlocked; if it exits non-zero, ask the user to rbw unlock and wait rather than starting and failing partway. The agent stays unlocked 5h.
To edit the vault: make edit-vault — decrypts → opens nvim → re-encrypts on :wq (abort with :cq), then check-vault validates structure. No plaintext lands in the work tree. Override the file with VAULT=<path>. (The lower-level make decrypt/ encrypt FILE=<path> still exist for scripted/non-interactive edits.)
make check-vault validates the vault decrypts, is valid YAML, keeps secrets under the nested vault: map, and has no empty leaves — printing a structure view with values masked. Needs rbw unlocked. It also flags any leaf still set to CHANGEME (see next bullet).
Stubbing a secret the operator must supply (don't ping-pong over chat): when a new secret is needed, the agent itself adds the vault entry with the sentinel value CHANGEME plus a comment stating what it is and how to obtain it, wires the code to {{ vault.<service>.<key> }}, and commits that. Then prompt the operator to run make edit-vault, replace the CHANGEME(s) with the real value(s) — which never touch the conversation — and re-encrypt. make check-vault lists any outstanding CHANGEME placeholders so nothing is forgotten. The agent never handles the real secret.

Role conventions

Every role must have molecule/default/ scenario targeting Debian 13
Every role must have a populated README.md
Every role must have meta/main.yml filled in
Every service role must have a populated SECURITY.md (ADR-002/004) — copy docs/security/service-security-template.md
Every service role must have a populated VERIFY.md (ADR-008/017) — copy docs/testing/service-verify-template.md
Every service role must have a populated ACCESS.md (ADR-021) — copy docs/access/service-access-template.md; rendered from the role's access__* data
Every service role that holds state must have a populated BACKUP.md (ADR-022) — copy docs/backup/service-backup-template.md; rendered from the role's backup__* data. A stateless service records backup__state: false with a reason.
One service = one self-contained role; no shared multi-service roles (ADR-004)
Role names: snake_case, descriptive nouns (base, docker_host, reverse_proxy)
Use make new-role NAME=<name> to scaffold — never create role structure by hand

Inventory structure

inventories/
  production/         # live hosts — edit with care
    hosts.yml
    group_vars/
      all/            # applies to every host
        vars.yml
        vault.yml
      docker_hosts/   # hosts running Docker services
      proxmox_hosts/  # Proxmox nodes themselves
      offsite_hosts/  # off-site hosts (askari) — NetBird coordinator + watchdog
    host_vars/        # per-host overrides
  staging/            # safe to run freely

Host groups: all, control, docker_hosts, proxmox_hosts, offsite_hosts

(control holds ubongo, the one manually-provisioned physical control node outside the cluster; offsite_hosts holds askari, the off-site Hetzner host that runs the NetBird coordinator + watchdog — also added manually. See ADR-009, ADR-015, ADR-016.)

Git conventions

Single-contributor, trunk-based (no merge requests / approval gates):

main is the trunk and must always work — small, safe changes commit straight to it
Branch for sweeping or AI-driven changes you want to review as one diff or be able to abandon: role/<name>, fix/<description>, feat/<description>, chore/<description>; merge to main when reviewed, then delete the branch
Run make lint (and make test for touched roles) before committing
Commit in logical units; imperative subject ≤72 chars
AI agents commit their own work in logical units with a Co-Authored-By trailer
Push to the Forgejo origin often — it is the off-machine backup
Never commit secrets; a vault.yml must be $ANSIBLE_VAULT-encrypted (pre-commit enforces this, plus gitleaks secret scanning)

Dependencies policy

No Galaxy roles — all roles are local; never add a Galaxy role to requirements.yml
Collections on demand — only add a collection when a task in a committed role uses a module from it; add a comment in requirements.yml naming the module(s) used
Full rationale: docs/decisions/003-toolchain.md (Collections and roles policy)

Terraform conventions

Terraform owns VM existence only — nothing inside a VM, and no DNS records
Every TF-managed VM carries three Proxmox tags — <env>, its inventory group, and managed-by=terraform — as metadata only (ADR-019). They do not feed inventory or run-targeting; tf_to_inventory.py still groups by the group output field.
Internal DNS is entirely Ansible (the dns role renders the zone from inventory)
OPNsense is entirely Ansible; do not reach for a Terraform OPNsense provider
Environments are separate directories (staging/, production/), not workspaces
Secrets via TF_VAR_* env vars only — never in .tfvars files
terraform.tfvars.example is tracked; terraform.tfvars is gitignored
.terraform.lock.hcl is tracked (pins provider versions)
Every module declares its own required_providers (in versions.tf) for any non-hashicorp provider — otherwise TF infers hashicorp/<name> and init fails (caught only by a live tf-init, not by static review)
Full rationale: docs/decisions/006-terraform.md

What Claude must not do without explicit instruction

Run make deploy — always run make check first and show output
Run make tf-apply — always run make tf-plan first and show output
Modify inventories/<env>/hosts.yml directly — regenerate via make tf-inventory
Edit vault-encrypted files directly — decrypt first, re-encrypt after
Force-push or rewrite already-pushed history on main
Add a collection to requirements.yml without a specific module need in existing role tasks
Open a firewall port anywhere but the group_vars service catalog — never ad-hoc on a host. If it's not in the catalog, it doesn't exist (ADR-002, ADR-020)
Disable or weaken a baseline control from ADR-002 (SSH hardening, nftables default-deny, fail2ban, auditd)
Expose a service to the LAN/WAN without it sitting behind the reverse proxy with authentication (ADR-002)
Deploy a service that hasn't cleared docs/security/service-checklist.md (record any deviation in docs/security/accepted-risks.md)
Justify a decision by AnsibleBaobabV4 precedent, or import its structure/requirements/values — consult V4 only per ADR-013 (gotchas/configs, announced, re-derived on boma's terms)

Sourcing technical knowledge (ADR-014)

Facts vs judgments. Version-specific facts (syntax, options, defaults) have one authoritative answer — consult version-matched official docs and cite. Best practices are evidence to translate through boma's principles (ADR-013), never authority ("because the docs / a blog say so" is banned).
When to verify (risk-based): required when security-relevant, the tool is unfamiliar / fast-moving or newer than your training, or you'd assert a specific flag/option/default you can't quote with confidence. Otherwise memory is fine — but mark any from-memory version-specific claim "from memory, unverified."
Sources: context7 for library docs · upstream docs via WebFetch · Claude Code/SDK/API → claude-code-guide agent · broad questions → deep-research. These are plugins and may be absent on a fresh checkout — fall back to WebFetch/WebSearch (core tools). Match the pinned version, not "latest."
Stamp verified facts next to them: verified: <subject> · <tool> <version> · <source> · <YYYY-MM-DD>.

Topic	File
Architecture overview	`docs/decisions/001-architecture.md`
Build order / roadmap	`docs/ROADMAP.md`
Capabilities overview (what boma does)	`docs/CAPABILITIES.md`
Security baseline & strategy	`docs/decisions/002-security.md`
Accepted security risks	`docs/security/accepted-risks.md`
Per-service security checklist	`docs/security/service-checklist.md`
Per-service security record (template)	`docs/security/service-security-template.md`
Per-service verification spec (template)	`docs/testing/service-verify-template.md`
Heritage / V4 policy	`docs/decisions/013-heritage-v4.md`
Sourcing tech knowledge	`docs/decisions/014-knowledge-sourcing.md`
Toolchain choices	`docs/decisions/003-toolchain.md`
Docker & Compose model	`docs/decisions/004-docker-model.md`
Bootstrapping hosts	`docs/decisions/005-bootstrapping.md`
Control / AI-worker host (`ubongo`)	`docs/decisions/015-control-host.md`
Terraform	`docs/decisions/006-terraform.md`
Network topology	`docs/decisions/007-network.md`
Mesh VPN (NetBird, self-hosted)	`docs/decisions/016-mesh-vpn.md`
Testing methodology	`docs/decisions/008-testing.md`
Service-UI verification (Level 4)	`docs/decisions/017-service-ui-verification.md`
TF ↔ Ansible handoff	`docs/decisions/009-provisioning-handoff.md`
Forgejo & CI	`docs/decisions/010-forgejo-ci.md`
Update management	`docs/decisions/011-update-management.md`
Hardware & capacity	`docs/decisions/012-hardware-capacity.md`
Logging & log integrity	`docs/decisions/018-logging.md`
Tagging & run-targeting	`docs/decisions/019-tagging.md`
Firewall strategy	`docs/decisions/020-firewall.md`
Operational access	`docs/decisions/021-operational-access.md`
Backup & disaster recovery	`docs/decisions/022-backup.md`
ADR structure & lifecycle	`docs/decisions/023-adr-structure.md`
Reverse proxy (Caddy)	`docs/decisions/024-reverse-proxy.md`
Local VM integration testing (ADR-025)	`docs/decisions/025-local-vm-integration-testing.md`
Integration testing runbook	`docs/runbooks/integration-testing.md`
Adding a new role	`docs/runbooks/new-role.md`
Adding a new host	`docs/runbooks/new-host.md`
Enrolling a NetBird client (laptop/phone)	`docs/runbooks/netbird-client.md`
Rotating vault secrets	`docs/runbooks/rotate-secrets.md`
Claude Code setup (per machine)	`docs/runbooks/claude-code-setup.md`

15 KiB Raw Blame History Unescape Escape