- ADR-008: add reboot-survivability gap row + ADR-025 pointer to the "not tested in Molecule" table - ADR-015: reconcile "not a hypervisor" with ephemeral KVM test VMs (ADR-025); note ~3 GiB test-VM RAM against the 16 GiB sizing - accepted-risks: add R6 (le-prod-wildcard PAT + transient TXT records) - CLAUDE.md: add make test-integration[/-clean] to key-commands; add ADR-025 + runbook rows to further-reading - hardware/reference.md: note one ephemeral KVM test VM on ubongo - STATUS.md: add integration harness entry (built, lint+pytest clean; RED/GREEN acceptance PENDING ubongo live pass); TODO 2.4 stays open Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
15 KiB
CLAUDE.md — Ansible homelab monorepo
This file is read by Claude Code at the start of every session.
Keep it dense and command-focused. Verbose detail lives in docs/.
Before assuming a role, provider, or pipeline exists, check
STATUS.md. Much of the design indocs/decisions/is intended, not yet built (e.g. thebase/docker_hostroles are currently empty; Terraform is notinited).
Project in one paragraph
Homelab infrastructure automation for a Proxmox cluster running 2–5 Debian 13 VMs.
All hosts share a hardened base configuration. Each host runs a defined set of Docker
services deployed via Compose files rendered from Ansible templates. Ansible runs from
a dedicated physical control node (ubongo) outside the cluster. CI runs on Forgejo
Actions (self-hosted).
Full design rationale: docs/decisions/
Key commands
| Action | Command |
|---|---|
| Lint everything | make lint |
| Test a single role | make test ROLE=<name> |
| Test all roles | make test-all |
| Check mode (dry run) | make check PLAYBOOK=<name> |
| Deploy a playbook | make deploy PLAYBOOK=<name> |
| Scaffold a new role | make new-role NAME=<name> |
| Review repo for drift/cruft | /review-repo (Claude command) |
| Review hardware capacity | /capacity-review (Claude command) |
| Edit the vault (nvim, auto re-encrypt) | make edit-vault [VAULT=<path>] |
| Validate vault structure | make check-vault [VAULT=<path>] |
| Encrypt a vault file | make encrypt FILE=<path> |
| Decrypt a vault file | make decrypt FILE=<path> |
| Install Python deps | make setup |
| Install Ansible collections | make collections |
| Initialise Terraform | make tf-init [TF_ENV=staging] |
| Terraform plan | make tf-plan [TF_ENV=staging] |
| Terraform apply | make tf-apply [TF_ENV=staging] |
| Regenerate Ansible inventory | make tf-inventory TF_ENV=<staging|production> |
| Integration-test a host on a local VM | make test-integration HOST=<name> [CERTS=…] |
| Clean up integration test VMs | make test-integration-clean |
Always tf-plan before tf-apply. Always check before deploy. Never skip lint.
TF_ENV defaults to staging. Always specify TF_ENV=production explicitly for production.
Ansible conventions
- FQCN always:
ansible.builtin.template, nevertemplate - Tags (ADR-019): import each role with its role-name tag once at the play level
(Ansible inherits it to every task). Tag a task/block with a concern tag from the
approved list (
tests/tags.yml) only where it genuinely belongs to that concern — don't invent tags or tag for tagging's sake. Target one axis at a time (role/service or concern; tags are union/OR, never intersected).make lintenforces the vocabulary and that each role import carries its role-name tag. - Handlers: use
listen:topic strings, not direct name references - Variables:
rolename__varnamedouble-underscore namespace for role defaults - No inline vars in playbooks: use
group_vars/orhost_vars/only - Loops: prefer
loop:overwith_items: - Loop var keys: index with
item['key'], neveritem.key— a key namedvalues/keys/items/get/… resolves to the dict method (silently corrupt + non-idempotent), not the value - Conditionals: prefer
true/falseoveryes/no
Secrets
- Encrypted files are always named
vault.yml, sitting alongsidevars.yml - Never put plaintext secrets in any file not named
vault.yml - Structure secrets as a nested map
vault.<service>.<key>(e.g.vault.grafana.admin_password); reference as{{ vault.grafana.admin_password }} - Vault password comes from Vaultwarden via
rbw(scripts/vault-pass-client.sh, wired asvault_password_file). Unlock once per session:rbw unlock - Before any vault-dependent task (
make deploy/check/encrypt/decrypt, or any git commit — the pre-commit ansible-lint hook decryptsvault.yml), runrbw unlocked; if it exits non-zero, ask the user torbw unlockand wait rather than starting and failing partway. The agent stays unlocked 5h. - To edit the vault:
make edit-vault— decrypts → opens nvim → re-encrypts on:wq(abort with:cq), thencheck-vaultvalidates structure. No plaintext lands in the work tree. Override the file withVAULT=<path>. (The lower-levelmake decrypt/encrypt FILE=<path>still exist for scripted/non-interactive edits.) make check-vaultvalidates the vault decrypts, is valid YAML, keeps secrets under the nestedvault:map, and has no empty leaves — printing a structure view with values masked. Needsrbwunlocked. It also flags any leaf still set toCHANGEME(see next bullet).- Stubbing a secret the operator must supply (don't ping-pong over chat): when a new
secret is needed, the agent itself adds the vault entry with the sentinel value
CHANGEMEplus a comment stating what it is and how to obtain it, wires the code to{{ vault.<service>.<key> }}, and commits that. Then prompt the operator to runmake edit-vault, replace theCHANGEME(s) with the real value(s) — which never touch the conversation — and re-encrypt.make check-vaultlists any outstandingCHANGEMEplaceholders so nothing is forgotten. The agent never handles the real secret.
Role conventions
- Every role must have
molecule/default/scenario targeting Debian 13 - Every role must have a populated
README.md - Every role must have
meta/main.ymlfilled in - Every service role must have a populated
SECURITY.md(ADR-002/004) — copydocs/security/service-security-template.md - Every service role must have a populated
VERIFY.md(ADR-008/017) — copydocs/testing/service-verify-template.md - Every service role must have a populated
ACCESS.md(ADR-021) — copydocs/access/service-access-template.md; rendered from the role'saccess__*data - Every service role that holds state must have a populated
BACKUP.md(ADR-022) — copydocs/backup/service-backup-template.md; rendered from the role'sbackup__*data. A stateless service recordsbackup__state: falsewith a reason. - One service = one self-contained role; no shared multi-service roles (ADR-004)
- Role names:
snake_case, descriptive nouns (base,docker_host,reverse_proxy) - Use
make new-role NAME=<name>to scaffold — never create role structure by hand
Inventory structure
inventories/
production/ # live hosts — edit with care
hosts.yml
group_vars/
all/ # applies to every host
vars.yml
vault.yml
docker_hosts/ # hosts running Docker services
proxmox_hosts/ # Proxmox nodes themselves
offsite_hosts/ # off-site hosts (askari) — NetBird coordinator + watchdog
host_vars/ # per-host overrides
staging/ # safe to run freely
Host groups: all, control, docker_hosts, proxmox_hosts, offsite_hosts
(control holds ubongo, the one manually-provisioned physical control node
outside the cluster; offsite_hosts holds askari, the off-site Hetzner host that
runs the NetBird coordinator + watchdog — also added manually. See ADR-009, ADR-015,
ADR-016.)
Git conventions
Single-contributor, trunk-based (no merge requests / approval gates):
mainis the trunk and must always work — small, safe changes commit straight to it- Branch for sweeping or AI-driven changes you want to review as one diff or be able
to abandon:
role/<name>,fix/<description>,feat/<description>,chore/<description>; merge tomainwhen reviewed, then delete the branch - Run
make lint(andmake testfor touched roles) before committing - Commit in logical units; imperative subject ≤72 chars
- AI agents commit their own work in logical units with a
Co-Authored-Bytrailer - Push to the Forgejo
originoften — it is the off-machine backup - Never commit secrets; a
vault.ymlmust be$ANSIBLE_VAULT-encrypted (pre-commit enforces this, plus gitleaks secret scanning)
Dependencies policy
- No Galaxy roles — all roles are local; never add a Galaxy role to
requirements.yml - Collections on demand — only add a collection when a task in a committed role
uses a module from it; add a comment in
requirements.ymlnaming the module(s) used - Full rationale:
docs/decisions/003-toolchain.md(Collections and roles policy)
Terraform conventions
- Terraform owns VM existence only — nothing inside a VM, and no DNS records
- Every TF-managed VM carries three Proxmox tags —
<env>, its inventorygroup, andmanaged-by=terraform— as metadata only (ADR-019). They do not feed inventory or run-targeting;tf_to_inventory.pystill groups by thegroupoutput field. - Internal DNS is entirely Ansible (the
dnsrole renders the zone from inventory) - OPNsense is entirely Ansible; do not reach for a Terraform OPNsense provider
- Environments are separate directories (
staging/,production/), not workspaces - Secrets via
TF_VAR_*env vars only — never in.tfvarsfiles terraform.tfvars.exampleis tracked;terraform.tfvarsis gitignored.terraform.lock.hclis tracked (pins provider versions)- Every module declares its own
required_providers(inversions.tf) for any non-hashicorp provider — otherwise TF infershashicorp/<name>andinitfails (caught only by a livetf-init, not by static review) - Full rationale:
docs/decisions/006-terraform.md
What Claude must not do without explicit instruction
- Run
make deploy— always runmake checkfirst and show output - Run
make tf-apply— always runmake tf-planfirst and show output - Modify
inventories/<env>/hosts.ymldirectly — regenerate viamake tf-inventory - Edit vault-encrypted files directly — decrypt first, re-encrypt after
- Force-push or rewrite already-pushed history on
main - Add a collection to
requirements.ymlwithout a specific module need in existing role tasks - Open a firewall port anywhere but the
group_varsservice catalog — never ad-hoc on a host. If it's not in the catalog, it doesn't exist (ADR-002, ADR-020) - Disable or weaken a baseline control from ADR-002 (SSH hardening, nftables default-deny, fail2ban, auditd)
- Expose a service to the LAN/WAN without it sitting behind the reverse proxy with authentication (ADR-002)
- Deploy a service that hasn't cleared
docs/security/service-checklist.md(record any deviation indocs/security/accepted-risks.md) - Justify a decision by AnsibleBaobabV4 precedent, or import its structure/requirements/values — consult V4 only per ADR-013 (gotchas/configs, announced, re-derived on boma's terms)
Sourcing technical knowledge (ADR-014)
- Facts vs judgments. Version-specific facts (syntax, options, defaults) have one authoritative answer — consult version-matched official docs and cite. Best practices are evidence to translate through boma's principles (ADR-013), never authority ("because the docs / a blog say so" is banned).
- When to verify (risk-based): required when security-relevant, the tool is unfamiliar / fast-moving or newer than your training, or you'd assert a specific flag/option/default you can't quote with confidence. Otherwise memory is fine — but mark any from-memory version-specific claim "from memory, unverified."
- Sources:
context7for library docs · upstream docs viaWebFetch· Claude Code/SDK/API →claude-code-guideagent · broad questions →deep-research. These are plugins and may be absent on a fresh checkout — fall back toWebFetch/WebSearch(core tools). Match the pinned version, not "latest." - Stamp verified facts next to them:
verified: <subject> · <tool> <version> · <source> · <YYYY-MM-DD>.
Further reading
| Topic | File |
|---|---|
| Architecture overview | docs/decisions/001-architecture.md |
| Build order / roadmap | docs/ROADMAP.md |
| Capabilities overview (what boma does) | docs/CAPABILITIES.md |
| Security baseline & strategy | docs/decisions/002-security.md |
| Accepted security risks | docs/security/accepted-risks.md |
| Per-service security checklist | docs/security/service-checklist.md |
| Per-service security record (template) | docs/security/service-security-template.md |
| Per-service verification spec (template) | docs/testing/service-verify-template.md |
| Heritage / V4 policy | docs/decisions/013-heritage-v4.md |
| Sourcing tech knowledge | docs/decisions/014-knowledge-sourcing.md |
| Toolchain choices | docs/decisions/003-toolchain.md |
| Docker & Compose model | docs/decisions/004-docker-model.md |
| Bootstrapping hosts | docs/decisions/005-bootstrapping.md |
Control / AI-worker host (ubongo) |
docs/decisions/015-control-host.md |
| Terraform | docs/decisions/006-terraform.md |
| Network topology | docs/decisions/007-network.md |
| Mesh VPN (NetBird, self-hosted) | docs/decisions/016-mesh-vpn.md |
| Testing methodology | docs/decisions/008-testing.md |
| Service-UI verification (Level 4) | docs/decisions/017-service-ui-verification.md |
| TF ↔ Ansible handoff | docs/decisions/009-provisioning-handoff.md |
| Forgejo & CI | docs/decisions/010-forgejo-ci.md |
| Update management | docs/decisions/011-update-management.md |
| Hardware & capacity | docs/decisions/012-hardware-capacity.md |
| Logging & log integrity | docs/decisions/018-logging.md |
| Tagging & run-targeting | docs/decisions/019-tagging.md |
| Firewall strategy | docs/decisions/020-firewall.md |
| Operational access | docs/decisions/021-operational-access.md |
| Backup & disaster recovery | docs/decisions/022-backup.md |
| ADR structure & lifecycle | docs/decisions/023-adr-structure.md |
| Reverse proxy (Caddy) | docs/decisions/024-reverse-proxy.md |
| Local VM integration testing (ADR-025) | docs/decisions/025-local-vm-integration-testing.md |
| Integration testing runbook | docs/runbooks/integration-testing.md |
| Adding a new role | docs/runbooks/new-role.md |
| Adding a new host | docs/runbooks/new-host.md |
| Enrolling a NetBird client (laptop/phone) | docs/runbooks/netbird-client.md |
| Rotating vault secrets | docs/runbooks/rotate-secrets.md |
| Claude Code setup (per machine) | docs/runbooks/claude-code-setup.md |