# CLAUDE.md — Ansible homelab monorepo This file is read by Claude Code at the start of every session. Keep it dense and command-focused. Verbose detail lives in `docs/`. > **Before assuming a role, provider, or pipeline exists, check `STATUS.md`.** > Much of the design in `docs/decisions/` is intended, not yet built (e.g. the > `base`/`docker_host` roles are currently empty; Terraform is not `init`ed). --- ## Project in one paragraph Homelab infrastructure automation for a Proxmox cluster running 2–5 Debian 13 VMs. All hosts share a hardened base configuration. Each host runs a defined set of Docker services deployed via Compose files rendered from Ansible templates. Ansible runs from a dedicated physical control node (`ubongo`) outside the cluster. CI runs on Forgejo Actions (self-hosted). Full design rationale: `docs/decisions/` --- ## Key commands | Action | Command | |-------------------------------|--------------------------------------------------| | Lint everything | `make lint` | | Test a single role | `make test ROLE=` | | Test all roles | `make test-all` | | Check mode (dry run) | `make check PLAYBOOK=` | | Deploy a playbook | `make deploy PLAYBOOK=` | | Scaffold a new role | `make new-role NAME=` | | Review repo for drift/cruft | `/review-repo` (Claude command) | | Review hardware capacity | `/capacity-review` (Claude command) | | Edit the vault (nvim, auto re-encrypt) | `make edit-vault [VAULT=]` | | Validate vault structure | `make check-vault [VAULT=]` | | Encrypt a vault file | `make encrypt FILE=` | | Decrypt a vault file | `make decrypt FILE=` | | Install Python deps | `make setup` | | Install Ansible collections | `make collections` | | Initialise Terraform | `make tf-init [TF_ENV=staging]` | | Terraform plan | `make tf-plan [TF_ENV=staging]` | | Terraform apply | `make tf-apply [TF_ENV=staging]` | | Regenerate Ansible inventory | `make tf-inventory TF_ENV=` | | Integration-test a host on a local VM | `make test-integration HOST= [CERTS=…]` | | Clean up integration test VMs | `make test-integration-clean` | **Always `tf-plan` before `tf-apply`. Always `check` before `deploy`. Never skip lint.** `TF_ENV` defaults to `staging`. Always specify `TF_ENV=production` explicitly for production. --- ## Ansible conventions - **FQCN always**: `ansible.builtin.template`, never `template` - **Tags** (ADR-019): import each role with its role-name tag once at the play level (Ansible inherits it to every task). Tag a task/block with a concern tag from the approved list (`tests/tags.yml`) only where it genuinely belongs to that concern — don't invent tags or tag for tagging's sake. Target one axis at a time (role/service *or* concern; tags are union/OR, never intersected). `make lint` enforces the vocabulary and that each role import carries its role-name tag. - **Handlers**: use `listen:` topic strings, not direct name references - **Variables**: `rolename__varname` double-underscore namespace for role defaults - **No inline vars in playbooks**: use `group_vars/` or `host_vars/` only - **Loops**: prefer `loop:` over `with_items:` - **Loop var keys**: index with `item['key']`, never `item.key` — a key named `values`/`keys`/`items`/`get`/… resolves to the dict *method* (silently corrupt + non-idempotent), not the value - **Conditionals**: prefer `true`/`false` over `yes`/`no` --- ## Secrets - Encrypted files are always named `vault.yml`, sitting alongside `vars.yml` - Never put plaintext secrets in any file not named `vault.yml` - Structure secrets as a nested map `vault..` (e.g. `vault.grafana.admin_password`); reference as `{{ vault.grafana.admin_password }}` - Vault password comes from Vaultwarden via `rbw` (`scripts/vault-pass-client.sh`, wired as `vault_password_file`). Unlock once per session: `rbw unlock` - **Before any vault-dependent task** (`make deploy/check/encrypt/decrypt`, or **any git commit** — the pre-commit ansible-lint hook decrypts `vault.yml`), run `rbw unlocked`; if it exits non-zero, ask the user to `rbw unlock` and wait rather than starting and failing partway. The agent stays unlocked 5h. - To edit the vault: `make edit-vault` — decrypts → opens nvim → re-encrypts on `:wq` (abort with `:cq`), then `check-vault` validates structure. No plaintext lands in the work tree. Override the file with `VAULT=`. (The lower-level `make decrypt`/ `encrypt FILE=` still exist for scripted/non-interactive edits.) - `make check-vault` validates the vault decrypts, is valid YAML, keeps secrets under the nested `vault:` map, and has no empty leaves — printing a structure view with values masked. Needs `rbw` unlocked. It also **flags any leaf still set to `CHANGEME`** (see next bullet). - **Stubbing a secret the operator must supply** (don't ping-pong over chat): when a new secret is needed, the agent itself adds the vault entry with the sentinel value **`CHANGEME`** plus a comment stating *what it is and how to obtain it*, wires the code to `{{ vault.. }}`, and commits that. Then prompt the operator to run `make edit-vault`, replace the `CHANGEME`(s) with the real value(s) — which never touch the conversation — and re-encrypt. `make check-vault` lists any outstanding `CHANGEME` placeholders so nothing is forgotten. The agent never handles the real secret. --- ## Role conventions - Every role must have `molecule/default/` scenario targeting Debian 13 - Every role must have a populated `README.md` - Every role must have `meta/main.yml` filled in - Every **service** role must have a populated `SECURITY.md` (ADR-002/004) — copy `docs/security/service-security-template.md` - Every **service** role must have a populated `VERIFY.md` (ADR-008/017) — copy `docs/testing/service-verify-template.md` - Every **service** role must have a populated `ACCESS.md` (ADR-021) — copy `docs/access/service-access-template.md`; rendered from the role's `access__*` data - Every **service** role that holds state must have a populated `BACKUP.md` (ADR-022) — copy `docs/backup/service-backup-template.md`; rendered from the role's `backup__*` data. A stateless service records `backup__state: false` with a reason. - One service = one self-contained role; no shared multi-service roles (ADR-004) - Role names: `snake_case`, descriptive nouns (`base`, `docker_host`, `reverse_proxy`) - Use `make new-role NAME=` to scaffold — never create role structure by hand --- ## Inventory structure ``` inventories/ production/ # live hosts — edit with care hosts.yml group_vars/ all/ # applies to every host vars.yml vault.yml docker_hosts/ # hosts running Docker services proxmox_hosts/ # Proxmox nodes themselves offsite_hosts/ # off-site hosts (askari) — NetBird coordinator + watchdog host_vars/ # per-host overrides staging/ # safe to run freely ``` Host groups: `all`, `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts` (`control` holds `ubongo`, the one manually-provisioned **physical** control node outside the cluster; `offsite_hosts` holds `askari`, the off-site Hetzner host that runs the NetBird coordinator + watchdog — also added manually. See ADR-009, ADR-015, ADR-016.) --- ## Git conventions Single-contributor, trunk-based (no merge requests / approval gates): - `main` is the trunk and must always work — small, safe changes commit straight to it - Branch for sweeping or AI-driven changes you want to review as one diff or be able to abandon: `role/`, `fix/`, `feat/`, `chore/`; merge to `main` when reviewed, then delete the branch - Run `make lint` (and `make test` for touched roles) before committing - Commit in logical units; imperative subject ≤72 chars - AI agents commit their own work in logical units with a `Co-Authored-By` trailer - Push to the Forgejo `origin` often — it is the off-machine backup - Never commit secrets; a `vault.yml` must be `$ANSIBLE_VAULT`-encrypted (pre-commit enforces this, plus gitleaks secret scanning) --- ## Dependencies policy - **No Galaxy roles** — all roles are local; never add a Galaxy role to `requirements.yml` - **Collections on demand** — only add a collection when a task in a committed role uses a module from it; add a comment in `requirements.yml` naming the module(s) used - Full rationale: `docs/decisions/003-toolchain.md` (Collections and roles policy) --- ## Terraform conventions - Terraform owns VM existence only — nothing inside a VM, and no DNS records - Every TF-managed VM carries three Proxmox tags — ``, its inventory `group`, and `managed-by=terraform` — as **metadata only** (ADR-019). They do not feed inventory or run-targeting; `tf_to_inventory.py` still groups by the `group` output field. - Internal DNS is entirely Ansible (the `dns` role renders the zone from inventory) - OPNsense is entirely Ansible; do not reach for a Terraform OPNsense provider - Environments are separate directories (`staging/`, `production/`), not workspaces - Secrets via `TF_VAR_*` env vars only — never in `.tfvars` files - `terraform.tfvars.example` is tracked; `terraform.tfvars` is gitignored - `.terraform.lock.hcl` is tracked (pins provider versions) - Every module declares its own `required_providers` (in `versions.tf`) for any non-hashicorp provider — otherwise TF infers `hashicorp/` and `init` fails (caught only by a live `tf-init`, not by static review) - Full rationale: `docs/decisions/006-terraform.md` --- ## What Claude must not do without explicit instruction - Run `make deploy` — always run `make check` first and show output - Run `make tf-apply` — always run `make tf-plan` first and show output - Modify `inventories//hosts.yml` directly — regenerate via `make tf-inventory` - Edit vault-encrypted files directly — decrypt first, re-encrypt after - Force-push or rewrite already-pushed history on `main` - Add a collection to `requirements.yml` without a specific module need in existing role tasks - Open a firewall port anywhere but the `group_vars` service catalog — never ad-hoc on a host. If it's not in the catalog, it doesn't exist (ADR-002, ADR-020) - Disable or weaken a baseline control from ADR-002 (SSH hardening, nftables default-deny, fail2ban, auditd) - Expose a service to the LAN/WAN without it sitting behind the reverse proxy with authentication (ADR-002) - Deploy a service that hasn't cleared `docs/security/service-checklist.md` (record any deviation in `docs/security/accepted-risks.md`) - Justify a decision by AnsibleBaobabV4 precedent, or import its structure/requirements/values — consult V4 only per ADR-013 (gotchas/configs, announced, re-derived on boma's terms) --- ## Sourcing technical knowledge (ADR-014) - **Facts vs judgments.** Version-specific facts (syntax, options, defaults) have one authoritative answer — consult **version-matched** official docs and cite. Best practices are *evidence to translate through boma's principles* (ADR-013), never authority ("because the docs / a blog say so" is banned). - **When to verify (risk-based):** required when security-relevant, the tool is unfamiliar / fast-moving or newer than your training, or you'd assert a specific flag/option/default you can't quote with confidence. Otherwise memory is fine — but mark any from-memory version-specific claim **"from memory, unverified."** - **Sources:** `context7` for library docs · upstream docs via `WebFetch` · Claude Code/SDK/API → `claude-code-guide` agent · broad questions → `deep-research`. These are plugins and may be absent on a fresh checkout — fall back to `WebFetch`/`WebSearch` (core tools). Match the **pinned** version, not "latest." - **Stamp verified facts** next to them: `verified: · · · `. --- ## Further reading | Topic | File | |------------------------|---------------------------------------| | Architecture overview | `docs/decisions/001-architecture.md` | | Build order / roadmap | `docs/ROADMAP.md` | | Capabilities overview (what boma does) | `docs/CAPABILITIES.md` | | Security baseline & strategy | `docs/decisions/002-security.md` | | Accepted security risks | `docs/security/accepted-risks.md` | | Per-service security checklist | `docs/security/service-checklist.md` | | Per-service security record (template) | `docs/security/service-security-template.md` | | Per-service verification spec (template) | `docs/testing/service-verify-template.md` | | Heritage / V4 policy | `docs/decisions/013-heritage-v4.md` | | Sourcing tech knowledge | `docs/decisions/014-knowledge-sourcing.md` | | Toolchain choices | `docs/decisions/003-toolchain.md` | | Docker & Compose model | `docs/decisions/004-docker-model.md` | | Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` | | Control / AI-worker host (`ubongo`) | `docs/decisions/015-control-host.md` | | Terraform | `docs/decisions/006-terraform.md` | | Network topology | `docs/decisions/007-network.md` | | Mesh VPN (NetBird, self-hosted) | `docs/decisions/016-mesh-vpn.md` | | Testing methodology | `docs/decisions/008-testing.md` | | Service-UI verification (Level 4) | `docs/decisions/017-service-ui-verification.md` | | TF ↔ Ansible handoff | `docs/decisions/009-provisioning-handoff.md` | | Forgejo & CI | `docs/decisions/010-forgejo-ci.md` | | Update management | `docs/decisions/011-update-management.md` | | Hardware & capacity | `docs/decisions/012-hardware-capacity.md` | | Logging & log integrity | `docs/decisions/018-logging.md` | | Tagging & run-targeting | `docs/decisions/019-tagging.md` | | Firewall strategy | `docs/decisions/020-firewall.md` | | Operational access | `docs/decisions/021-operational-access.md` | | Backup & disaster recovery | `docs/decisions/022-backup.md` | | ADR structure & lifecycle | `docs/decisions/023-adr-structure.md` | | Reverse proxy (Caddy) | `docs/decisions/024-reverse-proxy.md` | | Local VM integration testing (ADR-025) | `docs/decisions/025-local-vm-integration-testing.md` | | Integration testing runbook | `docs/runbooks/integration-testing.md` | | Adding a new role | `docs/runbooks/new-role.md` | | Adding a new host | `docs/runbooks/new-host.md` | | Enrolling a NetBird client (laptop/phone) | `docs/runbooks/netbird-client.md` | | Rotating vault secrets | `docs/runbooks/rotate-secrets.md` | | Claude Code setup (per machine) | `docs/runbooks/claude-code-setup.md` |