boma/CLAUDE.md
sjat 4732730515 docs: wire ADR-025 into testing/control-host/risks/status/capacity
- ADR-008: add reboot-survivability gap row + ADR-025 pointer to the
  "not tested in Molecule" table
- ADR-015: reconcile "not a hypervisor" with ephemeral KVM test VMs
  (ADR-025); note ~3 GiB test-VM RAM against the 16 GiB sizing
- accepted-risks: add R6 (le-prod-wildcard PAT + transient TXT records)
- CLAUDE.md: add make test-integration[/-clean] to key-commands;
  add ADR-025 + runbook rows to further-reading
- hardware/reference.md: note one ephemeral KVM test VM on ubongo
- STATUS.md: add integration harness entry (built, lint+pytest clean;
  RED/GREEN acceptance PENDING ubongo live pass); TODO 2.4 stays open

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 12:51:22 +02:00

267 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CLAUDE.md — Ansible homelab monorepo
This file is read by Claude Code at the start of every session.
Keep it dense and command-focused. Verbose detail lives in `docs/`.
> **Before assuming a role, provider, or pipeline exists, check `STATUS.md`.**
> Much of the design in `docs/decisions/` is intended, not yet built (e.g. the
> `base`/`docker_host` roles are currently empty; Terraform is not `init`ed).
---
## Project in one paragraph
Homelab infrastructure automation for a Proxmox cluster running 25 Debian 13 VMs.
All hosts share a hardened base configuration. Each host runs a defined set of Docker
services deployed via Compose files rendered from Ansible templates. Ansible runs from
a dedicated physical control node (`ubongo`) outside the cluster. CI runs on Forgejo
Actions (self-hosted).
Full design rationale: `docs/decisions/`
---
## Key commands
| Action | Command |
|-------------------------------|--------------------------------------------------|
| Lint everything | `make lint` |
| Test a single role | `make test ROLE=<name>` |
| Test all roles | `make test-all` |
| Check mode (dry run) | `make check PLAYBOOK=<name>` |
| Deploy a playbook | `make deploy PLAYBOOK=<name>` |
| Scaffold a new role | `make new-role NAME=<name>` |
| Review repo for drift/cruft | `/review-repo` (Claude command) |
| Review hardware capacity | `/capacity-review` (Claude command) |
| Edit the vault (nvim, auto re-encrypt) | `make edit-vault [VAULT=<path>]` |
| Validate vault structure | `make check-vault [VAULT=<path>]` |
| Encrypt a vault file | `make encrypt FILE=<path>` |
| Decrypt a vault file | `make decrypt FILE=<path>` |
| Install Python deps | `make setup` |
| Install Ansible collections | `make collections` |
| Initialise Terraform | `make tf-init [TF_ENV=staging]` |
| Terraform plan | `make tf-plan [TF_ENV=staging]` |
| Terraform apply | `make tf-apply [TF_ENV=staging]` |
| Regenerate Ansible inventory | `make tf-inventory TF_ENV=<staging\|production>` |
| Integration-test a host on a local VM | `make test-integration HOST=<name> [CERTS=…]` |
| Clean up integration test VMs | `make test-integration-clean` |
**Always `tf-plan` before `tf-apply`. Always `check` before `deploy`. Never skip lint.**
`TF_ENV` defaults to `staging`. Always specify `TF_ENV=production` explicitly for production.
---
## Ansible conventions
- **FQCN always**: `ansible.builtin.template`, never `template`
- **Tags** (ADR-019): import each role with its role-name tag once at the play level
(Ansible inherits it to every task). Tag a task/block with a concern tag from the
approved list (`tests/tags.yml`) only where it genuinely belongs to that concern —
don't invent tags or tag for tagging's sake. Target one axis at a time (role/service
*or* concern; tags are union/OR, never intersected). `make lint` enforces the vocabulary and that each role import carries its role-name tag.
- **Handlers**: use `listen:` topic strings, not direct name references
- **Variables**: `rolename__varname` double-underscore namespace for role defaults
- **No inline vars in playbooks**: use `group_vars/` or `host_vars/` only
- **Loops**: prefer `loop:` over `with_items:`
- **Loop var keys**: index with `item['key']`, never `item.key` — a key named
`values`/`keys`/`items`/`get`/… resolves to the dict *method* (silently corrupt +
non-idempotent), not the value
- **Conditionals**: prefer `true`/`false` over `yes`/`no`
---
## Secrets
- Encrypted files are always named `vault.yml`, sitting alongside `vars.yml`
- Never put plaintext secrets in any file not named `vault.yml`
- Structure secrets as a nested map `vault.<service>.<key>` (e.g.
`vault.grafana.admin_password`); reference as `{{ vault.grafana.admin_password }}`
- Vault password comes from Vaultwarden via `rbw` (`scripts/vault-pass-client.sh`,
wired as `vault_password_file`). Unlock once per session: `rbw unlock`
- **Before any vault-dependent task** (`make deploy/check/encrypt/decrypt`, or **any
git commit** — the pre-commit ansible-lint hook decrypts `vault.yml`), run `rbw
unlocked`; if it exits non-zero, ask the user to `rbw unlock` and wait rather than
starting and failing partway. The agent stays unlocked 5h.
- To edit the vault: `make edit-vault` — decrypts → opens nvim → re-encrypts on `:wq`
(abort with `:cq`), then `check-vault` validates structure. No plaintext lands in the
work tree. Override the file with `VAULT=<path>`. (The lower-level `make decrypt`/
`encrypt FILE=<path>` still exist for scripted/non-interactive edits.)
- `make check-vault` validates the vault decrypts, is valid YAML, keeps secrets under the
nested `vault:` map, and has no empty leaves — printing a structure view with values
masked. Needs `rbw` unlocked. It also **flags any leaf still set to `CHANGEME`** (see
next bullet).
- **Stubbing a secret the operator must supply** (don't ping-pong over chat): when a new
secret is needed, the agent itself adds the vault entry with the sentinel value
**`CHANGEME`** plus a comment stating *what it is and how to obtain it*, wires the code
to `{{ vault.<service>.<key> }}`, and commits that. Then prompt the operator to run
`make edit-vault`, replace the `CHANGEME`(s) with the real value(s) — which never touch
the conversation — and re-encrypt. `make check-vault` lists any outstanding `CHANGEME`
placeholders so nothing is forgotten. The agent never handles the real secret.
---
## Role conventions
- Every role must have `molecule/default/` scenario targeting Debian 13
- Every role must have a populated `README.md`
- Every role must have `meta/main.yml` filled in
- Every **service** role must have a populated `SECURITY.md` (ADR-002/004) — copy `docs/security/service-security-template.md`
- Every **service** role must have a populated `VERIFY.md` (ADR-008/017) — copy `docs/testing/service-verify-template.md`
- Every **service** role must have a populated `ACCESS.md` (ADR-021) — copy
`docs/access/service-access-template.md`; rendered from the role's `access__*` data
- Every **service** role that holds state must have a populated `BACKUP.md` (ADR-022) —
copy `docs/backup/service-backup-template.md`; rendered from the role's `backup__*`
data. A stateless service records `backup__state: false` with a reason.
- One service = one self-contained role; no shared multi-service roles (ADR-004)
- Role names: `snake_case`, descriptive nouns (`base`, `docker_host`, `reverse_proxy`)
- Use `make new-role NAME=<name>` to scaffold — never create role structure by hand
---
## Inventory structure
```
inventories/
production/ # live hosts — edit with care
hosts.yml
group_vars/
all/ # applies to every host
vars.yml
vault.yml
docker_hosts/ # hosts running Docker services
proxmox_hosts/ # Proxmox nodes themselves
offsite_hosts/ # off-site hosts (askari) — NetBird coordinator + watchdog
host_vars/ # per-host overrides
staging/ # safe to run freely
```
Host groups: `all`, `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`
(`control` holds `ubongo`, the one manually-provisioned **physical** control node
outside the cluster; `offsite_hosts` holds `askari`, the off-site Hetzner host that
runs the NetBird coordinator + watchdog — also added manually. See ADR-009, ADR-015,
ADR-016.)
---
## Git conventions
Single-contributor, trunk-based (no merge requests / approval gates):
- `main` is the trunk and must always work — small, safe changes commit straight to it
- Branch for sweeping or AI-driven changes you want to review as one diff or be able
to abandon: `role/<name>`, `fix/<description>`, `feat/<description>`,
`chore/<description>`; merge to `main` when reviewed, then delete the branch
- Run `make lint` (and `make test` for touched roles) before committing
- Commit in logical units; imperative subject ≤72 chars
- AI agents commit their own work in logical units with a `Co-Authored-By` trailer
- Push to the Forgejo `origin` often — it is the off-machine backup
- Never commit secrets; a `vault.yml` must be `$ANSIBLE_VAULT`-encrypted (pre-commit
enforces this, plus gitleaks secret scanning)
---
## Dependencies policy
- **No Galaxy roles** — all roles are local; never add a Galaxy role to `requirements.yml`
- **Collections on demand** — only add a collection when a task in a committed role
uses a module from it; add a comment in `requirements.yml` naming the module(s) used
- Full rationale: `docs/decisions/003-toolchain.md` (Collections and roles policy)
---
## Terraform conventions
- Terraform owns VM existence only — nothing inside a VM, and no DNS records
- Every TF-managed VM carries three Proxmox tags — `<env>`, its inventory `group`, and
`managed-by=terraform` — as **metadata only** (ADR-019). They do not feed inventory
or run-targeting; `tf_to_inventory.py` still groups by the `group` output field.
- Internal DNS is entirely Ansible (the `dns` role renders the zone from inventory)
- OPNsense is entirely Ansible; do not reach for a Terraform OPNsense provider
- Environments are separate directories (`staging/`, `production/`), not workspaces
- Secrets via `TF_VAR_*` env vars only — never in `.tfvars` files
- `terraform.tfvars.example` is tracked; `terraform.tfvars` is gitignored
- `.terraform.lock.hcl` is tracked (pins provider versions)
- Every module declares its own `required_providers` (in `versions.tf`) for any
non-hashicorp provider — otherwise TF infers `hashicorp/<name>` and `init` fails
(caught only by a live `tf-init`, not by static review)
- Full rationale: `docs/decisions/006-terraform.md`
---
## What Claude must not do without explicit instruction
- Run `make deploy` — always run `make check` first and show output
- Run `make tf-apply` — always run `make tf-plan` first and show output
- Modify `inventories/<env>/hosts.yml` directly — regenerate via `make tf-inventory`
- Edit vault-encrypted files directly — decrypt first, re-encrypt after
- Force-push or rewrite already-pushed history on `main`
- Add a collection to `requirements.yml` without a specific module need in existing role tasks
- Open a firewall port anywhere but the `group_vars` service catalog — never ad-hoc on a host. If it's not in the catalog, it doesn't exist (ADR-002, ADR-020)
- Disable or weaken a baseline control from ADR-002 (SSH hardening, nftables default-deny, fail2ban, auditd)
- Expose a service to the LAN/WAN without it sitting behind the reverse proxy with authentication (ADR-002)
- Deploy a service that hasn't cleared `docs/security/service-checklist.md` (record any deviation in `docs/security/accepted-risks.md`)
- Justify a decision by AnsibleBaobabV4 precedent, or import its structure/requirements/values — consult V4 only per ADR-013 (gotchas/configs, announced, re-derived on boma's terms)
---
## Sourcing technical knowledge (ADR-014)
- **Facts vs judgments.** Version-specific facts (syntax, options, defaults) have one
authoritative answer — consult **version-matched** official docs and cite. Best
practices are *evidence to translate through boma's principles* (ADR-013), never
authority ("because the docs / a blog say so" is banned).
- **When to verify (risk-based):** required when security-relevant, the tool is
unfamiliar / fast-moving or newer than your training, or you'd assert a specific
flag/option/default you can't quote with confidence. Otherwise memory is fine — but
mark any from-memory version-specific claim **"from memory, unverified."**
- **Sources:** `context7` for library docs · upstream docs via `WebFetch` · Claude
Code/SDK/API → `claude-code-guide` agent · broad questions → `deep-research`. These
are plugins and may be absent on a fresh checkout — fall back to `WebFetch`/`WebSearch`
(core tools). Match the **pinned** version, not "latest."
- **Stamp verified facts** next to them: `verified: <subject> · <tool> <version> · <source> · <YYYY-MM-DD>`.
---
## Further reading
| Topic | File |
|------------------------|---------------------------------------|
| Architecture overview | `docs/decisions/001-architecture.md` |
| Build order / roadmap | `docs/ROADMAP.md` |
| Capabilities overview (what boma does) | `docs/CAPABILITIES.md` |
| Security baseline & strategy | `docs/decisions/002-security.md` |
| Accepted security risks | `docs/security/accepted-risks.md` |
| Per-service security checklist | `docs/security/service-checklist.md` |
| Per-service security record (template) | `docs/security/service-security-template.md` |
| Per-service verification spec (template) | `docs/testing/service-verify-template.md` |
| Heritage / V4 policy | `docs/decisions/013-heritage-v4.md` |
| Sourcing tech knowledge | `docs/decisions/014-knowledge-sourcing.md` |
| Toolchain choices | `docs/decisions/003-toolchain.md` |
| Docker & Compose model | `docs/decisions/004-docker-model.md` |
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |
| Control / AI-worker host (`ubongo`) | `docs/decisions/015-control-host.md` |
| Terraform | `docs/decisions/006-terraform.md` |
| Network topology | `docs/decisions/007-network.md` |
| Mesh VPN (NetBird, self-hosted) | `docs/decisions/016-mesh-vpn.md` |
| Testing methodology | `docs/decisions/008-testing.md` |
| Service-UI verification (Level 4) | `docs/decisions/017-service-ui-verification.md` |
| TF ↔ Ansible handoff | `docs/decisions/009-provisioning-handoff.md` |
| Forgejo & CI | `docs/decisions/010-forgejo-ci.md` |
| Update management | `docs/decisions/011-update-management.md` |
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
| Logging & log integrity | `docs/decisions/018-logging.md` |
| Tagging & run-targeting | `docs/decisions/019-tagging.md` |
| Firewall strategy | `docs/decisions/020-firewall.md` |
| Operational access | `docs/decisions/021-operational-access.md` |
| Backup & disaster recovery | `docs/decisions/022-backup.md` |
| ADR structure & lifecycle | `docs/decisions/023-adr-structure.md` |
| Reverse proxy (Caddy) | `docs/decisions/024-reverse-proxy.md` |
| Local VM integration testing (ADR-025) | `docs/decisions/025-local-vm-integration-testing.md` |
| Integration testing runbook | `docs/runbooks/integration-testing.md` |
| Adding a new role | `docs/runbooks/new-role.md` |
| Adding a new host | `docs/runbooks/new-host.md` |
| Enrolling a NetBird client (laptop/phone) | `docs/runbooks/netbird-client.md` |
| Rotating vault secrets | `docs/runbooks/rotate-secrets.md` |
| Claude Code setup (per machine) | `docs/runbooks/claude-code-setup.md` |