2026-05-30 14:10:01 +02:00
# CLAUDE.md — Ansible homelab monorepo
This file is read by Claude Code at the start of every session.
Keep it dense and command-focused. Verbose detail lives in `docs/` .
> **Before assuming a role, provider, or pipeline exists, check `STATUS.md`.**
> Much of the design in `docs/decisions/` is intended, not yet built (e.g. the
> `base`/`docker_host` roles are currently empty; Terraform is not `init`ed).
---
## Project in one paragraph
Homelab infrastructure automation for a Proxmox cluster running 2– 5 Debian 13 VMs.
All hosts share a hardened base configuration. Each host runs a defined set of Docker
services deployed via Compose files rendered from Ansible templates. Ansible runs from
2026-06-05 09:49:23 +02:00
a dedicated physical control node (`ubongo` ) outside the cluster. CI runs on Forgejo
Actions (self-hosted).
2026-05-30 14:10:01 +02:00
Full design rationale: `docs/decisions/`
---
## Key commands
| Action | Command |
|-------------------------------|--------------------------------------------------|
| Lint everything | `make lint` |
| Test a single role | `make test ROLE=<name>` |
| Test all roles | `make test-all` |
| Check mode (dry run) | `make check PLAYBOOK=<name>` |
| Deploy a playbook | `make deploy PLAYBOOK=<name>` |
| Scaffold a new role | `make new-role NAME=<name>` |
2026-05-30 18:56:01 +02:00
| Review repo for drift/cruft | `/review-repo` (Claude command) |
2026-06-01 10:34:38 +02:00
| Review hardware capacity | `/capacity-review` (Claude command) |
2026-05-30 14:10:01 +02:00
| Encrypt a vault file | `make encrypt FILE=<path>` |
| Decrypt a vault file | `make decrypt FILE=<path>` |
| Install Python deps | `make setup` |
| Install Ansible collections | `make collections` |
| Initialise Terraform | `make tf-init [TF_ENV=staging]` |
| Terraform plan | `make tf-plan [TF_ENV=staging]` |
| Terraform apply | `make tf-apply [TF_ENV=staging]` |
| Regenerate Ansible inventory | `make tf-inventory TF_ENV=<staging\|production>` |
**Always `tf-plan` before `tf-apply` . Always `check` before `deploy` . Never skip lint.**
`TF_ENV` defaults to `staging` . Always specify `TF_ENV=production` explicitly for production.
---
## Ansible conventions
- **FQCN always**: `ansible.builtin.template` , never `template`
2026-06-06 09:42:22 +02:00
- **Tags** (ADR-019): import each role with its role-name tag once at the play level
(Ansible inherits it to every task). Tag a task/block with a concern tag from the
approved list (`tests/tags.yml` ) only where it genuinely belongs to that concern —
don't invent tags or tag for tagging's sake. Target one axis at a time (role/service
2026-06-06 15:12:48 +02:00
*or* concern; tags are union/OR, never intersected). `make lint` enforces the vocabulary and that each role import carries its role-name tag.
2026-05-30 14:10:01 +02:00
- **Handlers**: use `listen:` topic strings, not direct name references
- **Variables**: `rolename__varname` double-underscore namespace for role defaults
- **No inline vars in playbooks**: use `group_vars/` or `host_vars/` only
- **Loops**: prefer `loop:` over `with_items:`
- **Conditionals**: prefer `true` /`false` over `yes` /`no`
---
## Secrets
- Encrypted files are always named `vault.yml` , sitting alongside `vars.yml`
- Never put plaintext secrets in any file not named `vault.yml`
2026-05-30 18:16:35 +02:00
- Structure secrets as a nested map `vault.<service>.<key>` (e.g.
`vault.grafana.admin_password` ); reference as `{{ vault.grafana.admin_password }}`
- Vault password comes from Vaultwarden via `rbw` (`scripts/vault-pass-client.sh` ,
wired as `vault_password_file` ). Unlock once per session: `rbw unlock`
2026-05-30 21:34:07 +02:00
- **Before any vault-dependent task** (`make deploy/check/encrypt/decrypt` , or **any
git commit** — the pre-commit ansible-lint hook decrypts `vault.yml` ), run `rbw
unlocked`; if it exits non-zero, ask the user to ` rbw unlock` and wait rather than
starting and failing partway. The agent stays unlocked 5h.
2026-05-30 14:10:01 +02:00
- To edit a vault file: `make decrypt FILE=<path>` , edit, `make encrypt FILE=<path>`
---
## Role conventions
- Every role must have `molecule/default/` scenario targeting Debian 13
- Every role must have a populated `README.md`
- Every role must have `meta/main.yml` filled in
2026-06-04 16:09:33 +02:00
- Every **service** role must have a populated `SECURITY.md` (ADR-002/004) — copy `docs/security/service-security-template.md`
2026-06-05 13:18:07 +02:00
- Every **service** role must have a populated `VERIFY.md` (ADR-008/017) — copy `docs/testing/service-verify-template.md`
2026-06-04 16:09:33 +02:00
- One service = one self-contained role; no shared multi-service roles (ADR-004)
2026-05-30 14:10:01 +02:00
- Role names: `snake_case` , descriptive nouns (`base` , `docker_host` , `reverse_proxy` )
- Use `make new-role NAME=<name>` to scaffold — never create role structure by hand
---
## Inventory structure
```
inventories/
production/ # live hosts — edit with care
hosts.yml
group_vars/
all/ # applies to every host
vars.yml
vault.yml
docker_hosts/ # hosts running Docker services
proxmox_hosts/ # Proxmox nodes themselves
2026-06-05 18:54:54 +02:00
offsite_hosts/ # off-site hosts (askari) — NetBird coordinator + watchdog
2026-05-30 14:10:01 +02:00
host_vars/ # per-host overrides
staging/ # safe to run freely
```
2026-06-05 18:54:54 +02:00
Host groups: `all` , `control` , `docker_hosts` , `proxmox_hosts` , `offsite_hosts`
2026-05-30 14:10:01 +02:00
2026-06-05 09:48:09 +02:00
(`control` holds `ubongo` , the one manually-provisioned **physical** control node
2026-06-05 18:54:54 +02:00
outside the cluster; `offsite_hosts` holds `askari` , the off-site Hetzner host that
runs the NetBird coordinator + watchdog — also added manually. See ADR-009, ADR-015,
ADR-016.)
2026-05-30 14:10:01 +02:00
---
## Git conventions
Single-contributor, trunk-based (no merge requests / approval gates):
- `main` is the trunk and must always work — small, safe changes commit straight to it
- Branch for sweeping or AI-driven changes you want to review as one diff or be able
to abandon: `role/<name>` , `fix/<description>` , `feat/<description>` ,
`chore/<description>` ; merge to `main` when reviewed, then delete the branch
- Run `make lint` (and `make test` for touched roles) before committing
- Commit in logical units; imperative subject ≤72 chars
- AI agents commit their own work in logical units with a `Co-Authored-By` trailer
- Push to the Forgejo `origin` often — it is the off-machine backup
- Never commit secrets; a `vault.yml` must be `$ANSIBLE_VAULT` -encrypted (pre-commit
enforces this, plus gitleaks secret scanning)
---
## Dependencies policy
- **No Galaxy roles** — all roles are local; never add a Galaxy role to `requirements.yml`
- **Collections on demand** — only add a collection when a task in a committed role
uses a module from it; add a comment in `requirements.yml` naming the module(s) used
- Full rationale: `docs/decisions/003-toolchain.md` (Collections and roles policy)
---
## Terraform conventions
- Terraform owns VM existence only — nothing inside a VM, and no DNS records
2026-06-06 09:42:22 +02:00
- Every TF-managed VM carries three Proxmox tags — `<env>` , its inventory `group` , and
`managed-by=terraform` — as **metadata only** (ADR-019). They do not feed inventory
or run-targeting; `tf_to_inventory.py` still groups by the `group` output field.
2026-05-30 14:10:01 +02:00
- Internal DNS is entirely Ansible (the `dns` role renders the zone from inventory)
- OPNsense is entirely Ansible; do not reach for a Terraform OPNsense provider
- Environments are separate directories (`staging/` , `production/` ), not workspaces
- Secrets via `TF_VAR_*` env vars only — never in `.tfvars` files
- `terraform.tfvars.example` is tracked; `terraform.tfvars` is gitignored
- `.terraform.lock.hcl` is tracked (pins provider versions)
- Full rationale: `docs/decisions/006-terraform.md`
---
## What Claude must not do without explicit instruction
- Run `make deploy` — always run `make check` first and show output
- Run `make tf-apply` — always run `make tf-plan` first and show output
2026-05-30 19:10:58 +02:00
- Modify `inventories/<env>/hosts.yml` directly — regenerate via `make tf-inventory`
2026-05-30 14:10:01 +02:00
- Edit vault-encrypted files directly — decrypt first, re-encrypt after
2026-05-30 19:32:37 +02:00
- Force-push or rewrite already-pushed history on `main`
2026-05-30 14:10:01 +02:00
- Add a collection to `requirements.yml` without a specific module need in existing role tasks
2026-06-04 14:39:51 +02:00
- Open a firewall port anywhere but the `group_vars` firewall definitions — never ad-hoc on a host (ADR-002)
- Disable or weaken a baseline control from ADR-002 (SSH hardening, nftables default-deny, fail2ban, auditd)
- Expose a service to the LAN/WAN without it sitting behind the reverse proxy with authentication (ADR-002)
- Deploy a service that hasn't cleared `docs/security/service-checklist.md` (record any deviation in `docs/security/accepted-risks.md` )
2026-06-04 19:07:48 +02:00
- Justify a decision by AnsibleBaobabV4 precedent, or import its structure/requirements/values — consult V4 only per ADR-013 (gotchas/configs, announced, re-derived on boma's terms)
2026-05-30 14:10:01 +02:00
---
2026-06-04 20:07:18 +02:00
## Sourcing technical knowledge (ADR-014)
- **Facts vs judgments.** Version-specific facts (syntax, options, defaults) have one
authoritative answer — consult **version-matched** official docs and cite. Best
practices are *evidence to translate through boma's principles* (ADR-013), never
authority ("because the docs / a blog say so" is banned).
- **When to verify (risk-based):** required when security-relevant, the tool is
unfamiliar / fast-moving or newer than your training, or you'd assert a specific
flag/option/default you can't quote with confidence. Otherwise memory is fine — but
mark any from-memory version-specific claim ** "from memory, unverified."**
- **Sources:** `context7` for library docs · upstream docs via `WebFetch` · Claude
Code/SDK/API → `claude-code-guide` agent · broad questions → `deep-research` . These
are plugins and may be absent on a fresh checkout — fall back to `WebFetch` /`WebSearch`
(core tools). Match the **pinned** version, not "latest."
- **Stamp verified facts** next to them: `verified: <subject> · <tool> <version> · <source> · <YYYY-MM-DD>` .
---
2026-05-30 14:10:01 +02:00
## Further reading
| Topic | File |
|------------------------|---------------------------------------|
| Architecture overview | `docs/decisions/001-architecture.md` |
2026-06-05 09:50:28 +02:00
| Capabilities overview (what boma does) | `docs/CAPABILITIES.md` |
2026-06-04 14:39:51 +02:00
| Security baseline & strategy | `docs/decisions/002-security.md` |
| Accepted security risks | `docs/security/accepted-risks.md` |
| Per-service security checklist | `docs/security/service-checklist.md` |
2026-06-04 16:09:33 +02:00
| Per-service security record (template) | `docs/security/service-security-template.md` |
2026-06-05 18:23:16 +02:00
| Per-service verification spec (template) | `docs/testing/service-verify-template.md` |
2026-06-04 19:07:48 +02:00
| Heritage / V4 policy | `docs/decisions/013-heritage-v4.md` |
2026-06-04 20:07:18 +02:00
| Sourcing tech knowledge | `docs/decisions/014-knowledge-sourcing.md` |
2026-05-30 14:10:01 +02:00
| Toolchain choices | `docs/decisions/003-toolchain.md` |
| Docker & Compose model | `docs/decisions/004-docker-model.md` |
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |
2026-06-05 09:48:09 +02:00
| Control / AI-worker host (`ubongo` ) | `docs/decisions/015-control-host.md` |
2026-05-30 14:10:01 +02:00
| Terraform | `docs/decisions/006-terraform.md` |
| Network topology | `docs/decisions/007-network.md` |
2026-06-05 11:51:36 +02:00
| Mesh VPN (NetBird, self-hosted) | `docs/decisions/016-mesh-vpn.md` |
2026-05-30 14:10:01 +02:00
| Testing methodology | `docs/decisions/008-testing.md` |
2026-06-05 13:18:07 +02:00
| Service-UI verification (Level 4) | `docs/decisions/017-service-ui-verification.md` |
2026-05-30 14:10:01 +02:00
| TF ↔ Ansible handoff | `docs/decisions/009-provisioning-handoff.md` |
2026-05-30 21:34:07 +02:00
| Forgejo & CI | `docs/decisions/010-forgejo-ci.md` |
2026-06-04 21:53:57 +02:00
| Update management | `docs/decisions/011-update-management.md` |
2026-06-01 10:34:38 +02:00
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
2026-06-06 07:07:43 +02:00
| Logging & log integrity | `docs/decisions/018-logging.md` |
2026-06-06 09:42:22 +02:00
| Tagging & run-targeting | `docs/decisions/019-tagging.md` |
2026-05-30 14:10:01 +02:00
| Adding a new role | `docs/runbooks/new-role.md` |
| Adding a new host | `docs/runbooks/new-host.md` |
| Rotating vault secrets | `docs/runbooks/rotate-secrets.md` |
Make the Claude Code toolchain reproducible (TODO 10.7)
Reviewed the Claude Code config against boma's capabilities and committed a
reproducible, leaner toolchain:
- .claude/settings.json now declares extraKnownMarketplaces + enabledPlugins so a
fresh clone prompts to install the active set: superpowers, context7, terraform
(we use TF, ADR-006), claude-md-management (doc/ADR-heavy). Drops code-simplifier.
- Adds a conservative, read-only/verify permissions allowlist (git status/diff/log,
make lint/test/check, pytest, rbw unlocked, ls/cat/rg/find) — mutations and
outward/destructive commands stay gated, consistent with ADR-002.
- docs/runbooks/claude-code-setup.md: per-machine bootstrap, the deferred
enable-when plugins (security-guidance/semgrep, playwright, hookify, skill-creator),
rbw/venv prerequisites, and a note to keep the dangerous-mode prompt on.
Closes TODO 10.7. Plugin install remains a per-machine /plugin action (no native
auto-install).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 21:41:54 +02:00
| Claude Code setup (per machine) | `docs/runbooks/claude-code-setup.md` |