Add architecture decision records and runbooks

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 14:10:01 +02:00 · 2026-05-30 14:10:01 +02:00 · fe4228fb38
commit fe4228fb38
parent 3f1d7eb128
13 changed files with 1340 additions and 0 deletions
--- a/docs/README.md
+++ b/docs/README.md
@ -0,0 +1,11 @@
+# docs/
+
+Project documentation.
+
+- `decisions/` — Architecture Decision Records (ADRs): the "why" behind the design.
+  Numbered from 001; each records context, the decision, and what was ruled out.
+- `runbooks/` — step-by-step operational procedures (add a host, add a role, rotate
+  secrets).
+
+For what is actually **built vs only designed**, see `STATUS.md` at the repo root —
+the ADRs describe intent, not necessarily current reality.
--- a/docs/decisions/001-architecture.md
+++ b/docs/decisions/001-architecture.md
@ -0,0 +1,62 @@
+# ADR-001 — Architecture overview
+
+## Context
+
+This document describes the overall architecture of the homelab infrastructure
+and the boundaries of what this Ansible monorepo manages.
+
+## Infrastructure
+
+- **Hypervisor**: Proxmox cluster (2+ nodes)
+- **Guest OS**: Debian 13 (all managed hosts)
+- **Scale**: 2–5 VMs, small fleet — treated as individuals, not cattle
+- **Control node**: A dedicated Debian 13 VM on the cluster. Ansible runs from here.
+  The control node is the one host that cannot fully bootstrap itself from scratch
+  and requires manual initial setup (see `docs/runbooks/new-host.md`).
+
+## What this repo manages
+
+| Layer              | Managed by         | Notes                                      |
+|--------------------|--------------------|--------------------------------------------|
+| VM existence       | Terraform (`terraform/`) | Clones the cloud-init template; control node is the one manual exception (see ADR-009) |
+| Internal DNS records | Ansible `dns` role | Internal zone rendered from inventory (see ADR-007/009) |
+| OS baseline        | Ansible `base` role | Users, SSH, firewall, updates, audit       |
+| Docker runtime     | Ansible `docker_host` role | Engine, daemon config, log driver  |
+| Service deployment | Ansible per-service roles | Compose rendered from templates      |
+| Secrets            | Ansible Vault      | Encrypted `vault.yml` files in repo        |
+
+The Terraform↔Ansible boundary and handoff are defined in ADR-009.
+
+## Host groups
+
+```
+all
+├── control           # the control node itself — baseline config only, runs no services
+├── docker_hosts      # VMs running Docker services (most hosts)
+└── proxmox_hosts     # Proxmox nodes themselves (limited management scope)
+```
+
+The `control` group holds the single manually-provisioned control node; it is
+managed for baseline config (SSH, firewall, updates) but never runs the
+`docker_host` role. Proxmox nodes are managed only for basic baseline tasks (SSH,
+monitoring agent). Proxmox configuration itself (storage, clustering, networking)
+is out of scope.
+
+## Service interaction model
+
+Services run as Docker containers on one or more `docker_hosts`. Where services
+need to interact, they do so via:
+
+- Docker networks (same host)
+- Internal DNS / hostname resolution (cross-host)
+- Explicitly defined published ports (external access)
+
+All Compose files are rendered by Ansible from Jinja2 templates. No hand-edited
+Compose files exist on hosts — they are always regenerated on deploy.
+
+## Decision
+
+This architecture prioritises:
+- **Simplicity**: few moving parts, no orchestration layer (no Kubernetes, no Swarm)
+- **Reproducibility**: any host can be rebuilt from scratch via Ansible
+- **Legibility**: a human reading the repo can understand what runs where
--- a/docs/decisions/002-security.md
+++ b/docs/decisions/002-security.md
@ -0,0 +1,73 @@
+# ADR-002 — Security baseline
+
+## Context
+
+Every managed host must reach a defined security baseline before any services
+are deployed. This baseline is applied by the `base` role and is non-negotiable —
+it runs first, on every host, every time.
+
+The goal is a principled, maintainable baseline appropriate for a homelab with
+some public-facing services — not a compliance exercise.
+
+## Baseline components
+
+### Access & authentication
+
+- SSH key authentication only — password auth disabled
+- Root login disabled — `PermitRootLogin no`
+- Dedicated `ansible` user with locked-down sudo (NOPASSWD for automation)
+- No shared user accounts — per-person SSH keys in `group_vars/all/vars.yml`
+
+### Firewall
+
+- `nftables` (native on Debian 13, replaces iptables)
+- Default policy: deny inbound, allow established/related, allow loopback
+- Rules managed entirely by Ansible — never edited manually on hosts
+- Port definitions live in `group_vars/` so rules stay in sync with deployed services
+- Docker's own iptables rules are disabled — nftables manages all filtering
+
+> **Note on Docker + nftables**: Docker historically bypassed iptables-based firewalls.
+> This is addressed by setting `"iptables": false` in Docker daemon config and managing
+> all rules via nftables explicitly. See `docs/decisions/004-docker-model.md`.
+
+### Intrusion deterrence
+
+- `fail2ban` monitoring SSH (and optionally reverse proxy logs)
+- Configured to ban after 5 failed attempts, 1-hour ban
+
+### Updates
+
+- `unattended-upgrades` enabled for **security patches only**
+- Full system upgrades triggered deliberately via Ansible (`make deploy PLAYBOOK=upgrade`)
+- No automatic reboots — reboots are a conscious operational decision
+
+### Minimal attack surface
+
+- No unnecessary packages installed
+- Docker daemon TCP socket disabled — Unix socket only
+- No open ports beyond those explicitly defined in firewall rules
+
+### Audit trail
+
+- `auditd` installed and running with a baseline ruleset
+- Logs shipped to a central location if a log aggregation service is available
+
+## Secrets management
+
+- Ansible Vault for all secrets (API keys, passwords, certificates)
+- Vault password stored outside the repo (`.vault_pass` gitignored)
+- New collaborators receive vault password via a separate secure channel
+- See `docs/runbooks/rotate-secrets.md` for rotation procedure
+
+## What this baseline does not include
+
+- Full CIS benchmark hardening — adds complexity for marginal gain at this scale
+- SELinux / AppArmor — not applied by default, revisit if threat model changes
+- Intrusion detection (IDS) — out of scope for now
+
+## Decision
+
+This baseline was chosen to be:
+- **Effective** against the realistic threat model (exposed services, shared repo)
+- **Maintainable** by a small team without security expertise overhead
+- **Automated** — no manual steps should be needed to reach baseline state
--- a/docs/decisions/003-toolchain.md
+++ b/docs/decisions/003-toolchain.md
@ -0,0 +1,135 @@
+# ADR-003 — Toolchain decisions
+
+## Execution engine
+
+**Choice**: `ansible-core` (pip-installed, pinned version) + explicit `requirements.yml`
+
+**Not chosen**: `ansible` full package (bundles ~85 collections at a frozen version)
+
+**Rationale**: Explicit collection pinning allows independent upgrades, smaller installs,
+and fully reproducible environments. The full package trades these away for convenience
+that isn't needed in a maintained monorepo.
+
+---
+
+## Python environment
+
+**Choice**: `python3-venv` (system Python on Debian 13) + pinned `requirements.txt`
+
+**Not chosen**: `pyenv` (solves multi-version problems on developer laptops, not needed
+on a dedicated Debian control node with a controlled Python version)
+
+**Rationale**: The control node runs one Python version. A plain venv is sufficient,
+reproducible, and has no extra dependencies.
+
+---
+
+## Secrets
+
+**Choice**: Ansible Vault (file-based, built-in)
+
+**Not chosen**:
+- SOPS + age: better git-diff ergonomics, but adds external tooling and key management
+- HashiCorp Vault: powerful, but significant operational overhead for this scale
+
+**Rationale**: Vault is built-in, requires no extra services, and works well at this
+scale. The main limitation (whole-file encryption makes diffs unreadable) is mitigated
+by keeping `vault.yml` files small and purposeful — only actual secrets, no structure.
+
+---
+
+## Testing
+
+**Choice**: Molecule with Docker driver (`molecule-plugins[docker]`)
+
+**Not chosen**:
+- Molecule + Podman: rootless is appealing, but Docker is simpler on a Debian control node
+- Molecule + Vagrant: full VMs are slower and require a hypervisor on the control node
+- No testing: unacceptable for a shared, maintained project
+
+**Test image**: a self-built, project-owned Debian 13 image with systemd support
+(`.docker/molecule-debian13/`), hosted in the Forgejo registry. ADR-008 is canonical
+for the image and the rationale for not using an external image such as
+`geerlingguy/docker-debian13-ansible`.
+
+**Verifier**: Built-in Ansible verifier. Testinfra added later if deeper assertions
+are needed.
+
+---
+
+## Linting
+
+**Choice**: `ansible-lint` + `yamllint` + `pre-commit`
+
+- `yamllint`: catches formatting issues before Ansible sees the file
+- `ansible-lint`: enforces correctness and idiomatic style
+- `pre-commit`: runs both locally on every commit, preventing CI failures
+
+Config files: `.ansible-lint`, `.yamllint` in repo root.
+
+---
+
+## CI/CD
+
+**Choice**: Forgejo Actions (self-hosted at git.baobab.band) + `act_runner`
+
+**Not chosen**: GitHub Actions (external), Jenkins (heavy)
+
+**Pipeline**:
+1. Push to any branch → lint + Molecule tests
+2. Merge to `main` → lint + Molecule tests + manual approval gate
+3. After approval → deploy to staging, then production
+
+`act_runner` runs as a Docker container on the control node or a dedicated runner VM.
+
+---
+
+## Developer ergonomics
+
+**Choice**: `Makefile` as the single interface for all operations
+
+**Rationale**: All `ansible-playbook`, `molecule`, and `ansible-lint` invocations go
+through Make targets. This means:
+- Claude Code always calls `make <target>` — never constructs raw commands
+- Collaborators don't need to know the underlying flags
+- CI uses the same targets as local development (no drift)
+
+**direnv**: Not used — the control node is a dedicated host, not a shared workstation.
+The venv is activated in the user's shell profile.
+
+---
+
+## Collections and roles policy
+
+**No Galaxy roles.** All roles are written and maintained locally in `roles/`.
+Galaxy roles introduce external state, versioning surprises, and implicit
+conventions that conflict with this repo's style.
+
+**Collections on demand.** A collection is added to `requirements.yml` only when
+a task in a committed role actively uses a module from it. Pre-emptive inclusions
+are removed. Each entry in `requirements.yml` must justify its presence.
+
+**Starting collection set** (rationale for each):
+
+| Collection     | Kept / dropped | Reason                                                       |
+|----------------|----------------|--------------------------------------------------------------|
+| `ansible.posix`| Kept           | Ansible-team maintained; fills real `ansible.builtin` gaps (`authorized_key`, `sysctl`, `acl`) |
+| `community.docker` | Dropped    | ADR-004 uses `ansible.builtin.command` + `docker compose` — no Docker API modules needed |
+| `community.proxmox`| Dropped    | Proxmox configuration is out of scope (ADR-001)              |
+| `community.crypto` | Deferred   | Add when a role needs cert automation; use `openssl` CLI until then |
+| `community.general`| Deferred   | 1,500+ modules; add only the specific sub-module needed, with a comment |
+
+---
+
+## What was explicitly ruled out
+
+| Tool             | Reason not adopted                                          |
+|------------------|-------------------------------------------------------------|
+| AWX / AAP        | Significant operational overhead, not needed at this scale  |
+| Semaphore        | Revisit if non-SSH operators need to trigger runs           |
+| ansible-runner   | Only needed when AWX/Semaphore orchestrates runs            |
+| ansible-builder  | Only needed when packaging Execution Environments for AWX   |
+| Kubernetes/Swarm | Out of scope — Docker Compose is the right complexity level |
+| NixOS targets    | Poor Ansible fit; all hosts standardised on Debian 13       |
+
+Terraform is **adopted** for VM provisioning and infrastructure DNS — see `docs/decisions/006-terraform.md`.
--- a/docs/decisions/004-docker-model.md
+++ b/docs/decisions/004-docker-model.md
@ -0,0 +1,77 @@
+# ADR-004 — Docker and Compose service model
+
+## Context
+
+All services run as Docker containers managed via Docker Compose. This document
+defines how services are structured, deployed, and maintained.
+
+## Core principles
+
+- **No hand-edited files on hosts**: all Compose files are rendered by Ansible
+  from Jinja2 templates. If a file exists on a host, it was put there by Ansible.
+- **Compose per service**: each service (or tightly coupled service group) gets
+  its own Compose file and directory under a standard path.
+- **Variables drive differences**: the same template renders differently per host
+  via `group_vars` and `host_vars`. No host-specific templates.
+
+## Directory layout on hosts
+
+```
+/opt/services/
+├── servicename/
+│   ├── docker-compose.yml    # rendered by Ansible, never edited manually
+│   ├── .env                  # rendered by Ansible from vault variables
+│   └── data/                 # persistent volumes (bind mounts)
+│       └── ...
+```
+
+All services live under `/opt/services/`. The path is defined in
+`group_vars/all/vars.yml` as `services__base_dir`.
+
+## Compose file delivery
+
+Each service has a corresponding Ansible role (or is managed by a shared role
+with per-service variables). The role:
+
+1. Creates `/opt/services/servicename/` directory
+2. Renders `docker-compose.yml` from `templates/docker-compose.yml.j2`
+3. Renders `.env` from `templates/env.j2` (pulling secrets from vault variables)
+4. Runs `docker compose up -d --remove-orphans` via `ansible.builtin.command`
+5. Optionally runs `docker compose pull` before up (controlled by variable)
+
+## Docker daemon configuration
+
+Managed by the `docker_host` role. Key settings:
+
+- `"log-driver": "json-file"` with size limits (prevents disk exhaustion)
+- `"iptables": false` — firewall managed entirely by nftables (see ADR-002)
+- TCP socket disabled — Unix socket only (`/var/run/docker.sock`)
+- User namespace remapping: evaluated per use case, not enabled by default
+
+## Networking
+
+- Each service Compose file defines its own named network(s)
+- Services that need to communicate are placed on a shared named network
+  defined in a dedicated `docker-compose.networks.yml` (if cross-service
+  networking is needed on a host)
+- External port publishing is explicit and matches nftables rules
+
+## Image management
+
+- Images are always pinned to a specific digest or tag in templates
+- `latest` is never used in production Compose files
+- Image updates are a deliberate operation: update the tag variable, run deploy
+
+## Persistent data
+
+- Bind mounts preferred over named volumes for data that must be backed up
+- All bind mount paths are under `/opt/services/<name>/data/`
+- Backup strategy is defined separately (not in scope of this repo)
+
+## Decision
+
+Docker Compose was chosen over Kubernetes/Swarm because:
+- Appropriate complexity level for 2–5 hosts with independent service sets
+- Compose files are human-readable and easily auditable
+- No distributed state to manage
+- Straightforward to back up and restore
--- a/docs/decisions/005-bootstrapping.md
+++ b/docs/decisions/005-bootstrapping.md
@ -0,0 +1,79 @@
+# ADR-005 — Host bootstrapping
+
+## Context
+
+This document defines the **cloud-init template** that managed VMs are cloned
+from, and the **control-node** bootstrapping special case. The per-host
+provisioning pipeline — how a VM is created from this template and handed off to
+Ansible — is owned by ADR-009. Terraform clones the template defined here; the
+template is the base image both for Terraform-managed hosts and for the manually
+provisioned control node.
+
+## Approach: Proxmox cloud-init template
+
+Managed VMs are cloned from a Proxmox VM template based on the official Debian 13
+cloud image. Cloud-init handles first-boot configuration. Ansible takes over
+from there.
+
+The cloud-init image was chosen over:
+- **Manual Debian installer**: slow, error-prone, not reproducible
+- **Preseed/netboot**: powerful but complex to maintain
+
+## Template creation (one-time, manual)
+
+This is a manual procedure performed once per Proxmox cluster. Documented in
+`docs/runbooks/new-host.md`.
+
+High-level steps:
+1. Download official Debian 13 genericcloud image
+2. Import disk to Proxmox, create VM template
+3. Install `qemu-guest-agent` in the template image
+4. Convert VM to template — never boot the template directly
+
+## VM provisioning (per new host)
+
+Per-host VMs are created by **Terraform**, which clones this template, sets the
+cloud-init values (hostname, SSH public key, IP/gateway), and writes the host's
+DNS A record. Cloud-init runs at first boot (~30–60 seconds), leaving the VM
+reachable via SSH with the ansible user's key.
+
+The full create → inventory → configure pipeline, and the Terraform↔Ansible data
+contract, are defined in **ADR-009 (provisioning handoff)**. There is no manual
+`qm clone` path for managed hosts — the sole exception is the control node below.
+
+## Ansible handoff
+
+Once Terraform has created the VM and `make tf-inventory` has regenerated the
+inventory, the `bootstrap` playbook handles first-run specifics (Python may not be
+present, user may differ) and `site` applies the full standard state. See ADR-009
+for the end-to-end commands and `docs/runbooks/new-host.md` for the full procedure.
+
+## Control node bootstrapping
+
+The control node is a special case — it runs Terraform and Ansible, so it cannot
+be created by the Terraform it hosts (chicken-and-egg). It is the one documented
+exception to Terraform-owned VM existence (see ADR-009). The control node requires:
+
+1. Manual VM provisioning — clone this cloud-init template by hand (Proxmox UI or
+   `qm clone`), since Terraform is not yet available to do it
+2. Manual setup of the Ansible environment:
+   ```bash
+   git clone <repo> ~/ansible
+   cd ~/ansible
+   make setup        # creates venv, installs deps
+   make collections  # installs Ansible collections
+   cp /secure/location/.vault_pass ~/ansible/.vault_pass
+   ```
+3. After that, the control node can manage all other hosts normally
+
+The control node itself is listed in `inventories/production/hosts.yml` under
+a `control` group and can be managed for baseline config (SSH, firewall, updates)
+but not for the `docker_host` role (it does not run services).
+
+## Decision
+
+Cloud-init with Proxmox templates provides:
+- Reproducible VM creation in under 2 minutes
+- No manual installer interaction
+- A clean handoff point to Ansible
+- Easy rebuilds — destroy VM, clone template, run Ansible
--- a/docs/decisions/006-terraform.md
+++ b/docs/decisions/006-terraform.md
@ -0,0 +1,111 @@
+# ADR-006 — Terraform for infrastructure provisioning
+
+## Context
+
+Ansible manages host configuration well but has no state model for infrastructure
+existence. Adding Terraform handles the "what exists" layer — creating and destroying
+VMs on Proxmox — while Ansible continues to own everything that runs inside them,
+including all internal DNS records.
+
+This complements rather than replaces Ansible. The two tools do not overlap. The
+exact boundary, handoff pipeline, and data contract between them live in **ADR-009
+(provisioning handoff)** — this ADR covers Terraform's own internals only.
+
+---
+
+## Responsibility split
+
+The canonical responsibility-split table lives in **ADR-009**. In short: Terraform
+owns VM existence only; Ansible owns everything inside a VM, including all internal
+DNS records.
+
+**OPNsense is entirely Ansible.** The available Terraform providers for OPNsense
+are community-maintained with real risk of provider rot across OPNsense releases.
+OPNsense firewall rules also change on a service cadence, not an infrastructure
+cadence, making them a poor fit for Terraform state.
+
+---
+
+## Providers
+
+**`bpg/proxmox` (`~> 0.70`)**: Chosen over `telmate/proxmox` for active maintenance,
+full Proxmox 8 API support, and better cloud-init integration. This is the only
+provider.
+
+Terraform does **not** manage DNS. An earlier design used `hashicorp/dns` (RFC 2136)
+to write A records, but that created a bootstrap cycle — the first DNS server cannot
+register itself — and split DNS ownership across two tools. Ansible's `dns` role now
+owns the entire internal zone, rendered from inventory. See ADR-009.
+
+No Galaxy roles. Terraform manages its own provider dependencies via
+`required_providers` and `.terraform.lock.hcl` (tracked in git once `terraform init`
+has been run).
+
+---
+
+## State backend
+
+**Choice**: Forgejo HTTP backend (self-hosted at git.baobab.band)
+
+Keeps all state on the same self-hosted stack without additional services.
+Authentication uses a Forgejo personal access token via `TF_HTTP_USERNAME` and
+`TF_HTTP_PASSWORD` environment variables.
+
+**Note**: The backend URL in `backend.tf` is a placeholder — confirm the exact
+endpoint path against your running Forgejo instance's API documentation before
+running `terraform init`. If Forgejo's HTTP state is unavailable, remove the
+`backend` block from `backend.tf` to fall back to local state on the control node.
+
+---
+
+## Structure
+
+```
+terraform/
+  modules/
+    proxmox_vm/          # reusable VM module — Proxmox only, no DNS
+  environments/
+    staging/             # staging VMs, separate state file
+    production/          # production VMs, separate state file
+```
+
+Separate environment directories (not Terraform workspaces) for the clearest
+isolation — no risk of accidentally applying the wrong state.
+
+Each environment directory contains:
+- `providers.tf` — provider version pins and configuration
+- `backend.tf` — Forgejo state backend (environment-specific path)
+- `variables.tf` — input declarations
+- `terraform.tfvars.example` — tracked template; copy to `terraform.tfvars` for actual values
+- `main.tf` — `local.vms` map and module calls (no DNS resources)
+- `outputs.tf` — VM map consumed by `make tf-inventory`
+
+---
+
+## Secrets handling
+
+The only secret input (the Proxmox API token) is passed via a `TF_VAR_*`
+environment variable and declared `sensitive = true` in `variables.tf`. It never
+appears in `.tfvars` files. Non-secret configuration lives in tracked
+`terraform.tfvars.example`; the real `terraform.tfvars` is gitignored.
+
+---
+
+## Ansible integration
+
+After `terraform apply`, run `make tf-inventory TF_ENV=<env>` to regenerate
+`inventories/<env>/hosts.yml` from the `vms` output. The full handoff pipeline,
+the `vms` output → inventory data contract, and the generator script
+(`scripts/tf_to_inventory.py`) are documented in **ADR-009 (provisioning
+handoff)**.
+
+---
+
+## What was ruled out
+
+| Option | Reason |
+|---|---|
+| `telmate/proxmox` provider | Less actively maintained; weaker cloud-init and Proxmox 8 support |
+| OPNsense Terraform provider | Community-maintained; provider rot risk across OPNsense releases |
+| Terraform workspaces | Single state file with workspace prefix; accidental cross-env apply possible |
+| Separate Terraform repo | Cross-referencing between infra and config adds friction; monorepo keeps the full picture together |
--- a/docs/decisions/007-network.md
+++ b/docs/decisions/007-network.md
@ -0,0 +1,186 @@
+# ADR-007 — Network topology and addressing
+
+## Context
+
+The boma homelab is a Proxmox cluster on a dedicated private network behind an
+OPNsense firewall. This document records the agreed physical topology, VLAN
+design, IP addressing conventions, naming scheme, and DNS zone structure.
+Everything here feeds directly into Terraform variables, Ansible inventory,
+and OPNsense configuration.
+
+---
+
+## Physical topology
+
+```
+ISP
+ └── OPNsense (dedicated hardware)
+      ├── WAN — ISP uplink
+      └── LAN — 802.1q trunk to managed switch
+                         │
+          ┌──────────────┼──────────────────────────┐
+          │              │              │            │
+        pve0           pve1           pve2        AP1 / AP2
+     (eno1 trunk)   (eno1 trunk)  (eno1 trunk)   (trunk)
+     (eno2 corosync)(eno2 corosync)(eno2 corosync)
+          └──────────────┴──────────────┘
+               172.16.0.0/24  (corosync ring — not on managed switch)
+```
+
+**Dual NICs per Proxmox node:**
+- `eno1` — VLAN-aware trunk. Carries all VLANs via a single VLAN-aware bridge
+  (`vmbr0`). VMs get their VLAN tag assigned in Proxmox.
+- `eno2` — Dedicated corosync ring (`vmbr1`). Direct link or tiny unmanaged
+  switch between the three nodes only. Never touches the main switch fabric.
+
+**Access points** broadcast multiple SSIDs, each tagged to its corresponding VLAN
+(trusted WiFi → VLAN 30, IoT → VLAN 40, guest → VLAN 50).
+
+---
+
+## VLAN design
+
+| VLAN | Name | Subnet | Purpose |
+|---|---|---|---|
+| 10 | `mgmt` | `10.10.0.0/24` | Proxmox hosts, OPNsense, managed switch. No internet except update repos. |
+| 20 | `srv` | `10.20.0.0/24` | All Debian VMs and Docker services. 100% static. Terraform provisions here. |
+| 30 | `lan` | `10.30.0.0/24` | Trusted home devices. DHCP. Access to selected `srv` services via OPNsense. |
+| 40 | `iot` | `10.40.0.0/24` | Smart home, cameras, printers. DHCP. Internet egress only + HA exception. |
+| 50 | `guest` | `10.50.0.0/24` | Guest WiFi. DHCP. Internet only, fully isolated. |
+| 99 | `vpn` | `10.99.0.0/24` | WireGuard peers. `askari` (Hetzner) + road-warrior clients. |
+
+---
+
+## IP addressing
+
+### VLAN 10 — mgmt (10.10.0.0/24) — no DHCP
+
+| Address | Host |
+|---|---|
+| `10.10.0.1` | OPNsense LAN (mgmt) |
+| `10.10.0.2` | Managed switch |
+| `10.10.0.200` | `pve0` |
+| `10.10.0.201` | `pve1` |
+| `10.10.0.202` | `pve2` |
+
+### VLAN 20 — srv (10.20.0.0/24) — no DHCP, all static
+
+| Range | Purpose |
+|---|---|
+| `10.20.0.1` | OPNsense gateway |
+| `10.20.0.10`–`.19` | Core infrastructure VMs (DNS, proxy) |
+| `10.20.0.20`–`.49` | Additional static infrastructure |
+| `10.20.0.50`–`.249` | Terraform-provisioned VMs |
+
+Assigned infrastructure addresses:
+
+| Address | Host | Role |
+|---|---|---|
+| `10.20.0.10` | `dns1` | Primary DNS server |
+| `10.20.0.11` | `dns2` | Secondary DNS server |
+| `10.20.0.12` | `proxy` | Reverse proxy |
+| `10.20.0.13` | `homeassistant` | Home Assistant (IoT controller) |
+
+### VLAN 30 — lan (10.30.0.0/24)
+
+| Range | Purpose |
+|---|---|
+| `10.30.0.1` | OPNsense gateway |
+| `10.30.0.100`–`.249` | DHCP pool |
+
+### VLAN 40 — iot (10.40.0.0/24)
+
+| Range | Purpose |
+|---|---|
+| `10.40.0.1` | OPNsense gateway |
+| `10.40.0.100`–`.249` | DHCP pool |
+
+### VLAN 50 — guest (10.50.0.0/24)
+
+| Range | Purpose |
+|---|---|
+| `10.50.0.1` | OPNsense gateway |
+| `10.50.0.100`–`.249` | DHCP pool |
+
+### VLAN 99 — vpn (10.99.0.0/24) — WireGuard
+
+| Address | Host |
+|---|---|
+| `10.99.0.1` | OPNsense (WireGuard endpoint) |
+| `10.99.0.2` | `askari` (Hetzner VPS) |
+| `10.99.0.10`+ | Road-warrior clients |
+
+### Corosync ring (172.16.0.0/24) — not on managed switch
+
+| Address | Host |
+|---|---|
+| `172.16.0.200` | `pve0` |
+| `172.16.0.201` | `pve1` |
+| `172.16.0.202` | `pve2` |
+
+---
+
+## OPNsense firewall rules (intent)
+
+| Source | Destination | Policy |
+|---|---|---|
+| `mgmt` | anywhere | allow (administrator access) |
+| `srv` | `srv` | allow (inter-service communication) |
+| `srv` | internet | allow (updates, image pulls) |
+| `lan` | `srv` (allow-list) | allow specific published ports only |
+| `lan` | internet | allow |
+| `iot` | internet | allow egress only |
+| `iot` | `srv` (HA IP only) | allow on integration ports |
+| `guest` | internet | allow, isolated from all internal |
+| `vpn` | `srv` (metrics ports) | allow (monitoring) |
+| `vpn` | `mgmt` | allow (administration from askari) |
+
+**Home Assistant ↔ IoT**: HA VM at `10.20.0.13` can reach IoT VLAN on required
+ports. OPNsense Avahi (mDNS reflector) bridges `srv` ↔ `iot` for device discovery.
+IoT devices cannot initiate connections to `srv`.
+
+---
+
+## Naming scheme
+
+| Layer | Convention | Examples |
+|---|---|---|
+| Homelab name | `boma` | — |
+| Proxmox nodes | `pve<n>` | `pve0`, `pve1`, `pve2` |
+| Infrastructure VMs | `<role><n>` | `dns1`, `dns2`, `proxy` |
+| Hetzner VPS | `askari` | Swahili for guard/sentinel |
+| Internal FQDN | `<host>.boma.baobab.band` | `dns1.boma.baobab.band` |
+| Public service FQDN | `<service>.baobab.band` | `git.baobab.band` |
+
+---
+
+## DNS zones and split-horizon
+
+**Internal zone**: `boma.baobab.band` — served by `dns1` and `dns2`.
+The zone is rendered by the Ansible `dns` role: host A records come from the
+inventory (which derives from Terraform's `local.vms` via `make tf-inventory`),
+and service/alias/split-horizon records are explicit zone data in `group_vars`.
+Terraform itself writes no DNS records — see ADR-009.
+
+**Public zone**: `baobab.band` — served by external DNS (Cloudflare or equivalent).
+Public-facing services resolve to the public IP or Cloudflare proxy.
+
+**Split-horizon**: `dns1`/`dns2` serve internal answers for any hostname that has
+both a public and private face. Example: `git.baobab.band` resolves to
+`10.20.0.12` (proxy) internally and to the public IP externally.
+
+OPNsense DNS resolver forwards `boma.baobab.band` queries to `dns1`/`dns2`.
+All other queries go upstream (e.g., `1.1.1.1`, `9.9.9.9`).
+
+---
+
+## External monitoring — askari
+
+`askari` (Hetzner VPS) connects via WireGuard to OPNsense (`10.99.0.1`).
+Its peer address is `10.99.0.2`. OPNsense routes `10.99.0.0/24` into the VPN
+tunnel and allows `askari` narrow access to `srv` metrics endpoints and `mgmt`
+for administration.
+
+`askari` is provisioned and managed independently of the Proxmox cluster — it
+must be reachable even when the homelab is down (its entire purpose).
+FQDN: `askari.baobab.band`.
--- a/docs/decisions/008-testing.md
+++ b/docs/decisions/008-testing.md
@ -0,0 +1,160 @@
+# ADR-008 — Testing methodology
+
+## Context
+
+Ansible roles must be idempotent and correct before they touch production hosts.
+This document records the testing strategy, what each level covers, and — critically
+— what is explicitly out of scope for automated testing and why.
+
+---
+
+## Three testing levels
+
+### Level 1 — Molecule (per role, always required)
+
+Runs in Docker on the control node or in CI. Fast (~5 min per role).
+
+**What happens during `molecule test`:**
+1. `create` — start the test container
+2. `converge` — apply the role via `converge.yml`
+3. **`idempotency`** — run `converge.yml` again; fail if any task reports `changed`
+4. `verify` — assert expected state via `verify.yml`
+5. `destroy` — remove the container
+
+The idempotency step is non-negotiable. Every role must pass it cleanly.
+
+**`verify.yml` must assert outcomes, not task success:**
+
+```yaml
+# Wrong — only proves the task ran
+- assert:
+    that: result is success
+
+# Right — proves the outcome exists
+- ansible.builtin.command: systemctl is-active fail2ban
+  changed_when: false
+  register: svc
+- ansible.builtin.assert:
+    that: svc.stdout == "active"
+```
+
+### Level 2 — Staging playbook (full stack, real VMs)
+
+`make check PLAYBOOK=site` followed by `make deploy PLAYBOOK=site` on
+Terraform-provisioned staging VMs. Catches inter-role dependencies and ordering
+issues that Molecule cannot see (e.g., `docker_host` role requires `base` to
+have already run and configured the firewall).
+
+Run before every merge to `main`.
+
+### Level 3 — External smoke test from askari
+
+Once `askari` is operational: scripted checks from outside the network confirming
+that public-facing services respond correctly. Catches firewall and reverse proxy
+configuration issues invisible to Ansible check mode.
+
+---
+
+## Molecule test image
+
+**No external images.** The project builds and hosts its own test image.
+
+**Source**: `.docker/molecule-debian13/Dockerfile`
+**Base**: `debian:trixie-slim` (official Debian 13, Docker Hub — only external
+dependency permitted here, as the base OS image is not substitutable)
+**Registry**: `git.baobab.band/<owner>/<repo-name>/molecule-debian13:latest`
+
+Build and push with:
+```bash
+make molecule-image        # build locally
+make molecule-image-push   # push to Forgejo registry (requires docker login)
+```
+
+The scaffold `molecule.yml` references this image with `pre_build_image: true`,
+meaning Molecule uses the image as-is and does not attempt to build it.
+
+**Why not geerlingguy/docker-debian13-ansible?** It is a Docker Hub image outside
+project control. It is not a Galaxy role, but it is an external dependency that
+can drift, disappear, or introduce unexpected changes. The custom image is
+functionally equivalent and fully owned.
+
+---
+
+## Idempotency requirements
+
+Every role task must satisfy one of these:
+
+| Task type | Requirement |
+|---|---|
+| `apt`, `template`, `copy`, `file`, `user`, `group`, `service` | Naturally idempotent — no action needed |
+| `command` / `shell` (read-only) | `changed_when: false` |
+| `command` / `shell` (detectable change) | `changed_when: result.stdout \| length > 0` or equivalent |
+| `command` / `shell` (creates a file) | `creates: /path/to/artifact` |
+| Service restart after config change | Move to a handler; handler fires only when notified |
+| `docker compose up -d` | Handler only — notified by template change, never runs unconditionally |
+
+ansible-lint enforces most of these at lint time. The Molecule idempotency step
+catches anything lint misses.
+
+---
+
+## What Molecule tests — and what it does not
+
+### Tested in Molecule
+
+| Capability | Notes |
+|---|---|
+| Package installation | `apt` works in the container |
+| File and directory creation, permissions, ownership | Full support |
+| Template rendering and content | Full support |
+| User and group management | Full support |
+| Service installation and `systemd enable` | Requires the systemd-capable image |
+| Service start/stop | Works for most services in the container |
+| SSH configuration file content | File-level only |
+| fail2ban installation and configuration | Install and config file; not live banning |
+| Docker daemon installation | Works in privileged container |
+| auditd installation and configuration | Install and config file |
+| Idempotency of all of the above | Enforced by Molecule's idempotency step |
+
+### Not tested in Molecule — explicit exceptions
+
+The following require a real kernel or real hardware and are validated only at
+Level 2 (staging) or Level 3 (external). This is a conscious, documented decision
+— not a gap.
+
+| Capability | Reason not testable in Molecule |
+|---|---|
+| `nftables` rule loading | Requires `nf_tables` kernel module; not available in Docker |
+| WireGuard tunnel establishment | Requires `wireguard` kernel module |
+| `unattended-upgrades` behaviour | Installs correctly; actual upgrade behaviour requires a real apt environment |
+| DHCP behaviour (OPNsense) | OPNsense is managed by Ansible but not testable in a container |
+| mDNS reflector (Avahi cross-VLAN) | Requires real network interfaces and VLANs |
+| Hardware passthrough (NIC, USB) | Not applicable in containers |
+| Corosync cluster formation | Requires multiple real nodes |
+
+For the above, Molecule tests only what it can: that the relevant packages are
+installed, that configuration files render correctly, and that services are enabled.
+Behavioural correctness is confirmed on staging.
+
+---
+
+## CI pipeline
+
+```
+push to any branch
+  ├── yamllint + ansible-lint          (fast gate, ~1 min)
+  └── molecule test (changed roles)   (parallel, ~5 min per role)
+
+pull request to main
+  ├── yamllint + ansible-lint
+  ├── molecule test (all roles)        (parallel)
+  └── [manual gate] review tf-plan and make check on staging
+
+merge to main
+  ├── yamllint + ansible-lint + molecule test (final gate)
+  ├── [manual approval] make deploy PLAYBOOK=site on staging
+  └── [manual approval] make deploy PLAYBOOK=site on production
+```
+
+Manual gates are intentional. Automated tests prove correctness in isolation;
+a human confirms the change is safe to promote.
--- a/docs/decisions/009-provisioning-handoff.md
+++ b/docs/decisions/009-provisioning-handoff.md
@ -0,0 +1,149 @@
+# ADR-009 — Terraform ↔ Ansible provisioning handoff
+
+## Context
+
+Two tools touch every managed host. Terraform owns **what exists** — VMs on
+Proxmox. Ansible owns **what is configured inside** — users, packages, firewall,
+Docker services, and all internal DNS. This ADR is the single source of truth for
+the seam between them: the exact handoff, the data contract, and the one documented
+exception. The two tools must never overlap; this document defines the line they
+meet at.
+
+ADR-006 covers Terraform's internals (providers, state, structure). ADR-005 covers
+the cloud-init template that VMs are cloned from. This ADR covers how they connect.
+
+---
+
+## The boundary
+
+| Layer | Tool | Notes |
+|---|---|---|
+| VM existence | Terraform | Create/destroy Proxmox VMs, assign static IPs |
+| VM resolver (cloud-init) | Terraform | Sets *which* DNS servers a VM queries — not a zone record |
+| OS configuration | Ansible | Users, SSH, firewall, packages |
+| Service deployment | Ansible | Docker, Compose files, secrets |
+| OPNsense (all) | Ansible | Firewall rules, DHCP, interfaces, VLANs |
+| Internal DNS (all records) | Ansible (`dns` role) | Internal zone rendered from inventory + `group_vars`; see ADR-007 |
+
+This table is canonical here. ADR-006 links to it rather than restating it.
+Terraform owns VM **existence** only — it writes no DNS records (see "Internal DNS"
+below).
+
+---
+
+## The handoff pipeline
+
+There is one path by which a managed host comes into existence and reaches its
+configured state:
+
+```
+make tf-plan TF_ENV=production       # review infrastructure changes
+make tf-apply TF_ENV=production      # clone template → VM (no DNS records written)
+make tf-inventory TF_ENV=production  # regenerate Ansible inventory from outputs
+make check PLAYBOOK=site             # dry-run Ansible against the new host(s)
+make deploy PLAYBOOK=bootstrap       # first-run specifics (see ADR-005)
+make deploy PLAYBOOK=site            # full standard state — `dns` role writes the zone
+```
+
+`tf-apply` creates the VM by cloning the Debian 13 cloud-init template (ADR-005).
+`tf-inventory` regenerates the Ansible inventory from Terraform outputs. From
+`make check` onward the host is Ansible's — including its DNS record, which the
+`dns` role writes into the internal zone during `make deploy`.
+
+Adding a host means editing `local.vms` in the environment's `main.tf` and running
+this pipeline — **never** by hand-editing the inventory.
+
+---
+
+## The data contract
+
+The seam's interface is a single Terraform output consumed by a single script.
+
+**Producer** — `terraform/environments/<env>/outputs.tf` emits a `vms` map:
+
+```json
+{
+  "vms": {
+    "value": {
+      "host-a": { "ip": "192.168.1.10", "group": "docker_hosts" }
+    }
+  }
+}
+```
+
+**Consumer** — `scripts/tf_to_inventory.py` (Python standard library only) reads
+`terraform output -json` and writes `inventories/<env>/hosts.yml`. It validates the
+group against the allowed set and fails loudly on an unknown group.
+
+**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`.
+
+The generated `hosts.yml` carries a "do not edit manually" header and is owned by
+the generator. Treat it as a build artifact: the source of truth is `local.vms` in
+Terraform, and the inventory is regenerated, never edited.
+
+---
+
+## Cloud-init's role
+
+Cloud-init is the thin first-boot layer between Terraform and Ansible:
+
+- **Terraform** clones the cloud-init template (ADR-005) and sets cloud-init values
+  (hostname, SSH public key, IP/gateway).
+- **Cloud-init** does just enough at first boot to make the VM reachable over SSH
+  with the ansible user's key — nothing more.
+- **Ansible** takes over from a reachable host: the `bootstrap` playbook handles
+  first-run specifics, then `site` applies the full standard state.
+
+The line is sharp: cloud-init buys *reachability*, Ansible owns *configuration*.
+
+---
+
+## Internal DNS — owned by Ansible, no chicken-and-egg
+
+Terraform writes **no** DNS records. The internal zone (`boma.baobab.band`) is
+rendered entirely by the Ansible `dns` role:
+
+- **Host A records** derive from the inventory — the same `hostname → ip` data that
+  originated in `local.vms` and reached Ansible via `make tf-inventory`. So Terraform
+  remains the ultimate source of truth for which hosts exist; the data simply flows
+  through the inventory instead of through a direct Terraform→DNS write.
+- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense
+  gateway, `git.baobab.band` → proxy) are explicit zone data in `group_vars`.
+
+This dissolves the bootstrap cycle that a Terraform-managed zone would create. If
+Terraform wrote records via RFC 2136, provisioning the **first** DNS server would
+require a DNS server that does not yet exist — `dns1` cannot register its own A
+record before it is running and configured. Because Ansible renders the zone from
+inventory (using IP addresses, never name resolution, to connect), `dns1`/`dns2`
+are ordinary Terraform-created VMs whose records are written by the same role that
+configures the DNS service. There is no special case and no ordering trap.
+
+ADR-007 holds the zone structure, split-horizon, and addressing conventions. The
+IP-range split there (`.10–.19` core infra vs `.50–.249` fleet) is now an addressing
+convention only — it no longer implies any difference in how records are written.
+
+---
+
+## The control-node exception
+
+The control node — the host that runs Terraform and Ansible — is the one VM
+Terraform does **not** create. It cannot provision the infrastructure that would
+provision itself (chicken-and-egg). It is therefore the single documented exception
+to "Terraform owns VM existence":
+
+- Provisioned and bootstrapped manually, per the control-node section of ADR-005.
+- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
+  Ansible for baseline config only (no `docker_host` role).
+
+Every other host is Terraform-managed.
+
+---
+
+## What was ruled out
+
+| Option | Reason |
+|---|---|
+| Manual `qm clone` as a general provisioning path | Terraform is the single way VMs come into existence; a parallel manual path would let the inventory and real infrastructure drift. The sole exception is the control node. |
+| Hand-editing the generated inventory | `hosts.yml` is a build artifact of `tf_to_inventory.py`; edits are overwritten on the next `make tf-inventory`. Edit `local.vms` instead. |
+| Documenting the seam in both ADR-005 and ADR-006 | The boundary belongs in exactly one place. Those ADRs link here. |
+| Terraform-managed DNS records (`hashicorp/dns` + RFC 2136) | Created a bootstrap cycle (the first DNS server can't register itself) and split DNS ownership across two tools. Ansible owns the whole internal zone instead — one owner, no cycle. |
--- a/docs/runbooks/new-host.md
+++ b/docs/runbooks/new-host.md
@ -0,0 +1,145 @@
+# Runbook — Adding a new managed host
+
+## Prerequisites
+
+- Proxmox VM template exists (Debian 13 cloud-init image — see below if not)
+- You have the vault password (`.vault_pass`)
+- The host's intended hostname and IP are decided
+
+---
+
+## Part A — Create the Proxmox template (one-time)
+
+Run on a Proxmox node. Only needed once per cluster.
+
+```bash
+# Download the Debian 13 genericcloud image
+wget https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2
+
+# Create a VM (adjust ID, storage name as needed)
+qm create 9000 --name debian13-template --memory 2048 --cores 2 \
+  --net0 virtio,bridge=vmbr0 --serial0 socket --vga serial0
+
+# Import the disk
+qm importdisk 9000 debian-13-genericcloud-amd64.qcow2 local-lvm
+
+# Attach disk and set boot order
+qm set 9000 --scsihw virtio-scsi-pci --scsi0 local-lvm:vm-9000-disk-0
+qm set 9000 --boot c --bootdisk scsi0
+
+# Add cloud-init drive
+qm set 9000 --ide2 local-lvm:cloudinit
+
+# Enable QEMU guest agent
+qm set 9000 --agent enabled=1
+
+# Convert to template (cannot be undone)
+qm template 9000
+```
+
+---
+
+## Part B — Define the VM in Terraform
+
+Managed hosts are created by Terraform, never by hand. Add an entry to `local.vms`
+in the environment's `main.tf` (`terraform/environments/<env>/main.tf`):
+
+```hcl
+locals {
+  vms = {
+    <hostname> = {
+      ip        = "<IP>/24"        # static; from docs/decisions/007-network.md
+      group     = "docker_hosts"   # control | docker_hosts | proxmox_hosts
+      cores     = 2
+      memory_mb = 2048
+    }
+  }
+}
+```
+
+Terraform clones the cloud-init template from Part A, sets the cloud-init values
+(hostname, SSH key, IP/gateway), and writes the host's DNS A record. See ADR-009
+for the full handoff and the `vms` output → inventory data contract.
+
+---
+
+## Part C — Provision and regenerate the inventory
+
+```bash
+make tf-plan TF_ENV=production       # review — confirm only the new VM is added
+make tf-apply TF_ENV=production      # create the VM + write its DNS A record
+make tf-inventory TF_ENV=production  # regenerate inventories/production/hosts.yml
+```
+
+`make tf-inventory` rewrites `hosts.yml` from Terraform outputs — **do not edit
+that file by hand**; it carries a "do not edit manually" header and your changes
+would be overwritten. The source of truth is `local.vms`.
+
+Wait ~60 seconds after apply for cloud-init to complete, then verify SSH access:
+
+```bash
+ssh ansible@<IP> echo ok
+```
+
+Add a `host_vars/<hostname>/` directory if the host needs specific overrides
+(this is config, not inventory membership, so it is not generated):
+
+```bash
+mkdir -p inventories/production/host_vars/<hostname>
+touch inventories/production/host_vars/<hostname>/vars.yml
+```
+
+---
+
+## Part D — Bootstrap and configure
+
+```bash
+# First-run bootstrap (handles Python installation, initial user setup)
+make deploy PLAYBOOK=bootstrap
+
+# Apply full standard state
+make deploy PLAYBOOK=site
+```
+
+Verify the host reaches baseline:
+
+```bash
+make check PLAYBOOK=site
+# Should report no changes
+```
+
+---
+
+## Part E — Control node (manual exception)
+
+The control node runs Terraform and Ansible, so it cannot be created by the
+Terraform it hosts (chicken-and-egg). It is the **one** host provisioned manually —
+see ADR-009 and the control-node section of ADR-005. Use the template from Part A:
+
+```bash
+# Clone the template by hand (Proxmox UI or qm clone)
+qm clone 9000 <VMID> --name <hostname> --full
+qm set <VMID> --memory 2048 --cores 2 \
+  --ciuser ansible \
+  --sshkeys /path/to/ansible_ed25519.pub \
+  --ipconfig0 ip=<IP>/24,gw=<GATEWAY>
+qm start <VMID>
+```
+
+Then set up the Ansible environment on it (`make setup`, `make collections`, place
+`.vault_pass`) per ADR-005, and add it to `inventories/<env>/hosts.yml` under the
+`control` group. Because the control node is not in `local.vms`, this is the only
+case where editing `hosts.yml` by hand is expected — every other host comes from
+`make tf-inventory`.
+
+---
+
+## Troubleshooting
+
+**SSH connection refused**: cloud-init may still be running. Wait and retry.
+
+**Python not found**: the bootstrap playbook handles this via `raw` module.
+If bootstrap fails, SSH to the host manually and run `apt install -y python3`.
+
+**Firewall locked out**: if nftables rules are misconfigured, connect via
+Proxmox console (not SSH) and run `nft flush ruleset` to clear all rules temporarily.
--- a/docs/runbooks/new-role.md
+++ b/docs/runbooks/new-role.md
@ -0,0 +1,81 @@
+# Runbook — Adding a new Ansible role
+
+## When to create a new role
+
+Create a new role when you need to manage a distinct, reusable unit of
+configuration — a service, a system component, or a behaviour applied to
+a group of hosts.
+
+Do not create a role for a single task that logically belongs in an existing role.
+
+## Procedure
+
+### 1. Scaffold the role
+
+```bash
+make new-role NAME=<rolename>
+```
+
+This creates the full directory structure and placeholder files under `roles/<rolename>/`.
+
+### 2. Fill in meta/main.yml
+
+```yaml
+galaxy_info:
+  role_name: <rolename>
+  author: <your name>
+  description: <one sentence>
+  min_ansible_version: "2.15"
+  platforms:
+    - name: Debian
+      versions:
+        - trixie  # Debian 13
+```
+
+### 3. Define defaults
+
+Add all tuneable variables to `defaults/main.yml` with inline comments explaining
+each variable. Use the `rolename__varname` namespace convention.
+
+### 4. Write tasks
+
+- Use FQCN for all modules
+- Every task must have a `name:` that reads as a sentence
+- Every task must have at least one `tags:` entry
+- Notify handlers by `listen:` topic string, not handler name
+
+### 5. Configure Molecule
+
+Edit `molecule/default/molecule.yml` to use the Debian 13 test image.
+Write a `converge.yml` that applies the role. Write a `verify.yml` that
+asserts the expected state.
+
+### 6. Write the README
+
+Document:
+- Purpose of the role (one paragraph)
+- All variables from `defaults/main.yml` with types, defaults, and descriptions
+- Example playbook usage
+- Any dependencies or prerequisites
+
+### 7. Test locally
+
+```bash
+make test ROLE=<rolename>
+```
+
+Fix any lint or test failures before committing.
+
+### 8. Add to a playbook
+
+Add the role to the appropriate playbook in `playbooks/` and add the host group
+to `inventories/staging/hosts.yml` for integration testing.
+
+### 9. Commit
+
+```bash
+git checkout -b role/<rolename>
+git add roles/<rolename>
+git commit -m "Add <rolename> role"
+# open PR / merge request in Forgejo
+```
--- a/docs/runbooks/rotate-secrets.md
+++ b/docs/runbooks/rotate-secrets.md
@ -0,0 +1,71 @@
+# Runbook — Rotating vault secrets
+
+## Rotating a single secret value
+
+1. Decrypt the relevant vault file:
+   ```bash
+   make decrypt FILE=inventories/production/group_vars/all/vault.yml
+   ```
+
+2. Edit the file and update the secret value.
+
+3. Re-encrypt:
+   ```bash
+   make encrypt FILE=inventories/production/group_vars/all/vault.yml
+   ```
+
+4. Commit the updated vault file:
+   ```bash
+   git add inventories/production/group_vars/all/vault.yml
+   git commit -m "Rotate <secret name>"
+   ```
+
+5. Deploy to apply the new secret to hosts:
+   ```bash
+   make check PLAYBOOK=site   # verify what will change
+   make deploy PLAYBOOK=site
+   ```
+
+---
+
+## Rotating the vault password
+
+This affects all encrypted files in the repo. Do this only when:
+- A person with vault access leaves the project
+- The password is suspected to be compromised
+
+Steps:
+
+1. Ensure you have the current vault password in `.vault_pass`.
+
+2. Re-key all vault files:
+   ```bash
+   find . -name "vault.yml" | xargs ansible-vault rekey \
+     --vault-password-file .vault_pass \
+     --new-vault-password-file /path/to/new_password_file
+   ```
+
+3. Replace `.vault_pass` with the new password file.
+
+4. Distribute the new password to all collaborators via a secure channel.
+
+5. Commit all rekeyed vault files:
+   ```bash
+   git add -A
+   git commit -m "Rekey all vault files"
+   ```
+
+---
+
+## Adding a new collaborator
+
+1. Share the vault password via a secure channel (password manager, etc.)
+2. The collaborator creates `.vault_pass` locally (gitignored)
+3. They can now decrypt/encrypt vault files normally
+
+## Removing a collaborator's access
+
+Rotate the vault password as described above. There is no per-user access
+control in Ansible Vault — access is binary (has the password or not).
+
+If per-user access control becomes necessary, evaluate SOPS + age at that point.