diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..c1d8f7f --- /dev/null +++ b/docs/README.md @@ -0,0 +1,11 @@ +# docs/ + +Project documentation. + +- `decisions/` — Architecture Decision Records (ADRs): the "why" behind the design. + Numbered from 001; each records context, the decision, and what was ruled out. +- `runbooks/` — step-by-step operational procedures (add a host, add a role, rotate + secrets). + +For what is actually **built vs only designed**, see `STATUS.md` at the repo root — +the ADRs describe intent, not necessarily current reality. diff --git a/docs/decisions/001-architecture.md b/docs/decisions/001-architecture.md new file mode 100644 index 0000000..e670441 --- /dev/null +++ b/docs/decisions/001-architecture.md @@ -0,0 +1,62 @@ +# ADR-001 — Architecture overview + +## Context + +This document describes the overall architecture of the homelab infrastructure +and the boundaries of what this Ansible monorepo manages. + +## Infrastructure + +- **Hypervisor**: Proxmox cluster (2+ nodes) +- **Guest OS**: Debian 13 (all managed hosts) +- **Scale**: 2–5 VMs, small fleet — treated as individuals, not cattle +- **Control node**: A dedicated Debian 13 VM on the cluster. Ansible runs from here. + The control node is the one host that cannot fully bootstrap itself from scratch + and requires manual initial setup (see `docs/runbooks/new-host.md`). + +## What this repo manages + +| Layer | Managed by | Notes | +|--------------------|--------------------|--------------------------------------------| +| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; control node is the one manual exception (see ADR-009) | +| Internal DNS records | Ansible `dns` role | Internal zone rendered from inventory (see ADR-007/009) | +| OS baseline | Ansible `base` role | Users, SSH, firewall, updates, audit | +| Docker runtime | Ansible `docker_host` role | Engine, daemon config, log driver | +| Service deployment | Ansible per-service roles | Compose rendered from templates | +| Secrets | Ansible Vault | Encrypted `vault.yml` files in repo | + +The Terraform↔Ansible boundary and handoff are defined in ADR-009. + +## Host groups + +``` +all +├── control # the control node itself — baseline config only, runs no services +├── docker_hosts # VMs running Docker services (most hosts) +└── proxmox_hosts # Proxmox nodes themselves (limited management scope) +``` + +The `control` group holds the single manually-provisioned control node; it is +managed for baseline config (SSH, firewall, updates) but never runs the +`docker_host` role. Proxmox nodes are managed only for basic baseline tasks (SSH, +monitoring agent). Proxmox configuration itself (storage, clustering, networking) +is out of scope. + +## Service interaction model + +Services run as Docker containers on one or more `docker_hosts`. Where services +need to interact, they do so via: + +- Docker networks (same host) +- Internal DNS / hostname resolution (cross-host) +- Explicitly defined published ports (external access) + +All Compose files are rendered by Ansible from Jinja2 templates. No hand-edited +Compose files exist on hosts — they are always regenerated on deploy. + +## Decision + +This architecture prioritises: +- **Simplicity**: few moving parts, no orchestration layer (no Kubernetes, no Swarm) +- **Reproducibility**: any host can be rebuilt from scratch via Ansible +- **Legibility**: a human reading the repo can understand what runs where diff --git a/docs/decisions/002-security.md b/docs/decisions/002-security.md new file mode 100644 index 0000000..84bfaf4 --- /dev/null +++ b/docs/decisions/002-security.md @@ -0,0 +1,73 @@ +# ADR-002 — Security baseline + +## Context + +Every managed host must reach a defined security baseline before any services +are deployed. This baseline is applied by the `base` role and is non-negotiable — +it runs first, on every host, every time. + +The goal is a principled, maintainable baseline appropriate for a homelab with +some public-facing services — not a compliance exercise. + +## Baseline components + +### Access & authentication + +- SSH key authentication only — password auth disabled +- Root login disabled — `PermitRootLogin no` +- Dedicated `ansible` user with locked-down sudo (NOPASSWD for automation) +- No shared user accounts — per-person SSH keys in `group_vars/all/vars.yml` + +### Firewall + +- `nftables` (native on Debian 13, replaces iptables) +- Default policy: deny inbound, allow established/related, allow loopback +- Rules managed entirely by Ansible — never edited manually on hosts +- Port definitions live in `group_vars/` so rules stay in sync with deployed services +- Docker's own iptables rules are disabled — nftables manages all filtering + +> **Note on Docker + nftables**: Docker historically bypassed iptables-based firewalls. +> This is addressed by setting `"iptables": false` in Docker daemon config and managing +> all rules via nftables explicitly. See `docs/decisions/004-docker-model.md`. + +### Intrusion deterrence + +- `fail2ban` monitoring SSH (and optionally reverse proxy logs) +- Configured to ban after 5 failed attempts, 1-hour ban + +### Updates + +- `unattended-upgrades` enabled for **security patches only** +- Full system upgrades triggered deliberately via Ansible (`make deploy PLAYBOOK=upgrade`) +- No automatic reboots — reboots are a conscious operational decision + +### Minimal attack surface + +- No unnecessary packages installed +- Docker daemon TCP socket disabled — Unix socket only +- No open ports beyond those explicitly defined in firewall rules + +### Audit trail + +- `auditd` installed and running with a baseline ruleset +- Logs shipped to a central location if a log aggregation service is available + +## Secrets management + +- Ansible Vault for all secrets (API keys, passwords, certificates) +- Vault password stored outside the repo (`.vault_pass` gitignored) +- New collaborators receive vault password via a separate secure channel +- See `docs/runbooks/rotate-secrets.md` for rotation procedure + +## What this baseline does not include + +- Full CIS benchmark hardening — adds complexity for marginal gain at this scale +- SELinux / AppArmor — not applied by default, revisit if threat model changes +- Intrusion detection (IDS) — out of scope for now + +## Decision + +This baseline was chosen to be: +- **Effective** against the realistic threat model (exposed services, shared repo) +- **Maintainable** by a small team without security expertise overhead +- **Automated** — no manual steps should be needed to reach baseline state diff --git a/docs/decisions/003-toolchain.md b/docs/decisions/003-toolchain.md new file mode 100644 index 0000000..ef260c3 --- /dev/null +++ b/docs/decisions/003-toolchain.md @@ -0,0 +1,135 @@ +# ADR-003 — Toolchain decisions + +## Execution engine + +**Choice**: `ansible-core` (pip-installed, pinned version) + explicit `requirements.yml` + +**Not chosen**: `ansible` full package (bundles ~85 collections at a frozen version) + +**Rationale**: Explicit collection pinning allows independent upgrades, smaller installs, +and fully reproducible environments. The full package trades these away for convenience +that isn't needed in a maintained monorepo. + +--- + +## Python environment + +**Choice**: `python3-venv` (system Python on Debian 13) + pinned `requirements.txt` + +**Not chosen**: `pyenv` (solves multi-version problems on developer laptops, not needed +on a dedicated Debian control node with a controlled Python version) + +**Rationale**: The control node runs one Python version. A plain venv is sufficient, +reproducible, and has no extra dependencies. + +--- + +## Secrets + +**Choice**: Ansible Vault (file-based, built-in) + +**Not chosen**: +- SOPS + age: better git-diff ergonomics, but adds external tooling and key management +- HashiCorp Vault: powerful, but significant operational overhead for this scale + +**Rationale**: Vault is built-in, requires no extra services, and works well at this +scale. The main limitation (whole-file encryption makes diffs unreadable) is mitigated +by keeping `vault.yml` files small and purposeful — only actual secrets, no structure. + +--- + +## Testing + +**Choice**: Molecule with Docker driver (`molecule-plugins[docker]`) + +**Not chosen**: +- Molecule + Podman: rootless is appealing, but Docker is simpler on a Debian control node +- Molecule + Vagrant: full VMs are slower and require a hypervisor on the control node +- No testing: unacceptable for a shared, maintained project + +**Test image**: a self-built, project-owned Debian 13 image with systemd support +(`.docker/molecule-debian13/`), hosted in the Forgejo registry. ADR-008 is canonical +for the image and the rationale for not using an external image such as +`geerlingguy/docker-debian13-ansible`. + +**Verifier**: Built-in Ansible verifier. Testinfra added later if deeper assertions +are needed. + +--- + +## Linting + +**Choice**: `ansible-lint` + `yamllint` + `pre-commit` + +- `yamllint`: catches formatting issues before Ansible sees the file +- `ansible-lint`: enforces correctness and idiomatic style +- `pre-commit`: runs both locally on every commit, preventing CI failures + +Config files: `.ansible-lint`, `.yamllint` in repo root. + +--- + +## CI/CD + +**Choice**: Forgejo Actions (self-hosted at git.baobab.band) + `act_runner` + +**Not chosen**: GitHub Actions (external), Jenkins (heavy) + +**Pipeline**: +1. Push to any branch → lint + Molecule tests +2. Merge to `main` → lint + Molecule tests + manual approval gate +3. After approval → deploy to staging, then production + +`act_runner` runs as a Docker container on the control node or a dedicated runner VM. + +--- + +## Developer ergonomics + +**Choice**: `Makefile` as the single interface for all operations + +**Rationale**: All `ansible-playbook`, `molecule`, and `ansible-lint` invocations go +through Make targets. This means: +- Claude Code always calls `make ` — never constructs raw commands +- Collaborators don't need to know the underlying flags +- CI uses the same targets as local development (no drift) + +**direnv**: Not used — the control node is a dedicated host, not a shared workstation. +The venv is activated in the user's shell profile. + +--- + +## Collections and roles policy + +**No Galaxy roles.** All roles are written and maintained locally in `roles/`. +Galaxy roles introduce external state, versioning surprises, and implicit +conventions that conflict with this repo's style. + +**Collections on demand.** A collection is added to `requirements.yml` only when +a task in a committed role actively uses a module from it. Pre-emptive inclusions +are removed. Each entry in `requirements.yml` must justify its presence. + +**Starting collection set** (rationale for each): + +| Collection | Kept / dropped | Reason | +|----------------|----------------|--------------------------------------------------------------| +| `ansible.posix`| Kept | Ansible-team maintained; fills real `ansible.builtin` gaps (`authorized_key`, `sysctl`, `acl`) | +| `community.docker` | Dropped | ADR-004 uses `ansible.builtin.command` + `docker compose` — no Docker API modules needed | +| `community.proxmox`| Dropped | Proxmox configuration is out of scope (ADR-001) | +| `community.crypto` | Deferred | Add when a role needs cert automation; use `openssl` CLI until then | +| `community.general`| Deferred | 1,500+ modules; add only the specific sub-module needed, with a comment | + +--- + +## What was explicitly ruled out + +| Tool | Reason not adopted | +|------------------|-------------------------------------------------------------| +| AWX / AAP | Significant operational overhead, not needed at this scale | +| Semaphore | Revisit if non-SSH operators need to trigger runs | +| ansible-runner | Only needed when AWX/Semaphore orchestrates runs | +| ansible-builder | Only needed when packaging Execution Environments for AWX | +| Kubernetes/Swarm | Out of scope — Docker Compose is the right complexity level | +| NixOS targets | Poor Ansible fit; all hosts standardised on Debian 13 | + +Terraform is **adopted** for VM provisioning and infrastructure DNS — see `docs/decisions/006-terraform.md`. diff --git a/docs/decisions/004-docker-model.md b/docs/decisions/004-docker-model.md new file mode 100644 index 0000000..52d1051 --- /dev/null +++ b/docs/decisions/004-docker-model.md @@ -0,0 +1,77 @@ +# ADR-004 — Docker and Compose service model + +## Context + +All services run as Docker containers managed via Docker Compose. This document +defines how services are structured, deployed, and maintained. + +## Core principles + +- **No hand-edited files on hosts**: all Compose files are rendered by Ansible + from Jinja2 templates. If a file exists on a host, it was put there by Ansible. +- **Compose per service**: each service (or tightly coupled service group) gets + its own Compose file and directory under a standard path. +- **Variables drive differences**: the same template renders differently per host + via `group_vars` and `host_vars`. No host-specific templates. + +## Directory layout on hosts + +``` +/opt/services/ +├── servicename/ +│ ├── docker-compose.yml # rendered by Ansible, never edited manually +│ ├── .env # rendered by Ansible from vault variables +│ └── data/ # persistent volumes (bind mounts) +│ └── ... +``` + +All services live under `/opt/services/`. The path is defined in +`group_vars/all/vars.yml` as `services__base_dir`. + +## Compose file delivery + +Each service has a corresponding Ansible role (or is managed by a shared role +with per-service variables). The role: + +1. Creates `/opt/services/servicename/` directory +2. Renders `docker-compose.yml` from `templates/docker-compose.yml.j2` +3. Renders `.env` from `templates/env.j2` (pulling secrets from vault variables) +4. Runs `docker compose up -d --remove-orphans` via `ansible.builtin.command` +5. Optionally runs `docker compose pull` before up (controlled by variable) + +## Docker daemon configuration + +Managed by the `docker_host` role. Key settings: + +- `"log-driver": "json-file"` with size limits (prevents disk exhaustion) +- `"iptables": false` — firewall managed entirely by nftables (see ADR-002) +- TCP socket disabled — Unix socket only (`/var/run/docker.sock`) +- User namespace remapping: evaluated per use case, not enabled by default + +## Networking + +- Each service Compose file defines its own named network(s) +- Services that need to communicate are placed on a shared named network + defined in a dedicated `docker-compose.networks.yml` (if cross-service + networking is needed on a host) +- External port publishing is explicit and matches nftables rules + +## Image management + +- Images are always pinned to a specific digest or tag in templates +- `latest` is never used in production Compose files +- Image updates are a deliberate operation: update the tag variable, run deploy + +## Persistent data + +- Bind mounts preferred over named volumes for data that must be backed up +- All bind mount paths are under `/opt/services//data/` +- Backup strategy is defined separately (not in scope of this repo) + +## Decision + +Docker Compose was chosen over Kubernetes/Swarm because: +- Appropriate complexity level for 2–5 hosts with independent service sets +- Compose files are human-readable and easily auditable +- No distributed state to manage +- Straightforward to back up and restore diff --git a/docs/decisions/005-bootstrapping.md b/docs/decisions/005-bootstrapping.md new file mode 100644 index 0000000..06e1eff --- /dev/null +++ b/docs/decisions/005-bootstrapping.md @@ -0,0 +1,79 @@ +# ADR-005 — Host bootstrapping + +## Context + +This document defines the **cloud-init template** that managed VMs are cloned +from, and the **control-node** bootstrapping special case. The per-host +provisioning pipeline — how a VM is created from this template and handed off to +Ansible — is owned by ADR-009. Terraform clones the template defined here; the +template is the base image both for Terraform-managed hosts and for the manually +provisioned control node. + +## Approach: Proxmox cloud-init template + +Managed VMs are cloned from a Proxmox VM template based on the official Debian 13 +cloud image. Cloud-init handles first-boot configuration. Ansible takes over +from there. + +The cloud-init image was chosen over: +- **Manual Debian installer**: slow, error-prone, not reproducible +- **Preseed/netboot**: powerful but complex to maintain + +## Template creation (one-time, manual) + +This is a manual procedure performed once per Proxmox cluster. Documented in +`docs/runbooks/new-host.md`. + +High-level steps: +1. Download official Debian 13 genericcloud image +2. Import disk to Proxmox, create VM template +3. Install `qemu-guest-agent` in the template image +4. Convert VM to template — never boot the template directly + +## VM provisioning (per new host) + +Per-host VMs are created by **Terraform**, which clones this template, sets the +cloud-init values (hostname, SSH public key, IP/gateway), and writes the host's +DNS A record. Cloud-init runs at first boot (~30–60 seconds), leaving the VM +reachable via SSH with the ansible user's key. + +The full create → inventory → configure pipeline, and the Terraform↔Ansible data +contract, are defined in **ADR-009 (provisioning handoff)**. There is no manual +`qm clone` path for managed hosts — the sole exception is the control node below. + +## Ansible handoff + +Once Terraform has created the VM and `make tf-inventory` has regenerated the +inventory, the `bootstrap` playbook handles first-run specifics (Python may not be +present, user may differ) and `site` applies the full standard state. See ADR-009 +for the end-to-end commands and `docs/runbooks/new-host.md` for the full procedure. + +## Control node bootstrapping + +The control node is a special case — it runs Terraform and Ansible, so it cannot +be created by the Terraform it hosts (chicken-and-egg). It is the one documented +exception to Terraform-owned VM existence (see ADR-009). The control node requires: + +1. Manual VM provisioning — clone this cloud-init template by hand (Proxmox UI or + `qm clone`), since Terraform is not yet available to do it +2. Manual setup of the Ansible environment: + ```bash + git clone ~/ansible + cd ~/ansible + make setup # creates venv, installs deps + make collections # installs Ansible collections + cp /secure/location/.vault_pass ~/ansible/.vault_pass + ``` +3. After that, the control node can manage all other hosts normally + +The control node itself is listed in `inventories/production/hosts.yml` under +a `control` group and can be managed for baseline config (SSH, firewall, updates) +but not for the `docker_host` role (it does not run services). + +## Decision + +Cloud-init with Proxmox templates provides: +- Reproducible VM creation in under 2 minutes +- No manual installer interaction +- A clean handoff point to Ansible +- Easy rebuilds — destroy VM, clone template, run Ansible diff --git a/docs/decisions/006-terraform.md b/docs/decisions/006-terraform.md new file mode 100644 index 0000000..70218a8 --- /dev/null +++ b/docs/decisions/006-terraform.md @@ -0,0 +1,111 @@ +# ADR-006 — Terraform for infrastructure provisioning + +## Context + +Ansible manages host configuration well but has no state model for infrastructure +existence. Adding Terraform handles the "what exists" layer — creating and destroying +VMs on Proxmox — while Ansible continues to own everything that runs inside them, +including all internal DNS records. + +This complements rather than replaces Ansible. The two tools do not overlap. The +exact boundary, handoff pipeline, and data contract between them live in **ADR-009 +(provisioning handoff)** — this ADR covers Terraform's own internals only. + +--- + +## Responsibility split + +The canonical responsibility-split table lives in **ADR-009**. In short: Terraform +owns VM existence only; Ansible owns everything inside a VM, including all internal +DNS records. + +**OPNsense is entirely Ansible.** The available Terraform providers for OPNsense +are community-maintained with real risk of provider rot across OPNsense releases. +OPNsense firewall rules also change on a service cadence, not an infrastructure +cadence, making them a poor fit for Terraform state. + +--- + +## Providers + +**`bpg/proxmox` (`~> 0.70`)**: Chosen over `telmate/proxmox` for active maintenance, +full Proxmox 8 API support, and better cloud-init integration. This is the only +provider. + +Terraform does **not** manage DNS. An earlier design used `hashicorp/dns` (RFC 2136) +to write A records, but that created a bootstrap cycle — the first DNS server cannot +register itself — and split DNS ownership across two tools. Ansible's `dns` role now +owns the entire internal zone, rendered from inventory. See ADR-009. + +No Galaxy roles. Terraform manages its own provider dependencies via +`required_providers` and `.terraform.lock.hcl` (tracked in git once `terraform init` +has been run). + +--- + +## State backend + +**Choice**: Forgejo HTTP backend (self-hosted at git.baobab.band) + +Keeps all state on the same self-hosted stack without additional services. +Authentication uses a Forgejo personal access token via `TF_HTTP_USERNAME` and +`TF_HTTP_PASSWORD` environment variables. + +**Note**: The backend URL in `backend.tf` is a placeholder — confirm the exact +endpoint path against your running Forgejo instance's API documentation before +running `terraform init`. If Forgejo's HTTP state is unavailable, remove the +`backend` block from `backend.tf` to fall back to local state on the control node. + +--- + +## Structure + +``` +terraform/ + modules/ + proxmox_vm/ # reusable VM module — Proxmox only, no DNS + environments/ + staging/ # staging VMs, separate state file + production/ # production VMs, separate state file +``` + +Separate environment directories (not Terraform workspaces) for the clearest +isolation — no risk of accidentally applying the wrong state. + +Each environment directory contains: +- `providers.tf` — provider version pins and configuration +- `backend.tf` — Forgejo state backend (environment-specific path) +- `variables.tf` — input declarations +- `terraform.tfvars.example` — tracked template; copy to `terraform.tfvars` for actual values +- `main.tf` — `local.vms` map and module calls (no DNS resources) +- `outputs.tf` — VM map consumed by `make tf-inventory` + +--- + +## Secrets handling + +The only secret input (the Proxmox API token) is passed via a `TF_VAR_*` +environment variable and declared `sensitive = true` in `variables.tf`. It never +appears in `.tfvars` files. Non-secret configuration lives in tracked +`terraform.tfvars.example`; the real `terraform.tfvars` is gitignored. + +--- + +## Ansible integration + +After `terraform apply`, run `make tf-inventory TF_ENV=` to regenerate +`inventories//hosts.yml` from the `vms` output. The full handoff pipeline, +the `vms` output → inventory data contract, and the generator script +(`scripts/tf_to_inventory.py`) are documented in **ADR-009 (provisioning +handoff)**. + +--- + +## What was ruled out + +| Option | Reason | +|---|---| +| `telmate/proxmox` provider | Less actively maintained; weaker cloud-init and Proxmox 8 support | +| OPNsense Terraform provider | Community-maintained; provider rot risk across OPNsense releases | +| Terraform workspaces | Single state file with workspace prefix; accidental cross-env apply possible | +| Separate Terraform repo | Cross-referencing between infra and config adds friction; monorepo keeps the full picture together | diff --git a/docs/decisions/007-network.md b/docs/decisions/007-network.md new file mode 100644 index 0000000..5914133 --- /dev/null +++ b/docs/decisions/007-network.md @@ -0,0 +1,186 @@ +# ADR-007 — Network topology and addressing + +## Context + +The boma homelab is a Proxmox cluster on a dedicated private network behind an +OPNsense firewall. This document records the agreed physical topology, VLAN +design, IP addressing conventions, naming scheme, and DNS zone structure. +Everything here feeds directly into Terraform variables, Ansible inventory, +and OPNsense configuration. + +--- + +## Physical topology + +``` +ISP + └── OPNsense (dedicated hardware) + ├── WAN — ISP uplink + └── LAN — 802.1q trunk to managed switch + │ + ┌──────────────┼──────────────────────────┐ + │ │ │ │ + pve0 pve1 pve2 AP1 / AP2 + (eno1 trunk) (eno1 trunk) (eno1 trunk) (trunk) + (eno2 corosync)(eno2 corosync)(eno2 corosync) + └──────────────┴──────────────┘ + 172.16.0.0/24 (corosync ring — not on managed switch) +``` + +**Dual NICs per Proxmox node:** +- `eno1` — VLAN-aware trunk. Carries all VLANs via a single VLAN-aware bridge + (`vmbr0`). VMs get their VLAN tag assigned in Proxmox. +- `eno2` — Dedicated corosync ring (`vmbr1`). Direct link or tiny unmanaged + switch between the three nodes only. Never touches the main switch fabric. + +**Access points** broadcast multiple SSIDs, each tagged to its corresponding VLAN +(trusted WiFi → VLAN 30, IoT → VLAN 40, guest → VLAN 50). + +--- + +## VLAN design + +| VLAN | Name | Subnet | Purpose | +|---|---|---|---| +| 10 | `mgmt` | `10.10.0.0/24` | Proxmox hosts, OPNsense, managed switch. No internet except update repos. | +| 20 | `srv` | `10.20.0.0/24` | All Debian VMs and Docker services. 100% static. Terraform provisions here. | +| 30 | `lan` | `10.30.0.0/24` | Trusted home devices. DHCP. Access to selected `srv` services via OPNsense. | +| 40 | `iot` | `10.40.0.0/24` | Smart home, cameras, printers. DHCP. Internet egress only + HA exception. | +| 50 | `guest` | `10.50.0.0/24` | Guest WiFi. DHCP. Internet only, fully isolated. | +| 99 | `vpn` | `10.99.0.0/24` | WireGuard peers. `askari` (Hetzner) + road-warrior clients. | + +--- + +## IP addressing + +### VLAN 10 — mgmt (10.10.0.0/24) — no DHCP + +| Address | Host | +|---|---| +| `10.10.0.1` | OPNsense LAN (mgmt) | +| `10.10.0.2` | Managed switch | +| `10.10.0.200` | `pve0` | +| `10.10.0.201` | `pve1` | +| `10.10.0.202` | `pve2` | + +### VLAN 20 — srv (10.20.0.0/24) — no DHCP, all static + +| Range | Purpose | +|---|---| +| `10.20.0.1` | OPNsense gateway | +| `10.20.0.10`–`.19` | Core infrastructure VMs (DNS, proxy) | +| `10.20.0.20`–`.49` | Additional static infrastructure | +| `10.20.0.50`–`.249` | Terraform-provisioned VMs | + +Assigned infrastructure addresses: + +| Address | Host | Role | +|---|---|---| +| `10.20.0.10` | `dns1` | Primary DNS server | +| `10.20.0.11` | `dns2` | Secondary DNS server | +| `10.20.0.12` | `proxy` | Reverse proxy | +| `10.20.0.13` | `homeassistant` | Home Assistant (IoT controller) | + +### VLAN 30 — lan (10.30.0.0/24) + +| Range | Purpose | +|---|---| +| `10.30.0.1` | OPNsense gateway | +| `10.30.0.100`–`.249` | DHCP pool | + +### VLAN 40 — iot (10.40.0.0/24) + +| Range | Purpose | +|---|---| +| `10.40.0.1` | OPNsense gateway | +| `10.40.0.100`–`.249` | DHCP pool | + +### VLAN 50 — guest (10.50.0.0/24) + +| Range | Purpose | +|---|---| +| `10.50.0.1` | OPNsense gateway | +| `10.50.0.100`–`.249` | DHCP pool | + +### VLAN 99 — vpn (10.99.0.0/24) — WireGuard + +| Address | Host | +|---|---| +| `10.99.0.1` | OPNsense (WireGuard endpoint) | +| `10.99.0.2` | `askari` (Hetzner VPS) | +| `10.99.0.10`+ | Road-warrior clients | + +### Corosync ring (172.16.0.0/24) — not on managed switch + +| Address | Host | +|---|---| +| `172.16.0.200` | `pve0` | +| `172.16.0.201` | `pve1` | +| `172.16.0.202` | `pve2` | + +--- + +## OPNsense firewall rules (intent) + +| Source | Destination | Policy | +|---|---|---| +| `mgmt` | anywhere | allow (administrator access) | +| `srv` | `srv` | allow (inter-service communication) | +| `srv` | internet | allow (updates, image pulls) | +| `lan` | `srv` (allow-list) | allow specific published ports only | +| `lan` | internet | allow | +| `iot` | internet | allow egress only | +| `iot` | `srv` (HA IP only) | allow on integration ports | +| `guest` | internet | allow, isolated from all internal | +| `vpn` | `srv` (metrics ports) | allow (monitoring) | +| `vpn` | `mgmt` | allow (administration from askari) | + +**Home Assistant ↔ IoT**: HA VM at `10.20.0.13` can reach IoT VLAN on required +ports. OPNsense Avahi (mDNS reflector) bridges `srv` ↔ `iot` for device discovery. +IoT devices cannot initiate connections to `srv`. + +--- + +## Naming scheme + +| Layer | Convention | Examples | +|---|---|---| +| Homelab name | `boma` | — | +| Proxmox nodes | `pve` | `pve0`, `pve1`, `pve2` | +| Infrastructure VMs | `` | `dns1`, `dns2`, `proxy` | +| Hetzner VPS | `askari` | Swahili for guard/sentinel | +| Internal FQDN | `.boma.baobab.band` | `dns1.boma.baobab.band` | +| Public service FQDN | `.baobab.band` | `git.baobab.band` | + +--- + +## DNS zones and split-horizon + +**Internal zone**: `boma.baobab.band` — served by `dns1` and `dns2`. +The zone is rendered by the Ansible `dns` role: host A records come from the +inventory (which derives from Terraform's `local.vms` via `make tf-inventory`), +and service/alias/split-horizon records are explicit zone data in `group_vars`. +Terraform itself writes no DNS records — see ADR-009. + +**Public zone**: `baobab.band` — served by external DNS (Cloudflare or equivalent). +Public-facing services resolve to the public IP or Cloudflare proxy. + +**Split-horizon**: `dns1`/`dns2` serve internal answers for any hostname that has +both a public and private face. Example: `git.baobab.band` resolves to +`10.20.0.12` (proxy) internally and to the public IP externally. + +OPNsense DNS resolver forwards `boma.baobab.band` queries to `dns1`/`dns2`. +All other queries go upstream (e.g., `1.1.1.1`, `9.9.9.9`). + +--- + +## External monitoring — askari + +`askari` (Hetzner VPS) connects via WireGuard to OPNsense (`10.99.0.1`). +Its peer address is `10.99.0.2`. OPNsense routes `10.99.0.0/24` into the VPN +tunnel and allows `askari` narrow access to `srv` metrics endpoints and `mgmt` +for administration. + +`askari` is provisioned and managed independently of the Proxmox cluster — it +must be reachable even when the homelab is down (its entire purpose). +FQDN: `askari.baobab.band`. diff --git a/docs/decisions/008-testing.md b/docs/decisions/008-testing.md new file mode 100644 index 0000000..d338a3c --- /dev/null +++ b/docs/decisions/008-testing.md @@ -0,0 +1,160 @@ +# ADR-008 — Testing methodology + +## Context + +Ansible roles must be idempotent and correct before they touch production hosts. +This document records the testing strategy, what each level covers, and — critically +— what is explicitly out of scope for automated testing and why. + +--- + +## Three testing levels + +### Level 1 — Molecule (per role, always required) + +Runs in Docker on the control node or in CI. Fast (~5 min per role). + +**What happens during `molecule test`:** +1. `create` — start the test container +2. `converge` — apply the role via `converge.yml` +3. **`idempotency`** — run `converge.yml` again; fail if any task reports `changed` +4. `verify` — assert expected state via `verify.yml` +5. `destroy` — remove the container + +The idempotency step is non-negotiable. Every role must pass it cleanly. + +**`verify.yml` must assert outcomes, not task success:** + +```yaml +# Wrong — only proves the task ran +- assert: + that: result is success + +# Right — proves the outcome exists +- ansible.builtin.command: systemctl is-active fail2ban + changed_when: false + register: svc +- ansible.builtin.assert: + that: svc.stdout == "active" +``` + +### Level 2 — Staging playbook (full stack, real VMs) + +`make check PLAYBOOK=site` followed by `make deploy PLAYBOOK=site` on +Terraform-provisioned staging VMs. Catches inter-role dependencies and ordering +issues that Molecule cannot see (e.g., `docker_host` role requires `base` to +have already run and configured the firewall). + +Run before every merge to `main`. + +### Level 3 — External smoke test from askari + +Once `askari` is operational: scripted checks from outside the network confirming +that public-facing services respond correctly. Catches firewall and reverse proxy +configuration issues invisible to Ansible check mode. + +--- + +## Molecule test image + +**No external images.** The project builds and hosts its own test image. + +**Source**: `.docker/molecule-debian13/Dockerfile` +**Base**: `debian:trixie-slim` (official Debian 13, Docker Hub — only external +dependency permitted here, as the base OS image is not substitutable) +**Registry**: `git.baobab.band///molecule-debian13:latest` + +Build and push with: +```bash +make molecule-image # build locally +make molecule-image-push # push to Forgejo registry (requires docker login) +``` + +The scaffold `molecule.yml` references this image with `pre_build_image: true`, +meaning Molecule uses the image as-is and does not attempt to build it. + +**Why not geerlingguy/docker-debian13-ansible?** It is a Docker Hub image outside +project control. It is not a Galaxy role, but it is an external dependency that +can drift, disappear, or introduce unexpected changes. The custom image is +functionally equivalent and fully owned. + +--- + +## Idempotency requirements + +Every role task must satisfy one of these: + +| Task type | Requirement | +|---|---| +| `apt`, `template`, `copy`, `file`, `user`, `group`, `service` | Naturally idempotent — no action needed | +| `command` / `shell` (read-only) | `changed_when: false` | +| `command` / `shell` (detectable change) | `changed_when: result.stdout \| length > 0` or equivalent | +| `command` / `shell` (creates a file) | `creates: /path/to/artifact` | +| Service restart after config change | Move to a handler; handler fires only when notified | +| `docker compose up -d` | Handler only — notified by template change, never runs unconditionally | + +ansible-lint enforces most of these at lint time. The Molecule idempotency step +catches anything lint misses. + +--- + +## What Molecule tests — and what it does not + +### Tested in Molecule + +| Capability | Notes | +|---|---| +| Package installation | `apt` works in the container | +| File and directory creation, permissions, ownership | Full support | +| Template rendering and content | Full support | +| User and group management | Full support | +| Service installation and `systemd enable` | Requires the systemd-capable image | +| Service start/stop | Works for most services in the container | +| SSH configuration file content | File-level only | +| fail2ban installation and configuration | Install and config file; not live banning | +| Docker daemon installation | Works in privileged container | +| auditd installation and configuration | Install and config file | +| Idempotency of all of the above | Enforced by Molecule's idempotency step | + +### Not tested in Molecule — explicit exceptions + +The following require a real kernel or real hardware and are validated only at +Level 2 (staging) or Level 3 (external). This is a conscious, documented decision +— not a gap. + +| Capability | Reason not testable in Molecule | +|---|---| +| `nftables` rule loading | Requires `nf_tables` kernel module; not available in Docker | +| WireGuard tunnel establishment | Requires `wireguard` kernel module | +| `unattended-upgrades` behaviour | Installs correctly; actual upgrade behaviour requires a real apt environment | +| DHCP behaviour (OPNsense) | OPNsense is managed by Ansible but not testable in a container | +| mDNS reflector (Avahi cross-VLAN) | Requires real network interfaces and VLANs | +| Hardware passthrough (NIC, USB) | Not applicable in containers | +| Corosync cluster formation | Requires multiple real nodes | + +For the above, Molecule tests only what it can: that the relevant packages are +installed, that configuration files render correctly, and that services are enabled. +Behavioural correctness is confirmed on staging. + +--- + +## CI pipeline + +``` +push to any branch + ├── yamllint + ansible-lint (fast gate, ~1 min) + └── molecule test (changed roles) (parallel, ~5 min per role) + +pull request to main + ├── yamllint + ansible-lint + ├── molecule test (all roles) (parallel) + └── [manual gate] review tf-plan and make check on staging + +merge to main + ├── yamllint + ansible-lint + molecule test (final gate) + ├── [manual approval] make deploy PLAYBOOK=site on staging + └── [manual approval] make deploy PLAYBOOK=site on production +``` + +Manual gates are intentional. Automated tests prove correctness in isolation; +a human confirms the change is safe to promote. diff --git a/docs/decisions/009-provisioning-handoff.md b/docs/decisions/009-provisioning-handoff.md new file mode 100644 index 0000000..7c1e0a0 --- /dev/null +++ b/docs/decisions/009-provisioning-handoff.md @@ -0,0 +1,149 @@ +# ADR-009 — Terraform ↔ Ansible provisioning handoff + +## Context + +Two tools touch every managed host. Terraform owns **what exists** — VMs on +Proxmox. Ansible owns **what is configured inside** — users, packages, firewall, +Docker services, and all internal DNS. This ADR is the single source of truth for +the seam between them: the exact handoff, the data contract, and the one documented +exception. The two tools must never overlap; this document defines the line they +meet at. + +ADR-006 covers Terraform's internals (providers, state, structure). ADR-005 covers +the cloud-init template that VMs are cloned from. This ADR covers how they connect. + +--- + +## The boundary + +| Layer | Tool | Notes | +|---|---|---| +| VM existence | Terraform | Create/destroy Proxmox VMs, assign static IPs | +| VM resolver (cloud-init) | Terraform | Sets *which* DNS servers a VM queries — not a zone record | +| OS configuration | Ansible | Users, SSH, firewall, packages | +| Service deployment | Ansible | Docker, Compose files, secrets | +| OPNsense (all) | Ansible | Firewall rules, DHCP, interfaces, VLANs | +| Internal DNS (all records) | Ansible (`dns` role) | Internal zone rendered from inventory + `group_vars`; see ADR-007 | + +This table is canonical here. ADR-006 links to it rather than restating it. +Terraform owns VM **existence** only — it writes no DNS records (see "Internal DNS" +below). + +--- + +## The handoff pipeline + +There is one path by which a managed host comes into existence and reaches its +configured state: + +``` +make tf-plan TF_ENV=production # review infrastructure changes +make tf-apply TF_ENV=production # clone template → VM (no DNS records written) +make tf-inventory TF_ENV=production # regenerate Ansible inventory from outputs +make check PLAYBOOK=site # dry-run Ansible against the new host(s) +make deploy PLAYBOOK=bootstrap # first-run specifics (see ADR-005) +make deploy PLAYBOOK=site # full standard state — `dns` role writes the zone +``` + +`tf-apply` creates the VM by cloning the Debian 13 cloud-init template (ADR-005). +`tf-inventory` regenerates the Ansible inventory from Terraform outputs. From +`make check` onward the host is Ansible's — including its DNS record, which the +`dns` role writes into the internal zone during `make deploy`. + +Adding a host means editing `local.vms` in the environment's `main.tf` and running +this pipeline — **never** by hand-editing the inventory. + +--- + +## The data contract + +The seam's interface is a single Terraform output consumed by a single script. + +**Producer** — `terraform/environments//outputs.tf` emits a `vms` map: + +```json +{ + "vms": { + "value": { + "host-a": { "ip": "192.168.1.10", "group": "docker_hosts" } + } + } +} +``` + +**Consumer** — `scripts/tf_to_inventory.py` (Python standard library only) reads +`terraform output -json` and writes `inventories//hosts.yml`. It validates the +group against the allowed set and fails loudly on an unknown group. + +**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`. + +The generated `hosts.yml` carries a "do not edit manually" header and is owned by +the generator. Treat it as a build artifact: the source of truth is `local.vms` in +Terraform, and the inventory is regenerated, never edited. + +--- + +## Cloud-init's role + +Cloud-init is the thin first-boot layer between Terraform and Ansible: + +- **Terraform** clones the cloud-init template (ADR-005) and sets cloud-init values + (hostname, SSH public key, IP/gateway). +- **Cloud-init** does just enough at first boot to make the VM reachable over SSH + with the ansible user's key — nothing more. +- **Ansible** takes over from a reachable host: the `bootstrap` playbook handles + first-run specifics, then `site` applies the full standard state. + +The line is sharp: cloud-init buys *reachability*, Ansible owns *configuration*. + +--- + +## Internal DNS — owned by Ansible, no chicken-and-egg + +Terraform writes **no** DNS records. The internal zone (`boma.baobab.band`) is +rendered entirely by the Ansible `dns` role: + +- **Host A records** derive from the inventory — the same `hostname → ip` data that + originated in `local.vms` and reached Ansible via `make tf-inventory`. So Terraform + remains the ultimate source of truth for which hosts exist; the data simply flows + through the inventory instead of through a direct Terraform→DNS write. +- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense + gateway, `git.baobab.band` → proxy) are explicit zone data in `group_vars`. + +This dissolves the bootstrap cycle that a Terraform-managed zone would create. If +Terraform wrote records via RFC 2136, provisioning the **first** DNS server would +require a DNS server that does not yet exist — `dns1` cannot register its own A +record before it is running and configured. Because Ansible renders the zone from +inventory (using IP addresses, never name resolution, to connect), `dns1`/`dns2` +are ordinary Terraform-created VMs whose records are written by the same role that +configures the DNS service. There is no special case and no ordering trap. + +ADR-007 holds the zone structure, split-horizon, and addressing conventions. The +IP-range split there (`.10–.19` core infra vs `.50–.249` fleet) is now an addressing +convention only — it no longer implies any difference in how records are written. + +--- + +## The control-node exception + +The control node — the host that runs Terraform and Ansible — is the one VM +Terraform does **not** create. It cannot provision the infrastructure that would +provision itself (chicken-and-egg). It is therefore the single documented exception +to "Terraform owns VM existence": + +- Provisioned and bootstrapped manually, per the control-node section of ADR-005. +- Listed in `inventories//hosts.yml` under the `control` group, and managed by + Ansible for baseline config only (no `docker_host` role). + +Every other host is Terraform-managed. + +--- + +## What was ruled out + +| Option | Reason | +|---|---| +| Manual `qm clone` as a general provisioning path | Terraform is the single way VMs come into existence; a parallel manual path would let the inventory and real infrastructure drift. The sole exception is the control node. | +| Hand-editing the generated inventory | `hosts.yml` is a build artifact of `tf_to_inventory.py`; edits are overwritten on the next `make tf-inventory`. Edit `local.vms` instead. | +| Documenting the seam in both ADR-005 and ADR-006 | The boundary belongs in exactly one place. Those ADRs link here. | +| Terraform-managed DNS records (`hashicorp/dns` + RFC 2136) | Created a bootstrap cycle (the first DNS server can't register itself) and split DNS ownership across two tools. Ansible owns the whole internal zone instead — one owner, no cycle. | diff --git a/docs/runbooks/new-host.md b/docs/runbooks/new-host.md new file mode 100644 index 0000000..411c8d5 --- /dev/null +++ b/docs/runbooks/new-host.md @@ -0,0 +1,145 @@ +# Runbook — Adding a new managed host + +## Prerequisites + +- Proxmox VM template exists (Debian 13 cloud-init image — see below if not) +- You have the vault password (`.vault_pass`) +- The host's intended hostname and IP are decided + +--- + +## Part A — Create the Proxmox template (one-time) + +Run on a Proxmox node. Only needed once per cluster. + +```bash +# Download the Debian 13 genericcloud image +wget https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2 + +# Create a VM (adjust ID, storage name as needed) +qm create 9000 --name debian13-template --memory 2048 --cores 2 \ + --net0 virtio,bridge=vmbr0 --serial0 socket --vga serial0 + +# Import the disk +qm importdisk 9000 debian-13-genericcloud-amd64.qcow2 local-lvm + +# Attach disk and set boot order +qm set 9000 --scsihw virtio-scsi-pci --scsi0 local-lvm:vm-9000-disk-0 +qm set 9000 --boot c --bootdisk scsi0 + +# Add cloud-init drive +qm set 9000 --ide2 local-lvm:cloudinit + +# Enable QEMU guest agent +qm set 9000 --agent enabled=1 + +# Convert to template (cannot be undone) +qm template 9000 +``` + +--- + +## Part B — Define the VM in Terraform + +Managed hosts are created by Terraform, never by hand. Add an entry to `local.vms` +in the environment's `main.tf` (`terraform/environments//main.tf`): + +```hcl +locals { + vms = { + = { + ip = "/24" # static; from docs/decisions/007-network.md + group = "docker_hosts" # control | docker_hosts | proxmox_hosts + cores = 2 + memory_mb = 2048 + } + } +} +``` + +Terraform clones the cloud-init template from Part A, sets the cloud-init values +(hostname, SSH key, IP/gateway), and writes the host's DNS A record. See ADR-009 +for the full handoff and the `vms` output → inventory data contract. + +--- + +## Part C — Provision and regenerate the inventory + +```bash +make tf-plan TF_ENV=production # review — confirm only the new VM is added +make tf-apply TF_ENV=production # create the VM + write its DNS A record +make tf-inventory TF_ENV=production # regenerate inventories/production/hosts.yml +``` + +`make tf-inventory` rewrites `hosts.yml` from Terraform outputs — **do not edit +that file by hand**; it carries a "do not edit manually" header and your changes +would be overwritten. The source of truth is `local.vms`. + +Wait ~60 seconds after apply for cloud-init to complete, then verify SSH access: + +```bash +ssh ansible@ echo ok +``` + +Add a `host_vars//` directory if the host needs specific overrides +(this is config, not inventory membership, so it is not generated): + +```bash +mkdir -p inventories/production/host_vars/ +touch inventories/production/host_vars//vars.yml +``` + +--- + +## Part D — Bootstrap and configure + +```bash +# First-run bootstrap (handles Python installation, initial user setup) +make deploy PLAYBOOK=bootstrap + +# Apply full standard state +make deploy PLAYBOOK=site +``` + +Verify the host reaches baseline: + +```bash +make check PLAYBOOK=site +# Should report no changes +``` + +--- + +## Part E — Control node (manual exception) + +The control node runs Terraform and Ansible, so it cannot be created by the +Terraform it hosts (chicken-and-egg). It is the **one** host provisioned manually — +see ADR-009 and the control-node section of ADR-005. Use the template from Part A: + +```bash +# Clone the template by hand (Proxmox UI or qm clone) +qm clone 9000 --name --full +qm set --memory 2048 --cores 2 \ + --ciuser ansible \ + --sshkeys /path/to/ansible_ed25519.pub \ + --ipconfig0 ip=/24,gw= +qm start +``` + +Then set up the Ansible environment on it (`make setup`, `make collections`, place +`.vault_pass`) per ADR-005, and add it to `inventories//hosts.yml` under the +`control` group. Because the control node is not in `local.vms`, this is the only +case where editing `hosts.yml` by hand is expected — every other host comes from +`make tf-inventory`. + +--- + +## Troubleshooting + +**SSH connection refused**: cloud-init may still be running. Wait and retry. + +**Python not found**: the bootstrap playbook handles this via `raw` module. +If bootstrap fails, SSH to the host manually and run `apt install -y python3`. + +**Firewall locked out**: if nftables rules are misconfigured, connect via +Proxmox console (not SSH) and run `nft flush ruleset` to clear all rules temporarily. diff --git a/docs/runbooks/new-role.md b/docs/runbooks/new-role.md new file mode 100644 index 0000000..b466454 --- /dev/null +++ b/docs/runbooks/new-role.md @@ -0,0 +1,81 @@ +# Runbook — Adding a new Ansible role + +## When to create a new role + +Create a new role when you need to manage a distinct, reusable unit of +configuration — a service, a system component, or a behaviour applied to +a group of hosts. + +Do not create a role for a single task that logically belongs in an existing role. + +## Procedure + +### 1. Scaffold the role + +```bash +make new-role NAME= +``` + +This creates the full directory structure and placeholder files under `roles//`. + +### 2. Fill in meta/main.yml + +```yaml +galaxy_info: + role_name: + author: + description: + min_ansible_version: "2.15" + platforms: + - name: Debian + versions: + - trixie # Debian 13 +``` + +### 3. Define defaults + +Add all tuneable variables to `defaults/main.yml` with inline comments explaining +each variable. Use the `rolename__varname` namespace convention. + +### 4. Write tasks + +- Use FQCN for all modules +- Every task must have a `name:` that reads as a sentence +- Every task must have at least one `tags:` entry +- Notify handlers by `listen:` topic string, not handler name + +### 5. Configure Molecule + +Edit `molecule/default/molecule.yml` to use the Debian 13 test image. +Write a `converge.yml` that applies the role. Write a `verify.yml` that +asserts the expected state. + +### 6. Write the README + +Document: +- Purpose of the role (one paragraph) +- All variables from `defaults/main.yml` with types, defaults, and descriptions +- Example playbook usage +- Any dependencies or prerequisites + +### 7. Test locally + +```bash +make test ROLE= +``` + +Fix any lint or test failures before committing. + +### 8. Add to a playbook + +Add the role to the appropriate playbook in `playbooks/` and add the host group +to `inventories/staging/hosts.yml` for integration testing. + +### 9. Commit + +```bash +git checkout -b role/ +git add roles/ +git commit -m "Add role" +# open PR / merge request in Forgejo +``` diff --git a/docs/runbooks/rotate-secrets.md b/docs/runbooks/rotate-secrets.md new file mode 100644 index 0000000..26cc096 --- /dev/null +++ b/docs/runbooks/rotate-secrets.md @@ -0,0 +1,71 @@ +# Runbook — Rotating vault secrets + +## Rotating a single secret value + +1. Decrypt the relevant vault file: + ```bash + make decrypt FILE=inventories/production/group_vars/all/vault.yml + ``` + +2. Edit the file and update the secret value. + +3. Re-encrypt: + ```bash + make encrypt FILE=inventories/production/group_vars/all/vault.yml + ``` + +4. Commit the updated vault file: + ```bash + git add inventories/production/group_vars/all/vault.yml + git commit -m "Rotate " + ``` + +5. Deploy to apply the new secret to hosts: + ```bash + make check PLAYBOOK=site # verify what will change + make deploy PLAYBOOK=site + ``` + +--- + +## Rotating the vault password + +This affects all encrypted files in the repo. Do this only when: +- A person with vault access leaves the project +- The password is suspected to be compromised + +Steps: + +1. Ensure you have the current vault password in `.vault_pass`. + +2. Re-key all vault files: + ```bash + find . -name "vault.yml" | xargs ansible-vault rekey \ + --vault-password-file .vault_pass \ + --new-vault-password-file /path/to/new_password_file + ``` + +3. Replace `.vault_pass` with the new password file. + +4. Distribute the new password to all collaborators via a secure channel. + +5. Commit all rekeyed vault files: + ```bash + git add -A + git commit -m "Rekey all vault files" + ``` + +--- + +## Adding a new collaborator + +1. Share the vault password via a secure channel (password manager, etc.) +2. The collaborator creates `.vault_pass` locally (gitignored) +3. They can now decrypt/encrypt vault files normally + +## Removing a collaborator's access + +Rotate the vault password as described above. There is no per-user access +control in Ansible Vault — access is binary (has the password or not). + +If per-user access control becomes necessary, evaluate SOPS + age at that point.