Add architecture decision records and runbooks

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-05-30 14:10:01 +02:00
parent 3f1d7eb128
commit fe4228fb38
13 changed files with 1340 additions and 0 deletions

11
docs/README.md Normal file
View file

@ -0,0 +1,11 @@
# docs/
Project documentation.
- `decisions/` — Architecture Decision Records (ADRs): the "why" behind the design.
Numbered from 001; each records context, the decision, and what was ruled out.
- `runbooks/` — step-by-step operational procedures (add a host, add a role, rotate
secrets).
For what is actually **built vs only designed**, see `STATUS.md` at the repo root —
the ADRs describe intent, not necessarily current reality.

View file

@ -0,0 +1,62 @@
# ADR-001 — Architecture overview
## Context
This document describes the overall architecture of the homelab infrastructure
and the boundaries of what this Ansible monorepo manages.
## Infrastructure
- **Hypervisor**: Proxmox cluster (2+ nodes)
- **Guest OS**: Debian 13 (all managed hosts)
- **Scale**: 25 VMs, small fleet — treated as individuals, not cattle
- **Control node**: A dedicated Debian 13 VM on the cluster. Ansible runs from here.
The control node is the one host that cannot fully bootstrap itself from scratch
and requires manual initial setup (see `docs/runbooks/new-host.md`).
## What this repo manages
| Layer | Managed by | Notes |
|--------------------|--------------------|--------------------------------------------|
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; control node is the one manual exception (see ADR-009) |
| Internal DNS records | Ansible `dns` role | Internal zone rendered from inventory (see ADR-007/009) |
| OS baseline | Ansible `base` role | Users, SSH, firewall, updates, audit |
| Docker runtime | Ansible `docker_host` role | Engine, daemon config, log driver |
| Service deployment | Ansible per-service roles | Compose rendered from templates |
| Secrets | Ansible Vault | Encrypted `vault.yml` files in repo |
The Terraform↔Ansible boundary and handoff are defined in ADR-009.
## Host groups
```
all
├── control # the control node itself — baseline config only, runs no services
├── docker_hosts # VMs running Docker services (most hosts)
└── proxmox_hosts # Proxmox nodes themselves (limited management scope)
```
The `control` group holds the single manually-provisioned control node; it is
managed for baseline config (SSH, firewall, updates) but never runs the
`docker_host` role. Proxmox nodes are managed only for basic baseline tasks (SSH,
monitoring agent). Proxmox configuration itself (storage, clustering, networking)
is out of scope.
## Service interaction model
Services run as Docker containers on one or more `docker_hosts`. Where services
need to interact, they do so via:
- Docker networks (same host)
- Internal DNS / hostname resolution (cross-host)
- Explicitly defined published ports (external access)
All Compose files are rendered by Ansible from Jinja2 templates. No hand-edited
Compose files exist on hosts — they are always regenerated on deploy.
## Decision
This architecture prioritises:
- **Simplicity**: few moving parts, no orchestration layer (no Kubernetes, no Swarm)
- **Reproducibility**: any host can be rebuilt from scratch via Ansible
- **Legibility**: a human reading the repo can understand what runs where

View file

@ -0,0 +1,73 @@
# ADR-002 — Security baseline
## Context
Every managed host must reach a defined security baseline before any services
are deployed. This baseline is applied by the `base` role and is non-negotiable —
it runs first, on every host, every time.
The goal is a principled, maintainable baseline appropriate for a homelab with
some public-facing services — not a compliance exercise.
## Baseline components
### Access & authentication
- SSH key authentication only — password auth disabled
- Root login disabled — `PermitRootLogin no`
- Dedicated `ansible` user with locked-down sudo (NOPASSWD for automation)
- No shared user accounts — per-person SSH keys in `group_vars/all/vars.yml`
### Firewall
- `nftables` (native on Debian 13, replaces iptables)
- Default policy: deny inbound, allow established/related, allow loopback
- Rules managed entirely by Ansible — never edited manually on hosts
- Port definitions live in `group_vars/` so rules stay in sync with deployed services
- Docker's own iptables rules are disabled — nftables manages all filtering
> **Note on Docker + nftables**: Docker historically bypassed iptables-based firewalls.
> This is addressed by setting `"iptables": false` in Docker daemon config and managing
> all rules via nftables explicitly. See `docs/decisions/004-docker-model.md`.
### Intrusion deterrence
- `fail2ban` monitoring SSH (and optionally reverse proxy logs)
- Configured to ban after 5 failed attempts, 1-hour ban
### Updates
- `unattended-upgrades` enabled for **security patches only**
- Full system upgrades triggered deliberately via Ansible (`make deploy PLAYBOOK=upgrade`)
- No automatic reboots — reboots are a conscious operational decision
### Minimal attack surface
- No unnecessary packages installed
- Docker daemon TCP socket disabled — Unix socket only
- No open ports beyond those explicitly defined in firewall rules
### Audit trail
- `auditd` installed and running with a baseline ruleset
- Logs shipped to a central location if a log aggregation service is available
## Secrets management
- Ansible Vault for all secrets (API keys, passwords, certificates)
- Vault password stored outside the repo (`.vault_pass` gitignored)
- New collaborators receive vault password via a separate secure channel
- See `docs/runbooks/rotate-secrets.md` for rotation procedure
## What this baseline does not include
- Full CIS benchmark hardening — adds complexity for marginal gain at this scale
- SELinux / AppArmor — not applied by default, revisit if threat model changes
- Intrusion detection (IDS) — out of scope for now
## Decision
This baseline was chosen to be:
- **Effective** against the realistic threat model (exposed services, shared repo)
- **Maintainable** by a small team without security expertise overhead
- **Automated** — no manual steps should be needed to reach baseline state

View file

@ -0,0 +1,135 @@
# ADR-003 — Toolchain decisions
## Execution engine
**Choice**: `ansible-core` (pip-installed, pinned version) + explicit `requirements.yml`
**Not chosen**: `ansible` full package (bundles ~85 collections at a frozen version)
**Rationale**: Explicit collection pinning allows independent upgrades, smaller installs,
and fully reproducible environments. The full package trades these away for convenience
that isn't needed in a maintained monorepo.
---
## Python environment
**Choice**: `python3-venv` (system Python on Debian 13) + pinned `requirements.txt`
**Not chosen**: `pyenv` (solves multi-version problems on developer laptops, not needed
on a dedicated Debian control node with a controlled Python version)
**Rationale**: The control node runs one Python version. A plain venv is sufficient,
reproducible, and has no extra dependencies.
---
## Secrets
**Choice**: Ansible Vault (file-based, built-in)
**Not chosen**:
- SOPS + age: better git-diff ergonomics, but adds external tooling and key management
- HashiCorp Vault: powerful, but significant operational overhead for this scale
**Rationale**: Vault is built-in, requires no extra services, and works well at this
scale. The main limitation (whole-file encryption makes diffs unreadable) is mitigated
by keeping `vault.yml` files small and purposeful — only actual secrets, no structure.
---
## Testing
**Choice**: Molecule with Docker driver (`molecule-plugins[docker]`)
**Not chosen**:
- Molecule + Podman: rootless is appealing, but Docker is simpler on a Debian control node
- Molecule + Vagrant: full VMs are slower and require a hypervisor on the control node
- No testing: unacceptable for a shared, maintained project
**Test image**: a self-built, project-owned Debian 13 image with systemd support
(`.docker/molecule-debian13/`), hosted in the Forgejo registry. ADR-008 is canonical
for the image and the rationale for not using an external image such as
`geerlingguy/docker-debian13-ansible`.
**Verifier**: Built-in Ansible verifier. Testinfra added later if deeper assertions
are needed.
---
## Linting
**Choice**: `ansible-lint` + `yamllint` + `pre-commit`
- `yamllint`: catches formatting issues before Ansible sees the file
- `ansible-lint`: enforces correctness and idiomatic style
- `pre-commit`: runs both locally on every commit, preventing CI failures
Config files: `.ansible-lint`, `.yamllint` in repo root.
---
## CI/CD
**Choice**: Forgejo Actions (self-hosted at git.baobab.band) + `act_runner`
**Not chosen**: GitHub Actions (external), Jenkins (heavy)
**Pipeline**:
1. Push to any branch → lint + Molecule tests
2. Merge to `main` → lint + Molecule tests + manual approval gate
3. After approval → deploy to staging, then production
`act_runner` runs as a Docker container on the control node or a dedicated runner VM.
---
## Developer ergonomics
**Choice**: `Makefile` as the single interface for all operations
**Rationale**: All `ansible-playbook`, `molecule`, and `ansible-lint` invocations go
through Make targets. This means:
- Claude Code always calls `make <target>` — never constructs raw commands
- Collaborators don't need to know the underlying flags
- CI uses the same targets as local development (no drift)
**direnv**: Not used — the control node is a dedicated host, not a shared workstation.
The venv is activated in the user's shell profile.
---
## Collections and roles policy
**No Galaxy roles.** All roles are written and maintained locally in `roles/`.
Galaxy roles introduce external state, versioning surprises, and implicit
conventions that conflict with this repo's style.
**Collections on demand.** A collection is added to `requirements.yml` only when
a task in a committed role actively uses a module from it. Pre-emptive inclusions
are removed. Each entry in `requirements.yml` must justify its presence.
**Starting collection set** (rationale for each):
| Collection | Kept / dropped | Reason |
|----------------|----------------|--------------------------------------------------------------|
| `ansible.posix`| Kept | Ansible-team maintained; fills real `ansible.builtin` gaps (`authorized_key`, `sysctl`, `acl`) |
| `community.docker` | Dropped | ADR-004 uses `ansible.builtin.command` + `docker compose` — no Docker API modules needed |
| `community.proxmox`| Dropped | Proxmox configuration is out of scope (ADR-001) |
| `community.crypto` | Deferred | Add when a role needs cert automation; use `openssl` CLI until then |
| `community.general`| Deferred | 1,500+ modules; add only the specific sub-module needed, with a comment |
---
## What was explicitly ruled out
| Tool | Reason not adopted |
|------------------|-------------------------------------------------------------|
| AWX / AAP | Significant operational overhead, not needed at this scale |
| Semaphore | Revisit if non-SSH operators need to trigger runs |
| ansible-runner | Only needed when AWX/Semaphore orchestrates runs |
| ansible-builder | Only needed when packaging Execution Environments for AWX |
| Kubernetes/Swarm | Out of scope — Docker Compose is the right complexity level |
| NixOS targets | Poor Ansible fit; all hosts standardised on Debian 13 |
Terraform is **adopted** for VM provisioning and infrastructure DNS — see `docs/decisions/006-terraform.md`.

View file

@ -0,0 +1,77 @@
# ADR-004 — Docker and Compose service model
## Context
All services run as Docker containers managed via Docker Compose. This document
defines how services are structured, deployed, and maintained.
## Core principles
- **No hand-edited files on hosts**: all Compose files are rendered by Ansible
from Jinja2 templates. If a file exists on a host, it was put there by Ansible.
- **Compose per service**: each service (or tightly coupled service group) gets
its own Compose file and directory under a standard path.
- **Variables drive differences**: the same template renders differently per host
via `group_vars` and `host_vars`. No host-specific templates.
## Directory layout on hosts
```
/opt/services/
├── servicename/
│ ├── docker-compose.yml # rendered by Ansible, never edited manually
│ ├── .env # rendered by Ansible from vault variables
│ └── data/ # persistent volumes (bind mounts)
│ └── ...
```
All services live under `/opt/services/`. The path is defined in
`group_vars/all/vars.yml` as `services__base_dir`.
## Compose file delivery
Each service has a corresponding Ansible role (or is managed by a shared role
with per-service variables). The role:
1. Creates `/opt/services/servicename/` directory
2. Renders `docker-compose.yml` from `templates/docker-compose.yml.j2`
3. Renders `.env` from `templates/env.j2` (pulling secrets from vault variables)
4. Runs `docker compose up -d --remove-orphans` via `ansible.builtin.command`
5. Optionally runs `docker compose pull` before up (controlled by variable)
## Docker daemon configuration
Managed by the `docker_host` role. Key settings:
- `"log-driver": "json-file"` with size limits (prevents disk exhaustion)
- `"iptables": false` — firewall managed entirely by nftables (see ADR-002)
- TCP socket disabled — Unix socket only (`/var/run/docker.sock`)
- User namespace remapping: evaluated per use case, not enabled by default
## Networking
- Each service Compose file defines its own named network(s)
- Services that need to communicate are placed on a shared named network
defined in a dedicated `docker-compose.networks.yml` (if cross-service
networking is needed on a host)
- External port publishing is explicit and matches nftables rules
## Image management
- Images are always pinned to a specific digest or tag in templates
- `latest` is never used in production Compose files
- Image updates are a deliberate operation: update the tag variable, run deploy
## Persistent data
- Bind mounts preferred over named volumes for data that must be backed up
- All bind mount paths are under `/opt/services/<name>/data/`
- Backup strategy is defined separately (not in scope of this repo)
## Decision
Docker Compose was chosen over Kubernetes/Swarm because:
- Appropriate complexity level for 25 hosts with independent service sets
- Compose files are human-readable and easily auditable
- No distributed state to manage
- Straightforward to back up and restore

View file

@ -0,0 +1,79 @@
# ADR-005 — Host bootstrapping
## Context
This document defines the **cloud-init template** that managed VMs are cloned
from, and the **control-node** bootstrapping special case. The per-host
provisioning pipeline — how a VM is created from this template and handed off to
Ansible — is owned by ADR-009. Terraform clones the template defined here; the
template is the base image both for Terraform-managed hosts and for the manually
provisioned control node.
## Approach: Proxmox cloud-init template
Managed VMs are cloned from a Proxmox VM template based on the official Debian 13
cloud image. Cloud-init handles first-boot configuration. Ansible takes over
from there.
The cloud-init image was chosen over:
- **Manual Debian installer**: slow, error-prone, not reproducible
- **Preseed/netboot**: powerful but complex to maintain
## Template creation (one-time, manual)
This is a manual procedure performed once per Proxmox cluster. Documented in
`docs/runbooks/new-host.md`.
High-level steps:
1. Download official Debian 13 genericcloud image
2. Import disk to Proxmox, create VM template
3. Install `qemu-guest-agent` in the template image
4. Convert VM to template — never boot the template directly
## VM provisioning (per new host)
Per-host VMs are created by **Terraform**, which clones this template, sets the
cloud-init values (hostname, SSH public key, IP/gateway), and writes the host's
DNS A record. Cloud-init runs at first boot (~3060 seconds), leaving the VM
reachable via SSH with the ansible user's key.
The full create → inventory → configure pipeline, and the Terraform↔Ansible data
contract, are defined in **ADR-009 (provisioning handoff)**. There is no manual
`qm clone` path for managed hosts — the sole exception is the control node below.
## Ansible handoff
Once Terraform has created the VM and `make tf-inventory` has regenerated the
inventory, the `bootstrap` playbook handles first-run specifics (Python may not be
present, user may differ) and `site` applies the full standard state. See ADR-009
for the end-to-end commands and `docs/runbooks/new-host.md` for the full procedure.
## Control node bootstrapping
The control node is a special case — it runs Terraform and Ansible, so it cannot
be created by the Terraform it hosts (chicken-and-egg). It is the one documented
exception to Terraform-owned VM existence (see ADR-009). The control node requires:
1. Manual VM provisioning — clone this cloud-init template by hand (Proxmox UI or
`qm clone`), since Terraform is not yet available to do it
2. Manual setup of the Ansible environment:
```bash
git clone <repo> ~/ansible
cd ~/ansible
make setup # creates venv, installs deps
make collections # installs Ansible collections
cp /secure/location/.vault_pass ~/ansible/.vault_pass
```
3. After that, the control node can manage all other hosts normally
The control node itself is listed in `inventories/production/hosts.yml` under
a `control` group and can be managed for baseline config (SSH, firewall, updates)
but not for the `docker_host` role (it does not run services).
## Decision
Cloud-init with Proxmox templates provides:
- Reproducible VM creation in under 2 minutes
- No manual installer interaction
- A clean handoff point to Ansible
- Easy rebuilds — destroy VM, clone template, run Ansible

View file

@ -0,0 +1,111 @@
# ADR-006 — Terraform for infrastructure provisioning
## Context
Ansible manages host configuration well but has no state model for infrastructure
existence. Adding Terraform handles the "what exists" layer — creating and destroying
VMs on Proxmox — while Ansible continues to own everything that runs inside them,
including all internal DNS records.
This complements rather than replaces Ansible. The two tools do not overlap. The
exact boundary, handoff pipeline, and data contract between them live in **ADR-009
(provisioning handoff)** — this ADR covers Terraform's own internals only.
---
## Responsibility split
The canonical responsibility-split table lives in **ADR-009**. In short: Terraform
owns VM existence only; Ansible owns everything inside a VM, including all internal
DNS records.
**OPNsense is entirely Ansible.** The available Terraform providers for OPNsense
are community-maintained with real risk of provider rot across OPNsense releases.
OPNsense firewall rules also change on a service cadence, not an infrastructure
cadence, making them a poor fit for Terraform state.
---
## Providers
**`bpg/proxmox` (`~> 0.70`)**: Chosen over `telmate/proxmox` for active maintenance,
full Proxmox 8 API support, and better cloud-init integration. This is the only
provider.
Terraform does **not** manage DNS. An earlier design used `hashicorp/dns` (RFC 2136)
to write A records, but that created a bootstrap cycle — the first DNS server cannot
register itself — and split DNS ownership across two tools. Ansible's `dns` role now
owns the entire internal zone, rendered from inventory. See ADR-009.
No Galaxy roles. Terraform manages its own provider dependencies via
`required_providers` and `.terraform.lock.hcl` (tracked in git once `terraform init`
has been run).
---
## State backend
**Choice**: Forgejo HTTP backend (self-hosted at git.baobab.band)
Keeps all state on the same self-hosted stack without additional services.
Authentication uses a Forgejo personal access token via `TF_HTTP_USERNAME` and
`TF_HTTP_PASSWORD` environment variables.
**Note**: The backend URL in `backend.tf` is a placeholder — confirm the exact
endpoint path against your running Forgejo instance's API documentation before
running `terraform init`. If Forgejo's HTTP state is unavailable, remove the
`backend` block from `backend.tf` to fall back to local state on the control node.
---
## Structure
```
terraform/
modules/
proxmox_vm/ # reusable VM module — Proxmox only, no DNS
environments/
staging/ # staging VMs, separate state file
production/ # production VMs, separate state file
```
Separate environment directories (not Terraform workspaces) for the clearest
isolation — no risk of accidentally applying the wrong state.
Each environment directory contains:
- `providers.tf` — provider version pins and configuration
- `backend.tf` — Forgejo state backend (environment-specific path)
- `variables.tf` — input declarations
- `terraform.tfvars.example` — tracked template; copy to `terraform.tfvars` for actual values
- `main.tf``local.vms` map and module calls (no DNS resources)
- `outputs.tf` — VM map consumed by `make tf-inventory`
---
## Secrets handling
The only secret input (the Proxmox API token) is passed via a `TF_VAR_*`
environment variable and declared `sensitive = true` in `variables.tf`. It never
appears in `.tfvars` files. Non-secret configuration lives in tracked
`terraform.tfvars.example`; the real `terraform.tfvars` is gitignored.
---
## Ansible integration
After `terraform apply`, run `make tf-inventory TF_ENV=<env>` to regenerate
`inventories/<env>/hosts.yml` from the `vms` output. The full handoff pipeline,
the `vms` output → inventory data contract, and the generator script
(`scripts/tf_to_inventory.py`) are documented in **ADR-009 (provisioning
handoff)**.
---
## What was ruled out
| Option | Reason |
|---|---|
| `telmate/proxmox` provider | Less actively maintained; weaker cloud-init and Proxmox 8 support |
| OPNsense Terraform provider | Community-maintained; provider rot risk across OPNsense releases |
| Terraform workspaces | Single state file with workspace prefix; accidental cross-env apply possible |
| Separate Terraform repo | Cross-referencing between infra and config adds friction; monorepo keeps the full picture together |

View file

@ -0,0 +1,186 @@
# ADR-007 — Network topology and addressing
## Context
The boma homelab is a Proxmox cluster on a dedicated private network behind an
OPNsense firewall. This document records the agreed physical topology, VLAN
design, IP addressing conventions, naming scheme, and DNS zone structure.
Everything here feeds directly into Terraform variables, Ansible inventory,
and OPNsense configuration.
---
## Physical topology
```
ISP
└── OPNsense (dedicated hardware)
├── WAN — ISP uplink
└── LAN — 802.1q trunk to managed switch
┌──────────────┼──────────────────────────┐
│ │ │ │
pve0 pve1 pve2 AP1 / AP2
(eno1 trunk) (eno1 trunk) (eno1 trunk) (trunk)
(eno2 corosync)(eno2 corosync)(eno2 corosync)
└──────────────┴──────────────┘
172.16.0.0/24 (corosync ring — not on managed switch)
```
**Dual NICs per Proxmox node:**
- `eno1` — VLAN-aware trunk. Carries all VLANs via a single VLAN-aware bridge
(`vmbr0`). VMs get their VLAN tag assigned in Proxmox.
- `eno2` — Dedicated corosync ring (`vmbr1`). Direct link or tiny unmanaged
switch between the three nodes only. Never touches the main switch fabric.
**Access points** broadcast multiple SSIDs, each tagged to its corresponding VLAN
(trusted WiFi → VLAN 30, IoT → VLAN 40, guest → VLAN 50).
---
## VLAN design
| VLAN | Name | Subnet | Purpose |
|---|---|---|---|
| 10 | `mgmt` | `10.10.0.0/24` | Proxmox hosts, OPNsense, managed switch. No internet except update repos. |
| 20 | `srv` | `10.20.0.0/24` | All Debian VMs and Docker services. 100% static. Terraform provisions here. |
| 30 | `lan` | `10.30.0.0/24` | Trusted home devices. DHCP. Access to selected `srv` services via OPNsense. |
| 40 | `iot` | `10.40.0.0/24` | Smart home, cameras, printers. DHCP. Internet egress only + HA exception. |
| 50 | `guest` | `10.50.0.0/24` | Guest WiFi. DHCP. Internet only, fully isolated. |
| 99 | `vpn` | `10.99.0.0/24` | WireGuard peers. `askari` (Hetzner) + road-warrior clients. |
---
## IP addressing
### VLAN 10 — mgmt (10.10.0.0/24) — no DHCP
| Address | Host |
|---|---|
| `10.10.0.1` | OPNsense LAN (mgmt) |
| `10.10.0.2` | Managed switch |
| `10.10.0.200` | `pve0` |
| `10.10.0.201` | `pve1` |
| `10.10.0.202` | `pve2` |
### VLAN 20 — srv (10.20.0.0/24) — no DHCP, all static
| Range | Purpose |
|---|---|
| `10.20.0.1` | OPNsense gateway |
| `10.20.0.10``.19` | Core infrastructure VMs (DNS, proxy) |
| `10.20.0.20``.49` | Additional static infrastructure |
| `10.20.0.50``.249` | Terraform-provisioned VMs |
Assigned infrastructure addresses:
| Address | Host | Role |
|---|---|---|
| `10.20.0.10` | `dns1` | Primary DNS server |
| `10.20.0.11` | `dns2` | Secondary DNS server |
| `10.20.0.12` | `proxy` | Reverse proxy |
| `10.20.0.13` | `homeassistant` | Home Assistant (IoT controller) |
### VLAN 30 — lan (10.30.0.0/24)
| Range | Purpose |
|---|---|
| `10.30.0.1` | OPNsense gateway |
| `10.30.0.100``.249` | DHCP pool |
### VLAN 40 — iot (10.40.0.0/24)
| Range | Purpose |
|---|---|
| `10.40.0.1` | OPNsense gateway |
| `10.40.0.100``.249` | DHCP pool |
### VLAN 50 — guest (10.50.0.0/24)
| Range | Purpose |
|---|---|
| `10.50.0.1` | OPNsense gateway |
| `10.50.0.100``.249` | DHCP pool |
### VLAN 99 — vpn (10.99.0.0/24) — WireGuard
| Address | Host |
|---|---|
| `10.99.0.1` | OPNsense (WireGuard endpoint) |
| `10.99.0.2` | `askari` (Hetzner VPS) |
| `10.99.0.10`+ | Road-warrior clients |
### Corosync ring (172.16.0.0/24) — not on managed switch
| Address | Host |
|---|---|
| `172.16.0.200` | `pve0` |
| `172.16.0.201` | `pve1` |
| `172.16.0.202` | `pve2` |
---
## OPNsense firewall rules (intent)
| Source | Destination | Policy |
|---|---|---|
| `mgmt` | anywhere | allow (administrator access) |
| `srv` | `srv` | allow (inter-service communication) |
| `srv` | internet | allow (updates, image pulls) |
| `lan` | `srv` (allow-list) | allow specific published ports only |
| `lan` | internet | allow |
| `iot` | internet | allow egress only |
| `iot` | `srv` (HA IP only) | allow on integration ports |
| `guest` | internet | allow, isolated from all internal |
| `vpn` | `srv` (metrics ports) | allow (monitoring) |
| `vpn` | `mgmt` | allow (administration from askari) |
**Home Assistant ↔ IoT**: HA VM at `10.20.0.13` can reach IoT VLAN on required
ports. OPNsense Avahi (mDNS reflector) bridges `srv``iot` for device discovery.
IoT devices cannot initiate connections to `srv`.
---
## Naming scheme
| Layer | Convention | Examples |
|---|---|---|
| Homelab name | `boma` | — |
| Proxmox nodes | `pve<n>` | `pve0`, `pve1`, `pve2` |
| Infrastructure VMs | `<role><n>` | `dns1`, `dns2`, `proxy` |
| Hetzner VPS | `askari` | Swahili for guard/sentinel |
| Internal FQDN | `<host>.boma.baobab.band` | `dns1.boma.baobab.band` |
| Public service FQDN | `<service>.baobab.band` | `git.baobab.band` |
---
## DNS zones and split-horizon
**Internal zone**: `boma.baobab.band` — served by `dns1` and `dns2`.
The zone is rendered by the Ansible `dns` role: host A records come from the
inventory (which derives from Terraform's `local.vms` via `make tf-inventory`),
and service/alias/split-horizon records are explicit zone data in `group_vars`.
Terraform itself writes no DNS records — see ADR-009.
**Public zone**: `baobab.band` — served by external DNS (Cloudflare or equivalent).
Public-facing services resolve to the public IP or Cloudflare proxy.
**Split-horizon**: `dns1`/`dns2` serve internal answers for any hostname that has
both a public and private face. Example: `git.baobab.band` resolves to
`10.20.0.12` (proxy) internally and to the public IP externally.
OPNsense DNS resolver forwards `boma.baobab.band` queries to `dns1`/`dns2`.
All other queries go upstream (e.g., `1.1.1.1`, `9.9.9.9`).
---
## External monitoring — askari
`askari` (Hetzner VPS) connects via WireGuard to OPNsense (`10.99.0.1`).
Its peer address is `10.99.0.2`. OPNsense routes `10.99.0.0/24` into the VPN
tunnel and allows `askari` narrow access to `srv` metrics endpoints and `mgmt`
for administration.
`askari` is provisioned and managed independently of the Proxmox cluster — it
must be reachable even when the homelab is down (its entire purpose).
FQDN: `askari.baobab.band`.

View file

@ -0,0 +1,160 @@
# ADR-008 — Testing methodology
## Context
Ansible roles must be idempotent and correct before they touch production hosts.
This document records the testing strategy, what each level covers, and — critically
— what is explicitly out of scope for automated testing and why.
---
## Three testing levels
### Level 1 — Molecule (per role, always required)
Runs in Docker on the control node or in CI. Fast (~5 min per role).
**What happens during `molecule test`:**
1. `create` — start the test container
2. `converge` — apply the role via `converge.yml`
3. **`idempotency`** — run `converge.yml` again; fail if any task reports `changed`
4. `verify` — assert expected state via `verify.yml`
5. `destroy` — remove the container
The idempotency step is non-negotiable. Every role must pass it cleanly.
**`verify.yml` must assert outcomes, not task success:**
```yaml
# Wrong — only proves the task ran
- assert:
that: result is success
# Right — proves the outcome exists
- ansible.builtin.command: systemctl is-active fail2ban
changed_when: false
register: svc
- ansible.builtin.assert:
that: svc.stdout == "active"
```
### Level 2 — Staging playbook (full stack, real VMs)
`make check PLAYBOOK=site` followed by `make deploy PLAYBOOK=site` on
Terraform-provisioned staging VMs. Catches inter-role dependencies and ordering
issues that Molecule cannot see (e.g., `docker_host` role requires `base` to
have already run and configured the firewall).
Run before every merge to `main`.
### Level 3 — External smoke test from askari
Once `askari` is operational: scripted checks from outside the network confirming
that public-facing services respond correctly. Catches firewall and reverse proxy
configuration issues invisible to Ansible check mode.
---
## Molecule test image
**No external images.** The project builds and hosts its own test image.
**Source**: `.docker/molecule-debian13/Dockerfile`
**Base**: `debian:trixie-slim` (official Debian 13, Docker Hub — only external
dependency permitted here, as the base OS image is not substitutable)
**Registry**: `git.baobab.band/<owner>/<repo-name>/molecule-debian13:latest`
Build and push with:
```bash
make molecule-image # build locally
make molecule-image-push # push to Forgejo registry (requires docker login)
```
The scaffold `molecule.yml` references this image with `pre_build_image: true`,
meaning Molecule uses the image as-is and does not attempt to build it.
**Why not geerlingguy/docker-debian13-ansible?** It is a Docker Hub image outside
project control. It is not a Galaxy role, but it is an external dependency that
can drift, disappear, or introduce unexpected changes. The custom image is
functionally equivalent and fully owned.
---
## Idempotency requirements
Every role task must satisfy one of these:
| Task type | Requirement |
|---|---|
| `apt`, `template`, `copy`, `file`, `user`, `group`, `service` | Naturally idempotent — no action needed |
| `command` / `shell` (read-only) | `changed_when: false` |
| `command` / `shell` (detectable change) | `changed_when: result.stdout \| length > 0` or equivalent |
| `command` / `shell` (creates a file) | `creates: /path/to/artifact` |
| Service restart after config change | Move to a handler; handler fires only when notified |
| `docker compose up -d` | Handler only — notified by template change, never runs unconditionally |
ansible-lint enforces most of these at lint time. The Molecule idempotency step
catches anything lint misses.
---
## What Molecule tests — and what it does not
### Tested in Molecule
| Capability | Notes |
|---|---|
| Package installation | `apt` works in the container |
| File and directory creation, permissions, ownership | Full support |
| Template rendering and content | Full support |
| User and group management | Full support |
| Service installation and `systemd enable` | Requires the systemd-capable image |
| Service start/stop | Works for most services in the container |
| SSH configuration file content | File-level only |
| fail2ban installation and configuration | Install and config file; not live banning |
| Docker daemon installation | Works in privileged container |
| auditd installation and configuration | Install and config file |
| Idempotency of all of the above | Enforced by Molecule's idempotency step |
### Not tested in Molecule — explicit exceptions
The following require a real kernel or real hardware and are validated only at
Level 2 (staging) or Level 3 (external). This is a conscious, documented decision
— not a gap.
| Capability | Reason not testable in Molecule |
|---|---|
| `nftables` rule loading | Requires `nf_tables` kernel module; not available in Docker |
| WireGuard tunnel establishment | Requires `wireguard` kernel module |
| `unattended-upgrades` behaviour | Installs correctly; actual upgrade behaviour requires a real apt environment |
| DHCP behaviour (OPNsense) | OPNsense is managed by Ansible but not testable in a container |
| mDNS reflector (Avahi cross-VLAN) | Requires real network interfaces and VLANs |
| Hardware passthrough (NIC, USB) | Not applicable in containers |
| Corosync cluster formation | Requires multiple real nodes |
For the above, Molecule tests only what it can: that the relevant packages are
installed, that configuration files render correctly, and that services are enabled.
Behavioural correctness is confirmed on staging.
---
## CI pipeline
```
push to any branch
├── yamllint + ansible-lint (fast gate, ~1 min)
└── molecule test (changed roles) (parallel, ~5 min per role)
pull request to main
├── yamllint + ansible-lint
├── molecule test (all roles) (parallel)
└── [manual gate] review tf-plan and make check on staging
merge to main
├── yamllint + ansible-lint + molecule test (final gate)
├── [manual approval] make deploy PLAYBOOK=site on staging
└── [manual approval] make deploy PLAYBOOK=site on production
```
Manual gates are intentional. Automated tests prove correctness in isolation;
a human confirms the change is safe to promote.

View file

@ -0,0 +1,149 @@
# ADR-009 — Terraform ↔ Ansible provisioning handoff
## Context
Two tools touch every managed host. Terraform owns **what exists** — VMs on
Proxmox. Ansible owns **what is configured inside** — users, packages, firewall,
Docker services, and all internal DNS. This ADR is the single source of truth for
the seam between them: the exact handoff, the data contract, and the one documented
exception. The two tools must never overlap; this document defines the line they
meet at.
ADR-006 covers Terraform's internals (providers, state, structure). ADR-005 covers
the cloud-init template that VMs are cloned from. This ADR covers how they connect.
---
## The boundary
| Layer | Tool | Notes |
|---|---|---|
| VM existence | Terraform | Create/destroy Proxmox VMs, assign static IPs |
| VM resolver (cloud-init) | Terraform | Sets *which* DNS servers a VM queries — not a zone record |
| OS configuration | Ansible | Users, SSH, firewall, packages |
| Service deployment | Ansible | Docker, Compose files, secrets |
| OPNsense (all) | Ansible | Firewall rules, DHCP, interfaces, VLANs |
| Internal DNS (all records) | Ansible (`dns` role) | Internal zone rendered from inventory + `group_vars`; see ADR-007 |
This table is canonical here. ADR-006 links to it rather than restating it.
Terraform owns VM **existence** only — it writes no DNS records (see "Internal DNS"
below).
---
## The handoff pipeline
There is one path by which a managed host comes into existence and reaches its
configured state:
```
make tf-plan TF_ENV=production # review infrastructure changes
make tf-apply TF_ENV=production # clone template → VM (no DNS records written)
make tf-inventory TF_ENV=production # regenerate Ansible inventory from outputs
make check PLAYBOOK=site # dry-run Ansible against the new host(s)
make deploy PLAYBOOK=bootstrap # first-run specifics (see ADR-005)
make deploy PLAYBOOK=site # full standard state — `dns` role writes the zone
```
`tf-apply` creates the VM by cloning the Debian 13 cloud-init template (ADR-005).
`tf-inventory` regenerates the Ansible inventory from Terraform outputs. From
`make check` onward the host is Ansible's — including its DNS record, which the
`dns` role writes into the internal zone during `make deploy`.
Adding a host means editing `local.vms` in the environment's `main.tf` and running
this pipeline — **never** by hand-editing the inventory.
---
## The data contract
The seam's interface is a single Terraform output consumed by a single script.
**Producer** — `terraform/environments/<env>/outputs.tf` emits a `vms` map:
```json
{
"vms": {
"value": {
"host-a": { "ip": "192.168.1.10", "group": "docker_hosts" }
}
}
}
```
**Consumer** — `scripts/tf_to_inventory.py` (Python standard library only) reads
`terraform output -json` and writes `inventories/<env>/hosts.yml`. It validates the
group against the allowed set and fails loudly on an unknown group.
**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`.
The generated `hosts.yml` carries a "do not edit manually" header and is owned by
the generator. Treat it as a build artifact: the source of truth is `local.vms` in
Terraform, and the inventory is regenerated, never edited.
---
## Cloud-init's role
Cloud-init is the thin first-boot layer between Terraform and Ansible:
- **Terraform** clones the cloud-init template (ADR-005) and sets cloud-init values
(hostname, SSH public key, IP/gateway).
- **Cloud-init** does just enough at first boot to make the VM reachable over SSH
with the ansible user's key — nothing more.
- **Ansible** takes over from a reachable host: the `bootstrap` playbook handles
first-run specifics, then `site` applies the full standard state.
The line is sharp: cloud-init buys *reachability*, Ansible owns *configuration*.
---
## Internal DNS — owned by Ansible, no chicken-and-egg
Terraform writes **no** DNS records. The internal zone (`boma.baobab.band`) is
rendered entirely by the Ansible `dns` role:
- **Host A records** derive from the inventory — the same `hostname → ip` data that
originated in `local.vms` and reached Ansible via `make tf-inventory`. So Terraform
remains the ultimate source of truth for which hosts exist; the data simply flows
through the inventory instead of through a direct Terraform→DNS write.
- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense
gateway, `git.baobab.band` → proxy) are explicit zone data in `group_vars`.
This dissolves the bootstrap cycle that a Terraform-managed zone would create. If
Terraform wrote records via RFC 2136, provisioning the **first** DNS server would
require a DNS server that does not yet exist — `dns1` cannot register its own A
record before it is running and configured. Because Ansible renders the zone from
inventory (using IP addresses, never name resolution, to connect), `dns1`/`dns2`
are ordinary Terraform-created VMs whose records are written by the same role that
configures the DNS service. There is no special case and no ordering trap.
ADR-007 holds the zone structure, split-horizon, and addressing conventions. The
IP-range split there (`.10.19` core infra vs `.50.249` fleet) is now an addressing
convention only — it no longer implies any difference in how records are written.
---
## The control-node exception
The control node — the host that runs Terraform and Ansible — is the one VM
Terraform does **not** create. It cannot provision the infrastructure that would
provision itself (chicken-and-egg). It is therefore the single documented exception
to "Terraform owns VM existence":
- Provisioned and bootstrapped manually, per the control-node section of ADR-005.
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
Ansible for baseline config only (no `docker_host` role).
Every other host is Terraform-managed.
---
## What was ruled out
| Option | Reason |
|---|---|
| Manual `qm clone` as a general provisioning path | Terraform is the single way VMs come into existence; a parallel manual path would let the inventory and real infrastructure drift. The sole exception is the control node. |
| Hand-editing the generated inventory | `hosts.yml` is a build artifact of `tf_to_inventory.py`; edits are overwritten on the next `make tf-inventory`. Edit `local.vms` instead. |
| Documenting the seam in both ADR-005 and ADR-006 | The boundary belongs in exactly one place. Those ADRs link here. |
| Terraform-managed DNS records (`hashicorp/dns` + RFC 2136) | Created a bootstrap cycle (the first DNS server can't register itself) and split DNS ownership across two tools. Ansible owns the whole internal zone instead — one owner, no cycle. |

145
docs/runbooks/new-host.md Normal file
View file

@ -0,0 +1,145 @@
# Runbook — Adding a new managed host
## Prerequisites
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not)
- You have the vault password (`.vault_pass`)
- The host's intended hostname and IP are decided
---
## Part A — Create the Proxmox template (one-time)
Run on a Proxmox node. Only needed once per cluster.
```bash
# Download the Debian 13 genericcloud image
wget https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2
# Create a VM (adjust ID, storage name as needed)
qm create 9000 --name debian13-template --memory 2048 --cores 2 \
--net0 virtio,bridge=vmbr0 --serial0 socket --vga serial0
# Import the disk
qm importdisk 9000 debian-13-genericcloud-amd64.qcow2 local-lvm
# Attach disk and set boot order
qm set 9000 --scsihw virtio-scsi-pci --scsi0 local-lvm:vm-9000-disk-0
qm set 9000 --boot c --bootdisk scsi0
# Add cloud-init drive
qm set 9000 --ide2 local-lvm:cloudinit
# Enable QEMU guest agent
qm set 9000 --agent enabled=1
# Convert to template (cannot be undone)
qm template 9000
```
---
## Part B — Define the VM in Terraform
Managed hosts are created by Terraform, never by hand. Add an entry to `local.vms`
in the environment's `main.tf` (`terraform/environments/<env>/main.tf`):
```hcl
locals {
vms = {
<hostname> = {
ip = "<IP>/24" # static; from docs/decisions/007-network.md
group = "docker_hosts" # control | docker_hosts | proxmox_hosts
cores = 2
memory_mb = 2048
}
}
}
```
Terraform clones the cloud-init template from Part A, sets the cloud-init values
(hostname, SSH key, IP/gateway), and writes the host's DNS A record. See ADR-009
for the full handoff and the `vms` output → inventory data contract.
---
## Part C — Provision and regenerate the inventory
```bash
make tf-plan TF_ENV=production # review — confirm only the new VM is added
make tf-apply TF_ENV=production # create the VM + write its DNS A record
make tf-inventory TF_ENV=production # regenerate inventories/production/hosts.yml
```
`make tf-inventory` rewrites `hosts.yml` from Terraform outputs — **do not edit
that file by hand**; it carries a "do not edit manually" header and your changes
would be overwritten. The source of truth is `local.vms`.
Wait ~60 seconds after apply for cloud-init to complete, then verify SSH access:
```bash
ssh ansible@<IP> echo ok
```
Add a `host_vars/<hostname>/` directory if the host needs specific overrides
(this is config, not inventory membership, so it is not generated):
```bash
mkdir -p inventories/production/host_vars/<hostname>
touch inventories/production/host_vars/<hostname>/vars.yml
```
---
## Part D — Bootstrap and configure
```bash
# First-run bootstrap (handles Python installation, initial user setup)
make deploy PLAYBOOK=bootstrap
# Apply full standard state
make deploy PLAYBOOK=site
```
Verify the host reaches baseline:
```bash
make check PLAYBOOK=site
# Should report no changes
```
---
## Part E — Control node (manual exception)
The control node runs Terraform and Ansible, so it cannot be created by the
Terraform it hosts (chicken-and-egg). It is the **one** host provisioned manually —
see ADR-009 and the control-node section of ADR-005. Use the template from Part A:
```bash
# Clone the template by hand (Proxmox UI or qm clone)
qm clone 9000 <VMID> --name <hostname> --full
qm set <VMID> --memory 2048 --cores 2 \
--ciuser ansible \
--sshkeys /path/to/ansible_ed25519.pub \
--ipconfig0 ip=<IP>/24,gw=<GATEWAY>
qm start <VMID>
```
Then set up the Ansible environment on it (`make setup`, `make collections`, place
`.vault_pass`) per ADR-005, and add it to `inventories/<env>/hosts.yml` under the
`control` group. Because the control node is not in `local.vms`, this is the only
case where editing `hosts.yml` by hand is expected — every other host comes from
`make tf-inventory`.
---
## Troubleshooting
**SSH connection refused**: cloud-init may still be running. Wait and retry.
**Python not found**: the bootstrap playbook handles this via `raw` module.
If bootstrap fails, SSH to the host manually and run `apt install -y python3`.
**Firewall locked out**: if nftables rules are misconfigured, connect via
Proxmox console (not SSH) and run `nft flush ruleset` to clear all rules temporarily.

81
docs/runbooks/new-role.md Normal file
View file

@ -0,0 +1,81 @@
# Runbook — Adding a new Ansible role
## When to create a new role
Create a new role when you need to manage a distinct, reusable unit of
configuration — a service, a system component, or a behaviour applied to
a group of hosts.
Do not create a role for a single task that logically belongs in an existing role.
## Procedure
### 1. Scaffold the role
```bash
make new-role NAME=<rolename>
```
This creates the full directory structure and placeholder files under `roles/<rolename>/`.
### 2. Fill in meta/main.yml
```yaml
galaxy_info:
role_name: <rolename>
author: <your name>
description: <one sentence>
min_ansible_version: "2.15"
platforms:
- name: Debian
versions:
- trixie # Debian 13
```
### 3. Define defaults
Add all tuneable variables to `defaults/main.yml` with inline comments explaining
each variable. Use the `rolename__varname` namespace convention.
### 4. Write tasks
- Use FQCN for all modules
- Every task must have a `name:` that reads as a sentence
- Every task must have at least one `tags:` entry
- Notify handlers by `listen:` topic string, not handler name
### 5. Configure Molecule
Edit `molecule/default/molecule.yml` to use the Debian 13 test image.
Write a `converge.yml` that applies the role. Write a `verify.yml` that
asserts the expected state.
### 6. Write the README
Document:
- Purpose of the role (one paragraph)
- All variables from `defaults/main.yml` with types, defaults, and descriptions
- Example playbook usage
- Any dependencies or prerequisites
### 7. Test locally
```bash
make test ROLE=<rolename>
```
Fix any lint or test failures before committing.
### 8. Add to a playbook
Add the role to the appropriate playbook in `playbooks/` and add the host group
to `inventories/staging/hosts.yml` for integration testing.
### 9. Commit
```bash
git checkout -b role/<rolename>
git add roles/<rolename>
git commit -m "Add <rolename> role"
# open PR / merge request in Forgejo
```

View file

@ -0,0 +1,71 @@
# Runbook — Rotating vault secrets
## Rotating a single secret value
1. Decrypt the relevant vault file:
```bash
make decrypt FILE=inventories/production/group_vars/all/vault.yml
```
2. Edit the file and update the secret value.
3. Re-encrypt:
```bash
make encrypt FILE=inventories/production/group_vars/all/vault.yml
```
4. Commit the updated vault file:
```bash
git add inventories/production/group_vars/all/vault.yml
git commit -m "Rotate <secret name>"
```
5. Deploy to apply the new secret to hosts:
```bash
make check PLAYBOOK=site # verify what will change
make deploy PLAYBOOK=site
```
---
## Rotating the vault password
This affects all encrypted files in the repo. Do this only when:
- A person with vault access leaves the project
- The password is suspected to be compromised
Steps:
1. Ensure you have the current vault password in `.vault_pass`.
2. Re-key all vault files:
```bash
find . -name "vault.yml" | xargs ansible-vault rekey \
--vault-password-file .vault_pass \
--new-vault-password-file /path/to/new_password_file
```
3. Replace `.vault_pass` with the new password file.
4. Distribute the new password to all collaborators via a secure channel.
5. Commit all rekeyed vault files:
```bash
git add -A
git commit -m "Rekey all vault files"
```
---
## Adding a new collaborator
1. Share the vault password via a secure channel (password manager, etc.)
2. The collaborator creates `.vault_pass` locally (gitignored)
3. They can now decrypt/encrypt vault files normally
## Removing a collaborator's access
Rotate the vault password as described above. There is no per-user access
control in Ansible Vault — access is binary (has the password or not).
If per-user access control becomes necessary, evaluate SOPS + age at that point.