Add architecture decision records and runbooks
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
3f1d7eb128
commit
fe4228fb38
13 changed files with 1340 additions and 0 deletions
11
docs/README.md
Normal file
11
docs/README.md
Normal file
|
|
@ -0,0 +1,11 @@
|
|||
# docs/
|
||||
|
||||
Project documentation.
|
||||
|
||||
- `decisions/` — Architecture Decision Records (ADRs): the "why" behind the design.
|
||||
Numbered from 001; each records context, the decision, and what was ruled out.
|
||||
- `runbooks/` — step-by-step operational procedures (add a host, add a role, rotate
|
||||
secrets).
|
||||
|
||||
For what is actually **built vs only designed**, see `STATUS.md` at the repo root —
|
||||
the ADRs describe intent, not necessarily current reality.
|
||||
62
docs/decisions/001-architecture.md
Normal file
62
docs/decisions/001-architecture.md
Normal file
|
|
@ -0,0 +1,62 @@
|
|||
# ADR-001 — Architecture overview
|
||||
|
||||
## Context
|
||||
|
||||
This document describes the overall architecture of the homelab infrastructure
|
||||
and the boundaries of what this Ansible monorepo manages.
|
||||
|
||||
## Infrastructure
|
||||
|
||||
- **Hypervisor**: Proxmox cluster (2+ nodes)
|
||||
- **Guest OS**: Debian 13 (all managed hosts)
|
||||
- **Scale**: 2–5 VMs, small fleet — treated as individuals, not cattle
|
||||
- **Control node**: A dedicated Debian 13 VM on the cluster. Ansible runs from here.
|
||||
The control node is the one host that cannot fully bootstrap itself from scratch
|
||||
and requires manual initial setup (see `docs/runbooks/new-host.md`).
|
||||
|
||||
## What this repo manages
|
||||
|
||||
| Layer | Managed by | Notes |
|
||||
|--------------------|--------------------|--------------------------------------------|
|
||||
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; control node is the one manual exception (see ADR-009) |
|
||||
| Internal DNS records | Ansible `dns` role | Internal zone rendered from inventory (see ADR-007/009) |
|
||||
| OS baseline | Ansible `base` role | Users, SSH, firewall, updates, audit |
|
||||
| Docker runtime | Ansible `docker_host` role | Engine, daemon config, log driver |
|
||||
| Service deployment | Ansible per-service roles | Compose rendered from templates |
|
||||
| Secrets | Ansible Vault | Encrypted `vault.yml` files in repo |
|
||||
|
||||
The Terraform↔Ansible boundary and handoff are defined in ADR-009.
|
||||
|
||||
## Host groups
|
||||
|
||||
```
|
||||
all
|
||||
├── control # the control node itself — baseline config only, runs no services
|
||||
├── docker_hosts # VMs running Docker services (most hosts)
|
||||
└── proxmox_hosts # Proxmox nodes themselves (limited management scope)
|
||||
```
|
||||
|
||||
The `control` group holds the single manually-provisioned control node; it is
|
||||
managed for baseline config (SSH, firewall, updates) but never runs the
|
||||
`docker_host` role. Proxmox nodes are managed only for basic baseline tasks (SSH,
|
||||
monitoring agent). Proxmox configuration itself (storage, clustering, networking)
|
||||
is out of scope.
|
||||
|
||||
## Service interaction model
|
||||
|
||||
Services run as Docker containers on one or more `docker_hosts`. Where services
|
||||
need to interact, they do so via:
|
||||
|
||||
- Docker networks (same host)
|
||||
- Internal DNS / hostname resolution (cross-host)
|
||||
- Explicitly defined published ports (external access)
|
||||
|
||||
All Compose files are rendered by Ansible from Jinja2 templates. No hand-edited
|
||||
Compose files exist on hosts — they are always regenerated on deploy.
|
||||
|
||||
## Decision
|
||||
|
||||
This architecture prioritises:
|
||||
- **Simplicity**: few moving parts, no orchestration layer (no Kubernetes, no Swarm)
|
||||
- **Reproducibility**: any host can be rebuilt from scratch via Ansible
|
||||
- **Legibility**: a human reading the repo can understand what runs where
|
||||
73
docs/decisions/002-security.md
Normal file
73
docs/decisions/002-security.md
Normal file
|
|
@ -0,0 +1,73 @@
|
|||
# ADR-002 — Security baseline
|
||||
|
||||
## Context
|
||||
|
||||
Every managed host must reach a defined security baseline before any services
|
||||
are deployed. This baseline is applied by the `base` role and is non-negotiable —
|
||||
it runs first, on every host, every time.
|
||||
|
||||
The goal is a principled, maintainable baseline appropriate for a homelab with
|
||||
some public-facing services — not a compliance exercise.
|
||||
|
||||
## Baseline components
|
||||
|
||||
### Access & authentication
|
||||
|
||||
- SSH key authentication only — password auth disabled
|
||||
- Root login disabled — `PermitRootLogin no`
|
||||
- Dedicated `ansible` user with locked-down sudo (NOPASSWD for automation)
|
||||
- No shared user accounts — per-person SSH keys in `group_vars/all/vars.yml`
|
||||
|
||||
### Firewall
|
||||
|
||||
- `nftables` (native on Debian 13, replaces iptables)
|
||||
- Default policy: deny inbound, allow established/related, allow loopback
|
||||
- Rules managed entirely by Ansible — never edited manually on hosts
|
||||
- Port definitions live in `group_vars/` so rules stay in sync with deployed services
|
||||
- Docker's own iptables rules are disabled — nftables manages all filtering
|
||||
|
||||
> **Note on Docker + nftables**: Docker historically bypassed iptables-based firewalls.
|
||||
> This is addressed by setting `"iptables": false` in Docker daemon config and managing
|
||||
> all rules via nftables explicitly. See `docs/decisions/004-docker-model.md`.
|
||||
|
||||
### Intrusion deterrence
|
||||
|
||||
- `fail2ban` monitoring SSH (and optionally reverse proxy logs)
|
||||
- Configured to ban after 5 failed attempts, 1-hour ban
|
||||
|
||||
### Updates
|
||||
|
||||
- `unattended-upgrades` enabled for **security patches only**
|
||||
- Full system upgrades triggered deliberately via Ansible (`make deploy PLAYBOOK=upgrade`)
|
||||
- No automatic reboots — reboots are a conscious operational decision
|
||||
|
||||
### Minimal attack surface
|
||||
|
||||
- No unnecessary packages installed
|
||||
- Docker daemon TCP socket disabled — Unix socket only
|
||||
- No open ports beyond those explicitly defined in firewall rules
|
||||
|
||||
### Audit trail
|
||||
|
||||
- `auditd` installed and running with a baseline ruleset
|
||||
- Logs shipped to a central location if a log aggregation service is available
|
||||
|
||||
## Secrets management
|
||||
|
||||
- Ansible Vault for all secrets (API keys, passwords, certificates)
|
||||
- Vault password stored outside the repo (`.vault_pass` gitignored)
|
||||
- New collaborators receive vault password via a separate secure channel
|
||||
- See `docs/runbooks/rotate-secrets.md` for rotation procedure
|
||||
|
||||
## What this baseline does not include
|
||||
|
||||
- Full CIS benchmark hardening — adds complexity for marginal gain at this scale
|
||||
- SELinux / AppArmor — not applied by default, revisit if threat model changes
|
||||
- Intrusion detection (IDS) — out of scope for now
|
||||
|
||||
## Decision
|
||||
|
||||
This baseline was chosen to be:
|
||||
- **Effective** against the realistic threat model (exposed services, shared repo)
|
||||
- **Maintainable** by a small team without security expertise overhead
|
||||
- **Automated** — no manual steps should be needed to reach baseline state
|
||||
135
docs/decisions/003-toolchain.md
Normal file
135
docs/decisions/003-toolchain.md
Normal file
|
|
@ -0,0 +1,135 @@
|
|||
# ADR-003 — Toolchain decisions
|
||||
|
||||
## Execution engine
|
||||
|
||||
**Choice**: `ansible-core` (pip-installed, pinned version) + explicit `requirements.yml`
|
||||
|
||||
**Not chosen**: `ansible` full package (bundles ~85 collections at a frozen version)
|
||||
|
||||
**Rationale**: Explicit collection pinning allows independent upgrades, smaller installs,
|
||||
and fully reproducible environments. The full package trades these away for convenience
|
||||
that isn't needed in a maintained monorepo.
|
||||
|
||||
---
|
||||
|
||||
## Python environment
|
||||
|
||||
**Choice**: `python3-venv` (system Python on Debian 13) + pinned `requirements.txt`
|
||||
|
||||
**Not chosen**: `pyenv` (solves multi-version problems on developer laptops, not needed
|
||||
on a dedicated Debian control node with a controlled Python version)
|
||||
|
||||
**Rationale**: The control node runs one Python version. A plain venv is sufficient,
|
||||
reproducible, and has no extra dependencies.
|
||||
|
||||
---
|
||||
|
||||
## Secrets
|
||||
|
||||
**Choice**: Ansible Vault (file-based, built-in)
|
||||
|
||||
**Not chosen**:
|
||||
- SOPS + age: better git-diff ergonomics, but adds external tooling and key management
|
||||
- HashiCorp Vault: powerful, but significant operational overhead for this scale
|
||||
|
||||
**Rationale**: Vault is built-in, requires no extra services, and works well at this
|
||||
scale. The main limitation (whole-file encryption makes diffs unreadable) is mitigated
|
||||
by keeping `vault.yml` files small and purposeful — only actual secrets, no structure.
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
**Choice**: Molecule with Docker driver (`molecule-plugins[docker]`)
|
||||
|
||||
**Not chosen**:
|
||||
- Molecule + Podman: rootless is appealing, but Docker is simpler on a Debian control node
|
||||
- Molecule + Vagrant: full VMs are slower and require a hypervisor on the control node
|
||||
- No testing: unacceptable for a shared, maintained project
|
||||
|
||||
**Test image**: a self-built, project-owned Debian 13 image with systemd support
|
||||
(`.docker/molecule-debian13/`), hosted in the Forgejo registry. ADR-008 is canonical
|
||||
for the image and the rationale for not using an external image such as
|
||||
`geerlingguy/docker-debian13-ansible`.
|
||||
|
||||
**Verifier**: Built-in Ansible verifier. Testinfra added later if deeper assertions
|
||||
are needed.
|
||||
|
||||
---
|
||||
|
||||
## Linting
|
||||
|
||||
**Choice**: `ansible-lint` + `yamllint` + `pre-commit`
|
||||
|
||||
- `yamllint`: catches formatting issues before Ansible sees the file
|
||||
- `ansible-lint`: enforces correctness and idiomatic style
|
||||
- `pre-commit`: runs both locally on every commit, preventing CI failures
|
||||
|
||||
Config files: `.ansible-lint`, `.yamllint` in repo root.
|
||||
|
||||
---
|
||||
|
||||
## CI/CD
|
||||
|
||||
**Choice**: Forgejo Actions (self-hosted at git.baobab.band) + `act_runner`
|
||||
|
||||
**Not chosen**: GitHub Actions (external), Jenkins (heavy)
|
||||
|
||||
**Pipeline**:
|
||||
1. Push to any branch → lint + Molecule tests
|
||||
2. Merge to `main` → lint + Molecule tests + manual approval gate
|
||||
3. After approval → deploy to staging, then production
|
||||
|
||||
`act_runner` runs as a Docker container on the control node or a dedicated runner VM.
|
||||
|
||||
---
|
||||
|
||||
## Developer ergonomics
|
||||
|
||||
**Choice**: `Makefile` as the single interface for all operations
|
||||
|
||||
**Rationale**: All `ansible-playbook`, `molecule`, and `ansible-lint` invocations go
|
||||
through Make targets. This means:
|
||||
- Claude Code always calls `make <target>` — never constructs raw commands
|
||||
- Collaborators don't need to know the underlying flags
|
||||
- CI uses the same targets as local development (no drift)
|
||||
|
||||
**direnv**: Not used — the control node is a dedicated host, not a shared workstation.
|
||||
The venv is activated in the user's shell profile.
|
||||
|
||||
---
|
||||
|
||||
## Collections and roles policy
|
||||
|
||||
**No Galaxy roles.** All roles are written and maintained locally in `roles/`.
|
||||
Galaxy roles introduce external state, versioning surprises, and implicit
|
||||
conventions that conflict with this repo's style.
|
||||
|
||||
**Collections on demand.** A collection is added to `requirements.yml` only when
|
||||
a task in a committed role actively uses a module from it. Pre-emptive inclusions
|
||||
are removed. Each entry in `requirements.yml` must justify its presence.
|
||||
|
||||
**Starting collection set** (rationale for each):
|
||||
|
||||
| Collection | Kept / dropped | Reason |
|
||||
|----------------|----------------|--------------------------------------------------------------|
|
||||
| `ansible.posix`| Kept | Ansible-team maintained; fills real `ansible.builtin` gaps (`authorized_key`, `sysctl`, `acl`) |
|
||||
| `community.docker` | Dropped | ADR-004 uses `ansible.builtin.command` + `docker compose` — no Docker API modules needed |
|
||||
| `community.proxmox`| Dropped | Proxmox configuration is out of scope (ADR-001) |
|
||||
| `community.crypto` | Deferred | Add when a role needs cert automation; use `openssl` CLI until then |
|
||||
| `community.general`| Deferred | 1,500+ modules; add only the specific sub-module needed, with a comment |
|
||||
|
||||
---
|
||||
|
||||
## What was explicitly ruled out
|
||||
|
||||
| Tool | Reason not adopted |
|
||||
|------------------|-------------------------------------------------------------|
|
||||
| AWX / AAP | Significant operational overhead, not needed at this scale |
|
||||
| Semaphore | Revisit if non-SSH operators need to trigger runs |
|
||||
| ansible-runner | Only needed when AWX/Semaphore orchestrates runs |
|
||||
| ansible-builder | Only needed when packaging Execution Environments for AWX |
|
||||
| Kubernetes/Swarm | Out of scope — Docker Compose is the right complexity level |
|
||||
| NixOS targets | Poor Ansible fit; all hosts standardised on Debian 13 |
|
||||
|
||||
Terraform is **adopted** for VM provisioning and infrastructure DNS — see `docs/decisions/006-terraform.md`.
|
||||
77
docs/decisions/004-docker-model.md
Normal file
77
docs/decisions/004-docker-model.md
Normal file
|
|
@ -0,0 +1,77 @@
|
|||
# ADR-004 — Docker and Compose service model
|
||||
|
||||
## Context
|
||||
|
||||
All services run as Docker containers managed via Docker Compose. This document
|
||||
defines how services are structured, deployed, and maintained.
|
||||
|
||||
## Core principles
|
||||
|
||||
- **No hand-edited files on hosts**: all Compose files are rendered by Ansible
|
||||
from Jinja2 templates. If a file exists on a host, it was put there by Ansible.
|
||||
- **Compose per service**: each service (or tightly coupled service group) gets
|
||||
its own Compose file and directory under a standard path.
|
||||
- **Variables drive differences**: the same template renders differently per host
|
||||
via `group_vars` and `host_vars`. No host-specific templates.
|
||||
|
||||
## Directory layout on hosts
|
||||
|
||||
```
|
||||
/opt/services/
|
||||
├── servicename/
|
||||
│ ├── docker-compose.yml # rendered by Ansible, never edited manually
|
||||
│ ├── .env # rendered by Ansible from vault variables
|
||||
│ └── data/ # persistent volumes (bind mounts)
|
||||
│ └── ...
|
||||
```
|
||||
|
||||
All services live under `/opt/services/`. The path is defined in
|
||||
`group_vars/all/vars.yml` as `services__base_dir`.
|
||||
|
||||
## Compose file delivery
|
||||
|
||||
Each service has a corresponding Ansible role (or is managed by a shared role
|
||||
with per-service variables). The role:
|
||||
|
||||
1. Creates `/opt/services/servicename/` directory
|
||||
2. Renders `docker-compose.yml` from `templates/docker-compose.yml.j2`
|
||||
3. Renders `.env` from `templates/env.j2` (pulling secrets from vault variables)
|
||||
4. Runs `docker compose up -d --remove-orphans` via `ansible.builtin.command`
|
||||
5. Optionally runs `docker compose pull` before up (controlled by variable)
|
||||
|
||||
## Docker daemon configuration
|
||||
|
||||
Managed by the `docker_host` role. Key settings:
|
||||
|
||||
- `"log-driver": "json-file"` with size limits (prevents disk exhaustion)
|
||||
- `"iptables": false` — firewall managed entirely by nftables (see ADR-002)
|
||||
- TCP socket disabled — Unix socket only (`/var/run/docker.sock`)
|
||||
- User namespace remapping: evaluated per use case, not enabled by default
|
||||
|
||||
## Networking
|
||||
|
||||
- Each service Compose file defines its own named network(s)
|
||||
- Services that need to communicate are placed on a shared named network
|
||||
defined in a dedicated `docker-compose.networks.yml` (if cross-service
|
||||
networking is needed on a host)
|
||||
- External port publishing is explicit and matches nftables rules
|
||||
|
||||
## Image management
|
||||
|
||||
- Images are always pinned to a specific digest or tag in templates
|
||||
- `latest` is never used in production Compose files
|
||||
- Image updates are a deliberate operation: update the tag variable, run deploy
|
||||
|
||||
## Persistent data
|
||||
|
||||
- Bind mounts preferred over named volumes for data that must be backed up
|
||||
- All bind mount paths are under `/opt/services/<name>/data/`
|
||||
- Backup strategy is defined separately (not in scope of this repo)
|
||||
|
||||
## Decision
|
||||
|
||||
Docker Compose was chosen over Kubernetes/Swarm because:
|
||||
- Appropriate complexity level for 2–5 hosts with independent service sets
|
||||
- Compose files are human-readable and easily auditable
|
||||
- No distributed state to manage
|
||||
- Straightforward to back up and restore
|
||||
79
docs/decisions/005-bootstrapping.md
Normal file
79
docs/decisions/005-bootstrapping.md
Normal file
|
|
@ -0,0 +1,79 @@
|
|||
# ADR-005 — Host bootstrapping
|
||||
|
||||
## Context
|
||||
|
||||
This document defines the **cloud-init template** that managed VMs are cloned
|
||||
from, and the **control-node** bootstrapping special case. The per-host
|
||||
provisioning pipeline — how a VM is created from this template and handed off to
|
||||
Ansible — is owned by ADR-009. Terraform clones the template defined here; the
|
||||
template is the base image both for Terraform-managed hosts and for the manually
|
||||
provisioned control node.
|
||||
|
||||
## Approach: Proxmox cloud-init template
|
||||
|
||||
Managed VMs are cloned from a Proxmox VM template based on the official Debian 13
|
||||
cloud image. Cloud-init handles first-boot configuration. Ansible takes over
|
||||
from there.
|
||||
|
||||
The cloud-init image was chosen over:
|
||||
- **Manual Debian installer**: slow, error-prone, not reproducible
|
||||
- **Preseed/netboot**: powerful but complex to maintain
|
||||
|
||||
## Template creation (one-time, manual)
|
||||
|
||||
This is a manual procedure performed once per Proxmox cluster. Documented in
|
||||
`docs/runbooks/new-host.md`.
|
||||
|
||||
High-level steps:
|
||||
1. Download official Debian 13 genericcloud image
|
||||
2. Import disk to Proxmox, create VM template
|
||||
3. Install `qemu-guest-agent` in the template image
|
||||
4. Convert VM to template — never boot the template directly
|
||||
|
||||
## VM provisioning (per new host)
|
||||
|
||||
Per-host VMs are created by **Terraform**, which clones this template, sets the
|
||||
cloud-init values (hostname, SSH public key, IP/gateway), and writes the host's
|
||||
DNS A record. Cloud-init runs at first boot (~30–60 seconds), leaving the VM
|
||||
reachable via SSH with the ansible user's key.
|
||||
|
||||
The full create → inventory → configure pipeline, and the Terraform↔Ansible data
|
||||
contract, are defined in **ADR-009 (provisioning handoff)**. There is no manual
|
||||
`qm clone` path for managed hosts — the sole exception is the control node below.
|
||||
|
||||
## Ansible handoff
|
||||
|
||||
Once Terraform has created the VM and `make tf-inventory` has regenerated the
|
||||
inventory, the `bootstrap` playbook handles first-run specifics (Python may not be
|
||||
present, user may differ) and `site` applies the full standard state. See ADR-009
|
||||
for the end-to-end commands and `docs/runbooks/new-host.md` for the full procedure.
|
||||
|
||||
## Control node bootstrapping
|
||||
|
||||
The control node is a special case — it runs Terraform and Ansible, so it cannot
|
||||
be created by the Terraform it hosts (chicken-and-egg). It is the one documented
|
||||
exception to Terraform-owned VM existence (see ADR-009). The control node requires:
|
||||
|
||||
1. Manual VM provisioning — clone this cloud-init template by hand (Proxmox UI or
|
||||
`qm clone`), since Terraform is not yet available to do it
|
||||
2. Manual setup of the Ansible environment:
|
||||
```bash
|
||||
git clone <repo> ~/ansible
|
||||
cd ~/ansible
|
||||
make setup # creates venv, installs deps
|
||||
make collections # installs Ansible collections
|
||||
cp /secure/location/.vault_pass ~/ansible/.vault_pass
|
||||
```
|
||||
3. After that, the control node can manage all other hosts normally
|
||||
|
||||
The control node itself is listed in `inventories/production/hosts.yml` under
|
||||
a `control` group and can be managed for baseline config (SSH, firewall, updates)
|
||||
but not for the `docker_host` role (it does not run services).
|
||||
|
||||
## Decision
|
||||
|
||||
Cloud-init with Proxmox templates provides:
|
||||
- Reproducible VM creation in under 2 minutes
|
||||
- No manual installer interaction
|
||||
- A clean handoff point to Ansible
|
||||
- Easy rebuilds — destroy VM, clone template, run Ansible
|
||||
111
docs/decisions/006-terraform.md
Normal file
111
docs/decisions/006-terraform.md
Normal file
|
|
@ -0,0 +1,111 @@
|
|||
# ADR-006 — Terraform for infrastructure provisioning
|
||||
|
||||
## Context
|
||||
|
||||
Ansible manages host configuration well but has no state model for infrastructure
|
||||
existence. Adding Terraform handles the "what exists" layer — creating and destroying
|
||||
VMs on Proxmox — while Ansible continues to own everything that runs inside them,
|
||||
including all internal DNS records.
|
||||
|
||||
This complements rather than replaces Ansible. The two tools do not overlap. The
|
||||
exact boundary, handoff pipeline, and data contract between them live in **ADR-009
|
||||
(provisioning handoff)** — this ADR covers Terraform's own internals only.
|
||||
|
||||
---
|
||||
|
||||
## Responsibility split
|
||||
|
||||
The canonical responsibility-split table lives in **ADR-009**. In short: Terraform
|
||||
owns VM existence only; Ansible owns everything inside a VM, including all internal
|
||||
DNS records.
|
||||
|
||||
**OPNsense is entirely Ansible.** The available Terraform providers for OPNsense
|
||||
are community-maintained with real risk of provider rot across OPNsense releases.
|
||||
OPNsense firewall rules also change on a service cadence, not an infrastructure
|
||||
cadence, making them a poor fit for Terraform state.
|
||||
|
||||
---
|
||||
|
||||
## Providers
|
||||
|
||||
**`bpg/proxmox` (`~> 0.70`)**: Chosen over `telmate/proxmox` for active maintenance,
|
||||
full Proxmox 8 API support, and better cloud-init integration. This is the only
|
||||
provider.
|
||||
|
||||
Terraform does **not** manage DNS. An earlier design used `hashicorp/dns` (RFC 2136)
|
||||
to write A records, but that created a bootstrap cycle — the first DNS server cannot
|
||||
register itself — and split DNS ownership across two tools. Ansible's `dns` role now
|
||||
owns the entire internal zone, rendered from inventory. See ADR-009.
|
||||
|
||||
No Galaxy roles. Terraform manages its own provider dependencies via
|
||||
`required_providers` and `.terraform.lock.hcl` (tracked in git once `terraform init`
|
||||
has been run).
|
||||
|
||||
---
|
||||
|
||||
## State backend
|
||||
|
||||
**Choice**: Forgejo HTTP backend (self-hosted at git.baobab.band)
|
||||
|
||||
Keeps all state on the same self-hosted stack without additional services.
|
||||
Authentication uses a Forgejo personal access token via `TF_HTTP_USERNAME` and
|
||||
`TF_HTTP_PASSWORD` environment variables.
|
||||
|
||||
**Note**: The backend URL in `backend.tf` is a placeholder — confirm the exact
|
||||
endpoint path against your running Forgejo instance's API documentation before
|
||||
running `terraform init`. If Forgejo's HTTP state is unavailable, remove the
|
||||
`backend` block from `backend.tf` to fall back to local state on the control node.
|
||||
|
||||
---
|
||||
|
||||
## Structure
|
||||
|
||||
```
|
||||
terraform/
|
||||
modules/
|
||||
proxmox_vm/ # reusable VM module — Proxmox only, no DNS
|
||||
environments/
|
||||
staging/ # staging VMs, separate state file
|
||||
production/ # production VMs, separate state file
|
||||
```
|
||||
|
||||
Separate environment directories (not Terraform workspaces) for the clearest
|
||||
isolation — no risk of accidentally applying the wrong state.
|
||||
|
||||
Each environment directory contains:
|
||||
- `providers.tf` — provider version pins and configuration
|
||||
- `backend.tf` — Forgejo state backend (environment-specific path)
|
||||
- `variables.tf` — input declarations
|
||||
- `terraform.tfvars.example` — tracked template; copy to `terraform.tfvars` for actual values
|
||||
- `main.tf` — `local.vms` map and module calls (no DNS resources)
|
||||
- `outputs.tf` — VM map consumed by `make tf-inventory`
|
||||
|
||||
---
|
||||
|
||||
## Secrets handling
|
||||
|
||||
The only secret input (the Proxmox API token) is passed via a `TF_VAR_*`
|
||||
environment variable and declared `sensitive = true` in `variables.tf`. It never
|
||||
appears in `.tfvars` files. Non-secret configuration lives in tracked
|
||||
`terraform.tfvars.example`; the real `terraform.tfvars` is gitignored.
|
||||
|
||||
---
|
||||
|
||||
## Ansible integration
|
||||
|
||||
After `terraform apply`, run `make tf-inventory TF_ENV=<env>` to regenerate
|
||||
`inventories/<env>/hosts.yml` from the `vms` output. The full handoff pipeline,
|
||||
the `vms` output → inventory data contract, and the generator script
|
||||
(`scripts/tf_to_inventory.py`) are documented in **ADR-009 (provisioning
|
||||
handoff)**.
|
||||
|
||||
---
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|---|---|
|
||||
| `telmate/proxmox` provider | Less actively maintained; weaker cloud-init and Proxmox 8 support |
|
||||
| OPNsense Terraform provider | Community-maintained; provider rot risk across OPNsense releases |
|
||||
| Terraform workspaces | Single state file with workspace prefix; accidental cross-env apply possible |
|
||||
| Separate Terraform repo | Cross-referencing between infra and config adds friction; monorepo keeps the full picture together |
|
||||
186
docs/decisions/007-network.md
Normal file
186
docs/decisions/007-network.md
Normal file
|
|
@ -0,0 +1,186 @@
|
|||
# ADR-007 — Network topology and addressing
|
||||
|
||||
## Context
|
||||
|
||||
The boma homelab is a Proxmox cluster on a dedicated private network behind an
|
||||
OPNsense firewall. This document records the agreed physical topology, VLAN
|
||||
design, IP addressing conventions, naming scheme, and DNS zone structure.
|
||||
Everything here feeds directly into Terraform variables, Ansible inventory,
|
||||
and OPNsense configuration.
|
||||
|
||||
---
|
||||
|
||||
## Physical topology
|
||||
|
||||
```
|
||||
ISP
|
||||
└── OPNsense (dedicated hardware)
|
||||
├── WAN — ISP uplink
|
||||
└── LAN — 802.1q trunk to managed switch
|
||||
│
|
||||
┌──────────────┼──────────────────────────┐
|
||||
│ │ │ │
|
||||
pve0 pve1 pve2 AP1 / AP2
|
||||
(eno1 trunk) (eno1 trunk) (eno1 trunk) (trunk)
|
||||
(eno2 corosync)(eno2 corosync)(eno2 corosync)
|
||||
└──────────────┴──────────────┘
|
||||
172.16.0.0/24 (corosync ring — not on managed switch)
|
||||
```
|
||||
|
||||
**Dual NICs per Proxmox node:**
|
||||
- `eno1` — VLAN-aware trunk. Carries all VLANs via a single VLAN-aware bridge
|
||||
(`vmbr0`). VMs get their VLAN tag assigned in Proxmox.
|
||||
- `eno2` — Dedicated corosync ring (`vmbr1`). Direct link or tiny unmanaged
|
||||
switch between the three nodes only. Never touches the main switch fabric.
|
||||
|
||||
**Access points** broadcast multiple SSIDs, each tagged to its corresponding VLAN
|
||||
(trusted WiFi → VLAN 30, IoT → VLAN 40, guest → VLAN 50).
|
||||
|
||||
---
|
||||
|
||||
## VLAN design
|
||||
|
||||
| VLAN | Name | Subnet | Purpose |
|
||||
|---|---|---|---|
|
||||
| 10 | `mgmt` | `10.10.0.0/24` | Proxmox hosts, OPNsense, managed switch. No internet except update repos. |
|
||||
| 20 | `srv` | `10.20.0.0/24` | All Debian VMs and Docker services. 100% static. Terraform provisions here. |
|
||||
| 30 | `lan` | `10.30.0.0/24` | Trusted home devices. DHCP. Access to selected `srv` services via OPNsense. |
|
||||
| 40 | `iot` | `10.40.0.0/24` | Smart home, cameras, printers. DHCP. Internet egress only + HA exception. |
|
||||
| 50 | `guest` | `10.50.0.0/24` | Guest WiFi. DHCP. Internet only, fully isolated. |
|
||||
| 99 | `vpn` | `10.99.0.0/24` | WireGuard peers. `askari` (Hetzner) + road-warrior clients. |
|
||||
|
||||
---
|
||||
|
||||
## IP addressing
|
||||
|
||||
### VLAN 10 — mgmt (10.10.0.0/24) — no DHCP
|
||||
|
||||
| Address | Host |
|
||||
|---|---|
|
||||
| `10.10.0.1` | OPNsense LAN (mgmt) |
|
||||
| `10.10.0.2` | Managed switch |
|
||||
| `10.10.0.200` | `pve0` |
|
||||
| `10.10.0.201` | `pve1` |
|
||||
| `10.10.0.202` | `pve2` |
|
||||
|
||||
### VLAN 20 — srv (10.20.0.0/24) — no DHCP, all static
|
||||
|
||||
| Range | Purpose |
|
||||
|---|---|
|
||||
| `10.20.0.1` | OPNsense gateway |
|
||||
| `10.20.0.10`–`.19` | Core infrastructure VMs (DNS, proxy) |
|
||||
| `10.20.0.20`–`.49` | Additional static infrastructure |
|
||||
| `10.20.0.50`–`.249` | Terraform-provisioned VMs |
|
||||
|
||||
Assigned infrastructure addresses:
|
||||
|
||||
| Address | Host | Role |
|
||||
|---|---|---|
|
||||
| `10.20.0.10` | `dns1` | Primary DNS server |
|
||||
| `10.20.0.11` | `dns2` | Secondary DNS server |
|
||||
| `10.20.0.12` | `proxy` | Reverse proxy |
|
||||
| `10.20.0.13` | `homeassistant` | Home Assistant (IoT controller) |
|
||||
|
||||
### VLAN 30 — lan (10.30.0.0/24)
|
||||
|
||||
| Range | Purpose |
|
||||
|---|---|
|
||||
| `10.30.0.1` | OPNsense gateway |
|
||||
| `10.30.0.100`–`.249` | DHCP pool |
|
||||
|
||||
### VLAN 40 — iot (10.40.0.0/24)
|
||||
|
||||
| Range | Purpose |
|
||||
|---|---|
|
||||
| `10.40.0.1` | OPNsense gateway |
|
||||
| `10.40.0.100`–`.249` | DHCP pool |
|
||||
|
||||
### VLAN 50 — guest (10.50.0.0/24)
|
||||
|
||||
| Range | Purpose |
|
||||
|---|---|
|
||||
| `10.50.0.1` | OPNsense gateway |
|
||||
| `10.50.0.100`–`.249` | DHCP pool |
|
||||
|
||||
### VLAN 99 — vpn (10.99.0.0/24) — WireGuard
|
||||
|
||||
| Address | Host |
|
||||
|---|---|
|
||||
| `10.99.0.1` | OPNsense (WireGuard endpoint) |
|
||||
| `10.99.0.2` | `askari` (Hetzner VPS) |
|
||||
| `10.99.0.10`+ | Road-warrior clients |
|
||||
|
||||
### Corosync ring (172.16.0.0/24) — not on managed switch
|
||||
|
||||
| Address | Host |
|
||||
|---|---|
|
||||
| `172.16.0.200` | `pve0` |
|
||||
| `172.16.0.201` | `pve1` |
|
||||
| `172.16.0.202` | `pve2` |
|
||||
|
||||
---
|
||||
|
||||
## OPNsense firewall rules (intent)
|
||||
|
||||
| Source | Destination | Policy |
|
||||
|---|---|---|
|
||||
| `mgmt` | anywhere | allow (administrator access) |
|
||||
| `srv` | `srv` | allow (inter-service communication) |
|
||||
| `srv` | internet | allow (updates, image pulls) |
|
||||
| `lan` | `srv` (allow-list) | allow specific published ports only |
|
||||
| `lan` | internet | allow |
|
||||
| `iot` | internet | allow egress only |
|
||||
| `iot` | `srv` (HA IP only) | allow on integration ports |
|
||||
| `guest` | internet | allow, isolated from all internal |
|
||||
| `vpn` | `srv` (metrics ports) | allow (monitoring) |
|
||||
| `vpn` | `mgmt` | allow (administration from askari) |
|
||||
|
||||
**Home Assistant ↔ IoT**: HA VM at `10.20.0.13` can reach IoT VLAN on required
|
||||
ports. OPNsense Avahi (mDNS reflector) bridges `srv` ↔ `iot` for device discovery.
|
||||
IoT devices cannot initiate connections to `srv`.
|
||||
|
||||
---
|
||||
|
||||
## Naming scheme
|
||||
|
||||
| Layer | Convention | Examples |
|
||||
|---|---|---|
|
||||
| Homelab name | `boma` | — |
|
||||
| Proxmox nodes | `pve<n>` | `pve0`, `pve1`, `pve2` |
|
||||
| Infrastructure VMs | `<role><n>` | `dns1`, `dns2`, `proxy` |
|
||||
| Hetzner VPS | `askari` | Swahili for guard/sentinel |
|
||||
| Internal FQDN | `<host>.boma.baobab.band` | `dns1.boma.baobab.band` |
|
||||
| Public service FQDN | `<service>.baobab.band` | `git.baobab.band` |
|
||||
|
||||
---
|
||||
|
||||
## DNS zones and split-horizon
|
||||
|
||||
**Internal zone**: `boma.baobab.band` — served by `dns1` and `dns2`.
|
||||
The zone is rendered by the Ansible `dns` role: host A records come from the
|
||||
inventory (which derives from Terraform's `local.vms` via `make tf-inventory`),
|
||||
and service/alias/split-horizon records are explicit zone data in `group_vars`.
|
||||
Terraform itself writes no DNS records — see ADR-009.
|
||||
|
||||
**Public zone**: `baobab.band` — served by external DNS (Cloudflare or equivalent).
|
||||
Public-facing services resolve to the public IP or Cloudflare proxy.
|
||||
|
||||
**Split-horizon**: `dns1`/`dns2` serve internal answers for any hostname that has
|
||||
both a public and private face. Example: `git.baobab.band` resolves to
|
||||
`10.20.0.12` (proxy) internally and to the public IP externally.
|
||||
|
||||
OPNsense DNS resolver forwards `boma.baobab.band` queries to `dns1`/`dns2`.
|
||||
All other queries go upstream (e.g., `1.1.1.1`, `9.9.9.9`).
|
||||
|
||||
---
|
||||
|
||||
## External monitoring — askari
|
||||
|
||||
`askari` (Hetzner VPS) connects via WireGuard to OPNsense (`10.99.0.1`).
|
||||
Its peer address is `10.99.0.2`. OPNsense routes `10.99.0.0/24` into the VPN
|
||||
tunnel and allows `askari` narrow access to `srv` metrics endpoints and `mgmt`
|
||||
for administration.
|
||||
|
||||
`askari` is provisioned and managed independently of the Proxmox cluster — it
|
||||
must be reachable even when the homelab is down (its entire purpose).
|
||||
FQDN: `askari.baobab.band`.
|
||||
160
docs/decisions/008-testing.md
Normal file
160
docs/decisions/008-testing.md
Normal file
|
|
@ -0,0 +1,160 @@
|
|||
# ADR-008 — Testing methodology
|
||||
|
||||
## Context
|
||||
|
||||
Ansible roles must be idempotent and correct before they touch production hosts.
|
||||
This document records the testing strategy, what each level covers, and — critically
|
||||
— what is explicitly out of scope for automated testing and why.
|
||||
|
||||
---
|
||||
|
||||
## Three testing levels
|
||||
|
||||
### Level 1 — Molecule (per role, always required)
|
||||
|
||||
Runs in Docker on the control node or in CI. Fast (~5 min per role).
|
||||
|
||||
**What happens during `molecule test`:**
|
||||
1. `create` — start the test container
|
||||
2. `converge` — apply the role via `converge.yml`
|
||||
3. **`idempotency`** — run `converge.yml` again; fail if any task reports `changed`
|
||||
4. `verify` — assert expected state via `verify.yml`
|
||||
5. `destroy` — remove the container
|
||||
|
||||
The idempotency step is non-negotiable. Every role must pass it cleanly.
|
||||
|
||||
**`verify.yml` must assert outcomes, not task success:**
|
||||
|
||||
```yaml
|
||||
# Wrong — only proves the task ran
|
||||
- assert:
|
||||
that: result is success
|
||||
|
||||
# Right — proves the outcome exists
|
||||
- ansible.builtin.command: systemctl is-active fail2ban
|
||||
changed_when: false
|
||||
register: svc
|
||||
- ansible.builtin.assert:
|
||||
that: svc.stdout == "active"
|
||||
```
|
||||
|
||||
### Level 2 — Staging playbook (full stack, real VMs)
|
||||
|
||||
`make check PLAYBOOK=site` followed by `make deploy PLAYBOOK=site` on
|
||||
Terraform-provisioned staging VMs. Catches inter-role dependencies and ordering
|
||||
issues that Molecule cannot see (e.g., `docker_host` role requires `base` to
|
||||
have already run and configured the firewall).
|
||||
|
||||
Run before every merge to `main`.
|
||||
|
||||
### Level 3 — External smoke test from askari
|
||||
|
||||
Once `askari` is operational: scripted checks from outside the network confirming
|
||||
that public-facing services respond correctly. Catches firewall and reverse proxy
|
||||
configuration issues invisible to Ansible check mode.
|
||||
|
||||
---
|
||||
|
||||
## Molecule test image
|
||||
|
||||
**No external images.** The project builds and hosts its own test image.
|
||||
|
||||
**Source**: `.docker/molecule-debian13/Dockerfile`
|
||||
**Base**: `debian:trixie-slim` (official Debian 13, Docker Hub — only external
|
||||
dependency permitted here, as the base OS image is not substitutable)
|
||||
**Registry**: `git.baobab.band/<owner>/<repo-name>/molecule-debian13:latest`
|
||||
|
||||
Build and push with:
|
||||
```bash
|
||||
make molecule-image # build locally
|
||||
make molecule-image-push # push to Forgejo registry (requires docker login)
|
||||
```
|
||||
|
||||
The scaffold `molecule.yml` references this image with `pre_build_image: true`,
|
||||
meaning Molecule uses the image as-is and does not attempt to build it.
|
||||
|
||||
**Why not geerlingguy/docker-debian13-ansible?** It is a Docker Hub image outside
|
||||
project control. It is not a Galaxy role, but it is an external dependency that
|
||||
can drift, disappear, or introduce unexpected changes. The custom image is
|
||||
functionally equivalent and fully owned.
|
||||
|
||||
---
|
||||
|
||||
## Idempotency requirements
|
||||
|
||||
Every role task must satisfy one of these:
|
||||
|
||||
| Task type | Requirement |
|
||||
|---|---|
|
||||
| `apt`, `template`, `copy`, `file`, `user`, `group`, `service` | Naturally idempotent — no action needed |
|
||||
| `command` / `shell` (read-only) | `changed_when: false` |
|
||||
| `command` / `shell` (detectable change) | `changed_when: result.stdout \| length > 0` or equivalent |
|
||||
| `command` / `shell` (creates a file) | `creates: /path/to/artifact` |
|
||||
| Service restart after config change | Move to a handler; handler fires only when notified |
|
||||
| `docker compose up -d` | Handler only — notified by template change, never runs unconditionally |
|
||||
|
||||
ansible-lint enforces most of these at lint time. The Molecule idempotency step
|
||||
catches anything lint misses.
|
||||
|
||||
---
|
||||
|
||||
## What Molecule tests — and what it does not
|
||||
|
||||
### Tested in Molecule
|
||||
|
||||
| Capability | Notes |
|
||||
|---|---|
|
||||
| Package installation | `apt` works in the container |
|
||||
| File and directory creation, permissions, ownership | Full support |
|
||||
| Template rendering and content | Full support |
|
||||
| User and group management | Full support |
|
||||
| Service installation and `systemd enable` | Requires the systemd-capable image |
|
||||
| Service start/stop | Works for most services in the container |
|
||||
| SSH configuration file content | File-level only |
|
||||
| fail2ban installation and configuration | Install and config file; not live banning |
|
||||
| Docker daemon installation | Works in privileged container |
|
||||
| auditd installation and configuration | Install and config file |
|
||||
| Idempotency of all of the above | Enforced by Molecule's idempotency step |
|
||||
|
||||
### Not tested in Molecule — explicit exceptions
|
||||
|
||||
The following require a real kernel or real hardware and are validated only at
|
||||
Level 2 (staging) or Level 3 (external). This is a conscious, documented decision
|
||||
— not a gap.
|
||||
|
||||
| Capability | Reason not testable in Molecule |
|
||||
|---|---|
|
||||
| `nftables` rule loading | Requires `nf_tables` kernel module; not available in Docker |
|
||||
| WireGuard tunnel establishment | Requires `wireguard` kernel module |
|
||||
| `unattended-upgrades` behaviour | Installs correctly; actual upgrade behaviour requires a real apt environment |
|
||||
| DHCP behaviour (OPNsense) | OPNsense is managed by Ansible but not testable in a container |
|
||||
| mDNS reflector (Avahi cross-VLAN) | Requires real network interfaces and VLANs |
|
||||
| Hardware passthrough (NIC, USB) | Not applicable in containers |
|
||||
| Corosync cluster formation | Requires multiple real nodes |
|
||||
|
||||
For the above, Molecule tests only what it can: that the relevant packages are
|
||||
installed, that configuration files render correctly, and that services are enabled.
|
||||
Behavioural correctness is confirmed on staging.
|
||||
|
||||
---
|
||||
|
||||
## CI pipeline
|
||||
|
||||
```
|
||||
push to any branch
|
||||
├── yamllint + ansible-lint (fast gate, ~1 min)
|
||||
└── molecule test (changed roles) (parallel, ~5 min per role)
|
||||
|
||||
pull request to main
|
||||
├── yamllint + ansible-lint
|
||||
├── molecule test (all roles) (parallel)
|
||||
└── [manual gate] review tf-plan and make check on staging
|
||||
|
||||
merge to main
|
||||
├── yamllint + ansible-lint + molecule test (final gate)
|
||||
├── [manual approval] make deploy PLAYBOOK=site on staging
|
||||
└── [manual approval] make deploy PLAYBOOK=site on production
|
||||
```
|
||||
|
||||
Manual gates are intentional. Automated tests prove correctness in isolation;
|
||||
a human confirms the change is safe to promote.
|
||||
149
docs/decisions/009-provisioning-handoff.md
Normal file
149
docs/decisions/009-provisioning-handoff.md
Normal file
|
|
@ -0,0 +1,149 @@
|
|||
# ADR-009 — Terraform ↔ Ansible provisioning handoff
|
||||
|
||||
## Context
|
||||
|
||||
Two tools touch every managed host. Terraform owns **what exists** — VMs on
|
||||
Proxmox. Ansible owns **what is configured inside** — users, packages, firewall,
|
||||
Docker services, and all internal DNS. This ADR is the single source of truth for
|
||||
the seam between them: the exact handoff, the data contract, and the one documented
|
||||
exception. The two tools must never overlap; this document defines the line they
|
||||
meet at.
|
||||
|
||||
ADR-006 covers Terraform's internals (providers, state, structure). ADR-005 covers
|
||||
the cloud-init template that VMs are cloned from. This ADR covers how they connect.
|
||||
|
||||
---
|
||||
|
||||
## The boundary
|
||||
|
||||
| Layer | Tool | Notes |
|
||||
|---|---|---|
|
||||
| VM existence | Terraform | Create/destroy Proxmox VMs, assign static IPs |
|
||||
| VM resolver (cloud-init) | Terraform | Sets *which* DNS servers a VM queries — not a zone record |
|
||||
| OS configuration | Ansible | Users, SSH, firewall, packages |
|
||||
| Service deployment | Ansible | Docker, Compose files, secrets |
|
||||
| OPNsense (all) | Ansible | Firewall rules, DHCP, interfaces, VLANs |
|
||||
| Internal DNS (all records) | Ansible (`dns` role) | Internal zone rendered from inventory + `group_vars`; see ADR-007 |
|
||||
|
||||
This table is canonical here. ADR-006 links to it rather than restating it.
|
||||
Terraform owns VM **existence** only — it writes no DNS records (see "Internal DNS"
|
||||
below).
|
||||
|
||||
---
|
||||
|
||||
## The handoff pipeline
|
||||
|
||||
There is one path by which a managed host comes into existence and reaches its
|
||||
configured state:
|
||||
|
||||
```
|
||||
make tf-plan TF_ENV=production # review infrastructure changes
|
||||
make tf-apply TF_ENV=production # clone template → VM (no DNS records written)
|
||||
make tf-inventory TF_ENV=production # regenerate Ansible inventory from outputs
|
||||
make check PLAYBOOK=site # dry-run Ansible against the new host(s)
|
||||
make deploy PLAYBOOK=bootstrap # first-run specifics (see ADR-005)
|
||||
make deploy PLAYBOOK=site # full standard state — `dns` role writes the zone
|
||||
```
|
||||
|
||||
`tf-apply` creates the VM by cloning the Debian 13 cloud-init template (ADR-005).
|
||||
`tf-inventory` regenerates the Ansible inventory from Terraform outputs. From
|
||||
`make check` onward the host is Ansible's — including its DNS record, which the
|
||||
`dns` role writes into the internal zone during `make deploy`.
|
||||
|
||||
Adding a host means editing `local.vms` in the environment's `main.tf` and running
|
||||
this pipeline — **never** by hand-editing the inventory.
|
||||
|
||||
---
|
||||
|
||||
## The data contract
|
||||
|
||||
The seam's interface is a single Terraform output consumed by a single script.
|
||||
|
||||
**Producer** — `terraform/environments/<env>/outputs.tf` emits a `vms` map:
|
||||
|
||||
```json
|
||||
{
|
||||
"vms": {
|
||||
"value": {
|
||||
"host-a": { "ip": "192.168.1.10", "group": "docker_hosts" }
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Consumer** — `scripts/tf_to_inventory.py` (Python standard library only) reads
|
||||
`terraform output -json` and writes `inventories/<env>/hosts.yml`. It validates the
|
||||
group against the allowed set and fails loudly on an unknown group.
|
||||
|
||||
**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`.
|
||||
|
||||
The generated `hosts.yml` carries a "do not edit manually" header and is owned by
|
||||
the generator. Treat it as a build artifact: the source of truth is `local.vms` in
|
||||
Terraform, and the inventory is regenerated, never edited.
|
||||
|
||||
---
|
||||
|
||||
## Cloud-init's role
|
||||
|
||||
Cloud-init is the thin first-boot layer between Terraform and Ansible:
|
||||
|
||||
- **Terraform** clones the cloud-init template (ADR-005) and sets cloud-init values
|
||||
(hostname, SSH public key, IP/gateway).
|
||||
- **Cloud-init** does just enough at first boot to make the VM reachable over SSH
|
||||
with the ansible user's key — nothing more.
|
||||
- **Ansible** takes over from a reachable host: the `bootstrap` playbook handles
|
||||
first-run specifics, then `site` applies the full standard state.
|
||||
|
||||
The line is sharp: cloud-init buys *reachability*, Ansible owns *configuration*.
|
||||
|
||||
---
|
||||
|
||||
## Internal DNS — owned by Ansible, no chicken-and-egg
|
||||
|
||||
Terraform writes **no** DNS records. The internal zone (`boma.baobab.band`) is
|
||||
rendered entirely by the Ansible `dns` role:
|
||||
|
||||
- **Host A records** derive from the inventory — the same `hostname → ip` data that
|
||||
originated in `local.vms` and reached Ansible via `make tf-inventory`. So Terraform
|
||||
remains the ultimate source of truth for which hosts exist; the data simply flows
|
||||
through the inventory instead of through a direct Terraform→DNS write.
|
||||
- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense
|
||||
gateway, `git.baobab.band` → proxy) are explicit zone data in `group_vars`.
|
||||
|
||||
This dissolves the bootstrap cycle that a Terraform-managed zone would create. If
|
||||
Terraform wrote records via RFC 2136, provisioning the **first** DNS server would
|
||||
require a DNS server that does not yet exist — `dns1` cannot register its own A
|
||||
record before it is running and configured. Because Ansible renders the zone from
|
||||
inventory (using IP addresses, never name resolution, to connect), `dns1`/`dns2`
|
||||
are ordinary Terraform-created VMs whose records are written by the same role that
|
||||
configures the DNS service. There is no special case and no ordering trap.
|
||||
|
||||
ADR-007 holds the zone structure, split-horizon, and addressing conventions. The
|
||||
IP-range split there (`.10–.19` core infra vs `.50–.249` fleet) is now an addressing
|
||||
convention only — it no longer implies any difference in how records are written.
|
||||
|
||||
---
|
||||
|
||||
## The control-node exception
|
||||
|
||||
The control node — the host that runs Terraform and Ansible — is the one VM
|
||||
Terraform does **not** create. It cannot provision the infrastructure that would
|
||||
provision itself (chicken-and-egg). It is therefore the single documented exception
|
||||
to "Terraform owns VM existence":
|
||||
|
||||
- Provisioned and bootstrapped manually, per the control-node section of ADR-005.
|
||||
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
|
||||
Ansible for baseline config only (no `docker_host` role).
|
||||
|
||||
Every other host is Terraform-managed.
|
||||
|
||||
---
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|---|---|
|
||||
| Manual `qm clone` as a general provisioning path | Terraform is the single way VMs come into existence; a parallel manual path would let the inventory and real infrastructure drift. The sole exception is the control node. |
|
||||
| Hand-editing the generated inventory | `hosts.yml` is a build artifact of `tf_to_inventory.py`; edits are overwritten on the next `make tf-inventory`. Edit `local.vms` instead. |
|
||||
| Documenting the seam in both ADR-005 and ADR-006 | The boundary belongs in exactly one place. Those ADRs link here. |
|
||||
| Terraform-managed DNS records (`hashicorp/dns` + RFC 2136) | Created a bootstrap cycle (the first DNS server can't register itself) and split DNS ownership across two tools. Ansible owns the whole internal zone instead — one owner, no cycle. |
|
||||
145
docs/runbooks/new-host.md
Normal file
145
docs/runbooks/new-host.md
Normal file
|
|
@ -0,0 +1,145 @@
|
|||
# Runbook — Adding a new managed host
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not)
|
||||
- You have the vault password (`.vault_pass`)
|
||||
- The host's intended hostname and IP are decided
|
||||
|
||||
---
|
||||
|
||||
## Part A — Create the Proxmox template (one-time)
|
||||
|
||||
Run on a Proxmox node. Only needed once per cluster.
|
||||
|
||||
```bash
|
||||
# Download the Debian 13 genericcloud image
|
||||
wget https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2
|
||||
|
||||
# Create a VM (adjust ID, storage name as needed)
|
||||
qm create 9000 --name debian13-template --memory 2048 --cores 2 \
|
||||
--net0 virtio,bridge=vmbr0 --serial0 socket --vga serial0
|
||||
|
||||
# Import the disk
|
||||
qm importdisk 9000 debian-13-genericcloud-amd64.qcow2 local-lvm
|
||||
|
||||
# Attach disk and set boot order
|
||||
qm set 9000 --scsihw virtio-scsi-pci --scsi0 local-lvm:vm-9000-disk-0
|
||||
qm set 9000 --boot c --bootdisk scsi0
|
||||
|
||||
# Add cloud-init drive
|
||||
qm set 9000 --ide2 local-lvm:cloudinit
|
||||
|
||||
# Enable QEMU guest agent
|
||||
qm set 9000 --agent enabled=1
|
||||
|
||||
# Convert to template (cannot be undone)
|
||||
qm template 9000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Part B — Define the VM in Terraform
|
||||
|
||||
Managed hosts are created by Terraform, never by hand. Add an entry to `local.vms`
|
||||
in the environment's `main.tf` (`terraform/environments/<env>/main.tf`):
|
||||
|
||||
```hcl
|
||||
locals {
|
||||
vms = {
|
||||
<hostname> = {
|
||||
ip = "<IP>/24" # static; from docs/decisions/007-network.md
|
||||
group = "docker_hosts" # control | docker_hosts | proxmox_hosts
|
||||
cores = 2
|
||||
memory_mb = 2048
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Terraform clones the cloud-init template from Part A, sets the cloud-init values
|
||||
(hostname, SSH key, IP/gateway), and writes the host's DNS A record. See ADR-009
|
||||
for the full handoff and the `vms` output → inventory data contract.
|
||||
|
||||
---
|
||||
|
||||
## Part C — Provision and regenerate the inventory
|
||||
|
||||
```bash
|
||||
make tf-plan TF_ENV=production # review — confirm only the new VM is added
|
||||
make tf-apply TF_ENV=production # create the VM + write its DNS A record
|
||||
make tf-inventory TF_ENV=production # regenerate inventories/production/hosts.yml
|
||||
```
|
||||
|
||||
`make tf-inventory` rewrites `hosts.yml` from Terraform outputs — **do not edit
|
||||
that file by hand**; it carries a "do not edit manually" header and your changes
|
||||
would be overwritten. The source of truth is `local.vms`.
|
||||
|
||||
Wait ~60 seconds after apply for cloud-init to complete, then verify SSH access:
|
||||
|
||||
```bash
|
||||
ssh ansible@<IP> echo ok
|
||||
```
|
||||
|
||||
Add a `host_vars/<hostname>/` directory if the host needs specific overrides
|
||||
(this is config, not inventory membership, so it is not generated):
|
||||
|
||||
```bash
|
||||
mkdir -p inventories/production/host_vars/<hostname>
|
||||
touch inventories/production/host_vars/<hostname>/vars.yml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Part D — Bootstrap and configure
|
||||
|
||||
```bash
|
||||
# First-run bootstrap (handles Python installation, initial user setup)
|
||||
make deploy PLAYBOOK=bootstrap
|
||||
|
||||
# Apply full standard state
|
||||
make deploy PLAYBOOK=site
|
||||
```
|
||||
|
||||
Verify the host reaches baseline:
|
||||
|
||||
```bash
|
||||
make check PLAYBOOK=site
|
||||
# Should report no changes
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Part E — Control node (manual exception)
|
||||
|
||||
The control node runs Terraform and Ansible, so it cannot be created by the
|
||||
Terraform it hosts (chicken-and-egg). It is the **one** host provisioned manually —
|
||||
see ADR-009 and the control-node section of ADR-005. Use the template from Part A:
|
||||
|
||||
```bash
|
||||
# Clone the template by hand (Proxmox UI or qm clone)
|
||||
qm clone 9000 <VMID> --name <hostname> --full
|
||||
qm set <VMID> --memory 2048 --cores 2 \
|
||||
--ciuser ansible \
|
||||
--sshkeys /path/to/ansible_ed25519.pub \
|
||||
--ipconfig0 ip=<IP>/24,gw=<GATEWAY>
|
||||
qm start <VMID>
|
||||
```
|
||||
|
||||
Then set up the Ansible environment on it (`make setup`, `make collections`, place
|
||||
`.vault_pass`) per ADR-005, and add it to `inventories/<env>/hosts.yml` under the
|
||||
`control` group. Because the control node is not in `local.vms`, this is the only
|
||||
case where editing `hosts.yml` by hand is expected — every other host comes from
|
||||
`make tf-inventory`.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**SSH connection refused**: cloud-init may still be running. Wait and retry.
|
||||
|
||||
**Python not found**: the bootstrap playbook handles this via `raw` module.
|
||||
If bootstrap fails, SSH to the host manually and run `apt install -y python3`.
|
||||
|
||||
**Firewall locked out**: if nftables rules are misconfigured, connect via
|
||||
Proxmox console (not SSH) and run `nft flush ruleset` to clear all rules temporarily.
|
||||
81
docs/runbooks/new-role.md
Normal file
81
docs/runbooks/new-role.md
Normal file
|
|
@ -0,0 +1,81 @@
|
|||
# Runbook — Adding a new Ansible role
|
||||
|
||||
## When to create a new role
|
||||
|
||||
Create a new role when you need to manage a distinct, reusable unit of
|
||||
configuration — a service, a system component, or a behaviour applied to
|
||||
a group of hosts.
|
||||
|
||||
Do not create a role for a single task that logically belongs in an existing role.
|
||||
|
||||
## Procedure
|
||||
|
||||
### 1. Scaffold the role
|
||||
|
||||
```bash
|
||||
make new-role NAME=<rolename>
|
||||
```
|
||||
|
||||
This creates the full directory structure and placeholder files under `roles/<rolename>/`.
|
||||
|
||||
### 2. Fill in meta/main.yml
|
||||
|
||||
```yaml
|
||||
galaxy_info:
|
||||
role_name: <rolename>
|
||||
author: <your name>
|
||||
description: <one sentence>
|
||||
min_ansible_version: "2.15"
|
||||
platforms:
|
||||
- name: Debian
|
||||
versions:
|
||||
- trixie # Debian 13
|
||||
```
|
||||
|
||||
### 3. Define defaults
|
||||
|
||||
Add all tuneable variables to `defaults/main.yml` with inline comments explaining
|
||||
each variable. Use the `rolename__varname` namespace convention.
|
||||
|
||||
### 4. Write tasks
|
||||
|
||||
- Use FQCN for all modules
|
||||
- Every task must have a `name:` that reads as a sentence
|
||||
- Every task must have at least one `tags:` entry
|
||||
- Notify handlers by `listen:` topic string, not handler name
|
||||
|
||||
### 5. Configure Molecule
|
||||
|
||||
Edit `molecule/default/molecule.yml` to use the Debian 13 test image.
|
||||
Write a `converge.yml` that applies the role. Write a `verify.yml` that
|
||||
asserts the expected state.
|
||||
|
||||
### 6. Write the README
|
||||
|
||||
Document:
|
||||
- Purpose of the role (one paragraph)
|
||||
- All variables from `defaults/main.yml` with types, defaults, and descriptions
|
||||
- Example playbook usage
|
||||
- Any dependencies or prerequisites
|
||||
|
||||
### 7. Test locally
|
||||
|
||||
```bash
|
||||
make test ROLE=<rolename>
|
||||
```
|
||||
|
||||
Fix any lint or test failures before committing.
|
||||
|
||||
### 8. Add to a playbook
|
||||
|
||||
Add the role to the appropriate playbook in `playbooks/` and add the host group
|
||||
to `inventories/staging/hosts.yml` for integration testing.
|
||||
|
||||
### 9. Commit
|
||||
|
||||
```bash
|
||||
git checkout -b role/<rolename>
|
||||
git add roles/<rolename>
|
||||
git commit -m "Add <rolename> role"
|
||||
# open PR / merge request in Forgejo
|
||||
```
|
||||
71
docs/runbooks/rotate-secrets.md
Normal file
71
docs/runbooks/rotate-secrets.md
Normal file
|
|
@ -0,0 +1,71 @@
|
|||
# Runbook — Rotating vault secrets
|
||||
|
||||
## Rotating a single secret value
|
||||
|
||||
1. Decrypt the relevant vault file:
|
||||
```bash
|
||||
make decrypt FILE=inventories/production/group_vars/all/vault.yml
|
||||
```
|
||||
|
||||
2. Edit the file and update the secret value.
|
||||
|
||||
3. Re-encrypt:
|
||||
```bash
|
||||
make encrypt FILE=inventories/production/group_vars/all/vault.yml
|
||||
```
|
||||
|
||||
4. Commit the updated vault file:
|
||||
```bash
|
||||
git add inventories/production/group_vars/all/vault.yml
|
||||
git commit -m "Rotate <secret name>"
|
||||
```
|
||||
|
||||
5. Deploy to apply the new secret to hosts:
|
||||
```bash
|
||||
make check PLAYBOOK=site # verify what will change
|
||||
make deploy PLAYBOOK=site
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rotating the vault password
|
||||
|
||||
This affects all encrypted files in the repo. Do this only when:
|
||||
- A person with vault access leaves the project
|
||||
- The password is suspected to be compromised
|
||||
|
||||
Steps:
|
||||
|
||||
1. Ensure you have the current vault password in `.vault_pass`.
|
||||
|
||||
2. Re-key all vault files:
|
||||
```bash
|
||||
find . -name "vault.yml" | xargs ansible-vault rekey \
|
||||
--vault-password-file .vault_pass \
|
||||
--new-vault-password-file /path/to/new_password_file
|
||||
```
|
||||
|
||||
3. Replace `.vault_pass` with the new password file.
|
||||
|
||||
4. Distribute the new password to all collaborators via a secure channel.
|
||||
|
||||
5. Commit all rekeyed vault files:
|
||||
```bash
|
||||
git add -A
|
||||
git commit -m "Rekey all vault files"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Adding a new collaborator
|
||||
|
||||
1. Share the vault password via a secure channel (password manager, etc.)
|
||||
2. The collaborator creates `.vault_pass` locally (gitignored)
|
||||
3. They can now decrypt/encrypt vault files normally
|
||||
|
||||
## Removing a collaborator's access
|
||||
|
||||
Rotate the vault password as described above. There is no per-user access
|
||||
control in Ansible Vault — access is binary (has the password or not).
|
||||
|
||||
If per-user access control becomes necessary, evaluate SOPS + age at that point.
|
||||
Loading…
Add table
Reference in a new issue