boma/docs/decisions/009-provisioning-handoff.md
sjat fe4228fb38 Add architecture decision records and runbooks
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 14:10:01 +02:00

149 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-009 — Terraform ↔ Ansible provisioning handoff
## Context
Two tools touch every managed host. Terraform owns **what exists** — VMs on
Proxmox. Ansible owns **what is configured inside** — users, packages, firewall,
Docker services, and all internal DNS. This ADR is the single source of truth for
the seam between them: the exact handoff, the data contract, and the one documented
exception. The two tools must never overlap; this document defines the line they
meet at.
ADR-006 covers Terraform's internals (providers, state, structure). ADR-005 covers
the cloud-init template that VMs are cloned from. This ADR covers how they connect.
---
## The boundary
| Layer | Tool | Notes |
|---|---|---|
| VM existence | Terraform | Create/destroy Proxmox VMs, assign static IPs |
| VM resolver (cloud-init) | Terraform | Sets *which* DNS servers a VM queries — not a zone record |
| OS configuration | Ansible | Users, SSH, firewall, packages |
| Service deployment | Ansible | Docker, Compose files, secrets |
| OPNsense (all) | Ansible | Firewall rules, DHCP, interfaces, VLANs |
| Internal DNS (all records) | Ansible (`dns` role) | Internal zone rendered from inventory + `group_vars`; see ADR-007 |
This table is canonical here. ADR-006 links to it rather than restating it.
Terraform owns VM **existence** only — it writes no DNS records (see "Internal DNS"
below).
---
## The handoff pipeline
There is one path by which a managed host comes into existence and reaches its
configured state:
```
make tf-plan TF_ENV=production # review infrastructure changes
make tf-apply TF_ENV=production # clone template → VM (no DNS records written)
make tf-inventory TF_ENV=production # regenerate Ansible inventory from outputs
make check PLAYBOOK=site # dry-run Ansible against the new host(s)
make deploy PLAYBOOK=bootstrap # first-run specifics (see ADR-005)
make deploy PLAYBOOK=site # full standard state — `dns` role writes the zone
```
`tf-apply` creates the VM by cloning the Debian 13 cloud-init template (ADR-005).
`tf-inventory` regenerates the Ansible inventory from Terraform outputs. From
`make check` onward the host is Ansible's — including its DNS record, which the
`dns` role writes into the internal zone during `make deploy`.
Adding a host means editing `local.vms` in the environment's `main.tf` and running
this pipeline — **never** by hand-editing the inventory.
---
## The data contract
The seam's interface is a single Terraform output consumed by a single script.
**Producer**`terraform/environments/<env>/outputs.tf` emits a `vms` map:
```json
{
"vms": {
"value": {
"host-a": { "ip": "192.168.1.10", "group": "docker_hosts" }
}
}
}
```
**Consumer**`scripts/tf_to_inventory.py` (Python standard library only) reads
`terraform output -json` and writes `inventories/<env>/hosts.yml`. It validates the
group against the allowed set and fails loudly on an unknown group.
**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`.
The generated `hosts.yml` carries a "do not edit manually" header and is owned by
the generator. Treat it as a build artifact: the source of truth is `local.vms` in
Terraform, and the inventory is regenerated, never edited.
---
## Cloud-init's role
Cloud-init is the thin first-boot layer between Terraform and Ansible:
- **Terraform** clones the cloud-init template (ADR-005) and sets cloud-init values
(hostname, SSH public key, IP/gateway).
- **Cloud-init** does just enough at first boot to make the VM reachable over SSH
with the ansible user's key — nothing more.
- **Ansible** takes over from a reachable host: the `bootstrap` playbook handles
first-run specifics, then `site` applies the full standard state.
The line is sharp: cloud-init buys *reachability*, Ansible owns *configuration*.
---
## Internal DNS — owned by Ansible, no chicken-and-egg
Terraform writes **no** DNS records. The internal zone (`boma.baobab.band`) is
rendered entirely by the Ansible `dns` role:
- **Host A records** derive from the inventory — the same `hostname → ip` data that
originated in `local.vms` and reached Ansible via `make tf-inventory`. So Terraform
remains the ultimate source of truth for which hosts exist; the data simply flows
through the inventory instead of through a direct Terraform→DNS write.
- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense
gateway, `git.baobab.band` → proxy) are explicit zone data in `group_vars`.
This dissolves the bootstrap cycle that a Terraform-managed zone would create. If
Terraform wrote records via RFC 2136, provisioning the **first** DNS server would
require a DNS server that does not yet exist — `dns1` cannot register its own A
record before it is running and configured. Because Ansible renders the zone from
inventory (using IP addresses, never name resolution, to connect), `dns1`/`dns2`
are ordinary Terraform-created VMs whose records are written by the same role that
configures the DNS service. There is no special case and no ordering trap.
ADR-007 holds the zone structure, split-horizon, and addressing conventions. The
IP-range split there (`.10.19` core infra vs `.50.249` fleet) is now an addressing
convention only — it no longer implies any difference in how records are written.
---
## The control-node exception
The control node — the host that runs Terraform and Ansible — is the one VM
Terraform does **not** create. It cannot provision the infrastructure that would
provision itself (chicken-and-egg). It is therefore the single documented exception
to "Terraform owns VM existence":
- Provisioned and bootstrapped manually, per the control-node section of ADR-005.
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
Ansible for baseline config only (no `docker_host` role).
Every other host is Terraform-managed.
---
## What was ruled out
| Option | Reason |
|---|---|
| Manual `qm clone` as a general provisioning path | Terraform is the single way VMs come into existence; a parallel manual path would let the inventory and real infrastructure drift. The sole exception is the control node. |
| Hand-editing the generated inventory | `hosts.yml` is a build artifact of `tf_to_inventory.py`; edits are overwritten on the next `make tf-inventory`. Edit `local.vms` instead. |
| Documenting the seam in both ADR-005 and ADR-006 | The boundary belongs in exactly one place. Those ADRs link here. |
| Terraform-managed DNS records (`hashicorp/dns` + RFC 2136) | Created a bootstrap cycle (the first DNS server can't register itself) and split DNS ownership across two tools. Ansible owns the whole internal zone instead — one owner, no cycle. |