Review O4: ADR-016 said askari gets "its own inventory group" but never named it. Settled as offsite_hosts (off-site, distinct from on-site-but-off-cluster ubongo). Added to VALID_GROUPS (tf_to_inventory.py), ADR-009 valid groups, ADR-001/ADR-016 host-group enumerations, and CLAUDE.md. Generated hosts.yml picks up the section on the next make tf-inventory (a manual-exception group like control). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
156 lines
7.1 KiB
Markdown
156 lines
7.1 KiB
Markdown
# ADR-009 — Terraform ↔ Ansible provisioning handoff
|
||
|
||
## Context
|
||
|
||
Two tools touch every managed host. Terraform owns **what exists** — VMs on
|
||
Proxmox. Ansible owns **what is configured inside** — users, packages, firewall,
|
||
Docker services, and all internal DNS. This ADR is the single source of truth for
|
||
the seam between them: the exact handoff, the data contract, and the one documented
|
||
exception. The two tools must never overlap; this document defines the line they
|
||
meet at.
|
||
|
||
ADR-006 covers Terraform's internals (providers, state, structure). ADR-005 covers
|
||
the cloud-init template that VMs are cloned from. This ADR covers how they connect.
|
||
|
||
---
|
||
|
||
## The boundary
|
||
|
||
| Layer | Tool | Notes |
|
||
|---|---|---|
|
||
| VM existence | Terraform | Create/destroy Proxmox VMs, assign static IPs |
|
||
| VM resolver (cloud-init) | Terraform | Sets *which* DNS servers a VM queries — not a zone record |
|
||
| OS configuration | Ansible | Users, SSH, firewall, packages |
|
||
| Service deployment | Ansible | Docker, Compose files, secrets |
|
||
| OPNsense (all) | Ansible | Firewall rules, DHCP, interfaces, VLANs |
|
||
| Internal DNS (all records) | Ansible (`dns` role) | Internal zone rendered from inventory + `group_vars`; see ADR-007 |
|
||
|
||
This table is canonical here. ADR-006 links to it rather than restating it.
|
||
Terraform owns VM **existence** only — it writes no DNS records (see "Internal DNS"
|
||
below).
|
||
|
||
---
|
||
|
||
## The handoff pipeline
|
||
|
||
There is one path by which a managed host comes into existence and reaches its
|
||
configured state:
|
||
|
||
```
|
||
make tf-plan TF_ENV=production # review infrastructure changes
|
||
make tf-apply TF_ENV=production # clone template → VM (no DNS records written)
|
||
make tf-inventory TF_ENV=production # regenerate Ansible inventory from outputs
|
||
make check PLAYBOOK=site # dry-run Ansible against the new host(s)
|
||
make deploy PLAYBOOK=bootstrap # first-run specifics (see ADR-005)
|
||
make deploy PLAYBOOK=site # full standard state — `dns` role writes the zone
|
||
```
|
||
|
||
`tf-apply` creates the VM by cloning the Debian 13 cloud-init template (ADR-005).
|
||
`tf-inventory` regenerates the Ansible inventory from Terraform outputs. From
|
||
`make check` onward the host is Ansible's — including its DNS record, which the
|
||
`dns` role writes into the internal zone during `make deploy`.
|
||
|
||
Adding a host means editing `local.vms` in the environment's `main.tf` and running
|
||
this pipeline — **never** by hand-editing the inventory.
|
||
|
||
---
|
||
|
||
## The data contract
|
||
|
||
The seam's interface is a single Terraform output consumed by a single script.
|
||
|
||
**Producer** — `terraform/environments/<env>/outputs.tf` emits a `vms` map:
|
||
|
||
```json
|
||
{
|
||
"vms": {
|
||
"value": {
|
||
"host-a": { "ip": "192.168.1.10", "group": "docker_hosts" }
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
**Consumer** — `scripts/tf_to_inventory.py` (Python standard library only) reads
|
||
`terraform output -json` and writes `inventories/<env>/hosts.yml`. It validates the
|
||
group against the allowed set and fails loudly on an unknown group.
|
||
|
||
**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`.
|
||
|
||
`control` and `offsite_hosts` are not produced by Terraform — they hold manually
|
||
provisioned hosts (`ubongo` and `askari` respectively) added to the inventory by hand
|
||
(see the control-node exception below and ADR-015/ADR-016). They are valid groups so
|
||
the generated `hosts.yml` carries their (otherwise empty) sections.
|
||
|
||
The generated `hosts.yml` carries a "do not edit manually" header and is owned by
|
||
the generator. Treat it as a build artifact: the source of truth is `local.vms` in
|
||
Terraform, and the inventory is regenerated, never edited.
|
||
|
||
---
|
||
|
||
## Cloud-init's role
|
||
|
||
Cloud-init is the thin first-boot layer between Terraform and Ansible:
|
||
|
||
- **Terraform** clones the cloud-init template (ADR-005) and sets cloud-init values
|
||
(hostname, SSH public key, IP/gateway).
|
||
- **Cloud-init** does just enough at first boot to make the VM reachable over SSH
|
||
with the ansible user's key — nothing more.
|
||
- **Ansible** takes over from a reachable host: the `bootstrap` playbook handles
|
||
first-run specifics, then `site` applies the full standard state.
|
||
|
||
The line is sharp: cloud-init buys *reachability*, Ansible owns *configuration*.
|
||
|
||
---
|
||
|
||
## Internal DNS — owned by Ansible, no chicken-and-egg
|
||
|
||
Terraform writes **no** DNS records. The internal zone (`boma.baobab.band`) is
|
||
rendered entirely by the Ansible `dns` role:
|
||
|
||
- **Host A records** derive from the inventory — the same `hostname → ip` data that
|
||
originated in `local.vms` and reached Ansible via `make tf-inventory`. So Terraform
|
||
remains the ultimate source of truth for which hosts exist; the data simply flows
|
||
through the inventory instead of through a direct Terraform→DNS write.
|
||
- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense
|
||
gateway, `forgejo.nyumbani.baobab.band` → proxy) are explicit zone data in `group_vars`.
|
||
|
||
This dissolves the bootstrap cycle that a Terraform-managed zone would create. If
|
||
Terraform wrote records via RFC 2136, provisioning the **first** DNS server would
|
||
require a DNS server that does not yet exist — `dns1` cannot register its own A
|
||
record before it is running and configured. Because Ansible renders the zone from
|
||
inventory (using IP addresses, never name resolution, to connect), `dns1`/`dns2`
|
||
are ordinary Terraform-created VMs whose records are written by the same role that
|
||
configures the DNS service. There is no special case and no ordering trap.
|
||
|
||
ADR-007 holds the zone structure, split-horizon, and addressing conventions. The
|
||
IP-range split there (`.10–.19` core infra vs `.50–.249` fleet) is now an addressing
|
||
convention only — it no longer implies any difference in how records are written.
|
||
|
||
---
|
||
|
||
## The control-node exception
|
||
|
||
The control node — the host that runs Terraform and Ansible — is `ubongo`, a
|
||
dedicated **physical** machine outside the cluster. It is not a VM at all, so
|
||
Terraform genuinely never touches it: it cannot provision the infrastructure that
|
||
would provision itself (chicken-and-egg). It is therefore the single documented
|
||
exception to "Terraform owns VM existence":
|
||
|
||
- Provisioned and bootstrapped manually on bare metal, per the control-node section
|
||
of ADR-005; rationale, hardware, and recovery model in ADR-015.
|
||
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
|
||
Ansible for baseline config only (no `docker_host` role).
|
||
|
||
Every other host is Terraform-managed.
|
||
|
||
---
|
||
|
||
## What was ruled out
|
||
|
||
| Option | Reason |
|
||
|---|---|
|
||
| Manual `qm clone` as a general provisioning path | Terraform is the single way VMs come into existence; a parallel manual path would let the inventory and real infrastructure drift. The sole exception is the control node. |
|
||
| Hand-editing the generated inventory | `hosts.yml` is a build artifact of `tf_to_inventory.py`; edits are overwritten on the next `make tf-inventory`. Edit `local.vms` instead. |
|
||
| Documenting the seam in both ADR-005 and ADR-006 | The boundary belongs in exactly one place. Those ADRs link here. |
|
||
| Terraform-managed DNS records (`hashicorp/dns` + RFC 2136) | Created a bootstrap cycle (the first DNS server can't register itself) and split DNS ownership across two tools. Ansible owns the whole internal zone instead — one owner, no cycle. |
|