boma/docs/decisions/009-provisioning-handoff.md

157 lines
7.1 KiB
Markdown
Raw Normal View History

# ADR-009 — Terraform ↔ Ansible provisioning handoff
## Context
Two tools touch every managed host. Terraform owns **what exists** — VMs on
Proxmox. Ansible owns **what is configured inside** — users, packages, firewall,
Docker services, and all internal DNS. This ADR is the single source of truth for
the seam between them: the exact handoff, the data contract, and the one documented
exception. The two tools must never overlap; this document defines the line they
meet at.
ADR-006 covers Terraform's internals (providers, state, structure). ADR-005 covers
the cloud-init template that VMs are cloned from. This ADR covers how they connect.
---
## The boundary
| Layer | Tool | Notes |
|---|---|---|
| VM existence | Terraform | Create/destroy Proxmox VMs, assign static IPs |
| VM resolver (cloud-init) | Terraform | Sets *which* DNS servers a VM queries — not a zone record |
| OS configuration | Ansible | Users, SSH, firewall, packages |
| Service deployment | Ansible | Docker, Compose files, secrets |
| OPNsense (all) | Ansible | Firewall rules, DHCP, interfaces, VLANs |
| Internal DNS (all records) | Ansible (`dns` role) | Internal zone rendered from inventory + `group_vars`; see ADR-007 |
This table is canonical here. ADR-006 links to it rather than restating it.
Terraform owns VM **existence** only — it writes no DNS records (see "Internal DNS"
below).
---
## The handoff pipeline
There is one path by which a managed host comes into existence and reaches its
configured state:
```
make tf-plan TF_ENV=production # review infrastructure changes
make tf-apply TF_ENV=production # clone template → VM (no DNS records written)
make tf-inventory TF_ENV=production # regenerate Ansible inventory from outputs
make check PLAYBOOK=site # dry-run Ansible against the new host(s)
make deploy PLAYBOOK=bootstrap # first-run specifics (see ADR-005)
make deploy PLAYBOOK=site # full standard state — `dns` role writes the zone
```
`tf-apply` creates the VM by cloning the Debian 13 cloud-init template (ADR-005).
`tf-inventory` regenerates the Ansible inventory from Terraform outputs. From
`make check` onward the host is Ansible's — including its DNS record, which the
`dns` role writes into the internal zone during `make deploy`.
Adding a host means editing `local.vms` in the environment's `main.tf` and running
this pipeline — **never** by hand-editing the inventory.
---
## The data contract
The seam's interface is a single Terraform output consumed by a single script.
**Producer** — `terraform/environments/<env>/outputs.tf` emits a `vms` map:
```json
{
"vms": {
"value": {
"host-a": { "ip": "192.168.1.10", "group": "docker_hosts" }
}
}
}
```
**Consumer** — `scripts/tf_to_inventory.py` (Python standard library only) reads
`terraform output -json` and writes `inventories/<env>/hosts.yml`. It validates the
group against the allowed set and fails loudly on an unknown group.
**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`.
`control` and `offsite_hosts` are not produced by Terraform — they hold manually
provisioned hosts (`ubongo` and `askari` respectively) added to the inventory by hand
(see the control-node exception below and ADR-015/ADR-016). They are valid groups so
the generated `hosts.yml` carries their (otherwise empty) sections.
The generated `hosts.yml` carries a "do not edit manually" header and is owned by
the generator. Treat it as a build artifact: the source of truth is `local.vms` in
Terraform, and the inventory is regenerated, never edited.
---
## Cloud-init's role
Cloud-init is the thin first-boot layer between Terraform and Ansible:
- **Terraform** clones the cloud-init template (ADR-005) and sets cloud-init values
(hostname, SSH public key, IP/gateway).
- **Cloud-init** does just enough at first boot to make the VM reachable over SSH
with the ansible user's key — nothing more.
- **Ansible** takes over from a reachable host: the `bootstrap` playbook handles
first-run specifics, then `site` applies the full standard state.
The line is sharp: cloud-init buys *reachability*, Ansible owns *configuration*.
---
## Internal DNS — owned by Ansible, no chicken-and-egg
Terraform writes **no** DNS records. The internal zone (`boma.baobab.band`) is
rendered entirely by the Ansible `dns` role:
- **Host A records** derive from the inventory — the same `hostname → ip` data that
originated in `local.vms` and reached Ansible via `make tf-inventory`. So Terraform
remains the ultimate source of truth for which hosts exist; the data simply flows
through the inventory instead of through a direct Terraform→DNS write.
- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense
gateway, `forgejo.nyumbani.baobab.band` → proxy) are explicit zone data in `group_vars`.
This dissolves the bootstrap cycle that a Terraform-managed zone would create. If
Terraform wrote records via RFC 2136, provisioning the **first** DNS server would
require a DNS server that does not yet exist — `dns1` cannot register its own A
record before it is running and configured. Because Ansible renders the zone from
inventory (using IP addresses, never name resolution, to connect), `dns1`/`dns2`
are ordinary Terraform-created VMs whose records are written by the same role that
configures the DNS service. There is no special case and no ordering trap.
ADR-007 holds the zone structure, split-horizon, and addressing conventions. The
IP-range split there (`.10.19` core infra vs `.50.249` fleet) is now an addressing
convention only — it no longer implies any difference in how records are written.
---
## The control-node exception
The control node — the host that runs Terraform and Ansible — is `ubongo`, a
dedicated **physical** machine outside the cluster. It is not a VM at all, so
Terraform genuinely never touches it: it cannot provision the infrastructure that
would provision itself (chicken-and-egg). It is therefore the single documented
exception to "Terraform owns VM existence":
- Provisioned and bootstrapped manually on bare metal, per the control-node section
of ADR-005; rationale, hardware, and recovery model in ADR-015.
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
Ansible for baseline config only (no `docker_host` role).
Every other host is Terraform-managed.
---
## What was ruled out
| Option | Reason |
|---|---|
| Manual `qm clone` as a general provisioning path | Terraform is the single way VMs come into existence; a parallel manual path would let the inventory and real infrastructure drift. The sole exception is the control node. |
| Hand-editing the generated inventory | `hosts.yml` is a build artifact of `tf_to_inventory.py`; edits are overwritten on the next `make tf-inventory`. Edit `local.vms` instead. |
| Documenting the seam in both ADR-005 and ADR-006 | The boundary belongs in exactly one place. Those ADRs link here. |
| Terraform-managed DNS records (`hashicorp/dns` + RFC 2136) | Created a bootstrap cycle (the first DNS server can't register itself) and split DNS ownership across two tools. Ansible owns the whole internal zone instead — one owner, no cycle. |