boma/docs/decisions/009-provisioning-handoff.md

210 lines
9.7 KiB
Markdown
Raw Normal View History

# ADR-009 — Terraform ↔ Ansible provisioning handoff
## Status
Accepted (2026-05-30)
## Context
Two tools touch every managed host. Terraform owns **what exists** — VMs on
Proxmox. Ansible owns **what is configured inside** — users, packages, firewall,
Docker services, and all internal DNS. This ADR is the single source of truth for
the seam between them: the exact handoff, the data contract, and the one documented
exception. The two tools must never overlap; this document defines the line they
meet at.
ADR-006 covers Terraform's internals (providers, state, structure). ADR-005 covers
the cloud-init template that VMs are cloned from. This ADR covers how they connect.
---
## Decision
### The boundary
| Layer | Tool | Notes |
|---|---|---|
| VM existence | Terraform | Create/destroy Proxmox VMs, assign static IPs |
| VM resolver (cloud-init) | Terraform | Sets *which* DNS servers a VM queries — not a zone record |
| OS configuration | Ansible | Users, SSH, firewall, packages |
| Service deployment | Ansible | Docker, Compose files, secrets |
| OPNsense (all) | Ansible | Firewall rules, DHCP, interfaces, VLANs |
| Internal DNS (all records) | Ansible (`dns` role) | Internal zone rendered from inventory + `group_vars`; see ADR-007 |
This table is canonical here. ADR-006 links to it rather than restating it.
Terraform owns VM **existence** only — it writes no DNS records (see "Internal DNS"
below).
---
### The handoff pipeline
There is one path by which a managed host comes into existence and reaches its
configured state:
```
make tf-plan TF_ENV=production # review infrastructure changes
make tf-apply TF_ENV=production # clone template → VM (no DNS records written)
make tf-inventory TF_ENV=production # regenerate Ansible inventory from outputs
make check PLAYBOOK=site # dry-run Ansible against the new host(s)
make deploy PLAYBOOK=bootstrap # first-run specifics (see ADR-005)
make deploy PLAYBOOK=site # full standard state — `dns` role writes the zone
```
`tf-apply` creates the VM by cloning the Debian 13 cloud-init template (ADR-005).
`tf-inventory` regenerates the Ansible inventory from Terraform outputs. From
`make check` onward the host is Ansible's — including its DNS record, which the
`dns` role writes into the internal zone during `make deploy`.
Adding a host means editing `local.vms` in the environment's `main.tf` and running
this pipeline — **never** by hand-editing the inventory.
---
### The data contract
The seam's interface is a single Terraform output consumed by a single script.
**Producer** — `terraform/environments/<env>/outputs.tf` emits a `vms` map:
```json
{
"vms": {
"value": {
"host-a": { "ip": "192.168.1.10", "group": "docker_hosts" }
}
}
}
```
**Consumer** — `scripts/tf_to_inventory.py` (Python standard library only) reads
`terraform output -json` and writes `inventories/<env>/hosts.yml`. It validates the
group against the allowed set and fails loudly on an unknown group.
**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`.
`control` holds `ubongo`, a physical machine not managed by Terraform (see the
control-node exception below and ADR-015). `offsite_hosts` holds `askari`, which is
Terraform-managed via the `hetznercloud/hcloud` provider in the `offsite` environment
(see the off-site handoff note below and ADR-016).
The generated `hosts.yml` carries a "do not edit manually" header and is owned by
the generator. Treat it as a build artifact: the source of truth is `local.vms` in
Terraform, and the inventory is regenerated, never edited.
---
### Cloud-init's role
Cloud-init is the thin first-boot layer between Terraform and Ansible:
- **Terraform** clones the cloud-init template (ADR-005) and sets cloud-init values
(hostname, SSH public key, IP/gateway).
- **Cloud-init** does just enough at first boot to make the VM reachable over SSH
with the ansible user's key — nothing more.
- **Ansible** takes over from a reachable host: the `bootstrap` playbook handles
first-run specifics, then `site` applies the full standard state.
The line is sharp: cloud-init buys *reachability*, Ansible owns *configuration*.
---
### Internal DNS — owned by Ansible, no chicken-and-egg
Terraform writes **no** DNS records. The internal zone (`boma.baobab.band`) is
rendered entirely by the Ansible `dns` role:
- **Host A records** derive from the inventory — the same `hostname → ip` data that
originated in `local.vms` and reached Ansible via `make tf-inventory`. So Terraform
remains the ultimate source of truth for which hosts exist; the data simply flows
through the inventory instead of through a direct Terraform→DNS write.
- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense
docs: reconcile lower-severity review findings (O9-O24) - ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional, outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative boma.baobab.band -> boma.wingu.me transition note already added earlier - terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and <host>.boma.baobab.band per ADR-007 naming (O11) - ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections placed after Consequences, matching ADR-014/019-023 (O13) - docs/README + inventories/README: list the missing subdirs / offsite_hosts + offsite.yml merge behaviour (O14, O29 note) - ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19) - ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20) - ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21) - netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23) - ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24) - capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28) - tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9) - tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep) O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected); the fix lives in the generator for the next regeneration. make lint + pytest (57) green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00
gateway, `vaultwarden.wingu.me` → proxy split-horizon) are explicit zone data in
`group_vars`.
This dissolves the bootstrap cycle that a Terraform-managed zone would create. If
Terraform wrote records via RFC 2136, provisioning the **first** DNS server would
require a DNS server that does not yet exist — `dns1` cannot register its own A
record before it is running and configured. Because Ansible renders the zone from
inventory (using IP addresses, never name resolution, to connect), `dns1`/`dns2`
are ordinary Terraform-created VMs whose records are written by the same role that
configures the DNS service. There is no special case and no ordering trap.
ADR-007 holds the zone structure, split-horizon, and addressing conventions. The
IP-range split there (`.10.19` core infra vs `.50.249` fleet) is now an addressing
convention only — it no longer implies any difference in how records are written.
---
### The control-node exception
The control node — the host that runs Terraform and Ansible — is `ubongo`, a
dedicated **physical** machine outside the cluster. It is not a VM at all, so
Terraform genuinely never touches it: it cannot provision the infrastructure that
would provision itself (chicken-and-egg). It is therefore the single documented
exception to "Terraform owns VM existence":
- Provisioned and bootstrapped manually on bare metal, per the control-node section
of ADR-005; rationale, hardware, and recovery model in ADR-015.
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
Ansible for baseline config only (no `docker_host` role).
Every other host is Terraform-managed.
---
### The off-site handoff (`offsite` environment → `offsite_hosts`)
`askari` (Hetzner VPS, ADR-016) follows the same handoff pipeline as Proxmox hosts but
with its own provider and environment:
- **Producer** — `terraform/environments/offsite/outputs.tf` emits a `vms` map in the
same `{ host: { ip, group } }` shape as Proxmox environments; `askari`'s group is
`offsite_hosts`.
- **Consumer** — `scripts/tf_to_inventory.py` reads `terraform output -json` from the
`offsite` environment and writes `inventories/production/offsite.yml`.
- **Makefile target** — `make tf-inventory-offsite` runs the generator for the offsite
environment.
The production inventory is a **directory** (`inventories/production/`) that Ansible
merges at runtime: `hosts.yml` (Proxmox-generated) and `offsite.yml`
(offsite-generated) together form the full production host list. Each file is a build
artifact — never hand-edited; their source of truth is `local.vms` in the respective
environment's `main.tf`.
---
### What was ruled out
| Option | Reason |
|---|---|
| Manual `qm clone` as a general provisioning path | Terraform is the single way VMs come into existence; a parallel manual path would let the inventory and real infrastructure drift. The sole exception is the control node. |
| Hand-editing the generated inventory | `hosts.yml` is a build artifact of `tf_to_inventory.py`; edits are overwritten on the next `make tf-inventory`. Edit `local.vms` instead. |
| Documenting the seam in both ADR-005 and ADR-006 | The boundary belongs in exactly one place. Those ADRs link here. |
| Terraform-managed DNS records (`hashicorp/dns` + RFC 2136) | Created a bootstrap cycle (the first DNS server can't register itself) and split DNS ownership across two tools. Ansible owns the whole internal zone instead — one owner, no cycle. |
## Consequences
Drawn from the boundary, the data contract, and the "What was ruled out" section above:
- Adding a host means editing `local.vms` and running the handoff pipeline; the
generated `hosts.yml` is a build artifact and must never be hand-edited — manual
edits are overwritten on the next `make tf-inventory` (The handoff pipeline; The
data contract; What was ruled out).
- Manual `qm clone` is rejected as a general provisioning path so the inventory and
real infrastructure cannot drift; Terraform is the single way VMs come into
existence (What was ruled out).
- Terraform writes no DNS records: the Ansible `dns` role renders the whole internal
zone from inventory plus `group_vars`, dissolving the bootstrap cycle a
Terraform-managed zone (`hashicorp/dns` + RFC 2136) would create (Internal DNS —
owned by Ansible, no chicken-and-egg; What was ruled out).
- The control node (`ubongo`) is the single documented exception to "Terraform owns
VM existence" — a physical machine provisioned manually and managed by Ansible for
baseline config only (The control-node exception).
- The `offsite` TF environment's `vms` output feeds the `offsite_hosts` group via
`tf_to_inventory.py` (`make tf-inventory-offsite``inventories/production/offsite.yml`);
the production inventory is a directory that merges `hosts.yml` (Proxmox) and
`offsite.yml` (offsite) (The off-site handoff).
- The seam is documented in exactly one place (this ADR); ADR-005 and ADR-006 link
here rather than restating it (What was ruled out).