boma/docs/decisions/009-provisioning-handoff.md
sjat 9e0c264658 docs: reconcile lower-severity review findings (O9-O24)
- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional,
  outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative
  boma.baobab.band -> boma.wingu.me transition note already added earlier
- terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and
  <host>.boma.baobab.band per ADR-007 naming (O11)
- ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections
  placed after Consequences, matching ADR-014/019-023 (O13)
- docs/README + inventories/README: list the missing subdirs / offsite_hosts +
  offsite.yml merge behaviour (O14, O29 note)
- ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19)
- ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20)
- ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21)
- netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23)
- ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24)
- capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28)
- tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9)
- tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep)

O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected);
the fix lives in the generator for the next regeneration. make lint + pytest (57) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00

9.7 KiB
Raw Permalink Blame History

ADR-009 — Terraform ↔ Ansible provisioning handoff

Status

Accepted (2026-05-30)

Context

Two tools touch every managed host. Terraform owns what exists — VMs on Proxmox. Ansible owns what is configured inside — users, packages, firewall, Docker services, and all internal DNS. This ADR is the single source of truth for the seam between them: the exact handoff, the data contract, and the one documented exception. The two tools must never overlap; this document defines the line they meet at.

ADR-006 covers Terraform's internals (providers, state, structure). ADR-005 covers the cloud-init template that VMs are cloned from. This ADR covers how they connect.


Decision

The boundary

Layer Tool Notes
VM existence Terraform Create/destroy Proxmox VMs, assign static IPs
VM resolver (cloud-init) Terraform Sets which DNS servers a VM queries — not a zone record
OS configuration Ansible Users, SSH, firewall, packages
Service deployment Ansible Docker, Compose files, secrets
OPNsense (all) Ansible Firewall rules, DHCP, interfaces, VLANs
Internal DNS (all records) Ansible (dns role) Internal zone rendered from inventory + group_vars; see ADR-007

This table is canonical here. ADR-006 links to it rather than restating it. Terraform owns VM existence only — it writes no DNS records (see "Internal DNS" below).


The handoff pipeline

There is one path by which a managed host comes into existence and reaches its configured state:

make tf-plan TF_ENV=production       # review infrastructure changes
make tf-apply TF_ENV=production      # clone template → VM (no DNS records written)
make tf-inventory TF_ENV=production  # regenerate Ansible inventory from outputs
make check PLAYBOOK=site             # dry-run Ansible against the new host(s)
make deploy PLAYBOOK=bootstrap       # first-run specifics (see ADR-005)
make deploy PLAYBOOK=site            # full standard state — `dns` role writes the zone

tf-apply creates the VM by cloning the Debian 13 cloud-init template (ADR-005). tf-inventory regenerates the Ansible inventory from Terraform outputs. From make check onward the host is Ansible's — including its DNS record, which the dns role writes into the internal zone during make deploy.

Adding a host means editing local.vms in the environment's main.tf and running this pipeline — never by hand-editing the inventory.


The data contract

The seam's interface is a single Terraform output consumed by a single script.

Producerterraform/environments/<env>/outputs.tf emits a vms map:

{
  "vms": {
    "value": {
      "host-a": { "ip": "192.168.1.10", "group": "docker_hosts" }
    }
  }
}

Consumerscripts/tf_to_inventory.py (Python standard library only) reads terraform output -json and writes inventories/<env>/hosts.yml. It validates the group against the allowed set and fails loudly on an unknown group.

Valid groups: control, docker_hosts, proxmox_hosts, offsite_hosts.

control holds ubongo, a physical machine not managed by Terraform (see the control-node exception below and ADR-015). offsite_hosts holds askari, which is Terraform-managed via the hetznercloud/hcloud provider in the offsite environment (see the off-site handoff note below and ADR-016).

The generated hosts.yml carries a "do not edit manually" header and is owned by the generator. Treat it as a build artifact: the source of truth is local.vms in Terraform, and the inventory is regenerated, never edited.


Cloud-init's role

Cloud-init is the thin first-boot layer between Terraform and Ansible:

  • Terraform clones the cloud-init template (ADR-005) and sets cloud-init values (hostname, SSH public key, IP/gateway).
  • Cloud-init does just enough at first boot to make the VM reachable over SSH with the ansible user's key — nothing more.
  • Ansible takes over from a reachable host: the bootstrap playbook handles first-run specifics, then site applies the full standard state.

The line is sharp: cloud-init buys reachability, Ansible owns configuration.


Internal DNS — owned by Ansible, no chicken-and-egg

Terraform writes no DNS records. The internal zone (boma.baobab.band) is rendered entirely by the Ansible dns role:

  • Host A records derive from the inventory — the same hostname → ip data that originated in local.vms and reached Ansible via make tf-inventory. So Terraform remains the ultimate source of truth for which hosts exist; the data simply flows through the inventory instead of through a direct Terraform→DNS write.
  • Service, alias (CNAME), split-horizon, and non-VM records (e.g. the OPNsense gateway, vaultwarden.wingu.me → proxy split-horizon) are explicit zone data in group_vars.

This dissolves the bootstrap cycle that a Terraform-managed zone would create. If Terraform wrote records via RFC 2136, provisioning the first DNS server would require a DNS server that does not yet exist — dns1 cannot register its own A record before it is running and configured. Because Ansible renders the zone from inventory (using IP addresses, never name resolution, to connect), dns1/dns2 are ordinary Terraform-created VMs whose records are written by the same role that configures the DNS service. There is no special case and no ordering trap.

ADR-007 holds the zone structure, split-horizon, and addressing conventions. The IP-range split there (.10.19 core infra vs .50.249 fleet) is now an addressing convention only — it no longer implies any difference in how records are written.


The control-node exception

The control node — the host that runs Terraform and Ansible — is ubongo, a dedicated physical machine outside the cluster. It is not a VM at all, so Terraform genuinely never touches it: it cannot provision the infrastructure that would provision itself (chicken-and-egg). It is therefore the single documented exception to "Terraform owns VM existence":

  • Provisioned and bootstrapped manually on bare metal, per the control-node section of ADR-005; rationale, hardware, and recovery model in ADR-015.
  • Listed in inventories/<env>/hosts.yml under the control group, and managed by Ansible for baseline config only (no docker_host role).

Every other host is Terraform-managed.


The off-site handoff (offsite environment → offsite_hosts)

askari (Hetzner VPS, ADR-016) follows the same handoff pipeline as Proxmox hosts but with its own provider and environment:

  • Producerterraform/environments/offsite/outputs.tf emits a vms map in the same { host: { ip, group } } shape as Proxmox environments; askari's group is offsite_hosts.
  • Consumerscripts/tf_to_inventory.py reads terraform output -json from the offsite environment and writes inventories/production/offsite.yml.
  • Makefile targetmake tf-inventory-offsite runs the generator for the offsite environment.

The production inventory is a directory (inventories/production/) that Ansible merges at runtime: hosts.yml (Proxmox-generated) and offsite.yml (offsite-generated) together form the full production host list. Each file is a build artifact — never hand-edited; their source of truth is local.vms in the respective environment's main.tf.


What was ruled out

Option Reason
Manual qm clone as a general provisioning path Terraform is the single way VMs come into existence; a parallel manual path would let the inventory and real infrastructure drift. The sole exception is the control node.
Hand-editing the generated inventory hosts.yml is a build artifact of tf_to_inventory.py; edits are overwritten on the next make tf-inventory. Edit local.vms instead.
Documenting the seam in both ADR-005 and ADR-006 The boundary belongs in exactly one place. Those ADRs link here.
Terraform-managed DNS records (hashicorp/dns + RFC 2136) Created a bootstrap cycle (the first DNS server can't register itself) and split DNS ownership across two tools. Ansible owns the whole internal zone instead — one owner, no cycle.

Consequences

Drawn from the boundary, the data contract, and the "What was ruled out" section above:

  • Adding a host means editing local.vms and running the handoff pipeline; the generated hosts.yml is a build artifact and must never be hand-edited — manual edits are overwritten on the next make tf-inventory (The handoff pipeline; The data contract; What was ruled out).
  • Manual qm clone is rejected as a general provisioning path so the inventory and real infrastructure cannot drift; Terraform is the single way VMs come into existence (What was ruled out).
  • Terraform writes no DNS records: the Ansible dns role renders the whole internal zone from inventory plus group_vars, dissolving the bootstrap cycle a Terraform-managed zone (hashicorp/dns + RFC 2136) would create (Internal DNS — owned by Ansible, no chicken-and-egg; What was ruled out).
  • The control node (ubongo) is the single documented exception to "Terraform owns VM existence" — a physical machine provisioned manually and managed by Ansible for baseline config only (The control-node exception).
  • The offsite TF environment's vms output feeds the offsite_hosts group via tf_to_inventory.py (make tf-inventory-offsiteinventories/production/offsite.yml); the production inventory is a directory that merges hosts.yml (Proxmox) and offsite.yml (offsite) (The off-site handoff).
  • The seam is documented in exactly one place (this ADR); ADR-005 and ADR-006 link here rather than restating it (What was ruled out).