boma/docs/decisions/009-provisioning-handoff.md
sjat 9e0c264658 docs: reconcile lower-severity review findings (O9-O24)
- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional,
  outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative
  boma.baobab.band -> boma.wingu.me transition note already added earlier
- terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and
  <host>.boma.baobab.band per ADR-007 naming (O11)
- ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections
  placed after Consequences, matching ADR-014/019-023 (O13)
- docs/README + inventories/README: list the missing subdirs / offsite_hosts +
  offsite.yml merge behaviour (O14, O29 note)
- ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19)
- ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20)
- ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21)
- netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23)
- ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24)
- capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28)
- tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9)
- tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep)

O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected);
the fix lives in the generator for the next regeneration. make lint + pytest (57) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00

209 lines
9.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-009 — Terraform ↔ Ansible provisioning handoff
## Status
Accepted (2026-05-30)
## Context
Two tools touch every managed host. Terraform owns **what exists** — VMs on
Proxmox. Ansible owns **what is configured inside** — users, packages, firewall,
Docker services, and all internal DNS. This ADR is the single source of truth for
the seam between them: the exact handoff, the data contract, and the one documented
exception. The two tools must never overlap; this document defines the line they
meet at.
ADR-006 covers Terraform's internals (providers, state, structure). ADR-005 covers
the cloud-init template that VMs are cloned from. This ADR covers how they connect.
---
## Decision
### The boundary
| Layer | Tool | Notes |
|---|---|---|
| VM existence | Terraform | Create/destroy Proxmox VMs, assign static IPs |
| VM resolver (cloud-init) | Terraform | Sets *which* DNS servers a VM queries — not a zone record |
| OS configuration | Ansible | Users, SSH, firewall, packages |
| Service deployment | Ansible | Docker, Compose files, secrets |
| OPNsense (all) | Ansible | Firewall rules, DHCP, interfaces, VLANs |
| Internal DNS (all records) | Ansible (`dns` role) | Internal zone rendered from inventory + `group_vars`; see ADR-007 |
This table is canonical here. ADR-006 links to it rather than restating it.
Terraform owns VM **existence** only — it writes no DNS records (see "Internal DNS"
below).
---
### The handoff pipeline
There is one path by which a managed host comes into existence and reaches its
configured state:
```
make tf-plan TF_ENV=production # review infrastructure changes
make tf-apply TF_ENV=production # clone template → VM (no DNS records written)
make tf-inventory TF_ENV=production # regenerate Ansible inventory from outputs
make check PLAYBOOK=site # dry-run Ansible against the new host(s)
make deploy PLAYBOOK=bootstrap # first-run specifics (see ADR-005)
make deploy PLAYBOOK=site # full standard state — `dns` role writes the zone
```
`tf-apply` creates the VM by cloning the Debian 13 cloud-init template (ADR-005).
`tf-inventory` regenerates the Ansible inventory from Terraform outputs. From
`make check` onward the host is Ansible's — including its DNS record, which the
`dns` role writes into the internal zone during `make deploy`.
Adding a host means editing `local.vms` in the environment's `main.tf` and running
this pipeline — **never** by hand-editing the inventory.
---
### The data contract
The seam's interface is a single Terraform output consumed by a single script.
**Producer**`terraform/environments/<env>/outputs.tf` emits a `vms` map:
```json
{
"vms": {
"value": {
"host-a": { "ip": "192.168.1.10", "group": "docker_hosts" }
}
}
}
```
**Consumer**`scripts/tf_to_inventory.py` (Python standard library only) reads
`terraform output -json` and writes `inventories/<env>/hosts.yml`. It validates the
group against the allowed set and fails loudly on an unknown group.
**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`.
`control` holds `ubongo`, a physical machine not managed by Terraform (see the
control-node exception below and ADR-015). `offsite_hosts` holds `askari`, which is
Terraform-managed via the `hetznercloud/hcloud` provider in the `offsite` environment
(see the off-site handoff note below and ADR-016).
The generated `hosts.yml` carries a "do not edit manually" header and is owned by
the generator. Treat it as a build artifact: the source of truth is `local.vms` in
Terraform, and the inventory is regenerated, never edited.
---
### Cloud-init's role
Cloud-init is the thin first-boot layer between Terraform and Ansible:
- **Terraform** clones the cloud-init template (ADR-005) and sets cloud-init values
(hostname, SSH public key, IP/gateway).
- **Cloud-init** does just enough at first boot to make the VM reachable over SSH
with the ansible user's key — nothing more.
- **Ansible** takes over from a reachable host: the `bootstrap` playbook handles
first-run specifics, then `site` applies the full standard state.
The line is sharp: cloud-init buys *reachability*, Ansible owns *configuration*.
---
### Internal DNS — owned by Ansible, no chicken-and-egg
Terraform writes **no** DNS records. The internal zone (`boma.baobab.band`) is
rendered entirely by the Ansible `dns` role:
- **Host A records** derive from the inventory — the same `hostname → ip` data that
originated in `local.vms` and reached Ansible via `make tf-inventory`. So Terraform
remains the ultimate source of truth for which hosts exist; the data simply flows
through the inventory instead of through a direct Terraform→DNS write.
- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense
gateway, `vaultwarden.wingu.me` → proxy split-horizon) are explicit zone data in
`group_vars`.
This dissolves the bootstrap cycle that a Terraform-managed zone would create. If
Terraform wrote records via RFC 2136, provisioning the **first** DNS server would
require a DNS server that does not yet exist — `dns1` cannot register its own A
record before it is running and configured. Because Ansible renders the zone from
inventory (using IP addresses, never name resolution, to connect), `dns1`/`dns2`
are ordinary Terraform-created VMs whose records are written by the same role that
configures the DNS service. There is no special case and no ordering trap.
ADR-007 holds the zone structure, split-horizon, and addressing conventions. The
IP-range split there (`.10.19` core infra vs `.50.249` fleet) is now an addressing
convention only — it no longer implies any difference in how records are written.
---
### The control-node exception
The control node — the host that runs Terraform and Ansible — is `ubongo`, a
dedicated **physical** machine outside the cluster. It is not a VM at all, so
Terraform genuinely never touches it: it cannot provision the infrastructure that
would provision itself (chicken-and-egg). It is therefore the single documented
exception to "Terraform owns VM existence":
- Provisioned and bootstrapped manually on bare metal, per the control-node section
of ADR-005; rationale, hardware, and recovery model in ADR-015.
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
Ansible for baseline config only (no `docker_host` role).
Every other host is Terraform-managed.
---
### The off-site handoff (`offsite` environment → `offsite_hosts`)
`askari` (Hetzner VPS, ADR-016) follows the same handoff pipeline as Proxmox hosts but
with its own provider and environment:
- **Producer** — `terraform/environments/offsite/outputs.tf` emits a `vms` map in the
same `{ host: { ip, group } }` shape as Proxmox environments; `askari`'s group is
`offsite_hosts`.
- **Consumer** — `scripts/tf_to_inventory.py` reads `terraform output -json` from the
`offsite` environment and writes `inventories/production/offsite.yml`.
- **Makefile target** — `make tf-inventory-offsite` runs the generator for the offsite
environment.
The production inventory is a **directory** (`inventories/production/`) that Ansible
merges at runtime: `hosts.yml` (Proxmox-generated) and `offsite.yml`
(offsite-generated) together form the full production host list. Each file is a build
artifact — never hand-edited; their source of truth is `local.vms` in the respective
environment's `main.tf`.
---
### What was ruled out
| Option | Reason |
|---|---|
| Manual `qm clone` as a general provisioning path | Terraform is the single way VMs come into existence; a parallel manual path would let the inventory and real infrastructure drift. The sole exception is the control node. |
| Hand-editing the generated inventory | `hosts.yml` is a build artifact of `tf_to_inventory.py`; edits are overwritten on the next `make tf-inventory`. Edit `local.vms` instead. |
| Documenting the seam in both ADR-005 and ADR-006 | The boundary belongs in exactly one place. Those ADRs link here. |
| Terraform-managed DNS records (`hashicorp/dns` + RFC 2136) | Created a bootstrap cycle (the first DNS server can't register itself) and split DNS ownership across two tools. Ansible owns the whole internal zone instead — one owner, no cycle. |
## Consequences
Drawn from the boundary, the data contract, and the "What was ruled out" section above:
- Adding a host means editing `local.vms` and running the handoff pipeline; the
generated `hosts.yml` is a build artifact and must never be hand-edited — manual
edits are overwritten on the next `make tf-inventory` (The handoff pipeline; The
data contract; What was ruled out).
- Manual `qm clone` is rejected as a general provisioning path so the inventory and
real infrastructure cannot drift; Terraform is the single way VMs come into
existence (What was ruled out).
- Terraform writes no DNS records: the Ansible `dns` role renders the whole internal
zone from inventory plus `group_vars`, dissolving the bootstrap cycle a
Terraform-managed zone (`hashicorp/dns` + RFC 2136) would create (Internal DNS —
owned by Ansible, no chicken-and-egg; What was ruled out).
- The control node (`ubongo`) is the single documented exception to "Terraform owns
VM existence" — a physical machine provisioned manually and managed by Ansible for
baseline config only (The control-node exception).
- The `offsite` TF environment's `vms` output feeds the `offsite_hosts` group via
`tf_to_inventory.py` (`make tf-inventory-offsite``inventories/production/offsite.yml`);
the production inventory is a directory that merges `hosts.yml` (Proxmox) and
`offsite.yml` (offsite) (The off-site handoff).
- The seam is documented in exactly one place (this ADR); ADR-005 and ADR-006 link
here rather than restating it (What was ruled out).