From 44dbd4628fce16c0a1a00bf43f4621f1976d1828 Mon Sep 17 00:00:00 2001 From: sjat Date: Wed, 10 Jun 2026 14:41:24 +0200 Subject: [PATCH] docs(adr): restructure ADRs 006-009 to ADR-023 conformance Add dated Status sections, a Decision umbrella over the existing topical sections (demoted to ###), and Consequences assembled from each ADR's already-stated implications. No decision substance changed. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/decisions/006-terraform.md | 41 +++++++++++++++++---- docs/decisions/007-network.md | 41 +++++++++++++++++---- docs/decisions/008-testing.md | 40 ++++++++++++++++++--- docs/decisions/009-provisioning-handoff.md | 42 ++++++++++++++++++---- 4 files changed, 138 insertions(+), 26 deletions(-) diff --git a/docs/decisions/006-terraform.md b/docs/decisions/006-terraform.md index bd8d5dd..7c77f66 100644 --- a/docs/decisions/006-terraform.md +++ b/docs/decisions/006-terraform.md @@ -1,5 +1,9 @@ # ADR-006 — Terraform for infrastructure provisioning +## Status + +Accepted (2026-05-30) + ## Context Ansible manages host configuration well but has no state model for infrastructure @@ -13,7 +17,9 @@ exact boundary, handoff pipeline, and data contract between them live in **ADR-0 --- -## Responsibility split +## Decision + +### Responsibility split The canonical responsibility-split table lives in **ADR-009**. In short: Terraform owns VM existence only; Ansible owns everything inside a VM, including all internal @@ -26,7 +32,7 @@ cadence, making them a poor fit for Terraform state. --- -## Providers +### Providers **`bpg/proxmox` (`~> 0.70`)**: Chosen over `telmate/proxmox` for active maintenance, full Proxmox 8 API support, and better cloud-init integration. This is the only @@ -42,7 +48,7 @@ Terraform manages its own provider dependencies via `required_providers` and --- -## State backend +### State backend **Choice**: Local state on the control node. @@ -59,7 +65,7 @@ integration boundary. --- -## Structure +### Structure ``` terraform/ @@ -83,7 +89,7 @@ Each environment directory contains: --- -## Secrets handling +### Secrets handling The only secret input (the Proxmox API token) is passed via a `TF_VAR_*` environment variable and declared `sensitive = true` in `variables.tf`. It never @@ -92,7 +98,7 @@ appears in `.tfvars` files. Non-secret configuration lives in tracked --- -## Ansible integration +### Ansible integration After `terraform apply`, run `make tf-inventory TF_ENV=` to regenerate `inventories//hosts.yml` from the `vms` output. The full handoff pipeline, @@ -102,7 +108,7 @@ handoff)**. --- -## What was ruled out +### What was ruled out | Option | Reason | |---|---| @@ -110,3 +116,24 @@ handoff)**. | OPNsense Terraform provider | Community-maintained; provider rot risk across OPNsense releases | | Terraform workspaces | Single state file with workspace prefix; accidental cross-env apply possible | | Separate Terraform repo | Cross-referencing between infra and config adds friction; monorepo keeps the full picture together | + +## Consequences + +Drawn from the "What was ruled out" section and the decisions stated above: + +- `bpg/proxmox` is the only provider; `telmate/proxmox` was ruled out for weaker + maintenance and Proxmox 8 / cloud-init support (Providers; What was ruled out). +- OPNsense stays entirely in Ansible — no Terraform OPNsense provider — to avoid + community-provider rot across OPNsense releases (Responsibility split; What was + ruled out). +- Terraform writes no DNS records; Ansible's `dns` role owns the entire internal + zone, avoiding the bootstrap cycle and split DNS ownership the earlier + `hashicorp/dns` design created (Providers). +- State is local on the control node because Forgejo offers no usable HTTP state + backend; this is sufficient at solo-operator scale (no concurrent applies, no + remote locking), with a real backend such as MinIO/S3 to be added later if + warranted (State backend). +- Separate environment directories are used instead of Terraform workspaces to + remove the risk of applying the wrong state (Structure; What was ruled out). +- Terraform and Ansible internals are kept in one monorepo rather than a separate + Terraform repo to avoid cross-referencing friction (What was ruled out). diff --git a/docs/decisions/007-network.md b/docs/decisions/007-network.md index dd3b577..f9c1d0e 100644 --- a/docs/decisions/007-network.md +++ b/docs/decisions/007-network.md @@ -1,5 +1,9 @@ # ADR-007 — Network topology and addressing +## Status + +Accepted (2026-05-30) + ## Context The boma homelab is a Proxmox cluster on a dedicated private network behind an @@ -10,7 +14,9 @@ and OPNsense configuration. --- -## Physical topology +## Decision + +### Physical topology ``` ISP @@ -38,7 +44,7 @@ ISP --- -## VLAN design +### VLAN design | VLAN | Name | Subnet | Purpose | |---|---|---|---| @@ -51,7 +57,7 @@ ISP --- -## IP addressing +### IP addressing ### VLAN 10 — mgmt (10.10.0.0/24) — no DHCP @@ -121,7 +127,7 @@ NetBird self-hosted on `askari`. NetBird manages its own overlay addressing --- -## OPNsense firewall rules (intent) +### OPNsense firewall rules (intent) | Source | Destination | Policy | |---|---|---| @@ -142,7 +148,7 @@ IoT devices cannot initiate connections to `srv`. --- -## Naming scheme +### Naming scheme | Layer | Convention | Examples | |---|---|---| @@ -155,7 +161,7 @@ IoT devices cannot initiate connections to `srv`. --- -## DNS zones and split-horizon +### DNS zones and split-horizon **Internal zone**: `boma.baobab.band` — served by `dns1` and `dns2`. The zone is rendered by the Ansible `dns` role: host A records come from the @@ -175,7 +181,7 @@ All other queries go upstream (e.g., `1.1.1.1`, `9.9.9.9`). --- -## External monitoring — askari +### External monitoring — askari `askari` (Hetzner VPS) is a peer on the **NetBird mesh** (ADR-016) and also **hosts the self-hosted NetBird coordinator** (management/signal/relay). It reaches `srv` @@ -186,3 +192,24 @@ ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing. be reachable even when the homelab is down (its entire purpose), which is also why the mesh coordinator lives here: an off-site control plane survives a homelab outage. FQDN: `askari.baobab.band`. + +--- + +## Consequences + +Drawn from the implications already stated above: + +- VLAN 99 (`vpn`, `10.99.0.0/24`) is retired and the subnet freed; remote access is + carried by the self-hosted NetBird mesh instead of an OPNsense WireGuard subnet + (VLAN design; IP addressing — VLAN 99 retired). +- Mesh-peer firewall allowances (to `srv` metrics ports and `mgmt`) are enforced by + NetBird ACLs, not OPNsense rules (OPNsense firewall rules (intent)). +- IoT devices cannot initiate connections to `srv`; only Home Assistant at + `10.20.0.13` may reach the IoT VLAN, with OPNsense Avahi bridging `srv` ↔ `iot` + for discovery (OPNsense firewall rules (intent)). +- Terraform writes no DNS records; the Ansible `dns` role renders the internal zone + from inventory plus `group_vars`, with `dns1`/`dns2` serving split-horizon answers + (DNS zones and split-horizon). +- `askari` runs independently of the cluster so it survives a homelab outage, which + is why the off-site NetBird control plane lives there (External monitoring — + askari). diff --git a/docs/decisions/008-testing.md b/docs/decisions/008-testing.md index 5a915de..b6935e7 100644 --- a/docs/decisions/008-testing.md +++ b/docs/decisions/008-testing.md @@ -3,6 +3,10 @@ > Practical point-of-use pitfalls (nft render checks, Molecule `community.docker`, > apply-path coverage blind spots) live in `docs/testing/gotchas.md`. +## Status + +Accepted (2026-05-30) + ## Context Ansible roles must be idempotent and correct before they touch production hosts. @@ -11,7 +15,9 @@ This document records the testing strategy, what each level covers, and — crit --- -## Three testing levels +## Decision + +### Three testing levels ### Level 1 — Molecule (per role, always required) @@ -78,7 +84,7 @@ deploy (STATUS.md). Full design: ADR-017. --- -## Molecule test image +### Molecule test image **No external images.** The project builds and hosts its own test image. @@ -103,7 +109,7 @@ functionally equivalent and fully owned. --- -## Idempotency requirements +### Idempotency requirements Every role task must satisfy one of these: @@ -121,7 +127,7 @@ catches anything lint misses. --- -## What Molecule tests — and what it does not +### What Molecule tests — and what it does not ### Tested in Molecule @@ -161,7 +167,7 @@ Behavioural correctness is confirmed on staging. --- -## CI pipeline +### CI pipeline ``` push to main @@ -178,3 +184,27 @@ promote to production Manual gates are intentional. Automated tests prove correctness in isolation; a human confirms the change is safe to promote. + +--- + +## Consequences + +Drawn from the limitations and trade-offs already stated above: + +- The Molecule idempotency step is non-negotiable; every role must pass it cleanly + (Three testing levels — Level 1). +- A class of capabilities (nftables rule loading, NetBird mesh data plane, + unattended-upgrades behaviour, OPNsense DHCP, Avahi mDNS reflection, hardware + passthrough, corosync cluster formation) cannot be verified in Molecule and is + validated only at Level 2 (staging) or Level 3 (external) — a conscious, + documented decision, not a gap (What Molecule tests — and what it does not). +- The project builds and hosts its own `molecule-debian13` image rather than relying + on an external Docker Hub image (e.g. geerlingguy), accepting the maintenance of a + custom image to avoid drift, disappearance, or unexpected changes outside project + control (Molecule test image). +- Level 4 service-UI acceptance is authorable now but its execution is deferred, + pending `ubongo`, the `playwright` plugin, Authentik, and a staging deploy (Three + testing levels — Level 4). +- Promotion to staging and to production stays behind intentional manual approval + gates; automation proves isolated correctness, a human confirms promotion safety + (CI pipeline). diff --git a/docs/decisions/009-provisioning-handoff.md b/docs/decisions/009-provisioning-handoff.md index abb0173..733edcd 100644 --- a/docs/decisions/009-provisioning-handoff.md +++ b/docs/decisions/009-provisioning-handoff.md @@ -1,5 +1,9 @@ # ADR-009 — Terraform ↔ Ansible provisioning handoff +## Status + +Accepted (2026-05-30) + ## Context Two tools touch every managed host. Terraform owns **what exists** — VMs on @@ -14,7 +18,9 @@ the cloud-init template that VMs are cloned from. This ADR covers how they conne --- -## The boundary +## Decision + +### The boundary | Layer | Tool | Notes | |---|---|---| @@ -31,7 +37,7 @@ below). --- -## The handoff pipeline +### The handoff pipeline There is one path by which a managed host comes into existence and reaches its configured state: @@ -55,7 +61,7 @@ this pipeline — **never** by hand-editing the inventory. --- -## The data contract +### The data contract The seam's interface is a single Terraform output consumed by a single script. @@ -88,7 +94,7 @@ Terraform, and the inventory is regenerated, never edited. --- -## Cloud-init's role +### Cloud-init's role Cloud-init is the thin first-boot layer between Terraform and Ansible: @@ -103,7 +109,7 @@ The line is sharp: cloud-init buys *reachability*, Ansible owns *configuration*. --- -## Internal DNS — owned by Ansible, no chicken-and-egg +### Internal DNS — owned by Ansible, no chicken-and-egg Terraform writes **no** DNS records. The internal zone (`boma.baobab.band`) is rendered entirely by the Ansible `dns` role: @@ -129,7 +135,7 @@ convention only — it no longer implies any difference in how records are writt --- -## The control-node exception +### The control-node exception The control node — the host that runs Terraform and Ansible — is `ubongo`, a dedicated **physical** machine outside the cluster. It is not a VM at all, so @@ -146,7 +152,7 @@ Every other host is Terraform-managed. --- -## What was ruled out +### What was ruled out | Option | Reason | |---|---| @@ -154,3 +160,25 @@ Every other host is Terraform-managed. | Hand-editing the generated inventory | `hosts.yml` is a build artifact of `tf_to_inventory.py`; edits are overwritten on the next `make tf-inventory`. Edit `local.vms` instead. | | Documenting the seam in both ADR-005 and ADR-006 | The boundary belongs in exactly one place. Those ADRs link here. | | Terraform-managed DNS records (`hashicorp/dns` + RFC 2136) | Created a bootstrap cycle (the first DNS server can't register itself) and split DNS ownership across two tools. Ansible owns the whole internal zone instead — one owner, no cycle. | + +## Consequences + +Drawn from the boundary, the data contract, and the "What was ruled out" section above: + +- Adding a host means editing `local.vms` and running the handoff pipeline; the + generated `hosts.yml` is a build artifact and must never be hand-edited — manual + edits are overwritten on the next `make tf-inventory` (The handoff pipeline; The + data contract; What was ruled out). +- Manual `qm clone` is rejected as a general provisioning path so the inventory and + real infrastructure cannot drift; Terraform is the single way VMs come into + existence (What was ruled out). +- Terraform writes no DNS records: the Ansible `dns` role renders the whole internal + zone from inventory plus `group_vars`, dissolving the bootstrap cycle a + Terraform-managed zone (`hashicorp/dns` + RFC 2136) would create (Internal DNS — + owned by Ansible, no chicken-and-egg; What was ruled out). +- The control node (`ubongo`) is the single documented exception to "Terraform owns + VM existence" — a physical machine provisioned manually and managed by Ansible for + baseline config only; every other host is Terraform-managed (The control-node + exception). +- The seam is documented in exactly one place (this ADR); ADR-005 and ADR-006 link + here rather than restating it (What was ruled out).