docs(askari): amend ADR-006/009/020/007/016 for TF-provisioned offsite host; STATUS (apply pending)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-14 12:09:20 +02:00
parent fd86ec6848
commit 3588904528
6 changed files with 71 additions and 20 deletions

View file

@ -5,7 +5,7 @@ This repo is partly aspirational: the ADRs in `docs/decisions/` describe the
truth. **Before relying on a role, provider, or pipeline existing, check here.** truth. **Before relying on a role, provider, or pipeline existing, check here.**
If something is listed as "designed, not built", do not assume it works. If something is listed as "designed, not built", do not assume it works.
_Last reviewed: 2026-06-11._ _Last reviewed: 2026-06-14._
## Real and working today ## Real and working today
@ -20,7 +20,7 @@ _Last reviewed: 2026-06-11._
| Pre-commit hooks | Configured: lint, gitleaks, vault-encryption guard. Activate with `pre-commit install` after `make setup`. | | Pre-commit hooks | Configured: lint, gitleaks, vault-encryption guard. Activate with `pre-commit install` after `make setup`. |
| Vault password client | `scripts/vault-pass-client.sh` fetches the master password from Vaultwarden via `rbw` (wired as `vault_password_file`). Requires `rbw` installed + `rbw unlock`. | | Vault password client | `scripts/vault-pass-client.sh` fetches the master password from Vaultwarden via `rbw` (wired as `vault_password_file`). Requires `rbw` installed + `rbw unlock`. |
| `/review-repo` | Repo audit: `scripts/repo-scan.py` (Phase 0) + `.claude/commands/review-repo.md`, reports to `docs/reviews/`. On-demand only; cron + email deferred (`docs/TODO.md`). | | `/review-repo` | Repo audit: `scripts/repo-scan.py` (Phase 0) + `.claude/commands/review-repo.md`, reports to `docs/reviews/`. On-demand only; cron + email deferred (`docs/TODO.md`). |
| Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below | | Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below. Offsite env also written — see "Designed but not built". |
| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON | | `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) | | `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
| ADR-002 security strategy + `docs/security/{accepted-risks,service-checklist}.md` | Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review | | ADR-002 security strategy + `docs/security/{accepted-risks,service-checklist}.md` | Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review |
@ -50,7 +50,8 @@ applying `dev_env` via `playbooks/workstation.yml`.)
| Thing | Designed in | Notes | | Thing | Designed in | Notes |
|---|---|---| |---|---|---|
| `dns` role (renders the internal zone) | ADR-007 / ADR-009 | Does not exist. Internal DNS ownership is assigned to it by design. | | `dns` role (renders the internal zone) | ADR-007 / ADR-009 | Does not exist. Internal DNS ownership is assigned to it by design. |
| Terraform actually provisioning | ADR-006 / ADR-009 | Never `terraform init`ed: no `.terraform.lock.hcl`, no state, no real `local.vms` entries | | Terraform actually provisioning (Proxmox) | ADR-006 / ADR-009 | Never `terraform init`ed: no `.terraform.lock.hcl`, no state, no real `local.vms` entries |
| `terraform/{modules/hetzner_vm, environments/offsite}` (askari) | ADR-006 (amended) | **Written, not yet applied.** Terraform owns askari's existence (hcloud provider, CAX11/hel1/debian-13, cloud-init `ansible` user, Hetzner Cloud Firewall SSH-from-ubongo). Makefile token-injection + directory inventory + `tf-inventory-offsite` handoff wired; offsite-handoff pytest green. **Pending:** `terraform init/plan/apply` (run on ubongo — creates a billed VPS) + bootstrap. M2 of the roadmap. |
| CI (Forgejo Actions) | ADR-003 / ADR-008 | Pipeline described; not implemented | | CI (Forgejo Actions) | ADR-003 / ADR-008 | Pipeline described; not implemented |
| Level 2 / 3 testing (staging, `askari` smoke) | ADR-008 | Depends on real VMs / `askari`, which don't exist yet | | Level 2 / 3 testing (staging, `askari` smoke) | ADR-008 | Depends on real VMs / `askari`, which don't exist yet |
| Per-service roles | ADR-004 | Model defined; no service roles built | | Per-service roles | ADR-004 | Model defined; no service roles built |

View file

@ -8,7 +8,7 @@ Accepted (2026-05-30)
Ansible manages host configuration well but has no state model for infrastructure Ansible manages host configuration well but has no state model for infrastructure
existence. Adding Terraform handles the "what exists" layer — creating and destroying existence. Adding Terraform handles the "what exists" layer — creating and destroying
VMs on Proxmox — while Ansible continues to own everything that runs inside them, VMs on Proxmox and Hetzner — while Ansible continues to own everything that runs inside them,
including all internal DNS records. including all internal DNS records.
This complements rather than replaces Ansible. The two tools do not overlap. The This complements rather than replaces Ansible. The two tools do not overlap. The
@ -35,8 +35,13 @@ cadence, making them a poor fit for Terraform state.
### Providers ### Providers
**`bpg/proxmox` (`~> 0.70`)**: Chosen over `telmate/proxmox` for active maintenance, **`bpg/proxmox` (`~> 0.70`)**: Chosen over `telmate/proxmox` for active maintenance,
full Proxmox 8 API support, and better cloud-init integration. This is the only full Proxmox 8 API support, and better cloud-init integration. This is the provider
provider. for Proxmox VMs.
**`hetznercloud/hcloud` (`~> 1.65`)**: owns off-site VM existence (`askari`). ADR-006's
scope is now **Proxmox + Hetzner** — "Terraform owns VM existence" generalizes across
providers. The `offsite` environment + `hetzner_vm` module live alongside the Proxmox env
+ `proxmox_vm` module; each environment has its own local state.
Terraform does **not** manage DNS. An earlier design used `hashicorp/dns` (RFC 2136) Terraform does **not** manage DNS. An earlier design used `hashicorp/dns` (RFC 2136)
to write A records, but that created a bootstrap cycle — the first DNS server cannot to write A records, but that created a bootstrap cycle — the first DNS server cannot
@ -71,9 +76,11 @@ integration boundary.
terraform/ terraform/
modules/ modules/
proxmox_vm/ # reusable VM module — Proxmox only, no DNS proxmox_vm/ # reusable VM module — Proxmox only, no DNS
hetzner_vm/ # reusable VM module — Hetzner Cloud, no DNS
environments/ environments/
staging/ # staging VMs, separate state file staging/ # staging Proxmox VMs, separate state file
production/ # production VMs, separate state file production/ # production Proxmox VMs, separate state file
offsite/ # off-site Hetzner VMs (askari), separate state file
``` ```
Separate environment directories (not Terraform workspaces) for the clearest Separate environment directories (not Terraform workspaces) for the clearest
@ -121,8 +128,10 @@ handoff)**.
Drawn from the "What was ruled out" section and the decisions stated above: Drawn from the "What was ruled out" section and the decisions stated above:
- `bpg/proxmox` is the only provider; `telmate/proxmox` was ruled out for weaker - `bpg/proxmox` is the provider for Proxmox VMs; `telmate/proxmox` was ruled out for weaker
maintenance and Proxmox 8 / cloud-init support (Providers; What was ruled out). maintenance and Proxmox 8 / cloud-init support (Providers; What was ruled out).
- `hetznercloud/hcloud` is the provider for off-site VM existence (`askari`); ADR-006's
scope now covers Proxmox + Hetzner (Providers).
- OPNsense stays entirely in Ansible — no Terraform OPNsense provider — to avoid - OPNsense stays entirely in Ansible — no Terraform OPNsense provider — to avoid
community-provider rot across OPNsense releases (Responsibility split; What was community-provider rot across OPNsense releases (Responsibility split; What was
ruled out). ruled out).

View file

@ -195,9 +195,11 @@ the self-hosted NetBird coordinator** (management/signal/relay). It reaches `srv
metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird
ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing. ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing.
`askari` is provisioned and managed independently of the Proxmox cluster — it must `askari` is provisioned as **Terraform IaC** (`hetznercloud/hcloud`), managed
be reachable even when the homelab is down (its entire purpose), which is also why independently of the Proxmox cluster (its own provider + local state in
the mesh coordinator lives here: an off-site control plane survives a homelab outage. `terraform/environments/offsite/`). It must be reachable even when the homelab is down
(its entire purpose), which is also why the mesh coordinator lives here: an off-site
control plane survives a homelab outage.
FQDN: `askari.wingu.me` (off-site tier; record added by `public_dns` when askari exists — M2/M4). FQDN: `askari.wingu.me` (off-site tier; record added by `public_dns` when askari exists — M2/M4).
--- ---

View file

@ -83,10 +83,10 @@ group against the allowed set and fails loudly on an unknown group.
**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`. **Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`.
`control` and `offsite_hosts` are not produced by Terraform — they hold manually `control` holds `ubongo`, a physical machine not managed by Terraform (see the
provisioned hosts (`ubongo` and `askari` respectively) added to the inventory by hand control-node exception below and ADR-015). `offsite_hosts` holds `askari`, which is
(see the control-node exception below and ADR-015/ADR-016). They are valid groups so Terraform-managed via the `hetznercloud/hcloud` provider in the `offsite` environment
the generated `hosts.yml` carries their (otherwise empty) sections. (see the off-site handoff note below and ADR-016).
The generated `hosts.yml` carries a "do not edit manually" header and is owned by The generated `hosts.yml` carries a "do not edit manually" header and is owned by
the generator. Treat it as a build artifact: the source of truth is `local.vms` in the generator. Treat it as a build artifact: the source of truth is `local.vms` in
@ -152,6 +152,27 @@ Every other host is Terraform-managed.
--- ---
### The off-site handoff (`offsite` environment → `offsite_hosts`)
`askari` (Hetzner VPS, ADR-016) follows the same handoff pipeline as Proxmox hosts but
with its own provider and environment:
- **Producer**`terraform/environments/offsite/outputs.tf` emits a `vms` map in the
same `{ host: { ip, group } }` shape as Proxmox environments; `askari`'s group is
`offsite_hosts`.
- **Consumer**`scripts/tf_to_inventory.py` reads `terraform output -json` from the
`offsite` environment and writes `inventories/production/offsite.yml`.
- **Makefile target**`make tf-inventory-offsite` runs the generator for the offsite
environment.
The production inventory is a **directory** (`inventories/production/`) that Ansible
merges at runtime: `hosts.yml` (Proxmox-generated) and `offsite.yml`
(offsite-generated) together form the full production host list. Each file is a build
artifact — never hand-edited; their source of truth is `local.vms` in the respective
environment's `main.tf`.
---
### What was ruled out ### What was ruled out
| Option | Reason | | Option | Reason |
@ -178,7 +199,10 @@ Drawn from the boundary, the data contract, and the "What was ruled out" section
owned by Ansible, no chicken-and-egg; What was ruled out). owned by Ansible, no chicken-and-egg; What was ruled out).
- The control node (`ubongo`) is the single documented exception to "Terraform owns - The control node (`ubongo`) is the single documented exception to "Terraform owns
VM existence" — a physical machine provisioned manually and managed by Ansible for VM existence" — a physical machine provisioned manually and managed by Ansible for
baseline config only; every other host is Terraform-managed (The control-node baseline config only (The control-node exception).
exception). - The `offsite` TF environment's `vms` output feeds the `offsite_hosts` group via
`tf_to_inventory.py` (`make tf-inventory-offsite``inventories/production/offsite.yml`);
the production inventory is a directory that merges `hosts.yml` (Proxmox) and
`offsite.yml` (offsite) (The off-site handoff).
- The seam is documented in exactly one place (this ADR); ADR-005 and ADR-006 link - The seam is documented in exactly one place (this ADR); ADR-005 and ADR-006 link
here rather than restating it (What was ruled out). here rather than restating it (What was ruled out).

View file

@ -81,8 +81,9 @@ allocated for it.
- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage. - **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage.
NetBird's management datastore is backed up encrypted off `askari` (synced to NetBird's management datastore is backed up encrypted off `askari` (synced to
`ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage. `ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage.
- **`askari` is Ansible-managed:** its own inventory group `offsite_hosts` (added - **`askari` is Ansible-managed:** its own inventory group `offsite_hosts` — provisioned
manually like the control node — it is not Terraform-managed), `base` role, plus a as **Terraform IaC** (`hetznercloud/hcloud`), managed independently of the Proxmox
cluster (its own provider + local state). Ansible configuration: `base` role, plus a
dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with
`SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are `SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are
version-pinned (ADR-011). boma's `dns` role stays authoritative for version-pinned (ADR-011). boma's `dns` role stays authoritative for

View file

@ -84,6 +84,20 @@ This was chosen over a single connectivity-model-generates-both (too much machin
tight coupling of two very different rule domains) and over fully independent per-layer tight coupling of two very different rule domains) and over fully independent per-layer
declarations (real drift risk). declarations (real drift risk).
### Off-cluster hosts — `askari` (Hetzner)
`askari` sits outside the Proxmox cluster and has no OPNsense. Its **perimeter** layer
is a TF-managed **Hetzner Cloud Firewall** (declared in `terraform/environments/offsite/`)
alongside the VM itself. Current rule set (M2): SSH inbound from `ubongo`'s public IP
only. NetBird ports (UDP 3478 + TCP 80/443) will be added in M4 when the coordinator
role is built.
The `group_vars` service catalog remains authoritative for `askari`'s **host nftables**
layer — the same two-layer model applies, with Hetzner Cloud Firewall substituting for
OPNsense at the perimeter.
---
### OPNsense automation — owned here, mechanism deferred ### OPNsense automation — owned here, mechanism deferred
OPNsense is Ansible-managed (CLAUDE.md: "OPNsense is entirely Ansible; no Terraform OPNsense is Ansible-managed (CLAUDE.md: "OPNsense is entirely Ansible; no Terraform