docs(askari): amend ADR-006/009/020/007/016 for TF-provisioned offsite host; STATUS (apply pending)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
fd86ec6848
commit
3588904528
6 changed files with 71 additions and 20 deletions
|
|
@ -5,7 +5,7 @@ This repo is partly aspirational: the ADRs in `docs/decisions/` describe the
|
||||||
truth. **Before relying on a role, provider, or pipeline existing, check here.**
|
truth. **Before relying on a role, provider, or pipeline existing, check here.**
|
||||||
If something is listed as "designed, not built", do not assume it works.
|
If something is listed as "designed, not built", do not assume it works.
|
||||||
|
|
||||||
_Last reviewed: 2026-06-11._
|
_Last reviewed: 2026-06-14._
|
||||||
|
|
||||||
## Real and working today
|
## Real and working today
|
||||||
|
|
||||||
|
|
@ -20,7 +20,7 @@ _Last reviewed: 2026-06-11._
|
||||||
| Pre-commit hooks | Configured: lint, gitleaks, vault-encryption guard. Activate with `pre-commit install` after `make setup`. |
|
| Pre-commit hooks | Configured: lint, gitleaks, vault-encryption guard. Activate with `pre-commit install` after `make setup`. |
|
||||||
| Vault password client | `scripts/vault-pass-client.sh` fetches the master password from Vaultwarden via `rbw` (wired as `vault_password_file`). Requires `rbw` installed + `rbw unlock`. |
|
| Vault password client | `scripts/vault-pass-client.sh` fetches the master password from Vaultwarden via `rbw` (wired as `vault_password_file`). Requires `rbw` installed + `rbw unlock`. |
|
||||||
| `/review-repo` | Repo audit: `scripts/repo-scan.py` (Phase 0) + `.claude/commands/review-repo.md`, reports to `docs/reviews/`. On-demand only; cron + email deferred (`docs/TODO.md`). |
|
| `/review-repo` | Repo audit: `scripts/repo-scan.py` (Phase 0) + `.claude/commands/review-repo.md`, reports to `docs/reviews/`. On-demand only; cron + email deferred (`docs/TODO.md`). |
|
||||||
| Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below |
|
| Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below. Offsite env also written — see "Designed but not built". |
|
||||||
| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
|
| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
|
||||||
| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
|
| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
|
||||||
| ADR-002 security strategy + `docs/security/{accepted-risks,service-checklist}.md` | Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review |
|
| ADR-002 security strategy + `docs/security/{accepted-risks,service-checklist}.md` | Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review |
|
||||||
|
|
@ -50,7 +50,8 @@ applying `dev_env` via `playbooks/workstation.yml`.)
|
||||||
| Thing | Designed in | Notes |
|
| Thing | Designed in | Notes |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| `dns` role (renders the internal zone) | ADR-007 / ADR-009 | Does not exist. Internal DNS ownership is assigned to it by design. |
|
| `dns` role (renders the internal zone) | ADR-007 / ADR-009 | Does not exist. Internal DNS ownership is assigned to it by design. |
|
||||||
| Terraform actually provisioning | ADR-006 / ADR-009 | Never `terraform init`ed: no `.terraform.lock.hcl`, no state, no real `local.vms` entries |
|
| Terraform actually provisioning (Proxmox) | ADR-006 / ADR-009 | Never `terraform init`ed: no `.terraform.lock.hcl`, no state, no real `local.vms` entries |
|
||||||
|
| `terraform/{modules/hetzner_vm, environments/offsite}` (askari) | ADR-006 (amended) | **Written, not yet applied.** Terraform owns askari's existence (hcloud provider, CAX11/hel1/debian-13, cloud-init `ansible` user, Hetzner Cloud Firewall SSH-from-ubongo). Makefile token-injection + directory inventory + `tf-inventory-offsite` handoff wired; offsite-handoff pytest green. **Pending:** `terraform init/plan/apply` (run on ubongo — creates a billed VPS) + bootstrap. M2 of the roadmap. |
|
||||||
| CI (Forgejo Actions) | ADR-003 / ADR-008 | Pipeline described; not implemented |
|
| CI (Forgejo Actions) | ADR-003 / ADR-008 | Pipeline described; not implemented |
|
||||||
| Level 2 / 3 testing (staging, `askari` smoke) | ADR-008 | Depends on real VMs / `askari`, which don't exist yet |
|
| Level 2 / 3 testing (staging, `askari` smoke) | ADR-008 | Depends on real VMs / `askari`, which don't exist yet |
|
||||||
| Per-service roles | ADR-004 | Model defined; no service roles built |
|
| Per-service roles | ADR-004 | Model defined; no service roles built |
|
||||||
|
|
|
||||||
|
|
@ -8,7 +8,7 @@ Accepted (2026-05-30)
|
||||||
|
|
||||||
Ansible manages host configuration well but has no state model for infrastructure
|
Ansible manages host configuration well but has no state model for infrastructure
|
||||||
existence. Adding Terraform handles the "what exists" layer — creating and destroying
|
existence. Adding Terraform handles the "what exists" layer — creating and destroying
|
||||||
VMs on Proxmox — while Ansible continues to own everything that runs inside them,
|
VMs on Proxmox and Hetzner — while Ansible continues to own everything that runs inside them,
|
||||||
including all internal DNS records.
|
including all internal DNS records.
|
||||||
|
|
||||||
This complements rather than replaces Ansible. The two tools do not overlap. The
|
This complements rather than replaces Ansible. The two tools do not overlap. The
|
||||||
|
|
@ -35,8 +35,13 @@ cadence, making them a poor fit for Terraform state.
|
||||||
### Providers
|
### Providers
|
||||||
|
|
||||||
**`bpg/proxmox` (`~> 0.70`)**: Chosen over `telmate/proxmox` for active maintenance,
|
**`bpg/proxmox` (`~> 0.70`)**: Chosen over `telmate/proxmox` for active maintenance,
|
||||||
full Proxmox 8 API support, and better cloud-init integration. This is the only
|
full Proxmox 8 API support, and better cloud-init integration. This is the provider
|
||||||
provider.
|
for Proxmox VMs.
|
||||||
|
|
||||||
|
**`hetznercloud/hcloud` (`~> 1.65`)**: owns off-site VM existence (`askari`). ADR-006's
|
||||||
|
scope is now **Proxmox + Hetzner** — "Terraform owns VM existence" generalizes across
|
||||||
|
providers. The `offsite` environment + `hetzner_vm` module live alongside the Proxmox env
|
||||||
|
+ `proxmox_vm` module; each environment has its own local state.
|
||||||
|
|
||||||
Terraform does **not** manage DNS. An earlier design used `hashicorp/dns` (RFC 2136)
|
Terraform does **not** manage DNS. An earlier design used `hashicorp/dns` (RFC 2136)
|
||||||
to write A records, but that created a bootstrap cycle — the first DNS server cannot
|
to write A records, but that created a bootstrap cycle — the first DNS server cannot
|
||||||
|
|
@ -71,9 +76,11 @@ integration boundary.
|
||||||
terraform/
|
terraform/
|
||||||
modules/
|
modules/
|
||||||
proxmox_vm/ # reusable VM module — Proxmox only, no DNS
|
proxmox_vm/ # reusable VM module — Proxmox only, no DNS
|
||||||
|
hetzner_vm/ # reusable VM module — Hetzner Cloud, no DNS
|
||||||
environments/
|
environments/
|
||||||
staging/ # staging VMs, separate state file
|
staging/ # staging Proxmox VMs, separate state file
|
||||||
production/ # production VMs, separate state file
|
production/ # production Proxmox VMs, separate state file
|
||||||
|
offsite/ # off-site Hetzner VMs (askari), separate state file
|
||||||
```
|
```
|
||||||
|
|
||||||
Separate environment directories (not Terraform workspaces) for the clearest
|
Separate environment directories (not Terraform workspaces) for the clearest
|
||||||
|
|
@ -121,8 +128,10 @@ handoff)**.
|
||||||
|
|
||||||
Drawn from the "What was ruled out" section and the decisions stated above:
|
Drawn from the "What was ruled out" section and the decisions stated above:
|
||||||
|
|
||||||
- `bpg/proxmox` is the only provider; `telmate/proxmox` was ruled out for weaker
|
- `bpg/proxmox` is the provider for Proxmox VMs; `telmate/proxmox` was ruled out for weaker
|
||||||
maintenance and Proxmox 8 / cloud-init support (Providers; What was ruled out).
|
maintenance and Proxmox 8 / cloud-init support (Providers; What was ruled out).
|
||||||
|
- `hetznercloud/hcloud` is the provider for off-site VM existence (`askari`); ADR-006's
|
||||||
|
scope now covers Proxmox + Hetzner (Providers).
|
||||||
- OPNsense stays entirely in Ansible — no Terraform OPNsense provider — to avoid
|
- OPNsense stays entirely in Ansible — no Terraform OPNsense provider — to avoid
|
||||||
community-provider rot across OPNsense releases (Responsibility split; What was
|
community-provider rot across OPNsense releases (Responsibility split; What was
|
||||||
ruled out).
|
ruled out).
|
||||||
|
|
|
||||||
|
|
@ -195,9 +195,11 @@ the self-hosted NetBird coordinator** (management/signal/relay). It reaches `srv
|
||||||
metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird
|
metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird
|
||||||
ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing.
|
ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing.
|
||||||
|
|
||||||
`askari` is provisioned and managed independently of the Proxmox cluster — it must
|
`askari` is provisioned as **Terraform IaC** (`hetznercloud/hcloud`), managed
|
||||||
be reachable even when the homelab is down (its entire purpose), which is also why
|
independently of the Proxmox cluster (its own provider + local state in
|
||||||
the mesh coordinator lives here: an off-site control plane survives a homelab outage.
|
`terraform/environments/offsite/`). It must be reachable even when the homelab is down
|
||||||
|
(its entire purpose), which is also why the mesh coordinator lives here: an off-site
|
||||||
|
control plane survives a homelab outage.
|
||||||
FQDN: `askari.wingu.me` (off-site tier; record added by `public_dns` when askari exists — M2/M4).
|
FQDN: `askari.wingu.me` (off-site tier; record added by `public_dns` when askari exists — M2/M4).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
|
||||||
|
|
@ -83,10 +83,10 @@ group against the allowed set and fails loudly on an unknown group.
|
||||||
|
|
||||||
**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`.
|
**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`.
|
||||||
|
|
||||||
`control` and `offsite_hosts` are not produced by Terraform — they hold manually
|
`control` holds `ubongo`, a physical machine not managed by Terraform (see the
|
||||||
provisioned hosts (`ubongo` and `askari` respectively) added to the inventory by hand
|
control-node exception below and ADR-015). `offsite_hosts` holds `askari`, which is
|
||||||
(see the control-node exception below and ADR-015/ADR-016). They are valid groups so
|
Terraform-managed via the `hetznercloud/hcloud` provider in the `offsite` environment
|
||||||
the generated `hosts.yml` carries their (otherwise empty) sections.
|
(see the off-site handoff note below and ADR-016).
|
||||||
|
|
||||||
The generated `hosts.yml` carries a "do not edit manually" header and is owned by
|
The generated `hosts.yml` carries a "do not edit manually" header and is owned by
|
||||||
the generator. Treat it as a build artifact: the source of truth is `local.vms` in
|
the generator. Treat it as a build artifact: the source of truth is `local.vms` in
|
||||||
|
|
@ -152,6 +152,27 @@ Every other host is Terraform-managed.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
### The off-site handoff (`offsite` environment → `offsite_hosts`)
|
||||||
|
|
||||||
|
`askari` (Hetzner VPS, ADR-016) follows the same handoff pipeline as Proxmox hosts but
|
||||||
|
with its own provider and environment:
|
||||||
|
|
||||||
|
- **Producer** — `terraform/environments/offsite/outputs.tf` emits a `vms` map in the
|
||||||
|
same `{ host: { ip, group } }` shape as Proxmox environments; `askari`'s group is
|
||||||
|
`offsite_hosts`.
|
||||||
|
- **Consumer** — `scripts/tf_to_inventory.py` reads `terraform output -json` from the
|
||||||
|
`offsite` environment and writes `inventories/production/offsite.yml`.
|
||||||
|
- **Makefile target** — `make tf-inventory-offsite` runs the generator for the offsite
|
||||||
|
environment.
|
||||||
|
|
||||||
|
The production inventory is a **directory** (`inventories/production/`) that Ansible
|
||||||
|
merges at runtime: `hosts.yml` (Proxmox-generated) and `offsite.yml`
|
||||||
|
(offsite-generated) together form the full production host list. Each file is a build
|
||||||
|
artifact — never hand-edited; their source of truth is `local.vms` in the respective
|
||||||
|
environment's `main.tf`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
### What was ruled out
|
### What was ruled out
|
||||||
|
|
||||||
| Option | Reason |
|
| Option | Reason |
|
||||||
|
|
@ -178,7 +199,10 @@ Drawn from the boundary, the data contract, and the "What was ruled out" section
|
||||||
owned by Ansible, no chicken-and-egg; What was ruled out).
|
owned by Ansible, no chicken-and-egg; What was ruled out).
|
||||||
- The control node (`ubongo`) is the single documented exception to "Terraform owns
|
- The control node (`ubongo`) is the single documented exception to "Terraform owns
|
||||||
VM existence" — a physical machine provisioned manually and managed by Ansible for
|
VM existence" — a physical machine provisioned manually and managed by Ansible for
|
||||||
baseline config only; every other host is Terraform-managed (The control-node
|
baseline config only (The control-node exception).
|
||||||
exception).
|
- The `offsite` TF environment's `vms` output feeds the `offsite_hosts` group via
|
||||||
|
`tf_to_inventory.py` (`make tf-inventory-offsite` → `inventories/production/offsite.yml`);
|
||||||
|
the production inventory is a directory that merges `hosts.yml` (Proxmox) and
|
||||||
|
`offsite.yml` (offsite) (The off-site handoff).
|
||||||
- The seam is documented in exactly one place (this ADR); ADR-005 and ADR-006 link
|
- The seam is documented in exactly one place (this ADR); ADR-005 and ADR-006 link
|
||||||
here rather than restating it (What was ruled out).
|
here rather than restating it (What was ruled out).
|
||||||
|
|
|
||||||
|
|
@ -81,8 +81,9 @@ allocated for it.
|
||||||
- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage.
|
- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage.
|
||||||
NetBird's management datastore is backed up encrypted off `askari` (synced to
|
NetBird's management datastore is backed up encrypted off `askari` (synced to
|
||||||
`ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage.
|
`ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage.
|
||||||
- **`askari` is Ansible-managed:** its own inventory group `offsite_hosts` (added
|
- **`askari` is Ansible-managed:** its own inventory group `offsite_hosts` — provisioned
|
||||||
manually like the control node — it is not Terraform-managed), `base` role, plus a
|
as **Terraform IaC** (`hetznercloud/hcloud`), managed independently of the Proxmox
|
||||||
|
cluster (its own provider + local state). Ansible configuration: `base` role, plus a
|
||||||
dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with
|
dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with
|
||||||
`SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are
|
`SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are
|
||||||
version-pinned (ADR-011). boma's `dns` role stays authoritative for
|
version-pinned (ADR-011). boma's `dns` role stays authoritative for
|
||||||
|
|
|
||||||
|
|
@ -84,6 +84,20 @@ This was chosen over a single connectivity-model-generates-both (too much machin
|
||||||
tight coupling of two very different rule domains) and over fully independent per-layer
|
tight coupling of two very different rule domains) and over fully independent per-layer
|
||||||
declarations (real drift risk).
|
declarations (real drift risk).
|
||||||
|
|
||||||
|
### Off-cluster hosts — `askari` (Hetzner)
|
||||||
|
|
||||||
|
`askari` sits outside the Proxmox cluster and has no OPNsense. Its **perimeter** layer
|
||||||
|
is a TF-managed **Hetzner Cloud Firewall** (declared in `terraform/environments/offsite/`)
|
||||||
|
alongside the VM itself. Current rule set (M2): SSH inbound from `ubongo`'s public IP
|
||||||
|
only. NetBird ports (UDP 3478 + TCP 80/443) will be added in M4 when the coordinator
|
||||||
|
role is built.
|
||||||
|
|
||||||
|
The `group_vars` service catalog remains authoritative for `askari`'s **host nftables**
|
||||||
|
layer — the same two-layer model applies, with Hetzner Cloud Firewall substituting for
|
||||||
|
OPNsense at the perimeter.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
### OPNsense automation — owned here, mechanism deferred
|
### OPNsense automation — owned here, mechanism deferred
|
||||||
|
|
||||||
OPNsense is Ansible-managed (CLAUDE.md: "OPNsense is entirely Ansible; no Terraform
|
OPNsense is Ansible-managed (CLAUDE.md: "OPNsense is entirely Ansible; no Terraform
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue