docs(spec): M2 — provision askari via Terraform + Hetzner Cloud

askari is provisioned as IaC: Terraform owns its existence too, generalizing
ADR-006 from "Proxmox VM existence" to Proxmox + Hetzner (new hetznercloud/hcloud
provider, hetzner_vm module, offsite stack with local state). CAX11 (ARM) in
Helsinki on Debian 13, behind a TF-managed Hetzner Cloud Firewall (SSH-from-ubongo
now; NetBird ports in M4). Token via TF_VAR_hcloud_token from vault.hetzner.token.
Handoff stays ADR-009-shaped (tf_to_inventory.py extended to emit askari into
offsite_hosts). State in the ADR-022 backup scope; DR via terraform import.

Amends ADR-006/009/020/007/016. Point ROADMAP.md M2 at the spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-14 10:12:10 +02:00
parent 32d480efcf
commit 602550fdaa
2 changed files with 157 additions and 6 deletions

View file

@ -79,13 +79,18 @@ zero-risk and *born at Gandi*.
### M2 · `askari` provisioned + under Ansible
Spin up the Hetzner VPS; bring it under Ansible in the `offsite_hosts` group; bootstrap it.
Provision the Hetzner VPS **as IaC with Terraform** (CAX11 ARM / Helsinki / Debian 13,
behind a TF-managed Hetzner Cloud Firewall), bring it into `offsite_hosts`, and bootstrap
it. Design: `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`.
- **Proves:** the `offsite_hosts` pattern, bootstrap of a non-cluster host, rbw/vault
against a brand-new host. Regenerates the inventory stubs (closes review finding O6 —
`offsite_hosts` missing from `hosts.yml`).
- **Maps to:** ADR-007 (`askari` role), ADR-009 (provisioning handoff), ADR-015/016,
TODO 5 (control-node-style bootstrap, reused).
- **Decided:** Terraform owns `askari`'s existence — generalizes ADR-006 from "Proxmox VM
existence" to **Proxmox + Hetzner** (new `hetznercloud/hcloud` provider, `hetzner_vm`
module, `offsite` stack). Token via `TF_VAR_hcloud_token` from `vault.hetzner.token`.
- **Proves:** the `offsite_hosts` pattern, the TF→Ansible handoff for a non-Proxmox host
(`tf_to_inventory.py` extended), bootstrap of a non-cluster host. Closes review finding
O6 (`offsite_hosts` missing from `hosts.yml`).
- **Amends:** ADR-006 (TF scope), ADR-009 (offsite handoff), ADR-020 (Hetzner Cloud
Firewall = perimeter), ADR-007/016 (`askari` TF-provisioned, not "added manually").
### M3 · `base` matured to a "remote-access-sufficient" subset

View file

@ -0,0 +1,146 @@
# Design — Provisioning `askari` (Terraform + Hetzner Cloud)
- **Date:** 2026-06-14
- **Status:** Draft for review — design settled in brainstorming; pending user review,
then implementation plan
- **Roadmap milestone:** M2 (`docs/ROADMAP.md`)
- **Amends:** ADR-006 (Terraform scope → Proxmox **+ Hetzner**), ADR-009 (offsite
handoff), ADR-020 (Hetzner Cloud Firewall = askari's perimeter), ADR-007/016 (`askari`
is Terraform-provisioned, not "added manually")
- **Becomes:** amendments to those ADRs
---
## Problem
`askari` (the off-site Hetzner VPS — NetBird coordinator + watchdog, later the off-site
log subset) does not exist yet. ADR-007/016 designed it as "provisioned independently…
added manually." Now that there's a dedicated Hetzner account + a verified API token in
the vault, we can provision it as **IaC** instead. boma's principle (ADR-006/009) is
"**Terraform owns VM existence; Ansible owns config**" — but scoped to Proxmox. This
milestone **generalizes that principle to Hetzner** and stands `askari` up.
## Decisions (as settled)
1. **Terraform owns `askari`'s existence** (Approach 1) — generalize ADR-006 from "Proxmox
VM existence" to "VM existence on **Proxmox + Hetzner**." (Rejected: Ansible
`hetzner.hcloud` — breaks the TF/Ansible boundary; `hcloud` CLI — not stateful IaC.)
2. **Server:** **CAX11** (ARM/Ampere, 2 vCPU / 4 GB / 40 GB), **Helsinki (`hel1`)**,
**Debian 13**. Rescale up later if the off-site log subset needs it.
3. **TF-managed Hetzner Cloud Firewall** as `askari`'s perimeter (the off-site
OPNsense-analog). Starts minimal (**SSH from ubongo only**); service ports are added as
services land (NetBird ports in M4). The ADR-020 catalog stays authoritative for the
**host nftables** layer.
4. **Token via `TF_VAR_hcloud_token`**, sourced from `vault.hetzner.token` at apply time
— never in `.tfvars` (CLAUDE.md).
5. **Handoff stays ADR-009-shaped:** `tf_to_inventory.py` is extended to emit `askari`
into `offsite_hosts`, so `hosts.yml` stays fully generated.
## Verified facts (ADR-014)
> verified: Hetzner Cloud entry tiers · WebSearch · 2026-06-14 · **CAX11** (ARM/Ampere)
> 2 vCPU / 4 GB / 40 GB ≈ €3.79/mo, 20 TB traffic + 1 IPv4; ARM (CAX) is **EU-locations
> only** (incl. `hel1`). Price change for new orders from 2026-06-15.
> to verify when writing the role (ADR-014): the `hetznercloud/hcloud` provider version
> to pin; the Debian 13 image slug (expected `debian-13`); CAX11 availability in `hel1`.
## Architecture
### Terraform structure
- **Module `terraform/modules/hetzner_vm/`** (sibling to `proxmox_vm`): inputs `name`,
`server_type`, `location`, `image`, `ssh_keys`, `user_data`, `firewall_rules`,
`labels`; outputs the server's `ipv4` (+ id, name).
- **Stack `terraform/environments/offsite/`** (its own **local state** on ubongo,
gitignored): `providers.tf` pins **`hetznercloud/hcloud`**; `main.tf` calls
`hetzner_vm` for `askari` + an `hcloud_firewall` + an `hcloud_ssh_key`; `variables.tf`
(incl. `hcloud_token`, `control_ssh_pubkey`, `ssh_admin_cidr`); `outputs.tf` (askari
`ipv4`, for the handoff + DNS); `backend.tf` (local state, like the Proxmox envs).
- **`make tf-* TF_ENV=offsite`** drives it; for `offsite` the targets first export
`TF_VAR_hcloud_token` from `vault.hetzner.token` (a small vault→env step). `tf-apply`
stays gated behind a shown `tf-plan` (CLAUDE.md).
### Provisioning → Ansible handoff
1. TF creates the CAX11 with a **cloud-init `user_data`** that injects **ubongo's control
SSH public key** for first login (minimal — no config beyond the key + ensuring
Python is present for Ansible).
2. TF outputs `askari`'s public IPv4. `tf_to_inventory.py` (extended for the offsite
stack) writes `askari` into the `offsite_hosts` group of `hosts.yml`.
3. `playbooks/bootstrap.yml` runs against `askari` → creates the `ansible` user + sudoers
(as for Proxmox hosts). **Where M2 ends.**
4. *(Downstream, not M2):* `base` remote-access subset (M3), NetBird coordinator (M4),
mesh enrollment + SSH-narrowed-to-`wt0` (M5).
- A convenience **`askari.wingu.me` A record** is added via the M1 `public_dns` role
(stable name for humans + future certs); the inventory may reference it once DNS exists.
### Cloud firewall (perimeter)
- TF `hcloud_firewall` attached to `askari`:
- **inbound SSH (22/tcp) from ubongo's address only** (`ssh_admin_cidr` var);
- everything else default-deny.
- **Grows with services:** NetBird's **UDP 3478** (Coturn) + **TCP 80/443**
(management/dashboard) are added in **M4** when the coordinator deploys — not opened to
a non-existent listener now.
- This is the off-site **perimeter** layer (OPNsense has no presence off-cluster);
ADR-020's `group_vars` catalog remains the single source for the **host nftables**
layer that `base` renders (M3).
### State + disaster recovery
- The `offsite` `terraform.tfstate` lives on ubongo and is added to the **ADR-022 backup
scope** (the control-node TF state backup already flagged in STATUS).
- DR is management-only: `askari` survives a homelab/ubongo outage by design, so a lost
state is recovered by `terraform import`-ing the still-running server — no rebuild.
## Division of labour & access
| Task | Who | How |
|---|---|---|
| Hetzner token | Done | `vault.hetzner.token` (verified live, HTTP 200). |
| `hetzner_vm` module + `offsite` stack + `tf_to_inventory` extension + make token-inject | Agent | Committed IaC + a pytest for the handoff. |
| `terraform plan` (offsite) | Agent | `make tf-plan TF_ENV=offsite`, **output shown**. |
| `terraform apply` (offsite) | Human-gated | Only after the plan is reviewed (CLAUDE.md: never apply without a shown plan). Run on ubongo. |
| Confirm the control SSH key | Human | Which ubongo key Ansible uses to reach hosts (its public key feeds `control_ssh_pubkey`). |
- **Token:** `TF_VAR_hcloud_token` from vault at apply; never written to a `.tfvars` file.
- **SSH:** cloud-init injects only the control public key; the private key stays on
ubongo. The cloud firewall limits SSH to ubongo's address until the mesh exists.
## Testing & verification
- `terraform fmt` + **`terraform validate`** + **`make tf-plan TF_ENV=offsite`** (plan
reviewed before any apply).
- **pytest** for the `tf_to_inventory.py` offsite extension (mirrors the existing
stdlib-only script tests), asserting an `askari` entry lands in `offsite_hosts`.
- Post-apply: SSH reachability from ubongo; cloud-init ran; then `bootstrap.yml`
connectivity. (`base`/NetBird get their own Molecule/verify in M3/M4.)
## Scope boundaries — what M2 is NOT
- **Not** the `base` hardening subset (SSH hardening, fail2ban, NetBird agent) — **M3**.
- **Not** the NetBird coordinator or the cloud-firewall NetBird ports — **M4**.
- **Not** mesh enrollment / narrowing SSH to `wt0`**M5**.
- **Not** the off-site log subset (may need a bigger instance / a volume) — later.
## ADR work
- **ADR-006** — generalize "Terraform owns VM existence" to **Proxmox + Hetzner**; add the
`hetznercloud/hcloud` provider (no longer "the only provider is `bpg/proxmox`"); add the
`offsite` environment + `hetzner_vm` module to Structure; note the TF-managed Hetzner
Cloud Firewall.
- **ADR-009** — the offsite handoff (`tf_to_inventory.py` emits `askari``offsite_hosts`).
- **ADR-020** — the Hetzner Cloud Firewall is `askari`'s perimeter (OPNsense-analog);
catalog still authoritative for host nftables.
- **ADR-007 / ADR-016**`askari` is Terraform-provisioned (hcloud), superseding "added
manually."
## Open items (resolve during the plan / implementation)
- **Pin** the `hetznercloud/hcloud` provider version; confirm the `debian-13` image slug
and CAX11/`hel1` availability (ADR-014).
- The **make tf token-inject** mechanism for `offsite` (read `vault.hetzner.token` → export
`TF_VAR_hcloud_token`) — shape it in the plan (rbw/ansible-vault one-liner vs a wrapper).
- Whether the inventory references `askari` by **IPv4 (from TF output)** or by
**`askari.wingu.me`** once the DNS record exists — decide in the plan.