# Design — Provisioning `askari` (Terraform + Hetzner Cloud) - **Date:** 2026-06-14 - **Status:** Draft for review — design settled in brainstorming; pending user review, then implementation plan - **Roadmap milestone:** M2 (`docs/ROADMAP.md`) - **Amends:** ADR-006 (Terraform scope → Proxmox **+ Hetzner**), ADR-009 (offsite handoff), ADR-020 (Hetzner Cloud Firewall = askari's perimeter), ADR-007/016 (`askari` is Terraform-provisioned, not "added manually") - **Becomes:** amendments to those ADRs --- ## Problem `askari` (the off-site Hetzner VPS — NetBird coordinator + watchdog, later the off-site log subset) does not exist yet. ADR-007/016 designed it as "provisioned independently… added manually." Now that there's a dedicated Hetzner account + a verified API token in the vault, we can provision it as **IaC** instead. boma's principle (ADR-006/009) is "**Terraform owns VM existence; Ansible owns config**" — but scoped to Proxmox. This milestone **generalizes that principle to Hetzner** and stands `askari` up. ## Decisions (as settled) 1. **Terraform owns `askari`'s existence** (Approach 1) — generalize ADR-006 from "Proxmox VM existence" to "VM existence on **Proxmox + Hetzner**." (Rejected: Ansible `hetzner.hcloud` — breaks the TF/Ansible boundary; `hcloud` CLI — not stateful IaC.) 2. **Server:** **CAX11** (ARM/Ampere, 2 vCPU / 4 GB / 40 GB), **Helsinki (`hel1`)**, **Debian 13**. Rescale up later if the off-site log subset needs it. 3. **TF-managed Hetzner Cloud Firewall** as `askari`'s perimeter (the off-site OPNsense-analog). Starts minimal (**SSH from ubongo only**); service ports are added as services land (NetBird ports in M4). The ADR-020 catalog stays authoritative for the **host nftables** layer. 4. **Token via `TF_VAR_hcloud_token`**, sourced from `vault.hetzner.token` at apply time — never in `.tfvars` (CLAUDE.md). 5. **Handoff stays ADR-009-shaped:** `tf_to_inventory.py` is extended to emit `askari` into `offsite_hosts`, so `hosts.yml` stays fully generated. ## Verified facts (ADR-014) > verified: Hetzner Cloud entry tiers · WebSearch · 2026-06-14 · **CAX11** (ARM/Ampere) > 2 vCPU / 4 GB / 40 GB ≈ €3.79/mo, 20 TB traffic + 1 IPv4; ARM (CAX) is **EU-locations > only** (incl. `hel1`). Price change for new orders from 2026-06-15. > to verify when writing the role (ADR-014): the `hetznercloud/hcloud` provider version > to pin; the Debian 13 image slug (expected `debian-13`); CAX11 availability in `hel1`. ## Architecture ### Terraform structure - **Module `terraform/modules/hetzner_vm/`** (sibling to `proxmox_vm`): inputs `name`, `server_type`, `location`, `image`, `ssh_keys`, `user_data`, `firewall_rules`, `labels`; outputs the server's `ipv4` (+ id, name). - **Stack `terraform/environments/offsite/`** (its own **local state** on ubongo, gitignored): `providers.tf` pins **`hetznercloud/hcloud`**; `main.tf` calls `hetzner_vm` for `askari` + an `hcloud_firewall` + an `hcloud_ssh_key`; `variables.tf` (incl. `hcloud_token`, `control_ssh_pubkey`, `ssh_admin_cidr`); `outputs.tf` (askari `ipv4`, for the handoff + DNS); `backend.tf` (local state, like the Proxmox envs). - **`make tf-* TF_ENV=offsite`** drives it; for `offsite` the targets first export `TF_VAR_hcloud_token` from `vault.hetzner.token` (a small vault→env step). `tf-apply` stays gated behind a shown `tf-plan` (CLAUDE.md). ### Provisioning → Ansible handoff 1. TF creates the CAX11 with a **cloud-init `user_data`** that injects **ubongo's control SSH public key** for first login (minimal — no config beyond the key + ensuring Python is present for Ansible). 2. TF outputs `askari`'s public IPv4. `tf_to_inventory.py` (extended for the offsite stack) writes `askari` into the `offsite_hosts` group of `hosts.yml`. 3. `playbooks/bootstrap.yml` runs against `askari` → creates the `ansible` user + sudoers (as for Proxmox hosts). **Where M2 ends.** 4. *(Downstream, not M2):* `base` remote-access subset (M3), NetBird coordinator (M4), mesh enrollment + SSH-narrowed-to-`wt0` (M5). - A convenience **`askari.wingu.me` A record** is added via the M1 `public_dns` role (stable name for humans + future certs); the inventory may reference it once DNS exists. ### Cloud firewall (perimeter) - TF `hcloud_firewall` attached to `askari`: - **inbound SSH (22/tcp) from ubongo's address only** (`ssh_admin_cidr` var); - everything else default-deny. - **Grows with services:** NetBird's **UDP 3478** (Coturn) + **TCP 80/443** (management/dashboard) are added in **M4** when the coordinator deploys — not opened to a non-existent listener now. - This is the off-site **perimeter** layer (OPNsense has no presence off-cluster); ADR-020's `group_vars` catalog remains the single source for the **host nftables** layer that `base` renders (M3). ### State + disaster recovery - The `offsite` `terraform.tfstate` lives on ubongo and is added to the **ADR-022 backup scope** (the control-node TF state backup already flagged in STATUS). - DR is management-only: `askari` survives a homelab/ubongo outage by design, so a lost state is recovered by `terraform import`-ing the still-running server — no rebuild. ## Division of labour & access | Task | Who | How | |---|---|---| | Hetzner token | Done | `vault.hetzner.token` (verified live, HTTP 200). | | `hetzner_vm` module + `offsite` stack + `tf_to_inventory` extension + make token-inject | Agent | Committed IaC + a pytest for the handoff. | | `terraform plan` (offsite) | Agent | `make tf-plan TF_ENV=offsite`, **output shown**. | | `terraform apply` (offsite) | Human-gated | Only after the plan is reviewed (CLAUDE.md: never apply without a shown plan). Run on ubongo. | | Confirm the control SSH key | Human | Which ubongo key Ansible uses to reach hosts (its public key feeds `control_ssh_pubkey`). | - **Token:** `TF_VAR_hcloud_token` from vault at apply; never written to a `.tfvars` file. - **SSH:** cloud-init injects only the control public key; the private key stays on ubongo. The cloud firewall limits SSH to ubongo's address until the mesh exists. ## Testing & verification - `terraform fmt` + **`terraform validate`** + **`make tf-plan TF_ENV=offsite`** (plan reviewed before any apply). - **pytest** for the `tf_to_inventory.py` offsite extension (mirrors the existing stdlib-only script tests), asserting an `askari` entry lands in `offsite_hosts`. - Post-apply: SSH reachability from ubongo; cloud-init ran; then `bootstrap.yml` connectivity. (`base`/NetBird get their own Molecule/verify in M3/M4.) ## Scope boundaries — what M2 is NOT - **Not** the `base` hardening subset (SSH hardening, fail2ban, NetBird agent) — **M3**. - **Not** the NetBird coordinator or the cloud-firewall NetBird ports — **M4**. - **Not** mesh enrollment / narrowing SSH to `wt0` — **M5**. - **Not** the off-site log subset (may need a bigger instance / a volume) — later. ## ADR work - **ADR-006** — generalize "Terraform owns VM existence" to **Proxmox + Hetzner**; add the `hetznercloud/hcloud` provider (no longer "the only provider is `bpg/proxmox`"); add the `offsite` environment + `hetzner_vm` module to Structure; note the TF-managed Hetzner Cloud Firewall. - **ADR-009** — the offsite handoff (`tf_to_inventory.py` emits `askari` → `offsite_hosts`). - **ADR-020** — the Hetzner Cloud Firewall is `askari`'s perimeter (OPNsense-analog); catalog still authoritative for host nftables. - **ADR-007 / ADR-016** — `askari` is Terraform-provisioned (hcloud), superseding "added manually." ## Open items (resolve during the plan / implementation) - **Pin** the `hetznercloud/hcloud` provider version; confirm the `debian-13` image slug and CAX11/`hel1` availability (ADR-014). - The **make tf token-inject** mechanism for `offsite` (read `vault.hetzner.token` → export `TF_VAR_hcloud_token`) — shape it in the plan (rbw/ansible-vault one-liner vs a wrapper). - Whether the inventory references `askari` by **IPv4 (from TF output)** or by **`askari.wingu.me`** once the DNS record exists — decide in the plan.