boma/docs/superpowers/specs/2026-06-14-askari-provisioning-design.md
sjat 602550fdaa docs(spec): M2 — provision askari via Terraform + Hetzner Cloud
askari is provisioned as IaC: Terraform owns its existence too, generalizing
ADR-006 from "Proxmox VM existence" to Proxmox + Hetzner (new hetznercloud/hcloud
provider, hetzner_vm module, offsite stack with local state). CAX11 (ARM) in
Helsinki on Debian 13, behind a TF-managed Hetzner Cloud Firewall (SSH-from-ubongo
now; NetBird ports in M4). Token via TF_VAR_hcloud_token from vault.hetzner.token.
Handoff stays ADR-009-shaped (tf_to_inventory.py extended to emit askari into
offsite_hosts). State in the ADR-022 backup scope; DR via terraform import.

Amends ADR-006/009/020/007/016. Point ROADMAP.md M2 at the spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:12:10 +02:00

8 KiB

Design — Provisioning askari (Terraform + Hetzner Cloud)

  • Date: 2026-06-14
  • Status: Draft for review — design settled in brainstorming; pending user review, then implementation plan
  • Roadmap milestone: M2 (docs/ROADMAP.md)
  • Amends: ADR-006 (Terraform scope → Proxmox + Hetzner), ADR-009 (offsite handoff), ADR-020 (Hetzner Cloud Firewall = askari's perimeter), ADR-007/016 (askari is Terraform-provisioned, not "added manually")
  • Becomes: amendments to those ADRs

Problem

askari (the off-site Hetzner VPS — NetBird coordinator + watchdog, later the off-site log subset) does not exist yet. ADR-007/016 designed it as "provisioned independently… added manually." Now that there's a dedicated Hetzner account + a verified API token in the vault, we can provision it as IaC instead. boma's principle (ADR-006/009) is "Terraform owns VM existence; Ansible owns config" — but scoped to Proxmox. This milestone generalizes that principle to Hetzner and stands askari up.

Decisions (as settled)

  1. Terraform owns askari's existence (Approach 1) — generalize ADR-006 from "Proxmox VM existence" to "VM existence on Proxmox + Hetzner." (Rejected: Ansible hetzner.hcloud — breaks the TF/Ansible boundary; hcloud CLI — not stateful IaC.)
  2. Server: CAX11 (ARM/Ampere, 2 vCPU / 4 GB / 40 GB), Helsinki (hel1), Debian 13. Rescale up later if the off-site log subset needs it.
  3. TF-managed Hetzner Cloud Firewall as askari's perimeter (the off-site OPNsense-analog). Starts minimal (SSH from ubongo only); service ports are added as services land (NetBird ports in M4). The ADR-020 catalog stays authoritative for the host nftables layer.
  4. Token via TF_VAR_hcloud_token, sourced from vault.hetzner.token at apply time — never in .tfvars (CLAUDE.md).
  5. Handoff stays ADR-009-shaped: tf_to_inventory.py is extended to emit askari into offsite_hosts, so hosts.yml stays fully generated.

Verified facts (ADR-014)

verified: Hetzner Cloud entry tiers · WebSearch · 2026-06-14 · CAX11 (ARM/Ampere) 2 vCPU / 4 GB / 40 GB ≈ €3.79/mo, 20 TB traffic + 1 IPv4; ARM (CAX) is EU-locations only (incl. hel1). Price change for new orders from 2026-06-15.

to verify when writing the role (ADR-014): the hetznercloud/hcloud provider version to pin; the Debian 13 image slug (expected debian-13); CAX11 availability in hel1.

Architecture

Terraform structure

  • Module terraform/modules/hetzner_vm/ (sibling to proxmox_vm): inputs name, server_type, location, image, ssh_keys, user_data, firewall_rules, labels; outputs the server's ipv4 (+ id, name).
  • Stack terraform/environments/offsite/ (its own local state on ubongo, gitignored): providers.tf pins hetznercloud/hcloud; main.tf calls hetzner_vm for askari + an hcloud_firewall + an hcloud_ssh_key; variables.tf (incl. hcloud_token, control_ssh_pubkey, ssh_admin_cidr); outputs.tf (askari ipv4, for the handoff + DNS); backend.tf (local state, like the Proxmox envs).
  • make tf-* TF_ENV=offsite drives it; for offsite the targets first export TF_VAR_hcloud_token from vault.hetzner.token (a small vault→env step). tf-apply stays gated behind a shown tf-plan (CLAUDE.md).

Provisioning → Ansible handoff

  1. TF creates the CAX11 with a cloud-init user_data that injects ubongo's control SSH public key for first login (minimal — no config beyond the key + ensuring Python is present for Ansible).
  2. TF outputs askari's public IPv4. tf_to_inventory.py (extended for the offsite stack) writes askari into the offsite_hosts group of hosts.yml.
  3. playbooks/bootstrap.yml runs against askari → creates the ansible user + sudoers (as for Proxmox hosts). Where M2 ends.
  4. (Downstream, not M2): base remote-access subset (M3), NetBird coordinator (M4), mesh enrollment + SSH-narrowed-to-wt0 (M5).
  • A convenience askari.wingu.me A record is added via the M1 public_dns role (stable name for humans + future certs); the inventory may reference it once DNS exists.

Cloud firewall (perimeter)

  • TF hcloud_firewall attached to askari:
    • inbound SSH (22/tcp) from ubongo's address only (ssh_admin_cidr var);
    • everything else default-deny.
  • Grows with services: NetBird's UDP 3478 (Coturn) + TCP 80/443 (management/dashboard) are added in M4 when the coordinator deploys — not opened to a non-existent listener now.
  • This is the off-site perimeter layer (OPNsense has no presence off-cluster); ADR-020's group_vars catalog remains the single source for the host nftables layer that base renders (M3).

State + disaster recovery

  • The offsite terraform.tfstate lives on ubongo and is added to the ADR-022 backup scope (the control-node TF state backup already flagged in STATUS).
  • DR is management-only: askari survives a homelab/ubongo outage by design, so a lost state is recovered by terraform import-ing the still-running server — no rebuild.

Division of labour & access

Task Who How
Hetzner token Done vault.hetzner.token (verified live, HTTP 200).
hetzner_vm module + offsite stack + tf_to_inventory extension + make token-inject Agent Committed IaC + a pytest for the handoff.
terraform plan (offsite) Agent make tf-plan TF_ENV=offsite, output shown.
terraform apply (offsite) Human-gated Only after the plan is reviewed (CLAUDE.md: never apply without a shown plan). Run on ubongo.
Confirm the control SSH key Human Which ubongo key Ansible uses to reach hosts (its public key feeds control_ssh_pubkey).
  • Token: TF_VAR_hcloud_token from vault at apply; never written to a .tfvars file.
  • SSH: cloud-init injects only the control public key; the private key stays on ubongo. The cloud firewall limits SSH to ubongo's address until the mesh exists.

Testing & verification

  • terraform fmt + terraform validate + make tf-plan TF_ENV=offsite (plan reviewed before any apply).
  • pytest for the tf_to_inventory.py offsite extension (mirrors the existing stdlib-only script tests), asserting an askari entry lands in offsite_hosts.
  • Post-apply: SSH reachability from ubongo; cloud-init ran; then bootstrap.yml connectivity. (base/NetBird get their own Molecule/verify in M3/M4.)

Scope boundaries — what M2 is NOT

  • Not the base hardening subset (SSH hardening, fail2ban, NetBird agent) — M3.
  • Not the NetBird coordinator or the cloud-firewall NetBird ports — M4.
  • Not mesh enrollment / narrowing SSH to wt0M5.
  • Not the off-site log subset (may need a bigger instance / a volume) — later.

ADR work

  • ADR-006 — generalize "Terraform owns VM existence" to Proxmox + Hetzner; add the hetznercloud/hcloud provider (no longer "the only provider is bpg/proxmox"); add the offsite environment + hetzner_vm module to Structure; note the TF-managed Hetzner Cloud Firewall.
  • ADR-009 — the offsite handoff (tf_to_inventory.py emits askarioffsite_hosts).
  • ADR-020 — the Hetzner Cloud Firewall is askari's perimeter (OPNsense-analog); catalog still authoritative for host nftables.
  • ADR-007 / ADR-016askari is Terraform-provisioned (hcloud), superseding "added manually."

Open items (resolve during the plan / implementation)

  • Pin the hetznercloud/hcloud provider version; confirm the debian-13 image slug and CAX11/hel1 availability (ADR-014).
  • The make tf token-inject mechanism for offsite (read vault.hetzner.token → export TF_VAR_hcloud_token) — shape it in the plan (rbw/ansible-vault one-liner vs a wrapper).
  • Whether the inventory references askari by IPv4 (from TF output) or by askari.wingu.me once the DNS record exists — decide in the plan.