boma/docs/superpowers/plans/2026-06-14-askari-provisioning-m2.md
sjat 29921428c4 docs(plan): M2 — askari provisioning (Terraform + Hetzner Cloud)
9-task plan: verify hcloud facts; hetzner_vm module (server+firewall+ssh+cloud-init);
offsite env (CAX11/hel1/debian-13, local state); Makefile token-injection + directory
inventory + tf-inventory-offsite; offsite-handoff pytest; init/validate/plan; GATED
apply (billed VPS) + bootstrap; ADR-006/009/020/007/016 amendments. Resolves the
inventory-handoff open item via a directory inventory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:53:08 +02:00

19 KiB
Raw Permalink Blame History

askari Provisioning (M2) Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Provision askari (the off-site Hetzner VPS) as Terraform IaC — a hetzner_vm module + an offsite stack — behind a TF-managed cloud firewall, hand it into the offsite_hosts inventory, and bootstrap it.

Architecture: Generalize boma's "Terraform owns VM existence" principle (ADR-006) from Proxmox to Hetzner. A reusable hetzner_vm module wraps hcloud_server + hcloud_firewall + hcloud_ssh_key; an offsite environment (own local state) declares askari (CAX11/ARM, Helsinki, Debian 13). cloud-init creates the ansible user with ubongo's key; the firewall allows SSH from ubongo only. Handoff stays ADR-009-shaped: the offsite env outputs vms, and tf_to_inventory.py (already offsite-aware) generates an inventory file merged via a directory inventory.

Tech Stack: Terraform (hetznercloud/hcloud provider), Hetzner Cloud, cloud-init, Ansible. Token from vault.hetzner.tokenTF_VAR_hcloud_token.

Spec: docs/superpowers/specs/2026-06-14-askari-provisioning-design.md

Execution context: Tasks 16 + 9 are authoring + terraform fmt/validate/plan (need terraform installed + the token, but no resources are created). Task 7 (terraform apply) and Task 8 (bootstrap) create a real, billed VPS — gated, run with explicit user go, tf-plan shown first (CLAUDE.md). If terraform is absent in the working env, Tasks 68 defer to ubongo.


File Structure

  • terraform/modules/hetzner_vm/{variables,main,outputs}.tf (create) — wraps server + firewall + ssh key + cloud-init.
  • terraform/environments/offsite/{providers,variables,main,outputs,backend}.tf + terraform.tfvars.example (create) — the askari stack, own local state.
  • Makefile (modify) — inject TF_VAR_hcloud_token for TF_ENV=offsite; directory inventory; tf-inventory-offsite target.
  • scripts/tf_to_inventory.py (no change — already offsite-aware) + tests/test_tf_to_inventory.py (create) — lock the offsite handoff.
  • docs/decisions/{006,009,020,007,016}-*.md, STATUS.md (modify) — ADR amendments + status.

Task 1: Verify the Hetzner provider/image facts (ADR-014)

Files: none (research; pin values used by later tasks).

  • Step 1: Verify and record

Verify (WebFetch registry.terraform.io / docs.hetzner.com, or terraform once init'd):

  • latest hetznercloud/hcloud provider version to pin (expected ~> 1.48+),
  • the Debian 13 image slug (expected debian-13),
  • that server type cax11 exists in location hel1.

Record a stamp in the offsite providers.tf comment, e.g.: # verified: hetznercloud/hcloud <ver> · debian-13 image · cax11@hel1 · <source> · <date>

  • Step 2: No commit (values land in later tasks).

Task 2: The hetzner_vm module

Files:

  • Create: terraform/modules/hetzner_vm/variables.tf, main.tf, outputs.tf

  • Step 1: variables.tf

variable "name" {
  description = "Server name (and hostname)"
  type        = string
}

variable "server_type" {
  description = "Hetzner server type, e.g. cax11 (ARM)"
  type        = string
}

variable "location" {
  description = "Hetzner location, e.g. hel1"
  type        = string
}

variable "image" {
  description = "OS image slug, e.g. debian-13"
  type        = string
}

variable "ansible_ssh_pubkey" {
  description = "Public SSH key provisioned for the ansible user via cloud-init"
  type        = string
}

variable "ssh_admin_cidrs" {
  description = "Source CIDRs allowed to reach SSH (e.g. ubongo's address/32)"
  type        = list(string)
}

variable "labels" {
  description = "Hetzner resource labels (metadata only)"
  type        = map(string)
  default     = {}
}
  • Step 2: main.tf
# cloud-init: create the unprivileged `ansible` user with ubongo's key + sudo.
# (Mirrors the proxmox_vm module's user_account; Hetzner has no structured field.)
locals {
  user_data = <<-EOT
    #cloud-config
    users:
      - name: ansible
        groups: [sudo]
        sudo: "ALL=(ALL) NOPASSWD:ALL"
        shell: /bin/bash
        ssh_authorized_keys:
          - ${var.ansible_ssh_pubkey}
    package_update: true
    packages:
      - python3
  EOT
}

resource "hcloud_ssh_key" "ansible" {
  name       = "${var.name}-ansible"
  public_key = var.ansible_ssh_pubkey
}

resource "hcloud_firewall" "this" {
  name = "${var.name}-fw"

  # SSH from the control node only (NetBird ports are added in M4 when the
  # coordinator deploys — see ADR-020; the host nftables layer is catalog-driven).
  rule {
    direction  = "in"
    protocol   = "tcp"
    port       = "22"
    source_ips = var.ssh_admin_cidrs
  }
}

resource "hcloud_server" "this" {
  name         = var.name
  server_type  = var.server_type
  location     = var.location
  image        = var.image
  ssh_keys     = [hcloud_ssh_key.ansible.id]
  user_data    = local.user_data
  firewall_ids = [hcloud_firewall.this.id]
  labels       = var.labels

  public_net {
    ipv4_enabled = true
    ipv6_enabled = true
  }
}
  • Step 3: outputs.tf
output "ipv4_address" {
  description = "Server public IPv4"
  value       = hcloud_server.this.ipv4_address
}

output "name" {
  description = "Server name"
  value       = hcloud_server.this.name
}
  • Step 4: Format

Run: terraform fmt terraform/modules/hetzner_vm/ Expected: files formatted (or already formatted).

  • Step 5: Commit
git add terraform/modules/hetzner_vm
git commit -m "feat(tf): hetzner_vm module (server + firewall + ssh key + cloud-init)"

(append Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>)


Task 3: The offsite environment

Files:

  • Create: terraform/environments/offsite/{providers,variables,main,outputs,backend}.tf, terraform.tfvars.example

  • Step 1: providers.tf (pin the version from Task 1)

# verified: hetznercloud/hcloud ~> 1.48 · debian-13 · cax11@hel1 · <source> · <date>
terraform {
  required_version = ">= 1.9"

  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = "~> 1.48"
    }
  }
}

provider "hcloud" {
  token = var.hcloud_token
}
  • Step 2: variables.tf
variable "hcloud_token" {
  description = "Hetzner Cloud API token — set via TF_VAR_hcloud_token (from vault.hetzner.token)"
  type        = string
  sensitive   = true
}

variable "ansible_ssh_pubkey" {
  description = "ubongo's control SSH public key, provisioned for the ansible user"
  type        = string
}

variable "ssh_admin_cidrs" {
  description = "Source CIDRs allowed to SSH askari (ubongo's address/32)"
  type        = list(string)
}
  • Step 3: main.tf
# offsite/main.tf — off-site Hetzner hosts. Terraform owns VM existence (ADR-006,
# generalized to Hetzner). ALWAYS `make tf-plan TF_ENV=offsite` and review before
# `make tf-apply TF_ENV=offsite`.

module "askari" {
  source = "../../modules/hetzner_vm"

  name               = "askari"
  server_type        = "cax11"  # ARM, 2 vCPU / 4 GB
  location           = "hel1"   # Helsinki
  image              = "debian-13"
  ansible_ssh_pubkey = var.ansible_ssh_pubkey
  ssh_admin_cidrs    = var.ssh_admin_cidrs
  labels = {
    env        = "offsite"
    group      = "offsite_hosts"
    managed-by = "terraform"
  }
}
  • Step 4: outputs.tf (the tf_to_inventory.py contract — vms map)
output "vms" {
  description = "Hostname → IP and Ansible group — consumed by make tf-inventory-offsite"
  value = {
    askari = {
      ip    = module.askari.ipv4_address
      group = "offsite_hosts"
    }
  }
}
  • Step 5: backend.tf
# Terraform state: LOCAL, on the control node (like the Proxmox envs; ADR-006).
# askari survives a homelab outage by design, so a lost state is recovered by
# `terraform import` of the running server — not a rebuild. Back the state up with
# the control node (ADR-022).
  • Step 6: terraform.tfvars.example
# offsite environment — non-secret values. Copy to terraform.tfvars and fill in.
#
# Secret is exported as an env var (never in this file):
#   export TF_VAR_hcloud_token="$(...from vault.hetzner.token...)"   # make handles this
#
# State is local (see backend.tf).

ansible_ssh_pubkey = "ssh-ed25519 AAAA... ansible@ubongo"
ssh_admin_cidrs    = ["10.20.10.151/32"]  # ubongo's LAN address (ADR-021)
  • Step 7: Format + commit

Run: terraform fmt terraform/environments/offsite/

git add terraform/environments/offsite
git commit -m "feat(tf): offsite environment — askari (CAX11/hel1/debian-13)"

(Co-Authored-By trailer)


Task 4: Makefile — token injection, directory inventory, offsite handoff

Files:

  • Modify: Makefile

  • Step 1: Inject the Hetzner token for TF_ENV=offsite

The tf-* targets need TF_VAR_hcloud_token for offsite, sourced from the vault. Add a guarded helper variable near the TF definition:

# For TF_ENV=offsite, export the Hetzner token from the vault (rbw unlocked).
# Reads vault.hetzner.token in-memory; never written to a tfvars file (CLAUDE.md).
ifeq ($(TF_ENV),offsite)
TF_TOKEN_ENV = TF_VAR_hcloud_token="$$($(VENV)/bin/ansible-vault view inventories/production/group_vars/all/vault.yml | $(VENV)/bin/python -c 'import sys,yaml; print(yaml.safe_load(sys)["vault"]["hetzner"]["token"])')"
else
TF_TOKEN_ENV =
endif

Then prefix the tf-init/tf-plan/tf-apply/tf-output recipes with $(TF_TOKEN_ENV), e.g.:

tf-plan:
	$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/$(TF_ENV) plan

(Apply the same prefix to tf-init, tf-apply, tf-output.)

  • Step 2: Directory inventory

Change the inventory so multiple TF envs can each generate a file:

INVENTORY   := -i inventories/production/

(Ansible reads every file in the directory as an inventory source and merges them; group_vars//host_vars/ remain variable dirs. Verify ansible.cfg does not also hard-set inventory=; if it does, update it to match.)

  • Step 3: tf-inventory-offsite target

Add (writes the offsite hosts into the production inventory dir, beside the Proxmox-generated hosts.yml):

tf-inventory-offsite:
	$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/offsite output -json \
	  | $(PYTHON) scripts/tf_to_inventory.py > inventories/production/offsite.yml
	@echo "Offsite inventory written to inventories/production/offsite.yml"

Add tf-inventory-offsite to .PHONY and a help line.

  • Step 4: Verify existing playbooks still resolve under the directory inventory

Run: make check PLAYBOOK=dns 2>&1 | tail -3 Expected: still resolves the control host and runs (no inventory errors). If connection:/group_vars break, fix before committing.

  • Step 5: Commit
git add Makefile
git commit -m "feat(make): offsite TF token injection + directory inventory + tf-inventory-offsite"

(Co-Authored-By trailer)


Task 5: Lock the offsite inventory handoff (TDD)

Files:

  • Test: tests/test_tf_to_inventory.py

  • Step 1: Write the failing test

import json
import pathlib
import subprocess
import sys

_SCRIPT = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "tf_to_inventory.py"


def _run(tf_output: dict) -> str:
    return subprocess.run(
        [sys.executable, str(_SCRIPT)],
        input=json.dumps(tf_output), capture_output=True, text=True, check=True,
    ).stdout


def test_offsite_host_lands_in_offsite_hosts():
    out = _run({"vms": {"value": {"askari": {"ip": "203.0.113.7", "group": "offsite_hosts"}}}})
    assert "offsite_hosts:" in out
    assert "askari:" in out
    assert "ansible_host: 203.0.113.7" in out


def test_unknown_group_rejected():
    proc = subprocess.run(
        [sys.executable, str(_SCRIPT)],
        input=json.dumps({"vms": {"value": {"x": {"ip": "1.2.3.4", "group": "nope"}}}}),
        capture_output=True, text=True,
    )
    assert proc.returncode == 1
    assert "unknown group" in proc.stderr
  • Step 2: Run it

Run: .venv/bin/python -m pytest tests/test_tf_to_inventory.py -v Expected: PASS — tf_to_inventory.py already supports offsite_hosts and rejects unknown groups (this test locks that behaviour for the M2 handoff; no code change needed). If it fails, fix scripts/tf_to_inventory.py minimally and report.

  • Step 3: Commit
git add tests/test_tf_to_inventory.py
git commit -m "test(tf): lock the offsite_hosts inventory handoff"

(Co-Authored-By trailer)


Task 6: Init, validate, plan (gated — needs terraform + token)

Needs terraform installed and rbw unlocked. Creates no resources. If terraform is absent, defer Tasks 68 to ubongo.

  • Step 1: Set tfvars

cp terraform/environments/offsite/terraform.tfvars.example terraform/environments/offsite/terraform.tfvars and set ansible_ssh_pubkey to ubongo's real control public key and ssh_admin_cidrs to ubongo's address (10.20.10.151/32). (terraform.tfvars is gitignored.)

  • Step 2: Init (tracks the lock file)

Run: make tf-init TF_ENV=offsite Expected: providers installed; terraform/environments/offsite/.terraform.lock.hcl created. git add the lock file (tracked per CLAUDE.md).

  • Step 3: Validate + plan

Run: terraform -chdir=terraform/environments/offsite validateSuccess. Run: make tf-plan TF_ENV=offsite → review: 1 server + 1 firewall + 1 ssh key to add. Confirm CAX11/hel1/debian-13 and the SSH-from-ubongo rule.

  • Step 4: Commit the lock file
git add terraform/environments/offsite/.terraform.lock.hcl
git commit -m "chore(tf): pin offsite provider lock (hcloud)"

(Co-Authored-By trailer)


Task 7: Apply — create askari (GATED, real billed VPS)

Explicit user go required. Run on ubongo. The plan from Task 6 must be reviewed first (CLAUDE.md: never apply without a shown plan).

  • Step 1: Apply

Run: make tf-apply TF_ENV=offsite Expected: hcloud_ssh_key, hcloud_firewall, hcloud_server.askari created; outputs show askari's IPv4.

  • Step 2: Generate the offsite inventory

Run: make tf-inventory-offsite Expected: inventories/production/offsite.yml written with askari under offsite_hosts.

  • Step 3: Verify the inventory merges

Run: .venv/bin/ansible-inventory $(INVENTORY) --host askari (or --list) Expected: askari present with its ansible_host.

  • Step 4: Commit the generated inventory
git add inventories/production/offsite.yml
git commit -m "chore(inventory): askari in offsite_hosts (generated)"

(Co-Authored-By trailer)


Task 8: Bootstrap askari (GATED — needs the live host)

Run on ubongo after Task 7. rbw unlocked.

  • Step 1: Reach it

Run: ssh ansible@<askari-ip> (cloud-init created the ansible user with ubongo's key) — expect a shell. If refused, check the firewall ssh_admin_cidrs matches ubongo's egress IP.

  • Step 2: Bootstrap

Run: make check PLAYBOOK=bootstrap (review) then make deploy PLAYBOOK=bootstrap — expect the ansible user + sudoers confirmed/created on askari (idempotent).

  • Step 3: No repo commit — this configures the host, not the repo. (base subset = M3.)

Task 9: ADR amendments + STATUS

Files:

  • Modify: docs/decisions/006-terraform.md, 009-provisioning-handoff.md, 020-firewall.md, 007-network.md, 016-mesh-vpn.md, STATUS.md

For each: Read the relevant section first, then apply the change.

  • Step 1: ADR-006 — generalize the provider scope

In the Providers section, the line "bpg/proxmox … This is the only provider." → note a second provider:

**`hetznercloud/hcloud`**: owns off-site VM existence (`askari`). ADR-006's scope is
**Proxmox + Hetzner** — "Terraform owns VM existence" generalizes across providers; the
`offsite` environment + `hetzner_vm` module live alongside the Proxmox env + module.

Also adjust the Context line "creating and destroying VMs on Proxmox" → "on Proxmox and Hetzner".

  • Step 2: ADR-009 — offsite handoff

Add a note that offsite is a TF environment whose vms output feeds offsite_hosts via tf_to_inventory.py (make tf-inventory-offsiteinventories/production/offsite.yml), and that the production inventory is a directory merging the Proxmox + offsite generated files.

  • Step 3: ADR-020 — askari's perimeter

Note that off-cluster askari has no OPNsense; its perimeter is a TF-managed Hetzner Cloud Firewall (SSH-from-ubongo now; NetBird ports in M4). The group_vars catalog stays authoritative for the host nftables layer.

  • Step 4: ADR-007 / ADR-016 — askari is TF-provisioned

Replace "provisioned … independently … added manually" wording for askari with "provisioned as Terraform IaC (hcloud), managed independently of the Proxmox cluster (own provider + state)."

  • Step 5: STATUS.md

Move/realize askari's row per how far Task 7/8 got. If applied: under "Real and working today" — askari Built + applied (CAX11/hel1/debian-13, cloud firewall SSH-from-ubongo, bootstrapped, in offsite_hosts). If only authored (apply deferred): note the TF is written + tf-plan clean, apply pending on ubongo.

  • Step 6: Lint + commit

Run: make lint (must pass).

git add docs/decisions/006-terraform.md docs/decisions/009-provisioning-handoff.md \
        docs/decisions/020-firewall.md docs/decisions/007-network.md \
        docs/decisions/016-mesh-vpn.md STATUS.md
git commit -m "docs(askari): amend ADR-006/009/020/007/016 for TF-provisioned offsite host; STATUS"

(Co-Authored-By trailer)


Self-Review (completed)

  • Spec coverage: TF owns existence / generalize ADR-006 (Decision 1) → Tasks 2,3,9; CAX11/hel1/debian-13 (Decision 2) → Task 3; TF cloud firewall, SSH-from-ubongo, NetBird ports later (Decision 3) → Task 2 + Task 9 ADR-020; token via TF_VAR_hcloud_token from vault (Decision 4) → Task 4; ADR-009 handoff via tf_to_inventory (Decision 5) → Tasks 4,5,7; cloud-init ansible user + bootstrap → Tasks 2,8; state + DR (import) → Task 3 backend; ADR amendments → Task 9. All covered.
  • Placeholder scan: none — HCL, make, and test content are concrete. <askari-ip>/<source>/<date> are runtime/verification values, not unspecified logic.
  • Type/name consistency: module vars (name, server_type, location, image, ansible_ssh_pubkey, ssh_admin_cidrs, labels) match between module + env call; the vms output shape ({ip, group}) matches tf_to_inventory.py's contract; TF_VAR_hcloud_tokenvar.hcloud_token; vault.hetzner.token matches the stored key.
  • Notes for the implementer: (a) confirm Ansible merges the directory inventory's two files so askari resolves (Task 7 Step 3); (b) verify hcloud_server arg names against the pinned provider version (Task 1) — adjust public_net/firewall_ids if the provider differs; (c) Tasks 78 create a billed VPS — gated on explicit go.