boma/docs/superpowers/plans/2026-06-14-askari-provisioning-m2.md
sjat 29921428c4 docs(plan): M2 — askari provisioning (Terraform + Hetzner Cloud)
9-task plan: verify hcloud facts; hetzner_vm module (server+firewall+ssh+cloud-init);
offsite env (CAX11/hel1/debian-13, local state); Makefile token-injection + directory
inventory + tf-inventory-offsite; offsite-handoff pytest; init/validate/plan; GATED
apply (billed VPS) + bootstrap; ADR-006/009/020/007/016 amendments. Resolves the
inventory-handoff open item via a directory inventory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:53:08 +02:00

538 lines
19 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# askari Provisioning (M2) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Provision `askari` (the off-site Hetzner VPS) as Terraform IaC — a `hetzner_vm` module + an `offsite` stack — behind a TF-managed cloud firewall, hand it into the `offsite_hosts` inventory, and bootstrap it.
**Architecture:** Generalize boma's "Terraform owns VM existence" principle (ADR-006) from Proxmox to Hetzner. A reusable `hetzner_vm` module wraps `hcloud_server` + `hcloud_firewall` + `hcloud_ssh_key`; an `offsite` environment (own local state) declares `askari` (CAX11/ARM, Helsinki, Debian 13). cloud-init creates the `ansible` user with ubongo's key; the firewall allows SSH from ubongo only. Handoff stays ADR-009-shaped: the offsite env outputs `vms`, and `tf_to_inventory.py` (already offsite-aware) generates an inventory file merged via a **directory inventory**.
**Tech Stack:** Terraform (`hetznercloud/hcloud` provider), Hetzner Cloud, cloud-init, Ansible. Token from `vault.hetzner.token``TF_VAR_hcloud_token`.
**Spec:** `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`
**Execution context:** Tasks 16 + 9 are authoring + `terraform fmt/validate/plan` (need `terraform` installed + the token, but no resources are created). **Task 7 (`terraform apply`) and Task 8 (bootstrap) create a real, billed VPS** — gated, run with explicit user go, `tf-plan` shown first (CLAUDE.md). If `terraform` is absent in the working env, Tasks 68 defer to ubongo.
---
## File Structure
- `terraform/modules/hetzner_vm/{variables,main,outputs}.tf` (create) — wraps server + firewall + ssh key + cloud-init.
- `terraform/environments/offsite/{providers,variables,main,outputs,backend}.tf` + `terraform.tfvars.example` (create) — the askari stack, own local state.
- `Makefile` (modify) — inject `TF_VAR_hcloud_token` for `TF_ENV=offsite`; directory inventory; `tf-inventory-offsite` target.
- `scripts/tf_to_inventory.py` (no change — already offsite-aware) + `tests/test_tf_to_inventory.py` (create) — lock the offsite handoff.
- `docs/decisions/{006,009,020,007,016}-*.md`, `STATUS.md` (modify) — ADR amendments + status.
---
### Task 1: Verify the Hetzner provider/image facts (ADR-014)
**Files:** none (research; pin values used by later tasks).
- [ ] **Step 1: Verify and record**
Verify (WebFetch registry.terraform.io / docs.hetzner.com, or `terraform` once init'd):
- latest `hetznercloud/hcloud` provider version to pin (expected `~> 1.48`+),
- the Debian 13 image slug (expected `debian-13`),
- that server type `cax11` exists in location `hel1`.
Record a stamp in the offsite `providers.tf` comment, e.g.:
`# verified: hetznercloud/hcloud <ver> · debian-13 image · cax11@hel1 · <source> · <date>`
- [ ] **Step 2: No commit** (values land in later tasks).
---
### Task 2: The `hetzner_vm` module
**Files:**
- Create: `terraform/modules/hetzner_vm/variables.tf`, `main.tf`, `outputs.tf`
- [ ] **Step 1: `variables.tf`**
```hcl
variable "name" {
description = "Server name (and hostname)"
type = string
}
variable "server_type" {
description = "Hetzner server type, e.g. cax11 (ARM)"
type = string
}
variable "location" {
description = "Hetzner location, e.g. hel1"
type = string
}
variable "image" {
description = "OS image slug, e.g. debian-13"
type = string
}
variable "ansible_ssh_pubkey" {
description = "Public SSH key provisioned for the ansible user via cloud-init"
type = string
}
variable "ssh_admin_cidrs" {
description = "Source CIDRs allowed to reach SSH (e.g. ubongo's address/32)"
type = list(string)
}
variable "labels" {
description = "Hetzner resource labels (metadata only)"
type = map(string)
default = {}
}
```
- [ ] **Step 2: `main.tf`**
```hcl
# cloud-init: create the unprivileged `ansible` user with ubongo's key + sudo.
# (Mirrors the proxmox_vm module's user_account; Hetzner has no structured field.)
locals {
user_data = <<-EOT
#cloud-config
users:
- name: ansible
groups: [sudo]
sudo: "ALL=(ALL) NOPASSWD:ALL"
shell: /bin/bash
ssh_authorized_keys:
- ${var.ansible_ssh_pubkey}
package_update: true
packages:
- python3
EOT
}
resource "hcloud_ssh_key" "ansible" {
name = "${var.name}-ansible"
public_key = var.ansible_ssh_pubkey
}
resource "hcloud_firewall" "this" {
name = "${var.name}-fw"
# SSH from the control node only (NetBird ports are added in M4 when the
# coordinator deploys — see ADR-020; the host nftables layer is catalog-driven).
rule {
direction = "in"
protocol = "tcp"
port = "22"
source_ips = var.ssh_admin_cidrs
}
}
resource "hcloud_server" "this" {
name = var.name
server_type = var.server_type
location = var.location
image = var.image
ssh_keys = [hcloud_ssh_key.ansible.id]
user_data = local.user_data
firewall_ids = [hcloud_firewall.this.id]
labels = var.labels
public_net {
ipv4_enabled = true
ipv6_enabled = true
}
}
```
- [ ] **Step 3: `outputs.tf`**
```hcl
output "ipv4_address" {
description = "Server public IPv4"
value = hcloud_server.this.ipv4_address
}
output "name" {
description = "Server name"
value = hcloud_server.this.name
}
```
- [ ] **Step 4: Format**
Run: `terraform fmt terraform/modules/hetzner_vm/`
Expected: files formatted (or already formatted).
- [ ] **Step 5: Commit**
```bash
git add terraform/modules/hetzner_vm
git commit -m "feat(tf): hetzner_vm module (server + firewall + ssh key + cloud-init)"
```
(append `Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>`)
---
### Task 3: The `offsite` environment
**Files:**
- Create: `terraform/environments/offsite/{providers,variables,main,outputs,backend}.tf`, `terraform.tfvars.example`
- [ ] **Step 1: `providers.tf`** (pin the version from Task 1)
```hcl
# verified: hetznercloud/hcloud ~> 1.48 · debian-13 · cax11@hel1 · <source> · <date>
terraform {
required_version = ">= 1.9"
required_providers {
hcloud = {
source = "hetznercloud/hcloud"
version = "~> 1.48"
}
}
}
provider "hcloud" {
token = var.hcloud_token
}
```
- [ ] **Step 2: `variables.tf`**
```hcl
variable "hcloud_token" {
description = "Hetzner Cloud API token — set via TF_VAR_hcloud_token (from vault.hetzner.token)"
type = string
sensitive = true
}
variable "ansible_ssh_pubkey" {
description = "ubongo's control SSH public key, provisioned for the ansible user"
type = string
}
variable "ssh_admin_cidrs" {
description = "Source CIDRs allowed to SSH askari (ubongo's address/32)"
type = list(string)
}
```
- [ ] **Step 3: `main.tf`**
```hcl
# offsite/main.tf — off-site Hetzner hosts. Terraform owns VM existence (ADR-006,
# generalized to Hetzner). ALWAYS `make tf-plan TF_ENV=offsite` and review before
# `make tf-apply TF_ENV=offsite`.
module "askari" {
source = "../../modules/hetzner_vm"
name = "askari"
server_type = "cax11" # ARM, 2 vCPU / 4 GB
location = "hel1" # Helsinki
image = "debian-13"
ansible_ssh_pubkey = var.ansible_ssh_pubkey
ssh_admin_cidrs = var.ssh_admin_cidrs
labels = {
env = "offsite"
group = "offsite_hosts"
managed-by = "terraform"
}
}
```
- [ ] **Step 4: `outputs.tf`** (the `tf_to_inventory.py` contract — `vms` map)
```hcl
output "vms" {
description = "Hostname → IP and Ansible group — consumed by make tf-inventory-offsite"
value = {
askari = {
ip = module.askari.ipv4_address
group = "offsite_hosts"
}
}
}
```
- [ ] **Step 5: `backend.tf`**
```hcl
# Terraform state: LOCAL, on the control node (like the Proxmox envs; ADR-006).
# askari survives a homelab outage by design, so a lost state is recovered by
# `terraform import` of the running server — not a rebuild. Back the state up with
# the control node (ADR-022).
```
- [ ] **Step 6: `terraform.tfvars.example`**
```hcl
# offsite environment — non-secret values. Copy to terraform.tfvars and fill in.
#
# Secret is exported as an env var (never in this file):
# export TF_VAR_hcloud_token="$(...from vault.hetzner.token...)" # make handles this
#
# State is local (see backend.tf).
ansible_ssh_pubkey = "ssh-ed25519 AAAA... ansible@ubongo"
ssh_admin_cidrs = ["10.20.10.151/32"] # ubongo's LAN address (ADR-021)
```
- [ ] **Step 7: Format + commit**
Run: `terraform fmt terraform/environments/offsite/`
```bash
git add terraform/environments/offsite
git commit -m "feat(tf): offsite environment — askari (CAX11/hel1/debian-13)"
```
(Co-Authored-By trailer)
---
### Task 4: Makefile — token injection, directory inventory, offsite handoff
**Files:**
- Modify: `Makefile`
- [ ] **Step 1: Inject the Hetzner token for `TF_ENV=offsite`**
The `tf-*` targets need `TF_VAR_hcloud_token` for offsite, sourced from the vault. Add a guarded helper variable near the `TF` definition:
```makefile
# For TF_ENV=offsite, export the Hetzner token from the vault (rbw unlocked).
# Reads vault.hetzner.token in-memory; never written to a tfvars file (CLAUDE.md).
ifeq ($(TF_ENV),offsite)
TF_TOKEN_ENV = TF_VAR_hcloud_token="$$($(VENV)/bin/ansible-vault view inventories/production/group_vars/all/vault.yml | $(VENV)/bin/python -c 'import sys,yaml; print(yaml.safe_load(sys)["vault"]["hetzner"]["token"])')"
else
TF_TOKEN_ENV =
endif
```
Then prefix the `tf-init`/`tf-plan`/`tf-apply`/`tf-output` recipes with `$(TF_TOKEN_ENV)`, e.g.:
```makefile
tf-plan:
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/$(TF_ENV) plan
```
(Apply the same prefix to `tf-init`, `tf-apply`, `tf-output`.)
- [ ] **Step 2: Directory inventory**
Change the inventory so multiple TF envs can each generate a file:
```makefile
INVENTORY := -i inventories/production/
```
(Ansible reads every file in the directory as an inventory source and merges them; `group_vars/`/`host_vars/` remain variable dirs. Verify `ansible.cfg` does not also hard-set `inventory=`; if it does, update it to match.)
- [ ] **Step 3: `tf-inventory-offsite` target**
Add (writes the offsite hosts into the production inventory dir, beside the Proxmox-generated `hosts.yml`):
```makefile
tf-inventory-offsite:
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/offsite output -json \
| $(PYTHON) scripts/tf_to_inventory.py > inventories/production/offsite.yml
@echo "Offsite inventory written to inventories/production/offsite.yml"
```
Add `tf-inventory-offsite` to `.PHONY` and a help line.
- [ ] **Step 4: Verify existing playbooks still resolve under the directory inventory**
Run: `make check PLAYBOOK=dns 2>&1 | tail -3`
Expected: still resolves the `control` host and runs (no inventory errors). If `connection:`/group_vars break, fix before committing.
- [ ] **Step 5: Commit**
```bash
git add Makefile
git commit -m "feat(make): offsite TF token injection + directory inventory + tf-inventory-offsite"
```
(Co-Authored-By trailer)
---
### Task 5: Lock the offsite inventory handoff (TDD)
**Files:**
- Test: `tests/test_tf_to_inventory.py`
- [ ] **Step 1: Write the failing test**
```python
import json
import pathlib
import subprocess
import sys
_SCRIPT = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "tf_to_inventory.py"
def _run(tf_output: dict) -> str:
return subprocess.run(
[sys.executable, str(_SCRIPT)],
input=json.dumps(tf_output), capture_output=True, text=True, check=True,
).stdout
def test_offsite_host_lands_in_offsite_hosts():
out = _run({"vms": {"value": {"askari": {"ip": "203.0.113.7", "group": "offsite_hosts"}}}})
assert "offsite_hosts:" in out
assert "askari:" in out
assert "ansible_host: 203.0.113.7" in out
def test_unknown_group_rejected():
proc = subprocess.run(
[sys.executable, str(_SCRIPT)],
input=json.dumps({"vms": {"value": {"x": {"ip": "1.2.3.4", "group": "nope"}}}}),
capture_output=True, text=True,
)
assert proc.returncode == 1
assert "unknown group" in proc.stderr
```
- [ ] **Step 2: Run it**
Run: `.venv/bin/python -m pytest tests/test_tf_to_inventory.py -v`
Expected: PASS — `tf_to_inventory.py` already supports `offsite_hosts` and rejects unknown groups (this test locks that behaviour for the M2 handoff; no code change needed). If it fails, fix `scripts/tf_to_inventory.py` minimally and report.
- [ ] **Step 3: Commit**
```bash
git add tests/test_tf_to_inventory.py
git commit -m "test(tf): lock the offsite_hosts inventory handoff"
```
(Co-Authored-By trailer)
---
### Task 6: Init, validate, plan (gated — needs terraform + token)
> Needs `terraform` installed and `rbw` unlocked. Creates **no** resources. If `terraform` is absent, defer Tasks 68 to ubongo.
- [ ] **Step 1: Set tfvars**
`cp terraform/environments/offsite/terraform.tfvars.example terraform/environments/offsite/terraform.tfvars` and set `ansible_ssh_pubkey` to ubongo's real control public key and `ssh_admin_cidrs` to ubongo's address (`10.20.10.151/32`). (`terraform.tfvars` is gitignored.)
- [ ] **Step 2: Init (tracks the lock file)**
Run: `make tf-init TF_ENV=offsite`
Expected: providers installed; `terraform/environments/offsite/.terraform.lock.hcl` created. `git add` the lock file (tracked per CLAUDE.md).
- [ ] **Step 3: Validate + plan**
Run: `terraform -chdir=terraform/environments/offsite validate``Success`.
Run: `make tf-plan TF_ENV=offsite` → review: **1 server + 1 firewall + 1 ssh key to add**. Confirm CAX11/hel1/debian-13 and the SSH-from-ubongo rule.
- [ ] **Step 4: Commit the lock file**
```bash
git add terraform/environments/offsite/.terraform.lock.hcl
git commit -m "chore(tf): pin offsite provider lock (hcloud)"
```
(Co-Authored-By trailer)
---
### Task 7: Apply — create askari (GATED, real billed VPS)
> **Explicit user go required.** Run on ubongo. The plan from Task 6 must be reviewed first (CLAUDE.md: never apply without a shown plan).
- [ ] **Step 1: Apply**
Run: `make tf-apply TF_ENV=offsite`
Expected: `hcloud_ssh_key`, `hcloud_firewall`, `hcloud_server.askari` created; outputs show `askari`'s IPv4.
- [ ] **Step 2: Generate the offsite inventory**
Run: `make tf-inventory-offsite`
Expected: `inventories/production/offsite.yml` written with `askari` under `offsite_hosts`.
- [ ] **Step 3: Verify the inventory merges**
Run: `.venv/bin/ansible-inventory $(INVENTORY) --host askari` (or `--list`)
Expected: `askari` present with its `ansible_host`.
- [ ] **Step 4: Commit the generated inventory**
```bash
git add inventories/production/offsite.yml
git commit -m "chore(inventory): askari in offsite_hosts (generated)"
```
(Co-Authored-By trailer)
---
### Task 8: Bootstrap askari (GATED — needs the live host)
> Run on ubongo after Task 7. `rbw` unlocked.
- [ ] **Step 1: Reach it**
Run: `ssh ansible@<askari-ip>` (cloud-init created the `ansible` user with ubongo's key) — expect a shell. If refused, check the firewall `ssh_admin_cidrs` matches ubongo's egress IP.
- [ ] **Step 2: Bootstrap**
Run: `make check PLAYBOOK=bootstrap` (review) then `make deploy PLAYBOOK=bootstrap` — expect the `ansible` user + sudoers confirmed/created on askari (idempotent).
- [ ] **Step 3: No repo commit** — this configures the host, not the repo. (`base` subset = M3.)
---
### Task 9: ADR amendments + STATUS
**Files:**
- Modify: `docs/decisions/006-terraform.md`, `009-provisioning-handoff.md`, `020-firewall.md`, `007-network.md`, `016-mesh-vpn.md`, `STATUS.md`
For each: **Read the relevant section first**, then apply the change.
- [ ] **Step 1: ADR-006 — generalize the provider scope**
In the **Providers** section, the line "`bpg/proxmox` … This is the only provider." → note a second provider:
```
**`hetznercloud/hcloud`**: owns off-site VM existence (`askari`). ADR-006's scope is
**Proxmox + Hetzner** — "Terraform owns VM existence" generalizes across providers; the
`offsite` environment + `hetzner_vm` module live alongside the Proxmox env + module.
```
Also adjust the Context line "creating and destroying VMs on Proxmox" → "on Proxmox and Hetzner".
- [ ] **Step 2: ADR-009 — offsite handoff**
Add a note that `offsite` is a TF environment whose `vms` output feeds `offsite_hosts` via `tf_to_inventory.py` (`make tf-inventory-offsite``inventories/production/offsite.yml`), and that the production inventory is a **directory** merging the Proxmox + offsite generated files.
- [ ] **Step 3: ADR-020 — askari's perimeter**
Note that off-cluster `askari` has no OPNsense; its **perimeter** is a TF-managed Hetzner Cloud Firewall (SSH-from-ubongo now; NetBird ports in M4). The `group_vars` catalog stays authoritative for the host nftables layer.
- [ ] **Step 4: ADR-007 / ADR-016 — askari is TF-provisioned**
Replace "provisioned … independently … added manually" wording for askari with "provisioned as Terraform IaC (hcloud), managed independently of the Proxmox cluster (own provider + state)."
- [ ] **Step 5: STATUS.md**
Move/realize askari's row per how far Task 7/8 got. If applied: under "Real and working today" — `askari` **Built + applied** (CAX11/hel1/debian-13, cloud firewall SSH-from-ubongo, bootstrapped, in `offsite_hosts`). If only authored (apply deferred): note the TF is written + `tf-plan` clean, apply pending on ubongo.
- [ ] **Step 6: Lint + commit**
Run: `make lint` (must pass).
```bash
git add docs/decisions/006-terraform.md docs/decisions/009-provisioning-handoff.md \
docs/decisions/020-firewall.md docs/decisions/007-network.md \
docs/decisions/016-mesh-vpn.md STATUS.md
git commit -m "docs(askari): amend ADR-006/009/020/007/016 for TF-provisioned offsite host; STATUS"
```
(Co-Authored-By trailer)
---
## Self-Review (completed)
- **Spec coverage:** TF owns existence / generalize ADR-006 (Decision 1) → Tasks 2,3,9; CAX11/hel1/debian-13 (Decision 2) → Task 3; TF cloud firewall, SSH-from-ubongo, NetBird ports later (Decision 3) → Task 2 + Task 9 ADR-020; token via `TF_VAR_hcloud_token` from vault (Decision 4) → Task 4; ADR-009 handoff via `tf_to_inventory` (Decision 5) → Tasks 4,5,7; cloud-init `ansible` user + bootstrap → Tasks 2,8; state + DR (import) → Task 3 backend; ADR amendments → Task 9. All covered.
- **Placeholder scan:** none — HCL, make, and test content are concrete. `<askari-ip>`/`<source>`/`<date>` are runtime/verification values, not unspecified logic.
- **Type/name consistency:** module vars (`name`, `server_type`, `location`, `image`, `ansible_ssh_pubkey`, `ssh_admin_cidrs`, `labels`) match between module + env call; the `vms` output shape (`{ip, group}`) matches `tf_to_inventory.py`'s contract; `TF_VAR_hcloud_token``var.hcloud_token`; `vault.hetzner.token` matches the stored key.
- **Notes for the implementer:** (a) confirm Ansible merges the directory inventory's two files so `askari` resolves (Task 7 Step 3); (b) verify `hcloud_server` arg names against the pinned provider version (Task 1) — adjust `public_net`/`firewall_ids` if the provider differs; (c) Tasks 78 create a billed VPS — gated on explicit go.