diff --git a/docs/ROADMAP.md b/docs/ROADMAP.md index 67efb16..b879d00 100644 --- a/docs/ROADMAP.md +++ b/docs/ROADMAP.md @@ -79,13 +79,18 @@ zero-risk and *born at Gandi*. ### M2 · `askari` provisioned + under Ansible -Spin up the Hetzner VPS; bring it under Ansible in the `offsite_hosts` group; bootstrap it. +Provision the Hetzner VPS **as IaC with Terraform** (CAX11 ARM / Helsinki / Debian 13, +behind a TF-managed Hetzner Cloud Firewall), bring it into `offsite_hosts`, and bootstrap +it. Design: `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`. -- **Proves:** the `offsite_hosts` pattern, bootstrap of a non-cluster host, rbw/vault - against a brand-new host. Regenerates the inventory stubs (closes review finding O6 — - `offsite_hosts` missing from `hosts.yml`). -- **Maps to:** ADR-007 (`askari` role), ADR-009 (provisioning handoff), ADR-015/016, - TODO 5 (control-node-style bootstrap, reused). +- **Decided:** Terraform owns `askari`'s existence — generalizes ADR-006 from "Proxmox VM + existence" to **Proxmox + Hetzner** (new `hetznercloud/hcloud` provider, `hetzner_vm` + module, `offsite` stack). Token via `TF_VAR_hcloud_token` from `vault.hetzner.token`. +- **Proves:** the `offsite_hosts` pattern, the TF→Ansible handoff for a non-Proxmox host + (`tf_to_inventory.py` extended), bootstrap of a non-cluster host. Closes review finding + O6 (`offsite_hosts` missing from `hosts.yml`). +- **Amends:** ADR-006 (TF scope), ADR-009 (offsite handoff), ADR-020 (Hetzner Cloud + Firewall = perimeter), ADR-007/016 (`askari` TF-provisioned, not "added manually"). ### M3 · `base` matured to a "remote-access-sufficient" subset diff --git a/docs/superpowers/specs/2026-06-14-askari-provisioning-design.md b/docs/superpowers/specs/2026-06-14-askari-provisioning-design.md new file mode 100644 index 0000000..31e596a --- /dev/null +++ b/docs/superpowers/specs/2026-06-14-askari-provisioning-design.md @@ -0,0 +1,146 @@ +# Design — Provisioning `askari` (Terraform + Hetzner Cloud) + +- **Date:** 2026-06-14 +- **Status:** Draft for review — design settled in brainstorming; pending user review, + then implementation plan +- **Roadmap milestone:** M2 (`docs/ROADMAP.md`) +- **Amends:** ADR-006 (Terraform scope → Proxmox **+ Hetzner**), ADR-009 (offsite + handoff), ADR-020 (Hetzner Cloud Firewall = askari's perimeter), ADR-007/016 (`askari` + is Terraform-provisioned, not "added manually") +- **Becomes:** amendments to those ADRs + +--- + +## Problem + +`askari` (the off-site Hetzner VPS — NetBird coordinator + watchdog, later the off-site +log subset) does not exist yet. ADR-007/016 designed it as "provisioned independently… +added manually." Now that there's a dedicated Hetzner account + a verified API token in +the vault, we can provision it as **IaC** instead. boma's principle (ADR-006/009) is +"**Terraform owns VM existence; Ansible owns config**" — but scoped to Proxmox. This +milestone **generalizes that principle to Hetzner** and stands `askari` up. + +## Decisions (as settled) + +1. **Terraform owns `askari`'s existence** (Approach 1) — generalize ADR-006 from "Proxmox + VM existence" to "VM existence on **Proxmox + Hetzner**." (Rejected: Ansible + `hetzner.hcloud` — breaks the TF/Ansible boundary; `hcloud` CLI — not stateful IaC.) +2. **Server:** **CAX11** (ARM/Ampere, 2 vCPU / 4 GB / 40 GB), **Helsinki (`hel1`)**, + **Debian 13**. Rescale up later if the off-site log subset needs it. +3. **TF-managed Hetzner Cloud Firewall** as `askari`'s perimeter (the off-site + OPNsense-analog). Starts minimal (**SSH from ubongo only**); service ports are added as + services land (NetBird ports in M4). The ADR-020 catalog stays authoritative for the + **host nftables** layer. +4. **Token via `TF_VAR_hcloud_token`**, sourced from `vault.hetzner.token` at apply time + — never in `.tfvars` (CLAUDE.md). +5. **Handoff stays ADR-009-shaped:** `tf_to_inventory.py` is extended to emit `askari` + into `offsite_hosts`, so `hosts.yml` stays fully generated. + +## Verified facts (ADR-014) + +> verified: Hetzner Cloud entry tiers · WebSearch · 2026-06-14 · **CAX11** (ARM/Ampere) +> 2 vCPU / 4 GB / 40 GB ≈ €3.79/mo, 20 TB traffic + 1 IPv4; ARM (CAX) is **EU-locations +> only** (incl. `hel1`). Price change for new orders from 2026-06-15. + +> to verify when writing the role (ADR-014): the `hetznercloud/hcloud` provider version +> to pin; the Debian 13 image slug (expected `debian-13`); CAX11 availability in `hel1`. + +## Architecture + +### Terraform structure + +- **Module `terraform/modules/hetzner_vm/`** (sibling to `proxmox_vm`): inputs `name`, + `server_type`, `location`, `image`, `ssh_keys`, `user_data`, `firewall_rules`, + `labels`; outputs the server's `ipv4` (+ id, name). +- **Stack `terraform/environments/offsite/`** (its own **local state** on ubongo, + gitignored): `providers.tf` pins **`hetznercloud/hcloud`**; `main.tf` calls + `hetzner_vm` for `askari` + an `hcloud_firewall` + an `hcloud_ssh_key`; `variables.tf` + (incl. `hcloud_token`, `control_ssh_pubkey`, `ssh_admin_cidr`); `outputs.tf` (askari + `ipv4`, for the handoff + DNS); `backend.tf` (local state, like the Proxmox envs). +- **`make tf-* TF_ENV=offsite`** drives it; for `offsite` the targets first export + `TF_VAR_hcloud_token` from `vault.hetzner.token` (a small vault→env step). `tf-apply` + stays gated behind a shown `tf-plan` (CLAUDE.md). + +### Provisioning → Ansible handoff + +1. TF creates the CAX11 with a **cloud-init `user_data`** that injects **ubongo's control + SSH public key** for first login (minimal — no config beyond the key + ensuring + Python is present for Ansible). +2. TF outputs `askari`'s public IPv4. `tf_to_inventory.py` (extended for the offsite + stack) writes `askari` into the `offsite_hosts` group of `hosts.yml`. +3. `playbooks/bootstrap.yml` runs against `askari` → creates the `ansible` user + sudoers + (as for Proxmox hosts). **Where M2 ends.** +4. *(Downstream, not M2):* `base` remote-access subset (M3), NetBird coordinator (M4), + mesh enrollment + SSH-narrowed-to-`wt0` (M5). +- A convenience **`askari.wingu.me` A record** is added via the M1 `public_dns` role + (stable name for humans + future certs); the inventory may reference it once DNS exists. + +### Cloud firewall (perimeter) + +- TF `hcloud_firewall` attached to `askari`: + - **inbound SSH (22/tcp) from ubongo's address only** (`ssh_admin_cidr` var); + - everything else default-deny. +- **Grows with services:** NetBird's **UDP 3478** (Coturn) + **TCP 80/443** + (management/dashboard) are added in **M4** when the coordinator deploys — not opened to + a non-existent listener now. +- This is the off-site **perimeter** layer (OPNsense has no presence off-cluster); + ADR-020's `group_vars` catalog remains the single source for the **host nftables** + layer that `base` renders (M3). + +### State + disaster recovery + +- The `offsite` `terraform.tfstate` lives on ubongo and is added to the **ADR-022 backup + scope** (the control-node TF state backup already flagged in STATUS). +- DR is management-only: `askari` survives a homelab/ubongo outage by design, so a lost + state is recovered by `terraform import`-ing the still-running server — no rebuild. + +## Division of labour & access + +| Task | Who | How | +|---|---|---| +| Hetzner token | Done | `vault.hetzner.token` (verified live, HTTP 200). | +| `hetzner_vm` module + `offsite` stack + `tf_to_inventory` extension + make token-inject | Agent | Committed IaC + a pytest for the handoff. | +| `terraform plan` (offsite) | Agent | `make tf-plan TF_ENV=offsite`, **output shown**. | +| `terraform apply` (offsite) | Human-gated | Only after the plan is reviewed (CLAUDE.md: never apply without a shown plan). Run on ubongo. | +| Confirm the control SSH key | Human | Which ubongo key Ansible uses to reach hosts (its public key feeds `control_ssh_pubkey`). | + +- **Token:** `TF_VAR_hcloud_token` from vault at apply; never written to a `.tfvars` file. +- **SSH:** cloud-init injects only the control public key; the private key stays on + ubongo. The cloud firewall limits SSH to ubongo's address until the mesh exists. + +## Testing & verification + +- `terraform fmt` + **`terraform validate`** + **`make tf-plan TF_ENV=offsite`** (plan + reviewed before any apply). +- **pytest** for the `tf_to_inventory.py` offsite extension (mirrors the existing + stdlib-only script tests), asserting an `askari` entry lands in `offsite_hosts`. +- Post-apply: SSH reachability from ubongo; cloud-init ran; then `bootstrap.yml` + connectivity. (`base`/NetBird get their own Molecule/verify in M3/M4.) + +## Scope boundaries — what M2 is NOT + +- **Not** the `base` hardening subset (SSH hardening, fail2ban, NetBird agent) — **M3**. +- **Not** the NetBird coordinator or the cloud-firewall NetBird ports — **M4**. +- **Not** mesh enrollment / narrowing SSH to `wt0` — **M5**. +- **Not** the off-site log subset (may need a bigger instance / a volume) — later. + +## ADR work + +- **ADR-006** — generalize "Terraform owns VM existence" to **Proxmox + Hetzner**; add the + `hetznercloud/hcloud` provider (no longer "the only provider is `bpg/proxmox`"); add the + `offsite` environment + `hetzner_vm` module to Structure; note the TF-managed Hetzner + Cloud Firewall. +- **ADR-009** — the offsite handoff (`tf_to_inventory.py` emits `askari` → `offsite_hosts`). +- **ADR-020** — the Hetzner Cloud Firewall is `askari`'s perimeter (OPNsense-analog); + catalog still authoritative for host nftables. +- **ADR-007 / ADR-016** — `askari` is Terraform-provisioned (hcloud), superseding "added + manually." + +## Open items (resolve during the plan / implementation) + +- **Pin** the `hetznercloud/hcloud` provider version; confirm the `debian-13` image slug + and CAX11/`hel1` availability (ADR-014). +- The **make tf token-inject** mechanism for `offsite` (read `vault.hetzner.token` → export + `TF_VAR_hcloud_token`) — shape it in the plan (rbw/ansible-vault one-liner vs a wrapper). +- Whether the inventory references `askari` by **IPv4 (from TF output)** or by + **`askari.wingu.me`** once the DNS record exists — decide in the plan.