docs(spec): M2 — provision askari via Terraform + Hetzner Cloud
askari is provisioned as IaC: Terraform owns its existence too, generalizing ADR-006 from "Proxmox VM existence" to Proxmox + Hetzner (new hetznercloud/hcloud provider, hetzner_vm module, offsite stack with local state). CAX11 (ARM) in Helsinki on Debian 13, behind a TF-managed Hetzner Cloud Firewall (SSH-from-ubongo now; NetBird ports in M4). Token via TF_VAR_hcloud_token from vault.hetzner.token. Handoff stays ADR-009-shaped (tf_to_inventory.py extended to emit askari into offsite_hosts). State in the ADR-022 backup scope; DR via terraform import. Amends ADR-006/009/020/007/016. Point ROADMAP.md M2 at the spec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
32d480efcf
commit
602550fdaa
2 changed files with 157 additions and 6 deletions
|
|
@ -79,13 +79,18 @@ zero-risk and *born at Gandi*.
|
|||
|
||||
### M2 · `askari` provisioned + under Ansible
|
||||
|
||||
Spin up the Hetzner VPS; bring it under Ansible in the `offsite_hosts` group; bootstrap it.
|
||||
Provision the Hetzner VPS **as IaC with Terraform** (CAX11 ARM / Helsinki / Debian 13,
|
||||
behind a TF-managed Hetzner Cloud Firewall), bring it into `offsite_hosts`, and bootstrap
|
||||
it. Design: `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`.
|
||||
|
||||
- **Proves:** the `offsite_hosts` pattern, bootstrap of a non-cluster host, rbw/vault
|
||||
against a brand-new host. Regenerates the inventory stubs (closes review finding O6 —
|
||||
`offsite_hosts` missing from `hosts.yml`).
|
||||
- **Maps to:** ADR-007 (`askari` role), ADR-009 (provisioning handoff), ADR-015/016,
|
||||
TODO 5 (control-node-style bootstrap, reused).
|
||||
- **Decided:** Terraform owns `askari`'s existence — generalizes ADR-006 from "Proxmox VM
|
||||
existence" to **Proxmox + Hetzner** (new `hetznercloud/hcloud` provider, `hetzner_vm`
|
||||
module, `offsite` stack). Token via `TF_VAR_hcloud_token` from `vault.hetzner.token`.
|
||||
- **Proves:** the `offsite_hosts` pattern, the TF→Ansible handoff for a non-Proxmox host
|
||||
(`tf_to_inventory.py` extended), bootstrap of a non-cluster host. Closes review finding
|
||||
O6 (`offsite_hosts` missing from `hosts.yml`).
|
||||
- **Amends:** ADR-006 (TF scope), ADR-009 (offsite handoff), ADR-020 (Hetzner Cloud
|
||||
Firewall = perimeter), ADR-007/016 (`askari` TF-provisioned, not "added manually").
|
||||
|
||||
### M3 · `base` matured to a "remote-access-sufficient" subset
|
||||
|
||||
|
|
|
|||
146
docs/superpowers/specs/2026-06-14-askari-provisioning-design.md
Normal file
146
docs/superpowers/specs/2026-06-14-askari-provisioning-design.md
Normal file
|
|
@ -0,0 +1,146 @@
|
|||
# Design — Provisioning `askari` (Terraform + Hetzner Cloud)
|
||||
|
||||
- **Date:** 2026-06-14
|
||||
- **Status:** Draft for review — design settled in brainstorming; pending user review,
|
||||
then implementation plan
|
||||
- **Roadmap milestone:** M2 (`docs/ROADMAP.md`)
|
||||
- **Amends:** ADR-006 (Terraform scope → Proxmox **+ Hetzner**), ADR-009 (offsite
|
||||
handoff), ADR-020 (Hetzner Cloud Firewall = askari's perimeter), ADR-007/016 (`askari`
|
||||
is Terraform-provisioned, not "added manually")
|
||||
- **Becomes:** amendments to those ADRs
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
`askari` (the off-site Hetzner VPS — NetBird coordinator + watchdog, later the off-site
|
||||
log subset) does not exist yet. ADR-007/016 designed it as "provisioned independently…
|
||||
added manually." Now that there's a dedicated Hetzner account + a verified API token in
|
||||
the vault, we can provision it as **IaC** instead. boma's principle (ADR-006/009) is
|
||||
"**Terraform owns VM existence; Ansible owns config**" — but scoped to Proxmox. This
|
||||
milestone **generalizes that principle to Hetzner** and stands `askari` up.
|
||||
|
||||
## Decisions (as settled)
|
||||
|
||||
1. **Terraform owns `askari`'s existence** (Approach 1) — generalize ADR-006 from "Proxmox
|
||||
VM existence" to "VM existence on **Proxmox + Hetzner**." (Rejected: Ansible
|
||||
`hetzner.hcloud` — breaks the TF/Ansible boundary; `hcloud` CLI — not stateful IaC.)
|
||||
2. **Server:** **CAX11** (ARM/Ampere, 2 vCPU / 4 GB / 40 GB), **Helsinki (`hel1`)**,
|
||||
**Debian 13**. Rescale up later if the off-site log subset needs it.
|
||||
3. **TF-managed Hetzner Cloud Firewall** as `askari`'s perimeter (the off-site
|
||||
OPNsense-analog). Starts minimal (**SSH from ubongo only**); service ports are added as
|
||||
services land (NetBird ports in M4). The ADR-020 catalog stays authoritative for the
|
||||
**host nftables** layer.
|
||||
4. **Token via `TF_VAR_hcloud_token`**, sourced from `vault.hetzner.token` at apply time
|
||||
— never in `.tfvars` (CLAUDE.md).
|
||||
5. **Handoff stays ADR-009-shaped:** `tf_to_inventory.py` is extended to emit `askari`
|
||||
into `offsite_hosts`, so `hosts.yml` stays fully generated.
|
||||
|
||||
## Verified facts (ADR-014)
|
||||
|
||||
> verified: Hetzner Cloud entry tiers · WebSearch · 2026-06-14 · **CAX11** (ARM/Ampere)
|
||||
> 2 vCPU / 4 GB / 40 GB ≈ €3.79/mo, 20 TB traffic + 1 IPv4; ARM (CAX) is **EU-locations
|
||||
> only** (incl. `hel1`). Price change for new orders from 2026-06-15.
|
||||
|
||||
> to verify when writing the role (ADR-014): the `hetznercloud/hcloud` provider version
|
||||
> to pin; the Debian 13 image slug (expected `debian-13`); CAX11 availability in `hel1`.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Terraform structure
|
||||
|
||||
- **Module `terraform/modules/hetzner_vm/`** (sibling to `proxmox_vm`): inputs `name`,
|
||||
`server_type`, `location`, `image`, `ssh_keys`, `user_data`, `firewall_rules`,
|
||||
`labels`; outputs the server's `ipv4` (+ id, name).
|
||||
- **Stack `terraform/environments/offsite/`** (its own **local state** on ubongo,
|
||||
gitignored): `providers.tf` pins **`hetznercloud/hcloud`**; `main.tf` calls
|
||||
`hetzner_vm` for `askari` + an `hcloud_firewall` + an `hcloud_ssh_key`; `variables.tf`
|
||||
(incl. `hcloud_token`, `control_ssh_pubkey`, `ssh_admin_cidr`); `outputs.tf` (askari
|
||||
`ipv4`, for the handoff + DNS); `backend.tf` (local state, like the Proxmox envs).
|
||||
- **`make tf-* TF_ENV=offsite`** drives it; for `offsite` the targets first export
|
||||
`TF_VAR_hcloud_token` from `vault.hetzner.token` (a small vault→env step). `tf-apply`
|
||||
stays gated behind a shown `tf-plan` (CLAUDE.md).
|
||||
|
||||
### Provisioning → Ansible handoff
|
||||
|
||||
1. TF creates the CAX11 with a **cloud-init `user_data`** that injects **ubongo's control
|
||||
SSH public key** for first login (minimal — no config beyond the key + ensuring
|
||||
Python is present for Ansible).
|
||||
2. TF outputs `askari`'s public IPv4. `tf_to_inventory.py` (extended for the offsite
|
||||
stack) writes `askari` into the `offsite_hosts` group of `hosts.yml`.
|
||||
3. `playbooks/bootstrap.yml` runs against `askari` → creates the `ansible` user + sudoers
|
||||
(as for Proxmox hosts). **Where M2 ends.**
|
||||
4. *(Downstream, not M2):* `base` remote-access subset (M3), NetBird coordinator (M4),
|
||||
mesh enrollment + SSH-narrowed-to-`wt0` (M5).
|
||||
- A convenience **`askari.wingu.me` A record** is added via the M1 `public_dns` role
|
||||
(stable name for humans + future certs); the inventory may reference it once DNS exists.
|
||||
|
||||
### Cloud firewall (perimeter)
|
||||
|
||||
- TF `hcloud_firewall` attached to `askari`:
|
||||
- **inbound SSH (22/tcp) from ubongo's address only** (`ssh_admin_cidr` var);
|
||||
- everything else default-deny.
|
||||
- **Grows with services:** NetBird's **UDP 3478** (Coturn) + **TCP 80/443**
|
||||
(management/dashboard) are added in **M4** when the coordinator deploys — not opened to
|
||||
a non-existent listener now.
|
||||
- This is the off-site **perimeter** layer (OPNsense has no presence off-cluster);
|
||||
ADR-020's `group_vars` catalog remains the single source for the **host nftables**
|
||||
layer that `base` renders (M3).
|
||||
|
||||
### State + disaster recovery
|
||||
|
||||
- The `offsite` `terraform.tfstate` lives on ubongo and is added to the **ADR-022 backup
|
||||
scope** (the control-node TF state backup already flagged in STATUS).
|
||||
- DR is management-only: `askari` survives a homelab/ubongo outage by design, so a lost
|
||||
state is recovered by `terraform import`-ing the still-running server — no rebuild.
|
||||
|
||||
## Division of labour & access
|
||||
|
||||
| Task | Who | How |
|
||||
|---|---|---|
|
||||
| Hetzner token | Done | `vault.hetzner.token` (verified live, HTTP 200). |
|
||||
| `hetzner_vm` module + `offsite` stack + `tf_to_inventory` extension + make token-inject | Agent | Committed IaC + a pytest for the handoff. |
|
||||
| `terraform plan` (offsite) | Agent | `make tf-plan TF_ENV=offsite`, **output shown**. |
|
||||
| `terraform apply` (offsite) | Human-gated | Only after the plan is reviewed (CLAUDE.md: never apply without a shown plan). Run on ubongo. |
|
||||
| Confirm the control SSH key | Human | Which ubongo key Ansible uses to reach hosts (its public key feeds `control_ssh_pubkey`). |
|
||||
|
||||
- **Token:** `TF_VAR_hcloud_token` from vault at apply; never written to a `.tfvars` file.
|
||||
- **SSH:** cloud-init injects only the control public key; the private key stays on
|
||||
ubongo. The cloud firewall limits SSH to ubongo's address until the mesh exists.
|
||||
|
||||
## Testing & verification
|
||||
|
||||
- `terraform fmt` + **`terraform validate`** + **`make tf-plan TF_ENV=offsite`** (plan
|
||||
reviewed before any apply).
|
||||
- **pytest** for the `tf_to_inventory.py` offsite extension (mirrors the existing
|
||||
stdlib-only script tests), asserting an `askari` entry lands in `offsite_hosts`.
|
||||
- Post-apply: SSH reachability from ubongo; cloud-init ran; then `bootstrap.yml`
|
||||
connectivity. (`base`/NetBird get their own Molecule/verify in M3/M4.)
|
||||
|
||||
## Scope boundaries — what M2 is NOT
|
||||
|
||||
- **Not** the `base` hardening subset (SSH hardening, fail2ban, NetBird agent) — **M3**.
|
||||
- **Not** the NetBird coordinator or the cloud-firewall NetBird ports — **M4**.
|
||||
- **Not** mesh enrollment / narrowing SSH to `wt0` — **M5**.
|
||||
- **Not** the off-site log subset (may need a bigger instance / a volume) — later.
|
||||
|
||||
## ADR work
|
||||
|
||||
- **ADR-006** — generalize "Terraform owns VM existence" to **Proxmox + Hetzner**; add the
|
||||
`hetznercloud/hcloud` provider (no longer "the only provider is `bpg/proxmox`"); add the
|
||||
`offsite` environment + `hetzner_vm` module to Structure; note the TF-managed Hetzner
|
||||
Cloud Firewall.
|
||||
- **ADR-009** — the offsite handoff (`tf_to_inventory.py` emits `askari` → `offsite_hosts`).
|
||||
- **ADR-020** — the Hetzner Cloud Firewall is `askari`'s perimeter (OPNsense-analog);
|
||||
catalog still authoritative for host nftables.
|
||||
- **ADR-007 / ADR-016** — `askari` is Terraform-provisioned (hcloud), superseding "added
|
||||
manually."
|
||||
|
||||
## Open items (resolve during the plan / implementation)
|
||||
|
||||
- **Pin** the `hetznercloud/hcloud` provider version; confirm the `debian-13` image slug
|
||||
and CAX11/`hel1` availability (ADR-014).
|
||||
- The **make tf token-inject** mechanism for `offsite` (read `vault.hetzner.token` → export
|
||||
`TF_VAR_hcloud_token`) — shape it in the plan (rbw/ansible-vault one-liner vs a wrapper).
|
||||
- Whether the inventory references `askari` by **IPv4 (from TF output)** or by
|
||||
**`askari.wingu.me`** once the DNS record exists — decide in the plan.
|
||||
Loading…
Add table
Reference in a new issue