From 5aca796fa0bec1299c696d5f3877568d5c5c2ffa Mon Sep 17 00:00:00 2001 From: sjat Date: Fri, 5 Jun 2026 09:37:56 +0200 Subject: [PATCH] Add ADR-015 (control/AI-worker host ubongo) --- docs/decisions/015-control-host.md | 133 +++++++++++++++++++++++++++++ 1 file changed, 133 insertions(+) create mode 100644 docs/decisions/015-control-host.md diff --git a/docs/decisions/015-control-host.md b/docs/decisions/015-control-host.md new file mode 100644 index 0000000..8393ab6 --- /dev/null +++ b/docs/decisions/015-control-host.md @@ -0,0 +1,133 @@ +# ADR-015 — Control / development / AI-worker host (`ubongo`) + +## Context + +Earlier ADRs framed the control node — the host that runs Terraform and Ansible — +as a **single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one +documented exception to "Terraform owns VM existence" (ADR-009). That framing treats +the control node purely as a control-plane runner. + +It fails four needs, all confirmed as drivers: + +1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible cannot exist until + something else creates it; the bootstrap is circular and awkward. +2. **Always-on availability** — the operator wants to SSH in from a work PC or + anywhere to drive Claude Code. A cluster VM is gone whenever the cluster is down + or being rebuilt. +3. **Recovery / disaster** — the tool used to rebuild the cluster must not live + inside the thing it rebuilds. +4. **Dev ergonomics** — a persistent home for Claude Code + the repo, not entangled + with production VM lifecycle. + +A laptop-only answer fails always-on and recovery. A VM-only answer fails cold-start +and recovery. A small dedicated always-on physical machine outside the cluster +satisfies all four. + +## Decision + +Introduce **`ubongo`** (Swahili: *brain*, consistent with the fleet's theme): a +single dedicated x86-64 mini-PC, always-on, living **outside** the Proxmox cluster. +It becomes *the* control node and collapses four roles into one box: + +- Terraform + Ansible runner (control plane) +- Claude Code / AI-worker host the operator SSHes into +- Local test runner (Molecule/Docker, lint, and later a browser stack) +- Persistent dev home for the repo + +There is **no longer a control VM on the cluster.** The `control` inventory group +points at this physical box. This *strengthens* the ADR-009 control-node exception: +it is genuinely outside Terraform's world, not a VM pretending to be the exception. +Every other host stays a Terraform-managed VM exactly as designed. + +`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a hypervisor +and runs no `docker_host` services. + +### Hardware target + +| Spec | Target | Why | +|---|---|---| +| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 | +| RAM | 16 GB | Docker + headless Chromium + toolchain headroom | +| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache | +| Network | Wired GbE | Always-on reliability over Wi-Fi | +| Power | Low draw (≤15 W idle) | Runs 24/7 | + +Indicative: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC (~€150–250). +Claude Code itself is light (the model runs in Anthropic's cloud); the sizing driver +is **all testing being local** — Molecule (Docker), lint, and a future +headless-Chromium/Playwright stack. + +### Provisioning (bootstrap path) + +Manual, on bare metal: + +1. Install Debian 13 on the box (one-time, by hand). +2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock. +3. Join the mesh VPN (choice deferred — see below). +4. From then on `ubongo` manages every other host normally; Ansible manages *it* for + baseline config via the `control` group (`base` role only). + +### Access & security + +- Remote access is via the **mesh VPN** (choice deferred). SSH to `ubongo` over the + mesh; nothing is published to the public internet — this stays inside ADR-002. +- `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban, + auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**, + denied on the physical NIC. + +### Recovery model + +`ubongo` is the rebuild tool, so three things must survive a full cluster loss: + +1. **`mamba` (laptop) is a break-glass clone** — repo + toolchain + mesh + `rbw`, + able to drive the fleet if `ubongo` dies. +2. **Terraform state** lives on `ubongo`, backed up encrypted off-box (synced to + `mamba`). For a 2–5 VM fleet it is also reconstructable via `terraform import`. +3. **Vault password** — `ubongo` gets it from Vaultwarden via `rbw`. `rbw` keeps a + local encrypted copy of the vault and decrypts it offline with the operator's + Vaultwarden master password, so `ubongo` can decrypt the Ansible vault with the + whole cluster down — provided `rbw` has synced once and the operator keeps the + Vaultwarden master password offline (memorised + paper in a safe). Mirror onto + `mamba`. + +There is always exactly one irreducible offline root secret; here it is the +Vaultwarden master password. Mirroring Vaultwarden onto `ubongo` is rejected: it +would make the control node run a service (against its remit) and still need that +master password. + +> verified: rbw offline-cache decryption · TO VERIFY before relying on the recovery +> model · rbw docs · (ADR-014, security-relevant — confirm during build) + +## Consequences + +- The control node is physical compute outside the cluster, so it appears in + `docs/hardware/reference.md` even though it is not a cluster node (ADR-012). +- All testing (Molecule, lint, staging/external) runs on `ubongo` (ADR-008). +- A future **service-UI acceptance** testing level (Claude driving a headless browser + against a deployed service) is anticipated; `ubongo` is sized for it. The harness + is a separate spec. + +## Deferred (separate specs / discussions) + +1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Recovery + dimension: a hosted coordinator keeps the mesh up when the cluster is down; a + self-hosted coordinator must live off-cluster (on `ubongo`), never on the fleet, + or it recreates the chicken-and-egg. +2. **Browser-E2E verification harness** — Playwright/headless-Chromium, test-user + generation, screenshot-back-to-Claude, and the new ADR-008 level. +3. **`rbw` offline-cache verification** — confirm offline decryption before relying + on it (ADR-014). + +## What was ruled out + +| Option | Reason | +|---|---| +| Keep control node as a cluster VM | Fails cold-start, recovery, always-on. | +| Laptop-only (`mamba` for everything) | Fails always-on. Retained as break-glass backup. | +| Split roles (control VM + thin jump box) | Two toolchains, split control plane, heavy testing back on a cluster VM. | +| Mirror Vaultwarden onto `ubongo` | Control node would run a service; still needs the master password. | +| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. | +| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. | + +See also: ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing), +ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).