boma/docs/decisions/015-control-host.md
sjat 188882449d docs(adr): restructure ADRs 001,002,004,005,012,014,015 to ADR-023 conformance
Add dated Status sections and (where missing) Consequences sections assembled
from each ADR's already-stated implications. No decision substance changed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 14:39:00 +02:00

139 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-015 — Control / development / AI-worker host (`ubongo`)
## Status
Accepted (2026-06-05)
## Context
Earlier ADRs framed the control node — the host that runs Terraform and Ansible —
as a **single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one
documented exception to "Terraform owns VM existence" (ADR-009). That framing treats
the control node purely as a control-plane runner.
It fails four needs, all confirmed as drivers:
1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible cannot exist until
something else creates it; the bootstrap is circular and awkward.
2. **Always-on availability** — the operator wants to SSH in from a work PC or
anywhere to drive Claude Code. A cluster VM is gone whenever the cluster is down
or being rebuilt.
3. **Recovery / disaster** — the tool used to rebuild the cluster must not live
inside the thing it rebuilds.
4. **Dev ergonomics** — a persistent home for Claude Code + the repo, not entangled
with production VM lifecycle.
A laptop-only answer fails always-on and recovery. A VM-only answer fails cold-start
and recovery. A small dedicated always-on physical machine outside the cluster
satisfies all four.
## Decision
Introduce **`ubongo`** (Swahili: *brain*, consistent with the fleet's theme): a
single dedicated x86-64 mini-PC, always-on, living **outside** the Proxmox cluster.
It becomes *the* control node and collapses four roles into one box:
- Terraform + Ansible runner (control plane)
- Claude Code / AI-worker host the operator SSHes into
- Local test runner (Molecule/Docker, lint, and later a browser stack)
- Persistent dev home for the repo
There is **no longer a control VM on the cluster.** The `control` inventory group
points at this physical box. This *strengthens* the ADR-009 control-node exception:
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
Every other host stays a Terraform-managed VM exactly as designed.
`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a hypervisor
and runs no `docker_host` services.
### Hardware target
| Spec | Target | Why |
|---|---|---|
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
| Network | Wired GbE | Always-on reliability over Wi-Fi |
| Power | Low draw (≤15 W idle) | Runs 24/7 |
Indicative: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC (~€150250).
Claude Code itself is light (the model runs in Anthropic's cloud); the sizing driver
is **all testing being local** — Molecule (Docker), lint, and a future
headless-Chromium/Playwright stack.
### Provisioning (bootstrap path)
Manual, on bare metal:
1. Install Debian 13 on the box (one-time, by hand).
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
3. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016).
4. From then on `ubongo` manages every other host normally; Ansible manages *it* for
baseline config via the `control` group (`base` role only).
### Access & security
- Remote access is via the **mesh VPN** — NetBird, self-hosted on `askari` (ADR-016).
SSH to `ubongo` over the mesh; nothing is published to the public internet — this
stays inside ADR-002.
- `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban,
auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**,
denied on the physical NIC.
### Recovery model
`ubongo` is the rebuild tool, so three things must survive a full cluster loss:
1. **`mamba` (laptop) is a break-glass clone** — repo + toolchain + mesh + `rbw`,
able to drive the fleet if `ubongo` dies.
2. **Terraform state** lives on `ubongo`, backed up encrypted off-box (synced to
`mamba`). For a 25 VM fleet it is also reconstructable via `terraform import`.
3. **Vault password**`ubongo` gets it from Vaultwarden via `rbw`. `rbw` keeps a
local encrypted copy of the vault and decrypts it offline with the operator's
Vaultwarden master password, so `ubongo` can decrypt the Ansible vault with the
whole cluster down — provided `rbw` has synced once and the operator keeps the
Vaultwarden master password offline (memorised + paper in a safe). Mirror onto
`mamba`.
There is always exactly one irreducible offline root secret; here it is the
Vaultwarden master password. Mirroring Vaultwarden onto `ubongo` is rejected: it
would make the control node run a service (against its remit) and still need that
master password.
> verified: rbw offline-cache decryption · TO VERIFY before relying on the recovery
> model · rbw docs · (ADR-014, security-relevant — confirm during build)
## Consequences
- The control node is physical compute outside the cluster, so it appears in
`docs/hardware/reference.md` even though it is not a cluster node (ADR-012).
- All testing (Molecule, lint, staging/external) runs on `ubongo` (ADR-008).
- A future **service-UI acceptance** testing level (Claude driving a headless browser
against a deployed service) is anticipated; `ubongo` is sized for it. The harness
is a separate spec.
## Deferred (separate specs / discussions)
1. **Mesh VPN choice — RESOLVED (ADR-016):** NetBird, self-hosted on `askari`
(off-site, so it survives a homelab outage and stays out of the cluster it
administers). Replaces ADR-007's OPNsense WireGuard.
2. **Browser-E2E verification harness — RESOLVED (ADR-017):** Claude-driven
exploratory service-UI verification (`/verify-service`, ADR-008 Level 4), against
staging with test users in Authentik. Design + skill + standards complete; running
deferred on the stack.
3. **`rbw` offline-cache verification** — still open: confirm offline decryption before
relying on it (ADR-014).
## What was ruled out
| Option | Reason |
|---|---|
| Keep control node as a cluster VM | Fails cold-start, recovery, always-on. |
| Laptop-only (`mamba` for everything) | Fails always-on. Retained as break-glass backup. |
| Split roles (control VM + thin jump box) | Two toolchains, split control plane, heavy testing back on a cluster VM. |
| Mirror Vaultwarden onto `ubongo` | Control node would run a service; still needs the master password. |
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. |
| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. |
See also: ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).