boma/docs/superpowers/specs/2026-06-05-ubongo-control-host-design.md
sjat c1b21c9b2b Add design spec for ubongo control/AI-worker host
Records the decision to replace the cluster-resident control VM with a
dedicated always-on physical mini-PC (ubongo) outside the Proxmox
cluster, collapsing control plane, AI-worker host, dev home, and local
test runner into one box. Basis for ADR-015.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 09:19:02 +02:00

205 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Design — Control / development / AI-worker host (`ubongo`)
- **Date:** 2026-06-05
- **Status:** Approved design — pending implementation plan
- **Supersedes (in part):** the "control node is a dedicated VM on the cluster"
assumption in ADR-001 / ADR-005 / ADR-009
- **Becomes:** ADR-015 (this design is the basis for that ADR)
---
## Problem
Today the control node — the host that runs Terraform and Ansible — is defined as a
**single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one
documented exception to "Terraform owns VM existence" (ADR-009). The ADRs treat it
purely as a control-plane runner.
That framing fails four things the user actually needs, all confirmed as drivers:
1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible can't exist until
something else creates it; the manual bootstrap is awkward and circular.
2. **Always-on availability** — the user wants to SSH in from a work PC (or anywhere)
to fire off Claude Code commands. A VM on the cluster is gone whenever the cluster
is down or being rebuilt.
3. **Recovery / disaster** — the tool you'd use to rebuild the cluster must not live
*inside* the thing it rebuilds.
4. **Dev ergonomics** — a comfortable, persistent home for Claude Code + the repo,
not entangled with production VM lifecycle.
A laptop-only answer fails always-on (not always carried) and recovery. A VM-only
answer fails cold-start and recovery. A small **dedicated always-on physical machine
outside the cluster** satisfies all four.
---
## Decision
Introduce **`ubongo`**: a single dedicated x86-64 mini-PC, always-on, living
**outside** the Proxmox cluster. It becomes *the* control node and collapses four
roles into one box:
- Terraform + Ansible runner (control plane)
- Claude Code / AI-worker host the user SSHes into
- Local test runner (Molecule/Docker, lint, and later the browser stack)
- Persistent dev home for the repo
There is **no longer a control VM on the cluster.** The `control` inventory group now
points at this physical box. This *strengthens* the ADR-009 control-node exception:
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
Every other host stays a Terraform-managed VM exactly as designed.
`ubongo` runs **plain Debian 13** (matches the fleet — the `base` role applies). It is
not a hypervisor.
### Name
`ubongo` (Swahili: *brain*), consistent with the fleet's Swahili theme (`boma`,
`nyumbani`, `askari`, `mamba`).
---
## Hardware target
| Spec | Target | Why |
|---|---|---|
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
| Network | Wired GbE | Always-on reliability over Wi-Fi |
| Power | Low draw (≤15 W idle) | Runs 24/7 |
Indicative hardware: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC,
roughly €150250.
**Sizing rationale.** Claude Code itself is light — the model runs in Anthropic's
cloud; the box only runs the CLI, git, Ansible, Terraform, `rbw`, SSH. The real
sizing driver is **all testing being local**: Molecule (Docker — the heavy one), lint,
and later a headless-Chromium/Playwright browser stack for service-UI verification.
That combination is firmly mini-PC / small-server class, not a Raspberry Pi.
---
## Provisioning (bootstrap path)
Manual, like today's control-node exception — but on bare metal instead of a clone:
1. Install Debian 13 on the box (one-time, by hand).
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
3. Join the mesh VPN (choice TBD — see Deferred).
4. From then on `ubongo` manages every other host normally. Ansible manages *it* for
baseline config via the `control` group (`base` role: SSH, firewall, updates,
auditd) — but never the `docker_host` role. It runs no services.
---
## Access & security
- **Remote access via the mesh VPN** (choice TBD). The user SSHes to `ubongo` over the
mesh from work PC, laptop, or phone. **Nothing is published to the public internet**
— SSH-over-mesh keeps the design fully inside ADR-002 (no LAN/WAN exposure without
reverse-proxy + auth).
- **Hardening.** `ubongo` runs the `base` role like every host: SSH hardening,
nftables default-deny, fail2ban, auditd, unattended-upgrades. Inbound SSH is allowed
**only on the mesh interface** — denied on the physical NIC. Even on the LAN it is
not an open SSH target.
- **Third-party dependency.** A hosted mesh coordinator is a third party. This is a
deliberate trade: a hosted control plane keeps the mesh up when the cluster is down
(helps recovery). A self-hosted coordinator on the cluster would recreate the
chicken-and-egg — so if self-hosted, it must live on `ubongo` or off-cluster, never
on the fleet. To be logged in `accepted-risks.md` once the VPN is chosen.
---
## Recovery model
`ubongo` is now the rebuild tool, so three things must survive a full cluster loss:
1. **The box / its data.** `mamba` (laptop) stays a **break-glass clone**: repo +
toolchain + mesh + `rbw`, able to drive the fleet if `ubongo` dies. Two machines
that can drive the fleet, not one.
2. **Terraform state.** Lives on `ubongo`, backed up **encrypted off-box** (synced to
`mamba`). For a 25 VM fleet it is also reconstructable via `terraform import`, so
this is belt-and-suspenders, not load-bearing.
3. **The vault password.** `ubongo` gets the vault master password from Vaultwarden via
`rbw`. `rbw` keeps a **local encrypted copy** of the Vaultwarden vault and decrypts
it **offline** with the user's Vaultwarden master password — no live server needed
for already-synced entries. So provided (a) `rbw` has synced at least once and (b)
the user keeps their **Vaultwarden master password** offline (memorised + paper in a
safe), `ubongo` can decrypt the Ansible vault with the whole cluster down. Mirror
the same onto `mamba`.
**Why not mirror/replicate Vaultwarden onto `ubongo`?** It would make the control node
*run a service* (against its remit + adds attack surface), add DB-replication
complexity, and **still** require the Vaultwarden master password to read anything.
There is always exactly **one irreducible offline root secret** — make it the
Vaultwarden master password, and let `rbw`'s local cache make everything else
self-serve offline. (If full disaster access to *all* secrets — router, Proxmox UI —
is wanted, that same `rbw` cache already covers it; optionally add a scheduled
encrypted `rbw export` as extra insurance.)
> **To verify (ADR-014, security-relevant):** the "`rbw` decrypts its local cache fully
> offline" behaviour is the load-bearing assumption of the recovery model. Confirm it
> against `rbw`'s docs/version during implementation before relying on it.
---
## Testing implication
All testing runs on `ubongo`:
- **Level 1 (Molecule)** — Docker on `ubongo`.
- **Lint** — on `ubongo`.
- **Level 2 / 3** (staging deploy, external smoke) — driven from `ubongo` as before.
A future **service-UI acceptance level** (Claude driving a headless browser against a
deployed service: load the UI, create test users, exercise features, hand the user a
manual test script) is anticipated. `ubongo` is *sized* for it now (Chromium +
Playwright headroom). The harness itself is a **separate spec** (see Deferred).
---
## Documentation changes
A new **ADR-015 — Control / development / AI-worker host (`ubongo`)** is the home of
record. Other docs get small amendments that link to it:
| Doc | Change |
|---|---|
| ADR-015 (new) | Full record of this design. |
| ADR-001 (architecture) | Control node: "dedicated Debian 13 VM on the cluster" → "dedicated physical x86 machine *outside* the cluster (`ubongo`)". |
| ADR-005 (bootstrapping) | Control-node section: "clone the cloud-init template by hand" → "install Debian 13 on the physical box". |
| ADR-009 (provisioning handoff) | Strengthen the control-node exception: now genuinely physical/outside Terraform. |
| ADR-008 (testing) | "runs on the control node or in CI" → all levels run on `ubongo`; add a stub for the future service-UI acceptance level. |
| ADR-012 / `docs/hardware/reference.md` | Add `ubongo` to the node-capacity table (physical compute, though outside the cluster). |
| `docs/runbooks/new-host.md` | Update the control-node bootstrap procedure (bare-metal Debian install, not `qm clone`). |
| `docs/runbooks/rotate-secrets.md` | Add the offline vault-password break-glass requirement. |
| `docs/security/accepted-risks.md` | Reserve an entry for the mesh-VPN third-party coordinator — pending the VPN choice. |
| `STATUS.md` | Add a row: `ubongo`*designed, not built*. |
| `CLAUDE.md` | One-line touch to the inventory/`control`-group description if needed. |
---
## Explicitly deferred (separate specs / discussions)
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Carries the
recovery-dimension note above (hosted coordinator helps recovery; self-hosted must
be off-cluster). Its own discussion.
2. **Browser-E2E verification harness** — Playwright/headless-Chromium driving live
service UIs, test-user generation, screenshot-back-to-Claude, and the new ADR-008
level. `ubongo` is sized for it now; the harness is designed later.
3. **`rbw` offline-cache verification** — a to-verify task during implementation
(ADR-014), before relying on offline decryption.
---
## What was ruled out
| Option | Reason |
|---|---|
| Keep control node as a cluster VM | Fails cold-start and recovery (rebuild tool lives inside the thing it rebuilds); fails always-on (dies with the cluster). |
| Laptop-only (`mamba` for everything) | Fails always-on (not always carried). Retained instead as the break-glass backup. |
| Split roles (control VM + thin jump box) | Two places to maintain the toolchain, control plane split in two, heavy local testing back on a cluster VM — more moving parts, less benefit. |
| Mirror/replicate Vaultwarden onto `ubongo` | Makes the control node run a service (against its remit), adds DB-replication complexity, and still needs the Vaultwarden master password. `rbw`'s local cache achieves offline decryption without it. |
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg the whole design escapes. If self-hosted, it lives off-cluster. |
| Raspberry Pi as the box | Could just run Molecule, but chokes running Docker + Chromium + toolchain together. x86 mini-PC instead. |