boma/docs/superpowers/specs/2026-06-05-ubongo-control-host-design.md
sjat c1b21c9b2b Add design spec for ubongo control/AI-worker host
Records the decision to replace the cluster-resident control VM with a
dedicated always-on physical mini-PC (ubongo) outside the Proxmox
cluster, collapsing control plane, AI-worker host, dev home, and local
test runner into one box. Basis for ADR-015.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 09:19:02 +02:00

10 KiB
Raw Blame History

Design — Control / development / AI-worker host (ubongo)

  • Date: 2026-06-05
  • Status: Approved design — pending implementation plan
  • Supersedes (in part): the "control node is a dedicated VM on the cluster" assumption in ADR-001 / ADR-005 / ADR-009
  • Becomes: ADR-015 (this design is the basis for that ADR)

Problem

Today the control node — the host that runs Terraform and Ansible — is defined as a single Debian 13 VM on the Proxmox cluster, manually provisioned as the one documented exception to "Terraform owns VM existence" (ADR-009). The ADRs treat it purely as a control-plane runner.

That framing fails four things the user actually needs, all confirmed as drivers:

  1. Cold-start bootstrap — the VM that runs Terraform/Ansible can't exist until something else creates it; the manual bootstrap is awkward and circular.
  2. Always-on availability — the user wants to SSH in from a work PC (or anywhere) to fire off Claude Code commands. A VM on the cluster is gone whenever the cluster is down or being rebuilt.
  3. Recovery / disaster — the tool you'd use to rebuild the cluster must not live inside the thing it rebuilds.
  4. Dev ergonomics — a comfortable, persistent home for Claude Code + the repo, not entangled with production VM lifecycle.

A laptop-only answer fails always-on (not always carried) and recovery. A VM-only answer fails cold-start and recovery. A small dedicated always-on physical machine outside the cluster satisfies all four.


Decision

Introduce ubongo: a single dedicated x86-64 mini-PC, always-on, living outside the Proxmox cluster. It becomes the control node and collapses four roles into one box:

  • Terraform + Ansible runner (control plane)
  • Claude Code / AI-worker host the user SSHes into
  • Local test runner (Molecule/Docker, lint, and later the browser stack)
  • Persistent dev home for the repo

There is no longer a control VM on the cluster. The control inventory group now points at this physical box. This strengthens the ADR-009 control-node exception: it is genuinely outside Terraform's world, not a VM pretending to be the exception. Every other host stays a Terraform-managed VM exactly as designed.

ubongo runs plain Debian 13 (matches the fleet — the base role applies). It is not a hypervisor.

Name

ubongo (Swahili: brain), consistent with the fleet's Swahili theme (boma, nyumbani, askari, mamba).


Hardware target

Spec Target Why
CPU 4 cores, x86-64 (Intel N100-class or better) Molecule containers + Chromium prefer x86
RAM 16 GB Docker + headless Chromium + toolchain headroom
Disk 250 GB SSD/NVMe Docker images, molecule layers, repos, browser cache
Network Wired GbE Always-on reliability over Wi-Fi
Power Low draw (≤15 W idle) Runs 24/7

Indicative hardware: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC, roughly €150250.

Sizing rationale. Claude Code itself is light — the model runs in Anthropic's cloud; the box only runs the CLI, git, Ansible, Terraform, rbw, SSH. The real sizing driver is all testing being local: Molecule (Docker — the heavy one), lint, and later a headless-Chromium/Playwright browser stack for service-UI verification. That combination is firmly mini-PC / small-server class, not a Raspberry Pi.


Provisioning (bootstrap path)

Manual, like today's control-node exception — but on bare metal instead of a clone:

  1. Install Debian 13 on the box (one-time, by hand).
  2. git clone the repo; make setup; make collections; set up rbw + unlock.
  3. Join the mesh VPN (choice TBD — see Deferred).
  4. From then on ubongo manages every other host normally. Ansible manages it for baseline config via the control group (base role: SSH, firewall, updates, auditd) — but never the docker_host role. It runs no services.

Access & security

  • Remote access via the mesh VPN (choice TBD). The user SSHes to ubongo over the mesh from work PC, laptop, or phone. Nothing is published to the public internet — SSH-over-mesh keeps the design fully inside ADR-002 (no LAN/WAN exposure without reverse-proxy + auth).
  • Hardening. ubongo runs the base role like every host: SSH hardening, nftables default-deny, fail2ban, auditd, unattended-upgrades. Inbound SSH is allowed only on the mesh interface — denied on the physical NIC. Even on the LAN it is not an open SSH target.
  • Third-party dependency. A hosted mesh coordinator is a third party. This is a deliberate trade: a hosted control plane keeps the mesh up when the cluster is down (helps recovery). A self-hosted coordinator on the cluster would recreate the chicken-and-egg — so if self-hosted, it must live on ubongo or off-cluster, never on the fleet. To be logged in accepted-risks.md once the VPN is chosen.

Recovery model

ubongo is now the rebuild tool, so three things must survive a full cluster loss:

  1. The box / its data. mamba (laptop) stays a break-glass clone: repo + toolchain + mesh + rbw, able to drive the fleet if ubongo dies. Two machines that can drive the fleet, not one.
  2. Terraform state. Lives on ubongo, backed up encrypted off-box (synced to mamba). For a 25 VM fleet it is also reconstructable via terraform import, so this is belt-and-suspenders, not load-bearing.
  3. The vault password. ubongo gets the vault master password from Vaultwarden via rbw. rbw keeps a local encrypted copy of the Vaultwarden vault and decrypts it offline with the user's Vaultwarden master password — no live server needed for already-synced entries. So provided (a) rbw has synced at least once and (b) the user keeps their Vaultwarden master password offline (memorised + paper in a safe), ubongo can decrypt the Ansible vault with the whole cluster down. Mirror the same onto mamba.

Why not mirror/replicate Vaultwarden onto ubongo? It would make the control node run a service (against its remit + adds attack surface), add DB-replication complexity, and still require the Vaultwarden master password to read anything. There is always exactly one irreducible offline root secret — make it the Vaultwarden master password, and let rbw's local cache make everything else self-serve offline. (If full disaster access to all secrets — router, Proxmox UI — is wanted, that same rbw cache already covers it; optionally add a scheduled encrypted rbw export as extra insurance.)

To verify (ADR-014, security-relevant): the "rbw decrypts its local cache fully offline" behaviour is the load-bearing assumption of the recovery model. Confirm it against rbw's docs/version during implementation before relying on it.


Testing implication

All testing runs on ubongo:

  • Level 1 (Molecule) — Docker on ubongo.
  • Lint — on ubongo.
  • Level 2 / 3 (staging deploy, external smoke) — driven from ubongo as before.

A future service-UI acceptance level (Claude driving a headless browser against a deployed service: load the UI, create test users, exercise features, hand the user a manual test script) is anticipated. ubongo is sized for it now (Chromium + Playwright headroom). The harness itself is a separate spec (see Deferred).


Documentation changes

A new ADR-015 — Control / development / AI-worker host (ubongo) is the home of record. Other docs get small amendments that link to it:

Doc Change
ADR-015 (new) Full record of this design.
ADR-001 (architecture) Control node: "dedicated Debian 13 VM on the cluster" → "dedicated physical x86 machine outside the cluster (ubongo)".
ADR-005 (bootstrapping) Control-node section: "clone the cloud-init template by hand" → "install Debian 13 on the physical box".
ADR-009 (provisioning handoff) Strengthen the control-node exception: now genuinely physical/outside Terraform.
ADR-008 (testing) "runs on the control node or in CI" → all levels run on ubongo; add a stub for the future service-UI acceptance level.
ADR-012 / docs/hardware/reference.md Add ubongo to the node-capacity table (physical compute, though outside the cluster).
docs/runbooks/new-host.md Update the control-node bootstrap procedure (bare-metal Debian install, not qm clone).
docs/runbooks/rotate-secrets.md Add the offline vault-password break-glass requirement.
docs/security/accepted-risks.md Reserve an entry for the mesh-VPN third-party coordinator — pending the VPN choice.
STATUS.md Add a row: ubongodesigned, not built.
CLAUDE.md One-line touch to the inventory/control-group description if needed.

Explicitly deferred (separate specs / discussions)

  1. Mesh VPN choice — Tailscale vs NetBird, hosted vs self-hosted. Carries the recovery-dimension note above (hosted coordinator helps recovery; self-hosted must be off-cluster). Its own discussion.
  2. Browser-E2E verification harness — Playwright/headless-Chromium driving live service UIs, test-user generation, screenshot-back-to-Claude, and the new ADR-008 level. ubongo is sized for it now; the harness is designed later.
  3. rbw offline-cache verification — a to-verify task during implementation (ADR-014), before relying on offline decryption.

What was ruled out

Option Reason
Keep control node as a cluster VM Fails cold-start and recovery (rebuild tool lives inside the thing it rebuilds); fails always-on (dies with the cluster).
Laptop-only (mamba for everything) Fails always-on (not always carried). Retained instead as the break-glass backup.
Split roles (control VM + thin jump box) Two places to maintain the toolchain, control plane split in two, heavy local testing back on a cluster VM — more moving parts, less benefit.
Mirror/replicate Vaultwarden onto ubongo Makes the control node run a service (against its remit), adds DB-replication complexity, and still needs the Vaultwarden master password. rbw's local cache achieves offline decryption without it.
Self-hosted mesh coordinator on the cluster Recreates the chicken-and-egg the whole design escapes. If self-hosted, it lives off-cluster.
Raspberry Pi as the box Could just run Molecule, but chokes running Docker + Chromium + toolchain together. x86 mini-PC instead.