Records the decision to replace the cluster-resident control VM with a dedicated always-on physical mini-PC (ubongo) outside the Proxmox cluster, collapsing control plane, AI-worker host, dev home, and local test runner into one box. Basis for ADR-015. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
10 KiB
Design — Control / development / AI-worker host (ubongo)
- Date: 2026-06-05
- Status: Approved design — pending implementation plan
- Supersedes (in part): the "control node is a dedicated VM on the cluster" assumption in ADR-001 / ADR-005 / ADR-009
- Becomes: ADR-015 (this design is the basis for that ADR)
Problem
Today the control node — the host that runs Terraform and Ansible — is defined as a single Debian 13 VM on the Proxmox cluster, manually provisioned as the one documented exception to "Terraform owns VM existence" (ADR-009). The ADRs treat it purely as a control-plane runner.
That framing fails four things the user actually needs, all confirmed as drivers:
- Cold-start bootstrap — the VM that runs Terraform/Ansible can't exist until something else creates it; the manual bootstrap is awkward and circular.
- Always-on availability — the user wants to SSH in from a work PC (or anywhere) to fire off Claude Code commands. A VM on the cluster is gone whenever the cluster is down or being rebuilt.
- Recovery / disaster — the tool you'd use to rebuild the cluster must not live inside the thing it rebuilds.
- Dev ergonomics — a comfortable, persistent home for Claude Code + the repo, not entangled with production VM lifecycle.
A laptop-only answer fails always-on (not always carried) and recovery. A VM-only answer fails cold-start and recovery. A small dedicated always-on physical machine outside the cluster satisfies all four.
Decision
Introduce ubongo: a single dedicated x86-64 mini-PC, always-on, living
outside the Proxmox cluster. It becomes the control node and collapses four
roles into one box:
- Terraform + Ansible runner (control plane)
- Claude Code / AI-worker host the user SSHes into
- Local test runner (Molecule/Docker, lint, and later the browser stack)
- Persistent dev home for the repo
There is no longer a control VM on the cluster. The control inventory group now
points at this physical box. This strengthens the ADR-009 control-node exception:
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
Every other host stays a Terraform-managed VM exactly as designed.
ubongo runs plain Debian 13 (matches the fleet — the base role applies). It is
not a hypervisor.
Name
ubongo (Swahili: brain), consistent with the fleet's Swahili theme (boma,
nyumbani, askari, mamba).
Hardware target
| Spec | Target | Why |
|---|---|---|
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
| Network | Wired GbE | Always-on reliability over Wi-Fi |
| Power | Low draw (≤15 W idle) | Runs 24/7 |
Indicative hardware: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC, roughly €150–250.
Sizing rationale. Claude Code itself is light — the model runs in Anthropic's
cloud; the box only runs the CLI, git, Ansible, Terraform, rbw, SSH. The real
sizing driver is all testing being local: Molecule (Docker — the heavy one), lint,
and later a headless-Chromium/Playwright browser stack for service-UI verification.
That combination is firmly mini-PC / small-server class, not a Raspberry Pi.
Provisioning (bootstrap path)
Manual, like today's control-node exception — but on bare metal instead of a clone:
- Install Debian 13 on the box (one-time, by hand).
git clonethe repo;make setup;make collections; set uprbw+ unlock.- Join the mesh VPN (choice TBD — see Deferred).
- From then on
ubongomanages every other host normally. Ansible manages it for baseline config via thecontrolgroup (baserole: SSH, firewall, updates, auditd) — but never thedocker_hostrole. It runs no services.
Access & security
- Remote access via the mesh VPN (choice TBD). The user SSHes to
ubongoover the mesh from work PC, laptop, or phone. Nothing is published to the public internet — SSH-over-mesh keeps the design fully inside ADR-002 (no LAN/WAN exposure without reverse-proxy + auth). - Hardening.
ubongoruns thebaserole like every host: SSH hardening, nftables default-deny, fail2ban, auditd, unattended-upgrades. Inbound SSH is allowed only on the mesh interface — denied on the physical NIC. Even on the LAN it is not an open SSH target. - Third-party dependency. A hosted mesh coordinator is a third party. This is a
deliberate trade: a hosted control plane keeps the mesh up when the cluster is down
(helps recovery). A self-hosted coordinator on the cluster would recreate the
chicken-and-egg — so if self-hosted, it must live on
ubongoor off-cluster, never on the fleet. To be logged inaccepted-risks.mdonce the VPN is chosen.
Recovery model
ubongo is now the rebuild tool, so three things must survive a full cluster loss:
- The box / its data.
mamba(laptop) stays a break-glass clone: repo + toolchain + mesh +rbw, able to drive the fleet ifubongodies. Two machines that can drive the fleet, not one. - Terraform state. Lives on
ubongo, backed up encrypted off-box (synced tomamba). For a 2–5 VM fleet it is also reconstructable viaterraform import, so this is belt-and-suspenders, not load-bearing. - The vault password.
ubongogets the vault master password from Vaultwarden viarbw.rbwkeeps a local encrypted copy of the Vaultwarden vault and decrypts it offline with the user's Vaultwarden master password — no live server needed for already-synced entries. So provided (a)rbwhas synced at least once and (b) the user keeps their Vaultwarden master password offline (memorised + paper in a safe),ubongocan decrypt the Ansible vault with the whole cluster down. Mirror the same ontomamba.
Why not mirror/replicate Vaultwarden onto ubongo? It would make the control node
run a service (against its remit + adds attack surface), add DB-replication
complexity, and still require the Vaultwarden master password to read anything.
There is always exactly one irreducible offline root secret — make it the
Vaultwarden master password, and let rbw's local cache make everything else
self-serve offline. (If full disaster access to all secrets — router, Proxmox UI —
is wanted, that same rbw cache already covers it; optionally add a scheduled
encrypted rbw export as extra insurance.)
To verify (ADR-014, security-relevant): the "
rbwdecrypts its local cache fully offline" behaviour is the load-bearing assumption of the recovery model. Confirm it againstrbw's docs/version during implementation before relying on it.
Testing implication
All testing runs on ubongo:
- Level 1 (Molecule) — Docker on
ubongo. - Lint — on
ubongo. - Level 2 / 3 (staging deploy, external smoke) — driven from
ubongoas before.
A future service-UI acceptance level (Claude driving a headless browser against a
deployed service: load the UI, create test users, exercise features, hand the user a
manual test script) is anticipated. ubongo is sized for it now (Chromium +
Playwright headroom). The harness itself is a separate spec (see Deferred).
Documentation changes
A new ADR-015 — Control / development / AI-worker host (ubongo) is the home of
record. Other docs get small amendments that link to it:
| Doc | Change |
|---|---|
| ADR-015 (new) | Full record of this design. |
| ADR-001 (architecture) | Control node: "dedicated Debian 13 VM on the cluster" → "dedicated physical x86 machine outside the cluster (ubongo)". |
| ADR-005 (bootstrapping) | Control-node section: "clone the cloud-init template by hand" → "install Debian 13 on the physical box". |
| ADR-009 (provisioning handoff) | Strengthen the control-node exception: now genuinely physical/outside Terraform. |
| ADR-008 (testing) | "runs on the control node or in CI" → all levels run on ubongo; add a stub for the future service-UI acceptance level. |
ADR-012 / docs/hardware/reference.md |
Add ubongo to the node-capacity table (physical compute, though outside the cluster). |
docs/runbooks/new-host.md |
Update the control-node bootstrap procedure (bare-metal Debian install, not qm clone). |
docs/runbooks/rotate-secrets.md |
Add the offline vault-password break-glass requirement. |
docs/security/accepted-risks.md |
Reserve an entry for the mesh-VPN third-party coordinator — pending the VPN choice. |
STATUS.md |
Add a row: ubongo — designed, not built. |
CLAUDE.md |
One-line touch to the inventory/control-group description if needed. |
Explicitly deferred (separate specs / discussions)
- Mesh VPN choice — Tailscale vs NetBird, hosted vs self-hosted. Carries the recovery-dimension note above (hosted coordinator helps recovery; self-hosted must be off-cluster). Its own discussion.
- Browser-E2E verification harness — Playwright/headless-Chromium driving live
service UIs, test-user generation, screenshot-back-to-Claude, and the new ADR-008
level.
ubongois sized for it now; the harness is designed later. rbwoffline-cache verification — a to-verify task during implementation (ADR-014), before relying on offline decryption.
What was ruled out
| Option | Reason |
|---|---|
| Keep control node as a cluster VM | Fails cold-start and recovery (rebuild tool lives inside the thing it rebuilds); fails always-on (dies with the cluster). |
Laptop-only (mamba for everything) |
Fails always-on (not always carried). Retained instead as the break-glass backup. |
| Split roles (control VM + thin jump box) | Two places to maintain the toolchain, control plane split in two, heavy local testing back on a cluster VM — more moving parts, less benefit. |
Mirror/replicate Vaultwarden onto ubongo |
Makes the control node run a service (against its remit), adds DB-replication complexity, and still needs the Vaultwarden master password. rbw's local cache achieves offline decryption without it. |
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg the whole design escapes. If self-hosted, it lives off-cluster. |
| Raspberry Pi as the box | Could just run Molecule, but chokes running Docker + Chromium + toolchain together. x86 mini-PC instead. |