Add design spec for ubongo control/AI-worker host
Records the decision to replace the cluster-resident control VM with a dedicated always-on physical mini-PC (ubongo) outside the Proxmox cluster, collapsing control plane, AI-worker host, dev home, and local test runner into one box. Basis for ADR-015. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
fc0d49f1c4
commit
c1b21c9b2b
1 changed files with 205 additions and 0 deletions
205
docs/superpowers/specs/2026-06-05-ubongo-control-host-design.md
Normal file
205
docs/superpowers/specs/2026-06-05-ubongo-control-host-design.md
Normal file
|
|
@ -0,0 +1,205 @@
|
|||
# Design — Control / development / AI-worker host (`ubongo`)
|
||||
|
||||
- **Date:** 2026-06-05
|
||||
- **Status:** Approved design — pending implementation plan
|
||||
- **Supersedes (in part):** the "control node is a dedicated VM on the cluster"
|
||||
assumption in ADR-001 / ADR-005 / ADR-009
|
||||
- **Becomes:** ADR-015 (this design is the basis for that ADR)
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
Today the control node — the host that runs Terraform and Ansible — is defined as a
|
||||
**single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one
|
||||
documented exception to "Terraform owns VM existence" (ADR-009). The ADRs treat it
|
||||
purely as a control-plane runner.
|
||||
|
||||
That framing fails four things the user actually needs, all confirmed as drivers:
|
||||
|
||||
1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible can't exist until
|
||||
something else creates it; the manual bootstrap is awkward and circular.
|
||||
2. **Always-on availability** — the user wants to SSH in from a work PC (or anywhere)
|
||||
to fire off Claude Code commands. A VM on the cluster is gone whenever the cluster
|
||||
is down or being rebuilt.
|
||||
3. **Recovery / disaster** — the tool you'd use to rebuild the cluster must not live
|
||||
*inside* the thing it rebuilds.
|
||||
4. **Dev ergonomics** — a comfortable, persistent home for Claude Code + the repo,
|
||||
not entangled with production VM lifecycle.
|
||||
|
||||
A laptop-only answer fails always-on (not always carried) and recovery. A VM-only
|
||||
answer fails cold-start and recovery. A small **dedicated always-on physical machine
|
||||
outside the cluster** satisfies all four.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
Introduce **`ubongo`**: a single dedicated x86-64 mini-PC, always-on, living
|
||||
**outside** the Proxmox cluster. It becomes *the* control node and collapses four
|
||||
roles into one box:
|
||||
|
||||
- Terraform + Ansible runner (control plane)
|
||||
- Claude Code / AI-worker host the user SSHes into
|
||||
- Local test runner (Molecule/Docker, lint, and later the browser stack)
|
||||
- Persistent dev home for the repo
|
||||
|
||||
There is **no longer a control VM on the cluster.** The `control` inventory group now
|
||||
points at this physical box. This *strengthens* the ADR-009 control-node exception:
|
||||
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
|
||||
Every other host stays a Terraform-managed VM exactly as designed.
|
||||
|
||||
`ubongo` runs **plain Debian 13** (matches the fleet — the `base` role applies). It is
|
||||
not a hypervisor.
|
||||
|
||||
### Name
|
||||
|
||||
`ubongo` (Swahili: *brain*), consistent with the fleet's Swahili theme (`boma`,
|
||||
`nyumbani`, `askari`, `mamba`).
|
||||
|
||||
---
|
||||
|
||||
## Hardware target
|
||||
|
||||
| Spec | Target | Why |
|
||||
|---|---|---|
|
||||
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
|
||||
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
|
||||
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
|
||||
| Network | Wired GbE | Always-on reliability over Wi-Fi |
|
||||
| Power | Low draw (≤15 W idle) | Runs 24/7 |
|
||||
|
||||
Indicative hardware: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC,
|
||||
roughly €150–250.
|
||||
|
||||
**Sizing rationale.** Claude Code itself is light — the model runs in Anthropic's
|
||||
cloud; the box only runs the CLI, git, Ansible, Terraform, `rbw`, SSH. The real
|
||||
sizing driver is **all testing being local**: Molecule (Docker — the heavy one), lint,
|
||||
and later a headless-Chromium/Playwright browser stack for service-UI verification.
|
||||
That combination is firmly mini-PC / small-server class, not a Raspberry Pi.
|
||||
|
||||
---
|
||||
|
||||
## Provisioning (bootstrap path)
|
||||
|
||||
Manual, like today's control-node exception — but on bare metal instead of a clone:
|
||||
|
||||
1. Install Debian 13 on the box (one-time, by hand).
|
||||
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
|
||||
3. Join the mesh VPN (choice TBD — see Deferred).
|
||||
4. From then on `ubongo` manages every other host normally. Ansible manages *it* for
|
||||
baseline config via the `control` group (`base` role: SSH, firewall, updates,
|
||||
auditd) — but never the `docker_host` role. It runs no services.
|
||||
|
||||
---
|
||||
|
||||
## Access & security
|
||||
|
||||
- **Remote access via the mesh VPN** (choice TBD). The user SSHes to `ubongo` over the
|
||||
mesh from work PC, laptop, or phone. **Nothing is published to the public internet**
|
||||
— SSH-over-mesh keeps the design fully inside ADR-002 (no LAN/WAN exposure without
|
||||
reverse-proxy + auth).
|
||||
- **Hardening.** `ubongo` runs the `base` role like every host: SSH hardening,
|
||||
nftables default-deny, fail2ban, auditd, unattended-upgrades. Inbound SSH is allowed
|
||||
**only on the mesh interface** — denied on the physical NIC. Even on the LAN it is
|
||||
not an open SSH target.
|
||||
- **Third-party dependency.** A hosted mesh coordinator is a third party. This is a
|
||||
deliberate trade: a hosted control plane keeps the mesh up when the cluster is down
|
||||
(helps recovery). A self-hosted coordinator on the cluster would recreate the
|
||||
chicken-and-egg — so if self-hosted, it must live on `ubongo` or off-cluster, never
|
||||
on the fleet. To be logged in `accepted-risks.md` once the VPN is chosen.
|
||||
|
||||
---
|
||||
|
||||
## Recovery model
|
||||
|
||||
`ubongo` is now the rebuild tool, so three things must survive a full cluster loss:
|
||||
|
||||
1. **The box / its data.** `mamba` (laptop) stays a **break-glass clone**: repo +
|
||||
toolchain + mesh + `rbw`, able to drive the fleet if `ubongo` dies. Two machines
|
||||
that can drive the fleet, not one.
|
||||
2. **Terraform state.** Lives on `ubongo`, backed up **encrypted off-box** (synced to
|
||||
`mamba`). For a 2–5 VM fleet it is also reconstructable via `terraform import`, so
|
||||
this is belt-and-suspenders, not load-bearing.
|
||||
3. **The vault password.** `ubongo` gets the vault master password from Vaultwarden via
|
||||
`rbw`. `rbw` keeps a **local encrypted copy** of the Vaultwarden vault and decrypts
|
||||
it **offline** with the user's Vaultwarden master password — no live server needed
|
||||
for already-synced entries. So provided (a) `rbw` has synced at least once and (b)
|
||||
the user keeps their **Vaultwarden master password** offline (memorised + paper in a
|
||||
safe), `ubongo` can decrypt the Ansible vault with the whole cluster down. Mirror
|
||||
the same onto `mamba`.
|
||||
|
||||
**Why not mirror/replicate Vaultwarden onto `ubongo`?** It would make the control node
|
||||
*run a service* (against its remit + adds attack surface), add DB-replication
|
||||
complexity, and **still** require the Vaultwarden master password to read anything.
|
||||
There is always exactly **one irreducible offline root secret** — make it the
|
||||
Vaultwarden master password, and let `rbw`'s local cache make everything else
|
||||
self-serve offline. (If full disaster access to *all* secrets — router, Proxmox UI —
|
||||
is wanted, that same `rbw` cache already covers it; optionally add a scheduled
|
||||
encrypted `rbw export` as extra insurance.)
|
||||
|
||||
> **To verify (ADR-014, security-relevant):** the "`rbw` decrypts its local cache fully
|
||||
> offline" behaviour is the load-bearing assumption of the recovery model. Confirm it
|
||||
> against `rbw`'s docs/version during implementation before relying on it.
|
||||
|
||||
---
|
||||
|
||||
## Testing implication
|
||||
|
||||
All testing runs on `ubongo`:
|
||||
|
||||
- **Level 1 (Molecule)** — Docker on `ubongo`.
|
||||
- **Lint** — on `ubongo`.
|
||||
- **Level 2 / 3** (staging deploy, external smoke) — driven from `ubongo` as before.
|
||||
|
||||
A future **service-UI acceptance level** (Claude driving a headless browser against a
|
||||
deployed service: load the UI, create test users, exercise features, hand the user a
|
||||
manual test script) is anticipated. `ubongo` is *sized* for it now (Chromium +
|
||||
Playwright headroom). The harness itself is a **separate spec** (see Deferred).
|
||||
|
||||
---
|
||||
|
||||
## Documentation changes
|
||||
|
||||
A new **ADR-015 — Control / development / AI-worker host (`ubongo`)** is the home of
|
||||
record. Other docs get small amendments that link to it:
|
||||
|
||||
| Doc | Change |
|
||||
|---|---|
|
||||
| ADR-015 (new) | Full record of this design. |
|
||||
| ADR-001 (architecture) | Control node: "dedicated Debian 13 VM on the cluster" → "dedicated physical x86 machine *outside* the cluster (`ubongo`)". |
|
||||
| ADR-005 (bootstrapping) | Control-node section: "clone the cloud-init template by hand" → "install Debian 13 on the physical box". |
|
||||
| ADR-009 (provisioning handoff) | Strengthen the control-node exception: now genuinely physical/outside Terraform. |
|
||||
| ADR-008 (testing) | "runs on the control node or in CI" → all levels run on `ubongo`; add a stub for the future service-UI acceptance level. |
|
||||
| ADR-012 / `docs/hardware/reference.md` | Add `ubongo` to the node-capacity table (physical compute, though outside the cluster). |
|
||||
| `docs/runbooks/new-host.md` | Update the control-node bootstrap procedure (bare-metal Debian install, not `qm clone`). |
|
||||
| `docs/runbooks/rotate-secrets.md` | Add the offline vault-password break-glass requirement. |
|
||||
| `docs/security/accepted-risks.md` | Reserve an entry for the mesh-VPN third-party coordinator — pending the VPN choice. |
|
||||
| `STATUS.md` | Add a row: `ubongo` — *designed, not built*. |
|
||||
| `CLAUDE.md` | One-line touch to the inventory/`control`-group description if needed. |
|
||||
|
||||
---
|
||||
|
||||
## Explicitly deferred (separate specs / discussions)
|
||||
|
||||
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Carries the
|
||||
recovery-dimension note above (hosted coordinator helps recovery; self-hosted must
|
||||
be off-cluster). Its own discussion.
|
||||
2. **Browser-E2E verification harness** — Playwright/headless-Chromium driving live
|
||||
service UIs, test-user generation, screenshot-back-to-Claude, and the new ADR-008
|
||||
level. `ubongo` is sized for it now; the harness is designed later.
|
||||
3. **`rbw` offline-cache verification** — a to-verify task during implementation
|
||||
(ADR-014), before relying on offline decryption.
|
||||
|
||||
---
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|---|---|
|
||||
| Keep control node as a cluster VM | Fails cold-start and recovery (rebuild tool lives inside the thing it rebuilds); fails always-on (dies with the cluster). |
|
||||
| Laptop-only (`mamba` for everything) | Fails always-on (not always carried). Retained instead as the break-glass backup. |
|
||||
| Split roles (control VM + thin jump box) | Two places to maintain the toolchain, control plane split in two, heavy local testing back on a cluster VM — more moving parts, less benefit. |
|
||||
| Mirror/replicate Vaultwarden onto `ubongo` | Makes the control node run a service (against its remit), adds DB-replication complexity, and still needs the Vaultwarden master password. `rbw`'s local cache achieves offline decryption without it. |
|
||||
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg the whole design escapes. If self-hosted, it lives off-cluster. |
|
||||
| Raspberry Pi as the box | Could just run Molecule, but chokes running Docker + Chromium + toolchain together. x86 mini-PC instead. |
|
||||
Loading…
Add table
Reference in a new issue