- ADR-008: add reboot-survivability gap row + ADR-025 pointer to the "not tested in Molecule" table - ADR-015: reconcile "not a hypervisor" with ephemeral KVM test VMs (ADR-025); note ~3 GiB test-VM RAM against the 16 GiB sizing - accepted-risks: add R6 (le-prod-wildcard PAT + transient TXT records) - CLAUDE.md: add make test-integration[/-clean] to key-commands; add ADR-025 + runbook rows to further-reading - hardware/reference.md: note one ephemeral KVM test VM on ubongo - STATUS.md: add integration harness entry (built, lint+pytest clean; RED/GREEN acceptance PENDING ubongo live pass); TODO 2.4 stays open Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8.3 KiB
ADR-015 — Control / development / AI-worker host (ubongo)
Status
Accepted (2026-06-05)
Context
Earlier ADRs framed the control node — the host that runs Terraform and Ansible — as a single Debian 13 VM on the Proxmox cluster, manually provisioned as the one documented exception to "Terraform owns VM existence" (ADR-009). That framing treats the control node purely as a control-plane runner.
It fails four needs, all confirmed as drivers:
- Cold-start bootstrap — the VM that runs Terraform/Ansible cannot exist until something else creates it; the bootstrap is circular and awkward.
- Always-on availability — the operator wants to SSH in from a work PC or anywhere to drive Claude Code. A cluster VM is gone whenever the cluster is down or being rebuilt.
- Recovery / disaster — the tool used to rebuild the cluster must not live inside the thing it rebuilds.
- Dev ergonomics — a persistent home for Claude Code + the repo, not entangled with production VM lifecycle.
A laptop-only answer fails always-on and recovery. A VM-only answer fails cold-start and recovery. A small dedicated always-on physical machine outside the cluster satisfies all four.
Decision
Introduce ubongo (Swahili: brain, consistent with the fleet's theme): a
single dedicated x86-64 mini-PC, always-on, living outside the Proxmox cluster.
It becomes the control node and collapses four roles into one box:
- Terraform + Ansible runner (control plane)
- Claude Code / AI-worker host the operator SSHes into
- Local test runner (Molecule/Docker, lint, and later a browser stack)
- Persistent dev home for the repo
There is no longer a control VM on the cluster. The control inventory group
points at this physical box. This strengthens the ADR-009 control-node exception:
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
Every other host stays a Terraform-managed VM exactly as designed.
ubongo runs plain Debian 13 (the base role applies). It is not a production
hypervisor and runs no docker_host services. It does run ephemeral KVM test VMs
as part of its local-test-runner role (ADR-025 — local VM integration testing): one
throwaway VM at a time (~3 GiB RAM), against ~13 GiB free of the 16 GiB sized here.
This is not a production workload — it is the concrete implementation of ADR-008 Level
2/3, and the resource guard enforces one-at-a-time to stay within the RAM ceiling.
Hardware target
| Spec | Target | Why |
|---|---|---|
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
| Network | Wired GbE | Always-on reliability over Wi-Fi |
| Power | Low draw (≤15 W idle) | Runs 24/7 |
Indicative: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC (~€150–250). Claude Code itself is light (the model runs in Anthropic's cloud); the sizing driver is all testing being local — Molecule (Docker), lint, and a future headless-Chromium/Playwright stack.
Provisioning (bootstrap path)
Manual, on bare metal:
- Install Debian 13 on the box (one-time, by hand).
git clonethe repo;make setup;make collections; set uprbw+ unlock.- Join the mesh VPN — NetBird, self-hosted on
askari(ADR-016). - From then on
ubongomanages every other host normally; Ansible manages it for baseline config via thecontrolgroup (baserole only).
Access & security
- Remote access is via the mesh VPN — NetBird, self-hosted on
askari(ADR-016). SSH toubongoover the mesh; nothing is published to the public internet — this stays inside ADR-002. ubongoruns thebaserole: SSH hardening, nftables default-deny, fail2ban, auditd, unattended-upgrades. Inbound SSH is allowed only on the mesh interface, denied on the physical NIC.- Operational reality (until the mesh exists): the "SSH only on the mesh interface" target above is the end state, not yet in force. Today remote access is LAN SSH only — key-only, with password auth and root login disabled — until the NetBird mesh (ADR-016) is stood up.
- AI-worker identity:
ubongoruns the AI worker under a dedicated, password-lockedclaudeuser (in thedockergroup for Molecule; no local sudo — boma deploys reach the fleet over SSH as theansibleuser, not via local root). It is reached viasudo -iu claudeor its own SSH key. The rationale is attribution + revocation, not containment: auditd/Loki (ADR-018) can separate human from agent actions, and the account/key can be revoked without touching the operator's access. (ADR-021 left the on-ubongoagent identity unspecified; this records it.) - Disk encryption:
ubongo's SSD is not encrypted at rest — the SanDisk X600 is TCG-Opal-capable but Opal is unused. This is an accepted risk recorded indocs/security/accepted-risks.md(control-node disk not encrypted at rest), compensated by physical security, a BIOS supervisor password, and disabled external/USB boot.
Recovery model
ubongo is the rebuild tool, so three things must survive a full cluster loss:
mamba(laptop) is a break-glass clone — repo + toolchain + mesh +rbw, able to drive the fleet ifubongodies.- Terraform state lives on
ubongo, backed up encrypted off-box (synced tomamba). For a 2–5 VM fleet it is also reconstructable viaterraform import. - Vault password —
ubongogets it from Vaultwarden viarbw.rbwkeeps a local encrypted copy of the vault and decrypts it offline with the operator's Vaultwarden master password, soubongocan decrypt the Ansible vault with the whole cluster down — providedrbwhas synced once and the operator keeps the Vaultwarden master password offline (memorised + paper in a safe). Mirror ontomamba.
There is always exactly one irreducible offline root secret; here it is the
Vaultwarden master password. Mirroring Vaultwarden onto ubongo is rejected: it
would make the control node run a service (against its remit) and still need that
master password.
verified: rbw offline-cache decryption · rbw 1.15.0 on ubongo · with the Vaultwarden host blocked,
rbw syncfailed butrbw getdecrypted the cached vault offline · 2026-06-11
Consequences
- The control node is physical compute outside the cluster, so it appears in
docs/hardware/reference.mdeven though it is not a cluster node (ADR-012). - All testing (Molecule, lint, staging/external) runs on
ubongo(ADR-008). - A future service-UI acceptance testing level (Claude driving a headless browser
against a deployed service) is anticipated;
ubongois sized for it. The harness is a separate spec.
Deferred (separate specs / discussions)
- Mesh VPN choice — RESOLVED (ADR-016): NetBird, self-hosted on
askari(off-site, so it survives a homelab outage and stays out of the cluster it administers). Replaces ADR-007's OPNsense WireGuard. - Browser-E2E verification harness — RESOLVED (ADR-017): Claude-driven
exploratory service-UI verification (
/verify-service, ADR-008 Level 4), against staging with test users in Authentik. Design + skill + standards complete; running deferred on the stack. rbwoffline-cache verification — RESOLVED (2026-06-11 build): confirmed offline cache decryption on rbw 1.15.0 —rbw syncfails with Vaultwarden unreachable whilerbw getstill decrypts from the local cache (ADR-014).
What was ruled out
| Option | Reason |
|---|---|
| Keep control node as a cluster VM | Fails cold-start, recovery, always-on. |
Laptop-only (mamba for everything) |
Fails always-on. Retained as break-glass backup. |
| Split roles (control VM + thin jump box) | Two toolchains, split control plane, heavy testing back on a cluster VM. |
Mirror Vaultwarden onto ubongo |
Control node would run a service; still needs the master password. |
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. |
| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. |
Related
ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing), ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).