2026-06-05 09:37:56 +02:00
|
|
|
|
# ADR-015 — Control / development / AI-worker host (`ubongo`)
|
|
|
|
|
|
|
2026-06-10 14:37:52 +02:00
|
|
|
|
## Status
|
|
|
|
|
|
|
|
|
|
|
|
Accepted (2026-06-05)
|
|
|
|
|
|
|
2026-06-05 09:37:56 +02:00
|
|
|
|
## Context
|
|
|
|
|
|
|
|
|
|
|
|
Earlier ADRs framed the control node — the host that runs Terraform and Ansible —
|
|
|
|
|
|
as a **single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one
|
|
|
|
|
|
documented exception to "Terraform owns VM existence" (ADR-009). That framing treats
|
|
|
|
|
|
the control node purely as a control-plane runner.
|
|
|
|
|
|
|
|
|
|
|
|
It fails four needs, all confirmed as drivers:
|
|
|
|
|
|
|
|
|
|
|
|
1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible cannot exist until
|
|
|
|
|
|
something else creates it; the bootstrap is circular and awkward.
|
|
|
|
|
|
2. **Always-on availability** — the operator wants to SSH in from a work PC or
|
|
|
|
|
|
anywhere to drive Claude Code. A cluster VM is gone whenever the cluster is down
|
|
|
|
|
|
or being rebuilt.
|
|
|
|
|
|
3. **Recovery / disaster** — the tool used to rebuild the cluster must not live
|
|
|
|
|
|
inside the thing it rebuilds.
|
|
|
|
|
|
4. **Dev ergonomics** — a persistent home for Claude Code + the repo, not entangled
|
|
|
|
|
|
with production VM lifecycle.
|
|
|
|
|
|
|
|
|
|
|
|
A laptop-only answer fails always-on and recovery. A VM-only answer fails cold-start
|
|
|
|
|
|
and recovery. A small dedicated always-on physical machine outside the cluster
|
|
|
|
|
|
satisfies all four.
|
|
|
|
|
|
|
|
|
|
|
|
## Decision
|
|
|
|
|
|
|
|
|
|
|
|
Introduce **`ubongo`** (Swahili: *brain*, consistent with the fleet's theme): a
|
|
|
|
|
|
single dedicated x86-64 mini-PC, always-on, living **outside** the Proxmox cluster.
|
|
|
|
|
|
It becomes *the* control node and collapses four roles into one box:
|
|
|
|
|
|
|
|
|
|
|
|
- Terraform + Ansible runner (control plane)
|
|
|
|
|
|
- Claude Code / AI-worker host the operator SSHes into
|
|
|
|
|
|
- Local test runner (Molecule/Docker, lint, and later a browser stack)
|
|
|
|
|
|
- Persistent dev home for the repo
|
|
|
|
|
|
|
|
|
|
|
|
There is **no longer a control VM on the cluster.** The `control` inventory group
|
|
|
|
|
|
points at this physical box. This *strengthens* the ADR-009 control-node exception:
|
|
|
|
|
|
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
|
|
|
|
|
|
Every other host stays a Terraform-managed VM exactly as designed.
|
|
|
|
|
|
|
2026-06-18 12:51:22 +02:00
|
|
|
|
`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a production
|
|
|
|
|
|
hypervisor and runs no `docker_host` services. It does run **ephemeral KVM test VMs**
|
|
|
|
|
|
as part of its local-test-runner role (ADR-025 — local VM integration testing): one
|
|
|
|
|
|
throwaway VM at a time (~3 GiB RAM), against ~13 GiB free of the 16 GiB sized here.
|
|
|
|
|
|
This is not a production workload — it is the concrete implementation of ADR-008 Level
|
|
|
|
|
|
2/3, and the resource guard enforces one-at-a-time to stay within the RAM ceiling.
|
2026-06-05 09:37:56 +02:00
|
|
|
|
|
|
|
|
|
|
### Hardware target
|
|
|
|
|
|
|
|
|
|
|
|
| Spec | Target | Why |
|
|
|
|
|
|
|---|---|---|
|
|
|
|
|
|
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
|
|
|
|
|
|
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
|
|
|
|
|
|
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
|
|
|
|
|
|
| Network | Wired GbE | Always-on reliability over Wi-Fi |
|
|
|
|
|
|
| Power | Low draw (≤15 W idle) | Runs 24/7 |
|
|
|
|
|
|
|
|
|
|
|
|
Indicative: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC (~€150–250).
|
|
|
|
|
|
Claude Code itself is light (the model runs in Anthropic's cloud); the sizing driver
|
|
|
|
|
|
is **all testing being local** — Molecule (Docker), lint, and a future
|
|
|
|
|
|
headless-Chromium/Playwright stack.
|
|
|
|
|
|
|
|
|
|
|
|
### Provisioning (bootstrap path)
|
|
|
|
|
|
|
|
|
|
|
|
Manual, on bare metal:
|
|
|
|
|
|
|
|
|
|
|
|
1. Install Debian 13 on the box (one-time, by hand).
|
|
|
|
|
|
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
|
2026-06-05 11:48:04 +02:00
|
|
|
|
3. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016).
|
2026-06-05 09:37:56 +02:00
|
|
|
|
4. From then on `ubongo` manages every other host normally; Ansible manages *it* for
|
|
|
|
|
|
baseline config via the `control` group (`base` role only).
|
|
|
|
|
|
|
|
|
|
|
|
### Access & security
|
|
|
|
|
|
|
2026-06-05 11:48:04 +02:00
|
|
|
|
- Remote access is via the **mesh VPN** — NetBird, self-hosted on `askari` (ADR-016).
|
|
|
|
|
|
SSH to `ubongo` over the mesh; nothing is published to the public internet — this
|
|
|
|
|
|
stays inside ADR-002.
|
2026-06-05 09:37:56 +02:00
|
|
|
|
- `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban,
|
|
|
|
|
|
auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**,
|
|
|
|
|
|
denied on the physical NIC.
|
2026-06-11 10:32:26 +02:00
|
|
|
|
- **Operational reality (until the mesh exists):** the "SSH only on the mesh interface"
|
|
|
|
|
|
target above is the end state, not yet in force. Today remote access is **LAN SSH
|
|
|
|
|
|
only** — key-only, with password auth and root login disabled — until the NetBird mesh
|
|
|
|
|
|
(ADR-016) is stood up.
|
|
|
|
|
|
- **AI-worker identity:** `ubongo` runs the AI worker under a dedicated,
|
|
|
|
|
|
password-locked `claude` user (in the `docker` group for Molecule; **no local sudo** —
|
|
|
|
|
|
boma deploys reach the fleet over SSH as the `ansible` user, not via local root). It is
|
|
|
|
|
|
reached via `sudo -iu claude` or its own SSH key. The rationale is **attribution +
|
|
|
|
|
|
revocation, not containment**: auditd/Loki (ADR-018) can separate human from agent
|
|
|
|
|
|
actions, and the account/key can be revoked without touching the operator's access.
|
|
|
|
|
|
(ADR-021 left the on-`ubongo` agent identity unspecified; this records it.)
|
|
|
|
|
|
- **Disk encryption:** `ubongo`'s SSD is **not encrypted at rest** — the SanDisk X600 is
|
|
|
|
|
|
TCG-Opal-capable but Opal is unused. This is an accepted risk recorded in
|
|
|
|
|
|
`docs/security/accepted-risks.md` (control-node disk not encrypted at rest),
|
|
|
|
|
|
compensated by physical security, a BIOS supervisor password, and disabled
|
|
|
|
|
|
external/USB boot.
|
2026-06-05 09:37:56 +02:00
|
|
|
|
|
|
|
|
|
|
### Recovery model
|
|
|
|
|
|
|
|
|
|
|
|
`ubongo` is the rebuild tool, so three things must survive a full cluster loss:
|
|
|
|
|
|
|
|
|
|
|
|
1. **`mamba` (laptop) is a break-glass clone** — repo + toolchain + mesh + `rbw`,
|
|
|
|
|
|
able to drive the fleet if `ubongo` dies.
|
|
|
|
|
|
2. **Terraform state** lives on `ubongo`, backed up encrypted off-box (synced to
|
|
|
|
|
|
`mamba`). For a 2–5 VM fleet it is also reconstructable via `terraform import`.
|
|
|
|
|
|
3. **Vault password** — `ubongo` gets it from Vaultwarden via `rbw`. `rbw` keeps a
|
|
|
|
|
|
local encrypted copy of the vault and decrypts it offline with the operator's
|
|
|
|
|
|
Vaultwarden master password, so `ubongo` can decrypt the Ansible vault with the
|
|
|
|
|
|
whole cluster down — provided `rbw` has synced once and the operator keeps the
|
|
|
|
|
|
Vaultwarden master password offline (memorised + paper in a safe). Mirror onto
|
|
|
|
|
|
`mamba`.
|
|
|
|
|
|
|
|
|
|
|
|
There is always exactly one irreducible offline root secret; here it is the
|
|
|
|
|
|
Vaultwarden master password. Mirroring Vaultwarden onto `ubongo` is rejected: it
|
|
|
|
|
|
would make the control node run a service (against its remit) and still need that
|
|
|
|
|
|
master password.
|
|
|
|
|
|
|
2026-06-11 10:32:26 +02:00
|
|
|
|
> verified: rbw offline-cache decryption · rbw 1.15.0 on ubongo · with the Vaultwarden
|
|
|
|
|
|
> host blocked, `rbw sync` failed but `rbw get` decrypted the cached vault offline ·
|
|
|
|
|
|
> 2026-06-11
|
2026-06-05 09:37:56 +02:00
|
|
|
|
|
|
|
|
|
|
## Consequences
|
|
|
|
|
|
|
|
|
|
|
|
- The control node is physical compute outside the cluster, so it appears in
|
|
|
|
|
|
`docs/hardware/reference.md` even though it is not a cluster node (ADR-012).
|
|
|
|
|
|
- All testing (Molecule, lint, staging/external) runs on `ubongo` (ADR-008).
|
|
|
|
|
|
- A future **service-UI acceptance** testing level (Claude driving a headless browser
|
|
|
|
|
|
against a deployed service) is anticipated; `ubongo` is sized for it. The harness
|
|
|
|
|
|
is a separate spec.
|
|
|
|
|
|
|
|
|
|
|
|
## Deferred (separate specs / discussions)
|
|
|
|
|
|
|
2026-06-05 11:48:04 +02:00
|
|
|
|
1. **Mesh VPN choice — RESOLVED (ADR-016):** NetBird, self-hosted on `askari`
|
|
|
|
|
|
(off-site, so it survives a homelab outage and stays out of the cluster it
|
|
|
|
|
|
administers). Replaces ADR-007's OPNsense WireGuard.
|
2026-06-05 18:01:14 +02:00
|
|
|
|
2. **Browser-E2E verification harness — RESOLVED (ADR-017):** Claude-driven
|
|
|
|
|
|
exploratory service-UI verification (`/verify-service`, ADR-008 Level 4), against
|
|
|
|
|
|
staging with test users in Authentik. Design + skill + standards complete; running
|
|
|
|
|
|
deferred on the stack.
|
2026-06-11 10:32:26 +02:00
|
|
|
|
3. **`rbw` offline-cache verification — RESOLVED (2026-06-11 build):** confirmed offline
|
|
|
|
|
|
cache decryption on rbw 1.15.0 — `rbw sync` fails with Vaultwarden unreachable while
|
|
|
|
|
|
`rbw get` still decrypts from the local cache (ADR-014).
|
2026-06-05 09:37:56 +02:00
|
|
|
|
|
|
|
|
|
|
## What was ruled out
|
|
|
|
|
|
|
|
|
|
|
|
| Option | Reason |
|
|
|
|
|
|
|---|---|
|
|
|
|
|
|
| Keep control node as a cluster VM | Fails cold-start, recovery, always-on. |
|
|
|
|
|
|
| Laptop-only (`mamba` for everything) | Fails always-on. Retained as break-glass backup. |
|
|
|
|
|
|
| Split roles (control VM + thin jump box) | Two toolchains, split control plane, heavy testing back on a cluster VM. |
|
|
|
|
|
|
| Mirror Vaultwarden onto `ubongo` | Control node would run a service; still needs the master password. |
|
|
|
|
|
|
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. |
|
|
|
|
|
|
| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. |
|
|
|
|
|
|
|
2026-06-14 19:31:40 +02:00
|
|
|
|
## Related
|
|
|
|
|
|
|
|
|
|
|
|
ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
|
2026-06-05 09:37:56 +02:00
|
|
|
|
ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).
|