boma/docs/decisions/015-control-host.md
sjat bc8592616b fix: address final whole-branch review findings
- ADR-023 §4: ADR-015 no-sudo sub-decision now Superseded-by ADR-025 (bidirectional), not just an in-place amendment.
- STATUS: drop the deferred `reset` verb; honest integration_test (molecule not run in this env; applied to ubongo) + verify (forward/DNAT, not wt0); RED->GREEN validated.
- driver: remove unused `import shutil`.
- README: fix the ADR-025 link filename.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:52:28 +02:00

10 KiB
Raw Permalink Blame History

ADR-015 — Control / development / AI-worker host (ubongo)

Status

Accepted (2026-06-05). Amended 2026-06-18: the claude AI-worker account now has NOPASSWD:ALL sudo on ubongo — reversing the original "no local sudo" sub-decision. The amendment is recorded in §Access & security below; rationale and accepted risk are in ADR-021 and docs/security/accepted-risks.md (R7).

Context

Earlier ADRs framed the control node — the host that runs Terraform and Ansible — as a single Debian 13 VM on the Proxmox cluster, manually provisioned as the one documented exception to "Terraform owns VM existence" (ADR-009). That framing treats the control node purely as a control-plane runner.

It fails four needs, all confirmed as drivers:

  1. Cold-start bootstrap — the VM that runs Terraform/Ansible cannot exist until something else creates it; the bootstrap is circular and awkward.
  2. Always-on availability — the operator wants to SSH in from a work PC or anywhere to drive Claude Code. A cluster VM is gone whenever the cluster is down or being rebuilt.
  3. Recovery / disaster — the tool used to rebuild the cluster must not live inside the thing it rebuilds.
  4. Dev ergonomics — a persistent home for Claude Code + the repo, not entangled with production VM lifecycle.

A laptop-only answer fails always-on and recovery. A VM-only answer fails cold-start and recovery. A small dedicated always-on physical machine outside the cluster satisfies all four.

Decision

Introduce ubongo (Swahili: brain, consistent with the fleet's theme): a single dedicated x86-64 mini-PC, always-on, living outside the Proxmox cluster. It becomes the control node and collapses four roles into one box:

  • Terraform + Ansible runner (control plane)
  • Claude Code / AI-worker host the operator SSHes into
  • Local test runner (Molecule/Docker, lint, and later a browser stack)
  • Persistent dev home for the repo

There is no longer a control VM on the cluster. The control inventory group points at this physical box. This strengthens the ADR-009 control-node exception: it is genuinely outside Terraform's world, not a VM pretending to be the exception. Every other host stays a Terraform-managed VM exactly as designed.

ubongo runs plain Debian 13 (the base role applies). It is not a production hypervisor and runs no docker_host services. It does run ephemeral KVM test VMs as part of its local-test-runner role (ADR-025 — local VM integration testing): one throwaway VM at a time (~3 GiB RAM), against ~13 GiB free of the 16 GiB sized here. This is not a production workload — it is the concrete implementation of ADR-008 Level 2/3, and the resource guard enforces one-at-a-time to stay within the RAM ceiling.

Hardware target

Spec Target Why
CPU 4 cores, x86-64 (Intel N100-class or better) Molecule containers + Chromium prefer x86
RAM 16 GB Docker + headless Chromium + toolchain headroom
Disk 250 GB SSD/NVMe Docker images, molecule layers, repos, browser cache
Network Wired GbE Always-on reliability over Wi-Fi
Power Low draw (≤15 W idle) Runs 24/7

Indicative: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC (~€150250). Claude Code itself is light (the model runs in Anthropic's cloud); the sizing driver is all testing being local — Molecule (Docker), lint, and a future headless-Chromium/Playwright stack.

Provisioning (bootstrap path)

Manual, on bare metal:

  1. Install Debian 13 on the box (one-time, by hand).
  2. git clone the repo; make setup; make collections; set up rbw + unlock.
  3. Join the mesh VPN — NetBird, self-hosted on askari (ADR-016).
  4. From then on ubongo manages every other host normally; Ansible manages it for baseline config via the control group (base role only).

Access & security

  • Remote access is via the mesh VPN — NetBird, self-hosted on askari (ADR-016). SSH to ubongo over the mesh; nothing is published to the public internet — this stays inside ADR-002.

  • ubongo runs the base role: SSH hardening, nftables default-deny, fail2ban, auditd, unattended-upgrades. Inbound SSH is allowed only on the mesh interface, denied on the physical NIC.

  • Operational reality (until the mesh exists): the "SSH only on the mesh interface" target above is the end state, not yet in force. Today remote access is LAN SSH only — key-only, with password auth and root login disabled — until the NetBird mesh (ADR-016) is stood up.

  • AI-worker identity: ubongo runs the AI worker under a dedicated, password-locked claude user (in the docker and libvirt groups; NOPASSWD:ALL sudo via a repo-managed drop-in — see amendment below). It is reached via sudo -iu claude or its own SSH key. The rationale is attribution + revocation, not containment: auditd/Loki (ADR-018) can separate human from agent actions, and the account/key can be revoked without touching the operator's access. (ADR-021 left the on-ubongo agent identity unspecified; this records it.)

    Amendment (2026-06-18) — claude now has NOPASSWD:ALL sudo.

    Superseded by ADR-025 (per ADR-023 §4): the "no local sudo" sub-decision is reversed. The shakedown that necessitated it is ADR-025; the resulting two-account access model is ADR-021; the accepted risk is R7.

    During the integration-testing harness shakedown, the original "no local sudo" sub-decision was reversed. No-sudo blocked the AI-worker from diagnosing a failed VM: virsh, virt-install, cloud-localds, journalctl, nft — nearly all low-level diagnostic commands — require root. The AI-worker must autonomously spin up, inspect, and tear down test VMs without operator hand-holding; that is the harness's core value proposition. Compensating controls make the risk acceptable:

    1. claude's password is locked (no interactive login, no su claude without the operator's own credentials) — NOPASSWD sudo is the only sudo path.
    2. auditd + Loki attribution (ADR-018) separates human from agent root actions.
    3. The drop-in is repo-managed via base__ai_worker_user — revocable in one commit and one deploy.
    4. Single-operator homelab: everything in git, off-machine backups (ADR-022).

    The operator (sjat) uses password-required sudo via the sudo group; their former NOPASSWD drop-in was removed 2026-06-18 as redundant once claude had sudo (least-privilege cleanup). The accepted risk is registered as R7 in docs/security/accepted-risks.md. ADR-021 records the resulting sudo model for both accounts.

  • Disk encryption: ubongo's SSD is not encrypted at rest — the SanDisk X600 is TCG-Opal-capable but Opal is unused. This is an accepted risk recorded in docs/security/accepted-risks.md (control-node disk not encrypted at rest), compensated by physical security, a BIOS supervisor password, and disabled external/USB boot.

Recovery model

ubongo is the rebuild tool, so three things must survive a full cluster loss:

  1. mamba (laptop) is a break-glass clone — repo + toolchain + mesh + rbw, able to drive the fleet if ubongo dies.
  2. Terraform state lives on ubongo, backed up encrypted off-box (synced to mamba). For a 25 VM fleet it is also reconstructable via terraform import.
  3. Vault passwordubongo gets it from Vaultwarden via rbw. rbw keeps a local encrypted copy of the vault and decrypts it offline with the operator's Vaultwarden master password, so ubongo can decrypt the Ansible vault with the whole cluster down — provided rbw has synced once and the operator keeps the Vaultwarden master password offline (memorised + paper in a safe). Mirror onto mamba.

There is always exactly one irreducible offline root secret; here it is the Vaultwarden master password. Mirroring Vaultwarden onto ubongo is rejected: it would make the control node run a service (against its remit) and still need that master password.

verified: rbw offline-cache decryption · rbw 1.15.0 on ubongo · with the Vaultwarden host blocked, rbw sync failed but rbw get decrypted the cached vault offline · 2026-06-11

Consequences

  • The control node is physical compute outside the cluster, so it appears in docs/hardware/reference.md even though it is not a cluster node (ADR-012).
  • All testing (Molecule, lint, staging/external) runs on ubongo (ADR-008).
  • A future service-UI acceptance testing level (Claude driving a headless browser against a deployed service) is anticipated; ubongo is sized for it. The harness is a separate spec.

Deferred (separate specs / discussions)

  1. Mesh VPN choice — RESOLVED (ADR-016): NetBird, self-hosted on askari (off-site, so it survives a homelab outage and stays out of the cluster it administers). Replaces ADR-007's OPNsense WireGuard.
  2. Browser-E2E verification harness — RESOLVED (ADR-017): Claude-driven exploratory service-UI verification (/verify-service, ADR-008 Level 4), against staging with test users in Authentik. Design + skill + standards complete; running deferred on the stack.
  3. rbw offline-cache verification — RESOLVED (2026-06-11 build): confirmed offline cache decryption on rbw 1.15.0 — rbw sync fails with Vaultwarden unreachable while rbw get still decrypts from the local cache (ADR-014).

What was ruled out

Option Reason
Keep control node as a cluster VM Fails cold-start, recovery, always-on.
Laptop-only (mamba for everything) Fails always-on. Retained as break-glass backup.
Split roles (control VM + thin jump box) Two toolchains, split control plane, heavy testing back on a cluster VM.
Mirror Vaultwarden onto ubongo Control node would run a service; still needs the master password.
Self-hosted mesh coordinator on the cluster Recreates the chicken-and-egg.
Raspberry Pi Chokes running Docker + Chromium + toolchain together.

ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing), ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).