boma/docs/decisions/015-control-host.md
sjat bc8592616b fix: address final whole-branch review findings
- ADR-023 §4: ADR-015 no-sudo sub-decision now Superseded-by ADR-025 (bidirectional), not just an in-place amendment.
- STATUS: drop the deferred `reset` verb; honest integration_test (molecule not run in this env; applied to ubongo) + verify (forward/DNAT, not wt0); RED->GREEN validated.
- driver: remove unused `import shutil`.
- README: fix the ADR-025 link filename.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:52:28 +02:00

192 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-015 — Control / development / AI-worker host (`ubongo`)
## Status
Accepted (2026-06-05). **Amended 2026-06-18:** the `claude` AI-worker account now has
`NOPASSWD:ALL` sudo on `ubongo` — reversing the original "no local sudo" sub-decision.
The amendment is recorded in §Access & security below; rationale and accepted risk are
in ADR-021 and `docs/security/accepted-risks.md` (R7).
## Context
Earlier ADRs framed the control node — the host that runs Terraform and Ansible —
as a **single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one
documented exception to "Terraform owns VM existence" (ADR-009). That framing treats
the control node purely as a control-plane runner.
It fails four needs, all confirmed as drivers:
1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible cannot exist until
something else creates it; the bootstrap is circular and awkward.
2. **Always-on availability** — the operator wants to SSH in from a work PC or
anywhere to drive Claude Code. A cluster VM is gone whenever the cluster is down
or being rebuilt.
3. **Recovery / disaster** — the tool used to rebuild the cluster must not live
inside the thing it rebuilds.
4. **Dev ergonomics** — a persistent home for Claude Code + the repo, not entangled
with production VM lifecycle.
A laptop-only answer fails always-on and recovery. A VM-only answer fails cold-start
and recovery. A small dedicated always-on physical machine outside the cluster
satisfies all four.
## Decision
Introduce **`ubongo`** (Swahili: *brain*, consistent with the fleet's theme): a
single dedicated x86-64 mini-PC, always-on, living **outside** the Proxmox cluster.
It becomes *the* control node and collapses four roles into one box:
- Terraform + Ansible runner (control plane)
- Claude Code / AI-worker host the operator SSHes into
- Local test runner (Molecule/Docker, lint, and later a browser stack)
- Persistent dev home for the repo
There is **no longer a control VM on the cluster.** The `control` inventory group
points at this physical box. This *strengthens* the ADR-009 control-node exception:
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
Every other host stays a Terraform-managed VM exactly as designed.
`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a production
hypervisor and runs no `docker_host` services. It does run **ephemeral KVM test VMs**
as part of its local-test-runner role (ADR-025 — local VM integration testing): one
throwaway VM at a time (~3 GiB RAM), against ~13 GiB free of the 16 GiB sized here.
This is not a production workload — it is the concrete implementation of ADR-008 Level
2/3, and the resource guard enforces one-at-a-time to stay within the RAM ceiling.
### Hardware target
| Spec | Target | Why |
|---|---|---|
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
| Network | Wired GbE | Always-on reliability over Wi-Fi |
| Power | Low draw (≤15 W idle) | Runs 24/7 |
Indicative: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC (~€150250).
Claude Code itself is light (the model runs in Anthropic's cloud); the sizing driver
is **all testing being local** — Molecule (Docker), lint, and a future
headless-Chromium/Playwright stack.
### Provisioning (bootstrap path)
Manual, on bare metal:
1. Install Debian 13 on the box (one-time, by hand).
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
3. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016).
4. From then on `ubongo` manages every other host normally; Ansible manages *it* for
baseline config via the `control` group (`base` role only).
### Access & security
- Remote access is via the **mesh VPN** — NetBird, self-hosted on `askari` (ADR-016).
SSH to `ubongo` over the mesh; nothing is published to the public internet — this
stays inside ADR-002.
- `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban,
auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**,
denied on the physical NIC.
- **Operational reality (until the mesh exists):** the "SSH only on the mesh interface"
target above is the end state, not yet in force. Today remote access is **LAN SSH
only** — key-only, with password auth and root login disabled — until the NetBird mesh
(ADR-016) is stood up.
- **AI-worker identity:** `ubongo` runs the AI worker under a dedicated,
password-locked `claude` user (in the `docker` and `libvirt` groups; **`NOPASSWD:ALL`
sudo** via a repo-managed drop-in — see amendment below). It is reached via `sudo -iu
claude` or its own SSH key. The rationale is **attribution + revocation, not
containment**: auditd/Loki (ADR-018) can separate human from agent actions, and the
account/key can be revoked without touching the operator's access. (ADR-021 left the
on-`ubongo` agent identity unspecified; this records it.)
**Amendment (2026-06-18) — `claude` now has `NOPASSWD:ALL` sudo.**
> **Superseded by [ADR-025](025-local-vm-integration-testing.md)** (per ADR-023 §4): the
> "no local sudo" sub-decision is reversed. The shakedown that necessitated it is ADR-025;
> the resulting two-account access model is ADR-021; the accepted risk is R7.
During the
integration-testing harness shakedown, the original "no local sudo" sub-decision was
reversed. No-sudo blocked the AI-worker from diagnosing a failed VM: `virsh`,
`virt-install`, `cloud-localds`, `journalctl`, `nft` — nearly all low-level
diagnostic commands — require root. The AI-worker must autonomously spin up,
inspect, and tear down test VMs without operator hand-holding; that is the harness's
core value proposition. Compensating controls make the risk acceptable:
1. `claude`'s password is **locked** (no interactive login, no `su claude` without the
operator's own credentials) — `NOPASSWD` sudo is the *only* sudo path.
2. `auditd` + Loki attribution (ADR-018) separates human from agent root actions.
3. The drop-in is **repo-managed** via `base__ai_worker_user` — revocable in one commit
and one deploy.
4. Single-operator homelab: everything in git, off-machine backups (ADR-022).
The operator (`sjat`) uses **password-required sudo** via the `sudo` group; their
former `NOPASSWD` drop-in was removed 2026-06-18 as redundant once `claude` had sudo
(least-privilege cleanup). The accepted risk is registered as R7 in
`docs/security/accepted-risks.md`. ADR-021 records the resulting sudo model for both
accounts.
- **Disk encryption:** `ubongo`'s SSD is **not encrypted at rest** — the SanDisk X600 is
TCG-Opal-capable but Opal is unused. This is an accepted risk recorded in
`docs/security/accepted-risks.md` (control-node disk not encrypted at rest),
compensated by physical security, a BIOS supervisor password, and disabled
external/USB boot.
### Recovery model
`ubongo` is the rebuild tool, so three things must survive a full cluster loss:
1. **`mamba` (laptop) is a break-glass clone** — repo + toolchain + mesh + `rbw`,
able to drive the fleet if `ubongo` dies.
2. **Terraform state** lives on `ubongo`, backed up encrypted off-box (synced to
`mamba`). For a 25 VM fleet it is also reconstructable via `terraform import`.
3. **Vault password**`ubongo` gets it from Vaultwarden via `rbw`. `rbw` keeps a
local encrypted copy of the vault and decrypts it offline with the operator's
Vaultwarden master password, so `ubongo` can decrypt the Ansible vault with the
whole cluster down — provided `rbw` has synced once and the operator keeps the
Vaultwarden master password offline (memorised + paper in a safe). Mirror onto
`mamba`.
There is always exactly one irreducible offline root secret; here it is the
Vaultwarden master password. Mirroring Vaultwarden onto `ubongo` is rejected: it
would make the control node run a service (against its remit) and still need that
master password.
> verified: rbw offline-cache decryption · rbw 1.15.0 on ubongo · with the Vaultwarden
> host blocked, `rbw sync` failed but `rbw get` decrypted the cached vault offline ·
> 2026-06-11
## Consequences
- The control node is physical compute outside the cluster, so it appears in
`docs/hardware/reference.md` even though it is not a cluster node (ADR-012).
- All testing (Molecule, lint, staging/external) runs on `ubongo` (ADR-008).
- A future **service-UI acceptance** testing level (Claude driving a headless browser
against a deployed service) is anticipated; `ubongo` is sized for it. The harness
is a separate spec.
## Deferred (separate specs / discussions)
1. **Mesh VPN choice — RESOLVED (ADR-016):** NetBird, self-hosted on `askari`
(off-site, so it survives a homelab outage and stays out of the cluster it
administers). Replaces ADR-007's OPNsense WireGuard.
2. **Browser-E2E verification harness — RESOLVED (ADR-017):** Claude-driven
exploratory service-UI verification (`/verify-service`, ADR-008 Level 4), against
staging with test users in Authentik. Design + skill + standards complete; running
deferred on the stack.
3. **`rbw` offline-cache verification — RESOLVED (2026-06-11 build):** confirmed offline
cache decryption on rbw 1.15.0 — `rbw sync` fails with Vaultwarden unreachable while
`rbw get` still decrypts from the local cache (ADR-014).
## What was ruled out
| Option | Reason |
|---|---|
| Keep control node as a cluster VM | Fails cold-start, recovery, always-on. |
| Laptop-only (`mamba` for everything) | Fails always-on. Retained as break-glass backup. |
| Split roles (control VM + thin jump box) | Two toolchains, split control plane, heavy testing back on a cluster VM. |
| Mirror Vaultwarden onto `ubongo` | Control node would run a service; still needs the master password. |
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. |
| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. |
## Related
ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).