Compare commits

...

17 commits

Author SHA1 Message Date
a53941dffe CLAUDE.md: fix capabilities doc link after rename to CAPABILITIES.md 2026-06-05 09:50:28 +02:00
7a48a60f14 CLAUDE.md: fix project summary — control node is physical ubongo 2026-06-05 09:49:23 +02:00
a30c1af3f0 CLAUDE.md: link ADR-015; note ubongo as physical control node 2026-06-05 09:48:09 +02:00
9653a34241 STATUS: record ubongo control host as designed, not built 2026-06-05 09:47:24 +02:00
55a3666d16 accepted-risks: reserve R3 mesh-VPN coordinator (pending choice) 2026-06-05 09:46:40 +02:00
a2db8058e7 rotate-secrets: document offline vault break-glass for ubongo 2026-06-05 09:45:27 +02:00
b89ca8835a new-host runbook: control node ubongo is bare-metal
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 09:44:31 +02:00
3fb780c286 ADR-012/hardware: add ubongo as physical control node
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 09:43:09 +02:00
66064be7b2 ADR-008: tests run on ubongo; stub Level 4 service-UI acceptance 2026-06-05 09:42:01 +02:00
07bc1c83f0 ADR-009: control-node exception is a physical box, not a VM 2026-06-05 09:41:03 +02:00
1064716d49 ADR-005: control node bootstrap is bare-metal Debian on ubongo
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 09:40:15 +02:00
15779be086 ADR-001: control node is physical ubongo outside cluster 2026-06-05 09:39:18 +02:00
5aca796fa0 Add ADR-015 (control/AI-worker host ubongo) 2026-06-05 09:37:56 +02:00
4cf4aaa12e Renamed capabilities doc to capital letters to comform with other. 2026-06-05 09:36:55 +02:00
d96cf9f846 FRICTION: default to subagent-driven execution, don't ask
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 09:35:13 +02:00
0e9f179bfc Add implementation plan for ubongo control host
Task-by-task docs plan: author ADR-015 and reconcile ADR-001/005/008/009/012,
the new-host and rotate-secrets runbooks, accepted-risks, STATUS, and CLAUDE.md.
Documentation-only; the physical box stays "designed, not built".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 09:29:10 +02:00
c1b21c9b2b Add design spec for ubongo control/AI-worker host
Records the decision to replace the cluster-resident control VM with a
dedicated always-on physical mini-PC (ubongo) outside the Proxmox
cluster, collapsing control plane, AI-worker host, dev home, and local
test runner into one box. Basis for ADR-015.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 09:19:02 +02:00
16 changed files with 1187 additions and 40 deletions

View file

@ -14,7 +14,8 @@ Keep it dense and command-focused. Verbose detail lives in `docs/`.
Homelab infrastructure automation for a Proxmox cluster running 25 Debian 13 VMs.
All hosts share a hardened base configuration. Each host runs a defined set of Docker
services deployed via Compose files rendered from Ansible templates. Ansible runs from
a dedicated control VM. CI runs on Forgejo Actions (self-hosted).
a dedicated physical control node (`ubongo`) outside the cluster. CI runs on Forgejo
Actions (self-hosted).
Full design rationale: `docs/decisions/`
@ -105,7 +106,8 @@ inventories/
Host groups: `all`, `control`, `docker_hosts`, `proxmox_hosts`
(`control` holds the one manually-provisioned control node — see ADR-009.)
(`control` holds `ubongo`, the one manually-provisioned **physical** control node
outside the cluster — see ADR-009 and ADR-015.)
---
@ -187,7 +189,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
| Topic | File |
|------------------------|---------------------------------------|
| Architecture overview | `docs/decisions/001-architecture.md` |
| Capabilities overview (what boma does) | `docs/capabilities.md` |
| Capabilities overview (what boma does) | `docs/CAPABILITIES.md` |
| Security baseline & strategy | `docs/decisions/002-security.md` |
| Accepted security risks | `docs/security/accepted-risks.md` |
| Per-service security checklist | `docs/security/service-checklist.md` |
@ -197,6 +199,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
| Toolchain choices | `docs/decisions/003-toolchain.md` |
| Docker & Compose model | `docs/decisions/004-docker-model.md` |
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |
| Control / AI-worker host (`ubongo`) | `docs/decisions/015-control-host.md` |
| Terraform | `docs/decisions/006-terraform.md` |
| Network topology | `docs/decisions/007-network.md` |
| Testing methodology | `docs/decisions/008-testing.md` |

View file

@ -52,6 +52,7 @@ So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `bas
| `/security-review` skill | ADR-002 / TODO 8.5 | Periodic posture re-check + accepted-risk re-challenge; planned, not built |
| CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) `base`/`docker_host` roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) |
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
## Keeping this honest

View file

@ -53,3 +53,10 @@ earning its keep.
apply — the real path is local fast-forward merge to `main`, then push. → Skills and
conventions that assume a GitHub-style PR workflow need a homelab-aware variant;
encode that here "finishing a branch" means merge-locally-then-push, not open-a-PR.
## 2026-06-05
- `[recurring]` The `writing-plans` skill ends by asking "subagent-driven vs inline
execution?" — always answer subagent-driven here. Don't ask; default straight to
subagent-driven (fresh subagent per task + review between tasks). → Standing
preference; skip the execution-mode prompt.

View file

@ -10,15 +10,16 @@ and the boundaries of what this Ansible monorepo manages.
- **Hypervisor**: Proxmox cluster (2+ nodes)
- **Guest OS**: Debian 13 (all managed hosts)
- **Scale**: 25 VMs, small fleet — treated as individuals, not cattle
- **Control node**: A dedicated Debian 13 VM on the cluster. Ansible runs from here.
The control node is the one host that cannot fully bootstrap itself from scratch
and requires manual initial setup (see `docs/runbooks/new-host.md`).
- **Control node**: `ubongo` — a dedicated always-on **physical** x86-64 machine
**outside** the cluster. Ansible runs from here. It cannot be created by the
Terraform it hosts, so it is provisioned manually (see ADR-015 and
`docs/runbooks/new-host.md`).
## What this repo manages
| Layer | Managed by | Notes |
|--------------------|--------------------|--------------------------------------------|
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; control node is the one manual exception (see ADR-009) |
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; `ubongo` (control node) is a physical box outside the cluster, the one manual exception (see ADR-009/ADR-015) |
| Internal DNS records | Ansible `dns` role | Internal zone rendered from inventory (see ADR-007/009) |
| OS baseline | Ansible `base` role | Users, SSH, firewall, updates, audit |
| Docker runtime | Ansible `docker_host` role | Engine, daemon config, log driver |
@ -32,7 +33,7 @@ describes the *intended* design — see STATUS.md for what is actually built.
```
all
├── control # the control node itself — baseline config only, runs no services
├── control # ubongo — physical control node outside the cluster; baseline config only, runs no services
├── docker_hosts # VMs running Docker services (most hosts)
└── proxmox_hosts # Proxmox nodes themselves (limited management scope)
```

View file

@ -51,11 +51,12 @@ for the end-to-end commands and `docs/runbooks/new-host.md` for the full procedu
## Control node bootstrapping
The control node is a special case — it runs Terraform and Ansible, so it cannot
be created by the Terraform it hosts (chicken-and-egg). It is the one documented
exception to Terraform-owned VM existence (see ADR-009). The control node requires:
be created by the Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated
**physical** machine outside the cluster, and the one documented exception to
Terraform-owned VM existence (see ADR-009 and ADR-015). The control node requires:
1. Manual VM provisioning — clone this cloud-init template by hand (Proxmox UI or
`qm clone`), since Terraform is not yet available to do it
1. Manual OS provisioning — install Debian 13 on the physical box by hand (it is not
a Proxmox guest, so there is no template to clone)
2. Manual setup of the Ansible environment:
```bash
git clone <repo> ~/ansible
@ -68,9 +69,10 @@ exception to Terraform-owned VM existence (see ADR-009). The control node requir
```
3. After that, the control node can manage all other hosts normally
The control node itself is listed in `inventories/production/hosts.yml` under
a `control` group and can be managed for baseline config (SSH, firewall, updates)
but not for the `docker_host` role (it does not run services).
`ubongo` is listed in `inventories/production/hosts.yml` under the `control` group
and can be managed for baseline config (SSH, firewall, updates) but not for the
`docker_host` role (it does not run services). Hardware target and recovery model
are in ADR-015.
## Decision

View file

@ -12,7 +12,7 @@ This document records the testing strategy, what each level covers, and — crit
### Level 1 — Molecule (per role, always required)
Runs in Docker on the control node or in CI. Fast (~5 min per role).
Runs in Docker on the control node (`ubongo`) or in CI. Fast (~5 min per role).
**What happens during `molecule test`:**
1. `create` — start the test container
@ -53,6 +53,15 @@ Once `askari` is operational: scripted checks from outside the network confirmin
that public-facing services respond correctly. Catches firewall and reverse proxy
configuration issues invisible to Ansible check mode.
### Level 4 — Service-UI acceptance (planned, not built)
Claude drives a headless browser from `ubongo` against a *deployed* service: loads
the rendered UI, creates test users, exercises features, and hands the operator a
manual test script for the rest. Catches application-level regressions that no lower
level sees. The harness (Playwright/headless-Chromium, screenshot-back-to-Claude) is
a **separate spec**; `ubongo` is sized for it (ADR-015). Status: designed, not built
(STATUS.md).
---
## Molecule test image

View file

@ -126,12 +126,14 @@ convention only — it no longer implies any difference in how records are writt
## The control-node exception
The control node — the host that runs Terraform and Ansible — is the one VM
Terraform does **not** create. It cannot provision the infrastructure that would
provision itself (chicken-and-egg). It is therefore the single documented exception
to "Terraform owns VM existence":
The control node — the host that runs Terraform and Ansible — is `ubongo`, a
dedicated **physical** machine outside the cluster. It is not a VM at all, so
Terraform genuinely never touches it: it cannot provision the infrastructure that
would provision itself (chicken-and-egg). It is therefore the single documented
exception to "Terraform owns VM existence":
- Provisioned and bootstrapped manually, per the control-node section of ADR-005.
- Provisioned and bootstrapped manually on bare metal, per the control-node section
of ADR-005; rationale, hardware, and recovery model in ADR-015.
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
Ansible for baseline config only (no `docker_host` role).

View file

@ -13,6 +13,8 @@ workload that should move, or a node due an upgrade.
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
physical compute + network gear and workload placement intent. Two
machine-readable tables (node capacity, workload placement) carry the numbers.
This includes `ubongo`, the physical control node (ADR-015), even though it sits
outside the Proxmox cluster.
- `scripts/capacity-scan.py` (stdlib-only, like `repo-scan.py` / `tf_to_inventory.py`)
parses those tables, computes per-node allocated-vs-physical rollups, and
cross-checks workload hostnames against `terraform output -json` /

View file

@ -0,0 +1,133 @@
# ADR-015 — Control / development / AI-worker host (`ubongo`)
## Context
Earlier ADRs framed the control node — the host that runs Terraform and Ansible —
as a **single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one
documented exception to "Terraform owns VM existence" (ADR-009). That framing treats
the control node purely as a control-plane runner.
It fails four needs, all confirmed as drivers:
1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible cannot exist until
something else creates it; the bootstrap is circular and awkward.
2. **Always-on availability** — the operator wants to SSH in from a work PC or
anywhere to drive Claude Code. A cluster VM is gone whenever the cluster is down
or being rebuilt.
3. **Recovery / disaster** — the tool used to rebuild the cluster must not live
inside the thing it rebuilds.
4. **Dev ergonomics** — a persistent home for Claude Code + the repo, not entangled
with production VM lifecycle.
A laptop-only answer fails always-on and recovery. A VM-only answer fails cold-start
and recovery. A small dedicated always-on physical machine outside the cluster
satisfies all four.
## Decision
Introduce **`ubongo`** (Swahili: *brain*, consistent with the fleet's theme): a
single dedicated x86-64 mini-PC, always-on, living **outside** the Proxmox cluster.
It becomes *the* control node and collapses four roles into one box:
- Terraform + Ansible runner (control plane)
- Claude Code / AI-worker host the operator SSHes into
- Local test runner (Molecule/Docker, lint, and later a browser stack)
- Persistent dev home for the repo
There is **no longer a control VM on the cluster.** The `control` inventory group
points at this physical box. This *strengthens* the ADR-009 control-node exception:
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
Every other host stays a Terraform-managed VM exactly as designed.
`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a hypervisor
and runs no `docker_host` services.
### Hardware target
| Spec | Target | Why |
|---|---|---|
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
| Network | Wired GbE | Always-on reliability over Wi-Fi |
| Power | Low draw (≤15 W idle) | Runs 24/7 |
Indicative: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC (~€150250).
Claude Code itself is light (the model runs in Anthropic's cloud); the sizing driver
is **all testing being local** — Molecule (Docker), lint, and a future
headless-Chromium/Playwright stack.
### Provisioning (bootstrap path)
Manual, on bare metal:
1. Install Debian 13 on the box (one-time, by hand).
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
3. Join the mesh VPN (choice deferred — see below).
4. From then on `ubongo` manages every other host normally; Ansible manages *it* for
baseline config via the `control` group (`base` role only).
### Access & security
- Remote access is via the **mesh VPN** (choice deferred). SSH to `ubongo` over the
mesh; nothing is published to the public internet — this stays inside ADR-002.
- `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban,
auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**,
denied on the physical NIC.
### Recovery model
`ubongo` is the rebuild tool, so three things must survive a full cluster loss:
1. **`mamba` (laptop) is a break-glass clone** — repo + toolchain + mesh + `rbw`,
able to drive the fleet if `ubongo` dies.
2. **Terraform state** lives on `ubongo`, backed up encrypted off-box (synced to
`mamba`). For a 25 VM fleet it is also reconstructable via `terraform import`.
3. **Vault password**`ubongo` gets it from Vaultwarden via `rbw`. `rbw` keeps a
local encrypted copy of the vault and decrypts it offline with the operator's
Vaultwarden master password, so `ubongo` can decrypt the Ansible vault with the
whole cluster down — provided `rbw` has synced once and the operator keeps the
Vaultwarden master password offline (memorised + paper in a safe). Mirror onto
`mamba`.
There is always exactly one irreducible offline root secret; here it is the
Vaultwarden master password. Mirroring Vaultwarden onto `ubongo` is rejected: it
would make the control node run a service (against its remit) and still need that
master password.
> verified: rbw offline-cache decryption · TO VERIFY before relying on the recovery
> model · rbw docs · (ADR-014, security-relevant — confirm during build)
## Consequences
- The control node is physical compute outside the cluster, so it appears in
`docs/hardware/reference.md` even though it is not a cluster node (ADR-012).
- All testing (Molecule, lint, staging/external) runs on `ubongo` (ADR-008).
- A future **service-UI acceptance** testing level (Claude driving a headless browser
against a deployed service) is anticipated; `ubongo` is sized for it. The harness
is a separate spec.
## Deferred (separate specs / discussions)
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Recovery
dimension: a hosted coordinator keeps the mesh up when the cluster is down; a
self-hosted coordinator must live off-cluster (on `ubongo`), never on the fleet,
or it recreates the chicken-and-egg.
2. **Browser-E2E verification harness** — Playwright/headless-Chromium, test-user
generation, screenshot-back-to-Claude, and the new ADR-008 level.
3. **`rbw` offline-cache verification** — confirm offline decryption before relying
on it (ADR-014).
## What was ruled out
| Option | Reason |
|---|---|
| Keep control node as a cluster VM | Fails cold-start, recovery, always-on. |
| Laptop-only (`mamba` for everything) | Fails always-on. Retained as break-glass backup. |
| Split roles (control VM + thin jump box) | Two toolchains, split control plane, heavy testing back on a cluster VM. |
| Mirror Vaultwarden onto `ubongo` | Control node would run a service; still needs the master password. |
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. |
| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. |
See also: ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).

View file

@ -18,6 +18,14 @@
- **NICs:** _eno1 trunk (vmbr0), eno2 corosync (vmbr1)_
- **Notes:** _warranty, quirks_
### ubongo (control node — outside the cluster)
- **Model / form factor:** _TBD (x86-64 mini-PC / USFF, e.g. N100 or refurb micro)_
- **CPU:** _TBD (target 4 cores, x86-64)_
- **RAM:** _TBD (target 16 GB)_
- **Storage:** _TBD (target 250 GB SSD/NVMe)_
- **NICs:** _wired GbE_
- **Notes:** _always-on; control plane + AI-worker + local test runner (ADR-015); not a Proxmox guest_
_(repeat for pve1, pve2, askari)_
## 2. Network gear
@ -46,6 +54,7 @@ Physical totals per node. Integers; `ram_gb` and `disk_gb` may be decimals.
|------|-------|--------|---------|
| pve0 | 20 | 64 | 4000 |
| pve1 | 20 | 64 | 4000 |
| ubongo | 4 | 16 | 250 |
## 5. Capacity notes

View file

@ -2,7 +2,8 @@
## Prerequisites
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not)
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not).
Not needed for the control node `ubongo`, which is bare-metal (Part E).
- `rbw` is installed and unlocked (`rbw unlock`) so the vault password resolves from Vaultwarden
- The host's intended hostname and IP are decided
@ -110,27 +111,32 @@ make check PLAYBOOK=site
---
## Part E — Control node (manual exception)
## Part E — Control node (`ubongo`, manual exception)
The control node runs Terraform and Ansible, so it cannot be created by the
Terraform it hosts (chicken-and-egg). It is the **one** host provisioned manually —
see ADR-009 and the control-node section of ADR-005. Use the template from Part A:
Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated **physical**
machine outside the cluster — not a Proxmox guest. It is the **one** host
provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
```bash
# Clone the template by hand (Proxmox UI or qm clone)
qm clone 9000 <VMID> --name <hostname> --full
qm set <VMID> --memory 2048 --cores 2 \
--ciuser ansible \
--sshkeys /path/to/ansible_ed25519.pub \
--ipconfig0 ip=<IP>/24,gw=<GATEWAY>
qm start <VMID>
```
1. Install Debian 13 on the physical box by hand (no template to clone).
2. Create the `ansible` user and install its SSH public key.
3. Set up the Ansible environment on it:
```bash
git clone <repo> ~/ansible
cd ~/ansible
make setup # venv + Python deps
make collections # Ansible collections
rbw login && rbw unlock # vault password from Vaultwarden (see rotate-secrets.md)
```
4. Join the mesh VPN (choice deferred — see ADR-015) so it is reachable over SSH
from elsewhere.
5. Add `ubongo` to `inventories/<env>/hosts.yml` under the `control` group.
Then set up the Ansible environment on it (`make setup`, `make collections`, set up
`rbw` and `rbw unlock`) per ADR-005, and add it to `inventories/<env>/hosts.yml` under the
`control` group. Because the control node is not in `local.vms`, this is the only
case where editing `hosts.yml` by hand is expected — every other host comes from
`make tf-inventory`.
Because `ubongo` is not in `local.vms`, this is the only case where editing
`hosts.yml` by hand is expected. **Known limitation:** `make tf-inventory`
regenerates `hosts.yml` from Terraform outputs and will overwrite a hand-added
`control` entry — re-add `ubongo` after running it (preserving the control entry in
the generator is tracked separately, not yet built).
---

View file

@ -30,6 +30,27 @@ clear "run: rbw unlock" error rather than a hang.
---
## Break-glass — vault access during a full cluster outage
The control node `ubongo` (ADR-015) is the tool used to rebuild the cluster, so it
must be able to decrypt the vault even when Vaultwarden (if hosted on the cluster)
is down. `rbw` keeps a **local encrypted copy** of the Vaultwarden vault and decrypts
it **offline** with your Vaultwarden master password — no live server needed for
entries it has already synced. The recovery design therefore requires:
- `rbw` on `ubongo` (and on `mamba`, the break-glass laptop) has **synced at least
once** while Vaultwarden was reachable (`rbw sync`).
- Your **Vaultwarden master password** is kept **offline** — in a password manager on
`mamba` and on paper in a safe — independent of any cluster-hosted Vaultwarden.
There is always exactly one irreducible offline root secret; here it is the
Vaultwarden master password. Keep it recoverable without the cluster.
> **To verify (ADR-014, security-relevant):** confirm `rbw` actually decrypts its
> local cache fully offline on your pinned `rbw` version before relying on this.
---
## Rotating a single secret value
1. Ensure the agent is unlocked: `rbw unlock`

View file

@ -15,8 +15,9 @@ revisit (trigger).
|---|---|---|---|
| R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (tiered image pinning per ADR-011 — stateful `tag@digest`, stateless rolling — prefer official/verified images; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise |
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
| R3 | **Mesh-VPN coordinator dependency (pending VPN choice)** — remote SSH to the control node `ubongo` (ADR-015) rides a mesh VPN whose coordination plane may be a third party (e.g. hosted Tailscale/NetBird) | A hosted coordinator keeps the mesh up when the cluster is down, which *helps* recovery; nothing is exposed to the public internet (ADR-002 preserved). Provisional — finalised when the VPN is chosen (separate discussion) | The VPN choice is settled (replace this entry with the concrete decision); a self-hosted coordinator is adopted; the provider's trust/security posture changes |
_Last reviewed: 2026-06-04. The prior gaps (full CIS hardening, SELinux/AppArmor,
_Last reviewed: 2026-06-05. The prior gaps (full CIS hardening, SELinux/AppArmor,
IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now
part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build

View file

@ -0,0 +1,745 @@
# Ubongo Control / AI-Worker Host — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Record the decision to replace the cluster-resident control VM with a dedicated always-on physical host (`ubongo`) outside the Proxmox cluster, by authoring ADR-015 and reconciling every doc that currently assumes the control node is a cluster VM.
**Architecture:** This is a **documentation-only** change. No code, no roles, no inventory data. `ubongo` is recorded as *designed, not built* (per STATUS.md discipline) — the physical box, its OS install, and its inventory wiring are a future manual build, not part of this plan. The work is: one new ADR (the home of record) plus targeted amendments to the ADRs/runbooks/registers that contradict it, each cross-linking ADR-015.
**Tech Stack:** Markdown only. Verification is the repo's pre-commit hooks (trailing-whitespace, end-of-file, gitleaks, ansible-lint, vault-encryption guard) plus manual internal-consistency checks. There is no markdown linter in the toolchain, so "tests" are hook-pass + cross-reference-resolves greps.
---
## Pre-flight (read once before starting)
- **`rbw` must be unlocked before every commit.** The pre-commit ansible-lint hook decrypts `vault.yml`. Run `rbw unlocked` (exit 0 = good); if not, stop and ask the user to `rbw unlock`. Do not start a task you cannot commit.
- **Commit style:** one commit per task, imperative subject ≤72 chars, with the trailer:
```
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
```
- **Order matters:** Task 1 (ADR-015) must land first — every later task links to it.
- **Spec reference:** `docs/superpowers/specs/2026-06-05-ubongo-control-host-design.md`.
---
## File map
| File | Action | Responsibility after change |
|---|---|---|
| `docs/decisions/015-control-host.md` | Create | Home of record for the `ubongo` decision |
| `docs/decisions/001-architecture.md` | Modify | Control node = physical box outside cluster |
| `docs/decisions/005-bootstrapping.md` | Modify | Control-node bootstrap = bare-metal Debian install |
| `docs/decisions/009-provisioning-handoff.md` | Modify | Control-node exception is genuinely physical |
| `docs/decisions/008-testing.md` | Modify | All test levels run on `ubongo`; stub future UI level |
| `docs/decisions/012-hardware-capacity.md` | Modify | `ubongo` is in-scope physical compute |
| `docs/hardware/reference.md` | Modify | `ubongo` row in node-capacity + physical-compute section |
| `docs/runbooks/new-host.md` | Modify | Part E: control node is bare-metal, not `qm clone` |
| `docs/runbooks/rotate-secrets.md` | Modify | Offline break-glass vault-password requirement |
| `docs/security/accepted-risks.md` | Modify | Reserve mesh-VPN coordinator risk (pending VPN choice) |
| `STATUS.md` | Modify | Row: `ubongo` — designed, not built |
| `CLAUDE.md` | Modify | ADR-015 in Further reading; control-group note |
---
### Task 1: Author ADR-015 (the home of record)
**Files:**
- Create: `docs/decisions/015-control-host.md`
- [ ] **Step 1: Create the ADR file**
Create `docs/decisions/015-control-host.md` with exactly this content:
```markdown
# ADR-015 — Control / development / AI-worker host (`ubongo`)
## Context
Earlier ADRs framed the control node — the host that runs Terraform and Ansible —
as a **single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one
documented exception to "Terraform owns VM existence" (ADR-009). That framing treats
the control node purely as a control-plane runner.
It fails four needs, all confirmed as drivers:
1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible cannot exist until
something else creates it; the bootstrap is circular and awkward.
2. **Always-on availability** — the operator wants to SSH in from a work PC or
anywhere to drive Claude Code. A cluster VM is gone whenever the cluster is down
or being rebuilt.
3. **Recovery / disaster** — the tool used to rebuild the cluster must not live
inside the thing it rebuilds.
4. **Dev ergonomics** — a persistent home for Claude Code + the repo, not entangled
with production VM lifecycle.
A laptop-only answer fails always-on and recovery. A VM-only answer fails cold-start
and recovery. A small dedicated always-on physical machine outside the cluster
satisfies all four.
## Decision
Introduce **`ubongo`** (Swahili: *brain*, consistent with the fleet's theme): a
single dedicated x86-64 mini-PC, always-on, living **outside** the Proxmox cluster.
It becomes *the* control node and collapses four roles into one box:
- Terraform + Ansible runner (control plane)
- Claude Code / AI-worker host the operator SSHes into
- Local test runner (Molecule/Docker, lint, and later a browser stack)
- Persistent dev home for the repo
There is **no longer a control VM on the cluster.** The `control` inventory group
points at this physical box. This *strengthens* the ADR-009 control-node exception:
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
Every other host stays a Terraform-managed VM exactly as designed.
`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a hypervisor
and runs no `docker_host` services.
### Hardware target
| Spec | Target | Why |
|---|---|---|
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
| Network | Wired GbE | Always-on reliability over Wi-Fi |
| Power | Low draw (≤15 W idle) | Runs 24/7 |
Indicative: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC (~€150250).
Claude Code itself is light (the model runs in Anthropic's cloud); the sizing driver
is **all testing being local** — Molecule (Docker), lint, and a future
headless-Chromium/Playwright stack.
### Provisioning (bootstrap path)
Manual, on bare metal:
1. Install Debian 13 on the box (one-time, by hand).
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
3. Join the mesh VPN (choice deferred — see below).
4. From then on `ubongo` manages every other host normally; Ansible manages *it* for
baseline config via the `control` group (`base` role only).
### Access & security
- Remote access is via the **mesh VPN** (choice deferred). SSH to `ubongo` over the
mesh; nothing is published to the public internet — this stays inside ADR-002.
- `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban,
auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**,
denied on the physical NIC.
### Recovery model
`ubongo` is the rebuild tool, so three things must survive a full cluster loss:
1. **`mamba` (laptop) is a break-glass clone** — repo + toolchain + mesh + `rbw`,
able to drive the fleet if `ubongo` dies.
2. **Terraform state** lives on `ubongo`, backed up encrypted off-box (synced to
`mamba`). For a 25 VM fleet it is also reconstructable via `terraform import`.
3. **Vault password**`ubongo` gets it from Vaultwarden via `rbw`. `rbw` keeps a
local encrypted copy of the vault and decrypts it offline with the operator's
Vaultwarden master password, so `ubongo` can decrypt the Ansible vault with the
whole cluster down — provided `rbw` has synced once and the operator keeps the
Vaultwarden master password offline (memorised + paper in a safe). Mirror onto
`mamba`.
There is always exactly one irreducible offline root secret; here it is the
Vaultwarden master password. Mirroring Vaultwarden onto `ubongo` is rejected: it
would make the control node run a service (against its remit) and still need that
master password.
> verified: rbw offline-cache decryption · TO VERIFY before relying on the recovery
> model · rbw docs · (ADR-014, security-relevant — confirm during build)
## Consequences
- The control node is physical compute outside the cluster, so it appears in
`docs/hardware/reference.md` even though it is not a cluster node (ADR-012).
- All testing (Molecule, lint, staging/external) runs on `ubongo` (ADR-008).
- A future **service-UI acceptance** testing level (Claude driving a headless browser
against a deployed service) is anticipated; `ubongo` is sized for it. The harness
is a separate spec.
## Deferred (separate specs / discussions)
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Recovery
dimension: a hosted coordinator keeps the mesh up when the cluster is down; a
self-hosted coordinator must live off-cluster (on `ubongo`), never on the fleet,
or it recreates the chicken-and-egg.
2. **Browser-E2E verification harness** — Playwright/headless-Chromium, test-user
generation, screenshot-back-to-Claude, and the new ADR-008 level.
3. **`rbw` offline-cache verification** — confirm offline decryption before relying
on it (ADR-014).
## What was ruled out
| Option | Reason |
|---|---|
| Keep control node as a cluster VM | Fails cold-start, recovery, always-on. |
| Laptop-only (`mamba` for everything) | Fails always-on. Retained as break-glass backup. |
| Split roles (control VM + thin jump box) | Two toolchains, split control plane, heavy testing back on a cluster VM. |
| Mirror Vaultwarden onto `ubongo` | Control node would run a service; still needs the master password. |
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. |
| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. |
See also: ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).
```
- [ ] **Step 2: Confirm `rbw` is unlocked, then verify hooks pass**
Run: `rbw unlocked && pre-commit run --files docs/decisions/015-control-host.md`
Expected: `rbw` exits 0; hooks report `Passed`/`Skipped` (ansible-lint skips non-YAML; trailing-whitespace + end-of-file Passed).
- [ ] **Step 3: Commit**
```bash
git add docs/decisions/015-control-host.md
git commit -m "Add ADR-015 (control/AI-worker host ubongo)"
```
---
### Task 2: Amend ADR-001 (architecture)
**Files:**
- Modify: `docs/decisions/001-architecture.md`
- [ ] **Step 1: Update the control-node bullet**
Find (lines ~1315):
```markdown
- **Control node**: A dedicated Debian 13 VM on the cluster. Ansible runs from here.
The control node is the one host that cannot fully bootstrap itself from scratch
and requires manual initial setup (see `docs/runbooks/new-host.md`).
```
Replace with:
```markdown
- **Control node**: `ubongo` — a dedicated always-on **physical** x86-64 machine
**outside** the cluster. Ansible runs from here. It cannot be created by the
Terraform it hosts, so it is provisioned manually (see ADR-015 and
`docs/runbooks/new-host.md`).
```
- [ ] **Step 2: Update the VM-existence table row**
Find:
```markdown
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; control node is the one manual exception (see ADR-009) |
```
Replace with:
```markdown
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; `ubongo` (control node) is a physical box outside the cluster, the one manual exception (see ADR-009/ADR-015) |
```
- [ ] **Step 3: Update the `control` host-group comment**
Find:
```markdown
├── control # the control node itself — baseline config only, runs no services
```
Replace with:
```markdown
├── control # ubongo — physical control node outside the cluster; baseline config only, runs no services
```
- [ ] **Step 4: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/001-architecture.md`
Expected: hooks `Passed`/`Skipped`.
```bash
git add docs/decisions/001-architecture.md
git commit -m "ADR-001: control node is physical ubongo outside cluster"
```
---
### Task 3: Amend ADR-005 (bootstrapping)
**Files:**
- Modify: `docs/decisions/005-bootstrapping.md`
- [ ] **Step 1: Replace the "Control node bootstrapping" section body**
Find (the numbered list under `## Control node bootstrapping`, lines ~5269):
```markdown
The control node is a special case — it runs Terraform and Ansible, so it cannot
be created by the Terraform it hosts (chicken-and-egg). It is the one documented
exception to Terraform-owned VM existence (see ADR-009). The control node requires:
1. Manual VM provisioning — clone this cloud-init template by hand (Proxmox UI or
`qm clone`), since Terraform is not yet available to do it
2. Manual setup of the Ansible environment:
```
Replace with:
```markdown
The control node is a special case — it runs Terraform and Ansible, so it cannot
be created by the Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated
**physical** machine outside the cluster, and the one documented exception to
Terraform-owned VM existence (see ADR-009 and ADR-015). The control node requires:
1. Manual OS provisioning — install Debian 13 on the physical box by hand (it is not
a Proxmox guest, so there is no template to clone)
2. Manual setup of the Ansible environment:
```
- [ ] **Step 2: Update the trailing reference to the control node listing**
Find:
```markdown
The control node itself is listed in `inventories/production/hosts.yml` under
a `control` group and can be managed for baseline config (SSH, firewall, updates)
but not for the `docker_host` role (it does not run services).
```
Replace with:
```markdown
`ubongo` is listed in `inventories/production/hosts.yml` under the `control` group
and can be managed for baseline config (SSH, firewall, updates) but not for the
`docker_host` role (it does not run services). Hardware target and recovery model
are in ADR-015.
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/005-bootstrapping.md`
Expected: hooks `Passed`/`Skipped`.
```bash
git add docs/decisions/005-bootstrapping.md
git commit -m "ADR-005: control node bootstrap is bare-metal Debian on ubongo"
```
---
### Task 4: Amend ADR-009 (provisioning handoff)
**Files:**
- Modify: `docs/decisions/009-provisioning-handoff.md`
- [ ] **Step 1: Strengthen the control-node exception section**
Find (under `## The control-node exception`, lines ~129138):
```markdown
The control node — the host that runs Terraform and Ansible — is the one VM
Terraform does **not** create. It cannot provision the infrastructure that would
provision itself (chicken-and-egg). It is therefore the single documented exception
to "Terraform owns VM existence":
- Provisioned and bootstrapped manually, per the control-node section of ADR-005.
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
Ansible for baseline config only (no `docker_host` role).
```
Replace with:
```markdown
The control node — the host that runs Terraform and Ansible — is `ubongo`, a
dedicated **physical** machine outside the cluster. It is not a VM at all, so
Terraform genuinely never touches it: it cannot provision the infrastructure that
would provision itself (chicken-and-egg). It is therefore the single documented
exception to "Terraform owns VM existence":
- Provisioned and bootstrapped manually on bare metal, per the control-node section
of ADR-005; rationale, hardware, and recovery model in ADR-015.
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
Ansible for baseline config only (no `docker_host` role).
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/009-provisioning-handoff.md`
Expected: hooks `Passed`/`Skipped`.
```bash
git add docs/decisions/009-provisioning-handoff.md
git commit -m "ADR-009: control-node exception is a physical box, not a VM"
```
---
### Task 5: Amend ADR-008 (testing)
**Files:**
- Modify: `docs/decisions/008-testing.md`
- [ ] **Step 1: Make Level 1 say it runs on `ubongo`**
Find:
```markdown
Runs in Docker on the control node or in CI. Fast (~5 min per role).
```
Replace with:
```markdown
Runs in Docker on the control node (`ubongo`) or in CI. Fast (~5 min per role).
```
- [ ] **Step 2: Add a future service-UI acceptance level stub**
Find (the end of `### Level 3 — External smoke test from askari`, lines ~5155):
```markdown
### Level 3 — External smoke test from askari
Once `askari` is operational: scripted checks from outside the network confirming
that public-facing services respond correctly. Catches firewall and reverse proxy
configuration issues invisible to Ansible check mode.
```
Replace with:
```markdown
### Level 3 — External smoke test from askari
Once `askari` is operational: scripted checks from outside the network confirming
that public-facing services respond correctly. Catches firewall and reverse proxy
configuration issues invisible to Ansible check mode.
### Level 4 — Service-UI acceptance (planned, not built)
Claude drives a headless browser from `ubongo` against a *deployed* service: loads
the rendered UI, creates test users, exercises features, and hands the operator a
manual test script for the rest. Catches application-level regressions that no lower
level sees. The harness (Playwright/headless-Chromium, screenshot-back-to-Claude) is
a **separate spec**; `ubongo` is sized for it (ADR-015). Status: designed, not built
(STATUS.md).
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/008-testing.md`
Expected: hooks `Passed`/`Skipped`.
```bash
git add docs/decisions/008-testing.md
git commit -m "ADR-008: tests run on ubongo; stub Level 4 service-UI acceptance"
```
---
### Task 6: Amend ADR-012 and the hardware reference
**Files:**
- Modify: `docs/decisions/012-hardware-capacity.md`
- Modify: `docs/hardware/reference.md`
- [ ] **Step 1: Note `ubongo` as in-scope physical compute in ADR-012**
In `docs/decisions/012-hardware-capacity.md`, find the first bullet under `## Decision`:
```markdown
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
physical compute + network gear and workload placement intent. Two
machine-readable tables (node capacity, workload placement) carry the numbers.
```
Replace with:
```markdown
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
physical compute + network gear and workload placement intent. Two
machine-readable tables (node capacity, workload placement) carry the numbers.
This includes `ubongo`, the physical control node (ADR-015), even though it sits
outside the Proxmox cluster.
```
- [ ] **Step 2: Add `ubongo` to the physical-compute section of the reference**
In `docs/hardware/reference.md`, find:
```markdown
_(repeat for pve1, pve2, askari)_
```
Replace with:
```markdown
### ubongo (control node — outside the cluster)
- **Model / form factor:** _TBD (x86-64 mini-PC / USFF, e.g. N100 or refurb micro)_
- **CPU:** _TBD (target 4 cores, x86-64)_
- **RAM:** _TBD (target 16 GB)_
- **Storage:** _TBD (target 250 GB SSD/NVMe)_
- **NICs:** _wired GbE_
- **Notes:** _always-on; control plane + AI-worker + local test runner (ADR-015); not a Proxmox guest_
_(repeat for pve1, pve2, askari)_
```
- [ ] **Step 3: Add `ubongo` to the machine-readable node-capacity table**
In `docs/hardware/reference.md`, find the node-capacity table:
```markdown
| node | cores | ram_gb | disk_gb |
|------|-------|--------|---------|
| pve0 | 20 | 64 | 4000 |
| pve1 | 20 | 64 | 4000 |
```
Replace with:
```markdown
| node | cores | ram_gb | disk_gb |
|------|-------|--------|---------|
| pve0 | 20 | 64 | 4000 |
| pve1 | 20 | 64 | 4000 |
| ubongo | 4 | 16 | 250 |
```
Note: the header row (`node | cores | ram_gb | disk_gb`) is a parser contract for
`scripts/capacity-scan.py` — only a data row is added, the header is untouched.
- [ ] **Step 4: Verify the capacity scan still parses, hooks pass, then commit**
Run: `python3 scripts/capacity-scan.py 2>&1 | head -c 400`
Expected: it runs without a parse error and the output reflects the new `ubongo` row (no traceback). If the script needs an argument or env, consult its `--help`; a clean exit with JSON is success.
Run: `rbw unlocked && pre-commit run --files docs/decisions/012-hardware-capacity.md docs/hardware/reference.md`
Expected: hooks `Passed`/`Skipped`.
```bash
git add docs/decisions/012-hardware-capacity.md docs/hardware/reference.md
git commit -m "ADR-012/hardware: add ubongo as physical control node"
```
---
### Task 7: Update the new-host runbook (Part E)
**Files:**
- Modify: `docs/runbooks/new-host.md`
- [ ] **Step 1: Replace Part E with the bare-metal control-node procedure**
Find the whole `## Part E — Control node (manual exception)` section (lines ~113133), from the heading through the paragraph ending "every other host comes from `make tf-inventory`." Replace it with:
```markdown
## Part E — Control node (`ubongo`, manual exception)
The control node runs Terraform and Ansible, so it cannot be created by the
Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated **physical**
machine outside the cluster — not a Proxmox guest. It is the **one** host
provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
1. Install Debian 13 on the physical box by hand (no template to clone).
2. Create the `ansible` user and install its SSH public key.
3. Set up the Ansible environment on it:
```bash
git clone <repo> ~/ansible
cd ~/ansible
make setup # venv + Python deps
make collections # Ansible collections
rbw login && rbw unlock # vault password from Vaultwarden (see rotate-secrets.md)
```
4. Join the mesh VPN (choice deferred — see ADR-015) so it is reachable over SSH
from elsewhere.
5. Add `ubongo` to `inventories/<env>/hosts.yml` under the `control` group.
Because `ubongo` is not in `local.vms`, this is the only case where editing
`hosts.yml` by hand is expected. **Known limitation:** `make tf-inventory`
regenerates `hosts.yml` from Terraform outputs and will overwrite a hand-added
`control` entry — re-add `ubongo` after running it (preserving the control entry in
the generator is tracked separately, not yet built).
```
- [ ] **Step 2: Update the Prerequisites note that assumes a template**
Find:
```markdown
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not)
```
Replace with:
```markdown
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not).
Not needed for the control node `ubongo`, which is bare-metal (Part E).
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/runbooks/new-host.md`
Expected: hooks `Passed`/`Skipped`.
```bash
git add docs/runbooks/new-host.md
git commit -m "new-host runbook: control node ubongo is bare-metal"
```
---
### Task 8: Update the rotate-secrets runbook (offline break-glass)
**Files:**
- Modify: `docs/runbooks/rotate-secrets.md`
- [ ] **Step 1: Add a break-glass section after the `rbw` setup section**
Find the end of the `## One-time — \`rbw\` setup on a new machine` section:
```markdown
Once unlocked, `make encrypt/decrypt/check/deploy` and the pre-commit ansible-lint
hook all obtain the password automatically. If the agent is locked you'll see a
clear "run: rbw unlock" error rather than a hang.
```
Replace with:
```markdown
Once unlocked, `make encrypt/decrypt/check/deploy` and the pre-commit ansible-lint
hook all obtain the password automatically. If the agent is locked you'll see a
clear "run: rbw unlock" error rather than a hang.
---
## Break-glass — vault access during a full cluster outage
The control node `ubongo` (ADR-015) is the tool used to rebuild the cluster, so it
must be able to decrypt the vault even when Vaultwarden (if hosted on the cluster)
is down. `rbw` keeps a **local encrypted copy** of the Vaultwarden vault and decrypts
it **offline** with your Vaultwarden master password — no live server needed for
entries it has already synced. The recovery design therefore requires:
- `rbw` on `ubongo` (and on `mamba`, the break-glass laptop) has **synced at least
once** while Vaultwarden was reachable (`rbw sync`).
- Your **Vaultwarden master password** is kept **offline** — in a password manager on
`mamba` and on paper in a safe — independent of any cluster-hosted Vaultwarden.
There is always exactly one irreducible offline root secret; here it is the
Vaultwarden master password. Keep it recoverable without the cluster.
> **To verify (ADR-014, security-relevant):** confirm `rbw` actually decrypts its
> local cache fully offline on your pinned `rbw` version before relying on this.
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/runbooks/rotate-secrets.md`
Expected: hooks `Passed`/`Skipped`.
```bash
git add docs/runbooks/rotate-secrets.md
git commit -m "rotate-secrets: document offline vault break-glass for ubongo"
```
---
### Task 9: Reserve the mesh-VPN accepted-risk entry
**Files:**
- Modify: `docs/security/accepted-risks.md`
- [ ] **Step 1: Add R3 to the risk table**
Find the table row for R2:
```markdown
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
```
Add immediately **after** it:
```markdown
| R3 | **Mesh-VPN coordinator dependency (pending VPN choice)** — remote SSH to the control node `ubongo` (ADR-015) rides a mesh VPN whose coordination plane may be a third party (e.g. hosted Tailscale/NetBird) | A hosted coordinator keeps the mesh up when the cluster is down, which *helps* recovery; nothing is exposed to the public internet (ADR-002 preserved). Provisional — finalised when the VPN is chosen (separate discussion) | The VPN choice is settled (replace this entry with the concrete decision); a self-hosted coordinator is adopted; the provider's trust/security posture changes |
```
- [ ] **Step 2: Update the "Last reviewed" footer date**
Find:
```markdown
_Last reviewed: 2026-06-04. The prior gaps
```
Replace `2026-06-04` with `2026-06-05` (only the date changes; leave the rest of the sentence intact):
```markdown
_Last reviewed: 2026-06-05. The prior gaps
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md`
Expected: hooks `Passed`/`Skipped`.
```bash
git add docs/security/accepted-risks.md
git commit -m "accepted-risks: reserve R3 mesh-VPN coordinator (pending choice)"
```
---
### Task 10: Add the `ubongo` row to STATUS.md
**Files:**
- Modify: `STATUS.md`
- [ ] **Step 1: Add a row to the "Designed but not built" table**
Find the last row of the `## Designed but not built` table:
```markdown
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |
```
Add immediately **after** it:
```markdown
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files STATUS.md`
Expected: hooks `Passed`/`Skipped`.
```bash
git add STATUS.md
git commit -m "STATUS: record ubongo control host as designed, not built"
```
---
### Task 11: Update CLAUDE.md (index + control-group note)
**Files:**
- Modify: `CLAUDE.md`
- [ ] **Step 1: Add ADR-015 to the Further reading table**
Find:
```markdown
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |
```
Replace with:
```markdown
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |
| Control / AI-worker host (`ubongo`) | `docs/decisions/015-control-host.md` |
```
- [ ] **Step 2: Update the control-group parenthetical in the Inventory structure section**
Find:
```markdown
(`control` holds the one manually-provisioned control node — see ADR-009.)
```
Replace with:
```markdown
(`control` holds `ubongo`, the one manually-provisioned **physical** control node
outside the cluster — see ADR-009 and ADR-015.)
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
Expected: hooks `Passed`/`Skipped`.
```bash
git add CLAUDE.md
git commit -m "CLAUDE.md: link ADR-015; note ubongo as physical control node"
```
---
### Task 12: Final consistency sweep
**Files:** none modified (verification only)
- [ ] **Step 1: Confirm no doc still calls the control node a VM**
Run:
```bash
grep -rniE "control node.*(VM|virtual)|dedicated Debian 13 VM" docs/ CLAUDE.md STATUS.md
```
Expected: no hit that *asserts* the control node is a VM. (Hits inside ADR-015's "What was ruled out" table that describe the rejected option are fine.) If any other doc still frames the control node as a VM, fix it the same way as the relevant task above and amend that task's commit.
- [ ] **Step 2: Confirm every ADR-015 cross-link resolves**
Run:
```bash
grep -rl "ADR-015\|015-control-host" docs/ CLAUDE.md STATUS.md
test -f docs/decisions/015-control-host.md && echo "ADR-015 present"
```
Expected: the file exists and the referencing docs (001, 005, 008, 009, 012, runbooks, accepted-risks, STATUS, CLAUDE.md) appear.
- [ ] **Step 3: Full hook run**
Run: `rbw unlocked && pre-commit run --all-files`
Expected: all hooks `Passed`/`Skipped`. Fix anything that fails (most likely trailing whitespace or end-of-file) and amend the owning commit.
- [ ] **Step 4: Push (only if the user asks)**
Per CLAUDE.md, push to `origin` is the off-machine backup. If the user wants it pushed:
```bash
git push origin main
```
---
## Self-review notes (author)
- **Spec coverage:** every spec section maps to a task — host decision/hardware/bootstrap/access/recovery → Task 1 (ADR-015); the doc-changes table → Tasks 211; testing implication → Task 5; deferrals are recorded in ADR-015 and not implemented here (correct — they are separate specs). ✓
- **Not in scope (intentional):** acquiring/installing the box, mesh-VPN selection, the browser harness, adding `ubongo` to live inventory, and modifying `tf_to_inventory.py` to preserve the control entry (logged as a known limitation in Task 7). ✓
- **No placeholders:** every edit shows exact find/replace text; the only `_TBD_` strings are deliberate hardware-reference skeleton fields matching that file's existing style. ✓
```

View file

@ -0,0 +1,205 @@
# Design — Control / development / AI-worker host (`ubongo`)
- **Date:** 2026-06-05
- **Status:** Approved design — pending implementation plan
- **Supersedes (in part):** the "control node is a dedicated VM on the cluster"
assumption in ADR-001 / ADR-005 / ADR-009
- **Becomes:** ADR-015 (this design is the basis for that ADR)
---
## Problem
Today the control node — the host that runs Terraform and Ansible — is defined as a
**single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one
documented exception to "Terraform owns VM existence" (ADR-009). The ADRs treat it
purely as a control-plane runner.
That framing fails four things the user actually needs, all confirmed as drivers:
1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible can't exist until
something else creates it; the manual bootstrap is awkward and circular.
2. **Always-on availability** — the user wants to SSH in from a work PC (or anywhere)
to fire off Claude Code commands. A VM on the cluster is gone whenever the cluster
is down or being rebuilt.
3. **Recovery / disaster** — the tool you'd use to rebuild the cluster must not live
*inside* the thing it rebuilds.
4. **Dev ergonomics** — a comfortable, persistent home for Claude Code + the repo,
not entangled with production VM lifecycle.
A laptop-only answer fails always-on (not always carried) and recovery. A VM-only
answer fails cold-start and recovery. A small **dedicated always-on physical machine
outside the cluster** satisfies all four.
---
## Decision
Introduce **`ubongo`**: a single dedicated x86-64 mini-PC, always-on, living
**outside** the Proxmox cluster. It becomes *the* control node and collapses four
roles into one box:
- Terraform + Ansible runner (control plane)
- Claude Code / AI-worker host the user SSHes into
- Local test runner (Molecule/Docker, lint, and later the browser stack)
- Persistent dev home for the repo
There is **no longer a control VM on the cluster.** The `control` inventory group now
points at this physical box. This *strengthens* the ADR-009 control-node exception:
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
Every other host stays a Terraform-managed VM exactly as designed.
`ubongo` runs **plain Debian 13** (matches the fleet — the `base` role applies). It is
not a hypervisor.
### Name
`ubongo` (Swahili: *brain*), consistent with the fleet's Swahili theme (`boma`,
`nyumbani`, `askari`, `mamba`).
---
## Hardware target
| Spec | Target | Why |
|---|---|---|
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
| Network | Wired GbE | Always-on reliability over Wi-Fi |
| Power | Low draw (≤15 W idle) | Runs 24/7 |
Indicative hardware: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC,
roughly €150250.
**Sizing rationale.** Claude Code itself is light — the model runs in Anthropic's
cloud; the box only runs the CLI, git, Ansible, Terraform, `rbw`, SSH. The real
sizing driver is **all testing being local**: Molecule (Docker — the heavy one), lint,
and later a headless-Chromium/Playwright browser stack for service-UI verification.
That combination is firmly mini-PC / small-server class, not a Raspberry Pi.
---
## Provisioning (bootstrap path)
Manual, like today's control-node exception — but on bare metal instead of a clone:
1. Install Debian 13 on the box (one-time, by hand).
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
3. Join the mesh VPN (choice TBD — see Deferred).
4. From then on `ubongo` manages every other host normally. Ansible manages *it* for
baseline config via the `control` group (`base` role: SSH, firewall, updates,
auditd) — but never the `docker_host` role. It runs no services.
---
## Access & security
- **Remote access via the mesh VPN** (choice TBD). The user SSHes to `ubongo` over the
mesh from work PC, laptop, or phone. **Nothing is published to the public internet**
— SSH-over-mesh keeps the design fully inside ADR-002 (no LAN/WAN exposure without
reverse-proxy + auth).
- **Hardening.** `ubongo` runs the `base` role like every host: SSH hardening,
nftables default-deny, fail2ban, auditd, unattended-upgrades. Inbound SSH is allowed
**only on the mesh interface** — denied on the physical NIC. Even on the LAN it is
not an open SSH target.
- **Third-party dependency.** A hosted mesh coordinator is a third party. This is a
deliberate trade: a hosted control plane keeps the mesh up when the cluster is down
(helps recovery). A self-hosted coordinator on the cluster would recreate the
chicken-and-egg — so if self-hosted, it must live on `ubongo` or off-cluster, never
on the fleet. To be logged in `accepted-risks.md` once the VPN is chosen.
---
## Recovery model
`ubongo` is now the rebuild tool, so three things must survive a full cluster loss:
1. **The box / its data.** `mamba` (laptop) stays a **break-glass clone**: repo +
toolchain + mesh + `rbw`, able to drive the fleet if `ubongo` dies. Two machines
that can drive the fleet, not one.
2. **Terraform state.** Lives on `ubongo`, backed up **encrypted off-box** (synced to
`mamba`). For a 25 VM fleet it is also reconstructable via `terraform import`, so
this is belt-and-suspenders, not load-bearing.
3. **The vault password.** `ubongo` gets the vault master password from Vaultwarden via
`rbw`. `rbw` keeps a **local encrypted copy** of the Vaultwarden vault and decrypts
it **offline** with the user's Vaultwarden master password — no live server needed
for already-synced entries. So provided (a) `rbw` has synced at least once and (b)
the user keeps their **Vaultwarden master password** offline (memorised + paper in a
safe), `ubongo` can decrypt the Ansible vault with the whole cluster down. Mirror
the same onto `mamba`.
**Why not mirror/replicate Vaultwarden onto `ubongo`?** It would make the control node
*run a service* (against its remit + adds attack surface), add DB-replication
complexity, and **still** require the Vaultwarden master password to read anything.
There is always exactly **one irreducible offline root secret** — make it the
Vaultwarden master password, and let `rbw`'s local cache make everything else
self-serve offline. (If full disaster access to *all* secrets — router, Proxmox UI —
is wanted, that same `rbw` cache already covers it; optionally add a scheduled
encrypted `rbw export` as extra insurance.)
> **To verify (ADR-014, security-relevant):** the "`rbw` decrypts its local cache fully
> offline" behaviour is the load-bearing assumption of the recovery model. Confirm it
> against `rbw`'s docs/version during implementation before relying on it.
---
## Testing implication
All testing runs on `ubongo`:
- **Level 1 (Molecule)** — Docker on `ubongo`.
- **Lint** — on `ubongo`.
- **Level 2 / 3** (staging deploy, external smoke) — driven from `ubongo` as before.
A future **service-UI acceptance level** (Claude driving a headless browser against a
deployed service: load the UI, create test users, exercise features, hand the user a
manual test script) is anticipated. `ubongo` is *sized* for it now (Chromium +
Playwright headroom). The harness itself is a **separate spec** (see Deferred).
---
## Documentation changes
A new **ADR-015 — Control / development / AI-worker host (`ubongo`)** is the home of
record. Other docs get small amendments that link to it:
| Doc | Change |
|---|---|
| ADR-015 (new) | Full record of this design. |
| ADR-001 (architecture) | Control node: "dedicated Debian 13 VM on the cluster" → "dedicated physical x86 machine *outside* the cluster (`ubongo`)". |
| ADR-005 (bootstrapping) | Control-node section: "clone the cloud-init template by hand" → "install Debian 13 on the physical box". |
| ADR-009 (provisioning handoff) | Strengthen the control-node exception: now genuinely physical/outside Terraform. |
| ADR-008 (testing) | "runs on the control node or in CI" → all levels run on `ubongo`; add a stub for the future service-UI acceptance level. |
| ADR-012 / `docs/hardware/reference.md` | Add `ubongo` to the node-capacity table (physical compute, though outside the cluster). |
| `docs/runbooks/new-host.md` | Update the control-node bootstrap procedure (bare-metal Debian install, not `qm clone`). |
| `docs/runbooks/rotate-secrets.md` | Add the offline vault-password break-glass requirement. |
| `docs/security/accepted-risks.md` | Reserve an entry for the mesh-VPN third-party coordinator — pending the VPN choice. |
| `STATUS.md` | Add a row: `ubongo`*designed, not built*. |
| `CLAUDE.md` | One-line touch to the inventory/`control`-group description if needed. |
---
## Explicitly deferred (separate specs / discussions)
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Carries the
recovery-dimension note above (hosted coordinator helps recovery; self-hosted must
be off-cluster). Its own discussion.
2. **Browser-E2E verification harness** — Playwright/headless-Chromium driving live
service UIs, test-user generation, screenshot-back-to-Claude, and the new ADR-008
level. `ubongo` is sized for it now; the harness is designed later.
3. **`rbw` offline-cache verification** — a to-verify task during implementation
(ADR-014), before relying on offline decryption.
---
## What was ruled out
| Option | Reason |
|---|---|
| Keep control node as a cluster VM | Fails cold-start and recovery (rebuild tool lives inside the thing it rebuilds); fails always-on (dies with the cluster). |
| Laptop-only (`mamba` for everything) | Fails always-on (not always carried). Retained instead as the break-glass backup. |
| Split roles (control VM + thin jump box) | Two places to maintain the toolchain, control plane split in two, heavy local testing back on a cluster VM — more moving parts, less benefit. |
| Mirror/replicate Vaultwarden onto `ubongo` | Makes the control node run a service (against its remit), adds DB-replication complexity, and still needs the Vaultwarden master password. `rbw`'s local cache achieves offline decryption without it. |
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg the whole design escapes. If self-hosted, it lives off-cluster. |
| Raspberry Pi as the box | Could just run Molecule, but chokes running Docker + Chromium + toolchain together. x86 mini-PC instead. |