Compare commits
17 commits
fc0d49f1c4
...
a53941dffe
| Author | SHA1 | Date | |
|---|---|---|---|
| a53941dffe | |||
| 7a48a60f14 | |||
| a30c1af3f0 | |||
| 9653a34241 | |||
| 55a3666d16 | |||
| a2db8058e7 | |||
| b89ca8835a | |||
| 3fb780c286 | |||
| 66064be7b2 | |||
| 07bc1c83f0 | |||
| 1064716d49 | |||
| 15779be086 | |||
| 5aca796fa0 | |||
| 4cf4aaa12e | |||
| d96cf9f846 | |||
| 0e9f179bfc | |||
| c1b21c9b2b |
16 changed files with 1187 additions and 40 deletions
|
|
@ -14,7 +14,8 @@ Keep it dense and command-focused. Verbose detail lives in `docs/`.
|
|||
Homelab infrastructure automation for a Proxmox cluster running 2–5 Debian 13 VMs.
|
||||
All hosts share a hardened base configuration. Each host runs a defined set of Docker
|
||||
services deployed via Compose files rendered from Ansible templates. Ansible runs from
|
||||
a dedicated control VM. CI runs on Forgejo Actions (self-hosted).
|
||||
a dedicated physical control node (`ubongo`) outside the cluster. CI runs on Forgejo
|
||||
Actions (self-hosted).
|
||||
|
||||
Full design rationale: `docs/decisions/`
|
||||
|
||||
|
|
@ -105,7 +106,8 @@ inventories/
|
|||
|
||||
Host groups: `all`, `control`, `docker_hosts`, `proxmox_hosts`
|
||||
|
||||
(`control` holds the one manually-provisioned control node — see ADR-009.)
|
||||
(`control` holds `ubongo`, the one manually-provisioned **physical** control node
|
||||
outside the cluster — see ADR-009 and ADR-015.)
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -187,7 +189,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
|
|||
| Topic | File |
|
||||
|------------------------|---------------------------------------|
|
||||
| Architecture overview | `docs/decisions/001-architecture.md` |
|
||||
| Capabilities overview (what boma does) | `docs/capabilities.md` |
|
||||
| Capabilities overview (what boma does) | `docs/CAPABILITIES.md` |
|
||||
| Security baseline & strategy | `docs/decisions/002-security.md` |
|
||||
| Accepted security risks | `docs/security/accepted-risks.md` |
|
||||
| Per-service security checklist | `docs/security/service-checklist.md` |
|
||||
|
|
@ -197,6 +199,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
|
|||
| Toolchain choices | `docs/decisions/003-toolchain.md` |
|
||||
| Docker & Compose model | `docs/decisions/004-docker-model.md` |
|
||||
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |
|
||||
| Control / AI-worker host (`ubongo`) | `docs/decisions/015-control-host.md` |
|
||||
| Terraform | `docs/decisions/006-terraform.md` |
|
||||
| Network topology | `docs/decisions/007-network.md` |
|
||||
| Testing methodology | `docs/decisions/008-testing.md` |
|
||||
|
|
|
|||
|
|
@ -52,6 +52,7 @@ So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `bas
|
|||
| `/security-review` skill | ADR-002 / TODO 8.5 | Periodic posture re-check + accepted-risk re-challenge; planned, not built |
|
||||
| CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) `base`/`docker_host` roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) |
|
||||
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |
|
||||
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
|
||||
|
||||
## Keeping this honest
|
||||
|
||||
|
|
|
|||
|
|
@ -53,3 +53,10 @@ earning its keep.
|
|||
apply — the real path is local fast-forward merge to `main`, then push. → Skills and
|
||||
conventions that assume a GitHub-style PR workflow need a homelab-aware variant;
|
||||
encode that here "finishing a branch" means merge-locally-then-push, not open-a-PR.
|
||||
|
||||
## 2026-06-05
|
||||
|
||||
- `[recurring]` The `writing-plans` skill ends by asking "subagent-driven vs inline
|
||||
execution?" — always answer subagent-driven here. Don't ask; default straight to
|
||||
subagent-driven (fresh subagent per task + review between tasks). → Standing
|
||||
preference; skip the execution-mode prompt.
|
||||
|
|
|
|||
|
|
@ -10,15 +10,16 @@ and the boundaries of what this Ansible monorepo manages.
|
|||
- **Hypervisor**: Proxmox cluster (2+ nodes)
|
||||
- **Guest OS**: Debian 13 (all managed hosts)
|
||||
- **Scale**: 2–5 VMs, small fleet — treated as individuals, not cattle
|
||||
- **Control node**: A dedicated Debian 13 VM on the cluster. Ansible runs from here.
|
||||
The control node is the one host that cannot fully bootstrap itself from scratch
|
||||
and requires manual initial setup (see `docs/runbooks/new-host.md`).
|
||||
- **Control node**: `ubongo` — a dedicated always-on **physical** x86-64 machine
|
||||
**outside** the cluster. Ansible runs from here. It cannot be created by the
|
||||
Terraform it hosts, so it is provisioned manually (see ADR-015 and
|
||||
`docs/runbooks/new-host.md`).
|
||||
|
||||
## What this repo manages
|
||||
|
||||
| Layer | Managed by | Notes |
|
||||
|--------------------|--------------------|--------------------------------------------|
|
||||
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; control node is the one manual exception (see ADR-009) |
|
||||
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; `ubongo` (control node) is a physical box outside the cluster, the one manual exception (see ADR-009/ADR-015) |
|
||||
| Internal DNS records | Ansible `dns` role | Internal zone rendered from inventory (see ADR-007/009) |
|
||||
| OS baseline | Ansible `base` role | Users, SSH, firewall, updates, audit |
|
||||
| Docker runtime | Ansible `docker_host` role | Engine, daemon config, log driver |
|
||||
|
|
@ -32,7 +33,7 @@ describes the *intended* design — see STATUS.md for what is actually built.
|
|||
|
||||
```
|
||||
all
|
||||
├── control # the control node itself — baseline config only, runs no services
|
||||
├── control # ubongo — physical control node outside the cluster; baseline config only, runs no services
|
||||
├── docker_hosts # VMs running Docker services (most hosts)
|
||||
└── proxmox_hosts # Proxmox nodes themselves (limited management scope)
|
||||
```
|
||||
|
|
|
|||
|
|
@ -51,11 +51,12 @@ for the end-to-end commands and `docs/runbooks/new-host.md` for the full procedu
|
|||
## Control node bootstrapping
|
||||
|
||||
The control node is a special case — it runs Terraform and Ansible, so it cannot
|
||||
be created by the Terraform it hosts (chicken-and-egg). It is the one documented
|
||||
exception to Terraform-owned VM existence (see ADR-009). The control node requires:
|
||||
be created by the Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated
|
||||
**physical** machine outside the cluster, and the one documented exception to
|
||||
Terraform-owned VM existence (see ADR-009 and ADR-015). The control node requires:
|
||||
|
||||
1. Manual VM provisioning — clone this cloud-init template by hand (Proxmox UI or
|
||||
`qm clone`), since Terraform is not yet available to do it
|
||||
1. Manual OS provisioning — install Debian 13 on the physical box by hand (it is not
|
||||
a Proxmox guest, so there is no template to clone)
|
||||
2. Manual setup of the Ansible environment:
|
||||
```bash
|
||||
git clone <repo> ~/ansible
|
||||
|
|
@ -68,9 +69,10 @@ exception to Terraform-owned VM existence (see ADR-009). The control node requir
|
|||
```
|
||||
3. After that, the control node can manage all other hosts normally
|
||||
|
||||
The control node itself is listed in `inventories/production/hosts.yml` under
|
||||
a `control` group and can be managed for baseline config (SSH, firewall, updates)
|
||||
but not for the `docker_host` role (it does not run services).
|
||||
`ubongo` is listed in `inventories/production/hosts.yml` under the `control` group
|
||||
and can be managed for baseline config (SSH, firewall, updates) but not for the
|
||||
`docker_host` role (it does not run services). Hardware target and recovery model
|
||||
are in ADR-015.
|
||||
|
||||
## Decision
|
||||
|
||||
|
|
|
|||
|
|
@ -12,7 +12,7 @@ This document records the testing strategy, what each level covers, and — crit
|
|||
|
||||
### Level 1 — Molecule (per role, always required)
|
||||
|
||||
Runs in Docker on the control node or in CI. Fast (~5 min per role).
|
||||
Runs in Docker on the control node (`ubongo`) or in CI. Fast (~5 min per role).
|
||||
|
||||
**What happens during `molecule test`:**
|
||||
1. `create` — start the test container
|
||||
|
|
@ -53,6 +53,15 @@ Once `askari` is operational: scripted checks from outside the network confirmin
|
|||
that public-facing services respond correctly. Catches firewall and reverse proxy
|
||||
configuration issues invisible to Ansible check mode.
|
||||
|
||||
### Level 4 — Service-UI acceptance (planned, not built)
|
||||
|
||||
Claude drives a headless browser from `ubongo` against a *deployed* service: loads
|
||||
the rendered UI, creates test users, exercises features, and hands the operator a
|
||||
manual test script for the rest. Catches application-level regressions that no lower
|
||||
level sees. The harness (Playwright/headless-Chromium, screenshot-back-to-Claude) is
|
||||
a **separate spec**; `ubongo` is sized for it (ADR-015). Status: designed, not built
|
||||
(STATUS.md).
|
||||
|
||||
---
|
||||
|
||||
## Molecule test image
|
||||
|
|
|
|||
|
|
@ -126,12 +126,14 @@ convention only — it no longer implies any difference in how records are writt
|
|||
|
||||
## The control-node exception
|
||||
|
||||
The control node — the host that runs Terraform and Ansible — is the one VM
|
||||
Terraform does **not** create. It cannot provision the infrastructure that would
|
||||
provision itself (chicken-and-egg). It is therefore the single documented exception
|
||||
to "Terraform owns VM existence":
|
||||
The control node — the host that runs Terraform and Ansible — is `ubongo`, a
|
||||
dedicated **physical** machine outside the cluster. It is not a VM at all, so
|
||||
Terraform genuinely never touches it: it cannot provision the infrastructure that
|
||||
would provision itself (chicken-and-egg). It is therefore the single documented
|
||||
exception to "Terraform owns VM existence":
|
||||
|
||||
- Provisioned and bootstrapped manually, per the control-node section of ADR-005.
|
||||
- Provisioned and bootstrapped manually on bare metal, per the control-node section
|
||||
of ADR-005; rationale, hardware, and recovery model in ADR-015.
|
||||
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
|
||||
Ansible for baseline config only (no `docker_host` role).
|
||||
|
||||
|
|
|
|||
|
|
@ -13,6 +13,8 @@ workload that should move, or a node due an upgrade.
|
|||
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
|
||||
physical compute + network gear and workload placement intent. Two
|
||||
machine-readable tables (node capacity, workload placement) carry the numbers.
|
||||
This includes `ubongo`, the physical control node (ADR-015), even though it sits
|
||||
outside the Proxmox cluster.
|
||||
- `scripts/capacity-scan.py` (stdlib-only, like `repo-scan.py` / `tf_to_inventory.py`)
|
||||
parses those tables, computes per-node allocated-vs-physical rollups, and
|
||||
cross-checks workload hostnames against `terraform output -json` /
|
||||
|
|
|
|||
133
docs/decisions/015-control-host.md
Normal file
133
docs/decisions/015-control-host.md
Normal file
|
|
@ -0,0 +1,133 @@
|
|||
# ADR-015 — Control / development / AI-worker host (`ubongo`)
|
||||
|
||||
## Context
|
||||
|
||||
Earlier ADRs framed the control node — the host that runs Terraform and Ansible —
|
||||
as a **single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one
|
||||
documented exception to "Terraform owns VM existence" (ADR-009). That framing treats
|
||||
the control node purely as a control-plane runner.
|
||||
|
||||
It fails four needs, all confirmed as drivers:
|
||||
|
||||
1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible cannot exist until
|
||||
something else creates it; the bootstrap is circular and awkward.
|
||||
2. **Always-on availability** — the operator wants to SSH in from a work PC or
|
||||
anywhere to drive Claude Code. A cluster VM is gone whenever the cluster is down
|
||||
or being rebuilt.
|
||||
3. **Recovery / disaster** — the tool used to rebuild the cluster must not live
|
||||
inside the thing it rebuilds.
|
||||
4. **Dev ergonomics** — a persistent home for Claude Code + the repo, not entangled
|
||||
with production VM lifecycle.
|
||||
|
||||
A laptop-only answer fails always-on and recovery. A VM-only answer fails cold-start
|
||||
and recovery. A small dedicated always-on physical machine outside the cluster
|
||||
satisfies all four.
|
||||
|
||||
## Decision
|
||||
|
||||
Introduce **`ubongo`** (Swahili: *brain*, consistent with the fleet's theme): a
|
||||
single dedicated x86-64 mini-PC, always-on, living **outside** the Proxmox cluster.
|
||||
It becomes *the* control node and collapses four roles into one box:
|
||||
|
||||
- Terraform + Ansible runner (control plane)
|
||||
- Claude Code / AI-worker host the operator SSHes into
|
||||
- Local test runner (Molecule/Docker, lint, and later a browser stack)
|
||||
- Persistent dev home for the repo
|
||||
|
||||
There is **no longer a control VM on the cluster.** The `control` inventory group
|
||||
points at this physical box. This *strengthens* the ADR-009 control-node exception:
|
||||
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
|
||||
Every other host stays a Terraform-managed VM exactly as designed.
|
||||
|
||||
`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a hypervisor
|
||||
and runs no `docker_host` services.
|
||||
|
||||
### Hardware target
|
||||
|
||||
| Spec | Target | Why |
|
||||
|---|---|---|
|
||||
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
|
||||
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
|
||||
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
|
||||
| Network | Wired GbE | Always-on reliability over Wi-Fi |
|
||||
| Power | Low draw (≤15 W idle) | Runs 24/7 |
|
||||
|
||||
Indicative: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC (~€150–250).
|
||||
Claude Code itself is light (the model runs in Anthropic's cloud); the sizing driver
|
||||
is **all testing being local** — Molecule (Docker), lint, and a future
|
||||
headless-Chromium/Playwright stack.
|
||||
|
||||
### Provisioning (bootstrap path)
|
||||
|
||||
Manual, on bare metal:
|
||||
|
||||
1. Install Debian 13 on the box (one-time, by hand).
|
||||
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
|
||||
3. Join the mesh VPN (choice deferred — see below).
|
||||
4. From then on `ubongo` manages every other host normally; Ansible manages *it* for
|
||||
baseline config via the `control` group (`base` role only).
|
||||
|
||||
### Access & security
|
||||
|
||||
- Remote access is via the **mesh VPN** (choice deferred). SSH to `ubongo` over the
|
||||
mesh; nothing is published to the public internet — this stays inside ADR-002.
|
||||
- `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban,
|
||||
auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**,
|
||||
denied on the physical NIC.
|
||||
|
||||
### Recovery model
|
||||
|
||||
`ubongo` is the rebuild tool, so three things must survive a full cluster loss:
|
||||
|
||||
1. **`mamba` (laptop) is a break-glass clone** — repo + toolchain + mesh + `rbw`,
|
||||
able to drive the fleet if `ubongo` dies.
|
||||
2. **Terraform state** lives on `ubongo`, backed up encrypted off-box (synced to
|
||||
`mamba`). For a 2–5 VM fleet it is also reconstructable via `terraform import`.
|
||||
3. **Vault password** — `ubongo` gets it from Vaultwarden via `rbw`. `rbw` keeps a
|
||||
local encrypted copy of the vault and decrypts it offline with the operator's
|
||||
Vaultwarden master password, so `ubongo` can decrypt the Ansible vault with the
|
||||
whole cluster down — provided `rbw` has synced once and the operator keeps the
|
||||
Vaultwarden master password offline (memorised + paper in a safe). Mirror onto
|
||||
`mamba`.
|
||||
|
||||
There is always exactly one irreducible offline root secret; here it is the
|
||||
Vaultwarden master password. Mirroring Vaultwarden onto `ubongo` is rejected: it
|
||||
would make the control node run a service (against its remit) and still need that
|
||||
master password.
|
||||
|
||||
> verified: rbw offline-cache decryption · TO VERIFY before relying on the recovery
|
||||
> model · rbw docs · (ADR-014, security-relevant — confirm during build)
|
||||
|
||||
## Consequences
|
||||
|
||||
- The control node is physical compute outside the cluster, so it appears in
|
||||
`docs/hardware/reference.md` even though it is not a cluster node (ADR-012).
|
||||
- All testing (Molecule, lint, staging/external) runs on `ubongo` (ADR-008).
|
||||
- A future **service-UI acceptance** testing level (Claude driving a headless browser
|
||||
against a deployed service) is anticipated; `ubongo` is sized for it. The harness
|
||||
is a separate spec.
|
||||
|
||||
## Deferred (separate specs / discussions)
|
||||
|
||||
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Recovery
|
||||
dimension: a hosted coordinator keeps the mesh up when the cluster is down; a
|
||||
self-hosted coordinator must live off-cluster (on `ubongo`), never on the fleet,
|
||||
or it recreates the chicken-and-egg.
|
||||
2. **Browser-E2E verification harness** — Playwright/headless-Chromium, test-user
|
||||
generation, screenshot-back-to-Claude, and the new ADR-008 level.
|
||||
3. **`rbw` offline-cache verification** — confirm offline decryption before relying
|
||||
on it (ADR-014).
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|---|---|
|
||||
| Keep control node as a cluster VM | Fails cold-start, recovery, always-on. |
|
||||
| Laptop-only (`mamba` for everything) | Fails always-on. Retained as break-glass backup. |
|
||||
| Split roles (control VM + thin jump box) | Two toolchains, split control plane, heavy testing back on a cluster VM. |
|
||||
| Mirror Vaultwarden onto `ubongo` | Control node would run a service; still needs the master password. |
|
||||
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. |
|
||||
| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. |
|
||||
|
||||
See also: ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
|
||||
ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).
|
||||
|
|
@ -18,6 +18,14 @@
|
|||
- **NICs:** _eno1 trunk (vmbr0), eno2 corosync (vmbr1)_
|
||||
- **Notes:** _warranty, quirks_
|
||||
|
||||
### ubongo (control node — outside the cluster)
|
||||
- **Model / form factor:** _TBD (x86-64 mini-PC / USFF, e.g. N100 or refurb micro)_
|
||||
- **CPU:** _TBD (target 4 cores, x86-64)_
|
||||
- **RAM:** _TBD (target 16 GB)_
|
||||
- **Storage:** _TBD (target 250 GB SSD/NVMe)_
|
||||
- **NICs:** _wired GbE_
|
||||
- **Notes:** _always-on; control plane + AI-worker + local test runner (ADR-015); not a Proxmox guest_
|
||||
|
||||
_(repeat for pve1, pve2, askari)_
|
||||
|
||||
## 2. Network gear
|
||||
|
|
@ -46,6 +54,7 @@ Physical totals per node. Integers; `ram_gb` and `disk_gb` may be decimals.
|
|||
|------|-------|--------|---------|
|
||||
| pve0 | 20 | 64 | 4000 |
|
||||
| pve1 | 20 | 64 | 4000 |
|
||||
| ubongo | 4 | 16 | 250 |
|
||||
|
||||
## 5. Capacity notes
|
||||
|
||||
|
|
|
|||
|
|
@ -2,7 +2,8 @@
|
|||
|
||||
## Prerequisites
|
||||
|
||||
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not)
|
||||
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not).
|
||||
Not needed for the control node `ubongo`, which is bare-metal (Part E).
|
||||
- `rbw` is installed and unlocked (`rbw unlock`) so the vault password resolves from Vaultwarden
|
||||
- The host's intended hostname and IP are decided
|
||||
|
||||
|
|
@ -110,27 +111,32 @@ make check PLAYBOOK=site
|
|||
|
||||
---
|
||||
|
||||
## Part E — Control node (manual exception)
|
||||
## Part E — Control node (`ubongo`, manual exception)
|
||||
|
||||
The control node runs Terraform and Ansible, so it cannot be created by the
|
||||
Terraform it hosts (chicken-and-egg). It is the **one** host provisioned manually —
|
||||
see ADR-009 and the control-node section of ADR-005. Use the template from Part A:
|
||||
Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated **physical**
|
||||
machine outside the cluster — not a Proxmox guest. It is the **one** host
|
||||
provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
|
||||
|
||||
```bash
|
||||
# Clone the template by hand (Proxmox UI or qm clone)
|
||||
qm clone 9000 <VMID> --name <hostname> --full
|
||||
qm set <VMID> --memory 2048 --cores 2 \
|
||||
--ciuser ansible \
|
||||
--sshkeys /path/to/ansible_ed25519.pub \
|
||||
--ipconfig0 ip=<IP>/24,gw=<GATEWAY>
|
||||
qm start <VMID>
|
||||
```
|
||||
1. Install Debian 13 on the physical box by hand (no template to clone).
|
||||
2. Create the `ansible` user and install its SSH public key.
|
||||
3. Set up the Ansible environment on it:
|
||||
```bash
|
||||
git clone <repo> ~/ansible
|
||||
cd ~/ansible
|
||||
make setup # venv + Python deps
|
||||
make collections # Ansible collections
|
||||
rbw login && rbw unlock # vault password from Vaultwarden (see rotate-secrets.md)
|
||||
```
|
||||
4. Join the mesh VPN (choice deferred — see ADR-015) so it is reachable over SSH
|
||||
from elsewhere.
|
||||
5. Add `ubongo` to `inventories/<env>/hosts.yml` under the `control` group.
|
||||
|
||||
Then set up the Ansible environment on it (`make setup`, `make collections`, set up
|
||||
`rbw` and `rbw unlock`) per ADR-005, and add it to `inventories/<env>/hosts.yml` under the
|
||||
`control` group. Because the control node is not in `local.vms`, this is the only
|
||||
case where editing `hosts.yml` by hand is expected — every other host comes from
|
||||
`make tf-inventory`.
|
||||
Because `ubongo` is not in `local.vms`, this is the only case where editing
|
||||
`hosts.yml` by hand is expected. **Known limitation:** `make tf-inventory`
|
||||
regenerates `hosts.yml` from Terraform outputs and will overwrite a hand-added
|
||||
`control` entry — re-add `ubongo` after running it (preserving the control entry in
|
||||
the generator is tracked separately, not yet built).
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
|
|
@ -30,6 +30,27 @@ clear "run: rbw unlock" error rather than a hang.
|
|||
|
||||
---
|
||||
|
||||
## Break-glass — vault access during a full cluster outage
|
||||
|
||||
The control node `ubongo` (ADR-015) is the tool used to rebuild the cluster, so it
|
||||
must be able to decrypt the vault even when Vaultwarden (if hosted on the cluster)
|
||||
is down. `rbw` keeps a **local encrypted copy** of the Vaultwarden vault and decrypts
|
||||
it **offline** with your Vaultwarden master password — no live server needed for
|
||||
entries it has already synced. The recovery design therefore requires:
|
||||
|
||||
- `rbw` on `ubongo` (and on `mamba`, the break-glass laptop) has **synced at least
|
||||
once** while Vaultwarden was reachable (`rbw sync`).
|
||||
- Your **Vaultwarden master password** is kept **offline** — in a password manager on
|
||||
`mamba` and on paper in a safe — independent of any cluster-hosted Vaultwarden.
|
||||
|
||||
There is always exactly one irreducible offline root secret; here it is the
|
||||
Vaultwarden master password. Keep it recoverable without the cluster.
|
||||
|
||||
> **To verify (ADR-014, security-relevant):** confirm `rbw` actually decrypts its
|
||||
> local cache fully offline on your pinned `rbw` version before relying on this.
|
||||
|
||||
---
|
||||
|
||||
## Rotating a single secret value
|
||||
|
||||
1. Ensure the agent is unlocked: `rbw unlock`
|
||||
|
|
|
|||
|
|
@ -15,8 +15,9 @@ revisit (trigger).
|
|||
|---|---|---|---|
|
||||
| R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (tiered image pinning per ADR-011 — stateful `tag@digest`, stateless rolling — prefer official/verified images; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise |
|
||||
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
|
||||
| R3 | **Mesh-VPN coordinator dependency (pending VPN choice)** — remote SSH to the control node `ubongo` (ADR-015) rides a mesh VPN whose coordination plane may be a third party (e.g. hosted Tailscale/NetBird) | A hosted coordinator keeps the mesh up when the cluster is down, which *helps* recovery; nothing is exposed to the public internet (ADR-002 preserved). Provisional — finalised when the VPN is chosen (separate discussion) | The VPN choice is settled (replace this entry with the concrete decision); a self-hosted coordinator is adopted; the provider's trust/security posture changes |
|
||||
|
||||
_Last reviewed: 2026-06-04. The prior gaps (full CIS hardening, SELinux/AppArmor,
|
||||
_Last reviewed: 2026-06-05. The prior gaps (full CIS hardening, SELinux/AppArmor,
|
||||
IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
|
||||
Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now
|
||||
part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build
|
||||
|
|
|
|||
745
docs/superpowers/plans/2026-06-05-ubongo-control-host.md
Normal file
745
docs/superpowers/plans/2026-06-05-ubongo-control-host.md
Normal file
|
|
@ -0,0 +1,745 @@
|
|||
# Ubongo Control / AI-Worker Host — Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Record the decision to replace the cluster-resident control VM with a dedicated always-on physical host (`ubongo`) outside the Proxmox cluster, by authoring ADR-015 and reconciling every doc that currently assumes the control node is a cluster VM.
|
||||
|
||||
**Architecture:** This is a **documentation-only** change. No code, no roles, no inventory data. `ubongo` is recorded as *designed, not built* (per STATUS.md discipline) — the physical box, its OS install, and its inventory wiring are a future manual build, not part of this plan. The work is: one new ADR (the home of record) plus targeted amendments to the ADRs/runbooks/registers that contradict it, each cross-linking ADR-015.
|
||||
|
||||
**Tech Stack:** Markdown only. Verification is the repo's pre-commit hooks (trailing-whitespace, end-of-file, gitleaks, ansible-lint, vault-encryption guard) plus manual internal-consistency checks. There is no markdown linter in the toolchain, so "tests" are hook-pass + cross-reference-resolves greps.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight (read once before starting)
|
||||
|
||||
- **`rbw` must be unlocked before every commit.** The pre-commit ansible-lint hook decrypts `vault.yml`. Run `rbw unlocked` (exit 0 = good); if not, stop and ask the user to `rbw unlock`. Do not start a task you cannot commit.
|
||||
- **Commit style:** one commit per task, imperative subject ≤72 chars, with the trailer:
|
||||
```
|
||||
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
||||
```
|
||||
- **Order matters:** Task 1 (ADR-015) must land first — every later task links to it.
|
||||
- **Spec reference:** `docs/superpowers/specs/2026-06-05-ubongo-control-host-design.md`.
|
||||
|
||||
---
|
||||
|
||||
## File map
|
||||
|
||||
| File | Action | Responsibility after change |
|
||||
|---|---|---|
|
||||
| `docs/decisions/015-control-host.md` | Create | Home of record for the `ubongo` decision |
|
||||
| `docs/decisions/001-architecture.md` | Modify | Control node = physical box outside cluster |
|
||||
| `docs/decisions/005-bootstrapping.md` | Modify | Control-node bootstrap = bare-metal Debian install |
|
||||
| `docs/decisions/009-provisioning-handoff.md` | Modify | Control-node exception is genuinely physical |
|
||||
| `docs/decisions/008-testing.md` | Modify | All test levels run on `ubongo`; stub future UI level |
|
||||
| `docs/decisions/012-hardware-capacity.md` | Modify | `ubongo` is in-scope physical compute |
|
||||
| `docs/hardware/reference.md` | Modify | `ubongo` row in node-capacity + physical-compute section |
|
||||
| `docs/runbooks/new-host.md` | Modify | Part E: control node is bare-metal, not `qm clone` |
|
||||
| `docs/runbooks/rotate-secrets.md` | Modify | Offline break-glass vault-password requirement |
|
||||
| `docs/security/accepted-risks.md` | Modify | Reserve mesh-VPN coordinator risk (pending VPN choice) |
|
||||
| `STATUS.md` | Modify | Row: `ubongo` — designed, not built |
|
||||
| `CLAUDE.md` | Modify | ADR-015 in Further reading; control-group note |
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Author ADR-015 (the home of record)
|
||||
|
||||
**Files:**
|
||||
- Create: `docs/decisions/015-control-host.md`
|
||||
|
||||
- [ ] **Step 1: Create the ADR file**
|
||||
|
||||
Create `docs/decisions/015-control-host.md` with exactly this content:
|
||||
|
||||
```markdown
|
||||
# ADR-015 — Control / development / AI-worker host (`ubongo`)
|
||||
|
||||
## Context
|
||||
|
||||
Earlier ADRs framed the control node — the host that runs Terraform and Ansible —
|
||||
as a **single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one
|
||||
documented exception to "Terraform owns VM existence" (ADR-009). That framing treats
|
||||
the control node purely as a control-plane runner.
|
||||
|
||||
It fails four needs, all confirmed as drivers:
|
||||
|
||||
1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible cannot exist until
|
||||
something else creates it; the bootstrap is circular and awkward.
|
||||
2. **Always-on availability** — the operator wants to SSH in from a work PC or
|
||||
anywhere to drive Claude Code. A cluster VM is gone whenever the cluster is down
|
||||
or being rebuilt.
|
||||
3. **Recovery / disaster** — the tool used to rebuild the cluster must not live
|
||||
inside the thing it rebuilds.
|
||||
4. **Dev ergonomics** — a persistent home for Claude Code + the repo, not entangled
|
||||
with production VM lifecycle.
|
||||
|
||||
A laptop-only answer fails always-on and recovery. A VM-only answer fails cold-start
|
||||
and recovery. A small dedicated always-on physical machine outside the cluster
|
||||
satisfies all four.
|
||||
|
||||
## Decision
|
||||
|
||||
Introduce **`ubongo`** (Swahili: *brain*, consistent with the fleet's theme): a
|
||||
single dedicated x86-64 mini-PC, always-on, living **outside** the Proxmox cluster.
|
||||
It becomes *the* control node and collapses four roles into one box:
|
||||
|
||||
- Terraform + Ansible runner (control plane)
|
||||
- Claude Code / AI-worker host the operator SSHes into
|
||||
- Local test runner (Molecule/Docker, lint, and later a browser stack)
|
||||
- Persistent dev home for the repo
|
||||
|
||||
There is **no longer a control VM on the cluster.** The `control` inventory group
|
||||
points at this physical box. This *strengthens* the ADR-009 control-node exception:
|
||||
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
|
||||
Every other host stays a Terraform-managed VM exactly as designed.
|
||||
|
||||
`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a hypervisor
|
||||
and runs no `docker_host` services.
|
||||
|
||||
### Hardware target
|
||||
|
||||
| Spec | Target | Why |
|
||||
|---|---|---|
|
||||
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
|
||||
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
|
||||
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
|
||||
| Network | Wired GbE | Always-on reliability over Wi-Fi |
|
||||
| Power | Low draw (≤15 W idle) | Runs 24/7 |
|
||||
|
||||
Indicative: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC (~€150–250).
|
||||
Claude Code itself is light (the model runs in Anthropic's cloud); the sizing driver
|
||||
is **all testing being local** — Molecule (Docker), lint, and a future
|
||||
headless-Chromium/Playwright stack.
|
||||
|
||||
### Provisioning (bootstrap path)
|
||||
|
||||
Manual, on bare metal:
|
||||
|
||||
1. Install Debian 13 on the box (one-time, by hand).
|
||||
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
|
||||
3. Join the mesh VPN (choice deferred — see below).
|
||||
4. From then on `ubongo` manages every other host normally; Ansible manages *it* for
|
||||
baseline config via the `control` group (`base` role only).
|
||||
|
||||
### Access & security
|
||||
|
||||
- Remote access is via the **mesh VPN** (choice deferred). SSH to `ubongo` over the
|
||||
mesh; nothing is published to the public internet — this stays inside ADR-002.
|
||||
- `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban,
|
||||
auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**,
|
||||
denied on the physical NIC.
|
||||
|
||||
### Recovery model
|
||||
|
||||
`ubongo` is the rebuild tool, so three things must survive a full cluster loss:
|
||||
|
||||
1. **`mamba` (laptop) is a break-glass clone** — repo + toolchain + mesh + `rbw`,
|
||||
able to drive the fleet if `ubongo` dies.
|
||||
2. **Terraform state** lives on `ubongo`, backed up encrypted off-box (synced to
|
||||
`mamba`). For a 2–5 VM fleet it is also reconstructable via `terraform import`.
|
||||
3. **Vault password** — `ubongo` gets it from Vaultwarden via `rbw`. `rbw` keeps a
|
||||
local encrypted copy of the vault and decrypts it offline with the operator's
|
||||
Vaultwarden master password, so `ubongo` can decrypt the Ansible vault with the
|
||||
whole cluster down — provided `rbw` has synced once and the operator keeps the
|
||||
Vaultwarden master password offline (memorised + paper in a safe). Mirror onto
|
||||
`mamba`.
|
||||
|
||||
There is always exactly one irreducible offline root secret; here it is the
|
||||
Vaultwarden master password. Mirroring Vaultwarden onto `ubongo` is rejected: it
|
||||
would make the control node run a service (against its remit) and still need that
|
||||
master password.
|
||||
|
||||
> verified: rbw offline-cache decryption · TO VERIFY before relying on the recovery
|
||||
> model · rbw docs · (ADR-014, security-relevant — confirm during build)
|
||||
|
||||
## Consequences
|
||||
|
||||
- The control node is physical compute outside the cluster, so it appears in
|
||||
`docs/hardware/reference.md` even though it is not a cluster node (ADR-012).
|
||||
- All testing (Molecule, lint, staging/external) runs on `ubongo` (ADR-008).
|
||||
- A future **service-UI acceptance** testing level (Claude driving a headless browser
|
||||
against a deployed service) is anticipated; `ubongo` is sized for it. The harness
|
||||
is a separate spec.
|
||||
|
||||
## Deferred (separate specs / discussions)
|
||||
|
||||
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Recovery
|
||||
dimension: a hosted coordinator keeps the mesh up when the cluster is down; a
|
||||
self-hosted coordinator must live off-cluster (on `ubongo`), never on the fleet,
|
||||
or it recreates the chicken-and-egg.
|
||||
2. **Browser-E2E verification harness** — Playwright/headless-Chromium, test-user
|
||||
generation, screenshot-back-to-Claude, and the new ADR-008 level.
|
||||
3. **`rbw` offline-cache verification** — confirm offline decryption before relying
|
||||
on it (ADR-014).
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|---|---|
|
||||
| Keep control node as a cluster VM | Fails cold-start, recovery, always-on. |
|
||||
| Laptop-only (`mamba` for everything) | Fails always-on. Retained as break-glass backup. |
|
||||
| Split roles (control VM + thin jump box) | Two toolchains, split control plane, heavy testing back on a cluster VM. |
|
||||
| Mirror Vaultwarden onto `ubongo` | Control node would run a service; still needs the master password. |
|
||||
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. |
|
||||
| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. |
|
||||
|
||||
See also: ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
|
||||
ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Confirm `rbw` is unlocked, then verify hooks pass**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/015-control-host.md`
|
||||
Expected: `rbw` exits 0; hooks report `Passed`/`Skipped` (ansible-lint skips non-YAML; trailing-whitespace + end-of-file Passed).
|
||||
|
||||
- [ ] **Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add docs/decisions/015-control-host.md
|
||||
git commit -m "Add ADR-015 (control/AI-worker host ubongo)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Amend ADR-001 (architecture)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/decisions/001-architecture.md`
|
||||
|
||||
- [ ] **Step 1: Update the control-node bullet**
|
||||
|
||||
Find (lines ~13–15):
|
||||
```markdown
|
||||
- **Control node**: A dedicated Debian 13 VM on the cluster. Ansible runs from here.
|
||||
The control node is the one host that cannot fully bootstrap itself from scratch
|
||||
and requires manual initial setup (see `docs/runbooks/new-host.md`).
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
- **Control node**: `ubongo` — a dedicated always-on **physical** x86-64 machine
|
||||
**outside** the cluster. Ansible runs from here. It cannot be created by the
|
||||
Terraform it hosts, so it is provisioned manually (see ADR-015 and
|
||||
`docs/runbooks/new-host.md`).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Update the VM-existence table row**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; control node is the one manual exception (see ADR-009) |
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; `ubongo` (control node) is a physical box outside the cluster, the one manual exception (see ADR-009/ADR-015) |
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Update the `control` host-group comment**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
├── control # the control node itself — baseline config only, runs no services
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
├── control # ubongo — physical control node outside the cluster; baseline config only, runs no services
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/001-architecture.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/decisions/001-architecture.md
|
||||
git commit -m "ADR-001: control node is physical ubongo outside cluster"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Amend ADR-005 (bootstrapping)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/decisions/005-bootstrapping.md`
|
||||
|
||||
- [ ] **Step 1: Replace the "Control node bootstrapping" section body**
|
||||
|
||||
Find (the numbered list under `## Control node bootstrapping`, lines ~52–69):
|
||||
```markdown
|
||||
The control node is a special case — it runs Terraform and Ansible, so it cannot
|
||||
be created by the Terraform it hosts (chicken-and-egg). It is the one documented
|
||||
exception to Terraform-owned VM existence (see ADR-009). The control node requires:
|
||||
|
||||
1. Manual VM provisioning — clone this cloud-init template by hand (Proxmox UI or
|
||||
`qm clone`), since Terraform is not yet available to do it
|
||||
2. Manual setup of the Ansible environment:
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
The control node is a special case — it runs Terraform and Ansible, so it cannot
|
||||
be created by the Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated
|
||||
**physical** machine outside the cluster, and the one documented exception to
|
||||
Terraform-owned VM existence (see ADR-009 and ADR-015). The control node requires:
|
||||
|
||||
1. Manual OS provisioning — install Debian 13 on the physical box by hand (it is not
|
||||
a Proxmox guest, so there is no template to clone)
|
||||
2. Manual setup of the Ansible environment:
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Update the trailing reference to the control node listing**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
The control node itself is listed in `inventories/production/hosts.yml` under
|
||||
a `control` group and can be managed for baseline config (SSH, firewall, updates)
|
||||
but not for the `docker_host` role (it does not run services).
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
`ubongo` is listed in `inventories/production/hosts.yml` under the `control` group
|
||||
and can be managed for baseline config (SSH, firewall, updates) but not for the
|
||||
`docker_host` role (it does not run services). Hardware target and recovery model
|
||||
are in ADR-015.
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/005-bootstrapping.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/decisions/005-bootstrapping.md
|
||||
git commit -m "ADR-005: control node bootstrap is bare-metal Debian on ubongo"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Amend ADR-009 (provisioning handoff)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/decisions/009-provisioning-handoff.md`
|
||||
|
||||
- [ ] **Step 1: Strengthen the control-node exception section**
|
||||
|
||||
Find (under `## The control-node exception`, lines ~129–138):
|
||||
```markdown
|
||||
The control node — the host that runs Terraform and Ansible — is the one VM
|
||||
Terraform does **not** create. It cannot provision the infrastructure that would
|
||||
provision itself (chicken-and-egg). It is therefore the single documented exception
|
||||
to "Terraform owns VM existence":
|
||||
|
||||
- Provisioned and bootstrapped manually, per the control-node section of ADR-005.
|
||||
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
|
||||
Ansible for baseline config only (no `docker_host` role).
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
The control node — the host that runs Terraform and Ansible — is `ubongo`, a
|
||||
dedicated **physical** machine outside the cluster. It is not a VM at all, so
|
||||
Terraform genuinely never touches it: it cannot provision the infrastructure that
|
||||
would provision itself (chicken-and-egg). It is therefore the single documented
|
||||
exception to "Terraform owns VM existence":
|
||||
|
||||
- Provisioned and bootstrapped manually on bare metal, per the control-node section
|
||||
of ADR-005; rationale, hardware, and recovery model in ADR-015.
|
||||
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
|
||||
Ansible for baseline config only (no `docker_host` role).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/009-provisioning-handoff.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/decisions/009-provisioning-handoff.md
|
||||
git commit -m "ADR-009: control-node exception is a physical box, not a VM"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Amend ADR-008 (testing)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/decisions/008-testing.md`
|
||||
|
||||
- [ ] **Step 1: Make Level 1 say it runs on `ubongo`**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
Runs in Docker on the control node or in CI. Fast (~5 min per role).
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
Runs in Docker on the control node (`ubongo`) or in CI. Fast (~5 min per role).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add a future service-UI acceptance level stub**
|
||||
|
||||
Find (the end of `### Level 3 — External smoke test from askari`, lines ~51–55):
|
||||
```markdown
|
||||
### Level 3 — External smoke test from askari
|
||||
|
||||
Once `askari` is operational: scripted checks from outside the network confirming
|
||||
that public-facing services respond correctly. Catches firewall and reverse proxy
|
||||
configuration issues invisible to Ansible check mode.
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
### Level 3 — External smoke test from askari
|
||||
|
||||
Once `askari` is operational: scripted checks from outside the network confirming
|
||||
that public-facing services respond correctly. Catches firewall and reverse proxy
|
||||
configuration issues invisible to Ansible check mode.
|
||||
|
||||
### Level 4 — Service-UI acceptance (planned, not built)
|
||||
|
||||
Claude drives a headless browser from `ubongo` against a *deployed* service: loads
|
||||
the rendered UI, creates test users, exercises features, and hands the operator a
|
||||
manual test script for the rest. Catches application-level regressions that no lower
|
||||
level sees. The harness (Playwright/headless-Chromium, screenshot-back-to-Claude) is
|
||||
a **separate spec**; `ubongo` is sized for it (ADR-015). Status: designed, not built
|
||||
(STATUS.md).
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/008-testing.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/decisions/008-testing.md
|
||||
git commit -m "ADR-008: tests run on ubongo; stub Level 4 service-UI acceptance"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Amend ADR-012 and the hardware reference
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/decisions/012-hardware-capacity.md`
|
||||
- Modify: `docs/hardware/reference.md`
|
||||
|
||||
- [ ] **Step 1: Note `ubongo` as in-scope physical compute in ADR-012**
|
||||
|
||||
In `docs/decisions/012-hardware-capacity.md`, find the first bullet under `## Decision`:
|
||||
```markdown
|
||||
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
|
||||
physical compute + network gear and workload placement intent. Two
|
||||
machine-readable tables (node capacity, workload placement) carry the numbers.
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
|
||||
physical compute + network gear and workload placement intent. Two
|
||||
machine-readable tables (node capacity, workload placement) carry the numbers.
|
||||
This includes `ubongo`, the physical control node (ADR-015), even though it sits
|
||||
outside the Proxmox cluster.
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add `ubongo` to the physical-compute section of the reference**
|
||||
|
||||
In `docs/hardware/reference.md`, find:
|
||||
```markdown
|
||||
_(repeat for pve1, pve2, askari)_
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
### ubongo (control node — outside the cluster)
|
||||
- **Model / form factor:** _TBD (x86-64 mini-PC / USFF, e.g. N100 or refurb micro)_
|
||||
- **CPU:** _TBD (target 4 cores, x86-64)_
|
||||
- **RAM:** _TBD (target 16 GB)_
|
||||
- **Storage:** _TBD (target 250 GB SSD/NVMe)_
|
||||
- **NICs:** _wired GbE_
|
||||
- **Notes:** _always-on; control plane + AI-worker + local test runner (ADR-015); not a Proxmox guest_
|
||||
|
||||
_(repeat for pve1, pve2, askari)_
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Add `ubongo` to the machine-readable node-capacity table**
|
||||
|
||||
In `docs/hardware/reference.md`, find the node-capacity table:
|
||||
```markdown
|
||||
| node | cores | ram_gb | disk_gb |
|
||||
|------|-------|--------|---------|
|
||||
| pve0 | 20 | 64 | 4000 |
|
||||
| pve1 | 20 | 64 | 4000 |
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
| node | cores | ram_gb | disk_gb |
|
||||
|------|-------|--------|---------|
|
||||
| pve0 | 20 | 64 | 4000 |
|
||||
| pve1 | 20 | 64 | 4000 |
|
||||
| ubongo | 4 | 16 | 250 |
|
||||
```
|
||||
|
||||
Note: the header row (`node | cores | ram_gb | disk_gb`) is a parser contract for
|
||||
`scripts/capacity-scan.py` — only a data row is added, the header is untouched.
|
||||
|
||||
- [ ] **Step 4: Verify the capacity scan still parses, hooks pass, then commit**
|
||||
|
||||
Run: `python3 scripts/capacity-scan.py 2>&1 | head -c 400`
|
||||
Expected: it runs without a parse error and the output reflects the new `ubongo` row (no traceback). If the script needs an argument or env, consult its `--help`; a clean exit with JSON is success.
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/012-hardware-capacity.md docs/hardware/reference.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/decisions/012-hardware-capacity.md docs/hardware/reference.md
|
||||
git commit -m "ADR-012/hardware: add ubongo as physical control node"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Update the new-host runbook (Part E)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/runbooks/new-host.md`
|
||||
|
||||
- [ ] **Step 1: Replace Part E with the bare-metal control-node procedure**
|
||||
|
||||
Find the whole `## Part E — Control node (manual exception)` section (lines ~113–133), from the heading through the paragraph ending "every other host comes from `make tf-inventory`." Replace it with:
|
||||
```markdown
|
||||
## Part E — Control node (`ubongo`, manual exception)
|
||||
|
||||
The control node runs Terraform and Ansible, so it cannot be created by the
|
||||
Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated **physical**
|
||||
machine outside the cluster — not a Proxmox guest. It is the **one** host
|
||||
provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
|
||||
|
||||
1. Install Debian 13 on the physical box by hand (no template to clone).
|
||||
2. Create the `ansible` user and install its SSH public key.
|
||||
3. Set up the Ansible environment on it:
|
||||
```bash
|
||||
git clone <repo> ~/ansible
|
||||
cd ~/ansible
|
||||
make setup # venv + Python deps
|
||||
make collections # Ansible collections
|
||||
rbw login && rbw unlock # vault password from Vaultwarden (see rotate-secrets.md)
|
||||
```
|
||||
4. Join the mesh VPN (choice deferred — see ADR-015) so it is reachable over SSH
|
||||
from elsewhere.
|
||||
5. Add `ubongo` to `inventories/<env>/hosts.yml` under the `control` group.
|
||||
|
||||
Because `ubongo` is not in `local.vms`, this is the only case where editing
|
||||
`hosts.yml` by hand is expected. **Known limitation:** `make tf-inventory`
|
||||
regenerates `hosts.yml` from Terraform outputs and will overwrite a hand-added
|
||||
`control` entry — re-add `ubongo` after running it (preserving the control entry in
|
||||
the generator is tracked separately, not yet built).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Update the Prerequisites note that assumes a template**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not)
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not).
|
||||
Not needed for the control node `ubongo`, which is bare-metal (Part E).
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/runbooks/new-host.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/runbooks/new-host.md
|
||||
git commit -m "new-host runbook: control node ubongo is bare-metal"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 8: Update the rotate-secrets runbook (offline break-glass)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/runbooks/rotate-secrets.md`
|
||||
|
||||
- [ ] **Step 1: Add a break-glass section after the `rbw` setup section**
|
||||
|
||||
Find the end of the `## One-time — \`rbw\` setup on a new machine` section:
|
||||
```markdown
|
||||
Once unlocked, `make encrypt/decrypt/check/deploy` and the pre-commit ansible-lint
|
||||
hook all obtain the password automatically. If the agent is locked you'll see a
|
||||
clear "run: rbw unlock" error rather than a hang.
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
Once unlocked, `make encrypt/decrypt/check/deploy` and the pre-commit ansible-lint
|
||||
hook all obtain the password automatically. If the agent is locked you'll see a
|
||||
clear "run: rbw unlock" error rather than a hang.
|
||||
|
||||
---
|
||||
|
||||
## Break-glass — vault access during a full cluster outage
|
||||
|
||||
The control node `ubongo` (ADR-015) is the tool used to rebuild the cluster, so it
|
||||
must be able to decrypt the vault even when Vaultwarden (if hosted on the cluster)
|
||||
is down. `rbw` keeps a **local encrypted copy** of the Vaultwarden vault and decrypts
|
||||
it **offline** with your Vaultwarden master password — no live server needed for
|
||||
entries it has already synced. The recovery design therefore requires:
|
||||
|
||||
- `rbw` on `ubongo` (and on `mamba`, the break-glass laptop) has **synced at least
|
||||
once** while Vaultwarden was reachable (`rbw sync`).
|
||||
- Your **Vaultwarden master password** is kept **offline** — in a password manager on
|
||||
`mamba` and on paper in a safe — independent of any cluster-hosted Vaultwarden.
|
||||
|
||||
There is always exactly one irreducible offline root secret; here it is the
|
||||
Vaultwarden master password. Keep it recoverable without the cluster.
|
||||
|
||||
> **To verify (ADR-014, security-relevant):** confirm `rbw` actually decrypts its
|
||||
> local cache fully offline on your pinned `rbw` version before relying on this.
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/runbooks/rotate-secrets.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/runbooks/rotate-secrets.md
|
||||
git commit -m "rotate-secrets: document offline vault break-glass for ubongo"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 9: Reserve the mesh-VPN accepted-risk entry
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/security/accepted-risks.md`
|
||||
|
||||
- [ ] **Step 1: Add R3 to the risk table**
|
||||
|
||||
Find the table row for R2:
|
||||
```markdown
|
||||
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
|
||||
```
|
||||
Add immediately **after** it:
|
||||
```markdown
|
||||
| R3 | **Mesh-VPN coordinator dependency (pending VPN choice)** — remote SSH to the control node `ubongo` (ADR-015) rides a mesh VPN whose coordination plane may be a third party (e.g. hosted Tailscale/NetBird) | A hosted coordinator keeps the mesh up when the cluster is down, which *helps* recovery; nothing is exposed to the public internet (ADR-002 preserved). Provisional — finalised when the VPN is chosen (separate discussion) | The VPN choice is settled (replace this entry with the concrete decision); a self-hosted coordinator is adopted; the provider's trust/security posture changes |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Update the "Last reviewed" footer date**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
_Last reviewed: 2026-06-04. The prior gaps
|
||||
```
|
||||
Replace `2026-06-04` with `2026-06-05` (only the date changes; leave the rest of the sentence intact):
|
||||
```markdown
|
||||
_Last reviewed: 2026-06-05. The prior gaps
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/security/accepted-risks.md
|
||||
git commit -m "accepted-risks: reserve R3 mesh-VPN coordinator (pending choice)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 10: Add the `ubongo` row to STATUS.md
|
||||
|
||||
**Files:**
|
||||
- Modify: `STATUS.md`
|
||||
|
||||
- [ ] **Step 1: Add a row to the "Designed but not built" table**
|
||||
|
||||
Find the last row of the `## Designed but not built` table:
|
||||
```markdown
|
||||
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |
|
||||
```
|
||||
Add immediately **after** it:
|
||||
```markdown
|
||||
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files STATUS.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add STATUS.md
|
||||
git commit -m "STATUS: record ubongo control host as designed, not built"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 11: Update CLAUDE.md (index + control-group note)
|
||||
|
||||
**Files:**
|
||||
- Modify: `CLAUDE.md`
|
||||
|
||||
- [ ] **Step 1: Add ADR-015 to the Further reading table**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |
|
||||
| Control / AI-worker host (`ubongo`) | `docs/decisions/015-control-host.md` |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Update the control-group parenthetical in the Inventory structure section**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
(`control` holds the one manually-provisioned control node — see ADR-009.)
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
(`control` holds `ubongo`, the one manually-provisioned **physical** control node
|
||||
outside the cluster — see ADR-009 and ADR-015.)
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add CLAUDE.md
|
||||
git commit -m "CLAUDE.md: link ADR-015; note ubongo as physical control node"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 12: Final consistency sweep
|
||||
|
||||
**Files:** none modified (verification only)
|
||||
|
||||
- [ ] **Step 1: Confirm no doc still calls the control node a VM**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
grep -rniE "control node.*(VM|virtual)|dedicated Debian 13 VM" docs/ CLAUDE.md STATUS.md
|
||||
```
|
||||
Expected: no hit that *asserts* the control node is a VM. (Hits inside ADR-015's "What was ruled out" table that describe the rejected option are fine.) If any other doc still frames the control node as a VM, fix it the same way as the relevant task above and amend that task's commit.
|
||||
|
||||
- [ ] **Step 2: Confirm every ADR-015 cross-link resolves**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
grep -rl "ADR-015\|015-control-host" docs/ CLAUDE.md STATUS.md
|
||||
test -f docs/decisions/015-control-host.md && echo "ADR-015 present"
|
||||
```
|
||||
Expected: the file exists and the referencing docs (001, 005, 008, 009, 012, runbooks, accepted-risks, STATUS, CLAUDE.md) appear.
|
||||
|
||||
- [ ] **Step 3: Full hook run**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --all-files`
|
||||
Expected: all hooks `Passed`/`Skipped`. Fix anything that fails (most likely trailing whitespace or end-of-file) and amend the owning commit.
|
||||
|
||||
- [ ] **Step 4: Push (only if the user asks)**
|
||||
|
||||
Per CLAUDE.md, push to `origin` is the off-machine backup. If the user wants it pushed:
|
||||
```bash
|
||||
git push origin main
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Self-review notes (author)
|
||||
|
||||
- **Spec coverage:** every spec section maps to a task — host decision/hardware/bootstrap/access/recovery → Task 1 (ADR-015); the doc-changes table → Tasks 2–11; testing implication → Task 5; deferrals are recorded in ADR-015 and not implemented here (correct — they are separate specs). ✓
|
||||
- **Not in scope (intentional):** acquiring/installing the box, mesh-VPN selection, the browser harness, adding `ubongo` to live inventory, and modifying `tf_to_inventory.py` to preserve the control entry (logged as a known limitation in Task 7). ✓
|
||||
- **No placeholders:** every edit shows exact find/replace text; the only `_TBD_` strings are deliberate hardware-reference skeleton fields matching that file's existing style. ✓
|
||||
```
|
||||
205
docs/superpowers/specs/2026-06-05-ubongo-control-host-design.md
Normal file
205
docs/superpowers/specs/2026-06-05-ubongo-control-host-design.md
Normal file
|
|
@ -0,0 +1,205 @@
|
|||
# Design — Control / development / AI-worker host (`ubongo`)
|
||||
|
||||
- **Date:** 2026-06-05
|
||||
- **Status:** Approved design — pending implementation plan
|
||||
- **Supersedes (in part):** the "control node is a dedicated VM on the cluster"
|
||||
assumption in ADR-001 / ADR-005 / ADR-009
|
||||
- **Becomes:** ADR-015 (this design is the basis for that ADR)
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
Today the control node — the host that runs Terraform and Ansible — is defined as a
|
||||
**single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one
|
||||
documented exception to "Terraform owns VM existence" (ADR-009). The ADRs treat it
|
||||
purely as a control-plane runner.
|
||||
|
||||
That framing fails four things the user actually needs, all confirmed as drivers:
|
||||
|
||||
1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible can't exist until
|
||||
something else creates it; the manual bootstrap is awkward and circular.
|
||||
2. **Always-on availability** — the user wants to SSH in from a work PC (or anywhere)
|
||||
to fire off Claude Code commands. A VM on the cluster is gone whenever the cluster
|
||||
is down or being rebuilt.
|
||||
3. **Recovery / disaster** — the tool you'd use to rebuild the cluster must not live
|
||||
*inside* the thing it rebuilds.
|
||||
4. **Dev ergonomics** — a comfortable, persistent home for Claude Code + the repo,
|
||||
not entangled with production VM lifecycle.
|
||||
|
||||
A laptop-only answer fails always-on (not always carried) and recovery. A VM-only
|
||||
answer fails cold-start and recovery. A small **dedicated always-on physical machine
|
||||
outside the cluster** satisfies all four.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
Introduce **`ubongo`**: a single dedicated x86-64 mini-PC, always-on, living
|
||||
**outside** the Proxmox cluster. It becomes *the* control node and collapses four
|
||||
roles into one box:
|
||||
|
||||
- Terraform + Ansible runner (control plane)
|
||||
- Claude Code / AI-worker host the user SSHes into
|
||||
- Local test runner (Molecule/Docker, lint, and later the browser stack)
|
||||
- Persistent dev home for the repo
|
||||
|
||||
There is **no longer a control VM on the cluster.** The `control` inventory group now
|
||||
points at this physical box. This *strengthens* the ADR-009 control-node exception:
|
||||
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
|
||||
Every other host stays a Terraform-managed VM exactly as designed.
|
||||
|
||||
`ubongo` runs **plain Debian 13** (matches the fleet — the `base` role applies). It is
|
||||
not a hypervisor.
|
||||
|
||||
### Name
|
||||
|
||||
`ubongo` (Swahili: *brain*), consistent with the fleet's Swahili theme (`boma`,
|
||||
`nyumbani`, `askari`, `mamba`).
|
||||
|
||||
---
|
||||
|
||||
## Hardware target
|
||||
|
||||
| Spec | Target | Why |
|
||||
|---|---|---|
|
||||
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
|
||||
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
|
||||
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
|
||||
| Network | Wired GbE | Always-on reliability over Wi-Fi |
|
||||
| Power | Low draw (≤15 W idle) | Runs 24/7 |
|
||||
|
||||
Indicative hardware: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC,
|
||||
roughly €150–250.
|
||||
|
||||
**Sizing rationale.** Claude Code itself is light — the model runs in Anthropic's
|
||||
cloud; the box only runs the CLI, git, Ansible, Terraform, `rbw`, SSH. The real
|
||||
sizing driver is **all testing being local**: Molecule (Docker — the heavy one), lint,
|
||||
and later a headless-Chromium/Playwright browser stack for service-UI verification.
|
||||
That combination is firmly mini-PC / small-server class, not a Raspberry Pi.
|
||||
|
||||
---
|
||||
|
||||
## Provisioning (bootstrap path)
|
||||
|
||||
Manual, like today's control-node exception — but on bare metal instead of a clone:
|
||||
|
||||
1. Install Debian 13 on the box (one-time, by hand).
|
||||
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
|
||||
3. Join the mesh VPN (choice TBD — see Deferred).
|
||||
4. From then on `ubongo` manages every other host normally. Ansible manages *it* for
|
||||
baseline config via the `control` group (`base` role: SSH, firewall, updates,
|
||||
auditd) — but never the `docker_host` role. It runs no services.
|
||||
|
||||
---
|
||||
|
||||
## Access & security
|
||||
|
||||
- **Remote access via the mesh VPN** (choice TBD). The user SSHes to `ubongo` over the
|
||||
mesh from work PC, laptop, or phone. **Nothing is published to the public internet**
|
||||
— SSH-over-mesh keeps the design fully inside ADR-002 (no LAN/WAN exposure without
|
||||
reverse-proxy + auth).
|
||||
- **Hardening.** `ubongo` runs the `base` role like every host: SSH hardening,
|
||||
nftables default-deny, fail2ban, auditd, unattended-upgrades. Inbound SSH is allowed
|
||||
**only on the mesh interface** — denied on the physical NIC. Even on the LAN it is
|
||||
not an open SSH target.
|
||||
- **Third-party dependency.** A hosted mesh coordinator is a third party. This is a
|
||||
deliberate trade: a hosted control plane keeps the mesh up when the cluster is down
|
||||
(helps recovery). A self-hosted coordinator on the cluster would recreate the
|
||||
chicken-and-egg — so if self-hosted, it must live on `ubongo` or off-cluster, never
|
||||
on the fleet. To be logged in `accepted-risks.md` once the VPN is chosen.
|
||||
|
||||
---
|
||||
|
||||
## Recovery model
|
||||
|
||||
`ubongo` is now the rebuild tool, so three things must survive a full cluster loss:
|
||||
|
||||
1. **The box / its data.** `mamba` (laptop) stays a **break-glass clone**: repo +
|
||||
toolchain + mesh + `rbw`, able to drive the fleet if `ubongo` dies. Two machines
|
||||
that can drive the fleet, not one.
|
||||
2. **Terraform state.** Lives on `ubongo`, backed up **encrypted off-box** (synced to
|
||||
`mamba`). For a 2–5 VM fleet it is also reconstructable via `terraform import`, so
|
||||
this is belt-and-suspenders, not load-bearing.
|
||||
3. **The vault password.** `ubongo` gets the vault master password from Vaultwarden via
|
||||
`rbw`. `rbw` keeps a **local encrypted copy** of the Vaultwarden vault and decrypts
|
||||
it **offline** with the user's Vaultwarden master password — no live server needed
|
||||
for already-synced entries. So provided (a) `rbw` has synced at least once and (b)
|
||||
the user keeps their **Vaultwarden master password** offline (memorised + paper in a
|
||||
safe), `ubongo` can decrypt the Ansible vault with the whole cluster down. Mirror
|
||||
the same onto `mamba`.
|
||||
|
||||
**Why not mirror/replicate Vaultwarden onto `ubongo`?** It would make the control node
|
||||
*run a service* (against its remit + adds attack surface), add DB-replication
|
||||
complexity, and **still** require the Vaultwarden master password to read anything.
|
||||
There is always exactly **one irreducible offline root secret** — make it the
|
||||
Vaultwarden master password, and let `rbw`'s local cache make everything else
|
||||
self-serve offline. (If full disaster access to *all* secrets — router, Proxmox UI —
|
||||
is wanted, that same `rbw` cache already covers it; optionally add a scheduled
|
||||
encrypted `rbw export` as extra insurance.)
|
||||
|
||||
> **To verify (ADR-014, security-relevant):** the "`rbw` decrypts its local cache fully
|
||||
> offline" behaviour is the load-bearing assumption of the recovery model. Confirm it
|
||||
> against `rbw`'s docs/version during implementation before relying on it.
|
||||
|
||||
---
|
||||
|
||||
## Testing implication
|
||||
|
||||
All testing runs on `ubongo`:
|
||||
|
||||
- **Level 1 (Molecule)** — Docker on `ubongo`.
|
||||
- **Lint** — on `ubongo`.
|
||||
- **Level 2 / 3** (staging deploy, external smoke) — driven from `ubongo` as before.
|
||||
|
||||
A future **service-UI acceptance level** (Claude driving a headless browser against a
|
||||
deployed service: load the UI, create test users, exercise features, hand the user a
|
||||
manual test script) is anticipated. `ubongo` is *sized* for it now (Chromium +
|
||||
Playwright headroom). The harness itself is a **separate spec** (see Deferred).
|
||||
|
||||
---
|
||||
|
||||
## Documentation changes
|
||||
|
||||
A new **ADR-015 — Control / development / AI-worker host (`ubongo`)** is the home of
|
||||
record. Other docs get small amendments that link to it:
|
||||
|
||||
| Doc | Change |
|
||||
|---|---|
|
||||
| ADR-015 (new) | Full record of this design. |
|
||||
| ADR-001 (architecture) | Control node: "dedicated Debian 13 VM on the cluster" → "dedicated physical x86 machine *outside* the cluster (`ubongo`)". |
|
||||
| ADR-005 (bootstrapping) | Control-node section: "clone the cloud-init template by hand" → "install Debian 13 on the physical box". |
|
||||
| ADR-009 (provisioning handoff) | Strengthen the control-node exception: now genuinely physical/outside Terraform. |
|
||||
| ADR-008 (testing) | "runs on the control node or in CI" → all levels run on `ubongo`; add a stub for the future service-UI acceptance level. |
|
||||
| ADR-012 / `docs/hardware/reference.md` | Add `ubongo` to the node-capacity table (physical compute, though outside the cluster). |
|
||||
| `docs/runbooks/new-host.md` | Update the control-node bootstrap procedure (bare-metal Debian install, not `qm clone`). |
|
||||
| `docs/runbooks/rotate-secrets.md` | Add the offline vault-password break-glass requirement. |
|
||||
| `docs/security/accepted-risks.md` | Reserve an entry for the mesh-VPN third-party coordinator — pending the VPN choice. |
|
||||
| `STATUS.md` | Add a row: `ubongo` — *designed, not built*. |
|
||||
| `CLAUDE.md` | One-line touch to the inventory/`control`-group description if needed. |
|
||||
|
||||
---
|
||||
|
||||
## Explicitly deferred (separate specs / discussions)
|
||||
|
||||
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Carries the
|
||||
recovery-dimension note above (hosted coordinator helps recovery; self-hosted must
|
||||
be off-cluster). Its own discussion.
|
||||
2. **Browser-E2E verification harness** — Playwright/headless-Chromium driving live
|
||||
service UIs, test-user generation, screenshot-back-to-Claude, and the new ADR-008
|
||||
level. `ubongo` is sized for it now; the harness is designed later.
|
||||
3. **`rbw` offline-cache verification** — a to-verify task during implementation
|
||||
(ADR-014), before relying on offline decryption.
|
||||
|
||||
---
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|---|---|
|
||||
| Keep control node as a cluster VM | Fails cold-start and recovery (rebuild tool lives inside the thing it rebuilds); fails always-on (dies with the cluster). |
|
||||
| Laptop-only (`mamba` for everything) | Fails always-on (not always carried). Retained instead as the break-glass backup. |
|
||||
| Split roles (control VM + thin jump box) | Two places to maintain the toolchain, control plane split in two, heavy local testing back on a cluster VM — more moving parts, less benefit. |
|
||||
| Mirror/replicate Vaultwarden onto `ubongo` | Makes the control node run a service (against its remit), adds DB-replication complexity, and still needs the Vaultwarden master password. `rbw`'s local cache achieves offline decryption without it. |
|
||||
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg the whole design escapes. If self-hosted, it lives off-cluster. |
|
||||
| Raspberry Pi as the box | Could just run Molecule, but chokes running Docker + Chromium + toolchain together. x86 mini-PC instead. |
|
||||
Loading…
Add table
Reference in a new issue