Add implementation plan for ubongo control host
Task-by-task docs plan: author ADR-015 and reconcile ADR-001/005/008/009/012, the new-host and rotate-secrets runbooks, accepted-risks, STATUS, and CLAUDE.md. Documentation-only; the physical box stays "designed, not built". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
c1b21c9b2b
commit
0e9f179bfc
1 changed files with 745 additions and 0 deletions
745
docs/superpowers/plans/2026-06-05-ubongo-control-host.md
Normal file
745
docs/superpowers/plans/2026-06-05-ubongo-control-host.md
Normal file
|
|
@ -0,0 +1,745 @@
|
|||
# Ubongo Control / AI-Worker Host — Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Record the decision to replace the cluster-resident control VM with a dedicated always-on physical host (`ubongo`) outside the Proxmox cluster, by authoring ADR-015 and reconciling every doc that currently assumes the control node is a cluster VM.
|
||||
|
||||
**Architecture:** This is a **documentation-only** change. No code, no roles, no inventory data. `ubongo` is recorded as *designed, not built* (per STATUS.md discipline) — the physical box, its OS install, and its inventory wiring are a future manual build, not part of this plan. The work is: one new ADR (the home of record) plus targeted amendments to the ADRs/runbooks/registers that contradict it, each cross-linking ADR-015.
|
||||
|
||||
**Tech Stack:** Markdown only. Verification is the repo's pre-commit hooks (trailing-whitespace, end-of-file, gitleaks, ansible-lint, vault-encryption guard) plus manual internal-consistency checks. There is no markdown linter in the toolchain, so "tests" are hook-pass + cross-reference-resolves greps.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight (read once before starting)
|
||||
|
||||
- **`rbw` must be unlocked before every commit.** The pre-commit ansible-lint hook decrypts `vault.yml`. Run `rbw unlocked` (exit 0 = good); if not, stop and ask the user to `rbw unlock`. Do not start a task you cannot commit.
|
||||
- **Commit style:** one commit per task, imperative subject ≤72 chars, with the trailer:
|
||||
```
|
||||
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
||||
```
|
||||
- **Order matters:** Task 1 (ADR-015) must land first — every later task links to it.
|
||||
- **Spec reference:** `docs/superpowers/specs/2026-06-05-ubongo-control-host-design.md`.
|
||||
|
||||
---
|
||||
|
||||
## File map
|
||||
|
||||
| File | Action | Responsibility after change |
|
||||
|---|---|---|
|
||||
| `docs/decisions/015-control-host.md` | Create | Home of record for the `ubongo` decision |
|
||||
| `docs/decisions/001-architecture.md` | Modify | Control node = physical box outside cluster |
|
||||
| `docs/decisions/005-bootstrapping.md` | Modify | Control-node bootstrap = bare-metal Debian install |
|
||||
| `docs/decisions/009-provisioning-handoff.md` | Modify | Control-node exception is genuinely physical |
|
||||
| `docs/decisions/008-testing.md` | Modify | All test levels run on `ubongo`; stub future UI level |
|
||||
| `docs/decisions/012-hardware-capacity.md` | Modify | `ubongo` is in-scope physical compute |
|
||||
| `docs/hardware/reference.md` | Modify | `ubongo` row in node-capacity + physical-compute section |
|
||||
| `docs/runbooks/new-host.md` | Modify | Part E: control node is bare-metal, not `qm clone` |
|
||||
| `docs/runbooks/rotate-secrets.md` | Modify | Offline break-glass vault-password requirement |
|
||||
| `docs/security/accepted-risks.md` | Modify | Reserve mesh-VPN coordinator risk (pending VPN choice) |
|
||||
| `STATUS.md` | Modify | Row: `ubongo` — designed, not built |
|
||||
| `CLAUDE.md` | Modify | ADR-015 in Further reading; control-group note |
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Author ADR-015 (the home of record)
|
||||
|
||||
**Files:**
|
||||
- Create: `docs/decisions/015-control-host.md`
|
||||
|
||||
- [ ] **Step 1: Create the ADR file**
|
||||
|
||||
Create `docs/decisions/015-control-host.md` with exactly this content:
|
||||
|
||||
```markdown
|
||||
# ADR-015 — Control / development / AI-worker host (`ubongo`)
|
||||
|
||||
## Context
|
||||
|
||||
Earlier ADRs framed the control node — the host that runs Terraform and Ansible —
|
||||
as a **single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one
|
||||
documented exception to "Terraform owns VM existence" (ADR-009). That framing treats
|
||||
the control node purely as a control-plane runner.
|
||||
|
||||
It fails four needs, all confirmed as drivers:
|
||||
|
||||
1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible cannot exist until
|
||||
something else creates it; the bootstrap is circular and awkward.
|
||||
2. **Always-on availability** — the operator wants to SSH in from a work PC or
|
||||
anywhere to drive Claude Code. A cluster VM is gone whenever the cluster is down
|
||||
or being rebuilt.
|
||||
3. **Recovery / disaster** — the tool used to rebuild the cluster must not live
|
||||
inside the thing it rebuilds.
|
||||
4. **Dev ergonomics** — a persistent home for Claude Code + the repo, not entangled
|
||||
with production VM lifecycle.
|
||||
|
||||
A laptop-only answer fails always-on and recovery. A VM-only answer fails cold-start
|
||||
and recovery. A small dedicated always-on physical machine outside the cluster
|
||||
satisfies all four.
|
||||
|
||||
## Decision
|
||||
|
||||
Introduce **`ubongo`** (Swahili: *brain*, consistent with the fleet's theme): a
|
||||
single dedicated x86-64 mini-PC, always-on, living **outside** the Proxmox cluster.
|
||||
It becomes *the* control node and collapses four roles into one box:
|
||||
|
||||
- Terraform + Ansible runner (control plane)
|
||||
- Claude Code / AI-worker host the operator SSHes into
|
||||
- Local test runner (Molecule/Docker, lint, and later a browser stack)
|
||||
- Persistent dev home for the repo
|
||||
|
||||
There is **no longer a control VM on the cluster.** The `control` inventory group
|
||||
points at this physical box. This *strengthens* the ADR-009 control-node exception:
|
||||
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
|
||||
Every other host stays a Terraform-managed VM exactly as designed.
|
||||
|
||||
`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a hypervisor
|
||||
and runs no `docker_host` services.
|
||||
|
||||
### Hardware target
|
||||
|
||||
| Spec | Target | Why |
|
||||
|---|---|---|
|
||||
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
|
||||
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
|
||||
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
|
||||
| Network | Wired GbE | Always-on reliability over Wi-Fi |
|
||||
| Power | Low draw (≤15 W idle) | Runs 24/7 |
|
||||
|
||||
Indicative: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC (~€150–250).
|
||||
Claude Code itself is light (the model runs in Anthropic's cloud); the sizing driver
|
||||
is **all testing being local** — Molecule (Docker), lint, and a future
|
||||
headless-Chromium/Playwright stack.
|
||||
|
||||
### Provisioning (bootstrap path)
|
||||
|
||||
Manual, on bare metal:
|
||||
|
||||
1. Install Debian 13 on the box (one-time, by hand).
|
||||
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
|
||||
3. Join the mesh VPN (choice deferred — see below).
|
||||
4. From then on `ubongo` manages every other host normally; Ansible manages *it* for
|
||||
baseline config via the `control` group (`base` role only).
|
||||
|
||||
### Access & security
|
||||
|
||||
- Remote access is via the **mesh VPN** (choice deferred). SSH to `ubongo` over the
|
||||
mesh; nothing is published to the public internet — this stays inside ADR-002.
|
||||
- `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban,
|
||||
auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**,
|
||||
denied on the physical NIC.
|
||||
|
||||
### Recovery model
|
||||
|
||||
`ubongo` is the rebuild tool, so three things must survive a full cluster loss:
|
||||
|
||||
1. **`mamba` (laptop) is a break-glass clone** — repo + toolchain + mesh + `rbw`,
|
||||
able to drive the fleet if `ubongo` dies.
|
||||
2. **Terraform state** lives on `ubongo`, backed up encrypted off-box (synced to
|
||||
`mamba`). For a 2–5 VM fleet it is also reconstructable via `terraform import`.
|
||||
3. **Vault password** — `ubongo` gets it from Vaultwarden via `rbw`. `rbw` keeps a
|
||||
local encrypted copy of the vault and decrypts it offline with the operator's
|
||||
Vaultwarden master password, so `ubongo` can decrypt the Ansible vault with the
|
||||
whole cluster down — provided `rbw` has synced once and the operator keeps the
|
||||
Vaultwarden master password offline (memorised + paper in a safe). Mirror onto
|
||||
`mamba`.
|
||||
|
||||
There is always exactly one irreducible offline root secret; here it is the
|
||||
Vaultwarden master password. Mirroring Vaultwarden onto `ubongo` is rejected: it
|
||||
would make the control node run a service (against its remit) and still need that
|
||||
master password.
|
||||
|
||||
> verified: rbw offline-cache decryption · TO VERIFY before relying on the recovery
|
||||
> model · rbw docs · (ADR-014, security-relevant — confirm during build)
|
||||
|
||||
## Consequences
|
||||
|
||||
- The control node is physical compute outside the cluster, so it appears in
|
||||
`docs/hardware/reference.md` even though it is not a cluster node (ADR-012).
|
||||
- All testing (Molecule, lint, staging/external) runs on `ubongo` (ADR-008).
|
||||
- A future **service-UI acceptance** testing level (Claude driving a headless browser
|
||||
against a deployed service) is anticipated; `ubongo` is sized for it. The harness
|
||||
is a separate spec.
|
||||
|
||||
## Deferred (separate specs / discussions)
|
||||
|
||||
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Recovery
|
||||
dimension: a hosted coordinator keeps the mesh up when the cluster is down; a
|
||||
self-hosted coordinator must live off-cluster (on `ubongo`), never on the fleet,
|
||||
or it recreates the chicken-and-egg.
|
||||
2. **Browser-E2E verification harness** — Playwright/headless-Chromium, test-user
|
||||
generation, screenshot-back-to-Claude, and the new ADR-008 level.
|
||||
3. **`rbw` offline-cache verification** — confirm offline decryption before relying
|
||||
on it (ADR-014).
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|---|---|
|
||||
| Keep control node as a cluster VM | Fails cold-start, recovery, always-on. |
|
||||
| Laptop-only (`mamba` for everything) | Fails always-on. Retained as break-glass backup. |
|
||||
| Split roles (control VM + thin jump box) | Two toolchains, split control plane, heavy testing back on a cluster VM. |
|
||||
| Mirror Vaultwarden onto `ubongo` | Control node would run a service; still needs the master password. |
|
||||
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. |
|
||||
| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. |
|
||||
|
||||
See also: ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
|
||||
ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Confirm `rbw` is unlocked, then verify hooks pass**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/015-control-host.md`
|
||||
Expected: `rbw` exits 0; hooks report `Passed`/`Skipped` (ansible-lint skips non-YAML; trailing-whitespace + end-of-file Passed).
|
||||
|
||||
- [ ] **Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add docs/decisions/015-control-host.md
|
||||
git commit -m "Add ADR-015 (control/AI-worker host ubongo)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Amend ADR-001 (architecture)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/decisions/001-architecture.md`
|
||||
|
||||
- [ ] **Step 1: Update the control-node bullet**
|
||||
|
||||
Find (lines ~13–15):
|
||||
```markdown
|
||||
- **Control node**: A dedicated Debian 13 VM on the cluster. Ansible runs from here.
|
||||
The control node is the one host that cannot fully bootstrap itself from scratch
|
||||
and requires manual initial setup (see `docs/runbooks/new-host.md`).
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
- **Control node**: `ubongo` — a dedicated always-on **physical** x86-64 machine
|
||||
**outside** the cluster. Ansible runs from here. It cannot be created by the
|
||||
Terraform it hosts, so it is provisioned manually (see ADR-015 and
|
||||
`docs/runbooks/new-host.md`).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Update the VM-existence table row**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; control node is the one manual exception (see ADR-009) |
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; `ubongo` (control node) is a physical box outside the cluster, the one manual exception (see ADR-009/ADR-015) |
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Update the `control` host-group comment**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
├── control # the control node itself — baseline config only, runs no services
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
├── control # ubongo — physical control node outside the cluster; baseline config only, runs no services
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/001-architecture.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/decisions/001-architecture.md
|
||||
git commit -m "ADR-001: control node is physical ubongo outside cluster"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Amend ADR-005 (bootstrapping)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/decisions/005-bootstrapping.md`
|
||||
|
||||
- [ ] **Step 1: Replace the "Control node bootstrapping" section body**
|
||||
|
||||
Find (the numbered list under `## Control node bootstrapping`, lines ~52–69):
|
||||
```markdown
|
||||
The control node is a special case — it runs Terraform and Ansible, so it cannot
|
||||
be created by the Terraform it hosts (chicken-and-egg). It is the one documented
|
||||
exception to Terraform-owned VM existence (see ADR-009). The control node requires:
|
||||
|
||||
1. Manual VM provisioning — clone this cloud-init template by hand (Proxmox UI or
|
||||
`qm clone`), since Terraform is not yet available to do it
|
||||
2. Manual setup of the Ansible environment:
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
The control node is a special case — it runs Terraform and Ansible, so it cannot
|
||||
be created by the Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated
|
||||
**physical** machine outside the cluster, and the one documented exception to
|
||||
Terraform-owned VM existence (see ADR-009 and ADR-015). The control node requires:
|
||||
|
||||
1. Manual OS provisioning — install Debian 13 on the physical box by hand (it is not
|
||||
a Proxmox guest, so there is no template to clone)
|
||||
2. Manual setup of the Ansible environment:
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Update the trailing reference to the control node listing**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
The control node itself is listed in `inventories/production/hosts.yml` under
|
||||
a `control` group and can be managed for baseline config (SSH, firewall, updates)
|
||||
but not for the `docker_host` role (it does not run services).
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
`ubongo` is listed in `inventories/production/hosts.yml` under the `control` group
|
||||
and can be managed for baseline config (SSH, firewall, updates) but not for the
|
||||
`docker_host` role (it does not run services). Hardware target and recovery model
|
||||
are in ADR-015.
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/005-bootstrapping.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/decisions/005-bootstrapping.md
|
||||
git commit -m "ADR-005: control node bootstrap is bare-metal Debian on ubongo"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Amend ADR-009 (provisioning handoff)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/decisions/009-provisioning-handoff.md`
|
||||
|
||||
- [ ] **Step 1: Strengthen the control-node exception section**
|
||||
|
||||
Find (under `## The control-node exception`, lines ~129–138):
|
||||
```markdown
|
||||
The control node — the host that runs Terraform and Ansible — is the one VM
|
||||
Terraform does **not** create. It cannot provision the infrastructure that would
|
||||
provision itself (chicken-and-egg). It is therefore the single documented exception
|
||||
to "Terraform owns VM existence":
|
||||
|
||||
- Provisioned and bootstrapped manually, per the control-node section of ADR-005.
|
||||
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
|
||||
Ansible for baseline config only (no `docker_host` role).
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
The control node — the host that runs Terraform and Ansible — is `ubongo`, a
|
||||
dedicated **physical** machine outside the cluster. It is not a VM at all, so
|
||||
Terraform genuinely never touches it: it cannot provision the infrastructure that
|
||||
would provision itself (chicken-and-egg). It is therefore the single documented
|
||||
exception to "Terraform owns VM existence":
|
||||
|
||||
- Provisioned and bootstrapped manually on bare metal, per the control-node section
|
||||
of ADR-005; rationale, hardware, and recovery model in ADR-015.
|
||||
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
|
||||
Ansible for baseline config only (no `docker_host` role).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/009-provisioning-handoff.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/decisions/009-provisioning-handoff.md
|
||||
git commit -m "ADR-009: control-node exception is a physical box, not a VM"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Amend ADR-008 (testing)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/decisions/008-testing.md`
|
||||
|
||||
- [ ] **Step 1: Make Level 1 say it runs on `ubongo`**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
Runs in Docker on the control node or in CI. Fast (~5 min per role).
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
Runs in Docker on the control node (`ubongo`) or in CI. Fast (~5 min per role).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add a future service-UI acceptance level stub**
|
||||
|
||||
Find (the end of `### Level 3 — External smoke test from askari`, lines ~51–55):
|
||||
```markdown
|
||||
### Level 3 — External smoke test from askari
|
||||
|
||||
Once `askari` is operational: scripted checks from outside the network confirming
|
||||
that public-facing services respond correctly. Catches firewall and reverse proxy
|
||||
configuration issues invisible to Ansible check mode.
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
### Level 3 — External smoke test from askari
|
||||
|
||||
Once `askari` is operational: scripted checks from outside the network confirming
|
||||
that public-facing services respond correctly. Catches firewall and reverse proxy
|
||||
configuration issues invisible to Ansible check mode.
|
||||
|
||||
### Level 4 — Service-UI acceptance (planned, not built)
|
||||
|
||||
Claude drives a headless browser from `ubongo` against a *deployed* service: loads
|
||||
the rendered UI, creates test users, exercises features, and hands the operator a
|
||||
manual test script for the rest. Catches application-level regressions that no lower
|
||||
level sees. The harness (Playwright/headless-Chromium, screenshot-back-to-Claude) is
|
||||
a **separate spec**; `ubongo` is sized for it (ADR-015). Status: designed, not built
|
||||
(STATUS.md).
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/008-testing.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/decisions/008-testing.md
|
||||
git commit -m "ADR-008: tests run on ubongo; stub Level 4 service-UI acceptance"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Amend ADR-012 and the hardware reference
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/decisions/012-hardware-capacity.md`
|
||||
- Modify: `docs/hardware/reference.md`
|
||||
|
||||
- [ ] **Step 1: Note `ubongo` as in-scope physical compute in ADR-012**
|
||||
|
||||
In `docs/decisions/012-hardware-capacity.md`, find the first bullet under `## Decision`:
|
||||
```markdown
|
||||
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
|
||||
physical compute + network gear and workload placement intent. Two
|
||||
machine-readable tables (node capacity, workload placement) carry the numbers.
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
|
||||
physical compute + network gear and workload placement intent. Two
|
||||
machine-readable tables (node capacity, workload placement) carry the numbers.
|
||||
This includes `ubongo`, the physical control node (ADR-015), even though it sits
|
||||
outside the Proxmox cluster.
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add `ubongo` to the physical-compute section of the reference**
|
||||
|
||||
In `docs/hardware/reference.md`, find:
|
||||
```markdown
|
||||
_(repeat for pve1, pve2, askari)_
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
### ubongo (control node — outside the cluster)
|
||||
- **Model / form factor:** _TBD (x86-64 mini-PC / USFF, e.g. N100 or refurb micro)_
|
||||
- **CPU:** _TBD (target 4 cores, x86-64)_
|
||||
- **RAM:** _TBD (target 16 GB)_
|
||||
- **Storage:** _TBD (target 250 GB SSD/NVMe)_
|
||||
- **NICs:** _wired GbE_
|
||||
- **Notes:** _always-on; control plane + AI-worker + local test runner (ADR-015); not a Proxmox guest_
|
||||
|
||||
_(repeat for pve1, pve2, askari)_
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Add `ubongo` to the machine-readable node-capacity table**
|
||||
|
||||
In `docs/hardware/reference.md`, find the node-capacity table:
|
||||
```markdown
|
||||
| node | cores | ram_gb | disk_gb |
|
||||
|------|-------|--------|---------|
|
||||
| pve0 | 20 | 64 | 4000 |
|
||||
| pve1 | 20 | 64 | 4000 |
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
| node | cores | ram_gb | disk_gb |
|
||||
|------|-------|--------|---------|
|
||||
| pve0 | 20 | 64 | 4000 |
|
||||
| pve1 | 20 | 64 | 4000 |
|
||||
| ubongo | 4 | 16 | 250 |
|
||||
```
|
||||
|
||||
Note: the header row (`node | cores | ram_gb | disk_gb`) is a parser contract for
|
||||
`scripts/capacity-scan.py` — only a data row is added, the header is untouched.
|
||||
|
||||
- [ ] **Step 4: Verify the capacity scan still parses, hooks pass, then commit**
|
||||
|
||||
Run: `python3 scripts/capacity-scan.py 2>&1 | head -c 400`
|
||||
Expected: it runs without a parse error and the output reflects the new `ubongo` row (no traceback). If the script needs an argument or env, consult its `--help`; a clean exit with JSON is success.
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/012-hardware-capacity.md docs/hardware/reference.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/decisions/012-hardware-capacity.md docs/hardware/reference.md
|
||||
git commit -m "ADR-012/hardware: add ubongo as physical control node"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Update the new-host runbook (Part E)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/runbooks/new-host.md`
|
||||
|
||||
- [ ] **Step 1: Replace Part E with the bare-metal control-node procedure**
|
||||
|
||||
Find the whole `## Part E — Control node (manual exception)` section (lines ~113–133), from the heading through the paragraph ending "every other host comes from `make tf-inventory`." Replace it with:
|
||||
```markdown
|
||||
## Part E — Control node (`ubongo`, manual exception)
|
||||
|
||||
The control node runs Terraform and Ansible, so it cannot be created by the
|
||||
Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated **physical**
|
||||
machine outside the cluster — not a Proxmox guest. It is the **one** host
|
||||
provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
|
||||
|
||||
1. Install Debian 13 on the physical box by hand (no template to clone).
|
||||
2. Create the `ansible` user and install its SSH public key.
|
||||
3. Set up the Ansible environment on it:
|
||||
```bash
|
||||
git clone <repo> ~/ansible
|
||||
cd ~/ansible
|
||||
make setup # venv + Python deps
|
||||
make collections # Ansible collections
|
||||
rbw login && rbw unlock # vault password from Vaultwarden (see rotate-secrets.md)
|
||||
```
|
||||
4. Join the mesh VPN (choice deferred — see ADR-015) so it is reachable over SSH
|
||||
from elsewhere.
|
||||
5. Add `ubongo` to `inventories/<env>/hosts.yml` under the `control` group.
|
||||
|
||||
Because `ubongo` is not in `local.vms`, this is the only case where editing
|
||||
`hosts.yml` by hand is expected. **Known limitation:** `make tf-inventory`
|
||||
regenerates `hosts.yml` from Terraform outputs and will overwrite a hand-added
|
||||
`control` entry — re-add `ubongo` after running it (preserving the control entry in
|
||||
the generator is tracked separately, not yet built).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Update the Prerequisites note that assumes a template**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not)
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not).
|
||||
Not needed for the control node `ubongo`, which is bare-metal (Part E).
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/runbooks/new-host.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/runbooks/new-host.md
|
||||
git commit -m "new-host runbook: control node ubongo is bare-metal"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 8: Update the rotate-secrets runbook (offline break-glass)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/runbooks/rotate-secrets.md`
|
||||
|
||||
- [ ] **Step 1: Add a break-glass section after the `rbw` setup section**
|
||||
|
||||
Find the end of the `## One-time — \`rbw\` setup on a new machine` section:
|
||||
```markdown
|
||||
Once unlocked, `make encrypt/decrypt/check/deploy` and the pre-commit ansible-lint
|
||||
hook all obtain the password automatically. If the agent is locked you'll see a
|
||||
clear "run: rbw unlock" error rather than a hang.
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
Once unlocked, `make encrypt/decrypt/check/deploy` and the pre-commit ansible-lint
|
||||
hook all obtain the password automatically. If the agent is locked you'll see a
|
||||
clear "run: rbw unlock" error rather than a hang.
|
||||
|
||||
---
|
||||
|
||||
## Break-glass — vault access during a full cluster outage
|
||||
|
||||
The control node `ubongo` (ADR-015) is the tool used to rebuild the cluster, so it
|
||||
must be able to decrypt the vault even when Vaultwarden (if hosted on the cluster)
|
||||
is down. `rbw` keeps a **local encrypted copy** of the Vaultwarden vault and decrypts
|
||||
it **offline** with your Vaultwarden master password — no live server needed for
|
||||
entries it has already synced. The recovery design therefore requires:
|
||||
|
||||
- `rbw` on `ubongo` (and on `mamba`, the break-glass laptop) has **synced at least
|
||||
once** while Vaultwarden was reachable (`rbw sync`).
|
||||
- Your **Vaultwarden master password** is kept **offline** — in a password manager on
|
||||
`mamba` and on paper in a safe — independent of any cluster-hosted Vaultwarden.
|
||||
|
||||
There is always exactly one irreducible offline root secret; here it is the
|
||||
Vaultwarden master password. Keep it recoverable without the cluster.
|
||||
|
||||
> **To verify (ADR-014, security-relevant):** confirm `rbw` actually decrypts its
|
||||
> local cache fully offline on your pinned `rbw` version before relying on this.
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/runbooks/rotate-secrets.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/runbooks/rotate-secrets.md
|
||||
git commit -m "rotate-secrets: document offline vault break-glass for ubongo"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 9: Reserve the mesh-VPN accepted-risk entry
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/security/accepted-risks.md`
|
||||
|
||||
- [ ] **Step 1: Add R3 to the risk table**
|
||||
|
||||
Find the table row for R2:
|
||||
```markdown
|
||||
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
|
||||
```
|
||||
Add immediately **after** it:
|
||||
```markdown
|
||||
| R3 | **Mesh-VPN coordinator dependency (pending VPN choice)** — remote SSH to the control node `ubongo` (ADR-015) rides a mesh VPN whose coordination plane may be a third party (e.g. hosted Tailscale/NetBird) | A hosted coordinator keeps the mesh up when the cluster is down, which *helps* recovery; nothing is exposed to the public internet (ADR-002 preserved). Provisional — finalised when the VPN is chosen (separate discussion) | The VPN choice is settled (replace this entry with the concrete decision); a self-hosted coordinator is adopted; the provider's trust/security posture changes |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Update the "Last reviewed" footer date**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
_Last reviewed: 2026-06-04. The prior gaps
|
||||
```
|
||||
Replace `2026-06-04` with `2026-06-05` (only the date changes; leave the rest of the sentence intact):
|
||||
```markdown
|
||||
_Last reviewed: 2026-06-05. The prior gaps
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add docs/security/accepted-risks.md
|
||||
git commit -m "accepted-risks: reserve R3 mesh-VPN coordinator (pending choice)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 10: Add the `ubongo` row to STATUS.md
|
||||
|
||||
**Files:**
|
||||
- Modify: `STATUS.md`
|
||||
|
||||
- [ ] **Step 1: Add a row to the "Designed but not built" table**
|
||||
|
||||
Find the last row of the `## Designed but not built` table:
|
||||
```markdown
|
||||
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |
|
||||
```
|
||||
Add immediately **after** it:
|
||||
```markdown
|
||||
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files STATUS.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add STATUS.md
|
||||
git commit -m "STATUS: record ubongo control host as designed, not built"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 11: Update CLAUDE.md (index + control-group note)
|
||||
|
||||
**Files:**
|
||||
- Modify: `CLAUDE.md`
|
||||
|
||||
- [ ] **Step 1: Add ADR-015 to the Further reading table**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |
|
||||
| Control / AI-worker host (`ubongo`) | `docs/decisions/015-control-host.md` |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Update the control-group parenthetical in the Inventory structure section**
|
||||
|
||||
Find:
|
||||
```markdown
|
||||
(`control` holds the one manually-provisioned control node — see ADR-009.)
|
||||
```
|
||||
Replace with:
|
||||
```markdown
|
||||
(`control` holds `ubongo`, the one manually-provisioned **physical** control node
|
||||
outside the cluster — see ADR-009 and ADR-015.)
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
|
||||
Expected: hooks `Passed`/`Skipped`.
|
||||
```bash
|
||||
git add CLAUDE.md
|
||||
git commit -m "CLAUDE.md: link ADR-015; note ubongo as physical control node"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 12: Final consistency sweep
|
||||
|
||||
**Files:** none modified (verification only)
|
||||
|
||||
- [ ] **Step 1: Confirm no doc still calls the control node a VM**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
grep -rniE "control node.*(VM|virtual)|dedicated Debian 13 VM" docs/ CLAUDE.md STATUS.md
|
||||
```
|
||||
Expected: no hit that *asserts* the control node is a VM. (Hits inside ADR-015's "What was ruled out" table that describe the rejected option are fine.) If any other doc still frames the control node as a VM, fix it the same way as the relevant task above and amend that task's commit.
|
||||
|
||||
- [ ] **Step 2: Confirm every ADR-015 cross-link resolves**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
grep -rl "ADR-015\|015-control-host" docs/ CLAUDE.md STATUS.md
|
||||
test -f docs/decisions/015-control-host.md && echo "ADR-015 present"
|
||||
```
|
||||
Expected: the file exists and the referencing docs (001, 005, 008, 009, 012, runbooks, accepted-risks, STATUS, CLAUDE.md) appear.
|
||||
|
||||
- [ ] **Step 3: Full hook run**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --all-files`
|
||||
Expected: all hooks `Passed`/`Skipped`. Fix anything that fails (most likely trailing whitespace or end-of-file) and amend the owning commit.
|
||||
|
||||
- [ ] **Step 4: Push (only if the user asks)**
|
||||
|
||||
Per CLAUDE.md, push to `origin` is the off-machine backup. If the user wants it pushed:
|
||||
```bash
|
||||
git push origin main
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Self-review notes (author)
|
||||
|
||||
- **Spec coverage:** every spec section maps to a task — host decision/hardware/bootstrap/access/recovery → Task 1 (ADR-015); the doc-changes table → Tasks 2–11; testing implication → Task 5; deferrals are recorded in ADR-015 and not implemented here (correct — they are separate specs). ✓
|
||||
- **Not in scope (intentional):** acquiring/installing the box, mesh-VPN selection, the browser harness, adding `ubongo` to live inventory, and modifying `tf_to_inventory.py` to preserve the control entry (logged as a known limitation in Task 7). ✓
|
||||
- **No placeholders:** every edit shows exact find/replace text; the only `_TBD_` strings are deliberate hardware-reference skeleton fields matching that file's existing style. ✓
|
||||
```
|
||||
Loading…
Add table
Reference in a new issue