docs: record ubongo physical build (2026-06-11)

Move ubongo to 'Built (partial)' in STATUS; fill real M70q hardware specs
(i3-10100T, 16 GB, 256 GB SanDisk X600 SATA, no disk encryption). Record in
ADR-015 the dedicated claude AI-worker identity, LAN-SSH-only operational
reality, and the no-encryption decision; close the rbw offline-cache
recovery-verification item (ADR-015 + rotate-secrets). Add accepted-risk R5
(control-node disk unencrypted at rest) with its compensating controls.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-11 10:32:26 +02:00
parent 7b5fd17e55
commit 349d10d65c
5 changed files with 37 additions and 16 deletions

View file

@ -5,7 +5,7 @@ This repo is partly aspirational: the ADRs in `docs/decisions/` describe the
truth. **Before relying on a role, provider, or pipeline existing, check here.**
If something is listed as "designed, not built", do not assume it works.
_Last reviewed: 2026-06-06._
_Last reviewed: 2026-06-11._
## Real and working today
@ -26,6 +26,7 @@ _Last reviewed: 2026-06-06._
| ADR-002 security strategy + `docs/security/{accepted-risks,service-checklist}.md` | Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review |
| Service-role standard + per-service `SECURITY.md` convention | Defined (ADR-004 + `docs/security/service-security-template.md`); not yet applied — no service roles exist |
| Tag standard + enforcement (ADR-019) | Works — `tests/tags.yml` (closed vocabulary) + `scripts/check-tags.py` (run by `make lint`, unit-tested): enforces the tag vocabulary and that each role import in a play's `roles:` block carries its role-name tag. Governs mostly-unbuilt roles, but the linter is live now. Proxmox VM tag convention (`<env>`, group, `managed-by=terraform`) is in the Terraform HCL but unprovisioned. |
| `ubongo` — physical control / AI-worker host (ADR-015) | **Built (partial).** Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to `fisi` (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via `make setup`/`make collections`). Repo cloned under a dedicated `claude` user (docker group, no sudo). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory `control` group at 10.20.10.151. **Pending:** NetBird mesh enrollment (so SSH is LAN-only); full `base` hardening (only the `firewall` concern exists, and it is NOT applied here — applying default-deny with no mesh would lock out inbound SSH on the physical NIC); OPNsense DHCP reservation for 10.20.10.151 (MAC `88:a4:c2:e0:ee:da`); Terraform state backup (no TF state yet). |
## Scaffolded but empty — NOT implemented
@ -53,7 +54,6 @@ So `make deploy PLAYBOOK=site` is still incomplete — `base` is only partially
| `/security-review` skill | ADR-002 / TODO 8.5 | Periodic posture re-check + accepted-risk re-challenge; planned, not built |
| CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) `base`/`docker_host` roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) |
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |
| `ubongo` — physical control / AI-worker host | ADR-015 | **Design RESOLVED** (ADR-015 + spec + plan). Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. **Build pending:** box not yet acquired/installed, not in inventory. |
| NetBird mesh — coordinator on `askari` | ADR-016 | **Design RESOLVED** (ADR-016 + spec + plan); resolves ADR-015 deferred #1. Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. **Build pending:** not deployed (askari + service-role machinery not built). |
| NetBird agent enrollment in `base` | ADR-016 | **Design RESOLVED** (ADR-016). Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. **Build pending:** base role not built. |
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |

View file

@ -79,6 +79,22 @@ Manual, on bare metal:
- `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban,
auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**,
denied on the physical NIC.
- **Operational reality (until the mesh exists):** the "SSH only on the mesh interface"
target above is the end state, not yet in force. Today remote access is **LAN SSH
only** — key-only, with password auth and root login disabled — until the NetBird mesh
(ADR-016) is stood up.
- **AI-worker identity:** `ubongo` runs the AI worker under a dedicated,
password-locked `claude` user (in the `docker` group for Molecule; **no local sudo**
boma deploys reach the fleet over SSH as the `ansible` user, not via local root). It is
reached via `sudo -iu claude` or its own SSH key. The rationale is **attribution +
revocation, not containment**: auditd/Loki (ADR-018) can separate human from agent
actions, and the account/key can be revoked without touching the operator's access.
(ADR-021 left the on-`ubongo` agent identity unspecified; this records it.)
- **Disk encryption:** `ubongo`'s SSD is **not encrypted at rest** — the SanDisk X600 is
TCG-Opal-capable but Opal is unused. This is an accepted risk recorded in
`docs/security/accepted-risks.md` (control-node disk not encrypted at rest),
compensated by physical security, a BIOS supervisor password, and disabled
external/USB boot.
### Recovery model
@ -100,8 +116,9 @@ Vaultwarden master password. Mirroring Vaultwarden onto `ubongo` is rejected: it
would make the control node run a service (against its remit) and still need that
master password.
> verified: rbw offline-cache decryption · TO VERIFY before relying on the recovery
> model · rbw docs · (ADR-014, security-relevant — confirm during build)
> verified: rbw offline-cache decryption · rbw 1.15.0 on ubongo · with the Vaultwarden
> host blocked, `rbw sync` failed but `rbw get` decrypted the cached vault offline ·
> 2026-06-11
## Consequences
@ -121,8 +138,9 @@ master password.
exploratory service-UI verification (`/verify-service`, ADR-008 Level 4), against
staging with test users in Authentik. Design + skill + standards complete; running
deferred on the stack.
3. **`rbw` offline-cache verification** — still open: confirm offline decryption before
relying on it (ADR-014).
3. **`rbw` offline-cache verification — RESOLVED (2026-06-11 build):** confirmed offline
cache decryption on rbw 1.15.0 — `rbw sync` fails with Vaultwarden unreachable while
`rbw get` still decrypts from the local cache (ADR-014).
## What was ruled out

View file

@ -19,12 +19,13 @@
- **Notes:** _warranty, quirks_
### ubongo (control node — outside the cluster)
- **Model / form factor:** _TBD (x86-64 mini-PC / USFF, e.g. N100 or refurb micro)_
- **CPU:** _TBD (target 4 cores, x86-64)_
- **RAM:** _TBD (target 16 GB)_
- **Storage:** 1 TB NVMe (ThinkCentre M70q Tiny; i3-10100T, 16 GB) — over-spec for Tier-1 restore-verify (ADR-022)
- **NICs:** _wired GbE_
- **Notes:** _always-on; control plane + AI-worker + local test runner (ADR-015); not a Proxmox guest_
- **Model / form factor:** Lenovo ThinkCentre M70q Tiny (machine type 11DUS7XP00); 1-litre tiny/USFF
- **CPU:** Intel Core i3-10100T — 4 cores / 8 threads, 35 W TDP
- **RAM:** 16 GB DDR4-3200 (2×8 GB SODIMM)
- **Storage:** 256 GB SanDisk X600 SATA 2.5" SSD (model SD9TB8W256G1001; TCG Opal-capable, Opal unused — no disk encryption)
- **NICs:** wired GbE, interface eno1, MAC 88:a4:c2:e0:ee:da
- **BIOS:** Lenovo M2WKT5AA (2023-06-20)
- **Notes:** always-on; control plane + AI-worker (dedicated `claude` user) + local test runner (Molecule/Docker) per ADR-015; not a Proxmox guest; remote access currently LAN SSH only (mesh deferred)
### fisi (backup node — outside the cluster; provisional)
- **Model / form factor:** HP Elite 600 G9 (tower)
@ -64,7 +65,7 @@ Physical totals per node. Integers; `ram_gb` and `disk_gb` may be decimals.
|------|-------|--------|---------|
| pve0 | 20 | 64 | 4000 |
| pve1 | 20 | 64 | 4000 |
| ubongo | 4 | 16 | 1000 |
| ubongo | 4 | 16 | 250 |
| fisi | 4 | 16 | 8000 |
## 5. Capacity notes

View file

@ -46,8 +46,9 @@ entries it has already synced. The recovery design therefore requires:
There is always exactly one irreducible offline root secret; here it is the
Vaultwarden master password. Keep it recoverable without the cluster.
> **To verify (ADR-014, security-relevant):** confirm `rbw` actually decrypts its
> local cache fully offline on your pinned `rbw` version before relying on this.
> **Verified (2026-06-11, ADR-014):** confirmed on `ubongo` with rbw 1.15.0 — with
> the Vaultwarden host unreachable, `rbw sync` fails but `rbw get boma-ansible-vault`
> still decrypts from the local cache. Re-verify after an `rbw` major-version bump.
---

View file

@ -17,8 +17,9 @@ revisit (trigger).
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
| R5 | **No disk encryption on `ubongo`** — the control node's SSD (SanDisk X600 256 GB, TCG-Opal-capable but Opal unused) is unencrypted at rest, so it holds recovery-critical secrets in plaintext: the Ansible Vault password's `rbw` local cache and (future) Terraform state. Physical theft of the box would expose them | `ubongo` is always-on in a physically controlled location; compensating controls are a **BIOS supervisor password** and **disabled external/USB + PXE boot** (an attacker cannot trivially boot another OS to read the disk), and the offline-recoverable design means the irreducible root secret (Vaultwarden master password) is never stored on the box anyway. Full-disk encryption was weighed against the always-on/unattended-reboot requirement (LUKS+TPM auto-unlock or passphrase) and deferred for simplicity at this trust level | `ubongo` is relocated to a less-trusted physical location; the box starts holding additional high-value secrets; or a reinstall onto LUKS (TPM-sealed) is undertaken |
_Last reviewed: 2026-06-06. The prior gaps (full CIS hardening, SELinux/AppArmor,
_Last reviewed: 2026-06-11. The prior gaps (full CIS hardening, SELinux/AppArmor,
IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now
part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build