diff --git a/docs/superpowers/plans/2026-06-09-operational-access.md b/docs/superpowers/plans/2026-06-09-operational-access.md new file mode 100644 index 0000000..0ae433b --- /dev/null +++ b/docs/superpowers/plans/2026-06-09-operational-access.md @@ -0,0 +1,544 @@ +# Operational Access (ADR-021) Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Establish operational access as a deployment deliverable — a documented, verifiable set of mesh-reachable troubleshooting paths for every host and service — by writing ADR-021, reconciling the latent ADR-016/020 SSH contradiction, adding the control-node SSH source to the host firewall, and wiring the `ACCESS.md` record + `/check-access` verifier into boma's governance. + +**Architecture:** Source of truth is the committed design spec `docs/superpowers/specs/2026-06-09-operational-access-design.md`. Structured access facts live as declarative `access__*` data that renders `ACCESS.md` and drives `/check-access` (the access analogue of `VERIFY.md` + `/verify-service`). Work is split into **Tranche A — land now** (doctrine docs, the one firewall code change, the dormant `/check-access` command, governance wiring) and **Tranche B — build-pending on infra** (per-service `access__*` population, rendered `ACCESS.md` files, and `/check-access` *running*), which arrive with service roles and live hosts and require no action in this plan. + +**Tech Stack:** Markdown ADRs/docs; Ansible role `base` (Jinja2 nftables template + `defaults/main.yml`); Molecule (Debian 13, render + `nft -c`, no apply) for the firewall test; Claude Code command file for `/check-access`. + +--- + +## File structure + +| File | Tranche | Responsibility | +|---|---|---| +| `docs/decisions/021-operational-access.md` | A | NEW — the doctrine (two layers, three-tier ladder, break-glass, `access__*` model, `/check-access`) | +| `docs/decisions/016-mesh-vpn.md` | A | MODIFY — reconcile: SSH on `wt0` **and** from `ubongo`'s LAN address | +| `docs/decisions/020-firewall.md` | A | MODIFY — guaranteed management plane gains the control-node SSH source | +| `docs/access/service-access-template.md` | A | NEW — the `ACCESS.md` record shape (rendered-from-data + prose tail) | +| `roles/base/defaults/main.yml` | A | MODIFY — add `base__firewall_control_addr` knob (default empty → no-op) | +| `roles/base/templates/nftables.conf.j2` | A | MODIFY — conditional management-plane SSH rule for the control address | +| `roles/base/molecule/default/converge.yml` | A | MODIFY — set the knob for the test | +| `roles/base/molecule/default/verify.yml` | A | MODIFY — assert the rendered rule | +| `.claude/commands/check-access.md` | A | NEW — the `/check-access` verifier command (dormant until infra exists) | +| `docs/security/service-checklist.md` | A | MODIFY — one new gate item | +| `docs/runbooks/new-role.md` | A | MODIFY — new step: write `ACCESS.md` (mirrors SECURITY/VERIFY steps) | +| `CLAUDE.md` | A | MODIFY — `ACCESS.md` in Role conventions; ADR-021 in Further reading | +| `STATUS.md` | A | MODIFY — new rows for the doctrine, the firewall source, `/check-access` | +| `docs/TODO.md` | A | MODIFY — mark 3.2 + 7.2 DECIDED → ADR-021 | + +**Tranche B (no tasks here — captured for the record):** per-service `access__*` blocks + rendered `roles//ACCESS.md` land when each service role is built (governed by the Tranche-A checklist + runbook); `/check-access` *running* lands when `ubongo` + staging + vault exist. Both are designed-now, build-pending — exactly like `/verify-service` under ADR-017. + +--- + +## Tranche A — Land now + +### Task 1: Write ADR-021 + +**Files:** +- Create: `docs/decisions/021-operational-access.md` + +The ADR is the durable decision record derived from the committed spec +`docs/superpowers/specs/2026-06-09-operational-access-design.md`. Match the prose style and +heading shape of an existing ADR (read `docs/decisions/020-firewall.md` first). The ADR +**must** state these specifics — they are the parts easy to get wrong: + +- **Doctrine sentence (verbatim):** *"Every host and every service guarantees at least one + documented, verifiable way in for operational troubleshooting — and the deploy that + creates it also records and proves it."* +- **Two layers:** host baseline (resolves TODO 7.2) + per-service record (resolves TODO 3.2). +- **Three-tier access ladder:** (1) `wt0` mesh SSH — primary, WireGuard-authenticated; + (2) LAN SSH from `ubongo` only — secondary, mesh-independent, source-IP-gated **plus** + keys-only + fail2ban; all other LAN hosts stay default-denied; (3) console — break-glass + per host class: cluster VMs → Proxmox serial/VNC console, `askari` → Hetzner + rescue/console, `ubongo` → local console; reachability-checked, never exercised. +- **Reconciliation, not weakening (state this explicitly):** ADR-016 already requires + Ansible to reach the fleet by LAN IP ("a mesh/coordinator outage never blocks on-LAN + runs"), which *requires* LAN SSH from `ubongo`; yet ADR-016 also said "SSH only on `wt0`" + and ADR-020's guaranteed management plane listed only `wt0`. ADR-021 resolves that latent + contradiction by making the control-node SSH allow explicit and adding it to the + guaranteed management plane. It does **not** weaken default-deny: exactly one extra + trusted source on the LAN. +- **Declarative `access__*` data model:** service-role defaults carry `access__service`, + `access__compose_project`, `access__compose_path`, `access__containers`, + `access__log.loki_labels`, and `access__api` (`enabled`, `base_url`, `firewall_ref`, + `auth.vault_ref`, `health_path`; or `enabled: false` + `reason`). **Invariant:** + `access__api` never opens a port — it `firewall_ref`s the `group_vars` firewall catalog; + ADR-020 stays the sole owner of exposure. +- **Rendered record:** `ACCESS.md` is rendered from that data + a prose tail (operational + notes / gotchas). First-class sibling of `SECURITY.md`/`VERIFY.md`. +- **`/check-access`:** the verifier that probes each declared path and reports which are + live; break-glass reachability-only; designed now, build-pending on infra. +- **Status / consequences:** what lands now vs build-pending (mirror this plan's split). + +- [ ] **Step 1: Author the ADR** + +Write `docs/decisions/021-operational-access.md` covering every bullet above, in the +house style of `docs/decisions/020-firewall.md` (problem → decision → layers/ladder → +data model → verifier → consequences). Open with a one-line title heading +`# ADR-021 — Operational access: documented, verifiable ways in`. + +- [ ] **Step 2: Sanity-check internal links** + +Run: `grep -n "ADR-01[67]\|ADR-020\|access__\|check-access\|ACCESS.md" docs/decisions/021-operational-access.md` +Expected: references to ADR-016, ADR-020, the `access__*` keys, `/check-access`, and +`ACCESS.md` all present. + +- [ ] **Step 3: Commit** + +```bash +git add docs/decisions/021-operational-access.md +git commit -m "docs(access): add ADR-021 operational-access doctrine" +``` + +--- + +### Task 2: Reconcile ADR-016 and ADR-020 + +**Files:** +- Modify: `docs/decisions/016-mesh-vpn.md` (the "Host firewall" bullet, ~line 64-65) +- Modify: `docs/decisions/020-firewall.md` (the "Guaranteed management plane" bullet, ~line 42-45) + +- [ ] **Step 1: Amend ADR-016's Host-firewall bullet** + +Replace the existing bullet: + +```markdown +- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH + **only on `wt0`** (the ADR-015 pattern, fleet-wide). +``` + +with: + +```markdown +- **Host firewall:** `base` nftables allows inbound SSH on NetBird's `wt0` interface + (primary, WireGuard-authenticated) **and** from `ubongo`'s LAN address (secondary, + mesh-independent — required by the LAN-IP recovery path below, so a mesh/coordinator + outage never blocks on-LAN SSH). All other LAN hosts remain default-denied. This makes + explicit the control-node SSH allow that the recovery model already implied; the access + doctrine and the three-tier access ladder live in **ADR-021**. +``` + +- [ ] **Step 2: Amend ADR-020's guaranteed-management-plane bullet** + +Replace: + +```markdown +- **Guaranteed management plane**: loopback, established/related, and `wt0` (NetBird, + ADR-016) for SSH + Ansible are always allowed, independent of the catalog, applied + atomically — a malformed or empty catalog can never lock out management. (ADR-016: SSH + is allowed only on `wt0`.) +``` + +with: + +```markdown +- **Guaranteed management plane**: loopback, established/related, `wt0` (NetBird, + ADR-016), and SSH from the control node's LAN address (`base__firewall_control_addr`, + the `ssh-from-control` source) for SSH + Ansible are always allowed, independent of the + catalog, applied atomically — a malformed or empty catalog can never lock out + management. The control-node source is part of the guaranteed plane, not the service + catalog (it is management, not a service); see ADR-021 for the access doctrine. +``` + +- [ ] **Step 3: Commit** + +```bash +git add docs/decisions/016-mesh-vpn.md docs/decisions/020-firewall.md +git commit -m "docs(access): reconcile ADR-016/020 with control-node SSH source (ADR-021)" +``` + +--- + +### Task 3: The `ACCESS.md` record template + +**Files:** +- Create: `docs/access/service-access-template.md` + +Match the preamble convention of `docs/security/service-security-template.md` and +`docs/testing/service-verify-template.md` (a "copy this to `roles//ACCESS.md`" +preamble, then a `---`, then the record). + +- [ ] **Step 1: Write the template** + +Create `docs/access/service-access-template.md`: + +```markdown +# Per-service operational-access record — template + +Copy this file to `roles//ACCESS.md` when building a service role (ADR-021). +It is the per-service **operational-access record**: every documented, verifiable way in +for troubleshooting. The structured parts are **rendered from the role's `access__*` +data** (the single source of truth that also drives `/check-access`) — keep the data +authoritative and regenerate this file rather than hand-editing the tables. The prose +"Operational notes" tail is hand-written. + +Delete this preamble in the copy and start from the heading below. + +--- + +# Access — + +## Access paths + +The mesh-reachable ways in, by tier (rendered from `access__*`): + +| Tier | Path | Invocation | +|---|---|---| +| primary | `wt0` mesh SSH | `ssh ` (over the NetBird mesh) | +| secondary | LAN SSH from `ubongo` | `ssh ` (from the control node, LAN address) | +| — | container exec + compose | `docker compose -p -f ps` / `exec` | +| — | logs | Loki query for labels `` (Grafana; ADR-018) | +| — | admin API | `curl -H 'Authorization: …(vault_ref)' ` — or `n/a` | + +## Break-glass + +Mesh-and-LAN-independent fallback for this host's class (recorded, not routine): + +- + +## Operational notes + +Prose the data can't capture — service quirks, "if X is wedged, do Y", ordering gotchas. + +- +``` + +- [ ] **Step 2: Commit** + +```bash +git add docs/access/service-access-template.md +git commit -m "docs(access): add ACCESS.md service record template" +``` + +--- + +### Task 4: Add the control-node SSH source to the host firewall (TDD) + +**Files:** +- Modify: `roles/base/defaults/main.yml` +- Modify: `roles/base/templates/nftables.conf.j2` +- Modify: `roles/base/molecule/default/converge.yml` +- Modify: `roles/base/molecule/default/verify.yml` + +This is the only code in Tranche A. It adds an **optional** guaranteed-management-plane +allow for SSH from the control node's LAN address. Default empty ⇒ no rule rendered ⇒ +no behaviour change until a real `ubongo` address is set in `group_vars` (build-pending). +Test path is the established one for this role: Molecule render + `nft -c` (no apply). + +- [ ] **Step 1: Write the failing test — converge sets the knob, verify asserts the rule** + +In `roles/base/molecule/default/converge.yml`, add the knob under `vars:` (alongside +`base__firewall_apply: false`): + +```yaml + base__firewall_control_addr: 10.10.0.99 # test control-node LAN address +``` + +In `roles/base/molecule/default/verify.yml`, extend the "management plane" assert block's +`that:` list (the task asserting default-deny + `wt0` SSH) with: + +```yaml + - "'ip saddr 10.10.0.99 tcp dport 22 accept' in nft" +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `make test ROLE=base` +Expected: FAIL — the verify assert "input chain is missing default-deny or the management +plane" fires, because the template does not yet render the control-address rule. + +- [ ] **Step 3: Add the default knob** + +In `roles/base/defaults/main.yml`, after the `base__firewall_mgmt_interface` line, add: + +```yaml +base__firewall_control_addr: "" # control-node LAN address (ubongo); SSH allowed from it + # as the guaranteed-management-plane `ssh-from-control` + # source (ADR-021). Empty = no rule. Set in group_vars + # once ubongo exists. +``` + +- [ ] **Step 4: Render the rule in the template** + +In `roles/base/templates/nftables.conf.j2`, immediately after the `wt0` SSH line (the +`iifname "{{ base__firewall_mgmt_interface }}" ...` line), add: + +```jinja +{% if base__firewall_control_addr %} + ip saddr {{ base__firewall_control_addr }} tcp dport {{ base__firewall_ssh_port }} accept +{% endif %} +``` + +- [ ] **Step 5: Run the test to verify it passes** + +Run: `make test ROLE=base` +Expected: PASS — the rule `ip saddr 10.10.0.99 tcp dport 22 accept` renders, `nft -c` +syntax-check succeeds, and all prior assertions (default-deny, `wt0` SSH, zone rules, +drop-in hook) still pass. + +- [ ] **Step 6: Lint** + +Run: `make lint` +Expected: PASS (no tag/FQCN/yaml regressions). + +- [ ] **Step 7: Commit** + +```bash +git add roles/base/defaults/main.yml roles/base/templates/nftables.conf.j2 \ + roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml +git commit -m "feat(base): add ssh-from-control management-plane source (ADR-021)" +``` + +--- + +### Task 5: Author the `/check-access` command (dormant until infra) + +**Files:** +- Create: `.claude/commands/check-access.md` + +Mirror the structure of `.claude/commands/verify-service.md` (a forward-looking command +with a hard Prerequisites gate). It does not run until `ubongo` + live/staging hosts + +vault exist; if a prerequisite is missing it must say so and stop. + +- [ ] **Step 1: Write the command** + +Create `.claude/commands/check-access.md`: + +```markdown +Operational-access verification (ADR-021) + +Probe every documented way in to a service or host from `ubongo` and report which paths +are live. Reads the target's `access__*` data (and host baseline), so the verifier and +`ACCESS.md` can never disagree. Argument: a service/role name or a host +(e.g. `/check-access photoprism`, `/check-access docker01`). + +## Prerequisites (forward-looking — ADR-021 dependencies) + +This skill cannot run until these exist; if any is missing, say so and stop — do not +improvise around it: + +- `ubongo` reachable on the mesh **and** the LAN (it runs the probes). +- The target host/service is deployed (staging or production inventory). +- `roles//` carries `access__*` data (services) / the host baseline applies. +- Vault unlocked (`rbw unlocked`) for any token-authenticated API probe. + +## Process + +### Phase 0 — resolve the target + +Resolve the argument to a host or a service role + its host. Load the `access__*` data +(service) or the host-baseline + break-glass record (host). State what you will probe. + +### Phase 1 — probe each declared path + +| Path | Probe | Green = | +|---|---|---| +| `wt0` mesh SSH | connect over the mesh, run `true` | reachable + key works | +| LAN SSH from `ubongo` | connect via the LAN address, run `true` | reachable + key works | +| exec + compose | `docker compose -p ps`; exec `true` in each `access__containers` entry | stack up, exec works | +| logs | query Loki for `access__log.loki_labels`, expect recent lines | logs flowing | +| admin API | `curl` `access__api.health_path` with the token from `access__api.auth.vault_ref` | 2xx | +| break-glass | reachability of the Proxmox/provider console endpoint **only** | console host reachable | + +Break-glass is **never exercised** — firing a serial console is invasive; confirm the +fallback exists, do not drive it. + +### Phase 2 — report + +Emit a pass/fail table. For any red path, name it and the likely cause (e.g. "API token +in vault stale", "Alloy not shipping", "`base__firewall_control_addr` unset → no +`ssh-from-control` rule"). Verdict line: e.g. "3/4 paths green; admin API red". + +## Notes + +- Read-only and non-destructive — probes confirm reachability, they do not change state. +- This is the access analogue of `/verify-service` (ADR-017): designed now, runs when the + control node + hosts exist. +``` + +- [ ] **Step 2: Commit** + +```bash +git add .claude/commands/check-access.md +git commit -m "feat(access): add /check-access verifier command (ADR-021, dormant)" +``` + +--- + +### Task 6: Governance wiring — checklist + runbook + +**Files:** +- Modify: `docs/security/service-checklist.md` (the "Operability (security-adjacent)" section) +- Modify: `docs/runbooks/new-role.md` (after step 10, the VERIFY.md step) + +ACCESS.md mirrors how SECURITY.md/VERIFY.md are enforced: a manual runbook step + a +checklist gate (the scaffold does not auto-drop SECURITY/VERIFY today either, so ACCESS +follows the same manual-copy pattern — no Makefile change). + +- [ ] **Step 1: Add the checklist gate item** + +In `docs/security/service-checklist.md`, under `## Operability (security-adjacent)`, add a +bullet after the `/verify-service` item: + +```markdown +- [ ] Operational access recorded and verifiable (ADR-021): the role carries `access__*` + data, `roles//ACCESS.md` is rendered, and `/check-access` reports the + documented paths green — or a deviation is recorded in + `docs/security/accepted-risks.md` +``` + +- [ ] **Step 2: Add the runbook step** + +In `docs/runbooks/new-role.md`, insert a new step between step 10 (VERIFY.md) and the +final commit step, and renumber the commit step to 12: + +```markdown +### 11. Write the per-service operational-access record (services) + +For a **service** role, copy `docs/access/service-access-template.md` to +`roles//ACCESS.md` and populate the role's `access__*` data +(`access__service`, `access__compose_project`/`_path`, `access__containers`, +`access__log.loki_labels`, and `access__api` — `enabled` + endpoint + `firewall_ref` + +`auth.vault_ref` + `health_path`, or `enabled: false` with a reason). `ACCESS.md` is +rendered from that data; the admin-API path must `firewall_ref` an entry in the +`group_vars` firewall catalog, never open a port itself (ADR-020/021). Once hosts exist, +`/check-access ` proves the documented paths are live — part of the +service-clearance gate (`docs/security/service-checklist.md`). +``` + +- [ ] **Step 3: Verify renumbering** + +Run: `grep -n "^### 1[12]\." docs/runbooks/new-role.md` +Expected: `### 11. Write the per-service operational-access record` and `### 12. Commit`. + +- [ ] **Step 4: Commit** + +```bash +git add docs/security/service-checklist.md docs/runbooks/new-role.md +git commit -m "docs(access): gate ACCESS.md in checklist + new-role runbook (ADR-021)" +``` + +--- + +### Task 7: Index wiring — CLAUDE.md, STATUS.md, TODO.md + +**Files:** +- Modify: `CLAUDE.md` (Role conventions list + Further reading table) +- Modify: `STATUS.md` (Designed-but-not-built table) +- Modify: `docs/TODO.md` (items 3.2 and 7.2) + +- [ ] **Step 1: CLAUDE.md — Role conventions** + +In the `## Role conventions` list, after the `VERIFY.md` bullet +("Every **service** role must have a populated `VERIFY.md` ..."), add: + +```markdown +- Every **service** role must have a populated `ACCESS.md` (ADR-021) — copy + `docs/access/service-access-template.md`; rendered from the role's `access__*` data +``` + +- [ ] **Step 2: CLAUDE.md — Further reading** + +In the Further reading table, after the Firewall strategy row, add: + +```markdown +| Operational access | `docs/decisions/021-operational-access.md` | +``` + +- [ ] **Step 3: STATUS.md — new rows** + +In the `## Designed but not built` table, add: + +```markdown +| Operational-access doctrine (ADR-021) | ADR-021 | **Design RESOLVED** (ADR-021 + spec + plan). Two-layer doctrine, three-tier access ladder, `access__*` model, `ACCESS.md` record, `/check-access`. Reconciles ADR-016/020 SSH. | +| `ssh-from-control` firewall source | ADR-021 / ADR-020 | **Built (dormant).** `base__firewall_control_addr` knob + nftables rule + Molecule assertion landed; empty default = no rule until `ubongo`'s LAN address is set in `group_vars`. | +| `/check-access` verifier | ADR-021 | **Design RESOLVED** (`.claude/commands/check-access.md` authored). **Build pending:** running needs `ubongo` + live/staging hosts + vault. Access analogue of `/verify-service` (ADR-017). | +| Per-service `ACCESS.md` records | ADR-021 | Template + governance present; per-service files render when each service role is built. | +``` + +- [ ] **Step 4: docs/TODO.md — mark 3.2 and 7.2 DECIDED** + +In `docs/TODO.md`, change item **3.2** from: + +```markdown + 2. Decide how to manage APIs / API access. +``` + +to: + +```markdown + 2. ~~Decide how to manage APIs / API access.~~ DECIDED (ADR-021): per-service `access__*` + data declares the admin API (endpoint + `firewall_ref` to the catalog + vault token + ref + health path); rendered into `ACCESS.md` and probed by `/check-access`. Part of + the two-layer operational-access doctrine. +``` + +And change item **7.2** from: + +```markdown + 2. Decide what to set up on the hosts, given that direct access will be rare. +``` + +to: + +```markdown + 2. ~~Decide what to set up on the hosts, given that direct access will be rare.~~ + DECIDED (ADR-021): the host-layer access baseline — SSH on `wt0` + from `ubongo`, + Docker/Compose tooling, Alloy log shipping, and a recorded break-glass console per + host class. +``` + +- [ ] **Step 5: Verify and commit** + +Run: `grep -n "021-operational-access\|ACCESS.md\|ssh-from-control" CLAUDE.md STATUS.md` +Expected: the new Role-conventions bullet, the Further-reading row, and the STATUS rows +are present. + +```bash +git add CLAUDE.md STATUS.md docs/TODO.md +git commit -m "docs(access): wire ADR-021 into CLAUDE.md, STATUS, TODO" +``` + +--- + +## Tranche B — Build-pending on infra (no tasks now) + +Recorded so the boundary is explicit; nothing here is actioned by this plan. + +- **Per-service `access__*` + rendered `ACCESS.md`** — authored when each service role is + built, governed by the Task 6 checklist item + runbook step. The first real service role + is where this first runs. +- **`/check-access` running** — needs `ubongo` + a live/staging host + vault. The command + (Task 5) already gates on these and stops cleanly until then. +- **Real `base__firewall_control_addr` value** — set in `group_vars/all` to `ubongo`'s LAN + address once `ubongo` is in inventory; the machinery + test landed in Task 4. + +--- + +## Self-review + +**Spec coverage:** doctrine + two layers → Task 1; three-tier ladder + ADR-016/020 +reconciliation → Tasks 1–2, 4; `access__*` model + invariant → Tasks 1, 3, 6; rendered +`ACCESS.md` → Task 3; `/check-access` → Task 5; governance (checklist/runbook) → Task 6; +repo wiring (CLAUDE/STATUS/TODO) → Task 7; build-now vs build-pending split → Tranches +A/B. All spec sections map to a task. + +**Deviations from the spec (deliberate, flagged for the user):** +1. The spec called `ssh-from-control` a *catalog* source; the plan places it in the + *guaranteed management plane* (`base__firewall_control_addr`) instead — ADR-020 already + houses SSH/Ansible management allows there, independent of the catalog, and the spec's + own invariant says the catalog owns *service* exposure only. Same intent, correct home. +2. The spec said `make new-role` would *scaffold* an `ACCESS.md` stub; the plan instead adds + a manual runbook step (Task 6) mirroring how `SECURITY.md`/`VERIFY.md` are handled today + (also manual copies, not scaffolded). Avoids unilaterally restructuring the scaffold; + the "can't be forgotten" intent is met by the checklist gate + runbook step. + +**Type/name consistency:** `base__firewall_control_addr` (knob), `access__service` / +`access__compose_project` / `access__compose_path` / `access__containers` / +`access__log.loki_labels` / `access__api.{enabled,base_url,firewall_ref,auth.vault_ref,health_path}` +are used identically across Tasks 1, 3, 5, 6. The rendered nftables rule string +`ip saddr tcp dport 22 accept` matches between Task 4's template (Step 4) and its +assertion (Step 1).