boma/docs/superpowers/plans/2026-06-09-operational-access.md
sjat cdbd66410a docs(access): implementation plan for ADR-021 operational access
Splits the work into Tranche A (land now: ADR-021, ADR-016/020
reconciliation, ssh-from-control firewall source, ACCESS.md template,
/check-access command, governance + index wiring) and Tranche B
(build-pending on infra: per-service access__* + rendered ACCESS.md,
/check-access running).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 17:16:49 +02:00

24 KiB
Raw Permalink Blame History

Operational Access (ADR-021) Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Establish operational access as a deployment deliverable — a documented, verifiable set of mesh-reachable troubleshooting paths for every host and service — by writing ADR-021, reconciling the latent ADR-016/020 SSH contradiction, adding the control-node SSH source to the host firewall, and wiring the ACCESS.md record + /check-access verifier into boma's governance.

Architecture: Source of truth is the committed design spec docs/superpowers/specs/2026-06-09-operational-access-design.md. Structured access facts live as declarative access__* data that renders ACCESS.md and drives /check-access (the access analogue of VERIFY.md + /verify-service). Work is split into Tranche A — land now (doctrine docs, the one firewall code change, the dormant /check-access command, governance wiring) and Tranche B — build-pending on infra (per-service access__* population, rendered ACCESS.md files, and /check-access running), which arrive with service roles and live hosts and require no action in this plan.

Tech Stack: Markdown ADRs/docs; Ansible role base (Jinja2 nftables template + defaults/main.yml); Molecule (Debian 13, render + nft -c, no apply) for the firewall test; Claude Code command file for /check-access.


File structure

File Tranche Responsibility
docs/decisions/021-operational-access.md A NEW — the doctrine (two layers, three-tier ladder, break-glass, access__* model, /check-access)
docs/decisions/016-mesh-vpn.md A MODIFY — reconcile: SSH on wt0 and from ubongo's LAN address
docs/decisions/020-firewall.md A MODIFY — guaranteed management plane gains the control-node SSH source
docs/access/service-access-template.md A NEW — the ACCESS.md record shape (rendered-from-data + prose tail)
roles/base/defaults/main.yml A MODIFY — add base__firewall_control_addr knob (default empty → no-op)
roles/base/templates/nftables.conf.j2 A MODIFY — conditional management-plane SSH rule for the control address
roles/base/molecule/default/converge.yml A MODIFY — set the knob for the test
roles/base/molecule/default/verify.yml A MODIFY — assert the rendered rule
.claude/commands/check-access.md A NEW — the /check-access verifier command (dormant until infra exists)
docs/security/service-checklist.md A MODIFY — one new gate item
docs/runbooks/new-role.md A MODIFY — new step: write ACCESS.md (mirrors SECURITY/VERIFY steps)
CLAUDE.md A MODIFY — ACCESS.md in Role conventions; ADR-021 in Further reading
STATUS.md A MODIFY — new rows for the doctrine, the firewall source, /check-access
docs/TODO.md A MODIFY — mark 3.2 + 7.2 DECIDED → ADR-021

Tranche B (no tasks here — captured for the record): per-service access__* blocks + rendered roles/<svc>/ACCESS.md land when each service role is built (governed by the Tranche-A checklist + runbook); /check-access running lands when ubongo + staging + vault exist. Both are designed-now, build-pending — exactly like /verify-service under ADR-017.


Tranche A — Land now

Task 1: Write ADR-021

Files:

  • Create: docs/decisions/021-operational-access.md

The ADR is the durable decision record derived from the committed spec docs/superpowers/specs/2026-06-09-operational-access-design.md. Match the prose style and heading shape of an existing ADR (read docs/decisions/020-firewall.md first). The ADR must state these specifics — they are the parts easy to get wrong:

  • Doctrine sentence (verbatim): "Every host and every service guarantees at least one documented, verifiable way in for operational troubleshooting — and the deploy that creates it also records and proves it."

  • Two layers: host baseline (resolves TODO 7.2) + per-service record (resolves TODO 3.2).

  • Three-tier access ladder: (1) wt0 mesh SSH — primary, WireGuard-authenticated; (2) LAN SSH from ubongo only — secondary, mesh-independent, source-IP-gated plus keys-only + fail2ban; all other LAN hosts stay default-denied; (3) console — break-glass per host class: cluster VMs → Proxmox serial/VNC console, askari → Hetzner rescue/console, ubongo → local console; reachability-checked, never exercised.

  • Reconciliation, not weakening (state this explicitly): ADR-016 already requires Ansible to reach the fleet by LAN IP ("a mesh/coordinator outage never blocks on-LAN runs"), which requires LAN SSH from ubongo; yet ADR-016 also said "SSH only on wt0" and ADR-020's guaranteed management plane listed only wt0. ADR-021 resolves that latent contradiction by making the control-node SSH allow explicit and adding it to the guaranteed management plane. It does not weaken default-deny: exactly one extra trusted source on the LAN.

  • Declarative access__* data model: service-role defaults carry access__service, access__compose_project, access__compose_path, access__containers, access__log.loki_labels, and access__api (enabled, base_url, firewall_ref, auth.vault_ref, health_path; or enabled: false + reason). Invariant: access__api never opens a port — it firewall_refs the group_vars firewall catalog; ADR-020 stays the sole owner of exposure.

  • Rendered record: ACCESS.md is rendered from that data + a prose tail (operational notes / gotchas). First-class sibling of SECURITY.md/VERIFY.md.

  • /check-access: the verifier that probes each declared path and reports which are live; break-glass reachability-only; designed now, build-pending on infra.

  • Status / consequences: what lands now vs build-pending (mirror this plan's split).

  • Step 1: Author the ADR

Write docs/decisions/021-operational-access.md covering every bullet above, in the house style of docs/decisions/020-firewall.md (problem → decision → layers/ladder → data model → verifier → consequences). Open with a one-line title heading # ADR-021 — Operational access: documented, verifiable ways in.

  • Step 2: Sanity-check internal links

Run: grep -n "ADR-01[67]\|ADR-020\|access__\|check-access\|ACCESS.md" docs/decisions/021-operational-access.md Expected: references to ADR-016, ADR-020, the access__* keys, /check-access, and ACCESS.md all present.

  • Step 3: Commit
git add docs/decisions/021-operational-access.md
git commit -m "docs(access): add ADR-021 operational-access doctrine"

Task 2: Reconcile ADR-016 and ADR-020

Files:

  • Modify: docs/decisions/016-mesh-vpn.md (the "Host firewall" bullet, ~line 64-65)

  • Modify: docs/decisions/020-firewall.md (the "Guaranteed management plane" bullet, ~line 42-45)

  • Step 1: Amend ADR-016's Host-firewall bullet

Replace the existing bullet:

- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH
  **only on `wt0`** (the ADR-015 pattern, fleet-wide).

with:

- **Host firewall:** `base` nftables allows inbound SSH on NetBird's `wt0` interface
  (primary, WireGuard-authenticated) **and** from `ubongo`'s LAN address (secondary,
  mesh-independent — required by the LAN-IP recovery path below, so a mesh/coordinator
  outage never blocks on-LAN SSH). All other LAN hosts remain default-denied. This makes
  explicit the control-node SSH allow that the recovery model already implied; the access
  doctrine and the three-tier access ladder live in **ADR-021**.
  • Step 2: Amend ADR-020's guaranteed-management-plane bullet

Replace:

- **Guaranteed management plane**: loopback, established/related, and `wt0` (NetBird,
  ADR-016) for SSH + Ansible are always allowed, independent of the catalog, applied
  atomically — a malformed or empty catalog can never lock out management. (ADR-016: SSH
  is allowed only on `wt0`.)

with:

- **Guaranteed management plane**: loopback, established/related, `wt0` (NetBird,
  ADR-016), and SSH from the control node's LAN address (`base__firewall_control_addr`,
  the `ssh-from-control` source) for SSH + Ansible are always allowed, independent of the
  catalog, applied atomically — a malformed or empty catalog can never lock out
  management. The control-node source is part of the guaranteed plane, not the service
  catalog (it is management, not a service); see ADR-021 for the access doctrine.
  • Step 3: Commit
git add docs/decisions/016-mesh-vpn.md docs/decisions/020-firewall.md
git commit -m "docs(access): reconcile ADR-016/020 with control-node SSH source (ADR-021)"

Task 3: The ACCESS.md record template

Files:

  • Create: docs/access/service-access-template.md

Match the preamble convention of docs/security/service-security-template.md and docs/testing/service-verify-template.md (a "copy this to roles/<service>/ACCESS.md" preamble, then a ---, then the record).

  • Step 1: Write the template

Create docs/access/service-access-template.md:

# Per-service operational-access record — template

Copy this file to `roles/<service>/ACCESS.md` when building a service role (ADR-021).
It is the per-service **operational-access record**: every documented, verifiable way in
for troubleshooting. The structured parts are **rendered from the role's `access__*`
data** (the single source of truth that also drives `/check-access`) — keep the data
authoritative and regenerate this file rather than hand-editing the tables. The prose
"Operational notes" tail is hand-written.

Delete this preamble in the copy and start from the heading below.

---

# Access — <service>

## Access paths

The mesh-reachable ways in, by tier (rendered from `access__*`):

| Tier | Path | Invocation |
|---|---|---|
| primary | `wt0` mesh SSH | `ssh <host>` (over the NetBird mesh) |
| secondary | LAN SSH from `ubongo` | `ssh <host>` (from the control node, LAN address) |
| — | container exec + compose | `docker compose -p <access__compose_project> -f <access__compose_path> ps` / `exec` |
| — | logs | Loki query for labels `<access__log.loki_labels>` (Grafana; ADR-018) |
| — | admin API | `curl -H 'Authorization: …(vault_ref)' <access__api.base_url><health_path>` — or `n/a` |

## Break-glass

Mesh-and-LAN-independent fallback for this host's class (recorded, not routine):

- <Proxmox serial/VNC console for cluster VMs · Hetzner rescue for `askari` · local console for `ubongo`>

## Operational notes

Prose the data can't capture — service quirks, "if X is wedged, do Y", ordering gotchas.

- <none yet>
  • Step 2: Commit
git add docs/access/service-access-template.md
git commit -m "docs(access): add ACCESS.md service record template"

Task 4: Add the control-node SSH source to the host firewall (TDD)

Files:

  • Modify: roles/base/defaults/main.yml
  • Modify: roles/base/templates/nftables.conf.j2
  • Modify: roles/base/molecule/default/converge.yml
  • Modify: roles/base/molecule/default/verify.yml

This is the only code in Tranche A. It adds an optional guaranteed-management-plane allow for SSH from the control node's LAN address. Default empty ⇒ no rule rendered ⇒ no behaviour change until a real ubongo address is set in group_vars (build-pending). Test path is the established one for this role: Molecule render + nft -c (no apply).

  • Step 1: Write the failing test — converge sets the knob, verify asserts the rule

In roles/base/molecule/default/converge.yml, add the knob under vars: (alongside base__firewall_apply: false):

    base__firewall_control_addr: 10.10.0.99   # test control-node LAN address

In roles/base/molecule/default/verify.yml, extend the "management plane" assert block's that: list (the task asserting default-deny + wt0 SSH) with:

          - "'ip saddr 10.10.0.99 tcp dport 22 accept' in nft"
  • Step 2: Run the test to verify it fails

Run: make test ROLE=base Expected: FAIL — the verify assert "input chain is missing default-deny or the management plane" fires, because the template does not yet render the control-address rule.

  • Step 3: Add the default knob

In roles/base/defaults/main.yml, after the base__firewall_mgmt_interface line, add:

base__firewall_control_addr: ""      # control-node LAN address (ubongo); SSH allowed from it
                                     # as the guaranteed-management-plane `ssh-from-control`
                                     # source (ADR-021). Empty = no rule. Set in group_vars
                                     # once ubongo exists.
  • Step 4: Render the rule in the template

In roles/base/templates/nftables.conf.j2, immediately after the wt0 SSH line (the iifname "{{ base__firewall_mgmt_interface }}" ... line), add:

{% if base__firewall_control_addr %}
    ip saddr {{ base__firewall_control_addr }} tcp dport {{ base__firewall_ssh_port }} accept
{% endif %}
  • Step 5: Run the test to verify it passes

Run: make test ROLE=base Expected: PASS — the rule ip saddr 10.10.0.99 tcp dport 22 accept renders, nft -c syntax-check succeeds, and all prior assertions (default-deny, wt0 SSH, zone rules, drop-in hook) still pass.

  • Step 6: Lint

Run: make lint Expected: PASS (no tag/FQCN/yaml regressions).

  • Step 7: Commit
git add roles/base/defaults/main.yml roles/base/templates/nftables.conf.j2 \
        roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
git commit -m "feat(base): add ssh-from-control management-plane source (ADR-021)"

Task 5: Author the /check-access command (dormant until infra)

Files:

  • Create: .claude/commands/check-access.md

Mirror the structure of .claude/commands/verify-service.md (a forward-looking command with a hard Prerequisites gate). It does not run until ubongo + live/staging hosts + vault exist; if a prerequisite is missing it must say so and stop.

  • Step 1: Write the command

Create .claude/commands/check-access.md:

Operational-access verification (ADR-021)

Probe every documented way in to a service or host from `ubongo` and report which paths
are live. Reads the target's `access__*` data (and host baseline), so the verifier and
`ACCESS.md` can never disagree. Argument: a service/role name or a host
(e.g. `/check-access photoprism`, `/check-access docker01`).

## Prerequisites (forward-looking — ADR-021 dependencies)

This skill cannot run until these exist; if any is missing, say so and stop — do not
improvise around it:

- `ubongo` reachable on the mesh **and** the LAN (it runs the probes).
- The target host/service is deployed (staging or production inventory).
- `roles/<name>/` carries `access__*` data (services) / the host baseline applies.
- Vault unlocked (`rbw unlocked`) for any token-authenticated API probe.

## Process

### Phase 0 — resolve the target

Resolve the argument to a host or a service role + its host. Load the `access__*` data
(service) or the host-baseline + break-glass record (host). State what you will probe.

### Phase 1 — probe each declared path

| Path | Probe | Green = |
|---|---|---|
| `wt0` mesh SSH | connect over the mesh, run `true` | reachable + key works |
| LAN SSH from `ubongo` | connect via the LAN address, run `true` | reachable + key works |
| exec + compose | `docker compose -p <project> ps`; exec `true` in each `access__containers` entry | stack up, exec works |
| logs | query Loki for `access__log.loki_labels`, expect recent lines | logs flowing |
| admin API | `curl` `access__api.health_path` with the token from `access__api.auth.vault_ref` | 2xx |
| break-glass | reachability of the Proxmox/provider console endpoint **only** | console host reachable |

Break-glass is **never exercised** — firing a serial console is invasive; confirm the
fallback exists, do not drive it.

### Phase 2 — report

Emit a pass/fail table. For any red path, name it and the likely cause (e.g. "API token
in vault stale", "Alloy not shipping", "`base__firewall_control_addr` unset → no
`ssh-from-control` rule"). Verdict line: e.g. "3/4 paths green; admin API red".

## Notes

- Read-only and non-destructive — probes confirm reachability, they do not change state.
- This is the access analogue of `/verify-service` (ADR-017): designed now, runs when the
  control node + hosts exist.
  • Step 2: Commit
git add .claude/commands/check-access.md
git commit -m "feat(access): add /check-access verifier command (ADR-021, dormant)"

Task 6: Governance wiring — checklist + runbook

Files:

  • Modify: docs/security/service-checklist.md (the "Operability (security-adjacent)" section)
  • Modify: docs/runbooks/new-role.md (after step 10, the VERIFY.md step)

ACCESS.md mirrors how SECURITY.md/VERIFY.md are enforced: a manual runbook step + a checklist gate (the scaffold does not auto-drop SECURITY/VERIFY today either, so ACCESS follows the same manual-copy pattern — no Makefile change).

  • Step 1: Add the checklist gate item

In docs/security/service-checklist.md, under ## Operability (security-adjacent), add a bullet after the /verify-service item:

- [ ] Operational access recorded and verifiable (ADR-021): the role carries `access__*`
      data, `roles/<service>/ACCESS.md` is rendered, and `/check-access` reports the
      documented paths green — or a deviation is recorded in
      `docs/security/accepted-risks.md`
  • Step 2: Add the runbook step

In docs/runbooks/new-role.md, insert a new step between step 10 (VERIFY.md) and the final commit step, and renumber the commit step to 12:

### 11. Write the per-service operational-access record (services)

For a **service** role, copy `docs/access/service-access-template.md` to
`roles/<rolename>/ACCESS.md` and populate the role's `access__*` data
(`access__service`, `access__compose_project`/`_path`, `access__containers`,
`access__log.loki_labels`, and `access__api``enabled` + endpoint + `firewall_ref` +
`auth.vault_ref` + `health_path`, or `enabled: false` with a reason). `ACCESS.md` is
rendered from that data; the admin-API path must `firewall_ref` an entry in the
`group_vars` firewall catalog, never open a port itself (ADR-020/021). Once hosts exist,
`/check-access <rolename>` proves the documented paths are live — part of the
service-clearance gate (`docs/security/service-checklist.md`).
  • Step 3: Verify renumbering

Run: grep -n "^### 1[12]\." docs/runbooks/new-role.md Expected: ### 11. Write the per-service operational-access record and ### 12. Commit.

  • Step 4: Commit
git add docs/security/service-checklist.md docs/runbooks/new-role.md
git commit -m "docs(access): gate ACCESS.md in checklist + new-role runbook (ADR-021)"

Task 7: Index wiring — CLAUDE.md, STATUS.md, TODO.md

Files:

  • Modify: CLAUDE.md (Role conventions list + Further reading table)

  • Modify: STATUS.md (Designed-but-not-built table)

  • Modify: docs/TODO.md (items 3.2 and 7.2)

  • Step 1: CLAUDE.md — Role conventions

In the ## Role conventions list, after the VERIFY.md bullet ("Every service role must have a populated VERIFY.md ..."), add:

- Every **service** role must have a populated `ACCESS.md` (ADR-021) — copy
  `docs/access/service-access-template.md`; rendered from the role's `access__*` data
  • Step 2: CLAUDE.md — Further reading

In the Further reading table, after the Firewall strategy row, add:

| Operational access     | `docs/decisions/021-operational-access.md` |
  • Step 3: STATUS.md — new rows

In the ## Designed but not built table, add:

| Operational-access doctrine (ADR-021) | ADR-021 | **Design RESOLVED** (ADR-021 + spec + plan). Two-layer doctrine, three-tier access ladder, `access__*` model, `ACCESS.md` record, `/check-access`. Reconciles ADR-016/020 SSH. |
| `ssh-from-control` firewall source | ADR-021 / ADR-020 | **Built (dormant).** `base__firewall_control_addr` knob + nftables rule + Molecule assertion landed; empty default = no rule until `ubongo`'s LAN address is set in `group_vars`. |
| `/check-access` verifier | ADR-021 | **Design RESOLVED** (`.claude/commands/check-access.md` authored). **Build pending:** running needs `ubongo` + live/staging hosts + vault. Access analogue of `/verify-service` (ADR-017). |
| Per-service `ACCESS.md` records | ADR-021 | Template + governance present; per-service files render when each service role is built. |
  • Step 4: docs/TODO.md — mark 3.2 and 7.2 DECIDED

In docs/TODO.md, change item 3.2 from:

   2. Decide how to manage APIs / API access.

to:

   2. ~~Decide how to manage APIs / API access.~~ DECIDED (ADR-021): per-service `access__*`
      data declares the admin API (endpoint + `firewall_ref` to the catalog + vault token
      ref + health path); rendered into `ACCESS.md` and probed by `/check-access`. Part of
      the two-layer operational-access doctrine.

And change item 7.2 from:

   2. Decide what to set up on the hosts, given that direct access will be rare.

to:

   2. ~~Decide what to set up on the hosts, given that direct access will be rare.~~
      DECIDED (ADR-021): the host-layer access baseline — SSH on `wt0` + from `ubongo`,
      Docker/Compose tooling, Alloy log shipping, and a recorded break-glass console per
      host class.
  • Step 5: Verify and commit

Run: grep -n "021-operational-access\|ACCESS.md\|ssh-from-control" CLAUDE.md STATUS.md Expected: the new Role-conventions bullet, the Further-reading row, and the STATUS rows are present.

git add CLAUDE.md STATUS.md docs/TODO.md
git commit -m "docs(access): wire ADR-021 into CLAUDE.md, STATUS, TODO"

Tranche B — Build-pending on infra (no tasks now)

Recorded so the boundary is explicit; nothing here is actioned by this plan.

  • Per-service access__* + rendered ACCESS.md — authored when each service role is built, governed by the Task 6 checklist item + runbook step. The first real service role is where this first runs.
  • /check-access running — needs ubongo + a live/staging host + vault. The command (Task 5) already gates on these and stops cleanly until then.
  • Real base__firewall_control_addr value — set in group_vars/all to ubongo's LAN address once ubongo is in inventory; the machinery + test landed in Task 4.

Self-review

Spec coverage: doctrine + two layers → Task 1; three-tier ladder + ADR-016/020 reconciliation → Tasks 12, 4; access__* model + invariant → Tasks 1, 3, 6; rendered ACCESS.md → Task 3; /check-access → Task 5; governance (checklist/runbook) → Task 6; repo wiring (CLAUDE/STATUS/TODO) → Task 7; build-now vs build-pending split → Tranches A/B. All spec sections map to a task.

Deviations from the spec (deliberate, flagged for the user):

  1. The spec called ssh-from-control a catalog source; the plan places it in the guaranteed management plane (base__firewall_control_addr) instead — ADR-020 already houses SSH/Ansible management allows there, independent of the catalog, and the spec's own invariant says the catalog owns service exposure only. Same intent, correct home.
  2. The spec said make new-role would scaffold an ACCESS.md stub; the plan instead adds a manual runbook step (Task 6) mirroring how SECURITY.md/VERIFY.md are handled today (also manual copies, not scaffolded). Avoids unilaterally restructuring the scaffold; the "can't be forgotten" intent is met by the checklist gate + runbook step.

Type/name consistency: base__firewall_control_addr (knob), access__service / access__compose_project / access__compose_path / access__containers / access__log.loki_labels / access__api.{enabled,base_url,firewall_ref,auth.vault_ref,health_path} are used identically across Tasks 1, 3, 5, 6. The rendered nftables rule string ip saddr <addr> tcp dport 22 accept matches between Task 4's template (Step 4) and its assertion (Step 1).