Compare commits

...

11 commits

Author SHA1 Message Date
032adf1525 docs(friction): log execution-mode recurrence; fix list de-indents
Complete the 2026-06-09 entry (third recurrence of presenting the
execution-mode menu despite the standing subagent-driven preference) and
restore two continuation-line indents a markdown formatter had stripped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 08:54:37 +02:00
f151e99d04 docs(access): correct ADR-021 governance (runbook+gate, not scaffold)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:52:24 +02:00
13f0d482bd docs(access): wire ADR-021 into CLAUDE.md, STATUS, TODO
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:48:31 +02:00
649925b303 docs(access): gate ACCESS.md in checklist + new-role runbook (ADR-021)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:46:51 +02:00
384b94e34b feat(access): add /check-access verifier command (ADR-021, dormant)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:45:24 +02:00
0c507bbace feat(base): add ssh-from-control management-plane source (ADR-021)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 17:43:55 +02:00
46d091e82e docs(access): add ACCESS.md service record template
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:36:28 +02:00
f8098c2e15 docs(access): reconcile ADR-016/020 with control-node SSH source (ADR-021)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:34:57 +02:00
0fe9e45f57 docs(access): add ADR-021 operational-access doctrine 2026-06-09 17:33:46 +02:00
cdbd66410a docs(access): implementation plan for ADR-021 operational access
Splits the work into Tranche A (land now: ADR-021, ADR-016/020
reconciliation, ssh-from-control firewall source, ACCESS.md template,
/check-access command, governance + index wiring) and Tranche B
(build-pending on infra: per-service access__* + rendered ACCESS.md,
/check-access running).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 17:16:49 +02:00
fd4bbbc977 docs(access): design operational-access doctrine (ADR-021)
Brainstorming spec for ADR-021: operational access as a deployment
deliverable. Two layers (host baseline + per-service), a three-tier
access ladder (mesh SSH -> LAN SSH from ubongo -> console break-glass),
declarative access__* data rendering ACCESS.md and driving a
/check-access verifier. Resolves TODO 3.2 (API access) and 7.2 (host
access); amends ADR-016 (SSH also from ubongo) and ADR-020
(ssh-from-control source).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 17:10:54 +02:00
17 changed files with 1124 additions and 15 deletions

View file

@ -0,0 +1,49 @@
Operational-access verification (ADR-021)
Probe every documented way in to a service or host from `ubongo` and report which paths
are live. Reads the target's `access__*` data (and host baseline), so the verifier and
`ACCESS.md` can never disagree. Argument: a service/role name or a host
(e.g. `/check-access photoprism`, `/check-access docker01`).
## Prerequisites (this is forward-looking — ADR-021 dependencies)
This skill cannot run until these exist; if any is missing, say so and stop — do not
improvise around it:
- `ubongo` reachable on the mesh **and** the LAN (it runs the probes).
- The target host/service is deployed (staging or production inventory).
- `roles/<name>/` carries `access__*` data (services) / the host baseline applies.
- Vault unlocked (`rbw unlocked`) for any token-authenticated API probe.
## Process
### Phase 0 — resolve the target
Resolve the argument to a host or a service role + its host. Load the `access__*` data
(service) or the host-baseline + break-glass record (host). State what you will probe.
### Phase 1 — probe each declared path
| Path | Probe | Green = |
|---|---|---|
| `wt0` mesh SSH | connect over the mesh, run `true` | reachable + key works |
| LAN SSH from `ubongo` | connect via the LAN address, run `true` | reachable + key works |
| exec + compose | `docker compose -p <project> ps`; exec `true` in each `access__containers` entry | stack up, exec works |
| logs | query Loki for `access__log.loki_labels`, expect recent lines | logs flowing |
| admin API | `curl` `access__api.health_path` with the token from `access__api.auth.vault_ref` | 2xx |
| break-glass | reachability of the Proxmox/provider console endpoint **only** | console host reachable |
Break-glass is **never exercised** — firing a serial console is invasive; confirm the
fallback exists, do not drive it.
### Phase 2 — report
Emit a pass/fail table. For any red path, name it and the likely cause (e.g. "API token
in vault stale", "Alloy not shipping", "`base__firewall_control_addr` unset → no
`ssh-from-control` rule"). Verdict line: e.g. "3/4 paths green; admin API red".
## Notes
- Read-only and non-destructive — probes confirm reachability, they do not change state.
- This is the access analogue of `/verify-service` (ADR-017): designed now, runs when the
control node + hosts exist.

View file

@ -87,6 +87,8 @@ Full design rationale: `docs/decisions/`
- Every role must have `meta/main.yml` filled in - Every role must have `meta/main.yml` filled in
- Every **service** role must have a populated `SECURITY.md` (ADR-002/004) — copy `docs/security/service-security-template.md` - Every **service** role must have a populated `SECURITY.md` (ADR-002/004) — copy `docs/security/service-security-template.md`
- Every **service** role must have a populated `VERIFY.md` (ADR-008/017) — copy `docs/testing/service-verify-template.md` - Every **service** role must have a populated `VERIFY.md` (ADR-008/017) — copy `docs/testing/service-verify-template.md`
- Every **service** role must have a populated `ACCESS.md` (ADR-021) — copy
`docs/access/service-access-template.md`; rendered from the role's `access__*` data
- One service = one self-contained role; no shared multi-service roles (ADR-004) - One service = one self-contained role; no shared multi-service roles (ADR-004)
- Role names: `snake_case`, descriptive nouns (`base`, `docker_host`, `reverse_proxy`) - Role names: `snake_case`, descriptive nouns (`base`, `docker_host`, `reverse_proxy`)
- Use `make new-role NAME=<name>` to scaffold — never create role structure by hand - Use `make new-role NAME=<name>` to scaffold — never create role structure by hand
@ -224,6 +226,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
| Logging & log integrity | `docs/decisions/018-logging.md` | | Logging & log integrity | `docs/decisions/018-logging.md` |
| Tagging & run-targeting | `docs/decisions/019-tagging.md` | | Tagging & run-targeting | `docs/decisions/019-tagging.md` |
| Firewall strategy | `docs/decisions/020-firewall.md` | | Firewall strategy | `docs/decisions/020-firewall.md` |
| Operational access | `docs/decisions/021-operational-access.md` |
| Adding a new role | `docs/runbooks/new-role.md` | | Adding a new role | `docs/runbooks/new-role.md` |
| Adding a new host | `docs/runbooks/new-host.md` | | Adding a new host | `docs/runbooks/new-host.md` |
| Rotating vault secrets | `docs/runbooks/rotate-secrets.md` | | Rotating vault secrets | `docs/runbooks/rotate-secrets.md` |

View file

@ -59,6 +59,10 @@ So `make deploy PLAYBOOK=site` is still incomplete — `base` is only partially
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. | | Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. | | Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. |
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). | | Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
| Operational-access doctrine (ADR-021) | ADR-021 | **Design RESOLVED** (ADR-021 + spec + plan). Two-layer doctrine, three-tier access ladder, `access__*` model, `ACCESS.md` record, `/check-access`. Reconciles ADR-016/020 SSH. |
| `ssh-from-control` firewall source | ADR-021 / ADR-020 | **Built (dormant).** `base__firewall_control_addr` knob + nftables rule + Molecule assertion landed; empty default = no rule until `ubongo`'s LAN address is set in `group_vars`. |
| `/check-access` verifier | ADR-021 | **Design RESOLVED** (`.claude/commands/check-access.md` authored). **Build pending:** running needs `ubongo` + live/staging hosts + vault. Access analogue of `/verify-service` (ADR-017). |
| Per-service `ACCESS.md` records | ADR-021 | Template + governance present; per-service files render when each service role is built. |
## Keeping this honest ## Keeping this honest

View file

@ -63,14 +63,14 @@ earning its keep.
- `[recurring]` When a **deferred** decision later resolves, docs that referenced the - `[recurring]` When a **deferred** decision later resolves, docs that referenced the
deferral go stale and a plan's file-map can miss them (e.g. resolving the mesh-VPN deferral go stale and a plan's file-map can miss them (e.g. resolving the mesh-VPN
choice left `new-host.md` still saying "mesh VPN (choice deferred)"; the ubongo work choice left `new-host.md` still saying "mesh VPN (choice deferred)"; the ubongo work
similarly left a contradiction in CLAUDE.md). A *broadened* final grep sweep caught similarly left a contradiction in CLAUDE.md). A _broadened_ final grep sweep caught
both. → On resolving a deferred decision, grep all canonical docs for the deferral both. → On resolving a deferred decision, grep all canonical docs for the deferral
language ("choice deferred", "pending", "TBD", the placeholder's name) and reconcile language ("choice deferred", "pending", "TBD", the placeholder's name) and reconcile
every hit — don't rely on the plan's file-map alone. Worth a `/review-repo` check for every hit — don't rely on the plan's file-map alone. Worth a `/review-repo` check for
lingering "deferred/pending/TBD" references whose ADR has since resolved. lingering "deferred/pending/TBD" references whose ADR has since resolved.
- **Recurred a 3rd time (same day):** ADR-017 resolved the browser-E2E harness but - **Recurred a 3rd time (same day):** ADR-017 resolved the browser-E2E harness but
left ADR-015's own "Deferred" list item #2 still reading as open — not caught by the left ADR-015's own "Deferred" list item #2 still reading as open — not caught by the
ADR-017 plan's sweep (which only checked for *its own* placeholder language), only ADR-017 plan's sweep (which only checked for _its own_ placeholder language), only
by a later STATUS pass. Lesson sharpened: the stale reference often lives in the by a later STATUS pass. Lesson sharpened: the stale reference often lives in the
**originating ADR's Deferred section**, which the resolving ADR's plan won't think **originating ADR's Deferred section**, which the resolving ADR's plan won't think
to grep. → When an ADR resolves another ADR's deferred item, edit that **source to grep. → When an ADR resolves another ADR's deferred item, edit that **source
@ -82,7 +82,7 @@ earning its keep.
- `[recurring]` **Asked the execution-mode question AGAIN** ("subagent-driven vs inline — - `[recurring]` **Asked the execution-mode question AGAIN** ("subagent-driven vs inline —
which approach?") at the end of `writing-plans`, despite the 2026-06-05 standing which approach?") at the end of `writing-plans`, despite the 2026-06-05 standing
preference *and* the `always-subagent-driven-execution` memory both saying don't ask. preference _and_ the `always-subagent-driven-execution` memory both saying don't ask.
Root cause: the `writing-plans` skill's "Execution Handoff" step scripts the menu, and Root cause: the `writing-plans` skill's "Execution Handoff" step scripts the menu, and
I followed the skill text over the user's standing override. Second occurrence → I followed the skill text over the user's standing override. Second occurrence →
escalate from "skip the prompt" to a **hard rule**: never present the execution-mode escalate from "skip the prompt" to a **hard rule**: never present the execution-mode
@ -98,12 +98,12 @@ earning its keep.
### Host nftables firewall build (`base` role) ### Host nftables firewall build (`base` role)
- `[gotcha]` **`nft -c` rejects `iif "<name>"` when the interface is absent** (it resolves - `[gotcha]` **`nft -c` rejects `iif "<name>"` when the interface is absent** (it resolves
to an interface *index* at load time). The render+syntax-check Molecule step caught to an interface _index_ at load time). The render+syntax-check Molecule step caught
`iif "wt0"` failing in the container — and it would fail identically on any real host `iif "wt0"` failing in the container — and it would fail identically on any real host
before NetBird brings up `wt0`. Use **`iifname "<name>"`** (string match, no existence before NetBird brings up `wt0`. Use **`iifname "<name>"`** (string match, no existence
requirement, survives the interface coming/going) for any interface that may be absent. requirement, survives the interface coming/going) for any interface that may be absent.
- `[gotcha]` **Molecule's `community.docker` connection uses `ansible_host` as the - `[gotcha]` **Molecule's `community.docker` connection uses `ansible_host` as the
container name** (`remote_addr`). Setting `ansible_host` as *data* in a scenario's container name** (`remote_addr`). Setting `ansible_host` as _data_ in a scenario's
`host_vars` (e.g. to give a resolver a fake IP) breaks the connection → `UNREACHABLE`, `host_vars` (e.g. to give a resolver a fake IP) breaks the connection → `UNREACHABLE`,
"Failed to create temporary directory". Don't override `ansible_host` in molecule; feed "Failed to create temporary directory". Don't override `ansible_host` in molecule; feed
fixture IPs another way (or keep fixtures to zone sources and unit-test IP resolution). fixture IPs another way (or keep fixtures to zone sources and unit-test IP resolution).
@ -124,3 +124,15 @@ earning its keep.
- `[note]` The render-and-`nft -c` (no-apply) Molecule approach **earned its keep** - `[note]` The render-and-`nft -c` (no-apply) Molecule approach **earned its keep**
caught the `iif`/`iifname` bug deterministically without touching the host kernel. Good caught the `iif`/`iifname` bug deterministically without touching the host kernel. Good
pattern to reuse for other config-rendering roles. pattern to reuse for other config-rendering roles.
## 2026-06-09
- `[recurring]` **Asked the execution-mode question AGAIN** — presented the
"subagent-driven vs inline" menu at the `writing-plans` → execution handoff, even
though the standing 2026-06-05 preference and the `always-subagent-driven-execution`
memory both say to default to subagent-driven without asking. Third occurrence; the
earlier "hard rule" escalation didn't hold because both `writing-plans` and
`subagent-driven-development` script the menu and I followed the skill text over the
user's standing override. → The standing preference outranks skill scripts: when a
skill's handoff offers the execution-mode menu, skip it and proceed subagent-driven;
only ask if the user signals otherwise this session.

View file

@ -18,7 +18,10 @@
1. ~~Decide how to manage logs.~~ DECIDED (ADR-018): all logs → on-cluster Loki via 1. ~~Decide how to manage logs.~~ DECIDED (ADR-018): all logs → on-cluster Loki via
Grafana Alloy (in `base`); a security subset also ships write-only off-site to Grafana Alloy (in `base`); a security subset also ships write-only off-site to
`askari` (append-only); Grafana queries both. WORM skipped (accepted-risk R4). `askari` (append-only); Grafana queries both. WORM skipped (accepted-risk R4).
2. Decide how to manage APIs / API access. 2. ~~Decide how to manage APIs / API access.~~ DECIDED (ADR-021): per-service `access__*`
data declares the admin API (endpoint + `firewall_ref` to the catalog + vault token
ref + health path); rendered into `ACCESS.md` and probed by `/check-access`. Part of
the two-layer operational-access doctrine.
3. ~~Decide how to import or integrate from baobabAnsibleV4.~~ DECIDED (ADR-013): 3. ~~Decide how to import or integrate from baobabAnsibleV4.~~ DECIDED (ADR-013):
translate-don't-transplant — V4 is a source only of gotchas + working config translate-don't-transplant — V4 is a source only of gotchas + working config
snippets, re-derived on boma's terms; never structure/requirements/values. snippets, re-derived on boma's terms; never structure/requirements/values.
@ -53,7 +56,10 @@
7. **Shell setup** 7. **Shell setup**
1. Decide what shell setup matters for the AI's work on the control node. 1. Decide what shell setup matters for the AI's work on the control node.
2. Decide what to set up on the hosts, given that direct access will be rare. 2. ~~Decide what to set up on the hosts, given that direct access will be rare.~~
DECIDED (ADR-021): the host-layer access baseline — SSH on `wt0` + from `ubongo`,
Docker/Compose tooling, Alloy log shipping, and a recorded break-glass console per
host class.
8. **Scheduled work** 8. **Scheduled work**
1. Run `/review-repo` as `claude -p` via cron every two weeks? 1. Run `/review-repo` as `claude -p` via cron every two weeks?

View file

@ -0,0 +1,38 @@
# Per-service operational-access record — template
Copy this file to `roles/<service>/ACCESS.md` when building a service role (ADR-021).
It is the per-service **operational-access record**: every documented, verifiable way in
for troubleshooting. The structured parts are **rendered from the role's `access__*`
data** (the single source of truth that also drives `/check-access`) — keep the data
authoritative and regenerate this file rather than hand-editing the tables. The prose
"Operational notes" tail is hand-written.
Delete this preamble in the copy and start from the heading below.
---
# Access — <service>
## Access paths
The documented ways in, by tier (rendered from `access__*`):
| Tier | Path | Invocation |
|---|---|---|
| primary | `wt0` mesh SSH | `ssh <host>` (over the NetBird mesh) |
| secondary | LAN SSH from `ubongo` | `ssh <host>` (from the control node, LAN address) |
| — | container exec + compose | `docker compose -p <access__compose_project> -f <access__compose_path> ps` / `exec` |
| — | logs | Loki query for labels `<access__log.loki_labels>` (Grafana; ADR-018) |
| — | admin API | `curl -H 'Authorization: …(vault_ref)' <access__api.base_url><health_path>` — or `n/a` |
## Break-glass
Mesh-and-LAN-independent fallback for this host's class (recorded, not routine):
- <Proxmox serial/VNC console for cluster VMs · Hetzner rescue for `askari` · local console for `ubongo`>
## Operational notes
Prose the data can't capture — service quirks, "if X is wedged, do Y", ordering gotchas.
- <none yet>

View file

@ -61,8 +61,12 @@ allocated for it.
privilege. privilege.
- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`), - **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`),
consumed by `base`; prefer ephemeral/scoped keys. consumed by `base`; prefer ephemeral/scoped keys.
- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH - **Host firewall:** `base` nftables allows inbound SSH on NetBird's `wt0` interface
**only on `wt0`** (the ADR-015 pattern, fleet-wide). (primary, WireGuard-authenticated) **and** from `ubongo`'s LAN address (secondary,
mesh-independent — required by the LAN-IP recovery path below, so a mesh/coordinator
outage never blocks on-LAN SSH). All other LAN hosts remain default-denied. This makes
explicit the control-node SSH allow that the recovery model already implied; the access
doctrine and the three-tier access ladder live in **ADR-021**.
- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn - **New public surface on `askari`:** management API + dashboard (80/443) + Coturn
(3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical, (3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,
`base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence. `base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence.

View file

@ -39,10 +39,12 @@ subnet (VLAN 20), which never reaches the gateway.
added benefit once the VLAN already bounds where a host can go. added benefit once the VLAN already bounds where a host can go.
- **Docker**: daemon runs with `"iptables": false`; nftables owns all filtering, - **Docker**: daemon runs with `"iptables": false`; nftables owns all filtering,
including container traffic (ADR-004). including container traffic (ADR-004).
- **Guaranteed management plane**: loopback, established/related, and `wt0` (NetBird, - **Guaranteed management plane**: loopback, established/related, `wt0` (NetBird,
ADR-016) for SSH + Ansible are always allowed, independent of the catalog, applied ADR-016), and SSH from the control node's LAN address (`base__firewall_control_addr`,
atomically — a malformed or empty catalog can never lock out management. (ADR-016: SSH the `ssh-from-control` source) for SSH + Ansible are always allowed, independent of the
is allowed only on `wt0`.) catalog, applied atomically — a malformed or empty catalog can never lock out
management. The control-node source is part of the guaranteed plane, not the service
catalog (it is management, not a service); see ADR-021 for the access doctrine.
So "per-host vs central" is answered: **both**, with clear ownership. So "per-host vs central" is answered: **both**, with clear ownership.

View file

@ -0,0 +1,206 @@
# ADR-021 — Operational access: documented, verifiable ways in
## Status
Accepted (2026-06-09). Resolves TODO 7.2 (what to set up on hosts given direct access
will be rare) and TODO 3.2 (the service admin-API access question).
**Doctrine ADR.** It pins the operational-access doctrine, the declarative `access__*`
data model, the rendered `ACCESS.md` record, and the `/check-access` verifier. It does
**not** build any of them — `base`'s non-firewall concerns, service roles, and live
hosts do not exist yet. Designed now, built when there is something to access (see
*Scope*). Reconciles a latent contradiction between ADR-016 and ADR-020 (see
*Reconciliation*).
## Context
boma is built security-first: nftables default-deny, SSH reachable only on the NetBird
`wt0` mesh interface (ADR-016), every service behind the reverse proxy + SSO, no ad-hoc
ports (ADR-002/ADR-020). That posture is correct — but it leaves one operational
question unanswered: **when a host or service breaks, how does the operator (and the AI
working from `ubongo`) actually get in to troubleshoot it?**
Troubleshooting is far more effective with *several* documented ways in — SSH, container
exec, logs, an admin API — so a single broken path does not mean blind. Today boma has no
standard guaranteeing those paths exist, are documented, or still work. The risk is the
classic one: the access you assumed you had is stale exactly when you need it (key
rotated, API disabled, token expired).
boma already has the right *shape*. Service roles carry record docs — `SECURITY.md`
(security answers) and `VERIFY.md` (acceptance spec). What is missing is the third
sibling — an operational-access record — and the doctrine behind it.
Two constraints shape the decision:
1. **Minimal attack surface is non-negotiable.** "Multiple ways in" must mean multiple
paths over *trusted* interfaces, never new exposed ports.
2. **A documented path that is never tested drifts** — it fails exactly when needed. So
the access facts must be *data* that both renders the doc and drives an active
verifier; the two can then never disagree.
## Decision
### The doctrine
> **Every host and every service guarantees at least one documented, verifiable way in
> for operational troubleshooting — and the deploy that creates it also records and
> proves it.**
Access is a deployment deliverable, not something rediscovered under pressure. The deploy
that creates a host/service also records its access paths and (by design) proves them.
### Two layers
- **Host layer** (resolves TODO 7.2). Every host, via the `base` role, guarantees a fixed
access baseline: SSH over `wt0` and from `ubongo` (the ladder below), Docker/Compose
tooling present, and log shipping live (Alloy → Loki; ADR-018). Little is *exposed*; a
known, uniform set of paths exists over trusted interfaces. The break-glass console per
host class is recorded once at this layer. This is boma's answer to "what every host
runs for access."
- **Service layer** (resolves TODO 3.2). Every service role guarantees and records its
own paths: container exec + compose management, its Loki log labels, and its admin API
where one exists (enabled, token in vault, endpoint + health probe documented) — or an
explicit "no API."
### The three-tier access ladder
1. **`wt0` mesh SSH — primary.** WireGuard *cryptographically authenticates* the peer
before SSH sees it. The preferred path (ADR-016's original rationale).
2. **LAN SSH from `ubongo` only — secondary, mesh-independent.** All hardware but
`askari` shares a LAN. SSH from `ubongo`'s LAN address is allowed, giving a fallback
that survives a NetBird/`wt0` outage. It is gated by *source IP* (spoofable on a LAN)
**plus** the standing keys-only + fail2ban SSH hardening (ADR-002), so the marginal
cost is "SSH daemon reachable from one trusted LAN host" — modest and deliberate. All
*other* LAN hosts stay default-denied.
3. **Console — break-glass.** Mesh-*and*-LAN-independent, recorded per host class, never
exercised for routine work:
- **Cluster VMs** → Proxmox serial/VNC console — independent of the guest network,
`wt0`, and even a broken guest nftables ruleset.
- **`askari`** (bare-metal Hetzner) → provider rescue/console.
- **`ubongo`** (physical) → local console.
A total mesh outage therefore still leaves exactly one documented way in to each box.
### Reconciliation, not weakening
ADR-016 already requires Ansible to reach the fleet by LAN IP — "a mesh/coordinator
outage never blocks on-LAN runs" — which **requires** LAN SSH from `ubongo`. Yet ADR-016
also stated "SSH only on `wt0`," and ADR-020's guaranteed management plane listed only
`wt0`. That was a latent contradiction. ADR-021 resolves it by making the control-node
SSH allow **explicit** and adding it to the guaranteed management plane. This does **not**
weaken default-deny: it admits exactly one extra trusted source on the LAN (`ubongo`),
keys-only + fail2ban-gated; every other LAN host stays denied. ADR-016 and ADR-020 are
amended to cross-reference this ladder.
### The declarative `access__*` data model
Structured access facts live as **data** — the single source of truth that both renders
`ACCESS.md` *and* tells `/check-access` what to probe, so doc and verifier cannot diverge
(the firewall-catalog philosophy of ADR-020, applied to access).
Each service role's defaults carry:
```yaml
access__service: photoprism
access__compose_project: photoprism # docker compose -p <this>
access__compose_path: /opt/photoprism/compose.yml
access__containers: [photoprism, photoprism-db] # exec targets
access__log:
loki_labels: { service: photoprism } # how to query logs (ADR-018)
access__api:
enabled: true
base_url: "http://photoprism.srv:2342" # reachable over the mesh
firewall_ref: photoprism-api # the catalog entry that opens it (ADR-020)
auth: { vault_ref: "vault.photoprism.api_token" }
health_path: "/api/v1/status" # what /check-access pings
# where the service has no API:
# access__api: { enabled: false, reason: "<none upstream>" }
```
**Invariant — `access__api` never opens a port.** It `firewall_ref`s an entry in the
`group_vars` firewall catalog; ADR-020 stays the **sole owner of exposure**. The access
data adds only *how to use* the path (endpoint, token ref, health probe) — no duplication,
no ad-hoc ports (CLAUDE.md: ports only in the catalog).
The host baseline (SSH on `wt0` + from `ubongo`, Docker/Compose present, Alloy live) is
uniform, so it is asserted by `base` and recorded once at the host/group level, not
re-stated per service.
### The rendered record — `ACCESS.md`
`ACCESS.md` is a first-class sibling of `SECURITY.md`/`VERIFY.md`, **rendered** from the
`access__*` data with a prose tail for the narrative parts:
- **Access paths (generated)** — a table: each path (mesh SSH, LAN-SSH-from-`ubongo`,
exec/compose, logs, API), its tier (primary / secondary / break-glass), and the exact
invocation.
- **Break-glass (generated from host class)** — the Proxmox/provider/local console line.
- **Operational notes (prose)** — service quirks, gotchas, "if X is wedged, do Y." The
part a template cannot know.
A `docs/access/service-access-template.md` defines the shape, alongside the existing
security/verify templates.
### The verifier — `/check-access`
`/check-access <service|host>` runs from `ubongo` and turns the `access__*` data into
live probes, reporting which declared paths are green right now — the access analogue of
`/verify-service` (ADR-017). It probes mesh SSH, LAN SSH, exec + compose, Loki logs, and
the admin API health path; on any red it names the path and the likely cause. **Break-glass
is checked for reachability only, never exercised** — firing a serial console is invasive,
so the verifier confirms the fallback *exists* without disrupting anything. Designed now,
**build-pending on infra** (needs live hosts + staging + vault), exactly like
`/verify-service` under ADR-017.
### Governance
Three light touches, mirroring how `SECURITY.md`/`VERIFY.md` are enforced: the service
checklist (`docs/security/service-checklist.md`) gains an access item; the `new-role`
runbook gains a fill/render/`check-access` step (step 11: copy
`docs/access/service-access-template.md` into `roles/<service>/ACCESS.md` and populate the
`access__*` data); and a service-checklist gate item blocks clearance until the record
exists and `/check-access` is green (or a deviation is recorded in `accepted-risks.md`).
No scaffold change — same manual-copy-plus-review pattern the sibling records
(`SECURITY.md`/`VERIFY.md`) use.
## Consequences
- Every host and service has at least one documented, verifiable way in — and a verifier
that proves it, so stale access is caught before an outage, not during one.
- Doc and verifier share one source of truth (`access__*`), so they cannot drift apart.
- The management plane gains exactly one extra trusted LAN source (`ubongo`); attack
surface grows by one keys-only + fail2ban-gated SSH path, no new exposed ports.
- Cost: per-service `access__*` declarations and a rendered `ACCESS.md` to maintain
(mitigated by the uniform host baseline + the new-role runbook step + checklist gate), plus `/check-access` to build.
## Scope
Delivered by ADR-021's implementation plan
(`docs/superpowers/plans/2026-06-09-operational-access.md`), task by task, and tracked in
`STATUS.md` as it lands — not all of it exists at the moment this ADR is written. The split
below is near-term tranche vs longer build-pending, not instant-existence vs not.
**Near-term tranche (this plan):** the doctrine; this ADR; the `ACCESS.md` template; the
`ssh-from-control` firewall management-plane source — added to ADR-020's *guaranteed
management plane* (the always-allowed block that already holds the `wt0` SSH/Ansible allow
and is explicitly independent of the service catalog), not added to the catalog itself (the
catalog owns service ingress only) — via the `base__firewall_control_addr` knob and its
nftables rule, both of which do **not** exist in `roles/base` yet and land with the
`firewall` concern of `base`; and the governance wiring (checklist item, new-role runbook step). ADR-016 and ADR-020 are amended to reference the ladder.
**Build-pending on infra:** per-service `access__*` data and rendered `ACCESS.md` files
(wait on service roles), `/check-access` *running* (waits on live hosts + staging + vault),
and the real `ubongo` LAN address value behind `base__firewall_control_addr`. Designed now,
built when there is something to verify.
**Out of scope:** broader LAN SSH (a management VLAN) — explicitly rejected, `ubongo`-only;
exercising (vs reachability-probing) the break-glass console; any access path that is not
over the mesh or the one `ubongo` LAN source.
## Related
ADR-002 (security baseline: SSH hardening, default-deny, fail2ban), ADR-004 (Docker
model, Compose), ADR-016 (NetBird mesh; amended — SSH on `wt0` **and** from `ubongo`'s
LAN address), ADR-017 (`/verify-service` Level-4 verification), ADR-018 (logging:
Alloy → Loki/Grafana), ADR-020 (firewall: service catalog + guaranteed management plane;
amended — adds the `ssh-from-control` management-plane source), ADR-019 (`firewall` tag).

View file

@ -91,7 +91,19 @@ For a **service** role, copy `docs/testing/service-verify-template.md` to
Level 4 `/verify-service` check (ADR-008 / ADR-017) and is part of the pre-production Level 4 `/verify-service` check (ADR-008 / ADR-017) and is part of the pre-production
service-clearance gate (`docs/security/service-checklist.md`). service-clearance gate (`docs/security/service-checklist.md`).
### 11. Commit ### 11. Write the per-service operational-access record (services)
For a **service** role, copy `docs/access/service-access-template.md` to
`roles/<rolename>/ACCESS.md` and populate the role's `access__*` data
(`access__service`, `access__compose_project`/`_path`, `access__containers`,
`access__log.loki_labels`, and `access__api``enabled` + endpoint + `firewall_ref` +
`auth.vault_ref` + `health_path`, or `enabled: false` with a reason). `ACCESS.md` is
rendered from that data; the admin-API path must `firewall_ref` an entry in the
`group_vars` firewall catalog, never open a port itself (ADR-020/021). Once hosts exist,
`/check-access <rolename>` proves the documented paths are live — part of the
service-clearance gate (`docs/security/service-checklist.md`).
### 12. Commit
```bash ```bash
git checkout -b role/<rolename> git checkout -b role/<rolename>

View file

@ -51,6 +51,10 @@ This checklist is the generic **bar**. Each service answers it in its own
- [ ] Passed Level 4 service-UI verification (`/verify-service`) against staging — the - [ ] Passed Level 4 service-UI verification (`/verify-service`) against staging — the
service has a populated `roles/<service>/VERIFY.md` and its critical journeys service has a populated `roles/<service>/VERIFY.md` and its critical journeys
verified (ADR-008 Level 4 / ADR-017) verified (ADR-008 Level 4 / ADR-017)
- [ ] Operational access recorded and verifiable (ADR-021): the role carries `access__*`
data, `roles/<service>/ACCESS.md` is rendered, and `/check-access` reports the
documented paths green — or a deviation is recorded in
`docs/security/accepted-risks.md`
> Deviations are allowed but must be **conscious**: record them in > Deviations are allowed but must be **conscious**: record them in
> `docs/security/accepted-risks.md`, don't leave them implicit. > `docs/security/accepted-risks.md`, don't leave them implicit.

View file

@ -0,0 +1,544 @@
# Operational Access (ADR-021) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Establish operational access as a deployment deliverable — a documented, verifiable set of mesh-reachable troubleshooting paths for every host and service — by writing ADR-021, reconciling the latent ADR-016/020 SSH contradiction, adding the control-node SSH source to the host firewall, and wiring the `ACCESS.md` record + `/check-access` verifier into boma's governance.
**Architecture:** Source of truth is the committed design spec `docs/superpowers/specs/2026-06-09-operational-access-design.md`. Structured access facts live as declarative `access__*` data that renders `ACCESS.md` and drives `/check-access` (the access analogue of `VERIFY.md` + `/verify-service`). Work is split into **Tranche A — land now** (doctrine docs, the one firewall code change, the dormant `/check-access` command, governance wiring) and **Tranche B — build-pending on infra** (per-service `access__*` population, rendered `ACCESS.md` files, and `/check-access` *running*), which arrive with service roles and live hosts and require no action in this plan.
**Tech Stack:** Markdown ADRs/docs; Ansible role `base` (Jinja2 nftables template + `defaults/main.yml`); Molecule (Debian 13, render + `nft -c`, no apply) for the firewall test; Claude Code command file for `/check-access`.
---
## File structure
| File | Tranche | Responsibility |
|---|---|---|
| `docs/decisions/021-operational-access.md` | A | NEW — the doctrine (two layers, three-tier ladder, break-glass, `access__*` model, `/check-access`) |
| `docs/decisions/016-mesh-vpn.md` | A | MODIFY — reconcile: SSH on `wt0` **and** from `ubongo`'s LAN address |
| `docs/decisions/020-firewall.md` | A | MODIFY — guaranteed management plane gains the control-node SSH source |
| `docs/access/service-access-template.md` | A | NEW — the `ACCESS.md` record shape (rendered-from-data + prose tail) |
| `roles/base/defaults/main.yml` | A | MODIFY — add `base__firewall_control_addr` knob (default empty → no-op) |
| `roles/base/templates/nftables.conf.j2` | A | MODIFY — conditional management-plane SSH rule for the control address |
| `roles/base/molecule/default/converge.yml` | A | MODIFY — set the knob for the test |
| `roles/base/molecule/default/verify.yml` | A | MODIFY — assert the rendered rule |
| `.claude/commands/check-access.md` | A | NEW — the `/check-access` verifier command (dormant until infra exists) |
| `docs/security/service-checklist.md` | A | MODIFY — one new gate item |
| `docs/runbooks/new-role.md` | A | MODIFY — new step: write `ACCESS.md` (mirrors SECURITY/VERIFY steps) |
| `CLAUDE.md` | A | MODIFY — `ACCESS.md` in Role conventions; ADR-021 in Further reading |
| `STATUS.md` | A | MODIFY — new rows for the doctrine, the firewall source, `/check-access` |
| `docs/TODO.md` | A | MODIFY — mark 3.2 + 7.2 DECIDED → ADR-021 |
**Tranche B (no tasks here — captured for the record):** per-service `access__*` blocks + rendered `roles/<svc>/ACCESS.md` land when each service role is built (governed by the Tranche-A checklist + runbook); `/check-access` *running* lands when `ubongo` + staging + vault exist. Both are designed-now, build-pending — exactly like `/verify-service` under ADR-017.
---
## Tranche A — Land now
### Task 1: Write ADR-021
**Files:**
- Create: `docs/decisions/021-operational-access.md`
The ADR is the durable decision record derived from the committed spec
`docs/superpowers/specs/2026-06-09-operational-access-design.md`. Match the prose style and
heading shape of an existing ADR (read `docs/decisions/020-firewall.md` first). The ADR
**must** state these specifics — they are the parts easy to get wrong:
- **Doctrine sentence (verbatim):** *"Every host and every service guarantees at least one
documented, verifiable way in for operational troubleshooting — and the deploy that
creates it also records and proves it."*
- **Two layers:** host baseline (resolves TODO 7.2) + per-service record (resolves TODO 3.2).
- **Three-tier access ladder:** (1) `wt0` mesh SSH — primary, WireGuard-authenticated;
(2) LAN SSH from `ubongo` only — secondary, mesh-independent, source-IP-gated **plus**
keys-only + fail2ban; all other LAN hosts stay default-denied; (3) console — break-glass
per host class: cluster VMs → Proxmox serial/VNC console, `askari` → Hetzner
rescue/console, `ubongo` → local console; reachability-checked, never exercised.
- **Reconciliation, not weakening (state this explicitly):** ADR-016 already requires
Ansible to reach the fleet by LAN IP ("a mesh/coordinator outage never blocks on-LAN
runs"), which *requires* LAN SSH from `ubongo`; yet ADR-016 also said "SSH only on `wt0`"
and ADR-020's guaranteed management plane listed only `wt0`. ADR-021 resolves that latent
contradiction by making the control-node SSH allow explicit and adding it to the
guaranteed management plane. It does **not** weaken default-deny: exactly one extra
trusted source on the LAN.
- **Declarative `access__*` data model:** service-role defaults carry `access__service`,
`access__compose_project`, `access__compose_path`, `access__containers`,
`access__log.loki_labels`, and `access__api` (`enabled`, `base_url`, `firewall_ref`,
`auth.vault_ref`, `health_path`; or `enabled: false` + `reason`). **Invariant:**
`access__api` never opens a port — it `firewall_ref`s the `group_vars` firewall catalog;
ADR-020 stays the sole owner of exposure.
- **Rendered record:** `ACCESS.md` is rendered from that data + a prose tail (operational
notes / gotchas). First-class sibling of `SECURITY.md`/`VERIFY.md`.
- **`/check-access`:** the verifier that probes each declared path and reports which are
live; break-glass reachability-only; designed now, build-pending on infra.
- **Status / consequences:** what lands now vs build-pending (mirror this plan's split).
- [ ] **Step 1: Author the ADR**
Write `docs/decisions/021-operational-access.md` covering every bullet above, in the
house style of `docs/decisions/020-firewall.md` (problem → decision → layers/ladder →
data model → verifier → consequences). Open with a one-line title heading
`# ADR-021 — Operational access: documented, verifiable ways in`.
- [ ] **Step 2: Sanity-check internal links**
Run: `grep -n "ADR-01[67]\|ADR-020\|access__\|check-access\|ACCESS.md" docs/decisions/021-operational-access.md`
Expected: references to ADR-016, ADR-020, the `access__*` keys, `/check-access`, and
`ACCESS.md` all present.
- [ ] **Step 3: Commit**
```bash
git add docs/decisions/021-operational-access.md
git commit -m "docs(access): add ADR-021 operational-access doctrine"
```
---
### Task 2: Reconcile ADR-016 and ADR-020
**Files:**
- Modify: `docs/decisions/016-mesh-vpn.md` (the "Host firewall" bullet, ~line 64-65)
- Modify: `docs/decisions/020-firewall.md` (the "Guaranteed management plane" bullet, ~line 42-45)
- [ ] **Step 1: Amend ADR-016's Host-firewall bullet**
Replace the existing bullet:
```markdown
- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH
**only on `wt0`** (the ADR-015 pattern, fleet-wide).
```
with:
```markdown
- **Host firewall:** `base` nftables allows inbound SSH on NetBird's `wt0` interface
(primary, WireGuard-authenticated) **and** from `ubongo`'s LAN address (secondary,
mesh-independent — required by the LAN-IP recovery path below, so a mesh/coordinator
outage never blocks on-LAN SSH). All other LAN hosts remain default-denied. This makes
explicit the control-node SSH allow that the recovery model already implied; the access
doctrine and the three-tier access ladder live in **ADR-021**.
```
- [ ] **Step 2: Amend ADR-020's guaranteed-management-plane bullet**
Replace:
```markdown
- **Guaranteed management plane**: loopback, established/related, and `wt0` (NetBird,
ADR-016) for SSH + Ansible are always allowed, independent of the catalog, applied
atomically — a malformed or empty catalog can never lock out management. (ADR-016: SSH
is allowed only on `wt0`.)
```
with:
```markdown
- **Guaranteed management plane**: loopback, established/related, `wt0` (NetBird,
ADR-016), and SSH from the control node's LAN address (`base__firewall_control_addr`,
the `ssh-from-control` source) for SSH + Ansible are always allowed, independent of the
catalog, applied atomically — a malformed or empty catalog can never lock out
management. The control-node source is part of the guaranteed plane, not the service
catalog (it is management, not a service); see ADR-021 for the access doctrine.
```
- [ ] **Step 3: Commit**
```bash
git add docs/decisions/016-mesh-vpn.md docs/decisions/020-firewall.md
git commit -m "docs(access): reconcile ADR-016/020 with control-node SSH source (ADR-021)"
```
---
### Task 3: The `ACCESS.md` record template
**Files:**
- Create: `docs/access/service-access-template.md`
Match the preamble convention of `docs/security/service-security-template.md` and
`docs/testing/service-verify-template.md` (a "copy this to `roles/<service>/ACCESS.md`"
preamble, then a `---`, then the record).
- [ ] **Step 1: Write the template**
Create `docs/access/service-access-template.md`:
```markdown
# Per-service operational-access record — template
Copy this file to `roles/<service>/ACCESS.md` when building a service role (ADR-021).
It is the per-service **operational-access record**: every documented, verifiable way in
for troubleshooting. The structured parts are **rendered from the role's `access__*`
data** (the single source of truth that also drives `/check-access`) — keep the data
authoritative and regenerate this file rather than hand-editing the tables. The prose
"Operational notes" tail is hand-written.
Delete this preamble in the copy and start from the heading below.
---
# Access — <service>
## Access paths
The mesh-reachable ways in, by tier (rendered from `access__*`):
| Tier | Path | Invocation |
|---|---|---|
| primary | `wt0` mesh SSH | `ssh <host>` (over the NetBird mesh) |
| secondary | LAN SSH from `ubongo` | `ssh <host>` (from the control node, LAN address) |
| — | container exec + compose | `docker compose -p <access__compose_project> -f <access__compose_path> ps` / `exec` |
| — | logs | Loki query for labels `<access__log.loki_labels>` (Grafana; ADR-018) |
| — | admin API | `curl -H 'Authorization: …(vault_ref)' <access__api.base_url><health_path>` — or `n/a` |
## Break-glass
Mesh-and-LAN-independent fallback for this host's class (recorded, not routine):
- <Proxmox serial/VNC console for cluster VMs · Hetzner rescue for `askari` · local console for `ubongo`>
## Operational notes
Prose the data can't capture — service quirks, "if X is wedged, do Y", ordering gotchas.
- <none yet>
```
- [ ] **Step 2: Commit**
```bash
git add docs/access/service-access-template.md
git commit -m "docs(access): add ACCESS.md service record template"
```
---
### Task 4: Add the control-node SSH source to the host firewall (TDD)
**Files:**
- Modify: `roles/base/defaults/main.yml`
- Modify: `roles/base/templates/nftables.conf.j2`
- Modify: `roles/base/molecule/default/converge.yml`
- Modify: `roles/base/molecule/default/verify.yml`
This is the only code in Tranche A. It adds an **optional** guaranteed-management-plane
allow for SSH from the control node's LAN address. Default empty ⇒ no rule rendered ⇒
no behaviour change until a real `ubongo` address is set in `group_vars` (build-pending).
Test path is the established one for this role: Molecule render + `nft -c` (no apply).
- [ ] **Step 1: Write the failing test — converge sets the knob, verify asserts the rule**
In `roles/base/molecule/default/converge.yml`, add the knob under `vars:` (alongside
`base__firewall_apply: false`):
```yaml
base__firewall_control_addr: 10.10.0.99 # test control-node LAN address
```
In `roles/base/molecule/default/verify.yml`, extend the "management plane" assert block's
`that:` list (the task asserting default-deny + `wt0` SSH) with:
```yaml
- "'ip saddr 10.10.0.99 tcp dport 22 accept' in nft"
```
- [ ] **Step 2: Run the test to verify it fails**
Run: `make test ROLE=base`
Expected: FAIL — the verify assert "input chain is missing default-deny or the management
plane" fires, because the template does not yet render the control-address rule.
- [ ] **Step 3: Add the default knob**
In `roles/base/defaults/main.yml`, after the `base__firewall_mgmt_interface` line, add:
```yaml
base__firewall_control_addr: "" # control-node LAN address (ubongo); SSH allowed from it
# as the guaranteed-management-plane `ssh-from-control`
# source (ADR-021). Empty = no rule. Set in group_vars
# once ubongo exists.
```
- [ ] **Step 4: Render the rule in the template**
In `roles/base/templates/nftables.conf.j2`, immediately after the `wt0` SSH line (the
`iifname "{{ base__firewall_mgmt_interface }}" ...` line), add:
```jinja
{% if base__firewall_control_addr %}
ip saddr {{ base__firewall_control_addr }} tcp dport {{ base__firewall_ssh_port }} accept
{% endif %}
```
- [ ] **Step 5: Run the test to verify it passes**
Run: `make test ROLE=base`
Expected: PASS — the rule `ip saddr 10.10.0.99 tcp dport 22 accept` renders, `nft -c`
syntax-check succeeds, and all prior assertions (default-deny, `wt0` SSH, zone rules,
drop-in hook) still pass.
- [ ] **Step 6: Lint**
Run: `make lint`
Expected: PASS (no tag/FQCN/yaml regressions).
- [ ] **Step 7: Commit**
```bash
git add roles/base/defaults/main.yml roles/base/templates/nftables.conf.j2 \
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
git commit -m "feat(base): add ssh-from-control management-plane source (ADR-021)"
```
---
### Task 5: Author the `/check-access` command (dormant until infra)
**Files:**
- Create: `.claude/commands/check-access.md`
Mirror the structure of `.claude/commands/verify-service.md` (a forward-looking command
with a hard Prerequisites gate). It does not run until `ubongo` + live/staging hosts +
vault exist; if a prerequisite is missing it must say so and stop.
- [ ] **Step 1: Write the command**
Create `.claude/commands/check-access.md`:
```markdown
Operational-access verification (ADR-021)
Probe every documented way in to a service or host from `ubongo` and report which paths
are live. Reads the target's `access__*` data (and host baseline), so the verifier and
`ACCESS.md` can never disagree. Argument: a service/role name or a host
(e.g. `/check-access photoprism`, `/check-access docker01`).
## Prerequisites (forward-looking — ADR-021 dependencies)
This skill cannot run until these exist; if any is missing, say so and stop — do not
improvise around it:
- `ubongo` reachable on the mesh **and** the LAN (it runs the probes).
- The target host/service is deployed (staging or production inventory).
- `roles/<name>/` carries `access__*` data (services) / the host baseline applies.
- Vault unlocked (`rbw unlocked`) for any token-authenticated API probe.
## Process
### Phase 0 — resolve the target
Resolve the argument to a host or a service role + its host. Load the `access__*` data
(service) or the host-baseline + break-glass record (host). State what you will probe.
### Phase 1 — probe each declared path
| Path | Probe | Green = |
|---|---|---|
| `wt0` mesh SSH | connect over the mesh, run `true` | reachable + key works |
| LAN SSH from `ubongo` | connect via the LAN address, run `true` | reachable + key works |
| exec + compose | `docker compose -p <project> ps`; exec `true` in each `access__containers` entry | stack up, exec works |
| logs | query Loki for `access__log.loki_labels`, expect recent lines | logs flowing |
| admin API | `curl` `access__api.health_path` with the token from `access__api.auth.vault_ref` | 2xx |
| break-glass | reachability of the Proxmox/provider console endpoint **only** | console host reachable |
Break-glass is **never exercised** — firing a serial console is invasive; confirm the
fallback exists, do not drive it.
### Phase 2 — report
Emit a pass/fail table. For any red path, name it and the likely cause (e.g. "API token
in vault stale", "Alloy not shipping", "`base__firewall_control_addr` unset → no
`ssh-from-control` rule"). Verdict line: e.g. "3/4 paths green; admin API red".
## Notes
- Read-only and non-destructive — probes confirm reachability, they do not change state.
- This is the access analogue of `/verify-service` (ADR-017): designed now, runs when the
control node + hosts exist.
```
- [ ] **Step 2: Commit**
```bash
git add .claude/commands/check-access.md
git commit -m "feat(access): add /check-access verifier command (ADR-021, dormant)"
```
---
### Task 6: Governance wiring — checklist + runbook
**Files:**
- Modify: `docs/security/service-checklist.md` (the "Operability (security-adjacent)" section)
- Modify: `docs/runbooks/new-role.md` (after step 10, the VERIFY.md step)
ACCESS.md mirrors how SECURITY.md/VERIFY.md are enforced: a manual runbook step + a
checklist gate (the scaffold does not auto-drop SECURITY/VERIFY today either, so ACCESS
follows the same manual-copy pattern — no Makefile change).
- [ ] **Step 1: Add the checklist gate item**
In `docs/security/service-checklist.md`, under `## Operability (security-adjacent)`, add a
bullet after the `/verify-service` item:
```markdown
- [ ] Operational access recorded and verifiable (ADR-021): the role carries `access__*`
data, `roles/<service>/ACCESS.md` is rendered, and `/check-access` reports the
documented paths green — or a deviation is recorded in
`docs/security/accepted-risks.md`
```
- [ ] **Step 2: Add the runbook step**
In `docs/runbooks/new-role.md`, insert a new step between step 10 (VERIFY.md) and the
final commit step, and renumber the commit step to 12:
```markdown
### 11. Write the per-service operational-access record (services)
For a **service** role, copy `docs/access/service-access-template.md` to
`roles/<rolename>/ACCESS.md` and populate the role's `access__*` data
(`access__service`, `access__compose_project`/`_path`, `access__containers`,
`access__log.loki_labels`, and `access__api``enabled` + endpoint + `firewall_ref` +
`auth.vault_ref` + `health_path`, or `enabled: false` with a reason). `ACCESS.md` is
rendered from that data; the admin-API path must `firewall_ref` an entry in the
`group_vars` firewall catalog, never open a port itself (ADR-020/021). Once hosts exist,
`/check-access <rolename>` proves the documented paths are live — part of the
service-clearance gate (`docs/security/service-checklist.md`).
```
- [ ] **Step 3: Verify renumbering**
Run: `grep -n "^### 1[12]\." docs/runbooks/new-role.md`
Expected: `### 11. Write the per-service operational-access record` and `### 12. Commit`.
- [ ] **Step 4: Commit**
```bash
git add docs/security/service-checklist.md docs/runbooks/new-role.md
git commit -m "docs(access): gate ACCESS.md in checklist + new-role runbook (ADR-021)"
```
---
### Task 7: Index wiring — CLAUDE.md, STATUS.md, TODO.md
**Files:**
- Modify: `CLAUDE.md` (Role conventions list + Further reading table)
- Modify: `STATUS.md` (Designed-but-not-built table)
- Modify: `docs/TODO.md` (items 3.2 and 7.2)
- [ ] **Step 1: CLAUDE.md — Role conventions**
In the `## Role conventions` list, after the `VERIFY.md` bullet
("Every **service** role must have a populated `VERIFY.md` ..."), add:
```markdown
- Every **service** role must have a populated `ACCESS.md` (ADR-021) — copy
`docs/access/service-access-template.md`; rendered from the role's `access__*` data
```
- [ ] **Step 2: CLAUDE.md — Further reading**
In the Further reading table, after the Firewall strategy row, add:
```markdown
| Operational access | `docs/decisions/021-operational-access.md` |
```
- [ ] **Step 3: STATUS.md — new rows**
In the `## Designed but not built` table, add:
```markdown
| Operational-access doctrine (ADR-021) | ADR-021 | **Design RESOLVED** (ADR-021 + spec + plan). Two-layer doctrine, three-tier access ladder, `access__*` model, `ACCESS.md` record, `/check-access`. Reconciles ADR-016/020 SSH. |
| `ssh-from-control` firewall source | ADR-021 / ADR-020 | **Built (dormant).** `base__firewall_control_addr` knob + nftables rule + Molecule assertion landed; empty default = no rule until `ubongo`'s LAN address is set in `group_vars`. |
| `/check-access` verifier | ADR-021 | **Design RESOLVED** (`.claude/commands/check-access.md` authored). **Build pending:** running needs `ubongo` + live/staging hosts + vault. Access analogue of `/verify-service` (ADR-017). |
| Per-service `ACCESS.md` records | ADR-021 | Template + governance present; per-service files render when each service role is built. |
```
- [ ] **Step 4: docs/TODO.md — mark 3.2 and 7.2 DECIDED**
In `docs/TODO.md`, change item **3.2** from:
```markdown
2. Decide how to manage APIs / API access.
```
to:
```markdown
2. ~~Decide how to manage APIs / API access.~~ DECIDED (ADR-021): per-service `access__*`
data declares the admin API (endpoint + `firewall_ref` to the catalog + vault token
ref + health path); rendered into `ACCESS.md` and probed by `/check-access`. Part of
the two-layer operational-access doctrine.
```
And change item **7.2** from:
```markdown
2. Decide what to set up on the hosts, given that direct access will be rare.
```
to:
```markdown
2. ~~Decide what to set up on the hosts, given that direct access will be rare.~~
DECIDED (ADR-021): the host-layer access baseline — SSH on `wt0` + from `ubongo`,
Docker/Compose tooling, Alloy log shipping, and a recorded break-glass console per
host class.
```
- [ ] **Step 5: Verify and commit**
Run: `grep -n "021-operational-access\|ACCESS.md\|ssh-from-control" CLAUDE.md STATUS.md`
Expected: the new Role-conventions bullet, the Further-reading row, and the STATUS rows
are present.
```bash
git add CLAUDE.md STATUS.md docs/TODO.md
git commit -m "docs(access): wire ADR-021 into CLAUDE.md, STATUS, TODO"
```
---
## Tranche B — Build-pending on infra (no tasks now)
Recorded so the boundary is explicit; nothing here is actioned by this plan.
- **Per-service `access__*` + rendered `ACCESS.md`** — authored when each service role is
built, governed by the Task 6 checklist item + runbook step. The first real service role
is where this first runs.
- **`/check-access` running** — needs `ubongo` + a live/staging host + vault. The command
(Task 5) already gates on these and stops cleanly until then.
- **Real `base__firewall_control_addr` value** — set in `group_vars/all` to `ubongo`'s LAN
address once `ubongo` is in inventory; the machinery + test landed in Task 4.
---
## Self-review
**Spec coverage:** doctrine + two layers → Task 1; three-tier ladder + ADR-016/020
reconciliation → Tasks 12, 4; `access__*` model + invariant → Tasks 1, 3, 6; rendered
`ACCESS.md` → Task 3; `/check-access` → Task 5; governance (checklist/runbook) → Task 6;
repo wiring (CLAUDE/STATUS/TODO) → Task 7; build-now vs build-pending split → Tranches
A/B. All spec sections map to a task.
**Deviations from the spec (deliberate, flagged for the user):**
1. The spec called `ssh-from-control` a *catalog* source; the plan places it in the
*guaranteed management plane* (`base__firewall_control_addr`) instead — ADR-020 already
houses SSH/Ansible management allows there, independent of the catalog, and the spec's
own invariant says the catalog owns *service* exposure only. Same intent, correct home.
2. The spec said `make new-role` would *scaffold* an `ACCESS.md` stub; the plan instead adds
a manual runbook step (Task 6) mirroring how `SECURITY.md`/`VERIFY.md` are handled today
(also manual copies, not scaffolded). Avoids unilaterally restructuring the scaffold;
the "can't be forgotten" intent is met by the checklist gate + runbook step.
**Type/name consistency:** `base__firewall_control_addr` (knob), `access__service` /
`access__compose_project` / `access__compose_path` / `access__containers` /
`access__log.loki_labels` / `access__api.{enabled,base_url,firewall_ref,auth.vault_ref,health_path}`
are used identically across Tasks 1, 3, 5, 6. The rendered nftables rule string
`ip saddr <addr> tcp dport 22 accept` matches between Task 4's template (Step 4) and its
assertion (Step 1).

View file

@ -0,0 +1,214 @@
# Design — Operational access (ADR-021)
- **Date:** 2026-06-09
- **Status:** Approved design — pending implementation plan
- **Implements:** New ADR-021. Resolves TODO 3.2 (API / API access) and TODO 7.2
(what to set up on hosts, given direct access will be rare).
- **Amends:** ADR-016 (SSH was mesh-only; now also from `ubongo`'s LAN address) and
ADR-020 (adds an `ssh-from-control` symbolic catalog source).
- **Scope:** The operational-access *doctrine* + the declarative `access__*` data model,
the rendered `ACCESS.md` record, and the `/check-access` verifier design. It does **not**
build any of it — `base`/service roles and live hosts don't exist yet. Designed now,
built when there is something to access.
---
## Problem
boma is built security-first: nftables default-deny, SSH reachable only on the NetBird
`wt0` mesh interface (ADR-016), every service behind the reverse proxy + SSO, no ad-hoc
ports (ADR-002/020). That posture is correct — but it leaves an unanswered operational
question: **when a service or host breaks, how does the operator (and the AI working on
boma's behalf from `ubongo`) actually get in to troubleshoot it?**
Experience on similar projects shows troubleshooting is far more effective with *several*
documented ways in — SSH, container exec, logs, an admin API — so a single broken path
doesn't mean blind. Today boma has no standard guaranteeing those paths exist, are
documented, or still work. The risk is the classic one: the access you assumed you had is
stale exactly when you need it (key rotated, API disabled, token expired).
boma already has the right *shape* for the fix. Service roles carry record docs —
`SECURITY.md` (security answers) and `VERIFY.md` (acceptance spec) — gated by the service
checklist and the `new-role` runbook. What's missing is the third sibling: an
**operational access record**, plus the doctrine behind it.
Two constraints shape the design:
1. **Minimal attack surface is non-negotiable.** "Multiple ways in" must mean multiple
paths over the *trusted* interface, never new exposed ports. Resolution: all routine
access runs over the mesh from `ubongo`.
2. **A documented path that is never tested drifts.** It fails exactly when needed. So
the structured access facts must be *data* that both renders the doc and drives an
active verifier — the two can then never disagree.
## Decisions settled in brainstorming
- **Access is a deployment deliverable.** The deploy that creates a host/service also
records and (by design) proves its access paths. Not rediscovered under pressure.
- **All routine access over the mesh** (`wt0`, from `ubongo`). No new LAN/WAN exposure.
- **Two layers:** a host-level access baseline (resolves TODO 7.2) and a per-service
access record (resolves TODO 3.2).
- **Baseline paths, every service:** host SSH, container exec + compose, logs
(Loki/Grafana, ADR-018), and the service admin API where one exists (`n/a` otherwise).
- **A new first-class sibling record** `ACCESS.md` (next to `SECURITY.md`/`VERIFY.md`),
**rendered from declarative data** — not hand-written prose (the firewall-catalog
philosophy of ADR-020 applied to access).
- **Active verification designed in:** a `/check-access` skill probes the declared paths
and reports which are live — the access analogue of `/verify-service` (ADR-017).
- **Direct LAN SSH from `ubongo` only** is added as a second, mesh-independent path
(amends ADR-016); all other LAN hosts stay blocked by default-deny.
## The doctrine
> **Every host and every service guarantees at least one documented, verifiable way in
> for operational troubleshooting — and the deploy that creates it also records and
> proves it.**
### Two layers
- **Host layer** (TODO 7.2). Every host, via the `base` role, guarantees a fixed access
baseline: SSH over `wt0` and from `ubongo` (below), Docker/Compose tooling present, and
log shipping live (Alloy → Loki; ADR-018). Little is *exposed*; a known, uniform set of
paths exists over the mesh. This is boma's answer to "what every host runs for access."
- **Service layer** (TODO 3.2). Every service role guarantees and records its paths:
container exec + compose management, its Loki log labels, and its admin API where one
exists (enabled, token in vault, endpoint + health probe documented) or explicit `n/a`.
### The three-tier access ladder
1. **`wt0` mesh SSH — primary.** WireGuard *cryptographically authenticates* the peer
before SSH sees it. The preferred path (ADR-016's original rationale).
2. **LAN SSH from `ubongo` — secondary, mesh-independent.** Most hardware (all but
`askari`) shares a LAN. SSH from `ubongo`'s LAN address is allowed via a new catalog
source, giving a fallback that survives a NetBird/`wt0` outage. It is gated by *source
IP* (spoofable on a LAN) **plus** the standing keys-only + fail2ban SSH hardening, so
the marginal cost is "SSH daemon reachable from the LAN broadcast domain from one
trusted host" — modest and deliberate. All *other* LAN hosts remain default-denied.
3. **Console — break-glass.** Mesh-*and*-LAN-independent, recorded per host class, not
used for routine work:
- **Cluster VMs** → Proxmox serial/VNC console (`qm terminal` / console via the
Proxmox host) — independent of the guest network, `wt0`, and even a broken guest
nftables ruleset.
- **`askari`** (bare-metal Hetzner) → provider rescue/console.
- **`ubongo`** (physical) → local console.
A total mesh outage therefore still leaves exactly one documented way in to each box.
## The declarative access data model (Approach B)
Structured access facts live as **data** — the single source of truth that both renders
`ACCESS.md` *and* tells `/check-access` what to probe, so doc and verifier cannot diverge.
### Service-layer — `access__*` in each service role's defaults
```yaml
access__service: photoprism
access__compose_project: photoprism # docker compose -p <this>
access__compose_path: /opt/photoprism/compose.yml
access__containers: [photoprism, photoprism-db] # exec targets
access__log:
loki_labels: { service: photoprism } # how to query logs (ADR-018)
access__api:
enabled: true
base_url: "https://photoprism.host:2342" # reachable over the mesh
firewall_ref: photoprism-api # the catalog entry that opens it (ADR-020)
auth: { type: token, vault_ref: "vault.photoprism.api_token" }
health_path: "/api/v1/status" # what /check-access pings
# where the service has no API:
# access__api: { enabled: false, reason: "<none upstream>" }
```
**Single-source-of-truth rule:** `access__api` **never opens a port**. It `firewall_ref`s
the entry in the `group_vars` firewall catalog — ADR-020 stays the sole owner of
*exposure*. The access data adds only *how to use* the path (endpoint, token ref, health
probe). No duplication, no ad-hoc ports (CLAUDE.md: ports only in the catalog).
### Host-layer — a fixed baseline, stated once
The host baseline (SSH on `wt0` + from `ubongo`, Docker/Compose present, Alloy live) is
uniform, so it is asserted by `base` and recorded once at the host/group level — not
re-stated per service. The break-glass console per host class is recorded with it.
## The rendered record — `ACCESS.md`
`ACCESS.md` is **rendered** from the `access__*` data, with a prose tail for the genuinely
narrative parts:
- **Access paths (generated)** — a table: each path (mesh SSH, LAN-SSH-from-`ubongo`,
exec/compose, logs, API), its tier (primary / secondary / break-glass), and the exact
invocation (`ssh host`, `docker compose -p <project> …`, the Loki query, the `curl`
against the API health path).
- **Break-glass (generated from host class)** — the Proxmox/provider console line.
- **Operational notes (prose)** — service quirks, gotchas, "if X is wedged, do Y." The
part a template cannot know.
A `docs/access/service-access-template.md` defines the shape, alongside the existing
security/verify templates.
## The verifier — `/check-access` (designed now, build-pending on infra)
Runs from `ubongo`; turns the `access__*` data into live probes. Invoked
`/check-access <service>` (or `<host>` for the host baseline). The access analogue of
`/verify-service` (ADR-017).
| Path | Probe | Green = |
|---|---|---|
| `wt0` mesh SSH | connect over mesh, run `true` | reachable + key works |
| LAN SSH from `ubongo` | connect via LAN addr, run `true` | reachable + key works |
| exec + compose | `docker compose -p <project> ps`; exec `true` in each container | stack up, exec works |
| logs | query Loki for `loki_labels`, expect recent lines | logs flowing |
| admin API | `curl` the `health_path` with the vault token | 2xx |
| break-glass | reachability of the Proxmox/provider console endpoint only | console host reachable |
- **Break-glass is checked for reachability, not exercised** — firing a serial console is
invasive; the verifier confirms the fallback *exists* without disrupting anything.
- **Output:** a pass/fail table; on any red, it names the path and the likely cause
("API token in vault stale", "Alloy not shipping", "`ssh-from-control` catalog source
missing"). The payoff: not "the doc *says* you can get in" but "verified — three of four
paths green right now, here's the broken one."
- **Status:** designed now, build-pending on infra (needs live hosts + staging + vault),
exactly like `/verify-service` under ADR-017.
## Governance — so it can't be forgotten
Three light touches mirror how `SECURITY.md`/`VERIFY.md` are enforced:
1. **Service checklist** (`docs/security/service-checklist.md`) gains one item: *"Access
paths declared (`access__*`), `ACCESS.md` rendered, `/check-access` green — or
deviation recorded in `accepted-risks.md`."*
2. **`new-role` runbook** (`docs/runbooks/new-role.md`) gains a step: fill `access__*`,
render `ACCESS.md`, run `/check-access`.
3. **`make new-role` scaffold** drops a stub `access__*` block + the `ACCESS.md` template
into the role — the same way roles already get `SECURITY.md`/`VERIFY.md` stubs, so it
is structurally impossible to ship a service role with no access record.
## Repo wiring
- **`docs/decisions/021-operational-access.md`** — the new ADR (doctrine, both layers,
the three-tier ladder, break-glass, the `access__*` model, `/check-access`).
- **`docs/decisions/016-mesh-vpn.md`** — amend: SSH on `wt0` **and** from `ubongo`'s LAN
address (was mesh-only). Cross-link ADR-021.
- **`docs/decisions/020-firewall.md`** — note the new `ssh-from-control` symbolic source.
- **`docs/access/service-access-template.md`** — the rendered `ACCESS.md` shape.
- **`docs/security/service-checklist.md`** — the one new gate item.
- **`docs/runbooks/new-role.md`** — the fill/render/`check-access` step.
- **`CLAUDE.md`** — `ACCESS.md` under "Role conventions"; ADR-021 in Further reading.
- **`STATUS.md`** — rows: ADR-021 doctrine *(designed)*; `ssh-from-control` catalog source
*(designed, builds with `base` firewall)*; `/check-access` *(designed, build-pending)*.
- **`docs/TODO.md`** — mark 3.2 and 7.2 DECIDED → ADR-021.
## What is buildable now vs later
- **Now:** the doctrine, ADR-021, the `ACCESS.md` template, the checklist/runbook/scaffold
wiring, and the `ssh-from-control` catalog source (the `firewall` concern of `base`
already exists, so the source can land with it).
- **Later (build-pending on infra):** `/check-access` *running*, and per-service
`ACCESS.md` *files* — both wait on service roles + live hosts. Designed now, built when
there is something to verify.
## Out of scope
- Building `base`'s non-firewall concerns, any service role, or live hosts.
- Broader LAN SSH (a management VLAN) — explicitly rejected; `ubongo`-only.
- Exercising (vs reachability-probing) the break-glass console.
- Any access path that is not over the mesh or the one `ubongo` LAN source.

View file

@ -2,6 +2,10 @@
# Host firewall (nftables) behaviour knobs. Shared topology (firewall_catalog/ # Host firewall (nftables) behaviour knobs. Shared topology (firewall_catalog/
# firewall_zones) lives in group_vars/all, not here. See docs/decisions/020-firewall.md. # firewall_zones) lives in group_vars/all, not here. See docs/decisions/020-firewall.md.
base__firewall_mgmt_interface: wt0 # SSH accepted only on this iface (NetBird, ADR-016) base__firewall_mgmt_interface: wt0 # SSH accepted only on this iface (NetBird, ADR-016)
base__firewall_control_addr: "" # control-node LAN address (ubongo); SSH allowed from it
# as the guaranteed-management-plane `ssh-from-control`
# source (ADR-021). Empty = no rule. Set in group_vars
# once ubongo exists.
base__firewall_ssh_port: 22 base__firewall_ssh_port: 22
base__firewall_rollback_timeout: 45 # seconds before the auto-revert fires on a bad apply base__firewall_rollback_timeout: 45 # seconds before the auto-revert fires on a bad apply
base__firewall_confirm_timeout: 20 # seconds to re-establish a fresh connection post-apply base__firewall_confirm_timeout: 20 # seconds to re-establish a fresh connection post-apply

View file

@ -5,6 +5,7 @@
gather_facts: true gather_facts: true
vars: vars:
base__firewall_apply: false base__firewall_apply: false
base__firewall_control_addr: 10.10.0.99 # test control-node LAN address
firewall_zones: firewall_zones:
lan: 10.30.0.0/24 lan: 10.30.0.0/24
srv: 10.20.0.0/24 srv: 10.20.0.0/24

View file

@ -19,7 +19,10 @@
- "'type filter hook input priority 0; policy drop;' in nft" - "'type filter hook input priority 0; policy drop;' in nft"
- "'ct state established,related accept' in nft" - "'ct state established,related accept' in nft"
- "'iifname \"wt0\" tcp dport 22 accept' in nft" - "'iifname \"wt0\" tcp dport 22 accept' in nft"
fail_msg: "input chain is missing default-deny or the management plane" - "'ip saddr 10.10.0.99 tcp dport 22 accept' in nft"
fail_msg: >-
input chain is missing default-deny, the wt0 SSH allow,
or the ssh-from-control management-plane rule
- name: Assert the lan->reverse_proxy:443 ingress rule (zone source) - name: Assert the lan->reverse_proxy:443 ingress rule (zone source)
ansible.builtin.assert: ansible.builtin.assert:

View file

@ -9,6 +9,9 @@ table inet filter {
ct state established,related accept ct state established,related accept
ct state invalid drop ct state invalid drop
iifname "{{ base__firewall_mgmt_interface }}" tcp dport {{ base__firewall_ssh_port }} accept iifname "{{ base__firewall_mgmt_interface }}" tcp dport {{ base__firewall_ssh_port }} accept
{% if base__firewall_control_addr %}
ip saddr {{ base__firewall_control_addr }} tcp dport {{ base__firewall_ssh_port }} accept
{% endif %}
ip protocol icmp accept ip protocol icmp accept
ip6 nexthdr ipv6-icmp accept ip6 nexthdr ipv6-icmp accept
{% for r in base__firewall_resolved %} {% for r in base__firewall_resolved %}