docs(access): design operational-access doctrine (ADR-021)
Brainstorming spec for ADR-021: operational access as a deployment deliverable. Two layers (host baseline + per-service), a three-tier access ladder (mesh SSH -> LAN SSH from ubongo -> console break-glass), declarative access__* data rendering ACCESS.md and driving a /check-access verifier. Resolves TODO 3.2 (API access) and 7.2 (host access); amends ADR-016 (SSH also from ubongo) and ADR-020 (ssh-from-control source). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
fcfb056591
commit
fd4bbbc977
1 changed files with 214 additions and 0 deletions
214
docs/superpowers/specs/2026-06-09-operational-access-design.md
Normal file
214
docs/superpowers/specs/2026-06-09-operational-access-design.md
Normal file
|
|
@ -0,0 +1,214 @@
|
|||
# Design — Operational access (ADR-021)
|
||||
|
||||
- **Date:** 2026-06-09
|
||||
- **Status:** Approved design — pending implementation plan
|
||||
- **Implements:** New ADR-021. Resolves TODO 3.2 (API / API access) and TODO 7.2
|
||||
(what to set up on hosts, given direct access will be rare).
|
||||
- **Amends:** ADR-016 (SSH was mesh-only; now also from `ubongo`'s LAN address) and
|
||||
ADR-020 (adds an `ssh-from-control` symbolic catalog source).
|
||||
- **Scope:** The operational-access *doctrine* + the declarative `access__*` data model,
|
||||
the rendered `ACCESS.md` record, and the `/check-access` verifier design. It does **not**
|
||||
build any of it — `base`/service roles and live hosts don't exist yet. Designed now,
|
||||
built when there is something to access.
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
boma is built security-first: nftables default-deny, SSH reachable only on the NetBird
|
||||
`wt0` mesh interface (ADR-016), every service behind the reverse proxy + SSO, no ad-hoc
|
||||
ports (ADR-002/020). That posture is correct — but it leaves an unanswered operational
|
||||
question: **when a service or host breaks, how does the operator (and the AI working on
|
||||
boma's behalf from `ubongo`) actually get in to troubleshoot it?**
|
||||
|
||||
Experience on similar projects shows troubleshooting is far more effective with *several*
|
||||
documented ways in — SSH, container exec, logs, an admin API — so a single broken path
|
||||
doesn't mean blind. Today boma has no standard guaranteeing those paths exist, are
|
||||
documented, or still work. The risk is the classic one: the access you assumed you had is
|
||||
stale exactly when you need it (key rotated, API disabled, token expired).
|
||||
|
||||
boma already has the right *shape* for the fix. Service roles carry record docs —
|
||||
`SECURITY.md` (security answers) and `VERIFY.md` (acceptance spec) — gated by the service
|
||||
checklist and the `new-role` runbook. What's missing is the third sibling: an
|
||||
**operational access record**, plus the doctrine behind it.
|
||||
|
||||
Two constraints shape the design:
|
||||
|
||||
1. **Minimal attack surface is non-negotiable.** "Multiple ways in" must mean multiple
|
||||
paths over the *trusted* interface, never new exposed ports. Resolution: all routine
|
||||
access runs over the mesh from `ubongo`.
|
||||
2. **A documented path that is never tested drifts.** It fails exactly when needed. So
|
||||
the structured access facts must be *data* that both renders the doc and drives an
|
||||
active verifier — the two can then never disagree.
|
||||
|
||||
## Decisions settled in brainstorming
|
||||
|
||||
- **Access is a deployment deliverable.** The deploy that creates a host/service also
|
||||
records and (by design) proves its access paths. Not rediscovered under pressure.
|
||||
- **All routine access over the mesh** (`wt0`, from `ubongo`). No new LAN/WAN exposure.
|
||||
- **Two layers:** a host-level access baseline (resolves TODO 7.2) and a per-service
|
||||
access record (resolves TODO 3.2).
|
||||
- **Baseline paths, every service:** host SSH, container exec + compose, logs
|
||||
(Loki/Grafana, ADR-018), and the service admin API where one exists (`n/a` otherwise).
|
||||
- **A new first-class sibling record** `ACCESS.md` (next to `SECURITY.md`/`VERIFY.md`),
|
||||
**rendered from declarative data** — not hand-written prose (the firewall-catalog
|
||||
philosophy of ADR-020 applied to access).
|
||||
- **Active verification designed in:** a `/check-access` skill probes the declared paths
|
||||
and reports which are live — the access analogue of `/verify-service` (ADR-017).
|
||||
- **Direct LAN SSH from `ubongo` only** is added as a second, mesh-independent path
|
||||
(amends ADR-016); all other LAN hosts stay blocked by default-deny.
|
||||
|
||||
## The doctrine
|
||||
|
||||
> **Every host and every service guarantees at least one documented, verifiable way in
|
||||
> for operational troubleshooting — and the deploy that creates it also records and
|
||||
> proves it.**
|
||||
|
||||
### Two layers
|
||||
|
||||
- **Host layer** (TODO 7.2). Every host, via the `base` role, guarantees a fixed access
|
||||
baseline: SSH over `wt0` and from `ubongo` (below), Docker/Compose tooling present, and
|
||||
log shipping live (Alloy → Loki; ADR-018). Little is *exposed*; a known, uniform set of
|
||||
paths exists over the mesh. This is boma's answer to "what every host runs for access."
|
||||
- **Service layer** (TODO 3.2). Every service role guarantees and records its paths:
|
||||
container exec + compose management, its Loki log labels, and its admin API where one
|
||||
exists (enabled, token in vault, endpoint + health probe documented) or explicit `n/a`.
|
||||
|
||||
### The three-tier access ladder
|
||||
|
||||
1. **`wt0` mesh SSH — primary.** WireGuard *cryptographically authenticates* the peer
|
||||
before SSH sees it. The preferred path (ADR-016's original rationale).
|
||||
2. **LAN SSH from `ubongo` — secondary, mesh-independent.** Most hardware (all but
|
||||
`askari`) shares a LAN. SSH from `ubongo`'s LAN address is allowed via a new catalog
|
||||
source, giving a fallback that survives a NetBird/`wt0` outage. It is gated by *source
|
||||
IP* (spoofable on a LAN) **plus** the standing keys-only + fail2ban SSH hardening, so
|
||||
the marginal cost is "SSH daemon reachable from the LAN broadcast domain from one
|
||||
trusted host" — modest and deliberate. All *other* LAN hosts remain default-denied.
|
||||
3. **Console — break-glass.** Mesh-*and*-LAN-independent, recorded per host class, not
|
||||
used for routine work:
|
||||
- **Cluster VMs** → Proxmox serial/VNC console (`qm terminal` / console via the
|
||||
Proxmox host) — independent of the guest network, `wt0`, and even a broken guest
|
||||
nftables ruleset.
|
||||
- **`askari`** (bare-metal Hetzner) → provider rescue/console.
|
||||
- **`ubongo`** (physical) → local console.
|
||||
|
||||
A total mesh outage therefore still leaves exactly one documented way in to each box.
|
||||
|
||||
## The declarative access data model (Approach B)
|
||||
|
||||
Structured access facts live as **data** — the single source of truth that both renders
|
||||
`ACCESS.md` *and* tells `/check-access` what to probe, so doc and verifier cannot diverge.
|
||||
|
||||
### Service-layer — `access__*` in each service role's defaults
|
||||
|
||||
```yaml
|
||||
access__service: photoprism
|
||||
access__compose_project: photoprism # docker compose -p <this>
|
||||
access__compose_path: /opt/photoprism/compose.yml
|
||||
access__containers: [photoprism, photoprism-db] # exec targets
|
||||
access__log:
|
||||
loki_labels: { service: photoprism } # how to query logs (ADR-018)
|
||||
access__api:
|
||||
enabled: true
|
||||
base_url: "https://photoprism.host:2342" # reachable over the mesh
|
||||
firewall_ref: photoprism-api # the catalog entry that opens it (ADR-020)
|
||||
auth: { type: token, vault_ref: "vault.photoprism.api_token" }
|
||||
health_path: "/api/v1/status" # what /check-access pings
|
||||
# where the service has no API:
|
||||
# access__api: { enabled: false, reason: "<none upstream>" }
|
||||
```
|
||||
|
||||
**Single-source-of-truth rule:** `access__api` **never opens a port**. It `firewall_ref`s
|
||||
the entry in the `group_vars` firewall catalog — ADR-020 stays the sole owner of
|
||||
*exposure*. The access data adds only *how to use* the path (endpoint, token ref, health
|
||||
probe). No duplication, no ad-hoc ports (CLAUDE.md: ports only in the catalog).
|
||||
|
||||
### Host-layer — a fixed baseline, stated once
|
||||
|
||||
The host baseline (SSH on `wt0` + from `ubongo`, Docker/Compose present, Alloy live) is
|
||||
uniform, so it is asserted by `base` and recorded once at the host/group level — not
|
||||
re-stated per service. The break-glass console per host class is recorded with it.
|
||||
|
||||
## The rendered record — `ACCESS.md`
|
||||
|
||||
`ACCESS.md` is **rendered** from the `access__*` data, with a prose tail for the genuinely
|
||||
narrative parts:
|
||||
|
||||
- **Access paths (generated)** — a table: each path (mesh SSH, LAN-SSH-from-`ubongo`,
|
||||
exec/compose, logs, API), its tier (primary / secondary / break-glass), and the exact
|
||||
invocation (`ssh host`, `docker compose -p <project> …`, the Loki query, the `curl`
|
||||
against the API health path).
|
||||
- **Break-glass (generated from host class)** — the Proxmox/provider console line.
|
||||
- **Operational notes (prose)** — service quirks, gotchas, "if X is wedged, do Y." The
|
||||
part a template cannot know.
|
||||
|
||||
A `docs/access/service-access-template.md` defines the shape, alongside the existing
|
||||
security/verify templates.
|
||||
|
||||
## The verifier — `/check-access` (designed now, build-pending on infra)
|
||||
|
||||
Runs from `ubongo`; turns the `access__*` data into live probes. Invoked
|
||||
`/check-access <service>` (or `<host>` for the host baseline). The access analogue of
|
||||
`/verify-service` (ADR-017).
|
||||
|
||||
| Path | Probe | Green = |
|
||||
|---|---|---|
|
||||
| `wt0` mesh SSH | connect over mesh, run `true` | reachable + key works |
|
||||
| LAN SSH from `ubongo` | connect via LAN addr, run `true` | reachable + key works |
|
||||
| exec + compose | `docker compose -p <project> ps`; exec `true` in each container | stack up, exec works |
|
||||
| logs | query Loki for `loki_labels`, expect recent lines | logs flowing |
|
||||
| admin API | `curl` the `health_path` with the vault token | 2xx |
|
||||
| break-glass | reachability of the Proxmox/provider console endpoint only | console host reachable |
|
||||
|
||||
- **Break-glass is checked for reachability, not exercised** — firing a serial console is
|
||||
invasive; the verifier confirms the fallback *exists* without disrupting anything.
|
||||
- **Output:** a pass/fail table; on any red, it names the path and the likely cause
|
||||
("API token in vault stale", "Alloy not shipping", "`ssh-from-control` catalog source
|
||||
missing"). The payoff: not "the doc *says* you can get in" but "verified — three of four
|
||||
paths green right now, here's the broken one."
|
||||
- **Status:** designed now, build-pending on infra (needs live hosts + staging + vault),
|
||||
exactly like `/verify-service` under ADR-017.
|
||||
|
||||
## Governance — so it can't be forgotten
|
||||
|
||||
Three light touches mirror how `SECURITY.md`/`VERIFY.md` are enforced:
|
||||
|
||||
1. **Service checklist** (`docs/security/service-checklist.md`) gains one item: *"Access
|
||||
paths declared (`access__*`), `ACCESS.md` rendered, `/check-access` green — or
|
||||
deviation recorded in `accepted-risks.md`."*
|
||||
2. **`new-role` runbook** (`docs/runbooks/new-role.md`) gains a step: fill `access__*`,
|
||||
render `ACCESS.md`, run `/check-access`.
|
||||
3. **`make new-role` scaffold** drops a stub `access__*` block + the `ACCESS.md` template
|
||||
into the role — the same way roles already get `SECURITY.md`/`VERIFY.md` stubs, so it
|
||||
is structurally impossible to ship a service role with no access record.
|
||||
|
||||
## Repo wiring
|
||||
|
||||
- **`docs/decisions/021-operational-access.md`** — the new ADR (doctrine, both layers,
|
||||
the three-tier ladder, break-glass, the `access__*` model, `/check-access`).
|
||||
- **`docs/decisions/016-mesh-vpn.md`** — amend: SSH on `wt0` **and** from `ubongo`'s LAN
|
||||
address (was mesh-only). Cross-link ADR-021.
|
||||
- **`docs/decisions/020-firewall.md`** — note the new `ssh-from-control` symbolic source.
|
||||
- **`docs/access/service-access-template.md`** — the rendered `ACCESS.md` shape.
|
||||
- **`docs/security/service-checklist.md`** — the one new gate item.
|
||||
- **`docs/runbooks/new-role.md`** — the fill/render/`check-access` step.
|
||||
- **`CLAUDE.md`** — `ACCESS.md` under "Role conventions"; ADR-021 in Further reading.
|
||||
- **`STATUS.md`** — rows: ADR-021 doctrine *(designed)*; `ssh-from-control` catalog source
|
||||
*(designed, builds with `base` firewall)*; `/check-access` *(designed, build-pending)*.
|
||||
- **`docs/TODO.md`** — mark 3.2 and 7.2 DECIDED → ADR-021.
|
||||
|
||||
## What is buildable now vs later
|
||||
|
||||
- **Now:** the doctrine, ADR-021, the `ACCESS.md` template, the checklist/runbook/scaffold
|
||||
wiring, and the `ssh-from-control` catalog source (the `firewall` concern of `base`
|
||||
already exists, so the source can land with it).
|
||||
- **Later (build-pending on infra):** `/check-access` *running*, and per-service
|
||||
`ACCESS.md` *files* — both wait on service roles + live hosts. Designed now, built when
|
||||
there is something to verify.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Building `base`'s non-firewall concerns, any service role, or live hosts.
|
||||
- Broader LAN SSH (a management VLAN) — explicitly rejected; `ubongo`-only.
|
||||
- Exercising (vs reachability-probing) the break-glass console.
|
||||
- Any access path that is not over the mesh or the one `ubongo` LAN source.
|
||||
Loading…
Add table
Reference in a new issue