sjat/boma

sjat fd4bbbc977 docs(access): design operational-access doctrine (ADR-021)

Brainstorming spec for ADR-021: operational access as a deployment
deliverable. Two layers (host baseline + per-service), a three-tier
access ladder (mesh SSH -> LAN SSH from ubongo -> console break-glass),
declarative access__* data rendering ACCESS.md and driving a
/check-access verifier. Resolves TODO 3.2 (API access) and 7.2 (host
access); amends ADR-016 (SSH also from ubongo) and ADR-020
(ssh-from-control source).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-09 17:10:54 +02:00

12 KiB

Raw Blame History

Design — Operational access (ADR-021)

Date: 2026-06-09
Status: Approved design — pending implementation plan
Implements: New ADR-021. Resolves TODO 3.2 (API / API access) and TODO 7.2 (what to set up on hosts, given direct access will be rare).
Amends: ADR-016 (SSH was mesh-only; now also from ubongo's LAN address) and ADR-020 (adds an ssh-from-control symbolic catalog source).
Scope: The operational-access doctrine + the declarative access__* data model, the rendered ACCESS.md record, and the /check-access verifier design. It does not build any of it — base/service roles and live hosts don't exist yet. Designed now, built when there is something to access.

Problem

boma is built security-first: nftables default-deny, SSH reachable only on the NetBird wt0 mesh interface (ADR-016), every service behind the reverse proxy + SSO, no ad-hoc ports (ADR-002/020). That posture is correct — but it leaves an unanswered operational question: when a service or host breaks, how does the operator (and the AI working on boma's behalf from ubongo) actually get in to troubleshoot it?

Experience on similar projects shows troubleshooting is far more effective with several documented ways in — SSH, container exec, logs, an admin API — so a single broken path doesn't mean blind. Today boma has no standard guaranteeing those paths exist, are documented, or still work. The risk is the classic one: the access you assumed you had is stale exactly when you need it (key rotated, API disabled, token expired).

boma already has the right shape for the fix. Service roles carry record docs — SECURITY.md (security answers) and VERIFY.md (acceptance spec) — gated by the service checklist and the new-role runbook. What's missing is the third sibling: an operational access record, plus the doctrine behind it.

Two constraints shape the design:

Minimal attack surface is non-negotiable. "Multiple ways in" must mean multiple paths over the trusted interface, never new exposed ports. Resolution: all routine access runs over the mesh from ubongo.
A documented path that is never tested drifts. It fails exactly when needed. So the structured access facts must be data that both renders the doc and drives an active verifier — the two can then never disagree.

Decisions settled in brainstorming

Access is a deployment deliverable. The deploy that creates a host/service also records and (by design) proves its access paths. Not rediscovered under pressure.
All routine access over the mesh (wt0, from ubongo). No new LAN/WAN exposure.
Two layers: a host-level access baseline (resolves TODO 7.2) and a per-service access record (resolves TODO 3.2).
Baseline paths, every service: host SSH, container exec + compose, logs (Loki/Grafana, ADR-018), and the service admin API where one exists (n/a otherwise).
A new first-class sibling record ACCESS.md (next to SECURITY.md/VERIFY.md), rendered from declarative data — not hand-written prose (the firewall-catalog philosophy of ADR-020 applied to access).
Active verification designed in: a /check-access skill probes the declared paths and reports which are live — the access analogue of /verify-service (ADR-017).
Direct LAN SSH from ubongo only is added as a second, mesh-independent path (amends ADR-016); all other LAN hosts stay blocked by default-deny.

The doctrine

Every host and every service guarantees at least one documented, verifiable way in for operational troubleshooting — and the deploy that creates it also records and proves it.

Two layers

Host layer (TODO 7.2). Every host, via the base role, guarantees a fixed access baseline: SSH over wt0 and from ubongo (below), Docker/Compose tooling present, and log shipping live (Alloy → Loki; ADR-018). Little is exposed; a known, uniform set of paths exists over the mesh. This is boma's answer to "what every host runs for access."
Service layer (TODO 3.2). Every service role guarantees and records its paths: container exec + compose management, its Loki log labels, and its admin API where one exists (enabled, token in vault, endpoint + health probe documented) or explicit n/a.

The three-tier access ladder

wt0 mesh SSH — primary. WireGuard cryptographically authenticates the peer before SSH sees it. The preferred path (ADR-016's original rationale).
LAN SSH from ubongo — secondary, mesh-independent. Most hardware (all but askari) shares a LAN. SSH from ubongo's LAN address is allowed via a new catalog source, giving a fallback that survives a NetBird/wt0 outage. It is gated by source IP (spoofable on a LAN) plus the standing keys-only + fail2ban SSH hardening, so the marginal cost is "SSH daemon reachable from the LAN broadcast domain from one trusted host" — modest and deliberate. All other LAN hosts remain default-denied.
Console — break-glass. Mesh-and-LAN-independent, recorded per host class, not used for routine work:
- Cluster VMs → Proxmox serial/VNC console (qm terminal / console via the Proxmox host) — independent of the guest network, wt0, and even a broken guest nftables ruleset.
- askari (bare-metal Hetzner) → provider rescue/console.
- ubongo (physical) → local console.
A total mesh outage therefore still leaves exactly one documented way in to each box.

The declarative access data model (Approach B)

Structured access facts live as data — the single source of truth that both renders ACCESS.md and tells /check-access what to probe, so doc and verifier cannot diverge.

Service-layer — `access__*` in each service role's defaults

access__service: photoprism
access__compose_project: photoprism              # docker compose -p <this>
access__compose_path: /opt/photoprism/compose.yml
access__containers: [photoprism, photoprism-db]  # exec targets
access__log:
  loki_labels: { service: photoprism }           # how to query logs (ADR-018)
access__api:
  enabled: true
  base_url: "https://photoprism.host:2342"       # reachable over the mesh
  firewall_ref: photoprism-api                   # the catalog entry that opens it (ADR-020)
  auth: { type: token, vault_ref: "vault.photoprism.api_token" }
  health_path: "/api/v1/status"                  # what /check-access pings
  # where the service has no API:
  # access__api: { enabled: false, reason: "<none upstream>" }

Single-source-of-truth rule: access__api never opens a port. It firewall_refs the entry in the group_vars firewall catalog — ADR-020 stays the sole owner of exposure. The access data adds only how to use the path (endpoint, token ref, health probe). No duplication, no ad-hoc ports (CLAUDE.md: ports only in the catalog).

Host-layer — a fixed baseline, stated once

The host baseline (SSH on wt0 + from ubongo, Docker/Compose present, Alloy live) is uniform, so it is asserted by base and recorded once at the host/group level — not re-stated per service. The break-glass console per host class is recorded with it.

The rendered record — `ACCESS.md`

ACCESS.md is rendered from the access__* data, with a prose tail for the genuinely narrative parts:

Access paths (generated) — a table: each path (mesh SSH, LAN-SSH-from-ubongo, exec/compose, logs, API), its tier (primary / secondary / break-glass), and the exact invocation (ssh host, docker compose -p <project> …, the Loki query, the curl against the API health path).
Break-glass (generated from host class) — the Proxmox/provider console line.
Operational notes (prose) — service quirks, gotchas, "if X is wedged, do Y." The part a template cannot know.

A docs/access/service-access-template.md defines the shape, alongside the existing security/verify templates.

The verifier — `/check-access` (designed now, build-pending on infra)

Runs from ubongo; turns the access__* data into live probes. Invoked /check-access <service> (or <host> for the host baseline). The access analogue of /verify-service (ADR-017).

Path	Probe	Green =
`wt0` mesh SSH	connect over mesh, run `true`	reachable + key works
LAN SSH from `ubongo`	connect via LAN addr, run `true`	reachable + key works
exec + compose	`docker compose -p <project> ps`; exec `true` in each container	stack up, exec works
logs	query Loki for `loki_labels`, expect recent lines	logs flowing
admin API	`curl` the `health_path` with the vault token	2xx
break-glass	reachability of the Proxmox/provider console endpoint only	console host reachable

Break-glass is checked for reachability, not exercised — firing a serial console is invasive; the verifier confirms the fallback exists without disrupting anything.
Output: a pass/fail table; on any red, it names the path and the likely cause ("API token in vault stale", "Alloy not shipping", "ssh-from-control catalog source missing"). The payoff: not "the doc says you can get in" but "verified — three of four paths green right now, here's the broken one."
Status: designed now, build-pending on infra (needs live hosts + staging + vault), exactly like /verify-service under ADR-017.

Governance — so it can't be forgotten

Three light touches mirror how SECURITY.md/VERIFY.md are enforced:

Service checklist (docs/security/service-checklist.md) gains one item: "Access paths declared (access__*), ACCESS.md rendered, /check-access green — or deviation recorded in accepted-risks.md."
new-role runbook (docs/runbooks/new-role.md) gains a step: fill access__*, render ACCESS.md, run /check-access.
make new-role scaffold drops a stub access__* block + the ACCESS.md template into the role — the same way roles already get SECURITY.md/VERIFY.md stubs, so it is structurally impossible to ship a service role with no access record.

Repo wiring

docs/decisions/021-operational-access.md — the new ADR (doctrine, both layers, the three-tier ladder, break-glass, the access__* model, /check-access).
docs/decisions/016-mesh-vpn.md — amend: SSH on wt0 and from ubongo's LAN address (was mesh-only). Cross-link ADR-021.
docs/decisions/020-firewall.md — note the new ssh-from-control symbolic source.
docs/access/service-access-template.md — the rendered ACCESS.md shape.
docs/security/service-checklist.md — the one new gate item.
docs/runbooks/new-role.md — the fill/render/check-access step.
CLAUDE.md — ACCESS.md under "Role conventions"; ADR-021 in Further reading.
STATUS.md — rows: ADR-021 doctrine (designed); ssh-from-control catalog source (designed, builds with base firewall); /check-access (designed, build-pending).
docs/TODO.md — mark 3.2 and 7.2 DECIDED → ADR-021.

What is buildable now vs later

Now: the doctrine, ADR-021, the ACCESS.md template, the checklist/runbook/scaffold wiring, and the ssh-from-control catalog source (the firewall concern of base already exists, so the source can land with it).
Later (build-pending on infra): /check-access running, and per-service ACCESS.md files — both wait on service roles + live hosts. Designed now, built when there is something to verify.

Out of scope

Building base's non-firewall concerns, any service role, or live hosts.
Broader LAN SSH (a management VLAN) — explicitly rejected; ubongo-only.
Exercising (vs reachability-probing) the break-glass console.
Any access path that is not over the mesh or the one ubongo LAN source.

12 KiB Raw Blame History