boma/docs/decisions/021-operational-access.md
sjat f151e99d04 docs(access): correct ADR-021 governance (runbook+gate, not scaffold)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:52:24 +02:00

11 KiB

ADR-021 — Operational access: documented, verifiable ways in

Status

Accepted (2026-06-09). Resolves TODO 7.2 (what to set up on hosts given direct access will be rare) and TODO 3.2 (the service admin-API access question).

Doctrine ADR. It pins the operational-access doctrine, the declarative access__* data model, the rendered ACCESS.md record, and the /check-access verifier. It does not build any of them — base's non-firewall concerns, service roles, and live hosts do not exist yet. Designed now, built when there is something to access (see Scope). Reconciles a latent contradiction between ADR-016 and ADR-020 (see Reconciliation).

Context

boma is built security-first: nftables default-deny, SSH reachable only on the NetBird wt0 mesh interface (ADR-016), every service behind the reverse proxy + SSO, no ad-hoc ports (ADR-002/ADR-020). That posture is correct — but it leaves one operational question unanswered: when a host or service breaks, how does the operator (and the AI working from ubongo) actually get in to troubleshoot it?

Troubleshooting is far more effective with several documented ways in — SSH, container exec, logs, an admin API — so a single broken path does not mean blind. Today boma has no standard guaranteeing those paths exist, are documented, or still work. The risk is the classic one: the access you assumed you had is stale exactly when you need it (key rotated, API disabled, token expired).

boma already has the right shape. Service roles carry record docs — SECURITY.md (security answers) and VERIFY.md (acceptance spec). What is missing is the third sibling — an operational-access record — and the doctrine behind it.

Two constraints shape the decision:

  1. Minimal attack surface is non-negotiable. "Multiple ways in" must mean multiple paths over trusted interfaces, never new exposed ports.
  2. A documented path that is never tested drifts — it fails exactly when needed. So the access facts must be data that both renders the doc and drives an active verifier; the two can then never disagree.

Decision

The doctrine

Every host and every service guarantees at least one documented, verifiable way in for operational troubleshooting — and the deploy that creates it also records and proves it.

Access is a deployment deliverable, not something rediscovered under pressure. The deploy that creates a host/service also records its access paths and (by design) proves them.

Two layers

  • Host layer (resolves TODO 7.2). Every host, via the base role, guarantees a fixed access baseline: SSH over wt0 and from ubongo (the ladder below), Docker/Compose tooling present, and log shipping live (Alloy → Loki; ADR-018). Little is exposed; a known, uniform set of paths exists over trusted interfaces. The break-glass console per host class is recorded once at this layer. This is boma's answer to "what every host runs for access."
  • Service layer (resolves TODO 3.2). Every service role guarantees and records its own paths: container exec + compose management, its Loki log labels, and its admin API where one exists (enabled, token in vault, endpoint + health probe documented) — or an explicit "no API."

The three-tier access ladder

  1. wt0 mesh SSH — primary. WireGuard cryptographically authenticates the peer before SSH sees it. The preferred path (ADR-016's original rationale).

  2. LAN SSH from ubongo only — secondary, mesh-independent. All hardware but askari shares a LAN. SSH from ubongo's LAN address is allowed, giving a fallback that survives a NetBird/wt0 outage. It is gated by source IP (spoofable on a LAN) plus the standing keys-only + fail2ban SSH hardening (ADR-002), so the marginal cost is "SSH daemon reachable from one trusted LAN host" — modest and deliberate. All other LAN hosts stay default-denied.

  3. Console — break-glass. Mesh-and-LAN-independent, recorded per host class, never exercised for routine work:

    • Cluster VMs → Proxmox serial/VNC console — independent of the guest network, wt0, and even a broken guest nftables ruleset.
    • askari (bare-metal Hetzner) → provider rescue/console.
    • ubongo (physical) → local console.

    A total mesh outage therefore still leaves exactly one documented way in to each box.

Reconciliation, not weakening

ADR-016 already requires Ansible to reach the fleet by LAN IP — "a mesh/coordinator outage never blocks on-LAN runs" — which requires LAN SSH from ubongo. Yet ADR-016 also stated "SSH only on wt0," and ADR-020's guaranteed management plane listed only wt0. That was a latent contradiction. ADR-021 resolves it by making the control-node SSH allow explicit and adding it to the guaranteed management plane. This does not weaken default-deny: it admits exactly one extra trusted source on the LAN (ubongo), keys-only + fail2ban-gated; every other LAN host stays denied. ADR-016 and ADR-020 are amended to cross-reference this ladder.

The declarative access__* data model

Structured access facts live as data — the single source of truth that both renders ACCESS.md and tells /check-access what to probe, so doc and verifier cannot diverge (the firewall-catalog philosophy of ADR-020, applied to access).

Each service role's defaults carry:

access__service: photoprism
access__compose_project: photoprism              # docker compose -p <this>
access__compose_path: /opt/photoprism/compose.yml
access__containers: [photoprism, photoprism-db]  # exec targets
access__log:
  loki_labels: { service: photoprism }           # how to query logs (ADR-018)
access__api:
  enabled: true
  base_url: "http://photoprism.srv:2342"         # reachable over the mesh
  firewall_ref: photoprism-api                   # the catalog entry that opens it (ADR-020)
  auth: { vault_ref: "vault.photoprism.api_token" }
  health_path: "/api/v1/status"                  # what /check-access pings
  # where the service has no API:
  # access__api: { enabled: false, reason: "<none upstream>" }

Invariant — access__api never opens a port. It firewall_refs an entry in the group_vars firewall catalog; ADR-020 stays the sole owner of exposure. The access data adds only how to use the path (endpoint, token ref, health probe) — no duplication, no ad-hoc ports (CLAUDE.md: ports only in the catalog).

The host baseline (SSH on wt0 + from ubongo, Docker/Compose present, Alloy live) is uniform, so it is asserted by base and recorded once at the host/group level, not re-stated per service.

The rendered record — ACCESS.md

ACCESS.md is a first-class sibling of SECURITY.md/VERIFY.md, rendered from the access__* data with a prose tail for the narrative parts:

  • Access paths (generated) — a table: each path (mesh SSH, LAN-SSH-from-ubongo, exec/compose, logs, API), its tier (primary / secondary / break-glass), and the exact invocation.
  • Break-glass (generated from host class) — the Proxmox/provider/local console line.
  • Operational notes (prose) — service quirks, gotchas, "if X is wedged, do Y." The part a template cannot know.

A docs/access/service-access-template.md defines the shape, alongside the existing security/verify templates.

The verifier — /check-access

/check-access <service|host> runs from ubongo and turns the access__* data into live probes, reporting which declared paths are green right now — the access analogue of /verify-service (ADR-017). It probes mesh SSH, LAN SSH, exec + compose, Loki logs, and the admin API health path; on any red it names the path and the likely cause. Break-glass is checked for reachability only, never exercised — firing a serial console is invasive, so the verifier confirms the fallback exists without disrupting anything. Designed now, build-pending on infra (needs live hosts + staging + vault), exactly like /verify-service under ADR-017.

Governance

Three light touches, mirroring how SECURITY.md/VERIFY.md are enforced: the service checklist (docs/security/service-checklist.md) gains an access item; the new-role runbook gains a fill/render/check-access step (step 11: copy docs/access/service-access-template.md into roles/<service>/ACCESS.md and populate the access__* data); and a service-checklist gate item blocks clearance until the record exists and /check-access is green (or a deviation is recorded in accepted-risks.md). No scaffold change — same manual-copy-plus-review pattern the sibling records (SECURITY.md/VERIFY.md) use.

Consequences

  • Every host and service has at least one documented, verifiable way in — and a verifier that proves it, so stale access is caught before an outage, not during one.
  • Doc and verifier share one source of truth (access__*), so they cannot drift apart.
  • The management plane gains exactly one extra trusted LAN source (ubongo); attack surface grows by one keys-only + fail2ban-gated SSH path, no new exposed ports.
  • Cost: per-service access__* declarations and a rendered ACCESS.md to maintain (mitigated by the uniform host baseline + the new-role runbook step + checklist gate), plus /check-access to build.

Scope

Delivered by ADR-021's implementation plan (docs/superpowers/plans/2026-06-09-operational-access.md), task by task, and tracked in STATUS.md as it lands — not all of it exists at the moment this ADR is written. The split below is near-term tranche vs longer build-pending, not instant-existence vs not.

Near-term tranche (this plan): the doctrine; this ADR; the ACCESS.md template; the ssh-from-control firewall management-plane source — added to ADR-020's guaranteed management plane (the always-allowed block that already holds the wt0 SSH/Ansible allow and is explicitly independent of the service catalog), not added to the catalog itself (the catalog owns service ingress only) — via the base__firewall_control_addr knob and its nftables rule, both of which do not exist in roles/base yet and land with the firewall concern of base; and the governance wiring (checklist item, new-role runbook step). ADR-016 and ADR-020 are amended to reference the ladder.

Build-pending on infra: per-service access__* data and rendered ACCESS.md files (wait on service roles), /check-access running (waits on live hosts + staging + vault), and the real ubongo LAN address value behind base__firewall_control_addr. Designed now, built when there is something to verify.

Out of scope: broader LAN SSH (a management VLAN) — explicitly rejected, ubongo-only; exercising (vs reachability-probing) the break-glass console; any access path that is not over the mesh or the one ubongo LAN source.

ADR-002 (security baseline: SSH hardening, default-deny, fail2ban), ADR-004 (Docker model, Compose), ADR-016 (NetBird mesh; amended — SSH on wt0 and from ubongo's LAN address), ADR-017 (/verify-service Level-4 verification), ADR-018 (logging: Alloy → Loki/Grafana), ADR-020 (firewall: service catalog + guaranteed management plane; amended — adds the ssh-from-control management-plane source), ADR-019 (firewall tag).