The integration-testing shakedown reversed ADR-015's "no local sudo" sub-decision: the claude AI-worker now has NOPASSWD:ALL sudo on ubongo — without it, virsh, nft, and journalctl all block during VM diagnosis. Compensating controls: password-locked account, auditd/Loki attribution, repo-managed revocable drop-in. ADR-015: dated amendment note in Status + expanded AI-worker identity section. ADR-021: new §Sudo model (amendment 2026-06-18) — claude=NOPASSWD, sjat=password required; former sjat NOPASSWD drop-in removed 2026-06-18 (least-privilege cleanup). accepted-risks.md: R7 added (claude NOPASSWD:ALL on ubongo); last-reviewed updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
13 KiB
ADR-021 — Operational access: documented, verifiable ways in
Status
Accepted (2026-06-09). Resolves TODO 7.2 (what to set up on hosts given direct access
will be rare) and TODO 3.2 (the service admin-API access question). Amended
2026-06-18: the on-ubongo sudo model for the two local accounts is now settled
(see §Sudo model on ubongo below).
Doctrine ADR. It pins the operational-access doctrine, the declarative access__*
data model, the rendered ACCESS.md record, and the /check-access verifier. It does
not build any of them — base's non-firewall concerns, service roles, and live
hosts do not exist yet. Designed now, built when there is something to access (see
Scope). Reconciles a latent contradiction between ADR-016 and ADR-020 (see
Reconciliation).
Context
boma is built security-first: nftables default-deny, SSH reachable only on the NetBird
wt0 mesh interface (ADR-016), every service behind the reverse proxy + SSO, no ad-hoc
ports (ADR-002/ADR-020). That posture is correct — but it leaves one operational
question unanswered: when a host or service breaks, how does the operator (and the AI
working from ubongo) actually get in to troubleshoot it?
Troubleshooting is far more effective with several documented ways in — SSH, container exec, logs, an admin API — so a single broken path does not mean blind. Today boma has no standard guaranteeing those paths exist, are documented, or still work. The risk is the classic one: the access you assumed you had is stale exactly when you need it (key rotated, API disabled, token expired).
boma already has the right shape. Service roles carry record docs — SECURITY.md
(security answers) and VERIFY.md (acceptance spec). What is missing is the third
sibling — an operational-access record — and the doctrine behind it.
Two constraints shape the decision:
- Minimal attack surface is non-negotiable. "Multiple ways in" must mean multiple paths over trusted interfaces, never new exposed ports.
- A documented path that is never tested drifts — it fails exactly when needed. So the access facts must be data that both renders the doc and drives an active verifier; the two can then never disagree.
Decision
The doctrine
Every host and every service guarantees at least one documented, verifiable way in for operational troubleshooting — and the deploy that creates it also records and proves it.
Access is a deployment deliverable, not something rediscovered under pressure. The deploy that creates a host/service also records its access paths and (by design) proves them.
Two layers
- Host layer (resolves TODO 7.2). Every host, via the
baserole, guarantees a fixed access baseline: SSH overwt0and fromubongo(the ladder below), Docker/Compose tooling present, and log shipping live (Alloy → Loki; ADR-018). Little is exposed; a known, uniform set of paths exists over trusted interfaces. The break-glass console per host class is recorded once at this layer. This is boma's answer to "what every host runs for access." - Service layer (resolves TODO 3.2). Every service role guarantees and records its own paths: container exec + compose management, its Loki log labels, and its admin API where one exists (enabled, token in vault, endpoint + health probe documented) — or an explicit "no API."
The three-tier access ladder
-
wt0mesh SSH — primary. WireGuard cryptographically authenticates the peer before SSH sees it. The preferred path (ADR-016's original rationale). -
LAN SSH from
ubongoonly — secondary, mesh-independent. All hardware butaskarishares a LAN. SSH fromubongo's LAN address is allowed, giving a fallback that survives a NetBird/wt0outage. It is gated by source IP (spoofable on a LAN) plus the standing keys-only + fail2ban SSH hardening (ADR-002), so the marginal cost is "SSH daemon reachable from one trusted LAN host" — modest and deliberate. All other LAN hosts stay default-denied. -
Console — break-glass. Mesh-and-LAN-independent, recorded per host class, never exercised for routine work:
- Cluster VMs → Proxmox serial/VNC console — independent of the guest network,
wt0, and even a broken guest nftables ruleset. askari(bare-metal Hetzner) → provider rescue/console.ubongo(physical) → local console.
A total mesh outage therefore still leaves exactly one documented way in to each box.
- Cluster VMs → Proxmox serial/VNC console — independent of the guest network,
Reconciliation, not weakening
ADR-016 already requires Ansible to reach the fleet by LAN IP — "a mesh/coordinator
outage never blocks on-LAN runs" — which requires LAN SSH from ubongo. Yet ADR-016
also stated "SSH only on wt0," and ADR-020's guaranteed management plane listed only
wt0. That was a latent contradiction. ADR-021 resolves it by making the control-node
SSH allow explicit and adding it to the guaranteed management plane. This does not
weaken default-deny: it admits exactly one extra trusted source on the LAN (ubongo),
keys-only + fail2ban-gated; every other LAN host stays denied. ADR-016 and ADR-020 are
amended to cross-reference this ladder.
The declarative access__* data model
Structured access facts live as data — the single source of truth that both renders
ACCESS.md and tells /check-access what to probe, so doc and verifier cannot diverge
(the firewall-catalog philosophy of ADR-020, applied to access).
Each service role's defaults carry:
access__service: photoprism
access__compose_project: photoprism # docker compose -p <this>
access__compose_path: /opt/photoprism/compose.yml
access__containers: [photoprism, photoprism-db] # exec targets
access__log:
loki_labels: { service: photoprism } # how to query logs (ADR-018)
access__api:
enabled: true
base_url: "http://photoprism.srv:2342" # reachable over the mesh
firewall_ref: photoprism-api # the catalog entry that opens it (ADR-020)
auth: { vault_ref: "vault.photoprism.api_token" }
health_path: "/api/v1/status" # what /check-access pings
# where the service has no API:
# access__api: { enabled: false, reason: "<none upstream>" }
Invariant — access__api never opens a port. It firewall_refs an entry in the
group_vars firewall catalog; ADR-020 stays the sole owner of exposure. The access
data adds only how to use the path (endpoint, token ref, health probe) — no duplication,
no ad-hoc ports (CLAUDE.md: ports only in the catalog).
The host baseline (SSH on wt0 + from ubongo, Docker/Compose present, Alloy live) is
uniform, so it is asserted by base and recorded once at the host/group level, not
re-stated per service.
The rendered record — ACCESS.md
ACCESS.md is a first-class sibling of SECURITY.md/VERIFY.md, rendered from the
access__* data with a prose tail for the narrative parts:
- Access paths (generated) — a table: each path (mesh SSH, LAN-SSH-from-
ubongo, exec/compose, logs, API), its tier (primary / secondary / break-glass), and the exact invocation. - Break-glass (generated from host class) — the Proxmox/provider/local console line.
- Operational notes (prose) — service quirks, gotchas, "if X is wedged, do Y." The part a template cannot know.
A docs/access/service-access-template.md defines the shape, alongside the existing
security/verify templates.
The verifier — /check-access
/check-access <service|host> runs from ubongo and turns the access__* data into
live probes, reporting which declared paths are green right now — the access analogue of
/verify-service (ADR-017). It probes mesh SSH, LAN SSH, exec + compose, Loki logs, and
the admin API health path; on any red it names the path and the likely cause. Break-glass
is checked for reachability only, never exercised — firing a serial console is invasive,
so the verifier confirms the fallback exists without disrupting anything. Designed now,
build-pending on infra (needs live hosts + staging + vault), exactly like
/verify-service under ADR-017.
Governance
Three light touches, mirroring how SECURITY.md/VERIFY.md are enforced: the service
checklist (docs/security/service-checklist.md) gains an access item; the new-role
runbook gains a fill/render/check-access step (step 11: copy
docs/access/service-access-template.md into roles/<service>/ACCESS.md and populate the
access__* data); and a service-checklist gate item blocks clearance until the record
exists and /check-access is green (or a deviation is recorded in accepted-risks.md).
No scaffold change — same manual-copy-plus-review pattern the sibling records
(SECURITY.md/VERIFY.md) use.
Sudo model on ubongo (amendment 2026-06-18)
The original ADR left on-ubongo local sudo unspecified. The integration-testing
harness shakedown settled it:
| Account | Role | Sudo |
|---|---|---|
claude |
Automated AI-worker | NOPASSWD:ALL via repo-managed drop-in (base__ai_worker_user) |
sjat |
Human operator | Password-required sudo via the sudo group |
Rationale for claude NOPASSWD. No-sudo blocked the AI-worker from diagnosing a
failed test VM: virsh, virt-install, cloud-localds, nft, journalctl —
almost every low-level diagnostic tool — require root. The harness's core value is
autonomous spin-up → apply → reboot → assert → diagnose; that loop collapses without
local root access.
Compensating controls (R7 in docs/security/accepted-risks.md):
claude's password is locked —NOPASSWDis the account's only sudo path; no interactive login is possible.auditd+ Loki attribution (ADR-018) separates human from agent root actions in the audit trail.- The drop-in is repo-managed and revocable in one commit + one deploy.
- Single-operator homelab; everything in git; off-machine backups (ADR-022).
sjat NOPASSWD removed. The operator's former NOPASSWD drop-in
(/etc/sudoers.d/sjat-ansible, added as an interim measure during M5 NetBird
enrolment) was removed 2026-06-18. It was redundant once claude held sudo, and its
removal restores least-privilege for the human operator. sjat retains full sudo
capability via the sudo group (password required).
Consequences
- Every host and service has at least one documented, verifiable way in — and a verifier that proves it, so stale access is caught before an outage, not during one.
- Doc and verifier share one source of truth (
access__*), so they cannot drift apart. - The management plane gains exactly one extra trusted LAN source (
ubongo); attack surface grows by one keys-only + fail2ban-gated SSH path, no new exposed ports. - Cost: per-service
access__*declarations and a renderedACCESS.mdto maintain (mitigated by the uniform host baseline + the new-role runbook step + checklist gate), plus/check-accessto build.
Scope
Delivered by ADR-021's implementation plan
(docs/superpowers/plans/2026-06-09-operational-access.md), task by task, and tracked in
STATUS.md as it lands — not all of it exists at the moment this ADR is written. The split
below is near-term tranche vs longer build-pending, not instant-existence vs not.
Near-term tranche (this plan): the doctrine; this ADR; the ACCESS.md template; the
ssh-from-control firewall management-plane source — added to ADR-020's guaranteed
management plane (the always-allowed block that already holds the wt0 SSH/Ansible allow
and is explicitly independent of the service catalog), not added to the catalog itself (the
catalog owns service ingress only) — via the base__firewall_control_addr knob and its
nftables rule, both of which do not exist in roles/base yet and land with the
firewall concern of base; and the governance wiring (checklist item, new-role runbook step). ADR-016 and ADR-020 are amended to reference the ladder.
Build-pending on infra: per-service access__* data and rendered ACCESS.md files
(wait on service roles), /check-access running (waits on live hosts + staging + vault),
and the real ubongo LAN address value behind base__firewall_control_addr. Designed now,
built when there is something to verify.
Out of scope: broader LAN SSH (a management VLAN) — explicitly rejected, ubongo-only;
exercising (vs reachability-probing) the break-glass console; any access path that is not
over the mesh or the one ubongo LAN source.
Related
ADR-002 (security baseline: SSH hardening, default-deny, fail2ban), ADR-004 (Docker
model, Compose), ADR-016 (NetBird mesh; amended — SSH on wt0 and from ubongo's
LAN address), ADR-017 (/verify-service Level-4 verification), ADR-018 (logging:
Alloy → Loki/Grafana), ADR-020 (firewall: service catalog + guaranteed management plane;
amended — adds the ssh-from-control management-plane source), ADR-019 (firewall tag).