sjat/boma

sjat cc772ff845 docs(adr/security): record claude NOPASSWD sudo model (ADR-015 amend + R7)

The integration-testing shakedown reversed ADR-015's "no local sudo" sub-decision:
the claude AI-worker now has NOPASSWD:ALL sudo on ubongo — without it, virsh,
nft, and journalctl all block during VM diagnosis. Compensating controls:
password-locked account, auditd/Loki attribution, repo-managed revocable drop-in.

ADR-015: dated amendment note in Status + expanded AI-worker identity section.
ADR-021: new §Sudo model (amendment 2026-06-18) — claude=NOPASSWD, sjat=password
required; former sjat NOPASSWD drop-in removed 2026-06-18 (least-privilege cleanup).
accepted-risks.md: R7 added (claude NOPASSWD:ALL on ubongo); last-reviewed updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-18 21:39:20 +02:00

13 KiB

Raw Blame History

ADR-021 — Operational access: documented, verifiable ways in

Status

Accepted (2026-06-09). Resolves TODO 7.2 (what to set up on hosts given direct access will be rare) and TODO 3.2 (the service admin-API access question). Amended 2026-06-18: the on-ubongo sudo model for the two local accounts is now settled (see §Sudo model on ubongo below).

Doctrine ADR. It pins the operational-access doctrine, the declarative access__* data model, the rendered ACCESS.md record, and the /check-access verifier. It does not build any of them — base's non-firewall concerns, service roles, and live hosts do not exist yet. Designed now, built when there is something to access (see Scope). Reconciles a latent contradiction between ADR-016 and ADR-020 (see Reconciliation).

Context

boma is built security-first: nftables default-deny, SSH reachable only on the NetBird wt0 mesh interface (ADR-016), every service behind the reverse proxy + SSO, no ad-hoc ports (ADR-002/ADR-020). That posture is correct — but it leaves one operational question unanswered: when a host or service breaks, how does the operator (and the AI working from ubongo) actually get in to troubleshoot it?

Troubleshooting is far more effective with several documented ways in — SSH, container exec, logs, an admin API — so a single broken path does not mean blind. Today boma has no standard guaranteeing those paths exist, are documented, or still work. The risk is the classic one: the access you assumed you had is stale exactly when you need it (key rotated, API disabled, token expired).

boma already has the right shape. Service roles carry record docs — SECURITY.md (security answers) and VERIFY.md (acceptance spec). What is missing is the third sibling — an operational-access record — and the doctrine behind it.

Two constraints shape the decision:

Minimal attack surface is non-negotiable. "Multiple ways in" must mean multiple paths over trusted interfaces, never new exposed ports.
A documented path that is never tested drifts — it fails exactly when needed. So the access facts must be data that both renders the doc and drives an active verifier; the two can then never disagree.

Decision

The doctrine

Every host and every service guarantees at least one documented, verifiable way in for operational troubleshooting — and the deploy that creates it also records and proves it.

Access is a deployment deliverable, not something rediscovered under pressure. The deploy that creates a host/service also records its access paths and (by design) proves them.

Two layers

Host layer (resolves TODO 7.2). Every host, via the base role, guarantees a fixed access baseline: SSH over wt0 and from ubongo (the ladder below), Docker/Compose tooling present, and log shipping live (Alloy → Loki; ADR-018). Little is exposed; a known, uniform set of paths exists over trusted interfaces. The break-glass console per host class is recorded once at this layer. This is boma's answer to "what every host runs for access."
Service layer (resolves TODO 3.2). Every service role guarantees and records its own paths: container exec + compose management, its Loki log labels, and its admin API where one exists (enabled, token in vault, endpoint + health probe documented) — or an explicit "no API."

The three-tier access ladder

wt0 mesh SSH — primary. WireGuard cryptographically authenticates the peer before SSH sees it. The preferred path (ADR-016's original rationale).
LAN SSH from ubongo only — secondary, mesh-independent. All hardware but askari shares a LAN. SSH from ubongo's LAN address is allowed, giving a fallback that survives a NetBird/wt0 outage. It is gated by source IP (spoofable on a LAN) plus the standing keys-only + fail2ban SSH hardening (ADR-002), so the marginal cost is "SSH daemon reachable from one trusted LAN host" — modest and deliberate. All other LAN hosts stay default-denied.
Console — break-glass. Mesh-and-LAN-independent, recorded per host class, never exercised for routine work:
- Cluster VMs → Proxmox serial/VNC console — independent of the guest network, wt0, and even a broken guest nftables ruleset.
- askari (bare-metal Hetzner) → provider rescue/console.
- ubongo (physical) → local console.
A total mesh outage therefore still leaves exactly one documented way in to each box.

Reconciliation, not weakening

ADR-016 already requires Ansible to reach the fleet by LAN IP — "a mesh/coordinator outage never blocks on-LAN runs" — which requires LAN SSH from ubongo. Yet ADR-016 also stated "SSH only on wt0," and ADR-020's guaranteed management plane listed only wt0. That was a latent contradiction. ADR-021 resolves it by making the control-node SSH allow explicit and adding it to the guaranteed management plane. This does not weaken default-deny: it admits exactly one extra trusted source on the LAN (ubongo), keys-only + fail2ban-gated; every other LAN host stays denied. ADR-016 and ADR-020 are amended to cross-reference this ladder.

The declarative `access__*` data model

Structured access facts live as data — the single source of truth that both renders ACCESS.md and tells /check-access what to probe, so doc and verifier cannot diverge (the firewall-catalog philosophy of ADR-020, applied to access).

Each service role's defaults carry:

access__service: photoprism
access__compose_project: photoprism              # docker compose -p <this>
access__compose_path: /opt/photoprism/compose.yml
access__containers: [photoprism, photoprism-db]  # exec targets
access__log:
  loki_labels: { service: photoprism }           # how to query logs (ADR-018)
access__api:
  enabled: true
  base_url: "http://photoprism.srv:2342"         # reachable over the mesh
  firewall_ref: photoprism-api                   # the catalog entry that opens it (ADR-020)
  auth: { vault_ref: "vault.photoprism.api_token" }
  health_path: "/api/v1/status"                  # what /check-access pings
  # where the service has no API:
  # access__api: { enabled: false, reason: "<none upstream>" }

Invariant — access__api never opens a port. It firewall_refs an entry in the group_vars firewall catalog; ADR-020 stays the sole owner of exposure. The access data adds only how to use the path (endpoint, token ref, health probe) — no duplication, no ad-hoc ports (CLAUDE.md: ports only in the catalog).

The host baseline (SSH on wt0 + from ubongo, Docker/Compose present, Alloy live) is uniform, so it is asserted by base and recorded once at the host/group level, not re-stated per service.

The rendered record — `ACCESS.md`

ACCESS.md is a first-class sibling of SECURITY.md/VERIFY.md, rendered from the access__* data with a prose tail for the narrative parts:

Access paths (generated) — a table: each path (mesh SSH, LAN-SSH-from-ubongo, exec/compose, logs, API), its tier (primary / secondary / break-glass), and the exact invocation.
Break-glass (generated from host class) — the Proxmox/provider/local console line.
Operational notes (prose) — service quirks, gotchas, "if X is wedged, do Y." The part a template cannot know.

A docs/access/service-access-template.md defines the shape, alongside the existing security/verify templates.

The verifier — `/check-access`

/check-access <service|host> runs from ubongo and turns the access__* data into live probes, reporting which declared paths are green right now — the access analogue of /verify-service (ADR-017). It probes mesh SSH, LAN SSH, exec + compose, Loki logs, and the admin API health path; on any red it names the path and the likely cause. Break-glass is checked for reachability only, never exercised — firing a serial console is invasive, so the verifier confirms the fallback exists without disrupting anything. Designed now, build-pending on infra (needs live hosts + staging + vault), exactly like /verify-service under ADR-017.

Governance

Three light touches, mirroring how SECURITY.md/VERIFY.md are enforced: the service checklist (docs/security/service-checklist.md) gains an access item; the new-role runbook gains a fill/render/check-access step (step 11: copy docs/access/service-access-template.md into roles/<service>/ACCESS.md and populate the access__* data); and a service-checklist gate item blocks clearance until the record exists and /check-access is green (or a deviation is recorded in accepted-risks.md). No scaffold change — same manual-copy-plus-review pattern the sibling records (SECURITY.md/VERIFY.md) use.

Sudo model on `ubongo` (amendment 2026-06-18)

The original ADR left on-ubongo local sudo unspecified. The integration-testing harness shakedown settled it:

Account	Role	Sudo
`claude`	Automated AI-worker	`NOPASSWD:ALL` via repo-managed drop-in (`base__ai_worker_user`)
`sjat`	Human operator	Password-required sudo via the `sudo` group

Rationale for claude NOPASSWD. No-sudo blocked the AI-worker from diagnosing a failed test VM: virsh, virt-install, cloud-localds, nft, journalctl — almost every low-level diagnostic tool — require root. The harness's core value is autonomous spin-up → apply → reboot → assert → diagnose; that loop collapses without local root access.

Compensating controls (R7 in docs/security/accepted-risks.md):

claude's password is locked — NOPASSWD is the account's only sudo path; no interactive login is possible.
auditd + Loki attribution (ADR-018) separates human from agent root actions in the audit trail.
The drop-in is repo-managed and revocable in one commit + one deploy.
Single-operator homelab; everything in git; off-machine backups (ADR-022).

sjat NOPASSWD removed. The operator's former NOPASSWD drop-in (/etc/sudoers.d/sjat-ansible, added as an interim measure during M5 NetBird enrolment) was removed 2026-06-18. It was redundant once claude held sudo, and its removal restores least-privilege for the human operator. sjat retains full sudo capability via the sudo group (password required).

Consequences

Every host and service has at least one documented, verifiable way in — and a verifier that proves it, so stale access is caught before an outage, not during one.
Doc and verifier share one source of truth (access__*), so they cannot drift apart.
The management plane gains exactly one extra trusted LAN source (ubongo); attack surface grows by one keys-only + fail2ban-gated SSH path, no new exposed ports.
Cost: per-service access__* declarations and a rendered ACCESS.md to maintain (mitigated by the uniform host baseline + the new-role runbook step + checklist gate), plus /check-access to build.

Scope

Delivered by ADR-021's implementation plan (docs/superpowers/plans/2026-06-09-operational-access.md), task by task, and tracked in STATUS.md as it lands — not all of it exists at the moment this ADR is written. The split below is near-term tranche vs longer build-pending, not instant-existence vs not.

Near-term tranche (this plan): the doctrine; this ADR; the ACCESS.md template; the ssh-from-control firewall management-plane source — added to ADR-020's guaranteed management plane (the always-allowed block that already holds the wt0 SSH/Ansible allow and is explicitly independent of the service catalog), not added to the catalog itself (the catalog owns service ingress only) — via the base__firewall_control_addr knob and its nftables rule, both of which do not exist in roles/base yet and land with the firewall concern of base; and the governance wiring (checklist item, new-role runbook step). ADR-016 and ADR-020 are amended to reference the ladder.

Build-pending on infra: per-service access__* data and rendered ACCESS.md files (wait on service roles), /check-access running (waits on live hosts + staging + vault), and the real ubongo LAN address value behind base__firewall_control_addr. Designed now, built when there is something to verify.

Out of scope: broader LAN SSH (a management VLAN) — explicitly rejected, ubongo-only; exercising (vs reachability-probing) the break-glass console; any access path that is not over the mesh or the one ubongo LAN source.

ADR-002 (security baseline: SSH hardening, default-deny, fail2ban), ADR-004 (Docker model, Compose), ADR-016 (NetBird mesh; amended — SSH on wt0 and from ubongo's LAN address), ADR-017 (/verify-service Level-4 verification), ADR-018 (logging: Alloy → Loki/Grafana), ADR-020 (firewall: service catalog + guaranteed management plane; amended — adds the ssh-from-control management-plane source), ADR-019 (firewall tag).

13 KiB Raw Blame History