docs(access): add ADR-021 operational-access doctrine
This commit is contained in:
parent
cdbd66410a
commit
0fe9e45f57
1 changed files with 205 additions and 0 deletions
205
docs/decisions/021-operational-access.md
Normal file
205
docs/decisions/021-operational-access.md
Normal file
|
|
@ -0,0 +1,205 @@
|
|||
# ADR-021 — Operational access: documented, verifiable ways in
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-09). Resolves TODO 7.2 (what to set up on hosts given direct access
|
||||
will be rare) and TODO 3.2 (the service admin-API access question).
|
||||
|
||||
**Doctrine ADR.** It pins the operational-access doctrine, the declarative `access__*`
|
||||
data model, the rendered `ACCESS.md` record, and the `/check-access` verifier. It does
|
||||
**not** build any of them — `base`'s non-firewall concerns, service roles, and live
|
||||
hosts do not exist yet. Designed now, built when there is something to access (see
|
||||
*Scope*). Reconciles a latent contradiction between ADR-016 and ADR-020 (see
|
||||
*Reconciliation*).
|
||||
|
||||
## Context
|
||||
|
||||
boma is built security-first: nftables default-deny, SSH reachable only on the NetBird
|
||||
`wt0` mesh interface (ADR-016), every service behind the reverse proxy + SSO, no ad-hoc
|
||||
ports (ADR-002/ADR-020). That posture is correct — but it leaves one operational
|
||||
question unanswered: **when a host or service breaks, how does the operator (and the AI
|
||||
working from `ubongo`) actually get in to troubleshoot it?**
|
||||
|
||||
Troubleshooting is far more effective with *several* documented ways in — SSH, container
|
||||
exec, logs, an admin API — so a single broken path does not mean blind. Today boma has no
|
||||
standard guaranteeing those paths exist, are documented, or still work. The risk is the
|
||||
classic one: the access you assumed you had is stale exactly when you need it (key
|
||||
rotated, API disabled, token expired).
|
||||
|
||||
boma already has the right *shape*. Service roles carry record docs — `SECURITY.md`
|
||||
(security answers) and `VERIFY.md` (acceptance spec). What is missing is the third
|
||||
sibling — an operational-access record — and the doctrine behind it.
|
||||
|
||||
Two constraints shape the decision:
|
||||
|
||||
1. **Minimal attack surface is non-negotiable.** "Multiple ways in" must mean multiple
|
||||
paths over *trusted* interfaces, never new exposed ports.
|
||||
2. **A documented path that is never tested drifts** — it fails exactly when needed. So
|
||||
the access facts must be *data* that both renders the doc and drives an active
|
||||
verifier; the two can then never disagree.
|
||||
|
||||
## Decision
|
||||
|
||||
### The doctrine
|
||||
|
||||
> **Every host and every service guarantees at least one documented, verifiable way in
|
||||
> for operational troubleshooting — and the deploy that creates it also records and
|
||||
> proves it.**
|
||||
|
||||
Access is a deployment deliverable, not something rediscovered under pressure. The deploy
|
||||
that creates a host/service also records its access paths and (by design) proves them.
|
||||
|
||||
### Two layers
|
||||
|
||||
- **Host layer** (resolves TODO 7.2). Every host, via the `base` role, guarantees a fixed
|
||||
access baseline: SSH over `wt0` and from `ubongo` (the ladder below), Docker/Compose
|
||||
tooling present, and log shipping live (Alloy → Loki; ADR-018). Little is *exposed*; a
|
||||
known, uniform set of paths exists over trusted interfaces. The break-glass console per
|
||||
host class is recorded once at this layer. This is boma's answer to "what every host
|
||||
runs for access."
|
||||
- **Service layer** (resolves TODO 3.2). Every service role guarantees and records its
|
||||
own paths: container exec + compose management, its Loki log labels, and its admin API
|
||||
where one exists (enabled, token in vault, endpoint + health probe documented) — or an
|
||||
explicit "no API."
|
||||
|
||||
### The three-tier access ladder
|
||||
|
||||
1. **`wt0` mesh SSH — primary.** WireGuard *cryptographically authenticates* the peer
|
||||
before SSH sees it. The preferred path (ADR-016's original rationale).
|
||||
2. **LAN SSH from `ubongo` only — secondary, mesh-independent.** All hardware but
|
||||
`askari` shares a LAN. SSH from `ubongo`'s LAN address is allowed, giving a fallback
|
||||
that survives a NetBird/`wt0` outage. It is gated by *source IP* (spoofable on a LAN)
|
||||
**plus** the standing keys-only + fail2ban SSH hardening (ADR-002), so the marginal
|
||||
cost is "SSH daemon reachable from one trusted LAN host" — modest and deliberate. All
|
||||
*other* LAN hosts stay default-denied.
|
||||
3. **Console — break-glass.** Mesh-*and*-LAN-independent, recorded per host class, never
|
||||
exercised for routine work:
|
||||
- **Cluster VMs** → Proxmox serial/VNC console — independent of the guest network,
|
||||
`wt0`, and even a broken guest nftables ruleset.
|
||||
- **`askari`** (bare-metal Hetzner) → provider rescue/console.
|
||||
- **`ubongo`** (physical) → local console.
|
||||
|
||||
A total mesh outage therefore still leaves exactly one documented way in to each box.
|
||||
|
||||
### Reconciliation, not weakening
|
||||
|
||||
ADR-016 already requires Ansible to reach the fleet by LAN IP — "a mesh/coordinator
|
||||
outage never blocks on-LAN runs" — which **requires** LAN SSH from `ubongo`. Yet ADR-016
|
||||
also stated "SSH only on `wt0`," and ADR-020's guaranteed management plane listed only
|
||||
`wt0`. That was a latent contradiction. ADR-021 resolves it by making the control-node
|
||||
SSH allow **explicit** and adding it to the guaranteed management plane. This does **not**
|
||||
weaken default-deny: it admits exactly one extra trusted source on the LAN (`ubongo`),
|
||||
keys-only + fail2ban-gated; every other LAN host stays denied. ADR-016 and ADR-020 are
|
||||
amended to cross-reference this ladder.
|
||||
|
||||
### The declarative `access__*` data model
|
||||
|
||||
Structured access facts live as **data** — the single source of truth that both renders
|
||||
`ACCESS.md` *and* tells `/check-access` what to probe, so doc and verifier cannot diverge
|
||||
(the firewall-catalog philosophy of ADR-020, applied to access).
|
||||
|
||||
Each service role's defaults carry:
|
||||
|
||||
```yaml
|
||||
access__service: photoprism
|
||||
access__compose_project: photoprism # docker compose -p <this>
|
||||
access__compose_path: /opt/photoprism/compose.yml
|
||||
access__containers: [photoprism, photoprism-db] # exec targets
|
||||
access__log:
|
||||
loki_labels: { service: photoprism } # how to query logs (ADR-018)
|
||||
access__api:
|
||||
enabled: true
|
||||
base_url: "http://photoprism.srv:2342" # reachable over the mesh
|
||||
firewall_ref: photoprism-api # the catalog entry that opens it (ADR-020)
|
||||
auth: { vault_ref: "vault.photoprism.api_token" }
|
||||
health_path: "/api/v1/status" # what /check-access pings
|
||||
# where the service has no API:
|
||||
# access__api: { enabled: false, reason: "<none upstream>" }
|
||||
```
|
||||
|
||||
**Invariant — `access__api` never opens a port.** It `firewall_ref`s an entry in the
|
||||
`group_vars` firewall catalog; ADR-020 stays the **sole owner of exposure**. The access
|
||||
data adds only *how to use* the path (endpoint, token ref, health probe) — no duplication,
|
||||
no ad-hoc ports (CLAUDE.md: ports only in the catalog).
|
||||
|
||||
The host baseline (SSH on `wt0` + from `ubongo`, Docker/Compose present, Alloy live) is
|
||||
uniform, so it is asserted by `base` and recorded once at the host/group level, not
|
||||
re-stated per service.
|
||||
|
||||
### The rendered record — `ACCESS.md`
|
||||
|
||||
`ACCESS.md` is a first-class sibling of `SECURITY.md`/`VERIFY.md`, **rendered** from the
|
||||
`access__*` data with a prose tail for the narrative parts:
|
||||
|
||||
- **Access paths (generated)** — a table: each path (mesh SSH, LAN-SSH-from-`ubongo`,
|
||||
exec/compose, logs, API), its tier (primary / secondary / break-glass), and the exact
|
||||
invocation.
|
||||
- **Break-glass (generated from host class)** — the Proxmox/provider/local console line.
|
||||
- **Operational notes (prose)** — service quirks, gotchas, "if X is wedged, do Y." The
|
||||
part a template cannot know.
|
||||
|
||||
A `docs/access/service-access-template.md` defines the shape, alongside the existing
|
||||
security/verify templates.
|
||||
|
||||
### The verifier — `/check-access`
|
||||
|
||||
`/check-access <service|host>` runs from `ubongo` and turns the `access__*` data into
|
||||
live probes, reporting which declared paths are green right now — the access analogue of
|
||||
`/verify-service` (ADR-017). It probes mesh SSH, LAN SSH, exec + compose, Loki logs, and
|
||||
the admin API health path; on any red it names the path and the likely cause. **Break-glass
|
||||
is checked for reachability only, never exercised** — firing a serial console is invasive,
|
||||
so the verifier confirms the fallback *exists* without disrupting anything. Designed now,
|
||||
**build-pending on infra** (needs live hosts + staging + vault), exactly like
|
||||
`/verify-service` under ADR-017.
|
||||
|
||||
### Governance
|
||||
|
||||
Three light touches, mirroring how `SECURITY.md`/`VERIFY.md` are enforced: the service
|
||||
checklist (`docs/security/service-checklist.md`) gains an access item; the `new-role`
|
||||
runbook gains a fill/render/`check-access` step; and the `make new-role` scaffold drops a
|
||||
stub `access__*` block + the `ACCESS.md` template into every service role — so it is
|
||||
structurally impossible to ship one with no access record (deviations go in
|
||||
`accepted-risks.md`).
|
||||
|
||||
## Consequences
|
||||
|
||||
- Every host and service has at least one documented, verifiable way in — and a verifier
|
||||
that proves it, so stale access is caught before an outage, not during one.
|
||||
- Doc and verifier share one source of truth (`access__*`), so they cannot drift apart.
|
||||
- The management plane gains exactly one extra trusted LAN source (`ubongo`); attack
|
||||
surface grows by one keys-only + fail2ban-gated SSH path, no new exposed ports.
|
||||
- Cost: per-service `access__*` declarations and a rendered `ACCESS.md` to maintain
|
||||
(mitigated by the uniform host baseline + scaffold), plus `/check-access` to build.
|
||||
|
||||
## Scope
|
||||
|
||||
Delivered by ADR-021's implementation plan
|
||||
(`docs/superpowers/plans/2026-06-09-operational-access.md`), task by task, and tracked in
|
||||
`STATUS.md` as it lands — not all of it exists at the moment this ADR is written. The split
|
||||
below is near-term tranche vs longer build-pending, not instant-existence vs not.
|
||||
|
||||
**Near-term tranche (this plan):** the doctrine; this ADR; the `ACCESS.md` template; the
|
||||
`ssh-from-control` firewall management-plane source — added to ADR-020's *guaranteed
|
||||
management plane* (the always-allowed block that already holds the `wt0` SSH/Ansible allow
|
||||
and is explicitly independent of the service catalog), not added to the catalog itself (the
|
||||
catalog owns service ingress only) — via the `base__firewall_control_addr` knob and its
|
||||
nftables rule, both of which do **not** exist in `roles/base` yet and land with the
|
||||
`firewall` concern of `base`; and the governance wiring (checklist item, runbook step,
|
||||
scaffold stub). ADR-016 and ADR-020 are amended to reference the ladder.
|
||||
|
||||
**Build-pending on infra:** per-service `access__*` data and rendered `ACCESS.md` files
|
||||
(wait on service roles), `/check-access` *running* (waits on live hosts + staging + vault),
|
||||
and the real `ubongo` LAN address value behind `base__firewall_control_addr`. Designed now,
|
||||
built when there is something to verify.
|
||||
|
||||
**Out of scope:** broader LAN SSH (a management VLAN) — explicitly rejected, `ubongo`-only;
|
||||
exercising (vs reachability-probing) the break-glass console; any access path that is not
|
||||
over the mesh or the one `ubongo` LAN source.
|
||||
|
||||
## Related
|
||||
|
||||
ADR-002 (security baseline: SSH hardening, default-deny, fail2ban), ADR-004 (Docker
|
||||
model, Compose), ADR-016 (NetBird mesh; amended — SSH on `wt0` **and** from `ubongo`'s
|
||||
LAN address), ADR-017 (`/verify-service` Level-4 verification), ADR-018 (logging:
|
||||
Alloy → Loki/Grafana), ADR-020 (firewall: service catalog + guaranteed management plane;
|
||||
amended — adds the `ssh-from-control` management-plane source), ADR-019 (`firewall` tag).
|
||||
Loading…
Add table
Reference in a new issue