Compare commits
No commits in common. "032adf152555d371814bd016c1b1e7355eff2b15" and "fcfb056591475312df093757b6e8a1eb93198f17" have entirely different histories.
032adf1525
...
fcfb056591
17 changed files with 15 additions and 1124 deletions
|
|
@ -1,49 +0,0 @@
|
||||||
Operational-access verification (ADR-021)
|
|
||||||
|
|
||||||
Probe every documented way in to a service or host from `ubongo` and report which paths
|
|
||||||
are live. Reads the target's `access__*` data (and host baseline), so the verifier and
|
|
||||||
`ACCESS.md` can never disagree. Argument: a service/role name or a host
|
|
||||||
(e.g. `/check-access photoprism`, `/check-access docker01`).
|
|
||||||
|
|
||||||
## Prerequisites (this is forward-looking — ADR-021 dependencies)
|
|
||||||
|
|
||||||
This skill cannot run until these exist; if any is missing, say so and stop — do not
|
|
||||||
improvise around it:
|
|
||||||
|
|
||||||
- `ubongo` reachable on the mesh **and** the LAN (it runs the probes).
|
|
||||||
- The target host/service is deployed (staging or production inventory).
|
|
||||||
- `roles/<name>/` carries `access__*` data (services) / the host baseline applies.
|
|
||||||
- Vault unlocked (`rbw unlocked`) for any token-authenticated API probe.
|
|
||||||
|
|
||||||
## Process
|
|
||||||
|
|
||||||
### Phase 0 — resolve the target
|
|
||||||
|
|
||||||
Resolve the argument to a host or a service role + its host. Load the `access__*` data
|
|
||||||
(service) or the host-baseline + break-glass record (host). State what you will probe.
|
|
||||||
|
|
||||||
### Phase 1 — probe each declared path
|
|
||||||
|
|
||||||
| Path | Probe | Green = |
|
|
||||||
|---|---|---|
|
|
||||||
| `wt0` mesh SSH | connect over the mesh, run `true` | reachable + key works |
|
|
||||||
| LAN SSH from `ubongo` | connect via the LAN address, run `true` | reachable + key works |
|
|
||||||
| exec + compose | `docker compose -p <project> ps`; exec `true` in each `access__containers` entry | stack up, exec works |
|
|
||||||
| logs | query Loki for `access__log.loki_labels`, expect recent lines | logs flowing |
|
|
||||||
| admin API | `curl` `access__api.health_path` with the token from `access__api.auth.vault_ref` | 2xx |
|
|
||||||
| break-glass | reachability of the Proxmox/provider console endpoint **only** | console host reachable |
|
|
||||||
|
|
||||||
Break-glass is **never exercised** — firing a serial console is invasive; confirm the
|
|
||||||
fallback exists, do not drive it.
|
|
||||||
|
|
||||||
### Phase 2 — report
|
|
||||||
|
|
||||||
Emit a pass/fail table. For any red path, name it and the likely cause (e.g. "API token
|
|
||||||
in vault stale", "Alloy not shipping", "`base__firewall_control_addr` unset → no
|
|
||||||
`ssh-from-control` rule"). Verdict line: e.g. "3/4 paths green; admin API red".
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- Read-only and non-destructive — probes confirm reachability, they do not change state.
|
|
||||||
- This is the access analogue of `/verify-service` (ADR-017): designed now, runs when the
|
|
||||||
control node + hosts exist.
|
|
||||||
|
|
@ -87,8 +87,6 @@ Full design rationale: `docs/decisions/`
|
||||||
- Every role must have `meta/main.yml` filled in
|
- Every role must have `meta/main.yml` filled in
|
||||||
- Every **service** role must have a populated `SECURITY.md` (ADR-002/004) — copy `docs/security/service-security-template.md`
|
- Every **service** role must have a populated `SECURITY.md` (ADR-002/004) — copy `docs/security/service-security-template.md`
|
||||||
- Every **service** role must have a populated `VERIFY.md` (ADR-008/017) — copy `docs/testing/service-verify-template.md`
|
- Every **service** role must have a populated `VERIFY.md` (ADR-008/017) — copy `docs/testing/service-verify-template.md`
|
||||||
- Every **service** role must have a populated `ACCESS.md` (ADR-021) — copy
|
|
||||||
`docs/access/service-access-template.md`; rendered from the role's `access__*` data
|
|
||||||
- One service = one self-contained role; no shared multi-service roles (ADR-004)
|
- One service = one self-contained role; no shared multi-service roles (ADR-004)
|
||||||
- Role names: `snake_case`, descriptive nouns (`base`, `docker_host`, `reverse_proxy`)
|
- Role names: `snake_case`, descriptive nouns (`base`, `docker_host`, `reverse_proxy`)
|
||||||
- Use `make new-role NAME=<name>` to scaffold — never create role structure by hand
|
- Use `make new-role NAME=<name>` to scaffold — never create role structure by hand
|
||||||
|
|
@ -226,7 +224,6 @@ Single-contributor, trunk-based (no merge requests / approval gates):
|
||||||
| Logging & log integrity | `docs/decisions/018-logging.md` |
|
| Logging & log integrity | `docs/decisions/018-logging.md` |
|
||||||
| Tagging & run-targeting | `docs/decisions/019-tagging.md` |
|
| Tagging & run-targeting | `docs/decisions/019-tagging.md` |
|
||||||
| Firewall strategy | `docs/decisions/020-firewall.md` |
|
| Firewall strategy | `docs/decisions/020-firewall.md` |
|
||||||
| Operational access | `docs/decisions/021-operational-access.md` |
|
|
||||||
| Adding a new role | `docs/runbooks/new-role.md` |
|
| Adding a new role | `docs/runbooks/new-role.md` |
|
||||||
| Adding a new host | `docs/runbooks/new-host.md` |
|
| Adding a new host | `docs/runbooks/new-host.md` |
|
||||||
| Rotating vault secrets | `docs/runbooks/rotate-secrets.md` |
|
| Rotating vault secrets | `docs/runbooks/rotate-secrets.md` |
|
||||||
|
|
|
||||||
|
|
@ -59,10 +59,6 @@ So `make deploy PLAYBOOK=site` is still incomplete — `base` is only partially
|
||||||
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
|
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
|
||||||
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. |
|
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. |
|
||||||
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
|
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
|
||||||
| Operational-access doctrine (ADR-021) | ADR-021 | **Design RESOLVED** (ADR-021 + spec + plan). Two-layer doctrine, three-tier access ladder, `access__*` model, `ACCESS.md` record, `/check-access`. Reconciles ADR-016/020 SSH. |
|
|
||||||
| `ssh-from-control` firewall source | ADR-021 / ADR-020 | **Built (dormant).** `base__firewall_control_addr` knob + nftables rule + Molecule assertion landed; empty default = no rule until `ubongo`'s LAN address is set in `group_vars`. |
|
|
||||||
| `/check-access` verifier | ADR-021 | **Design RESOLVED** (`.claude/commands/check-access.md` authored). **Build pending:** running needs `ubongo` + live/staging hosts + vault. Access analogue of `/verify-service` (ADR-017). |
|
|
||||||
| Per-service `ACCESS.md` records | ADR-021 | Template + governance present; per-service files render when each service role is built. |
|
|
||||||
|
|
||||||
## Keeping this honest
|
## Keeping this honest
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -63,14 +63,14 @@ earning its keep.
|
||||||
- `[recurring]` When a **deferred** decision later resolves, docs that referenced the
|
- `[recurring]` When a **deferred** decision later resolves, docs that referenced the
|
||||||
deferral go stale and a plan's file-map can miss them (e.g. resolving the mesh-VPN
|
deferral go stale and a plan's file-map can miss them (e.g. resolving the mesh-VPN
|
||||||
choice left `new-host.md` still saying "mesh VPN (choice deferred)"; the ubongo work
|
choice left `new-host.md` still saying "mesh VPN (choice deferred)"; the ubongo work
|
||||||
similarly left a contradiction in CLAUDE.md). A _broadened_ final grep sweep caught
|
similarly left a contradiction in CLAUDE.md). A *broadened* final grep sweep caught
|
||||||
both. → On resolving a deferred decision, grep all canonical docs for the deferral
|
both. → On resolving a deferred decision, grep all canonical docs for the deferral
|
||||||
language ("choice deferred", "pending", "TBD", the placeholder's name) and reconcile
|
language ("choice deferred", "pending", "TBD", the placeholder's name) and reconcile
|
||||||
every hit — don't rely on the plan's file-map alone. Worth a `/review-repo` check for
|
every hit — don't rely on the plan's file-map alone. Worth a `/review-repo` check for
|
||||||
lingering "deferred/pending/TBD" references whose ADR has since resolved.
|
lingering "deferred/pending/TBD" references whose ADR has since resolved.
|
||||||
- **Recurred a 3rd time (same day):** ADR-017 resolved the browser-E2E harness but
|
- **Recurred a 3rd time (same day):** ADR-017 resolved the browser-E2E harness but
|
||||||
left ADR-015's own "Deferred" list item #2 still reading as open — not caught by the
|
left ADR-015's own "Deferred" list item #2 still reading as open — not caught by the
|
||||||
ADR-017 plan's sweep (which only checked for _its own_ placeholder language), only
|
ADR-017 plan's sweep (which only checked for *its own* placeholder language), only
|
||||||
by a later STATUS pass. Lesson sharpened: the stale reference often lives in the
|
by a later STATUS pass. Lesson sharpened: the stale reference often lives in the
|
||||||
**originating ADR's Deferred section**, which the resolving ADR's plan won't think
|
**originating ADR's Deferred section**, which the resolving ADR's plan won't think
|
||||||
to grep. → When an ADR resolves another ADR's deferred item, edit that **source
|
to grep. → When an ADR resolves another ADR's deferred item, edit that **source
|
||||||
|
|
@ -82,7 +82,7 @@ earning its keep.
|
||||||
|
|
||||||
- `[recurring]` **Asked the execution-mode question AGAIN** ("subagent-driven vs inline —
|
- `[recurring]` **Asked the execution-mode question AGAIN** ("subagent-driven vs inline —
|
||||||
which approach?") at the end of `writing-plans`, despite the 2026-06-05 standing
|
which approach?") at the end of `writing-plans`, despite the 2026-06-05 standing
|
||||||
preference _and_ the `always-subagent-driven-execution` memory both saying don't ask.
|
preference *and* the `always-subagent-driven-execution` memory both saying don't ask.
|
||||||
Root cause: the `writing-plans` skill's "Execution Handoff" step scripts the menu, and
|
Root cause: the `writing-plans` skill's "Execution Handoff" step scripts the menu, and
|
||||||
I followed the skill text over the user's standing override. Second occurrence →
|
I followed the skill text over the user's standing override. Second occurrence →
|
||||||
escalate from "skip the prompt" to a **hard rule**: never present the execution-mode
|
escalate from "skip the prompt" to a **hard rule**: never present the execution-mode
|
||||||
|
|
@ -98,12 +98,12 @@ earning its keep.
|
||||||
### Host nftables firewall build (`base` role)
|
### Host nftables firewall build (`base` role)
|
||||||
|
|
||||||
- `[gotcha]` **`nft -c` rejects `iif "<name>"` when the interface is absent** (it resolves
|
- `[gotcha]` **`nft -c` rejects `iif "<name>"` when the interface is absent** (it resolves
|
||||||
to an interface _index_ at load time). The render+syntax-check Molecule step caught
|
to an interface *index* at load time). The render+syntax-check Molecule step caught
|
||||||
`iif "wt0"` failing in the container — and it would fail identically on any real host
|
`iif "wt0"` failing in the container — and it would fail identically on any real host
|
||||||
before NetBird brings up `wt0`. Use **`iifname "<name>"`** (string match, no existence
|
before NetBird brings up `wt0`. Use **`iifname "<name>"`** (string match, no existence
|
||||||
requirement, survives the interface coming/going) for any interface that may be absent.
|
requirement, survives the interface coming/going) for any interface that may be absent.
|
||||||
- `[gotcha]` **Molecule's `community.docker` connection uses `ansible_host` as the
|
- `[gotcha]` **Molecule's `community.docker` connection uses `ansible_host` as the
|
||||||
container name** (`remote_addr`). Setting `ansible_host` as _data_ in a scenario's
|
container name** (`remote_addr`). Setting `ansible_host` as *data* in a scenario's
|
||||||
`host_vars` (e.g. to give a resolver a fake IP) breaks the connection → `UNREACHABLE`,
|
`host_vars` (e.g. to give a resolver a fake IP) breaks the connection → `UNREACHABLE`,
|
||||||
"Failed to create temporary directory". Don't override `ansible_host` in molecule; feed
|
"Failed to create temporary directory". Don't override `ansible_host` in molecule; feed
|
||||||
fixture IPs another way (or keep fixtures to zone sources and unit-test IP resolution).
|
fixture IPs another way (or keep fixtures to zone sources and unit-test IP resolution).
|
||||||
|
|
@ -124,15 +124,3 @@ earning its keep.
|
||||||
- `[note]` The render-and-`nft -c` (no-apply) Molecule approach **earned its keep** —
|
- `[note]` The render-and-`nft -c` (no-apply) Molecule approach **earned its keep** —
|
||||||
caught the `iif`/`iifname` bug deterministically without touching the host kernel. Good
|
caught the `iif`/`iifname` bug deterministically without touching the host kernel. Good
|
||||||
pattern to reuse for other config-rendering roles.
|
pattern to reuse for other config-rendering roles.
|
||||||
|
|
||||||
## 2026-06-09
|
|
||||||
|
|
||||||
- `[recurring]` **Asked the execution-mode question AGAIN** — presented the
|
|
||||||
"subagent-driven vs inline" menu at the `writing-plans` → execution handoff, even
|
|
||||||
though the standing 2026-06-05 preference and the `always-subagent-driven-execution`
|
|
||||||
memory both say to default to subagent-driven without asking. Third occurrence; the
|
|
||||||
earlier "hard rule" escalation didn't hold because both `writing-plans` and
|
|
||||||
`subagent-driven-development` script the menu and I followed the skill text over the
|
|
||||||
user's standing override. → The standing preference outranks skill scripts: when a
|
|
||||||
skill's handoff offers the execution-mode menu, skip it and proceed subagent-driven;
|
|
||||||
only ask if the user signals otherwise this session.
|
|
||||||
|
|
|
||||||
10
docs/TODO.md
10
docs/TODO.md
|
|
@ -18,10 +18,7 @@
|
||||||
1. ~~Decide how to manage logs.~~ DECIDED (ADR-018): all logs → on-cluster Loki via
|
1. ~~Decide how to manage logs.~~ DECIDED (ADR-018): all logs → on-cluster Loki via
|
||||||
Grafana Alloy (in `base`); a security subset also ships write-only off-site to
|
Grafana Alloy (in `base`); a security subset also ships write-only off-site to
|
||||||
`askari` (append-only); Grafana queries both. WORM skipped (accepted-risk R4).
|
`askari` (append-only); Grafana queries both. WORM skipped (accepted-risk R4).
|
||||||
2. ~~Decide how to manage APIs / API access.~~ DECIDED (ADR-021): per-service `access__*`
|
2. Decide how to manage APIs / API access.
|
||||||
data declares the admin API (endpoint + `firewall_ref` to the catalog + vault token
|
|
||||||
ref + health path); rendered into `ACCESS.md` and probed by `/check-access`. Part of
|
|
||||||
the two-layer operational-access doctrine.
|
|
||||||
3. ~~Decide how to import or integrate from baobabAnsibleV4.~~ DECIDED (ADR-013):
|
3. ~~Decide how to import or integrate from baobabAnsibleV4.~~ DECIDED (ADR-013):
|
||||||
translate-don't-transplant — V4 is a source only of gotchas + working config
|
translate-don't-transplant — V4 is a source only of gotchas + working config
|
||||||
snippets, re-derived on boma's terms; never structure/requirements/values.
|
snippets, re-derived on boma's terms; never structure/requirements/values.
|
||||||
|
|
@ -56,10 +53,7 @@
|
||||||
|
|
||||||
7. **Shell setup**
|
7. **Shell setup**
|
||||||
1. Decide what shell setup matters for the AI's work on the control node.
|
1. Decide what shell setup matters for the AI's work on the control node.
|
||||||
2. ~~Decide what to set up on the hosts, given that direct access will be rare.~~
|
2. Decide what to set up on the hosts, given that direct access will be rare.
|
||||||
DECIDED (ADR-021): the host-layer access baseline — SSH on `wt0` + from `ubongo`,
|
|
||||||
Docker/Compose tooling, Alloy log shipping, and a recorded break-glass console per
|
|
||||||
host class.
|
|
||||||
|
|
||||||
8. **Scheduled work**
|
8. **Scheduled work**
|
||||||
1. Run `/review-repo` as `claude -p` via cron every two weeks?
|
1. Run `/review-repo` as `claude -p` via cron every two weeks?
|
||||||
|
|
|
||||||
|
|
@ -1,38 +0,0 @@
|
||||||
# Per-service operational-access record — template
|
|
||||||
|
|
||||||
Copy this file to `roles/<service>/ACCESS.md` when building a service role (ADR-021).
|
|
||||||
It is the per-service **operational-access record**: every documented, verifiable way in
|
|
||||||
for troubleshooting. The structured parts are **rendered from the role's `access__*`
|
|
||||||
data** (the single source of truth that also drives `/check-access`) — keep the data
|
|
||||||
authoritative and regenerate this file rather than hand-editing the tables. The prose
|
|
||||||
"Operational notes" tail is hand-written.
|
|
||||||
|
|
||||||
Delete this preamble in the copy and start from the heading below.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# Access — <service>
|
|
||||||
|
|
||||||
## Access paths
|
|
||||||
|
|
||||||
The documented ways in, by tier (rendered from `access__*`):
|
|
||||||
|
|
||||||
| Tier | Path | Invocation |
|
|
||||||
|---|---|---|
|
|
||||||
| primary | `wt0` mesh SSH | `ssh <host>` (over the NetBird mesh) |
|
|
||||||
| secondary | LAN SSH from `ubongo` | `ssh <host>` (from the control node, LAN address) |
|
|
||||||
| — | container exec + compose | `docker compose -p <access__compose_project> -f <access__compose_path> ps` / `exec` |
|
|
||||||
| — | logs | Loki query for labels `<access__log.loki_labels>` (Grafana; ADR-018) |
|
|
||||||
| — | admin API | `curl -H 'Authorization: …(vault_ref)' <access__api.base_url><health_path>` — or `n/a` |
|
|
||||||
|
|
||||||
## Break-glass
|
|
||||||
|
|
||||||
Mesh-and-LAN-independent fallback for this host's class (recorded, not routine):
|
|
||||||
|
|
||||||
- <Proxmox serial/VNC console for cluster VMs · Hetzner rescue for `askari` · local console for `ubongo`>
|
|
||||||
|
|
||||||
## Operational notes
|
|
||||||
|
|
||||||
Prose the data can't capture — service quirks, "if X is wedged, do Y", ordering gotchas.
|
|
||||||
|
|
||||||
- <none yet>
|
|
||||||
|
|
@ -61,12 +61,8 @@ allocated for it.
|
||||||
privilege.
|
privilege.
|
||||||
- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`),
|
- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`),
|
||||||
consumed by `base`; prefer ephemeral/scoped keys.
|
consumed by `base`; prefer ephemeral/scoped keys.
|
||||||
- **Host firewall:** `base` nftables allows inbound SSH on NetBird's `wt0` interface
|
- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH
|
||||||
(primary, WireGuard-authenticated) **and** from `ubongo`'s LAN address (secondary,
|
**only on `wt0`** (the ADR-015 pattern, fleet-wide).
|
||||||
mesh-independent — required by the LAN-IP recovery path below, so a mesh/coordinator
|
|
||||||
outage never blocks on-LAN SSH). All other LAN hosts remain default-denied. This makes
|
|
||||||
explicit the control-node SSH allow that the recovery model already implied; the access
|
|
||||||
doctrine and the three-tier access ladder live in **ADR-021**.
|
|
||||||
- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn
|
- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn
|
||||||
(3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,
|
(3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,
|
||||||
`base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence.
|
`base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence.
|
||||||
|
|
|
||||||
|
|
@ -39,12 +39,10 @@ subnet (VLAN 20), which never reaches the gateway.
|
||||||
added benefit once the VLAN already bounds where a host can go.
|
added benefit once the VLAN already bounds where a host can go.
|
||||||
- **Docker**: daemon runs with `"iptables": false`; nftables owns all filtering,
|
- **Docker**: daemon runs with `"iptables": false`; nftables owns all filtering,
|
||||||
including container traffic (ADR-004).
|
including container traffic (ADR-004).
|
||||||
- **Guaranteed management plane**: loopback, established/related, `wt0` (NetBird,
|
- **Guaranteed management plane**: loopback, established/related, and `wt0` (NetBird,
|
||||||
ADR-016), and SSH from the control node's LAN address (`base__firewall_control_addr`,
|
ADR-016) for SSH + Ansible are always allowed, independent of the catalog, applied
|
||||||
the `ssh-from-control` source) for SSH + Ansible are always allowed, independent of the
|
atomically — a malformed or empty catalog can never lock out management. (ADR-016: SSH
|
||||||
catalog, applied atomically — a malformed or empty catalog can never lock out
|
is allowed only on `wt0`.)
|
||||||
management. The control-node source is part of the guaranteed plane, not the service
|
|
||||||
catalog (it is management, not a service); see ADR-021 for the access doctrine.
|
|
||||||
|
|
||||||
So "per-host vs central" is answered: **both**, with clear ownership.
|
So "per-host vs central" is answered: **both**, with clear ownership.
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,206 +0,0 @@
|
||||||
# ADR-021 — Operational access: documented, verifiable ways in
|
|
||||||
|
|
||||||
## Status
|
|
||||||
|
|
||||||
Accepted (2026-06-09). Resolves TODO 7.2 (what to set up on hosts given direct access
|
|
||||||
will be rare) and TODO 3.2 (the service admin-API access question).
|
|
||||||
|
|
||||||
**Doctrine ADR.** It pins the operational-access doctrine, the declarative `access__*`
|
|
||||||
data model, the rendered `ACCESS.md` record, and the `/check-access` verifier. It does
|
|
||||||
**not** build any of them — `base`'s non-firewall concerns, service roles, and live
|
|
||||||
hosts do not exist yet. Designed now, built when there is something to access (see
|
|
||||||
*Scope*). Reconciles a latent contradiction between ADR-016 and ADR-020 (see
|
|
||||||
*Reconciliation*).
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
boma is built security-first: nftables default-deny, SSH reachable only on the NetBird
|
|
||||||
`wt0` mesh interface (ADR-016), every service behind the reverse proxy + SSO, no ad-hoc
|
|
||||||
ports (ADR-002/ADR-020). That posture is correct — but it leaves one operational
|
|
||||||
question unanswered: **when a host or service breaks, how does the operator (and the AI
|
|
||||||
working from `ubongo`) actually get in to troubleshoot it?**
|
|
||||||
|
|
||||||
Troubleshooting is far more effective with *several* documented ways in — SSH, container
|
|
||||||
exec, logs, an admin API — so a single broken path does not mean blind. Today boma has no
|
|
||||||
standard guaranteeing those paths exist, are documented, or still work. The risk is the
|
|
||||||
classic one: the access you assumed you had is stale exactly when you need it (key
|
|
||||||
rotated, API disabled, token expired).
|
|
||||||
|
|
||||||
boma already has the right *shape*. Service roles carry record docs — `SECURITY.md`
|
|
||||||
(security answers) and `VERIFY.md` (acceptance spec). What is missing is the third
|
|
||||||
sibling — an operational-access record — and the doctrine behind it.
|
|
||||||
|
|
||||||
Two constraints shape the decision:
|
|
||||||
|
|
||||||
1. **Minimal attack surface is non-negotiable.** "Multiple ways in" must mean multiple
|
|
||||||
paths over *trusted* interfaces, never new exposed ports.
|
|
||||||
2. **A documented path that is never tested drifts** — it fails exactly when needed. So
|
|
||||||
the access facts must be *data* that both renders the doc and drives an active
|
|
||||||
verifier; the two can then never disagree.
|
|
||||||
|
|
||||||
## Decision
|
|
||||||
|
|
||||||
### The doctrine
|
|
||||||
|
|
||||||
> **Every host and every service guarantees at least one documented, verifiable way in
|
|
||||||
> for operational troubleshooting — and the deploy that creates it also records and
|
|
||||||
> proves it.**
|
|
||||||
|
|
||||||
Access is a deployment deliverable, not something rediscovered under pressure. The deploy
|
|
||||||
that creates a host/service also records its access paths and (by design) proves them.
|
|
||||||
|
|
||||||
### Two layers
|
|
||||||
|
|
||||||
- **Host layer** (resolves TODO 7.2). Every host, via the `base` role, guarantees a fixed
|
|
||||||
access baseline: SSH over `wt0` and from `ubongo` (the ladder below), Docker/Compose
|
|
||||||
tooling present, and log shipping live (Alloy → Loki; ADR-018). Little is *exposed*; a
|
|
||||||
known, uniform set of paths exists over trusted interfaces. The break-glass console per
|
|
||||||
host class is recorded once at this layer. This is boma's answer to "what every host
|
|
||||||
runs for access."
|
|
||||||
- **Service layer** (resolves TODO 3.2). Every service role guarantees and records its
|
|
||||||
own paths: container exec + compose management, its Loki log labels, and its admin API
|
|
||||||
where one exists (enabled, token in vault, endpoint + health probe documented) — or an
|
|
||||||
explicit "no API."
|
|
||||||
|
|
||||||
### The three-tier access ladder
|
|
||||||
|
|
||||||
1. **`wt0` mesh SSH — primary.** WireGuard *cryptographically authenticates* the peer
|
|
||||||
before SSH sees it. The preferred path (ADR-016's original rationale).
|
|
||||||
2. **LAN SSH from `ubongo` only — secondary, mesh-independent.** All hardware but
|
|
||||||
`askari` shares a LAN. SSH from `ubongo`'s LAN address is allowed, giving a fallback
|
|
||||||
that survives a NetBird/`wt0` outage. It is gated by *source IP* (spoofable on a LAN)
|
|
||||||
**plus** the standing keys-only + fail2ban SSH hardening (ADR-002), so the marginal
|
|
||||||
cost is "SSH daemon reachable from one trusted LAN host" — modest and deliberate. All
|
|
||||||
*other* LAN hosts stay default-denied.
|
|
||||||
3. **Console — break-glass.** Mesh-*and*-LAN-independent, recorded per host class, never
|
|
||||||
exercised for routine work:
|
|
||||||
- **Cluster VMs** → Proxmox serial/VNC console — independent of the guest network,
|
|
||||||
`wt0`, and even a broken guest nftables ruleset.
|
|
||||||
- **`askari`** (bare-metal Hetzner) → provider rescue/console.
|
|
||||||
- **`ubongo`** (physical) → local console.
|
|
||||||
|
|
||||||
A total mesh outage therefore still leaves exactly one documented way in to each box.
|
|
||||||
|
|
||||||
### Reconciliation, not weakening
|
|
||||||
|
|
||||||
ADR-016 already requires Ansible to reach the fleet by LAN IP — "a mesh/coordinator
|
|
||||||
outage never blocks on-LAN runs" — which **requires** LAN SSH from `ubongo`. Yet ADR-016
|
|
||||||
also stated "SSH only on `wt0`," and ADR-020's guaranteed management plane listed only
|
|
||||||
`wt0`. That was a latent contradiction. ADR-021 resolves it by making the control-node
|
|
||||||
SSH allow **explicit** and adding it to the guaranteed management plane. This does **not**
|
|
||||||
weaken default-deny: it admits exactly one extra trusted source on the LAN (`ubongo`),
|
|
||||||
keys-only + fail2ban-gated; every other LAN host stays denied. ADR-016 and ADR-020 are
|
|
||||||
amended to cross-reference this ladder.
|
|
||||||
|
|
||||||
### The declarative `access__*` data model
|
|
||||||
|
|
||||||
Structured access facts live as **data** — the single source of truth that both renders
|
|
||||||
`ACCESS.md` *and* tells `/check-access` what to probe, so doc and verifier cannot diverge
|
|
||||||
(the firewall-catalog philosophy of ADR-020, applied to access).
|
|
||||||
|
|
||||||
Each service role's defaults carry:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
access__service: photoprism
|
|
||||||
access__compose_project: photoprism # docker compose -p <this>
|
|
||||||
access__compose_path: /opt/photoprism/compose.yml
|
|
||||||
access__containers: [photoprism, photoprism-db] # exec targets
|
|
||||||
access__log:
|
|
||||||
loki_labels: { service: photoprism } # how to query logs (ADR-018)
|
|
||||||
access__api:
|
|
||||||
enabled: true
|
|
||||||
base_url: "http://photoprism.srv:2342" # reachable over the mesh
|
|
||||||
firewall_ref: photoprism-api # the catalog entry that opens it (ADR-020)
|
|
||||||
auth: { vault_ref: "vault.photoprism.api_token" }
|
|
||||||
health_path: "/api/v1/status" # what /check-access pings
|
|
||||||
# where the service has no API:
|
|
||||||
# access__api: { enabled: false, reason: "<none upstream>" }
|
|
||||||
```
|
|
||||||
|
|
||||||
**Invariant — `access__api` never opens a port.** It `firewall_ref`s an entry in the
|
|
||||||
`group_vars` firewall catalog; ADR-020 stays the **sole owner of exposure**. The access
|
|
||||||
data adds only *how to use* the path (endpoint, token ref, health probe) — no duplication,
|
|
||||||
no ad-hoc ports (CLAUDE.md: ports only in the catalog).
|
|
||||||
|
|
||||||
The host baseline (SSH on `wt0` + from `ubongo`, Docker/Compose present, Alloy live) is
|
|
||||||
uniform, so it is asserted by `base` and recorded once at the host/group level, not
|
|
||||||
re-stated per service.
|
|
||||||
|
|
||||||
### The rendered record — `ACCESS.md`
|
|
||||||
|
|
||||||
`ACCESS.md` is a first-class sibling of `SECURITY.md`/`VERIFY.md`, **rendered** from the
|
|
||||||
`access__*` data with a prose tail for the narrative parts:
|
|
||||||
|
|
||||||
- **Access paths (generated)** — a table: each path (mesh SSH, LAN-SSH-from-`ubongo`,
|
|
||||||
exec/compose, logs, API), its tier (primary / secondary / break-glass), and the exact
|
|
||||||
invocation.
|
|
||||||
- **Break-glass (generated from host class)** — the Proxmox/provider/local console line.
|
|
||||||
- **Operational notes (prose)** — service quirks, gotchas, "if X is wedged, do Y." The
|
|
||||||
part a template cannot know.
|
|
||||||
|
|
||||||
A `docs/access/service-access-template.md` defines the shape, alongside the existing
|
|
||||||
security/verify templates.
|
|
||||||
|
|
||||||
### The verifier — `/check-access`
|
|
||||||
|
|
||||||
`/check-access <service|host>` runs from `ubongo` and turns the `access__*` data into
|
|
||||||
live probes, reporting which declared paths are green right now — the access analogue of
|
|
||||||
`/verify-service` (ADR-017). It probes mesh SSH, LAN SSH, exec + compose, Loki logs, and
|
|
||||||
the admin API health path; on any red it names the path and the likely cause. **Break-glass
|
|
||||||
is checked for reachability only, never exercised** — firing a serial console is invasive,
|
|
||||||
so the verifier confirms the fallback *exists* without disrupting anything. Designed now,
|
|
||||||
**build-pending on infra** (needs live hosts + staging + vault), exactly like
|
|
||||||
`/verify-service` under ADR-017.
|
|
||||||
|
|
||||||
### Governance
|
|
||||||
|
|
||||||
Three light touches, mirroring how `SECURITY.md`/`VERIFY.md` are enforced: the service
|
|
||||||
checklist (`docs/security/service-checklist.md`) gains an access item; the `new-role`
|
|
||||||
runbook gains a fill/render/`check-access` step (step 11: copy
|
|
||||||
`docs/access/service-access-template.md` into `roles/<service>/ACCESS.md` and populate the
|
|
||||||
`access__*` data); and a service-checklist gate item blocks clearance until the record
|
|
||||||
exists and `/check-access` is green (or a deviation is recorded in `accepted-risks.md`).
|
|
||||||
No scaffold change — same manual-copy-plus-review pattern the sibling records
|
|
||||||
(`SECURITY.md`/`VERIFY.md`) use.
|
|
||||||
|
|
||||||
## Consequences
|
|
||||||
|
|
||||||
- Every host and service has at least one documented, verifiable way in — and a verifier
|
|
||||||
that proves it, so stale access is caught before an outage, not during one.
|
|
||||||
- Doc and verifier share one source of truth (`access__*`), so they cannot drift apart.
|
|
||||||
- The management plane gains exactly one extra trusted LAN source (`ubongo`); attack
|
|
||||||
surface grows by one keys-only + fail2ban-gated SSH path, no new exposed ports.
|
|
||||||
- Cost: per-service `access__*` declarations and a rendered `ACCESS.md` to maintain
|
|
||||||
(mitigated by the uniform host baseline + the new-role runbook step + checklist gate), plus `/check-access` to build.
|
|
||||||
|
|
||||||
## Scope
|
|
||||||
|
|
||||||
Delivered by ADR-021's implementation plan
|
|
||||||
(`docs/superpowers/plans/2026-06-09-operational-access.md`), task by task, and tracked in
|
|
||||||
`STATUS.md` as it lands — not all of it exists at the moment this ADR is written. The split
|
|
||||||
below is near-term tranche vs longer build-pending, not instant-existence vs not.
|
|
||||||
|
|
||||||
**Near-term tranche (this plan):** the doctrine; this ADR; the `ACCESS.md` template; the
|
|
||||||
`ssh-from-control` firewall management-plane source — added to ADR-020's *guaranteed
|
|
||||||
management plane* (the always-allowed block that already holds the `wt0` SSH/Ansible allow
|
|
||||||
and is explicitly independent of the service catalog), not added to the catalog itself (the
|
|
||||||
catalog owns service ingress only) — via the `base__firewall_control_addr` knob and its
|
|
||||||
nftables rule, both of which do **not** exist in `roles/base` yet and land with the
|
|
||||||
`firewall` concern of `base`; and the governance wiring (checklist item, new-role runbook step). ADR-016 and ADR-020 are amended to reference the ladder.
|
|
||||||
|
|
||||||
**Build-pending on infra:** per-service `access__*` data and rendered `ACCESS.md` files
|
|
||||||
(wait on service roles), `/check-access` *running* (waits on live hosts + staging + vault),
|
|
||||||
and the real `ubongo` LAN address value behind `base__firewall_control_addr`. Designed now,
|
|
||||||
built when there is something to verify.
|
|
||||||
|
|
||||||
**Out of scope:** broader LAN SSH (a management VLAN) — explicitly rejected, `ubongo`-only;
|
|
||||||
exercising (vs reachability-probing) the break-glass console; any access path that is not
|
|
||||||
over the mesh or the one `ubongo` LAN source.
|
|
||||||
|
|
||||||
## Related
|
|
||||||
|
|
||||||
ADR-002 (security baseline: SSH hardening, default-deny, fail2ban), ADR-004 (Docker
|
|
||||||
model, Compose), ADR-016 (NetBird mesh; amended — SSH on `wt0` **and** from `ubongo`'s
|
|
||||||
LAN address), ADR-017 (`/verify-service` Level-4 verification), ADR-018 (logging:
|
|
||||||
Alloy → Loki/Grafana), ADR-020 (firewall: service catalog + guaranteed management plane;
|
|
||||||
amended — adds the `ssh-from-control` management-plane source), ADR-019 (`firewall` tag).
|
|
||||||
|
|
@ -91,19 +91,7 @@ For a **service** role, copy `docs/testing/service-verify-template.md` to
|
||||||
Level 4 `/verify-service` check (ADR-008 / ADR-017) and is part of the pre-production
|
Level 4 `/verify-service` check (ADR-008 / ADR-017) and is part of the pre-production
|
||||||
service-clearance gate (`docs/security/service-checklist.md`).
|
service-clearance gate (`docs/security/service-checklist.md`).
|
||||||
|
|
||||||
### 11. Write the per-service operational-access record (services)
|
### 11. Commit
|
||||||
|
|
||||||
For a **service** role, copy `docs/access/service-access-template.md` to
|
|
||||||
`roles/<rolename>/ACCESS.md` and populate the role's `access__*` data
|
|
||||||
(`access__service`, `access__compose_project`/`_path`, `access__containers`,
|
|
||||||
`access__log.loki_labels`, and `access__api` — `enabled` + endpoint + `firewall_ref` +
|
|
||||||
`auth.vault_ref` + `health_path`, or `enabled: false` with a reason). `ACCESS.md` is
|
|
||||||
rendered from that data; the admin-API path must `firewall_ref` an entry in the
|
|
||||||
`group_vars` firewall catalog, never open a port itself (ADR-020/021). Once hosts exist,
|
|
||||||
`/check-access <rolename>` proves the documented paths are live — part of the
|
|
||||||
service-clearance gate (`docs/security/service-checklist.md`).
|
|
||||||
|
|
||||||
### 12. Commit
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git checkout -b role/<rolename>
|
git checkout -b role/<rolename>
|
||||||
|
|
|
||||||
|
|
@ -51,10 +51,6 @@ This checklist is the generic **bar**. Each service answers it in its own
|
||||||
- [ ] Passed Level 4 service-UI verification (`/verify-service`) against staging — the
|
- [ ] Passed Level 4 service-UI verification (`/verify-service`) against staging — the
|
||||||
service has a populated `roles/<service>/VERIFY.md` and its critical journeys
|
service has a populated `roles/<service>/VERIFY.md` and its critical journeys
|
||||||
verified (ADR-008 Level 4 / ADR-017)
|
verified (ADR-008 Level 4 / ADR-017)
|
||||||
- [ ] Operational access recorded and verifiable (ADR-021): the role carries `access__*`
|
|
||||||
data, `roles/<service>/ACCESS.md` is rendered, and `/check-access` reports the
|
|
||||||
documented paths green — or a deviation is recorded in
|
|
||||||
`docs/security/accepted-risks.md`
|
|
||||||
|
|
||||||
> Deviations are allowed but must be **conscious**: record them in
|
> Deviations are allowed but must be **conscious**: record them in
|
||||||
> `docs/security/accepted-risks.md`, don't leave them implicit.
|
> `docs/security/accepted-risks.md`, don't leave them implicit.
|
||||||
|
|
|
||||||
|
|
@ -1,544 +0,0 @@
|
||||||
# Operational Access (ADR-021) Implementation Plan
|
|
||||||
|
|
||||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
|
||||||
|
|
||||||
**Goal:** Establish operational access as a deployment deliverable — a documented, verifiable set of mesh-reachable troubleshooting paths for every host and service — by writing ADR-021, reconciling the latent ADR-016/020 SSH contradiction, adding the control-node SSH source to the host firewall, and wiring the `ACCESS.md` record + `/check-access` verifier into boma's governance.
|
|
||||||
|
|
||||||
**Architecture:** Source of truth is the committed design spec `docs/superpowers/specs/2026-06-09-operational-access-design.md`. Structured access facts live as declarative `access__*` data that renders `ACCESS.md` and drives `/check-access` (the access analogue of `VERIFY.md` + `/verify-service`). Work is split into **Tranche A — land now** (doctrine docs, the one firewall code change, the dormant `/check-access` command, governance wiring) and **Tranche B — build-pending on infra** (per-service `access__*` population, rendered `ACCESS.md` files, and `/check-access` *running*), which arrive with service roles and live hosts and require no action in this plan.
|
|
||||||
|
|
||||||
**Tech Stack:** Markdown ADRs/docs; Ansible role `base` (Jinja2 nftables template + `defaults/main.yml`); Molecule (Debian 13, render + `nft -c`, no apply) for the firewall test; Claude Code command file for `/check-access`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## File structure
|
|
||||||
|
|
||||||
| File | Tranche | Responsibility |
|
|
||||||
|---|---|---|
|
|
||||||
| `docs/decisions/021-operational-access.md` | A | NEW — the doctrine (two layers, three-tier ladder, break-glass, `access__*` model, `/check-access`) |
|
|
||||||
| `docs/decisions/016-mesh-vpn.md` | A | MODIFY — reconcile: SSH on `wt0` **and** from `ubongo`'s LAN address |
|
|
||||||
| `docs/decisions/020-firewall.md` | A | MODIFY — guaranteed management plane gains the control-node SSH source |
|
|
||||||
| `docs/access/service-access-template.md` | A | NEW — the `ACCESS.md` record shape (rendered-from-data + prose tail) |
|
|
||||||
| `roles/base/defaults/main.yml` | A | MODIFY — add `base__firewall_control_addr` knob (default empty → no-op) |
|
|
||||||
| `roles/base/templates/nftables.conf.j2` | A | MODIFY — conditional management-plane SSH rule for the control address |
|
|
||||||
| `roles/base/molecule/default/converge.yml` | A | MODIFY — set the knob for the test |
|
|
||||||
| `roles/base/molecule/default/verify.yml` | A | MODIFY — assert the rendered rule |
|
|
||||||
| `.claude/commands/check-access.md` | A | NEW — the `/check-access` verifier command (dormant until infra exists) |
|
|
||||||
| `docs/security/service-checklist.md` | A | MODIFY — one new gate item |
|
|
||||||
| `docs/runbooks/new-role.md` | A | MODIFY — new step: write `ACCESS.md` (mirrors SECURITY/VERIFY steps) |
|
|
||||||
| `CLAUDE.md` | A | MODIFY — `ACCESS.md` in Role conventions; ADR-021 in Further reading |
|
|
||||||
| `STATUS.md` | A | MODIFY — new rows for the doctrine, the firewall source, `/check-access` |
|
|
||||||
| `docs/TODO.md` | A | MODIFY — mark 3.2 + 7.2 DECIDED → ADR-021 |
|
|
||||||
|
|
||||||
**Tranche B (no tasks here — captured for the record):** per-service `access__*` blocks + rendered `roles/<svc>/ACCESS.md` land when each service role is built (governed by the Tranche-A checklist + runbook); `/check-access` *running* lands when `ubongo` + staging + vault exist. Both are designed-now, build-pending — exactly like `/verify-service` under ADR-017.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Tranche A — Land now
|
|
||||||
|
|
||||||
### Task 1: Write ADR-021
|
|
||||||
|
|
||||||
**Files:**
|
|
||||||
- Create: `docs/decisions/021-operational-access.md`
|
|
||||||
|
|
||||||
The ADR is the durable decision record derived from the committed spec
|
|
||||||
`docs/superpowers/specs/2026-06-09-operational-access-design.md`. Match the prose style and
|
|
||||||
heading shape of an existing ADR (read `docs/decisions/020-firewall.md` first). The ADR
|
|
||||||
**must** state these specifics — they are the parts easy to get wrong:
|
|
||||||
|
|
||||||
- **Doctrine sentence (verbatim):** *"Every host and every service guarantees at least one
|
|
||||||
documented, verifiable way in for operational troubleshooting — and the deploy that
|
|
||||||
creates it also records and proves it."*
|
|
||||||
- **Two layers:** host baseline (resolves TODO 7.2) + per-service record (resolves TODO 3.2).
|
|
||||||
- **Three-tier access ladder:** (1) `wt0` mesh SSH — primary, WireGuard-authenticated;
|
|
||||||
(2) LAN SSH from `ubongo` only — secondary, mesh-independent, source-IP-gated **plus**
|
|
||||||
keys-only + fail2ban; all other LAN hosts stay default-denied; (3) console — break-glass
|
|
||||||
per host class: cluster VMs → Proxmox serial/VNC console, `askari` → Hetzner
|
|
||||||
rescue/console, `ubongo` → local console; reachability-checked, never exercised.
|
|
||||||
- **Reconciliation, not weakening (state this explicitly):** ADR-016 already requires
|
|
||||||
Ansible to reach the fleet by LAN IP ("a mesh/coordinator outage never blocks on-LAN
|
|
||||||
runs"), which *requires* LAN SSH from `ubongo`; yet ADR-016 also said "SSH only on `wt0`"
|
|
||||||
and ADR-020's guaranteed management plane listed only `wt0`. ADR-021 resolves that latent
|
|
||||||
contradiction by making the control-node SSH allow explicit and adding it to the
|
|
||||||
guaranteed management plane. It does **not** weaken default-deny: exactly one extra
|
|
||||||
trusted source on the LAN.
|
|
||||||
- **Declarative `access__*` data model:** service-role defaults carry `access__service`,
|
|
||||||
`access__compose_project`, `access__compose_path`, `access__containers`,
|
|
||||||
`access__log.loki_labels`, and `access__api` (`enabled`, `base_url`, `firewall_ref`,
|
|
||||||
`auth.vault_ref`, `health_path`; or `enabled: false` + `reason`). **Invariant:**
|
|
||||||
`access__api` never opens a port — it `firewall_ref`s the `group_vars` firewall catalog;
|
|
||||||
ADR-020 stays the sole owner of exposure.
|
|
||||||
- **Rendered record:** `ACCESS.md` is rendered from that data + a prose tail (operational
|
|
||||||
notes / gotchas). First-class sibling of `SECURITY.md`/`VERIFY.md`.
|
|
||||||
- **`/check-access`:** the verifier that probes each declared path and reports which are
|
|
||||||
live; break-glass reachability-only; designed now, build-pending on infra.
|
|
||||||
- **Status / consequences:** what lands now vs build-pending (mirror this plan's split).
|
|
||||||
|
|
||||||
- [ ] **Step 1: Author the ADR**
|
|
||||||
|
|
||||||
Write `docs/decisions/021-operational-access.md` covering every bullet above, in the
|
|
||||||
house style of `docs/decisions/020-firewall.md` (problem → decision → layers/ladder →
|
|
||||||
data model → verifier → consequences). Open with a one-line title heading
|
|
||||||
`# ADR-021 — Operational access: documented, verifiable ways in`.
|
|
||||||
|
|
||||||
- [ ] **Step 2: Sanity-check internal links**
|
|
||||||
|
|
||||||
Run: `grep -n "ADR-01[67]\|ADR-020\|access__\|check-access\|ACCESS.md" docs/decisions/021-operational-access.md`
|
|
||||||
Expected: references to ADR-016, ADR-020, the `access__*` keys, `/check-access`, and
|
|
||||||
`ACCESS.md` all present.
|
|
||||||
|
|
||||||
- [ ] **Step 3: Commit**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git add docs/decisions/021-operational-access.md
|
|
||||||
git commit -m "docs(access): add ADR-021 operational-access doctrine"
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Task 2: Reconcile ADR-016 and ADR-020
|
|
||||||
|
|
||||||
**Files:**
|
|
||||||
- Modify: `docs/decisions/016-mesh-vpn.md` (the "Host firewall" bullet, ~line 64-65)
|
|
||||||
- Modify: `docs/decisions/020-firewall.md` (the "Guaranteed management plane" bullet, ~line 42-45)
|
|
||||||
|
|
||||||
- [ ] **Step 1: Amend ADR-016's Host-firewall bullet**
|
|
||||||
|
|
||||||
Replace the existing bullet:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH
|
|
||||||
**only on `wt0`** (the ADR-015 pattern, fleet-wide).
|
|
||||||
```
|
|
||||||
|
|
||||||
with:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
- **Host firewall:** `base` nftables allows inbound SSH on NetBird's `wt0` interface
|
|
||||||
(primary, WireGuard-authenticated) **and** from `ubongo`'s LAN address (secondary,
|
|
||||||
mesh-independent — required by the LAN-IP recovery path below, so a mesh/coordinator
|
|
||||||
outage never blocks on-LAN SSH). All other LAN hosts remain default-denied. This makes
|
|
||||||
explicit the control-node SSH allow that the recovery model already implied; the access
|
|
||||||
doctrine and the three-tier access ladder live in **ADR-021**.
|
|
||||||
```
|
|
||||||
|
|
||||||
- [ ] **Step 2: Amend ADR-020's guaranteed-management-plane bullet**
|
|
||||||
|
|
||||||
Replace:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
- **Guaranteed management plane**: loopback, established/related, and `wt0` (NetBird,
|
|
||||||
ADR-016) for SSH + Ansible are always allowed, independent of the catalog, applied
|
|
||||||
atomically — a malformed or empty catalog can never lock out management. (ADR-016: SSH
|
|
||||||
is allowed only on `wt0`.)
|
|
||||||
```
|
|
||||||
|
|
||||||
with:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
- **Guaranteed management plane**: loopback, established/related, `wt0` (NetBird,
|
|
||||||
ADR-016), and SSH from the control node's LAN address (`base__firewall_control_addr`,
|
|
||||||
the `ssh-from-control` source) for SSH + Ansible are always allowed, independent of the
|
|
||||||
catalog, applied atomically — a malformed or empty catalog can never lock out
|
|
||||||
management. The control-node source is part of the guaranteed plane, not the service
|
|
||||||
catalog (it is management, not a service); see ADR-021 for the access doctrine.
|
|
||||||
```
|
|
||||||
|
|
||||||
- [ ] **Step 3: Commit**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git add docs/decisions/016-mesh-vpn.md docs/decisions/020-firewall.md
|
|
||||||
git commit -m "docs(access): reconcile ADR-016/020 with control-node SSH source (ADR-021)"
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Task 3: The `ACCESS.md` record template
|
|
||||||
|
|
||||||
**Files:**
|
|
||||||
- Create: `docs/access/service-access-template.md`
|
|
||||||
|
|
||||||
Match the preamble convention of `docs/security/service-security-template.md` and
|
|
||||||
`docs/testing/service-verify-template.md` (a "copy this to `roles/<service>/ACCESS.md`"
|
|
||||||
preamble, then a `---`, then the record).
|
|
||||||
|
|
||||||
- [ ] **Step 1: Write the template**
|
|
||||||
|
|
||||||
Create `docs/access/service-access-template.md`:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
# Per-service operational-access record — template
|
|
||||||
|
|
||||||
Copy this file to `roles/<service>/ACCESS.md` when building a service role (ADR-021).
|
|
||||||
It is the per-service **operational-access record**: every documented, verifiable way in
|
|
||||||
for troubleshooting. The structured parts are **rendered from the role's `access__*`
|
|
||||||
data** (the single source of truth that also drives `/check-access`) — keep the data
|
|
||||||
authoritative and regenerate this file rather than hand-editing the tables. The prose
|
|
||||||
"Operational notes" tail is hand-written.
|
|
||||||
|
|
||||||
Delete this preamble in the copy and start from the heading below.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# Access — <service>
|
|
||||||
|
|
||||||
## Access paths
|
|
||||||
|
|
||||||
The mesh-reachable ways in, by tier (rendered from `access__*`):
|
|
||||||
|
|
||||||
| Tier | Path | Invocation |
|
|
||||||
|---|---|---|
|
|
||||||
| primary | `wt0` mesh SSH | `ssh <host>` (over the NetBird mesh) |
|
|
||||||
| secondary | LAN SSH from `ubongo` | `ssh <host>` (from the control node, LAN address) |
|
|
||||||
| — | container exec + compose | `docker compose -p <access__compose_project> -f <access__compose_path> ps` / `exec` |
|
|
||||||
| — | logs | Loki query for labels `<access__log.loki_labels>` (Grafana; ADR-018) |
|
|
||||||
| — | admin API | `curl -H 'Authorization: …(vault_ref)' <access__api.base_url><health_path>` — or `n/a` |
|
|
||||||
|
|
||||||
## Break-glass
|
|
||||||
|
|
||||||
Mesh-and-LAN-independent fallback for this host's class (recorded, not routine):
|
|
||||||
|
|
||||||
- <Proxmox serial/VNC console for cluster VMs · Hetzner rescue for `askari` · local console for `ubongo`>
|
|
||||||
|
|
||||||
## Operational notes
|
|
||||||
|
|
||||||
Prose the data can't capture — service quirks, "if X is wedged, do Y", ordering gotchas.
|
|
||||||
|
|
||||||
- <none yet>
|
|
||||||
```
|
|
||||||
|
|
||||||
- [ ] **Step 2: Commit**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git add docs/access/service-access-template.md
|
|
||||||
git commit -m "docs(access): add ACCESS.md service record template"
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Task 4: Add the control-node SSH source to the host firewall (TDD)
|
|
||||||
|
|
||||||
**Files:**
|
|
||||||
- Modify: `roles/base/defaults/main.yml`
|
|
||||||
- Modify: `roles/base/templates/nftables.conf.j2`
|
|
||||||
- Modify: `roles/base/molecule/default/converge.yml`
|
|
||||||
- Modify: `roles/base/molecule/default/verify.yml`
|
|
||||||
|
|
||||||
This is the only code in Tranche A. It adds an **optional** guaranteed-management-plane
|
|
||||||
allow for SSH from the control node's LAN address. Default empty ⇒ no rule rendered ⇒
|
|
||||||
no behaviour change until a real `ubongo` address is set in `group_vars` (build-pending).
|
|
||||||
Test path is the established one for this role: Molecule render + `nft -c` (no apply).
|
|
||||||
|
|
||||||
- [ ] **Step 1: Write the failing test — converge sets the knob, verify asserts the rule**
|
|
||||||
|
|
||||||
In `roles/base/molecule/default/converge.yml`, add the knob under `vars:` (alongside
|
|
||||||
`base__firewall_apply: false`):
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
base__firewall_control_addr: 10.10.0.99 # test control-node LAN address
|
|
||||||
```
|
|
||||||
|
|
||||||
In `roles/base/molecule/default/verify.yml`, extend the "management plane" assert block's
|
|
||||||
`that:` list (the task asserting default-deny + `wt0` SSH) with:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
- "'ip saddr 10.10.0.99 tcp dport 22 accept' in nft"
|
|
||||||
```
|
|
||||||
|
|
||||||
- [ ] **Step 2: Run the test to verify it fails**
|
|
||||||
|
|
||||||
Run: `make test ROLE=base`
|
|
||||||
Expected: FAIL — the verify assert "input chain is missing default-deny or the management
|
|
||||||
plane" fires, because the template does not yet render the control-address rule.
|
|
||||||
|
|
||||||
- [ ] **Step 3: Add the default knob**
|
|
||||||
|
|
||||||
In `roles/base/defaults/main.yml`, after the `base__firewall_mgmt_interface` line, add:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
base__firewall_control_addr: "" # control-node LAN address (ubongo); SSH allowed from it
|
|
||||||
# as the guaranteed-management-plane `ssh-from-control`
|
|
||||||
# source (ADR-021). Empty = no rule. Set in group_vars
|
|
||||||
# once ubongo exists.
|
|
||||||
```
|
|
||||||
|
|
||||||
- [ ] **Step 4: Render the rule in the template**
|
|
||||||
|
|
||||||
In `roles/base/templates/nftables.conf.j2`, immediately after the `wt0` SSH line (the
|
|
||||||
`iifname "{{ base__firewall_mgmt_interface }}" ...` line), add:
|
|
||||||
|
|
||||||
```jinja
|
|
||||||
{% if base__firewall_control_addr %}
|
|
||||||
ip saddr {{ base__firewall_control_addr }} tcp dport {{ base__firewall_ssh_port }} accept
|
|
||||||
{% endif %}
|
|
||||||
```
|
|
||||||
|
|
||||||
- [ ] **Step 5: Run the test to verify it passes**
|
|
||||||
|
|
||||||
Run: `make test ROLE=base`
|
|
||||||
Expected: PASS — the rule `ip saddr 10.10.0.99 tcp dport 22 accept` renders, `nft -c`
|
|
||||||
syntax-check succeeds, and all prior assertions (default-deny, `wt0` SSH, zone rules,
|
|
||||||
drop-in hook) still pass.
|
|
||||||
|
|
||||||
- [ ] **Step 6: Lint**
|
|
||||||
|
|
||||||
Run: `make lint`
|
|
||||||
Expected: PASS (no tag/FQCN/yaml regressions).
|
|
||||||
|
|
||||||
- [ ] **Step 7: Commit**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git add roles/base/defaults/main.yml roles/base/templates/nftables.conf.j2 \
|
|
||||||
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
|
|
||||||
git commit -m "feat(base): add ssh-from-control management-plane source (ADR-021)"
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Task 5: Author the `/check-access` command (dormant until infra)
|
|
||||||
|
|
||||||
**Files:**
|
|
||||||
- Create: `.claude/commands/check-access.md`
|
|
||||||
|
|
||||||
Mirror the structure of `.claude/commands/verify-service.md` (a forward-looking command
|
|
||||||
with a hard Prerequisites gate). It does not run until `ubongo` + live/staging hosts +
|
|
||||||
vault exist; if a prerequisite is missing it must say so and stop.
|
|
||||||
|
|
||||||
- [ ] **Step 1: Write the command**
|
|
||||||
|
|
||||||
Create `.claude/commands/check-access.md`:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
Operational-access verification (ADR-021)
|
|
||||||
|
|
||||||
Probe every documented way in to a service or host from `ubongo` and report which paths
|
|
||||||
are live. Reads the target's `access__*` data (and host baseline), so the verifier and
|
|
||||||
`ACCESS.md` can never disagree. Argument: a service/role name or a host
|
|
||||||
(e.g. `/check-access photoprism`, `/check-access docker01`).
|
|
||||||
|
|
||||||
## Prerequisites (forward-looking — ADR-021 dependencies)
|
|
||||||
|
|
||||||
This skill cannot run until these exist; if any is missing, say so and stop — do not
|
|
||||||
improvise around it:
|
|
||||||
|
|
||||||
- `ubongo` reachable on the mesh **and** the LAN (it runs the probes).
|
|
||||||
- The target host/service is deployed (staging or production inventory).
|
|
||||||
- `roles/<name>/` carries `access__*` data (services) / the host baseline applies.
|
|
||||||
- Vault unlocked (`rbw unlocked`) for any token-authenticated API probe.
|
|
||||||
|
|
||||||
## Process
|
|
||||||
|
|
||||||
### Phase 0 — resolve the target
|
|
||||||
|
|
||||||
Resolve the argument to a host or a service role + its host. Load the `access__*` data
|
|
||||||
(service) or the host-baseline + break-glass record (host). State what you will probe.
|
|
||||||
|
|
||||||
### Phase 1 — probe each declared path
|
|
||||||
|
|
||||||
| Path | Probe | Green = |
|
|
||||||
|---|---|---|
|
|
||||||
| `wt0` mesh SSH | connect over the mesh, run `true` | reachable + key works |
|
|
||||||
| LAN SSH from `ubongo` | connect via the LAN address, run `true` | reachable + key works |
|
|
||||||
| exec + compose | `docker compose -p <project> ps`; exec `true` in each `access__containers` entry | stack up, exec works |
|
|
||||||
| logs | query Loki for `access__log.loki_labels`, expect recent lines | logs flowing |
|
|
||||||
| admin API | `curl` `access__api.health_path` with the token from `access__api.auth.vault_ref` | 2xx |
|
|
||||||
| break-glass | reachability of the Proxmox/provider console endpoint **only** | console host reachable |
|
|
||||||
|
|
||||||
Break-glass is **never exercised** — firing a serial console is invasive; confirm the
|
|
||||||
fallback exists, do not drive it.
|
|
||||||
|
|
||||||
### Phase 2 — report
|
|
||||||
|
|
||||||
Emit a pass/fail table. For any red path, name it and the likely cause (e.g. "API token
|
|
||||||
in vault stale", "Alloy not shipping", "`base__firewall_control_addr` unset → no
|
|
||||||
`ssh-from-control` rule"). Verdict line: e.g. "3/4 paths green; admin API red".
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- Read-only and non-destructive — probes confirm reachability, they do not change state.
|
|
||||||
- This is the access analogue of `/verify-service` (ADR-017): designed now, runs when the
|
|
||||||
control node + hosts exist.
|
|
||||||
```
|
|
||||||
|
|
||||||
- [ ] **Step 2: Commit**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git add .claude/commands/check-access.md
|
|
||||||
git commit -m "feat(access): add /check-access verifier command (ADR-021, dormant)"
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Task 6: Governance wiring — checklist + runbook
|
|
||||||
|
|
||||||
**Files:**
|
|
||||||
- Modify: `docs/security/service-checklist.md` (the "Operability (security-adjacent)" section)
|
|
||||||
- Modify: `docs/runbooks/new-role.md` (after step 10, the VERIFY.md step)
|
|
||||||
|
|
||||||
ACCESS.md mirrors how SECURITY.md/VERIFY.md are enforced: a manual runbook step + a
|
|
||||||
checklist gate (the scaffold does not auto-drop SECURITY/VERIFY today either, so ACCESS
|
|
||||||
follows the same manual-copy pattern — no Makefile change).
|
|
||||||
|
|
||||||
- [ ] **Step 1: Add the checklist gate item**
|
|
||||||
|
|
||||||
In `docs/security/service-checklist.md`, under `## Operability (security-adjacent)`, add a
|
|
||||||
bullet after the `/verify-service` item:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
- [ ] Operational access recorded and verifiable (ADR-021): the role carries `access__*`
|
|
||||||
data, `roles/<service>/ACCESS.md` is rendered, and `/check-access` reports the
|
|
||||||
documented paths green — or a deviation is recorded in
|
|
||||||
`docs/security/accepted-risks.md`
|
|
||||||
```
|
|
||||||
|
|
||||||
- [ ] **Step 2: Add the runbook step**
|
|
||||||
|
|
||||||
In `docs/runbooks/new-role.md`, insert a new step between step 10 (VERIFY.md) and the
|
|
||||||
final commit step, and renumber the commit step to 12:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
### 11. Write the per-service operational-access record (services)
|
|
||||||
|
|
||||||
For a **service** role, copy `docs/access/service-access-template.md` to
|
|
||||||
`roles/<rolename>/ACCESS.md` and populate the role's `access__*` data
|
|
||||||
(`access__service`, `access__compose_project`/`_path`, `access__containers`,
|
|
||||||
`access__log.loki_labels`, and `access__api` — `enabled` + endpoint + `firewall_ref` +
|
|
||||||
`auth.vault_ref` + `health_path`, or `enabled: false` with a reason). `ACCESS.md` is
|
|
||||||
rendered from that data; the admin-API path must `firewall_ref` an entry in the
|
|
||||||
`group_vars` firewall catalog, never open a port itself (ADR-020/021). Once hosts exist,
|
|
||||||
`/check-access <rolename>` proves the documented paths are live — part of the
|
|
||||||
service-clearance gate (`docs/security/service-checklist.md`).
|
|
||||||
```
|
|
||||||
|
|
||||||
- [ ] **Step 3: Verify renumbering**
|
|
||||||
|
|
||||||
Run: `grep -n "^### 1[12]\." docs/runbooks/new-role.md`
|
|
||||||
Expected: `### 11. Write the per-service operational-access record` and `### 12. Commit`.
|
|
||||||
|
|
||||||
- [ ] **Step 4: Commit**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git add docs/security/service-checklist.md docs/runbooks/new-role.md
|
|
||||||
git commit -m "docs(access): gate ACCESS.md in checklist + new-role runbook (ADR-021)"
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Task 7: Index wiring — CLAUDE.md, STATUS.md, TODO.md
|
|
||||||
|
|
||||||
**Files:**
|
|
||||||
- Modify: `CLAUDE.md` (Role conventions list + Further reading table)
|
|
||||||
- Modify: `STATUS.md` (Designed-but-not-built table)
|
|
||||||
- Modify: `docs/TODO.md` (items 3.2 and 7.2)
|
|
||||||
|
|
||||||
- [ ] **Step 1: CLAUDE.md — Role conventions**
|
|
||||||
|
|
||||||
In the `## Role conventions` list, after the `VERIFY.md` bullet
|
|
||||||
("Every **service** role must have a populated `VERIFY.md` ..."), add:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
- Every **service** role must have a populated `ACCESS.md` (ADR-021) — copy
|
|
||||||
`docs/access/service-access-template.md`; rendered from the role's `access__*` data
|
|
||||||
```
|
|
||||||
|
|
||||||
- [ ] **Step 2: CLAUDE.md — Further reading**
|
|
||||||
|
|
||||||
In the Further reading table, after the Firewall strategy row, add:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
| Operational access | `docs/decisions/021-operational-access.md` |
|
|
||||||
```
|
|
||||||
|
|
||||||
- [ ] **Step 3: STATUS.md — new rows**
|
|
||||||
|
|
||||||
In the `## Designed but not built` table, add:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
| Operational-access doctrine (ADR-021) | ADR-021 | **Design RESOLVED** (ADR-021 + spec + plan). Two-layer doctrine, three-tier access ladder, `access__*` model, `ACCESS.md` record, `/check-access`. Reconciles ADR-016/020 SSH. |
|
|
||||||
| `ssh-from-control` firewall source | ADR-021 / ADR-020 | **Built (dormant).** `base__firewall_control_addr` knob + nftables rule + Molecule assertion landed; empty default = no rule until `ubongo`'s LAN address is set in `group_vars`. |
|
|
||||||
| `/check-access` verifier | ADR-021 | **Design RESOLVED** (`.claude/commands/check-access.md` authored). **Build pending:** running needs `ubongo` + live/staging hosts + vault. Access analogue of `/verify-service` (ADR-017). |
|
|
||||||
| Per-service `ACCESS.md` records | ADR-021 | Template + governance present; per-service files render when each service role is built. |
|
|
||||||
```
|
|
||||||
|
|
||||||
- [ ] **Step 4: docs/TODO.md — mark 3.2 and 7.2 DECIDED**
|
|
||||||
|
|
||||||
In `docs/TODO.md`, change item **3.2** from:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
2. Decide how to manage APIs / API access.
|
|
||||||
```
|
|
||||||
|
|
||||||
to:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
2. ~~Decide how to manage APIs / API access.~~ DECIDED (ADR-021): per-service `access__*`
|
|
||||||
data declares the admin API (endpoint + `firewall_ref` to the catalog + vault token
|
|
||||||
ref + health path); rendered into `ACCESS.md` and probed by `/check-access`. Part of
|
|
||||||
the two-layer operational-access doctrine.
|
|
||||||
```
|
|
||||||
|
|
||||||
And change item **7.2** from:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
2. Decide what to set up on the hosts, given that direct access will be rare.
|
|
||||||
```
|
|
||||||
|
|
||||||
to:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
2. ~~Decide what to set up on the hosts, given that direct access will be rare.~~
|
|
||||||
DECIDED (ADR-021): the host-layer access baseline — SSH on `wt0` + from `ubongo`,
|
|
||||||
Docker/Compose tooling, Alloy log shipping, and a recorded break-glass console per
|
|
||||||
host class.
|
|
||||||
```
|
|
||||||
|
|
||||||
- [ ] **Step 5: Verify and commit**
|
|
||||||
|
|
||||||
Run: `grep -n "021-operational-access\|ACCESS.md\|ssh-from-control" CLAUDE.md STATUS.md`
|
|
||||||
Expected: the new Role-conventions bullet, the Further-reading row, and the STATUS rows
|
|
||||||
are present.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git add CLAUDE.md STATUS.md docs/TODO.md
|
|
||||||
git commit -m "docs(access): wire ADR-021 into CLAUDE.md, STATUS, TODO"
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Tranche B — Build-pending on infra (no tasks now)
|
|
||||||
|
|
||||||
Recorded so the boundary is explicit; nothing here is actioned by this plan.
|
|
||||||
|
|
||||||
- **Per-service `access__*` + rendered `ACCESS.md`** — authored when each service role is
|
|
||||||
built, governed by the Task 6 checklist item + runbook step. The first real service role
|
|
||||||
is where this first runs.
|
|
||||||
- **`/check-access` running** — needs `ubongo` + a live/staging host + vault. The command
|
|
||||||
(Task 5) already gates on these and stops cleanly until then.
|
|
||||||
- **Real `base__firewall_control_addr` value** — set in `group_vars/all` to `ubongo`'s LAN
|
|
||||||
address once `ubongo` is in inventory; the machinery + test landed in Task 4.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Self-review
|
|
||||||
|
|
||||||
**Spec coverage:** doctrine + two layers → Task 1; three-tier ladder + ADR-016/020
|
|
||||||
reconciliation → Tasks 1–2, 4; `access__*` model + invariant → Tasks 1, 3, 6; rendered
|
|
||||||
`ACCESS.md` → Task 3; `/check-access` → Task 5; governance (checklist/runbook) → Task 6;
|
|
||||||
repo wiring (CLAUDE/STATUS/TODO) → Task 7; build-now vs build-pending split → Tranches
|
|
||||||
A/B. All spec sections map to a task.
|
|
||||||
|
|
||||||
**Deviations from the spec (deliberate, flagged for the user):**
|
|
||||||
1. The spec called `ssh-from-control` a *catalog* source; the plan places it in the
|
|
||||||
*guaranteed management plane* (`base__firewall_control_addr`) instead — ADR-020 already
|
|
||||||
houses SSH/Ansible management allows there, independent of the catalog, and the spec's
|
|
||||||
own invariant says the catalog owns *service* exposure only. Same intent, correct home.
|
|
||||||
2. The spec said `make new-role` would *scaffold* an `ACCESS.md` stub; the plan instead adds
|
|
||||||
a manual runbook step (Task 6) mirroring how `SECURITY.md`/`VERIFY.md` are handled today
|
|
||||||
(also manual copies, not scaffolded). Avoids unilaterally restructuring the scaffold;
|
|
||||||
the "can't be forgotten" intent is met by the checklist gate + runbook step.
|
|
||||||
|
|
||||||
**Type/name consistency:** `base__firewall_control_addr` (knob), `access__service` /
|
|
||||||
`access__compose_project` / `access__compose_path` / `access__containers` /
|
|
||||||
`access__log.loki_labels` / `access__api.{enabled,base_url,firewall_ref,auth.vault_ref,health_path}`
|
|
||||||
are used identically across Tasks 1, 3, 5, 6. The rendered nftables rule string
|
|
||||||
`ip saddr <addr> tcp dport 22 accept` matches between Task 4's template (Step 4) and its
|
|
||||||
assertion (Step 1).
|
|
||||||
|
|
@ -1,214 +0,0 @@
|
||||||
# Design — Operational access (ADR-021)
|
|
||||||
|
|
||||||
- **Date:** 2026-06-09
|
|
||||||
- **Status:** Approved design — pending implementation plan
|
|
||||||
- **Implements:** New ADR-021. Resolves TODO 3.2 (API / API access) and TODO 7.2
|
|
||||||
(what to set up on hosts, given direct access will be rare).
|
|
||||||
- **Amends:** ADR-016 (SSH was mesh-only; now also from `ubongo`'s LAN address) and
|
|
||||||
ADR-020 (adds an `ssh-from-control` symbolic catalog source).
|
|
||||||
- **Scope:** The operational-access *doctrine* + the declarative `access__*` data model,
|
|
||||||
the rendered `ACCESS.md` record, and the `/check-access` verifier design. It does **not**
|
|
||||||
build any of it — `base`/service roles and live hosts don't exist yet. Designed now,
|
|
||||||
built when there is something to access.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
boma is built security-first: nftables default-deny, SSH reachable only on the NetBird
|
|
||||||
`wt0` mesh interface (ADR-016), every service behind the reverse proxy + SSO, no ad-hoc
|
|
||||||
ports (ADR-002/020). That posture is correct — but it leaves an unanswered operational
|
|
||||||
question: **when a service or host breaks, how does the operator (and the AI working on
|
|
||||||
boma's behalf from `ubongo`) actually get in to troubleshoot it?**
|
|
||||||
|
|
||||||
Experience on similar projects shows troubleshooting is far more effective with *several*
|
|
||||||
documented ways in — SSH, container exec, logs, an admin API — so a single broken path
|
|
||||||
doesn't mean blind. Today boma has no standard guaranteeing those paths exist, are
|
|
||||||
documented, or still work. The risk is the classic one: the access you assumed you had is
|
|
||||||
stale exactly when you need it (key rotated, API disabled, token expired).
|
|
||||||
|
|
||||||
boma already has the right *shape* for the fix. Service roles carry record docs —
|
|
||||||
`SECURITY.md` (security answers) and `VERIFY.md` (acceptance spec) — gated by the service
|
|
||||||
checklist and the `new-role` runbook. What's missing is the third sibling: an
|
|
||||||
**operational access record**, plus the doctrine behind it.
|
|
||||||
|
|
||||||
Two constraints shape the design:
|
|
||||||
|
|
||||||
1. **Minimal attack surface is non-negotiable.** "Multiple ways in" must mean multiple
|
|
||||||
paths over the *trusted* interface, never new exposed ports. Resolution: all routine
|
|
||||||
access runs over the mesh from `ubongo`.
|
|
||||||
2. **A documented path that is never tested drifts.** It fails exactly when needed. So
|
|
||||||
the structured access facts must be *data* that both renders the doc and drives an
|
|
||||||
active verifier — the two can then never disagree.
|
|
||||||
|
|
||||||
## Decisions settled in brainstorming
|
|
||||||
|
|
||||||
- **Access is a deployment deliverable.** The deploy that creates a host/service also
|
|
||||||
records and (by design) proves its access paths. Not rediscovered under pressure.
|
|
||||||
- **All routine access over the mesh** (`wt0`, from `ubongo`). No new LAN/WAN exposure.
|
|
||||||
- **Two layers:** a host-level access baseline (resolves TODO 7.2) and a per-service
|
|
||||||
access record (resolves TODO 3.2).
|
|
||||||
- **Baseline paths, every service:** host SSH, container exec + compose, logs
|
|
||||||
(Loki/Grafana, ADR-018), and the service admin API where one exists (`n/a` otherwise).
|
|
||||||
- **A new first-class sibling record** `ACCESS.md` (next to `SECURITY.md`/`VERIFY.md`),
|
|
||||||
**rendered from declarative data** — not hand-written prose (the firewall-catalog
|
|
||||||
philosophy of ADR-020 applied to access).
|
|
||||||
- **Active verification designed in:** a `/check-access` skill probes the declared paths
|
|
||||||
and reports which are live — the access analogue of `/verify-service` (ADR-017).
|
|
||||||
- **Direct LAN SSH from `ubongo` only** is added as a second, mesh-independent path
|
|
||||||
(amends ADR-016); all other LAN hosts stay blocked by default-deny.
|
|
||||||
|
|
||||||
## The doctrine
|
|
||||||
|
|
||||||
> **Every host and every service guarantees at least one documented, verifiable way in
|
|
||||||
> for operational troubleshooting — and the deploy that creates it also records and
|
|
||||||
> proves it.**
|
|
||||||
|
|
||||||
### Two layers
|
|
||||||
|
|
||||||
- **Host layer** (TODO 7.2). Every host, via the `base` role, guarantees a fixed access
|
|
||||||
baseline: SSH over `wt0` and from `ubongo` (below), Docker/Compose tooling present, and
|
|
||||||
log shipping live (Alloy → Loki; ADR-018). Little is *exposed*; a known, uniform set of
|
|
||||||
paths exists over the mesh. This is boma's answer to "what every host runs for access."
|
|
||||||
- **Service layer** (TODO 3.2). Every service role guarantees and records its paths:
|
|
||||||
container exec + compose management, its Loki log labels, and its admin API where one
|
|
||||||
exists (enabled, token in vault, endpoint + health probe documented) or explicit `n/a`.
|
|
||||||
|
|
||||||
### The three-tier access ladder
|
|
||||||
|
|
||||||
1. **`wt0` mesh SSH — primary.** WireGuard *cryptographically authenticates* the peer
|
|
||||||
before SSH sees it. The preferred path (ADR-016's original rationale).
|
|
||||||
2. **LAN SSH from `ubongo` — secondary, mesh-independent.** Most hardware (all but
|
|
||||||
`askari`) shares a LAN. SSH from `ubongo`'s LAN address is allowed via a new catalog
|
|
||||||
source, giving a fallback that survives a NetBird/`wt0` outage. It is gated by *source
|
|
||||||
IP* (spoofable on a LAN) **plus** the standing keys-only + fail2ban SSH hardening, so
|
|
||||||
the marginal cost is "SSH daemon reachable from the LAN broadcast domain from one
|
|
||||||
trusted host" — modest and deliberate. All *other* LAN hosts remain default-denied.
|
|
||||||
3. **Console — break-glass.** Mesh-*and*-LAN-independent, recorded per host class, not
|
|
||||||
used for routine work:
|
|
||||||
- **Cluster VMs** → Proxmox serial/VNC console (`qm terminal` / console via the
|
|
||||||
Proxmox host) — independent of the guest network, `wt0`, and even a broken guest
|
|
||||||
nftables ruleset.
|
|
||||||
- **`askari`** (bare-metal Hetzner) → provider rescue/console.
|
|
||||||
- **`ubongo`** (physical) → local console.
|
|
||||||
|
|
||||||
A total mesh outage therefore still leaves exactly one documented way in to each box.
|
|
||||||
|
|
||||||
## The declarative access data model (Approach B)
|
|
||||||
|
|
||||||
Structured access facts live as **data** — the single source of truth that both renders
|
|
||||||
`ACCESS.md` *and* tells `/check-access` what to probe, so doc and verifier cannot diverge.
|
|
||||||
|
|
||||||
### Service-layer — `access__*` in each service role's defaults
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
access__service: photoprism
|
|
||||||
access__compose_project: photoprism # docker compose -p <this>
|
|
||||||
access__compose_path: /opt/photoprism/compose.yml
|
|
||||||
access__containers: [photoprism, photoprism-db] # exec targets
|
|
||||||
access__log:
|
|
||||||
loki_labels: { service: photoprism } # how to query logs (ADR-018)
|
|
||||||
access__api:
|
|
||||||
enabled: true
|
|
||||||
base_url: "https://photoprism.host:2342" # reachable over the mesh
|
|
||||||
firewall_ref: photoprism-api # the catalog entry that opens it (ADR-020)
|
|
||||||
auth: { type: token, vault_ref: "vault.photoprism.api_token" }
|
|
||||||
health_path: "/api/v1/status" # what /check-access pings
|
|
||||||
# where the service has no API:
|
|
||||||
# access__api: { enabled: false, reason: "<none upstream>" }
|
|
||||||
```
|
|
||||||
|
|
||||||
**Single-source-of-truth rule:** `access__api` **never opens a port**. It `firewall_ref`s
|
|
||||||
the entry in the `group_vars` firewall catalog — ADR-020 stays the sole owner of
|
|
||||||
*exposure*. The access data adds only *how to use* the path (endpoint, token ref, health
|
|
||||||
probe). No duplication, no ad-hoc ports (CLAUDE.md: ports only in the catalog).
|
|
||||||
|
|
||||||
### Host-layer — a fixed baseline, stated once
|
|
||||||
|
|
||||||
The host baseline (SSH on `wt0` + from `ubongo`, Docker/Compose present, Alloy live) is
|
|
||||||
uniform, so it is asserted by `base` and recorded once at the host/group level — not
|
|
||||||
re-stated per service. The break-glass console per host class is recorded with it.
|
|
||||||
|
|
||||||
## The rendered record — `ACCESS.md`
|
|
||||||
|
|
||||||
`ACCESS.md` is **rendered** from the `access__*` data, with a prose tail for the genuinely
|
|
||||||
narrative parts:
|
|
||||||
|
|
||||||
- **Access paths (generated)** — a table: each path (mesh SSH, LAN-SSH-from-`ubongo`,
|
|
||||||
exec/compose, logs, API), its tier (primary / secondary / break-glass), and the exact
|
|
||||||
invocation (`ssh host`, `docker compose -p <project> …`, the Loki query, the `curl`
|
|
||||||
against the API health path).
|
|
||||||
- **Break-glass (generated from host class)** — the Proxmox/provider console line.
|
|
||||||
- **Operational notes (prose)** — service quirks, gotchas, "if X is wedged, do Y." The
|
|
||||||
part a template cannot know.
|
|
||||||
|
|
||||||
A `docs/access/service-access-template.md` defines the shape, alongside the existing
|
|
||||||
security/verify templates.
|
|
||||||
|
|
||||||
## The verifier — `/check-access` (designed now, build-pending on infra)
|
|
||||||
|
|
||||||
Runs from `ubongo`; turns the `access__*` data into live probes. Invoked
|
|
||||||
`/check-access <service>` (or `<host>` for the host baseline). The access analogue of
|
|
||||||
`/verify-service` (ADR-017).
|
|
||||||
|
|
||||||
| Path | Probe | Green = |
|
|
||||||
|---|---|---|
|
|
||||||
| `wt0` mesh SSH | connect over mesh, run `true` | reachable + key works |
|
|
||||||
| LAN SSH from `ubongo` | connect via LAN addr, run `true` | reachable + key works |
|
|
||||||
| exec + compose | `docker compose -p <project> ps`; exec `true` in each container | stack up, exec works |
|
|
||||||
| logs | query Loki for `loki_labels`, expect recent lines | logs flowing |
|
|
||||||
| admin API | `curl` the `health_path` with the vault token | 2xx |
|
|
||||||
| break-glass | reachability of the Proxmox/provider console endpoint only | console host reachable |
|
|
||||||
|
|
||||||
- **Break-glass is checked for reachability, not exercised** — firing a serial console is
|
|
||||||
invasive; the verifier confirms the fallback *exists* without disrupting anything.
|
|
||||||
- **Output:** a pass/fail table; on any red, it names the path and the likely cause
|
|
||||||
("API token in vault stale", "Alloy not shipping", "`ssh-from-control` catalog source
|
|
||||||
missing"). The payoff: not "the doc *says* you can get in" but "verified — three of four
|
|
||||||
paths green right now, here's the broken one."
|
|
||||||
- **Status:** designed now, build-pending on infra (needs live hosts + staging + vault),
|
|
||||||
exactly like `/verify-service` under ADR-017.
|
|
||||||
|
|
||||||
## Governance — so it can't be forgotten
|
|
||||||
|
|
||||||
Three light touches mirror how `SECURITY.md`/`VERIFY.md` are enforced:
|
|
||||||
|
|
||||||
1. **Service checklist** (`docs/security/service-checklist.md`) gains one item: *"Access
|
|
||||||
paths declared (`access__*`), `ACCESS.md` rendered, `/check-access` green — or
|
|
||||||
deviation recorded in `accepted-risks.md`."*
|
|
||||||
2. **`new-role` runbook** (`docs/runbooks/new-role.md`) gains a step: fill `access__*`,
|
|
||||||
render `ACCESS.md`, run `/check-access`.
|
|
||||||
3. **`make new-role` scaffold** drops a stub `access__*` block + the `ACCESS.md` template
|
|
||||||
into the role — the same way roles already get `SECURITY.md`/`VERIFY.md` stubs, so it
|
|
||||||
is structurally impossible to ship a service role with no access record.
|
|
||||||
|
|
||||||
## Repo wiring
|
|
||||||
|
|
||||||
- **`docs/decisions/021-operational-access.md`** — the new ADR (doctrine, both layers,
|
|
||||||
the three-tier ladder, break-glass, the `access__*` model, `/check-access`).
|
|
||||||
- **`docs/decisions/016-mesh-vpn.md`** — amend: SSH on `wt0` **and** from `ubongo`'s LAN
|
|
||||||
address (was mesh-only). Cross-link ADR-021.
|
|
||||||
- **`docs/decisions/020-firewall.md`** — note the new `ssh-from-control` symbolic source.
|
|
||||||
- **`docs/access/service-access-template.md`** — the rendered `ACCESS.md` shape.
|
|
||||||
- **`docs/security/service-checklist.md`** — the one new gate item.
|
|
||||||
- **`docs/runbooks/new-role.md`** — the fill/render/`check-access` step.
|
|
||||||
- **`CLAUDE.md`** — `ACCESS.md` under "Role conventions"; ADR-021 in Further reading.
|
|
||||||
- **`STATUS.md`** — rows: ADR-021 doctrine *(designed)*; `ssh-from-control` catalog source
|
|
||||||
*(designed, builds with `base` firewall)*; `/check-access` *(designed, build-pending)*.
|
|
||||||
- **`docs/TODO.md`** — mark 3.2 and 7.2 DECIDED → ADR-021.
|
|
||||||
|
|
||||||
## What is buildable now vs later
|
|
||||||
|
|
||||||
- **Now:** the doctrine, ADR-021, the `ACCESS.md` template, the checklist/runbook/scaffold
|
|
||||||
wiring, and the `ssh-from-control` catalog source (the `firewall` concern of `base`
|
|
||||||
already exists, so the source can land with it).
|
|
||||||
- **Later (build-pending on infra):** `/check-access` *running*, and per-service
|
|
||||||
`ACCESS.md` *files* — both wait on service roles + live hosts. Designed now, built when
|
|
||||||
there is something to verify.
|
|
||||||
|
|
||||||
## Out of scope
|
|
||||||
|
|
||||||
- Building `base`'s non-firewall concerns, any service role, or live hosts.
|
|
||||||
- Broader LAN SSH (a management VLAN) — explicitly rejected; `ubongo`-only.
|
|
||||||
- Exercising (vs reachability-probing) the break-glass console.
|
|
||||||
- Any access path that is not over the mesh or the one `ubongo` LAN source.
|
|
||||||
|
|
@ -2,10 +2,6 @@
|
||||||
# Host firewall (nftables) behaviour knobs. Shared topology (firewall_catalog/
|
# Host firewall (nftables) behaviour knobs. Shared topology (firewall_catalog/
|
||||||
# firewall_zones) lives in group_vars/all, not here. See docs/decisions/020-firewall.md.
|
# firewall_zones) lives in group_vars/all, not here. See docs/decisions/020-firewall.md.
|
||||||
base__firewall_mgmt_interface: wt0 # SSH accepted only on this iface (NetBird, ADR-016)
|
base__firewall_mgmt_interface: wt0 # SSH accepted only on this iface (NetBird, ADR-016)
|
||||||
base__firewall_control_addr: "" # control-node LAN address (ubongo); SSH allowed from it
|
|
||||||
# as the guaranteed-management-plane `ssh-from-control`
|
|
||||||
# source (ADR-021). Empty = no rule. Set in group_vars
|
|
||||||
# once ubongo exists.
|
|
||||||
base__firewall_ssh_port: 22
|
base__firewall_ssh_port: 22
|
||||||
base__firewall_rollback_timeout: 45 # seconds before the auto-revert fires on a bad apply
|
base__firewall_rollback_timeout: 45 # seconds before the auto-revert fires on a bad apply
|
||||||
base__firewall_confirm_timeout: 20 # seconds to re-establish a fresh connection post-apply
|
base__firewall_confirm_timeout: 20 # seconds to re-establish a fresh connection post-apply
|
||||||
|
|
|
||||||
|
|
@ -5,7 +5,6 @@
|
||||||
gather_facts: true
|
gather_facts: true
|
||||||
vars:
|
vars:
|
||||||
base__firewall_apply: false
|
base__firewall_apply: false
|
||||||
base__firewall_control_addr: 10.10.0.99 # test control-node LAN address
|
|
||||||
firewall_zones:
|
firewall_zones:
|
||||||
lan: 10.30.0.0/24
|
lan: 10.30.0.0/24
|
||||||
srv: 10.20.0.0/24
|
srv: 10.20.0.0/24
|
||||||
|
|
|
||||||
|
|
@ -19,10 +19,7 @@
|
||||||
- "'type filter hook input priority 0; policy drop;' in nft"
|
- "'type filter hook input priority 0; policy drop;' in nft"
|
||||||
- "'ct state established,related accept' in nft"
|
- "'ct state established,related accept' in nft"
|
||||||
- "'iifname \"wt0\" tcp dport 22 accept' in nft"
|
- "'iifname \"wt0\" tcp dport 22 accept' in nft"
|
||||||
- "'ip saddr 10.10.0.99 tcp dport 22 accept' in nft"
|
fail_msg: "input chain is missing default-deny or the management plane"
|
||||||
fail_msg: >-
|
|
||||||
input chain is missing default-deny, the wt0 SSH allow,
|
|
||||||
or the ssh-from-control management-plane rule
|
|
||||||
|
|
||||||
- name: Assert the lan->reverse_proxy:443 ingress rule (zone source)
|
- name: Assert the lan->reverse_proxy:443 ingress rule (zone source)
|
||||||
ansible.builtin.assert:
|
ansible.builtin.assert:
|
||||||
|
|
|
||||||
|
|
@ -9,9 +9,6 @@ table inet filter {
|
||||||
ct state established,related accept
|
ct state established,related accept
|
||||||
ct state invalid drop
|
ct state invalid drop
|
||||||
iifname "{{ base__firewall_mgmt_interface }}" tcp dport {{ base__firewall_ssh_port }} accept
|
iifname "{{ base__firewall_mgmt_interface }}" tcp dport {{ base__firewall_ssh_port }} accept
|
||||||
{% if base__firewall_control_addr %}
|
|
||||||
ip saddr {{ base__firewall_control_addr }} tcp dport {{ base__firewall_ssh_port }} accept
|
|
||||||
{% endif %}
|
|
||||||
ip protocol icmp accept
|
ip protocol icmp accept
|
||||||
ip6 nexthdr ipv6-icmp accept
|
ip6 nexthdr ipv6-icmp accept
|
||||||
{% for r in base__firewall_resolved %}
|
{% for r in base__firewall_resolved %}
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue