diff --git a/docs/decisions/016-mesh-vpn.md b/docs/decisions/016-mesh-vpn.md index 1893cec..a0d6b4a 100644 --- a/docs/decisions/016-mesh-vpn.md +++ b/docs/decisions/016-mesh-vpn.md @@ -90,7 +90,7 @@ allocated for it. ## Status -Designed, not built — depends on the unbuilt `base` role and service-role machinery +Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` role and service-role machinery (STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when `base` exists. @@ -108,3 +108,22 @@ Designed, not built — depends on the unbuilt `base` role and service-role mach See also: ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security), ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted). + +## Consequences + +- A new public surface appears on `askari` — management API + dashboard (80/443) + + Coturn (3478) — mitigated by TLS, embedded-IdP login, source-IP limits where + practical, `base` hardening and version-pinned NetBird, and recorded as accepted-risk + R3 (Security). +- On-LAN SSH never depends on the mesh: `base` allows inbound SSH from `ubongo`'s LAN + address as a mesh-independent secondary path, so a mesh/coordinator outage never + blocks on-LAN SSH and Ansible stays off the mesh (Security; Recovery & operations). +- The mesh survives a homelab outage because the coordinator is off-site on `askari`, + with its management datastore backed up encrypted off `askari` and peers keeping + last-known config through a brief coordinator outage (Recovery & operations). +- Choosing NetBird over plain OPNsense WireGuard, Tailscale, Tailscale+Headscale, an + on-cluster coordinator, a `ubongo` subnet router, and a standalone IdP gains + identity/ACL policy, self-hosted sovereignty, no routing SPOF, and a light single + operator footprint (What was ruled out). +- Implementation is pending: the role tasks land only once the unbuilt `base` role and + service-role machinery exist (Status). diff --git a/docs/decisions/017-service-ui-verification.md b/docs/decisions/017-service-ui-verification.md index 62fdb5a..ae39ba4 100644 --- a/docs/decisions/017-service-ui-verification.md +++ b/docs/decisions/017-service-ui-verification.md @@ -65,7 +65,7 @@ them. ## Status -Designed. **Authorable now:** this ADR, the ADR-008 Level 4 expansion, the `VERIFY.md` +Accepted (2026-06-05). Designed. **Authorable now:** this ADR, the ADR-008 Level 4 expansion, the `VERIFY.md` template, the `/verify-service` skill, the convention/checklist/Further-reading edits, `.gitignore`/dir, STATUS/TODO. **Running is deferred** on its dependencies. @@ -90,3 +90,21 @@ template, the `/verify-service` skill, the convention/checklist/Further-reading See also: ADR-008 (testing — expanded), ADR-015 (control host), ADR-002 (security), ADR-004 (`VERIFY.md` parallels `SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing). + +## Consequences + +- The harness is confined to staging by a hard stop: it refuses to run against + production because exploratory clicking is destructive, the blast radius is bounded to + the target service, and test users live only in the staging `test` group (Safety). +- No secrets leak: the git-ignored screenshot dir is the safety boundary and credential + screens are avoided (Safety; Reporting & manual handoff). +- Test identities are ephemeral per-run credentials in the staging Authentik only — + never production, none persisted in `vault.yml` — created reuse-or-create and torn + down via staging rebuild or `test`-group cleanup (Test-user standard). +- Anything Claude cannot exercise (physical device, paid/external flow, subjective + judgment) is handed off via a structured manual-test checklist in the run report + (Reporting & manual handoff). +- Authoring is possible now (this ADR, the `VERIFY.md` template, the `/verify-service` + skill, conventions/checklist edits), but running is deferred on its dependencies: + `ubongo`, the `playwright` plugin, Authentik, a staging deploy, and `make new-role` + scaffolding `VERIFY.md` (Status; Dependencies). diff --git a/docs/decisions/018-logging.md b/docs/decisions/018-logging.md index 15c432d..c044e8a 100644 --- a/docs/decisions/018-logging.md +++ b/docs/decisions/018-logging.md @@ -72,7 +72,7 @@ tracked allocation in `docs/hardware/reference.md` (ADR-012). ## Status -Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/ +Accepted (2026-06-06). Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/ accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`, the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential, and the live pipeline. @@ -97,3 +97,26 @@ the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence a See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`), ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role standard), ADR-011 (health checks — distinct from this). + +## Consequences + +- Opportunistic track-covering and host-pivot-to-store are defeated because logs leave + the host in near-real-time and the off-cluster security trail is append-only, so it + survives full-cluster compromise (Security, integrity & residual risks). +- Conscious residuals remain: append-only is not cryptographic WORM (root-on-`askari` + could edit chunks — R4); there is a few-seconds un-shipped window; agent compromise + can stop future shipping but not alter shipped history; a stolen push credential + appends noise but cannot delete; and an `askari` outage buffers then flushes on + reconnect (Security, integrity & residual risks). +- A host going silent is itself an alert (Security, integrity & residual risks). +- Only a bounded security subset ships off-site — `auditd`, `authpriv`, `fail2ban`, + AIDE, Suricata and key container security events tagged `security="true"` — while the + cluster Loki holds everything, keeping off-site volume small (Data flow & the security + subset). +- Disk-wear is a managed parameter: log storage on NVMe/SSD or HDD never SD/USB flash, + bounded verbosity at source, tuned Loki retention/compaction, and monitored SSD + wearout/TBW with an alert; log storage is a tracked allocation in + `docs/hardware/reference.md` (Retention & disk-wear). +- The decision is authorable now but the live pipeline is deferred on the stack: + Alloy-in-`base`, the `loki`/`grafana` service roles, OPNsense syslog config, and the + push-only credential (Status; Dependencies).