From 96f8f20c05a33d8c186739a25880c3a787985c8c Mon Sep 17 00:00:00 2001 From: sjat Date: Sat, 6 Jun 2026 06:59:58 +0200 Subject: [PATCH] Add implementation plan for logging + log integrity (ADR-018) Task-by-task docs plan: author ADR-018 and reconcile ADR-002, accepted-risks (R4), CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md. Roles/pipeline deferred on the base + service-role machinery. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../plans/2026-06-06-logging-log-integrity.md | 480 ++++++++++++++++++ 1 file changed, 480 insertions(+) create mode 100644 docs/superpowers/plans/2026-06-06-logging-log-integrity.md diff --git a/docs/superpowers/plans/2026-06-06-logging-log-integrity.md b/docs/superpowers/plans/2026-06-06-logging-log-integrity.md new file mode 100644 index 0000000..5b8b52c --- /dev/null +++ b/docs/superpowers/plans/2026-06-06-logging-log-integrity.md @@ -0,0 +1,480 @@ +# Logging & Log Integrity Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Record the logging architecture (all logs → on-cluster Loki; a security subset also write-only off-site to `askari`) by authoring ADR-018 and reconciling every doc that touches logging/observability. + +**Architecture:** Documentation-only. The runtime pieces — Alloy in the `base` role, the `loki`/`grafana` service roles, OPNsense syslog forwarding — wait on the `base` + service-role machinery STATUS.md lists as not-yet-built. This plan settles the decision and the doc reconciliation. + +**Tech Stack:** Markdown. Verification is the repo's pre-commit hooks + a final cross-reference sweep. No markdown linter, so "tests" are hook-pass + grep checks. + +--- + +## Pre-flight (read once) + +- **`rbw` must be unlocked before every commit** (pre-commit ansible-lint decrypts `vault.yml`). `rbw unlocked`; if non-zero, stop and ask the user to `rbw unlock`. +- **Commit style:** one commit per task, imperative subject ≤72 chars. +- **Order:** Task 1 (ADR-018) first — later tasks link to it. +- **Spec:** `docs/superpowers/specs/2026-06-05-logging-log-integrity-design.md`. +- **Branch:** controller creates `chore/logging-log-integrity-docs` off `main` before Task 1; do not implement on `main`. + +--- + +## File map + +| File | Action | Responsibility | +|---|---|---| +| `docs/decisions/018-logging.md` | Create | Home of record for the logging architecture | +| `docs/decisions/002-security.md` | Modify | Make the "logs to central" + "active alerting" bullets concrete (→ ADR-018) | +| `docs/security/accepted-risks.md` | Modify | Add R4 — no cryptographic WORM for logs | +| `docs/CAPABILITIES.md` | Modify | Loki row → decided; add Alloy agent row; note security alerting | +| `docs/decisions/012-hardware-capacity.md` | Modify | Log-storage allocation + SSD-wearout tracked metric | +| `STATUS.md` | Modify | Rows: logging pipeline (designed, not built) | +| `docs/TODO.md` | Modify | Mark 3.1 decided; reconcile 3.6's "on askari" phrasing | +| `CLAUDE.md` | Modify | ADR-018 in Further reading | + +**Deferred (not in this plan):** the Alloy task in `base`, the `loki`/`grafana` service roles, OPNsense Suricata syslog forwarding, the push-only `vault.loki.*` credential, and the live pipeline — all recorded in ADR-018/STATUS, built when the stack exists. + +--- + +### Task 1: Author ADR-018 (the home of record) + +**Files:** +- Create: `docs/decisions/018-logging.md` + +- [ ] **Step 1: Create the ADR** + +Create `docs/decisions/018-logging.md` with exactly this content (preserve em-dashes —, backticks, table pipes, `≠`, `~`): + +```markdown +# ADR-018 — Logging and log integrity + +## Context + +boma wants all logs in one queryable store for troubleshooting, spotting issues over +time, and detecting intrusions / malicious activity. ADR-002 commits in principle +("logs shipped to a central location"; "active alerting wires AIDE/`auditd`/`fail2ban`/ +Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and `askari` (the +off-site watchdog). Undecided: the architecture and the **integrity** question — an +attacker who roots a host will try to clear logs to cover their tracks. + +The framing insight: the biggest anti-tampering win is that logs **leave the host in +near-real-time** — once a line is in a store the attacker doesn't control, wiping the +local copy is futile. How far to harden the central store is set by the threat model. + +## Decision + +1. **Threat model — opportunistic + blast-radius** (ADR-002 / accepted-risk R1). Not + forensic-grade. +2. **All logs → an on-cluster Loki** — the single monitoring DB for troubleshooting + + trends. Near-real-time shipping already defeats per-host track-covering. +3. **A security-relevant subset ALSO ships off-site to `askari`, write-only** — + tamper-resistant against full-cluster compromise, at bounded volume. +4. **Skip WORM/object-lock** — accepted-risk R4; append-only push + off-site is the + proportionate control. +5. **Disk-wear is a managed parameter** — media choice + bounded verbosity + tuned + retention + wearout monitoring. + +## Architecture + +- **Agent:** Grafana Alloy on every host, installed by the `base` role — reads journald + + container logs + security sources (`auditd`, `authpriv`, `fail2ban`, AIDE). +- **Loki (cluster):** a `loki` service role on a docker_host; all logs; monolithic + single-binary mode; NVMe; bounded retention. +- **Loki (`askari`):** the same role parameterised, in `offsite_hosts`; security subset + only, write-only, long retention, tiny volume. +- **Grafana (cluster):** both Lokis as datasources (one pane queries both); dashboards + + the alerting ADR-002 calls for. + +## Data flow & the security subset + +Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage +tags security sources `security="true"`) to the `askari` Loki. Subset: `auditd`, +`authpriv` (SSH/`sudo`), `fail2ban`, AIDE, **Suricata** (OPNsense isn't a `base` host — +it syslog-forwards its alerts to the ingest point), and key container security events. + +**Write-only / append-only:** the `askari` push endpoint (`/loki/api/v1/push`) is +mesh-only with a **push-only credential**; query/admin/delete APIs are not exposed to +hosts. The push API has no edit/delete verb, so a compromised host can append but not +read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers +(WAL) + retries across a brief outage. + +## Security, integrity & residual risks + +Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store +(append-only, off-cluster). The security trail survives full-cluster compromise. +Conscious residuals: append-only ≠ cryptographic WORM (root-on-`askari` could edit +chunks — R4); a few-seconds un-shipped window; agent compromise can stop *future* +shipping but not alter shipped history; **a host going silent is itself an alert**; a +stolen push credential appends noise but can't delete; an `askari` outage buffers + +flushes on reconnect. + +## Retention & disk-wear + +Estimates are intent-based until measured (like `/capacity-review`). Cluster Loki: +bounded hot retention (~30–90 days). `askari` subset: long (~1 year+, ~5–25 GB/yr). +Disk-wear rules: (1) log storage on NVMe/SSD or HDD, **never SD/USB flash**; (2) bounded +verbosity at source (sane levels, selective access logging, a targeted `auditd` +ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a monitored +metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a +tracked allocation in `docs/hardware/reference.md` (ADR-012). + +## Status + +Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/ +accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`, +the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential, +and the live pipeline. + +## Dependencies + +`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster + +`askari` (`offsite_hosts`, ADR-016); OPNsense automation for Suricata syslog (ADR-007); +the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence alerting +(sibling effort, TODO 3.6). + +## What was ruled out + +| Option | Reason | +|---|---| +| Everything off-site on `askari` (no on-cluster Loki) | The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site. | +| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). | +| On-cluster-only (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only. | +| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. | +| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). | + +See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`), +ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role +standard), ADR-011 (health checks — distinct from this). +``` + +- [ ] **Step 2: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files docs/decisions/018-logging.md` +Expected: Passed/Skipped. +```bash +git add docs/decisions/018-logging.md +git commit -m "Add ADR-018 (logging and log integrity)" +``` + +--- + +### Task 2: Make ADR-002's logging bullets concrete + +**Files:** +- Modify: `docs/decisions/002-security.md` + +Read the file first, then two exact edits. + +- [ ] **Step 1: The audit-trail bullet** + +Find: +``` +- `auditd` installed and running with a baseline ruleset +- Logs shipped to a central location if a log aggregation service is available +``` +Replace with: +``` +- `auditd` installed and running with a baseline ruleset +- Logs shipped to a central location in near-real-time — all logs to an on-cluster + Loki, plus a security-relevant subset write-only off-site to `askari` so the audit + trail survives host (and full-cluster) compromise (ADR-018) +``` + +- [ ] **Step 2: The active-alerting bullet** + +Find: +``` +- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata into the + monitoring/alerting stack (planned; ties to the Loki/Grafana effort) +``` +Replace with: +``` +- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata — plus + log-source-silence (a host that stops shipping) — into Grafana alerting on the + Loki/Grafana stack (ADR-018; planned) +``` + +- [ ] **Step 3: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files docs/decisions/002-security.md` +Expected: Passed/Skipped. +```bash +git add docs/decisions/002-security.md +git commit -m "ADR-002: make central-logging + alerting controls concrete (ADR-018)" +``` + +--- + +### Task 3: Add accepted-risk R4 (no WORM for logs) + +**Files:** +- Modify: `docs/security/accepted-risks.md` + +Read the file first, then one exact edit (add R4 after R3). + +- [ ] **Step 1: Add the R4 row** + +Find this exact line (the R3 row): +``` +| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering | +``` +Add immediately **after** it: +``` +| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target | +``` + +- [ ] **Step 2: Bump the "Last reviewed" date** + +Find: +``` +_Last reviewed: 2026-06-05. The prior gaps +``` +Replace with: +``` +_Last reviewed: 2026-06-06. The prior gaps +``` + +- [ ] **Step 3: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md` +Expected: Passed/Skipped. +```bash +git add docs/security/accepted-risks.md +git commit -m "accepted-risks: add R4 (no cryptographic WORM for logs)" +``` + +--- + +### Task 4: Update CAPABILITIES §3 (Observability) + +**Files:** +- Modify: `docs/CAPABILITIES.md` + +Read the file first, then three exact edits. + +- [ ] **Step 1: Loki row → decided, note the off-site sink** + +Find: +``` +| Logs | Loki | P | planned | Log aggregation | TODO 3.6 | +``` +Replace with: +``` +| Logs | Loki (cluster all-logs + off-site security subset on `askari`) | P | core | Central log aggregation; a security subset ships write-only off-site (append-only) | **Decided (ADR-018)** | +``` + +- [ ] **Step 2: Add the Alloy agent row** (right after the Loki row just edited) + +Find: +``` +| Dashboards | Grafana | P | planned | Visualisation + alerting | TODO 3.6 | +``` +Replace with: +``` +| Log shipping agent | Grafana Alloy (in `base`) | P | core | Collects journald + container + security logs on every host; ships to Loki (ADR-018) | **Decided (ADR-018)** | +| Dashboards | Grafana | P | planned | Visualisation + alerting (incl. AIDE/`auditd`/`fail2ban`/Suricata + log-silence — ADR-018) | TODO 3.6 | +``` + +- [ ] **Step 3: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files docs/CAPABILITIES.md` +Expected: Passed/Skipped. +```bash +git add docs/CAPABILITIES.md +git commit -m "CAPABILITIES: Loki decided + Alloy agent + security alerting (ADR-018)" +``` + +--- + +### Task 5: ADR-012 — log-storage allocation + wearout metric + +**Files:** +- Modify: `docs/decisions/012-hardware-capacity.md` + +Read the file first, then one exact edit (add a Consequences bullet). + +- [ ] **Step 1: Add a Consequences bullet** + +Find this exact block: +``` +## Consequences + +- Right-sizing advice is intent-based until usage data exists; reports say so. +- `reference.md` table headers are a parser contract — changing them needs a + matching `capacity-scan.py` change. +``` +Replace with: +``` +## Consequences + +- Right-sizing advice is intent-based until usage data exists; reports say so. +- `reference.md` table headers are a parser contract — changing them needs a + matching `capacity-scan.py` change. +- Log storage (ADR-018) is a tracked allocation: the cluster Loki host's retention + budget and `askari`'s security-subset volume belong in `reference.md`, and SSD + **wearout/TBW** is a monitored metric — logging is write-heavy, so wear is watched, + not assumed. +``` + +- [ ] **Step 2: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files docs/decisions/012-hardware-capacity.md` +Expected: Passed/Skipped. +```bash +git add docs/decisions/012-hardware-capacity.md +git commit -m "ADR-012: track log-storage allocation + SSD wearout (ADR-018)" +``` + +--- + +### Task 6: Add logging rows to STATUS.md + +**Files:** +- Modify: `STATUS.md` + +Read the file first, then one exact edit (add two rows after the Level 4 row). + +- [ ] **Step 1: Add the rows** + +Find this exact line: +``` +| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. | +``` +Replace with that SAME line followed by the two new rows: +``` +| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. | +| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. | +| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). | +``` + +- [ ] **Step 2: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files STATUS.md` +Expected: Passed/Skipped. +```bash +git add STATUS.md +git commit -m "STATUS: record logging pipeline + security alerting (ADR-018)" +``` + +--- + +### Task 7: Reconcile TODO 3.1 and 3.6 + +**Files:** +- Modify: `docs/TODO.md` + +Read the file first, then two exact edits. (Preserve the `~~strikethrough~~` markers.) + +- [ ] **Step 1: Mark 3.1 decided** + +Find: +``` +3. **Building services** + 1. Decide how to manage logs. +``` +Replace with: +``` +3. **Building services** + 1. ~~Decide how to manage logs.~~ DECIDED (ADR-018): all logs → on-cluster Loki via + Grafana Alloy (in `base`); a security subset also ships write-only off-site to + `askari` (append-only); Grafana queries both. WORM skipped (accepted-risk R4). +``` + +- [ ] **Step 2: Reconcile 3.6's "on askari" phrasing** + +Find: +``` + 6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime + Kuma alerts on askari. +``` +Replace with: +``` + 6. Wire up the monitoring stack. Logging topology DECIDED (ADR-018): cluster Loki + (all logs) + off-site security subset on `askari` + Grafana on-cluster (not the + whole stack on `askari`). Still to design/build: Prometheus + metric exporters, + Uptime Kuma, and exactly which alerts live where. +``` + +- [ ] **Step 3: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files docs/TODO.md` +Expected: Passed/Skipped. +```bash +git add docs/TODO.md +git commit -m "TODO: mark log management decided (ADR-018); reconcile 3.6" +``` + +--- + +### Task 8: Link ADR-018 from CLAUDE.md + +**Files:** +- Modify: `CLAUDE.md` + +Read the file first, then one exact edit. + +- [ ] **Step 1: Add the Further-reading row after Hardware & capacity** + +Find: +``` +| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` | +``` +Replace with that SAME line followed by the new row: +``` +| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` | +| Logging & log integrity | `docs/decisions/018-logging.md` | +``` + +- [ ] **Step 2: Verify and commit** + +Run: `rbw unlocked && pre-commit run --files CLAUDE.md` +Expected: Passed/Skipped. +```bash +git add CLAUDE.md +git commit -m "CLAUDE.md: link ADR-018 (logging)" +``` + +--- + +### Task 9: Final consistency sweep + +**Files:** none modified (verification only) + +- [ ] **Step 1: ADR-018 present + cross-linked (canonical docs only)** + +Run: +```bash +test -f docs/decisions/018-logging.md && echo "ADR-018 present" +grep -rl "ADR-018\|018-logging" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/" +``` +Expected: the file exists and the referencing docs appear — ADR-002, accepted-risks, CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md. + +- [ ] **Step 2: No stale "logging undecided / if available" language** + +Run: +```bash +grep -rniE "log aggregation service is available|Logs \| Loki \| P \| planned|Decide how to manage logs\.($|[^~])" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/" +``` +Expected: no hits — the ADR-002 conditional, the "planned" Loki row, and the open "Decide how to manage logs" TODO are all now updated. + +- [ ] **Step 3: Full hook run** + +Run: `rbw unlocked && pre-commit run --all-files` +Expected: all hooks Passed/Skipped. Fix anything that fails (likely trailing whitespace / end-of-file) and amend the owning commit. + +- [ ] **Step 4: Push (only if the user asks)** + +```bash +git push origin +``` + +--- + +## Self-review notes (author) + +- **Spec coverage:** decision/architecture/data-flow/security/retention → Task 1 (ADR-018); the spec's "Documentation & implementation changes" table → Tasks 2–8 (ADR-002, accepted-risks R4, CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md). The role/pipeline rows in that table are deferred (recorded in ADR-018/STATUS), not implemented here. ✓ +- **Deferred, intentional:** Alloy-in-`base`, the `loki`/`grafana` service roles, OPNsense syslog forwarding, the `vault.loki.*` credential, the metrics-stack dependency — all need the unbuilt machinery; named in ADR-018/STATUS. ✓ +- **No placeholders:** every create/edit shows exact text. ✓ +- **Name consistency:** `ADR-018` / `018-logging.md`, "security subset", `offsite_hosts`, Grafana Alloy, push-only credential, R4 used identically across tasks. ✓ +```