boma/docs/superpowers/plans/2026-06-06-logging-log-integrity.md
sjat 96f8f20c05 Add implementation plan for logging + log integrity (ADR-018)
Task-by-task docs plan: author ADR-018 and reconcile ADR-002, accepted-risks
(R4), CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md. Roles/pipeline deferred
on the base + service-role machinery.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 06:59:58 +02:00

480 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Logging & Log Integrity Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Record the logging architecture (all logs → on-cluster Loki; a security subset also write-only off-site to `askari`) by authoring ADR-018 and reconciling every doc that touches logging/observability.
**Architecture:** Documentation-only. The runtime pieces — Alloy in the `base` role, the `loki`/`grafana` service roles, OPNsense syslog forwarding — wait on the `base` + service-role machinery STATUS.md lists as not-yet-built. This plan settles the decision and the doc reconciliation.
**Tech Stack:** Markdown. Verification is the repo's pre-commit hooks + a final cross-reference sweep. No markdown linter, so "tests" are hook-pass + grep checks.
---
## Pre-flight (read once)
- **`rbw` must be unlocked before every commit** (pre-commit ansible-lint decrypts `vault.yml`). `rbw unlocked`; if non-zero, stop and ask the user to `rbw unlock`.
- **Commit style:** one commit per task, imperative subject ≤72 chars.
- **Order:** Task 1 (ADR-018) first — later tasks link to it.
- **Spec:** `docs/superpowers/specs/2026-06-05-logging-log-integrity-design.md`.
- **Branch:** controller creates `chore/logging-log-integrity-docs` off `main` before Task 1; do not implement on `main`.
---
## File map
| File | Action | Responsibility |
|---|---|---|
| `docs/decisions/018-logging.md` | Create | Home of record for the logging architecture |
| `docs/decisions/002-security.md` | Modify | Make the "logs to central" + "active alerting" bullets concrete (→ ADR-018) |
| `docs/security/accepted-risks.md` | Modify | Add R4 — no cryptographic WORM for logs |
| `docs/CAPABILITIES.md` | Modify | Loki row → decided; add Alloy agent row; note security alerting |
| `docs/decisions/012-hardware-capacity.md` | Modify | Log-storage allocation + SSD-wearout tracked metric |
| `STATUS.md` | Modify | Rows: logging pipeline (designed, not built) |
| `docs/TODO.md` | Modify | Mark 3.1 decided; reconcile 3.6's "on askari" phrasing |
| `CLAUDE.md` | Modify | ADR-018 in Further reading |
**Deferred (not in this plan):** the Alloy task in `base`, the `loki`/`grafana` service roles, OPNsense Suricata syslog forwarding, the push-only `vault.loki.*` credential, and the live pipeline — all recorded in ADR-018/STATUS, built when the stack exists.
---
### Task 1: Author ADR-018 (the home of record)
**Files:**
- Create: `docs/decisions/018-logging.md`
- [ ] **Step 1: Create the ADR**
Create `docs/decisions/018-logging.md` with exactly this content (preserve em-dashes —, backticks, table pipes, `≠`, `~`):
```markdown
# ADR-018 — Logging and log integrity
## Context
boma wants all logs in one queryable store for troubleshooting, spotting issues over
time, and detecting intrusions / malicious activity. ADR-002 commits in principle
("logs shipped to a central location"; "active alerting wires AIDE/`auditd`/`fail2ban`/
Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and `askari` (the
off-site watchdog). Undecided: the architecture and the **integrity** question — an
attacker who roots a host will try to clear logs to cover their tracks.
The framing insight: the biggest anti-tampering win is that logs **leave the host in
near-real-time** — once a line is in a store the attacker doesn't control, wiping the
local copy is futile. How far to harden the central store is set by the threat model.
## Decision
1. **Threat model — opportunistic + blast-radius** (ADR-002 / accepted-risk R1). Not
forensic-grade.
2. **All logs → an on-cluster Loki** — the single monitoring DB for troubleshooting +
trends. Near-real-time shipping already defeats per-host track-covering.
3. **A security-relevant subset ALSO ships off-site to `askari`, write-only**
tamper-resistant against full-cluster compromise, at bounded volume.
4. **Skip WORM/object-lock** — accepted-risk R4; append-only push + off-site is the
proportionate control.
5. **Disk-wear is a managed parameter** — media choice + bounded verbosity + tuned
retention + wearout monitoring.
## Architecture
- **Agent:** Grafana Alloy on every host, installed by the `base` role — reads journald
+ container logs + security sources (`auditd`, `authpriv`, `fail2ban`, AIDE).
- **Loki (cluster):** a `loki` service role on a docker_host; all logs; monolithic
single-binary mode; NVMe; bounded retention.
- **Loki (`askari`):** the same role parameterised, in `offsite_hosts`; security subset
only, write-only, long retention, tiny volume.
- **Grafana (cluster):** both Lokis as datasources (one pane queries both); dashboards
+ the alerting ADR-002 calls for.
## Data flow & the security subset
Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage
tags security sources `security="true"`) to the `askari` Loki. Subset: `auditd`,
`authpriv` (SSH/`sudo`), `fail2ban`, AIDE, **Suricata** (OPNsense isn't a `base` host —
it syslog-forwards its alerts to the ingest point), and key container security events.
**Write-only / append-only:** the `askari` push endpoint (`/loki/api/v1/push`) is
mesh-only with a **push-only credential**; query/admin/delete APIs are not exposed to
hosts. The push API has no edit/delete verb, so a compromised host can append but not
read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers
(WAL) + retries across a brief outage.
## Security, integrity & residual risks
Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store
(append-only, off-cluster). The security trail survives full-cluster compromise.
Conscious residuals: append-only ≠ cryptographic WORM (root-on-`askari` could edit
chunks — R4); a few-seconds un-shipped window; agent compromise can stop *future*
shipping but not alter shipped history; **a host going silent is itself an alert**; a
stolen push credential appends noise but can't delete; an `askari` outage buffers +
flushes on reconnect.
## Retention & disk-wear
Estimates are intent-based until measured (like `/capacity-review`). Cluster Loki:
bounded hot retention (~3090 days). `askari` subset: long (~1 year+, ~525 GB/yr).
Disk-wear rules: (1) log storage on NVMe/SSD or HDD, **never SD/USB flash**; (2) bounded
verbosity at source (sane levels, selective access logging, a targeted `auditd`
ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a monitored
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
tracked allocation in `docs/hardware/reference.md` (ADR-012).
## Status
Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
and the live pipeline.
## Dependencies
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
`askari` (`offsite_hosts`, ADR-016); OPNsense automation for Suricata syslog (ADR-007);
the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence alerting
(sibling effort, TODO 3.6).
## What was ruled out
| Option | Reason |
|---|---|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site. |
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
| On-cluster-only (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only. |
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/018-logging.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/018-logging.md
git commit -m "Add ADR-018 (logging and log integrity)"
```
---
### Task 2: Make ADR-002's logging bullets concrete
**Files:**
- Modify: `docs/decisions/002-security.md`
Read the file first, then two exact edits.
- [ ] **Step 1: The audit-trail bullet**
Find:
```
- `auditd` installed and running with a baseline ruleset
- Logs shipped to a central location if a log aggregation service is available
```
Replace with:
```
- `auditd` installed and running with a baseline ruleset
- Logs shipped to a central location in near-real-time — all logs to an on-cluster
Loki, plus a security-relevant subset write-only off-site to `askari` so the audit
trail survives host (and full-cluster) compromise (ADR-018)
```
- [ ] **Step 2: The active-alerting bullet**
Find:
```
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata into the
monitoring/alerting stack (planned; ties to the Loki/Grafana effort)
```
Replace with:
```
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata — plus
log-source-silence (a host that stops shipping) — into Grafana alerting on the
Loki/Grafana stack (ADR-018; planned)
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/002-security.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/002-security.md
git commit -m "ADR-002: make central-logging + alerting controls concrete (ADR-018)"
```
---
### Task 3: Add accepted-risk R4 (no WORM for logs)
**Files:**
- Modify: `docs/security/accepted-risks.md`
Read the file first, then one exact edit (add R4 after R3).
- [ ] **Step 1: Add the R4 row**
Find this exact line (the R3 row):
```
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
```
Add immediately **after** it:
```
| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
```
- [ ] **Step 2: Bump the "Last reviewed" date**
Find:
```
_Last reviewed: 2026-06-05. The prior gaps
```
Replace with:
```
_Last reviewed: 2026-06-06. The prior gaps
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md`
Expected: Passed/Skipped.
```bash
git add docs/security/accepted-risks.md
git commit -m "accepted-risks: add R4 (no cryptographic WORM for logs)"
```
---
### Task 4: Update CAPABILITIES §3 (Observability)
**Files:**
- Modify: `docs/CAPABILITIES.md`
Read the file first, then three exact edits.
- [ ] **Step 1: Loki row → decided, note the off-site sink**
Find:
```
| Logs | Loki | P | planned | Log aggregation | TODO 3.6 |
```
Replace with:
```
| Logs | Loki (cluster all-logs + off-site security subset on `askari`) | P | core | Central log aggregation; a security subset ships write-only off-site (append-only) | **Decided (ADR-018)** |
```
- [ ] **Step 2: Add the Alloy agent row** (right after the Loki row just edited)
Find:
```
| Dashboards | Grafana | P | planned | Visualisation + alerting | TODO 3.6 |
```
Replace with:
```
| Log shipping agent | Grafana Alloy (in `base`) | P | core | Collects journald + container + security logs on every host; ships to Loki (ADR-018) | **Decided (ADR-018)** |
| Dashboards | Grafana | P | planned | Visualisation + alerting (incl. AIDE/`auditd`/`fail2ban`/Suricata + log-silence — ADR-018) | TODO 3.6 |
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/CAPABILITIES.md`
Expected: Passed/Skipped.
```bash
git add docs/CAPABILITIES.md
git commit -m "CAPABILITIES: Loki decided + Alloy agent + security alerting (ADR-018)"
```
---
### Task 5: ADR-012 — log-storage allocation + wearout metric
**Files:**
- Modify: `docs/decisions/012-hardware-capacity.md`
Read the file first, then one exact edit (add a Consequences bullet).
- [ ] **Step 1: Add a Consequences bullet**
Find this exact block:
```
## Consequences
- Right-sizing advice is intent-based until usage data exists; reports say so.
- `reference.md` table headers are a parser contract — changing them needs a
matching `capacity-scan.py` change.
```
Replace with:
```
## Consequences
- Right-sizing advice is intent-based until usage data exists; reports say so.
- `reference.md` table headers are a parser contract — changing them needs a
matching `capacity-scan.py` change.
- Log storage (ADR-018) is a tracked allocation: the cluster Loki host's retention
budget and `askari`'s security-subset volume belong in `reference.md`, and SSD
**wearout/TBW** is a monitored metric — logging is write-heavy, so wear is watched,
not assumed.
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/012-hardware-capacity.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/012-hardware-capacity.md
git commit -m "ADR-012: track log-storage allocation + SSD wearout (ADR-018)"
```
---
### Task 6: Add logging rows to STATUS.md
**Files:**
- Modify: `STATUS.md`
Read the file first, then one exact edit (add two rows after the Level 4 row).
- [ ] **Step 1: Add the rows**
Find this exact line:
```
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
```
Replace with that SAME line followed by the two new rows:
```
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. |
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files STATUS.md`
Expected: Passed/Skipped.
```bash
git add STATUS.md
git commit -m "STATUS: record logging pipeline + security alerting (ADR-018)"
```
---
### Task 7: Reconcile TODO 3.1 and 3.6
**Files:**
- Modify: `docs/TODO.md`
Read the file first, then two exact edits. (Preserve the `~~strikethrough~~` markers.)
- [ ] **Step 1: Mark 3.1 decided**
Find:
```
3. **Building services**
1. Decide how to manage logs.
```
Replace with:
```
3. **Building services**
1. ~~Decide how to manage logs.~~ DECIDED (ADR-018): all logs → on-cluster Loki via
Grafana Alloy (in `base`); a security subset also ships write-only off-site to
`askari` (append-only); Grafana queries both. WORM skipped (accepted-risk R4).
```
- [ ] **Step 2: Reconcile 3.6's "on askari" phrasing**
Find:
```
6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime
Kuma alerts on askari.
```
Replace with:
```
6. Wire up the monitoring stack. Logging topology DECIDED (ADR-018): cluster Loki
(all logs) + off-site security subset on `askari` + Grafana on-cluster (not the
whole stack on `askari`). Still to design/build: Prometheus + metric exporters,
Uptime Kuma, and exactly which alerts live where.
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/TODO.md`
Expected: Passed/Skipped.
```bash
git add docs/TODO.md
git commit -m "TODO: mark log management decided (ADR-018); reconcile 3.6"
```
---
### Task 8: Link ADR-018 from CLAUDE.md
**Files:**
- Modify: `CLAUDE.md`
Read the file first, then one exact edit.
- [ ] **Step 1: Add the Further-reading row after Hardware & capacity**
Find:
```
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
```
Replace with that SAME line followed by the new row:
```
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
| Logging & log integrity | `docs/decisions/018-logging.md` |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
Expected: Passed/Skipped.
```bash
git add CLAUDE.md
git commit -m "CLAUDE.md: link ADR-018 (logging)"
```
---
### Task 9: Final consistency sweep
**Files:** none modified (verification only)
- [ ] **Step 1: ADR-018 present + cross-linked (canonical docs only)**
Run:
```bash
test -f docs/decisions/018-logging.md && echo "ADR-018 present"
grep -rl "ADR-018\|018-logging" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
```
Expected: the file exists and the referencing docs appear — ADR-002, accepted-risks, CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md.
- [ ] **Step 2: No stale "logging undecided / if available" language**
Run:
```bash
grep -rniE "log aggregation service is available|Logs \| Loki \| P \| planned|Decide how to manage logs\.($|[^~])" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
```
Expected: no hits — the ADR-002 conditional, the "planned" Loki row, and the open "Decide how to manage logs" TODO are all now updated.
- [ ] **Step 3: Full hook run**
Run: `rbw unlocked && pre-commit run --all-files`
Expected: all hooks Passed/Skipped. Fix anything that fails (likely trailing whitespace / end-of-file) and amend the owning commit.
- [ ] **Step 4: Push (only if the user asks)**
```bash
git push origin <branch-or-main-after-merge>
```
---
## Self-review notes (author)
- **Spec coverage:** decision/architecture/data-flow/security/retention → Task 1 (ADR-018); the spec's "Documentation & implementation changes" table → Tasks 28 (ADR-002, accepted-risks R4, CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md). The role/pipeline rows in that table are deferred (recorded in ADR-018/STATUS), not implemented here. ✓
- **Deferred, intentional:** Alloy-in-`base`, the `loki`/`grafana` service roles, OPNsense syslog forwarding, the `vault.loki.*` credential, the metrics-stack dependency — all need the unbuilt machinery; named in ADR-018/STATUS. ✓
- **No placeholders:** every create/edit shows exact text. ✓
- **Name consistency:** `ADR-018` / `018-logging.md`, "security subset", `offsite_hosts`, Grafana Alloy, push-only credential, R4 used identically across tasks. ✓
```