Compare commits

...

10 commits

Author SHA1 Message Date
9bdb3017bb CLAUDE.md: link ADR-018 (logging) 2026-06-06 07:07:43 +02:00
12baeba750 TODO: mark log management decided (ADR-018); reconcile 3.6 2026-06-06 07:07:01 +02:00
1021c6d25d STATUS: record logging pipeline + security alerting (ADR-018) 2026-06-06 07:06:06 +02:00
c6aa45037d ADR-012: track log-storage allocation + SSD wearout (ADR-018) 2026-06-06 07:05:15 +02:00
687d623a52 CAPABILITIES: Loki decided + Alloy agent + security alerting (ADR-018) 2026-06-06 07:04:26 +02:00
6f68f8b8c5 accepted-risks: add R4 (no cryptographic WORM for logs) 2026-06-06 07:03:27 +02:00
30c6a93c28 ADR-002: make central-logging + alerting controls concrete (ADR-018) 2026-06-06 07:02:32 +02:00
2894319f01 Add ADR-018 (logging and log integrity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 07:01:36 +02:00
96f8f20c05 Add implementation plan for logging + log integrity (ADR-018)
Task-by-task docs plan: author ADR-018 and reconcile ADR-002, accepted-risks
(R4), CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md. Roles/pipeline deferred
on the base + service-role machinery.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 06:59:58 +02:00
8eb5ccf97d Add design spec for logging + log integrity (ship all to Loki)
All logs -> on-cluster Loki for troubleshooting/trends; a security-relevant
subset also ships write-only off-site to askari (append-only, tamper-resistant
against full-cluster compromise); skip WORM (accepted-risk R4). Alloy agent in
base; loki/grafana service roles; disk-wear handled as a design parameter.
Basis for ADR-018.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 22:03:31 +02:00
10 changed files with 816 additions and 9 deletions

View file

@ -214,6 +214,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
| Forgejo & CI | `docs/decisions/010-forgejo-ci.md` | | Forgejo & CI | `docs/decisions/010-forgejo-ci.md` |
| Update management | `docs/decisions/011-update-management.md` | | Update management | `docs/decisions/011-update-management.md` |
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` | | Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
| Logging & log integrity | `docs/decisions/018-logging.md` |
| Adding a new role | `docs/runbooks/new-role.md` | | Adding a new role | `docs/runbooks/new-role.md` |
| Adding a new host | `docs/runbooks/new-host.md` | | Adding a new host | `docs/runbooks/new-host.md` |
| Rotating vault secrets | `docs/runbooks/rotate-secrets.md` | | Rotating vault secrets | `docs/runbooks/rotate-secrets.md` |

View file

@ -56,6 +56,8 @@ So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `bas
| NetBird mesh — coordinator on `askari` | ADR-016 | **Design RESOLVED** (ADR-016 + spec + plan); resolves ADR-015 deferred #1. Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. **Build pending:** not deployed (askari + service-role machinery not built). | | NetBird mesh — coordinator on `askari` | ADR-016 | **Design RESOLVED** (ADR-016 + spec + plan); resolves ADR-015 deferred #1. Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. **Build pending:** not deployed (askari + service-role machinery not built). |
| NetBird agent enrollment in `base` | ADR-016 | **Design RESOLVED** (ADR-016). Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. **Build pending:** base role not built. | | NetBird agent enrollment in `base` | ADR-016 | **Design RESOLVED** (ADR-016). Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. **Build pending:** base role not built. |
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. | | Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. |
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
## Keeping this honest ## Keeping this honest

View file

@ -43,8 +43,9 @@ _(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not cont
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open | | Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|---|---|---|---|---|---| |---|---|---|---|---|---|
| Metrics | Prometheus | P | planned | Time-series metrics + alert rules | TODO 3.6 | | Metrics | Prometheus | P | planned | Time-series metrics + alert rules | TODO 3.6 |
| Logs | Loki | P | planned | Log aggregation | TODO 3.6 | | Logs | Loki (cluster all-logs + off-site security subset on `askari`) | P | core | Central log aggregation; a security subset ships write-only off-site (append-only) | **Decided (ADR-018)** |
| Dashboards | Grafana | P | planned | Visualisation + alerting | TODO 3.6 | | Log shipping agent | Grafana Alloy (in `base`) | P | core | Collects journald + container + security logs on every host; ships to Loki (ADR-018) | **Decided (ADR-018)** |
| Dashboards | Grafana | P | planned | Visualisation + alerting (incl. AIDE/`auditd`/`fail2ban`/Suricata + log-silence — ADR-018) | TODO 3.6 |
| Uptime checks | Uptime Kuma | P | planned | Endpoint up/down checks | TODO 3.6 | | Uptime checks | Uptime Kuma | P | planned | Endpoint up/down checks | TODO 3.6 |
| External watchdog | askari (Hetzner VPS) | P | core | Off-site monitoring that survives a homelab outage | ADR-007 | | External watchdog | askari (Hetzner VPS) | P | core | Off-site monitoring that survives a homelab outage | ADR-007 |
| Notify / alerting | ntfy · Matrix · email (multi-channel) | S | planned | Deliver alerts to the user across channels | TODO 9; Matrix homeserver in §8 | | Notify / alerting | ntfy · Matrix · email (multi-channel) | S | planned | Deliver alerts to the user across channels | TODO 9; Matrix homeserver in §8 |

View file

@ -15,15 +15,19 @@
`/verify-service` report. `/verify-service` report.
3. **Building services** 3. **Building services**
1. Decide how to manage logs. 1. ~~Decide how to manage logs.~~ DECIDED (ADR-018): all logs → on-cluster Loki via
Grafana Alloy (in `base`); a security subset also ships write-only off-site to
`askari` (append-only); Grafana queries both. WORM skipped (accepted-risk R4).
2. Decide how to manage APIs / API access. 2. Decide how to manage APIs / API access.
3. ~~Decide how to import or integrate from baobabAnsibleV4.~~ DECIDED (ADR-013): 3. ~~Decide how to import or integrate from baobabAnsibleV4.~~ DECIDED (ADR-013):
translate-don't-transplant — V4 is a source only of gotchas + working config translate-don't-transplant — V4 is a source only of gotchas + working config
snippets, re-derived on boma's terms; never structure/requirements/values. snippets, re-derived on boma's terms; never structure/requirements/values.
4. Decide what each node runs — base packages plus which apps/services. 4. Decide what each node runs — base packages plus which apps/services.
5. Decide the firewall strategy (which firewall, ruleset, per-host vs central). 5. Decide the firewall strategy (which firewall, ruleset, per-host vs central).
6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime 6. Wire up the monitoring stack. Logging topology DECIDED (ADR-018): cluster Loki
Kuma alerts on askari. (all logs) + off-site security subset on `askari` + Grafana on-cluster (not the
whole stack on `askari`). Still to design/build: Prometheus + metric exporters,
Uptime Kuma, and exactly which alerts live where.
7. Define a tagging standard that lets us target runs without over-tagging. 7. Define a tagging standard that lets us target runs without over-tagging.
8. Ensure the right things are backed up (incl. database dumps if we land on PBS). 8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
9. Decide: a central database server, or individual database services per app? 9. Decide: a central database server, or individual database services per app?

View file

@ -87,7 +87,9 @@ time. Each heading tags the threat(s) it primarily serves.
### Audit trail — *agent error, blast radius* ### Audit trail — *agent error, blast radius*
- `auditd` installed and running with a baseline ruleset - `auditd` installed and running with a baseline ruleset
- Logs shipped to a central location if a log aggregation service is available - Logs shipped to a central location in near-real-time — all logs to an on-cluster
Loki, plus a security-relevant subset write-only off-site to `askari` so the audit
trail survives host (and full-cluster) compromise (ADR-018)
### Mandatory access control — *blast radius* ### Mandatory access control — *blast radius*
@ -102,8 +104,9 @@ time. Each heading tags the threat(s) it primarily serves.
- **AIDE** file-integrity monitoring (required by the CIS Debian benchmark) — detects - **AIDE** file-integrity monitoring (required by the CIS Debian benchmark) — detects
unexpected changes to system files unexpected changes to system files
- **Network IDS** — Suricata on OPNsense (planned; see STATUS.md / TODO) - **Network IDS** — Suricata on OPNsense (planned; see STATUS.md / TODO)
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata into the - **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata — plus
monitoring/alerting stack (planned; ties to the Loki/Grafana effort) log-source-silence (a host that stops shipping) — into Grafana alerting on the
Loki/Grafana stack (ADR-018; planned)
## Secrets management — *agent error, opportunistic* ## Secrets management — *agent error, opportunistic*

View file

@ -36,5 +36,9 @@ workload that should move, or a node due an upgrade.
- Right-sizing advice is intent-based until usage data exists; reports say so. - Right-sizing advice is intent-based until usage data exists; reports say so.
- `reference.md` table headers are a parser contract — changing them needs a - `reference.md` table headers are a parser contract — changing them needs a
matching `capacity-scan.py` change. matching `capacity-scan.py` change.
- Log storage (ADR-018) is a tracked allocation: the cluster Loki host's retention
budget and `askari`'s security-subset volume belong in `reference.md`, and SSD
**wearout/TBW** is a monitored metric — logging is write-heavy, so wear is watched,
not assumed.
See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff). See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).

View file

@ -0,0 +1,99 @@
# ADR-018 — Logging and log integrity
## Context
boma wants all logs in one queryable store for troubleshooting, spotting issues over
time, and detecting intrusions / malicious activity. ADR-002 commits in principle
("logs shipped to a central location"; "active alerting wires AIDE/`auditd`/`fail2ban`/
Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and `askari` (the
off-site watchdog). Undecided: the architecture and the **integrity** question — an
attacker who roots a host will try to clear logs to cover their tracks.
The framing insight: the biggest anti-tampering win is that logs **leave the host in
near-real-time** — once a line is in a store the attacker doesn't control, wiping the
local copy is futile. How far to harden the central store is set by the threat model.
## Decision
1. **Threat model — opportunistic + blast-radius** (ADR-002 / accepted-risk R1). Not
forensic-grade.
2. **All logs → an on-cluster Loki** — the single monitoring DB for troubleshooting +
trends. Near-real-time shipping already defeats per-host track-covering.
3. **A security-relevant subset ALSO ships off-site to `askari`, write-only**
tamper-resistant against full-cluster compromise, at bounded volume.
4. **Skip WORM/object-lock** — accepted-risk R4; append-only push + off-site is the
proportionate control.
5. **Disk-wear is a managed parameter** — media choice + bounded verbosity + tuned
retention + wearout monitoring.
## Architecture
- **Agent:** Grafana Alloy on every host, installed by the `base` role — reads journald
+ container logs + security sources (`auditd`, `authpriv`, `fail2ban`, AIDE).
- **Loki (cluster):** a `loki` service role on a docker_host; all logs; monolithic
single-binary mode; NVMe; bounded retention.
- **Loki (`askari`):** the same role parameterised, in `offsite_hosts`; security subset
only, write-only, long retention, tiny volume.
- **Grafana (cluster):** both Lokis as datasources (one pane queries both); dashboards
+ the alerting ADR-002 calls for.
## Data flow & the security subset
Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage
tags security sources `security="true"`) to the `askari` Loki. Subset: `auditd`,
`authpriv` (SSH/`sudo`), `fail2ban`, AIDE, **Suricata** (OPNsense isn't a `base` host —
it syslog-forwards its alerts to the ingest point), and key container security events.
**Write-only / append-only:** the `askari` push endpoint (`/loki/api/v1/push`) is
mesh-only with a **push-only credential**; query/admin/delete APIs are not exposed to
hosts. The push API has no edit/delete verb, so a compromised host can append but not
read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers
(WAL) + retries across a brief outage.
## Security, integrity & residual risks
Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store
(append-only, off-cluster). The security trail survives full-cluster compromise.
Conscious residuals: append-only ≠ cryptographic WORM (root-on-`askari` could edit
chunks — R4); a few-seconds un-shipped window; agent compromise can stop *future*
shipping but not alter shipped history; **a host going silent is itself an alert**; a
stolen push credential appends noise but can't delete; an `askari` outage buffers +
flushes on reconnect.
## Retention & disk-wear
Estimates are intent-based until measured (like `/capacity-review`). Cluster Loki:
bounded hot retention (~3090 days). `askari` subset: long (~1 year+, ~525 GB/yr).
Disk-wear rules: (1) log storage on NVMe/SSD or HDD, **never SD/USB flash**; (2) bounded
verbosity at source (sane levels, selective access logging, a targeted `auditd`
ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a monitored
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
tracked allocation in `docs/hardware/reference.md` (ADR-012).
## Status
Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
and the live pipeline.
## Dependencies
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
`askari` (`offsite_hosts`, ADR-016); OPNsense automation for Suricata syslog (ADR-007);
the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence alerting
(sibling effort, TODO 3.6).
## What was ruled out
| Option | Reason |
|---|---|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site. |
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
| On-cluster-only (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only. |
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).

View file

@ -16,8 +16,9 @@ revisit (trigger).
| R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (tiered image pinning per ADR-011 — stateful `tag@digest`, stateless rolling — prefer official/verified images; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise | | R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (tiered image pinning per ADR-011 — stateful `tag@digest`, stateless rolling — prefer official/verified images; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise |
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers | | R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering | | R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
_Last reviewed: 2026-06-05. The prior gaps (full CIS hardening, SELinux/AppArmor, _Last reviewed: 2026-06-06. The prior gaps (full CIS hardening, SELinux/AppArmor,
IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now
part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build

View file

@ -0,0 +1,480 @@
# Logging & Log Integrity Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Record the logging architecture (all logs → on-cluster Loki; a security subset also write-only off-site to `askari`) by authoring ADR-018 and reconciling every doc that touches logging/observability.
**Architecture:** Documentation-only. The runtime pieces — Alloy in the `base` role, the `loki`/`grafana` service roles, OPNsense syslog forwarding — wait on the `base` + service-role machinery STATUS.md lists as not-yet-built. This plan settles the decision and the doc reconciliation.
**Tech Stack:** Markdown. Verification is the repo's pre-commit hooks + a final cross-reference sweep. No markdown linter, so "tests" are hook-pass + grep checks.
---
## Pre-flight (read once)
- **`rbw` must be unlocked before every commit** (pre-commit ansible-lint decrypts `vault.yml`). `rbw unlocked`; if non-zero, stop and ask the user to `rbw unlock`.
- **Commit style:** one commit per task, imperative subject ≤72 chars.
- **Order:** Task 1 (ADR-018) first — later tasks link to it.
- **Spec:** `docs/superpowers/specs/2026-06-05-logging-log-integrity-design.md`.
- **Branch:** controller creates `chore/logging-log-integrity-docs` off `main` before Task 1; do not implement on `main`.
---
## File map
| File | Action | Responsibility |
|---|---|---|
| `docs/decisions/018-logging.md` | Create | Home of record for the logging architecture |
| `docs/decisions/002-security.md` | Modify | Make the "logs to central" + "active alerting" bullets concrete (→ ADR-018) |
| `docs/security/accepted-risks.md` | Modify | Add R4 — no cryptographic WORM for logs |
| `docs/CAPABILITIES.md` | Modify | Loki row → decided; add Alloy agent row; note security alerting |
| `docs/decisions/012-hardware-capacity.md` | Modify | Log-storage allocation + SSD-wearout tracked metric |
| `STATUS.md` | Modify | Rows: logging pipeline (designed, not built) |
| `docs/TODO.md` | Modify | Mark 3.1 decided; reconcile 3.6's "on askari" phrasing |
| `CLAUDE.md` | Modify | ADR-018 in Further reading |
**Deferred (not in this plan):** the Alloy task in `base`, the `loki`/`grafana` service roles, OPNsense Suricata syslog forwarding, the push-only `vault.loki.*` credential, and the live pipeline — all recorded in ADR-018/STATUS, built when the stack exists.
---
### Task 1: Author ADR-018 (the home of record)
**Files:**
- Create: `docs/decisions/018-logging.md`
- [ ] **Step 1: Create the ADR**
Create `docs/decisions/018-logging.md` with exactly this content (preserve em-dashes —, backticks, table pipes, `≠`, `~`):
```markdown
# ADR-018 — Logging and log integrity
## Context
boma wants all logs in one queryable store for troubleshooting, spotting issues over
time, and detecting intrusions / malicious activity. ADR-002 commits in principle
("logs shipped to a central location"; "active alerting wires AIDE/`auditd`/`fail2ban`/
Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and `askari` (the
off-site watchdog). Undecided: the architecture and the **integrity** question — an
attacker who roots a host will try to clear logs to cover their tracks.
The framing insight: the biggest anti-tampering win is that logs **leave the host in
near-real-time** — once a line is in a store the attacker doesn't control, wiping the
local copy is futile. How far to harden the central store is set by the threat model.
## Decision
1. **Threat model — opportunistic + blast-radius** (ADR-002 / accepted-risk R1). Not
forensic-grade.
2. **All logs → an on-cluster Loki** — the single monitoring DB for troubleshooting +
trends. Near-real-time shipping already defeats per-host track-covering.
3. **A security-relevant subset ALSO ships off-site to `askari`, write-only**
tamper-resistant against full-cluster compromise, at bounded volume.
4. **Skip WORM/object-lock** — accepted-risk R4; append-only push + off-site is the
proportionate control.
5. **Disk-wear is a managed parameter** — media choice + bounded verbosity + tuned
retention + wearout monitoring.
## Architecture
- **Agent:** Grafana Alloy on every host, installed by the `base` role — reads journald
+ container logs + security sources (`auditd`, `authpriv`, `fail2ban`, AIDE).
- **Loki (cluster):** a `loki` service role on a docker_host; all logs; monolithic
single-binary mode; NVMe; bounded retention.
- **Loki (`askari`):** the same role parameterised, in `offsite_hosts`; security subset
only, write-only, long retention, tiny volume.
- **Grafana (cluster):** both Lokis as datasources (one pane queries both); dashboards
+ the alerting ADR-002 calls for.
## Data flow & the security subset
Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage
tags security sources `security="true"`) to the `askari` Loki. Subset: `auditd`,
`authpriv` (SSH/`sudo`), `fail2ban`, AIDE, **Suricata** (OPNsense isn't a `base` host —
it syslog-forwards its alerts to the ingest point), and key container security events.
**Write-only / append-only:** the `askari` push endpoint (`/loki/api/v1/push`) is
mesh-only with a **push-only credential**; query/admin/delete APIs are not exposed to
hosts. The push API has no edit/delete verb, so a compromised host can append but not
read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers
(WAL) + retries across a brief outage.
## Security, integrity & residual risks
Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store
(append-only, off-cluster). The security trail survives full-cluster compromise.
Conscious residuals: append-only ≠ cryptographic WORM (root-on-`askari` could edit
chunks — R4); a few-seconds un-shipped window; agent compromise can stop *future*
shipping but not alter shipped history; **a host going silent is itself an alert**; a
stolen push credential appends noise but can't delete; an `askari` outage buffers +
flushes on reconnect.
## Retention & disk-wear
Estimates are intent-based until measured (like `/capacity-review`). Cluster Loki:
bounded hot retention (~3090 days). `askari` subset: long (~1 year+, ~525 GB/yr).
Disk-wear rules: (1) log storage on NVMe/SSD or HDD, **never SD/USB flash**; (2) bounded
verbosity at source (sane levels, selective access logging, a targeted `auditd`
ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a monitored
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
tracked allocation in `docs/hardware/reference.md` (ADR-012).
## Status
Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
and the live pipeline.
## Dependencies
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
`askari` (`offsite_hosts`, ADR-016); OPNsense automation for Suricata syslog (ADR-007);
the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence alerting
(sibling effort, TODO 3.6).
## What was ruled out
| Option | Reason |
|---|---|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site. |
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
| On-cluster-only (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only. |
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/018-logging.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/018-logging.md
git commit -m "Add ADR-018 (logging and log integrity)"
```
---
### Task 2: Make ADR-002's logging bullets concrete
**Files:**
- Modify: `docs/decisions/002-security.md`
Read the file first, then two exact edits.
- [ ] **Step 1: The audit-trail bullet**
Find:
```
- `auditd` installed and running with a baseline ruleset
- Logs shipped to a central location if a log aggregation service is available
```
Replace with:
```
- `auditd` installed and running with a baseline ruleset
- Logs shipped to a central location in near-real-time — all logs to an on-cluster
Loki, plus a security-relevant subset write-only off-site to `askari` so the audit
trail survives host (and full-cluster) compromise (ADR-018)
```
- [ ] **Step 2: The active-alerting bullet**
Find:
```
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata into the
monitoring/alerting stack (planned; ties to the Loki/Grafana effort)
```
Replace with:
```
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata — plus
log-source-silence (a host that stops shipping) — into Grafana alerting on the
Loki/Grafana stack (ADR-018; planned)
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/002-security.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/002-security.md
git commit -m "ADR-002: make central-logging + alerting controls concrete (ADR-018)"
```
---
### Task 3: Add accepted-risk R4 (no WORM for logs)
**Files:**
- Modify: `docs/security/accepted-risks.md`
Read the file first, then one exact edit (add R4 after R3).
- [ ] **Step 1: Add the R4 row**
Find this exact line (the R3 row):
```
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
```
Add immediately **after** it:
```
| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
```
- [ ] **Step 2: Bump the "Last reviewed" date**
Find:
```
_Last reviewed: 2026-06-05. The prior gaps
```
Replace with:
```
_Last reviewed: 2026-06-06. The prior gaps
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md`
Expected: Passed/Skipped.
```bash
git add docs/security/accepted-risks.md
git commit -m "accepted-risks: add R4 (no cryptographic WORM for logs)"
```
---
### Task 4: Update CAPABILITIES §3 (Observability)
**Files:**
- Modify: `docs/CAPABILITIES.md`
Read the file first, then three exact edits.
- [ ] **Step 1: Loki row → decided, note the off-site sink**
Find:
```
| Logs | Loki | P | planned | Log aggregation | TODO 3.6 |
```
Replace with:
```
| Logs | Loki (cluster all-logs + off-site security subset on `askari`) | P | core | Central log aggregation; a security subset ships write-only off-site (append-only) | **Decided (ADR-018)** |
```
- [ ] **Step 2: Add the Alloy agent row** (right after the Loki row just edited)
Find:
```
| Dashboards | Grafana | P | planned | Visualisation + alerting | TODO 3.6 |
```
Replace with:
```
| Log shipping agent | Grafana Alloy (in `base`) | P | core | Collects journald + container + security logs on every host; ships to Loki (ADR-018) | **Decided (ADR-018)** |
| Dashboards | Grafana | P | planned | Visualisation + alerting (incl. AIDE/`auditd`/`fail2ban`/Suricata + log-silence — ADR-018) | TODO 3.6 |
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/CAPABILITIES.md`
Expected: Passed/Skipped.
```bash
git add docs/CAPABILITIES.md
git commit -m "CAPABILITIES: Loki decided + Alloy agent + security alerting (ADR-018)"
```
---
### Task 5: ADR-012 — log-storage allocation + wearout metric
**Files:**
- Modify: `docs/decisions/012-hardware-capacity.md`
Read the file first, then one exact edit (add a Consequences bullet).
- [ ] **Step 1: Add a Consequences bullet**
Find this exact block:
```
## Consequences
- Right-sizing advice is intent-based until usage data exists; reports say so.
- `reference.md` table headers are a parser contract — changing them needs a
matching `capacity-scan.py` change.
```
Replace with:
```
## Consequences
- Right-sizing advice is intent-based until usage data exists; reports say so.
- `reference.md` table headers are a parser contract — changing them needs a
matching `capacity-scan.py` change.
- Log storage (ADR-018) is a tracked allocation: the cluster Loki host's retention
budget and `askari`'s security-subset volume belong in `reference.md`, and SSD
**wearout/TBW** is a monitored metric — logging is write-heavy, so wear is watched,
not assumed.
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/012-hardware-capacity.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/012-hardware-capacity.md
git commit -m "ADR-012: track log-storage allocation + SSD wearout (ADR-018)"
```
---
### Task 6: Add logging rows to STATUS.md
**Files:**
- Modify: `STATUS.md`
Read the file first, then one exact edit (add two rows after the Level 4 row).
- [ ] **Step 1: Add the rows**
Find this exact line:
```
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
```
Replace with that SAME line followed by the two new rows:
```
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. |
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files STATUS.md`
Expected: Passed/Skipped.
```bash
git add STATUS.md
git commit -m "STATUS: record logging pipeline + security alerting (ADR-018)"
```
---
### Task 7: Reconcile TODO 3.1 and 3.6
**Files:**
- Modify: `docs/TODO.md`
Read the file first, then two exact edits. (Preserve the `~~strikethrough~~` markers.)
- [ ] **Step 1: Mark 3.1 decided**
Find:
```
3. **Building services**
1. Decide how to manage logs.
```
Replace with:
```
3. **Building services**
1. ~~Decide how to manage logs.~~ DECIDED (ADR-018): all logs → on-cluster Loki via
Grafana Alloy (in `base`); a security subset also ships write-only off-site to
`askari` (append-only); Grafana queries both. WORM skipped (accepted-risk R4).
```
- [ ] **Step 2: Reconcile 3.6's "on askari" phrasing**
Find:
```
6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime
Kuma alerts on askari.
```
Replace with:
```
6. Wire up the monitoring stack. Logging topology DECIDED (ADR-018): cluster Loki
(all logs) + off-site security subset on `askari` + Grafana on-cluster (not the
whole stack on `askari`). Still to design/build: Prometheus + metric exporters,
Uptime Kuma, and exactly which alerts live where.
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/TODO.md`
Expected: Passed/Skipped.
```bash
git add docs/TODO.md
git commit -m "TODO: mark log management decided (ADR-018); reconcile 3.6"
```
---
### Task 8: Link ADR-018 from CLAUDE.md
**Files:**
- Modify: `CLAUDE.md`
Read the file first, then one exact edit.
- [ ] **Step 1: Add the Further-reading row after Hardware & capacity**
Find:
```
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
```
Replace with that SAME line followed by the new row:
```
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
| Logging & log integrity | `docs/decisions/018-logging.md` |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
Expected: Passed/Skipped.
```bash
git add CLAUDE.md
git commit -m "CLAUDE.md: link ADR-018 (logging)"
```
---
### Task 9: Final consistency sweep
**Files:** none modified (verification only)
- [ ] **Step 1: ADR-018 present + cross-linked (canonical docs only)**
Run:
```bash
test -f docs/decisions/018-logging.md && echo "ADR-018 present"
grep -rl "ADR-018\|018-logging" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
```
Expected: the file exists and the referencing docs appear — ADR-002, accepted-risks, CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md.
- [ ] **Step 2: No stale "logging undecided / if available" language**
Run:
```bash
grep -rniE "log aggregation service is available|Logs \| Loki \| P \| planned|Decide how to manage logs\.($|[^~])" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
```
Expected: no hits — the ADR-002 conditional, the "planned" Loki row, and the open "Decide how to manage logs" TODO are all now updated.
- [ ] **Step 3: Full hook run**
Run: `rbw unlocked && pre-commit run --all-files`
Expected: all hooks Passed/Skipped. Fix anything that fails (likely trailing whitespace / end-of-file) and amend the owning commit.
- [ ] **Step 4: Push (only if the user asks)**
```bash
git push origin <branch-or-main-after-merge>
```
---
## Self-review notes (author)
- **Spec coverage:** decision/architecture/data-flow/security/retention → Task 1 (ADR-018); the spec's "Documentation & implementation changes" table → Tasks 28 (ADR-002, accepted-risks R4, CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md). The role/pipeline rows in that table are deferred (recorded in ADR-018/STATUS), not implemented here. ✓
- **Deferred, intentional:** Alloy-in-`base`, the `loki`/`grafana` service roles, OPNsense syslog forwarding, the `vault.loki.*` credential, the metrics-stack dependency — all need the unbuilt machinery; named in ADR-018/STATUS. ✓
- **No placeholders:** every create/edit shows exact text. ✓
- **Name consistency:** `ADR-018` / `018-logging.md`, "security subset", `offsite_hosts`, Grafana Alloy, push-only credential, R4 used identically across tasks. ✓
```

View file

@ -0,0 +1,212 @@
# Design — Logging and log integrity (ship all logs to Loki)
- **Date:** 2026-06-05
- **Status:** Approved design — pending implementation plan
- **Resolves:** TODO 3.1 ("Decide how to manage logs"); makes concrete ADR-002's
"logs shipped to a central location" + "active alerting" controls; advances TODO 3.6
- **Becomes:** ADR-018 (this design is the basis for that ADR)
---
## Problem
boma wants **all logs in one queryable store** for three things: day-to-day
troubleshooting, spotting issues/trends over time, and **detecting intrusions /
malicious activity**. ADR-002 already commits in principle ("`auditd`… Logs shipped to
a central location if a log aggregation service is available"; "Active alerting wires
AIDE/`auditd`/`fail2ban`/Suricata into the monitoring/alerting stack… ties to the
Loki/Grafana effort"), and CAPABILITIES lists Loki (planned) + `askari` as the off-site
watchdog. What's undecided is the **architecture** and, critically, the **integrity**
dimension: an attacker who roots a host will try to clear logs to cover their tracks.
The key insight that frames the integrity question: **the biggest anti-tampering win is
that logs leave the host in near-real-time.** Once a line is in a store the attacker
doesn't control, wiping the local copy is futile. The remaining question is only *how
far* to harden the central store — set by the threat model.
## Decisions (the settled forks)
1. **Threat model — opportunistic + blast-radius**, per ADR-002 / accepted-risk R1.
Not forensic-grade. This sizes everything below.
2. **Ship all logs to an on-cluster Loki** — the single monitoring DB for
troubleshooting + trends. Near-real-time shipping already defeats per-host
track-covering.
3. **Split: a security-relevant subset ALSO ships off-site to `askari`, write-only.**
Tamper-resistant against full-cluster compromise, at bounded volume.
4. **Skip WORM/object-lock (Tier 3)** — recorded as accepted-risk R4; append-only push
+ off-site is the proportionate control.
5. **Disk-wear is a managed design parameter, not a blocker** — storage media choice +
bounded verbosity + tuned retention + wearout monitoring (Section: Retention & wear).
## Architecture & components
**Agent — Grafana Alloy on every host, installed by the `base` role.** Alloy reads
journald + container logs + the security sources (`auditd`, `authpriv`, `fail2ban`,
AIDE) on every host (docker_hosts, proxmox nodes, `ubongo`, `askari`) and ships them.
Placing it in `base` ties it to ADR-002's baseline "logs shipped to central" control.
**Two Loki instances, one Grafana:**
```
┌──────────────────── per host (base role) ─────────────────────┐
│ Grafana Alloy: collect journald + container + auditd/auth/... │
└──────────┬───────────────────────────────────┬────────────────┘
ALL logs │ security subset │ (over the NetBird mesh)
▼ ▼
┌────────────────────────┐ ┌──────────────────────────────┐
│ Loki (cluster) all logs│ │ Loki (askari) security only │
│ docker_host, NVMe, │ │ off-site, write-only push, │
│ bounded hot retention │ │ long retention, append-only │
└───────────┬────────────┘ └──────────────┬───────────────┘
└───────────────┬────────────────────┘
┌────────────────────────────────────┐
│ Grafana (cluster): both datasources │
│ dashboards + alerts (AIDE/auditd/ │
│ fail2ban/Suricata + log-silence) │
└────────────────────────────────────┘
```
- **Loki (cluster)**`loki` service role on a docker_host; **all** logs; monolithic
single-binary mode (ample at this scale); NVMe; bounded retention.
- **Loki (`askari`)** — the same role parameterised, deployed to the `offsite_hosts`
group; **security subset only**, **write-only**, long retention, tiny volume.
- **Grafana**`grafana` service role on the cluster; both Lokis as datasources (one
pane queries both); where ADR-002's "active alerting" lands.
Reuses what boma already has: `askari` (off-site, on the mesh per ADR-016) and the
`base`/service-role machinery.
## Data flow & the security subset
Each host's Alloy pipeline writes **everything** to the cluster Loki and a **filtered
copy** of security events to the `askari` Loki — a relabel/match stage tags security
sources (`security="true"`) and routes only those to the second `loki.write` target.
One agent, two destinations.
**Security subset** (high-value, bounded volume): `auditd` (auth, privilege, file
watches), `authpriv` (SSH, `sudo`), `fail2ban` (bans), AIDE (file-integrity reports),
**Suricata** (OPNsense isn't a `base` host, so it **syslog-forwards** alerts to the
ingest point), and key container security events (reverse-proxy 401/403, Authentik
login events, Docker daemon events).
**Write-only / append-only** (the tamper-resistance mechanism):
- The `askari` Loki push endpoint (`/loki/api/v1/push`) is reachable only over the
**NetBird mesh**, with a **push-only credential**; hosts hold *only* that.
- Loki's query/admin/delete APIs on `askari` are **not exposed to hosts** (localhost /
mesh-ACL'd to operator + Grafana). The push API has no edit/delete verb, so a
compromised host can **append but not read/edit/delete**. Deletion needs the
admin/compactor API or filesystem — unreachable from a host.
- The cluster Loki uses the same push-only credential, blocking per-host log-clearing
via API there too.
**Reliability:** Alloy buffers (WAL) and retries, so a brief `askari`/mesh outage
doesn't lose logs — they flush on reconnect with only a small local buffer.
## Security, integrity & residual risks
**Defeated:** opportunistic track-covering (`rm`/`vacuum`) — lines are already off the
host; **host pivot to the store** — an attacker rooting any cluster host can append but
not delete, and cannot reach `askari`'s admin plane. **The security trail survives full
cluster compromise.**
**Honest residual risks (conscious, recorded):**
1. **Append-only ≠ cryptographic WORM** — a root-on-`askari` attacker could edit chunk
files on disk. Skipping object-lock is **accepted-risk R4**; mitigated by `askari`
being minimal/hardened/operator-only/mesh-only.
2. **Un-shipped window** — a few seconds of not-yet-flushed logs live on the host;
near-real-time minimises it. Accept.
3. **Agent compromise (forward-looking)** — rooting a host lets the attacker stop *that
host's* Alloy or inject *future* false logs, but **cannot alter shipped history**.
4. **Detection as a feature** — a host that **goes silent** (Alloy stops) is an
**alert**; the tamper attempt becomes a signal. "Log-source silence" is wired into
Grafana alerting.
5. **Credential theft / `askari` outage** — a stolen push credential allows appending
noise, not deletion (bounded, rotatable); an `askari` outage buffers on hosts and
flushes on reconnect (a very long outage eventually drops oldest — monitor it).
**ADR-002 fit:** realises "logs shipped to central" + "active alerting"; the off-site +
append-only model is a clean blast-radius-containment enhancement for the opportunistic
threat model.
## Retention, sizing & disk-wear
**Sizing (estimates — intent-based until measured, like `/capacity-review`):** a 25
host homelab generates ~13 GB/day raw "typical" (≪1 GB/day quiet; 515 GB/day very
chatty); Loki compresses ~710× → ~0.10.4 GB/day stored; the security subset is
~1020% of that.
**Retention (tunable in `group_vars`):**
- **Cluster Loki (all logs):** bounded hot retention, start **3090 days** (~1035 GB
at 90d on NVMe).
- **`askari` Loki (security subset):** **1 year+** (~525 GB/yr) — small enough to keep
the security trail long for over-time detection.
- Defaults now; **re-measure real volume after a few weeks live** and tune.
**Disk-wear (the lore is real only for specific media/misconfig; mitigated as design
rules):** at boma's volume even ~1040 GB/day of amplified writes is decades of life on
a ~600-TBW/TB NVMe. Rules:
1. Log storage on **NVMe/SSD** (or **HDD** for a long-retention cold tier — sequential,
endurance-unlimited); **never SD/USB flash**.
2. **Bounded verbosity at source** (sane log levels, selective access logging, a
*targeted* `auditd` ruleset) — the one lever that controls wear *and* firehose size.
3. Tuned Loki **retention + compaction** so neither store grows unbounded.
4. **SSD wearout/TBW is a monitored metric** (Proxmox wearout %, `node_exporter`
smartmon) with an alert — wear is a graph, not a surprise. (Depends on the metrics
stack — see Dependencies.)
Capacity bookkeeping ties into ADR-012: a log-storage allocation line (cluster +
`askari`) and SSD-wearout as a tracked metric.
## Documentation & implementation changes
This is a substantial capability → its own ADR-018, with reconciliations:
| Doc / artifact | Change |
|---|---|
| ADR-018 (new) | Home of record: ship-all-to-Loki, the off-site write-only security subset, append-only model, skip-WORM (R4), disk-wear rules. |
| `base` role (when built) | Install + configure Alloy (all → cluster Loki; subset → `askari` write-only). |
| `loki` service role (new, when built) | One role, two deployments (cluster all-logs; `askari` security-subset write-only). `SECURITY.md` + `VERIFY.md`. |
| `grafana` service role (new, when built) | Both Lokis as datasources; dashboards + alerting (AIDE/`auditd`/`fail2ban`/Suricata + log-silence). |
| OPNsense (Ansible-managed) | Syslog-forward Suricata alerts to the ingest point. |
| ADR-002 | "Logs shipped to central" + "active alerting" bullets point to ADR-018. |
| `docs/security/accepted-risks.md` | Add **R4** — no cryptographic WORM for logs (append-only + off-site is the control). |
| `docs/CAPABILITIES.md` §3 | Loki → decided; add the off-site security sink + Alloy agent rows; mark the alerting wiring. |
| `docs/decisions/012-hardware-capacity.md` | Log-storage allocation (cluster + `askari`) + SSD-wearout tracked metric. |
| `STATUS.md` + `docs/TODO.md` (3.1 / 3.6) | Mark "how to manage logs" decided by ADR-018; rows as designed-not-built. |
| `vault.yml` | Push-only Loki credential (`vault.loki.*`). |
**Buildable now:** ADR-018 + the ADR-002/CAPABILITIES/ADR-012/accepted-risks/STATUS/TODO
reconciliations. **Deferred on the stack:** the Alloy-in-`base`, `loki`/`grafana`
service roles, OPNsense syslog config, and the live pipeline.
## Dependencies
- `base` role + service-role machinery (unbuilt) — STATUS.md.
- The running cluster + `askari` (`offsite_hosts`, designed) — ADR-016.
- OPNsense automation (for Suricata syslog forwarding) — ADR-007.
- The **metrics stack** (Prometheus / `node_exporter`) for SSD-wearout + log-silence
alerting — sibling effort, TODO 3.6.
## Deferred / out of scope
1. **WORM / object-lock (Tier 3)** — accepted-risk R4; revisit only if the threat model
shifts to targeted/forensic.
2. **The metrics pipeline** (Prometheus/`node_exporter`) — sibling effort; this spec is
**logs**. SSD-wearout + silence alerting depend on it.
3. **Cold archival beyond Loki retention** (export to backups) and **structured/parsed
per-service log standards** — future refinements.
## What was ruled out
| Option | Reason |
|---|---|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose (tenshundreds of GB/yr) is disk-hungry on a small VPS; keep volume where storage is cheap (on-cluster) and send only the bounded security subset off-site. |
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
| On-cluster-only logging (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail needs to be off-cluster + append-only. |
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-size-caps + real-time shipping is safer. |
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice; one agent for logs (and later metrics). |
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).