boma/docs/decisions/018-logging.md
sjat 2894319f01 Add ADR-018 (logging and log integrity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 07:01:36 +02:00

99 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-018 — Logging and log integrity
## Context
boma wants all logs in one queryable store for troubleshooting, spotting issues over
time, and detecting intrusions / malicious activity. ADR-002 commits in principle
("logs shipped to a central location"; "active alerting wires AIDE/`auditd`/`fail2ban`/
Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and `askari` (the
off-site watchdog). Undecided: the architecture and the **integrity** question — an
attacker who roots a host will try to clear logs to cover their tracks.
The framing insight: the biggest anti-tampering win is that logs **leave the host in
near-real-time** — once a line is in a store the attacker doesn't control, wiping the
local copy is futile. How far to harden the central store is set by the threat model.
## Decision
1. **Threat model — opportunistic + blast-radius** (ADR-002 / accepted-risk R1). Not
forensic-grade.
2. **All logs → an on-cluster Loki** — the single monitoring DB for troubleshooting +
trends. Near-real-time shipping already defeats per-host track-covering.
3. **A security-relevant subset ALSO ships off-site to `askari`, write-only**
tamper-resistant against full-cluster compromise, at bounded volume.
4. **Skip WORM/object-lock** — accepted-risk R4; append-only push + off-site is the
proportionate control.
5. **Disk-wear is a managed parameter** — media choice + bounded verbosity + tuned
retention + wearout monitoring.
## Architecture
- **Agent:** Grafana Alloy on every host, installed by the `base` role — reads journald
+ container logs + security sources (`auditd`, `authpriv`, `fail2ban`, AIDE).
- **Loki (cluster):** a `loki` service role on a docker_host; all logs; monolithic
single-binary mode; NVMe; bounded retention.
- **Loki (`askari`):** the same role parameterised, in `offsite_hosts`; security subset
only, write-only, long retention, tiny volume.
- **Grafana (cluster):** both Lokis as datasources (one pane queries both); dashboards
+ the alerting ADR-002 calls for.
## Data flow & the security subset
Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage
tags security sources `security="true"`) to the `askari` Loki. Subset: `auditd`,
`authpriv` (SSH/`sudo`), `fail2ban`, AIDE, **Suricata** (OPNsense isn't a `base` host —
it syslog-forwards its alerts to the ingest point), and key container security events.
**Write-only / append-only:** the `askari` push endpoint (`/loki/api/v1/push`) is
mesh-only with a **push-only credential**; query/admin/delete APIs are not exposed to
hosts. The push API has no edit/delete verb, so a compromised host can append but not
read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers
(WAL) + retries across a brief outage.
## Security, integrity & residual risks
Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store
(append-only, off-cluster). The security trail survives full-cluster compromise.
Conscious residuals: append-only ≠ cryptographic WORM (root-on-`askari` could edit
chunks — R4); a few-seconds un-shipped window; agent compromise can stop *future*
shipping but not alter shipped history; **a host going silent is itself an alert**; a
stolen push credential appends noise but can't delete; an `askari` outage buffers +
flushes on reconnect.
## Retention & disk-wear
Estimates are intent-based until measured (like `/capacity-review`). Cluster Loki:
bounded hot retention (~3090 days). `askari` subset: long (~1 year+, ~525 GB/yr).
Disk-wear rules: (1) log storage on NVMe/SSD or HDD, **never SD/USB flash**; (2) bounded
verbosity at source (sane levels, selective access logging, a targeted `auditd`
ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a monitored
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
tracked allocation in `docs/hardware/reference.md` (ADR-012).
## Status
Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
and the live pipeline.
## Dependencies
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
`askari` (`offsite_hosts`, ADR-016); OPNsense automation for Suricata syslog (ADR-007);
the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence alerting
(sibling effort, TODO 3.6).
## What was ruled out
| Option | Reason |
|---|---|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site. |
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
| On-cluster-only (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only. |
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).