100 lines
5.5 KiB
Markdown
100 lines
5.5 KiB
Markdown
|
|
# ADR-018 — Logging and log integrity
|
|||
|
|
|
|||
|
|
## Context
|
|||
|
|
|
|||
|
|
boma wants all logs in one queryable store for troubleshooting, spotting issues over
|
|||
|
|
time, and detecting intrusions / malicious activity. ADR-002 commits in principle
|
|||
|
|
("logs shipped to a central location"; "active alerting wires AIDE/`auditd`/`fail2ban`/
|
|||
|
|
Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and `askari` (the
|
|||
|
|
off-site watchdog). Undecided: the architecture and the **integrity** question — an
|
|||
|
|
attacker who roots a host will try to clear logs to cover their tracks.
|
|||
|
|
|
|||
|
|
The framing insight: the biggest anti-tampering win is that logs **leave the host in
|
|||
|
|
near-real-time** — once a line is in a store the attacker doesn't control, wiping the
|
|||
|
|
local copy is futile. How far to harden the central store is set by the threat model.
|
|||
|
|
|
|||
|
|
## Decision
|
|||
|
|
|
|||
|
|
1. **Threat model — opportunistic + blast-radius** (ADR-002 / accepted-risk R1). Not
|
|||
|
|
forensic-grade.
|
|||
|
|
2. **All logs → an on-cluster Loki** — the single monitoring DB for troubleshooting +
|
|||
|
|
trends. Near-real-time shipping already defeats per-host track-covering.
|
|||
|
|
3. **A security-relevant subset ALSO ships off-site to `askari`, write-only** —
|
|||
|
|
tamper-resistant against full-cluster compromise, at bounded volume.
|
|||
|
|
4. **Skip WORM/object-lock** — accepted-risk R4; append-only push + off-site is the
|
|||
|
|
proportionate control.
|
|||
|
|
5. **Disk-wear is a managed parameter** — media choice + bounded verbosity + tuned
|
|||
|
|
retention + wearout monitoring.
|
|||
|
|
|
|||
|
|
## Architecture
|
|||
|
|
|
|||
|
|
- **Agent:** Grafana Alloy on every host, installed by the `base` role — reads journald
|
|||
|
|
+ container logs + security sources (`auditd`, `authpriv`, `fail2ban`, AIDE).
|
|||
|
|
- **Loki (cluster):** a `loki` service role on a docker_host; all logs; monolithic
|
|||
|
|
single-binary mode; NVMe; bounded retention.
|
|||
|
|
- **Loki (`askari`):** the same role parameterised, in `offsite_hosts`; security subset
|
|||
|
|
only, write-only, long retention, tiny volume.
|
|||
|
|
- **Grafana (cluster):** both Lokis as datasources (one pane queries both); dashboards
|
|||
|
|
+ the alerting ADR-002 calls for.
|
|||
|
|
|
|||
|
|
## Data flow & the security subset
|
|||
|
|
|
|||
|
|
Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage
|
|||
|
|
tags security sources `security="true"`) to the `askari` Loki. Subset: `auditd`,
|
|||
|
|
`authpriv` (SSH/`sudo`), `fail2ban`, AIDE, **Suricata** (OPNsense isn't a `base` host —
|
|||
|
|
it syslog-forwards its alerts to the ingest point), and key container security events.
|
|||
|
|
|
|||
|
|
**Write-only / append-only:** the `askari` push endpoint (`/loki/api/v1/push`) is
|
|||
|
|
mesh-only with a **push-only credential**; query/admin/delete APIs are not exposed to
|
|||
|
|
hosts. The push API has no edit/delete verb, so a compromised host can append but not
|
|||
|
|
read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers
|
|||
|
|
(WAL) + retries across a brief outage.
|
|||
|
|
|
|||
|
|
## Security, integrity & residual risks
|
|||
|
|
|
|||
|
|
Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store
|
|||
|
|
(append-only, off-cluster). The security trail survives full-cluster compromise.
|
|||
|
|
Conscious residuals: append-only ≠ cryptographic WORM (root-on-`askari` could edit
|
|||
|
|
chunks — R4); a few-seconds un-shipped window; agent compromise can stop *future*
|
|||
|
|
shipping but not alter shipped history; **a host going silent is itself an alert**; a
|
|||
|
|
stolen push credential appends noise but can't delete; an `askari` outage buffers +
|
|||
|
|
flushes on reconnect.
|
|||
|
|
|
|||
|
|
## Retention & disk-wear
|
|||
|
|
|
|||
|
|
Estimates are intent-based until measured (like `/capacity-review`). Cluster Loki:
|
|||
|
|
bounded hot retention (~30–90 days). `askari` subset: long (~1 year+, ~5–25 GB/yr).
|
|||
|
|
Disk-wear rules: (1) log storage on NVMe/SSD or HDD, **never SD/USB flash**; (2) bounded
|
|||
|
|
verbosity at source (sane levels, selective access logging, a targeted `auditd`
|
|||
|
|
ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a monitored
|
|||
|
|
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
|
|||
|
|
tracked allocation in `docs/hardware/reference.md` (ADR-012).
|
|||
|
|
|
|||
|
|
## Status
|
|||
|
|
|
|||
|
|
Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
|
|||
|
|
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
|
|||
|
|
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
|
|||
|
|
and the live pipeline.
|
|||
|
|
|
|||
|
|
## Dependencies
|
|||
|
|
|
|||
|
|
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
|
|||
|
|
`askari` (`offsite_hosts`, ADR-016); OPNsense automation for Suricata syslog (ADR-007);
|
|||
|
|
the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence alerting
|
|||
|
|
(sibling effort, TODO 3.6).
|
|||
|
|
|
|||
|
|
## What was ruled out
|
|||
|
|
|
|||
|
|
| Option | Reason |
|
|||
|
|
|---|---|
|
|||
|
|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site. |
|
|||
|
|
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
|
|||
|
|
| On-cluster-only (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only. |
|
|||
|
|
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
|
|||
|
|
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
|
|||
|
|
|
|||
|
|
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
|
|||
|
|
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
|
|||
|
|
standard), ADR-011 (health checks — distinct from this).
|