boma/docs/superpowers/specs/2026-06-05-logging-log-integrity-design.md
sjat 8eb5ccf97d Add design spec for logging + log integrity (ship all to Loki)
All logs -> on-cluster Loki for troubleshooting/trends; a security-relevant
subset also ships write-only off-site to askari (append-only, tamper-resistant
against full-cluster compromise); skip WORM (accepted-risk R4). Alloy agent in
base; loki/grafana service roles; disk-wear handled as a design parameter.
Basis for ADR-018.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 22:03:31 +02:00

212 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Design — Logging and log integrity (ship all logs to Loki)
- **Date:** 2026-06-05
- **Status:** Approved design — pending implementation plan
- **Resolves:** TODO 3.1 ("Decide how to manage logs"); makes concrete ADR-002's
"logs shipped to a central location" + "active alerting" controls; advances TODO 3.6
- **Becomes:** ADR-018 (this design is the basis for that ADR)
---
## Problem
boma wants **all logs in one queryable store** for three things: day-to-day
troubleshooting, spotting issues/trends over time, and **detecting intrusions /
malicious activity**. ADR-002 already commits in principle ("`auditd`… Logs shipped to
a central location if a log aggregation service is available"; "Active alerting wires
AIDE/`auditd`/`fail2ban`/Suricata into the monitoring/alerting stack… ties to the
Loki/Grafana effort"), and CAPABILITIES lists Loki (planned) + `askari` as the off-site
watchdog. What's undecided is the **architecture** and, critically, the **integrity**
dimension: an attacker who roots a host will try to clear logs to cover their tracks.
The key insight that frames the integrity question: **the biggest anti-tampering win is
that logs leave the host in near-real-time.** Once a line is in a store the attacker
doesn't control, wiping the local copy is futile. The remaining question is only *how
far* to harden the central store — set by the threat model.
## Decisions (the settled forks)
1. **Threat model — opportunistic + blast-radius**, per ADR-002 / accepted-risk R1.
Not forensic-grade. This sizes everything below.
2. **Ship all logs to an on-cluster Loki** — the single monitoring DB for
troubleshooting + trends. Near-real-time shipping already defeats per-host
track-covering.
3. **Split: a security-relevant subset ALSO ships off-site to `askari`, write-only.**
Tamper-resistant against full-cluster compromise, at bounded volume.
4. **Skip WORM/object-lock (Tier 3)** — recorded as accepted-risk R4; append-only push
+ off-site is the proportionate control.
5. **Disk-wear is a managed design parameter, not a blocker** — storage media choice +
bounded verbosity + tuned retention + wearout monitoring (Section: Retention & wear).
## Architecture & components
**Agent — Grafana Alloy on every host, installed by the `base` role.** Alloy reads
journald + container logs + the security sources (`auditd`, `authpriv`, `fail2ban`,
AIDE) on every host (docker_hosts, proxmox nodes, `ubongo`, `askari`) and ships them.
Placing it in `base` ties it to ADR-002's baseline "logs shipped to central" control.
**Two Loki instances, one Grafana:**
```
┌──────────────────── per host (base role) ─────────────────────┐
│ Grafana Alloy: collect journald + container + auditd/auth/... │
└──────────┬───────────────────────────────────┬────────────────┘
ALL logs │ security subset │ (over the NetBird mesh)
▼ ▼
┌────────────────────────┐ ┌──────────────────────────────┐
│ Loki (cluster) all logs│ │ Loki (askari) security only │
│ docker_host, NVMe, │ │ off-site, write-only push, │
│ bounded hot retention │ │ long retention, append-only │
└───────────┬────────────┘ └──────────────┬───────────────┘
└───────────────┬────────────────────┘
┌────────────────────────────────────┐
│ Grafana (cluster): both datasources │
│ dashboards + alerts (AIDE/auditd/ │
│ fail2ban/Suricata + log-silence) │
└────────────────────────────────────┘
```
- **Loki (cluster)** — `loki` service role on a docker_host; **all** logs; monolithic
single-binary mode (ample at this scale); NVMe; bounded retention.
- **Loki (`askari`)** — the same role parameterised, deployed to the `offsite_hosts`
group; **security subset only**, **write-only**, long retention, tiny volume.
- **Grafana** — `grafana` service role on the cluster; both Lokis as datasources (one
pane queries both); where ADR-002's "active alerting" lands.
Reuses what boma already has: `askari` (off-site, on the mesh per ADR-016) and the
`base`/service-role machinery.
## Data flow & the security subset
Each host's Alloy pipeline writes **everything** to the cluster Loki and a **filtered
copy** of security events to the `askari` Loki — a relabel/match stage tags security
sources (`security="true"`) and routes only those to the second `loki.write` target.
One agent, two destinations.
**Security subset** (high-value, bounded volume): `auditd` (auth, privilege, file
watches), `authpriv` (SSH, `sudo`), `fail2ban` (bans), AIDE (file-integrity reports),
**Suricata** (OPNsense isn't a `base` host, so it **syslog-forwards** alerts to the
ingest point), and key container security events (reverse-proxy 401/403, Authentik
login events, Docker daemon events).
**Write-only / append-only** (the tamper-resistance mechanism):
- The `askari` Loki push endpoint (`/loki/api/v1/push`) is reachable only over the
**NetBird mesh**, with a **push-only credential**; hosts hold *only* that.
- Loki's query/admin/delete APIs on `askari` are **not exposed to hosts** (localhost /
mesh-ACL'd to operator + Grafana). The push API has no edit/delete verb, so a
compromised host can **append but not read/edit/delete**. Deletion needs the
admin/compactor API or filesystem — unreachable from a host.
- The cluster Loki uses the same push-only credential, blocking per-host log-clearing
via API there too.
**Reliability:** Alloy buffers (WAL) and retries, so a brief `askari`/mesh outage
doesn't lose logs — they flush on reconnect with only a small local buffer.
## Security, integrity & residual risks
**Defeated:** opportunistic track-covering (`rm`/`vacuum`) — lines are already off the
host; **host pivot to the store** — an attacker rooting any cluster host can append but
not delete, and cannot reach `askari`'s admin plane. **The security trail survives full
cluster compromise.**
**Honest residual risks (conscious, recorded):**
1. **Append-only ≠ cryptographic WORM** — a root-on-`askari` attacker could edit chunk
files on disk. Skipping object-lock is **accepted-risk R4**; mitigated by `askari`
being minimal/hardened/operator-only/mesh-only.
2. **Un-shipped window** — a few seconds of not-yet-flushed logs live on the host;
near-real-time minimises it. Accept.
3. **Agent compromise (forward-looking)** — rooting a host lets the attacker stop *that
host's* Alloy or inject *future* false logs, but **cannot alter shipped history**.
4. **Detection as a feature** — a host that **goes silent** (Alloy stops) is an
**alert**; the tamper attempt becomes a signal. "Log-source silence" is wired into
Grafana alerting.
5. **Credential theft / `askari` outage** — a stolen push credential allows appending
noise, not deletion (bounded, rotatable); an `askari` outage buffers on hosts and
flushes on reconnect (a very long outage eventually drops oldest — monitor it).
**ADR-002 fit:** realises "logs shipped to central" + "active alerting"; the off-site +
append-only model is a clean blast-radius-containment enhancement for the opportunistic
threat model.
## Retention, sizing & disk-wear
**Sizing (estimates — intent-based until measured, like `/capacity-review`):** a 25
host homelab generates ~13 GB/day raw "typical" (≪1 GB/day quiet; 515 GB/day very
chatty); Loki compresses ~710× → ~0.10.4 GB/day stored; the security subset is
~1020% of that.
**Retention (tunable in `group_vars`):**
- **Cluster Loki (all logs):** bounded hot retention, start **3090 days** (~1035 GB
at 90d on NVMe).
- **`askari` Loki (security subset):** **1 year+** (~525 GB/yr) — small enough to keep
the security trail long for over-time detection.
- Defaults now; **re-measure real volume after a few weeks live** and tune.
**Disk-wear (the lore is real only for specific media/misconfig; mitigated as design
rules):** at boma's volume even ~1040 GB/day of amplified writes is decades of life on
a ~600-TBW/TB NVMe. Rules:
1. Log storage on **NVMe/SSD** (or **HDD** for a long-retention cold tier — sequential,
endurance-unlimited); **never SD/USB flash**.
2. **Bounded verbosity at source** (sane log levels, selective access logging, a
*targeted* `auditd` ruleset) — the one lever that controls wear *and* firehose size.
3. Tuned Loki **retention + compaction** so neither store grows unbounded.
4. **SSD wearout/TBW is a monitored metric** (Proxmox wearout %, `node_exporter`
smartmon) with an alert — wear is a graph, not a surprise. (Depends on the metrics
stack — see Dependencies.)
Capacity bookkeeping ties into ADR-012: a log-storage allocation line (cluster +
`askari`) and SSD-wearout as a tracked metric.
## Documentation & implementation changes
This is a substantial capability → its own ADR-018, with reconciliations:
| Doc / artifact | Change |
|---|---|
| ADR-018 (new) | Home of record: ship-all-to-Loki, the off-site write-only security subset, append-only model, skip-WORM (R4), disk-wear rules. |
| `base` role (when built) | Install + configure Alloy (all → cluster Loki; subset → `askari` write-only). |
| `loki` service role (new, when built) | One role, two deployments (cluster all-logs; `askari` security-subset write-only). `SECURITY.md` + `VERIFY.md`. |
| `grafana` service role (new, when built) | Both Lokis as datasources; dashboards + alerting (AIDE/`auditd`/`fail2ban`/Suricata + log-silence). |
| OPNsense (Ansible-managed) | Syslog-forward Suricata alerts to the ingest point. |
| ADR-002 | "Logs shipped to central" + "active alerting" bullets point to ADR-018. |
| `docs/security/accepted-risks.md` | Add **R4** — no cryptographic WORM for logs (append-only + off-site is the control). |
| `docs/CAPABILITIES.md` §3 | Loki → decided; add the off-site security sink + Alloy agent rows; mark the alerting wiring. |
| `docs/decisions/012-hardware-capacity.md` | Log-storage allocation (cluster + `askari`) + SSD-wearout tracked metric. |
| `STATUS.md` + `docs/TODO.md` (3.1 / 3.6) | Mark "how to manage logs" decided by ADR-018; rows as designed-not-built. |
| `vault.yml` | Push-only Loki credential (`vault.loki.*`). |
**Buildable now:** ADR-018 + the ADR-002/CAPABILITIES/ADR-012/accepted-risks/STATUS/TODO
reconciliations. **Deferred on the stack:** the Alloy-in-`base`, `loki`/`grafana`
service roles, OPNsense syslog config, and the live pipeline.
## Dependencies
- `base` role + service-role machinery (unbuilt) — STATUS.md.
- The running cluster + `askari` (`offsite_hosts`, designed) — ADR-016.
- OPNsense automation (for Suricata syslog forwarding) — ADR-007.
- The **metrics stack** (Prometheus / `node_exporter`) for SSD-wearout + log-silence
alerting — sibling effort, TODO 3.6.
## Deferred / out of scope
1. **WORM / object-lock (Tier 3)** — accepted-risk R4; revisit only if the threat model
shifts to targeted/forensic.
2. **The metrics pipeline** (Prometheus/`node_exporter`) — sibling effort; this spec is
**logs**. SSD-wearout + silence alerting depend on it.
3. **Cold archival beyond Loki retention** (export to backups) and **structured/parsed
per-service log standards** — future refinements.
## What was ruled out
| Option | Reason |
|---|---|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose (tenshundreds of GB/yr) is disk-hungry on a small VPS; keep volume where storage is cheap (on-cluster) and send only the bounded security subset off-site. |
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
| On-cluster-only logging (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail needs to be off-cluster + append-only. |
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-size-caps + real-time shipping is safer. |
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice; one agent for logs (and later metrics). |
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).