boma/docs/superpowers/specs/2026-06-05-logging-log-integrity-design.md
sjat 8eb5ccf97d Add design spec for logging + log integrity (ship all to Loki)
All logs -> on-cluster Loki for troubleshooting/trends; a security-relevant
subset also ships write-only off-site to askari (append-only, tamper-resistant
against full-cluster compromise); skip WORM (accepted-risk R4). Alloy agent in
base; loki/grafana service roles; disk-wear handled as a design parameter.
Basis for ADR-018.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 22:03:31 +02:00

13 KiB
Raw Blame History

Design — Logging and log integrity (ship all logs to Loki)

  • Date: 2026-06-05
  • Status: Approved design — pending implementation plan
  • Resolves: TODO 3.1 ("Decide how to manage logs"); makes concrete ADR-002's "logs shipped to a central location" + "active alerting" controls; advances TODO 3.6
  • Becomes: ADR-018 (this design is the basis for that ADR)

Problem

boma wants all logs in one queryable store for three things: day-to-day troubleshooting, spotting issues/trends over time, and detecting intrusions / malicious activity. ADR-002 already commits in principle ("auditd… Logs shipped to a central location if a log aggregation service is available"; "Active alerting wires AIDE/auditd/fail2ban/Suricata into the monitoring/alerting stack… ties to the Loki/Grafana effort"), and CAPABILITIES lists Loki (planned) + askari as the off-site watchdog. What's undecided is the architecture and, critically, the integrity dimension: an attacker who roots a host will try to clear logs to cover their tracks.

The key insight that frames the integrity question: the biggest anti-tampering win is that logs leave the host in near-real-time. Once a line is in a store the attacker doesn't control, wiping the local copy is futile. The remaining question is only how far to harden the central store — set by the threat model.

Decisions (the settled forks)

  1. Threat model — opportunistic + blast-radius, per ADR-002 / accepted-risk R1. Not forensic-grade. This sizes everything below.
  2. Ship all logs to an on-cluster Loki — the single monitoring DB for troubleshooting + trends. Near-real-time shipping already defeats per-host track-covering.
  3. Split: a security-relevant subset ALSO ships off-site to askari, write-only. Tamper-resistant against full-cluster compromise, at bounded volume.
  4. Skip WORM/object-lock (Tier 3) — recorded as accepted-risk R4; append-only push
    • off-site is the proportionate control.
  5. Disk-wear is a managed design parameter, not a blocker — storage media choice + bounded verbosity + tuned retention + wearout monitoring (Section: Retention & wear).

Architecture & components

Agent — Grafana Alloy on every host, installed by the base role. Alloy reads journald + container logs + the security sources (auditd, authpriv, fail2ban, AIDE) on every host (docker_hosts, proxmox nodes, ubongo, askari) and ships them. Placing it in base ties it to ADR-002's baseline "logs shipped to central" control.

Two Loki instances, one Grafana:

        ┌──────────────────── per host (base role) ─────────────────────┐
        │ Grafana Alloy: collect journald + container + auditd/auth/...  │
        └──────────┬───────────────────────────────────┬────────────────┘
      ALL logs     │                  security subset    │ (over the NetBird mesh)
                   ▼                                     ▼
        ┌────────────────────────┐        ┌──────────────────────────────┐
        │ Loki (cluster) all logs│        │ Loki (askari) security only  │
        │ docker_host, NVMe,     │        │ off-site, write-only push,   │
        │ bounded hot retention  │        │ long retention, append-only  │
        └───────────┬────────────┘        └──────────────┬───────────────┘
                    └───────────────┬────────────────────┘
                                    ▼
                    ┌────────────────────────────────────┐
                    │ Grafana (cluster): both datasources │
                    │ dashboards + alerts (AIDE/auditd/   │
                    │ fail2ban/Suricata + log-silence)    │
                    └────────────────────────────────────┘
  • Loki (cluster)loki service role on a docker_host; all logs; monolithic single-binary mode (ample at this scale); NVMe; bounded retention.
  • Loki (askari) — the same role parameterised, deployed to the offsite_hosts group; security subset only, write-only, long retention, tiny volume.
  • Grafanagrafana service role on the cluster; both Lokis as datasources (one pane queries both); where ADR-002's "active alerting" lands.

Reuses what boma already has: askari (off-site, on the mesh per ADR-016) and the base/service-role machinery.

Data flow & the security subset

Each host's Alloy pipeline writes everything to the cluster Loki and a filtered copy of security events to the askari Loki — a relabel/match stage tags security sources (security="true") and routes only those to the second loki.write target. One agent, two destinations.

Security subset (high-value, bounded volume): auditd (auth, privilege, file watches), authpriv (SSH, sudo), fail2ban (bans), AIDE (file-integrity reports), Suricata (OPNsense isn't a base host, so it syslog-forwards alerts to the ingest point), and key container security events (reverse-proxy 401/403, Authentik login events, Docker daemon events).

Write-only / append-only (the tamper-resistance mechanism):

  • The askari Loki push endpoint (/loki/api/v1/push) is reachable only over the NetBird mesh, with a push-only credential; hosts hold only that.
  • Loki's query/admin/delete APIs on askari are not exposed to hosts (localhost / mesh-ACL'd to operator + Grafana). The push API has no edit/delete verb, so a compromised host can append but not read/edit/delete. Deletion needs the admin/compactor API or filesystem — unreachable from a host.
  • The cluster Loki uses the same push-only credential, blocking per-host log-clearing via API there too.

Reliability: Alloy buffers (WAL) and retries, so a brief askari/mesh outage doesn't lose logs — they flush on reconnect with only a small local buffer.

Security, integrity & residual risks

Defeated: opportunistic track-covering (rm/vacuum) — lines are already off the host; host pivot to the store — an attacker rooting any cluster host can append but not delete, and cannot reach askari's admin plane. The security trail survives full cluster compromise.

Honest residual risks (conscious, recorded):

  1. Append-only ≠ cryptographic WORM — a root-on-askari attacker could edit chunk files on disk. Skipping object-lock is accepted-risk R4; mitigated by askari being minimal/hardened/operator-only/mesh-only.
  2. Un-shipped window — a few seconds of not-yet-flushed logs live on the host; near-real-time minimises it. Accept.
  3. Agent compromise (forward-looking) — rooting a host lets the attacker stop that host's Alloy or inject future false logs, but cannot alter shipped history.
  4. Detection as a feature — a host that goes silent (Alloy stops) is an alert; the tamper attempt becomes a signal. "Log-source silence" is wired into Grafana alerting.
  5. Credential theft / askari outage — a stolen push credential allows appending noise, not deletion (bounded, rotatable); an askari outage buffers on hosts and flushes on reconnect (a very long outage eventually drops oldest — monitor it).

ADR-002 fit: realises "logs shipped to central" + "active alerting"; the off-site + append-only model is a clean blast-radius-containment enhancement for the opportunistic threat model.

Retention, sizing & disk-wear

Sizing (estimates — intent-based until measured, like /capacity-review): a 25 host homelab generates ~13 GB/day raw "typical" (≪1 GB/day quiet; 515 GB/day very chatty); Loki compresses ~710× → ~0.10.4 GB/day stored; the security subset is ~1020% of that.

Retention (tunable in group_vars):

  • Cluster Loki (all logs): bounded hot retention, start 3090 days (~1035 GB at 90d on NVMe).
  • askari Loki (security subset): 1 year+ (~525 GB/yr) — small enough to keep the security trail long for over-time detection.
  • Defaults now; re-measure real volume after a few weeks live and tune.

Disk-wear (the lore is real only for specific media/misconfig; mitigated as design rules): at boma's volume even ~1040 GB/day of amplified writes is decades of life on a ~600-TBW/TB NVMe. Rules:

  1. Log storage on NVMe/SSD (or HDD for a long-retention cold tier — sequential, endurance-unlimited); never SD/USB flash.
  2. Bounded verbosity at source (sane log levels, selective access logging, a targeted auditd ruleset) — the one lever that controls wear and firehose size.
  3. Tuned Loki retention + compaction so neither store grows unbounded.
  4. SSD wearout/TBW is a monitored metric (Proxmox wearout %, node_exporter smartmon) with an alert — wear is a graph, not a surprise. (Depends on the metrics stack — see Dependencies.)

Capacity bookkeeping ties into ADR-012: a log-storage allocation line (cluster + askari) and SSD-wearout as a tracked metric.

Documentation & implementation changes

This is a substantial capability → its own ADR-018, with reconciliations:

Doc / artifact Change
ADR-018 (new) Home of record: ship-all-to-Loki, the off-site write-only security subset, append-only model, skip-WORM (R4), disk-wear rules.
base role (when built) Install + configure Alloy (all → cluster Loki; subset → askari write-only).
loki service role (new, when built) One role, two deployments (cluster all-logs; askari security-subset write-only). SECURITY.md + VERIFY.md.
grafana service role (new, when built) Both Lokis as datasources; dashboards + alerting (AIDE/auditd/fail2ban/Suricata + log-silence).
OPNsense (Ansible-managed) Syslog-forward Suricata alerts to the ingest point.
ADR-002 "Logs shipped to central" + "active alerting" bullets point to ADR-018.
docs/security/accepted-risks.md Add R4 — no cryptographic WORM for logs (append-only + off-site is the control).
docs/CAPABILITIES.md §3 Loki → decided; add the off-site security sink + Alloy agent rows; mark the alerting wiring.
docs/decisions/012-hardware-capacity.md Log-storage allocation (cluster + askari) + SSD-wearout tracked metric.
STATUS.md + docs/TODO.md (3.1 / 3.6) Mark "how to manage logs" decided by ADR-018; rows as designed-not-built.
vault.yml Push-only Loki credential (vault.loki.*).

Buildable now: ADR-018 + the ADR-002/CAPABILITIES/ADR-012/accepted-risks/STATUS/TODO reconciliations. Deferred on the stack: the Alloy-in-base, loki/grafana service roles, OPNsense syslog config, and the live pipeline.

Dependencies

  • base role + service-role machinery (unbuilt) — STATUS.md.
  • The running cluster + askari (offsite_hosts, designed) — ADR-016.
  • OPNsense automation (for Suricata syslog forwarding) — ADR-007.
  • The metrics stack (Prometheus / node_exporter) for SSD-wearout + log-silence alerting — sibling effort, TODO 3.6.

Deferred / out of scope

  1. WORM / object-lock (Tier 3) — accepted-risk R4; revisit only if the threat model shifts to targeted/forensic.
  2. The metrics pipeline (Prometheus/node_exporter) — sibling effort; this spec is logs. SSD-wearout + silence alerting depend on it.
  3. Cold archival beyond Loki retention (export to backups) and structured/parsed per-service log standards — future refinements.

What was ruled out

Option Reason
Everything off-site on askari (no on-cluster Loki) The firehose (tenshundreds of GB/yr) is disk-hungry on a small VPS; keep volume where storage is cheap (on-cluster) and send only the bounded security subset off-site.
WORM / object-lock for all logs Forensic-grade cost for an opportunistic threat model — YAGNI (R4).
On-cluster-only logging (no off-site copy) Doesn't survive compromise of the cluster Loki host; the security trail needs to be off-cluster + append-only.
Volatile (RAM-only) journald to cut writes Risks losing logs on crash before shipping; persistent-with-size-caps + real-time shipping is safer.
Promtail / legacy agents Alloy is the current unified Grafana collector and the V4-aligned choice; one agent for logs (and later metrics).

See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / askari), ADR-007 (OPNsense / askari), ADR-012 (hardware/capacity), ADR-004 (service-role standard), ADR-011 (health checks — distinct from this).