All logs -> on-cluster Loki for troubleshooting/trends; a security-relevant subset also ships write-only off-site to askari (append-only, tamper-resistant against full-cluster compromise); skip WORM (accepted-risk R4). Alloy agent in base; loki/grafana service roles; disk-wear handled as a design parameter. Basis for ADR-018. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
13 KiB
Design — Logging and log integrity (ship all logs to Loki)
- Date: 2026-06-05
- Status: Approved design — pending implementation plan
- Resolves: TODO 3.1 ("Decide how to manage logs"); makes concrete ADR-002's "logs shipped to a central location" + "active alerting" controls; advances TODO 3.6
- Becomes: ADR-018 (this design is the basis for that ADR)
Problem
boma wants all logs in one queryable store for three things: day-to-day
troubleshooting, spotting issues/trends over time, and detecting intrusions /
malicious activity. ADR-002 already commits in principle ("auditd… Logs shipped to
a central location if a log aggregation service is available"; "Active alerting wires
AIDE/auditd/fail2ban/Suricata into the monitoring/alerting stack… ties to the
Loki/Grafana effort"), and CAPABILITIES lists Loki (planned) + askari as the off-site
watchdog. What's undecided is the architecture and, critically, the integrity
dimension: an attacker who roots a host will try to clear logs to cover their tracks.
The key insight that frames the integrity question: the biggest anti-tampering win is that logs leave the host in near-real-time. Once a line is in a store the attacker doesn't control, wiping the local copy is futile. The remaining question is only how far to harden the central store — set by the threat model.
Decisions (the settled forks)
- Threat model — opportunistic + blast-radius, per ADR-002 / accepted-risk R1. Not forensic-grade. This sizes everything below.
- Ship all logs to an on-cluster Loki — the single monitoring DB for troubleshooting + trends. Near-real-time shipping already defeats per-host track-covering.
- Split: a security-relevant subset ALSO ships off-site to
askari, write-only. Tamper-resistant against full-cluster compromise, at bounded volume. - Skip WORM/object-lock (Tier 3) — recorded as accepted-risk R4; append-only push
- off-site is the proportionate control.
- Disk-wear is a managed design parameter, not a blocker — storage media choice + bounded verbosity + tuned retention + wearout monitoring (Section: Retention & wear).
Architecture & components
Agent — Grafana Alloy on every host, installed by the base role. Alloy reads
journald + container logs + the security sources (auditd, authpriv, fail2ban,
AIDE) on every host (docker_hosts, proxmox nodes, ubongo, askari) and ships them.
Placing it in base ties it to ADR-002's baseline "logs shipped to central" control.
Two Loki instances, one Grafana:
┌──────────────────── per host (base role) ─────────────────────┐
│ Grafana Alloy: collect journald + container + auditd/auth/... │
└──────────┬───────────────────────────────────┬────────────────┘
ALL logs │ security subset │ (over the NetBird mesh)
▼ ▼
┌────────────────────────┐ ┌──────────────────────────────┐
│ Loki (cluster) all logs│ │ Loki (askari) security only │
│ docker_host, NVMe, │ │ off-site, write-only push, │
│ bounded hot retention │ │ long retention, append-only │
└───────────┬────────────┘ └──────────────┬───────────────┘
└───────────────┬────────────────────┘
▼
┌────────────────────────────────────┐
│ Grafana (cluster): both datasources │
│ dashboards + alerts (AIDE/auditd/ │
│ fail2ban/Suricata + log-silence) │
└────────────────────────────────────┘
- Loki (cluster) —
lokiservice role on a docker_host; all logs; monolithic single-binary mode (ample at this scale); NVMe; bounded retention. - Loki (
askari) — the same role parameterised, deployed to theoffsite_hostsgroup; security subset only, write-only, long retention, tiny volume. - Grafana —
grafanaservice role on the cluster; both Lokis as datasources (one pane queries both); where ADR-002's "active alerting" lands.
Reuses what boma already has: askari (off-site, on the mesh per ADR-016) and the
base/service-role machinery.
Data flow & the security subset
Each host's Alloy pipeline writes everything to the cluster Loki and a filtered
copy of security events to the askari Loki — a relabel/match stage tags security
sources (security="true") and routes only those to the second loki.write target.
One agent, two destinations.
Security subset (high-value, bounded volume): auditd (auth, privilege, file
watches), authpriv (SSH, sudo), fail2ban (bans), AIDE (file-integrity reports),
Suricata (OPNsense isn't a base host, so it syslog-forwards alerts to the
ingest point), and key container security events (reverse-proxy 401/403, Authentik
login events, Docker daemon events).
Write-only / append-only (the tamper-resistance mechanism):
- The
askariLoki push endpoint (/loki/api/v1/push) is reachable only over the NetBird mesh, with a push-only credential; hosts hold only that. - Loki's query/admin/delete APIs on
askariare not exposed to hosts (localhost / mesh-ACL'd to operator + Grafana). The push API has no edit/delete verb, so a compromised host can append but not read/edit/delete. Deletion needs the admin/compactor API or filesystem — unreachable from a host. - The cluster Loki uses the same push-only credential, blocking per-host log-clearing via API there too.
Reliability: Alloy buffers (WAL) and retries, so a brief askari/mesh outage
doesn't lose logs — they flush on reconnect with only a small local buffer.
Security, integrity & residual risks
Defeated: opportunistic track-covering (rm/vacuum) — lines are already off the
host; host pivot to the store — an attacker rooting any cluster host can append but
not delete, and cannot reach askari's admin plane. The security trail survives full
cluster compromise.
Honest residual risks (conscious, recorded):
- Append-only ≠ cryptographic WORM — a root-on-
askariattacker could edit chunk files on disk. Skipping object-lock is accepted-risk R4; mitigated byaskaribeing minimal/hardened/operator-only/mesh-only. - Un-shipped window — a few seconds of not-yet-flushed logs live on the host; near-real-time minimises it. Accept.
- Agent compromise (forward-looking) — rooting a host lets the attacker stop that host's Alloy or inject future false logs, but cannot alter shipped history.
- Detection as a feature — a host that goes silent (Alloy stops) is an alert; the tamper attempt becomes a signal. "Log-source silence" is wired into Grafana alerting.
- Credential theft /
askarioutage — a stolen push credential allows appending noise, not deletion (bounded, rotatable); anaskarioutage buffers on hosts and flushes on reconnect (a very long outage eventually drops oldest — monitor it).
ADR-002 fit: realises "logs shipped to central" + "active alerting"; the off-site + append-only model is a clean blast-radius-containment enhancement for the opportunistic threat model.
Retention, sizing & disk-wear
Sizing (estimates — intent-based until measured, like /capacity-review): a 2–5
host homelab generates ~1–3 GB/day raw "typical" (≪1 GB/day quiet; 5–15 GB/day very
chatty); Loki compresses ~7–10× → ~0.1–0.4 GB/day stored; the security subset is
~10–20% of that.
Retention (tunable in group_vars):
- Cluster Loki (all logs): bounded hot retention, start 30–90 days (~10–35 GB at 90d on NVMe).
askariLoki (security subset): 1 year+ (~5–25 GB/yr) — small enough to keep the security trail long for over-time detection.- Defaults now; re-measure real volume after a few weeks live and tune.
Disk-wear (the lore is real only for specific media/misconfig; mitigated as design rules): at boma's volume even ~10–40 GB/day of amplified writes is decades of life on a ~600-TBW/TB NVMe. Rules:
- Log storage on NVMe/SSD (or HDD for a long-retention cold tier — sequential, endurance-unlimited); never SD/USB flash.
- Bounded verbosity at source (sane log levels, selective access logging, a
targeted
auditdruleset) — the one lever that controls wear and firehose size. - Tuned Loki retention + compaction so neither store grows unbounded.
- SSD wearout/TBW is a monitored metric (Proxmox wearout %,
node_exportersmartmon) with an alert — wear is a graph, not a surprise. (Depends on the metrics stack — see Dependencies.)
Capacity bookkeeping ties into ADR-012: a log-storage allocation line (cluster +
askari) and SSD-wearout as a tracked metric.
Documentation & implementation changes
This is a substantial capability → its own ADR-018, with reconciliations:
| Doc / artifact | Change |
|---|---|
| ADR-018 (new) | Home of record: ship-all-to-Loki, the off-site write-only security subset, append-only model, skip-WORM (R4), disk-wear rules. |
base role (when built) |
Install + configure Alloy (all → cluster Loki; subset → askari write-only). |
loki service role (new, when built) |
One role, two deployments (cluster all-logs; askari security-subset write-only). SECURITY.md + VERIFY.md. |
grafana service role (new, when built) |
Both Lokis as datasources; dashboards + alerting (AIDE/auditd/fail2ban/Suricata + log-silence). |
| OPNsense (Ansible-managed) | Syslog-forward Suricata alerts to the ingest point. |
| ADR-002 | "Logs shipped to central" + "active alerting" bullets point to ADR-018. |
docs/security/accepted-risks.md |
Add R4 — no cryptographic WORM for logs (append-only + off-site is the control). |
docs/CAPABILITIES.md §3 |
Loki → decided; add the off-site security sink + Alloy agent rows; mark the alerting wiring. |
docs/decisions/012-hardware-capacity.md |
Log-storage allocation (cluster + askari) + SSD-wearout tracked metric. |
STATUS.md + docs/TODO.md (3.1 / 3.6) |
Mark "how to manage logs" decided by ADR-018; rows as designed-not-built. |
vault.yml |
Push-only Loki credential (vault.loki.*). |
Buildable now: ADR-018 + the ADR-002/CAPABILITIES/ADR-012/accepted-risks/STATUS/TODO
reconciliations. Deferred on the stack: the Alloy-in-base, loki/grafana
service roles, OPNsense syslog config, and the live pipeline.
Dependencies
baserole + service-role machinery (unbuilt) — STATUS.md.- The running cluster +
askari(offsite_hosts, designed) — ADR-016. - OPNsense automation (for Suricata syslog forwarding) — ADR-007.
- The metrics stack (Prometheus /
node_exporter) for SSD-wearout + log-silence alerting — sibling effort, TODO 3.6.
Deferred / out of scope
- WORM / object-lock (Tier 3) — accepted-risk R4; revisit only if the threat model shifts to targeted/forensic.
- The metrics pipeline (Prometheus/
node_exporter) — sibling effort; this spec is logs. SSD-wearout + silence alerting depend on it. - Cold archival beyond Loki retention (export to backups) and structured/parsed per-service log standards — future refinements.
What was ruled out
| Option | Reason |
|---|---|
Everything off-site on askari (no on-cluster Loki) |
The firehose (tens–hundreds of GB/yr) is disk-hungry on a small VPS; keep volume where storage is cheap (on-cluster) and send only the bounded security subset off-site. |
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
| On-cluster-only logging (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail needs to be off-cluster + append-only. |
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-size-caps + real-time shipping is safer. |
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice; one agent for logs (and later metrics). |
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / askari),
ADR-007 (OPNsense / askari), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).