boma/docs/decisions/018-logging.md
sjat 175777e36a docs: reconcile 2026-06-14 review findings (O1-O7,O18,O22)
- STATUS: docker_host is built+applied, not scaffold-only (O1)
- ADR-004: backup points to ADR-022, not "out of scope"; service-role file
  table gains ACCESS.md + BACKUP.md rows (O2, O5)
- Finish Traefik->Caddy: ADR-008/011/017/019, CAPABILITIES, TODO (O3); scope
  ADR-024's custom-image/NetBird claims to the deferred DNS-01/M4b paths (O22)
- ADR-016/017/018 now lead with ## Status per ADR-023 (O4)
- ADR-002: caveat `PLAYBOOK=upgrade` as planned/unbuilt (O6)
- CAPABILITIES: carve out ubongo's dev_env from the nvim/tmux exclusion (O7)
- ADR-007: one authoritative boma.baobab.band -> boma.wingu.me transition note (O18)
- new-host Part E: note ubongo is managed as sjat, ansible-user bootstrap pending (O15)

O9 (hosts.yml header) left open: the file is generator-owned (hook-protected);
fixing it needs a tf_to_inventory.py change or a tf-inventory run, not a hand-edit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:06:33 +02:00

7 KiB
Raw Blame History

ADR-018 — Logging and log integrity

Status

Accepted (2026-06-06). Designed. Authorable now: this ADR + the ADR-002/CAPABILITIES/ADR-012/ accepted-risks/STATUS/TODO reconciliations. Deferred on the stack: Alloy-in-base, the loki/grafana service roles, OPNsense syslog config, the push-only credential, and the live pipeline.

Context

boma wants all logs in one queryable store for troubleshooting, spotting issues over time, and detecting intrusions / malicious activity. ADR-002 commits in principle ("logs shipped to a central location"; "active alerting wires AIDE/auditd/fail2ban/ Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and askari (the off-site watchdog). Undecided: the architecture and the integrity question — an attacker who roots a host will try to clear logs to cover their tracks.

The framing insight: the biggest anti-tampering win is that logs leave the host in near-real-time — once a line is in a store the attacker doesn't control, wiping the local copy is futile. How far to harden the central store is set by the threat model.

Decision

  1. Threat model — opportunistic + blast-radius (ADR-002 / accepted-risk R1). Not forensic-grade.
  2. All logs → an on-cluster Loki — the single monitoring DB for troubleshooting + trends. Near-real-time shipping already defeats per-host track-covering.
  3. A security-relevant subset ALSO ships off-site to askari, write-only — tamper-resistant against full-cluster compromise, at bounded volume.
  4. Skip WORM/object-lock — accepted-risk R4; append-only push + off-site is the proportionate control.
  5. Disk-wear is a managed parameter — media choice + bounded verbosity + tuned retention + wearout monitoring.

Architecture

  • Agent: Grafana Alloy on every host, installed by the base role — reads journald
    • container logs + security sources (auditd, authpriv, fail2ban, AIDE).
  • Loki (cluster): a loki service role on a docker_host; all logs; monolithic single-binary mode; NVMe; bounded retention.
  • Loki (askari): the same role parameterised, in offsite_hosts; security subset only, write-only, long retention, tiny volume.
  • Grafana (cluster): both Lokis as datasources (one pane queries both); dashboards
    • the alerting ADR-002 calls for.

Data flow & the security subset

Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage tags security sources security="true") to the askari Loki. Subset: auditd, authpriv (SSH/sudo), fail2ban, AIDE, Suricata (OPNsense isn't a base host — it syslog-forwards its alerts to the ingest point), and key container security events.

Write-only / append-only: the askari push endpoint (/loki/api/v1/push) is mesh-only with a push-only credential; query/admin/delete APIs are not exposed to hosts. The push API has no edit/delete verb, so a compromised host can append but not read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers (WAL) + retries across a brief outage.

Security, integrity & residual risks

Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store (append-only, off-cluster). The security trail survives full-cluster compromise. Conscious residuals: append-only ≠ cryptographic WORM (root-on-askari could edit chunks — R4); a few-seconds un-shipped window; agent compromise can stop future shipping but not alter shipped history; a host going silent is itself an alert; a stolen push credential appends noise but can't delete; an askari outage buffers + flushes on reconnect.

Retention & disk-wear

Estimates are intent-based until measured (like /capacity-review). Cluster Loki: bounded hot retention (~3090 days). askari subset: long (~1 year+, ~525 GB/yr). Disk-wear rules: (1) log storage on NVMe/SSD or HDD, never SD/USB flash; (2) bounded verbosity at source (sane levels, selective access logging, a targeted auditd ruleset); (3) tuned Loki retention/compaction; (4) SSD wearout/TBW is a monitored metric (Proxmox wearout %, node_exporter smartmon) with an alert. Log storage is a tracked allocation in docs/hardware/reference.md (ADR-012).

Dependencies

base role + service-role machinery (unbuilt, STATUS.md); the running cluster + askari (offsite_hosts, ADR-016); OPNsense automation for Suricata syslog (ADR-007); the metrics stack (Prometheus / node_exporter) for SSD-wearout + log-silence alerting (sibling effort, TODO 3.6).

What was ruled out

Option Reason
Everything off-site on askari (no on-cluster Loki) The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site.
WORM / object-lock for all logs Forensic-grade cost for an opportunistic threat model — YAGNI (R4).
On-cluster-only (no off-site copy) Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only.
Volatile (RAM-only) journald to cut writes Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer.
Promtail / legacy agents Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics).

See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / askari), ADR-007 (OPNsense / askari), ADR-012 (hardware/capacity), ADR-004 (service-role standard), ADR-011 (health checks — distinct from this).

Consequences

  • Opportunistic track-covering and host-pivot-to-store are defeated because logs leave the host in near-real-time and the off-cluster security trail is append-only, so it survives full-cluster compromise (Security, integrity & residual risks).
  • Conscious residuals remain: append-only is not cryptographic WORM (root-on-askari could edit chunks — R4); there is a few-seconds un-shipped window; agent compromise can stop future shipping but not alter shipped history; a stolen push credential appends noise but cannot delete; and an askari outage buffers then flushes on reconnect (Security, integrity & residual risks).
  • A host going silent is itself an alert (Security, integrity & residual risks).
  • Only a bounded security subset ships off-site — auditd, authpriv, fail2ban, AIDE, Suricata and key container security events tagged security="true" — while the cluster Loki holds everything, keeping off-site volume small (Data flow & the security subset).
  • Disk-wear is a managed parameter: log storage on NVMe/SSD or HDD never SD/USB flash, bounded verbosity at source, tuned Loki retention/compaction, and monitored SSD wearout/TBW with an alert; log storage is a tracked allocation in docs/hardware/reference.md (Retention & disk-wear).
  • The decision is authorable now but the live pipeline is deferred on the stack: Alloy-in-base, the loki/grafana service roles, OPNsense syslog config, and the push-only credential (Status; Dependencies).