- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional, outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative boma.baobab.band -> boma.wingu.me transition note already added earlier - terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and <host>.boma.baobab.band per ADR-007 naming (O11) - ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections placed after Consequences, matching ADR-014/019-023 (O13) - docs/README + inventories/README: list the missing subdirs / offsite_hosts + offsite.yml merge behaviour (O14, O29 note) - ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19) - ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20) - ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21) - netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23) - ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24) - capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28) - tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9) - tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep) O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected); the fix lives in the generator for the next regeneration. make lint + pytest (57) green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7 KiB
ADR-018 — Logging and log integrity
Status
Accepted (2026-06-06). Designed. Authorable now: this ADR + the ADR-002/CAPABILITIES/ADR-012/
accepted-risks/STATUS/TODO reconciliations. Deferred on the stack: Alloy-in-base,
the loki/grafana service roles, OPNsense syslog config, the push-only credential,
and the live pipeline.
Context
boma wants all logs in one queryable store for troubleshooting, spotting issues over
time, and detecting intrusions / malicious activity. ADR-002 commits in principle
("logs shipped to a central location"; "active alerting wires AIDE/auditd/fail2ban/
Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and askari (the
off-site watchdog). Undecided: the architecture and the integrity question — an
attacker who roots a host will try to clear logs to cover their tracks.
The framing insight: the biggest anti-tampering win is that logs leave the host in near-real-time — once a line is in a store the attacker doesn't control, wiping the local copy is futile. How far to harden the central store is set by the threat model.
Decision
- Threat model — opportunistic + blast-radius (ADR-002 / accepted-risk R1). Not forensic-grade.
- All logs → an on-cluster Loki — the single monitoring DB for troubleshooting + trends. Near-real-time shipping already defeats per-host track-covering.
- A security-relevant subset ALSO ships off-site to
askari, write-only — tamper-resistant against full-cluster compromise, at bounded volume. - Skip WORM/object-lock — accepted-risk R4; append-only push + off-site is the proportionate control.
- Disk-wear is a managed parameter — media choice + bounded verbosity + tuned retention + wearout monitoring.
Architecture
- Agent: Grafana Alloy on every host, installed by the
baserole — reads journald- container logs + security sources (
auditd,authpriv,fail2ban, AIDE).
- container logs + security sources (
- Loki (cluster): a
lokiservice role on a docker_host; all logs; monolithic single-binary mode; NVMe; bounded retention. - Loki (
askari): the same role parameterised, inoffsite_hosts; security subset only, write-only, long retention, tiny volume. - Grafana (cluster): both Lokis as datasources (one pane queries both); dashboards
- the alerting ADR-002 calls for.
Data flow & the security subset
Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage
tags security sources security="true") to the askari Loki. Subset: auditd,
authpriv (SSH/sudo), fail2ban, AIDE, Suricata (OPNsense isn't a base host —
it syslog-forwards its alerts to the ingest point), and key container security events.
Write-only / append-only: the askari push endpoint (/loki/api/v1/push) is
mesh-only with a push-only credential; query/admin/delete APIs are not exposed to
hosts. The push API has no edit/delete verb, so a compromised host can append but not
read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers
(WAL) + retries across a brief outage.
Security, integrity & residual risks
Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store
(append-only, off-cluster). The security trail survives full-cluster compromise.
Conscious residuals: append-only ≠ cryptographic WORM (root-on-askari could edit
chunks — R4); a few-seconds un-shipped window; agent compromise can stop future
shipping but not alter shipped history; a host going silent is itself an alert; a
stolen push credential appends noise but can't delete; an askari outage buffers +
flushes on reconnect.
Retention & disk-wear
Estimates are intent-based until measured (like /capacity-review). Cluster Loki:
bounded hot retention (~30–90 days). askari subset: long (~1 year+, ~5–25 GB/yr).
Disk-wear rules: (1) log storage on NVMe/SSD or HDD, never SD/USB flash; (2) bounded
verbosity at source (sane levels, selective access logging, a targeted auditd
ruleset); (3) tuned Loki retention/compaction; (4) SSD wearout/TBW is a monitored
metric (Proxmox wearout %, node_exporter smartmon) with an alert. Log storage is a
tracked allocation in docs/hardware/reference.md (ADR-012).
Dependencies
base role + service-role machinery (unbuilt, STATUS.md); the running cluster +
askari (offsite_hosts, ADR-016); OPNsense automation for Suricata syslog (ADR-007);
the metrics stack (Prometheus / node_exporter) for SSD-wearout + log-silence alerting
(sibling effort, TODO 3.6).
What was ruled out
| Option | Reason |
|---|---|
Everything off-site on askari (no on-cluster Loki) |
The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site. |
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
| On-cluster-only (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only. |
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
Consequences
- Opportunistic track-covering and host-pivot-to-store are defeated because logs leave the host in near-real-time and the off-cluster security trail is append-only, so it survives full-cluster compromise (Security, integrity & residual risks).
- Conscious residuals remain: append-only is not cryptographic WORM (root-on-
askaricould edit chunks — R4); there is a few-seconds un-shipped window; agent compromise can stop future shipping but not alter shipped history; a stolen push credential appends noise but cannot delete; and anaskarioutage buffers then flushes on reconnect (Security, integrity & residual risks). - A host going silent is itself an alert (Security, integrity & residual risks).
- Only a bounded security subset ships off-site —
auditd,authpriv,fail2ban, AIDE, Suricata and key container security events taggedsecurity="true"— while the cluster Loki holds everything, keeping off-site volume small (Data flow & the security subset). - Disk-wear is a managed parameter: log storage on NVMe/SSD or HDD never SD/USB flash,
bounded verbosity at source, tuned Loki retention/compaction, and monitored SSD
wearout/TBW with an alert; log storage is a tracked allocation in
docs/hardware/reference.md(Retention & disk-wear). - The decision is authorable now but the live pipeline is deferred on the stack:
Alloy-in-
base, theloki/grafanaservice roles, OPNsense syslog config, and the push-only credential (Status; Dependencies).
Related
ADR-002 (security baseline — realised here), ADR-016 (mesh / askari),
ADR-007 (OPNsense / askari), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).