boma/docs/decisions/018-logging.md
sjat 9e0c264658 docs: reconcile lower-severity review findings (O9-O24)
- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional,
  outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative
  boma.baobab.band -> boma.wingu.me transition note already added earlier
- terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and
  <host>.boma.baobab.band per ADR-007 naming (O11)
- ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections
  placed after Consequences, matching ADR-014/019-023 (O13)
- docs/README + inventories/README: list the missing subdirs / offsite_hosts +
  offsite.yml merge behaviour (O14, O29 note)
- ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19)
- ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20)
- ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21)
- netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23)
- ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24)
- capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28)
- tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9)
- tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep)

O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected);
the fix lives in the generator for the next regeneration. make lint + pytest (57) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00

124 lines
7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-018 — Logging and log integrity
## Status
Accepted (2026-06-06). Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
and the live pipeline.
## Context
boma wants all logs in one queryable store for troubleshooting, spotting issues over
time, and detecting intrusions / malicious activity. ADR-002 commits in principle
("logs shipped to a central location"; "active alerting wires AIDE/`auditd`/`fail2ban`/
Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and `askari` (the
off-site watchdog). Undecided: the architecture and the **integrity** question — an
attacker who roots a host will try to clear logs to cover their tracks.
The framing insight: the biggest anti-tampering win is that logs **leave the host in
near-real-time** — once a line is in a store the attacker doesn't control, wiping the
local copy is futile. How far to harden the central store is set by the threat model.
## Decision
1. **Threat model — opportunistic + blast-radius** (ADR-002 / accepted-risk R1). Not
forensic-grade.
2. **All logs → an on-cluster Loki** — the single monitoring DB for troubleshooting +
trends. Near-real-time shipping already defeats per-host track-covering.
3. **A security-relevant subset ALSO ships off-site to `askari`, write-only**
tamper-resistant against full-cluster compromise, at bounded volume.
4. **Skip WORM/object-lock** — accepted-risk R4; append-only push + off-site is the
proportionate control.
5. **Disk-wear is a managed parameter** — media choice + bounded verbosity + tuned
retention + wearout monitoring.
## Architecture
- **Agent:** Grafana Alloy on every host, installed by the `base` role — reads journald
+ container logs + security sources (`auditd`, `authpriv`, `fail2ban`, AIDE).
- **Loki (cluster):** a `loki` service role on a docker_host; all logs; monolithic
single-binary mode; NVMe; bounded retention.
- **Loki (`askari`):** the same role parameterised, in `offsite_hosts`; security subset
only, write-only, long retention, tiny volume.
- **Grafana (cluster):** both Lokis as datasources (one pane queries both); dashboards
+ the alerting ADR-002 calls for.
## Data flow & the security subset
Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage
tags security sources `security="true"`) to the `askari` Loki. Subset: `auditd`,
`authpriv` (SSH/`sudo`), `fail2ban`, AIDE, **Suricata** (OPNsense isn't a `base` host —
it syslog-forwards its alerts to the ingest point), and key container security events.
**Write-only / append-only:** the `askari` push endpoint (`/loki/api/v1/push`) is
mesh-only with a **push-only credential**; query/admin/delete APIs are not exposed to
hosts. The push API has no edit/delete verb, so a compromised host can append but not
read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers
(WAL) + retries across a brief outage.
## Security, integrity & residual risks
Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store
(append-only, off-cluster). The security trail survives full-cluster compromise.
Conscious residuals: append-only ≠ cryptographic WORM (root-on-`askari` could edit
chunks — R4); a few-seconds un-shipped window; agent compromise can stop *future*
shipping but not alter shipped history; **a host going silent is itself an alert**; a
stolen push credential appends noise but can't delete; an `askari` outage buffers +
flushes on reconnect.
## Retention & disk-wear
Estimates are intent-based until measured (like `/capacity-review`). Cluster Loki:
bounded hot retention (~3090 days). `askari` subset: long (~1 year+, ~525 GB/yr).
Disk-wear rules: (1) log storage on NVMe/SSD or HDD, **never SD/USB flash**; (2) bounded
verbosity at source (sane levels, selective access logging, a targeted `auditd`
ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a monitored
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
tracked allocation in `docs/hardware/reference.md` (ADR-012).
## Dependencies
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
`askari` (`offsite_hosts`, ADR-016); OPNsense automation for Suricata syslog (ADR-007);
the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence alerting
(sibling effort, TODO 3.6).
## What was ruled out
| Option | Reason |
|---|---|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site. |
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
| On-cluster-only (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only. |
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
## Consequences
- Opportunistic track-covering and host-pivot-to-store are defeated because logs leave
the host in near-real-time and the off-cluster security trail is append-only, so it
survives full-cluster compromise (Security, integrity & residual risks).
- Conscious residuals remain: append-only is not cryptographic WORM (root-on-`askari`
could edit chunks — R4); there is a few-seconds un-shipped window; agent compromise
can stop future shipping but not alter shipped history; a stolen push credential
appends noise but cannot delete; and an `askari` outage buffers then flushes on
reconnect (Security, integrity & residual risks).
- A host going silent is itself an alert (Security, integrity & residual risks).
- Only a bounded security subset ships off-site — `auditd`, `authpriv`, `fail2ban`,
AIDE, Suricata and key container security events tagged `security="true"` — while the
cluster Loki holds everything, keeping off-site volume small (Data flow & the security
subset).
- Disk-wear is a managed parameter: log storage on NVMe/SSD or HDD never SD/USB flash,
bounded verbosity at source, tuned Loki retention/compaction, and monitored SSD
wearout/TBW with an alert; log storage is a tracked allocation in
`docs/hardware/reference.md` (Retention & disk-wear).
- The decision is authorable now but the live pipeline is deferred on the stack:
Alloy-in-`base`, the `loki`/`grafana` service roles, OPNsense syslog config, and the
push-only credential (Status; Dependencies).
## Related
ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).