Add design spec for logging + log integrity (ship all to Loki)
All logs -> on-cluster Loki for troubleshooting/trends; a security-relevant subset also ships write-only off-site to askari (append-only, tamper-resistant against full-cluster compromise); skip WORM (accepted-risk R4). Alloy agent in base; loki/grafana service roles; disk-wear handled as a design parameter. Basis for ADR-018. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
568729e7bd
commit
8eb5ccf97d
1 changed files with 212 additions and 0 deletions
|
|
@ -0,0 +1,212 @@
|
||||||
|
# Design — Logging and log integrity (ship all logs to Loki)
|
||||||
|
|
||||||
|
- **Date:** 2026-06-05
|
||||||
|
- **Status:** Approved design — pending implementation plan
|
||||||
|
- **Resolves:** TODO 3.1 ("Decide how to manage logs"); makes concrete ADR-002's
|
||||||
|
"logs shipped to a central location" + "active alerting" controls; advances TODO 3.6
|
||||||
|
- **Becomes:** ADR-018 (this design is the basis for that ADR)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
boma wants **all logs in one queryable store** for three things: day-to-day
|
||||||
|
troubleshooting, spotting issues/trends over time, and **detecting intrusions /
|
||||||
|
malicious activity**. ADR-002 already commits in principle ("`auditd`… Logs shipped to
|
||||||
|
a central location if a log aggregation service is available"; "Active alerting wires
|
||||||
|
AIDE/`auditd`/`fail2ban`/Suricata into the monitoring/alerting stack… ties to the
|
||||||
|
Loki/Grafana effort"), and CAPABILITIES lists Loki (planned) + `askari` as the off-site
|
||||||
|
watchdog. What's undecided is the **architecture** and, critically, the **integrity**
|
||||||
|
dimension: an attacker who roots a host will try to clear logs to cover their tracks.
|
||||||
|
|
||||||
|
The key insight that frames the integrity question: **the biggest anti-tampering win is
|
||||||
|
that logs leave the host in near-real-time.** Once a line is in a store the attacker
|
||||||
|
doesn't control, wiping the local copy is futile. The remaining question is only *how
|
||||||
|
far* to harden the central store — set by the threat model.
|
||||||
|
|
||||||
|
## Decisions (the settled forks)
|
||||||
|
|
||||||
|
1. **Threat model — opportunistic + blast-radius**, per ADR-002 / accepted-risk R1.
|
||||||
|
Not forensic-grade. This sizes everything below.
|
||||||
|
2. **Ship all logs to an on-cluster Loki** — the single monitoring DB for
|
||||||
|
troubleshooting + trends. Near-real-time shipping already defeats per-host
|
||||||
|
track-covering.
|
||||||
|
3. **Split: a security-relevant subset ALSO ships off-site to `askari`, write-only.**
|
||||||
|
Tamper-resistant against full-cluster compromise, at bounded volume.
|
||||||
|
4. **Skip WORM/object-lock (Tier 3)** — recorded as accepted-risk R4; append-only push
|
||||||
|
+ off-site is the proportionate control.
|
||||||
|
5. **Disk-wear is a managed design parameter, not a blocker** — storage media choice +
|
||||||
|
bounded verbosity + tuned retention + wearout monitoring (Section: Retention & wear).
|
||||||
|
|
||||||
|
## Architecture & components
|
||||||
|
|
||||||
|
**Agent — Grafana Alloy on every host, installed by the `base` role.** Alloy reads
|
||||||
|
journald + container logs + the security sources (`auditd`, `authpriv`, `fail2ban`,
|
||||||
|
AIDE) on every host (docker_hosts, proxmox nodes, `ubongo`, `askari`) and ships them.
|
||||||
|
Placing it in `base` ties it to ADR-002's baseline "logs shipped to central" control.
|
||||||
|
|
||||||
|
**Two Loki instances, one Grafana:**
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────── per host (base role) ─────────────────────┐
|
||||||
|
│ Grafana Alloy: collect journald + container + auditd/auth/... │
|
||||||
|
└──────────┬───────────────────────────────────┬────────────────┘
|
||||||
|
ALL logs │ security subset │ (over the NetBird mesh)
|
||||||
|
▼ ▼
|
||||||
|
┌────────────────────────┐ ┌──────────────────────────────┐
|
||||||
|
│ Loki (cluster) all logs│ │ Loki (askari) security only │
|
||||||
|
│ docker_host, NVMe, │ │ off-site, write-only push, │
|
||||||
|
│ bounded hot retention │ │ long retention, append-only │
|
||||||
|
└───────────┬────────────┘ └──────────────┬───────────────┘
|
||||||
|
└───────────────┬────────────────────┘
|
||||||
|
▼
|
||||||
|
┌────────────────────────────────────┐
|
||||||
|
│ Grafana (cluster): both datasources │
|
||||||
|
│ dashboards + alerts (AIDE/auditd/ │
|
||||||
|
│ fail2ban/Suricata + log-silence) │
|
||||||
|
└────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Loki (cluster)** — `loki` service role on a docker_host; **all** logs; monolithic
|
||||||
|
single-binary mode (ample at this scale); NVMe; bounded retention.
|
||||||
|
- **Loki (`askari`)** — the same role parameterised, deployed to the `offsite_hosts`
|
||||||
|
group; **security subset only**, **write-only**, long retention, tiny volume.
|
||||||
|
- **Grafana** — `grafana` service role on the cluster; both Lokis as datasources (one
|
||||||
|
pane queries both); where ADR-002's "active alerting" lands.
|
||||||
|
|
||||||
|
Reuses what boma already has: `askari` (off-site, on the mesh per ADR-016) and the
|
||||||
|
`base`/service-role machinery.
|
||||||
|
|
||||||
|
## Data flow & the security subset
|
||||||
|
|
||||||
|
Each host's Alloy pipeline writes **everything** to the cluster Loki and a **filtered
|
||||||
|
copy** of security events to the `askari` Loki — a relabel/match stage tags security
|
||||||
|
sources (`security="true"`) and routes only those to the second `loki.write` target.
|
||||||
|
One agent, two destinations.
|
||||||
|
|
||||||
|
**Security subset** (high-value, bounded volume): `auditd` (auth, privilege, file
|
||||||
|
watches), `authpriv` (SSH, `sudo`), `fail2ban` (bans), AIDE (file-integrity reports),
|
||||||
|
**Suricata** (OPNsense isn't a `base` host, so it **syslog-forwards** alerts to the
|
||||||
|
ingest point), and key container security events (reverse-proxy 401/403, Authentik
|
||||||
|
login events, Docker daemon events).
|
||||||
|
|
||||||
|
**Write-only / append-only** (the tamper-resistance mechanism):
|
||||||
|
- The `askari` Loki push endpoint (`/loki/api/v1/push`) is reachable only over the
|
||||||
|
**NetBird mesh**, with a **push-only credential**; hosts hold *only* that.
|
||||||
|
- Loki's query/admin/delete APIs on `askari` are **not exposed to hosts** (localhost /
|
||||||
|
mesh-ACL'd to operator + Grafana). The push API has no edit/delete verb, so a
|
||||||
|
compromised host can **append but not read/edit/delete**. Deletion needs the
|
||||||
|
admin/compactor API or filesystem — unreachable from a host.
|
||||||
|
- The cluster Loki uses the same push-only credential, blocking per-host log-clearing
|
||||||
|
via API there too.
|
||||||
|
|
||||||
|
**Reliability:** Alloy buffers (WAL) and retries, so a brief `askari`/mesh outage
|
||||||
|
doesn't lose logs — they flush on reconnect with only a small local buffer.
|
||||||
|
|
||||||
|
## Security, integrity & residual risks
|
||||||
|
|
||||||
|
**Defeated:** opportunistic track-covering (`rm`/`vacuum`) — lines are already off the
|
||||||
|
host; **host pivot to the store** — an attacker rooting any cluster host can append but
|
||||||
|
not delete, and cannot reach `askari`'s admin plane. **The security trail survives full
|
||||||
|
cluster compromise.**
|
||||||
|
|
||||||
|
**Honest residual risks (conscious, recorded):**
|
||||||
|
1. **Append-only ≠ cryptographic WORM** — a root-on-`askari` attacker could edit chunk
|
||||||
|
files on disk. Skipping object-lock is **accepted-risk R4**; mitigated by `askari`
|
||||||
|
being minimal/hardened/operator-only/mesh-only.
|
||||||
|
2. **Un-shipped window** — a few seconds of not-yet-flushed logs live on the host;
|
||||||
|
near-real-time minimises it. Accept.
|
||||||
|
3. **Agent compromise (forward-looking)** — rooting a host lets the attacker stop *that
|
||||||
|
host's* Alloy or inject *future* false logs, but **cannot alter shipped history**.
|
||||||
|
4. **Detection as a feature** — a host that **goes silent** (Alloy stops) is an
|
||||||
|
**alert**; the tamper attempt becomes a signal. "Log-source silence" is wired into
|
||||||
|
Grafana alerting.
|
||||||
|
5. **Credential theft / `askari` outage** — a stolen push credential allows appending
|
||||||
|
noise, not deletion (bounded, rotatable); an `askari` outage buffers on hosts and
|
||||||
|
flushes on reconnect (a very long outage eventually drops oldest — monitor it).
|
||||||
|
|
||||||
|
**ADR-002 fit:** realises "logs shipped to central" + "active alerting"; the off-site +
|
||||||
|
append-only model is a clean blast-radius-containment enhancement for the opportunistic
|
||||||
|
threat model.
|
||||||
|
|
||||||
|
## Retention, sizing & disk-wear
|
||||||
|
|
||||||
|
**Sizing (estimates — intent-based until measured, like `/capacity-review`):** a 2–5
|
||||||
|
host homelab generates ~1–3 GB/day raw "typical" (≪1 GB/day quiet; 5–15 GB/day very
|
||||||
|
chatty); Loki compresses ~7–10× → ~0.1–0.4 GB/day stored; the security subset is
|
||||||
|
~10–20% of that.
|
||||||
|
|
||||||
|
**Retention (tunable in `group_vars`):**
|
||||||
|
- **Cluster Loki (all logs):** bounded hot retention, start **30–90 days** (~10–35 GB
|
||||||
|
at 90d on NVMe).
|
||||||
|
- **`askari` Loki (security subset):** **1 year+** (~5–25 GB/yr) — small enough to keep
|
||||||
|
the security trail long for over-time detection.
|
||||||
|
- Defaults now; **re-measure real volume after a few weeks live** and tune.
|
||||||
|
|
||||||
|
**Disk-wear (the lore is real only for specific media/misconfig; mitigated as design
|
||||||
|
rules):** at boma's volume even ~10–40 GB/day of amplified writes is decades of life on
|
||||||
|
a ~600-TBW/TB NVMe. Rules:
|
||||||
|
1. Log storage on **NVMe/SSD** (or **HDD** for a long-retention cold tier — sequential,
|
||||||
|
endurance-unlimited); **never SD/USB flash**.
|
||||||
|
2. **Bounded verbosity at source** (sane log levels, selective access logging, a
|
||||||
|
*targeted* `auditd` ruleset) — the one lever that controls wear *and* firehose size.
|
||||||
|
3. Tuned Loki **retention + compaction** so neither store grows unbounded.
|
||||||
|
4. **SSD wearout/TBW is a monitored metric** (Proxmox wearout %, `node_exporter`
|
||||||
|
smartmon) with an alert — wear is a graph, not a surprise. (Depends on the metrics
|
||||||
|
stack — see Dependencies.)
|
||||||
|
|
||||||
|
Capacity bookkeeping ties into ADR-012: a log-storage allocation line (cluster +
|
||||||
|
`askari`) and SSD-wearout as a tracked metric.
|
||||||
|
|
||||||
|
## Documentation & implementation changes
|
||||||
|
|
||||||
|
This is a substantial capability → its own ADR-018, with reconciliations:
|
||||||
|
|
||||||
|
| Doc / artifact | Change |
|
||||||
|
|---|---|
|
||||||
|
| ADR-018 (new) | Home of record: ship-all-to-Loki, the off-site write-only security subset, append-only model, skip-WORM (R4), disk-wear rules. |
|
||||||
|
| `base` role (when built) | Install + configure Alloy (all → cluster Loki; subset → `askari` write-only). |
|
||||||
|
| `loki` service role (new, when built) | One role, two deployments (cluster all-logs; `askari` security-subset write-only). `SECURITY.md` + `VERIFY.md`. |
|
||||||
|
| `grafana` service role (new, when built) | Both Lokis as datasources; dashboards + alerting (AIDE/`auditd`/`fail2ban`/Suricata + log-silence). |
|
||||||
|
| OPNsense (Ansible-managed) | Syslog-forward Suricata alerts to the ingest point. |
|
||||||
|
| ADR-002 | "Logs shipped to central" + "active alerting" bullets point to ADR-018. |
|
||||||
|
| `docs/security/accepted-risks.md` | Add **R4** — no cryptographic WORM for logs (append-only + off-site is the control). |
|
||||||
|
| `docs/CAPABILITIES.md` §3 | Loki → decided; add the off-site security sink + Alloy agent rows; mark the alerting wiring. |
|
||||||
|
| `docs/decisions/012-hardware-capacity.md` | Log-storage allocation (cluster + `askari`) + SSD-wearout tracked metric. |
|
||||||
|
| `STATUS.md` + `docs/TODO.md` (3.1 / 3.6) | Mark "how to manage logs" decided by ADR-018; rows as designed-not-built. |
|
||||||
|
| `vault.yml` | Push-only Loki credential (`vault.loki.*`). |
|
||||||
|
|
||||||
|
**Buildable now:** ADR-018 + the ADR-002/CAPABILITIES/ADR-012/accepted-risks/STATUS/TODO
|
||||||
|
reconciliations. **Deferred on the stack:** the Alloy-in-`base`, `loki`/`grafana`
|
||||||
|
service roles, OPNsense syslog config, and the live pipeline.
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- `base` role + service-role machinery (unbuilt) — STATUS.md.
|
||||||
|
- The running cluster + `askari` (`offsite_hosts`, designed) — ADR-016.
|
||||||
|
- OPNsense automation (for Suricata syslog forwarding) — ADR-007.
|
||||||
|
- The **metrics stack** (Prometheus / `node_exporter`) for SSD-wearout + log-silence
|
||||||
|
alerting — sibling effort, TODO 3.6.
|
||||||
|
|
||||||
|
## Deferred / out of scope
|
||||||
|
|
||||||
|
1. **WORM / object-lock (Tier 3)** — accepted-risk R4; revisit only if the threat model
|
||||||
|
shifts to targeted/forensic.
|
||||||
|
2. **The metrics pipeline** (Prometheus/`node_exporter`) — sibling effort; this spec is
|
||||||
|
**logs**. SSD-wearout + silence alerting depend on it.
|
||||||
|
3. **Cold archival beyond Loki retention** (export to backups) and **structured/parsed
|
||||||
|
per-service log standards** — future refinements.
|
||||||
|
|
||||||
|
## What was ruled out
|
||||||
|
|
||||||
|
| Option | Reason |
|
||||||
|
|---|---|
|
||||||
|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose (tens–hundreds of GB/yr) is disk-hungry on a small VPS; keep volume where storage is cheap (on-cluster) and send only the bounded security subset off-site. |
|
||||||
|
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
|
||||||
|
| On-cluster-only logging (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail needs to be off-cluster + append-only. |
|
||||||
|
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-size-caps + real-time shipping is safer. |
|
||||||
|
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice; one agent for logs (and later metrics). |
|
||||||
|
|
||||||
|
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
|
||||||
|
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
|
||||||
|
standard), ADR-011 (health checks — distinct from this).
|
||||||
Loading…
Add table
Reference in a new issue