Compare commits
10 commits
568729e7bd
...
9bdb3017bb
| Author | SHA1 | Date | |
|---|---|---|---|
| 9bdb3017bb | |||
| 12baeba750 | |||
| 1021c6d25d | |||
| c6aa45037d | |||
| 687d623a52 | |||
| 6f68f8b8c5 | |||
| 30c6a93c28 | |||
| 2894319f01 | |||
| 96f8f20c05 | |||
| 8eb5ccf97d |
10 changed files with 816 additions and 9 deletions
|
|
@ -214,6 +214,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
|
|||
| Forgejo & CI | `docs/decisions/010-forgejo-ci.md` |
|
||||
| Update management | `docs/decisions/011-update-management.md` |
|
||||
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
|
||||
| Logging & log integrity | `docs/decisions/018-logging.md` |
|
||||
| Adding a new role | `docs/runbooks/new-role.md` |
|
||||
| Adding a new host | `docs/runbooks/new-host.md` |
|
||||
| Rotating vault secrets | `docs/runbooks/rotate-secrets.md` |
|
||||
|
|
|
|||
|
|
@ -56,6 +56,8 @@ So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `bas
|
|||
| NetBird mesh — coordinator on `askari` | ADR-016 | **Design RESOLVED** (ADR-016 + spec + plan); resolves ADR-015 deferred #1. Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. **Build pending:** not deployed (askari + service-role machinery not built). |
|
||||
| NetBird agent enrollment in `base` | ADR-016 | **Design RESOLVED** (ADR-016). Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. **Build pending:** base role not built. |
|
||||
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
|
||||
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. |
|
||||
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
|
||||
|
||||
## Keeping this honest
|
||||
|
||||
|
|
|
|||
|
|
@ -43,8 +43,9 @@ _(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not cont
|
|||
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|
||||
|---|---|---|---|---|---|
|
||||
| Metrics | Prometheus | P | planned | Time-series metrics + alert rules | TODO 3.6 |
|
||||
| Logs | Loki | P | planned | Log aggregation | TODO 3.6 |
|
||||
| Dashboards | Grafana | P | planned | Visualisation + alerting | TODO 3.6 |
|
||||
| Logs | Loki (cluster all-logs + off-site security subset on `askari`) | P | core | Central log aggregation; a security subset ships write-only off-site (append-only) | **Decided (ADR-018)** |
|
||||
| Log shipping agent | Grafana Alloy (in `base`) | P | core | Collects journald + container + security logs on every host; ships to Loki (ADR-018) | **Decided (ADR-018)** |
|
||||
| Dashboards | Grafana | P | planned | Visualisation + alerting (incl. AIDE/`auditd`/`fail2ban`/Suricata + log-silence — ADR-018) | TODO 3.6 |
|
||||
| Uptime checks | Uptime Kuma | P | planned | Endpoint up/down checks | TODO 3.6 |
|
||||
| External watchdog | askari (Hetzner VPS) | P | core | Off-site monitoring that survives a homelab outage | ADR-007 |
|
||||
| Notify / alerting | ntfy · Matrix · email (multi-channel) | S | planned | Deliver alerts to the user across channels | TODO 9; Matrix homeserver in §8 |
|
||||
|
|
|
|||
10
docs/TODO.md
10
docs/TODO.md
|
|
@ -15,15 +15,19 @@
|
|||
`/verify-service` report.
|
||||
|
||||
3. **Building services**
|
||||
1. Decide how to manage logs.
|
||||
1. ~~Decide how to manage logs.~~ DECIDED (ADR-018): all logs → on-cluster Loki via
|
||||
Grafana Alloy (in `base`); a security subset also ships write-only off-site to
|
||||
`askari` (append-only); Grafana queries both. WORM skipped (accepted-risk R4).
|
||||
2. Decide how to manage APIs / API access.
|
||||
3. ~~Decide how to import or integrate from baobabAnsibleV4.~~ DECIDED (ADR-013):
|
||||
translate-don't-transplant — V4 is a source only of gotchas + working config
|
||||
snippets, re-derived on boma's terms; never structure/requirements/values.
|
||||
4. Decide what each node runs — base packages plus which apps/services.
|
||||
5. Decide the firewall strategy (which firewall, ruleset, per-host vs central).
|
||||
6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime
|
||||
Kuma alerts on askari.
|
||||
6. Wire up the monitoring stack. Logging topology DECIDED (ADR-018): cluster Loki
|
||||
(all logs) + off-site security subset on `askari` + Grafana on-cluster (not the
|
||||
whole stack on `askari`). Still to design/build: Prometheus + metric exporters,
|
||||
Uptime Kuma, and exactly which alerts live where.
|
||||
7. Define a tagging standard that lets us target runs without over-tagging.
|
||||
8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
|
||||
9. Decide: a central database server, or individual database services per app?
|
||||
|
|
|
|||
|
|
@ -87,7 +87,9 @@ time. Each heading tags the threat(s) it primarily serves.
|
|||
### Audit trail — *agent error, blast radius*
|
||||
|
||||
- `auditd` installed and running with a baseline ruleset
|
||||
- Logs shipped to a central location if a log aggregation service is available
|
||||
- Logs shipped to a central location in near-real-time — all logs to an on-cluster
|
||||
Loki, plus a security-relevant subset write-only off-site to `askari` so the audit
|
||||
trail survives host (and full-cluster) compromise (ADR-018)
|
||||
|
||||
### Mandatory access control — *blast radius*
|
||||
|
||||
|
|
@ -102,8 +104,9 @@ time. Each heading tags the threat(s) it primarily serves.
|
|||
- **AIDE** file-integrity monitoring (required by the CIS Debian benchmark) — detects
|
||||
unexpected changes to system files
|
||||
- **Network IDS** — Suricata on OPNsense (planned; see STATUS.md / TODO)
|
||||
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata into the
|
||||
monitoring/alerting stack (planned; ties to the Loki/Grafana effort)
|
||||
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata — plus
|
||||
log-source-silence (a host that stops shipping) — into Grafana alerting on the
|
||||
Loki/Grafana stack (ADR-018; planned)
|
||||
|
||||
## Secrets management — *agent error, opportunistic*
|
||||
|
||||
|
|
|
|||
|
|
@ -36,5 +36,9 @@ workload that should move, or a node due an upgrade.
|
|||
- Right-sizing advice is intent-based until usage data exists; reports say so.
|
||||
- `reference.md` table headers are a parser contract — changing them needs a
|
||||
matching `capacity-scan.py` change.
|
||||
- Log storage (ADR-018) is a tracked allocation: the cluster Loki host's retention
|
||||
budget and `askari`'s security-subset volume belong in `reference.md`, and SSD
|
||||
**wearout/TBW** is a monitored metric — logging is write-heavy, so wear is watched,
|
||||
not assumed.
|
||||
|
||||
See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).
|
||||
|
|
|
|||
99
docs/decisions/018-logging.md
Normal file
99
docs/decisions/018-logging.md
Normal file
|
|
@ -0,0 +1,99 @@
|
|||
# ADR-018 — Logging and log integrity
|
||||
|
||||
## Context
|
||||
|
||||
boma wants all logs in one queryable store for troubleshooting, spotting issues over
|
||||
time, and detecting intrusions / malicious activity. ADR-002 commits in principle
|
||||
("logs shipped to a central location"; "active alerting wires AIDE/`auditd`/`fail2ban`/
|
||||
Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and `askari` (the
|
||||
off-site watchdog). Undecided: the architecture and the **integrity** question — an
|
||||
attacker who roots a host will try to clear logs to cover their tracks.
|
||||
|
||||
The framing insight: the biggest anti-tampering win is that logs **leave the host in
|
||||
near-real-time** — once a line is in a store the attacker doesn't control, wiping the
|
||||
local copy is futile. How far to harden the central store is set by the threat model.
|
||||
|
||||
## Decision
|
||||
|
||||
1. **Threat model — opportunistic + blast-radius** (ADR-002 / accepted-risk R1). Not
|
||||
forensic-grade.
|
||||
2. **All logs → an on-cluster Loki** — the single monitoring DB for troubleshooting +
|
||||
trends. Near-real-time shipping already defeats per-host track-covering.
|
||||
3. **A security-relevant subset ALSO ships off-site to `askari`, write-only** —
|
||||
tamper-resistant against full-cluster compromise, at bounded volume.
|
||||
4. **Skip WORM/object-lock** — accepted-risk R4; append-only push + off-site is the
|
||||
proportionate control.
|
||||
5. **Disk-wear is a managed parameter** — media choice + bounded verbosity + tuned
|
||||
retention + wearout monitoring.
|
||||
|
||||
## Architecture
|
||||
|
||||
- **Agent:** Grafana Alloy on every host, installed by the `base` role — reads journald
|
||||
+ container logs + security sources (`auditd`, `authpriv`, `fail2ban`, AIDE).
|
||||
- **Loki (cluster):** a `loki` service role on a docker_host; all logs; monolithic
|
||||
single-binary mode; NVMe; bounded retention.
|
||||
- **Loki (`askari`):** the same role parameterised, in `offsite_hosts`; security subset
|
||||
only, write-only, long retention, tiny volume.
|
||||
- **Grafana (cluster):** both Lokis as datasources (one pane queries both); dashboards
|
||||
+ the alerting ADR-002 calls for.
|
||||
|
||||
## Data flow & the security subset
|
||||
|
||||
Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage
|
||||
tags security sources `security="true"`) to the `askari` Loki. Subset: `auditd`,
|
||||
`authpriv` (SSH/`sudo`), `fail2ban`, AIDE, **Suricata** (OPNsense isn't a `base` host —
|
||||
it syslog-forwards its alerts to the ingest point), and key container security events.
|
||||
|
||||
**Write-only / append-only:** the `askari` push endpoint (`/loki/api/v1/push`) is
|
||||
mesh-only with a **push-only credential**; query/admin/delete APIs are not exposed to
|
||||
hosts. The push API has no edit/delete verb, so a compromised host can append but not
|
||||
read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers
|
||||
(WAL) + retries across a brief outage.
|
||||
|
||||
## Security, integrity & residual risks
|
||||
|
||||
Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store
|
||||
(append-only, off-cluster). The security trail survives full-cluster compromise.
|
||||
Conscious residuals: append-only ≠ cryptographic WORM (root-on-`askari` could edit
|
||||
chunks — R4); a few-seconds un-shipped window; agent compromise can stop *future*
|
||||
shipping but not alter shipped history; **a host going silent is itself an alert**; a
|
||||
stolen push credential appends noise but can't delete; an `askari` outage buffers +
|
||||
flushes on reconnect.
|
||||
|
||||
## Retention & disk-wear
|
||||
|
||||
Estimates are intent-based until measured (like `/capacity-review`). Cluster Loki:
|
||||
bounded hot retention (~30–90 days). `askari` subset: long (~1 year+, ~5–25 GB/yr).
|
||||
Disk-wear rules: (1) log storage on NVMe/SSD or HDD, **never SD/USB flash**; (2) bounded
|
||||
verbosity at source (sane levels, selective access logging, a targeted `auditd`
|
||||
ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a monitored
|
||||
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
|
||||
tracked allocation in `docs/hardware/reference.md` (ADR-012).
|
||||
|
||||
## Status
|
||||
|
||||
Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
|
||||
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
|
||||
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
|
||||
and the live pipeline.
|
||||
|
||||
## Dependencies
|
||||
|
||||
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
|
||||
`askari` (`offsite_hosts`, ADR-016); OPNsense automation for Suricata syslog (ADR-007);
|
||||
the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence alerting
|
||||
(sibling effort, TODO 3.6).
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|---|---|
|
||||
| Everything off-site on `askari` (no on-cluster Loki) | The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site. |
|
||||
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
|
||||
| On-cluster-only (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only. |
|
||||
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
|
||||
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
|
||||
|
||||
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
|
||||
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
|
||||
standard), ADR-011 (health checks — distinct from this).
|
||||
|
|
@ -16,8 +16,9 @@ revisit (trigger).
|
|||
| R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (tiered image pinning per ADR-011 — stateful `tag@digest`, stateless rolling — prefer official/verified images; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise |
|
||||
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
|
||||
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
|
||||
| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
|
||||
|
||||
_Last reviewed: 2026-06-05. The prior gaps (full CIS hardening, SELinux/AppArmor,
|
||||
_Last reviewed: 2026-06-06. The prior gaps (full CIS hardening, SELinux/AppArmor,
|
||||
IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
|
||||
Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now
|
||||
part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build
|
||||
|
|
|
|||
480
docs/superpowers/plans/2026-06-06-logging-log-integrity.md
Normal file
480
docs/superpowers/plans/2026-06-06-logging-log-integrity.md
Normal file
|
|
@ -0,0 +1,480 @@
|
|||
# Logging & Log Integrity Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Record the logging architecture (all logs → on-cluster Loki; a security subset also write-only off-site to `askari`) by authoring ADR-018 and reconciling every doc that touches logging/observability.
|
||||
|
||||
**Architecture:** Documentation-only. The runtime pieces — Alloy in the `base` role, the `loki`/`grafana` service roles, OPNsense syslog forwarding — wait on the `base` + service-role machinery STATUS.md lists as not-yet-built. This plan settles the decision and the doc reconciliation.
|
||||
|
||||
**Tech Stack:** Markdown. Verification is the repo's pre-commit hooks + a final cross-reference sweep. No markdown linter, so "tests" are hook-pass + grep checks.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight (read once)
|
||||
|
||||
- **`rbw` must be unlocked before every commit** (pre-commit ansible-lint decrypts `vault.yml`). `rbw unlocked`; if non-zero, stop and ask the user to `rbw unlock`.
|
||||
- **Commit style:** one commit per task, imperative subject ≤72 chars.
|
||||
- **Order:** Task 1 (ADR-018) first — later tasks link to it.
|
||||
- **Spec:** `docs/superpowers/specs/2026-06-05-logging-log-integrity-design.md`.
|
||||
- **Branch:** controller creates `chore/logging-log-integrity-docs` off `main` before Task 1; do not implement on `main`.
|
||||
|
||||
---
|
||||
|
||||
## File map
|
||||
|
||||
| File | Action | Responsibility |
|
||||
|---|---|---|
|
||||
| `docs/decisions/018-logging.md` | Create | Home of record for the logging architecture |
|
||||
| `docs/decisions/002-security.md` | Modify | Make the "logs to central" + "active alerting" bullets concrete (→ ADR-018) |
|
||||
| `docs/security/accepted-risks.md` | Modify | Add R4 — no cryptographic WORM for logs |
|
||||
| `docs/CAPABILITIES.md` | Modify | Loki row → decided; add Alloy agent row; note security alerting |
|
||||
| `docs/decisions/012-hardware-capacity.md` | Modify | Log-storage allocation + SSD-wearout tracked metric |
|
||||
| `STATUS.md` | Modify | Rows: logging pipeline (designed, not built) |
|
||||
| `docs/TODO.md` | Modify | Mark 3.1 decided; reconcile 3.6's "on askari" phrasing |
|
||||
| `CLAUDE.md` | Modify | ADR-018 in Further reading |
|
||||
|
||||
**Deferred (not in this plan):** the Alloy task in `base`, the `loki`/`grafana` service roles, OPNsense Suricata syslog forwarding, the push-only `vault.loki.*` credential, and the live pipeline — all recorded in ADR-018/STATUS, built when the stack exists.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Author ADR-018 (the home of record)
|
||||
|
||||
**Files:**
|
||||
- Create: `docs/decisions/018-logging.md`
|
||||
|
||||
- [ ] **Step 1: Create the ADR**
|
||||
|
||||
Create `docs/decisions/018-logging.md` with exactly this content (preserve em-dashes —, backticks, table pipes, `≠`, `~`):
|
||||
|
||||
```markdown
|
||||
# ADR-018 — Logging and log integrity
|
||||
|
||||
## Context
|
||||
|
||||
boma wants all logs in one queryable store for troubleshooting, spotting issues over
|
||||
time, and detecting intrusions / malicious activity. ADR-002 commits in principle
|
||||
("logs shipped to a central location"; "active alerting wires AIDE/`auditd`/`fail2ban`/
|
||||
Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and `askari` (the
|
||||
off-site watchdog). Undecided: the architecture and the **integrity** question — an
|
||||
attacker who roots a host will try to clear logs to cover their tracks.
|
||||
|
||||
The framing insight: the biggest anti-tampering win is that logs **leave the host in
|
||||
near-real-time** — once a line is in a store the attacker doesn't control, wiping the
|
||||
local copy is futile. How far to harden the central store is set by the threat model.
|
||||
|
||||
## Decision
|
||||
|
||||
1. **Threat model — opportunistic + blast-radius** (ADR-002 / accepted-risk R1). Not
|
||||
forensic-grade.
|
||||
2. **All logs → an on-cluster Loki** — the single monitoring DB for troubleshooting +
|
||||
trends. Near-real-time shipping already defeats per-host track-covering.
|
||||
3. **A security-relevant subset ALSO ships off-site to `askari`, write-only** —
|
||||
tamper-resistant against full-cluster compromise, at bounded volume.
|
||||
4. **Skip WORM/object-lock** — accepted-risk R4; append-only push + off-site is the
|
||||
proportionate control.
|
||||
5. **Disk-wear is a managed parameter** — media choice + bounded verbosity + tuned
|
||||
retention + wearout monitoring.
|
||||
|
||||
## Architecture
|
||||
|
||||
- **Agent:** Grafana Alloy on every host, installed by the `base` role — reads journald
|
||||
+ container logs + security sources (`auditd`, `authpriv`, `fail2ban`, AIDE).
|
||||
- **Loki (cluster):** a `loki` service role on a docker_host; all logs; monolithic
|
||||
single-binary mode; NVMe; bounded retention.
|
||||
- **Loki (`askari`):** the same role parameterised, in `offsite_hosts`; security subset
|
||||
only, write-only, long retention, tiny volume.
|
||||
- **Grafana (cluster):** both Lokis as datasources (one pane queries both); dashboards
|
||||
+ the alerting ADR-002 calls for.
|
||||
|
||||
## Data flow & the security subset
|
||||
|
||||
Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage
|
||||
tags security sources `security="true"`) to the `askari` Loki. Subset: `auditd`,
|
||||
`authpriv` (SSH/`sudo`), `fail2ban`, AIDE, **Suricata** (OPNsense isn't a `base` host —
|
||||
it syslog-forwards its alerts to the ingest point), and key container security events.
|
||||
|
||||
**Write-only / append-only:** the `askari` push endpoint (`/loki/api/v1/push`) is
|
||||
mesh-only with a **push-only credential**; query/admin/delete APIs are not exposed to
|
||||
hosts. The push API has no edit/delete verb, so a compromised host can append but not
|
||||
read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers
|
||||
(WAL) + retries across a brief outage.
|
||||
|
||||
## Security, integrity & residual risks
|
||||
|
||||
Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store
|
||||
(append-only, off-cluster). The security trail survives full-cluster compromise.
|
||||
Conscious residuals: append-only ≠ cryptographic WORM (root-on-`askari` could edit
|
||||
chunks — R4); a few-seconds un-shipped window; agent compromise can stop *future*
|
||||
shipping but not alter shipped history; **a host going silent is itself an alert**; a
|
||||
stolen push credential appends noise but can't delete; an `askari` outage buffers +
|
||||
flushes on reconnect.
|
||||
|
||||
## Retention & disk-wear
|
||||
|
||||
Estimates are intent-based until measured (like `/capacity-review`). Cluster Loki:
|
||||
bounded hot retention (~30–90 days). `askari` subset: long (~1 year+, ~5–25 GB/yr).
|
||||
Disk-wear rules: (1) log storage on NVMe/SSD or HDD, **never SD/USB flash**; (2) bounded
|
||||
verbosity at source (sane levels, selective access logging, a targeted `auditd`
|
||||
ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a monitored
|
||||
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
|
||||
tracked allocation in `docs/hardware/reference.md` (ADR-012).
|
||||
|
||||
## Status
|
||||
|
||||
Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
|
||||
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
|
||||
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
|
||||
and the live pipeline.
|
||||
|
||||
## Dependencies
|
||||
|
||||
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
|
||||
`askari` (`offsite_hosts`, ADR-016); OPNsense automation for Suricata syslog (ADR-007);
|
||||
the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence alerting
|
||||
(sibling effort, TODO 3.6).
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|---|---|
|
||||
| Everything off-site on `askari` (no on-cluster Loki) | The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site. |
|
||||
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
|
||||
| On-cluster-only (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only. |
|
||||
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
|
||||
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
|
||||
|
||||
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
|
||||
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
|
||||
standard), ADR-011 (health checks — distinct from this).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/018-logging.md`
|
||||
Expected: Passed/Skipped.
|
||||
```bash
|
||||
git add docs/decisions/018-logging.md
|
||||
git commit -m "Add ADR-018 (logging and log integrity)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Make ADR-002's logging bullets concrete
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/decisions/002-security.md`
|
||||
|
||||
Read the file first, then two exact edits.
|
||||
|
||||
- [ ] **Step 1: The audit-trail bullet**
|
||||
|
||||
Find:
|
||||
```
|
||||
- `auditd` installed and running with a baseline ruleset
|
||||
- Logs shipped to a central location if a log aggregation service is available
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
- `auditd` installed and running with a baseline ruleset
|
||||
- Logs shipped to a central location in near-real-time — all logs to an on-cluster
|
||||
Loki, plus a security-relevant subset write-only off-site to `askari` so the audit
|
||||
trail survives host (and full-cluster) compromise (ADR-018)
|
||||
```
|
||||
|
||||
- [ ] **Step 2: The active-alerting bullet**
|
||||
|
||||
Find:
|
||||
```
|
||||
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata into the
|
||||
monitoring/alerting stack (planned; ties to the Loki/Grafana effort)
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata — plus
|
||||
log-source-silence (a host that stops shipping) — into Grafana alerting on the
|
||||
Loki/Grafana stack (ADR-018; planned)
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/002-security.md`
|
||||
Expected: Passed/Skipped.
|
||||
```bash
|
||||
git add docs/decisions/002-security.md
|
||||
git commit -m "ADR-002: make central-logging + alerting controls concrete (ADR-018)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Add accepted-risk R4 (no WORM for logs)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/security/accepted-risks.md`
|
||||
|
||||
Read the file first, then one exact edit (add R4 after R3).
|
||||
|
||||
- [ ] **Step 1: Add the R4 row**
|
||||
|
||||
Find this exact line (the R3 row):
|
||||
```
|
||||
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
|
||||
```
|
||||
Add immediately **after** it:
|
||||
```
|
||||
| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Bump the "Last reviewed" date**
|
||||
|
||||
Find:
|
||||
```
|
||||
_Last reviewed: 2026-06-05. The prior gaps
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
_Last reviewed: 2026-06-06. The prior gaps
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md`
|
||||
Expected: Passed/Skipped.
|
||||
```bash
|
||||
git add docs/security/accepted-risks.md
|
||||
git commit -m "accepted-risks: add R4 (no cryptographic WORM for logs)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Update CAPABILITIES §3 (Observability)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/CAPABILITIES.md`
|
||||
|
||||
Read the file first, then three exact edits.
|
||||
|
||||
- [ ] **Step 1: Loki row → decided, note the off-site sink**
|
||||
|
||||
Find:
|
||||
```
|
||||
| Logs | Loki | P | planned | Log aggregation | TODO 3.6 |
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
| Logs | Loki (cluster all-logs + off-site security subset on `askari`) | P | core | Central log aggregation; a security subset ships write-only off-site (append-only) | **Decided (ADR-018)** |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add the Alloy agent row** (right after the Loki row just edited)
|
||||
|
||||
Find:
|
||||
```
|
||||
| Dashboards | Grafana | P | planned | Visualisation + alerting | TODO 3.6 |
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
| Log shipping agent | Grafana Alloy (in `base`) | P | core | Collects journald + container + security logs on every host; ships to Loki (ADR-018) | **Decided (ADR-018)** |
|
||||
| Dashboards | Grafana | P | planned | Visualisation + alerting (incl. AIDE/`auditd`/`fail2ban`/Suricata + log-silence — ADR-018) | TODO 3.6 |
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/CAPABILITIES.md`
|
||||
Expected: Passed/Skipped.
|
||||
```bash
|
||||
git add docs/CAPABILITIES.md
|
||||
git commit -m "CAPABILITIES: Loki decided + Alloy agent + security alerting (ADR-018)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: ADR-012 — log-storage allocation + wearout metric
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/decisions/012-hardware-capacity.md`
|
||||
|
||||
Read the file first, then one exact edit (add a Consequences bullet).
|
||||
|
||||
- [ ] **Step 1: Add a Consequences bullet**
|
||||
|
||||
Find this exact block:
|
||||
```
|
||||
## Consequences
|
||||
|
||||
- Right-sizing advice is intent-based until usage data exists; reports say so.
|
||||
- `reference.md` table headers are a parser contract — changing them needs a
|
||||
matching `capacity-scan.py` change.
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
## Consequences
|
||||
|
||||
- Right-sizing advice is intent-based until usage data exists; reports say so.
|
||||
- `reference.md` table headers are a parser contract — changing them needs a
|
||||
matching `capacity-scan.py` change.
|
||||
- Log storage (ADR-018) is a tracked allocation: the cluster Loki host's retention
|
||||
budget and `askari`'s security-subset volume belong in `reference.md`, and SSD
|
||||
**wearout/TBW** is a monitored metric — logging is write-heavy, so wear is watched,
|
||||
not assumed.
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/decisions/012-hardware-capacity.md`
|
||||
Expected: Passed/Skipped.
|
||||
```bash
|
||||
git add docs/decisions/012-hardware-capacity.md
|
||||
git commit -m "ADR-012: track log-storage allocation + SSD wearout (ADR-018)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Add logging rows to STATUS.md
|
||||
|
||||
**Files:**
|
||||
- Modify: `STATUS.md`
|
||||
|
||||
Read the file first, then one exact edit (add two rows after the Level 4 row).
|
||||
|
||||
- [ ] **Step 1: Add the rows**
|
||||
|
||||
Find this exact line:
|
||||
```
|
||||
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
|
||||
```
|
||||
Replace with that SAME line followed by the two new rows:
|
||||
```
|
||||
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
|
||||
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. |
|
||||
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files STATUS.md`
|
||||
Expected: Passed/Skipped.
|
||||
```bash
|
||||
git add STATUS.md
|
||||
git commit -m "STATUS: record logging pipeline + security alerting (ADR-018)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Reconcile TODO 3.1 and 3.6
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/TODO.md`
|
||||
|
||||
Read the file first, then two exact edits. (Preserve the `~~strikethrough~~` markers.)
|
||||
|
||||
- [ ] **Step 1: Mark 3.1 decided**
|
||||
|
||||
Find:
|
||||
```
|
||||
3. **Building services**
|
||||
1. Decide how to manage logs.
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
3. **Building services**
|
||||
1. ~~Decide how to manage logs.~~ DECIDED (ADR-018): all logs → on-cluster Loki via
|
||||
Grafana Alloy (in `base`); a security subset also ships write-only off-site to
|
||||
`askari` (append-only); Grafana queries both. WORM skipped (accepted-risk R4).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Reconcile 3.6's "on askari" phrasing**
|
||||
|
||||
Find:
|
||||
```
|
||||
6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime
|
||||
Kuma alerts on askari.
|
||||
```
|
||||
Replace with:
|
||||
```
|
||||
6. Wire up the monitoring stack. Logging topology DECIDED (ADR-018): cluster Loki
|
||||
(all logs) + off-site security subset on `askari` + Grafana on-cluster (not the
|
||||
whole stack on `askari`). Still to design/build: Prometheus + metric exporters,
|
||||
Uptime Kuma, and exactly which alerts live where.
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files docs/TODO.md`
|
||||
Expected: Passed/Skipped.
|
||||
```bash
|
||||
git add docs/TODO.md
|
||||
git commit -m "TODO: mark log management decided (ADR-018); reconcile 3.6"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 8: Link ADR-018 from CLAUDE.md
|
||||
|
||||
**Files:**
|
||||
- Modify: `CLAUDE.md`
|
||||
|
||||
Read the file first, then one exact edit.
|
||||
|
||||
- [ ] **Step 1: Add the Further-reading row after Hardware & capacity**
|
||||
|
||||
Find:
|
||||
```
|
||||
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
|
||||
```
|
||||
Replace with that SAME line followed by the new row:
|
||||
```
|
||||
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
|
||||
| Logging & log integrity | `docs/decisions/018-logging.md` |
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify and commit**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
|
||||
Expected: Passed/Skipped.
|
||||
```bash
|
||||
git add CLAUDE.md
|
||||
git commit -m "CLAUDE.md: link ADR-018 (logging)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 9: Final consistency sweep
|
||||
|
||||
**Files:** none modified (verification only)
|
||||
|
||||
- [ ] **Step 1: ADR-018 present + cross-linked (canonical docs only)**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
test -f docs/decisions/018-logging.md && echo "ADR-018 present"
|
||||
grep -rl "ADR-018\|018-logging" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
|
||||
```
|
||||
Expected: the file exists and the referencing docs appear — ADR-002, accepted-risks, CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md.
|
||||
|
||||
- [ ] **Step 2: No stale "logging undecided / if available" language**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
grep -rniE "log aggregation service is available|Logs \| Loki \| P \| planned|Decide how to manage logs\.($|[^~])" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
|
||||
```
|
||||
Expected: no hits — the ADR-002 conditional, the "planned" Loki row, and the open "Decide how to manage logs" TODO are all now updated.
|
||||
|
||||
- [ ] **Step 3: Full hook run**
|
||||
|
||||
Run: `rbw unlocked && pre-commit run --all-files`
|
||||
Expected: all hooks Passed/Skipped. Fix anything that fails (likely trailing whitespace / end-of-file) and amend the owning commit.
|
||||
|
||||
- [ ] **Step 4: Push (only if the user asks)**
|
||||
|
||||
```bash
|
||||
git push origin <branch-or-main-after-merge>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Self-review notes (author)
|
||||
|
||||
- **Spec coverage:** decision/architecture/data-flow/security/retention → Task 1 (ADR-018); the spec's "Documentation & implementation changes" table → Tasks 2–8 (ADR-002, accepted-risks R4, CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md). The role/pipeline rows in that table are deferred (recorded in ADR-018/STATUS), not implemented here. ✓
|
||||
- **Deferred, intentional:** Alloy-in-`base`, the `loki`/`grafana` service roles, OPNsense syslog forwarding, the `vault.loki.*` credential, the metrics-stack dependency — all need the unbuilt machinery; named in ADR-018/STATUS. ✓
|
||||
- **No placeholders:** every create/edit shows exact text. ✓
|
||||
- **Name consistency:** `ADR-018` / `018-logging.md`, "security subset", `offsite_hosts`, Grafana Alloy, push-only credential, R4 used identically across tasks. ✓
|
||||
```
|
||||
|
|
@ -0,0 +1,212 @@
|
|||
# Design — Logging and log integrity (ship all logs to Loki)
|
||||
|
||||
- **Date:** 2026-06-05
|
||||
- **Status:** Approved design — pending implementation plan
|
||||
- **Resolves:** TODO 3.1 ("Decide how to manage logs"); makes concrete ADR-002's
|
||||
"logs shipped to a central location" + "active alerting" controls; advances TODO 3.6
|
||||
- **Becomes:** ADR-018 (this design is the basis for that ADR)
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
boma wants **all logs in one queryable store** for three things: day-to-day
|
||||
troubleshooting, spotting issues/trends over time, and **detecting intrusions /
|
||||
malicious activity**. ADR-002 already commits in principle ("`auditd`… Logs shipped to
|
||||
a central location if a log aggregation service is available"; "Active alerting wires
|
||||
AIDE/`auditd`/`fail2ban`/Suricata into the monitoring/alerting stack… ties to the
|
||||
Loki/Grafana effort"), and CAPABILITIES lists Loki (planned) + `askari` as the off-site
|
||||
watchdog. What's undecided is the **architecture** and, critically, the **integrity**
|
||||
dimension: an attacker who roots a host will try to clear logs to cover their tracks.
|
||||
|
||||
The key insight that frames the integrity question: **the biggest anti-tampering win is
|
||||
that logs leave the host in near-real-time.** Once a line is in a store the attacker
|
||||
doesn't control, wiping the local copy is futile. The remaining question is only *how
|
||||
far* to harden the central store — set by the threat model.
|
||||
|
||||
## Decisions (the settled forks)
|
||||
|
||||
1. **Threat model — opportunistic + blast-radius**, per ADR-002 / accepted-risk R1.
|
||||
Not forensic-grade. This sizes everything below.
|
||||
2. **Ship all logs to an on-cluster Loki** — the single monitoring DB for
|
||||
troubleshooting + trends. Near-real-time shipping already defeats per-host
|
||||
track-covering.
|
||||
3. **Split: a security-relevant subset ALSO ships off-site to `askari`, write-only.**
|
||||
Tamper-resistant against full-cluster compromise, at bounded volume.
|
||||
4. **Skip WORM/object-lock (Tier 3)** — recorded as accepted-risk R4; append-only push
|
||||
+ off-site is the proportionate control.
|
||||
5. **Disk-wear is a managed design parameter, not a blocker** — storage media choice +
|
||||
bounded verbosity + tuned retention + wearout monitoring (Section: Retention & wear).
|
||||
|
||||
## Architecture & components
|
||||
|
||||
**Agent — Grafana Alloy on every host, installed by the `base` role.** Alloy reads
|
||||
journald + container logs + the security sources (`auditd`, `authpriv`, `fail2ban`,
|
||||
AIDE) on every host (docker_hosts, proxmox nodes, `ubongo`, `askari`) and ships them.
|
||||
Placing it in `base` ties it to ADR-002's baseline "logs shipped to central" control.
|
||||
|
||||
**Two Loki instances, one Grafana:**
|
||||
|
||||
```
|
||||
┌──────────────────── per host (base role) ─────────────────────┐
|
||||
│ Grafana Alloy: collect journald + container + auditd/auth/... │
|
||||
└──────────┬───────────────────────────────────┬────────────────┘
|
||||
ALL logs │ security subset │ (over the NetBird mesh)
|
||||
▼ ▼
|
||||
┌────────────────────────┐ ┌──────────────────────────────┐
|
||||
│ Loki (cluster) all logs│ │ Loki (askari) security only │
|
||||
│ docker_host, NVMe, │ │ off-site, write-only push, │
|
||||
│ bounded hot retention │ │ long retention, append-only │
|
||||
└───────────┬────────────┘ └──────────────┬───────────────┘
|
||||
└───────────────┬────────────────────┘
|
||||
▼
|
||||
┌────────────────────────────────────┐
|
||||
│ Grafana (cluster): both datasources │
|
||||
│ dashboards + alerts (AIDE/auditd/ │
|
||||
│ fail2ban/Suricata + log-silence) │
|
||||
└────────────────────────────────────┘
|
||||
```
|
||||
|
||||
- **Loki (cluster)** — `loki` service role on a docker_host; **all** logs; monolithic
|
||||
single-binary mode (ample at this scale); NVMe; bounded retention.
|
||||
- **Loki (`askari`)** — the same role parameterised, deployed to the `offsite_hosts`
|
||||
group; **security subset only**, **write-only**, long retention, tiny volume.
|
||||
- **Grafana** — `grafana` service role on the cluster; both Lokis as datasources (one
|
||||
pane queries both); where ADR-002's "active alerting" lands.
|
||||
|
||||
Reuses what boma already has: `askari` (off-site, on the mesh per ADR-016) and the
|
||||
`base`/service-role machinery.
|
||||
|
||||
## Data flow & the security subset
|
||||
|
||||
Each host's Alloy pipeline writes **everything** to the cluster Loki and a **filtered
|
||||
copy** of security events to the `askari` Loki — a relabel/match stage tags security
|
||||
sources (`security="true"`) and routes only those to the second `loki.write` target.
|
||||
One agent, two destinations.
|
||||
|
||||
**Security subset** (high-value, bounded volume): `auditd` (auth, privilege, file
|
||||
watches), `authpriv` (SSH, `sudo`), `fail2ban` (bans), AIDE (file-integrity reports),
|
||||
**Suricata** (OPNsense isn't a `base` host, so it **syslog-forwards** alerts to the
|
||||
ingest point), and key container security events (reverse-proxy 401/403, Authentik
|
||||
login events, Docker daemon events).
|
||||
|
||||
**Write-only / append-only** (the tamper-resistance mechanism):
|
||||
- The `askari` Loki push endpoint (`/loki/api/v1/push`) is reachable only over the
|
||||
**NetBird mesh**, with a **push-only credential**; hosts hold *only* that.
|
||||
- Loki's query/admin/delete APIs on `askari` are **not exposed to hosts** (localhost /
|
||||
mesh-ACL'd to operator + Grafana). The push API has no edit/delete verb, so a
|
||||
compromised host can **append but not read/edit/delete**. Deletion needs the
|
||||
admin/compactor API or filesystem — unreachable from a host.
|
||||
- The cluster Loki uses the same push-only credential, blocking per-host log-clearing
|
||||
via API there too.
|
||||
|
||||
**Reliability:** Alloy buffers (WAL) and retries, so a brief `askari`/mesh outage
|
||||
doesn't lose logs — they flush on reconnect with only a small local buffer.
|
||||
|
||||
## Security, integrity & residual risks
|
||||
|
||||
**Defeated:** opportunistic track-covering (`rm`/`vacuum`) — lines are already off the
|
||||
host; **host pivot to the store** — an attacker rooting any cluster host can append but
|
||||
not delete, and cannot reach `askari`'s admin plane. **The security trail survives full
|
||||
cluster compromise.**
|
||||
|
||||
**Honest residual risks (conscious, recorded):**
|
||||
1. **Append-only ≠ cryptographic WORM** — a root-on-`askari` attacker could edit chunk
|
||||
files on disk. Skipping object-lock is **accepted-risk R4**; mitigated by `askari`
|
||||
being minimal/hardened/operator-only/mesh-only.
|
||||
2. **Un-shipped window** — a few seconds of not-yet-flushed logs live on the host;
|
||||
near-real-time minimises it. Accept.
|
||||
3. **Agent compromise (forward-looking)** — rooting a host lets the attacker stop *that
|
||||
host's* Alloy or inject *future* false logs, but **cannot alter shipped history**.
|
||||
4. **Detection as a feature** — a host that **goes silent** (Alloy stops) is an
|
||||
**alert**; the tamper attempt becomes a signal. "Log-source silence" is wired into
|
||||
Grafana alerting.
|
||||
5. **Credential theft / `askari` outage** — a stolen push credential allows appending
|
||||
noise, not deletion (bounded, rotatable); an `askari` outage buffers on hosts and
|
||||
flushes on reconnect (a very long outage eventually drops oldest — monitor it).
|
||||
|
||||
**ADR-002 fit:** realises "logs shipped to central" + "active alerting"; the off-site +
|
||||
append-only model is a clean blast-radius-containment enhancement for the opportunistic
|
||||
threat model.
|
||||
|
||||
## Retention, sizing & disk-wear
|
||||
|
||||
**Sizing (estimates — intent-based until measured, like `/capacity-review`):** a 2–5
|
||||
host homelab generates ~1–3 GB/day raw "typical" (≪1 GB/day quiet; 5–15 GB/day very
|
||||
chatty); Loki compresses ~7–10× → ~0.1–0.4 GB/day stored; the security subset is
|
||||
~10–20% of that.
|
||||
|
||||
**Retention (tunable in `group_vars`):**
|
||||
- **Cluster Loki (all logs):** bounded hot retention, start **30–90 days** (~10–35 GB
|
||||
at 90d on NVMe).
|
||||
- **`askari` Loki (security subset):** **1 year+** (~5–25 GB/yr) — small enough to keep
|
||||
the security trail long for over-time detection.
|
||||
- Defaults now; **re-measure real volume after a few weeks live** and tune.
|
||||
|
||||
**Disk-wear (the lore is real only for specific media/misconfig; mitigated as design
|
||||
rules):** at boma's volume even ~10–40 GB/day of amplified writes is decades of life on
|
||||
a ~600-TBW/TB NVMe. Rules:
|
||||
1. Log storage on **NVMe/SSD** (or **HDD** for a long-retention cold tier — sequential,
|
||||
endurance-unlimited); **never SD/USB flash**.
|
||||
2. **Bounded verbosity at source** (sane log levels, selective access logging, a
|
||||
*targeted* `auditd` ruleset) — the one lever that controls wear *and* firehose size.
|
||||
3. Tuned Loki **retention + compaction** so neither store grows unbounded.
|
||||
4. **SSD wearout/TBW is a monitored metric** (Proxmox wearout %, `node_exporter`
|
||||
smartmon) with an alert — wear is a graph, not a surprise. (Depends on the metrics
|
||||
stack — see Dependencies.)
|
||||
|
||||
Capacity bookkeeping ties into ADR-012: a log-storage allocation line (cluster +
|
||||
`askari`) and SSD-wearout as a tracked metric.
|
||||
|
||||
## Documentation & implementation changes
|
||||
|
||||
This is a substantial capability → its own ADR-018, with reconciliations:
|
||||
|
||||
| Doc / artifact | Change |
|
||||
|---|---|
|
||||
| ADR-018 (new) | Home of record: ship-all-to-Loki, the off-site write-only security subset, append-only model, skip-WORM (R4), disk-wear rules. |
|
||||
| `base` role (when built) | Install + configure Alloy (all → cluster Loki; subset → `askari` write-only). |
|
||||
| `loki` service role (new, when built) | One role, two deployments (cluster all-logs; `askari` security-subset write-only). `SECURITY.md` + `VERIFY.md`. |
|
||||
| `grafana` service role (new, when built) | Both Lokis as datasources; dashboards + alerting (AIDE/`auditd`/`fail2ban`/Suricata + log-silence). |
|
||||
| OPNsense (Ansible-managed) | Syslog-forward Suricata alerts to the ingest point. |
|
||||
| ADR-002 | "Logs shipped to central" + "active alerting" bullets point to ADR-018. |
|
||||
| `docs/security/accepted-risks.md` | Add **R4** — no cryptographic WORM for logs (append-only + off-site is the control). |
|
||||
| `docs/CAPABILITIES.md` §3 | Loki → decided; add the off-site security sink + Alloy agent rows; mark the alerting wiring. |
|
||||
| `docs/decisions/012-hardware-capacity.md` | Log-storage allocation (cluster + `askari`) + SSD-wearout tracked metric. |
|
||||
| `STATUS.md` + `docs/TODO.md` (3.1 / 3.6) | Mark "how to manage logs" decided by ADR-018; rows as designed-not-built. |
|
||||
| `vault.yml` | Push-only Loki credential (`vault.loki.*`). |
|
||||
|
||||
**Buildable now:** ADR-018 + the ADR-002/CAPABILITIES/ADR-012/accepted-risks/STATUS/TODO
|
||||
reconciliations. **Deferred on the stack:** the Alloy-in-`base`, `loki`/`grafana`
|
||||
service roles, OPNsense syslog config, and the live pipeline.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- `base` role + service-role machinery (unbuilt) — STATUS.md.
|
||||
- The running cluster + `askari` (`offsite_hosts`, designed) — ADR-016.
|
||||
- OPNsense automation (for Suricata syslog forwarding) — ADR-007.
|
||||
- The **metrics stack** (Prometheus / `node_exporter`) for SSD-wearout + log-silence
|
||||
alerting — sibling effort, TODO 3.6.
|
||||
|
||||
## Deferred / out of scope
|
||||
|
||||
1. **WORM / object-lock (Tier 3)** — accepted-risk R4; revisit only if the threat model
|
||||
shifts to targeted/forensic.
|
||||
2. **The metrics pipeline** (Prometheus/`node_exporter`) — sibling effort; this spec is
|
||||
**logs**. SSD-wearout + silence alerting depend on it.
|
||||
3. **Cold archival beyond Loki retention** (export to backups) and **structured/parsed
|
||||
per-service log standards** — future refinements.
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|---|---|
|
||||
| Everything off-site on `askari` (no on-cluster Loki) | The firehose (tens–hundreds of GB/yr) is disk-hungry on a small VPS; keep volume where storage is cheap (on-cluster) and send only the bounded security subset off-site. |
|
||||
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
|
||||
| On-cluster-only logging (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail needs to be off-cluster + append-only. |
|
||||
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-size-caps + real-time shipping is safer. |
|
||||
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice; one agent for logs (and later metrics). |
|
||||
|
||||
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
|
||||
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
|
||||
standard), ADR-011 (health checks — distinct from this).
|
||||
Loading…
Add table
Reference in a new issue