Re-challenge accepted risks; adopt CIS hardening + IDS

Walked the seeded accepted-risk register (R1-R4) and turned inherited gaps into deliberate decisions: - Supply chain (R1): tightened to required baseline hygiene (digest pinning, official/verified images); active scanning deferred — stays an accepted risk - CIS (R2): adopted as a positive decision — CIS Debian L1+L2 (base role) + CIS Docker (docker_host + service checklist); app layer via the checklist - SELinux/AppArmor (R3): AppArmor becomes a baseline control (CIS-enforced); register keeps a clean "no SELinux" accept - IDS (R4): adopt AIDE (baseline via CIS) + Suricata on OPNsense + active alerting Register shrinks from 4 inherited gaps to 2 deliberate accepts. ADR-002 gains a Hardening standard section; STATUS + TODO 15 track the (unbuilt) implementation, including the CIS L2 partition impact on VM provisioning (ADR-006). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Expand ADR-002 into a security baseline + strategy
2026-06-04 15:15:39 +02:00 · 2026-06-04 14:39:51 +02:00
7 changed files with 245 additions and 24 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -154,6 +154,10 @@ Single-contributor, trunk-based (no merge requests / approval gates):
 - Edit vault-encrypted files directly — decrypt first, re-encrypt after
 - Force-push or rewrite already-pushed history on `main`
 - Add a collection to `requirements.yml` without a specific module need in existing role tasks
+- Open a firewall port anywhere but the `group_vars` firewall definitions — never ad-hoc on a host (ADR-002)
+- Disable or weaken a baseline control from ADR-002 (SSH hardening, nftables default-deny, fail2ban, auditd)
+- Expose a service to the LAN/WAN without it sitting behind the reverse proxy with authentication (ADR-002)
+- Deploy a service that hasn't cleared `docs/security/service-checklist.md` (record any deviation in `docs/security/accepted-risks.md`)

 ---

@ -162,7 +166,9 @@ Single-contributor, trunk-based (no merge requests / approval gates):
 | Topic                  | File                                  |
 |------------------------|---------------------------------------|
 | Architecture overview  | `docs/decisions/001-architecture.md`  |
-| Security baseline      | `docs/decisions/002-security.md`      |
+| Security baseline & strategy | `docs/decisions/002-security.md`      |
+| Accepted security risks | `docs/security/accepted-risks.md`     |
+| Per-service security checklist | `docs/security/service-checklist.md` |
 | Toolchain choices      | `docs/decisions/003-toolchain.md`     |
 | Docker & Compose model | `docs/decisions/004-docker-model.md`  |
 | Bootstrapping hosts    | `docs/decisions/005-bootstrapping.md` |
--- a/STATUS.md
+++ b/STATUS.md
@ -23,6 +23,7 @@ _Last reviewed: 2026-05-30._
 | Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below |
 | `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
 | `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
+| ADR-002 security strategy + `docs/security/{accepted-risks,service-checklist}.md` | Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review |

 ## Scaffolded but empty — NOT implemented

@ -47,6 +48,9 @@ So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `bas
 | Per-service roles | ADR-004 | Model defined; no service roles built |
 | Forgejo Actions CI | ADR-003 / ADR-008 | Remote is live (pushed); Actions/`act_runner` pipeline not yet built |
 | Live usage stats for `/capacity-review` | ADR-012 / TODO 8.4 | `gather_usage()` stubbed; source undecided (Proxmox RRD vs PLG stack); needs the cluster |
+| `/security-review` skill | ADR-002 / TODO 8.5 | Periodic posture re-check + accepted-risk re-challenge; planned, not built |
+| CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) `base`/`docker_host` roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) |
+| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |

 ## Keeping this honest

--- a/docs/TODO.md
+++ b/docs/TODO.md
@ -60,6 +60,12 @@
      Prometheus/Loki/Grafana/Grafana-Alloy stack we will likely set up anyway
      (richer, per-process, but more to run) — see TODO 3.6. Don't build the
      Proxmox-RRD hook before settling this, to avoid throwaway work.
+   5. Build a `/security-review` skill (sibling to `/review-repo`): re-check the
+      security posture against ADR-002, surface drift, and re-challenge the
+      accepted-risk register (`docs/security/accepted-risks.md`). Could pair a
+      deterministic pre-scan (undeclared open ports, disabled baseline controls,
+      world-readable secrets, services not behind auth) with a judgement pass.
+      Open question: standalone, or folded into the kaizen `/retro` (item 11)?
 9. Should we make a basic function so that tools (and AI) can send messages to the user - email, matrix or ntfy?

 10. **Claude setup** — DECIDED: brainstorm for intent, capture as ADRs (skip plan
@ -68,6 +74,9 @@
    1. Policy for how we collaborate with references to baobabAnsibleV4 without misusing it.
    2. Policy for how we write key documents like ADRs.
    3. Further development on how we we collaborate on designing the foundation for the project - seperate from how we implement new containers etc.
+    4. How do we make sure agents always use the latest official documentation for the technologies etc. we use?
+    5. Always subagent driven?
+    6. When AI deploys, i.e. runs playbooks etc., should we make a methodology so that it does not have to poll all the time or review all the output. Perhaps something about the MAKE method could provide only the relevant feedback?

 11. **Kaizen loop** — set up ~2026-06-06 (one week from now).
    1. Build `/retro`: reads `docs/FRICTION.md` + recurring `/review-repo`
@ -88,3 +97,20 @@
    whether selectively allowing libraries (e.g. PyYAML — already present via
    Ansible) is a better fit in general: weigh the parsing-correctness win
    against losing zero-setup portability. Decide a clear rule and record it.
+
+15. **Security hardening implementation** — build out the ADR-002 hardening standard.
+    1. Implement the CIS Debian Benchmark **Level 1 + Level 2** in the `base` role
+       (local tasks; CIS / `dev-sec` as reference only — no Galaxy roles). Includes
+       AppArmor (enforce mode) and AIDE file-integrity.
+    2. Implement the CIS Docker Benchmark: daemon/engine settings in `docker_host`;
+       per-container settings enforced via `docs/security/service-checklist.md`.
+    3. VM disk layout for CIS L2: separate `/tmp`, `/var`, `/var/log`, `/home`
+       partitions with `nodev,nosuid,noexec` — a Terraform/cloud-init concern
+       (ADR-006). Decide the template layout **before** provisioning, since it is
+       painful to retrofit.
+    4. Network IDS: enable Suricata on OPNsense (IDS first; IPS later?).
+    5. Active security alerting: wire AIDE, `auditd`, `fail2ban`, and Suricata into
+       the Loki/Grafana alerting stack (ties to 3.6).
+    6. Supply-chain hygiene: enforce image digest pinning + official/verified images
+       via the service checklist; revisit active scanning (Trivy/Grype) once a
+       triage stack exists (accepted-risk R1).
--- a/docs/decisions/002-security.md
+++ b/docs/decisions/002-security.md
@ -1,24 +1,61 @@
-# ADR-002 — Security baseline
+# ADR-002 — Security baseline and strategy

 ## Context

-Every managed host must reach a defined security baseline before any services
-are deployed. This baseline is applied by the `base` role and is non-negotiable —
-it runs first, on every host, every time.
+Security here is not a single control but the sum of several combined efforts —
+host hardening, network segmentation, secrets handling, supply-chain hygiene, and
+disciplined automation. This ADR is the frame that organizes them: it records the
+**threat model** we design against, the **principles** every control serves, the
+host-level **baseline** the `base` role enforces, and the **governance** that keeps
+security sharp as the homelab grows.

-The goal is a principled, maintainable baseline appropriate for a homelab with
-some public-facing services — not a compliance exercise.
+The goal is a principled, maintainable posture for a homelab with some
+public-facing services — effective against a realistic threat model, not a
+compliance exercise.

-## Baseline components
+Related decisions: network segmentation (ADR-007), secrets structure (ADR-003),
+per-service roles (ADR-004), CI secret-scanning (ADR-010).

-### Access & authentication
+## Threat model
+
+What we deliberately design against — and, just as importantly, what we do not:
+
+| Threat | In scope? | What it drives |
+|---|---|---|
+| **Opportunistic external** — bots scanning, credential stuffing, mass-exploiting known CVEs in exposed services | Yes — primary | SSH key-only + fail2ban, deny-by-default firewall, security auto-patching, minimal attack surface, services behind a reverse proxy with auth |
+| **Lateral movement / blast radius** — assume one service *is* compromised; limit how far it spreads | Yes | VLAN segmentation (ADR-007), least-privilege containers, no host network mode, per-service isolation, no shared credentials |
+| **Operator / agent error** — accidental secret leak, misconfiguration, or an AI agent making an unsafe change | Yes | Vault + gitleaks, declarative firewall (no ad-hoc ports), review gates, agent guardrails (below), pre-commit hooks |
+| **Supply chain** — compromised images, base images, dependencies, collections | Acknowledged, lower priority | Baseline hygiene required: image digest pinning + prefer official/verified images (ADR-011, service checklist), gitleaks. Active vuln scanning deferred — accepted risk |
+| **Targeted / physical** — a determined adversary specifically after this homelab, or physical device access | Out of scope | Not designed against at this scale; revisit if the threat model changes |
+
+Supply chain is consciously deprioritized, not forgotten — see
+`docs/security/accepted-risks.md`.
+
+## Security principles
+
+Every control below should trace back to one of these:
+
+- **Defense in depth** — no single control is load-bearing; layers compensate.
+- **Least privilege** — accounts, containers, and automation get the minimum they need.
+- **Deny / secure by default** — closed unless explicitly opened; safe defaults.
+- **Contain the blast radius** — segment and isolate so one compromise isn't total.
+- **Automated & reproducible** — the baseline is reached by Ansible, never by hand.
+- **Explicit & revisitable** — decisions and accepted risks are written down and
+  re-challenged, not left implicit.
+
+## Baseline controls
+
+Applied by the `base` role, non-negotiable — it runs first, on every host, every
+time. Each heading tags the threat(s) it primarily serves.
+
+### Access & authentication — *opportunistic, agent error*

 - SSH key authentication only — password auth disabled
 - Root login disabled — `PermitRootLogin no`
 - Dedicated `ansible` user with locked-down sudo (NOPASSWD for automation)
 - No shared user accounts — per-person SSH keys in `group_vars/all/vars.yml`

-### Firewall
+### Firewall — *opportunistic, blast radius, agent error*

 - `nftables` (native on Debian 13, replaces iptables)
 - Default policy: deny inbound, allow established/related, allow loopback
@ -30,29 +67,45 @@ some public-facing services — not a compliance exercise.
 > This is addressed by setting `"iptables": false` in Docker daemon config and managing
 > all rules via nftables explicitly. See `docs/decisions/004-docker-model.md`.

-### Intrusion deterrence
+### Intrusion deterrence — *opportunistic*

 - `fail2ban` monitoring SSH (and optionally reverse proxy logs)
 - Configured to ban after 5 failed attempts, 1-hour ban

-### Updates
+### Updates — *opportunistic*

 - `unattended-upgrades` enabled for **security patches only**
 - Full system upgrades triggered deliberately via Ansible (`make deploy PLAYBOOK=upgrade`)
 - No automatic reboots — reboots are a conscious operational decision

-### Minimal attack surface
+### Minimal attack surface — *opportunistic, blast radius*

 - No unnecessary packages installed
 - Docker daemon TCP socket disabled — Unix socket only
 - No open ports beyond those explicitly defined in firewall rules

-### Audit trail
+### Audit trail — *agent error, blast radius*

 - `auditd` installed and running with a baseline ruleset
 - Logs shipped to a central location if a log aggregation service is available

-## Secrets management
+### Mandatory access control — *blast radius*
+
+- **AppArmor** enabled with profiles in enforce mode — Debian-native MAC, default-on,
+  and required by the CIS Debian benchmark. Docker applies its `docker-default`
+  profile to containers; tighter per-service profiles are authored as needed.
+- **SELinux is not used** — non-native to Debian and redundant with AppArmor
+  (see `docs/security/accepted-risks.md`).
+
+### File integrity & intrusion detection — *opportunistic, blast radius, agent error*
+
+- **AIDE** file-integrity monitoring (required by the CIS Debian benchmark) — detects
+  unexpected changes to system files
+- **Network IDS** — Suricata on OPNsense (planned; see STATUS.md / TODO)
+- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata into the
+  monitoring/alerting stack (planned; ties to the Loki/Grafana effort)
+
+## Secrets management — *agent error, opportunistic*

 - Ansible Vault for all secrets (API keys, passwords, certificates), structured as a
  nested `vault.<service>.<key>` map (ADR-003)
@ -62,15 +115,65 @@ some public-facing services — not a compliance exercise.
  `rbw unlock`; nothing decryptable sits at rest in the repo or working tree
 - See `docs/runbooks/rotate-secrets.md` for `rbw` setup and rotation

-## What this baseline does not include
+## Hardening standard

- Full CIS benchmark hardening — adds complexity for marginal gain at this scale
- SELinux / AppArmor — not applied by default, revisit if threat model changes
- Intrusion detection (IDS) — out of scope for now
+The baseline above is implemented to a recognised benchmark rather than ad-hoc:
+
+- **Hosts** — the **CIS Debian Benchmark, Levels 1 and 2**, applied by the `base`
+  role. Some L2 items require separate partitions (`/tmp`, `/var`, `/var/log`,
+  `/home`) with restrictive mount options (`nodev,nosuid,noexec`) — that reaches into
+  VM disk layout, a provisioning concern (Terraform / cloud-init, ADR-006), not just
+  the `base` role.
+- **Container runtime** — the **CIS Docker Benchmark**: daemon/engine settings in the
+  `docker_host` role; per-container run settings (non-root, read-only rootfs, dropped
+  capabilities, no `privileged`, no host namespaces) enforced via
+  `docs/security/service-checklist.md`.
+- **Application containers** — no CIS benchmark exists for the app long tail
+  (Jellyfin, Nextcloud, Forgejo, …); they are covered by the CIS Docker run settings
+  plus the service checklist plus upstream hardening guidance.
+
+Hardening controls are **implemented as local roles** (per the no-Galaxy-roles
+policy, ADR-003), using the CIS benchmarks and community roles (e.g. `dev-sec`) only
+as reference. Any specific CIS item that proves impractical is exempted into
+`docs/security/accepted-risks.md` with a rationale — so the register records named
+exceptions, not a blanket opt-out.
+
+## Governance
+
+Security is maintained, not achieved once. This ADR **establishes** four
+mechanisms; each lives where change is cheap and is linked from here.
+
+- **Per-service security bar** — every exposed service must clear a defined
+  checklist before deploy (secrets in vault, no default creds, least-privilege /
+  non-root, declared firewall ports, reverse-proxy + auth if exposed). Lives in
+  `docs/security/service-checklist.md`; referenced from `docs/runbooks/new-role.md`.
+  Enforced manually in review today; the planned `/security-review` will automate it.
+- **Periodic security review** — a recurring review that re-checks posture,
+  surfaces drift, and re-challenges accepted risks. Planned as a `/security-review`
+  skill (sibling to `/review-repo`); see `docs/TODO.md` (Scheduled work). Not built
+  yet — see STATUS.md.
+- **Accepted-risk register** — the conscious trade-offs we choose to live with, each
+  with rationale and a revisit trigger. Lives in `docs/security/accepted-risks.md`
+  (expected to change; kept out of this ADR so the ADR stays stable).
+- **Agent / automation guardrails** — what AI agents and automation may do
+  unsupervised vs. what needs a human gate, since operator/agent error is in the
+  threat model. Encoded in `CLAUDE.md` ("What Claude must not do without explicit
+  instruction") and enforced by PreToolUse hooks (generated-file guard, `rbw`
+  pre-flight).

 ## Decision

-This baseline was chosen to be:
- **Effective** against the realistic threat model (exposed services, shared repo)
- **Maintainable** by a small team without security expertise overhead
- **Automated** — no manual steps should be needed to reach baseline state
+This posture was chosen to be:
+
+- **Effective** against the stated threat model (opportunistic external, lateral
+  movement, operator/agent error)
+- **Maintainable** by a small team without security-expertise overhead
+- **Automated** — no manual steps to reach baseline state
+- **Legible & revisitable** — the threat model, principles, and accepted risks are
+  written down and reviewed over time, not implicit
+- **Benchmarked** — host and container hardening follow CIS (Debian L1+L2, Docker),
+  not ad-hoc choices
+
+Out-of-scope items and conscious trade-offs are recorded in
+`docs/security/accepted-risks.md` rather than here, so this decision record stays
+stable while the risk posture evolves.
--- a/docs/runbooks/new-role.md
+++ b/docs/runbooks/new-role.md
@ -71,7 +71,16 @@ Fix any lint or test failures before committing.
 Add the role to the appropriate playbook in `playbooks/` and add the host group
 to `inventories/staging/hosts.yml` for integration testing.

-### 9. Commit
+### 9. Clear the security checklist (services)
+
+If the role is a **service** — especially one reachable beyond its own host —
+walk `docs/security/service-checklist.md` and confirm every item passes (secrets
+in vault, no default creds, least-privilege, declared firewall ports, behind the
+reverse proxy with auth if exposed). Record any conscious deviation in
+`docs/security/accepted-risks.md`. This bar is established by ADR-002; enforcement
+is manual in review today, with the planned `/security-review` to automate it.
+
+### 10. Commit

 ```bash
 git checkout -b role/<rolename>
--- a/docs/security/accepted-risks.md
+++ b/docs/security/accepted-risks.md
@ -0,0 +1,24 @@
+# Accepted security risks
+
+Conscious security trade-offs we are choosing to live with — recorded so "what we
+are *not* doing" is explicit and revisitable, not forgotten. This register is a
+**living document**, deliberately kept out of ADR-002 (which records durable
+decisions) so the ADR stays stable.
+
+Owned by **ADR-002** (Security baseline and strategy). Re-challenged during the
+periodic security review (planned `/security-review`; see `docs/TODO.md`).
+
+**Each entry:** the risk · why we accept it (rationale) · what would make us
+revisit (trigger).
+
+| # | Accepted risk | Rationale | Revisit trigger |
+|---|---|---|---|
+| R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (image digest pinning + prefer official/verified images, ADR-011 / service checklist; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise |
+| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
+
+_Last reviewed: 2026-06-04. The prior gaps (full CIS hardening, SELinux/AppArmor,
+IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
+Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now
+part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build
+status. As CIS is implemented, any specific item that proves impractical is added
+here as a named exception._
--- a/docs/security/service-checklist.md
+++ b/docs/security/service-checklist.md
@ -0,0 +1,49 @@
+# Per-service security checklist
+
+The bar every service (a per-service role — ADR-004) must clear **before deploy**,
+especially anything reachable beyond its own host. Established by **ADR-002**
+(Security baseline and strategy); referenced from `docs/runbooks/new-role.md`.
+Enforced manually in review today; the planned `/security-review` skill (see
+`docs/TODO.md`) will automate the check.
+
+Treat each item as must-pass **unless** a deviation is recorded in
+`docs/security/accepted-risks.md` with a rationale and a revisit trigger.
+
+## Secrets & credentials
+
+- [ ] All secrets live in an encrypted `vault.yml` (`vault.<service>.<key>`); none in
+      plaintext files, templates, or Compose env literals
+- [ ] No default or vendor-shipped credentials remain — admin passwords/tokens are
+      generated and stored in vault
+- [ ] Nothing secret is baked into an image or committed to git (gitleaks must pass)
+
+## Least privilege
+
+- [ ] Container runs as a non-root user where the image supports it
+- [ ] No `privileged: true` and no host network mode unless explicitly justified
+- [ ] Only the volumes/paths the service needs are mounted; read-only where possible
+- [ ] Linux capabilities dropped to what's required (no blanket grants)
+
+## Network & exposure
+
+- [ ] Every listening port is declared in `group_vars` firewall definitions — never
+      opened ad-hoc on a host
+- [ ] The service is not published directly to a LAN/WAN port if it can sit behind the
+      reverse proxy instead
+- [ ] Anything reachable beyond the `srv` VLAN is behind the reverse proxy **with
+      authentication** (and TLS)
+- [ ] Inter-service reach follows least privilege — no broad `srv`→`srv` access where a
+      single declared dependency suffices
+
+## Updates & provenance
+
+- [ ] Image/source version is pinned (tag or digest), not floating `latest` (ADR-011)
+- [ ] The update path is known — how this service gets patched
+
+## Operability (security-adjacent)
+
+- [ ] Logs go somewhere reviewable (central aggregation when available)
+- [ ] Backup/restore is covered if the service holds state
+
+> Deviations are allowed but must be **conscious**: record them in
+> `docs/security/accepted-risks.md`, don't leave them implicit.