Expand ADR-002 into a security baseline + strategy

Add a managerial security frame on top of the host baseline: explicit threat model (opportunistic external, lateral movement/blast radius, operator/agent error; supply chain accepted-lower-priority), security principles, and four governance mechanisms that ADR-002 establishes and links out to: - docs/security/service-checklist.md — per-service security bar (referenced from the new-role runbook) - docs/security/accepted-risks.md — living accepted-risk register (R1-R4) - planned /security-review skill (TODO 8.5) - agent guardrails in CLAUDE.md "what Claude must not do" STATUS.md records the frame as present (manual enforcement) and /security-review as planned-not-built. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 14:39:51 +02:00 · 2026-06-04 14:39:51 +02:00 · f338bccd46
commit f338bccd46
parent c57910eda8
7 changed files with 182 additions and 24 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -154,6 +154,10 @@ Single-contributor, trunk-based (no merge requests / approval gates):
 - Edit vault-encrypted files directly — decrypt first, re-encrypt after
 - Force-push or rewrite already-pushed history on `main`
 - Add a collection to `requirements.yml` without a specific module need in existing role tasks
+- Open a firewall port anywhere but the `group_vars` firewall definitions — never ad-hoc on a host (ADR-002)
+- Disable or weaken a baseline control from ADR-002 (SSH hardening, nftables default-deny, fail2ban, auditd)
+- Expose a service to the LAN/WAN without it sitting behind the reverse proxy with authentication (ADR-002)
+- Deploy a service that hasn't cleared `docs/security/service-checklist.md` (record any deviation in `docs/security/accepted-risks.md`)

 ---

@ -162,7 +166,9 @@ Single-contributor, trunk-based (no merge requests / approval gates):
 | Topic                  | File                                  |
 |------------------------|---------------------------------------|
 | Architecture overview  | `docs/decisions/001-architecture.md`  |
-| Security baseline      | `docs/decisions/002-security.md`      |
+| Security baseline & strategy | `docs/decisions/002-security.md`      |
+| Accepted security risks | `docs/security/accepted-risks.md`     |
+| Per-service security checklist | `docs/security/service-checklist.md` |
 | Toolchain choices      | `docs/decisions/003-toolchain.md`     |
 | Docker & Compose model | `docs/decisions/004-docker-model.md`  |
 | Bootstrapping hosts    | `docs/decisions/005-bootstrapping.md` |
--- a/STATUS.md
+++ b/STATUS.md
@ -23,6 +23,7 @@ _Last reviewed: 2026-05-30._
 | Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below |
 | `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
 | `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
+| ADR-002 security strategy + `docs/security/{accepted-risks,service-checklist}.md` | Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review |

 ## Scaffolded but empty — NOT implemented

@ -47,6 +48,7 @@ So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `bas
 | Per-service roles | ADR-004 | Model defined; no service roles built |
 | Forgejo Actions CI | ADR-003 / ADR-008 | Remote is live (pushed); Actions/`act_runner` pipeline not yet built |
 | Live usage stats for `/capacity-review` | ADR-012 / TODO 8.4 | `gather_usage()` stubbed; source undecided (Proxmox RRD vs PLG stack); needs the cluster |
+| `/security-review` skill | ADR-002 / TODO 8.5 | Periodic posture re-check + accepted-risk re-challenge; planned, not built |

 ## Keeping this honest

--- a/docs/TODO.md
+++ b/docs/TODO.md
@ -60,6 +60,12 @@
      Prometheus/Loki/Grafana/Grafana-Alloy stack we will likely set up anyway
      (richer, per-process, but more to run) — see TODO 3.6. Don't build the
      Proxmox-RRD hook before settling this, to avoid throwaway work.
+   5. Build a `/security-review` skill (sibling to `/review-repo`): re-check the
+      security posture against ADR-002, surface drift, and re-challenge the
+      accepted-risk register (`docs/security/accepted-risks.md`). Could pair a
+      deterministic pre-scan (undeclared open ports, disabled baseline controls,
+      world-readable secrets, services not behind auth) with a judgement pass.
+      Open question: standalone, or folded into the kaizen `/retro` (item 11)?
 9. Should we make a basic function so that tools (and AI) can send messages to the user - email, matrix or ntfy?

 10. **Claude setup** — DECIDED: brainstorm for intent, capture as ADRs (skip plan
@ -68,6 +74,9 @@
    1. Policy for how we collaborate with references to baobabAnsibleV4 without misusing it.
    2. Policy for how we write key documents like ADRs.
    3. Further development on how we we collaborate on designing the foundation for the project - seperate from how we implement new containers etc.
+    4. How do we make sure agents always use the latest official documentation for the technologies etc. we use?
+    5. Always subagent driven?
+    6. When AI deploys, i.e. runs playbooks etc., should we make a methodology so that it does not have to poll all the time or review all the output. Perhaps something about the MAKE method could provide only the relevant feedback?

 11. **Kaizen loop** — set up ~2026-06-06 (one week from now).
    1. Build `/retro`: reads `docs/FRICTION.md` + recurring `/review-repo`
--- a/docs/decisions/002-security.md
+++ b/docs/decisions/002-security.md
@ -1,24 +1,61 @@
-# ADR-002 — Security baseline
+# ADR-002 — Security baseline and strategy

 ## Context

-Every managed host must reach a defined security baseline before any services
-are deployed. This baseline is applied by the `base` role and is non-negotiable —
-it runs first, on every host, every time.
+Security here is not a single control but the sum of several combined efforts —
+host hardening, network segmentation, secrets handling, supply-chain hygiene, and
+disciplined automation. This ADR is the frame that organizes them: it records the
+**threat model** we design against, the **principles** every control serves, the
+host-level **baseline** the `base` role enforces, and the **governance** that keeps
+security sharp as the homelab grows.

-The goal is a principled, maintainable baseline appropriate for a homelab with
-some public-facing services — not a compliance exercise.
+The goal is a principled, maintainable posture for a homelab with some
+public-facing services — effective against a realistic threat model, not a
+compliance exercise.

-## Baseline components
+Related decisions: network segmentation (ADR-007), secrets structure (ADR-003),
+per-service roles (ADR-004), CI secret-scanning (ADR-010).

-### Access & authentication
+## Threat model
+
+What we deliberately design against — and, just as importantly, what we do not:
+
+| Threat | In scope? | What it drives |
+|---|---|---|
+| **Opportunistic external** — bots scanning, credential stuffing, mass-exploiting known CVEs in exposed services | Yes — primary | SSH key-only + fail2ban, deny-by-default firewall, security auto-patching, minimal attack surface, services behind a reverse proxy with auth |
+| **Lateral movement / blast radius** — assume one service *is* compromised; limit how far it spreads | Yes | VLAN segmentation (ADR-007), least-privilege containers, no host network mode, per-service isolation, no shared credentials |
+| **Operator / agent error** — accidental secret leak, misconfiguration, or an AI agent making an unsafe change | Yes | Vault + gitleaks, declarative firewall (no ad-hoc ports), review gates, agent guardrails (below), pre-commit hooks |
+| **Supply chain** — compromised images, base images, dependencies, collections | Acknowledged, lower priority | Version pinning where practical (ADR-011), gitleaks; tracked as an accepted risk with a revisit trigger |
+| **Targeted / physical** — a determined adversary specifically after this homelab, or physical device access | Out of scope | Not designed against at this scale; revisit if the threat model changes |
+
+Supply chain is consciously deprioritized, not forgotten — see
+`docs/security/accepted-risks.md`.
+
+## Security principles
+
+Every control below should trace back to one of these:
+
+- **Defense in depth** — no single control is load-bearing; layers compensate.
+- **Least privilege** — accounts, containers, and automation get the minimum they need.
+- **Deny / secure by default** — closed unless explicitly opened; safe defaults.
+- **Contain the blast radius** — segment and isolate so one compromise isn't total.
+- **Automated & reproducible** — the baseline is reached by Ansible, never by hand.
+- **Explicit & revisitable** — decisions and accepted risks are written down and
+  re-challenged, not left implicit.
+
+## Baseline controls
+
+Applied by the `base` role, non-negotiable — it runs first, on every host, every
+time. Each heading tags the threat(s) it primarily serves.
+
+### Access & authentication — *opportunistic, agent error*

 - SSH key authentication only — password auth disabled
 - Root login disabled — `PermitRootLogin no`
 - Dedicated `ansible` user with locked-down sudo (NOPASSWD for automation)
 - No shared user accounts — per-person SSH keys in `group_vars/all/vars.yml`

-### Firewall
+### Firewall — *opportunistic, blast radius, agent error*

 - `nftables` (native on Debian 13, replaces iptables)
 - Default policy: deny inbound, allow established/related, allow loopback
@ -30,29 +67,29 @@ some public-facing services — not a compliance exercise.
 > This is addressed by setting `"iptables": false` in Docker daemon config and managing
 > all rules via nftables explicitly. See `docs/decisions/004-docker-model.md`.

-### Intrusion deterrence
+### Intrusion deterrence — *opportunistic*

 - `fail2ban` monitoring SSH (and optionally reverse proxy logs)
 - Configured to ban after 5 failed attempts, 1-hour ban

-### Updates
+### Updates — *opportunistic*

 - `unattended-upgrades` enabled for **security patches only**
 - Full system upgrades triggered deliberately via Ansible (`make deploy PLAYBOOK=upgrade`)
 - No automatic reboots — reboots are a conscious operational decision

-### Minimal attack surface
+### Minimal attack surface — *opportunistic, blast radius*

 - No unnecessary packages installed
 - Docker daemon TCP socket disabled — Unix socket only
 - No open ports beyond those explicitly defined in firewall rules

-### Audit trail
+### Audit trail — *agent error, blast radius*

 - `auditd` installed and running with a baseline ruleset
 - Logs shipped to a central location if a log aggregation service is available

-## Secrets management
+## Secrets management — *agent error, opportunistic*

 - Ansible Vault for all secrets (API keys, passwords, certificates), structured as a
  nested `vault.<service>.<key>` map (ADR-003)
@ -62,15 +99,40 @@ some public-facing services — not a compliance exercise.
  `rbw unlock`; nothing decryptable sits at rest in the repo or working tree
 - See `docs/runbooks/rotate-secrets.md` for `rbw` setup and rotation

-## What this baseline does not include
+## Governance

- Full CIS benchmark hardening — adds complexity for marginal gain at this scale
- SELinux / AppArmor — not applied by default, revisit if threat model changes
- Intrusion detection (IDS) — out of scope for now
+Security is maintained, not achieved once. This ADR **establishes** four
+mechanisms; each lives where change is cheap and is linked from here.
+
+- **Per-service security bar** — every exposed service must clear a defined
+  checklist before deploy (secrets in vault, no default creds, least-privilege /
+  non-root, declared firewall ports, reverse-proxy + auth if exposed). Lives in
+  `docs/security/service-checklist.md`; referenced from `docs/runbooks/new-role.md`.
+  Enforced manually in review today; the planned `/security-review` will automate it.
+- **Periodic security review** — a recurring review that re-checks posture,
+  surfaces drift, and re-challenges accepted risks. Planned as a `/security-review`
+  skill (sibling to `/review-repo`); see `docs/TODO.md` (Scheduled work). Not built
+  yet — see STATUS.md.
+- **Accepted-risk register** — the conscious trade-offs we choose to live with, each
+  with rationale and a revisit trigger. Lives in `docs/security/accepted-risks.md`
+  (expected to change; kept out of this ADR so the ADR stays stable).
+- **Agent / automation guardrails** — what AI agents and automation may do
+  unsupervised vs. what needs a human gate, since operator/agent error is in the
+  threat model. Encoded in `CLAUDE.md` ("What Claude must not do without explicit
+  instruction") and enforced by PreToolUse hooks (generated-file guard, `rbw`
+  pre-flight).

 ## Decision

-This baseline was chosen to be:
- **Effective** against the realistic threat model (exposed services, shared repo)
- **Maintainable** by a small team without security expertise overhead
- **Automated** — no manual steps should be needed to reach baseline state
+This posture was chosen to be:
+
+- **Effective** against the stated threat model (opportunistic external, lateral
+  movement, operator/agent error)
+- **Maintainable** by a small team without security-expertise overhead
+- **Automated** — no manual steps to reach baseline state
+- **Legible & revisitable** — the threat model, principles, and accepted risks are
+  written down and reviewed over time, not implicit
+
+Out-of-scope items and conscious trade-offs are recorded in
+`docs/security/accepted-risks.md` rather than here, so this decision record stays
+stable while the risk posture evolves.
--- a/docs/runbooks/new-role.md
+++ b/docs/runbooks/new-role.md
@ -71,7 +71,16 @@ Fix any lint or test failures before committing.
 Add the role to the appropriate playbook in `playbooks/` and add the host group
 to `inventories/staging/hosts.yml` for integration testing.

-### 9. Commit
+### 9. Clear the security checklist (services)
+
+If the role is a **service** — especially one reachable beyond its own host —
+walk `docs/security/service-checklist.md` and confirm every item passes (secrets
+in vault, no default creds, least-privilege, declared firewall ports, behind the
+reverse proxy with auth if exposed). Record any conscious deviation in
+`docs/security/accepted-risks.md`. This bar is established by ADR-002; enforcement
+is manual in review today, with the planned `/security-review` to automate it.
+
+### 10. Commit

 ```bash
 git checkout -b role/<rolename>
--- a/docs/security/accepted-risks.md
+++ b/docs/security/accepted-risks.md
@ -0,0 +1,21 @@
+# Accepted security risks
+
+Conscious security trade-offs we are choosing to live with — recorded so "what we
+are *not* doing" is explicit and revisitable, not forgotten. This register is a
+**living document** and is expected to change; it is deliberately kept out of
+ADR-002 (which records durable decisions) so the ADR stays stable.
+
+Owned by **ADR-002** (Security baseline and strategy). Re-challenged during the
+periodic security review (planned `/security-review`; see `docs/TODO.md`).
+
+**Each entry:** the risk · why we accept it (rationale) · what would make us
+revisit (trigger).
+
+| # | Accepted risk | Rationale | Revisit trigger |
+|---|---|---|---|
+| R1 | **Supply chain not actively defended** — third-party container/base images, dependencies, and Ansible collections are trusted as pulled | Out of proportion to a homelab's effort budget; the realistic threat is opportunistic, not a targeted supply-chain attack. gitleaks + version pinning (ADR-011) give partial cover | Hosting high-value data/finances for others; a relevant upstream compromise; appetite for image signing / SBOM / pinned digests |
+| R2 | **No full CIS benchmark hardening** | Significant complexity for marginal gain at this scale | A compliance need, or hosting third-party data with obligations |
+| R3 | **No SELinux / AppArmor** mandatory access control | Operational overhead exceeds benefit for the current threat model | Threat model shifts toward targeted attackers; a service with a poor security history |
+| R4 | **No intrusion detection system (IDS)** | Detection is only useful with the capacity to triage it; alerts no one reads are noise | Monitoring/alerting stack (Prometheus/Loki/Grafana) is in place and someone will act on alerts |
+
+_Last reviewed: 2026-06-04 (seeded — pending a first re-challenge pass)._
--- a/docs/security/service-checklist.md
+++ b/docs/security/service-checklist.md
@ -0,0 +1,49 @@
+# Per-service security checklist
+
+The bar every service (a per-service role — ADR-004) must clear **before deploy**,
+especially anything reachable beyond its own host. Established by **ADR-002**
+(Security baseline and strategy); referenced from `docs/runbooks/new-role.md`.
+Enforced manually in review today; the planned `/security-review` skill (see
+`docs/TODO.md`) will automate the check.
+
+Treat each item as must-pass **unless** a deviation is recorded in
+`docs/security/accepted-risks.md` with a rationale and a revisit trigger.
+
+## Secrets & credentials
+
+- [ ] All secrets live in an encrypted `vault.yml` (`vault.<service>.<key>`); none in
+      plaintext files, templates, or Compose env literals
+- [ ] No default or vendor-shipped credentials remain — admin passwords/tokens are
+      generated and stored in vault
+- [ ] Nothing secret is baked into an image or committed to git (gitleaks must pass)
+
+## Least privilege
+
+- [ ] Container runs as a non-root user where the image supports it
+- [ ] No `privileged: true` and no host network mode unless explicitly justified
+- [ ] Only the volumes/paths the service needs are mounted; read-only where possible
+- [ ] Linux capabilities dropped to what's required (no blanket grants)
+
+## Network & exposure
+
+- [ ] Every listening port is declared in `group_vars` firewall definitions — never
+      opened ad-hoc on a host
+- [ ] The service is not published directly to a LAN/WAN port if it can sit behind the
+      reverse proxy instead
+- [ ] Anything reachable beyond the `srv` VLAN is behind the reverse proxy **with
+      authentication** (and TLS)
+- [ ] Inter-service reach follows least privilege — no broad `srv`→`srv` access where a
+      single declared dependency suffices
+
+## Updates & provenance
+
+- [ ] Image/source version is pinned (tag or digest), not floating `latest` (ADR-011)
+- [ ] The update path is known — how this service gets patched
+
+## Operability (security-adjacent)
+
+- [ ] Logs go somewhere reviewable (central aggregation when available)
+- [ ] Backup/restore is covered if the service holds state
+
+> Deviations are allowed but must be **conscious**: record them in
+> `docs/security/accepted-risks.md`, don't leave them implicit.