Compare commits

...

2 commits

Author SHA1 Message Date
19dd89b875 Re-challenge accepted risks; adopt CIS hardening + IDS
Walked the seeded accepted-risk register (R1-R4) and turned inherited gaps into
deliberate decisions:

- Supply chain (R1): tightened to required baseline hygiene (digest pinning,
  official/verified images); active scanning deferred — stays an accepted risk
- CIS (R2): adopted as a positive decision — CIS Debian L1+L2 (base role) + CIS
  Docker (docker_host + service checklist); app layer via the checklist
- SELinux/AppArmor (R3): AppArmor becomes a baseline control (CIS-enforced);
  register keeps a clean "no SELinux" accept
- IDS (R4): adopt AIDE (baseline via CIS) + Suricata on OPNsense + active alerting

Register shrinks from 4 inherited gaps to 2 deliberate accepts. ADR-002 gains a
Hardening standard section; STATUS + TODO 15 track the (unbuilt) implementation,
including the CIS L2 partition impact on VM provisioning (ADR-006).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 15:15:39 +02:00
f338bccd46 Expand ADR-002 into a security baseline + strategy
Add a managerial security frame on top of the host baseline: explicit threat
model (opportunistic external, lateral movement/blast radius, operator/agent
error; supply chain accepted-lower-priority), security principles, and four
governance mechanisms that ADR-002 establishes and links out to:

- docs/security/service-checklist.md — per-service security bar (referenced
  from the new-role runbook)
- docs/security/accepted-risks.md — living accepted-risk register (R1-R4)
- planned /security-review skill (TODO 8.5)
- agent guardrails in CLAUDE.md "what Claude must not do"

STATUS.md records the frame as present (manual enforcement) and /security-review
as planned-not-built.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 14:39:51 +02:00
7 changed files with 245 additions and 24 deletions

View file

@ -154,6 +154,10 @@ Single-contributor, trunk-based (no merge requests / approval gates):
- Edit vault-encrypted files directly — decrypt first, re-encrypt after
- Force-push or rewrite already-pushed history on `main`
- Add a collection to `requirements.yml` without a specific module need in existing role tasks
- Open a firewall port anywhere but the `group_vars` firewall definitions — never ad-hoc on a host (ADR-002)
- Disable or weaken a baseline control from ADR-002 (SSH hardening, nftables default-deny, fail2ban, auditd)
- Expose a service to the LAN/WAN without it sitting behind the reverse proxy with authentication (ADR-002)
- Deploy a service that hasn't cleared `docs/security/service-checklist.md` (record any deviation in `docs/security/accepted-risks.md`)
---
@ -162,7 +166,9 @@ Single-contributor, trunk-based (no merge requests / approval gates):
| Topic | File |
|------------------------|---------------------------------------|
| Architecture overview | `docs/decisions/001-architecture.md` |
| Security baseline | `docs/decisions/002-security.md` |
| Security baseline & strategy | `docs/decisions/002-security.md` |
| Accepted security risks | `docs/security/accepted-risks.md` |
| Per-service security checklist | `docs/security/service-checklist.md` |
| Toolchain choices | `docs/decisions/003-toolchain.md` |
| Docker & Compose model | `docs/decisions/004-docker-model.md` |
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |

View file

@ -23,6 +23,7 @@ _Last reviewed: 2026-05-30._
| Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below |
| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
| ADR-002 security strategy + `docs/security/{accepted-risks,service-checklist}.md` | Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review |
## Scaffolded but empty — NOT implemented
@ -47,6 +48,9 @@ So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `bas
| Per-service roles | ADR-004 | Model defined; no service roles built |
| Forgejo Actions CI | ADR-003 / ADR-008 | Remote is live (pushed); Actions/`act_runner` pipeline not yet built |
| Live usage stats for `/capacity-review` | ADR-012 / TODO 8.4 | `gather_usage()` stubbed; source undecided (Proxmox RRD vs PLG stack); needs the cluster |
| `/security-review` skill | ADR-002 / TODO 8.5 | Periodic posture re-check + accepted-risk re-challenge; planned, not built |
| CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) `base`/`docker_host` roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) |
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |
## Keeping this honest

View file

@ -60,6 +60,12 @@
Prometheus/Loki/Grafana/Grafana-Alloy stack we will likely set up anyway
(richer, per-process, but more to run) — see TODO 3.6. Don't build the
Proxmox-RRD hook before settling this, to avoid throwaway work.
5. Build a `/security-review` skill (sibling to `/review-repo`): re-check the
security posture against ADR-002, surface drift, and re-challenge the
accepted-risk register (`docs/security/accepted-risks.md`). Could pair a
deterministic pre-scan (undeclared open ports, disabled baseline controls,
world-readable secrets, services not behind auth) with a judgement pass.
Open question: standalone, or folded into the kaizen `/retro` (item 11)?
9. Should we make a basic function so that tools (and AI) can send messages to the user - email, matrix or ntfy?
10. **Claude setup** — DECIDED: brainstorm for intent, capture as ADRs (skip plan
@ -68,6 +74,9 @@
1. Policy for how we collaborate with references to baobabAnsibleV4 without misusing it.
2. Policy for how we write key documents like ADRs.
3. Further development on how we we collaborate on designing the foundation for the project - seperate from how we implement new containers etc.
4. How do we make sure agents always use the latest official documentation for the technologies etc. we use?
5. Always subagent driven?
6. When AI deploys, i.e. runs playbooks etc., should we make a methodology so that it does not have to poll all the time or review all the output. Perhaps something about the MAKE method could provide only the relevant feedback?
11. **Kaizen loop** — set up ~2026-06-06 (one week from now).
1. Build `/retro`: reads `docs/FRICTION.md` + recurring `/review-repo`
@ -88,3 +97,20 @@
whether selectively allowing libraries (e.g. PyYAML — already present via
Ansible) is a better fit in general: weigh the parsing-correctness win
against losing zero-setup portability. Decide a clear rule and record it.
15. **Security hardening implementation** — build out the ADR-002 hardening standard.
1. Implement the CIS Debian Benchmark **Level 1 + Level 2** in the `base` role
(local tasks; CIS / `dev-sec` as reference only — no Galaxy roles). Includes
AppArmor (enforce mode) and AIDE file-integrity.
2. Implement the CIS Docker Benchmark: daemon/engine settings in `docker_host`;
per-container settings enforced via `docs/security/service-checklist.md`.
3. VM disk layout for CIS L2: separate `/tmp`, `/var`, `/var/log`, `/home`
partitions with `nodev,nosuid,noexec` — a Terraform/cloud-init concern
(ADR-006). Decide the template layout **before** provisioning, since it is
painful to retrofit.
4. Network IDS: enable Suricata on OPNsense (IDS first; IPS later?).
5. Active security alerting: wire AIDE, `auditd`, `fail2ban`, and Suricata into
the Loki/Grafana alerting stack (ties to 3.6).
6. Supply-chain hygiene: enforce image digest pinning + official/verified images
via the service checklist; revisit active scanning (Trivy/Grype) once a
triage stack exists (accepted-risk R1).

View file

@ -1,24 +1,61 @@
# ADR-002 — Security baseline
# ADR-002 — Security baseline and strategy
## Context
Every managed host must reach a defined security baseline before any services
are deployed. This baseline is applied by the `base` role and is non-negotiable —
it runs first, on every host, every time.
Security here is not a single control but the sum of several combined efforts —
host hardening, network segmentation, secrets handling, supply-chain hygiene, and
disciplined automation. This ADR is the frame that organizes them: it records the
**threat model** we design against, the **principles** every control serves, the
host-level **baseline** the `base` role enforces, and the **governance** that keeps
security sharp as the homelab grows.
The goal is a principled, maintainable baseline appropriate for a homelab with
some public-facing services — not a compliance exercise.
The goal is a principled, maintainable posture for a homelab with some
public-facing services — effective against a realistic threat model, not a
compliance exercise.
## Baseline components
Related decisions: network segmentation (ADR-007), secrets structure (ADR-003),
per-service roles (ADR-004), CI secret-scanning (ADR-010).
### Access & authentication
## Threat model
What we deliberately design against — and, just as importantly, what we do not:
| Threat | In scope? | What it drives |
|---|---|---|
| **Opportunistic external** — bots scanning, credential stuffing, mass-exploiting known CVEs in exposed services | Yes — primary | SSH key-only + fail2ban, deny-by-default firewall, security auto-patching, minimal attack surface, services behind a reverse proxy with auth |
| **Lateral movement / blast radius** — assume one service *is* compromised; limit how far it spreads | Yes | VLAN segmentation (ADR-007), least-privilege containers, no host network mode, per-service isolation, no shared credentials |
| **Operator / agent error** — accidental secret leak, misconfiguration, or an AI agent making an unsafe change | Yes | Vault + gitleaks, declarative firewall (no ad-hoc ports), review gates, agent guardrails (below), pre-commit hooks |
| **Supply chain** — compromised images, base images, dependencies, collections | Acknowledged, lower priority | Baseline hygiene required: image digest pinning + prefer official/verified images (ADR-011, service checklist), gitleaks. Active vuln scanning deferred — accepted risk |
| **Targeted / physical** — a determined adversary specifically after this homelab, or physical device access | Out of scope | Not designed against at this scale; revisit if the threat model changes |
Supply chain is consciously deprioritized, not forgotten — see
`docs/security/accepted-risks.md`.
## Security principles
Every control below should trace back to one of these:
- **Defense in depth** — no single control is load-bearing; layers compensate.
- **Least privilege** — accounts, containers, and automation get the minimum they need.
- **Deny / secure by default** — closed unless explicitly opened; safe defaults.
- **Contain the blast radius** — segment and isolate so one compromise isn't total.
- **Automated & reproducible** — the baseline is reached by Ansible, never by hand.
- **Explicit & revisitable** — decisions and accepted risks are written down and
re-challenged, not left implicit.
## Baseline controls
Applied by the `base` role, non-negotiable — it runs first, on every host, every
time. Each heading tags the threat(s) it primarily serves.
### Access & authentication — *opportunistic, agent error*
- SSH key authentication only — password auth disabled
- Root login disabled — `PermitRootLogin no`
- Dedicated `ansible` user with locked-down sudo (NOPASSWD for automation)
- No shared user accounts — per-person SSH keys in `group_vars/all/vars.yml`
### Firewall
### Firewall — *opportunistic, blast radius, agent error*
- `nftables` (native on Debian 13, replaces iptables)
- Default policy: deny inbound, allow established/related, allow loopback
@ -30,29 +67,45 @@ some public-facing services — not a compliance exercise.
> This is addressed by setting `"iptables": false` in Docker daemon config and managing
> all rules via nftables explicitly. See `docs/decisions/004-docker-model.md`.
### Intrusion deterrence
### Intrusion deterrence — *opportunistic*
- `fail2ban` monitoring SSH (and optionally reverse proxy logs)
- Configured to ban after 5 failed attempts, 1-hour ban
### Updates
### Updates — *opportunistic*
- `unattended-upgrades` enabled for **security patches only**
- Full system upgrades triggered deliberately via Ansible (`make deploy PLAYBOOK=upgrade`)
- No automatic reboots — reboots are a conscious operational decision
### Minimal attack surface
### Minimal attack surface — *opportunistic, blast radius*
- No unnecessary packages installed
- Docker daemon TCP socket disabled — Unix socket only
- No open ports beyond those explicitly defined in firewall rules
### Audit trail
### Audit trail — *agent error, blast radius*
- `auditd` installed and running with a baseline ruleset
- Logs shipped to a central location if a log aggregation service is available
## Secrets management
### Mandatory access control — *blast radius*
- **AppArmor** enabled with profiles in enforce mode — Debian-native MAC, default-on,
and required by the CIS Debian benchmark. Docker applies its `docker-default`
profile to containers; tighter per-service profiles are authored as needed.
- **SELinux is not used** — non-native to Debian and redundant with AppArmor
(see `docs/security/accepted-risks.md`).
### File integrity & intrusion detection — *opportunistic, blast radius, agent error*
- **AIDE** file-integrity monitoring (required by the CIS Debian benchmark) — detects
unexpected changes to system files
- **Network IDS** — Suricata on OPNsense (planned; see STATUS.md / TODO)
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata into the
monitoring/alerting stack (planned; ties to the Loki/Grafana effort)
## Secrets management — *agent error, opportunistic*
- Ansible Vault for all secrets (API keys, passwords, certificates), structured as a
nested `vault.<service>.<key>` map (ADR-003)
@ -62,15 +115,65 @@ some public-facing services — not a compliance exercise.
`rbw unlock`; nothing decryptable sits at rest in the repo or working tree
- See `docs/runbooks/rotate-secrets.md` for `rbw` setup and rotation
## What this baseline does not include
## Hardening standard
- Full CIS benchmark hardening — adds complexity for marginal gain at this scale
- SELinux / AppArmor — not applied by default, revisit if threat model changes
- Intrusion detection (IDS) — out of scope for now
The baseline above is implemented to a recognised benchmark rather than ad-hoc:
- **Hosts** — the **CIS Debian Benchmark, Levels 1 and 2**, applied by the `base`
role. Some L2 items require separate partitions (`/tmp`, `/var`, `/var/log`,
`/home`) with restrictive mount options (`nodev,nosuid,noexec`) — that reaches into
VM disk layout, a provisioning concern (Terraform / cloud-init, ADR-006), not just
the `base` role.
- **Container runtime** — the **CIS Docker Benchmark**: daemon/engine settings in the
`docker_host` role; per-container run settings (non-root, read-only rootfs, dropped
capabilities, no `privileged`, no host namespaces) enforced via
`docs/security/service-checklist.md`.
- **Application containers** — no CIS benchmark exists for the app long tail
(Jellyfin, Nextcloud, Forgejo, …); they are covered by the CIS Docker run settings
plus the service checklist plus upstream hardening guidance.
Hardening controls are **implemented as local roles** (per the no-Galaxy-roles
policy, ADR-003), using the CIS benchmarks and community roles (e.g. `dev-sec`) only
as reference. Any specific CIS item that proves impractical is exempted into
`docs/security/accepted-risks.md` with a rationale — so the register records named
exceptions, not a blanket opt-out.
## Governance
Security is maintained, not achieved once. This ADR **establishes** four
mechanisms; each lives where change is cheap and is linked from here.
- **Per-service security bar** — every exposed service must clear a defined
checklist before deploy (secrets in vault, no default creds, least-privilege /
non-root, declared firewall ports, reverse-proxy + auth if exposed). Lives in
`docs/security/service-checklist.md`; referenced from `docs/runbooks/new-role.md`.
Enforced manually in review today; the planned `/security-review` will automate it.
- **Periodic security review** — a recurring review that re-checks posture,
surfaces drift, and re-challenges accepted risks. Planned as a `/security-review`
skill (sibling to `/review-repo`); see `docs/TODO.md` (Scheduled work). Not built
yet — see STATUS.md.
- **Accepted-risk register** — the conscious trade-offs we choose to live with, each
with rationale and a revisit trigger. Lives in `docs/security/accepted-risks.md`
(expected to change; kept out of this ADR so the ADR stays stable).
- **Agent / automation guardrails** — what AI agents and automation may do
unsupervised vs. what needs a human gate, since operator/agent error is in the
threat model. Encoded in `CLAUDE.md` ("What Claude must not do without explicit
instruction") and enforced by PreToolUse hooks (generated-file guard, `rbw`
pre-flight).
## Decision
This baseline was chosen to be:
- **Effective** against the realistic threat model (exposed services, shared repo)
- **Maintainable** by a small team without security expertise overhead
- **Automated** — no manual steps should be needed to reach baseline state
This posture was chosen to be:
- **Effective** against the stated threat model (opportunistic external, lateral
movement, operator/agent error)
- **Maintainable** by a small team without security-expertise overhead
- **Automated** — no manual steps to reach baseline state
- **Legible & revisitable** — the threat model, principles, and accepted risks are
written down and reviewed over time, not implicit
- **Benchmarked** — host and container hardening follow CIS (Debian L1+L2, Docker),
not ad-hoc choices
Out-of-scope items and conscious trade-offs are recorded in
`docs/security/accepted-risks.md` rather than here, so this decision record stays
stable while the risk posture evolves.

View file

@ -71,7 +71,16 @@ Fix any lint or test failures before committing.
Add the role to the appropriate playbook in `playbooks/` and add the host group
to `inventories/staging/hosts.yml` for integration testing.
### 9. Commit
### 9. Clear the security checklist (services)
If the role is a **service** — especially one reachable beyond its own host —
walk `docs/security/service-checklist.md` and confirm every item passes (secrets
in vault, no default creds, least-privilege, declared firewall ports, behind the
reverse proxy with auth if exposed). Record any conscious deviation in
`docs/security/accepted-risks.md`. This bar is established by ADR-002; enforcement
is manual in review today, with the planned `/security-review` to automate it.
### 10. Commit
```bash
git checkout -b role/<rolename>

View file

@ -0,0 +1,24 @@
# Accepted security risks
Conscious security trade-offs we are choosing to live with — recorded so "what we
are *not* doing" is explicit and revisitable, not forgotten. This register is a
**living document**, deliberately kept out of ADR-002 (which records durable
decisions) so the ADR stays stable.
Owned by **ADR-002** (Security baseline and strategy). Re-challenged during the
periodic security review (planned `/security-review`; see `docs/TODO.md`).
**Each entry:** the risk · why we accept it (rationale) · what would make us
revisit (trigger).
| # | Accepted risk | Rationale | Revisit trigger |
|---|---|---|---|
| R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (image digest pinning + prefer official/verified images, ADR-011 / service checklist; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise |
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
_Last reviewed: 2026-06-04. The prior gaps (full CIS hardening, SELinux/AppArmor,
IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now
part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build
status. As CIS is implemented, any specific item that proves impractical is added
here as a named exception._

View file

@ -0,0 +1,49 @@
# Per-service security checklist
The bar every service (a per-service role — ADR-004) must clear **before deploy**,
especially anything reachable beyond its own host. Established by **ADR-002**
(Security baseline and strategy); referenced from `docs/runbooks/new-role.md`.
Enforced manually in review today; the planned `/security-review` skill (see
`docs/TODO.md`) will automate the check.
Treat each item as must-pass **unless** a deviation is recorded in
`docs/security/accepted-risks.md` with a rationale and a revisit trigger.
## Secrets & credentials
- [ ] All secrets live in an encrypted `vault.yml` (`vault.<service>.<key>`); none in
plaintext files, templates, or Compose env literals
- [ ] No default or vendor-shipped credentials remain — admin passwords/tokens are
generated and stored in vault
- [ ] Nothing secret is baked into an image or committed to git (gitleaks must pass)
## Least privilege
- [ ] Container runs as a non-root user where the image supports it
- [ ] No `privileged: true` and no host network mode unless explicitly justified
- [ ] Only the volumes/paths the service needs are mounted; read-only where possible
- [ ] Linux capabilities dropped to what's required (no blanket grants)
## Network & exposure
- [ ] Every listening port is declared in `group_vars` firewall definitions — never
opened ad-hoc on a host
- [ ] The service is not published directly to a LAN/WAN port if it can sit behind the
reverse proxy instead
- [ ] Anything reachable beyond the `srv` VLAN is behind the reverse proxy **with
authentication** (and TLS)
- [ ] Inter-service reach follows least privilege — no broad `srv``srv` access where a
single declared dependency suffices
## Updates & provenance
- [ ] Image/source version is pinned (tag or digest), not floating `latest` (ADR-011)
- [ ] The update path is known — how this service gets patched
## Operability (security-adjacent)
- [ ] Logs go somewhere reviewable (central aggregation when available)
- [ ] Backup/restore is covered if the service holds state
> Deviations are allowed but must be **conscious**: record them in
> `docs/security/accepted-risks.md`, don't leave them implicit.