185 lines
10 KiB
Markdown
185 lines
10 KiB
Markdown
# ADR-002 — Security baseline and strategy
|
|
|
|
## Context
|
|
|
|
Security here is not a single control but the sum of several combined efforts —
|
|
host hardening, network segmentation, secrets handling, supply-chain hygiene, and
|
|
disciplined automation. This ADR is the frame that organizes them: it records the
|
|
**threat model** we design against, the **principles** every control serves, the
|
|
host-level **baseline** the `base` role enforces, and the **governance** that keeps
|
|
security sharp as the homelab grows.
|
|
|
|
The goal is a principled, maintainable posture for a homelab with some
|
|
public-facing services — effective against a realistic threat model, not a
|
|
compliance exercise.
|
|
|
|
Related decisions: network segmentation (ADR-007), secrets structure (ADR-003),
|
|
per-service roles (ADR-004), CI secret-scanning (ADR-010).
|
|
|
|
## Threat model
|
|
|
|
What we deliberately design against — and, just as importantly, what we do not:
|
|
|
|
| Threat | In scope? | What it drives |
|
|
|---|---|---|
|
|
| **Opportunistic external** — bots scanning, credential stuffing, mass-exploiting known CVEs in exposed services | Yes — primary | SSH key-only + fail2ban, deny-by-default firewall, security auto-patching, minimal attack surface, services behind a reverse proxy with auth |
|
|
| **Lateral movement / blast radius** — assume one service *is* compromised; limit how far it spreads | Yes | VLAN segmentation (ADR-007), least-privilege containers, no host network mode, per-service isolation, no shared credentials |
|
|
| **Operator / agent error** — accidental secret leak, misconfiguration, or an AI agent making an unsafe change | Yes | Vault + gitleaks, declarative firewall (no ad-hoc ports), review gates, agent guardrails (below), pre-commit hooks |
|
|
| **Supply chain** — compromised images, base images, dependencies, collections | Acknowledged, lower priority | Baseline hygiene required: tiered image pinning (stateful `tag@digest`, stateless rolling — ADR-011) + prefer official/verified images, gitleaks. Active vuln scanning deferred — accepted risk |
|
|
| **Targeted / physical** — a determined adversary specifically after this homelab, or physical device access | Out of scope | Not designed against at this scale; revisit if the threat model changes |
|
|
|
|
Supply chain is consciously deprioritized, not forgotten — see
|
|
`docs/security/accepted-risks.md`.
|
|
|
|
## Security principles
|
|
|
|
Every control below should trace back to one of these:
|
|
|
|
- **Defense in depth** — no single control is load-bearing; layers compensate.
|
|
- **Least privilege** — accounts, containers, and automation get the minimum they need.
|
|
- **Deny / secure by default** — closed unless explicitly opened; safe defaults.
|
|
- **Contain the blast radius** — segment and isolate so one compromise isn't total.
|
|
- **Automated & reproducible** — the baseline is reached by Ansible, never by hand.
|
|
- **Explicit & revisitable** — decisions and accepted risks are written down and
|
|
re-challenged, not left implicit.
|
|
|
|
## Baseline controls
|
|
|
|
Applied by the `base` role, non-negotiable — it runs first, on every host, every
|
|
time. Each heading tags the threat(s) it primarily serves.
|
|
|
|
### Access & authentication — *opportunistic, agent error*
|
|
|
|
- SSH key authentication only — password auth disabled
|
|
- Root login disabled — `PermitRootLogin no`
|
|
- Dedicated `ansible` user with locked-down sudo (NOPASSWD for automation)
|
|
- No shared user accounts — per-person SSH keys in `group_vars/all/vars.yml`
|
|
|
|
### Firewall — *opportunistic, blast radius, agent error*
|
|
|
|
- `nftables` (native on Debian 13, replaces iptables)
|
|
- Default policy: deny inbound, allow established/related, allow loopback
|
|
- Rules managed entirely by Ansible — never edited manually on hosts
|
|
- Port definitions live in `group_vars/` so rules stay in sync with deployed services
|
|
- Docker's own iptables rules are disabled — nftables manages all filtering
|
|
|
|
> **Note on Docker + nftables**: Docker historically bypassed iptables-based firewalls.
|
|
> This is addressed by setting `"iptables": false` in Docker daemon config and managing
|
|
> all rules via nftables explicitly. See `docs/decisions/004-docker-model.md`.
|
|
|
|
### Intrusion deterrence — *opportunistic*
|
|
|
|
- `fail2ban` monitoring SSH (and optionally reverse proxy logs)
|
|
- Configured to ban after 5 failed attempts, 1-hour ban
|
|
|
|
### Updates — *opportunistic*
|
|
|
|
- `unattended-upgrades` enabled for **security patches only**
|
|
- Full system upgrades triggered deliberately via Ansible (`make deploy PLAYBOOK=upgrade`)
|
|
- No automatic reboots — reboots are a conscious operational decision
|
|
|
|
### Minimal attack surface — *opportunistic, blast radius*
|
|
|
|
- No unnecessary packages installed
|
|
- Docker daemon TCP socket disabled — Unix socket only
|
|
- No open ports beyond those explicitly defined in firewall rules
|
|
|
|
### Audit trail — *agent error, blast radius*
|
|
|
|
- `auditd` installed and running with a baseline ruleset
|
|
- Logs shipped to a central location in near-real-time — all logs to an on-cluster
|
|
Loki, plus a security-relevant subset write-only off-site to `askari` so the audit
|
|
trail survives host (and full-cluster) compromise (ADR-018)
|
|
|
|
### Mandatory access control — *blast radius*
|
|
|
|
- **AppArmor** enabled with profiles in enforce mode — Debian-native MAC, default-on,
|
|
and required by the CIS Debian benchmark. Docker applies its `docker-default`
|
|
profile to containers; tighter per-service profiles are authored as needed.
|
|
- **SELinux is not used** — non-native to Debian and redundant with AppArmor
|
|
(see `docs/security/accepted-risks.md`).
|
|
|
|
### File integrity & intrusion detection — *opportunistic, blast radius, agent error*
|
|
|
|
- **AIDE** file-integrity monitoring (required by the CIS Debian benchmark) — detects
|
|
unexpected changes to system files
|
|
- **Network IDS** — Suricata on OPNsense (planned; see STATUS.md / TODO)
|
|
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata — plus
|
|
log-source-silence (a host that stops shipping) — into Grafana alerting on the
|
|
Loki/Grafana stack (ADR-018; planned)
|
|
|
|
## Secrets management — *agent error, opportunistic*
|
|
|
|
- Ansible Vault for all secrets (API keys, passwords, certificates), structured as a
|
|
nested `vault.<service>.<key>` map (ADR-003)
|
|
- The master vault password lives in **Vaultwarden** and is fetched on demand by
|
|
`scripts/vault-pass-client.sh` (wired as `vault_password_file`) through the `rbw`
|
|
agent — never written to a plaintext file on disk. Unlock once per session with
|
|
`rbw unlock`; nothing decryptable sits at rest in the repo or working tree
|
|
- See `docs/runbooks/rotate-secrets.md` for `rbw` setup and rotation
|
|
|
|
## Hardening standard
|
|
|
|
The baseline above is implemented to a recognised benchmark rather than ad-hoc:
|
|
|
|
- **Hosts** — the **CIS Debian Benchmark, Levels 1 and 2**, applied by the `base`
|
|
role. Some L2 items require separate partitions (`/tmp`, `/var`, `/var/log`,
|
|
`/home`) with restrictive mount options (`nodev,nosuid,noexec`) — that reaches into
|
|
VM disk layout, a provisioning concern (Terraform / cloud-init, ADR-006), not just
|
|
the `base` role.
|
|
- **Container runtime** — the **CIS Docker Benchmark**: daemon/engine settings in the
|
|
`docker_host` role; per-container run settings (non-root, read-only rootfs, dropped
|
|
capabilities, no `privileged`, no host namespaces) enforced via
|
|
`docs/security/service-checklist.md`.
|
|
- **Application containers** — no CIS benchmark exists for the app long tail
|
|
(Jellyfin, Nextcloud, Forgejo, …); they are covered by the CIS Docker run settings
|
|
plus the service checklist plus upstream hardening guidance.
|
|
|
|
Hardening controls are **implemented as local roles** (per the no-Galaxy-roles
|
|
policy, ADR-003), using the CIS benchmarks and community roles (e.g. `dev-sec`) only
|
|
as reference. Any specific CIS item that proves impractical is exempted into
|
|
`docs/security/accepted-risks.md` with a rationale — so the register records named
|
|
exceptions, not a blanket opt-out.
|
|
|
|
## Governance
|
|
|
|
Security is maintained, not achieved once. This ADR **establishes** four
|
|
mechanisms; each lives where change is cheap and is linked from here.
|
|
|
|
- **Per-service security bar** — every exposed service must clear a defined
|
|
checklist before deploy (secrets in vault, no default creds, least-privilege /
|
|
non-root, declared firewall ports, reverse-proxy + auth if exposed). The generic
|
|
bar lives in `docs/security/service-checklist.md`, and each service
|
|
records how it meets the bar (plus service-specific hardening) in its own
|
|
`roles/<service>/SECURITY.md`, created from `docs/security/service-security-template.md`
|
|
(ADR-004). Enforced manually in review today; the planned `/security-review`
|
|
aggregates every `roles/*/SECURITY.md` and cross-checks it against the role's config.
|
|
- **Periodic security review** — a recurring review that re-checks posture,
|
|
surfaces drift, and re-challenges accepted risks. Planned as a `/security-review`
|
|
skill (sibling to `/review-repo`); see `docs/TODO.md` (Scheduled work). Not built
|
|
yet — see STATUS.md.
|
|
- **Accepted-risk register** — the conscious trade-offs we choose to live with, each
|
|
with rationale and a revisit trigger. Lives in `docs/security/accepted-risks.md`
|
|
(expected to change; kept out of this ADR so the ADR stays stable).
|
|
- **Agent / automation guardrails** — what AI agents and automation may do
|
|
unsupervised vs. what needs a human gate, since operator/agent error is in the
|
|
threat model. Encoded in `CLAUDE.md` ("What Claude must not do without explicit
|
|
instruction") and enforced by PreToolUse hooks (generated-file guard, `rbw`
|
|
pre-flight).
|
|
|
|
## Decision
|
|
|
|
This posture was chosen to be:
|
|
|
|
- **Effective** against the stated threat model (opportunistic external, lateral
|
|
movement, operator/agent error)
|
|
- **Maintainable** by a small team without security-expertise overhead
|
|
- **Automated** — no manual steps to reach baseline state
|
|
- **Legible & revisitable** — the threat model, principles, and accepted risks are
|
|
written down and reviewed over time, not implicit
|
|
- **Benchmarked** — host and container hardening follow CIS (Debian L1+L2, Docker),
|
|
not ad-hoc choices
|
|
|
|
Out-of-scope items and conscious trade-offs are recorded in
|
|
`docs/security/accepted-risks.md` rather than here, so this decision record stays
|
|
stable while the risk posture evolves.
|