boma/docs/decisions/002-security.md
sjat 175777e36a docs: reconcile 2026-06-14 review findings (O1-O7,O18,O22)
- STATUS: docker_host is built+applied, not scaffold-only (O1)
- ADR-004: backup points to ADR-022, not "out of scope"; service-role file
  table gains ACCESS.md + BACKUP.md rows (O2, O5)
- Finish Traefik->Caddy: ADR-008/011/017/019, CAPABILITIES, TODO (O3); scope
  ADR-024's custom-image/NetBird claims to the deferred DNS-01/M4b paths (O22)
- ADR-016/017/018 now lead with ## Status per ADR-023 (O4)
- ADR-002: caveat `PLAYBOOK=upgrade` as planned/unbuilt (O6)
- CAPABILITIES: carve out ubongo's dev_env from the nvim/tmux exclusion (O7)
- ADR-007: one authoritative boma.baobab.band -> boma.wingu.me transition note (O18)
- new-host Part E: note ubongo is managed as sjat, ansible-user bootstrap pending (O15)

O9 (hosts.yml header) left open: the file is generator-owned (hook-protected);
fixing it needs a tf_to_inventory.py change or a tf-inventory run, not a hand-edit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:06:33 +02:00

12 KiB

ADR-002 — Security baseline and strategy

Status

Accepted (2026-05-30)

Context

Security here is not a single control but the sum of several combined efforts — host hardening, network segmentation, secrets handling, supply-chain hygiene, and disciplined automation. This ADR is the frame that organizes them: it records the threat model we design against, the principles every control serves, the host-level baseline the base role enforces, and the governance that keeps security sharp as the homelab grows.

The goal is a principled, maintainable posture for a homelab with some public-facing services — effective against a realistic threat model, not a compliance exercise.

Related decisions: network segmentation (ADR-007), secrets structure (ADR-003), per-service roles (ADR-004), CI secret-scanning (ADR-010).

Threat model

What we deliberately design against — and, just as importantly, what we do not:

Threat In scope? What it drives
Opportunistic external — bots scanning, credential stuffing, mass-exploiting known CVEs in exposed services Yes — primary SSH key-only + fail2ban, deny-by-default firewall, security auto-patching, minimal attack surface, services behind a reverse proxy with auth
Lateral movement / blast radius — assume one service is compromised; limit how far it spreads Yes VLAN segmentation (ADR-007), least-privilege containers, no host network mode, per-service isolation, no shared credentials
Operator / agent error — accidental secret leak, misconfiguration, or an AI agent making an unsafe change Yes Vault + gitleaks, declarative firewall (no ad-hoc ports), review gates, agent guardrails (below), pre-commit hooks
Supply chain — compromised images, base images, dependencies, collections Acknowledged, lower priority Baseline hygiene required: tiered image pinning (stateful tag@digest, stateless rolling — ADR-011) + prefer official/verified images, gitleaks. Active vuln scanning deferred — accepted risk
Targeted / physical — a determined adversary specifically after this homelab, or physical device access Out of scope Not designed against at this scale; revisit if the threat model changes

Supply chain is consciously deprioritized, not forgotten — see docs/security/accepted-risks.md.

Security principles

Every control below should trace back to one of these:

  • Defense in depth — no single control is load-bearing; layers compensate.
  • Least privilege — accounts, containers, and automation get the minimum they need.
  • Deny / secure by default — closed unless explicitly opened; safe defaults.
  • Contain the blast radius — segment and isolate so one compromise isn't total.
  • Automated & reproducible — the baseline is reached by Ansible, never by hand.
  • Explicit & revisitable — decisions and accepted risks are written down and re-challenged, not left implicit.

Baseline controls

Applied by the base role, non-negotiable — it runs first, on every host, every time. Each heading tags the threat(s) it primarily serves.

Access & authentication — opportunistic, agent error

  • SSH key authentication only — password auth disabled
  • Root login disabled — PermitRootLogin no
  • Dedicated ansible user with locked-down sudo (NOPASSWD for automation)
  • No shared user accounts — per-person SSH keys in group_vars/all/vars.yml

Firewall — opportunistic, blast radius, agent error

  • nftables (native on Debian 13, replaces iptables)
  • Default policy: deny inbound, allow established/related, allow loopback
  • Rules managed entirely by Ansible — never edited manually on hosts
  • Port definitions live in group_vars/ so rules stay in sync with deployed services
  • Docker's own iptables rules are disabled — nftables manages all filtering

Note on Docker + nftables: Docker historically bypassed iptables-based firewalls. This is addressed by setting "iptables": false in Docker daemon config and managing all rules via nftables explicitly. See docs/decisions/004-docker-model.md.

Intrusion deterrence — opportunistic

  • fail2ban monitoring SSH (and optionally reverse proxy logs)
  • Configured to ban after 5 failed attempts, 1-hour ban

Updates — opportunistic

  • unattended-upgrades enabled for security patches only
  • Full system upgrades triggered deliberately via Ansible (planned — a dedicated upgrade playbook per ADR-011; not yet built, no upgrade.yml exists today)
  • No automatic reboots — reboots are a conscious operational decision

Minimal attack surface — opportunistic, blast radius

  • No unnecessary packages installed
  • Docker daemon TCP socket disabled — Unix socket only
  • No open ports beyond those explicitly defined in firewall rules

Audit trail — agent error, blast radius

  • auditd installed and running with a baseline ruleset
  • Logs shipped to a central location in near-real-time — all logs to an on-cluster Loki, plus a security-relevant subset write-only off-site to askari so the audit trail survives host (and full-cluster) compromise (ADR-018)

Mandatory access control — blast radius

  • AppArmor enabled with profiles in enforce mode — Debian-native MAC, default-on, and required by the CIS Debian benchmark. Docker applies its docker-default profile to containers; tighter per-service profiles are authored as needed.
  • SELinux is not used — non-native to Debian and redundant with AppArmor (see docs/security/accepted-risks.md).

File integrity & intrusion detection — opportunistic, blast radius, agent error

  • AIDE file-integrity monitoring (required by the CIS Debian benchmark) — detects unexpected changes to system files
  • Network IDS — Suricata on OPNsense (planned; see STATUS.md / TODO)
  • Active alerting wires AIDE, auditd, fail2ban, and Suricata — plus log-source-silence (a host that stops shipping) — into Grafana alerting on the Loki/Grafana stack (ADR-018; planned)

Secrets management — agent error, opportunistic

  • Ansible Vault for all secrets (API keys, passwords, certificates), structured as a nested vault.<service>.<key> map (ADR-003)
  • The master vault password lives in Vaultwarden and is fetched on demand by scripts/vault-pass-client.sh (wired as vault_password_file) through the rbw agent — never written to a plaintext file on disk. Unlock once per session with rbw unlock; nothing decryptable sits at rest in the repo or working tree
  • See docs/runbooks/rotate-secrets.md for rbw setup and rotation

Hardening standard

The baseline above is implemented to a recognised benchmark rather than ad-hoc:

  • Hosts — the CIS Debian Benchmark, Levels 1 and 2, applied by the base role. Some L2 items require separate partitions (/tmp, /var, /var/log, /home) with restrictive mount options (nodev,nosuid,noexec) — that reaches into VM disk layout, a provisioning concern (Terraform / cloud-init, ADR-006), not just the base role.
  • Container runtime — the CIS Docker Benchmark: daemon/engine settings in the docker_host role; per-container run settings (non-root, read-only rootfs, dropped capabilities, no privileged, no host namespaces) enforced via docs/security/service-checklist.md.
  • Application containers — no CIS benchmark exists for the app long tail (Jellyfin, Nextcloud, Forgejo, …); they are covered by the CIS Docker run settings plus the service checklist plus upstream hardening guidance.

Hardening controls are implemented as local roles (per the no-Galaxy-roles policy, ADR-003), using the CIS benchmarks and community roles (e.g. dev-sec) only as reference. Any specific CIS item that proves impractical is exempted into docs/security/accepted-risks.md with a rationale — so the register records named exceptions, not a blanket opt-out.

Governance

Security is maintained, not achieved once. This ADR establishes four mechanisms; each lives where change is cheap and is linked from here.

  • Per-service security bar — every exposed service must clear a defined checklist before deploy (secrets in vault, no default creds, least-privilege / non-root, declared firewall ports, reverse-proxy + auth if exposed). The generic bar lives in docs/security/service-checklist.md, and each service records how it meets the bar (plus service-specific hardening) in its own roles/<service>/SECURITY.md, created from docs/security/service-security-template.md (ADR-004). Enforced manually in review today; the planned /security-review aggregates every roles/*/SECURITY.md and cross-checks it against the role's config.
  • Periodic security review — a recurring review that re-checks posture, surfaces drift, and re-challenges accepted risks. Planned as a /security-review skill (sibling to /review-repo); see docs/TODO.md (Scheduled work). Not built yet — see STATUS.md.
  • Accepted-risk register — the conscious trade-offs we choose to live with, each with rationale and a revisit trigger. Lives in docs/security/accepted-risks.md (expected to change; kept out of this ADR so the ADR stays stable).
  • Agent / automation guardrails — what AI agents and automation may do unsupervised vs. what needs a human gate, since operator/agent error is in the threat model. Encoded in CLAUDE.md ("What Claude must not do without explicit instruction") and enforced by PreToolUse hooks (generated-file guard, rbw pre-flight).

Decision

This posture was chosen to be:

  • Effective against the stated threat model (opportunistic external, lateral movement, operator/agent error)
  • Maintainable by a small team without security-expertise overhead
  • Automated — no manual steps to reach baseline state
  • Legible & revisitable — the threat model, principles, and accepted risks are written down and reviewed over time, not implicit
  • Benchmarked — host and container hardening follow CIS (Debian L1+L2, Docker), not ad-hoc choices

Out-of-scope items and conscious trade-offs are recorded in docs/security/accepted-risks.md rather than here, so this decision record stays stable while the risk posture evolves.

Consequences

Drawn from the trade-offs, scoping, and follow-on work this ADR already states:

  • Targeted/physical adversaries are out of scope at this scale, and supply chain is consciously deprioritized — active vuln scanning is deferred as an accepted risk (per Threat model; docs/security/accepted-risks.md).
  • SELinux is not used (non-native to Debian, redundant with AppArmor), recorded as an accepted risk (per Mandatory access control).
  • Some CIS L2 items require separate partitions with restrictive mount options, which reaches into VM disk layout — a provisioning concern (Terraform / cloud-init, ADR-006), not just the base role (per Hardening standard). Any impractical CIS item is exempted into the accepted-risk register with rationale, recording named exceptions rather than a blanket opt-out.
  • Several controls and governance mechanisms are stated as planned, not yet built: Suricata network IDS, active alerting wiring AIDE/auditd/fail2ban/Suricata plus log-source-silence into Grafana, the /security-review skill and its aggregation of every roles/*/SECURITY.md, and the periodic security review (per File integrity / Governance; STATUS.md / docs/TODO.md).
  • The per-service security bar is enforced manually in review today, pending the planned /security-review automation (per Governance).
  • The accepted-risk register is kept out of this ADR so the record stays stable while the risk posture evolves (per Decision; docs/security/accepted-risks.md).