sjat/boma

sjat 175777e36a docs: reconcile 2026-06-14 review findings (O1-O7,O18,O22)

- STATUS: docker_host is built+applied, not scaffold-only (O1)
- ADR-004: backup points to ADR-022, not "out of scope"; service-role file
  table gains ACCESS.md + BACKUP.md rows (O2, O5)
- Finish Traefik->Caddy: ADR-008/011/017/019, CAPABILITIES, TODO (O3); scope
  ADR-024's custom-image/NetBird claims to the deferred DNS-01/M4b paths (O22)
- ADR-016/017/018 now lead with ## Status per ADR-023 (O4)
- ADR-002: caveat `PLAYBOOK=upgrade` as planned/unbuilt (O6)
- CAPABILITIES: carve out ubongo's dev_env from the nvim/tmux exclusion (O7)
- ADR-007: one authoritative boma.baobab.band -> boma.wingu.me transition note (O18)
- new-host Part E: note ubongo is managed as sjat, ansible-user bootstrap pending (O15)

O9 (hosts.yml header) left open: the file is generator-owned (hook-protected);
fixing it needs a tf_to_inventory.py change or a tf-inventory run, not a hand-edit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-14 19:06:33 +02:00

12 KiB

Raw Permalink Blame History

ADR-002 — Security baseline and strategy

Status

Accepted (2026-05-30)

Context

Security here is not a single control but the sum of several combined efforts — host hardening, network segmentation, secrets handling, supply-chain hygiene, and disciplined automation. This ADR is the frame that organizes them: it records the threat model we design against, the principles every control serves, the host-level baseline the base role enforces, and the governance that keeps security sharp as the homelab grows.

The goal is a principled, maintainable posture for a homelab with some public-facing services — effective against a realistic threat model, not a compliance exercise.

Related decisions: network segmentation (ADR-007), secrets structure (ADR-003), per-service roles (ADR-004), CI secret-scanning (ADR-010).

Threat model

What we deliberately design against — and, just as importantly, what we do not:

Threat	In scope?	What it drives
Opportunistic external — bots scanning, credential stuffing, mass-exploiting known CVEs in exposed services	Yes — primary	SSH key-only + fail2ban, deny-by-default firewall, security auto-patching, minimal attack surface, services behind a reverse proxy with auth
Lateral movement / blast radius — assume one service is compromised; limit how far it spreads	Yes	VLAN segmentation (ADR-007), least-privilege containers, no host network mode, per-service isolation, no shared credentials
Operator / agent error — accidental secret leak, misconfiguration, or an AI agent making an unsafe change	Yes	Vault + gitleaks, declarative firewall (no ad-hoc ports), review gates, agent guardrails (below), pre-commit hooks
Supply chain — compromised images, base images, dependencies, collections	Acknowledged, lower priority	Baseline hygiene required: tiered image pinning (stateful `tag@digest`, stateless rolling — ADR-011) + prefer official/verified images, gitleaks. Active vuln scanning deferred — accepted risk
Targeted / physical — a determined adversary specifically after this homelab, or physical device access	Out of scope	Not designed against at this scale; revisit if the threat model changes

Supply chain is consciously deprioritized, not forgotten — see docs/security/accepted-risks.md.

Security principles

Every control below should trace back to one of these:

Defense in depth — no single control is load-bearing; layers compensate.
Least privilege — accounts, containers, and automation get the minimum they need.
Deny / secure by default — closed unless explicitly opened; safe defaults.
Contain the blast radius — segment and isolate so one compromise isn't total.
Automated & reproducible — the baseline is reached by Ansible, never by hand.
Explicit & revisitable — decisions and accepted risks are written down and re-challenged, not left implicit.

Baseline controls

Applied by the base role, non-negotiable — it runs first, on every host, every time. Each heading tags the threat(s) it primarily serves.

Access & authentication — opportunistic, agent error

SSH key authentication only — password auth disabled
Root login disabled — PermitRootLogin no
Dedicated ansible user with locked-down sudo (NOPASSWD for automation)
No shared user accounts — per-person SSH keys in group_vars/all/vars.yml

Firewall — opportunistic, blast radius, agent error

nftables (native on Debian 13, replaces iptables)
Default policy: deny inbound, allow established/related, allow loopback
Rules managed entirely by Ansible — never edited manually on hosts
Port definitions live in group_vars/ so rules stay in sync with deployed services
Docker's own iptables rules are disabled — nftables manages all filtering

Note on Docker + nftables: Docker historically bypassed iptables-based firewalls. This is addressed by setting "iptables": false in Docker daemon config and managing all rules via nftables explicitly. See docs/decisions/004-docker-model.md.

Intrusion deterrence — opportunistic

fail2ban monitoring SSH (and optionally reverse proxy logs)
Configured to ban after 5 failed attempts, 1-hour ban

Updates — opportunistic

unattended-upgrades enabled for security patches only
Full system upgrades triggered deliberately via Ansible (planned — a dedicated upgrade playbook per ADR-011; not yet built, no upgrade.yml exists today)
No automatic reboots — reboots are a conscious operational decision

Minimal attack surface — opportunistic, blast radius

No unnecessary packages installed
Docker daemon TCP socket disabled — Unix socket only
No open ports beyond those explicitly defined in firewall rules

Audit trail — agent error, blast radius

auditd installed and running with a baseline ruleset
Logs shipped to a central location in near-real-time — all logs to an on-cluster Loki, plus a security-relevant subset write-only off-site to askari so the audit trail survives host (and full-cluster) compromise (ADR-018)

Mandatory access control — blast radius

AppArmor enabled with profiles in enforce mode — Debian-native MAC, default-on, and required by the CIS Debian benchmark. Docker applies its docker-default profile to containers; tighter per-service profiles are authored as needed.
SELinux is not used — non-native to Debian and redundant with AppArmor (see docs/security/accepted-risks.md).

File integrity & intrusion detection — opportunistic, blast radius, agent error

AIDE file-integrity monitoring (required by the CIS Debian benchmark) — detects unexpected changes to system files
Network IDS — Suricata on OPNsense (planned; see STATUS.md / TODO)
Active alerting wires AIDE, auditd, fail2ban, and Suricata — plus log-source-silence (a host that stops shipping) — into Grafana alerting on the Loki/Grafana stack (ADR-018; planned)

Secrets management — agent error, opportunistic

Ansible Vault for all secrets (API keys, passwords, certificates), structured as a nested vault.<service>.<key> map (ADR-003)
The master vault password lives in Vaultwarden and is fetched on demand by scripts/vault-pass-client.sh (wired as vault_password_file) through the rbw agent — never written to a plaintext file on disk. Unlock once per session with rbw unlock; nothing decryptable sits at rest in the repo or working tree
See docs/runbooks/rotate-secrets.md for rbw setup and rotation

Hardening standard

The baseline above is implemented to a recognised benchmark rather than ad-hoc:

Hosts — the CIS Debian Benchmark, Levels 1 and 2, applied by the base role. Some L2 items require separate partitions (/tmp, /var, /var/log, /home) with restrictive mount options (nodev,nosuid,noexec) — that reaches into VM disk layout, a provisioning concern (Terraform / cloud-init, ADR-006), not just the base role.
Container runtime — the CIS Docker Benchmark: daemon/engine settings in the docker_host role; per-container run settings (non-root, read-only rootfs, dropped capabilities, no privileged, no host namespaces) enforced via docs/security/service-checklist.md.
Application containers — no CIS benchmark exists for the app long tail (Jellyfin, Nextcloud, Forgejo, …); they are covered by the CIS Docker run settings plus the service checklist plus upstream hardening guidance.

Hardening controls are implemented as local roles (per the no-Galaxy-roles policy, ADR-003), using the CIS benchmarks and community roles (e.g. dev-sec) only as reference. Any specific CIS item that proves impractical is exempted into docs/security/accepted-risks.md with a rationale — so the register records named exceptions, not a blanket opt-out.

Governance

Security is maintained, not achieved once. This ADR establishes four mechanisms; each lives where change is cheap and is linked from here.

Per-service security bar — every exposed service must clear a defined checklist before deploy (secrets in vault, no default creds, least-privilege / non-root, declared firewall ports, reverse-proxy + auth if exposed). The generic bar lives in docs/security/service-checklist.md, and each service records how it meets the bar (plus service-specific hardening) in its own roles/<service>/SECURITY.md, created from docs/security/service-security-template.md (ADR-004). Enforced manually in review today; the planned /security-review aggregates every roles/*/SECURITY.md and cross-checks it against the role's config.
Periodic security review — a recurring review that re-checks posture, surfaces drift, and re-challenges accepted risks. Planned as a /security-review skill (sibling to /review-repo); see docs/TODO.md (Scheduled work). Not built yet — see STATUS.md.
Accepted-risk register — the conscious trade-offs we choose to live with, each with rationale and a revisit trigger. Lives in docs/security/accepted-risks.md (expected to change; kept out of this ADR so the ADR stays stable).
Agent / automation guardrails — what AI agents and automation may do unsupervised vs. what needs a human gate, since operator/agent error is in the threat model. Encoded in CLAUDE.md ("What Claude must not do without explicit instruction") and enforced by PreToolUse hooks (generated-file guard, rbw pre-flight).

Decision

This posture was chosen to be:

Effective against the stated threat model (opportunistic external, lateral movement, operator/agent error)
Maintainable by a small team without security-expertise overhead
Automated — no manual steps to reach baseline state
Legible & revisitable — the threat model, principles, and accepted risks are written down and reviewed over time, not implicit
Benchmarked — host and container hardening follow CIS (Debian L1+L2, Docker), not ad-hoc choices

Out-of-scope items and conscious trade-offs are recorded in docs/security/accepted-risks.md rather than here, so this decision record stays stable while the risk posture evolves.

Consequences

Drawn from the trade-offs, scoping, and follow-on work this ADR already states:

Targeted/physical adversaries are out of scope at this scale, and supply chain is consciously deprioritized — active vuln scanning is deferred as an accepted risk (per Threat model; docs/security/accepted-risks.md).
SELinux is not used (non-native to Debian, redundant with AppArmor), recorded as an accepted risk (per Mandatory access control).
Some CIS L2 items require separate partitions with restrictive mount options, which reaches into VM disk layout — a provisioning concern (Terraform / cloud-init, ADR-006), not just the base role (per Hardening standard). Any impractical CIS item is exempted into the accepted-risk register with rationale, recording named exceptions rather than a blanket opt-out.
Several controls and governance mechanisms are stated as planned, not yet built: Suricata network IDS, active alerting wiring AIDE/auditd/fail2ban/Suricata plus log-source-silence into Grafana, the /security-review skill and its aggregation of every roles/*/SECURITY.md, and the periodic security review (per File integrity / Governance; STATUS.md / docs/TODO.md).
The per-service security bar is enforced manually in review today, pending the planned /security-review automation (per Governance).
The accepted-risk register is kept out of this ADR so the record stays stable while the risk posture evolves (per Decision; docs/security/accepted-risks.md).

12 KiB Raw Permalink Blame History