boma/docs/security/accepted-risks.md

# Accepted security risks

Conscious security trade-offs we are choosing to live with — recorded so "what we
are *not* doing" is explicit and revisitable, not forgotten. This register is a
**living document**, deliberately kept out of ADR-002 (which records durable
decisions) so the ADR stays stable.

Owned by **ADR-002** (Security baseline and strategy). Re-challenged during the
periodic security review (planned `/security-review`; see `docs/TODO.md`).

**Each entry:** the risk · why we accept it (rationale) · what would make us
revisit (trigger).

| # | Accepted risk | Rationale | Revisit trigger |
|---|---|---|---|
| R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (tiered image pinning per ADR-011 — stateful `tag@digest`, stateless rolling — prefer official/verified images; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise |
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and STUN (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh (NetBird v0.72.4 embeds STUN in the combined server — no separate Coturn) | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
| R5 | **No disk encryption on `ubongo`** — the control node's SSD (SanDisk X600 256 GB, TCG-Opal-capable but Opal unused) is unencrypted at rest, so it holds recovery-critical secrets in plaintext: the Ansible Vault password's `rbw` local cache and (future) Terraform state. Physical theft of the box would expose them | `ubongo` is always-on in a physically controlled location; compensating controls are a **BIOS supervisor password** and **disabled external/USB + PXE boot** (an attacker cannot trivially boot another OS to read the disk), and the offline-recoverable design means the irreducible root secret (Vaultwarden master password) is never stored on the box anyway. Full-disk encryption was weighed against the always-on/unattended-reboot requirement (LUKS+TPM auto-unlock or passphrase) and deferred for simplicity at this trust level | `ubongo` is relocated to a less-trusted physical location; the box starts holding additional high-value secrets; or a reinstall onto LUKS (TPM-sealed) is undertaken |
| R6 | **`le-prod-wildcard` integration runs** — when `CERTS=le-prod-wildcard` is passed to `make test-integration`, the production Gandi PAT (`vault.gandi.pat`) is passed to an ephemeral local test VM via the var overlay, and transient `_acme-challenge` TXT records are written into the real `wingu.me` DNS zone to satisfy the Let's Encrypt DNS-01 challenge. A compromised or long-lived test VM could exfiltrate the PAT; the real zone is briefly (seconds) modified | Scope is **on-demand only** — `le-staging` is the default cert tier (`CERTS=internal` for incident repro); `le-prod-wildcard` is an explicit opt-in. Compensating controls: the VM is ephemeral and destroyed on success; it sits on an isolated libvirt NAT network (no LAN/mesh access); TXT records are auto-removed by Caddy immediately after validation; the PAT is not persisted inside the VM after the run. ADR-025 documents the cert-tier design and the three isolation invariants | The PAT is exfiltrated from a test VM; the `wingu.me` zone shows unexpected records; a `CERTS=le-prod-wildcard` run must be audited or the tier must be revoked |

_Last reviewed: 2026-06-11. The prior gaps (full CIS hardening, SELinux/AppArmor,
IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now
part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build
status. As CIS is implemented, any specific item that proves impractical is added
here as a named exception._
Expand ADR-002 into a security baseline + strategy Add a managerial security frame on top of the host baseline: explicit threat model (opportunistic external, lateral movement/blast radius, operator/agent error; supply chain accepted-lower-priority), security principles, and four governance mechanisms that ADR-002 establishes and links out to: - docs/security/service-checklist.md — per-service security bar (referenced from the new-role runbook) - docs/security/accepted-risks.md — living accepted-risk register (R1-R4) - planned /security-review skill (TODO 8.5) - agent guardrails in CLAUDE.md "what Claude must not do" STATUS.md records the frame as present (manual enforcement) and /security-review as planned-not-built. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-04 14:39:51 +02:00			`# Accepted security risks`

			`Conscious security trade-offs we are choosing to live with — recorded so "what we`
			`are not doing" is explicit and revisitable, not forgotten. This register is a`
Re-challenge accepted risks; adopt CIS hardening + IDS Walked the seeded accepted-risk register (R1-R4) and turned inherited gaps into deliberate decisions: - Supply chain (R1): tightened to required baseline hygiene (digest pinning, official/verified images); active scanning deferred — stays an accepted risk - CIS (R2): adopted as a positive decision — CIS Debian L1+L2 (base role) + CIS Docker (docker_host + service checklist); app layer via the checklist - SELinux/AppArmor (R3): AppArmor becomes a baseline control (CIS-enforced); register keeps a clean "no SELinux" accept - IDS (R4): adopt AIDE (baseline via CIS) + Suricata on OPNsense + active alerting Register shrinks from 4 inherited gaps to 2 deliberate accepts. ADR-002 gains a Hardening standard section; STATUS + TODO 15 track the (unbuilt) implementation, including the CIS L2 partition impact on VM provisioning (ADR-006). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-04 15:15:39 +02:00			`living document, deliberately kept out of ADR-002 (which records durable`
			`decisions) so the ADR stays stable.`
Expand ADR-002 into a security baseline + strategy Add a managerial security frame on top of the host baseline: explicit threat model (opportunistic external, lateral movement/blast radius, operator/agent error; supply chain accepted-lower-priority), security principles, and four governance mechanisms that ADR-002 establishes and links out to: - docs/security/service-checklist.md — per-service security bar (referenced from the new-role runbook) - docs/security/accepted-risks.md — living accepted-risk register (R1-R4) - planned /security-review skill (TODO 8.5) - agent guardrails in CLAUDE.md "what Claude must not do" STATUS.md records the frame as present (manual enforcement) and /security-review as planned-not-built. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-04 14:39:51 +02:00
			`Owned by ADR-002 (Security baseline and strategy). Re-challenged during the`
			periodic security review (planned `/security-review`; see `docs/TODO.md`).

			`Each entry: the risk · why we accept it (rationale) · what would make us`
			`revisit (trigger).`

			`\| # \| Accepted risk \| Rationale \| Revisit trigger \|`
			`\|---\|---\|---\|---\|`
Reconcile image pinning to a tiered tag@digest rule Resolve the conflict between ADR-011 (tags-not-digests) and the security work (digest pinning) with one coherent rule that respects ADR-011's stateless/stateful split: - Stateful → pin `tag@digest` (readable tag + integrity digest): legible diffs AND tamper-evidence. Snapshots cover broken updates; the digest covers swapped images. - Stateless → rolling tags (latest/stable); digest-pinning would defeat the rolling design. Integrity rests on official/verified images + disposability. Aligned across ADR-011 (decision 2), ADR-004 (image management), ADR-002 (supply-chain row), accepted-risk R1, the service checklist, and TODO 15.6. TODO 16.7 marked decided. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-04 19:21:36 +02:00			\| R1 \| Active supply-chain scanning deferred — baseline hygiene is required (tiered image pinning per ADR-011 — stateful `tag@digest`, stateless rolling — prefer official/verified images; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified \| Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack \| A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise \|
Re-challenge accepted risks; adopt CIS hardening + IDS Walked the seeded accepted-risk register (R1-R4) and turned inherited gaps into deliberate decisions: - Supply chain (R1): tightened to required baseline hygiene (digest pinning, official/verified images); active scanning deferred — stays an accepted risk - CIS (R2): adopted as a positive decision — CIS Debian L1+L2 (base role) + CIS Docker (docker_host + service checklist); app layer via the checklist - SELinux/AppArmor (R3): AppArmor becomes a baseline control (CIS-enforced); register keeps a clean "no SELinux" accept - IDS (R4): adopt AIDE (baseline via CIS) + Suricata on OPNsense + active alerting Register shrinks from 4 inherited gaps to 2 deliberate accepts. ADR-002 gains a Hardening standard section; STATUS + TODO 15 track the (unbuilt) implementation, including the CIS L2 partition impact on VM provisioning (ADR-006). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-04 15:15:39 +02:00			`\| R2 \| SELinux not used — no SELinux mandatory access control \| AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain \| A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers \|`
docs(netbird): M4b done — STATUS/ROADMAP/risks/friction netbird_coordinator built + applied to askari (first service role, dashboard live). STATUS: new "real and working" row + askari/coordinator rows updated. ROADMAP: M4b done, M5 (peer enrol) next, recorded the v0.72.4 combined-container/embedded-Dex/ no-Coturn reality. accepted-risks R3: Coturn -> STUN wording. FRICTION: single-file bind-mount stale-inode gotcha + check-before-first-deploy artifact. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-16 07:48:53 +02:00			\| R3 \| Self-hosted mesh control plane is a public target on `askari` — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and STUN (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh (NetBird v0.72.4 embeds STUN in the combined server — no separate Coturn) \| Self-hosting means no third-party trust and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence \| A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering \|
accepted-risks: add R4 (no cryptographic WORM for logs) 2026-06-06 07:03:27 +02:00			\| R4 \| No cryptographic WORM for logs — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history \| Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) \| Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target \|
docs: record ubongo physical build (2026-06-11) Move ubongo to 'Built (partial)' in STATUS; fill real M70q hardware specs (i3-10100T, 16 GB, 256 GB SanDisk X600 SATA, no disk encryption). Record in ADR-015 the dedicated claude AI-worker identity, LAN-SSH-only operational reality, and the no-encryption decision; close the rbw offline-cache recovery-verification item (ADR-015 + rotate-secrets). Add accepted-risk R5 (control-node disk unencrypted at rest) with its compensating controls. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-11 10:32:26 +02:00			\| R5 \| No disk encryption on `ubongo` — the control node's SSD (SanDisk X600 256 GB, TCG-Opal-capable but Opal unused) is unencrypted at rest, so it holds recovery-critical secrets in plaintext: the Ansible Vault password's `rbw` local cache and (future) Terraform state. Physical theft of the box would expose them \| `ubongo` is always-on in a physically controlled location; compensating controls are a BIOS supervisor password and disabled external/USB + PXE boot (an attacker cannot trivially boot another OS to read the disk), and the offline-recoverable design means the irreducible root secret (Vaultwarden master password) is never stored on the box anyway. Full-disk encryption was weighed against the always-on/unattended-reboot requirement (LUKS+TPM auto-unlock or passphrase) and deferred for simplicity at this trust level \| `ubongo` is relocated to a less-trusted physical location; the box starts holding additional high-value secrets; or a reinstall onto LUKS (TPM-sealed) is undertaken \|
docs: wire ADR-025 into testing/control-host/risks/status/capacity - ADR-008: add reboot-survivability gap row + ADR-025 pointer to the "not tested in Molecule" table - ADR-015: reconcile "not a hypervisor" with ephemeral KVM test VMs (ADR-025); note ~3 GiB test-VM RAM against the 16 GiB sizing - accepted-risks: add R6 (le-prod-wildcard PAT + transient TXT records) - CLAUDE.md: add make test-integration[/-clean] to key-commands; add ADR-025 + runbook rows to further-reading - hardware/reference.md: note one ephemeral KVM test VM on ubongo - STATUS.md: add integration harness entry (built, lint+pytest clean; RED/GREEN acceptance PENDING ubongo live pass); TODO 2.4 stays open Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-18 12:51:22 +02:00			\| R6 \| `le-prod-wildcard` integration runs — when `CERTS=le-prod-wildcard` is passed to `make test-integration`, the production Gandi PAT (`vault.gandi.pat`) is passed to an ephemeral local test VM via the var overlay, and transient `_acme-challenge` TXT records are written into the real `wingu.me` DNS zone to satisfy the Let's Encrypt DNS-01 challenge. A compromised or long-lived test VM could exfiltrate the PAT; the real zone is briefly (seconds) modified \| Scope is on-demand only — `le-staging` is the default cert tier (`CERTS=internal` for incident repro); `le-prod-wildcard` is an explicit opt-in. Compensating controls: the VM is ephemeral and destroyed on success; it sits on an isolated libvirt NAT network (no LAN/mesh access); TXT records are auto-removed by Caddy immediately after validation; the PAT is not persisted inside the VM after the run. ADR-025 documents the cert-tier design and the three isolation invariants \| The PAT is exfiltrated from a test VM; the `wingu.me` zone shows unexpected records; a `CERTS=le-prod-wildcard` run must be audited or the tier must be revoked \|
Expand ADR-002 into a security baseline + strategy Add a managerial security frame on top of the host baseline: explicit threat model (opportunistic external, lateral movement/blast radius, operator/agent error; supply chain accepted-lower-priority), security principles, and four governance mechanisms that ADR-002 establishes and links out to: - docs/security/service-checklist.md — per-service security bar (referenced from the new-role runbook) - docs/security/accepted-risks.md — living accepted-risk register (R1-R4) - planned /security-review skill (TODO 8.5) - agent guardrails in CLAUDE.md "what Claude must not do" STATUS.md records the frame as present (manual enforcement) and /security-review as planned-not-built. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-04 14:39:51 +02:00
docs: record ubongo physical build (2026-06-11) Move ubongo to 'Built (partial)' in STATUS; fill real M70q hardware specs (i3-10100T, 16 GB, 256 GB SanDisk X600 SATA, no disk encryption). Record in ADR-015 the dedicated claude AI-worker identity, LAN-SSH-only operational reality, and the no-encryption decision; close the rbw offline-cache recovery-verification item (ADR-015 + rotate-secrets). Add accepted-risk R5 (control-node disk unencrypted at rest) with its compensating controls. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-11 10:32:26 +02:00			`_Last reviewed: 2026-06-11. The prior gaps (full CIS hardening, SELinux/AppArmor,`
Re-challenge accepted risks; adopt CIS hardening + IDS Walked the seeded accepted-risk register (R1-R4) and turned inherited gaps into deliberate decisions: - Supply chain (R1): tightened to required baseline hygiene (digest pinning, official/verified images); active scanning deferred — stays an accepted risk - CIS (R2): adopted as a positive decision — CIS Debian L1+L2 (base role) + CIS Docker (docker_host + service checklist); app layer via the checklist - SELinux/AppArmor (R3): AppArmor becomes a baseline control (CIS-enforced); register keeps a clean "no SELinux" accept - IDS (R4): adopt AIDE (baseline via CIS) + Suricata on OPNsense + active alerting Register shrinks from 4 inherited gaps to 2 deliberate accepts. ADR-002 gains a Hardening standard section; STATUS + TODO 15 track the (unbuilt) implementation, including the CIS L2 partition impact on VM provisioning (ADR-006). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-04 15:15:39 +02:00			`IDS) were re-challenged and adopted rather than accepted: CIS Debian L1+L2 + CIS`
			`Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now`
			part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build
			`status. As CIS is implemented, any specific item that proves impractical is added`
			`here as a named exception._`