boma/docs/TODO.md
sjat 19dd89b875 Re-challenge accepted risks; adopt CIS hardening + IDS
Walked the seeded accepted-risk register (R1-R4) and turned inherited gaps into
deliberate decisions:

- Supply chain (R1): tightened to required baseline hygiene (digest pinning,
  official/verified images); active scanning deferred — stays an accepted risk
- CIS (R2): adopted as a positive decision — CIS Debian L1+L2 (base role) + CIS
  Docker (docker_host + service checklist); app layer via the checklist
- SELinux/AppArmor (R3): AppArmor becomes a baseline control (CIS-enforced);
  register keeps a clean "no SELinux" accept
- IDS (R4): adopt AIDE (baseline via CIS) + Suricata on OPNsense + active alerting

Register shrinks from 4 inherited gaps to 2 deliberate accepts. ADR-002 gains a
Hardening standard section; STATUS + TODO 15 track the (unbuilt) implementation,
including the CIS L2 partition impact on VM provisioning (ADR-006).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 15:15:39 +02:00

116 lines
7.3 KiB
Markdown

# ToDo
1. **Forgejo CI** — what CI work remains after ADR-010 (which workflows, runner
setup, etc. still need to be built)?
2. **Testing**
1. Choose and configure code-testing tooling (Molecule, etc.).
2. Decide how the AI interprets Molecule output and performs live testing:
API calls, curl pulls of web products, log reviews, and headless browsing.
3. Define a standard for generating test users and for instructing the user to
perform relevant manual tests.
3. **Building services**
1. Decide how to manage logs.
2. Decide how to manage APIs / API access.
3. Decide how to import or integrate from baobabAnsibleV4.
4. Decide what each node runs — base packages plus which apps/services.
5. Decide the firewall strategy (which firewall, ruleset, per-host vs central).
6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime
Kuma alerts on askari.
7. Define a tagging standard that lets us target runs without over-tagging.
8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
9. Decide: a central database server, or individual database services per app?
10. Should we continue to use the base-container method, or maybe something in the improvements of the methods in boma moods the point?
4. **Split-horizon FQDN** — adopt split-horizon FQDN with or without nyumbani?
5. **Control node**
1. Set up and test the control node while waiting for hardware.
2. Define control-node bootstrapping — a dedicated recipe and playbook?
3. Decide the role of mamba — access/availability vs compute power and ease?
4. Set up rbw on the control node.
6. **Updating**
1. Decide pinning vs latest for versions.
2. Decide the update strategy across services & containers vs packages &
builds / GitHub pulls / Flatpaks.
3. Define scheduling of updates and reboots, including post-update testing.
7. **Shell setup**
1. Decide what shell setup matters for the AI's work on the control node.
2. Decide what to set up on the hosts, given that direct access will be rare.
8. **Scheduled work**
1. Run `/review-repo` as `claude -p` via cron every two weeks?
2. Build sanity checks (e.g. does PhotoPrism have its pictures? are email
services receiving and sending?).
3. Design a declarative `scheduled_jobs` role so the repo owns which cronjobs
run on a host, enforced by Ansible. Sketch (deferred until we have hosts):
reads a `scheduled_jobs__jobs` list from group_vars/host_vars, rendered via
a managed `/etc/cron.d` file. Open questions:
1. General role vs control-node-only?
2. Prune undeclared jobs (repo authoritative) vs additive?
3. Validate headless email and that cron's env has the `claude` CLI.
4. (The fortnightly `/review-repo` job is the first entry.)
4. Schedule `/capacity-review` to run periodically (on-demand only for now).
Revisit once the physical cluster + a live usage-stats hook exist, so it
reasons on real usage rather than declared intent alone. **Decide the usage
source first:** Proxmox RRD (built-in, no extra infra) vs the
Prometheus/Loki/Grafana/Grafana-Alloy stack we will likely set up anyway
(richer, per-process, but more to run) — see TODO 3.6. Don't build the
Proxmox-RRD hook before settling this, to avoid throwaway work.
5. Build a `/security-review` skill (sibling to `/review-repo`): re-check the
security posture against ADR-002, surface drift, and re-challenge the
accepted-risk register (`docs/security/accepted-risks.md`). Could pair a
deterministic pre-scan (undeclared open ports, disabled baseline controls,
world-readable secrets, services not behind auth) with a judgement pass.
Open question: standalone, or folded into the kaizen `/retro` (item 11)?
9. Should we make a basic function so that tools (and AI) can send messages to the user - email, matrix or ntfy?
10. **Claude setup** — DECIDED: brainstorm for intent, capture as ADRs (skip plan
files); hooks + slash commands + `/review-repo` for enforcement at scale. Any
remaining setup to carry out from this decision?
1. Policy for how we collaborate with references to baobabAnsibleV4 without misusing it.
2. Policy for how we write key documents like ADRs.
3. Further development on how we we collaborate on designing the foundation for the project - seperate from how we implement new containers etc.
4. How do we make sure agents always use the latest official documentation for the technologies etc. we use?
5. Always subagent driven?
6. When AI deploys, i.e. runs playbooks etc., should we make a methodology so that it does not have to poll all the time or review all the output. Perhaps something about the MAKE method could provide only the relevant feedback?
11. **Kaizen loop** — set up ~2026-06-06 (one week from now).
1. Build `/retro`: reads `docs/FRICTION.md` + recurring `/review-repo`
findings + a tooling-usage inventory; proposes add / change / **remove**
(biased to remove); records decisions as ADRs; evaluates itself.
Recurrence-triggered plus a light periodic sweep.
2. Keep appending raw signals to `docs/FRICTION.md` (live now) until the
retro consumes them.
12. **Spin-up order** — what is the right order of operations when spinning up
from scratch (OS, DNS, Authentik, Traefik, …)?
13. **Intentions** - Is the current setup clearly identifying intentions throughout? We have the readme files but is that enough? Also, how do we rechallange desisions and how they interact over time. I.e. We have these two services running, but extending one a little bit could make the other redundant so we could remove it. Or an alternative to this services has emerged, and it is actually better.
14. **Script dependencies policy** — utility scripts (`tf_to_inventory.py`,
`repo-scan.py`, `capacity-scan.py`) are stdlib-only by convention, for
run-anywhere portability (control node, CI, bare clone, no venv). Reevaluate
whether selectively allowing libraries (e.g. PyYAML — already present via
Ansible) is a better fit in general: weigh the parsing-correctness win
against losing zero-setup portability. Decide a clear rule and record it.
15. **Security hardening implementation** — build out the ADR-002 hardening standard.
1. Implement the CIS Debian Benchmark **Level 1 + Level 2** in the `base` role
(local tasks; CIS / `dev-sec` as reference only — no Galaxy roles). Includes
AppArmor (enforce mode) and AIDE file-integrity.
2. Implement the CIS Docker Benchmark: daemon/engine settings in `docker_host`;
per-container settings enforced via `docs/security/service-checklist.md`.
3. VM disk layout for CIS L2: separate `/tmp`, `/var`, `/var/log`, `/home`
partitions with `nodev,nosuid,noexec` — a Terraform/cloud-init concern
(ADR-006). Decide the template layout **before** provisioning, since it is
painful to retrofit.
4. Network IDS: enable Suricata on OPNsense (IDS first; IPS later?).
5. Active security alerting: wire AIDE, `auditd`, `fail2ban`, and Suricata into
the Loki/Grafana alerting stack (ties to 3.6).
6. Supply-chain hygiene: enforce image digest pinning + official/verified images
via the service checklist; revisit active scanning (Trivy/Grype) once a
triage stack exists (accepted-risk R1).