boma/docs/TODO.md
sjat 9f0626040b docs(todo): add note on ubongo↔cluster network topology question
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 19:15:18 +02:00

10 KiB
Raw Blame History

ToDo

Build order lives in docs/ROADMAP.md — that sequences this backlog into milestones. This file is the decision backlog; the roadmap is the order we build them.

Open items only. Item numbers are stable cross-references (cited by ROADMAP, STATUS, ADRs, scripts) — never renumber. When an item is decided or built, collapse it to a one-line pointer in place; the full record lives in its ADR / STATUS.md / the FRICTION.md decisions ledger.

  1. Forgejo CI — what CI work remains after ADR-010 (which workflows, runner setup, etc. still need to be built)?

  2. Testing

    1. Choose and configure code-testing tooling (Molecule, etc.).
    2. Decide how the AI interprets Molecule output and performs live testing — API calls, curl pulls of web products, log reviews. Headless browsing → ADR-017 (/verify-service); the API/curl/log-review siblings remain open.
    3. Standard for test users + manual-test instructions. → ADR-017.
    4. Local VM integration testing on ubongo. → ADR-025 / make test-integration (built + RED→GREEN validated 2026-06-18).
  3. Building services

    1. Decide how to manage logs. → ADR-018.
    2. Decide how to manage APIs / API access. → ADR-021.
    3. Decide how to import/integrate from baobabAnsibleV4. → ADR-013.
    4. Decide what each node runs — base packages plus which apps/services.
    5. Decide the firewall strategy. → ADR-020 (builds: host nftables in base done; OPNsense-as-code pending).
    6. Wire up the monitoring stack — Prometheus + metric exporters, Uptime Kuma, and exactly which alerts live where. (Logging topology → ADR-018.)
    7. Define a tagging standard. → ADR-019.
    8. Ensure the right things are backed up. → ADR-022 (build: the backup role, Plans 23, pending).
    9. Decide: a central database server, or individual database services per app?
    10. Should we keep the custom base-container (Molecule test image) method for role testing, or revisit it as boma's testing approach matures (ADR-008)?
    11. Deliberate tagging strategy. → ADR-019 (folded into 3.7).
  4. Split-horizon FQDN. → ADR-007 / M1 (wingu.me three-tier; nyumbani dropped; mesh/LAN-only default).

  5. Control node

    1. Set up and test the control node while waiting for hardware.
    2. Define control-node bootstrapping — a dedicated recipe and playbook?
    3. Set up rbw on the control node.
  6. Updating — 1. Decide the update strategy across services & containers vs packages & builds / GitHub pulls / Flatpaks. 2. Define scheduling of updates and reboots, including post-update testing. (Tracked in item 16 / ADR-011.)

  7. Shell setup

    1. Decide what shell setup matters for the AI's work on the control node.
    2. Decide what to set up on the hosts (direct access rare). → ADR-021.
  8. Scheduled work

    1. Run /review-repo as claude -p via cron every two weeks?
    2. Build sanity checks (e.g. does PhotoPrism have its pictures? are email services receiving and sending?).
    3. Design a declarative scheduled_jobs role so the repo owns which cronjobs run on a host, enforced by Ansible. Sketch (deferred until we have hosts): reads a scheduled_jobs__jobs list from group_vars/host_vars, rendered via a managed /etc/cron.d file. Open questions:
      1. General role vs control-node-only?
      2. Prune undeclared jobs (repo authoritative) vs additive?
      3. Validate headless email and that cron's env has the claude CLI.
      4. (The fortnightly /review-repo job is the first entry.)
    4. Schedule /capacity-review to run periodically (on-demand only for now). Revisit once the physical cluster + a live usage-stats hook exist, so it reasons on real usage rather than declared intent alone. Decide the usage source first: Proxmox RRD (built-in, no extra infra) vs the Prometheus/Loki/Grafana/Grafana-Alloy stack we will likely set up anyway (richer, per-process, but more to run) — see TODO 3.6. Don't build the Proxmox-RRD hook before settling this, to avoid throwaway work.
    5. Build a /security-review skill (sibling to /review-repo): re-check the security posture against ADR-002, surface drift, and re-challenge the accepted-risk register (docs/security/accepted-risks.md). Could pair a deterministic pre-scan (undeclared open ports, disabled baseline controls, world-readable secrets, services not behind auth) with a judgement pass. Open question: standalone, or folded into /kaizen (item 11)?
  9. Should we make a basic function so that tools (and AI) can send messages to the user - email, matrix or ntfy?

  10. Claude setup — DECIDED: brainstorm for intent → ADRs; hooks + slash commands + /review-repo for enforcement at scale. Remaining:

    1. V4 collaboration policy. → ADR-013.
    2. Policy for how we write key documents like ADRs. → ADR-023.
    3. Further development on how we collaborate on designing the foundation for the project - separate from how we implement new containers etc.
    4. Always-latest official documentation for our tech. → ADR-014.
    5. Always subagent-driven? → DECIDED: yes (standing agreement; enforced by .claude/hooks/guard-execution-mode-menu.sh).
    6. When AI deploys, i.e. runs playbooks etc., should we make a methodology so that it does not have to poll all the time or review all the output. Perhaps something about the MAKE method could provide only the relevant feedback?
    7. Reproducible agent toolchain..claude/settings.json + docs/runbooks/claude-code-setup.md.
    8. Screenshot hand-off to the agent. Give the operator a smooth way to hand the agent a screenshot (e.g. of a Hetzner/VNC console during an incident) — the agent can already read image files; the gap is the hand-off. During the 2026-06-17 incident the only diagnostic channel was console screenshots, copied manually to /tmp and find-located. Options: a known drop path the agent checks (e.g. ~/screenshots/), a small screenshot/paste helper or slash-command, or a clipboard→file convention. Cheap, high-value for incident work.
  11. Kaizen loop/kaizen built (STATUS).

    1. Build the loop command./kaizen (scripts/friction-scan.py + .claude/commands/kaizen.md; spec docs/superpowers/specs/2026-06-14-kaizen-command-design.md).
    2. Keep appending raw signals to docs/FRICTION.md (ongoing practice; see FRICTION.md).
    3. Automation deferred (revisit when the notify + cron stack is up): wire a scheduled headless run — report-only (proposes verdicts + notifies, does not auto-curate/commit). The on-demand command + recurrence/age nudge ship now.
  12. Spin-up / build order — what is the right order of operations when spinning up from scratch (OS, DNS, Authentik, Caddy, …)?

  13. Intentions - Is the current setup clearly identifying intentions throughout? We have the readme files but is that enough? Also, how do we rechallange desisions and how they interact over time. I.e. We have these two services running, but extending one a little bit could make the other redundant so we could remove it. Or an alternative to this services has emerged, and it is actually better.

  14. Script dependencies policy — utility scripts (tf_to_inventory.py, repo-scan.py, capacity-scan.py, friction-scan.py) are stdlib-only by convention, for run-anywhere portability (control node, CI, bare clone, no venv). Reevaluate whether selectively allowing libraries (e.g. PyYAML — already present via Ansible) is a better fit in general: weigh the parsing-correctness win against losing zero-setup portability. Decide a clear rule and record it.

  15. Security hardening implementation — build out the ADR-002 hardening standard.

    1. Implement the CIS Debian Benchmark Level 1 + Level 2 in the base role (local tasks; CIS / dev-sec as reference only — no Galaxy roles). Includes AppArmor (enforce mode) and AIDE file-integrity.
    2. Implement the CIS Docker Benchmark: daemon/engine settings in docker_host; per-container settings enforced via docs/security/service-checklist.md.
    3. VM disk layout for CIS L2: separate /tmp, /var, /var/log, /home partitions with nodev,nosuid,noexec — a Terraform/cloud-init concern (ADR-006). Decide the template layout before provisioning, since it is painful to retrofit.
    4. Network IDS: enable Suricata on OPNsense (IDS first; IPS later?).
    5. Active security alerting: wire AIDE, auditd, fail2ban, and Suricata into the Loki/Grafana alerting stack (ties to 3.6).
    6. Supply-chain hygiene: enforce tiered image pinning (stateful tag@digest; stateless rolling tags — ADR-011) + official/verified images via the service checklist; revisit active scanning (Trivy/Grype) once a triage stack exists (R1).
    7. Is our network setup as it should be? I am not sure if all traffic between ubongo and notes goes via askari? what if askari breaks - will the rest work?
  16. ADR-011 (update management) — resolve open questions + accept. Committed as Proposed; resolve before marking Accepted:

    1. Snapshot driver — control node calling the Proxmox API vs a Proxmox-side hook (crosses the TF/Ansible boundary, ADR-006/009).
    2. Cadences — is weekly OS patching right; should reboots be rarer than apt?
    3. Health-check harness — where it lives and the minimum bar that counts as "in order" before the weekly run ships (ties to ADR-008, TODO 2.2 / 8.2).
    4. Stateful classification home — per-role __stateful flag vs a group_vars list.
    5. Staging-first? — hit a staging host before production, or is snapshot-before + Friday timing enough at this scale?
    6. Notification/control channel — boma's own ntfy topics (ADR-013) + a "skip this week" / "pause" switch (ties to TODO 9).
    7. Reconcile pinning conflict (tags vs digests). → DECIDED: tiered (stateful tag@digest, stateless rolling); ADR-011 dec. 2 / ADR-004 / ADR-002.