Add hardware reference & capacity-evaluation design spec

Brainstormed design for docs/hardware/reference.md (physical compute + network gear + workload placement intent), a stdlib-only capacity-scan.py, and an on-demand /capacity-review skill that reports to docs/hardware/reviews/. Mirrors the repo-scan -> /review-repo -> docs/reviews triad. TODO additions: schedule /capacity-review later and decide its usage-stats source (Proxmox RRD vs the Prometheus/Loki/Grafana/Alloy stack) before building any hook (8.4); reevaluate the stdlib-only script policy (#14). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 09:59:16 +02:00 · 2026-06-01 09:59:16 +02:00 · 88210db09c
commit 88210db09c
parent ed3eeb0199
2 changed files with 244 additions and 50 deletions
--- a/docs/TODO.md
+++ b/docs/TODO.md
@ -1,64 +1,90 @@
 # ToDo

- [x] Main readme only says ansible, not terraform. Should properbly be included.
- [x] Main readme does not include a description of the name boma, nor the scope (i.e. infrastructure - not laptops)
+1. **Forgejo CI** — what CI work remains after ADR-010 (which workflows, runner
+   setup, etc. still need to be built)?

- [x] Method to review repo to ensure
-  - We dont carry around code, comments, notes, etc. that is no longer needed but was perhaps added to fix an issue that has been resolved.
-  - That all code, structure, comments, notes etc. follow our design decisions.
-  - That clear intent is documented throughout - and that there are not any overlaps, contradictions etc.
+2. **Testing**
+   1. Choose and configure code-testing tooling (Molecule, etc.).
+   2. Decide how the AI interprets Molecule output and performs live testing:
+      API calls, curl pulls of web products, log reviews, and headless browsing.
+   3. Define a standard for generating test users and for instructing the user to
+      perform relevant manual tests.

- [ ] Forgejo CI
+3. **Building services**
+   1. Decide how to manage logs.
+   2. Decide how to manage APIs / API access.
+   3. Decide how to import or integrate from baobabAnsibleV4.
+   4. Decide what each node runs — base packages plus which apps/services.
+   5. Decide the firewall strategy (which firewall, ruleset, per-host vs central).
+   6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime
+      Kuma alerts on askari.
+   7. Define a tagging standard that lets us target runs without over-tagging.
+   8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
+   9. Decide: a central database server, or individual database services per app?
+   10. Should we continue to use the base-container method, or maybe something in the improvements of the methods in boma moods the point?

- [ ] Testing
-  - Code testing tools (molecule etc.)
-  - AI interpretation of molecule etc, but also actual testing via API-calls, CURL pulls of web products, log reviews and perhaps even headless browsing
+4. **Split-horizon FQDN** — adopt split-horizon FQDN with or without nyumbani?

- [ ] Building stuff
-  - How to manage logs
-  - How to manage APIs
-  - How to import/integrate from baobabAnsibleV4?
-  - What to install on nodes?
-  - firewalls?
-  - apps?
-  - wirering up loki, prometheous, grafana dashboards, grafana alerts, uptimekuma alerts on askari
-  - tagging strategy - we need a specific standard so that we can target runs, but dont over-tag.
+5. **Control node**
+   1. Set up and test the control node while waiting for hardware.
+   2. Define control-node bootstrapping — a dedicated recipe and playbook?
+   3. Decide the role of mamba — access/availability vs compute power and ease?
+   4. Set up rbw on the control node.

- [ ] Split horizon FQDN - with or without nyumbani
+6. **Updating**
+   1. Decide pinning vs latest for versions.
+   2. Decide the update strategy across services & containers vs packages &
+      builds / GitHub pulls / Flatpaks.
+   3. Define scheduling of updates and reboots, including post-update testing.

- [ ] Control node
-  - Setup and testing while waiting for hardware?
-  - Bootstrapping - perhaps dedicated recipe and playbook?
-  - Role of mamba? - Access/availability vs compute power and ease?
-  - rbw on control node
+7. **Shell setup**
+   1. Decide what shell setup matters for the AI's work on the control node.
+   2. Decide what to set up on the hosts, given that direct access will be rare.

- [ ] Updating
-  - Pinning vs latest.
-  - services and containers vs packages and builds/github pulls/flatpacks
-  - scheduling of updates and reboots - incl. testing afterwards.
+8. **Scheduled work**
+   1. Run `/review-repo` as `claude -p` via cron every two weeks?
+   2. Build sanity checks (e.g. does PhotoPrism have its pictures? are email
+      services receiving and sending?).
+   3. Design a declarative `scheduled_jobs` role so the repo owns which cronjobs
+      run on a host, enforced by Ansible. Sketch (deferred until we have hosts):
+      reads a `scheduled_jobs__jobs` list from group_vars/host_vars, rendered via
+      a managed `/etc/cron.d` file. Open questions:
+      1. General role vs control-node-only?
+      2. Prune undeclared jobs (repo authoritative) vs additive?
+      3. Validate headless email and that cron's env has the `claude` CLI.
+      4. (The fortnightly `/review-repo` job is the first entry.)
+   4. Schedule `/capacity-review` to run periodically (on-demand only for now).
+      Revisit once the physical cluster + a live usage-stats hook exist, so it
+      reasons on real usage rather than declared intent alone. **Decide the usage
+      source first:** Proxmox RRD (built-in, no extra infra) vs the
+      Prometheus/Loki/Grafana/Grafana-Alloy stack we will likely set up anyway
+      (richer, per-process, but more to run) — see TODO 3.6. Don't build the
+      Proxmox-RRD hook before settling this, to avoid throwaway work.
+9. Should we make a basic function so that tools (and AI) can send messages to the user - email, matrix or ntfy?

- [ ] shell setup
- What does it matter in relations to the AIs work on the control node?
- What should we set up on the hosts, if i'll rarely go there?
+10. **Claude setup** — DECIDED: brainstorm for intent, capture as ADRs (skip plan
+    files); hooks + slash commands + `/review-repo` for enforcement at scale. Any
+    remaining setup to carry out from this decision?
+    1. Policy for how we collaborate with references to baobabAnsibleV4 without misusing it.
+    2. Policy for how we write key documents like ADRs.
+    3. Further development on how we we collaborate on designing the foundation for the project - seperate from how we implement new containers etc.

- [ ] Scheduled work
- /review-repo maybe as claude -p via cron every two weeks?
- Sanity checks: does a photoprism have its pictures? are email services recieving and sending?
- Cron "section": a declarative way for the repo to own which cronjobs are active on a
-  host, enforced by Ansible. Sketch (deferred until we have hosts): a `scheduled_jobs`
-  role reading a `scheduled_jobs__jobs` list from group_vars/host_vars, rendered via a
-  managed /etc/cron.d file. Open Qs: general role vs control-node-only; prune
-  undeclared jobs (repo authoritative) vs additive; validate headless email + that
-  cron's env has the `claude` CLI. The /review-repo fortnightly job is the first entry.
+11. **Kaizen loop** — set up ~2026-06-06 (one week from now).
+    1. Build `/retro`: reads `docs/FRICTION.md` + recurring `/review-repo`
+       findings + a tooling-usage inventory; proposes add / change / **remove**
+       (biased to remove); records decisions as ADRs; evaluates itself.
+       Recurrence-triggered plus a light periodic sweep.
+    2. Keep appending raw signals to `docs/FRICTION.md` (live now) until the
+       retro consumes them.

- [ ] Claude setup
- superpowers or other methodologies?  → decided: brainstorm for intent, capture as
-  ADRs (skip plan files); hooks + slash commands + /review-repo for enforcement at scale.
+12. **Spin-up order** — what is the right order of operations when spinning up
+    from scratch (OS, DNS, Authentik, Traefik, …)?

- [ ] Kaizen loop — set up ~2026-06-06 (one week from now)
-  - Build `/retro`: reads `docs/FRICTION.md` + `/review-repo` recurring findings + a
-    tooling-usage inventory; proposes add / change / **remove** (biased to remove);
-    records decisions as ADRs; evaluates itself. Recurrence-triggered + light periodic sweep.
-  - `docs/FRICTION.md` is live now — keep appending raw signals until the retro consumes them.
+13. **Intentions** - Is the current setup clearly identifying intentions throughout? We have the readme files but is that enough? Also, how do we rechallange desisions and how they interact over time. I.e. We have these two services running, but extending one a little bit could make the other redundant so we could remove it. Or an alternative to this services has emerged, and it is actually better.

- [ ] What is the right order of operation when spinning up from scratch? (OS, DNS, authentik, traefik...?)
+14. **Script dependencies policy** — utility scripts (`tf_to_inventory.py`,
+    `repo-scan.py`, `capacity-scan.py`) are stdlib-only by convention, for
+    run-anywhere portability (control node, CI, bare clone, no venv). Reevaluate
+    whether selectively allowing libraries (e.g. PyYAML — already present via
+    Ansible) is a better fit in general: weigh the parsing-correctness win
+    against losing zero-setup portability. Decide a clear rule and record it.
--- a/docs/superpowers/specs/2026-06-01-hardware-capacity-design.md
+++ b/docs/superpowers/specs/2026-06-01-hardware-capacity-design.md
@ -0,0 +1,168 @@
+# Design — Hardware reference & capacity evaluation
+
+_Date: 2026-06-01 · Status: approved for planning_
+
+## Problem
+
+The repo documents the **logical/network** layer well — Terraform declares per-VM
+`cores`/`memory_mb`/`disk_size_gb`, and ADR-007 records VLANs, IPs, and topology.
+But the **physical** layer is undocumented: how many Proxmox nodes physically
+exist, their real CPU/RAM/disk capacity, storage pools, the network gear, and
+`askari`. Nothing records "this node has 64 GB, X is allocated, Y is free," and
+nothing evaluates whether the design is well-proportioned — e.g. a service that
+needn't be HA, a workload that should move nodes, or a node due a RAM/disk
+upgrade.
+
+## Goal
+
+1. A single, human-first **hardware reference document** capturing physical
+   compute + network gear and the intended workload placement.
+2. A **capacity evaluator** ("script + skill") that reasons about optimization:
+   HA overkill / missing redundancy, right-sizing, placement moves, and
+   upgrade timing — emitting a dated report.
+
+## Scope
+
+- **In:** Proxmox compute nodes (`pve0..2`) + `askari`; network gear (OPNsense,
+  managed switch, APs); per-workload placement intent.
+- **Out (for now):** power/UPS budget, NAS, cabling, rack layout, asset
+  register, warranty/serial tracking.
+
+## Non-negotiable repo conventions this must honor
+
+- Mirror the existing `repo-scan.py` → `/review-repo` → `docs/reviews/` triad
+  (deterministic scan feeds a judgement skill; report is dated markdown).
+- Utility scripts are **stdlib-only** for run-anywhere portability (control
+  node, CI, bare clone, no venv). See TODO #14 for the standing reevaluation.
+- Be honest about real-vs-planned (STATUS.md). The physical cluster is **not
+  stood up yet**, so live usage stats are a documented future hook, not a
+  current capability.
+
+## Architecture
+
+Four pieces, plus tracking updates.
+
+### 1. Reference doc — `docs/hardware/reference.md`
+
+One hand-maintained markdown file, the source of truth for physical facts and
+placement intent. Four parts:
+
+1. **Physical compute** — one subsection per node (`pve0..2`, `askari`):
+   model/form factor, CPU (cores/threads), RAM total (+ max & free DIMM slots),
+   storage (disks → pools, e.g. `local-zfs` / `local-lvm`), NICs, notes.
+2. **Network gear** — OPNsense box, managed switch, APs: model, port/PoE
+   counts, throughput, uplinks. Short table.
+3. **Workload placement & intent** — one row per planned VM/service, columns:
+   `Service | Home node | Criticality | HA intent | Resource profile |
+   Placement constraints | Growth notes`. These columns map onto the four
+   attribute groups chosen during brainstorming and give the evaluator concrete
+   intent to judge against (e.g. anti-affinity: `dns1`/`dns2` on different
+   nodes).
+4. **Capacity summary** — per-node "allocated vs physical" rollup (RAM / cores /
+   disk, headroom %).
+
+Node-capacity tables use a **strict, documented format** so the scan script can
+parse the numbers without a YAML dependency.
+
+### 2. Scan script — `scripts/capacity-scan.py`
+
+Stdlib-only, deterministic, JSON to stdout (like `repo-scan.py`). Avoids
+hand-parsing YAML by shelling out for JSON, the pattern `tf_to_inventory.py`
+already uses.
+
+Gathers **today**:
+- **Declared allocations** — `terraform output -json` (and/or the `.tf` module
+  calls) for each VM's cores/RAM/disk; degrades gracefully when Terraform has no
+  real VMs yet (current reality) instead of failing.
+- **Inventory hosts** — `ansible-inventory -i inventories/<env>/hosts.yml
+  --list` → JSON.
+- **Physical capacities** — parses the strict node tables in `reference.md`.
+- **Rollup math** — per node: allocated vs physical, headroom %,
+  `oversubscribed` flag.
+- **Drift warnings** — e.g. `reference.md` lists a host no Terraform VM
+  declares; surfaced in a `warnings[]` array (free doc↔Terraform drift check).
+
+**Stubbed future hook** (honest, à la STATUS.md):
+```python
+# FUTURE: live usage stats (per-VM CPU/RAM/disk history).
+# Requires the physical cluster online. Source UNDECIDED — see "Open decisions".
+def gather_usage():
+    return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}
+```
+
+Output sketch:
+```json
+{
+  "nodes": {"pve0": {"ram_gb": 64, "ram_allocated_gb": 12, "headroom_pct": 81, "oversubscribed": false}},
+  "workloads": [{"name": "forgejo", "node": "pve1", "cores": 2, "memory_mb": 4096}],
+  "usage": {"available": false, "reason": "cluster not provisioned"},
+  "warnings": ["reference.md lists dns1 but no Terraform VM declares it"]
+}
+```
+
+### 3. Evaluator skill — `/capacity-review`
+
+A skill in `.claude/` (mirrors `/review-repo`), on-demand. Flow:
+
+1. Run `python3 scripts/capacity-scan.py` → JSON.
+2. Read `docs/hardware/reference.md` for intent columns the math can't capture.
+3. Reason across dimensions, each recommendation **tagged by type** and stating
+   **what it is based on** (declared intent vs measured usage):
+   - **HA / redundancy** — anti-affinity violations, SPOFs, HA-overkill,
+     critical-but-unredundant services.
+   - **Right-sizing** — over/under-provisioned VMs. *Intent-based today*;
+     explicitly upgradeable to usage-based once the usage hook is live.
+   - **Placement / moves** — oversubscribed nodes, constraint-driven relocation.
+   - **Upgrade timing** — growth notes vs headroom → rough runway.
+   - **Drift** — surfaces the scan's `warnings[]`.
+4. Write `docs/hardware/reviews/YYYY-MM-DD-capacity.md` (+ `latest.md`),
+   mirroring `docs/reviews/`.
+
+### 4. Recording — ADR + STATUS + CLAUDE.md
+
+- **ADR-012 — Hardware reference & capacity evaluation**
+  (`docs/decisions/012-hardware-capacity.md`): records the decision and
+  rationale; cross-links ADR-001 / ADR-007 / ADR-009. Names the usage-source as
+  an open decision (below).
+- **STATUS.md** rows: `reference.md` + `capacity-scan.py` → real/working
+  (skeleton); `/capacity-review` → working, intent-only; live usage → designed,
+  not built.
+- **CLAUDE.md**: a "Review capacity/hardware → `/capacity-review`" commands-table
+  row + a "Further reading" pointer to ADR-012.
+
+## Data flow
+
+```
+reference.md ──┐
+               ├─→ capacity-scan.py ──→ scan JSON ──┐
+terraform ─────┤      (stdlib, JSON-via-subprocess)  ├─→ /capacity-review ─→ docs/hardware/reviews/
+inventory ─────┘                                     │        (judgement)
+reference.md (intent columns) ───────────────────────┘
+```
+
+## Open decisions (deferred, tracked in TODO)
+
+- **Usage-stats source** (TODO 8.4): **Proxmox RRD** (built-in, no extra infra)
+  vs the **Prometheus/Loki/Grafana/Grafana-Alloy** stack we will likely run
+  anyway (richer, per-process, more to operate; see TODO 3.6). **Decide before
+  building any usage hook** to avoid throwaway work.
+- **Script dependency policy** (TODO #14): whether stdlib-only remains the rule
+  for utility scripts or libraries (e.g. PyYAML) are selectively allowed.
+- **Scheduling** (TODO 8.4): `/capacity-review` is on-demand now; cron later.
+
+## Deliverables & state at delivery
+
+| Piece | Path | State |
+|---|---|---|
+| Reference doc | `docs/hardware/reference.md` | Skeleton + real node data |
+| Scan script | `scripts/capacity-scan.py` | Working (stdlib, usage hook stubbed) |
+| Evaluator skill | `/capacity-review` → `docs/hardware/reviews/` | Working, intent-based |
+| Decision record | `docs/decisions/012-hardware-capacity.md` | New ADR |
+| Tracking | STATUS.md, CLAUDE.md, TODO #14 + 8.4 | Updated |
+
+## Out of scope / YAGNI
+
+- No usage-stats collection until the cluster exists and the source is decided.
+- No structured-data (YAML) source of truth — markdown is the single hand-edited
+  source by choice; revisit only if parsing pain demands it.
+- No automated moves/remediation — the evaluator recommends; humans act.