# Design — Hardware reference & capacity evaluation _Date: 2026-06-01 · Status: approved for planning_ ## Problem The repo documents the **logical/network** layer well — Terraform declares per-VM `cores`/`memory_mb`/`disk_size_gb`, and ADR-007 records VLANs, IPs, and topology. But the **physical** layer is undocumented: how many Proxmox nodes physically exist, their real CPU/RAM/disk capacity, storage pools, the network gear, and `askari`. Nothing records "this node has 64 GB, X is allocated, Y is free," and nothing evaluates whether the design is well-proportioned — e.g. a service that needn't be HA, a workload that should move nodes, or a node due a RAM/disk upgrade. ## Goal 1. A single, human-first **hardware reference document** capturing physical compute + network gear and the intended workload placement. 2. A **capacity evaluator** ("script + skill") that reasons about optimization: HA overkill / missing redundancy, right-sizing, placement moves, and upgrade timing — emitting a dated report. ## Scope - **In:** Proxmox compute nodes (`pve0..2`) + `askari`; network gear (OPNsense, managed switch, APs); per-workload placement intent. - **Out (for now):** power/UPS budget, NAS, cabling, rack layout, asset register, warranty/serial tracking. ## Non-negotiable repo conventions this must honor - Mirror the existing `repo-scan.py` → `/review-repo` → `docs/reviews/` triad (deterministic scan feeds a judgement skill; report is dated markdown). - Utility scripts are **stdlib-only** for run-anywhere portability (control node, CI, bare clone, no venv). See TODO #14 for the standing reevaluation. - Be honest about real-vs-planned (STATUS.md). The physical cluster is **not stood up yet**, so live usage stats are a documented future hook, not a current capability. ## Architecture Four pieces, plus tracking updates. ### 1. Reference doc — `docs/hardware/reference.md` One hand-maintained markdown file, the source of truth for physical facts and placement intent. Four parts: 1. **Physical compute** — one subsection per node (`pve0..2`, `askari`): model/form factor, CPU (cores/threads), RAM total (+ max & free DIMM slots), storage (disks → pools, e.g. `local-zfs` / `local-lvm`), NICs, notes. 2. **Network gear** — OPNsense box, managed switch, APs: model, port/PoE counts, throughput, uplinks. Short table. 3. **Workload placement & intent** — one row per planned VM/service, columns: `Service | Home node | Criticality | HA intent | Resource profile | Placement constraints | Growth notes`. These columns map onto the four attribute groups chosen during brainstorming and give the evaluator concrete intent to judge against (e.g. anti-affinity: `dns1`/`dns2` on different nodes). 4. **Capacity summary** — per-node "allocated vs physical" rollup (RAM / cores / disk, headroom %). Node-capacity tables use a **strict, documented format** so the scan script can parse the numbers without a YAML dependency. ### 2. Scan script — `scripts/capacity-scan.py` Stdlib-only, deterministic, JSON to stdout (like `repo-scan.py`). Avoids hand-parsing YAML by shelling out for JSON, the pattern `tf_to_inventory.py` already uses. Gathers **today**: - **Declared allocations** — `terraform output -json` (and/or the `.tf` module calls) for each VM's cores/RAM/disk; degrades gracefully when Terraform has no real VMs yet (current reality) instead of failing. - **Inventory hosts** — `ansible-inventory -i inventories//hosts.yml --list` → JSON. - **Physical capacities** — parses the strict node tables in `reference.md`. - **Rollup math** — per node: allocated vs physical, headroom %, `oversubscribed` flag. - **Drift warnings** — e.g. `reference.md` lists a host no Terraform VM declares; surfaced in a `warnings[]` array (free doc↔Terraform drift check). **Stubbed future hook** (honest, à la STATUS.md): ```python # FUTURE: live usage stats (per-VM CPU/RAM/disk history). # Requires the physical cluster online. Source UNDECIDED — see "Open decisions". def gather_usage(): return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"} ``` Output sketch: ```json { "nodes": {"pve0": {"ram_gb": 64, "ram_allocated_gb": 12, "headroom_pct": 81, "oversubscribed": false}}, "workloads": [{"name": "forgejo", "node": "pve1", "cores": 2, "memory_mb": 4096}], "usage": {"available": false, "reason": "cluster not provisioned"}, "warnings": ["reference.md lists dns1 but no Terraform VM declares it"] } ``` ### 3. Evaluator skill — `/capacity-review` A skill in `.claude/` (mirrors `/review-repo`), on-demand. Flow: 1. Run `python3 scripts/capacity-scan.py` → JSON. 2. Read `docs/hardware/reference.md` for intent columns the math can't capture. 3. Reason across dimensions, each recommendation **tagged by type** and stating **what it is based on** (declared intent vs measured usage): - **HA / redundancy** — anti-affinity violations, SPOFs, HA-overkill, critical-but-unredundant services. - **Right-sizing** — over/under-provisioned VMs. *Intent-based today*; explicitly upgradeable to usage-based once the usage hook is live. - **Placement / moves** — oversubscribed nodes, constraint-driven relocation. - **Upgrade timing** — growth notes vs headroom → rough runway. - **Drift** — surfaces the scan's `warnings[]`. 4. Write `docs/hardware/reviews/YYYY-MM-DD-capacity.md` (+ `latest.md`), mirroring `docs/reviews/`. ### 4. Recording — ADR + STATUS + CLAUDE.md - **ADR-012 — Hardware reference & capacity evaluation** (`docs/decisions/012-hardware-capacity.md`): records the decision and rationale; cross-links ADR-001 / ADR-007 / ADR-009. Names the usage-source as an open decision (below). - **STATUS.md** rows: `reference.md` + `capacity-scan.py` → real/working (skeleton); `/capacity-review` → working, intent-only; live usage → designed, not built. - **CLAUDE.md**: a "Review capacity/hardware → `/capacity-review`" commands-table row + a "Further reading" pointer to ADR-012. ## Data flow ``` reference.md ──┐ ├─→ capacity-scan.py ──→ scan JSON ──┐ terraform ─────┤ (stdlib, JSON-via-subprocess) ├─→ /capacity-review ─→ docs/hardware/reviews/ inventory ─────┘ │ (judgement) reference.md (intent columns) ───────────────────────┘ ``` ## Open decisions (deferred, tracked in TODO) - **Usage-stats source** (TODO 8.4): **Proxmox RRD** (built-in, no extra infra) vs the **Prometheus/Loki/Grafana/Grafana-Alloy** stack we will likely run anyway (richer, per-process, more to operate; see TODO 3.6). **Decide before building any usage hook** to avoid throwaway work. - **Script dependency policy** (TODO #14): whether stdlib-only remains the rule for utility scripts or libraries (e.g. PyYAML) are selectively allowed. - **Scheduling** (TODO 8.4): `/capacity-review` is on-demand now; cron later. ## Deliverables & state at delivery | Piece | Path | State | |---|---|---| | Reference doc | `docs/hardware/reference.md` | Skeleton + real node data | | Scan script | `scripts/capacity-scan.py` | Working (stdlib, usage hook stubbed) | | Evaluator skill | `/capacity-review` → `docs/hardware/reviews/` | Working, intent-based | | Decision record | `docs/decisions/012-hardware-capacity.md` | New ADR | | Tracking | STATUS.md, CLAUDE.md, TODO #14 + 8.4 | Updated | ## Out of scope / YAGNI - No usage-stats collection until the cluster exists and the source is decided. - No structured-data (YAML) source of truth — markdown is the single hand-edited source by choice; revisit only if parsing pain demands it. - No automated moves/remediation — the evaluator recommends; humans act.