Brainstormed design for docs/hardware/reference.md (physical compute + network gear + workload placement intent), a stdlib-only capacity-scan.py, and an on-demand /capacity-review skill that reports to docs/hardware/reviews/. Mirrors the repo-scan -> /review-repo -> docs/reviews triad. TODO additions: schedule /capacity-review later and decide its usage-stats source (Proxmox RRD vs the Prometheus/Loki/Grafana/Alloy stack) before building any hook (8.4); reevaluate the stdlib-only script policy (#14). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7.7 KiB
Design — Hardware reference & capacity evaluation
Date: 2026-06-01 · Status: approved for planning
Problem
The repo documents the logical/network layer well — Terraform declares per-VM
cores/memory_mb/disk_size_gb, and ADR-007 records VLANs, IPs, and topology.
But the physical layer is undocumented: how many Proxmox nodes physically
exist, their real CPU/RAM/disk capacity, storage pools, the network gear, and
askari. Nothing records "this node has 64 GB, X is allocated, Y is free," and
nothing evaluates whether the design is well-proportioned — e.g. a service that
needn't be HA, a workload that should move nodes, or a node due a RAM/disk
upgrade.
Goal
- A single, human-first hardware reference document capturing physical compute + network gear and the intended workload placement.
- A capacity evaluator ("script + skill") that reasons about optimization: HA overkill / missing redundancy, right-sizing, placement moves, and upgrade timing — emitting a dated report.
Scope
- In: Proxmox compute nodes (
pve0..2) +askari; network gear (OPNsense, managed switch, APs); per-workload placement intent. - Out (for now): power/UPS budget, NAS, cabling, rack layout, asset register, warranty/serial tracking.
Non-negotiable repo conventions this must honor
- Mirror the existing
repo-scan.py→/review-repo→docs/reviews/triad (deterministic scan feeds a judgement skill; report is dated markdown). - Utility scripts are stdlib-only for run-anywhere portability (control node, CI, bare clone, no venv). See TODO #14 for the standing reevaluation.
- Be honest about real-vs-planned (STATUS.md). The physical cluster is not stood up yet, so live usage stats are a documented future hook, not a current capability.
Architecture
Four pieces, plus tracking updates.
1. Reference doc — docs/hardware/reference.md
One hand-maintained markdown file, the source of truth for physical facts and placement intent. Four parts:
- Physical compute — one subsection per node (
pve0..2,askari): model/form factor, CPU (cores/threads), RAM total (+ max & free DIMM slots), storage (disks → pools, e.g.local-zfs/local-lvm), NICs, notes. - Network gear — OPNsense box, managed switch, APs: model, port/PoE counts, throughput, uplinks. Short table.
- Workload placement & intent — one row per planned VM/service, columns:
Service | Home node | Criticality | HA intent | Resource profile | Placement constraints | Growth notes. These columns map onto the four attribute groups chosen during brainstorming and give the evaluator concrete intent to judge against (e.g. anti-affinity:dns1/dns2on different nodes). - Capacity summary — per-node "allocated vs physical" rollup (RAM / cores / disk, headroom %).
Node-capacity tables use a strict, documented format so the scan script can parse the numbers without a YAML dependency.
2. Scan script — scripts/capacity-scan.py
Stdlib-only, deterministic, JSON to stdout (like repo-scan.py). Avoids
hand-parsing YAML by shelling out for JSON, the pattern tf_to_inventory.py
already uses.
Gathers today:
- Declared allocations —
terraform output -json(and/or the.tfmodule calls) for each VM's cores/RAM/disk; degrades gracefully when Terraform has no real VMs yet (current reality) instead of failing. - Inventory hosts —
ansible-inventory -i inventories/<env>/hosts.yml --list→ JSON. - Physical capacities — parses the strict node tables in
reference.md. - Rollup math — per node: allocated vs physical, headroom %,
oversubscribedflag. - Drift warnings — e.g.
reference.mdlists a host no Terraform VM declares; surfaced in awarnings[]array (free doc↔Terraform drift check).
Stubbed future hook (honest, à la STATUS.md):
# FUTURE: live usage stats (per-VM CPU/RAM/disk history).
# Requires the physical cluster online. Source UNDECIDED — see "Open decisions".
def gather_usage():
return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}
Output sketch:
{
"nodes": {"pve0": {"ram_gb": 64, "ram_allocated_gb": 12, "headroom_pct": 81, "oversubscribed": false}},
"workloads": [{"name": "forgejo", "node": "pve1", "cores": 2, "memory_mb": 4096}],
"usage": {"available": false, "reason": "cluster not provisioned"},
"warnings": ["reference.md lists dns1 but no Terraform VM declares it"]
}
3. Evaluator skill — /capacity-review
A skill in .claude/ (mirrors /review-repo), on-demand. Flow:
- Run
python3 scripts/capacity-scan.py→ JSON. - Read
docs/hardware/reference.mdfor intent columns the math can't capture. - Reason across dimensions, each recommendation tagged by type and stating
what it is based on (declared intent vs measured usage):
- HA / redundancy — anti-affinity violations, SPOFs, HA-overkill, critical-but-unredundant services.
- Right-sizing — over/under-provisioned VMs. Intent-based today; explicitly upgradeable to usage-based once the usage hook is live.
- Placement / moves — oversubscribed nodes, constraint-driven relocation.
- Upgrade timing — growth notes vs headroom → rough runway.
- Drift — surfaces the scan's
warnings[].
- Write
docs/hardware/reviews/YYYY-MM-DD-capacity.md(+latest.md), mirroringdocs/reviews/.
4. Recording — ADR + STATUS + CLAUDE.md
- ADR-012 — Hardware reference & capacity evaluation
(
docs/decisions/012-hardware-capacity.md): records the decision and rationale; cross-links ADR-001 / ADR-007 / ADR-009. Names the usage-source as an open decision (below). - STATUS.md rows:
reference.md+capacity-scan.py→ real/working (skeleton);/capacity-review→ working, intent-only; live usage → designed, not built. - CLAUDE.md: a "Review capacity/hardware →
/capacity-review" commands-table row + a "Further reading" pointer to ADR-012.
Data flow
reference.md ──┐
├─→ capacity-scan.py ──→ scan JSON ──┐
terraform ─────┤ (stdlib, JSON-via-subprocess) ├─→ /capacity-review ─→ docs/hardware/reviews/
inventory ─────┘ │ (judgement)
reference.md (intent columns) ───────────────────────┘
Open decisions (deferred, tracked in TODO)
- Usage-stats source (TODO 8.4): Proxmox RRD (built-in, no extra infra) vs the Prometheus/Loki/Grafana/Grafana-Alloy stack we will likely run anyway (richer, per-process, more to operate; see TODO 3.6). Decide before building any usage hook to avoid throwaway work.
- Script dependency policy (TODO #14): whether stdlib-only remains the rule for utility scripts or libraries (e.g. PyYAML) are selectively allowed.
- Scheduling (TODO 8.4):
/capacity-reviewis on-demand now; cron later.
Deliverables & state at delivery
| Piece | Path | State |
|---|---|---|
| Reference doc | docs/hardware/reference.md |
Skeleton + real node data |
| Scan script | scripts/capacity-scan.py |
Working (stdlib, usage hook stubbed) |
| Evaluator skill | /capacity-review → docs/hardware/reviews/ |
Working, intent-based |
| Decision record | docs/decisions/012-hardware-capacity.md |
New ADR |
| Tracking | STATUS.md, CLAUDE.md, TODO #14 + 8.4 | Updated |
Out of scope / YAGNI
- No usage-stats collection until the cluster exists and the source is decided.
- No structured-data (YAML) source of truth — markdown is the single hand-edited source by choice; revisit only if parsing pain demands it.
- No automated moves/remediation — the evaluator recommends; humans act.