boma/docs/superpowers/specs/2026-06-01-hardware-capacity-design.md
sjat 88210db09c Add hardware reference & capacity-evaluation design spec
Brainstormed design for docs/hardware/reference.md (physical compute +
network gear + workload placement intent), a stdlib-only capacity-scan.py,
and an on-demand /capacity-review skill that reports to docs/hardware/reviews/.
Mirrors the repo-scan -> /review-repo -> docs/reviews triad.

TODO additions: schedule /capacity-review later and decide its usage-stats
source (Proxmox RRD vs the Prometheus/Loki/Grafana/Alloy stack) before
building any hook (8.4); reevaluate the stdlib-only script policy (#14).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 09:59:16 +02:00

7.7 KiB

Design — Hardware reference & capacity evaluation

Date: 2026-06-01 · Status: approved for planning

Problem

The repo documents the logical/network layer well — Terraform declares per-VM cores/memory_mb/disk_size_gb, and ADR-007 records VLANs, IPs, and topology. But the physical layer is undocumented: how many Proxmox nodes physically exist, their real CPU/RAM/disk capacity, storage pools, the network gear, and askari. Nothing records "this node has 64 GB, X is allocated, Y is free," and nothing evaluates whether the design is well-proportioned — e.g. a service that needn't be HA, a workload that should move nodes, or a node due a RAM/disk upgrade.

Goal

  1. A single, human-first hardware reference document capturing physical compute + network gear and the intended workload placement.
  2. A capacity evaluator ("script + skill") that reasons about optimization: HA overkill / missing redundancy, right-sizing, placement moves, and upgrade timing — emitting a dated report.

Scope

  • In: Proxmox compute nodes (pve0..2) + askari; network gear (OPNsense, managed switch, APs); per-workload placement intent.
  • Out (for now): power/UPS budget, NAS, cabling, rack layout, asset register, warranty/serial tracking.

Non-negotiable repo conventions this must honor

  • Mirror the existing repo-scan.py/review-repodocs/reviews/ triad (deterministic scan feeds a judgement skill; report is dated markdown).
  • Utility scripts are stdlib-only for run-anywhere portability (control node, CI, bare clone, no venv). See TODO #14 for the standing reevaluation.
  • Be honest about real-vs-planned (STATUS.md). The physical cluster is not stood up yet, so live usage stats are a documented future hook, not a current capability.

Architecture

Four pieces, plus tracking updates.

1. Reference doc — docs/hardware/reference.md

One hand-maintained markdown file, the source of truth for physical facts and placement intent. Four parts:

  1. Physical compute — one subsection per node (pve0..2, askari): model/form factor, CPU (cores/threads), RAM total (+ max & free DIMM slots), storage (disks → pools, e.g. local-zfs / local-lvm), NICs, notes.
  2. Network gear — OPNsense box, managed switch, APs: model, port/PoE counts, throughput, uplinks. Short table.
  3. Workload placement & intent — one row per planned VM/service, columns: Service | Home node | Criticality | HA intent | Resource profile | Placement constraints | Growth notes. These columns map onto the four attribute groups chosen during brainstorming and give the evaluator concrete intent to judge against (e.g. anti-affinity: dns1/dns2 on different nodes).
  4. Capacity summary — per-node "allocated vs physical" rollup (RAM / cores / disk, headroom %).

Node-capacity tables use a strict, documented format so the scan script can parse the numbers without a YAML dependency.

2. Scan script — scripts/capacity-scan.py

Stdlib-only, deterministic, JSON to stdout (like repo-scan.py). Avoids hand-parsing YAML by shelling out for JSON, the pattern tf_to_inventory.py already uses.

Gathers today:

  • Declared allocationsterraform output -json (and/or the .tf module calls) for each VM's cores/RAM/disk; degrades gracefully when Terraform has no real VMs yet (current reality) instead of failing.
  • Inventory hostsansible-inventory -i inventories/<env>/hosts.yml --list → JSON.
  • Physical capacities — parses the strict node tables in reference.md.
  • Rollup math — per node: allocated vs physical, headroom %, oversubscribed flag.
  • Drift warnings — e.g. reference.md lists a host no Terraform VM declares; surfaced in a warnings[] array (free doc↔Terraform drift check).

Stubbed future hook (honest, à la STATUS.md):

# FUTURE: live usage stats (per-VM CPU/RAM/disk history).
# Requires the physical cluster online. Source UNDECIDED — see "Open decisions".
def gather_usage():
    return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}

Output sketch:

{
  "nodes": {"pve0": {"ram_gb": 64, "ram_allocated_gb": 12, "headroom_pct": 81, "oversubscribed": false}},
  "workloads": [{"name": "forgejo", "node": "pve1", "cores": 2, "memory_mb": 4096}],
  "usage": {"available": false, "reason": "cluster not provisioned"},
  "warnings": ["reference.md lists dns1 but no Terraform VM declares it"]
}

3. Evaluator skill — /capacity-review

A skill in .claude/ (mirrors /review-repo), on-demand. Flow:

  1. Run python3 scripts/capacity-scan.py → JSON.
  2. Read docs/hardware/reference.md for intent columns the math can't capture.
  3. Reason across dimensions, each recommendation tagged by type and stating what it is based on (declared intent vs measured usage):
    • HA / redundancy — anti-affinity violations, SPOFs, HA-overkill, critical-but-unredundant services.
    • Right-sizing — over/under-provisioned VMs. Intent-based today; explicitly upgradeable to usage-based once the usage hook is live.
    • Placement / moves — oversubscribed nodes, constraint-driven relocation.
    • Upgrade timing — growth notes vs headroom → rough runway.
    • Drift — surfaces the scan's warnings[].
  4. Write docs/hardware/reviews/YYYY-MM-DD-capacity.md (+ latest.md), mirroring docs/reviews/.

4. Recording — ADR + STATUS + CLAUDE.md

  • ADR-012 — Hardware reference & capacity evaluation (docs/decisions/012-hardware-capacity.md): records the decision and rationale; cross-links ADR-001 / ADR-007 / ADR-009. Names the usage-source as an open decision (below).
  • STATUS.md rows: reference.md + capacity-scan.py → real/working (skeleton); /capacity-review → working, intent-only; live usage → designed, not built.
  • CLAUDE.md: a "Review capacity/hardware → /capacity-review" commands-table row + a "Further reading" pointer to ADR-012.

Data flow

reference.md ──┐
               ├─→ capacity-scan.py ──→ scan JSON ──┐
terraform ─────┤      (stdlib, JSON-via-subprocess)  ├─→ /capacity-review ─→ docs/hardware/reviews/
inventory ─────┘                                     │        (judgement)
reference.md (intent columns) ───────────────────────┘

Open decisions (deferred, tracked in TODO)

  • Usage-stats source (TODO 8.4): Proxmox RRD (built-in, no extra infra) vs the Prometheus/Loki/Grafana/Grafana-Alloy stack we will likely run anyway (richer, per-process, more to operate; see TODO 3.6). Decide before building any usage hook to avoid throwaway work.
  • Script dependency policy (TODO #14): whether stdlib-only remains the rule for utility scripts or libraries (e.g. PyYAML) are selectively allowed.
  • Scheduling (TODO 8.4): /capacity-review is on-demand now; cron later.

Deliverables & state at delivery

Piece Path State
Reference doc docs/hardware/reference.md Skeleton + real node data
Scan script scripts/capacity-scan.py Working (stdlib, usage hook stubbed)
Evaluator skill /capacity-reviewdocs/hardware/reviews/ Working, intent-based
Decision record docs/decisions/012-hardware-capacity.md New ADR
Tracking STATUS.md, CLAUDE.md, TODO #14 + 8.4 Updated

Out of scope / YAGNI

  • No usage-stats collection until the cluster exists and the source is decided.
  • No structured-data (YAML) source of truth — markdown is the single hand-edited source by choice; revisit only if parsing pain demands it.
  • No automated moves/remediation — the evaluator recommends; humans act.