boma/docs/superpowers/specs/2026-06-01-hardware-capacity-design.md

# Design — Hardware reference & capacity evaluation

_Date: 2026-06-01 · Status: approved for planning_

## Problem

The repo documents the **logical/network** layer well — Terraform declares per-VM
`cores`/`memory_mb`/`disk_size_gb`, and ADR-007 records VLANs, IPs, and topology.
But the **physical** layer is undocumented: how many Proxmox nodes physically
exist, their real CPU/RAM/disk capacity, storage pools, the network gear, and
`askari`. Nothing records "this node has 64 GB, X is allocated, Y is free," and
nothing evaluates whether the design is well-proportioned — e.g. a service that
needn't be HA, a workload that should move nodes, or a node due a RAM/disk
upgrade.

## Goal

1. A single, human-first **hardware reference document** capturing physical
   compute + network gear and the intended workload placement.
2. A **capacity evaluator** ("script + skill") that reasons about optimization:
   HA overkill / missing redundancy, right-sizing, placement moves, and
   upgrade timing — emitting a dated report.

## Scope

- **In:** Proxmox compute nodes (`pve0..2`) + `askari`; network gear (OPNsense,
  managed switch, APs); per-workload placement intent.
- **Out (for now):** power/UPS budget, NAS, cabling, rack layout, asset
  register, warranty/serial tracking.

## Non-negotiable repo conventions this must honor

- Mirror the existing `repo-scan.py` → `/review-repo` → `docs/reviews/` triad
  (deterministic scan feeds a judgement skill; report is dated markdown).
- Utility scripts are **stdlib-only** for run-anywhere portability (control
  node, CI, bare clone, no venv). See TODO #14 for the standing reevaluation.
- Be honest about real-vs-planned (STATUS.md). The physical cluster is **not
  stood up yet**, so live usage stats are a documented future hook, not a
  current capability.

## Architecture

Four pieces, plus tracking updates.

### 1. Reference doc — `docs/hardware/reference.md`

One hand-maintained markdown file, the source of truth for physical facts and
placement intent. Four parts:

1. **Physical compute** — one subsection per node (`pve0..2`, `askari`):
   model/form factor, CPU (cores/threads), RAM total (+ max & free DIMM slots),
   storage (disks → pools, e.g. `local-zfs` / `local-lvm`), NICs, notes.
2. **Network gear** — OPNsense box, managed switch, APs: model, port/PoE
   counts, throughput, uplinks. Short table.
3. **Workload placement & intent** — one row per planned VM/service, columns:
   `Service | Home node | Criticality | HA intent | Resource profile |
   Placement constraints | Growth notes`. These columns map onto the four
   attribute groups chosen during brainstorming and give the evaluator concrete
   intent to judge against (e.g. anti-affinity: `dns1`/`dns2` on different
   nodes).
4. **Capacity summary** — per-node "allocated vs physical" rollup (RAM / cores /
   disk, headroom %).

Node-capacity tables use a **strict, documented format** so the scan script can
parse the numbers without a YAML dependency.

### 2. Scan script — `scripts/capacity-scan.py`

Stdlib-only, deterministic, JSON to stdout (like `repo-scan.py`). Avoids
hand-parsing YAML by shelling out for JSON, the pattern `tf_to_inventory.py`
already uses.

Gathers **today**:
- **Declared allocations** — `terraform output -json` (and/or the `.tf` module
  calls) for each VM's cores/RAM/disk; degrades gracefully when Terraform has no
  real VMs yet (current reality) instead of failing.
- **Inventory hosts** — `ansible-inventory -i inventories/<env>/hosts.yml
  --list` → JSON.
- **Physical capacities** — parses the strict node tables in `reference.md`.
- **Rollup math** — per node: allocated vs physical, headroom %,
  `oversubscribed` flag.
- **Drift warnings** — e.g. `reference.md` lists a host no Terraform VM
  declares; surfaced in a `warnings[]` array (free doc↔Terraform drift check).

**Stubbed future hook** (honest, à la STATUS.md):
```python
# FUTURE: live usage stats (per-VM CPU/RAM/disk history).
# Requires the physical cluster online. Source UNDECIDED — see "Open decisions".
def gather_usage():
    return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}
```

Output sketch:
```json
{
  "nodes": {"pve0": {"ram_gb": 64, "ram_allocated_gb": 12, "headroom_pct": 81, "oversubscribed": false}},
  "workloads": [{"name": "forgejo", "node": "pve1", "cores": 2, "memory_mb": 4096}],
  "usage": {"available": false, "reason": "cluster not provisioned"},
  "warnings": ["reference.md lists dns1 but no Terraform VM declares it"]
}
```

### 3. Evaluator skill — `/capacity-review`

A skill in `.claude/` (mirrors `/review-repo`), on-demand. Flow:

1. Run `python3 scripts/capacity-scan.py` → JSON.
2. Read `docs/hardware/reference.md` for intent columns the math can't capture.
3. Reason across dimensions, each recommendation **tagged by type** and stating
   **what it is based on** (declared intent vs measured usage):
   - **HA / redundancy** — anti-affinity violations, SPOFs, HA-overkill,
     critical-but-unredundant services.
   - **Right-sizing** — over/under-provisioned VMs. *Intent-based today*;
     explicitly upgradeable to usage-based once the usage hook is live.
   - **Placement / moves** — oversubscribed nodes, constraint-driven relocation.
   - **Upgrade timing** — growth notes vs headroom → rough runway.
   - **Drift** — surfaces the scan's `warnings[]`.
4. Write `docs/hardware/reviews/YYYY-MM-DD-capacity.md` (+ `latest.md`),
   mirroring `docs/reviews/`.

### 4. Recording — ADR + STATUS + CLAUDE.md

- **ADR-012 — Hardware reference & capacity evaluation**
  (`docs/decisions/012-hardware-capacity.md`): records the decision and
  rationale; cross-links ADR-001 / ADR-007 / ADR-009. Names the usage-source as
  an open decision (below).
- **STATUS.md** rows: `reference.md` + `capacity-scan.py` → real/working
  (skeleton); `/capacity-review` → working, intent-only; live usage → designed,
  not built.
- **CLAUDE.md**: a "Review capacity/hardware → `/capacity-review`" commands-table
  row + a "Further reading" pointer to ADR-012.

## Data flow

```
reference.md ──┐
               ├─→ capacity-scan.py ──→ scan JSON ──┐
terraform ─────┤      (stdlib, JSON-via-subprocess)  ├─→ /capacity-review ─→ docs/hardware/reviews/
inventory ─────┘                                     │        (judgement)
reference.md (intent columns) ───────────────────────┘
```

## Open decisions (deferred, tracked in TODO)

- **Usage-stats source** (TODO 8.4): **Proxmox RRD** (built-in, no extra infra)
  vs the **Prometheus/Loki/Grafana/Grafana-Alloy** stack we will likely run
  anyway (richer, per-process, more to operate; see TODO 3.6). **Decide before
  building any usage hook** to avoid throwaway work.
- **Script dependency policy** (TODO #14): whether stdlib-only remains the rule
  for utility scripts or libraries (e.g. PyYAML) are selectively allowed.
- **Scheduling** (TODO 8.4): `/capacity-review` is on-demand now; cron later.

## Deliverables & state at delivery

| Piece | Path | State |
|---|---|---|
| Reference doc | `docs/hardware/reference.md` | Skeleton + real node data |
| Scan script | `scripts/capacity-scan.py` | Working (stdlib, usage hook stubbed) |
| Evaluator skill | `/capacity-review` → `docs/hardware/reviews/` | Working, intent-based |
| Decision record | `docs/decisions/012-hardware-capacity.md` | New ADR |
| Tracking | STATUS.md, CLAUDE.md, TODO #14 + 8.4 | Updated |

## Out of scope / YAGNI

- No usage-stats collection until the cluster exists and the source is decided.
- No structured-data (YAML) source of truth — markdown is the single hand-edited
  source by choice; revisit only if parsing pain demands it.
- No automated moves/remediation — the evaluator recommends; humans act.