boma/docs/superpowers/specs/2026-06-01-hardware-capacity-design.md

# Design — Hardware reference & capacity evaluation

_Date: 2026-06-01 · Status: approved for planning_

## Problem

The repo documents the **logical/network** layer well — Terraform declares per-VM
`cores`/`memory_mb`/`disk_size_gb`, and ADR-007 records VLANs, IPs, and topology.
But the **physical** layer is undocumented: how many Proxmox nodes physically
exist, their real CPU/RAM/disk capacity, storage pools, the network gear, and
`askari`. Nothing records "this node has 64 GB, X is allocated, Y is free," and
nothing evaluates whether the design is well-proportioned — e.g. a service that
needn't be HA, a workload that should move nodes, or a node due a RAM/disk
upgrade.

## Goal

1. A single, human-first **hardware reference document** capturing physical
   compute + network gear and the intended workload placement.
2. A **capacity evaluator** ("script + skill") that reasons about optimization:
   HA overkill / missing redundancy, right-sizing, placement moves, and
   upgrade timing — emitting a dated report.

## Scope

- **In:** Proxmox compute nodes (`pve0..2`) + `askari`; network gear (OPNsense,
  managed switch, APs); per-workload placement intent.
- **Out (for now):** power/UPS budget, NAS, cabling, rack layout, asset
  register, warranty/serial tracking.

## Non-negotiable repo conventions this must honor

- Mirror the existing `repo-scan.py` → `/review-repo` → `docs/reviews/` triad
  (deterministic scan feeds a judgement skill; report is dated markdown).
- Utility scripts are **stdlib-only** for run-anywhere portability (control
  node, CI, bare clone, no venv). See TODO #14 for the standing reevaluation.
- Be honest about real-vs-planned (STATUS.md). The physical cluster is **not
  stood up yet**, so live usage stats are a documented future hook, not a
  current capability.

## Architecture

Four pieces, plus tracking updates.

### 1. Reference doc — `docs/hardware/reference.md`

One hand-maintained markdown file, the source of truth for physical facts and
placement intent. Four parts:

1. **Physical compute** — one subsection per node (`pve0..2`, `askari`):
   model/form factor, CPU (cores/threads), RAM total (+ max & free DIMM slots),
   storage (disks → pools, e.g. `local-zfs` / `local-lvm`), NICs, notes.
2. **Network gear** — OPNsense box, managed switch, APs: model, port/PoE
   counts, throughput, uplinks. Short table.
3. **Workload placement & intent** — one row per planned VM/service, columns:
   `Service | Home node | Criticality | HA intent | Resource profile |
   Placement constraints | Growth notes`. These columns map onto the four
   attribute groups chosen during brainstorming and give the evaluator concrete
   intent to judge against (e.g. anti-affinity: `dns1`/`dns2` on different
   nodes).
4. **Capacity summary** — per-node "allocated vs physical" rollup (RAM / cores /
   disk, headroom %).

Node-capacity tables use a **strict, documented format** so the scan script can
parse the numbers without a YAML dependency.

### 2. Scan script — `scripts/capacity-scan.py`

Stdlib-only, deterministic, JSON to stdout (like `repo-scan.py`). Avoids
hand-parsing YAML by shelling out for JSON, the pattern `tf_to_inventory.py`
already uses.

Gathers **today**:
- **Declared allocations** — `terraform output -json` (and/or the `.tf` module
  calls) for each VM's cores/RAM/disk; degrades gracefully when Terraform has no
  real VMs yet (current reality) instead of failing.
- **Inventory hosts** — `ansible-inventory -i inventories/<env>/hosts.yml
  --list` → JSON.
- **Physical capacities** — parses the strict node tables in `reference.md`.
- **Rollup math** — per node: allocated vs physical, headroom %,
  `oversubscribed` flag.
- **Drift warnings** — e.g. `reference.md` lists a host no Terraform VM
  declares; surfaced in a `warnings[]` array (free doc↔Terraform drift check).

**Stubbed future hook** (honest, à la STATUS.md):
```python
# FUTURE: live usage stats (per-VM CPU/RAM/disk history).
# Requires the physical cluster online. Source UNDECIDED — see "Open decisions".
def gather_usage():
    return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}
```

Output sketch:
```json
{
  "nodes": {"pve0": {"ram_gb": 64, "ram_allocated_gb": 12, "headroom_pct": 81, "oversubscribed": false}},
  "workloads": [{"name": "forgejo", "node": "pve1", "cores": 2, "memory_mb": 4096}],
  "usage": {"available": false, "reason": "cluster not provisioned"},
  "warnings": ["reference.md lists dns1 but no Terraform VM declares it"]
}
```

### 3. Evaluator skill — `/capacity-review`

A skill in `.claude/` (mirrors `/review-repo`), on-demand. Flow:

1. Run `python3 scripts/capacity-scan.py` → JSON.
2. Read `docs/hardware/reference.md` for intent columns the math can't capture.
3. Reason across dimensions, each recommendation **tagged by type** and stating
   **what it is based on** (declared intent vs measured usage):
   - **HA / redundancy** — anti-affinity violations, SPOFs, HA-overkill,
     critical-but-unredundant services.
   - **Right-sizing** — over/under-provisioned VMs. *Intent-based today*;
     explicitly upgradeable to usage-based once the usage hook is live.
   - **Placement / moves** — oversubscribed nodes, constraint-driven relocation.
   - **Upgrade timing** — growth notes vs headroom → rough runway.
   - **Drift** — surfaces the scan's `warnings[]`.
4. Write `docs/hardware/reviews/YYYY-MM-DD-capacity.md` (+ `latest.md`),
   mirroring `docs/reviews/`.

### 4. Recording — ADR + STATUS + CLAUDE.md

- **ADR-012 — Hardware reference & capacity evaluation**
  (`docs/decisions/012-hardware-capacity.md`): records the decision and
  rationale; cross-links ADR-001 / ADR-007 / ADR-009. Names the usage-source as
  an open decision (below).
- **STATUS.md** rows: `reference.md` + `capacity-scan.py` → real/working
  (skeleton); `/capacity-review` → working, intent-only; live usage → designed,
  not built.
- **CLAUDE.md**: a "Review capacity/hardware → `/capacity-review`" commands-table
  row + a "Further reading" pointer to ADR-012.

## Data flow

```
reference.md ──┐
               ├─→ capacity-scan.py ──→ scan JSON ──┐
terraform ─────┤      (stdlib, JSON-via-subprocess)  ├─→ /capacity-review ─→ docs/hardware/reviews/
inventory ─────┘                                     │        (judgement)
reference.md (intent columns) ───────────────────────┘
```

## Open decisions (deferred, tracked in TODO)

- **Usage-stats source** (TODO 8.4): **Proxmox RRD** (built-in, no extra infra)
  vs the **Prometheus/Loki/Grafana/Grafana-Alloy** stack we will likely run
  anyway (richer, per-process, more to operate; see TODO 3.6). **Decide before
  building any usage hook** to avoid throwaway work.
- **Script dependency policy** (TODO #14): whether stdlib-only remains the rule
  for utility scripts or libraries (e.g. PyYAML) are selectively allowed.
- **Scheduling** (TODO 8.4): `/capacity-review` is on-demand now; cron later.

## Deliverables & state at delivery

| Piece | Path | State |
|---|---|---|
| Reference doc | `docs/hardware/reference.md` | Skeleton + real node data |
| Scan script | `scripts/capacity-scan.py` | Working (stdlib, usage hook stubbed) |
| Evaluator skill | `/capacity-review` → `docs/hardware/reviews/` | Working, intent-based |
| Decision record | `docs/decisions/012-hardware-capacity.md` | New ADR |
| Tracking | STATUS.md, CLAUDE.md, TODO #14 + 8.4 | Updated |

## Out of scope / YAGNI

- No usage-stats collection until the cluster exists and the source is decided.
- No structured-data (YAML) source of truth — markdown is the single hand-edited
  source by choice; revisit only if parsing pain demands it.
- No automated moves/remediation — the evaluator recommends; humans act.
Add hardware reference & capacity-evaluation design spec Brainstormed design for docs/hardware/reference.md (physical compute + network gear + workload placement intent), a stdlib-only capacity-scan.py, and an on-demand /capacity-review skill that reports to docs/hardware/reviews/. Mirrors the repo-scan -> /review-repo -> docs/reviews triad. TODO additions: schedule /capacity-review later and decide its usage-stats source (Proxmox RRD vs the Prometheus/Loki/Grafana/Alloy stack) before building any hook (8.4); reevaluate the stdlib-only script policy (#14). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-01 09:59:16 +02:00			`# Design — Hardware reference & capacity evaluation`

			`_Date: 2026-06-01 · Status: approved for planning_`

			`## Problem`

			`The repo documents the logical/network layer well — Terraform declares per-VM`
			`cores`/`memory_mb`/`disk_size_gb`, and ADR-007 records VLANs, IPs, and topology.
			`But the physical layer is undocumented: how many Proxmox nodes physically`
			`exist, their real CPU/RAM/disk capacity, storage pools, the network gear, and`
			`askari`. Nothing records "this node has 64 GB, X is allocated, Y is free," and
			`nothing evaluates whether the design is well-proportioned — e.g. a service that`
			`needn't be HA, a workload that should move nodes, or a node due a RAM/disk`
			`upgrade.`

			`## Goal`

			`1. A single, human-first hardware reference document capturing physical`
			`compute + network gear and the intended workload placement.`
			`2. A capacity evaluator ("script + skill") that reasons about optimization:`
			`HA overkill / missing redundancy, right-sizing, placement moves, and`
			`upgrade timing — emitting a dated report.`

			`## Scope`

			- In: Proxmox compute nodes (`pve0..2`) + `askari`; network gear (OPNsense,
			`managed switch, APs); per-workload placement intent.`
			`- Out (for now): power/UPS budget, NAS, cabling, rack layout, asset`
			`register, warranty/serial tracking.`

			`## Non-negotiable repo conventions this must honor`

			- Mirror the existing `repo-scan.py` → `/review-repo` → `docs/reviews/` triad
			`(deterministic scan feeds a judgement skill; report is dated markdown).`
			`- Utility scripts are stdlib-only for run-anywhere portability (control`
			`node, CI, bare clone, no venv). See TODO #14 for the standing reevaluation.`
			`- Be honest about real-vs-planned (STATUS.md). The physical cluster is **not`
			`stood up yet**, so live usage stats are a documented future hook, not a`
			`current capability.`

			`## Architecture`

			`Four pieces, plus tracking updates.`

			### 1. Reference doc — `docs/hardware/reference.md`

			`One hand-maintained markdown file, the source of truth for physical facts and`
			`placement intent. Four parts:`

			1. Physical compute — one subsection per node (`pve0..2`, `askari`):
			`model/form factor, CPU (cores/threads), RAM total (+ max & free DIMM slots),`
			storage (disks → pools, e.g. `local-zfs` / `local-lvm`), NICs, notes.
			`2. Network gear — OPNsense box, managed switch, APs: model, port/PoE`
			`counts, throughput, uplinks. Short table.`
			`3. Workload placement & intent — one row per planned VM/service, columns:`
			`Service \| Home node \| Criticality \| HA intent \| Resource profile \|
			Placement constraints \| Growth notes`. These columns map onto the four
			`attribute groups chosen during brainstorming and give the evaluator concrete`
			intent to judge against (e.g. anti-affinity: `dns1`/`dns2` on different
			`nodes).`
			`4. Capacity summary — per-node "allocated vs physical" rollup (RAM / cores /`
			`disk, headroom %).`

			`Node-capacity tables use a strict, documented format so the scan script can`
			`parse the numbers without a YAML dependency.`

			### 2. Scan script — `scripts/capacity-scan.py`

			Stdlib-only, deterministic, JSON to stdout (like `repo-scan.py`). Avoids
			hand-parsing YAML by shelling out for JSON, the pattern `tf_to_inventory.py`
			`already uses.`

			`Gathers today:`
			- Declared allocations — `terraform output -json` (and/or the `.tf` module
			`calls) for each VM's cores/RAM/disk; degrades gracefully when Terraform has no`
			`real VMs yet (current reality) instead of failing.`
			- Inventory hosts — `ansible-inventory -i inventories/<env>/hosts.yml
			--list` → JSON.
			- Physical capacities — parses the strict node tables in `reference.md`.
			`- Rollup math — per node: allocated vs physical, headroom %,`
			`oversubscribed` flag.
			- Drift warnings — e.g. `reference.md` lists a host no Terraform VM
			declares; surfaced in a `warnings[]` array (free doc↔Terraform drift check).

			`Stubbed future hook (honest, à la STATUS.md):`
			```python
			`# FUTURE: live usage stats (per-VM CPU/RAM/disk history).`
			`# Requires the physical cluster online. Source UNDECIDED — see "Open decisions".`
			`def gather_usage():`
			`return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}`
			```

			`Output sketch:`
			```json
			`{`
			`"nodes": {"pve0": {"ram_gb": 64, "ram_allocated_gb": 12, "headroom_pct": 81, "oversubscribed": false}},`
			`"workloads": [{"name": "forgejo", "node": "pve1", "cores": 2, "memory_mb": 4096}],`
			`"usage": {"available": false, "reason": "cluster not provisioned"},`
			`"warnings": ["reference.md lists dns1 but no Terraform VM declares it"]`
			`}`
			```

			### 3. Evaluator skill — `/capacity-review`

			A skill in `.claude/` (mirrors `/review-repo`), on-demand. Flow:

			1. Run `python3 scripts/capacity-scan.py` → JSON.
			2. Read `docs/hardware/reference.md` for intent columns the math can't capture.
			`3. Reason across dimensions, each recommendation tagged by type and stating`
			`what it is based on (declared intent vs measured usage):`
			`- HA / redundancy — anti-affinity violations, SPOFs, HA-overkill,`
			`critical-but-unredundant services.`
			`- Right-sizing — over/under-provisioned VMs. Intent-based today;`
			`explicitly upgradeable to usage-based once the usage hook is live.`
			`- Placement / moves — oversubscribed nodes, constraint-driven relocation.`
			`- Upgrade timing — growth notes vs headroom → rough runway.`
			- Drift — surfaces the scan's `warnings[]`.
			4. Write `docs/hardware/reviews/YYYY-MM-DD-capacity.md` (+ `latest.md`),
			mirroring `docs/reviews/`.

			`### 4. Recording — ADR + STATUS + CLAUDE.md`

			`- ADR-012 — Hardware reference & capacity evaluation`
			(`docs/decisions/012-hardware-capacity.md`): records the decision and
			`rationale; cross-links ADR-001 / ADR-007 / ADR-009. Names the usage-source as`
			`an open decision (below).`
			- STATUS.md rows: `reference.md` + `capacity-scan.py` → real/working
			(skeleton); `/capacity-review` → working, intent-only; live usage → designed,
			`not built.`
			- CLAUDE.md: a "Review capacity/hardware → `/capacity-review`" commands-table
			`row + a "Further reading" pointer to ADR-012.`

			`## Data flow`

			```
			`reference.md ──┐`
			`├─→ capacity-scan.py ──→ scan JSON ──┐`
			`terraform ─────┤ (stdlib, JSON-via-subprocess) ├─→ /capacity-review ─→ docs/hardware/reviews/`
			`inventory ─────┘ │ (judgement)`
			`reference.md (intent columns) ───────────────────────┘`
			```

			`## Open decisions (deferred, tracked in TODO)`

			`- Usage-stats source (TODO 8.4): Proxmox RRD (built-in, no extra infra)`
			`vs the Prometheus/Loki/Grafana/Grafana-Alloy stack we will likely run`
			`anyway (richer, per-process, more to operate; see TODO 3.6). **Decide before`
			`building any usage hook** to avoid throwaway work.`
			`- Script dependency policy (TODO #14): whether stdlib-only remains the rule`
			`for utility scripts or libraries (e.g. PyYAML) are selectively allowed.`
			- Scheduling (TODO 8.4): `/capacity-review` is on-demand now; cron later.

			`## Deliverables & state at delivery`

			`\| Piece \| Path \| State \|`
			`\|---\|---\|---\|`
			\| Reference doc \| `docs/hardware/reference.md` \| Skeleton + real node data \|
			\| Scan script \| `scripts/capacity-scan.py` \| Working (stdlib, usage hook stubbed) \|
			\| Evaluator skill \| `/capacity-review` → `docs/hardware/reviews/` \| Working, intent-based \|
			\| Decision record \| `docs/decisions/012-hardware-capacity.md` \| New ADR \|
			`\| Tracking \| STATUS.md, CLAUDE.md, TODO #14 + 8.4 \| Updated \|`

			`## Out of scope / YAGNI`

			`- No usage-stats collection until the cluster exists and the source is decided.`
			`- No structured-data (YAML) source of truth — markdown is the single hand-edited`
			`source by choice; revisit only if parsing pain demands it.`
			`- No automated moves/remediation — the evaluator recommends; humans act.`