boma/docs/superpowers/plans/2026-06-01-hardware-capacity.md
sjat 6ff5d55810 Add implementation plan for hardware capacity tooling
Task-by-task TDD plan: reference.md skeleton, stdlib-only capacity-scan.py
(parse_table, compute_rollup, drift, usage stub, main), /capacity-review skill,
and ADR-012 + STATUS/CLAUDE/scripts-README updates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 10:04:59 +02:00

753 lines
28 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Hardware Reference & Capacity Evaluation Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add a hand-maintained hardware reference doc, a stdlib-only `capacity-scan.py` that emits deterministic capacity facts, and an on-demand `/capacity-review` skill that reasons about HA / right-sizing / placement / upgrade timing.
**Architecture:** `docs/hardware/reference.md` is the single machine-readable source of truth (physical node capacities + workload allocations + placement intent). `scripts/capacity-scan.py` parses its tables, computes per-node allocated-vs-physical rollups, and cross-checks workload hostnames against `terraform output -json` / `ansible-inventory --list` to surface drift — degrading gracefully when nothing is provisioned. `/capacity-review` runs the scan, reads the intent columns, and writes a dated report to `docs/hardware/reviews/`. Live usage stats are a stubbed future hook. Mirrors the existing `repo-scan.py``/review-repo``docs/reviews/` triad.
**Tech Stack:** Python 3 standard library only (no third-party imports in the script); `pytest` for unit tests (already in `requirements.txt`); markdown docs; a `.claude` slash-command skill.
---
## Spec
Design spec: `docs/superpowers/specs/2026-06-01-hardware-capacity-design.md`.
**Refinement vs spec:** the spec said allocations come from Terraform. The current
`terraform output "vms"` only exposes `{ip, group}`, not cores/RAM/disk, so numeric
allocations are read from `reference.md` instead; Terraform/inventory are used only
for hostname-drift cross-checks. This better honors the "self-contained markdown
source of truth" decision and needs no Terraform module changes.
## File Structure
- `docs/hardware/reference.md`**create.** Source of truth. Human sections
(physical compute, network gear) + two machine-readable tables (node capacity,
workload placement) the script parses.
- `scripts/capacity-scan.py`**create.** Stdlib-only. Pure parse/math functions
+ thin subprocess glue + `main()` emitting JSON to stdout.
- `tests/test_capacity_scan.py`**create.** Pytest unit tests for the pure
functions + a smoke test against the real `reference.md`.
- `.claude/commands/capacity-review.md`**create.** The `/capacity-review` skill.
- `docs/hardware/reviews/.gitkeep`**create.** Report output dir.
- `docs/decisions/012-hardware-capacity.md`**create.** ADR recording the decision.
- `STATUS.md`**modify.** Add real-vs-planned rows.
- `CLAUDE.md`**modify.** Commands-table row + Further-reading pointer.
- `scripts/README.md`**modify.** Document `capacity-scan.py`.
### Machine-readable table contract (used by Task 1 and the parser)
`reference.md` must contain these two tables verbatim in header shape. The parser
keys on header names, so column order is flexible and extra free-text columns are
ignored.
**Node capacity** — header contains `node, cores, ram_gb, disk_gb` (integers/floats):
```
| node | cores | ram_gb | disk_gb |
|------|-------|--------|---------|
| pve0 | 20 | 64 | 4000 |
```
**Workload placement** — header contains the numeric columns `workload, node,
cores, ram_mb, disk_gb` plus any free-text intent columns:
```
| workload | node | cores | ram_mb | disk_gb | criticality | ha_intent | profile | constraints | growth |
|----------|------|-------|--------|---------|-------------|-----------|---------|-------------|--------|
| dns1 | pve0 | 1 | 512 | 10 | high | pair/dns2 | tiny | anti-affinity: dns2 elsewhere | flat |
```
---
## Task 1: Reference doc skeleton
**Files:**
- Create: `docs/hardware/reference.md`
- Create: `docs/hardware/reviews/.gitkeep`
- [ ] **Step 1: Write `docs/hardware/reference.md`**
```markdown
# Hardware reference — boma
> Hand-maintained source of truth for **physical** compute + network gear and
> **workload placement intent**. The two machine-readable tables (Node capacity,
> Workload placement) are parsed by `scripts/capacity-scan.py` — keep their
> headers intact. Evaluated by `/capacity-review`. See ADR-012.
>
> _Status: skeleton. Replace example rows with real hardware once the cluster is
> stood up (STATUS.md tracks real-vs-planned)._
## 1. Physical compute
### pve0
- **Model / form factor:** _TBD (e.g. Minisforum MS-01, mini-PC)_
- **CPU:** _TBD (e.g. i9-13900H, 14C/20T)_
- **RAM:** _TBD total; max _; free DIMM slots _
- **Storage:** _TBD (disks → pools, e.g. 2× 2 TB NVMe → `local-zfs`)_
- **NICs:** _eno1 trunk (vmbr0), eno2 corosync (vmbr1)_
- **Notes:** _warranty, quirks_
_(repeat for pve1, pve2, askari)_
## 2. Network gear
| device | model | ports | poe | throughput | uplinks | notes |
|----------|-------|-------|-----|------------|---------|-------|
| opnsense | _TBD_ | _TBD_ | n/a | _TBD_ | WAN+LAN | dedicated hardware |
| switch | _TBD_ | _TBD_ | _TBD_ | _TBD_ | trunk | managed, 802.1q |
| ap1 | _TBD_ | _TBD_ | _TBD_ | _TBD_ | trunk | multi-SSID per VLAN |
## 3. Workload placement & intent
The numeric columns (`cores, ram_mb, disk_gb`) feed `capacity-scan.py`; the
free-text columns feed `/capacity-review`'s judgement.
| workload | node | cores | ram_mb | disk_gb | criticality | ha_intent | profile | constraints | growth |
|----------|------|-------|--------|---------|-------------|-----------|---------|-------------|--------|
| dns1 | pve0 | 1 | 512 | 10 | high | pair/dns2 | tiny/steady | anti-affinity: dns2 on a different node | flat |
| dns2 | pve1 | 1 | 512 | 10 | high | pair/dns1 | tiny/steady | anti-affinity: dns1 on a different node | flat |
## 4. Node capacity (machine-readable)
Physical totals per node. Integers; `ram_gb` and `disk_gb` may be decimals.
| node | cores | ram_gb | disk_gb |
|------|-------|--------|---------|
| pve0 | 20 | 64 | 4000 |
| pve1 | 20 | 64 | 4000 |
## 5. Capacity notes
Free-text running notes for the evaluator (trends, planned moves, upgrade ideas).
```
- [ ] **Step 2: Create the reports directory**
Run: `mkdir -p docs/hardware/reviews && touch docs/hardware/reviews/.gitkeep`
Expected: both paths exist.
- [ ] **Step 3: Verify the machine-readable headers match the contract**
Run: `grep -n '| node | cores | ram_gb | disk_gb |' docs/hardware/reference.md && grep -n '| workload | node | cores | ram_mb | disk_gb |' docs/hardware/reference.md`
Expected: each grep prints one matching line (the table headers the parser keys on).
- [ ] **Step 4: Commit**
```bash
git add docs/hardware/reference.md docs/hardware/reviews/.gitkeep
git commit -m "Add hardware reference doc skeleton + reviews dir"
```
---
## Task 2: Scan script — `parse_table()`
**Files:**
- Create: `scripts/capacity-scan.py`
- Create: `tests/test_capacity_scan.py`
- [ ] **Step 1: Write the failing test**
Create `tests/test_capacity_scan.py`:
```python
import importlib.util
import pathlib
_PATH = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "capacity-scan.py"
_spec = importlib.util.spec_from_file_location("capacity_scan", _PATH)
cs = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(cs)
def test_parse_table_keys_on_header_and_ignores_extra_cols():
md = """
intro text
| node | cores | ram_gb | disk_gb |
|------|-------|--------|---------|
| pve0 | 20 | 64 | 4000 |
| pve1 | 20 | 64 | 4000 |
trailing text
"""
rows = cs.parse_table(md, ["node", "cores", "ram_gb", "disk_gb"])
assert rows == [
{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"},
{"node": "pve1", "cores": "20", "ram_gb": "64", "disk_gb": "4000"},
]
def test_parse_table_returns_empty_when_header_absent():
assert cs.parse_table("no tables here", ["node", "cores"]) == []
```
- [ ] **Step 2: Run test to verify it fails**
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
Expected: FAIL — `ModuleNotFoundError`/`AttributeError` (script or `parse_table` not defined yet).
- [ ] **Step 3: Write minimal implementation**
Create `scripts/capacity-scan.py`:
```python
#!/usr/bin/env python3
"""capacity-scan.py — deterministic capacity facts for /capacity-review.
Python standard library only. Emits a JSON object to stdout.
Reads physical capacities and workload allocations from the machine-readable
tables in docs/hardware/reference.md, computes per-node allocated-vs-physical
rollups, and cross-checks workload hostnames against `terraform output -json`
and `ansible-inventory --list` to surface drift. Degrades gracefully when
nothing is provisioned. Live usage stats are a documented future hook.
Usage: python3 scripts/capacity-scan.py [--env staging] [--reference PATH]
"""
import argparse
import json
import os
import subprocess
import sys
REPO_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
def parse_table(markdown, required_cols):
"""Return row dicts for the first markdown table whose header contains all
required_cols. Keys are header names; values are raw cell strings."""
lines = markdown.splitlines()
required = set(required_cols)
for i, raw in enumerate(lines):
line = raw.strip()
if not line.startswith("|"):
continue
headers = [c.strip() for c in line.strip("|").split("|")]
if not required.issubset(set(headers)):
continue
rows = []
for body in lines[i + 2:]:
if not body.strip().startswith("|"):
break
cells = [c.strip() for c in body.strip().strip("|").split("|")]
if len(cells) == len(headers):
rows.append(dict(zip(headers, cells)))
return rows
return []
```
- [ ] **Step 4: Run test to verify it passes**
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
Expected: PASS (2 passed).
- [ ] **Step 5: Commit**
```bash
git add scripts/capacity-scan.py tests/test_capacity_scan.py
git commit -m "Add capacity-scan.py with parse_table()"
```
---
## Task 3: Rollup math — `compute_rollup()`
**Files:**
- Modify: `scripts/capacity-scan.py`
- Modify: `tests/test_capacity_scan.py`
- [ ] **Step 1: Write the failing test (append to `tests/test_capacity_scan.py`)**
```python
def test_compute_rollup_sums_allocations_and_flags_headroom():
node_rows = [{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"}]
workload_rows = [
{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"},
{"workload": "forgejo", "node": "pve0", "cores": "4", "ram_mb": "8192", "disk_gb": "100"},
]
nodes = cs.compute_rollup(node_rows, workload_rows)
pve0 = nodes["pve0"]
assert pve0["alloc_cores"] == 5
assert pve0["alloc_ram_gb"] == 8.5 # (512 + 8192) / 1024
assert pve0["alloc_disk_gb"] == 110.0
assert pve0["ram_headroom_pct"] == 87 # round(100 * (64 - 8.5) / 64)
assert pve0["oversubscribed"] is False
def test_compute_rollup_flags_oversubscription():
node_rows = [{"node": "tiny", "cores": "2", "ram_gb": "4", "disk_gb": "50"}]
workload_rows = [
{"workload": "hog", "node": "tiny", "cores": "4", "ram_mb": "1024", "disk_gb": "10"},
]
nodes = cs.compute_rollup(node_rows, workload_rows)
assert nodes["tiny"]["oversubscribed"] is True # 4 cores > 2
def test_compute_rollup_ignores_workloads_on_unknown_nodes():
nodes = cs.compute_rollup(
[{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"}],
[{"workload": "ghost", "node": "nope", "cores": "1", "ram_mb": "512", "disk_gb": "10"}],
)
assert nodes["pve0"]["alloc_cores"] == 0
```
- [ ] **Step 2: Run test to verify it fails**
Run: `python3 -m pytest tests/test_capacity_scan.py -k compute_rollup -v`
Expected: FAIL — `AttributeError: module 'capacity_scan' has no attribute 'compute_rollup'`.
- [ ] **Step 3: Write minimal implementation (append to `scripts/capacity-scan.py`, before any `main`)**
```python
def compute_rollup(node_rows, workload_rows):
"""Per node: physical totals, summed allocations, RAM headroom %, and an
oversubscribed flag. Workloads on unknown nodes are ignored."""
nodes = {}
for r in node_rows:
nodes[r["node"]] = {
"cores": int(r["cores"]),
"ram_gb": float(r["ram_gb"]),
"disk_gb": float(r["disk_gb"]),
"alloc_cores": 0,
"alloc_ram_mb": 0,
"alloc_disk_gb": 0.0,
}
for w in workload_rows:
node = nodes.get(w["node"])
if node is None:
continue
node["alloc_cores"] += int(w["cores"])
node["alloc_ram_mb"] += int(w["ram_mb"])
node["alloc_disk_gb"] += float(w["disk_gb"])
for node in nodes.values():
node["alloc_ram_gb"] = round(node.pop("alloc_ram_mb") / 1024, 1)
node["ram_headroom_pct"] = (
round(100 * (node["ram_gb"] - node["alloc_ram_gb"]) / node["ram_gb"])
if node["ram_gb"]
else 0
)
node["oversubscribed"] = (
node["alloc_cores"] > node["cores"]
or node["alloc_ram_gb"] > node["ram_gb"]
or node["alloc_disk_gb"] > node["disk_gb"]
)
return nodes
```
- [ ] **Step 4: Run test to verify it passes**
Run: `python3 -m pytest tests/test_capacity_scan.py -k compute_rollup -v`
Expected: PASS (3 passed).
- [ ] **Step 5: Commit**
```bash
git add scripts/capacity-scan.py tests/test_capacity_scan.py
git commit -m "Add compute_rollup() to capacity-scan.py"
```
---
## Task 4: Drift detection — `find_drift()` + hostname parsers
**Files:**
- Modify: `scripts/capacity-scan.py`
- Modify: `tests/test_capacity_scan.py`
- [ ] **Step 1: Write the failing test (append)**
```python
def test_parse_tf_hostnames_reads_vms_value_keys():
tf_json = '{"vms": {"value": {"dns1": {"ip": "10.20.0.10", "group": "docker_hosts"}}}}'
assert cs.parse_tf_hostnames(tf_json) == {"dns1"}
def test_parse_inventory_hostnames_reads_meta_hostvars():
inv_json = '{"_meta": {"hostvars": {"dns1": {}, "proxy": {}}}}'
assert cs.parse_inventory_hostnames(inv_json) == {"dns1", "proxy"}
def test_find_drift_reports_both_directions():
workload_rows = [{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"}]
warnings = cs.find_drift(workload_rows, {"proxy"})
assert any("dns1" in w and "no Terraform" in w for w in warnings)
assert any("proxy" in w and "absent from reference.md" in w for w in warnings)
def test_find_drift_silent_when_no_hostnames_known():
workload_rows = [{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"}]
assert cs.find_drift(workload_rows, set()) == []
```
- [ ] **Step 2: Run test to verify it fails**
Run: `python3 -m pytest tests/test_capacity_scan.py -k "drift or hostnames" -v`
Expected: FAIL — attributes `parse_tf_hostnames` / `parse_inventory_hostnames` / `find_drift` not defined.
- [ ] **Step 3: Write minimal implementation (append)**
```python
def parse_tf_hostnames(tf_json):
"""Hostnames from `terraform output -json` (the `vms` map keys)."""
data = json.loads(tf_json)
return set(data.get("vms", {}).get("value", {}).keys())
def parse_inventory_hostnames(inv_json):
"""Hostnames from `ansible-inventory --list` (_meta.hostvars keys)."""
data = json.loads(inv_json)
return set(data.get("_meta", {}).get("hostvars", {}).keys())
def find_drift(workload_rows, known_hostnames):
"""Warn when reference.md workloads and live hostnames disagree. Silent when
no hostnames are known (pre-provisioning) — nothing to compare against."""
warnings = []
declared = {w["workload"] for w in workload_rows}
if not known_hostnames:
return warnings
for name in sorted(declared - known_hostnames):
warnings.append(
f"reference.md lists '{name}' but no Terraform/inventory host declares it"
)
for name in sorted(known_hostnames - declared):
warnings.append(
f"host '{name}' exists in Terraform/inventory but is absent from reference.md"
)
return warnings
```
- [ ] **Step 4: Run test to verify it passes**
Run: `python3 -m pytest tests/test_capacity_scan.py -k "drift or hostnames" -v`
Expected: PASS (4 passed).
- [ ] **Step 5: Commit**
```bash
git add scripts/capacity-scan.py tests/test_capacity_scan.py
git commit -m "Add hostname parsers + find_drift() to capacity-scan.py"
```
---
## Task 5: Subprocess glue + usage stub + `main()`
**Files:**
- Modify: `scripts/capacity-scan.py`
- Modify: `tests/test_capacity_scan.py`
- [ ] **Step 1: Write the failing test (append)**
```python
import json as _json
def test_gather_usage_is_stubbed_unavailable():
usage = cs.gather_usage()
assert usage["available"] is False
assert "reason" in usage
def test_known_hostnames_degrades_to_empty(monkeypatch):
# Simulate terraform/ansible-inventory being absent or failing.
def boom(*a, **k):
raise FileNotFoundError("no such tool")
monkeypatch.setattr(cs.subprocess, "run", boom)
assert cs.known_hostnames("staging") == set()
def test_main_emits_valid_json_against_real_reference(monkeypatch, capsys):
# Isolate from the host: no real terraform/ansible needed.
monkeypatch.setattr(cs, "known_hostnames", lambda env: set())
monkeypatch.setattr("sys.argv", ["capacity-scan.py"])
cs.main()
out = _json.loads(capsys.readouterr().out)
assert set(out) == {"nodes", "workloads", "usage", "warnings"}
assert out["usage"]["available"] is False
assert "pve0" in out["nodes"] # from the skeleton reference.md (Task 1)
```
- [ ] **Step 2: Run test to verify it fails**
Run: `python3 -m pytest tests/test_capacity_scan.py -k "usage or known_hostnames or main" -v`
Expected: FAIL — `gather_usage` / `known_hostnames` / `main` not defined.
- [ ] **Step 3: Write minimal implementation (append)**
```python
def gather_usage():
"""FUTURE: live per-VM CPU/RAM/disk usage history. Requires the physical
cluster online; source UNDECIDED (Proxmox RRD vs Prometheus/Loki/Grafana —
see docs/TODO.md 8.4). Until then the evaluator reasons on declared intent."""
return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}
def _run_json(cmd):
return subprocess.run(cmd, capture_output=True, text=True, check=True).stdout
def known_hostnames(env):
"""Union of hostnames from Terraform output and Ansible inventory. Each
source is best-effort: missing tool / no state / bad JSON yields nothing."""
hosts = set()
tf_dir = os.path.join(REPO_ROOT, "terraform", "environments", env)
try:
hosts |= parse_tf_hostnames(_run_json(["terraform", f"-chdir={tf_dir}", "output", "-json"]))
except Exception:
pass
inv = os.path.join(REPO_ROOT, "inventories", env, "hosts.yml")
try:
hosts |= parse_inventory_hostnames(_run_json(["ansible-inventory", "-i", inv, "--list"]))
except Exception:
pass
return hosts
def main():
parser = argparse.ArgumentParser(description="Deterministic capacity facts for /capacity-review.")
parser.add_argument("--env", default="staging")
parser.add_argument(
"--reference",
default=os.path.join(REPO_ROOT, "docs", "hardware", "reference.md"),
)
args = parser.parse_args()
with open(args.reference, encoding="utf-8") as fh:
markdown = fh.read()
node_rows = parse_table(markdown, ["node", "cores", "ram_gb", "disk_gb"])
workload_rows = parse_table(markdown, ["workload", "node", "cores", "ram_mb", "disk_gb"])
nodes = compute_rollup(node_rows, workload_rows)
warnings = find_drift(workload_rows, known_hostnames(args.env))
json.dump(
{"nodes": nodes, "workloads": workload_rows, "usage": gather_usage(), "warnings": warnings},
sys.stdout,
indent=2,
sort_keys=True,
)
sys.stdout.write("\n")
if __name__ == "__main__":
main()
```
- [ ] **Step 4: Run the full test file**
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
Expected: PASS (all tests).
- [ ] **Step 5: Smoke-run the script end to end**
Run: `python3 scripts/capacity-scan.py | python3 -m json.tool`
Expected: valid JSON with `nodes.pve0`, a `workloads` list, `usage.available: false`, and a `warnings` array (likely empty with no Terraform state).
- [ ] **Step 6: Commit**
```bash
git add scripts/capacity-scan.py tests/test_capacity_scan.py
git commit -m "Complete capacity-scan.py: usage stub, subprocess glue, main()"
```
---
## Task 6: The `/capacity-review` skill
**Files:**
- Create: `.claude/commands/capacity-review.md`
- [ ] **Step 1: Confirm the existing command pattern**
Run: `ls .claude/commands/ && sed -n '1,20p' .claude/commands/review-repo.md`
Expected: lists existing commands; shows the frontmatter/structure to mirror.
- [ ] **Step 2: Write `.claude/commands/capacity-review.md`**
Mirror the frontmatter style of `review-repo.md` (adjust `description`/`allowed-tools` to match that file's actual keys). Body:
```markdown
---
description: Evaluate hardware capacity and placement; recommend optimizations
---
# /capacity-review
Evaluate the homelab's hardware capacity and workload placement, and recommend
optimizations. On-demand only (scheduling is deferred — see docs/TODO.md 8.4).
## Steps
1. **Gather facts.** Run `python3 scripts/capacity-scan.py` and parse its JSON
(`nodes`, `workloads`, `usage`, `warnings`). If `usage.available` is false,
note in the report that recommendations are **intent-based, not usage-based**.
2. **Read intent.** Read `docs/hardware/reference.md` for the free-text columns
the scan does not parse: `criticality`, `ha_intent`, `profile`, `constraints`,
`growth`, plus the "Capacity notes" section.
3. **Reason across dimensions.** Produce recommendations, each tagged with its
type and the basis it rests on (declared intent vs measured usage):
- **HA / redundancy** — anti-affinity violations (e.g. an HA pair sharing one
node), single points of failure, HA that looks like overkill, or a
high-criticality workload with no redundancy.
- **Right-sizing** — over/under-provisioned workloads. Today this is
intent-based (allocation vs `profile`); flag that it becomes usage-based
once the `gather_usage()` hook is live.
- **Placement / moves** — oversubscribed nodes (`oversubscribed: true`, low
`ram_headroom_pct`) or constraint-driven relocations.
- **Upgrade timing** — `growth` notes vs headroom → rough runway.
- **Drift** — surface every entry in the scan's `warnings` array.
4. **Write the report.** Save to `docs/hardware/reviews/YYYY-MM-DD-capacity.md`
and copy it to `docs/hardware/reviews/latest.md`. Structure: a one-line
summary, then a section per dimension with concrete, actionable items. State
the basis (intent vs usage) on every recommendation.
```
- [ ] **Step 3: Verify the file is well-formed**
Run: `head -5 .claude/commands/capacity-review.md`
Expected: frontmatter block present and consistent with `review-repo.md`'s keys.
- [ ] **Step 4: Commit**
```bash
git add .claude/commands/capacity-review.md
git commit -m "Add /capacity-review skill"
```
---
## Task 7: ADR-012, STATUS, CLAUDE.md, scripts/README
**Files:**
- Create: `docs/decisions/012-hardware-capacity.md`
- Modify: `STATUS.md`
- Modify: `CLAUDE.md`
- Modify: `scripts/README.md`
- [ ] **Step 1: Write `docs/decisions/012-hardware-capacity.md`**
Match the heading style of an existing ADR (`sed -n '1,15p' docs/decisions/010-forgejo-ci.md` first). Content:
```markdown
# ADR-012 — Hardware reference & capacity evaluation
## Context
The repo modelled the logical/network layer (Terraform VM specs, ADR-007
topology) but not the physical layer — node CPU/RAM/disk capacity, network gear,
or which workloads are designed to run where with what headroom. There was also
no way to ask "is this well-proportioned?" — e.g. HA that isn't needed, a
workload that should move, or a node due an upgrade.
## Decision
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
physical compute + network gear and workload placement intent. Two
machine-readable tables (node capacity, workload placement) carry the numbers.
- `scripts/capacity-scan.py` (stdlib-only, like `repo-scan.py` / `tf_to_inventory.py`)
parses those tables, computes per-node allocated-vs-physical rollups, and
cross-checks workload hostnames against `terraform output -json` /
`ansible-inventory --list` to surface drift.
- `/capacity-review` reads the scan + intent columns and writes a dated report to
`docs/hardware/reviews/`, mirroring `/review-repo``docs/reviews/`.
- Numeric allocations live in `reference.md`, not Terraform: the current
`terraform output` exposes only `{ip, group}`. Terraform/inventory are used
only for hostname-drift cross-checks.
- **Live usage stats are a future hook.** The cluster is not stood up;
`gather_usage()` returns `available: false` and the evaluator reasons on
declared intent. The usage source (Proxmox RRD vs Prometheus/Loki/Grafana/
Alloy) is undecided — see docs/TODO.md 8.4, to be settled before any hook is
built.
## Consequences
- Right-sizing advice is intent-based until usage data exists; reports say so.
- `reference.md` table headers are a parser contract — changing them needs a
matching `capacity-scan.py` change.
See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF↔Ansible handoff).
```
- [ ] **Step 2: Add STATUS.md rows**
In `STATUS.md`, add to the "Real and working today" table:
```markdown
| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
```
And to the "Designed but not built" table:
```markdown
| Live usage stats for `/capacity-review` | ADR-012 / TODO 8.4 | `gather_usage()` stubbed; source undecided (Proxmox RRD vs PLG stack); needs the cluster |
```
- [ ] **Step 3: Add the CLAUDE.md command row + further-reading pointer**
In `CLAUDE.md` "Key commands" table, add:
```markdown
| Review hardware capacity | `/capacity-review` (Claude command) |
```
In the "Further reading" table, add:
```markdown
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
```
- [ ] **Step 4: Document the script in scripts/README.md**
Add under the existing list in `scripts/README.md`:
```markdown
- `capacity-scan.py` — deterministic capacity facts for `/capacity-review`: parses
the machine-readable tables in `docs/hardware/reference.md`, computes per-node
allocated-vs-physical rollups, and cross-checks workload hostnames against
Terraform output / Ansible inventory for drift. Emits JSON. See **ADR-012**.
```
- [ ] **Step 5: Verify references resolve**
Run: `python3 scripts/repo-scan.py | python3 -c "import json,sys; d=json.load(sys.stdin); print('broken_refs:', [f for f in d.get('findings',{}).get('broken_refs',[]) if '012' in str(f) or 'hardware' in str(f)])"`
Expected: no broken refs mentioning ADR-012 or the hardware paths (empty list). If the scan's JSON shape differs, instead run `python3 scripts/repo-scan.py >/dev/null && echo OK` and eyeball the findings.
- [ ] **Step 6: Commit**
```bash
git add docs/decisions/012-hardware-capacity.md STATUS.md CLAUDE.md scripts/README.md
git commit -m "Record ADR-012 + STATUS/CLAUDE/scripts docs for capacity tooling"
```
---
## Task 8: Final verification
**Files:** none (verification only)
- [ ] **Step 1: Run the full unit-test suite**
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
Expected: all tests pass.
- [ ] **Step 2: Run the lint suite**
Run: `make lint`
Expected: passes (markdown/script changes do not break ansible-lint/yamllint).
- [ ] **Step 3: End-to-end scan**
Run: `python3 scripts/capacity-scan.py`
Expected: valid JSON; `nodes.pve0` present; `usage.available: false`.
- [ ] **Step 4: Confirm working tree is clean**
Run: `git status --short`
Expected: no uncommitted changes from this plan (pre-existing FRICTION.md / ADR-011 may remain — leave them).
```