Add implementation plan for hardware capacity tooling
Task-by-task TDD plan: reference.md skeleton, stdlib-only capacity-scan.py (parse_table, compute_rollup, drift, usage stub, main), /capacity-review skill, and ADR-012 + STATUS/CLAUDE/scripts-README updates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
88210db09c
commit
6ff5d55810
1 changed files with 753 additions and 0 deletions
753
docs/superpowers/plans/2026-06-01-hardware-capacity.md
Normal file
753
docs/superpowers/plans/2026-06-01-hardware-capacity.md
Normal file
|
|
@ -0,0 +1,753 @@
|
||||||
|
# Hardware Reference & Capacity Evaluation Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Add a hand-maintained hardware reference doc, a stdlib-only `capacity-scan.py` that emits deterministic capacity facts, and an on-demand `/capacity-review` skill that reasons about HA / right-sizing / placement / upgrade timing.
|
||||||
|
|
||||||
|
**Architecture:** `docs/hardware/reference.md` is the single machine-readable source of truth (physical node capacities + workload allocations + placement intent). `scripts/capacity-scan.py` parses its tables, computes per-node allocated-vs-physical rollups, and cross-checks workload hostnames against `terraform output -json` / `ansible-inventory --list` to surface drift — degrading gracefully when nothing is provisioned. `/capacity-review` runs the scan, reads the intent columns, and writes a dated report to `docs/hardware/reviews/`. Live usage stats are a stubbed future hook. Mirrors the existing `repo-scan.py` → `/review-repo` → `docs/reviews/` triad.
|
||||||
|
|
||||||
|
**Tech Stack:** Python 3 standard library only (no third-party imports in the script); `pytest` for unit tests (already in `requirements.txt`); markdown docs; a `.claude` slash-command skill.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Spec
|
||||||
|
|
||||||
|
Design spec: `docs/superpowers/specs/2026-06-01-hardware-capacity-design.md`.
|
||||||
|
|
||||||
|
**Refinement vs spec:** the spec said allocations come from Terraform. The current
|
||||||
|
`terraform output "vms"` only exposes `{ip, group}`, not cores/RAM/disk, so numeric
|
||||||
|
allocations are read from `reference.md` instead; Terraform/inventory are used only
|
||||||
|
for hostname-drift cross-checks. This better honors the "self-contained markdown
|
||||||
|
source of truth" decision and needs no Terraform module changes.
|
||||||
|
|
||||||
|
## File Structure
|
||||||
|
|
||||||
|
- `docs/hardware/reference.md` — **create.** Source of truth. Human sections
|
||||||
|
(physical compute, network gear) + two machine-readable tables (node capacity,
|
||||||
|
workload placement) the script parses.
|
||||||
|
- `scripts/capacity-scan.py` — **create.** Stdlib-only. Pure parse/math functions
|
||||||
|
+ thin subprocess glue + `main()` emitting JSON to stdout.
|
||||||
|
- `tests/test_capacity_scan.py` — **create.** Pytest unit tests for the pure
|
||||||
|
functions + a smoke test against the real `reference.md`.
|
||||||
|
- `.claude/commands/capacity-review.md` — **create.** The `/capacity-review` skill.
|
||||||
|
- `docs/hardware/reviews/.gitkeep` — **create.** Report output dir.
|
||||||
|
- `docs/decisions/012-hardware-capacity.md` — **create.** ADR recording the decision.
|
||||||
|
- `STATUS.md` — **modify.** Add real-vs-planned rows.
|
||||||
|
- `CLAUDE.md` — **modify.** Commands-table row + Further-reading pointer.
|
||||||
|
- `scripts/README.md` — **modify.** Document `capacity-scan.py`.
|
||||||
|
|
||||||
|
### Machine-readable table contract (used by Task 1 and the parser)
|
||||||
|
|
||||||
|
`reference.md` must contain these two tables verbatim in header shape. The parser
|
||||||
|
keys on header names, so column order is flexible and extra free-text columns are
|
||||||
|
ignored.
|
||||||
|
|
||||||
|
**Node capacity** — header contains `node, cores, ram_gb, disk_gb` (integers/floats):
|
||||||
|
|
||||||
|
```
|
||||||
|
| node | cores | ram_gb | disk_gb |
|
||||||
|
|------|-------|--------|---------|
|
||||||
|
| pve0 | 20 | 64 | 4000 |
|
||||||
|
```
|
||||||
|
|
||||||
|
**Workload placement** — header contains the numeric columns `workload, node,
|
||||||
|
cores, ram_mb, disk_gb` plus any free-text intent columns:
|
||||||
|
|
||||||
|
```
|
||||||
|
| workload | node | cores | ram_mb | disk_gb | criticality | ha_intent | profile | constraints | growth |
|
||||||
|
|----------|------|-------|--------|---------|-------------|-----------|---------|-------------|--------|
|
||||||
|
| dns1 | pve0 | 1 | 512 | 10 | high | pair/dns2 | tiny | anti-affinity: dns2 elsewhere | flat |
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 1: Reference doc skeleton
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `docs/hardware/reference.md`
|
||||||
|
- Create: `docs/hardware/reviews/.gitkeep`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write `docs/hardware/reference.md`**
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# Hardware reference — boma
|
||||||
|
|
||||||
|
> Hand-maintained source of truth for **physical** compute + network gear and
|
||||||
|
> **workload placement intent**. The two machine-readable tables (Node capacity,
|
||||||
|
> Workload placement) are parsed by `scripts/capacity-scan.py` — keep their
|
||||||
|
> headers intact. Evaluated by `/capacity-review`. See ADR-012.
|
||||||
|
>
|
||||||
|
> _Status: skeleton. Replace example rows with real hardware once the cluster is
|
||||||
|
> stood up (STATUS.md tracks real-vs-planned)._
|
||||||
|
|
||||||
|
## 1. Physical compute
|
||||||
|
|
||||||
|
### pve0
|
||||||
|
- **Model / form factor:** _TBD (e.g. Minisforum MS-01, mini-PC)_
|
||||||
|
- **CPU:** _TBD (e.g. i9-13900H, 14C/20T)_
|
||||||
|
- **RAM:** _TBD total; max _; free DIMM slots _
|
||||||
|
- **Storage:** _TBD (disks → pools, e.g. 2× 2 TB NVMe → `local-zfs`)_
|
||||||
|
- **NICs:** _eno1 trunk (vmbr0), eno2 corosync (vmbr1)_
|
||||||
|
- **Notes:** _warranty, quirks_
|
||||||
|
|
||||||
|
_(repeat for pve1, pve2, askari)_
|
||||||
|
|
||||||
|
## 2. Network gear
|
||||||
|
|
||||||
|
| device | model | ports | poe | throughput | uplinks | notes |
|
||||||
|
|----------|-------|-------|-----|------------|---------|-------|
|
||||||
|
| opnsense | _TBD_ | _TBD_ | n/a | _TBD_ | WAN+LAN | dedicated hardware |
|
||||||
|
| switch | _TBD_ | _TBD_ | _TBD_ | _TBD_ | trunk | managed, 802.1q |
|
||||||
|
| ap1 | _TBD_ | _TBD_ | _TBD_ | _TBD_ | trunk | multi-SSID per VLAN |
|
||||||
|
|
||||||
|
## 3. Workload placement & intent
|
||||||
|
|
||||||
|
The numeric columns (`cores, ram_mb, disk_gb`) feed `capacity-scan.py`; the
|
||||||
|
free-text columns feed `/capacity-review`'s judgement.
|
||||||
|
|
||||||
|
| workload | node | cores | ram_mb | disk_gb | criticality | ha_intent | profile | constraints | growth |
|
||||||
|
|----------|------|-------|--------|---------|-------------|-----------|---------|-------------|--------|
|
||||||
|
| dns1 | pve0 | 1 | 512 | 10 | high | pair/dns2 | tiny/steady | anti-affinity: dns2 on a different node | flat |
|
||||||
|
| dns2 | pve1 | 1 | 512 | 10 | high | pair/dns1 | tiny/steady | anti-affinity: dns1 on a different node | flat |
|
||||||
|
|
||||||
|
## 4. Node capacity (machine-readable)
|
||||||
|
|
||||||
|
Physical totals per node. Integers; `ram_gb` and `disk_gb` may be decimals.
|
||||||
|
|
||||||
|
| node | cores | ram_gb | disk_gb |
|
||||||
|
|------|-------|--------|---------|
|
||||||
|
| pve0 | 20 | 64 | 4000 |
|
||||||
|
| pve1 | 20 | 64 | 4000 |
|
||||||
|
|
||||||
|
## 5. Capacity notes
|
||||||
|
|
||||||
|
Free-text running notes for the evaluator (trends, planned moves, upgrade ideas).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Create the reports directory**
|
||||||
|
|
||||||
|
Run: `mkdir -p docs/hardware/reviews && touch docs/hardware/reviews/.gitkeep`
|
||||||
|
Expected: both paths exist.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify the machine-readable headers match the contract**
|
||||||
|
|
||||||
|
Run: `grep -n '| node | cores | ram_gb | disk_gb |' docs/hardware/reference.md && grep -n '| workload | node | cores | ram_mb | disk_gb |' docs/hardware/reference.md`
|
||||||
|
Expected: each grep prints one matching line (the table headers the parser keys on).
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/hardware/reference.md docs/hardware/reviews/.gitkeep
|
||||||
|
git commit -m "Add hardware reference doc skeleton + reviews dir"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 2: Scan script — `parse_table()`
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `scripts/capacity-scan.py`
|
||||||
|
- Create: `tests/test_capacity_scan.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test**
|
||||||
|
|
||||||
|
Create `tests/test_capacity_scan.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import importlib.util
|
||||||
|
import pathlib
|
||||||
|
|
||||||
|
_PATH = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "capacity-scan.py"
|
||||||
|
_spec = importlib.util.spec_from_file_location("capacity_scan", _PATH)
|
||||||
|
cs = importlib.util.module_from_spec(_spec)
|
||||||
|
_spec.loader.exec_module(cs)
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_table_keys_on_header_and_ignores_extra_cols():
|
||||||
|
md = """
|
||||||
|
intro text
|
||||||
|
| node | cores | ram_gb | disk_gb |
|
||||||
|
|------|-------|--------|---------|
|
||||||
|
| pve0 | 20 | 64 | 4000 |
|
||||||
|
| pve1 | 20 | 64 | 4000 |
|
||||||
|
|
||||||
|
trailing text
|
||||||
|
"""
|
||||||
|
rows = cs.parse_table(md, ["node", "cores", "ram_gb", "disk_gb"])
|
||||||
|
assert rows == [
|
||||||
|
{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"},
|
||||||
|
{"node": "pve1", "cores": "20", "ram_gb": "64", "disk_gb": "4000"},
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_table_returns_empty_when_header_absent():
|
||||||
|
assert cs.parse_table("no tables here", ["node", "cores"]) == []
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run test to verify it fails**
|
||||||
|
|
||||||
|
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
|
||||||
|
Expected: FAIL — `ModuleNotFoundError`/`AttributeError` (script or `parse_table` not defined yet).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Write minimal implementation**
|
||||||
|
|
||||||
|
Create `scripts/capacity-scan.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""capacity-scan.py — deterministic capacity facts for /capacity-review.
|
||||||
|
|
||||||
|
Python standard library only. Emits a JSON object to stdout.
|
||||||
|
|
||||||
|
Reads physical capacities and workload allocations from the machine-readable
|
||||||
|
tables in docs/hardware/reference.md, computes per-node allocated-vs-physical
|
||||||
|
rollups, and cross-checks workload hostnames against `terraform output -json`
|
||||||
|
and `ansible-inventory --list` to surface drift. Degrades gracefully when
|
||||||
|
nothing is provisioned. Live usage stats are a documented future hook.
|
||||||
|
|
||||||
|
Usage: python3 scripts/capacity-scan.py [--env staging] [--reference PATH]
|
||||||
|
"""
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
|
||||||
|
REPO_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
|
||||||
|
|
||||||
|
def parse_table(markdown, required_cols):
|
||||||
|
"""Return row dicts for the first markdown table whose header contains all
|
||||||
|
required_cols. Keys are header names; values are raw cell strings."""
|
||||||
|
lines = markdown.splitlines()
|
||||||
|
required = set(required_cols)
|
||||||
|
for i, raw in enumerate(lines):
|
||||||
|
line = raw.strip()
|
||||||
|
if not line.startswith("|"):
|
||||||
|
continue
|
||||||
|
headers = [c.strip() for c in line.strip("|").split("|")]
|
||||||
|
if not required.issubset(set(headers)):
|
||||||
|
continue
|
||||||
|
rows = []
|
||||||
|
for body in lines[i + 2:]:
|
||||||
|
if not body.strip().startswith("|"):
|
||||||
|
break
|
||||||
|
cells = [c.strip() for c in body.strip().strip("|").split("|")]
|
||||||
|
if len(cells) == len(headers):
|
||||||
|
rows.append(dict(zip(headers, cells)))
|
||||||
|
return rows
|
||||||
|
return []
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run test to verify it passes**
|
||||||
|
|
||||||
|
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
|
||||||
|
Expected: PASS (2 passed).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scripts/capacity-scan.py tests/test_capacity_scan.py
|
||||||
|
git commit -m "Add capacity-scan.py with parse_table()"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 3: Rollup math — `compute_rollup()`
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scripts/capacity-scan.py`
|
||||||
|
- Modify: `tests/test_capacity_scan.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test (append to `tests/test_capacity_scan.py`)**
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_compute_rollup_sums_allocations_and_flags_headroom():
|
||||||
|
node_rows = [{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"}]
|
||||||
|
workload_rows = [
|
||||||
|
{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"},
|
||||||
|
{"workload": "forgejo", "node": "pve0", "cores": "4", "ram_mb": "8192", "disk_gb": "100"},
|
||||||
|
]
|
||||||
|
nodes = cs.compute_rollup(node_rows, workload_rows)
|
||||||
|
pve0 = nodes["pve0"]
|
||||||
|
assert pve0["alloc_cores"] == 5
|
||||||
|
assert pve0["alloc_ram_gb"] == 8.5 # (512 + 8192) / 1024
|
||||||
|
assert pve0["alloc_disk_gb"] == 110.0
|
||||||
|
assert pve0["ram_headroom_pct"] == 87 # round(100 * (64 - 8.5) / 64)
|
||||||
|
assert pve0["oversubscribed"] is False
|
||||||
|
|
||||||
|
|
||||||
|
def test_compute_rollup_flags_oversubscription():
|
||||||
|
node_rows = [{"node": "tiny", "cores": "2", "ram_gb": "4", "disk_gb": "50"}]
|
||||||
|
workload_rows = [
|
||||||
|
{"workload": "hog", "node": "tiny", "cores": "4", "ram_mb": "1024", "disk_gb": "10"},
|
||||||
|
]
|
||||||
|
nodes = cs.compute_rollup(node_rows, workload_rows)
|
||||||
|
assert nodes["tiny"]["oversubscribed"] is True # 4 cores > 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_compute_rollup_ignores_workloads_on_unknown_nodes():
|
||||||
|
nodes = cs.compute_rollup(
|
||||||
|
[{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"}],
|
||||||
|
[{"workload": "ghost", "node": "nope", "cores": "1", "ram_mb": "512", "disk_gb": "10"}],
|
||||||
|
)
|
||||||
|
assert nodes["pve0"]["alloc_cores"] == 0
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run test to verify it fails**
|
||||||
|
|
||||||
|
Run: `python3 -m pytest tests/test_capacity_scan.py -k compute_rollup -v`
|
||||||
|
Expected: FAIL — `AttributeError: module 'capacity_scan' has no attribute 'compute_rollup'`.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Write minimal implementation (append to `scripts/capacity-scan.py`, before any `main`)**
|
||||||
|
|
||||||
|
```python
|
||||||
|
def compute_rollup(node_rows, workload_rows):
|
||||||
|
"""Per node: physical totals, summed allocations, RAM headroom %, and an
|
||||||
|
oversubscribed flag. Workloads on unknown nodes are ignored."""
|
||||||
|
nodes = {}
|
||||||
|
for r in node_rows:
|
||||||
|
nodes[r["node"]] = {
|
||||||
|
"cores": int(r["cores"]),
|
||||||
|
"ram_gb": float(r["ram_gb"]),
|
||||||
|
"disk_gb": float(r["disk_gb"]),
|
||||||
|
"alloc_cores": 0,
|
||||||
|
"alloc_ram_mb": 0,
|
||||||
|
"alloc_disk_gb": 0.0,
|
||||||
|
}
|
||||||
|
for w in workload_rows:
|
||||||
|
node = nodes.get(w["node"])
|
||||||
|
if node is None:
|
||||||
|
continue
|
||||||
|
node["alloc_cores"] += int(w["cores"])
|
||||||
|
node["alloc_ram_mb"] += int(w["ram_mb"])
|
||||||
|
node["alloc_disk_gb"] += float(w["disk_gb"])
|
||||||
|
for node in nodes.values():
|
||||||
|
node["alloc_ram_gb"] = round(node.pop("alloc_ram_mb") / 1024, 1)
|
||||||
|
node["ram_headroom_pct"] = (
|
||||||
|
round(100 * (node["ram_gb"] - node["alloc_ram_gb"]) / node["ram_gb"])
|
||||||
|
if node["ram_gb"]
|
||||||
|
else 0
|
||||||
|
)
|
||||||
|
node["oversubscribed"] = (
|
||||||
|
node["alloc_cores"] > node["cores"]
|
||||||
|
or node["alloc_ram_gb"] > node["ram_gb"]
|
||||||
|
or node["alloc_disk_gb"] > node["disk_gb"]
|
||||||
|
)
|
||||||
|
return nodes
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run test to verify it passes**
|
||||||
|
|
||||||
|
Run: `python3 -m pytest tests/test_capacity_scan.py -k compute_rollup -v`
|
||||||
|
Expected: PASS (3 passed).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scripts/capacity-scan.py tests/test_capacity_scan.py
|
||||||
|
git commit -m "Add compute_rollup() to capacity-scan.py"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 4: Drift detection — `find_drift()` + hostname parsers
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scripts/capacity-scan.py`
|
||||||
|
- Modify: `tests/test_capacity_scan.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test (append)**
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_parse_tf_hostnames_reads_vms_value_keys():
|
||||||
|
tf_json = '{"vms": {"value": {"dns1": {"ip": "10.20.0.10", "group": "docker_hosts"}}}}'
|
||||||
|
assert cs.parse_tf_hostnames(tf_json) == {"dns1"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_inventory_hostnames_reads_meta_hostvars():
|
||||||
|
inv_json = '{"_meta": {"hostvars": {"dns1": {}, "proxy": {}}}}'
|
||||||
|
assert cs.parse_inventory_hostnames(inv_json) == {"dns1", "proxy"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_find_drift_reports_both_directions():
|
||||||
|
workload_rows = [{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"}]
|
||||||
|
warnings = cs.find_drift(workload_rows, {"proxy"})
|
||||||
|
assert any("dns1" in w and "no Terraform" in w for w in warnings)
|
||||||
|
assert any("proxy" in w and "absent from reference.md" in w for w in warnings)
|
||||||
|
|
||||||
|
|
||||||
|
def test_find_drift_silent_when_no_hostnames_known():
|
||||||
|
workload_rows = [{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"}]
|
||||||
|
assert cs.find_drift(workload_rows, set()) == []
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run test to verify it fails**
|
||||||
|
|
||||||
|
Run: `python3 -m pytest tests/test_capacity_scan.py -k "drift or hostnames" -v`
|
||||||
|
Expected: FAIL — attributes `parse_tf_hostnames` / `parse_inventory_hostnames` / `find_drift` not defined.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Write minimal implementation (append)**
|
||||||
|
|
||||||
|
```python
|
||||||
|
def parse_tf_hostnames(tf_json):
|
||||||
|
"""Hostnames from `terraform output -json` (the `vms` map keys)."""
|
||||||
|
data = json.loads(tf_json)
|
||||||
|
return set(data.get("vms", {}).get("value", {}).keys())
|
||||||
|
|
||||||
|
|
||||||
|
def parse_inventory_hostnames(inv_json):
|
||||||
|
"""Hostnames from `ansible-inventory --list` (_meta.hostvars keys)."""
|
||||||
|
data = json.loads(inv_json)
|
||||||
|
return set(data.get("_meta", {}).get("hostvars", {}).keys())
|
||||||
|
|
||||||
|
|
||||||
|
def find_drift(workload_rows, known_hostnames):
|
||||||
|
"""Warn when reference.md workloads and live hostnames disagree. Silent when
|
||||||
|
no hostnames are known (pre-provisioning) — nothing to compare against."""
|
||||||
|
warnings = []
|
||||||
|
declared = {w["workload"] for w in workload_rows}
|
||||||
|
if not known_hostnames:
|
||||||
|
return warnings
|
||||||
|
for name in sorted(declared - known_hostnames):
|
||||||
|
warnings.append(
|
||||||
|
f"reference.md lists '{name}' but no Terraform/inventory host declares it"
|
||||||
|
)
|
||||||
|
for name in sorted(known_hostnames - declared):
|
||||||
|
warnings.append(
|
||||||
|
f"host '{name}' exists in Terraform/inventory but is absent from reference.md"
|
||||||
|
)
|
||||||
|
return warnings
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run test to verify it passes**
|
||||||
|
|
||||||
|
Run: `python3 -m pytest tests/test_capacity_scan.py -k "drift or hostnames" -v`
|
||||||
|
Expected: PASS (4 passed).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scripts/capacity-scan.py tests/test_capacity_scan.py
|
||||||
|
git commit -m "Add hostname parsers + find_drift() to capacity-scan.py"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 5: Subprocess glue + usage stub + `main()`
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scripts/capacity-scan.py`
|
||||||
|
- Modify: `tests/test_capacity_scan.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test (append)**
|
||||||
|
|
||||||
|
```python
|
||||||
|
import json as _json
|
||||||
|
|
||||||
|
|
||||||
|
def test_gather_usage_is_stubbed_unavailable():
|
||||||
|
usage = cs.gather_usage()
|
||||||
|
assert usage["available"] is False
|
||||||
|
assert "reason" in usage
|
||||||
|
|
||||||
|
|
||||||
|
def test_known_hostnames_degrades_to_empty(monkeypatch):
|
||||||
|
# Simulate terraform/ansible-inventory being absent or failing.
|
||||||
|
def boom(*a, **k):
|
||||||
|
raise FileNotFoundError("no such tool")
|
||||||
|
|
||||||
|
monkeypatch.setattr(cs.subprocess, "run", boom)
|
||||||
|
assert cs.known_hostnames("staging") == set()
|
||||||
|
|
||||||
|
|
||||||
|
def test_main_emits_valid_json_against_real_reference(monkeypatch, capsys):
|
||||||
|
# Isolate from the host: no real terraform/ansible needed.
|
||||||
|
monkeypatch.setattr(cs, "known_hostnames", lambda env: set())
|
||||||
|
monkeypatch.setattr("sys.argv", ["capacity-scan.py"])
|
||||||
|
cs.main()
|
||||||
|
out = _json.loads(capsys.readouterr().out)
|
||||||
|
assert set(out) == {"nodes", "workloads", "usage", "warnings"}
|
||||||
|
assert out["usage"]["available"] is False
|
||||||
|
assert "pve0" in out["nodes"] # from the skeleton reference.md (Task 1)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run test to verify it fails**
|
||||||
|
|
||||||
|
Run: `python3 -m pytest tests/test_capacity_scan.py -k "usage or known_hostnames or main" -v`
|
||||||
|
Expected: FAIL — `gather_usage` / `known_hostnames` / `main` not defined.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Write minimal implementation (append)**
|
||||||
|
|
||||||
|
```python
|
||||||
|
def gather_usage():
|
||||||
|
"""FUTURE: live per-VM CPU/RAM/disk usage history. Requires the physical
|
||||||
|
cluster online; source UNDECIDED (Proxmox RRD vs Prometheus/Loki/Grafana —
|
||||||
|
see docs/TODO.md 8.4). Until then the evaluator reasons on declared intent."""
|
||||||
|
return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}
|
||||||
|
|
||||||
|
|
||||||
|
def _run_json(cmd):
|
||||||
|
return subprocess.run(cmd, capture_output=True, text=True, check=True).stdout
|
||||||
|
|
||||||
|
|
||||||
|
def known_hostnames(env):
|
||||||
|
"""Union of hostnames from Terraform output and Ansible inventory. Each
|
||||||
|
source is best-effort: missing tool / no state / bad JSON yields nothing."""
|
||||||
|
hosts = set()
|
||||||
|
tf_dir = os.path.join(REPO_ROOT, "terraform", "environments", env)
|
||||||
|
try:
|
||||||
|
hosts |= parse_tf_hostnames(_run_json(["terraform", f"-chdir={tf_dir}", "output", "-json"]))
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
inv = os.path.join(REPO_ROOT, "inventories", env, "hosts.yml")
|
||||||
|
try:
|
||||||
|
hosts |= parse_inventory_hostnames(_run_json(["ansible-inventory", "-i", inv, "--list"]))
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
return hosts
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description="Deterministic capacity facts for /capacity-review.")
|
||||||
|
parser.add_argument("--env", default="staging")
|
||||||
|
parser.add_argument(
|
||||||
|
"--reference",
|
||||||
|
default=os.path.join(REPO_ROOT, "docs", "hardware", "reference.md"),
|
||||||
|
)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
with open(args.reference, encoding="utf-8") as fh:
|
||||||
|
markdown = fh.read()
|
||||||
|
|
||||||
|
node_rows = parse_table(markdown, ["node", "cores", "ram_gb", "disk_gb"])
|
||||||
|
workload_rows = parse_table(markdown, ["workload", "node", "cores", "ram_mb", "disk_gb"])
|
||||||
|
nodes = compute_rollup(node_rows, workload_rows)
|
||||||
|
warnings = find_drift(workload_rows, known_hostnames(args.env))
|
||||||
|
|
||||||
|
json.dump(
|
||||||
|
{"nodes": nodes, "workloads": workload_rows, "usage": gather_usage(), "warnings": warnings},
|
||||||
|
sys.stdout,
|
||||||
|
indent=2,
|
||||||
|
sort_keys=True,
|
||||||
|
)
|
||||||
|
sys.stdout.write("\n")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run the full test file**
|
||||||
|
|
||||||
|
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
|
||||||
|
Expected: PASS (all tests).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Smoke-run the script end to end**
|
||||||
|
|
||||||
|
Run: `python3 scripts/capacity-scan.py | python3 -m json.tool`
|
||||||
|
Expected: valid JSON with `nodes.pve0`, a `workloads` list, `usage.available: false`, and a `warnings` array (likely empty with no Terraform state).
|
||||||
|
|
||||||
|
- [ ] **Step 6: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scripts/capacity-scan.py tests/test_capacity_scan.py
|
||||||
|
git commit -m "Complete capacity-scan.py: usage stub, subprocess glue, main()"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 6: The `/capacity-review` skill
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `.claude/commands/capacity-review.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Confirm the existing command pattern**
|
||||||
|
|
||||||
|
Run: `ls .claude/commands/ && sed -n '1,20p' .claude/commands/review-repo.md`
|
||||||
|
Expected: lists existing commands; shows the frontmatter/structure to mirror.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Write `.claude/commands/capacity-review.md`**
|
||||||
|
|
||||||
|
Mirror the frontmatter style of `review-repo.md` (adjust `description`/`allowed-tools` to match that file's actual keys). Body:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
---
|
||||||
|
description: Evaluate hardware capacity and placement; recommend optimizations
|
||||||
|
---
|
||||||
|
|
||||||
|
# /capacity-review
|
||||||
|
|
||||||
|
Evaluate the homelab's hardware capacity and workload placement, and recommend
|
||||||
|
optimizations. On-demand only (scheduling is deferred — see docs/TODO.md 8.4).
|
||||||
|
|
||||||
|
## Steps
|
||||||
|
|
||||||
|
1. **Gather facts.** Run `python3 scripts/capacity-scan.py` and parse its JSON
|
||||||
|
(`nodes`, `workloads`, `usage`, `warnings`). If `usage.available` is false,
|
||||||
|
note in the report that recommendations are **intent-based, not usage-based**.
|
||||||
|
2. **Read intent.** Read `docs/hardware/reference.md` for the free-text columns
|
||||||
|
the scan does not parse: `criticality`, `ha_intent`, `profile`, `constraints`,
|
||||||
|
`growth`, plus the "Capacity notes" section.
|
||||||
|
3. **Reason across dimensions.** Produce recommendations, each tagged with its
|
||||||
|
type and the basis it rests on (declared intent vs measured usage):
|
||||||
|
- **HA / redundancy** — anti-affinity violations (e.g. an HA pair sharing one
|
||||||
|
node), single points of failure, HA that looks like overkill, or a
|
||||||
|
high-criticality workload with no redundancy.
|
||||||
|
- **Right-sizing** — over/under-provisioned workloads. Today this is
|
||||||
|
intent-based (allocation vs `profile`); flag that it becomes usage-based
|
||||||
|
once the `gather_usage()` hook is live.
|
||||||
|
- **Placement / moves** — oversubscribed nodes (`oversubscribed: true`, low
|
||||||
|
`ram_headroom_pct`) or constraint-driven relocations.
|
||||||
|
- **Upgrade timing** — `growth` notes vs headroom → rough runway.
|
||||||
|
- **Drift** — surface every entry in the scan's `warnings` array.
|
||||||
|
4. **Write the report.** Save to `docs/hardware/reviews/YYYY-MM-DD-capacity.md`
|
||||||
|
and copy it to `docs/hardware/reviews/latest.md`. Structure: a one-line
|
||||||
|
summary, then a section per dimension with concrete, actionable items. State
|
||||||
|
the basis (intent vs usage) on every recommendation.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify the file is well-formed**
|
||||||
|
|
||||||
|
Run: `head -5 .claude/commands/capacity-review.md`
|
||||||
|
Expected: frontmatter block present and consistent with `review-repo.md`'s keys.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add .claude/commands/capacity-review.md
|
||||||
|
git commit -m "Add /capacity-review skill"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 7: ADR-012, STATUS, CLAUDE.md, scripts/README
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `docs/decisions/012-hardware-capacity.md`
|
||||||
|
- Modify: `STATUS.md`
|
||||||
|
- Modify: `CLAUDE.md`
|
||||||
|
- Modify: `scripts/README.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write `docs/decisions/012-hardware-capacity.md`**
|
||||||
|
|
||||||
|
Match the heading style of an existing ADR (`sed -n '1,15p' docs/decisions/010-forgejo-ci.md` first). Content:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# ADR-012 — Hardware reference & capacity evaluation
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The repo modelled the logical/network layer (Terraform VM specs, ADR-007
|
||||||
|
topology) but not the physical layer — node CPU/RAM/disk capacity, network gear,
|
||||||
|
or which workloads are designed to run where with what headroom. There was also
|
||||||
|
no way to ask "is this well-proportioned?" — e.g. HA that isn't needed, a
|
||||||
|
workload that should move, or a node due an upgrade.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
|
||||||
|
physical compute + network gear and workload placement intent. Two
|
||||||
|
machine-readable tables (node capacity, workload placement) carry the numbers.
|
||||||
|
- `scripts/capacity-scan.py` (stdlib-only, like `repo-scan.py` / `tf_to_inventory.py`)
|
||||||
|
parses those tables, computes per-node allocated-vs-physical rollups, and
|
||||||
|
cross-checks workload hostnames against `terraform output -json` /
|
||||||
|
`ansible-inventory --list` to surface drift.
|
||||||
|
- `/capacity-review` reads the scan + intent columns and writes a dated report to
|
||||||
|
`docs/hardware/reviews/`, mirroring `/review-repo` → `docs/reviews/`.
|
||||||
|
- Numeric allocations live in `reference.md`, not Terraform: the current
|
||||||
|
`terraform output` exposes only `{ip, group}`. Terraform/inventory are used
|
||||||
|
only for hostname-drift cross-checks.
|
||||||
|
- **Live usage stats are a future hook.** The cluster is not stood up;
|
||||||
|
`gather_usage()` returns `available: false` and the evaluator reasons on
|
||||||
|
declared intent. The usage source (Proxmox RRD vs Prometheus/Loki/Grafana/
|
||||||
|
Alloy) is undecided — see docs/TODO.md 8.4, to be settled before any hook is
|
||||||
|
built.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Right-sizing advice is intent-based until usage data exists; reports say so.
|
||||||
|
- `reference.md` table headers are a parser contract — changing them needs a
|
||||||
|
matching `capacity-scan.py` change.
|
||||||
|
|
||||||
|
See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF↔Ansible handoff).
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add STATUS.md rows**
|
||||||
|
|
||||||
|
In `STATUS.md`, add to the "Real and working today" table:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
|
||||||
|
| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
|
||||||
|
```
|
||||||
|
|
||||||
|
And to the "Designed but not built" table:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| Live usage stats for `/capacity-review` | ADR-012 / TODO 8.4 | `gather_usage()` stubbed; source undecided (Proxmox RRD vs PLG stack); needs the cluster |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add the CLAUDE.md command row + further-reading pointer**
|
||||||
|
|
||||||
|
In `CLAUDE.md` "Key commands" table, add:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| Review hardware capacity | `/capacity-review` (Claude command) |
|
||||||
|
```
|
||||||
|
|
||||||
|
In the "Further reading" table, add:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Document the script in scripts/README.md**
|
||||||
|
|
||||||
|
Add under the existing list in `scripts/README.md`:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
- `capacity-scan.py` — deterministic capacity facts for `/capacity-review`: parses
|
||||||
|
the machine-readable tables in `docs/hardware/reference.md`, computes per-node
|
||||||
|
allocated-vs-physical rollups, and cross-checks workload hostnames against
|
||||||
|
Terraform output / Ansible inventory for drift. Emits JSON. See **ADR-012**.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Verify references resolve**
|
||||||
|
|
||||||
|
Run: `python3 scripts/repo-scan.py | python3 -c "import json,sys; d=json.load(sys.stdin); print('broken_refs:', [f for f in d.get('findings',{}).get('broken_refs',[]) if '012' in str(f) or 'hardware' in str(f)])"`
|
||||||
|
Expected: no broken refs mentioning ADR-012 or the hardware paths (empty list). If the scan's JSON shape differs, instead run `python3 scripts/repo-scan.py >/dev/null && echo OK` and eyeball the findings.
|
||||||
|
|
||||||
|
- [ ] **Step 6: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/decisions/012-hardware-capacity.md STATUS.md CLAUDE.md scripts/README.md
|
||||||
|
git commit -m "Record ADR-012 + STATUS/CLAUDE/scripts docs for capacity tooling"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 8: Final verification
|
||||||
|
|
||||||
|
**Files:** none (verification only)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Run the full unit-test suite**
|
||||||
|
|
||||||
|
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
|
||||||
|
Expected: all tests pass.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run the lint suite**
|
||||||
|
|
||||||
|
Run: `make lint`
|
||||||
|
Expected: passes (markdown/script changes do not break ansible-lint/yamllint).
|
||||||
|
|
||||||
|
- [ ] **Step 3: End-to-end scan**
|
||||||
|
|
||||||
|
Run: `python3 scripts/capacity-scan.py`
|
||||||
|
Expected: valid JSON; `nodes.pve0` present; `usage.available: false`.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Confirm working tree is clean**
|
||||||
|
|
||||||
|
Run: `git status --short`
|
||||||
|
Expected: no uncommitted changes from this plan (pre-existing FRICTION.md / ADR-011 may remain — leave them).
|
||||||
|
```
|
||||||
Loading…
Add table
Reference in a new issue