boma/docs/superpowers/specs/2026-06-01-hardware-capacity-design.md

169 lines
7.7 KiB
Markdown
Raw Normal View History

# Design — Hardware reference & capacity evaluation
_Date: 2026-06-01 · Status: approved for planning_
## Problem
The repo documents the **logical/network** layer well — Terraform declares per-VM
`cores`/`memory_mb`/`disk_size_gb`, and ADR-007 records VLANs, IPs, and topology.
But the **physical** layer is undocumented: how many Proxmox nodes physically
exist, their real CPU/RAM/disk capacity, storage pools, the network gear, and
`askari`. Nothing records "this node has 64 GB, X is allocated, Y is free," and
nothing evaluates whether the design is well-proportioned — e.g. a service that
needn't be HA, a workload that should move nodes, or a node due a RAM/disk
upgrade.
## Goal
1. A single, human-first **hardware reference document** capturing physical
compute + network gear and the intended workload placement.
2. A **capacity evaluator** ("script + skill") that reasons about optimization:
HA overkill / missing redundancy, right-sizing, placement moves, and
upgrade timing — emitting a dated report.
## Scope
- **In:** Proxmox compute nodes (`pve0..2`) + `askari`; network gear (OPNsense,
managed switch, APs); per-workload placement intent.
- **Out (for now):** power/UPS budget, NAS, cabling, rack layout, asset
register, warranty/serial tracking.
## Non-negotiable repo conventions this must honor
- Mirror the existing `repo-scan.py``/review-repo``docs/reviews/` triad
(deterministic scan feeds a judgement skill; report is dated markdown).
- Utility scripts are **stdlib-only** for run-anywhere portability (control
node, CI, bare clone, no venv). See TODO #14 for the standing reevaluation.
- Be honest about real-vs-planned (STATUS.md). The physical cluster is **not
stood up yet**, so live usage stats are a documented future hook, not a
current capability.
## Architecture
Four pieces, plus tracking updates.
### 1. Reference doc — `docs/hardware/reference.md`
One hand-maintained markdown file, the source of truth for physical facts and
placement intent. Four parts:
1. **Physical compute** — one subsection per node (`pve0..2`, `askari`):
model/form factor, CPU (cores/threads), RAM total (+ max & free DIMM slots),
storage (disks → pools, e.g. `local-zfs` / `local-lvm`), NICs, notes.
2. **Network gear** — OPNsense box, managed switch, APs: model, port/PoE
counts, throughput, uplinks. Short table.
3. **Workload placement & intent** — one row per planned VM/service, columns:
`Service | Home node | Criticality | HA intent | Resource profile |
Placement constraints | Growth notes`. These columns map onto the four
attribute groups chosen during brainstorming and give the evaluator concrete
intent to judge against (e.g. anti-affinity: `dns1`/`dns2` on different
nodes).
4. **Capacity summary** — per-node "allocated vs physical" rollup (RAM / cores /
disk, headroom %).
Node-capacity tables use a **strict, documented format** so the scan script can
parse the numbers without a YAML dependency.
### 2. Scan script — `scripts/capacity-scan.py`
Stdlib-only, deterministic, JSON to stdout (like `repo-scan.py`). Avoids
hand-parsing YAML by shelling out for JSON, the pattern `tf_to_inventory.py`
already uses.
Gathers **today**:
- **Declared allocations** — `terraform output -json` (and/or the `.tf` module
calls) for each VM's cores/RAM/disk; degrades gracefully when Terraform has no
real VMs yet (current reality) instead of failing.
- **Inventory hosts** — `ansible-inventory -i inventories/<env>/hosts.yml
--list` → JSON.
- **Physical capacities** — parses the strict node tables in `reference.md`.
- **Rollup math** — per node: allocated vs physical, headroom %,
`oversubscribed` flag.
- **Drift warnings** — e.g. `reference.md` lists a host no Terraform VM
declares; surfaced in a `warnings[]` array (free doc↔Terraform drift check).
**Stubbed future hook** (honest, à la STATUS.md):
```python
# FUTURE: live usage stats (per-VM CPU/RAM/disk history).
# Requires the physical cluster online. Source UNDECIDED — see "Open decisions".
def gather_usage():
return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}
```
Output sketch:
```json
{
"nodes": {"pve0": {"ram_gb": 64, "ram_allocated_gb": 12, "headroom_pct": 81, "oversubscribed": false}},
"workloads": [{"name": "forgejo", "node": "pve1", "cores": 2, "memory_mb": 4096}],
"usage": {"available": false, "reason": "cluster not provisioned"},
"warnings": ["reference.md lists dns1 but no Terraform VM declares it"]
}
```
### 3. Evaluator skill — `/capacity-review`
A skill in `.claude/` (mirrors `/review-repo`), on-demand. Flow:
1. Run `python3 scripts/capacity-scan.py` → JSON.
2. Read `docs/hardware/reference.md` for intent columns the math can't capture.
3. Reason across dimensions, each recommendation **tagged by type** and stating
**what it is based on** (declared intent vs measured usage):
- **HA / redundancy** — anti-affinity violations, SPOFs, HA-overkill,
critical-but-unredundant services.
- **Right-sizing** — over/under-provisioned VMs. *Intent-based today*;
explicitly upgradeable to usage-based once the usage hook is live.
- **Placement / moves** — oversubscribed nodes, constraint-driven relocation.
- **Upgrade timing** — growth notes vs headroom → rough runway.
- **Drift** — surfaces the scan's `warnings[]`.
4. Write `docs/hardware/reviews/YYYY-MM-DD-capacity.md` (+ `latest.md`),
mirroring `docs/reviews/`.
### 4. Recording — ADR + STATUS + CLAUDE.md
- **ADR-012 — Hardware reference & capacity evaluation**
(`docs/decisions/012-hardware-capacity.md`): records the decision and
rationale; cross-links ADR-001 / ADR-007 / ADR-009. Names the usage-source as
an open decision (below).
- **STATUS.md** rows: `reference.md` + `capacity-scan.py` → real/working
(skeleton); `/capacity-review` → working, intent-only; live usage → designed,
not built.
- **CLAUDE.md**: a "Review capacity/hardware → `/capacity-review`" commands-table
row + a "Further reading" pointer to ADR-012.
## Data flow
```
reference.md ──┐
├─→ capacity-scan.py ──→ scan JSON ──┐
terraform ─────┤ (stdlib, JSON-via-subprocess) ├─→ /capacity-review ─→ docs/hardware/reviews/
inventory ─────┘ │ (judgement)
reference.md (intent columns) ───────────────────────┘
```
## Open decisions (deferred, tracked in TODO)
- **Usage-stats source** (TODO 8.4): **Proxmox RRD** (built-in, no extra infra)
vs the **Prometheus/Loki/Grafana/Grafana-Alloy** stack we will likely run
anyway (richer, per-process, more to operate; see TODO 3.6). **Decide before
building any usage hook** to avoid throwaway work.
- **Script dependency policy** (TODO #14): whether stdlib-only remains the rule
for utility scripts or libraries (e.g. PyYAML) are selectively allowed.
- **Scheduling** (TODO 8.4): `/capacity-review` is on-demand now; cron later.
## Deliverables & state at delivery
| Piece | Path | State |
|---|---|---|
| Reference doc | `docs/hardware/reference.md` | Skeleton + real node data |
| Scan script | `scripts/capacity-scan.py` | Working (stdlib, usage hook stubbed) |
| Evaluator skill | `/capacity-review``docs/hardware/reviews/` | Working, intent-based |
| Decision record | `docs/decisions/012-hardware-capacity.md` | New ADR |
| Tracking | STATUS.md, CLAUDE.md, TODO #14 + 8.4 | Updated |
## Out of scope / YAGNI
- No usage-stats collection until the cluster exists and the source is decided.
- No structured-data (YAML) source of truth — markdown is the single hand-edited
source by choice; revisit only if parsing pain demands it.
- No automated moves/remediation — the evaluator recommends; humans act.