Brainstormed design for docs/hardware/reference.md (physical compute + network gear + workload placement intent), a stdlib-only capacity-scan.py, and an on-demand /capacity-review skill that reports to docs/hardware/reviews/. Mirrors the repo-scan -> /review-repo -> docs/reviews triad. TODO additions: schedule /capacity-review later and decide its usage-stats source (Proxmox RRD vs the Prometheus/Loki/Grafana/Alloy stack) before building any hook (8.4); reevaluate the stdlib-only script policy (#14). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
168 lines
7.7 KiB
Markdown
168 lines
7.7 KiB
Markdown
# Design — Hardware reference & capacity evaluation
|
|
|
|
_Date: 2026-06-01 · Status: approved for planning_
|
|
|
|
## Problem
|
|
|
|
The repo documents the **logical/network** layer well — Terraform declares per-VM
|
|
`cores`/`memory_mb`/`disk_size_gb`, and ADR-007 records VLANs, IPs, and topology.
|
|
But the **physical** layer is undocumented: how many Proxmox nodes physically
|
|
exist, their real CPU/RAM/disk capacity, storage pools, the network gear, and
|
|
`askari`. Nothing records "this node has 64 GB, X is allocated, Y is free," and
|
|
nothing evaluates whether the design is well-proportioned — e.g. a service that
|
|
needn't be HA, a workload that should move nodes, or a node due a RAM/disk
|
|
upgrade.
|
|
|
|
## Goal
|
|
|
|
1. A single, human-first **hardware reference document** capturing physical
|
|
compute + network gear and the intended workload placement.
|
|
2. A **capacity evaluator** ("script + skill") that reasons about optimization:
|
|
HA overkill / missing redundancy, right-sizing, placement moves, and
|
|
upgrade timing — emitting a dated report.
|
|
|
|
## Scope
|
|
|
|
- **In:** Proxmox compute nodes (`pve0..2`) + `askari`; network gear (OPNsense,
|
|
managed switch, APs); per-workload placement intent.
|
|
- **Out (for now):** power/UPS budget, NAS, cabling, rack layout, asset
|
|
register, warranty/serial tracking.
|
|
|
|
## Non-negotiable repo conventions this must honor
|
|
|
|
- Mirror the existing `repo-scan.py` → `/review-repo` → `docs/reviews/` triad
|
|
(deterministic scan feeds a judgement skill; report is dated markdown).
|
|
- Utility scripts are **stdlib-only** for run-anywhere portability (control
|
|
node, CI, bare clone, no venv). See TODO #14 for the standing reevaluation.
|
|
- Be honest about real-vs-planned (STATUS.md). The physical cluster is **not
|
|
stood up yet**, so live usage stats are a documented future hook, not a
|
|
current capability.
|
|
|
|
## Architecture
|
|
|
|
Four pieces, plus tracking updates.
|
|
|
|
### 1. Reference doc — `docs/hardware/reference.md`
|
|
|
|
One hand-maintained markdown file, the source of truth for physical facts and
|
|
placement intent. Four parts:
|
|
|
|
1. **Physical compute** — one subsection per node (`pve0..2`, `askari`):
|
|
model/form factor, CPU (cores/threads), RAM total (+ max & free DIMM slots),
|
|
storage (disks → pools, e.g. `local-zfs` / `local-lvm`), NICs, notes.
|
|
2. **Network gear** — OPNsense box, managed switch, APs: model, port/PoE
|
|
counts, throughput, uplinks. Short table.
|
|
3. **Workload placement & intent** — one row per planned VM/service, columns:
|
|
`Service | Home node | Criticality | HA intent | Resource profile |
|
|
Placement constraints | Growth notes`. These columns map onto the four
|
|
attribute groups chosen during brainstorming and give the evaluator concrete
|
|
intent to judge against (e.g. anti-affinity: `dns1`/`dns2` on different
|
|
nodes).
|
|
4. **Capacity summary** — per-node "allocated vs physical" rollup (RAM / cores /
|
|
disk, headroom %).
|
|
|
|
Node-capacity tables use a **strict, documented format** so the scan script can
|
|
parse the numbers without a YAML dependency.
|
|
|
|
### 2. Scan script — `scripts/capacity-scan.py`
|
|
|
|
Stdlib-only, deterministic, JSON to stdout (like `repo-scan.py`). Avoids
|
|
hand-parsing YAML by shelling out for JSON, the pattern `tf_to_inventory.py`
|
|
already uses.
|
|
|
|
Gathers **today**:
|
|
- **Declared allocations** — `terraform output -json` (and/or the `.tf` module
|
|
calls) for each VM's cores/RAM/disk; degrades gracefully when Terraform has no
|
|
real VMs yet (current reality) instead of failing.
|
|
- **Inventory hosts** — `ansible-inventory -i inventories/<env>/hosts.yml
|
|
--list` → JSON.
|
|
- **Physical capacities** — parses the strict node tables in `reference.md`.
|
|
- **Rollup math** — per node: allocated vs physical, headroom %,
|
|
`oversubscribed` flag.
|
|
- **Drift warnings** — e.g. `reference.md` lists a host no Terraform VM
|
|
declares; surfaced in a `warnings[]` array (free doc↔Terraform drift check).
|
|
|
|
**Stubbed future hook** (honest, à la STATUS.md):
|
|
```python
|
|
# FUTURE: live usage stats (per-VM CPU/RAM/disk history).
|
|
# Requires the physical cluster online. Source UNDECIDED — see "Open decisions".
|
|
def gather_usage():
|
|
return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}
|
|
```
|
|
|
|
Output sketch:
|
|
```json
|
|
{
|
|
"nodes": {"pve0": {"ram_gb": 64, "ram_allocated_gb": 12, "headroom_pct": 81, "oversubscribed": false}},
|
|
"workloads": [{"name": "forgejo", "node": "pve1", "cores": 2, "memory_mb": 4096}],
|
|
"usage": {"available": false, "reason": "cluster not provisioned"},
|
|
"warnings": ["reference.md lists dns1 but no Terraform VM declares it"]
|
|
}
|
|
```
|
|
|
|
### 3. Evaluator skill — `/capacity-review`
|
|
|
|
A skill in `.claude/` (mirrors `/review-repo`), on-demand. Flow:
|
|
|
|
1. Run `python3 scripts/capacity-scan.py` → JSON.
|
|
2. Read `docs/hardware/reference.md` for intent columns the math can't capture.
|
|
3. Reason across dimensions, each recommendation **tagged by type** and stating
|
|
**what it is based on** (declared intent vs measured usage):
|
|
- **HA / redundancy** — anti-affinity violations, SPOFs, HA-overkill,
|
|
critical-but-unredundant services.
|
|
- **Right-sizing** — over/under-provisioned VMs. *Intent-based today*;
|
|
explicitly upgradeable to usage-based once the usage hook is live.
|
|
- **Placement / moves** — oversubscribed nodes, constraint-driven relocation.
|
|
- **Upgrade timing** — growth notes vs headroom → rough runway.
|
|
- **Drift** — surfaces the scan's `warnings[]`.
|
|
4. Write `docs/hardware/reviews/YYYY-MM-DD-capacity.md` (+ `latest.md`),
|
|
mirroring `docs/reviews/`.
|
|
|
|
### 4. Recording — ADR + STATUS + CLAUDE.md
|
|
|
|
- **ADR-012 — Hardware reference & capacity evaluation**
|
|
(`docs/decisions/012-hardware-capacity.md`): records the decision and
|
|
rationale; cross-links ADR-001 / ADR-007 / ADR-009. Names the usage-source as
|
|
an open decision (below).
|
|
- **STATUS.md** rows: `reference.md` + `capacity-scan.py` → real/working
|
|
(skeleton); `/capacity-review` → working, intent-only; live usage → designed,
|
|
not built.
|
|
- **CLAUDE.md**: a "Review capacity/hardware → `/capacity-review`" commands-table
|
|
row + a "Further reading" pointer to ADR-012.
|
|
|
|
## Data flow
|
|
|
|
```
|
|
reference.md ──┐
|
|
├─→ capacity-scan.py ──→ scan JSON ──┐
|
|
terraform ─────┤ (stdlib, JSON-via-subprocess) ├─→ /capacity-review ─→ docs/hardware/reviews/
|
|
inventory ─────┘ │ (judgement)
|
|
reference.md (intent columns) ───────────────────────┘
|
|
```
|
|
|
|
## Open decisions (deferred, tracked in TODO)
|
|
|
|
- **Usage-stats source** (TODO 8.4): **Proxmox RRD** (built-in, no extra infra)
|
|
vs the **Prometheus/Loki/Grafana/Grafana-Alloy** stack we will likely run
|
|
anyway (richer, per-process, more to operate; see TODO 3.6). **Decide before
|
|
building any usage hook** to avoid throwaway work.
|
|
- **Script dependency policy** (TODO #14): whether stdlib-only remains the rule
|
|
for utility scripts or libraries (e.g. PyYAML) are selectively allowed.
|
|
- **Scheduling** (TODO 8.4): `/capacity-review` is on-demand now; cron later.
|
|
|
|
## Deliverables & state at delivery
|
|
|
|
| Piece | Path | State |
|
|
|---|---|---|
|
|
| Reference doc | `docs/hardware/reference.md` | Skeleton + real node data |
|
|
| Scan script | `scripts/capacity-scan.py` | Working (stdlib, usage hook stubbed) |
|
|
| Evaluator skill | `/capacity-review` → `docs/hardware/reviews/` | Working, intent-based |
|
|
| Decision record | `docs/decisions/012-hardware-capacity.md` | New ADR |
|
|
| Tracking | STATUS.md, CLAUDE.md, TODO #14 + 8.4 | Updated |
|
|
|
|
## Out of scope / YAGNI
|
|
|
|
- No usage-stats collection until the cluster exists and the source is decided.
|
|
- No structured-data (YAML) source of truth — markdown is the single hand-edited
|
|
source by choice; revisit only if parsing pain demands it.
|
|
- No automated moves/remediation — the evaluator recommends; humans act.
|