sjat/boma

sjat 88210db09c Add hardware reference & capacity-evaluation design spec

Brainstormed design for docs/hardware/reference.md (physical compute +
network gear + workload placement intent), a stdlib-only capacity-scan.py,
and an on-demand /capacity-review skill that reports to docs/hardware/reviews/.
Mirrors the repo-scan -> /review-repo -> docs/reviews triad.

TODO additions: schedule /capacity-review later and decide its usage-stats
source (Proxmox RRD vs the Prometheus/Loki/Grafana/Alloy stack) before
building any hook (8.4); reevaluate the stdlib-only script policy (#14).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-01 09:59:16 +02:00

7.7 KiB

Raw Blame History

Design — Hardware reference & capacity evaluation

Date: 2026-06-01 · Status: approved for planning

Problem

The repo documents the logical/network layer well — Terraform declares per-VM cores/memory_mb/disk_size_gb, and ADR-007 records VLANs, IPs, and topology. But the physical layer is undocumented: how many Proxmox nodes physically exist, their real CPU/RAM/disk capacity, storage pools, the network gear, and askari. Nothing records "this node has 64 GB, X is allocated, Y is free," and nothing evaluates whether the design is well-proportioned — e.g. a service that needn't be HA, a workload that should move nodes, or a node due a RAM/disk upgrade.

Goal

A single, human-first hardware reference document capturing physical compute + network gear and the intended workload placement.
A capacity evaluator ("script + skill") that reasons about optimization: HA overkill / missing redundancy, right-sizing, placement moves, and upgrade timing — emitting a dated report.

Scope

In: Proxmox compute nodes (pve0..2) + askari; network gear (OPNsense, managed switch, APs); per-workload placement intent.
Out (for now): power/UPS budget, NAS, cabling, rack layout, asset register, warranty/serial tracking.

Non-negotiable repo conventions this must honor

Mirror the existing repo-scan.py → /review-repo → docs/reviews/ triad (deterministic scan feeds a judgement skill; report is dated markdown).
Utility scripts are stdlib-only for run-anywhere portability (control node, CI, bare clone, no venv). See TODO #14 for the standing reevaluation.
Be honest about real-vs-planned (STATUS.md). The physical cluster is not stood up yet, so live usage stats are a documented future hook, not a current capability.

Architecture

Four pieces, plus tracking updates.

1. Reference doc — `docs/hardware/reference.md`

One hand-maintained markdown file, the source of truth for physical facts and placement intent. Four parts:

Physical compute — one subsection per node (pve0..2, askari): model/form factor, CPU (cores/threads), RAM total (+ max & free DIMM slots), storage (disks → pools, e.g. local-zfs / local-lvm), NICs, notes.
Network gear — OPNsense box, managed switch, APs: model, port/PoE counts, throughput, uplinks. Short table.
Workload placement & intent — one row per planned VM/service, columns: Service | Home node | Criticality | HA intent | Resource profile | Placement constraints | Growth notes. These columns map onto the four attribute groups chosen during brainstorming and give the evaluator concrete intent to judge against (e.g. anti-affinity: dns1/dns2 on different nodes).
Capacity summary — per-node "allocated vs physical" rollup (RAM / cores / disk, headroom %).

Node-capacity tables use a strict, documented format so the scan script can parse the numbers without a YAML dependency.

2. Scan script — `scripts/capacity-scan.py`

Stdlib-only, deterministic, JSON to stdout (like repo-scan.py). Avoids hand-parsing YAML by shelling out for JSON, the pattern tf_to_inventory.py already uses.

Gathers today:

Declared allocations — terraform output -json (and/or the .tf module calls) for each VM's cores/RAM/disk; degrades gracefully when Terraform has no real VMs yet (current reality) instead of failing.
Inventory hosts — ansible-inventory -i inventories/<env>/hosts.yml --list → JSON.
Physical capacities — parses the strict node tables in reference.md.
Rollup math — per node: allocated vs physical, headroom %, oversubscribed flag.
Drift warnings — e.g. reference.md lists a host no Terraform VM declares; surfaced in a warnings[] array (free doc↔Terraform drift check).

Stubbed future hook (honest, à la STATUS.md):

# FUTURE: live usage stats (per-VM CPU/RAM/disk history).
# Requires the physical cluster online. Source UNDECIDED — see "Open decisions".
def gather_usage():
    return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}

Output sketch:

{
  "nodes": {"pve0": {"ram_gb": 64, "ram_allocated_gb": 12, "headroom_pct": 81, "oversubscribed": false}},
  "workloads": [{"name": "forgejo", "node": "pve1", "cores": 2, "memory_mb": 4096}],
  "usage": {"available": false, "reason": "cluster not provisioned"},
  "warnings": ["reference.md lists dns1 but no Terraform VM declares it"]
}

3. Evaluator skill — `/capacity-review`

A skill in .claude/ (mirrors /review-repo), on-demand. Flow:

Run python3 scripts/capacity-scan.py → JSON.
Read docs/hardware/reference.md for intent columns the math can't capture.
Reason across dimensions, each recommendation tagged by type and stating what it is based on (declared intent vs measured usage):
- HA / redundancy — anti-affinity violations, SPOFs, HA-overkill, critical-but-unredundant services.
- Right-sizing — over/under-provisioned VMs. Intent-based today; explicitly upgradeable to usage-based once the usage hook is live.
- Placement / moves — oversubscribed nodes, constraint-driven relocation.
- Upgrade timing — growth notes vs headroom → rough runway.
- Drift — surfaces the scan's warnings[].
Write docs/hardware/reviews/YYYY-MM-DD-capacity.md (+ latest.md), mirroring docs/reviews/.

4. Recording — ADR + STATUS + CLAUDE.md

ADR-012 — Hardware reference & capacity evaluation (docs/decisions/012-hardware-capacity.md): records the decision and rationale; cross-links ADR-001 / ADR-007 / ADR-009. Names the usage-source as an open decision (below).
STATUS.md rows: reference.md + capacity-scan.py → real/working (skeleton); /capacity-review → working, intent-only; live usage → designed, not built.
CLAUDE.md: a "Review capacity/hardware → /capacity-review" commands-table row + a "Further reading" pointer to ADR-012.

Data flow

reference.md ──┐
               ├─→ capacity-scan.py ──→ scan JSON ──┐
terraform ─────┤      (stdlib, JSON-via-subprocess)  ├─→ /capacity-review ─→ docs/hardware/reviews/
inventory ─────┘                                     │        (judgement)
reference.md (intent columns) ───────────────────────┘

Open decisions (deferred, tracked in TODO)

Usage-stats source (TODO 8.4): Proxmox RRD (built-in, no extra infra) vs the Prometheus/Loki/Grafana/Grafana-Alloy stack we will likely run anyway (richer, per-process, more to operate; see TODO 3.6). Decide before building any usage hook to avoid throwaway work.
Script dependency policy (TODO #14): whether stdlib-only remains the rule for utility scripts or libraries (e.g. PyYAML) are selectively allowed.
Scheduling (TODO 8.4): /capacity-review is on-demand now; cron later.

Deliverables & state at delivery

Piece	Path	State
Reference doc	`docs/hardware/reference.md`	Skeleton + real node data
Scan script	`scripts/capacity-scan.py`	Working (stdlib, usage hook stubbed)
Evaluator skill	`/capacity-review` → `docs/hardware/reviews/`	Working, intent-based
Decision record	`docs/decisions/012-hardware-capacity.md`	New ADR
Tracking	STATUS.md, CLAUDE.md, TODO #14 + 8.4	Updated

Out of scope / YAGNI

No usage-stats collection until the cluster exists and the source is decided.
No structured-data (YAML) source of truth — markdown is the single hand-edited source by choice; revisit only if parsing pain demands it.
No automated moves/remediation — the evaluator recommends; humans act.

7.7 KiB Raw Blame History