boma/.claude/commands/capacity-review.md
sjat 1060a9c08a Add /capacity-review skill
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:32:07 +02:00

3.2 KiB

Evaluate the homelab's hardware capacity and workload placement

Assess current allocation headroom, HA posture, and workload placement against declared intent, and write a tracked report to docs/hardware/reviews/. On-demand only; scheduled runs are deferred (see docs/TODO.md 8.4).

Reference material

  • docs/hardware/reference.md — physical node specs, workload allocations, and the free-text intent columns (criticality, ha_intent, profile, constraints, growth).
  • scripts/capacity-scan.py — deterministic scan; emits JSON with keys nodes, workloads, usage, warnings.

Process

Phase 0 — gather facts

Run python3 scripts/capacity-scan.py and parse its JSON output:

  • nodes — per-node physical totals, allocated totals, ram_headroom_pct, and the oversubscribed flag.
  • workloads — per-workload allocation rows from reference.md.
  • usage — live usage stats if available; check usage.available. If false, every recommendation in the report is intent-based, not usage-based — state this prominently in the report header.
  • warnings — drift findings the scan has already detected (reference vs Terraform/inventory).

Phase 1 — read intent

Read docs/hardware/reference.md for the free-text columns the scan does not parse: criticality, ha_intent, profile, constraints, and growth, plus the "Capacity notes" section at the bottom of the file.

Phase 2 — reason across five dimensions

Produce concrete, actionable recommendations. Tag every item with its type and the basis it rests on (intent-based vs usage-based):

  1. HA / redundancy — anti-affinity violations (e.g. an HA pair co-located on one node), single points of failure, HA posture that looks like overkill for the declared criticality, and high-criticality workloads with no redundancy.
  2. Right-sizing — over- or under-provisioned workloads compared to their profile. Today this is intent-based (declared allocation vs profile); flag explicitly that it becomes usage-based once the gather_usage() hook in the scan script is live.
  3. Placement / moves — oversubscribed nodes (oversubscribed: true or low ram_headroom_pct) and constraint-driven relocations indicated by constraints.
  4. Upgrade timing — cross-reference growth notes against current headroom to estimate a rough runway before a node upgrade is needed.
  5. Drift — surface every entry in the scan's warnings array verbatim.

Phase 3 — write the report

Save the report to docs/hardware/reviews/YYYY-MM-DD-capacity.md and overwrite docs/hardware/reviews/latest.md with the same content.

Report structure:

  • One-line summary — overall health signal (e.g. "All nodes within headroom; two HA violations detected").
  • Run metadata — date, reviewed commit SHA, usage.available status.
  • Section per dimension — each with concrete, actionable items; every item states its basis (intent-based or usage-based) and the evidence behind it.
  • Follow-up prompt — a generated, copy-pasteable prompt for the next review or for acting on the top finding.

Commit the report files per CLAUDE.md git conventions.