boma/.claude/commands/capacity-review.md

# Evaluate the homelab's hardware capacity and workload placement

Assess current allocation headroom, HA posture, and workload placement against declared
intent, and write a tracked report to `docs/hardware/reviews/`. On-demand only;
scheduled runs are deferred (see `docs/TODO.md` 8.4).

## Reference material

- `docs/hardware/reference.md` — physical node specs, workload allocations, and the
  free-text intent columns (`criticality`, `ha_intent`, `profile`, `constraints`, `growth`).
- `scripts/capacity-scan.py` — deterministic scan; emits JSON with keys `nodes`,
  `workloads`, `usage`, `warnings`.

## Process

### Phase 0 — gather facts

Run `python3 scripts/capacity-scan.py` and parse its JSON output:

- `nodes` — per-node physical totals, allocated totals, `ram_headroom_pct`, and the
  `oversubscribed` flag.
- `workloads` — per-workload allocation rows from `reference.md`.
- `usage` — live usage stats if available; check `usage.available`. If `false`, every
  recommendation in the report is **intent-based, not usage-based** — state this
  prominently in the report header.
- `warnings` — drift findings the scan has already detected (reference vs Terraform/inventory).

### Phase 1 — read intent

Read `docs/hardware/reference.md` for the free-text columns the scan does not parse:
`criticality`, `ha_intent`, `profile`, `constraints`, and `growth`, plus the
"Capacity notes" section at the bottom of the file.

### Phase 2 — reason across five dimensions

Produce concrete, actionable recommendations. Tag every item with its type and the
basis it rests on (**intent-based** vs **usage-based**):

1. **HA / redundancy** — anti-affinity violations (e.g. an HA pair co-located on one
   node), single points of failure, HA posture that looks like overkill for the
   declared `criticality`, and high-criticality workloads with no redundancy.
2. **Right-sizing** — over- or under-provisioned workloads compared to their `profile`.
   Today this is intent-based (declared allocation vs profile); flag explicitly that it
   becomes usage-based once the `gather_usage()` hook in the scan script is live.
3. **Placement / moves** — oversubscribed nodes (`oversubscribed: true` or low
   `ram_headroom_pct`) and constraint-driven relocations indicated by `constraints`.
4. **Upgrade timing** — cross-reference `growth` notes against current headroom to
   estimate a rough runway before a node upgrade is needed.
5. **Drift** — surface every entry in the scan's `warnings` array verbatim.

### Phase 3 — write the report

Save the report to `docs/hardware/reviews/YYYY-MM-DD-capacity.md` and overwrite
`docs/hardware/reviews/latest.md` with the same content.

Report structure:

- **One-line summary** — overall health signal (e.g. "All nodes within headroom;
  two HA violations detected").
- **Run metadata** — date, reviewed commit SHA, `usage.available` status.
- **Section per dimension** — each with concrete, actionable items; every item states
  its basis (intent-based or usage-based) and the evidence behind it.
- **Follow-up prompt** — a generated, copy-pasteable prompt for the next review or
  for acting on the top finding.

Commit the report files per CLAUDE.md git conventions.
Add /capacity-review skill Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-06-01 10:32:07 +02:00			`# Evaluate the homelab's hardware capacity and workload placement`

			`Assess current allocation headroom, HA posture, and workload placement against declared`
			intent, and write a tracked report to `docs/hardware/reviews/`. On-demand only;
			scheduled runs are deferred (see `docs/TODO.md` 8.4).

			`## Reference material`

			- `docs/hardware/reference.md` — physical node specs, workload allocations, and the
			free-text intent columns (`criticality`, `ha_intent`, `profile`, `constraints`, `growth`).
			- `scripts/capacity-scan.py` — deterministic scan; emits JSON with keys `nodes`,
			`workloads`, `usage`, `warnings`.

			`## Process`

			`### Phase 0 — gather facts`

			Run `python3 scripts/capacity-scan.py` and parse its JSON output:

			- `nodes` — per-node physical totals, allocated totals, `ram_headroom_pct`, and the
			`oversubscribed` flag.
			- `workloads` — per-workload allocation rows from `reference.md`.
			- `usage` — live usage stats if available; check `usage.available`. If `false`, every
			`recommendation in the report is intent-based, not usage-based — state this`
			`prominently in the report header.`
			- `warnings` — drift findings the scan has already detected (reference vs Terraform/inventory).

			`### Phase 1 — read intent`

			Read `docs/hardware/reference.md` for the free-text columns the scan does not parse:
			`criticality`, `ha_intent`, `profile`, `constraints`, and `growth`, plus the
			`"Capacity notes" section at the bottom of the file.`

			`### Phase 2 — reason across five dimensions`

			`Produce concrete, actionable recommendations. Tag every item with its type and the`
			`basis it rests on (intent-based vs usage-based):`

			`1. HA / redundancy — anti-affinity violations (e.g. an HA pair co-located on one`
			`node), single points of failure, HA posture that looks like overkill for the`
			declared `criticality`, and high-criticality workloads with no redundancy.
			2. Right-sizing — over- or under-provisioned workloads compared to their `profile`.
			`Today this is intent-based (declared allocation vs profile); flag explicitly that it`
			becomes usage-based once the `gather_usage()` hook in the scan script is live.
			3. Placement / moves — oversubscribed nodes (`oversubscribed: true` or low
			`ram_headroom_pct`) and constraint-driven relocations indicated by `constraints`.
			4. Upgrade timing — cross-reference `growth` notes against current headroom to
			`estimate a rough runway before a node upgrade is needed.`
			5. Drift — surface every entry in the scan's `warnings` array verbatim.

			`### Phase 3 — write the report`

			Save the report to `docs/hardware/reviews/YYYY-MM-DD-capacity.md` and overwrite
			`docs/hardware/reviews/latest.md` with the same content.

			`Report structure:`

			`- One-line summary — overall health signal (e.g. "All nodes within headroom;`
			`two HA violations detected").`
			- Run metadata — date, reviewed commit SHA, `usage.available` status.
			`- Section per dimension — each with concrete, actionable items; every item states`
			`its basis (intent-based or usage-based) and the evidence behind it.`
			`- Follow-up prompt — a generated, copy-pasteable prompt for the next review or`
			`for acting on the top finding.`

			`Commit the report files per CLAUDE.md git conventions.`