Compare commits

...

10 commits

Author SHA1 Message Date
e12326148c Note latest.md report mirror in ADR-012
Final-review minor: the /capacity-review skill overwrites a latest.md
pointer alongside the dated report; record that in the ADR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 10:40:16 +02:00
4c535c908e Record ADR-012 + STATUS/CLAUDE/scripts docs for capacity tooling
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:34:38 +02:00
1060a9c08a Add /capacity-review skill
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:32:07 +02:00
05694f6ea4 Complete capacity-scan.py: usage stub, subprocess glue, main()
Adds gather_usage() (stubbed, returns available:false), known_hostnames()
with graceful degradation when terraform/ansible-inventory are absent,
_run_json() helper, and main() that parses reference.md and emits JSON.
Three new TDD tests (12 total, all passing). Script exits 0 with valid
JSON even when no cluster is provisioned.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:30:45 +02:00
8ed00c9206 Add hostname parsers + find_drift() to capacity-scan.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:24:11 +02:00
b240fa8bfe Add compute_rollup() to capacity-scan.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:21:22 +02:00
07ecbb2789 Add capacity-scan.py with parse_table()
Implements the parse_table() function and pytest test harness for the
capacity-scan script. Tests cover header matching and graceful empty
return when the required header is absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:20:10 +02:00
3ea9109ba2 Add hardware reference doc skeleton + reviews dir
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:14:53 +02:00
6ff5d55810 Add implementation plan for hardware capacity tooling
Task-by-task TDD plan: reference.md skeleton, stdlib-only capacity-scan.py
(parse_table, compute_rollup, drift, usage stub, main), /capacity-review skill,
and ADR-012 + STATUS/CLAUDE/scripts-README updates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 10:04:59 +02:00
88210db09c Add hardware reference & capacity-evaluation design spec
Brainstormed design for docs/hardware/reference.md (physical compute +
network gear + workload placement intent), a stdlib-only capacity-scan.py,
and an on-demand /capacity-review skill that reports to docs/hardware/reviews/.
Mirrors the repo-scan -> /review-repo -> docs/reviews triad.

TODO additions: schedule /capacity-review later and decide its usage-stats
source (Proxmox RRD vs the Prometheus/Loki/Grafana/Alloy stack) before
building any hook (8.4); reevaluate the stdlib-only script policy (#14).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 09:59:16 +02:00
12 changed files with 1439 additions and 50 deletions

View file

@ -0,0 +1,66 @@
# Evaluate the homelab's hardware capacity and workload placement
Assess current allocation headroom, HA posture, and workload placement against declared
intent, and write a tracked report to `docs/hardware/reviews/`. On-demand only;
scheduled runs are deferred (see `docs/TODO.md` 8.4).
## Reference material
- `docs/hardware/reference.md` — physical node specs, workload allocations, and the
free-text intent columns (`criticality`, `ha_intent`, `profile`, `constraints`, `growth`).
- `scripts/capacity-scan.py` — deterministic scan; emits JSON with keys `nodes`,
`workloads`, `usage`, `warnings`.
## Process
### Phase 0 — gather facts
Run `python3 scripts/capacity-scan.py` and parse its JSON output:
- `nodes` — per-node physical totals, allocated totals, `ram_headroom_pct`, and the
`oversubscribed` flag.
- `workloads` — per-workload allocation rows from `reference.md`.
- `usage` — live usage stats if available; check `usage.available`. If `false`, every
recommendation in the report is **intent-based, not usage-based** — state this
prominently in the report header.
- `warnings` — drift findings the scan has already detected (reference vs Terraform/inventory).
### Phase 1 — read intent
Read `docs/hardware/reference.md` for the free-text columns the scan does not parse:
`criticality`, `ha_intent`, `profile`, `constraints`, and `growth`, plus the
"Capacity notes" section at the bottom of the file.
### Phase 2 — reason across five dimensions
Produce concrete, actionable recommendations. Tag every item with its type and the
basis it rests on (**intent-based** vs **usage-based**):
1. **HA / redundancy** — anti-affinity violations (e.g. an HA pair co-located on one
node), single points of failure, HA posture that looks like overkill for the
declared `criticality`, and high-criticality workloads with no redundancy.
2. **Right-sizing** — over- or under-provisioned workloads compared to their `profile`.
Today this is intent-based (declared allocation vs profile); flag explicitly that it
becomes usage-based once the `gather_usage()` hook in the scan script is live.
3. **Placement / moves** — oversubscribed nodes (`oversubscribed: true` or low
`ram_headroom_pct`) and constraint-driven relocations indicated by `constraints`.
4. **Upgrade timing** — cross-reference `growth` notes against current headroom to
estimate a rough runway before a node upgrade is needed.
5. **Drift** — surface every entry in the scan's `warnings` array verbatim.
### Phase 3 — write the report
Save the report to `docs/hardware/reviews/YYYY-MM-DD-capacity.md` and overwrite
`docs/hardware/reviews/latest.md` with the same content.
Report structure:
- **One-line summary** — overall health signal (e.g. "All nodes within headroom;
two HA violations detected").
- **Run metadata** — date, reviewed commit SHA, `usage.available` status.
- **Section per dimension** — each with concrete, actionable items; every item states
its basis (intent-based or usage-based) and the evidence behind it.
- **Follow-up prompt** — a generated, copy-pasteable prompt for the next review or
for acting on the top finding.
Commit the report files per CLAUDE.md git conventions.

View file

@ -31,6 +31,7 @@ Full design rationale: `docs/decisions/`
| Deploy a playbook | `make deploy PLAYBOOK=<name>` |
| Scaffold a new role | `make new-role NAME=<name>` |
| Review repo for drift/cruft | `/review-repo` (Claude command) |
| Review hardware capacity | `/capacity-review` (Claude command) |
| Encrypt a vault file | `make encrypt FILE=<path>` |
| Decrypt a vault file | `make decrypt FILE=<path>` |
| Install Python deps | `make setup` |
@ -170,6 +171,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
| Testing methodology | `docs/decisions/008-testing.md` |
| TF ↔ Ansible handoff | `docs/decisions/009-provisioning-handoff.md` |
| Forgejo & CI | `docs/decisions/010-forgejo-ci.md` |
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
| Adding a new role | `docs/runbooks/new-role.md` |
| Adding a new host | `docs/runbooks/new-host.md` |
| Rotating vault secrets | `docs/runbooks/rotate-secrets.md` |

View file

@ -21,6 +21,8 @@ _Last reviewed: 2026-05-30._
| Vault password client | `scripts/vault-pass-client.sh` fetches the master password from Vaultwarden via `rbw` (wired as `vault_password_file`). Requires `rbw` installed + `rbw unlock`. |
| `/review-repo` | Repo audit: `scripts/repo-scan.py` (Phase 0) + `.claude/commands/review-repo.md`, reports to `docs/reviews/`. On-demand only; cron + email deferred (`docs/TODO.md`). |
| Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below |
| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
## Scaffolded but empty — NOT implemented
@ -44,6 +46,7 @@ So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `bas
| Level 2 / 3 testing (staging, `askari` smoke) | ADR-008 | Depends on real VMs / `askari`, which don't exist yet |
| Per-service roles | ADR-004 | Model defined; no service roles built |
| Forgejo Actions CI | ADR-003 / ADR-008 | Remote is live (pushed); Actions/`act_runner` pipeline not yet built |
| Live usage stats for `/capacity-review` | ADR-012 / TODO 8.4 | `gather_usage()` stubbed; source undecided (Proxmox RRD vs PLG stack); needs the cluster |
## Keeping this honest

View file

@ -1,64 +1,90 @@
# ToDo
- [x] Main readme only says ansible, not terraform. Should properbly be included.
- [x] Main readme does not include a description of the name boma, nor the scope (i.e. infrastructure - not laptops)
1. **Forgejo CI** — what CI work remains after ADR-010 (which workflows, runner
setup, etc. still need to be built)?
- [x] Method to review repo to ensure
- We dont carry around code, comments, notes, etc. that is no longer needed but was perhaps added to fix an issue that has been resolved.
- That all code, structure, comments, notes etc. follow our design decisions.
- That clear intent is documented throughout - and that there are not any overlaps, contradictions etc.
2. **Testing**
1. Choose and configure code-testing tooling (Molecule, etc.).
2. Decide how the AI interprets Molecule output and performs live testing:
API calls, curl pulls of web products, log reviews, and headless browsing.
3. Define a standard for generating test users and for instructing the user to
perform relevant manual tests.
- [ ] Forgejo CI
3. **Building services**
1. Decide how to manage logs.
2. Decide how to manage APIs / API access.
3. Decide how to import or integrate from baobabAnsibleV4.
4. Decide what each node runs — base packages plus which apps/services.
5. Decide the firewall strategy (which firewall, ruleset, per-host vs central).
6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime
Kuma alerts on askari.
7. Define a tagging standard that lets us target runs without over-tagging.
8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
9. Decide: a central database server, or individual database services per app?
10. Should we continue to use the base-container method, or maybe something in the improvements of the methods in boma moods the point?
- [ ] Testing
- Code testing tools (molecule etc.)
- AI interpretation of molecule etc, but also actual testing via API-calls, CURL pulls of web products, log reviews and perhaps even headless browsing
4. **Split-horizon FQDN** — adopt split-horizon FQDN with or without nyumbani?
- [ ] Building stuff
- How to manage logs
- How to manage APIs
- How to import/integrate from baobabAnsibleV4?
- What to install on nodes?
- firewalls?
- apps?
- wirering up loki, prometheous, grafana dashboards, grafana alerts, uptimekuma alerts on askari
- tagging strategy - we need a specific standard so that we can target runs, but dont over-tag.
5. **Control node**
1. Set up and test the control node while waiting for hardware.
2. Define control-node bootstrapping — a dedicated recipe and playbook?
3. Decide the role of mamba — access/availability vs compute power and ease?
4. Set up rbw on the control node.
- [ ] Split horizon FQDN - with or without nyumbani
6. **Updating**
1. Decide pinning vs latest for versions.
2. Decide the update strategy across services & containers vs packages &
builds / GitHub pulls / Flatpaks.
3. Define scheduling of updates and reboots, including post-update testing.
- [ ] Control node
- Setup and testing while waiting for hardware?
- Bootstrapping - perhaps dedicated recipe and playbook?
- Role of mamba? - Access/availability vs compute power and ease?
- rbw on control node
7. **Shell setup**
1. Decide what shell setup matters for the AI's work on the control node.
2. Decide what to set up on the hosts, given that direct access will be rare.
- [ ] Updating
- Pinning vs latest.
- services and containers vs packages and builds/github pulls/flatpacks
- scheduling of updates and reboots - incl. testing afterwards.
8. **Scheduled work**
1. Run `/review-repo` as `claude -p` via cron every two weeks?
2. Build sanity checks (e.g. does PhotoPrism have its pictures? are email
services receiving and sending?).
3. Design a declarative `scheduled_jobs` role so the repo owns which cronjobs
run on a host, enforced by Ansible. Sketch (deferred until we have hosts):
reads a `scheduled_jobs__jobs` list from group_vars/host_vars, rendered via
a managed `/etc/cron.d` file. Open questions:
1. General role vs control-node-only?
2. Prune undeclared jobs (repo authoritative) vs additive?
3. Validate headless email and that cron's env has the `claude` CLI.
4. (The fortnightly `/review-repo` job is the first entry.)
4. Schedule `/capacity-review` to run periodically (on-demand only for now).
Revisit once the physical cluster + a live usage-stats hook exist, so it
reasons on real usage rather than declared intent alone. **Decide the usage
source first:** Proxmox RRD (built-in, no extra infra) vs the
Prometheus/Loki/Grafana/Grafana-Alloy stack we will likely set up anyway
(richer, per-process, but more to run) — see TODO 3.6. Don't build the
Proxmox-RRD hook before settling this, to avoid throwaway work.
9. Should we make a basic function so that tools (and AI) can send messages to the user - email, matrix or ntfy?
- [ ] shell setup
- What does it matter in relations to the AIs work on the control node?
- What should we set up on the hosts, if i'll rarely go there?
10. **Claude setup** — DECIDED: brainstorm for intent, capture as ADRs (skip plan
files); hooks + slash commands + `/review-repo` for enforcement at scale. Any
remaining setup to carry out from this decision?
1. Policy for how we collaborate with references to baobabAnsibleV4 without misusing it.
2. Policy for how we write key documents like ADRs.
3. Further development on how we we collaborate on designing the foundation for the project - seperate from how we implement new containers etc.
- [ ] Scheduled work
- /review-repo maybe as claude -p via cron every two weeks?
- Sanity checks: does a photoprism have its pictures? are email services recieving and sending?
- Cron "section": a declarative way for the repo to own which cronjobs are active on a
host, enforced by Ansible. Sketch (deferred until we have hosts): a `scheduled_jobs`
role reading a `scheduled_jobs__jobs` list from group_vars/host_vars, rendered via a
managed /etc/cron.d file. Open Qs: general role vs control-node-only; prune
undeclared jobs (repo authoritative) vs additive; validate headless email + that
cron's env has the `claude` CLI. The /review-repo fortnightly job is the first entry.
11. **Kaizen loop** — set up ~2026-06-06 (one week from now).
1. Build `/retro`: reads `docs/FRICTION.md` + recurring `/review-repo`
findings + a tooling-usage inventory; proposes add / change / **remove**
(biased to remove); records decisions as ADRs; evaluates itself.
Recurrence-triggered plus a light periodic sweep.
2. Keep appending raw signals to `docs/FRICTION.md` (live now) until the
retro consumes them.
- [ ] Claude setup
- superpowers or other methodologies? → decided: brainstorm for intent, capture as
ADRs (skip plan files); hooks + slash commands + /review-repo for enforcement at scale.
12. **Spin-up order** — what is the right order of operations when spinning up
from scratch (OS, DNS, Authentik, Traefik, …)?
- [ ] Kaizen loop — set up ~2026-06-06 (one week from now)
- Build `/retro`: reads `docs/FRICTION.md` + `/review-repo` recurring findings + a
tooling-usage inventory; proposes add / change / **remove** (biased to remove);
records decisions as ADRs; evaluates itself. Recurrence-triggered + light periodic sweep.
- `docs/FRICTION.md` is live now — keep appending raw signals until the retro consumes them.
13. **Intentions** - Is the current setup clearly identifying intentions throughout? We have the readme files but is that enough? Also, how do we rechallange desisions and how they interact over time. I.e. We have these two services running, but extending one a little bit could make the other redundant so we could remove it. Or an alternative to this services has emerged, and it is actually better.
- [ ] What is the right order of operation when spinning up from scratch? (OS, DNS, authentik, traefik...?)
14. **Script dependencies policy** — utility scripts (`tf_to_inventory.py`,
`repo-scan.py`, `capacity-scan.py`) are stdlib-only by convention, for
run-anywhere portability (control node, CI, bare clone, no venv). Reevaluate
whether selectively allowing libraries (e.g. PyYAML — already present via
Ansible) is a better fit in general: weigh the parsing-correctness win
against losing zero-setup portability. Decide a clear rule and record it.

View file

@ -0,0 +1,38 @@
# ADR-012 — Hardware reference & capacity evaluation
## Context
The repo modelled the logical/network layer (Terraform VM specs, ADR-007
topology) but not the physical layer — node CPU/RAM/disk capacity, network gear,
or which workloads are designed to run where with what headroom. There was also
no way to ask "is this well-proportioned?" — e.g. HA that isn't needed, a
workload that should move, or a node due an upgrade.
## Decision
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
physical compute + network gear and workload placement intent. Two
machine-readable tables (node capacity, workload placement) carry the numbers.
- `scripts/capacity-scan.py` (stdlib-only, like `repo-scan.py` / `tf_to_inventory.py`)
parses those tables, computes per-node allocated-vs-physical rollups, and
cross-checks workload hostnames against `terraform output -json` /
`ansible-inventory --list` to surface drift.
- `/capacity-review` reads the scan + intent columns and writes a dated report to
`docs/hardware/reviews/YYYY-MM-DD-capacity.md`, also overwriting
`docs/hardware/reviews/latest.md`, mirroring `/review-repo``docs/reviews/`.
- Numeric allocations live in `reference.md`, not Terraform: the current
`terraform output` exposes only `{ip, group}`. Terraform/inventory are used
only for hostname-drift cross-checks.
- **Live usage stats are a future hook.** The cluster is not stood up;
`gather_usage()` returns `available: false` and the evaluator reasons on
declared intent. The usage source (Proxmox RRD vs Prometheus/Loki/Grafana/
Alloy) is undecided — see docs/TODO.md 8.4, to be settled before any hook is
built.
## Consequences
- Right-sizing advice is intent-based until usage data exists; reports say so.
- `reference.md` table headers are a parser contract — changing them needs a
matching `capacity-scan.py` change.
See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).

View file

@ -0,0 +1,52 @@
# Hardware reference — boma
> Hand-maintained source of truth for **physical** compute + network gear and
> **workload placement intent**. The two machine-readable tables (Node capacity,
> Workload placement) are parsed by `scripts/capacity-scan.py` — keep their
> headers intact. Evaluated by `/capacity-review`. See ADR-012.
>
> _Status: skeleton. Replace example rows with real hardware once the cluster is
> stood up (STATUS.md tracks real-vs-planned)._
## 1. Physical compute
### pve0
- **Model / form factor:** _TBD (e.g. Minisforum MS-01, mini-PC)_
- **CPU:** _TBD (e.g. i9-13900H, 14C/20T)_
- **RAM:** _TBD total; max _; free DIMM slots _
- **Storage:** _TBD (disks → pools, e.g. 2× 2 TB NVMe → `local-zfs`)_
- **NICs:** _eno1 trunk (vmbr0), eno2 corosync (vmbr1)_
- **Notes:** _warranty, quirks_
_(repeat for pve1, pve2, askari)_
## 2. Network gear
| device | model | ports | poe | throughput | uplinks | notes |
|----------|-------|-------|-----|------------|---------|-------|
| opnsense | _TBD_ | _TBD_ | n/a | _TBD_ | WAN+LAN | dedicated hardware |
| switch | _TBD_ | _TBD_ | _TBD_ | _TBD_ | trunk | managed, 802.1q |
| ap1 | _TBD_ | _TBD_ | _TBD_ | _TBD_ | trunk | multi-SSID per VLAN |
## 3. Workload placement & intent
The numeric columns (`cores, ram_mb, disk_gb`) feed `capacity-scan.py`; the
free-text columns feed `/capacity-review`'s judgement.
| workload | node | cores | ram_mb | disk_gb | criticality | ha_intent | profile | constraints | growth |
|----------|------|-------|--------|---------|-------------|-----------|---------|-------------|--------|
| dns1 | pve0 | 1 | 512 | 10 | high | pair/dns2 | tiny/steady | anti-affinity: dns2 on a different node | flat |
| dns2 | pve1 | 1 | 512 | 10 | high | pair/dns1 | tiny/steady | anti-affinity: dns1 on a different node | flat |
## 4. Node capacity (machine-readable)
Physical totals per node. Integers; `ram_gb` and `disk_gb` may be decimals.
| node | cores | ram_gb | disk_gb |
|------|-------|--------|---------|
| pve0 | 20 | 64 | 4000 |
| pve1 | 20 | 64 | 4000 |
## 5. Capacity notes
Free-text running notes for the evaluator (trends, planned moves, upgrade ideas).

View file

View file

@ -0,0 +1,753 @@
# Hardware Reference & Capacity Evaluation Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add a hand-maintained hardware reference doc, a stdlib-only `capacity-scan.py` that emits deterministic capacity facts, and an on-demand `/capacity-review` skill that reasons about HA / right-sizing / placement / upgrade timing.
**Architecture:** `docs/hardware/reference.md` is the single machine-readable source of truth (physical node capacities + workload allocations + placement intent). `scripts/capacity-scan.py` parses its tables, computes per-node allocated-vs-physical rollups, and cross-checks workload hostnames against `terraform output -json` / `ansible-inventory --list` to surface drift — degrading gracefully when nothing is provisioned. `/capacity-review` runs the scan, reads the intent columns, and writes a dated report to `docs/hardware/reviews/`. Live usage stats are a stubbed future hook. Mirrors the existing `repo-scan.py``/review-repo``docs/reviews/` triad.
**Tech Stack:** Python 3 standard library only (no third-party imports in the script); `pytest` for unit tests (already in `requirements.txt`); markdown docs; a `.claude` slash-command skill.
---
## Spec
Design spec: `docs/superpowers/specs/2026-06-01-hardware-capacity-design.md`.
**Refinement vs spec:** the spec said allocations come from Terraform. The current
`terraform output "vms"` only exposes `{ip, group}`, not cores/RAM/disk, so numeric
allocations are read from `reference.md` instead; Terraform/inventory are used only
for hostname-drift cross-checks. This better honors the "self-contained markdown
source of truth" decision and needs no Terraform module changes.
## File Structure
- `docs/hardware/reference.md`**create.** Source of truth. Human sections
(physical compute, network gear) + two machine-readable tables (node capacity,
workload placement) the script parses.
- `scripts/capacity-scan.py`**create.** Stdlib-only. Pure parse/math functions
+ thin subprocess glue + `main()` emitting JSON to stdout.
- `tests/test_capacity_scan.py`**create.** Pytest unit tests for the pure
functions + a smoke test against the real `reference.md`.
- `.claude/commands/capacity-review.md`**create.** The `/capacity-review` skill.
- `docs/hardware/reviews/.gitkeep`**create.** Report output dir.
- `docs/decisions/012-hardware-capacity.md`**create.** ADR recording the decision.
- `STATUS.md`**modify.** Add real-vs-planned rows.
- `CLAUDE.md`**modify.** Commands-table row + Further-reading pointer.
- `scripts/README.md`**modify.** Document `capacity-scan.py`.
### Machine-readable table contract (used by Task 1 and the parser)
`reference.md` must contain these two tables verbatim in header shape. The parser
keys on header names, so column order is flexible and extra free-text columns are
ignored.
**Node capacity** — header contains `node, cores, ram_gb, disk_gb` (integers/floats):
```
| node | cores | ram_gb | disk_gb |
|------|-------|--------|---------|
| pve0 | 20 | 64 | 4000 |
```
**Workload placement** — header contains the numeric columns `workload, node,
cores, ram_mb, disk_gb` plus any free-text intent columns:
```
| workload | node | cores | ram_mb | disk_gb | criticality | ha_intent | profile | constraints | growth |
|----------|------|-------|--------|---------|-------------|-----------|---------|-------------|--------|
| dns1 | pve0 | 1 | 512 | 10 | high | pair/dns2 | tiny | anti-affinity: dns2 elsewhere | flat |
```
---
## Task 1: Reference doc skeleton
**Files:**
- Create: `docs/hardware/reference.md`
- Create: `docs/hardware/reviews/.gitkeep`
- [ ] **Step 1: Write `docs/hardware/reference.md`**
```markdown
# Hardware reference — boma
> Hand-maintained source of truth for **physical** compute + network gear and
> **workload placement intent**. The two machine-readable tables (Node capacity,
> Workload placement) are parsed by `scripts/capacity-scan.py` — keep their
> headers intact. Evaluated by `/capacity-review`. See ADR-012.
>
> _Status: skeleton. Replace example rows with real hardware once the cluster is
> stood up (STATUS.md tracks real-vs-planned)._
## 1. Physical compute
### pve0
- **Model / form factor:** _TBD (e.g. Minisforum MS-01, mini-PC)_
- **CPU:** _TBD (e.g. i9-13900H, 14C/20T)_
- **RAM:** _TBD total; max _; free DIMM slots _
- **Storage:** _TBD (disks → pools, e.g. 2× 2 TB NVMe → `local-zfs`)_
- **NICs:** _eno1 trunk (vmbr0), eno2 corosync (vmbr1)_
- **Notes:** _warranty, quirks_
_(repeat for pve1, pve2, askari)_
## 2. Network gear
| device | model | ports | poe | throughput | uplinks | notes |
|----------|-------|-------|-----|------------|---------|-------|
| opnsense | _TBD_ | _TBD_ | n/a | _TBD_ | WAN+LAN | dedicated hardware |
| switch | _TBD_ | _TBD_ | _TBD_ | _TBD_ | trunk | managed, 802.1q |
| ap1 | _TBD_ | _TBD_ | _TBD_ | _TBD_ | trunk | multi-SSID per VLAN |
## 3. Workload placement & intent
The numeric columns (`cores, ram_mb, disk_gb`) feed `capacity-scan.py`; the
free-text columns feed `/capacity-review`'s judgement.
| workload | node | cores | ram_mb | disk_gb | criticality | ha_intent | profile | constraints | growth |
|----------|------|-------|--------|---------|-------------|-----------|---------|-------------|--------|
| dns1 | pve0 | 1 | 512 | 10 | high | pair/dns2 | tiny/steady | anti-affinity: dns2 on a different node | flat |
| dns2 | pve1 | 1 | 512 | 10 | high | pair/dns1 | tiny/steady | anti-affinity: dns1 on a different node | flat |
## 4. Node capacity (machine-readable)
Physical totals per node. Integers; `ram_gb` and `disk_gb` may be decimals.
| node | cores | ram_gb | disk_gb |
|------|-------|--------|---------|
| pve0 | 20 | 64 | 4000 |
| pve1 | 20 | 64 | 4000 |
## 5. Capacity notes
Free-text running notes for the evaluator (trends, planned moves, upgrade ideas).
```
- [ ] **Step 2: Create the reports directory**
Run: `mkdir -p docs/hardware/reviews && touch docs/hardware/reviews/.gitkeep`
Expected: both paths exist.
- [ ] **Step 3: Verify the machine-readable headers match the contract**
Run: `grep -n '| node | cores | ram_gb | disk_gb |' docs/hardware/reference.md && grep -n '| workload | node | cores | ram_mb | disk_gb |' docs/hardware/reference.md`
Expected: each grep prints one matching line (the table headers the parser keys on).
- [ ] **Step 4: Commit**
```bash
git add docs/hardware/reference.md docs/hardware/reviews/.gitkeep
git commit -m "Add hardware reference doc skeleton + reviews dir"
```
---
## Task 2: Scan script — `parse_table()`
**Files:**
- Create: `scripts/capacity-scan.py`
- Create: `tests/test_capacity_scan.py`
- [ ] **Step 1: Write the failing test**
Create `tests/test_capacity_scan.py`:
```python
import importlib.util
import pathlib
_PATH = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "capacity-scan.py"
_spec = importlib.util.spec_from_file_location("capacity_scan", _PATH)
cs = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(cs)
def test_parse_table_keys_on_header_and_ignores_extra_cols():
md = """
intro text
| node | cores | ram_gb | disk_gb |
|------|-------|--------|---------|
| pve0 | 20 | 64 | 4000 |
| pve1 | 20 | 64 | 4000 |
trailing text
"""
rows = cs.parse_table(md, ["node", "cores", "ram_gb", "disk_gb"])
assert rows == [
{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"},
{"node": "pve1", "cores": "20", "ram_gb": "64", "disk_gb": "4000"},
]
def test_parse_table_returns_empty_when_header_absent():
assert cs.parse_table("no tables here", ["node", "cores"]) == []
```
- [ ] **Step 2: Run test to verify it fails**
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
Expected: FAIL — `ModuleNotFoundError`/`AttributeError` (script or `parse_table` not defined yet).
- [ ] **Step 3: Write minimal implementation**
Create `scripts/capacity-scan.py`:
```python
#!/usr/bin/env python3
"""capacity-scan.py — deterministic capacity facts for /capacity-review.
Python standard library only. Emits a JSON object to stdout.
Reads physical capacities and workload allocations from the machine-readable
tables in docs/hardware/reference.md, computes per-node allocated-vs-physical
rollups, and cross-checks workload hostnames against `terraform output -json`
and `ansible-inventory --list` to surface drift. Degrades gracefully when
nothing is provisioned. Live usage stats are a documented future hook.
Usage: python3 scripts/capacity-scan.py [--env staging] [--reference PATH]
"""
import argparse
import json
import os
import subprocess
import sys
REPO_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
def parse_table(markdown, required_cols):
"""Return row dicts for the first markdown table whose header contains all
required_cols. Keys are header names; values are raw cell strings."""
lines = markdown.splitlines()
required = set(required_cols)
for i, raw in enumerate(lines):
line = raw.strip()
if not line.startswith("|"):
continue
headers = [c.strip() for c in line.strip("|").split("|")]
if not required.issubset(set(headers)):
continue
rows = []
for body in lines[i + 2:]:
if not body.strip().startswith("|"):
break
cells = [c.strip() for c in body.strip().strip("|").split("|")]
if len(cells) == len(headers):
rows.append(dict(zip(headers, cells)))
return rows
return []
```
- [ ] **Step 4: Run test to verify it passes**
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
Expected: PASS (2 passed).
- [ ] **Step 5: Commit**
```bash
git add scripts/capacity-scan.py tests/test_capacity_scan.py
git commit -m "Add capacity-scan.py with parse_table()"
```
---
## Task 3: Rollup math — `compute_rollup()`
**Files:**
- Modify: `scripts/capacity-scan.py`
- Modify: `tests/test_capacity_scan.py`
- [ ] **Step 1: Write the failing test (append to `tests/test_capacity_scan.py`)**
```python
def test_compute_rollup_sums_allocations_and_flags_headroom():
node_rows = [{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"}]
workload_rows = [
{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"},
{"workload": "forgejo", "node": "pve0", "cores": "4", "ram_mb": "8192", "disk_gb": "100"},
]
nodes = cs.compute_rollup(node_rows, workload_rows)
pve0 = nodes["pve0"]
assert pve0["alloc_cores"] == 5
assert pve0["alloc_ram_gb"] == 8.5 # (512 + 8192) / 1024
assert pve0["alloc_disk_gb"] == 110.0
assert pve0["ram_headroom_pct"] == 87 # round(100 * (64 - 8.5) / 64)
assert pve0["oversubscribed"] is False
def test_compute_rollup_flags_oversubscription():
node_rows = [{"node": "tiny", "cores": "2", "ram_gb": "4", "disk_gb": "50"}]
workload_rows = [
{"workload": "hog", "node": "tiny", "cores": "4", "ram_mb": "1024", "disk_gb": "10"},
]
nodes = cs.compute_rollup(node_rows, workload_rows)
assert nodes["tiny"]["oversubscribed"] is True # 4 cores > 2
def test_compute_rollup_ignores_workloads_on_unknown_nodes():
nodes = cs.compute_rollup(
[{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"}],
[{"workload": "ghost", "node": "nope", "cores": "1", "ram_mb": "512", "disk_gb": "10"}],
)
assert nodes["pve0"]["alloc_cores"] == 0
```
- [ ] **Step 2: Run test to verify it fails**
Run: `python3 -m pytest tests/test_capacity_scan.py -k compute_rollup -v`
Expected: FAIL — `AttributeError: module 'capacity_scan' has no attribute 'compute_rollup'`.
- [ ] **Step 3: Write minimal implementation (append to `scripts/capacity-scan.py`, before any `main`)**
```python
def compute_rollup(node_rows, workload_rows):
"""Per node: physical totals, summed allocations, RAM headroom %, and an
oversubscribed flag. Workloads on unknown nodes are ignored."""
nodes = {}
for r in node_rows:
nodes[r["node"]] = {
"cores": int(r["cores"]),
"ram_gb": float(r["ram_gb"]),
"disk_gb": float(r["disk_gb"]),
"alloc_cores": 0,
"alloc_ram_mb": 0,
"alloc_disk_gb": 0.0,
}
for w in workload_rows:
node = nodes.get(w["node"])
if node is None:
continue
node["alloc_cores"] += int(w["cores"])
node["alloc_ram_mb"] += int(w["ram_mb"])
node["alloc_disk_gb"] += float(w["disk_gb"])
for node in nodes.values():
node["alloc_ram_gb"] = round(node.pop("alloc_ram_mb") / 1024, 1)
node["ram_headroom_pct"] = (
round(100 * (node["ram_gb"] - node["alloc_ram_gb"]) / node["ram_gb"])
if node["ram_gb"]
else 0
)
node["oversubscribed"] = (
node["alloc_cores"] > node["cores"]
or node["alloc_ram_gb"] > node["ram_gb"]
or node["alloc_disk_gb"] > node["disk_gb"]
)
return nodes
```
- [ ] **Step 4: Run test to verify it passes**
Run: `python3 -m pytest tests/test_capacity_scan.py -k compute_rollup -v`
Expected: PASS (3 passed).
- [ ] **Step 5: Commit**
```bash
git add scripts/capacity-scan.py tests/test_capacity_scan.py
git commit -m "Add compute_rollup() to capacity-scan.py"
```
---
## Task 4: Drift detection — `find_drift()` + hostname parsers
**Files:**
- Modify: `scripts/capacity-scan.py`
- Modify: `tests/test_capacity_scan.py`
- [ ] **Step 1: Write the failing test (append)**
```python
def test_parse_tf_hostnames_reads_vms_value_keys():
tf_json = '{"vms": {"value": {"dns1": {"ip": "10.20.0.10", "group": "docker_hosts"}}}}'
assert cs.parse_tf_hostnames(tf_json) == {"dns1"}
def test_parse_inventory_hostnames_reads_meta_hostvars():
inv_json = '{"_meta": {"hostvars": {"dns1": {}, "proxy": {}}}}'
assert cs.parse_inventory_hostnames(inv_json) == {"dns1", "proxy"}
def test_find_drift_reports_both_directions():
workload_rows = [{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"}]
warnings = cs.find_drift(workload_rows, {"proxy"})
assert any("dns1" in w and "no Terraform" in w for w in warnings)
assert any("proxy" in w and "absent from reference.md" in w for w in warnings)
def test_find_drift_silent_when_no_hostnames_known():
workload_rows = [{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"}]
assert cs.find_drift(workload_rows, set()) == []
```
- [ ] **Step 2: Run test to verify it fails**
Run: `python3 -m pytest tests/test_capacity_scan.py -k "drift or hostnames" -v`
Expected: FAIL — attributes `parse_tf_hostnames` / `parse_inventory_hostnames` / `find_drift` not defined.
- [ ] **Step 3: Write minimal implementation (append)**
```python
def parse_tf_hostnames(tf_json):
"""Hostnames from `terraform output -json` (the `vms` map keys)."""
data = json.loads(tf_json)
return set(data.get("vms", {}).get("value", {}).keys())
def parse_inventory_hostnames(inv_json):
"""Hostnames from `ansible-inventory --list` (_meta.hostvars keys)."""
data = json.loads(inv_json)
return set(data.get("_meta", {}).get("hostvars", {}).keys())
def find_drift(workload_rows, known_hostnames):
"""Warn when reference.md workloads and live hostnames disagree. Silent when
no hostnames are known (pre-provisioning) — nothing to compare against."""
warnings = []
declared = {w["workload"] for w in workload_rows}
if not known_hostnames:
return warnings
for name in sorted(declared - known_hostnames):
warnings.append(
f"reference.md lists '{name}' but no Terraform/inventory host declares it"
)
for name in sorted(known_hostnames - declared):
warnings.append(
f"host '{name}' exists in Terraform/inventory but is absent from reference.md"
)
return warnings
```
- [ ] **Step 4: Run test to verify it passes**
Run: `python3 -m pytest tests/test_capacity_scan.py -k "drift or hostnames" -v`
Expected: PASS (4 passed).
- [ ] **Step 5: Commit**
```bash
git add scripts/capacity-scan.py tests/test_capacity_scan.py
git commit -m "Add hostname parsers + find_drift() to capacity-scan.py"
```
---
## Task 5: Subprocess glue + usage stub + `main()`
**Files:**
- Modify: `scripts/capacity-scan.py`
- Modify: `tests/test_capacity_scan.py`
- [ ] **Step 1: Write the failing test (append)**
```python
import json as _json
def test_gather_usage_is_stubbed_unavailable():
usage = cs.gather_usage()
assert usage["available"] is False
assert "reason" in usage
def test_known_hostnames_degrades_to_empty(monkeypatch):
# Simulate terraform/ansible-inventory being absent or failing.
def boom(*a, **k):
raise FileNotFoundError("no such tool")
monkeypatch.setattr(cs.subprocess, "run", boom)
assert cs.known_hostnames("staging") == set()
def test_main_emits_valid_json_against_real_reference(monkeypatch, capsys):
# Isolate from the host: no real terraform/ansible needed.
monkeypatch.setattr(cs, "known_hostnames", lambda env: set())
monkeypatch.setattr("sys.argv", ["capacity-scan.py"])
cs.main()
out = _json.loads(capsys.readouterr().out)
assert set(out) == {"nodes", "workloads", "usage", "warnings"}
assert out["usage"]["available"] is False
assert "pve0" in out["nodes"] # from the skeleton reference.md (Task 1)
```
- [ ] **Step 2: Run test to verify it fails**
Run: `python3 -m pytest tests/test_capacity_scan.py -k "usage or known_hostnames or main" -v`
Expected: FAIL — `gather_usage` / `known_hostnames` / `main` not defined.
- [ ] **Step 3: Write minimal implementation (append)**
```python
def gather_usage():
"""FUTURE: live per-VM CPU/RAM/disk usage history. Requires the physical
cluster online; source UNDECIDED (Proxmox RRD vs Prometheus/Loki/Grafana —
see docs/TODO.md 8.4). Until then the evaluator reasons on declared intent."""
return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}
def _run_json(cmd):
return subprocess.run(cmd, capture_output=True, text=True, check=True).stdout
def known_hostnames(env):
"""Union of hostnames from Terraform output and Ansible inventory. Each
source is best-effort: missing tool / no state / bad JSON yields nothing."""
hosts = set()
tf_dir = os.path.join(REPO_ROOT, "terraform", "environments", env)
try:
hosts |= parse_tf_hostnames(_run_json(["terraform", f"-chdir={tf_dir}", "output", "-json"]))
except Exception:
pass
inv = os.path.join(REPO_ROOT, "inventories", env, "hosts.yml")
try:
hosts |= parse_inventory_hostnames(_run_json(["ansible-inventory", "-i", inv, "--list"]))
except Exception:
pass
return hosts
def main():
parser = argparse.ArgumentParser(description="Deterministic capacity facts for /capacity-review.")
parser.add_argument("--env", default="staging")
parser.add_argument(
"--reference",
default=os.path.join(REPO_ROOT, "docs", "hardware", "reference.md"),
)
args = parser.parse_args()
with open(args.reference, encoding="utf-8") as fh:
markdown = fh.read()
node_rows = parse_table(markdown, ["node", "cores", "ram_gb", "disk_gb"])
workload_rows = parse_table(markdown, ["workload", "node", "cores", "ram_mb", "disk_gb"])
nodes = compute_rollup(node_rows, workload_rows)
warnings = find_drift(workload_rows, known_hostnames(args.env))
json.dump(
{"nodes": nodes, "workloads": workload_rows, "usage": gather_usage(), "warnings": warnings},
sys.stdout,
indent=2,
sort_keys=True,
)
sys.stdout.write("\n")
if __name__ == "__main__":
main()
```
- [ ] **Step 4: Run the full test file**
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
Expected: PASS (all tests).
- [ ] **Step 5: Smoke-run the script end to end**
Run: `python3 scripts/capacity-scan.py | python3 -m json.tool`
Expected: valid JSON with `nodes.pve0`, a `workloads` list, `usage.available: false`, and a `warnings` array (likely empty with no Terraform state).
- [ ] **Step 6: Commit**
```bash
git add scripts/capacity-scan.py tests/test_capacity_scan.py
git commit -m "Complete capacity-scan.py: usage stub, subprocess glue, main()"
```
---
## Task 6: The `/capacity-review` skill
**Files:**
- Create: `.claude/commands/capacity-review.md`
- [ ] **Step 1: Confirm the existing command pattern**
Run: `ls .claude/commands/ && sed -n '1,20p' .claude/commands/review-repo.md`
Expected: lists existing commands; shows the frontmatter/structure to mirror.
- [ ] **Step 2: Write `.claude/commands/capacity-review.md`**
Mirror the frontmatter style of `review-repo.md` (adjust `description`/`allowed-tools` to match that file's actual keys). Body:
```markdown
---
description: Evaluate hardware capacity and placement; recommend optimizations
---
# /capacity-review
Evaluate the homelab's hardware capacity and workload placement, and recommend
optimizations. On-demand only (scheduling is deferred — see docs/TODO.md 8.4).
## Steps
1. **Gather facts.** Run `python3 scripts/capacity-scan.py` and parse its JSON
(`nodes`, `workloads`, `usage`, `warnings`). If `usage.available` is false,
note in the report that recommendations are **intent-based, not usage-based**.
2. **Read intent.** Read `docs/hardware/reference.md` for the free-text columns
the scan does not parse: `criticality`, `ha_intent`, `profile`, `constraints`,
`growth`, plus the "Capacity notes" section.
3. **Reason across dimensions.** Produce recommendations, each tagged with its
type and the basis it rests on (declared intent vs measured usage):
- **HA / redundancy** — anti-affinity violations (e.g. an HA pair sharing one
node), single points of failure, HA that looks like overkill, or a
high-criticality workload with no redundancy.
- **Right-sizing** — over/under-provisioned workloads. Today this is
intent-based (allocation vs `profile`); flag that it becomes usage-based
once the `gather_usage()` hook is live.
- **Placement / moves** — oversubscribed nodes (`oversubscribed: true`, low
`ram_headroom_pct`) or constraint-driven relocations.
- **Upgrade timing**`growth` notes vs headroom → rough runway.
- **Drift** — surface every entry in the scan's `warnings` array.
4. **Write the report.** Save to `docs/hardware/reviews/YYYY-MM-DD-capacity.md`
and copy it to `docs/hardware/reviews/latest.md`. Structure: a one-line
summary, then a section per dimension with concrete, actionable items. State
the basis (intent vs usage) on every recommendation.
```
- [ ] **Step 3: Verify the file is well-formed**
Run: `head -5 .claude/commands/capacity-review.md`
Expected: frontmatter block present and consistent with `review-repo.md`'s keys.
- [ ] **Step 4: Commit**
```bash
git add .claude/commands/capacity-review.md
git commit -m "Add /capacity-review skill"
```
---
## Task 7: ADR-012, STATUS, CLAUDE.md, scripts/README
**Files:**
- Create: `docs/decisions/012-hardware-capacity.md`
- Modify: `STATUS.md`
- Modify: `CLAUDE.md`
- Modify: `scripts/README.md`
- [ ] **Step 1: Write `docs/decisions/012-hardware-capacity.md`**
Match the heading style of an existing ADR (`sed -n '1,15p' docs/decisions/010-forgejo-ci.md` first). Content:
```markdown
# ADR-012 — Hardware reference & capacity evaluation
## Context
The repo modelled the logical/network layer (Terraform VM specs, ADR-007
topology) but not the physical layer — node CPU/RAM/disk capacity, network gear,
or which workloads are designed to run where with what headroom. There was also
no way to ask "is this well-proportioned?" — e.g. HA that isn't needed, a
workload that should move, or a node due an upgrade.
## Decision
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
physical compute + network gear and workload placement intent. Two
machine-readable tables (node capacity, workload placement) carry the numbers.
- `scripts/capacity-scan.py` (stdlib-only, like `repo-scan.py` / `tf_to_inventory.py`)
parses those tables, computes per-node allocated-vs-physical rollups, and
cross-checks workload hostnames against `terraform output -json` /
`ansible-inventory --list` to surface drift.
- `/capacity-review` reads the scan + intent columns and writes a dated report to
`docs/hardware/reviews/`, mirroring `/review-repo``docs/reviews/`.
- Numeric allocations live in `reference.md`, not Terraform: the current
`terraform output` exposes only `{ip, group}`. Terraform/inventory are used
only for hostname-drift cross-checks.
- **Live usage stats are a future hook.** The cluster is not stood up;
`gather_usage()` returns `available: false` and the evaluator reasons on
declared intent. The usage source (Proxmox RRD vs Prometheus/Loki/Grafana/
Alloy) is undecided — see docs/TODO.md 8.4, to be settled before any hook is
built.
## Consequences
- Right-sizing advice is intent-based until usage data exists; reports say so.
- `reference.md` table headers are a parser contract — changing them needs a
matching `capacity-scan.py` change.
See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF↔Ansible handoff).
```
- [ ] **Step 2: Add STATUS.md rows**
In `STATUS.md`, add to the "Real and working today" table:
```markdown
| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
```
And to the "Designed but not built" table:
```markdown
| Live usage stats for `/capacity-review` | ADR-012 / TODO 8.4 | `gather_usage()` stubbed; source undecided (Proxmox RRD vs PLG stack); needs the cluster |
```
- [ ] **Step 3: Add the CLAUDE.md command row + further-reading pointer**
In `CLAUDE.md` "Key commands" table, add:
```markdown
| Review hardware capacity | `/capacity-review` (Claude command) |
```
In the "Further reading" table, add:
```markdown
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
```
- [ ] **Step 4: Document the script in scripts/README.md**
Add under the existing list in `scripts/README.md`:
```markdown
- `capacity-scan.py` — deterministic capacity facts for `/capacity-review`: parses
the machine-readable tables in `docs/hardware/reference.md`, computes per-node
allocated-vs-physical rollups, and cross-checks workload hostnames against
Terraform output / Ansible inventory for drift. Emits JSON. See **ADR-012**.
```
- [ ] **Step 5: Verify references resolve**
Run: `python3 scripts/repo-scan.py | python3 -c "import json,sys; d=json.load(sys.stdin); print('broken_refs:', [f for f in d.get('findings',{}).get('broken_refs',[]) if '012' in str(f) or 'hardware' in str(f)])"`
Expected: no broken refs mentioning ADR-012 or the hardware paths (empty list). If the scan's JSON shape differs, instead run `python3 scripts/repo-scan.py >/dev/null && echo OK` and eyeball the findings.
- [ ] **Step 6: Commit**
```bash
git add docs/decisions/012-hardware-capacity.md STATUS.md CLAUDE.md scripts/README.md
git commit -m "Record ADR-012 + STATUS/CLAUDE/scripts docs for capacity tooling"
```
---
## Task 8: Final verification
**Files:** none (verification only)
- [ ] **Step 1: Run the full unit-test suite**
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
Expected: all tests pass.
- [ ] **Step 2: Run the lint suite**
Run: `make lint`
Expected: passes (markdown/script changes do not break ansible-lint/yamllint).
- [ ] **Step 3: End-to-end scan**
Run: `python3 scripts/capacity-scan.py`
Expected: valid JSON; `nodes.pve0` present; `usage.available: false`.
- [ ] **Step 4: Confirm working tree is clean**
Run: `git status --short`
Expected: no uncommitted changes from this plan (pre-existing FRICTION.md / ADR-011 may remain — leave them).
```

View file

@ -0,0 +1,168 @@
# Design — Hardware reference & capacity evaluation
_Date: 2026-06-01 · Status: approved for planning_
## Problem
The repo documents the **logical/network** layer well — Terraform declares per-VM
`cores`/`memory_mb`/`disk_size_gb`, and ADR-007 records VLANs, IPs, and topology.
But the **physical** layer is undocumented: how many Proxmox nodes physically
exist, their real CPU/RAM/disk capacity, storage pools, the network gear, and
`askari`. Nothing records "this node has 64 GB, X is allocated, Y is free," and
nothing evaluates whether the design is well-proportioned — e.g. a service that
needn't be HA, a workload that should move nodes, or a node due a RAM/disk
upgrade.
## Goal
1. A single, human-first **hardware reference document** capturing physical
compute + network gear and the intended workload placement.
2. A **capacity evaluator** ("script + skill") that reasons about optimization:
HA overkill / missing redundancy, right-sizing, placement moves, and
upgrade timing — emitting a dated report.
## Scope
- **In:** Proxmox compute nodes (`pve0..2`) + `askari`; network gear (OPNsense,
managed switch, APs); per-workload placement intent.
- **Out (for now):** power/UPS budget, NAS, cabling, rack layout, asset
register, warranty/serial tracking.
## Non-negotiable repo conventions this must honor
- Mirror the existing `repo-scan.py``/review-repo``docs/reviews/` triad
(deterministic scan feeds a judgement skill; report is dated markdown).
- Utility scripts are **stdlib-only** for run-anywhere portability (control
node, CI, bare clone, no venv). See TODO #14 for the standing reevaluation.
- Be honest about real-vs-planned (STATUS.md). The physical cluster is **not
stood up yet**, so live usage stats are a documented future hook, not a
current capability.
## Architecture
Four pieces, plus tracking updates.
### 1. Reference doc — `docs/hardware/reference.md`
One hand-maintained markdown file, the source of truth for physical facts and
placement intent. Four parts:
1. **Physical compute** — one subsection per node (`pve0..2`, `askari`):
model/form factor, CPU (cores/threads), RAM total (+ max & free DIMM slots),
storage (disks → pools, e.g. `local-zfs` / `local-lvm`), NICs, notes.
2. **Network gear** — OPNsense box, managed switch, APs: model, port/PoE
counts, throughput, uplinks. Short table.
3. **Workload placement & intent** — one row per planned VM/service, columns:
`Service | Home node | Criticality | HA intent | Resource profile |
Placement constraints | Growth notes`. These columns map onto the four
attribute groups chosen during brainstorming and give the evaluator concrete
intent to judge against (e.g. anti-affinity: `dns1`/`dns2` on different
nodes).
4. **Capacity summary** — per-node "allocated vs physical" rollup (RAM / cores /
disk, headroom %).
Node-capacity tables use a **strict, documented format** so the scan script can
parse the numbers without a YAML dependency.
### 2. Scan script — `scripts/capacity-scan.py`
Stdlib-only, deterministic, JSON to stdout (like `repo-scan.py`). Avoids
hand-parsing YAML by shelling out for JSON, the pattern `tf_to_inventory.py`
already uses.
Gathers **today**:
- **Declared allocations**`terraform output -json` (and/or the `.tf` module
calls) for each VM's cores/RAM/disk; degrades gracefully when Terraform has no
real VMs yet (current reality) instead of failing.
- **Inventory hosts** — `ansible-inventory -i inventories/<env>/hosts.yml
--list` → JSON.
- **Physical capacities** — parses the strict node tables in `reference.md`.
- **Rollup math** — per node: allocated vs physical, headroom %,
`oversubscribed` flag.
- **Drift warnings** — e.g. `reference.md` lists a host no Terraform VM
declares; surfaced in a `warnings[]` array (free doc↔Terraform drift check).
**Stubbed future hook** (honest, à la STATUS.md):
```python
# FUTURE: live usage stats (per-VM CPU/RAM/disk history).
# Requires the physical cluster online. Source UNDECIDED — see "Open decisions".
def gather_usage():
return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}
```
Output sketch:
```json
{
"nodes": {"pve0": {"ram_gb": 64, "ram_allocated_gb": 12, "headroom_pct": 81, "oversubscribed": false}},
"workloads": [{"name": "forgejo", "node": "pve1", "cores": 2, "memory_mb": 4096}],
"usage": {"available": false, "reason": "cluster not provisioned"},
"warnings": ["reference.md lists dns1 but no Terraform VM declares it"]
}
```
### 3. Evaluator skill — `/capacity-review`
A skill in `.claude/` (mirrors `/review-repo`), on-demand. Flow:
1. Run `python3 scripts/capacity-scan.py` → JSON.
2. Read `docs/hardware/reference.md` for intent columns the math can't capture.
3. Reason across dimensions, each recommendation **tagged by type** and stating
**what it is based on** (declared intent vs measured usage):
- **HA / redundancy** — anti-affinity violations, SPOFs, HA-overkill,
critical-but-unredundant services.
- **Right-sizing** — over/under-provisioned VMs. *Intent-based today*;
explicitly upgradeable to usage-based once the usage hook is live.
- **Placement / moves** — oversubscribed nodes, constraint-driven relocation.
- **Upgrade timing** — growth notes vs headroom → rough runway.
- **Drift** — surfaces the scan's `warnings[]`.
4. Write `docs/hardware/reviews/YYYY-MM-DD-capacity.md` (+ `latest.md`),
mirroring `docs/reviews/`.
### 4. Recording — ADR + STATUS + CLAUDE.md
- **ADR-012 — Hardware reference & capacity evaluation**
(`docs/decisions/012-hardware-capacity.md`): records the decision and
rationale; cross-links ADR-001 / ADR-007 / ADR-009. Names the usage-source as
an open decision (below).
- **STATUS.md** rows: `reference.md` + `capacity-scan.py` → real/working
(skeleton); `/capacity-review` → working, intent-only; live usage → designed,
not built.
- **CLAUDE.md**: a "Review capacity/hardware → `/capacity-review`" commands-table
row + a "Further reading" pointer to ADR-012.
## Data flow
```
reference.md ──┐
├─→ capacity-scan.py ──→ scan JSON ──┐
terraform ─────┤ (stdlib, JSON-via-subprocess) ├─→ /capacity-review ─→ docs/hardware/reviews/
inventory ─────┘ │ (judgement)
reference.md (intent columns) ───────────────────────┘
```
## Open decisions (deferred, tracked in TODO)
- **Usage-stats source** (TODO 8.4): **Proxmox RRD** (built-in, no extra infra)
vs the **Prometheus/Loki/Grafana/Grafana-Alloy** stack we will likely run
anyway (richer, per-process, more to operate; see TODO 3.6). **Decide before
building any usage hook** to avoid throwaway work.
- **Script dependency policy** (TODO #14): whether stdlib-only remains the rule
for utility scripts or libraries (e.g. PyYAML) are selectively allowed.
- **Scheduling** (TODO 8.4): `/capacity-review` is on-demand now; cron later.
## Deliverables & state at delivery
| Piece | Path | State |
|---|---|---|
| Reference doc | `docs/hardware/reference.md` | Skeleton + real node data |
| Scan script | `scripts/capacity-scan.py` | Working (stdlib, usage hook stubbed) |
| Evaluator skill | `/capacity-review``docs/hardware/reviews/` | Working, intent-based |
| Decision record | `docs/decisions/012-hardware-capacity.md` | New ADR |
| Tracking | STATUS.md, CLAUDE.md, TODO #14 + 8.4 | Updated |
## Out of scope / YAGNI
- No usage-stats collection until the cluster exists and the source is decided.
- No structured-data (YAML) source of truth — markdown is the single hand-edited
source by choice; revisit only if parsing pain demands it.
- No automated moves/remediation — the evaluator recommends; humans act.

View file

@ -11,3 +11,7 @@ dependencies (keeps them runnable anywhere without a venv).
plaintext secrets.
- `repo-scan.py` — Phase-0 deterministic scan for `/review-repo` (markers, broken
refs, unencrypted vaults, inventory).
- `capacity-scan.py` — deterministic capacity facts for `/capacity-review`: parses
the machine-readable tables in `docs/hardware/reference.md`, computes per-node
allocated-vs-physical rollups, and cross-checks workload hostnames against
Terraform output / Ansible inventory for drift. Emits JSON. See **ADR-012**.

168
scripts/capacity-scan.py Normal file
View file

@ -0,0 +1,168 @@
#!/usr/bin/env python3
"""capacity-scan.py — deterministic capacity facts for /capacity-review.
Python standard library only. Emits a JSON object to stdout.
Reads physical capacities and workload allocations from the machine-readable
tables in docs/hardware/reference.md, computes per-node allocated-vs-physical
rollups, and cross-checks workload hostnames against `terraform output -json`
and `ansible-inventory --list` to surface drift. Degrades gracefully when
nothing is provisioned. Live usage stats are a documented future hook.
Usage: python3 scripts/capacity-scan.py [--env staging] [--reference PATH]
"""
import argparse
import json
import os
import subprocess
import sys
REPO_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
def parse_table(markdown, required_cols):
"""Return row dicts for the first markdown table whose header contains all
required_cols. Keys are header names; values are raw cell strings.
Rows whose cell count does not match the header are skipped."""
lines = markdown.splitlines()
required = set(required_cols)
for i, raw in enumerate(lines):
line = raw.strip()
if not line.startswith("|"):
continue
headers = [c.strip() for c in line.strip("|").split("|")]
if not required.issubset(set(headers)):
continue
rows = []
# i + 2 skips the header's GFM separator row (|---|---|)
for body in lines[i + 2:]:
if not body.strip().startswith("|"):
break
cells = [c.strip() for c in body.strip().strip("|").split("|")]
if len(cells) == len(headers):
rows.append(dict(zip(headers, cells)))
return rows
return []
def compute_rollup(node_rows, workload_rows):
"""Per node: physical totals, summed allocations, RAM headroom %, and an
oversubscribed flag. Workloads on unknown nodes are ignored."""
nodes = {}
for r in node_rows:
nodes[r["node"]] = {
"cores": int(r["cores"]),
"ram_gb": float(r["ram_gb"]),
"disk_gb": float(r["disk_gb"]),
"alloc_cores": 0,
"alloc_ram_mb": 0,
"alloc_disk_gb": 0.0,
}
for w in workload_rows:
node = nodes.get(w["node"])
if node is None:
continue
node["alloc_cores"] += int(w["cores"])
node["alloc_ram_mb"] += int(w["ram_mb"])
node["alloc_disk_gb"] += float(w["disk_gb"])
for node in nodes.values():
node["alloc_ram_gb"] = round(node.pop("alloc_ram_mb") / 1024, 1)
node["ram_headroom_pct"] = (
round(100 * (node["ram_gb"] - node["alloc_ram_gb"]) / node["ram_gb"])
if node["ram_gb"]
else 0
)
node["oversubscribed"] = (
node["alloc_cores"] > node["cores"]
or node["alloc_ram_gb"] > node["ram_gb"]
or node["alloc_disk_gb"] > node["disk_gb"]
)
return nodes
def parse_tf_hostnames(tf_json):
"""Hostnames from `terraform output -json` (the `vms` map keys)."""
data = json.loads(tf_json)
return set(data.get("vms", {}).get("value", {}).keys())
def parse_inventory_hostnames(inv_json):
"""Hostnames from `ansible-inventory --list` (_meta.hostvars keys)."""
data = json.loads(inv_json)
return set(data.get("_meta", {}).get("hostvars", {}).keys())
def find_drift(workload_rows, known_hostnames):
"""Warn when reference.md workloads and live hostnames disagree. Silent when
no hostnames are known (pre-provisioning) nothing to compare against."""
warnings = []
declared = {w["workload"] for w in workload_rows}
if not known_hostnames:
return warnings
for name in sorted(declared - known_hostnames):
warnings.append(
f"reference.md lists '{name}' but no Terraform/inventory host declares it"
)
for name in sorted(known_hostnames - declared):
warnings.append(
f"host '{name}' exists in Terraform/inventory but is absent from reference.md"
)
return warnings
def gather_usage():
"""FUTURE: live per-VM CPU/RAM/disk usage history. Requires the physical
cluster online; source UNDECIDED (Proxmox RRD vs Prometheus/Loki/Grafana
see docs/TODO.md 8.4). Until then the evaluator reasons on declared intent."""
return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}
def _run_json(cmd):
return subprocess.run(cmd, capture_output=True, text=True, check=True).stdout
def known_hostnames(env):
"""Union of hostnames from Terraform output and Ansible inventory. Each
source is best-effort: missing tool / no state / bad JSON yields nothing."""
hosts = set()
tf_dir = os.path.join(REPO_ROOT, "terraform", "environments", env)
try:
hosts |= parse_tf_hostnames(_run_json(["terraform", f"-chdir={tf_dir}", "output", "-json"]))
except (OSError, subprocess.CalledProcessError, ValueError):
pass
inv = os.path.join(REPO_ROOT, "inventories", env, "hosts.yml")
try:
hosts |= parse_inventory_hostnames(_run_json(["ansible-inventory", "-i", inv, "--list"]))
except (OSError, subprocess.CalledProcessError, ValueError):
pass
return hosts
def main():
parser = argparse.ArgumentParser(description="Deterministic capacity facts for /capacity-review.")
parser.add_argument("--env", default="staging")
parser.add_argument(
"--reference",
default=os.path.join(REPO_ROOT, "docs", "hardware", "reference.md"),
)
args = parser.parse_args()
with open(args.reference, encoding="utf-8") as fh:
markdown = fh.read()
node_rows = parse_table(markdown, ["node", "cores", "ram_gb", "disk_gb"])
workload_rows = parse_table(markdown, ["workload", "node", "cores", "ram_mb", "disk_gb"])
nodes = compute_rollup(node_rows, workload_rows)
warnings = find_drift(workload_rows, known_hostnames(args.env))
json.dump(
{"nodes": nodes, "workloads": workload_rows, "usage": gather_usage(), "warnings": warnings},
sys.stdout,
indent=2,
sort_keys=True,
)
sys.stdout.write("\n")
if __name__ == "__main__":
main()

109
tests/test_capacity_scan.py Normal file
View file

@ -0,0 +1,109 @@
import importlib.util
import json as _json
import pathlib
_PATH = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "capacity-scan.py"
_spec = importlib.util.spec_from_file_location("capacity_scan", _PATH)
cs = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(cs)
def test_parse_table_keys_on_header_and_ignores_extra_cols():
md = """
intro text
| node | cores | ram_gb | disk_gb | notes |
|------|-------|--------|---------|-------|
| pve0 | 20 | 64 | 4000 | nvme |
| pve1 | 20 | 64 | 4000 | nvme |
trailing text
"""
rows = cs.parse_table(md, ["node", "cores", "ram_gb", "disk_gb"])
assert rows == [
{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000", "notes": "nvme"},
{"node": "pve1", "cores": "20", "ram_gb": "64", "disk_gb": "4000", "notes": "nvme"},
]
def test_parse_table_returns_empty_when_header_absent():
assert cs.parse_table("no tables here", ["node", "cores"]) == []
def test_compute_rollup_sums_allocations_and_flags_headroom():
node_rows = [{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"}]
workload_rows = [
{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"},
{"workload": "forgejo", "node": "pve0", "cores": "4", "ram_mb": "8192", "disk_gb": "100"},
]
nodes = cs.compute_rollup(node_rows, workload_rows)
pve0 = nodes["pve0"]
assert pve0["alloc_cores"] == 5
assert pve0["alloc_ram_gb"] == 8.5 # (512 + 8192) / 1024
assert pve0["alloc_disk_gb"] == 110.0
assert pve0["ram_headroom_pct"] == 87 # round(100 * (64 - 8.5) / 64)
assert pve0["oversubscribed"] is False
def test_compute_rollup_flags_oversubscription():
node_rows = [{"node": "tiny", "cores": "2", "ram_gb": "4", "disk_gb": "50"}]
workload_rows = [
{"workload": "hog", "node": "tiny", "cores": "4", "ram_mb": "1024", "disk_gb": "10"},
]
nodes = cs.compute_rollup(node_rows, workload_rows)
assert nodes["tiny"]["oversubscribed"] is True # 4 cores > 2
def test_compute_rollup_ignores_workloads_on_unknown_nodes():
nodes = cs.compute_rollup(
[{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"}],
[{"workload": "ghost", "node": "nope", "cores": "1", "ram_mb": "512", "disk_gb": "10"}],
)
assert nodes["pve0"]["alloc_cores"] == 0
def test_parse_tf_hostnames_reads_vms_value_keys():
tf_json = '{"vms": {"value": {"dns1": {"ip": "10.20.0.10", "group": "docker_hosts"}}}}'
assert cs.parse_tf_hostnames(tf_json) == {"dns1"}
def test_parse_inventory_hostnames_reads_meta_hostvars():
inv_json = '{"_meta": {"hostvars": {"dns1": {}, "proxy": {}}}}'
assert cs.parse_inventory_hostnames(inv_json) == {"dns1", "proxy"}
def test_find_drift_reports_both_directions():
workload_rows = [{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"}]
warnings = cs.find_drift(workload_rows, {"proxy"})
assert any("dns1" in w and "no Terraform" in w for w in warnings)
assert any("proxy" in w and "absent from reference.md" in w for w in warnings)
def test_find_drift_silent_when_no_hostnames_known():
workload_rows = [{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"}]
assert cs.find_drift(workload_rows, set()) == []
def test_gather_usage_is_stubbed_unavailable():
usage = cs.gather_usage()
assert usage["available"] is False
assert "reason" in usage
def test_known_hostnames_degrades_to_empty(monkeypatch):
# Simulate terraform/ansible-inventory being absent or failing.
def boom(*a, **k):
raise FileNotFoundError("no such tool")
monkeypatch.setattr(cs.subprocess, "run", boom)
assert cs.known_hostnames("staging") == set()
def test_main_emits_valid_json_against_real_reference(monkeypatch, capsys):
# Isolate from the host: no real terraform/ansible needed.
monkeypatch.setattr(cs, "known_hostnames", lambda env: set())
monkeypatch.setattr("sys.argv", ["capacity-scan.py"])
cs.main()
out = _json.loads(capsys.readouterr().out)
assert set(out) == {"nodes", "workloads", "usage", "warnings"}
assert out["usage"]["available"] is False
assert "pve0" in out["nodes"] # from the skeleton reference.md