Compare commits
10 commits
ed3eeb0199
...
e12326148c
| Author | SHA1 | Date | |
|---|---|---|---|
| e12326148c | |||
| 4c535c908e | |||
| 1060a9c08a | |||
| 05694f6ea4 | |||
| 8ed00c9206 | |||
| b240fa8bfe | |||
| 07ecbb2789 | |||
| 3ea9109ba2 | |||
| 6ff5d55810 | |||
| 88210db09c |
12 changed files with 1439 additions and 50 deletions
66
.claude/commands/capacity-review.md
Normal file
66
.claude/commands/capacity-review.md
Normal file
|
|
@ -0,0 +1,66 @@
|
|||
# Evaluate the homelab's hardware capacity and workload placement
|
||||
|
||||
Assess current allocation headroom, HA posture, and workload placement against declared
|
||||
intent, and write a tracked report to `docs/hardware/reviews/`. On-demand only;
|
||||
scheduled runs are deferred (see `docs/TODO.md` 8.4).
|
||||
|
||||
## Reference material
|
||||
|
||||
- `docs/hardware/reference.md` — physical node specs, workload allocations, and the
|
||||
free-text intent columns (`criticality`, `ha_intent`, `profile`, `constraints`, `growth`).
|
||||
- `scripts/capacity-scan.py` — deterministic scan; emits JSON with keys `nodes`,
|
||||
`workloads`, `usage`, `warnings`.
|
||||
|
||||
## Process
|
||||
|
||||
### Phase 0 — gather facts
|
||||
|
||||
Run `python3 scripts/capacity-scan.py` and parse its JSON output:
|
||||
|
||||
- `nodes` — per-node physical totals, allocated totals, `ram_headroom_pct`, and the
|
||||
`oversubscribed` flag.
|
||||
- `workloads` — per-workload allocation rows from `reference.md`.
|
||||
- `usage` — live usage stats if available; check `usage.available`. If `false`, every
|
||||
recommendation in the report is **intent-based, not usage-based** — state this
|
||||
prominently in the report header.
|
||||
- `warnings` — drift findings the scan has already detected (reference vs Terraform/inventory).
|
||||
|
||||
### Phase 1 — read intent
|
||||
|
||||
Read `docs/hardware/reference.md` for the free-text columns the scan does not parse:
|
||||
`criticality`, `ha_intent`, `profile`, `constraints`, and `growth`, plus the
|
||||
"Capacity notes" section at the bottom of the file.
|
||||
|
||||
### Phase 2 — reason across five dimensions
|
||||
|
||||
Produce concrete, actionable recommendations. Tag every item with its type and the
|
||||
basis it rests on (**intent-based** vs **usage-based**):
|
||||
|
||||
1. **HA / redundancy** — anti-affinity violations (e.g. an HA pair co-located on one
|
||||
node), single points of failure, HA posture that looks like overkill for the
|
||||
declared `criticality`, and high-criticality workloads with no redundancy.
|
||||
2. **Right-sizing** — over- or under-provisioned workloads compared to their `profile`.
|
||||
Today this is intent-based (declared allocation vs profile); flag explicitly that it
|
||||
becomes usage-based once the `gather_usage()` hook in the scan script is live.
|
||||
3. **Placement / moves** — oversubscribed nodes (`oversubscribed: true` or low
|
||||
`ram_headroom_pct`) and constraint-driven relocations indicated by `constraints`.
|
||||
4. **Upgrade timing** — cross-reference `growth` notes against current headroom to
|
||||
estimate a rough runway before a node upgrade is needed.
|
||||
5. **Drift** — surface every entry in the scan's `warnings` array verbatim.
|
||||
|
||||
### Phase 3 — write the report
|
||||
|
||||
Save the report to `docs/hardware/reviews/YYYY-MM-DD-capacity.md` and overwrite
|
||||
`docs/hardware/reviews/latest.md` with the same content.
|
||||
|
||||
Report structure:
|
||||
|
||||
- **One-line summary** — overall health signal (e.g. "All nodes within headroom;
|
||||
two HA violations detected").
|
||||
- **Run metadata** — date, reviewed commit SHA, `usage.available` status.
|
||||
- **Section per dimension** — each with concrete, actionable items; every item states
|
||||
its basis (intent-based or usage-based) and the evidence behind it.
|
||||
- **Follow-up prompt** — a generated, copy-pasteable prompt for the next review or
|
||||
for acting on the top finding.
|
||||
|
||||
Commit the report files per CLAUDE.md git conventions.
|
||||
|
|
@ -31,6 +31,7 @@ Full design rationale: `docs/decisions/`
|
|||
| Deploy a playbook | `make deploy PLAYBOOK=<name>` |
|
||||
| Scaffold a new role | `make new-role NAME=<name>` |
|
||||
| Review repo for drift/cruft | `/review-repo` (Claude command) |
|
||||
| Review hardware capacity | `/capacity-review` (Claude command) |
|
||||
| Encrypt a vault file | `make encrypt FILE=<path>` |
|
||||
| Decrypt a vault file | `make decrypt FILE=<path>` |
|
||||
| Install Python deps | `make setup` |
|
||||
|
|
@ -170,6 +171,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
|
|||
| Testing methodology | `docs/decisions/008-testing.md` |
|
||||
| TF ↔ Ansible handoff | `docs/decisions/009-provisioning-handoff.md` |
|
||||
| Forgejo & CI | `docs/decisions/010-forgejo-ci.md` |
|
||||
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
|
||||
| Adding a new role | `docs/runbooks/new-role.md` |
|
||||
| Adding a new host | `docs/runbooks/new-host.md` |
|
||||
| Rotating vault secrets | `docs/runbooks/rotate-secrets.md` |
|
||||
|
|
|
|||
|
|
@ -21,6 +21,8 @@ _Last reviewed: 2026-05-30._
|
|||
| Vault password client | `scripts/vault-pass-client.sh` fetches the master password from Vaultwarden via `rbw` (wired as `vault_password_file`). Requires `rbw` installed + `rbw unlock`. |
|
||||
| `/review-repo` | Repo audit: `scripts/repo-scan.py` (Phase 0) + `.claude/commands/review-repo.md`, reports to `docs/reviews/`. On-demand only; cron + email deferred (`docs/TODO.md`). |
|
||||
| Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below |
|
||||
| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
|
||||
| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
|
||||
|
||||
## Scaffolded but empty — NOT implemented
|
||||
|
||||
|
|
@ -44,6 +46,7 @@ So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `bas
|
|||
| Level 2 / 3 testing (staging, `askari` smoke) | ADR-008 | Depends on real VMs / `askari`, which don't exist yet |
|
||||
| Per-service roles | ADR-004 | Model defined; no service roles built |
|
||||
| Forgejo Actions CI | ADR-003 / ADR-008 | Remote is live (pushed); Actions/`act_runner` pipeline not yet built |
|
||||
| Live usage stats for `/capacity-review` | ADR-012 / TODO 8.4 | `gather_usage()` stubbed; source undecided (Proxmox RRD vs PLG stack); needs the cluster |
|
||||
|
||||
## Keeping this honest
|
||||
|
||||
|
|
|
|||
126
docs/TODO.md
126
docs/TODO.md
|
|
@ -1,64 +1,90 @@
|
|||
# ToDo
|
||||
|
||||
- [x] Main readme only says ansible, not terraform. Should properbly be included.
|
||||
- [x] Main readme does not include a description of the name boma, nor the scope (i.e. infrastructure - not laptops)
|
||||
1. **Forgejo CI** — what CI work remains after ADR-010 (which workflows, runner
|
||||
setup, etc. still need to be built)?
|
||||
|
||||
- [x] Method to review repo to ensure
|
||||
- We dont carry around code, comments, notes, etc. that is no longer needed but was perhaps added to fix an issue that has been resolved.
|
||||
- That all code, structure, comments, notes etc. follow our design decisions.
|
||||
- That clear intent is documented throughout - and that there are not any overlaps, contradictions etc.
|
||||
2. **Testing**
|
||||
1. Choose and configure code-testing tooling (Molecule, etc.).
|
||||
2. Decide how the AI interprets Molecule output and performs live testing:
|
||||
API calls, curl pulls of web products, log reviews, and headless browsing.
|
||||
3. Define a standard for generating test users and for instructing the user to
|
||||
perform relevant manual tests.
|
||||
|
||||
- [ ] Forgejo CI
|
||||
3. **Building services**
|
||||
1. Decide how to manage logs.
|
||||
2. Decide how to manage APIs / API access.
|
||||
3. Decide how to import or integrate from baobabAnsibleV4.
|
||||
4. Decide what each node runs — base packages plus which apps/services.
|
||||
5. Decide the firewall strategy (which firewall, ruleset, per-host vs central).
|
||||
6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime
|
||||
Kuma alerts on askari.
|
||||
7. Define a tagging standard that lets us target runs without over-tagging.
|
||||
8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
|
||||
9. Decide: a central database server, or individual database services per app?
|
||||
10. Should we continue to use the base-container method, or maybe something in the improvements of the methods in boma moods the point?
|
||||
|
||||
- [ ] Testing
|
||||
- Code testing tools (molecule etc.)
|
||||
- AI interpretation of molecule etc, but also actual testing via API-calls, CURL pulls of web products, log reviews and perhaps even headless browsing
|
||||
4. **Split-horizon FQDN** — adopt split-horizon FQDN with or without nyumbani?
|
||||
|
||||
- [ ] Building stuff
|
||||
- How to manage logs
|
||||
- How to manage APIs
|
||||
- How to import/integrate from baobabAnsibleV4?
|
||||
- What to install on nodes?
|
||||
- firewalls?
|
||||
- apps?
|
||||
- wirering up loki, prometheous, grafana dashboards, grafana alerts, uptimekuma alerts on askari
|
||||
- tagging strategy - we need a specific standard so that we can target runs, but dont over-tag.
|
||||
5. **Control node**
|
||||
1. Set up and test the control node while waiting for hardware.
|
||||
2. Define control-node bootstrapping — a dedicated recipe and playbook?
|
||||
3. Decide the role of mamba — access/availability vs compute power and ease?
|
||||
4. Set up rbw on the control node.
|
||||
|
||||
- [ ] Split horizon FQDN - with or without nyumbani
|
||||
6. **Updating**
|
||||
1. Decide pinning vs latest for versions.
|
||||
2. Decide the update strategy across services & containers vs packages &
|
||||
builds / GitHub pulls / Flatpaks.
|
||||
3. Define scheduling of updates and reboots, including post-update testing.
|
||||
|
||||
- [ ] Control node
|
||||
- Setup and testing while waiting for hardware?
|
||||
- Bootstrapping - perhaps dedicated recipe and playbook?
|
||||
- Role of mamba? - Access/availability vs compute power and ease?
|
||||
- rbw on control node
|
||||
7. **Shell setup**
|
||||
1. Decide what shell setup matters for the AI's work on the control node.
|
||||
2. Decide what to set up on the hosts, given that direct access will be rare.
|
||||
|
||||
- [ ] Updating
|
||||
- Pinning vs latest.
|
||||
- services and containers vs packages and builds/github pulls/flatpacks
|
||||
- scheduling of updates and reboots - incl. testing afterwards.
|
||||
8. **Scheduled work**
|
||||
1. Run `/review-repo` as `claude -p` via cron every two weeks?
|
||||
2. Build sanity checks (e.g. does PhotoPrism have its pictures? are email
|
||||
services receiving and sending?).
|
||||
3. Design a declarative `scheduled_jobs` role so the repo owns which cronjobs
|
||||
run on a host, enforced by Ansible. Sketch (deferred until we have hosts):
|
||||
reads a `scheduled_jobs__jobs` list from group_vars/host_vars, rendered via
|
||||
a managed `/etc/cron.d` file. Open questions:
|
||||
1. General role vs control-node-only?
|
||||
2. Prune undeclared jobs (repo authoritative) vs additive?
|
||||
3. Validate headless email and that cron's env has the `claude` CLI.
|
||||
4. (The fortnightly `/review-repo` job is the first entry.)
|
||||
4. Schedule `/capacity-review` to run periodically (on-demand only for now).
|
||||
Revisit once the physical cluster + a live usage-stats hook exist, so it
|
||||
reasons on real usage rather than declared intent alone. **Decide the usage
|
||||
source first:** Proxmox RRD (built-in, no extra infra) vs the
|
||||
Prometheus/Loki/Grafana/Grafana-Alloy stack we will likely set up anyway
|
||||
(richer, per-process, but more to run) — see TODO 3.6. Don't build the
|
||||
Proxmox-RRD hook before settling this, to avoid throwaway work.
|
||||
9. Should we make a basic function so that tools (and AI) can send messages to the user - email, matrix or ntfy?
|
||||
|
||||
- [ ] shell setup
|
||||
- What does it matter in relations to the AIs work on the control node?
|
||||
- What should we set up on the hosts, if i'll rarely go there?
|
||||
10. **Claude setup** — DECIDED: brainstorm for intent, capture as ADRs (skip plan
|
||||
files); hooks + slash commands + `/review-repo` for enforcement at scale. Any
|
||||
remaining setup to carry out from this decision?
|
||||
1. Policy for how we collaborate with references to baobabAnsibleV4 without misusing it.
|
||||
2. Policy for how we write key documents like ADRs.
|
||||
3. Further development on how we we collaborate on designing the foundation for the project - seperate from how we implement new containers etc.
|
||||
|
||||
- [ ] Scheduled work
|
||||
- /review-repo maybe as claude -p via cron every two weeks?
|
||||
- Sanity checks: does a photoprism have its pictures? are email services recieving and sending?
|
||||
- Cron "section": a declarative way for the repo to own which cronjobs are active on a
|
||||
host, enforced by Ansible. Sketch (deferred until we have hosts): a `scheduled_jobs`
|
||||
role reading a `scheduled_jobs__jobs` list from group_vars/host_vars, rendered via a
|
||||
managed /etc/cron.d file. Open Qs: general role vs control-node-only; prune
|
||||
undeclared jobs (repo authoritative) vs additive; validate headless email + that
|
||||
cron's env has the `claude` CLI. The /review-repo fortnightly job is the first entry.
|
||||
11. **Kaizen loop** — set up ~2026-06-06 (one week from now).
|
||||
1. Build `/retro`: reads `docs/FRICTION.md` + recurring `/review-repo`
|
||||
findings + a tooling-usage inventory; proposes add / change / **remove**
|
||||
(biased to remove); records decisions as ADRs; evaluates itself.
|
||||
Recurrence-triggered plus a light periodic sweep.
|
||||
2. Keep appending raw signals to `docs/FRICTION.md` (live now) until the
|
||||
retro consumes them.
|
||||
|
||||
- [ ] Claude setup
|
||||
- superpowers or other methodologies? → decided: brainstorm for intent, capture as
|
||||
ADRs (skip plan files); hooks + slash commands + /review-repo for enforcement at scale.
|
||||
12. **Spin-up order** — what is the right order of operations when spinning up
|
||||
from scratch (OS, DNS, Authentik, Traefik, …)?
|
||||
|
||||
- [ ] Kaizen loop — set up ~2026-06-06 (one week from now)
|
||||
- Build `/retro`: reads `docs/FRICTION.md` + `/review-repo` recurring findings + a
|
||||
tooling-usage inventory; proposes add / change / **remove** (biased to remove);
|
||||
records decisions as ADRs; evaluates itself. Recurrence-triggered + light periodic sweep.
|
||||
- `docs/FRICTION.md` is live now — keep appending raw signals until the retro consumes them.
|
||||
13. **Intentions** - Is the current setup clearly identifying intentions throughout? We have the readme files but is that enough? Also, how do we rechallange desisions and how they interact over time. I.e. We have these two services running, but extending one a little bit could make the other redundant so we could remove it. Or an alternative to this services has emerged, and it is actually better.
|
||||
|
||||
- [ ] What is the right order of operation when spinning up from scratch? (OS, DNS, authentik, traefik...?)
|
||||
14. **Script dependencies policy** — utility scripts (`tf_to_inventory.py`,
|
||||
`repo-scan.py`, `capacity-scan.py`) are stdlib-only by convention, for
|
||||
run-anywhere portability (control node, CI, bare clone, no venv). Reevaluate
|
||||
whether selectively allowing libraries (e.g. PyYAML — already present via
|
||||
Ansible) is a better fit in general: weigh the parsing-correctness win
|
||||
against losing zero-setup portability. Decide a clear rule and record it.
|
||||
|
|
|
|||
38
docs/decisions/012-hardware-capacity.md
Normal file
38
docs/decisions/012-hardware-capacity.md
Normal file
|
|
@ -0,0 +1,38 @@
|
|||
# ADR-012 — Hardware reference & capacity evaluation
|
||||
|
||||
## Context
|
||||
|
||||
The repo modelled the logical/network layer (Terraform VM specs, ADR-007
|
||||
topology) but not the physical layer — node CPU/RAM/disk capacity, network gear,
|
||||
or which workloads are designed to run where with what headroom. There was also
|
||||
no way to ask "is this well-proportioned?" — e.g. HA that isn't needed, a
|
||||
workload that should move, or a node due an upgrade.
|
||||
|
||||
## Decision
|
||||
|
||||
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
|
||||
physical compute + network gear and workload placement intent. Two
|
||||
machine-readable tables (node capacity, workload placement) carry the numbers.
|
||||
- `scripts/capacity-scan.py` (stdlib-only, like `repo-scan.py` / `tf_to_inventory.py`)
|
||||
parses those tables, computes per-node allocated-vs-physical rollups, and
|
||||
cross-checks workload hostnames against `terraform output -json` /
|
||||
`ansible-inventory --list` to surface drift.
|
||||
- `/capacity-review` reads the scan + intent columns and writes a dated report to
|
||||
`docs/hardware/reviews/YYYY-MM-DD-capacity.md`, also overwriting
|
||||
`docs/hardware/reviews/latest.md`, mirroring `/review-repo` → `docs/reviews/`.
|
||||
- Numeric allocations live in `reference.md`, not Terraform: the current
|
||||
`terraform output` exposes only `{ip, group}`. Terraform/inventory are used
|
||||
only for hostname-drift cross-checks.
|
||||
- **Live usage stats are a future hook.** The cluster is not stood up;
|
||||
`gather_usage()` returns `available: false` and the evaluator reasons on
|
||||
declared intent. The usage source (Proxmox RRD vs Prometheus/Loki/Grafana/
|
||||
Alloy) is undecided — see docs/TODO.md 8.4, to be settled before any hook is
|
||||
built.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Right-sizing advice is intent-based until usage data exists; reports say so.
|
||||
- `reference.md` table headers are a parser contract — changing them needs a
|
||||
matching `capacity-scan.py` change.
|
||||
|
||||
See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).
|
||||
52
docs/hardware/reference.md
Normal file
52
docs/hardware/reference.md
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
# Hardware reference — boma
|
||||
|
||||
> Hand-maintained source of truth for **physical** compute + network gear and
|
||||
> **workload placement intent**. The two machine-readable tables (Node capacity,
|
||||
> Workload placement) are parsed by `scripts/capacity-scan.py` — keep their
|
||||
> headers intact. Evaluated by `/capacity-review`. See ADR-012.
|
||||
>
|
||||
> _Status: skeleton. Replace example rows with real hardware once the cluster is
|
||||
> stood up (STATUS.md tracks real-vs-planned)._
|
||||
|
||||
## 1. Physical compute
|
||||
|
||||
### pve0
|
||||
- **Model / form factor:** _TBD (e.g. Minisforum MS-01, mini-PC)_
|
||||
- **CPU:** _TBD (e.g. i9-13900H, 14C/20T)_
|
||||
- **RAM:** _TBD total; max _; free DIMM slots _
|
||||
- **Storage:** _TBD (disks → pools, e.g. 2× 2 TB NVMe → `local-zfs`)_
|
||||
- **NICs:** _eno1 trunk (vmbr0), eno2 corosync (vmbr1)_
|
||||
- **Notes:** _warranty, quirks_
|
||||
|
||||
_(repeat for pve1, pve2, askari)_
|
||||
|
||||
## 2. Network gear
|
||||
|
||||
| device | model | ports | poe | throughput | uplinks | notes |
|
||||
|----------|-------|-------|-----|------------|---------|-------|
|
||||
| opnsense | _TBD_ | _TBD_ | n/a | _TBD_ | WAN+LAN | dedicated hardware |
|
||||
| switch | _TBD_ | _TBD_ | _TBD_ | _TBD_ | trunk | managed, 802.1q |
|
||||
| ap1 | _TBD_ | _TBD_ | _TBD_ | _TBD_ | trunk | multi-SSID per VLAN |
|
||||
|
||||
## 3. Workload placement & intent
|
||||
|
||||
The numeric columns (`cores, ram_mb, disk_gb`) feed `capacity-scan.py`; the
|
||||
free-text columns feed `/capacity-review`'s judgement.
|
||||
|
||||
| workload | node | cores | ram_mb | disk_gb | criticality | ha_intent | profile | constraints | growth |
|
||||
|----------|------|-------|--------|---------|-------------|-----------|---------|-------------|--------|
|
||||
| dns1 | pve0 | 1 | 512 | 10 | high | pair/dns2 | tiny/steady | anti-affinity: dns2 on a different node | flat |
|
||||
| dns2 | pve1 | 1 | 512 | 10 | high | pair/dns1 | tiny/steady | anti-affinity: dns1 on a different node | flat |
|
||||
|
||||
## 4. Node capacity (machine-readable)
|
||||
|
||||
Physical totals per node. Integers; `ram_gb` and `disk_gb` may be decimals.
|
||||
|
||||
| node | cores | ram_gb | disk_gb |
|
||||
|------|-------|--------|---------|
|
||||
| pve0 | 20 | 64 | 4000 |
|
||||
| pve1 | 20 | 64 | 4000 |
|
||||
|
||||
## 5. Capacity notes
|
||||
|
||||
Free-text running notes for the evaluator (trends, planned moves, upgrade ideas).
|
||||
0
docs/hardware/reviews/.gitkeep
Normal file
0
docs/hardware/reviews/.gitkeep
Normal file
753
docs/superpowers/plans/2026-06-01-hardware-capacity.md
Normal file
753
docs/superpowers/plans/2026-06-01-hardware-capacity.md
Normal file
|
|
@ -0,0 +1,753 @@
|
|||
# Hardware Reference & Capacity Evaluation Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Add a hand-maintained hardware reference doc, a stdlib-only `capacity-scan.py` that emits deterministic capacity facts, and an on-demand `/capacity-review` skill that reasons about HA / right-sizing / placement / upgrade timing.
|
||||
|
||||
**Architecture:** `docs/hardware/reference.md` is the single machine-readable source of truth (physical node capacities + workload allocations + placement intent). `scripts/capacity-scan.py` parses its tables, computes per-node allocated-vs-physical rollups, and cross-checks workload hostnames against `terraform output -json` / `ansible-inventory --list` to surface drift — degrading gracefully when nothing is provisioned. `/capacity-review` runs the scan, reads the intent columns, and writes a dated report to `docs/hardware/reviews/`. Live usage stats are a stubbed future hook. Mirrors the existing `repo-scan.py` → `/review-repo` → `docs/reviews/` triad.
|
||||
|
||||
**Tech Stack:** Python 3 standard library only (no third-party imports in the script); `pytest` for unit tests (already in `requirements.txt`); markdown docs; a `.claude` slash-command skill.
|
||||
|
||||
---
|
||||
|
||||
## Spec
|
||||
|
||||
Design spec: `docs/superpowers/specs/2026-06-01-hardware-capacity-design.md`.
|
||||
|
||||
**Refinement vs spec:** the spec said allocations come from Terraform. The current
|
||||
`terraform output "vms"` only exposes `{ip, group}`, not cores/RAM/disk, so numeric
|
||||
allocations are read from `reference.md` instead; Terraform/inventory are used only
|
||||
for hostname-drift cross-checks. This better honors the "self-contained markdown
|
||||
source of truth" decision and needs no Terraform module changes.
|
||||
|
||||
## File Structure
|
||||
|
||||
- `docs/hardware/reference.md` — **create.** Source of truth. Human sections
|
||||
(physical compute, network gear) + two machine-readable tables (node capacity,
|
||||
workload placement) the script parses.
|
||||
- `scripts/capacity-scan.py` — **create.** Stdlib-only. Pure parse/math functions
|
||||
+ thin subprocess glue + `main()` emitting JSON to stdout.
|
||||
- `tests/test_capacity_scan.py` — **create.** Pytest unit tests for the pure
|
||||
functions + a smoke test against the real `reference.md`.
|
||||
- `.claude/commands/capacity-review.md` — **create.** The `/capacity-review` skill.
|
||||
- `docs/hardware/reviews/.gitkeep` — **create.** Report output dir.
|
||||
- `docs/decisions/012-hardware-capacity.md` — **create.** ADR recording the decision.
|
||||
- `STATUS.md` — **modify.** Add real-vs-planned rows.
|
||||
- `CLAUDE.md` — **modify.** Commands-table row + Further-reading pointer.
|
||||
- `scripts/README.md` — **modify.** Document `capacity-scan.py`.
|
||||
|
||||
### Machine-readable table contract (used by Task 1 and the parser)
|
||||
|
||||
`reference.md` must contain these two tables verbatim in header shape. The parser
|
||||
keys on header names, so column order is flexible and extra free-text columns are
|
||||
ignored.
|
||||
|
||||
**Node capacity** — header contains `node, cores, ram_gb, disk_gb` (integers/floats):
|
||||
|
||||
```
|
||||
| node | cores | ram_gb | disk_gb |
|
||||
|------|-------|--------|---------|
|
||||
| pve0 | 20 | 64 | 4000 |
|
||||
```
|
||||
|
||||
**Workload placement** — header contains the numeric columns `workload, node,
|
||||
cores, ram_mb, disk_gb` plus any free-text intent columns:
|
||||
|
||||
```
|
||||
| workload | node | cores | ram_mb | disk_gb | criticality | ha_intent | profile | constraints | growth |
|
||||
|----------|------|-------|--------|---------|-------------|-----------|---------|-------------|--------|
|
||||
| dns1 | pve0 | 1 | 512 | 10 | high | pair/dns2 | tiny | anti-affinity: dns2 elsewhere | flat |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Reference doc skeleton
|
||||
|
||||
**Files:**
|
||||
- Create: `docs/hardware/reference.md`
|
||||
- Create: `docs/hardware/reviews/.gitkeep`
|
||||
|
||||
- [ ] **Step 1: Write `docs/hardware/reference.md`**
|
||||
|
||||
```markdown
|
||||
# Hardware reference — boma
|
||||
|
||||
> Hand-maintained source of truth for **physical** compute + network gear and
|
||||
> **workload placement intent**. The two machine-readable tables (Node capacity,
|
||||
> Workload placement) are parsed by `scripts/capacity-scan.py` — keep their
|
||||
> headers intact. Evaluated by `/capacity-review`. See ADR-012.
|
||||
>
|
||||
> _Status: skeleton. Replace example rows with real hardware once the cluster is
|
||||
> stood up (STATUS.md tracks real-vs-planned)._
|
||||
|
||||
## 1. Physical compute
|
||||
|
||||
### pve0
|
||||
- **Model / form factor:** _TBD (e.g. Minisforum MS-01, mini-PC)_
|
||||
- **CPU:** _TBD (e.g. i9-13900H, 14C/20T)_
|
||||
- **RAM:** _TBD total; max _; free DIMM slots _
|
||||
- **Storage:** _TBD (disks → pools, e.g. 2× 2 TB NVMe → `local-zfs`)_
|
||||
- **NICs:** _eno1 trunk (vmbr0), eno2 corosync (vmbr1)_
|
||||
- **Notes:** _warranty, quirks_
|
||||
|
||||
_(repeat for pve1, pve2, askari)_
|
||||
|
||||
## 2. Network gear
|
||||
|
||||
| device | model | ports | poe | throughput | uplinks | notes |
|
||||
|----------|-------|-------|-----|------------|---------|-------|
|
||||
| opnsense | _TBD_ | _TBD_ | n/a | _TBD_ | WAN+LAN | dedicated hardware |
|
||||
| switch | _TBD_ | _TBD_ | _TBD_ | _TBD_ | trunk | managed, 802.1q |
|
||||
| ap1 | _TBD_ | _TBD_ | _TBD_ | _TBD_ | trunk | multi-SSID per VLAN |
|
||||
|
||||
## 3. Workload placement & intent
|
||||
|
||||
The numeric columns (`cores, ram_mb, disk_gb`) feed `capacity-scan.py`; the
|
||||
free-text columns feed `/capacity-review`'s judgement.
|
||||
|
||||
| workload | node | cores | ram_mb | disk_gb | criticality | ha_intent | profile | constraints | growth |
|
||||
|----------|------|-------|--------|---------|-------------|-----------|---------|-------------|--------|
|
||||
| dns1 | pve0 | 1 | 512 | 10 | high | pair/dns2 | tiny/steady | anti-affinity: dns2 on a different node | flat |
|
||||
| dns2 | pve1 | 1 | 512 | 10 | high | pair/dns1 | tiny/steady | anti-affinity: dns1 on a different node | flat |
|
||||
|
||||
## 4. Node capacity (machine-readable)
|
||||
|
||||
Physical totals per node. Integers; `ram_gb` and `disk_gb` may be decimals.
|
||||
|
||||
| node | cores | ram_gb | disk_gb |
|
||||
|------|-------|--------|---------|
|
||||
| pve0 | 20 | 64 | 4000 |
|
||||
| pve1 | 20 | 64 | 4000 |
|
||||
|
||||
## 5. Capacity notes
|
||||
|
||||
Free-text running notes for the evaluator (trends, planned moves, upgrade ideas).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Create the reports directory**
|
||||
|
||||
Run: `mkdir -p docs/hardware/reviews && touch docs/hardware/reviews/.gitkeep`
|
||||
Expected: both paths exist.
|
||||
|
||||
- [ ] **Step 3: Verify the machine-readable headers match the contract**
|
||||
|
||||
Run: `grep -n '| node | cores | ram_gb | disk_gb |' docs/hardware/reference.md && grep -n '| workload | node | cores | ram_mb | disk_gb |' docs/hardware/reference.md`
|
||||
Expected: each grep prints one matching line (the table headers the parser keys on).
|
||||
|
||||
- [ ] **Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add docs/hardware/reference.md docs/hardware/reviews/.gitkeep
|
||||
git commit -m "Add hardware reference doc skeleton + reviews dir"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Scan script — `parse_table()`
|
||||
|
||||
**Files:**
|
||||
- Create: `scripts/capacity-scan.py`
|
||||
- Create: `tests/test_capacity_scan.py`
|
||||
|
||||
- [ ] **Step 1: Write the failing test**
|
||||
|
||||
Create `tests/test_capacity_scan.py`:
|
||||
|
||||
```python
|
||||
import importlib.util
|
||||
import pathlib
|
||||
|
||||
_PATH = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "capacity-scan.py"
|
||||
_spec = importlib.util.spec_from_file_location("capacity_scan", _PATH)
|
||||
cs = importlib.util.module_from_spec(_spec)
|
||||
_spec.loader.exec_module(cs)
|
||||
|
||||
|
||||
def test_parse_table_keys_on_header_and_ignores_extra_cols():
|
||||
md = """
|
||||
intro text
|
||||
| node | cores | ram_gb | disk_gb |
|
||||
|------|-------|--------|---------|
|
||||
| pve0 | 20 | 64 | 4000 |
|
||||
| pve1 | 20 | 64 | 4000 |
|
||||
|
||||
trailing text
|
||||
"""
|
||||
rows = cs.parse_table(md, ["node", "cores", "ram_gb", "disk_gb"])
|
||||
assert rows == [
|
||||
{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"},
|
||||
{"node": "pve1", "cores": "20", "ram_gb": "64", "disk_gb": "4000"},
|
||||
]
|
||||
|
||||
|
||||
def test_parse_table_returns_empty_when_header_absent():
|
||||
assert cs.parse_table("no tables here", ["node", "cores"]) == []
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run test to verify it fails**
|
||||
|
||||
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
|
||||
Expected: FAIL — `ModuleNotFoundError`/`AttributeError` (script or `parse_table` not defined yet).
|
||||
|
||||
- [ ] **Step 3: Write minimal implementation**
|
||||
|
||||
Create `scripts/capacity-scan.py`:
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
"""capacity-scan.py — deterministic capacity facts for /capacity-review.
|
||||
|
||||
Python standard library only. Emits a JSON object to stdout.
|
||||
|
||||
Reads physical capacities and workload allocations from the machine-readable
|
||||
tables in docs/hardware/reference.md, computes per-node allocated-vs-physical
|
||||
rollups, and cross-checks workload hostnames against `terraform output -json`
|
||||
and `ansible-inventory --list` to surface drift. Degrades gracefully when
|
||||
nothing is provisioned. Live usage stats are a documented future hook.
|
||||
|
||||
Usage: python3 scripts/capacity-scan.py [--env staging] [--reference PATH]
|
||||
"""
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
REPO_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
|
||||
|
||||
def parse_table(markdown, required_cols):
|
||||
"""Return row dicts for the first markdown table whose header contains all
|
||||
required_cols. Keys are header names; values are raw cell strings."""
|
||||
lines = markdown.splitlines()
|
||||
required = set(required_cols)
|
||||
for i, raw in enumerate(lines):
|
||||
line = raw.strip()
|
||||
if not line.startswith("|"):
|
||||
continue
|
||||
headers = [c.strip() for c in line.strip("|").split("|")]
|
||||
if not required.issubset(set(headers)):
|
||||
continue
|
||||
rows = []
|
||||
for body in lines[i + 2:]:
|
||||
if not body.strip().startswith("|"):
|
||||
break
|
||||
cells = [c.strip() for c in body.strip().strip("|").split("|")]
|
||||
if len(cells) == len(headers):
|
||||
rows.append(dict(zip(headers, cells)))
|
||||
return rows
|
||||
return []
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Run test to verify it passes**
|
||||
|
||||
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
|
||||
Expected: PASS (2 passed).
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/capacity-scan.py tests/test_capacity_scan.py
|
||||
git commit -m "Add capacity-scan.py with parse_table()"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 3: Rollup math — `compute_rollup()`
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/capacity-scan.py`
|
||||
- Modify: `tests/test_capacity_scan.py`
|
||||
|
||||
- [ ] **Step 1: Write the failing test (append to `tests/test_capacity_scan.py`)**
|
||||
|
||||
```python
|
||||
def test_compute_rollup_sums_allocations_and_flags_headroom():
|
||||
node_rows = [{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"}]
|
||||
workload_rows = [
|
||||
{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"},
|
||||
{"workload": "forgejo", "node": "pve0", "cores": "4", "ram_mb": "8192", "disk_gb": "100"},
|
||||
]
|
||||
nodes = cs.compute_rollup(node_rows, workload_rows)
|
||||
pve0 = nodes["pve0"]
|
||||
assert pve0["alloc_cores"] == 5
|
||||
assert pve0["alloc_ram_gb"] == 8.5 # (512 + 8192) / 1024
|
||||
assert pve0["alloc_disk_gb"] == 110.0
|
||||
assert pve0["ram_headroom_pct"] == 87 # round(100 * (64 - 8.5) / 64)
|
||||
assert pve0["oversubscribed"] is False
|
||||
|
||||
|
||||
def test_compute_rollup_flags_oversubscription():
|
||||
node_rows = [{"node": "tiny", "cores": "2", "ram_gb": "4", "disk_gb": "50"}]
|
||||
workload_rows = [
|
||||
{"workload": "hog", "node": "tiny", "cores": "4", "ram_mb": "1024", "disk_gb": "10"},
|
||||
]
|
||||
nodes = cs.compute_rollup(node_rows, workload_rows)
|
||||
assert nodes["tiny"]["oversubscribed"] is True # 4 cores > 2
|
||||
|
||||
|
||||
def test_compute_rollup_ignores_workloads_on_unknown_nodes():
|
||||
nodes = cs.compute_rollup(
|
||||
[{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"}],
|
||||
[{"workload": "ghost", "node": "nope", "cores": "1", "ram_mb": "512", "disk_gb": "10"}],
|
||||
)
|
||||
assert nodes["pve0"]["alloc_cores"] == 0
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run test to verify it fails**
|
||||
|
||||
Run: `python3 -m pytest tests/test_capacity_scan.py -k compute_rollup -v`
|
||||
Expected: FAIL — `AttributeError: module 'capacity_scan' has no attribute 'compute_rollup'`.
|
||||
|
||||
- [ ] **Step 3: Write minimal implementation (append to `scripts/capacity-scan.py`, before any `main`)**
|
||||
|
||||
```python
|
||||
def compute_rollup(node_rows, workload_rows):
|
||||
"""Per node: physical totals, summed allocations, RAM headroom %, and an
|
||||
oversubscribed flag. Workloads on unknown nodes are ignored."""
|
||||
nodes = {}
|
||||
for r in node_rows:
|
||||
nodes[r["node"]] = {
|
||||
"cores": int(r["cores"]),
|
||||
"ram_gb": float(r["ram_gb"]),
|
||||
"disk_gb": float(r["disk_gb"]),
|
||||
"alloc_cores": 0,
|
||||
"alloc_ram_mb": 0,
|
||||
"alloc_disk_gb": 0.0,
|
||||
}
|
||||
for w in workload_rows:
|
||||
node = nodes.get(w["node"])
|
||||
if node is None:
|
||||
continue
|
||||
node["alloc_cores"] += int(w["cores"])
|
||||
node["alloc_ram_mb"] += int(w["ram_mb"])
|
||||
node["alloc_disk_gb"] += float(w["disk_gb"])
|
||||
for node in nodes.values():
|
||||
node["alloc_ram_gb"] = round(node.pop("alloc_ram_mb") / 1024, 1)
|
||||
node["ram_headroom_pct"] = (
|
||||
round(100 * (node["ram_gb"] - node["alloc_ram_gb"]) / node["ram_gb"])
|
||||
if node["ram_gb"]
|
||||
else 0
|
||||
)
|
||||
node["oversubscribed"] = (
|
||||
node["alloc_cores"] > node["cores"]
|
||||
or node["alloc_ram_gb"] > node["ram_gb"]
|
||||
or node["alloc_disk_gb"] > node["disk_gb"]
|
||||
)
|
||||
return nodes
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Run test to verify it passes**
|
||||
|
||||
Run: `python3 -m pytest tests/test_capacity_scan.py -k compute_rollup -v`
|
||||
Expected: PASS (3 passed).
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/capacity-scan.py tests/test_capacity_scan.py
|
||||
git commit -m "Add compute_rollup() to capacity-scan.py"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 4: Drift detection — `find_drift()` + hostname parsers
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/capacity-scan.py`
|
||||
- Modify: `tests/test_capacity_scan.py`
|
||||
|
||||
- [ ] **Step 1: Write the failing test (append)**
|
||||
|
||||
```python
|
||||
def test_parse_tf_hostnames_reads_vms_value_keys():
|
||||
tf_json = '{"vms": {"value": {"dns1": {"ip": "10.20.0.10", "group": "docker_hosts"}}}}'
|
||||
assert cs.parse_tf_hostnames(tf_json) == {"dns1"}
|
||||
|
||||
|
||||
def test_parse_inventory_hostnames_reads_meta_hostvars():
|
||||
inv_json = '{"_meta": {"hostvars": {"dns1": {}, "proxy": {}}}}'
|
||||
assert cs.parse_inventory_hostnames(inv_json) == {"dns1", "proxy"}
|
||||
|
||||
|
||||
def test_find_drift_reports_both_directions():
|
||||
workload_rows = [{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"}]
|
||||
warnings = cs.find_drift(workload_rows, {"proxy"})
|
||||
assert any("dns1" in w and "no Terraform" in w for w in warnings)
|
||||
assert any("proxy" in w and "absent from reference.md" in w for w in warnings)
|
||||
|
||||
|
||||
def test_find_drift_silent_when_no_hostnames_known():
|
||||
workload_rows = [{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"}]
|
||||
assert cs.find_drift(workload_rows, set()) == []
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run test to verify it fails**
|
||||
|
||||
Run: `python3 -m pytest tests/test_capacity_scan.py -k "drift or hostnames" -v`
|
||||
Expected: FAIL — attributes `parse_tf_hostnames` / `parse_inventory_hostnames` / `find_drift` not defined.
|
||||
|
||||
- [ ] **Step 3: Write minimal implementation (append)**
|
||||
|
||||
```python
|
||||
def parse_tf_hostnames(tf_json):
|
||||
"""Hostnames from `terraform output -json` (the `vms` map keys)."""
|
||||
data = json.loads(tf_json)
|
||||
return set(data.get("vms", {}).get("value", {}).keys())
|
||||
|
||||
|
||||
def parse_inventory_hostnames(inv_json):
|
||||
"""Hostnames from `ansible-inventory --list` (_meta.hostvars keys)."""
|
||||
data = json.loads(inv_json)
|
||||
return set(data.get("_meta", {}).get("hostvars", {}).keys())
|
||||
|
||||
|
||||
def find_drift(workload_rows, known_hostnames):
|
||||
"""Warn when reference.md workloads and live hostnames disagree. Silent when
|
||||
no hostnames are known (pre-provisioning) — nothing to compare against."""
|
||||
warnings = []
|
||||
declared = {w["workload"] for w in workload_rows}
|
||||
if not known_hostnames:
|
||||
return warnings
|
||||
for name in sorted(declared - known_hostnames):
|
||||
warnings.append(
|
||||
f"reference.md lists '{name}' but no Terraform/inventory host declares it"
|
||||
)
|
||||
for name in sorted(known_hostnames - declared):
|
||||
warnings.append(
|
||||
f"host '{name}' exists in Terraform/inventory but is absent from reference.md"
|
||||
)
|
||||
return warnings
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Run test to verify it passes**
|
||||
|
||||
Run: `python3 -m pytest tests/test_capacity_scan.py -k "drift or hostnames" -v`
|
||||
Expected: PASS (4 passed).
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/capacity-scan.py tests/test_capacity_scan.py
|
||||
git commit -m "Add hostname parsers + find_drift() to capacity-scan.py"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 5: Subprocess glue + usage stub + `main()`
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/capacity-scan.py`
|
||||
- Modify: `tests/test_capacity_scan.py`
|
||||
|
||||
- [ ] **Step 1: Write the failing test (append)**
|
||||
|
||||
```python
|
||||
import json as _json
|
||||
|
||||
|
||||
def test_gather_usage_is_stubbed_unavailable():
|
||||
usage = cs.gather_usage()
|
||||
assert usage["available"] is False
|
||||
assert "reason" in usage
|
||||
|
||||
|
||||
def test_known_hostnames_degrades_to_empty(monkeypatch):
|
||||
# Simulate terraform/ansible-inventory being absent or failing.
|
||||
def boom(*a, **k):
|
||||
raise FileNotFoundError("no such tool")
|
||||
|
||||
monkeypatch.setattr(cs.subprocess, "run", boom)
|
||||
assert cs.known_hostnames("staging") == set()
|
||||
|
||||
|
||||
def test_main_emits_valid_json_against_real_reference(monkeypatch, capsys):
|
||||
# Isolate from the host: no real terraform/ansible needed.
|
||||
monkeypatch.setattr(cs, "known_hostnames", lambda env: set())
|
||||
monkeypatch.setattr("sys.argv", ["capacity-scan.py"])
|
||||
cs.main()
|
||||
out = _json.loads(capsys.readouterr().out)
|
||||
assert set(out) == {"nodes", "workloads", "usage", "warnings"}
|
||||
assert out["usage"]["available"] is False
|
||||
assert "pve0" in out["nodes"] # from the skeleton reference.md (Task 1)
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run test to verify it fails**
|
||||
|
||||
Run: `python3 -m pytest tests/test_capacity_scan.py -k "usage or known_hostnames or main" -v`
|
||||
Expected: FAIL — `gather_usage` / `known_hostnames` / `main` not defined.
|
||||
|
||||
- [ ] **Step 3: Write minimal implementation (append)**
|
||||
|
||||
```python
|
||||
def gather_usage():
|
||||
"""FUTURE: live per-VM CPU/RAM/disk usage history. Requires the physical
|
||||
cluster online; source UNDECIDED (Proxmox RRD vs Prometheus/Loki/Grafana —
|
||||
see docs/TODO.md 8.4). Until then the evaluator reasons on declared intent."""
|
||||
return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}
|
||||
|
||||
|
||||
def _run_json(cmd):
|
||||
return subprocess.run(cmd, capture_output=True, text=True, check=True).stdout
|
||||
|
||||
|
||||
def known_hostnames(env):
|
||||
"""Union of hostnames from Terraform output and Ansible inventory. Each
|
||||
source is best-effort: missing tool / no state / bad JSON yields nothing."""
|
||||
hosts = set()
|
||||
tf_dir = os.path.join(REPO_ROOT, "terraform", "environments", env)
|
||||
try:
|
||||
hosts |= parse_tf_hostnames(_run_json(["terraform", f"-chdir={tf_dir}", "output", "-json"]))
|
||||
except Exception:
|
||||
pass
|
||||
inv = os.path.join(REPO_ROOT, "inventories", env, "hosts.yml")
|
||||
try:
|
||||
hosts |= parse_inventory_hostnames(_run_json(["ansible-inventory", "-i", inv, "--list"]))
|
||||
except Exception:
|
||||
pass
|
||||
return hosts
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Deterministic capacity facts for /capacity-review.")
|
||||
parser.add_argument("--env", default="staging")
|
||||
parser.add_argument(
|
||||
"--reference",
|
||||
default=os.path.join(REPO_ROOT, "docs", "hardware", "reference.md"),
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
with open(args.reference, encoding="utf-8") as fh:
|
||||
markdown = fh.read()
|
||||
|
||||
node_rows = parse_table(markdown, ["node", "cores", "ram_gb", "disk_gb"])
|
||||
workload_rows = parse_table(markdown, ["workload", "node", "cores", "ram_mb", "disk_gb"])
|
||||
nodes = compute_rollup(node_rows, workload_rows)
|
||||
warnings = find_drift(workload_rows, known_hostnames(args.env))
|
||||
|
||||
json.dump(
|
||||
{"nodes": nodes, "workloads": workload_rows, "usage": gather_usage(), "warnings": warnings},
|
||||
sys.stdout,
|
||||
indent=2,
|
||||
sort_keys=True,
|
||||
)
|
||||
sys.stdout.write("\n")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Run the full test file**
|
||||
|
||||
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
|
||||
Expected: PASS (all tests).
|
||||
|
||||
- [ ] **Step 5: Smoke-run the script end to end**
|
||||
|
||||
Run: `python3 scripts/capacity-scan.py | python3 -m json.tool`
|
||||
Expected: valid JSON with `nodes.pve0`, a `workloads` list, `usage.available: false`, and a `warnings` array (likely empty with no Terraform state).
|
||||
|
||||
- [ ] **Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/capacity-scan.py tests/test_capacity_scan.py
|
||||
git commit -m "Complete capacity-scan.py: usage stub, subprocess glue, main()"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 6: The `/capacity-review` skill
|
||||
|
||||
**Files:**
|
||||
- Create: `.claude/commands/capacity-review.md`
|
||||
|
||||
- [ ] **Step 1: Confirm the existing command pattern**
|
||||
|
||||
Run: `ls .claude/commands/ && sed -n '1,20p' .claude/commands/review-repo.md`
|
||||
Expected: lists existing commands; shows the frontmatter/structure to mirror.
|
||||
|
||||
- [ ] **Step 2: Write `.claude/commands/capacity-review.md`**
|
||||
|
||||
Mirror the frontmatter style of `review-repo.md` (adjust `description`/`allowed-tools` to match that file's actual keys). Body:
|
||||
|
||||
```markdown
|
||||
---
|
||||
description: Evaluate hardware capacity and placement; recommend optimizations
|
||||
---
|
||||
|
||||
# /capacity-review
|
||||
|
||||
Evaluate the homelab's hardware capacity and workload placement, and recommend
|
||||
optimizations. On-demand only (scheduling is deferred — see docs/TODO.md 8.4).
|
||||
|
||||
## Steps
|
||||
|
||||
1. **Gather facts.** Run `python3 scripts/capacity-scan.py` and parse its JSON
|
||||
(`nodes`, `workloads`, `usage`, `warnings`). If `usage.available` is false,
|
||||
note in the report that recommendations are **intent-based, not usage-based**.
|
||||
2. **Read intent.** Read `docs/hardware/reference.md` for the free-text columns
|
||||
the scan does not parse: `criticality`, `ha_intent`, `profile`, `constraints`,
|
||||
`growth`, plus the "Capacity notes" section.
|
||||
3. **Reason across dimensions.** Produce recommendations, each tagged with its
|
||||
type and the basis it rests on (declared intent vs measured usage):
|
||||
- **HA / redundancy** — anti-affinity violations (e.g. an HA pair sharing one
|
||||
node), single points of failure, HA that looks like overkill, or a
|
||||
high-criticality workload with no redundancy.
|
||||
- **Right-sizing** — over/under-provisioned workloads. Today this is
|
||||
intent-based (allocation vs `profile`); flag that it becomes usage-based
|
||||
once the `gather_usage()` hook is live.
|
||||
- **Placement / moves** — oversubscribed nodes (`oversubscribed: true`, low
|
||||
`ram_headroom_pct`) or constraint-driven relocations.
|
||||
- **Upgrade timing** — `growth` notes vs headroom → rough runway.
|
||||
- **Drift** — surface every entry in the scan's `warnings` array.
|
||||
4. **Write the report.** Save to `docs/hardware/reviews/YYYY-MM-DD-capacity.md`
|
||||
and copy it to `docs/hardware/reviews/latest.md`. Structure: a one-line
|
||||
summary, then a section per dimension with concrete, actionable items. State
|
||||
the basis (intent vs usage) on every recommendation.
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify the file is well-formed**
|
||||
|
||||
Run: `head -5 .claude/commands/capacity-review.md`
|
||||
Expected: frontmatter block present and consistent with `review-repo.md`'s keys.
|
||||
|
||||
- [ ] **Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add .claude/commands/capacity-review.md
|
||||
git commit -m "Add /capacity-review skill"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 7: ADR-012, STATUS, CLAUDE.md, scripts/README
|
||||
|
||||
**Files:**
|
||||
- Create: `docs/decisions/012-hardware-capacity.md`
|
||||
- Modify: `STATUS.md`
|
||||
- Modify: `CLAUDE.md`
|
||||
- Modify: `scripts/README.md`
|
||||
|
||||
- [ ] **Step 1: Write `docs/decisions/012-hardware-capacity.md`**
|
||||
|
||||
Match the heading style of an existing ADR (`sed -n '1,15p' docs/decisions/010-forgejo-ci.md` first). Content:
|
||||
|
||||
```markdown
|
||||
# ADR-012 — Hardware reference & capacity evaluation
|
||||
|
||||
## Context
|
||||
|
||||
The repo modelled the logical/network layer (Terraform VM specs, ADR-007
|
||||
topology) but not the physical layer — node CPU/RAM/disk capacity, network gear,
|
||||
or which workloads are designed to run where with what headroom. There was also
|
||||
no way to ask "is this well-proportioned?" — e.g. HA that isn't needed, a
|
||||
workload that should move, or a node due an upgrade.
|
||||
|
||||
## Decision
|
||||
|
||||
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
|
||||
physical compute + network gear and workload placement intent. Two
|
||||
machine-readable tables (node capacity, workload placement) carry the numbers.
|
||||
- `scripts/capacity-scan.py` (stdlib-only, like `repo-scan.py` / `tf_to_inventory.py`)
|
||||
parses those tables, computes per-node allocated-vs-physical rollups, and
|
||||
cross-checks workload hostnames against `terraform output -json` /
|
||||
`ansible-inventory --list` to surface drift.
|
||||
- `/capacity-review` reads the scan + intent columns and writes a dated report to
|
||||
`docs/hardware/reviews/`, mirroring `/review-repo` → `docs/reviews/`.
|
||||
- Numeric allocations live in `reference.md`, not Terraform: the current
|
||||
`terraform output` exposes only `{ip, group}`. Terraform/inventory are used
|
||||
only for hostname-drift cross-checks.
|
||||
- **Live usage stats are a future hook.** The cluster is not stood up;
|
||||
`gather_usage()` returns `available: false` and the evaluator reasons on
|
||||
declared intent. The usage source (Proxmox RRD vs Prometheus/Loki/Grafana/
|
||||
Alloy) is undecided — see docs/TODO.md 8.4, to be settled before any hook is
|
||||
built.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Right-sizing advice is intent-based until usage data exists; reports say so.
|
||||
- `reference.md` table headers are a parser contract — changing them needs a
|
||||
matching `capacity-scan.py` change.
|
||||
|
||||
See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF↔Ansible handoff).
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add STATUS.md rows**
|
||||
|
||||
In `STATUS.md`, add to the "Real and working today" table:
|
||||
|
||||
```markdown
|
||||
| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
|
||||
| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
|
||||
```
|
||||
|
||||
And to the "Designed but not built" table:
|
||||
|
||||
```markdown
|
||||
| Live usage stats for `/capacity-review` | ADR-012 / TODO 8.4 | `gather_usage()` stubbed; source undecided (Proxmox RRD vs PLG stack); needs the cluster |
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Add the CLAUDE.md command row + further-reading pointer**
|
||||
|
||||
In `CLAUDE.md` "Key commands" table, add:
|
||||
|
||||
```markdown
|
||||
| Review hardware capacity | `/capacity-review` (Claude command) |
|
||||
```
|
||||
|
||||
In the "Further reading" table, add:
|
||||
|
||||
```markdown
|
||||
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Document the script in scripts/README.md**
|
||||
|
||||
Add under the existing list in `scripts/README.md`:
|
||||
|
||||
```markdown
|
||||
- `capacity-scan.py` — deterministic capacity facts for `/capacity-review`: parses
|
||||
the machine-readable tables in `docs/hardware/reference.md`, computes per-node
|
||||
allocated-vs-physical rollups, and cross-checks workload hostnames against
|
||||
Terraform output / Ansible inventory for drift. Emits JSON. See **ADR-012**.
|
||||
```
|
||||
|
||||
- [ ] **Step 5: Verify references resolve**
|
||||
|
||||
Run: `python3 scripts/repo-scan.py | python3 -c "import json,sys; d=json.load(sys.stdin); print('broken_refs:', [f for f in d.get('findings',{}).get('broken_refs',[]) if '012' in str(f) or 'hardware' in str(f)])"`
|
||||
Expected: no broken refs mentioning ADR-012 or the hardware paths (empty list). If the scan's JSON shape differs, instead run `python3 scripts/repo-scan.py >/dev/null && echo OK` and eyeball the findings.
|
||||
|
||||
- [ ] **Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add docs/decisions/012-hardware-capacity.md STATUS.md CLAUDE.md scripts/README.md
|
||||
git commit -m "Record ADR-012 + STATUS/CLAUDE/scripts docs for capacity tooling"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 8: Final verification
|
||||
|
||||
**Files:** none (verification only)
|
||||
|
||||
- [ ] **Step 1: Run the full unit-test suite**
|
||||
|
||||
Run: `python3 -m pytest tests/test_capacity_scan.py -v`
|
||||
Expected: all tests pass.
|
||||
|
||||
- [ ] **Step 2: Run the lint suite**
|
||||
|
||||
Run: `make lint`
|
||||
Expected: passes (markdown/script changes do not break ansible-lint/yamllint).
|
||||
|
||||
- [ ] **Step 3: End-to-end scan**
|
||||
|
||||
Run: `python3 scripts/capacity-scan.py`
|
||||
Expected: valid JSON; `nodes.pve0` present; `usage.available: false`.
|
||||
|
||||
- [ ] **Step 4: Confirm working tree is clean**
|
||||
|
||||
Run: `git status --short`
|
||||
Expected: no uncommitted changes from this plan (pre-existing FRICTION.md / ADR-011 may remain — leave them).
|
||||
```
|
||||
168
docs/superpowers/specs/2026-06-01-hardware-capacity-design.md
Normal file
168
docs/superpowers/specs/2026-06-01-hardware-capacity-design.md
Normal file
|
|
@ -0,0 +1,168 @@
|
|||
# Design — Hardware reference & capacity evaluation
|
||||
|
||||
_Date: 2026-06-01 · Status: approved for planning_
|
||||
|
||||
## Problem
|
||||
|
||||
The repo documents the **logical/network** layer well — Terraform declares per-VM
|
||||
`cores`/`memory_mb`/`disk_size_gb`, and ADR-007 records VLANs, IPs, and topology.
|
||||
But the **physical** layer is undocumented: how many Proxmox nodes physically
|
||||
exist, their real CPU/RAM/disk capacity, storage pools, the network gear, and
|
||||
`askari`. Nothing records "this node has 64 GB, X is allocated, Y is free," and
|
||||
nothing evaluates whether the design is well-proportioned — e.g. a service that
|
||||
needn't be HA, a workload that should move nodes, or a node due a RAM/disk
|
||||
upgrade.
|
||||
|
||||
## Goal
|
||||
|
||||
1. A single, human-first **hardware reference document** capturing physical
|
||||
compute + network gear and the intended workload placement.
|
||||
2. A **capacity evaluator** ("script + skill") that reasons about optimization:
|
||||
HA overkill / missing redundancy, right-sizing, placement moves, and
|
||||
upgrade timing — emitting a dated report.
|
||||
|
||||
## Scope
|
||||
|
||||
- **In:** Proxmox compute nodes (`pve0..2`) + `askari`; network gear (OPNsense,
|
||||
managed switch, APs); per-workload placement intent.
|
||||
- **Out (for now):** power/UPS budget, NAS, cabling, rack layout, asset
|
||||
register, warranty/serial tracking.
|
||||
|
||||
## Non-negotiable repo conventions this must honor
|
||||
|
||||
- Mirror the existing `repo-scan.py` → `/review-repo` → `docs/reviews/` triad
|
||||
(deterministic scan feeds a judgement skill; report is dated markdown).
|
||||
- Utility scripts are **stdlib-only** for run-anywhere portability (control
|
||||
node, CI, bare clone, no venv). See TODO #14 for the standing reevaluation.
|
||||
- Be honest about real-vs-planned (STATUS.md). The physical cluster is **not
|
||||
stood up yet**, so live usage stats are a documented future hook, not a
|
||||
current capability.
|
||||
|
||||
## Architecture
|
||||
|
||||
Four pieces, plus tracking updates.
|
||||
|
||||
### 1. Reference doc — `docs/hardware/reference.md`
|
||||
|
||||
One hand-maintained markdown file, the source of truth for physical facts and
|
||||
placement intent. Four parts:
|
||||
|
||||
1. **Physical compute** — one subsection per node (`pve0..2`, `askari`):
|
||||
model/form factor, CPU (cores/threads), RAM total (+ max & free DIMM slots),
|
||||
storage (disks → pools, e.g. `local-zfs` / `local-lvm`), NICs, notes.
|
||||
2. **Network gear** — OPNsense box, managed switch, APs: model, port/PoE
|
||||
counts, throughput, uplinks. Short table.
|
||||
3. **Workload placement & intent** — one row per planned VM/service, columns:
|
||||
`Service | Home node | Criticality | HA intent | Resource profile |
|
||||
Placement constraints | Growth notes`. These columns map onto the four
|
||||
attribute groups chosen during brainstorming and give the evaluator concrete
|
||||
intent to judge against (e.g. anti-affinity: `dns1`/`dns2` on different
|
||||
nodes).
|
||||
4. **Capacity summary** — per-node "allocated vs physical" rollup (RAM / cores /
|
||||
disk, headroom %).
|
||||
|
||||
Node-capacity tables use a **strict, documented format** so the scan script can
|
||||
parse the numbers without a YAML dependency.
|
||||
|
||||
### 2. Scan script — `scripts/capacity-scan.py`
|
||||
|
||||
Stdlib-only, deterministic, JSON to stdout (like `repo-scan.py`). Avoids
|
||||
hand-parsing YAML by shelling out for JSON, the pattern `tf_to_inventory.py`
|
||||
already uses.
|
||||
|
||||
Gathers **today**:
|
||||
- **Declared allocations** — `terraform output -json` (and/or the `.tf` module
|
||||
calls) for each VM's cores/RAM/disk; degrades gracefully when Terraform has no
|
||||
real VMs yet (current reality) instead of failing.
|
||||
- **Inventory hosts** — `ansible-inventory -i inventories/<env>/hosts.yml
|
||||
--list` → JSON.
|
||||
- **Physical capacities** — parses the strict node tables in `reference.md`.
|
||||
- **Rollup math** — per node: allocated vs physical, headroom %,
|
||||
`oversubscribed` flag.
|
||||
- **Drift warnings** — e.g. `reference.md` lists a host no Terraform VM
|
||||
declares; surfaced in a `warnings[]` array (free doc↔Terraform drift check).
|
||||
|
||||
**Stubbed future hook** (honest, à la STATUS.md):
|
||||
```python
|
||||
# FUTURE: live usage stats (per-VM CPU/RAM/disk history).
|
||||
# Requires the physical cluster online. Source UNDECIDED — see "Open decisions".
|
||||
def gather_usage():
|
||||
return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}
|
||||
```
|
||||
|
||||
Output sketch:
|
||||
```json
|
||||
{
|
||||
"nodes": {"pve0": {"ram_gb": 64, "ram_allocated_gb": 12, "headroom_pct": 81, "oversubscribed": false}},
|
||||
"workloads": [{"name": "forgejo", "node": "pve1", "cores": 2, "memory_mb": 4096}],
|
||||
"usage": {"available": false, "reason": "cluster not provisioned"},
|
||||
"warnings": ["reference.md lists dns1 but no Terraform VM declares it"]
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Evaluator skill — `/capacity-review`
|
||||
|
||||
A skill in `.claude/` (mirrors `/review-repo`), on-demand. Flow:
|
||||
|
||||
1. Run `python3 scripts/capacity-scan.py` → JSON.
|
||||
2. Read `docs/hardware/reference.md` for intent columns the math can't capture.
|
||||
3. Reason across dimensions, each recommendation **tagged by type** and stating
|
||||
**what it is based on** (declared intent vs measured usage):
|
||||
- **HA / redundancy** — anti-affinity violations, SPOFs, HA-overkill,
|
||||
critical-but-unredundant services.
|
||||
- **Right-sizing** — over/under-provisioned VMs. *Intent-based today*;
|
||||
explicitly upgradeable to usage-based once the usage hook is live.
|
||||
- **Placement / moves** — oversubscribed nodes, constraint-driven relocation.
|
||||
- **Upgrade timing** — growth notes vs headroom → rough runway.
|
||||
- **Drift** — surfaces the scan's `warnings[]`.
|
||||
4. Write `docs/hardware/reviews/YYYY-MM-DD-capacity.md` (+ `latest.md`),
|
||||
mirroring `docs/reviews/`.
|
||||
|
||||
### 4. Recording — ADR + STATUS + CLAUDE.md
|
||||
|
||||
- **ADR-012 — Hardware reference & capacity evaluation**
|
||||
(`docs/decisions/012-hardware-capacity.md`): records the decision and
|
||||
rationale; cross-links ADR-001 / ADR-007 / ADR-009. Names the usage-source as
|
||||
an open decision (below).
|
||||
- **STATUS.md** rows: `reference.md` + `capacity-scan.py` → real/working
|
||||
(skeleton); `/capacity-review` → working, intent-only; live usage → designed,
|
||||
not built.
|
||||
- **CLAUDE.md**: a "Review capacity/hardware → `/capacity-review`" commands-table
|
||||
row + a "Further reading" pointer to ADR-012.
|
||||
|
||||
## Data flow
|
||||
|
||||
```
|
||||
reference.md ──┐
|
||||
├─→ capacity-scan.py ──→ scan JSON ──┐
|
||||
terraform ─────┤ (stdlib, JSON-via-subprocess) ├─→ /capacity-review ─→ docs/hardware/reviews/
|
||||
inventory ─────┘ │ (judgement)
|
||||
reference.md (intent columns) ───────────────────────┘
|
||||
```
|
||||
|
||||
## Open decisions (deferred, tracked in TODO)
|
||||
|
||||
- **Usage-stats source** (TODO 8.4): **Proxmox RRD** (built-in, no extra infra)
|
||||
vs the **Prometheus/Loki/Grafana/Grafana-Alloy** stack we will likely run
|
||||
anyway (richer, per-process, more to operate; see TODO 3.6). **Decide before
|
||||
building any usage hook** to avoid throwaway work.
|
||||
- **Script dependency policy** (TODO #14): whether stdlib-only remains the rule
|
||||
for utility scripts or libraries (e.g. PyYAML) are selectively allowed.
|
||||
- **Scheduling** (TODO 8.4): `/capacity-review` is on-demand now; cron later.
|
||||
|
||||
## Deliverables & state at delivery
|
||||
|
||||
| Piece | Path | State |
|
||||
|---|---|---|
|
||||
| Reference doc | `docs/hardware/reference.md` | Skeleton + real node data |
|
||||
| Scan script | `scripts/capacity-scan.py` | Working (stdlib, usage hook stubbed) |
|
||||
| Evaluator skill | `/capacity-review` → `docs/hardware/reviews/` | Working, intent-based |
|
||||
| Decision record | `docs/decisions/012-hardware-capacity.md` | New ADR |
|
||||
| Tracking | STATUS.md, CLAUDE.md, TODO #14 + 8.4 | Updated |
|
||||
|
||||
## Out of scope / YAGNI
|
||||
|
||||
- No usage-stats collection until the cluster exists and the source is decided.
|
||||
- No structured-data (YAML) source of truth — markdown is the single hand-edited
|
||||
source by choice; revisit only if parsing pain demands it.
|
||||
- No automated moves/remediation — the evaluator recommends; humans act.
|
||||
|
|
@ -11,3 +11,7 @@ dependencies (keeps them runnable anywhere without a venv).
|
|||
plaintext secrets.
|
||||
- `repo-scan.py` — Phase-0 deterministic scan for `/review-repo` (markers, broken
|
||||
refs, unencrypted vaults, inventory).
|
||||
- `capacity-scan.py` — deterministic capacity facts for `/capacity-review`: parses
|
||||
the machine-readable tables in `docs/hardware/reference.md`, computes per-node
|
||||
allocated-vs-physical rollups, and cross-checks workload hostnames against
|
||||
Terraform output / Ansible inventory for drift. Emits JSON. See **ADR-012**.
|
||||
|
|
|
|||
168
scripts/capacity-scan.py
Normal file
168
scripts/capacity-scan.py
Normal file
|
|
@ -0,0 +1,168 @@
|
|||
#!/usr/bin/env python3
|
||||
"""capacity-scan.py — deterministic capacity facts for /capacity-review.
|
||||
|
||||
Python standard library only. Emits a JSON object to stdout.
|
||||
|
||||
Reads physical capacities and workload allocations from the machine-readable
|
||||
tables in docs/hardware/reference.md, computes per-node allocated-vs-physical
|
||||
rollups, and cross-checks workload hostnames against `terraform output -json`
|
||||
and `ansible-inventory --list` to surface drift. Degrades gracefully when
|
||||
nothing is provisioned. Live usage stats are a documented future hook.
|
||||
|
||||
Usage: python3 scripts/capacity-scan.py [--env staging] [--reference PATH]
|
||||
"""
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
REPO_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
|
||||
|
||||
def parse_table(markdown, required_cols):
|
||||
"""Return row dicts for the first markdown table whose header contains all
|
||||
required_cols. Keys are header names; values are raw cell strings.
|
||||
Rows whose cell count does not match the header are skipped."""
|
||||
lines = markdown.splitlines()
|
||||
required = set(required_cols)
|
||||
for i, raw in enumerate(lines):
|
||||
line = raw.strip()
|
||||
if not line.startswith("|"):
|
||||
continue
|
||||
headers = [c.strip() for c in line.strip("|").split("|")]
|
||||
if not required.issubset(set(headers)):
|
||||
continue
|
||||
rows = []
|
||||
# i + 2 skips the header's GFM separator row (|---|---|)
|
||||
for body in lines[i + 2:]:
|
||||
if not body.strip().startswith("|"):
|
||||
break
|
||||
cells = [c.strip() for c in body.strip().strip("|").split("|")]
|
||||
if len(cells) == len(headers):
|
||||
rows.append(dict(zip(headers, cells)))
|
||||
return rows
|
||||
return []
|
||||
|
||||
|
||||
def compute_rollup(node_rows, workload_rows):
|
||||
"""Per node: physical totals, summed allocations, RAM headroom %, and an
|
||||
oversubscribed flag. Workloads on unknown nodes are ignored."""
|
||||
nodes = {}
|
||||
for r in node_rows:
|
||||
nodes[r["node"]] = {
|
||||
"cores": int(r["cores"]),
|
||||
"ram_gb": float(r["ram_gb"]),
|
||||
"disk_gb": float(r["disk_gb"]),
|
||||
"alloc_cores": 0,
|
||||
"alloc_ram_mb": 0,
|
||||
"alloc_disk_gb": 0.0,
|
||||
}
|
||||
for w in workload_rows:
|
||||
node = nodes.get(w["node"])
|
||||
if node is None:
|
||||
continue
|
||||
node["alloc_cores"] += int(w["cores"])
|
||||
node["alloc_ram_mb"] += int(w["ram_mb"])
|
||||
node["alloc_disk_gb"] += float(w["disk_gb"])
|
||||
for node in nodes.values():
|
||||
node["alloc_ram_gb"] = round(node.pop("alloc_ram_mb") / 1024, 1)
|
||||
node["ram_headroom_pct"] = (
|
||||
round(100 * (node["ram_gb"] - node["alloc_ram_gb"]) / node["ram_gb"])
|
||||
if node["ram_gb"]
|
||||
else 0
|
||||
)
|
||||
node["oversubscribed"] = (
|
||||
node["alloc_cores"] > node["cores"]
|
||||
or node["alloc_ram_gb"] > node["ram_gb"]
|
||||
or node["alloc_disk_gb"] > node["disk_gb"]
|
||||
)
|
||||
return nodes
|
||||
|
||||
|
||||
def parse_tf_hostnames(tf_json):
|
||||
"""Hostnames from `terraform output -json` (the `vms` map keys)."""
|
||||
data = json.loads(tf_json)
|
||||
return set(data.get("vms", {}).get("value", {}).keys())
|
||||
|
||||
|
||||
def parse_inventory_hostnames(inv_json):
|
||||
"""Hostnames from `ansible-inventory --list` (_meta.hostvars keys)."""
|
||||
data = json.loads(inv_json)
|
||||
return set(data.get("_meta", {}).get("hostvars", {}).keys())
|
||||
|
||||
|
||||
def find_drift(workload_rows, known_hostnames):
|
||||
"""Warn when reference.md workloads and live hostnames disagree. Silent when
|
||||
no hostnames are known (pre-provisioning) — nothing to compare against."""
|
||||
warnings = []
|
||||
declared = {w["workload"] for w in workload_rows}
|
||||
if not known_hostnames:
|
||||
return warnings
|
||||
for name in sorted(declared - known_hostnames):
|
||||
warnings.append(
|
||||
f"reference.md lists '{name}' but no Terraform/inventory host declares it"
|
||||
)
|
||||
for name in sorted(known_hostnames - declared):
|
||||
warnings.append(
|
||||
f"host '{name}' exists in Terraform/inventory but is absent from reference.md"
|
||||
)
|
||||
return warnings
|
||||
|
||||
|
||||
def gather_usage():
|
||||
"""FUTURE: live per-VM CPU/RAM/disk usage history. Requires the physical
|
||||
cluster online; source UNDECIDED (Proxmox RRD vs Prometheus/Loki/Grafana —
|
||||
see docs/TODO.md 8.4). Until then the evaluator reasons on declared intent."""
|
||||
return {"available": False, "reason": "cluster not provisioned (see STATUS.md)"}
|
||||
|
||||
|
||||
def _run_json(cmd):
|
||||
return subprocess.run(cmd, capture_output=True, text=True, check=True).stdout
|
||||
|
||||
|
||||
def known_hostnames(env):
|
||||
"""Union of hostnames from Terraform output and Ansible inventory. Each
|
||||
source is best-effort: missing tool / no state / bad JSON yields nothing."""
|
||||
hosts = set()
|
||||
tf_dir = os.path.join(REPO_ROOT, "terraform", "environments", env)
|
||||
try:
|
||||
hosts |= parse_tf_hostnames(_run_json(["terraform", f"-chdir={tf_dir}", "output", "-json"]))
|
||||
except (OSError, subprocess.CalledProcessError, ValueError):
|
||||
pass
|
||||
inv = os.path.join(REPO_ROOT, "inventories", env, "hosts.yml")
|
||||
try:
|
||||
hosts |= parse_inventory_hostnames(_run_json(["ansible-inventory", "-i", inv, "--list"]))
|
||||
except (OSError, subprocess.CalledProcessError, ValueError):
|
||||
pass
|
||||
return hosts
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Deterministic capacity facts for /capacity-review.")
|
||||
parser.add_argument("--env", default="staging")
|
||||
parser.add_argument(
|
||||
"--reference",
|
||||
default=os.path.join(REPO_ROOT, "docs", "hardware", "reference.md"),
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
with open(args.reference, encoding="utf-8") as fh:
|
||||
markdown = fh.read()
|
||||
|
||||
node_rows = parse_table(markdown, ["node", "cores", "ram_gb", "disk_gb"])
|
||||
workload_rows = parse_table(markdown, ["workload", "node", "cores", "ram_mb", "disk_gb"])
|
||||
nodes = compute_rollup(node_rows, workload_rows)
|
||||
warnings = find_drift(workload_rows, known_hostnames(args.env))
|
||||
|
||||
json.dump(
|
||||
{"nodes": nodes, "workloads": workload_rows, "usage": gather_usage(), "warnings": warnings},
|
||||
sys.stdout,
|
||||
indent=2,
|
||||
sort_keys=True,
|
||||
)
|
||||
sys.stdout.write("\n")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
109
tests/test_capacity_scan.py
Normal file
109
tests/test_capacity_scan.py
Normal file
|
|
@ -0,0 +1,109 @@
|
|||
import importlib.util
|
||||
import json as _json
|
||||
import pathlib
|
||||
|
||||
_PATH = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "capacity-scan.py"
|
||||
_spec = importlib.util.spec_from_file_location("capacity_scan", _PATH)
|
||||
cs = importlib.util.module_from_spec(_spec)
|
||||
_spec.loader.exec_module(cs)
|
||||
|
||||
|
||||
def test_parse_table_keys_on_header_and_ignores_extra_cols():
|
||||
md = """
|
||||
intro text
|
||||
| node | cores | ram_gb | disk_gb | notes |
|
||||
|------|-------|--------|---------|-------|
|
||||
| pve0 | 20 | 64 | 4000 | nvme |
|
||||
| pve1 | 20 | 64 | 4000 | nvme |
|
||||
|
||||
trailing text
|
||||
"""
|
||||
rows = cs.parse_table(md, ["node", "cores", "ram_gb", "disk_gb"])
|
||||
assert rows == [
|
||||
{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000", "notes": "nvme"},
|
||||
{"node": "pve1", "cores": "20", "ram_gb": "64", "disk_gb": "4000", "notes": "nvme"},
|
||||
]
|
||||
|
||||
|
||||
def test_parse_table_returns_empty_when_header_absent():
|
||||
assert cs.parse_table("no tables here", ["node", "cores"]) == []
|
||||
|
||||
|
||||
def test_compute_rollup_sums_allocations_and_flags_headroom():
|
||||
node_rows = [{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"}]
|
||||
workload_rows = [
|
||||
{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"},
|
||||
{"workload": "forgejo", "node": "pve0", "cores": "4", "ram_mb": "8192", "disk_gb": "100"},
|
||||
]
|
||||
nodes = cs.compute_rollup(node_rows, workload_rows)
|
||||
pve0 = nodes["pve0"]
|
||||
assert pve0["alloc_cores"] == 5
|
||||
assert pve0["alloc_ram_gb"] == 8.5 # (512 + 8192) / 1024
|
||||
assert pve0["alloc_disk_gb"] == 110.0
|
||||
assert pve0["ram_headroom_pct"] == 87 # round(100 * (64 - 8.5) / 64)
|
||||
assert pve0["oversubscribed"] is False
|
||||
|
||||
|
||||
def test_compute_rollup_flags_oversubscription():
|
||||
node_rows = [{"node": "tiny", "cores": "2", "ram_gb": "4", "disk_gb": "50"}]
|
||||
workload_rows = [
|
||||
{"workload": "hog", "node": "tiny", "cores": "4", "ram_mb": "1024", "disk_gb": "10"},
|
||||
]
|
||||
nodes = cs.compute_rollup(node_rows, workload_rows)
|
||||
assert nodes["tiny"]["oversubscribed"] is True # 4 cores > 2
|
||||
|
||||
|
||||
def test_compute_rollup_ignores_workloads_on_unknown_nodes():
|
||||
nodes = cs.compute_rollup(
|
||||
[{"node": "pve0", "cores": "20", "ram_gb": "64", "disk_gb": "4000"}],
|
||||
[{"workload": "ghost", "node": "nope", "cores": "1", "ram_mb": "512", "disk_gb": "10"}],
|
||||
)
|
||||
assert nodes["pve0"]["alloc_cores"] == 0
|
||||
|
||||
|
||||
def test_parse_tf_hostnames_reads_vms_value_keys():
|
||||
tf_json = '{"vms": {"value": {"dns1": {"ip": "10.20.0.10", "group": "docker_hosts"}}}}'
|
||||
assert cs.parse_tf_hostnames(tf_json) == {"dns1"}
|
||||
|
||||
|
||||
def test_parse_inventory_hostnames_reads_meta_hostvars():
|
||||
inv_json = '{"_meta": {"hostvars": {"dns1": {}, "proxy": {}}}}'
|
||||
assert cs.parse_inventory_hostnames(inv_json) == {"dns1", "proxy"}
|
||||
|
||||
|
||||
def test_find_drift_reports_both_directions():
|
||||
workload_rows = [{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"}]
|
||||
warnings = cs.find_drift(workload_rows, {"proxy"})
|
||||
assert any("dns1" in w and "no Terraform" in w for w in warnings)
|
||||
assert any("proxy" in w and "absent from reference.md" in w for w in warnings)
|
||||
|
||||
|
||||
def test_find_drift_silent_when_no_hostnames_known():
|
||||
workload_rows = [{"workload": "dns1", "node": "pve0", "cores": "1", "ram_mb": "512", "disk_gb": "10"}]
|
||||
assert cs.find_drift(workload_rows, set()) == []
|
||||
|
||||
|
||||
def test_gather_usage_is_stubbed_unavailable():
|
||||
usage = cs.gather_usage()
|
||||
assert usage["available"] is False
|
||||
assert "reason" in usage
|
||||
|
||||
|
||||
def test_known_hostnames_degrades_to_empty(monkeypatch):
|
||||
# Simulate terraform/ansible-inventory being absent or failing.
|
||||
def boom(*a, **k):
|
||||
raise FileNotFoundError("no such tool")
|
||||
|
||||
monkeypatch.setattr(cs.subprocess, "run", boom)
|
||||
assert cs.known_hostnames("staging") == set()
|
||||
|
||||
|
||||
def test_main_emits_valid_json_against_real_reference(monkeypatch, capsys):
|
||||
# Isolate from the host: no real terraform/ansible needed.
|
||||
monkeypatch.setattr(cs, "known_hostnames", lambda env: set())
|
||||
monkeypatch.setattr("sys.argv", ["capacity-scan.py"])
|
||||
cs.main()
|
||||
out = _json.loads(capsys.readouterr().out)
|
||||
assert set(out) == {"nodes", "workloads", "usage", "warnings"}
|
||||
assert out["usage"]["available"] is False
|
||||
assert "pve0" in out["nodes"] # from the skeleton reference.md
|
||||
Loading…
Add table
Reference in a new issue