diff --git a/CLAUDE.md b/CLAUDE.md index f3f37c6..c4ae55e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -31,6 +31,7 @@ Full design rationale: `docs/decisions/` | Deploy a playbook | `make deploy PLAYBOOK=` | | Scaffold a new role | `make new-role NAME=` | | Review repo for drift/cruft | `/review-repo` (Claude command) | +| Review hardware capacity | `/capacity-review` (Claude command) | | Encrypt a vault file | `make encrypt FILE=` | | Decrypt a vault file | `make decrypt FILE=` | | Install Python deps | `make setup` | @@ -170,6 +171,7 @@ Single-contributor, trunk-based (no merge requests / approval gates): | Testing methodology | `docs/decisions/008-testing.md` | | TF ↔ Ansible handoff | `docs/decisions/009-provisioning-handoff.md` | | Forgejo & CI | `docs/decisions/010-forgejo-ci.md` | +| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` | | Adding a new role | `docs/runbooks/new-role.md` | | Adding a new host | `docs/runbooks/new-host.md` | | Rotating vault secrets | `docs/runbooks/rotate-secrets.md` | diff --git a/STATUS.md b/STATUS.md index 3fc1c39..d17580e 100644 --- a/STATUS.md +++ b/STATUS.md @@ -21,6 +21,8 @@ _Last reviewed: 2026-05-30._ | Vault password client | `scripts/vault-pass-client.sh` fetches the master password from Vaultwarden via `rbw` (wired as `vault_password_file`). Requires `rbw` installed + `rbw unlock`. | | `/review-repo` | Repo audit: `scripts/repo-scan.py` (Phase 0) + `.claude/commands/review-repo.md`, reports to `docs/reviews/`. On-demand only; cron + email deferred (`docs/TODO.md`). | | Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below | +| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON | +| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) | ## Scaffolded but empty — NOT implemented @@ -44,6 +46,7 @@ So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `bas | Level 2 / 3 testing (staging, `askari` smoke) | ADR-008 | Depends on real VMs / `askari`, which don't exist yet | | Per-service roles | ADR-004 | Model defined; no service roles built | | Forgejo Actions CI | ADR-003 / ADR-008 | Remote is live (pushed); Actions/`act_runner` pipeline not yet built | +| Live usage stats for `/capacity-review` | ADR-012 / TODO 8.4 | `gather_usage()` stubbed; source undecided (Proxmox RRD vs PLG stack); needs the cluster | ## Keeping this honest diff --git a/docs/decisions/012-hardware-capacity.md b/docs/decisions/012-hardware-capacity.md new file mode 100644 index 0000000..46b5c5c --- /dev/null +++ b/docs/decisions/012-hardware-capacity.md @@ -0,0 +1,37 @@ +# ADR-012 — Hardware reference & capacity evaluation + +## Context + +The repo modelled the logical/network layer (Terraform VM specs, ADR-007 +topology) but not the physical layer — node CPU/RAM/disk capacity, network gear, +or which workloads are designed to run where with what headroom. There was also +no way to ask "is this well-proportioned?" — e.g. HA that isn't needed, a +workload that should move, or a node due an upgrade. + +## Decision + +- `docs/hardware/reference.md` is the single, hand-maintained source of truth for + physical compute + network gear and workload placement intent. Two + machine-readable tables (node capacity, workload placement) carry the numbers. +- `scripts/capacity-scan.py` (stdlib-only, like `repo-scan.py` / `tf_to_inventory.py`) + parses those tables, computes per-node allocated-vs-physical rollups, and + cross-checks workload hostnames against `terraform output -json` / + `ansible-inventory --list` to surface drift. +- `/capacity-review` reads the scan + intent columns and writes a dated report to + `docs/hardware/reviews/`, mirroring `/review-repo` → `docs/reviews/`. +- Numeric allocations live in `reference.md`, not Terraform: the current + `terraform output` exposes only `{ip, group}`. Terraform/inventory are used + only for hostname-drift cross-checks. +- **Live usage stats are a future hook.** The cluster is not stood up; + `gather_usage()` returns `available: false` and the evaluator reasons on + declared intent. The usage source (Proxmox RRD vs Prometheus/Loki/Grafana/ + Alloy) is undecided — see docs/TODO.md 8.4, to be settled before any hook is + built. + +## Consequences + +- Right-sizing advice is intent-based until usage data exists; reports say so. +- `reference.md` table headers are a parser contract — changing them needs a + matching `capacity-scan.py` change. + +See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff). diff --git a/scripts/README.md b/scripts/README.md index 9e4233b..931191c 100644 --- a/scripts/README.md +++ b/scripts/README.md @@ -11,3 +11,7 @@ dependencies (keeps them runnable anywhere without a venv). plaintext secrets. - `repo-scan.py` — Phase-0 deterministic scan for `/review-repo` (markers, broken refs, unencrypted vaults, inventory). +- `capacity-scan.py` — deterministic capacity facts for `/capacity-review`: parses + the machine-readable tables in `docs/hardware/reference.md`, computes per-node + allocated-vs-physical rollups, and cross-checks workload hostnames against + Terraform output / Ansible inventory for drift. Emits JSON. See **ADR-012**.