2.3 KiB
2.3 KiB
ADR-012 — Hardware reference & capacity evaluation
Context
The repo modelled the logical/network layer (Terraform VM specs, ADR-007 topology) but not the physical layer — node CPU/RAM/disk capacity, network gear, or which workloads are designed to run where with what headroom. There was also no way to ask "is this well-proportioned?" — e.g. HA that isn't needed, a workload that should move, or a node due an upgrade.
Decision
docs/hardware/reference.mdis the single, hand-maintained source of truth for physical compute + network gear and workload placement intent. Two machine-readable tables (node capacity, workload placement) carry the numbers. This includesubongo, the physical control node (ADR-015), even though it sits outside the Proxmox cluster.scripts/capacity-scan.py(stdlib-only, likerepo-scan.py/tf_to_inventory.py) parses those tables, computes per-node allocated-vs-physical rollups, and cross-checks workload hostnames againstterraform output -json/ansible-inventory --listto surface drift./capacity-reviewreads the scan + intent columns and writes a dated report todocs/hardware/reviews/YYYY-MM-DD-capacity.md, also overwritingdocs/hardware/reviews/latest.md, mirroring/review-repo→docs/reviews/.- Numeric allocations live in
reference.md, not Terraform: the currentterraform outputexposes only{ip, group}. Terraform/inventory are used only for hostname-drift cross-checks. - Live usage stats are a future hook. The cluster is not stood up;
gather_usage()returnsavailable: falseand the evaluator reasons on declared intent. The usage source (Proxmox RRD vs Prometheus/Loki/Grafana/ Alloy) is undecided — see docs/TODO.md 8.4, to be settled before any hook is built.
Consequences
- Right-sizing advice is intent-based until usage data exists; reports say so.
reference.mdtable headers are a parser contract — changing them needs a matchingcapacity-scan.pychange.- Log storage (ADR-018) is a tracked allocation: the cluster Loki host's retention
budget and
askari's security-subset volume belong inreference.md, and SSD wearout/TBW is a monitored metric — logging is write-heavy, so wear is watched, not assumed.
See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).