boma/docs/decisions/012-hardware-capacity.md

# ADR-012 — Hardware reference & capacity evaluation

## Context

The repo modelled the logical/network layer (Terraform VM specs, ADR-007
topology) but not the physical layer — node CPU/RAM/disk capacity, network gear,
or which workloads are designed to run where with what headroom. There was also
no way to ask "is this well-proportioned?" — e.g. HA that isn't needed, a
workload that should move, or a node due an upgrade.

## Decision

- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
  physical compute + network gear and workload placement intent. Two
  machine-readable tables (node capacity, workload placement) carry the numbers.
  This includes `ubongo`, the physical control node (ADR-015), even though it sits
  outside the Proxmox cluster.
- `scripts/capacity-scan.py` (stdlib-only, like `repo-scan.py` / `tf_to_inventory.py`)
  parses those tables, computes per-node allocated-vs-physical rollups, and
  cross-checks workload hostnames against `terraform output -json` /
  `ansible-inventory --list` to surface drift.
- `/capacity-review` reads the scan + intent columns and writes a dated report to
  `docs/hardware/reviews/YYYY-MM-DD-capacity.md`, also overwriting
  `docs/hardware/reviews/latest.md`, mirroring `/review-repo` → `docs/reviews/`.
- Numeric allocations live in `reference.md`, not Terraform: the current
  `terraform output` exposes only `{ip, group}`. Terraform/inventory are used
  only for hostname-drift cross-checks.
- **Live usage stats are a future hook.** The cluster is not stood up;
  `gather_usage()` returns `available: false` and the evaluator reasons on
  declared intent. The usage source (Proxmox RRD vs Prometheus/Loki/Grafana/
  Alloy) is undecided — see docs/TODO.md 8.4, to be settled before any hook is
  built.

## Consequences

- Right-sizing advice is intent-based until usage data exists; reports say so.
- `reference.md` table headers are a parser contract — changing them needs a
  matching `capacity-scan.py` change.
- Log storage (ADR-018) is a tracked allocation: the cluster Loki host's retention
  budget and `askari`'s security-subset volume belong in `reference.md`, and SSD
  **wearout/TBW** is a monitored metric — logging is write-heavy, so wear is watched,
  not assumed.

See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).
Record ADR-012 + STATUS/CLAUDE/scripts docs for capacity tooling Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-06-01 10:34:38 +02:00			`# ADR-012 — Hardware reference & capacity evaluation`

			`## Context`

			`The repo modelled the logical/network layer (Terraform VM specs, ADR-007`
			`topology) but not the physical layer — node CPU/RAM/disk capacity, network gear,`
			`or which workloads are designed to run where with what headroom. There was also`
			`no way to ask "is this well-proportioned?" — e.g. HA that isn't needed, a`
			`workload that should move, or a node due an upgrade.`

			`## Decision`

			- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
			`physical compute + network gear and workload placement intent. Two`
			`machine-readable tables (node capacity, workload placement) carry the numbers.`
ADR-012/hardware: add ubongo as physical control node Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-06-05 09:43:09 +02:00			This includes `ubongo`, the physical control node (ADR-015), even though it sits
			`outside the Proxmox cluster.`
Record ADR-012 + STATUS/CLAUDE/scripts docs for capacity tooling Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-06-01 10:34:38 +02:00			- `scripts/capacity-scan.py` (stdlib-only, like `repo-scan.py` / `tf_to_inventory.py`)
			`parses those tables, computes per-node allocated-vs-physical rollups, and`
			cross-checks workload hostnames against `terraform output -json` /
			`ansible-inventory --list` to surface drift.
			- `/capacity-review` reads the scan + intent columns and writes a dated report to
Note latest.md report mirror in ADR-012 Final-review minor: the /capacity-review skill overwrites a latest.md pointer alongside the dated report; record that in the ADR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-01 10:40:16 +02:00			`docs/hardware/reviews/YYYY-MM-DD-capacity.md`, also overwriting
			`docs/hardware/reviews/latest.md`, mirroring `/review-repo` → `docs/reviews/`.
Record ADR-012 + STATUS/CLAUDE/scripts docs for capacity tooling Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-06-01 10:34:38 +02:00			- Numeric allocations live in `reference.md`, not Terraform: the current
			`terraform output` exposes only `{ip, group}`. Terraform/inventory are used
			`only for hostname-drift cross-checks.`
			`- Live usage stats are a future hook. The cluster is not stood up;`
			`gather_usage()` returns `available: false` and the evaluator reasons on
			`declared intent. The usage source (Proxmox RRD vs Prometheus/Loki/Grafana/`
			`Alloy) is undecided — see docs/TODO.md 8.4, to be settled before any hook is`
			`built.`

			`## Consequences`

			`- Right-sizing advice is intent-based until usage data exists; reports say so.`
			- `reference.md` table headers are a parser contract — changing them needs a
			matching `capacity-scan.py` change.
ADR-012: track log-storage allocation + SSD wearout (ADR-018) 2026-06-06 07:05:15 +02:00			`- Log storage (ADR-018) is a tracked allocation: the cluster Loki host's retention`
			budget and `askari`'s security-subset volume belong in `reference.md`, and SSD
			`wearout/TBW is a monitored metric — logging is write-heavy, so wear is watched,`
			`not assumed.`
Record ADR-012 + STATUS/CLAUDE/scripts docs for capacity tooling Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-06-01 10:34:38 +02:00
			`See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).`