boma/docs/decisions/012-hardware-capacity.md
sjat 9e0c264658 docs: reconcile lower-severity review findings (O9-O24)
- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional,
  outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative
  boma.baobab.band -> boma.wingu.me transition note already added earlier
- terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and
  <host>.boma.baobab.band per ADR-007 naming (O11)
- ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections
  placed after Consequences, matching ADR-014/019-023 (O13)
- docs/README + inventories/README: list the missing subdirs / offsite_hosts +
  offsite.yml merge behaviour (O14, O29 note)
- ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19)
- ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20)
- ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21)
- netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23)
- ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24)
- capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28)
- tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9)
- tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep)

O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected);
the fix lives in the generator for the next regeneration. make lint + pytest (57) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00

2.4 KiB

ADR-012 — Hardware reference & capacity evaluation

Status

Accepted (2026-06-01)

Context

The repo modelled the logical/network layer (Terraform VM specs, ADR-007 topology) but not the physical layer — node CPU/RAM/disk capacity, network gear, or which workloads are designed to run where with what headroom. There was also no way to ask "is this well-proportioned?" — e.g. HA that isn't needed, a workload that should move, or a node due an upgrade.

Decision

  • docs/hardware/reference.md is the single, hand-maintained source of truth for physical compute + network gear and workload placement intent. Two machine-readable tables (node capacity, workload placement) carry the numbers. This includes ubongo, the physical control node (ADR-015), even though it sits outside the Proxmox cluster.
  • scripts/capacity-scan.py (stdlib-only, like repo-scan.py / tf_to_inventory.py) parses those tables, computes per-node allocated-vs-physical rollups, and cross-checks workload hostnames against terraform output -json / ansible-inventory --list to surface drift.
  • /capacity-review reads the scan + intent columns and writes a dated report to docs/hardware/reviews/YYYY-MM-DD-capacity.md, also overwriting docs/hardware/reviews/latest.md, mirroring /review-repodocs/reviews/.
  • Numeric allocations live in reference.md, not Terraform: the current terraform output exposes only {ip, group}. Terraform/inventory are used only for hostname-drift cross-checks.
  • Live usage stats are a future hook. The cluster is not stood up; gather_usage() returns available: false and the evaluator reasons on declared intent. The usage source (Proxmox RRD vs Prometheus/Loki/Grafana/ Alloy) is undecided — see docs/TODO.md 8.4, to be settled before any hook is built.

Consequences

  • Right-sizing advice is intent-based until usage data exists; reports say so.
  • reference.md table headers are a parser contract — changing them needs a matching capacity-scan.py change.
  • Log storage (ADR-018) is a tracked allocation: the cluster Loki host's retention budget and askari's security-subset volume belong in reference.md, and SSD wearout/TBW is a monitored metric — logging is write-heavy, so wear is watched, not assumed.

ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).