boma/docs/reviews/2026-06-14-review.md
sjat 64f1e821d8 docs(review): 2026-06-14 repo audit — M4a doc drift + Traefik→Caddy lag
11 safe auto-fixes (docs/comments only): reverse_proxy meta stale DNS-01
description, base/playbooks/scripts/terraform/public_dns README build-state,
CAPABILITIES reverse-proxy Traefik→Caddy, README ADR list → 024, TF cax11→cx23
stamps, public_dns wildcard DNS-01→HTTP-01 comment. 29 open findings reported.
make lint green. No stale-deferred (ADR-011 open questions still open).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:37:54 +02:00

157 lines
9.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Repo review — 2026-06-14
- **Reviewed commit:** `e346137` (docs(plan): M4b — NetBird coordinator service role)
- **Mode:** on-demand (interactive — auto-fixes applied + committed)
- **Previous run:** 2026-06-11 (`67f2aba`)
- **`make lint`:** green before and after fixes (260 files, profile production; check-tags OK).
## Summary
A lot shipped since the last review (M4a: `docker_host` Docker engine, `reverse_proxy`
Caddy applied to askari; offsite Terraform env live; ADR-024). Most findings this run are
the predictable **docs-lagging-the-build** kind — stale "not built yet" notes, a
reverse-proxy that switched from DNS-01/custom-image to vanilla HTTP-01 leaving stale
descriptions behind, and the **Traefik→Caddy** rename only half-propagated through the
ADR set. The previous run's blocker (O1, `make lint` RED) is **resolved**.
### Counts
| Dimension | High | Medium | Low | Total |
|---|---|---|---|---|
| Cruft / staleness | 0 | 0 | 0 | 0 |
| Design conformance | 1 | 2 | 2 | 5 |
| Consistency & intent | 2 | 2 | 9 | 13 |
| Docs-vs-reality drift | 1 | 4 | 5 | 10 |
| **Open total** | **4** | **8** | **16** | **29** |
Plus **11 auto-fixes applied** (3 high, 5 medium, 3 low).
### Phase-0 scan
`repo-scan.py`: 5 roles, 25 ADRs · broken-adr-ref=4, broken-path-ref=2, marker=14,
open-deferred-item=5, **stale-deferred=0**. Every scan finding is a known false-positive
(test fixtures ADR-099/100; the `roles/netbird/` references in the M4b *plan* for unbuilt
work; superpowers planning artifacts; `019-tagging.md:14` is prose about "over-tagging",
not a TODO). Details in the findings JSON.
### Deferral checklist
All 5 ADR-011 "Open questions" (Proxmox snapshot driver, exact cadences, health-check
harness home, classification home, staging-first) confirmed **genuinely still open**
ADR-011 is still Proposed/unbuilt, the same questions sit open in `docs/TODO.md` item 16,
and no later ADR or STATUS decides any of them. **No stale-deferred** (same as last run).
## Auto-fixes applied
All safe/obvious (stale text contradicting code/reality, partial enumerations, broken
descriptions) — no logic, variable, secret, or task-order changes.
| ID | Sev | File | What |
|---|---|---|---|
| AF1 | high | `roles/reverse_proxy/meta/main.yml` | description still said DNS-01 + custom on-host image → rewrote to vanilla Caddy + HTTP-01 (matches the role since b7e919d) |
| AF2 | med | `roles/README.md` | base hardening + docker_host/reverse_proxy/public_dns build-state was stale → reconciled with STATUS |
| AF3 | med | `playbooks/README.md` | stale "docker_host has no tasks" note; added missing `dns.yml` + `offsite.yml` bullets |
| AF4 | low | `roles/public_dns/README.md` | "askari in M4" → askari + `*.askari` records applied in M4a |
| AF5 | low | `scripts/README.md` | added the missing `check-tags.py` entry (run by `make lint`) |
| AF6 | med | `terraform/README.md` | added `modules/hetzner_vm` + `environments/offsite` (the one applied env) |
| AF7 | low | `terraform/environments/offsite/providers.tf` | verified-stamp `cax11@hel1``cx23@hel1` (actual server) |
| AF8 | low | `terraform/modules/hetzner_vm/variables.tf` | `server_type` example `cax11 (ARM)``cx23 (x86) or cax11 (ARM)` |
| AF9 | med | `inventories/production/group_vars/all/public_dns.yml` | wildcard comment "cert via DNS-01" → ACME HTTP-01 (M4a) |
| AF10 | high | `docs/CAPABILITIES.md` | reverse-proxy candidate `Traefik``Caddy (ADR-024)`; public DNS "apply pending" → "applied (M1)" |
| AF11 | low | `README.md` | Documentation ADR list extended ADR-017 → ADR-024 |
## Open findings (prioritised)
### High
- **O1 — drift — STATUS.md:41 (+45-48) ↔ 33-34** *(new)*: docker_host still appears in
the "Scaffolded but empty — NOT implemented" table as a no-op, contradicting its own
"Built + applied" rows and the real tasks file. Reword the scaffold row + closing
paragraph (left for the operator — STATUS is the ground-truth doc).
- **O2 — consistency — ADR-004:105,131 ↔ ADR-022** *(recurring)*: ADR-004 says backup is
"not in scope of this repo"; ADR-022 defines a full in-repo backup doctrine. Repoint
ADR-004 at ADR-022 (ADR↔ADR design decision — report).
- **O3 — consistency — ADR-024 Consequences ↔ ADR-008:70/017:27,88/019:52** *(new)*:
ADR-024 claims it updated ADR-017's Traefik prose to Caddy; it didn't, and ADR-008/019
still say Traefik too. Either finish the rename or soften ADR-024's claim.
- **O4 — conformance — ADR-023:7-8,77-80 ↔ ADR-016/017/018** *(recurring)*: ADR-023
claims ADRs 001018 were restructured to lead with `## Status`, but 016/017/018 still
open with `## Context` and bury Status. Fix the three ADRs or correct ADR-023 §6.
### Medium
- **O5 — ADR-004:48-50** *(recurring)*: service-role file table omits ACCESS.md +
BACKUP.md rows (now mandated by CLAUDE.md/ADR-021/022).
- **O6 — ADR-002:82** *(recurring)*: `make deploy PLAYBOOK=upgrade` cited as real, but no
`upgrade.yml` exists and ADR-011 is unbuilt — needs a `(planned)` caveat.
- **O7 — CAPABILITIES:150-155 ↔ STATUS:29** *(recurring)*: nvim/tmux listed as a
"confirmed exclusion" while `dev_env` installs them on ubongo; needs a control-host
carve-out (not a token swap, so left from AF10).
- **O8 — dev_env tasks (include_tasks + per_user.yml:4-9)** *(recurring)*: untagged
`set_fact dev_env__home` preflight + include without `apply: tags:`; a partial
`--tags users|config` run breaks (base guards this; dev_env doesn't).
- **O9 — inventories/production/hosts.yml** *(recurring)*: header claims TF-generated but
it's hand-maintained (carries ubongo, omits offsite_hosts); `tf-inventory` would drop
ubongo. Make the header honest.
- **O10 — group_vars/all/vars.yml:42 ↔ ADR-007** *(recurring)*: ubongo `10.20.10.151` is
in no ADR-007 subnet and undocumented; `base__firewall_control_addr` depends on it.
- **O11 — terraform tfvars.example (both envs)** *(recurring)*: `pve01` vs ADR-007's
`pve0`; verify the real node name before changing.
- **O12 — roles/reverse_proxy/** *(new)*: first built+applied service role, but missing
SECURITY/VERIFY/ACCESS/BACKUP.md. (Recorded judgement: public_dns is exempt — control-
node external-API role, not a host service.)
- **O15 — runbooks/new-host.md Part E** *(recurring)*: still describes an `ansible` user
on ubongo; STATUS says ubongo is managed as `sjat` (ansible-user bootstrap pending).
- **O18 — ADR-007/009/016 internal-zone name** *(new)*: `boma.baobab.band` vs target
`boma.wingu.me` used inconsistently across the doc set after M1; state the transition
in one place.
### Low
O13 (See-also vs `## Related` in ADR-012/013/015/016/017/018 — recurring), O14
(docs/README + inventories/README narrow enumerations — recurring), O16 (.zshrc rclone
alias + unguarded direnv hook — recurring), O17 (oh_my_posh zen.toml tasks missing
`config` tag — recurring), O19 (ADR-009:122 `nyumbani` example after retirement —
recurring), O20 (ROADMAP M2 CAX11/ARM vs cx23/x86 — new), O21 (ADR-020 "ports will be
added in M4" stale; already opened in M4a — new), O22 (ADR-024 body still asserts custom-
image obligation contradicting its revised Status — new), O23 (`netbird_coordinator` vs
`netbird` role name across ADRs/ROADMAP/plan — new), O24 (`*.boma.<domain>` vs
`*.<boma-domain>` wildcard scope ADR-024 vs ROADMAP — new), O25 (`tags: [verify]` out of
the ADR-019 vocabulary in molecule verify — new), O26 (reverse_proxy templates lack
`ansible_managed` header — new), O27 (reverse_proxy vars in `group_vars/all/` not
`offsite_hosts/` — new), O28 (capacity-scan.py ignores `offsite.yml` — new), O29
(offsite.yml duplicates empty groups from hosts.yml, undocumented merge — new).
Full detail + suggested fixes in `2026-06-14-findings.json`.
## Themes worth a deliberate pass
1. **Finish the Traefik→Caddy rename** (O3, and ADR-024 over-claimed it was done). One
sweep across ADR-008/017/019 closes it.
2. **STATUS docker_host self-contradiction** (O1) — quick, but it's the ground-truth doc.
3. **ADR-024 internal consistency** (O22) — the role went vanilla/HTTP-01 but the ADR
body still mandates the custom image; reconcile §2/§3/Consequences with its own Status.
4. **dev_env tag-isolation** (O8) — the one real conformance bug with runtime impact;
mirror base's `apply: tags:` guard.
5. **First service-role doc quartet** (O12) — reverse_proxy is the template for every
future service role; getting SECURITY/VERIFY/ACCESS/BACKUP.md right now pays forward.
## Follow-up prompt
> Work the open findings from `docs/reviews/2026-06-14-review.md`. Priority order:
> (1) **O1** — fix the STATUS.md docker_host contradiction (it's built+applied, not a
> no-op; reword the "Scaffolded but empty" row + the 45-48 paragraph).
> (2) **O3 + O22** — finish the Traefik→Caddy rename in ADR-008:70, ADR-017:27,88,
> ADR-019:52, and reconcile ADR-024's body (§2 custom image, §3 NetBird, Consequences)
> with its own revised HTTP-01 Status note.
> (3) **O2 + O5** — repoint ADR-004's "backup not in scope" line at ADR-022 and add
> ACCESS.md + BACKUP.md rows to its service-role file table.
> (4) **O8** — add `apply: tags: [users, config]` to dev_env's per_user.yml include and
> tag the `dev_env__home` set_fact `always`; add a Molecule assertion that a partial
> `--tags config` run still resolves the home dir.
> (5) **O12** — author the four service-role doc files for `roles/reverse_proxy/` from the
> templates (BACKUP.md = `backup__state: false`, re-issuable certs).
> (6) **O4** — restructure ADR-016/017/018 to lead with `## Status`, or correct ADR-023 §6.
> Then the medium drift items (O6 upgrade caveat, O7 nvim/tmux carve-out, O9 hosts.yml
> header, O15 new-host Part E, O18 internal-zone naming). Run `make lint` after each
> batch; commit per CLAUDE.md git conventions.