Compare commits
6 commits
e3461375f5
...
9e0c264658
| Author | SHA1 | Date | |
|---|---|---|---|
| 9e0c264658 | |||
| 9b5851ba4b | |||
| 175777e36a | |||
| cb8f924d4b | |||
| 718781053f | |||
| 64f1e821d8 |
60 changed files with 818 additions and 269 deletions
|
|
@ -108,6 +108,13 @@ See `Makefile` for the full list of targets.
|
|||
- Control / AI-worker host (`ubongo`): `docs/decisions/015-control-host.md`
|
||||
- Mesh VPN (NetBird): `docs/decisions/016-mesh-vpn.md`
|
||||
- Service-UI verification (Level 4): `docs/decisions/017-service-ui-verification.md`
|
||||
- Logging & log integrity: `docs/decisions/018-logging.md`
|
||||
- Tagging & run-targeting: `docs/decisions/019-tagging.md`
|
||||
- Firewall strategy: `docs/decisions/020-firewall.md`
|
||||
- Operational access: `docs/decisions/021-operational-access.md`
|
||||
- Backup & disaster recovery: `docs/decisions/022-backup.md`
|
||||
- ADR structure & lifecycle: `docs/decisions/023-adr-structure.md`
|
||||
- Reverse proxy (Caddy): `docs/decisions/024-reverse-proxy.md`
|
||||
|
||||
(CLAUDE.md carries the full cross-referenced table, including the runbooks and
|
||||
security/testing docs.)
|
||||
|
|
|
|||
16
STATUS.md
16
STATUS.md
|
|
@ -38,14 +38,20 @@ _Last reviewed: 2026-06-14._
|
|||
| Thing | State |
|
||||
|---|---|
|
||||
| `roles/base/` | **Partially built.** Concerns built: `firewall` (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) and **`hardening`** (M3: sshd drop-in key-only + `PermitRootLogin no`, fail2ban sshd jail 5/1h; ADR-002) — both pytest/Molecule-tested. The **`hardening`** concern is **applied to askari** (`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`). The `firewall` concern is built but **not yet applied** to any host (mesh-gated to avoid lockout — M5). Not built: auditd, packages, users (Phase 2 / TODO 15). |
|
||||
| `roles/docker_host/` | **Scaffolded, no tasks.** In git (meta/README/molecule filled), wired into `playbooks/site.yml` so the standard state is expressed end-to-end and `make lint` covers it, but it has no tasks yet — applying it is a no-op. Planned scope (Docker engine + Compose, daemon hardening, `nftables.d` container rules) in ADR-004/ADR-020. |
|
||||
| `inventories/*/hosts.yml` | Structured stubs with empty host maps (`hosts: {}`); regenerated by `make tf-inventory` once Terraform has hosts |
|
||||
| `inventories/production/group_vars/{docker_hosts,proxmox_hosts}/` | Empty dirs |
|
||||
|
||||
So `make deploy PLAYBOOK=site` has no real content to apply — `base` is only partially
|
||||
built (its `firewall` concern only) and the `docker_host` role is scaffolded but has no
|
||||
tasks yet. (The `make check`/`deploy` machinery itself now works — first proven by
|
||||
applying `dev_env` via `playbooks/workstation.yml`.)
|
||||
(`roles/docker_host/` is no longer scaffold-only — it installs the Docker engine + Compose
|
||||
and is built + applied to askari; see "Real and working today". Its deferred scope —
|
||||
daemon hardening + `nftables.d` container rules, ADR-004/ADR-020 — is still pending.)
|
||||
|
||||
A `make deploy PLAYBOOK=site` run now applies real content — `base` (its `firewall` +
|
||||
`hardening` concerns) plus a functional `docker_host` (Docker engine) on docker hosts —
|
||||
but in practice it is still limited: the production cluster has no docker hosts yet, and
|
||||
`base`'s `firewall` concern is mesh-gated until M5, so a full cluster `site` run does not
|
||||
yet exist. (The `make check`/`deploy` machinery itself works — first proven by applying
|
||||
`dev_env` via `playbooks/workstation.yml`, then `base`/`docker_host`/`reverse_proxy` on
|
||||
askari.)
|
||||
|
||||
## Designed but not built
|
||||
|
||||
|
|
|
|||
|
|
@ -24,9 +24,9 @@ decisions this frame enables.
|
|||
|
||||
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|
||||
|---|---|---|---|---|---|
|
||||
| Reverse proxy / TLS | Traefik | P | core | Edge routing + ACME certs for everything exposed | Spin-up order names it (TODO 12) |
|
||||
| Reverse proxy / TLS | Caddy (ADR-024) | P | core | Edge routing + ACME certs for everything exposed | Spin-up order names it (TODO 12) |
|
||||
| Internal DNS | `dns` role → dns1/dns2 | P | core | Authoritative internal zone (ADR-007) | Ansible-rendered zone |
|
||||
| Public DNS | `public_dns` role → Gandi LiveDNS | P | core | wingu.me zone as code (ADR-007) | anti-spoof baseline; mesh/LAN-only default; apply pending |
|
||||
| Public DNS | `public_dns` role → Gandi LiveDNS | P | core | wingu.me zone as code (ADR-007) | anti-spoof baseline; mesh/LAN-only default; applied (M1) |
|
||||
| VPN / remote access | NetBird (self-hosted on `askari`) | P | core | Secure mesh remote access to `srv`/`mgmt` | **Decided (ADR-016):** NetBird mesh replaces ADR-007 OPNsense WireGuard |
|
||||
| Service portal / dashboard | Homepage | A | candidate | One landing page listing all services — a "what does what" front door | Gap surfaced by V4; fits boma's legibility goal |
|
||||
|
||||
|
|
@ -148,8 +148,11 @@ AI/LLM, a game server (Minecraft), generic static-site hosting. Plausible someda
|
|||
none are committed.
|
||||
|
||||
**Confirmed exclusions (V4 had them; boma deliberately does not).** V4 mixed in a lot
|
||||
of **workstation/desktop** config — XFCE/GNOME desktops, kiosk mode, nvim/kitty/tmux,
|
||||
LibreOffice, antivirus, remote desktop. boma is **server-only**, so these are correctly
|
||||
absent. Likewise the removed Knowledge domain (Discourse, Snipe-IT, MRBS booking) and
|
||||
V4-specific project websites — out of boma's scope by design. The narrower surface is
|
||||
intentional, not an oversight.
|
||||
of **workstation/desktop** config — XFCE/GNOME desktops, kiosk mode, LibreOffice,
|
||||
antivirus, remote desktop. boma's **managed cluster/server hosts** stay server-only, so
|
||||
these are correctly absent. (One scoped exception: the control / AI-worker host `ubongo`
|
||||
runs an interactive `dev_env` — zsh/tmux/neovim — per ADR-015; that is the developer
|
||||
environment of an infrastructure worker host, not a personal desktop, and does not apply
|
||||
to managed service hosts.) Likewise the removed Knowledge domain (Discourse, Snipe-IT,
|
||||
MRBS booking) and V4-specific project websites — out of boma's scope by design. The
|
||||
narrower surface is intentional, not an oversight.
|
||||
|
|
|
|||
|
|
@ -6,6 +6,15 @@ Project documentation.
|
|||
Numbered from 001; each records context, the decision, and what was ruled out.
|
||||
- `runbooks/` — step-by-step operational procedures (add a host, add a role, rotate
|
||||
secrets).
|
||||
- `security/` — security baseline, accepted-risk register, per-service checklist +
|
||||
template (ADR-002/004).
|
||||
- `testing/` — testing methodology artifacts + the `VERIFY.md` template (ADR-008/017).
|
||||
- `access/` — operational-access doctrine + the `ACCESS.md` template (ADR-021).
|
||||
- `backup/` — backup doctrine + the `BACKUP.md` template (ADR-022).
|
||||
- `hardware/` — capacity reference + `/capacity-review` output (ADR-012).
|
||||
- `reviews/` — `/review-repo` audit trail.
|
||||
- `CAPABILITIES.md` / `ROADMAP.md` / `TODO.md` / `FRICTION.md` — what boma does, the
|
||||
build order, the backlog, and recurring-friction notes.
|
||||
|
||||
For what is actually **built vs only designed**, see `STATUS.md` at the repo root —
|
||||
the ADRs describe intent, not necessarily current reality.
|
||||
|
|
|
|||
|
|
@ -79,9 +79,10 @@ zero-risk and *born at Gandi*.
|
|||
|
||||
### M2 · `askari` provisioned + under Ansible
|
||||
|
||||
Provision the Hetzner VPS **as IaC with Terraform** (CAX11 ARM / Helsinki / Debian 13,
|
||||
behind a TF-managed Hetzner Cloud Firewall), bring it into `offsite_hosts`, and bootstrap
|
||||
it. Design: `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`.
|
||||
Provision the Hetzner VPS **as IaC with Terraform** (Helsinki / Debian 13, behind a
|
||||
TF-managed Hetzner Cloud Firewall), bring it into `offsite_hosts`, and bootstrap it.
|
||||
**Shipped as cx23/x86** (CAX11/ARM was out of stock EU-wide on 2026-06-14 — same-spec
|
||||
x86, cheaper). Design: `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`.
|
||||
|
||||
- **Decided:** Terraform owns `askari`'s existence — generalizes ADR-006 from "Proxmox VM
|
||||
existence" to **Proxmox + Hetzner** (new `hetznercloud/hcloud` provider, `hetzner_vm`
|
||||
|
|
@ -113,8 +114,8 @@ Built in two phases. **M4a (platform) — ✅ DONE:** Docker on askari + boma's
|
|||
**Caddy** reverse proxy (ADR-024), proven by `https://test.askari.wingu.me` serving a
|
||||
valid Let's Encrypt cert (HTTP-01 — DNS-01 deferred to Phase 2, see ADR-024/FRICTION).
|
||||
Firewall opened 80/443/3478. Spec/plan: `…2026-06-14-netbird-coordinator-m4-design.md` /
|
||||
`…2026-06-14-m4a-docker-caddy.md`. **M4b (next):** the `netbird` service role — read
|
||||
NetBird's current self-host compose then.
|
||||
`…2026-06-14-m4a-docker-caddy.md`. **M4b (next):** the `netbird_coordinator` service
|
||||
role — read NetBird's current self-host compose then.
|
||||
|
||||
Deploy the NetBird stack (management / signal / relay / Coturn + dashboard) with the
|
||||
**embedded IdP** (ADR-016 — no Authentik dependency), fronted by the now-proven Caddy.
|
||||
|
|
|
|||
|
|
@ -122,7 +122,7 @@
|
|||
retro consumes them.
|
||||
|
||||
12. **Spin-up / build order** — what is the right order of operations when spinning up
|
||||
from scratch (OS, DNS, Authentik, Traefik, …)?
|
||||
from scratch (OS, DNS, Authentik, Caddy, …)?
|
||||
|
||||
13. **Intentions** - Is the current setup clearly identifying intentions throughout? We have the readme files but is that enough? Also, how do we rechallange desisions and how they interact over time. I.e. We have these two services running, but extending one a little bit could make the other redundant so we could remove it. Or an alternative to this services has emerged, and it is actually better.
|
||||
|
||||
|
|
|
|||
|
|
@ -79,7 +79,8 @@ time. Each heading tags the threat(s) it primarily serves.
|
|||
### Updates — *opportunistic*
|
||||
|
||||
- `unattended-upgrades` enabled for **security patches only**
|
||||
- Full system upgrades triggered deliberately via Ansible (`make deploy PLAYBOOK=upgrade`)
|
||||
- Full system upgrades triggered deliberately via Ansible (planned — a dedicated upgrade
|
||||
playbook per ADR-011; not yet built, no `upgrade.yml` exists today)
|
||||
- No automatic reboots — reboots are a conscious operational decision
|
||||
|
||||
### Minimal attack surface — *opportunistic, blast radius*
|
||||
|
|
|
|||
|
|
@ -47,6 +47,8 @@ below). Each service role contains a standard set of files:
|
|||
| `README.md` | Purpose, variables, usage (role convention) |
|
||||
| `SECURITY.md` | Per-service security record — see ADR-002 and `docs/security/service-security-template.md` |
|
||||
| `VERIFY.md` | Per-service UI acceptance spec — see ADR-008 Level 4 / ADR-017 and `docs/testing/service-verify-template.md` |
|
||||
| `ACCESS.md` | Per-service operational-access record — see ADR-021 and `docs/access/service-access-template.md` |
|
||||
| `BACKUP.md` | Per-service backup record — see ADR-022 and `docs/backup/service-backup-template.md` (a stateless service declares `backup__state: false` with a reason) |
|
||||
| `meta/main.yml`, `molecule/default/` | Metadata + Debian 13 test scenario |
|
||||
|
||||
### Standard deploy mechanics
|
||||
|
|
@ -102,7 +104,9 @@ Managed by the `docker_host` role. Key settings:
|
|||
|
||||
- Bind mounts preferred over named volumes for data that must be backed up
|
||||
- All bind mount paths are under `/opt/services/<name>/data/`
|
||||
- Backup strategy is defined separately (not in scope of this repo)
|
||||
- Backup strategy is defined in **ADR-022** — the bind mounts under
|
||||
`/opt/services/<name>/data/` are exactly the unit ADR-022's per-service `backup__*`
|
||||
contract (and `BACKUP.md`) captures
|
||||
|
||||
## Decision
|
||||
|
||||
|
|
@ -128,5 +132,6 @@ Drawn from the trade-offs and deferred items this ADR already states:
|
|||
- Bare `latest` is acceptable only on the stateless tier; the stateful tier is always
|
||||
pinned `tag@digest`, and image updates are a deliberate operation (per Image management;
|
||||
ADR-011).
|
||||
- Backup strategy is stated as defined separately, not in scope of this ADR (per Persistent
|
||||
data).
|
||||
- Backup strategy is defined in ADR-022 (not in this ADR); the persistent bind mounts
|
||||
under `/opt/services/<name>/data/` are the unit ADR-022's per-service `backup__*`
|
||||
contract captures (per Persistent data).
|
||||
|
|
|
|||
|
|
@ -87,6 +87,14 @@ Assigned infrastructure addresses:
|
|||
| `10.20.0.12` | `proxy` | Reverse proxy |
|
||||
| `10.20.0.13` | `homeassistant` | Home Assistant (IoT controller) |
|
||||
|
||||
> **Control node `ubongo` — legacy V4 network (transitional).** `ubongo` (ADR-015) is the
|
||||
> manually-provisioned physical control node and currently lives on the **legacy V4
|
||||
> homelab network at `10.20.10.151`** — boma is being built up from the V4 base, and the
|
||||
> physical LAN has not yet been re-cut to this VLAN scheme. That address is therefore
|
||||
> **outside** the planned `srv` `10.20.0.0/24`; `base__firewall_control_addr` and the
|
||||
> inventory point at the real (V4) address. When the network is migrated to these VLANs,
|
||||
> `ubongo` moves into `mgmt`/`srv` and this note is retired.
|
||||
|
||||
#### VLAN 30 — lan (10.30.0.0/24)
|
||||
|
||||
| Range | Purpose |
|
||||
|
|
@ -164,15 +172,21 @@ IoT devices cannot initiate connections to `srv`.
|
|||
|
||||
### DNS zones and split-horizon
|
||||
|
||||
**Internal zone**: `boma.baobab.band` — served by `dns1` and `dns2`.
|
||||
**Internal zone**: `boma.baobab.band` **today** (the `dns` role is unbuilt) — served by
|
||||
`dns1` and `dns2`. **Target:** it is renamed to `boma.wingu.me` in Phase 2 when the `dns`
|
||||
role lands. Until then `boma.baobab.band` is the authoritative internal name **everywhere
|
||||
it appears** (the naming table above, split-horizon below, the OPNsense forwarder, and
|
||||
ADR-009/016). This is the single source for that transition; other references use the
|
||||
current name and inherit this caveat.
|
||||
The zone is rendered by the Ansible `dns` role: host A records come from the
|
||||
inventory (which derives from Terraform's `local.vms` via `make tf-inventory`),
|
||||
and service/alias/split-horizon records are explicit zone data in `group_vars`.
|
||||
Terraform itself writes no DNS records — see ADR-009.
|
||||
|
||||
**Public zone**: `wingu.me` — Gandi LiveDNS, **managed as code** by the `public_dns`
|
||||
role (`vault.gandi.pat`). Three-tier naming: infra `<host>.boma.wingu.me` (internal),
|
||||
services `<service>.wingu.me` (split-horizon), off-site `<service>.askari.wingu.me`.
|
||||
role (`vault.gandi.pat`). Three-tier naming: infra `<host>.boma.wingu.me` (internal — the
|
||||
Phase-2 target; currently `boma.baobab.band`, see *Internal zone* above), services
|
||||
`<service>.wingu.me` (split-horizon), off-site `<service>.askari.wingu.me`.
|
||||
`nyumbani` is retired. **Mesh/LAN-only by default**: home services have no public record
|
||||
(reached over LAN or the NetBird mesh); only deliberate exceptions are published. The
|
||||
project is `boma`; the domain is `wingu.me`. The legacy `baobab.band` zone (Cloudflare)
|
||||
|
|
|
|||
|
|
@ -67,7 +67,7 @@ configuration issues invisible to Ansible check mode.
|
|||
A Claude-driven exploratory check of a service's **application UI**, run as
|
||||
`/verify-service <name>` on `ubongo` (ADR-017). Claude drives Chromium via the
|
||||
`playwright` plugin against a **staging** deploy, authenticates through the real
|
||||
Traefik + Authentik SSO flow using a test user in the staging `test` group, then
|
||||
Caddy (ADR-024) + Authentik SSO flow using a test user in the staging `test` group, then
|
||||
executes the service's `roles/<service>/VERIFY.md` acceptance journeys *and*
|
||||
free-explores — judging pass/fail, screenshotting key states. It writes a dated report
|
||||
to `docs/testing/reviews/` and hands the operator a manual-test checklist for anything
|
||||
|
|
|
|||
|
|
@ -119,7 +119,8 @@ rendered entirely by the Ansible `dns` role:
|
|||
remains the ultimate source of truth for which hosts exist; the data simply flows
|
||||
through the inventory instead of through a direct Terraform→DNS write.
|
||||
- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense
|
||||
gateway, `forgejo.nyumbani.baobab.band` → proxy) are explicit zone data in `group_vars`.
|
||||
gateway, `vaultwarden.wingu.me` → proxy split-horizon) are explicit zone data in
|
||||
`group_vars`.
|
||||
|
||||
This dissolves the bootstrap cycle that a Terraform-managed zone would create. If
|
||||
Terraform wrote records via RFC 2136, provisioning the **first** DNS server would
|
||||
|
|
|
|||
|
|
@ -21,7 +21,7 @@ Each container role declares its class, e.g. `<role>__stateful: true|false` (def
|
|||
`false`). The split is the load-bearing classification for the whole policy.
|
||||
|
||||
- **Stateless** — no durable data of its own; losing the container loses nothing.
|
||||
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Traefik,
|
||||
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Caddy,
|
||||
reverse proxies, FlareSolverr.
|
||||
- **Stateful** — owns data, schema, or migrations: databases, and apps with their own
|
||||
store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT).
|
||||
|
|
@ -56,7 +56,7 @@ per host, in strict order with a verification gate between every phase:
|
|||
5. **Verify** again; alert on failure.
|
||||
|
||||
**Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate
|
||||
**before** the rest follow — so a DNS/Traefik failure doesn't make every host look
|
||||
**before** the rest follow — so a DNS/Caddy failure doesn't make every host look
|
||||
broken at once and hide the real cause. Never reboot the whole fleet simultaneously.
|
||||
|
||||
### 4. Snapshot-before is the rollback mechanism
|
||||
|
|
|
|||
|
|
@ -45,4 +45,6 @@ workload that should move, or a node due an upgrade.
|
|||
**wearout/TBW** is a monitored metric — logging is write-heavy, so wear is watched,
|
||||
not assumed.
|
||||
|
||||
See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).
|
||||
## Related
|
||||
|
||||
ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).
|
||||
|
|
|
|||
|
|
@ -74,5 +74,7 @@ copy.
|
|||
cost of a clean methodological break.
|
||||
- The policy is enforceable in review and by the AI guardrails above.
|
||||
|
||||
See also: ADR-001 (architecture / legibility), ADR-004 (service-role model), ADR-011
|
||||
## Related
|
||||
|
||||
ADR-001 (architecture / legibility), ADR-004 (service-role model), ADR-011
|
||||
(update management — ntfy topics decided fresh per this policy).
|
||||
|
|
|
|||
|
|
@ -153,5 +153,7 @@ master password.
|
|||
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. |
|
||||
| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. |
|
||||
|
||||
See also: ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
|
||||
## Related
|
||||
|
||||
ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
|
||||
ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).
|
||||
|
|
|
|||
|
|
@ -1,5 +1,11 @@
|
|||
# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` role and service-role machinery
|
||||
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
|
||||
`base` exists.
|
||||
|
||||
## Context
|
||||
|
||||
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
|
||||
|
|
@ -89,12 +95,6 @@ allocated for it.
|
|||
version-pinned (ADR-011). boma's `dns` role stays authoritative for
|
||||
`boma.baobab.band`; NetBird built-in DNS scoped/off.
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` role and service-role machinery
|
||||
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
|
||||
`base` exists.
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|
|
@ -106,11 +106,6 @@ Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` rol
|
|||
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. |
|
||||
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
|
||||
|
||||
See also: ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
|
||||
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
|
||||
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted),
|
||||
ADR-021 (operational access; SSH ladder reconciling `wt0` + `ubongo`'s LAN address).
|
||||
|
||||
## Consequences
|
||||
|
||||
- A new public surface appears on `askari` — management API + dashboard (80/443) +
|
||||
|
|
@ -129,3 +124,10 @@ ADR-021 (operational access; SSH ladder reconciling `wt0` + `ubongo`'s LAN addre
|
|||
operator footprint (What was ruled out).
|
||||
- Implementation is pending: the role tasks land only once the unbuilt `base` role and
|
||||
service-role machinery exist (Status).
|
||||
|
||||
## Related
|
||||
|
||||
ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
|
||||
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
|
||||
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted),
|
||||
ADR-021 (operational access; SSH ladder reconciling `wt0` + `ubongo`'s LAN address).
|
||||
|
|
|
|||
|
|
@ -1,5 +1,11 @@
|
|||
# ADR-017 — Service-UI acceptance verification (Level 4)
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-05). Designed. **Authorable now:** this ADR, the ADR-008 Level 4 expansion, the `VERIFY.md`
|
||||
template, the `/verify-service` skill, the convention/checklist/Further-reading edits,
|
||||
`.gitignore`/dir, STATUS/TODO. **Running is deferred** on its dependencies.
|
||||
|
||||
## Context
|
||||
|
||||
ADR-008 defines testing Levels 1–3 (Molecule, staging deploy, external smoke) and a
|
||||
|
|
@ -24,7 +30,7 @@ A Claude-driven exploratory service-UI verification harness — **Level 4** —
|
|||
(incl. destructive flows) against a *staging* deploy; the rebuildable sandbox
|
||||
resolves safety.
|
||||
4. **Test users in Authentik (central IdP), real SSO flow** — authenticates through
|
||||
Traefik + Authentik as a real user would.
|
||||
Caddy (ADR-024) + Authentik as a real user would.
|
||||
5. **Per-service `VERIFY.md` backbone + free exploration** — each service role ships an
|
||||
acceptance spec of critical journeys; Claude executes it and explores beyond it.
|
||||
|
||||
|
|
@ -63,12 +69,6 @@ them.
|
|||
- **No secrets leaked** — the git-ignored screenshot dir is the safety boundary;
|
||||
avoid capturing credential screens.
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-05). Designed. **Authorable now:** this ADR, the ADR-008 Level 4 expansion, the `VERIFY.md`
|
||||
template, the `/verify-service` skill, the convention/checklist/Further-reading edits,
|
||||
`.gitignore`/dir, STATUS/TODO. **Running is deferred** on its dependencies.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- `ubongo` (ADR-015) — runs the browser. Designed, not built.
|
||||
|
|
@ -85,12 +85,9 @@ template, the `/verify-service` skill, the convention/checklist/Further-reading
|
|||
| Scheduled headless smoke gate | Needs determinism the exploratory nature excludes; belongs to health checks / Uptime Kuma. |
|
||||
| Verify against production | Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead. |
|
||||
| Free-form, no per-service spec | Non-repeatable, can miss a critical flow; `VERIFY.md` gives a backbone. |
|
||||
| Staging bypasses SSO / per-app users | Wouldn't exercise the real Traefik+Authentik path; central test users are faithful. |
|
||||
| Staging bypasses SSO / per-app users | Wouldn't exercise the real Caddy+Authentik path; central test users are faithful. |
|
||||
| Commit screenshots to the repo | Repo bloat + secret-leak risk; git-ignored on `ubongo`. |
|
||||
|
||||
See also: ADR-008 (testing — expanded), ADR-015 (control host), ADR-002 (security),
|
||||
ADR-004 (`VERIFY.md` parallels `SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing).
|
||||
|
||||
## Consequences
|
||||
|
||||
- The harness is confined to staging by a hard stop: it refuses to run against
|
||||
|
|
@ -108,3 +105,8 @@ ADR-004 (`VERIFY.md` parallels `SECURITY.md`), ADR-013/014 (heritage / knowledge
|
|||
skill, conventions/checklist edits), but running is deferred on its dependencies:
|
||||
`ubongo`, the `playwright` plugin, Authentik, a staging deploy, and `make new-role`
|
||||
scaffolding `VERIFY.md` (Status; Dependencies).
|
||||
|
||||
## Related
|
||||
|
||||
ADR-008 (testing — expanded), ADR-015 (control host), ADR-002 (security),
|
||||
ADR-004 (`VERIFY.md` parallels `SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing).
|
||||
|
|
|
|||
|
|
@ -1,5 +1,12 @@
|
|||
# ADR-018 — Logging and log integrity
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-06). Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
|
||||
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
|
||||
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
|
||||
and the live pipeline.
|
||||
|
||||
## Context
|
||||
|
||||
boma wants all logs in one queryable store for troubleshooting, spotting issues over
|
||||
|
|
@ -70,13 +77,6 @@ ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a moni
|
|||
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
|
||||
tracked allocation in `docs/hardware/reference.md` (ADR-012).
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-06). Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
|
||||
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
|
||||
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
|
||||
and the live pipeline.
|
||||
|
||||
## Dependencies
|
||||
|
||||
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
|
||||
|
|
@ -94,10 +94,6 @@ the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence a
|
|||
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
|
||||
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
|
||||
|
||||
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
|
||||
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
|
||||
standard), ADR-011 (health checks — distinct from this).
|
||||
|
||||
## Consequences
|
||||
|
||||
- Opportunistic track-covering and host-pivot-to-store are defeated because logs leave
|
||||
|
|
@ -120,3 +116,9 @@ standard), ADR-011 (health checks — distinct from this).
|
|||
- The decision is authorable now but the live pipeline is deferred on the stack:
|
||||
Alloy-in-`base`, the `loki`/`grafana` service roles, OPNsense syslog config, and the
|
||||
push-only credential (Status; Dependencies).
|
||||
|
||||
## Related
|
||||
|
||||
ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
|
||||
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
|
||||
standard), ADR-011 (health checks — distinct from this).
|
||||
|
|
|
|||
|
|
@ -49,7 +49,7 @@ slice on its own, and (c) doesn't overlap confusingly with another.
|
|||
| `monitoring` | metric exporters / health checks |
|
||||
| `config` | render templated config/compose files to disk — **no restart** |
|
||||
| `deploy` | bring services up / restart (`compose up -d`) |
|
||||
| `proxy` | reverse-proxy + TLS registration (Traefik routes, Authentik) |
|
||||
| `proxy` | reverse-proxy + TLS registration (Caddy routes, Authentik) |
|
||||
|
||||
The `config`/`deploy` split lets you re-render and diff configuration (`--tags
|
||||
config`) without bouncing services, then restart deliberately (`--tags deploy`).
|
||||
|
|
|
|||
|
|
@ -88,9 +88,9 @@ declarations (real drift risk).
|
|||
|
||||
`askari` sits outside the Proxmox cluster and has no OPNsense. Its **perimeter** layer
|
||||
is a TF-managed **Hetzner Cloud Firewall** (declared in `terraform/environments/offsite/`)
|
||||
alongside the VM itself. Current rule set (M2): SSH inbound from `ubongo`'s public IP
|
||||
only. NetBird ports (UDP 3478 + TCP 80/443) will be added in M4 when the coordinator
|
||||
role is built.
|
||||
alongside the VM itself. Rule set: SSH inbound from `ubongo`'s public IP (M2), plus
|
||||
TCP 80/443 + UDP 3478 opened in **M4a** (Caddy + NetBird). The `netbird_coordinator`
|
||||
service role that uses 3478 lands in **M4b**; the ports are already open.
|
||||
|
||||
The `group_vars` service catalog remains authoritative for `askari`'s **host nftables**
|
||||
layer — the same two-layer model applies, with Hetzner Cloud Firewall substituting for
|
||||
|
|
|
|||
|
|
@ -19,9 +19,9 @@ Accepted (2026-06-14). Amends the soft Traefik assumption carried by the roadmap
|
|||
|
||||
boma needs a reverse proxy to front its services with TLS. ADR-002 requires every
|
||||
service to sit behind a proxy with authentication before it is reachable; ADR-007/M1
|
||||
delivers a `*.boma.<domain>` wildcard cert via ACME DNS-01 against Gandi — the only
|
||||
viable cert path for mesh/LAN-only services that cannot satisfy HTTP-01 (no public
|
||||
A-record to point at).
|
||||
delivers a `*.<domain>` wildcard cert via ACME DNS-01 against Gandi (the apex `boma`
|
||||
domain, matching ROADMAP M1) — the only viable cert path for mesh/LAN-only services
|
||||
that cannot satisfy HTTP-01 (no public A-record to point at).
|
||||
|
||||
The roadmap (Phase-2, step 5) and ADR-017 prose assumed **Traefik + Authentik** as the
|
||||
auth-and-proxy pair without an ADR ever pinning Traefik. On closer inspection:
|
||||
|
|
@ -57,10 +57,14 @@ boma's reverse proxy is **Caddy**.
|
|||
5. `forward_auth` to Authentik is a first-class Caddy directive — the planned
|
||||
Authentik auth story (ADR-002) is preserved without Traefik as the middleman.
|
||||
|
||||
### 2. Custom image
|
||||
### 2. Custom image (DNS-01 path only — Phase 2)
|
||||
|
||||
> Applies only to the **DNS-01** path, which is **deferred to Phase 2** (see the Status
|
||||
> note). M4a ships **vanilla `caddy:2`** on askari (HTTP-01) — no custom image.
|
||||
|
||||
Caddy's official Docker image does not include third-party DNS plugins. The `caddy-dns/gandi`
|
||||
plugin must be compiled in via `xcaddy`. boma builds a custom image:
|
||||
plugin must be compiled in via `xcaddy`. When the cluster's mesh/LAN-only services need
|
||||
DNS-01, boma builds a custom image:
|
||||
|
||||
```
|
||||
FROM caddy:builder AS builder
|
||||
|
|
@ -70,14 +74,16 @@ FROM caddy:latest
|
|||
COPY --from=builder /usr/bin/caddy /usr/bin/caddy
|
||||
```
|
||||
|
||||
This image is maintained as a boma artifact (Forgejo registry, pinned digest in the
|
||||
Compose template). It is the cost of the Gandi DNS-01 path — unavoidable regardless of
|
||||
proxy choice.
|
||||
That image would be maintained as a boma artifact (Forgejo registry, pinned digest in the
|
||||
Compose template) — the cost of the Gandi DNS-01 path. (On askari this approach hit two
|
||||
blockers, so DNS-01 is deferred; see the Status note.)
|
||||
|
||||
### 3. Deployment scope
|
||||
|
||||
The first Caddy instance fronts the NetBird stack on `askari` (M4). The pattern
|
||||
generalises to the Proxmox cluster in Phase 2 when services multiply.
|
||||
The first Caddy instance runs on `askari` (M4a), serving a test vhost over HTTP-01 to
|
||||
prove the proxy + ACME path. It fronts the NetBird stack in **M4b** (when the
|
||||
`netbird_coordinator` role is built). The pattern generalises to the Proxmox cluster in
|
||||
Phase 2 when services multiply.
|
||||
|
||||
### 4. Authentik integration (deferred)
|
||||
|
||||
|
|
@ -90,8 +96,9 @@ middleware migration is required.
|
|||
- **Roadmap Phase-2 step 5** is updated from "Authentik + Traefik" to "Authentik +
|
||||
Caddy (ADR-024)".
|
||||
- **ADR-017 prose** that mentioned Traefik is updated to read "Caddy (ADR-024)".
|
||||
- A custom Caddy image (`xcaddy` + `caddy-dns/gandi`) must be built, pushed to the
|
||||
Forgejo registry, and kept current (plugin + base image updates).
|
||||
- M4a (public hosts, HTTP-01) runs **vanilla `caddy:2`** — no custom image. **If/when**
|
||||
the Phase-2 DNS-01 path lands, a custom Caddy image (`xcaddy` + `caddy-dns/gandi`) must
|
||||
be built, pushed to the Forgejo registry, and kept current (plugin + base image updates).
|
||||
- Caddyfile config is rendered by Ansible from `group_vars` — consistent with ADR-004
|
||||
and easier to review than distributed container labels.
|
||||
- `forward_auth` to Authentik is available when Authentik is deployed; no extra
|
||||
|
|
|
|||
76
docs/reviews/2026-06-14-findings.json
Normal file
76
docs/reviews/2026-06-14-findings.json
Normal file
|
|
@ -0,0 +1,76 @@
|
|||
{
|
||||
"date": "2026-06-14",
|
||||
"reviewed_commit": "e346137",
|
||||
"fixes_commit": null,
|
||||
"mode": "on-demand",
|
||||
"counts": {
|
||||
"auto_fixed": 11,
|
||||
"open": 29,
|
||||
"scan": {
|
||||
"broken-adr-ref": 4,
|
||||
"broken-path-ref": 2,
|
||||
"marker": 14,
|
||||
"open-deferred-item": 5,
|
||||
"stale-deferred": 0
|
||||
}
|
||||
},
|
||||
"deferral_checklist": {
|
||||
"adr-011-open-items": "all 5 ('Open questions': Proxmox snapshot driver, exact cadences, health-check harness home, classification home, staging-first) confirmed genuinely still open. ADR-011 is still Proposed/unbuilt; the same questions are echoed open in docs/TODO.md item 16; no later ADR or STATUS decides any of them. No stale-deferred.",
|
||||
"stale_deferred_found": 0
|
||||
},
|
||||
"scan_false_positives": [
|
||||
{"check": "broken-adr-ref", "location": "tests/test_repo_scan.py:10,43; docs/superpowers/plans/2026-06-10-adr-structure.md:50,83", "why": "ADR-099/ADR-100 are intentional test fixtures exercising the scanner's bad-ref detection."},
|
||||
{"check": "broken-path-ref", "location": "docs/superpowers/plans/2026-06-14-m4b-netbird.md:28,56", "why": "roles/netbird/ is referenced by the M4b implementation plan for a role to be scaffolded via make new-role; forward-looking plan for unbuilt work, not a dead ref."},
|
||||
{"check": "marker", "location": "docs/decisions/019-tagging.md:14 + docs/superpowers/plans/* + docs/superpowers/specs/*", "why": "019-tagging.md:14 is prose discussing 'over-tagging' as a concept ('the TODO explicitly warns against...'), not an actionable TODO. The 13 superpowers markers are historical planning artifacts (commit-message TODOs, plan steps)."}
|
||||
],
|
||||
"auto_fixed": [
|
||||
{"id": "AF1", "dimension": "drift", "severity": "high", "location": "roles/reverse_proxy/meta/main.yml:4-6", "description": "meta description said 'ACME DNS-01 TLS via Gandi ... builds the custom image on-host (caddy-dns/gandi)' — but the role is now vanilla Caddy + HTTP-01 (commit b7e919d dropped the custom image); README/defaults/compose/STATUS all reflect vanilla. Only meta was stale and contradicted the code.", "fix": "rewrote description to 'Vanilla Caddy reverse proxy (ADR-024); TLS via ACME HTTP-01 for public hosts. Routes from reverse_proxy__routes, managed via Docker Compose.'", "tag": "new"},
|
||||
{"id": "AF2", "dimension": "cruft", "severity": "medium", "location": "roles/README.md:11-15", "description": "Current-state paragraph said base hardening (SSH/fail2ball), auditd, packages, users 'not yet built' and docker_host 'scaffolded but has no tasks yet' — but STATUS records the hardening concern built+tested+applied to askari, and docker_host/reverse_proxy/public_dns all built.", "fix": "rewrote to: base firewall+hardening built (hardening applied to askari), docker_host/reverse_proxy/public_dns/dev_env built; auditd/packages/users pending.", "tag": "recurring"},
|
||||
{"id": "AF3", "dimension": "drift", "severity": "medium", "location": "playbooks/README.md:6-13", "description": "site.yml note said docker_host 'scaffolded with no tasks yet' (now installs Docker engine) and the file omitted dns.yml and offsite.yml entirely.", "fix": "reworded site.yml note (base firewall+hardening, no cluster docker hosts yet) and added dns.yml + offsite.yml bullets.", "tag": "new"},
|
||||
{"id": "AF4", "dimension": "cruft", "severity": "low", "location": "roles/public_dns/README.md:7-9", "description": "'the anti-spoof baseline now; askari in M4' — M4a is done; askari + *.askari records are applied.", "fix": "updated to note askari.wingu.me + *.askari wildcard applied in M4a.", "tag": "new"},
|
||||
{"id": "AF5", "dimension": "cruft", "severity": "low", "location": "scripts/README.md:17", "description": "Helper-script list omitted check-tags.py, which exists and is run by make lint (ADR-019).", "fix": "added a check-tags.py bullet.", "tag": "new"},
|
||||
{"id": "AF6", "dimension": "drift", "severity": "medium", "location": "terraform/README.md:7-15", "description": "Top-level terraform README omitted modules/hetzner_vm and environments/offsite — the only built+applied TF environment (askari).", "fix": "added hetzner_vm + offsite env bullets; scoped 'not yet init'ed' to the Proxmox envs.", "tag": "new"},
|
||||
{"id": "AF7", "dimension": "cruft", "severity": "low", "location": "terraform/environments/offsite/providers.tf:1", "description": "Verified-stamp said 'cax11@hel1' but the deployed server is cx23 (CAX11 out of stock).", "fix": "stamp now reads cx23@hel1.", "tag": "new"},
|
||||
{"id": "AF8", "dimension": "cruft", "severity": "low", "location": "terraform/modules/hetzner_vm/variables.tf:7", "description": "server_type description example was 'e.g. cax11 (ARM)'; the only consumer uses cx23.", "fix": "example now 'e.g. cx23 (x86) or cax11 (ARM)'.", "tag": "new"},
|
||||
{"id": "AF9", "dimension": "drift", "severity": "medium", "location": "inventories/production/group_vars/all/public_dns.yml:16-17", "description": "Comment on the *.askari wildcard said 'Caddy gets a *.askari.wingu.me cert via DNS-01 (M4a)' — M4a uses HTTP-01 (the wildcard A record itself is still legitimately needed for name resolution).", "fix": "comment now says per-host certs via ACME HTTP-01 (M4a).", "tag": "new"},
|
||||
{"id": "AF10", "dimension": "drift", "severity": "high", "location": "docs/CAPABILITIES.md:27,29", "description": "Capability table named Traefik as the reverse-proxy candidate (ADR-024 chose Caddy, built+applied) and marked public DNS 'apply pending' (applied 2026-06-14).", "fix": "reverse-proxy row -> 'Caddy (ADR-024)'; public DNS note -> 'applied (M1)'. (The V4-history Traefik mention at line 134 is correct and left as-is.)", "tag": "new"},
|
||||
{"id": "AF11", "dimension": "cruft", "severity": "low", "location": "README.md:110-119", "description": "README 'Documentation' ADR list stopped at ADR-017; ADR-018..024 exist.", "fix": "extended the list through ADR-024 (logging, tagging, firewall, access, backup, ADR-structure, reverse-proxy).", "tag": "recurring"}
|
||||
],
|
||||
"open": [
|
||||
{"id": "O1", "dimension": "drift", "severity": "high", "location": "STATUS.md:41 (+ 45-48) ↔ STATUS.md:33-34", "description": "The 'Scaffolded but empty — NOT implemented' table still lists roles/docker_host as 'Scaffolded, no tasks ... applying it is a no-op', and the trailing prose (45-48) repeats it. This contradicts STATUS.md:33-34 ('Built + applied', installs Docker CE + compose) and the actual roles/docker_host/tasks/main.yml. An internal STATUS contradiction; one side is plainly correct (docker_host is built).", "suggested_fix": "Remove/rewrite the docker_host row in the 'Scaffolded but empty' table and the 45-48 paragraph: docker_host now installs the Docker engine; only its deferred daemon-hardening + nftables.d scope (ADR-004/020) remains. Report (STATUS is the operator's ground-truth doc — reword deliberately).", "tag": "new", "auto_fixable": false},
|
||||
{"id": "O2", "dimension": "consistency", "severity": "high", "location": "docs/decisions/004-docker-model.md:105,131 ↔ docs/decisions/022-backup.md", "description": "ADR-004 states twice that 'Backup strategy is defined separately (not in scope of this repo)'. ADR-022 defines a full in-repo backup/DR doctrine (restic, fisi pull node, per-service backup__* + BACKUP.md). Direct ADR↔ADR scope contradiction.", "suggested_fix": "Reword ADR-004's lines to point at ADR-022 (backup is now in-repo scope) and cross-link, per ADR-023's no-silent-reversal rule. Design decision — report.", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O3", "dimension": "consistency", "severity": "high", "location": "docs/decisions/024-reverse-proxy.md (Consequences) ↔ 008-testing.md:70; 017-service-ui-verification.md:27,88; 019-tagging.md:52", "description": "ADR-024's Consequences claim 'ADR-017 prose that mentioned Traefik is updated to read Caddy'. That update was NOT done: ADR-017:27,88 still say 'Traefik + Authentik'; ADR-008:70 'Traefik + Authentik SSO flow'; ADR-019:52 'Traefik routes, Authentik'. The doc set still designs around Traefik while ADR-024 overclaims the reconciliation was completed.", "suggested_fix": "Replace Traefik with Caddy (ADR-024) in ADR-008:70, ADR-017:27,88, ADR-019:52, OR soften ADR-024's Consequences to 'to be updated'. ADR prose = design docs — report (not auto-fixed).", "tag": "new", "auto_fixable": false},
|
||||
{"id": "O4", "dimension": "conformance", "severity": "high", "location": "docs/decisions/023-adr-structure.md:7-8,77-80 ↔ 016-mesh-vpn.md:3; 017-service-ui-verification.md:3; 018-logging.md:3", "description": "ADR-023 §2 mandates ## Status as the first section and §6 explicitly claims ADRs 001–018 were retroactively restructured to lead with Status (calling out 016–018). But ADR-016/017/018 still open with ## Context, Status buried late (016:~92, 017:~66, 018:~73). ADR-023's own conformance claim is contradicted by three in-scope files. (Older ADRs 001–010 lead with Status but place Decision/Consequences after topical sections — an accepted presentational trade-off per ADR-023 §5/§6.)", "suggested_fix": "Either add a top-of-file ## Status section to ADR-016/017/018 (move the existing build-state line up), or correct ADR-023 §6 to exclude them. Reordering judgement — report.", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O5", "dimension": "consistency", "severity": "medium", "location": "docs/decisions/004-docker-model.md:48-50", "description": "The service-role file table (the canonical standard) lists only README/SECURITY/VERIFY; it omits ACCESS.md (ADR-021) and BACKUP.md (ADR-022), both of which CLAUDE.md + those ADRs mandate as required per-service-role files.", "suggested_fix": "Add ACCESS.md (ADR-021) and BACKUP.md (ADR-022, stateful) rows to ADR-004's file table.", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O6", "dimension": "drift", "severity": "medium", "location": "docs/decisions/002-security.md:82", "description": "References 'make deploy PLAYBOOK=upgrade' as the deliberate full-upgrade mechanism, but no upgrade.yml exists (only bootstrap/dns/offsite/site/workstation) and ADR-011 is still Proposed/unbuilt — stated without the '(planned)' caveat ADR-002 uses for its other unbuilt controls.", "suggested_fix": "Add a '(planned — ADR-011, not yet built)' caveat to the upgrade line, or drop the concrete command until upgrade.yml exists.", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O7", "dimension": "drift", "severity": "medium", "location": "docs/CAPABILITIES.md:150-155 ↔ STATUS.md:29", "description": "CAPABILITIES still lists nvim/kitty/tmux among 'Confirmed exclusions' boma 'deliberately does not' have, but the dev_env role (built+applied to ubongo) installs neovim + tmux. (The reverse-proxy/public-DNS rows in this file were auto-fixed in AF10; this exclusions block was left because it needs a scoped carve-out, not a token swap.)", "suggested_fix": "Scope the exclusion to managed cluster/server hosts and note the control/dev host (ubongo, ADR-015) runs an interactive dev_env, or drop nvim/tmux from the list.", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O8", "dimension": "conformance", "severity": "medium", "location": "roles/dev_env/tasks/main.yml (include_tasks per_user.yml) + roles/dev_env/tasks/per_user.yml:4-9", "description": "per_user.yml's getent + set_fact dev_env__home preflight is untagged, and the include_tasks that pulls it in carries no 'apply: tags:'. base/tasks/main.yml documents and guards exactly this gotcha with apply: tags:; dev_env does not. A partial --tags users or --tags config run selects only the include statement (running nothing) or, if made tag-aware, skips the set_fact and fails the dependent [config] tasks on an undefined dev_env__home. Against ADR-019's concern-runnable-in-isolation intent.", "suggested_fix": "Add apply: tags: [users, config] to the per_user.yml include (mirroring base), and tag the getent+set_fact with 'always' (or the union [users, config]).", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O9", "dimension": "drift", "severity": "medium", "location": "inventories/production/hosts.yml:1-17", "description": "Header claims 'Generated from Terraform outputs: make tf-inventory TF_ENV=production', but the file is hand-maintained: it carries the manual control host (ubongo) and omits the offsite_hosts group that tf_to_inventory.py always emits (VALID_GROUPS). Running tf-inventory against the empty production env would DROP ubongo and ADD offsite_hosts, so the header misrepresents how the file is managed.", "suggested_fix": "Make the header honest (hand-maintained for the manual control-node exception while production TF has no VMs; offsite hosts live in offsite.yml), and reconcile the declared group set with tf_to_inventory.py. Do NOT hand-regenerate hosts.yml in a way that drops ubongo.", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O10", "dimension": "consistency", "severity": "medium", "location": "inventories/production/group_vars/all/vars.yml:42 + hosts.yml:12 ↔ docs/decisions/007-network.md", "description": "ubongo's address is 10.20.10.151 (control host_var + base__firewall_control_addr), but ADR-007 defines srv as 10.20.0.0/24 (network__srv_subnet) and mgmt as 10.10.0.0/24 — 10.20.10.151 is in neither, and ADR-007's addressing tables don't record where the physical control node lives. base__firewall_control_addr (ADR-021 recovery path) depends on this being right.", "suggested_fix": "Add ubongo to ADR-007's addressing table (which VLAN/segment 10.20.10.151 belongs to, clearly outside srv 10.20.0.0/24), or correct the address. Confirm the real address with the operator first.", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O11", "dimension": "consistency", "severity": "medium", "location": "terraform/environments/{staging,production}/terraform.tfvars.example:9-11 + variables.tf:5", "description": "Proxmox node naming uses 'pve01' (two-digit) in both tfvars.example files and the proxmox_endpoint var descriptions; ADR-007 defines single-digit node names pve0/pve1/pve2, and internal FQDNs as <host>.boma.<domain>. Example contradicts the naming convention.", "suggested_fix": "Align example values with ADR-007 (proxmox_node = pve0; endpoint = https://pve0.boma.<domain>:8006/). Verify the intended node name with the operator before changing — report rather than auto-fix.", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O12", "dimension": "conformance", "severity": "medium", "location": "roles/reverse_proxy/ (missing SECURITY.md, VERIFY.md, ACCESS.md, BACKUP.md)", "description": "CLAUDE.md requires every service role to carry SECURITY.md (ADR-002/004), VERIFY.md (ADR-008/017), ACCESS.md (ADR-021), and a stateful BACKUP.md (ADR-022); a stateless service records backup__state: false with a reason. reverse_proxy is the first real built+applied service role (askari, M4a) but ships only README.md. (Judgement recorded: public_dns is exempt — it runs on the control node against an external DNS API, provisioning no host-resident service/port, so it is not a 'service' role in the ADR-004 sense.)", "suggested_fix": "Add the four files from docs/security|testing|access|backup/ templates. BACKUP.md can declare backup__state: false (Caddy state = re-issuable ACME certs).", "tag": "new", "auto_fixable": false},
|
||||
{"id": "O13", "dimension": "consistency", "severity": "low", "location": "docs/decisions/012-hardware-capacity.md; 013-heritage-v4.md:77; 015-control-host.md; 016-mesh-vpn.md; 017-service-ui-verification.md; 018-logging.md", "description": "Inconsistent cross-reference convention: ADRs 014/019/020/021/022/023 + adr-template use a dedicated '## Related' section, while 012/013/015/016/017/018 use an inline 'See also:' prose line (placed mid-document in 016/017/018). ADR-023 §3 names ## Related as the optional section; 'See also:' is an undocumented variant.", "suggested_fix": "Convert the 'See also:' prose into ## Related sections (after Consequences) in ADR-012/013/015/016/017/018 for uniformity. Cosmetic.", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O14", "dimension": "consistency", "severity": "low", "location": "docs/README.md:4-8; inventories/README.md", "description": "docs/README.md lists only decisions/ + runbooks/ (omits security/testing/access/backup/hardware/reviews); inventories/README.md omits the offsite_hosts group documented in CLAUDE.md. Both narrower than current reality.", "suggested_fix": "Add the missing subdir rows / note offsite_hosts, or explicitly defer to the canonical list in the repo README / CLAUDE.md.", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O15", "dimension": "drift", "severity": "medium", "location": "docs/runbooks/new-host.md:82,114-138 (Part E)", "description": "Part E (control node ubongo) still instructs 'ssh ansible@<IP>' / an ansible-user flow, but STATUS records ubongo is deliberately managed as the operator account sjat (group_vars/control ansible_user: sjat) with the ansible-user bootstrap listed as Pending.", "suggested_fix": "Update Part E to reflect ubongo managed as sjat (no ansible user yet), the ansible-user bootstrap a pending item per STATUS.md.", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O16", "dimension": "consistency", "severity": "low", "location": "roles/dev_env/files/dotfiles/zsh/.zshrc:28,55", "description": "Shipped .zshrc hard-codes alias rclone=\"/usr/bin/rclone\" (rclone not installed by dev_env) and 'eval \"$(direnv hook zsh)\"' unguarded (unlike the guarded oh-my-posh block) — heritage fisi/V4 carryovers. If direnv is dropped from dev_env__packages, every shell startup errors.", "suggested_fix": "Drop the rclone alias and guard the direnv hook with 'command -v direnv', or document direnv as a hard dependency of the shipped .zshrc.", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O17", "dimension": "consistency", "severity": "low", "location": "roles/dev_env/tasks/oh_my_posh.yml:15-26", "description": "The zen.toml theme-directory + deploy tasks render config to disk but carry no 'config' tag, while analogous dotfile tasks in per_user.yml are tagged config — inconsistent concern tagging within the role.", "suggested_fix": "Add tags: [config] to the zen.toml directory + deploy tasks.", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O18", "dimension": "drift", "severity": "medium", "location": "docs/decisions/007-network.md:159,167,186 + 009-provisioning-handoff.md:114 + 016-mesh-vpn.md:90 ↔ 007-network.md:174,184", "description": "Internal-zone name is inconsistent across the doc set: ADR-007:159/167/186, ADR-009:114, ADR-016:90 call it 'boma.baobab.band', while ADR-007:174/184 says infra is '<host>.boma.wingu.me' and the internal zone 'will be renamed to boma.wingu.me' (Phase 2). M1 moved boma's home to wingu.me. A reader can't tell which domain the unbuilt dns role should render.", "suggested_fix": "State the transitional state in one authoritative place (current = boma.baobab.band, target = boma.wingu.me in Phase 2), or align all references on the target. Report.", "tag": "new", "auto_fixable": false},
|
||||
{"id": "O19", "dimension": "consistency", "severity": "low", "location": "docs/decisions/009-provisioning-handoff.md:122", "description": "M1 retired 'nyumbani' as a naming tier (ROADMAP:70, ADR-007:176). ADR-009:122 still uses 'forgejo.nyumbani.baobab.band' as the worked example of internal-zone data the dns role would render. (Note: STATUS:19 + ADR-003/008/010 use the same name for the LIVE legacy Forgejo host, which is legitimately legacy infra — distinguish.)", "suggested_fix": "Update the ADR-009:122 example to a non-nyumbani name consistent with the retired-nyumbani decision; annotate the legacy Forgejo references as intentionally legacy where they remain.", "tag": "recurring", "auto_fixable": false},
|
||||
{"id": "O20", "dimension": "drift", "severity": "low", "location": "docs/ROADMAP.md:82-83", "description": "ROADMAP M2 still describes askari as 'CAX11 ARM / Helsinki', but STATUS records it provisioned as cx23/x86 (CAX11/ARM out of stock EU-wide on 2026-06-14). M3/M4 sections got DONE notes; M2's spec line wasn't corrected.", "suggested_fix": "Update ROADMAP M2 to note askari shipped as cx23/x86 (CAX11 unavailable), or add a DONE note mirroring M3/M4.", "tag": "new", "auto_fixable": false},
|
||||
{"id": "O21", "dimension": "drift", "severity": "low", "location": "docs/decisions/020-firewall.md:91-93", "description": "ADR-020 says askari's Hetzner Cloud Firewall 'NetBird ports (UDP 3478 + TCP 80/443) will be added in M4 when the coordinator role is built' — but M4a is DONE and the firewall already opens 80/443/3478. Future-tense is stale; only the netbird role (M4b) remains.", "suggested_fix": "Update ADR-020 to past tense (80/443/3478 opened in M4a); keep the netbird coordinator role (M4b) caveated as unbuilt.", "tag": "new", "auto_fixable": false},
|
||||
{"id": "O22", "dimension": "consistency", "severity": "low", "location": "docs/decisions/024-reverse-proxy.md:60-92", "description": "ADR-024 is internally inconsistent post-revision: the revised Status note says askari ships HTTP-01 with vanilla Caddy (custom-image DNS-01 deferred to Phase 2), but Decision §2 still asserts boma builds/maintains the custom xcaddy+gandi image, §3 says 'fronts the NetBird stack on askari (M4)' (M4b unbuilt), and Consequences still lists 'a custom Caddy image must be built/pushed/kept current' as a present obligation.", "suggested_fix": "Scope the custom-image obligation (§2, Consequences) to the deferred Phase-2 DNS-01 path; soften §3 to reflect that M4a ships a test vhost and the NetBird front-end is M4b. Report (touches decision substance).", "tag": "new", "auto_fixable": false},
|
||||
{"id": "O23", "dimension": "consistency", "severity": "low", "location": "docs/decisions/001-architecture.md:50 + 016-mesh-vpn.md:87 ↔ docs/ROADMAP.md:116", "description": "The future NetBird service role is named 'netbird_coordinator' in ADR-001:50 + ADR-016:87 (coordinator framing also in STATUS), but ROADMAP M4b:116 calls it 'the netbird service role'. make new-role creates one directory name; the committed names will mismatch the actual role at build time. (The M4b plan at docs/superpowers/plans/2026-06-14-m4b-netbird.md also uses 'netbird'.)", "suggested_fix": "Settle one role name and align ADR-001/016, ROADMAP, and the M4b plan before scaffolding.", "tag": "new", "auto_fixable": false},
|
||||
{"id": "O24", "dimension": "consistency", "severity": "low", "location": "docs/decisions/024-reverse-proxy.md:22 ↔ docs/ROADMAP.md:71", "description": "ADR-024 describes the M1 ACME DNS-01 wildcard as '*.boma.<domain>' (infra subdomain), while ROADMAP:71 specifies '*.<boma-domain>' (apex). Different name spaces — the cert's actual SAN coverage for unexposed services is ambiguous across the two docs.", "suggested_fix": "Align the wildcard scope (decide *.wingu.me vs *.boma.wingu.me vs both) and state it identically in ADR-024 and ROADMAP.", "tag": "new", "auto_fixable": false},
|
||||
{"id": "O25", "dimension": "consistency", "severity": "low", "location": "roles/reverse_proxy/molecule/default/verify.yml:11,22; roles/public_dns/molecule/default/verify.yml:12", "description": "Molecule verify tasks use tags: [verify], which is not in the tests/tags.yml vocabulary (concerns/special/opt_ins/playbooks). check-tags.py exempts molecule/ paths so the linter doesn't flag it, and 4 roles use this de-facto convention — but it's an out-of-vocabulary tag the ADR-019 standard doesn't sanction.", "suggested_fix": "Either drop the tags from molecule verify tasks (the linter ignores molecule anyway) or add 'verify' as a sanctioned testing-only tag in tests/tags.yml with an ADR-019 note. Repo-wide convention call.", "tag": "new", "auto_fixable": false},
|
||||
{"id": "O26", "dimension": "consistency", "severity": "low", "location": "roles/reverse_proxy/templates/Caddyfile.j2:1; docker-compose.yml.j2:1", "description": "Neither rendered template carries an {{ ansible_managed }} header, though ADR-024 §1.2 cites 'one ansible_managed header' as a Caddy advantage. (No template in the repo currently uses ansible_managed — consistent with current practice but inconsistent with the ADR's stated intent.)", "suggested_fix": "Add a commented '# {{ ansible_managed }}' header to both templates (and ideally adopt the convention repo-wide).", "tag": "new", "auto_fixable": false},
|
||||
{"id": "O27", "dimension": "consistency", "severity": "low", "location": "inventories/production/group_vars/all/reverse_proxy.yml", "description": "reverse_proxy production vars live in group_vars/all/ (every host) though the role only runs on offsite_hosts via offsite.yml; CLAUDE.md establishes an offsite_hosts/ group_vars dir for askari-specific config, which doesn't exist on disk. Harmless today (only askari imports the role) but broader scope than intended.", "suggested_fix": "Consider moving reverse_proxy.yml (and the offsite firewall opens) to group_vars/offsite_hosts/ for scope clarity, or leave if intentionally global. Judgement call.", "tag": "new", "auto_fixable": false},
|
||||
{"id": "O28", "dimension": "drift", "severity": "low", "location": "scripts/capacity-scan.py:133", "description": "capacity-scan.py cross-checks workload hostnames only against inventories/<env>/hosts.yml. askari lives in inventories/production/offsite.yml, not hosts.yml, so the drift cross-check never sees it. Minor (capacity is intent-based today) but a latent gap as offsite hosts grow.", "suggested_fix": "Also read offsite.yml (or glob inventories/<env>/*.yml host files) so offsite_hosts are included.", "tag": "new", "auto_fixable": false},
|
||||
{"id": "O29", "dimension": "consistency", "severity": "low", "location": "inventories/production/offsite.yml:1-16 ↔ inventories/production/hosts.yml:7-16", "description": "offsite.yml (generated by tf-inventory-offsite) re-declares control/docker_hosts/proxmox_hosts with empty host maps because tf_to_inventory.py always emits all four VALID_GROUPS — duplicating groups in hosts.yml in the same inventory dir. Ansible merges them harmlessly, but the duplication/merge is undocumented.", "suggested_fix": "Document in inventories/README.md that offsite.yml is a second generated inventory file merged with hosts.yml, or have tf_to_inventory.py emit only non-empty groups for offsite. Leave as-is if intended; just document.", "tag": "new", "auto_fixable": false}
|
||||
],
|
||||
"prior_resolved": [
|
||||
{"id": "O1@2026-06-11", "description": "make lint RED on main (site.yml imported nonexistent docker_host role)", "status": "resolved — docker_host scaffolded (03d33f8) then built (456c27d); make lint green this run."},
|
||||
{"id": "O10@2026-06-11", "description": "README ADR list stopped early (recurring)", "status": "resolved — auto-fixed this run (AF11), extended through ADR-024."},
|
||||
{"id": "O17@2026-06-11", "description": "empty handlers/main.yml scaffold artifacts in base/dev_env", "status": "resolved (accepted) — treated as an intentional make new-role scaffold convention; not re-raised."},
|
||||
{"id": "O2,O3,O4,O5,O6,O7,O8,O9,O11,O12,O13,O14,O15,O16,O18@2026-06-11", "description": "ADR-004 backup scope; ADR-004 ACCESS/BACKUP table; CAPABILITIES nvim/tmux; ADR-002 upgrade caveat; hosts.yml offsite_hosts; new-host Part E; dev_env set_fact tag; ubongo subnet; ADR section order; ADR-007 example; .zshrc rclone/direnv; oh_my_posh config tag; tfvars pve01; See-also vs Related; docs/inventories README narrowness", "status": "still open — carried forward as O2,O5,O7,O6,O9,O15,O8,O10,O4,O18/O19,O16,O17,O11,O13,O14 respectively (renumbered)."}
|
||||
]
|
||||
}
|
||||
157
docs/reviews/2026-06-14-review.md
Normal file
157
docs/reviews/2026-06-14-review.md
Normal file
|
|
@ -0,0 +1,157 @@
|
|||
# Repo review — 2026-06-14
|
||||
|
||||
- **Reviewed commit:** `e346137` (docs(plan): M4b — NetBird coordinator service role)
|
||||
- **Mode:** on-demand (interactive — auto-fixes applied + committed)
|
||||
- **Previous run:** 2026-06-11 (`67f2aba`)
|
||||
- **`make lint`:** green before and after fixes (260 files, profile production; check-tags OK).
|
||||
|
||||
## Summary
|
||||
|
||||
A lot shipped since the last review (M4a: `docker_host` Docker engine, `reverse_proxy`
|
||||
Caddy applied to askari; offsite Terraform env live; ADR-024). Most findings this run are
|
||||
the predictable **docs-lagging-the-build** kind — stale "not built yet" notes, a
|
||||
reverse-proxy that switched from DNS-01/custom-image to vanilla HTTP-01 leaving stale
|
||||
descriptions behind, and the **Traefik→Caddy** rename only half-propagated through the
|
||||
ADR set. The previous run's blocker (O1, `make lint` RED) is **resolved**.
|
||||
|
||||
### Counts
|
||||
|
||||
| Dimension | High | Medium | Low | Total |
|
||||
|---|---|---|---|---|
|
||||
| Cruft / staleness | 0 | 0 | 0 | 0 |
|
||||
| Design conformance | 1 | 2 | 2 | 5 |
|
||||
| Consistency & intent | 2 | 2 | 9 | 13 |
|
||||
| Docs-vs-reality drift | 1 | 4 | 5 | 10 |
|
||||
| **Open total** | **4** | **8** | **16** | **29** |
|
||||
|
||||
Plus **11 auto-fixes applied** (3 high, 5 medium, 3 low).
|
||||
|
||||
### Phase-0 scan
|
||||
|
||||
`repo-scan.py`: 5 roles, 25 ADRs · broken-adr-ref=4, broken-path-ref=2, marker=14,
|
||||
open-deferred-item=5, **stale-deferred=0**. Every scan finding is a known false-positive
|
||||
(test fixtures ADR-099/100; the `roles/netbird/` references in the M4b *plan* for unbuilt
|
||||
work; superpowers planning artifacts; `019-tagging.md:14` is prose about "over-tagging",
|
||||
not a TODO). Details in the findings JSON.
|
||||
|
||||
### Deferral checklist
|
||||
|
||||
All 5 ADR-011 "Open questions" (Proxmox snapshot driver, exact cadences, health-check
|
||||
harness home, classification home, staging-first) confirmed **genuinely still open** —
|
||||
ADR-011 is still Proposed/unbuilt, the same questions sit open in `docs/TODO.md` item 16,
|
||||
and no later ADR or STATUS decides any of them. **No stale-deferred** (same as last run).
|
||||
|
||||
## Auto-fixes applied
|
||||
|
||||
All safe/obvious (stale text contradicting code/reality, partial enumerations, broken
|
||||
descriptions) — no logic, variable, secret, or task-order changes.
|
||||
|
||||
| ID | Sev | File | What |
|
||||
|---|---|---|---|
|
||||
| AF1 | high | `roles/reverse_proxy/meta/main.yml` | description still said DNS-01 + custom on-host image → rewrote to vanilla Caddy + HTTP-01 (matches the role since b7e919d) |
|
||||
| AF2 | med | `roles/README.md` | base hardening + docker_host/reverse_proxy/public_dns build-state was stale → reconciled with STATUS |
|
||||
| AF3 | med | `playbooks/README.md` | stale "docker_host has no tasks" note; added missing `dns.yml` + `offsite.yml` bullets |
|
||||
| AF4 | low | `roles/public_dns/README.md` | "askari in M4" → askari + `*.askari` records applied in M4a |
|
||||
| AF5 | low | `scripts/README.md` | added the missing `check-tags.py` entry (run by `make lint`) |
|
||||
| AF6 | med | `terraform/README.md` | added `modules/hetzner_vm` + `environments/offsite` (the one applied env) |
|
||||
| AF7 | low | `terraform/environments/offsite/providers.tf` | verified-stamp `cax11@hel1` → `cx23@hel1` (actual server) |
|
||||
| AF8 | low | `terraform/modules/hetzner_vm/variables.tf` | `server_type` example `cax11 (ARM)` → `cx23 (x86) or cax11 (ARM)` |
|
||||
| AF9 | med | `inventories/production/group_vars/all/public_dns.yml` | wildcard comment "cert via DNS-01" → ACME HTTP-01 (M4a) |
|
||||
| AF10 | high | `docs/CAPABILITIES.md` | reverse-proxy candidate `Traefik` → `Caddy (ADR-024)`; public DNS "apply pending" → "applied (M1)" |
|
||||
| AF11 | low | `README.md` | Documentation ADR list extended ADR-017 → ADR-024 |
|
||||
|
||||
## Open findings (prioritised)
|
||||
|
||||
### High
|
||||
|
||||
- **O1 — drift — STATUS.md:41 (+45-48) ↔ 33-34** *(new)*: docker_host still appears in
|
||||
the "Scaffolded but empty — NOT implemented" table as a no-op, contradicting its own
|
||||
"Built + applied" rows and the real tasks file. Reword the scaffold row + closing
|
||||
paragraph (left for the operator — STATUS is the ground-truth doc).
|
||||
- **O2 — consistency — ADR-004:105,131 ↔ ADR-022** *(recurring)*: ADR-004 says backup is
|
||||
"not in scope of this repo"; ADR-022 defines a full in-repo backup doctrine. Repoint
|
||||
ADR-004 at ADR-022 (ADR↔ADR design decision — report).
|
||||
- **O3 — consistency — ADR-024 Consequences ↔ ADR-008:70/017:27,88/019:52** *(new)*:
|
||||
ADR-024 claims it updated ADR-017's Traefik prose to Caddy; it didn't, and ADR-008/019
|
||||
still say Traefik too. Either finish the rename or soften ADR-024's claim.
|
||||
- **O4 — conformance — ADR-023:7-8,77-80 ↔ ADR-016/017/018** *(recurring)*: ADR-023
|
||||
claims ADRs 001–018 were restructured to lead with `## Status`, but 016/017/018 still
|
||||
open with `## Context` and bury Status. Fix the three ADRs or correct ADR-023 §6.
|
||||
|
||||
### Medium
|
||||
|
||||
- **O5 — ADR-004:48-50** *(recurring)*: service-role file table omits ACCESS.md +
|
||||
BACKUP.md rows (now mandated by CLAUDE.md/ADR-021/022).
|
||||
- **O6 — ADR-002:82** *(recurring)*: `make deploy PLAYBOOK=upgrade` cited as real, but no
|
||||
`upgrade.yml` exists and ADR-011 is unbuilt — needs a `(planned)` caveat.
|
||||
- **O7 — CAPABILITIES:150-155 ↔ STATUS:29** *(recurring)*: nvim/tmux listed as a
|
||||
"confirmed exclusion" while `dev_env` installs them on ubongo; needs a control-host
|
||||
carve-out (not a token swap, so left from AF10).
|
||||
- **O8 — dev_env tasks (include_tasks + per_user.yml:4-9)** *(recurring)*: untagged
|
||||
`set_fact dev_env__home` preflight + include without `apply: tags:`; a partial
|
||||
`--tags users|config` run breaks (base guards this; dev_env doesn't).
|
||||
- **O9 — inventories/production/hosts.yml** *(recurring)*: header claims TF-generated but
|
||||
it's hand-maintained (carries ubongo, omits offsite_hosts); `tf-inventory` would drop
|
||||
ubongo. Make the header honest.
|
||||
- **O10 — group_vars/all/vars.yml:42 ↔ ADR-007** *(recurring)*: ubongo `10.20.10.151` is
|
||||
in no ADR-007 subnet and undocumented; `base__firewall_control_addr` depends on it.
|
||||
- **O11 — terraform tfvars.example (both envs)** *(recurring)*: `pve01` vs ADR-007's
|
||||
`pve0`; verify the real node name before changing.
|
||||
- **O12 — roles/reverse_proxy/** *(new)*: first built+applied service role, but missing
|
||||
SECURITY/VERIFY/ACCESS/BACKUP.md. (Recorded judgement: public_dns is exempt — control-
|
||||
node external-API role, not a host service.)
|
||||
- **O15 — runbooks/new-host.md Part E** *(recurring)*: still describes an `ansible` user
|
||||
on ubongo; STATUS says ubongo is managed as `sjat` (ansible-user bootstrap pending).
|
||||
- **O18 — ADR-007/009/016 internal-zone name** *(new)*: `boma.baobab.band` vs target
|
||||
`boma.wingu.me` used inconsistently across the doc set after M1; state the transition
|
||||
in one place.
|
||||
|
||||
### Low
|
||||
|
||||
O13 (See-also vs `## Related` in ADR-012/013/015/016/017/018 — recurring), O14
|
||||
(docs/README + inventories/README narrow enumerations — recurring), O16 (.zshrc rclone
|
||||
alias + unguarded direnv hook — recurring), O17 (oh_my_posh zen.toml tasks missing
|
||||
`config` tag — recurring), O19 (ADR-009:122 `nyumbani` example after retirement —
|
||||
recurring), O20 (ROADMAP M2 CAX11/ARM vs cx23/x86 — new), O21 (ADR-020 "ports will be
|
||||
added in M4" stale; already opened in M4a — new), O22 (ADR-024 body still asserts custom-
|
||||
image obligation contradicting its revised Status — new), O23 (`netbird_coordinator` vs
|
||||
`netbird` role name across ADRs/ROADMAP/plan — new), O24 (`*.boma.<domain>` vs
|
||||
`*.<boma-domain>` wildcard scope ADR-024 vs ROADMAP — new), O25 (`tags: [verify]` out of
|
||||
the ADR-019 vocabulary in molecule verify — new), O26 (reverse_proxy templates lack
|
||||
`ansible_managed` header — new), O27 (reverse_proxy vars in `group_vars/all/` not
|
||||
`offsite_hosts/` — new), O28 (capacity-scan.py ignores `offsite.yml` — new), O29
|
||||
(offsite.yml duplicates empty groups from hosts.yml, undocumented merge — new).
|
||||
|
||||
Full detail + suggested fixes in `2026-06-14-findings.json`.
|
||||
|
||||
## Themes worth a deliberate pass
|
||||
|
||||
1. **Finish the Traefik→Caddy rename** (O3, and ADR-024 over-claimed it was done). One
|
||||
sweep across ADR-008/017/019 closes it.
|
||||
2. **STATUS docker_host self-contradiction** (O1) — quick, but it's the ground-truth doc.
|
||||
3. **ADR-024 internal consistency** (O22) — the role went vanilla/HTTP-01 but the ADR
|
||||
body still mandates the custom image; reconcile §2/§3/Consequences with its own Status.
|
||||
4. **dev_env tag-isolation** (O8) — the one real conformance bug with runtime impact;
|
||||
mirror base's `apply: tags:` guard.
|
||||
5. **First service-role doc quartet** (O12) — reverse_proxy is the template for every
|
||||
future service role; getting SECURITY/VERIFY/ACCESS/BACKUP.md right now pays forward.
|
||||
|
||||
## Follow-up prompt
|
||||
|
||||
> Work the open findings from `docs/reviews/2026-06-14-review.md`. Priority order:
|
||||
> (1) **O1** — fix the STATUS.md docker_host contradiction (it's built+applied, not a
|
||||
> no-op; reword the "Scaffolded but empty" row + the 45-48 paragraph).
|
||||
> (2) **O3 + O22** — finish the Traefik→Caddy rename in ADR-008:70, ADR-017:27,88,
|
||||
> ADR-019:52, and reconcile ADR-024's body (§2 custom image, §3 NetBird, Consequences)
|
||||
> with its own revised HTTP-01 Status note.
|
||||
> (3) **O2 + O5** — repoint ADR-004's "backup not in scope" line at ADR-022 and add
|
||||
> ACCESS.md + BACKUP.md rows to its service-role file table.
|
||||
> (4) **O8** — add `apply: tags: [users, config]` to dev_env's per_user.yml include and
|
||||
> tag the `dev_env__home` set_fact `always`; add a Molecule assertion that a partial
|
||||
> `--tags config` run still resolves the home dir.
|
||||
> (5) **O12** — author the four service-role doc files for `roles/reverse_proxy/` from the
|
||||
> templates (BACKUP.md = `backup__state: false`, re-issuable certs).
|
||||
> (6) **O4** — restructure ADR-016/017/018 to lead with `## Status`, or correct ADR-023 §6.
|
||||
> Then the medium drift items (O6 upgrade caveat, O7 nvim/tmux carve-out, O9 hosts.yml
|
||||
> header, O15 new-host Part E, O18 internal-zone naming). Run `make lint` after each
|
||||
> batch; commit per CLAUDE.md git conventions.
|
||||
|
|
@ -1,161 +1,157 @@
|
|||
# Repo review — 2026-06-11
|
||||
# Repo review — 2026-06-14
|
||||
|
||||
- **Reviewed commit:** `67f2aba` (main)
|
||||
- **Mode:** on-demand (interactive)
|
||||
- **Previous run:** `2026-06-05` (commit `f566fd1`)
|
||||
- **Process:** Phase 0 deterministic scan → 5 parallel shard reviewers + 1 cross-cutting
|
||||
reviewer → synthesis, deferral-checklist resolution, prior-run diff → safe auto-fixes.
|
||||
- **Reviewed commit:** `e346137` (docs(plan): M4b — NetBird coordinator service role)
|
||||
- **Mode:** on-demand (interactive — auto-fixes applied + committed)
|
||||
- **Previous run:** 2026-06-11 (`67f2aba`)
|
||||
- **`make lint`:** green before and after fixes (260 files, profile production; check-tags OK).
|
||||
|
||||
## Summary
|
||||
|
||||
| | High | Medium | Low | Total |
|
||||
A lot shipped since the last review (M4a: `docker_host` Docker engine, `reverse_proxy`
|
||||
Caddy applied to askari; offsite Terraform env live; ADR-024). Most findings this run are
|
||||
the predictable **docs-lagging-the-build** kind — stale "not built yet" notes, a
|
||||
reverse-proxy that switched from DNS-01/custom-image to vanilla HTTP-01 leaving stale
|
||||
descriptions behind, and the **Traefik→Caddy** rename only half-propagated through the
|
||||
ADR set. The previous run's blocker (O1, `make lint` RED) is **resolved**.
|
||||
|
||||
### Counts
|
||||
|
||||
| Dimension | High | Medium | Low | Total |
|
||||
|---|---|---|---|---|
|
||||
| **Auto-fixed** | 1 | 2 | 2 | 5 |
|
||||
| **Open (report-only)** | 2 | 7 | 9 | 18 |
|
||||
| Cruft / staleness | 0 | 0 | 0 | 0 |
|
||||
| Design conformance | 1 | 2 | 2 | 5 |
|
||||
| Consistency & intent | 2 | 2 | 9 | 13 |
|
||||
| Docs-vs-reality drift | 1 | 4 | 5 | 10 |
|
||||
| **Open total** | **4** | **8** | **16** | **29** |
|
||||
|
||||
By dimension (open): conformance 3 · consistency 8 · drift 6 · cruft 1.
|
||||
Plus **11 auto-fixes applied** (3 high, 5 medium, 3 low).
|
||||
|
||||
**Headline:** `make lint` is currently **red on `main`** — `playbooks/site.yml` imports the
|
||||
not-yet-existent `docker_host` role (confirmed at clean HEAD, unrelated to this run's
|
||||
edits). That breaks CLAUDE.md's "main must always work" / "Never skip lint" contract and
|
||||
is the top open finding (O1). The bulk of the rest is documentation drift created by the
|
||||
recent `base` (firewall) + `dev_env` build wave: several READMEs/playbook notes still
|
||||
described the roles as "empty / not built." Those were the safe auto-fixes.
|
||||
### Phase-0 scan
|
||||
|
||||
**Good news:** 7 of the 12 open findings from the 2026-06-05 run are confirmed resolved
|
||||
(VERIFY.md row + runbook step, backend.tf relabel, askari group naming, ADR-014
|
||||
reproducibility, CAPABILITIES Level-4 row, TODO 3.10). The deferral checklist is clean —
|
||||
**0 stale-deferred** this run (the recurring miss logged in FRICTION.md did not recur).
|
||||
`repo-scan.py`: 5 roles, 25 ADRs · broken-adr-ref=4, broken-path-ref=2, marker=14,
|
||||
open-deferred-item=5, **stale-deferred=0**. Every scan finding is a known false-positive
|
||||
(test fixtures ADR-099/100; the `roles/netbird/` references in the M4b *plan* for unbuilt
|
||||
work; superpowers planning artifacts; `019-tagging.md:14` is prose about "over-tagging",
|
||||
not a TODO). Details in the findings JSON.
|
||||
|
||||
### Deferral checklist
|
||||
|
||||
All 5 ADR-011 "Open questions" (Proxmox snapshot driver, exact cadences, health-check
|
||||
harness home, classification home, staging-first) confirmed **genuinely still open** —
|
||||
ADR-011 is still Proposed/unbuilt, the same questions sit open in `docs/TODO.md` item 16,
|
||||
and no later ADR or STATUS decides any of them. **No stale-deferred** (same as last run).
|
||||
|
||||
## Auto-fixes applied
|
||||
|
||||
Markdown / YAML-comment only; no runtime behaviour, logic, vars, or task order touched.
|
||||
All safe/obvious (stale text contradicting code/reality, partial enumerations, broken
|
||||
descriptions) — no logic, variable, secret, or task-order changes.
|
||||
|
||||
| ID | Sev | File(s) | What |
|
||||
| ID | Sev | File | What |
|
||||
|---|---|---|---|
|
||||
| AF1 | high | `roles/README.md` | Rewrote stale "base & docker_host are empty untracked dirs, site.yml would fail on a clean clone" → base partially built (firewall), docker_host not yet created, dev_env built+applied. |
|
||||
| AF2 | med | `playbooks/site.yml` | NOTE no longer claims base is unbuilt / "fails on a clean clone"; now reflects firewall-only base + missing docker_host. |
|
||||
| AF3 | med | `playbooks/README.md` | Dropped the "currently a no-op" claim; added a `workstation.yml` bullet. |
|
||||
| AF4 | low | `README.md` | Added `docs/access/`, `docs/backup/`, `roles/dev_env/`, `playbooks/workstation.yml` to the project-structure tree. |
|
||||
| AF5 | low | `docs/decisions/016-mesh-vpn.md`, `docs/decisions/020-firewall.md` | Added the reciprocal `ADR-021` cross-reference that ADR-021 says it amended in. |
|
||||
|
||||
> `make lint` was re-run after the fixes: it fails **only** on the pre-existing
|
||||
> `docker_host` syntax-check (O1), identical to clean HEAD. No auto-fix introduced or
|
||||
> changed any lint result, so none were reverted.
|
||||
| AF1 | high | `roles/reverse_proxy/meta/main.yml` | description still said DNS-01 + custom on-host image → rewrote to vanilla Caddy + HTTP-01 (matches the role since b7e919d) |
|
||||
| AF2 | med | `roles/README.md` | base hardening + docker_host/reverse_proxy/public_dns build-state was stale → reconciled with STATUS |
|
||||
| AF3 | med | `playbooks/README.md` | stale "docker_host has no tasks" note; added missing `dns.yml` + `offsite.yml` bullets |
|
||||
| AF4 | low | `roles/public_dns/README.md` | "askari in M4" → askari + `*.askari` records applied in M4a |
|
||||
| AF5 | low | `scripts/README.md` | added the missing `check-tags.py` entry (run by `make lint`) |
|
||||
| AF6 | med | `terraform/README.md` | added `modules/hetzner_vm` + `environments/offsite` (the one applied env) |
|
||||
| AF7 | low | `terraform/environments/offsite/providers.tf` | verified-stamp `cax11@hel1` → `cx23@hel1` (actual server) |
|
||||
| AF8 | low | `terraform/modules/hetzner_vm/variables.tf` | `server_type` example `cax11 (ARM)` → `cx23 (x86) or cax11 (ARM)` |
|
||||
| AF9 | med | `inventories/production/group_vars/all/public_dns.yml` | wildcard comment "cert via DNS-01" → ACME HTTP-01 (M4a) |
|
||||
| AF10 | high | `docs/CAPABILITIES.md` | reverse-proxy candidate `Traefik` → `Caddy (ADR-024)`; public DNS "apply pending" → "applied (M1)" |
|
||||
| AF11 | low | `README.md` | Documentation ADR list extended ADR-017 → ADR-024 |
|
||||
|
||||
## Open findings (prioritised)
|
||||
|
||||
### High
|
||||
|
||||
- **O1 — `make lint` is red on `main`** · `playbooks/site.yml:18` · *conformance*
|
||||
site.yml imports the `docker_host` role, which does not exist, so ansible-lint's
|
||||
syntax-check fails on a clean checkout. Violates "main must always work" + "Never skip
|
||||
lint" (pre-commit would block every commit unless bypassed).
|
||||
*Fix (judgement):* guard/skip the docker_host play until the role exists, scaffold a
|
||||
stub via `make new-role NAME=docker_host`, or exclude site.yml from syntax-check until
|
||||
built — and record the choice. **new**
|
||||
|
||||
- **O2 — ADR-004 ↔ ADR-022 backup-scope contradiction** ·
|
||||
`docs/decisions/004-docker-model.md:105` · *consistency*
|
||||
ADR-004 says "Backup strategy is defined separately (not in scope of this repo)";
|
||||
ADR-022 defines a full in-repo backup strategy. Per ADR-023 (no silent reversals),
|
||||
update ADR-004's line to defer to ADR-022 and cross-link. Design decision — report. **new**
|
||||
- **O1 — drift — STATUS.md:41 (+45-48) ↔ 33-34** *(new)*: docker_host still appears in
|
||||
the "Scaffolded but empty — NOT implemented" table as a no-op, contradicting its own
|
||||
"Built + applied" rows and the real tasks file. Reword the scaffold row + closing
|
||||
paragraph (left for the operator — STATUS is the ground-truth doc).
|
||||
- **O2 — consistency — ADR-004:105,131 ↔ ADR-022** *(recurring)*: ADR-004 says backup is
|
||||
"not in scope of this repo"; ADR-022 defines a full in-repo backup doctrine. Repoint
|
||||
ADR-004 at ADR-022 (ADR↔ADR design decision — report).
|
||||
- **O3 — consistency — ADR-024 Consequences ↔ ADR-008:70/017:27,88/019:52** *(new)*:
|
||||
ADR-024 claims it updated ADR-017's Traefik prose to Caddy; it didn't, and ADR-008/019
|
||||
still say Traefik too. Either finish the rename or soften ADR-024's claim.
|
||||
- **O4 — conformance — ADR-023:7-8,77-80 ↔ ADR-016/017/018** *(recurring)*: ADR-023
|
||||
claims ADRs 001–018 were restructured to lead with `## Status`, but 016/017/018 still
|
||||
open with `## Context` and bury Status. Fix the three ADRs or correct ADR-023 §6.
|
||||
|
||||
### Medium
|
||||
|
||||
- **O3 — ADR-004 service-role file table missing ACCESS.md + BACKUP.md** ·
|
||||
`docs/decisions/004-docker-model.md:48` · *consistency* — CLAUDE.md + ADR-021/022 now
|
||||
mandate both for service roles; the canonical table lists only SECURITY.md + VERIFY.md.
|
||||
(Prior "missing VERIFY.md" is resolved; this is the next evolution.) **new**
|
||||
- **O4 — CAPABILITIES nvim/tmux exclusion ↔ dev_env built** ·
|
||||
`docs/CAPABILITIES.md:149` · *consistency* — listed as a confirmed exclusion
|
||||
("server-only"), but `dev_env` (built+applied to ubongo) installs exactly that. Carve
|
||||
out the control-node/AI-worker exception (ADR-015). **new**
|
||||
- **O5 — phantom `make deploy PLAYBOOK=upgrade`** · `docs/decisions/002-security.md:82` ·
|
||||
*drift* — no `upgrade.yml` exists; ADR-011 is unbuilt. Add a "(planned)" caveat. **new**
|
||||
- **O6 — hosts.yml stubs missing `offsite_hosts` group** ·
|
||||
`inventories/{production,staging}/hosts.yml` · *drift* — the generator emits it (one of
|
||||
four VALID_GROUPS); the hand-stubs predate the standard. Regenerate via
|
||||
`make tf-inventory` (don't hand-edit). (Prior "askari group unnamed" is resolved.) **new**
|
||||
- **O7 — new-host runbook Part E vs ubongo reality** · `docs/runbooks/new-host.md:81-130`
|
||||
· *drift* — instructs creating an `ansible` user / `ssh ansible@`; STATUS records ubongo
|
||||
is managed as `sjat`, ansible-user bootstrap pending. **new**
|
||||
- **O8 — dev_env untagged `set_fact` under tagged consumers** ·
|
||||
`roles/dev_env/tasks/per_user.yml:2-9` · *conformance* — partial `--tags users|config`
|
||||
runs skip the `dev_env__home` set_fact and fail. Tag the preflight `[users, config]` or
|
||||
`always`. **new**
|
||||
- **O9 — ubongo address outside ADR-007 subnets** · `STATUS.md:31 ↔ 007-network.md` ·
|
||||
*drift* — 10.20.10.151 is in neither srv (10.20.0.0/24) nor mgmt (10.10.0.0/24);
|
||||
`base__firewall_control_addr` depends on it. Already a tracked follow-up in the
|
||||
ubongo-build plan. Reconcile address or ADR-007. **new**
|
||||
- **O5 — ADR-004:48-50** *(recurring)*: service-role file table omits ACCESS.md +
|
||||
BACKUP.md rows (now mandated by CLAUDE.md/ADR-021/022).
|
||||
- **O6 — ADR-002:82** *(recurring)*: `make deploy PLAYBOOK=upgrade` cited as real, but no
|
||||
`upgrade.yml` exists and ADR-011 is unbuilt — needs a `(planned)` caveat.
|
||||
- **O7 — CAPABILITIES:150-155 ↔ STATUS:29** *(recurring)*: nvim/tmux listed as a
|
||||
"confirmed exclusion" while `dev_env` installs them on ubongo; needs a control-host
|
||||
carve-out (not a token swap, so left from AF10).
|
||||
- **O8 — dev_env tasks (include_tasks + per_user.yml:4-9)** *(recurring)*: untagged
|
||||
`set_fact dev_env__home` preflight + include without `apply: tags:`; a partial
|
||||
`--tags users|config` run breaks (base guards this; dev_env doesn't).
|
||||
- **O9 — inventories/production/hosts.yml** *(recurring)*: header claims TF-generated but
|
||||
it's hand-maintained (carries ubongo, omits offsite_hosts); `tf-inventory` would drop
|
||||
ubongo. Make the header honest.
|
||||
- **O10 — group_vars/all/vars.yml:42 ↔ ADR-007** *(recurring)*: ubongo `10.20.10.151` is
|
||||
in no ADR-007 subnet and undocumented; `base__firewall_control_addr` depends on it.
|
||||
- **O11 — terraform tfvars.example (both envs)** *(recurring)*: `pve01` vs ADR-007's
|
||||
`pve0`; verify the real node name before changing.
|
||||
- **O12 — roles/reverse_proxy/** *(new)*: first built+applied service role, but missing
|
||||
SECURITY/VERIFY/ACCESS/BACKUP.md. (Recorded judgement: public_dns is exempt — control-
|
||||
node external-API role, not a host service.)
|
||||
- **O15 — runbooks/new-host.md Part E** *(recurring)*: still describes an `ansible` user
|
||||
on ubongo; STATUS says ubongo is managed as `sjat` (ansible-user bootstrap pending).
|
||||
- **O18 — ADR-007/009/016 internal-zone name** *(new)*: `boma.baobab.band` vs target
|
||||
`boma.wingu.me` used inconsistently across the doc set after M1; state the transition
|
||||
in one place.
|
||||
|
||||
### Low
|
||||
|
||||
- **O10 — README ADR list stops at 017** · `README.md:104` · *drift* — 018–023 exist;
|
||||
extend or trim to a pointer. **recurring** (evolved from prior O3)
|
||||
- **O11 — ADR section-order vs ADR-023 §2** · `008:3, 014:98, 016:91, 017:66, 018:73` ·
|
||||
*conformance* — Status-not-first / Decision-late; passes lint (order not gated) but not
|
||||
the standard. Presentational restructure. **new**
|
||||
- **O12 — ADR-007 FQDN convention vs its own example** · `007-network.md:160` ·
|
||||
*consistency* — `<service>.baobab.band` vs `forgejo.nyumbani.baobab.band`; ties to open
|
||||
TODO 4 (split-horizon). **new**
|
||||
- **O13 — dev_env `.zshrc` heritage carryovers** ·
|
||||
`roles/dev_env/files/dotfiles/zsh/.zshrc:28,55` · *consistency* — hard-coded
|
||||
`/usr/bin/rclone` alias (not installed by the role) + unguarded `direnv` hook. **new**
|
||||
- **O14 — oh_my_posh config tasks untagged** · `roles/dev_env/tasks/oh_my_posh.yml:15-26`
|
||||
· *consistency* — inconsistent `config` tagging vs per_user.yml. **new**
|
||||
- **O15 — tfvars.example `pve01` vs ADR-007 `pve0`** ·
|
||||
`terraform/environments/*/terraform.tfvars.example:9` · *consistency* — verify the real
|
||||
node name, then align. **new**
|
||||
- **O16 — ADR-013/015 "See also:" vs `## Related`** · *consistency* — stylistic; convert
|
||||
for uniformity. **new**
|
||||
- **O17 — empty scaffold `handlers/main.yml`** · `roles/{dev_env,base}/handlers/main.yml`
|
||||
· *cruft* — confirm convention or delete. **new**
|
||||
- **O18 — docs/README.md + inventories/README.md narrower than reality** · *consistency*
|
||||
— omit several real subdirs / the offsite_hosts group. **new**
|
||||
O13 (See-also vs `## Related` in ADR-012/013/015/016/017/018 — recurring), O14
|
||||
(docs/README + inventories/README narrow enumerations — recurring), O16 (.zshrc rclone
|
||||
alias + unguarded direnv hook — recurring), O17 (oh_my_posh zen.toml tasks missing
|
||||
`config` tag — recurring), O19 (ADR-009:122 `nyumbani` example after retirement —
|
||||
recurring), O20 (ROADMAP M2 CAX11/ARM vs cx23/x86 — new), O21 (ADR-020 "ports will be
|
||||
added in M4" stale; already opened in M4a — new), O22 (ADR-024 body still asserts custom-
|
||||
image obligation contradicting its revised Status — new), O23 (`netbird_coordinator` vs
|
||||
`netbird` role name across ADRs/ROADMAP/plan — new), O24 (`*.boma.<domain>` vs
|
||||
`*.<boma-domain>` wildcard scope ADR-024 vs ROADMAP — new), O25 (`tags: [verify]` out of
|
||||
the ADR-019 vocabulary in molecule verify — new), O26 (reverse_proxy templates lack
|
||||
`ansible_managed` header — new), O27 (reverse_proxy vars in `group_vars/all/` not
|
||||
`offsite_hosts/` — new), O28 (capacity-scan.py ignores `offsite.yml` — new), O29
|
||||
(offsite.yml duplicates empty groups from hosts.yml, undocumented merge — new).
|
||||
|
||||
## Deferral checklist (Phase 2)
|
||||
Full detail + suggested fixes in `2026-06-14-findings.json`.
|
||||
|
||||
| Source | Items | Verdict |
|
||||
|---|---|---|
|
||||
| ADR-011 Deferred/Open | 5 (snapshot driver, cadences, health-check harness home, classification home, staging-first) | **All genuinely still open** — cross-checked against later ADRs + TODO 16. None silently resolved. |
|
||||
| ADR-015 Deferred | #1 mesh VPN, #2 service-UI, #3 build | **All marked RESOLVED in place** (ADR-016 / ADR-017 / 2026-06-11 build). |
|
||||
## Themes worth a deliberate pass
|
||||
|
||||
**Stale-deferred found: 0.** The recurring FRICTION.md miss did not recur this run.
|
||||
1. **Finish the Traefik→Caddy rename** (O3, and ADR-024 over-claimed it was done). One
|
||||
sweep across ADR-008/017/019 closes it.
|
||||
2. **STATUS docker_host self-contradiction** (O1) — quick, but it's the ground-truth doc.
|
||||
3. **ADR-024 internal consistency** (O22) — the role went vanilla/HTTP-01 but the ADR
|
||||
body still mandates the custom image; reconcile §2/§3/Consequences with its own Status.
|
||||
4. **dev_env tag-isolation** (O8) — the one real conformance bug with runtime impact;
|
||||
mirror base's `apply: tags:` guard.
|
||||
5. **First service-role doc quartet** (O12) — reverse_proxy is the template for every
|
||||
future service role; getting SECURITY/VERIFY/ACCESS/BACKUP.md right now pays forward.
|
||||
|
||||
## Scan false positives (folded in, not actionable)
|
||||
## Follow-up prompt
|
||||
|
||||
- `broken-path-ref STATUS.md:38` — STATUS legitimately documents `roles/docker_host/` as
|
||||
"Not in git." (intentional reference to an unbuilt role).
|
||||
- `broken-adr-ref` ×4 — `ADR-099`/`ADR-100` in `tests/test_repo_scan.py` and the
|
||||
adr-structure plan are intentional **test fixtures** for the scanner's bad-ref check.
|
||||
- `marker` ×14 — all in `docs/superpowers/{plans,specs}/*` (historical commit-message
|
||||
TODOs / plan steps) or prose discussing "over-tagging" as a concept. Not cruft.
|
||||
|
||||
## Prior-run diff (vs 2026-06-05)
|
||||
|
||||
**Resolved (7):** O1 VERIFY.md row · O2 new-role VERIFY step · O4 askari group naming ·
|
||||
O5 backend.tf relabel · O6 ADR-014 reproducibility · O11 CAPABILITIES Level-4 row ·
|
||||
O12 TODO 3.10. **Partial:** O3 (docs tree fixed in AF4; ADR-list carried as O10).
|
||||
**Not re-detected (verify next run):** O7–O10 (ADR-011 still Proposed).
|
||||
|
||||
## Follow-up prompt (copy-paste)
|
||||
|
||||
> Act on the open findings from `docs/reviews/2026-06-11-review.md`. Priority order:
|
||||
> 1. **O1 (high):** `make lint` is red on `main` — `playbooks/site.yml` imports the
|
||||
> non-existent `docker_host` role. Pick an interim posture (guard/skip the play, or
|
||||
> `make new-role NAME=docker_host` to scaffold a stub, or exclude from syntax-check
|
||||
> until built) so the trunk lints clean again, and record the choice in STATUS.md.
|
||||
> 2. **O2 (high):** Resolve the ADR-004 ↔ ADR-022 backup-scope contradiction —
|
||||
> update ADR-004's "not in scope of this repo" line to defer to ADR-022 (per ADR-023's
|
||||
> no-silent-reversal rule) and cross-link.
|
||||
> 3. **O3:** Add ACCESS.md + BACKUP.md rows to ADR-004's service-role file table.
|
||||
> 4. **O4:** Reconcile CAPABILITIES' nvim/tmux exclusion with the built `dev_env` role
|
||||
> (carve out the ubongo control-node exception).
|
||||
> 5. **O8 (conformance):** Tag the `dev_env__home` preflight `set_fact` so partial
|
||||
> `--tags users|config` runs don't fail.
|
||||
> 6. **O6 / O9:** Regenerate the inventory stubs to include `offsite_hosts`; reconcile
|
||||
> ubongo's 10.20.10.151 against ADR-007's subnets (or amend ADR-007).
|
||||
> 7. Sweep the low-severity doc items (O5 caveat, O7 runbook, O10 ADR list, O11 ADR
|
||||
> section order, O12–O18) as a single docs-hygiene batch.
|
||||
> Run `make lint` before committing; commit per CLAUDE.md git conventions.
|
||||
> Work the open findings from `docs/reviews/2026-06-14-review.md`. Priority order:
|
||||
> (1) **O1** — fix the STATUS.md docker_host contradiction (it's built+applied, not a
|
||||
> no-op; reword the "Scaffolded but empty" row + the 45-48 paragraph).
|
||||
> (2) **O3 + O22** — finish the Traefik→Caddy rename in ADR-008:70, ADR-017:27,88,
|
||||
> ADR-019:52, and reconcile ADR-024's body (§2 custom image, §3 NetBird, Consequences)
|
||||
> with its own revised HTTP-01 Status note.
|
||||
> (3) **O2 + O5** — repoint ADR-004's "backup not in scope" line at ADR-022 and add
|
||||
> ACCESS.md + BACKUP.md rows to its service-role file table.
|
||||
> (4) **O8** — add `apply: tags: [users, config]` to dev_env's per_user.yml include and
|
||||
> tag the `dev_env__home` set_fact `always`; add a Molecule assertion that a partial
|
||||
> `--tags config` run still resolves the home dir.
|
||||
> (5) **O12** — author the four service-role doc files for `roles/reverse_proxy/` from the
|
||||
> templates (BACKUP.md = `backup__state: false`, re-issuable certs).
|
||||
> (6) **O4** — restructure ADR-016/017/018 to lead with `## Status`, or correct ADR-023 §6.
|
||||
> Then the medium drift items (O6 upgrade caveat, O7 nvim/tmux carve-out, O9 hosts.yml
|
||||
> header, O15 new-host Part E, O18 internal-zone naming). Run `make lint` after each
|
||||
> batch; commit per CLAUDE.md git conventions.
|
||||
|
|
|
|||
|
|
@ -118,8 +118,14 @@ Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated **physical**
|
|||
machine outside the cluster — not a Proxmox guest. It is the **one** host
|
||||
provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
|
||||
|
||||
> **Current state (STATUS.md):** `ubongo` is today managed as the operator account
|
||||
> `sjat` (`group_vars/control` sets `ansible_user: sjat`); it has **no** dedicated
|
||||
> `ansible` service user yet. The dedicated-`ansible`-user bootstrap (step 2) is a
|
||||
> **pending** item. Steps below describe the intended end state.
|
||||
|
||||
1. Install Debian 13 on the physical box by hand (no template to clone).
|
||||
2. Create the `ansible` user and install its SSH public key.
|
||||
2. Create the `ansible` user and install its SSH public key. *(Pending for `ubongo` —
|
||||
currently managed as `sjat`; see the note above.)*
|
||||
3. Set up the Ansible environment on it:
|
||||
```bash
|
||||
git clone <repo> ~/ansible
|
||||
|
|
|
|||
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use `- [ ]` checkboxes.
|
||||
|
||||
**Goal:** Deploy the self-hosted NetBird control plane on askari as boma's first real service role (`netbird`), fronted by the M4a Caddy, reachable at `https://netbird.askari.wingu.me` with the embedded Dex login.
|
||||
**Goal:** Deploy the self-hosted NetBird control plane on askari as boma's first real service role (`netbird_coordinator`), fronted by the M4a Caddy, reachable at `https://netbird.askari.wingu.me` with the embedded Dex login.
|
||||
|
||||
**Architecture:** NetBird's own `configure.sh` generates the canonical compose + config for a pinned version; boma **captures that reference once and translates it into role templates** (ADR-004/013 — don't run their imperative script in production, render from templates). Runs in **external-reverse-proxy mode** (no bundled Traefik); Caddy adds a `netbird.askari.wingu.me` route. Secrets (datastore encryption key, TURN password, Dex secrets) are generated into vault; the setup key is stubbed `CHANGEME` for M5.
|
||||
|
||||
|
|
@ -23,9 +23,9 @@
|
|||
|
||||
---
|
||||
|
||||
### Task 2: `netbird` service role — templates
|
||||
### Task 2: `netbird_coordinator` service role — templates
|
||||
|
||||
**Files:** `roles/netbird/` (scaffold via `make new-role NAME=netbird`): `defaults/main.yml`, `tasks/main.yml`, `templates/{docker-compose.yml,management.json,turnserver.conf,openid-configuration.json,dashboard.env}.j2`, `handlers/main.yml`, `README.md`.
|
||||
**Files:** `roles/netbird_coordinator/` (scaffold via `make new-role NAME=netbird_coordinator`): `defaults/main.yml`, `tasks/main.yml`, `templates/{docker-compose.yml,management.json,turnserver.conf,openid-configuration.json,dashboard.env}.j2`, `handlers/main.yml`, `README.md`.
|
||||
|
||||
- [ ] **Step 1:** Translate the captured compose into `templates/docker-compose.yml.j2` — containers, the shared `boma` Docker network (so Caddy reaches them by name), **no host port mappings except what Caddy/Coturn need** (Coturn 3478/udp; everything else internal, Caddy fronts it). Pin image tags (ADR-011).
|
||||
- [ ] **Step 2:** Translate `management.json`/`config.yaml` into a template — fill `Datadir`, `DataStoreEncryptionKey` (`{{ vault.netbird.datastore_key }}`), `HttpConfig` (public URL `https://netbird.askari.wingu.me`), `TURNConfig` (coturn host + `{{ vault.netbird.turn_password }}`), `Signal`, `Relay`, `Store` (sqlite), and the embedded-Dex IdP block (DeviceAuthorizationFlow/PKCE, `openid-configuration.json` URL).
|
||||
|
|
@ -53,7 +53,7 @@
|
|||
|
||||
### Task 5: Service-role standard files (ADR-004, authored)
|
||||
|
||||
- [ ] **Step 1:** Author `roles/netbird/SECURITY.md` (copy `docs/security/service-security-template.md`; record the public surface = Caddy 443 + Coturn 3478, embedded-Dex auth, accepted-risk R3).
|
||||
- [ ] **Step 1:** Author `roles/netbird_coordinator/SECURITY.md` (copy `docs/security/service-security-template.md`; record the public surface = Caddy 443 + Coturn 3478, embedded-Dex auth, accepted-risk R3).
|
||||
- [ ] **Step 2:** `VERIFY.md` (copy the template; the `/verify-service` UI spec — run later when the playwright harness exists).
|
||||
- [ ] **Step 3:** `ACCESS.md` (ADR-021; the dashboard/admin access + `access__*` intent).
|
||||
- [ ] **Step 4:** `BACKUP.md` (ADR-022; the **datastore is stateful** → `backup__*` data; record that off-site backup is **pending `fisi`** — an accepted risk for now).
|
||||
|
|
@ -63,7 +63,7 @@
|
|||
|
||||
### Task 6: Add netbird to the offsite playbook
|
||||
|
||||
- [ ] **Step 1:** In `playbooks/offsite.yml`, add `netbird` after `reverse_proxy` (role-name tag). `make lint`. Commit.
|
||||
- [ ] **Step 1:** In `playbooks/offsite.yml`, add `netbird_coordinator` after `reverse_proxy` (role-name tag). `make lint`. Commit.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -80,7 +80,7 @@
|
|||
|
||||
### Task 8: Docs
|
||||
|
||||
- [ ] **Step 1:** STATUS — `netbird` coordinator built + applied (dashboard live); the first service role. ROADMAP M4b done; **M5 (enrol) next**. `make lint`; commit.
|
||||
- [ ] **Step 1:** STATUS — `netbird_coordinator` built + applied (dashboard live); the first service role. ROADMAP M4b done; **M5 (enrol) next**. `make lint`; commit.
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
|
|
@ -6,6 +6,11 @@ hold per-group and per-host configuration.
|
|||
|
||||
- `hosts.yml` is **generated** from Terraform outputs by `make tf-inventory` — do not
|
||||
hand-edit. The control node is the one manual exception.
|
||||
- `offsite.yml` (in `production/`) is a **second** generated inventory file, written by
|
||||
`make tf-inventory-offsite` from the offsite Terraform env; it holds the `offsite_hosts`
|
||||
group (`askari`). Ansible merges it with `hosts.yml`, so both can declare the same group
|
||||
names harmlessly (the offsite generator emits all four groups, most empty).
|
||||
- Host groups: `all`, `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`.
|
||||
- Terraform→inventory data flow and the data contract: **ADR-009**.
|
||||
- Addressing conventions (subnets, ranges): **ADR-007**.
|
||||
- Layout and host groups: see CLAUDE.md ("Inventory structure").
|
||||
|
|
|
|||
|
|
@ -13,8 +13,8 @@ public_dns__records:
|
|||
# askari (off-site host, TF-provisioned M2) — public A so it's reachable by name +
|
||||
# for future ACME on *.askari.wingu.me. Mesh/LAN-only home services never appear here.
|
||||
- {record: askari, type: A, values: ["77.42.120.136"], ttl: 1800}
|
||||
# Wildcard for askari's services (test/netbird/...) → same host; Caddy gets a
|
||||
# *.askari.wingu.me cert via DNS-01 (M4a).
|
||||
# Wildcard for askari's services (test/netbird/...) → same host; Caddy gets
|
||||
# per-host certs via ACME HTTP-01 (M4a).
|
||||
- {record: "*.askari", type: A, values: ["77.42.120.136"], ttl: 1800}
|
||||
|
||||
# Absent — Gandi's auto-seeded defaults we don't want (purged once, idempotent thereafter).
|
||||
|
|
|
|||
|
|
@ -39,4 +39,4 @@ services__base_dir: /opt/services
|
|||
base__unattended_upgrades_enabled: true
|
||||
|
||||
# Management plane — activates the dormant ssh-from-control firewall rule
|
||||
base__firewall_control_addr: "10.20.10.151" # ubongo (control node) LAN address — ADR-021 ssh-from-control source
|
||||
base__firewall_control_addr: "10.20.10.151" # ubongo — legacy V4 addr (ADR-007); ADR-021 ssh-from-control
|
||||
|
|
|
|||
|
|
@ -4,10 +4,15 @@ Top-level orchestration playbooks. No inline vars — configuration comes from
|
|||
`group_vars/` / `host_vars/` (see CLAUDE.md).
|
||||
|
||||
- `site.yml` — full standard state: applies `base` to all hosts and `docker_host`
|
||||
to docker hosts. **Note:** `base` is only partially built (its `firewall` concern)
|
||||
and `docker_host` is scaffolded with no tasks yet, so this is incomplete — see `STATUS.md`.
|
||||
to docker hosts. **Note:** `base` is only partially built (its `firewall` +
|
||||
`hardening` concerns) and the cluster has no docker hosts yet, so this is
|
||||
incomplete — see `STATUS.md`.
|
||||
- `workstation.yml` — applies the `dev_env` role (interactive developer environment)
|
||||
to the `control` group; built and applied to `ubongo` (see `STATUS.md`).
|
||||
- `dns.yml` — manages the public DNS zone (wingu.me) at Gandi LiveDNS via the
|
||||
`public_dns` role; runs from the control node against an external API.
|
||||
- `offsite.yml` — off-site hosts (`askari`): `docker_host` (Docker engine) +
|
||||
`reverse_proxy` (Caddy). NetBird coordinator appended in M4b.
|
||||
- `bootstrap.yml` — first-run setup for a host that may not have Python yet;
|
||||
self-contained (does not depend on the roles).
|
||||
|
||||
|
|
|
|||
|
|
@ -8,8 +8,9 @@ Each role must have: a `molecule/default/` scenario (Debian 13), a populated
|
|||
`README.md`, and a filled-in `meta/main.yml`. Conventions: CLAUDE.md and
|
||||
`docs/runbooks/new-role.md`.
|
||||
|
||||
Current state: `base` is **partially built** — its `firewall` concern (nftables) is
|
||||
implemented and tested; the other concerns (SSH hardening, fail2ban, auditd, packages,
|
||||
users) are not yet built. `docker_host` is **scaffolded but has no tasks yet**. `dev_env` (interactive
|
||||
developer environment) is built and applied. See `STATUS.md` for the authoritative
|
||||
breakdown.
|
||||
Current state: `base` is **partially built** — its `firewall` (nftables) and
|
||||
`hardening` (SSH key-only + fail2ban) concerns are implemented, tested, and the
|
||||
hardening concern is applied to `askari`; the remaining concerns (auditd, packages,
|
||||
users) are not yet built. `docker_host` (Docker engine + Compose), `reverse_proxy`
|
||||
(Caddy), `public_dns` (Gandi), and `dev_env` are built. See `STATUS.md` for the
|
||||
authoritative breakdown.
|
||||
|
|
|
|||
|
|
@ -51,14 +51,9 @@
|
|||
- name: Sshd drop-in present and config valid
|
||||
ansible.builtin.command: sshd -t
|
||||
changed_when: false
|
||||
tags: [verify]
|
||||
|
||||
- name: PasswordAuthentication is disabled
|
||||
ansible.builtin.command: grep -q '^PasswordAuthentication no' /etc/ssh/sshd_config.d/10-boma.conf
|
||||
changed_when: false
|
||||
tags: [verify]
|
||||
|
||||
- name: Fail2ban sshd jail configured
|
||||
ansible.builtin.command: grep -q '^\[sshd\]' /etc/fail2ban/jail.d/sshd.local
|
||||
changed_when: false
|
||||
tags: [verify]
|
||||
|
|
|
|||
|
|
@ -25,7 +25,6 @@ alias ll="ls -lh"
|
|||
alias la="ls -lha"
|
||||
alias ..="cd .."
|
||||
alias update="sudo apt update && sudo apt upgrade -y"
|
||||
alias rclone="/usr/bin/rclone"
|
||||
|
||||
# Use neovim for vim/vi commands
|
||||
alias vim='nvim'
|
||||
|
|
@ -50,6 +49,5 @@ export PATH="$HOME/.local/bin:$HOME/bin:$PATH"
|
|||
# Ensure USER is set (edge cases)
|
||||
export USER=$(whoami)
|
||||
|
||||
# Extras from inventory
|
||||
# Enable direnv for automatic virtualenv activation
|
||||
eval "$(direnv hook zsh)"
|
||||
# Enable direnv for automatic virtualenv activation (guarded — direnv may not be installed)
|
||||
command -v direnv >/dev/null 2>&1 && eval "$(direnv hook zsh)"
|
||||
|
|
|
|||
|
|
@ -7,9 +7,38 @@
|
|||
dev_env__users:
|
||||
- tester
|
||||
pre_tasks:
|
||||
# `always` so the test user exists even under a partial `--tags` converge.
|
||||
- name: Create a test user to receive the environment
|
||||
ansible.builtin.user:
|
||||
name: tester
|
||||
create_home: true
|
||||
tags: [always]
|
||||
roles:
|
||||
- role: dev_env
|
||||
|
||||
# Partial-tags regression guard (O8): apply only the `config` concern to a fresh user.
|
||||
# The dev_env__home preflight is tagged `always`, so a config-only run must still resolve
|
||||
# the home dir and stow the dotfiles. Run the true partial path with:
|
||||
# molecule converge -- --tags config
|
||||
# (a full `molecule test` runs every tag, which still exercises this play idempotently).
|
||||
- name: Converge — config concern only, fresh user
|
||||
hosts: all
|
||||
become: true
|
||||
gather_facts: true
|
||||
vars:
|
||||
dev_env__users:
|
||||
- tagtester
|
||||
pre_tasks:
|
||||
# `always` so the test user exists even under a partial `--tags config` converge.
|
||||
- name: Create a second test user for the config-only path
|
||||
ansible.builtin.user:
|
||||
name: tagtester
|
||||
create_home: true
|
||||
tags: [always]
|
||||
tasks:
|
||||
- name: Apply dev_env restricted to the config concern
|
||||
ansible.builtin.include_role:
|
||||
name: dev_env
|
||||
apply:
|
||||
tags: [config]
|
||||
tags: [config]
|
||||
|
|
|
|||
|
|
@ -71,3 +71,18 @@
|
|||
- dev_env__dots.results[3].stat.exists
|
||||
- dev_env__dots.results[4].stat.exists
|
||||
fail_msg: dotfiles not stowed or omz/tpm not cloned
|
||||
|
||||
# Partial-tags regression guard (O8): the config-only converge play provisioned
|
||||
# `tagtester`. Its stowed .zshrc proves dev_env__home resolved (the `always` preflight)
|
||||
# and stow (a `config` task) ran without the `users`/`packages` concerns.
|
||||
- name: Stat the config-only user's stowed .zshrc
|
||||
ansible.builtin.stat:
|
||||
path: /home/tagtester/.zshrc
|
||||
register: dev_env__tagtester_zshrc
|
||||
|
||||
- name: Assert the config concern alone resolved home and stowed dotfiles
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- dev_env__tagtester_zshrc.stat.exists
|
||||
- dev_env__tagtester_zshrc.stat.islnk
|
||||
fail_msg: config-only run did not resolve dev_env__home / stow dotfiles for tagtester
|
||||
|
|
|
|||
|
|
@ -7,21 +7,44 @@
|
|||
cache_valid_time: 3600
|
||||
tags: [packages]
|
||||
|
||||
# `apply: tags:` propagates the concern tag onto the INCLUDED tasks — without it a tag on
|
||||
# a dynamic include_tasks only selects the include itself, not its (untagged) contents, so
|
||||
# `--tags <concern>` would run nothing (Ansible gotcha; mirrors roles/base/tasks/main.yml).
|
||||
- name: Install Neovim (pinned release)
|
||||
ansible.builtin.include_tasks: neovim.yml
|
||||
ansible.builtin.include_tasks:
|
||||
file: neovim.yml
|
||||
apply:
|
||||
tags: [packages]
|
||||
tags: [packages]
|
||||
|
||||
# Also reachable under `config`: oh_my_posh.yml renders /etc/oh-my-posh/zen.toml (a config
|
||||
# task, tagged `config` within the file) alongside the binary install (`packages`). apply
|
||||
# keeps `packages` on the untagged binary tasks; the include carries both so `--tags config`
|
||||
# enters it and re-renders just the theme.
|
||||
- name: Install oh-my-posh prompt (pinned release)
|
||||
ansible.builtin.include_tasks: oh_my_posh.yml
|
||||
tags: [packages]
|
||||
ansible.builtin.include_tasks:
|
||||
file: oh_my_posh.yml
|
||||
apply:
|
||||
tags: [packages]
|
||||
tags: [packages, config]
|
||||
|
||||
- name: Install Node.js (pinned release)
|
||||
ansible.builtin.include_tasks: nodejs.yml
|
||||
ansible.builtin.include_tasks:
|
||||
file: nodejs.yml
|
||||
apply:
|
||||
tags: [packages]
|
||||
tags: [packages]
|
||||
|
||||
# per_user.yml resolves dev_env__home (tagged `always`, below) then runs both the `users`
|
||||
# (login shell) and `config` (dotfiles/stow) concerns; tag + apply both so either
|
||||
# `--tags users` or `--tags config` reaches in and the home-dir preflight always runs.
|
||||
- name: Configure each developer user
|
||||
ansible.builtin.include_tasks: per_user.yml
|
||||
ansible.builtin.include_tasks:
|
||||
file: per_user.yml
|
||||
apply:
|
||||
tags: [users, config]
|
||||
loop: "{{ dev_env__users }}"
|
||||
loop_control:
|
||||
loop_var: dev_env__user
|
||||
label: "{{ dev_env__user }}"
|
||||
tags: [users, config]
|
||||
|
|
|
|||
|
|
@ -17,9 +17,11 @@
|
|||
path: /etc/oh-my-posh
|
||||
state: directory
|
||||
mode: "0755"
|
||||
tags: [config]
|
||||
|
||||
- name: Oh-my-posh | Deploy zen.toml theme (system-wide)
|
||||
ansible.builtin.copy:
|
||||
src: oh-my-posh/zen.toml
|
||||
dest: /etc/oh-my-posh/zen.toml
|
||||
mode: "0644"
|
||||
tags: [config]
|
||||
|
|
|
|||
|
|
@ -1,12 +1,17 @@
|
|||
---
|
||||
# `always`: dev_env__home must resolve on every entry into per_user.yml, including a
|
||||
# partial `--tags users` or `--tags config` run — the dotfile/stow (config) and login-shell
|
||||
# (users) tasks below all depend on it, so it must never be filtered out (ADR-019).
|
||||
- name: Look up account for {{ dev_env__user }}
|
||||
ansible.builtin.getent:
|
||||
database: passwd
|
||||
key: "{{ dev_env__user }}"
|
||||
tags: [always]
|
||||
|
||||
- name: Resolve home directory for {{ dev_env__user }}
|
||||
ansible.builtin.set_fact:
|
||||
dev_env__home: "{{ getent_passwd[dev_env__user][4] }}"
|
||||
tags: [always]
|
||||
|
||||
- name: Set login shell to zsh for {{ dev_env__user }}
|
||||
ansible.builtin.user:
|
||||
|
|
|
|||
|
|
@ -8,10 +8,7 @@
|
|||
ansible.builtin.command: docker --version
|
||||
register: docker_version_output
|
||||
changed_when: false
|
||||
tags: [verify]
|
||||
|
||||
- name: Assert docker --version succeeded
|
||||
ansible.builtin.assert:
|
||||
that: docker_version_output.rc == 0
|
||||
msg: "docker --version failed — Docker was not installed correctly"
|
||||
tags: [verify]
|
||||
|
|
|
|||
|
|
@ -5,8 +5,8 @@ Manages boma's public DNS zone (**wingu.me**) at **Gandi LiveDNS** as code, via
|
|||
name on purpose. Run from the control node: `make check/deploy PLAYBOOK=dns`.
|
||||
|
||||
Mesh/LAN-only by default — only deliberate public records live in the zone (the
|
||||
anti-spoof baseline now; `askari` in M4). Everything else is reached over LAN/mesh and
|
||||
never appears here.
|
||||
anti-spoof baseline plus `askari.wingu.me` + the `*.askari` wildcard, applied in M4a).
|
||||
Everything else is reached over LAN/mesh and never appears here.
|
||||
|
||||
## Data (in `group_vars/all/public_dns.yml`)
|
||||
|
||||
|
|
|
|||
|
|
@ -9,4 +9,3 @@
|
|||
- public_dns__domain == "example.test"
|
||||
- public_dns__apply | bool == false
|
||||
msg: "public_dns defaults/vars did not resolve as expected"
|
||||
tags: [verify]
|
||||
|
|
|
|||
37
roles/reverse_proxy/ACCESS.md
Normal file
37
roles/reverse_proxy/ACCESS.md
Normal file
|
|
@ -0,0 +1,37 @@
|
|||
# Access — reverse_proxy (Caddy)
|
||||
|
||||
Rendered from the role's `access__*` data (`roles/reverse_proxy/defaults/main.yml`) —
|
||||
the source of truth that also drives `/check-access`. Regenerate from the data; edit the
|
||||
data, not the tables. Host: `askari` (off-site Hetzner; ADR-007/016).
|
||||
|
||||
## Access paths
|
||||
|
||||
The documented ways in, by tier (rendered from `access__*`):
|
||||
|
||||
| Tier | Path | Invocation |
|
||||
|---|---|---|
|
||||
| primary | `wt0` mesh SSH | `ssh askari` (over the NetBird mesh — pending M5; see notes) |
|
||||
| secondary | LAN/WAN SSH from `ubongo` | `ssh ansible@askari` (from the control node; Hetzner firewall allows only ubongo's WAN) |
|
||||
| — | container exec + compose | `docker compose -p reverse_proxy -f /opt/services/reverse_proxy/docker-compose.yml ps` / `… exec caddy sh` |
|
||||
| — | logs | `docker logs caddy` now; Loki labels `{service: caddy}` once the ADR-018 pipeline lands |
|
||||
| — | admin API | n/a — Caddy admin API bound to container localhost `:2019`, never exposed (`access__api.enabled: false`) |
|
||||
|
||||
## Break-glass
|
||||
|
||||
Mesh-and-LAN-independent fallback for this host's class (recorded, not routine):
|
||||
|
||||
- **Hetzner rescue system + Cloud Console** (VNC) for `askari` — boot the rescue image
|
||||
or attach the web console from the Hetzner Cloud panel if SSH is unreachable.
|
||||
|
||||
## Operational notes
|
||||
|
||||
- **Mesh not yet enrolled (M5).** Until `askari` joins the NetBird mesh, the `wt0`
|
||||
primary path does not exist — the only SSH route is the secondary one (from `ubongo`'s
|
||||
WAN IP, which the TF-managed Hetzner Cloud Firewall allowlists). Promote `wt0` to
|
||||
primary once M5 lands.
|
||||
- **Caddy wedged / bad config:** the Caddyfile is rendered read-only by Ansible; to
|
||||
recover, fix `reverse_proxy__routes` in `group_vars` and re-run the role (it reloads
|
||||
Caddy via the handler). To inspect live config: `docker exec caddy caddy validate
|
||||
--config /etc/caddy/Caddyfile`.
|
||||
- **Cert issuance failing:** check that port 80 is reachable from the internet (HTTP-01
|
||||
needs it) and watch `docker logs caddy` for ACME errors before assuming a routing fault.
|
||||
61
roles/reverse_proxy/SECURITY.md
Normal file
61
roles/reverse_proxy/SECURITY.md
Normal file
|
|
@ -0,0 +1,61 @@
|
|||
# Security — reverse_proxy (Caddy)
|
||||
|
||||
## Exposure
|
||||
|
||||
- **Published ports:** `80/tcp` + `443/tcp` (HTTP→HTTPS redirect + TLS). Both are
|
||||
declared in the `group_vars` firewall catalog as the askari `public_web` opens
|
||||
(ADR-020); the Hetzner Cloud Firewall also opens 80/443 (and 3478 for NetBird).
|
||||
Port 80 must stay open to the internet for the ACME HTTP-01 challenge.
|
||||
- **Auth surface:** none of its own. Caddy is the TLS terminator and router; per-service
|
||||
authentication (Authentik `forward_auth`) is added at each route in Phase 2 (ADR-024
|
||||
§4). Today it fronts only a static `respond` test vhost and (M4b) the NetBird stack,
|
||||
which carries its own auth.
|
||||
- **Reachability:** public — askari is internet-facing. Caddy is the single public entry
|
||||
point; upstreams sit on the internal `boma` Docker network and are reached by name, not
|
||||
published directly.
|
||||
- **Data sensitivity:** none persistent worth protecting — only ACME account keys +
|
||||
issued certificates in the `caddy_data` volume, which are re-issuable (HTTP-01). No
|
||||
user data, no secrets at rest. See backup record: `backup__state: false` (stateless).
|
||||
|
||||
## Checklist status
|
||||
|
||||
Each item from `docs/security/service-checklist.md`:
|
||||
|
||||
- [x] Secrets in vault; no default creds; nothing secret in git/images — ✅ n/a: HTTP-01
|
||||
needs no credentials; the only config input is `reverse_proxy__acme_email` (not secret).
|
||||
- [x] Non-root; no `privileged`/host-network unless justified; minimal mounts; caps
|
||||
dropped — ⚠️ official `caddy:2` runs as root (to bind 80/443); no `privileged`, no host
|
||||
network (bridge `boma`); mounts are the read-only Caddyfile + two named volumes. Root
|
||||
inside the container is the upstream default; revisit if Caddy ships a rootless variant.
|
||||
- [x] Ports declared in `group_vars`; behind reverse proxy + auth if exposed;
|
||||
least-privilege inter-service reach — ✅ 80/443 in the catalog; Caddy *is* the proxy;
|
||||
upstreams are not published, only reachable on the `boma` network.
|
||||
- [x] Image pinned (tag/digest), update path known — ⚠️ pinned to the `caddy:2` major
|
||||
tag (stateless tier, ADR-011/ADR-004), not a digest; refreshed deliberately and watched
|
||||
by DIUN. Tighten to `tag@digest` if the proxy is reclassified as stateful.
|
||||
- [x] Logs reviewable; backup/restore covered if stateful — ✅ stateless (no backup
|
||||
needed); logs via `docker logs caddy` now, Loki labels declared for the ADR-018 pipeline.
|
||||
|
||||
## Service-specific hardening
|
||||
|
||||
- **HTTP-01 only, no DNS token:** vanilla `caddy:2`, no `caddy-dns/gandi` plugin and no
|
||||
Gandi API token on the host — removes a credential and a custom-image supply chain
|
||||
(ADR-024 revised Status).
|
||||
- **Caddyfile is read-only** in the container (`:ro` mount); rendered solely by Ansible
|
||||
from the `group_vars` route catalog — no dynamic label discovery, so no route exists
|
||||
that wasn't declared (the reason Caddy was chosen over Traefik, ADR-024 §1).
|
||||
- **Admin API not exposed:** Caddy's admin endpoint stays on container-localhost `:2019`;
|
||||
never published, never in the firewall catalog (`access__api.enabled: false`).
|
||||
- **Automatic HTTPS:** HTTP is redirected to HTTPS and modern TLS defaults are Caddy's
|
||||
out-of-the-box behaviour (no manual cipher config needed).
|
||||
|
||||
## Residual / accepted risks
|
||||
|
||||
- **Container runs as root** — upstream `caddy:2` default (needs to bind low ports).
|
||||
Rationale: official image, no rootless variant wired yet; blast radius limited to the
|
||||
proxy container. Revisit: adopt a rootless Caddy image if upstream stabilises one.
|
||||
- **Image pinned to a major tag, not a digest** — accepted for the stateless tier
|
||||
(ADR-011). Revisit if the role gains state.
|
||||
- **ACME re-issuance vs Let's Encrypt rate limits** — losing `caddy_data` triggers
|
||||
re-issuance; rapid repeated rebuilds could hit LE rate limits. Acceptable for a handful
|
||||
of askari hostnames; noted in the backup rationale.
|
||||
44
roles/reverse_proxy/VERIFY.md
Normal file
44
roles/reverse_proxy/VERIFY.md
Normal file
|
|
@ -0,0 +1,44 @@
|
|||
# Verify — reverse_proxy (Caddy)
|
||||
|
||||
`reverse_proxy` has no application UI of its own — it is the TLS terminator and router.
|
||||
"Working" is verified at the HTTP/TLS layer (what `/verify-service` can drive with a
|
||||
browser/HTTP client against the public hostnames it serves), not via an app login.
|
||||
|
||||
## Critical user journeys
|
||||
|
||||
1. **HTTPS serves with a valid cert** — request `https://<a host in
|
||||
reverse_proxy__routes>` (e.g. `https://test.askari.wingu.me`) → 200 with a valid
|
||||
Let's Encrypt certificate (trusted chain, CN/SAN matches the host, not expired).
|
||||
2. **HTTP redirects to HTTPS** — request `http://<host>` → 308/301 redirect to the
|
||||
`https://` URL (Caddy's automatic-HTTPS redirect).
|
||||
3. **A `respond` route returns its static body** — the test vhost returns its configured
|
||||
string with 200.
|
||||
4. **An `upstream` route proxies through** — once a real upstream is registered (M4b
|
||||
NetBird), `https://<host>` reaches the upstream's response, not a Caddy error page.
|
||||
5. **An unknown host is not served a valid cert** — a hostname not in
|
||||
`reverse_proxy__routes` does not get a certificate / is not routed (no accidental
|
||||
catch-all).
|
||||
|
||||
## What good looks like
|
||||
|
||||
- The browser padlock shows a valid Let's Encrypt certificate for the requested host;
|
||||
the SAN matches and the chain is trusted.
|
||||
- `http://` visibly becomes `https://` in the address bar.
|
||||
- The expected body (static `respond` text, or the upstream's page) renders.
|
||||
|
||||
## Not browser-verifiable
|
||||
|
||||
- Certificate *renewal* (60-day cadence) — confirm out of band via `docker logs caddy`
|
||||
/ Loki, not a single browser session.
|
||||
- Behaviour when port 80 is blocked (HTTP-01 would fail) — an infrastructure/firewall
|
||||
check, route to the manual handoff.
|
||||
- The deferred DNS-01 path for mesh/LAN-only services (Phase 2, ADR-024) — not yet live.
|
||||
|
||||
## Test data
|
||||
|
||||
Provisioned in the **staging** deploy (no Authentik user needed — there is no SSO on the
|
||||
proxy itself):
|
||||
|
||||
- At least one `reverse_proxy__routes` entry with a public DNS A-record pointing at the
|
||||
staging host, so HTTP-01 can complete. A static `respond` route is enough for journeys
|
||||
1–3 and 5.
|
||||
|
|
@ -4,3 +4,25 @@ reverse_proxy__base_dir: /opt/services/reverse_proxy
|
|||
reverse_proxy__acme_email: admin@example.test
|
||||
reverse_proxy__routes: [] # each: {host: x, upstream: "svc:port"} OR {host: x, respond: "text"}
|
||||
reverse_proxy__manage: true # set false in Molecule to render without Docker
|
||||
|
||||
# access__*/backup__* are the ADR-021/022 CROSS-ROLE conventions — shared field names that
|
||||
# render ACCESS.md/BACKUP.md and drive /check-access · /check-backup. They intentionally do
|
||||
# NOT carry the reverse_proxy__ prefix, so each is marked `# noqa: var-naming[no-role-prefix]`
|
||||
# (ansible-lint's role-prefix rule has no per-prefix allowlist; keeping it enabled elsewhere).
|
||||
|
||||
# Operational-access record (ADR-021) — source of truth for ACCESS.md + /check-access.
|
||||
access__service: reverse_proxy # noqa: var-naming[no-role-prefix]
|
||||
access__compose_project: reverse_proxy # noqa: var-naming[no-role-prefix]
|
||||
access__compose_path: "{{ reverse_proxy__base_dir }}/docker-compose.yml" # noqa: var-naming[no-role-prefix]
|
||||
access__containers: [caddy] # noqa: var-naming[no-role-prefix]
|
||||
access__log: # noqa: var-naming[no-role-prefix]
|
||||
loki_labels: { service: caddy } # intent; Loki/Alloy pipeline is ADR-018 (pending)
|
||||
access__api: # noqa: var-naming[no-role-prefix]
|
||||
enabled: false
|
||||
reason: "Caddy admin API bound to container localhost :2019; never exposed (ADR-020 catalog owns ports)"
|
||||
|
||||
# Backup contract (ADR-022). Stateless: Caddy's /data holds only ACME account keys +
|
||||
# issued certs, which are re-requested automatically on restart via HTTP-01 (no manual
|
||||
# steps). Residual risk: Let's Encrypt rate limits on rapid repeated re-issuance.
|
||||
backup__service: reverse_proxy # noqa: var-naming[no-role-prefix]
|
||||
backup__state: false # noqa: var-naming[no-role-prefix]
|
||||
|
|
|
|||
|
|
@ -2,8 +2,8 @@
|
|||
galaxy_info:
|
||||
author: sjat
|
||||
description: >-
|
||||
Caddy reverse proxy with ACME DNS-01 TLS via Gandi (ADR-024). Builds the
|
||||
custom image on-host (caddy-dns/gandi) and manages it via Docker Compose.
|
||||
Vanilla Caddy reverse proxy (ADR-024); TLS via ACME HTTP-01 for public
|
||||
hosts. Routes from reverse_proxy__routes, managed via Docker Compose.
|
||||
license: MIT
|
||||
min_ansible_version: "2.17"
|
||||
platforms:
|
||||
|
|
|
|||
|
|
@ -8,8 +8,6 @@
|
|||
ansible.builtin.slurp:
|
||||
src: /opt/services/reverse_proxy/Caddyfile
|
||||
register: _caddyfile
|
||||
tags: [verify]
|
||||
|
||||
- name: Assert Caddyfile exists and contains expected content
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
|
|
@ -19,4 +17,3 @@
|
|||
- "'respond \"ok\" 200' in (_caddyfile.content | b64decode)"
|
||||
fail_msg: "Caddyfile is missing expected content"
|
||||
success_msg: "Caddyfile rendered correctly"
|
||||
tags: [verify]
|
||||
|
|
|
|||
|
|
@ -1,3 +1,4 @@
|
|||
# {{ ansible_managed }}
|
||||
{
|
||||
email {{ reverse_proxy__acme_email }}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,3 +1,4 @@
|
|||
# {{ ansible_managed }}
|
||||
services:
|
||||
caddy:
|
||||
image: caddy:2
|
||||
|
|
|
|||
|
|
@ -14,6 +14,9 @@ exception: `check-vault.py` is a vault tool that needs the ansible venv (PyYAML
|
|||
`rbw`. Wired as `vault_password_file` (ADR-002).
|
||||
- `check-vault-encrypted.sh` — pre-commit guard: fails if a `vault.yml` holds
|
||||
plaintext secrets.
|
||||
- `check-tags.py` — enforces the closed tag vocabulary (`tests/tags.yml`) and that
|
||||
each role import in a play carries its role-name tag. Invoked by `make lint`. See
|
||||
**ADR-019**.
|
||||
- `repo-scan.py` — Phase-0 deterministic scan for `/review-repo` (markers, broken
|
||||
refs, unencrypted vaults, inventory).
|
||||
- `capacity-scan.py` — deterministic capacity facts for `/capacity-review`: parses
|
||||
|
|
|
|||
|
|
@ -130,7 +130,9 @@ def known_hostnames(env):
|
|||
hosts |= parse_tf_hostnames(_run_json(["terraform", f"-chdir={tf_dir}", "output", "-json"]))
|
||||
except (OSError, subprocess.CalledProcessError, ValueError):
|
||||
pass
|
||||
inv = os.path.join(REPO_ROOT, "inventories", env, "hosts.yml")
|
||||
# Point at the inventory DIRECTORY so every source file merges — hosts.yml AND
|
||||
# offsite.yml (offsite_hosts / askari), which a bare hosts.yml would miss.
|
||||
inv = os.path.join(REPO_ROOT, "inventories", env)
|
||||
try:
|
||||
hosts |= parse_inventory_hostnames(_run_json(["ansible-inventory", "-i", inv, "--list"]))
|
||||
except (OSError, subprocess.CalledProcessError, ValueError):
|
||||
|
|
|
|||
|
|
@ -53,6 +53,8 @@ def main() -> None:
|
|||
"---",
|
||||
"# Generated by scripts/tf_to_inventory.py — do not edit manually.",
|
||||
"# Regenerate with: make tf-inventory TF_ENV=<env>",
|
||||
"# This OVERWRITES the file, including any manually-added control node (ubongo) —",
|
||||
"# re-add it afterwards (the one hand-edit exception; docs/runbooks/new-host.md Part E).",
|
||||
"",
|
||||
"all:",
|
||||
" children:",
|
||||
|
|
|
|||
|
|
@ -5,9 +5,13 @@ destroying Proxmox VMs. It writes no DNS records and configures nothing inside a
|
|||
VM; Ansible owns all of that.
|
||||
|
||||
- `modules/proxmox_vm/` — reusable VM module (Proxmox only).
|
||||
- `environments/{staging,production}/` — separate state per environment. Add a VM by
|
||||
editing `local.vms` in that env's `main.tf`, then `make tf-plan` → `tf-apply` →
|
||||
`tf-inventory`.
|
||||
- `modules/hetzner_vm/` — reusable VM module (Hetzner Cloud: server + firewall +
|
||||
SSH key + cloud-init).
|
||||
- `environments/{staging,production}/` — separate state per environment (Proxmox).
|
||||
Add a VM by editing `local.vms` in that env's `main.tf`, then `make tf-plan` →
|
||||
`tf-apply` → `tf-inventory`. Not yet `terraform init`ed.
|
||||
- `environments/offsite/` — the off-site Hetzner host (`askari`); the one
|
||||
**applied** environment. Use `make tf-* TF_ENV=offsite` and `tf-inventory-offsite`.
|
||||
|
||||
Rationale: **ADR-006**. Handoff to Ansible: **ADR-009**. Secrets via `TF_VAR_*`
|
||||
only — never in `.tfvars`. Not yet `terraform init`ed — see `STATUS.md`.
|
||||
only — never in `.tfvars`. See `STATUS.md` for what is provisioned.
|
||||
|
|
|
|||
|
|
@ -1,4 +1,4 @@
|
|||
# verified: hetznercloud/hcloud 1.65.0 · debian-13 image · cax11@hel1 · terraform-registry · 2026-06-14
|
||||
# verified: hetznercloud/hcloud 1.65.0 · debian-13 image · cx23@hel1 · terraform-registry · 2026-06-14
|
||||
terraform {
|
||||
required_version = ">= 1.9"
|
||||
|
||||
|
|
|
|||
|
|
@ -6,9 +6,9 @@
|
|||
#
|
||||
# State is local (see backend.tf) — no Forgejo backend credentials needed.
|
||||
|
||||
proxmox_endpoint = "https://pve01.baobab.band:8006/"
|
||||
proxmox_endpoint = "https://pve0.boma.baobab.band:8006/"
|
||||
proxmox_insecure = false
|
||||
proxmox_node = "pve01"
|
||||
proxmox_node = "pve0"
|
||||
vm_template_id = 9000 # Proxmox VM ID of the Debian 13 cloud-init template
|
||||
vm_datastore_id = "local-lvm"
|
||||
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
# ── Proxmox ───────────────────────────────────────────────────────────────────
|
||||
|
||||
variable "proxmox_endpoint" {
|
||||
description = "Proxmox API URL, e.g. https://pve01.baobab.band:8006/"
|
||||
description = "Proxmox API URL, e.g. https://pve0.boma.baobab.band:8006/"
|
||||
type = string
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -6,9 +6,9 @@
|
|||
#
|
||||
# State is local (see backend.tf) — no Forgejo backend credentials needed.
|
||||
|
||||
proxmox_endpoint = "https://pve01.baobab.band:8006/"
|
||||
proxmox_endpoint = "https://pve0.boma.baobab.band:8006/"
|
||||
proxmox_insecure = true # set false once a valid TLS cert is in place
|
||||
proxmox_node = "pve01"
|
||||
proxmox_node = "pve0"
|
||||
vm_template_id = 9000 # Proxmox VM ID of the Debian 13 cloud-init template
|
||||
vm_datastore_id = "local-lvm"
|
||||
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
# ── Proxmox ───────────────────────────────────────────────────────────────────
|
||||
|
||||
variable "proxmox_endpoint" {
|
||||
description = "Proxmox API URL, e.g. https://pve01.baobab.band:8006/"
|
||||
description = "Proxmox API URL, e.g. https://pve0.boma.baobab.band:8006/"
|
||||
type = string
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -4,7 +4,7 @@ variable "name" {
|
|||
}
|
||||
|
||||
variable "server_type" {
|
||||
description = "Hetzner server type, e.g. cax11 (ARM)"
|
||||
description = "Hetzner server type, e.g. cx23 (x86) or cax11 (ARM)"
|
||||
type = string
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -19,7 +19,7 @@ concerns:
|
|||
- monitoring # metric exporters / health checks
|
||||
- config # render templated config/compose files to disk — no restart
|
||||
- deploy # bring services up / restart (compose up -d)
|
||||
- proxy # reverse-proxy + TLS registration (Traefik routes, Authentik)
|
||||
- proxy # reverse-proxy + TLS registration (Caddy routes, Authentik)
|
||||
|
||||
# Ansible built-in special tags. Narrow use only:
|
||||
# always — cheap preflight assertions (run regardless of --tags)
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue