Compare commits

...

6 commits

Author SHA1 Message Date
9e0c264658 docs: reconcile lower-severity review findings (O9-O24)
- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional,
  outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative
  boma.baobab.band -> boma.wingu.me transition note already added earlier
- terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and
  <host>.boma.baobab.band per ADR-007 naming (O11)
- ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections
  placed after Consequences, matching ADR-014/019-023 (O13)
- docs/README + inventories/README: list the missing subdirs / offsite_hosts +
  offsite.yml merge behaviour (O14, O29 note)
- ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19)
- ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20)
- ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21)
- netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23)
- ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24)
- capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28)
- tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9)
- tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep)

O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected);
the fix lives in the generator for the next regeneration. make lint + pytest (57) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00
9b5851ba4b chore(roles): role/test hygiene from review (O16,O17,O25,O26)
- dev_env .zshrc: drop the rclone alias (not installed) and guard the direnv
  hook with `command -v direnv` so a missing direnv doesn't error every shell (O16)
- dev_env oh-my-posh: tag the zen.toml theme deploy `config` (it renders config to
  disk like the per_user dotfiles); the include now carries packages+config so a
  `--tags config` run re-renders the theme while the binary install stays packages
  only (O17). Verified via `molecule converge -- --tags config`.
- drop the non-vocabulary `tags: [verify]` from molecule verify playbooks across
  base/docker_host/public_dns/reverse_proxy (check-tags exempts molecule anyway) (O25)
- reverse_proxy templates: add the `{{ ansible_managed }}` header (ADR-024 §1.2) (O26)

make lint green; dev_env + reverse_proxy molecule green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:23 +02:00
175777e36a docs: reconcile 2026-06-14 review findings (O1-O7,O18,O22)
- STATUS: docker_host is built+applied, not scaffold-only (O1)
- ADR-004: backup points to ADR-022, not "out of scope"; service-role file
  table gains ACCESS.md + BACKUP.md rows (O2, O5)
- Finish Traefik->Caddy: ADR-008/011/017/019, CAPABILITIES, TODO (O3); scope
  ADR-024's custom-image/NetBird claims to the deferred DNS-01/M4b paths (O22)
- ADR-016/017/018 now lead with ## Status per ADR-023 (O4)
- ADR-002: caveat `PLAYBOOK=upgrade` as planned/unbuilt (O6)
- CAPABILITIES: carve out ubongo's dev_env from the nvim/tmux exclusion (O7)
- ADR-007: one authoritative boma.baobab.band -> boma.wingu.me transition note (O18)
- new-host Part E: note ubongo is managed as sjat, ansible-user bootstrap pending (O15)

O9 (hosts.yml header) left open: the file is generator-owned (hook-protected);
fixing it needs a tf_to_inventory.py change or a tf-inventory run, not a hand-edit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:06:33 +02:00
cb8f924d4b docs(reverse_proxy): service-role SECURITY/VERIFY/ACCESS records (O12)
reverse_proxy is the first built+applied service role; add the per-service
records CLAUDE.md/ADR-002/008/017/021 require. Add access__*/backup__* data to
defaults as the source of truth (ADR-021/022). reverse_proxy is stateless (ACME
certs re-issue via HTTP-01), so it declares backup__state: false with a reason
rather than a BACKUP.md (ADR-022 convention).

The access__*/backup__* cross-role field names intentionally don't carry the
reverse_proxy__ prefix, so each is marked `# noqa: var-naming[no-role-prefix]`
(ansible-lint has no per-prefix allowlist; rule stays enabled elsewhere).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:06:23 +02:00
718781053f fix(dev_env): make concern tags reach included tasks (O8)
Dynamic include_tasks only filter on the include's own tags, not their
(untagged) contents — so `--tags packages` ran none of the neovim/oh-my-posh/
nodejs installs, and `--tags users|config` never entered per_user.yml. Add
`apply: tags:` to all four includes (mirroring base/tasks/main.yml) and tag the
dev_env__home getent+set_fact preflight `always` so a partial run still resolves
the home dir before the dotfile/stow tasks consume it.

Molecule: add a config-only converge play for a fresh user + a verify assertion.
Proven with `molecule converge -- --tags config` (idempotent, home resolved).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:06:15 +02:00
64f1e821d8 docs(review): 2026-06-14 repo audit — M4a doc drift + Traefik→Caddy lag
11 safe auto-fixes (docs/comments only): reverse_proxy meta stale DNS-01
description, base/playbooks/scripts/terraform/public_dns README build-state,
CAPABILITIES reverse-proxy Traefik→Caddy, README ADR list → 024, TF cax11→cx23
stamps, public_dns wildcard DNS-01→HTTP-01 comment. 29 open findings reported.
make lint green. No stale-deferred (ADR-011 open questions still open).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:37:54 +02:00
60 changed files with 818 additions and 269 deletions

View file

@ -108,6 +108,13 @@ See `Makefile` for the full list of targets.
- Control / AI-worker host (`ubongo`): `docs/decisions/015-control-host.md`
- Mesh VPN (NetBird): `docs/decisions/016-mesh-vpn.md`
- Service-UI verification (Level 4): `docs/decisions/017-service-ui-verification.md`
- Logging & log integrity: `docs/decisions/018-logging.md`
- Tagging & run-targeting: `docs/decisions/019-tagging.md`
- Firewall strategy: `docs/decisions/020-firewall.md`
- Operational access: `docs/decisions/021-operational-access.md`
- Backup & disaster recovery: `docs/decisions/022-backup.md`
- ADR structure & lifecycle: `docs/decisions/023-adr-structure.md`
- Reverse proxy (Caddy): `docs/decisions/024-reverse-proxy.md`
(CLAUDE.md carries the full cross-referenced table, including the runbooks and
security/testing docs.)

View file

@ -38,14 +38,20 @@ _Last reviewed: 2026-06-14._
| Thing | State |
|---|---|
| `roles/base/` | **Partially built.** Concerns built: `firewall` (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) and **`hardening`** (M3: sshd drop-in key-only + `PermitRootLogin no`, fail2ban sshd jail 5/1h; ADR-002) — both pytest/Molecule-tested. The **`hardening`** concern is **applied to askari** (`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`). The `firewall` concern is built but **not yet applied** to any host (mesh-gated to avoid lockout — M5). Not built: auditd, packages, users (Phase 2 / TODO 15). |
| `roles/docker_host/` | **Scaffolded, no tasks.** In git (meta/README/molecule filled), wired into `playbooks/site.yml` so the standard state is expressed end-to-end and `make lint` covers it, but it has no tasks yet — applying it is a no-op. Planned scope (Docker engine + Compose, daemon hardening, `nftables.d` container rules) in ADR-004/ADR-020. |
| `inventories/*/hosts.yml` | Structured stubs with empty host maps (`hosts: {}`); regenerated by `make tf-inventory` once Terraform has hosts |
| `inventories/production/group_vars/{docker_hosts,proxmox_hosts}/` | Empty dirs |
So `make deploy PLAYBOOK=site` has no real content to apply — `base` is only partially
built (its `firewall` concern only) and the `docker_host` role is scaffolded but has no
tasks yet. (The `make check`/`deploy` machinery itself now works — first proven by
applying `dev_env` via `playbooks/workstation.yml`.)
(`roles/docker_host/` is no longer scaffold-only — it installs the Docker engine + Compose
and is built + applied to askari; see "Real and working today". Its deferred scope —
daemon hardening + `nftables.d` container rules, ADR-004/ADR-020 — is still pending.)
A `make deploy PLAYBOOK=site` run now applies real content — `base` (its `firewall` +
`hardening` concerns) plus a functional `docker_host` (Docker engine) on docker hosts —
but in practice it is still limited: the production cluster has no docker hosts yet, and
`base`'s `firewall` concern is mesh-gated until M5, so a full cluster `site` run does not
yet exist. (The `make check`/`deploy` machinery itself works — first proven by applying
`dev_env` via `playbooks/workstation.yml`, then `base`/`docker_host`/`reverse_proxy` on
askari.)
## Designed but not built

View file

@ -24,9 +24,9 @@ decisions this frame enables.
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|---|---|---|---|---|---|
| Reverse proxy / TLS | Traefik | P | core | Edge routing + ACME certs for everything exposed | Spin-up order names it (TODO 12) |
| Reverse proxy / TLS | Caddy (ADR-024) | P | core | Edge routing + ACME certs for everything exposed | Spin-up order names it (TODO 12) |
| Internal DNS | `dns` role → dns1/dns2 | P | core | Authoritative internal zone (ADR-007) | Ansible-rendered zone |
| Public DNS | `public_dns` role → Gandi LiveDNS | P | core | wingu.me zone as code (ADR-007) | anti-spoof baseline; mesh/LAN-only default; apply pending |
| Public DNS | `public_dns` role → Gandi LiveDNS | P | core | wingu.me zone as code (ADR-007) | anti-spoof baseline; mesh/LAN-only default; applied (M1) |
| VPN / remote access | NetBird (self-hosted on `askari`) | P | core | Secure mesh remote access to `srv`/`mgmt` | **Decided (ADR-016):** NetBird mesh replaces ADR-007 OPNsense WireGuard |
| Service portal / dashboard | Homepage | A | candidate | One landing page listing all services — a "what does what" front door | Gap surfaced by V4; fits boma's legibility goal |
@ -148,8 +148,11 @@ AI/LLM, a game server (Minecraft), generic static-site hosting. Plausible someda
none are committed.
**Confirmed exclusions (V4 had them; boma deliberately does not).** V4 mixed in a lot
of **workstation/desktop** config — XFCE/GNOME desktops, kiosk mode, nvim/kitty/tmux,
LibreOffice, antivirus, remote desktop. boma is **server-only**, so these are correctly
absent. Likewise the removed Knowledge domain (Discourse, Snipe-IT, MRBS booking) and
V4-specific project websites — out of boma's scope by design. The narrower surface is
intentional, not an oversight.
of **workstation/desktop** config — XFCE/GNOME desktops, kiosk mode, LibreOffice,
antivirus, remote desktop. boma's **managed cluster/server hosts** stay server-only, so
these are correctly absent. (One scoped exception: the control / AI-worker host `ubongo`
runs an interactive `dev_env` — zsh/tmux/neovim — per ADR-015; that is the developer
environment of an infrastructure worker host, not a personal desktop, and does not apply
to managed service hosts.) Likewise the removed Knowledge domain (Discourse, Snipe-IT,
MRBS booking) and V4-specific project websites — out of boma's scope by design. The
narrower surface is intentional, not an oversight.

View file

@ -6,6 +6,15 @@ Project documentation.
Numbered from 001; each records context, the decision, and what was ruled out.
- `runbooks/` — step-by-step operational procedures (add a host, add a role, rotate
secrets).
- `security/` — security baseline, accepted-risk register, per-service checklist +
template (ADR-002/004).
- `testing/` — testing methodology artifacts + the `VERIFY.md` template (ADR-008/017).
- `access/` — operational-access doctrine + the `ACCESS.md` template (ADR-021).
- `backup/` — backup doctrine + the `BACKUP.md` template (ADR-022).
- `hardware/` — capacity reference + `/capacity-review` output (ADR-012).
- `reviews/``/review-repo` audit trail.
- `CAPABILITIES.md` / `ROADMAP.md` / `TODO.md` / `FRICTION.md` — what boma does, the
build order, the backlog, and recurring-friction notes.
For what is actually **built vs only designed**, see `STATUS.md` at the repo root —
the ADRs describe intent, not necessarily current reality.

View file

@ -79,9 +79,10 @@ zero-risk and *born at Gandi*.
### M2 · `askari` provisioned + under Ansible
Provision the Hetzner VPS **as IaC with Terraform** (CAX11 ARM / Helsinki / Debian 13,
behind a TF-managed Hetzner Cloud Firewall), bring it into `offsite_hosts`, and bootstrap
it. Design: `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`.
Provision the Hetzner VPS **as IaC with Terraform** (Helsinki / Debian 13, behind a
TF-managed Hetzner Cloud Firewall), bring it into `offsite_hosts`, and bootstrap it.
**Shipped as cx23/x86** (CAX11/ARM was out of stock EU-wide on 2026-06-14 — same-spec
x86, cheaper). Design: `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`.
- **Decided:** Terraform owns `askari`'s existence — generalizes ADR-006 from "Proxmox VM
existence" to **Proxmox + Hetzner** (new `hetznercloud/hcloud` provider, `hetzner_vm`
@ -113,8 +114,8 @@ Built in two phases. **M4a (platform) — ✅ DONE:** Docker on askari + boma's
**Caddy** reverse proxy (ADR-024), proven by `https://test.askari.wingu.me` serving a
valid Let's Encrypt cert (HTTP-01 — DNS-01 deferred to Phase 2, see ADR-024/FRICTION).
Firewall opened 80/443/3478. Spec/plan: `…2026-06-14-netbird-coordinator-m4-design.md` /
`…2026-06-14-m4a-docker-caddy.md`. **M4b (next):** the `netbird` service role — read
NetBird's current self-host compose then.
`…2026-06-14-m4a-docker-caddy.md`. **M4b (next):** the `netbird_coordinator` service
role — read NetBird's current self-host compose then.
Deploy the NetBird stack (management / signal / relay / Coturn + dashboard) with the
**embedded IdP** (ADR-016 — no Authentik dependency), fronted by the now-proven Caddy.

View file

@ -122,7 +122,7 @@
retro consumes them.
12. **Spin-up / build order** — what is the right order of operations when spinning up
from scratch (OS, DNS, Authentik, Traefik, …)?
from scratch (OS, DNS, Authentik, Caddy, …)?
13. **Intentions** - Is the current setup clearly identifying intentions throughout? We have the readme files but is that enough? Also, how do we rechallange desisions and how they interact over time. I.e. We have these two services running, but extending one a little bit could make the other redundant so we could remove it. Or an alternative to this services has emerged, and it is actually better.

View file

@ -79,7 +79,8 @@ time. Each heading tags the threat(s) it primarily serves.
### Updates — *opportunistic*
- `unattended-upgrades` enabled for **security patches only**
- Full system upgrades triggered deliberately via Ansible (`make deploy PLAYBOOK=upgrade`)
- Full system upgrades triggered deliberately via Ansible (planned — a dedicated upgrade
playbook per ADR-011; not yet built, no `upgrade.yml` exists today)
- No automatic reboots — reboots are a conscious operational decision
### Minimal attack surface — *opportunistic, blast radius*

View file

@ -47,6 +47,8 @@ below). Each service role contains a standard set of files:
| `README.md` | Purpose, variables, usage (role convention) |
| `SECURITY.md` | Per-service security record — see ADR-002 and `docs/security/service-security-template.md` |
| `VERIFY.md` | Per-service UI acceptance spec — see ADR-008 Level 4 / ADR-017 and `docs/testing/service-verify-template.md` |
| `ACCESS.md` | Per-service operational-access record — see ADR-021 and `docs/access/service-access-template.md` |
| `BACKUP.md` | Per-service backup record — see ADR-022 and `docs/backup/service-backup-template.md` (a stateless service declares `backup__state: false` with a reason) |
| `meta/main.yml`, `molecule/default/` | Metadata + Debian 13 test scenario |
### Standard deploy mechanics
@ -102,7 +104,9 @@ Managed by the `docker_host` role. Key settings:
- Bind mounts preferred over named volumes for data that must be backed up
- All bind mount paths are under `/opt/services/<name>/data/`
- Backup strategy is defined separately (not in scope of this repo)
- Backup strategy is defined in **ADR-022** — the bind mounts under
`/opt/services/<name>/data/` are exactly the unit ADR-022's per-service `backup__*`
contract (and `BACKUP.md`) captures
## Decision
@ -128,5 +132,6 @@ Drawn from the trade-offs and deferred items this ADR already states:
- Bare `latest` is acceptable only on the stateless tier; the stateful tier is always
pinned `tag@digest`, and image updates are a deliberate operation (per Image management;
ADR-011).
- Backup strategy is stated as defined separately, not in scope of this ADR (per Persistent
data).
- Backup strategy is defined in ADR-022 (not in this ADR); the persistent bind mounts
under `/opt/services/<name>/data/` are the unit ADR-022's per-service `backup__*`
contract captures (per Persistent data).

View file

@ -87,6 +87,14 @@ Assigned infrastructure addresses:
| `10.20.0.12` | `proxy` | Reverse proxy |
| `10.20.0.13` | `homeassistant` | Home Assistant (IoT controller) |
> **Control node `ubongo` — legacy V4 network (transitional).** `ubongo` (ADR-015) is the
> manually-provisioned physical control node and currently lives on the **legacy V4
> homelab network at `10.20.10.151`** — boma is being built up from the V4 base, and the
> physical LAN has not yet been re-cut to this VLAN scheme. That address is therefore
> **outside** the planned `srv` `10.20.0.0/24`; `base__firewall_control_addr` and the
> inventory point at the real (V4) address. When the network is migrated to these VLANs,
> `ubongo` moves into `mgmt`/`srv` and this note is retired.
#### VLAN 30 — lan (10.30.0.0/24)
| Range | Purpose |
@ -164,15 +172,21 @@ IoT devices cannot initiate connections to `srv`.
### DNS zones and split-horizon
**Internal zone**: `boma.baobab.band` — served by `dns1` and `dns2`.
**Internal zone**: `boma.baobab.band` **today** (the `dns` role is unbuilt) — served by
`dns1` and `dns2`. **Target:** it is renamed to `boma.wingu.me` in Phase 2 when the `dns`
role lands. Until then `boma.baobab.band` is the authoritative internal name **everywhere
it appears** (the naming table above, split-horizon below, the OPNsense forwarder, and
ADR-009/016). This is the single source for that transition; other references use the
current name and inherit this caveat.
The zone is rendered by the Ansible `dns` role: host A records come from the
inventory (which derives from Terraform's `local.vms` via `make tf-inventory`),
and service/alias/split-horizon records are explicit zone data in `group_vars`.
Terraform itself writes no DNS records — see ADR-009.
**Public zone**: `wingu.me` — Gandi LiveDNS, **managed as code** by the `public_dns`
role (`vault.gandi.pat`). Three-tier naming: infra `<host>.boma.wingu.me` (internal),
services `<service>.wingu.me` (split-horizon), off-site `<service>.askari.wingu.me`.
role (`vault.gandi.pat`). Three-tier naming: infra `<host>.boma.wingu.me` (internal — the
Phase-2 target; currently `boma.baobab.band`, see *Internal zone* above), services
`<service>.wingu.me` (split-horizon), off-site `<service>.askari.wingu.me`.
`nyumbani` is retired. **Mesh/LAN-only by default**: home services have no public record
(reached over LAN or the NetBird mesh); only deliberate exceptions are published. The
project is `boma`; the domain is `wingu.me`. The legacy `baobab.band` zone (Cloudflare)

View file

@ -67,7 +67,7 @@ configuration issues invisible to Ansible check mode.
A Claude-driven exploratory check of a service's **application UI**, run as
`/verify-service <name>` on `ubongo` (ADR-017). Claude drives Chromium via the
`playwright` plugin against a **staging** deploy, authenticates through the real
Traefik + Authentik SSO flow using a test user in the staging `test` group, then
Caddy (ADR-024) + Authentik SSO flow using a test user in the staging `test` group, then
executes the service's `roles/<service>/VERIFY.md` acceptance journeys *and*
free-explores — judging pass/fail, screenshotting key states. It writes a dated report
to `docs/testing/reviews/` and hands the operator a manual-test checklist for anything

View file

@ -119,7 +119,8 @@ rendered entirely by the Ansible `dns` role:
remains the ultimate source of truth for which hosts exist; the data simply flows
through the inventory instead of through a direct Terraform→DNS write.
- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense
gateway, `forgejo.nyumbani.baobab.band` → proxy) are explicit zone data in `group_vars`.
gateway, `vaultwarden.wingu.me` → proxy split-horizon) are explicit zone data in
`group_vars`.
This dissolves the bootstrap cycle that a Terraform-managed zone would create. If
Terraform wrote records via RFC 2136, provisioning the **first** DNS server would

View file

@ -21,7 +21,7 @@ Each container role declares its class, e.g. `<role>__stateful: true|false` (def
`false`). The split is the load-bearing classification for the whole policy.
- **Stateless** — no durable data of its own; losing the container loses nothing.
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Traefik,
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Caddy,
reverse proxies, FlareSolverr.
- **Stateful** — owns data, schema, or migrations: databases, and apps with their own
store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT).
@ -56,7 +56,7 @@ per host, in strict order with a verification gate between every phase:
5. **Verify** again; alert on failure.
**Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate
**before** the rest follow — so a DNS/Traefik failure doesn't make every host look
**before** the rest follow — so a DNS/Caddy failure doesn't make every host look
broken at once and hide the real cause. Never reboot the whole fleet simultaneously.
### 4. Snapshot-before is the rollback mechanism

View file

@ -45,4 +45,6 @@ workload that should move, or a node due an upgrade.
**wearout/TBW** is a monitored metric — logging is write-heavy, so wear is watched,
not assumed.
See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).
## Related
ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).

View file

@ -74,5 +74,7 @@ copy.
cost of a clean methodological break.
- The policy is enforceable in review and by the AI guardrails above.
See also: ADR-001 (architecture / legibility), ADR-004 (service-role model), ADR-011
## Related
ADR-001 (architecture / legibility), ADR-004 (service-role model), ADR-011
(update management — ntfy topics decided fresh per this policy).

View file

@ -153,5 +153,7 @@ master password.
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. |
| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. |
See also: ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
## Related
ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).

View file

@ -1,5 +1,11 @@
# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
## Status
Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` role and service-role machinery
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
`base` exists.
## Context
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
@ -89,12 +95,6 @@ allocated for it.
version-pinned (ADR-011). boma's `dns` role stays authoritative for
`boma.baobab.band`; NetBird built-in DNS scoped/off.
## Status
Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` role and service-role machinery
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
`base` exists.
## What was ruled out
| Option | Reason |
@ -106,11 +106,6 @@ Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` rol
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. |
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
See also: ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted),
ADR-021 (operational access; SSH ladder reconciling `wt0` + `ubongo`'s LAN address).
## Consequences
- A new public surface appears on `askari` — management API + dashboard (80/443) +
@ -129,3 +124,10 @@ ADR-021 (operational access; SSH ladder reconciling `wt0` + `ubongo`'s LAN addre
operator footprint (What was ruled out).
- Implementation is pending: the role tasks land only once the unbuilt `base` role and
service-role machinery exist (Status).
## Related
ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted),
ADR-021 (operational access; SSH ladder reconciling `wt0` + `ubongo`'s LAN address).

View file

@ -1,5 +1,11 @@
# ADR-017 — Service-UI acceptance verification (Level 4)
## Status
Accepted (2026-06-05). Designed. **Authorable now:** this ADR, the ADR-008 Level 4 expansion, the `VERIFY.md`
template, the `/verify-service` skill, the convention/checklist/Further-reading edits,
`.gitignore`/dir, STATUS/TODO. **Running is deferred** on its dependencies.
## Context
ADR-008 defines testing Levels 13 (Molecule, staging deploy, external smoke) and a
@ -24,7 +30,7 @@ A Claude-driven exploratory service-UI verification harness — **Level 4** —
(incl. destructive flows) against a *staging* deploy; the rebuildable sandbox
resolves safety.
4. **Test users in Authentik (central IdP), real SSO flow** — authenticates through
Traefik + Authentik as a real user would.
Caddy (ADR-024) + Authentik as a real user would.
5. **Per-service `VERIFY.md` backbone + free exploration** — each service role ships an
acceptance spec of critical journeys; Claude executes it and explores beyond it.
@ -63,12 +69,6 @@ them.
- **No secrets leaked** — the git-ignored screenshot dir is the safety boundary;
avoid capturing credential screens.
## Status
Accepted (2026-06-05). Designed. **Authorable now:** this ADR, the ADR-008 Level 4 expansion, the `VERIFY.md`
template, the `/verify-service` skill, the convention/checklist/Further-reading edits,
`.gitignore`/dir, STATUS/TODO. **Running is deferred** on its dependencies.
## Dependencies
- `ubongo` (ADR-015) — runs the browser. Designed, not built.
@ -85,12 +85,9 @@ template, the `/verify-service` skill, the convention/checklist/Further-reading
| Scheduled headless smoke gate | Needs determinism the exploratory nature excludes; belongs to health checks / Uptime Kuma. |
| Verify against production | Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead. |
| Free-form, no per-service spec | Non-repeatable, can miss a critical flow; `VERIFY.md` gives a backbone. |
| Staging bypasses SSO / per-app users | Wouldn't exercise the real Traefik+Authentik path; central test users are faithful. |
| Staging bypasses SSO / per-app users | Wouldn't exercise the real Caddy+Authentik path; central test users are faithful. |
| Commit screenshots to the repo | Repo bloat + secret-leak risk; git-ignored on `ubongo`. |
See also: ADR-008 (testing — expanded), ADR-015 (control host), ADR-002 (security),
ADR-004 (`VERIFY.md` parallels `SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing).
## Consequences
- The harness is confined to staging by a hard stop: it refuses to run against
@ -108,3 +105,8 @@ ADR-004 (`VERIFY.md` parallels `SECURITY.md`), ADR-013/014 (heritage / knowledge
skill, conventions/checklist edits), but running is deferred on its dependencies:
`ubongo`, the `playwright` plugin, Authentik, a staging deploy, and `make new-role`
scaffolding `VERIFY.md` (Status; Dependencies).
## Related
ADR-008 (testing — expanded), ADR-015 (control host), ADR-002 (security),
ADR-004 (`VERIFY.md` parallels `SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing).

View file

@ -1,5 +1,12 @@
# ADR-018 — Logging and log integrity
## Status
Accepted (2026-06-06). Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
and the live pipeline.
## Context
boma wants all logs in one queryable store for troubleshooting, spotting issues over
@ -70,13 +77,6 @@ ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a moni
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
tracked allocation in `docs/hardware/reference.md` (ADR-012).
## Status
Accepted (2026-06-06). Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
and the live pipeline.
## Dependencies
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
@ -94,10 +94,6 @@ the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence a
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).
## Consequences
- Opportunistic track-covering and host-pivot-to-store are defeated because logs leave
@ -120,3 +116,9 @@ standard), ADR-011 (health checks — distinct from this).
- The decision is authorable now but the live pipeline is deferred on the stack:
Alloy-in-`base`, the `loki`/`grafana` service roles, OPNsense syslog config, and the
push-only credential (Status; Dependencies).
## Related
ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).

View file

@ -49,7 +49,7 @@ slice on its own, and (c) doesn't overlap confusingly with another.
| `monitoring` | metric exporters / health checks |
| `config` | render templated config/compose files to disk — **no restart** |
| `deploy` | bring services up / restart (`compose up -d`) |
| `proxy` | reverse-proxy + TLS registration (Traefik routes, Authentik) |
| `proxy` | reverse-proxy + TLS registration (Caddy routes, Authentik) |
The `config`/`deploy` split lets you re-render and diff configuration (`--tags
config`) without bouncing services, then restart deliberately (`--tags deploy`).

View file

@ -88,9 +88,9 @@ declarations (real drift risk).
`askari` sits outside the Proxmox cluster and has no OPNsense. Its **perimeter** layer
is a TF-managed **Hetzner Cloud Firewall** (declared in `terraform/environments/offsite/`)
alongside the VM itself. Current rule set (M2): SSH inbound from `ubongo`'s public IP
only. NetBird ports (UDP 3478 + TCP 80/443) will be added in M4 when the coordinator
role is built.
alongside the VM itself. Rule set: SSH inbound from `ubongo`'s public IP (M2), plus
TCP 80/443 + UDP 3478 opened in **M4a** (Caddy + NetBird). The `netbird_coordinator`
service role that uses 3478 lands in **M4b**; the ports are already open.
The `group_vars` service catalog remains authoritative for `askari`'s **host nftables**
layer — the same two-layer model applies, with Hetzner Cloud Firewall substituting for

View file

@ -19,9 +19,9 @@ Accepted (2026-06-14). Amends the soft Traefik assumption carried by the roadmap
boma needs a reverse proxy to front its services with TLS. ADR-002 requires every
service to sit behind a proxy with authentication before it is reachable; ADR-007/M1
delivers a `*.boma.<domain>` wildcard cert via ACME DNS-01 against Gandi — the only
viable cert path for mesh/LAN-only services that cannot satisfy HTTP-01 (no public
A-record to point at).
delivers a `*.<domain>` wildcard cert via ACME DNS-01 against Gandi (the apex `boma`
domain, matching ROADMAP M1) — the only viable cert path for mesh/LAN-only services
that cannot satisfy HTTP-01 (no public A-record to point at).
The roadmap (Phase-2, step 5) and ADR-017 prose assumed **Traefik + Authentik** as the
auth-and-proxy pair without an ADR ever pinning Traefik. On closer inspection:
@ -57,10 +57,14 @@ boma's reverse proxy is **Caddy**.
5. `forward_auth` to Authentik is a first-class Caddy directive — the planned
Authentik auth story (ADR-002) is preserved without Traefik as the middleman.
### 2. Custom image
### 2. Custom image (DNS-01 path only — Phase 2)
> Applies only to the **DNS-01** path, which is **deferred to Phase 2** (see the Status
> note). M4a ships **vanilla `caddy:2`** on askari (HTTP-01) — no custom image.
Caddy's official Docker image does not include third-party DNS plugins. The `caddy-dns/gandi`
plugin must be compiled in via `xcaddy`. boma builds a custom image:
plugin must be compiled in via `xcaddy`. When the cluster's mesh/LAN-only services need
DNS-01, boma builds a custom image:
```
FROM caddy:builder AS builder
@ -70,14 +74,16 @@ FROM caddy:latest
COPY --from=builder /usr/bin/caddy /usr/bin/caddy
```
This image is maintained as a boma artifact (Forgejo registry, pinned digest in the
Compose template). It is the cost of the Gandi DNS-01 path — unavoidable regardless of
proxy choice.
That image would be maintained as a boma artifact (Forgejo registry, pinned digest in the
Compose template) — the cost of the Gandi DNS-01 path. (On askari this approach hit two
blockers, so DNS-01 is deferred; see the Status note.)
### 3. Deployment scope
The first Caddy instance fronts the NetBird stack on `askari` (M4). The pattern
generalises to the Proxmox cluster in Phase 2 when services multiply.
The first Caddy instance runs on `askari` (M4a), serving a test vhost over HTTP-01 to
prove the proxy + ACME path. It fronts the NetBird stack in **M4b** (when the
`netbird_coordinator` role is built). The pattern generalises to the Proxmox cluster in
Phase 2 when services multiply.
### 4. Authentik integration (deferred)
@ -90,8 +96,9 @@ middleware migration is required.
- **Roadmap Phase-2 step 5** is updated from "Authentik + Traefik" to "Authentik +
Caddy (ADR-024)".
- **ADR-017 prose** that mentioned Traefik is updated to read "Caddy (ADR-024)".
- A custom Caddy image (`xcaddy` + `caddy-dns/gandi`) must be built, pushed to the
Forgejo registry, and kept current (plugin + base image updates).
- M4a (public hosts, HTTP-01) runs **vanilla `caddy:2`** — no custom image. **If/when**
the Phase-2 DNS-01 path lands, a custom Caddy image (`xcaddy` + `caddy-dns/gandi`) must
be built, pushed to the Forgejo registry, and kept current (plugin + base image updates).
- Caddyfile config is rendered by Ansible from `group_vars` — consistent with ADR-004
and easier to review than distributed container labels.
- `forward_auth` to Authentik is available when Authentik is deployed; no extra

View file

@ -0,0 +1,76 @@
{
"date": "2026-06-14",
"reviewed_commit": "e346137",
"fixes_commit": null,
"mode": "on-demand",
"counts": {
"auto_fixed": 11,
"open": 29,
"scan": {
"broken-adr-ref": 4,
"broken-path-ref": 2,
"marker": 14,
"open-deferred-item": 5,
"stale-deferred": 0
}
},
"deferral_checklist": {
"adr-011-open-items": "all 5 ('Open questions': Proxmox snapshot driver, exact cadences, health-check harness home, classification home, staging-first) confirmed genuinely still open. ADR-011 is still Proposed/unbuilt; the same questions are echoed open in docs/TODO.md item 16; no later ADR or STATUS decides any of them. No stale-deferred.",
"stale_deferred_found": 0
},
"scan_false_positives": [
{"check": "broken-adr-ref", "location": "tests/test_repo_scan.py:10,43; docs/superpowers/plans/2026-06-10-adr-structure.md:50,83", "why": "ADR-099/ADR-100 are intentional test fixtures exercising the scanner's bad-ref detection."},
{"check": "broken-path-ref", "location": "docs/superpowers/plans/2026-06-14-m4b-netbird.md:28,56", "why": "roles/netbird/ is referenced by the M4b implementation plan for a role to be scaffolded via make new-role; forward-looking plan for unbuilt work, not a dead ref."},
{"check": "marker", "location": "docs/decisions/019-tagging.md:14 + docs/superpowers/plans/* + docs/superpowers/specs/*", "why": "019-tagging.md:14 is prose discussing 'over-tagging' as a concept ('the TODO explicitly warns against...'), not an actionable TODO. The 13 superpowers markers are historical planning artifacts (commit-message TODOs, plan steps)."}
],
"auto_fixed": [
{"id": "AF1", "dimension": "drift", "severity": "high", "location": "roles/reverse_proxy/meta/main.yml:4-6", "description": "meta description said 'ACME DNS-01 TLS via Gandi ... builds the custom image on-host (caddy-dns/gandi)' — but the role is now vanilla Caddy + HTTP-01 (commit b7e919d dropped the custom image); README/defaults/compose/STATUS all reflect vanilla. Only meta was stale and contradicted the code.", "fix": "rewrote description to 'Vanilla Caddy reverse proxy (ADR-024); TLS via ACME HTTP-01 for public hosts. Routes from reverse_proxy__routes, managed via Docker Compose.'", "tag": "new"},
{"id": "AF2", "dimension": "cruft", "severity": "medium", "location": "roles/README.md:11-15", "description": "Current-state paragraph said base hardening (SSH/fail2ball), auditd, packages, users 'not yet built' and docker_host 'scaffolded but has no tasks yet' — but STATUS records the hardening concern built+tested+applied to askari, and docker_host/reverse_proxy/public_dns all built.", "fix": "rewrote to: base firewall+hardening built (hardening applied to askari), docker_host/reverse_proxy/public_dns/dev_env built; auditd/packages/users pending.", "tag": "recurring"},
{"id": "AF3", "dimension": "drift", "severity": "medium", "location": "playbooks/README.md:6-13", "description": "site.yml note said docker_host 'scaffolded with no tasks yet' (now installs Docker engine) and the file omitted dns.yml and offsite.yml entirely.", "fix": "reworded site.yml note (base firewall+hardening, no cluster docker hosts yet) and added dns.yml + offsite.yml bullets.", "tag": "new"},
{"id": "AF4", "dimension": "cruft", "severity": "low", "location": "roles/public_dns/README.md:7-9", "description": "'the anti-spoof baseline now; askari in M4' — M4a is done; askari + *.askari records are applied.", "fix": "updated to note askari.wingu.me + *.askari wildcard applied in M4a.", "tag": "new"},
{"id": "AF5", "dimension": "cruft", "severity": "low", "location": "scripts/README.md:17", "description": "Helper-script list omitted check-tags.py, which exists and is run by make lint (ADR-019).", "fix": "added a check-tags.py bullet.", "tag": "new"},
{"id": "AF6", "dimension": "drift", "severity": "medium", "location": "terraform/README.md:7-15", "description": "Top-level terraform README omitted modules/hetzner_vm and environments/offsite — the only built+applied TF environment (askari).", "fix": "added hetzner_vm + offsite env bullets; scoped 'not yet init'ed' to the Proxmox envs.", "tag": "new"},
{"id": "AF7", "dimension": "cruft", "severity": "low", "location": "terraform/environments/offsite/providers.tf:1", "description": "Verified-stamp said 'cax11@hel1' but the deployed server is cx23 (CAX11 out of stock).", "fix": "stamp now reads cx23@hel1.", "tag": "new"},
{"id": "AF8", "dimension": "cruft", "severity": "low", "location": "terraform/modules/hetzner_vm/variables.tf:7", "description": "server_type description example was 'e.g. cax11 (ARM)'; the only consumer uses cx23.", "fix": "example now 'e.g. cx23 (x86) or cax11 (ARM)'.", "tag": "new"},
{"id": "AF9", "dimension": "drift", "severity": "medium", "location": "inventories/production/group_vars/all/public_dns.yml:16-17", "description": "Comment on the *.askari wildcard said 'Caddy gets a *.askari.wingu.me cert via DNS-01 (M4a)' — M4a uses HTTP-01 (the wildcard A record itself is still legitimately needed for name resolution).", "fix": "comment now says per-host certs via ACME HTTP-01 (M4a).", "tag": "new"},
{"id": "AF10", "dimension": "drift", "severity": "high", "location": "docs/CAPABILITIES.md:27,29", "description": "Capability table named Traefik as the reverse-proxy candidate (ADR-024 chose Caddy, built+applied) and marked public DNS 'apply pending' (applied 2026-06-14).", "fix": "reverse-proxy row -> 'Caddy (ADR-024)'; public DNS note -> 'applied (M1)'. (The V4-history Traefik mention at line 134 is correct and left as-is.)", "tag": "new"},
{"id": "AF11", "dimension": "cruft", "severity": "low", "location": "README.md:110-119", "description": "README 'Documentation' ADR list stopped at ADR-017; ADR-018..024 exist.", "fix": "extended the list through ADR-024 (logging, tagging, firewall, access, backup, ADR-structure, reverse-proxy).", "tag": "recurring"}
],
"open": [
{"id": "O1", "dimension": "drift", "severity": "high", "location": "STATUS.md:41 (+ 45-48) ↔ STATUS.md:33-34", "description": "The 'Scaffolded but empty — NOT implemented' table still lists roles/docker_host as 'Scaffolded, no tasks ... applying it is a no-op', and the trailing prose (45-48) repeats it. This contradicts STATUS.md:33-34 ('Built + applied', installs Docker CE + compose) and the actual roles/docker_host/tasks/main.yml. An internal STATUS contradiction; one side is plainly correct (docker_host is built).", "suggested_fix": "Remove/rewrite the docker_host row in the 'Scaffolded but empty' table and the 45-48 paragraph: docker_host now installs the Docker engine; only its deferred daemon-hardening + nftables.d scope (ADR-004/020) remains. Report (STATUS is the operator's ground-truth doc — reword deliberately).", "tag": "new", "auto_fixable": false},
{"id": "O2", "dimension": "consistency", "severity": "high", "location": "docs/decisions/004-docker-model.md:105,131 ↔ docs/decisions/022-backup.md", "description": "ADR-004 states twice that 'Backup strategy is defined separately (not in scope of this repo)'. ADR-022 defines a full in-repo backup/DR doctrine (restic, fisi pull node, per-service backup__* + BACKUP.md). Direct ADR↔ADR scope contradiction.", "suggested_fix": "Reword ADR-004's lines to point at ADR-022 (backup is now in-repo scope) and cross-link, per ADR-023's no-silent-reversal rule. Design decision — report.", "tag": "recurring", "auto_fixable": false},
{"id": "O3", "dimension": "consistency", "severity": "high", "location": "docs/decisions/024-reverse-proxy.md (Consequences) ↔ 008-testing.md:70; 017-service-ui-verification.md:27,88; 019-tagging.md:52", "description": "ADR-024's Consequences claim 'ADR-017 prose that mentioned Traefik is updated to read Caddy'. That update was NOT done: ADR-017:27,88 still say 'Traefik + Authentik'; ADR-008:70 'Traefik + Authentik SSO flow'; ADR-019:52 'Traefik routes, Authentik'. The doc set still designs around Traefik while ADR-024 overclaims the reconciliation was completed.", "suggested_fix": "Replace Traefik with Caddy (ADR-024) in ADR-008:70, ADR-017:27,88, ADR-019:52, OR soften ADR-024's Consequences to 'to be updated'. ADR prose = design docs — report (not auto-fixed).", "tag": "new", "auto_fixable": false},
{"id": "O4", "dimension": "conformance", "severity": "high", "location": "docs/decisions/023-adr-structure.md:7-8,77-80 ↔ 016-mesh-vpn.md:3; 017-service-ui-verification.md:3; 018-logging.md:3", "description": "ADR-023 §2 mandates ## Status as the first section and §6 explicitly claims ADRs 001018 were retroactively restructured to lead with Status (calling out 016018). But ADR-016/017/018 still open with ## Context, Status buried late (016:~92, 017:~66, 018:~73). ADR-023's own conformance claim is contradicted by three in-scope files. (Older ADRs 001010 lead with Status but place Decision/Consequences after topical sections — an accepted presentational trade-off per ADR-023 §5/§6.)", "suggested_fix": "Either add a top-of-file ## Status section to ADR-016/017/018 (move the existing build-state line up), or correct ADR-023 §6 to exclude them. Reordering judgement — report.", "tag": "recurring", "auto_fixable": false},
{"id": "O5", "dimension": "consistency", "severity": "medium", "location": "docs/decisions/004-docker-model.md:48-50", "description": "The service-role file table (the canonical standard) lists only README/SECURITY/VERIFY; it omits ACCESS.md (ADR-021) and BACKUP.md (ADR-022), both of which CLAUDE.md + those ADRs mandate as required per-service-role files.", "suggested_fix": "Add ACCESS.md (ADR-021) and BACKUP.md (ADR-022, stateful) rows to ADR-004's file table.", "tag": "recurring", "auto_fixable": false},
{"id": "O6", "dimension": "drift", "severity": "medium", "location": "docs/decisions/002-security.md:82", "description": "References 'make deploy PLAYBOOK=upgrade' as the deliberate full-upgrade mechanism, but no upgrade.yml exists (only bootstrap/dns/offsite/site/workstation) and ADR-011 is still Proposed/unbuilt — stated without the '(planned)' caveat ADR-002 uses for its other unbuilt controls.", "suggested_fix": "Add a '(planned — ADR-011, not yet built)' caveat to the upgrade line, or drop the concrete command until upgrade.yml exists.", "tag": "recurring", "auto_fixable": false},
{"id": "O7", "dimension": "drift", "severity": "medium", "location": "docs/CAPABILITIES.md:150-155 ↔ STATUS.md:29", "description": "CAPABILITIES still lists nvim/kitty/tmux among 'Confirmed exclusions' boma 'deliberately does not' have, but the dev_env role (built+applied to ubongo) installs neovim + tmux. (The reverse-proxy/public-DNS rows in this file were auto-fixed in AF10; this exclusions block was left because it needs a scoped carve-out, not a token swap.)", "suggested_fix": "Scope the exclusion to managed cluster/server hosts and note the control/dev host (ubongo, ADR-015) runs an interactive dev_env, or drop nvim/tmux from the list.", "tag": "recurring", "auto_fixable": false},
{"id": "O8", "dimension": "conformance", "severity": "medium", "location": "roles/dev_env/tasks/main.yml (include_tasks per_user.yml) + roles/dev_env/tasks/per_user.yml:4-9", "description": "per_user.yml's getent + set_fact dev_env__home preflight is untagged, and the include_tasks that pulls it in carries no 'apply: tags:'. base/tasks/main.yml documents and guards exactly this gotcha with apply: tags:; dev_env does not. A partial --tags users or --tags config run selects only the include statement (running nothing) or, if made tag-aware, skips the set_fact and fails the dependent [config] tasks on an undefined dev_env__home. Against ADR-019's concern-runnable-in-isolation intent.", "suggested_fix": "Add apply: tags: [users, config] to the per_user.yml include (mirroring base), and tag the getent+set_fact with 'always' (or the union [users, config]).", "tag": "recurring", "auto_fixable": false},
{"id": "O9", "dimension": "drift", "severity": "medium", "location": "inventories/production/hosts.yml:1-17", "description": "Header claims 'Generated from Terraform outputs: make tf-inventory TF_ENV=production', but the file is hand-maintained: it carries the manual control host (ubongo) and omits the offsite_hosts group that tf_to_inventory.py always emits (VALID_GROUPS). Running tf-inventory against the empty production env would DROP ubongo and ADD offsite_hosts, so the header misrepresents how the file is managed.", "suggested_fix": "Make the header honest (hand-maintained for the manual control-node exception while production TF has no VMs; offsite hosts live in offsite.yml), and reconcile the declared group set with tf_to_inventory.py. Do NOT hand-regenerate hosts.yml in a way that drops ubongo.", "tag": "recurring", "auto_fixable": false},
{"id": "O10", "dimension": "consistency", "severity": "medium", "location": "inventories/production/group_vars/all/vars.yml:42 + hosts.yml:12 ↔ docs/decisions/007-network.md", "description": "ubongo's address is 10.20.10.151 (control host_var + base__firewall_control_addr), but ADR-007 defines srv as 10.20.0.0/24 (network__srv_subnet) and mgmt as 10.10.0.0/24 — 10.20.10.151 is in neither, and ADR-007's addressing tables don't record where the physical control node lives. base__firewall_control_addr (ADR-021 recovery path) depends on this being right.", "suggested_fix": "Add ubongo to ADR-007's addressing table (which VLAN/segment 10.20.10.151 belongs to, clearly outside srv 10.20.0.0/24), or correct the address. Confirm the real address with the operator first.", "tag": "recurring", "auto_fixable": false},
{"id": "O11", "dimension": "consistency", "severity": "medium", "location": "terraform/environments/{staging,production}/terraform.tfvars.example:9-11 + variables.tf:5", "description": "Proxmox node naming uses 'pve01' (two-digit) in both tfvars.example files and the proxmox_endpoint var descriptions; ADR-007 defines single-digit node names pve0/pve1/pve2, and internal FQDNs as <host>.boma.<domain>. Example contradicts the naming convention.", "suggested_fix": "Align example values with ADR-007 (proxmox_node = pve0; endpoint = https://pve0.boma.<domain>:8006/). Verify the intended node name with the operator before changing — report rather than auto-fix.", "tag": "recurring", "auto_fixable": false},
{"id": "O12", "dimension": "conformance", "severity": "medium", "location": "roles/reverse_proxy/ (missing SECURITY.md, VERIFY.md, ACCESS.md, BACKUP.md)", "description": "CLAUDE.md requires every service role to carry SECURITY.md (ADR-002/004), VERIFY.md (ADR-008/017), ACCESS.md (ADR-021), and a stateful BACKUP.md (ADR-022); a stateless service records backup__state: false with a reason. reverse_proxy is the first real built+applied service role (askari, M4a) but ships only README.md. (Judgement recorded: public_dns is exempt — it runs on the control node against an external DNS API, provisioning no host-resident service/port, so it is not a 'service' role in the ADR-004 sense.)", "suggested_fix": "Add the four files from docs/security|testing|access|backup/ templates. BACKUP.md can declare backup__state: false (Caddy state = re-issuable ACME certs).", "tag": "new", "auto_fixable": false},
{"id": "O13", "dimension": "consistency", "severity": "low", "location": "docs/decisions/012-hardware-capacity.md; 013-heritage-v4.md:77; 015-control-host.md; 016-mesh-vpn.md; 017-service-ui-verification.md; 018-logging.md", "description": "Inconsistent cross-reference convention: ADRs 014/019/020/021/022/023 + adr-template use a dedicated '## Related' section, while 012/013/015/016/017/018 use an inline 'See also:' prose line (placed mid-document in 016/017/018). ADR-023 §3 names ## Related as the optional section; 'See also:' is an undocumented variant.", "suggested_fix": "Convert the 'See also:' prose into ## Related sections (after Consequences) in ADR-012/013/015/016/017/018 for uniformity. Cosmetic.", "tag": "recurring", "auto_fixable": false},
{"id": "O14", "dimension": "consistency", "severity": "low", "location": "docs/README.md:4-8; inventories/README.md", "description": "docs/README.md lists only decisions/ + runbooks/ (omits security/testing/access/backup/hardware/reviews); inventories/README.md omits the offsite_hosts group documented in CLAUDE.md. Both narrower than current reality.", "suggested_fix": "Add the missing subdir rows / note offsite_hosts, or explicitly defer to the canonical list in the repo README / CLAUDE.md.", "tag": "recurring", "auto_fixable": false},
{"id": "O15", "dimension": "drift", "severity": "medium", "location": "docs/runbooks/new-host.md:82,114-138 (Part E)", "description": "Part E (control node ubongo) still instructs 'ssh ansible@<IP>' / an ansible-user flow, but STATUS records ubongo is deliberately managed as the operator account sjat (group_vars/control ansible_user: sjat) with the ansible-user bootstrap listed as Pending.", "suggested_fix": "Update Part E to reflect ubongo managed as sjat (no ansible user yet), the ansible-user bootstrap a pending item per STATUS.md.", "tag": "recurring", "auto_fixable": false},
{"id": "O16", "dimension": "consistency", "severity": "low", "location": "roles/dev_env/files/dotfiles/zsh/.zshrc:28,55", "description": "Shipped .zshrc hard-codes alias rclone=\"/usr/bin/rclone\" (rclone not installed by dev_env) and 'eval \"$(direnv hook zsh)\"' unguarded (unlike the guarded oh-my-posh block) — heritage fisi/V4 carryovers. If direnv is dropped from dev_env__packages, every shell startup errors.", "suggested_fix": "Drop the rclone alias and guard the direnv hook with 'command -v direnv', or document direnv as a hard dependency of the shipped .zshrc.", "tag": "recurring", "auto_fixable": false},
{"id": "O17", "dimension": "consistency", "severity": "low", "location": "roles/dev_env/tasks/oh_my_posh.yml:15-26", "description": "The zen.toml theme-directory + deploy tasks render config to disk but carry no 'config' tag, while analogous dotfile tasks in per_user.yml are tagged config — inconsistent concern tagging within the role.", "suggested_fix": "Add tags: [config] to the zen.toml directory + deploy tasks.", "tag": "recurring", "auto_fixable": false},
{"id": "O18", "dimension": "drift", "severity": "medium", "location": "docs/decisions/007-network.md:159,167,186 + 009-provisioning-handoff.md:114 + 016-mesh-vpn.md:90 ↔ 007-network.md:174,184", "description": "Internal-zone name is inconsistent across the doc set: ADR-007:159/167/186, ADR-009:114, ADR-016:90 call it 'boma.baobab.band', while ADR-007:174/184 says infra is '<host>.boma.wingu.me' and the internal zone 'will be renamed to boma.wingu.me' (Phase 2). M1 moved boma's home to wingu.me. A reader can't tell which domain the unbuilt dns role should render.", "suggested_fix": "State the transitional state in one authoritative place (current = boma.baobab.band, target = boma.wingu.me in Phase 2), or align all references on the target. Report.", "tag": "new", "auto_fixable": false},
{"id": "O19", "dimension": "consistency", "severity": "low", "location": "docs/decisions/009-provisioning-handoff.md:122", "description": "M1 retired 'nyumbani' as a naming tier (ROADMAP:70, ADR-007:176). ADR-009:122 still uses 'forgejo.nyumbani.baobab.band' as the worked example of internal-zone data the dns role would render. (Note: STATUS:19 + ADR-003/008/010 use the same name for the LIVE legacy Forgejo host, which is legitimately legacy infra — distinguish.)", "suggested_fix": "Update the ADR-009:122 example to a non-nyumbani name consistent with the retired-nyumbani decision; annotate the legacy Forgejo references as intentionally legacy where they remain.", "tag": "recurring", "auto_fixable": false},
{"id": "O20", "dimension": "drift", "severity": "low", "location": "docs/ROADMAP.md:82-83", "description": "ROADMAP M2 still describes askari as 'CAX11 ARM / Helsinki', but STATUS records it provisioned as cx23/x86 (CAX11/ARM out of stock EU-wide on 2026-06-14). M3/M4 sections got DONE notes; M2's spec line wasn't corrected.", "suggested_fix": "Update ROADMAP M2 to note askari shipped as cx23/x86 (CAX11 unavailable), or add a DONE note mirroring M3/M4.", "tag": "new", "auto_fixable": false},
{"id": "O21", "dimension": "drift", "severity": "low", "location": "docs/decisions/020-firewall.md:91-93", "description": "ADR-020 says askari's Hetzner Cloud Firewall 'NetBird ports (UDP 3478 + TCP 80/443) will be added in M4 when the coordinator role is built' — but M4a is DONE and the firewall already opens 80/443/3478. Future-tense is stale; only the netbird role (M4b) remains.", "suggested_fix": "Update ADR-020 to past tense (80/443/3478 opened in M4a); keep the netbird coordinator role (M4b) caveated as unbuilt.", "tag": "new", "auto_fixable": false},
{"id": "O22", "dimension": "consistency", "severity": "low", "location": "docs/decisions/024-reverse-proxy.md:60-92", "description": "ADR-024 is internally inconsistent post-revision: the revised Status note says askari ships HTTP-01 with vanilla Caddy (custom-image DNS-01 deferred to Phase 2), but Decision §2 still asserts boma builds/maintains the custom xcaddy+gandi image, §3 says 'fronts the NetBird stack on askari (M4)' (M4b unbuilt), and Consequences still lists 'a custom Caddy image must be built/pushed/kept current' as a present obligation.", "suggested_fix": "Scope the custom-image obligation (§2, Consequences) to the deferred Phase-2 DNS-01 path; soften §3 to reflect that M4a ships a test vhost and the NetBird front-end is M4b. Report (touches decision substance).", "tag": "new", "auto_fixable": false},
{"id": "O23", "dimension": "consistency", "severity": "low", "location": "docs/decisions/001-architecture.md:50 + 016-mesh-vpn.md:87 ↔ docs/ROADMAP.md:116", "description": "The future NetBird service role is named 'netbird_coordinator' in ADR-001:50 + ADR-016:87 (coordinator framing also in STATUS), but ROADMAP M4b:116 calls it 'the netbird service role'. make new-role creates one directory name; the committed names will mismatch the actual role at build time. (The M4b plan at docs/superpowers/plans/2026-06-14-m4b-netbird.md also uses 'netbird'.)", "suggested_fix": "Settle one role name and align ADR-001/016, ROADMAP, and the M4b plan before scaffolding.", "tag": "new", "auto_fixable": false},
{"id": "O24", "dimension": "consistency", "severity": "low", "location": "docs/decisions/024-reverse-proxy.md:22 ↔ docs/ROADMAP.md:71", "description": "ADR-024 describes the M1 ACME DNS-01 wildcard as '*.boma.<domain>' (infra subdomain), while ROADMAP:71 specifies '*.<boma-domain>' (apex). Different name spaces — the cert's actual SAN coverage for unexposed services is ambiguous across the two docs.", "suggested_fix": "Align the wildcard scope (decide *.wingu.me vs *.boma.wingu.me vs both) and state it identically in ADR-024 and ROADMAP.", "tag": "new", "auto_fixable": false},
{"id": "O25", "dimension": "consistency", "severity": "low", "location": "roles/reverse_proxy/molecule/default/verify.yml:11,22; roles/public_dns/molecule/default/verify.yml:12", "description": "Molecule verify tasks use tags: [verify], which is not in the tests/tags.yml vocabulary (concerns/special/opt_ins/playbooks). check-tags.py exempts molecule/ paths so the linter doesn't flag it, and 4 roles use this de-facto convention — but it's an out-of-vocabulary tag the ADR-019 standard doesn't sanction.", "suggested_fix": "Either drop the tags from molecule verify tasks (the linter ignores molecule anyway) or add 'verify' as a sanctioned testing-only tag in tests/tags.yml with an ADR-019 note. Repo-wide convention call.", "tag": "new", "auto_fixable": false},
{"id": "O26", "dimension": "consistency", "severity": "low", "location": "roles/reverse_proxy/templates/Caddyfile.j2:1; docker-compose.yml.j2:1", "description": "Neither rendered template carries an {{ ansible_managed }} header, though ADR-024 §1.2 cites 'one ansible_managed header' as a Caddy advantage. (No template in the repo currently uses ansible_managed — consistent with current practice but inconsistent with the ADR's stated intent.)", "suggested_fix": "Add a commented '# {{ ansible_managed }}' header to both templates (and ideally adopt the convention repo-wide).", "tag": "new", "auto_fixable": false},
{"id": "O27", "dimension": "consistency", "severity": "low", "location": "inventories/production/group_vars/all/reverse_proxy.yml", "description": "reverse_proxy production vars live in group_vars/all/ (every host) though the role only runs on offsite_hosts via offsite.yml; CLAUDE.md establishes an offsite_hosts/ group_vars dir for askari-specific config, which doesn't exist on disk. Harmless today (only askari imports the role) but broader scope than intended.", "suggested_fix": "Consider moving reverse_proxy.yml (and the offsite firewall opens) to group_vars/offsite_hosts/ for scope clarity, or leave if intentionally global. Judgement call.", "tag": "new", "auto_fixable": false},
{"id": "O28", "dimension": "drift", "severity": "low", "location": "scripts/capacity-scan.py:133", "description": "capacity-scan.py cross-checks workload hostnames only against inventories/<env>/hosts.yml. askari lives in inventories/production/offsite.yml, not hosts.yml, so the drift cross-check never sees it. Minor (capacity is intent-based today) but a latent gap as offsite hosts grow.", "suggested_fix": "Also read offsite.yml (or glob inventories/<env>/*.yml host files) so offsite_hosts are included.", "tag": "new", "auto_fixable": false},
{"id": "O29", "dimension": "consistency", "severity": "low", "location": "inventories/production/offsite.yml:1-16 ↔ inventories/production/hosts.yml:7-16", "description": "offsite.yml (generated by tf-inventory-offsite) re-declares control/docker_hosts/proxmox_hosts with empty host maps because tf_to_inventory.py always emits all four VALID_GROUPS — duplicating groups in hosts.yml in the same inventory dir. Ansible merges them harmlessly, but the duplication/merge is undocumented.", "suggested_fix": "Document in inventories/README.md that offsite.yml is a second generated inventory file merged with hosts.yml, or have tf_to_inventory.py emit only non-empty groups for offsite. Leave as-is if intended; just document.", "tag": "new", "auto_fixable": false}
],
"prior_resolved": [
{"id": "O1@2026-06-11", "description": "make lint RED on main (site.yml imported nonexistent docker_host role)", "status": "resolved — docker_host scaffolded (03d33f8) then built (456c27d); make lint green this run."},
{"id": "O10@2026-06-11", "description": "README ADR list stopped early (recurring)", "status": "resolved — auto-fixed this run (AF11), extended through ADR-024."},
{"id": "O17@2026-06-11", "description": "empty handlers/main.yml scaffold artifacts in base/dev_env", "status": "resolved (accepted) — treated as an intentional make new-role scaffold convention; not re-raised."},
{"id": "O2,O3,O4,O5,O6,O7,O8,O9,O11,O12,O13,O14,O15,O16,O18@2026-06-11", "description": "ADR-004 backup scope; ADR-004 ACCESS/BACKUP table; CAPABILITIES nvim/tmux; ADR-002 upgrade caveat; hosts.yml offsite_hosts; new-host Part E; dev_env set_fact tag; ubongo subnet; ADR section order; ADR-007 example; .zshrc rclone/direnv; oh_my_posh config tag; tfvars pve01; See-also vs Related; docs/inventories README narrowness", "status": "still open — carried forward as O2,O5,O7,O6,O9,O15,O8,O10,O4,O18/O19,O16,O17,O11,O13,O14 respectively (renumbered)."}
]
}

View file

@ -0,0 +1,157 @@
# Repo review — 2026-06-14
- **Reviewed commit:** `e346137` (docs(plan): M4b — NetBird coordinator service role)
- **Mode:** on-demand (interactive — auto-fixes applied + committed)
- **Previous run:** 2026-06-11 (`67f2aba`)
- **`make lint`:** green before and after fixes (260 files, profile production; check-tags OK).
## Summary
A lot shipped since the last review (M4a: `docker_host` Docker engine, `reverse_proxy`
Caddy applied to askari; offsite Terraform env live; ADR-024). Most findings this run are
the predictable **docs-lagging-the-build** kind — stale "not built yet" notes, a
reverse-proxy that switched from DNS-01/custom-image to vanilla HTTP-01 leaving stale
descriptions behind, and the **Traefik→Caddy** rename only half-propagated through the
ADR set. The previous run's blocker (O1, `make lint` RED) is **resolved**.
### Counts
| Dimension | High | Medium | Low | Total |
|---|---|---|---|---|
| Cruft / staleness | 0 | 0 | 0 | 0 |
| Design conformance | 1 | 2 | 2 | 5 |
| Consistency & intent | 2 | 2 | 9 | 13 |
| Docs-vs-reality drift | 1 | 4 | 5 | 10 |
| **Open total** | **4** | **8** | **16** | **29** |
Plus **11 auto-fixes applied** (3 high, 5 medium, 3 low).
### Phase-0 scan
`repo-scan.py`: 5 roles, 25 ADRs · broken-adr-ref=4, broken-path-ref=2, marker=14,
open-deferred-item=5, **stale-deferred=0**. Every scan finding is a known false-positive
(test fixtures ADR-099/100; the `roles/netbird/` references in the M4b *plan* for unbuilt
work; superpowers planning artifacts; `019-tagging.md:14` is prose about "over-tagging",
not a TODO). Details in the findings JSON.
### Deferral checklist
All 5 ADR-011 "Open questions" (Proxmox snapshot driver, exact cadences, health-check
harness home, classification home, staging-first) confirmed **genuinely still open**
ADR-011 is still Proposed/unbuilt, the same questions sit open in `docs/TODO.md` item 16,
and no later ADR or STATUS decides any of them. **No stale-deferred** (same as last run).
## Auto-fixes applied
All safe/obvious (stale text contradicting code/reality, partial enumerations, broken
descriptions) — no logic, variable, secret, or task-order changes.
| ID | Sev | File | What |
|---|---|---|---|
| AF1 | high | `roles/reverse_proxy/meta/main.yml` | description still said DNS-01 + custom on-host image → rewrote to vanilla Caddy + HTTP-01 (matches the role since b7e919d) |
| AF2 | med | `roles/README.md` | base hardening + docker_host/reverse_proxy/public_dns build-state was stale → reconciled with STATUS |
| AF3 | med | `playbooks/README.md` | stale "docker_host has no tasks" note; added missing `dns.yml` + `offsite.yml` bullets |
| AF4 | low | `roles/public_dns/README.md` | "askari in M4" → askari + `*.askari` records applied in M4a |
| AF5 | low | `scripts/README.md` | added the missing `check-tags.py` entry (run by `make lint`) |
| AF6 | med | `terraform/README.md` | added `modules/hetzner_vm` + `environments/offsite` (the one applied env) |
| AF7 | low | `terraform/environments/offsite/providers.tf` | verified-stamp `cax11@hel1``cx23@hel1` (actual server) |
| AF8 | low | `terraform/modules/hetzner_vm/variables.tf` | `server_type` example `cax11 (ARM)``cx23 (x86) or cax11 (ARM)` |
| AF9 | med | `inventories/production/group_vars/all/public_dns.yml` | wildcard comment "cert via DNS-01" → ACME HTTP-01 (M4a) |
| AF10 | high | `docs/CAPABILITIES.md` | reverse-proxy candidate `Traefik``Caddy (ADR-024)`; public DNS "apply pending" → "applied (M1)" |
| AF11 | low | `README.md` | Documentation ADR list extended ADR-017 → ADR-024 |
## Open findings (prioritised)
### High
- **O1 — drift — STATUS.md:41 (+45-48) ↔ 33-34** *(new)*: docker_host still appears in
the "Scaffolded but empty — NOT implemented" table as a no-op, contradicting its own
"Built + applied" rows and the real tasks file. Reword the scaffold row + closing
paragraph (left for the operator — STATUS is the ground-truth doc).
- **O2 — consistency — ADR-004:105,131 ↔ ADR-022** *(recurring)*: ADR-004 says backup is
"not in scope of this repo"; ADR-022 defines a full in-repo backup doctrine. Repoint
ADR-004 at ADR-022 (ADR↔ADR design decision — report).
- **O3 — consistency — ADR-024 Consequences ↔ ADR-008:70/017:27,88/019:52** *(new)*:
ADR-024 claims it updated ADR-017's Traefik prose to Caddy; it didn't, and ADR-008/019
still say Traefik too. Either finish the rename or soften ADR-024's claim.
- **O4 — conformance — ADR-023:7-8,77-80 ↔ ADR-016/017/018** *(recurring)*: ADR-023
claims ADRs 001018 were restructured to lead with `## Status`, but 016/017/018 still
open with `## Context` and bury Status. Fix the three ADRs or correct ADR-023 §6.
### Medium
- **O5 — ADR-004:48-50** *(recurring)*: service-role file table omits ACCESS.md +
BACKUP.md rows (now mandated by CLAUDE.md/ADR-021/022).
- **O6 — ADR-002:82** *(recurring)*: `make deploy PLAYBOOK=upgrade` cited as real, but no
`upgrade.yml` exists and ADR-011 is unbuilt — needs a `(planned)` caveat.
- **O7 — CAPABILITIES:150-155 ↔ STATUS:29** *(recurring)*: nvim/tmux listed as a
"confirmed exclusion" while `dev_env` installs them on ubongo; needs a control-host
carve-out (not a token swap, so left from AF10).
- **O8 — dev_env tasks (include_tasks + per_user.yml:4-9)** *(recurring)*: untagged
`set_fact dev_env__home` preflight + include without `apply: tags:`; a partial
`--tags users|config` run breaks (base guards this; dev_env doesn't).
- **O9 — inventories/production/hosts.yml** *(recurring)*: header claims TF-generated but
it's hand-maintained (carries ubongo, omits offsite_hosts); `tf-inventory` would drop
ubongo. Make the header honest.
- **O10 — group_vars/all/vars.yml:42 ↔ ADR-007** *(recurring)*: ubongo `10.20.10.151` is
in no ADR-007 subnet and undocumented; `base__firewall_control_addr` depends on it.
- **O11 — terraform tfvars.example (both envs)** *(recurring)*: `pve01` vs ADR-007's
`pve0`; verify the real node name before changing.
- **O12 — roles/reverse_proxy/** *(new)*: first built+applied service role, but missing
SECURITY/VERIFY/ACCESS/BACKUP.md. (Recorded judgement: public_dns is exempt — control-
node external-API role, not a host service.)
- **O15 — runbooks/new-host.md Part E** *(recurring)*: still describes an `ansible` user
on ubongo; STATUS says ubongo is managed as `sjat` (ansible-user bootstrap pending).
- **O18 — ADR-007/009/016 internal-zone name** *(new)*: `boma.baobab.band` vs target
`boma.wingu.me` used inconsistently across the doc set after M1; state the transition
in one place.
### Low
O13 (See-also vs `## Related` in ADR-012/013/015/016/017/018 — recurring), O14
(docs/README + inventories/README narrow enumerations — recurring), O16 (.zshrc rclone
alias + unguarded direnv hook — recurring), O17 (oh_my_posh zen.toml tasks missing
`config` tag — recurring), O19 (ADR-009:122 `nyumbani` example after retirement —
recurring), O20 (ROADMAP M2 CAX11/ARM vs cx23/x86 — new), O21 (ADR-020 "ports will be
added in M4" stale; already opened in M4a — new), O22 (ADR-024 body still asserts custom-
image obligation contradicting its revised Status — new), O23 (`netbird_coordinator` vs
`netbird` role name across ADRs/ROADMAP/plan — new), O24 (`*.boma.<domain>` vs
`*.<boma-domain>` wildcard scope ADR-024 vs ROADMAP — new), O25 (`tags: [verify]` out of
the ADR-019 vocabulary in molecule verify — new), O26 (reverse_proxy templates lack
`ansible_managed` header — new), O27 (reverse_proxy vars in `group_vars/all/` not
`offsite_hosts/` — new), O28 (capacity-scan.py ignores `offsite.yml` — new), O29
(offsite.yml duplicates empty groups from hosts.yml, undocumented merge — new).
Full detail + suggested fixes in `2026-06-14-findings.json`.
## Themes worth a deliberate pass
1. **Finish the Traefik→Caddy rename** (O3, and ADR-024 over-claimed it was done). One
sweep across ADR-008/017/019 closes it.
2. **STATUS docker_host self-contradiction** (O1) — quick, but it's the ground-truth doc.
3. **ADR-024 internal consistency** (O22) — the role went vanilla/HTTP-01 but the ADR
body still mandates the custom image; reconcile §2/§3/Consequences with its own Status.
4. **dev_env tag-isolation** (O8) — the one real conformance bug with runtime impact;
mirror base's `apply: tags:` guard.
5. **First service-role doc quartet** (O12) — reverse_proxy is the template for every
future service role; getting SECURITY/VERIFY/ACCESS/BACKUP.md right now pays forward.
## Follow-up prompt
> Work the open findings from `docs/reviews/2026-06-14-review.md`. Priority order:
> (1) **O1** — fix the STATUS.md docker_host contradiction (it's built+applied, not a
> no-op; reword the "Scaffolded but empty" row + the 45-48 paragraph).
> (2) **O3 + O22** — finish the Traefik→Caddy rename in ADR-008:70, ADR-017:27,88,
> ADR-019:52, and reconcile ADR-024's body (§2 custom image, §3 NetBird, Consequences)
> with its own revised HTTP-01 Status note.
> (3) **O2 + O5** — repoint ADR-004's "backup not in scope" line at ADR-022 and add
> ACCESS.md + BACKUP.md rows to its service-role file table.
> (4) **O8** — add `apply: tags: [users, config]` to dev_env's per_user.yml include and
> tag the `dev_env__home` set_fact `always`; add a Molecule assertion that a partial
> `--tags config` run still resolves the home dir.
> (5) **O12** — author the four service-role doc files for `roles/reverse_proxy/` from the
> templates (BACKUP.md = `backup__state: false`, re-issuable certs).
> (6) **O4** — restructure ADR-016/017/018 to lead with `## Status`, or correct ADR-023 §6.
> Then the medium drift items (O6 upgrade caveat, O7 nvim/tmux carve-out, O9 hosts.yml
> header, O15 new-host Part E, O18 internal-zone naming). Run `make lint` after each
> batch; commit per CLAUDE.md git conventions.

View file

@ -1,161 +1,157 @@
# Repo review — 2026-06-11
# Repo review — 2026-06-14
- **Reviewed commit:** `67f2aba` (main)
- **Mode:** on-demand (interactive)
- **Previous run:** `2026-06-05` (commit `f566fd1`)
- **Process:** Phase 0 deterministic scan → 5 parallel shard reviewers + 1 cross-cutting
reviewer → synthesis, deferral-checklist resolution, prior-run diff → safe auto-fixes.
- **Reviewed commit:** `e346137` (docs(plan): M4b — NetBird coordinator service role)
- **Mode:** on-demand (interactive — auto-fixes applied + committed)
- **Previous run:** 2026-06-11 (`67f2aba`)
- **`make lint`:** green before and after fixes (260 files, profile production; check-tags OK).
## Summary
| | High | Medium | Low | Total |
A lot shipped since the last review (M4a: `docker_host` Docker engine, `reverse_proxy`
Caddy applied to askari; offsite Terraform env live; ADR-024). Most findings this run are
the predictable **docs-lagging-the-build** kind — stale "not built yet" notes, a
reverse-proxy that switched from DNS-01/custom-image to vanilla HTTP-01 leaving stale
descriptions behind, and the **Traefik→Caddy** rename only half-propagated through the
ADR set. The previous run's blocker (O1, `make lint` RED) is **resolved**.
### Counts
| Dimension | High | Medium | Low | Total |
|---|---|---|---|---|
| **Auto-fixed** | 1 | 2 | 2 | 5 |
| **Open (report-only)** | 2 | 7 | 9 | 18 |
| Cruft / staleness | 0 | 0 | 0 | 0 |
| Design conformance | 1 | 2 | 2 | 5 |
| Consistency & intent | 2 | 2 | 9 | 13 |
| Docs-vs-reality drift | 1 | 4 | 5 | 10 |
| **Open total** | **4** | **8** | **16** | **29** |
By dimension (open): conformance 3 · consistency 8 · drift 6 · cruft 1.
Plus **11 auto-fixes applied** (3 high, 5 medium, 3 low).
**Headline:** `make lint` is currently **red on `main`**`playbooks/site.yml` imports the
not-yet-existent `docker_host` role (confirmed at clean HEAD, unrelated to this run's
edits). That breaks CLAUDE.md's "main must always work" / "Never skip lint" contract and
is the top open finding (O1). The bulk of the rest is documentation drift created by the
recent `base` (firewall) + `dev_env` build wave: several READMEs/playbook notes still
described the roles as "empty / not built." Those were the safe auto-fixes.
### Phase-0 scan
**Good news:** 7 of the 12 open findings from the 2026-06-05 run are confirmed resolved
(VERIFY.md row + runbook step, backend.tf relabel, askari group naming, ADR-014
reproducibility, CAPABILITIES Level-4 row, TODO 3.10). The deferral checklist is clean —
**0 stale-deferred** this run (the recurring miss logged in FRICTION.md did not recur).
`repo-scan.py`: 5 roles, 25 ADRs · broken-adr-ref=4, broken-path-ref=2, marker=14,
open-deferred-item=5, **stale-deferred=0**. Every scan finding is a known false-positive
(test fixtures ADR-099/100; the `roles/netbird/` references in the M4b *plan* for unbuilt
work; superpowers planning artifacts; `019-tagging.md:14` is prose about "over-tagging",
not a TODO). Details in the findings JSON.
### Deferral checklist
All 5 ADR-011 "Open questions" (Proxmox snapshot driver, exact cadences, health-check
harness home, classification home, staging-first) confirmed **genuinely still open**
ADR-011 is still Proposed/unbuilt, the same questions sit open in `docs/TODO.md` item 16,
and no later ADR or STATUS decides any of them. **No stale-deferred** (same as last run).
## Auto-fixes applied
Markdown / YAML-comment only; no runtime behaviour, logic, vars, or task order touched.
All safe/obvious (stale text contradicting code/reality, partial enumerations, broken
descriptions) — no logic, variable, secret, or task-order changes.
| ID | Sev | File(s) | What |
| ID | Sev | File | What |
|---|---|---|---|
| AF1 | high | `roles/README.md` | Rewrote stale "base & docker_host are empty untracked dirs, site.yml would fail on a clean clone" → base partially built (firewall), docker_host not yet created, dev_env built+applied. |
| AF2 | med | `playbooks/site.yml` | NOTE no longer claims base is unbuilt / "fails on a clean clone"; now reflects firewall-only base + missing docker_host. |
| AF3 | med | `playbooks/README.md` | Dropped the "currently a no-op" claim; added a `workstation.yml` bullet. |
| AF4 | low | `README.md` | Added `docs/access/`, `docs/backup/`, `roles/dev_env/`, `playbooks/workstation.yml` to the project-structure tree. |
| AF5 | low | `docs/decisions/016-mesh-vpn.md`, `docs/decisions/020-firewall.md` | Added the reciprocal `ADR-021` cross-reference that ADR-021 says it amended in. |
> `make lint` was re-run after the fixes: it fails **only** on the pre-existing
> `docker_host` syntax-check (O1), identical to clean HEAD. No auto-fix introduced or
> changed any lint result, so none were reverted.
| AF1 | high | `roles/reverse_proxy/meta/main.yml` | description still said DNS-01 + custom on-host image → rewrote to vanilla Caddy + HTTP-01 (matches the role since b7e919d) |
| AF2 | med | `roles/README.md` | base hardening + docker_host/reverse_proxy/public_dns build-state was stale → reconciled with STATUS |
| AF3 | med | `playbooks/README.md` | stale "docker_host has no tasks" note; added missing `dns.yml` + `offsite.yml` bullets |
| AF4 | low | `roles/public_dns/README.md` | "askari in M4" → askari + `*.askari` records applied in M4a |
| AF5 | low | `scripts/README.md` | added the missing `check-tags.py` entry (run by `make lint`) |
| AF6 | med | `terraform/README.md` | added `modules/hetzner_vm` + `environments/offsite` (the one applied env) |
| AF7 | low | `terraform/environments/offsite/providers.tf` | verified-stamp `cax11@hel1``cx23@hel1` (actual server) |
| AF8 | low | `terraform/modules/hetzner_vm/variables.tf` | `server_type` example `cax11 (ARM)``cx23 (x86) or cax11 (ARM)` |
| AF9 | med | `inventories/production/group_vars/all/public_dns.yml` | wildcard comment "cert via DNS-01" → ACME HTTP-01 (M4a) |
| AF10 | high | `docs/CAPABILITIES.md` | reverse-proxy candidate `Traefik``Caddy (ADR-024)`; public DNS "apply pending" → "applied (M1)" |
| AF11 | low | `README.md` | Documentation ADR list extended ADR-017 → ADR-024 |
## Open findings (prioritised)
### High
- **O1 — `make lint` is red on `main`** · `playbooks/site.yml:18` · *conformance*
site.yml imports the `docker_host` role, which does not exist, so ansible-lint's
syntax-check fails on a clean checkout. Violates "main must always work" + "Never skip
lint" (pre-commit would block every commit unless bypassed).
*Fix (judgement):* guard/skip the docker_host play until the role exists, scaffold a
stub via `make new-role NAME=docker_host`, or exclude site.yml from syntax-check until
built — and record the choice. **new**
- **O2 — ADR-004 ↔ ADR-022 backup-scope contradiction** ·
`docs/decisions/004-docker-model.md:105` · *consistency*
ADR-004 says "Backup strategy is defined separately (not in scope of this repo)";
ADR-022 defines a full in-repo backup strategy. Per ADR-023 (no silent reversals),
update ADR-004's line to defer to ADR-022 and cross-link. Design decision — report. **new**
- **O1 — drift — STATUS.md:41 (+45-48) ↔ 33-34** *(new)*: docker_host still appears in
the "Scaffolded but empty — NOT implemented" table as a no-op, contradicting its own
"Built + applied" rows and the real tasks file. Reword the scaffold row + closing
paragraph (left for the operator — STATUS is the ground-truth doc).
- **O2 — consistency — ADR-004:105,131 ↔ ADR-022** *(recurring)*: ADR-004 says backup is
"not in scope of this repo"; ADR-022 defines a full in-repo backup doctrine. Repoint
ADR-004 at ADR-022 (ADR↔ADR design decision — report).
- **O3 — consistency — ADR-024 Consequences ↔ ADR-008:70/017:27,88/019:52** *(new)*:
ADR-024 claims it updated ADR-017's Traefik prose to Caddy; it didn't, and ADR-008/019
still say Traefik too. Either finish the rename or soften ADR-024's claim.
- **O4 — conformance — ADR-023:7-8,77-80 ↔ ADR-016/017/018** *(recurring)*: ADR-023
claims ADRs 001018 were restructured to lead with `## Status`, but 016/017/018 still
open with `## Context` and bury Status. Fix the three ADRs or correct ADR-023 §6.
### Medium
- **O3 — ADR-004 service-role file table missing ACCESS.md + BACKUP.md** ·
`docs/decisions/004-docker-model.md:48` · *consistency* — CLAUDE.md + ADR-021/022 now
mandate both for service roles; the canonical table lists only SECURITY.md + VERIFY.md.
(Prior "missing VERIFY.md" is resolved; this is the next evolution.) **new**
- **O4 — CAPABILITIES nvim/tmux exclusion ↔ dev_env built** ·
`docs/CAPABILITIES.md:149` · *consistency* — listed as a confirmed exclusion
("server-only"), but `dev_env` (built+applied to ubongo) installs exactly that. Carve
out the control-node/AI-worker exception (ADR-015). **new**
- **O5 — phantom `make deploy PLAYBOOK=upgrade`** · `docs/decisions/002-security.md:82` ·
*drift* — no `upgrade.yml` exists; ADR-011 is unbuilt. Add a "(planned)" caveat. **new**
- **O6 — hosts.yml stubs missing `offsite_hosts` group** ·
`inventories/{production,staging}/hosts.yml` · *drift* — the generator emits it (one of
four VALID_GROUPS); the hand-stubs predate the standard. Regenerate via
`make tf-inventory` (don't hand-edit). (Prior "askari group unnamed" is resolved.) **new**
- **O7 — new-host runbook Part E vs ubongo reality** · `docs/runbooks/new-host.md:81-130`
· *drift* — instructs creating an `ansible` user / `ssh ansible@`; STATUS records ubongo
is managed as `sjat`, ansible-user bootstrap pending. **new**
- **O8 — dev_env untagged `set_fact` under tagged consumers** ·
`roles/dev_env/tasks/per_user.yml:2-9` · *conformance* — partial `--tags users|config`
runs skip the `dev_env__home` set_fact and fail. Tag the preflight `[users, config]` or
`always`. **new**
- **O9 — ubongo address outside ADR-007 subnets** · `STATUS.md:31 ↔ 007-network.md` ·
*drift* — 10.20.10.151 is in neither srv (10.20.0.0/24) nor mgmt (10.10.0.0/24);
`base__firewall_control_addr` depends on it. Already a tracked follow-up in the
ubongo-build plan. Reconcile address or ADR-007. **new**
- **O5 — ADR-004:48-50** *(recurring)*: service-role file table omits ACCESS.md +
BACKUP.md rows (now mandated by CLAUDE.md/ADR-021/022).
- **O6 — ADR-002:82** *(recurring)*: `make deploy PLAYBOOK=upgrade` cited as real, but no
`upgrade.yml` exists and ADR-011 is unbuilt — needs a `(planned)` caveat.
- **O7 — CAPABILITIES:150-155 ↔ STATUS:29** *(recurring)*: nvim/tmux listed as a
"confirmed exclusion" while `dev_env` installs them on ubongo; needs a control-host
carve-out (not a token swap, so left from AF10).
- **O8 — dev_env tasks (include_tasks + per_user.yml:4-9)** *(recurring)*: untagged
`set_fact dev_env__home` preflight + include without `apply: tags:`; a partial
`--tags users|config` run breaks (base guards this; dev_env doesn't).
- **O9 — inventories/production/hosts.yml** *(recurring)*: header claims TF-generated but
it's hand-maintained (carries ubongo, omits offsite_hosts); `tf-inventory` would drop
ubongo. Make the header honest.
- **O10 — group_vars/all/vars.yml:42 ↔ ADR-007** *(recurring)*: ubongo `10.20.10.151` is
in no ADR-007 subnet and undocumented; `base__firewall_control_addr` depends on it.
- **O11 — terraform tfvars.example (both envs)** *(recurring)*: `pve01` vs ADR-007's
`pve0`; verify the real node name before changing.
- **O12 — roles/reverse_proxy/** *(new)*: first built+applied service role, but missing
SECURITY/VERIFY/ACCESS/BACKUP.md. (Recorded judgement: public_dns is exempt — control-
node external-API role, not a host service.)
- **O15 — runbooks/new-host.md Part E** *(recurring)*: still describes an `ansible` user
on ubongo; STATUS says ubongo is managed as `sjat` (ansible-user bootstrap pending).
- **O18 — ADR-007/009/016 internal-zone name** *(new)*: `boma.baobab.band` vs target
`boma.wingu.me` used inconsistently across the doc set after M1; state the transition
in one place.
### Low
- **O10 — README ADR list stops at 017** · `README.md:104` · *drift* — 018023 exist;
extend or trim to a pointer. **recurring** (evolved from prior O3)
- **O11 — ADR section-order vs ADR-023 §2** · `008:3, 014:98, 016:91, 017:66, 018:73` ·
*conformance* — Status-not-first / Decision-late; passes lint (order not gated) but not
the standard. Presentational restructure. **new**
- **O12 — ADR-007 FQDN convention vs its own example** · `007-network.md:160` ·
*consistency*`<service>.baobab.band` vs `forgejo.nyumbani.baobab.band`; ties to open
TODO 4 (split-horizon). **new**
- **O13 — dev_env `.zshrc` heritage carryovers** ·
`roles/dev_env/files/dotfiles/zsh/.zshrc:28,55` · *consistency* — hard-coded
`/usr/bin/rclone` alias (not installed by the role) + unguarded `direnv` hook. **new**
- **O14 — oh_my_posh config tasks untagged** · `roles/dev_env/tasks/oh_my_posh.yml:15-26`
· *consistency* — inconsistent `config` tagging vs per_user.yml. **new**
- **O15 — tfvars.example `pve01` vs ADR-007 `pve0`** ·
`terraform/environments/*/terraform.tfvars.example:9` · *consistency* — verify the real
node name, then align. **new**
- **O16 — ADR-013/015 "See also:" vs `## Related`** · *consistency* — stylistic; convert
for uniformity. **new**
- **O17 — empty scaffold `handlers/main.yml`** · `roles/{dev_env,base}/handlers/main.yml`
· *cruft* — confirm convention or delete. **new**
- **O18 — docs/README.md + inventories/README.md narrower than reality** · *consistency*
— omit several real subdirs / the offsite_hosts group. **new**
O13 (See-also vs `## Related` in ADR-012/013/015/016/017/018 — recurring), O14
(docs/README + inventories/README narrow enumerations — recurring), O16 (.zshrc rclone
alias + unguarded direnv hook — recurring), O17 (oh_my_posh zen.toml tasks missing
`config` tag — recurring), O19 (ADR-009:122 `nyumbani` example after retirement —
recurring), O20 (ROADMAP M2 CAX11/ARM vs cx23/x86 — new), O21 (ADR-020 "ports will be
added in M4" stale; already opened in M4a — new), O22 (ADR-024 body still asserts custom-
image obligation contradicting its revised Status — new), O23 (`netbird_coordinator` vs
`netbird` role name across ADRs/ROADMAP/plan — new), O24 (`*.boma.<domain>` vs
`*.<boma-domain>` wildcard scope ADR-024 vs ROADMAP — new), O25 (`tags: [verify]` out of
the ADR-019 vocabulary in molecule verify — new), O26 (reverse_proxy templates lack
`ansible_managed` header — new), O27 (reverse_proxy vars in `group_vars/all/` not
`offsite_hosts/` — new), O28 (capacity-scan.py ignores `offsite.yml` — new), O29
(offsite.yml duplicates empty groups from hosts.yml, undocumented merge — new).
## Deferral checklist (Phase 2)
Full detail + suggested fixes in `2026-06-14-findings.json`.
| Source | Items | Verdict |
|---|---|---|
| ADR-011 Deferred/Open | 5 (snapshot driver, cadences, health-check harness home, classification home, staging-first) | **All genuinely still open** — cross-checked against later ADRs + TODO 16. None silently resolved. |
| ADR-015 Deferred | #1 mesh VPN, #2 service-UI, #3 build | **All marked RESOLVED in place** (ADR-016 / ADR-017 / 2026-06-11 build). |
## Themes worth a deliberate pass
**Stale-deferred found: 0.** The recurring FRICTION.md miss did not recur this run.
1. **Finish the Traefik→Caddy rename** (O3, and ADR-024 over-claimed it was done). One
sweep across ADR-008/017/019 closes it.
2. **STATUS docker_host self-contradiction** (O1) — quick, but it's the ground-truth doc.
3. **ADR-024 internal consistency** (O22) — the role went vanilla/HTTP-01 but the ADR
body still mandates the custom image; reconcile §2/§3/Consequences with its own Status.
4. **dev_env tag-isolation** (O8) — the one real conformance bug with runtime impact;
mirror base's `apply: tags:` guard.
5. **First service-role doc quartet** (O12) — reverse_proxy is the template for every
future service role; getting SECURITY/VERIFY/ACCESS/BACKUP.md right now pays forward.
## Scan false positives (folded in, not actionable)
## Follow-up prompt
- `broken-path-ref STATUS.md:38` — STATUS legitimately documents `roles/docker_host/` as
"Not in git." (intentional reference to an unbuilt role).
- `broken-adr-ref` ×4 — `ADR-099`/`ADR-100` in `tests/test_repo_scan.py` and the
adr-structure plan are intentional **test fixtures** for the scanner's bad-ref check.
- `marker` ×14 — all in `docs/superpowers/{plans,specs}/*` (historical commit-message
TODOs / plan steps) or prose discussing "over-tagging" as a concept. Not cruft.
## Prior-run diff (vs 2026-06-05)
**Resolved (7):** O1 VERIFY.md row · O2 new-role VERIFY step · O4 askari group naming ·
O5 backend.tf relabel · O6 ADR-014 reproducibility · O11 CAPABILITIES Level-4 row ·
O12 TODO 3.10. **Partial:** O3 (docs tree fixed in AF4; ADR-list carried as O10).
**Not re-detected (verify next run):** O7O10 (ADR-011 still Proposed).
## Follow-up prompt (copy-paste)
> Act on the open findings from `docs/reviews/2026-06-11-review.md`. Priority order:
> 1. **O1 (high):** `make lint` is red on `main``playbooks/site.yml` imports the
> non-existent `docker_host` role. Pick an interim posture (guard/skip the play, or
> `make new-role NAME=docker_host` to scaffold a stub, or exclude from syntax-check
> until built) so the trunk lints clean again, and record the choice in STATUS.md.
> 2. **O2 (high):** Resolve the ADR-004 ↔ ADR-022 backup-scope contradiction —
> update ADR-004's "not in scope of this repo" line to defer to ADR-022 (per ADR-023's
> no-silent-reversal rule) and cross-link.
> 3. **O3:** Add ACCESS.md + BACKUP.md rows to ADR-004's service-role file table.
> 4. **O4:** Reconcile CAPABILITIES' nvim/tmux exclusion with the built `dev_env` role
> (carve out the ubongo control-node exception).
> 5. **O8 (conformance):** Tag the `dev_env__home` preflight `set_fact` so partial
> `--tags users|config` runs don't fail.
> 6. **O6 / O9:** Regenerate the inventory stubs to include `offsite_hosts`; reconcile
> ubongo's 10.20.10.151 against ADR-007's subnets (or amend ADR-007).
> 7. Sweep the low-severity doc items (O5 caveat, O7 runbook, O10 ADR list, O11 ADR
> section order, O12O18) as a single docs-hygiene batch.
> Run `make lint` before committing; commit per CLAUDE.md git conventions.
> Work the open findings from `docs/reviews/2026-06-14-review.md`. Priority order:
> (1) **O1** — fix the STATUS.md docker_host contradiction (it's built+applied, not a
> no-op; reword the "Scaffolded but empty" row + the 45-48 paragraph).
> (2) **O3 + O22** — finish the Traefik→Caddy rename in ADR-008:70, ADR-017:27,88,
> ADR-019:52, and reconcile ADR-024's body (§2 custom image, §3 NetBird, Consequences)
> with its own revised HTTP-01 Status note.
> (3) **O2 + O5** — repoint ADR-004's "backup not in scope" line at ADR-022 and add
> ACCESS.md + BACKUP.md rows to its service-role file table.
> (4) **O8** — add `apply: tags: [users, config]` to dev_env's per_user.yml include and
> tag the `dev_env__home` set_fact `always`; add a Molecule assertion that a partial
> `--tags config` run still resolves the home dir.
> (5) **O12** — author the four service-role doc files for `roles/reverse_proxy/` from the
> templates (BACKUP.md = `backup__state: false`, re-issuable certs).
> (6) **O4** — restructure ADR-016/017/018 to lead with `## Status`, or correct ADR-023 §6.
> Then the medium drift items (O6 upgrade caveat, O7 nvim/tmux carve-out, O9 hosts.yml
> header, O15 new-host Part E, O18 internal-zone naming). Run `make lint` after each
> batch; commit per CLAUDE.md git conventions.

View file

@ -118,8 +118,14 @@ Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated **physical**
machine outside the cluster — not a Proxmox guest. It is the **one** host
provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
> **Current state (STATUS.md):** `ubongo` is today managed as the operator account
> `sjat` (`group_vars/control` sets `ansible_user: sjat`); it has **no** dedicated
> `ansible` service user yet. The dedicated-`ansible`-user bootstrap (step 2) is a
> **pending** item. Steps below describe the intended end state.
1. Install Debian 13 on the physical box by hand (no template to clone).
2. Create the `ansible` user and install its SSH public key.
2. Create the `ansible` user and install its SSH public key. *(Pending for `ubongo`
currently managed as `sjat`; see the note above.)*
3. Set up the Ansible environment on it:
```bash
git clone <repo> ~/ansible

View file

@ -2,7 +2,7 @@
> **For agentic workers:** REQUIRED SUB-SKILL: superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use `- [ ]` checkboxes.
**Goal:** Deploy the self-hosted NetBird control plane on askari as boma's first real service role (`netbird`), fronted by the M4a Caddy, reachable at `https://netbird.askari.wingu.me` with the embedded Dex login.
**Goal:** Deploy the self-hosted NetBird control plane on askari as boma's first real service role (`netbird_coordinator`), fronted by the M4a Caddy, reachable at `https://netbird.askari.wingu.me` with the embedded Dex login.
**Architecture:** NetBird's own `configure.sh` generates the canonical compose + config for a pinned version; boma **captures that reference once and translates it into role templates** (ADR-004/013 — don't run their imperative script in production, render from templates). Runs in **external-reverse-proxy mode** (no bundled Traefik); Caddy adds a `netbird.askari.wingu.me` route. Secrets (datastore encryption key, TURN password, Dex secrets) are generated into vault; the setup key is stubbed `CHANGEME` for M5.
@ -23,9 +23,9 @@
---
### Task 2: `netbird` service role — templates
### Task 2: `netbird_coordinator` service role — templates
**Files:** `roles/netbird/` (scaffold via `make new-role NAME=netbird`): `defaults/main.yml`, `tasks/main.yml`, `templates/{docker-compose.yml,management.json,turnserver.conf,openid-configuration.json,dashboard.env}.j2`, `handlers/main.yml`, `README.md`.
**Files:** `roles/netbird_coordinator/` (scaffold via `make new-role NAME=netbird_coordinator`): `defaults/main.yml`, `tasks/main.yml`, `templates/{docker-compose.yml,management.json,turnserver.conf,openid-configuration.json,dashboard.env}.j2`, `handlers/main.yml`, `README.md`.
- [ ] **Step 1:** Translate the captured compose into `templates/docker-compose.yml.j2` — containers, the shared `boma` Docker network (so Caddy reaches them by name), **no host port mappings except what Caddy/Coturn need** (Coturn 3478/udp; everything else internal, Caddy fronts it). Pin image tags (ADR-011).
- [ ] **Step 2:** Translate `management.json`/`config.yaml` into a template — fill `Datadir`, `DataStoreEncryptionKey` (`{{ vault.netbird.datastore_key }}`), `HttpConfig` (public URL `https://netbird.askari.wingu.me`), `TURNConfig` (coturn host + `{{ vault.netbird.turn_password }}`), `Signal`, `Relay`, `Store` (sqlite), and the embedded-Dex IdP block (DeviceAuthorizationFlow/PKCE, `openid-configuration.json` URL).
@ -53,7 +53,7 @@
### Task 5: Service-role standard files (ADR-004, authored)
- [ ] **Step 1:** Author `roles/netbird/SECURITY.md` (copy `docs/security/service-security-template.md`; record the public surface = Caddy 443 + Coturn 3478, embedded-Dex auth, accepted-risk R3).
- [ ] **Step 1:** Author `roles/netbird_coordinator/SECURITY.md` (copy `docs/security/service-security-template.md`; record the public surface = Caddy 443 + Coturn 3478, embedded-Dex auth, accepted-risk R3).
- [ ] **Step 2:** `VERIFY.md` (copy the template; the `/verify-service` UI spec — run later when the playwright harness exists).
- [ ] **Step 3:** `ACCESS.md` (ADR-021; the dashboard/admin access + `access__*` intent).
- [ ] **Step 4:** `BACKUP.md` (ADR-022; the **datastore is stateful**`backup__*` data; record that off-site backup is **pending `fisi`** — an accepted risk for now).
@ -63,7 +63,7 @@
### Task 6: Add netbird to the offsite playbook
- [ ] **Step 1:** In `playbooks/offsite.yml`, add `netbird` after `reverse_proxy` (role-name tag). `make lint`. Commit.
- [ ] **Step 1:** In `playbooks/offsite.yml`, add `netbird_coordinator` after `reverse_proxy` (role-name tag). `make lint`. Commit.
---
@ -80,7 +80,7 @@
### Task 8: Docs
- [ ] **Step 1:** STATUS — `netbird` coordinator built + applied (dashboard live); the first service role. ROADMAP M4b done; **M5 (enrol) next**. `make lint`; commit.
- [ ] **Step 1:** STATUS — `netbird_coordinator` built + applied (dashboard live); the first service role. ROADMAP M4b done; **M5 (enrol) next**. `make lint`; commit.
---

View file

@ -6,6 +6,11 @@ hold per-group and per-host configuration.
- `hosts.yml` is **generated** from Terraform outputs by `make tf-inventory` — do not
hand-edit. The control node is the one manual exception.
- `offsite.yml` (in `production/`) is a **second** generated inventory file, written by
`make tf-inventory-offsite` from the offsite Terraform env; it holds the `offsite_hosts`
group (`askari`). Ansible merges it with `hosts.yml`, so both can declare the same group
names harmlessly (the offsite generator emits all four groups, most empty).
- Host groups: `all`, `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`.
- Terraform→inventory data flow and the data contract: **ADR-009**.
- Addressing conventions (subnets, ranges): **ADR-007**.
- Layout and host groups: see CLAUDE.md ("Inventory structure").

View file

@ -13,8 +13,8 @@ public_dns__records:
# askari (off-site host, TF-provisioned M2) — public A so it's reachable by name +
# for future ACME on *.askari.wingu.me. Mesh/LAN-only home services never appear here.
- {record: askari, type: A, values: ["77.42.120.136"], ttl: 1800}
# Wildcard for askari's services (test/netbird/...) → same host; Caddy gets a
# *.askari.wingu.me cert via DNS-01 (M4a).
# Wildcard for askari's services (test/netbird/...) → same host; Caddy gets
# per-host certs via ACME HTTP-01 (M4a).
- {record: "*.askari", type: A, values: ["77.42.120.136"], ttl: 1800}
# Absent — Gandi's auto-seeded defaults we don't want (purged once, idempotent thereafter).

View file

@ -39,4 +39,4 @@ services__base_dir: /opt/services
base__unattended_upgrades_enabled: true
# Management plane — activates the dormant ssh-from-control firewall rule
base__firewall_control_addr: "10.20.10.151" # ubongo (control node) LAN address — ADR-021 ssh-from-control source
base__firewall_control_addr: "10.20.10.151" # ubongo — legacy V4 addr (ADR-007); ADR-021 ssh-from-control

View file

@ -4,10 +4,15 @@ Top-level orchestration playbooks. No inline vars — configuration comes from
`group_vars/` / `host_vars/` (see CLAUDE.md).
- `site.yml` — full standard state: applies `base` to all hosts and `docker_host`
to docker hosts. **Note:** `base` is only partially built (its `firewall` concern)
and `docker_host` is scaffolded with no tasks yet, so this is incomplete — see `STATUS.md`.
to docker hosts. **Note:** `base` is only partially built (its `firewall` +
`hardening` concerns) and the cluster has no docker hosts yet, so this is
incomplete — see `STATUS.md`.
- `workstation.yml` — applies the `dev_env` role (interactive developer environment)
to the `control` group; built and applied to `ubongo` (see `STATUS.md`).
- `dns.yml` — manages the public DNS zone (wingu.me) at Gandi LiveDNS via the
`public_dns` role; runs from the control node against an external API.
- `offsite.yml` — off-site hosts (`askari`): `docker_host` (Docker engine) +
`reverse_proxy` (Caddy). NetBird coordinator appended in M4b.
- `bootstrap.yml` — first-run setup for a host that may not have Python yet;
self-contained (does not depend on the roles).

View file

@ -8,8 +8,9 @@ Each role must have: a `molecule/default/` scenario (Debian 13), a populated
`README.md`, and a filled-in `meta/main.yml`. Conventions: CLAUDE.md and
`docs/runbooks/new-role.md`.
Current state: `base` is **partially built** — its `firewall` concern (nftables) is
implemented and tested; the other concerns (SSH hardening, fail2ban, auditd, packages,
users) are not yet built. `docker_host` is **scaffolded but has no tasks yet**. `dev_env` (interactive
developer environment) is built and applied. See `STATUS.md` for the authoritative
breakdown.
Current state: `base` is **partially built** — its `firewall` (nftables) and
`hardening` (SSH key-only + fail2ban) concerns are implemented, tested, and the
hardening concern is applied to `askari`; the remaining concerns (auditd, packages,
users) are not yet built. `docker_host` (Docker engine + Compose), `reverse_proxy`
(Caddy), `public_dns` (Gandi), and `dev_env` are built. See `STATUS.md` for the
authoritative breakdown.

View file

@ -51,14 +51,9 @@
- name: Sshd drop-in present and config valid
ansible.builtin.command: sshd -t
changed_when: false
tags: [verify]
- name: PasswordAuthentication is disabled
ansible.builtin.command: grep -q '^PasswordAuthentication no' /etc/ssh/sshd_config.d/10-boma.conf
changed_when: false
tags: [verify]
- name: Fail2ban sshd jail configured
ansible.builtin.command: grep -q '^\[sshd\]' /etc/fail2ban/jail.d/sshd.local
changed_when: false
tags: [verify]

View file

@ -25,7 +25,6 @@ alias ll="ls -lh"
alias la="ls -lha"
alias ..="cd .."
alias update="sudo apt update && sudo apt upgrade -y"
alias rclone="/usr/bin/rclone"
# Use neovim for vim/vi commands
alias vim='nvim'
@ -50,6 +49,5 @@ export PATH="$HOME/.local/bin:$HOME/bin:$PATH"
# Ensure USER is set (edge cases)
export USER=$(whoami)
# Extras from inventory
# Enable direnv for automatic virtualenv activation
eval "$(direnv hook zsh)"
# Enable direnv for automatic virtualenv activation (guarded — direnv may not be installed)
command -v direnv >/dev/null 2>&1 && eval "$(direnv hook zsh)"

View file

@ -7,9 +7,38 @@
dev_env__users:
- tester
pre_tasks:
# `always` so the test user exists even under a partial `--tags` converge.
- name: Create a test user to receive the environment
ansible.builtin.user:
name: tester
create_home: true
tags: [always]
roles:
- role: dev_env
# Partial-tags regression guard (O8): apply only the `config` concern to a fresh user.
# The dev_env__home preflight is tagged `always`, so a config-only run must still resolve
# the home dir and stow the dotfiles. Run the true partial path with:
# molecule converge -- --tags config
# (a full `molecule test` runs every tag, which still exercises this play idempotently).
- name: Converge — config concern only, fresh user
hosts: all
become: true
gather_facts: true
vars:
dev_env__users:
- tagtester
pre_tasks:
# `always` so the test user exists even under a partial `--tags config` converge.
- name: Create a second test user for the config-only path
ansible.builtin.user:
name: tagtester
create_home: true
tags: [always]
tasks:
- name: Apply dev_env restricted to the config concern
ansible.builtin.include_role:
name: dev_env
apply:
tags: [config]
tags: [config]

View file

@ -71,3 +71,18 @@
- dev_env__dots.results[3].stat.exists
- dev_env__dots.results[4].stat.exists
fail_msg: dotfiles not stowed or omz/tpm not cloned
# Partial-tags regression guard (O8): the config-only converge play provisioned
# `tagtester`. Its stowed .zshrc proves dev_env__home resolved (the `always` preflight)
# and stow (a `config` task) ran without the `users`/`packages` concerns.
- name: Stat the config-only user's stowed .zshrc
ansible.builtin.stat:
path: /home/tagtester/.zshrc
register: dev_env__tagtester_zshrc
- name: Assert the config concern alone resolved home and stowed dotfiles
ansible.builtin.assert:
that:
- dev_env__tagtester_zshrc.stat.exists
- dev_env__tagtester_zshrc.stat.islnk
fail_msg: config-only run did not resolve dev_env__home / stow dotfiles for tagtester

View file

@ -7,21 +7,44 @@
cache_valid_time: 3600
tags: [packages]
# `apply: tags:` propagates the concern tag onto the INCLUDED tasks — without it a tag on
# a dynamic include_tasks only selects the include itself, not its (untagged) contents, so
# `--tags <concern>` would run nothing (Ansible gotcha; mirrors roles/base/tasks/main.yml).
- name: Install Neovim (pinned release)
ansible.builtin.include_tasks: neovim.yml
ansible.builtin.include_tasks:
file: neovim.yml
apply:
tags: [packages]
tags: [packages]
# Also reachable under `config`: oh_my_posh.yml renders /etc/oh-my-posh/zen.toml (a config
# task, tagged `config` within the file) alongside the binary install (`packages`). apply
# keeps `packages` on the untagged binary tasks; the include carries both so `--tags config`
# enters it and re-renders just the theme.
- name: Install oh-my-posh prompt (pinned release)
ansible.builtin.include_tasks: oh_my_posh.yml
tags: [packages]
ansible.builtin.include_tasks:
file: oh_my_posh.yml
apply:
tags: [packages]
tags: [packages, config]
- name: Install Node.js (pinned release)
ansible.builtin.include_tasks: nodejs.yml
ansible.builtin.include_tasks:
file: nodejs.yml
apply:
tags: [packages]
tags: [packages]
# per_user.yml resolves dev_env__home (tagged `always`, below) then runs both the `users`
# (login shell) and `config` (dotfiles/stow) concerns; tag + apply both so either
# `--tags users` or `--tags config` reaches in and the home-dir preflight always runs.
- name: Configure each developer user
ansible.builtin.include_tasks: per_user.yml
ansible.builtin.include_tasks:
file: per_user.yml
apply:
tags: [users, config]
loop: "{{ dev_env__users }}"
loop_control:
loop_var: dev_env__user
label: "{{ dev_env__user }}"
tags: [users, config]

View file

@ -17,9 +17,11 @@
path: /etc/oh-my-posh
state: directory
mode: "0755"
tags: [config]
- name: Oh-my-posh | Deploy zen.toml theme (system-wide)
ansible.builtin.copy:
src: oh-my-posh/zen.toml
dest: /etc/oh-my-posh/zen.toml
mode: "0644"
tags: [config]

View file

@ -1,12 +1,17 @@
---
# `always`: dev_env__home must resolve on every entry into per_user.yml, including a
# partial `--tags users` or `--tags config` run — the dotfile/stow (config) and login-shell
# (users) tasks below all depend on it, so it must never be filtered out (ADR-019).
- name: Look up account for {{ dev_env__user }}
ansible.builtin.getent:
database: passwd
key: "{{ dev_env__user }}"
tags: [always]
- name: Resolve home directory for {{ dev_env__user }}
ansible.builtin.set_fact:
dev_env__home: "{{ getent_passwd[dev_env__user][4] }}"
tags: [always]
- name: Set login shell to zsh for {{ dev_env__user }}
ansible.builtin.user:

View file

@ -8,10 +8,7 @@
ansible.builtin.command: docker --version
register: docker_version_output
changed_when: false
tags: [verify]
- name: Assert docker --version succeeded
ansible.builtin.assert:
that: docker_version_output.rc == 0
msg: "docker --version failed — Docker was not installed correctly"
tags: [verify]

View file

@ -5,8 +5,8 @@ Manages boma's public DNS zone (**wingu.me**) at **Gandi LiveDNS** as code, via
name on purpose. Run from the control node: `make check/deploy PLAYBOOK=dns`.
Mesh/LAN-only by default — only deliberate public records live in the zone (the
anti-spoof baseline now; `askari` in M4). Everything else is reached over LAN/mesh and
never appears here.
anti-spoof baseline plus `askari.wingu.me` + the `*.askari` wildcard, applied in M4a).
Everything else is reached over LAN/mesh and never appears here.
## Data (in `group_vars/all/public_dns.yml`)

View file

@ -9,4 +9,3 @@
- public_dns__domain == "example.test"
- public_dns__apply | bool == false
msg: "public_dns defaults/vars did not resolve as expected"
tags: [verify]

View file

@ -0,0 +1,37 @@
# Access — reverse_proxy (Caddy)
Rendered from the role's `access__*` data (`roles/reverse_proxy/defaults/main.yml`) —
the source of truth that also drives `/check-access`. Regenerate from the data; edit the
data, not the tables. Host: `askari` (off-site Hetzner; ADR-007/016).
## Access paths
The documented ways in, by tier (rendered from `access__*`):
| Tier | Path | Invocation |
|---|---|---|
| primary | `wt0` mesh SSH | `ssh askari` (over the NetBird mesh — pending M5; see notes) |
| secondary | LAN/WAN SSH from `ubongo` | `ssh ansible@askari` (from the control node; Hetzner firewall allows only ubongo's WAN) |
| — | container exec + compose | `docker compose -p reverse_proxy -f /opt/services/reverse_proxy/docker-compose.yml ps` / `… exec caddy sh` |
| — | logs | `docker logs caddy` now; Loki labels `{service: caddy}` once the ADR-018 pipeline lands |
| — | admin API | n/a — Caddy admin API bound to container localhost `:2019`, never exposed (`access__api.enabled: false`) |
## Break-glass
Mesh-and-LAN-independent fallback for this host's class (recorded, not routine):
- **Hetzner rescue system + Cloud Console** (VNC) for `askari` — boot the rescue image
or attach the web console from the Hetzner Cloud panel if SSH is unreachable.
## Operational notes
- **Mesh not yet enrolled (M5).** Until `askari` joins the NetBird mesh, the `wt0`
primary path does not exist — the only SSH route is the secondary one (from `ubongo`'s
WAN IP, which the TF-managed Hetzner Cloud Firewall allowlists). Promote `wt0` to
primary once M5 lands.
- **Caddy wedged / bad config:** the Caddyfile is rendered read-only by Ansible; to
recover, fix `reverse_proxy__routes` in `group_vars` and re-run the role (it reloads
Caddy via the handler). To inspect live config: `docker exec caddy caddy validate
--config /etc/caddy/Caddyfile`.
- **Cert issuance failing:** check that port 80 is reachable from the internet (HTTP-01
needs it) and watch `docker logs caddy` for ACME errors before assuming a routing fault.

View file

@ -0,0 +1,61 @@
# Security — reverse_proxy (Caddy)
## Exposure
- **Published ports:** `80/tcp` + `443/tcp` (HTTP→HTTPS redirect + TLS). Both are
declared in the `group_vars` firewall catalog as the askari `public_web` opens
(ADR-020); the Hetzner Cloud Firewall also opens 80/443 (and 3478 for NetBird).
Port 80 must stay open to the internet for the ACME HTTP-01 challenge.
- **Auth surface:** none of its own. Caddy is the TLS terminator and router; per-service
authentication (Authentik `forward_auth`) is added at each route in Phase 2 (ADR-024
§4). Today it fronts only a static `respond` test vhost and (M4b) the NetBird stack,
which carries its own auth.
- **Reachability:** public — askari is internet-facing. Caddy is the single public entry
point; upstreams sit on the internal `boma` Docker network and are reached by name, not
published directly.
- **Data sensitivity:** none persistent worth protecting — only ACME account keys +
issued certificates in the `caddy_data` volume, which are re-issuable (HTTP-01). No
user data, no secrets at rest. See backup record: `backup__state: false` (stateless).
## Checklist status
Each item from `docs/security/service-checklist.md`:
- [x] Secrets in vault; no default creds; nothing secret in git/images — ✅ n/a: HTTP-01
needs no credentials; the only config input is `reverse_proxy__acme_email` (not secret).
- [x] Non-root; no `privileged`/host-network unless justified; minimal mounts; caps
dropped — ⚠️ official `caddy:2` runs as root (to bind 80/443); no `privileged`, no host
network (bridge `boma`); mounts are the read-only Caddyfile + two named volumes. Root
inside the container is the upstream default; revisit if Caddy ships a rootless variant.
- [x] Ports declared in `group_vars`; behind reverse proxy + auth if exposed;
least-privilege inter-service reach — ✅ 80/443 in the catalog; Caddy *is* the proxy;
upstreams are not published, only reachable on the `boma` network.
- [x] Image pinned (tag/digest), update path known — ⚠️ pinned to the `caddy:2` major
tag (stateless tier, ADR-011/ADR-004), not a digest; refreshed deliberately and watched
by DIUN. Tighten to `tag@digest` if the proxy is reclassified as stateful.
- [x] Logs reviewable; backup/restore covered if stateful — ✅ stateless (no backup
needed); logs via `docker logs caddy` now, Loki labels declared for the ADR-018 pipeline.
## Service-specific hardening
- **HTTP-01 only, no DNS token:** vanilla `caddy:2`, no `caddy-dns/gandi` plugin and no
Gandi API token on the host — removes a credential and a custom-image supply chain
(ADR-024 revised Status).
- **Caddyfile is read-only** in the container (`:ro` mount); rendered solely by Ansible
from the `group_vars` route catalog — no dynamic label discovery, so no route exists
that wasn't declared (the reason Caddy was chosen over Traefik, ADR-024 §1).
- **Admin API not exposed:** Caddy's admin endpoint stays on container-localhost `:2019`;
never published, never in the firewall catalog (`access__api.enabled: false`).
- **Automatic HTTPS:** HTTP is redirected to HTTPS and modern TLS defaults are Caddy's
out-of-the-box behaviour (no manual cipher config needed).
## Residual / accepted risks
- **Container runs as root** — upstream `caddy:2` default (needs to bind low ports).
Rationale: official image, no rootless variant wired yet; blast radius limited to the
proxy container. Revisit: adopt a rootless Caddy image if upstream stabilises one.
- **Image pinned to a major tag, not a digest** — accepted for the stateless tier
(ADR-011). Revisit if the role gains state.
- **ACME re-issuance vs Let's Encrypt rate limits** — losing `caddy_data` triggers
re-issuance; rapid repeated rebuilds could hit LE rate limits. Acceptable for a handful
of askari hostnames; noted in the backup rationale.

View file

@ -0,0 +1,44 @@
# Verify — reverse_proxy (Caddy)
`reverse_proxy` has no application UI of its own — it is the TLS terminator and router.
"Working" is verified at the HTTP/TLS layer (what `/verify-service` can drive with a
browser/HTTP client against the public hostnames it serves), not via an app login.
## Critical user journeys
1. **HTTPS serves with a valid cert** — request `https://<a host in
reverse_proxy__routes>` (e.g. `https://test.askari.wingu.me`) → 200 with a valid
Let's Encrypt certificate (trusted chain, CN/SAN matches the host, not expired).
2. **HTTP redirects to HTTPS** — request `http://<host>` → 308/301 redirect to the
`https://` URL (Caddy's automatic-HTTPS redirect).
3. **A `respond` route returns its static body** — the test vhost returns its configured
string with 200.
4. **An `upstream` route proxies through** — once a real upstream is registered (M4b
NetBird), `https://<host>` reaches the upstream's response, not a Caddy error page.
5. **An unknown host is not served a valid cert** — a hostname not in
`reverse_proxy__routes` does not get a certificate / is not routed (no accidental
catch-all).
## What good looks like
- The browser padlock shows a valid Let's Encrypt certificate for the requested host;
the SAN matches and the chain is trusted.
- `http://` visibly becomes `https://` in the address bar.
- The expected body (static `respond` text, or the upstream's page) renders.
## Not browser-verifiable
- Certificate *renewal* (60-day cadence) — confirm out of band via `docker logs caddy`
/ Loki, not a single browser session.
- Behaviour when port 80 is blocked (HTTP-01 would fail) — an infrastructure/firewall
check, route to the manual handoff.
- The deferred DNS-01 path for mesh/LAN-only services (Phase 2, ADR-024) — not yet live.
## Test data
Provisioned in the **staging** deploy (no Authentik user needed — there is no SSO on the
proxy itself):
- At least one `reverse_proxy__routes` entry with a public DNS A-record pointing at the
staging host, so HTTP-01 can complete. A static `respond` route is enough for journeys
13 and 5.

View file

@ -4,3 +4,25 @@ reverse_proxy__base_dir: /opt/services/reverse_proxy
reverse_proxy__acme_email: admin@example.test
reverse_proxy__routes: [] # each: {host: x, upstream: "svc:port"} OR {host: x, respond: "text"}
reverse_proxy__manage: true # set false in Molecule to render without Docker
# access__*/backup__* are the ADR-021/022 CROSS-ROLE conventions — shared field names that
# render ACCESS.md/BACKUP.md and drive /check-access · /check-backup. They intentionally do
# NOT carry the reverse_proxy__ prefix, so each is marked `# noqa: var-naming[no-role-prefix]`
# (ansible-lint's role-prefix rule has no per-prefix allowlist; keeping it enabled elsewhere).
# Operational-access record (ADR-021) — source of truth for ACCESS.md + /check-access.
access__service: reverse_proxy # noqa: var-naming[no-role-prefix]
access__compose_project: reverse_proxy # noqa: var-naming[no-role-prefix]
access__compose_path: "{{ reverse_proxy__base_dir }}/docker-compose.yml" # noqa: var-naming[no-role-prefix]
access__containers: [caddy] # noqa: var-naming[no-role-prefix]
access__log: # noqa: var-naming[no-role-prefix]
loki_labels: { service: caddy } # intent; Loki/Alloy pipeline is ADR-018 (pending)
access__api: # noqa: var-naming[no-role-prefix]
enabled: false
reason: "Caddy admin API bound to container localhost :2019; never exposed (ADR-020 catalog owns ports)"
# Backup contract (ADR-022). Stateless: Caddy's /data holds only ACME account keys +
# issued certs, which are re-requested automatically on restart via HTTP-01 (no manual
# steps). Residual risk: Let's Encrypt rate limits on rapid repeated re-issuance.
backup__service: reverse_proxy # noqa: var-naming[no-role-prefix]
backup__state: false # noqa: var-naming[no-role-prefix]

View file

@ -2,8 +2,8 @@
galaxy_info:
author: sjat
description: >-
Caddy reverse proxy with ACME DNS-01 TLS via Gandi (ADR-024). Builds the
custom image on-host (caddy-dns/gandi) and manages it via Docker Compose.
Vanilla Caddy reverse proxy (ADR-024); TLS via ACME HTTP-01 for public
hosts. Routes from reverse_proxy__routes, managed via Docker Compose.
license: MIT
min_ansible_version: "2.17"
platforms:

View file

@ -8,8 +8,6 @@
ansible.builtin.slurp:
src: /opt/services/reverse_proxy/Caddyfile
register: _caddyfile
tags: [verify]
- name: Assert Caddyfile exists and contains expected content
ansible.builtin.assert:
that:
@ -19,4 +17,3 @@
- "'respond \"ok\" 200' in (_caddyfile.content | b64decode)"
fail_msg: "Caddyfile is missing expected content"
success_msg: "Caddyfile rendered correctly"
tags: [verify]

View file

@ -1,3 +1,4 @@
# {{ ansible_managed }}
{
email {{ reverse_proxy__acme_email }}
}

View file

@ -1,3 +1,4 @@
# {{ ansible_managed }}
services:
caddy:
image: caddy:2

View file

@ -14,6 +14,9 @@ exception: `check-vault.py` is a vault tool that needs the ansible venv (PyYAML
`rbw`. Wired as `vault_password_file` (ADR-002).
- `check-vault-encrypted.sh` — pre-commit guard: fails if a `vault.yml` holds
plaintext secrets.
- `check-tags.py` — enforces the closed tag vocabulary (`tests/tags.yml`) and that
each role import in a play carries its role-name tag. Invoked by `make lint`. See
**ADR-019**.
- `repo-scan.py` — Phase-0 deterministic scan for `/review-repo` (markers, broken
refs, unencrypted vaults, inventory).
- `capacity-scan.py` — deterministic capacity facts for `/capacity-review`: parses

View file

@ -130,7 +130,9 @@ def known_hostnames(env):
hosts |= parse_tf_hostnames(_run_json(["terraform", f"-chdir={tf_dir}", "output", "-json"]))
except (OSError, subprocess.CalledProcessError, ValueError):
pass
inv = os.path.join(REPO_ROOT, "inventories", env, "hosts.yml")
# Point at the inventory DIRECTORY so every source file merges — hosts.yml AND
# offsite.yml (offsite_hosts / askari), which a bare hosts.yml would miss.
inv = os.path.join(REPO_ROOT, "inventories", env)
try:
hosts |= parse_inventory_hostnames(_run_json(["ansible-inventory", "-i", inv, "--list"]))
except (OSError, subprocess.CalledProcessError, ValueError):

View file

@ -53,6 +53,8 @@ def main() -> None:
"---",
"# Generated by scripts/tf_to_inventory.py — do not edit manually.",
"# Regenerate with: make tf-inventory TF_ENV=<env>",
"# This OVERWRITES the file, including any manually-added control node (ubongo) —",
"# re-add it afterwards (the one hand-edit exception; docs/runbooks/new-host.md Part E).",
"",
"all:",
" children:",

View file

@ -5,9 +5,13 @@ destroying Proxmox VMs. It writes no DNS records and configures nothing inside a
VM; Ansible owns all of that.
- `modules/proxmox_vm/` — reusable VM module (Proxmox only).
- `environments/{staging,production}/` — separate state per environment. Add a VM by
editing `local.vms` in that env's `main.tf`, then `make tf-plan``tf-apply`
`tf-inventory`.
- `modules/hetzner_vm/` — reusable VM module (Hetzner Cloud: server + firewall +
SSH key + cloud-init).
- `environments/{staging,production}/` — separate state per environment (Proxmox).
Add a VM by editing `local.vms` in that env's `main.tf`, then `make tf-plan`
`tf-apply``tf-inventory`. Not yet `terraform init`ed.
- `environments/offsite/` — the off-site Hetzner host (`askari`); the one
**applied** environment. Use `make tf-* TF_ENV=offsite` and `tf-inventory-offsite`.
Rationale: **ADR-006**. Handoff to Ansible: **ADR-009**. Secrets via `TF_VAR_*`
only — never in `.tfvars`. Not yet `terraform init`ed — see `STATUS.md`.
only — never in `.tfvars`. See `STATUS.md` for what is provisioned.

View file

@ -1,4 +1,4 @@
# verified: hetznercloud/hcloud 1.65.0 · debian-13 image · cax11@hel1 · terraform-registry · 2026-06-14
# verified: hetznercloud/hcloud 1.65.0 · debian-13 image · cx23@hel1 · terraform-registry · 2026-06-14
terraform {
required_version = ">= 1.9"

View file

@ -6,9 +6,9 @@
#
# State is local (see backend.tf) — no Forgejo backend credentials needed.
proxmox_endpoint = "https://pve01.baobab.band:8006/"
proxmox_endpoint = "https://pve0.boma.baobab.band:8006/"
proxmox_insecure = false
proxmox_node = "pve01"
proxmox_node = "pve0"
vm_template_id = 9000 # Proxmox VM ID of the Debian 13 cloud-init template
vm_datastore_id = "local-lvm"

View file

@ -1,7 +1,7 @@
# Proxmox
variable "proxmox_endpoint" {
description = "Proxmox API URL, e.g. https://pve01.baobab.band:8006/"
description = "Proxmox API URL, e.g. https://pve0.boma.baobab.band:8006/"
type = string
}

View file

@ -6,9 +6,9 @@
#
# State is local (see backend.tf) — no Forgejo backend credentials needed.
proxmox_endpoint = "https://pve01.baobab.band:8006/"
proxmox_endpoint = "https://pve0.boma.baobab.band:8006/"
proxmox_insecure = true # set false once a valid TLS cert is in place
proxmox_node = "pve01"
proxmox_node = "pve0"
vm_template_id = 9000 # Proxmox VM ID of the Debian 13 cloud-init template
vm_datastore_id = "local-lvm"

View file

@ -1,7 +1,7 @@
# Proxmox
variable "proxmox_endpoint" {
description = "Proxmox API URL, e.g. https://pve01.baobab.band:8006/"
description = "Proxmox API URL, e.g. https://pve0.boma.baobab.band:8006/"
type = string
}

View file

@ -4,7 +4,7 @@ variable "name" {
}
variable "server_type" {
description = "Hetzner server type, e.g. cax11 (ARM)"
description = "Hetzner server type, e.g. cx23 (x86) or cax11 (ARM)"
type = string
}

View file

@ -19,7 +19,7 @@ concerns:
- monitoring # metric exporters / health checks
- config # render templated config/compose files to disk — no restart
- deploy # bring services up / restart (compose up -d)
- proxy # reverse-proxy + TLS registration (Traefik routes, Authentik)
- proxy # reverse-proxy + TLS registration (Caddy routes, Authentik)
# Ansible built-in special tags. Narrow use only:
# always — cheap preflight assertions (run regardless of --tags)