From b3468b34e4b0874251ce4f0ccb1d5e95c9425b70 Mon Sep 17 00:00:00 2001 From: sjat Date: Mon, 15 Jun 2026 06:57:55 +0200 Subject: [PATCH] docs: record Caddy/Gandi DNS-01 as resolved + proven (was M4a deferral) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ADR-024 Status/Consequences, STATUS.md, ROADMAP M4a, and the FRICTION ledger now record that the DNS-01 path is built and proven, with the root cause of the M4a failure (version skew: pre-Bearer libdns/gandi sent the deprecated Apikey header; plus building on a Hetzner IP). Traefik was reconsidered and rejected again — lego's Gandi provider has the same PAT-vs-Apikey question, so it would not have helped. Dated review reports and spec/plan snapshots are left as historical records. Co-Authored-By: Claude Opus 4.8 (1M context) --- STATUS.md | 2 +- docs/FRICTION.md | 2 +- docs/ROADMAP.md | 3 +- docs/decisions/024-reverse-proxy.md | 77 ++++++++++++++++++----------- 4 files changed, 53 insertions(+), 31 deletions(-) diff --git a/STATUS.md b/STATUS.md index 4c54b7a..cdf6f5a 100644 --- a/STATUS.md +++ b/STATUS.md @@ -32,7 +32,7 @@ _Last reviewed: 2026-06-14._ | `roles/public_dns/` + `playbooks/dns.yml` | **Built + applied.** Manages wingu.me at Gandi LiveDNS as code (`community.general.gandi_livedns`, PAT from `vault.gandi.pat`); record data, anti-spoof baseline (SPF `-all` + DMARC reject), and the Gandi-defaults purge are defined + unit-tested (`tests/test_public_dns.py`). **Applied to wingu.me (2026-06-14):** purged Gandi's 13 seeded defaults; zone now holds only the SPF + DMARC TXT records; idempotent re-run clean. No null-MX (Gandi rejects `0 .`) — the MX is removed, so no MX + no apex A = no mail. M1 of the roadmap. | | `ubongo` — physical control / AI-worker host (ADR-015) | **Built (partial).** Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to `fisi` (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via `make setup`/`make collections`). Repo cloned under a dedicated `claude` user (docker group, no sudo). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory `control` group at 10.20.10.151. **`dev_env` now applied here** (zsh/tmux/nvim for `sjat` + `claude`, via `playbooks/workstation.yml`). Managed as the operator account `sjat` (`group_vars/control` sets `ansible_user: sjat`), not the `ansible` service user `group_vars/all` assumes — ubongo has no bootstrapped `ansible` user. **Pending:** NetBird mesh enrollment (so SSH is LAN-only); full `base` hardening (only the `firewall` concern exists, and it is NOT applied here — applying default-deny with no mesh would lock out inbound SSH on the physical NIC); proper `ansible`-user bootstrap (currently managed as `sjat`); OPNsense DHCP reservation for 10.20.10.151 (MAC `88:a4:c2:e0:ee:da`); Terraform state backup (now relevant — the offsite tfstate exists). | | `askari` — off-site Hetzner VPS (ADR-007/016, M2) | **Built + applied.** Provisioned by Terraform (`environments/offsite`, `hetznercloud/hcloud`) as **cx23 / hel1 / Debian 13.5** (CAX11/ARM was out of stock EU-wide on 2026-06-14 → cx23 is same-spec x86, cheaper). cloud-init created the `ansible` user + passwordless sudo; a TF-managed Hetzner Cloud Firewall allows SSH only from ubongo's WAN (`91.226.145.80`). Reachable from ubongo (`ansible offsite_hosts -m ping` ✓), in the `offsite_hosts` inventory (generated `offsite.yml`), published at `askari.wingu.me` → `77.42.120.136`. **SSH-hardened + fail2ban (M3).** **Docker + Caddy reverse proxy (M4a):** `docker_host` + `reverse_proxy` (vanilla Caddy, HTTP-01) applied; `https://test.askari.wingu.me` serves a valid Let's Encrypt cert ✓ (firewall opens 80/443/3478). **Pending:** NetBird coordinator (M4b), host firewall + mesh enrollment (M5), offsite tfstate backup (ADR-022). | -| `roles/docker_host/` (Docker engine) + `roles/reverse_proxy/` (Caddy, ADR-024) | **Built + applied** (askari, M4a). `docker_host` installs Docker CE + compose; `reverse_proxy` is boma's standard Caddy proxy (HTTP-01 for public hosts; routes from `reverse_proxy__routes`). DNS-01 for cluster mesh/LAN-only services is deferred to Phase 2 (caddy-dns/gandi unresolved — see FRICTION). | +| `roles/docker_host/` (Docker engine) + `roles/reverse_proxy/` (Caddy, ADR-024) | **Built + applied** (askari, M4a). `docker_host` installs Docker CE + compose; `reverse_proxy` is boma's standard Caddy proxy (HTTP-01 for public hosts; routes from `reverse_proxy__routes`). **DNS-01 for mesh/LAN-only services is now built + proven (2026-06-15):** custom `caddy-gandi` image (`.docker/caddy-gandi/`, `make caddy-image`, pinned caddy-dns/gandi v1.1.0 → Bearer PAT), enabled per-instance via `reverse_proxy__acme_dns_provider: gandi` + `reverse_proxy__image`. Verified end-to-end — a real wildcard cert issued via LE **staging** + Gandi DNS-01 with `vault.gandi.pat`. M4a's deferral (version skew + Hetzner-IP build) is closed; image **pending registry push** (`make caddy-image-push` needs `docker login`). | ## Scaffolded but empty — NOT implemented diff --git a/docs/FRICTION.md b/docs/FRICTION.md index be2ceed..e98ffdb 100644 --- a/docs/FRICTION.md +++ b/docs/FRICTION.md @@ -52,7 +52,7 @@ which migrate or archive (knowledge is never deleted). | Brainstorming spec-review gate fires despite the standing agreement (06-10) | CHANGE → mechanical | Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu. | | Subagent faithfulness self-reports can be wrong (06-10) | ACCEPTED | The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in `superpowers:subagent-driven-development`, used for the `/kaizen` build itself. Revisit if it recurs. | | ADR-writing policy unsettled (05-31) | ALREADY-BUILT | ADR-023 (ADR structure & lifecycle) + `docs/decisions/adr-template.md` settle status/sections — both postdate this signal. | -| Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT | ADR-024's revised Status records the HTTP-01 decision, the DNS-01 deferral to Phase 2, and the Hetzner-build + plugin blocks. | +| Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT → **RESOLVED 2026-06-15** | 06-14: ADR-024 recorded the HTTP-01 decision + DNS-01 deferral. 06-15: deferral **closed** — root cause was **version skew** (pre-Bearer `libdns/gandi` sent Gandi's deprecated `Apikey` header → 403) plus building on a Hetzner IP. Fix: pin caddy-dns/gandi v1.1.0 (Bearer PAT) + build on ubongo. DNS-01 now built + proven (real wildcard cert via LE staging). See ADR-024 Status + STATUS.md + `roles/reverse_proxy`. | | `apply:{tags}` not propagated by dynamic `include_tasks` (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Tags on dynamic `include_tasks` need `apply:`". | | Molecule CAN test tag-propagation, via a tagged converge (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Testing concern-tag isolation in Molecule". | | apply=false Molecule + data-pytest gap for API/templating roles (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "API / templating roles: render-only tests miss the real call". | diff --git a/docs/ROADMAP.md b/docs/ROADMAP.md index e9f88a3..987332e 100644 --- a/docs/ROADMAP.md +++ b/docs/ROADMAP.md @@ -112,7 +112,8 @@ active. Full CIS L1/L2, auditd, AppArmor, AIDE remain deferred to Phase 2 (TODO Built in two phases. **M4a (platform) — ✅ DONE:** Docker on askari + boma's standard **Caddy** reverse proxy (ADR-024), proven by `https://test.askari.wingu.me` serving a -valid Let's Encrypt cert (HTTP-01 — DNS-01 deferred to Phase 2, see ADR-024/FRICTION). +valid Let's Encrypt cert (HTTP-01; the Gandi **DNS-01** path is now built + proven — +2026-06-15, see ADR-024 — for mesh/LAN-only cluster services). Firewall opened 80/443/3478. Spec/plan: `…2026-06-14-netbird-coordinator-m4-design.md` / `…2026-06-14-m4a-docker-caddy.md`. **M4b (next):** the `netbird_coordinator` service role — read NetBird's current self-host compose then. diff --git a/docs/decisions/024-reverse-proxy.md b/docs/decisions/024-reverse-proxy.md index 5dc9474..d5d8a5e 100644 --- a/docs/decisions/024-reverse-proxy.md +++ b/docs/decisions/024-reverse-proxy.md @@ -2,18 +2,31 @@ ## Status -Accepted (2026-06-14). Amends the soft Traefik assumption carried by the roadmap -(Phase-2 step 5) and ADR-017 prose; those are updated to read "Caddy (ADR-024)". +Accepted (2026-06-14; DNS-01 path resolved + proven 2026-06-15). Amends the soft +Traefik assumption carried by the roadmap (Phase-2 step 5) and ADR-017 prose; those +are updated to read "Caddy (ADR-024)". -> **Cert method follows exposure (revised 2026-06-14, M4a).** The cert *challenge* -> depends on whether a host is publicly reachable: **public hosts** (askari) use -> **HTTP-01** with **vanilla Caddy** — simplest, no plugin; **mesh/LAN-only cluster -> services** (no public A-record) need **DNS-01** (the M1 Gandi capability), since they -> can't satisfy HTTP-01. The DNS-01 path is **deferred to Phase 2**: the `caddy-dns/gandi` -> plugin did not create the ACME TXT records on askari despite a verified-valid token -> (and Hetzner IPs are 403'd by Google's Go module infra, blocking the on-host custom -> build) — both to be sorted when the cluster's private services actually need DNS-01. -> The body below describes the DNS-01 design; askari (M4a) ships on HTTP-01. +> **Cert method follows exposure.** The cert *challenge* depends on whether a host is +> publicly reachable: **public hosts** (askari) use **HTTP-01** with **vanilla Caddy** — +> simplest, no plugin; **mesh/LAN-only cluster services** (no public A-record) use +> **DNS-01** via Gandi (the M1 capability), since they can't satisfy HTTP-01. +> +> **DNS-01 resolved + proven (2026-06-15) — the M4a deferral is closed.** The original +> failure was diagnosed as **version skew**: the image built at M4a used a pre-Bearer +> `libdns/gandi` that sent Gandi's **deprecated `Apikey` header** (→ 403 on a +> verified-valid token), and the `xcaddy` build ran *on a Hetzner IP* (Google's Go +> module proxy 403s those ranges). Both have clean, boma-aligned fixes: **pin +> caddy-dns/gandi v1.1.0** (→ `libdns/gandi` v1.1.0, which sends the PAT as +> `Authorization: Bearer` to `https://api.gandi.net/v5/livedns`) and **build the image +> on ubongo, not Hetzner**. Verified end-to-end (2026-06-15): the custom image issues a +> real **wildcard** cert (`*.dns01test.wingu.me`) against Let's Encrypt **staging** via +> Gandi DNS-01 using `vault.gandi.pat`; `caddy validate` accepts `acme_dns gandi` on the +> custom image and rejects it on vanilla `caddy:2`. Build with `make caddy-image`; the +> `reverse_proxy` role enables it per-instance via `reverse_proxy__acme_dns_provider: +> gandi` + `reverse_proxy__image`. **Traefik was reconsidered and rejected again** — +> lego's Gandi provider faces the *same* PAT-vs-Apikey question, so switching would not +> have dodged the issue, and would reverse this ADR for nothing. askari (M4a) stays on +> HTTP-01 (a public host needs no DNS-01). ## Context @@ -57,26 +70,32 @@ boma's reverse proxy is **Caddy**. 5. `forward_auth` to Authentik is a first-class Caddy directive — the planned Authentik auth story (ADR-002) is preserved without Traefik as the middleman. -### 2. Custom image (DNS-01 path only — Phase 2) +### 2. Custom image (DNS-01 path — built) -> Applies only to the **DNS-01** path, which is **deferred to Phase 2** (see the Status -> note). M4a ships **vanilla `caddy:2`** on askari (HTTP-01) — no custom image. +> Applies only to the **DNS-01** path. M4a ships **vanilla `caddy:2`** on askari +> (HTTP-01) — no custom image; only DNS-01 hosts pull the custom one. -Caddy's official Docker image does not include third-party DNS plugins. The `caddy-dns/gandi` -plugin must be compiled in via `xcaddy`. When the cluster's mesh/LAN-only services need -DNS-01, boma builds a custom image: +Caddy's official Docker image does not include third-party DNS plugins. The +`caddy-dns/gandi` plugin must be compiled in via `xcaddy`. boma builds a custom image +(`.docker/caddy-gandi/Dockerfile`, `make caddy-image`), **pinned** (ADR-011/ADR-014): -``` -FROM caddy:builder AS builder -RUN xcaddy build --with github.com/caddy-dns/gandi +```dockerfile +FROM caddy:2.11.4-builder AS build +RUN xcaddy build v2.11.4 --with github.com/caddy-dns/gandi@v1.1.0 -FROM caddy:latest -COPY --from=builder /usr/bin/caddy /usr/bin/caddy +FROM caddy:2.11.4 +COPY --from=build /usr/bin/caddy /usr/bin/caddy ``` -That image would be maintained as a boma artifact (Forgejo registry, pinned digest in the -Compose template) — the cost of the Gandi DNS-01 path. (On askari this approach hit two -blockers, so DNS-01 is deferred; see the Status note.) +Two hard constraints, both learned from the M4a failure: + +1. **Build on ubongo, not Hetzner.** Google's Go module proxy 403s Hetzner IP ranges, so + the on-host build on askari failed. ubongo (the control node) builds it in ~1 min, + then it is pushed to the Forgejo registry (`make caddy-image-push`) and pulled by + DNS-01 hosts — the same artifact pattern as the Molecule image. +2. **Pin a Bearer-capable plugin.** caddy-dns/gandi v1.1.0 → libdns/gandi v1.1.0 sends + the PAT as `Authorization: Bearer`. Older versions used the deprecated `Apikey` + header and 403 on a PAT — that was the M4a "valid token but no TXT record" symptom. ### 3. Deployment scope @@ -96,9 +115,11 @@ middleware migration is required. - **Roadmap Phase-2 step 5** is updated from "Authentik + Traefik" to "Authentik + Caddy (ADR-024)". - **ADR-017 prose** that mentioned Traefik is updated to read "Caddy (ADR-024)". -- M4a (public hosts, HTTP-01) runs **vanilla `caddy:2`** — no custom image. **If/when** - the Phase-2 DNS-01 path lands, a custom Caddy image (`xcaddy` + `caddy-dns/gandi`) must - be built, pushed to the Forgejo registry, and kept current (plugin + base image updates). +- M4a (public hosts, HTTP-01) runs **vanilla `caddy:2`** — no custom image. The DNS-01 + custom Caddy image (`xcaddy` + `caddy-dns/gandi`, `.docker/caddy-gandi/`) is **built and + proven**; it must be pushed to the Forgejo registry (`make caddy-image-push`, needs + `docker login`) and kept current (plugin + base-image version bumps, pinned per + ADR-011/ADR-014) as DNS-01 cluster services come online. - Caddyfile config is rendered by Ansible from `group_vars` — consistent with ADR-004 and easier to review than distributed container labels. - `forward_auth` to Authentik is available when Authentik is deployed; no extra