boma/docs/decisions/024-reverse-proxy.md
sjat b3468b34e4 docs: record Caddy/Gandi DNS-01 as resolved + proven (was M4a deferral)
ADR-024 Status/Consequences, STATUS.md, ROADMAP M4a, and the FRICTION ledger now
record that the DNS-01 path is built and proven, with the root cause of the M4a
failure (version skew: pre-Bearer libdns/gandi sent the deprecated Apikey header;
plus building on a Hetzner IP). Traefik was reconsidered and rejected again — lego's
Gandi provider has the same PAT-vs-Apikey question, so it would not have helped.

Dated review reports and spec/plan snapshots are left as historical records.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 06:57:55 +02:00

7.8 KiB

ADR-024 — Reverse proxy: Caddy (ACME — HTTP-01 public, DNS-01 private)

Status

Accepted (2026-06-14; DNS-01 path resolved + proven 2026-06-15). Amends the soft Traefik assumption carried by the roadmap (Phase-2 step 5) and ADR-017 prose; those are updated to read "Caddy (ADR-024)".

Cert method follows exposure. The cert challenge depends on whether a host is publicly reachable: public hosts (askari) use HTTP-01 with vanilla Caddy — simplest, no plugin; mesh/LAN-only cluster services (no public A-record) use DNS-01 via Gandi (the M1 capability), since they can't satisfy HTTP-01.

DNS-01 resolved + proven (2026-06-15) — the M4a deferral is closed. The original failure was diagnosed as version skew: the image built at M4a used a pre-Bearer libdns/gandi that sent Gandi's deprecated Apikey header (→ 403 on a verified-valid token), and the xcaddy build ran on a Hetzner IP (Google's Go module proxy 403s those ranges). Both have clean, boma-aligned fixes: pin caddy-dns/gandi v1.1.0 (→ libdns/gandi v1.1.0, which sends the PAT as Authorization: Bearer to https://api.gandi.net/v5/livedns) and build the image on ubongo, not Hetzner. Verified end-to-end (2026-06-15): the custom image issues a real wildcard cert (*.dns01test.wingu.me) against Let's Encrypt staging via Gandi DNS-01 using vault.gandi.pat; caddy validate accepts acme_dns gandi on the custom image and rejects it on vanilla caddy:2. Build with make caddy-image; the reverse_proxy role enables it per-instance via reverse_proxy__acme_dns_provider: gandi + reverse_proxy__image. Traefik was reconsidered and rejected again — lego's Gandi provider faces the same PAT-vs-Apikey question, so switching would not have dodged the issue, and would reverse this ADR for nothing. askari (M4a) stays on HTTP-01 (a public host needs no DNS-01).

Context

boma needs a reverse proxy to front its services with TLS. ADR-002 requires every service to sit behind a proxy with authentication before it is reachable; ADR-007/M1 delivers a *.<domain> wildcard cert via ACME DNS-01 against Gandi (the apex boma domain, matching ROADMAP M1) — the only viable cert path for mesh/LAN-only services that cannot satisfy HTTP-01 (no public A-record to point at).

The roadmap (Phase-2, step 5) and ADR-017 prose assumed Traefik + Authentik as the auth-and-proxy pair without an ADR ever pinning Traefik. On closer inspection:

  • Traefik's headline feature is dynamic Docker-label discovery — it discovers and routes services automatically from container labels without any static config.
  • boma already renders all config from Ansible templates and the group_vars catalog (ADR-004). That makes dynamic label discovery a disadvantage: a service that is not in the catalog does not exist (CLAUDE.md), so any route that Traefik auto-discovers outside the catalog would be unaudited.
  • The first reverse-proxy instance is needed on askari for M4 (NetBird), a host where docker_hosts patterns are being established under off-site/VPS constraints, not a full Proxmox cluster with many services.

No production investment in Traefik config has been made; the decision can be made cleanly here.

Decision

boma's reverse proxy is Caddy.

1. Rationale for Caddy over Traefik

  1. Traefik's dynamic label discovery is wasted — boma renders config from the catalog; Caddy's static Caddyfile maps naturally to "render from templates" (ADR-004).
  2. Caddy's Caddyfile is simple to template with ansible.builtin.template; one file, one ansible_managed header, no side-channel label state.
  3. Automatic HTTPS via ACME DNS-01: the caddy-dns/gandi plugin satisfies the Gandi DNS-01 challenge, which is the only cert path for services with no public A-record (ADR-007/M1 wildcard strategy).
  4. Far simpler for a solo operator: no dashboard-as-a-service, no routing-rule DSL, no dynamic config files to reconcile.
  5. forward_auth to Authentik is a first-class Caddy directive — the planned Authentik auth story (ADR-002) is preserved without Traefik as the middleman.

2. Custom image (DNS-01 path — built)

Applies only to the DNS-01 path. M4a ships vanilla caddy:2 on askari (HTTP-01) — no custom image; only DNS-01 hosts pull the custom one.

Caddy's official Docker image does not include third-party DNS plugins. The caddy-dns/gandi plugin must be compiled in via xcaddy. boma builds a custom image (.docker/caddy-gandi/Dockerfile, make caddy-image), pinned (ADR-011/ADR-014):

FROM caddy:2.11.4-builder AS build
RUN xcaddy build v2.11.4 --with github.com/caddy-dns/gandi@v1.1.0

FROM caddy:2.11.4
COPY --from=build /usr/bin/caddy /usr/bin/caddy

Two hard constraints, both learned from the M4a failure:

  1. Build on ubongo, not Hetzner. Google's Go module proxy 403s Hetzner IP ranges, so the on-host build on askari failed. ubongo (the control node) builds it in ~1 min, then it is pushed to the Forgejo registry (make caddy-image-push) and pulled by DNS-01 hosts — the same artifact pattern as the Molecule image.
  2. Pin a Bearer-capable plugin. caddy-dns/gandi v1.1.0 → libdns/gandi v1.1.0 sends the PAT as Authorization: Bearer. Older versions used the deprecated Apikey header and 403 on a PAT — that was the M4a "valid token but no TXT record" symptom.

3. Deployment scope

The first Caddy instance runs on askari (M4a), serving a test vhost over HTTP-01 to prove the proxy + ACME path. It fronts the NetBird stack in M4b (when the netbird_coordinator role is built). The pattern generalises to the Proxmox cluster in Phase 2 when services multiply.

4. Authentik integration (deferred)

forward_auth to Authentik is deferred to Phase 2 (when Authentik is deployed on the cluster). The Caddyfile template will carry a placeholder comment. No Traefik-Authentik middleware migration is required.

Consequences

  • Roadmap Phase-2 step 5 is updated from "Authentik + Traefik" to "Authentik + Caddy (ADR-024)".
  • ADR-017 prose that mentioned Traefik is updated to read "Caddy (ADR-024)".
  • M4a (public hosts, HTTP-01) runs vanilla caddy:2 — no custom image. The DNS-01 custom Caddy image (xcaddy + caddy-dns/gandi, .docker/caddy-gandi/) is built and proven; it must be pushed to the Forgejo registry (make caddy-image-push, needs docker login) and kept current (plugin + base-image version bumps, pinned per ADR-011/ADR-014) as DNS-01 cluster services come online.
  • Caddyfile config is rendered by Ansible from group_vars — consistent with ADR-004 and easier to review than distributed container labels.
  • forward_auth to Authentik is available when Authentik is deployed; no extra middleware layer required.
  • The proxy concern tag (already in tests/tags.yml) covers Caddy config tasks.

What was ruled out

  • Traefik — dynamic label discovery is a mismatch for boma's catalog-rendered config model (ADR-004); more complex for a solo operator; no prior investment to protect.
  • nginx / HAProxy — no built-in ACME; require a separate ACME client (certbot, acme.sh) adding operational surface; Caddy's integrated ACME is simpler.
  • NetBird's bundled TLS — NetBird's management UI can serve its own TLS, but that doesn't generalise; a real proxy separates concerns and applies to every service.
  • ADR-002 — services behind a proxy with authentication (the requirement this satisfies).
  • ADR-004 — Docker & Compose model (template-rendered config, catalog-driven).
  • ADR-007 / M1 — Gandi DNS-01 ACME path (the TLS strategy Caddy implements).
  • ADR-016 — NetBird (M4 is the first deployment of this proxy).
  • ADR-017 — service-UI verification; forward_auth to Authentik is the future auth story.