boma/docs/decisions/024-reverse-proxy.md

146 lines
7.8 KiB
Markdown
Raw Permalink Normal View History

# ADR-024 — Reverse proxy: Caddy (ACME — HTTP-01 public, DNS-01 private)
## Status
Accepted (2026-06-14; DNS-01 path resolved + proven 2026-06-15). Amends the soft
Traefik assumption carried by the roadmap (Phase-2 step 5) and ADR-017 prose; those
are updated to read "Caddy (ADR-024)".
> **Cert method follows exposure.** The cert *challenge* depends on whether a host is
> publicly reachable: **public hosts** (askari) use **HTTP-01** with **vanilla Caddy** —
> simplest, no plugin; **mesh/LAN-only cluster services** (no public A-record) use
> **DNS-01** via Gandi (the M1 capability), since they can't satisfy HTTP-01.
>
> **DNS-01 resolved + proven (2026-06-15) — the M4a deferral is closed.** The original
> failure was diagnosed as **version skew**: the image built at M4a used a pre-Bearer
> `libdns/gandi` that sent Gandi's **deprecated `Apikey` header** (→ 403 on a
> verified-valid token), and the `xcaddy` build ran *on a Hetzner IP* (Google's Go
> module proxy 403s those ranges). Both have clean, boma-aligned fixes: **pin
> caddy-dns/gandi v1.1.0** (→ `libdns/gandi` v1.1.0, which sends the PAT as
> `Authorization: Bearer` to `https://api.gandi.net/v5/livedns`) and **build the image
> on ubongo, not Hetzner**. Verified end-to-end (2026-06-15): the custom image issues a
> real **wildcard** cert (`*.dns01test.wingu.me`) against Let's Encrypt **staging** via
> Gandi DNS-01 using `vault.gandi.pat`; `caddy validate` accepts `acme_dns gandi` on the
> custom image and rejects it on vanilla `caddy:2`. Build with `make caddy-image`; the
> `reverse_proxy` role enables it per-instance via `reverse_proxy__acme_dns_provider:
> gandi` + `reverse_proxy__image`. **Traefik was reconsidered and rejected again** —
> lego's Gandi provider faces the *same* PAT-vs-Apikey question, so switching would not
> have dodged the issue, and would reverse this ADR for nothing. askari (M4a) stays on
> HTTP-01 (a public host needs no DNS-01).
## Context
boma needs a reverse proxy to front its services with TLS. ADR-002 requires every
service to sit behind a proxy with authentication before it is reachable; ADR-007/M1
docs: reconcile lower-severity review findings (O9-O24) - ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional, outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative boma.baobab.band -> boma.wingu.me transition note already added earlier - terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and <host>.boma.baobab.band per ADR-007 naming (O11) - ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections placed after Consequences, matching ADR-014/019-023 (O13) - docs/README + inventories/README: list the missing subdirs / offsite_hosts + offsite.yml merge behaviour (O14, O29 note) - ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19) - ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20) - ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21) - netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23) - ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24) - capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28) - tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9) - tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep) O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected); the fix lives in the generator for the next regeneration. make lint + pytest (57) green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00
delivers a `*.<domain>` wildcard cert via ACME DNS-01 against Gandi (the apex `boma`
domain, matching ROADMAP M1) — the only viable cert path for mesh/LAN-only services
that cannot satisfy HTTP-01 (no public A-record to point at).
The roadmap (Phase-2, step 5) and ADR-017 prose assumed **Traefik + Authentik** as the
auth-and-proxy pair without an ADR ever pinning Traefik. On closer inspection:
- Traefik's headline feature is **dynamic Docker-label discovery** — it discovers and
routes services automatically from container labels without any static config.
- boma already renders *all* config from Ansible templates and the `group_vars` catalog
(ADR-004). That makes dynamic label discovery a disadvantage: a service that is not in
the catalog does not exist (CLAUDE.md), so any route that Traefik auto-discovers
outside the catalog would be unaudited.
- The first reverse-proxy instance is needed on `askari` for M4 (NetBird), a host where
`docker_hosts` patterns are being established under off-site/VPS constraints, not a
full Proxmox cluster with many services.
No production investment in Traefik config has been made; the decision can be made
cleanly here.
## Decision
boma's reverse proxy is **Caddy**.
### 1. Rationale for Caddy over Traefik
1. Traefik's dynamic label discovery is wasted — boma renders config from the catalog;
Caddy's static Caddyfile maps naturally to "render from templates" (ADR-004).
2. Caddy's Caddyfile is simple to template with `ansible.builtin.template`; one file,
one `ansible_managed` header, no side-channel label state.
3. **Automatic HTTPS** via ACME DNS-01: the `caddy-dns/gandi` plugin satisfies the
Gandi DNS-01 challenge, which is the only cert path for services with no public
A-record (ADR-007/M1 wildcard strategy).
4. Far simpler for a solo operator: no dashboard-as-a-service, no routing-rule DSL,
no dynamic config files to reconcile.
5. `forward_auth` to Authentik is a first-class Caddy directive — the planned
Authentik auth story (ADR-002) is preserved without Traefik as the middleman.
### 2. Custom image (DNS-01 path — built)
> Applies only to the **DNS-01** path. M4a ships **vanilla `caddy:2`** on askari
> (HTTP-01) — no custom image; only DNS-01 hosts pull the custom one.
Caddy's official Docker image does not include third-party DNS plugins. The
`caddy-dns/gandi` plugin must be compiled in via `xcaddy`. boma builds a custom image
(`.docker/caddy-gandi/Dockerfile`, `make caddy-image`), **pinned** (ADR-011/ADR-014):
```dockerfile
FROM caddy:2.11.4-builder AS build
RUN xcaddy build v2.11.4 --with github.com/caddy-dns/gandi@v1.1.0
FROM caddy:2.11.4
COPY --from=build /usr/bin/caddy /usr/bin/caddy
```
Two hard constraints, both learned from the M4a failure:
1. **Build on ubongo, not Hetzner.** Google's Go module proxy 403s Hetzner IP ranges, so
the on-host build on askari failed. ubongo (the control node) builds it in ~1 min,
then it is pushed to the Forgejo registry (`make caddy-image-push`) and pulled by
DNS-01 hosts — the same artifact pattern as the Molecule image.
2. **Pin a Bearer-capable plugin.** caddy-dns/gandi v1.1.0 → libdns/gandi v1.1.0 sends
the PAT as `Authorization: Bearer`. Older versions used the deprecated `Apikey`
header and 403 on a PAT — that was the M4a "valid token but no TXT record" symptom.
### 3. Deployment scope
The first Caddy instance runs on `askari` (M4a), serving a test vhost over HTTP-01 to
prove the proxy + ACME path. It fronts the NetBird stack in **M4b** (when the
docs: reconcile lower-severity review findings (O9-O24) - ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional, outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative boma.baobab.band -> boma.wingu.me transition note already added earlier - terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and <host>.boma.baobab.band per ADR-007 naming (O11) - ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections placed after Consequences, matching ADR-014/019-023 (O13) - docs/README + inventories/README: list the missing subdirs / offsite_hosts + offsite.yml merge behaviour (O14, O29 note) - ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19) - ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20) - ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21) - netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23) - ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24) - capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28) - tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9) - tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep) O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected); the fix lives in the generator for the next regeneration. make lint + pytest (57) green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00
`netbird_coordinator` role is built). The pattern generalises to the Proxmox cluster in
Phase 2 when services multiply.
### 4. Authentik integration (deferred)
`forward_auth` to Authentik is deferred to Phase 2 (when Authentik is deployed on the
cluster). The Caddyfile template will carry a placeholder comment. No Traefik-Authentik
middleware migration is required.
## Consequences
- **Roadmap Phase-2 step 5** is updated from "Authentik + Traefik" to "Authentik +
Caddy (ADR-024)".
- **ADR-017 prose** that mentioned Traefik is updated to read "Caddy (ADR-024)".
- M4a (public hosts, HTTP-01) runs **vanilla `caddy:2`** — no custom image. The DNS-01
custom Caddy image (`xcaddy` + `caddy-dns/gandi`, `.docker/caddy-gandi/`) is **built and
proven**; it must be pushed to the Forgejo registry (`make caddy-image-push`, needs
`docker login`) and kept current (plugin + base-image version bumps, pinned per
ADR-011/ADR-014) as DNS-01 cluster services come online.
- Caddyfile config is rendered by Ansible from `group_vars` — consistent with ADR-004
and easier to review than distributed container labels.
- `forward_auth` to Authentik is available when Authentik is deployed; no extra
middleware layer required.
- The `proxy` concern tag (already in `tests/tags.yml`) covers Caddy config tasks.
## What was ruled out
- **Traefik** — dynamic label discovery is a mismatch for boma's catalog-rendered
config model (ADR-004); more complex for a solo operator; no prior investment to
protect.
- **nginx / HAProxy** — no built-in ACME; require a separate ACME client (certbot,
acme.sh) adding operational surface; Caddy's integrated ACME is simpler.
- **NetBird's bundled TLS** — NetBird's management UI can serve its own TLS, but that
doesn't generalise; a real proxy separates concerns and applies to every service.
## Related
- ADR-002 — services behind a proxy with authentication (the requirement this satisfies).
- ADR-004 — Docker & Compose model (template-rendered config, catalog-driven).
- ADR-007 / M1 — Gandi DNS-01 ACME path (the TLS strategy Caddy implements).
- ADR-016 — NetBird (M4 is the first deployment of this proxy).
- ADR-017 — service-UI verification; forward_auth to Authentik is the future auth story.