diff --git a/roles/reverse_proxy/ACCESS.md b/roles/reverse_proxy/ACCESS.md new file mode 100644 index 0000000..f4ebb96 --- /dev/null +++ b/roles/reverse_proxy/ACCESS.md @@ -0,0 +1,37 @@ +# Access — reverse_proxy (Caddy) + +Rendered from the role's `access__*` data (`roles/reverse_proxy/defaults/main.yml`) — +the source of truth that also drives `/check-access`. Regenerate from the data; edit the +data, not the tables. Host: `askari` (off-site Hetzner; ADR-007/016). + +## Access paths + +The documented ways in, by tier (rendered from `access__*`): + +| Tier | Path | Invocation | +|---|---|---| +| primary | `wt0` mesh SSH | `ssh askari` (over the NetBird mesh — pending M5; see notes) | +| secondary | LAN/WAN SSH from `ubongo` | `ssh ansible@askari` (from the control node; Hetzner firewall allows only ubongo's WAN) | +| — | container exec + compose | `docker compose -p reverse_proxy -f /opt/services/reverse_proxy/docker-compose.yml ps` / `… exec caddy sh` | +| — | logs | `docker logs caddy` now; Loki labels `{service: caddy}` once the ADR-018 pipeline lands | +| — | admin API | n/a — Caddy admin API bound to container localhost `:2019`, never exposed (`access__api.enabled: false`) | + +## Break-glass + +Mesh-and-LAN-independent fallback for this host's class (recorded, not routine): + +- **Hetzner rescue system + Cloud Console** (VNC) for `askari` — boot the rescue image + or attach the web console from the Hetzner Cloud panel if SSH is unreachable. + +## Operational notes + +- **Mesh not yet enrolled (M5).** Until `askari` joins the NetBird mesh, the `wt0` + primary path does not exist — the only SSH route is the secondary one (from `ubongo`'s + WAN IP, which the TF-managed Hetzner Cloud Firewall allowlists). Promote `wt0` to + primary once M5 lands. +- **Caddy wedged / bad config:** the Caddyfile is rendered read-only by Ansible; to + recover, fix `reverse_proxy__routes` in `group_vars` and re-run the role (it reloads + Caddy via the handler). To inspect live config: `docker exec caddy caddy validate + --config /etc/caddy/Caddyfile`. +- **Cert issuance failing:** check that port 80 is reachable from the internet (HTTP-01 + needs it) and watch `docker logs caddy` for ACME errors before assuming a routing fault. diff --git a/roles/reverse_proxy/SECURITY.md b/roles/reverse_proxy/SECURITY.md new file mode 100644 index 0000000..602755c --- /dev/null +++ b/roles/reverse_proxy/SECURITY.md @@ -0,0 +1,61 @@ +# Security — reverse_proxy (Caddy) + +## Exposure + +- **Published ports:** `80/tcp` + `443/tcp` (HTTP→HTTPS redirect + TLS). Both are + declared in the `group_vars` firewall catalog as the askari `public_web` opens + (ADR-020); the Hetzner Cloud Firewall also opens 80/443 (and 3478 for NetBird). + Port 80 must stay open to the internet for the ACME HTTP-01 challenge. +- **Auth surface:** none of its own. Caddy is the TLS terminator and router; per-service + authentication (Authentik `forward_auth`) is added at each route in Phase 2 (ADR-024 + §4). Today it fronts only a static `respond` test vhost and (M4b) the NetBird stack, + which carries its own auth. +- **Reachability:** public — askari is internet-facing. Caddy is the single public entry + point; upstreams sit on the internal `boma` Docker network and are reached by name, not + published directly. +- **Data sensitivity:** none persistent worth protecting — only ACME account keys + + issued certificates in the `caddy_data` volume, which are re-issuable (HTTP-01). No + user data, no secrets at rest. See backup record: `backup__state: false` (stateless). + +## Checklist status + +Each item from `docs/security/service-checklist.md`: + +- [x] Secrets in vault; no default creds; nothing secret in git/images — ✅ n/a: HTTP-01 + needs no credentials; the only config input is `reverse_proxy__acme_email` (not secret). +- [x] Non-root; no `privileged`/host-network unless justified; minimal mounts; caps + dropped — ⚠️ official `caddy:2` runs as root (to bind 80/443); no `privileged`, no host + network (bridge `boma`); mounts are the read-only Caddyfile + two named volumes. Root + inside the container is the upstream default; revisit if Caddy ships a rootless variant. +- [x] Ports declared in `group_vars`; behind reverse proxy + auth if exposed; + least-privilege inter-service reach — ✅ 80/443 in the catalog; Caddy *is* the proxy; + upstreams are not published, only reachable on the `boma` network. +- [x] Image pinned (tag/digest), update path known — ⚠️ pinned to the `caddy:2` major + tag (stateless tier, ADR-011/ADR-004), not a digest; refreshed deliberately and watched + by DIUN. Tighten to `tag@digest` if the proxy is reclassified as stateful. +- [x] Logs reviewable; backup/restore covered if stateful — ✅ stateless (no backup + needed); logs via `docker logs caddy` now, Loki labels declared for the ADR-018 pipeline. + +## Service-specific hardening + +- **HTTP-01 only, no DNS token:** vanilla `caddy:2`, no `caddy-dns/gandi` plugin and no + Gandi API token on the host — removes a credential and a custom-image supply chain + (ADR-024 revised Status). +- **Caddyfile is read-only** in the container (`:ro` mount); rendered solely by Ansible + from the `group_vars` route catalog — no dynamic label discovery, so no route exists + that wasn't declared (the reason Caddy was chosen over Traefik, ADR-024 §1). +- **Admin API not exposed:** Caddy's admin endpoint stays on container-localhost `:2019`; + never published, never in the firewall catalog (`access__api.enabled: false`). +- **Automatic HTTPS:** HTTP is redirected to HTTPS and modern TLS defaults are Caddy's + out-of-the-box behaviour (no manual cipher config needed). + +## Residual / accepted risks + +- **Container runs as root** — upstream `caddy:2` default (needs to bind low ports). + Rationale: official image, no rootless variant wired yet; blast radius limited to the + proxy container. Revisit: adopt a rootless Caddy image if upstream stabilises one. +- **Image pinned to a major tag, not a digest** — accepted for the stateless tier + (ADR-011). Revisit if the role gains state. +- **ACME re-issuance vs Let's Encrypt rate limits** — losing `caddy_data` triggers + re-issuance; rapid repeated rebuilds could hit LE rate limits. Acceptable for a handful + of askari hostnames; noted in the backup rationale. diff --git a/roles/reverse_proxy/VERIFY.md b/roles/reverse_proxy/VERIFY.md new file mode 100644 index 0000000..db427b1 --- /dev/null +++ b/roles/reverse_proxy/VERIFY.md @@ -0,0 +1,44 @@ +# Verify — reverse_proxy (Caddy) + +`reverse_proxy` has no application UI of its own — it is the TLS terminator and router. +"Working" is verified at the HTTP/TLS layer (what `/verify-service` can drive with a +browser/HTTP client against the public hostnames it serves), not via an app login. + +## Critical user journeys + +1. **HTTPS serves with a valid cert** — request `https://` (e.g. `https://test.askari.wingu.me`) → 200 with a valid + Let's Encrypt certificate (trusted chain, CN/SAN matches the host, not expired). +2. **HTTP redirects to HTTPS** — request `http://` → 308/301 redirect to the + `https://` URL (Caddy's automatic-HTTPS redirect). +3. **A `respond` route returns its static body** — the test vhost returns its configured + string with 200. +4. **An `upstream` route proxies through** — once a real upstream is registered (M4b + NetBird), `https://` reaches the upstream's response, not a Caddy error page. +5. **An unknown host is not served a valid cert** — a hostname not in + `reverse_proxy__routes` does not get a certificate / is not routed (no accidental + catch-all). + +## What good looks like + +- The browser padlock shows a valid Let's Encrypt certificate for the requested host; + the SAN matches and the chain is trusted. +- `http://` visibly becomes `https://` in the address bar. +- The expected body (static `respond` text, or the upstream's page) renders. + +## Not browser-verifiable + +- Certificate *renewal* (60-day cadence) — confirm out of band via `docker logs caddy` + / Loki, not a single browser session. +- Behaviour when port 80 is blocked (HTTP-01 would fail) — an infrastructure/firewall + check, route to the manual handoff. +- The deferred DNS-01 path for mesh/LAN-only services (Phase 2, ADR-024) — not yet live. + +## Test data + +Provisioned in the **staging** deploy (no Authentik user needed — there is no SSO on the +proxy itself): + +- At least one `reverse_proxy__routes` entry with a public DNS A-record pointing at the + staging host, so HTTP-01 can complete. A static `respond` route is enough for journeys + 1–3 and 5. diff --git a/roles/reverse_proxy/defaults/main.yml b/roles/reverse_proxy/defaults/main.yml index 7060df3..cb57d48 100644 --- a/roles/reverse_proxy/defaults/main.yml +++ b/roles/reverse_proxy/defaults/main.yml @@ -4,3 +4,25 @@ reverse_proxy__base_dir: /opt/services/reverse_proxy reverse_proxy__acme_email: admin@example.test reverse_proxy__routes: [] # each: {host: x, upstream: "svc:port"} OR {host: x, respond: "text"} reverse_proxy__manage: true # set false in Molecule to render without Docker + +# access__*/backup__* are the ADR-021/022 CROSS-ROLE conventions — shared field names that +# render ACCESS.md/BACKUP.md and drive /check-access · /check-backup. They intentionally do +# NOT carry the reverse_proxy__ prefix, so each is marked `# noqa: var-naming[no-role-prefix]` +# (ansible-lint's role-prefix rule has no per-prefix allowlist; keeping it enabled elsewhere). + +# Operational-access record (ADR-021) — source of truth for ACCESS.md + /check-access. +access__service: reverse_proxy # noqa: var-naming[no-role-prefix] +access__compose_project: reverse_proxy # noqa: var-naming[no-role-prefix] +access__compose_path: "{{ reverse_proxy__base_dir }}/docker-compose.yml" # noqa: var-naming[no-role-prefix] +access__containers: [caddy] # noqa: var-naming[no-role-prefix] +access__log: # noqa: var-naming[no-role-prefix] + loki_labels: { service: caddy } # intent; Loki/Alloy pipeline is ADR-018 (pending) +access__api: # noqa: var-naming[no-role-prefix] + enabled: false + reason: "Caddy admin API bound to container localhost :2019; never exposed (ADR-020 catalog owns ports)" + +# Backup contract (ADR-022). Stateless: Caddy's /data holds only ACME account keys + +# issued certs, which are re-requested automatically on restart via HTTP-01 (no manual +# steps). Residual risk: Let's Encrypt rate limits on rapid repeated re-issuance. +backup__service: reverse_proxy # noqa: var-naming[no-role-prefix] +backup__state: false # noqa: var-naming[no-role-prefix]