docs(spec): M4 — NetBird coordinator on askari + Caddy reverse proxy

Caddy becomes boma's standard reverse proxy (amends the soft Traefik assumption;
new ADR) with Gandi DNS-01 certs (custom xcaddy image, reuses vault.gandi.pat) —
the only cert path for mesh/LAN-only services. NetBird self-hosted in
external-proxy mode (embedded Dex), compose rendered from boma templates
(ADR-004/013). Three roles: docker_host (first real content), reverse_proxy (new,
Caddy), netbird (first service role w/ full ADR-004 standard files). Firewall +
DNS amendments; backup execution deferred (fisi). caddy-dns/gandi + NetBird
self-host facts verified.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-14 17:19:21 +02:00
parent 181a02fd3a
commit 65cf20a993

View file

@ -0,0 +1,120 @@
# Design — NetBird coordinator on askari + Caddy reverse proxy (M4)
- **Date:** 2026-06-14
- **Status:** Draft → straight to plan (per the standing skip-the-spec-review-gate agreement)
- **Roadmap milestone:** M4 (`docs/ROADMAP.md`)
- **Implements:** ADR-016 (NetBird coordinator self-hosted on askari), ADR-004 (first
service role)
- **Establishes:** a new **ADR — boma's reverse proxy is Caddy** (amends the soft Traefik
assumption in the roadmap/ADR-017 prose)
---
## Problem
The NetBird mesh control plane (ADR-016) must run on askari so ubongo + road-warrior
laptops can enrol (M5) and reach ubongo from anywhere. This is also boma's **first real
service role** (ADR-004) and its **first reverse proxy** — so M4 sets two precedents:
the service-role pattern, and Caddy as boma's standard reverse proxy.
## Decisions (as settled)
1. **Caddy is boma's standard reverse proxy** (replaces the soft Traefik assumption — no
ADR ever pinned Traefik). Rationale: boma renders all config from Ansible templates
(ADR-004), so Traefik's dynamic Docker-label discovery is wasted; Caddy's templated
Caddyfile + automatic HTTPS fits the "render from the catalog" model; far simpler for a
solo operator; `forward_auth` to Authentik later keeps the auth story. → small new ADR.
2. **Caddy + Gandi DNS-01** (not HTTP-01). boma's services are mostly **mesh/LAN-only with
no public DNS record**, and you cannot HTTP-01 an unexposed host — DNS-01 is the only
cert path for them (the reason M1 built Gandi DNS-01). One mechanism fleet-wide; reuses
`vault.gandi.pat`. Cost: a **custom Caddy image** (`xcaddy` + `caddy-dns/gandi`) — fits
boma's "build our own images" pattern (the Molecule image).
3. **NetBird in external-reverse-proxy mode** — disable its bundled Traefik; boma's Caddy
terminates TLS for `netbird.askari.wingu.me` and proxies to the NetBird containers.
Embedded **Dex** IdP (ADR-016). The compose + server config are **rendered from boma
Jinja templates** (ADR-004 + ADR-013 translate-don't-transplant), based on NetBird's
current self-host reference read at implementation time.
4. **Three roles, applied to askari (`offsite_hosts`):**
- **`docker_host`** — first real content: install Docker engine + compose plugin,
version-pinned (ADR-011). (Cluster daemon-hardening + `nftables.d` integration stay
deferred to the cluster.)
- **`reverse_proxy`** (new) — the custom Caddy image + a `Caddyfile` rendered from route
data + `.env` with `GANDI_BEARER_TOKEN={{ vault.gandi.pat }}`. boma's standard proxy;
generalises to the cluster later (not built now).
- **`netbird`** (new) — **boma's first service role**: renders the NetBird compose +
server config + `.env` from vault; the full ADR-004 standard files.
5. **Firewall:** amend the M2 Hetzner Cloud Firewall (TF `offsite`) to open **80/443 TCP +
3478 UDP** (NetBird's public ports). SSH-from-ubongo stays.
6. **DNS:** add `netbird.askari.wingu.me` → askari's IP via `public_dns` (M1 role).
7. **Standard service-role files authored, execution deferred.** SECURITY/VERIFY/ACCESS/
BACKUP.md written for `netbird` (the precedent), but `/verify-service` (playwright) and
`fisi` backup don't exist yet — BACKUP.md records the datastore + an **accepted risk**
that off-site backup is pending; VERIFY.md is authored, run later.
8. **Setup keys are an M5 artifact** (created post-deploy via the dashboard/API). M4 stubs
`vault.netbird.setup_key: CHANGEME` (the placeholder convention) for M5 to fill.
## Verified facts (ADR-014)
> verified: caddy-dns/gandi v1.1.0 (2025-07) · module `dns.providers.gandi`, `xcaddy`
> build, PAT via `GANDI_BEARER_TOKEN`, `tls { dns gandi {env.GANDI_BEARER_TOKEN} }` ·
> WebFetch github.com/caddy-dns/gandi · 2026-06-14
> verified: NetBird self-host · Docker Compose (management + signal + relay + coturn +
> dashboard), embedded Dex, ports 80/443 TCP + 3478 UDP, supports an external reverse
> proxy · WebFetch docs.netbird.io/selfhosted · 2026-06-14
> to verify in the plan: exact NetBird compose/`config.yaml`/`dashboard.env` schema for
> the pinned version, the external-proxy config knobs, and which secrets are
> role-generated vs operator-supplied.
## Architecture & data flow
```
road-warrior / ubongo ──TLS──> Caddy (askari:443, netbird.askari.wingu.me)
│ cert: ACME DNS-01 via Gandi (vault.gandi.pat)
└─> NetBird dashboard + management/signal (HTTP, internal)
NetBird agents ──UDP 3478──> Coturn (STUN/TURN) ; ──relay──> relay
```
- All containers on askari via Docker Compose (rendered by Ansible).
- Caddy and NetBird share a Docker network; only Caddy (80/443) + Coturn (3478) face the
internet (Hetzner Cloud Firewall + the container port mapping).
## Roles (units, each testable)
- `docker_host``tasks/main.yml`: add Docker apt repo (pinned), install
`docker-ce` + `docker-compose-plugin`, enable the service. Molecule: install + `docker
--version`. (Tag `packages`/role-name.)
- `reverse_proxy` — custom image (`.docker/caddy-gandi/Dockerfile`, `xcaddy` +
`caddy-dns/gandi`, built/pushed like the Molecule image); `templates/{docker-compose,
Caddyfile,env}.j2`; route data in `group_vars` (`reverse_proxy__routes`). Molecule:
render + `caddy validate`.
- `netbird``templates/{docker-compose.yml,config.yaml,dashboard.env,...}.j2` rendered
from `netbird__*` + vault; the ADR-004 standard files. Deploy mechanics per ADR-004.
## Testing
- **Molecule** per role where it fits (docker_host: install; reverse_proxy: `caddy
validate` on the rendered Caddyfile). NetBird's full stack is heavy for a container —
rely on **live verification on askari** (compose up; `curl -sI https://netbird.askari.wingu.me`
→ 200 + valid cert; dashboard loads; `docker compose ps` healthy).
- **Live (gated, on askari):** deploy the three roles; verify the cert issues via DNS-01,
the dashboard is reachable over TLS, and the NetBird services are healthy. (Enrolment is
M5.)
## Scope boundaries — what M4 is NOT
- **Not** enrolment (ubongo + laptops) or narrowing SSH to `wt0`**M5**.
- **Not** the cluster reverse proxy / Authentik forward-auth — **Phase 2** (the
`reverse_proxy` role is built to generalise, but only askari/NetBird is wired now).
- **Not** off-site backup execution of the datastore — pending `fisi` (ADR-022); recorded
as an accepted risk with BACKUP.md authored.
- **Not** auditd/CIS, host firewall on askari (still M5/Phase 2).
## Open items (resolve in the plan)
- Pin the NetBird version + read its current self-host compose/config; pin Caddy +
`caddy-dns/gandi` versions; pin Docker CE.
- Decide which `netbird` secrets are **role-generated** (turn password, dex secrets — via
`community.general.random_string`/lookups, persisted to vault) vs operator-supplied
(none expected beyond the M5 setup key).
- Confirm the custom Caddy image build/host (local build vs the Forgejo registry, like the
Molecule image).
- `netbird.askari.wingu.me` as an A (to askari's IP) vs CNAME to `askari.wingu.me`.