boma/docs/superpowers/specs/2026-06-14-netbird-coordinator-m4-design.md
sjat 65cf20a993 docs(spec): M4 — NetBird coordinator on askari + Caddy reverse proxy
Caddy becomes boma's standard reverse proxy (amends the soft Traefik assumption;
new ADR) with Gandi DNS-01 certs (custom xcaddy image, reuses vault.gandi.pat) —
the only cert path for mesh/LAN-only services. NetBird self-hosted in
external-proxy mode (embedded Dex), compose rendered from boma templates
(ADR-004/013). Three roles: docker_host (first real content), reverse_proxy (new,
Caddy), netbird (first service role w/ full ADR-004 standard files). Firewall +
DNS amendments; backup execution deferred (fisi). caddy-dns/gandi + NetBird
self-host facts verified.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:19:21 +02:00

7 KiB

Design — NetBird coordinator on askari + Caddy reverse proxy (M4)

  • Date: 2026-06-14
  • Status: Draft → straight to plan (per the standing skip-the-spec-review-gate agreement)
  • Roadmap milestone: M4 (docs/ROADMAP.md)
  • Implements: ADR-016 (NetBird coordinator self-hosted on askari), ADR-004 (first service role)
  • Establishes: a new ADR — boma's reverse proxy is Caddy (amends the soft Traefik assumption in the roadmap/ADR-017 prose)

Problem

The NetBird mesh control plane (ADR-016) must run on askari so ubongo + road-warrior laptops can enrol (M5) and reach ubongo from anywhere. This is also boma's first real service role (ADR-004) and its first reverse proxy — so M4 sets two precedents: the service-role pattern, and Caddy as boma's standard reverse proxy.

Decisions (as settled)

  1. Caddy is boma's standard reverse proxy (replaces the soft Traefik assumption — no ADR ever pinned Traefik). Rationale: boma renders all config from Ansible templates (ADR-004), so Traefik's dynamic Docker-label discovery is wasted; Caddy's templated Caddyfile + automatic HTTPS fits the "render from the catalog" model; far simpler for a solo operator; forward_auth to Authentik later keeps the auth story. → small new ADR.
  2. Caddy + Gandi DNS-01 (not HTTP-01). boma's services are mostly mesh/LAN-only with no public DNS record, and you cannot HTTP-01 an unexposed host — DNS-01 is the only cert path for them (the reason M1 built Gandi DNS-01). One mechanism fleet-wide; reuses vault.gandi.pat. Cost: a custom Caddy image (xcaddy + caddy-dns/gandi) — fits boma's "build our own images" pattern (the Molecule image).
  3. NetBird in external-reverse-proxy mode — disable its bundled Traefik; boma's Caddy terminates TLS for netbird.askari.wingu.me and proxies to the NetBird containers. Embedded Dex IdP (ADR-016). The compose + server config are rendered from boma Jinja templates (ADR-004 + ADR-013 translate-don't-transplant), based on NetBird's current self-host reference read at implementation time.
  4. Three roles, applied to askari (offsite_hosts):
    • docker_host — first real content: install Docker engine + compose plugin, version-pinned (ADR-011). (Cluster daemon-hardening + nftables.d integration stay deferred to the cluster.)
    • reverse_proxy (new) — the custom Caddy image + a Caddyfile rendered from route data + .env with GANDI_BEARER_TOKEN={{ vault.gandi.pat }}. boma's standard proxy; generalises to the cluster later (not built now).
    • netbird (new) — boma's first service role: renders the NetBird compose + server config + .env from vault; the full ADR-004 standard files.
  5. Firewall: amend the M2 Hetzner Cloud Firewall (TF offsite) to open 80/443 TCP + 3478 UDP (NetBird's public ports). SSH-from-ubongo stays.
  6. DNS: add netbird.askari.wingu.me → askari's IP via public_dns (M1 role).
  7. Standard service-role files authored, execution deferred. SECURITY/VERIFY/ACCESS/ BACKUP.md written for netbird (the precedent), but /verify-service (playwright) and fisi backup don't exist yet — BACKUP.md records the datastore + an accepted risk that off-site backup is pending; VERIFY.md is authored, run later.
  8. Setup keys are an M5 artifact (created post-deploy via the dashboard/API). M4 stubs vault.netbird.setup_key: CHANGEME (the placeholder convention) for M5 to fill.

Verified facts (ADR-014)

verified: caddy-dns/gandi v1.1.0 (2025-07) · module dns.providers.gandi, xcaddy build, PAT via GANDI_BEARER_TOKEN, tls { dns gandi {env.GANDI_BEARER_TOKEN} } · WebFetch github.com/caddy-dns/gandi · 2026-06-14 verified: NetBird self-host · Docker Compose (management + signal + relay + coturn + dashboard), embedded Dex, ports 80/443 TCP + 3478 UDP, supports an external reverse proxy · WebFetch docs.netbird.io/selfhosted · 2026-06-14 to verify in the plan: exact NetBird compose/config.yaml/dashboard.env schema for the pinned version, the external-proxy config knobs, and which secrets are role-generated vs operator-supplied.

Architecture & data flow

road-warrior / ubongo ──TLS──> Caddy (askari:443, netbird.askari.wingu.me)
                                  │  cert: ACME DNS-01 via Gandi (vault.gandi.pat)
                                  └─> NetBird dashboard + management/signal (HTTP, internal)
NetBird agents ──UDP 3478──> Coturn (STUN/TURN) ; ──relay──> relay
  • All containers on askari via Docker Compose (rendered by Ansible).
  • Caddy and NetBird share a Docker network; only Caddy (80/443) + Coturn (3478) face the internet (Hetzner Cloud Firewall + the container port mapping).

Roles (units, each testable)

  • docker_hosttasks/main.yml: add Docker apt repo (pinned), install docker-ce + docker-compose-plugin, enable the service. Molecule: install + docker --version. (Tag packages/role-name.)
  • reverse_proxy — custom image (.docker/caddy-gandi/Dockerfile, xcaddy + caddy-dns/gandi, built/pushed like the Molecule image); templates/{docker-compose, Caddyfile,env}.j2; route data in group_vars (reverse_proxy__routes). Molecule: render + caddy validate.
  • netbirdtemplates/{docker-compose.yml,config.yaml,dashboard.env,...}.j2 rendered from netbird__* + vault; the ADR-004 standard files. Deploy mechanics per ADR-004.

Testing

  • Molecule per role where it fits (docker_host: install; reverse_proxy: caddy validate on the rendered Caddyfile). NetBird's full stack is heavy for a container — rely on live verification on askari (compose up; curl -sI https://netbird.askari.wingu.me → 200 + valid cert; dashboard loads; docker compose ps healthy).
  • Live (gated, on askari): deploy the three roles; verify the cert issues via DNS-01, the dashboard is reachable over TLS, and the NetBird services are healthy. (Enrolment is M5.)

Scope boundaries — what M4 is NOT

  • Not enrolment (ubongo + laptops) or narrowing SSH to wt0M5.
  • Not the cluster reverse proxy / Authentik forward-auth — Phase 2 (the reverse_proxy role is built to generalise, but only askari/NetBird is wired now).
  • Not off-site backup execution of the datastore — pending fisi (ADR-022); recorded as an accepted risk with BACKUP.md authored.
  • Not auditd/CIS, host firewall on askari (still M5/Phase 2).

Open items (resolve in the plan)

  • Pin the NetBird version + read its current self-host compose/config; pin Caddy + caddy-dns/gandi versions; pin Docker CE.
  • Decide which netbird secrets are role-generated (turn password, dex secrets — via community.general.random_string/lookups, persisted to vault) vs operator-supplied (none expected beyond the M5 setup key).
  • Confirm the custom Caddy image build/host (local build vs the Forgejo registry, like the Molecule image).
  • netbird.askari.wingu.me as an A (to askari's IP) vs CNAME to askari.wingu.me.