Compare commits

...

8 commits

Author SHA1 Message Date
94dd6da14c docs(netbird): describe gRPC routing as the deployed Content-Type matcher
README/SECURITY said gRPC was path-matched (/management.ManagementService/* etc.);
the deployed Caddy route selects gRPC by Content-Type: application/grpc* (NetBird's
own external-proxy example). Reconciled the prose to what actually runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 07:54:09 +02:00
684718f4a5 docs(netbird): M4b done — STATUS/ROADMAP/risks/friction
netbird_coordinator built + applied to askari (first service role, dashboard live).
STATUS: new "real and working" row + askari/coordinator rows updated. ROADMAP: M4b
done, M5 (peer enrol) next, recorded the v0.72.4 combined-container/embedded-Dex/
no-Coturn reality. accepted-risks R3: Coturn -> STUN wording. FRICTION: single-file
bind-mount stale-inode gotcha + check-before-first-deploy artifact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 07:48:53 +02:00
3a31b8e6f4 fix(reverse_proxy): bind-mount the Caddy config dir so reload sees changes
The Caddyfile was bind-mounted as a single file. ansible.builtin.template writes
atomically (temp + rename), so a re-render swaps the file's inode while the running
container keeps the old one — `caddy reload` then re-read stale config and silently
no-op'd ("config is unchanged"), so new routes never loaded. Surfaced deploying the
NetBird route: Caddy never requested its cert. Fix: render to ./caddy/Caddyfile and
mount the ./caddy DIRECTORY at /etc/caddy — directory mounts reflect inode swaps, so
graceful `caddy reload` works. Proven on askari: atomic replace in the host dir is
visible inside the running container.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 07:44:45 +02:00
0e8d448f2b feat(offsite): apply netbird_coordinator after reverse_proxy
NetBird joins the boma Docker network that reverse_proxy creates, so it's
ordered last. Carries its netbird_coordinator role-name tag (check-tags).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 18:05:12 +02:00
070d6f293b docs(netbird): service-role standard files (SECURITY/VERIFY/ACCESS/BACKUP)
Author the four ADR-mandated service-role docs for netbird_coordinator and
add the cross-role access__*/backup__* data (ADR-021/022). First stateful
service: backup__state=true; off-site capture pending the fisi pull node.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 18:01:29 +02:00
1333ec181f feat(reverse_proxy): raw-directive route type; wire NetBird (gRPC/WS) route
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 17:55:05 +02:00
3762be4622 feat(netbird): vault secrets — auth_secret + datastore_key
Self-generated random values for the NetBird coordinator: auth_secret (relay/JWT
shared secret) and datastore_key (SQLite store encryption, base64 32 bytes with
padding). Wired into roles/netbird_coordinator config.yaml via vault.netbird.*.
No CHANGEME — both are agent-generatable (not operator-supplied). The M5 peer
setup key is a runtime dashboard artifact, added to vault when M5 wires it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 17:52:16 +02:00
ab1b0678ab feat(netbird): coordinator service role (combined server + dashboard, v0.72.4)
First real service role. NetBird v0.72.4 self-hosted control plane: single
netbirdio/netbird-server:0.72.4 (management + signal + relay + STUN + embedded
Dex) plus netbirdio/dashboard:v2.39.0, both on the shared boma Docker network so
the M4a Caddy fronts them. Renders docker-compose.yml + config.yaml (secrets from
vault.netbird.*, no_log) + dashboard.env. STUN 3478/udp host-exposed; everything
else via the proxy. netbird_coordinator__manage gates the compose-up for Molecule.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 17:49:57 +02:00
27 changed files with 741 additions and 61 deletions

View file

@ -31,8 +31,9 @@ _Last reviewed: 2026-06-14._
| `make check` / `make deploy PLAYBOOK=<name>` | **Works.** First end-to-end run (applying `dev_env`) surfaced + fixed latent bugs: Makefile `PLAYBOOK` var collision (binary path vs playbook-name arg) meant the targets never ran; `ansible.cfg` referenced uninstalled community.general callbacks (now built-in `default` + `ansible.posix.profile_tasks`); `acl` package added so Ansible can `become_user` an unprivileged user. The make targets now function — though `site`/`base`/`docker_host` content is still incomplete (see below). |
| `roles/public_dns/` + `playbooks/dns.yml` | **Built + applied.** Manages wingu.me at Gandi LiveDNS as code (`community.general.gandi_livedns`, PAT from `vault.gandi.pat`); record data, anti-spoof baseline (SPF `-all` + DMARC reject), and the Gandi-defaults purge are defined + unit-tested (`tests/test_public_dns.py`). **Applied to wingu.me (2026-06-14):** purged Gandi's 13 seeded defaults; zone now holds only the SPF + DMARC TXT records; idempotent re-run clean. No null-MX (Gandi rejects `0 .`) — the MX is removed, so no MX + no apex A = no mail. M1 of the roadmap. |
| `ubongo` — physical control / AI-worker host (ADR-015) | **Built (partial).** Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to `fisi` (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via `make setup`/`make collections`). Repo cloned under a dedicated `claude` user (docker group, no sudo). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory `control` group at 10.20.10.151. **`dev_env` now applied here** (zsh/tmux/nvim for `sjat` + `claude`, via `playbooks/workstation.yml`). Managed as the operator account `sjat` (`group_vars/control` sets `ansible_user: sjat`), not the `ansible` service user `group_vars/all` assumes — ubongo has no bootstrapped `ansible` user. **Pending:** NetBird mesh enrollment (so SSH is LAN-only); full `base` hardening (only the `firewall` concern exists, and it is NOT applied here — applying default-deny with no mesh would lock out inbound SSH on the physical NIC); proper `ansible`-user bootstrap (currently managed as `sjat`); OPNsense DHCP reservation for 10.20.10.151 (MAC `88:a4:c2:e0:ee:da`); Terraform state backup (now relevant — the offsite tfstate exists). |
| `askari` — off-site Hetzner VPS (ADR-007/016, M2) | **Built + applied.** Provisioned by Terraform (`environments/offsite`, `hetznercloud/hcloud`) as **cx23 / hel1 / Debian 13.5** (CAX11/ARM was out of stock EU-wide on 2026-06-14 → cx23 is same-spec x86, cheaper). cloud-init created the `ansible` user + passwordless sudo; a TF-managed Hetzner Cloud Firewall allows SSH only from ubongo's WAN (`91.226.145.80`). Reachable from ubongo (`ansible offsite_hosts -m ping` ✓), in the `offsite_hosts` inventory (generated `offsite.yml`), published at `askari.wingu.me``77.42.120.136`. **SSH-hardened + fail2ban (M3).** **Docker + Caddy reverse proxy (M4a):** `docker_host` + `reverse_proxy` (vanilla Caddy, HTTP-01) applied; `https://test.askari.wingu.me` serves a valid Let's Encrypt cert ✓ (firewall opens 80/443/3478). **Pending:** NetBird coordinator (M4b), host firewall + mesh enrollment (M5), offsite tfstate backup (ADR-022). |
| `roles/docker_host/` (Docker engine) + `roles/reverse_proxy/` (Caddy, ADR-024) | **Built + applied** (askari, M4a). `docker_host` installs Docker CE + compose; `reverse_proxy` is boma's standard Caddy proxy (HTTP-01 for public hosts; routes from `reverse_proxy__routes`). **DNS-01 for mesh/LAN-only services is now built + proven (2026-06-15):** custom `caddy-gandi` image (`.docker/caddy-gandi/`, `make caddy-image`, pinned caddy-dns/gandi v1.1.0 → Bearer PAT), enabled per-instance via `reverse_proxy__acme_dns_provider: gandi` + `reverse_proxy__image`. Verified end-to-end — a real wildcard cert issued via LE **staging** + Gandi DNS-01 with `vault.gandi.pat`. M4a's deferral (version skew + Hetzner-IP build) is closed; image **pending registry push** (`make caddy-image-push` needs `docker login`). |
| `askari` — off-site Hetzner VPS (ADR-007/016, M2) | **Built + applied.** Provisioned by Terraform (`environments/offsite`, `hetznercloud/hcloud`) as **cx23 / hel1 / Debian 13.5** (CAX11/ARM was out of stock EU-wide on 2026-06-14 → cx23 is same-spec x86, cheaper). cloud-init created the `ansible` user + passwordless sudo; a TF-managed Hetzner Cloud Firewall allows SSH only from ubongo's WAN (`91.226.145.80`). Reachable from ubongo (`ansible offsite_hosts -m ping` ✓), in the `offsite_hosts` inventory (generated `offsite.yml`), published at `askari.wingu.me``77.42.120.136`. **SSH-hardened + fail2ban (M3).** **Docker + Caddy reverse proxy (M4a):** `docker_host` + `reverse_proxy` (vanilla Caddy, HTTP-01) applied; `https://test.askari.wingu.me` serves a valid Let's Encrypt cert ✓ (firewall opens 80/443/3478). **NetBird coordinator (M4b):** `netbird_coordinator` deployed — dashboard live at `https://netbird.askari.wingu.me` (valid LE cert), management API behind embedded Dex (401 unauth), STUN on 3478/udp. **Pending:** host firewall + mesh enrollment (M5 — incl. first-boot `/setup` admin + peer setup keys), offsite tfstate backup (ADR-022). |
| `roles/docker_host/` (Docker engine) + `roles/reverse_proxy/` (Caddy, ADR-024) | **Built + applied** (askari, M4a). `docker_host` installs Docker CE + compose; `reverse_proxy` is boma's standard Caddy proxy (HTTP-01 for public hosts; routes from `reverse_proxy__routes`). **DNS-01 for mesh/LAN-only services is now built + proven (2026-06-15):** custom `caddy-gandi` image (`.docker/caddy-gandi/`, `make caddy-image`, pinned caddy-dns/gandi v1.1.0 → Bearer PAT), enabled per-instance via `reverse_proxy__acme_dns_provider: gandi` + `reverse_proxy__image`. Verified end-to-end — a real wildcard cert issued via LE **staging** + Gandi DNS-01 with `vault.gandi.pat`. M4a's deferral (version skew + Hetzner-IP build) is closed; image **pending registry push** (`make caddy-image-push` needs `docker login`). The `reverse_proxy` Caddyfile is bind-mounted as a **directory** (`./caddy``/etc/caddy`) so atomic re-renders are visible in-container and `caddy reload` actually applies new routes (a single-file mount pinned the stale inode). |
| `roles/netbird_coordinator/` — NetBird control plane (ADR-016, M4b) | **Built + applied (askari, 2026-06-16). boma's FIRST real service role.** Self-hosted NetBird **v0.72.4**: a single combined `netbird-server` container (management + signal + relay + STUN + **embedded Dex IdP** at `/oauth2`) + `dashboard:v2.39.0`, on the shared `boma` network behind the M4a Caddy via gRPC-h2c + WebSocket + path routing (`reverse_proxy__routes` gained a raw-`caddy` route type). Secrets `vault.netbird.{auth_secret,datastore_key}` (self-generated). Carries the full service-role file set (SECURITY/VERIFY/ACCESS/BACKUP) — **first stateful role** (`backup__state: true`; encrypted SQLite at `/var/lib/netbird`, off-site backup pending `fisi`/ADR-022). **Verified live:** dashboard 200 + valid LE cert, `/api` 401 (auth-gated, routes OK), STUN up. **Not yet configured:** first-boot `/setup` admin + peer enrolment = M5. |
## Scaffolded but empty — NOT implemented
@ -68,7 +69,7 @@ askari.)
| `/security-review` skill | ADR-002 / TODO 8.5 | Periodic posture re-check + accepted-risk re-challenge; planned, not built |
| CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) `base`/`docker_host` roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) |
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |
| NetBird mesh — coordinator on `askari` | ADR-016 | **Design RESOLVED** (ADR-016 + spec + plan); resolves ADR-015 deferred #1. Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. **Build pending:** not deployed (askari + service-role machinery not built). |
| NetBird mesh — coordinator on `askari` | ADR-016 | **BUILT + applied (M4b, 2026-06-16)** — moved up to "Real and working today" (`roles/netbird_coordinator/`). Self-hosted control plane on askari; replaces ADR-007 WireGuard. Mesh **peer enrolment = M5** (next row). |
| NetBird agent enrollment in `base` | ADR-016 | **Design RESOLVED** (ADR-016). Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. **Build pending:** base role not built. |
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. |

View file

@ -33,6 +33,29 @@ _(append new raw signals here; the next kaizen review consumes them)_
`--password-stdin`, `no_log`) so pushes are agent-completable like every other
vault-backed action.
- `[gotcha]` **Single-file Docker bind mount + atomic config rewrite = stale config in
the running container** (2026-06-16): `reverse_proxy` bind-mounted the Caddyfile as a
single file; `ansible.builtin.template` writes atomically (temp + rename → new inode),
so the running container kept the OLD inode and `caddy reload` (in-container, no restart)
re-read stale config and silently no-op'd (`"config is unchanged"`). The NetBird route
never loaded → Caddy never requested its cert; surfaced only by a TLS handshake failure.
Fix: mount the config **directory** (`./caddy``/etc/caddy`) — directory mounts reflect
inode swaps, so live reload works (proven on askari). NOTE the sibling case: NetBird also
single-file-mounts `config.yaml`, but its handler does `docker compose restart` (not an
in-container reload), and a restart DOES re-resolve the bind mount (verified: 0 before,
1 after) — so restart-based roles are safe; only in-place-reload roles need the dir mount.
→ candidate gotcha doc (`docs/testing/gotchas.md`): "reload-in-place needs a directory
mount; restart-based roles are fine with a single-file mount."
- `[friction]` **`make check` always fails on the first-ever deploy of a compose service
role** (2026-06-16): in check mode the "ensure base_dir" task is reported-but-not-run, so
the later `community.docker.docker_compose_v2` up fails with `"…is not a directory"`
(missing `project_src`). Not a defect — a real deploy creates the dir — but it means the
CLAUDE.md "always `make check` before `make deploy`" step is guaranteed-red for any brand
new stateful role, which erodes trust in the check. → candidate: guard the compose-up with
`not ansible_check_mode` (clean "skipped" in dry-run; compose can't be meaningfully
dry-run before first deploy anyway), OR document the one-time expected failure. Decide one.
- `[recurring]` **ADRs claim cross-doc reconciliation they didn't actually perform**
(2026-06-14): ADR-024's Status + Consequences asserted "ADR-017 prose that mentioned
Traefik is updated to read Caddy" — but ADR-008/017/019 + CAPABILITIES still said

View file

@ -115,11 +115,15 @@ Built in two phases. **M4a (platform) — ✅ DONE:** Docker on askari + boma's
valid Let's Encrypt cert (HTTP-01; the Gandi **DNS-01** path is now built + proven —
2026-06-15, see ADR-024 — for mesh/LAN-only cluster services).
Firewall opened 80/443/3478. Spec/plan: `…2026-06-14-netbird-coordinator-m4-design.md` /
`…2026-06-14-m4a-docker-caddy.md`. **M4b (next):** the `netbird_coordinator` service
role — read NetBird's current self-host compose then.
`…2026-06-14-m4a-docker-caddy.md` / `…2026-06-14-m4b-netbird.md`.
Deploy the NetBird stack (management / signal / relay / Coturn + dashboard) with the
**embedded IdP** (ADR-016 — no Authentik dependency), fronted by the now-proven Caddy.
**M4b — ✅ DONE (2026-06-16):** the `netbird_coordinator` service role, deployed to askari.
Reality differed from the original plan (captured fresh per ADR-014): NetBird **v0.72.4**
ships a **single combined `netbird-server`** container (management + signal + relay + STUN
+ **embedded Dex** IdP at `/oauth2`) plus `dashboard:v2.39.0` — **no separate signal/relay
container and no Coturn**. Fronted by the M4a Caddy via gRPC-h2c + WebSocket + path routing.
Dashboard live at `https://netbird.askari.wingu.me` (valid LE cert); `/api` auth-gated.
**M5 (enrol peers) is next** — incl. the first-boot `/setup` admin + setup keys.
- **First exercise of:** the service-role conventions (`SECURITY.md` / `VERIFY.md` /
`ACCESS.md` / `BACKUP.md`), public **TLS / ACME**, and the **backup contract**

View file

@ -15,7 +15,7 @@ revisit (trigger).
|---|---|---|---|
| R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (tiered image pinning per ADR-011 — stateful `tag@digest`, stateless rolling — prefer official/verified images; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise |
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and STUN (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh (NetBird v0.72.4 embeds STUN in the combined server — no separate Coturn) | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
| R5 | **No disk encryption on `ubongo`** — the control node's SSD (SanDisk X600 256 GB, TCG-Opal-capable but Opal unused) is unencrypted at rest, so it holds recovery-critical secrets in plaintext: the Ansible Vault password's `rbw` local cache and (future) Terraform state. Physical theft of the box would expose them | `ubongo` is always-on in a physically controlled location; compensating controls are a **BIOS supervisor password** and **disabled external/USB + PXE boot** (an attacker cannot trivially boot another OS to read the disk), and the offline-recoverable design means the irreducible root secret (Vaultwarden master password) is never stored on the box anyway. Full-disk encryption was weighed against the always-on/unattended-reboot requirement (LUKS+TPM auto-unlock or passphrase) and deferred for simplicity at this trust level | `ubongo` is relocated to a less-trusted physical location; the box starts holding additional high-value secrets; or a reinstall onto LUKS (TPM-sealed) is undertaken |

View file

@ -3,4 +3,15 @@
reverse_proxy__acme_email: admin@wingu.me
reverse_proxy__routes:
- {host: test.askari.wingu.me, respond: "boma reverse proxy"}
# M4b appends: {host: netbird.askari.wingu.me, upstream: "netbird-dashboard:80"}
# NetBird control plane (M4b). gRPC (h2c) + WebSocket + path-based routing to two
# backend containers on the shared `boma` Docker network (ADR-024, ADR-016).
- host: netbird.askari.wingu.me
caddy: |
# gRPC needs HTTP/2 cleartext (h2c) to the backend
@grpc header Content-Type application/grpc*
reverse_proxy @grpc h2c://netbird-server:80
# management REST, embedded IdP, relay + ws-proxy (WebSocket) — same backend
@backend path /relay* /ws-proxy/* /api/* /oauth2/*
reverse_proxy @backend netbird-server:80
# dashboard SPA (everything else)
reverse_proxy /* netbird-dashboard:80

View file

@ -1,44 +1,70 @@
$ANSIBLE_VAULT;1.1;AES256
32663033666462323861636161306437393231663035646137646130326433366638356632333463
3338333435356631306330376134376139333233336334300a336164376539363833356431633465
61313531366161663761373038613166303132636261363138636438316631313133326265623166
3439643431646261340a383734386163373630633261613231643530393064303431633437343434
34323434346265336535663637326433643837366564363633666132633537313230303731313264
64366364626266663437303032353933653664313932383765346431303035303136326637616131
31666237663930303035306632633765626133346561653434653131323962623730613338343532
63363235326537636434303163646131656535376234353732366264666131366532656333383066
62363631316533333330373763653366376162336531373539666466323934353461666433616231
38333639333831363861326636303434316130353662336235336261346433343539366233643337
64656564613531623533663865366138356633373065643263613832373961653237303831336539
61643363656566396164383236383361643035383233363064313766336561626564366435626539
37376262396234343765313430303736623038353765666337363162643666323766373333306438
63316639363864386662373865396139313933666533333062376266393737356535366164636261
63663764623132356131393966323563313265303261666232623033653136633763653933616166
66373137633536313863646134633435643735356165313863343662393065306336613737356131
36636466626639346238303239326462393966316233343531383137343633626439316130373836
30383434653234343964353633313764386639326130373331343130336432653164383935336663
31343166353833343535373338616464316437386163353865353363386462393038323563633837
30666161626537316532663234366633356336363965396166333062306335346639373262303633
32316262623037336166613466623662383134363463663136353433396237393935313661366461
31336531396163633065346364323037356665383039633437346465306431303530336263356536
33373033313538303464373562336131336238373433656439626462343930356363393033323736
34663666356434393263666633666439373639383336333165663036623332336330626165376634
31333438346132613433313162636439316531303436313436383063646438316366663661373363
37326263343163666230363530313066383534636635636136636261333037333533303937313861
36363137393264313734636165656631643234646634653835656666306635373761356535663232
63626137613839333833303832623135326237616662333563626461636436363562653338343938
39636430313739323965626362623034613364323162376161366236373439373262383036383234
61303962633265326563386139313966626334663865623762326139666535613232373261623264
37623730386661643662396639643737313265646532353561333537316464623064393236356230
63666163346531333363393434643337643038303232333862353831363434313961386133613163
63376264343036373230633130383732316332303437393936646464383630343130316432636134
37643335633763303931623133396166646231616233653533623731643231323331393732383935
38366564626637623737366336356433613435653631653762333833373662643634383265353266
65613736313534646134316333343566313564383838316633386235343136383239633636303862
34623635623530373961313434366562366564373533633839356664643064383139393561373833
66373466323863383734663834613832663339623236656636353032663237623733303136393138
64376331666666633361666538623138393065626664623139333832663930333065393332623235
31633135616633336630353136326463366664646133316637313066303637636231313766616464
30643165323530366631326238636437613664396633613163353536623934313163333330373665
35336364393236313934653339653462663639616562366264313334313062323235346239343534
373835613065313662636665646537633036
30613536623331613935326162646664303565646530376564343633313431636165626431313264
3034343032356461306137346162626637663139653033650a346164393839343264633062663030
34623764363631356239363737393265613961323633316239653032633532623032636632316132
6633306637653839660a656161643030626562653062303762616162393836383635323461303337
64336339633161643664306666306530333363653362313731613231396133346433666437346231
61306232646235396161383634363265623432373737353036656664326630313335313463323462
65383832313066633539643632326563336438616261623630316230643239323435613337303361
30383039653531396264613237396138363836616163626366363063383634323335383038386362
34343064656365336530666435636465393966373434633230373534353261656566656138356636
35303032396365346136383837633134343037343332393863363832616139633964396638643833
30323834303737346266383237643530306366653262633131663332613937626632363732613932
39313864613038356262323465393062636535663138633738376163363930313132323735383565
66353964633230613266663433313234383961666464636561343364303536306334336536373836
63363366333136623137346537623230393935656334396565303231663833386338666636663833
65656633663335316162343438396264663738343861393237353537393432333035393331626566
31383534623339346338646434613230336566363063333137343636666238303637626637363263
30663131636634623861623232303031633862343538386232653637613866376231336465393535
33343164346661626361623136383230393365396636356266656637366537613331343330643532
30653333346235303338396336303965656232323165643465653235616666636631336633613132
30316161626236663365613133336430643030323164366563613633666362666239333164306662
62623163343162373261653332636333373462323635633364356531396538316134356234663861
33363766343666303135356165626136626164666630386161393062636662383835653731373830
65653335373664316638376239633137393537343362346535343138656265323836653332616630
33623938626139343662626639386536626134396663653930373532343865386565356336303334
35343134323765653764613132656430333362373535613662336234366164313965666362613733
66626663353030626332326433623734643961646336613937346637333439333731633438336339
64653537326131623061316239323132663231653863393230653439646266633764633934366563
63656136376531393761306331383866656338306432333966636166643831323065353866313336
39636232306362386439663866653238626532316339346432373933316565386265303739663764
61626337313963363365666438363837353732353938613962303832343639306135363864333536
64306634303235396163663837323931383031616335396438383636336264346361316439326133
64363231636434396166343530643037373232386336623433393436356236643938386331376237
30613164616532356163303430333861613863643132303565346461613433343634353036666638
33333533646263636362393865633166353962653137353831346366336235326465333436333230
62633339313364663765613361316264346161356334313866656436643666393631376433313333
62373361313633663037643038333233346461383732613939343935353635373738326566333838
37313338336532383938303762303062373138353436626462323439356663626431366563633863
38636234383164626636633566643963366466666334323131336166623837313933343262323834
32643134336137616462353862336533653062313664346138383762356663343861386134393361
63303763346161656465626465666463356631336539333234623931373764323638623331396234
65333036653131353737376663633134633238666535636661383530333032333466643438333163
33333231393134393165363036353262323836363965323037396531373865656363366534636263
34633131616637323432396162646466316166373639313731626364633234393339333333373663
38363164353262376338333933383732393631626532313861633465646231666335396233376266
66313165633165313566646266663265633730316330643261373838343665613035373662323365
35333836336235396237333934653766333732383533373732353633383931323232663731393965
39623161613131626562663632373031663234656438316363373462316137646236323438663031
36613666383863623033383231333338613537333565653633616635366463313062613263343938
31633839376164383261333465326538373439653265373665323063623366616366356138666265
34666164393165386566346533396638623464383937623539346234303730626463636435333434
36313466306166333264346533623132306262646538316335343936373862363931303366643765
65396132306664393435643531646637633939616636663933393138383137633536656362386165
35653337326537633539626332333565643831633339393866616164653862306333393531336130
30383561366431303030376436643434643466323562323730633638643663613339386239646562
31353266386164343832376464303962363665316261383031633534333333333766656530306664
31633931653231653530343763383738336333323161663031646331333638356233343661656463
63316234333430643730663661363662373030653730613762663464393937393962373064623631
38343864313764633737303838616133383636666130396339316138346162386438306664306363
32333438383033626235656133356335623834656637386633333839343134363137363266636536
36303235393264653462353833323030333263666464663864623964363738393166613439313639
62656538343662343665356339326364613032363334376232666434613836346638333464623266
65346236393161663562323865613437633863396437383233363532396136336534376431613937
35383231323430323462343861666564663734666564393131313932333831643035303036613430
30313636613939616336323534636131393761356534306332343735616136333531366339343936
65393536393636666639633236303234653766306263393237653437353632373430653438633736
63633139323732653566663062373537376463383439643336383434646533353762623636323031
65323233306666323630366164366331646632303263333665336432396262383138643432666365
37633336663362323132646438363832346438653361346438303630636131646638323461376534
36663333353962636266643336373963623564326366333736393936396333326262

View file

@ -1,6 +1,8 @@
---
# offsite.yml — off-site hosts (askari): Docker engine + the Caddy reverse proxy.
# NetBird (M4b) appends to this play. Run: make deploy PLAYBOOK=offsite LIMIT=askari
# offsite.yml — off-site hosts (askari): Docker engine, the Caddy reverse proxy,
# and the NetBird coordinator. Run: make deploy PLAYBOOK=offsite LIMIT=askari
# Order matters: reverse_proxy creates the `boma` Docker network that
# netbird_coordinator joins as external, so it must come after.
- name: Configure off-site hosts
hosts: offsite_hosts
become: true
@ -9,3 +11,5 @@
tags: [docker_host]
- role: reverse_proxy
tags: [reverse_proxy]
- role: netbird_coordinator
tags: [netbird_coordinator]

View file

@ -0,0 +1,47 @@
# Access — netbird_coordinator (NetBird control plane)
Rendered from the role's `access__*` data (`roles/netbird_coordinator/defaults/main.yml`)
— the source of truth that also drives `/check-access`. Regenerate from the data; edit the
data, not the tables. Host: `askari` (off-site Hetzner; ADR-007/016).
## Access paths
The documented ways in, by tier (rendered from `access__*`):
| Tier | Path | Invocation |
|---|---|---|
| primary | `wt0` mesh SSH | `ssh askari` (over the NetBird mesh — pending M5; see notes) |
| secondary | LAN/WAN SSH from `ubongo` | `ssh ansible@askari` (from the control node; Hetzner firewall allows only ubongo's WAN) |
| — | container exec + compose | `docker compose -p netbird -f /opt/services/netbird/docker-compose.yml ps` / `… exec netbird-server sh` |
| — | logs | `docker logs netbird-server` / `docker logs netbird-dashboard` now; Loki labels `{service: netbird}` once the ADR-018 pipeline lands |
| — | admin API | management REST/gRPC API at `https://netbird.askari.wingu.me/api` (and gRPC), via Caddy, **behind embedded-Dex auth** (`access__api.enabled: true`) — admin surface is the dashboard at `https://netbird.askari.wingu.me` |
## Break-glass
Mesh-and-LAN-independent fallback for this host's class (recorded, not routine):
- **Hetzner rescue system + Cloud Console** (VNC) for `askari` — boot the rescue image
or attach the web console from the Hetzner Cloud panel if SSH is unreachable.
## Operational notes
- **The admin surface is the dashboard, not a raw port.** Day-to-day administration
(peers, setup keys, ACLs, users) is the web dashboard at
`https://netbird.askari.wingu.me`, behind the embedded Dex login. The management REST
API (`/api`) + gRPC are the same control plane the dashboard calls — reachable for
scripting **only with a Dex-issued JWT**; there is no separate unauthenticated admin
port (metrics `:9090` / healthcheck `:9000` are in-container only, never published).
- **First-admin bootstrap is one-shot.** On a fresh deploy the first admin is created via
`https://netbird.askari.wingu.me/setup`, reachable only while zero users exist — it
self-closes after the first account. If you ever lose all admins, recovery means
resetting the datastore (and re-enrolling), not re-opening `/setup`.
- **Mesh not yet enrolled (M5).** Until `askari` joins the NetBird mesh, the `wt0`
primary SSH path does not exist — the only SSH route is the secondary one (from
ubongo's WAN IP, which the Hetzner Cloud Firewall allowlists). Promote `wt0` to primary
once M5 lands. (askari runs the coordinator the mesh depends on, so a coordinator
outage can also take down its own `wt0` path — fall back to LAN/WAN SSH then.)
- **Config wedged / bad render:** `config.yaml` is rendered read-only by Ansible (mode
`0640`, `no_log` — it holds the two vault secrets). To recover, fix the
`netbird_coordinator__*` vars and re-run the role (the `restart netbird` handler
recreates the stack). Note the compose project name is **`netbird`** (the base-dir
basename), not `netbird_coordinator`.

View file

@ -0,0 +1,55 @@
# Backup — netbird_coordinator (NetBird control plane)
Rendered from the role's `backup__*` data (`roles/netbird_coordinator/defaults/main.yml`)
— the source of truth that also drives `/check-backup`. Regenerate from the data; edit the
data, not the tables. Host: `askari` (off-site Hetzner; ADR-007/016).
This is boma's **first stateful service** (`backup__state: true`). It holds the entire
mesh control-plane state in an encrypted SQLite datastore.
## State captured
Rendered from `backup__*`:
| What | Source | How captured |
|---|---|---|
| datastore volume | `/var/lib/netbird` (Docker named volume `netbird_data`) | file-level, pulled read-only — the SQLite DB (peers, setup keys, ACLs, embedded-IdP users) |
- **Encryption key is part of the backup contract.** The datastore is **encrypted** with
`vault.netbird.datastore_key` (`server.store.encryptionKey`, base64 32 bytes). A
restore needs **both** the captured volume **and** that key. The key already lives in
the Ansible Vault (off-host, in the repo); it is **not** re-captured by the data backup
and must not be — the vault is its own backup. Lose the key and the snapshot is
unreadable.
- **Quiesce:** `false` — SQLite is captured file-level from the named volume. ADR-022
Decision 7 prefers a logical dump; NetBird exposes no dump command and uses an embedded
store, so this is the file-level escape hatch (Decision 7 B). If a live file-level copy
proves inconsistent in practice, flip `backup__quiesce: true` (stop → snapshot →
restart) — the stack tolerates a brief restart.
- **RPO:** ~24 h (nightly; ADR-022 Decision 2) — **once the pipeline exists** (see below).
## Restore procedure
1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A. This
renders `config.yaml` with `vault.netbird.datastore_key` from the vault (the *same*
key the snapshot was encrypted under — do not rotate it across a restore).
2. Stop the stack, `restic restore` the latest snapshot for `netbird_coordinator` into
the `netbird_data` volume / `/var/lib/netbird`, then start the stack.
3. No logical dump to replay (file-level store).
4. Confirm with this role's `VERIFY.md` checks (ADR-008/017) — dashboard loads, login via
the embedded IdP works, the management API lists the restored peers/keys.
## Restore notes
- **The encryption key must match the snapshot.** The datastore is unreadable without the
exact `vault.netbird.datastore_key` it was written under. Restore the vault first (or
confirm the key is unchanged) before restoring the data; never rotate the datastore key
as part of a restore.
- **Off-site backup is NOT yet captured — accepted risk.** The restic / `fisi` pull node
(ADR-022 Plan 2) is **not built yet**, so right now this state is **not** backed up
off-host. Until `fisi` lands, a loss of askari loses the mesh control-plane state; the
only recovery is to re-bootstrap a fresh coordinator (`/setup`) and re-enrol peers (M5).
Accepted for now; this record exists so the gap is explicit and `/check-backup` flags
it. Revisit when the `fisi` pull node + restic repo are live.
- **Compose project name is `netbird`** (the base-dir basename), not
`netbird_coordinator` — relevant when stopping the stack to quiesce a restore.

View file

@ -0,0 +1,67 @@
# netbird_coordinator
Self-hosted **NetBird coordinator** — the mesh-VPN control plane (ADR-016). Runs on
`askari` (the off-site Hetzner host) and is the rendezvous point every NetBird peer
talks to. Deployed via Docker Compose (ADR-004), behind the Caddy reverse proxy.
## Architecture — combined server
NetBird's self-hosted stack is now a **single combined server image** plus a separate
dashboard UI — there is no longer a separate signal / relay / coturn / dex container,
and no `turnserver.conf` / `management.json` / `openid-configuration.json`.
| Container | Image | Role |
|---|---|---|
| `netbird-server` | `netbirdio/netbird-server` | Management API + Signal + Relay + STUN + embedded Dex IdP (`/oauth2`), all on one process. Config at `/etc/netbird/config.yaml`. State in the `netbird_data` volume (SQLite). |
| `netbird-dashboard` | `netbirdio/dashboard` | Web UI. Configured purely by environment (`dashboard.env`); a public PKCE OIDC client, so its client secret is intentionally empty. |
Both containers join the **existing external `boma` Docker network** (created by the
`reverse_proxy` role's compose) so Caddy reaches them by container name. The only
host-exposed port is **`3478/udp` (STUN)**; HTTP/gRPC/WS traffic enters via Caddy over
the boma network, not via host ports.
### Reverse-proxy routing (added separately — M4a Caddy)
This role does **not** add the Caddy route. The route is a separate task and must
front several upstreams on `netbird-server` over the boma network, all to the same
backend:
- Native gRPC (signal + management) — matched by **`Content-Type: application/grpc*`**
(not by path) → `h2c://netbird-server:80`
- HTTP + WebSocket — paths `/relay*`, `/ws-proxy/*`, `/api/*`, `/oauth2/*``netbird-server:80`
- Dashboard catch-all — `/*``netbird-dashboard:80`
This matches NetBird's own external-proxy Caddy example: gRPC (the
`/management.ManagementService/*` + `/signalexchange.SignalExchange/*` services) is
selected by content-type rather than enumerated by path. gRPC needs HTTP/2 (h2c)
upstream support; WS/gRPC need long timeouts (Caddy sets none by default).
## Variables — `netbird_coordinator__*`
| Variable | Default | Description |
|---|---|---|
| `netbird_coordinator__server_image` | `netbirdio/netbird-server:0.72.4` | Combined server image (pinned; never `latest`) |
| `netbird_coordinator__dashboard_image` | `netbirdio/dashboard:v2.39.0` | Dashboard image (versioned independently of the server) |
| `netbird_coordinator__base_dir` | `/opt/services/netbird` | Working directory for the Compose project |
| `netbird_coordinator__domain` | `netbird.askari.wingu.me` | Public hostname; feeds `exposedAddress`, the OIDC issuer, redirect URIs, and the dashboard endpoints |
| `netbird_coordinator__trusted_proxies` | `["172.16.0.0/12"]` | Source ranges NetBird trusts `X-Forwarded-*` from (`server.reverseProxy.trustedHTTPProxies`). Must cover Caddy's source IP on the boma network — verify the actual bridge subnet at deploy |
| `netbird_coordinator__manage` | `true` | Set `false` in Molecule to render templates without a Docker daemon |
Production overrides live in `inventories/production/group_vars/`.
## Secrets
Two secrets come from the vault and are rendered into the host-side `config.yaml`
(mode 0640, `no_log`); they never touch the work tree or the dashboard:
- `vault.netbird.auth_secret``server.authSecret`
- `vault.netbird.datastore_key``server.store.encryptionKey` (base64; keep the padding)
The dashboard's OIDC client is a public PKCE client, so `AUTH_CLIENT_SECRET` is
intentionally empty — `dashboard.env` carries no secrets.
## `netbird_coordinator__manage` toggle
Docker operations (`docker compose up`, the restart handler) are gated on
`netbird_coordinator__manage | bool`. Molecule sets it `false` so the role can be tested
(template rendering, directory creation) without a Docker daemon.

View file

@ -0,0 +1,98 @@
# Security — netbird_coordinator (NetBird control plane)
## Exposure
- **Published ports:**
- `443/tcp`**not host-published**; reached via the M4a Caddy reverse proxy on the
`boma` Docker network. Caddy fronts the dashboard SPA, the management REST API
(`/api`), the embedded Dex IdP (`/oauth2`), native gRPC over h2c (the management +
signal services, matched by `Content-Type: application/grpc*`), and the relay
WebSocket (`/relay*`, `/ws-proxy/*`). TLS terminates at Caddy (Let's Encrypt
HTTP-01); upstreams listen plain `:80` on the internal network only.
- `3478/udp`**STUN, host-published directly** (`netbird-server`'s only host port),
bypassing Caddy because STUN is UDP and not HTTP.
- The **Hetzner Cloud Firewall already opens 80/443/3478** (done in M4a) — this role
adds **no** new firewall change. The host nftables `firewall_catalog` (ADR-020)
stays empty for askari; the cloud firewall is the authoritative edge here.
- In-container only, never published: metrics `:9090`, healthcheck `:9000`.
- **Auth surface:** the **embedded Dex IdP** shipped inside `netbird-server` (served at
`/oauth2`). The dashboard authenticates as a **public PKCE OIDC client**
(`AUTH_CLIENT_ID=netbird-dashboard`, **no client secret** — intentionally empty). The
management REST/gRPC API is behind Dex-issued JWTs. The **first admin user is created
via a one-time `/setup` page on first boot**, reachable only while zero users exist;
once an admin exists, `/setup` is closed. Peer enrolment uses **setup keys** minted in
the dashboard after login (used in M5, not part of this provisioning).
- **Reachability:** public — askari is internet-facing. The HTTP surface is reachable
only through Caddy (single public entry point, ADR-024); STUN/3478-udp is reachable
directly on askari's public IP. The management API controls the whole mesh, so this is
a deliberate public attack surface (see accepted risk **R3** below).
- **Data sensitivity:** **stateful** — holds the entire mesh control-plane state (peers,
setup keys, ACLs, IdP users) in an **encrypted SQLite datastore** at `/var/lib/netbird`
in the `netbird_data` volume. The datastore is encrypted with
`vault.netbird.datastore_key`; a restore needs **both** the volume **and** that key.
See backup record: `BACKUP.md` (`backup__state: true`).
## Checklist status
Each item from `docs/security/service-checklist.md`:
- [x] Secrets in vault; no default creds; nothing secret in git/images — ✅ two secrets
come from the vault (`vault.netbird.auth_secret`, `vault.netbird.datastore_key`),
rendered into host-side `config.yaml` (mode `0640`, task `no_log: true`). No default
creds: the first admin is bootstrapped interactively via `/setup`; the dashboard's
OIDC client secret is intentionally empty (public PKCE), not a leaked credential.
- [x] Non-root; no `privileged`/host-network unless justified; minimal mounts; caps
dropped — ⚠️ both containers run the upstream images' default user; no `privileged`,
no host networking (bridge `boma`). `netbird-server` mounts the read-only `config.yaml`
(`:ro`) and the `netbird_data` named volume; it publishes only `3478/udp`. Hardening
is the upstream default; revisit if NetBird documents a rootless/cap-drop posture.
- [x] Ports declared; behind reverse proxy + auth if exposed; least-privilege
inter-service reach — ✅ the HTTP surface (443) is behind Caddy + Dex auth; STUN/3478
is intentionally direct (UDP, can't proxy) and opened only at the Hetzner Cloud
Firewall (M4a). Containers reach Caddy by name on the `boma` network; nothing else is
published.
- [x] Image pinned (tag/digest), update path known — ⚠️ stateful tier (ADR-011) — pinned
to exact tags `netbirdio/netbird-server:0.72.4` and `netbirdio/dashboard:v2.39.0`, not
yet `tag@digest`. Watched by DIUN; bumped deliberately on boma's cadence (ADR-011).
Tighten to digests when convenient.
- [x] Logs reviewable; backup/restore covered if stateful — ✅ `docker logs
netbird-server` / `netbird-dashboard` now (json-file driver capped at 500m×2 since the
default never rotates), Loki labels declared for the ADR-018 pipeline. Stateful: backup
is declared in `BACKUP.md` but **not yet captured** (pending the fisi pull node — see
Residual risks).
## Service-specific hardening
- **Trusted-proxy pinning:** `server.reverseProxy.trustedHTTPProxies` is set from
`netbird_coordinator__trusted_proxies` so NetBird honours `X-Forwarded-*` **only** from
Caddy's source range on the `boma` bridge — rendered via `to_json` so an empty override
becomes `[]` (trust nothing), never YAML `null`. Tighten the range to Caddy's actual
container subnet at deploy (`docker network inspect boma`).
- **`/setup` self-closes:** the one-time admin-bootstrap page is reachable only while the
IdP has zero users — first login closes the window, so there is no standing
unauthenticated admin-creation route.
- **No standing unauthenticated admin surface:** the management REST/gRPC API requires a
Dex-issued JWT; metrics (`:9090`) and healthcheck (`:9000`) are in-container only and
never published (`access__api` describes the authenticated path).
- **Secrets never reach the dashboard or work tree:** `config.yaml` (with both secrets)
is rendered `0640` with `no_log`; `dashboard.env` carries no secrets (public client).
## Residual / accepted risks
- **Public mesh control plane on askari** — the management API + dashboard (443 via
Caddy) and STUN (3478/udp) are exposed on askari's public IP; the management API
controls the whole mesh. Accepted as **R3** in `docs/security/accepted-risks.md`
(self-hosting = no third-party trust + an off-site control plane that survives a
homelab outage). Mitigated by TLS + embedded-Dex login, trusted-proxy pinning, `base`
hardening, and version-pinned NetBird patched on boma's cadence. Revisit per R3's
trigger (a coordinator compromise / unpatched NetBird CVE, or the management plane
becoming reachable without auth). *(Note: R3's text says "Coturn (UDP 3478)"; the
v0.72.4 combined server actually exposes plain STUN on 3478/udp with no Coturn — same
port and surface, no functional difference to the accepted risk.)*
- **Off-site backup not yet captured** — the service is stateful (`backup__state: true`)
but the restic/`fisi` pull pipeline (ADR-022 Plan 2) is not built. Until then, the
encrypted datastore is **not** backed up off-host: a loss of askari loses the mesh
control-plane state (recoverable only by re-bootstrapping a fresh coordinator and
re-enrolling peers). Accepted for now; revisit when `fisi` lands. See `BACKUP.md`.
- **Images pinned to tags, not digests** — stateful tier wants `tag@digest` (ADR-011);
currently exact tags. Revisit when convenient.

View file

@ -0,0 +1,63 @@
# Verify — netbird_coordinator (NetBird control plane)
> **Authored now, executed later.** This is the acceptance spec for `/verify-service
> netbird_coordinator`. It cannot run yet: it needs the Playwright UI harness (ADR-017)
> **and** a live deploy of this role behind the M4a Caddy on askari. Until both exist,
> treat this as the spec to drive once they do — verification is deferred, not skipped.
NetBird's coordinator does have a real web UI (the dashboard), so this is a genuine
Level-4 UI spec, not just an HTTP/TLS check.
## Critical user journeys
The acceptance criteria — what "working" means. Numbered; action → expected result.
1. **Dashboard loads over a valid LE cert** — request
`https://netbird.askari.wingu.me` → the dashboard SPA renders; the browser shows a
valid Let's Encrypt certificate (trusted chain, SAN matches the host, not expired).
2. **First-boot `/setup` creates the first admin** — on a fresh deploy (zero users),
`https://netbird.askari.wingu.me/setup` is reachable and creating the first admin
account succeeds; re-visiting `/setup` afterwards no longer offers admin creation
(the window self-closes once a user exists).
3. **Login via the embedded Dex IdP succeeds** — logging in with the just-created admin
(OIDC redirect through `/oauth2`, public PKCE client, no client secret) lands on the
dashboard's authenticated home / peers view.
4. **The management API responds behind auth** — an authenticated dashboard session can
list peers / setup keys (the dashboard calls the management REST API at `/api`); an
**unauthenticated** request to `/api/...` is rejected (401/403), confirming the API
is not open.
5. **STUN answers on 3478/udp** — out of band (not browser): a STUN binding request to
`askari:3478/udp` returns a binding response (confirms the host-published UDP port is
live).
## What good looks like
Key states/screens to confirm (and screenshot):
- The browser padlock shows a valid Let's Encrypt cert for `netbird.askari.wingu.me`.
- The `/setup` page renders the admin-creation form on a fresh deploy, and the dashboard
reports an authenticated session after first login.
- The dashboard's peers/setup-keys view loads its data from the management API (no error
toast, no infinite spinner) — proving the `/api` + gRPC routing through Caddy works.
- An anonymous `/api` request returns 401/403, not data.
## Not browser-verifiable
Route these to the manual-test handoff:
- **STUN on 3478/udp** (journey 5) — UDP, not HTTP; verify with a STUN client, not a
browser.
- **gRPC over h2c** (management + signal exchange) and the **relay WebSocket** — exercised
end-to-end only by a real peer enrolling (M5), not by a headless dashboard session.
- **Peer enrolment via setup keys** — depends on the M5 client work; out of scope here.
- **Datastore encryption / restore** — proven by the `BACKUP.md` restore drill, not the UI.
## Test data
This service runs **only on production askari** — there is no staging Authentik group and
no SSO in front of it (it ships its own embedded IdP). The journeys provision their own:
- A **fresh deploy with zero users** so journey 2 (`/setup`) is reachable; journey 2
itself creates the single admin account used by journeys 34. No pre-seeded peers.
- Public DNS A-record for `netbird.askari.wingu.me` pointing at askari (so Caddy's
HTTP-01 cert can issue) — already provisioned with the M4a Caddy.

View file

@ -0,0 +1,51 @@
---
# NetBird coordinator (self-hosted mesh-VPN control plane, ADR-016).
# Combined server image (Management + Signal + Relay + STUN) plus the dashboard UI.
netbird_coordinator__server_image: "netbirdio/netbird-server:0.72.4"
netbird_coordinator__dashboard_image: "netbirdio/dashboard:v2.39.0"
netbird_coordinator__base_dir: /opt/services/netbird
netbird_coordinator__domain: netbird.askari.wingu.me
# Source IP ranges Caddy fronts NetBird from, rendered into config.yaml
# server.reverseProxy.trustedHTTPProxies. NetBird trusts X-Forwarded-* only from
# these. MUST cover the Caddy container's source IP on the boma Docker network —
# verify the actual bridge subnet at deploy (docker network inspect boma) and tighten.
netbird_coordinator__trusted_proxies: ["172.16.0.0/12"]
netbird_coordinator__manage: true # set false in Molecule to render without Docker
# access__*/backup__* are the ADR-021/022 CROSS-ROLE conventions — shared field names that
# render ACCESS.md/BACKUP.md and drive /check-access · /check-backup. They intentionally do
# NOT carry the netbird_coordinator__ prefix, so each is marked `# noqa: var-naming[no-role-prefix]`
# (ansible-lint's role-prefix rule has no per-prefix allowlist; keeping it enabled elsewhere).
# Operational-access record (ADR-021) — source of truth for ACCESS.md + /check-access.
# Compose project name defaults to the base_dir basename (= "netbird"), not the role name.
access__service: netbird_coordinator # noqa: var-naming[no-role-prefix]
access__compose_project: netbird # noqa: var-naming[no-role-prefix]
access__compose_path: "{{ netbird_coordinator__base_dir }}/docker-compose.yml" # noqa: var-naming[no-role-prefix]
access__containers: [netbird-server, netbird-dashboard] # noqa: var-naming[no-role-prefix]
access__log: # noqa: var-naming[no-role-prefix]
loki_labels: { service: netbird } # intent; Loki/Alloy pipeline is ADR-018 (pending)
access__api: # noqa: var-naming[no-role-prefix]
enabled: true
# Management REST API at /api (+ gRPC), via Caddy, behind the embedded Dex IdP.
# Needs a Dex-issued JWT — no unauthenticated admin port (metrics :9090 / health
# :9000 are in-container only). Admin surface is the dashboard at the same host.
base_url: "https://{{ netbird_coordinator__domain }}"
health_path: "/api"
auth:
vault_ref: null # no static token — auth is a per-session Dex-issued JWT (dashboard login)
note: "Bearer JWT from the embedded Dex IdP; /check-access can't curl this unauthenticated"
# Backup contract (ADR-022). STATEFUL — boma's first. Encrypted SQLite datastore in the
# netbird_data volume (/var/lib/netbird): peers, setup keys, ACLs, embedded-IdP users.
# Decryptable only with vault.netbird.datastore_key (lives in the vault, its own backup).
# Off-site capture is PENDING the fisi pull node + restic repo (ADR-022 Plan 2, not built)
# — an accepted gap for now; see BACKUP.md.
backup__service: netbird_coordinator # noqa: var-naming[no-role-prefix]
backup__state: true # noqa: var-naming[no-role-prefix]
backup__paths: # noqa: var-naming[no-role-prefix]
- /var/lib/netbird # netbird_data named volume — encrypted SQLite store
backup__dumps: [] # noqa: var-naming[no-role-prefix] # embedded SQLite, no logical dump cmd
backup__quiesce: false # noqa: var-naming[no-role-prefix] # file-level copy; flip true if inconsistent

View file

@ -0,0 +1,7 @@
---
- name: Restart netbird
listen: restart netbird
community.docker.docker_compose_v2:
project_src: "{{ netbird_coordinator__base_dir }}"
state: restarted
when: netbird_coordinator__manage | bool

View file

@ -0,0 +1,15 @@
---
galaxy_info:
author: sjat
description: >-
Self-hosted NetBird control plane (ADR-016): combined server image
(Management API + Signal + Relay + STUN + embedded Dex IdP) plus dashboard
UI, run on askari via Docker Compose behind the Caddy reverse proxy. Stateful
(encrypted SQLite store). Pinned images; secrets from vault.
license: MIT
min_ansible_version: "2.17"
platforms:
- name: Debian
versions:
- trixie
dependencies: []

View file

@ -0,0 +1,16 @@
---
- name: Converge
hosts: all
gather_facts: true
vars:
netbird_coordinator__manage: false
# Dummy vault values so the secret-bearing templates render under Molecule.
# (datastore_key must be valid base64 — NetBird decodes it on the real host.)
vault:
netbird:
auth_secret: "dummy-auth-secret"
datastore_key: "ZHVtbXlrZXk="
roles:
- role: netbird_coordinator

View file

@ -0,0 +1,31 @@
---
dependency:
name: galaxy
options:
requirements-file: ../../requirements.yml
driver:
name: docker
platforms:
- name: instance
# Project-owned image built from .docker/molecule-debian13/Dockerfile
# and hosted in the Forgejo container registry.
# Build/push with: make molecule-image / make molecule-image-push
image: forgejo.nyumbani.baobab.band/sjat/molecule-debian13:latest
pre_build_image: true
privileged: true # required for systemd
cgroupns_mode: host
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:rw
command: /lib/systemd/systemd
provisioner:
name: ansible
inventory:
host_vars:
instance:
ansible_user: root
verifier:
name: ansible

View file

@ -0,0 +1,32 @@
---
- name: Verify
hosts: all
gather_facts: false
tasks:
- name: Slurp the rendered config.yaml
ansible.builtin.slurp:
src: /opt/services/netbird/config.yaml
register: _config
- name: Assert config.yaml has expected content
ansible.builtin.assert:
that:
- _config.content | b64decode | length > 0
- "'netbird.askari.wingu.me' in (_config.content | b64decode)"
- "'engine: \"sqlite\"' in (_config.content | b64decode)"
- "'/oauth2' in (_config.content | b64decode)"
fail_msg: "config.yaml is missing expected content"
success_msg: "config.yaml rendered correctly"
- name: Slurp the rendered docker-compose.yml
ansible.builtin.slurp:
src: /opt/services/netbird/docker-compose.yml
register: _compose
- name: Assert compose pins both image tags
ansible.builtin.assert:
that:
- _compose.content | b64decode | length > 0
- "'0.72.4' in (_compose.content | b64decode)"
- "'v2.39.0' in (_compose.content | b64decode)"
fail_msg: "docker-compose.yml is missing pinned image tags"
success_msg: "docker-compose.yml pins both image tags"

View file

@ -0,0 +1,38 @@
---
- name: Ensure the service directory exists
ansible.builtin.file:
path: "{{ netbird_coordinator__base_dir }}"
state: directory
mode: "0750"
tags: [config]
- name: Render the combined server config
ansible.builtin.template:
src: config.yaml.j2
dest: "{{ netbird_coordinator__base_dir }}/config.yaml"
mode: "0640"
no_log: true # holds authSecret + datastore encryption key
notify: restart netbird
tags: [config]
- name: Render the dashboard env file
ansible.builtin.template:
src: dashboard.env.j2
dest: "{{ netbird_coordinator__base_dir }}/dashboard.env"
mode: "0644"
notify: restart netbird
tags: [config]
- name: Render the compose file
ansible.builtin.template:
src: docker-compose.yml.j2
dest: "{{ netbird_coordinator__base_dir }}/docker-compose.yml"
mode: "0644"
tags: [config]
- name: Bring the NetBird coordinator up
community.docker.docker_compose_v2:
project_src: "{{ netbird_coordinator__base_dir }}"
state: present
when: netbird_coordinator__manage | bool
tags: [deploy]

View file

@ -0,0 +1,26 @@
# {{ ansible_managed }}
server:
listenAddress: ":80"
exposedAddress: "https://{{ netbird_coordinator__domain }}:443"
stunPorts: [3478]
metricsPort: 9090
healthcheckAddress: ":9000"
logLevel: "info"
logFile: "console"
authSecret: "{{ vault.netbird.auth_secret }}"
dataDir: "/var/lib/netbird"
auth:
issuer: "https://{{ netbird_coordinator__domain }}/oauth2"
signKeyRefreshEnabled: true
dashboardRedirectURIs:
- "https://{{ netbird_coordinator__domain }}/nb-auth"
- "https://{{ netbird_coordinator__domain }}/nb-silent-auth"
cliRedirectURIs:
- "http://localhost:53000/"
reverseProxy:
# to_json (not a loop) so an empty override renders [] not YAML null —
# null would mean "trust no proxy" and silently break X-Forwarded-* from Caddy.
trustedHTTPProxies: {{ netbird_coordinator__trusted_proxies | to_json }}
store:
engine: "sqlite"
encryptionKey: "{{ vault.netbird.datastore_key }}"

View file

@ -0,0 +1,13 @@
# {{ ansible_managed }}
NETBIRD_MGMT_API_ENDPOINT=https://{{ netbird_coordinator__domain }}
NETBIRD_MGMT_GRPC_API_ENDPOINT=https://{{ netbird_coordinator__domain }}
AUTH_AUDIENCE=netbird-dashboard
AUTH_CLIENT_ID=netbird-dashboard
AUTH_CLIENT_SECRET=
AUTH_AUTHORITY=https://{{ netbird_coordinator__domain }}/oauth2
USE_AUTH0=false
AUTH_SUPPORTED_SCOPES=openid profile email groups
AUTH_REDIRECT_URI=/nb-auth
AUTH_SILENT_REDIRECT_URI=/nb-silent-auth
NGINX_SSL_PORT=443
LETSENCRYPT_DOMAIN=none

View file

@ -0,0 +1,33 @@
# {{ ansible_managed }}
services:
dashboard:
image: "{{ netbird_coordinator__dashboard_image }}"
container_name: netbird-dashboard
restart: unless-stopped
env_file: [./dashboard.env]
networks: [boma]
# Cap json logs — Docker's default driver never rotates. Interim until ADR-018
# (Alloy log shipping) lands; consider back-porting this to reverse_proxy too.
logging:
driver: json-file
options: {max-size: "500m", max-file: "2"}
netbird-server:
image: "{{ netbird_coordinator__server_image }}"
container_name: netbird-server
restart: unless-stopped
command: ["--config", "/etc/netbird/config.yaml"]
ports:
- "3478:3478/udp"
volumes:
- netbird_data:/var/lib/netbird
- ./config.yaml:/etc/netbird/config.yaml:ro
networks: [boma]
logging:
driver: json-file
options: {max-size: "500m", max-file: "2"}
volumes:
netbird_data:
networks:
boma:
external: true
name: boma

View file

@ -11,6 +11,10 @@
upstream: "app:80"
- host: t.example.test
respond: "ok"
- host: grpc.example.test
caddy: |
@grpc header Content-Type application/grpc*
reverse_proxy @grpc h2c://backend:80
roles:
- role: reverse_proxy

View file

@ -6,7 +6,7 @@
tasks:
- name: Slurp the rendered Caddyfile
ansible.builtin.slurp:
src: /opt/services/reverse_proxy/Caddyfile
src: /opt/services/reverse_proxy/caddy/Caddyfile
register: _caddyfile
- name: Assert Caddyfile exists and contains expected content
ansible.builtin.assert:
@ -15,5 +15,7 @@
- "'app.example.test' in (_caddyfile.content | b64decode)"
- "'reverse_proxy app:80' in (_caddyfile.content | b64decode)"
- "'respond \"ok\" 200' in (_caddyfile.content | b64decode)"
- "'grpc.example.test' in (_caddyfile.content | b64decode)"
- "'reverse_proxy @grpc h2c://backend:80' in (_caddyfile.content | b64decode)"
fail_msg: "Caddyfile is missing expected content"
success_msg: "Caddyfile rendered correctly"

View file

@ -6,10 +6,21 @@
mode: "0750"
tags: [config]
- name: Ensure the Caddy config directory exists
ansible.builtin.file:
path: "{{ reverse_proxy__base_dir }}/caddy"
state: directory
mode: "0750"
tags: [config]
# Render into a directory that is bind-mounted whole (./caddy -> /etc/caddy). Mounting
# the directory, not the single file, means an atomic template rewrite (which swaps the
# file inode) stays visible inside the running container, so `caddy reload` picks it up.
# A single-file bind mount pins the original inode and reload silently no-ops (ADR-024).
- name: Render the Caddyfile
ansible.builtin.template:
src: Caddyfile.j2
dest: "{{ reverse_proxy__base_dir }}/Caddyfile"
dest: "{{ reverse_proxy__base_dir }}/caddy/Caddyfile"
mode: "0644"
notify: reload caddy
tags: [config]

View file

@ -9,11 +9,13 @@
{% endif %}
}
{% for r in reverse_proxy__routes %}
{{ r.host }} {
{% if r.upstream is defined %}
reverse_proxy {{ r.upstream }}
{{ r['host'] }} {
{% if r['caddy'] is defined %}
{{ r['caddy'] | trim | indent(2, first=true) }}
{% elif r['upstream'] is defined %}
reverse_proxy {{ r['upstream'] }}
{% else %}
respond "{{ r.respond | default('boma') }}" 200
respond "{{ r['respond'] | default('boma') }}" 200
{% endif %}
}
{% endfor %}

View file

@ -12,7 +12,7 @@ services:
- ./env
{% endif %}
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile:ro
- ./caddy:/etc/caddy:ro
- caddy_data:/data
- caddy_config:/config
networks: