docs(netbird): service-role standard files (SECURITY/VERIFY/ACCESS/BACKUP)
Author the four ADR-mandated service-role docs for netbird_coordinator and add the cross-role access__*/backup__* data (ADR-021/022). First stateful service: backup__state=true; off-site capture pending the fisi pull node. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
1333ec181f
commit
070d6f293b
6 changed files with 303 additions and 4 deletions
47
roles/netbird_coordinator/ACCESS.md
Normal file
47
roles/netbird_coordinator/ACCESS.md
Normal file
|
|
@ -0,0 +1,47 @@
|
|||
# Access — netbird_coordinator (NetBird control plane)
|
||||
|
||||
Rendered from the role's `access__*` data (`roles/netbird_coordinator/defaults/main.yml`)
|
||||
— the source of truth that also drives `/check-access`. Regenerate from the data; edit the
|
||||
data, not the tables. Host: `askari` (off-site Hetzner; ADR-007/016).
|
||||
|
||||
## Access paths
|
||||
|
||||
The documented ways in, by tier (rendered from `access__*`):
|
||||
|
||||
| Tier | Path | Invocation |
|
||||
|---|---|---|
|
||||
| primary | `wt0` mesh SSH | `ssh askari` (over the NetBird mesh — pending M5; see notes) |
|
||||
| secondary | LAN/WAN SSH from `ubongo` | `ssh ansible@askari` (from the control node; Hetzner firewall allows only ubongo's WAN) |
|
||||
| — | container exec + compose | `docker compose -p netbird -f /opt/services/netbird/docker-compose.yml ps` / `… exec netbird-server sh` |
|
||||
| — | logs | `docker logs netbird-server` / `docker logs netbird-dashboard` now; Loki labels `{service: netbird}` once the ADR-018 pipeline lands |
|
||||
| — | admin API | management REST/gRPC API at `https://netbird.askari.wingu.me/api` (and gRPC), via Caddy, **behind embedded-Dex auth** (`access__api.enabled: true`) — admin surface is the dashboard at `https://netbird.askari.wingu.me` |
|
||||
|
||||
## Break-glass
|
||||
|
||||
Mesh-and-LAN-independent fallback for this host's class (recorded, not routine):
|
||||
|
||||
- **Hetzner rescue system + Cloud Console** (VNC) for `askari` — boot the rescue image
|
||||
or attach the web console from the Hetzner Cloud panel if SSH is unreachable.
|
||||
|
||||
## Operational notes
|
||||
|
||||
- **The admin surface is the dashboard, not a raw port.** Day-to-day administration
|
||||
(peers, setup keys, ACLs, users) is the web dashboard at
|
||||
`https://netbird.askari.wingu.me`, behind the embedded Dex login. The management REST
|
||||
API (`/api`) + gRPC are the same control plane the dashboard calls — reachable for
|
||||
scripting **only with a Dex-issued JWT**; there is no separate unauthenticated admin
|
||||
port (metrics `:9090` / healthcheck `:9000` are in-container only, never published).
|
||||
- **First-admin bootstrap is one-shot.** On a fresh deploy the first admin is created via
|
||||
`https://netbird.askari.wingu.me/setup`, reachable only while zero users exist — it
|
||||
self-closes after the first account. If you ever lose all admins, recovery means
|
||||
resetting the datastore (and re-enrolling), not re-opening `/setup`.
|
||||
- **Mesh not yet enrolled (M5).** Until `askari` joins the NetBird mesh, the `wt0`
|
||||
primary SSH path does not exist — the only SSH route is the secondary one (from
|
||||
ubongo's WAN IP, which the Hetzner Cloud Firewall allowlists). Promote `wt0` to primary
|
||||
once M5 lands. (askari runs the coordinator the mesh depends on, so a coordinator
|
||||
outage can also take down its own `wt0` path — fall back to LAN/WAN SSH then.)
|
||||
- **Config wedged / bad render:** `config.yaml` is rendered read-only by Ansible (mode
|
||||
`0640`, `no_log` — it holds the two vault secrets). To recover, fix the
|
||||
`netbird_coordinator__*` vars and re-run the role (the `restart netbird` handler
|
||||
recreates the stack). Note the compose project name is **`netbird`** (the base-dir
|
||||
basename), not `netbird_coordinator`.
|
||||
55
roles/netbird_coordinator/BACKUP.md
Normal file
55
roles/netbird_coordinator/BACKUP.md
Normal file
|
|
@ -0,0 +1,55 @@
|
|||
# Backup — netbird_coordinator (NetBird control plane)
|
||||
|
||||
Rendered from the role's `backup__*` data (`roles/netbird_coordinator/defaults/main.yml`)
|
||||
— the source of truth that also drives `/check-backup`. Regenerate from the data; edit the
|
||||
data, not the tables. Host: `askari` (off-site Hetzner; ADR-007/016).
|
||||
|
||||
This is boma's **first stateful service** (`backup__state: true`). It holds the entire
|
||||
mesh control-plane state in an encrypted SQLite datastore.
|
||||
|
||||
## State captured
|
||||
|
||||
Rendered from `backup__*`:
|
||||
|
||||
| What | Source | How captured |
|
||||
|---|---|---|
|
||||
| datastore volume | `/var/lib/netbird` (Docker named volume `netbird_data`) | file-level, pulled read-only — the SQLite DB (peers, setup keys, ACLs, embedded-IdP users) |
|
||||
|
||||
- **Encryption key is part of the backup contract.** The datastore is **encrypted** with
|
||||
`vault.netbird.datastore_key` (`server.store.encryptionKey`, base64 32 bytes). A
|
||||
restore needs **both** the captured volume **and** that key. The key already lives in
|
||||
the Ansible Vault (off-host, in the repo); it is **not** re-captured by the data backup
|
||||
and must not be — the vault is its own backup. Lose the key and the snapshot is
|
||||
unreadable.
|
||||
- **Quiesce:** `false` — SQLite is captured file-level from the named volume. ADR-022
|
||||
Decision 7 prefers a logical dump; NetBird exposes no dump command and uses an embedded
|
||||
store, so this is the file-level escape hatch (Decision 7 B). If a live file-level copy
|
||||
proves inconsistent in practice, flip `backup__quiesce: true` (stop → snapshot →
|
||||
restart) — the stack tolerates a brief restart.
|
||||
- **RPO:** ~24 h (nightly; ADR-022 Decision 2) — **once the pipeline exists** (see below).
|
||||
|
||||
## Restore procedure
|
||||
|
||||
1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A. This
|
||||
renders `config.yaml` with `vault.netbird.datastore_key` from the vault (the *same*
|
||||
key the snapshot was encrypted under — do not rotate it across a restore).
|
||||
2. Stop the stack, `restic restore` the latest snapshot for `netbird_coordinator` into
|
||||
the `netbird_data` volume / `/var/lib/netbird`, then start the stack.
|
||||
3. No logical dump to replay (file-level store).
|
||||
4. Confirm with this role's `VERIFY.md` checks (ADR-008/017) — dashboard loads, login via
|
||||
the embedded IdP works, the management API lists the restored peers/keys.
|
||||
|
||||
## Restore notes
|
||||
|
||||
- **The encryption key must match the snapshot.** The datastore is unreadable without the
|
||||
exact `vault.netbird.datastore_key` it was written under. Restore the vault first (or
|
||||
confirm the key is unchanged) before restoring the data; never rotate the datastore key
|
||||
as part of a restore.
|
||||
- **Off-site backup is NOT yet captured — accepted risk.** The restic / `fisi` pull node
|
||||
(ADR-022 Plan 2) is **not built yet**, so right now this state is **not** backed up
|
||||
off-host. Until `fisi` lands, a loss of askari loses the mesh control-plane state; the
|
||||
only recovery is to re-bootstrap a fresh coordinator (`/setup`) and re-enrol peers (M5).
|
||||
Accepted for now; this record exists so the gap is explicit and `/check-backup` flags
|
||||
it. Revisit when the `fisi` pull node + restic repo are live.
|
||||
- **Compose project name is `netbird`** (the base-dir basename), not
|
||||
`netbird_coordinator` — relevant when stopping the stack to quiesce a restore.
|
||||
98
roles/netbird_coordinator/SECURITY.md
Normal file
98
roles/netbird_coordinator/SECURITY.md
Normal file
|
|
@ -0,0 +1,98 @@
|
|||
# Security — netbird_coordinator (NetBird control plane)
|
||||
|
||||
## Exposure
|
||||
|
||||
- **Published ports:**
|
||||
- `443/tcp` — **not host-published**; reached via the M4a Caddy reverse proxy on the
|
||||
`boma` Docker network. Caddy fronts the dashboard SPA, the management REST API
|
||||
(`/api`), the embedded Dex IdP (`/oauth2`), native gRPC over h2c
|
||||
(`/management.ManagementService/*`, `/signalexchange.SignalExchange/*`), and the
|
||||
relay WebSocket (`/relay*`, `/ws-proxy/*`). TLS terminates at Caddy (Let's Encrypt
|
||||
HTTP-01); upstreams listen plain `:80` on the internal network only.
|
||||
- `3478/udp` — **STUN, host-published directly** (`netbird-server`'s only host port),
|
||||
bypassing Caddy because STUN is UDP and not HTTP.
|
||||
- The **Hetzner Cloud Firewall already opens 80/443/3478** (done in M4a) — this role
|
||||
adds **no** new firewall change. The host nftables `firewall_catalog` (ADR-020)
|
||||
stays empty for askari; the cloud firewall is the authoritative edge here.
|
||||
- In-container only, never published: metrics `:9090`, healthcheck `:9000`.
|
||||
- **Auth surface:** the **embedded Dex IdP** shipped inside `netbird-server` (served at
|
||||
`/oauth2`). The dashboard authenticates as a **public PKCE OIDC client**
|
||||
(`AUTH_CLIENT_ID=netbird-dashboard`, **no client secret** — intentionally empty). The
|
||||
management REST/gRPC API is behind Dex-issued JWTs. The **first admin user is created
|
||||
via a one-time `/setup` page on first boot**, reachable only while zero users exist;
|
||||
once an admin exists, `/setup` is closed. Peer enrolment uses **setup keys** minted in
|
||||
the dashboard after login (used in M5, not part of this provisioning).
|
||||
- **Reachability:** public — askari is internet-facing. The HTTP surface is reachable
|
||||
only through Caddy (single public entry point, ADR-024); STUN/3478-udp is reachable
|
||||
directly on askari's public IP. The management API controls the whole mesh, so this is
|
||||
a deliberate public attack surface (see accepted risk **R3** below).
|
||||
- **Data sensitivity:** **stateful** — holds the entire mesh control-plane state (peers,
|
||||
setup keys, ACLs, IdP users) in an **encrypted SQLite datastore** at `/var/lib/netbird`
|
||||
in the `netbird_data` volume. The datastore is encrypted with
|
||||
`vault.netbird.datastore_key`; a restore needs **both** the volume **and** that key.
|
||||
See backup record: `BACKUP.md` (`backup__state: true`).
|
||||
|
||||
## Checklist status
|
||||
|
||||
Each item from `docs/security/service-checklist.md`:
|
||||
|
||||
- [x] Secrets in vault; no default creds; nothing secret in git/images — ✅ two secrets
|
||||
come from the vault (`vault.netbird.auth_secret`, `vault.netbird.datastore_key`),
|
||||
rendered into host-side `config.yaml` (mode `0640`, task `no_log: true`). No default
|
||||
creds: the first admin is bootstrapped interactively via `/setup`; the dashboard's
|
||||
OIDC client secret is intentionally empty (public PKCE), not a leaked credential.
|
||||
- [x] Non-root; no `privileged`/host-network unless justified; minimal mounts; caps
|
||||
dropped — ⚠️ both containers run the upstream images' default user; no `privileged`,
|
||||
no host networking (bridge `boma`). `netbird-server` mounts the read-only `config.yaml`
|
||||
(`:ro`) and the `netbird_data` named volume; it publishes only `3478/udp`. Hardening
|
||||
is the upstream default; revisit if NetBird documents a rootless/cap-drop posture.
|
||||
- [x] Ports declared; behind reverse proxy + auth if exposed; least-privilege
|
||||
inter-service reach — ✅ the HTTP surface (443) is behind Caddy + Dex auth; STUN/3478
|
||||
is intentionally direct (UDP, can't proxy) and opened only at the Hetzner Cloud
|
||||
Firewall (M4a). Containers reach Caddy by name on the `boma` network; nothing else is
|
||||
published.
|
||||
- [x] Image pinned (tag/digest), update path known — ⚠️ stateful tier (ADR-011) — pinned
|
||||
to exact tags `netbirdio/netbird-server:0.72.4` and `netbirdio/dashboard:v2.39.0`, not
|
||||
yet `tag@digest`. Watched by DIUN; bumped deliberately on boma's cadence (ADR-011).
|
||||
Tighten to digests when convenient.
|
||||
- [x] Logs reviewable; backup/restore covered if stateful — ✅ `docker logs
|
||||
netbird-server` / `netbird-dashboard` now (json-file driver capped at 500m×2 since the
|
||||
default never rotates), Loki labels declared for the ADR-018 pipeline. Stateful: backup
|
||||
is declared in `BACKUP.md` but **not yet captured** (pending the fisi pull node — see
|
||||
Residual risks).
|
||||
|
||||
## Service-specific hardening
|
||||
|
||||
- **Trusted-proxy pinning:** `server.reverseProxy.trustedHTTPProxies` is set from
|
||||
`netbird_coordinator__trusted_proxies` so NetBird honours `X-Forwarded-*` **only** from
|
||||
Caddy's source range on the `boma` bridge — rendered via `to_json` so an empty override
|
||||
becomes `[]` (trust nothing), never YAML `null`. Tighten the range to Caddy's actual
|
||||
container subnet at deploy (`docker network inspect boma`).
|
||||
- **`/setup` self-closes:** the one-time admin-bootstrap page is reachable only while the
|
||||
IdP has zero users — first login closes the window, so there is no standing
|
||||
unauthenticated admin-creation route.
|
||||
- **No standing unauthenticated admin surface:** the management REST/gRPC API requires a
|
||||
Dex-issued JWT; metrics (`:9090`) and healthcheck (`:9000`) are in-container only and
|
||||
never published (`access__api` describes the authenticated path).
|
||||
- **Secrets never reach the dashboard or work tree:** `config.yaml` (with both secrets)
|
||||
is rendered `0640` with `no_log`; `dashboard.env` carries no secrets (public client).
|
||||
|
||||
## Residual / accepted risks
|
||||
|
||||
- **Public mesh control plane on askari** — the management API + dashboard (443 via
|
||||
Caddy) and STUN (3478/udp) are exposed on askari's public IP; the management API
|
||||
controls the whole mesh. Accepted as **R3** in `docs/security/accepted-risks.md`
|
||||
(self-hosting = no third-party trust + an off-site control plane that survives a
|
||||
homelab outage). Mitigated by TLS + embedded-Dex login, trusted-proxy pinning, `base`
|
||||
hardening, and version-pinned NetBird patched on boma's cadence. Revisit per R3's
|
||||
trigger (a coordinator compromise / unpatched NetBird CVE, or the management plane
|
||||
becoming reachable without auth). *(Note: R3's text says "Coturn (UDP 3478)"; the
|
||||
v0.72.4 combined server actually exposes plain STUN on 3478/udp with no Coturn — same
|
||||
port and surface, no functional difference to the accepted risk.)*
|
||||
- **Off-site backup not yet captured** — the service is stateful (`backup__state: true`)
|
||||
but the restic/`fisi` pull pipeline (ADR-022 Plan 2) is not built. Until then, the
|
||||
encrypted datastore is **not** backed up off-host: a loss of askari loses the mesh
|
||||
control-plane state (recoverable only by re-bootstrapping a fresh coordinator and
|
||||
re-enrolling peers). Accepted for now; revisit when `fisi` lands. See `BACKUP.md`.
|
||||
- **Images pinned to tags, not digests** — stateful tier wants `tag@digest` (ADR-011);
|
||||
currently exact tags. Revisit when convenient.
|
||||
63
roles/netbird_coordinator/VERIFY.md
Normal file
63
roles/netbird_coordinator/VERIFY.md
Normal file
|
|
@ -0,0 +1,63 @@
|
|||
# Verify — netbird_coordinator (NetBird control plane)
|
||||
|
||||
> **Authored now, executed later.** This is the acceptance spec for `/verify-service
|
||||
> netbird_coordinator`. It cannot run yet: it needs the Playwright UI harness (ADR-017)
|
||||
> **and** a live deploy of this role behind the M4a Caddy on askari. Until both exist,
|
||||
> treat this as the spec to drive once they do — verification is deferred, not skipped.
|
||||
|
||||
NetBird's coordinator does have a real web UI (the dashboard), so this is a genuine
|
||||
Level-4 UI spec, not just an HTTP/TLS check.
|
||||
|
||||
## Critical user journeys
|
||||
|
||||
The acceptance criteria — what "working" means. Numbered; action → expected result.
|
||||
|
||||
1. **Dashboard loads over a valid LE cert** — request
|
||||
`https://netbird.askari.wingu.me` → the dashboard SPA renders; the browser shows a
|
||||
valid Let's Encrypt certificate (trusted chain, SAN matches the host, not expired).
|
||||
2. **First-boot `/setup` creates the first admin** — on a fresh deploy (zero users),
|
||||
`https://netbird.askari.wingu.me/setup` is reachable and creating the first admin
|
||||
account succeeds; re-visiting `/setup` afterwards no longer offers admin creation
|
||||
(the window self-closes once a user exists).
|
||||
3. **Login via the embedded Dex IdP succeeds** — logging in with the just-created admin
|
||||
(OIDC redirect through `/oauth2`, public PKCE client, no client secret) lands on the
|
||||
dashboard's authenticated home / peers view.
|
||||
4. **The management API responds behind auth** — an authenticated dashboard session can
|
||||
list peers / setup keys (the dashboard calls the management REST API at `/api`); an
|
||||
**unauthenticated** request to `/api/...` is rejected (401/403), confirming the API
|
||||
is not open.
|
||||
5. **STUN answers on 3478/udp** — out of band (not browser): a STUN binding request to
|
||||
`askari:3478/udp` returns a binding response (confirms the host-published UDP port is
|
||||
live).
|
||||
|
||||
## What good looks like
|
||||
|
||||
Key states/screens to confirm (and screenshot):
|
||||
|
||||
- The browser padlock shows a valid Let's Encrypt cert for `netbird.askari.wingu.me`.
|
||||
- The `/setup` page renders the admin-creation form on a fresh deploy, and the dashboard
|
||||
reports an authenticated session after first login.
|
||||
- The dashboard's peers/setup-keys view loads its data from the management API (no error
|
||||
toast, no infinite spinner) — proving the `/api` + gRPC routing through Caddy works.
|
||||
- An anonymous `/api` request returns 401/403, not data.
|
||||
|
||||
## Not browser-verifiable
|
||||
|
||||
Route these to the manual-test handoff:
|
||||
|
||||
- **STUN on 3478/udp** (journey 5) — UDP, not HTTP; verify with a STUN client, not a
|
||||
browser.
|
||||
- **gRPC over h2c** (management + signal exchange) and the **relay WebSocket** — exercised
|
||||
end-to-end only by a real peer enrolling (M5), not by a headless dashboard session.
|
||||
- **Peer enrolment via setup keys** — depends on the M5 client work; out of scope here.
|
||||
- **Datastore encryption / restore** — proven by the `BACKUP.md` restore drill, not the UI.
|
||||
|
||||
## Test data
|
||||
|
||||
This service runs **only on production askari** — there is no staging Authentik group and
|
||||
no SSO in front of it (it ships its own embedded IdP). The journeys provision their own:
|
||||
|
||||
- A **fresh deploy with zero users** so journey 2 (`/setup`) is reachable; journey 2
|
||||
itself creates the single admin account used by journeys 3–4. No pre-seeded peers.
|
||||
- Public DNS A-record for `netbird.askari.wingu.me` pointing at askari (so Caddy's
|
||||
HTTP-01 cert can issue) — already provisioned with the M4a Caddy.
|
||||
|
|
@ -13,3 +13,39 @@ netbird_coordinator__domain: netbird.askari.wingu.me
|
|||
netbird_coordinator__trusted_proxies: ["172.16.0.0/12"]
|
||||
|
||||
netbird_coordinator__manage: true # set false in Molecule to render without Docker
|
||||
|
||||
# access__*/backup__* are the ADR-021/022 CROSS-ROLE conventions — shared field names that
|
||||
# render ACCESS.md/BACKUP.md and drive /check-access · /check-backup. They intentionally do
|
||||
# NOT carry the netbird_coordinator__ prefix, so each is marked `# noqa: var-naming[no-role-prefix]`
|
||||
# (ansible-lint's role-prefix rule has no per-prefix allowlist; keeping it enabled elsewhere).
|
||||
|
||||
# Operational-access record (ADR-021) — source of truth for ACCESS.md + /check-access.
|
||||
# Compose project name defaults to the base_dir basename (= "netbird"), not the role name.
|
||||
access__service: netbird_coordinator # noqa: var-naming[no-role-prefix]
|
||||
access__compose_project: netbird # noqa: var-naming[no-role-prefix]
|
||||
access__compose_path: "{{ netbird_coordinator__base_dir }}/docker-compose.yml" # noqa: var-naming[no-role-prefix]
|
||||
access__containers: [netbird-server, netbird-dashboard] # noqa: var-naming[no-role-prefix]
|
||||
access__log: # noqa: var-naming[no-role-prefix]
|
||||
loki_labels: { service: netbird } # intent; Loki/Alloy pipeline is ADR-018 (pending)
|
||||
access__api: # noqa: var-naming[no-role-prefix]
|
||||
enabled: true
|
||||
# Management REST API at /api (+ gRPC), via Caddy, behind the embedded Dex IdP.
|
||||
# Needs a Dex-issued JWT — no unauthenticated admin port (metrics :9090 / health
|
||||
# :9000 are in-container only). Admin surface is the dashboard at the same host.
|
||||
base_url: "https://{{ netbird_coordinator__domain }}"
|
||||
health_path: "/api"
|
||||
auth:
|
||||
vault_ref: null # no static token — auth is a per-session Dex-issued JWT (dashboard login)
|
||||
note: "Bearer JWT from the embedded Dex IdP; /check-access can't curl this unauthenticated"
|
||||
|
||||
# Backup contract (ADR-022). STATEFUL — boma's first. Encrypted SQLite datastore in the
|
||||
# netbird_data volume (/var/lib/netbird): peers, setup keys, ACLs, embedded-IdP users.
|
||||
# Decryptable only with vault.netbird.datastore_key (lives in the vault, its own backup).
|
||||
# Off-site capture is PENDING the fisi pull node + restic repo (ADR-022 Plan 2, not built)
|
||||
# — an accepted gap for now; see BACKUP.md.
|
||||
backup__service: netbird_coordinator # noqa: var-naming[no-role-prefix]
|
||||
backup__state: true # noqa: var-naming[no-role-prefix]
|
||||
backup__paths: # noqa: var-naming[no-role-prefix]
|
||||
- /var/lib/netbird # netbird_data named volume — encrypted SQLite store
|
||||
backup__dumps: [] # noqa: var-naming[no-role-prefix] # embedded SQLite, no logical dump cmd
|
||||
backup__quiesce: false # noqa: var-naming[no-role-prefix] # file-level copy; flip true if inconsistent
|
||||
|
|
|
|||
|
|
@ -2,10 +2,10 @@
|
|||
galaxy_info:
|
||||
author: sjat
|
||||
description: >-
|
||||
Self-hosted NetBird coordinator (ADR-016): combined server image
|
||||
(Management + Signal + Relay + STUN) plus dashboard UI, run on askari via
|
||||
Docker Compose behind the Caddy reverse proxy. Pinned images; secrets from
|
||||
vault.
|
||||
Self-hosted NetBird control plane (ADR-016): combined server image
|
||||
(Management API + Signal + Relay + STUN + embedded Dex IdP) plus dashboard
|
||||
UI, run on askari via Docker Compose behind the Caddy reverse proxy. Stateful
|
||||
(encrypted SQLite store). Pinned images; secrets from vault.
|
||||
license: MIT
|
||||
min_ansible_version: "2.17"
|
||||
platforms:
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue