boma/roles/netbird_coordinator/BACKUP.md
sjat 070d6f293b docs(netbird): service-role standard files (SECURITY/VERIFY/ACCESS/BACKUP)
Author the four ADR-mandated service-role docs for netbird_coordinator and
add the cross-role access__*/backup__* data (ADR-021/022). First stateful
service: backup__state=true; off-site capture pending the fisi pull node.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 18:01:29 +02:00

3.2 KiB

Backup — netbird_coordinator (NetBird control plane)

Rendered from the role's backup__* data (roles/netbird_coordinator/defaults/main.yml) — the source of truth that also drives /check-backup. Regenerate from the data; edit the data, not the tables. Host: askari (off-site Hetzner; ADR-007/016).

This is boma's first stateful service (backup__state: true). It holds the entire mesh control-plane state in an encrypted SQLite datastore.

State captured

Rendered from backup__*:

What Source How captured
datastore volume /var/lib/netbird (Docker named volume netbird_data) file-level, pulled read-only — the SQLite DB (peers, setup keys, ACLs, embedded-IdP users)
  • Encryption key is part of the backup contract. The datastore is encrypted with vault.netbird.datastore_key (server.store.encryptionKey, base64 32 bytes). A restore needs both the captured volume and that key. The key already lives in the Ansible Vault (off-host, in the repo); it is not re-captured by the data backup and must not be — the vault is its own backup. Lose the key and the snapshot is unreadable.
  • Quiesce: false — SQLite is captured file-level from the named volume. ADR-022 Decision 7 prefers a logical dump; NetBird exposes no dump command and uses an embedded store, so this is the file-level escape hatch (Decision 7 B). If a live file-level copy proves inconsistent in practice, flip backup__quiesce: true (stop → snapshot → restart) — the stack tolerates a brief restart.
  • RPO: ~24 h (nightly; ADR-022 Decision 2) — once the pipeline exists (see below).

Restore procedure

  1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A. This renders config.yaml with vault.netbird.datastore_key from the vault (the same key the snapshot was encrypted under — do not rotate it across a restore).
  2. Stop the stack, restic restore the latest snapshot for netbird_coordinator into the netbird_data volume / /var/lib/netbird, then start the stack.
  3. No logical dump to replay (file-level store).
  4. Confirm with this role's VERIFY.md checks (ADR-008/017) — dashboard loads, login via the embedded IdP works, the management API lists the restored peers/keys.

Restore notes

  • The encryption key must match the snapshot. The datastore is unreadable without the exact vault.netbird.datastore_key it was written under. Restore the vault first (or confirm the key is unchanged) before restoring the data; never rotate the datastore key as part of a restore.
  • Off-site backup is NOT yet captured — accepted risk. The restic / fisi pull node (ADR-022 Plan 2) is not built yet, so right now this state is not backed up off-host. Until fisi lands, a loss of askari loses the mesh control-plane state; the only recovery is to re-bootstrap a fresh coordinator (/setup) and re-enrol peers (M5). Accepted for now; this record exists so the gap is explicit and /check-backup flags it. Revisit when the fisi pull node + restic repo are live.
  • Compose project name is netbird (the base-dir basename), not netbird_coordinator — relevant when stopping the stack to quiesce a restore.