Author the four ADR-mandated service-role docs for netbird_coordinator and add the cross-role access__*/backup__* data (ADR-021/022). First stateful service: backup__state=true; off-site capture pending the fisi pull node. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3.2 KiB
3.2 KiB
Backup — netbird_coordinator (NetBird control plane)
Rendered from the role's backup__* data (roles/netbird_coordinator/defaults/main.yml)
— the source of truth that also drives /check-backup. Regenerate from the data; edit the
data, not the tables. Host: askari (off-site Hetzner; ADR-007/016).
This is boma's first stateful service (backup__state: true). It holds the entire
mesh control-plane state in an encrypted SQLite datastore.
State captured
Rendered from backup__*:
| What | Source | How captured |
|---|---|---|
| datastore volume | /var/lib/netbird (Docker named volume netbird_data) |
file-level, pulled read-only — the SQLite DB (peers, setup keys, ACLs, embedded-IdP users) |
- Encryption key is part of the backup contract. The datastore is encrypted with
vault.netbird.datastore_key(server.store.encryptionKey, base64 32 bytes). A restore needs both the captured volume and that key. The key already lives in the Ansible Vault (off-host, in the repo); it is not re-captured by the data backup and must not be — the vault is its own backup. Lose the key and the snapshot is unreadable. - Quiesce:
false— SQLite is captured file-level from the named volume. ADR-022 Decision 7 prefers a logical dump; NetBird exposes no dump command and uses an embedded store, so this is the file-level escape hatch (Decision 7 B). If a live file-level copy proves inconsistent in practice, flipbackup__quiesce: true(stop → snapshot → restart) — the stack tolerates a brief restart. - RPO: ~24 h (nightly; ADR-022 Decision 2) — once the pipeline exists (see below).
Restore procedure
- Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A. This
renders
config.yamlwithvault.netbird.datastore_keyfrom the vault (the same key the snapshot was encrypted under — do not rotate it across a restore). - Stop the stack,
restic restorethe latest snapshot fornetbird_coordinatorinto thenetbird_datavolume //var/lib/netbird, then start the stack. - No logical dump to replay (file-level store).
- Confirm with this role's
VERIFY.mdchecks (ADR-008/017) — dashboard loads, login via the embedded IdP works, the management API lists the restored peers/keys.
Restore notes
- The encryption key must match the snapshot. The datastore is unreadable without the
exact
vault.netbird.datastore_keyit was written under. Restore the vault first (or confirm the key is unchanged) before restoring the data; never rotate the datastore key as part of a restore. - Off-site backup is NOT yet captured — accepted risk. The restic /
fisipull node (ADR-022 Plan 2) is not built yet, so right now this state is not backed up off-host. Untilfisilands, a loss of askari loses the mesh control-plane state; the only recovery is to re-bootstrap a fresh coordinator (/setup) and re-enrol peers (M5). Accepted for now; this record exists so the gap is explicit and/check-backupflags it. Revisit when thefisipull node + restic repo are live. - Compose project name is
netbird(the base-dir basename), notnetbird_coordinator— relevant when stopping the stack to quiesce a restore.