boma/roles/netbird_coordinator/BACKUP.md

56 lines
3.2 KiB
Markdown
Raw Normal View History

# Backup — netbird_coordinator (NetBird control plane)
Rendered from the role's `backup__*` data (`roles/netbird_coordinator/defaults/main.yml`)
— the source of truth that also drives `/check-backup`. Regenerate from the data; edit the
data, not the tables. Host: `askari` (off-site Hetzner; ADR-007/016).
This is boma's **first stateful service** (`backup__state: true`). It holds the entire
mesh control-plane state in an encrypted SQLite datastore.
## State captured
Rendered from `backup__*`:
| What | Source | How captured |
|---|---|---|
| datastore volume | `/var/lib/netbird` (Docker named volume `netbird_data`) | file-level, pulled read-only — the SQLite DB (peers, setup keys, ACLs, embedded-IdP users) |
- **Encryption key is part of the backup contract.** The datastore is **encrypted** with
`vault.netbird.datastore_key` (`server.store.encryptionKey`, base64 32 bytes). A
restore needs **both** the captured volume **and** that key. The key already lives in
the Ansible Vault (off-host, in the repo); it is **not** re-captured by the data backup
and must not be — the vault is its own backup. Lose the key and the snapshot is
unreadable.
- **Quiesce:** `false` — SQLite is captured file-level from the named volume. ADR-022
Decision 7 prefers a logical dump; NetBird exposes no dump command and uses an embedded
store, so this is the file-level escape hatch (Decision 7 B). If a live file-level copy
proves inconsistent in practice, flip `backup__quiesce: true` (stop → snapshot →
restart) — the stack tolerates a brief restart.
- **RPO:** ~24 h (nightly; ADR-022 Decision 2) — **once the pipeline exists** (see below).
## Restore procedure
1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A. This
renders `config.yaml` with `vault.netbird.datastore_key` from the vault (the *same*
key the snapshot was encrypted under — do not rotate it across a restore).
2. Stop the stack, `restic restore` the latest snapshot for `netbird_coordinator` into
the `netbird_data` volume / `/var/lib/netbird`, then start the stack.
3. No logical dump to replay (file-level store).
4. Confirm with this role's `VERIFY.md` checks (ADR-008/017) — dashboard loads, login via
the embedded IdP works, the management API lists the restored peers/keys.
## Restore notes
- **The encryption key must match the snapshot.** The datastore is unreadable without the
exact `vault.netbird.datastore_key` it was written under. Restore the vault first (or
confirm the key is unchanged) before restoring the data; never rotate the datastore key
as part of a restore.
- **Off-site backup is NOT yet captured — accepted risk.** The restic / `fisi` pull node
(ADR-022 Plan 2) is **not built yet**, so right now this state is **not** backed up
off-host. Until `fisi` lands, a loss of askari loses the mesh control-plane state; the
only recovery is to re-bootstrap a fresh coordinator (`/setup`) and re-enrol peers (M5).
Accepted for now; this record exists so the gap is explicit and `/check-backup` flags
it. Revisit when the `fisi` pull node + restic repo are live.
- **Compose project name is `netbird`** (the base-dir basename), not
`netbird_coordinator` — relevant when stopping the stack to quiesce a restore.