56 lines
3.2 KiB
Markdown
56 lines
3.2 KiB
Markdown
|
|
# Backup — netbird_coordinator (NetBird control plane)
|
||
|
|
|
||
|
|
Rendered from the role's `backup__*` data (`roles/netbird_coordinator/defaults/main.yml`)
|
||
|
|
— the source of truth that also drives `/check-backup`. Regenerate from the data; edit the
|
||
|
|
data, not the tables. Host: `askari` (off-site Hetzner; ADR-007/016).
|
||
|
|
|
||
|
|
This is boma's **first stateful service** (`backup__state: true`). It holds the entire
|
||
|
|
mesh control-plane state in an encrypted SQLite datastore.
|
||
|
|
|
||
|
|
## State captured
|
||
|
|
|
||
|
|
Rendered from `backup__*`:
|
||
|
|
|
||
|
|
| What | Source | How captured |
|
||
|
|
|---|---|---|
|
||
|
|
| datastore volume | `/var/lib/netbird` (Docker named volume `netbird_data`) | file-level, pulled read-only — the SQLite DB (peers, setup keys, ACLs, embedded-IdP users) |
|
||
|
|
|
||
|
|
- **Encryption key is part of the backup contract.** The datastore is **encrypted** with
|
||
|
|
`vault.netbird.datastore_key` (`server.store.encryptionKey`, base64 32 bytes). A
|
||
|
|
restore needs **both** the captured volume **and** that key. The key already lives in
|
||
|
|
the Ansible Vault (off-host, in the repo); it is **not** re-captured by the data backup
|
||
|
|
and must not be — the vault is its own backup. Lose the key and the snapshot is
|
||
|
|
unreadable.
|
||
|
|
- **Quiesce:** `false` — SQLite is captured file-level from the named volume. ADR-022
|
||
|
|
Decision 7 prefers a logical dump; NetBird exposes no dump command and uses an embedded
|
||
|
|
store, so this is the file-level escape hatch (Decision 7 B). If a live file-level copy
|
||
|
|
proves inconsistent in practice, flip `backup__quiesce: true` (stop → snapshot →
|
||
|
|
restart) — the stack tolerates a brief restart.
|
||
|
|
- **RPO:** ~24 h (nightly; ADR-022 Decision 2) — **once the pipeline exists** (see below).
|
||
|
|
|
||
|
|
## Restore procedure
|
||
|
|
|
||
|
|
1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A. This
|
||
|
|
renders `config.yaml` with `vault.netbird.datastore_key` from the vault (the *same*
|
||
|
|
key the snapshot was encrypted under — do not rotate it across a restore).
|
||
|
|
2. Stop the stack, `restic restore` the latest snapshot for `netbird_coordinator` into
|
||
|
|
the `netbird_data` volume / `/var/lib/netbird`, then start the stack.
|
||
|
|
3. No logical dump to replay (file-level store).
|
||
|
|
4. Confirm with this role's `VERIFY.md` checks (ADR-008/017) — dashboard loads, login via
|
||
|
|
the embedded IdP works, the management API lists the restored peers/keys.
|
||
|
|
|
||
|
|
## Restore notes
|
||
|
|
|
||
|
|
- **The encryption key must match the snapshot.** The datastore is unreadable without the
|
||
|
|
exact `vault.netbird.datastore_key` it was written under. Restore the vault first (or
|
||
|
|
confirm the key is unchanged) before restoring the data; never rotate the datastore key
|
||
|
|
as part of a restore.
|
||
|
|
- **Off-site backup is NOT yet captured — accepted risk.** The restic / `fisi` pull node
|
||
|
|
(ADR-022 Plan 2) is **not built yet**, so right now this state is **not** backed up
|
||
|
|
off-host. Until `fisi` lands, a loss of askari loses the mesh control-plane state; the
|
||
|
|
only recovery is to re-bootstrap a fresh coordinator (`/setup`) and re-enrol peers (M5).
|
||
|
|
Accepted for now; this record exists so the gap is explicit and `/check-backup` flags
|
||
|
|
it. Revisit when the `fisi` pull node + restic repo are live.
|
||
|
|
- **Compose project name is `netbird`** (the base-dir basename), not
|
||
|
|
`netbird_coordinator` — relevant when stopping the stack to quiesce a restore.
|