diff --git a/STATUS.md b/STATUS.md index 6f747a6..9a7d631 100644 --- a/STATUS.md +++ b/STATUS.md @@ -39,7 +39,7 @@ _Last reviewed: 2026-06-19._ | Thing | State | |---|---| -| `roles/base/` | **Partially built.** Concerns built: `firewall` (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) and **`hardening`** (M3: sshd drop-in key-only + `PermitRootLogin no`, fail2ban sshd jail 5/1h; ADR-002) — both pytest/Molecule-tested. The **`hardening`** concern is **applied to askari** (`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`). The `firewall` concern is **applied to ubongo** (mesh-hardening 2/3, 2026-06-19) **and askari** (mesh-hardening redesign, 2026-06-20) — both INPUT-only default-deny via the `base__firewall_input_only` knob (input default-deny + `wt0`/ssh-from-control/`base__firewall_admin_addrs` allow-list; forward left `accept` so Docker/libvirt-NAT survive), both **live reboot-validated**. On a Docker host (askari) base's `flush ruleset` wipes Docker's nat, so the cutover follows the firewall apply with a `restart docker` to rebuild it (FRICTION). Not built: auditd, packages, users (Phase 2 / TODO 15). | +| `roles/base/` | **Partially built.** Concerns built: `firewall` (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) and **`hardening`** (M3: sshd drop-in key-only + `PermitRootLogin no`, fail2ban sshd jail 5/1h; ADR-002) — both pytest/Molecule-tested. The **`hardening`** concern is **applied to askari** (`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`). The `firewall` concern is **applied to ubongo** (mesh-hardening 2/3, 2026-06-19) **and askari** (mesh-hardening redesign, 2026-06-20) — both INPUT-only default-deny via the `base__firewall_input_only` knob (input default-deny + `wt0`/ssh-from-control/`base__firewall_admin_addrs` allow-list; forward left `accept` so Docker/libvirt-NAT survive), both **live reboot-validated**. On a Docker host (askari) base's `flush ruleset` wipes Docker's nat, so the cutover follows the firewall apply with a `restart docker` to rebuild it (FRICTION). Not built: auditd, packages, users (Phase 2 / TODO 15). The `mesh` concern also pins the coordinator FQDN in `/etc/hosts` (`base__mesh_coordinator_pin`, set for ubongo) so a local-DNS hiccup can't strand the mesh; the single-coordinator SPOF is an accepted availability risk (R8, ADR-016 availability amendment). | | `inventories/*/hosts.yml` | Structured stubs with empty host maps (`hosts: {}`); regenerated by `make tf-inventory` once Terraform has hosts | | `inventories/production/group_vars/{docker_hosts,proxmox_hosts}/` | Empty dirs | diff --git a/docs/ROADMAP.md b/docs/ROADMAP.md index 7b3da5c..e9e2280 100644 --- a/docs/ROADMAP.md +++ b/docs/ROADMAP.md @@ -215,8 +215,13 @@ coordinator; a real reboot recovered unattended. Remaining mesh-hardening sub-pr 1. ~~`ubongo` nftables default-deny + `ssh-from-control`~~ → **DONE (2026-06-19).** 2. ~~**redesign** `askari`'s SSH → `wt0`~~ → **DONE (2026-06-20)** — boot-race, coordinator-bootstrap chicken-egg, and Docker-nat-flush all resolved + live reboot-validated. -3. **askari relay-SPOF reduction** (next) — `ubongo→askari` is currently `Relayed` through askari's own - relay, so askari is a single point of failure for relayed mesh traffic; reduce it (second relay / direct P2P). -4. tighten the NetBird ACL **off Allow-All** to scoped policies (open mechanism question — no headless API path). +3. ~~**askari relay-SPOF reduction**~~ → **DONE (2026-06-20)** — assessed + **accepted** as a + documented availability risk (R8 + ADR-016 availability amendment): the blast radius is + narrow (LAN/intra-cluster/local traffic never touch askari), so no P2P / second relay / + second coordinator was warranted. Hardened the one real gap — a managed-host coordinator-FQDN + DNS pin (`base__mesh_coordinator_pin`). The coordinator off-site backup gap is handed to ADR-022. +4. **NetBird ACL off Allow-All** to scoped policies (open mechanism question — no headless API path). +5. **ADR-022 backup kickoff** — off-site backup of the `netbird_coordinator` store (named in R8 / + BACKUP.md) as the first slice of the backup role (restic + the `fisi` pull node). **Then** the Procurement gate (`/capacity-review` → buy Proxmox hardware) opens Phase 2. diff --git a/docs/decisions/016-mesh-vpn.md b/docs/decisions/016-mesh-vpn.md index 5aaef1e..ec7361f 100644 --- a/docs/decisions/016-mesh-vpn.md +++ b/docs/decisions/016-mesh-vpn.md @@ -125,6 +125,38 @@ allocated for it. - Implementation is pending: the role tasks land only once the unbuilt `base` role and service-role machinery exist (Status). +## Availability — an `askari` outage (amendment 2026-06-20) + +The coordinator is deliberately **single** (one off-site host). Recorded here so its +availability envelope is explicit; accepted as **R8** (`docs/security/accepted-risks.md`). + +The mesh is **not** a default gateway — `wt0` routes only the overlay CIDR (`100.99.0.0/16`); +normal traffic uses the host's default route. So an `askari` outage has a **narrow blast +radius**: + +| Traffic | `askari` down | +|---|---| +| LAN device → LAN service (direct / via reverse proxy) | unaffected | +| node ↔ node over LAN IPs (cluster) | unaffected | +| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P) | +| **road-warrior → `ubongo` (remote, relayed)** | **breaks** | +| mesh control plane (new enrol / ACL change / re-handshake) | pauses | + +Only remote (off-LAN) mesh access to peers is lost, and only when off-LAN **and** `askari` +is down simultaneously. On-LAN access to `ubongo` never depends on the mesh (Recovery & +operations, above). + +**Recovery:** rebuild the coordinator (`/setup` + re-enrol peers, M5) or restore from backup +once ADR-022 lands; the `netbird_coordinator` store backup is the **next sub-project** (its +gap is named in R8 and `BACKUP.md`). Client/road-warrior break-glass (reliable resolvers + +the coordinator-FQDN `/etc/hosts` pin) is in `docs/runbooks/netbird-client.md`; managed mesh +hosts get the same pin via `base__mesh_coordinator_pin`. + +**Not pursued** (deliberately, given the narrow blast radius): direct P2P (punctures the +default-deny posture; only helps established sessions), a second relay (needs another public +host / reintroduces the home public surface), a second coordinator (unsupported by +self-hosted NetBird; against this ADR). + ## Related ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security), diff --git a/docs/security/accepted-risks.md b/docs/security/accepted-risks.md index 0801afa..3e409e9 100644 --- a/docs/security/accepted-risks.md +++ b/docs/security/accepted-risks.md @@ -20,8 +20,9 @@ revisit (trigger). | R5 | **No disk encryption on `ubongo`** — the control node's SSD (SanDisk X600 256 GB, TCG-Opal-capable but Opal unused) is unencrypted at rest, so it holds recovery-critical secrets in plaintext: the Ansible Vault password's `rbw` local cache and (future) Terraform state. Physical theft of the box would expose them | `ubongo` is always-on in a physically controlled location; compensating controls are a **BIOS supervisor password** and **disabled external/USB + PXE boot** (an attacker cannot trivially boot another OS to read the disk), and the offline-recoverable design means the irreducible root secret (Vaultwarden master password) is never stored on the box anyway. Full-disk encryption was weighed against the always-on/unattended-reboot requirement (LUKS+TPM auto-unlock or passphrase) and deferred for simplicity at this trust level | `ubongo` is relocated to a less-trusted physical location; the box starts holding additional high-value secrets; or a reinstall onto LUKS (TPM-sealed) is undertaken | | R6 | **`le-prod-wildcard` integration runs** — when `CERTS=le-prod-wildcard` is passed to `make test-integration`, the production Gandi PAT (`vault.gandi.pat`) is passed to an ephemeral local test VM via the var overlay, and transient `_acme-challenge` TXT records are written into the real `wingu.me` DNS zone to satisfy the Let's Encrypt DNS-01 challenge. A compromised or long-lived test VM could exfiltrate the PAT; the real zone is briefly (seconds) modified | Scope is **on-demand only** — `le-staging` is the default cert tier (`CERTS=internal` for incident repro); `le-prod-wildcard` is an explicit opt-in. Compensating controls: the VM is ephemeral and destroyed on success; it sits on an isolated libvirt NAT network (no LAN/mesh access); TXT records are auto-removed by Caddy immediately after validation; the PAT is not persisted inside the VM after the run. ADR-025 documents the cert-tier design and the three isolation invariants | The PAT is exfiltrated from a test VM; the `wingu.me` zone shows unexpected records; a `CERTS=le-prod-wildcard` run must be audited or the tier must be revoked | | R7 | **`claude` AI-worker has `NOPASSWD:ALL` sudo on `ubongo`** — the automated AI-worker account can execute any command as root on the control node without a password prompt. A compromised or misbehaving agent session could make arbitrary root-level changes to ubongo | The account is **password-locked** (no interactive `claude` login; `NOPASSWD` sudo is the account's only escalation path, so there is no "su to claude + sudo" attack). `auditd` + Loki attribution (ADR-018) logs every `sudo` invocation with the originating user. The drop-in (`/etc/sudoers.d/claude-ai-worker`) is repo-managed via `base__ai_worker_user` — revocable in one commit + one deploy. Single-operator homelab; all changes in git; off-machine backups (ADR-022). Full rationale: ADR-015 amendment (2026-06-18) + ADR-021 §Sudo model. | The AI-worker executes a destructive action that cannot be rolled back via git; the account key is compromised; the threat model shifts toward targeted remote attackers | +| R8 | **Single off-site mesh coordinator is an availability SPOF for remote mesh access** — `askari` hosts the only NetBird management/signal/relay (ADR-016); while askari is down, every *relayed* peer (all of `ubongo`'s, by the deliberate default-deny posture) loses remote mesh reachability and the control plane pauses. The `netbird_coordinator` store also has **no off-site backup yet** (BACKUP.md), so an askari loss loses mesh control-plane state until rebuilt | Inherent to ADR-016's deliberate single off-site coordinator (sovereignty; survives a homelab outage). **Narrow blast radius:** the mesh is not a gateway (`wt0` routes only `100.99.0.0/16`) — LAN, intra-cluster, and local-service traffic are unaffected; only remote/off-LAN mesh access breaks, and only when off-LAN *and* askari is down at once. askari is a reliable always-on VPS; mitigations: client + managed-host coordinator-FQDN DNS pin (`base__mesh_coordinator_pin`; runbook), documented `/setup` rebuild | askari proves unreliable; the cluster grows to depend on the mesh for intra-node traffic; remote mesh access becomes business-critical; or the ADR-022 backup role lands (closes the state-loss half) | -_Last reviewed: 2026-06-18. The prior gaps (full CIS hardening, SELinux/AppArmor, +_Last reviewed: 2026-06-20. The prior gaps (full CIS hardening, SELinux/AppArmor, IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build