Compare commits

...

4 commits

Author SHA1 Message Date
61cbcc6c18 docs(friction): re-asked settled defaults (push + subagent-driven) at plan->execute handoff
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 17:11:01 +02:00
6be758bece docs(plan): mesh-hardening redesign — askari implementation plan
Four tasks: netbird_coordinator geolocation disable (TDD via Molecule) -> inventory enablement (INPUT-only firewall + WAN break-glass + manage over wt0) -> an askari_inputonly integration profile (the reboot-safety GREEN gate) -> the operator-gated supervised live cutover + STATUS/ROADMAP update. Tasks 1-3 are autonomously implementable; Task 4 is operator-gated (live off-site host, lockout risk).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 16:32:27 +02:00
a178729587 docs(spec): mesh-hardening redesign — askari wt0-primary + WAN break-glass
Redesign of the backed-out 2026-06-17 askari SSH->wt0 attempt. Mirrors the proven ubongo 2/3 pattern (INPUT-only default-deny, SSH scoped by iifname wt0, no sshd ListenAddress change -> no boot-race) and adds the coordinator-host exception the incident demanded: a permanent non-mesh break-glass (WAN :22 from ubongo's static WAN IP + the Hetzner console), WAN :22 deliberately left open. Folds in the netbird_coordinator geo-DB robustness fix (FRICTION #4) so a transient egress blip can't FATAL the control plane. Harness-GREEN gate before a supervised live cutover.

Operator decision (2026-06-19): do this redesign first, then a separate sub-project to reduce askari's SPOF role.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 16:25:26 +02:00
ef5e049e9b docs(status): mesh-hardening 2/3 — ubongo reboot-validated
After an operator reboot of ubongo, verified live that the INPUT-only default-deny ruleset re-applied on boot (input chain policy drop + the full wt0/ssh-from-control/admin-addr allow-list), the wt0 mesh came back (Management+Signal Connected), and both SSH paths recovered clean. Closes the 'real-host reboot validation pending' item for mesh-hardening 2/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 16:25:19 +02:00
5 changed files with 637 additions and 2 deletions

View file

@ -30,7 +30,7 @@ _Last reviewed: 2026-06-19._
| `roles/dev_env/` — interactive developer environment | **Built + applied.** zsh + oh-my-zsh + oh-my-posh, tmux + TPM plugins, neovim; dotfiles deployed via GNU stow (re-derived from V4/fisi per ADR-013). Node.js from a pinned upstream tarball (not Debian's npm). Lint + Molecule (idempotent) green. **Applied to `ubongo`** for users `sjat` + `claude` (verified: zsh login shells, stow-symlinked `.zshrc`/`.tmux.conf` + nvim config, oh-my-zsh, tmux plugins; nvim v0.12.2, oh-my-posh 29.0.1). Run via `playbooks/workstation.yml` against the `control` group (no dedicated `workstations` group yet). | | `roles/dev_env/` — interactive developer environment | **Built + applied.** zsh + oh-my-zsh + oh-my-posh, tmux + TPM plugins, neovim; dotfiles deployed via GNU stow (re-derived from V4/fisi per ADR-013). Node.js from a pinned upstream tarball (not Debian's npm). Lint + Molecule (idempotent) green. **Applied to `ubongo`** for users `sjat` + `claude` (verified: zsh login shells, stow-symlinked `.zshrc`/`.tmux.conf` + nvim config, oh-my-zsh, tmux plugins; nvim v0.12.2, oh-my-posh 29.0.1). Run via `playbooks/workstation.yml` against the `control` group (no dedicated `workstations` group yet). |
| `make check` / `make deploy PLAYBOOK=<name>` | **Works.** First end-to-end run (applying `dev_env`) surfaced + fixed latent bugs: Makefile `PLAYBOOK` var collision (binary path vs playbook-name arg) meant the targets never ran; `ansible.cfg` referenced uninstalled community.general callbacks (now built-in `default` + `ansible.posix.profile_tasks`); `acl` package added so Ansible can `become_user` an unprivileged user. The make targets now function — though `site`/`base`/`docker_host` content is still incomplete (see below). | | `make check` / `make deploy PLAYBOOK=<name>` | **Works.** First end-to-end run (applying `dev_env`) surfaced + fixed latent bugs: Makefile `PLAYBOOK` var collision (binary path vs playbook-name arg) meant the targets never ran; `ansible.cfg` referenced uninstalled community.general callbacks (now built-in `default` + `ansible.posix.profile_tasks`); `acl` package added so Ansible can `become_user` an unprivileged user. The make targets now function — though `site`/`base`/`docker_host` content is still incomplete (see below). |
| `roles/public_dns/` + `playbooks/dns.yml` | **Built + applied.** Manages wingu.me at Gandi LiveDNS as code (`community.general.gandi_livedns`, PAT from `vault.gandi.pat`); record data, anti-spoof baseline (SPF `-all` + DMARC reject), and the Gandi-defaults purge are defined + unit-tested (`tests/test_public_dns.py`). **Applied to wingu.me (2026-06-14):** purged Gandi's 13 seeded defaults; zone now holds only the SPF + DMARC TXT records; idempotent re-run clean. No null-MX (Gandi rejects `0 .`) — the MX is removed, so no MX + no apex A = no mail. M1 of the roadmap. | | `roles/public_dns/` + `playbooks/dns.yml` | **Built + applied.** Manages wingu.me at Gandi LiveDNS as code (`community.general.gandi_livedns`, PAT from `vault.gandi.pat`); record data, anti-spoof baseline (SPF `-all` + DMARC reject), and the Gandi-defaults purge are defined + unit-tested (`tests/test_public_dns.py`). **Applied to wingu.me (2026-06-14):** purged Gandi's 13 seeded defaults; zone now holds only the SPF + DMARC TXT records; idempotent re-run clean. No null-MX (Gandi rejects `0 .`) — the MX is removed, so no MX + no apex A = no mail. M1 of the roadmap. |
| `ubongo` — physical control / AI-worker host (ADR-015) | **Built (partial).** Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to `fisi` (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via `make setup`/`make collections`). Repo cloned under a dedicated `claude` user (docker + libvirt groups, **`NOPASSWD:ALL` sudo** — ADR-015 amended 2026-06-18; operator `sjat` uses password-required sudo via `sudo` group; the former `sjat-ansible` NOPASSWD drop-in removed 2026-06-18). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory `control` group at 10.20.10.151. **`dev_env` now applied here** (zsh/tmux/nvim for `sjat` + `claude`, via `playbooks/workstation.yml`). Managed as the operator account `sjat` (`group_vars/control` sets `ansible_user: sjat`), not the `ansible` service user `group_vars/all` assumes — ubongo has no bootstrapped `ansible` user. **NetBird mesh-enrolled (M5, 2026-06-17):** `wt0` up at `100.99.146.14` via the `base` `mesh` concern. **`base` firewall applied (mesh-hardening 2/3, 2026-06-19):** INPUT-only default-deny — input locked to `wt0` + ssh-from-control (`10.20.10.151`) + workstations (`10.20.10.50` mamba, `10.20.10.17`); forward `accept` (Docker/libvirt-NAT safe). Live-verified (SSH self-path + Docker egress, after a post-apply `restart docker` — base's flush wipes Docker nat, FRICTION); **real-host reboot validation pending** (low-risk — lockout-safe via the permanent console). `claude` now self-SSHes (ad-hoc `authorized_keys` grant so the agent can run SSH-based deploys with the auto-rollback safety; fold into the control-node bootstrap). **Pending:** full `base` hardening (auditd/CIS); proper `ansible`-user bootstrap (currently managed as `sjat`); OPNsense DHCP reservations (10.20.10.151 MAC `88:a4:c2:e0:ee:da` + the `.50`/`.17` workstation leases); Terraform state backup (now relevant — the offsite tfstate exists). | | `ubongo` — physical control / AI-worker host (ADR-015) | **Built (partial).** Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to `fisi` (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via `make setup`/`make collections`). Repo cloned under a dedicated `claude` user (docker + libvirt groups, **`NOPASSWD:ALL` sudo** — ADR-015 amended 2026-06-18; operator `sjat` uses password-required sudo via `sudo` group; the former `sjat-ansible` NOPASSWD drop-in removed 2026-06-18). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory `control` group at 10.20.10.151. **`dev_env` now applied here** (zsh/tmux/nvim for `sjat` + `claude`, via `playbooks/workstation.yml`). Managed as the operator account `sjat` (`group_vars/control` sets `ansible_user: sjat`), not the `ansible` service user `group_vars/all` assumes — ubongo has no bootstrapped `ansible` user. **NetBird mesh-enrolled (M5, 2026-06-17):** `wt0` up at `100.99.146.14` via the `base` `mesh` concern. **`base` firewall applied (mesh-hardening 2/3, 2026-06-19):** INPUT-only default-deny — input locked to `wt0` + ssh-from-control (`10.20.10.151`) + workstations (`10.20.10.50` mamba, `10.20.10.17`); forward `accept` (Docker/libvirt-NAT safe). Live-verified (SSH self-path + Docker egress, after a post-apply `restart docker` — base's flush wipes Docker nat, FRICTION); **real-host reboot-validated (2026-06-19):** after an operator reboot, the `policy drop` input chain + full allow-list re-applied on boot and the `wt0` mesh + SSH self-path came back clean. `claude` now self-SSHes (ad-hoc `authorized_keys` grant so the agent can run SSH-based deploys with the auto-rollback safety; fold into the control-node bootstrap). **Pending:** full `base` hardening (auditd/CIS); proper `ansible`-user bootstrap (currently managed as `sjat`); OPNsense DHCP reservations (10.20.10.151 MAC `88:a4:c2:e0:ee:da` + the `.50`/`.17` workstation leases); Terraform state backup (now relevant — the offsite tfstate exists). |
| `askari` — off-site Hetzner VPS (ADR-007/016, M2) | **Built + applied.** Provisioned by Terraform (`environments/offsite`, `hetznercloud/hcloud`) as **cx23 / hel1 / Debian 13.5** (CAX11/ARM was out of stock EU-wide on 2026-06-14 → cx23 is same-spec x86, cheaper). cloud-init created the `ansible` user + passwordless sudo; a TF-managed Hetzner Cloud Firewall allows SSH only from ubongo's WAN (`91.226.145.80`). Reachable from ubongo (`ansible offsite_hosts -m ping` ✓), in the `offsite_hosts` inventory (generated `offsite.yml`), published at `askari.wingu.me``77.42.120.136`. **SSH-hardened + fail2ban (M3).** **Docker + Caddy reverse proxy (M4a):** `docker_host` + `reverse_proxy` (vanilla Caddy, HTTP-01) applied; `https://test.askari.wingu.me` serves a valid Let's Encrypt cert ✓ (firewall opens 80/443/3478). **NetBird coordinator (M4b):** `netbird_coordinator` deployed — dashboard live at `https://netbird.askari.wingu.me` (valid LE cert), management API behind embedded Dex (401 unauth), STUN on 3478/udp. **NetBird peer (M5, 2026-06-17):** also enrolled as a mesh agent (`base` `mesh` concern) — `wt0` at `100.99.226.39`, Management+Signal Connected; the agent coexists with the coordinator. **Pending:** host firewall + moving askari's SSH onto `wt0` (deferred mesh-hardening; the Hetzner Cloud Firewall is its perimeter until then), offsite tfstate backup (ADR-022). | | `askari` — off-site Hetzner VPS (ADR-007/016, M2) | **Built + applied.** Provisioned by Terraform (`environments/offsite`, `hetznercloud/hcloud`) as **cx23 / hel1 / Debian 13.5** (CAX11/ARM was out of stock EU-wide on 2026-06-14 → cx23 is same-spec x86, cheaper). cloud-init created the `ansible` user + passwordless sudo; a TF-managed Hetzner Cloud Firewall allows SSH only from ubongo's WAN (`91.226.145.80`). Reachable from ubongo (`ansible offsite_hosts -m ping` ✓), in the `offsite_hosts` inventory (generated `offsite.yml`), published at `askari.wingu.me``77.42.120.136`. **SSH-hardened + fail2ban (M3).** **Docker + Caddy reverse proxy (M4a):** `docker_host` + `reverse_proxy` (vanilla Caddy, HTTP-01) applied; `https://test.askari.wingu.me` serves a valid Let's Encrypt cert ✓ (firewall opens 80/443/3478). **NetBird coordinator (M4b):** `netbird_coordinator` deployed — dashboard live at `https://netbird.askari.wingu.me` (valid LE cert), management API behind embedded Dex (401 unauth), STUN on 3478/udp. **NetBird peer (M5, 2026-06-17):** also enrolled as a mesh agent (`base` `mesh` concern) — `wt0` at `100.99.226.39`, Management+Signal Connected; the agent coexists with the coordinator. **Pending:** host firewall + moving askari's SSH onto `wt0` (deferred mesh-hardening; the Hetzner Cloud Firewall is its perimeter until then), offsite tfstate backup (ADR-022). |
| `roles/docker_host/` (Docker engine) + `roles/reverse_proxy/` (Caddy, ADR-024) | **Built + applied** (askari, M4a). `docker_host` installs Docker CE + compose; `reverse_proxy` is boma's standard Caddy proxy (HTTP-01 for public hosts; routes from `reverse_proxy__routes`). **DNS-01 for mesh/LAN-only services is now built + proven (2026-06-15):** custom `caddy-gandi` image (`.docker/caddy-gandi/`, `make caddy-image`, pinned caddy-dns/gandi v1.1.0 → Bearer PAT), enabled per-instance via `reverse_proxy__acme_dns_provider: gandi` + `reverse_proxy__image`. Verified end-to-end — a real wildcard cert issued via LE **staging** + Gandi DNS-01 with `vault.gandi.pat`. M4a's deferral (version skew + Hetzner-IP build) is closed; image **pending registry push** (`make caddy-image-push` needs `docker login`). The `reverse_proxy` Caddyfile is bind-mounted as a **directory** (`./caddy``/etc/caddy`) so atomic re-renders are visible in-container and `caddy reload` actually applies new routes (a single-file mount pinned the stale inode). | | `roles/docker_host/` (Docker engine) + `roles/reverse_proxy/` (Caddy, ADR-024) | **Built + applied** (askari, M4a). `docker_host` installs Docker CE + compose; `reverse_proxy` is boma's standard Caddy proxy (HTTP-01 for public hosts; routes from `reverse_proxy__routes`). **DNS-01 for mesh/LAN-only services is now built + proven (2026-06-15):** custom `caddy-gandi` image (`.docker/caddy-gandi/`, `make caddy-image`, pinned caddy-dns/gandi v1.1.0 → Bearer PAT), enabled per-instance via `reverse_proxy__acme_dns_provider: gandi` + `reverse_proxy__image`. Verified end-to-end — a real wildcard cert issued via LE **staging** + Gandi DNS-01 with `vault.gandi.pat`. M4a's deferral (version skew + Hetzner-IP build) is closed; image **pending registry push** (`make caddy-image-push` needs `docker login`). The `reverse_proxy` Caddyfile is bind-mounted as a **directory** (`./caddy``/etc/caddy`) so atomic re-renders are visible in-container and `caddy reload` actually applies new routes (a single-file mount pinned the stale inode). |
| `roles/netbird_coordinator/` — NetBird control plane (ADR-016, M4b) | **Built + applied (askari, 2026-06-16). boma's FIRST real service role.** Self-hosted NetBird **v0.72.4**: a single combined `netbird-server` container (management + signal + relay + STUN + **embedded Dex IdP** at `/oauth2`) + `dashboard:v2.39.0`, on the shared `boma` network behind the M4a Caddy via gRPC-h2c + WebSocket + path routing (`reverse_proxy__routes` gained a raw-`caddy` route type). Secrets `vault.netbird.{auth_secret,datastore_key}` (self-generated). Carries the full service-role file set (SECURITY/VERIFY/ACCESS/BACKUP) — **first stateful role** (`backup__state: true`; encrypted SQLite at `/var/lib/netbird`, off-site backup pending `fisi`/ADR-022). **Verified live:** dashboard 200 + valid LE cert, `/api` 401 (auth-gated, routes OK), STUN up. **Not yet configured:** first-boot `/setup` admin + peer enrolment = M5. | | `roles/netbird_coordinator/` — NetBird control plane (ADR-016, M4b) | **Built + applied (askari, 2026-06-16). boma's FIRST real service role.** Self-hosted NetBird **v0.72.4**: a single combined `netbird-server` container (management + signal + relay + STUN + **embedded Dex IdP** at `/oauth2`) + `dashboard:v2.39.0`, on the shared `boma` network behind the M4a Caddy via gRPC-h2c + WebSocket + path routing (`reverse_proxy__routes` gained a raw-`caddy` route type). Secrets `vault.netbird.{auth_secret,datastore_key}` (self-generated). Carries the full service-role file set (SECURITY/VERIFY/ACCESS/BACKUP) — **first stateful role** (`backup__state: true`; encrypted SQLite at `/var/lib/netbird`, off-site backup pending `fisi`/ADR-022). **Verified live:** dashboard 200 + valid LE cert, `/api` 401 (auth-gated, routes OK), STUN up. **Not yet configured:** first-boot `/setup` admin + peer enrolment = M5. |

View file

@ -22,6 +22,16 @@ earning its keep.
_(append new raw signals here; the next kaizen review consumes them)_ _(append new raw signals here; the next kaizen review consumes them)_
- `[friction]` **Re-asked settled defaults (push + subagent-driven) at the plan→execute handoff**
(2026-06-19): despite the standing preference (memory `dont-reask-settled-defaults`: push to
origin as off-machine backup **and** go subagent-driven, both WITHOUT asking), I again asked the
operator "which execution approach?" and "want me to push?". The `writing-plans` skill scripts
that handoff question ("Which approach?"), and confirming a push felt natural — both overrode the
memory. → at the writing-plans → execution handoff, default to subagent-driven execution and push
to origin without a confirmation gate; reserve questions for genuine forks. Recurrence of an
already-recorded signal — treat the skill's scripted "Which approach?" as pre-answered
(subagent-driven) for this operator.
<!-- The six below are from the 2026-06-17 mesh-hardening-1/3 incident: applying base's <!-- The six below are from the 2026-06-17 mesh-hardening-1/3 incident: applying base's
nftables default-deny + wt0-only sshd to askari (the off-site Docker host that ALSO runs nftables default-deny + wt0-only sshd to askari (the off-site Docker host that ALSO runs
the NetBird coordinator) took it down on reboot; recovery needed the Hetzner console + the NetBird coordinator) took it down on reboot; recovery needed the Hetzner console +

View file

@ -209,7 +209,7 @@ Canonical dependency order:
**Phase 1 complete (M1M5); mesh-hardening 2/3 (ubongo default-deny) DONE (2026-06-19)** — **Phase 1 complete (M1M5); mesh-hardening 2/3 (ubongo default-deny) DONE (2026-06-19)** —
INPUT-only nftables default-deny applied + live-verified on `ubongo` (`base__firewall_input_only`; INPUT-only nftables default-deny applied + live-verified on `ubongo` (`base__firewall_input_only`;
spec/plan `docs/superpowers/{specs,plans}/2026-06-19-mesh-hardening-ubongo-default-deny*`; spec/plan `docs/superpowers/{specs,plans}/2026-06-19-mesh-hardening-ubongo-default-deny*`;
real-host reboot validation pending, low-risk — lockout-safe via the permanent console). real-host reboot-validated 2026-06-19 — ruleset re-applied on boot, mesh + SSH recovered clean).
Remaining mesh-hardening sub-projects, each its own spec → plan → implementation cycle: Remaining mesh-hardening sub-projects, each its own spec → plan → implementation cycle:
1. ~~`ubongo` nftables default-deny + `ssh-from-control`~~ → **DONE (2026-06-19).** 1. ~~`ubongo` nftables default-deny + `ssh-from-control`~~ → **DONE (2026-06-19).**

View file

@ -0,0 +1,409 @@
# Mesh-hardening redesign (askari) — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Harden askari's inbound surface with the proven ubongo INPUT-only default-deny pattern (SSH scoped by `iifname "wt0"` + a permanent WAN break-glass), and make the NetBird coordinator survive a no-egress startup — reboot-safe, no boot-race, no lockout.
**Architecture:** Mirror mesh-hardening 2/3 (ubongo): `base` firewall INPUT-only (`base__firewall_input_only: true`, forward stays `policy accept` so Docker forwarding/NAT survive), **no** sshd `ListenAddress` change (the firewall, not sshd, scopes `:22`). The coordinator-host exception: WAN `:22` stays open from ubongo's static WAN IP as the always-available non-mesh break-glass (the Hetzner console is the ultimate fallback). A `netbird_coordinator` change disables geolocation so a transient egress loss can't FATAL the control plane. Validate firewall reboot-safety on a throwaway VM (ADR-025 harness) GREEN before a supervised live cutover.
**Tech Stack:** Ansible (`base`, `netbird_coordinator` roles), nftables, Docker Compose, Molecule (Debian 13), the `scripts/integration-vm.py` ADR-025 harness, NetBird self-hosted `netbird-server:0.72.4`.
**Spec:** `docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md`
## Global Constraints
- **FQCN always** (`ansible.builtin.*`); role defaults use the `rolename__var` namespace.
- **No sshd `ListenAddress` change**`base__ssh_listen_mesh_only` stays `false` everywhere here (this is what sidesteps the 2026-06-17 boot-race).
- **WAN `:22` is never closed** — no Terraform / Hetzner-Cloud-Firewall change in this plan.
- **`base__firewall_input_only: true` on askari** — the forward chain must stay `policy accept` (Docker host). Never apply a forward-`drop` firewall to askari.
- **ubongo's WAN IP is `91.226.145.80`** (operator-confirmed static 2026-06-19) — the break-glass anchor.
- **askari `wt0` IP is `100.99.226.39`**; askari domain `netbird.askari.wingu.me`.
- **Before any commit:** `rbw unlocked` must succeed (the pre-commit hook decrypts `vault.yml`); run `make lint` and it must be clean.
- **Tags:** import each role at play level with its role-name tag; only use concern tags from `tests/tags.yml`.
- **Harness GREEN before live** (Task 3 before Task 4). The live cutover (Task 4) is **operator-gated** — never run autonomously.
---
### Task 1: Disable geolocation in `netbird_coordinator` (FRICTION 2026-06-17 #4)
Make the control plane survive a startup with no container egress: NetBird's combined server downloads the GeoLite2 DB at boot and treats failure as FATAL. boma uses no geo posture (ACL is Allow-All), so disable geolocation entirely via the documented env var. TDD'd through the role's render-only Molecule scenario.
> verified: NetBird self-hosted geolocation knobs (`NB_DISABLE_GEOLOCATION`, `disableGeoliteUpdate`, GeoLite2 pre-seed) · WebFetch · docs.netbird.io/selfhosted/geo-support · 2026-06-19 — *from a docs summary; the live "healthy with egress blocked" check in Task 4 is the real gate, with a concrete pre-seed fallback there.*
**Files:**
- Modify: `roles/netbird_coordinator/defaults/main.yml` (add the knob)
- Modify: `roles/netbird_coordinator/templates/docker-compose.yml.j2:14-27` (add `environment:` to `netbird-server`)
- Test: `roles/netbird_coordinator/molecule/default/verify.yml:21-32` (assert the rendered compose)
- Modify: `roles/netbird_coordinator/README.md` (one line documenting the knob)
**Interfaces:**
- Produces: role default `netbird_coordinator__disable_geolocation` (bool, default `true`); rendered compose env `NB_DISABLE_GEOLOCATION: "true"` on the `netbird-server` service.
- [ ] **Step 1: Write the failing Molecule assertion**
Append to `roles/netbird_coordinator/molecule/default/verify.yml` (after the existing compose-tags assert, inside the same `tasks:` list):
```yaml
- name: Assert geolocation is disabled (FRICTION 2026-06-17 #4 — no geo-DB download FATAL)
ansible.builtin.assert:
that:
- "'NB_DISABLE_GEOLOCATION: \"true\"' in (_compose.content | b64decode)"
fail_msg: >-
compose must set NB_DISABLE_GEOLOCATION=true so a no-egress startup can't FATAL
the coordinator on the GeoLite2 download
success_msg: "geolocation disabled in compose"
```
- [ ] **Step 2: Run Molecule to verify it fails**
Run: `make test ROLE=netbird_coordinator`
Expected: FAIL at "Assert geolocation is disabled" — the rendered compose has no `NB_DISABLE_GEOLOCATION`.
- [ ] **Step 3: Add the default knob**
Add to `roles/netbird_coordinator/defaults/main.yml` (after line 7, the `__domain` line):
```yaml
# Disable NetBird's GeoLite2 geolocation (download + lookups). boma uses no geo posture
# (ACL is Allow-All), and the combined server treats a failed GeoLite2 download as FATAL —
# so a transient egress loss (NAT wiped on `nft flush`, or the boot window before Docker
# re-adds NAT) would crash-loop the whole control plane (FRICTION 2026-06-17 #4). Disabling
# removes that dependency. Revisit if a future ACL sub-project wants geo-based posture.
netbird_coordinator__disable_geolocation: true
```
- [ ] **Step 4: Render the env in the compose template**
In `roles/netbird_coordinator/templates/docker-compose.yml.j2`, add an `environment:` block to the `netbird-server` service, immediately after its `command:` line (line 18):
```yaml
environment:
# Disable geolocation so a no-egress startup can't FATAL the control plane
# (FRICTION 2026-06-17 #4). boma uses no geo posture (ACL Allow-All).
NB_DISABLE_GEOLOCATION: "{{ netbird_coordinator__disable_geolocation | string | lower }}"
```
- [ ] **Step 5: Run Molecule to verify it passes**
Run: `make test ROLE=netbird_coordinator`
Expected: PASS — all asserts green, including "geolocation disabled in compose"; Molecule idempotence clean.
- [ ] **Step 6: Document the knob**
Add one line to `roles/netbird_coordinator/README.md` under its variables/defaults section:
```markdown
- `netbird_coordinator__disable_geolocation` (default `true`) — sets `NB_DISABLE_GEOLOCATION` so a no-egress startup can't FATAL the server on the GeoLite2 download (FRICTION 2026-06-17 #4).
```
- [ ] **Step 7: Lint and commit**
```bash
rbw unlocked && make lint
git add roles/netbird_coordinator/defaults/main.yml \
roles/netbird_coordinator/templates/docker-compose.yml.j2 \
roles/netbird_coordinator/molecule/default/verify.yml \
roles/netbird_coordinator/README.md
git commit -m "feat(netbird_coordinator): disable geolocation so no-egress startup can't FATAL the control plane" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 2: Enable askari's host firewall (INPUT-only) + WAN break-glass + manage over `wt0`
Flip askari from "firewall not applied" to the redesigned INPUT-only default-deny, add the permanent WAN break-glass source, and point Ansible at the mesh. Pure inventory change — validated by lint + inventory resolution (the firewall *behavior* is proven in Task 3).
**Files:**
- Modify: `inventories/production/group_vars/offsite_hosts/vars.yml` (replace the whole file body)
- Create: `inventories/production/host_vars/askari.yml`
**Interfaces:**
- Consumes: `base` knobs `base__firewall_apply`, `base__firewall_input_only`, `base__firewall_admin_addrs`, `base__ssh_listen_mesh_only`, `base__mesh_enabled` (all defined in `roles/base/defaults/main.yml`).
- Produces: askari resolves `ansible_host: 100.99.226.39`, `base__firewall_apply: true`, `base__firewall_input_only: true`, `base__firewall_admin_addrs: ["91.226.145.80"]`.
- [ ] **Step 1: Rewrite the offsite group_vars**
Replace the body of `inventories/production/group_vars/offsite_hosts/vars.yml` with:
```yaml
---
# Off-site hosts (askari). askari runs the NetBird coordinator AND is a mesh peer
# (ADR-016, M5).
#
# Mesh-hardening REDESIGN (2026-06-19): the 2026-06-17 attempt was backed out (forward
# `policy drop` broke Docker on reboot; wt0-only sshd left no break-glass; ip_nonlocal_bind
# did not beat the boot-race). The redesign mirrors the proven ubongo 2/3 pattern:
# - INPUT-only default-deny (base__firewall_input_only) — forward stays `policy accept`
# so Docker container forwarding/NAT survive a reboot;
# - SSH scoped by the host firewall (iifname wt0 + admin-addr), NOT a sshd ListenAddress
# change — base__ssh_listen_mesh_only stays false, so there is no boot-race;
# - WAN :22 is DELIBERATELY left open from ubongo's WAN IP (base__firewall_admin_addrs)
# as the permanent non-mesh break-glass — the coordinator-host exception (a host's only
# management path must never depend on a service that host itself hosts).
# Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md
base__mesh_enabled: true
base__firewall_apply: true
base__firewall_input_only: true # forward stays `policy accept` → Docker-safe
base__ssh_listen_mesh_only: false # no sshd ListenAddress change → no boot-race
base__firewall_admin_addrs:
- 91.226.145.80 # ubongo's (static) WAN IP — the permanent non-mesh SSH break-glass
```
- [ ] **Step 2: Create the askari host_vars to manage over the mesh**
Create `inventories/production/host_vars/askari.yml`:
```yaml
---
# Manage askari over the NetBird mesh (wt0). Overrides the TF-generated WAN `ansible_host`
# in offsite.yml (host_vars are NOT regenerated by tf_to_inventory.py). The WAN :22 path
# (Hetzner Cloud Firewall + base__firewall_admin_addrs = ubongo's WAN) stays as the
# break-glass; the Hetzner web console is the IP-independent ultimate fallback.
# Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md
ansible_host: 100.99.226.39
```
- [ ] **Step 3: Verify the inventory resolves**
Run: `ansible-inventory -i inventories/production --host askari`
Expected: JSON shows `"ansible_host": "100.99.226.39"`, `"base__firewall_apply": true`, `"base__firewall_input_only": true`, and `"base__firewall_admin_addrs": ["91.226.145.80"]`.
- [ ] **Step 4: Lint**
Run: `rbw unlocked && make lint`
Expected: clean (no yamllint/ansible-lint errors).
- [ ] **Step 5: Commit**
```bash
git add inventories/production/group_vars/offsite_hosts/vars.yml \
inventories/production/host_vars/askari.yml
git commit -m "feat(inventory): askari INPUT-only firewall + WAN break-glass + manage over wt0" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 3: Integration harness "askari_inputonly" profile — the reboot-safety GREEN gate
Prove on a throwaway VM (ADR-025) that the redesigned firewall is reboot-safe BEFORE touching the real host: INPUT default-deny + forward accept + the admin-addr break-glass + published-port DNAT all survive a reboot. New profile (keeps the existing `askari` profile, which validates the `docker_host` container-forward drop-in path, intact).
**Files:**
- Create: `tests/integration/profiles/askari_inputonly.json`
- Create: `tests/integration/overrides/askari_inputonly.yml`
- Modify: `tests/integration/verify.yml` (allow-list + a new profile branch)
**Interfaces:**
- Consumes: the `scripts/integration-vm.py` harness; `make test-integration HOST=<profile>` maps `HOST` to `profiles/<HOST>.json` (a profile name, not a production inventory host).
- Produces: profile `askari_inputonly` with `integration_profile: askari_inputonly`.
- [ ] **Step 1: Add the new profile to the verify allow-list and a failing branch**
In `tests/integration/verify.yml`, change the allow-list assert (line 14) from:
```yaml
- integration_profile in ['askari', 'ubongo']
```
to:
```yaml
- integration_profile in ['askari', 'askari_inputonly', 'ubongo']
```
and update its `fail_msg` (line 15) to `"integration_profile must be set in the profile overlay (askari|askari_inputonly|ubongo)"`. Then append this block to the `tasks:` list (after the ubongo block):
```yaml
# ── askari_inputonly profile — the mesh-hardening REDESIGN (2026-06-19) ──
# INPUT-only default-deny on a Docker host: input policy drop, forward policy ACCEPT
# (Docker-safe), SSH via the admin-addr break-glass, published-port DNAT survives reboot.
- name: (askari_inputonly) Read the live nftables ruleset
when: integration_profile == 'askari_inputonly'
ansible.builtin.command: nft list ruleset
register: _nft_io
changed_when: false
- name: (askari_inputonly) INPUT default-deny, forward permissive, admin-addr break-glass
when: integration_profile == 'askari_inputonly'
ansible.builtin.assert:
that:
- "'hook input priority filter; policy drop;' in _nft_io.stdout"
- "'hook forward priority filter; policy accept;' in _nft_io.stdout"
- "'ip saddr 192.168.150.1 tcp dport 22 accept' in _nft_io.stdout"
fail_msg: >-
askari_inputonly: expected input policy drop, forward policy accept (input-only),
and the admin-addr break-glass (192.168.150.1) SSH allow in the live ruleset.
- name: (askari_inputonly) Gather service facts
when: integration_profile == 'askari_inputonly'
ansible.builtin.service_facts:
- name: (askari_inputonly) Docker daemon is active
when: integration_profile == 'askari_inputonly'
ansible.builtin.assert:
that: "ansible_facts.services['docker.service'].state == 'running'"
fail_msg: "docker.service is not running"
- name: (askari_inputonly) Published port answers from the controller (DNAT + forward alive)
when: integration_profile == 'askari_inputonly'
delegate_to: localhost
become: false
ansible.builtin.uri:
url: "http://{{ ansible_host }}/"
follow_redirects: none
status_code: [200, 301, 308, 404, 502, 503]
timeout: 10
register: _probe_io
retries: 5
delay: 6
until: _probe_io is succeeded
```
- [ ] **Step 2: Create the profile descriptor**
Create `tests/integration/profiles/askari_inputonly.json`:
```json
{
"groups": ["offsite_hosts"],
"applies": [
{"playbook": "site.yml", "tags": ["base"]},
{"playbook": "offsite.yml", "tags": ["docker_host", "reverse_proxy"]}
],
"extra_vars_files": ["overrides/askari_inputonly.yml"],
"mem_mib": 3072,
"vcpus": 2
}
```
- [ ] **Step 3: Create the overlay**
Create `tests/integration/overrides/askari_inputonly.yml`:
```yaml
---
# Integration overlay (ADR-025) — the askari mesh-hardening REDESIGN (2026-06-19).
# Validates INPUT-only default-deny on a Docker host: input policy drop, forward policy
# accept (Docker-safe), SSH via the admin-addr break-glass, reboot-survivable.
integration_profile: askari_inputonly
base__firewall_apply: true
base__firewall_input_only: true
# No sshd ListenAddress change — never wt0-only in a throwaway VM.
base__ssh_listen_mesh_only: false
# Isolated VM: never touch the real mesh.
base__mesh_enabled: false
# The non-mesh SSH break-glass = the admin-addr path the real design uses. Point it at the
# VM's libvirt-NAT gateway (where the harness connects from), by source IP so it is
# interface-independent and the default-deny + reboot don't lock out the driver. This
# mirrors askari's real base__firewall_admin_addrs (ubongo's WAN) in the test topology.
base__firewall_admin_addrs:
- 192.168.150.1
```
- [ ] **Step 4: Run the harness — the GREEN gate**
Run: `make test-integration HOST=askari_inputonly`
Expected: GREEN. The harness boots a VM, applies `base` (INPUT-only) + `docker_host` + `reverse_proxy`, **reboots**, re-SSHes (proving the admin-addr break-glass survives), then `verify.yml` asserts input `policy drop`, forward `policy accept`, the `192.168.150.1` SSH allow, Docker active, and the published `:80` answering. Clean up: `make test-integration-clean`.
- [ ] **Step 5: Commit**
```bash
rbw unlocked && make lint
git add tests/integration/profiles/askari_inputonly.json \
tests/integration/overrides/askari_inputonly.yml \
tests/integration/verify.yml
git commit -m "test(integration): askari_inputonly profile — INPUT-only default-deny reboot gate" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 4: Supervised live cutover + STATUS/ROADMAP update — ⚠️ OPERATOR-GATED
> **⚠️ DO NOT run this task autonomously.** It changes the live off-site host (lockout risk) and runs `make deploy`. An automated executor must STOP here and hand back to the operator. Preconditions: Tasks 13 committed and GREEN; `rbw unlocked`; the **Hetzner web console** open in a browser (the out-of-band ultimate break-glass); the operator present. The WAN `:22` break-glass is never removed, so a fallback path is open throughout (FRICTION 2026-06-17 #6).
**Files (Step 7 only):**
- Modify: `STATUS.md` (askari row), `docs/ROADMAP.md` (Next step)
- [ ] **Step 1: Pre-check both paths are healthy**
```bash
ssh sjat@100.99.226.39 true && echo "wt0 SSH OK"
ansible askari -i inventories/production -m ping
curl -sI https://test.askari.wingu.me | head -1
curl -sI https://netbird.askari.wingu.me | head -1
```
Expected: wt0 SSH OK; ping `pong`; both curls `HTTP/2 200`.
- [ ] **Step 2: Dry-run the converge (mandatory `check` before `deploy`)**
```bash
make check PLAYBOOK=site LIMIT=askari
```
Expected: changes limited to the `base` firewall (input-only ruleset, admin-addr) + the `netbird_coordinator` compose env (`NB_DISABLE_GEOLOCATION`). Review and show the output before proceeding.
- [ ] **Step 3: Apply (operator present, console open, auto-rollback armed)**
```bash
make deploy PLAYBOOK=site LIMIT=askari
```
The `base` firewall concern arms the auto-rollback timer (`base__firewall_rollback_timeout: 45`) and reconnects over `wt0` — a bad ruleset reverts itself. Expected: converge OK; SSH-over-`wt0` stays up.
- [ ] **Step 4: Rebuild NAT and confirm the coordinator is healthy with geo disabled**
`base`'s `flush ruleset` wipes Docker's nat (FRICTION) — rebuild it, then confirm the control plane:
```bash
ssh sjat@100.99.226.39 'sudo systemctl restart docker'
ssh sjat@100.99.226.39 'docker ps --format "{{.Names}} {{.Status}}"'
ssh sjat@100.99.226.39 'docker logs --since 2m netbird-server 2>&1 | grep -iE "geo|fatal" || echo "no geo/fatal log lines"'
```
Expected: `netbird-server` + `netbird-dashboard` Up; no geo-DB FATAL.
> **Contingency (only if `netbird-server` still FATALs on geolocation):** `NB_DISABLE_GEOLOCATION` was not honored by the pinned image. Pre-seed the DB into the volume instead — `ssh sjat@100.99.226.39 'sudo curl -fSL -o /var/lib/docker/volumes/netbird_data/_data/GeoLite2-City_20260101.mmdb https://pkgs.netbird.io/geolite2/GeoLite2-City.mmdb && sudo docker restart netbird-server'` — and add `disableGeoliteUpdate: true` under `server:` in `config.yaml.j2` so it never re-downloads. Re-verify, then fold the working fix back into the role (amend Task 1).
- [ ] **Step 5: Verify the new steady state (both SSH paths + services)**
```bash
ssh sjat@100.99.226.39 true && echo "wt0 SSH OK"
# From ubongo: SSH to askari's WAN IP. ubongo's packets egress via OPNsense, SNAT'd to the
# WAN IP 91.226.145.80 — matching askari's admin-addr break-glass rule. (No BindAddress:
# ubongo does not hold 91.226.145.80; OPNsense does.)
ssh sjat@77.42.120.136 true && echo "WAN break-glass OK"
curl -sI https://test.askari.wingu.me | head -1
nc -vz -u 77.42.120.136 3478 # STUN answers
```
Expected: both SSH paths succeed; cert valid; STUN reachable.
- [ ] **Step 6: Reboot-resilience — the real test (console available)**
```bash
ssh sjat@100.99.226.39 'sudo systemctl reboot'
# wait ~60s, then from ubongo — no manual intervention:
sleep 60; ssh sjat@100.99.226.39 'nft list chain inet filter input | grep -E "policy drop|wt0|91.226.145.80"'
curl -sI https://netbird.askari.wingu.me | head -1
ssh sjat@100.99.226.39 'docker ps --format "{{.Names}} {{.Status}}"'
```
Expected, unattended: input `policy drop` with the `wt0` + `91.226.145.80` allows; public cert valid; both containers Up; `wt0` SSH back. (If lost: recover via the Hetzner console — the firewall auto-rollback and the WAN break-glass should make that unnecessary.)
- [ ] **Step 7: Record reality in the ground-truth docs and commit**
Update `STATUS.md` (the askari row): firewall now **applied** — INPUT-only default-deny, SSH `wt0`-primary + permanent WAN break-glass (ubongo's WAN), managed over `wt0`, geolocation disabled, **reboot-validated**. Update `docs/ROADMAP.md` "Next step": mark the askari SSH→`wt0` redesign **DONE**; the next mesh-hardening sub-project is the **SPOF reduction** (askari relay single-point-of-failure) — confirmed by the `ubongo → askari` `Relayed` finding (2026-06-19).
```bash
rbw unlocked && make lint
git add STATUS.md docs/ROADMAP.md
git commit -m "docs(status): mesh-hardening redesign — askari INPUT-only + WAN break-glass applied + reboot-validated" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
## Notes / out of scope (carry to the SPOF sub-project)
- **SPOF reduction is the next sub-project** (operator decision 2026-06-19): `ubongo → askari` is currently `Relayed` through askari's own relay; if askari is down, relayed peers lose the mesh data plane. Its own spec.
- **NetBird ACL stays Allow-All** — any enrolled peer can reach askari `wt0:22` until a later sub-project.
- **Full forward-chain hardening** (`docker_host` container-forward drop-in over the `input_only` baseline) — a later tightening; the existing `askari` integration profile already covers that path.
- **Coordinator off-site backup** (FRICTION 2026-06-17 #5, ADR-022) — still pending; not in scope.

View file

@ -0,0 +1,216 @@
# Spec — Mesh-hardening redesign: askari SSH `wt0`-primary + permanent WAN break-glass
Status: Accepted (2026-06-19)
## Context & scope
The **mesh-hardening follow-on** (deferred from M5) was decomposed into three independent
sub-projects, each with its own spec → plan → implementation cycle. Progress so far:
1. ~~askari SSH → `wt0`~~**attempted 2026-06-17, BACKED OUT** after it took askari down
on reboot (spec/plan `docs/superpowers/{specs,plans}/2026-06-17-mesh-hardening-askari-ssh-wt0*`).
2. ubongo nftables INPUT-only default-deny — **DONE 2026-06-19**, reboot-validated
(`base__firewall_input_only`).
3. NetBird ACL off Allow-All → scoped policies — not started.
This spec is the **redesign of (1)**. The operator sequencing decision (2026-06-19) is:
do this redesign **first**, then a separate sub-project to reduce askari's
single-point-of-failure (SPOF) role. **This spec covers only the redesign of (1).** The SPOF
reduction is the named follow-on (its own later spec).
### Why the 2026-06-17 attempt was backed out
Four hazards, recorded in `docs/FRICTION.md` (the six 2026-06-17 signals):
1. **`base`'s `forward policy drop` breaks Docker hosts on reboot** — nftables loaded
default-deny before Docker, so container forwarding/NAT (WAN→Caddy, Caddy→coordinator)
died after reboot.
2. **`ip_nonlocal_bind` did NOT beat the sshd boot-race** — binding sshd `ListenAddress`
to the `wt0` IP still failed at boot ("could not assign the address"); and because
`wt0` never came up, sshd had no listener at all.
3. **The coordinator host can't bootstrap the mesh it depends on** — askari runs the
NetBird coordinator *and* is a mesh peer; its agent needs the local coordinator container
healthy to bring up `wt0`. After an unclean reboot the coordinator was down → `wt0`
never came up → with SSH `wt0`-only, the host was reachable only via the Hetzner console.
General rule: *never make a host's only management path depend on a service that host
itself hosts.*
4. **The coordinator FATAL-loops on the geolocation-DB download with no egress** — a
transient loss of container egress (here: NAT wiped by `nft flush`) crash-loops the whole
control plane.
### What changed since 2026-06-17 (enablers this redesign relies on)
- `docker_host` **container-forward nftables drop-in** (`172ae37`) — reboot-safe Docker
forwarding (available as a later tightening; not required by this pass).
- **`base__firewall_input_only`** — input-only default-deny, forward chain stays
`policy accept` (Docker-safe). **Proven on ubongo and reboot-validated 2026-06-19.**
- The **ADR-025 integration harness** — reproduces a host's boot on a throwaway local VM,
so reboot-safety is proven GREEN before the real host is touched.
## Goal / success criteria
- askari's host nftables firewall is **applied at last** (`base__firewall_apply: true`),
INPUT-only default-deny — matching ubongo.
- **Normal management is over the mesh:** `ansible_host` resolves to askari's `wt0` IP
(`100.99.226.39`); SSH-over-`wt0` and `ansible askari -m ping` over the mesh both succeed.
- **A permanent non-mesh break-glass survives a mesh/coordinator outage**, via two
independent channels:
- the **Hetzner web console** (out-of-band; always works, IP-independent); and
- **WAN `:22` reachable only from ubongo's WAN IP (`91.226.145.80`)**, enforced at *both*
the host nftables layer (`base__firewall_admin_addrs`) and the Hetzner Cloud Firewall.
WAN `:22` is **deliberately NOT closed** — the coordinator-host exception (FRICTION #3).
- **askari survives a reboot under the new firewall, unattended:** Docker forwarding/NAT
intact, `https://test.askari.wingu.me` + `https://netbird.askari.wingu.me` serve valid
certs, STUN `3478/udp` answers, the coordinator container is healthy (geo-DB no longer
FATAL), `wt0` returns, SSH is reachable over both `wt0` and the WAN break-glass.
- **No sshd `ListenAddress` change** (`base__ssh_listen_mesh_only` stays `false`) — this is
what sidesteps the boot-race that sank the 2026-06-17 attempt.
## Design — mirror ubongo 2/3, with the coordinator-host exception
The host firewall does the SSH scoping; sshd is left listening on all interfaces. This is
the ubongo 2/3 pattern, which is proven and reboot-validated.
1. **`base` firewall, INPUT-only default-deny** (`base__firewall_apply: true`,
`base__firewall_input_only: true`): the input chain defaults to `drop`; the forward chain
stays `policy accept` so Docker container forwarding/NAT and published-port DNAT keep
working across a reboot. Allowed ingress:
- `:22/tcp` via `iifname "wt0"` (the interface-name match that survives `wt0` being
absent at boot — `base__firewall_mgmt_interface: wt0`);
- `:22/tcp` from `91.226.145.80` (ubongo's WAN — the break-glass; via
`base__firewall_admin_addrs`);
- the public service surface from the catalog: `80,443/tcp` + `3478/udp` (WAN).
2. **No sshd change.** `base__ssh_listen_mesh_only` stays `false`; sshd keeps listening on
all interfaces. The firewall, not sshd, restricts where `:22` is reachable. There is no
`ListenAddress`, hence no `ip_nonlocal_bind`, hence no boot-race.
3. **The Hetzner Cloud Firewall is unchanged** — the `:22`-from-ubongo rule stays (the
2026-06-17 attempt removed it; this redesign keeps it as the perimeter break-glass).
4. **Coordinator geo-DB robustness** — make the `netbird_coordinator` control plane survive
a transient egress loss (the nat-flush window on apply, and the boot window before Docker
re-adds its NAT), so the coordinator stays healthy and `wt0` can come back. One of:
- **pre-seed** the GeoLite2 DB into the persistent `netbird_data:/var/lib/netbird` volume
so netbird-server finds it locally and never needs to download; or
- **disable / make non-fatal** the geolocation requirement in `config.yaml.j2`.
The exact v0.72.4 mechanism is verified against NetBird's pinned docs at plan time
(ADR-014) — the design fixes the *intent* (a transient egress blip must not FATAL the
control plane); the plan fixes the *knob*.
### Rejected alternatives (these are the 2026-06-17 failures)
- sshd `ListenAddress = wt0 IP` + `ip_nonlocal_bind` → boot-race; did not bind. **Out.**
- `forward policy drop` on a Docker host → broke forwarding on reboot. **Out** (use
`input_only`; the `docker_host` container-forward drop-in is a later tightening).
- Close WAN `:22` entirely → coordinator host left console-only on a bad reboot. **Out**
(keep WAN `:22`-from-ubongo as the always-available non-mesh path).
### How each 2026-06-17 failure is answered
| 2026-06-17 failure | Fix in this design |
|---|---|
| `forward drop` killed Docker on reboot | `base__firewall_input_only: true` → forward stays `accept` |
| `ip_nonlocal_bind` sshd boot-race | no sshd `ListenAddress` change; firewall scopes `:22` by `iifname "wt0"` |
| coordinator chicken-egg / lockout | permanent WAN `:22`-from-ubongo + Hetzner console; management never depends on a service askari hosts |
| coordinator geo-DB FATAL-loop | pre-seed / non-fatal geo so a transient egress blip can't crash the control plane |
## New & changed code
**Inventory:**
- `inventories/production/group_vars/offsite_hosts/vars.yml`
- `base__firewall_apply: true` (was `false`);
- `base__firewall_input_only: true` (new — forward stays `accept`, Docker-safe);
- `base__firewall_admin_addrs: ["91.226.145.80"]` (new — ubongo's WAN, the break-glass;
comment states what it is and why a coordinator host keeps a non-mesh path);
- `base__ssh_listen_mesh_only: false` stays (explicit — no boot-race);
- rewrite the header backout note → "redesigned 2026-06-19: `wt0`-primary + permanent WAN
break-glass; see this spec."
- `inventories/production/host_vars/askari.yml` (**new**) — `ansible_host: 100.99.226.39`
(the `wt0` IP), so Ansible manages askari over the mesh. Overrides the TF-generated WAN
`ansible_host` in `offsite.yml` (host_vars are not regenerated by `tf_to_inventory.py`).
Header comment explains why.
**Role `netbird_coordinator`:**
- The geo-DB robustness change above (`templates/config.yaml.j2` and/or a pre-seed task +
`templates/docker-compose.yml.j2` volume already persists `/var/lib/netbird`), with
Molecule/verify coverage that the control plane comes up without external geo egress.
**Firewall catalog** (`inventories/production/group_vars/all/firewall.yml`):
- **No change.** It already enumerates askari's public ingress (`reverse_proxy` 80/443,
`netbird_stun` 3478/udp). `:22` is handled by the `base` firewall's built-in SSH rules
(`mgmt_interface` `wt0` + `admin_addrs`), not the catalog.
**Terraform / Hetzner Cloud Firewall:**
- **No change.** The WAN `:22`-from-ubongo rule stays (the perimeter half of the break-glass).
**sshd:**
- **No change.**
## Validation
### Harness-first GREEN gate (ADR-025) — before any live change
A "be askari" integration profile (Docker host + a coordinator-like container on the shared
network + `base__firewall_input_only` + `admin_addrs`), driven through `make
test-integration HOST=askari` (reusing the existing profile/overlay/verify pattern):
- input chain default-deny with `:22` accepted via `iifname "wt0"` **and** from the
break-glass admin address; forward chain `policy accept`;
- published-port DNAT + NAT masquerade survive a **reboot** (the RED→GREEN reboot cycle);
- the coordinator-like container comes up healthy with **no external geo egress**;
- SSH path returns after reboot.
This must be GREEN before the live cutover.
### Live cutover — supervised, console open, break-glass never removed
Sequencing rule (FRICTION #6): validate reboot-recovery while a fallback path is still open.
Because the WAN break-glass is *never* removed in this design, that invariant holds by
construction.
1. **Pre-check:** `ssh sjat@100.99.226.39` (over `wt0`) and `ansible askari -m ping` (forced
over `wt0`) both succeed; public services + STUN healthy.
2. **Repoint Ansible:** add `host_vars/askari.yml` (`ansible_host` = `wt0` IP); confirm
`ansible askari -m ping` runs over the mesh.
3. **Apply `base` (+ the geo-DB fix):** one `make deploy PLAYBOOK=site LIMIT=askari`
converge applies INPUT-only default-deny with the `wt0` + admin-addr SSH allow and the
coordinator robustness change. The firewall concern's armed auto-rollback
(`base__firewall_rollback_timeout: 45`) reverts a bad ruleset. Then a post-apply
`restart docker` rebuilds NAT (base's `flush ruleset` wipes Docker's nat — FRICTION); the
coordinator now survives the egress window thanks to the geo-DB fix.
4. **Verify the new steady state:** public services serve valid certs; STUN answers; SSH
over `wt0` works; SSH over the WAN break-glass (`91.226.145.80``:22`) works.
5. **Reboot resilience (the real test):** reboot askari (Hetzner console available) and
confirm — with no intervention — Docker forwarding/NAT, public services, the coordinator,
`wt0`, and SSH (both paths) all return.
## Risks & rollback
- **ubongo's WAN IP anchors the break-glass.** If it is dynamic and rotates, the host
`admin_addrs` rule and the Hetzner FW rule must be updated. The **Hetzner console** is the
IP-independent ultimate break-glass. (Confirmed static by the operator 2026-06-19; it is
also already the Hetzner FW assumption today.)
- **Mid-cutover lockout:** mitigated by the staged order (a path open at each step), the
firewall auto-rollback timer, `ansible_host` = `wt0` (the confirm tests the real new path),
and the WAN break-glass that is never removed.
- **Reboot lockout:** mitigated by `iifname "wt0"` scoping (no sshd boot-race), the WAN
break-glass, the geo-DB fix (coordinator survives the egress window), and harness GREEN.
- **Default-deny breaks a public service:** mitigated by the catalog already enumerating all
live ingress and the §Validation service checks; reversible via `base__firewall_apply:
false`.
- **Ultimate break-glass:** the Hetzner web console (out-of-band).
## Out of scope / follow-ons
- **SPOF reduction (the next sub-project)** — reduce askari's single-point-of-failure role
(currently `ubongo → askari` is `Relayed` through askari's own relay; if askari is down the
mesh data plane for relayed peers is down). Its own spec, after this.
- **NetBird ACL off Allow-All** — until then any enrolled peer can reach askari's `wt0:22`;
scoping that is a separate sub-project.
- **Full forward-chain hardening** — the `docker_host` container-forward drop-in (full
forward default-deny, reboot-safe) as a later tightening over the `input_only` baseline.
- **Coordinator off-site backup** (FRICTION #5, ADR-022) — still pending; noted, not in scope.
- STATUS.md / ROADMAP updates land with the implementation, not this spec.