diff --git a/docs/superpowers/plans/2026-06-19-mesh-hardening-askari-redesign.md b/docs/superpowers/plans/2026-06-19-mesh-hardening-askari-redesign.md new file mode 100644 index 0000000..04555a9 --- /dev/null +++ b/docs/superpowers/plans/2026-06-19-mesh-hardening-askari-redesign.md @@ -0,0 +1,409 @@ +# Mesh-hardening redesign (askari) — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Harden askari's inbound surface with the proven ubongo INPUT-only default-deny pattern (SSH scoped by `iifname "wt0"` + a permanent WAN break-glass), and make the NetBird coordinator survive a no-egress startup — reboot-safe, no boot-race, no lockout. + +**Architecture:** Mirror mesh-hardening 2/3 (ubongo): `base` firewall INPUT-only (`base__firewall_input_only: true`, forward stays `policy accept` so Docker forwarding/NAT survive), **no** sshd `ListenAddress` change (the firewall, not sshd, scopes `:22`). The coordinator-host exception: WAN `:22` stays open from ubongo's static WAN IP as the always-available non-mesh break-glass (the Hetzner console is the ultimate fallback). A `netbird_coordinator` change disables geolocation so a transient egress loss can't FATAL the control plane. Validate firewall reboot-safety on a throwaway VM (ADR-025 harness) GREEN before a supervised live cutover. + +**Tech Stack:** Ansible (`base`, `netbird_coordinator` roles), nftables, Docker Compose, Molecule (Debian 13), the `scripts/integration-vm.py` ADR-025 harness, NetBird self-hosted `netbird-server:0.72.4`. + +**Spec:** `docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md` + +## Global Constraints + +- **FQCN always** (`ansible.builtin.*`); role defaults use the `rolename__var` namespace. +- **No sshd `ListenAddress` change** — `base__ssh_listen_mesh_only` stays `false` everywhere here (this is what sidesteps the 2026-06-17 boot-race). +- **WAN `:22` is never closed** — no Terraform / Hetzner-Cloud-Firewall change in this plan. +- **`base__firewall_input_only: true` on askari** — the forward chain must stay `policy accept` (Docker host). Never apply a forward-`drop` firewall to askari. +- **ubongo's WAN IP is `91.226.145.80`** (operator-confirmed static 2026-06-19) — the break-glass anchor. +- **askari `wt0` IP is `100.99.226.39`**; askari domain `netbird.askari.wingu.me`. +- **Before any commit:** `rbw unlocked` must succeed (the pre-commit hook decrypts `vault.yml`); run `make lint` and it must be clean. +- **Tags:** import each role at play level with its role-name tag; only use concern tags from `tests/tags.yml`. +- **Harness GREEN before live** (Task 3 before Task 4). The live cutover (Task 4) is **operator-gated** — never run autonomously. + +--- + +### Task 1: Disable geolocation in `netbird_coordinator` (FRICTION 2026-06-17 #4) + +Make the control plane survive a startup with no container egress: NetBird's combined server downloads the GeoLite2 DB at boot and treats failure as FATAL. boma uses no geo posture (ACL is Allow-All), so disable geolocation entirely via the documented env var. TDD'd through the role's render-only Molecule scenario. + +> verified: NetBird self-hosted geolocation knobs (`NB_DISABLE_GEOLOCATION`, `disableGeoliteUpdate`, GeoLite2 pre-seed) · WebFetch · docs.netbird.io/selfhosted/geo-support · 2026-06-19 — *from a docs summary; the live "healthy with egress blocked" check in Task 4 is the real gate, with a concrete pre-seed fallback there.* + +**Files:** +- Modify: `roles/netbird_coordinator/defaults/main.yml` (add the knob) +- Modify: `roles/netbird_coordinator/templates/docker-compose.yml.j2:14-27` (add `environment:` to `netbird-server`) +- Test: `roles/netbird_coordinator/molecule/default/verify.yml:21-32` (assert the rendered compose) +- Modify: `roles/netbird_coordinator/README.md` (one line documenting the knob) + +**Interfaces:** +- Produces: role default `netbird_coordinator__disable_geolocation` (bool, default `true`); rendered compose env `NB_DISABLE_GEOLOCATION: "true"` on the `netbird-server` service. + +- [ ] **Step 1: Write the failing Molecule assertion** + +Append to `roles/netbird_coordinator/molecule/default/verify.yml` (after the existing compose-tags assert, inside the same `tasks:` list): + +```yaml + - name: Assert geolocation is disabled (FRICTION 2026-06-17 #4 — no geo-DB download FATAL) + ansible.builtin.assert: + that: + - "'NB_DISABLE_GEOLOCATION: \"true\"' in (_compose.content | b64decode)" + fail_msg: >- + compose must set NB_DISABLE_GEOLOCATION=true so a no-egress startup can't FATAL + the coordinator on the GeoLite2 download + success_msg: "geolocation disabled in compose" +``` + +- [ ] **Step 2: Run Molecule to verify it fails** + +Run: `make test ROLE=netbird_coordinator` +Expected: FAIL at "Assert geolocation is disabled" — the rendered compose has no `NB_DISABLE_GEOLOCATION`. + +- [ ] **Step 3: Add the default knob** + +Add to `roles/netbird_coordinator/defaults/main.yml` (after line 7, the `__domain` line): + +```yaml + +# Disable NetBird's GeoLite2 geolocation (download + lookups). boma uses no geo posture +# (ACL is Allow-All), and the combined server treats a failed GeoLite2 download as FATAL — +# so a transient egress loss (NAT wiped on `nft flush`, or the boot window before Docker +# re-adds NAT) would crash-loop the whole control plane (FRICTION 2026-06-17 #4). Disabling +# removes that dependency. Revisit if a future ACL sub-project wants geo-based posture. +netbird_coordinator__disable_geolocation: true +``` + +- [ ] **Step 4: Render the env in the compose template** + +In `roles/netbird_coordinator/templates/docker-compose.yml.j2`, add an `environment:` block to the `netbird-server` service, immediately after its `command:` line (line 18): + +```yaml + environment: + # Disable geolocation so a no-egress startup can't FATAL the control plane + # (FRICTION 2026-06-17 #4). boma uses no geo posture (ACL Allow-All). + NB_DISABLE_GEOLOCATION: "{{ netbird_coordinator__disable_geolocation | string | lower }}" +``` + +- [ ] **Step 5: Run Molecule to verify it passes** + +Run: `make test ROLE=netbird_coordinator` +Expected: PASS — all asserts green, including "geolocation disabled in compose"; Molecule idempotence clean. + +- [ ] **Step 6: Document the knob** + +Add one line to `roles/netbird_coordinator/README.md` under its variables/defaults section: + +```markdown +- `netbird_coordinator__disable_geolocation` (default `true`) — sets `NB_DISABLE_GEOLOCATION` so a no-egress startup can't FATAL the server on the GeoLite2 download (FRICTION 2026-06-17 #4). +``` + +- [ ] **Step 7: Lint and commit** + +```bash +rbw unlocked && make lint +git add roles/netbird_coordinator/defaults/main.yml \ + roles/netbird_coordinator/templates/docker-compose.yml.j2 \ + roles/netbird_coordinator/molecule/default/verify.yml \ + roles/netbird_coordinator/README.md +git commit -m "feat(netbird_coordinator): disable geolocation so no-egress startup can't FATAL the control plane" \ + -m "Co-Authored-By: Claude Opus 4.8 (1M context) " +``` + +--- + +### Task 2: Enable askari's host firewall (INPUT-only) + WAN break-glass + manage over `wt0` + +Flip askari from "firewall not applied" to the redesigned INPUT-only default-deny, add the permanent WAN break-glass source, and point Ansible at the mesh. Pure inventory change — validated by lint + inventory resolution (the firewall *behavior* is proven in Task 3). + +**Files:** +- Modify: `inventories/production/group_vars/offsite_hosts/vars.yml` (replace the whole file body) +- Create: `inventories/production/host_vars/askari.yml` + +**Interfaces:** +- Consumes: `base` knobs `base__firewall_apply`, `base__firewall_input_only`, `base__firewall_admin_addrs`, `base__ssh_listen_mesh_only`, `base__mesh_enabled` (all defined in `roles/base/defaults/main.yml`). +- Produces: askari resolves `ansible_host: 100.99.226.39`, `base__firewall_apply: true`, `base__firewall_input_only: true`, `base__firewall_admin_addrs: ["91.226.145.80"]`. + +- [ ] **Step 1: Rewrite the offsite group_vars** + +Replace the body of `inventories/production/group_vars/offsite_hosts/vars.yml` with: + +```yaml +--- +# Off-site hosts (askari). askari runs the NetBird coordinator AND is a mesh peer +# (ADR-016, M5). +# +# Mesh-hardening REDESIGN (2026-06-19): the 2026-06-17 attempt was backed out (forward +# `policy drop` broke Docker on reboot; wt0-only sshd left no break-glass; ip_nonlocal_bind +# did not beat the boot-race). The redesign mirrors the proven ubongo 2/3 pattern: +# - INPUT-only default-deny (base__firewall_input_only) — forward stays `policy accept` +# so Docker container forwarding/NAT survive a reboot; +# - SSH scoped by the host firewall (iifname wt0 + admin-addr), NOT a sshd ListenAddress +# change — base__ssh_listen_mesh_only stays false, so there is no boot-race; +# - WAN :22 is DELIBERATELY left open from ubongo's WAN IP (base__firewall_admin_addrs) +# as the permanent non-mesh break-glass — the coordinator-host exception (a host's only +# management path must never depend on a service that host itself hosts). +# Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md +base__mesh_enabled: true +base__firewall_apply: true +base__firewall_input_only: true # forward stays `policy accept` → Docker-safe +base__ssh_listen_mesh_only: false # no sshd ListenAddress change → no boot-race +base__firewall_admin_addrs: + - 91.226.145.80 # ubongo's (static) WAN IP — the permanent non-mesh SSH break-glass +``` + +- [ ] **Step 2: Create the askari host_vars to manage over the mesh** + +Create `inventories/production/host_vars/askari.yml`: + +```yaml +--- +# Manage askari over the NetBird mesh (wt0). Overrides the TF-generated WAN `ansible_host` +# in offsite.yml (host_vars are NOT regenerated by tf_to_inventory.py). The WAN :22 path +# (Hetzner Cloud Firewall + base__firewall_admin_addrs = ubongo's WAN) stays as the +# break-glass; the Hetzner web console is the IP-independent ultimate fallback. +# Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md +ansible_host: 100.99.226.39 +``` + +- [ ] **Step 3: Verify the inventory resolves** + +Run: `ansible-inventory -i inventories/production --host askari` +Expected: JSON shows `"ansible_host": "100.99.226.39"`, `"base__firewall_apply": true`, `"base__firewall_input_only": true`, and `"base__firewall_admin_addrs": ["91.226.145.80"]`. + +- [ ] **Step 4: Lint** + +Run: `rbw unlocked && make lint` +Expected: clean (no yamllint/ansible-lint errors). + +- [ ] **Step 5: Commit** + +```bash +git add inventories/production/group_vars/offsite_hosts/vars.yml \ + inventories/production/host_vars/askari.yml +git commit -m "feat(inventory): askari INPUT-only firewall + WAN break-glass + manage over wt0" \ + -m "Co-Authored-By: Claude Opus 4.8 (1M context) " +``` + +--- + +### Task 3: Integration harness "askari_inputonly" profile — the reboot-safety GREEN gate + +Prove on a throwaway VM (ADR-025) that the redesigned firewall is reboot-safe BEFORE touching the real host: INPUT default-deny + forward accept + the admin-addr break-glass + published-port DNAT all survive a reboot. New profile (keeps the existing `askari` profile, which validates the `docker_host` container-forward drop-in path, intact). + +**Files:** +- Create: `tests/integration/profiles/askari_inputonly.json` +- Create: `tests/integration/overrides/askari_inputonly.yml` +- Modify: `tests/integration/verify.yml` (allow-list + a new profile branch) + +**Interfaces:** +- Consumes: the `scripts/integration-vm.py` harness; `make test-integration HOST=` maps `HOST` to `profiles/.json` (a profile name, not a production inventory host). +- Produces: profile `askari_inputonly` with `integration_profile: askari_inputonly`. + +- [ ] **Step 1: Add the new profile to the verify allow-list and a failing branch** + +In `tests/integration/verify.yml`, change the allow-list assert (line 14) from: + +```yaml + - integration_profile in ['askari', 'ubongo'] +``` + +to: + +```yaml + - integration_profile in ['askari', 'askari_inputonly', 'ubongo'] +``` + +and update its `fail_msg` (line 15) to `"integration_profile must be set in the profile overlay (askari|askari_inputonly|ubongo)"`. Then append this block to the `tasks:` list (after the ubongo block): + +```yaml + # ── askari_inputonly profile — the mesh-hardening REDESIGN (2026-06-19) ── + # INPUT-only default-deny on a Docker host: input policy drop, forward policy ACCEPT + # (Docker-safe), SSH via the admin-addr break-glass, published-port DNAT survives reboot. + - name: (askari_inputonly) Read the live nftables ruleset + when: integration_profile == 'askari_inputonly' + ansible.builtin.command: nft list ruleset + register: _nft_io + changed_when: false + + - name: (askari_inputonly) INPUT default-deny, forward permissive, admin-addr break-glass + when: integration_profile == 'askari_inputonly' + ansible.builtin.assert: + that: + - "'hook input priority filter; policy drop;' in _nft_io.stdout" + - "'hook forward priority filter; policy accept;' in _nft_io.stdout" + - "'ip saddr 192.168.150.1 tcp dport 22 accept' in _nft_io.stdout" + fail_msg: >- + askari_inputonly: expected input policy drop, forward policy accept (input-only), + and the admin-addr break-glass (192.168.150.1) SSH allow in the live ruleset. + + - name: (askari_inputonly) Gather service facts + when: integration_profile == 'askari_inputonly' + ansible.builtin.service_facts: + + - name: (askari_inputonly) Docker daemon is active + when: integration_profile == 'askari_inputonly' + ansible.builtin.assert: + that: "ansible_facts.services['docker.service'].state == 'running'" + fail_msg: "docker.service is not running" + + - name: (askari_inputonly) Published port answers from the controller (DNAT + forward alive) + when: integration_profile == 'askari_inputonly' + delegate_to: localhost + become: false + ansible.builtin.uri: + url: "http://{{ ansible_host }}/" + follow_redirects: none + status_code: [200, 301, 308, 404, 502, 503] + timeout: 10 + register: _probe_io + retries: 5 + delay: 6 + until: _probe_io is succeeded +``` + +- [ ] **Step 2: Create the profile descriptor** + +Create `tests/integration/profiles/askari_inputonly.json`: + +```json +{ + "groups": ["offsite_hosts"], + "applies": [ + {"playbook": "site.yml", "tags": ["base"]}, + {"playbook": "offsite.yml", "tags": ["docker_host", "reverse_proxy"]} + ], + "extra_vars_files": ["overrides/askari_inputonly.yml"], + "mem_mib": 3072, + "vcpus": 2 +} +``` + +- [ ] **Step 3: Create the overlay** + +Create `tests/integration/overrides/askari_inputonly.yml`: + +```yaml +--- +# Integration overlay (ADR-025) — the askari mesh-hardening REDESIGN (2026-06-19). +# Validates INPUT-only default-deny on a Docker host: input policy drop, forward policy +# accept (Docker-safe), SSH via the admin-addr break-glass, reboot-survivable. +integration_profile: askari_inputonly +base__firewall_apply: true +base__firewall_input_only: true +# No sshd ListenAddress change — never wt0-only in a throwaway VM. +base__ssh_listen_mesh_only: false +# Isolated VM: never touch the real mesh. +base__mesh_enabled: false +# The non-mesh SSH break-glass = the admin-addr path the real design uses. Point it at the +# VM's libvirt-NAT gateway (where the harness connects from), by source IP so it is +# interface-independent and the default-deny + reboot don't lock out the driver. This +# mirrors askari's real base__firewall_admin_addrs (ubongo's WAN) in the test topology. +base__firewall_admin_addrs: + - 192.168.150.1 +``` + +- [ ] **Step 4: Run the harness — the GREEN gate** + +Run: `make test-integration HOST=askari_inputonly` +Expected: GREEN. The harness boots a VM, applies `base` (INPUT-only) + `docker_host` + `reverse_proxy`, **reboots**, re-SSHes (proving the admin-addr break-glass survives), then `verify.yml` asserts input `policy drop`, forward `policy accept`, the `192.168.150.1` SSH allow, Docker active, and the published `:80` answering. Clean up: `make test-integration-clean`. + +- [ ] **Step 5: Commit** + +```bash +rbw unlocked && make lint +git add tests/integration/profiles/askari_inputonly.json \ + tests/integration/overrides/askari_inputonly.yml \ + tests/integration/verify.yml +git commit -m "test(integration): askari_inputonly profile — INPUT-only default-deny reboot gate" \ + -m "Co-Authored-By: Claude Opus 4.8 (1M context) " +``` + +--- + +### Task 4: Supervised live cutover + STATUS/ROADMAP update — ⚠️ OPERATOR-GATED + +> **⚠️ DO NOT run this task autonomously.** It changes the live off-site host (lockout risk) and runs `make deploy`. An automated executor must STOP here and hand back to the operator. Preconditions: Tasks 1–3 committed and GREEN; `rbw unlocked`; the **Hetzner web console** open in a browser (the out-of-band ultimate break-glass); the operator present. The WAN `:22` break-glass is never removed, so a fallback path is open throughout (FRICTION 2026-06-17 #6). + +**Files (Step 7 only):** +- Modify: `STATUS.md` (askari row), `docs/ROADMAP.md` (Next step) + +- [ ] **Step 1: Pre-check both paths are healthy** + +```bash +ssh sjat@100.99.226.39 true && echo "wt0 SSH OK" +ansible askari -i inventories/production -m ping +curl -sI https://test.askari.wingu.me | head -1 +curl -sI https://netbird.askari.wingu.me | head -1 +``` +Expected: wt0 SSH OK; ping `pong`; both curls `HTTP/2 200`. + +- [ ] **Step 2: Dry-run the converge (mandatory `check` before `deploy`)** + +```bash +make check PLAYBOOK=site LIMIT=askari +``` +Expected: changes limited to the `base` firewall (input-only ruleset, admin-addr) + the `netbird_coordinator` compose env (`NB_DISABLE_GEOLOCATION`). Review and show the output before proceeding. + +- [ ] **Step 3: Apply (operator present, console open, auto-rollback armed)** + +```bash +make deploy PLAYBOOK=site LIMIT=askari +``` +The `base` firewall concern arms the auto-rollback timer (`base__firewall_rollback_timeout: 45`) and reconnects over `wt0` — a bad ruleset reverts itself. Expected: converge OK; SSH-over-`wt0` stays up. + +- [ ] **Step 4: Rebuild NAT and confirm the coordinator is healthy with geo disabled** + +`base`'s `flush ruleset` wipes Docker's nat (FRICTION) — rebuild it, then confirm the control plane: + +```bash +ssh sjat@100.99.226.39 'sudo systemctl restart docker' +ssh sjat@100.99.226.39 'docker ps --format "{{.Names}} {{.Status}}"' +ssh sjat@100.99.226.39 'docker logs --since 2m netbird-server 2>&1 | grep -iE "geo|fatal" || echo "no geo/fatal log lines"' +``` +Expected: `netbird-server` + `netbird-dashboard` Up; no geo-DB FATAL. + +> **Contingency (only if `netbird-server` still FATALs on geolocation):** `NB_DISABLE_GEOLOCATION` was not honored by the pinned image. Pre-seed the DB into the volume instead — `ssh sjat@100.99.226.39 'sudo curl -fSL -o /var/lib/docker/volumes/netbird_data/_data/GeoLite2-City_20260101.mmdb https://pkgs.netbird.io/geolite2/GeoLite2-City.mmdb && sudo docker restart netbird-server'` — and add `disableGeoliteUpdate: true` under `server:` in `config.yaml.j2` so it never re-downloads. Re-verify, then fold the working fix back into the role (amend Task 1). + +- [ ] **Step 5: Verify the new steady state (both SSH paths + services)** + +```bash +ssh sjat@100.99.226.39 true && echo "wt0 SSH OK" +# From ubongo: SSH to askari's WAN IP. ubongo's packets egress via OPNsense, SNAT'd to the +# WAN IP 91.226.145.80 — matching askari's admin-addr break-glass rule. (No BindAddress: +# ubongo does not hold 91.226.145.80; OPNsense does.) +ssh sjat@77.42.120.136 true && echo "WAN break-glass OK" +curl -sI https://test.askari.wingu.me | head -1 +nc -vz -u 77.42.120.136 3478 # STUN answers +``` +Expected: both SSH paths succeed; cert valid; STUN reachable. + +- [ ] **Step 6: Reboot-resilience — the real test (console available)** + +```bash +ssh sjat@100.99.226.39 'sudo systemctl reboot' +# wait ~60s, then from ubongo — no manual intervention: +sleep 60; ssh sjat@100.99.226.39 'nft list chain inet filter input | grep -E "policy drop|wt0|91.226.145.80"' +curl -sI https://netbird.askari.wingu.me | head -1 +ssh sjat@100.99.226.39 'docker ps --format "{{.Names}} {{.Status}}"' +``` +Expected, unattended: input `policy drop` with the `wt0` + `91.226.145.80` allows; public cert valid; both containers Up; `wt0` SSH back. (If lost: recover via the Hetzner console — the firewall auto-rollback and the WAN break-glass should make that unnecessary.) + +- [ ] **Step 7: Record reality in the ground-truth docs and commit** + +Update `STATUS.md` (the askari row): firewall now **applied** — INPUT-only default-deny, SSH `wt0`-primary + permanent WAN break-glass (ubongo's WAN), managed over `wt0`, geolocation disabled, **reboot-validated**. Update `docs/ROADMAP.md` "Next step": mark the askari SSH→`wt0` redesign **DONE**; the next mesh-hardening sub-project is the **SPOF reduction** (askari relay single-point-of-failure) — confirmed by the `ubongo → askari` `Relayed` finding (2026-06-19). + +```bash +rbw unlocked && make lint +git add STATUS.md docs/ROADMAP.md +git commit -m "docs(status): mesh-hardening redesign — askari INPUT-only + WAN break-glass applied + reboot-validated" \ + -m "Co-Authored-By: Claude Opus 4.8 (1M context) " +``` + +--- + +## Notes / out of scope (carry to the SPOF sub-project) + +- **SPOF reduction is the next sub-project** (operator decision 2026-06-19): `ubongo → askari` is currently `Relayed` through askari's own relay; if askari is down, relayed peers lose the mesh data plane. Its own spec. +- **NetBird ACL stays Allow-All** — any enrolled peer can reach askari `wt0:22` until a later sub-project. +- **Full forward-chain hardening** (`docker_host` container-forward drop-in over the `input_only` baseline) — a later tightening; the existing `askari` integration profile already covers that path. +- **Coordinator off-site backup** (FRICTION 2026-06-17 #5, ADR-022) — still pending; not in scope.