boma/docs/superpowers/plans/2026-06-19-mesh-hardening-askari-redesign.md
sjat 6be758bece docs(plan): mesh-hardening redesign — askari implementation plan
Four tasks: netbird_coordinator geolocation disable (TDD via Molecule) -> inventory enablement (INPUT-only firewall + WAN break-glass + manage over wt0) -> an askari_inputonly integration profile (the reboot-safety GREEN gate) -> the operator-gated supervised live cutover + STATUS/ROADMAP update. Tasks 1-3 are autonomously implementable; Task 4 is operator-gated (live off-site host, lockout risk).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 16:32:27 +02:00

22 KiB
Raw Blame History

Mesh-hardening redesign (askari) — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Harden askari's inbound surface with the proven ubongo INPUT-only default-deny pattern (SSH scoped by iifname "wt0" + a permanent WAN break-glass), and make the NetBird coordinator survive a no-egress startup — reboot-safe, no boot-race, no lockout.

Architecture: Mirror mesh-hardening 2/3 (ubongo): base firewall INPUT-only (base__firewall_input_only: true, forward stays policy accept so Docker forwarding/NAT survive), no sshd ListenAddress change (the firewall, not sshd, scopes :22). The coordinator-host exception: WAN :22 stays open from ubongo's static WAN IP as the always-available non-mesh break-glass (the Hetzner console is the ultimate fallback). A netbird_coordinator change disables geolocation so a transient egress loss can't FATAL the control plane. Validate firewall reboot-safety on a throwaway VM (ADR-025 harness) GREEN before a supervised live cutover.

Tech Stack: Ansible (base, netbird_coordinator roles), nftables, Docker Compose, Molecule (Debian 13), the scripts/integration-vm.py ADR-025 harness, NetBird self-hosted netbird-server:0.72.4.

Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md

Global Constraints

  • FQCN always (ansible.builtin.*); role defaults use the rolename__var namespace.
  • No sshd ListenAddress changebase__ssh_listen_mesh_only stays false everywhere here (this is what sidesteps the 2026-06-17 boot-race).
  • WAN :22 is never closed — no Terraform / Hetzner-Cloud-Firewall change in this plan.
  • base__firewall_input_only: true on askari — the forward chain must stay policy accept (Docker host). Never apply a forward-drop firewall to askari.
  • ubongo's WAN IP is 91.226.145.80 (operator-confirmed static 2026-06-19) — the break-glass anchor.
  • askari wt0 IP is 100.99.226.39; askari domain netbird.askari.wingu.me.
  • Before any commit: rbw unlocked must succeed (the pre-commit hook decrypts vault.yml); run make lint and it must be clean.
  • Tags: import each role at play level with its role-name tag; only use concern tags from tests/tags.yml.
  • Harness GREEN before live (Task 3 before Task 4). The live cutover (Task 4) is operator-gated — never run autonomously.

Task 1: Disable geolocation in netbird_coordinator (FRICTION 2026-06-17 #4)

Make the control plane survive a startup with no container egress: NetBird's combined server downloads the GeoLite2 DB at boot and treats failure as FATAL. boma uses no geo posture (ACL is Allow-All), so disable geolocation entirely via the documented env var. TDD'd through the role's render-only Molecule scenario.

verified: NetBird self-hosted geolocation knobs (NB_DISABLE_GEOLOCATION, disableGeoliteUpdate, GeoLite2 pre-seed) · WebFetch · docs.netbird.io/selfhosted/geo-support · 2026-06-19 — from a docs summary; the live "healthy with egress blocked" check in Task 4 is the real gate, with a concrete pre-seed fallback there.

Files:

  • Modify: roles/netbird_coordinator/defaults/main.yml (add the knob)
  • Modify: roles/netbird_coordinator/templates/docker-compose.yml.j2:14-27 (add environment: to netbird-server)
  • Test: roles/netbird_coordinator/molecule/default/verify.yml:21-32 (assert the rendered compose)
  • Modify: roles/netbird_coordinator/README.md (one line documenting the knob)

Interfaces:

  • Produces: role default netbird_coordinator__disable_geolocation (bool, default true); rendered compose env NB_DISABLE_GEOLOCATION: "true" on the netbird-server service.

  • Step 1: Write the failing Molecule assertion

Append to roles/netbird_coordinator/molecule/default/verify.yml (after the existing compose-tags assert, inside the same tasks: list):

    - name: Assert geolocation is disabled (FRICTION 2026-06-17 #4 — no geo-DB download FATAL)
      ansible.builtin.assert:
        that:
          - "'NB_DISABLE_GEOLOCATION: \"true\"' in (_compose.content | b64decode)"
        fail_msg: >-
          compose must set NB_DISABLE_GEOLOCATION=true so a no-egress startup can't FATAL
          the coordinator on the GeoLite2 download
        success_msg: "geolocation disabled in compose"
  • Step 2: Run Molecule to verify it fails

Run: make test ROLE=netbird_coordinator Expected: FAIL at "Assert geolocation is disabled" — the rendered compose has no NB_DISABLE_GEOLOCATION.

  • Step 3: Add the default knob

Add to roles/netbird_coordinator/defaults/main.yml (after line 7, the __domain line):


# Disable NetBird's GeoLite2 geolocation (download + lookups). boma uses no geo posture
# (ACL is Allow-All), and the combined server treats a failed GeoLite2 download as FATAL —
# so a transient egress loss (NAT wiped on `nft flush`, or the boot window before Docker
# re-adds NAT) would crash-loop the whole control plane (FRICTION 2026-06-17 #4). Disabling
# removes that dependency. Revisit if a future ACL sub-project wants geo-based posture.
netbird_coordinator__disable_geolocation: true
  • Step 4: Render the env in the compose template

In roles/netbird_coordinator/templates/docker-compose.yml.j2, add an environment: block to the netbird-server service, immediately after its command: line (line 18):

    environment:
      # Disable geolocation so a no-egress startup can't FATAL the control plane
      # (FRICTION 2026-06-17 #4). boma uses no geo posture (ACL Allow-All).
      NB_DISABLE_GEOLOCATION: "{{ netbird_coordinator__disable_geolocation | string | lower }}"
  • Step 5: Run Molecule to verify it passes

Run: make test ROLE=netbird_coordinator Expected: PASS — all asserts green, including "geolocation disabled in compose"; Molecule idempotence clean.

  • Step 6: Document the knob

Add one line to roles/netbird_coordinator/README.md under its variables/defaults section:

- `netbird_coordinator__disable_geolocation` (default `true`) — sets `NB_DISABLE_GEOLOCATION` so a no-egress startup can't FATAL the server on the GeoLite2 download (FRICTION 2026-06-17 #4).
  • Step 7: Lint and commit
rbw unlocked && make lint
git add roles/netbird_coordinator/defaults/main.yml \
        roles/netbird_coordinator/templates/docker-compose.yml.j2 \
        roles/netbird_coordinator/molecule/default/verify.yml \
        roles/netbird_coordinator/README.md
git commit -m "feat(netbird_coordinator): disable geolocation so no-egress startup can't FATAL the control plane" \
           -m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"

Task 2: Enable askari's host firewall (INPUT-only) + WAN break-glass + manage over wt0

Flip askari from "firewall not applied" to the redesigned INPUT-only default-deny, add the permanent WAN break-glass source, and point Ansible at the mesh. Pure inventory change — validated by lint + inventory resolution (the firewall behavior is proven in Task 3).

Files:

  • Modify: inventories/production/group_vars/offsite_hosts/vars.yml (replace the whole file body)
  • Create: inventories/production/host_vars/askari.yml

Interfaces:

  • Consumes: base knobs base__firewall_apply, base__firewall_input_only, base__firewall_admin_addrs, base__ssh_listen_mesh_only, base__mesh_enabled (all defined in roles/base/defaults/main.yml).

  • Produces: askari resolves ansible_host: 100.99.226.39, base__firewall_apply: true, base__firewall_input_only: true, base__firewall_admin_addrs: ["91.226.145.80"].

  • Step 1: Rewrite the offsite group_vars

Replace the body of inventories/production/group_vars/offsite_hosts/vars.yml with:

---
# Off-site hosts (askari). askari runs the NetBird coordinator AND is a mesh peer
# (ADR-016, M5).
#
# Mesh-hardening REDESIGN (2026-06-19): the 2026-06-17 attempt was backed out (forward
# `policy drop` broke Docker on reboot; wt0-only sshd left no break-glass; ip_nonlocal_bind
# did not beat the boot-race). The redesign mirrors the proven ubongo 2/3 pattern:
#   - INPUT-only default-deny (base__firewall_input_only) — forward stays `policy accept`
#     so Docker container forwarding/NAT survive a reboot;
#   - SSH scoped by the host firewall (iifname wt0 + admin-addr), NOT a sshd ListenAddress
#     change — base__ssh_listen_mesh_only stays false, so there is no boot-race;
#   - WAN :22 is DELIBERATELY left open from ubongo's WAN IP (base__firewall_admin_addrs)
#     as the permanent non-mesh break-glass — the coordinator-host exception (a host's only
#     management path must never depend on a service that host itself hosts).
# Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md
base__mesh_enabled: true
base__firewall_apply: true
base__firewall_input_only: true     # forward stays `policy accept` → Docker-safe
base__ssh_listen_mesh_only: false   # no sshd ListenAddress change → no boot-race
base__firewall_admin_addrs:
  - 91.226.145.80   # ubongo's (static) WAN IP — the permanent non-mesh SSH break-glass
  • Step 2: Create the askari host_vars to manage over the mesh

Create inventories/production/host_vars/askari.yml:

---
# Manage askari over the NetBird mesh (wt0). Overrides the TF-generated WAN `ansible_host`
# in offsite.yml (host_vars are NOT regenerated by tf_to_inventory.py). The WAN :22 path
# (Hetzner Cloud Firewall + base__firewall_admin_addrs = ubongo's WAN) stays as the
# break-glass; the Hetzner web console is the IP-independent ultimate fallback.
# Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md
ansible_host: 100.99.226.39
  • Step 3: Verify the inventory resolves

Run: ansible-inventory -i inventories/production --host askari Expected: JSON shows "ansible_host": "100.99.226.39", "base__firewall_apply": true, "base__firewall_input_only": true, and "base__firewall_admin_addrs": ["91.226.145.80"].

  • Step 4: Lint

Run: rbw unlocked && make lint Expected: clean (no yamllint/ansible-lint errors).

  • Step 5: Commit
git add inventories/production/group_vars/offsite_hosts/vars.yml \
        inventories/production/host_vars/askari.yml
git commit -m "feat(inventory): askari INPUT-only firewall + WAN break-glass + manage over wt0" \
           -m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"

Task 3: Integration harness "askari_inputonly" profile — the reboot-safety GREEN gate

Prove on a throwaway VM (ADR-025) that the redesigned firewall is reboot-safe BEFORE touching the real host: INPUT default-deny + forward accept + the admin-addr break-glass + published-port DNAT all survive a reboot. New profile (keeps the existing askari profile, which validates the docker_host container-forward drop-in path, intact).

Files:

  • Create: tests/integration/profiles/askari_inputonly.json
  • Create: tests/integration/overrides/askari_inputonly.yml
  • Modify: tests/integration/verify.yml (allow-list + a new profile branch)

Interfaces:

  • Consumes: the scripts/integration-vm.py harness; make test-integration HOST=<profile> maps HOST to profiles/<HOST>.json (a profile name, not a production inventory host).

  • Produces: profile askari_inputonly with integration_profile: askari_inputonly.

  • Step 1: Add the new profile to the verify allow-list and a failing branch

In tests/integration/verify.yml, change the allow-list assert (line 14) from:

          - integration_profile in ['askari', 'ubongo']

to:

          - integration_profile in ['askari', 'askari_inputonly', 'ubongo']

and update its fail_msg (line 15) to "integration_profile must be set in the profile overlay (askari|askari_inputonly|ubongo)". Then append this block to the tasks: list (after the ubongo block):

    # ── askari_inputonly profile — the mesh-hardening REDESIGN (2026-06-19) ──
    # INPUT-only default-deny on a Docker host: input policy drop, forward policy ACCEPT
    # (Docker-safe), SSH via the admin-addr break-glass, published-port DNAT survives reboot.
    - name: (askari_inputonly) Read the live nftables ruleset
      when: integration_profile == 'askari_inputonly'
      ansible.builtin.command: nft list ruleset
      register: _nft_io
      changed_when: false

    - name: (askari_inputonly) INPUT default-deny, forward permissive, admin-addr break-glass
      when: integration_profile == 'askari_inputonly'
      ansible.builtin.assert:
        that:
          - "'hook input priority filter; policy drop;' in _nft_io.stdout"
          - "'hook forward priority filter; policy accept;' in _nft_io.stdout"
          - "'ip saddr 192.168.150.1 tcp dport 22 accept' in _nft_io.stdout"
        fail_msg: >-
          askari_inputonly: expected input policy drop, forward policy accept (input-only),
          and the admin-addr break-glass (192.168.150.1) SSH allow in the live ruleset.

    - name: (askari_inputonly) Gather service facts
      when: integration_profile == 'askari_inputonly'
      ansible.builtin.service_facts:

    - name: (askari_inputonly) Docker daemon is active
      when: integration_profile == 'askari_inputonly'
      ansible.builtin.assert:
        that: "ansible_facts.services['docker.service'].state == 'running'"
        fail_msg: "docker.service is not running"

    - name: (askari_inputonly) Published port answers from the controller (DNAT + forward alive)
      when: integration_profile == 'askari_inputonly'
      delegate_to: localhost
      become: false
      ansible.builtin.uri:
        url: "http://{{ ansible_host }}/"
        follow_redirects: none
        status_code: [200, 301, 308, 404, 502, 503]
        timeout: 10
      register: _probe_io
      retries: 5
      delay: 6
      until: _probe_io is succeeded
  • Step 2: Create the profile descriptor

Create tests/integration/profiles/askari_inputonly.json:

{
  "groups": ["offsite_hosts"],
  "applies": [
    {"playbook": "site.yml", "tags": ["base"]},
    {"playbook": "offsite.yml", "tags": ["docker_host", "reverse_proxy"]}
  ],
  "extra_vars_files": ["overrides/askari_inputonly.yml"],
  "mem_mib": 3072,
  "vcpus": 2
}
  • Step 3: Create the overlay

Create tests/integration/overrides/askari_inputonly.yml:

---
# Integration overlay (ADR-025) — the askari mesh-hardening REDESIGN (2026-06-19).
# Validates INPUT-only default-deny on a Docker host: input policy drop, forward policy
# accept (Docker-safe), SSH via the admin-addr break-glass, reboot-survivable.
integration_profile: askari_inputonly
base__firewall_apply: true
base__firewall_input_only: true
# No sshd ListenAddress change — never wt0-only in a throwaway VM.
base__ssh_listen_mesh_only: false
# Isolated VM: never touch the real mesh.
base__mesh_enabled: false
# The non-mesh SSH break-glass = the admin-addr path the real design uses. Point it at the
# VM's libvirt-NAT gateway (where the harness connects from), by source IP so it is
# interface-independent and the default-deny + reboot don't lock out the driver. This
# mirrors askari's real base__firewall_admin_addrs (ubongo's WAN) in the test topology.
base__firewall_admin_addrs:
  - 192.168.150.1
  • Step 4: Run the harness — the GREEN gate

Run: make test-integration HOST=askari_inputonly Expected: GREEN. The harness boots a VM, applies base (INPUT-only) + docker_host + reverse_proxy, reboots, re-SSHes (proving the admin-addr break-glass survives), then verify.yml asserts input policy drop, forward policy accept, the 192.168.150.1 SSH allow, Docker active, and the published :80 answering. Clean up: make test-integration-clean.

  • Step 5: Commit
rbw unlocked && make lint
git add tests/integration/profiles/askari_inputonly.json \
        tests/integration/overrides/askari_inputonly.yml \
        tests/integration/verify.yml
git commit -m "test(integration): askari_inputonly profile — INPUT-only default-deny reboot gate" \
           -m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"

Task 4: Supervised live cutover + STATUS/ROADMAP update — ⚠️ OPERATOR-GATED

⚠️ DO NOT run this task autonomously. It changes the live off-site host (lockout risk) and runs make deploy. An automated executor must STOP here and hand back to the operator. Preconditions: Tasks 13 committed and GREEN; rbw unlocked; the Hetzner web console open in a browser (the out-of-band ultimate break-glass); the operator present. The WAN :22 break-glass is never removed, so a fallback path is open throughout (FRICTION 2026-06-17 #6).

Files (Step 7 only):

  • Modify: STATUS.md (askari row), docs/ROADMAP.md (Next step)

  • Step 1: Pre-check both paths are healthy

ssh sjat@100.99.226.39 true && echo "wt0 SSH OK"
ansible askari -i inventories/production -m ping
curl -sI https://test.askari.wingu.me | head -1
curl -sI https://netbird.askari.wingu.me | head -1

Expected: wt0 SSH OK; ping pong; both curls HTTP/2 200.

  • Step 2: Dry-run the converge (mandatory check before deploy)
make check PLAYBOOK=site LIMIT=askari

Expected: changes limited to the base firewall (input-only ruleset, admin-addr) + the netbird_coordinator compose env (NB_DISABLE_GEOLOCATION). Review and show the output before proceeding.

  • Step 3: Apply (operator present, console open, auto-rollback armed)
make deploy PLAYBOOK=site LIMIT=askari

The base firewall concern arms the auto-rollback timer (base__firewall_rollback_timeout: 45) and reconnects over wt0 — a bad ruleset reverts itself. Expected: converge OK; SSH-over-wt0 stays up.

  • Step 4: Rebuild NAT and confirm the coordinator is healthy with geo disabled

base's flush ruleset wipes Docker's nat (FRICTION) — rebuild it, then confirm the control plane:

ssh sjat@100.99.226.39 'sudo systemctl restart docker'
ssh sjat@100.99.226.39 'docker ps --format "{{.Names}} {{.Status}}"'
ssh sjat@100.99.226.39 'docker logs --since 2m netbird-server 2>&1 | grep -iE "geo|fatal" || echo "no geo/fatal log lines"'

Expected: netbird-server + netbird-dashboard Up; no geo-DB FATAL.

Contingency (only if netbird-server still FATALs on geolocation): NB_DISABLE_GEOLOCATION was not honored by the pinned image. Pre-seed the DB into the volume instead — ssh sjat@100.99.226.39 'sudo curl -fSL -o /var/lib/docker/volumes/netbird_data/_data/GeoLite2-City_20260101.mmdb https://pkgs.netbird.io/geolite2/GeoLite2-City.mmdb && sudo docker restart netbird-server' — and add disableGeoliteUpdate: true under server: in config.yaml.j2 so it never re-downloads. Re-verify, then fold the working fix back into the role (amend Task 1).

  • Step 5: Verify the new steady state (both SSH paths + services)
ssh sjat@100.99.226.39 true && echo "wt0 SSH OK"
# From ubongo: SSH to askari's WAN IP. ubongo's packets egress via OPNsense, SNAT'd to the
# WAN IP 91.226.145.80 — matching askari's admin-addr break-glass rule. (No BindAddress:
# ubongo does not hold 91.226.145.80; OPNsense does.)
ssh sjat@77.42.120.136 true && echo "WAN break-glass OK"
curl -sI https://test.askari.wingu.me | head -1
nc -vz -u 77.42.120.136 3478   # STUN answers

Expected: both SSH paths succeed; cert valid; STUN reachable.

  • Step 6: Reboot-resilience — the real test (console available)
ssh sjat@100.99.226.39 'sudo systemctl reboot'
# wait ~60s, then from ubongo — no manual intervention:
sleep 60; ssh sjat@100.99.226.39 'nft list chain inet filter input | grep -E "policy drop|wt0|91.226.145.80"'
curl -sI https://netbird.askari.wingu.me | head -1
ssh sjat@100.99.226.39 'docker ps --format "{{.Names}} {{.Status}}"'

Expected, unattended: input policy drop with the wt0 + 91.226.145.80 allows; public cert valid; both containers Up; wt0 SSH back. (If lost: recover via the Hetzner console — the firewall auto-rollback and the WAN break-glass should make that unnecessary.)

  • Step 7: Record reality in the ground-truth docs and commit

Update STATUS.md (the askari row): firewall now applied — INPUT-only default-deny, SSH wt0-primary + permanent WAN break-glass (ubongo's WAN), managed over wt0, geolocation disabled, reboot-validated. Update docs/ROADMAP.md "Next step": mark the askari SSH→wt0 redesign DONE; the next mesh-hardening sub-project is the SPOF reduction (askari relay single-point-of-failure) — confirmed by the ubongo → askari Relayed finding (2026-06-19).

rbw unlocked && make lint
git add STATUS.md docs/ROADMAP.md
git commit -m "docs(status): mesh-hardening redesign — askari INPUT-only + WAN break-glass applied + reboot-validated" \
           -m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"

Notes / out of scope (carry to the SPOF sub-project)

  • SPOF reduction is the next sub-project (operator decision 2026-06-19): ubongo → askari is currently Relayed through askari's own relay; if askari is down, relayed peers lose the mesh data plane. Its own spec.
  • NetBird ACL stays Allow-All — any enrolled peer can reach askari wt0:22 until a later sub-project.
  • Full forward-chain hardening (docker_host container-forward drop-in over the input_only baseline) — a later tightening; the existing askari integration profile already covers that path.
  • Coordinator off-site backup (FRICTION 2026-06-17 #5, ADR-022) — still pending; not in scope.