boma/docs/superpowers/plans/2026-06-19-mesh-hardening-askari-redesign.md
sjat 6be758bece docs(plan): mesh-hardening redesign — askari implementation plan
Four tasks: netbird_coordinator geolocation disable (TDD via Molecule) -> inventory enablement (INPUT-only firewall + WAN break-glass + manage over wt0) -> an askari_inputonly integration profile (the reboot-safety GREEN gate) -> the operator-gated supervised live cutover + STATUS/ROADMAP update. Tasks 1-3 are autonomously implementable; Task 4 is operator-gated (live off-site host, lockout risk).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 16:32:27 +02:00

409 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Mesh-hardening redesign (askari) — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Harden askari's inbound surface with the proven ubongo INPUT-only default-deny pattern (SSH scoped by `iifname "wt0"` + a permanent WAN break-glass), and make the NetBird coordinator survive a no-egress startup — reboot-safe, no boot-race, no lockout.
**Architecture:** Mirror mesh-hardening 2/3 (ubongo): `base` firewall INPUT-only (`base__firewall_input_only: true`, forward stays `policy accept` so Docker forwarding/NAT survive), **no** sshd `ListenAddress` change (the firewall, not sshd, scopes `:22`). The coordinator-host exception: WAN `:22` stays open from ubongo's static WAN IP as the always-available non-mesh break-glass (the Hetzner console is the ultimate fallback). A `netbird_coordinator` change disables geolocation so a transient egress loss can't FATAL the control plane. Validate firewall reboot-safety on a throwaway VM (ADR-025 harness) GREEN before a supervised live cutover.
**Tech Stack:** Ansible (`base`, `netbird_coordinator` roles), nftables, Docker Compose, Molecule (Debian 13), the `scripts/integration-vm.py` ADR-025 harness, NetBird self-hosted `netbird-server:0.72.4`.
**Spec:** `docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md`
## Global Constraints
- **FQCN always** (`ansible.builtin.*`); role defaults use the `rolename__var` namespace.
- **No sshd `ListenAddress` change** — `base__ssh_listen_mesh_only` stays `false` everywhere here (this is what sidesteps the 2026-06-17 boot-race).
- **WAN `:22` is never closed** — no Terraform / Hetzner-Cloud-Firewall change in this plan.
- **`base__firewall_input_only: true` on askari** — the forward chain must stay `policy accept` (Docker host). Never apply a forward-`drop` firewall to askari.
- **ubongo's WAN IP is `91.226.145.80`** (operator-confirmed static 2026-06-19) — the break-glass anchor.
- **askari `wt0` IP is `100.99.226.39`**; askari domain `netbird.askari.wingu.me`.
- **Before any commit:** `rbw unlocked` must succeed (the pre-commit hook decrypts `vault.yml`); run `make lint` and it must be clean.
- **Tags:** import each role at play level with its role-name tag; only use concern tags from `tests/tags.yml`.
- **Harness GREEN before live** (Task 3 before Task 4). The live cutover (Task 4) is **operator-gated** — never run autonomously.
---
### Task 1: Disable geolocation in `netbird_coordinator` (FRICTION 2026-06-17 #4)
Make the control plane survive a startup with no container egress: NetBird's combined server downloads the GeoLite2 DB at boot and treats failure as FATAL. boma uses no geo posture (ACL is Allow-All), so disable geolocation entirely via the documented env var. TDD'd through the role's render-only Molecule scenario.
> verified: NetBird self-hosted geolocation knobs (`NB_DISABLE_GEOLOCATION`, `disableGeoliteUpdate`, GeoLite2 pre-seed) · WebFetch · docs.netbird.io/selfhosted/geo-support · 2026-06-19 — *from a docs summary; the live "healthy with egress blocked" check in Task 4 is the real gate, with a concrete pre-seed fallback there.*
**Files:**
- Modify: `roles/netbird_coordinator/defaults/main.yml` (add the knob)
- Modify: `roles/netbird_coordinator/templates/docker-compose.yml.j2:14-27` (add `environment:` to `netbird-server`)
- Test: `roles/netbird_coordinator/molecule/default/verify.yml:21-32` (assert the rendered compose)
- Modify: `roles/netbird_coordinator/README.md` (one line documenting the knob)
**Interfaces:**
- Produces: role default `netbird_coordinator__disable_geolocation` (bool, default `true`); rendered compose env `NB_DISABLE_GEOLOCATION: "true"` on the `netbird-server` service.
- [ ] **Step 1: Write the failing Molecule assertion**
Append to `roles/netbird_coordinator/molecule/default/verify.yml` (after the existing compose-tags assert, inside the same `tasks:` list):
```yaml
- name: Assert geolocation is disabled (FRICTION 2026-06-17 #4 — no geo-DB download FATAL)
ansible.builtin.assert:
that:
- "'NB_DISABLE_GEOLOCATION: \"true\"' in (_compose.content | b64decode)"
fail_msg: >-
compose must set NB_DISABLE_GEOLOCATION=true so a no-egress startup can't FATAL
the coordinator on the GeoLite2 download
success_msg: "geolocation disabled in compose"
```
- [ ] **Step 2: Run Molecule to verify it fails**
Run: `make test ROLE=netbird_coordinator`
Expected: FAIL at "Assert geolocation is disabled" — the rendered compose has no `NB_DISABLE_GEOLOCATION`.
- [ ] **Step 3: Add the default knob**
Add to `roles/netbird_coordinator/defaults/main.yml` (after line 7, the `__domain` line):
```yaml
# Disable NetBird's GeoLite2 geolocation (download + lookups). boma uses no geo posture
# (ACL is Allow-All), and the combined server treats a failed GeoLite2 download as FATAL —
# so a transient egress loss (NAT wiped on `nft flush`, or the boot window before Docker
# re-adds NAT) would crash-loop the whole control plane (FRICTION 2026-06-17 #4). Disabling
# removes that dependency. Revisit if a future ACL sub-project wants geo-based posture.
netbird_coordinator__disable_geolocation: true
```
- [ ] **Step 4: Render the env in the compose template**
In `roles/netbird_coordinator/templates/docker-compose.yml.j2`, add an `environment:` block to the `netbird-server` service, immediately after its `command:` line (line 18):
```yaml
environment:
# Disable geolocation so a no-egress startup can't FATAL the control plane
# (FRICTION 2026-06-17 #4). boma uses no geo posture (ACL Allow-All).
NB_DISABLE_GEOLOCATION: "{{ netbird_coordinator__disable_geolocation | string | lower }}"
```
- [ ] **Step 5: Run Molecule to verify it passes**
Run: `make test ROLE=netbird_coordinator`
Expected: PASS — all asserts green, including "geolocation disabled in compose"; Molecule idempotence clean.
- [ ] **Step 6: Document the knob**
Add one line to `roles/netbird_coordinator/README.md` under its variables/defaults section:
```markdown
- `netbird_coordinator__disable_geolocation` (default `true`) — sets `NB_DISABLE_GEOLOCATION` so a no-egress startup can't FATAL the server on the GeoLite2 download (FRICTION 2026-06-17 #4).
```
- [ ] **Step 7: Lint and commit**
```bash
rbw unlocked && make lint
git add roles/netbird_coordinator/defaults/main.yml \
roles/netbird_coordinator/templates/docker-compose.yml.j2 \
roles/netbird_coordinator/molecule/default/verify.yml \
roles/netbird_coordinator/README.md
git commit -m "feat(netbird_coordinator): disable geolocation so no-egress startup can't FATAL the control plane" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 2: Enable askari's host firewall (INPUT-only) + WAN break-glass + manage over `wt0`
Flip askari from "firewall not applied" to the redesigned INPUT-only default-deny, add the permanent WAN break-glass source, and point Ansible at the mesh. Pure inventory change — validated by lint + inventory resolution (the firewall *behavior* is proven in Task 3).
**Files:**
- Modify: `inventories/production/group_vars/offsite_hosts/vars.yml` (replace the whole file body)
- Create: `inventories/production/host_vars/askari.yml`
**Interfaces:**
- Consumes: `base` knobs `base__firewall_apply`, `base__firewall_input_only`, `base__firewall_admin_addrs`, `base__ssh_listen_mesh_only`, `base__mesh_enabled` (all defined in `roles/base/defaults/main.yml`).
- Produces: askari resolves `ansible_host: 100.99.226.39`, `base__firewall_apply: true`, `base__firewall_input_only: true`, `base__firewall_admin_addrs: ["91.226.145.80"]`.
- [ ] **Step 1: Rewrite the offsite group_vars**
Replace the body of `inventories/production/group_vars/offsite_hosts/vars.yml` with:
```yaml
---
# Off-site hosts (askari). askari runs the NetBird coordinator AND is a mesh peer
# (ADR-016, M5).
#
# Mesh-hardening REDESIGN (2026-06-19): the 2026-06-17 attempt was backed out (forward
# `policy drop` broke Docker on reboot; wt0-only sshd left no break-glass; ip_nonlocal_bind
# did not beat the boot-race). The redesign mirrors the proven ubongo 2/3 pattern:
# - INPUT-only default-deny (base__firewall_input_only) — forward stays `policy accept`
# so Docker container forwarding/NAT survive a reboot;
# - SSH scoped by the host firewall (iifname wt0 + admin-addr), NOT a sshd ListenAddress
# change — base__ssh_listen_mesh_only stays false, so there is no boot-race;
# - WAN :22 is DELIBERATELY left open from ubongo's WAN IP (base__firewall_admin_addrs)
# as the permanent non-mesh break-glass — the coordinator-host exception (a host's only
# management path must never depend on a service that host itself hosts).
# Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md
base__mesh_enabled: true
base__firewall_apply: true
base__firewall_input_only: true # forward stays `policy accept` → Docker-safe
base__ssh_listen_mesh_only: false # no sshd ListenAddress change → no boot-race
base__firewall_admin_addrs:
- 91.226.145.80 # ubongo's (static) WAN IP — the permanent non-mesh SSH break-glass
```
- [ ] **Step 2: Create the askari host_vars to manage over the mesh**
Create `inventories/production/host_vars/askari.yml`:
```yaml
---
# Manage askari over the NetBird mesh (wt0). Overrides the TF-generated WAN `ansible_host`
# in offsite.yml (host_vars are NOT regenerated by tf_to_inventory.py). The WAN :22 path
# (Hetzner Cloud Firewall + base__firewall_admin_addrs = ubongo's WAN) stays as the
# break-glass; the Hetzner web console is the IP-independent ultimate fallback.
# Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md
ansible_host: 100.99.226.39
```
- [ ] **Step 3: Verify the inventory resolves**
Run: `ansible-inventory -i inventories/production --host askari`
Expected: JSON shows `"ansible_host": "100.99.226.39"`, `"base__firewall_apply": true`, `"base__firewall_input_only": true`, and `"base__firewall_admin_addrs": ["91.226.145.80"]`.
- [ ] **Step 4: Lint**
Run: `rbw unlocked && make lint`
Expected: clean (no yamllint/ansible-lint errors).
- [ ] **Step 5: Commit**
```bash
git add inventories/production/group_vars/offsite_hosts/vars.yml \
inventories/production/host_vars/askari.yml
git commit -m "feat(inventory): askari INPUT-only firewall + WAN break-glass + manage over wt0" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 3: Integration harness "askari_inputonly" profile — the reboot-safety GREEN gate
Prove on a throwaway VM (ADR-025) that the redesigned firewall is reboot-safe BEFORE touching the real host: INPUT default-deny + forward accept + the admin-addr break-glass + published-port DNAT all survive a reboot. New profile (keeps the existing `askari` profile, which validates the `docker_host` container-forward drop-in path, intact).
**Files:**
- Create: `tests/integration/profiles/askari_inputonly.json`
- Create: `tests/integration/overrides/askari_inputonly.yml`
- Modify: `tests/integration/verify.yml` (allow-list + a new profile branch)
**Interfaces:**
- Consumes: the `scripts/integration-vm.py` harness; `make test-integration HOST=<profile>` maps `HOST` to `profiles/<HOST>.json` (a profile name, not a production inventory host).
- Produces: profile `askari_inputonly` with `integration_profile: askari_inputonly`.
- [ ] **Step 1: Add the new profile to the verify allow-list and a failing branch**
In `tests/integration/verify.yml`, change the allow-list assert (line 14) from:
```yaml
- integration_profile in ['askari', 'ubongo']
```
to:
```yaml
- integration_profile in ['askari', 'askari_inputonly', 'ubongo']
```
and update its `fail_msg` (line 15) to `"integration_profile must be set in the profile overlay (askari|askari_inputonly|ubongo)"`. Then append this block to the `tasks:` list (after the ubongo block):
```yaml
# ── askari_inputonly profile — the mesh-hardening REDESIGN (2026-06-19) ──
# INPUT-only default-deny on a Docker host: input policy drop, forward policy ACCEPT
# (Docker-safe), SSH via the admin-addr break-glass, published-port DNAT survives reboot.
- name: (askari_inputonly) Read the live nftables ruleset
when: integration_profile == 'askari_inputonly'
ansible.builtin.command: nft list ruleset
register: _nft_io
changed_when: false
- name: (askari_inputonly) INPUT default-deny, forward permissive, admin-addr break-glass
when: integration_profile == 'askari_inputonly'
ansible.builtin.assert:
that:
- "'hook input priority filter; policy drop;' in _nft_io.stdout"
- "'hook forward priority filter; policy accept;' in _nft_io.stdout"
- "'ip saddr 192.168.150.1 tcp dport 22 accept' in _nft_io.stdout"
fail_msg: >-
askari_inputonly: expected input policy drop, forward policy accept (input-only),
and the admin-addr break-glass (192.168.150.1) SSH allow in the live ruleset.
- name: (askari_inputonly) Gather service facts
when: integration_profile == 'askari_inputonly'
ansible.builtin.service_facts:
- name: (askari_inputonly) Docker daemon is active
when: integration_profile == 'askari_inputonly'
ansible.builtin.assert:
that: "ansible_facts.services['docker.service'].state == 'running'"
fail_msg: "docker.service is not running"
- name: (askari_inputonly) Published port answers from the controller (DNAT + forward alive)
when: integration_profile == 'askari_inputonly'
delegate_to: localhost
become: false
ansible.builtin.uri:
url: "http://{{ ansible_host }}/"
follow_redirects: none
status_code: [200, 301, 308, 404, 502, 503]
timeout: 10
register: _probe_io
retries: 5
delay: 6
until: _probe_io is succeeded
```
- [ ] **Step 2: Create the profile descriptor**
Create `tests/integration/profiles/askari_inputonly.json`:
```json
{
"groups": ["offsite_hosts"],
"applies": [
{"playbook": "site.yml", "tags": ["base"]},
{"playbook": "offsite.yml", "tags": ["docker_host", "reverse_proxy"]}
],
"extra_vars_files": ["overrides/askari_inputonly.yml"],
"mem_mib": 3072,
"vcpus": 2
}
```
- [ ] **Step 3: Create the overlay**
Create `tests/integration/overrides/askari_inputonly.yml`:
```yaml
---
# Integration overlay (ADR-025) — the askari mesh-hardening REDESIGN (2026-06-19).
# Validates INPUT-only default-deny on a Docker host: input policy drop, forward policy
# accept (Docker-safe), SSH via the admin-addr break-glass, reboot-survivable.
integration_profile: askari_inputonly
base__firewall_apply: true
base__firewall_input_only: true
# No sshd ListenAddress change — never wt0-only in a throwaway VM.
base__ssh_listen_mesh_only: false
# Isolated VM: never touch the real mesh.
base__mesh_enabled: false
# The non-mesh SSH break-glass = the admin-addr path the real design uses. Point it at the
# VM's libvirt-NAT gateway (where the harness connects from), by source IP so it is
# interface-independent and the default-deny + reboot don't lock out the driver. This
# mirrors askari's real base__firewall_admin_addrs (ubongo's WAN) in the test topology.
base__firewall_admin_addrs:
- 192.168.150.1
```
- [ ] **Step 4: Run the harness — the GREEN gate**
Run: `make test-integration HOST=askari_inputonly`
Expected: GREEN. The harness boots a VM, applies `base` (INPUT-only) + `docker_host` + `reverse_proxy`, **reboots**, re-SSHes (proving the admin-addr break-glass survives), then `verify.yml` asserts input `policy drop`, forward `policy accept`, the `192.168.150.1` SSH allow, Docker active, and the published `:80` answering. Clean up: `make test-integration-clean`.
- [ ] **Step 5: Commit**
```bash
rbw unlocked && make lint
git add tests/integration/profiles/askari_inputonly.json \
tests/integration/overrides/askari_inputonly.yml \
tests/integration/verify.yml
git commit -m "test(integration): askari_inputonly profile — INPUT-only default-deny reboot gate" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 4: Supervised live cutover + STATUS/ROADMAP update — ⚠️ OPERATOR-GATED
> **⚠️ DO NOT run this task autonomously.** It changes the live off-site host (lockout risk) and runs `make deploy`. An automated executor must STOP here and hand back to the operator. Preconditions: Tasks 13 committed and GREEN; `rbw unlocked`; the **Hetzner web console** open in a browser (the out-of-band ultimate break-glass); the operator present. The WAN `:22` break-glass is never removed, so a fallback path is open throughout (FRICTION 2026-06-17 #6).
**Files (Step 7 only):**
- Modify: `STATUS.md` (askari row), `docs/ROADMAP.md` (Next step)
- [ ] **Step 1: Pre-check both paths are healthy**
```bash
ssh sjat@100.99.226.39 true && echo "wt0 SSH OK"
ansible askari -i inventories/production -m ping
curl -sI https://test.askari.wingu.me | head -1
curl -sI https://netbird.askari.wingu.me | head -1
```
Expected: wt0 SSH OK; ping `pong`; both curls `HTTP/2 200`.
- [ ] **Step 2: Dry-run the converge (mandatory `check` before `deploy`)**
```bash
make check PLAYBOOK=site LIMIT=askari
```
Expected: changes limited to the `base` firewall (input-only ruleset, admin-addr) + the `netbird_coordinator` compose env (`NB_DISABLE_GEOLOCATION`). Review and show the output before proceeding.
- [ ] **Step 3: Apply (operator present, console open, auto-rollback armed)**
```bash
make deploy PLAYBOOK=site LIMIT=askari
```
The `base` firewall concern arms the auto-rollback timer (`base__firewall_rollback_timeout: 45`) and reconnects over `wt0` — a bad ruleset reverts itself. Expected: converge OK; SSH-over-`wt0` stays up.
- [ ] **Step 4: Rebuild NAT and confirm the coordinator is healthy with geo disabled**
`base`'s `flush ruleset` wipes Docker's nat (FRICTION) — rebuild it, then confirm the control plane:
```bash
ssh sjat@100.99.226.39 'sudo systemctl restart docker'
ssh sjat@100.99.226.39 'docker ps --format "{{.Names}} {{.Status}}"'
ssh sjat@100.99.226.39 'docker logs --since 2m netbird-server 2>&1 | grep -iE "geo|fatal" || echo "no geo/fatal log lines"'
```
Expected: `netbird-server` + `netbird-dashboard` Up; no geo-DB FATAL.
> **Contingency (only if `netbird-server` still FATALs on geolocation):** `NB_DISABLE_GEOLOCATION` was not honored by the pinned image. Pre-seed the DB into the volume instead — `ssh sjat@100.99.226.39 'sudo curl -fSL -o /var/lib/docker/volumes/netbird_data/_data/GeoLite2-City_20260101.mmdb https://pkgs.netbird.io/geolite2/GeoLite2-City.mmdb && sudo docker restart netbird-server'` — and add `disableGeoliteUpdate: true` under `server:` in `config.yaml.j2` so it never re-downloads. Re-verify, then fold the working fix back into the role (amend Task 1).
- [ ] **Step 5: Verify the new steady state (both SSH paths + services)**
```bash
ssh sjat@100.99.226.39 true && echo "wt0 SSH OK"
# From ubongo: SSH to askari's WAN IP. ubongo's packets egress via OPNsense, SNAT'd to the
# WAN IP 91.226.145.80 — matching askari's admin-addr break-glass rule. (No BindAddress:
# ubongo does not hold 91.226.145.80; OPNsense does.)
ssh sjat@77.42.120.136 true && echo "WAN break-glass OK"
curl -sI https://test.askari.wingu.me | head -1
nc -vz -u 77.42.120.136 3478 # STUN answers
```
Expected: both SSH paths succeed; cert valid; STUN reachable.
- [ ] **Step 6: Reboot-resilience — the real test (console available)**
```bash
ssh sjat@100.99.226.39 'sudo systemctl reboot'
# wait ~60s, then from ubongo — no manual intervention:
sleep 60; ssh sjat@100.99.226.39 'nft list chain inet filter input | grep -E "policy drop|wt0|91.226.145.80"'
curl -sI https://netbird.askari.wingu.me | head -1
ssh sjat@100.99.226.39 'docker ps --format "{{.Names}} {{.Status}}"'
```
Expected, unattended: input `policy drop` with the `wt0` + `91.226.145.80` allows; public cert valid; both containers Up; `wt0` SSH back. (If lost: recover via the Hetzner console — the firewall auto-rollback and the WAN break-glass should make that unnecessary.)
- [ ] **Step 7: Record reality in the ground-truth docs and commit**
Update `STATUS.md` (the askari row): firewall now **applied** — INPUT-only default-deny, SSH `wt0`-primary + permanent WAN break-glass (ubongo's WAN), managed over `wt0`, geolocation disabled, **reboot-validated**. Update `docs/ROADMAP.md` "Next step": mark the askari SSH→`wt0` redesign **DONE**; the next mesh-hardening sub-project is the **SPOF reduction** (askari relay single-point-of-failure) — confirmed by the `ubongo → askari` `Relayed` finding (2026-06-19).
```bash
rbw unlocked && make lint
git add STATUS.md docs/ROADMAP.md
git commit -m "docs(status): mesh-hardening redesign — askari INPUT-only + WAN break-glass applied + reboot-validated" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
## Notes / out of scope (carry to the SPOF sub-project)
- **SPOF reduction is the next sub-project** (operator decision 2026-06-19): `ubongo → askari` is currently `Relayed` through askari's own relay; if askari is down, relayed peers lose the mesh data plane. Its own spec.
- **NetBird ACL stays Allow-All** — any enrolled peer can reach askari `wt0:22` until a later sub-project.
- **Full forward-chain hardening** (`docker_host` container-forward drop-in over the `input_only` baseline) — a later tightening; the existing `askari` integration profile already covers that path.
- **Coordinator off-site backup** (FRICTION 2026-06-17 #5, ADR-022) — still pending; not in scope.