# Mesh-hardening 1/3 — askari SSH onto wt0 — Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Make askari's SSH reachable only over the NetBird mesh (`wt0`) and close the WAN `:22` surface at both the host nftables layer and the Hetzner Cloud Firewall, without dropping askari's public services. **Architecture:** Three enforcement layers — (1) sshd `ListenAddress` bound to the live `wt0` IP (fail-closed, `ip_nonlocal_bind` to beat the post-boot bind race); (2) the base role's catalog-driven nftables default-deny (SSH already restricted to `wt0` via `base__firewall_mgmt_interface`; add a `public` zone + askari service entries so 80/443/3478 survive); (3) Terraform drops the Hetzner Cloud Firewall WAN `:22` rule. Tasks 1–4 are code (subagent-driven, each Molecule/lint/plan-verified). Task 5 is the live, operator-supervised cutover on the real host. **Tech Stack:** Ansible (role `base`, FQCN), nftables, Molecule on Debian 13, `ansible.posix.sysctl`, pytest (filter unit tests), Terraform (`hcloud` provider). **Spec:** `docs/superpowers/specs/2026-06-17-mesh-hardening-askari-ssh-wt0-design.md` **Conventions:** `make lint` and `make test ROLE=base` before each commit; `make check` before `make deploy`; `make tf-plan` before `make tf-apply`; never hand-edit the generated `offsite.yml`; rbw unlocked for commits touching ansible content. --- ### Task 1: base role — sshd `ListenAddress` on wt0 + `ip_nonlocal_bind` (fail-closed) **Files:** - Modify: `roles/base/defaults/main.yml` - Modify: `roles/base/tasks/ssh.yml` - Modify: `roles/base/templates/sshd_hardening.conf.j2` - Modify: `roles/base/molecule/default/converge.yml` (fixture) - Modify: `roles/base/molecule/default/verify.yml` (assertions = the test) - [ ] **Step 1: Write the failing test (extend Molecule verify)** In `roles/base/molecule/default/verify.yml`, add these tasks after the existing "Sshd drop-in present and config valid" block: ```yaml - name: ListenAddress bound to the fixture mesh IP (mesh-only mode) ansible.builtin.command: grep -q '^ListenAddress 100.99.0.1$' /etc/ssh/sshd_config.d/10-boma.conf changed_when: false - name: ip_nonlocal_bind sysctl drop-in is present ansible.builtin.command: grep -q '^net.ipv4.ip_nonlocal_bind = 1' /etc/sysctl.d/60-boma-nonlocal-bind.conf changed_when: false - name: ip_nonlocal_bind is live in this netns ansible.builtin.command: sysctl -n net.ipv4.ip_nonlocal_bind register: _nonlocal changed_when: false failed_when: _nonlocal.stdout | trim != '1' ``` - [ ] **Step 2: Add the fixture that drives it (Molecule converge)** In `roles/base/molecule/default/converge.yml`, add to the `vars:` block (alongside the existing `base__mesh_*`): ```yaml base__ssh_listen_mesh_only: true base__ssh_listen_addr: "100.99.0.1" # fixture mesh IP (no wt0 in the container) ``` - [ ] **Step 3: Run the test to verify it fails** Run: `make test ROLE=base` Expected: FAIL — converge errors or verify fails (`ListenAddress` not rendered; sysctl drop-in absent), because the feature isn't implemented yet. - [ ] **Step 4: Add the defaults** In `roles/base/defaults/main.yml`, after the `base__ssh_authorised_keys: []` line (end of the hardening block), add: ```yaml # SSH listen-on-mesh (mesh-hardening 1/3, ADR-016/021). Opt-in: when true, sshd binds # ListenAddress to this host's mesh IP only (not the WAN). The IP comes from the live wt0 # fact (ansible_facts.wt0.ipv4.address); base__ssh_listen_addr overrides it. ip_nonlocal_bind # lets sshd bind the mesh IP before wt0 exists at boot. Fails closed: the play asserts a # non-empty address rather than silently listening on all interfaces. base__ssh_listen_mesh_only: false base__ssh_listen_addr: "" ``` - [ ] **Step 5: Resolve + assert + sysctl in `ssh.yml`** In `roles/base/tasks/ssh.yml`, insert these tasks at the TOP of the file (before "Ensure openssh-server is installed"): ```yaml - name: Resolve the sshd mesh listen address (override, else live wt0 fact) ansible.builtin.set_fact: base__ssh_listen_addr_resolved: >- {{ base__ssh_listen_addr or ansible_facts.get('wt0', {}).get('ipv4', {}).get('address', '') }} when: base__ssh_listen_mesh_only | bool - name: Fail closed — refuse to render sshd without a known mesh address ansible.builtin.assert: that: - base__ssh_listen_addr_resolved | length > 0 fail_msg: >- base__ssh_listen_mesh_only is true but no mesh address resolved (set base__ssh_listen_addr or ensure wt0 is up so its fact is gathered). Refusing to render sshd ListenAddress empty (which would listen on ALL interfaces). when: base__ssh_listen_mesh_only | bool - name: Allow sshd to bind the mesh IP before wt0 exists at boot ansible.posix.sysctl: name: net.ipv4.ip_nonlocal_bind value: "1" sysctl_set: true state: present reload: true sysctl_file: /etc/sysctl.d/60-boma-nonlocal-bind.conf when: base__ssh_listen_mesh_only | bool ``` - [ ] **Step 6: Render the conditional `ListenAddress`** In `roles/base/templates/sshd_hardening.conf.j2`, append after the existing `KbdInteractiveAuthentication no` line: ```jinja {% if base__ssh_listen_mesh_only | bool %} ListenAddress {{ base__ssh_listen_addr_resolved }} {% endif %} ``` - [ ] **Step 7: Run the test to verify it passes** Run: `make test ROLE=base` Expected: PASS — converge succeeds; verify confirms `ListenAddress 100.99.0.1`, the sysctl drop-in, and the live value `1`. > **Checkpoint (environmental):** if `make test` fails on the sysctl task because the Molecule container can't write `net.ipv4.ip_nonlocal_bind`, add `sysctls: {net.ipv4.ip_nonlocal_bind: "0"}` to the platform in `roles/base/molecule/default/molecule.yml` (pre-creates the namespaced sysctl so the task can set it), then re-run. Note the change in the commit. - [ ] **Step 8: Lint** Run: `make lint` Expected: `Passed: 0 failure(s)` and `check-tags: OK`. - [ ] **Step 9: Commit** ```bash git add roles/base/defaults/main.yml roles/base/tasks/ssh.yml \ roles/base/templates/sshd_hardening.conf.j2 \ roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml git commit -m "feat(base): opt-in sshd ListenAddress on the mesh IP (fail-closed) base__ssh_listen_mesh_only binds sshd to the live wt0 IP only, with ip_nonlocal_bind to beat the post-boot bind race and a fail-closed assert so an unresolved address never silently listens on all interfaces. Molecule covers the render + sysctl. Mesh-hardening 1/3 (ADR-016/021). Co-Authored-By: Claude Opus 4.8 (1M context) " ``` --- ### Task 2: firewall catalog — `public` zone + askari's public services **Files:** - Modify: `inventories/production/group_vars/all/firewall.yml` - Modify: `roles/base/molecule/default/converge.yml` (fixture: public-zone rule) - Modify: `roles/base/molecule/default/verify.yml` (assert the 0.0.0.0/0 rule) - Test: `tests/test_firewall_rules.py` (unit: a `public` zone resolves to `0.0.0.0/0`) Rationale: `base__firewall_mgmt_interface` already accepts `:22` on `wt0`. The gap is that the catalog is empty and has no "anywhere" source, so applying default-deny to askari would drop 80/443/3478. We add a `public` zone (`0.0.0.0/0`) and askari's service ingress. - [ ] **Step 1: Write the failing unit test** In `tests/test_firewall_rules.py`, add: ```python def test_public_zone_resolves_to_anywhere(): catalog = {"web": {"host": "askari", "ingress": [{"from": "public", "port": 443, "proto": "tcp"}]}} zones = {"public": "0.0.0.0/0"} rules = rs.resolve_firewall_rules(catalog, zones, "askari", {"askari": {"ansible_host": "100.99.226.39"}}, {}) assert rules == [{"proto": "tcp", "port": 443, "sources": ["0.0.0.0/0"]}] ``` (Module is loaded by the existing importlib shim at the top of the test file as `rs`. If the filter is imported under a different alias there, match it.) - [ ] **Step 2: Run it to verify it fails (or passes trivially)** Run: `.venv/bin/python -m pytest tests/test_firewall_rules.py -q` Expected: this test PASSES immediately if the filter already resolves arbitrary zones (it does — `_resolve_source` treats any `zones` key generically). That is fine: the unit test documents/locks the `public`-zone contract. If it fails, fix the filter. Either way it must end green. - [ ] **Step 3: Add the Molecule fixture (public-zone rule)** In `roles/base/molecule/default/converge.yml`, under `firewall_zones:` add `public: 0.0.0.0/0`, and under `firewall_catalog:` add: ```yaml netbird_stun: host: instance ingress: - { from: public, port: 3478, proto: udp } ``` - [ ] **Step 4: Add the Molecule assertion (the test)** In `roles/base/molecule/default/verify.yml`, after the photoprism assertion block, add: ```yaml - name: Assert the public->stun:3478/udp ingress rule (0.0.0.0/0 source) ansible.builtin.assert: that: - "'0.0.0.0/0' in nft" - "'udp dport 3478 accept' in nft" fail_msg: "missing public->3478/udp rule for netbird_stun" ``` - [ ] **Step 5: Run the tests** Run: `make test ROLE=base` then `.venv/bin/python -m pytest tests/test_firewall_rules.py -q` Expected: both PASS (the rendered ruleset now contains the `0.0.0.0/0 ... udp dport 3478 accept` rule). - [ ] **Step 6: Populate the real catalog** In `inventories/production/group_vars/all/firewall.yml`, replace the `firewall_zones`/`firewall_catalog` blocks with: ```yaml # Zone → subnet (from ADR-007). `public` = the WAN (anywhere) for deliberately public # off-site services (askari); home/cluster services use the internal zones only. firewall_zones: mgmt: 10.10.0.0/24 srv: 10.20.0.0/24 lan: 10.30.0.0/24 iot: 10.40.0.0/24 guest: 10.50.0.0/24 public: 0.0.0.0/0 # Service catalog: → placement (host | group | hosts) + ingress[]. # askari's public surface (ADR-024 Caddy + ADR-016 NetBird STUN). NOTE: the host # nftables template renders IPv4 source rules only; askari is reached via its A record # (no AAAA), so IPv4-only public rules are sufficient (see the spec's IPv6 note). firewall_catalog: reverse_proxy: host: askari ingress: - { from: public, port: 80, proto: tcp } - { from: public, port: 443, proto: tcp } netbird_stun: host: askari ingress: - { from: public, port: 3478, proto: udp } ``` - [ ] **Step 7: Lint** Run: `make lint` Expected: clean pass (`check-tags: OK`). - [ ] **Step 8: Commit** ```bash git add inventories/production/group_vars/all/firewall.yml \ roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml \ tests/test_firewall_rules.py git commit -m "feat(firewall): public zone + askari's public services in the catalog Adds a public (0.0.0.0/0) zone and askari's Caddy (80/443) + NetBird STUN (3478/udp) ingress so the base nftables default-deny does not drop the live public services when applied to askari. Molecule + filter unit test cover the public-zone rendering. Mesh-hardening 1/3 (ADR-020/024/016). Co-Authored-By: Claude Opus 4.8 (1M context) " ``` --- ### Task 3: inventory — point Ansible at wt0 + enable mesh-only SSH on askari **Files:** - Create: `inventories/production/host_vars/askari.yml` - Modify: `inventories/production/group_vars/offsite_hosts/vars.yml` - [ ] **Step 1: Create the host_var override** Create `inventories/production/host_vars/askari.yml`: ```yaml --- # Manage askari over the NetBird mesh (wt0), not its WAN IP. This OVERRIDES the # TF-generated inventories/production/offsite.yml (ansible_host = 77.42.120.136); host_vars # outrank the generated inventory and are NOT touched by `make tf-inventory-offsite`. # Mesh-hardening 1/3 — once SSH is wt0-only, the WAN IP is no longer reachable for SSH. ansible_host: 100.99.226.39 # askari's wt0 address (NetBird, M5) ``` - [ ] **Step 2: Enable mesh-only SSH for offsite hosts** In `inventories/production/group_vars/offsite_hosts/vars.yml`, replace the file body with: ```yaml --- # Off-site hosts (askari). askari runs the NetBird coordinator AND is a mesh peer # (ADR-016, M5). Mesh-hardening 1/3 (2026-06-17): SSH is moved onto wt0 — sshd binds the # mesh IP only (base__ssh_listen_mesh_only) and the base nftables default-deny applies # (base__firewall_apply defaults true; SSH allowed on wt0 via base__firewall_mgmt_interface, # public services via the catalog). base__mesh_enabled stays true (precondition from M5). base__mesh_enabled: true base__ssh_listen_mesh_only: true ``` - [ ] **Step 3: Verify the override resolves** Run: `.venv/bin/ansible-inventory -i inventories/production/ --host askari 2>/dev/null | grep ansible_host` Expected: `"ansible_host": "100.99.226.39"` (the host_var wins over the generated `offsite.yml`). - [ ] **Step 4: Lint** Run: `make lint` Expected: clean pass. - [ ] **Step 5: Commit** ```bash git add inventories/production/host_vars/askari.yml \ inventories/production/group_vars/offsite_hosts/vars.yml git commit -m "feat(inventory): manage askari over wt0 + enable mesh-only SSH host_vars/askari.yml points ansible_host at the wt0 IP (overriding the generated offsite.yml); offsite_hosts sets base__ssh_listen_mesh_only. Mesh-hardening 1/3. Co-Authored-By: Claude Opus 4.8 (1M context) " ``` --- ### Task 4: Terraform — retire the Hetzner WAN `:22` rule **Files:** - Modify: `terraform/modules/hetzner_vm/main.tf` - Modify: `terraform/modules/hetzner_vm/variables.tf` - Modify: `terraform/environments/offsite/main.tf` This task makes the SSH rule conditional and sets askari's admin CIDRs to empty (mesh-only). The live `tf-plan`/`tf-apply` happens in Task 5 — here we only change + format/validate the code. - [ ] **Step 1: Gate the SSH rule on a non-empty CIDR list** In `terraform/modules/hetzner_vm/main.tf`, replace the static SSH `rule { ... }` block (the one with `port = "22"`) with a dynamic block: ```hcl # SSH from the control node only — and only when admin CIDRs are set. An empty # ssh_admin_cidrs removes the WAN :22 rule entirely (mesh-only SSH; reach the host over # wt0, break-glass = Hetzner console). Mesh-hardening 1/3. dynamic "rule" { for_each = length(var.ssh_admin_cidrs) > 0 ? [1] : [] content { direction = "in" protocol = "tcp" port = "22" source_ips = var.ssh_admin_cidrs } } ``` - [ ] **Step 2: Default the variable to empty** In `terraform/modules/hetzner_vm/variables.tf`, change the `ssh_admin_cidrs` variable to default to an empty list: ```hcl variable "ssh_admin_cidrs" { description = "Source CIDRs allowed to reach SSH over the WAN. Empty = no WAN SSH rule (mesh-only)." type = list(string) default = [] } ``` - [ ] **Step 3: Set askari to mesh-only SSH** In `terraform/environments/offsite/main.tf`, change the `ssh_admin_cidrs` argument in the `module "askari"` block to: ```hcl ssh_admin_cidrs = [] # mesh-only: SSH is reached over wt0; WAN :22 retired (mesh-hardening 1/3) ``` - [ ] **Step 4: Format + validate** Run: `cd terraform/environments/offsite && terraform fmt -recursive ../.. && terraform validate && cd -` Expected: `fmt` lists any reformatted files (re-add them); `validate` prints `Success! The configuration is valid.` (offsite is already `init`ed — it has live state.) - [ ] **Step 5: Commit** ```bash git add terraform/modules/hetzner_vm/main.tf terraform/modules/hetzner_vm/variables.tf \ terraform/environments/offsite/main.tf git commit -m "feat(tf/offsite): retire askari's WAN :22 (mesh-only SSH) The Hetzner Cloud Firewall SSH rule is now conditional on a non-empty ssh_admin_cidrs (default []); askari sets it empty so the WAN :22 rule is removed on the next apply. SSH is reached over wt0; break-glass is the Hetzner console. Apply is the live cutover (Task 5). Mesh-hardening 1/3. Co-Authored-By: Claude Opus 4.8 (1M context) " ``` --- ### Task 5: Live staged cutover (operator-supervised — NOT a subagent task) > This task touches the real askari over the network and is lockout-risky. Run it > interactively with the operator, in order, verifying each step before the next. The > firewall's auto-rollback timer + `wait_for_connection` over wt0 is the safety net; the > Hetzner web console is the ultimate break-glass. Do NOT hand this to an unattended agent. - [ ] **Step 1: Pre-check the mesh SSH path (before any change)** Run: `.venv/bin/ansible askari -i inventories/production/ -m ping` Expected: `SUCCESS` — confirms Ansible reaches askari over `wt0` (Tasks 1–3 are merged, so `ansible_host` is now `100.99.226.39`). If this fails, STOP — the mesh path must work before closing the WAN. - [ ] **Step 2: Dry-run the base apply (firewall + sshd)** Run: `make check PLAYBOOK=site LIMIT=askari TAGS=firewall,hardening` Expected: shows the nftables ruleset diff (default-deny + wt0 SSH + public 80/443/3478) and the sshd drop-in diff (`ListenAddress 100.99.226.39`); no errors. Review that the public service rules are present (so they won't be dropped). - [ ] **Step 3: Apply the host firewall + sshd (auto-rollback armed)** Run: `make deploy PLAYBOOK=site LIMIT=askari TAGS=firewall,hardening` Expected: the firewall concern arms the rollback timer, applies, resets the connection, and `wait_for_connection` succeeds over wt0; sshd reloads with the mesh ListenAddress. If connectivity is lost, the timer auto-reverts the ruleset within `base__firewall_rollback_timeout` (45 s). - [ ] **Step 4: Verify services + WAN SSH still open at the cloud edge** ```bash curl -sSf -o /dev/null -w '%{http_code}\n' https://test.askari.wingu.me # expect 200 curl -sSf -o /dev/null -w '%{http_code}\n' https://netbird.askari.wingu.me # expect 200 ``` Expected: both `200` (valid certs); the host firewall did not drop the public services. (WAN `:22` is now dropped by the host nftables, but the Hetzner FW still allows it until Step 5 — that's fine.) - [ ] **Step 5: Retire the Hetzner WAN `:22` — plan, review, apply** Run: `make tf-plan TF_ENV=offsite` Expected: the plan shows the SSH firewall rule being **destroyed** (and nothing else of substance). Review it. Then: `make tf-apply TF_ENV=offsite` Expected: apply succeeds; the WAN `:22` rule is gone. - [ ] **Step 6: Verify the end-state (out-of-band)** From an OFF-MESH host (e.g. the operator's laptop with NetBird disconnected, or a quick check from askari's perspective): ```bash nc -vz -w5 77.42.120.136 22 # expect: refused / timeout (WAN SSH closed) nc -vz -w5 77.42.120.136 443 # expect: open (public service intact) ``` And from ubongo over the mesh: `.venv/bin/ansible askari -i inventories/production/ -m ping` → `SUCCESS`. - [ ] **Step 7: Reboot resilience check (optional but recommended)** Reboot askari from the Hetzner console; after it comes back, confirm `ansible askari -m ping` succeeds over wt0 without intervention (proves `ip_nonlocal_bind` beat the post-boot bind race). - [ ] **Step 8: Update STATUS + ROADMAP** - In `STATUS.md`, update the askari row: SSH is now wt0-only; the host nftables default-deny is applied; the Hetzner WAN `:22` is retired. Move "host firewall + moving askari's SSH onto wt0" out of *Pending*. - In `docs/ROADMAP.md`, mark mesh-hardening sub-project 1 (askari SSH→wt0) done; next is sub-project 2 (ubongo default-deny). ```bash git add STATUS.md docs/ROADMAP.md git commit -m "docs: askari SSH moved onto wt0 (mesh-hardening 1/3 done) Co-Authored-By: Claude Opus 4.8 (1M context) " ``` - [ ] **Step 9: Push** Run: `git push origin main` --- ## Self-review (against the spec) - **§ three layers** → Task 1 (sshd ListenAddress), Task 2 (nftables catalog; SSH-on-wt0 pre-existing via `base__firewall_mgmt_interface`), Task 4 (Hetzner WAN :22). ✓ - **§ boot-race fix** (`ip_nonlocal_bind` + fail-closed assert + live wt0 fact) → Task 1 Steps 4–6. ✓ - **§ new code/vars** (`base__ssh_listen_mesh_only`, `base__ssh_listen_addr`, host_vars/askari.yml, offsite flag, catalog, TF) → Tasks 1–4. ✓ - **§ staged cutover** → Task 5 Steps 1–6, with the firewall auto-rollback as the gate. ✓ - **§ testing** → Molecule render asserts (ListenAddress, sysctl, public-zone rule) + filter unit test + live out-of-band checks. The fail-closed assert is exercised by code; to spot-check it, temporarily blank `base__ssh_listen_addr` in the converge fixture and confirm `make test ROLE=base` fails on the assert, then revert (manual, not automated — a deliberate-failure Molecule scenario is non-idiomatic). ✓ - **§ risks/rollback** → auto-rollback timer (Task 5 Step 3), `ip_nonlocal_bind` (Task 1), Hetzner console break-glass, re-addable TF rule. ✓ - **IPv6 note** → recorded in the catalog comment (Task 2 Step 6); acceptable because askari has only an A record.