5 tasks: base sshd ListenAddress+ip_nonlocal_bind (Molecule), firewall public zone + askari catalog, inventory wt0 override, TF retire WAN :22, then the live operator-supervised staged cutover. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
21 KiB
Mesh-hardening 1/3 — askari SSH onto wt0 — Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Make askari's SSH reachable only over the NetBird mesh (wt0) and close the WAN :22 surface at both the host nftables layer and the Hetzner Cloud Firewall, without dropping askari's public services.
Architecture: Three enforcement layers — (1) sshd ListenAddress bound to the live wt0 IP (fail-closed, ip_nonlocal_bind to beat the post-boot bind race); (2) the base role's catalog-driven nftables default-deny (SSH already restricted to wt0 via base__firewall_mgmt_interface; add a public zone + askari service entries so 80/443/3478 survive); (3) Terraform drops the Hetzner Cloud Firewall WAN :22 rule. Tasks 1–4 are code (subagent-driven, each Molecule/lint/plan-verified). Task 5 is the live, operator-supervised cutover on the real host.
Tech Stack: Ansible (role base, FQCN), nftables, Molecule on Debian 13, ansible.posix.sysctl, pytest (filter unit tests), Terraform (hcloud provider).
Spec: docs/superpowers/specs/2026-06-17-mesh-hardening-askari-ssh-wt0-design.md
Conventions: make lint and make test ROLE=base before each commit; make check before make deploy; make tf-plan before make tf-apply; never hand-edit the generated offsite.yml; rbw unlocked for commits touching ansible content.
Task 1: base role — sshd ListenAddress on wt0 + ip_nonlocal_bind (fail-closed)
Files:
-
Modify:
roles/base/defaults/main.yml -
Modify:
roles/base/tasks/ssh.yml -
Modify:
roles/base/templates/sshd_hardening.conf.j2 -
Modify:
roles/base/molecule/default/converge.yml(fixture) -
Modify:
roles/base/molecule/default/verify.yml(assertions = the test) -
Step 1: Write the failing test (extend Molecule verify)
In roles/base/molecule/default/verify.yml, add these tasks after the existing "Sshd drop-in present and config valid" block:
- name: ListenAddress bound to the fixture mesh IP (mesh-only mode)
ansible.builtin.command: grep -q '^ListenAddress 100.99.0.1$' /etc/ssh/sshd_config.d/10-boma.conf
changed_when: false
- name: ip_nonlocal_bind sysctl drop-in is present
ansible.builtin.command: grep -q '^net.ipv4.ip_nonlocal_bind = 1' /etc/sysctl.d/60-boma-nonlocal-bind.conf
changed_when: false
- name: ip_nonlocal_bind is live in this netns
ansible.builtin.command: sysctl -n net.ipv4.ip_nonlocal_bind
register: _nonlocal
changed_when: false
failed_when: _nonlocal.stdout | trim != '1'
- Step 2: Add the fixture that drives it (Molecule converge)
In roles/base/molecule/default/converge.yml, add to the vars: block (alongside the existing base__mesh_*):
base__ssh_listen_mesh_only: true
base__ssh_listen_addr: "100.99.0.1" # fixture mesh IP (no wt0 in the container)
- Step 3: Run the test to verify it fails
Run: make test ROLE=base
Expected: FAIL — converge errors or verify fails (ListenAddress not rendered; sysctl drop-in absent), because the feature isn't implemented yet.
- Step 4: Add the defaults
In roles/base/defaults/main.yml, after the base__ssh_authorised_keys: [] line (end of the hardening block), add:
# SSH listen-on-mesh (mesh-hardening 1/3, ADR-016/021). Opt-in: when true, sshd binds
# ListenAddress to this host's mesh IP only (not the WAN). The IP comes from the live wt0
# fact (ansible_facts.wt0.ipv4.address); base__ssh_listen_addr overrides it. ip_nonlocal_bind
# lets sshd bind the mesh IP before wt0 exists at boot. Fails closed: the play asserts a
# non-empty address rather than silently listening on all interfaces.
base__ssh_listen_mesh_only: false
base__ssh_listen_addr: ""
- Step 5: Resolve + assert + sysctl in
ssh.yml
In roles/base/tasks/ssh.yml, insert these tasks at the TOP of the file (before "Ensure openssh-server is installed"):
- name: Resolve the sshd mesh listen address (override, else live wt0 fact)
ansible.builtin.set_fact:
base__ssh_listen_addr_resolved: >-
{{ base__ssh_listen_addr
or ansible_facts.get('wt0', {}).get('ipv4', {}).get('address', '') }}
when: base__ssh_listen_mesh_only | bool
- name: Fail closed — refuse to render sshd without a known mesh address
ansible.builtin.assert:
that:
- base__ssh_listen_addr_resolved | length > 0
fail_msg: >-
base__ssh_listen_mesh_only is true but no mesh address resolved (set
base__ssh_listen_addr or ensure wt0 is up so its fact is gathered). Refusing to
render sshd ListenAddress empty (which would listen on ALL interfaces).
when: base__ssh_listen_mesh_only | bool
- name: Allow sshd to bind the mesh IP before wt0 exists at boot
ansible.posix.sysctl:
name: net.ipv4.ip_nonlocal_bind
value: "1"
sysctl_set: true
state: present
reload: true
sysctl_file: /etc/sysctl.d/60-boma-nonlocal-bind.conf
when: base__ssh_listen_mesh_only | bool
- Step 6: Render the conditional
ListenAddress
In roles/base/templates/sshd_hardening.conf.j2, append after the existing KbdInteractiveAuthentication no line:
{% if base__ssh_listen_mesh_only | bool %}
ListenAddress {{ base__ssh_listen_addr_resolved }}
{% endif %}
- Step 7: Run the test to verify it passes
Run: make test ROLE=base
Expected: PASS — converge succeeds; verify confirms ListenAddress 100.99.0.1, the sysctl drop-in, and the live value 1.
Checkpoint (environmental): if
make testfails on the sysctl task because the Molecule container can't writenet.ipv4.ip_nonlocal_bind, addsysctls: {net.ipv4.ip_nonlocal_bind: "0"}to the platform inroles/base/molecule/default/molecule.yml(pre-creates the namespaced sysctl so the task can set it), then re-run. Note the change in the commit.
- Step 8: Lint
Run: make lint
Expected: Passed: 0 failure(s) and check-tags: OK.
- Step 9: Commit
git add roles/base/defaults/main.yml roles/base/tasks/ssh.yml \
roles/base/templates/sshd_hardening.conf.j2 \
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
git commit -m "feat(base): opt-in sshd ListenAddress on the mesh IP (fail-closed)
base__ssh_listen_mesh_only binds sshd to the live wt0 IP only, with
ip_nonlocal_bind to beat the post-boot bind race and a fail-closed assert so an
unresolved address never silently listens on all interfaces. Molecule covers
the render + sysctl. Mesh-hardening 1/3 (ADR-016/021).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
Task 2: firewall catalog — public zone + askari's public services
Files:
- Modify:
inventories/production/group_vars/all/firewall.yml - Modify:
roles/base/molecule/default/converge.yml(fixture: public-zone rule) - Modify:
roles/base/molecule/default/verify.yml(assert the 0.0.0.0/0 rule) - Test:
tests/test_firewall_rules.py(unit: apubliczone resolves to0.0.0.0/0)
Rationale: base__firewall_mgmt_interface already accepts :22 on wt0. The gap is that the catalog is empty and has no "anywhere" source, so applying default-deny to askari would drop 80/443/3478. We add a public zone (0.0.0.0/0) and askari's service ingress.
- Step 1: Write the failing unit test
In tests/test_firewall_rules.py, add:
def test_public_zone_resolves_to_anywhere():
catalog = {"web": {"host": "askari",
"ingress": [{"from": "public", "port": 443, "proto": "tcp"}]}}
zones = {"public": "0.0.0.0/0"}
rules = rs.resolve_firewall_rules(catalog, zones, "askari",
{"askari": {"ansible_host": "100.99.226.39"}}, {})
assert rules == [{"proto": "tcp", "port": 443, "sources": ["0.0.0.0/0"]}]
(Module is loaded by the existing importlib shim at the top of the test file as rs. If the filter is imported under a different alias there, match it.)
- Step 2: Run it to verify it fails (or passes trivially)
Run: .venv/bin/python -m pytest tests/test_firewall_rules.py -q
Expected: this test PASSES immediately if the filter already resolves arbitrary zones (it does — _resolve_source treats any zones key generically). That is fine: the unit test documents/locks the public-zone contract. If it fails, fix the filter. Either way it must end green.
- Step 3: Add the Molecule fixture (public-zone rule)
In roles/base/molecule/default/converge.yml, under firewall_zones: add public: 0.0.0.0/0, and under firewall_catalog: add:
netbird_stun:
host: instance
ingress:
- { from: public, port: 3478, proto: udp }
- Step 4: Add the Molecule assertion (the test)
In roles/base/molecule/default/verify.yml, after the photoprism assertion block, add:
- name: Assert the public->stun:3478/udp ingress rule (0.0.0.0/0 source)
ansible.builtin.assert:
that:
- "'0.0.0.0/0' in nft"
- "'udp dport 3478 accept' in nft"
fail_msg: "missing public->3478/udp rule for netbird_stun"
- Step 5: Run the tests
Run: make test ROLE=base then .venv/bin/python -m pytest tests/test_firewall_rules.py -q
Expected: both PASS (the rendered ruleset now contains the 0.0.0.0/0 ... udp dport 3478 accept rule).
- Step 6: Populate the real catalog
In inventories/production/group_vars/all/firewall.yml, replace the firewall_zones/firewall_catalog blocks with:
# Zone → subnet (from ADR-007). `public` = the WAN (anywhere) for deliberately public
# off-site services (askari); home/cluster services use the internal zones only.
firewall_zones:
mgmt: 10.10.0.0/24
srv: 10.20.0.0/24
lan: 10.30.0.0/24
iot: 10.40.0.0/24
guest: 10.50.0.0/24
public: 0.0.0.0/0
# Service catalog: <name> → placement (host | group | hosts) + ingress[].
# askari's public surface (ADR-024 Caddy + ADR-016 NetBird STUN). NOTE: the host
# nftables template renders IPv4 source rules only; askari is reached via its A record
# (no AAAA), so IPv4-only public rules are sufficient (see the spec's IPv6 note).
firewall_catalog:
reverse_proxy:
host: askari
ingress:
- { from: public, port: 80, proto: tcp }
- { from: public, port: 443, proto: tcp }
netbird_stun:
host: askari
ingress:
- { from: public, port: 3478, proto: udp }
- Step 7: Lint
Run: make lint
Expected: clean pass (check-tags: OK).
- Step 8: Commit
git add inventories/production/group_vars/all/firewall.yml \
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml \
tests/test_firewall_rules.py
git commit -m "feat(firewall): public zone + askari's public services in the catalog
Adds a public (0.0.0.0/0) zone and askari's Caddy (80/443) + NetBird STUN
(3478/udp) ingress so the base nftables default-deny does not drop the live
public services when applied to askari. Molecule + filter unit test cover the
public-zone rendering. Mesh-hardening 1/3 (ADR-020/024/016).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
Task 3: inventory — point Ansible at wt0 + enable mesh-only SSH on askari
Files:
-
Create:
inventories/production/host_vars/askari.yml -
Modify:
inventories/production/group_vars/offsite_hosts/vars.yml -
Step 1: Create the host_var override
Create inventories/production/host_vars/askari.yml:
---
# Manage askari over the NetBird mesh (wt0), not its WAN IP. This OVERRIDES the
# TF-generated inventories/production/offsite.yml (ansible_host = 77.42.120.136); host_vars
# outrank the generated inventory and are NOT touched by `make tf-inventory-offsite`.
# Mesh-hardening 1/3 — once SSH is wt0-only, the WAN IP is no longer reachable for SSH.
ansible_host: 100.99.226.39 # askari's wt0 address (NetBird, M5)
- Step 2: Enable mesh-only SSH for offsite hosts
In inventories/production/group_vars/offsite_hosts/vars.yml, replace the file body with:
---
# Off-site hosts (askari). askari runs the NetBird coordinator AND is a mesh peer
# (ADR-016, M5). Mesh-hardening 1/3 (2026-06-17): SSH is moved onto wt0 — sshd binds the
# mesh IP only (base__ssh_listen_mesh_only) and the base nftables default-deny applies
# (base__firewall_apply defaults true; SSH allowed on wt0 via base__firewall_mgmt_interface,
# public services via the catalog). base__mesh_enabled stays true (precondition from M5).
base__mesh_enabled: true
base__ssh_listen_mesh_only: true
- Step 3: Verify the override resolves
Run: .venv/bin/ansible-inventory -i inventories/production/ --host askari 2>/dev/null | grep ansible_host
Expected: "ansible_host": "100.99.226.39" (the host_var wins over the generated offsite.yml).
- Step 4: Lint
Run: make lint
Expected: clean pass.
- Step 5: Commit
git add inventories/production/host_vars/askari.yml \
inventories/production/group_vars/offsite_hosts/vars.yml
git commit -m "feat(inventory): manage askari over wt0 + enable mesh-only SSH
host_vars/askari.yml points ansible_host at the wt0 IP (overriding the generated
offsite.yml); offsite_hosts sets base__ssh_listen_mesh_only. Mesh-hardening 1/3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
Task 4: Terraform — retire the Hetzner WAN :22 rule
Files:
- Modify:
terraform/modules/hetzner_vm/main.tf - Modify:
terraform/modules/hetzner_vm/variables.tf - Modify:
terraform/environments/offsite/main.tf
This task makes the SSH rule conditional and sets askari's admin CIDRs to empty (mesh-only). The live tf-plan/tf-apply happens in Task 5 — here we only change + format/validate the code.
- Step 1: Gate the SSH rule on a non-empty CIDR list
In terraform/modules/hetzner_vm/main.tf, replace the static SSH rule { ... } block (the one with port = "22") with a dynamic block:
# SSH from the control node only — and only when admin CIDRs are set. An empty
# ssh_admin_cidrs removes the WAN :22 rule entirely (mesh-only SSH; reach the host over
# wt0, break-glass = Hetzner console). Mesh-hardening 1/3.
dynamic "rule" {
for_each = length(var.ssh_admin_cidrs) > 0 ? [1] : []
content {
direction = "in"
protocol = "tcp"
port = "22"
source_ips = var.ssh_admin_cidrs
}
}
- Step 2: Default the variable to empty
In terraform/modules/hetzner_vm/variables.tf, change the ssh_admin_cidrs variable to default to an empty list:
variable "ssh_admin_cidrs" {
description = "Source CIDRs allowed to reach SSH over the WAN. Empty = no WAN SSH rule (mesh-only)."
type = list(string)
default = []
}
- Step 3: Set askari to mesh-only SSH
In terraform/environments/offsite/main.tf, change the ssh_admin_cidrs argument in the module "askari" block to:
ssh_admin_cidrs = [] # mesh-only: SSH is reached over wt0; WAN :22 retired (mesh-hardening 1/3)
- Step 4: Format + validate
Run: cd terraform/environments/offsite && terraform fmt -recursive ../.. && terraform validate && cd -
Expected: fmt lists any reformatted files (re-add them); validate prints Success! The configuration is valid. (offsite is already inited — it has live state.)
- Step 5: Commit
git add terraform/modules/hetzner_vm/main.tf terraform/modules/hetzner_vm/variables.tf \
terraform/environments/offsite/main.tf
git commit -m "feat(tf/offsite): retire askari's WAN :22 (mesh-only SSH)
The Hetzner Cloud Firewall SSH rule is now conditional on a non-empty
ssh_admin_cidrs (default []); askari sets it empty so the WAN :22 rule is
removed on the next apply. SSH is reached over wt0; break-glass is the Hetzner
console. Apply is the live cutover (Task 5). Mesh-hardening 1/3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
Task 5: Live staged cutover (operator-supervised — NOT a subagent task)
This task touches the real askari over the network and is lockout-risky. Run it interactively with the operator, in order, verifying each step before the next. The firewall's auto-rollback timer +
wait_for_connectionover wt0 is the safety net; the Hetzner web console is the ultimate break-glass. Do NOT hand this to an unattended agent.
- Step 1: Pre-check the mesh SSH path (before any change)
Run: .venv/bin/ansible askari -i inventories/production/ -m ping
Expected: SUCCESS — confirms Ansible reaches askari over wt0 (Tasks 1–3 are merged, so ansible_host is now 100.99.226.39). If this fails, STOP — the mesh path must work before closing the WAN.
- Step 2: Dry-run the base apply (firewall + sshd)
Run: make check PLAYBOOK=site LIMIT=askari TAGS=firewall,hardening
Expected: shows the nftables ruleset diff (default-deny + wt0 SSH + public 80/443/3478) and the sshd drop-in diff (ListenAddress 100.99.226.39); no errors. Review that the public service rules are present (so they won't be dropped).
- Step 3: Apply the host firewall + sshd (auto-rollback armed)
Run: make deploy PLAYBOOK=site LIMIT=askari TAGS=firewall,hardening
Expected: the firewall concern arms the rollback timer, applies, resets the connection, and wait_for_connection succeeds over wt0; sshd reloads with the mesh ListenAddress. If connectivity is lost, the timer auto-reverts the ruleset within base__firewall_rollback_timeout (45 s).
- Step 4: Verify services + WAN SSH still open at the cloud edge
curl -sSf -o /dev/null -w '%{http_code}\n' https://test.askari.wingu.me # expect 200
curl -sSf -o /dev/null -w '%{http_code}\n' https://netbird.askari.wingu.me # expect 200
Expected: both 200 (valid certs); the host firewall did not drop the public services. (WAN :22 is now dropped by the host nftables, but the Hetzner FW still allows it until Step 5 — that's fine.)
- Step 5: Retire the Hetzner WAN
:22— plan, review, apply
Run: make tf-plan TF_ENV=offsite
Expected: the plan shows the SSH firewall rule being destroyed (and nothing else of substance). Review it.
Then: make tf-apply TF_ENV=offsite
Expected: apply succeeds; the WAN :22 rule is gone.
- Step 6: Verify the end-state (out-of-band)
From an OFF-MESH host (e.g. the operator's laptop with NetBird disconnected, or a quick check from askari's perspective):
nc -vz -w5 77.42.120.136 22 # expect: refused / timeout (WAN SSH closed)
nc -vz -w5 77.42.120.136 443 # expect: open (public service intact)
And from ubongo over the mesh: .venv/bin/ansible askari -i inventories/production/ -m ping → SUCCESS.
- Step 7: Reboot resilience check (optional but recommended)
Reboot askari from the Hetzner console; after it comes back, confirm ansible askari -m ping succeeds over wt0 without intervention (proves ip_nonlocal_bind beat the post-boot bind race).
-
Step 8: Update STATUS + ROADMAP
-
In
STATUS.md, update the askari row: SSH is now wt0-only; the host nftables default-deny is applied; the Hetzner WAN:22is retired. Move "host firewall + moving askari's SSH onto wt0" out of Pending. -
In
docs/ROADMAP.md, mark mesh-hardening sub-project 1 (askari SSH→wt0) done; next is sub-project 2 (ubongo default-deny).
git add STATUS.md docs/ROADMAP.md
git commit -m "docs: askari SSH moved onto wt0 (mesh-hardening 1/3 done)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
- Step 9: Push
Run: git push origin main
Self-review (against the spec)
- § three layers → Task 1 (sshd ListenAddress), Task 2 (nftables catalog; SSH-on-wt0 pre-existing via
base__firewall_mgmt_interface), Task 4 (Hetzner WAN :22). ✓ - § boot-race fix (
ip_nonlocal_bind+ fail-closed assert + live wt0 fact) → Task 1 Steps 4–6. ✓ - § new code/vars (
base__ssh_listen_mesh_only,base__ssh_listen_addr, host_vars/askari.yml, offsite flag, catalog, TF) → Tasks 1–4. ✓ - § staged cutover → Task 5 Steps 1–6, with the firewall auto-rollback as the gate. ✓
- § testing → Molecule render asserts (ListenAddress, sysctl, public-zone rule) + filter unit test + live out-of-band checks. The fail-closed assert is exercised by code; to spot-check it, temporarily blank
base__ssh_listen_addrin the converge fixture and confirmmake test ROLE=basefails on the assert, then revert (manual, not automated — a deliberate-failure Molecule scenario is non-idiomatic). ✓ - § risks/rollback → auto-rollback timer (Task 5 Step 3),
ip_nonlocal_bind(Task 1), Hetzner console break-glass, re-addable TF rule. ✓ - IPv6 note → recorded in the catalog comment (Task 2 Step 6); acceptable because askari has only an A record.