Decomposes the M5 mesh-hardening follow-on into 3 independent sub-specs; this is sub-project 1. Three-layer SSH-on-wt0 (sshd ListenAddress=mesh + nftables iifname wt0 + retire the Hetzner WAN :22), ip_nonlocal_bind to beat the post-boot wt0 bind race (fail-closed), live wt0 fact for the listen addr, staged cutover with the firewall auto-rollback as the safety gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8.7 KiB
Spec — Mesh-hardening (1 of 3): move askari's SSH onto wt0
Status: Accepted (2026-06-17)
Context & scope
The mesh-hardening follow-on was deferred from M5 (ROADMAP). It was decomposed into three independent sub-projects, each with its own spec → plan → implementation cycle:
- askari SSH →
wt0← this spec - ubongo nftables default-deny +
ssh-from-control(its own later spec) - NetBird ACL off Allow-All → scoped policies (its own later spec)
This spec covers only (1). It makes askari's SSH reachable only over the NetBird mesh
interface wt0 and closes the WAN :22 surface at both the host and the Hetzner Cloud
Firewall. It does not touch ubongo, the NetBird ACL (stays Allow-All for now — one
moving access-layer at a time), or askari's public service exposure (Caddy 80/443, NetBird
STUN 3478 stay on the WAN).
Current state (STATUS.md): askari is reached at ansible_host: 77.42.120.136 (WAN, in the
TF-generated inventories/production/offsite.yml); wt0 is up at 100.99.226.39
(Management+Signal Connected, M5); the base nftables firewall concern is built but not
applied to askari (the Hetzner Cloud Firewall is its perimeter today); the Hetzner Cloud
Firewall (terraform/modules/hetzner_vm) opens :22 from var.ssh_admin_cidrs plus
80/443/3478 from anywhere.
Goal / success criteria
- SSH to askari succeeds over
wt0(from ubongo) and fails from any off-mesh source. - The WAN
:22surface is closed at both layers (host nftables =wt0-only; Hetzner Cloud Firewall drops the:22rule). - Public services are unaffected:
https://test.askari.wingu.meandhttps://netbird.askari.wingu.meserve valid certs; STUN3478/udpstill answers. - Ansible manages askari over
wt0. - Break-glass is the Hetzner web console (out-of-band; works even if the mesh is down).
- A reboot of askari does not lock SSH out (the boot-race below is solved).
Design — three enforcement layers (defense-in-depth)
- sshd binds
ListenAddressto askari'swt0IP only, so it does not accept on WAN. - host nftables (base
firewallconcern, ADR-020): catalog-driven default-deny;:22allowed only viaiifname "wt0"(the interface-name match that surviveswt0being absent — seedocs/testing/gotchas.md); public service ports stay open on WAN. - Hetzner Cloud Firewall (Terraform): the
:22ssh_admin_cidrsrule is removed; 80/443/3478 stay.
The boot-race fix (load-bearing)
wt0 is brought up by NetBird after boot, so at sshd start the wt0 IP may not exist
yet. A plain ListenAddress 100.99.226.39 would fail to bind → sshd exits → lockout on
reboot. Solution:
net.ipv4.ip_nonlocal_bind = 1via a sysctl drop-in (ansible.posix.sysctl, persisted under/etc/sysctl.d/). This lets sshd bind thewt0address even before the interface exists; oncewt0comes up with that IP, traffic is delivered to the existing listener — no reload needed.- The sshd drop-in fails closed: the mesh IP is resolved (see below) and the play
asserts it is non-empty before rendering. An empty
ListenAddresswould silently fall back to listening on all interfaces, defeating the restriction — that must never render.
Mesh-IP source (decided): the live wt0 fact ansible_wt0.ipv4.address, gathered
at apply time (wt0 is up during the play, since M5), with a host_var fallback
(base__ssh_listen_addr, default "") and a fail-closed assert that one of them yielded
a non-empty address. Live fact is preferred (correct even if NetBird reassigns the IP);
the host_var is an explicit override / belt.
New & changed code
Role base (the hardening + firewall concerns):
roles/base/defaults/main.yml— add:base__ssh_listen_mesh_only: false— opt-in; whentrue, sshd binds the mesh IP only.base__ssh_listen_addr: ""— optional explicit mesh-IP override (fallback to theansible_wt0fact).
roles/base/tasks/ssh.yml—- resolve the mesh IP (
base__ssh_listen_addroransible_wt0.ipv4.address) into a fact; assertit is non-empty whenbase__ssh_listen_mesh_only;- set
net.ipv4.ip_nonlocal_bind = 1(sysctl drop-in) under the same condition.
- resolve the mesh IP (
roles/base/templates/sshd_hardening.conf.j2— append a conditionalListenAddress {{ resolved_mesh_ip }}block guarded bybase__ssh_listen_mesh_only(unset → unchanged behaviour: listen on all). Keep the existingsshd -tvalidation.
Inventory:
inventories/production/host_vars/askari.yml(new) —ansible_host: 100.99.226.39(overrides the TF-generatedoffsite.yml; host_vars are not regenerated bytf_to_inventory.py). A header comment explains why.inventories/production/group_vars/offsite_hosts/vars.yml— addbase__ssh_listen_mesh_only: true; ensurebase__firewall_apply: true. (base__mesh_enabledis alreadytruefor askari — set in M5 — and is a precondition, not a change here.)
Firewall catalog (inventories/production/group_vars/all/firewall.yml):
- Enumerate askari's required ingress so catalog-driven default-deny does not drop a
live public service. Derived from the existing
reverse_proxy+netbird_coordinatordefinitions::22/tcpon the mesh zone (wt0);80,443/tcp+3478/udpon the public zone (WAN). The exact catalog/zone YAML is finalised in the implementation plan against theresolve_firewall_rulesfilter's schema.
Terraform (terraform/environments/offsite + terraform/modules/hetzner_vm):
- Remove the WAN
:22ingress rule (e.g. dropssh_admin_cidrsfrom the firewall, or set it empty and guard the rule). Keep 80/443/3478. Applied viamake tf-plan/apply TF_ENV=offsite(plan shown before apply).
Staged cutover — a working path at every step
- Pre-check: confirm
ssh sjat@100.99.226.39and anansible askari -m pingforced overwt0both succeed before changing anything. - Repoint Ansible: add
host_vars/askari.yml(ansible_host=wt0IP); verifyansible askari -m pingruns over the mesh. WAN:22still open as a fallback here. - Apply
base(firewall + sshd together): onemake deploy PLAYBOOK=site LIMIT=askariconverge applies catalog default-deny (:22onwt0+ public ports) and the sshdListenAddress=mesh +ip_nonlocal_binddrop-in. The firewall concern'sreset_connection→wait_for_connection(now overwt0) plus the armed auto-rollback timer (base__firewall_rollback_timeout, 45 s) is the safety gate — a bad ruleset reverts itself. The sshdreloadcannot drop the in-flightwt0session. Verify the public services still respond. - Retire the Hetzner WAN
:22: the Terraform change above;make tf-plan TF_ENV=offsite(review) →make tf-apply. Verify:wt0SSH works; off-meshnc -vz 77.42.120.136 22is refused/times out;:443open; STUN answers.
Testing
- Molecule (base
defaultscenario;wt0absent in-container,base__firewall_apply: falserender-only): assert (a) the rendered nftables allows:22viaiifname "wt0"; (b) withbase__ssh_listen_mesh_only: true+ a fixture mesh IP, the sshd drop-in rendersListenAddress <ip>andsshd -tpasses; (c) with the flag set but no resolvable mesh IP, the play fails closed (theassert); (d) theip_nonlocal_bindsysctl task is present. Keep existing firewall/hardening assertions green. - Live, out-of-band: post-cutover, from an off-mesh host
nc -vz 77.42.120.136 22→ refused;:443→ open; from ubongo overwt0, SSH +ansible -m pingsucceed; reboot askari (Hetzner console) and confirm SSH-over-wt0returns without intervention.
Risks & rollback
- Mid-cutover lockout: mitigated by the staged order (a path open at each step), the
firewall auto-rollback timer, and
ansible_host=wt0so the connectivity confirm tests the real new path. - Reboot lockout: mitigated by
ip_nonlocal_bind(sshd bindswt0regardless of interface timing) + the fail-closed assert (never silently listen-all). - Default-deny breaks a public service: mitigated by enumerating all live ingress into
the catalog and the §Testing service checks; reversible by re-running with
base__firewall_apply: falseor widening the catalog. - Ultimate break-glass: the Hetzner web console (out-of-band). The TF
:22rule is trivially re-addable.
Out of scope / follow-ons
- ubongo default-deny +
ssh-from-control(sub-project 2). - NetBird ACL off Allow-All (sub-project 3) — until then any enrolled peer can reach
askari's
wt0:22; scoping that is sub-project 3's job. /check-access(ADR-021) live verification — designed, build still pending.- STATUS.md / ROADMAP updates land with the implementation, not this spec.