# Spec — Mesh-hardening (1 of 3): move askari's SSH onto `wt0` Status: Accepted (2026-06-17) ## Context & scope The **mesh-hardening follow-on** was deferred from M5 (ROADMAP). It was decomposed into **three independent sub-projects**, each with its own spec → plan → implementation cycle: 1. **askari SSH → `wt0`** ← *this spec* 2. ubongo nftables default-deny + `ssh-from-control` (its own later spec) 3. NetBird ACL off Allow-All → scoped policies (its own later spec) This spec covers **only (1)**. It makes askari's SSH reachable **only over the NetBird mesh interface `wt0`** and closes the WAN `:22` surface at both the host and the Hetzner Cloud Firewall. It does **not** touch ubongo, the NetBird ACL (stays Allow-All for now — one moving access-layer at a time), or askari's public service exposure (Caddy 80/443, NetBird STUN 3478 stay on the WAN). Current state (STATUS.md): askari is reached at `ansible_host: 77.42.120.136` (WAN, in the TF-generated `inventories/production/offsite.yml`); `wt0` is up at `100.99.226.39` (Management+Signal Connected, M5); the base nftables `firewall` concern is **built but not applied** to askari (the Hetzner Cloud Firewall is its perimeter today); the Hetzner Cloud Firewall (`terraform/modules/hetzner_vm`) opens `:22` from `var.ssh_admin_cidrs` plus 80/443/3478 from anywhere. ## Goal / success criteria - SSH to askari succeeds over `wt0` (from ubongo) and **fails from any off-mesh source**. - The WAN `:22` surface is closed at **both** layers (host nftables = `wt0`-only; Hetzner Cloud Firewall drops the `:22` rule). - Public services are unaffected: `https://test.askari.wingu.me` and `https://netbird.askari.wingu.me` serve valid certs; STUN `3478/udp` still answers. - Ansible manages askari over `wt0`. - Break-glass is the **Hetzner web console** (out-of-band; works even if the mesh is down). - A reboot of askari does **not** lock SSH out (the boot-race below is solved). ## Design — three enforcement layers (defense-in-depth) 1. **sshd** binds `ListenAddress` to askari's `wt0` IP only, so it does not accept on WAN. 2. **host nftables** (base `firewall` concern, ADR-020): catalog-driven default-deny; `:22` allowed only via `iifname "wt0"` (the interface-name match that survives `wt0` being absent — see `docs/testing/gotchas.md`); public service ports stay open on WAN. 3. **Hetzner Cloud Firewall** (Terraform): the `:22` `ssh_admin_cidrs` rule is removed; 80/443/3478 stay. ## The boot-race fix (load-bearing) `wt0` is brought up by NetBird **after** boot, so at sshd start the `wt0` IP may not exist yet. A plain `ListenAddress 100.99.226.39` would fail to bind → sshd exits → **lockout on reboot**. Solution: - **`net.ipv4.ip_nonlocal_bind = 1`** via a sysctl drop-in (`ansible.posix.sysctl`, persisted under `/etc/sysctl.d/`). This lets sshd bind the `wt0` address even before the interface exists; once `wt0` comes up with that IP, traffic is delivered to the existing listener — no reload needed. - The sshd drop-in **fails closed**: the mesh IP is resolved (see below) and the play **asserts it is non-empty** before rendering. An empty `ListenAddress` would silently fall back to listening on all interfaces, defeating the restriction — that must never render. **Mesh-IP source (decided):** the **live `wt0` fact** `ansible_wt0.ipv4.address`, gathered at apply time (`wt0` is up during the play, since M5), with a **`host_var` fallback** (`base__ssh_listen_addr`, default `""`) and a fail-closed `assert` that one of them yielded a non-empty address. Live fact is preferred (correct even if NetBird reassigns the IP); the host_var is an explicit override / belt. ## New & changed code **Role `base` (the `hardening` + `firewall` concerns):** - `roles/base/defaults/main.yml` — add: - `base__ssh_listen_mesh_only: false` — opt-in; when `true`, sshd binds the mesh IP only. - `base__ssh_listen_addr: ""` — optional explicit mesh-IP override (fallback to the `ansible_wt0` fact). - `roles/base/tasks/ssh.yml` — - resolve the mesh IP (`base__ssh_listen_addr` or `ansible_wt0.ipv4.address`) into a fact; - `assert` it is non-empty **when** `base__ssh_listen_mesh_only`; - set `net.ipv4.ip_nonlocal_bind = 1` (sysctl drop-in) under the same condition. - `roles/base/templates/sshd_hardening.conf.j2` — append a conditional `ListenAddress {{ resolved_mesh_ip }}` block guarded by `base__ssh_listen_mesh_only` (unset → unchanged behaviour: listen on all). Keep the existing `sshd -t` validation. **Inventory:** - `inventories/production/host_vars/askari.yml` (new) — `ansible_host: 100.99.226.39` (overrides the TF-generated `offsite.yml`; host_vars are not regenerated by `tf_to_inventory.py`). A header comment explains why. - `inventories/production/group_vars/offsite_hosts/vars.yml` — add `base__ssh_listen_mesh_only: true`; ensure `base__firewall_apply: true`. (`base__mesh_enabled` is already `true` for askari — set in M5 — and is a precondition, not a change here.) **Firewall catalog** (`inventories/production/group_vars/all/firewall.yml`): - Enumerate askari's required ingress so catalog-driven default-deny does **not** drop a live public service. Derived from the existing `reverse_proxy` + `netbird_coordinator` definitions: `:22/tcp` on the **mesh** zone (`wt0`); `80,443/tcp` + `3478/udp` on the **public** zone (WAN). The exact catalog/zone YAML is finalised in the implementation plan against the `resolve_firewall_rules` filter's schema. **Terraform** (`terraform/environments/offsite` + `terraform/modules/hetzner_vm`): - Remove the WAN `:22` ingress rule (e.g. drop `ssh_admin_cidrs` from the firewall, or set it empty and guard the rule). Keep 80/443/3478. Applied via `make tf-plan/apply TF_ENV=offsite` (plan shown before apply). ## Staged cutover — a working path at every step 1. **Pre-check:** confirm `ssh sjat@100.99.226.39` and an `ansible askari -m ping` forced over `wt0` both succeed **before** changing anything. 2. **Repoint Ansible:** add `host_vars/askari.yml` (`ansible_host` = `wt0` IP); verify `ansible askari -m ping` runs over the mesh. WAN `:22` still open as a fallback here. 3. **Apply `base` (firewall + sshd together):** one `make deploy PLAYBOOK=site LIMIT=askari` converge applies catalog default-deny (`:22` on `wt0` + public ports) **and** the sshd `ListenAddress`=mesh + `ip_nonlocal_bind` drop-in. The firewall concern's `reset_connection` → `wait_for_connection` (now over `wt0`) plus the armed auto-rollback timer (`base__firewall_rollback_timeout`, 45 s) is the safety gate — a bad ruleset reverts itself. The sshd `reload` cannot drop the in-flight `wt0` session. Verify the public services still respond. 4. **Retire the Hetzner WAN `:22`:** the Terraform change above; `make tf-plan TF_ENV=offsite` (review) → `make tf-apply`. Verify: `wt0` SSH works; off-mesh `nc -vz 77.42.120.136 22` is refused/times out; `:443` open; STUN answers. ## Testing - **Molecule** (base `default` scenario; `wt0` absent in-container, `base__firewall_apply: false` render-only): assert (a) the rendered nftables allows `:22` via `iifname "wt0"`; (b) with `base__ssh_listen_mesh_only: true` + a fixture mesh IP, the sshd drop-in renders `ListenAddress ` and `sshd -t` passes; (c) with the flag set but **no** resolvable mesh IP, the play **fails closed** (the `assert`); (d) the `ip_nonlocal_bind` sysctl task is present. Keep existing firewall/hardening assertions green. - **Live, out-of-band:** post-cutover, from an off-mesh host `nc -vz 77.42.120.136 22` → refused; `:443` → open; from ubongo over `wt0`, SSH + `ansible -m ping` succeed; reboot askari (Hetzner console) and confirm SSH-over-`wt0` returns without intervention. ## Risks & rollback - **Mid-cutover lockout:** mitigated by the staged order (a path open at each step), the firewall auto-rollback timer, and `ansible_host`=`wt0` so the connectivity confirm tests the real new path. - **Reboot lockout:** mitigated by `ip_nonlocal_bind` (sshd binds `wt0` regardless of interface timing) + the fail-closed assert (never silently listen-all). - **Default-deny breaks a public service:** mitigated by enumerating all live ingress into the catalog and the §Testing service checks; reversible by re-running with `base__firewall_apply: false` or widening the catalog. - **Ultimate break-glass:** the Hetzner web console (out-of-band). The TF `:22` rule is trivially re-addable. ## Out of scope / follow-ons - ubongo default-deny + `ssh-from-control` (sub-project 2). - NetBird ACL off Allow-All (sub-project 3) — until then any enrolled peer can reach askari's `wt0:22`; scoping that is sub-project 3's job. - `/check-access` (ADR-021) live verification — designed, build still pending. - STATUS.md / ROADMAP updates land with the implementation, not this spec.