boma/docs/superpowers/plans/2026-06-17-mesh-hardening-askari-ssh-wt0.md
sjat dfa363cecd docs(plan): mesh-hardening 1/3 — askari SSH onto wt0 implementation plan
5 tasks: base sshd ListenAddress+ip_nonlocal_bind (Molecule), firewall public
zone + askari catalog, inventory wt0 override, TF retire WAN :22, then the live
operator-supervised staged cutover.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:25:59 +02:00

21 KiB
Raw Permalink Blame History

Mesh-hardening 1/3 — askari SSH onto wt0 — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Make askari's SSH reachable only over the NetBird mesh (wt0) and close the WAN :22 surface at both the host nftables layer and the Hetzner Cloud Firewall, without dropping askari's public services.

Architecture: Three enforcement layers — (1) sshd ListenAddress bound to the live wt0 IP (fail-closed, ip_nonlocal_bind to beat the post-boot bind race); (2) the base role's catalog-driven nftables default-deny (SSH already restricted to wt0 via base__firewall_mgmt_interface; add a public zone + askari service entries so 80/443/3478 survive); (3) Terraform drops the Hetzner Cloud Firewall WAN :22 rule. Tasks 14 are code (subagent-driven, each Molecule/lint/plan-verified). Task 5 is the live, operator-supervised cutover on the real host.

Tech Stack: Ansible (role base, FQCN), nftables, Molecule on Debian 13, ansible.posix.sysctl, pytest (filter unit tests), Terraform (hcloud provider).

Spec: docs/superpowers/specs/2026-06-17-mesh-hardening-askari-ssh-wt0-design.md

Conventions: make lint and make test ROLE=base before each commit; make check before make deploy; make tf-plan before make tf-apply; never hand-edit the generated offsite.yml; rbw unlocked for commits touching ansible content.


Task 1: base role — sshd ListenAddress on wt0 + ip_nonlocal_bind (fail-closed)

Files:

  • Modify: roles/base/defaults/main.yml

  • Modify: roles/base/tasks/ssh.yml

  • Modify: roles/base/templates/sshd_hardening.conf.j2

  • Modify: roles/base/molecule/default/converge.yml (fixture)

  • Modify: roles/base/molecule/default/verify.yml (assertions = the test)

  • Step 1: Write the failing test (extend Molecule verify)

In roles/base/molecule/default/verify.yml, add these tasks after the existing "Sshd drop-in present and config valid" block:

    - name: ListenAddress bound to the fixture mesh IP (mesh-only mode)
      ansible.builtin.command: grep -q '^ListenAddress 100.99.0.1$' /etc/ssh/sshd_config.d/10-boma.conf
      changed_when: false
    - name: ip_nonlocal_bind sysctl drop-in is present
      ansible.builtin.command: grep -q '^net.ipv4.ip_nonlocal_bind = 1' /etc/sysctl.d/60-boma-nonlocal-bind.conf
      changed_when: false
    - name: ip_nonlocal_bind is live in this netns
      ansible.builtin.command: sysctl -n net.ipv4.ip_nonlocal_bind
      register: _nonlocal
      changed_when: false
      failed_when: _nonlocal.stdout | trim != '1'
  • Step 2: Add the fixture that drives it (Molecule converge)

In roles/base/molecule/default/converge.yml, add to the vars: block (alongside the existing base__mesh_*):

    base__ssh_listen_mesh_only: true
    base__ssh_listen_addr: "100.99.0.1"   # fixture mesh IP (no wt0 in the container)
  • Step 3: Run the test to verify it fails

Run: make test ROLE=base Expected: FAIL — converge errors or verify fails (ListenAddress not rendered; sysctl drop-in absent), because the feature isn't implemented yet.

  • Step 4: Add the defaults

In roles/base/defaults/main.yml, after the base__ssh_authorised_keys: [] line (end of the hardening block), add:

# SSH listen-on-mesh (mesh-hardening 1/3, ADR-016/021). Opt-in: when true, sshd binds
# ListenAddress to this host's mesh IP only (not the WAN). The IP comes from the live wt0
# fact (ansible_facts.wt0.ipv4.address); base__ssh_listen_addr overrides it. ip_nonlocal_bind
# lets sshd bind the mesh IP before wt0 exists at boot. Fails closed: the play asserts a
# non-empty address rather than silently listening on all interfaces.
base__ssh_listen_mesh_only: false
base__ssh_listen_addr: ""
  • Step 5: Resolve + assert + sysctl in ssh.yml

In roles/base/tasks/ssh.yml, insert these tasks at the TOP of the file (before "Ensure openssh-server is installed"):

- name: Resolve the sshd mesh listen address (override, else live wt0 fact)
  ansible.builtin.set_fact:
    base__ssh_listen_addr_resolved: >-
      {{ base__ssh_listen_addr
         or ansible_facts.get('wt0', {}).get('ipv4', {}).get('address', '') }}
  when: base__ssh_listen_mesh_only | bool

- name: Fail closed — refuse to render sshd without a known mesh address
  ansible.builtin.assert:
    that:
      - base__ssh_listen_addr_resolved | length > 0
    fail_msg: >-
      base__ssh_listen_mesh_only is true but no mesh address resolved (set
      base__ssh_listen_addr or ensure wt0 is up so its fact is gathered). Refusing to
      render sshd ListenAddress empty (which would listen on ALL interfaces).
  when: base__ssh_listen_mesh_only | bool

- name: Allow sshd to bind the mesh IP before wt0 exists at boot
  ansible.posix.sysctl:
    name: net.ipv4.ip_nonlocal_bind
    value: "1"
    sysctl_set: true
    state: present
    reload: true
    sysctl_file: /etc/sysctl.d/60-boma-nonlocal-bind.conf
  when: base__ssh_listen_mesh_only | bool
  • Step 6: Render the conditional ListenAddress

In roles/base/templates/sshd_hardening.conf.j2, append after the existing KbdInteractiveAuthentication no line:

{% if base__ssh_listen_mesh_only | bool %}
ListenAddress {{ base__ssh_listen_addr_resolved }}
{% endif %}
  • Step 7: Run the test to verify it passes

Run: make test ROLE=base Expected: PASS — converge succeeds; verify confirms ListenAddress 100.99.0.1, the sysctl drop-in, and the live value 1.

Checkpoint (environmental): if make test fails on the sysctl task because the Molecule container can't write net.ipv4.ip_nonlocal_bind, add sysctls: {net.ipv4.ip_nonlocal_bind: "0"} to the platform in roles/base/molecule/default/molecule.yml (pre-creates the namespaced sysctl so the task can set it), then re-run. Note the change in the commit.

  • Step 8: Lint

Run: make lint Expected: Passed: 0 failure(s) and check-tags: OK.

  • Step 9: Commit
git add roles/base/defaults/main.yml roles/base/tasks/ssh.yml \
        roles/base/templates/sshd_hardening.conf.j2 \
        roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
git commit -m "feat(base): opt-in sshd ListenAddress on the mesh IP (fail-closed)

base__ssh_listen_mesh_only binds sshd to the live wt0 IP only, with
ip_nonlocal_bind to beat the post-boot bind race and a fail-closed assert so an
unresolved address never silently listens on all interfaces. Molecule covers
the render + sysctl. Mesh-hardening 1/3 (ADR-016/021).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"

Task 2: firewall catalog — public zone + askari's public services

Files:

  • Modify: inventories/production/group_vars/all/firewall.yml
  • Modify: roles/base/molecule/default/converge.yml (fixture: public-zone rule)
  • Modify: roles/base/molecule/default/verify.yml (assert the 0.0.0.0/0 rule)
  • Test: tests/test_firewall_rules.py (unit: a public zone resolves to 0.0.0.0/0)

Rationale: base__firewall_mgmt_interface already accepts :22 on wt0. The gap is that the catalog is empty and has no "anywhere" source, so applying default-deny to askari would drop 80/443/3478. We add a public zone (0.0.0.0/0) and askari's service ingress.

  • Step 1: Write the failing unit test

In tests/test_firewall_rules.py, add:

def test_public_zone_resolves_to_anywhere():
    catalog = {"web": {"host": "askari",
                       "ingress": [{"from": "public", "port": 443, "proto": "tcp"}]}}
    zones = {"public": "0.0.0.0/0"}
    rules = rs.resolve_firewall_rules(catalog, zones, "askari",
                                      {"askari": {"ansible_host": "100.99.226.39"}}, {})
    assert rules == [{"proto": "tcp", "port": 443, "sources": ["0.0.0.0/0"]}]

(Module is loaded by the existing importlib shim at the top of the test file as rs. If the filter is imported under a different alias there, match it.)

  • Step 2: Run it to verify it fails (or passes trivially)

Run: .venv/bin/python -m pytest tests/test_firewall_rules.py -q Expected: this test PASSES immediately if the filter already resolves arbitrary zones (it does — _resolve_source treats any zones key generically). That is fine: the unit test documents/locks the public-zone contract. If it fails, fix the filter. Either way it must end green.

  • Step 3: Add the Molecule fixture (public-zone rule)

In roles/base/molecule/default/converge.yml, under firewall_zones: add public: 0.0.0.0/0, and under firewall_catalog: add:

      netbird_stun:
        host: instance
        ingress:
          - { from: public, port: 3478, proto: udp }
  • Step 4: Add the Molecule assertion (the test)

In roles/base/molecule/default/verify.yml, after the photoprism assertion block, add:

    - name: Assert the public->stun:3478/udp ingress rule (0.0.0.0/0 source)
      ansible.builtin.assert:
        that:
          - "'0.0.0.0/0' in nft"
          - "'udp dport 3478 accept' in nft"
        fail_msg: "missing public->3478/udp rule for netbird_stun"
  • Step 5: Run the tests

Run: make test ROLE=base then .venv/bin/python -m pytest tests/test_firewall_rules.py -q Expected: both PASS (the rendered ruleset now contains the 0.0.0.0/0 ... udp dport 3478 accept rule).

  • Step 6: Populate the real catalog

In inventories/production/group_vars/all/firewall.yml, replace the firewall_zones/firewall_catalog blocks with:

# Zone → subnet (from ADR-007). `public` = the WAN (anywhere) for deliberately public
# off-site services (askari); home/cluster services use the internal zones only.
firewall_zones:
  mgmt: 10.10.0.0/24
  srv: 10.20.0.0/24
  lan: 10.30.0.0/24
  iot: 10.40.0.0/24
  guest: 10.50.0.0/24
  public: 0.0.0.0/0

# Service catalog: <name> → placement (host | group | hosts) + ingress[].
# askari's public surface (ADR-024 Caddy + ADR-016 NetBird STUN). NOTE: the host
# nftables template renders IPv4 source rules only; askari is reached via its A record
# (no AAAA), so IPv4-only public rules are sufficient (see the spec's IPv6 note).
firewall_catalog:
  reverse_proxy:
    host: askari
    ingress:
      - { from: public, port: 80, proto: tcp }
      - { from: public, port: 443, proto: tcp }
  netbird_stun:
    host: askari
    ingress:
      - { from: public, port: 3478, proto: udp }
  • Step 7: Lint

Run: make lint Expected: clean pass (check-tags: OK).

  • Step 8: Commit
git add inventories/production/group_vars/all/firewall.yml \
        roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml \
        tests/test_firewall_rules.py
git commit -m "feat(firewall): public zone + askari's public services in the catalog

Adds a public (0.0.0.0/0) zone and askari's Caddy (80/443) + NetBird STUN
(3478/udp) ingress so the base nftables default-deny does not drop the live
public services when applied to askari. Molecule + filter unit test cover the
public-zone rendering. Mesh-hardening 1/3 (ADR-020/024/016).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"

Task 3: inventory — point Ansible at wt0 + enable mesh-only SSH on askari

Files:

  • Create: inventories/production/host_vars/askari.yml

  • Modify: inventories/production/group_vars/offsite_hosts/vars.yml

  • Step 1: Create the host_var override

Create inventories/production/host_vars/askari.yml:

---
# Manage askari over the NetBird mesh (wt0), not its WAN IP. This OVERRIDES the
# TF-generated inventories/production/offsite.yml (ansible_host = 77.42.120.136); host_vars
# outrank the generated inventory and are NOT touched by `make tf-inventory-offsite`.
# Mesh-hardening 1/3 — once SSH is wt0-only, the WAN IP is no longer reachable for SSH.
ansible_host: 100.99.226.39   # askari's wt0 address (NetBird, M5)
  • Step 2: Enable mesh-only SSH for offsite hosts

In inventories/production/group_vars/offsite_hosts/vars.yml, replace the file body with:

---
# Off-site hosts (askari). askari runs the NetBird coordinator AND is a mesh peer
# (ADR-016, M5). Mesh-hardening 1/3 (2026-06-17): SSH is moved onto wt0 — sshd binds the
# mesh IP only (base__ssh_listen_mesh_only) and the base nftables default-deny applies
# (base__firewall_apply defaults true; SSH allowed on wt0 via base__firewall_mgmt_interface,
# public services via the catalog). base__mesh_enabled stays true (precondition from M5).
base__mesh_enabled: true
base__ssh_listen_mesh_only: true
  • Step 3: Verify the override resolves

Run: .venv/bin/ansible-inventory -i inventories/production/ --host askari 2>/dev/null | grep ansible_host Expected: "ansible_host": "100.99.226.39" (the host_var wins over the generated offsite.yml).

  • Step 4: Lint

Run: make lint Expected: clean pass.

  • Step 5: Commit
git add inventories/production/host_vars/askari.yml \
        inventories/production/group_vars/offsite_hosts/vars.yml
git commit -m "feat(inventory): manage askari over wt0 + enable mesh-only SSH

host_vars/askari.yml points ansible_host at the wt0 IP (overriding the generated
offsite.yml); offsite_hosts sets base__ssh_listen_mesh_only. Mesh-hardening 1/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"

Task 4: Terraform — retire the Hetzner WAN :22 rule

Files:

  • Modify: terraform/modules/hetzner_vm/main.tf
  • Modify: terraform/modules/hetzner_vm/variables.tf
  • Modify: terraform/environments/offsite/main.tf

This task makes the SSH rule conditional and sets askari's admin CIDRs to empty (mesh-only). The live tf-plan/tf-apply happens in Task 5 — here we only change + format/validate the code.

  • Step 1: Gate the SSH rule on a non-empty CIDR list

In terraform/modules/hetzner_vm/main.tf, replace the static SSH rule { ... } block (the one with port = "22") with a dynamic block:

  # SSH from the control node only — and only when admin CIDRs are set. An empty
  # ssh_admin_cidrs removes the WAN :22 rule entirely (mesh-only SSH; reach the host over
  # wt0, break-glass = Hetzner console). Mesh-hardening 1/3.
  dynamic "rule" {
    for_each = length(var.ssh_admin_cidrs) > 0 ? [1] : []
    content {
      direction  = "in"
      protocol   = "tcp"
      port       = "22"
      source_ips = var.ssh_admin_cidrs
    }
  }
  • Step 2: Default the variable to empty

In terraform/modules/hetzner_vm/variables.tf, change the ssh_admin_cidrs variable to default to an empty list:

variable "ssh_admin_cidrs" {
  description = "Source CIDRs allowed to reach SSH over the WAN. Empty = no WAN SSH rule (mesh-only)."
  type        = list(string)
  default     = []
}
  • Step 3: Set askari to mesh-only SSH

In terraform/environments/offsite/main.tf, change the ssh_admin_cidrs argument in the module "askari" block to:

  ssh_admin_cidrs = [] # mesh-only: SSH is reached over wt0; WAN :22 retired (mesh-hardening 1/3)
  • Step 4: Format + validate

Run: cd terraform/environments/offsite && terraform fmt -recursive ../.. && terraform validate && cd - Expected: fmt lists any reformatted files (re-add them); validate prints Success! The configuration is valid. (offsite is already inited — it has live state.)

  • Step 5: Commit
git add terraform/modules/hetzner_vm/main.tf terraform/modules/hetzner_vm/variables.tf \
        terraform/environments/offsite/main.tf
git commit -m "feat(tf/offsite): retire askari's WAN :22 (mesh-only SSH)

The Hetzner Cloud Firewall SSH rule is now conditional on a non-empty
ssh_admin_cidrs (default []); askari sets it empty so the WAN :22 rule is
removed on the next apply. SSH is reached over wt0; break-glass is the Hetzner
console. Apply is the live cutover (Task 5). Mesh-hardening 1/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"

Task 5: Live staged cutover (operator-supervised — NOT a subagent task)

This task touches the real askari over the network and is lockout-risky. Run it interactively with the operator, in order, verifying each step before the next. The firewall's auto-rollback timer + wait_for_connection over wt0 is the safety net; the Hetzner web console is the ultimate break-glass. Do NOT hand this to an unattended agent.

  • Step 1: Pre-check the mesh SSH path (before any change)

Run: .venv/bin/ansible askari -i inventories/production/ -m ping Expected: SUCCESS — confirms Ansible reaches askari over wt0 (Tasks 13 are merged, so ansible_host is now 100.99.226.39). If this fails, STOP — the mesh path must work before closing the WAN.

  • Step 2: Dry-run the base apply (firewall + sshd)

Run: make check PLAYBOOK=site LIMIT=askari TAGS=firewall,hardening Expected: shows the nftables ruleset diff (default-deny + wt0 SSH + public 80/443/3478) and the sshd drop-in diff (ListenAddress 100.99.226.39); no errors. Review that the public service rules are present (so they won't be dropped).

  • Step 3: Apply the host firewall + sshd (auto-rollback armed)

Run: make deploy PLAYBOOK=site LIMIT=askari TAGS=firewall,hardening Expected: the firewall concern arms the rollback timer, applies, resets the connection, and wait_for_connection succeeds over wt0; sshd reloads with the mesh ListenAddress. If connectivity is lost, the timer auto-reverts the ruleset within base__firewall_rollback_timeout (45 s).

  • Step 4: Verify services + WAN SSH still open at the cloud edge
curl -sSf -o /dev/null -w '%{http_code}\n' https://test.askari.wingu.me   # expect 200
curl -sSf -o /dev/null -w '%{http_code}\n' https://netbird.askari.wingu.me # expect 200

Expected: both 200 (valid certs); the host firewall did not drop the public services. (WAN :22 is now dropped by the host nftables, but the Hetzner FW still allows it until Step 5 — that's fine.)

  • Step 5: Retire the Hetzner WAN :22 — plan, review, apply

Run: make tf-plan TF_ENV=offsite Expected: the plan shows the SSH firewall rule being destroyed (and nothing else of substance). Review it.

Then: make tf-apply TF_ENV=offsite Expected: apply succeeds; the WAN :22 rule is gone.

  • Step 6: Verify the end-state (out-of-band)

From an OFF-MESH host (e.g. the operator's laptop with NetBird disconnected, or a quick check from askari's perspective):

nc -vz -w5 77.42.120.136 22   # expect: refused / timeout (WAN SSH closed)
nc -vz -w5 77.42.120.136 443  # expect: open (public service intact)

And from ubongo over the mesh: .venv/bin/ansible askari -i inventories/production/ -m pingSUCCESS.

  • Step 7: Reboot resilience check (optional but recommended)

Reboot askari from the Hetzner console; after it comes back, confirm ansible askari -m ping succeeds over wt0 without intervention (proves ip_nonlocal_bind beat the post-boot bind race).

  • Step 8: Update STATUS + ROADMAP

  • In STATUS.md, update the askari row: SSH is now wt0-only; the host nftables default-deny is applied; the Hetzner WAN :22 is retired. Move "host firewall + moving askari's SSH onto wt0" out of Pending.

  • In docs/ROADMAP.md, mark mesh-hardening sub-project 1 (askari SSH→wt0) done; next is sub-project 2 (ubongo default-deny).

git add STATUS.md docs/ROADMAP.md
git commit -m "docs: askari SSH moved onto wt0 (mesh-hardening 1/3 done)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
  • Step 9: Push

Run: git push origin main


Self-review (against the spec)

  • § three layers → Task 1 (sshd ListenAddress), Task 2 (nftables catalog; SSH-on-wt0 pre-existing via base__firewall_mgmt_interface), Task 4 (Hetzner WAN :22). ✓
  • § boot-race fix (ip_nonlocal_bind + fail-closed assert + live wt0 fact) → Task 1 Steps 46. ✓
  • § new code/vars (base__ssh_listen_mesh_only, base__ssh_listen_addr, host_vars/askari.yml, offsite flag, catalog, TF) → Tasks 14. ✓
  • § staged cutover → Task 5 Steps 16, with the firewall auto-rollback as the gate. ✓
  • § testing → Molecule render asserts (ListenAddress, sysctl, public-zone rule) + filter unit test + live out-of-band checks. The fail-closed assert is exercised by code; to spot-check it, temporarily blank base__ssh_listen_addr in the converge fixture and confirm make test ROLE=base fails on the assert, then revert (manual, not automated — a deliberate-failure Molecule scenario is non-idiomatic). ✓
  • § risks/rollback → auto-rollback timer (Task 5 Step 3), ip_nonlocal_bind (Task 1), Hetzner console break-glass, re-addable TF rule. ✓
  • IPv6 note → recorded in the catalog comment (Task 2 Step 6); acceptable because askari has only an A record.