docs(plan): mesh-hardening 1/3 — askari SSH onto wt0 implementation plan
5 tasks: base sshd ListenAddress+ip_nonlocal_bind (Molecule), firewall public zone + askari catalog, inventory wt0 override, TF retire WAN :22, then the live operator-supervised staged cutover. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
292c204752
commit
dfa363cecd
1 changed files with 466 additions and 0 deletions
|
|
@ -0,0 +1,466 @@
|
|||
# Mesh-hardening 1/3 — askari SSH onto wt0 — Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Make askari's SSH reachable only over the NetBird mesh (`wt0`) and close the WAN `:22` surface at both the host nftables layer and the Hetzner Cloud Firewall, without dropping askari's public services.
|
||||
|
||||
**Architecture:** Three enforcement layers — (1) sshd `ListenAddress` bound to the live `wt0` IP (fail-closed, `ip_nonlocal_bind` to beat the post-boot bind race); (2) the base role's catalog-driven nftables default-deny (SSH already restricted to `wt0` via `base__firewall_mgmt_interface`; add a `public` zone + askari service entries so 80/443/3478 survive); (3) Terraform drops the Hetzner Cloud Firewall WAN `:22` rule. Tasks 1–4 are code (subagent-driven, each Molecule/lint/plan-verified). Task 5 is the live, operator-supervised cutover on the real host.
|
||||
|
||||
**Tech Stack:** Ansible (role `base`, FQCN), nftables, Molecule on Debian 13, `ansible.posix.sysctl`, pytest (filter unit tests), Terraform (`hcloud` provider).
|
||||
|
||||
**Spec:** `docs/superpowers/specs/2026-06-17-mesh-hardening-askari-ssh-wt0-design.md`
|
||||
|
||||
**Conventions:** `make lint` and `make test ROLE=base` before each commit; `make check` before `make deploy`; `make tf-plan` before `make tf-apply`; never hand-edit the generated `offsite.yml`; rbw unlocked for commits touching ansible content.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: base role — sshd `ListenAddress` on wt0 + `ip_nonlocal_bind` (fail-closed)
|
||||
|
||||
**Files:**
|
||||
- Modify: `roles/base/defaults/main.yml`
|
||||
- Modify: `roles/base/tasks/ssh.yml`
|
||||
- Modify: `roles/base/templates/sshd_hardening.conf.j2`
|
||||
- Modify: `roles/base/molecule/default/converge.yml` (fixture)
|
||||
- Modify: `roles/base/molecule/default/verify.yml` (assertions = the test)
|
||||
|
||||
- [ ] **Step 1: Write the failing test (extend Molecule verify)**
|
||||
|
||||
In `roles/base/molecule/default/verify.yml`, add these tasks after the existing "Sshd drop-in present and config valid" block:
|
||||
|
||||
```yaml
|
||||
- name: ListenAddress bound to the fixture mesh IP (mesh-only mode)
|
||||
ansible.builtin.command: grep -q '^ListenAddress 100.99.0.1$' /etc/ssh/sshd_config.d/10-boma.conf
|
||||
changed_when: false
|
||||
- name: ip_nonlocal_bind sysctl drop-in is present
|
||||
ansible.builtin.command: grep -q '^net.ipv4.ip_nonlocal_bind = 1' /etc/sysctl.d/60-boma-nonlocal-bind.conf
|
||||
changed_when: false
|
||||
- name: ip_nonlocal_bind is live in this netns
|
||||
ansible.builtin.command: sysctl -n net.ipv4.ip_nonlocal_bind
|
||||
register: _nonlocal
|
||||
changed_when: false
|
||||
failed_when: _nonlocal.stdout | trim != '1'
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add the fixture that drives it (Molecule converge)**
|
||||
|
||||
In `roles/base/molecule/default/converge.yml`, add to the `vars:` block (alongside the existing `base__mesh_*`):
|
||||
|
||||
```yaml
|
||||
base__ssh_listen_mesh_only: true
|
||||
base__ssh_listen_addr: "100.99.0.1" # fixture mesh IP (no wt0 in the container)
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Run the test to verify it fails**
|
||||
|
||||
Run: `make test ROLE=base`
|
||||
Expected: FAIL — converge errors or verify fails (`ListenAddress` not rendered; sysctl drop-in absent), because the feature isn't implemented yet.
|
||||
|
||||
- [ ] **Step 4: Add the defaults**
|
||||
|
||||
In `roles/base/defaults/main.yml`, after the `base__ssh_authorised_keys: []` line (end of the hardening block), add:
|
||||
|
||||
```yaml
|
||||
# SSH listen-on-mesh (mesh-hardening 1/3, ADR-016/021). Opt-in: when true, sshd binds
|
||||
# ListenAddress to this host's mesh IP only (not the WAN). The IP comes from the live wt0
|
||||
# fact (ansible_facts.wt0.ipv4.address); base__ssh_listen_addr overrides it. ip_nonlocal_bind
|
||||
# lets sshd bind the mesh IP before wt0 exists at boot. Fails closed: the play asserts a
|
||||
# non-empty address rather than silently listening on all interfaces.
|
||||
base__ssh_listen_mesh_only: false
|
||||
base__ssh_listen_addr: ""
|
||||
```
|
||||
|
||||
- [ ] **Step 5: Resolve + assert + sysctl in `ssh.yml`**
|
||||
|
||||
In `roles/base/tasks/ssh.yml`, insert these tasks at the TOP of the file (before "Ensure openssh-server is installed"):
|
||||
|
||||
```yaml
|
||||
- name: Resolve the sshd mesh listen address (override, else live wt0 fact)
|
||||
ansible.builtin.set_fact:
|
||||
base__ssh_listen_addr_resolved: >-
|
||||
{{ base__ssh_listen_addr
|
||||
or ansible_facts.get('wt0', {}).get('ipv4', {}).get('address', '') }}
|
||||
when: base__ssh_listen_mesh_only | bool
|
||||
|
||||
- name: Fail closed — refuse to render sshd without a known mesh address
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- base__ssh_listen_addr_resolved | length > 0
|
||||
fail_msg: >-
|
||||
base__ssh_listen_mesh_only is true but no mesh address resolved (set
|
||||
base__ssh_listen_addr or ensure wt0 is up so its fact is gathered). Refusing to
|
||||
render sshd ListenAddress empty (which would listen on ALL interfaces).
|
||||
when: base__ssh_listen_mesh_only | bool
|
||||
|
||||
- name: Allow sshd to bind the mesh IP before wt0 exists at boot
|
||||
ansible.posix.sysctl:
|
||||
name: net.ipv4.ip_nonlocal_bind
|
||||
value: "1"
|
||||
sysctl_set: true
|
||||
state: present
|
||||
reload: true
|
||||
sysctl_file: /etc/sysctl.d/60-boma-nonlocal-bind.conf
|
||||
when: base__ssh_listen_mesh_only | bool
|
||||
```
|
||||
|
||||
- [ ] **Step 6: Render the conditional `ListenAddress`**
|
||||
|
||||
In `roles/base/templates/sshd_hardening.conf.j2`, append after the existing `KbdInteractiveAuthentication no` line:
|
||||
|
||||
```jinja
|
||||
{% if base__ssh_listen_mesh_only | bool %}
|
||||
ListenAddress {{ base__ssh_listen_addr_resolved }}
|
||||
{% endif %}
|
||||
```
|
||||
|
||||
- [ ] **Step 7: Run the test to verify it passes**
|
||||
|
||||
Run: `make test ROLE=base`
|
||||
Expected: PASS — converge succeeds; verify confirms `ListenAddress 100.99.0.1`, the sysctl drop-in, and the live value `1`.
|
||||
|
||||
> **Checkpoint (environmental):** if `make test` fails on the sysctl task because the Molecule container can't write `net.ipv4.ip_nonlocal_bind`, add `sysctls: {net.ipv4.ip_nonlocal_bind: "0"}` to the platform in `roles/base/molecule/default/molecule.yml` (pre-creates the namespaced sysctl so the task can set it), then re-run. Note the change in the commit.
|
||||
|
||||
- [ ] **Step 8: Lint**
|
||||
|
||||
Run: `make lint`
|
||||
Expected: `Passed: 0 failure(s)` and `check-tags: OK`.
|
||||
|
||||
- [ ] **Step 9: Commit**
|
||||
|
||||
```bash
|
||||
git add roles/base/defaults/main.yml roles/base/tasks/ssh.yml \
|
||||
roles/base/templates/sshd_hardening.conf.j2 \
|
||||
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
|
||||
git commit -m "feat(base): opt-in sshd ListenAddress on the mesh IP (fail-closed)
|
||||
|
||||
base__ssh_listen_mesh_only binds sshd to the live wt0 IP only, with
|
||||
ip_nonlocal_bind to beat the post-boot bind race and a fail-closed assert so an
|
||||
unresolved address never silently listens on all interfaces. Molecule covers
|
||||
the render + sysctl. Mesh-hardening 1/3 (ADR-016/021).
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: firewall catalog — `public` zone + askari's public services
|
||||
|
||||
**Files:**
|
||||
- Modify: `inventories/production/group_vars/all/firewall.yml`
|
||||
- Modify: `roles/base/molecule/default/converge.yml` (fixture: public-zone rule)
|
||||
- Modify: `roles/base/molecule/default/verify.yml` (assert the 0.0.0.0/0 rule)
|
||||
- Test: `tests/test_firewall_rules.py` (unit: a `public` zone resolves to `0.0.0.0/0`)
|
||||
|
||||
Rationale: `base__firewall_mgmt_interface` already accepts `:22` on `wt0`. The gap is that the catalog is empty and has no "anywhere" source, so applying default-deny to askari would drop 80/443/3478. We add a `public` zone (`0.0.0.0/0`) and askari's service ingress.
|
||||
|
||||
- [ ] **Step 1: Write the failing unit test**
|
||||
|
||||
In `tests/test_firewall_rules.py`, add:
|
||||
|
||||
```python
|
||||
def test_public_zone_resolves_to_anywhere():
|
||||
catalog = {"web": {"host": "askari",
|
||||
"ingress": [{"from": "public", "port": 443, "proto": "tcp"}]}}
|
||||
zones = {"public": "0.0.0.0/0"}
|
||||
rules = rs.resolve_firewall_rules(catalog, zones, "askari",
|
||||
{"askari": {"ansible_host": "100.99.226.39"}}, {})
|
||||
assert rules == [{"proto": "tcp", "port": 443, "sources": ["0.0.0.0/0"]}]
|
||||
```
|
||||
|
||||
(Module is loaded by the existing importlib shim at the top of the test file as `rs`. If the filter is imported under a different alias there, match it.)
|
||||
|
||||
- [ ] **Step 2: Run it to verify it fails (or passes trivially)**
|
||||
|
||||
Run: `.venv/bin/python -m pytest tests/test_firewall_rules.py -q`
|
||||
Expected: this test PASSES immediately if the filter already resolves arbitrary zones (it does — `_resolve_source` treats any `zones` key generically). That is fine: the unit test documents/locks the `public`-zone contract. If it fails, fix the filter. Either way it must end green.
|
||||
|
||||
- [ ] **Step 3: Add the Molecule fixture (public-zone rule)**
|
||||
|
||||
In `roles/base/molecule/default/converge.yml`, under `firewall_zones:` add `public: 0.0.0.0/0`, and under `firewall_catalog:` add:
|
||||
|
||||
```yaml
|
||||
netbird_stun:
|
||||
host: instance
|
||||
ingress:
|
||||
- { from: public, port: 3478, proto: udp }
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Add the Molecule assertion (the test)**
|
||||
|
||||
In `roles/base/molecule/default/verify.yml`, after the photoprism assertion block, add:
|
||||
|
||||
```yaml
|
||||
- name: Assert the public->stun:3478/udp ingress rule (0.0.0.0/0 source)
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- "'0.0.0.0/0' in nft"
|
||||
- "'udp dport 3478 accept' in nft"
|
||||
fail_msg: "missing public->3478/udp rule for netbird_stun"
|
||||
```
|
||||
|
||||
- [ ] **Step 5: Run the tests**
|
||||
|
||||
Run: `make test ROLE=base` then `.venv/bin/python -m pytest tests/test_firewall_rules.py -q`
|
||||
Expected: both PASS (the rendered ruleset now contains the `0.0.0.0/0 ... udp dport 3478 accept` rule).
|
||||
|
||||
- [ ] **Step 6: Populate the real catalog**
|
||||
|
||||
In `inventories/production/group_vars/all/firewall.yml`, replace the `firewall_zones`/`firewall_catalog` blocks with:
|
||||
|
||||
```yaml
|
||||
# Zone → subnet (from ADR-007). `public` = the WAN (anywhere) for deliberately public
|
||||
# off-site services (askari); home/cluster services use the internal zones only.
|
||||
firewall_zones:
|
||||
mgmt: 10.10.0.0/24
|
||||
srv: 10.20.0.0/24
|
||||
lan: 10.30.0.0/24
|
||||
iot: 10.40.0.0/24
|
||||
guest: 10.50.0.0/24
|
||||
public: 0.0.0.0/0
|
||||
|
||||
# Service catalog: <name> → placement (host | group | hosts) + ingress[].
|
||||
# askari's public surface (ADR-024 Caddy + ADR-016 NetBird STUN). NOTE: the host
|
||||
# nftables template renders IPv4 source rules only; askari is reached via its A record
|
||||
# (no AAAA), so IPv4-only public rules are sufficient (see the spec's IPv6 note).
|
||||
firewall_catalog:
|
||||
reverse_proxy:
|
||||
host: askari
|
||||
ingress:
|
||||
- { from: public, port: 80, proto: tcp }
|
||||
- { from: public, port: 443, proto: tcp }
|
||||
netbird_stun:
|
||||
host: askari
|
||||
ingress:
|
||||
- { from: public, port: 3478, proto: udp }
|
||||
```
|
||||
|
||||
- [ ] **Step 7: Lint**
|
||||
|
||||
Run: `make lint`
|
||||
Expected: clean pass (`check-tags: OK`).
|
||||
|
||||
- [ ] **Step 8: Commit**
|
||||
|
||||
```bash
|
||||
git add inventories/production/group_vars/all/firewall.yml \
|
||||
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml \
|
||||
tests/test_firewall_rules.py
|
||||
git commit -m "feat(firewall): public zone + askari's public services in the catalog
|
||||
|
||||
Adds a public (0.0.0.0/0) zone and askari's Caddy (80/443) + NetBird STUN
|
||||
(3478/udp) ingress so the base nftables default-deny does not drop the live
|
||||
public services when applied to askari. Molecule + filter unit test cover the
|
||||
public-zone rendering. Mesh-hardening 1/3 (ADR-020/024/016).
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: inventory — point Ansible at wt0 + enable mesh-only SSH on askari
|
||||
|
||||
**Files:**
|
||||
- Create: `inventories/production/host_vars/askari.yml`
|
||||
- Modify: `inventories/production/group_vars/offsite_hosts/vars.yml`
|
||||
|
||||
- [ ] **Step 1: Create the host_var override**
|
||||
|
||||
Create `inventories/production/host_vars/askari.yml`:
|
||||
|
||||
```yaml
|
||||
---
|
||||
# Manage askari over the NetBird mesh (wt0), not its WAN IP. This OVERRIDES the
|
||||
# TF-generated inventories/production/offsite.yml (ansible_host = 77.42.120.136); host_vars
|
||||
# outrank the generated inventory and are NOT touched by `make tf-inventory-offsite`.
|
||||
# Mesh-hardening 1/3 — once SSH is wt0-only, the WAN IP is no longer reachable for SSH.
|
||||
ansible_host: 100.99.226.39 # askari's wt0 address (NetBird, M5)
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Enable mesh-only SSH for offsite hosts**
|
||||
|
||||
In `inventories/production/group_vars/offsite_hosts/vars.yml`, replace the file body with:
|
||||
|
||||
```yaml
|
||||
---
|
||||
# Off-site hosts (askari). askari runs the NetBird coordinator AND is a mesh peer
|
||||
# (ADR-016, M5). Mesh-hardening 1/3 (2026-06-17): SSH is moved onto wt0 — sshd binds the
|
||||
# mesh IP only (base__ssh_listen_mesh_only) and the base nftables default-deny applies
|
||||
# (base__firewall_apply defaults true; SSH allowed on wt0 via base__firewall_mgmt_interface,
|
||||
# public services via the catalog). base__mesh_enabled stays true (precondition from M5).
|
||||
base__mesh_enabled: true
|
||||
base__ssh_listen_mesh_only: true
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify the override resolves**
|
||||
|
||||
Run: `.venv/bin/ansible-inventory -i inventories/production/ --host askari 2>/dev/null | grep ansible_host`
|
||||
Expected: `"ansible_host": "100.99.226.39"` (the host_var wins over the generated `offsite.yml`).
|
||||
|
||||
- [ ] **Step 4: Lint**
|
||||
|
||||
Run: `make lint`
|
||||
Expected: clean pass.
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add inventories/production/host_vars/askari.yml \
|
||||
inventories/production/group_vars/offsite_hosts/vars.yml
|
||||
git commit -m "feat(inventory): manage askari over wt0 + enable mesh-only SSH
|
||||
|
||||
host_vars/askari.yml points ansible_host at the wt0 IP (overriding the generated
|
||||
offsite.yml); offsite_hosts sets base__ssh_listen_mesh_only. Mesh-hardening 1/3.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Terraform — retire the Hetzner WAN `:22` rule
|
||||
|
||||
**Files:**
|
||||
- Modify: `terraform/modules/hetzner_vm/main.tf`
|
||||
- Modify: `terraform/modules/hetzner_vm/variables.tf`
|
||||
- Modify: `terraform/environments/offsite/main.tf`
|
||||
|
||||
This task makes the SSH rule conditional and sets askari's admin CIDRs to empty (mesh-only). The live `tf-plan`/`tf-apply` happens in Task 5 — here we only change + format/validate the code.
|
||||
|
||||
- [ ] **Step 1: Gate the SSH rule on a non-empty CIDR list**
|
||||
|
||||
In `terraform/modules/hetzner_vm/main.tf`, replace the static SSH `rule { ... }` block (the one with `port = "22"`) with a dynamic block:
|
||||
|
||||
```hcl
|
||||
# SSH from the control node only — and only when admin CIDRs are set. An empty
|
||||
# ssh_admin_cidrs removes the WAN :22 rule entirely (mesh-only SSH; reach the host over
|
||||
# wt0, break-glass = Hetzner console). Mesh-hardening 1/3.
|
||||
dynamic "rule" {
|
||||
for_each = length(var.ssh_admin_cidrs) > 0 ? [1] : []
|
||||
content {
|
||||
direction = "in"
|
||||
protocol = "tcp"
|
||||
port = "22"
|
||||
source_ips = var.ssh_admin_cidrs
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Default the variable to empty**
|
||||
|
||||
In `terraform/modules/hetzner_vm/variables.tf`, change the `ssh_admin_cidrs` variable to default to an empty list:
|
||||
|
||||
```hcl
|
||||
variable "ssh_admin_cidrs" {
|
||||
description = "Source CIDRs allowed to reach SSH over the WAN. Empty = no WAN SSH rule (mesh-only)."
|
||||
type = list(string)
|
||||
default = []
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Set askari to mesh-only SSH**
|
||||
|
||||
In `terraform/environments/offsite/main.tf`, change the `ssh_admin_cidrs` argument in the `module "askari"` block to:
|
||||
|
||||
```hcl
|
||||
ssh_admin_cidrs = [] # mesh-only: SSH is reached over wt0; WAN :22 retired (mesh-hardening 1/3)
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Format + validate**
|
||||
|
||||
Run: `cd terraform/environments/offsite && terraform fmt -recursive ../.. && terraform validate && cd -`
|
||||
Expected: `fmt` lists any reformatted files (re-add them); `validate` prints `Success! The configuration is valid.` (offsite is already `init`ed — it has live state.)
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add terraform/modules/hetzner_vm/main.tf terraform/modules/hetzner_vm/variables.tf \
|
||||
terraform/environments/offsite/main.tf
|
||||
git commit -m "feat(tf/offsite): retire askari's WAN :22 (mesh-only SSH)
|
||||
|
||||
The Hetzner Cloud Firewall SSH rule is now conditional on a non-empty
|
||||
ssh_admin_cidrs (default []); askari sets it empty so the WAN :22 rule is
|
||||
removed on the next apply. SSH is reached over wt0; break-glass is the Hetzner
|
||||
console. Apply is the live cutover (Task 5). Mesh-hardening 1/3.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Live staged cutover (operator-supervised — NOT a subagent task)
|
||||
|
||||
> This task touches the real askari over the network and is lockout-risky. Run it
|
||||
> interactively with the operator, in order, verifying each step before the next. The
|
||||
> firewall's auto-rollback timer + `wait_for_connection` over wt0 is the safety net; the
|
||||
> Hetzner web console is the ultimate break-glass. Do NOT hand this to an unattended agent.
|
||||
|
||||
- [ ] **Step 1: Pre-check the mesh SSH path (before any change)**
|
||||
|
||||
Run: `.venv/bin/ansible askari -i inventories/production/ -m ping`
|
||||
Expected: `SUCCESS` — confirms Ansible reaches askari over `wt0` (Tasks 1–3 are merged, so `ansible_host` is now `100.99.226.39`). If this fails, STOP — the mesh path must work before closing the WAN.
|
||||
|
||||
- [ ] **Step 2: Dry-run the base apply (firewall + sshd)**
|
||||
|
||||
Run: `make check PLAYBOOK=site LIMIT=askari TAGS=firewall,hardening`
|
||||
Expected: shows the nftables ruleset diff (default-deny + wt0 SSH + public 80/443/3478) and the sshd drop-in diff (`ListenAddress 100.99.226.39`); no errors. Review that the public service rules are present (so they won't be dropped).
|
||||
|
||||
- [ ] **Step 3: Apply the host firewall + sshd (auto-rollback armed)**
|
||||
|
||||
Run: `make deploy PLAYBOOK=site LIMIT=askari TAGS=firewall,hardening`
|
||||
Expected: the firewall concern arms the rollback timer, applies, resets the connection, and `wait_for_connection` succeeds over wt0; sshd reloads with the mesh ListenAddress. If connectivity is lost, the timer auto-reverts the ruleset within `base__firewall_rollback_timeout` (45 s).
|
||||
|
||||
- [ ] **Step 4: Verify services + WAN SSH still open at the cloud edge**
|
||||
|
||||
```bash
|
||||
curl -sSf -o /dev/null -w '%{http_code}\n' https://test.askari.wingu.me # expect 200
|
||||
curl -sSf -o /dev/null -w '%{http_code}\n' https://netbird.askari.wingu.me # expect 200
|
||||
```
|
||||
Expected: both `200` (valid certs); the host firewall did not drop the public services. (WAN `:22` is now dropped by the host nftables, but the Hetzner FW still allows it until Step 5 — that's fine.)
|
||||
|
||||
- [ ] **Step 5: Retire the Hetzner WAN `:22` — plan, review, apply**
|
||||
|
||||
Run: `make tf-plan TF_ENV=offsite`
|
||||
Expected: the plan shows the SSH firewall rule being **destroyed** (and nothing else of substance). Review it.
|
||||
|
||||
Then: `make tf-apply TF_ENV=offsite`
|
||||
Expected: apply succeeds; the WAN `:22` rule is gone.
|
||||
|
||||
- [ ] **Step 6: Verify the end-state (out-of-band)**
|
||||
|
||||
From an OFF-MESH host (e.g. the operator's laptop with NetBird disconnected, or a quick check from askari's perspective):
|
||||
|
||||
```bash
|
||||
nc -vz -w5 77.42.120.136 22 # expect: refused / timeout (WAN SSH closed)
|
||||
nc -vz -w5 77.42.120.136 443 # expect: open (public service intact)
|
||||
```
|
||||
And from ubongo over the mesh: `.venv/bin/ansible askari -i inventories/production/ -m ping` → `SUCCESS`.
|
||||
|
||||
- [ ] **Step 7: Reboot resilience check (optional but recommended)**
|
||||
|
||||
Reboot askari from the Hetzner console; after it comes back, confirm `ansible askari -m ping` succeeds over wt0 without intervention (proves `ip_nonlocal_bind` beat the post-boot bind race).
|
||||
|
||||
- [ ] **Step 8: Update STATUS + ROADMAP**
|
||||
|
||||
- In `STATUS.md`, update the askari row: SSH is now wt0-only; the host nftables default-deny is applied; the Hetzner WAN `:22` is retired. Move "host firewall + moving askari's SSH onto wt0" out of *Pending*.
|
||||
- In `docs/ROADMAP.md`, mark mesh-hardening sub-project 1 (askari SSH→wt0) done; next is sub-project 2 (ubongo default-deny).
|
||||
|
||||
```bash
|
||||
git add STATUS.md docs/ROADMAP.md
|
||||
git commit -m "docs: askari SSH moved onto wt0 (mesh-hardening 1/3 done)
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
- [ ] **Step 9: Push**
|
||||
|
||||
Run: `git push origin main`
|
||||
|
||||
---
|
||||
|
||||
## Self-review (against the spec)
|
||||
|
||||
- **§ three layers** → Task 1 (sshd ListenAddress), Task 2 (nftables catalog; SSH-on-wt0 pre-existing via `base__firewall_mgmt_interface`), Task 4 (Hetzner WAN :22). ✓
|
||||
- **§ boot-race fix** (`ip_nonlocal_bind` + fail-closed assert + live wt0 fact) → Task 1 Steps 4–6. ✓
|
||||
- **§ new code/vars** (`base__ssh_listen_mesh_only`, `base__ssh_listen_addr`, host_vars/askari.yml, offsite flag, catalog, TF) → Tasks 1–4. ✓
|
||||
- **§ staged cutover** → Task 5 Steps 1–6, with the firewall auto-rollback as the gate. ✓
|
||||
- **§ testing** → Molecule render asserts (ListenAddress, sysctl, public-zone rule) + filter unit test + live out-of-band checks. The fail-closed assert is exercised by code; to spot-check it, temporarily blank `base__ssh_listen_addr` in the converge fixture and confirm `make test ROLE=base` fails on the assert, then revert (manual, not automated — a deliberate-failure Molecule scenario is non-idiomatic). ✓
|
||||
- **§ risks/rollback** → auto-rollback timer (Task 5 Step 3), `ip_nonlocal_bind` (Task 1), Hetzner console break-glass, re-addable TF rule. ✓
|
||||
- **IPv6 note** → recorded in the catalog comment (Task 2 Step 6); acceptable because askari has only an A record.
|
||||
Loading…
Add table
Reference in a new issue