ADR-002 baseline (key-only, no root, fail2ban 5/1h) as two base task files under the existing 'hardening' concern tag; applied to askari by tag (NOT the host firewall — that's mesh-gated to avoid lockout; Hetzner Cloud Firewall is the perimeter until M5). NetBird agent deferred to M4. Adds a LIMIT=/TAGS= passthrough to make check/deploy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4.6 KiB
4.6 KiB
Design — base SSH hardening + fail2ban (M3)
- Date: 2026-06-14
- Status: Draft → straight to plan (design is ADR-derived; per the standing skip-the-spec-review-gate agreement)
- Roadmap milestone: M3 (
docs/ROADMAP.md) — the "remote-access-sufficient"basesubset - Implements: ADR-002 (SSH key-only,
PermitRootLogin no, fail2ban 5-fails/1-h) - Amends: none (uses decided ADRs); touches ADR-021 only by reference
Problem
askari is a public host now (M2) but only cloud-init-hardened. The base role so
far implements only the firewall concern. M3 adds the SSH-hardening + fail2ban concerns
to base and applies them to askari — the minimum to make a public host
remote-access-safe — without locking anything out.
Decisions (as settled)
- Scope: two new
basetask files —ssh.yml(sshd hardening) andfail2ban.yml— both under the existinghardeningconcern tag (already intests/tags.yml/ ADR-019: "sshd config, fail2ban, auditd, sysctl") — no vocab change. NetBird agent enrollment → M4 (needs the coordinator; ADR-016 bootstrap order). auditd / full CIS L1+L2 → Phase 2 (TODO 15). ubongo'sbaseapply + the host firewall on askari → M5 (when the mesh exists). - Apply only
ssh+fail2banto askari (by tag), NOT the host firewall. Applyingbase's default-deny nftables to askari pre-mesh would block the WAN SSH from ubongo (the firewall allows SSH only onwt0+ from the LANbase__firewall_control_addr, neither of which matches askari's WAN path) → lockout. The Hetzner Cloud Firewall (M2) is askari's perimeter until M5; the host firewall lands with the mesh. base__ssh_authorised_keysis populated with ubongo's control key (claude@ubongo) so thesshconcern'sauthorized_keysmanagement doesn't remove the cloud-init key and lock out. (Public key — set ingroup_vars/all.)make check/deploygainLIMIT=+TAGS=passthrough so a concern subset can be applied to one host (e.g.make deploy PLAYBOOK=site LIMIT=askari TAGS=ssh,fail2ban).
ADR-002 controls implemented
- sshd:
PasswordAuthentication no,PermitRootLogin no,PubkeyAuthentication yes,ChallengeResponseAuthentication no; theansibleuser'sauthorized_keysfrombase__ssh_authorised_keys. Validate config (sshd -t) before reload; reload via handler. - fail2ban: installed + enabled;
sshdjail, maxretry 5, bantime 1 h (knobs in defaults).
Implementation
roles/base/tasks/ssh.yml(taghardening) — render/etc/ssh/sshd_config.d/10-boma.conf(drop-in, validated), manage theansibleuser'sauthorized_keysonly whenbase__ssh_authorised_keysis non-empty (so Molecule, with empty keys, skips it and doesn't need a test user); notify "reload sshd".roles/base/tasks/fail2ban.yml(taghardening) — apt-installfail2ban, render ajail.d/sshd.local, enable+start the service.roles/base/tasks/main.yml—include_tasksboth (eachtags: [hardening]) after firewall.roles/base/defaults/main.yml—base__ssh_*+base__fail2ban_*knobs.roles/base/handlers/main.yml—reload sshd(listen-topic), config validated first.inventories/production/group_vars/all/vars.yml— populatebase__ssh_authorised_keyswithclaude@ubongo's control key.Makefile—$(if $(LIMIT),--limit $(LIMIT)) $(if $(TAGS),--tags $(TAGS))on check/deploy.- Molecule: extend the
basescenario to convergessh+fail2ban(withbase__firewall_apply: false) and verify (sshd -tclean, fail2ban jail present).
Testing
- Molecule (Debian 13 container): converge ssh + fail2ban; verify sshd drop-in valid,
PasswordAuthentication nopresent, fail2ban sshd jail configured. (firewall staysapply:false.) - Live on askari (gated):
make check(review) →make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening→ verify SSH still works (ansible offsite_hosts -m pingafter) → confirmfail2ban-client status sshd. Lock-out guard: theansibleuser keeps key auth throughout (we only disable password/root, which we don't use).
Scope boundaries — what M3 is NOT
- Not the NetBird agent (M4), not the host firewall on askari or ubongo hardening (M5),
not auditd / CIS L1+L2 (Phase 2), not
unattended-upgrades(deferred — ADR-002 baseline, but Phase 2 with the rest of the OS-update story / ADR-011).
Open items (resolve in the plan)
- Whether to also apply to ubongo now (it's already manually key-only) or wait for M5 —
default wait (avoid any risk to the box I run on; bring it under
basewith the mesh).