boma/docs/superpowers/specs/2026-06-14-base-ssh-fail2ban-m3-design.md
sjat cff368ece2 docs(spec,plan): M3 — base ssh hardening + fail2ban
ADR-002 baseline (key-only, no root, fail2ban 5/1h) as two base task files under
the existing 'hardening' concern tag; applied to askari by tag (NOT the host
firewall — that's mesh-gated to avoid lockout; Hetzner Cloud Firewall is the
perimeter until M5). NetBird agent deferred to M4. Adds a LIMIT=/TAGS= passthrough
to make check/deploy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:38:38 +02:00

4.6 KiB

Design — base SSH hardening + fail2ban (M3)

  • Date: 2026-06-14
  • Status: Draft → straight to plan (design is ADR-derived; per the standing skip-the-spec-review-gate agreement)
  • Roadmap milestone: M3 (docs/ROADMAP.md) — the "remote-access-sufficient" base subset
  • Implements: ADR-002 (SSH key-only, PermitRootLogin no, fail2ban 5-fails/1-h)
  • Amends: none (uses decided ADRs); touches ADR-021 only by reference

Problem

askari is a public host now (M2) but only cloud-init-hardened. The base role so far implements only the firewall concern. M3 adds the SSH-hardening + fail2ban concerns to base and applies them to askari — the minimum to make a public host remote-access-safe — without locking anything out.

Decisions (as settled)

  1. Scope: two new base task files — ssh.yml (sshd hardening) and fail2ban.yml — both under the existing hardening concern tag (already in tests/tags.yml / ADR-019: "sshd config, fail2ban, auditd, sysctl") — no vocab change. NetBird agent enrollment → M4 (needs the coordinator; ADR-016 bootstrap order). auditd / full CIS L1+L2 → Phase 2 (TODO 15). ubongo's base apply + the host firewall on askari → M5 (when the mesh exists).
  2. Apply only ssh + fail2ban to askari (by tag), NOT the host firewall. Applying base's default-deny nftables to askari pre-mesh would block the WAN SSH from ubongo (the firewall allows SSH only on wt0 + from the LAN base__firewall_control_addr, neither of which matches askari's WAN path) → lockout. The Hetzner Cloud Firewall (M2) is askari's perimeter until M5; the host firewall lands with the mesh.
  3. base__ssh_authorised_keys is populated with ubongo's control key (claude@ubongo) so the ssh concern's authorized_keys management doesn't remove the cloud-init key and lock out. (Public key — set in group_vars/all.)
  4. make check/deploy gain LIMIT= + TAGS= passthrough so a concern subset can be applied to one host (e.g. make deploy PLAYBOOK=site LIMIT=askari TAGS=ssh,fail2ban).

ADR-002 controls implemented

  • sshd: PasswordAuthentication no, PermitRootLogin no, PubkeyAuthentication yes, ChallengeResponseAuthentication no; the ansible user's authorized_keys from base__ssh_authorised_keys. Validate config (sshd -t) before reload; reload via handler.
  • fail2ban: installed + enabled; sshd jail, maxretry 5, bantime 1 h (knobs in defaults).

Implementation

  • roles/base/tasks/ssh.yml (tag hardening) — render /etc/ssh/sshd_config.d/10-boma.conf (drop-in, validated), manage the ansible user's authorized_keys only when base__ssh_authorised_keys is non-empty (so Molecule, with empty keys, skips it and doesn't need a test user); notify "reload sshd".
  • roles/base/tasks/fail2ban.yml (tag hardening) — apt-install fail2ban, render a jail.d/sshd.local, enable+start the service.
  • roles/base/tasks/main.ymlinclude_tasks both (each tags: [hardening]) after firewall.
  • roles/base/defaults/main.ymlbase__ssh_* + base__fail2ban_* knobs.
  • roles/base/handlers/main.ymlreload sshd (listen-topic), config validated first.
  • inventories/production/group_vars/all/vars.yml — populate base__ssh_authorised_keys with claude@ubongo's control key.
  • Makefile$(if $(LIMIT),--limit $(LIMIT)) $(if $(TAGS),--tags $(TAGS)) on check/deploy.
  • Molecule: extend the base scenario to converge ssh + fail2ban (with base__firewall_apply: false) and verify (sshd -t clean, fail2ban jail present).

Testing

  • Molecule (Debian 13 container): converge ssh + fail2ban; verify sshd drop-in valid, PasswordAuthentication no present, fail2ban sshd jail configured. (firewall stays apply:false.)
  • Live on askari (gated): make check (review) → make deploy PLAYBOOK=site LIMIT=askari TAGS=hardeningverify SSH still works (ansible offsite_hosts -m ping after) → confirm fail2ban-client status sshd. Lock-out guard: the ansible user keeps key auth throughout (we only disable password/root, which we don't use).

Scope boundaries — what M3 is NOT

  • Not the NetBird agent (M4), not the host firewall on askari or ubongo hardening (M5), not auditd / CIS L1+L2 (Phase 2), not unattended-upgrades (deferred — ADR-002 baseline, but Phase 2 with the rest of the OS-update story / ADR-011).

Open items (resolve in the plan)

  • Whether to also apply to ubongo now (it's already manually key-only) or wait for M5 — default wait (avoid any risk to the box I run on; bring it under base with the mesh).