ADR-002 baseline (key-only, no root, fail2ban 5/1h) as two base task files under the existing 'hardening' concern tag; applied to askari by tag (NOT the host firewall — that's mesh-gated to avoid lockout; Hetzner Cloud Firewall is the perimeter until M5). NetBird agent deferred to M4. Adds a LIMIT=/TAGS= passthrough to make check/deploy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
82 lines
4.6 KiB
Markdown
82 lines
4.6 KiB
Markdown
# Design — `base` SSH hardening + fail2ban (M3)
|
|
|
|
- **Date:** 2026-06-14
|
|
- **Status:** Draft → straight to plan (design is ADR-derived; per the standing
|
|
skip-the-spec-review-gate agreement)
|
|
- **Roadmap milestone:** M3 (`docs/ROADMAP.md`) — the "remote-access-sufficient" `base` subset
|
|
- **Implements:** ADR-002 (SSH key-only, `PermitRootLogin no`, fail2ban 5-fails/1-h)
|
|
- **Amends:** none (uses decided ADRs); touches ADR-021 only by reference
|
|
|
|
---
|
|
|
|
## Problem
|
|
|
|
`askari` is a **public** host now (M2) but only cloud-init-hardened. The `base` role so
|
|
far implements only the `firewall` concern. M3 adds the SSH-hardening + fail2ban concerns
|
|
to `base` and applies them to `askari` — the minimum to make a public host
|
|
remote-access-safe — without locking anything out.
|
|
|
|
## Decisions (as settled)
|
|
|
|
1. **Scope:** two new `base` task files — **`ssh.yml`** (sshd hardening) and
|
|
**`fail2ban.yml`** — both under the **existing `hardening` concern tag** (already in
|
|
`tests/tags.yml` / ADR-019: "sshd config, fail2ban, auditd, sysctl") — no vocab change.
|
|
NetBird agent enrollment → **M4** (needs the coordinator; ADR-016 bootstrap order).
|
|
auditd / full CIS L1+L2 → **Phase 2** (TODO 15). ubongo's `base` apply + the host
|
|
firewall on askari → **M5** (when the mesh exists).
|
|
2. **Apply only `ssh` + `fail2ban` to askari (by tag), NOT the host firewall.** Applying
|
|
`base`'s default-deny nftables to askari pre-mesh would block the WAN SSH from ubongo
|
|
(the firewall allows SSH only on `wt0` + from the *LAN* `base__firewall_control_addr`,
|
|
neither of which matches askari's WAN path) → lockout. The **Hetzner Cloud Firewall**
|
|
(M2) is askari's perimeter until M5; the host firewall lands with the mesh.
|
|
3. **`base__ssh_authorised_keys` is populated** with ubongo's control key
|
|
(`claude@ubongo`) so the `ssh` concern's `authorized_keys` management doesn't remove
|
|
the cloud-init key and lock out. (Public key — set in `group_vars/all`.)
|
|
4. **`make check`/`deploy` gain `LIMIT=` + `TAGS=` passthrough** so a concern subset can
|
|
be applied to one host (e.g. `make deploy PLAYBOOK=site LIMIT=askari TAGS=ssh,fail2ban`).
|
|
|
|
## ADR-002 controls implemented
|
|
|
|
- **sshd:** `PasswordAuthentication no`, `PermitRootLogin no`, `PubkeyAuthentication yes`,
|
|
`ChallengeResponseAuthentication no`; the `ansible` user's `authorized_keys` from
|
|
`base__ssh_authorised_keys`. Validate config (`sshd -t`) before reload; reload via handler.
|
|
- **fail2ban:** installed + enabled; `sshd` jail, **maxretry 5, bantime 1 h** (knobs in
|
|
defaults).
|
|
|
|
## Implementation
|
|
|
|
- `roles/base/tasks/ssh.yml` (tag `hardening`) — render `/etc/ssh/sshd_config.d/10-boma.conf`
|
|
(drop-in, validated), manage the `ansible` user's `authorized_keys` **only when
|
|
`base__ssh_authorised_keys` is non-empty** (so Molecule, with empty keys, skips it and
|
|
doesn't need a test user); notify "reload sshd".
|
|
- `roles/base/tasks/fail2ban.yml` (tag `hardening`) — apt-install `fail2ban`, render a
|
|
`jail.d/sshd.local`, enable+start the service.
|
|
- `roles/base/tasks/main.yml` — `include_tasks` both (each `tags: [hardening]`) after firewall.
|
|
- `roles/base/defaults/main.yml` — `base__ssh_*` + `base__fail2ban_*` knobs.
|
|
- `roles/base/handlers/main.yml` — `reload sshd` (listen-topic), config validated first.
|
|
- `inventories/production/group_vars/all/vars.yml` — populate `base__ssh_authorised_keys`
|
|
with `claude@ubongo`'s control key.
|
|
- `Makefile` — `$(if $(LIMIT),--limit $(LIMIT)) $(if $(TAGS),--tags $(TAGS))` on check/deploy.
|
|
- Molecule: extend the `base` scenario to converge `ssh` + `fail2ban` (with
|
|
`base__firewall_apply: false`) and verify (`sshd -t` clean, fail2ban jail present).
|
|
|
|
## Testing
|
|
|
|
- **Molecule** (Debian 13 container): converge ssh + fail2ban; verify sshd drop-in valid,
|
|
`PasswordAuthentication no` present, fail2ban sshd jail configured. (firewall stays
|
|
`apply:false`.)
|
|
- **Live on askari** (gated): `make check` (review) → `make deploy PLAYBOOK=site
|
|
LIMIT=askari TAGS=hardening` → **verify SSH still works** (`ansible offsite_hosts -m
|
|
ping` after) → confirm `fail2ban-client status sshd`. Lock-out guard: the `ansible`
|
|
user keeps key auth throughout (we only disable password/root, which we don't use).
|
|
|
|
## Scope boundaries — what M3 is NOT
|
|
|
|
- Not the NetBird agent (M4), not the host firewall on askari or ubongo hardening (M5),
|
|
not auditd / CIS L1+L2 (Phase 2), not `unattended-upgrades` (deferred — ADR-002 baseline,
|
|
but Phase 2 with the rest of the OS-update story / ADR-011).
|
|
|
|
## Open items (resolve in the plan)
|
|
|
|
- Whether to also apply to ubongo now (it's already manually key-only) or wait for M5 —
|
|
default **wait** (avoid any risk to the box I run on; bring it under `base` with the mesh).
|