Incident 2026-06-17: applying base's nftables default-deny (forward policy drop) to askari — a Docker host — broke container forwarding/NAT on reboot, and the wt0-only sshd ListenAddress left no break-glass (ip_nonlocal_bind did NOT beat the boot race). Recovery: disable nftables + restart docker (restore the wiped NAT masquerade) + force-recreate the coordinator (it FATAL-looped unable to download its GeoLite2 DB with no egress) -> mesh re-formed. Back out the enablement so a future deploy can't re-break askari: - offsite_hosts: base__ssh_listen_mesh_only=false, base__firewall_apply=false - remove host_vars/askari.yml (manage over the WAN again, not wt0) - tf/offsite: re-open WAN :22 to ubongo only (break-glass; already applied) askari now: sshd on all interfaces (Ansible-managed), nftables disabled, WAN :22 open -> stable + reboot-survivable. The base feature code (sshd ListenAddress option, firewall public zone) stays; it's just not enabled on Docker hosts. Mesh-hardening 1/3 to be re-spec'd before any retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
21 lines
929 B
HCL
21 lines
929 B
HCL
# offsite/main.tf — off-site Hetzner hosts. Terraform owns VM existence (ADR-006,
|
|
# generalized to Hetzner). ALWAYS `make tf-plan TF_ENV=offsite` and review before
|
|
# `make tf-apply TF_ENV=offsite`.
|
|
|
|
module "askari" {
|
|
source = "../../modules/hetzner_vm"
|
|
|
|
name = "askari"
|
|
server_type = "cx23" # x86, 2 vCPU / 4 GB / 40 GB (CAX11/ARM was out of stock in
|
|
# every EU location 2026-06-14; cx23 is same-spec + cheaper)
|
|
location = "hel1" # Helsinki
|
|
image = "debian-13"
|
|
ansible_ssh_pubkey = var.ansible_ssh_pubkey
|
|
ssh_admin_cidrs = ["91.226.145.80/32"] # TEMP (incident recovery 2026-06-17): re-open WAN :22 to ubongo only; re-close once the firewall/Docker + boot-race issues are fixed
|
|
public_web = true # Caddy 80/443 + NetBird 3478 (M4)
|
|
labels = {
|
|
env = "offsite"
|
|
group = "offsite_hosts"
|
|
managed-by = "terraform"
|
|
}
|
|
}
|