Commit graph

15 commits

Author SHA1 Message Date
847d9885e2 revert: back out mesh-hardening 1/3 on askari after it broke the Docker host
Incident 2026-06-17: applying base's nftables default-deny (forward policy drop)
to askari — a Docker host — broke container forwarding/NAT on reboot, and the
wt0-only sshd ListenAddress left no break-glass (ip_nonlocal_bind did NOT beat
the boot race). Recovery: disable nftables + restart docker (restore the wiped
NAT masquerade) + force-recreate the coordinator (it FATAL-looped unable to
download its GeoLite2 DB with no egress) -> mesh re-formed.

Back out the enablement so a future deploy can't re-break askari:
- offsite_hosts: base__ssh_listen_mesh_only=false, base__firewall_apply=false
- remove host_vars/askari.yml (manage over the WAN again, not wt0)
- tf/offsite: re-open WAN :22 to ubongo only (break-glass; already applied)

askari now: sshd on all interfaces (Ansible-managed), nftables disabled, WAN :22
open -> stable + reboot-survivable. The base feature code (sshd ListenAddress
option, firewall public zone) stays; it's just not enabled on Docker hosts.
Mesh-hardening 1/3 to be re-spec'd before any retry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 22:16:17 +02:00
b0511179cb feat(tf/offsite): retire askari's WAN :22 (mesh-only SSH)
The Hetzner Cloud Firewall SSH rule is now conditional on a non-empty
ssh_admin_cidrs (default []); askari sets it empty so the WAN :22 rule is
removed on the next apply. SSH is reached over wt0; break-glass is the Hetzner
console. Apply is the live cutover (Task 5). Mesh-hardening 1/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:51:24 +02:00
9e0c264658 docs: reconcile lower-severity review findings (O9-O24)
- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional,
  outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative
  boma.baobab.band -> boma.wingu.me transition note already added earlier
- terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and
  <host>.boma.baobab.band per ADR-007 naming (O11)
- ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections
  placed after Consequences, matching ADR-014/019-023 (O13)
- docs/README + inventories/README: list the missing subdirs / offsite_hosts +
  offsite.yml merge behaviour (O14, O29 note)
- ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19)
- ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20)
- ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21)
- netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23)
- ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24)
- capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28)
- tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9)
- tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep)

O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected);
the fix lives in the generator for the next regeneration. make lint + pytest (57) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00
64f1e821d8 docs(review): 2026-06-14 repo audit — M4a doc drift + Traefik→Caddy lag
11 safe auto-fixes (docs/comments only): reverse_proxy meta stale DNS-01
description, base/playbooks/scripts/terraform/public_dns README build-state,
CAPABILITIES reverse-proxy Traefik→Caddy, README ADR list → 024, TF cax11→cx23
stamps, public_dns wildcard DNS-01→HTTP-01 comment. 29 open findings reported.
make lint green. No stale-deferred (ADR-011 open questions still open).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:37:54 +02:00
1ee343dfca feat(tf): open Caddy 80/443 + NetBird 3478 on askari (public_web)
hetzner_vm gains a public_web bool (default false); offsite sets it true. Firewall
adds 80/443 tcp + 3478 udp from anywhere (SSH-from-ubongo preserved). For M4 Caddy
+ NetBird.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:38:51 +02:00
917005174a feat(tf): provision askari — cx23/hel1 (CAX11 ARM was out of stock)
ARM (cax11) unavailable in all EU locations 2026-06-14; fell back to cx23 (x86,
same 2/4/40 spec, cheaper in hel1). Server created (id 141153963); offsite.yml
generated into the directory inventory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:23:01 +02:00
839fc632a1 fix(tf): declare required_providers in modules; pin offsite lock
terraform init failed: child modules using non-hashicorp providers must declare
required_providers, else TF infers hashicorp/{hcloud,proxmox} (nonexistent). Add
versions.tf to hetzner_vm AND proxmox_vm (same latent bug, never caught because
Proxmox TF was never init'd). Track the offsite lock (hcloud 1.65.0). Caught by
running 'make tf-init/plan TF_ENV=offsite' on ubongo — static review missed it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:14:05 +02:00
09b0aad342 fix(tf): cloud-init heredoc column-0 + firewall uses ubongo's WAN IP
Review catches: (1) <<-EOT strips by the closing marker's indent, so the
cloud-config body must match it (2 spaces) for '#cloud-config' to land at column
0; (2) the Hetzner Cloud Firewall filters public traffic, so ssh_admin_cidrs is
ubongo's WAN/egress IP, not its LAN address — a private CIDR would lock SSH out of
the live VPS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:19:45 +02:00
127ade59a3 feat(tf): offsite environment — askari (CAX11/hel1/debian-13)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:03:31 +02:00
bbc287900a feat(tf): hetzner_vm module (server + firewall + ssh key + cloud-init)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:03:01 +02:00
9584cc2c76 feat(tags): Proxmox VM metadata convention (managed-by=terraform)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 09:39:19 +02:00
905bc92b15 Use local Terraform state; drop unworkable Forgejo HTTP backend (R10b)
Forgejo's /raw/ API is read-only so it cannot serve as a Terraform HTTP state
backend. Switch both envs to local state on the control node (ADR-006); remove
the dead TF_HTTP_* credential hints.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 21:34:05 +02:00
1642d1786a Wire Terraform vlan_tag and fix scaffold placeholder (R9,R11)
R9: pass vlan_tag (default 20 = srv VLAN, ADR-007) from both envs to the
proxmox_vm module so VMs are tagged, not on untagged vmbr0. R11: make new-role
now sed-substitutes ROLE_NAME_PLACEHOLDER so scaffolded molecule converge works
out of the box.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 19:34:02 +02:00
810e6d557b Correct Forgejo host to forgejo.nyumbani.baobab.band
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 18:16:38 +02:00
9a8181ef18 Add Terraform VM-provisioning skeleton
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 14:10:01 +02:00