Commit graph

243 commits

Author SHA1 Message Date
64f1e821d8 docs(review): 2026-06-14 repo audit — M4a doc drift + Traefik→Caddy lag
11 safe auto-fixes (docs/comments only): reverse_proxy meta stale DNS-01
description, base/playbooks/scripts/terraform/public_dns README build-state,
CAPABILITIES reverse-proxy Traefik→Caddy, README ADR list → 024, TF cax11→cx23
stamps, public_dns wildcard DNS-01→HTTP-01 comment. 29 open findings reported.
make lint green. No stale-deferred (ADR-011 open questions still open).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:37:54 +02:00
e3461375f5 docs(plan): M4b — NetBird coordinator service role
Capture NetBird's configure.sh reference for a pinned version → translate into
boma role templates (compose + management.json + dex/openid + turnserver),
external-proxy mode behind the M4a Caddy (netbird.askari.wingu.me). First service
role: full ADR-004 standard files; secrets generated/CHANGEME-stubbed (setup key
for M5). Gated live deploy + verify.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:20:04 +02:00
1862b7a828 docs(m4a): HTTP-01 for askari; ADR-024 cert-method-follows-exposure; STATUS/roadmap/friction
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:14:38 +02:00
b7e919d6b3 refactor(reverse_proxy): vanilla Caddy + HTTP-01 (drop DNS-01 custom image)
Switch from a custom caddy-dns/gandi image built on-host to the official
caddy:2 image with per-host ACME HTTP-01 certificates. Removes the
Dockerfile, env.j2 (Gandi token), on-host image build/ship/load tasks,
the caddy-image Makefile target, and the wildcard DNS-01 Caddyfile.
Each route now gets its own server block and automatic certificate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:11:20 +02:00
9c169561d7 feat(offsite): *.askari.wingu.me wildcard + offsite.yml (docker_host + reverse_proxy)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:39:44 +02:00
1ee343dfca feat(tf): open Caddy 80/443 + NetBird 3478 on askari (public_web)
hetzner_vm gains a public_web bool (default false); offsite sets it true. Firewall
adds 80/443 tcp + 3478 udp from anywhere (SSH-from-ubongo preserved). For M4 Caddy
+ NetBird.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:38:51 +02:00
50b6445bdd feat(reverse_proxy): Caddy role (Gandi DNS-01, on-host image build, route catalog)
Implements the Caddy reverse proxy role (ADR-024): builds boma/caddy-gandi:latest
on-host (caddy-dns/gandi plugin), renders Caddyfile from route catalog, brings
Compose project up. Adds community.docker to requirements.yml, production group_vars,
and a caddy-image Makefile target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:36:58 +02:00
456c27d12b feat(docker_host): install Docker engine + compose plugin
Implements the docker_host role tasks: prerequisites, /etc/apt/keyrings
directory (ordered before the GPG key write), Docker APT key + repo, and
docker-ce/cli/containerd.io/compose-plugin install. Daemon hardening and
nftables.d integration remain deferred to Phase 2 (cluster + base firewall).
Updates defaults, README, and molecule verify to assert docker --version.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:28:51 +02:00
d10f6de84b docs(adr): ADR-024 — Caddy is boma's reverse proxy
Adds ADR-024 pinning Caddy (xcaddy + caddy-dns/gandi) as boma's reverse
proxy, superseding the soft Traefik assumption in the roadmap and ADR-017
prose. Updates CLAUDE.md Further reading table and ROADMAP.md Phase-2 step 5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:28:42 +02:00
dd8c6825ba docs(plan): M4a — Docker + Caddy reverse proxy platform
First of M4's two build phases: docker_host (Docker engine), custom xcaddy Caddy
image (caddy-dns/gandi), reverse_proxy role (Caddyfile from a route catalog,
DNS-01 wildcard cert for *.askari.wingu.me via vault.gandi.pat), ADR-024 (Caddy is
boma's reverse proxy), firewall 80/443 + DNS, proven by serving a test route over
TLS. M4b (NetBird) follows, reading NetBird's current self-host compose then.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:20:53 +02:00
65cf20a993 docs(spec): M4 — NetBird coordinator on askari + Caddy reverse proxy
Caddy becomes boma's standard reverse proxy (amends the soft Traefik assumption;
new ADR) with Gandi DNS-01 certs (custom xcaddy image, reuses vault.gandi.pat) —
the only cert path for mesh/LAN-only services. NetBird self-hosted in
external-proxy mode (embedded Dex), compose rendered from boma templates
(ADR-004/013). Three roles: docker_host (first real content), reverse_proxy (new,
Caddy), netbird (first service role w/ full ADR-004 standard files). Firewall +
DNS amendments; backup execution deferred (fisi). caddy-dns/gandi + NetBird
self-host facts verified.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:19:21 +02:00
181a02fd3a docs(friction): include_tasks tag-propagation + check-mode gotchas (M3)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:56:23 +02:00
9d787a4f53 docs(base): M3 done — ssh hardening + fail2ban applied to askari; STATUS + roadmap
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:55:22 +02:00
db1e5db138 fix(base): propagate hardening tag to included tasks; check-mode-safe fail2ban
Two bugs caught by the live make check/deploy on askari:
- include_tasks with a tag selects the include but NOT its tasks, so --tags hardening
  ran nothing. Use apply: {tags:} to propagate (also fixed the firewall include).
- fail2ban service start + restart handler fail in a first-run --check (package not
  installed yet); guard both with when: not ansible_check_mode so check is clean.
Applied to askari: SSH hardened, fail2ban active, ping still works (no lockout).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:54:23 +02:00
a111a20cc8 test(base): Molecule coverage for ssh hardening + fail2ban
Add explicit base__ssh_authorised_keys: [] default to prevent
undefined-var errors in Molecule. Extend verify.yml with sshd
drop-in validation, PasswordAuthentication check, and fail2ban
jail assertion. Pre-create /run/sshd in ssh.yml so sshd -t
works in containers before the service has ever started.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:47:42 +02:00
deec75de0f feat(base): ssh hardening + fail2ban (hardening concern, ADR-002)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:42:56 +02:00
22021210c4 feat(make): optional LIMIT= and TAGS= passthrough on check/deploy
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:41:59 +02:00
cff368ece2 docs(spec,plan): M3 — base ssh hardening + fail2ban
ADR-002 baseline (key-only, no root, fail2ban 5/1h) as two base task files under
the existing 'hardening' concern tag; applied to askari by tag (NOT the host
firewall — that's mesh-gated to avoid lockout; Hetzner Cloud Firewall is the
perimeter until M5). NetBird agent deferred to M4. Adds a LIMIT=/TAGS= passthrough
to make check/deploy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:38:38 +02:00
a1c0f4814b feat(askari): publish askari.wingu.me; mark M2 applied (askari live)
askari provisioned + bootstrapped (cx23/hel1/Debian 13.5, cloud-init ansible user
+ sudo, cloud firewall SSH-from-ubongo-WAN, reachable, in offsite_hosts). Added
askari.wingu.me A -> 77.42.120.136 via public_dns. STATUS: askari moves to
'Real and working today'.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:26:26 +02:00
917005174a feat(tf): provision askari — cx23/hel1 (CAX11 ARM was out of stock)
ARM (cax11) unavailable in all EU locations 2026-06-14; fell back to cx23 (x86,
same 2/4/40 spec, cheaper in hel1). Server created (id 141153963); offsite.yml
generated into the directory inventory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:23:01 +02:00
e83c777b44 docs(friction): TF child-module required_providers gotcha (caught by live init)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:15:23 +02:00
839fc632a1 fix(tf): declare required_providers in modules; pin offsite lock
terraform init failed: child modules using non-hashicorp providers must declare
required_providers, else TF infers hashicorp/{hcloud,proxmox} (nonexistent). Add
versions.tf to hetzner_vm AND proxmox_vm (same latent bug, never caught because
Proxmox TF was never init'd). Track the offsite lock (hcloud 1.65.0). Caught by
running 'make tf-init/plan TF_ENV=offsite' on ubongo — static review missed it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:14:05 +02:00
9d4a49d49d feat(vault): CHANGEME placeholder convention + check-vault flags them
Streamline the recurring secret-entry friction: the agent stubs a needed secret as
vault.<service>.<key>=CHANGEME with a what/how-to-obtain comment, wires the code,
and commits; the operator fills it via make edit-vault (real value never hits chat).
check-vault now lists outstanding CHANGEME placeholders so none are forgotten.
Convention documented in CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 15:40:37 +02:00
09b0aad342 fix(tf): cloud-init heredoc column-0 + firewall uses ubongo's WAN IP
Review catches: (1) <<-EOT strips by the closing marker's indent, so the
cloud-config body must match it (2 spaces) for '#cloud-config' to land at column
0; (2) the Hetzner Cloud Firewall filters public traffic, so ssh_admin_cidrs is
ubongo's WAN/egress IP, not its LAN address — a private CIDR would lock SSH out of
the live VPS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:19:45 +02:00
3588904528 docs(askari): amend ADR-006/009/020/007/016 for TF-provisioned offsite host; STATUS (apply pending)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:09:20 +02:00
fd86ec6848 test(tf): lock the offsite_hosts inventory handoff
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:06:26 +02:00
07af037ff3 feat(make): offsite TF token injection + directory inventory + tf-inventory-offsite
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:05:41 +02:00
127ade59a3 feat(tf): offsite environment — askari (CAX11/hel1/debian-13)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:03:31 +02:00
bbc287900a feat(tf): hetzner_vm module (server + firewall + ssh key + cloud-init)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:03:01 +02:00
29921428c4 docs(plan): M2 — askari provisioning (Terraform + Hetzner Cloud)
9-task plan: verify hcloud facts; hetzner_vm module (server+firewall+ssh+cloud-init);
offsite env (CAX11/hel1/debian-13, local state); Makefile token-injection + directory
inventory + tf-inventory-offsite; offsite-handoff pytest; init/validate/plan; GATED
apply (billed VPS) + bootstrap; ADR-006/009/020/007/016 amendments. Resolves the
inventory-handoff open item via a directory inventory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:53:08 +02:00
993d7885e4 docs: mark M1 applied (STATUS); log item.values + Gandi null-MX gotchas
M1 public_dns applied to wingu.me (purge + SPF/DMARC, idempotent). Friction:
item.values dict-method collision, Gandi null-MX rejection, and the apply=false-
Molecule/data-only-pytest gap that let both bugs reach a live apply.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:58:03 +02:00
76bd1d63fc fix(public_dns): index loop keys with item['key'] not item.key
item.values resolved to the dict's built-in .values() METHOD, not the 'values'
key, so gandi_livedns received '<built-in method values of dict object at 0x..>'
as the TXT value — garbage AND non-idempotent (the address changes each run).
Bracket-index all loop fields. Caught only by the live apply (apply=false Molecule
+ data-only pytest both missed it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:57:23 +02:00
078d1ad9d9 fix(public_dns): drop null-MX (Gandi rejects '0 .'); remove MX instead
Gandi LiveDNS rejects the RFC-7505 null-MX value '0 .' ('invalid format for MX
record'), which failed the live apply. No MX + no apex A = no mail delivery, and
SPF -all + DMARC reject still prevent spoofing — so remove Gandi's seeded MX (add
@/MX to absent) rather than declare a null-MX present. Assert now requires an SPF
@/TXT record; tests + Molecule sample updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:53:54 +02:00
3cb6436ad2 docs(adr-007): fix askari FQDN to askari.wingu.me (review nit)
The naming-table amendment left the 'External monitoring' prose saying
askari.baobab.band; askari is greenfield (never on baobab.band), so its FQDN is
askari.wingu.me, off-site tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:44:21 +02:00
f170ffd936 docs(public_dns): amend ADR-007 to wingu.me/Gandi; resolve TODO 4; STATUS + CAPABILITIES
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:38:45 +02:00
e247af6e55 test(public_dns): Molecule scenario (apply disabled, no live API)
Converge runs in CI; the no-op apply=false scenario adds no local signal over
the pytest, and the test image is on an unreachable registry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:36:40 +02:00
a0a3e4d356 feat(public_dns): dns.yml play (control-node, Gandi LiveDNS)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:35:30 +02:00
bd84dd0213 feat(public_dns): role tasks, defaults, meta, README
Implement M1: manage wingu.me public DNS zone at Gandi LiveDNS via
community.general.gandi_livedns (PAT from vault.gandi.pat). Adds
assertion guard for domain + null-MX, present/absent record loops
with run_once, and apply-gate for Molecule dry-run mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:34:42 +02:00
9311968363 feat(public_dns): wingu.me record data + validation test
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:33:07 +02:00
91ad629c02 secrets(vault): rotate Gandi PAT (via make edit-vault)
The chat-exposed PAT was rotated at Gandi and swapped in via the new edit-vault
target; commit the re-encrypted vault so the rotation is versioned.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:30:58 +02:00
70c302d7e5 scaffold(public_dns): empty role structure
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:30:02 +02:00
6f5c7b2bfb deps: add community.general for gandi_livedns (public_dns)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:29:57 +02:00
e96480692d docs(friction): execution-mode menu recurred despite the 06-10 mechanical fix
5th occurrence (06-14): asked the subagent-driven/inline menu at the M1 plan
handoff. The 06-10 ledger claims a Stop hook blocks this; it didn't fire. Flag to
verify the hook is present + its matcher catches the writing-plans menu wording.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:26:43 +02:00
b131ee317e docs(plan): M1 — public_dns implementation plan
Bite-sized TDD plan: add community.general; scaffold public_dns; wingu.me record
data + pytest; role tasks (gandi_livedns present/absent loops, apply toggle);
Molecule (apply=false, no live API); dns.yml play; gated live run on ubongo
(purge Gandi defaults + anti-spoof baseline + dig verify); ADR-007 amendment +
TODO 4 resolution + STATUS/CAPABILITIES.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:23:26 +02:00
602550fdaa docs(spec): M2 — provision askari via Terraform + Hetzner Cloud
askari is provisioned as IaC: Terraform owns its existence too, generalizing
ADR-006 from "Proxmox VM existence" to Proxmox + Hetzner (new hetznercloud/hcloud
provider, hetzner_vm module, offsite stack with local state). CAX11 (ARM) in
Helsinki on Debian 13, behind a TF-managed Hetzner Cloud Firewall (SSH-from-ubongo
now; NetBird ports in M4). Token via TF_VAR_hcloud_token from vault.hetzner.token.
Handoff stays ADR-009-shaped (tf_to_inventory.py extended to emit askari into
offsite_hosts). State in the ADR-022 backup scope; DR via terraform import.

Amends ADR-006/009/020/007/016. Point ROADMAP.md M2 at the spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:12:10 +02:00
32d480efcf docs(spec): note project (boma) vs domain (wingu.me) in the naming scheme
Decided to keep the project named boma with wingu.me as its domain (boma was not
available as a domain). Record why the infra tier reads <host>.boma.wingu.me so it
isn't re-litigated; folds into the ADR-007 amendment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:47:13 +02:00
79f2315eee feat(make): add edit-vault + check-vault targets
`make edit-vault` runs `ansible-vault edit` (decrypt → nvim → re-encrypt on :wq,
abort on :cq) so editing the vault is one step with no plaintext left in the work
tree, then validates structure. `make check-vault` runs scripts/check-vault.py:
decrypts in-memory, asserts valid YAML with secrets under the nested `vault:` map
and no empty leaves, and prints a values-masked structure view (comments visible,
secrets never printed). Both default to the production all-vault; override VAULT=.

Update the vault header comment, CLAUDE.md (command table + Secrets section), and
scripts/README to point at edit-vault (note check-vault.py is the one venv-
dependent helper, by design).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:36:15 +02:00
43e5a4aa53 secrets(vault): add Gandi LiveDNS PAT as vault.gandi.pat
Personal Access Token for wingu.me LiveDNS, used by the M1 public_dns role via
community.general.gandi_livedns. Stored under the nested vault.<service>.<key> map
(CLAUDE.md); the placeholder canary is preserved. Verified the token authenticates
+ is scoped to wingu.me, and that the file round-trips (decrypts to the expected
structure). PAT to be rotated after M1 (transmitted in plaintext during setup).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:14:10 +02:00
f7fac5f5e3 docs(spec): M1 — finalize for wingu.me (greenfield), record Gandi-defaults purge
boma's domain is wingu.me (registered at Gandi; 'wingu' = Swahili for cloud).
Replace the parametric <boma-domain> placeholder with wingu.me throughout. The
zone was NOT empty — Gandi auto-seeded 13 default records (parking A, www redirect,
a full Gandi mailbox set), so M1 includes a one-time purge to a clean baseline plus
an anti-spoof null-mail set (null MX, SPF -all, DMARC reject) since wingu.me sends
no mail. Domain-pick open item closed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:14:10 +02:00
7a47dd9dec docs(spec): M1 — public DNS migration to Gandi (DNS-as-code) design
Settles the M1 design: full registrar transfer Cloudflare -> Gandi; three-tier
naming scheme (host.boma / service.bare / service.askari), nyumbani dropped,
mesh/LAN-only default; public-DNS-as-code via a control-node `public_dns` role
driven by group_vars data, using community.general.gandi_livedns with a PAT
(api_key is deprecated/rejected by Gandi — verified per ADR-014). Stale records +
unused MX cleaned by omission. Cert scope is DNS+PAT only (issuance deferred to
M4/Phase 2). Human/agent division of labour + token-scoping recorded.

Resolves TODO 4 and review finding O12 once the ADR-007 amendment lands. Point
ROADMAP.md M1 at the spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 23:17:19 +02:00