Commit graph

33 commits

Author SHA1 Message Date
f83d68d7a0 feat(base): pin the NetBird coordinator FQDN in /etc/hosts (mesh DNS-resilience)
Adds base__mesh_coordinator_pin (default empty = no-op). When set + base__mesh_enabled,
a lineinfile task writes "<ip> <fqdn>" to /etc/hosts so a managed mesh host survives a
local-DNS hiccup (the 2026-06-18 incident class). FQDN derived from base__mesh_management_url
via regex_replace (no community.general). Gated on base__mesh_enabled | bool and pin length;
the coordinator host (askari/offsite_hosts) stays exempt. Production pin wired for ubongo
(77.42.120.136). Molecule dns_servers fix included (Docker/NetBird DNS incompatibility).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-20 11:22:40 +02:00
d9b8676fce feat(inventory): askari INPUT-only firewall + WAN break-glass + manage over wt0
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 17:18:58 +02:00
b3e14decb4 feat(inventory): ubongo gets INPUT-only host firewall + mamba LAN SSH
Enables base__firewall_input_only on the control group (forward chain stays
permissive so Docker egress + the integration-test libvirt NAT survive) and
allows the operator workstations' LAN IPs (mamba 10.20.10.50 + 10.20.10.17;
raw leases, backstopped by wt0). Mesh-hardening 2/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:42:49 +02:00
3fe6f68316 feat(base): codify AI-worker NOPASSWD sudo (ADR-015 amended)
Add base__ai_worker_user var (default empty), a new operational_access.yml
task file that drops a validated sudoers file for the named user, and wire it
into base/tasks/main.yml after the hardening includes under the `users` tag.

Set base__ai_worker_user: claude in group_vars/control so that applying base
to ubongo is idempotent with the manual /etc/sudoers.d/claude-ai-worker drop-in
already in place. Password remains locked; NOPASSWD is the only sudo path;
actions are attributed via auditd (ADR-021).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:36:31 +02:00
847d9885e2 revert: back out mesh-hardening 1/3 on askari after it broke the Docker host
Incident 2026-06-17: applying base's nftables default-deny (forward policy drop)
to askari — a Docker host — broke container forwarding/NAT on reboot, and the
wt0-only sshd ListenAddress left no break-glass (ip_nonlocal_bind did NOT beat
the boot race). Recovery: disable nftables + restart docker (restore the wiped
NAT masquerade) + force-recreate the coordinator (it FATAL-looped unable to
download its GeoLite2 DB with no egress) -> mesh re-formed.

Back out the enablement so a future deploy can't re-break askari:
- offsite_hosts: base__ssh_listen_mesh_only=false, base__firewall_apply=false
- remove host_vars/askari.yml (manage over the WAN again, not wt0)
- tf/offsite: re-open WAN :22 to ubongo only (break-glass; already applied)

askari now: sshd on all interfaces (Ansible-managed), nftables disabled, WAN :22
open -> stable + reboot-survivable. The base feature code (sshd ListenAddress
option, firewall public zone) stays; it's just not enabled on Docker hosts.
Mesh-hardening 1/3 to be re-spec'd before any retry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 22:16:17 +02:00
cc21344ab1 feat(inventory): manage askari over wt0 + enable mesh-only SSH
host_vars/askari.yml points ansible_host at the wt0 IP (overriding the generated
offsite.yml); offsite_hosts sets base__ssh_listen_mesh_only. Mesh-hardening 1/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:49:15 +02:00
3b30e70ba5 feat(firewall): public zone + askari's public services in the catalog
Adds a public (0.0.0.0/0) zone and askari's Caddy (80/443) + NetBird STUN
(3478/udp) ingress so the base nftables default-deny does not drop the live
public services when applied to askari. Molecule + filter unit test cover the
public-zone rendering. Mesh-hardening 1/3 (ADR-020/024/016).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:46:03 +02:00
5947ba8756 chore(vault): Forgejo registry_token supplied (operator-minted, encrypted)
registry-login verified end-to-end (docker login -> Login Succeeded).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 18:37:11 +02:00
c1323a3f29 feat(make): registry-login via vaulted Forgejo token (kaizen)
scripts/registry-login.sh reads vault.forgejo.registry_token and pipes it to
docker login --password-stdin (never echoed, never on argv); 'make registry-login'
wires it with the venv binaries. Adds the operator-minted CHANGEME vault stub
(fill via make edit-vault) and a per-machine prereq note in the claude-code-setup
runbook, so 'make caddy-image-push'/'molecule-image-push' become agent-completable
non-interactively. Consumes the 2026-06-15 signal in docs/FRICTION.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:50:07 +02:00
8d2a064542 chore(vault): NetBird setup_key supplied (operator-minted, encrypted)
Operator replaced the CHANGEME with a real reusable scoped setup key via
make edit-vault (re-encrypted in place). Encrypted ciphertext only; no plaintext.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:40:58 +02:00
d202b89480 feat(base): vault setup_key stub + enable mesh on ubongo + askari
vault.netbird.setup_key: CHANGEME (operator mints a reusable scoped key after the
dashboard /setup). base__mesh_enabled: true for control (ubongo) + offsite_hosts
(askari) so the base 'mesh' concern enrols them. Enrollment only — no firewall change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:12:28 +02:00
1333ec181f feat(reverse_proxy): raw-directive route type; wire NetBird (gRPC/WS) route
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 17:55:05 +02:00
3762be4622 feat(netbird): vault secrets — auth_secret + datastore_key
Self-generated random values for the NetBird coordinator: auth_secret (relay/JWT
shared secret) and datastore_key (SQLite store encryption, base64 32 bytes with
padding). Wired into roles/netbird_coordinator config.yaml via vault.netbird.*.
No CHANGEME — both are agent-generatable (not operator-supplied). The M5 peer
setup key is a runtime dashboard artifact, added to vault when M5 wires it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 17:52:16 +02:00
9e0c264658 docs: reconcile lower-severity review findings (O9-O24)
- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional,
  outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative
  boma.baobab.band -> boma.wingu.me transition note already added earlier
- terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and
  <host>.boma.baobab.band per ADR-007 naming (O11)
- ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections
  placed after Consequences, matching ADR-014/019-023 (O13)
- docs/README + inventories/README: list the missing subdirs / offsite_hosts +
  offsite.yml merge behaviour (O14, O29 note)
- ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19)
- ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20)
- ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21)
- netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23)
- ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24)
- capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28)
- tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9)
- tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep)

O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected);
the fix lives in the generator for the next regeneration. make lint + pytest (57) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00
64f1e821d8 docs(review): 2026-06-14 repo audit — M4a doc drift + Traefik→Caddy lag
11 safe auto-fixes (docs/comments only): reverse_proxy meta stale DNS-01
description, base/playbooks/scripts/terraform/public_dns README build-state,
CAPABILITIES reverse-proxy Traefik→Caddy, README ADR list → 024, TF cax11→cx23
stamps, public_dns wildcard DNS-01→HTTP-01 comment. 29 open findings reported.
make lint green. No stale-deferred (ADR-011 open questions still open).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:37:54 +02:00
b7e919d6b3 refactor(reverse_proxy): vanilla Caddy + HTTP-01 (drop DNS-01 custom image)
Switch from a custom caddy-dns/gandi image built on-host to the official
caddy:2 image with per-host ACME HTTP-01 certificates. Removes the
Dockerfile, env.j2 (Gandi token), on-host image build/ship/load tasks,
the caddy-image Makefile target, and the wildcard DNS-01 Caddyfile.
Each route now gets its own server block and automatic certificate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:11:20 +02:00
9c169561d7 feat(offsite): *.askari.wingu.me wildcard + offsite.yml (docker_host + reverse_proxy)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:39:44 +02:00
50b6445bdd feat(reverse_proxy): Caddy role (Gandi DNS-01, on-host image build, route catalog)
Implements the Caddy reverse proxy role (ADR-024): builds boma/caddy-gandi:latest
on-host (caddy-dns/gandi plugin), renders Caddyfile from route catalog, brings
Compose project up. Adds community.docker to requirements.yml, production group_vars,
and a caddy-image Makefile target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:36:58 +02:00
deec75de0f feat(base): ssh hardening + fail2ban (hardening concern, ADR-002)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:42:56 +02:00
a1c0f4814b feat(askari): publish askari.wingu.me; mark M2 applied (askari live)
askari provisioned + bootstrapped (cx23/hel1/Debian 13.5, cloud-init ansible user
+ sudo, cloud firewall SSH-from-ubongo-WAN, reachable, in offsite_hosts). Added
askari.wingu.me A -> 77.42.120.136 via public_dns. STATUS: askari moves to
'Real and working today'.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:26:26 +02:00
917005174a feat(tf): provision askari — cx23/hel1 (CAX11 ARM was out of stock)
ARM (cax11) unavailable in all EU locations 2026-06-14; fell back to cx23 (x86,
same 2/4/40 spec, cheaper in hel1). Server created (id 141153963); offsite.yml
generated into the directory inventory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:23:01 +02:00
078d1ad9d9 fix(public_dns): drop null-MX (Gandi rejects '0 .'); remove MX instead
Gandi LiveDNS rejects the RFC-7505 null-MX value '0 .' ('invalid format for MX
record'), which failed the live apply. No MX + no apex A = no mail delivery, and
SPF -all + DMARC reject still prevent spoofing — so remove Gandi's seeded MX (add
@/MX to absent) rather than declare a null-MX present. Assert now requires an SPF
@/TXT record; tests + Molecule sample updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:53:54 +02:00
9311968363 feat(public_dns): wingu.me record data + validation test
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:33:07 +02:00
91ad629c02 secrets(vault): rotate Gandi PAT (via make edit-vault)
The chat-exposed PAT was rotated at Gandi and swapped in via the new edit-vault
target; commit the re-encrypted vault so the rotation is versioned.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:30:58 +02:00
79f2315eee feat(make): add edit-vault + check-vault targets
`make edit-vault` runs `ansible-vault edit` (decrypt → nvim → re-encrypt on :wq,
abort on :cq) so editing the vault is one step with no plaintext left in the work
tree, then validates structure. `make check-vault` runs scripts/check-vault.py:
decrypts in-memory, asserts valid YAML with secrets under the nested `vault:` map
and no empty leaves, and prints a values-masked structure view (comments visible,
secrets never printed). Both default to the production all-vault; override VAULT=.

Update the vault header comment, CLAUDE.md (command table + Secrets section), and
scripts/README to point at edit-vault (note check-vault.py is the one venv-
dependent helper, by design).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:36:15 +02:00
43e5a4aa53 secrets(vault): add Gandi LiveDNS PAT as vault.gandi.pat
Personal Access Token for wingu.me LiveDNS, used by the M1 public_dns role via
community.general.gandi_livedns. Stored under the nested vault.<service>.<key> map
(CLAUDE.md); the placeholder canary is preserved. Verified the token authenticates
+ is scoped to wingu.me, and that the file round-trips (decrypts to the expected
structure). PAT to be rotated after M1 (transmitted in plaintext during setup).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:14:10 +02:00
6203513220 inventory: manage ubongo (control node) as the operator account
group_vars/all assumes the ansible service user (created by bootstrap on
Terraform VMs). ubongo is the manually-provisioned control node (ADR-009/
ADR-015 exception) with no bootstrapped ansible user, so connect as sjat.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 14:09:15 +02:00
f3f382ae69 Add dev_env role: zsh/tmux/nvim for workstation-class hosts
A new role (separate from base) that gives workstation-class hosts (ubongo
now, mamba later) a clean interactive environment: zsh + oh-my-zsh +
oh-my-posh, tmux + TPM plugins, and neovim. Dotfiles are real files deployed
via GNU stow (not templated); pinned nvim v0.12.2 + oh-my-posh 29.0.1.

Configs re-derived (ADR-013) from AnsibleBaobabV4 + the operator's fisi setup
on boma's terms: no Nerd Font (headless host), no system LSP suite (nvim uses
mason), versions pinned (V4 tracks latest). Applied via playbooks/workstation.yml
to the control group for users sjat + claude. Lint + Molecule (idempotent) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 13:50:11 +02:00
7b5fd17e55 inventory: add ubongo to control group; set ssh-from-control addr
Wire the now-built physical control node ubongo (10.20.10.151) into the
production control group (the documented manual exception), and activate the
dormant base__firewall_control_addr knob (ADR-021 ssh-from-control source).
Forward-wiring only: no host has the base role applied yet.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 10:32:24 +02:00
390cd3b335 feat(base): shared firewall catalog/zones + firewall defaults
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 18:49:40 +02:00
4ee1b66e23 Source vault password from Vaultwarden via rbw; nest vault structure
Master vault password is fetched from Vaultwarden via the rbw agent
(scripts/vault-pass-client.sh, wired as vault_password_file) instead of a
plaintext .vault_pass. Vault secrets use a nested vault.<service>.<key> map.
Encrypted vault.yml files are excluded from lint. Includes the host rename in
Makefile and STATUS.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 18:16:35 +02:00
2dfa8ca9d6 Harden lint setup and clean inventory placeholders
- Pin pre-commit ansible-lint hook to ansible-core==2.17.* (was floating, crashed)
- Add pre-commit to requirements.txt
- Align .yamllint with ansible-lint (comments-indentation off, octal rules on)
- Rewrite inventory placeholders to lint-clean empty-group form

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 14:56:16 +02:00
3f1d7eb128 Add core Ansible scaffold, tooling, and pre-commit guards
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 14:10:01 +02:00