Converge enables mesh with base__mesh_manage:false (+ dummy key) so the include
path runs hermetically; verify asserts netbird is not installed — proving the
concern is a clean no-op when the live actions are gated off. Existing firewall/
ssh/fail2ban assertions unaffected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
README/SECURITY said gRPC was path-matched (/management.ManagementService/* etc.);
the deployed Caddy route selects gRPC by Content-Type: application/grpc* (NetBird's
own external-proxy example). Reconciled the prose to what actually runs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Caddyfile was bind-mounted as a single file. ansible.builtin.template writes
atomically (temp + rename), so a re-render swaps the file's inode while the running
container keeps the old one — `caddy reload` then re-read stale config and silently
no-op'd ("config is unchanged"), so new routes never loaded. Surfaced deploying the
NetBird route: Caddy never requested its cert. Fix: render to ./caddy/Caddyfile and
mount the ./caddy DIRECTORY at /etc/caddy — directory mounts reflect inode swaps, so
graceful `caddy reload` works. Proven on askari: atomic replace in the host dir is
visible inside the running container.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author the four ADR-mandated service-role docs for netbird_coordinator and
add the cross-role access__*/backup__* data (ADR-021/022). First stateful
service: backup__state=true; off-site capture pending the fisi pull node.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
First real service role. NetBird v0.72.4 self-hosted control plane: single
netbirdio/netbird-server:0.72.4 (management + signal + relay + STUN + embedded
Dex) plus netbirdio/dashboard:v2.39.0, both on the shared boma Docker network so
the M4a Caddy fronts them. Renders docker-compose.yml + config.yaml (secrets from
vault.netbird.*, no_log) + dashboard.env. STUN 3478/udp host-exposed; everything
else via the proxy. netbird_coordinator__manage gates the compose-up for Molecule.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a per-instance DNS-01 mode to the Caddy role for mesh/LAN-only hosts that
cannot satisfy HTTP-01. Default behaviour (vanilla caddy:2 + HTTP-01, what askari
runs) is unchanged.
- reverse_proxy__acme_dns_provider: "" (HTTP-01) | "gandi" (DNS-01)
- reverse_proxy__image: override to the custom caddy-gandi image for DNS-01
- Caddyfile gains a global `acme_dns gandi {env.GANDI_BEARER_TOKEN}` block
- the PAT (vault.gandi.pat) renders into a host-only 0600 env file (no_log),
loaded by compose only when DNS-01 is enabled
Verified: the custom image issues a real wildcard cert (*.dns01test.wingu.me)
end-to-end against LE staging via Gandi DNS-01; `caddy validate` accepts
`acme_dns gandi` on the custom image and rejects it on vanilla caddy:2. Molecule
(HTTP-01 default path) green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- dev_env .zshrc: drop the rclone alias (not installed) and guard the direnv
hook with `command -v direnv` so a missing direnv doesn't error every shell (O16)
- dev_env oh-my-posh: tag the zen.toml theme deploy `config` (it renders config to
disk like the per_user dotfiles); the include now carries packages+config so a
`--tags config` run re-renders the theme while the binary install stays packages
only (O17). Verified via `molecule converge -- --tags config`.
- drop the non-vocabulary `tags: [verify]` from molecule verify playbooks across
base/docker_host/public_dns/reverse_proxy (check-tags exempts molecule anyway) (O25)
- reverse_proxy templates: add the `{{ ansible_managed }}` header (ADR-024 §1.2) (O26)
make lint green; dev_env + reverse_proxy molecule green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
reverse_proxy is the first built+applied service role; add the per-service
records CLAUDE.md/ADR-002/008/017/021 require. Add access__*/backup__* data to
defaults as the source of truth (ADR-021/022). reverse_proxy is stateless (ACME
certs re-issue via HTTP-01), so it declares backup__state: false with a reason
rather than a BACKUP.md (ADR-022 convention).
The access__*/backup__* cross-role field names intentionally don't carry the
reverse_proxy__ prefix, so each is marked `# noqa: var-naming[no-role-prefix]`
(ansible-lint has no per-prefix allowlist; rule stays enabled elsewhere).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Dynamic include_tasks only filter on the include's own tags, not their
(untagged) contents — so `--tags packages` ran none of the neovim/oh-my-posh/
nodejs installs, and `--tags users|config` never entered per_user.yml. Add
`apply: tags:` to all four includes (mirroring base/tasks/main.yml) and tag the
dev_env__home getent+set_fact preflight `always` so a partial run still resolves
the home dir before the dotfile/stow tasks consume it.
Molecule: add a config-only converge play for a fresh user + a verify assertion.
Proven with `molecule converge -- --tags config` (idempotent, home resolved).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
11 safe auto-fixes (docs/comments only): reverse_proxy meta stale DNS-01
description, base/playbooks/scripts/terraform/public_dns README build-state,
CAPABILITIES reverse-proxy Traefik→Caddy, README ADR list → 024, TF cax11→cx23
stamps, public_dns wildcard DNS-01→HTTP-01 comment. 29 open findings reported.
make lint green. No stale-deferred (ADR-011 open questions still open).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Switch from a custom caddy-dns/gandi image built on-host to the official
caddy:2 image with per-host ACME HTTP-01 certificates. Removes the
Dockerfile, env.j2 (Gandi token), on-host image build/ship/load tasks,
the caddy-image Makefile target, and the wildcard DNS-01 Caddyfile.
Each route now gets its own server block and automatic certificate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the Caddy reverse proxy role (ADR-024): builds boma/caddy-gandi:latest
on-host (caddy-dns/gandi plugin), renders Caddyfile from route catalog, brings
Compose project up. Adds community.docker to requirements.yml, production group_vars,
and a caddy-image Makefile target.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the docker_host role tasks: prerequisites, /etc/apt/keyrings
directory (ordered before the GPG key write), Docker APT key + repo, and
docker-ce/cli/containerd.io/compose-plugin install. Daemon hardening and
nftables.d integration remain deferred to Phase 2 (cluster + base firewall).
Updates defaults, README, and molecule verify to assert docker --version.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two bugs caught by the live make check/deploy on askari:
- include_tasks with a tag selects the include but NOT its tasks, so --tags hardening
ran nothing. Use apply: {tags:} to propagate (also fixed the firewall include).
- fail2ban service start + restart handler fail in a first-run --check (package not
installed yet); guard both with when: not ansible_check_mode so check is clean.
Applied to askari: SSH hardened, fail2ban active, ping still works (no lockout).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add explicit base__ssh_authorised_keys: [] default to prevent
undefined-var errors in Molecule. Extend verify.yml with sshd
drop-in validation, PasswordAuthentication check, and fail2ban
jail assertion. Pre-create /run/sshd in ssh.yml so sshd -t
works in containers before the service has ever started.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
item.values resolved to the dict's built-in .values() METHOD, not the 'values'
key, so gandi_livedns received '<built-in method values of dict object at 0x..>'
as the TXT value — garbage AND non-idempotent (the address changes each run).
Bracket-index all loop fields. Caught only by the live apply (apply=false Molecule
+ data-only pytest both missed it).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Gandi LiveDNS rejects the RFC-7505 null-MX value '0 .' ('invalid format for MX
record'), which failed the live apply. No MX + no apex A = no mail delivery, and
SPF -all + DMARC reject still prevent spoofing — so remove Gandi's seeded MX (add
@/MX to absent) rather than declare a null-MX present. Assert now requires an SPF
@/TXT record; tests + Molecule sample updated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Converge runs in CI; the no-op apply=false scenario adds no local signal over
the pytest, and the test image is on an unreachable registry.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implement M1: manage wingu.me public DNS zone at Gandi LiveDNS via
community.general.gandi_livedns (PAT from vault.gandi.pat). Adds
assertion guard for domain + null-MX, present/absent record loops
with run_once, and apply-gate for Molecule dry-run mode.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
playbooks/site.yml imports the docker_host role, but it didn't exist, so
ansible-lint's syntax-check failed on a clean checkout — breaking CLAUDE.md's
"main must always work" / "Never skip lint" (top open finding O1 from the
2026-06-11 review).
Scaffold docker_host as a proper placeholder via the prescribed mechanism
(make new-role): filled meta/main.yml + README, an honest no-task tasks/main.yml
documenting planned scope (Docker engine + Compose, daemon hardening, nftables.d
container rules per ADR-004/020), and the standard molecule scenario. This
preserves site.yml's full-standard-state intent rather than dropping the play.
Update STATUS.md (docker_host moves from "Not in git" to "scaffolded, no tasks")
and the role/playbook READMEs to match.
make lint: 0 failures, 0 warnings; check-tags OK.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
/review-repo run at 67f2aba. Auto-fixed 5 safe doc-drift items left by the
base(firewall)+dev_env build wave: README/playbook/role notes that still called
the roles "empty/not built", plus README tree gaps and the reciprocal ADR-021
cross-links in ADR-016/020.
18 open findings reported (not fixed). Headline: `make lint` is red on `main`
(site.yml imports the non-existent docker_host role) and an ADR-004 <-> ADR-022
backup-scope contradiction. Deferral checklist clean (0 stale-deferred); 7 of
12 prior findings confirmed resolved. See docs/reviews/2026-06-11-review.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Debian's npm package pulls a ~400-package node-* tree (the first deploy
installed 527 packages). Replace apt nodejs+npm with a pinned upstream Node
tarball (v20.19.2) installed to /opt + symlinked, mirroring the nvim install
pattern (ADR-014 pinning). npm/npx come bundled. Molecule verifies node/npm
on PATH; lint + idempotent converge green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When the login user differs from the become_user (ubongo connects as sjat,
the role copies files as claude), Ansible needs ACLs on its temp files;
without the acl package it falls back to an unsupported chmod syntax and
fails. Molecule didn't catch it (root login can chown directly).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A new role (separate from base) that gives workstation-class hosts (ubongo
now, mamba later) a clean interactive environment: zsh + oh-my-zsh +
oh-my-posh, tmux + TPM plugins, and neovim. Dotfiles are real files deployed
via GNU stow (not templated); pinned nvim v0.12.2 + oh-my-posh 29.0.1.
Configs re-derived (ADR-013) from AnsibleBaobabV4 + the operator's fisi setup
on boma's terms: no Nerd Font (headless host), no system LSP suite (nvim uses
mason), versions pinned (V4 tracks latest). Applied via playbooks/workstation.yml
to the control group for users sjat + claude. Lint + Molecule (idempotent) green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bare 'nft list ruleset' has no leading flush, so the timer's 'nft -f rollback'
was a no-op on first apply (empty file) and errored ('table exists') on later
applies — the auto-rollback silently did nothing, defeating the askari lockout
safeguard. Prepend 'flush ruleset' so the revert is atomic + self-contained.
Verified the snapshot->lockout->revert round-trip in an isolated netns.
Also fix stale STATUS prose (base is partially built, not absent).
established/related keeps the in-flight session alive across the swap, so the
prior 'next task runs' confirm always passed even if new connections were
bricked — the rollback was theater. reset_connection + wait_for_connection now
force a fresh handshake through the new ruleset; failure aborts the play and the
armed timer reverts. (meta: reset_connection ignores 'when' — benign extra
reconnect on no-op runs; verified idempotent in molecule.)
nft -c rejects iif "wt0" when the interface is absent (container, or any host
before NetBird); iifname matches by name and is robust to wt0 coming/going.
Drop the ansible_host fixture override (the docker connection uses it as the
container name) — molecule covers zone resolution, pytest covers service->IP.
R6/R7: ADR-003 & ADR-008 CI pipelines rewritten trunk-based (push to main ->
test -> staging -> [manual gate] production); CLAUDE.md no longer forbids pushing
to main. R8: STATUS/roles-README/site.yml now say base & docker_host are not built
(not in git), so a clean clone errors. R15/R16: ADR-001 table flagged as intended
design; dropped the unbuilt 'monitoring agent' from the baseline.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>