Adds base__mesh_coordinator_pin (default empty = no-op). When set + base__mesh_enabled,
a lineinfile task writes "<ip> <fqdn>" to /etc/hosts so a managed mesh host survives a
local-DNS hiccup (the 2026-06-18 incident class). FQDN derived from base__mesh_management_url
via regex_replace (no community.general). Gated on base__mesh_enabled | bool and pin length;
the coordinator host (askari/offsite_hosts) stays exempt. Production pin wired for ubongo
(77.42.120.136). Molecule dns_servers fix included (Docker/NetBird DNS incompatibility).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a nftables drop-in (10-libvirt-boma.nft) to base's drop-in dir that
allows traffic on iifname "virbr-boma" in the inet filter input chain.
Fixes DHCP/DNS being dropped by base's default-deny INPUT policy for VMs
on the libvirt integration bridge. Mirrors docker_host's drop-in pattern.
Molecule scenario updated to exercise only the firewall tasks (package
install unavailable in the no-internet Docker container) via include_role
tasks_from; verify asserts the drop-in renders the virbr-boma accept rule.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
base__firewall_input_only renders the forward chain policy accept (host-local
INPUT filtering only) for hosts that forward container/NAT traffic; defaults
false so real service hosts keep the forward default-deny. base__firewall_admin_addrs
adds operator-workstation LAN sources to the SSH allow-list alongside wt0 +
ssh-from-control. Molecule locks the secure default + the admin rule.
Mesh-hardening 2/3 (ADR-020/021).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- ADR-023 §4: ADR-015 no-sudo sub-decision now Superseded-by ADR-025 (bidirectional), not just an in-place amendment.
- STATUS: drop the deferred `reset` verb; honest integration_test (molecule not run in this env; applied to ubongo) + verify (forward/DNAT, not wt0); RED->GREEN validated.
- driver: remove unused `import shutil`.
- README: fix the ADR-025 link filename.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add base__ai_worker_user var (default empty), a new operational_access.yml
task file that drops a validated sudoers file for the named user, and wire it
into base/tasks/main.yml after the hardening includes under the `users` tag.
Set base__ai_worker_user: claude in group_vars/control so that applying base
to ubongo is idempotent with the manual /etc/sudoers.d/claude-ai-worker drop-in
already in place. Password remains locked; NOPASSWD is the only sudo path;
actions are attributed via auditd (ADR-021).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
base's inet-filter forward chain is policy-drop; on a Docker host that kills published-port DNAT + inter-container forwarding ON REBOOT (nftables loads default-deny before dockerd). This drop-in (loaded via base's /etc/nftables.d/*.nft include at boot) appends the container-bridge accepts so a rebooted Docker host keeps forwarding. Resolves FRICTION 2026-06-17 #1 and the GREEN half of ADR-025's acceptance test. NB nftables wildcard is br-*, not the iptables br-+.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The tls-internal/acme_ca knobs used {%- -%} trims validated only against raw jinja2; ansible (trim_blocks=True) double-stripped newlines and collapsed the Caddyfile onto single lines, crash-looping caddy. Match the role's existing plain {% %} style.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a public (0.0.0.0/0) zone and askari's Caddy (80/443) + NetBird STUN
(3478/udp) ingress so the base nftables default-deny does not drop the live
public services when applied to askari. Molecule + filter unit test cover the
public-zone rendering. Mesh-hardening 1/3 (ADR-020/024/016).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
base__ssh_listen_mesh_only binds sshd to the live wt0 IP only, with
ip_nonlocal_bind to beat the post-boot bind race and a fail-closed assert so an
unresolved address never silently listens on all interfaces. Molecule covers
the render + sysctl. Mesh-hardening 1/3 (ADR-016/021).
Environmental checkpoint applied: the molecule-debian13 container image lacks
procps (no sysctl binary). Added molecule/default/prepare.yml to install procps
and sysctls: {net.ipv4.ip_nonlocal_bind: "0"} to molecule.yml platform so the
ansible.posix.sysctl task can write and read back the value hermetically.
Sysctl file format is net.ipv4.ip_nonlocal_bind=1 (no spaces); verify.yml
grep pattern updated to match ansible.posix.sysctl's actual output.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add check_mode: false to the state:directory base_dir tasks so that 'make check'
on a brand-new compose service role creates the scaffold during --check and the
rest of the dry-run (templates + docker_compose_v2 up) can be evaluated instead
of failing on a missing project_src. The directive is inert under a normal
converge (incl. Molecule + its tagged second converge), so role tests are
unchanged. Consumes the 2026-06-16 signal in docs/FRICTION.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Converge enables mesh with base__mesh_manage:false (+ dummy key) so the include
path runs hermetically; verify asserts netbird is not installed — proving the
concern is a clean no-op when the live actions are gated off. Existing firewall/
ssh/fail2ban assertions unaffected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
README/SECURITY said gRPC was path-matched (/management.ManagementService/* etc.);
the deployed Caddy route selects gRPC by Content-Type: application/grpc* (NetBird's
own external-proxy example). Reconciled the prose to what actually runs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Caddyfile was bind-mounted as a single file. ansible.builtin.template writes
atomically (temp + rename), so a re-render swaps the file's inode while the running
container keeps the old one — `caddy reload` then re-read stale config and silently
no-op'd ("config is unchanged"), so new routes never loaded. Surfaced deploying the
NetBird route: Caddy never requested its cert. Fix: render to ./caddy/Caddyfile and
mount the ./caddy DIRECTORY at /etc/caddy — directory mounts reflect inode swaps, so
graceful `caddy reload` works. Proven on askari: atomic replace in the host dir is
visible inside the running container.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author the four ADR-mandated service-role docs for netbird_coordinator and
add the cross-role access__*/backup__* data (ADR-021/022). First stateful
service: backup__state=true; off-site capture pending the fisi pull node.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
First real service role. NetBird v0.72.4 self-hosted control plane: single
netbirdio/netbird-server:0.72.4 (management + signal + relay + STUN + embedded
Dex) plus netbirdio/dashboard:v2.39.0, both on the shared boma Docker network so
the M4a Caddy fronts them. Renders docker-compose.yml + config.yaml (secrets from
vault.netbird.*, no_log) + dashboard.env. STUN 3478/udp host-exposed; everything
else via the proxy. netbird_coordinator__manage gates the compose-up for Molecule.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a per-instance DNS-01 mode to the Caddy role for mesh/LAN-only hosts that
cannot satisfy HTTP-01. Default behaviour (vanilla caddy:2 + HTTP-01, what askari
runs) is unchanged.
- reverse_proxy__acme_dns_provider: "" (HTTP-01) | "gandi" (DNS-01)
- reverse_proxy__image: override to the custom caddy-gandi image for DNS-01
- Caddyfile gains a global `acme_dns gandi {env.GANDI_BEARER_TOKEN}` block
- the PAT (vault.gandi.pat) renders into a host-only 0600 env file (no_log),
loaded by compose only when DNS-01 is enabled
Verified: the custom image issues a real wildcard cert (*.dns01test.wingu.me)
end-to-end against LE staging via Gandi DNS-01; `caddy validate` accepts
`acme_dns gandi` on the custom image and rejects it on vanilla caddy:2. Molecule
(HTTP-01 default path) green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- dev_env .zshrc: drop the rclone alias (not installed) and guard the direnv
hook with `command -v direnv` so a missing direnv doesn't error every shell (O16)
- dev_env oh-my-posh: tag the zen.toml theme deploy `config` (it renders config to
disk like the per_user dotfiles); the include now carries packages+config so a
`--tags config` run re-renders the theme while the binary install stays packages
only (O17). Verified via `molecule converge -- --tags config`.
- drop the non-vocabulary `tags: [verify]` from molecule verify playbooks across
base/docker_host/public_dns/reverse_proxy (check-tags exempts molecule anyway) (O25)
- reverse_proxy templates: add the `{{ ansible_managed }}` header (ADR-024 §1.2) (O26)
make lint green; dev_env + reverse_proxy molecule green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
reverse_proxy is the first built+applied service role; add the per-service
records CLAUDE.md/ADR-002/008/017/021 require. Add access__*/backup__* data to
defaults as the source of truth (ADR-021/022). reverse_proxy is stateless (ACME
certs re-issue via HTTP-01), so it declares backup__state: false with a reason
rather than a BACKUP.md (ADR-022 convention).
The access__*/backup__* cross-role field names intentionally don't carry the
reverse_proxy__ prefix, so each is marked `# noqa: var-naming[no-role-prefix]`
(ansible-lint has no per-prefix allowlist; rule stays enabled elsewhere).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Dynamic include_tasks only filter on the include's own tags, not their
(untagged) contents — so `--tags packages` ran none of the neovim/oh-my-posh/
nodejs installs, and `--tags users|config` never entered per_user.yml. Add
`apply: tags:` to all four includes (mirroring base/tasks/main.yml) and tag the
dev_env__home getent+set_fact preflight `always` so a partial run still resolves
the home dir before the dotfile/stow tasks consume it.
Molecule: add a config-only converge play for a fresh user + a verify assertion.
Proven with `molecule converge -- --tags config` (idempotent, home resolved).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
11 safe auto-fixes (docs/comments only): reverse_proxy meta stale DNS-01
description, base/playbooks/scripts/terraform/public_dns README build-state,
CAPABILITIES reverse-proxy Traefik→Caddy, README ADR list → 024, TF cax11→cx23
stamps, public_dns wildcard DNS-01→HTTP-01 comment. 29 open findings reported.
make lint green. No stale-deferred (ADR-011 open questions still open).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Switch from a custom caddy-dns/gandi image built on-host to the official
caddy:2 image with per-host ACME HTTP-01 certificates. Removes the
Dockerfile, env.j2 (Gandi token), on-host image build/ship/load tasks,
the caddy-image Makefile target, and the wildcard DNS-01 Caddyfile.
Each route now gets its own server block and automatic certificate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the Caddy reverse proxy role (ADR-024): builds boma/caddy-gandi:latest
on-host (caddy-dns/gandi plugin), renders Caddyfile from route catalog, brings
Compose project up. Adds community.docker to requirements.yml, production group_vars,
and a caddy-image Makefile target.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the docker_host role tasks: prerequisites, /etc/apt/keyrings
directory (ordered before the GPG key write), Docker APT key + repo, and
docker-ce/cli/containerd.io/compose-plugin install. Daemon hardening and
nftables.d integration remain deferred to Phase 2 (cluster + base firewall).
Updates defaults, README, and molecule verify to assert docker --version.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two bugs caught by the live make check/deploy on askari:
- include_tasks with a tag selects the include but NOT its tasks, so --tags hardening
ran nothing. Use apply: {tags:} to propagate (also fixed the firewall include).
- fail2ban service start + restart handler fail in a first-run --check (package not
installed yet); guard both with when: not ansible_check_mode so check is clean.
Applied to askari: SSH hardened, fail2ban active, ping still works (no lockout).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add explicit base__ssh_authorised_keys: [] default to prevent
undefined-var errors in Molecule. Extend verify.yml with sshd
drop-in validation, PasswordAuthentication check, and fail2ban
jail assertion. Pre-create /run/sshd in ssh.yml so sshd -t
works in containers before the service has ever started.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
item.values resolved to the dict's built-in .values() METHOD, not the 'values'
key, so gandi_livedns received '<built-in method values of dict object at 0x..>'
as the TXT value — garbage AND non-idempotent (the address changes each run).
Bracket-index all loop fields. Caught only by the live apply (apply=false Molecule
+ data-only pytest both missed it).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Gandi LiveDNS rejects the RFC-7505 null-MX value '0 .' ('invalid format for MX
record'), which failed the live apply. No MX + no apex A = no mail delivery, and
SPF -all + DMARC reject still prevent spoofing — so remove Gandi's seeded MX (add
@/MX to absent) rather than declare a null-MX present. Assert now requires an SPF
@/TXT record; tests + Molecule sample updated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Converge runs in CI; the no-op apply=false scenario adds no local signal over
the pytest, and the test image is on an unreachable registry.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implement M1: manage wingu.me public DNS zone at Gandi LiveDNS via
community.general.gandi_livedns (PAT from vault.gandi.pat). Adds
assertion guard for domain + null-MX, present/absent record loops
with run_once, and apply-gate for Molecule dry-run mode.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
playbooks/site.yml imports the docker_host role, but it didn't exist, so
ansible-lint's syntax-check failed on a clean checkout — breaking CLAUDE.md's
"main must always work" / "Never skip lint" (top open finding O1 from the
2026-06-11 review).
Scaffold docker_host as a proper placeholder via the prescribed mechanism
(make new-role): filled meta/main.yml + README, an honest no-task tasks/main.yml
documenting planned scope (Docker engine + Compose, daemon hardening, nftables.d
container rules per ADR-004/020), and the standard molecule scenario. This
preserves site.yml's full-standard-state intent rather than dropping the play.
Update STATUS.md (docker_host moves from "Not in git" to "scaffolded, no tasks")
and the role/playbook READMEs to match.
make lint: 0 failures, 0 warnings; check-tags OK.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
/review-repo run at 67f2aba. Auto-fixed 5 safe doc-drift items left by the
base(firewall)+dev_env build wave: README/playbook/role notes that still called
the roles "empty/not built", plus README tree gaps and the reciprocal ADR-021
cross-links in ADR-016/020.
18 open findings reported (not fixed). Headline: `make lint` is red on `main`
(site.yml imports the non-existent docker_host role) and an ADR-004 <-> ADR-022
backup-scope contradiction. Deferral checklist clean (0 stale-deferred); 7 of
12 prior findings confirmed resolved. See docs/reviews/2026-06-11-review.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Debian's npm package pulls a ~400-package node-* tree (the first deploy
installed 527 packages). Replace apt nodejs+npm with a pinned upstream Node
tarball (v20.19.2) installed to /opt + symlinked, mirroring the nvim install
pattern (ADR-014 pinning). npm/npx come bundled. Molecule verifies node/npm
on PATH; lint + idempotent converge green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When the login user differs from the become_user (ubongo connects as sjat,
the role copies files as claude), Ansible needs ACLs on its temp files;
without the acl package it falls back to an unsupported chmod syntax and
fails. Molecule didn't catch it (root login can chown directly).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A new role (separate from base) that gives workstation-class hosts (ubongo
now, mamba later) a clean interactive environment: zsh + oh-my-zsh +
oh-my-posh, tmux + TPM plugins, and neovim. Dotfiles are real files deployed
via GNU stow (not templated); pinned nvim v0.12.2 + oh-my-posh 29.0.1.
Configs re-derived (ADR-013) from AnsibleBaobabV4 + the operator's fisi setup
on boma's terms: no Nerd Font (headless host), no system LSP suite (nvim uses
mason), versions pinned (V4 tracks latest). Applied via playbooks/workstation.yml
to the control group for users sjat + claude. Lint + Molecule (idempotent) green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bare 'nft list ruleset' has no leading flush, so the timer's 'nft -f rollback'
was a no-op on first apply (empty file) and errored ('table exists') on later
applies — the auto-rollback silently did nothing, defeating the askari lockout
safeguard. Prepend 'flush ruleset' so the revert is atomic + self-contained.
Verified the snapshot->lockout->revert round-trip in an isolated netns.
Also fix stale STATUS prose (base is partially built, not absent).
established/related keeps the in-flight session alive across the swap, so the
prior 'next task runs' confirm always passed even if new connections were
bricked — the rollback was theater. reset_connection + wait_for_connection now
force a fresh handshake through the new ruleset; failure aborts the play and the
armed timer reverts. (meta: reset_connection ignores 'when' — benign extra
reconnect on no-op runs; verified idempotent in molecule.)