Throwaway KVM VMs on ubongo (libvirt, Approach A) that mirror a real host (real Docker, real reboot, real role apply) to catch the reboot/firewall/boot-order class Molecule cannot - the 2026-06-17 mesh-hardening incident. First profile: be askari; tiered certs (internal + le-staging built, le-prod-wildcard on-demand). Concrete build of ADR-008 Level 2/3; to be recorded as ADR-025.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From the 2026-06-17 mesh-hardening incident: Molecule can't catch
reboot/firewall-x-Docker/boot-order bugs — build local-VM pre-deploy testing
on ubongo (ADR-008 Level 2/3). And a smooth screenshot hand-off for the agent
during incidents.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
firewall-breaks-Docker-hosts, ip_nonlocal_bind didn't beat the boot race,
coordinator-host circular bootstrap, NetBird geo-DB FATAL dependency, no
off-site coordinator backup, and reboot-tested-after-removing-break-glass.
For the next /kaizen + the mesh-hardening re-spec.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Incident 2026-06-17: applying base's nftables default-deny (forward policy drop)
to askari — a Docker host — broke container forwarding/NAT on reboot, and the
wt0-only sshd ListenAddress left no break-glass (ip_nonlocal_bind did NOT beat
the boot race). Recovery: disable nftables + restart docker (restore the wiped
NAT masquerade) + force-recreate the coordinator (it FATAL-looped unable to
download its GeoLite2 DB with no egress) -> mesh re-formed.
Back out the enablement so a future deploy can't re-break askari:
- offsite_hosts: base__ssh_listen_mesh_only=false, base__firewall_apply=false
- remove host_vars/askari.yml (manage over the WAN again, not wt0)
- tf/offsite: re-open WAN :22 to ubongo only (break-glass; already applied)
askari now: sshd on all interfaces (Ansible-managed), nftables disabled, WAN :22
open -> stable + reboot-survivable. The base feature code (sshd ListenAddress
option, firewall public zone) stays; it's just not enabled on Docker hosts.
Mesh-hardening 1/3 to be re-spec'd before any retry.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Hetzner Cloud Firewall SSH rule is now conditional on a non-empty
ssh_admin_cidrs (default []); askari sets it empty so the WAN :22 rule is
removed on the next apply. SSH is reached over wt0; break-glass is the Hetzner
console. Apply is the live cutover (Task 5). Mesh-hardening 1/3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
host_vars/askari.yml points ansible_host at the wt0 IP (overriding the generated
offsite.yml); offsite_hosts sets base__ssh_listen_mesh_only. Mesh-hardening 1/3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a public (0.0.0.0/0) zone and askari's Caddy (80/443) + NetBird STUN
(3478/udp) ingress so the base nftables default-deny does not drop the live
public services when applied to askari. Molecule + filter unit test cover the
public-zone rendering. Mesh-hardening 1/3 (ADR-020/024/016).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
base__ssh_listen_mesh_only binds sshd to the live wt0 IP only, with
ip_nonlocal_bind to beat the post-boot bind race and a fail-closed assert so an
unresolved address never silently listens on all interfaces. Molecule covers
the render + sysctl. Mesh-hardening 1/3 (ADR-016/021).
Environmental checkpoint applied: the molecule-debian13 container image lacks
procps (no sysctl binary). Added molecule/default/prepare.yml to install procps
and sysctls: {net.ipv4.ip_nonlocal_bind: "0"} to molecule.yml platform so the
ansible.posix.sysctl task can write and read back the value hermetically.
Sysctl file format is net.ipv4.ip_nonlocal_bind=1 (no spaces); verify.yml
grep pattern updated to match ansible.posix.sysctl's actual output.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5 tasks: base sshd ListenAddress+ip_nonlocal_bind (Molecule), firewall public
zone + askari catalog, inventory wt0 override, TF retire WAN :22, then the live
operator-supervised staged cutover.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Decomposes the M5 mesh-hardening follow-on into 3 independent sub-specs; this
is sub-project 1. Three-layer SSH-on-wt0 (sshd ListenAddress=mesh + nftables
iifname wt0 + retire the Hetzner WAN :22), ip_nonlocal_bind to beat the
post-boot wt0 bind race (fail-closed), live wt0 fact for the listen addr,
staged cutover with the firewall auto-rollback as the safety gate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Migrate the single-file-bind-mount/stale-config gotcha (reload-in-place needs a
directory mount; restart-based roles don't) to docs/testing/gotchas.md, and move
all 7 open signals out of FRICTION.md's Open-signals section into the new
2026-06-17 decisions-ledger block: all consumed, 1 PARK (the ubongo
self-management gap, tracked in STATUS), 0 REMOVE. Relax test_load_signals to
accept an empty Open-signals section (the goal state after a kaizen pass).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scripts/registry-login.sh reads vault.forgejo.registry_token and pipes it to
docker login --password-stdin (never echoed, never on argv); 'make registry-login'
wires it with the venv binaries. Adds the operator-minted CHANGEME vault stub
(fill via make edit-vault) and a per-machine prereq note in the claude-code-setup
runbook, so 'make caddy-image-push'/'molecule-image-push' become agent-completable
non-interactively. Consumes the 2026-06-15 signal in docs/FRICTION.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
guard-vault-preflight: block a locked 'git commit' only when the staged set
(git diff --cached, plus -a/--all) contains ansible content matching the
pre-commit ansible-lint hook's files: scope. Docs-/config-only commits never
trigger that hook, so they no longer need the vault — fixing the false block on
docs-only commits. Fails safe to block when unsure.
guard-execution-mode-menu: widen the execution-mode arm to also catch free-form
prose re-asks of the subagent-vs-inline choice ('which execution approach?',
'subagent vs inline', ...), which the literal-menu matcher missed; the push
re-ask is intentionally left to the dont-reask-settled-defaults memory.
Consumes two 2026-06-17 signals in docs/FRICTION.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add check_mode: false to the state:directory base_dir tasks so that 'make check'
on a brand-new compose service role creates the scaffold during --check and the
rest of the dry-run (templates + docker_compose_v2 up) can be evaluated instead
of failing on a missing project_src. The directive is inert under a normal
converge (incl. Molecule + its tagged second converge), so role tests are
unchanged. Consumes the 2026-06-16 signal in docs/FRICTION.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a numbered ADR announces a rename Old->New, flag design-doc lines where
Old still appears in present tense — skipping the announcing ADR, lines that
also name New, and historical/negation cues, and rejecting ADR-NNN tokens as
terms. Structural cousin of stale-deferred; run by /review-repo. Zero findings
on the current tree (the Traefik->Caddy ripple edits have landed). Consumes the
2026-06-14 KEEP-OPEN signal in docs/FRICTION.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds .claude/statusline.sh (reads context_window.used_percentage +
context_window_size straight from the statusLine JSON; green<70/yellow/red
bar) and wires it via .claude/settings.json statusLine. Committed in-repo so
it follows boma to any clone, matching how .claude/ already tracks hooks +
plugins.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mamba + work laptop enrolled in the mesh → ubongo reachable from anywhere; the
mobile-access goal is met and Phase 1 (remote access) is complete. Adds
docs/runbooks/netbird-client.md (reusable client-enrollment runbook) + STATUS/
ROADMAP flips + CLAUDE.md reading-table entry.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Operator replaced the CHANGEME with a real reusable scoped setup key via
make edit-vault (re-encrypted in place). Encrypted ciphertext only; no plaintext.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vault.netbird.setup_key: CHANGEME (operator mints a reusable scoped key after the
dashboard /setup). base__mesh_enabled: true for control (ubongo) + offsite_hosts
(askari) so the base 'mesh' concern enrols them. Enrollment only — no firewall change.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Converge enables mesh with base__mesh_manage:false (+ dummy key) so the include
path runs hermetically; verify asserts netbird is not installed — proving the
concern is a clean no-op when the live actions are gated off. Existing firewall/
ssh/fail2ban assertions unaffected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
I re-surfaced two already-settled decisions as questions (push to origin; subagent
vs inline) at the M5 handoff. The existing execution-mode guard only matches the
writing-plans menu's literal text, so free-form prose re-asks slip through. Default:
push as backup and go subagent-driven without asking.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
base 'mesh' concern enrols NetBird agents on ubongo + askari via a reusable scoped
setup key (vault); laptops enrolled by the operator. Reachability via the default
peer policy; the base nftables default-deny on ubongo + ACL tightening are deferred
to a follow-on. Resolves ROADMAP M5 design; next: writing-plans.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
README/SECURITY said gRPC was path-matched (/management.ManagementService/* etc.);
the deployed Caddy route selects gRPC by Content-Type: application/grpc* (NetBird's
own external-proxy example). Reconciled the prose to what actually runs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Caddyfile was bind-mounted as a single file. ansible.builtin.template writes
atomically (temp + rename), so a re-render swaps the file's inode while the running
container keeps the old one — `caddy reload` then re-read stale config and silently
no-op'd ("config is unchanged"), so new routes never loaded. Surfaced deploying the
NetBird route: Caddy never requested its cert. Fix: render to ./caddy/Caddyfile and
mount the ./caddy DIRECTORY at /etc/caddy — directory mounts reflect inode swaps, so
graceful `caddy reload` works. Proven on askari: atomic replace in the host dir is
visible inside the running container.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NetBird joins the boma Docker network that reverse_proxy creates, so it's
ordered last. Carries its netbird_coordinator role-name tag (check-tags).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author the four ADR-mandated service-role docs for netbird_coordinator and
add the cross-role access__*/backup__* data (ADR-021/022). First stateful
service: backup__state=true; off-site capture pending the fisi pull node.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Self-generated random values for the NetBird coordinator: auth_secret (relay/JWT
shared secret) and datastore_key (SQLite store encryption, base64 32 bytes with
padding). Wired into roles/netbird_coordinator config.yaml via vault.netbird.*.
No CHANGEME — both are agent-generatable (not operator-supplied). The M5 peer
setup key is a runtime dashboard artifact, added to vault when M5 wires it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
First real service role. NetBird v0.72.4 self-hosted control plane: single
netbirdio/netbird-server:0.72.4 (management + signal + relay + STUN + embedded
Dex) plus netbirdio/dashboard:v2.39.0, both on the shared boma Docker network so
the M4a Caddy fronts them. Renders docker-compose.yml + config.yaml (secrets from
vault.netbird.*, no_log) + dashboard.env. STUN 3478/udp host-exposed; everything
else via the proxy. netbird_coordinator__manage gates the compose-up for Molecule.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Building images is fully automatable; pushing to the Forgejo registry needs an
interactive docker login, and registry creds aren't in vault — so an agent can't
complete a push. Captured for the next kaizen review.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ADR-024 Status/Consequences, STATUS.md, ROADMAP M4a, and the FRICTION ledger now
record that the DNS-01 path is built and proven, with the root cause of the M4a
failure (version skew: pre-Bearer libdns/gandi sent the deprecated Apikey header;
plus building on a Hetzner IP). Traefik was reconsidered and rejected again — lego's
Gandi provider has the same PAT-vs-Apikey question, so it would not have helped.
Dated review reports and spec/plan snapshots are left as historical records.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a per-instance DNS-01 mode to the Caddy role for mesh/LAN-only hosts that
cannot satisfy HTTP-01. Default behaviour (vanilla caddy:2 + HTTP-01, what askari
runs) is unchanged.
- reverse_proxy__acme_dns_provider: "" (HTTP-01) | "gandi" (DNS-01)
- reverse_proxy__image: override to the custom caddy-gandi image for DNS-01
- Caddyfile gains a global `acme_dns gandi {env.GANDI_BEARER_TOKEN}` block
- the PAT (vault.gandi.pat) renders into a host-only 0600 env file (no_log),
loaded by compose only when DNS-01 is enabled
Verified: the custom image issues a real wildcard cert (*.dns01test.wingu.me)
end-to-end against LE staging via Gandi DNS-01; `caddy validate` accepts
`acme_dns gandi` on the custom image and rejects it on vanilla caddy:2. Molecule
(HTTP-01 default path) green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Compiles caddy-dns/gandi v1.1.0 into Caddy v2.11.4 via xcaddy so mesh/LAN-only
hosts (no public A-record) can issue certs via ACME DNS-01. Pinned per ADR-011/014.
The M4a attempt failed for two reasons, both addressed here:
- built on a Hetzner IP -> Google's Go module proxy 403s those ranges. The
Makefile target is documented to build on ubongo, then push to Forgejo.
- older libdns/gandi sent Gandi's deprecated Apikey header. v1.1.0 sends the
PAT as Authorization: Bearer to api.gandi.net/v5/livedns.
make caddy-image / caddy-image-push mirror the molecule-image targets.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
TODO had accreted multi-line DECIDED/DONE summaries duplicating the ADRs they
cite. Collapsed every done item to a one-line "~~task~~ -> ADR-NNN" pointer and
added an "open items only" convention note up top. Item numbers are stable
cross-references (ROADMAP/STATUS/ADRs/scripts cite them) so they are PRESERVED,
not renumbered — verified all externally-referenced numbers survive. 176->136 lines.
No new ledger: the record already lives in the ADRs / STATUS.md / FRICTION ledger.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Add REPO_DIRS constant; still_exists now only checks tokens that start
with a known repo top-level dir, ignoring plugin names (caddy-dns/gandi),
make command fragments (tf-init/plan), and role-relative paths.
- Add test_still_exists_ignores_non_repo_tokens (was failing before fix).
- Add test_nudge_line_overdue_on_age to close coverage gap on age threshold.
- Add load_signals docstring.
- Replace manual --today date parsing with datetime.date.fromisoformat type
converter so malformed dates give a clean argparse error.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>