Redesign of the backed-out 2026-06-17 askari SSH->wt0 attempt. Mirrors the proven ubongo 2/3 pattern (INPUT-only default-deny, SSH scoped by iifname wt0, no sshd ListenAddress change -> no boot-race) and adds the coordinator-host exception the incident demanded: a permanent non-mesh break-glass (WAN :22 from ubongo's static WAN IP + the Hetzner console), WAN :22 deliberately left open. Folds in the netbird_coordinator geo-DB robustness fix (FRICTION #4) so a transient egress blip can't FATAL the control plane. Harness-GREEN gate before a supervised live cutover.
Operator decision (2026-06-19): do this redesign first, then a separate sub-project to reduce askari's SPOF role.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After an operator reboot of ubongo, verified live that the INPUT-only default-deny ruleset re-applied on boot (input chain policy drop + the full wt0/ssh-from-control/admin-addr allow-list), the wt0 mesh came back (Management+Signal Connected), and both SSH paths recovered clean. Closes the 'real-host reboot validation pending' item for mesh-hardening 2/3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
base firewall applied + live-verified on ubongo (INPUT-only default-deny;
base__firewall_input_only). Records the Docker-nat-flush caveat (needs a restart
docker on a Docker host), the claude self-SSH grant, and reboot-validation-pending.
ROADMAP: sub-project 2 done; remaining = NetBird ACL + askari redesign.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Applying base's nftables (even INPUT-only/forward-accept) to a Docker host
flushes Docker's ip nat -> container egress breaks until 'systemctl restart
docker'. Found on the ubongo mesh-hardening 2/3 live cutover; the Docker-less
test VM couldn't surface it. Self-heals on reboot (dockerd re-adds nat;
forward=accept doesn't block). Runbook/docker_host follow-ups noted.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Final-review finding: the default Molecule scenario only renders the forward
drop (input_only off) branch; the accept branch is covered by the integration
harness only. Tracked for a kaizen decision (2nd scenario vs accept the split).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two signals from running the ubongo harness gate: (1) the operator wants a
standard pre-authorising isolated VM integration tests on ubongo so the agent
doesn't ask each time; (2) a stale agent session (shell predating the
integration_test libvirt-group grant) carries stale process groups, so the
harness's qemu-img/file writes are denied -> run via 'sg libvirt -c ...';
self-heal idea noted.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Allow a second operator workstation (10.20.10.17) onto ubongo's LAN SSH
alongside mamba (10.20.10.50). Both are raw DHCP leases; recorded a FRICTION
open signal to replace them with MAC-pinned OPNsense reservations when
OPNsense-as-code lands (ADR-020 / TODO 3.5).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Five tasks: base knobs (input-only forward policy + admin-addr SSH allow,
TDD via Molecule) → enable on the control group → a 'be ubongo' integration
profile (profile-aware verify) → the real-VM harness GREEN gate → the
operator-supervised live cutover (signal-6 order, physical-console break-glass).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sub-project 2 of the mesh-hardening follow-on (the post-incident roadmap
ordering puts ubongo first). Harden the control node's inbound surface via
base's nftables firewall as INPUT-only default-deny: the forward chain stays
permissive (new base__firewall_input_only knob) so Docker egress + the
libvirt-NAT integration harness keep working, and there is no sshd ListenAddress
change — sidestepping the ip_nonlocal_bind boot-race that sank askari. SSH
allowed from wt0, ssh-from-control (Ansible self), and mamba on the LAN (new
base__firewall_admin_addrs). Harness-validated before an operator-supervised
cutover; the physical console is the permanent break-glass.
Design maps to the four relevant 2026-06-17 incident lessons (FRICTION signals
1/2/3/6).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document the 2026-06-18 incident class: a road-warrior laptop losing DNS on a network transition strands NetBird (can't resolve the coordinator FQDN), taking ubongo unreachable until DNS recovers. Adds triage (local DNS vs coordinator), device mitigations (reliable resolvers + hosts-file pin), the non-mesh LAN break-glass to ubongo, and why ubongo is relay-only (deferred mesh-hardening, not a bug) — including the break-glass rule that hardening must preserve.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- ADR-023 §4: ADR-015 no-sudo sub-decision now Superseded-by ADR-025 (bidirectional), not just an in-place amendment.
- STATUS: drop the deferred `reset` verb; honest integration_test (molecule not run in this env; applied to ubongo) + verify (forward/DNAT, not wt0); RED->GREEN validated.
- driver: remove unused `import shutil`.
- README: fix the ADR-025 link filename.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The local-VM integration harness RED→GREEN acceptance passed on real hardware
(2026-06-18): a KVM VM on ubongo reproduced the 2026-06-17 nftables/Docker reboot
breakage (RED) and survived with the docker_host container-forward drop-in (GREEN).
ADR-025: Status updated to PASSED; shakedown learnings section added (UEFI boot
required, claude sudo load-bearing); ADR-021 added to Related.
STATUS.md: integration-harness section updated from PENDING to PASSED; ubongo
entry updated to reflect claude NOPASSWD sudo + sjat-ansible NOPASSWD removal;
last-reviewed date updated.
docs/TODO.md: item 2.4 collapsed to one-line pointer per the file's convention.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The integration-testing shakedown reversed ADR-015's "no local sudo" sub-decision:
the claude AI-worker now has NOPASSWD:ALL sudo on ubongo — without it, virsh,
nft, and journalctl all block during VM diagnosis. Compensating controls:
password-locked account, auditd/Loki attribution, repo-managed revocable drop-in.
ADR-015: dated amendment note in Status + expanded AI-worker identity section.
ADR-021: new §Sudo model (amendment 2026-06-18) — claude=NOPASSWD, sjat=password
required; former sjat NOPASSWD drop-in removed 2026-06-18 (least-privilege cleanup).
accepted-risks.md: R7 added (claude NOPASSWD:ALL on ubongo); last-reviewed updated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- New docs/runbooks/integration-testing.md: when to use (firewall/
sshd/boot/Docker changes); make test-integration commands; lower-
level driver sub-commands; cert tier guidance; diagnostics dir;
VM inspection (virsh console / SSH); safety invariants; resource
constraints; adding a new profile; self-validating acceptance test.
- docs/runbooks/new-host.md: pre-flight warning before deploying
lockout-risky changes (firewall/sshd/boot) while break-glass is open
- docs/runbooks/new-role.md: step 13 pre-flight for lockout-risky roles
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- ADR-008: add reboot-survivability gap row + ADR-025 pointer to the
"not tested in Molecule" table
- ADR-015: reconcile "not a hypervisor" with ephemeral KVM test VMs
(ADR-025); note ~3 GiB test-VM RAM against the 16 GiB sizing
- accepted-risks: add R6 (le-prod-wildcard PAT + transient TXT records)
- CLAUDE.md: add make test-integration[/-clean] to key-commands;
add ADR-025 + runbook rows to further-reading
- hardware/reference.md: note one ephemeral KVM test VM on ubongo
- STATUS.md: add integration harness entry (built, lint+pytest clean;
RED/GREEN acceptance PENDING ubongo live pass); TODO 2.4 stays open
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Accepted decision to implement ADR-008 Level 2/3 on ubongo via
libvirt/KVM directly: throwaway VM overlays, stdlib-only driver,
tiered cert fidelity, three safety invariants. Addresses the
2026-06-17 mesh-hardening incident's reboot-survivability gap.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Throwaway KVM VMs on ubongo (libvirt, Approach A) that mirror a real host (real Docker, real reboot, real role apply) to catch the reboot/firewall/boot-order class Molecule cannot - the 2026-06-17 mesh-hardening incident. First profile: be askari; tiered certs (internal + le-staging built, le-prod-wildcard on-demand). Concrete build of ADR-008 Level 2/3; to be recorded as ADR-025.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From the 2026-06-17 mesh-hardening incident: Molecule can't catch
reboot/firewall-x-Docker/boot-order bugs — build local-VM pre-deploy testing
on ubongo (ADR-008 Level 2/3). And a smooth screenshot hand-off for the agent
during incidents.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
firewall-breaks-Docker-hosts, ip_nonlocal_bind didn't beat the boot race,
coordinator-host circular bootstrap, NetBird geo-DB FATAL dependency, no
off-site coordinator backup, and reboot-tested-after-removing-break-glass.
For the next /kaizen + the mesh-hardening re-spec.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5 tasks: base sshd ListenAddress+ip_nonlocal_bind (Molecule), firewall public
zone + askari catalog, inventory wt0 override, TF retire WAN :22, then the live
operator-supervised staged cutover.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Decomposes the M5 mesh-hardening follow-on into 3 independent sub-specs; this
is sub-project 1. Three-layer SSH-on-wt0 (sshd ListenAddress=mesh + nftables
iifname wt0 + retire the Hetzner WAN :22), ip_nonlocal_bind to beat the
post-boot wt0 bind race (fail-closed), live wt0 fact for the listen addr,
staged cutover with the firewall auto-rollback as the safety gate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Migrate the single-file-bind-mount/stale-config gotcha (reload-in-place needs a
directory mount; restart-based roles don't) to docs/testing/gotchas.md, and move
all 7 open signals out of FRICTION.md's Open-signals section into the new
2026-06-17 decisions-ledger block: all consumed, 1 PARK (the ubongo
self-management gap, tracked in STATUS), 0 REMOVE. Relax test_load_signals to
accept an empty Open-signals section (the goal state after a kaizen pass).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scripts/registry-login.sh reads vault.forgejo.registry_token and pipes it to
docker login --password-stdin (never echoed, never on argv); 'make registry-login'
wires it with the venv binaries. Adds the operator-minted CHANGEME vault stub
(fill via make edit-vault) and a per-machine prereq note in the claude-code-setup
runbook, so 'make caddy-image-push'/'molecule-image-push' become agent-completable
non-interactively. Consumes the 2026-06-15 signal in docs/FRICTION.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mamba + work laptop enrolled in the mesh → ubongo reachable from anywhere; the
mobile-access goal is met and Phase 1 (remote access) is complete. Adds
docs/runbooks/netbird-client.md (reusable client-enrollment runbook) + STATUS/
ROADMAP flips + CLAUDE.md reading-table entry.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
I re-surfaced two already-settled decisions as questions (push to origin; subagent
vs inline) at the M5 handoff. The existing execution-mode guard only matches the
writing-plans menu's literal text, so free-form prose re-asks slip through. Default:
push as backup and go subagent-driven without asking.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
base 'mesh' concern enrols NetBird agents on ubongo + askari via a reusable scoped
setup key (vault); laptops enrolled by the operator. Reachability via the default
peer policy; the base nftables default-deny on ubongo + ACL tightening are deferred
to a follow-on. Resolves ROADMAP M5 design; next: writing-plans.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Building images is fully automatable; pushing to the Forgejo registry needs an
interactive docker login, and registry creds aren't in vault — so an agent can't
complete a push. Captured for the next kaizen review.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ADR-024 Status/Consequences, STATUS.md, ROADMAP M4a, and the FRICTION ledger now
record that the DNS-01 path is built and proven, with the root cause of the M4a
failure (version skew: pre-Bearer libdns/gandi sent the deprecated Apikey header;
plus building on a Hetzner IP). Traefik was reconsidered and rejected again — lego's
Gandi provider has the same PAT-vs-Apikey question, so it would not have helped.
Dated review reports and spec/plan snapshots are left as historical records.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
TODO had accreted multi-line DECIDED/DONE summaries duplicating the ADRs they
cite. Collapsed every done item to a one-line "~~task~~ -> ADR-NNN" pointer and
added an "open items only" convention note up top. Item numbers are stable
cross-references (ROADMAP/STATUS/ADRs/scripts cite them) so they are PRESERVED,
not renumbered — verified all externally-referenced numbers survive. 176->136 lines.
No new ledger: the record already lives in the ADRs / STATUS.md / FRICTION ledger.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Curate-only consume pass over FRICTION.md Open signals: interactive guided
session, add/change/park/remove verdicts (park-with-resurrection-trigger to
protect out-of-phase tooling on a solo project), single source = FRICTION.md,
ledger is the durable record. Mirrors /review-repo (command md + stdlib scanner).
Stage 1 on-demand + stage-2 nudge; headless/cron deferred (TODO 11.3).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per brainstorm: ship the on-demand command + recurrence/age nudge first;
revisit a scheduled headless (report-only) run once ntfy + scheduled jobs exist.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three new Open signals: ansible-lint no-role-prefix vs ADR-021/022 access__/
backup__ conventions (first service role); Molecule tag-propagation now testable
via tagged converge + full-then-partial; ADRs over-claiming cross-doc reconciliation
(repo-scan check candidate, cousin of stale-deferred).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional,
outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative
boma.baobab.band -> boma.wingu.me transition note already added earlier
- terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and
<host>.boma.baobab.band per ADR-007 naming (O11)
- ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections
placed after Consequences, matching ADR-014/019-023 (O13)
- docs/README + inventories/README: list the missing subdirs / offsite_hosts +
offsite.yml merge behaviour (O14, O29 note)
- ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19)
- ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20)
- ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21)
- netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23)
- ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24)
- capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28)
- tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9)
- tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep)
O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected);
the fix lives in the generator for the next regeneration. make lint + pytest (57) green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- STATUS: docker_host is built+applied, not scaffold-only (O1)
- ADR-004: backup points to ADR-022, not "out of scope"; service-role file
table gains ACCESS.md + BACKUP.md rows (O2, O5)
- Finish Traefik->Caddy: ADR-008/011/017/019, CAPABILITIES, TODO (O3); scope
ADR-024's custom-image/NetBird claims to the deferred DNS-01/M4b paths (O22)
- ADR-016/017/018 now lead with ## Status per ADR-023 (O4)
- ADR-002: caveat `PLAYBOOK=upgrade` as planned/unbuilt (O6)
- CAPABILITIES: carve out ubongo's dev_env from the nvim/tmux exclusion (O7)
- ADR-007: one authoritative boma.baobab.band -> boma.wingu.me transition note (O18)
- new-host Part E: note ubongo is managed as sjat, ansible-user bootstrap pending (O15)
O9 (hosts.yml header) left open: the file is generator-owned (hook-protected);
fixing it needs a tf_to_inventory.py change or a tf-inventory run, not a hand-edit.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
11 safe auto-fixes (docs/comments only): reverse_proxy meta stale DNS-01
description, base/playbooks/scripts/terraform/public_dns README build-state,
CAPABILITIES reverse-proxy Traefik→Caddy, README ADR list → 024, TF cax11→cx23
stamps, public_dns wildcard DNS-01→HTTP-01 comment. 29 open findings reported.
make lint green. No stale-deferred (ADR-011 open questions still open).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Capture NetBird's configure.sh reference for a pinned version → translate into
boma role templates (compose + management.json + dex/openid + turnserver),
external-proxy mode behind the M4a Caddy (netbird.askari.wingu.me). First service
role: full ADR-004 standard files; secrets generated/CHANGEME-stubbed (setup key
for M5). Gated live deploy + verify.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds ADR-024 pinning Caddy (xcaddy + caddy-dns/gandi) as boma's reverse
proxy, superseding the soft Traefik assumption in the roadmap and ADR-017
prose. Updates CLAUDE.md Further reading table and ROADMAP.md Phase-2 step 5.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
First of M4's two build phases: docker_host (Docker engine), custom xcaddy Caddy
image (caddy-dns/gandi), reverse_proxy role (Caddyfile from a route catalog,
DNS-01 wildcard cert for *.askari.wingu.me via vault.gandi.pat), ADR-024 (Caddy is
boma's reverse proxy), firewall 80/443 + DNS, proven by serving a test route over
TLS. M4b (NetBird) follows, reading NetBird's current self-host compose then.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Caddy becomes boma's standard reverse proxy (amends the soft Traefik assumption;
new ADR) with Gandi DNS-01 certs (custom xcaddy image, reuses vault.gandi.pat) —
the only cert path for mesh/LAN-only services. NetBird self-hosted in
external-proxy mode (embedded Dex), compose rendered from boma templates
(ADR-004/013). Three roles: docker_host (first real content), reverse_proxy (new,
Caddy), netbird (first service role w/ full ADR-004 standard files). Firewall +
DNS amendments; backup execution deferred (fisi). caddy-dns/gandi + NetBird
self-host facts verified.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>