Adds base__mesh_coordinator_pin (default empty = no-op). When set + base__mesh_enabled,
a lineinfile task writes "<ip> <fqdn>" to /etc/hosts so a managed mesh host survives a
local-DNS hiccup (the 2026-06-18 incident class). FQDN derived from base__mesh_management_url
via regex_replace (no community.general). Gated on base__mesh_enabled | bool and pin length;
the coordinator host (askari/offsite_hosts) stays exempt. Production pin wired for ubongo
(77.42.120.136). Molecule dns_servers fix included (Docker/NetBird DNS incompatibility).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two tasks: a base mesh coordinator-FQDN /etc/hosts pin (Molecule TDD) + the accept-and-document docs (R8, ADR-016 availability amendment, STATUS/ROADMAP). Coordinator backup deferred to ADR-022.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sub-project 3 of the mesh-hardening follow-on. Accepts the single off-site coordinator as a documented availability SPOF (R8 + ADR-016 amendment) given the narrow blast radius (LAN/intra-cluster/local traffic unaffected; only remote relayed mesh access breaks). Hardens the one real gap: a base mesh coordinator-FQDN /etc/hosts pin so managed hosts survive a local-DNS hiccup. Coordinator off-site backup explicitly deferred to an ADR-022 kickoff (no throwaway infra).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Live cutover complete: base INPUT-only default-deny + wt0-primary SSH + permanent WAN break-glass on askari, netbird_coordinator geo-disabled. A real reboot recovered unattended — firewall persisted, Docker forwarding + public services up, coordinator geo-disabled (no FATAL), mesh + both SSH paths back. ROADMAP sub-project 3 (askari redesign) marked DONE; next = relay-SPOF reduction.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lets an operator pass extra ansible-playbook args through make without bypassing it — e.g. -e ansible_host=<WAN> to manage a host over a relay-independent path during a cutover that restarts its own mesh relay.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mesh-hardening redesign of the backed-out 2026-06-17 askari SSH->wt0 attempt. Three tasks:
1. netbird_coordinator: disable geolocation (NB_DISABLE_GEOLOCATION) so a no-egress startup can't FATAL the control plane.
2. inventory: askari INPUT-only nftables default-deny (forward stays accept, Docker-safe) + ubongo's static WAN IP as a permanent SSH break-glass + manage over wt0; no sshd ListenAddress change (no boot-race); WAN :22 deliberately left open.
3. ADR-025 harness: askari_inputonly profile proves reboot-safety on a KVM VM (GREEN). Includes leaseshelper-independent VM-IP discovery (arp fallback) and an Ansible-managed virbr-boma nftables drop-in. A suid-root workaround the first implementer installed was backed out; nothing privileged reintroduced.
Whole-branch review (opus): ready to merge. Task 4 (live cutover) is operator-gated, not in this branch.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a nftables drop-in (10-libvirt-boma.nft) to base's drop-in dir that
allows traffic on iifname "virbr-boma" in the inet filter input chain.
Fixes DHCP/DNS being dropped by base's default-deny INPUT policy for VMs
on the libvirt integration bridge. Mirrors docker_host's drop-in pattern.
Molecule scenario updated to exercise only the firewall tasks (package
install unavailable in the no-internet Docker container) via include_role
tasks_from; verify asserts the drop-in renders the virbr-boma accept rule.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
wait_for_ip now tries --source lease first then --source arp; both produce
identical output handled by parse_lease_ip. Removes the suid leaseshelper
dependency introduced and backed out in Task 3. New unit test confirms
parse_lease_ip works on --source arp output format.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Documents three blockers found while developing the askari_inputonly
integration-test profile:
1. inet filter default-deny silently blocks libvirt dnsmasq DHCP: nftables
multi-table independence means ip filter LIBVIRT_INP accept does NOT
prevent inet filter drop. Diagnosed via strace; fixed with a drop-in.
2. libvirt leaseshelper PID-file: virPidFileReleasePath unlinks the file after
every call; nobody cannot recreate in /run/. Fix: suid root C wrapper.
3. cloud-init rejects underscores in local-hostname → skips network-config
→ no DHCP. Fix: sanitize with replace("_", "-") in meta-data hostname.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three fixes found during askari_inputonly integration-test development:
1. Hostname sanitization: cloud-init rejects underscores in local-hostname
(silently skips network-config → VM never gets DHCP). Sanitize with
name.replace("_", "-") for the meta-data hostname; paths/domain names
keep the original (underscore is valid there).
2. Netplan explicit interface: match.name: en* with a named key produces a
.network file that networkd never DHCPs. Use explicit enp1s0 (all virtio
NICs in these KVM VMs) + renderer: networkd to bypass the bug.
3. ansible_ssh_common_args in the generated hosts.yml: integration VMs
reuse IPs (different VMs at same 192.168.150.x lease). StrictHostKey
accept-new from ansible.cfg blocks changed keys. Add StrictHostKeyChecking=no
+ UserKnownHostsFile=/dev/null per-host to the generated inventory so
stale known_hosts entries never block the apply step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the ADR-025 integration-test profile that proves the askari
mesh-hardening REDESIGN (INPUT-only default-deny, forward ACCEPT for Docker)
is reboot-safe on a throwaway KVM VM before the live cut-over.
Profile applies base (firewall + sshd) and offsite (docker_host +
reverse_proxy). Post-reboot verify checks: input policy drop, forward
policy accept, admin-addr break-glass SSH (192.168.150.1), Docker up,
and a published port answered from the controller. GREEN on 2026-06-19.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four tasks: netbird_coordinator geolocation disable (TDD via Molecule) -> inventory enablement (INPUT-only firewall + WAN break-glass + manage over wt0) -> an askari_inputonly integration profile (the reboot-safety GREEN gate) -> the operator-gated supervised live cutover + STATUS/ROADMAP update. Tasks 1-3 are autonomously implementable; Task 4 is operator-gated (live off-site host, lockout risk).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Redesign of the backed-out 2026-06-17 askari SSH->wt0 attempt. Mirrors the proven ubongo 2/3 pattern (INPUT-only default-deny, SSH scoped by iifname wt0, no sshd ListenAddress change -> no boot-race) and adds the coordinator-host exception the incident demanded: a permanent non-mesh break-glass (WAN :22 from ubongo's static WAN IP + the Hetzner console), WAN :22 deliberately left open. Folds in the netbird_coordinator geo-DB robustness fix (FRICTION #4) so a transient egress blip can't FATAL the control plane. Harness-GREEN gate before a supervised live cutover.
Operator decision (2026-06-19): do this redesign first, then a separate sub-project to reduce askari's SPOF role.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After an operator reboot of ubongo, verified live that the INPUT-only default-deny ruleset re-applied on boot (input chain policy drop + the full wt0/ssh-from-control/admin-addr allow-list), the wt0 mesh came back (Management+Signal Connected), and both SSH paths recovered clean. Closes the 'real-host reboot validation pending' item for mesh-hardening 2/3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sub-project 2 of the mesh-hardening follow-on. base gains base__firewall_input_only
(forward-policy knob) + base__firewall_admin_addrs; enabled on ubongo (INPUT-only
default-deny). 'be ubongo' integration profile + profile-aware verify, plus two
harness fixes found by running it (virt-install venv-PATH hijack; nft priority
format). Applied + live-verified on the real ubongo; real-host reboot validation
pending (low-risk). FRICTION: VM-testing standard, libvirt stale-session,
Docker-nat-flush, Molecule coverage gap.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
base firewall applied + live-verified on ubongo (INPUT-only default-deny;
base__firewall_input_only). Records the Docker-nat-flush caveat (needs a restart
docker on a Docker host), the claude self-SSH grant, and reboot-validation-pending.
ROADMAP: sub-project 2 done; remaining = NetBird ACL + askari redesign.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Applying base's nftables (even INPUT-only/forward-accept) to a Docker host
flushes Docker's ip nat -> container egress breaks until 'systemctl restart
docker'. Found on the ubongo mesh-hardening 2/3 live cutover; the Docker-less
test VM couldn't surface it. Self-heals on reboot (dockerd re-adds nat;
forward=accept doesn't block). Runbook/docker_host follow-ups noted.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Final-review finding: the default Molecule scenario only renders the forward
drop (input_only off) branch; the accept branch is covered by the integration
harness only. Tracked for a kaizen decision (2nd scenario vs accept the split).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two signals from running the ubongo harness gate: (1) the operator wants a
standard pre-authorising isolated VM integration tests on ubongo so the agent
doesn't ask each time; (2) a stale agent session (shell predating the
integration_test libvirt-group grant) carries stale process groups, so the
harness's qemu-img/file writes are denied -> run via 'sg libvirt -c ...';
self-heal idea noted.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`nft list ruleset` prints the symbolic chain priority (`filter` = 0); the ubongo
profile asserted `priority 0` (the rendered-file format the Molecule scenario
checks), so the live-ruleset assertion failed even though the firewall was
correct. Assert `priority filter` for the input/forward policy lines. Caught by
the harness GREEN gate (`make test-integration HOST=ubongo`).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Makefile prepends .venv/bin to PATH (so the venv's ansible tools resolve),
but virt-install's `#!/usr/bin/env python3` shebang then resolved to the
isolated venv, which lacks system PyGObject (gi) -> ModuleNotFoundError. Strip
.venv/bin from PATH for the virt-install call so its shebang finds
/usr/bin/python3 (which has gi); ansible runs via its absolute .venv path and is
unaffected. Surfaced running `make test-integration HOST=ubongo`.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A control-group VM that applies base with INPUT-only default-deny (forward
policy accept; admin-addr SSH allow). verify.yml is now profile-aware via an
integration_profile marker — the askari Docker/DNAT block is gated, and a ubongo
block asserts input drop + forward accept + the admin-addr rule. Enables
`make test-integration HOST=ubongo`. Mesh-hardening 2/3 (ADR-025).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Enables base__firewall_input_only on the control group (forward chain stays
permissive so Docker egress + the integration-test libvirt NAT survive) and
allows the operator workstations' LAN IPs (mamba 10.20.10.50 + 10.20.10.17;
raw leases, backstopped by wt0). Mesh-hardening 2/3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
base__firewall_input_only renders the forward chain policy accept (host-local
INPUT filtering only) for hosts that forward container/NAT traffic; defaults
false so real service hosts keep the forward default-deny. base__firewall_admin_addrs
adds operator-workstation LAN sources to the SSH allow-list alongside wt0 +
ssh-from-control. Molecule locks the secure default + the admin rule.
Mesh-hardening 2/3 (ADR-020/021).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Allow a second operator workstation (10.20.10.17) onto ubongo's LAN SSH
alongside mamba (10.20.10.50). Both are raw DHCP leases; recorded a FRICTION
open signal to replace them with MAC-pinned OPNsense reservations when
OPNsense-as-code lands (ADR-020 / TODO 3.5).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Five tasks: base knobs (input-only forward policy + admin-addr SSH allow,
TDD via Molecule) → enable on the control group → a 'be ubongo' integration
profile (profile-aware verify) → the real-VM harness GREEN gate → the
operator-supervised live cutover (signal-6 order, physical-console break-glass).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sub-project 2 of the mesh-hardening follow-on (the post-incident roadmap
ordering puts ubongo first). Harden the control node's inbound surface via
base's nftables firewall as INPUT-only default-deny: the forward chain stays
permissive (new base__firewall_input_only knob) so Docker egress + the
libvirt-NAT integration harness keep working, and there is no sshd ListenAddress
change — sidestepping the ip_nonlocal_bind boot-race that sank askari. SSH
allowed from wt0, ssh-from-control (Ansible self), and mamba on the LAN (new
base__firewall_admin_addrs). Harness-validated before an operator-supervised
cutover; the physical console is the permanent break-glass.
Design maps to the four relevant 2026-06-17 incident lessons (FRICTION signals
1/2/3/6).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document the 2026-06-18 incident class: a road-warrior laptop losing DNS on a network transition strands NetBird (can't resolve the coordinator FQDN), taking ubongo unreachable until DNS recovers. Adds triage (local DNS vs coordinator), device mitigations (reliable resolvers + hosts-file pin), the non-mesh LAN break-glass to ubongo, and why ubongo is relay-only (deferred mesh-hardening, not a bug) — including the break-glass rule that hardening must preserve.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A stdlib driver (scripts/integration-vm.py) boots throwaway KVM VMs on ubongo mirroring a real host, applies the real playbooks, performs a real reboot, and asserts outcomes - catching the reboot/firewall/Docker class Molecule cannot. Validated end-to-end on real hardware: RED->GREEN acceptance passed (reproduced the 2026-06-17 incident, then proved the docker_host container-forward drop-in survives reboot). Also: claude AI-worker granted NOPASSWD sudo (reverses ADR-015 no-local-sudo; ADR-015/021 + accepted-risk R7, codified in base); 9 shakedown findings in FRICTION.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- ADR-023 §4: ADR-015 no-sudo sub-decision now Superseded-by ADR-025 (bidirectional), not just an in-place amendment.
- STATUS: drop the deferred `reset` verb; honest integration_test (molecule not run in this env; applied to ubongo) + verify (forward/DNAT, not wt0); RED->GREEN validated.
- driver: remove unused `import shutil`.
- README: fix the ADR-025 link filename.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The local-VM integration harness RED→GREEN acceptance passed on real hardware
(2026-06-18): a KVM VM on ubongo reproduced the 2026-06-17 nftables/Docker reboot
breakage (RED) and survived with the docker_host container-forward drop-in (GREEN).
ADR-025: Status updated to PASSED; shakedown learnings section added (UEFI boot
required, claude sudo load-bearing); ADR-021 added to Related.
STATUS.md: integration-harness section updated from PENDING to PASSED; ubongo
entry updated to reflect claude NOPASSWD sudo + sjat-ansible NOPASSWD removal;
last-reviewed date updated.
docs/TODO.md: item 2.4 collapsed to one-line pointer per the file's convention.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The integration-testing shakedown reversed ADR-015's "no local sudo" sub-decision:
the claude AI-worker now has NOPASSWD:ALL sudo on ubongo — without it, virsh,
nft, and journalctl all block during VM diagnosis. Compensating controls:
password-locked account, auditd/Loki attribution, repo-managed revocable drop-in.
ADR-015: dated amendment note in Status + expanded AI-worker identity section.
ADR-021: new §Sudo model (amendment 2026-06-18) — claude=NOPASSWD, sjat=password
required; former sjat NOPASSWD drop-in removed 2026-06-18 (least-privilege cleanup).
accepted-risks.md: R7 added (claude NOPASSWD:ALL on ubongo); last-reviewed updated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add base__ai_worker_user var (default empty), a new operational_access.yml
task file that drops a validated sudoers file for the named user, and wire it
into base/tasks/main.yml after the hardening includes under the `users` tag.
Set base__ai_worker_user: claude in group_vars/control so that applying base
to ubongo is idempotent with the manual /etc/sudoers.d/claude-ai-worker drop-in
already in place. Password remains locked; NOPASSWD is the only sudo path;
actions are attributed via auditd (ADR-021).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Accept caddy's 308 on :80 as proof the DNAT+forward path is alive; don't follow into https (tls internal has no cert for a bare-IP request). This load-bearing end-to-end check is what caught the br-+/br-* nftables-wildcard bug that the string-presence assert missed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
base's inet-filter forward chain is policy-drop; on a Docker host that kills published-port DNAT + inter-container forwarding ON REBOOT (nftables loads default-deny before dockerd). This drop-in (loaded via base's /etc/nftables.d/*.nft include at boot) appends the container-bridge accepts so a rebooted Docker host keeps forwarding. Resolves FRICTION 2026-06-17 #1 and the GREEN half of ADR-025's acceptance test. NB nftables wildcard is br-*, not the iptables br-+.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Running the harness leaves tests/integration/.run/ (gitignored, generated); exclude it from yamllint + ansible-lint so a post-run 'make lint' passes. Also emit a --- doc-start in the generated inventory.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
base's default-deny firewall would drop the driver's post-reboot SSH from the libvirt NAT gateway; set base__firewall_control_addr to the gateway (by source IP, interface-independent).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cloud-init package_update:true + block on 'cloud-init status --wait' in up() so apply sees populated apt lists (fresh genericcloud images ship empty lists); dump_diagnostics()/console() read the root:0600 serial log via sudo instead of shutil.copy, which raised PermissionError mid-diagnostics.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The tls-internal/acme_ca knobs used {%- -%} trims validated only against raw jinja2; ansible (trim_blocks=True) double-stripped newlines and collapsed the Caddyfile onto single lines, crash-looping caddy. Match the role's existing plain {% %} style.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Debian 13 genericcloud image triple-faults at the legacy real-mode kernel
handoff under SeaBIOS/q35 (boot-loops at GRUB, no 'Decompressing Linux', no DHCP
lease). Booting via UEFI (OVMF -> efistub) bypasses the legacy entry and boots
cleanly: cloud-init runs, DHCP lease obtained, SSH reachable. Verified end-to-end.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>