Commit graph

373 commits

Author SHA1 Message Date
a483f4e55c fix: address whole-branch review (anchor pin regexp, ADR-016 backup note, verify comment)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 11:41:19 +02:00
c09b7fe6a5 docs(security): accept the single-coordinator mesh SPOF (R8) + ADR-016 availability amendment
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 11:34:21 +02:00
74e54b359b fix(base): confine /etc/hosts unsafe-write fallback to the Docker Molecule env
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 11:31:15 +02:00
f83d68d7a0 feat(base): pin the NetBird coordinator FQDN in /etc/hosts (mesh DNS-resilience)
Adds base__mesh_coordinator_pin (default empty = no-op). When set + base__mesh_enabled,
a lineinfile task writes "<ip> <fqdn>" to /etc/hosts so a managed mesh host survives a
local-DNS hiccup (the 2026-06-18 incident class). FQDN derived from base__mesh_management_url
via regex_replace (no community.general). Gated on base__mesh_enabled | bool and pin length;
the coordinator host (askari/offsite_hosts) stays exempt. Production pin wired for ubongo
(77.42.120.136). Molecule dns_servers fix included (Docker/NetBird DNS incompatibility).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-20 11:22:40 +02:00
0286c78f36 docs(plan): mesh-hardening SPOF — accept + DNS-resilience implementation plan
Two tasks: a base mesh coordinator-FQDN /etc/hosts pin (Molecule TDD) + the accept-and-document docs (R8, ADR-016 availability amendment, STATUS/ROADMAP). Coordinator backup deferred to ADR-022.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 10:49:26 +02:00
3ba22d199a docs(spec): mesh-hardening SPOF — accept single-coordinator SPOF + DNS-resilience pin
Sub-project 3 of the mesh-hardening follow-on. Accepts the single off-site coordinator as a documented availability SPOF (R8 + ADR-016 amendment) given the narrow blast radius (LAN/intra-cluster/local traffic unaffected; only remote relayed mesh access breaks). Hardens the one real gap: a base mesh coordinator-FQDN /etc/hosts pin so managed hosts survive a local-DNS hiccup. Coordinator off-site backup explicitly deferred to an ADR-022 kickoff (no throwaway infra).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 10:42:19 +02:00
f10fe8bb60 docs(status): mesh-hardening askari redesign applied + live reboot-validated (2026-06-20)
Live cutover complete: base INPUT-only default-deny + wt0-primary SSH + permanent WAN break-glass on askari, netbird_coordinator geo-disabled. A real reboot recovered unattended — firewall persisted, Docker forwarding + public services up, coordinator geo-disabled (no FATAL), mesh + both SSH paths back. ROADMAP sub-project 3 (askari redesign) marked DONE; next = relay-SPOF reduction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 09:22:20 +02:00
dfc64da2eb feat(makefile): add EXTRA passthrough to check/deploy for ad-hoc ansible args
Lets an operator pass extra ansible-playbook args through make without bypassing it — e.g. -e ansible_host=<WAN> to manage a host over a relay-independent path during a cutover that restarts its own mesh relay.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 09:22:20 +02:00
0194865437 Merge feat/mesh-hardening-askari-redesign: askari INPUT-only redesign + reboot gate
Mesh-hardening redesign of the backed-out 2026-06-17 askari SSH->wt0 attempt. Three tasks:

1. netbird_coordinator: disable geolocation (NB_DISABLE_GEOLOCATION) so a no-egress startup can't FATAL the control plane.

2. inventory: askari INPUT-only nftables default-deny (forward stays accept, Docker-safe) + ubongo's static WAN IP as a permanent SSH break-glass + manage over wt0; no sshd ListenAddress change (no boot-race); WAN :22 deliberately left open.

3. ADR-025 harness: askari_inputonly profile proves reboot-safety on a KVM VM (GREEN). Includes leaseshelper-independent VM-IP discovery (arp fallback) and an Ansible-managed virbr-boma nftables drop-in. A suid-root workaround the first implementer installed was backed out; nothing privileged reintroduced.

Whole-branch review (opus): ready to merge. Task 4 (live cutover) is operator-gated, not in this branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:47:03 +02:00
d6e80990b2 fix(integration): real wait_for_ip arp-fallback test + document substrate coverage gap
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:41:11 +02:00
d1941c987e feat(integration_test): Ansible-manage virbr-boma nftables input allow
Adds a nftables drop-in (10-libvirt-boma.nft) to base's drop-in dir that
allows traffic on iifname "virbr-boma" in the inet filter input chain.
Fixes DHCP/DNS being dropped by base's default-deny INPUT policy for VMs
on the libvirt integration bridge. Mirrors docker_host's drop-in pattern.

Molecule scenario updated to exercise only the firewall tasks (package
install unavailable in the no-internet Docker container) via include_role
tasks_from; verify asserts the drop-in renders the virbr-boma accept rule.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:29:45 +02:00
dc5cc8933f fix(harness): fall back to --source arp for VM IP discovery (no leaseshelper)
wait_for_ip now tries --source lease first then --source arp; both produce
identical output handled by parse_lease_ip. Removes the suid leaseshelper
dependency introduced and backed out in Task 3. New unit test confirms
parse_lease_ip works on --source arp output format.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:29:35 +02:00
4933186d31 docs(friction): task-3 integration-gate findings (dnsmasq, nftables, hostname)
Documents three blockers found while developing the askari_inputonly
integration-test profile:

1. inet filter default-deny silently blocks libvirt dnsmasq DHCP: nftables
   multi-table independence means ip filter LIBVIRT_INP accept does NOT
   prevent inet filter drop. Diagnosed via strace; fixed with a drop-in.

2. libvirt leaseshelper PID-file: virPidFileReleasePath unlinks the file after
   every call; nobody cannot recreate in /run/. Fix: suid root C wrapper.

3. cloud-init rejects underscores in local-hostname → skips network-config
   → no DHCP. Fix: sanitize with replace("_", "-") in meta-data hostname.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 19:16:45 +02:00
9f0626040b docs(todo): add note on ubongo↔cluster network topology question
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 19:15:18 +02:00
8ca42c389c fix(integration): fix VM boot: hostname, netplan, known_hosts handling
Three fixes found during askari_inputonly integration-test development:

1. Hostname sanitization: cloud-init rejects underscores in local-hostname
   (silently skips network-config → VM never gets DHCP). Sanitize with
   name.replace("_", "-") for the meta-data hostname; paths/domain names
   keep the original (underscore is valid there).

2. Netplan explicit interface: match.name: en* with a named key produces a
   .network file that networkd never DHCPs. Use explicit enp1s0 (all virtio
   NICs in these KVM VMs) + renderer: networkd to bypass the bug.

3. ansible_ssh_common_args in the generated hosts.yml: integration VMs
   reuse IPs (different VMs at same 192.168.150.x lease). StrictHostKey
   accept-new from ansible.cfg blocks changed keys. Add StrictHostKeyChecking=no
   + UserKnownHostsFile=/dev/null per-host to the generated inventory so
   stale known_hosts entries never block the apply step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 19:15:07 +02:00
1042f161b6 test(integration): askari_inputonly — INPUT-only default-deny reboot gate
Adds the ADR-025 integration-test profile that proves the askari
mesh-hardening REDESIGN (INPUT-only default-deny, forward ACCEPT for Docker)
is reboot-safe on a throwaway KVM VM before the live cut-over.

Profile applies base (firewall + sshd) and offsite (docker_host +
reverse_proxy). Post-reboot verify checks: input policy drop, forward
policy accept, admin-addr break-glass SSH (192.168.150.1), Docker up,
and a published port answered from the controller. GREEN on 2026-06-19.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 19:14:55 +02:00
d9b8676fce feat(inventory): askari INPUT-only firewall + WAN break-glass + manage over wt0
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 17:18:58 +02:00
ab328a2f79 feat(netbird_coordinator): disable geolocation so no-egress startup can't FATAL the control plane
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 17:15:33 +02:00
61cbcc6c18 docs(friction): re-asked settled defaults (push + subagent-driven) at plan->execute handoff
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 17:11:01 +02:00
6be758bece docs(plan): mesh-hardening redesign — askari implementation plan
Four tasks: netbird_coordinator geolocation disable (TDD via Molecule) -> inventory enablement (INPUT-only firewall + WAN break-glass + manage over wt0) -> an askari_inputonly integration profile (the reboot-safety GREEN gate) -> the operator-gated supervised live cutover + STATUS/ROADMAP update. Tasks 1-3 are autonomously implementable; Task 4 is operator-gated (live off-site host, lockout risk).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 16:32:27 +02:00
a178729587 docs(spec): mesh-hardening redesign — askari wt0-primary + WAN break-glass
Redesign of the backed-out 2026-06-17 askari SSH->wt0 attempt. Mirrors the proven ubongo 2/3 pattern (INPUT-only default-deny, SSH scoped by iifname wt0, no sshd ListenAddress change -> no boot-race) and adds the coordinator-host exception the incident demanded: a permanent non-mesh break-glass (WAN :22 from ubongo's static WAN IP + the Hetzner console), WAN :22 deliberately left open. Folds in the netbird_coordinator geo-DB robustness fix (FRICTION #4) so a transient egress blip can't FATAL the control plane. Harness-GREEN gate before a supervised live cutover.

Operator decision (2026-06-19): do this redesign first, then a separate sub-project to reduce askari's SPOF role.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 16:25:26 +02:00
ef5e049e9b docs(status): mesh-hardening 2/3 — ubongo reboot-validated
After an operator reboot of ubongo, verified live that the INPUT-only default-deny ruleset re-applied on boot (input chain policy drop + the full wt0/ssh-from-control/admin-addr allow-list), the wt0 mesh came back (Management+Signal Connected), and both SSH paths recovered clean. Closes the 'real-host reboot validation pending' item for mesh-hardening 2/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 16:25:19 +02:00
215060bac1 Merge feat/mesh-hardening-ubongo: ubongo INPUT-only default-deny (mesh-hardening 2/3)
Sub-project 2 of the mesh-hardening follow-on. base gains base__firewall_input_only
(forward-policy knob) + base__firewall_admin_addrs; enabled on ubongo (INPUT-only
default-deny). 'be ubongo' integration profile + profile-aware verify, plus two
harness fixes found by running it (virt-install venv-PATH hijack; nft priority
format). Applied + live-verified on the real ubongo; real-host reboot validation
pending (low-risk). FRICTION: VM-testing standard, libvirt stale-session,
Docker-nat-flush, Molecule coverage gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 15:34:31 +02:00
fa2c4c6368 docs(status): mesh-hardening 2/3 — ubongo INPUT-only default-deny applied
base firewall applied + live-verified on ubongo (INPUT-only default-deny;
base__firewall_input_only). Records the Docker-nat-flush caveat (needs a restart
docker on a Docker host), the claude self-SSH grant, and reboot-validation-pending.
ROADMAP: sub-project 2 done; remaining = NetBird ACL + askari redesign.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 15:34:20 +02:00
a881185c73 docs(friction): base firewall flush wipes Docker nat (cutover finding)
Applying base's nftables (even INPUT-only/forward-accept) to a Docker host
flushes Docker's ip nat -> container egress breaks until 'systemctl restart
docker'. Found on the ubongo mesh-hardening 2/3 live cutover; the Docker-less
test VM couldn't surface it. Self-heals on reboot (dockerd re-adds nat;
forward=accept doesn't block). Runbook/docker_host follow-ups noted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 15:16:21 +02:00
180af46879 docs(friction): log the Molecule input_only-accept coverage gap
Final-review finding: the default Molecule scenario only renders the forward
drop (input_only off) branch; the accept branch is covered by the integration
harness only. Tracked for a kaizen decision (2nd scenario vs accept the split).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:40:29 +02:00
8d8c86fa39 docs(friction): VM-testing standard + libvirt stale-session gotcha
Two signals from running the ubongo harness gate: (1) the operator wants a
standard pre-authorising isolated VM integration tests on ubongo so the agent
doesn't ask each time; (2) a stale agent session (shell predating the
integration_test libvirt-group grant) carries stale process groups, so the
harness's qemu-img/file writes are denied -> run via 'sg libvirt -c ...';
self-heal idea noted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:32:09 +02:00
468f8c3a92 fix(integration): match live nft priority filter in the ubongo verify
`nft list ruleset` prints the symbolic chain priority (`filter` = 0); the ubongo
profile asserted `priority 0` (the rendered-file format the Molecule scenario
checks), so the live-ruleset assertion failed even though the firewall was
correct. Assert `priority filter` for the input/forward policy lines. Caught by
the harness GREEN gate (`make test-integration HOST=ubongo`).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:32:09 +02:00
26bb7e442d fix(integration): pin system python for virt-install (venv PATH hijack)
The Makefile prepends .venv/bin to PATH (so the venv's ansible tools resolve),
but virt-install's `#!/usr/bin/env python3` shebang then resolved to the
isolated venv, which lacks system PyGObject (gi) -> ModuleNotFoundError. Strip
.venv/bin from PATH for the virt-install call so its shebang finds
/usr/bin/python3 (which has gi); ansible runs via its absolute .venv path and is
unaffected. Surfaced running `make test-integration HOST=ubongo`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:32:09 +02:00
6ac5afaf67 test(integration): add the 'be ubongo' profile (input-only default-deny)
A control-group VM that applies base with INPUT-only default-deny (forward
policy accept; admin-addr SSH allow). verify.yml is now profile-aware via an
integration_profile marker — the askari Docker/DNAT block is gated, and a ubongo
block asserts input drop + forward accept + the admin-addr rule. Enables
`make test-integration HOST=ubongo`. Mesh-hardening 2/3 (ADR-025).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:52:17 +02:00
b3e14decb4 feat(inventory): ubongo gets INPUT-only host firewall + mamba LAN SSH
Enables base__firewall_input_only on the control group (forward chain stays
permissive so Docker egress + the integration-test libvirt NAT survive) and
allows the operator workstations' LAN IPs (mamba 10.20.10.50 + 10.20.10.17;
raw leases, backstopped by wt0). Mesh-hardening 2/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:42:49 +02:00
b10a33f439 feat(base): input-only forward policy + admin-addr SSH allow
base__firewall_input_only renders the forward chain policy accept (host-local
INPUT filtering only) for hosts that forward container/NAT traffic; defaults
false so real service hosts keep the forward default-deny. base__firewall_admin_addrs
adds operator-workstation LAN sources to the SSH allow-list alongside wt0 +
ssh-from-control. Molecule locks the secure default + the admin rule.
Mesh-hardening 2/3 (ADR-020/021).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:37:06 +02:00
66a9a0af08 docs: ubongo admin-addrs add 10.20.10.17 + flag raw-lease follow-up
Allow a second operator workstation (10.20.10.17) onto ubongo's LAN SSH
alongside mamba (10.20.10.50). Both are raw DHCP leases; recorded a FRICTION
open signal to replace them with MAC-pinned OPNsense reservations when
OPNsense-as-code lands (ADR-020 / TODO 3.5).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:26:04 +02:00
e14e347047 docs(plan): mesh-hardening 2/3 — ubongo implementation plan
Five tasks: base knobs (input-only forward policy + admin-addr SSH allow,
TDD via Molecule) → enable on the control group → a 'be ubongo' integration
profile (profile-aware verify) → the real-VM harness GREEN gate → the
operator-supervised live cutover (signal-6 order, physical-console break-glass).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:26:04 +02:00
24a1d909c9 docs(spec): mesh-hardening 2/3 — ubongo INPUT-only default-deny
Sub-project 2 of the mesh-hardening follow-on (the post-incident roadmap
ordering puts ubongo first). Harden the control node's inbound surface via
base's nftables firewall as INPUT-only default-deny: the forward chain stays
permissive (new base__firewall_input_only knob) so Docker egress + the
libvirt-NAT integration harness keep working, and there is no sshd ListenAddress
change — sidestepping the ip_nonlocal_bind boot-race that sank askari. SSH
allowed from wt0, ssh-from-control (Ansible self), and mamba on the LAN (new
base__firewall_admin_addrs). Harness-validated before an operator-supervised
cutover; the physical console is the permanent break-glass.

Design maps to the four relevant 2026-06-17 incident lessons (FRICTION signals
1/2/3/6).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:12:58 +02:00
77a20b8d40 docs(runbook): netbird-client mesh-drop / DNS troubleshooting
Document the 2026-06-18 incident class: a road-warrior laptop losing DNS on a network transition strands NetBird (can't resolve the coordinator FQDN), taking ubongo unreachable until DNS recovers. Adds triage (local DNS vs coordinator), device mitigations (reliable resolvers + hosts-file pin), the non-mesh LAN break-glass to ubongo, and why ubongo is relay-only (deferred mesh-hardening, not a bug) — including the break-glass rule that hardening must preserve.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 22:30:41 +02:00
a23ecd708d Merge feat/integration-testing: local VM integration testing (ADR-025, TODO 2.4)
A stdlib driver (scripts/integration-vm.py) boots throwaway KVM VMs on ubongo mirroring a real host, applies the real playbooks, performs a real reboot, and asserts outcomes - catching the reboot/firewall/Docker class Molecule cannot. Validated end-to-end on real hardware: RED->GREEN acceptance passed (reproduced the 2026-06-17 incident, then proved the docker_host container-forward drop-in survives reboot). Also: claude AI-worker granted NOPASSWD sudo (reverses ADR-015 no-local-sudo; ADR-015/021 + accepted-risk R7, codified in base); 9 shakedown findings in FRICTION.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:52:59 +02:00
bc8592616b fix: address final whole-branch review findings
- ADR-023 §4: ADR-015 no-sudo sub-decision now Superseded-by ADR-025 (bidirectional), not just an in-place amendment.
- STATUS: drop the deferred `reset` verb; honest integration_test (molecule not run in this env; applied to ubongo) + verify (forward/DNAT, not wt0); RED->GREEN validated.
- driver: remove unused `import shutil`.
- README: fix the ADR-025 link filename.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:52:28 +02:00
d7bd31babb docs(adr/status): integration-testing harness RED→GREEN validated (ADR-025)
The local-VM integration harness RED→GREEN acceptance passed on real hardware
(2026-06-18): a KVM VM on ubongo reproduced the 2026-06-17 nftables/Docker reboot
breakage (RED) and survived with the docker_host container-forward drop-in (GREEN).

ADR-025: Status updated to PASSED; shakedown learnings section added (UEFI boot
required, claude sudo load-bearing); ADR-021 added to Related.
STATUS.md: integration-harness section updated from PENDING to PASSED; ubongo
entry updated to reflect claude NOPASSWD sudo + sjat-ansible NOPASSWD removal;
last-reviewed date updated.
docs/TODO.md: item 2.4 collapsed to one-line pointer per the file's convention.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:39:30 +02:00
cc772ff845 docs(adr/security): record claude NOPASSWD sudo model (ADR-015 amend + R7)
The integration-testing shakedown reversed ADR-015's "no local sudo" sub-decision:
the claude AI-worker now has NOPASSWD:ALL sudo on ubongo — without it, virsh,
nft, and journalctl all block during VM diagnosis. Compensating controls:
password-locked account, auditd/Loki attribution, repo-managed revocable drop-in.

ADR-015: dated amendment note in Status + expanded AI-worker identity section.
ADR-021: new §Sudo model (amendment 2026-06-18) — claude=NOPASSWD, sjat=password
required; former sjat NOPASSWD drop-in removed 2026-06-18 (least-privilege cleanup).
accepted-risks.md: R7 added (claude NOPASSWD:ALL on ubongo); last-reviewed updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:39:20 +02:00
3fe6f68316 feat(base): codify AI-worker NOPASSWD sudo (ADR-015 amended)
Add base__ai_worker_user var (default empty), a new operational_access.yml
task file that drops a validated sudoers file for the named user, and wire it
into base/tasks/main.yml after the hardening includes under the `users` tag.

Set base__ai_worker_user: claude in group_vars/control so that applying base
to ubongo is idempotent with the manual /etc/sudoers.d/claude-ai-worker drop-in
already in place. Password remains locked; NOPASSWD is the only sudo path;
actions are attributed via auditd (ADR-021).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:36:31 +02:00
b1aa0f49d9 fix(integration): verify probes :80 without following redirects
Accept caddy's 308 on :80 as proof the DNAT+forward path is alive; don't follow into https (tls internal has no cert for a bare-IP request). This load-bearing end-to-end check is what caught the br-+/br-* nftables-wildcard bug that the string-presence assert missed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:57:47 +02:00
172ae37953 feat(docker_host): container-forward nftables drop-in (reboot-safe Docker forwarding)
base's inet-filter forward chain is policy-drop; on a Docker host that kills published-port DNAT + inter-container forwarding ON REBOOT (nftables loads default-deny before dockerd). This drop-in (loaded via base's /etc/nftables.d/*.nft include at boot) appends the container-bridge accepts so a rebooted Docker host keeps forwarding. Resolves FRICTION 2026-06-17 #1 and the GREEN half of ADR-025's acceptance test. NB nftables wildcard is br-*, not the iptables br-+.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:57:47 +02:00
051c040343 fix(integration): exclude transient .run/ from linters; --- in generated inventory
Running the harness leaves tests/integration/.run/ (gitignored, generated); exclude it from yamllint + ansible-lint so a post-run 'make lint' passes. Also emit a --- doc-start in the generated inventory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:44:12 +02:00
c7194ca147 feat(integration): allow SSH from the NAT gateway in the askari overlay
base's default-deny firewall would drop the driver's post-reboot SSH from the libvirt NAT gateway; set base__firewall_control_addr to the gateway (by source IP, interface-independent).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:35:15 +02:00
35446538df fix(integration-vm): apt-ready VMs + sudo-read serial console diagnostics
cloud-init package_update:true + block on 'cloud-init status --wait' in up() so apply sees populated apt lists (fresh genericcloud images ship empty lists); dump_diagnostics()/console() read the root:0600 serial log via sudo instead of shutil.copy, which raised PermissionError mid-diagnostics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:35:15 +02:00
83983d739c fix(reverse_proxy): plain {% %} tags so the Caddyfile renders under ansible trim_blocks
The tls-internal/acme_ca knobs used {%- -%} trims validated only against raw jinja2; ansible (trim_blocks=True) double-stripped newlines and collapsed the Caddyfile onto single lines, crash-looping caddy. Match the role's existing plain {% %} style.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:35:15 +02:00
941141e270 docs(friction): capture 9 signals from the ADR-025 harness shakedown
UEFI-vs-BIOS boot loop, no-sudo diagnosis gap (-> claude sudo decision), qemu
session-vs-system URI, system-qemu home-traversal, directory-inventory phantom
hosts, jinja trim_blocks render trap, empty apt lists on fresh cloud images,
NAT-gateway firewall allow, and the review-vs-hardware coverage lesson.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:30:13 +02:00
f27514860e fix(integration-vm): boot test VMs via UEFI
The Debian 13 genericcloud image triple-faults at the legacy real-mode kernel
handoff under SeaBIOS/q35 (boot-loops at GRUB, no 'Decompressing Linux', no DHCP
lease). Booting via UEFI (OVMF -> efistub) bypasses the legacy entry and boots
cleanly: cloud-init runs, DHCP lease obtained, SSH reachable. Verified end-to-end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:13:35 +02:00
65bacb25fa feat(integration-vm): force DHCP via explicit cloud-init network-config
Don't rely on the genericcloud image's network fallback; the seed now carries a
network-config forcing dhcp4 on en* interfaces. A correct prerequisite for the VM
to network once cloud-init processes the seed. (Note: a separate no-DHCP-lease
issue on first real boot is still under investigation — the guest isn't networking
and, under the no-sudo claude model, the VM console/logs aren't introspectable
without libguestfs; see next steps.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 15:05:49 +02:00