docs(friction): task-3 integration-gate findings (dnsmasq, nftables, hostname)
Documents three blockers found while developing the askari_inputonly
integration-test profile:
1. inet filter default-deny silently blocks libvirt dnsmasq DHCP: nftables
multi-table independence means ip filter LIBVIRT_INP accept does NOT
prevent inet filter drop. Diagnosed via strace; fixed with a drop-in.
2. libvirt leaseshelper PID-file: virPidFileReleasePath unlinks the file after
every call; nobody cannot recreate in /run/. Fix: suid root C wrapper.
3. cloud-init rejects underscores in local-hostname → skips network-config
→ no DHCP. Fix: sanitize with replace("_", "-") in meta-data hostname.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
9f0626040b
commit
4933186d31
1 changed files with 40 additions and 0 deletions
|
|
@ -224,6 +224,46 @@ harness on ubongo and shaking it down against real KVM (spec/plan in docs/superp
|
|||
`flush` is safe; (3) the firewall final-review checklist should include "does the host run
|
||||
Docker/libvirt? the flush wipes their nat."
|
||||
|
||||
<!-- From the 2026-06-19 mesh-hardening 3/3 (askari INPUT-only integration gate). -->
|
||||
|
||||
- `[gotcha]` **`inet filter` default-deny blocks libvirt dnsmasq DHCP — silent, hard to diagnose**
|
||||
(2026-06-19, task-3 integration gate): when `base__firewall_input_only: true` is applied to
|
||||
ubongo, the `table inet filter { chain input { policy drop; } }` blocks DHCP packets that arrive
|
||||
via the libvirt bridge (`virbr-boma`). In nftables, multiple tables at the same hook priority all
|
||||
run independently; an `accept` verdict in `table ip filter LIBVIRT_INP` does NOT prevent
|
||||
`table inet filter` from seeing and dropping the same packet. VMs never got DHCP leases (dnsmasq
|
||||
socket confirmed by strace to never receive POLLIN despite tcpdump seeing the packet on
|
||||
`virbr-boma`). Diagnosed by temporarily changing `inet filter input` to `policy accept` → fd=3
|
||||
immediately fired. Fix: `/etc/nftables.d/10-libvirt-boma.nft` drop-in adding
|
||||
`iifname "virbr-boma" accept` (survives service restarts via `include "/etc/nftables.d/*.nft"`).
|
||||
→ The `base` role's template needs a `base__firewall_trusted_bridges` variable so this is
|
||||
encoded at the Ansible level, not in a manual host drop-in. Every host that runs Docker or
|
||||
libvirt and also has `base__firewall_input_only: true` needs an analogous exception.
|
||||
|
||||
- `[gotcha]` **libvirt `leaseshelper` PID-file permission: `virPidFileReleasePath` unlinks
|
||||
`/run/leaseshelper.pid` after EVERY call; nobody cannot recreate it** (2026-06-19, task-3
|
||||
integration gate): dnsmasq runs as nobody; `libvirt_leaseshelper` is its `--dhcp-script`. The
|
||||
helper acquires a PID-file mutex at `/run/leaseshelper.pid`, but `virPidFileReleasePath`
|
||||
UNLINKS the file on exit. `/run/` is `root:root 755`, so nobody cannot create the file after the
|
||||
first unlink → every subsequent `add` call fails with `errno=13`, dnsmasq silently drops the
|
||||
DHCP grant (no log, no error to the client). Fix: suid root C wrapper at
|
||||
`/usr/lib/libvirt/libvirt_leaseshelper` (original moved to `.real`) that pre-creates
|
||||
`/run/leaseshelper.pid` owned by nobody, then drops privileges and execs the real helper. The
|
||||
root dnsmasq fork calls the wrapper; suid gives it permission to touch `/run/`; on return to
|
||||
nobody uid the PID file stays. Also: `/var/lib/libvirt/dnsmasq/` must be `nobody:nogroup 775`
|
||||
so leaseshelper can update `virbr-boma.status`. This fix is host-local on ubongo and NOT in
|
||||
Ansible — encode it in an `integration_test` role task (or a libvirt role) before the harness
|
||||
can be safely re-deployed.
|
||||
|
||||
- `[gotcha]` **cloud-init rejects underscores in `local-hostname` → silently skips
|
||||
network-config → VM never gets DHCP** (2026-06-19, task-3 integration gate): setting
|
||||
`local-hostname: boma-it-askari_inputonly-<uuid>` caused cloud-init-local to consider the
|
||||
hostname invalid and skip writing the network-config to the system. Systemd-networkd then
|
||||
used the genericcloud default (no DHCP), so VMs got only IPv6 link-local. Fix in
|
||||
`scripts/integration-vm.py`: `name.replace("_", "-")` in the meta-data hostname (disk paths
|
||||
and virsh domain names keep the original underscore). Sanitization rule: RFC-952 hostnames
|
||||
allow hyphens, not underscores.
|
||||
|
||||
---
|
||||
|
||||
## Kaizen reviews — decisions ledger
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue