sjat/boma

sjat 2a65391c0e docs(spec): firewall strategy design (TODO 3.5 → ADR-020)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-06 15:36:24 +02:00

8 KiB

Raw Blame History

Design — Firewall strategy (two-layer model + shared catalog)

Date: 2026-06-06
Status: Approved design — pending implementation plan
Resolves: TODO 3.5 ("Decide the firewall strategy — which firewall, ruleset, per-host vs central")
Becomes: ADR-020 (this design is the basis for that ADR)
Scope note: This is the strategy ADR. It pins the architecture and responsibilities; the detailed builds (host nftables in base, OPNsense-as-code) are separate follow-up specs (see Scope).

Problem

boma needs a firewall strategy that is predictable, declarative, and defends the stated threat model (opportunistic external, lateral movement / blast radius, operator/agent error — ADR-002). The ADRs already commit to pieces of this — nftables default-deny on hosts (ADR-002), OPNsense at the perimeter (ADR-007), Docker with iptables: false (ADR-004) — but no document ties them together: which layer owns what, where firewall intent is declared, and how the two layers stay consistent. Without that, ports drift open ad-hoc and "per-host vs central" stays unanswered.

The roles that would hold the host firewall (base, docker_host) are empty, and there is no OPNsense automation yet — so this is greenfield strategy work.

The two-layer model

Two firewall layers, each with a distinct job; the host layer adds deliberate defense-in-depth for the one thing the perimeter structurally cannot see.

OPNsense — perimeter + inter-VLAN

Owns everything between zones and at the edge:

WAN edge (the internet boundary).
Inter-VLAN policy: lan/iot/guest → srv, mgmt access, the documented per-VLAN egress rules (ADR-007).
Structurally blind to intra-srv traffic: services share the srv subnet (VLAN 20), which is switched and never reaches the OPNsense gateway.

Host nftables — host-local + east-west within `srv` (in `base`)

Runs on every Debian VM:

Default-deny inbound; allow loopback + established/related.
East-west allowlist: a service host accepts a connection only from declared sources (e.g. the reverse proxy, a named peer). This is the lateral-movement control OPNsense cannot provide — the blast-radius goal in ADR-002.
Permissive egress: allow outbound + established/related. Per-VLAN egress restriction stays at OPNsense (where it already lives, ADR-007). Rationale: host-level egress allowlisting is high-friction (every DNS/NTP/update/registry/webhook call must be enumerated) for limited additional benefit given OPNsense already bounds where each VLAN can go.
Docker integration: Docker daemon runs with "iptables": false; nftables owns all filtering, including container traffic (ADR-004).
Guaranteed management plane: loopback, established/related, and wt0 (the NetBird overlay, ADR-016) for SSH + Ansible are always allowed, independent of the catalog, and the ruleset is applied atomically — so a malformed or empty catalog can never lock out management. (ADR-016: SSH is allowed only on wt0, not the LAN.)

The shared service catalog (single source of truth)

A central, declarative service catalog in group_vars/ is the one source of truth for firewall intent. This aligns with ADR-002's existing rule that "port definitions live in group_vars/ so rules stay in sync with deployed services," and keeps connectivity topology (inherently cross-cutting) in inventory rather than in any one self-contained service role (ADR-004).

Each entry describes a service's ingress as a list of allow rules:

photoprism:
  ingress:
    - { from: reverse_proxy, port: 2342, proto: tcp }
reverse_proxy:
  ingress:
    - { from: lan, port: 443, proto: tcp }

from is symbolic, resolved at render time:

a host or group → IP(s) from inventory;
a role (e.g. reverse_proxy) → the host(s) filling it;
a VLAN/zone (e.g. lan) → the subnet from the ADR-007 table.

Symbolic sources keep the catalog readable and resilient to IP changes.

Each layer renders only its own slice

The same catalog feeds both layers; each filters for the rules it owns:

Ingress rule	Host nftables	OPNsense
`from: reverse_proxy` (a `srv` peer)	allow proxy IP → port	— (intra-`srv`, invisible)
`from: lan` (cross-VLAN)	allow `lan` subnet → port	allow `lan` → host:port

The dominant pattern falls out naturally: most services are proxied — their only ingress is from: reverse_proxy; users reach them through the reverse proxy, which alone carries from: lan, port: 443. This matches "services sit behind the reverse proxy with authentication" (ADR-002).

"Shared catalog, each layer renders its own" was chosen over a single connectivity-model-generates-both (too much machinery, tight coupling of two very different rule domains) and over fully independent per-layer declarations (real drift risk: a port opened on the host but not at OPNsense, or vice versa).

OPNsense automation — owned here, mechanism deferred

OPNsense is Ansible-managed (CLAUDE.md: "OPNsense is entirely Ansible; do not reach for a Terraform OPNsense provider"). It renders the cross-VLAN slice of the catalog (every from: <other-zone> rule) plus the static ADR-007 facts (WAN edge, per-VLAN egress, mgmt access, inter-VLAN defaults).

This ADR pins what OPNsense owns and that it renders from the shared catalog. The how — config-XML templating vs the OPNsense API vs a plugin — is a substantial, separate tooling decision, deferred to the OPNsense-as-code follow-up spec. Recorded here as an explicit open sub-decision so it is not lost.

Guardrails & enforcement

The catalog is authoritative. If a port is not in the catalog, it does not exist. This hardens the existing CLAUDE.md guardrail ("never open a firewall port ad-hoc on a host") into a positive contract.
The firewall tag (ADR-019) marks firewall tasks, so --tags firewall re-renders rules on base and any service role that contributes them.
Drift detection (aspiration). A deterministic check — in the spirit of scripts/check-tags.py — compares each host's actual listening ports / live nft ruleset against the catalog and flags anything undeclared. Ties to TODO 8.5 (/security-review) and the "undeclared open ports" pre-scan idea. Listed as a consequence and future guardrail; not necessarily built in the first implementation.

Consequences

"Per-host vs central" is answered: both, with clear ownership — central perimeter (OPNsense) + per-host default-deny with east-west allowlisting, fed by one catalog.
Lateral movement within srv is constrained (the gap OPNsense can't close).
One declarative catalog means no ad-hoc ports and no cross-layer drift on the shared facts (ports, IPs, sources).
Cost: the catalog and the render-per-layer machinery must be built and maintained; east-west allowlisting adds per-service ingress declarations (mitigated by the proxied-by-default pattern, which keeps most entries to a single line).

Scope

This ADR decides: the two-layer model and each layer's responsibilities; host nftables = default-deny inbound + east-west allowlist + permissive egress + guaranteed management plane + Docker iptables:false; the shared group_vars service catalog as single source of truth with symbolic sources; each layer renders its own slice; the no-ad-hoc-ports guardrail.

Deferred to follow-up specs (each its own brainstorm → plan):

Host nftables implementation in base — exact catalog schema, nftables template structure, Docker iptables:false integration, fail-safe ordering, Molecule tests. The natural next spec.
OPNsense-as-code — the tooling mechanism + cross-VLAN rule rendering.
Drift-detection check — if/when we build it.

ADR-002 (security baseline: nftables default-deny, fail2ban, blast radius), ADR-004 (Docker model: iptables:false), ADR-007 (network topology, VLANs, OPNsense, per-VLAN egress), ADR-016 (NetBird mesh: SSH on wt0 only), ADR-019 (firewall tag).

8 KiB Raw Blame History