boma/docs/superpowers/specs/2026-06-06-firewall-strategy-design.md

# Design — Firewall strategy (two-layer model + shared catalog)

- **Date:** 2026-06-06
- **Status:** Approved design — pending implementation plan
- **Resolves:** TODO 3.5 ("Decide the firewall strategy — which firewall, ruleset,
  per-host vs central")
- **Becomes:** ADR-020 (this design is the basis for that ADR)
- **Scope note:** This is the **strategy** ADR. It pins the architecture and
  responsibilities; the detailed builds (host nftables in `base`, OPNsense-as-code) are
  separate follow-up specs (see *Scope*).

---

## Problem

boma needs a firewall strategy that is **predictable, declarative, and defends the
stated threat model** (opportunistic external, lateral movement / blast radius,
operator/agent error — ADR-002). The ADRs already commit to pieces of this — `nftables`
default-deny on hosts (ADR-002), OPNsense at the perimeter (ADR-007), Docker with
`iptables: false` (ADR-004) — but no document ties them together: *which layer owns
what, where firewall intent is declared, and how the two layers stay consistent.*
Without that, ports drift open ad-hoc and "per-host vs central" stays unanswered.

The roles that would hold the host firewall (`base`, `docker_host`) are empty, and there
is no OPNsense automation yet — so this is greenfield strategy work.

## The two-layer model

Two firewall layers, each with a distinct job; the host layer adds deliberate
defense-in-depth for the one thing the perimeter structurally cannot see.

### OPNsense — perimeter + inter-VLAN

Owns everything *between zones* and at the edge:

- WAN edge (the internet boundary).
- Inter-VLAN policy: `lan`/`iot`/`guest` → `srv`, `mgmt` access, the documented
  per-VLAN egress rules (ADR-007).
- **Structurally blind to intra-`srv` traffic**: services share the `srv` subnet
  (VLAN 20), which is switched and never reaches the OPNsense gateway.

### Host nftables — host-local + east-west within `srv` (in `base`)

Runs on every Debian VM:

- **Default-deny inbound**; allow loopback + established/related.
- **East-west allowlist**: a service host accepts a connection only from declared
  sources (e.g. the reverse proxy, a named peer). This is the lateral-movement control
  OPNsense cannot provide — the blast-radius goal in ADR-002.
- **Permissive egress**: allow outbound + established/related. Per-VLAN egress
  restriction stays at OPNsense (where it already lives, ADR-007). Rationale: host-level
  egress allowlisting is high-friction (every DNS/NTP/update/registry/webhook call must
  be enumerated) for limited additional benefit given OPNsense already bounds where each
  VLAN can go.
- **Docker integration**: Docker daemon runs with `"iptables": false`; nftables owns all
  filtering, including container traffic (ADR-004).
- **Guaranteed management plane**: loopback, established/related, and `wt0` (the NetBird
  overlay, ADR-016) for SSH + Ansible are *always* allowed, independent of the catalog,
  and the ruleset is applied atomically — so a malformed or empty catalog can never lock
  out management. (ADR-016: SSH is allowed only on `wt0`, not the LAN.)

## The shared service catalog (single source of truth)

A central, declarative **service catalog** in `group_vars/` is the one source of truth
for firewall intent. This aligns with ADR-002's existing rule that "port definitions
live in `group_vars/` so rules stay in sync with deployed services," and keeps
connectivity *topology* (inherently cross-cutting) in inventory rather than in any one
self-contained service role (ADR-004).

Each entry describes a service's **ingress** as a list of allow rules:

```yaml
photoprism:
  ingress:
    - { from: reverse_proxy, port: 2342, proto: tcp }
reverse_proxy:
  ingress:
    - { from: lan, port: 443, proto: tcp }
```

`from` is **symbolic**, resolved at render time:

- a **host or group** → IP(s) from inventory;
- a **role** (e.g. `reverse_proxy`) → the host(s) filling it;
- a **VLAN/zone** (e.g. `lan`) → the subnet from the ADR-007 table.

Symbolic sources keep the catalog readable and resilient to IP changes.

### Each layer renders only its own slice

The same catalog feeds both layers; each filters for the rules it owns:

| Ingress rule | Host nftables | OPNsense |
|---|---|---|
| `from: reverse_proxy` (a `srv` peer) | allow proxy IP → port | — (intra-`srv`, invisible) |
| `from: lan` (cross-VLAN) | allow `lan` subnet → port | allow `lan` → host:port |

The dominant pattern falls out naturally: most services are **proxied** — their only
ingress is `from: reverse_proxy`; users reach them *through* the reverse proxy, which
alone carries `from: lan, port: 443`. This matches "services sit behind the reverse
proxy with authentication" (ADR-002).

"Shared catalog, each layer renders its own" was chosen over a single
connectivity-model-generates-both (too much machinery, tight coupling of two very
different rule domains) and over fully independent per-layer declarations (real drift
risk: a port opened on the host but not at OPNsense, or vice versa).

## OPNsense automation — owned here, mechanism deferred

OPNsense is **Ansible-managed** (CLAUDE.md: "OPNsense is entirely Ansible; do not reach
for a Terraform OPNsense provider"). It renders the **cross-VLAN slice** of the catalog
(every `from: <other-zone>` rule) plus the static ADR-007 facts (WAN edge, per-VLAN
egress, mgmt access, inter-VLAN defaults).

This ADR pins **what** OPNsense owns and that it renders from the shared catalog. The
**how** — config-XML templating vs the OPNsense API vs a plugin — is a substantial,
separate tooling decision, **deferred to the OPNsense-as-code follow-up spec**. Recorded
here as an explicit open sub-decision so it is not lost.

## Guardrails & enforcement

- **The catalog is authoritative.** If a port is not in the catalog, it does not exist.
  This hardens the existing CLAUDE.md guardrail ("never open a firewall port ad-hoc on a
  host") into a positive contract.
- **The `firewall` tag** (ADR-019) marks firewall tasks, so `--tags firewall` re-renders
  rules on `base` and any service role that contributes them.
- **Drift detection (aspiration).** A deterministic check — in the spirit of
  `scripts/check-tags.py` — compares each host's actual listening ports / live `nft`
  ruleset against the catalog and flags anything undeclared. Ties to TODO 8.5
  (`/security-review`) and the "undeclared open ports" pre-scan idea. Listed as a
  consequence and future guardrail; not necessarily built in the first implementation.

## Consequences

- "Per-host vs central" is answered: **both**, with clear ownership — central perimeter
  (OPNsense) + per-host default-deny with east-west allowlisting, fed by one catalog.
- Lateral movement within `srv` is constrained (the gap OPNsense can't close).
- One declarative catalog means no ad-hoc ports and no cross-layer drift on the shared
  facts (ports, IPs, sources).
- Cost: the catalog and the render-per-layer machinery must be built and maintained;
  east-west allowlisting adds per-service ingress declarations (mitigated by the
  proxied-by-default pattern, which keeps most entries to a single line).

## Scope

**This ADR decides:** the two-layer model and each layer's responsibilities; host
nftables = default-deny inbound + east-west allowlist + permissive egress + guaranteed
management plane + Docker `iptables:false`; the shared `group_vars` service catalog as
single source of truth with symbolic sources; each layer renders its own slice; the
no-ad-hoc-ports guardrail.

**Deferred to follow-up specs (each its own brainstorm → plan):**

1. **Host nftables implementation** in `base` — exact catalog schema, nftables template
   structure, Docker `iptables:false` integration, fail-safe ordering, Molecule tests.
   The natural next spec.
2. **OPNsense-as-code** — the tooling mechanism + cross-VLAN rule rendering.
3. **Drift-detection check** — if/when we build it.

## Related

ADR-002 (security baseline: nftables default-deny, fail2ban, blast radius),
ADR-004 (Docker model: `iptables:false`), ADR-007 (network topology, VLANs, OPNsense,
per-VLAN egress), ADR-016 (NetBird mesh: SSH on `wt0` only), ADR-019 (`firewall` tag).
docs(spec): firewall strategy design (TODO 3.5 → ADR-020) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-06 15:36:24 +02:00			`# Design — Firewall strategy (two-layer model + shared catalog)`

			`- Date: 2026-06-06`
			`- Status: Approved design — pending implementation plan`
			`- Resolves: TODO 3.5 ("Decide the firewall strategy — which firewall, ruleset,`
			`per-host vs central")`
			`- Becomes: ADR-020 (this design is the basis for that ADR)`
			`- Scope note: This is the strategy ADR. It pins the architecture and`
			responsibilities; the detailed builds (host nftables in `base`, OPNsense-as-code) are
			`separate follow-up specs (see Scope).`

			`---`

			`## Problem`

			`boma needs a firewall strategy that is **predictable, declarative, and defends the`
			`stated threat model** (opportunistic external, lateral movement / blast radius,`
			operator/agent error — ADR-002). The ADRs already commit to pieces of this — `nftables`
			`default-deny on hosts (ADR-002), OPNsense at the perimeter (ADR-007), Docker with`
			`iptables: false` (ADR-004) — but no document ties them together: *which layer owns
			`what, where firewall intent is declared, and how the two layers stay consistent.*`
			`Without that, ports drift open ad-hoc and "per-host vs central" stays unanswered.`

			The roles that would hold the host firewall (`base`, `docker_host`) are empty, and there
			`is no OPNsense automation yet — so this is greenfield strategy work.`

			`## The two-layer model`

			`Two firewall layers, each with a distinct job; the host layer adds deliberate`
			`defense-in-depth for the one thing the perimeter structurally cannot see.`

			`### OPNsense — perimeter + inter-VLAN`

			`Owns everything between zones and at the edge:`

			`- WAN edge (the internet boundary).`
			- Inter-VLAN policy: `lan`/`iot`/`guest` → `srv`, `mgmt` access, the documented
			`per-VLAN egress rules (ADR-007).`
			- Structurally blind to intra-`srv` traffic: services share the `srv` subnet
			`(VLAN 20), which is switched and never reaches the OPNsense gateway.`

			### Host nftables — host-local + east-west within `srv` (in `base`)

			`Runs on every Debian VM:`

			`- Default-deny inbound; allow loopback + established/related.`
			`- East-west allowlist: a service host accepts a connection only from declared`
			`sources (e.g. the reverse proxy, a named peer). This is the lateral-movement control`
			`OPNsense cannot provide — the blast-radius goal in ADR-002.`
			`- Permissive egress: allow outbound + established/related. Per-VLAN egress`
			`restriction stays at OPNsense (where it already lives, ADR-007). Rationale: host-level`
			`egress allowlisting is high-friction (every DNS/NTP/update/registry/webhook call must`
			`be enumerated) for limited additional benefit given OPNsense already bounds where each`
			`VLAN can go.`
			- Docker integration: Docker daemon runs with `"iptables": false`; nftables owns all
			`filtering, including container traffic (ADR-004).`
			- Guaranteed management plane: loopback, established/related, and `wt0` (the NetBird
			`overlay, ADR-016) for SSH + Ansible are always allowed, independent of the catalog,`
			`and the ruleset is applied atomically — so a malformed or empty catalog can never lock`
			out management. (ADR-016: SSH is allowed only on `wt0`, not the LAN.)

			`## The shared service catalog (single source of truth)`

			A central, declarative service catalog in `group_vars/` is the one source of truth
			`for firewall intent. This aligns with ADR-002's existing rule that "port definitions`
			live in `group_vars/` so rules stay in sync with deployed services," and keeps
			`connectivity topology (inherently cross-cutting) in inventory rather than in any one`
			`self-contained service role (ADR-004).`

			`Each entry describes a service's ingress as a list of allow rules:`

			```yaml
			`photoprism:`
			`ingress:`
			`- { from: reverse_proxy, port: 2342, proto: tcp }`
			`reverse_proxy:`
			`ingress:`
			`- { from: lan, port: 443, proto: tcp }`
			```

			`from` is symbolic, resolved at render time:

			`- a host or group → IP(s) from inventory;`
			- a role (e.g. `reverse_proxy`) → the host(s) filling it;
			- a VLAN/zone (e.g. `lan`) → the subnet from the ADR-007 table.

			`Symbolic sources keep the catalog readable and resilient to IP changes.`

			`### Each layer renders only its own slice`

			`The same catalog feeds both layers; each filters for the rules it owns:`

			`\| Ingress rule \| Host nftables \| OPNsense \|`
			`\|---\|---\|---\|`
			\| `from: reverse_proxy` (a `srv` peer) \| allow proxy IP → port \| — (intra-`srv`, invisible) \|
			\| `from: lan` (cross-VLAN) \| allow `lan` subnet → port \| allow `lan` → host:port \|

			`The dominant pattern falls out naturally: most services are proxied — their only`
			ingress is `from: reverse_proxy`; users reach them through the reverse proxy, which
			alone carries `from: lan, port: 443`. This matches "services sit behind the reverse
			`proxy with authentication" (ADR-002).`

			`"Shared catalog, each layer renders its own" was chosen over a single`
			`connectivity-model-generates-both (too much machinery, tight coupling of two very`
			`different rule domains) and over fully independent per-layer declarations (real drift`
			`risk: a port opened on the host but not at OPNsense, or vice versa).`

			`## OPNsense automation — owned here, mechanism deferred`

			`OPNsense is Ansible-managed (CLAUDE.md: "OPNsense is entirely Ansible; do not reach`
			`for a Terraform OPNsense provider"). It renders the cross-VLAN slice of the catalog`
			(every `from: <other-zone>` rule) plus the static ADR-007 facts (WAN edge, per-VLAN
			`egress, mgmt access, inter-VLAN defaults).`

			`This ADR pins what OPNsense owns and that it renders from the shared catalog. The`
			`how — config-XML templating vs the OPNsense API vs a plugin — is a substantial,`
			`separate tooling decision, deferred to the OPNsense-as-code follow-up spec. Recorded`
			`here as an explicit open sub-decision so it is not lost.`

			`## Guardrails & enforcement`

			`- The catalog is authoritative. If a port is not in the catalog, it does not exist.`
			`This hardens the existing CLAUDE.md guardrail ("never open a firewall port ad-hoc on a`
			`host") into a positive contract.`
			- The `firewall` tag (ADR-019) marks firewall tasks, so `--tags firewall` re-renders
			rules on `base` and any service role that contributes them.
			`- Drift detection (aspiration). A deterministic check — in the spirit of`
			`scripts/check-tags.py` — compares each host's actual listening ports / live `nft`
			`ruleset against the catalog and flags anything undeclared. Ties to TODO 8.5`
			(`/security-review`) and the "undeclared open ports" pre-scan idea. Listed as a
			`consequence and future guardrail; not necessarily built in the first implementation.`

			`## Consequences`

			`- "Per-host vs central" is answered: both, with clear ownership — central perimeter`
			`(OPNsense) + per-host default-deny with east-west allowlisting, fed by one catalog.`
			- Lateral movement within `srv` is constrained (the gap OPNsense can't close).
			`- One declarative catalog means no ad-hoc ports and no cross-layer drift on the shared`
			`facts (ports, IPs, sources).`
			`- Cost: the catalog and the render-per-layer machinery must be built and maintained;`
			`east-west allowlisting adds per-service ingress declarations (mitigated by the`
			`proxied-by-default pattern, which keeps most entries to a single line).`

			`## Scope`

			`This ADR decides: the two-layer model and each layer's responsibilities; host`
			`nftables = default-deny inbound + east-west allowlist + permissive egress + guaranteed`
			management plane + Docker `iptables:false`; the shared `group_vars` service catalog as
			`single source of truth with symbolic sources; each layer renders its own slice; the`
			`no-ad-hoc-ports guardrail.`

			`Deferred to follow-up specs (each its own brainstorm → plan):`

			1. Host nftables implementation in `base` — exact catalog schema, nftables template
			structure, Docker `iptables:false` integration, fail-safe ordering, Molecule tests.
			`The natural next spec.`
			`2. OPNsense-as-code — the tooling mechanism + cross-VLAN rule rendering.`
			`3. Drift-detection check — if/when we build it.`

			`## Related`

			`ADR-002 (security baseline: nftables default-deny, fail2ban, blast radius),`
			ADR-004 (Docker model: `iptables:false`), ADR-007 (network topology, VLANs, OPNsense,
			per-VLAN egress), ADR-016 (NetBird mesh: SSH on `wt0` only), ADR-019 (`firewall` tag).