boma/docs/decisions/007-network.md
sjat 9e0c264658 docs: reconcile lower-severity review findings (O9-O24)
- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional,
  outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative
  boma.baobab.band -> boma.wingu.me transition note already added earlier
- terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and
  <host>.boma.baobab.band per ADR-007 naming (O11)
- ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections
  placed after Consequences, matching ADR-014/019-023 (O13)
- docs/README + inventories/README: list the missing subdirs / offsite_hosts +
  offsite.yml merge behaviour (O14, O29 note)
- ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19)
- ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20)
- ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21)
- netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23)
- ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24)
- capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28)
- tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9)
- tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep)

O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected);
the fix lives in the generator for the next regeneration. make lint + pytest (57) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00

238 lines
9.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-007 — Network topology and addressing
## Status
Accepted (2026-05-30)
## Context
The boma homelab is a Proxmox cluster on a dedicated private network behind an
OPNsense firewall. This document records the agreed physical topology, VLAN
design, IP addressing conventions, naming scheme, and DNS zone structure.
Everything here feeds directly into Terraform variables, Ansible inventory,
and OPNsense configuration.
---
## Decision
### Physical topology
```
ISP
└── OPNsense (dedicated hardware)
├── WAN — ISP uplink
└── LAN — 802.1q trunk to managed switch
┌──────────────┼──────────────────────────┐
│ │ │ │
pve0 pve1 pve2 AP1 / AP2
(eno1 trunk) (eno1 trunk) (eno1 trunk) (trunk)
(eno2 corosync)(eno2 corosync)(eno2 corosync)
└──────────────┴──────────────┘
172.16.0.0/24 (corosync ring — not on managed switch)
```
**Dual NICs per Proxmox node:**
- `eno1` — VLAN-aware trunk. Carries all VLANs via a single VLAN-aware bridge
(`vmbr0`). VMs get their VLAN tag assigned in Proxmox.
- `eno2` — Dedicated corosync ring (`vmbr1`). Direct link or tiny unmanaged
switch between the three nodes only. Never touches the main switch fabric.
**Access points** broadcast multiple SSIDs, each tagged to its corresponding VLAN
(trusted WiFi → VLAN 30, IoT → VLAN 40, guest → VLAN 50).
---
### VLAN design
| VLAN | Name | Subnet | Purpose |
|---|---|---|---|
| 10 | `mgmt` | `10.10.0.0/24` | Proxmox hosts, OPNsense, managed switch. No internet except update repos. |
| 20 | `srv` | `10.20.0.0/24` | All Debian VMs and Docker services. 100% static. Terraform provisions here. |
| 30 | `lan` | `10.30.0.0/24` | Trusted home devices. DHCP. Access to selected `srv` services via OPNsense. |
| 40 | `iot` | `10.40.0.0/24` | Smart home, cameras, printers. DHCP. Internet egress only + HA exception. |
| 50 | `guest` | `10.50.0.0/24` | Guest WiFi. DHCP. Internet only, fully isolated. |
| 99 | `vpn` | _(retired)_ | **Replaced by the NetBird mesh (ADR-016).** Remote access for `ubongo`, `askari`, and road-warrior clients rides a self-hosted NetBird overlay, not an OPNsense WireGuard subnet. `10.99.0.0/24` is freed. |
---
### IP addressing
#### VLAN 10 — mgmt (10.10.0.0/24) — no DHCP
| Address | Host |
|---|---|
| `10.10.0.1` | OPNsense LAN (mgmt) |
| `10.10.0.2` | Managed switch |
| `10.10.0.200` | `pve0` |
| `10.10.0.201` | `pve1` |
| `10.10.0.202` | `pve2` |
#### VLAN 20 — srv (10.20.0.0/24) — no DHCP, all static
| Range | Purpose |
|---|---|
| `10.20.0.1` | OPNsense gateway |
| `10.20.0.10``.19` | Core infrastructure VMs (DNS, proxy) |
| `10.20.0.20``.49` | Additional static infrastructure |
| `10.20.0.50``.249` | Terraform-provisioned VMs |
Assigned infrastructure addresses:
| Address | Host | Role |
|---|---|---|
| `10.20.0.10` | `dns1` | Primary DNS server |
| `10.20.0.11` | `dns2` | Secondary DNS server |
| `10.20.0.12` | `proxy` | Reverse proxy |
| `10.20.0.13` | `homeassistant` | Home Assistant (IoT controller) |
> **Control node `ubongo` — legacy V4 network (transitional).** `ubongo` (ADR-015) is the
> manually-provisioned physical control node and currently lives on the **legacy V4
> homelab network at `10.20.10.151`** — boma is being built up from the V4 base, and the
> physical LAN has not yet been re-cut to this VLAN scheme. That address is therefore
> **outside** the planned `srv` `10.20.0.0/24`; `base__firewall_control_addr` and the
> inventory point at the real (V4) address. When the network is migrated to these VLANs,
> `ubongo` moves into `mgmt`/`srv` and this note is retired.
#### VLAN 30 — lan (10.30.0.0/24)
| Range | Purpose |
|---|---|
| `10.30.0.1` | OPNsense gateway |
| `10.30.0.100``.249` | DHCP pool |
#### VLAN 40 — iot (10.40.0.0/24)
| Range | Purpose |
|---|---|
| `10.40.0.1` | OPNsense gateway |
| `10.40.0.100``.249` | DHCP pool |
#### VLAN 50 — guest (10.50.0.0/24)
| Range | Purpose |
|---|---|
| `10.50.0.1` | OPNsense gateway |
| `10.50.0.100``.249` | DHCP pool |
#### VLAN 99 — vpn — retired
The OPNsense WireGuard VPN (`10.99.0.0/24`) is **replaced by the NetBird mesh**
(ADR-016). Remote access for `ubongo`, `askari`, and road-warrior clients rides a
self-hosted NetBird overlay — data plane peer-to-peer WireGuard, control plane
NetBird self-hosted on `askari`. NetBird manages its own overlay addressing
(default `100.64.0.0/10`); no boma VLAN/subnet is allocated for it, and
`10.99.0.0/24` is freed.
#### Corosync ring (172.16.0.0/24) — not on managed switch
| Address | Host |
|---|---|
| `172.16.0.200` | `pve0` |
| `172.16.0.201` | `pve1` |
| `172.16.0.202` | `pve2` |
---
### OPNsense firewall rules (intent)
| Source | Destination | Policy |
|---|---|---|
| `mgmt` | anywhere | allow (administrator access) |
| `srv` | `srv` | allow (inter-service communication) |
| `srv` | internet | allow (updates, image pulls) |
| `lan` | `srv` (allow-list) | allow specific published ports only |
| `lan` | internet | allow |
| `iot` | internet | allow egress only |
| `iot` | `srv` (HA IP only) | allow on integration ports |
| `guest` | internet | allow, isolated from all internal |
| mesh peers | `srv` (metrics ports) | allow (monitoring) — enforced by NetBird ACLs, not OPNsense (ADR-016) |
| mesh peers | `mgmt` | allow (administration) — enforced by NetBird ACLs (ADR-016) |
**Home Assistant ↔ IoT**: HA VM at `10.20.0.13` can reach IoT VLAN on required
ports. OPNsense Avahi (mDNS reflector) bridges `srv``iot` for device discovery.
IoT devices cannot initiate connections to `srv`.
---
### Naming scheme
| Layer | Convention | Examples |
|---|---|---|
| Homelab name | `boma` | — |
| Proxmox nodes | `pve<n>` | `pve0`, `pve1`, `pve2` |
| Infrastructure VMs | `<role><n>` | `dns1`, `dns2`, `proxy` |
| Hetzner VPS | `askari` | Swahili for guard/sentinel |
| Internal FQDN | `<host>.boma.baobab.band` | `dns1.boma.baobab.band` |
| Public service FQDN | `<service>.wingu.me` | `vaultwarden.wingu.me` |
| Off-site (VPS) FQDN | `<service>.askari.wingu.me` | `netbird.askari.wingu.me` |
---
### DNS zones and split-horizon
**Internal zone**: `boma.baobab.band` **today** (the `dns` role is unbuilt) — served by
`dns1` and `dns2`. **Target:** it is renamed to `boma.wingu.me` in Phase 2 when the `dns`
role lands. Until then `boma.baobab.band` is the authoritative internal name **everywhere
it appears** (the naming table above, split-horizon below, the OPNsense forwarder, and
ADR-009/016). This is the single source for that transition; other references use the
current name and inherit this caveat.
The zone is rendered by the Ansible `dns` role: host A records come from the
inventory (which derives from Terraform's `local.vms` via `make tf-inventory`),
and service/alias/split-horizon records are explicit zone data in `group_vars`.
Terraform itself writes no DNS records — see ADR-009.
**Public zone**: `wingu.me` — Gandi LiveDNS, **managed as code** by the `public_dns`
role (`vault.gandi.pat`). Three-tier naming: infra `<host>.boma.wingu.me` (internal — the
Phase-2 target; currently `boma.baobab.band`, see *Internal zone* above), services
`<service>.wingu.me` (split-horizon), off-site `<service>.askari.wingu.me`.
`nyumbani` is retired. **Mesh/LAN-only by default**: home services have no public record
(reached over LAN or the NetBird mesh); only deliberate exceptions are published. The
project is `boma`; the domain is `wingu.me`. The legacy `baobab.band` zone (Cloudflare)
is out of scope here.
**Split-horizon**: `dns1`/`dns2` serve internal answers for any hostname that has
both a public and private face. Example: `vaultwarden.wingu.me` resolves to
`10.20.0.12` (proxy) internally and to the public IP externally (the internal
zone will be renamed to `boma.wingu.me` when the `dns` role is built — Phase 2).
OPNsense DNS resolver forwards `boma.baobab.band` queries to `dns1`/`dns2`.
All other queries go upstream (e.g., `1.1.1.1`, `9.9.9.9`).
---
### External monitoring — askari
`askari` (Hetzner VPS) is a peer on the **NetBird mesh** (ADR-016) and also **hosts
the self-hosted NetBird coordinator** (management/signal/relay). It reaches `srv`
metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird
ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing.
`askari` is provisioned as **Terraform IaC** (`hetznercloud/hcloud`), managed
independently of the Proxmox cluster (its own provider + local state in
`terraform/environments/offsite/`). It must be reachable even when the homelab is down
(its entire purpose), which is also why the mesh coordinator lives here: an off-site
control plane survives a homelab outage.
FQDN: `askari.wingu.me` (off-site tier; record added by `public_dns` when askari exists — M2/M4).
---
## Consequences
Drawn from the implications already stated above:
- VLAN 99 (`vpn`, `10.99.0.0/24`) is retired and the subnet freed; remote access is
carried by the self-hosted NetBird mesh instead of an OPNsense WireGuard subnet
(VLAN design; IP addressing — VLAN 99 retired).
- Mesh-peer firewall allowances (to `srv` metrics ports and `mgmt`) are enforced by
NetBird ACLs, not OPNsense rules (OPNsense firewall rules (intent)).
- IoT devices cannot initiate connections to `srv`; only Home Assistant at
`10.20.0.13` may reach the IoT VLAN, with OPNsense Avahi bridging `srv``iot`
for discovery (OPNsense firewall rules (intent)).
- Terraform writes no DNS records; the Ansible `dns` role renders the internal zone
from inventory plus `group_vars`, with `dns1`/`dns2` serving split-horizon answers
(DNS zones and split-horizon).
- `askari` runs independently of the cluster so it survives a homelab outage, which
is why the off-site NetBird control plane lives there (External monitoring —
askari).