docs: reconcile lower-severity review findings (O9-O24)

- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional,
  outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative
  boma.baobab.band -> boma.wingu.me transition note already added earlier
- terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and
  <host>.boma.baobab.band per ADR-007 naming (O11)
- ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections
  placed after Consequences, matching ADR-014/019-023 (O13)
- docs/README + inventories/README: list the missing subdirs / offsite_hosts +
  offsite.yml merge behaviour (O14, O29 note)
- ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19)
- ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20)
- ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21)
- netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23)
- ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24)
- capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28)
- tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9)
- tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep)

O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected);
the fix lives in the generator for the next regeneration. make lint + pytest (57) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-14 19:31:40 +02:00
parent 9b5851ba4b
commit 9e0c264658
22 changed files with 83 additions and 43 deletions

View file

@ -6,6 +6,15 @@ Project documentation.
Numbered from 001; each records context, the decision, and what was ruled out.
- `runbooks/` — step-by-step operational procedures (add a host, add a role, rotate
secrets).
- `security/` — security baseline, accepted-risk register, per-service checklist +
template (ADR-002/004).
- `testing/` — testing methodology artifacts + the `VERIFY.md` template (ADR-008/017).
- `access/` — operational-access doctrine + the `ACCESS.md` template (ADR-021).
- `backup/` — backup doctrine + the `BACKUP.md` template (ADR-022).
- `hardware/` — capacity reference + `/capacity-review` output (ADR-012).
- `reviews/``/review-repo` audit trail.
- `CAPABILITIES.md` / `ROADMAP.md` / `TODO.md` / `FRICTION.md` — what boma does, the
build order, the backlog, and recurring-friction notes.
For what is actually **built vs only designed**, see `STATUS.md` at the repo root —
the ADRs describe intent, not necessarily current reality.

View file

@ -79,9 +79,10 @@ zero-risk and *born at Gandi*.
### M2 · `askari` provisioned + under Ansible
Provision the Hetzner VPS **as IaC with Terraform** (CAX11 ARM / Helsinki / Debian 13,
behind a TF-managed Hetzner Cloud Firewall), bring it into `offsite_hosts`, and bootstrap
it. Design: `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`.
Provision the Hetzner VPS **as IaC with Terraform** (Helsinki / Debian 13, behind a
TF-managed Hetzner Cloud Firewall), bring it into `offsite_hosts`, and bootstrap it.
**Shipped as cx23/x86** (CAX11/ARM was out of stock EU-wide on 2026-06-14 — same-spec
x86, cheaper). Design: `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`.
- **Decided:** Terraform owns `askari`'s existence — generalizes ADR-006 from "Proxmox VM
existence" to **Proxmox + Hetzner** (new `hetznercloud/hcloud` provider, `hetzner_vm`
@ -113,8 +114,8 @@ Built in two phases. **M4a (platform) — ✅ DONE:** Docker on askari + boma's
**Caddy** reverse proxy (ADR-024), proven by `https://test.askari.wingu.me` serving a
valid Let's Encrypt cert (HTTP-01 — DNS-01 deferred to Phase 2, see ADR-024/FRICTION).
Firewall opened 80/443/3478. Spec/plan: `…2026-06-14-netbird-coordinator-m4-design.md` /
`…2026-06-14-m4a-docker-caddy.md`. **M4b (next):** the `netbird` service role — read
NetBird's current self-host compose then.
`…2026-06-14-m4a-docker-caddy.md`. **M4b (next):** the `netbird_coordinator` service
role — read NetBird's current self-host compose then.
Deploy the NetBird stack (management / signal / relay / Coturn + dashboard) with the
**embedded IdP** (ADR-016 — no Authentik dependency), fronted by the now-proven Caddy.

View file

@ -87,6 +87,14 @@ Assigned infrastructure addresses:
| `10.20.0.12` | `proxy` | Reverse proxy |
| `10.20.0.13` | `homeassistant` | Home Assistant (IoT controller) |
> **Control node `ubongo` — legacy V4 network (transitional).** `ubongo` (ADR-015) is the
> manually-provisioned physical control node and currently lives on the **legacy V4
> homelab network at `10.20.10.151`** — boma is being built up from the V4 base, and the
> physical LAN has not yet been re-cut to this VLAN scheme. That address is therefore
> **outside** the planned `srv` `10.20.0.0/24`; `base__firewall_control_addr` and the
> inventory point at the real (V4) address. When the network is migrated to these VLANs,
> `ubongo` moves into `mgmt`/`srv` and this note is retired.
#### VLAN 30 — lan (10.30.0.0/24)
| Range | Purpose |

View file

@ -119,7 +119,8 @@ rendered entirely by the Ansible `dns` role:
remains the ultimate source of truth for which hosts exist; the data simply flows
through the inventory instead of through a direct Terraform→DNS write.
- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense
gateway, `forgejo.nyumbani.baobab.band` → proxy) are explicit zone data in `group_vars`.
gateway, `vaultwarden.wingu.me` → proxy split-horizon) are explicit zone data in
`group_vars`.
This dissolves the bootstrap cycle that a Terraform-managed zone would create. If
Terraform wrote records via RFC 2136, provisioning the **first** DNS server would

View file

@ -45,4 +45,6 @@ workload that should move, or a node due an upgrade.
**wearout/TBW** is a monitored metric — logging is write-heavy, so wear is watched,
not assumed.
See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).
## Related
ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).

View file

@ -74,5 +74,7 @@ copy.
cost of a clean methodological break.
- The policy is enforceable in review and by the AI guardrails above.
See also: ADR-001 (architecture / legibility), ADR-004 (service-role model), ADR-011
## Related
ADR-001 (architecture / legibility), ADR-004 (service-role model), ADR-011
(update management — ntfy topics decided fresh per this policy).

View file

@ -153,5 +153,7 @@ master password.
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. |
| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. |
See also: ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
## Related
ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).

View file

@ -106,11 +106,6 @@ allocated for it.
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. |
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
See also: ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted),
ADR-021 (operational access; SSH ladder reconciling `wt0` + `ubongo`'s LAN address).
## Consequences
- A new public surface appears on `askari` — management API + dashboard (80/443) +
@ -129,3 +124,10 @@ ADR-021 (operational access; SSH ladder reconciling `wt0` + `ubongo`'s LAN addre
operator footprint (What was ruled out).
- Implementation is pending: the role tasks land only once the unbuilt `base` role and
service-role machinery exist (Status).
## Related
ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted),
ADR-021 (operational access; SSH ladder reconciling `wt0` + `ubongo`'s LAN address).

View file

@ -88,9 +88,6 @@ them.
| Staging bypasses SSO / per-app users | Wouldn't exercise the real Caddy+Authentik path; central test users are faithful. |
| Commit screenshots to the repo | Repo bloat + secret-leak risk; git-ignored on `ubongo`. |
See also: ADR-008 (testing — expanded), ADR-015 (control host), ADR-002 (security),
ADR-004 (`VERIFY.md` parallels `SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing).
## Consequences
- The harness is confined to staging by a hard stop: it refuses to run against
@ -108,3 +105,8 @@ ADR-004 (`VERIFY.md` parallels `SECURITY.md`), ADR-013/014 (heritage / knowledge
skill, conventions/checklist edits), but running is deferred on its dependencies:
`ubongo`, the `playwright` plugin, Authentik, a staging deploy, and `make new-role`
scaffolding `VERIFY.md` (Status; Dependencies).
## Related
ADR-008 (testing — expanded), ADR-015 (control host), ADR-002 (security),
ADR-004 (`VERIFY.md` parallels `SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing).

View file

@ -94,10 +94,6 @@ the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence a
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).
## Consequences
- Opportunistic track-covering and host-pivot-to-store are defeated because logs leave
@ -120,3 +116,9 @@ standard), ADR-011 (health checks — distinct from this).
- The decision is authorable now but the live pipeline is deferred on the stack:
Alloy-in-`base`, the `loki`/`grafana` service roles, OPNsense syslog config, and the
push-only credential (Status; Dependencies).
## Related
ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).

View file

@ -88,9 +88,9 @@ declarations (real drift risk).
`askari` sits outside the Proxmox cluster and has no OPNsense. Its **perimeter** layer
is a TF-managed **Hetzner Cloud Firewall** (declared in `terraform/environments/offsite/`)
alongside the VM itself. Current rule set (M2): SSH inbound from `ubongo`'s public IP
only. NetBird ports (UDP 3478 + TCP 80/443) will be added in M4 when the coordinator
role is built.
alongside the VM itself. Rule set: SSH inbound from `ubongo`'s public IP (M2), plus
TCP 80/443 + UDP 3478 opened in **M4a** (Caddy + NetBird). The `netbird_coordinator`
service role that uses 3478 lands in **M4b**; the ports are already open.
The `group_vars` service catalog remains authoritative for `askari`'s **host nftables**
layer — the same two-layer model applies, with Hetzner Cloud Firewall substituting for

View file

@ -19,9 +19,9 @@ Accepted (2026-06-14). Amends the soft Traefik assumption carried by the roadmap
boma needs a reverse proxy to front its services with TLS. ADR-002 requires every
service to sit behind a proxy with authentication before it is reachable; ADR-007/M1
delivers a `*.boma.<domain>` wildcard cert via ACME DNS-01 against Gandi — the only
viable cert path for mesh/LAN-only services that cannot satisfy HTTP-01 (no public
A-record to point at).
delivers a `*.<domain>` wildcard cert via ACME DNS-01 against Gandi (the apex `boma`
domain, matching ROADMAP M1) — the only viable cert path for mesh/LAN-only services
that cannot satisfy HTTP-01 (no public A-record to point at).
The roadmap (Phase-2, step 5) and ADR-017 prose assumed **Traefik + Authentik** as the
auth-and-proxy pair without an ADR ever pinning Traefik. On closer inspection:
@ -82,7 +82,7 @@ blockers, so DNS-01 is deferred; see the Status note.)
The first Caddy instance runs on `askari` (M4a), serving a test vhost over HTTP-01 to
prove the proxy + ACME path. It fronts the NetBird stack in **M4b** (when the
`netbird` coordinator role is built). The pattern generalises to the Proxmox cluster in
`netbird_coordinator` role is built). The pattern generalises to the Proxmox cluster in
Phase 2 when services multiply.
### 4. Authentik integration (deferred)

View file

@ -2,7 +2,7 @@
> **For agentic workers:** REQUIRED SUB-SKILL: superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use `- [ ]` checkboxes.
**Goal:** Deploy the self-hosted NetBird control plane on askari as boma's first real service role (`netbird`), fronted by the M4a Caddy, reachable at `https://netbird.askari.wingu.me` with the embedded Dex login.
**Goal:** Deploy the self-hosted NetBird control plane on askari as boma's first real service role (`netbird_coordinator`), fronted by the M4a Caddy, reachable at `https://netbird.askari.wingu.me` with the embedded Dex login.
**Architecture:** NetBird's own `configure.sh` generates the canonical compose + config for a pinned version; boma **captures that reference once and translates it into role templates** (ADR-004/013 — don't run their imperative script in production, render from templates). Runs in **external-reverse-proxy mode** (no bundled Traefik); Caddy adds a `netbird.askari.wingu.me` route. Secrets (datastore encryption key, TURN password, Dex secrets) are generated into vault; the setup key is stubbed `CHANGEME` for M5.
@ -23,9 +23,9 @@
---
### Task 2: `netbird` service role — templates
### Task 2: `netbird_coordinator` service role — templates
**Files:** `roles/netbird/` (scaffold via `make new-role NAME=netbird`): `defaults/main.yml`, `tasks/main.yml`, `templates/{docker-compose.yml,management.json,turnserver.conf,openid-configuration.json,dashboard.env}.j2`, `handlers/main.yml`, `README.md`.
**Files:** `roles/netbird_coordinator/` (scaffold via `make new-role NAME=netbird_coordinator`): `defaults/main.yml`, `tasks/main.yml`, `templates/{docker-compose.yml,management.json,turnserver.conf,openid-configuration.json,dashboard.env}.j2`, `handlers/main.yml`, `README.md`.
- [ ] **Step 1:** Translate the captured compose into `templates/docker-compose.yml.j2` — containers, the shared `boma` Docker network (so Caddy reaches them by name), **no host port mappings except what Caddy/Coturn need** (Coturn 3478/udp; everything else internal, Caddy fronts it). Pin image tags (ADR-011).
- [ ] **Step 2:** Translate `management.json`/`config.yaml` into a template — fill `Datadir`, `DataStoreEncryptionKey` (`{{ vault.netbird.datastore_key }}`), `HttpConfig` (public URL `https://netbird.askari.wingu.me`), `TURNConfig` (coturn host + `{{ vault.netbird.turn_password }}`), `Signal`, `Relay`, `Store` (sqlite), and the embedded-Dex IdP block (DeviceAuthorizationFlow/PKCE, `openid-configuration.json` URL).
@ -53,7 +53,7 @@
### Task 5: Service-role standard files (ADR-004, authored)
- [ ] **Step 1:** Author `roles/netbird/SECURITY.md` (copy `docs/security/service-security-template.md`; record the public surface = Caddy 443 + Coturn 3478, embedded-Dex auth, accepted-risk R3).
- [ ] **Step 1:** Author `roles/netbird_coordinator/SECURITY.md` (copy `docs/security/service-security-template.md`; record the public surface = Caddy 443 + Coturn 3478, embedded-Dex auth, accepted-risk R3).
- [ ] **Step 2:** `VERIFY.md` (copy the template; the `/verify-service` UI spec — run later when the playwright harness exists).
- [ ] **Step 3:** `ACCESS.md` (ADR-021; the dashboard/admin access + `access__*` intent).
- [ ] **Step 4:** `BACKUP.md` (ADR-022; the **datastore is stateful**`backup__*` data; record that off-site backup is **pending `fisi`** — an accepted risk for now).
@ -63,7 +63,7 @@
### Task 6: Add netbird to the offsite playbook
- [ ] **Step 1:** In `playbooks/offsite.yml`, add `netbird` after `reverse_proxy` (role-name tag). `make lint`. Commit.
- [ ] **Step 1:** In `playbooks/offsite.yml`, add `netbird_coordinator` after `reverse_proxy` (role-name tag). `make lint`. Commit.
---
@ -80,7 +80,7 @@
### Task 8: Docs
- [ ] **Step 1:** STATUS — `netbird` coordinator built + applied (dashboard live); the first service role. ROADMAP M4b done; **M5 (enrol) next**. `make lint`; commit.
- [ ] **Step 1:** STATUS — `netbird_coordinator` built + applied (dashboard live); the first service role. ROADMAP M4b done; **M5 (enrol) next**. `make lint`; commit.
---

View file

@ -6,6 +6,11 @@ hold per-group and per-host configuration.
- `hosts.yml` is **generated** from Terraform outputs by `make tf-inventory` — do not
hand-edit. The control node is the one manual exception.
- `offsite.yml` (in `production/`) is a **second** generated inventory file, written by
`make tf-inventory-offsite` from the offsite Terraform env; it holds the `offsite_hosts`
group (`askari`). Ansible merges it with `hosts.yml`, so both can declare the same group
names harmlessly (the offsite generator emits all four groups, most empty).
- Host groups: `all`, `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`.
- Terraform→inventory data flow and the data contract: **ADR-009**.
- Addressing conventions (subnets, ranges): **ADR-007**.
- Layout and host groups: see CLAUDE.md ("Inventory structure").

View file

@ -39,4 +39,4 @@ services__base_dir: /opt/services
base__unattended_upgrades_enabled: true
# Management plane — activates the dormant ssh-from-control firewall rule
base__firewall_control_addr: "10.20.10.151" # ubongo (control node) LAN address — ADR-021 ssh-from-control source
base__firewall_control_addr: "10.20.10.151" # ubongo — legacy V4 addr (ADR-007); ADR-021 ssh-from-control

View file

@ -130,7 +130,9 @@ def known_hostnames(env):
hosts |= parse_tf_hostnames(_run_json(["terraform", f"-chdir={tf_dir}", "output", "-json"]))
except (OSError, subprocess.CalledProcessError, ValueError):
pass
inv = os.path.join(REPO_ROOT, "inventories", env, "hosts.yml")
# Point at the inventory DIRECTORY so every source file merges — hosts.yml AND
# offsite.yml (offsite_hosts / askari), which a bare hosts.yml would miss.
inv = os.path.join(REPO_ROOT, "inventories", env)
try:
hosts |= parse_inventory_hostnames(_run_json(["ansible-inventory", "-i", inv, "--list"]))
except (OSError, subprocess.CalledProcessError, ValueError):

View file

@ -53,6 +53,8 @@ def main() -> None:
"---",
"# Generated by scripts/tf_to_inventory.py — do not edit manually.",
"# Regenerate with: make tf-inventory TF_ENV=<env>",
"# This OVERWRITES the file, including any manually-added control node (ubongo) —",
"# re-add it afterwards (the one hand-edit exception; docs/runbooks/new-host.md Part E).",
"",
"all:",
" children:",

View file

@ -6,9 +6,9 @@
#
# State is local (see backend.tf) — no Forgejo backend credentials needed.
proxmox_endpoint = "https://pve01.baobab.band:8006/"
proxmox_endpoint = "https://pve0.boma.baobab.band:8006/"
proxmox_insecure = false
proxmox_node = "pve01"
proxmox_node = "pve0"
vm_template_id = 9000 # Proxmox VM ID of the Debian 13 cloud-init template
vm_datastore_id = "local-lvm"

View file

@ -1,7 +1,7 @@
# Proxmox
variable "proxmox_endpoint" {
description = "Proxmox API URL, e.g. https://pve01.baobab.band:8006/"
description = "Proxmox API URL, e.g. https://pve0.boma.baobab.band:8006/"
type = string
}

View file

@ -6,9 +6,9 @@
#
# State is local (see backend.tf) — no Forgejo backend credentials needed.
proxmox_endpoint = "https://pve01.baobab.band:8006/"
proxmox_endpoint = "https://pve0.boma.baobab.band:8006/"
proxmox_insecure = true # set false once a valid TLS cert is in place
proxmox_node = "pve01"
proxmox_node = "pve0"
vm_template_id = 9000 # Proxmox VM ID of the Debian 13 cloud-init template
vm_datastore_id = "local-lvm"

View file

@ -1,7 +1,7 @@
# Proxmox
variable "proxmox_endpoint" {
description = "Proxmox API URL, e.g. https://pve01.baobab.band:8006/"
description = "Proxmox API URL, e.g. https://pve0.boma.baobab.band:8006/"
type = string
}

View file

@ -19,7 +19,7 @@ concerns:
- monitoring # metric exporters / health checks
- config # render templated config/compose files to disk — no restart
- deploy # bring services up / restart (compose up -d)
- proxy # reverse-proxy + TLS registration (Traefik routes, Authentik)
- proxy # reverse-proxy + TLS registration (Caddy routes, Authentik)
# Ansible built-in special tags. Narrow use only:
# always — cheap preflight assertions (run regardless of --tags)