boma/docs/ROADMAP.md
sjat c09b7fe6a5 docs(security): accept the single-coordinator mesh SPOF (R8) + ADR-016 availability amendment
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 11:34:21 +02:00

227 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ROADMAP — boma build order
High-level **build order** for the project. Almost everything in `docs/decisions/`
(the ADRs) is *designed, not built* — this file sequences that backlog into milestones
and records *why* the order is what it is.
- **What is built vs planned:** `STATUS.md` (ground truth — always check there first).
- **The backlog of decisions:** `docs/TODO.md` (this roadmap sequences it).
- **The design rationale:** `docs/decisions/` (ADRs).
This is a **living document**: update it as milestones land (move them to `STATUS.md`),
as ordering changes, or as new milestones appear. Each milestone gets its own
spec → plan → implementation cycle (`docs/superpowers/specs/` then `…/plans/`) when it
comes up; this file stays high-level.
_Last updated: 2026-06-19._
---
## Strategy — "remote-access first" (Approach A)
One focused track now (**Off-site / Remote-access**), a **procurement gate**, then the
**Cluster** track. Cross-cutting/ongoing work runs underneath both.
**Why this order.** The only physical machine that exists today is `ubongo` (the control
node); the Proxmox cluster is a procurement decision, not yet made. The nearest-term goal
— reach `ubongo` from `mamba` / a work laptop while on the move — needs only things
already available or cheap to spin up (`askari` at Hetzner, the laptops). Doing the
remote-access track first:
1. **delivers the mobile-access goal in the first phase**, and
2. **doubles as the proving ground** for boma's core machinery — the first real *service
role* (NetBird), the `base` role on a *real, internet-facing* host, the `offsite_hosts`
pattern, public DNS + ACME, the backup contract, and `rbw`/vault in anger — all on two
cheap, low-stakes hosts **before** spending on the cluster.
Cluster hardware is then procured *after* those patterns are proven and a
`/capacity-review` informs the sizing — so the spend happens once, with knowledge.
Rejected alternatives: **B — procure now, build strictly bottom-up** (mobile access lands
late; spend precedes any proven pattern). **C — two parallel tracks** (for a solo operator
this collapses into interleaving with extra context-switching cost).
---
## Phase 1 — Off-site / Remote-access — ✅ COMPLETE (2026-06-17)
Delivers mobile access to `ubongo`; proves the machinery. Ordered by *real* dependencies.
All milestones (M1M5) done; the mobile-access goal is met. Next: the Procurement gate.
### M1 · boma's DNS home — a new domain at Gandi, managed as code
Register a **new Swahili-themed domain at Gandi** for boma and manage its records **as
code (IaC)**. Greenfield, not a migration: investigating the existing domains ruled them
out as boma's home — `baobab.band` is the **live legacy homelab** (Cloudflare; vaultwarden
/ nextcloud / matrix in daily use), and `ziethen.dk` is the **family's primary email**
(Fastmail); moving either's authoritative DNS risks breaking production. A fresh domain is
zero-risk and *born at Gandi*.
- **Driver:** values/sovereignty (Gandi) + a clean, decoupled home so boma builds without
endangering anything live. `baobab.band`'s Cloudflare exit / V4 decommission is a
**separate, later track**, not part of this build. `ziethen.dk` is untouched.
- **IaC approach:** follow boma's grain — internal DNS is already Ansible-rendered and
Terraform owns *no* DNS (CLAUDE.md), so **public DNS is Ansible-managed too** (Gandi
LiveDNS via an Ansible module — exact module pinned in M1's spec, verified per ADR-014).
- **Naming scheme (decided):** three tiers (on boma's new domain, `<boma-domain>`) —
`<host>.boma.<boma-domain>` (infra, internal-only) · `<service>.<boma-domain>`
(home/cluster services, split-horizon) · `<service>.askari.<boma-domain>` (off-site/VPS,
public). **`nyumbani` dropped.** Home services are **mesh/LAN-only by default** (no
public record; reached over LAN or the NetBird mesh), with public Gandi records only for
deliberate exceptions. The NetBird mesh carries the `<boma-domain>` match-domain to
road-warriors (resolver = dns1/dns2 over `wt0`); a `*.<boma-domain>` ACME **DNS-01**
wildcard cert (Gandi API) gives even unexposed services real TLS. Resolves TODO 4 and
review finding O12.
- **Records as a new/updated ADR:** amends ADR-007 — boma's public zone is
`<boma-domain>` at Gandi LiveDNS managed as code; the three-tier naming scheme;
`nyumbani` removed; mesh/LAN-only default; `baobab.band` (legacy, Cloudflare) is out of
scope.
- **Maps to:** ADR-007 (network/DNS), ADR-016 (mesh DNS), TODO 4 (**resolved here**).
### M2 · `askari` provisioned + under Ansible
Provision the Hetzner VPS **as IaC with Terraform** (Helsinki / Debian 13, behind a
TF-managed Hetzner Cloud Firewall), bring it into `offsite_hosts`, and bootstrap it.
**Shipped as cx23/x86** (CAX11/ARM was out of stock EU-wide on 2026-06-14 — same-spec
x86, cheaper). Design: `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`.
- **Decided:** Terraform owns `askari`'s existence — generalizes ADR-006 from "Proxmox VM
existence" to **Proxmox + Hetzner** (new `hetznercloud/hcloud` provider, `hetzner_vm`
module, `offsite` stack). Token via `TF_VAR_hcloud_token` from `vault.hetzner.token`.
- **Proves:** the `offsite_hosts` pattern, the TF→Ansible handoff for a non-Proxmox host
(`tf_to_inventory.py` extended), bootstrap of a non-cluster host. Closes review finding
O6 (`offsite_hosts` missing from `hosts.yml`).
- **Amends:** ADR-006 (TF scope), ADR-009 (offsite handoff), ADR-020 (Hetzner Cloud
Firewall = perimeter), ADR-007/016 (`askari` TF-provisioned, not "added manually").
### M3 · `base` matured to a "remote-access-sufficient" subset — ✅ DONE
Added the `hardening` concern to `base` (sshd drop-in key-only + `PermitRootLogin no`;
fail2ban sshd jail 5/1h; ADR-002) and **applied it to askari** by tag
(`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`) — SSH still works, fail2ban
active. Full CIS L1/L2, auditd, AppArmor, AIDE remain deferred to Phase 2 (TODO 15).
- **NetBird agent → M4** (deferred from M3: it enrolls against the coordinator, which
doesn't exist until M4 — ADR-016's coordinator-first bootstrap order).
- **Host firewall on askari + ubongo hardening → M5** (applying default-deny pre-mesh
would lock out SSH; the Hetzner Cloud Firewall is askari's perimeter until then).
- **Spec/plan:** `docs/superpowers/{specs,plans}/2026-06-14-base-ssh-fail2ban-m3*`.
- **Maps to:** ADR-002 (security baseline), ADR-020 (firewall — built, not yet applied),
TODO 15 (the rest of hardening → Phase 2).
### M4 · NetBird control plane on `askari` — first real service role
Built in two phases. **M4a (platform) — ✅ DONE:** Docker on askari + boma's standard
**Caddy** reverse proxy (ADR-024), proven by `https://test.askari.wingu.me` serving a
valid Let's Encrypt cert (HTTP-01; the Gandi **DNS-01** path is now built + proven —
2026-06-15, see ADR-024 — for mesh/LAN-only cluster services).
Firewall opened 80/443/3478. Spec/plan: `…2026-06-14-netbird-coordinator-m4-design.md` /
`…2026-06-14-m4a-docker-caddy.md` / `…2026-06-14-m4b-netbird.md`.
**M4b — ✅ DONE (2026-06-16):** the `netbird_coordinator` service role, deployed to askari.
Reality differed from the original plan (captured fresh per ADR-014): NetBird **v0.72.4**
ships a **single combined `netbird-server`** container (management + signal + relay + STUN
+ **embedded Dex** IdP at `/oauth2`) plus `dashboard:v2.39.0` — **no separate signal/relay
container and no Coturn**. Fronted by the M4a Caddy via gRPC-h2c + WebSocket + path routing.
Dashboard live at `https://netbird.askari.wingu.me` (valid LE cert); `/api` auth-gated.
**M5 (enrol peers) is next** — incl. the first-boot `/setup` admin + setup keys.
- **First exercise of:** the service-role conventions (`SECURITY.md` / `VERIFY.md` /
`ACCESS.md` / `BACKUP.md`), public **TLS / ACME**, and the **backup contract**
NetBird's management datastore is *stateful*, so it gets encrypted off-host backup
(ADR-016 §recovery, ADR-022).
- **Open design choice (decide in M4's spec):** a minimal ACME-terminating reverse proxy
(e.g. Caddy) just for NetBird on `askari`, vs leaning on NetBird's bundled setup.
- **Maps to:** ADR-016 (mesh), ADR-004 (one service = one role), ADR-021 (access),
ADR-022 (backup), ADR-008/017 (VERIFY), accepted-risk R3 (askari public surface).
### M5 · Enroll peers → goal reached — ✅ DONE (2026-06-17)
The `base` `mesh` concern enrolled **`ubongo` (`100.99.146.14`) + `askari`
(`100.99.226.39`)** as NetBird peers — both Management+Signal Connected, the ubongo↔askari
mesh link ping-verified. NetBird ships a default **Allow-All** peer policy, so any enrolled
peer reaches `ubongo` over `wt0`. The road-warrior clients (**`mamba` + the work laptop**)
are enrolled (operator, via `docs/runbooks/netbird-client.md`) → **`ubongo` is reachable
from anywhere. ← the mobile-access goal is met; Phase 1 is complete.**
- **Deferred to a "mesh-hardening" follow-on** (was folded into M5; split out as the
lockout-risky part): apply `base` nftables **default-deny** to `ubongo` + set
`base__firewall_control_addr` (ADR-021 `ssh-from-control`, built/dormant); tighten the
NetBird ACL off Allow-All to scoped policies; move `askari`'s SSH onto `wt0` (retiring
the Hetzner-firewall WAN allow). Safe to do now that the `wt0` path exists.
- **Maps to:** ADR-016, ADR-021 (SSH ladder: `wt0` + ssh-from-control), ADR-020.
---
## Gate — Procurement decision
Run `/capacity-review` (intent-based) to size the cluster, **then procure the Proxmox
hardware**. Every core pattern (service role, base-on-real-host, DNS+ACME, backup, access)
has by now been rehearsed on two cheap hosts, so the spend happens once and informed.
- **Maps to:** ADR-012 (hardware & capacity), `/capacity-review`.
---
## Phase 2 — Cluster (gated on procurement; coarse until M5 is near)
Canonical dependency order:
1. **Terraform provisioning**`terraform init`/apply the Proxmox VM module; regenerate
inventory via `make tf-inventory` (ADR-006, ADR-009).
2. **`base` full** — CIS L1/L2, auditd, AppArmor (enforce), AIDE, packages, users; the
VM disk layout for CIS L2 is decided **before** provisioning (ADR-002, TODO 15).
3. **`docker_host`** — real Docker engine + Compose, daemon hardening, `nftables.d`
container rules (currently a scaffold; ADR-004, ADR-020).
4. **`dns` role** — render the internal zone from inventory (ADR-007).
5. **Auth + reverse proxy** — Authentik + **Caddy** (ADR-024): the foundation every
service sits behind with authentication (ADR-002).
6. **Monitoring** — Loki + Grafana Alloy (logging, ADR-018) + Prometheus/exporters +
Uptime Kuma; decide which alerts live where (TODO 3.6).
7. **Service roles** — PhotoPrism, email, indexers, … (`docs/CAPABILITIES.md`); each
clears `docs/security/service-checklist.md` and carries its standard files.
8. **`backup` role + `fisi` pull node** — restic Model A, pCloud + USB air-gap (ADR-022).
9. **Forgejo Actions CI** — runner + workflows (ADR-003/010, TODO 1).
---
## Underneath both — Cross-cutting / ongoing
- **Accept ADR-011** (update management) — resolve its 6 open questions before the first
scheduled patch run (TODO 16).
- **Kaizen `/retro`** + keep appending to `docs/FRICTION.md` (TODO 11); **`/security-review`**
skill (TODO 8.5); **`/review-repo` fortnightly cron** + headless email (TODO 8.1);
`scheduled_jobs` role (TODO 8.3).
- **User-notification function** — ntfy / matrix / email so tools + AI can reach the
operator (TODO 9; ties to ADR-011 control channel).
### Parked decisions — decide when they bite, not before
- Split-horizon FQDN with or without `nyumbani` (TODO 4) — likely settled in M1.
- Central database server vs per-app databases (TODO 3.9) — at the service phase.
- Script-dependencies policy: stdlib-only vs selective libraries (TODO 14).
- Keep the custom Molecule base-image method as testing matures (TODO 3.10).
---
## Next step
**Phase 1 complete (M1M5); mesh-hardening: ubongo (2/3) DONE 2026-06-19, askari redesign DONE 2026-06-20.**
Both hosts now run INPUT-only nftables default-deny (`base__firewall_input_only`), live reboot-validated.
askari's redesign (spec/plan `docs/superpowers/{specs,plans}/2026-06-19-mesh-hardening-askari-redesign*`)
applied INPUT-only default-deny + `wt0`-primary SSH + a permanent WAN break-glass + a geo-disabled
coordinator; a real reboot recovered unattended. Remaining mesh-hardening sub-projects:
1. ~~`ubongo` nftables default-deny + `ssh-from-control`~~**DONE (2026-06-19).**
2. ~~**redesign** `askari`'s SSH → `wt0`~~**DONE (2026-06-20)** — boot-race, coordinator-bootstrap
chicken-egg, and Docker-nat-flush all resolved + live reboot-validated.
3. ~~**askari relay-SPOF reduction**~~**DONE (2026-06-20)** — assessed + **accepted** as a
documented availability risk (R8 + ADR-016 availability amendment): the blast radius is
narrow (LAN/intra-cluster/local traffic never touch askari), so no P2P / second relay /
second coordinator was warranted. Hardened the one real gap — a managed-host coordinator-FQDN
DNS pin (`base__mesh_coordinator_pin`). The coordinator off-site backup gap is handed to ADR-022.
4. **NetBird ACL off Allow-All** to scoped policies (open mechanism question — no headless API path).
5. **ADR-022 backup kickoff** — off-site backup of the `netbird_coordinator` store (named in R8 /
BACKUP.md) as the first slice of the backup role (restic + the `fisi` pull node).
**Then** the Procurement gate (`/capacity-review` → buy Proxmox hardware) opens Phase 2.