# ROADMAP — boma build order High-level **build order** for the project. Almost everything in `docs/decisions/` (the ADRs) is *designed, not built* — this file sequences that backlog into milestones and records *why* the order is what it is. - **What is built vs planned:** `STATUS.md` (ground truth — always check there first). - **The backlog of decisions:** `docs/TODO.md` (this roadmap sequences it). - **The design rationale:** `docs/decisions/` (ADRs). This is a **living document**: update it as milestones land (move them to `STATUS.md`), as ordering changes, or as new milestones appear. Each milestone gets its own spec → plan → implementation cycle (`docs/superpowers/specs/` then `…/plans/`) when it comes up; this file stays high-level. _Last updated: 2026-06-11._ --- ## Strategy — "remote-access first" (Approach A) One focused track now (**Off-site / Remote-access**), a **procurement gate**, then the **Cluster** track. Cross-cutting/ongoing work runs underneath both. **Why this order.** The only physical machine that exists today is `ubongo` (the control node); the Proxmox cluster is a procurement decision, not yet made. The nearest-term goal — reach `ubongo` from `mamba` / a work laptop while on the move — needs only things already available or cheap to spin up (`askari` at Hetzner, the laptops). Doing the remote-access track first: 1. **delivers the mobile-access goal in the first phase**, and 2. **doubles as the proving ground** for boma's core machinery — the first real *service role* (NetBird), the `base` role on a *real, internet-facing* host, the `offsite_hosts` pattern, public DNS + ACME, the backup contract, and `rbw`/vault in anger — all on two cheap, low-stakes hosts **before** spending on the cluster. Cluster hardware is then procured *after* those patterns are proven and a `/capacity-review` informs the sizing — so the spend happens once, with knowledge. Rejected alternatives: **B — procure now, build strictly bottom-up** (mobile access lands late; spend precedes any proven pattern). **C — two parallel tracks** (for a solo operator this collapses into interleaving with extra context-switching cost). --- ## Phase 1 — Off-site / Remote-access Delivers mobile access to `ubongo`; proves the machinery. Ordered by *real* dependencies. ### M1 · boma's DNS home — a new domain at Gandi, managed as code Register a **new Swahili-themed domain at Gandi** for boma and manage its records **as code (IaC)**. Greenfield, not a migration: investigating the existing domains ruled them out as boma's home — `baobab.band` is the **live legacy homelab** (Cloudflare; vaultwarden / nextcloud / matrix in daily use), and `ziethen.dk` is the **family's primary email** (Fastmail); moving either's authoritative DNS risks breaking production. A fresh domain is zero-risk and *born at Gandi*. - **Driver:** values/sovereignty (Gandi) + a clean, decoupled home so boma builds without endangering anything live. `baobab.band`'s Cloudflare exit / V4 decommission is a **separate, later track**, not part of this build. `ziethen.dk` is untouched. - **IaC approach:** follow boma's grain — internal DNS is already Ansible-rendered and Terraform owns *no* DNS (CLAUDE.md), so **public DNS is Ansible-managed too** (Gandi LiveDNS via an Ansible module — exact module pinned in M1's spec, verified per ADR-014). - **Naming scheme (decided):** three tiers (on boma's new domain, ``) — `.boma.` (infra, internal-only) · `.` (home/cluster services, split-horizon) · `.askari.` (off-site/VPS, public). **`nyumbani` dropped.** Home services are **mesh/LAN-only by default** (no public record; reached over LAN or the NetBird mesh), with public Gandi records only for deliberate exceptions. The NetBird mesh carries the `` match-domain to road-warriors (resolver = dns1/dns2 over `wt0`); a `*.` ACME **DNS-01** wildcard cert (Gandi API) gives even unexposed services real TLS. Resolves TODO 4 and review finding O12. - **Records as a new/updated ADR:** amends ADR-007 — boma's public zone is `` at Gandi LiveDNS managed as code; the three-tier naming scheme; `nyumbani` removed; mesh/LAN-only default; `baobab.band` (legacy, Cloudflare) is out of scope. - **Maps to:** ADR-007 (network/DNS), ADR-016 (mesh DNS), TODO 4 (**resolved here**). ### M2 · `askari` provisioned + under Ansible Provision the Hetzner VPS **as IaC with Terraform** (CAX11 ARM / Helsinki / Debian 13, behind a TF-managed Hetzner Cloud Firewall), bring it into `offsite_hosts`, and bootstrap it. Design: `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`. - **Decided:** Terraform owns `askari`'s existence — generalizes ADR-006 from "Proxmox VM existence" to **Proxmox + Hetzner** (new `hetznercloud/hcloud` provider, `hetzner_vm` module, `offsite` stack). Token via `TF_VAR_hcloud_token` from `vault.hetzner.token`. - **Proves:** the `offsite_hosts` pattern, the TF→Ansible handoff for a non-Proxmox host (`tf_to_inventory.py` extended), bootstrap of a non-cluster host. Closes review finding O6 (`offsite_hosts` missing from `hosts.yml`). - **Amends:** ADR-006 (TF scope), ADR-009 (offsite handoff), ADR-020 (Hetzner Cloud Firewall = perimeter), ADR-007/016 (`askari` TF-provisioned, not "added manually"). ### M3 · `base` matured to a "remote-access-sufficient" subset — ✅ DONE Added the `hardening` concern to `base` (sshd drop-in key-only + `PermitRootLogin no`; fail2ban sshd jail 5/1h; ADR-002) and **applied it to askari** by tag (`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`) — SSH still works, fail2ban active. Full CIS L1/L2, auditd, AppArmor, AIDE remain deferred to Phase 2 (TODO 15). - **NetBird agent → M4** (deferred from M3: it enrolls against the coordinator, which doesn't exist until M4 — ADR-016's coordinator-first bootstrap order). - **Host firewall on askari + ubongo hardening → M5** (applying default-deny pre-mesh would lock out SSH; the Hetzner Cloud Firewall is askari's perimeter until then). - **Spec/plan:** `docs/superpowers/{specs,plans}/2026-06-14-base-ssh-fail2ban-m3*`. - **Maps to:** ADR-002 (security baseline), ADR-020 (firewall — built, not yet applied), TODO 15 (the rest of hardening → Phase 2). ### M4 · NetBird control plane on `askari` — first real service role Deploy the NetBird stack (management / signal / relay / Coturn + dashboard) with the **embedded IdP** (ADR-016 — no Authentik dependency). - **First exercise of:** the service-role conventions (`SECURITY.md` / `VERIFY.md` / `ACCESS.md` / `BACKUP.md`), public **TLS / ACME**, and the **backup contract** — NetBird's management datastore is *stateful*, so it gets encrypted off-host backup (ADR-016 §recovery, ADR-022). - **Open design choice (decide in M4's spec):** a minimal ACME-terminating reverse proxy (e.g. Caddy) just for NetBird on `askari`, vs leaning on NetBird's bundled setup. - **Maps to:** ADR-016 (mesh), ADR-004 (one service = one role), ADR-021 (access), ADR-022 (backup), ADR-008/017 (VERIFY), accepted-risk R3 (askari public surface). ### M5 · Enroll peers → goal reached NetBird agent on `ubongo` (the `wt0` path appears), then NetBird **clients on `mamba` + the work laptop** → `ubongo` is reachable from anywhere. **← the mobile-access goal lands here.** - **Critical ordering:** NetBird-on-`ubongo` **before** applying `base` default-deny to `ubongo`. Hardening first would lock out SSH (no mesh path yet). Once the mesh `wt0` path exists, apply default-deny and set `base__firewall_control_addr` for the LAN fallback (ADR-021's `ssh-from-control`, already built/dormant). - **Maps to:** ADR-016, ADR-021 (SSH ladder: `wt0` + ssh-from-control), ADR-020. --- ## Gate — Procurement decision Run `/capacity-review` (intent-based) to size the cluster, **then procure the Proxmox hardware**. Every core pattern (service role, base-on-real-host, DNS+ACME, backup, access) has by now been rehearsed on two cheap hosts, so the spend happens once and informed. - **Maps to:** ADR-012 (hardware & capacity), `/capacity-review`. --- ## Phase 2 — Cluster (gated on procurement; coarse until M5 is near) Canonical dependency order: 1. **Terraform provisioning** — `terraform init`/apply the Proxmox VM module; regenerate inventory via `make tf-inventory` (ADR-006, ADR-009). 2. **`base` full** — CIS L1/L2, auditd, AppArmor (enforce), AIDE, packages, users; the VM disk layout for CIS L2 is decided **before** provisioning (ADR-002, TODO 15). 3. **`docker_host`** — real Docker engine + Compose, daemon hardening, `nftables.d` container rules (currently a scaffold; ADR-004, ADR-020). 4. **`dns` role** — render the internal zone from inventory (ADR-007). 5. **Auth + reverse proxy** — Authentik + Traefik: the foundation every service sits behind with authentication (ADR-002). 6. **Monitoring** — Loki + Grafana Alloy (logging, ADR-018) + Prometheus/exporters + Uptime Kuma; decide which alerts live where (TODO 3.6). 7. **Service roles** — PhotoPrism, email, indexers, … (`docs/CAPABILITIES.md`); each clears `docs/security/service-checklist.md` and carries its standard files. 8. **`backup` role + `fisi` pull node** — restic Model A, pCloud + USB air-gap (ADR-022). 9. **Forgejo Actions CI** — runner + workflows (ADR-003/010, TODO 1). --- ## Underneath both — Cross-cutting / ongoing - **Accept ADR-011** (update management) — resolve its 6 open questions before the first scheduled patch run (TODO 16). - **Kaizen `/retro`** + keep appending to `docs/FRICTION.md` (TODO 11); **`/security-review`** skill (TODO 8.5); **`/review-repo` fortnightly cron** + headless email (TODO 8.1); `scheduled_jobs` role (TODO 8.3). - **User-notification function** — ntfy / matrix / email so tools + AI can reach the operator (TODO 9; ties to ADR-011 control channel). ### Parked decisions — decide when they bite, not before - Split-horizon FQDN with or without `nyumbani` (TODO 4) — likely settled in M1. - Central database server vs per-app databases (TODO 3.9) — at the service phase. - Script-dependencies policy: stdlib-only vs selective libraries (TODO 14). - Keep the custom Molecule base-image method as testing matures (TODO 3.10). --- ## Next step **M1 (Gandi DNS migration, IaC)** design is written — `docs/superpowers/specs/2026-06-11-public-dns-gandi-migration-design.md`. Next: user review → implementation plan.