- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional, outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative boma.baobab.band -> boma.wingu.me transition note already added earlier - terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and <host>.boma.baobab.band per ADR-007 naming (O11) - ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections placed after Consequences, matching ADR-014/019-023 (O13) - docs/README + inventories/README: list the missing subdirs / offsite_hosts + offsite.yml merge behaviour (O14, O29 note) - ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19) - ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20) - ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21) - netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23) - ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24) - capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28) - tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9) - tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep) O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected); the fix lives in the generator for the next regeneration. make lint + pytest (57) green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
11 KiB
ROADMAP — boma build order
High-level build order for the project. Almost everything in docs/decisions/
(the ADRs) is designed, not built — this file sequences that backlog into milestones
and records why the order is what it is.
- What is built vs planned:
STATUS.md(ground truth — always check there first). - The backlog of decisions:
docs/TODO.md(this roadmap sequences it). - The design rationale:
docs/decisions/(ADRs).
This is a living document: update it as milestones land (move them to STATUS.md),
as ordering changes, or as new milestones appear. Each milestone gets its own
spec → plan → implementation cycle (docs/superpowers/specs/ then …/plans/) when it
comes up; this file stays high-level.
Last updated: 2026-06-11.
Strategy — "remote-access first" (Approach A)
One focused track now (Off-site / Remote-access), a procurement gate, then the Cluster track. Cross-cutting/ongoing work runs underneath both.
Why this order. The only physical machine that exists today is ubongo (the control
node); the Proxmox cluster is a procurement decision, not yet made. The nearest-term goal
— reach ubongo from mamba / a work laptop while on the move — needs only things
already available or cheap to spin up (askari at Hetzner, the laptops). Doing the
remote-access track first:
- delivers the mobile-access goal in the first phase, and
- doubles as the proving ground for boma's core machinery — the first real service
role (NetBird), the
baserole on a real, internet-facing host, theoffsite_hostspattern, public DNS + ACME, the backup contract, andrbw/vault in anger — all on two cheap, low-stakes hosts before spending on the cluster.
Cluster hardware is then procured after those patterns are proven and a
/capacity-review informs the sizing — so the spend happens once, with knowledge.
Rejected alternatives: B — procure now, build strictly bottom-up (mobile access lands late; spend precedes any proven pattern). C — two parallel tracks (for a solo operator this collapses into interleaving with extra context-switching cost).
Phase 1 — Off-site / Remote-access
Delivers mobile access to ubongo; proves the machinery. Ordered by real dependencies.
M1 · boma's DNS home — a new domain at Gandi, managed as code
Register a new Swahili-themed domain at Gandi for boma and manage its records as
code (IaC). Greenfield, not a migration: investigating the existing domains ruled them
out as boma's home — baobab.band is the live legacy homelab (Cloudflare; vaultwarden
/ nextcloud / matrix in daily use), and ziethen.dk is the family's primary email
(Fastmail); moving either's authoritative DNS risks breaking production. A fresh domain is
zero-risk and born at Gandi.
- Driver: values/sovereignty (Gandi) + a clean, decoupled home so boma builds without
endangering anything live.
baobab.band's Cloudflare exit / V4 decommission is a separate, later track, not part of this build.ziethen.dkis untouched. - IaC approach: follow boma's grain — internal DNS is already Ansible-rendered and Terraform owns no DNS (CLAUDE.md), so public DNS is Ansible-managed too (Gandi LiveDNS via an Ansible module — exact module pinned in M1's spec, verified per ADR-014).
- Naming scheme (decided): three tiers (on boma's new domain,
<boma-domain>) —<host>.boma.<boma-domain>(infra, internal-only) ·<service>.<boma-domain>(home/cluster services, split-horizon) ·<service>.askari.<boma-domain>(off-site/VPS, public).nyumbanidropped. Home services are mesh/LAN-only by default (no public record; reached over LAN or the NetBird mesh), with public Gandi records only for deliberate exceptions. The NetBird mesh carries the<boma-domain>match-domain to road-warriors (resolver = dns1/dns2 overwt0); a*.<boma-domain>ACME DNS-01 wildcard cert (Gandi API) gives even unexposed services real TLS. Resolves TODO 4 and review finding O12. - Records as a new/updated ADR: amends ADR-007 — boma's public zone is
<boma-domain>at Gandi LiveDNS managed as code; the three-tier naming scheme;nyumbaniremoved; mesh/LAN-only default;baobab.band(legacy, Cloudflare) is out of scope. - Maps to: ADR-007 (network/DNS), ADR-016 (mesh DNS), TODO 4 (resolved here).
M2 · askari provisioned + under Ansible
Provision the Hetzner VPS as IaC with Terraform (Helsinki / Debian 13, behind a
TF-managed Hetzner Cloud Firewall), bring it into offsite_hosts, and bootstrap it.
Shipped as cx23/x86 (CAX11/ARM was out of stock EU-wide on 2026-06-14 — same-spec
x86, cheaper). Design: docs/superpowers/specs/2026-06-14-askari-provisioning-design.md.
- Decided: Terraform owns
askari's existence — generalizes ADR-006 from "Proxmox VM existence" to Proxmox + Hetzner (newhetznercloud/hcloudprovider,hetzner_vmmodule,offsitestack). Token viaTF_VAR_hcloud_tokenfromvault.hetzner.token. - Proves: the
offsite_hostspattern, the TF→Ansible handoff for a non-Proxmox host (tf_to_inventory.pyextended), bootstrap of a non-cluster host. Closes review finding O6 (offsite_hostsmissing fromhosts.yml). - Amends: ADR-006 (TF scope), ADR-009 (offsite handoff), ADR-020 (Hetzner Cloud
Firewall = perimeter), ADR-007/016 (
askariTF-provisioned, not "added manually").
M3 · base matured to a "remote-access-sufficient" subset — ✅ DONE
Added the hardening concern to base (sshd drop-in key-only + PermitRootLogin no;
fail2ban sshd jail 5/1h; ADR-002) and applied it to askari by tag
(make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening) — SSH still works, fail2ban
active. Full CIS L1/L2, auditd, AppArmor, AIDE remain deferred to Phase 2 (TODO 15).
- NetBird agent → M4 (deferred from M3: it enrolls against the coordinator, which doesn't exist until M4 — ADR-016's coordinator-first bootstrap order).
- Host firewall on askari + ubongo hardening → M5 (applying default-deny pre-mesh would lock out SSH; the Hetzner Cloud Firewall is askari's perimeter until then).
- Spec/plan:
docs/superpowers/{specs,plans}/2026-06-14-base-ssh-fail2ban-m3*. - Maps to: ADR-002 (security baseline), ADR-020 (firewall — built, not yet applied), TODO 15 (the rest of hardening → Phase 2).
M4 · NetBird control plane on askari — first real service role
Built in two phases. M4a (platform) — ✅ DONE: Docker on askari + boma's standard
Caddy reverse proxy (ADR-024), proven by https://test.askari.wingu.me serving a
valid Let's Encrypt cert (HTTP-01 — DNS-01 deferred to Phase 2, see ADR-024/FRICTION).
Firewall opened 80/443/3478. Spec/plan: …2026-06-14-netbird-coordinator-m4-design.md /
…2026-06-14-m4a-docker-caddy.md. M4b (next): the netbird_coordinator service
role — read NetBird's current self-host compose then.
Deploy the NetBird stack (management / signal / relay / Coturn + dashboard) with the embedded IdP (ADR-016 — no Authentik dependency), fronted by the now-proven Caddy.
- First exercise of: the service-role conventions (
SECURITY.md/VERIFY.md/ACCESS.md/BACKUP.md), public TLS / ACME, and the backup contract — NetBird's management datastore is stateful, so it gets encrypted off-host backup (ADR-016 §recovery, ADR-022). - Open design choice (decide in M4's spec): a minimal ACME-terminating reverse proxy
(e.g. Caddy) just for NetBird on
askari, vs leaning on NetBird's bundled setup. - Maps to: ADR-016 (mesh), ADR-004 (one service = one role), ADR-021 (access), ADR-022 (backup), ADR-008/017 (VERIFY), accepted-risk R3 (askari public surface).
M5 · Enroll peers → goal reached
NetBird agent on ubongo (the wt0 path appears), then NetBird clients on mamba +
the work laptop → ubongo is reachable from anywhere. ← the mobile-access goal lands
here.
- Critical ordering: NetBird-on-
ubongobefore applyingbasedefault-deny toubongo. Hardening first would lock out SSH (no mesh path yet). Once the meshwt0path exists, apply default-deny and setbase__firewall_control_addrfor the LAN fallback (ADR-021'sssh-from-control, already built/dormant). - Maps to: ADR-016, ADR-021 (SSH ladder:
wt0+ ssh-from-control), ADR-020.
Gate — Procurement decision
Run /capacity-review (intent-based) to size the cluster, then procure the Proxmox
hardware. Every core pattern (service role, base-on-real-host, DNS+ACME, backup, access)
has by now been rehearsed on two cheap hosts, so the spend happens once and informed.
- Maps to: ADR-012 (hardware & capacity),
/capacity-review.
Phase 2 — Cluster (gated on procurement; coarse until M5 is near)
Canonical dependency order:
- Terraform provisioning —
terraform init/apply the Proxmox VM module; regenerate inventory viamake tf-inventory(ADR-006, ADR-009). basefull — CIS L1/L2, auditd, AppArmor (enforce), AIDE, packages, users; the VM disk layout for CIS L2 is decided before provisioning (ADR-002, TODO 15).docker_host— real Docker engine + Compose, daemon hardening,nftables.dcontainer rules (currently a scaffold; ADR-004, ADR-020).dnsrole — render the internal zone from inventory (ADR-007).- Auth + reverse proxy — Authentik + Caddy (ADR-024): the foundation every service sits behind with authentication (ADR-002).
- Monitoring — Loki + Grafana Alloy (logging, ADR-018) + Prometheus/exporters + Uptime Kuma; decide which alerts live where (TODO 3.6).
- Service roles — PhotoPrism, email, indexers, … (
docs/CAPABILITIES.md); each clearsdocs/security/service-checklist.mdand carries its standard files. backuprole +fisipull node — restic Model A, pCloud + USB air-gap (ADR-022).- Forgejo Actions CI — runner + workflows (ADR-003/010, TODO 1).
Underneath both — Cross-cutting / ongoing
- Accept ADR-011 (update management) — resolve its 6 open questions before the first scheduled patch run (TODO 16).
- Kaizen
/retro+ keep appending todocs/FRICTION.md(TODO 11);/security-reviewskill (TODO 8.5);/review-repofortnightly cron + headless email (TODO 8.1);scheduled_jobsrole (TODO 8.3). - User-notification function — ntfy / matrix / email so tools + AI can reach the operator (TODO 9; ties to ADR-011 control channel).
Parked decisions — decide when they bite, not before
- Split-horizon FQDN with or without
nyumbani(TODO 4) — likely settled in M1. - Central database server vs per-app databases (TODO 3.9) — at the service phase.
- Script-dependencies policy: stdlib-only vs selective libraries (TODO 14).
- Keep the custom Molecule base-image method as testing matures (TODO 3.10).
Next step
M1 (Gandi DNS migration, IaC) design is written —
docs/superpowers/specs/2026-06-11-public-dns-gandi-migration-design.md. Next: user
review → implementation plan.