boma/docs/ROADMAP.md
sjat be2679cc66 docs(roadmap): record decided DNS naming scheme in M1
Three-tier scheme: <host>.boma.baobab.band (infra, internal) /
<service>.baobab.band (home, split-horizon, mesh/LAN-only default) /
<service>.askari.baobab.band (off-site, public). nyumbani dropped; mesh carries
the baobab.band match-domain to road-warriors; *.baobab.band DNS-01 wildcard
certs via Gandi API. Resolves TODO 4 and review finding O12.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 22:17:28 +02:00

9.2 KiB

ROADMAP — boma build order

High-level build order for the project. Almost everything in docs/decisions/ (the ADRs) is designed, not built — this file sequences that backlog into milestones and records why the order is what it is.

  • What is built vs planned: STATUS.md (ground truth — always check there first).
  • The backlog of decisions: docs/TODO.md (this roadmap sequences it).
  • The design rationale: docs/decisions/ (ADRs).

This is a living document: update it as milestones land (move them to STATUS.md), as ordering changes, or as new milestones appear. Each milestone gets its own spec → plan → implementation cycle (docs/superpowers/specs/ then …/plans/) when it comes up; this file stays high-level.

Last updated: 2026-06-11.


Strategy — "remote-access first" (Approach A)

One focused track now (Off-site / Remote-access), a procurement gate, then the Cluster track. Cross-cutting/ongoing work runs underneath both.

Why this order. The only physical machine that exists today is ubongo (the control node); the Proxmox cluster is a procurement decision, not yet made. The nearest-term goal — reach ubongo from mamba / a work laptop while on the move — needs only things already available or cheap to spin up (askari at Hetzner, the laptops). Doing the remote-access track first:

  1. delivers the mobile-access goal in the first phase, and
  2. doubles as the proving ground for boma's core machinery — the first real service role (NetBird), the base role on a real, internet-facing host, the offsite_hosts pattern, public DNS + ACME, the backup contract, and rbw/vault in anger — all on two cheap, low-stakes hosts before spending on the cluster.

Cluster hardware is then procured after those patterns are proven and a /capacity-review informs the sizing — so the spend happens once, with knowledge.

Rejected alternatives: B — procure now, build strictly bottom-up (mobile access lands late; spend precedes any proven pattern). C — two parallel tracks (for a solo operator this collapses into interleaving with extra context-switching cost).


Phase 1 — Off-site / Remote-access

Delivers mobile access to ubongo; proves the machinery. Ordered by real dependencies.

M1 · Gandi DNS migration — managed as code

Move baobab.band authoritative DNS (and registrar) off Cloudflare to Gandi, with records managed as code (IaC), not hand-edited in a panel.

  • Driver: values/sovereignty (Gandi over Cloudflare) — not a NetBird technical prerequisite. Sequenced first anyway, so askari's records are born at Gandi and Cloudflare is never touched again.
  • IaC approach: follow boma's grain — internal DNS is already Ansible-rendered and Terraform owns no DNS (CLAUDE.md), so public DNS is Ansible-managed too (Gandi LiveDNS via an Ansible module — exact module pinned in M1's spec, verified per ADR-014).
  • Naming scheme (decided): three tiers — <host>.boma.baobab.band (infra, internal-only) · <service>.baobab.band (home/cluster services, split-horizon) · <service>.askari.baobab.band (off-site/VPS, public). nyumbani dropped. Home services are mesh/LAN-only by default (no public record; reached over LAN or the NetBird mesh), with public Gandi records only for deliberate exceptions. The NetBird mesh carries the baobab.band match-domain to road-warriors (resolver = dns1/dns2 over wt0); a *.baobab.band ACME DNS-01 wildcard cert (Gandi API) gives even unexposed services real TLS. Resolves TODO 4 and review finding O12.
  • Care: the live record forgejo.nyumbani.baobab.band (the git origin / Forgejo remote, :7577) becomes forgejo.baobab.band — cutover must update the remote + CI without breaking pushes.
  • Records as a new/updated ADR: amends ADR-007 — public DNS provider → Gandi LiveDNS managed as code; the three-tier naming scheme; nyumbani removed; mesh/LAN-only default.
  • Maps to: ADR-007 (network/DNS), ADR-016 (mesh DNS), TODO 4 (resolved here).

M2 · askari provisioned + under Ansible

Spin up the Hetzner VPS; bring it under Ansible in the offsite_hosts group; bootstrap it.

  • Proves: the offsite_hosts pattern, bootstrap of a non-cluster host, rbw/vault against a brand-new host. Regenerates the inventory stubs (closes review finding O6 — offsite_hosts missing from hosts.yml).
  • Maps to: ADR-007 (askari role), ADR-009 (provisioning handoff), ADR-015/016, TODO 5 (control-node-style bootstrap, reused).

M3 · base matured to a "remote-access-sufficient" subset

Today base is firewall-only. Add the subset a real, internet-facing host needs: SSH hardening + fail2ban + the NetBird agent task. Full CIS L1/L2, auditd, AppArmor, AIDE are deferred to Phase 2.

  • Why a subset: askari is public (Hetzner) — it must be SSH-hardened and firewalled with exposure, but the full hardening standard is not on the critical path to mobile access.
  • Maps to: ADR-002 (security baseline), ADR-016 (agent enrollment lives in base), ADR-020 (firewall — already built), TODO 15 (the rest of hardening → Phase 2).

M4 · NetBird control plane on askari — first real service role

Deploy the NetBird stack (management / signal / relay / Coturn + dashboard) with the embedded IdP (ADR-016 — no Authentik dependency).

  • First exercise of: the service-role conventions (SECURITY.md / VERIFY.md / ACCESS.md / BACKUP.md), public TLS / ACME, and the backup contract — NetBird's management datastore is stateful, so it gets encrypted off-host backup (ADR-016 §recovery, ADR-022).
  • Open design choice (decide in M4's spec): a minimal ACME-terminating reverse proxy (e.g. Caddy) just for NetBird on askari, vs leaning on NetBird's bundled setup.
  • Maps to: ADR-016 (mesh), ADR-004 (one service = one role), ADR-021 (access), ADR-022 (backup), ADR-008/017 (VERIFY), accepted-risk R3 (askari public surface).

M5 · Enroll peers → goal reached

NetBird agent on ubongo (the wt0 path appears), then NetBird clients on mamba + the work laptopubongo is reachable from anywhere. ← the mobile-access goal lands here.

  • Critical ordering: NetBird-on-ubongo before applying base default-deny to ubongo. Hardening first would lock out SSH (no mesh path yet). Once the mesh wt0 path exists, apply default-deny and set base__firewall_control_addr for the LAN fallback (ADR-021's ssh-from-control, already built/dormant).
  • Maps to: ADR-016, ADR-021 (SSH ladder: wt0 + ssh-from-control), ADR-020.

Gate — Procurement decision

Run /capacity-review (intent-based) to size the cluster, then procure the Proxmox hardware. Every core pattern (service role, base-on-real-host, DNS+ACME, backup, access) has by now been rehearsed on two cheap hosts, so the spend happens once and informed.

  • Maps to: ADR-012 (hardware & capacity), /capacity-review.

Phase 2 — Cluster (gated on procurement; coarse until M5 is near)

Canonical dependency order:

  1. Terraform provisioningterraform init/apply the Proxmox VM module; regenerate inventory via make tf-inventory (ADR-006, ADR-009).
  2. base full — CIS L1/L2, auditd, AppArmor (enforce), AIDE, packages, users; the VM disk layout for CIS L2 is decided before provisioning (ADR-002, TODO 15).
  3. docker_host — real Docker engine + Compose, daemon hardening, nftables.d container rules (currently a scaffold; ADR-004, ADR-020).
  4. dns role — render the internal zone from inventory (ADR-007).
  5. Auth + reverse proxy — Authentik + Traefik: the foundation every service sits behind with authentication (ADR-002).
  6. Monitoring — Loki + Grafana Alloy (logging, ADR-018) + Prometheus/exporters + Uptime Kuma; decide which alerts live where (TODO 3.6).
  7. Service roles — PhotoPrism, email, indexers, … (docs/CAPABILITIES.md); each clears docs/security/service-checklist.md and carries its standard files.
  8. backup role + fisi pull node — restic Model A, pCloud + USB air-gap (ADR-022).
  9. Forgejo Actions CI — runner + workflows (ADR-003/010, TODO 1).

Underneath both — Cross-cutting / ongoing

  • Accept ADR-011 (update management) — resolve its 6 open questions before the first scheduled patch run (TODO 16).
  • Kaizen /retro + keep appending to docs/FRICTION.md (TODO 11); /security-review skill (TODO 8.5); /review-repo fortnightly cron + headless email (TODO 8.1); scheduled_jobs role (TODO 8.3).
  • User-notification function — ntfy / matrix / email so tools + AI can reach the operator (TODO 9; ties to ADR-011 control channel).

Parked decisions — decide when they bite, not before

  • Split-horizon FQDN with or without nyumbani (TODO 4) — likely settled in M1.
  • Central database server vs per-app databases (TODO 3.9) — at the service phase.
  • Script-dependencies policy: stdlib-only vs selective libraries (TODO 14).
  • Keep the custom Molecule base-image method as testing matures (TODO 3.10).

Next step

Brainstorm M1 (Gandi DNS migration, IaC) as its own sub-project → spec → plan.