docs(roadmap): add ROADMAP.md — remote-access-first build order
High-level build order for the project (Approach A): one Off-site/Remote-access track first (Gandi DNS-as-code -> askari -> NetBird control plane -> enroll ubongo + road-warrior laptops -> harden), a procurement gate sized by /capacity-review, then the Cluster track. Sequences the docs/TODO.md backlog into milestones and records why the order is what it is. Decisions captured this session: Gandi over Cloudflare is values-driven and independent of NetBird (sequenced first so records are born at Gandi); public DNS managed as code (Ansible, consistent with internal DNS + Terraform-owns-no-DNS); NetBird-on-ubongo before base default-deny (chicken-and-egg); cluster procurement gated on patterns proven on two cheap hosts. Wire ROADMAP.md into CLAUDE.md's Further-reading index and point TODO.md at it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
03d33f83dd
commit
3cfcb1c2e9
3 changed files with 175 additions and 0 deletions
|
|
@ -205,6 +205,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
|
|||
| Topic | File |
|
||||
|------------------------|---------------------------------------|
|
||||
| Architecture overview | `docs/decisions/001-architecture.md` |
|
||||
| Build order / roadmap | `docs/ROADMAP.md` |
|
||||
| Capabilities overview (what boma does) | `docs/CAPABILITIES.md` |
|
||||
| Security baseline & strategy | `docs/decisions/002-security.md` |
|
||||
| Accepted security risks | `docs/security/accepted-risks.md` |
|
||||
|
|
|
|||
171
docs/ROADMAP.md
Normal file
171
docs/ROADMAP.md
Normal file
|
|
@ -0,0 +1,171 @@
|
|||
# ROADMAP — boma build order
|
||||
|
||||
High-level **build order** for the project. Almost everything in `docs/decisions/`
|
||||
(the ADRs) is *designed, not built* — this file sequences that backlog into milestones
|
||||
and records *why* the order is what it is.
|
||||
|
||||
- **What is built vs planned:** `STATUS.md` (ground truth — always check there first).
|
||||
- **The backlog of decisions:** `docs/TODO.md` (this roadmap sequences it).
|
||||
- **The design rationale:** `docs/decisions/` (ADRs).
|
||||
|
||||
This is a **living document**: update it as milestones land (move them to `STATUS.md`),
|
||||
as ordering changes, or as new milestones appear. Each milestone gets its own
|
||||
spec → plan → implementation cycle (`docs/superpowers/specs/` then `…/plans/`) when it
|
||||
comes up; this file stays high-level.
|
||||
|
||||
_Last updated: 2026-06-11._
|
||||
|
||||
---
|
||||
|
||||
## Strategy — "remote-access first" (Approach A)
|
||||
|
||||
One focused track now (**Off-site / Remote-access**), a **procurement gate**, then the
|
||||
**Cluster** track. Cross-cutting/ongoing work runs underneath both.
|
||||
|
||||
**Why this order.** The only physical machine that exists today is `ubongo` (the control
|
||||
node); the Proxmox cluster is a procurement decision, not yet made. The nearest-term goal
|
||||
— reach `ubongo` from `mamba` / a work laptop while on the move — needs only things
|
||||
already available or cheap to spin up (`askari` at Hetzner, the laptops). Doing the
|
||||
remote-access track first:
|
||||
|
||||
1. **delivers the mobile-access goal in the first phase**, and
|
||||
2. **doubles as the proving ground** for boma's core machinery — the first real *service
|
||||
role* (NetBird), the `base` role on a *real, internet-facing* host, the `offsite_hosts`
|
||||
pattern, public DNS + ACME, the backup contract, and `rbw`/vault in anger — all on two
|
||||
cheap, low-stakes hosts **before** spending on the cluster.
|
||||
|
||||
Cluster hardware is then procured *after* those patterns are proven and a
|
||||
`/capacity-review` informs the sizing — so the spend happens once, with knowledge.
|
||||
|
||||
Rejected alternatives: **B — procure now, build strictly bottom-up** (mobile access lands
|
||||
late; spend precedes any proven pattern). **C — two parallel tracks** (for a solo operator
|
||||
this collapses into interleaving with extra context-switching cost).
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Off-site / Remote-access
|
||||
|
||||
Delivers mobile access to `ubongo`; proves the machinery. Ordered by *real* dependencies.
|
||||
|
||||
### M1 · Gandi DNS migration — managed as code
|
||||
|
||||
Move `baobab.band` authoritative DNS (and registrar) off Cloudflare to **Gandi**, with
|
||||
records **managed as code (IaC)**, not hand-edited in a panel.
|
||||
|
||||
- **Driver:** values/sovereignty (Gandi over Cloudflare) — *not* a NetBird technical
|
||||
prerequisite. Sequenced **first** anyway, so `askari`'s records are born at Gandi and
|
||||
Cloudflare is never touched again.
|
||||
- **IaC approach:** follow boma's grain — internal DNS is already Ansible-rendered and
|
||||
Terraform owns *no* DNS (CLAUDE.md), so **public DNS is Ansible-managed too** (Gandi
|
||||
LiveDNS via an Ansible module). Exact module/role shape is M1's spec decision.
|
||||
- **Care:** the live record `forgejo.nyumbani.baobab.band` (the git `origin` / Forgejo
|
||||
remote) must not break during the cutover.
|
||||
- **Records as a new/updated ADR:** amends ADR-007's "served by external DNS (Cloudflare
|
||||
or equivalent)" line to "Gandi LiveDNS, managed as code."
|
||||
- **Maps to:** ADR-007 (network/DNS), TODO 4 (split-horizon FQDN — decide w/ or w/o
|
||||
`nyumbani` here or defer).
|
||||
|
||||
### M2 · `askari` provisioned + under Ansible
|
||||
|
||||
Spin up the Hetzner VPS; bring it under Ansible in the `offsite_hosts` group; bootstrap it.
|
||||
|
||||
- **Proves:** the `offsite_hosts` pattern, bootstrap of a non-cluster host, rbw/vault
|
||||
against a brand-new host. Regenerates the inventory stubs (closes review finding O6 —
|
||||
`offsite_hosts` missing from `hosts.yml`).
|
||||
- **Maps to:** ADR-007 (`askari` role), ADR-009 (provisioning handoff), ADR-015/016,
|
||||
TODO 5 (control-node-style bootstrap, reused).
|
||||
|
||||
### M3 · `base` matured to a "remote-access-sufficient" subset
|
||||
|
||||
Today `base` is firewall-only. Add the subset a real, internet-facing host needs:
|
||||
**SSH hardening + fail2ban + the NetBird agent task**. Full CIS L1/L2, auditd, AppArmor,
|
||||
AIDE are deferred to Phase 2.
|
||||
|
||||
- **Why a subset:** `askari` is public (Hetzner) — it must be SSH-hardened and firewalled
|
||||
*with* exposure, but the full hardening standard is not on the critical path to mobile
|
||||
access.
|
||||
- **Maps to:** ADR-002 (security baseline), ADR-016 (agent enrollment lives in `base`),
|
||||
ADR-020 (firewall — already built), TODO 15 (the rest of hardening → Phase 2).
|
||||
|
||||
### M4 · NetBird control plane on `askari` — first real service role
|
||||
|
||||
Deploy the NetBird stack (management / signal / relay / Coturn + dashboard) with the
|
||||
**embedded IdP** (ADR-016 — no Authentik dependency).
|
||||
|
||||
- **First exercise of:** the service-role conventions (`SECURITY.md` / `VERIFY.md` /
|
||||
`ACCESS.md` / `BACKUP.md`), public **TLS / ACME**, and the **backup contract** —
|
||||
NetBird's management datastore is *stateful*, so it gets encrypted off-host backup
|
||||
(ADR-016 §recovery, ADR-022).
|
||||
- **Open design choice (decide in M4's spec):** a minimal ACME-terminating reverse proxy
|
||||
(e.g. Caddy) just for NetBird on `askari`, vs leaning on NetBird's bundled setup.
|
||||
- **Maps to:** ADR-016 (mesh), ADR-004 (one service = one role), ADR-021 (access),
|
||||
ADR-022 (backup), ADR-008/017 (VERIFY), accepted-risk R3 (askari public surface).
|
||||
|
||||
### M5 · Enroll peers → goal reached
|
||||
|
||||
NetBird agent on `ubongo` (the `wt0` path appears), then NetBird **clients on `mamba` +
|
||||
the work laptop** → `ubongo` is reachable from anywhere. **← the mobile-access goal lands
|
||||
here.**
|
||||
|
||||
- **Critical ordering:** NetBird-on-`ubongo` **before** applying `base` default-deny to
|
||||
`ubongo`. Hardening first would lock out SSH (no mesh path yet). Once the mesh `wt0`
|
||||
path exists, apply default-deny and set `base__firewall_control_addr` for the LAN
|
||||
fallback (ADR-021's `ssh-from-control`, already built/dormant).
|
||||
- **Maps to:** ADR-016, ADR-021 (SSH ladder: `wt0` + ssh-from-control), ADR-020.
|
||||
|
||||
---
|
||||
|
||||
## Gate — Procurement decision
|
||||
|
||||
Run `/capacity-review` (intent-based) to size the cluster, **then procure the Proxmox
|
||||
hardware**. Every core pattern (service role, base-on-real-host, DNS+ACME, backup, access)
|
||||
has by now been rehearsed on two cheap hosts, so the spend happens once and informed.
|
||||
|
||||
- **Maps to:** ADR-012 (hardware & capacity), `/capacity-review`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Cluster (gated on procurement; coarse until M5 is near)
|
||||
|
||||
Canonical dependency order:
|
||||
|
||||
1. **Terraform provisioning** — `terraform init`/apply the Proxmox VM module; regenerate
|
||||
inventory via `make tf-inventory` (ADR-006, ADR-009).
|
||||
2. **`base` full** — CIS L1/L2, auditd, AppArmor (enforce), AIDE, packages, users; the
|
||||
VM disk layout for CIS L2 is decided **before** provisioning (ADR-002, TODO 15).
|
||||
3. **`docker_host`** — real Docker engine + Compose, daemon hardening, `nftables.d`
|
||||
container rules (currently a scaffold; ADR-004, ADR-020).
|
||||
4. **`dns` role** — render the internal zone from inventory (ADR-007).
|
||||
5. **Auth + reverse proxy** — Authentik + Traefik: the foundation every service sits
|
||||
behind with authentication (ADR-002).
|
||||
6. **Monitoring** — Loki + Grafana Alloy (logging, ADR-018) + Prometheus/exporters +
|
||||
Uptime Kuma; decide which alerts live where (TODO 3.6).
|
||||
7. **Service roles** — PhotoPrism, email, indexers, … (`docs/CAPABILITIES.md`); each
|
||||
clears `docs/security/service-checklist.md` and carries its standard files.
|
||||
8. **`backup` role + `fisi` pull node** — restic Model A, pCloud + USB air-gap (ADR-022).
|
||||
9. **Forgejo Actions CI** — runner + workflows (ADR-003/010, TODO 1).
|
||||
|
||||
---
|
||||
|
||||
## Underneath both — Cross-cutting / ongoing
|
||||
|
||||
- **Accept ADR-011** (update management) — resolve its 6 open questions before the first
|
||||
scheduled patch run (TODO 16).
|
||||
- **Kaizen `/retro`** + keep appending to `docs/FRICTION.md` (TODO 11); **`/security-review`**
|
||||
skill (TODO 8.5); **`/review-repo` fortnightly cron** + headless email (TODO 8.1);
|
||||
`scheduled_jobs` role (TODO 8.3).
|
||||
- **User-notification function** — ntfy / matrix / email so tools + AI can reach the
|
||||
operator (TODO 9; ties to ADR-011 control channel).
|
||||
|
||||
### Parked decisions — decide when they bite, not before
|
||||
|
||||
- Split-horizon FQDN with or without `nyumbani` (TODO 4) — likely settled in M1.
|
||||
- Central database server vs per-app databases (TODO 3.9) — at the service phase.
|
||||
- Script-dependencies policy: stdlib-only vs selective libraries (TODO 14).
|
||||
- Keep the custom Molecule base-image method as testing matures (TODO 3.10).
|
||||
|
||||
---
|
||||
|
||||
## Next step
|
||||
|
||||
Brainstorm **M1 (Gandi DNS migration, IaC)** as its own sub-project → spec → plan.
|
||||
|
|
@ -1,5 +1,8 @@
|
|||
# ToDo
|
||||
|
||||
> **Build order lives in `docs/ROADMAP.md`** — that sequences this backlog into
|
||||
> milestones. This file is the decision backlog; the roadmap is the order we build them.
|
||||
|
||||
1. **Forgejo CI** — what CI work remains after ADR-010 (which workflows, runner
|
||||
setup, etc. still need to be built)?
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue