2026-06-11 22:12:38 +02:00
|
|
|
|
# ROADMAP — boma build order
|
|
|
|
|
|
|
|
|
|
|
|
High-level **build order** for the project. Almost everything in `docs/decisions/`
|
|
|
|
|
|
(the ADRs) is *designed, not built* — this file sequences that backlog into milestones
|
|
|
|
|
|
and records *why* the order is what it is.
|
|
|
|
|
|
|
|
|
|
|
|
- **What is built vs planned:** `STATUS.md` (ground truth — always check there first).
|
|
|
|
|
|
- **The backlog of decisions:** `docs/TODO.md` (this roadmap sequences it).
|
|
|
|
|
|
- **The design rationale:** `docs/decisions/` (ADRs).
|
|
|
|
|
|
|
|
|
|
|
|
This is a **living document**: update it as milestones land (move them to `STATUS.md`),
|
|
|
|
|
|
as ordering changes, or as new milestones appear. Each milestone gets its own
|
|
|
|
|
|
spec → plan → implementation cycle (`docs/superpowers/specs/` then `…/plans/`) when it
|
|
|
|
|
|
comes up; this file stays high-level.
|
|
|
|
|
|
|
2026-06-19 15:34:20 +02:00
|
|
|
|
_Last updated: 2026-06-19._
|
2026-06-11 22:12:38 +02:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Strategy — "remote-access first" (Approach A)
|
|
|
|
|
|
|
|
|
|
|
|
One focused track now (**Off-site / Remote-access**), a **procurement gate**, then the
|
|
|
|
|
|
**Cluster** track. Cross-cutting/ongoing work runs underneath both.
|
|
|
|
|
|
|
|
|
|
|
|
**Why this order.** The only physical machine that exists today is `ubongo` (the control
|
|
|
|
|
|
node); the Proxmox cluster is a procurement decision, not yet made. The nearest-term goal
|
|
|
|
|
|
— reach `ubongo` from `mamba` / a work laptop while on the move — needs only things
|
|
|
|
|
|
already available or cheap to spin up (`askari` at Hetzner, the laptops). Doing the
|
|
|
|
|
|
remote-access track first:
|
|
|
|
|
|
|
|
|
|
|
|
1. **delivers the mobile-access goal in the first phase**, and
|
|
|
|
|
|
2. **doubles as the proving ground** for boma's core machinery — the first real *service
|
|
|
|
|
|
role* (NetBird), the `base` role on a *real, internet-facing* host, the `offsite_hosts`
|
|
|
|
|
|
pattern, public DNS + ACME, the backup contract, and `rbw`/vault in anger — all on two
|
|
|
|
|
|
cheap, low-stakes hosts **before** spending on the cluster.
|
|
|
|
|
|
|
|
|
|
|
|
Cluster hardware is then procured *after* those patterns are proven and a
|
|
|
|
|
|
`/capacity-review` informs the sizing — so the spend happens once, with knowledge.
|
|
|
|
|
|
|
|
|
|
|
|
Rejected alternatives: **B — procure now, build strictly bottom-up** (mobile access lands
|
|
|
|
|
|
late; spend precedes any proven pattern). **C — two parallel tracks** (for a solo operator
|
|
|
|
|
|
this collapses into interleaving with extra context-switching cost).
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-06-17 17:11:32 +02:00
|
|
|
|
## Phase 1 — Off-site / Remote-access — ✅ COMPLETE (2026-06-17)
|
2026-06-11 22:12:38 +02:00
|
|
|
|
|
|
|
|
|
|
Delivers mobile access to `ubongo`; proves the machinery. Ordered by *real* dependencies.
|
2026-06-17 17:11:32 +02:00
|
|
|
|
All milestones (M1–M5) done; the mobile-access goal is met. Next: the Procurement gate.
|
2026-06-11 22:12:38 +02:00
|
|
|
|
|
2026-06-14 09:14:10 +02:00
|
|
|
|
### M1 · boma's DNS home — a new domain at Gandi, managed as code
|
|
|
|
|
|
|
|
|
|
|
|
Register a **new Swahili-themed domain at Gandi** for boma and manage its records **as
|
|
|
|
|
|
code (IaC)**. Greenfield, not a migration: investigating the existing domains ruled them
|
|
|
|
|
|
out as boma's home — `baobab.band` is the **live legacy homelab** (Cloudflare; vaultwarden
|
|
|
|
|
|
/ nextcloud / matrix in daily use), and `ziethen.dk` is the **family's primary email**
|
|
|
|
|
|
(Fastmail); moving either's authoritative DNS risks breaking production. A fresh domain is
|
|
|
|
|
|
zero-risk and *born at Gandi*.
|
|
|
|
|
|
|
|
|
|
|
|
- **Driver:** values/sovereignty (Gandi) + a clean, decoupled home so boma builds without
|
|
|
|
|
|
endangering anything live. `baobab.band`'s Cloudflare exit / V4 decommission is a
|
|
|
|
|
|
**separate, later track**, not part of this build. `ziethen.dk` is untouched.
|
2026-06-11 22:12:38 +02:00
|
|
|
|
- **IaC approach:** follow boma's grain — internal DNS is already Ansible-rendered and
|
|
|
|
|
|
Terraform owns *no* DNS (CLAUDE.md), so **public DNS is Ansible-managed too** (Gandi
|
2026-06-11 22:17:28 +02:00
|
|
|
|
LiveDNS via an Ansible module — exact module pinned in M1's spec, verified per ADR-014).
|
2026-06-14 09:14:10 +02:00
|
|
|
|
- **Naming scheme (decided):** three tiers (on boma's new domain, `<boma-domain>`) —
|
|
|
|
|
|
`<host>.boma.<boma-domain>` (infra, internal-only) · `<service>.<boma-domain>`
|
|
|
|
|
|
(home/cluster services, split-horizon) · `<service>.askari.<boma-domain>` (off-site/VPS,
|
|
|
|
|
|
public). **`nyumbani` dropped.** Home services are **mesh/LAN-only by default** (no
|
|
|
|
|
|
public record; reached over LAN or the NetBird mesh), with public Gandi records only for
|
|
|
|
|
|
deliberate exceptions. The NetBird mesh carries the `<boma-domain>` match-domain to
|
|
|
|
|
|
road-warriors (resolver = dns1/dns2 over `wt0`); a `*.<boma-domain>` ACME **DNS-01**
|
|
|
|
|
|
wildcard cert (Gandi API) gives even unexposed services real TLS. Resolves TODO 4 and
|
|
|
|
|
|
review finding O12.
|
|
|
|
|
|
- **Records as a new/updated ADR:** amends ADR-007 — boma's public zone is
|
|
|
|
|
|
`<boma-domain>` at Gandi LiveDNS managed as code; the three-tier naming scheme;
|
|
|
|
|
|
`nyumbani` removed; mesh/LAN-only default; `baobab.band` (legacy, Cloudflare) is out of
|
|
|
|
|
|
scope.
|
2026-06-11 22:17:28 +02:00
|
|
|
|
- **Maps to:** ADR-007 (network/DNS), ADR-016 (mesh DNS), TODO 4 (**resolved here**).
|
2026-06-11 22:12:38 +02:00
|
|
|
|
|
|
|
|
|
|
### M2 · `askari` provisioned + under Ansible
|
|
|
|
|
|
|
2026-06-14 19:31:40 +02:00
|
|
|
|
Provision the Hetzner VPS **as IaC with Terraform** (Helsinki / Debian 13, behind a
|
|
|
|
|
|
TF-managed Hetzner Cloud Firewall), bring it into `offsite_hosts`, and bootstrap it.
|
|
|
|
|
|
**Shipped as cx23/x86** (CAX11/ARM was out of stock EU-wide on 2026-06-14 — same-spec
|
|
|
|
|
|
x86, cheaper). Design: `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`.
|
2026-06-14 10:12:10 +02:00
|
|
|
|
|
|
|
|
|
|
- **Decided:** Terraform owns `askari`'s existence — generalizes ADR-006 from "Proxmox VM
|
|
|
|
|
|
existence" to **Proxmox + Hetzner** (new `hetznercloud/hcloud` provider, `hetzner_vm`
|
|
|
|
|
|
module, `offsite` stack). Token via `TF_VAR_hcloud_token` from `vault.hetzner.token`.
|
|
|
|
|
|
- **Proves:** the `offsite_hosts` pattern, the TF→Ansible handoff for a non-Proxmox host
|
|
|
|
|
|
(`tf_to_inventory.py` extended), bootstrap of a non-cluster host. Closes review finding
|
|
|
|
|
|
O6 (`offsite_hosts` missing from `hosts.yml`).
|
|
|
|
|
|
- **Amends:** ADR-006 (TF scope), ADR-009 (offsite handoff), ADR-020 (Hetzner Cloud
|
|
|
|
|
|
Firewall = perimeter), ADR-007/016 (`askari` TF-provisioned, not "added manually").
|
2026-06-11 22:12:38 +02:00
|
|
|
|
|
2026-06-14 16:55:22 +02:00
|
|
|
|
### M3 · `base` matured to a "remote-access-sufficient" subset — ✅ DONE
|
|
|
|
|
|
|
|
|
|
|
|
Added the `hardening` concern to `base` (sshd drop-in key-only + `PermitRootLogin no`;
|
|
|
|
|
|
fail2ban sshd jail 5/1h; ADR-002) and **applied it to askari** by tag
|
|
|
|
|
|
(`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`) — SSH still works, fail2ban
|
|
|
|
|
|
active. Full CIS L1/L2, auditd, AppArmor, AIDE remain deferred to Phase 2 (TODO 15).
|
|
|
|
|
|
|
|
|
|
|
|
- **NetBird agent → M4** (deferred from M3: it enrolls against the coordinator, which
|
|
|
|
|
|
doesn't exist until M4 — ADR-016's coordinator-first bootstrap order).
|
|
|
|
|
|
- **Host firewall on askari + ubongo hardening → M5** (applying default-deny pre-mesh
|
|
|
|
|
|
would lock out SSH; the Hetzner Cloud Firewall is askari's perimeter until then).
|
|
|
|
|
|
- **Spec/plan:** `docs/superpowers/{specs,plans}/2026-06-14-base-ssh-fail2ban-m3*`.
|
|
|
|
|
|
- **Maps to:** ADR-002 (security baseline), ADR-020 (firewall — built, not yet applied),
|
|
|
|
|
|
TODO 15 (the rest of hardening → Phase 2).
|
2026-06-11 22:12:38 +02:00
|
|
|
|
|
|
|
|
|
|
### M4 · NetBird control plane on `askari` — first real service role
|
|
|
|
|
|
|
2026-06-14 18:14:38 +02:00
|
|
|
|
Built in two phases. **M4a (platform) — ✅ DONE:** Docker on askari + boma's standard
|
|
|
|
|
|
**Caddy** reverse proxy (ADR-024), proven by `https://test.askari.wingu.me` serving a
|
2026-06-15 06:57:55 +02:00
|
|
|
|
valid Let's Encrypt cert (HTTP-01; the Gandi **DNS-01** path is now built + proven —
|
|
|
|
|
|
2026-06-15, see ADR-024 — for mesh/LAN-only cluster services).
|
2026-06-14 18:14:38 +02:00
|
|
|
|
Firewall opened 80/443/3478. Spec/plan: `…2026-06-14-netbird-coordinator-m4-design.md` /
|
2026-06-16 07:48:53 +02:00
|
|
|
|
`…2026-06-14-m4a-docker-caddy.md` / `…2026-06-14-m4b-netbird.md`.
|
|
|
|
|
|
|
|
|
|
|
|
**M4b — ✅ DONE (2026-06-16):** the `netbird_coordinator` service role, deployed to askari.
|
|
|
|
|
|
Reality differed from the original plan (captured fresh per ADR-014): NetBird **v0.72.4**
|
|
|
|
|
|
ships a **single combined `netbird-server`** container (management + signal + relay + STUN
|
|
|
|
|
|
+ **embedded Dex** IdP at `/oauth2`) plus `dashboard:v2.39.0` — **no separate signal/relay
|
|
|
|
|
|
container and no Coturn**. Fronted by the M4a Caddy via gRPC-h2c + WebSocket + path routing.
|
|
|
|
|
|
Dashboard live at `https://netbird.askari.wingu.me` (valid LE cert); `/api` auth-gated.
|
|
|
|
|
|
**M5 (enrol peers) is next** — incl. the first-boot `/setup` admin + setup keys.
|
2026-06-11 22:12:38 +02:00
|
|
|
|
|
|
|
|
|
|
- **First exercise of:** the service-role conventions (`SECURITY.md` / `VERIFY.md` /
|
|
|
|
|
|
`ACCESS.md` / `BACKUP.md`), public **TLS / ACME**, and the **backup contract** —
|
|
|
|
|
|
NetBird's management datastore is *stateful*, so it gets encrypted off-host backup
|
|
|
|
|
|
(ADR-016 §recovery, ADR-022).
|
|
|
|
|
|
- **Open design choice (decide in M4's spec):** a minimal ACME-terminating reverse proxy
|
|
|
|
|
|
(e.g. Caddy) just for NetBird on `askari`, vs leaning on NetBird's bundled setup.
|
|
|
|
|
|
- **Maps to:** ADR-016 (mesh), ADR-004 (one service = one role), ADR-021 (access),
|
|
|
|
|
|
ADR-022 (backup), ADR-008/017 (VERIFY), accepted-risk R3 (askari public surface).
|
|
|
|
|
|
|
2026-06-17 17:11:32 +02:00
|
|
|
|
### M5 · Enroll peers → goal reached — ✅ DONE (2026-06-17)
|
2026-06-17 16:40:02 +02:00
|
|
|
|
|
|
|
|
|
|
The `base` `mesh` concern enrolled **`ubongo` (`100.99.146.14`) + `askari`
|
|
|
|
|
|
(`100.99.226.39`)** as NetBird peers — both Management+Signal Connected, the ubongo↔askari
|
|
|
|
|
|
mesh link ping-verified. NetBird ships a default **Allow-All** peer policy, so any enrolled
|
2026-06-17 17:11:32 +02:00
|
|
|
|
peer reaches `ubongo` over `wt0`. The road-warrior clients (**`mamba` + the work laptop**)
|
|
|
|
|
|
are enrolled (operator, via `docs/runbooks/netbird-client.md`) → **`ubongo` is reachable
|
|
|
|
|
|
from anywhere. ← the mobile-access goal is met; Phase 1 is complete.**
|
2026-06-17 16:40:02 +02:00
|
|
|
|
|
|
|
|
|
|
- **Deferred to a "mesh-hardening" follow-on** (was folded into M5; split out as the
|
|
|
|
|
|
lockout-risky part): apply `base` nftables **default-deny** to `ubongo` + set
|
|
|
|
|
|
`base__firewall_control_addr` (ADR-021 `ssh-from-control`, built/dormant); tighten the
|
|
|
|
|
|
NetBird ACL off Allow-All to scoped policies; move `askari`'s SSH onto `wt0` (retiring
|
|
|
|
|
|
the Hetzner-firewall WAN allow). Safe to do now that the `wt0` path exists.
|
2026-06-11 22:12:38 +02:00
|
|
|
|
- **Maps to:** ADR-016, ADR-021 (SSH ladder: `wt0` + ssh-from-control), ADR-020.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Gate — Procurement decision
|
|
|
|
|
|
|
|
|
|
|
|
Run `/capacity-review` (intent-based) to size the cluster, **then procure the Proxmox
|
|
|
|
|
|
hardware**. Every core pattern (service role, base-on-real-host, DNS+ACME, backup, access)
|
|
|
|
|
|
has by now been rehearsed on two cheap hosts, so the spend happens once and informed.
|
|
|
|
|
|
|
|
|
|
|
|
- **Maps to:** ADR-012 (hardware & capacity), `/capacity-review`.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Phase 2 — Cluster (gated on procurement; coarse until M5 is near)
|
|
|
|
|
|
|
|
|
|
|
|
Canonical dependency order:
|
|
|
|
|
|
|
|
|
|
|
|
1. **Terraform provisioning** — `terraform init`/apply the Proxmox VM module; regenerate
|
|
|
|
|
|
inventory via `make tf-inventory` (ADR-006, ADR-009).
|
|
|
|
|
|
2. **`base` full** — CIS L1/L2, auditd, AppArmor (enforce), AIDE, packages, users; the
|
|
|
|
|
|
VM disk layout for CIS L2 is decided **before** provisioning (ADR-002, TODO 15).
|
|
|
|
|
|
3. **`docker_host`** — real Docker engine + Compose, daemon hardening, `nftables.d`
|
|
|
|
|
|
container rules (currently a scaffold; ADR-004, ADR-020).
|
|
|
|
|
|
4. **`dns` role** — render the internal zone from inventory (ADR-007).
|
2026-06-14 17:28:42 +02:00
|
|
|
|
5. **Auth + reverse proxy** — Authentik + **Caddy** (ADR-024): the foundation every
|
|
|
|
|
|
service sits behind with authentication (ADR-002).
|
2026-06-11 22:12:38 +02:00
|
|
|
|
6. **Monitoring** — Loki + Grafana Alloy (logging, ADR-018) + Prometheus/exporters +
|
|
|
|
|
|
Uptime Kuma; decide which alerts live where (TODO 3.6).
|
|
|
|
|
|
7. **Service roles** — PhotoPrism, email, indexers, … (`docs/CAPABILITIES.md`); each
|
|
|
|
|
|
clears `docs/security/service-checklist.md` and carries its standard files.
|
|
|
|
|
|
8. **`backup` role + `fisi` pull node** — restic Model A, pCloud + USB air-gap (ADR-022).
|
|
|
|
|
|
9. **Forgejo Actions CI** — runner + workflows (ADR-003/010, TODO 1).
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Underneath both — Cross-cutting / ongoing
|
|
|
|
|
|
|
|
|
|
|
|
- **Accept ADR-011** (update management) — resolve its 6 open questions before the first
|
|
|
|
|
|
scheduled patch run (TODO 16).
|
|
|
|
|
|
- **Kaizen `/retro`** + keep appending to `docs/FRICTION.md` (TODO 11); **`/security-review`**
|
|
|
|
|
|
skill (TODO 8.5); **`/review-repo` fortnightly cron** + headless email (TODO 8.1);
|
|
|
|
|
|
`scheduled_jobs` role (TODO 8.3).
|
|
|
|
|
|
- **User-notification function** — ntfy / matrix / email so tools + AI can reach the
|
|
|
|
|
|
operator (TODO 9; ties to ADR-011 control channel).
|
|
|
|
|
|
|
|
|
|
|
|
### Parked decisions — decide when they bite, not before
|
|
|
|
|
|
|
|
|
|
|
|
- Split-horizon FQDN with or without `nyumbani` (TODO 4) — likely settled in M1.
|
|
|
|
|
|
- Central database server vs per-app databases (TODO 3.9) — at the service phase.
|
|
|
|
|
|
- Script-dependencies policy: stdlib-only vs selective libraries (TODO 14).
|
|
|
|
|
|
- Keep the custom Molecule base-image method as testing matures (TODO 3.10).
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Next step
|
|
|
|
|
|
|
2026-06-20 09:22:20 +02:00
|
|
|
|
**Phase 1 complete (M1–M5); mesh-hardening: ubongo (2/3) DONE 2026-06-19, askari redesign DONE 2026-06-20.**
|
|
|
|
|
|
Both hosts now run INPUT-only nftables default-deny (`base__firewall_input_only`), live reboot-validated.
|
|
|
|
|
|
askari's redesign (spec/plan `docs/superpowers/{specs,plans}/2026-06-19-mesh-hardening-askari-redesign*`)
|
|
|
|
|
|
applied INPUT-only default-deny + `wt0`-primary SSH + a permanent WAN break-glass + a geo-disabled
|
|
|
|
|
|
coordinator; a real reboot recovered unattended. Remaining mesh-hardening sub-projects:
|
2026-06-17 18:39:08 +02:00
|
|
|
|
|
2026-06-19 15:34:20 +02:00
|
|
|
|
1. ~~`ubongo` nftables default-deny + `ssh-from-control`~~ → **DONE (2026-06-19).**
|
2026-06-20 09:22:20 +02:00
|
|
|
|
2. ~~**redesign** `askari`'s SSH → `wt0`~~ → **DONE (2026-06-20)** — boot-race, coordinator-bootstrap
|
|
|
|
|
|
chicken-egg, and Docker-nat-flush all resolved + live reboot-validated.
|
2026-06-20 11:34:21 +02:00
|
|
|
|
3. ~~**askari relay-SPOF reduction**~~ → **DONE (2026-06-20)** — assessed + **accepted** as a
|
|
|
|
|
|
documented availability risk (R8 + ADR-016 availability amendment): the blast radius is
|
|
|
|
|
|
narrow (LAN/intra-cluster/local traffic never touch askari), so no P2P / second relay /
|
|
|
|
|
|
second coordinator was warranted. Hardened the one real gap — a managed-host coordinator-FQDN
|
|
|
|
|
|
DNS pin (`base__mesh_coordinator_pin`). The coordinator off-site backup gap is handed to ADR-022.
|
|
|
|
|
|
4. **NetBird ACL off Allow-All** to scoped policies (open mechanism question — no headless API path).
|
|
|
|
|
|
5. **ADR-022 backup kickoff** — off-site backup of the `netbird_coordinator` store (named in R8 /
|
|
|
|
|
|
BACKUP.md) as the first slice of the backup role (restic + the `fisi` pull node).
|
2026-06-17 18:39:08 +02:00
|
|
|
|
|
2026-06-19 15:34:20 +02:00
|
|
|
|
**Then** the Procurement gate (`/capacity-review` → buy Proxmox hardware) opens Phase 2.
|