base firewall applied + live-verified on ubongo (INPUT-only default-deny; base__firewall_input_only). Records the Docker-nat-flush caveat (needs a restart docker on a Docker host), the claude self-SSH grant, and reboot-validation-pending. ROADMAP: sub-project 2 done; remaining = NetBird ACL + askari redesign. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
13 KiB
ROADMAP — boma build order
High-level build order for the project. Almost everything in docs/decisions/
(the ADRs) is designed, not built — this file sequences that backlog into milestones
and records why the order is what it is.
- What is built vs planned:
STATUS.md(ground truth — always check there first). - The backlog of decisions:
docs/TODO.md(this roadmap sequences it). - The design rationale:
docs/decisions/(ADRs).
This is a living document: update it as milestones land (move them to STATUS.md),
as ordering changes, or as new milestones appear. Each milestone gets its own
spec → plan → implementation cycle (docs/superpowers/specs/ then …/plans/) when it
comes up; this file stays high-level.
Last updated: 2026-06-19.
Strategy — "remote-access first" (Approach A)
One focused track now (Off-site / Remote-access), a procurement gate, then the Cluster track. Cross-cutting/ongoing work runs underneath both.
Why this order. The only physical machine that exists today is ubongo (the control
node); the Proxmox cluster is a procurement decision, not yet made. The nearest-term goal
— reach ubongo from mamba / a work laptop while on the move — needs only things
already available or cheap to spin up (askari at Hetzner, the laptops). Doing the
remote-access track first:
- delivers the mobile-access goal in the first phase, and
- doubles as the proving ground for boma's core machinery — the first real service
role (NetBird), the
baserole on a real, internet-facing host, theoffsite_hostspattern, public DNS + ACME, the backup contract, andrbw/vault in anger — all on two cheap, low-stakes hosts before spending on the cluster.
Cluster hardware is then procured after those patterns are proven and a
/capacity-review informs the sizing — so the spend happens once, with knowledge.
Rejected alternatives: B — procure now, build strictly bottom-up (mobile access lands late; spend precedes any proven pattern). C — two parallel tracks (for a solo operator this collapses into interleaving with extra context-switching cost).
Phase 1 — Off-site / Remote-access — ✅ COMPLETE (2026-06-17)
Delivers mobile access to ubongo; proves the machinery. Ordered by real dependencies.
All milestones (M1–M5) done; the mobile-access goal is met. Next: the Procurement gate.
M1 · boma's DNS home — a new domain at Gandi, managed as code
Register a new Swahili-themed domain at Gandi for boma and manage its records as
code (IaC). Greenfield, not a migration: investigating the existing domains ruled them
out as boma's home — baobab.band is the live legacy homelab (Cloudflare; vaultwarden
/ nextcloud / matrix in daily use), and ziethen.dk is the family's primary email
(Fastmail); moving either's authoritative DNS risks breaking production. A fresh domain is
zero-risk and born at Gandi.
- Driver: values/sovereignty (Gandi) + a clean, decoupled home so boma builds without
endangering anything live.
baobab.band's Cloudflare exit / V4 decommission is a separate, later track, not part of this build.ziethen.dkis untouched. - IaC approach: follow boma's grain — internal DNS is already Ansible-rendered and Terraform owns no DNS (CLAUDE.md), so public DNS is Ansible-managed too (Gandi LiveDNS via an Ansible module — exact module pinned in M1's spec, verified per ADR-014).
- Naming scheme (decided): three tiers (on boma's new domain,
<boma-domain>) —<host>.boma.<boma-domain>(infra, internal-only) ·<service>.<boma-domain>(home/cluster services, split-horizon) ·<service>.askari.<boma-domain>(off-site/VPS, public).nyumbanidropped. Home services are mesh/LAN-only by default (no public record; reached over LAN or the NetBird mesh), with public Gandi records only for deliberate exceptions. The NetBird mesh carries the<boma-domain>match-domain to road-warriors (resolver = dns1/dns2 overwt0); a*.<boma-domain>ACME DNS-01 wildcard cert (Gandi API) gives even unexposed services real TLS. Resolves TODO 4 and review finding O12. - Records as a new/updated ADR: amends ADR-007 — boma's public zone is
<boma-domain>at Gandi LiveDNS managed as code; the three-tier naming scheme;nyumbaniremoved; mesh/LAN-only default;baobab.band(legacy, Cloudflare) is out of scope. - Maps to: ADR-007 (network/DNS), ADR-016 (mesh DNS), TODO 4 (resolved here).
M2 · askari provisioned + under Ansible
Provision the Hetzner VPS as IaC with Terraform (Helsinki / Debian 13, behind a
TF-managed Hetzner Cloud Firewall), bring it into offsite_hosts, and bootstrap it.
Shipped as cx23/x86 (CAX11/ARM was out of stock EU-wide on 2026-06-14 — same-spec
x86, cheaper). Design: docs/superpowers/specs/2026-06-14-askari-provisioning-design.md.
- Decided: Terraform owns
askari's existence — generalizes ADR-006 from "Proxmox VM existence" to Proxmox + Hetzner (newhetznercloud/hcloudprovider,hetzner_vmmodule,offsitestack). Token viaTF_VAR_hcloud_tokenfromvault.hetzner.token. - Proves: the
offsite_hostspattern, the TF→Ansible handoff for a non-Proxmox host (tf_to_inventory.pyextended), bootstrap of a non-cluster host. Closes review finding O6 (offsite_hostsmissing fromhosts.yml). - Amends: ADR-006 (TF scope), ADR-009 (offsite handoff), ADR-020 (Hetzner Cloud
Firewall = perimeter), ADR-007/016 (
askariTF-provisioned, not "added manually").
M3 · base matured to a "remote-access-sufficient" subset — ✅ DONE
Added the hardening concern to base (sshd drop-in key-only + PermitRootLogin no;
fail2ban sshd jail 5/1h; ADR-002) and applied it to askari by tag
(make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening) — SSH still works, fail2ban
active. Full CIS L1/L2, auditd, AppArmor, AIDE remain deferred to Phase 2 (TODO 15).
- NetBird agent → M4 (deferred from M3: it enrolls against the coordinator, which doesn't exist until M4 — ADR-016's coordinator-first bootstrap order).
- Host firewall on askari + ubongo hardening → M5 (applying default-deny pre-mesh would lock out SSH; the Hetzner Cloud Firewall is askari's perimeter until then).
- Spec/plan:
docs/superpowers/{specs,plans}/2026-06-14-base-ssh-fail2ban-m3*. - Maps to: ADR-002 (security baseline), ADR-020 (firewall — built, not yet applied), TODO 15 (the rest of hardening → Phase 2).
M4 · NetBird control plane on askari — first real service role
Built in two phases. M4a (platform) — ✅ DONE: Docker on askari + boma's standard
Caddy reverse proxy (ADR-024), proven by https://test.askari.wingu.me serving a
valid Let's Encrypt cert (HTTP-01; the Gandi DNS-01 path is now built + proven —
2026-06-15, see ADR-024 — for mesh/LAN-only cluster services).
Firewall opened 80/443/3478. Spec/plan: …2026-06-14-netbird-coordinator-m4-design.md /
…2026-06-14-m4a-docker-caddy.md / …2026-06-14-m4b-netbird.md.
M4b — ✅ DONE (2026-06-16): the netbird_coordinator service role, deployed to askari.
Reality differed from the original plan (captured fresh per ADR-014): NetBird v0.72.4
ships a single combined netbird-server container (management + signal + relay + STUN
- embedded Dex IdP at
/oauth2) plusdashboard:v2.39.0— no separate signal/relay container and no Coturn. Fronted by the M4a Caddy via gRPC-h2c + WebSocket + path routing. Dashboard live athttps://netbird.askari.wingu.me(valid LE cert);/apiauth-gated. M5 (enrol peers) is next — incl. the first-boot/setupadmin + setup keys.
- First exercise of: the service-role conventions (
SECURITY.md/VERIFY.md/ACCESS.md/BACKUP.md), public TLS / ACME, and the backup contract — NetBird's management datastore is stateful, so it gets encrypted off-host backup (ADR-016 §recovery, ADR-022). - Open design choice (decide in M4's spec): a minimal ACME-terminating reverse proxy
(e.g. Caddy) just for NetBird on
askari, vs leaning on NetBird's bundled setup. - Maps to: ADR-016 (mesh), ADR-004 (one service = one role), ADR-021 (access), ADR-022 (backup), ADR-008/017 (VERIFY), accepted-risk R3 (askari public surface).
M5 · Enroll peers → goal reached — ✅ DONE (2026-06-17)
The base mesh concern enrolled ubongo (100.99.146.14) + askari
(100.99.226.39) as NetBird peers — both Management+Signal Connected, the ubongo↔askari
mesh link ping-verified. NetBird ships a default Allow-All peer policy, so any enrolled
peer reaches ubongo over wt0. The road-warrior clients (mamba + the work laptop)
are enrolled (operator, via docs/runbooks/netbird-client.md) → ubongo is reachable
from anywhere. ← the mobile-access goal is met; Phase 1 is complete.
- Deferred to a "mesh-hardening" follow-on (was folded into M5; split out as the
lockout-risky part): apply
basenftables default-deny toubongo+ setbase__firewall_control_addr(ADR-021ssh-from-control, built/dormant); tighten the NetBird ACL off Allow-All to scoped policies; moveaskari's SSH ontowt0(retiring the Hetzner-firewall WAN allow). Safe to do now that thewt0path exists. - Maps to: ADR-016, ADR-021 (SSH ladder:
wt0+ ssh-from-control), ADR-020.
Gate — Procurement decision
Run /capacity-review (intent-based) to size the cluster, then procure the Proxmox
hardware. Every core pattern (service role, base-on-real-host, DNS+ACME, backup, access)
has by now been rehearsed on two cheap hosts, so the spend happens once and informed.
- Maps to: ADR-012 (hardware & capacity),
/capacity-review.
Phase 2 — Cluster (gated on procurement; coarse until M5 is near)
Canonical dependency order:
- Terraform provisioning —
terraform init/apply the Proxmox VM module; regenerate inventory viamake tf-inventory(ADR-006, ADR-009). basefull — CIS L1/L2, auditd, AppArmor (enforce), AIDE, packages, users; the VM disk layout for CIS L2 is decided before provisioning (ADR-002, TODO 15).docker_host— real Docker engine + Compose, daemon hardening,nftables.dcontainer rules (currently a scaffold; ADR-004, ADR-020).dnsrole — render the internal zone from inventory (ADR-007).- Auth + reverse proxy — Authentik + Caddy (ADR-024): the foundation every service sits behind with authentication (ADR-002).
- Monitoring — Loki + Grafana Alloy (logging, ADR-018) + Prometheus/exporters + Uptime Kuma; decide which alerts live where (TODO 3.6).
- Service roles — PhotoPrism, email, indexers, … (
docs/CAPABILITIES.md); each clearsdocs/security/service-checklist.mdand carries its standard files. backuprole +fisipull node — restic Model A, pCloud + USB air-gap (ADR-022).- Forgejo Actions CI — runner + workflows (ADR-003/010, TODO 1).
Underneath both — Cross-cutting / ongoing
- Accept ADR-011 (update management) — resolve its 6 open questions before the first scheduled patch run (TODO 16).
- Kaizen
/retro+ keep appending todocs/FRICTION.md(TODO 11);/security-reviewskill (TODO 8.5);/review-repofortnightly cron + headless email (TODO 8.1);scheduled_jobsrole (TODO 8.3). - User-notification function — ntfy / matrix / email so tools + AI can reach the operator (TODO 9; ties to ADR-011 control channel).
Parked decisions — decide when they bite, not before
- Split-horizon FQDN with or without
nyumbani(TODO 4) — likely settled in M1. - Central database server vs per-app databases (TODO 3.9) — at the service phase.
- Script-dependencies policy: stdlib-only vs selective libraries (TODO 14).
- Keep the custom Molecule base-image method as testing matures (TODO 3.10).
Next step
Phase 1 complete (M1–M5); mesh-hardening 2/3 (ubongo default-deny) DONE (2026-06-19) —
INPUT-only nftables default-deny applied + live-verified on ubongo (base__firewall_input_only;
spec/plan docs/superpowers/{specs,plans}/2026-06-19-mesh-hardening-ubongo-default-deny*;
real-host reboot validation pending, low-risk — lockout-safe via the permanent console).
Remaining mesh-hardening sub-projects, each its own spec → plan → implementation cycle:
→ DONE (2026-06-19).ubongonftables default-deny +ssh-from-control- tighten the NetBird ACL off Allow-All to scoped policies (open mechanism question — no headless API path).
- redesign
askari's SSH →wt0(the 2026-06-17 attempt was backed out; the redesign must resolve the boot-race, the coordinator-bootstrap chicken-egg, and the Docker-nat-flush that theflush rulesetcauses on a Docker host).
Then the Procurement gate (/capacity-review → buy Proxmox hardware) opens Phase 2.