sjat/boma

sjat f10fe8bb60 docs(status): mesh-hardening askari redesign applied + live reboot-validated (2026-06-20)

Live cutover complete: base INPUT-only default-deny + wt0-primary SSH + permanent WAN break-glass on askari, netbird_coordinator geo-disabled. A real reboot recovered unattended — firewall persisted, Docker forwarding + public services up, coordinator geo-disabled (no FATAL), mesh + both SSH paths back. ROADMAP sub-project 3 (askari redesign) marked DONE; next = relay-SPOF reduction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-20 09:22:20 +02:00

13 KiB

Raw Blame History

ROADMAP — boma build order

High-level build order for the project. Almost everything in docs/decisions/ (the ADRs) is designed, not built — this file sequences that backlog into milestones and records why the order is what it is.

What is built vs planned: STATUS.md (ground truth — always check there first).
The backlog of decisions: docs/TODO.md (this roadmap sequences it).
The design rationale: docs/decisions/ (ADRs).

This is a living document: update it as milestones land (move them to STATUS.md), as ordering changes, or as new milestones appear. Each milestone gets its own spec → plan → implementation cycle (docs/superpowers/specs/ then …/plans/) when it comes up; this file stays high-level.

Last updated: 2026-06-19.

Strategy — "remote-access first" (Approach A)

One focused track now (Off-site / Remote-access), a procurement gate, then the Cluster track. Cross-cutting/ongoing work runs underneath both.

Why this order. The only physical machine that exists today is ubongo (the control node); the Proxmox cluster is a procurement decision, not yet made. The nearest-term goal — reach ubongo from mamba / a work laptop while on the move — needs only things already available or cheap to spin up (askari at Hetzner, the laptops). Doing the remote-access track first:

delivers the mobile-access goal in the first phase, and
doubles as the proving ground for boma's core machinery — the first real service role (NetBird), the base role on a real, internet-facing host, the offsite_hosts pattern, public DNS + ACME, the backup contract, and rbw/vault in anger — all on two cheap, low-stakes hosts before spending on the cluster.

Cluster hardware is then procured after those patterns are proven and a /capacity-review informs the sizing — so the spend happens once, with knowledge.

Rejected alternatives: B — procure now, build strictly bottom-up (mobile access lands late; spend precedes any proven pattern). C — two parallel tracks (for a solo operator this collapses into interleaving with extra context-switching cost).

Phase 1 — Off-site / Remote-access — ✅ COMPLETE (2026-06-17)

Delivers mobile access to ubongo; proves the machinery. Ordered by real dependencies. All milestones (M1–M5) done; the mobile-access goal is met. Next: the Procurement gate.

M1 · boma's DNS home — a new domain at Gandi, managed as code

Register a new Swahili-themed domain at Gandi for boma and manage its records as code (IaC). Greenfield, not a migration: investigating the existing domains ruled them out as boma's home — baobab.band is the live legacy homelab (Cloudflare; vaultwarden / nextcloud / matrix in daily use), and ziethen.dk is the family's primary email (Fastmail); moving either's authoritative DNS risks breaking production. A fresh domain is zero-risk and born at Gandi.

Driver: values/sovereignty (Gandi) + a clean, decoupled home so boma builds without endangering anything live. baobab.band's Cloudflare exit / V4 decommission is a separate, later track, not part of this build. ziethen.dk is untouched.
IaC approach: follow boma's grain — internal DNS is already Ansible-rendered and Terraform owns no DNS (CLAUDE.md), so public DNS is Ansible-managed too (Gandi LiveDNS via an Ansible module — exact module pinned in M1's spec, verified per ADR-014).
Naming scheme (decided): three tiers (on boma's new domain, <boma-domain>) — <host>.boma.<boma-domain> (infra, internal-only) · <service>.<boma-domain> (home/cluster services, split-horizon) · <service>.askari.<boma-domain> (off-site/VPS, public). nyumbani dropped. Home services are mesh/LAN-only by default (no public record; reached over LAN or the NetBird mesh), with public Gandi records only for deliberate exceptions. The NetBird mesh carries the <boma-domain> match-domain to road-warriors (resolver = dns1/dns2 over wt0); a *.<boma-domain> ACME DNS-01 wildcard cert (Gandi API) gives even unexposed services real TLS. Resolves TODO 4 and review finding O12.
Records as a new/updated ADR: amends ADR-007 — boma's public zone is <boma-domain> at Gandi LiveDNS managed as code; the three-tier naming scheme; nyumbani removed; mesh/LAN-only default; baobab.band (legacy, Cloudflare) is out of scope.
Maps to: ADR-007 (network/DNS), ADR-016 (mesh DNS), TODO 4 (resolved here).

M2 · `askari` provisioned + under Ansible

Provision the Hetzner VPS as IaC with Terraform (Helsinki / Debian 13, behind a TF-managed Hetzner Cloud Firewall), bring it into offsite_hosts, and bootstrap it. Shipped as cx23/x86 (CAX11/ARM was out of stock EU-wide on 2026-06-14 — same-spec x86, cheaper). Design: docs/superpowers/specs/2026-06-14-askari-provisioning-design.md.

Decided: Terraform owns askari's existence — generalizes ADR-006 from "Proxmox VM existence" to Proxmox + Hetzner (new hetznercloud/hcloud provider, hetzner_vm module, offsite stack). Token via TF_VAR_hcloud_token from vault.hetzner.token.
Proves: the offsite_hosts pattern, the TF→Ansible handoff for a non-Proxmox host (tf_to_inventory.py extended), bootstrap of a non-cluster host. Closes review finding O6 (offsite_hosts missing from hosts.yml).
Amends: ADR-006 (TF scope), ADR-009 (offsite handoff), ADR-020 (Hetzner Cloud Firewall = perimeter), ADR-007/016 (askari TF-provisioned, not "added manually").

M3 · `base` matured to a "remote-access-sufficient" subset — ✅ DONE

Added the hardening concern to base (sshd drop-in key-only + PermitRootLogin no; fail2ban sshd jail 5/1h; ADR-002) and applied it to askari by tag (make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening) — SSH still works, fail2ban active. Full CIS L1/L2, auditd, AppArmor, AIDE remain deferred to Phase 2 (TODO 15).

NetBird agent → M4 (deferred from M3: it enrolls against the coordinator, which doesn't exist until M4 — ADR-016's coordinator-first bootstrap order).
Host firewall on askari + ubongo hardening → M5 (applying default-deny pre-mesh would lock out SSH; the Hetzner Cloud Firewall is askari's perimeter until then).
Spec/plan: docs/superpowers/{specs,plans}/2026-06-14-base-ssh-fail2ban-m3*.
Maps to: ADR-002 (security baseline), ADR-020 (firewall — built, not yet applied), TODO 15 (the rest of hardening → Phase 2).

M4 · NetBird control plane on `askari` — first real service role

Built in two phases. M4a (platform) — ✅ DONE: Docker on askari + boma's standard Caddy reverse proxy (ADR-024), proven by https://test.askari.wingu.me serving a valid Let's Encrypt cert (HTTP-01; the Gandi DNS-01 path is now built + proven — 2026-06-15, see ADR-024 — for mesh/LAN-only cluster services). Firewall opened 80/443/3478. Spec/plan: …2026-06-14-netbird-coordinator-m4-design.md / …2026-06-14-m4a-docker-caddy.md / …2026-06-14-m4b-netbird.md.

M4b — ✅ DONE (2026-06-16): the netbird_coordinator service role, deployed to askari. Reality differed from the original plan (captured fresh per ADR-014): NetBird v0.72.4 ships a single combined netbird-server container (management + signal + relay + STUN

embedded Dex IdP at /oauth2) plus dashboard:v2.39.0 — no separate signal/relay container and no Coturn. Fronted by the M4a Caddy via gRPC-h2c + WebSocket + path routing. Dashboard live at https://netbird.askari.wingu.me (valid LE cert); /api auth-gated. M5 (enrol peers) is next — incl. the first-boot /setup admin + setup keys.

First exercise of: the service-role conventions (SECURITY.md / VERIFY.md / ACCESS.md / BACKUP.md), public TLS / ACME, and the backup contract — NetBird's management datastore is stateful, so it gets encrypted off-host backup (ADR-016 §recovery, ADR-022).
Open design choice (decide in M4's spec): a minimal ACME-terminating reverse proxy (e.g. Caddy) just for NetBird on askari, vs leaning on NetBird's bundled setup.
Maps to: ADR-016 (mesh), ADR-004 (one service = one role), ADR-021 (access), ADR-022 (backup), ADR-008/017 (VERIFY), accepted-risk R3 (askari public surface).

M5 · Enroll peers → goal reached — ✅ DONE (2026-06-17)

The base mesh concern enrolled ubongo (100.99.146.14) + askari (100.99.226.39) as NetBird peers — both Management+Signal Connected, the ubongo↔askari mesh link ping-verified. NetBird ships a default Allow-All peer policy, so any enrolled peer reaches ubongo over wt0. The road-warrior clients (mamba + the work laptop) are enrolled (operator, via docs/runbooks/netbird-client.md) → ubongo is reachable from anywhere. ← the mobile-access goal is met; Phase 1 is complete.

Deferred to a "mesh-hardening" follow-on (was folded into M5; split out as the lockout-risky part): apply base nftables default-deny to ubongo + set base__firewall_control_addr (ADR-021 ssh-from-control, built/dormant); tighten the NetBird ACL off Allow-All to scoped policies; move askari's SSH onto wt0 (retiring the Hetzner-firewall WAN allow). Safe to do now that the wt0 path exists.
Maps to: ADR-016, ADR-021 (SSH ladder: wt0 + ssh-from-control), ADR-020.

Gate — Procurement decision

Run /capacity-review (intent-based) to size the cluster, then procure the Proxmox hardware. Every core pattern (service role, base-on-real-host, DNS+ACME, backup, access) has by now been rehearsed on two cheap hosts, so the spend happens once and informed.

Maps to: ADR-012 (hardware & capacity), /capacity-review.

Phase 2 — Cluster (gated on procurement; coarse until M5 is near)

Canonical dependency order:

Terraform provisioning — terraform init/apply the Proxmox VM module; regenerate inventory via make tf-inventory (ADR-006, ADR-009).
base full — CIS L1/L2, auditd, AppArmor (enforce), AIDE, packages, users; the VM disk layout for CIS L2 is decided before provisioning (ADR-002, TODO 15).
docker_host — real Docker engine + Compose, daemon hardening, nftables.d container rules (currently a scaffold; ADR-004, ADR-020).
dns role — render the internal zone from inventory (ADR-007).
Auth + reverse proxy — Authentik + Caddy (ADR-024): the foundation every service sits behind with authentication (ADR-002).
Monitoring — Loki + Grafana Alloy (logging, ADR-018) + Prometheus/exporters + Uptime Kuma; decide which alerts live where (TODO 3.6).
Service roles — PhotoPrism, email, indexers, … (docs/CAPABILITIES.md); each clears docs/security/service-checklist.md and carries its standard files.
backup role + fisi pull node — restic Model A, pCloud + USB air-gap (ADR-022).
Forgejo Actions CI — runner + workflows (ADR-003/010, TODO 1).

Underneath both — Cross-cutting / ongoing

Accept ADR-011 (update management) — resolve its 6 open questions before the first scheduled patch run (TODO 16).
Kaizen /retro + keep appending to docs/FRICTION.md (TODO 11); /security-review skill (TODO 8.5); /review-repo fortnightly cron + headless email (TODO 8.1); scheduled_jobs role (TODO 8.3).
User-notification function — ntfy / matrix / email so tools + AI can reach the operator (TODO 9; ties to ADR-011 control channel).

Parked decisions — decide when they bite, not before

Split-horizon FQDN with or without nyumbani (TODO 4) — likely settled in M1.
Central database server vs per-app databases (TODO 3.9) — at the service phase.
Script-dependencies policy: stdlib-only vs selective libraries (TODO 14).
Keep the custom Molecule base-image method as testing matures (TODO 3.10).

Next step

Phase 1 complete (M1–M5); mesh-hardening: ubongo (2/3) DONE 2026-06-19, askari redesign DONE 2026-06-20. Both hosts now run INPUT-only nftables default-deny (base__firewall_input_only), live reboot-validated. askari's redesign (spec/plan docs/superpowers/{specs,plans}/2026-06-19-mesh-hardening-askari-redesign*) applied INPUT-only default-deny + wt0-primary SSH + a permanent WAN break-glass + a geo-disabled coordinator; a real reboot recovered unattended. Remaining mesh-hardening sub-projects:

~~ubongo nftables default-deny + ssh-from-control~~ → DONE (2026-06-19).
redesign askari's SSH → wt0 → DONE (2026-06-20) — boot-race, coordinator-bootstrap chicken-egg, and Docker-nat-flush all resolved + live reboot-validated.
askari relay-SPOF reduction (next) — ubongo→askari is currently Relayed through askari's own relay, so askari is a single point of failure for relayed mesh traffic; reduce it (second relay / direct P2P).
tighten the NetBird ACL off Allow-All to scoped policies (open mechanism question — no headless API path).

Then the Procurement gate (/capacity-review → buy Proxmox hardware) opens Phase 2.

13 KiB Raw Blame History Unescape Escape