STATUS: base mesh concern built + applied; ubongo (100.99.146.14) + askari (100.99.226.39) enrolled, link verified; ubongo agent-management access (sjat key + NOPASSWD sudo) recorded. ROADMAP M5: infra done, laptops = operator step, mesh-hardening split out as the deferred follow-on. FRICTION: docs-only-commit rbw guard + control-node self-management access gap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
16 KiB
Project status — what's real vs planned
This repo is partly aspirational: the ADRs in docs/decisions/ describe the
intended design, and some of it is not built yet. This file is the ground
truth. Before relying on a role, provider, or pipeline existing, check here.
If something is listed as "designed, not built", do not assume it works.
Last reviewed: 2026-06-14.
Real and working today
| Thing | State |
|---|---|
playbooks/bootstrap.yml |
Works — self-contained (installs Python, creates the ansible user + sudoers) |
scripts/tf_to_inventory.py |
Works — stdlib only; terraform output -json → hosts.yml |
.docker/molecule-debian13/Dockerfile |
Present — custom Molecule test image (ADR-008) |
docs/decisions/*, docs/runbooks/* |
Current and mutually reconciled |
Makefile, lint config (.ansible-lint, .yamllint), .gitignore |
Present and used |
git |
Initialized, trunk-based on main, pushed to origin (forgejo.nyumbani.baobab.band:7577). |
| Pre-commit hooks | Configured: lint, gitleaks, vault-encryption guard. Activate with pre-commit install after make setup. |
| Vault password client | scripts/vault-pass-client.sh fetches the master password from Vaultwarden via rbw (wired as vault_password_file). Requires rbw installed + rbw unlock. |
/review-repo |
Repo audit: scripts/repo-scan.py (Phase 0) + .claude/commands/review-repo.md, reports to docs/reviews/. On-demand only; cron + email deferred (docs/TODO.md). |
/kaizen |
Curate docs/FRICTION.md Open signals → decisions ledger (scripts/friction-scan.py Phase 0, unit-tested, + .claude/commands/kaizen.md). Interactive, on-demand; --nudge (recurrence/age/backlog) surfaces in /review-repo. Headless/cron deferred (TODO 11.3). |
Terraform HCL (terraform/) |
Written (proxmox VM module + envs) — but never run; see below. Offsite env also written — see "Designed but not built". |
docs/hardware/reference.md + scripts/capacity-scan.py |
Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
/capacity-review |
Works — on-demand capacity evaluation → docs/hardware/reviews/. Intent-based (no live usage yet) |
ADR-002 security strategy + docs/security/{accepted-risks,service-checklist}.md |
Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review |
Service-role standard + per-service SECURITY.md convention |
Defined (ADR-004 + docs/security/service-security-template.md); not yet applied — no service roles exist |
| Tag standard + enforcement (ADR-019) | Works — tests/tags.yml (closed vocabulary) + scripts/check-tags.py (run by make lint, unit-tested): enforces the tag vocabulary and that each role import in a play's roles: block carries its role-name tag. Governs mostly-unbuilt roles, but the linter is live now. Proxmox VM tag convention (<env>, group, managed-by=terraform) is in the Terraform HCL but unprovisioned. |
roles/dev_env/ — interactive developer environment |
Built + applied. zsh + oh-my-zsh + oh-my-posh, tmux + TPM plugins, neovim; dotfiles deployed via GNU stow (re-derived from V4/fisi per ADR-013). Node.js from a pinned upstream tarball (not Debian's npm). Lint + Molecule (idempotent) green. Applied to ubongo for users sjat + claude (verified: zsh login shells, stow-symlinked .zshrc/.tmux.conf + nvim config, oh-my-zsh, tmux plugins; nvim v0.12.2, oh-my-posh 29.0.1). Run via playbooks/workstation.yml against the control group (no dedicated workstations group yet). |
make check / make deploy PLAYBOOK=<name> |
Works. First end-to-end run (applying dev_env) surfaced + fixed latent bugs: Makefile PLAYBOOK var collision (binary path vs playbook-name arg) meant the targets never ran; ansible.cfg referenced uninstalled community.general callbacks (now built-in default + ansible.posix.profile_tasks); acl package added so Ansible can become_user an unprivileged user. The make targets now function — though site/base/docker_host content is still incomplete (see below). |
roles/public_dns/ + playbooks/dns.yml |
Built + applied. Manages wingu.me at Gandi LiveDNS as code (community.general.gandi_livedns, PAT from vault.gandi.pat); record data, anti-spoof baseline (SPF -all + DMARC reject), and the Gandi-defaults purge are defined + unit-tested (tests/test_public_dns.py). Applied to wingu.me (2026-06-14): purged Gandi's 13 seeded defaults; zone now holds only the SPF + DMARC TXT records; idempotent re-run clean. No null-MX (Gandi rejects 0 .) — the MX is removed, so no MX + no apex A = no mail. M1 of the roadmap. |
ubongo — physical control / AI-worker host (ADR-015) |
Built (partial). Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to fisi (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via make setup/make collections). Repo cloned under a dedicated claude user (docker group, no sudo). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory control group at 10.20.10.151. dev_env now applied here (zsh/tmux/nvim for sjat + claude, via playbooks/workstation.yml). Managed as the operator account sjat (group_vars/control sets ansible_user: sjat), not the ansible service user group_vars/all assumes — ubongo has no bootstrapped ansible user. NetBird mesh-enrolled (M5, 2026-06-17): wt0 up at 100.99.146.14 via the base mesh concern; agent management now works because claude's SSH key was added to sjat's authorized_keys and sjat was granted NOPASSWD sudo (/etc/sudoers.d/sjat-ansible) — the interim until the proper ansible-user bootstrap. Pending: full base hardening (only firewall exists, NOT applied here — default-deny is the deferred mesh-hardening step now that wt0 exists); proper ansible-user bootstrap (currently managed as sjat); OPNsense DHCP reservation for 10.20.10.151 (MAC 88:a4:c2:e0:ee:da); Terraform state backup (now relevant — the offsite tfstate exists). |
askari — off-site Hetzner VPS (ADR-007/016, M2) |
Built + applied. Provisioned by Terraform (environments/offsite, hetznercloud/hcloud) as cx23 / hel1 / Debian 13.5 (CAX11/ARM was out of stock EU-wide on 2026-06-14 → cx23 is same-spec x86, cheaper). cloud-init created the ansible user + passwordless sudo; a TF-managed Hetzner Cloud Firewall allows SSH only from ubongo's WAN (91.226.145.80). Reachable from ubongo (ansible offsite_hosts -m ping ✓), in the offsite_hosts inventory (generated offsite.yml), published at askari.wingu.me → 77.42.120.136. SSH-hardened + fail2ban (M3). Docker + Caddy reverse proxy (M4a): docker_host + reverse_proxy (vanilla Caddy, HTTP-01) applied; https://test.askari.wingu.me serves a valid Let's Encrypt cert ✓ (firewall opens 80/443/3478). NetBird coordinator (M4b): netbird_coordinator deployed — dashboard live at https://netbird.askari.wingu.me (valid LE cert), management API behind embedded Dex (401 unauth), STUN on 3478/udp. NetBird peer (M5, 2026-06-17): also enrolled as a mesh agent (base mesh concern) — wt0 at 100.99.226.39, Management+Signal Connected; the agent coexists with the coordinator. Pending: host firewall + moving askari's SSH onto wt0 (deferred mesh-hardening; the Hetzner Cloud Firewall is its perimeter until then), offsite tfstate backup (ADR-022). |
roles/docker_host/ (Docker engine) + roles/reverse_proxy/ (Caddy, ADR-024) |
Built + applied (askari, M4a). docker_host installs Docker CE + compose; reverse_proxy is boma's standard Caddy proxy (HTTP-01 for public hosts; routes from reverse_proxy__routes). DNS-01 for mesh/LAN-only services is now built + proven (2026-06-15): custom caddy-gandi image (.docker/caddy-gandi/, make caddy-image, pinned caddy-dns/gandi v1.1.0 → Bearer PAT), enabled per-instance via reverse_proxy__acme_dns_provider: gandi + reverse_proxy__image. Verified end-to-end — a real wildcard cert issued via LE staging + Gandi DNS-01 with vault.gandi.pat. M4a's deferral (version skew + Hetzner-IP build) is closed; image pending registry push (make caddy-image-push needs docker login). The reverse_proxy Caddyfile is bind-mounted as a directory (./caddy → /etc/caddy) so atomic re-renders are visible in-container and caddy reload actually applies new routes (a single-file mount pinned the stale inode). |
roles/netbird_coordinator/ — NetBird control plane (ADR-016, M4b) |
Built + applied (askari, 2026-06-16). boma's FIRST real service role. Self-hosted NetBird v0.72.4: a single combined netbird-server container (management + signal + relay + STUN + embedded Dex IdP at /oauth2) + dashboard:v2.39.0, on the shared boma network behind the M4a Caddy via gRPC-h2c + WebSocket + path routing (reverse_proxy__routes gained a raw-caddy route type). Secrets vault.netbird.{auth_secret,datastore_key} (self-generated). Carries the full service-role file set (SECURITY/VERIFY/ACCESS/BACKUP) — first stateful role (backup__state: true; encrypted SQLite at /var/lib/netbird, off-site backup pending fisi/ADR-022). Verified live: dashboard 200 + valid LE cert, /api 401 (auth-gated, routes OK), STUN up. Not yet configured: first-boot /setup admin + peer enrolment = M5. |
Scaffolded but empty — NOT implemented
| Thing | State |
|---|---|
roles/base/ |
Partially built. Concerns built: firewall (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) and hardening (M3: sshd drop-in key-only + PermitRootLogin no, fail2ban sshd jail 5/1h; ADR-002) — both pytest/Molecule-tested. The hardening concern is applied to askari (make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening). The firewall concern is built but not yet applied to any host (mesh-gated to avoid lockout — M5). Not built: auditd, packages, users (Phase 2 / TODO 15). |
inventories/*/hosts.yml |
Structured stubs with empty host maps (hosts: {}); regenerated by make tf-inventory once Terraform has hosts |
inventories/production/group_vars/{docker_hosts,proxmox_hosts}/ |
Empty dirs |
(roles/docker_host/ is no longer scaffold-only — it installs the Docker engine + Compose
and is built + applied to askari; see "Real and working today". Its deferred scope —
daemon hardening + nftables.d container rules, ADR-004/ADR-020 — is still pending.)
A make deploy PLAYBOOK=site run now applies real content — base (its firewall +
hardening concerns) plus a functional docker_host (Docker engine) on docker hosts —
but in practice it is still limited: the production cluster has no docker hosts yet, and
base's firewall concern is mesh-gated until M5, so a full cluster site run does not
yet exist. (The make check/deploy machinery itself works — first proven by applying
dev_env via playbooks/workstation.yml, then base/docker_host/reverse_proxy on
askari.)
Designed but not built
| Thing | Designed in | Notes |
|---|---|---|
dns role (renders the internal zone) |
ADR-007 / ADR-009 | Does not exist. Internal DNS ownership is assigned to it by design. |
| Terraform actually provisioning (Proxmox) | ADR-006 / ADR-009 | Never terraform inited: no .terraform.lock.hcl, no state, no real local.vms entries |
| CI (Forgejo Actions) | ADR-003 / ADR-008 | Pipeline described; not implemented |
Level 2 / 3 testing (staging, askari smoke) |
ADR-008 | Depends on real VMs / askari, which don't exist yet |
| Per-service roles | ADR-004 | Model defined; no service roles built |
| Forgejo Actions CI | ADR-003 / ADR-008 | Remote is live (pushed); Actions/act_runner pipeline not yet built |
Live usage stats for /capacity-review |
ADR-012 / TODO 8.4 | gather_usage() stubbed; source undecided (Proxmox RRD vs PLG stack); needs the cluster |
/security-review skill |
ADR-002 / TODO 8.5 | Periodic posture re-check + accepted-risk re-challenge; planned, not built |
| CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) base/docker_host roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) |
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/auditd/fail2ban alerting into the monitoring stack; not built |
NetBird mesh — coordinator on askari |
ADR-016 | BUILT + applied (M4b, 2026-06-16) — moved up to "Real and working today" (roles/netbird_coordinator/). Self-hosted control plane on askari; replaces ADR-007 WireGuard. Mesh peer enrolment = M5 (next row). |
NetBird agent enrollment in base |
ADR-016 | BUILT + applied (M5, 2026-06-17). The base mesh concern (opt-in base__mesh_enabled) installs the pinned NetBird agent + runs netbird up with the reusable scoped key from vault.netbird.setup_key. Applied to askari (100.99.226.39) + ubongo (100.99.146.14) — both Management+Signal Connected; ubongo↔askari mesh ping verified. Enrollment is additive — the "SSH only on wt0" firewall lockdown is the deferred mesh-hardening follow-on, NOT applied. Road-warrior clients (laptops) are operator-enrolled. |
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | Design RESOLVED (ADR-017 + spec + plan); resolves ADR-015 deferred #2. /verify-service skill + VERIFY.md template + standards are authorable and present. Build pending: running needs ubongo + playwright plugin + Authentik + a staging deploy. |
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | Design RESOLVED (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. Build pending: Alloy in base, loki/grafana service roles, OPNsense syslog — none built. |
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
| Operational-access doctrine (ADR-021) | ADR-021 | Design RESOLVED (ADR-021 + spec + plan). Two-layer doctrine, three-tier access ladder, access__* model, ACCESS.md record, /check-access. Reconciles ADR-016/020 SSH. |
ssh-from-control firewall source |
ADR-021 / ADR-020 | Built (dormant). base__firewall_control_addr knob + nftables rule + Molecule assertion landed; empty default = no rule until ubongo's LAN address is set in group_vars. |
/check-access verifier |
ADR-021 | Design RESOLVED (.claude/commands/check-access.md authored). Build pending: running needs ubongo + live/staging hosts + vault. Access analogue of /verify-service (ADR-017). |
Per-service ACCESS.md records |
ADR-021 | Template + governance present; per-service files render when each service role is built. |
Backup backup role + backup_hosts group |
ADR-022 | Does not exist. Pull node (fisi), restic repo, rclone→pCloud, USB air-gap — Plan 2. |
Per-service backup__* contract + BACKUP.md |
ADR-022 | Convention defined; inert until service roles exist to declare against. |
Keeping this honest
Update this file whenever you build, stub, or remove something. It is the first place an AI tool or new contributor should look to learn what they can actually rely on. When a row moves from "designed" to "working", move it up — don't leave stale optimism here.