Bare 'nft list ruleset' has no leading flush, so the timer's 'nft -f rollback'
was a no-op on first apply (empty file) and errored ('table exists') on later
applies — the auto-rollback silently did nothing, defeating the askari lockout
safeguard. Prepend 'flush ruleset' so the revert is atomic + self-contained.
Verified the snapshot->lockout->revert round-trip in an isolated netns.
Also fix stale STATUS prose (base is partially built, not absent).
6.7 KiB
6.7 KiB
Project status — what's real vs planned
This repo is partly aspirational: the ADRs in docs/decisions/ describe the
intended design, and some of it is not built yet. This file is the ground
truth. Before relying on a role, provider, or pipeline existing, check here.
If something is listed as "designed, not built", do not assume it works.
Last reviewed: 2026-06-06.
Real and working today
| Thing | State |
|---|---|
playbooks/bootstrap.yml |
Works — self-contained (installs Python, creates the ansible user + sudoers) |
scripts/tf_to_inventory.py |
Works — stdlib only; terraform output -json → hosts.yml |
.docker/molecule-debian13/Dockerfile |
Present — custom Molecule test image (ADR-008) |
docs/decisions/*, docs/runbooks/* |
Current and mutually reconciled |
Makefile, lint config (.ansible-lint, .yamllint), .gitignore |
Present and used |
git |
Initialized, trunk-based on main, pushed to origin (forgejo.nyumbani.baobab.band:7577). |
| Pre-commit hooks | Configured: lint, gitleaks, vault-encryption guard. Activate with pre-commit install after make setup. |
| Vault password client | scripts/vault-pass-client.sh fetches the master password from Vaultwarden via rbw (wired as vault_password_file). Requires rbw installed + rbw unlock. |
/review-repo |
Repo audit: scripts/repo-scan.py (Phase 0) + .claude/commands/review-repo.md, reports to docs/reviews/. On-demand only; cron + email deferred (docs/TODO.md). |
Terraform HCL (terraform/) |
Written (proxmox VM module + envs) — but never run; see below |
docs/hardware/reference.md + scripts/capacity-scan.py |
Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
/capacity-review |
Works — on-demand capacity evaluation → docs/hardware/reviews/. Intent-based (no live usage yet) |
ADR-002 security strategy + docs/security/{accepted-risks,service-checklist}.md |
Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review |
Service-role standard + per-service SECURITY.md convention |
Defined (ADR-004 + docs/security/service-security-template.md); not yet applied — no service roles exist |
| Tag standard + enforcement (ADR-019) | Works — tests/tags.yml (closed vocabulary) + scripts/check-tags.py (run by make lint, unit-tested): enforces the tag vocabulary and that each role import in a play's roles: block carries its role-name tag. Governs mostly-unbuilt roles, but the linter is live now. Proxmox VM tag convention (<env>, group, managed-by=terraform) is in the Terraform HCL but unprovisioned. |
Scaffolded but empty — NOT implemented
| Thing | State |
|---|---|
roles/base/ |
Partially built. The firewall concern is implemented (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) with pytest + Molecule render/syntax tests. Other concerns (SSH hardening, fail2ban, auditd, packages, users) are not built yet, so make deploy PLAYBOOK=site is still incomplete. |
roles/docker_host/ |
Not in git. Same. |
inventories/*/hosts.yml |
Structured stubs with empty host maps (hosts: {}); regenerated by make tf-inventory once Terraform has hosts |
inventories/production/group_vars/{docker_hosts,proxmox_hosts}/ |
Empty dirs |
So make deploy PLAYBOOK=site is still incomplete — base is only partially built (its
firewall concern only) and the docker_host role does not exist yet.
Designed but not built
| Thing | Designed in | Notes |
|---|---|---|
dns role (renders the internal zone) |
ADR-007 / ADR-009 | Does not exist. Internal DNS ownership is assigned to it by design. |
| Terraform actually provisioning | ADR-006 / ADR-009 | Never terraform inited: no .terraform.lock.hcl, no state, no real local.vms entries |
| CI (Forgejo Actions) | ADR-003 / ADR-008 | Pipeline described; not implemented |
Level 2 / 3 testing (staging, askari smoke) |
ADR-008 | Depends on real VMs / askari, which don't exist yet |
| Per-service roles | ADR-004 | Model defined; no service roles built |
| Forgejo Actions CI | ADR-003 / ADR-008 | Remote is live (pushed); Actions/act_runner pipeline not yet built |
Live usage stats for /capacity-review |
ADR-012 / TODO 8.4 | gather_usage() stubbed; source undecided (Proxmox RRD vs PLG stack); needs the cluster |
/security-review skill |
ADR-002 / TODO 8.5 | Periodic posture re-check + accepted-risk re-challenge; planned, not built |
| CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) base/docker_host roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) |
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/auditd/fail2ban alerting into the monitoring stack; not built |
ubongo — physical control / AI-worker host |
ADR-015 | Design RESOLVED (ADR-015 + spec + plan). Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Build pending: box not yet acquired/installed, not in inventory. |
NetBird mesh — coordinator on askari |
ADR-016 | Design RESOLVED (ADR-016 + spec + plan); resolves ADR-015 deferred #1. Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. Build pending: not deployed (askari + service-role machinery not built). |
NetBird agent enrollment in base |
ADR-016 | Design RESOLVED (ADR-016). Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on wt0. Build pending: base role not built. |
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | Design RESOLVED (ADR-017 + spec + plan); resolves ADR-015 deferred #2. /verify-service skill + VERIFY.md template + standards are authorable and present. Build pending: running needs ubongo + playwright plugin + Authentik + a staging deploy. |
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | Design RESOLVED (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. Build pending: Alloy in base, loki/grafana service roles, OPNsense syslog — none built. |
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
Keeping this honest
Update this file whenever you build, stub, or remove something. It is the first place an AI tool or new contributor should look to learn what they can actually rely on. When a row moves from "designed" to "working", move it up — don't leave stale optimism here.