boma/STATUS.md

75 lines
8.4 KiB
Markdown
Raw Normal View History

# Project status — what's real vs planned
This repo is partly aspirational: the ADRs in `docs/decisions/` describe the
*intended* design, and some of it is **not built yet**. This file is the ground
truth. **Before relying on a role, provider, or pipeline existing, check here.**
If something is listed as "designed, not built", do not assume it works.
_Last reviewed: 2026-06-11._
## Real and working today
| Thing | State |
|---|---|
| `playbooks/bootstrap.yml` | Works — self-contained (installs Python, creates the `ansible` user + sudoers) |
| `scripts/tf_to_inventory.py` | Works — stdlib only; `terraform output -json``hosts.yml` |
| `.docker/molecule-debian13/Dockerfile` | Present — custom Molecule test image (ADR-008) |
| `docs/decisions/*`, `docs/runbooks/*` | Current and mutually reconciled |
| `Makefile`, lint config (`.ansible-lint`, `.yamllint`), `.gitignore` | Present and used |
| `git` | Initialized, trunk-based on `main`, pushed to `origin` (`forgejo.nyumbani.baobab.band:7577`). |
| Pre-commit hooks | Configured: lint, gitleaks, vault-encryption guard. Activate with `pre-commit install` after `make setup`. |
| Vault password client | `scripts/vault-pass-client.sh` fetches the master password from Vaultwarden via `rbw` (wired as `vault_password_file`). Requires `rbw` installed + `rbw unlock`. |
| `/review-repo` | Repo audit: `scripts/repo-scan.py` (Phase 0) + `.claude/commands/review-repo.md`, reports to `docs/reviews/`. On-demand only; cron + email deferred (`docs/TODO.md`). |
| Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below |
| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
| ADR-002 security strategy + `docs/security/{accepted-risks,service-checklist}.md` | Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review |
| Service-role standard + per-service `SECURITY.md` convention | Defined (ADR-004 + `docs/security/service-security-template.md`); not yet applied — no service roles exist |
| Tag standard + enforcement (ADR-019) | Works — `tests/tags.yml` (closed vocabulary) + `scripts/check-tags.py` (run by `make lint`, unit-tested): enforces the tag vocabulary and that each role import in a play's `roles:` block carries its role-name tag. Governs mostly-unbuilt roles, but the linter is live now. Proxmox VM tag convention (`<env>`, group, `managed-by=terraform`) is in the Terraform HCL but unprovisioned. |
| `ubongo` — physical control / AI-worker host (ADR-015) | **Built (partial).** Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to `fisi` (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via `make setup`/`make collections`). Repo cloned under a dedicated `claude` user (docker group, no sudo). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory `control` group at 10.20.10.151. **Pending:** NetBird mesh enrollment (so SSH is LAN-only); full `base` hardening (only the `firewall` concern exists, and it is NOT applied here — applying default-deny with no mesh would lock out inbound SSH on the physical NIC); OPNsense DHCP reservation for 10.20.10.151 (MAC `88:a4:c2:e0:ee:da`); Terraform state backup (no TF state yet). |
## Scaffolded but empty — NOT implemented
| Thing | State |
|---|---|
| `roles/base/` | **Partially built.** The `firewall` concern is implemented (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) with pytest + Molecule render/syntax tests. Other concerns (SSH hardening, fail2ban, auditd, packages, users) are **not** built yet, so `make deploy PLAYBOOK=site` is still incomplete. |
| `roles/docker_host/` | Not in git. Same. |
| `inventories/*/hosts.yml` | Structured stubs with empty host maps (`hosts: {}`); regenerated by `make tf-inventory` once Terraform has hosts |
| `inventories/production/group_vars/{docker_hosts,proxmox_hosts}/` | Empty dirs |
So `make deploy PLAYBOOK=site` is still incomplete — `base` is only partially built (its
`firewall` concern only) and the `docker_host` role does not exist yet.
## Designed but not built
| Thing | Designed in | Notes |
|---|---|---|
| `dns` role (renders the internal zone) | ADR-007 / ADR-009 | Does not exist. Internal DNS ownership is assigned to it by design. |
| Terraform actually provisioning | ADR-006 / ADR-009 | Never `terraform init`ed: no `.terraform.lock.hcl`, no state, no real `local.vms` entries |
| CI (Forgejo Actions) | ADR-003 / ADR-008 | Pipeline described; not implemented |
| Level 2 / 3 testing (staging, `askari` smoke) | ADR-008 | Depends on real VMs / `askari`, which don't exist yet |
| Per-service roles | ADR-004 | Model defined; no service roles built |
| Forgejo Actions CI | ADR-003 / ADR-008 | Remote is live (pushed); Actions/`act_runner` pipeline not yet built |
| Live usage stats for `/capacity-review` | ADR-012 / TODO 8.4 | `gather_usage()` stubbed; source undecided (Proxmox RRD vs PLG stack); needs the cluster |
| `/security-review` skill | ADR-002 / TODO 8.5 | Periodic posture re-check + accepted-risk re-challenge; planned, not built |
| CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) `base`/`docker_host` roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) |
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |
| NetBird mesh — coordinator on `askari` | ADR-016 | **Design RESOLVED** (ADR-016 + spec + plan); resolves ADR-015 deferred #1. Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. **Build pending:** not deployed (askari + service-role machinery not built). |
| NetBird agent enrollment in `base` | ADR-016 | **Design RESOLVED** (ADR-016). Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. **Build pending:** base role not built. |
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. |
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
| Operational-access doctrine (ADR-021) | ADR-021 | **Design RESOLVED** (ADR-021 + spec + plan). Two-layer doctrine, three-tier access ladder, `access__*` model, `ACCESS.md` record, `/check-access`. Reconciles ADR-016/020 SSH. |
| `ssh-from-control` firewall source | ADR-021 / ADR-020 | **Built (dormant).** `base__firewall_control_addr` knob + nftables rule + Molecule assertion landed; empty default = no rule until `ubongo`'s LAN address is set in `group_vars`. |
| `/check-access` verifier | ADR-021 | **Design RESOLVED** (`.claude/commands/check-access.md` authored). **Build pending:** running needs `ubongo` + live/staging hosts + vault. Access analogue of `/verify-service` (ADR-017). |
| Per-service `ACCESS.md` records | ADR-021 | Template + governance present; per-service files render when each service role is built. |
| Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. |
| Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. |
## Keeping this honest
Update this file whenever you build, stub, or remove something. It is the first
place an AI tool or new contributor should look to learn what they can actually
rely on. When a row moves from "designed" to "working", move it up — don't leave
stale optimism here.