boma/STATUS.md
sjat 175777e36a docs: reconcile 2026-06-14 review findings (O1-O7,O18,O22)
- STATUS: docker_host is built+applied, not scaffold-only (O1)
- ADR-004: backup points to ADR-022, not "out of scope"; service-role file
  table gains ACCESS.md + BACKUP.md rows (O2, O5)
- Finish Traefik->Caddy: ADR-008/011/017/019, CAPABILITIES, TODO (O3); scope
  ADR-024's custom-image/NetBird claims to the deferred DNS-01/M4b paths (O22)
- ADR-016/017/018 now lead with ## Status per ADR-023 (O4)
- ADR-002: caveat `PLAYBOOK=upgrade` as planned/unbuilt (O6)
- CAPABILITIES: carve out ubongo's dev_env from the nvim/tmux exclusion (O7)
- ADR-007: one authoritative boma.baobab.band -> boma.wingu.me transition note (O18)
- new-host Part E: note ubongo is managed as sjat, ansible-user bootstrap pending (O15)

O9 (hosts.yml header) left open: the file is generator-owned (hook-protected);
fixing it needs a tf_to_inventory.py change or a tf-inventory run, not a hand-edit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:06:33 +02:00

13 KiB

Project status — what's real vs planned

This repo is partly aspirational: the ADRs in docs/decisions/ describe the intended design, and some of it is not built yet. This file is the ground truth. Before relying on a role, provider, or pipeline existing, check here. If something is listed as "designed, not built", do not assume it works.

Last reviewed: 2026-06-14.

Real and working today

Thing State
playbooks/bootstrap.yml Works — self-contained (installs Python, creates the ansible user + sudoers)
scripts/tf_to_inventory.py Works — stdlib only; terraform output -jsonhosts.yml
.docker/molecule-debian13/Dockerfile Present — custom Molecule test image (ADR-008)
docs/decisions/*, docs/runbooks/* Current and mutually reconciled
Makefile, lint config (.ansible-lint, .yamllint), .gitignore Present and used
git Initialized, trunk-based on main, pushed to origin (forgejo.nyumbani.baobab.band:7577).
Pre-commit hooks Configured: lint, gitleaks, vault-encryption guard. Activate with pre-commit install after make setup.
Vault password client scripts/vault-pass-client.sh fetches the master password from Vaultwarden via rbw (wired as vault_password_file). Requires rbw installed + rbw unlock.
/review-repo Repo audit: scripts/repo-scan.py (Phase 0) + .claude/commands/review-repo.md, reports to docs/reviews/. On-demand only; cron + email deferred (docs/TODO.md).
Terraform HCL (terraform/) Written (proxmox VM module + envs) — but never run; see below. Offsite env also written — see "Designed but not built".
docs/hardware/reference.md + scripts/capacity-scan.py Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON
/capacity-review Works — on-demand capacity evaluation → docs/hardware/reviews/. Intent-based (no live usage yet)
ADR-002 security strategy + docs/security/{accepted-risks,service-checklist}.md Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review
Service-role standard + per-service SECURITY.md convention Defined (ADR-004 + docs/security/service-security-template.md); not yet applied — no service roles exist
Tag standard + enforcement (ADR-019) Works — tests/tags.yml (closed vocabulary) + scripts/check-tags.py (run by make lint, unit-tested): enforces the tag vocabulary and that each role import in a play's roles: block carries its role-name tag. Governs mostly-unbuilt roles, but the linter is live now. Proxmox VM tag convention (<env>, group, managed-by=terraform) is in the Terraform HCL but unprovisioned.
roles/dev_env/ — interactive developer environment Built + applied. zsh + oh-my-zsh + oh-my-posh, tmux + TPM plugins, neovim; dotfiles deployed via GNU stow (re-derived from V4/fisi per ADR-013). Node.js from a pinned upstream tarball (not Debian's npm). Lint + Molecule (idempotent) green. Applied to ubongo for users sjat + claude (verified: zsh login shells, stow-symlinked .zshrc/.tmux.conf + nvim config, oh-my-zsh, tmux plugins; nvim v0.12.2, oh-my-posh 29.0.1). Run via playbooks/workstation.yml against the control group (no dedicated workstations group yet).
make check / make deploy PLAYBOOK=<name> Works. First end-to-end run (applying dev_env) surfaced + fixed latent bugs: Makefile PLAYBOOK var collision (binary path vs playbook-name arg) meant the targets never ran; ansible.cfg referenced uninstalled community.general callbacks (now built-in default + ansible.posix.profile_tasks); acl package added so Ansible can become_user an unprivileged user. The make targets now function — though site/base/docker_host content is still incomplete (see below).
roles/public_dns/ + playbooks/dns.yml Built + applied. Manages wingu.me at Gandi LiveDNS as code (community.general.gandi_livedns, PAT from vault.gandi.pat); record data, anti-spoof baseline (SPF -all + DMARC reject), and the Gandi-defaults purge are defined + unit-tested (tests/test_public_dns.py). Applied to wingu.me (2026-06-14): purged Gandi's 13 seeded defaults; zone now holds only the SPF + DMARC TXT records; idempotent re-run clean. No null-MX (Gandi rejects 0 .) — the MX is removed, so no MX + no apex A = no mail. M1 of the roadmap.
ubongo — physical control / AI-worker host (ADR-015) Built (partial). Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to fisi (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via make setup/make collections). Repo cloned under a dedicated claude user (docker group, no sudo). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory control group at 10.20.10.151. dev_env now applied here (zsh/tmux/nvim for sjat + claude, via playbooks/workstation.yml). Managed as the operator account sjat (group_vars/control sets ansible_user: sjat), not the ansible service user group_vars/all assumes — ubongo has no bootstrapped ansible user. Pending: NetBird mesh enrollment (so SSH is LAN-only); full base hardening (only the firewall concern exists, and it is NOT applied here — applying default-deny with no mesh would lock out inbound SSH on the physical NIC); proper ansible-user bootstrap (currently managed as sjat); OPNsense DHCP reservation for 10.20.10.151 (MAC 88:a4:c2:e0:ee:da); Terraform state backup (now relevant — the offsite tfstate exists).
askari — off-site Hetzner VPS (ADR-007/016, M2) Built + applied. Provisioned by Terraform (environments/offsite, hetznercloud/hcloud) as cx23 / hel1 / Debian 13.5 (CAX11/ARM was out of stock EU-wide on 2026-06-14 → cx23 is same-spec x86, cheaper). cloud-init created the ansible user + passwordless sudo; a TF-managed Hetzner Cloud Firewall allows SSH only from ubongo's WAN (91.226.145.80). Reachable from ubongo (ansible offsite_hosts -m ping ✓), in the offsite_hosts inventory (generated offsite.yml), published at askari.wingu.me77.42.120.136. SSH-hardened + fail2ban (M3). Docker + Caddy reverse proxy (M4a): docker_host + reverse_proxy (vanilla Caddy, HTTP-01) applied; https://test.askari.wingu.me serves a valid Let's Encrypt cert ✓ (firewall opens 80/443/3478). Pending: NetBird coordinator (M4b), host firewall + mesh enrollment (M5), offsite tfstate backup (ADR-022).
roles/docker_host/ (Docker engine) + roles/reverse_proxy/ (Caddy, ADR-024) Built + applied (askari, M4a). docker_host installs Docker CE + compose; reverse_proxy is boma's standard Caddy proxy (HTTP-01 for public hosts; routes from reverse_proxy__routes). DNS-01 for cluster mesh/LAN-only services is deferred to Phase 2 (caddy-dns/gandi unresolved — see FRICTION).

Scaffolded but empty — NOT implemented

Thing State
roles/base/ Partially built. Concerns built: firewall (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) and hardening (M3: sshd drop-in key-only + PermitRootLogin no, fail2ban sshd jail 5/1h; ADR-002) — both pytest/Molecule-tested. The hardening concern is applied to askari (make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening). The firewall concern is built but not yet applied to any host (mesh-gated to avoid lockout — M5). Not built: auditd, packages, users (Phase 2 / TODO 15).
inventories/*/hosts.yml Structured stubs with empty host maps (hosts: {}); regenerated by make tf-inventory once Terraform has hosts
inventories/production/group_vars/{docker_hosts,proxmox_hosts}/ Empty dirs

(roles/docker_host/ is no longer scaffold-only — it installs the Docker engine + Compose and is built + applied to askari; see "Real and working today". Its deferred scope — daemon hardening + nftables.d container rules, ADR-004/ADR-020 — is still pending.)

A make deploy PLAYBOOK=site run now applies real content — base (its firewall + hardening concerns) plus a functional docker_host (Docker engine) on docker hosts — but in practice it is still limited: the production cluster has no docker hosts yet, and base's firewall concern is mesh-gated until M5, so a full cluster site run does not yet exist. (The make check/deploy machinery itself works — first proven by applying dev_env via playbooks/workstation.yml, then base/docker_host/reverse_proxy on askari.)

Designed but not built

Thing Designed in Notes
dns role (renders the internal zone) ADR-007 / ADR-009 Does not exist. Internal DNS ownership is assigned to it by design.
Terraform actually provisioning (Proxmox) ADR-006 / ADR-009 Never terraform inited: no .terraform.lock.hcl, no state, no real local.vms entries
CI (Forgejo Actions) ADR-003 / ADR-008 Pipeline described; not implemented
Level 2 / 3 testing (staging, askari smoke) ADR-008 Depends on real VMs / askari, which don't exist yet
Per-service roles ADR-004 Model defined; no service roles built
Forgejo Actions CI ADR-003 / ADR-008 Remote is live (pushed); Actions/act_runner pipeline not yet built
Live usage stats for /capacity-review ADR-012 / TODO 8.4 gather_usage() stubbed; source undecided (Proxmox RRD vs PLG stack); needs the cluster
/security-review skill ADR-002 / TODO 8.5 Periodic posture re-check + accepted-risk re-challenge; planned, not built
CIS hardening (Debian L1+L2 + Docker) ADR-002 / TODO 15 Implemented by the (unbuilt) base/docker_host roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006)
Network IDS + security alerting ADR-002 / TODO 15 Suricata on OPNsense + AIDE/auditd/fail2ban alerting into the monitoring stack; not built
NetBird mesh — coordinator on askari ADR-016 Design RESOLVED (ADR-016 + spec + plan); resolves ADR-015 deferred #1. Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. Build pending: not deployed (askari + service-role machinery not built).
NetBird agent enrollment in base ADR-016 Design RESOLVED (ADR-016). Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on wt0. Build pending: base role not built.
Service-UI verification (Level 4) ADR-017 / ADR-008 Design RESOLVED (ADR-017 + spec + plan); resolves ADR-015 deferred #2. /verify-service skill + VERIFY.md template + standards are authorable and present. Build pending: running needs ubongo + playwright plugin + Authentik + a staging deploy.
Logging pipeline (Loki + Alloy + off-site subset) ADR-018 Design RESOLVED (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. Build pending: Alloy in base, loki/grafana service roles, OPNsense syslog — none built.
Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) ADR-002 / ADR-018 Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6).
Operational-access doctrine (ADR-021) ADR-021 Design RESOLVED (ADR-021 + spec + plan). Two-layer doctrine, three-tier access ladder, access__* model, ACCESS.md record, /check-access. Reconciles ADR-016/020 SSH.
ssh-from-control firewall source ADR-021 / ADR-020 Built (dormant). base__firewall_control_addr knob + nftables rule + Molecule assertion landed; empty default = no rule until ubongo's LAN address is set in group_vars.
/check-access verifier ADR-021 Design RESOLVED (.claude/commands/check-access.md authored). Build pending: running needs ubongo + live/staging hosts + vault. Access analogue of /verify-service (ADR-017).
Per-service ACCESS.md records ADR-021 Template + governance present; per-service files render when each service role is built.
Backup backup role + backup_hosts group ADR-022 Does not exist. Pull node (fisi), restic repo, rclone→pCloud, USB air-gap — Plan 2.
Per-service backup__* contract + BACKUP.md ADR-022 Convention defined; inert until service roles exist to declare against.

Keeping this honest

Update this file whenever you build, stub, or remove something. It is the first place an AI tool or new contributor should look to learn what they can actually rely on. When a row moves from "designed" to "working", move it up — don't leave stale optimism here.