docs: reconcile 2026-06-14 review findings (O1-O7,O18,O22)

- STATUS: docker_host is built+applied, not scaffold-only (O1)
- ADR-004: backup points to ADR-022, not "out of scope"; service-role file
  table gains ACCESS.md + BACKUP.md rows (O2, O5)
- Finish Traefik->Caddy: ADR-008/011/017/019, CAPABILITIES, TODO (O3); scope
  ADR-024's custom-image/NetBird claims to the deferred DNS-01/M4b paths (O22)
- ADR-016/017/018 now lead with ## Status per ADR-023 (O4)
- ADR-002: caveat `PLAYBOOK=upgrade` as planned/unbuilt (O6)
- CAPABILITIES: carve out ubongo's dev_env from the nvim/tmux exclusion (O7)
- ADR-007: one authoritative boma.baobab.band -> boma.wingu.me transition note (O18)
- new-host Part E: note ubongo is managed as sjat, ansible-user bootstrap pending (O15)

O9 (hosts.yml header) left open: the file is generator-owned (hook-protected);
fixing it needs a tf_to_inventory.py change or a tf-inventory run, not a hand-edit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-14 19:06:33 +02:00
parent cb8f924d4b
commit 175777e36a
14 changed files with 87 additions and 53 deletions

View file

@ -38,14 +38,20 @@ _Last reviewed: 2026-06-14._
| Thing | State |
|---|---|
| `roles/base/` | **Partially built.** Concerns built: `firewall` (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) and **`hardening`** (M3: sshd drop-in key-only + `PermitRootLogin no`, fail2ban sshd jail 5/1h; ADR-002) — both pytest/Molecule-tested. The **`hardening`** concern is **applied to askari** (`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`). The `firewall` concern is built but **not yet applied** to any host (mesh-gated to avoid lockout — M5). Not built: auditd, packages, users (Phase 2 / TODO 15). |
| `roles/docker_host/` | **Scaffolded, no tasks.** In git (meta/README/molecule filled), wired into `playbooks/site.yml` so the standard state is expressed end-to-end and `make lint` covers it, but it has no tasks yet — applying it is a no-op. Planned scope (Docker engine + Compose, daemon hardening, `nftables.d` container rules) in ADR-004/ADR-020. |
| `inventories/*/hosts.yml` | Structured stubs with empty host maps (`hosts: {}`); regenerated by `make tf-inventory` once Terraform has hosts |
| `inventories/production/group_vars/{docker_hosts,proxmox_hosts}/` | Empty dirs |
So `make deploy PLAYBOOK=site` has no real content to apply — `base` is only partially
built (its `firewall` concern only) and the `docker_host` role is scaffolded but has no
tasks yet. (The `make check`/`deploy` machinery itself now works — first proven by
applying `dev_env` via `playbooks/workstation.yml`.)
(`roles/docker_host/` is no longer scaffold-only — it installs the Docker engine + Compose
and is built + applied to askari; see "Real and working today". Its deferred scope —
daemon hardening + `nftables.d` container rules, ADR-004/ADR-020 — is still pending.)
A `make deploy PLAYBOOK=site` run now applies real content — `base` (its `firewall` +
`hardening` concerns) plus a functional `docker_host` (Docker engine) on docker hosts —
but in practice it is still limited: the production cluster has no docker hosts yet, and
`base`'s `firewall` concern is mesh-gated until M5, so a full cluster `site` run does not
yet exist. (The `make check`/`deploy` machinery itself works — first proven by applying
`dev_env` via `playbooks/workstation.yml`, then `base`/`docker_host`/`reverse_proxy` on
askari.)
## Designed but not built

View file

@ -148,8 +148,11 @@ AI/LLM, a game server (Minecraft), generic static-site hosting. Plausible someda
none are committed.
**Confirmed exclusions (V4 had them; boma deliberately does not).** V4 mixed in a lot
of **workstation/desktop** config — XFCE/GNOME desktops, kiosk mode, nvim/kitty/tmux,
LibreOffice, antivirus, remote desktop. boma is **server-only**, so these are correctly
absent. Likewise the removed Knowledge domain (Discourse, Snipe-IT, MRBS booking) and
V4-specific project websites — out of boma's scope by design. The narrower surface is
intentional, not an oversight.
of **workstation/desktop** config — XFCE/GNOME desktops, kiosk mode, LibreOffice,
antivirus, remote desktop. boma's **managed cluster/server hosts** stay server-only, so
these are correctly absent. (One scoped exception: the control / AI-worker host `ubongo`
runs an interactive `dev_env` — zsh/tmux/neovim — per ADR-015; that is the developer
environment of an infrastructure worker host, not a personal desktop, and does not apply
to managed service hosts.) Likewise the removed Knowledge domain (Discourse, Snipe-IT,
MRBS booking) and V4-specific project websites — out of boma's scope by design. The
narrower surface is intentional, not an oversight.

View file

@ -122,7 +122,7 @@
retro consumes them.
12. **Spin-up / build order** — what is the right order of operations when spinning up
from scratch (OS, DNS, Authentik, Traefik, …)?
from scratch (OS, DNS, Authentik, Caddy, …)?
13. **Intentions** - Is the current setup clearly identifying intentions throughout? We have the readme files but is that enough? Also, how do we rechallange desisions and how they interact over time. I.e. We have these two services running, but extending one a little bit could make the other redundant so we could remove it. Or an alternative to this services has emerged, and it is actually better.

View file

@ -79,7 +79,8 @@ time. Each heading tags the threat(s) it primarily serves.
### Updates — *opportunistic*
- `unattended-upgrades` enabled for **security patches only**
- Full system upgrades triggered deliberately via Ansible (`make deploy PLAYBOOK=upgrade`)
- Full system upgrades triggered deliberately via Ansible (planned — a dedicated upgrade
playbook per ADR-011; not yet built, no `upgrade.yml` exists today)
- No automatic reboots — reboots are a conscious operational decision
### Minimal attack surface — *opportunistic, blast radius*

View file

@ -47,6 +47,8 @@ below). Each service role contains a standard set of files:
| `README.md` | Purpose, variables, usage (role convention) |
| `SECURITY.md` | Per-service security record — see ADR-002 and `docs/security/service-security-template.md` |
| `VERIFY.md` | Per-service UI acceptance spec — see ADR-008 Level 4 / ADR-017 and `docs/testing/service-verify-template.md` |
| `ACCESS.md` | Per-service operational-access record — see ADR-021 and `docs/access/service-access-template.md` |
| `BACKUP.md` | Per-service backup record — see ADR-022 and `docs/backup/service-backup-template.md` (a stateless service declares `backup__state: false` with a reason) |
| `meta/main.yml`, `molecule/default/` | Metadata + Debian 13 test scenario |
### Standard deploy mechanics
@ -102,7 +104,9 @@ Managed by the `docker_host` role. Key settings:
- Bind mounts preferred over named volumes for data that must be backed up
- All bind mount paths are under `/opt/services/<name>/data/`
- Backup strategy is defined separately (not in scope of this repo)
- Backup strategy is defined in **ADR-022** — the bind mounts under
`/opt/services/<name>/data/` are exactly the unit ADR-022's per-service `backup__*`
contract (and `BACKUP.md`) captures
## Decision
@ -128,5 +132,6 @@ Drawn from the trade-offs and deferred items this ADR already states:
- Bare `latest` is acceptable only on the stateless tier; the stateful tier is always
pinned `tag@digest`, and image updates are a deliberate operation (per Image management;
ADR-011).
- Backup strategy is stated as defined separately, not in scope of this ADR (per Persistent
data).
- Backup strategy is defined in ADR-022 (not in this ADR); the persistent bind mounts
under `/opt/services/<name>/data/` are the unit ADR-022's per-service `backup__*`
contract captures (per Persistent data).

View file

@ -164,15 +164,21 @@ IoT devices cannot initiate connections to `srv`.
### DNS zones and split-horizon
**Internal zone**: `boma.baobab.band` — served by `dns1` and `dns2`.
**Internal zone**: `boma.baobab.band` **today** (the `dns` role is unbuilt) — served by
`dns1` and `dns2`. **Target:** it is renamed to `boma.wingu.me` in Phase 2 when the `dns`
role lands. Until then `boma.baobab.band` is the authoritative internal name **everywhere
it appears** (the naming table above, split-horizon below, the OPNsense forwarder, and
ADR-009/016). This is the single source for that transition; other references use the
current name and inherit this caveat.
The zone is rendered by the Ansible `dns` role: host A records come from the
inventory (which derives from Terraform's `local.vms` via `make tf-inventory`),
and service/alias/split-horizon records are explicit zone data in `group_vars`.
Terraform itself writes no DNS records — see ADR-009.
**Public zone**: `wingu.me` — Gandi LiveDNS, **managed as code** by the `public_dns`
role (`vault.gandi.pat`). Three-tier naming: infra `<host>.boma.wingu.me` (internal),
services `<service>.wingu.me` (split-horizon), off-site `<service>.askari.wingu.me`.
role (`vault.gandi.pat`). Three-tier naming: infra `<host>.boma.wingu.me` (internal — the
Phase-2 target; currently `boma.baobab.band`, see *Internal zone* above), services
`<service>.wingu.me` (split-horizon), off-site `<service>.askari.wingu.me`.
`nyumbani` is retired. **Mesh/LAN-only by default**: home services have no public record
(reached over LAN or the NetBird mesh); only deliberate exceptions are published. The
project is `boma`; the domain is `wingu.me`. The legacy `baobab.band` zone (Cloudflare)

View file

@ -67,7 +67,7 @@ configuration issues invisible to Ansible check mode.
A Claude-driven exploratory check of a service's **application UI**, run as
`/verify-service <name>` on `ubongo` (ADR-017). Claude drives Chromium via the
`playwright` plugin against a **staging** deploy, authenticates through the real
Traefik + Authentik SSO flow using a test user in the staging `test` group, then
Caddy (ADR-024) + Authentik SSO flow using a test user in the staging `test` group, then
executes the service's `roles/<service>/VERIFY.md` acceptance journeys *and*
free-explores — judging pass/fail, screenshotting key states. It writes a dated report
to `docs/testing/reviews/` and hands the operator a manual-test checklist for anything

View file

@ -21,7 +21,7 @@ Each container role declares its class, e.g. `<role>__stateful: true|false` (def
`false`). The split is the load-bearing classification for the whole policy.
- **Stateless** — no durable data of its own; losing the container loses nothing.
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Traefik,
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Caddy,
reverse proxies, FlareSolverr.
- **Stateful** — owns data, schema, or migrations: databases, and apps with their own
store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT).
@ -56,7 +56,7 @@ per host, in strict order with a verification gate between every phase:
5. **Verify** again; alert on failure.
**Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate
**before** the rest follow — so a DNS/Traefik failure doesn't make every host look
**before** the rest follow — so a DNS/Caddy failure doesn't make every host look
broken at once and hide the real cause. Never reboot the whole fleet simultaneously.
### 4. Snapshot-before is the rollback mechanism

View file

@ -1,5 +1,11 @@
# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
## Status
Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` role and service-role machinery
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
`base` exists.
## Context
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
@ -89,12 +95,6 @@ allocated for it.
version-pinned (ADR-011). boma's `dns` role stays authoritative for
`boma.baobab.band`; NetBird built-in DNS scoped/off.
## Status
Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` role and service-role machinery
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
`base` exists.
## What was ruled out
| Option | Reason |

View file

@ -1,5 +1,11 @@
# ADR-017 — Service-UI acceptance verification (Level 4)
## Status
Accepted (2026-06-05). Designed. **Authorable now:** this ADR, the ADR-008 Level 4 expansion, the `VERIFY.md`
template, the `/verify-service` skill, the convention/checklist/Further-reading edits,
`.gitignore`/dir, STATUS/TODO. **Running is deferred** on its dependencies.
## Context
ADR-008 defines testing Levels 13 (Molecule, staging deploy, external smoke) and a
@ -24,7 +30,7 @@ A Claude-driven exploratory service-UI verification harness — **Level 4** —
(incl. destructive flows) against a *staging* deploy; the rebuildable sandbox
resolves safety.
4. **Test users in Authentik (central IdP), real SSO flow** — authenticates through
Traefik + Authentik as a real user would.
Caddy (ADR-024) + Authentik as a real user would.
5. **Per-service `VERIFY.md` backbone + free exploration** — each service role ships an
acceptance spec of critical journeys; Claude executes it and explores beyond it.
@ -63,12 +69,6 @@ them.
- **No secrets leaked** — the git-ignored screenshot dir is the safety boundary;
avoid capturing credential screens.
## Status
Accepted (2026-06-05). Designed. **Authorable now:** this ADR, the ADR-008 Level 4 expansion, the `VERIFY.md`
template, the `/verify-service` skill, the convention/checklist/Further-reading edits,
`.gitignore`/dir, STATUS/TODO. **Running is deferred** on its dependencies.
## Dependencies
- `ubongo` (ADR-015) — runs the browser. Designed, not built.
@ -85,7 +85,7 @@ template, the `/verify-service` skill, the convention/checklist/Further-reading
| Scheduled headless smoke gate | Needs determinism the exploratory nature excludes; belongs to health checks / Uptime Kuma. |
| Verify against production | Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead. |
| Free-form, no per-service spec | Non-repeatable, can miss a critical flow; `VERIFY.md` gives a backbone. |
| Staging bypasses SSO / per-app users | Wouldn't exercise the real Traefik+Authentik path; central test users are faithful. |
| Staging bypasses SSO / per-app users | Wouldn't exercise the real Caddy+Authentik path; central test users are faithful. |
| Commit screenshots to the repo | Repo bloat + secret-leak risk; git-ignored on `ubongo`. |
See also: ADR-008 (testing — expanded), ADR-015 (control host), ADR-002 (security),

View file

@ -1,5 +1,12 @@
# ADR-018 — Logging and log integrity
## Status
Accepted (2026-06-06). Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
and the live pipeline.
## Context
boma wants all logs in one queryable store for troubleshooting, spotting issues over
@ -70,13 +77,6 @@ ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a moni
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
tracked allocation in `docs/hardware/reference.md` (ADR-012).
## Status
Accepted (2026-06-06). Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
and the live pipeline.
## Dependencies
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +

View file

@ -49,7 +49,7 @@ slice on its own, and (c) doesn't overlap confusingly with another.
| `monitoring` | metric exporters / health checks |
| `config` | render templated config/compose files to disk — **no restart** |
| `deploy` | bring services up / restart (`compose up -d`) |
| `proxy` | reverse-proxy + TLS registration (Traefik routes, Authentik) |
| `proxy` | reverse-proxy + TLS registration (Caddy routes, Authentik) |
The `config`/`deploy` split lets you re-render and diff configuration (`--tags
config`) without bouncing services, then restart deliberately (`--tags deploy`).

View file

@ -57,10 +57,14 @@ boma's reverse proxy is **Caddy**.
5. `forward_auth` to Authentik is a first-class Caddy directive — the planned
Authentik auth story (ADR-002) is preserved without Traefik as the middleman.
### 2. Custom image
### 2. Custom image (DNS-01 path only — Phase 2)
> Applies only to the **DNS-01** path, which is **deferred to Phase 2** (see the Status
> note). M4a ships **vanilla `caddy:2`** on askari (HTTP-01) — no custom image.
Caddy's official Docker image does not include third-party DNS plugins. The `caddy-dns/gandi`
plugin must be compiled in via `xcaddy`. boma builds a custom image:
plugin must be compiled in via `xcaddy`. When the cluster's mesh/LAN-only services need
DNS-01, boma builds a custom image:
```
FROM caddy:builder AS builder
@ -70,14 +74,16 @@ FROM caddy:latest
COPY --from=builder /usr/bin/caddy /usr/bin/caddy
```
This image is maintained as a boma artifact (Forgejo registry, pinned digest in the
Compose template). It is the cost of the Gandi DNS-01 path — unavoidable regardless of
proxy choice.
That image would be maintained as a boma artifact (Forgejo registry, pinned digest in the
Compose template) — the cost of the Gandi DNS-01 path. (On askari this approach hit two
blockers, so DNS-01 is deferred; see the Status note.)
### 3. Deployment scope
The first Caddy instance fronts the NetBird stack on `askari` (M4). The pattern
generalises to the Proxmox cluster in Phase 2 when services multiply.
The first Caddy instance runs on `askari` (M4a), serving a test vhost over HTTP-01 to
prove the proxy + ACME path. It fronts the NetBird stack in **M4b** (when the
`netbird` coordinator role is built). The pattern generalises to the Proxmox cluster in
Phase 2 when services multiply.
### 4. Authentik integration (deferred)
@ -90,8 +96,9 @@ middleware migration is required.
- **Roadmap Phase-2 step 5** is updated from "Authentik + Traefik" to "Authentik +
Caddy (ADR-024)".
- **ADR-017 prose** that mentioned Traefik is updated to read "Caddy (ADR-024)".
- A custom Caddy image (`xcaddy` + `caddy-dns/gandi`) must be built, pushed to the
Forgejo registry, and kept current (plugin + base image updates).
- M4a (public hosts, HTTP-01) runs **vanilla `caddy:2`** — no custom image. **If/when**
the Phase-2 DNS-01 path lands, a custom Caddy image (`xcaddy` + `caddy-dns/gandi`) must
be built, pushed to the Forgejo registry, and kept current (plugin + base image updates).
- Caddyfile config is rendered by Ansible from `group_vars` — consistent with ADR-004
and easier to review than distributed container labels.
- `forward_auth` to Authentik is available when Authentik is deployed; no extra

View file

@ -118,8 +118,14 @@ Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated **physical**
machine outside the cluster — not a Proxmox guest. It is the **one** host
provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
> **Current state (STATUS.md):** `ubongo` is today managed as the operator account
> `sjat` (`group_vars/control` sets `ansible_user: sjat`); it has **no** dedicated
> `ansible` service user yet. The dedicated-`ansible`-user bootstrap (step 2) is a
> **pending** item. Steps below describe the intended end state.
1. Install Debian 13 on the physical box by hand (no template to clone).
2. Create the `ansible` user and install its SSH public key.
2. Create the `ansible` user and install its SSH public key. *(Pending for `ubongo`
currently managed as `sjat`; see the note above.)*
3. Set up the Ansible environment on it:
```bash
git clone <repo> ~/ansible