docs: reconcile 2026-06-14 review findings (O1-O7,O18,O22)
- STATUS: docker_host is built+applied, not scaffold-only (O1) - ADR-004: backup points to ADR-022, not "out of scope"; service-role file table gains ACCESS.md + BACKUP.md rows (O2, O5) - Finish Traefik->Caddy: ADR-008/011/017/019, CAPABILITIES, TODO (O3); scope ADR-024's custom-image/NetBird claims to the deferred DNS-01/M4b paths (O22) - ADR-016/017/018 now lead with ## Status per ADR-023 (O4) - ADR-002: caveat `PLAYBOOK=upgrade` as planned/unbuilt (O6) - CAPABILITIES: carve out ubongo's dev_env from the nvim/tmux exclusion (O7) - ADR-007: one authoritative boma.baobab.band -> boma.wingu.me transition note (O18) - new-host Part E: note ubongo is managed as sjat, ansible-user bootstrap pending (O15) O9 (hosts.yml header) left open: the file is generator-owned (hook-protected); fixing it needs a tf_to_inventory.py change or a tf-inventory run, not a hand-edit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
cb8f924d4b
commit
175777e36a
14 changed files with 87 additions and 53 deletions
16
STATUS.md
16
STATUS.md
|
|
@ -38,14 +38,20 @@ _Last reviewed: 2026-06-14._
|
|||
| Thing | State |
|
||||
|---|---|
|
||||
| `roles/base/` | **Partially built.** Concerns built: `firewall` (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) and **`hardening`** (M3: sshd drop-in key-only + `PermitRootLogin no`, fail2ban sshd jail 5/1h; ADR-002) — both pytest/Molecule-tested. The **`hardening`** concern is **applied to askari** (`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`). The `firewall` concern is built but **not yet applied** to any host (mesh-gated to avoid lockout — M5). Not built: auditd, packages, users (Phase 2 / TODO 15). |
|
||||
| `roles/docker_host/` | **Scaffolded, no tasks.** In git (meta/README/molecule filled), wired into `playbooks/site.yml` so the standard state is expressed end-to-end and `make lint` covers it, but it has no tasks yet — applying it is a no-op. Planned scope (Docker engine + Compose, daemon hardening, `nftables.d` container rules) in ADR-004/ADR-020. |
|
||||
| `inventories/*/hosts.yml` | Structured stubs with empty host maps (`hosts: {}`); regenerated by `make tf-inventory` once Terraform has hosts |
|
||||
| `inventories/production/group_vars/{docker_hosts,proxmox_hosts}/` | Empty dirs |
|
||||
|
||||
So `make deploy PLAYBOOK=site` has no real content to apply — `base` is only partially
|
||||
built (its `firewall` concern only) and the `docker_host` role is scaffolded but has no
|
||||
tasks yet. (The `make check`/`deploy` machinery itself now works — first proven by
|
||||
applying `dev_env` via `playbooks/workstation.yml`.)
|
||||
(`roles/docker_host/` is no longer scaffold-only — it installs the Docker engine + Compose
|
||||
and is built + applied to askari; see "Real and working today". Its deferred scope —
|
||||
daemon hardening + `nftables.d` container rules, ADR-004/ADR-020 — is still pending.)
|
||||
|
||||
A `make deploy PLAYBOOK=site` run now applies real content — `base` (its `firewall` +
|
||||
`hardening` concerns) plus a functional `docker_host` (Docker engine) on docker hosts —
|
||||
but in practice it is still limited: the production cluster has no docker hosts yet, and
|
||||
`base`'s `firewall` concern is mesh-gated until M5, so a full cluster `site` run does not
|
||||
yet exist. (The `make check`/`deploy` machinery itself works — first proven by applying
|
||||
`dev_env` via `playbooks/workstation.yml`, then `base`/`docker_host`/`reverse_proxy` on
|
||||
askari.)
|
||||
|
||||
## Designed but not built
|
||||
|
||||
|
|
|
|||
|
|
@ -148,8 +148,11 @@ AI/LLM, a game server (Minecraft), generic static-site hosting. Plausible someda
|
|||
none are committed.
|
||||
|
||||
**Confirmed exclusions (V4 had them; boma deliberately does not).** V4 mixed in a lot
|
||||
of **workstation/desktop** config — XFCE/GNOME desktops, kiosk mode, nvim/kitty/tmux,
|
||||
LibreOffice, antivirus, remote desktop. boma is **server-only**, so these are correctly
|
||||
absent. Likewise the removed Knowledge domain (Discourse, Snipe-IT, MRBS booking) and
|
||||
V4-specific project websites — out of boma's scope by design. The narrower surface is
|
||||
intentional, not an oversight.
|
||||
of **workstation/desktop** config — XFCE/GNOME desktops, kiosk mode, LibreOffice,
|
||||
antivirus, remote desktop. boma's **managed cluster/server hosts** stay server-only, so
|
||||
these are correctly absent. (One scoped exception: the control / AI-worker host `ubongo`
|
||||
runs an interactive `dev_env` — zsh/tmux/neovim — per ADR-015; that is the developer
|
||||
environment of an infrastructure worker host, not a personal desktop, and does not apply
|
||||
to managed service hosts.) Likewise the removed Knowledge domain (Discourse, Snipe-IT,
|
||||
MRBS booking) and V4-specific project websites — out of boma's scope by design. The
|
||||
narrower surface is intentional, not an oversight.
|
||||
|
|
|
|||
|
|
@ -122,7 +122,7 @@
|
|||
retro consumes them.
|
||||
|
||||
12. **Spin-up / build order** — what is the right order of operations when spinning up
|
||||
from scratch (OS, DNS, Authentik, Traefik, …)?
|
||||
from scratch (OS, DNS, Authentik, Caddy, …)?
|
||||
|
||||
13. **Intentions** - Is the current setup clearly identifying intentions throughout? We have the readme files but is that enough? Also, how do we rechallange desisions and how they interact over time. I.e. We have these two services running, but extending one a little bit could make the other redundant so we could remove it. Or an alternative to this services has emerged, and it is actually better.
|
||||
|
||||
|
|
|
|||
|
|
@ -79,7 +79,8 @@ time. Each heading tags the threat(s) it primarily serves.
|
|||
### Updates — *opportunistic*
|
||||
|
||||
- `unattended-upgrades` enabled for **security patches only**
|
||||
- Full system upgrades triggered deliberately via Ansible (`make deploy PLAYBOOK=upgrade`)
|
||||
- Full system upgrades triggered deliberately via Ansible (planned — a dedicated upgrade
|
||||
playbook per ADR-011; not yet built, no `upgrade.yml` exists today)
|
||||
- No automatic reboots — reboots are a conscious operational decision
|
||||
|
||||
### Minimal attack surface — *opportunistic, blast radius*
|
||||
|
|
|
|||
|
|
@ -47,6 +47,8 @@ below). Each service role contains a standard set of files:
|
|||
| `README.md` | Purpose, variables, usage (role convention) |
|
||||
| `SECURITY.md` | Per-service security record — see ADR-002 and `docs/security/service-security-template.md` |
|
||||
| `VERIFY.md` | Per-service UI acceptance spec — see ADR-008 Level 4 / ADR-017 and `docs/testing/service-verify-template.md` |
|
||||
| `ACCESS.md` | Per-service operational-access record — see ADR-021 and `docs/access/service-access-template.md` |
|
||||
| `BACKUP.md` | Per-service backup record — see ADR-022 and `docs/backup/service-backup-template.md` (a stateless service declares `backup__state: false` with a reason) |
|
||||
| `meta/main.yml`, `molecule/default/` | Metadata + Debian 13 test scenario |
|
||||
|
||||
### Standard deploy mechanics
|
||||
|
|
@ -102,7 +104,9 @@ Managed by the `docker_host` role. Key settings:
|
|||
|
||||
- Bind mounts preferred over named volumes for data that must be backed up
|
||||
- All bind mount paths are under `/opt/services/<name>/data/`
|
||||
- Backup strategy is defined separately (not in scope of this repo)
|
||||
- Backup strategy is defined in **ADR-022** — the bind mounts under
|
||||
`/opt/services/<name>/data/` are exactly the unit ADR-022's per-service `backup__*`
|
||||
contract (and `BACKUP.md`) captures
|
||||
|
||||
## Decision
|
||||
|
||||
|
|
@ -128,5 +132,6 @@ Drawn from the trade-offs and deferred items this ADR already states:
|
|||
- Bare `latest` is acceptable only on the stateless tier; the stateful tier is always
|
||||
pinned `tag@digest`, and image updates are a deliberate operation (per Image management;
|
||||
ADR-011).
|
||||
- Backup strategy is stated as defined separately, not in scope of this ADR (per Persistent
|
||||
data).
|
||||
- Backup strategy is defined in ADR-022 (not in this ADR); the persistent bind mounts
|
||||
under `/opt/services/<name>/data/` are the unit ADR-022's per-service `backup__*`
|
||||
contract captures (per Persistent data).
|
||||
|
|
|
|||
|
|
@ -164,15 +164,21 @@ IoT devices cannot initiate connections to `srv`.
|
|||
|
||||
### DNS zones and split-horizon
|
||||
|
||||
**Internal zone**: `boma.baobab.band` — served by `dns1` and `dns2`.
|
||||
**Internal zone**: `boma.baobab.band` **today** (the `dns` role is unbuilt) — served by
|
||||
`dns1` and `dns2`. **Target:** it is renamed to `boma.wingu.me` in Phase 2 when the `dns`
|
||||
role lands. Until then `boma.baobab.band` is the authoritative internal name **everywhere
|
||||
it appears** (the naming table above, split-horizon below, the OPNsense forwarder, and
|
||||
ADR-009/016). This is the single source for that transition; other references use the
|
||||
current name and inherit this caveat.
|
||||
The zone is rendered by the Ansible `dns` role: host A records come from the
|
||||
inventory (which derives from Terraform's `local.vms` via `make tf-inventory`),
|
||||
and service/alias/split-horizon records are explicit zone data in `group_vars`.
|
||||
Terraform itself writes no DNS records — see ADR-009.
|
||||
|
||||
**Public zone**: `wingu.me` — Gandi LiveDNS, **managed as code** by the `public_dns`
|
||||
role (`vault.gandi.pat`). Three-tier naming: infra `<host>.boma.wingu.me` (internal),
|
||||
services `<service>.wingu.me` (split-horizon), off-site `<service>.askari.wingu.me`.
|
||||
role (`vault.gandi.pat`). Three-tier naming: infra `<host>.boma.wingu.me` (internal — the
|
||||
Phase-2 target; currently `boma.baobab.band`, see *Internal zone* above), services
|
||||
`<service>.wingu.me` (split-horizon), off-site `<service>.askari.wingu.me`.
|
||||
`nyumbani` is retired. **Mesh/LAN-only by default**: home services have no public record
|
||||
(reached over LAN or the NetBird mesh); only deliberate exceptions are published. The
|
||||
project is `boma`; the domain is `wingu.me`. The legacy `baobab.band` zone (Cloudflare)
|
||||
|
|
|
|||
|
|
@ -67,7 +67,7 @@ configuration issues invisible to Ansible check mode.
|
|||
A Claude-driven exploratory check of a service's **application UI**, run as
|
||||
`/verify-service <name>` on `ubongo` (ADR-017). Claude drives Chromium via the
|
||||
`playwright` plugin against a **staging** deploy, authenticates through the real
|
||||
Traefik + Authentik SSO flow using a test user in the staging `test` group, then
|
||||
Caddy (ADR-024) + Authentik SSO flow using a test user in the staging `test` group, then
|
||||
executes the service's `roles/<service>/VERIFY.md` acceptance journeys *and*
|
||||
free-explores — judging pass/fail, screenshotting key states. It writes a dated report
|
||||
to `docs/testing/reviews/` and hands the operator a manual-test checklist for anything
|
||||
|
|
|
|||
|
|
@ -21,7 +21,7 @@ Each container role declares its class, e.g. `<role>__stateful: true|false` (def
|
|||
`false`). The split is the load-bearing classification for the whole policy.
|
||||
|
||||
- **Stateless** — no durable data of its own; losing the container loses nothing.
|
||||
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Traefik,
|
||||
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Caddy,
|
||||
reverse proxies, FlareSolverr.
|
||||
- **Stateful** — owns data, schema, or migrations: databases, and apps with their own
|
||||
store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT).
|
||||
|
|
@ -56,7 +56,7 @@ per host, in strict order with a verification gate between every phase:
|
|||
5. **Verify** again; alert on failure.
|
||||
|
||||
**Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate
|
||||
**before** the rest follow — so a DNS/Traefik failure doesn't make every host look
|
||||
**before** the rest follow — so a DNS/Caddy failure doesn't make every host look
|
||||
broken at once and hide the real cause. Never reboot the whole fleet simultaneously.
|
||||
|
||||
### 4. Snapshot-before is the rollback mechanism
|
||||
|
|
|
|||
|
|
@ -1,5 +1,11 @@
|
|||
# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` role and service-role machinery
|
||||
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
|
||||
`base` exists.
|
||||
|
||||
## Context
|
||||
|
||||
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
|
||||
|
|
@ -89,12 +95,6 @@ allocated for it.
|
|||
version-pinned (ADR-011). boma's `dns` role stays authoritative for
|
||||
`boma.baobab.band`; NetBird built-in DNS scoped/off.
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` role and service-role machinery
|
||||
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
|
||||
`base` exists.
|
||||
|
||||
## What was ruled out
|
||||
|
||||
| Option | Reason |
|
||||
|
|
|
|||
|
|
@ -1,5 +1,11 @@
|
|||
# ADR-017 — Service-UI acceptance verification (Level 4)
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-05). Designed. **Authorable now:** this ADR, the ADR-008 Level 4 expansion, the `VERIFY.md`
|
||||
template, the `/verify-service` skill, the convention/checklist/Further-reading edits,
|
||||
`.gitignore`/dir, STATUS/TODO. **Running is deferred** on its dependencies.
|
||||
|
||||
## Context
|
||||
|
||||
ADR-008 defines testing Levels 1–3 (Molecule, staging deploy, external smoke) and a
|
||||
|
|
@ -24,7 +30,7 @@ A Claude-driven exploratory service-UI verification harness — **Level 4** —
|
|||
(incl. destructive flows) against a *staging* deploy; the rebuildable sandbox
|
||||
resolves safety.
|
||||
4. **Test users in Authentik (central IdP), real SSO flow** — authenticates through
|
||||
Traefik + Authentik as a real user would.
|
||||
Caddy (ADR-024) + Authentik as a real user would.
|
||||
5. **Per-service `VERIFY.md` backbone + free exploration** — each service role ships an
|
||||
acceptance spec of critical journeys; Claude executes it and explores beyond it.
|
||||
|
||||
|
|
@ -63,12 +69,6 @@ them.
|
|||
- **No secrets leaked** — the git-ignored screenshot dir is the safety boundary;
|
||||
avoid capturing credential screens.
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-05). Designed. **Authorable now:** this ADR, the ADR-008 Level 4 expansion, the `VERIFY.md`
|
||||
template, the `/verify-service` skill, the convention/checklist/Further-reading edits,
|
||||
`.gitignore`/dir, STATUS/TODO. **Running is deferred** on its dependencies.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- `ubongo` (ADR-015) — runs the browser. Designed, not built.
|
||||
|
|
@ -85,7 +85,7 @@ template, the `/verify-service` skill, the convention/checklist/Further-reading
|
|||
| Scheduled headless smoke gate | Needs determinism the exploratory nature excludes; belongs to health checks / Uptime Kuma. |
|
||||
| Verify against production | Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead. |
|
||||
| Free-form, no per-service spec | Non-repeatable, can miss a critical flow; `VERIFY.md` gives a backbone. |
|
||||
| Staging bypasses SSO / per-app users | Wouldn't exercise the real Traefik+Authentik path; central test users are faithful. |
|
||||
| Staging bypasses SSO / per-app users | Wouldn't exercise the real Caddy+Authentik path; central test users are faithful. |
|
||||
| Commit screenshots to the repo | Repo bloat + secret-leak risk; git-ignored on `ubongo`. |
|
||||
|
||||
See also: ADR-008 (testing — expanded), ADR-015 (control host), ADR-002 (security),
|
||||
|
|
|
|||
|
|
@ -1,5 +1,12 @@
|
|||
# ADR-018 — Logging and log integrity
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-06). Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
|
||||
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
|
||||
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
|
||||
and the live pipeline.
|
||||
|
||||
## Context
|
||||
|
||||
boma wants all logs in one queryable store for troubleshooting, spotting issues over
|
||||
|
|
@ -70,13 +77,6 @@ ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a moni
|
|||
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
|
||||
tracked allocation in `docs/hardware/reference.md` (ADR-012).
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-06). Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
|
||||
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
|
||||
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
|
||||
and the live pipeline.
|
||||
|
||||
## Dependencies
|
||||
|
||||
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
|
||||
|
|
|
|||
|
|
@ -49,7 +49,7 @@ slice on its own, and (c) doesn't overlap confusingly with another.
|
|||
| `monitoring` | metric exporters / health checks |
|
||||
| `config` | render templated config/compose files to disk — **no restart** |
|
||||
| `deploy` | bring services up / restart (`compose up -d`) |
|
||||
| `proxy` | reverse-proxy + TLS registration (Traefik routes, Authentik) |
|
||||
| `proxy` | reverse-proxy + TLS registration (Caddy routes, Authentik) |
|
||||
|
||||
The `config`/`deploy` split lets you re-render and diff configuration (`--tags
|
||||
config`) without bouncing services, then restart deliberately (`--tags deploy`).
|
||||
|
|
|
|||
|
|
@ -57,10 +57,14 @@ boma's reverse proxy is **Caddy**.
|
|||
5. `forward_auth` to Authentik is a first-class Caddy directive — the planned
|
||||
Authentik auth story (ADR-002) is preserved without Traefik as the middleman.
|
||||
|
||||
### 2. Custom image
|
||||
### 2. Custom image (DNS-01 path only — Phase 2)
|
||||
|
||||
> Applies only to the **DNS-01** path, which is **deferred to Phase 2** (see the Status
|
||||
> note). M4a ships **vanilla `caddy:2`** on askari (HTTP-01) — no custom image.
|
||||
|
||||
Caddy's official Docker image does not include third-party DNS plugins. The `caddy-dns/gandi`
|
||||
plugin must be compiled in via `xcaddy`. boma builds a custom image:
|
||||
plugin must be compiled in via `xcaddy`. When the cluster's mesh/LAN-only services need
|
||||
DNS-01, boma builds a custom image:
|
||||
|
||||
```
|
||||
FROM caddy:builder AS builder
|
||||
|
|
@ -70,14 +74,16 @@ FROM caddy:latest
|
|||
COPY --from=builder /usr/bin/caddy /usr/bin/caddy
|
||||
```
|
||||
|
||||
This image is maintained as a boma artifact (Forgejo registry, pinned digest in the
|
||||
Compose template). It is the cost of the Gandi DNS-01 path — unavoidable regardless of
|
||||
proxy choice.
|
||||
That image would be maintained as a boma artifact (Forgejo registry, pinned digest in the
|
||||
Compose template) — the cost of the Gandi DNS-01 path. (On askari this approach hit two
|
||||
blockers, so DNS-01 is deferred; see the Status note.)
|
||||
|
||||
### 3. Deployment scope
|
||||
|
||||
The first Caddy instance fronts the NetBird stack on `askari` (M4). The pattern
|
||||
generalises to the Proxmox cluster in Phase 2 when services multiply.
|
||||
The first Caddy instance runs on `askari` (M4a), serving a test vhost over HTTP-01 to
|
||||
prove the proxy + ACME path. It fronts the NetBird stack in **M4b** (when the
|
||||
`netbird` coordinator role is built). The pattern generalises to the Proxmox cluster in
|
||||
Phase 2 when services multiply.
|
||||
|
||||
### 4. Authentik integration (deferred)
|
||||
|
||||
|
|
@ -90,8 +96,9 @@ middleware migration is required.
|
|||
- **Roadmap Phase-2 step 5** is updated from "Authentik + Traefik" to "Authentik +
|
||||
Caddy (ADR-024)".
|
||||
- **ADR-017 prose** that mentioned Traefik is updated to read "Caddy (ADR-024)".
|
||||
- A custom Caddy image (`xcaddy` + `caddy-dns/gandi`) must be built, pushed to the
|
||||
Forgejo registry, and kept current (plugin + base image updates).
|
||||
- M4a (public hosts, HTTP-01) runs **vanilla `caddy:2`** — no custom image. **If/when**
|
||||
the Phase-2 DNS-01 path lands, a custom Caddy image (`xcaddy` + `caddy-dns/gandi`) must
|
||||
be built, pushed to the Forgejo registry, and kept current (plugin + base image updates).
|
||||
- Caddyfile config is rendered by Ansible from `group_vars` — consistent with ADR-004
|
||||
and easier to review than distributed container labels.
|
||||
- `forward_auth` to Authentik is available when Authentik is deployed; no extra
|
||||
|
|
|
|||
|
|
@ -118,8 +118,14 @@ Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated **physical**
|
|||
machine outside the cluster — not a Proxmox guest. It is the **one** host
|
||||
provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
|
||||
|
||||
> **Current state (STATUS.md):** `ubongo` is today managed as the operator account
|
||||
> `sjat` (`group_vars/control` sets `ansible_user: sjat`); it has **no** dedicated
|
||||
> `ansible` service user yet. The dedicated-`ansible`-user bootstrap (step 2) is a
|
||||
> **pending** item. Steps below describe the intended end state.
|
||||
|
||||
1. Install Debian 13 on the physical box by hand (no template to clone).
|
||||
2. Create the `ansible` user and install its SSH public key.
|
||||
2. Create the `ansible` user and install its SSH public key. *(Pending for `ubongo` —
|
||||
currently managed as `sjat`; see the note above.)*
|
||||
3. Set up the Ansible environment on it:
|
||||
```bash
|
||||
git clone <repo> ~/ansible
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue