boma/docs/superpowers/specs/2026-06-05-service-ui-verification-design.md
sjat 2bd11b5aa9 Add design spec for service-UI verification (ADR-008 Level 4)
Resolves ADR-015 deferred item #2 + TODO 2.2/2.3: a Claude-driven exploratory
browser harness (/verify-service) that exercises staging service UIs through
real SSO, backed by a per-service VERIFY.md, with test users in staging
Authentik and a manual-test handoff. Basis for ADR-017.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 13:05:11 +02:00

203 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Design — Service-UI acceptance verification (ADR-008 Level 4)
- **Date:** 2026-06-05
- **Status:** Approved design — pending implementation plan
- **Resolves:** ADR-015 deferred item #2 (browser-E2E verification harness); TODO 2.2
(browser portion) + TODO 2.3 (test users + manual-test instruction)
- **Expands:** ADR-008 Level 4 (currently a stub)
- **Becomes:** ADR-017 (this design is the basis for that ADR)
---
## Problem
ADR-008 defines testing Levels 13 (Molecule, staging deploy, external smoke) and a
**Level 4 stub**: "Claude drives a headless browser from `ubongo` against a deployed
service: loads the rendered UI, creates test users, exercises features, and hands the
operator a manual test script." Nothing below Level 4 actually exercises a service's
**application UI** — Molecule tests the role in a container, Level 2 confirms the stack
converges, Level 3 confirms public endpoints respond. None answer "does PhotoPrism
actually let me log in, upload a photo, and see a thumbnail?" (TODO 8.2).
The operator's original ask: *"Claude could spin up a browser and actually see the
generated service web-UIs to verify various things. Perhaps even generate test users
and test features and instruct me on tests as well."* That is TODO 2.2 (headless
browsing) + TODO 2.3 (test-user generation + manual-test instruction).
Today Claude "sees" a browser only **passively** — the `/screenshot` skill fetches
screenshots the operator took on `mamba`. This harness is the **active** counterpart:
Claude drives the browser itself.
## Decisions (the settled forks)
1. **Nature — Claude-driven exploratory.** Claude navigates the live UI with judgment
(look, click, reason about whether it works, notice anything off), not deterministic
scripts. This is the distinctive value; a scripted Playwright regression suite is
explicitly *not* built here.
2. **Mode — interactive, Claude-in-the-loop.** Follows from #1: exploratory judgment
can't be a headless cron gate. Scheduled smoke-testing stays out of scope (that is a
determinism job for health checks / Uptime Kuma later).
3. **Environment — staging, full exercise.** Claude creates test users and exercises
features (including destructive flows) against a *staging* deploy. Staging is a
rebuildable sandbox, so this resolves safety: no production-data risk, no prod
pollution.
4. **Auth — test users in Authentik (central IdP), real SSO flow.** Claude's browser
authenticates through Traefik + Authentik exactly as a real user would, faithfully
testing the real access path.
5. **Structure — per-service `VERIFY.md` backbone + free exploration.** Each service
role ships an acceptance spec of critical user journeys; Claude executes it *and*
explores beyond it. Repeatable + intent-capturing, without losing exploratory value.
## Scope
In scope: the **browser/UI** verification harness (TODO 2.2 browser portion) + the
**test-user** and **manual-test-instruction** standards (TODO 2.3) = ADR-008 **Level 4**.
Out of scope (siblings, noted not built): the other TODO-2.2 "live testing" methods —
API calls, `curl` pulls, log review. They share the spirit but are not browser work.
Also out: a scripted/CI regression suite; scheduled headless smoke checks.
---
## Architecture, mechanism, and workflow placement
**Mechanism.** Claude drives a real Chromium on `ubongo` via the **`playwright` Claude
Code plugin** (already earmarked in `claude-code-setup.md`, enabled when this lands).
No bespoke browser code — Claude calls the Playwright MCP tools (navigate, click, type,
screenshot, read DOM) and reasons over what it sees. Active counterpart to the passive
`/screenshot`-from-`mamba` pattern.
**Orchestration.** A boma skill/command — **`/verify-service <name>`** — run
interactively on `ubongo`. It:
1. Reads the service's `roles/<name>/VERIFY.md` acceptance spec.
2. Provisions/uses a test user in the **staging** Authentik.
3. Drives the browser through the real SSO flow into the staging service.
4. Executes the listed journeys exploratorily (judging pass/fail, screenshotting key
states) and free-explores.
5. Writes a dated verification report with linked screenshots.
6. Emits a manual-test checklist for anything it couldn't do.
**Pipeline placement.** Level 4 runs after Level 2 (staging deploy) and before
production promotion:
`build role → molecule (L1) → staging deploy (L2) → /verify-service (L4) → promote`.
It reaches the staging service over the LAN from `ubongo` (services on `srv`; resolved
via boma DNS), through Traefik + Authentik as a real user would.
**Boundaries (one unit, clear interface):** the skill *orchestrates*; `VERIFY.md`
*declares intent* (per service); Authentik *provides identity*; the report *captures
results*. Each is independently understandable and swappable.
---
## The `VERIFY.md` standard
Every service role ships a populated `roles/<service>/VERIFY.md`, copied from a new
template `docs/testing/service-verify-template.md` — parallel to how each role ships
`SECURITY.md` from `service-security-template.md`. It becomes a **role convention**
(every *service* role must have a populated `VERIFY.md`).
Contents:
- **Critical user journeys** — the acceptance criteria that define "working" for this
service (e.g. PhotoPrism: *SSO login → library loads → upload a test image →
thumbnail generates → search finds it*).
- **What good looks like** — states/screenshots to confirm.
- **Not browser-verifiable** — items to route to the manual-test handoff (hardware,
paid/external flows, subjective quality).
`/verify-service` reads `roles/<name>/VERIFY.md`, executes those journeys, and explores
beyond them.
## Test-user generation standard (TODO 2.3)
Test identities are provisioned in the **staging** Authentik (never the production IdP
— test accounts must not exist in prod):
- **Convention:** a dedicated `test` group / naming prefix (e.g. `test-<service>@…`) so
accounts are identifiable and bulk-removable.
- **Credentials:** ephemeral, generated per run (staging is rebuildable); held only for
the run. No test creds in `vault.yml`.
- **Idempotent:** reuse-or-create.
- **Teardown:** primary teardown is the staging rebuild (sandbox); the skill also
offers explicit cleanup of the `test` group.
## Reporting & manual-test handoff
- **Report:** `/verify-service` writes `docs/testing/reviews/YYYY-MM-DD-<service>.md`
(plus `latest.md`), mirroring `/review-repo``docs/reviews/` and
`/capacity-review``docs/hardware/reviews/`. It contains pass/fail per `VERIFY.md`
journey, observations, the test-user/env used, a verdict, and the manual-test
checklist. The committed markdown is the durable artifact.
- **Screenshots:** saved to a **git-ignored** dir on `ubongo` (PNGs would bloat the
repo); the report links them and inlines only a few key evidence shots.
- **Manual-test handoff (TODO 2.3):** anything Claude can't do — physical device,
paid/external flow, subjective judgment — becomes a **structured checklist** in the
report (numbered steps, expected result, why handed off). The operator runs them and
reports back. This is the "instruct me on tests" half of the vision, as a first-class
output.
## Safety
Even though staging is a sandbox:
- **Staging-only guard.** The skill refuses to run against production (verifies it is
pointed at the staging environment/inventory before acting) — an ADR-002-aligned hard
stop, since exploratory clicking is destructive by nature.
- **Confined blast radius.** Test users live only in the staging `test` group; the run
sticks to the target service.
- **No secrets leaked.** Screenshots can capture on-screen tokens/credentials, so the
git-ignored screenshot dir is also the safety boundary (evidence isn't committed by
default), and the skill avoids capturing credential screens.
---
## Documentation & implementation changes
This is a substantial capability → its own ADR-017, with reconciliations:
| Doc / artifact | Change |
|---|---|
| ADR-017 (new) | Home of record: harness, the five settled forks, `VERIFY.md` standard, test-user + manual-handoff standards, safety. |
| ADR-008 (testing) | Expand the Level 4 stub into the full definition; link ADR-017. |
| `docs/testing/service-verify-template.md` (new) | The `VERIFY.md` template (parallels `service-security-template.md`). |
| `.claude/commands/verify-service.md` (new) | The `/verify-service <name>` orchestrating skill. |
| `CLAUDE.md` | Role conventions: every *service* role must ship a populated `VERIFY.md`. Further reading: ADR-017. |
| `docs/security/service-checklist.md` | Add "passed Level 4 (`/verify-service`)" to the pre-production service-clearance gate. |
| `.gitignore` + `docs/testing/reviews/` | Ignore the screenshot dir; create the reviews dir (README/`.gitkeep`). |
| `STATUS.md` | Row: Level 4 verification — skill + template authorable; *running* deferred. |
| `docs/TODO.md` | Mark 2.2 (browser portion) + 2.3 addressed by ADR-017; note API/`curl`/log siblings remain. |
| `make new-role` scaffold | Scaffold `VERIFY.md` into new service roles (when that scaffold is next touched). |
**Buildable now** (no `ubongo`/Authentik/staging needed): ADR-017, the ADR-008
expansion, the `VERIFY.md` template, the `/verify-service` skill logic, the convention +
checklist + Further-reading edits, `.gitignore`/dir, STATUS/TODO. This spec yields real
working artifacts immediately — the skill and standards exist and are reviewable; only
the *live run* waits on the stack.
**Deferred** (needs the stack): actually running it (`ubongo` + `playwright` plugin +
Authentik + a staging deploy); the Authentik test-user provisioning automation;
per-service `VERIFY.md` files (need the service roles, which don't exist yet).
---
## Dependencies
- `ubongo` (ADR-015) — the host that runs the browser. Designed, not built.
- `playwright` Claude Code plugin — enabled when this lands (`claude-code-setup.md`).
- Authentik (CAPABILITIES §2, planned) — central IdP for test users + SSO.
- A staging environment with the service deployed (ADR-008 Level 2) — staging is
currently empty stubs.
---
## What was ruled out
| Option | Reason |
|---|---|
| Scripted Playwright regression suite | The operator wants exploratory judgment, not deterministic scripts; scripts add authoring/maintenance burden. A scripted layer could come later but is not this. |
| Scheduled headless smoke gate (cron) | Needs determinism, which the exploratory nature excludes; that role belongs to health checks / Uptime Kuma. |
| Verify against production | Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead. Production gets non-destructive checks elsewhere, not here. |
| Free-form exploration with no per-service spec | Flexible but non-repeatable and can miss a service's critical flow; `VERIFY.md` gives a backbone while keeping free exploration. |
| Staging bypasses SSO / per-app local users | Wouldn't exercise the real Traefik+Authentik access path; central test users in Authentik are faithful. |
| Commit screenshots to the repo | Repo bloat + secret-leak risk; git-ignored on `ubongo`, markdown report committed. |
See also: ADR-008 (testing — expanded), ADR-015 (control host — runs the browser),
ADR-002 (security), ADR-004 (one service = one role — `VERIFY.md` parallels
`SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing).