boma/docs/decisions/017-service-ui-verification.md

4.9 KiB
Raw Blame History

ADR-017 — Service-UI acceptance verification (Level 4)

Context

ADR-008 defines testing Levels 13 (Molecule, staging deploy, external smoke) and a Level 4 stub. Nothing below Level 4 exercises a service's application UI — none answer "does PhotoPrism actually let me log in, upload a photo, and see a thumbnail?" (TODO 8.2). The operator's ask (TODO 2.2 headless browsing + TODO 2.3 test users + manual-test instruction): Claude spins up a browser, sees the service UI, exercises it, generates test users, and instructs the operator on manual tests. Today Claude sees a browser only passively (/screenshot fetches operator-taken shots from mamba); this is the active counterpart.

Decision

A Claude-driven exploratory service-UI verification harness — Level 4 — invoked as /verify-service <name> on ubongo. Five settled forks:

  1. Claude-driven exploratory — Claude navigates with judgment, not deterministic scripts. A scripted regression suite is explicitly not built here.
  2. Interactive, Claude-in-the-loop — exploratory judgment can't be a headless cron gate; scheduled smoke is a determinism job for health checks / Uptime Kuma later.
  3. Staging, full exercise — Claude creates test users and exercises features (incl. destructive flows) against a staging deploy; the rebuildable sandbox resolves safety.
  4. Test users in Authentik (central IdP), real SSO flow — authenticates through Traefik + Authentik as a real user would.
  5. Per-service VERIFY.md backbone + free exploration — each service role ships an acceptance spec of critical journeys; Claude executes it and explores beyond it.

VERIFY.md standard

Every service role ships a populated roles/<service>/VERIFY.md, copied from docs/testing/service-verify-template.md — parallel to SECURITY.md from service-security-template.md. A new role convention. It lists the service's critical user journeys (what "working" means), what good looks like, and what is not browser-verifiable (→ manual handoff). It also joins the pre-production gate in docs/security/service-checklist.md.

Test-user standard (TODO 2.3)

Test identities live only in the staging Authentik (never production): a dedicated test group / naming prefix; ephemeral per-run credentials (staging is rebuildable, so nothing persisted, none in vault.yml); reuse-or-create; teardown via staging rebuild or explicit test-group cleanup.

Reporting & manual handoff

/verify-service writes docs/testing/reviews/YYYY-MM-DD-<service>.md (+ latest.md), mirroring /review-repo and /capacity-review: pass/fail per VERIFY.md journey, observations, the test-user/env used, a verdict, and a structured manual-test checklist for anything Claude can't do (physical device, paid/external flow, subjective judgment) — the "instruct me on tests" output. Screenshots are saved to a git-ignored working dir on ubongo (PNG bloat + secret-leak risk); the report links them.

Safety

  • Staging-only guard — the skill refuses to run against production (exploratory clicking is destructive); ADR-002-aligned hard stop.
  • Confined blast radius — test users only in the staging test group; the run sticks to the target service.
  • No secrets leaked — the git-ignored screenshot dir is the safety boundary; avoid capturing credential screens.

Status

Designed. Authorable now: this ADR, the ADR-008 Level 4 expansion, the VERIFY.md template, the /verify-service skill, the convention/checklist/Further-reading edits, .gitignore/dir, STATUS/TODO. Running is deferred on its dependencies.

Dependencies

  • ubongo (ADR-015) — runs the browser. Designed, not built.
  • playwright Claude Code plugin — enabled when this lands (claude-code-setup.md).
  • Authentik (CAPABILITIES §2, planned) — central IdP for test users + SSO.
  • A staging deploy of the service (ADR-008 Level 2) — staging is currently empty stubs.
  • make new-role scaffolding VERIFY.md — deferred to when that scaffold is next touched.

What was ruled out

Option Reason
Scripted Playwright regression suite Operator wants exploratory judgment; scripts add maintenance burden. Could be a later layer, not this.
Scheduled headless smoke gate Needs determinism the exploratory nature excludes; belongs to health checks / Uptime Kuma.
Verify against production Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead.
Free-form, no per-service spec Non-repeatable, can miss a critical flow; VERIFY.md gives a backbone.
Staging bypasses SSO / per-app users Wouldn't exercise the real Traefik+Authentik path; central test users are faithful.
Commit screenshots to the repo Repo bloat + secret-leak risk; git-ignored on ubongo.

See also: ADR-008 (testing — expanded), ADR-015 (control host), ADR-002 (security), ADR-004 (VERIFY.md parallels SECURITY.md), ADR-013/014 (heritage / knowledge sourcing).