boma/docs/superpowers/specs/2026-06-05-service-ui-verification-design.md
sjat 2bd11b5aa9 Add design spec for service-UI verification (ADR-008 Level 4)
Resolves ADR-015 deferred item #2 + TODO 2.2/2.3: a Claude-driven exploratory
browser harness (/verify-service) that exercises staging service UIs through
real SSO, backed by a per-service VERIFY.md, with test users in staging
Authentik and a manual-test handoff. Basis for ADR-017.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 13:05:11 +02:00

11 KiB
Raw Blame History

Design — Service-UI acceptance verification (ADR-008 Level 4)

  • Date: 2026-06-05
  • Status: Approved design — pending implementation plan
  • Resolves: ADR-015 deferred item #2 (browser-E2E verification harness); TODO 2.2 (browser portion) + TODO 2.3 (test users + manual-test instruction)
  • Expands: ADR-008 Level 4 (currently a stub)
  • Becomes: ADR-017 (this design is the basis for that ADR)

Problem

ADR-008 defines testing Levels 13 (Molecule, staging deploy, external smoke) and a Level 4 stub: "Claude drives a headless browser from ubongo against a deployed service: loads the rendered UI, creates test users, exercises features, and hands the operator a manual test script." Nothing below Level 4 actually exercises a service's application UI — Molecule tests the role in a container, Level 2 confirms the stack converges, Level 3 confirms public endpoints respond. None answer "does PhotoPrism actually let me log in, upload a photo, and see a thumbnail?" (TODO 8.2).

The operator's original ask: "Claude could spin up a browser and actually see the generated service web-UIs to verify various things. Perhaps even generate test users and test features and instruct me on tests as well." That is TODO 2.2 (headless browsing) + TODO 2.3 (test-user generation + manual-test instruction).

Today Claude "sees" a browser only passively — the /screenshot skill fetches screenshots the operator took on mamba. This harness is the active counterpart: Claude drives the browser itself.

Decisions (the settled forks)

  1. Nature — Claude-driven exploratory. Claude navigates the live UI with judgment (look, click, reason about whether it works, notice anything off), not deterministic scripts. This is the distinctive value; a scripted Playwright regression suite is explicitly not built here.
  2. Mode — interactive, Claude-in-the-loop. Follows from #1: exploratory judgment can't be a headless cron gate. Scheduled smoke-testing stays out of scope (that is a determinism job for health checks / Uptime Kuma later).
  3. Environment — staging, full exercise. Claude creates test users and exercises features (including destructive flows) against a staging deploy. Staging is a rebuildable sandbox, so this resolves safety: no production-data risk, no prod pollution.
  4. Auth — test users in Authentik (central IdP), real SSO flow. Claude's browser authenticates through Traefik + Authentik exactly as a real user would, faithfully testing the real access path.
  5. Structure — per-service VERIFY.md backbone + free exploration. Each service role ships an acceptance spec of critical user journeys; Claude executes it and explores beyond it. Repeatable + intent-capturing, without losing exploratory value.

Scope

In scope: the browser/UI verification harness (TODO 2.2 browser portion) + the test-user and manual-test-instruction standards (TODO 2.3) = ADR-008 Level 4.

Out of scope (siblings, noted not built): the other TODO-2.2 "live testing" methods — API calls, curl pulls, log review. They share the spirit but are not browser work. Also out: a scripted/CI regression suite; scheduled headless smoke checks.


Architecture, mechanism, and workflow placement

Mechanism. Claude drives a real Chromium on ubongo via the playwright Claude Code plugin (already earmarked in claude-code-setup.md, enabled when this lands). No bespoke browser code — Claude calls the Playwright MCP tools (navigate, click, type, screenshot, read DOM) and reasons over what it sees. Active counterpart to the passive /screenshot-from-mamba pattern.

Orchestration. A boma skill/command — /verify-service <name> — run interactively on ubongo. It:

  1. Reads the service's roles/<name>/VERIFY.md acceptance spec.
  2. Provisions/uses a test user in the staging Authentik.
  3. Drives the browser through the real SSO flow into the staging service.
  4. Executes the listed journeys exploratorily (judging pass/fail, screenshotting key states) and free-explores.
  5. Writes a dated verification report with linked screenshots.
  6. Emits a manual-test checklist for anything it couldn't do.

Pipeline placement. Level 4 runs after Level 2 (staging deploy) and before production promotion: build role → molecule (L1) → staging deploy (L2) → /verify-service (L4) → promote. It reaches the staging service over the LAN from ubongo (services on srv; resolved via boma DNS), through Traefik + Authentik as a real user would.

Boundaries (one unit, clear interface): the skill orchestrates; VERIFY.md declares intent (per service); Authentik provides identity; the report captures results. Each is independently understandable and swappable.


The VERIFY.md standard

Every service role ships a populated roles/<service>/VERIFY.md, copied from a new template docs/testing/service-verify-template.md — parallel to how each role ships SECURITY.md from service-security-template.md. It becomes a role convention (every service role must have a populated VERIFY.md).

Contents:

  • Critical user journeys — the acceptance criteria that define "working" for this service (e.g. PhotoPrism: SSO login → library loads → upload a test image → thumbnail generates → search finds it).
  • What good looks like — states/screenshots to confirm.
  • Not browser-verifiable — items to route to the manual-test handoff (hardware, paid/external flows, subjective quality).

/verify-service reads roles/<name>/VERIFY.md, executes those journeys, and explores beyond them.

Test-user generation standard (TODO 2.3)

Test identities are provisioned in the staging Authentik (never the production IdP — test accounts must not exist in prod):

  • Convention: a dedicated test group / naming prefix (e.g. test-<service>@…) so accounts are identifiable and bulk-removable.
  • Credentials: ephemeral, generated per run (staging is rebuildable); held only for the run. No test creds in vault.yml.
  • Idempotent: reuse-or-create.
  • Teardown: primary teardown is the staging rebuild (sandbox); the skill also offers explicit cleanup of the test group.

Reporting & manual-test handoff

  • Report: /verify-service writes docs/testing/reviews/YYYY-MM-DD-<service>.md (plus latest.md), mirroring /review-repodocs/reviews/ and /capacity-reviewdocs/hardware/reviews/. It contains pass/fail per VERIFY.md journey, observations, the test-user/env used, a verdict, and the manual-test checklist. The committed markdown is the durable artifact.
  • Screenshots: saved to a git-ignored dir on ubongo (PNGs would bloat the repo); the report links them and inlines only a few key evidence shots.
  • Manual-test handoff (TODO 2.3): anything Claude can't do — physical device, paid/external flow, subjective judgment — becomes a structured checklist in the report (numbered steps, expected result, why handed off). The operator runs them and reports back. This is the "instruct me on tests" half of the vision, as a first-class output.

Safety

Even though staging is a sandbox:

  • Staging-only guard. The skill refuses to run against production (verifies it is pointed at the staging environment/inventory before acting) — an ADR-002-aligned hard stop, since exploratory clicking is destructive by nature.
  • Confined blast radius. Test users live only in the staging test group; the run sticks to the target service.
  • No secrets leaked. Screenshots can capture on-screen tokens/credentials, so the git-ignored screenshot dir is also the safety boundary (evidence isn't committed by default), and the skill avoids capturing credential screens.

Documentation & implementation changes

This is a substantial capability → its own ADR-017, with reconciliations:

Doc / artifact Change
ADR-017 (new) Home of record: harness, the five settled forks, VERIFY.md standard, test-user + manual-handoff standards, safety.
ADR-008 (testing) Expand the Level 4 stub into the full definition; link ADR-017.
docs/testing/service-verify-template.md (new) The VERIFY.md template (parallels service-security-template.md).
.claude/commands/verify-service.md (new) The /verify-service <name> orchestrating skill.
CLAUDE.md Role conventions: every service role must ship a populated VERIFY.md. Further reading: ADR-017.
docs/security/service-checklist.md Add "passed Level 4 (/verify-service)" to the pre-production service-clearance gate.
.gitignore + docs/testing/reviews/ Ignore the screenshot dir; create the reviews dir (README/.gitkeep).
STATUS.md Row: Level 4 verification — skill + template authorable; running deferred.
docs/TODO.md Mark 2.2 (browser portion) + 2.3 addressed by ADR-017; note API/curl/log siblings remain.
make new-role scaffold Scaffold VERIFY.md into new service roles (when that scaffold is next touched).

Buildable now (no ubongo/Authentik/staging needed): ADR-017, the ADR-008 expansion, the VERIFY.md template, the /verify-service skill logic, the convention + checklist + Further-reading edits, .gitignore/dir, STATUS/TODO. This spec yields real working artifacts immediately — the skill and standards exist and are reviewable; only the live run waits on the stack.

Deferred (needs the stack): actually running it (ubongo + playwright plugin + Authentik + a staging deploy); the Authentik test-user provisioning automation; per-service VERIFY.md files (need the service roles, which don't exist yet).


Dependencies

  • ubongo (ADR-015) — the host that runs the browser. Designed, not built.
  • playwright Claude Code plugin — enabled when this lands (claude-code-setup.md).
  • Authentik (CAPABILITIES §2, planned) — central IdP for test users + SSO.
  • A staging environment with the service deployed (ADR-008 Level 2) — staging is currently empty stubs.

What was ruled out

Option Reason
Scripted Playwright regression suite The operator wants exploratory judgment, not deterministic scripts; scripts add authoring/maintenance burden. A scripted layer could come later but is not this.
Scheduled headless smoke gate (cron) Needs determinism, which the exploratory nature excludes; that role belongs to health checks / Uptime Kuma.
Verify against production Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead. Production gets non-destructive checks elsewhere, not here.
Free-form exploration with no per-service spec Flexible but non-repeatable and can miss a service's critical flow; VERIFY.md gives a backbone while keeping free exploration.
Staging bypasses SSO / per-app local users Wouldn't exercise the real Traefik+Authentik access path; central test users in Authentik are faithful.
Commit screenshots to the repo Repo bloat + secret-leak risk; git-ignored on ubongo, markdown report committed.

See also: ADR-008 (testing — expanded), ADR-015 (control host — runs the browser), ADR-002 (security), ADR-004 (one service = one role — VERIFY.md parallels SECURITY.md), ADR-013/014 (heritage / knowledge sourcing).