Resolves ADR-015 deferred item #2 + TODO 2.2/2.3: a Claude-driven exploratory browser harness (/verify-service) that exercises staging service UIs through real SSO, backed by a per-service VERIFY.md, with test users in staging Authentik and a manual-test handoff. Basis for ADR-017. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
11 KiB
Design — Service-UI acceptance verification (ADR-008 Level 4)
- Date: 2026-06-05
- Status: Approved design — pending implementation plan
- Resolves: ADR-015 deferred item #2 (browser-E2E verification harness); TODO 2.2 (browser portion) + TODO 2.3 (test users + manual-test instruction)
- Expands: ADR-008 Level 4 (currently a stub)
- Becomes: ADR-017 (this design is the basis for that ADR)
Problem
ADR-008 defines testing Levels 1–3 (Molecule, staging deploy, external smoke) and a
Level 4 stub: "Claude drives a headless browser from ubongo against a deployed
service: loads the rendered UI, creates test users, exercises features, and hands the
operator a manual test script." Nothing below Level 4 actually exercises a service's
application UI — Molecule tests the role in a container, Level 2 confirms the stack
converges, Level 3 confirms public endpoints respond. None answer "does PhotoPrism
actually let me log in, upload a photo, and see a thumbnail?" (TODO 8.2).
The operator's original ask: "Claude could spin up a browser and actually see the generated service web-UIs to verify various things. Perhaps even generate test users and test features and instruct me on tests as well." That is TODO 2.2 (headless browsing) + TODO 2.3 (test-user generation + manual-test instruction).
Today Claude "sees" a browser only passively — the /screenshot skill fetches
screenshots the operator took on mamba. This harness is the active counterpart:
Claude drives the browser itself.
Decisions (the settled forks)
- Nature — Claude-driven exploratory. Claude navigates the live UI with judgment (look, click, reason about whether it works, notice anything off), not deterministic scripts. This is the distinctive value; a scripted Playwright regression suite is explicitly not built here.
- Mode — interactive, Claude-in-the-loop. Follows from #1: exploratory judgment can't be a headless cron gate. Scheduled smoke-testing stays out of scope (that is a determinism job for health checks / Uptime Kuma later).
- Environment — staging, full exercise. Claude creates test users and exercises features (including destructive flows) against a staging deploy. Staging is a rebuildable sandbox, so this resolves safety: no production-data risk, no prod pollution.
- Auth — test users in Authentik (central IdP), real SSO flow. Claude's browser authenticates through Traefik + Authentik exactly as a real user would, faithfully testing the real access path.
- Structure — per-service
VERIFY.mdbackbone + free exploration. Each service role ships an acceptance spec of critical user journeys; Claude executes it and explores beyond it. Repeatable + intent-capturing, without losing exploratory value.
Scope
In scope: the browser/UI verification harness (TODO 2.2 browser portion) + the test-user and manual-test-instruction standards (TODO 2.3) = ADR-008 Level 4.
Out of scope (siblings, noted not built): the other TODO-2.2 "live testing" methods —
API calls, curl pulls, log review. They share the spirit but are not browser work.
Also out: a scripted/CI regression suite; scheduled headless smoke checks.
Architecture, mechanism, and workflow placement
Mechanism. Claude drives a real Chromium on ubongo via the playwright Claude
Code plugin (already earmarked in claude-code-setup.md, enabled when this lands).
No bespoke browser code — Claude calls the Playwright MCP tools (navigate, click, type,
screenshot, read DOM) and reasons over what it sees. Active counterpart to the passive
/screenshot-from-mamba pattern.
Orchestration. A boma skill/command — /verify-service <name> — run
interactively on ubongo. It:
- Reads the service's
roles/<name>/VERIFY.mdacceptance spec. - Provisions/uses a test user in the staging Authentik.
- Drives the browser through the real SSO flow into the staging service.
- Executes the listed journeys exploratorily (judging pass/fail, screenshotting key states) and free-explores.
- Writes a dated verification report with linked screenshots.
- Emits a manual-test checklist for anything it couldn't do.
Pipeline placement. Level 4 runs after Level 2 (staging deploy) and before
production promotion:
build role → molecule (L1) → staging deploy (L2) → /verify-service (L4) → promote.
It reaches the staging service over the LAN from ubongo (services on srv; resolved
via boma DNS), through Traefik + Authentik as a real user would.
Boundaries (one unit, clear interface): the skill orchestrates; VERIFY.md
declares intent (per service); Authentik provides identity; the report captures
results. Each is independently understandable and swappable.
The VERIFY.md standard
Every service role ships a populated roles/<service>/VERIFY.md, copied from a new
template docs/testing/service-verify-template.md — parallel to how each role ships
SECURITY.md from service-security-template.md. It becomes a role convention
(every service role must have a populated VERIFY.md).
Contents:
- Critical user journeys — the acceptance criteria that define "working" for this service (e.g. PhotoPrism: SSO login → library loads → upload a test image → thumbnail generates → search finds it).
- What good looks like — states/screenshots to confirm.
- Not browser-verifiable — items to route to the manual-test handoff (hardware, paid/external flows, subjective quality).
/verify-service reads roles/<name>/VERIFY.md, executes those journeys, and explores
beyond them.
Test-user generation standard (TODO 2.3)
Test identities are provisioned in the staging Authentik (never the production IdP — test accounts must not exist in prod):
- Convention: a dedicated
testgroup / naming prefix (e.g.test-<service>@…) so accounts are identifiable and bulk-removable. - Credentials: ephemeral, generated per run (staging is rebuildable); held only for
the run. No test creds in
vault.yml. - Idempotent: reuse-or-create.
- Teardown: primary teardown is the staging rebuild (sandbox); the skill also
offers explicit cleanup of the
testgroup.
Reporting & manual-test handoff
- Report:
/verify-servicewritesdocs/testing/reviews/YYYY-MM-DD-<service>.md(pluslatest.md), mirroring/review-repo→docs/reviews/and/capacity-review→docs/hardware/reviews/. It contains pass/fail perVERIFY.mdjourney, observations, the test-user/env used, a verdict, and the manual-test checklist. The committed markdown is the durable artifact. - Screenshots: saved to a git-ignored dir on
ubongo(PNGs would bloat the repo); the report links them and inlines only a few key evidence shots. - Manual-test handoff (TODO 2.3): anything Claude can't do — physical device, paid/external flow, subjective judgment — becomes a structured checklist in the report (numbered steps, expected result, why handed off). The operator runs them and reports back. This is the "instruct me on tests" half of the vision, as a first-class output.
Safety
Even though staging is a sandbox:
- Staging-only guard. The skill refuses to run against production (verifies it is pointed at the staging environment/inventory before acting) — an ADR-002-aligned hard stop, since exploratory clicking is destructive by nature.
- Confined blast radius. Test users live only in the staging
testgroup; the run sticks to the target service. - No secrets leaked. Screenshots can capture on-screen tokens/credentials, so the git-ignored screenshot dir is also the safety boundary (evidence isn't committed by default), and the skill avoids capturing credential screens.
Documentation & implementation changes
This is a substantial capability → its own ADR-017, with reconciliations:
| Doc / artifact | Change |
|---|---|
| ADR-017 (new) | Home of record: harness, the five settled forks, VERIFY.md standard, test-user + manual-handoff standards, safety. |
| ADR-008 (testing) | Expand the Level 4 stub into the full definition; link ADR-017. |
docs/testing/service-verify-template.md (new) |
The VERIFY.md template (parallels service-security-template.md). |
.claude/commands/verify-service.md (new) |
The /verify-service <name> orchestrating skill. |
CLAUDE.md |
Role conventions: every service role must ship a populated VERIFY.md. Further reading: ADR-017. |
docs/security/service-checklist.md |
Add "passed Level 4 (/verify-service)" to the pre-production service-clearance gate. |
.gitignore + docs/testing/reviews/ |
Ignore the screenshot dir; create the reviews dir (README/.gitkeep). |
STATUS.md |
Row: Level 4 verification — skill + template authorable; running deferred. |
docs/TODO.md |
Mark 2.2 (browser portion) + 2.3 addressed by ADR-017; note API/curl/log siblings remain. |
make new-role scaffold |
Scaffold VERIFY.md into new service roles (when that scaffold is next touched). |
Buildable now (no ubongo/Authentik/staging needed): ADR-017, the ADR-008
expansion, the VERIFY.md template, the /verify-service skill logic, the convention +
checklist + Further-reading edits, .gitignore/dir, STATUS/TODO. This spec yields real
working artifacts immediately — the skill and standards exist and are reviewable; only
the live run waits on the stack.
Deferred (needs the stack): actually running it (ubongo + playwright plugin +
Authentik + a staging deploy); the Authentik test-user provisioning automation;
per-service VERIFY.md files (need the service roles, which don't exist yet).
Dependencies
ubongo(ADR-015) — the host that runs the browser. Designed, not built.playwrightClaude Code plugin — enabled when this lands (claude-code-setup.md).- Authentik (CAPABILITIES §2, planned) — central IdP for test users + SSO.
- A staging environment with the service deployed (ADR-008 Level 2) — staging is currently empty stubs.
What was ruled out
| Option | Reason |
|---|---|
| Scripted Playwright regression suite | The operator wants exploratory judgment, not deterministic scripts; scripts add authoring/maintenance burden. A scripted layer could come later but is not this. |
| Scheduled headless smoke gate (cron) | Needs determinism, which the exploratory nature excludes; that role belongs to health checks / Uptime Kuma. |
| Verify against production | Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead. Production gets non-destructive checks elsewhere, not here. |
| Free-form exploration with no per-service spec | Flexible but non-repeatable and can miss a service's critical flow; VERIFY.md gives a backbone while keeping free exploration. |
| Staging bypasses SSO / per-app local users | Wouldn't exercise the real Traefik+Authentik access path; central test users in Authentik are faithful. |
| Commit screenshots to the repo | Repo bloat + secret-leak risk; git-ignored on ubongo, markdown report committed. |
See also: ADR-008 (testing — expanded), ADR-015 (control host — runs the browser),
ADR-002 (security), ADR-004 (one service = one role — VERIFY.md parallels
SECURITY.md), ADR-013/014 (heritage / knowledge sourcing).