204 lines
11 KiB
Markdown
204 lines
11 KiB
Markdown
|
|
# Design — Service-UI acceptance verification (ADR-008 Level 4)
|
|||
|
|
|
|||
|
|
- **Date:** 2026-06-05
|
|||
|
|
- **Status:** Approved design — pending implementation plan
|
|||
|
|
- **Resolves:** ADR-015 deferred item #2 (browser-E2E verification harness); TODO 2.2
|
|||
|
|
(browser portion) + TODO 2.3 (test users + manual-test instruction)
|
|||
|
|
- **Expands:** ADR-008 Level 4 (currently a stub)
|
|||
|
|
- **Becomes:** ADR-017 (this design is the basis for that ADR)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Problem
|
|||
|
|
|
|||
|
|
ADR-008 defines testing Levels 1–3 (Molecule, staging deploy, external smoke) and a
|
|||
|
|
**Level 4 stub**: "Claude drives a headless browser from `ubongo` against a deployed
|
|||
|
|
service: loads the rendered UI, creates test users, exercises features, and hands the
|
|||
|
|
operator a manual test script." Nothing below Level 4 actually exercises a service's
|
|||
|
|
**application UI** — Molecule tests the role in a container, Level 2 confirms the stack
|
|||
|
|
converges, Level 3 confirms public endpoints respond. None answer "does PhotoPrism
|
|||
|
|
actually let me log in, upload a photo, and see a thumbnail?" (TODO 8.2).
|
|||
|
|
|
|||
|
|
The operator's original ask: *"Claude could spin up a browser and actually see the
|
|||
|
|
generated service web-UIs to verify various things. Perhaps even generate test users
|
|||
|
|
and test features and instruct me on tests as well."* That is TODO 2.2 (headless
|
|||
|
|
browsing) + TODO 2.3 (test-user generation + manual-test instruction).
|
|||
|
|
|
|||
|
|
Today Claude "sees" a browser only **passively** — the `/screenshot` skill fetches
|
|||
|
|
screenshots the operator took on `mamba`. This harness is the **active** counterpart:
|
|||
|
|
Claude drives the browser itself.
|
|||
|
|
|
|||
|
|
## Decisions (the settled forks)
|
|||
|
|
|
|||
|
|
1. **Nature — Claude-driven exploratory.** Claude navigates the live UI with judgment
|
|||
|
|
(look, click, reason about whether it works, notice anything off), not deterministic
|
|||
|
|
scripts. This is the distinctive value; a scripted Playwright regression suite is
|
|||
|
|
explicitly *not* built here.
|
|||
|
|
2. **Mode — interactive, Claude-in-the-loop.** Follows from #1: exploratory judgment
|
|||
|
|
can't be a headless cron gate. Scheduled smoke-testing stays out of scope (that is a
|
|||
|
|
determinism job for health checks / Uptime Kuma later).
|
|||
|
|
3. **Environment — staging, full exercise.** Claude creates test users and exercises
|
|||
|
|
features (including destructive flows) against a *staging* deploy. Staging is a
|
|||
|
|
rebuildable sandbox, so this resolves safety: no production-data risk, no prod
|
|||
|
|
pollution.
|
|||
|
|
4. **Auth — test users in Authentik (central IdP), real SSO flow.** Claude's browser
|
|||
|
|
authenticates through Traefik + Authentik exactly as a real user would, faithfully
|
|||
|
|
testing the real access path.
|
|||
|
|
5. **Structure — per-service `VERIFY.md` backbone + free exploration.** Each service
|
|||
|
|
role ships an acceptance spec of critical user journeys; Claude executes it *and*
|
|||
|
|
explores beyond it. Repeatable + intent-capturing, without losing exploratory value.
|
|||
|
|
|
|||
|
|
## Scope
|
|||
|
|
|
|||
|
|
In scope: the **browser/UI** verification harness (TODO 2.2 browser portion) + the
|
|||
|
|
**test-user** and **manual-test-instruction** standards (TODO 2.3) = ADR-008 **Level 4**.
|
|||
|
|
|
|||
|
|
Out of scope (siblings, noted not built): the other TODO-2.2 "live testing" methods —
|
|||
|
|
API calls, `curl` pulls, log review. They share the spirit but are not browser work.
|
|||
|
|
Also out: a scripted/CI regression suite; scheduled headless smoke checks.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Architecture, mechanism, and workflow placement
|
|||
|
|
|
|||
|
|
**Mechanism.** Claude drives a real Chromium on `ubongo` via the **`playwright` Claude
|
|||
|
|
Code plugin** (already earmarked in `claude-code-setup.md`, enabled when this lands).
|
|||
|
|
No bespoke browser code — Claude calls the Playwright MCP tools (navigate, click, type,
|
|||
|
|
screenshot, read DOM) and reasons over what it sees. Active counterpart to the passive
|
|||
|
|
`/screenshot`-from-`mamba` pattern.
|
|||
|
|
|
|||
|
|
**Orchestration.** A boma skill/command — **`/verify-service <name>`** — run
|
|||
|
|
interactively on `ubongo`. It:
|
|||
|
|
1. Reads the service's `roles/<name>/VERIFY.md` acceptance spec.
|
|||
|
|
2. Provisions/uses a test user in the **staging** Authentik.
|
|||
|
|
3. Drives the browser through the real SSO flow into the staging service.
|
|||
|
|
4. Executes the listed journeys exploratorily (judging pass/fail, screenshotting key
|
|||
|
|
states) and free-explores.
|
|||
|
|
5. Writes a dated verification report with linked screenshots.
|
|||
|
|
6. Emits a manual-test checklist for anything it couldn't do.
|
|||
|
|
|
|||
|
|
**Pipeline placement.** Level 4 runs after Level 2 (staging deploy) and before
|
|||
|
|
production promotion:
|
|||
|
|
`build role → molecule (L1) → staging deploy (L2) → /verify-service (L4) → promote`.
|
|||
|
|
It reaches the staging service over the LAN from `ubongo` (services on `srv`; resolved
|
|||
|
|
via boma DNS), through Traefik + Authentik as a real user would.
|
|||
|
|
|
|||
|
|
**Boundaries (one unit, clear interface):** the skill *orchestrates*; `VERIFY.md`
|
|||
|
|
*declares intent* (per service); Authentik *provides identity*; the report *captures
|
|||
|
|
results*. Each is independently understandable and swappable.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## The `VERIFY.md` standard
|
|||
|
|
|
|||
|
|
Every service role ships a populated `roles/<service>/VERIFY.md`, copied from a new
|
|||
|
|
template `docs/testing/service-verify-template.md` — parallel to how each role ships
|
|||
|
|
`SECURITY.md` from `service-security-template.md`. It becomes a **role convention**
|
|||
|
|
(every *service* role must have a populated `VERIFY.md`).
|
|||
|
|
|
|||
|
|
Contents:
|
|||
|
|
- **Critical user journeys** — the acceptance criteria that define "working" for this
|
|||
|
|
service (e.g. PhotoPrism: *SSO login → library loads → upload a test image →
|
|||
|
|
thumbnail generates → search finds it*).
|
|||
|
|
- **What good looks like** — states/screenshots to confirm.
|
|||
|
|
- **Not browser-verifiable** — items to route to the manual-test handoff (hardware,
|
|||
|
|
paid/external flows, subjective quality).
|
|||
|
|
|
|||
|
|
`/verify-service` reads `roles/<name>/VERIFY.md`, executes those journeys, and explores
|
|||
|
|
beyond them.
|
|||
|
|
|
|||
|
|
## Test-user generation standard (TODO 2.3)
|
|||
|
|
|
|||
|
|
Test identities are provisioned in the **staging** Authentik (never the production IdP
|
|||
|
|
— test accounts must not exist in prod):
|
|||
|
|
- **Convention:** a dedicated `test` group / naming prefix (e.g. `test-<service>@…`) so
|
|||
|
|
accounts are identifiable and bulk-removable.
|
|||
|
|
- **Credentials:** ephemeral, generated per run (staging is rebuildable); held only for
|
|||
|
|
the run. No test creds in `vault.yml`.
|
|||
|
|
- **Idempotent:** reuse-or-create.
|
|||
|
|
- **Teardown:** primary teardown is the staging rebuild (sandbox); the skill also
|
|||
|
|
offers explicit cleanup of the `test` group.
|
|||
|
|
|
|||
|
|
## Reporting & manual-test handoff
|
|||
|
|
|
|||
|
|
- **Report:** `/verify-service` writes `docs/testing/reviews/YYYY-MM-DD-<service>.md`
|
|||
|
|
(plus `latest.md`), mirroring `/review-repo`→`docs/reviews/` and
|
|||
|
|
`/capacity-review`→`docs/hardware/reviews/`. It contains pass/fail per `VERIFY.md`
|
|||
|
|
journey, observations, the test-user/env used, a verdict, and the manual-test
|
|||
|
|
checklist. The committed markdown is the durable artifact.
|
|||
|
|
- **Screenshots:** saved to a **git-ignored** dir on `ubongo` (PNGs would bloat the
|
|||
|
|
repo); the report links them and inlines only a few key evidence shots.
|
|||
|
|
- **Manual-test handoff (TODO 2.3):** anything Claude can't do — physical device,
|
|||
|
|
paid/external flow, subjective judgment — becomes a **structured checklist** in the
|
|||
|
|
report (numbered steps, expected result, why handed off). The operator runs them and
|
|||
|
|
reports back. This is the "instruct me on tests" half of the vision, as a first-class
|
|||
|
|
output.
|
|||
|
|
|
|||
|
|
## Safety
|
|||
|
|
|
|||
|
|
Even though staging is a sandbox:
|
|||
|
|
- **Staging-only guard.** The skill refuses to run against production (verifies it is
|
|||
|
|
pointed at the staging environment/inventory before acting) — an ADR-002-aligned hard
|
|||
|
|
stop, since exploratory clicking is destructive by nature.
|
|||
|
|
- **Confined blast radius.** Test users live only in the staging `test` group; the run
|
|||
|
|
sticks to the target service.
|
|||
|
|
- **No secrets leaked.** Screenshots can capture on-screen tokens/credentials, so the
|
|||
|
|
git-ignored screenshot dir is also the safety boundary (evidence isn't committed by
|
|||
|
|
default), and the skill avoids capturing credential screens.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Documentation & implementation changes
|
|||
|
|
|
|||
|
|
This is a substantial capability → its own ADR-017, with reconciliations:
|
|||
|
|
|
|||
|
|
| Doc / artifact | Change |
|
|||
|
|
|---|---|
|
|||
|
|
| ADR-017 (new) | Home of record: harness, the five settled forks, `VERIFY.md` standard, test-user + manual-handoff standards, safety. |
|
|||
|
|
| ADR-008 (testing) | Expand the Level 4 stub into the full definition; link ADR-017. |
|
|||
|
|
| `docs/testing/service-verify-template.md` (new) | The `VERIFY.md` template (parallels `service-security-template.md`). |
|
|||
|
|
| `.claude/commands/verify-service.md` (new) | The `/verify-service <name>` orchestrating skill. |
|
|||
|
|
| `CLAUDE.md` | Role conventions: every *service* role must ship a populated `VERIFY.md`. Further reading: ADR-017. |
|
|||
|
|
| `docs/security/service-checklist.md` | Add "passed Level 4 (`/verify-service`)" to the pre-production service-clearance gate. |
|
|||
|
|
| `.gitignore` + `docs/testing/reviews/` | Ignore the screenshot dir; create the reviews dir (README/`.gitkeep`). |
|
|||
|
|
| `STATUS.md` | Row: Level 4 verification — skill + template authorable; *running* deferred. |
|
|||
|
|
| `docs/TODO.md` | Mark 2.2 (browser portion) + 2.3 addressed by ADR-017; note API/`curl`/log siblings remain. |
|
|||
|
|
| `make new-role` scaffold | Scaffold `VERIFY.md` into new service roles (when that scaffold is next touched). |
|
|||
|
|
|
|||
|
|
**Buildable now** (no `ubongo`/Authentik/staging needed): ADR-017, the ADR-008
|
|||
|
|
expansion, the `VERIFY.md` template, the `/verify-service` skill logic, the convention +
|
|||
|
|
checklist + Further-reading edits, `.gitignore`/dir, STATUS/TODO. This spec yields real
|
|||
|
|
working artifacts immediately — the skill and standards exist and are reviewable; only
|
|||
|
|
the *live run* waits on the stack.
|
|||
|
|
|
|||
|
|
**Deferred** (needs the stack): actually running it (`ubongo` + `playwright` plugin +
|
|||
|
|
Authentik + a staging deploy); the Authentik test-user provisioning automation;
|
|||
|
|
per-service `VERIFY.md` files (need the service roles, which don't exist yet).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Dependencies
|
|||
|
|
|
|||
|
|
- `ubongo` (ADR-015) — the host that runs the browser. Designed, not built.
|
|||
|
|
- `playwright` Claude Code plugin — enabled when this lands (`claude-code-setup.md`).
|
|||
|
|
- Authentik (CAPABILITIES §2, planned) — central IdP for test users + SSO.
|
|||
|
|
- A staging environment with the service deployed (ADR-008 Level 2) — staging is
|
|||
|
|
currently empty stubs.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## What was ruled out
|
|||
|
|
|
|||
|
|
| Option | Reason |
|
|||
|
|
|---|---|
|
|||
|
|
| Scripted Playwright regression suite | The operator wants exploratory judgment, not deterministic scripts; scripts add authoring/maintenance burden. A scripted layer could come later but is not this. |
|
|||
|
|
| Scheduled headless smoke gate (cron) | Needs determinism, which the exploratory nature excludes; that role belongs to health checks / Uptime Kuma. |
|
|||
|
|
| Verify against production | Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead. Production gets non-destructive checks elsewhere, not here. |
|
|||
|
|
| Free-form exploration with no per-service spec | Flexible but non-repeatable and can miss a service's critical flow; `VERIFY.md` gives a backbone while keeping free exploration. |
|
|||
|
|
| Staging bypasses SSO / per-app local users | Wouldn't exercise the real Traefik+Authentik access path; central test users in Authentik are faithful. |
|
|||
|
|
| Commit screenshots to the repo | Repo bloat + secret-leak risk; git-ignored on `ubongo`, markdown report committed. |
|
|||
|
|
|
|||
|
|
See also: ADR-008 (testing — expanded), ADR-015 (control host — runs the browser),
|
|||
|
|
ADR-002 (security), ADR-004 (one service = one role — `VERIFY.md` parallels
|
|||
|
|
`SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing).
|