Purpose
Public evaluation should show whether a model-backed workflow is dependable enough for enterprise review, not merely whether it can answer a clever prompt. The workbench packages config-driven datasets, adapters, testable failure categories, operational metrics, and reviewable reports.
The first release provides a benign synthetic readiness suite and a deterministic adapter that validates the pipeline shape. Provider-backed comparison runs are the next reporting milestone.
Evaluation dimensions
Capability and reliability: task completion, schema following, and instruction stability.
Governance and security reasoning: appropriate human-review routing, prompt-injection handling, confidentiality, and access control.
Groundedness and operations: evidence-bounded answers, latency, and estimated cost.
Artifacts
- A runnable regulated-enterprise readiness demo command and dataset.
- Markdown and JSON output with pass rate, cost, latency, and explicit failure categories.
- Published methodology, limitations, and provider-policy notes.
- A static leaderboard surface for versioned provider-backed comparisons.
Initial public surface
This row is pipeline validation, not a frontier-model ranking.
| Adapter | Suite | Date | Cases | Pass | Latency | Cost |
|---|---|---|---|---|---|---|
| mock | regulated readiness demo | 2026-05-25 | 8 | 100% | 0.34s avg | $0.00476 |
Public safety posture
The methodology draws on PAEF atomic contract-compliance evaluation: break complex governance tasks into checkable requirements, expose failure modes, and produce evidence suitable for review.