Research question
Can atomic policy-level evaluation improve the auditability and diagnostic quality of LLM contract-compliance review compared with a single monolithic auditing pass?
The study evaluates microagent atomic policy checks against monolithic auditing across a labeled service-contract corpus and multiple compact models.
Study design
Corpus: 193 service contracts and 7,913 labeled policy checks.
Models: gpt-4.1-nano, gpt-4o-mini, and gpt-5-nano.
Comparison: microagent-based atomic checks versus a monolithic LLM auditor.
The published study reports that microagent evaluation outperformed monolithic auditing across all three evaluated models.
Token-level margin analysis
The method extracts a margin between non-compliance and compliance token likelihoods:
Delta = l(Non) - l(Com) and P(Non) = sigmoid(Delta)This supports confidence analysis, disagreement review, salvage decisions, and audit-ready diagnostics rather than relying only on a final label.
Connection to model readiness
PAEF provides the evaluation pattern behind the LLM Evaluation Workbench: decompose governance-heavy tasks into atomic checks, compare models and evaluation strategies, track uncertainty, and create reviewable artifacts.
Publication
Parallelized Atomic Evaluation Framework (PAEF) for Contract Compliance: A Multi-Contract, Multi-Model Study with Token-Level Margin Analysis.
Champion, Cody; O'Kane, Alan; Prunty, Peter. Zenodo, 2026. DOI: 10.5281/zenodo.19848867.