Concordance with Human Preferences Across Capability Gaps
Submitted to DMLR (Data-centric Machine Learning Research), 2026
A panel of 3 AI judges agrees with human evaluators 88% of the time when comparing models of different quality levels, and 76% when comparing top-tier models that are closely matched — on par with how often human evaluators agree with each other. All of this at a cost of ~$0.14 per evaluation.
Agent Clash evaluates candidate models through a 6-stage pipeline designed for blind, reproducible, multi-judge evaluation:
We tested on two well-known benchmarks: MT-Bench (a standard dataset where models vary widely in quality) and Chatbot Arena (real-world user conversations where top models are very close in ability). When models differ clearly, our AI judges agree with human evaluators 88.0% of the time — better than a single GPT-4 judge (85%) and even better than how often two human evaluators agree with each other (81%). When the task is harder (top-tier models that are nearly equal), agreement is 76.0%, which falls right in the range where even human experts disagree (72-83%).
Cohen's κ (kappa) measures agreement beyond random chance — it's the gold standard in research for comparing raters. Our κ = 0.760 on models with large quality gaps falls in Substantial Agreement. On closely-matched top models, κ = 0.520 is Moderate — expected since even human experts disagree more on close calls.
We ran the exact same evaluation twice independently. The results matched 91.0% of the time — meaning the system is highly reproducible, not random. A third run on 60 cases confirmed:
Breakdown: 72% consistently correct, 19% consistent disagreements with humans (genuinely ambiguous cases), 9% random noise.
Across all 239 evaluations, we compared what happens when all 3 AI judges agree (unanimous) vs. when they split 2-to-1:
The 21-point accuracy gap between unanimous and split decisions gives you a built-in confidence score that a single judge can never provide. In practice: when all 3 judges agree, you can trust the result. When they disagree, it flags the evaluation for human review — saving you time by only involving humans where it actually matters.
No bias toward "better" models: A common concern is that AI judges might systematically favor well-known or higher-ranked models. We tested this: when models are closely matched, AI judges picked the supposedly "stronger" model only 45.8% of the time in disagreements — essentially random. They judge based on actual response quality, not reputation.
No self-favoritism: Can GPT fairly judge GPT's own output? We tested 252 cases where an AI model was judging its own responses (without knowing it). Models ranked themselves first only 52.8% of the time vs. 59.9% expected — actually showing a slight anti-self-bias. Bottom line: under blind conditions, AI judges don't favor themselves.
Using a panel of 3 different AI judges (GPT, Claude, Gemini) matches the accuracy of the single best judge — without you having to guess which judge is best for your specific task:
| Judge | MT-Bench | Arena | Pooled |
|---|---|---|---|
| GPT-5.2-pro | 88.9% | 74.2% | 81.8% |
| Claude Opus 4.5 | 87.9% | 76.3% | 82.3% |
| Gemini 2.5 Pro | 83.8% | 69.9% | 77.1% |
| Panel (2-of-3) | 88.9% | 75.3% | 82.3% |
Key insight: The best individual judge changes depending on the task (GPT wins on one dataset, Claude on another). You can't know in advance which one to trust. The 3-judge panel always matches the best one — and protects you from accidentally relying on the worst (+5.2 points of safety margin).
| Metric | Different quality levels | Top-tier models only |
|---|---|---|
| Models tested | 6 (large quality gaps) | 25 (best models, very close) |
| Total evaluations | 100 | 260 (3 independent runs) |
| Agreement with humans | 88.0% | 76.0% |
| Cohen's κ (agreement score) | 0.760 (Substantial) | 0.520 (Moderate) |
| Human-human agreement | 81% (expert) | 72-83% (crowd) |
| Cost/eval | $0.129 | $0.149 |
| Test-retest | — | 91.0% |
The 12-point drop from 88% to 76% is expected and statistically confirmed: it's harder to pick a winner when all models are very good. This mirrors exactly what happens with human evaluators — they also agree less on close calls.
Agent Clash is free and open. Bring your own API key, run your own benchmarks.
Launch Agent Clash View on GitHub