Research Paper

Multi-Agent Judging for LLM Evaluation

Concordance with Human Preferences Across Capability Gaps

Anthony Boisbouvier — agent-clash.ai

Submitted to DMLR (Data-centric Machine Learning Research), 2026

A panel of 3 AI judges agrees with human evaluators 88% of the time when comparing models of different quality levels, and 76% when comparing top-tier models that are closely matched — on par with how often human evaluators agree with each other. All of this at a cost of ~$0.14 per evaluation.

88.0%
Agreement with Humans
On models with large quality gaps
76.0%
On Top-Tier Models
When even humans struggle to agree
91.0%
Reproducibility
Same results when you run it again
$0.14
Per Evaluation
360 evaluations for ~$51

How It Works

Agent Clash evaluates candidate models through a 6-stage pipeline designed for blind, reproducible, multi-judge evaluation:

1. User submits a prompt
2. Generate K responses via OpenRouter
3. Anonymize — strip IDs, shuffle, label A0…AK
4. Dynamic Criteria — 3-5 task-specific rubrics
5. Parallel Judging — GPT-5.2-pro, Claude Opus 4.5, Gemini 2.5 Pro
6. Aggregate via Borda Count → Winner + Confidence

Four Contributions

1. AI Judges Are More Accurate When the Gap Between Models Is Larger

We tested on two well-known benchmarks: MT-Bench (a standard dataset where models vary widely in quality) and Chatbot Arena (real-world user conversations where top models are very close in ability). When models differ clearly, our AI judges agree with human evaluators 88.0% of the time — better than a single GPT-4 judge (85%) and even better than how often two human evaluators agree with each other (81%). When the task is harder (top-tier models that are nearly equal), agreement is 76.0%, which falls right in the range where even human experts disagree (72-83%).

Comparison with Prior Work

Human vs Human
81%
GPT-4 single judge
85%
Agent Clash (MT-Bench)
88%
Crowd-Expert (Arena)
72-83%
Agent Clash (Arena)
76%

Cohen's κ — A Standard Measure of Agreement

Slight< 0.20
Fair0.21-0.40
Moderate0.41-0.60
Substantial0.61-0.80
Almost Perfect0.81-1.00

Cohen's κ (kappa) measures agreement beyond random chance — it's the gold standard in research for comparing raters. Our κ = 0.760 on models with large quality gaps falls in Substantial Agreement. On closely-matched top models, κ = 0.520 is Moderate — expected since even human experts disagree more on close calls.

2. Run It Twice, Get the Same Answer (91% of the Time)

We ran the exact same evaluation twice independently. The results matched 91.0% of the time — meaning the system is highly reproducible, not random. A third run on 60 cases confirmed:

  • 95% of persistent "errors" (cases where AI judges disagree with humans) are consistent across all runs — they're genuine hard cases, not random mistakes
  • Only 9% of evaluations are truly unpredictable (like a coin flip)
  • Running the evaluation 3 times and taking a majority vote does not improve accuracy — confirming that the remaining disagreements are inherently subjective, not fixable with more runs

Breakdown: 72% consistently correct, 19% consistent disagreements with humans (genuinely ambiguous cases), 9% random noise.

3. When All 3 Judges Agree, They're Almost Always Right

Across all 239 evaluations, we compared what happens when all 3 AI judges agree (unanimous) vs. when they split 2-to-1:

84.9%
Unanimous (3-0)
192 evals · 80% of cases
63.8%
Split (2-1)
47 evals · 20% of cases

The 21-point accuracy gap between unanimous and split decisions gives you a built-in confidence score that a single judge can never provide. In practice: when all 3 judges agree, you can trust the result. When they disagree, it flags the evaluation for human review — saving you time by only involving humans where it actually matters.

4. AI Judges Don't Cheat — No Bias, No Self-Favoritism

No bias toward "better" models: A common concern is that AI judges might systematically favor well-known or higher-ranked models. We tested this: when models are closely matched, AI judges picked the supposedly "stronger" model only 45.8% of the time in disagreements — essentially random. They judge based on actual response quality, not reputation.

No self-favoritism: Can GPT fairly judge GPT's own output? We tested 252 cases where an AI model was judging its own responses (without knowing it). Models ranked themselves first only 52.8% of the time vs. 59.9% expected — actually showing a slight anti-self-bias. Bottom line: under blind conditions, AI judges don't favor themselves.

Why 3 Judges Instead of 1?

Using a panel of 3 different AI judges (GPT, Claude, Gemini) matches the accuracy of the single best judge — without you having to guess which judge is best for your specific task:

JudgeMT-BenchArenaPooled
GPT-5.2-pro88.9%74.2%81.8%
Claude Opus 4.587.9%76.3%82.3%
Gemini 2.5 Pro83.8%69.9%77.1%
Panel (2-of-3)88.9%75.3%82.3%

Key insight: The best individual judge changes depending on the task (GPT wins on one dataset, Claude on another). You can't know in advance which one to trust. The 3-judge panel always matches the best one — and protects you from accidentally relying on the worst (+5.2 points of safety margin).

Cross-Experiment Summary

MetricDifferent quality levelsTop-tier models only
Models tested6 (large quality gaps)25 (best models, very close)
Total evaluations100260 (3 independent runs)
Agreement with humans88.0%76.0%
Cohen's κ (agreement score)0.760 (Substantial)0.520 (Moderate)
Human-human agreement81% (expert)72-83% (crowd)
Cost/eval$0.129$0.149
Test-retest91.0%

The 12-point drop from 88% to 76% is expected and statistically confirmed: it's harder to pick a winner when all models are very good. This mirrors exactly what happens with human evaluators — they also agree less on close calls.

Try It Yourself

Agent Clash is free and open. Bring your own API key, run your own benchmarks.

Launch Agent Clash View on GitHub

Resources