Question 1

What is Agent Clash?

Accepted Answer

Agent Clash is a free AI model evaluation platform. It lets you compare large language models (LLMs) like GPT, Claude, Gemini, Llama, and Mistral side by side on your own prompts. Independent AI judges score each response, and the platform produces statistically robust rankings across multiple runs and datasets.

Question 2

How does Agent Clash evaluate AI models?

Accepted Answer

You submit a prompt and select which AI models to test. All models receive the same prompt simultaneously. An AI model dynamically identifies the most relevant evaluation criteria for your specific prompt (e.g. code correctness, clarity, reasoning depth). Independent AI judges then score each response on those criteria (1-5 scale) and rank them. You can run multiple iterations and use parallel datasets for statistically robust results.

Question 3

What AI models does Agent Clash support?

Accepted Answer

Agent Clash supports over 100 models from major providers: OpenAI (GPT-4o, GPT-5.2), Anthropic (Claude 3.5, Claude Opus 4.5), Google (Gemini 2.5, Gemini 3), Meta (Llama 3.3), Mistral (Mistral Large, Ministral), and Cohere (Command R). New models are added regularly via the OpenRouter API.

Question 4

Is Agent Clash free?

Accepted Answer

The Agent Clash platform itself is free. You need an OpenRouter API key to access the AI models. A typical evaluation session costs approximately 2 euros for dozens of model comparisons.

Question 5

What is Supreme Court mode?

Accepted Answer

Supreme Court mode adds 3 elite AI judges (top-tier models) whose scores count double in the final ranking. This provides a more authoritative evaluation by weighting expert-level assessments more heavily.

Question 6

How are the rankings calculated?

Accepted Answer

For each prompt, an AI model dynamically generates 3-5 evaluation criteria tailored to the task. Each AI judge ranks the responses and rates them on every criterion (1-5 scale), providing justifications for low scores. Scores are averaged across all judges and evaluation runs. When using multiple datasets, results are aggregated to show how models perform across different types of inputs. The final ranking reflects statistically robust, reproducible performance differences.

What is Agent Clash?

How Does Agent Clash Work?

Supported AI Models

Key Features

Frequently Asked Questions