CAI™ Semantic Equivalence Benchmark v0.3
Most benchmarks test if a model is right. This one tests if it stays right when the phrasing changes. A CAI™ failure is when it doesn't.
Results
Avg CAI™ Strain across all evaluated pairs. Lower is better. Scores below 0.20 are strong. Above 0.50 means the model is actively contradicting itself. This is the public record.
| # | Model | Avg CAI Strain | Pairs | Date | Notes |
|---|---|---|---|---|---|
| 1 |
gpt-4o
OpenAI
|
0.3642 | 300 | 2025-03 | v0.1 dataset. Surface mismatch 0.99, semantic drift 0.36. |
| — |
claude-opus-4-6
Anthropic
|
pending | 420 | — | Run evaluate_anthropic.py to contribute. |
| — |
claude-sonnet-4-6
Anthropic
|
pending | 420 | — | Run evaluate_anthropic.py to contribute. |
| — |
gpt-4o-mini
OpenAI
|
pending | 420 | — | Run evaluate_openai.py to contribute. |
| — |
llama-3-70b
Meta
|
pending | 420 | — | Community contribution welcome. |
Ran the benchmark on a model not listed here? Open a PR.
how to submit results →Dataset v0.3
Policy domains have the highest real-world CAI™ failure rates. Financial services and insurance are new in v0.3. Rephrase-sensitive policy language no other benchmark covers.
Run it
Clone, set your API key, run. Results write to CSV. Open a PR to add your score to the leaderboard.
$ git clone https://github.com/michelejoseph1/ cai-semantic-equivalence-benchmark.git $ cd cai-semantic-equivalence-benchmark $ python3 -m venv .venv $ source .venv/bin/activate $ pip install anthropic $ export ANTHROPIC_API_KEY="your-key" $ python evaluate_anthropic.py \ --model claude-opus-4-6 \ --max_pairs 420
$ git clone https://github.com/michelejoseph1/ cai-semantic-equivalence-benchmark.git $ cd cai-semantic-equivalence-benchmark $ python3 -m venv .venv $ source .venv/bin/activate $ pip install -r requirements.txt $ export OPENAI_API_KEY="your-key" $ python evaluate_openai.py \ --model gpt-4o \ --max_pairs 420
Methodology
CAI™ Strain v2 uses a model-based judge to score semantic inconsistency between two responses on a 0–1 scale. The methodology is open and reproducible.
Each pair is two prompts with the same intended meaning and different surface form. Phrasing, syntax, vocabulary, formality, and presupposition are varied. Factual content and intent stay fixed.
A separate judge model scores semantic consistency between the two responses. 0.00 means identical meaning, 1.00 means direct contradiction. The judge focuses on meaning, not wording. Two responses can look different and still score 0.0.
Policy pairs vary formal language vs. colloquial phrasing and include presupposition variations. These are the rephrase patterns that most often trigger inconsistent responses in production.
The judge is an LLM. LLM judges can disagree with humans on edge cases, particularly in the 0.25–0.75 range. Treat aggregate scores as directional. Per-domain scores are more informative than the overall average.
The CAI™ Semantic Equivalence Benchmark is open. Run it on any model and PR your results. Every submission builds the public record.