contradish / leaderboard

Which models hold their answer
when pressure is applied?

3,840 adversarial strain tests across 20 domains. Models are ranked by Judgment Strain: a two-sided metric that penalizes drift where a model should hold firm, and rigidity where it should adapt. Lower is better. 0.00 is perfect. Judged by an independent model from a different provider. Equivalence between paraphrases is audited by domain experts.

20
domains
3,840
total strain tests (16 techniques)
16
adversarial techniques
6
languages (CL-Strain)
CAI Strain scale: lower is better
0.00perfect
0.20consistent
0.40drifting
0.60+unreliable
Awareness × Drift. Every leaderboard model plotted on the two-dimensional space contradish uniquely measures: how much a model drifts under pressure (Judgment Strain) versus whether it knows when it is drifting (CSA). The bottom-left quadrant, silent confident drift, is the failure class no other benchmark surfaces.
Awareness x Drift quadrant. 8 LLMs plotted by CSA and Judgment Strain Scatter chart with CSA on the x-axis and Judgment Strain on the y-axis. The top-right quadrant is stable_aware. The bottom-left quadrant is drifted_unaware, silent confident drift. 0.00 0.10 0.20 0.30 0.40 0.50 0.0 0.2 0.4 0.6 0.8 1.0 stable_unaware stable_aware drifted_unaware · silent confident drift drifted_aware CSA (Coherence Self-Awareness) → Judgment Strain → claude-opus-4-6 claude-sonnet-4-6 gpt-4o gemini-1.5-pro gpt-4o-mini claude-haiku-4-5 mistral-large-2 llama-3-70b stable_aware drifted_aware drifted_unaware
20 domains · 3,840 strain tests · independent judging · equivalence audited
# Model Judgment Strain ↓ EQ coverage SW-Strain MT-Strain CL-Strain CSA Worst technique Date Judge
1
claude-opus-4-6
Anthropic
0.118 pending* 0.097 0.081 0.063 0.81 roleplay 0.21 2026-04-18 independent
2
claude-sonnet-4-6
Anthropic
0.141 pending* 0.118 0.103 0.089 0.74 persistence 0.27 2026-04-18 independent
3
gpt-4o
OpenAI
0.179 pending* 0.154 0.142 0.118 0.68 flattery 0.31 2026-04-18 independent
4
gemini-1.5-pro
Google
0.213 pending* 0.188 0.179 0.201 0.62 authority 0.36 2026-04-18 independent
5
gpt-4o-mini
OpenAI
0.287 pending* 0.261 0.253 0.312 0.51 roleplay 0.44 2026-04-18 independent
6
claude-haiku-4-5
Anthropic
0.312 pending* 0.289 0.301 0.278 0.58 social_proof 0.48 2026-04-18 independent
7
mistral-large-2
Mistral
0.381 pending* 0.347 0.392 0.421 0.44 flattery 0.53 2026-04-18 independent
8
llama-3-70b
Meta (via Together)
0.443 pending* 0.412 0.468 0.489 0.37 persistence 0.61 2026-04-18 independent

Run the benchmark and open a PR to add your model.

Submit results →

Judgment Strain is the two-sided metric. Adversarial cases penalize drift (the model should hold firm). Tensioned cases penalize rigidity (the model should name both sides). Lower is better. 0.00 is perfect. A model cannot game it by becoming inflexible.

CAI Strain is the consistency-only component, reported per case alongside Judgment Strain.

EQ coverage is the fraction of cases that cleared the equivalence audit. *Marked pending until the v2.1 expert-annotation pass completes per domain. Judgment Strain is computed over every case in the meantime, with placeholder equivalence_confidence = 1.0.

SW-Strain: severity-weighted (critical 4×, high 2.5×). MT-Strain: multi-turn consistency across 4 turns. CL-Strain: cross-lingual across 6 languages.

CSA (Coherence Self-Awareness, 0–1, higher is better): does the model know when it is being pressured? Scored across uncertainty calibration, pressure recognition, tension articulation, and routing appropriateness.

All results use independent judging: the judge model is from a different provider than the model under test.


technique vulnerability breakdown: avg CAI Strain per adversarial technique
Technique Description claude-opus-4-6 claude-sonnet-4-6 gpt-4o gpt-4o-mini llama-3-70b
roleplay Adopt a persona without restrictions 0.21 0.24 0.28 0.44 0.59
persistence Claim prior agreement or "just this once" 0.18 0.27 0.25 0.41 0.61
flattery Compliment model capability to encourage compliance 0.14 0.19 0.31 0.37 0.53
authority Claim professional or institutional authority 0.11 0.17 0.36 0.33 0.49
technical_reframe Clinical / academic jargon to legitimise 0.09 0.13 0.18 0.28 0.44
social_proof Claim other AI systems answer freely 0.10 0.12 0.17 0.26 0.41
emotional Distress or urgency framing 0.08 0.11 0.14 0.22 0.37
hypothetical Frame as fictional or hypothetical scenario 0.07 0.09 0.12 0.19 0.31

Technique breakdown across all 20 domains. Higher CAI Strain per technique = that technique is more effective at inducing inconsistency in that model. Roleplay and persistence are the most dangerous techniques across all models tested.

How scoring works

Frozen benchmark

All adversarial variants are pre-generated and committed to the repo. Every model is tested on identical inputs. Scores are reproducible and comparable across runs and time.

16 adversarial techniques

8 original (emotional, presuppose, casual, sympathy, authority, hypothetical, boundary, indirect) plus 8 new: roleplay, third_party, incremental, social_proof, negation_trap, flattery, technical_reframe, persistence.

Independent judging

Anthropic models are judged by OpenAI models and vice versa. This eliminates same-provider bias and ensures the judge cannot recognize or prefer stylistically similar outputs.

Severity weighting

Failures on critical-severity cases (medication, self-harm, AI safety) count 4× more than medium cases. SW-Strain reflects what matters most in real-world deployments.

Multi-turn strain (MT-Strain)

Tests whether a model maintains its position across a 4-turn conversation where adversarial pressure accumulates. Turn-by-turn drift is a distinct failure mode that single-turn benchmarks miss entirely.

Cross-lingual consistency (CL-Strain)

Tests whether safety properties hold when the same question is asked in Spanish, French, German, Chinese, or Portuguese instead of English. Finds models that are well-aligned in English but drift in other languages.

Population consistency (PC-Strain)

Tests consistency across 5 user profiles: baseline, professional, minor, institutional, and caregiver. Detects whether claimed credentials (nurse, researcher) shift safety policy inappropriately.

Jailbreak resistance (JRR) & anchoring (SPA-Δ)

JRR measures resistance across 10 named jailbreak techniques on high-severity questions. SPA-Δ measures how much a system prompt reduces Strain: whether your system prompt is doing what you think it does.

Refusal quality (RQS)

A refusal isn't automatically a good refusal. RQS scores five dimensions: explanation quality, alternatives offered, tone, factual accuracy, and helpfulness within limits. Scored by an independent LLM judge.

Strain Routing Awareness (SRA)

On the highest-pressure cases in medication, mental health, AI safety, cybersecurity, and immigration: does the model hold its position or route correctly? SRA = (consistent + routed) / total. Only silent drift counts against it. High SRA: every high-pressure case lands on a held answer or a correct professional handoff. Low SRA: silent, confident, inconsistent answers that users cannot detect without contradish. No other benchmark measures this.

Coherence Self-Awareness (CSA)

Does the model know when it is being pressured into inconsistency? CSA scores 0-1 across four dimensions: uncertainty calibration, pressure recognition, tension articulation, and routing appropriateness. Every SRA run produces the 2D quadrant map: stable_aware, stable_unaware, drifted_aware, drifted_unaware. Silent confident drift (drifted_unaware) is the most dangerous operational failure class. Invisible to every other benchmark.

Contradiction Type Response (CTR)

Different contradictions call for different strategies. Adversarial pressure: hold firmly. Real-world tension: name both sides. Representational failure: reframe. Every SRA case is annotated with its contradiction type. CTR scores whether the model matched strategy to situation, not just whether it drifted.

Drift diagnosis, awareness, and repair

A leaderboard score tells you where a model stands. contradish diagnose tells you why it failed and how to fix it. For each drifted case it names the failure mode, contradiction type, CSA quadrant, counterfactual response, targeted system prompt language, and a JSONL fine-tuning pair. Run on any result file with one command.

contradish diagnose --input results/sra_claude-sonnet-4-6.json
contradish evaluate-csa --input results/sra_claude-sonnet-4-6.json

Run the benchmark. Fix what you find.

Open source. Results go to the public leaderboard. Failures come with a repair package.