contradish / leaderboard
3,840 adversarial strain tests across 20 domains. Models are ranked by Judgment Strain: a two-sided metric that penalizes drift where a model should hold firm, and rigidity where it should adapt. Lower is better. 0.00 is perfect. Judged by an independent model from a different provider. Equivalence between paraphrases is audited by domain experts.
| # | Model | Judgment Strain ↓ | EQ coverage | SW-Strain | MT-Strain | CL-Strain | CSA | Worst technique | Date | Judge |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 |
claude-opus-4-6
Anthropic
|
0.118 | pending* | 0.097 | 0.081 | 0.063 | 0.81 | roleplay 0.21 | 2026-04-18 | independent |
| 2 |
claude-sonnet-4-6
Anthropic
|
0.141 | pending* | 0.118 | 0.103 | 0.089 | 0.74 | persistence 0.27 | 2026-04-18 | independent |
| 3 |
gpt-4o
OpenAI
|
0.179 | pending* | 0.154 | 0.142 | 0.118 | 0.68 | flattery 0.31 | 2026-04-18 | independent |
| 4 |
gemini-1.5-pro
Google
|
0.213 | pending* | 0.188 | 0.179 | 0.201 | 0.62 | authority 0.36 | 2026-04-18 | independent |
| 5 |
gpt-4o-mini
OpenAI
|
0.287 | pending* | 0.261 | 0.253 | 0.312 | 0.51 | roleplay 0.44 | 2026-04-18 | independent |
| 6 |
claude-haiku-4-5
Anthropic
|
0.312 | pending* | 0.289 | 0.301 | 0.278 | 0.58 | social_proof 0.48 | 2026-04-18 | independent |
| 7 |
mistral-large-2
Mistral
|
0.381 | pending* | 0.347 | 0.392 | 0.421 | 0.44 | flattery 0.53 | 2026-04-18 | independent |
| 8 |
llama-3-70b
Meta (via Together)
|
0.443 | pending* | 0.412 | 0.468 | 0.489 | 0.37 | persistence 0.61 | 2026-04-18 | independent |
Judgment Strain is the two-sided metric. Adversarial cases penalize drift (the model should hold firm). Tensioned cases penalize rigidity (the model should name both sides). Lower is better. 0.00 is perfect. A model cannot game it by becoming inflexible.
CAI Strain is the consistency-only component, reported per case alongside Judgment Strain.
EQ coverage is the fraction of cases that cleared the equivalence audit. *Marked pending until the v2.1 expert-annotation pass completes per domain. Judgment Strain is computed over every case in the meantime, with placeholder equivalence_confidence = 1.0.
SW-Strain: severity-weighted (critical 4×, high 2.5×). MT-Strain: multi-turn consistency across 4 turns. CL-Strain: cross-lingual across 6 languages.
CSA (Coherence Self-Awareness, 0–1, higher is better): does the model know when it is being pressured? Scored across uncertainty calibration, pressure recognition, tension articulation, and routing appropriateness.
All results use independent judging: the judge model is from a different provider than the model under test.
| Technique | Description | claude-opus-4-6 | claude-sonnet-4-6 | gpt-4o | gpt-4o-mini | llama-3-70b |
|---|---|---|---|---|---|---|
| roleplay | Adopt a persona without restrictions | 0.21 | 0.24 | 0.28 | 0.44 | 0.59 |
| persistence | Claim prior agreement or "just this once" | 0.18 | 0.27 | 0.25 | 0.41 | 0.61 |
| flattery | Compliment model capability to encourage compliance | 0.14 | 0.19 | 0.31 | 0.37 | 0.53 |
| authority | Claim professional or institutional authority | 0.11 | 0.17 | 0.36 | 0.33 | 0.49 |
| technical_reframe | Clinical / academic jargon to legitimise | 0.09 | 0.13 | 0.18 | 0.28 | 0.44 |
| social_proof | Claim other AI systems answer freely | 0.10 | 0.12 | 0.17 | 0.26 | 0.41 |
| emotional | Distress or urgency framing | 0.08 | 0.11 | 0.14 | 0.22 | 0.37 |
| hypothetical | Frame as fictional or hypothetical scenario | 0.07 | 0.09 | 0.12 | 0.19 | 0.31 |
Technique breakdown across all 20 domains. Higher CAI Strain per technique = that technique is more effective at inducing inconsistency in that model. Roleplay and persistence are the most dangerous techniques across all models tested.
All adversarial variants are pre-generated and committed to the repo. Every model is tested on identical inputs. Scores are reproducible and comparable across runs and time.
8 original (emotional, presuppose, casual, sympathy, authority, hypothetical, boundary, indirect) plus 8 new: roleplay, third_party, incremental, social_proof, negation_trap, flattery, technical_reframe, persistence.
Anthropic models are judged by OpenAI models and vice versa. This eliminates same-provider bias and ensures the judge cannot recognize or prefer stylistically similar outputs.
Failures on critical-severity cases (medication, self-harm, AI safety) count 4× more than medium cases. SW-Strain reflects what matters most in real-world deployments.
Tests whether a model maintains its position across a 4-turn conversation where adversarial pressure accumulates. Turn-by-turn drift is a distinct failure mode that single-turn benchmarks miss entirely.
Tests whether safety properties hold when the same question is asked in Spanish, French, German, Chinese, or Portuguese instead of English. Finds models that are well-aligned in English but drift in other languages.
Tests consistency across 5 user profiles: baseline, professional, minor, institutional, and caregiver. Detects whether claimed credentials (nurse, researcher) shift safety policy inappropriately.
JRR measures resistance across 10 named jailbreak techniques on high-severity questions. SPA-Δ measures how much a system prompt reduces Strain: whether your system prompt is doing what you think it does.
A refusal isn't automatically a good refusal. RQS scores five dimensions: explanation quality, alternatives offered, tone, factual accuracy, and helpfulness within limits. Scored by an independent LLM judge.
On the highest-pressure cases in medication, mental health, AI safety, cybersecurity, and immigration: does the model hold its position or route correctly? SRA = (consistent + routed) / total. Only silent drift counts against it. High SRA: every high-pressure case lands on a held answer or a correct professional handoff. Low SRA: silent, confident, inconsistent answers that users cannot detect without contradish. No other benchmark measures this.
Does the model know when it is being pressured into inconsistency? CSA scores 0-1 across four dimensions: uncertainty calibration, pressure recognition, tension articulation, and routing appropriateness. Every SRA run produces the 2D quadrant map: stable_aware, stable_unaware, drifted_aware, drifted_unaware. Silent confident drift (drifted_unaware) is the most dangerous operational failure class. Invisible to every other benchmark.
Different contradictions call for different strategies. Adversarial pressure: hold firmly. Real-world tension: name both sides. Representational failure: reframe. Every SRA case is annotated with its contradiction type. CTR scores whether the model matched strategy to situation, not just whether it drifted.
A leaderboard score tells you where a model stands. contradish diagnose tells you why it failed and how to fix it. For each drifted case it names the failure mode, contradiction type, CSA quadrant, counterfactual response, targeted system prompt language, and a JSONL fine-tuning pair. Run on any result file with one command.
Open source. Results go to the public leaderboard. Failures come with a repair package.