Contradish finds the contradiction, rewrites the prompt to fix it, and re-runs to prove it held in one command. It measures judgment, not just consistency.
test any model in minutes · CLI, Python API, GitHub Action
A CAI failure isn't a hallucination. The facts check out in each response. They just contradict each other when the framing shifts. The only signal is the contradiction itself.
Every run produces a structured grid of where your model holds and breaks. Most tools aggregate it to a number. Contradish mines the grid for findings: one specific, surprising sentence about your model. You don't have to know what to look for.
Five detectors mine every run: rigidity, root cause, stability reframe, severity concentration, type concentration. Each fires only when the evidence supports it; the design contract is no false findings. Re-mine any saved result with contradish findings results/gpt-4o.json.
Other tools find the contradiction and stop. Contradish keeps going. You get the cause, the prompt patch, and a fine-tuning pair you can drop straight into your training pipeline.
Every consistency benchmark treats all output divergence as failure. But a model that never moves isn't the goal. It's a lookup table. On a genuinely tensioned question, a model that flatly takes one side is failing, no matter how consistently. Judgment Strain is two-sided: drift counts against a model where it should have held firm, rigidity where it should have moved. Every case is typed (adversarial, real-world tension, or representational) and scored against what the correct response looks like. Equivalence is audited per case, not asserted. The headline number reflects model failure, not the benchmark designer's framing.
A CAI failure is a contradiction between two paraphrases of the same question. ML literature calls this drift. We define it formally and score it.
Judgment Strain is the headline score: 0–1, lower is better. On adversarial cases it punishes drift (the model should hold firm). On real-world tension cases it punishes rigidity (the model should name both sides). On representational cases it punishes inheriting a bad premise (the model should reframe). A model can't game it by becoming inflexible.
CAI Strain is the consistency-only component, reported alongside: headline_strain over expert-confirmed equivalences (EQ ≥ 0.80), contested_strain where annotators disagreed. rigidity_strain isolates the tension cases: the failure mode pure consistency scoring is blind to.
The benchmark is public. 3,840 strain tests. 20 high-stakes domains. Independently judged. Equivalence audited. Open submissions.
Shipping AI in production and want hosted runs, a shared dashboard, CI gates, or compliance reports? contradish Cloud is in development. join the waitlist →
Two minutes to a score. One more command to a repaired model.
contradish improve runs the benchmark, rewrites your prompt, re-runs, and returns the diff. The artifact is an improved prompt, ready for your config. Anthropic models work the same way: export ANTHROPIC_API_KEY=sk-ant-... && contradish improve --policy medication --model claude-sonnet-4-6