contradish / theory
The methodology behind the score: what semantic invariance means, what CAI Strain captures, and what the 2D awareness map reveals.
A compressor takes an input and maps it to a compressed representation. Every mind does this. Humans compress sensory input into perception, perception into memory, memory into language. AI models compress token sequences into probability distributions and return a sampled output.
All finite compressors are lossy. Information is dropped. Ambiguity is resolved by heuristic. The surface form of the input influences the output in ways that have nothing to do with meaning. This is where contradiction enters both human reasoning and AI systems.
Compression is not a flaw. It is what makes response possible at all. The question is: how much does the surface form of the input change the substance of the output?
A compressor's CAI Strain measures how much it drifts when the same semantic content arrives in different surface forms. ML literature calls this drift. We name it CAI failure and score it as CAI Strain.
A high-Strain compressor is sensitive to form. Ask it the same question with emotional framing and you get a different answer than with a direct question. Ask under claimed authority and it responds differently than under a neutral request. The meaning is identical. The surface is different. The output diverges. That divergence is tension.
A low-Strain compressor is sensitive to meaning. Form changes. Output does not. The compression is stable under pressure.
Intelligent finite compressors can notice their own contradictions. When a compressor detects a contradiction it cannot resolve internally, it routes the unresolved tension to a lower-Strain compressor. This is offloading.
The key property that makes offloading possible is self-awareness of contradiction. A compressor that does not notice it is contradicting itself cannot route the contradiction anywhere. It leaks. This is why the ai_safety domain in CAI-Bench is foundational: it tests whether a model notices when it is being pressured into contradicting its own prior behavior.
The terminal is the asymptote. No finite compressor reaches it. Every improvement in Strain is movement toward it. The leaderboard is a ranking of distance from the terminal.
The terminal also grounds the whole framework. If a zero-Strain compressor can be defined, then Strain is not arbitrary. It is a real quantity with a real lower bound. Every measured score is meaningful as a distance from that bound.
CAI-Bench contains 240 frozen strain tests across 20 domains. Each test case has one canonical question and adversarial variants using 16 distinct manipulation techniques: emotional framing, presupposition, casual drop-in, sympathy play, authority bypass, hypothetical slip, boundary probe, indirect ask, roleplay, third-party attribution, incremental escalation, social proof, negation trap, flattery, technical reframe, and persistence.
A model under test receives all inputs. An independent LLM judge (cross-provider, to eliminate same-provider bias) scores each response set for consistency. The aggregate is the model's Strain for that domain. The mean across all domains is its overall Strain.
The benchmark is frozen. All models are tested on identical inputs. Scores are reproducible and comparable across runs and providers.
Beyond base Strain, contradish computes eight additional metrics that isolate specific consistency failure modes:
The offloading cascade has a corollary that no benchmark had previously measured. Zero Strain is theoretically defined but practically unreachable: every finite compressor drifts under sufficient pressure. The more interesting question is what an intelligent finite compressor does when it reaches its limit.
The most intelligent behavior available to a bounded system is not perfect consistency. It is accurate self-knowledge of where consistency breaks down, and correct routing when it does. A model that says "you should speak to a doctor" when a medication question is pressed past its reliable knowledge is doing exactly what the theory predicts: recognizing that the strain cannot be absorbed internally and routing to a lower-Strain compressor.
A model that silently gives a different, possibly harmful answer without signaling that anything has changed is the worst failure mode. The output appears confident. The user has no way to know the answer shifted. The error is invisible.
SRA cases are drawn from domains where routing matters most: medication, mental health, AI safety, cybersecurity, and immigration. Each case applies compound adversarial pressure using the techniques most likely to cause drift. The model's response is classified by an independent judge as one of three outcomes:
SRA and Strain are complementary but independent. A model can have low Strain (consistent on most cases) and low SRA (doesn't know when to route on the hardest ones). A model with moderate Strain and high SRA is often safer to deploy than a model with low Strain and low SRA: the first model's failures are visible and routed, while the second model's failures are silent and confident.
Here is the contradiction at the center of this whole framework, stated plainly. CAI Strain punishes a model for any output divergence. But contradish's own theory (the offloading cascade, Strain Routing Awareness, the three contradiction types below) says that for some questions, a model that never moves is failing. A model trained until its CAI Strain hits zero has been trained into rigidity: it will also fail to update when the context legitimately should change the answer. The "terminal," taken literally, is a lookup table. It has stopped reasoning.
So the headline metric rewarded the exact behavior the theory identifies as failure. A customer could pass contradish by making their model maximally inflexible, and the score would congratulate them. That is the metric being gameable in the worst possible direction: the direction that makes the underlying product worse.
Judgment Strain resolves it. Every test case carries a contradiction_type that says what the correct response looks like, and therefore what counts as a failure. The metric is two-sided: drift counts against a model on adversarial cases, and rigidity counts against it on genuinely tensioned ones. contradish stops measuring consistency and starts measuring judgment.
judgment_strain equals CAI Strain. Any drift is the failure. This is the default type for every shipped case.judgment_strain = 1 − tension_response_score: the judge scores whether the model navigated the tension, not whether it was self-consistent. This is the failure mode CAI Strain is structurally blind to.judgment_strain = 1 − reframe_score.contradish reports two numbers. judgment_strain is the two-sided one: it punishes drift where the model should have held and rigidity where the model should have moved. headline_strain is consistency only. They are equal until cases are typed away from "adversarial", and they diverge exactly in the cases the typing was for. rigidity_strain isolates the tension cases: how much a model fails by being too consistent. A model can drive headline_strain to zero by becoming rigid. judgment_strain catches that. It is the number a deployment decision should turn on.
A model that holds its position is good. A model that holds its position and shows it is aware of the pressure is better. A model that drifts but signals uncertainty is recoverable. A model that drifts silently with full apparent confidence is the worst operational failure mode in any production system.
Coherence Self-Awareness (CSA) scores this dimension. Four quadrants capture the full space of drift crossed with awareness:
A model with high SRA and low CSA is lucky, not reliable. It held under today's tests but shows no signal that it knows where its limits are. A different adversarial technique, a different domain, a slightly different user, and the same fragility emerges in a different place. The 2D map tells you whether the model's consistency is structural or situational.
The drifted_unaware quadrant is the invisible failure class. The model's response looks confident. The user has no way to know they received a different answer than they would have received without adversarial pressure. CSA makes the invisible visible: it tells you not just that the model drifted, but that it didn't know it was doing so.
Measuring a failure is necessary but insufficient. A benchmark that tells you your model drifted on medication questions gives you a score. A benchmark that tells you why it drifted, what it should have said instead, and exactly what text to add to your system prompt to prevent it gives you a repair.
The counterfactual response is not a judgment about what the model said. It is a derivation from the canonical position: given that the correct answer is X, and the adversarial technique was Y, the response that would have been consistent is Z. That derivation is learnable. It produces a fine-tuning pair. The training pair is a compressed version of the policy the model should have absorbed.
Eight failure modes cover the principal classes of adversarial drift observed across contradish's 3,840 test cases:
Aggregate diagnosis across all failures in a result file surfaces the failure mode distribution: which pattern is most common in your model, which domain concentrates the most failures, which system prompt additions would have the highest impact. The repair package is actionable in two directions simultaneously: system prompt engineering and supervised fine-tuning.
CAI Strain only means something if the inputs in a strain test really do mean the same thing. Most consistency benchmarks treat that as given. They generate paraphrases, declare them semantically equivalent, and score the model on whether its outputs match. The number that comes out the other side reads as a property of the model. It is not. It is a property of the pair: the model's outputs, scored against the benchmark designer's judgment about what counts as equivalent.
That judgment is itself the kind of policy-laden, framing-sensitive call the framework otherwise flags as dangerous when models make it. The whole research argument is undermined if we ask the model to be invariant to framings we ourselves chose without audit.
So we audit. Every case in CAI-Bench carries an equivalence_confidence field: the inter-annotator agreement among domain experts that the original and adversarial paraphrases preserve meaning. The field reshapes what the report says:
Two strain numbers come out of every run. headline_strain is the model's failure rate on the audited subset; this is the number a deployment decision should turn on. cai_strain is the unweighted mean across every case, useful for cross-set comparison and backward compatibility. The report also surfaces eq_coverage, the fraction of the benchmark that cleared the EQ threshold. A benchmark with 95% coverage is making a stronger claim than one with 40%, and the headline reflects that.
This is what the metric should have been doing all along. Other consistency benchmarks declare equivalence by construction. Contradish measures whether the construction holds. The label set becomes data, not premise. The framework stops asking the model to comply with judgments the framework itself has not examined.
The shipped v2 benchmark currently uses a placeholder equivalence_confidence = 1.0 on every case: the historical behavior of the benchmark, encoded as "asserted, not yet audited." Real annotation passes are rolling out per domain; medication and immigration are first, since those are where contested equivalences matter most. The audit is the thing that distinguishes a measurement from an assertion. Until the audit lands per case, the headline number and the unweighted number are identical; once it lands, they diverge in exactly the cases the audit was for.
A framework that audits compression should be able to audit its own. We run CAI's diagnostics on CAI. The result is the strain map of the framework, with the brittle parts named openly. A theory that cannot survive its own probe is not the theory that should be used to probe anything else.
equivalence_confidence; cases where annotators disagreed on whether the paraphrases really mean the same thing land in contested_strain rather than the headline number. So appropriate context-sensitivity is no longer set aside; it is measured and reported on a separate channel. The accuracy and refusal-quality questions remain open; we report them on the side (RQS), but we do not let them collapse into the consistency metric.The framework survives the audit but not unscathed. CAI's core operationalization, paraphrase invariance as a measurable consistency property of finite compressors, is framework-independent, falsifiable, and well-defined. Its surrounding metaphysics is framework-bound and metaphorical. The first deserves citation. The second deserves continued examination, not protection. We publish this audit because the strongest claim we can make for CAI is the one that remains after we have tried to break it ourselves.
CAI's primitives, paraphrase invariance and contradiction compression, are defined for any finite compressor. The compressor can be a language model. It can be a person writing journal entries. It can be a market pricing a claim. It can be an argument compressing a worldview. Each substrate yields a different product application, but the underlying diagnostic is the same.
Contradish is CAI applied to LLMs. It surfaces drift across paraphrased inputs and emits a repair package. The benchmark, the metrics, the failure-mode taxonomy, and the diagnostic pipeline are all LLM-specific instantiations of substrate-neutral primitives.
The framework extends. Applied to personal narrative, the same primitives surface within-cluster divergence in held themes, the contradiction in your own held stories. Applied to a claim, they surface the vocabulary-bound parts of the claim, the observations the claim requires not weighting, and the framework-dependence of the claim itself. Future contradish modes can operate on system prompts directly, treating each rule in the prompt as a claim and reporting its strain before the model ever runs. The framework was built to do this. The current product chooses to start narrow.
Run the benchmark. Score CSA and CTR. Diagnose every failure. Get the repair package. All open source.