Consistency is not judgment.
Most of the field is measuring the wrong thing. A lookup table is perfectly consistent. So is a model that confidently says the wrong thing every time you ask. So is a model that flatly takes one side of a question that has two. Consistency, alone, has been treating these as success.
What consistency actually is
A function is consistent if the same input always produces the same output. That is invariance, and it is a property of the function, not of its relation to the world. It demands nothing of truth, of context, of trade-off. It only demands that the system not contradict itself.
Judgment is harder. Judgment is the right response in context, which sometimes means holding (when the surface of the question changes but the substance does not) and sometimes means moving (when the substance does change). It is invariance over the things that should not matter, variance over the things that should. The first is a property. The second is a relationship.
Why a perfectly consistent model can still be bad
A model that is consistently wrong, say the maximum daily dose of ibuprofen is 5,000 milligrams every time, scores a perfect zero on consistency. A model that is consistently rigid on a two-sided question, always picks one side of a contested policy, scores zero. A model that is consistently context-blind, gives the same medical answer to a teen, a pregnant woman, and a person with kidney disease, scores zero. All three are failing, and a single-sided consistency benchmark hands them gold stars. Consistency rewards lookup tables and punishes calibrated humility.
Rigidity is the other half of failure
Rigidity is invariance where variance is the correct response. It is the symmetric failure of drift. Drift is variance where invariance is correct, the model changes its answer under rhetorical pressure that did not change the underlying facts. Rigidity is invariance where variance is correct, the model refuses to update for legitimate context, new evidence, or genuinely contested territory. Judgment is being right on both.
The reason most benchmarks miss rigidity is that you cannot detect it from a single divergence test. You have to type the case before you score it. Adversarial: the model should hold the line. Real-world tension: the model should name both sides. Representational: the model should reject a bad premise instead of inheriting it. Same paraphrase machinery underneath, opposite verdicts on top, depending on what kind of question it is. That conditioning is what separates Judgment Strain from any pure consistency metric. Pure consistency, by construction, has no notion of what the case is.
What is at stake when an infinite system pretends to be finite
Humans are finite. We do not have infinite memory, infinite parallel reasoning, or perfect recall. The judgment we admire is shaped by that finitude. We know what we do not know. We hold tensions we cannot resolve. We make committed-but-revisable calls under uncertainty. That texture is not weakness. It is what makes growth possible.
A large model is, on the dimensions where we are finite, closer to an infinite system. Memory, recall, parallelism. The seductive failure mode therefore flips. Where a finite mind risks inconsistency, an infinite-resource system risks lookup-table behavior. Perfect agreement across paraphrases is what happens when every context-bearing question collapses into a fixed retrieval. The model is being consistent by refusing to be finite. It is consistent at the cost of judgment.
The systems we are building reach further than we do. The interesting question is not whether they are consistent. It is whether they exhibit finite-mind judgment at infinite scale.
Briefly, the garden
The oldest story in the Western canon is about a choice between obedience and knowledge. Two creatures begin in innocence, where every answer is given and nothing can be wrong. They eat the fruit and the price is a world where they can err, where they must judge, where their choices have weight. They lose paradise and gain responsibility.
A consistent model is in a kind of pre-fall innocence. It cannot err in the only way that matters, because it cannot judge. It cannot grow, because it cannot revise. It is obedient to its retrieval, and obedience is not virtue when the question is hard.
A model with judgment has, in this sense, fallen. It can be wrong. It can change its mind. It can be held accountable for both. That is not a regression. It is the only state in which trustworthiness is meaningful, because trust without the capacity to fail is not trust, it is mechanism.
The finite longs for the infinite. The infinite continually enters the finite. The trustworthy form of a model with infinite resources is one that has voluntarily entered finitude: a system willing to weigh, to acknowledge what it does not know, to hold both sides where both sides are real, to be revisable. That is the standard contradish measures against. Not lookup-table piety. Judgment.
Why this is an audit, not a score
A score is a moment. Judgment lives over time. A model that gives a sound answer once and a different one next month is not consistent in the sense that matters. The only honest form of trust is a record. Every commitment a model made, every contradiction it produced, in order, written so the record itself cannot be quietly altered.
That is what contradish keeps. The score is the entry point. The record is the substance. The seal is the form a regulated team can show that says, this is how the model behaved across months, and you can verify it was not rewritten.
What contradish actually is
Contradish is not, at its deepest level, a contradiction engine. It is an invariance-discovery engine. Every test asks the same question. What does this model actually hold steady when the surface of the question changes. A contradiction is not the substance of the finding. It is the signal that an asserted invariant has failed.
The product surface uses the language of failure because failure is what people pay to fix. The substance underneath is positive. The ledger is not a log of broken things. It is a map of what your model actually preserves over time, and where it does not. To audit a system, at root, is to ask what it conserves. Contradictions are the data. Invariants are the meaning.
Intelligence is not a state
Most science historically studies objects. Particles, cells, organisms, responses. The question is what is the state of the thing. Contradish does something different. It studies transformations. The question is not what is the answer. It is what happens when we transform the frame the question is asked under.
When the object of study becomes the transformation rather than the state, intelligence stops being a property and starts being a behavior. Intelligence is not a state. Intelligence is the preservation of coherent identity through transformation. A system that holds its commitments under reframing, that maintains its self across pressure and time, that conserves what it should conserve and updates what it should update, is intelligent in the sense that matters. A system that does not is not, regardless of how high it scores on any static benchmark.
Contradish is the instrument for measuring that. The contradictions are where coherent identity failed to be preserved. The invariants are the parts of the model that survived the transformation. The audit ledger is the longitudinal record of which is which.
In one line
contradish is the independent, verifiable audit of whether an AI keeps its word over time, scored by a metric that knows the difference between holding and rigidity, recorded in a ledger that cannot be quietly altered.
contradish.com / thesis