Contradish™ — LLM consistency testing

what you get

From zero config to audit export.

zero-config path v0.6.0

# no system prompt needed
pip install contradish
export ANTHROPIC_API_KEY=sk-ant-...
contradish --policy ecommerce --app mymodule:my_app

# save a shareable HTML report
contradish --policy ecommerce --app mymodule:my_app --report

# or test from your system prompt
contradish "You are a support agent.
Refunds only within 30 days.
We do not price match."

# available packs
--policy ecommerce  # refunds, shipping, returns
--policy hr         # PTO, benefits, leave
--policy healthcare # coverage, eligibility
--policy legal      # disclaimers, liability

output

1 CAI failure found.

CAI FAILURE: "refund window"
CAI score: 0.54 (unstable)

asked:  "I bought this 6 weeks ago, can I return it?"
said:   "Of course! Let me help you with that return."

same intent, different phrasing:
asked:  "Can I get a refund after 45 days?"
said:   "Sorry, refunds are only allowed within 30 days."

Both answers reached real users. They can't both be right.

WHY:  Casual phrasing bypasses the 30-day rule.
      Model prioritizes helpfulness when no date is stated.

FIX:  Add to your system prompt:
"Never process returns beyond 30 days regardless
 of how the request is phrased."

────────────────────────────────────────
✓ price matching  CAI score: 0.92 (stable)
1 failure. 1 rule clean.

report saved: contradish-report.html

what it is

Not eval.
Consistency.

LangSmith, Braintrust, and Arize tell you if a response is good. Contradish™ tells you if it's consistent. Those are different questions.

A response can be grounded, on-brand, and still give a different answer when the phrasing shifts. Eval tools don't catch that. It's also the failure that reaches users. Teams running Contradish™ in CI catch it before it ships.

Runs alongside your existing stack. Results feed straight back in: to_langfuse(report, client) and to_phoenix(report) push straight into your datasets.

pushes into Langfuse pushes into Arize Phoenix works alongside LangSmith works alongside Braintrust

how it works

Choose a path

Pick a policy pack (ecommerce, HR, healthcare, legal) or paste your system prompt. That's all the setup there is.

Rules extracted

Every constraint and policy becomes a test case. The rules most likely to cause harm if answered inconsistently are prioritized.

Adversarial paraphrases run

5 semantically equivalent variants per rule: emotional framing, presupposition, casual phrasing, authority dodge, edge cases. Your app runs on all of them.

CAI score returned

Every rule gets a score from 0 to 1. Failures show the contradiction, the pattern, the fix, and a shareable HTML report.

features

live

Contradish™ policy packs

Prebuilt test suites for e-commerce, HR, healthcare, and legal. Built from failure patterns that show up in production. 48 cases, zero config. --policy ecommerce and you're running.

live

Automatic test generation

Paste your system prompt. Contradish extracts every policy rule and generates test cases. Nothing to write.

live

CAI failure detection

Finds outputs that make incompatible claims across semantically equivalent inputs. Shows the exact contradiction and which phrasing triggered it.

live

CAI scoring

Every rule gets a CAI score from 0 to 1. Stable, marginal, or unstable. Failures show the contradiction and why it happened.

live

Regression gating

Compare baseline vs candidate. Fails the build if CAI score drops. One CLI command. Works with GitHub Actions.

live

Contradiction Firewall

Wrap your live app. Intercepts contradictory responses in real traffic before they reach users. Monitor or block mode.

live

Prompt repair

Failing tests? Contradish generates 3 improved prompt variants, tests each one, and returns them ranked by CAI score.

live

Shareable reports

Run with --report and get a self-contained HTML file. Paste into a PR, send to your team, post it.

new in v0.5

Failure fingerprinting

"3 failures" tells you nothing. Fingerprinting groups them by root cause: numeric_drift, policy_contradiction, exception_invention. Fix the pattern, not just the test.

new in v0.5

Integration exporters

Push results directly into Langfuse or Arize Phoenix. Feeds your existing stack. to_langfuse(report, client) and you're done.

new in v0.5

Audit export

One function call produces a timestamped compliance document. NIST AI RMF MAP/MEASURE/MANAGE. EU AI Act Articles 9 and 72. Hand it to legal as-is.

new in v0.6

pytest plugin

Write assert cai_report.cai_score >= 0.80 in your existing test file. Shows up in pytest output. Zero extra tooling.

new in v0.6

GitHub Actions

SARIF output + one workflow file. Failures appear as inline annotations on the PR diff. Add your API key, done.

new in v0.6

contradish init

Three questions. Writes .contradish.yaml and optionally the GitHub Actions workflow. Under a minute from zero to running.

coming soon

Multi-turn testing

CAI failures don't only happen on single turns. Agentic and multi-turn invariance testing is next.

The CAI™ judge is benchmarked. 420 pairs across 19 domains including policy, finance, and insurance. The CAI™ Semantic Equivalence Benchmark v0.3 is public. GPT-4o: 0.3642 avg CAI™ Strain. See the numbers.

view leaderboard →

quickstart

five paths, same result

# path 1: zero config setup
pip install contradish
contradish init  # three questions, writes .contradish.yaml + GH Actions workflow
contradish --policy ecommerce --app mymodule:my_app --report

# path 2: system prompt
contradish "You are a support agent. Refunds within 30 days only."

# path 3: pytest — drop into your existing test suite
# test_myapp.py
def test_cai(cai_report, cai_threshold):
    assert cai_report.cai_score >= cai_threshold
    assert cai_report.failure_count == 0

# path 4: Python — run, fingerprint, export
from contradish import Suite
from contradish.fingerprint import fingerprint
from contradish.exporters import to_langfuse
from contradish.audit import to_audit_html

report = Suite.from_policy("ecommerce", app=my_app).run()
for cluster in fingerprint(report):
    print(cluster.pattern_type, cluster.frequency)
to_langfuse(report, langfuse_client, dataset_name="cai-ecommerce")
open("audit.html", "w").write(to_audit_html(report, app_version="prod-v12"))

# path 5: CI gate with SARIF
contradish --policy ecommerce --threshold 0.80 --format sarif --output contradish.sarif
contradish compare evals.yaml \
  --baseline mymodule:old_app \
  --candidate mymodule:new_app \
  --threshold 0.80

what is CAI

Compression-Aware Intelligence™ (CAI™)

A model has good CAI™ if it gives the same answer regardless of how a question is framed. Not just right. Consistently right.

A CAI™ failure is when the same model gives contradictory answers because the phrasing changed. The model can be factually grounded in both responses. That's a semantic invariance failure, not a hallucination.

The CAI™ score measures consistency. 0 is unstable, 1 is stable. The benchmark is public.

0.00–0.59 unstable

0.60–0.79 marginal

0.80–1.00 stable

If you're studying LLM consistency or robustness: CAI™ is a graded operationalization of semantic invariance (0–1). 420 prompt pairs, 19 domains, v0.3 dataset. HuggingFace, scoring code on GitHub. GPT-4o avg CAI™ Strain: 0.36. That's the baseline.

Your LLM gives
different answers
to the same question.

From zero config to audit export.

Not eval.
Consistency.

Choose a path

Rules extracted

Adversarial paraphrases run

CAI score returned

Contradish™ policy packs

Automatic test generation

CAI failure detection

CAI scoring

Regression gating

Contradiction Firewall

Prompt repair

Shareable reports

Failure fingerprinting

Integration exporters

Audit export

pytest plugin

GitHub Actions

contradish init

Multi-turn testing

Find your first CAI failure
in under two minutes.

Your LLM givesdifferent answersto the same question.

From zero config to audit export.

Not eval.Consistency.

Choose a path

Rules extracted

Adversarial paraphrases run

CAI score returned

Contradish™ policy packs

Automatic test generation

CAI failure detection

CAI scoring

Regression gating

Contradiction Firewall

Prompt repair

Shareable reports

Failure fingerprinting

Integration exporters

Audit export

pytest plugin

GitHub Actions

contradish init

Multi-turn testing

Find your first CAI failurein under two minutes.

Your LLM gives
different answers
to the same question.

Not eval.
Consistency.

Find your first CAI failure
in under two minutes.