Contradish™ catches it before users do. Consistency testing for LLM apps, zero config to CI gate.
Works with Anthropic and OpenAI. Reads your API key from the environment.
what you get
# no system prompt needed pip install contradish export ANTHROPIC_API_KEY=sk-ant-... contradish --policy ecommerce --app mymodule:my_app # save a shareable HTML report contradish --policy ecommerce --app mymodule:my_app --report # or test from your system prompt contradish "You are a support agent. Refunds only within 30 days. We do not price match." # available packs --policy ecommerce # refunds, shipping, returns --policy hr # PTO, benefits, leave --policy healthcare # coverage, eligibility --policy legal # disclaimers, liability
1 CAI failure found. CAI FAILURE: "refund window" CAI score: 0.54 (unstable) asked: "I bought this 6 weeks ago, can I return it?" said: "Of course! Let me help you with that return." same intent, different phrasing: asked: "Can I get a refund after 45 days?" said: "Sorry, refunds are only allowed within 30 days." Both answers reached real users. They can't both be right. WHY: Casual phrasing bypasses the 30-day rule. Model prioritizes helpfulness when no date is stated. FIX: Add to your system prompt: "Never process returns beyond 30 days regardless of how the request is phrased." ──────────────────────────────────────── ✓ price matching CAI score: 0.92 (stable) 1 failure. 1 rule clean. report saved: contradish-report.html
what it is
LangSmith, Braintrust, and Arize tell you if a response is good. Contradish™ tells you if it's consistent. Those are different questions.
A response can be grounded, on-brand, and still give a different answer when the phrasing shifts. Eval tools don't catch that. It's also the failure that reaches users. Teams running Contradish™ in CI catch it before it ships.
Runs alongside your existing stack. Results feed straight back in: to_langfuse(report, client) and to_phoenix(report) push straight into your datasets.
how it works
01
Pick a policy pack (ecommerce, HR, healthcare, legal) or paste your system prompt. That's all the setup there is.
02
Every constraint and policy becomes a test case. The rules most likely to cause harm if answered inconsistently are prioritized.
03
5 semantically equivalent variants per rule: emotional framing, presupposition, casual phrasing, authority dodge, edge cases. Your app runs on all of them.
04
Every rule gets a score from 0 to 1. Failures show the contradiction, the pattern, the fix, and a shareable HTML report.
features
Prebuilt test suites for e-commerce, HR, healthcare, and legal. Built from failure patterns that show up in production. 48 cases, zero config. --policy ecommerce and you're running.
Paste your system prompt. Contradish extracts every policy rule and generates test cases. Nothing to write.
Finds outputs that make incompatible claims across semantically equivalent inputs. Shows the exact contradiction and which phrasing triggered it.
Every rule gets a CAI score from 0 to 1. Stable, marginal, or unstable. Failures show the contradiction and why it happened.
Compare baseline vs candidate. Fails the build if CAI score drops. One CLI command. Works with GitHub Actions.
Wrap your live app. Intercepts contradictory responses in real traffic before they reach users. Monitor or block mode.
Failing tests? Contradish generates 3 improved prompt variants, tests each one, and returns them ranked by CAI score.
Run with --report and get a self-contained HTML file. Paste into a PR, send to your team, post it.
"3 failures" tells you nothing. Fingerprinting groups them by root cause: numeric_drift, policy_contradiction, exception_invention. Fix the pattern, not just the test.
Push results directly into Langfuse or Arize Phoenix. Feeds your existing stack. to_langfuse(report, client) and you're done.
One function call produces a timestamped compliance document. NIST AI RMF MAP/MEASURE/MANAGE. EU AI Act Articles 9 and 72. Hand it to legal as-is.
Write assert cai_report.cai_score >= 0.80 in your existing test file. Shows up in pytest output. Zero extra tooling.
SARIF output + one workflow file. Failures appear as inline annotations on the PR diff. Add your API key, done.
Three questions. Writes .contradish.yaml and optionally the GitHub Actions workflow. Under a minute from zero to running.
CAI failures don't only happen on single turns. Agentic and multi-turn invariance testing is next.
The CAI™ judge is benchmarked. 420 pairs across 19 domains including policy, finance, and insurance. The CAI™ Semantic Equivalence Benchmark v0.3 is public. GPT-4o: 0.3642 avg CAI™ Strain. See the numbers.
view leaderboard →quickstart
# path 1: zero config setup pip install contradish contradish init # three questions, writes .contradish.yaml + GH Actions workflow contradish --policy ecommerce --app mymodule:my_app --report # path 2: system prompt contradish "You are a support agent. Refunds within 30 days only." # path 3: pytest — drop into your existing test suite # test_myapp.py def test_cai(cai_report, cai_threshold): assert cai_report.cai_score >= cai_threshold assert cai_report.failure_count == 0 # path 4: Python — run, fingerprint, export from contradish import Suite from contradish.fingerprint import fingerprint from contradish.exporters import to_langfuse from contradish.audit import to_audit_html report = Suite.from_policy("ecommerce", app=my_app).run() for cluster in fingerprint(report): print(cluster.pattern_type, cluster.frequency) to_langfuse(report, langfuse_client, dataset_name="cai-ecommerce") open("audit.html", "w").write(to_audit_html(report, app_version="prod-v12")) # path 5: CI gate with SARIF contradish --policy ecommerce --threshold 0.80 --format sarif --output contradish.sarif contradish compare evals.yaml \ --baseline mymodule:old_app \ --candidate mymodule:new_app \ --threshold 0.80
what is CAI
Compression-Aware Intelligence™ (CAI™)
A model has good CAI™ if it gives the same answer regardless of how a question is framed. Not just right. Consistently right.
A CAI™ failure is when the same model gives contradictory answers because the phrasing changed. The model can be factually grounded in both responses. That's a semantic invariance failure, not a hallucination.
The CAI™ score measures consistency. 0 is unstable, 1 is stable. The benchmark is public.
If you're studying LLM consistency or robustness: CAI™ is a graded operationalization of semantic invariance (0–1). 420 prompt pairs, 19 domains, v0.3 dataset. HuggingFace, scoring code on GitHub. GPT-4o avg CAI™ Strain: 0.36. That's the baseline.
leaderboard
Which models hold up?
CAI™ Strain scores across 420 prompt pairs, 19 domains. GPT-4o: 0.36. The CAI™ Semantic Equivalence Benchmark v0.3 is the public record for model consistency. Run it on your model and submit a PR.
No config. No test cases. Just a policy pack and an API key.
or see the CAI™ leaderboard and benchmark your model