gpt-4o-mini or claude-3-haiku) is “good enough” for a specific task, vibes aren’t enough. You need hard data.
ShadowTest allows you to define multiple models (tiers), run prompts against them in parallel, and score the candidate outputs against your current production output.
Quick Start
Scoring Strategies
ShadowTest supports four built-in scoring strategies to evaluate candidate outputs:1. exact-match
Strict string comparison against an expected output. Best for highly deterministic tasks like classification or rigid data extraction.
2. format-check
Validates that the output adheres to a specific format (currently supports json).
3. similarity
Uses vector embeddings to ensure candidate answers remain semantically close to the reference answer. Ideal for summarization or open-ended generation where exact wording doesn’t matter.
4. llm-judge
Uses an LLM to evaluate the candidate output against a custom rubric, using the production output as a reference.
Reading the Report
TheShadowTestReport gives you objective metrics to make a routing decision:
- Pass Rate: What percentage of candidate responses passed the scoring strategy?
- Avg Cost: The average cost per call for each tier.
- Avg Latency: The average latency per call for each tier.
report.to_json() or report.to_markdown() to save the results for your CI/CD pipelines or pull request comments.