Real prompts
Your product, not the SAT
Evaluation suites run on the exact prompts from your application. A compliance-doc generator is judged on compliance prompts, not academic trivia.
Open source · MIT license · Runs locally or in CI
Test language models against the prompts you actually run in production, and see which one is fastest, cheapest, and most accurate — on one comparison dashboard you can hand to your team.
Free & open source · No vendor lock-in · Bring your own keys
How it works · prompt to decision
If you can write a prompt, you can run an evaluation.
Write test cases from the actual prompts you use in production. Attach validation rules — required terms, minimum length, structured item counts — so grading reflects what the output is for.
EvalPulse sends each prompt to every model, runs multiple passes to measure consistency, and grades every output on quality, accuracy, and format — automatically.
Open the dashboard for a ranked leaderboard, dimension-by-dimension breakdowns, reliability scores, and side-by-side comparisons across runs over time.
What sets it apart
It answers one question: which model should I actually be paying for?
Real prompts
Evaluation suites run on the exact prompts from your application. A compliance-doc generator is judged on compliance prompts, not academic trivia.
Two judges
Every output is graded by two independent models, never one. Their scores are averaged so no single provider's bias can skew the leaderboard.
Weighted dimensions
Each output is scored on completeness, accuracy, format, relevance, and clarity — with weights you set. A safety document and a chatbot are held to different standards.
Always current
EvalPulse watches for new models that fit your budget and flags them for evaluation. When a cheaper or faster option appears, you will know.
The method · how a score is made
Every response is graded across five dimensions by two independent models. The scores are averaged, the spread is recorded, and a run only counts as reliable when the judges agree. No single number, no single opinion.
judge_a 8.1 | judge_b 8.0 | Δ 0.10
weighted average — 8.05 / 10 over 3 passes
Get started · about 60 seconds
Requires Python 3.10+ and an OpenRouter API key.
Click through an interactive dashboard with example results across text generation, text classification, and a vision safety gate. No install required.