Test AI models against the prompts you actually use, see which one is fastest, cheapest, and most accurate — with a comparison dashboard you can share with your team.
Free & open source · Runs locally or in CI · No vendor lock-in
No data scientists required. If you can write a prompt, you can run an eval.
Write test cases using the actual prompts you use in production. Add validation rules — required terms, minimum length, structured item counts.
EvalPulse sends your prompts to each model, runs multiple passes to check consistency, and grades every output on quality, accuracy, and format — automatically.
Open the dashboard to see a ranked leaderboard, dimension-by-dimension breakdowns, reliability scores, and side-by-side comparisons across evaluation runs.
Answer the question: which model should I actually be paying for?
Your evaluation suites use the exact prompts from your product. A compliance doc generator is evaluated on compliance prompts, not SAT questions.
Every output is graded by two independent AI models, not just one. Their scores are averaged so no single provider's bias skews the results.
Each output is scored on completeness, accuracy, format, relevance, and clarity — with weights you control. A safety document and a chatbot need different standards.
EvalPulse watches for new models that fit your budget and flags them for evaluation. When a cheaper or faster option appears, you'll know.
Click through an interactive dashboard with example results across 6 models and two evaluation suites. No install required.
Requires Python 3.10+ and an OpenRouter API key.
Full documentation and configuration reference on GitHub →