Open source · MIT license

Stop guessing which
LLM works best for your task.

Test AI models against the prompts you actually use, see which one is fastest, cheapest, and most accurate — with a comparison dashboard you can share with your team.

See Live Demo View on GitHub

Free & open source · Runs locally or in CI · No vendor lock-in

How it works

Three steps from prompt to answer

No data scientists required. If you can write a prompt, you can run an eval.

Define your test cases

Write test cases using the actual prompts you use in production. Add validation rules — required terms, minimum length, structured item counts.

Run the evaluation

EvalPulse sends your prompts to each model, runs multiple passes to check consistency, and grades every output on quality, accuracy, and format — automatically.

Compare the results

Open the dashboard to see a ranked leaderboard, dimension-by-dimension breakdowns, reliability scores, and side-by-side comparisons across evaluation runs.

Features

Built for real decisions, not toy benchmarks

Answer the question: which model should I actually be paying for?

🎯

Real prompts, not academic benchmarks

Your evaluation suites use the exact prompts from your product. A compliance doc generator is evaluated on compliance prompts, not SAT questions.

⚖️

Scores you can actually trust

Every output is graded by two independent AI models, not just one. Their scores are averaged so no single provider's bias skews the results.

📊

Graded on what matters to you

Each output is scored on completeness, accuracy, format, relevance, and clarity — with weights you control. A safety document and a chatbot need different standards.

🔍

New models, automatically tested

EvalPulse watches for new models that fit your budget and flags them for evaluation. When a cheaper or faster option appears, you'll know.

See a real evaluation result

Click through an interactive dashboard with example results across 6 models and two evaluation suites. No install required.

Regulatory Compliance Suite regulatory_compliance 6 models compared

GPT-5.4 Mini

7.8/10

GPT-5.4 Nano

7.3/10

Qwen 3.5 Plus

7.0/10

Gemini 3.1 Flash Lite

6.5/10

Open Live Dashboard →

Get started

Up and running in 60 seconds

Requires Python 3.10+ and an OpenRouter API key.

# 1. Clone and install
git clone https://github.com/aristidesnakos/model-evals-framework && cd model-evals-framework
pip install -r requirements.txt
 
# 2. Add your API key
cp .env.example .env  # then add OPENROUTER_API_KEY
 
# 3. Create your first evaluation suite
python evalpulse.py init
 
# 4. Run a dry-run to verify (no cost), then evaluate
python evalpulse.py --dry-run --suite getting_started
python evalpulse.py --run-eval --suite getting_started --dashboard

Full documentation and configuration reference on GitHub →

Stop guessing whichLLM works best for your task.