Open source · MIT license

Stop guessing which
LLM works best for your task.

Test AI models against the prompts you actually use, see which one is fastest, cheapest, and most accurate — with a comparison dashboard you can share with your team.

See Live Demo View on GitHub

Free & open source  ·  Runs locally or in CI  ·  No vendor lock-in

How it works

Three steps from prompt to answer

No data scientists required. If you can write a prompt, you can run an eval.

1

Define your test cases

Write test cases using the actual prompts you use in production. Add validation rules — required terms, minimum length, structured item counts.

2

Run the evaluation

EvalPulse sends your prompts to each model, runs multiple passes to check consistency, and grades every output on quality, accuracy, and format — automatically.

3

Compare the results

Open the dashboard to see a ranked leaderboard, dimension-by-dimension breakdowns, reliability scores, and side-by-side comparisons across evaluation runs.

Features

Built for real decisions, not toy benchmarks

Answer the question: which model should I actually be paying for?

🎯

Real prompts, not academic benchmarks

Your evaluation suites use the exact prompts from your product. A compliance doc generator is evaluated on compliance prompts, not SAT questions.

⚖️

Scores you can actually trust

Every output is graded by two independent AI models, not just one. Their scores are averaged so no single provider's bias skews the results.

📊

Graded on what matters to you

Each output is scored on completeness, accuracy, format, relevance, and clarity — with weights you control. A safety document and a chatbot need different standards.

🔍

New models, automatically tested

EvalPulse watches for new models that fit your budget and flags them for evaluation. When a cheaper or faster option appears, you'll know.

See a real evaluation result

Click through an interactive dashboard with example results across 6 models and two evaluation suites. No install required.

Regulatory Compliance Suite regulatory_compliance 6 models compared
GPT-5.4 Mini
7.8/10
GPT-5.4 Nano
7.3/10
Qwen 3.5 Plus
7.0/10
Gemini 3.1 Flash Lite
6.5/10
Open Live Dashboard →
Get started

Up and running in 60 seconds

Requires Python 3.10+ and an OpenRouter API key.

# 1. Clone and install
git clone https://github.com/aristidesnakos/model-evals-framework && cd model-evals-framework
pip install -r requirements.txt
 
# 2. Add your API key
cp .env.example .env  # then add OPENROUTER_API_KEY
 
# 3. Create your first evaluation suite
python evalpulse.py init
 
# 4. Run a dry-run to verify (no cost), then evaluate
python evalpulse.py --dry-run --suite getting_started
python evalpulse.py --run-eval --suite getting_started --dashboard