EvalPulse LLM evaluation

Stop guessing which LLM works best for your task.

Test language models against the prompts you actually run in production, and see which one is fastest, cheapest, and most accurate — on one comparison dashboard you can hand to your team.

Free & open source · No vendor lock-in · Bring your own keys

Regulatory Compliance Suite
01 GPT-5.4 Miniproduction viable 7.8 / 10
02 GPT-5.4 Nanoproduction viable 7.3 / 10
03 Qwen 3.5 Plusproduction viable 7.0 / 10
04 Gemini 3.1 Flash Litebatch only 6.5 / 10
Fig. 1 — A real suite, scored by two independent judges Averaged across multiple passes

Three steps, no data scientists required.

If you can write a prompt, you can run an evaluation.

1

Define your test cases

Write test cases from the actual prompts you use in production. Attach validation rules — required terms, minimum length, structured item counts — so grading reflects what the output is for.

2

Run the evaluation

EvalPulse sends each prompt to every model, runs multiple passes to measure consistency, and grades every output on quality, accuracy, and format — automatically.

3

Compare the results

Open the dashboard for a ranked leaderboard, dimension-by-dimension breakdowns, reliability scores, and side-by-side comparisons across runs over time.

Built for real decisions, not toy benchmarks.

It answers one question: which model should I actually be paying for?

Your product, not the SAT

Evaluation suites run on the exact prompts from your application. A compliance-doc generator is judged on compliance prompts, not academic trivia.

Scores you can trust

Every output is graded by two independent models, never one. Their scores are averaged so no single provider's bias can skew the leaderboard.

Graded on what matters

Each output is scored on completeness, accuracy, format, relevance, and clarity — with weights you set. A safety document and a chatbot are held to different standards.

New models, flagged automatically

EvalPulse watches for new models that fit your budget and flags them for evaluation. When a cheaper or faster option appears, you will know.

One output. Two judges. Five dimensions.

Every response is graded across five dimensions by two independent models. The scores are averaged, the spread is recorded, and a run only counts as reliable when the judges agree. No single number, no single opinion.

See it on a live run
Scorecard · one test case judges agree
Completeness8.2
Accuracy7.9
Format9.0
Relevance7.5
Clarity8.4

judge_a 8.1  |  judge_b 8.0  |  Δ 0.10
weighted average — 8.05 / 10 over 3 passes

Up and running before the kettle boils.

Requires Python 3.10+ and an OpenRouter API key.

# 1. Clone and install
git clone https://github.com/aristidesnakos/model-evals-framework && cd model-evals-framework
pip install -r requirements.txt
 
# 2. Add your API key
cp .env.example .env  # then add OPENROUTER_API_KEY
 
# 3. Create your first evaluation suite
python evalpulse.py init
 
# 4. Dry-run to verify (no cost), then evaluate
python evalpulse.py --dry-run --suite getting_started
python evalpulse.py --run-eval --suite getting_started --dashboard

See a real evaluation result.

Click through an interactive dashboard with example results across text generation, text classification, and a vision safety gate. No install required.