Open source · MIT license · Runs locally or in CI

Stop guessing which LLM works best for your task.

Test language models against the prompts you actually run in production, and see which one is fastest, cheapest, and most accurate — on one comparison dashboard you can hand to your team.

See the live dashboard View on GitHub

Free & open source · No vendor lock-in · Bring your own keys

Regulatory Compliance Suite regulatory_compliance · 6 models compared

01 GPT-5.4 Miniproduction viable 7.8 / 10

02 GPT-5.4 Nanoproduction viable 7.3 / 10

03 Qwen 3.5 Plusproduction viable 7.0 / 10

04 Gemini 3.1 Flash Litebatch only 6.5 / 10

Fig. 1 — A real suite, scored by two independent judges Averaged across multiple passes

How it works · prompt to decision

Three steps, no data scientists required.

If you can write a prompt, you can run an evaluation.

Define your test cases

Write test cases from the actual prompts you use in production. Attach validation rules — required terms, minimum length, structured item counts — so grading reflects what the output is for.

Run the evaluation

EvalPulse sends each prompt to every model, runs multiple passes to measure consistency, and grades every output on quality, accuracy, and format — automatically.

Compare the results

Open the dashboard for a ranked leaderboard, dimension-by-dimension breakdowns, reliability scores, and side-by-side comparisons across runs over time.

What sets it apart

Built for real decisions, not toy benchmarks.

It answers one question: which model should I actually be paying for?

Real prompts

Your product, not the SAT

Evaluation suites run on the exact prompts from your application. A compliance-doc generator is judged on compliance prompts, not academic trivia.

Two judges

Scores you can trust

Every output is graded by two independent models, never one. Their scores are averaged so no single provider's bias can skew the leaderboard.

Weighted dimensions

Graded on what matters

Each output is scored on completeness, accuracy, format, relevance, and clarity — with weights you set. A safety document and a chatbot are held to different standards.

Always current

New models, flagged automatically

EvalPulse watches for new models that fit your budget and flags them for evaluation. When a cheaper or faster option appears, you will know.

The method · how a score is made

One output. Two judges. Five dimensions.

Every response is graded across five dimensions by two independent models. The scores are averaged, the spread is recorded, and a run only counts as reliable when the judges agree. No single number, no single opinion.

See it on a live run

Scorecard · one test case judges agree

Completeness8.2

Accuracy7.9

Format9.0

Relevance7.5

Clarity8.4

judge_a 8.1 | judge_b 8.0 | Δ 0.10
weighted average — 8.05 / 10 over 3 passes

Get started · about 60 seconds

Up and running before the kettle boils.

Requires Python 3.10+ and an OpenRouter API key.

# 1. Clone and install
git clone https://github.com/aristidesnakos/model-evals-framework && cd model-evals-framework
pip install -r requirements.txt
 
# 2. Add your API key
cp .env.example .env  # then add OPENROUTER_API_KEY
 
# 3. Create your first evaluation suite
python evalpulse.py init
 
# 4. Dry-run to verify (no cost), then evaluate
python evalpulse.py --dry-run --suite getting_started
python evalpulse.py --run-eval --suite getting_started --dashboard

Full documentation on GitHub

See a real evaluation result.

Click through an interactive dashboard with example results across text generation, text classification, and a vision safety gate. No install required.

Open the live dashboard View on GitHub