The LLM Benchmark Repository
One-stop destination for raw LLM benchmark data. Compare 20+ industry benchmarks — MMLU-Pro, GPQA, SWE-bench, Aider Polyglot, BFCL, Chatbot Arena ELO, and more — across every major frontier and open-weights model. Updated daily, with full source attribution on every score. No opaque composites, no vanity rankings — just the numbers.
| Provider | Model | Context | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Loading benchmark snapshot... | |||||||||
Why raw benchmarks beat composite indexes
Almost every LLM leaderboard on the public internet publishes a composite intelligence index — a single 0–100 number computed by weighted-averaging whatever benchmarks happened to be available for each model. These composites look clean and sortable, but they have three fundamental problems that make them actively misleading for real decisions.
Problem 1: composites hide sparsity. Imagine two models. Model A scored 80 on GPQA and nothing else. Model B scored 80 on GPQA, 79 on MMLU-Pro, 78 on SWE-bench, 82 on MATH, and 77 on BBH. A composite "intelligence index" might give both models the same 80 — because the average of 80 is 80, and the average of (80, 79, 78, 82, 77) is also ~79. You cannot tell them apart from the composite alone, even though Model B's number is backed by 5× the evidence and Model A's is a single data point that could easily be wrong.
Problem 2: composites mix incomparable scales. Chatbot Arena ELO ranges from ~1000 to ~1600 and measures subjective human preference. MMLU-Pro is a 0–100% accuracy score on multiple-choice questions. SWE-bench Verified is the percentage of GitHub issues a model resolved that passed the actual test suite. These three things measure completely different phenomena and cannot be meaningfully averaged — but composites do it anyway.
Problem 3: composites hide what was weighted. Every composite has a proprietary formula. Is Intelligence weighted 30% or 40%? Does Coding include SWE-bench Live or only SWE-bench Verified? Does the "Reasoning" index include AIME, or not? You have to read a methodology page to find out — if one even exists. Raw benchmarks need no explanation: the number means exactly what the benchmark page says it means.
Which benchmarks actually matter in 2026?
Many of the benchmarks from the GPT-3 era are now fully saturated — top models routinely score 95%+ on MMLU, HumanEval, and HellaSwag, making them useless for distinguishing frontier capability. The benchmarks that do matter in 2026 share three properties:
- Contamination-resistant. LiveBench rotates new questions monthly from recently-published research. FrontierMath sources from unpublished research papers. LiveCodeBench uses post-cutoff competitive programming problems.
- Verifiable. SWE-bench runs the PR against the repo’s actual test suite. Aider Polyglot scores against Exercism unit tests. BFCL uses Abstract Syntax Tree comparison rather than LLM-as-judge.
- Long-horizon. SWE-bench Pro averages 107 lines of changes across 4.1 files per task. Terminal-Bench requires multi-step shell execution. These measure real-world capability, not pattern-matching.
Running the best benchmark winners privately
The Intelligence and Coding champions on this leaderboard are increasingly open-weight models — Llama 4, DeepSeek V3.2, Qwen 3.6, Kimi K2 Thinking, MiniMax M2.7 — that match or beat GPT-5 on specific evals at 1/20th the cost. The challenge for enterprises is running them privately: fully air-gapped, with no data leaving the device.
Iternal's AirgapAI platform runs any of these open-weights models locally with Intel Core Ultra chipsets, requires zero cloud dependency, and matches GPU-class performance on modern laptops. Use this benchmark repository to pick the right model, then deploy it without giving your data to a hyperscaler.
Data freshness & sources
The repository refreshes every 24 hours via a Cloudflare Cron Trigger. On each run, a worker fetches from four live sources in parallel: OpenRouter for the model catalog, pricing, and context windows; SWE-bench for software engineering scores; the LMArena community mirror for crowdsourced Chatbot Arena ELO; and the Hugging Face Open LLM Leaderboard v2 for MMLU-Pro, GPQA, BBH, IFEval, MATH Lvl 5, and MuSR. Benchmarks the live sources don’t cover — FrontierMath, HLE, LiveCodeBench, Aider Polyglot, BFCL V4, HumanEval, Terminal-Bench, AIME — are backfilled from a curated seed compiled from vendor announcements and public research papers. The health of every source is shown in the summary strip above the table, and any single source failure simply falls through to the next without breaking the snapshot. Every cell in the detail drawer shows exactly which source provided the score and when it was fetched.
Frequently Asked Questions
Deploy the top-ranked open-weight model privately
Pick your model here. Run it on your own hardware with AirgapAI — no cloud, no data egress, no per-token bills. Match GPU-class performance on Intel Core Ultra laptops.