The LLM Benchmark Repository

One-stop destination for raw LLM benchmark data. Compare 20+ industry benchmarks — MMLU-Pro, GPQA, SWE-bench, Aider Polyglot, BFCL, Chatbot Arena ELO, and more — across every major frontier and open-weights model. Updated daily, with full source attribution on every score. No opaque composites, no vanity rankings — just the numbers.

LLM Benchmarks 2026 Intelligence Index SWE-bench Verified GPQA Diamond Agentic Evaluation Chatbot Arena ELO

—

Models Tracked

20+

Benchmarks

Live Sources

Daily

Refresh Cadence

What are the most important LLM benchmarks in 2026?

The most important LLM benchmarks in 2026 are MMLU-Pro (broad reasoning), GPQA Diamond (graduate-level science), SWE-bench Verified and Aider Polyglot (real-world coding), BFCL (tool/function calling), and Chatbot Arena ELO (human preference). Compare raw scores — not opaque composite indexes — against your specific use case, then validate the shortlist with our LLM selection guide.

Need help choosing? Pair these scores with our LLM selection guide, then plan deployment with the hardware sizing guide and token cost guide.

Read the Selection Guide

View Mode

Modality

Type

Min Context (K)

Coverage

Add Benchmark Columns

	Provider	Model	Context
Loading benchmark snapshot...

Why raw benchmarks beat composite indexes

Almost every LLM leaderboard on the public internet publishes a composite intelligence index — a single 0–100 number computed by weighted-averaging whatever benchmarks happened to be available for each model. These composites look clean and sortable, but they have three fundamental problems that make them actively misleading for real decisions.

Problem 1: composites hide sparsity. Imagine two models. Model A scored 80 on GPQA and nothing else. Model B scored 80 on GPQA, 79 on MMLU-Pro, 78 on SWE-bench, 82 on MATH, and 77 on BBH. A composite "intelligence index" might give both models the same 80 — because the average of 80 is 80, and the average of (80, 79, 78, 82, 77) is also ~79. You cannot tell them apart from the composite alone, even though Model B's number is backed by 5× the evidence and Model A's is a single data point that could easily be wrong.

Problem 2: composites mix incomparable scales. Chatbot Arena ELO ranges from ~1000 to ~1600 and measures subjective human preference. MMLU-Pro is a 0–100% accuracy score on multiple-choice questions. SWE-bench Verified is the percentage of GitHub issues a model resolved that passed the actual test suite. These three things measure completely different phenomena and cannot be meaningfully averaged — but composites do it anyway.

Problem 3: composites hide what was weighted. Every composite has a proprietary formula. Is Intelligence weighted 30% or 40%? Does Coding include SWE-bench Live or only SWE-bench Verified? Does the "Reasoning" index include AIME, or not? You have to read a methodology page to find out — if one even exists. Raw benchmarks need no explanation: the number means exactly what the benchmark page says it means.

Our approach: This repository shows only raw benchmark scores, direct from each source, with a clear source link on every cell. No hidden weights. No normalized scaling. No averaging of incomparable things. You add the columns you care about, you see the actual numbers, and you judge comparability yourself.

Which benchmarks actually matter in 2026?

Many of the benchmarks from the GPT-3 era are now fully saturated — top models routinely score 95%+ on MMLU, HumanEval, and HellaSwag, making them useless for distinguishing frontier capability. The benchmarks that do matter in 2026 share three properties:

Contamination-resistant. LiveBench rotates new questions monthly from recently-published research. FrontierMath sources from unpublished research papers. LiveCodeBench uses post-cutoff competitive programming problems.
Verifiable. SWE-bench runs the PR against the repo’s actual test suite. Aider Polyglot scores against Exercism unit tests. BFCL uses Abstract Syntax Tree comparison rather than LLM-as-judge.
Long-horizon. SWE-bench Pro averages 107 lines of changes across 4.1 files per task. Terminal-Bench requires multi-step shell execution. These measure real-world capability, not pattern-matching.

Running the best benchmark winners privately

The Intelligence and Coding champions on this leaderboard are increasingly open-weight models — Llama 4, DeepSeek V3.2, Qwen 3.6, Kimi K2 Thinking, MiniMax M2.7 — that match or beat GPT-5 on specific evals at 1/20th the cost. The challenge for enterprises is running them privately: fully air-gapped, with no data leaving the device.

Iternal's AirgapAI platform runs any of these open-weights models locally with Intel Core Ultra chipsets, requires zero cloud dependency, and matches GPU-class performance on modern laptops. Use this benchmark repository to pick the right model, then deploy it without giving your data to a hyperscaler.

Data freshness & sources

The repository refreshes every 24 hours via a Cloudflare Cron Trigger. On each run, a worker fetches from four live sources in parallel: OpenRouter for the model catalog, pricing, and context windows; SWE-bench for software engineering scores; the LMArena community mirror for crowdsourced Chatbot Arena ELO; and the Hugging Face Open LLM Leaderboard v2 for MMLU-Pro, GPQA, BBH, IFEval, MATH Lvl 5, and MuSR. Benchmarks the live sources don’t cover — FrontierMath, HLE, LiveCodeBench, Aider Polyglot, BFCL V4, HumanEval, Terminal-Bench, AIME — are backfilled from a curated seed compiled from vendor announcements and public research papers. The health of every source is shown in the summary strip above the table, and any single source failure simply falls through to the next without breaking the snapshot. Every cell in the detail drawer shows exactly which source provided the score and when it was fetched.

Frequently Asked Questions

Why raw scores instead of a composite index?

Composite "intelligence indexes" look tidy but hide what actually matters. A model with one GPQA score of 80 and a single-benchmark composite of 80 looks identical to a model that averages 80 across 15 benchmarks — even though they represent wildly different levels of confidence in the number. By showing only raw benchmarks, you can see exactly which evaluations a model was tested on, which it skipped, and judge comparability yourself. No hidden weights, no normalized fudging, no apples-to-oranges combinations.

Where does the benchmark data come from?

Data is pulled daily via Cloudflare Cron from four live sources: OpenRouter (model catalog, pricing, context windows), SWE-bench (software engineering), the LMArena community mirror (Chatbot Arena ELO), and the Hugging Face Open LLM Leaderboard v2 (MMLU-Pro, GPQA, BBH, IFEval, MATH Lvl 5, MuSR). Benchmarks the live sources don't cover — FrontierMath, HLE, LiveCodeBench, Aider Polyglot, BFCL V4, HumanEval, Terminal-Bench, AIME — are backfilled from a curated seed compiled from public research and vendor announcements. Every cell in the detail drawer shows its source and timestamp.

How often is the data updated?

The repository is refreshed once every 24 hours via a Cloudflare Cron Trigger at 06:00 UTC. The sync fetches all sources in parallel with Promise.allSettled so a single source outage cannot corrupt the snapshot. Each source's freshness and health status is shown in the summary strip above the table.

Why do different models have different numbers of benchmarks?

LLM providers and evaluation labs publish on different benchmarks. OpenAI might publish GPQA and SWE-bench for a new model but skip Aider Polyglot; an open-weights lab might run Hugging Face's Open LLM Leaderboard suite but not the proprietary Scale SEAL evals. This is exactly why raw scores matter: a gap in a column is honest — it tells you "we don't know" — whereas a composite would silently invent a number from whatever happened to be available.

Why is Chatbot Arena ELO shown differently?

Chatbot Arena ELO is a crowdsourced human-preference rating on a ~1000–1600 scale, not a 0–100 accuracy measurement. Mixing it with objective benchmarks is misleading — you can't directly compare "1504 ELO" to "82% on MMLU-Pro". Arena ELO is shown in its own "Human Preference (Subjective)" category with the raw ELO displayed, clearly tagged PREF, and visually styled with a dashed amber border so you never accidentally confuse it with an objective score.

How do I compare specific models?

Add the benchmark columns you care about from the sidebar picker (or use the "Add Benchmark Columns" quick-select dropdown at the top). Sort by any column. Pin up to 5 models using the thumbtack icon and toggle "Pinned Only" for a head-to-head view. Copy the "Share" link to send the exact filtered view to a colleague — all filters, pins, columns, and sort order are encoded in the URL.

Can I export the data?

Yes. The "Export CSV" button downloads your current filtered view with every raw benchmark score, the source that provided it, and the fetch timestamp. We ask for an email before the first export so we can notify you when new benchmark sources go live.

Which benchmarks actually matter in 2026?

Most GPT-3 era benchmarks (MMLU, HumanEval, HellaSwag) are now saturated — top models routinely score 95%+ which makes them useless for distinguishing frontier capability. The benchmarks that matter in 2026 are contamination-resistant (LiveBench rotates questions monthly, FrontierMath uses unpublished research, LiveCodeBench uses post-cutoff problems), verifiable (SWE-bench runs actual test suites, Aider Polyglot uses Exercism unit tests, BFCL uses AST comparison), and long-horizon (SWE-bench Pro averages 107 lines of changes across 4.1 files). Click any column header to sort by it — focus on GPQA Diamond, SWE-bench Verified, and FrontierMath for the sharpest frontier signal.

Which models should I actually deploy in production?

Public benchmarks tell you which models are capable on standardized tests — not which are right for your specific use case. Use this repository to narrow down 2–3 candidates based on the benchmarks closest to your workload, then run them against your own private evaluation set. Public benchmarks cannot predict production behavior on your data. Iternal's AirgapAI platform can run any open-weights model listed here in a fully private, on-premise deployment.

Deploy the top-ranked open-weight model privately

Pick your model here. Run it on your own hardware with AirgapAI — no cloud, no data egress, no per-token bills. Match GPU-class performance on Intel Core Ultra laptops.

Explore AirgapAI Compare API Pricing

The LLM Benchmark Repository

What are the most important LLM benchmarks in 2026?

Download the Full Benchmark CSV

Why raw benchmarks beat composite indexes

Which benchmarks actually matter in 2026?

Running the best benchmark winners privately

Data freshness & sources

Frequently Asked Questions

Deploy the top-ranked open-weight model privately