AI Model Benchmarks
Standard: 36 · HF: 22
About these benchmarks
Benchmarks are standardised tests that score how well AI models handle reasoning, knowledge, math and coding. Use them to compare models objectively and pick the right one for your task.
📊 Standard. Standard - the four most-cited public benchmarks (MMLU, GPQA, HumanEval, SWE-Bench), taken from each model's announcement page; our Score weights them into a single number.
🤗 HF Open LLM Leaderboard. HF Open LLM Leaderboard - six tasks (IFEval, BBH, MATH, GPQA, MuSR, MMLU-Pro) measured uniformly for open-source models; ranked by the Average.
🤗Open on HF →
HF Open LLM Leaderboard v2
IFEval · BBH · MATH · GPQA · MuSR · MMLU-Pro - open-source models on standardised tasks.
| # | Model↕ | Provider↕ | IFEval↕ | BBH↕ | MATH↕ | GPQA↕ | MuSR↕ | MMLU-Pro↕ | Average↓ |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Meta | 90.0 | 56.6 | 48.3 | 10.5 | 15.6 | 48.1 | 44.9 | |
| 2 | Meta | 86.7 | 55.9 | 38.1 | 14.2 | 17.7 | 47.9 | 43.4 | |
| 3 | Meta | 86.7 | 55.9 | 38.1 | 14.2 | 17.7 | 47.9 | 43.4 | |
| 4 | Meta | 73.9 | 24.1 | 17.7 | 3.8 | 1.4 | 24.4 | 24.2 | |
| 5 | Meta | 73.9 | 24.1 | 17.7 | 3.8 | 1.4 | 24.4 | 24.2 | |
| 6 | Meta | 50.6 | 29.2 | 15.5 | 9.5 | 8.5 | 30.9 | 24.0 | |
| 7 | Meta | 58.1 | 8.3 | 8.2 | 2.4 | 1.9 | 8.2 | 14.5 | |
| 8 | Meta | 24.0 | 18.4 | 3.9 | 2.1 | 19.9 | 17.8 | 14.3 |
Average = IFEval · BBH · MATH · GPQA · MuSR · MMLU-ProAll scores in % · higher is better

