AI Model Benchmarks
Standard: 36 · HF: 22
About these benchmarks
Benchmarks are standardised tests that score how well AI models handle reasoning, knowledge, math and coding. Use them to compare models objectively and pick the right one for your task.
📊 Standard. Standard - the four most-cited public benchmarks (MMLU, GPQA, HumanEval, SWE-Bench), taken from each model's announcement page; our Score weights them into a single number.
🤗 HF Open LLM Leaderboard. HF Open LLM Leaderboard - six tasks (IFEval, BBH, MATH, GPQA, MuSR, MMLU-Pro) measured uniformly for open-source models; ranked by the Average.
📚MMLU
57 academic subjects
🔬GPQA Diamond
PhD-level science questions
💻HumanEval
Python code generation
🔧SWE-Bench
Real GitHub engineering tasks
| # | Model↕ | Provider↕ | MMLU↕ | GPQA↕ | HumanEval↕ | SWE-Bench↕ | Score↓ |
|---|---|---|---|---|---|---|---|
| 1 | Anthropic | 90.1 | 74.3 | 92.1 | 72.5 | 81.5 | |
| 2 | Moonshot AI | 83.0 | - | 77.0 | - | 79.7 | |
| 3 | Zhipu AI | 83.0 | - | 76.0 | - | 79.1 | |
| 4 | Baidu | 82.0 | - | 76.0 | - | 78.7 | |
| 5 | ByteDance | 82.0 | - | 75.0 | - | 78.1 | |
| 6 | Tencent | 81.0 | - | 74.0 | - | 77.1 | |
| 7 | OpenAI | 92.3 | 77.3 | 92.4 | 48.9 | 77.0 | |
| 8 | DeepSeek | 90.8 | 71.5 | 92.6 | 49.2 | 75.1 | |
| 9 | Yandex | 78.0 | - | 72.0 | - | 74.7 | |
| 10 | Anthropic | 88.3 | 65.0 | 92.0 | 57.0 | 74.4 | |
| 11 | MiniMax Text-01 | MiniMax | 88.0 | 56.0 | 84.0 | - | 73.9 |
| 12 | OpenAI | 86.9 | 67.4 | 91.7 | 49.3 | 72.9 | |
| 13 | Baidu | 88.0 | 55.0 | 82.0 | - | 72.8 | |
| 14 | Sber | 74.0 | - | 68.0 | - | 70.7 | |
| 15 | DeepSeek | 88.5 | 59.1 | 89.4 | 42.0 | 68.3 | |
| 16 | Yandex | 72.0 | - | 65.0 | - | 68.1 | |
| 17 | Alibaba | 89.0 | 59.0 | 87.0 | 37.0 | 66.5 | |
| 18 | OpenAI | 88.7 | 53.6 | 90.2 | 38.3 | 65.9 | |
| 19 | HyperCLOVA X | Naver | 79.0 | - | 55.0 | - | 65.7 |
| 20 | Yandex | 69.0 | - | 62.0 | - | 65.1 | |
| 21 | Meta | 88.6 | 50.7 | 89.0 | 34.1 | 63.7 | |
| 22 | Sber | 68.0 | - | 60.0 | - | 63.6 | |
| 23 | Mistral AI | 84.0 | 47.2 | 92.1 | 32.6 | 62.1 | |
| 24 | Anthropic | 82.9 | 43.0 | 88.3 | 33.2 | 59.9 | |
| 25 | Alibaba/Qwen | 86.1 | 49.0 | 86.5 | 23.7 | 59.5 | |
| 26 | Alibaba/Qwen | 80.0 | 42.0 | 92.3 | 30.0 | 59.2 | |
| 27 | 85.9 | 46.2 | 84.1 | 26.9 | 58.8 | ||
| 28 | 83.0 | 45.0 | 85.0 | 22.0 | 56.9 | ||
| 29 | Meta | 83.6 | 46.7 | 80.5 | 21.8 | 56.3 | |
| 30 | OpenAI | 82.0 | 40.1 | 87.1 | 22.8 | 55.9 | |
| 31 | 78.9 | 37.0 | 78.9 | 16.2 | 50.7 | ||
| 32 | 75.2 | 38.4 | 74.0 | 14.7 | 48.7 | ||
| 33 | Cohere | 80.4 | 30.1 | 74.2 | 16.8 | 47.9 | |
| 34 | Mistral AI | 68.0 | 32.0 | 73.4 | 15.0 | 45.3 | |
| 35 | Meta | 63.4 | 24.7 | 58.3 | 9.5 | 37.0 | |
| 36 | Anthropic | - | - | - | 74.5 | - |
Score = MMLU×20% + GPQA×30% + HumanEval×25% + SWE-Bench×25%All scores in % · higher is better→ Full model table with pricing & speed

