TECHAGENT - MY AI LIFE

AI模型基准测试

标准: 36 · HF: 22

关于基准测试

基准测试是标准化测试,用于评估 AI 模型在推理、知识、数学和编程方面的表现。借助它们可以客观比较模型,并为任务选择合适的模型。

📊 标准. 标准 - 四个最常被引用的公开基准(MMLU、GPQA、HumanEval、SWE-Bench),取自各模型的发布页;我们的评分将其汇总为一个数值。

🤗 HF开放LLM排行榜. HF Open LLM Leaderboard - 对开源模型统一测量的六项任务(IFEval、BBH、MATH、GPQA、MuSR、MMLU-Pro),按平均分排名。

📚MMLU
57个学术科目
o1
OpenAI
92.3%
🔬GPQA Diamond
博士级科学问题
o1
OpenAI
77.3%
💻HumanEval
Python代码生成
DeepSeek R1
DeepSeek
92.6%
🔧SWE-Bench
真实GitHub工程任务
Claude Opus 4.1
Anthropic
74.5%
📚MMLU

57个学术科目

🔬GPQA Diamond

博士级科学问题

💻HumanEval

Python代码生成

🔧SWE-Bench

真实GitHub工程任务

#模型提供商MMLUGPQAHumanEvalSWE-Bench评分
1MetaLlama 3.2 3BMeta
63.4
24.7
58.3
9.5
37.0
2Mistral AIMistral NeMoMistral AI
68.0
32.0
73.4
15.0
45.3
3CohereCommand R+Cohere
80.4
30.1
74.2
16.8
47.9
4GoogleGemma 2 27BGoogle
75.2
38.4
74.0
14.7
48.7
5GoogleGemini 1.5 FlashGoogle
78.9
37.0
78.9
16.2
50.7
6OpenAIGPT-4o miniOpenAI
82.0
40.1
87.1
22.8
55.9
7MetaLlama 3.1 70BMeta
83.6
46.7
80.5
21.8
56.3
8GoogleGemini 2.0 FlashGoogle
83.0
45.0
85.0
22.0
56.9
9GoogleGemini 1.5 ProGoogle
85.9
46.2
84.1
26.9
58.8
10Alibaba/QwenQwen 2.5 Coder 32BAlibaba/Qwen
80.0
42.0
92.3
30.0
59.2
11Alibaba/QwenQwen 2.5 72BAlibaba/Qwen
86.1
49.0
86.5
23.7
59.5
12AnthropicClaude Haiku 4.5Anthropic
82.9
43.0
88.3
33.2
59.9
13Mistral AIMistral Large 2Mistral AI
84.0
47.2
92.1
32.6
62.1
14SberGigaChat ProSber
68.0
-
60.0
-63.6
15MetaLlama 3.1 405BMeta
88.6
50.7
89.0
34.1
63.7
16YandexYandexGPT 4 LiteYandex
69.0
-
62.0
-65.1
17HyperCLOVA XNaver
79.0
-
55.0
-65.7
18OpenAIGPT-4oOpenAI
88.7
53.6
90.2
38.3
65.9
19AlibabaQwen MaxAlibaba
89.0
59.0
87.0
37.0
66.5
20YandexYandexGPT 3 ProYandex
72.0
-
65.0
-68.1
21DeepSeekDeepSeek V3DeepSeek
88.5
59.1
89.4
42.0
68.3
22SberGigaChat MaxSber
74.0
-
68.0
-70.7
23BaiduERNIE 4.5Baidu
88.0
55.0
82.0
-72.8
24OpenAIo3-miniOpenAI
86.9
67.4
91.7
49.3
72.9
25MiniMax Text-01MiniMax
88.0
56.0
84.0
-73.9
26AnthropicClaude Sonnet 4.6Anthropic
88.3
65.0
92.0
57.0
74.4
27YandexYandexGPT 4 ProYandex
78.0
-
72.0
-74.7
28DeepSeekDeepSeek R1DeepSeek
90.8
71.5
92.6
49.2
75.1
29OpenAIo1OpenAI
92.3
77.3
92.4
48.9
77.0
30TencentHunyuan ProTencent
81.0
-
74.0
-77.1
31ByteDanceDoubao Pro 32KByteDance
82.0
-
75.0
-78.1
32BaiduERNIE 4.0 TurboBaidu
82.0
-
76.0
-78.7
33Zhipu AIGLM-4 PlusZhipu AI
83.0
-
76.0
-79.1
34Moonshot AIMoonshot v1 128KMoonshot AI
83.0
-
77.0
-79.7
35AnthropicClaude Opus 4.7Anthropic
90.1
74.3
92.1
72.5
81.5
36AnthropicClaude Opus 4.1Anthropic---
74.5
-
Score = MMLU×20% + GPQA×30% + HumanEval×25% + SWE-Bench×25%所有分数以%表示 · 越高越好→ 含价格与速度的完整模型表