AI模型基准测试
标准: 36 · HF: 22
关于基准测试
基准测试是标准化测试,用于评估 AI 模型在推理、知识、数学和编程方面的表现。借助它们可以客观比较模型,并为任务选择合适的模型。
📊 标准. 标准 - 四个最常被引用的公开基准(MMLU、GPQA、HumanEval、SWE-Bench),取自各模型的发布页;我们的评分将其汇总为一个数值。
🤗 HF开放LLM排行榜. HF Open LLM Leaderboard - 对开源模型统一测量的六项任务(IFEval、BBH、MATH、GPQA、MuSR、MMLU-Pro),按平均分排名。
| # | 模型↕ | 提供商↕ | IFEval↕ | BBH↕ | MATH↕ | GPQA↕ | MuSR↕ | MMLU-Pro↕ | 平均↓ |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Alibaba/Qwen | 86.4 | 61.9 | 59.8 | 16.7 | 11.7 | 51.4 | 48.0 | |
| 2 | Alibaba/Qwen | 86.4 | 61.9 | 59.8 | 16.7 | 11.7 | 51.4 | 48.0 | |
| 3 | Meta | 90.0 | 56.6 | 48.3 | 10.5 | 15.6 | 48.1 | 44.9 | |
| 4 | Meta | 86.7 | 55.9 | 38.1 | 14.2 | 17.7 | 47.9 | 43.4 | |
| 5 | Meta | 86.7 | 55.9 | 38.1 | 14.2 | 17.7 | 47.9 | 43.4 | |
| 6 | Microsoft | 68.8 | 55.3 | 50.0 | 11.5 | 10.1 | 48.6 | 40.7 | |
| 7 | Alibaba/Qwen | 72.7 | 52.3 | 49.5 | 13.2 | 13.7 | 37.9 | 39.9 | |
| 8 | Alibaba/Qwen | 72.7 | 52.3 | 49.5 | 13.2 | 13.7 | 37.9 | 39.9 | |
| 9 | Hermes 3 70B InstructOSS | Nousresearch | 76.6 | 53.8 | 21.0 | 14.9 | 23.4 | 41.4 | 38.5 |
| 10 | 79.8 | 49.3 | 23.9 | 16.7 | 9.1 | 38.4 | 36.2 | ||
| 11 | Alibaba/Qwen | 75.8 | 34.9 | 50.0 | 5.5 | 8.4 | 36.5 | 35.2 | |
| 12 | Microsoft | 52.7 | 48.6 | 25.0 | 17.6 | 14.5 | 40.0 | 33.1 | |
| 13 | Microsoft | 73.8 | 38.7 | 17.0 | 7.9 | 6.5 | 32.6 | 29.4 | |
| 14 | DeepSeek | 43.4 | 35.8 | 30.7 | 2.0 | 13.3 | 41.6 | 27.8 | |
| 15 | Meta | 73.9 | 24.1 | 17.7 | 3.8 | 1.4 | 24.4 | 24.2 | |
| 16 | Meta | 73.9 | 24.1 | 17.7 | 3.8 | 1.4 | 24.4 | 24.2 | |
| 17 | Meta | 50.6 | 29.2 | 15.5 | 9.5 | 8.5 | 30.9 | 24.0 | |
| 18 | DeepSeek | 41.9 | 17.1 | 17.1 | 4.6 | 16.1 | 41.0 | 23.0 | |
| 19 | Hermes 2 Pro - Llama-3 8BOSS | Nousresearch | 53.6 | 30.7 | 8.4 | 5.7 | 11.3 | 22.8 | 22.1 |
| 20 | Meta | 58.1 | 8.3 | 8.2 | 2.4 | 1.9 | 8.2 | 14.5 | |
| 21 | Meta | 24.0 | 18.4 | 3.9 | 2.1 | 19.9 | 17.8 | 14.3 | |
| 22 | Mistral AI | 44.9 | 7.7 | 2.3 | 0.0 | 6.1 | 15.7 | 12.8 |
平均 = IFEval · BBH · MATH · GPQA · MuSR · MMLU-Pro所有分数以%表示 · 越高越好

