TECHAGENT - MY AI LIFE

AI Model Benchmarks

Standard: 36 · HF: 22

About these benchmarks

Benchmarks are standardised tests that score how well AI models handle reasoning, knowledge, math and coding. Use them to compare models objectively and pick the right one for your task.

📊 Standard. Standard - the four most-cited public benchmarks (MMLU, GPQA, HumanEval, SWE-Bench), taken from each model's announcement page; our Score weights them into a single number.

🤗 HF Open LLM Leaderboard. HF Open LLM Leaderboard - six tasks (IFEval, BBH, MATH, GPQA, MuSR, MMLU-Pro) measured uniformly for open-source models; ranked by the Average.

🤗

HF Open LLM Leaderboard v2

IFEval · BBH · MATH · GPQA · MuSR · MMLU-Pro - open-source models on standardised tasks.

Open on HF →
#ModelProviderIFEvalBBHMATHGPQAMuSRMMLU-ProAverage
1Alibaba/QwenQwen2.5 72B InstructOSSAlibaba/Qwen
86.4
61.9
59.8
16.7
11.7
51.4
48.0
2Alibaba/QwenQwen 2.5 72BOSSAlibaba/Qwen
86.4
61.9
59.8
16.7
11.7
51.4
48.0
3MetaLlama 3.3 70B Instruct (free)OSSMeta
90.0
56.6
48.3
10.5
15.6
48.1
44.9
4MetaLlama 3.1 70B InstructOSSMeta
86.7
55.9
38.1
14.2
17.7
47.9
43.4
5MetaLlama 3.1 70BOSSMeta
86.7
55.9
38.1
14.2
17.7
47.9
43.4
6MicrosoftPhi 4OSSMicrosoft
68.8
55.3
50.0
11.5
10.1
48.6
40.7
7Alibaba/QwenQwen2.5 Coder 32B InstructOSSAlibaba/Qwen
72.7
52.3
49.5
13.2
13.7
37.9
39.9
8Alibaba/QwenQwen 2.5 Coder 32BOSSAlibaba/Qwen
72.7
52.3
49.5
13.2
13.7
37.9
39.9
9Hermes 3 70B InstructOSSNousresearch
76.6
53.8
21.0
14.9
23.4
41.4
38.5
10GoogleGemma 2 27BOSSGoogle
79.8
49.3
23.9
16.7
9.1
38.4
36.2
11Alibaba/QwenQwen2.5 7B InstructOSSAlibaba/Qwen
75.8
34.9
50.0
5.5
8.4
36.5
35.2
12MicrosoftWizardLM-2 8x22BOSSMicrosoft
52.7
48.6
25.0
17.6
14.5
40.0
33.1
13MicrosoftPhi 4 Mini InstructOSSMicrosoft
73.8
38.7
17.0
7.9
6.5
32.6
29.4
14DeepSeekR1 Distill Llama 70BOSSDeepSeek
43.4
35.8
30.7
2.0
13.3
41.6
27.8
15MetaLlama 3.2 3B Instruct (free)OSSMeta
73.9
24.1
17.7
3.8
1.4
24.4
24.2
16MetaLlama 3.2 3BOSSMeta
73.9
24.1
17.7
3.8
1.4
24.4
24.2
17MetaLlama 3.1 8B InstructOSSMeta
50.6
29.2
15.5
9.5
8.5
30.9
24.0
18DeepSeekR1 Distill Qwen 32BOSSDeepSeek
41.9
17.1
17.1
4.6
16.1
41.0
23.0
19Hermes 2 Pro - Llama-3 8BOSSNousresearch
53.6
30.7
8.4
5.7
11.3
22.8
22.1
20MetaLlama 3.2 1B InstructOSSMeta
58.1
8.3
8.2
2.4
1.9
8.2
14.5
21MetaLlama 3 8B InstructOSSMeta
24.0
18.4
3.9
2.1
19.9
17.8
14.3
22Mistral AIMistral 7B Instruct v0.1OSSMistral AI
44.9
7.7
2.3
0.0
6.1
15.7
12.8
Average = IFEval · BBH · MATH · GPQA · MuSR · MMLU-ProAll scores in % · higher is better