TECHAGENT - MY AI LIFE

AI Model Benchmarks

Standard: 36 · HF: 22

About these benchmarks

Benchmarks are standardised tests that score how well AI models handle reasoning, knowledge, math and coding. Use them to compare models objectively and pick the right one for your task.

📊 Standard. Standard - the four most-cited public benchmarks (MMLU, GPQA, HumanEval, SWE-Bench), taken from each model's announcement page; our Score weights them into a single number.

🤗 HF Open LLM Leaderboard. HF Open LLM Leaderboard - six tasks (IFEval, BBH, MATH, GPQA, MuSR, MMLU-Pro) measured uniformly for open-source models; ranked by the Average.

🤗

HF Open LLM Leaderboard v2

IFEval · BBH · MATH · GPQA · MuSR · MMLU-Pro - open-source models on standardised tasks.

Open on HF →
#ModelProviderIFEvalBBHMATHGPQAMuSRMMLU-ProAverage
1MicrosoftPhi 4OSSMicrosoft
68.8
55.3
50.0
11.5
10.1
48.6
40.7
2MicrosoftWizardLM-2 8x22BOSSMicrosoft
52.7
48.6
25.0
17.6
14.5
40.0
33.1
3MicrosoftPhi 4 Mini InstructOSSMicrosoft
73.8
38.7
17.0
7.9
6.5
32.6
29.4
Average = IFEval · BBH · MATH · GPQA · MuSR · MMLU-ProAll scores in % · higher is better