AI Model Benchmarks

Standard: 36 · HF: 22

About these benchmarks

Benchmarks are standardised tests that score how well AI models handle reasoning, knowledge, math and coding. Use them to compare models objectively and pick the right one for your task.

📊 Standard. Standard - the four most-cited public benchmarks (MMLU, GPQA, HumanEval, SWE-Bench), taken from each model's announcement page; our Score weights them into a single number.

🤗 HF Open LLM Leaderboard. HF Open LLM Leaderboard - six tasks (IFEval, BBH, MATH, GPQA, MuSR, MMLU-Pro) measured uniformly for open-source models; ranked by the Average.

📊 Standard (MMLU · GPQA · HumanEval · SWE-Bench)🤗 HF Open LLM Leaderboard

🤗

HF Open LLM Leaderboard v2

IFEval · BBH · MATH · GPQA · MuSR · MMLU-Pro - open-source models on standardised tasks.

Open on HF →

All Open SourceMeta (8)Alibaba/Qwen (5)Microsoft (3)

#	Model↕	Provider↕	IFEval↕	BBH↕	MATH↕	GPQA↕	MuSR↕	MMLU-Pro↕	Average↓
1	Phi 4OSS	Microsoft	68.8	55.3	50.0	11.5	10.1	48.6	40.7
2	WizardLM-2 8x22BOSS	Microsoft	52.7	48.6	25.0	17.6	14.5	40.0	33.1
3	Phi 4 Mini InstructOSS	Microsoft	73.8	38.7	17.0	7.9	6.5	32.6	29.4

Average = IFEval · BBH · MATH · GPQA · MuSR · MMLU-ProAll scores in % · higher is better