AI Model Benchmarks

Standard: 36 · HF: 22

About these benchmarks

Benchmarks are standardised tests that score how well AI models handle reasoning, knowledge, math and coding. Use them to compare models objectively and pick the right one for your task.

📊 Standard. Standard - the four most-cited public benchmarks (MMLU, GPQA, HumanEval, SWE-Bench), taken from each model's announcement page; our Score weights them into a single number.

🤗 HF Open LLM Leaderboard. HF Open LLM Leaderboard - six tasks (IFEval, BBH, MATH, GPQA, MuSR, MMLU-Pro) measured uniformly for open-source models; ranked by the Average.

📊 Standard (MMLU · GPQA · HumanEval · SWE-Bench)🤗 HF Open LLM Leaderboard

🤗

HF Open LLM Leaderboard v2

IFEval · BBH · MATH · GPQA · MuSR · MMLU-Pro - open-source models on standardised tasks.

Open on HF →

All Open SourceMeta (8)Alibaba/Qwen (5)Microsoft (3)

#	Model↕	Provider↕	IFEval↕	BBH↕	MATH↕	GPQA↕	MuSR↕	MMLU-Pro↕	Average↓
1	Llama 3.3 70B Instruct (free)OSS	Meta	90.0	56.6	48.3	10.5	15.6	48.1	44.9
2	Llama 3.1 70B InstructOSS	Meta	86.7	55.9	38.1	14.2	17.7	47.9	43.4
3	Llama 3.1 70BOSS	Meta	86.7	55.9	38.1	14.2	17.7	47.9	43.4
4	Llama 3.2 3B Instruct (free)OSS	Meta	73.9	24.1	17.7	3.8	1.4	24.4	24.2
5	Llama 3.2 3BOSS	Meta	73.9	24.1	17.7	3.8	1.4	24.4	24.2
6	Llama 3.1 8B InstructOSS	Meta	50.6	29.2	15.5	9.5	8.5	30.9	24.0
7	Llama 3.2 1B InstructOSS	Meta	58.1	8.3	8.2	2.4	1.9	8.2	14.5
8	Llama 3 8B InstructOSS	Meta	24.0	18.4	3.9	2.1	19.9	17.8	14.3

Average = IFEval · BBH · MATH · GPQA · MuSR · MMLU-ProAll scores in % · higher is better