TECHAGENT - MY AI LIFE

AI Model Benchmarks

Standard: 36 · HF: 22

About these benchmarks

Benchmarks are standardised tests that score how well AI models handle reasoning, knowledge, math and coding. Use them to compare models objectively and pick the right one for your task.

📊 Standard. Standard - the four most-cited public benchmarks (MMLU, GPQA, HumanEval, SWE-Bench), taken from each model's announcement page; our Score weights them into a single number.

🤗 HF Open LLM Leaderboard. HF Open LLM Leaderboard - six tasks (IFEval, BBH, MATH, GPQA, MuSR, MMLU-Pro) measured uniformly for open-source models; ranked by the Average.

📚MMLU
57 academic subjects
o1
OpenAI
92.3%
🔬GPQA Diamond
PhD-level science questions
o1
OpenAI
77.3%
💻HumanEval
Python code generation
DeepSeek R1
DeepSeek
92.6%
🔧SWE-Bench
Real GitHub engineering tasks
Claude Opus 4.1
Anthropic
74.5%
📚MMLU

57 academic subjects

🔬GPQA Diamond

PhD-level science questions

💻HumanEval

Python code generation

🔧SWE-Bench

Real GitHub engineering tasks

#ModelProviderMMLUGPQAHumanEvalSWE-BenchScore
1MetaLlama 3.2 3BMeta
63.4
24.7
58.3
9.5
37.0
2MetaLlama 3.1 70BMeta
83.6
46.7
80.5
21.8
56.3
3MetaLlama 3.1 405BMeta
88.6
50.7
89.0
34.1
63.7
Score = MMLU×20% + GPQA×30% + HumanEval×25% + SWE-Bench×25%All scores in % · higher is better→ Full model table with pricing & speed