The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
10 KiB
Benchmark Guide
Complete guide to all 60+ evaluation tasks in lm-evaluation-harness, what they measure, and how to interpret results.
Overview
The lm-evaluation-harness includes 60+ benchmarks spanning:
- Language understanding (MMLU, GLUE)
- Mathematical reasoning (GSM8K, MATH)
- Code generation (HumanEval, MBPP)
- Instruction following (IFEval, AlpacaEval)
- Long-context understanding (LongBench)
- Multilingual capabilities (AfroBench, NorEval)
- Reasoning (BBH, ARC)
- Truthfulness (TruthfulQA)
List all tasks:
lm_eval --tasks list
Major Benchmarks
MMLU (Massive Multitask Language Understanding)
What it measures: Broad knowledge across 57 subjects (STEM, humanities, social sciences, law).
Task variants:
mmlu: Original 57-subject benchmarkmmlu_pro: More challenging version with reasoning-focused questionsmmlu_prox: Multilingual extension
Format: Multiple choice (4 options)
Example:
Question: What is the capital of France?
A. Berlin
B. Paris
C. London
D. Madrid
Answer: B
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--num_fewshot 5
Interpretation:
- Random: 25% (chance)
- GPT-3 (175B): 43.9%
- GPT-4: 86.4%
- Human expert: ~90%
Good for: Assessing general knowledge and domain expertise.
GSM8K (Grade School Math 8K)
What it measures: Mathematical reasoning on grade-school level word problems.
Task variants:
gsm8k: Base taskgsm8k_cot: With chain-of-thought promptinggsm_plus: Adversarial variant with perturbations
Format: Free-form generation, extract numerical answer
Example:
Question: A baker made 200 cookies. He sold 3/5 of them in the morning and 1/4 of the remaining in the afternoon. How many cookies does he have left?
Answer: 60
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks gsm8k \
--num_fewshot 5
Interpretation:
- Random: ~0%
- GPT-3 (175B): 17.0%
- GPT-4: 92.0%
- Llama 2 70B: 56.8%
Good for: Testing multi-step reasoning and arithmetic.
HumanEval
What it measures: Python code generation from docstrings (functional correctness).
Task variants:
humaneval: Standard benchmarkhumaneval_instruct: For instruction-tuned models
Format: Code generation, execution-based evaluation
Example:
def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer to each other than
given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
Command:
lm_eval --model hf \
--model_args pretrained=codellama/CodeLlama-7b-hf \
--tasks humaneval \
--batch_size 1
Interpretation:
- Random: 0%
- GPT-3 (175B): 0%
- Codex: 28.8%
- GPT-4: 67.0%
- Code Llama 34B: 53.7%
Good for: Evaluating code generation capabilities.
BBH (BIG-Bench Hard)
What it measures: 23 challenging reasoning tasks where models previously failed to beat humans.
Categories:
- Logical reasoning
- Math word problems
- Social understanding
- Algorithmic reasoning
Format: Multiple choice and free-form
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks bbh \
--num_fewshot 3
Interpretation:
- Random: ~25%
- GPT-3 (175B): 33.9%
- PaLM 540B: 58.3%
- GPT-4: 86.7%
Good for: Testing advanced reasoning capabilities.
IFEval (Instruction-Following Evaluation)
What it measures: Ability to follow specific, verifiable instructions.
Instruction types:
- Format constraints (e.g., "answer in 3 sentences")
- Length constraints (e.g., "use at least 100 words")
- Content constraints (e.g., "include the word 'banana'")
- Structural constraints (e.g., "use bullet points")
Format: Free-form generation with rule-based verification
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
--tasks ifeval \
--batch_size auto
Interpretation:
- Measures: Instruction adherence (not quality)
- GPT-4: 86% instruction following
- Claude 2: 84%
Good for: Evaluating chat/instruct models.
GLUE (General Language Understanding Evaluation)
What it measures: Natural language understanding across 9 tasks.
Tasks:
cola: Grammatical acceptabilitysst2: Sentiment analysismrpc: Paraphrase detectionqqp: Question pairsstsb: Semantic similaritymnli: Natural language inferenceqnli: Question answering NLIrte: Recognizing textual entailmentwnli: Winograd schemas
Command:
lm_eval --model hf \
--model_args pretrained=bert-base-uncased \
--tasks glue \
--num_fewshot 0
Interpretation:
- BERT Base: 78.3 (GLUE score)
- RoBERTa Large: 88.5
- Human baseline: 87.1
Good for: Encoder-only models, fine-tuning baselines.
LongBench
What it measures: Long-context understanding (4K-32K tokens).
21 tasks covering:
- Single-document QA
- Multi-document QA
- Summarization
- Few-shot learning
- Code completion
- Synthetic tasks
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks longbench \
--batch_size 1
Interpretation:
- Tests context utilization
- Many models struggle beyond 4K tokens
- GPT-4 Turbo: 54.3%
Good for: Evaluating long-context models.
Additional Benchmarks
TruthfulQA
What it measures: Model's propensity to be truthful vs. generate plausible-sounding falsehoods.
Format: Multiple choice with 4-5 options
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks truthfulqa_mc2 \
--batch_size auto
Interpretation:
- Larger models often score worse (more convincing lies)
- GPT-3: 58.8%
- GPT-4: 59.0%
- Human: ~94%
ARC (AI2 Reasoning Challenge)
What it measures: Grade-school science questions.
Variants:
arc_easy: Easier questionsarc_challenge: Harder questions requiring reasoning
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks arc_challenge \
--num_fewshot 25
Interpretation:
- ARC-Easy: Most models >80%
- ARC-Challenge random: 25%
- GPT-4: 96.3%
HellaSwag
What it measures: Commonsense reasoning about everyday situations.
Format: Choose most plausible continuation
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks hellaswag \
--num_fewshot 10
Interpretation:
- Random: 25%
- GPT-3: 78.9%
- Llama 2 70B: 85.3%
WinoGrande
What it measures: Commonsense reasoning via pronoun resolution.
Example:
The trophy doesn't fit in the brown suitcase because _ is too large.
A. the trophy
B. the suitcase
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks winogrande \
--num_fewshot 5
PIQA
What it measures: Physical commonsense reasoning.
Example: "To clean a keyboard, use compressed air or..."
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks piqa
Multilingual Benchmarks
AfroBench
What it measures: Performance across 64 African languages.
15 tasks: NLU, text generation, knowledge, QA, math reasoning
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks afrobench
NorEval
What it measures: Norwegian language understanding (9 task categories).
Command:
lm_eval --model hf \
--model_args pretrained=NbAiLab/nb-gpt-j-6B \
--tasks noreval
Domain-Specific Benchmarks
MATH
What it measures: High-school competition math problems.
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks math \
--num_fewshot 4
Interpretation:
- Very challenging
- GPT-4: 42.5%
- Minerva 540B: 33.6%
MBPP (Mostly Basic Python Problems)
What it measures: Python programming from natural language descriptions.
Command:
lm_eval --model hf \
--model_args pretrained=codellama/CodeLlama-7b-hf \
--tasks mbpp \
--batch_size 1
DROP
What it measures: Reading comprehension requiring discrete reasoning.
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks drop
Benchmark Selection Guide
For General Purpose Models
Run this suite:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag,arc_challenge,truthfulqa_mc2 \
--num_fewshot 5
For Code Models
lm_eval --model hf \
--model_args pretrained=codellama/CodeLlama-7b-hf \
--tasks humaneval,mbpp \
--batch_size 1
For Chat/Instruct Models
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
--tasks ifeval,mmlu,gsm8k_cot \
--batch_size auto
For Long Context Models
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.1-8B \
--tasks longbench \
--batch_size 1
Interpreting Results
Understanding Metrics
Accuracy: Percentage of correct answers (most common)
Exact Match (EM): Requires exact string match (strict)
F1 Score: Balances precision and recall
BLEU/ROUGE: Text generation similarity
Pass@k: Percentage passing when generating k samples
Typical Score Ranges
| Model Size | MMLU | GSM8K | HumanEval | HellaSwag |
|---|---|---|---|---|
| 7B | 40-50% | 10-20% | 5-15% | 70-80% |
| 13B | 45-55% | 20-35% | 15-25% | 75-82% |
| 70B | 60-70% | 50-65% | 35-50% | 82-87% |
| GPT-4 | 86% | 92% | 67% | 95% |
Red Flags
- All tasks at random chance: Model not trained properly
- Exact 0% on generation tasks: Likely format/parsing issue
- Huge variance across runs: Check seed/sampling settings
- Better than GPT-4 on everything: Likely contamination
Best Practices
- Always report few-shot setting: 0-shot, 5-shot, etc.
- Run multiple seeds: Report mean ± std
- Check for data contamination: Search training data for benchmark examples
- Compare to published baselines: Validate your setup
- Report all hyperparameters: Model, batch size, max tokens, temperature
References
- Task list:
lm_eval --tasks list - Task README:
lm_eval/tasks/README.md - Papers: See individual benchmark papers