Files
hermes-agent/skills/mlops/evaluation/lm-evaluation-harness/references/benchmark-guide.md
teknium1 732c66b0f3 refactor: reorganize skills into sub-categories
The skills directory was getting disorganized — mlops alone had 40
skills in a flat list, and 12 categories were singletons with just
one skill each.

Code change:
- prompt_builder.py: Support sub-categories in skill scanner.
  skills/mlops/training/axolotl/SKILL.md now shows as category
  'mlops/training' instead of just 'mlops'. Backwards-compatible
  with existing flat structure.

Split mlops (40 skills) into 7 sub-categories:
- mlops/training (12): accelerate, axolotl, flash-attention,
  grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning,
  simpo, slime, torchtitan, trl-fine-tuning, unsloth
- mlops/inference (8): gguf, guidance, instructor, llama-cpp,
  obliteratus, outlines, tensorrt-llm, vllm
- mlops/models (6): audiocraft, clip, llava, segment-anything,
  stable-diffusion, whisper
- mlops/vector-databases (4): chroma, faiss, pinecone, qdrant
- mlops/evaluation (5): huggingface-tokenizers,
  lm-evaluation-harness, nemo-curator, saelens, weights-and-biases
- mlops/cloud (2): lambda-labs, modal
- mlops/research (1): dspy

Merged singleton categories:
- gifs → media (gif-search joins youtube-content)
- music-creation → media (heartmula, songsee)
- diagramming → creative (excalidraw joins ascii-art)
- ocr-and-documents → productivity
- domain → research (domain-intel)
- feeds → research (blogwatcher)
- market-data → research (polymarket)

Fixed misplaced skills:
- mlops/code-review → software-development (not ML-specific)
- mlops/ml-paper-writing → research (academic writing)

Added DESCRIPTION.md files for all new/updated categories.
2026-03-09 03:35:53 -07:00

10 KiB

Benchmark Guide

Complete guide to all 60+ evaluation tasks in lm-evaluation-harness, what they measure, and how to interpret results.

Overview

The lm-evaluation-harness includes 60+ benchmarks spanning:

  • Language understanding (MMLU, GLUE)
  • Mathematical reasoning (GSM8K, MATH)
  • Code generation (HumanEval, MBPP)
  • Instruction following (IFEval, AlpacaEval)
  • Long-context understanding (LongBench)
  • Multilingual capabilities (AfroBench, NorEval)
  • Reasoning (BBH, ARC)
  • Truthfulness (TruthfulQA)

List all tasks:

lm_eval --tasks list

Major Benchmarks

MMLU (Massive Multitask Language Understanding)

What it measures: Broad knowledge across 57 subjects (STEM, humanities, social sciences, law).

Task variants:

  • mmlu: Original 57-subject benchmark
  • mmlu_pro: More challenging version with reasoning-focused questions
  • mmlu_prox: Multilingual extension

Format: Multiple choice (4 options)

Example:

Question: What is the capital of France?
A. Berlin
B. Paris
C. London
D. Madrid
Answer: B

Command:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu \
  --num_fewshot 5

Interpretation:

  • Random: 25% (chance)
  • GPT-3 (175B): 43.9%
  • GPT-4: 86.4%
  • Human expert: ~90%

Good for: Assessing general knowledge and domain expertise.

GSM8K (Grade School Math 8K)

What it measures: Mathematical reasoning on grade-school level word problems.

Task variants:

  • gsm8k: Base task
  • gsm8k_cot: With chain-of-thought prompting
  • gsm_plus: Adversarial variant with perturbations

Format: Free-form generation, extract numerical answer

Example:

Question: A baker made 200 cookies. He sold 3/5 of them in the morning and 1/4 of the remaining in the afternoon. How many cookies does he have left?
Answer: 60

Command:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks gsm8k \
  --num_fewshot 5

Interpretation:

  • Random: ~0%
  • GPT-3 (175B): 17.0%
  • GPT-4: 92.0%
  • Llama 2 70B: 56.8%

Good for: Testing multi-step reasoning and arithmetic.

HumanEval

What it measures: Python code generation from docstrings (functional correctness).

Task variants:

  • humaneval: Standard benchmark
  • humaneval_instruct: For instruction-tuned models

Format: Code generation, execution-based evaluation

Example:

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

Command:

lm_eval --model hf \
  --model_args pretrained=codellama/CodeLlama-7b-hf \
  --tasks humaneval \
  --batch_size 1

Interpretation:

  • Random: 0%
  • GPT-3 (175B): 0%
  • Codex: 28.8%
  • GPT-4: 67.0%
  • Code Llama 34B: 53.7%

Good for: Evaluating code generation capabilities.

BBH (BIG-Bench Hard)

What it measures: 23 challenging reasoning tasks where models previously failed to beat humans.

Categories:

  • Logical reasoning
  • Math word problems
  • Social understanding
  • Algorithmic reasoning

Format: Multiple choice and free-form

Command:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks bbh \
  --num_fewshot 3

Interpretation:

  • Random: ~25%
  • GPT-3 (175B): 33.9%
  • PaLM 540B: 58.3%
  • GPT-4: 86.7%

Good for: Testing advanced reasoning capabilities.

IFEval (Instruction-Following Evaluation)

What it measures: Ability to follow specific, verifiable instructions.

Instruction types:

  • Format constraints (e.g., "answer in 3 sentences")
  • Length constraints (e.g., "use at least 100 words")
  • Content constraints (e.g., "include the word 'banana'")
  • Structural constraints (e.g., "use bullet points")

Format: Free-form generation with rule-based verification

Command:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
  --tasks ifeval \
  --batch_size auto

Interpretation:

  • Measures: Instruction adherence (not quality)
  • GPT-4: 86% instruction following
  • Claude 2: 84%

Good for: Evaluating chat/instruct models.

GLUE (General Language Understanding Evaluation)

What it measures: Natural language understanding across 9 tasks.

Tasks:

  • cola: Grammatical acceptability
  • sst2: Sentiment analysis
  • mrpc: Paraphrase detection
  • qqp: Question pairs
  • stsb: Semantic similarity
  • mnli: Natural language inference
  • qnli: Question answering NLI
  • rte: Recognizing textual entailment
  • wnli: Winograd schemas

Command:

lm_eval --model hf \
  --model_args pretrained=bert-base-uncased \
  --tasks glue \
  --num_fewshot 0

Interpretation:

  • BERT Base: 78.3 (GLUE score)
  • RoBERTa Large: 88.5
  • Human baseline: 87.1

Good for: Encoder-only models, fine-tuning baselines.

LongBench

What it measures: Long-context understanding (4K-32K tokens).

21 tasks covering:

  • Single-document QA
  • Multi-document QA
  • Summarization
  • Few-shot learning
  • Code completion
  • Synthetic tasks

Command:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks longbench \
  --batch_size 1

Interpretation:

  • Tests context utilization
  • Many models struggle beyond 4K tokens
  • GPT-4 Turbo: 54.3%

Good for: Evaluating long-context models.

Additional Benchmarks

TruthfulQA

What it measures: Model's propensity to be truthful vs. generate plausible-sounding falsehoods.

Format: Multiple choice with 4-5 options

Command:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks truthfulqa_mc2 \
  --batch_size auto

Interpretation:

  • Larger models often score worse (more convincing lies)
  • GPT-3: 58.8%
  • GPT-4: 59.0%
  • Human: ~94%

ARC (AI2 Reasoning Challenge)

What it measures: Grade-school science questions.

Variants:

  • arc_easy: Easier questions
  • arc_challenge: Harder questions requiring reasoning

Command:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks arc_challenge \
  --num_fewshot 25

Interpretation:

  • ARC-Easy: Most models >80%
  • ARC-Challenge random: 25%
  • GPT-4: 96.3%

HellaSwag

What it measures: Commonsense reasoning about everyday situations.

Format: Choose most plausible continuation

Command:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks hellaswag \
  --num_fewshot 10

Interpretation:

  • Random: 25%
  • GPT-3: 78.9%
  • Llama 2 70B: 85.3%

WinoGrande

What it measures: Commonsense reasoning via pronoun resolution.

Example:

The trophy doesn't fit in the brown suitcase because _ is too large.
A. the trophy
B. the suitcase

Command:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks winogrande \
  --num_fewshot 5

PIQA

What it measures: Physical commonsense reasoning.

Example: "To clean a keyboard, use compressed air or..."

Command:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks piqa

Multilingual Benchmarks

AfroBench

What it measures: Performance across 64 African languages.

15 tasks: NLU, text generation, knowledge, QA, math reasoning

Command:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks afrobench

NorEval

What it measures: Norwegian language understanding (9 task categories).

Command:

lm_eval --model hf \
  --model_args pretrained=NbAiLab/nb-gpt-j-6B \
  --tasks noreval

Domain-Specific Benchmarks

MATH

What it measures: High-school competition math problems.

Command:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks math \
  --num_fewshot 4

Interpretation:

  • Very challenging
  • GPT-4: 42.5%
  • Minerva 540B: 33.6%

MBPP (Mostly Basic Python Problems)

What it measures: Python programming from natural language descriptions.

Command:

lm_eval --model hf \
  --model_args pretrained=codellama/CodeLlama-7b-hf \
  --tasks mbpp \
  --batch_size 1

DROP

What it measures: Reading comprehension requiring discrete reasoning.

Command:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks drop

Benchmark Selection Guide

For General Purpose Models

Run this suite:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag,arc_challenge,truthfulqa_mc2 \
  --num_fewshot 5

For Code Models

lm_eval --model hf \
  --model_args pretrained=codellama/CodeLlama-7b-hf \
  --tasks humaneval,mbpp \
  --batch_size 1

For Chat/Instruct Models

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
  --tasks ifeval,mmlu,gsm8k_cot \
  --batch_size auto

For Long Context Models

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3.1-8B \
  --tasks longbench \
  --batch_size 1

Interpreting Results

Understanding Metrics

Accuracy: Percentage of correct answers (most common)

Exact Match (EM): Requires exact string match (strict)

F1 Score: Balances precision and recall

BLEU/ROUGE: Text generation similarity

Pass@k: Percentage passing when generating k samples

Typical Score Ranges

Model Size MMLU GSM8K HumanEval HellaSwag
7B 40-50% 10-20% 5-15% 70-80%
13B 45-55% 20-35% 15-25% 75-82%
70B 60-70% 50-65% 35-50% 82-87%
GPT-4 86% 92% 67% 95%

Red Flags

  • All tasks at random chance: Model not trained properly
  • Exact 0% on generation tasks: Likely format/parsing issue
  • Huge variance across runs: Check seed/sampling settings
  • Better than GPT-4 on everything: Likely contamination

Best Practices

  1. Always report few-shot setting: 0-shot, 5-shot, etc.
  2. Run multiple seeds: Report mean ± std
  3. Check for data contamination: Search training data for benchmark examples
  4. Compare to published baselines: Validate your setup
  5. Report all hyperparameters: Model, batch size, max tokens, temperature

References

  • Task list: lm_eval --tasks list
  • Task README: lm_eval/tasks/README.md
  • Papers: See individual benchmark papers