The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
15 KiB
DSPy Optimizers (Teleprompters)
Complete guide to DSPy's optimization algorithms for improving prompts and model weights.
What are Optimizers?
DSPy optimizers (called "teleprompters") automatically improve your modules by:
- Synthesizing few-shot examples from training data
- Proposing better instructions through search
- Fine-tuning model weights (optional)
Key idea: Instead of manually tuning prompts, define a metric and let DSPy optimize.
Optimizer Selection Guide
| Optimizer | Best For | Speed | Quality | Data Needed |
|---|---|---|---|---|
| BootstrapFewShot | General purpose | Fast | Good | 10-50 examples |
| MIPRO | Instruction tuning | Medium | Excellent | 50-200 examples |
| BootstrapFinetune | Fine-tuning | Slow | Excellent | 100+ examples |
| COPRO | Prompt optimization | Medium | Good | 20-100 examples |
| KNNFewShot | Quick baseline | Very fast | Fair | 10+ examples |
Core Optimizers
BootstrapFewShot
Most popular optimizer - Generates few-shot demonstrations from training data.
How it works:
- Takes your training examples
- Uses your module to generate predictions
- Selects high-quality predictions (based on metric)
- Uses these as few-shot examples in future prompts
Parameters:
metric: Function that scores predictions (required)max_bootstrapped_demos: Max demonstrations to generate (default: 4)max_labeled_demos: Max labeled examples to use (default: 16)max_rounds: Optimization iterations (default: 1)metric_threshold: Minimum score to accept (optional)
import dspy
from dspy.teleprompt import BootstrapFewShot
# Define metric
def validate_answer(example, pred, trace=None):
"""Return True if prediction matches gold answer."""
return example.answer.lower() == pred.answer.lower()
# Training data
trainset = [
dspy.Example(question="What is 2+2?", answer="4").with_inputs("question"),
dspy.Example(question="What is 3+5?", answer="8").with_inputs("question"),
dspy.Example(question="What is 10-3?", answer="7").with_inputs("question"),
]
# Create module
qa = dspy.ChainOfThought("question -> answer")
# Optimize
optimizer = BootstrapFewShot(
metric=validate_answer,
max_bootstrapped_demos=3,
max_rounds=2
)
optimized_qa = optimizer.compile(qa, trainset=trainset)
# Now optimized_qa has learned few-shot examples!
result = optimized_qa(question="What is 5+7?")
Best practices:
- Start with 10-50 training examples
- Use diverse examples covering edge cases
- Set
max_bootstrapped_demos=3-5for most tasks - Increase
max_rounds=2-3for better quality
When to use:
- First optimizer to try
- You have 10+ labeled examples
- Want quick improvements
- General-purpose tasks
MIPRO (Most Important Prompt Optimization)
State-of-the-art optimizer - Iteratively searches for better instructions.
How it works:
- Generates candidate instructions
- Tests each on validation set
- Selects best-performing instructions
- Iterates to refine further
Parameters:
metric: Evaluation metric (required)num_candidates: Instructions to try per iteration (default: 10)init_temperature: Sampling temperature (default: 1.0)verbose: Show progress (default: False)
from dspy.teleprompt import MIPRO
# Define metric with more nuance
def answer_quality(example, pred, trace=None):
"""Score answer quality 0-1."""
if example.answer.lower() in pred.answer.lower():
return 1.0
# Partial credit for similar answers
return 0.5 if len(set(example.answer.split()) & set(pred.answer.split())) > 0 else 0.0
# Larger training set (MIPRO benefits from more data)
trainset = [...] # 50-200 examples
valset = [...] # 20-50 examples
# Create module
qa = dspy.ChainOfThought("question -> answer")
# Optimize with MIPRO
optimizer = MIPRO(
metric=answer_quality,
num_candidates=10,
init_temperature=1.0,
verbose=True
)
optimized_qa = optimizer.compile(
student=qa,
trainset=trainset,
valset=valset, # MIPRO uses separate validation set
num_trials=100 # More trials = better quality
)
Best practices:
- Use 50-200 training examples
- Separate validation set (20-50 examples)
- Run 100-200 trials for best results
- Takes 10-30 minutes typically
When to use:
- You have 50+ labeled examples
- Want state-of-the-art performance
- Willing to wait for optimization
- Complex reasoning tasks
BootstrapFinetune
Fine-tune model weights - Creates training dataset for fine-tuning.
How it works:
- Generates synthetic training data
- Exports data in fine-tuning format
- You fine-tune model separately
- Load fine-tuned model back
Parameters:
metric: Evaluation metric (required)max_bootstrapped_demos: Demonstrations to generate (default: 4)max_rounds: Data generation rounds (default: 1)
from dspy.teleprompt import BootstrapFinetune
# Training data
trainset = [...] # 100+ examples recommended
# Define metric
def validate(example, pred, trace=None):
return example.answer == pred.answer
# Create module
qa = dspy.ChainOfThought("question -> answer")
# Generate fine-tuning data
optimizer = BootstrapFinetune(metric=validate)
optimized_qa = optimizer.compile(qa, trainset=trainset)
# Exports training data to file
# You then fine-tune using your LM provider's API
# After fine-tuning, load your model:
finetuned_lm = dspy.OpenAI(model="ft:gpt-3.5-turbo:your-model-id")
dspy.settings.configure(lm=finetuned_lm)
Best practices:
- Use 100+ training examples
- Validate on held-out test set
- Monitor for overfitting
- Compare with prompt-based methods first
When to use:
- You have 100+ examples
- Latency is critical (fine-tuned models faster)
- Task is narrow and well-defined
- Prompt optimization isn't enough
COPRO (Coordinate Prompt Optimization)
Optimize prompts via gradient-free search.
How it works:
- Generates prompt variants
- Evaluates each variant
- Selects best prompts
- Iterates to refine
from dspy.teleprompt import COPRO
# Training data
trainset = [...]
# Define metric
def metric(example, pred, trace=None):
return example.answer == pred.answer
# Create module
qa = dspy.ChainOfThought("question -> answer")
# Optimize with COPRO
optimizer = COPRO(
metric=metric,
breadth=10, # Candidates per iteration
depth=3 # Optimization rounds
)
optimized_qa = optimizer.compile(qa, trainset=trainset)
When to use:
- Want prompt optimization
- Have 20-100 examples
- MIPRO too slow
KNNFewShot
Simple k-nearest neighbors - Selects similar examples for each query.
How it works:
- Embeds all training examples
- For each query, finds k most similar examples
- Uses these as few-shot demonstrations
from dspy.teleprompt import KNNFewShot
trainset = [...]
# No metric needed - just selects similar examples
optimizer = KNNFewShot(k=3)
optimized_qa = optimizer.compile(qa, trainset=trainset)
# For each query, uses 3 most similar examples from trainset
When to use:
- Quick baseline
- Have diverse training examples
- Similarity is good proxy for helpfulness
Writing Metrics
Metrics are functions that score predictions. They're critical for optimization.
Binary Metrics
def exact_match(example, pred, trace=None):
"""Return True if prediction exactly matches gold."""
return example.answer == pred.answer
def contains_answer(example, pred, trace=None):
"""Return True if prediction contains gold answer."""
return example.answer.lower() in pred.answer.lower()
Continuous Metrics
def f1_score(example, pred, trace=None):
"""F1 score between prediction and gold."""
pred_tokens = set(pred.answer.lower().split())
gold_tokens = set(example.answer.lower().split())
if not pred_tokens:
return 0.0
precision = len(pred_tokens & gold_tokens) / len(pred_tokens)
recall = len(pred_tokens & gold_tokens) / len(gold_tokens)
if precision + recall == 0:
return 0.0
return 2 * (precision * recall) / (precision + recall)
def semantic_similarity(example, pred, trace=None):
"""Embedding similarity between prediction and gold."""
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
emb1 = model.encode(example.answer)
emb2 = model.encode(pred.answer)
similarity = cosine_similarity(emb1, emb2)
return similarity
Multi-Factor Metrics
def comprehensive_metric(example, pred, trace=None):
"""Combine multiple factors."""
score = 0.0
# Correctness (50%)
if example.answer.lower() in pred.answer.lower():
score += 0.5
# Conciseness (25%)
if len(pred.answer.split()) <= 20:
score += 0.25
# Citation (25%)
if "source:" in pred.answer.lower():
score += 0.25
return score
Using Trace for Debugging
def metric_with_trace(example, pred, trace=None):
"""Metric that uses trace for debugging."""
is_correct = example.answer == pred.answer
if trace is not None and not is_correct:
# Log failures for analysis
print(f"Failed on: {example.question}")
print(f"Expected: {example.answer}")
print(f"Got: {pred.answer}")
return is_correct
Evaluation Best Practices
Train/Val/Test Split
# Split data
trainset = data[:100] # 70%
valset = data[100:120] # 15%
testset = data[120:] # 15%
# Optimize on train
optimized = optimizer.compile(module, trainset=trainset)
# Validate during optimization (for MIPRO)
optimized = optimizer.compile(module, trainset=trainset, valset=valset)
# Evaluate on test
from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=testset, metric=metric)
score = evaluator(optimized)
Cross-Validation
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5)
scores = []
for train_idx, val_idx in kfold.split(data):
trainset = [data[i] for i in train_idx]
valset = [data[i] for i in val_idx]
optimized = optimizer.compile(module, trainset=trainset)
score = evaluator(optimized, devset=valset)
scores.append(score)
print(f"Average score: {sum(scores) / len(scores):.2f}")
Comparing Optimizers
results = {}
for opt_name, optimizer in [
("baseline", None),
("fewshot", BootstrapFewShot(metric=metric)),
("mipro", MIPRO(metric=metric)),
]:
if optimizer is None:
module_opt = module
else:
module_opt = optimizer.compile(module, trainset=trainset)
score = evaluator(module_opt, devset=testset)
results[opt_name] = score
print(results)
# {'baseline': 0.65, 'fewshot': 0.78, 'mipro': 0.85}
Advanced Patterns
Custom Optimizer
from dspy.teleprompt import Teleprompter
class CustomOptimizer(Teleprompter):
def __init__(self, metric):
self.metric = metric
def compile(self, student, trainset, **kwargs):
# Your optimization logic here
# Return optimized student module
return student
Multi-Stage Optimization
# Stage 1: Bootstrap few-shot
stage1 = BootstrapFewShot(metric=metric, max_bootstrapped_demos=3)
optimized1 = stage1.compile(module, trainset=trainset)
# Stage 2: Instruction tuning
stage2 = MIPRO(metric=metric, num_candidates=10)
optimized2 = stage2.compile(optimized1, trainset=trainset, valset=valset)
# Final optimized module
final_module = optimized2
Ensemble Optimization
class EnsembleModule(dspy.Module):
def __init__(self, modules):
super().__init__()
self.modules = modules
def forward(self, question):
predictions = [m(question=question).answer for m in self.modules]
# Vote or average
return dspy.Prediction(answer=max(set(predictions), key=predictions.count))
# Optimize multiple modules
opt1 = BootstrapFewShot(metric=metric).compile(module, trainset=trainset)
opt2 = MIPRO(metric=metric).compile(module, trainset=trainset)
opt3 = COPRO(metric=metric).compile(module, trainset=trainset)
# Ensemble
ensemble = EnsembleModule([opt1, opt2, opt3])
Optimization Workflow
1. Start with Baseline
# No optimization
baseline = dspy.ChainOfThought("question -> answer")
baseline_score = evaluator(baseline, devset=testset)
print(f"Baseline: {baseline_score}")
2. Try BootstrapFewShot
# Quick optimization
fewshot = BootstrapFewShot(metric=metric, max_bootstrapped_demos=3)
optimized = fewshot.compile(baseline, trainset=trainset)
fewshot_score = evaluator(optimized, devset=testset)
print(f"Few-shot: {fewshot_score} (+{fewshot_score - baseline_score:.2f})")
3. If More Data Available, Try MIPRO
# State-of-the-art optimization
mipro = MIPRO(metric=metric, num_candidates=10)
optimized_mipro = mipro.compile(baseline, trainset=trainset, valset=valset)
mipro_score = evaluator(optimized_mipro, devset=testset)
print(f"MIPRO: {mipro_score} (+{mipro_score - baseline_score:.2f})")
4. Save Best Model
if mipro_score > fewshot_score:
optimized_mipro.save("models/best_model.json")
else:
optimized.save("models/best_model.json")
Common Pitfalls
1. Overfitting to Training Data
# ❌ Bad: Too many demos
optimizer = BootstrapFewShot(max_bootstrapped_demos=20) # Overfits!
# ✅ Good: Moderate demos
optimizer = BootstrapFewShot(max_bootstrapped_demos=3-5)
2. Metric Doesn't Match Task
# ❌ Bad: Binary metric for nuanced task
def bad_metric(example, pred, trace=None):
return example.answer == pred.answer # Too strict!
# ✅ Good: Graded metric
def good_metric(example, pred, trace=None):
return f1_score(example.answer, pred.answer) # Allows partial credit
3. Insufficient Training Data
# ❌ Bad: Too little data
trainset = data[:5] # Not enough!
# ✅ Good: Sufficient data
trainset = data[:50] # Better
4. No Validation Set
# ❌ Bad: Optimizing on test set
optimizer.compile(module, trainset=testset) # Cheating!
# ✅ Good: Proper splits
optimizer.compile(module, trainset=trainset, valset=valset)
evaluator(optimized, devset=testset)
Performance Tips
- Start simple: BootstrapFewShot first
- Use representative data: Cover edge cases
- Monitor overfitting: Validate on held-out set
- Iterate metrics: Refine based on failures
- Save checkpoints: Don't lose progress
- Compare to baseline: Measure improvement
- Test multiple optimizers: Find best fit
Resources
- Paper: "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines"
- GitHub: https://github.com/stanfordnlp/dspy
- Discord: https://discord.gg/XCGy2WDCQB