Files
hermes-agent/skills/mlops/evaluation/weights-and-biases/references/sweeps.md
teknium1 732c66b0f3 refactor: reorganize skills into sub-categories
The skills directory was getting disorganized — mlops alone had 40
skills in a flat list, and 12 categories were singletons with just
one skill each.

Code change:
- prompt_builder.py: Support sub-categories in skill scanner.
  skills/mlops/training/axolotl/SKILL.md now shows as category
  'mlops/training' instead of just 'mlops'. Backwards-compatible
  with existing flat structure.

Split mlops (40 skills) into 7 sub-categories:
- mlops/training (12): accelerate, axolotl, flash-attention,
  grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning,
  simpo, slime, torchtitan, trl-fine-tuning, unsloth
- mlops/inference (8): gguf, guidance, instructor, llama-cpp,
  obliteratus, outlines, tensorrt-llm, vllm
- mlops/models (6): audiocraft, clip, llava, segment-anything,
  stable-diffusion, whisper
- mlops/vector-databases (4): chroma, faiss, pinecone, qdrant
- mlops/evaluation (5): huggingface-tokenizers,
  lm-evaluation-harness, nemo-curator, saelens, weights-and-biases
- mlops/cloud (2): lambda-labs, modal
- mlops/research (1): dspy

Merged singleton categories:
- gifs → media (gif-search joins youtube-content)
- music-creation → media (heartmula, songsee)
- diagramming → creative (excalidraw joins ascii-art)
- ocr-and-documents → productivity
- domain → research (domain-intel)
- feeds → research (blogwatcher)
- market-data → research (polymarket)

Fixed misplaced skills:
- mlops/code-review → software-development (not ML-specific)
- mlops/ml-paper-writing → research (academic writing)

Added DESCRIPTION.md files for all new/updated categories.
2026-03-09 03:35:53 -07:00

17 KiB
Raw Blame History

Comprehensive Hyperparameter Sweeps Guide

Complete guide to hyperparameter optimization with W&B Sweeps.

Table of Contents

  • Sweep Configuration
  • Search Strategies
  • Parameter Distributions
  • Early Termination
  • Parallel Execution
  • Advanced Patterns
  • Real-World Examples

Sweep Configuration

Basic Sweep Config

sweep_config = {
    'method': 'bayes',  # Search strategy
    'metric': {
        'name': 'val/accuracy',
        'goal': 'maximize'  # or 'minimize'
    },
    'parameters': {
        'learning_rate': {
            'distribution': 'log_uniform',
            'min': 1e-5,
            'max': 1e-1
        },
        'batch_size': {
            'values': [16, 32, 64, 128]
        }
    }
}

# Initialize sweep
sweep_id = wandb.sweep(sweep_config, project="my-project")

Complete Config Example

sweep_config = {
    # Required: Search method
    'method': 'bayes',

    # Required: Optimization metric
    'metric': {
        'name': 'val/f1_score',
        'goal': 'maximize'
    },

    # Required: Parameters to search
    'parameters': {
        # Continuous parameter
        'learning_rate': {
            'distribution': 'log_uniform',
            'min': 1e-5,
            'max': 1e-1
        },

        # Discrete values
        'batch_size': {
            'values': [16, 32, 64, 128]
        },

        # Categorical
        'optimizer': {
            'values': ['adam', 'sgd', 'rmsprop', 'adamw']
        },

        # Uniform distribution
        'dropout': {
            'distribution': 'uniform',
            'min': 0.1,
            'max': 0.5
        },

        # Integer range
        'num_layers': {
            'distribution': 'int_uniform',
            'min': 2,
            'max': 10
        },

        # Fixed value (constant across runs)
        'epochs': {
            'value': 50
        }
    },

    # Optional: Early termination
    'early_terminate': {
        'type': 'hyperband',
        'min_iter': 5,
        's': 2,
        'eta': 3,
        'max_iter': 27
    }
}

Search Strategies

Exhaustively search all combinations.

sweep_config = {
    'method': 'grid',
    'parameters': {
        'learning_rate': {
            'values': [0.001, 0.01, 0.1]
        },
        'batch_size': {
            'values': [16, 32, 64]
        },
        'optimizer': {
            'values': ['adam', 'sgd']
        }
    }
}

# Total runs: 3 × 3 × 2 = 18 runs

Pros:

  • Comprehensive search
  • Reproducible results
  • No randomness

Cons:

  • Exponential growth with parameters
  • Inefficient for continuous parameters
  • Not scalable beyond 3-4 parameters

When to use:

  • Few parameters (< 4)
  • All discrete values
  • Need complete coverage

Randomly sample parameter combinations.

sweep_config = {
    'method': 'random',
    'parameters': {
        'learning_rate': {
            'distribution': 'log_uniform',
            'min': 1e-5,
            'max': 1e-1
        },
        'batch_size': {
            'values': [16, 32, 64, 128, 256]
        },
        'dropout': {
            'distribution': 'uniform',
            'min': 0.0,
            'max': 0.5
        },
        'num_layers': {
            'distribution': 'int_uniform',
            'min': 2,
            'max': 8
        }
    }
}

# Run 100 random trials
wandb.agent(sweep_id, function=train, count=100)

Pros:

  • Scales to many parameters
  • Can run indefinitely
  • Often finds good solutions quickly

Cons:

  • No learning from previous runs
  • May miss optimal region
  • Results vary with random seed

When to use:

  • Many parameters (> 4)
  • Quick exploration
  • Limited budget

Learn from previous trials to sample promising regions.

sweep_config = {
    'method': 'bayes',
    'metric': {
        'name': 'val/loss',
        'goal': 'minimize'
    },
    'parameters': {
        'learning_rate': {
            'distribution': 'log_uniform',
            'min': 1e-5,
            'max': 1e-1
        },
        'weight_decay': {
            'distribution': 'log_uniform',
            'min': 1e-6,
            'max': 1e-2
        },
        'dropout': {
            'distribution': 'uniform',
            'min': 0.1,
            'max': 0.5
        },
        'num_layers': {
            'values': [2, 3, 4, 5, 6]
        }
    }
}

Pros:

  • Most sample-efficient
  • Learns from past trials
  • Focuses on promising regions

Cons:

  • Initial random exploration phase
  • May get stuck in local optima
  • Slower per iteration

When to use:

  • Expensive training runs
  • Need best performance
  • Limited compute budget

Parameter Distributions

Continuous Distributions

# Log-uniform: Good for learning rates, regularization
'learning_rate': {
    'distribution': 'log_uniform',
    'min': 1e-6,
    'max': 1e-1
}

# Uniform: Good for dropout, momentum
'dropout': {
    'distribution': 'uniform',
    'min': 0.0,
    'max': 0.5
}

# Normal distribution
'parameter': {
    'distribution': 'normal',
    'mu': 0.5,
    'sigma': 0.1
}

# Log-normal distribution
'parameter': {
    'distribution': 'log_normal',
    'mu': 0.0,
    'sigma': 1.0
}

Discrete Distributions

# Fixed values
'batch_size': {
    'values': [16, 32, 64, 128, 256]
}

# Integer uniform
'num_layers': {
    'distribution': 'int_uniform',
    'min': 2,
    'max': 10
}

# Quantized uniform (step size)
'layer_size': {
    'distribution': 'q_uniform',
    'min': 32,
    'max': 512,
    'q': 32  # Step by 32: 32, 64, 96, 128...
}

# Quantized log-uniform
'hidden_size': {
    'distribution': 'q_log_uniform',
    'min': 32,
    'max': 1024,
    'q': 32
}

Categorical Parameters

# Optimizers
'optimizer': {
    'values': ['adam', 'sgd', 'rmsprop', 'adamw']
}

# Model architectures
'model': {
    'values': ['resnet18', 'resnet34', 'resnet50', 'efficientnet_b0']
}

# Activation functions
'activation': {
    'values': ['relu', 'gelu', 'silu', 'leaky_relu']
}

Early Termination

Stop underperforming runs early to save compute.

Hyperband

sweep_config = {
    'method': 'bayes',
    'metric': {'name': 'val/accuracy', 'goal': 'maximize'},
    'parameters': {...},

    # Hyperband early termination
    'early_terminate': {
        'type': 'hyperband',
        'min_iter': 3,      # Minimum iterations before termination
        's': 2,             # Bracket count
        'eta': 3,           # Downsampling rate
        'max_iter': 27      # Maximum iterations
    }
}

How it works:

  • Runs trials in brackets
  • Keeps top 1/eta performers each round
  • Eliminates bottom performers early

Custom Termination

def train():
    run = wandb.init()

    for epoch in range(MAX_EPOCHS):
        loss = train_epoch()
        val_acc = validate()

        wandb.log({'val/accuracy': val_acc, 'epoch': epoch})

        # Custom early stopping
        if epoch > 5 and val_acc < 0.5:
            print("Early stop: Poor performance")
            break

        if epoch > 10 and val_acc > best_acc - 0.01:
            print("Early stop: No improvement")
            break

Training Function

Basic Template

def train():
    # Initialize W&B run
    run = wandb.init()

    # Get hyperparameters
    config = wandb.config

    # Build model with config
    model = build_model(
        hidden_size=config.hidden_size,
        num_layers=config.num_layers,
        dropout=config.dropout
    )

    # Create optimizer
    optimizer = create_optimizer(
        model.parameters(),
        name=config.optimizer,
        lr=config.learning_rate,
        weight_decay=config.weight_decay
    )

    # Training loop
    for epoch in range(config.epochs):
        # Train
        train_loss, train_acc = train_epoch(
            model, optimizer, train_loader, config.batch_size
        )

        # Validate
        val_loss, val_acc = validate(model, val_loader)

        # Log metrics
        wandb.log({
            'train/loss': train_loss,
            'train/accuracy': train_acc,
            'val/loss': val_loss,
            'val/accuracy': val_acc,
            'epoch': epoch
        })

    # Log final model
    torch.save(model.state_dict(), 'model.pth')
    wandb.save('model.pth')

    # Finish run
    wandb.finish()

With PyTorch

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import wandb

def train():
    run = wandb.init()
    config = wandb.config

    # Data
    train_loader = DataLoader(
        train_dataset,
        batch_size=config.batch_size,
        shuffle=True
    )

    # Model
    model = ResNet(
        num_classes=config.num_classes,
        dropout=config.dropout
    ).to(device)

    # Optimizer
    if config.optimizer == 'adam':
        optimizer = torch.optim.Adam(
            model.parameters(),
            lr=config.learning_rate,
            weight_decay=config.weight_decay
        )
    elif config.optimizer == 'sgd':
        optimizer = torch.optim.SGD(
            model.parameters(),
            lr=config.learning_rate,
            momentum=config.momentum,
            weight_decay=config.weight_decay
        )

    # Scheduler
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer, T_max=config.epochs
    )

    # Training
    for epoch in range(config.epochs):
        model.train()
        train_loss = 0.0

        for data, target in train_loader:
            data, target = data.to(device), target.to(device)

            optimizer.zero_grad()
            output = model(data)
            loss = nn.CrossEntropyLoss()(output, target)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()

        # Validation
        model.eval()
        val_loss, val_acc = validate(model, val_loader)

        # Step scheduler
        scheduler.step()

        # Log
        wandb.log({
            'train/loss': train_loss / len(train_loader),
            'val/loss': val_loss,
            'val/accuracy': val_acc,
            'learning_rate': scheduler.get_last_lr()[0],
            'epoch': epoch
        })

Parallel Execution

Multiple Agents

Run sweep agents in parallel to speed up search.

# Initialize sweep once
sweep_id = wandb.sweep(sweep_config, project="my-project")

# Run multiple agents in parallel
# Agent 1 (Terminal 1)
wandb.agent(sweep_id, function=train, count=20)

# Agent 2 (Terminal 2)
wandb.agent(sweep_id, function=train, count=20)

# Agent 3 (Terminal 3)
wandb.agent(sweep_id, function=train, count=20)

# Total: 60 runs across 3 agents

Multi-GPU Execution

import os

def train():
    # Get available GPU
    gpu_id = os.environ.get('CUDA_VISIBLE_DEVICES', '0')

    run = wandb.init()
    config = wandb.config

    # Train on specific GPU
    device = torch.device(f'cuda:{gpu_id}')
    model = model.to(device)

    # ... rest of training ...

# Run agents on different GPUs
# Terminal 1
# CUDA_VISIBLE_DEVICES=0 wandb agent sweep_id

# Terminal 2
# CUDA_VISIBLE_DEVICES=1 wandb agent sweep_id

# Terminal 3
# CUDA_VISIBLE_DEVICES=2 wandb agent sweep_id

Advanced Patterns

Nested Parameters

sweep_config = {
    'method': 'bayes',
    'metric': {'name': 'val/accuracy', 'goal': 'maximize'},
    'parameters': {
        'model': {
            'parameters': {
                'type': {
                    'values': ['resnet', 'efficientnet']
                },
                'size': {
                    'values': ['small', 'medium', 'large']
                }
            }
        },
        'optimizer': {
            'parameters': {
                'type': {
                    'values': ['adam', 'sgd']
                },
                'lr': {
                    'distribution': 'log_uniform',
                    'min': 1e-5,
                    'max': 1e-1
                }
            }
        }
    }
}

# Access nested config
def train():
    run = wandb.init()
    model_type = wandb.config.model.type
    model_size = wandb.config.model.size
    opt_type = wandb.config.optimizer.type
    lr = wandb.config.optimizer.lr

Conditional Parameters

sweep_config = {
    'method': 'bayes',
    'parameters': {
        'optimizer': {
            'values': ['adam', 'sgd']
        },
        'learning_rate': {
            'distribution': 'log_uniform',
            'min': 1e-5,
            'max': 1e-1
        },
        # Only used if optimizer == 'sgd'
        'momentum': {
            'distribution': 'uniform',
            'min': 0.5,
            'max': 0.99
        }
    }
}

def train():
    run = wandb.init()
    config = wandb.config

    if config.optimizer == 'adam':
        optimizer = torch.optim.Adam(
            model.parameters(),
            lr=config.learning_rate
        )
    elif config.optimizer == 'sgd':
        optimizer = torch.optim.SGD(
            model.parameters(),
            lr=config.learning_rate,
            momentum=config.momentum  # Conditional parameter
        )

Real-World Examples

Image Classification

sweep_config = {
    'method': 'bayes',
    'metric': {
        'name': 'val/top1_accuracy',
        'goal': 'maximize'
    },
    'parameters': {
        # Model
        'architecture': {
            'values': ['resnet50', 'resnet101', 'efficientnet_b0', 'efficientnet_b3']
        },
        'pretrained': {
            'values': [True, False]
        },

        # Training
        'learning_rate': {
            'distribution': 'log_uniform',
            'min': 1e-5,
            'max': 1e-2
        },
        'batch_size': {
            'values': [16, 32, 64, 128]
        },
        'optimizer': {
            'values': ['adam', 'sgd', 'adamw']
        },
        'weight_decay': {
            'distribution': 'log_uniform',
            'min': 1e-6,
            'max': 1e-2
        },

        # Regularization
        'dropout': {
            'distribution': 'uniform',
            'min': 0.0,
            'max': 0.5
        },
        'label_smoothing': {
            'distribution': 'uniform',
            'min': 0.0,
            'max': 0.2
        },

        # Data augmentation
        'mixup_alpha': {
            'distribution': 'uniform',
            'min': 0.0,
            'max': 1.0
        },
        'cutmix_alpha': {
            'distribution': 'uniform',
            'min': 0.0,
            'max': 1.0
        }
    },
    'early_terminate': {
        'type': 'hyperband',
        'min_iter': 5
    }
}

NLP Fine-Tuning

sweep_config = {
    'method': 'bayes',
    'metric': {'name': 'eval/f1', 'goal': 'maximize'},
    'parameters': {
        # Model
        'model_name': {
            'values': ['bert-base-uncased', 'roberta-base', 'distilbert-base-uncased']
        },

        # Training
        'learning_rate': {
            'distribution': 'log_uniform',
            'min': 1e-6,
            'max': 1e-4
        },
        'per_device_train_batch_size': {
            'values': [8, 16, 32]
        },
        'num_train_epochs': {
            'values': [3, 4, 5]
        },
        'warmup_ratio': {
            'distribution': 'uniform',
            'min': 0.0,
            'max': 0.1
        },
        'weight_decay': {
            'distribution': 'log_uniform',
            'min': 1e-4,
            'max': 1e-1
        },

        # Optimizer
        'adam_beta1': {
            'distribution': 'uniform',
            'min': 0.8,
            'max': 0.95
        },
        'adam_beta2': {
            'distribution': 'uniform',
            'min': 0.95,
            'max': 0.999
        }
    }
}

Best Practices

1. Start Small

# Initial exploration: Random search, 20 runs
sweep_config_v1 = {
    'method': 'random',
    'parameters': {...}
}
wandb.agent(sweep_id_v1, train, count=20)

# Refined search: Bayes, narrow ranges
sweep_config_v2 = {
    'method': 'bayes',
    'parameters': {
        'learning_rate': {
            'min': 5e-5,  # Narrowed from 1e-6 to 1e-4
            'max': 1e-4
        }
    }
}

2. Use Log Scales

# ✅ Good: Log scale for learning rate
'learning_rate': {
    'distribution': 'log_uniform',
    'min': 1e-6,
    'max': 1e-2
}

# ❌ Bad: Linear scale
'learning_rate': {
    'distribution': 'uniform',
    'min': 0.000001,
    'max': 0.01
}

3. Set Reasonable Ranges

# Base ranges on prior knowledge
'learning_rate': {'min': 1e-5, 'max': 1e-3},  # Typical for Adam
'batch_size': {'values': [16, 32, 64]},       # GPU memory limits
'dropout': {'min': 0.1, 'max': 0.5}           # Too high hurts training

4. Monitor Resource Usage

def train():
    run = wandb.init()

    # Log system metrics
    wandb.log({
        'system/gpu_memory_allocated': torch.cuda.memory_allocated(),
        'system/gpu_memory_reserved': torch.cuda.memory_reserved()
    })

5. Save Best Models

def train():
    run = wandb.init()
    best_acc = 0.0

    for epoch in range(config.epochs):
        val_acc = validate(model)

        if val_acc > best_acc:
            best_acc = val_acc
            # Save best checkpoint
            torch.save(model.state_dict(), 'best_model.pth')
            wandb.save('best_model.pth')

Resources