hermes-config/skills/mlops/simpo/references/hyperparameters.md

# Hyperparameters

Complete guide to SimPO hyperparameter selection and tuning.

## Overview

Key hyperparameters in SimPO:
1. **Learning Rate** - Most critical
2. **Beta (β)** - Reward scaling
3. **Gamma-Beta Ratio (γ/β)** - Target margin
4. **SFT Weight** - Regularization strength

## Learning Rate

### Recommended Ranges

**By model size**:
| Model Size | Learning Rate | Notes |
|------------|---------------|-------|
| 1B-3B | 5e-7 to 1e-6 | Higher end safe |
| 7B-8B | 3e-7 to 5e-7 | **Standard** |
| 13B-30B | 1e-7 to 3e-7 | Lower for stability |
| 70B+ | 5e-8 to 1e-7 | Very conservative |

**By task type**:
| Task | Learning Rate | Reason |
|------|---------------|--------|
| General chat | 5e-7 | Standard |
| Code generation | 3e-7 | **Precise reasoning** |
| Math reasoning | 3e-7 | **Careful optimization** |
| Creative writing | 1e-6 | More aggressive OK |

### Why Learning Rate Matters

**Too high** (> 1e-6 for 7B):
- Loss divergence
- Catastrophic forgetting
- Unstable training

**Too low** (< 1e-7 for 7B):
- Very slow convergence
- May not finish in time
- Undertraining

**Optimal** (3e-7 to 5e-7 for 7B):
- Stable convergence
- Good final performance
- Efficient training

### Config Examples

**Mistral 7B (general)**:
```yaml
learning_rate: 5e-7
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: cosine
```

**Llama 3 8B (reasoning)**:
```yaml
learning_rate: 3e-7
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: cosine
```

**Gemma 2 9B (creative)**:
```yaml
learning_rate: 1e-6
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: linear
```

## Beta (β)

### Recommended Values

**Range**: 2.0 to 10.0 (much higher than DPO's 0.01-0.1)

**By preference strength**:
| Beta | Preference Strength | Use Case |
|------|-------------------|----------|
| 1.0-2.0 | Weak | Subtle preferences |
| 2.0-5.0 | **Standard** | General alignment |
| 5.0-10.0 | Strong | Clear preferences |

**Default**: 2.0 to 2.5

### Why Beta Matters

**Low beta** (< 2.0):
- Weak reward signal
- Slow preference learning
- May underfit

**High beta** (> 10.0):
- Very strong reward signal
- Risk of overfitting
- May ignore weak preferences

**Optimal** (2.0-5.0):
- Balanced reward scaling
- Stable training
- Good generalization

### Interaction with Gamma

**Beta and gamma together**:
```
Target margin in reward space = gamma
Target margin in logit space = gamma / beta
```

**Example**:
```yaml
beta: 2.0
gamma_beta_ratio: 0.5
# Effective gamma = 2.0 * 0.5 = 1.0
```

### Config Examples

**Weak preferences**:
```yaml
beta: 2.0
gamma_beta_ratio: 0.3  # Small margin
```

**Standard**:
```yaml
beta: 2.5
gamma_beta_ratio: 0.5  # Default
```

**Strong preferences**:
```yaml
beta: 5.0
gamma_beta_ratio: 0.7  # Larger margin
```

## Gamma-Beta Ratio (γ/β)

### Recommended Values

**Range**: 0.0 to 1.0

**By scenario**:
| Ratio | Margin | Use Case |
|-------|--------|----------|
| 0.0-0.3 | Small | Weak preference data |
| 0.4-0.6 | **Standard** | General use |
| 0.7-1.0 | Large | Very clear preferences |

**Default**: 0.5

### Why Gamma Matters

**Low gamma** (< 0.3):
- Small target margin
- Less aggressive alignment
- More conservative

**High gamma** (> 0.7):
- Large target margin
- Stronger alignment
- More aggressive

**Optimal** (0.4-0.6):
- Balanced margin
- Stable training
- Good alignment

### Mathematical Meaning

**In loss function**:
```python
logits = pi_logratios - gamma_beta_ratio
loss = -log(sigmoid(beta * logits))
```

**Interpretation**:
- gamma_beta_ratio shifts the decision boundary
- Higher ratio = requires larger log prob difference
- Controls how "clear" preferences must be

### Config Examples

**Noisy preferences**:
```yaml
gamma_beta_ratio: 0.3  # Smaller margin, more tolerant
```

**Standard**:
```yaml
gamma_beta_ratio: 0.5  # Default
```

**High-quality preferences**:
```yaml
gamma_beta_ratio: 0.8  # Larger margin, stricter
```

## SFT Weight

### Recommended Values

**Range**: 0.0 to 1.0

**By model type**:
| Model Type | SFT Weight | Reason |
|------------|-----------|--------|
| Base model | 0.0 | No prior capabilities |
| **Instruct model** | 0.05-0.1 | Preserve instruction following |
| Chat model | 0.1-0.2 | Preserve conversational skills |

**Default**: 0.0 (no SFT regularization)

### Why SFT Weight Matters

**Zero SFT** (0.0):
- Pure preference optimization
- May forget capabilities
- Standard for base models

**Low SFT** (0.05-0.1):
- Balanced approach
- **Recommended for instruct models**
- Slight capability preservation

**High SFT** (> 0.2):
- Strong capability preservation
- Weaker preference alignment
- May reduce alignment gains

### Trade-off

```
Total Loss = SimPO Loss + (sft_weight * SFT Loss)
```

**Example**:
```yaml
sft_weight: 0.1
# 90% preference optimization + 10% capability preservation
```

### Config Examples

**Base model (no SFT)**:
```yaml
model_name_or_path: mistralai/Mistral-7B-v0.1
sft_weight: 0.0
```

**Instruct model (light SFT)**:
```yaml
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
sft_weight: 0.1
```

**Chat model (moderate SFT)**:
```yaml
model_name_or_path: HuggingFaceH4/zephyr-7b-beta
sft_weight: 0.2
```

## Model-Size-Specific Recommendations

### 7B Models (Mistral, Llama 3)

**Standard config**:
```yaml
learning_rate: 5e-7
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.0  # 0.1 if instruct model
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
```

### 8B-13B Models

**Standard config**:
```yaml
learning_rate: 3e-7
beta: 2.5
gamma_beta_ratio: 0.5
sft_weight: 0.1  # If instruct
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
```

### 70B Models

**Standard config**:
```yaml
learning_rate: 1e-7
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.05
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
```

## Batch Size & Gradient Accumulation

### Effective Batch Size

```
Effective Batch Size = per_device_batch_size * num_gpus * grad_accum_steps
```

**Recommended effective batch sizes**:
- 7B: 128-256
- 13B: 64-128
- 70B: 32-64

### Config Examples

**Single GPU (A100 40GB)**:
```yaml
per_device_train_batch_size: 1
gradient_accumulation_steps: 128  # Effective batch = 128
```

**4 GPUs (A100 40GB)**:
```yaml
per_device_train_batch_size: 2
gradient_accumulation_steps: 16  # Effective batch = 2*4*16 = 128
```

**8 GPUs (A100 80GB)**:
```yaml
per_device_train_batch_size: 2
gradient_accumulation_steps: 8  # Effective batch = 2*8*8 = 128
```

## Loss Type

### Sigmoid vs Hinge

**Sigmoid** (default, recommended):
```yaml
loss_type: sigmoid
label_smoothing: 0.0
```

**Hinge** (experimental):
```yaml
loss_type: hinge
# No label smoothing for hinge
```

**When to use hinge**:
- Margin-based tasks
- SVM-style optimization
- Experimental purposes

**Generally**: Stick with sigmoid

## Tuning Guide

### Step 1: Start with Defaults

```yaml
learning_rate: 5e-7  # For 7B
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.0  # 0.1 if instruct
loss_type: sigmoid
```

### Step 2: Monitor Training

**Check every 100 steps**:
- Loss curve (should decrease smoothly)
- Reward margin (should increase)
- Chosen/rejected logps (should separate)

### Step 3: Adjust if Needed

**If loss diverges**:
```yaml
learning_rate: 3e-7  # Reduce from 5e-7
beta: 1.0           # Reduce from 2.0
```

**If loss plateaus early**:
```yaml
learning_rate: 1e-6  # Increase from 5e-7
beta: 5.0           # Increase from 2.0
```

**If model forgets**:
```yaml
sft_weight: 0.2  # Increase from 0.0
```

## Complete Example Configs

### Mistral 7B Base (Standard)

```yaml
model_name_or_path: mistralai/Mistral-7B-v0.1
dataset_mixer:
  HuggingFaceH4/ultrafeedback_binarized: 1.0

learning_rate: 5e-7
beta: 2.0
gamma_beta_ratio: 0.5
loss_type: sigmoid
sft_weight: 0.0

num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
warmup_ratio: 0.1
lr_scheduler_type: cosine

bf16: true
gradient_checkpointing: true
```

### Llama 3 8B Instruct (Reasoning)

```yaml
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
dataset_mixer:
  argilla/distilabel-math-preference-dpo: 1.0

learning_rate: 3e-7
beta: 5.0
gamma_beta_ratio: 0.7
loss_type: sigmoid
sft_weight: 0.1

num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
warmup_ratio: 0.1
lr_scheduler_type: cosine
```

## References

- SimPO paper: https://arxiv.org/abs/2405.14734
- Alignment Handbook: https://github.com/huggingface/alignment-handbook