- Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities. - Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools. - Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing. - Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format. - Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5. - Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills. - Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage.
4.2 KiB
DPO Variants
Complete guide to Direct Preference Optimization loss variants in TRL.
Overview
DPO optimizes models using preference data (chosen/rejected pairs). TRL supports 10+ loss variants for different scenarios.
Loss Types
1. Sigmoid (Standard DPO)
Formula: -log(sigmoid(β * logits))
When to use: Default choice, general preference alignment
Config:
DPOConfig(
loss_type="sigmoid",
beta=0.1, # KL penalty
per_device_train_batch_size=64,
learning_rate=1e-6
)
2. IPO (Identity Policy Optimization)
Formula: (logits - 1/(2β))²
When to use: Better theoretical foundation, reduce overfitting
Config:
DPOConfig(
loss_type="ipo",
beta=0.1,
per_device_train_batch_size=90,
learning_rate=1e-2
)
3. Hinge (SLiC)
Formula: ReLU(1 - β * logits)
When to use: Margin-based objective
Config:
DPOConfig(
loss_type="hinge",
beta=0.1,
per_device_train_batch_size=512,
learning_rate=1e-4
)
4. Robust DPO
Formula: Sigmoid with label smoothing for noise robustness
When to use: Noisy preference labels
Config:
DPOConfig(
loss_type="robust",
beta=0.01,
label_smoothing=0.1, # Noise probability
per_device_train_batch_size=16,
learning_rate=1e-3,
max_prompt_length=128,
max_length=512
)
5. BCO Pair (Binary Classification)
Formula: Train binary classifier (chosen=1, rejected=0)
When to use: Pairwise preference data
Config:
DPOConfig(
loss_type="bco_pair",
beta=0.01,
per_device_train_batch_size=128,
learning_rate=5e-7,
max_prompt_length=1536,
max_completion_length=512
)
6. SPPO Hard
Formula: Push chosen→0.5, rejected→-0.5
When to use: Nash equilibrium, sparse data
Config:
DPOConfig(
loss_type="sppo_hard",
beta=0.1
)
7. DiscoPOP
Formula: Log-Ratio Modulated Loss
When to use: Automated loss discovery
Config:
DPOConfig(
loss_type="discopop",
beta=0.05,
discopop_tau=0.05,
per_device_train_batch_size=64,
learning_rate=5e-7
)
8. APO Zero
Formula: Increase chosen, decrease rejected likelihood
When to use: Model worse than winning outputs
Config:
DPOConfig(
loss_type="apo_zero",
beta=0.1,
per_device_train_batch_size=64,
learning_rate=2e-7,
max_prompt_length=512,
max_completion_length=512
)
9. APO Down
Formula: Decrease both, emphasize rejected reduction
When to use: Model better than winning outputs
Config:
DPOConfig(
loss_type="apo_down",
beta=0.1,
# Same hyperparameters as apo_zero
)
10. AOT & AOT Pair
Formula: Distributional alignment via stochastic dominance
When to use:
aot_pair: Paired preference dataaot: Unpaired data
Config:
DPOConfig(
loss_type="aot_pair", # or "aot"
beta=0.1,
label_smoothing=0.0
)
Multi-Loss Training
Combine multiple losses:
DPOConfig(
loss_type=["sigmoid", "ipo"],
loss_weights=[0.7, 0.3], # Weighted combination
beta=0.1
)
Key Parameters
Beta (β)
Controls deviation from reference model:
- Higher (0.5): More conservative, stays close to reference
- Lower (0.01): More aggressive alignment
- Default: 0.1
Label Smoothing
For robust DPO:
- 0.0: No smoothing (default)
- 0.1-0.3: Moderate noise robustness
- 0.5: Maximum noise tolerance
Max Lengths
max_prompt_length: 128-1536max_completion_length: 128-512max_length: Total sequence (1024-2048)
Comparison Table
| Loss | Speed | Stability | Best For |
|---|---|---|---|
| Sigmoid | Fast | Good | General use |
| IPO | Fast | Better | Overfitting issues |
| Hinge | Fast | Good | Margin objectives |
| Robust | Fast | Best | Noisy data |
| BCO | Medium | Good | Binary classification |
| DiscoPOP | Fast | Good | New architectures |
| APO | Fast | Good | Model quality matching |
References
- DPO paper: https://arxiv.org/abs/2305.18290
- IPO paper: https://arxiv.org/abs/2310.12036
- TRL docs: https://huggingface.co/docs/trl/dpo_trainer