Files
hermes-agent/skills/mlops/trl-fine-tuning/references/dpo-variants.md
teknium f172f7d4aa Add skills tools and enhance model integration
- Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities.
- Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools.
- Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing.
- Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format.
- Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5.
- Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills.
- Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage.
2026-01-30 07:39:55 +00:00

4.2 KiB

DPO Variants

Complete guide to Direct Preference Optimization loss variants in TRL.

Overview

DPO optimizes models using preference data (chosen/rejected pairs). TRL supports 10+ loss variants for different scenarios.

Loss Types

1. Sigmoid (Standard DPO)

Formula: -log(sigmoid(β * logits))

When to use: Default choice, general preference alignment

Config:

DPOConfig(
    loss_type="sigmoid",
    beta=0.1,  # KL penalty
    per_device_train_batch_size=64,
    learning_rate=1e-6
)

2. IPO (Identity Policy Optimization)

Formula: (logits - 1/(2β))²

When to use: Better theoretical foundation, reduce overfitting

Config:

DPOConfig(
    loss_type="ipo",
    beta=0.1,
    per_device_train_batch_size=90,
    learning_rate=1e-2
)

3. Hinge (SLiC)

Formula: ReLU(1 - β * logits)

When to use: Margin-based objective

Config:

DPOConfig(
    loss_type="hinge",
    beta=0.1,
    per_device_train_batch_size=512,
    learning_rate=1e-4
)

4. Robust DPO

Formula: Sigmoid with label smoothing for noise robustness

When to use: Noisy preference labels

Config:

DPOConfig(
    loss_type="robust",
    beta=0.01,
    label_smoothing=0.1,  # Noise probability
    per_device_train_batch_size=16,
    learning_rate=1e-3,
    max_prompt_length=128,
    max_length=512
)

5. BCO Pair (Binary Classification)

Formula: Train binary classifier (chosen=1, rejected=0)

When to use: Pairwise preference data

Config:

DPOConfig(
    loss_type="bco_pair",
    beta=0.01,
    per_device_train_batch_size=128,
    learning_rate=5e-7,
    max_prompt_length=1536,
    max_completion_length=512
)

6. SPPO Hard

Formula: Push chosen→0.5, rejected→-0.5

When to use: Nash equilibrium, sparse data

Config:

DPOConfig(
    loss_type="sppo_hard",
    beta=0.1
)

7. DiscoPOP

Formula: Log-Ratio Modulated Loss

When to use: Automated loss discovery

Config:

DPOConfig(
    loss_type="discopop",
    beta=0.05,
    discopop_tau=0.05,
    per_device_train_batch_size=64,
    learning_rate=5e-7
)

8. APO Zero

Formula: Increase chosen, decrease rejected likelihood

When to use: Model worse than winning outputs

Config:

DPOConfig(
    loss_type="apo_zero",
    beta=0.1,
    per_device_train_batch_size=64,
    learning_rate=2e-7,
    max_prompt_length=512,
    max_completion_length=512
)

9. APO Down

Formula: Decrease both, emphasize rejected reduction

When to use: Model better than winning outputs

Config:

DPOConfig(
    loss_type="apo_down",
    beta=0.1,
    # Same hyperparameters as apo_zero
)

10. AOT & AOT Pair

Formula: Distributional alignment via stochastic dominance

When to use:

  • aot_pair: Paired preference data
  • aot: Unpaired data

Config:

DPOConfig(
    loss_type="aot_pair",  # or "aot"
    beta=0.1,
    label_smoothing=0.0
)

Multi-Loss Training

Combine multiple losses:

DPOConfig(
    loss_type=["sigmoid", "ipo"],
    loss_weights=[0.7, 0.3],  # Weighted combination
    beta=0.1
)

Key Parameters

Beta (β)

Controls deviation from reference model:

  • Higher (0.5): More conservative, stays close to reference
  • Lower (0.01): More aggressive alignment
  • Default: 0.1

Label Smoothing

For robust DPO:

  • 0.0: No smoothing (default)
  • 0.1-0.3: Moderate noise robustness
  • 0.5: Maximum noise tolerance

Max Lengths

  • max_prompt_length: 128-1536
  • max_completion_length: 128-512
  • max_length: Total sequence (1024-2048)

Comparison Table

Loss Speed Stability Best For
Sigmoid Fast Good General use
IPO Fast Better Overfitting issues
Hinge Fast Good Margin objectives
Robust Fast Best Noisy data
BCO Medium Good Binary classification
DiscoPOP Fast Good New architectures
APO Fast Good Model quality matching

References