Files
hermes-config/skills/mlops/trl-fine-tuning/references/reward-modeling.md
Alexander Whitestone 11cc14d707 init: Hermes config, skills, memories, cron
Sovereign backup of all Hermes Agent configuration and data.
Excludes: secrets, auth tokens, sessions, caches, code (separate repo).

Tracked:
- config.yaml (model, fallback chain, toolsets, display prefs)
- SOUL.md (Timmy personality charter)
- memories/ (persistent MEMORY.md + USER.md)
- skills/ (371 files — full skill library)
- cron/jobs.json (scheduled tasks)
- channel_directory.json (platform channels)
- hooks/ (custom hooks)
2026-03-14 14:42:33 -04:00

2.5 KiB

Reward Modeling

Guide to training reward models with TRL for RLHF pipelines.

Overview

Reward models score completions based on human preferences. Used in:

  • PPO training (RL feedback)
  • GRPO online RL
  • Completion ranking

Basic Training

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig
from datasets import load_dataset

# Load model (num_labels=1 for single reward score)
model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    num_labels=1
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

# Load preference dataset (chosen/rejected pairs)
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

# Configure
config = RewardConfig(
    output_dir="Qwen2.5-Reward",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    learning_rate=1e-5
)

# Train
trainer = RewardTrainer(
    model=model,
    args=config,
    processing_class=tokenizer,
    train_dataset=dataset
)
trainer.train()

Dataset Format

Required fields:

{
  "prompt": "Question or instruction",
  "chosen": "Better response",
  "rejected": "Worse response"
}

Bradley-Terry Loss

Default loss function:

loss = -log(sigmoid(reward_chosen - reward_rejected))

Learns to score chosen > rejected.

Using Reward Models

Inference

from transformers import pipeline

# Load trained reward model
reward_pipe = pipeline("text-classification", model="Qwen2.5-Reward")

# Score completions
texts = ["Good answer", "Bad answer"]
scores = reward_pipe(texts)
print(scores)  # Higher score = better

In PPO

from trl import PPOTrainer, PPOConfig

config = PPOConfig(
    reward_model_path="Qwen2.5-Reward"  # Use trained reward model
)

trainer = PPOTrainer(
    model=policy_model,
    config=config,
    # Reward model loaded automatically
)

Hyperparameters

Model Size Learning Rate Batch Size Epochs
<1B 2e-5 4-8 1-2
1-7B 1e-5 2-4 1
7-13B 5e-6 1-2 1

Evaluation

Check reward separation:

# Chosen should score higher than rejected
chosen_rewards = model(**chosen_inputs).logits
rejected_rewards = model(**rejected_inputs).logits

accuracy = (chosen_rewards > rejected_rewards).float().mean()
print(f"Accuracy: {accuracy:.2%}")  # Target: >80%

References