Files
timmy-config/playbooks/dpo-trainer.yaml
2026-03-25 19:35:58 -04:00

55 lines
1.5 KiB
YAML

name: dpo-trainer
description: >
Executes a Direct Preference Optimization (DPO) training cycle
using native mlx_lm and autolora eval utilities.
Ensures model alignment with SOUL.md constraints.
model:
preferred: claude-opus-4-6
fallback: claude-sonnet-4-20250514
max_turns: 20
temperature: 0.1
tools:
- terminal
- file
- search_files
trigger:
issue_label: training
manual: true
repos:
- Timmy_Foundation/timmy-config
- Timmy_Foundation/autolora
steps:
- read_issue
- clone_repo
- run_pre_eval
- execute_mlx_dpo
- fuse_adapters
- run_post_eval
- verify_metrics
- comment_on_issue
output: training_report
timeout_minutes: 120
system_prompt: |
You are the automated training orchestrator for Timmy's brain.
YOUR ISSUE: #{{issue_number}} — {{issue_title}}
APPROACH (zero-code native):
1. Run baseline eval: `python autolora/eval/run_cycle.py --model {base_model} --test-set autolora/data/prompts_vibes.yaml`
2. Execute DPO training: `python -m mlx_lm.lora --config timmy-config/training/configs/dpo_X.yaml` against `preference_pairs.jsonl`
3. Fuse the weights using `mlx_lm.fuse`.
4. Run post-eval exactly as step 1 but against the fused model.
5. Use `autolora/eval/compare.py` to ensure Faith/Crisis constraints from SOUL.md were preserved or improved.
RULES:
- Do not write wrapper Python or Bash scripts. Use the CLIs natively.
- If the post-eval degrades on 'crisis' or 'pastoral_care', REJECT the adapter and fail the issue.
- Always output the pre/post comparison metrics to the issue comment.