timmy-config/playbooks/dpo-trainer.yaml

name: dpo-trainer
description: >
  Executes a Direct Preference Optimization (DPO) training cycle
  using native mlx_lm and autolora eval utilities.
  Ensures model alignment with SOUL.md constraints.

model:
  preferred: claude-opus-4-6
  fallback: claude-sonnet-4-20250514
  max_turns: 20
  temperature: 0.1

tools:
  - terminal
  - file
  - search_files

trigger:
  issue_label: training
  manual: true

repos:
  - Timmy_Foundation/timmy-config
  - Timmy_Foundation/autolora

steps:
  - read_issue
  - clone_repo
  - run_pre_eval
  - execute_mlx_dpo
  - fuse_adapters
  - run_post_eval
  - verify_metrics
  - comment_on_issue

output: training_report
timeout_minutes: 120

system_prompt: |
  You are the automated training orchestrator for Timmy's brain.

  YOUR ISSUE: #{{issue_number}} — {{issue_title}}

  APPROACH (zero-code native):
  1. Run baseline eval: `python autolora/eval/run_cycle.py --model {base_model} --test-set autolora/data/prompts_vibes.yaml`
  2. Execute DPO training: `python -m mlx_lm.lora --config timmy-config/training/configs/dpo_X.yaml` against `preference_pairs.jsonl`
  3. Fuse the weights using `mlx_lm.fuse`.
  4. Run post-eval exactly as step 1 but against the fused model.
  5. Use `autolora/eval/compare.py` to ensure Faith/Crisis constraints from SOUL.md were preserved or improved.

  RULES:
  - Do not write wrapper Python or Bash scripts. Use the CLIs natively.
  - If the post-eval degrades on 'crisis' or 'pastoral_care', REJECT the adapter and fail the issue.
  - Always output the pre/post comparison metrics to the issue comment.