feat: Phase 3.5 — DPO training pair generation from Deep Dive pipeline #1347
Reference in New Issue
Block a user
Delete Branch "feature/deepdive-dpo-phase-3.5"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Wires the arXiv relevance filter in the Deep Dive pipeline to output DPO training pairs directly, closing the loop between research synthesis and overnight training data.
Changes
New:
intelligence/deepdive/dpo_generator.pyDPOPairGeneratorclass with 3 pair strategies:~/.timmy/training-data/dpo-pairs/deepdive_{timestamp}.jsonl{prompt, chosen, rejected, task_type, evidence_ids, source_session, safety_flags, metadata}Modified:
intelligence/deepdive/pipeline.pyDPOPairGeneratorwith graceful degradation (HAS_DPO_GENERATORflag)deepdive.training.dpoconfig sectionresult["dpo"]Modified:
intelligence/deepdive/config.yamldeepdive.training.dposection with:enabled,output_dir,min_score,max_pairs_per_run,pair_typesTesting
Integration tested with mock data: 2 ranked items × 3 pair types = 6 valid JSONL pairs.
Pipeline Flow
Overnight Training Integration
The DPO pairs land in
~/.timmy/training-data/dpo-pairs/where the overnight R&D task (timmy-config #503) picks them up for the tightening loop.Wire arXiv relevance filter output directly into DPO pair generation, closing the loop between research synthesis and overnight training data. New module: dpo_generator.py - DPOPairGenerator class with 3 pair strategies: * summarize: paper → fleet-grounded analysis (chosen) vs generic (rejected) * relevance: 'what matters to Hermes?' → scored context vs vague * implication: 'what should we do?' → actionable insight vs platitude - Extracts synthesis excerpts matched to each ranked item - Outputs to ~/.timmy/training-data/dpo-pairs/deepdive_{timestamp}.jsonl - Format: {prompt, chosen, rejected, task_type, evidence_ids, source_session, safety_flags, metadata} Pipeline changes (pipeline.py): - Import DPOPairGenerator with graceful degradation - Initialize from config deepdive.training.dpo section - Execute as Phase 3.5 between synthesis and audio - DPO results included in pipeline return dict - Wrapped in try/except — DPO failure never blocks delivery Config changes (config.yaml): - New deepdive.training.dpo section with: enabled, output_dir, min_score, max_pairs_per_run, pair_types Integration tested: 2 mock items × 3 pair types = 6 valid JSONL pairs. Chosen responses consistently richer than rejected (assert-verified).