feat: Phase 3.5 — DPO training pair generation from Deep Dive pipeline

Wire arXiv relevance filter output directly into DPO pair generation, closing the loop between research synthesis and overnight training data. New module: dpo_generator.py - DPOPairGenerator class with 3 pair strategies: * summarize: paper → fleet-grounded analysis (chosen) vs generic (rejected) * relevance: 'what matters to Hermes?' → scored context vs vague * implication: 'what should we do?' → actionable insight vs platitude - Extracts synthesis excerpts matched to each ranked item - Outputs to ~/.timmy/training-data/dpo-pairs/deepdive_{timestamp}.jsonl - Format: {prompt, chosen, rejected, task_type, evidence_ids, source_session, safety_flags, metadata} Pipeline changes (pipeline.py): - Import DPOPairGenerator with graceful degradation - Initialize from config deepdive.training.dpo section - Execute as Phase 3.5 between synthesis and audio - DPO results included in pipeline return dict - Wrapped in try/except — DPO failure never blocks delivery Config changes (config.yaml): - New deepdive.training.dpo section with: enabled, output_dir, min_score, max_pairs_per_run, pair_types Integration tested: 2 mock items × 3 pair types = 6 valid JSONL pairs. Chosen responses consistently richer than rejected (assert-verified).
2026-04-13 02:24:04 +00:00
parent 61c24c390b
commit c3b455bd9c
3 changed files with 426 additions and 2 deletions
--- a/intelligence/deepdive/config.yaml
+++ b/intelligence/deepdive/config.yaml
@@ -88,6 +88,18 @@ deepdive:
    speed: 1.0
    output_format: "mp3"  # piper outputs WAV, convert for Telegram

+  # Phase 3.5: DPO Training Pair Generation
+  training:
+    dpo:
+      enabled: true
+      output_dir: "~/.timmy/training-data/dpo-pairs"
+      min_score: 0.5        # Only generate pairs from items above this relevance score
+      max_pairs_per_run: 30  # Cap pairs per pipeline execution
+      pair_types:            # Which pair strategies to use
+        - "summarize"        # Paper summary → fleet-grounded analysis
+        - "relevance"        # Relevance analysis → scored fleet context
+        - "implication"      # Implications → actionable insight
+
  # Phase 0: Fleet Context Grounding
  fleet_context:
    enabled: true