feat: DPO pair quality validator — gate before overnight training
Some checks failed
CI / test (pull_request) Failing after 20s
CI / validate (pull_request) Failing after 16s
Review Approval Gate / verify-review (pull_request) Failing after 2s

Add DPOQualityValidator that catches bad training pairs before they
enter the tightening loop. Wired into DPOPairGenerator between
generate() and export() as an automatic quality gate.

New module: dpo_quality.py
- 5 single-pair quality checks:
  1. Field length minimums (prompt ≥40, chosen ≥80, rejected ≥30 chars)
  2. Chosen/rejected length ratio (chosen must be ≥1.3x longer)
  3. Chosen≈rejected similarity (Jaccard ≤0.70 — catches low-contrast)
  4. Vocabulary diversity in chosen (unique word ratio ≥0.30)
  5. Substance markers in chosen (≥2 fleet/training/action terms)
- 2 cross-pair quality checks:
  6. Near-duplicate prompts within batch (Jaccard ≤0.85)
  7. Cross-run dedup against recent JSONL history files
- Two modes: 'drop' (filter out bad pairs) or 'flag' (export with warning)
- BatchReport with per-pair diagnostics, pass rates, and warnings
- Standalone CLI: python3 dpo_quality.py <file.jsonl> [--strict] [--json]

Modified: dpo_generator.py
- Imports DPOQualityValidator with graceful degradation
- Initializes from config validation section (enabled by default)
- Validates between generate() and export() in run()
- Quality report included in pipeline result dict
- Validator failure never blocks — falls back to unvalidated export

Modified: config.yaml
- New deepdive.training.dpo.validation section with all tunable knobs:
  enabled, flagged_pair_action, similarity thresholds, length minimums,
  dedup_history_files

Integration tested — 6 test cases covering:
  ✓ Good pairs pass (3/3 accepted)
  ✓ Bad pairs caught: too-short, high-similarity, inverted signal (0/3)
  ✓ Near-duplicate prompt detection (1/2 deduped)
  ✓ Flag mode preserves pairs with warnings (3/3 flagged)
  ✓ Cross-run deduplication against history (1 dupe caught)
  ✓ Full generator→validator→export pipeline (6/6 validated)
This commit is contained in:
perplexity
2026-04-13 02:46:50 +00:00
parent c19000de03
commit bb4922adeb
3 changed files with 565 additions and 3 deletions

View File

@@ -99,6 +99,16 @@ deepdive:
- "summarize" # Paper summary → fleet-grounded analysis
- "relevance" # Relevance analysis → scored fleet context
- "implication" # Implications → actionable insight
validation:
enabled: true
flagged_pair_action: "drop" # "drop" = remove bad pairs, "flag" = export with warning
min_prompt_chars: 40 # Minimum prompt length
min_chosen_chars: 80 # Minimum chosen response length
min_rejected_chars: 30 # Minimum rejected response length
min_chosen_rejected_ratio: 1.3 # Chosen must be ≥1.3x longer than rejected
max_chosen_rejected_similarity: 0.70 # Max Jaccard overlap between chosen/rejected
max_prompt_prompt_similarity: 0.85 # Max Jaccard overlap between prompts (dedup)
dedup_history_files: 5 # How many recent JSONL files to scan for cross-run dedup
# Phase 0: Fleet Context Grounding
fleet_context: