feat: full-history persistent dedup index for DPO training pairs
Replace the 5-file sliding window cross-run dedup with a persistent hash index that covers ALL historical training data. Overfitting risk compounds across the full dataset — a 5-file window lets old duplicates leak back into training after enough overnight runs. New module: dedup_index.py (DedupIndex) - Persistent JSON index (.dpo_dedup_index.json) alongside JSONL files - Append-on-export: new prompt hashes registered after each successful export — no full rescan needed for normal operations - Incremental sync: on load, detects JSONL files not yet indexed and ingests them automatically (handles files from other tools) - Full rebuild: rebuild() scans ALL deepdive_*.jsonl + pairs_*.jsonl to reconstruct from scratch (first run, corruption recovery) - Atomic writes (write-to-tmp + rename) to prevent index corruption - Standalone CLI: python3 dedup_index.py <dir> --rebuild --stats Modified: dpo_quality.py - Imports DedupIndex with graceful degradation - Replaces _load_history_hashes() with persistent index lookup - Fallback: if index unavailable, scans ALL files in-memory (not just 5) - New register_exported_hashes() method called after export - Config key: dedup_full_history (replaces dedup_history_files) Modified: dpo_generator.py - Calls validator.register_exported_hashes() after successful export to keep the persistent index current without rescanning Modified: config.yaml - Replaced dedup_history_files: 5 with dedup_full_history: true Tested — 7 integration tests: ✓ Fresh index build from empty directory ✓ Build from 3 existing JSONL files (15 unique hashes) ✓ Incremental sync when new file appears between runs ✓ Append after export + persistence across reloads ✓ Rebuild from scratch (recovers from corruption) ✓ Validator catches day-1 dupe from 20-day history (5-file window miss) ✓ Full pipeline: generate → validate → export → register → re-run detects
This commit is contained in:
@@ -108,7 +108,7 @@ deepdive:
|
||||
min_chosen_rejected_ratio: 1.3 # Chosen must be ≥1.3x longer than rejected
|
||||
max_chosen_rejected_similarity: 0.70 # Max Jaccard overlap between chosen/rejected
|
||||
max_prompt_prompt_similarity: 0.85 # Max Jaccard overlap between prompts (dedup)
|
||||
dedup_history_files: 5 # How many recent JSONL files to scan for cross-run dedup
|
||||
dedup_full_history: true # Persistent index covers ALL historical JSONL (no sliding window)
|
||||
|
||||
# Phase 0: Fleet Context Grounding
|
||||
fleet_context:
|
||||
|
||||
Reference in New Issue
Block a user