feat: full-history persistent dedup index for DPO training pairs · 4b15cf8283 - the-nexus

feat: full-history persistent dedup index for DPO training pairs

Some checks failed

CI / test (pull_request) Failing after 16s

Details

CI / validate (pull_request) Failing after 14s

Details

Review Approval Gate / verify-review (pull_request) Failing after 3s

Details

Replace the 5-file sliding window cross-run dedup with a persistent
hash index that covers ALL historical training data. Overfitting risk
compounds across the full dataset — a 5-file window lets old duplicates
leak back into training after enough overnight runs.

New module: dedup_index.py (DedupIndex)
- Persistent JSON index (.dpo_dedup_index.json) alongside JSONL files
- Append-on-export: new prompt hashes registered after each successful
  export — no full rescan needed for normal operations
- Incremental sync: on load, detects JSONL files not yet indexed and
  ingests them automatically (handles files from other tools)
- Full rebuild: rebuild() scans ALL deepdive_*.jsonl + pairs_*.jsonl
  to reconstruct from scratch (first run, corruption recovery)
- Atomic writes (write-to-tmp + rename) to prevent index corruption
- Standalone CLI: python3 dedup_index.py <dir> --rebuild --stats

Modified: dpo_quality.py
- Imports DedupIndex with graceful degradation
- Replaces _load_history_hashes() with persistent index lookup
- Fallback: if index unavailable, scans ALL files in-memory (not just 5)
- New register_exported_hashes() method called after export
- Config key: dedup_full_history (replaces dedup_history_files)

Modified: dpo_generator.py
- Calls validator.register_exported_hashes() after successful export
  to keep the persistent index current without rescanning

Modified: config.yaml
- Replaced dedup_history_files: 5 with dedup_full_history: true

Tested — 7 integration tests:
  ✓ Fresh index build from empty directory
  ✓ Build from 3 existing JSONL files (15 unique hashes)
  ✓ Incremental sync when new file appears between runs
  ✓ Append after export + persistence across reloads
  ✓ Rebuild from scratch (recovers from corruption)
  ✓ Validator catches day-1 dupe from 20-day history (5-file window miss)
  ✓ Full pipeline: generate → validate → export → register → re-run detects

This commit is contained in:

perplexity

2026-04-13 03:11:10 +00:00

parent c00e1caa26

commit 4b15cf8283

4 changed files with 472 additions and 46 deletions

									
										2

intelligence/deepdive/config.yaml
									
												View File
												
				@@ -108,7 +108,7 @@ deepdive:

				        min_chosen_rejected_ratio: 1.3   # Chosen must be ≥1.3x longer than rejected

				        max_chosen_rejected_similarity: 0.70  # Max Jaccard overlap between chosen/rejected

				        max_prompt_prompt_similarity: 0.85    # Max Jaccard overlap between prompts (dedup)

				        dedup_history_files: 5           # How many recent JSONL files to scan for cross-run dedup

				        dedup_full_history: true          # Persistent index covers ALL historical JSONL (no sliding window)

				  # Phase 0: Fleet Context Grounding

				  fleet_context:

feat: full-history persistent dedup index for DPO training pairs Some checks failed CI / test (pull_request) Failing after 16s Details CI / validate (pull_request) Failing after 14s Details Review Approval Gate / verify-review (pull_request) Failing after 3s Details

2 intelligence/deepdive/config.yaml Unescape Escape View File

feat: full-history persistent dedup index for DPO training pairs

Some checks failed

CI / test (pull_request) Failing after 16s

Details

CI / validate (pull_request) Failing after 14s

Details

Review Approval Gate / verify-review (pull_request) Failing after 3s

Details

2

intelligence/deepdive/config.yaml

View File