Replace the 5-file sliding window cross-run dedup with a persistent
hash index that covers ALL historical training data. Overfitting risk
compounds across the full dataset — a 5-file window lets old duplicates
leak back into training after enough overnight runs.
New module: dedup_index.py (DedupIndex)
- Persistent JSON index (.dpo_dedup_index.json) alongside JSONL files
- Append-on-export: new prompt hashes registered after each successful
export — no full rescan needed for normal operations
- Incremental sync: on load, detects JSONL files not yet indexed and
ingests them automatically (handles files from other tools)
- Full rebuild: rebuild() scans ALL deepdive_*.jsonl + pairs_*.jsonl
to reconstruct from scratch (first run, corruption recovery)
- Atomic writes (write-to-tmp + rename) to prevent index corruption
- Standalone CLI: python3 dedup_index.py <dir> --rebuild --stats
Modified: dpo_quality.py
- Imports DedupIndex with graceful degradation
- Replaces _load_history_hashes() with persistent index lookup
- Fallback: if index unavailable, scans ALL files in-memory (not just 5)
- New register_exported_hashes() method called after export
- Config key: dedup_full_history (replaces dedup_history_files)
Modified: dpo_generator.py
- Calls validator.register_exported_hashes() after successful export
to keep the persistent index current without rescanning
Modified: config.yaml
- Replaced dedup_history_files: 5 with dedup_full_history: true
Tested — 7 integration tests:
✓ Fresh index build from empty directory
✓ Build from 3 existing JSONL files (15 unique hashes)
✓ Incremental sync when new file appears between runs
✓ Append after export + persistence across reloads
✓ Rebuild from scratch (recovers from corruption)
✓ Validator catches day-1 dupe from 20-day history (5-file window miss)
✓ Full pipeline: generate → validate → export → register → re-run detects
Add DPOQualityValidator that catches bad training pairs before they
enter the tightening loop. Wired into DPOPairGenerator between
generate() and export() as an automatic quality gate.
New module: dpo_quality.py
- 5 single-pair quality checks:
1. Field length minimums (prompt ≥40, chosen ≥80, rejected ≥30 chars)
2. Chosen/rejected length ratio (chosen must be ≥1.3x longer)
3. Chosen≈rejected similarity (Jaccard ≤0.70 — catches low-contrast)
4. Vocabulary diversity in chosen (unique word ratio ≥0.30)
5. Substance markers in chosen (≥2 fleet/training/action terms)
- 2 cross-pair quality checks:
6. Near-duplicate prompts within batch (Jaccard ≤0.85)
7. Cross-run dedup against recent JSONL history files
- Two modes: 'drop' (filter out bad pairs) or 'flag' (export with warning)
- BatchReport with per-pair diagnostics, pass rates, and warnings
- Standalone CLI: python3 dpo_quality.py <file.jsonl> [--strict] [--json]
Modified: dpo_generator.py
- Imports DPOQualityValidator with graceful degradation
- Initializes from config validation section (enabled by default)
- Validates between generate() and export() in run()
- Quality report included in pipeline result dict
- Validator failure never blocks — falls back to unvalidated export
Modified: config.yaml
- New deepdive.training.dpo.validation section with all tunable knobs:
enabled, flagged_pair_action, similarity thresholds, length minimums,
dedup_history_files
Integration tested — 6 test cases covering:
✓ Good pairs pass (3/3 accepted)
✓ Bad pairs caught: too-short, high-similarity, inverted signal (0/3)
✓ Near-duplicate prompt detection (1/2 deduped)
✓ Flag mode preserves pairs with warnings (3/3 flagged)
✓ Cross-run deduplication against history (1 dupe caught)
✓ Full generator→validator→export pipeline (6/6 validated)
- lazarus-registry.yaml: replace big_brain/RunPod with local ollama/gemma4:12b
- fleet-routing.json: assign ollama:gemma4:12b to carnice, bilbobagginshire, substratum
- intelligence/deepdive/config.yaml: local model -> gemma4:12b