Add DPOQualityValidator that catches bad training pairs before they
enter the tightening loop. Wired into DPOPairGenerator between
generate() and export() as an automatic quality gate.
New module: dpo_quality.py
- 5 single-pair quality checks:
1. Field length minimums (prompt ≥40, chosen ≥80, rejected ≥30 chars)
2. Chosen/rejected length ratio (chosen must be ≥1.3x longer)
3. Chosen≈rejected similarity (Jaccard ≤0.70 — catches low-contrast)
4. Vocabulary diversity in chosen (unique word ratio ≥0.30)
5. Substance markers in chosen (≥2 fleet/training/action terms)
- 2 cross-pair quality checks:
6. Near-duplicate prompts within batch (Jaccard ≤0.85)
7. Cross-run dedup against recent JSONL history files
- Two modes: 'drop' (filter out bad pairs) or 'flag' (export with warning)
- BatchReport with per-pair diagnostics, pass rates, and warnings
- Standalone CLI: python3 dpo_quality.py <file.jsonl> [--strict] [--json]
Modified: dpo_generator.py
- Imports DPOQualityValidator with graceful degradation
- Initializes from config validation section (enabled by default)
- Validates between generate() and export() in run()
- Quality report included in pipeline result dict
- Validator failure never blocks — falls back to unvalidated export
Modified: config.yaml
- New deepdive.training.dpo.validation section with all tunable knobs:
enabled, flagged_pair_action, similarity thresholds, length minimums,
dedup_history_files
Integration tested — 6 test cases covering:
✓ Good pairs pass (3/3 accepted)
✓ Bad pairs caught: too-short, high-similarity, inverted signal (0/3)
✓ Near-duplicate prompt detection (1/2 deduped)
✓ Flag mode preserves pairs with warnings (3/3 flagged)
✓ Cross-run deduplication against history (1 dupe caught)
✓ Full generator→validator→export pipeline (6/6 validated)
- Add EdgeTTSAdapter to bin/deepdive_tts.py (provider key: "edge-tts")
default voice: en-US-GuyNeural, no API key required
- Add EdgeTTS class to intelligence/deepdive/tts_engine.py
- Update HybridTTS to try edge-tts as fallback between piper and elevenlabs
- Add --voice-memo flag to bin/night_watch.py for spoken nightly reports
- Add edge-tts>=6.1.9 to requirements.txt
- Create docs/voice-output.md documenting all providers and fallback chain
- Add tests/test_edge_tts.py with 17 unit tests (all mocked, no network)
Fixes#1126
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- lazarus-registry.yaml: replace big_brain/RunPod with local ollama/gemma4:12b
- fleet-routing.json: assign ollama:gemma4:12b to carnice, bilbobagginshire, substratum
- intelligence/deepdive/config.yaml: local model -> gemma4:12b
- Add Dockerfile for production containerized pipeline
- Add docker-compose.yml for full stack deployment
- Add .dockerignore for clean builds
- Add deploy.sh: one-command build, test, and systemd timer install
This provides a sovereign, reproducible deployment path for the
Deep Dive daily briefing pipeline.
- Add GEMINI_HANDOFF.md with codebase map, secrets inventory,
production checklist, and recommended next steps
- Continuity from Ezra scaffold to Gemini production-hardening
Executable Phase 4 component: PiperTTS, ElevenLabsTTS, HybridTTS
classes with chunking, concatenation, error handling.
Ready for integration with Phase 3 synthesizer.
Burn mode artifact by Ezra.