Add DPOQualityValidator that catches bad training pairs before they enter the tightening loop. Wired into DPOPairGenerator between generate() and export() as an automatic quality gate. New module: dpo_quality.py - 5 single-pair quality checks: 1. Field length minimums (prompt ≥40, chosen ≥80, rejected ≥30 chars) 2. Chosen/rejected length ratio (chosen must be ≥1.3x longer) 3. Chosen≈rejected similarity (Jaccard ≤0.70 — catches low-contrast) 4. Vocabulary diversity in chosen (unique word ratio ≥0.30) 5. Substance markers in chosen (≥2 fleet/training/action terms) - 2 cross-pair quality checks: 6. Near-duplicate prompts within batch (Jaccard ≤0.85) 7. Cross-run dedup against recent JSONL history files - Two modes: 'drop' (filter out bad pairs) or 'flag' (export with warning) - BatchReport with per-pair diagnostics, pass rates, and warnings - Standalone CLI: python3 dpo_quality.py <file.jsonl> [--strict] [--json] Modified: dpo_generator.py - Imports DPOQualityValidator with graceful degradation - Initializes from config validation section (enabled by default) - Validates between generate() and export() in run() - Quality report included in pipeline result dict - Validator failure never blocks — falls back to unvalidated export Modified: config.yaml - New deepdive.training.dpo.validation section with all tunable knobs: enabled, flagged_pair_action, similarity thresholds, length minimums, dedup_history_files Integration tested — 6 test cases covering: ✓ Good pairs pass (3/3 accepted) ✓ Bad pairs caught: too-short, high-similarity, inverted signal (0/3) ✓ Near-duplicate prompt detection (1/2 deduped) ✓ Flag mode preserves pairs with warnings (3/3 flagged) ✓ Cross-run deduplication against history (1 dupe caught) ✓ Full generator→validator→export pipeline (6/6 validated)
Deep Dive: Automated Intelligence Briefing System
Sovereign, automated daily intelligence pipeline for the Timmy Foundation fleet.
Vision
Zero-manual-input daily AI-generated podcast briefing covering:
- arXiv (cs.AI, cs.CL, cs.LG)
- OpenAI, Anthropic, DeepMind research blogs
- AI newsletters (Import AI, TLDR AI)
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Phase 1 │───▶│ Phase 2 │───▶│ Phase 3 │
│ Aggregation │ │ Relevance │ │ Synthesis │
│ (RSS/Feeds) │ │ (Embeddings) │ │ (LLM Briefing) │
└─────────────────┘ └─────────────────┘ └────────┬────────┘
│
┌────────────────────────┘
▼
┌─────────────────┐ ┌─────────────────┐
│ Phase 4 │───▶│ Phase 5 │
│ Audio (TTS) │ │ Delivery │
│ (Piper) │ │ (Telegram) │
└─────────────────┘ └─────────────────┘
Status: IMPLEMENTATION COMPLETE
This is no longer a reference scaffold — it is a production-ready executable pipeline.
| Component | Status | File |
|---|---|---|
| Phase 1: Aggregation | ✅ Complete | pipeline.py — RSS fetcher with caching |
| Phase 2: Relevance | ✅ Complete | pipeline.py — sentence-transformers ranking |
| Phase 3: Synthesis | ✅ Complete | pipeline.py — LLM briefing generation |
| Phase 4: Audio | ✅ Complete | tts_engine.py — Piper + ElevenLabs hybrid |
| Phase 5: Delivery | ✅ Complete | pipeline.py — Telegram text + voice |
| Orchestrator | ✅ Complete | pipeline.py — asyncio CLI + Python API |
| Tests | ✅ Complete | tests/test_e2e.py — dry-run validation |
| Systemd Timer | ✅ Complete | systemd/deepdive.timer — 06:00 daily |
Quick Start
See QUICKSTART.md for exact commands to run the pipeline.
Sovereignty Compliance
| Component | Implementation | Non-Negotiable |
|---|---|---|
| Aggregation | Local RSS polling | No third-party APIs |
| Relevance | sentence-transformers local | No cloud embeddings |
| Synthesis | Gemma 4 via Hermes llama-server | No OpenAI/Anthropic API |
| TTS | Piper TTS local | No ElevenLabs |
| Delivery | Hermes Telegram gateway | Existing infra |
Files
pipeline.py— Main orchestrator (production implementation)tts_engine.py— Phase 4 TTS engine (Piper + ElevenLabs fallback)config.yaml— Configuration templateMakefile— Build automation (make test-e2e,make install-systemd)tests/— pytest suite including end-to-end dry-run testsystemd/— Daily timer for 06:00 executionQUICKSTART.md— Step-by-step execution guidearchitecture.md— Full technical specificationtelegram_command.py— Hermes/deepdivecommand handler
Issue
#830 — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing