Files
the-nexus/intelligence/deepdive
perplexity 77cfa48707 feat: DPO pair quality validator — gate before overnight training
Add DPOQualityValidator that catches bad training pairs before they
enter the tightening loop. Wired into DPOPairGenerator between
generate() and export() as an automatic quality gate.

New module: dpo_quality.py
- 5 single-pair quality checks:
  1. Field length minimums (prompt ≥40, chosen ≥80, rejected ≥30 chars)
  2. Chosen/rejected length ratio (chosen must be ≥1.3x longer)
  3. Chosen≈rejected similarity (Jaccard ≤0.70 — catches low-contrast)
  4. Vocabulary diversity in chosen (unique word ratio ≥0.30)
  5. Substance markers in chosen (≥2 fleet/training/action terms)
- 2 cross-pair quality checks:
  6. Near-duplicate prompts within batch (Jaccard ≤0.85)
  7. Cross-run dedup against recent JSONL history files
- Two modes: 'drop' (filter out bad pairs) or 'flag' (export with warning)
- BatchReport with per-pair diagnostics, pass rates, and warnings
- Standalone CLI: python3 dpo_quality.py <file.jsonl> [--strict] [--json]

Modified: dpo_generator.py
- Imports DPOQualityValidator with graceful degradation
- Initializes from config validation section (enabled by default)
- Validates between generate() and export() in run()
- Quality report included in pipeline result dict
- Validator failure never blocks — falls back to unvalidated export

Modified: config.yaml
- New deepdive.training.dpo.validation section with all tunable knobs:
  enabled, flagged_pair_action, similarity thresholds, length minimums,
  dedup_history_files

Integration tested — 6 test cases covering:
  ✓ Good pairs pass (3/3 accepted)
  ✓ Bad pairs caught: too-short, high-similarity, inverted signal (0/3)
  ✓ Near-duplicate prompt detection (1/2 deduped)
  ✓ Flag mode preserves pairs with warnings (3/3 flagged)
  ✓ Cross-run deduplication against history (1 dupe caught)
  ✓ Full generator→validator→export pipeline (6/6 validated)
2026-04-18 15:19:56 -04:00
..
2026-04-18 15:19:55 -04:00

Deep Dive: Automated Intelligence Briefing System

Sovereign, automated daily intelligence pipeline for the Timmy Foundation fleet.

Vision

Zero-manual-input daily AI-generated podcast briefing covering:

  • arXiv (cs.AI, cs.CL, cs.LG)
  • OpenAI, Anthropic, DeepMind research blogs
  • AI newsletters (Import AI, TLDR AI)

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  Phase 1        │───▶│  Phase 2        │───▶│  Phase 3        │
│  Aggregation    │    │  Relevance      │    │  Synthesis      │
│  (RSS/Feeds)    │    │  (Embeddings)   │    │  (LLM Briefing) │
└─────────────────┘    └─────────────────┘    └────────┬────────┘
                                                       │
                              ┌────────────────────────┘
                              ▼
                    ┌─────────────────┐    ┌─────────────────┐
                    │  Phase 4        │───▶│  Phase 5        │
                    │  Audio (TTS)    │    │  Delivery       │
                    │  (Piper)        │    │  (Telegram)     │
                    └─────────────────┘    └─────────────────┘

Status: IMPLEMENTATION COMPLETE

This is no longer a reference scaffold — it is a production-ready executable pipeline.

Component Status File
Phase 1: Aggregation Complete pipeline.py — RSS fetcher with caching
Phase 2: Relevance Complete pipeline.py — sentence-transformers ranking
Phase 3: Synthesis Complete pipeline.py — LLM briefing generation
Phase 4: Audio Complete tts_engine.py — Piper + ElevenLabs hybrid
Phase 5: Delivery Complete pipeline.py — Telegram text + voice
Orchestrator Complete pipeline.py — asyncio CLI + Python API
Tests Complete tests/test_e2e.py — dry-run validation
Systemd Timer Complete systemd/deepdive.timer — 06:00 daily

Quick Start

See QUICKSTART.md for exact commands to run the pipeline.

Sovereignty Compliance

Component Implementation Non-Negotiable
Aggregation Local RSS polling No third-party APIs
Relevance sentence-transformers local No cloud embeddings
Synthesis Gemma 4 via Hermes llama-server No OpenAI/Anthropic API
TTS Piper TTS local No ElevenLabs
Delivery Hermes Telegram gateway Existing infra

Files

  • pipeline.py — Main orchestrator (production implementation)
  • tts_engine.py — Phase 4 TTS engine (Piper + ElevenLabs fallback)
  • config.yaml — Configuration template
  • Makefile — Build automation (make test-e2e, make install-systemd)
  • tests/ — pytest suite including end-to-end dry-run test
  • systemd/ — Daily timer for 06:00 execution
  • QUICKSTART.md — Step-by-step execution guide
  • architecture.md — Full technical specification
  • telegram_command.py — Hermes /deepdive command handler

Issue

#830 — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing