Files

perplexity 9f90392a93 feat: full-history persistent dedup index for DPO training pairs

Replace the 5-file sliding window cross-run dedup with a persistent
hash index that covers ALL historical training data. Overfitting risk
compounds across the full dataset — a 5-file window lets old duplicates
leak back into training after enough overnight runs.

New module: dedup_index.py (DedupIndex)
- Persistent JSON index (.dpo_dedup_index.json) alongside JSONL files
- Append-on-export: new prompt hashes registered after each successful
  export — no full rescan needed for normal operations
- Incremental sync: on load, detects JSONL files not yet indexed and
  ingests them automatically (handles files from other tools)
- Full rebuild: rebuild() scans ALL deepdive_*.jsonl + pairs_*.jsonl
  to reconstruct from scratch (first run, corruption recovery)
- Atomic writes (write-to-tmp + rename) to prevent index corruption
- Standalone CLI: python3 dedup_index.py <dir> --rebuild --stats

Modified: dpo_quality.py
- Imports DedupIndex with graceful degradation
- Replaces _load_history_hashes() with persistent index lookup
- Fallback: if index unavailable, scans ALL files in-memory (not just 5)
- New register_exported_hashes() method called after export
- Config key: dedup_full_history (replaces dedup_history_files)

Modified: dpo_generator.py
- Calls validator.register_exported_hashes() after successful export
  to keep the persistent index current without rescanning

Modified: config.yaml
- Replaced dedup_history_files: 5 with dedup_full_history: true

Tested — 7 integration tests:
  ✓ Fresh index build from empty directory
  ✓ Build from 3 existing JSONL files (15 unique hashes)
  ✓ Incremental sync when new file appears between runs
  ✓ Append after export + persistence across reloads
  ✓ Rebuild from scratch (recovers from corruption)
  ✓ Validator catches day-1 dupe from 20-day history (5-file window miss)
  ✓ Full pipeline: generate → validate → export → register → re-run detects

2026-04-15 21:24:01 -04:00

prompts

feat(deepdive): production briefing prompt + prompt engineering KT

2026-04-05 20:19:20 +00:00

systemd

[BURN] #830 : Systemd timer for daily 06:00 execution

2026-04-05 08:08:07 +00:00

tests

fix: closes #830

2026-04-15 21:24:01 -04:00

.dockerignore

intelligence(deepdive): Docker deployment scaffold for #830

2026-04-05 20:40:58 +00:00

architecture.md

[scaffold] Deep Dive intelligence pipeline: intelligence/deepdive/architecture.md

2026-04-05 06:19:48 +00:00

config.yaml

feat: full-history persistent dedup index for DPO training pairs

2026-04-15 21:24:01 -04:00

dedup_index.py

feat: full-history persistent dedup index for DPO training pairs

2026-04-15 21:24:01 -04:00

deploy.sh

intelligence(deepdive): Docker deployment scaffold for #830

2026-04-05 20:40:58 +00:00

docker-compose.yml

purge: remove Anthropic from the-nexus fleet + deepdive (#1346 )

2026-04-15 21:24:01 -04:00

Dockerfile

intelligence(deepdive): Docker deployment scaffold for #830

2026-04-05 20:40:58 +00:00

dpo_generator.py

feat: full-history persistent dedup index for DPO training pairs

2026-04-15 21:24:01 -04:00

dpo_quality.py

feat: full-history persistent dedup index for DPO training pairs

2026-04-15 21:24:01 -04:00

fleet_context.py

[BURN] Deep Dive proof-of-life, fleet context fix, dry-run repair

2026-04-05 18:42:18 +00:00

GEMINI_HANDOFF.md

[ezra] Gemini handoff for Deep Dive (#830 )

2026-04-05 18:20:53 +00:00

Makefile

[BURN] #830 : Build automation (Makefile)

2026-04-05 08:06:12 +00:00

OPERATIONAL_READINESS.md

[ezra] #830 : Operational readiness checklist + fix Gitea URL to forge

2026-04-05 19:54:47 +00:00

pipeline.py

feat: Phase 3.5 — DPO training pair generation from Deep Dive pipeline

2026-04-15 21:24:01 -04:00

PRODUCTION_READINESS_REVIEW.md

[ezra] Production Readiness Review for Deep Dive (#830 )

2026-04-05 21:00:26 +00:00

PROOF_OF_EXECUTION.md

[ezra] #830 : Pipeline proof-of-execution document

2026-04-05 12:46:03 +00:00

PROOF_OF_LIFE.md

[BURN] Deep Dive proof-of-life, fleet context fix, dry-run repair

2026-04-05 18:42:18 +00:00

quality_eval.py

feat(deepdive): quality evaluation framework

2026-04-05 19:03:05 +00:00

QUALITY_FRAMEWORK.md

feat(deepdive): quality evaluation framework

2026-04-05 19:03:05 +00:00

QUICKSTART.md

Add QUICKSTART.md for Deep Dive pipeline (#830 )

2026-04-05 12:17:16 +00:00

README.md

Update README to reflect production implementation status (#830 )

2026-04-05 12:18:18 +00:00

requirements.txt

[scaffold] Deep Dive intelligence pipeline: intelligence/deepdive/requirements.txt

2026-04-05 06:19:51 +00:00

telegram_command.py

Add Telegram /deepdive command handler for on-demand briefings (#830 )

2026-04-05 12:17:17 +00:00

tts_engine.py

feat: add edge-tts as zero-cost voice output provider

2026-04-08 06:29:26 -04:00

README.md

Deep Dive: Automated Intelligence Briefing System

Sovereign, automated daily intelligence pipeline for the Timmy Foundation fleet.

Vision

Zero-manual-input daily AI-generated podcast briefing covering:

arXiv (cs.AI, cs.CL, cs.LG)
OpenAI, Anthropic, DeepMind research blogs
AI newsletters (Import AI, TLDR AI)

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  Phase 1        │───▶│  Phase 2        │───▶│  Phase 3        │
│  Aggregation    │    │  Relevance      │    │  Synthesis      │
│  (RSS/Feeds)    │    │  (Embeddings)   │    │  (LLM Briefing) │
└─────────────────┘    └─────────────────┘    └────────┬────────┘
                                                       │
                              ┌────────────────────────┘
                              ▼
                    ┌─────────────────┐    ┌─────────────────┐
                    │  Phase 4        │───▶│  Phase 5        │
                    │  Audio (TTS)    │    │  Delivery       │
                    │  (Piper)        │    │  (Telegram)     │
                    └─────────────────┘    └─────────────────┘

Status: IMPLEMENTATION COMPLETE

This is no longer a reference scaffold — it is a production-ready executable pipeline.

Component	Status	File
Phase 1: Aggregation	✅ Complete	`pipeline.py` — RSS fetcher with caching
Phase 2: Relevance	✅ Complete	`pipeline.py` — sentence-transformers ranking
Phase 3: Synthesis	✅ Complete	`pipeline.py` — LLM briefing generation
Phase 4: Audio	✅ Complete	`tts_engine.py` — Piper + ElevenLabs hybrid
Phase 5: Delivery	✅ Complete	`pipeline.py` — Telegram text + voice
Orchestrator	✅ Complete	`pipeline.py` — asyncio CLI + Python API
Tests	✅ Complete	`tests/test_e2e.py` — dry-run validation
Systemd Timer	✅ Complete	`systemd/deepdive.timer` — 06:00 daily

Quick Start

See QUICKSTART.md for exact commands to run the pipeline.

Sovereignty Compliance

Component	Implementation	Non-Negotiable
Aggregation	Local RSS polling	No third-party APIs
Relevance	sentence-transformers local	No cloud embeddings
Synthesis	Gemma 4 via Hermes llama-server	No OpenAI/Anthropic API
TTS	Piper TTS local	No ElevenLabs
Delivery	Hermes Telegram gateway	Existing infra

Files

pipeline.py — Main orchestrator (production implementation)
tts_engine.py — Phase 4 TTS engine (Piper + ElevenLabs fallback)
config.yaml — Configuration template
Makefile — Build automation (make test-e2e, make install-systemd)
tests/ — pytest suite including end-to-end dry-run test
systemd/ — Daily timer for 06:00 execution
QUICKSTART.md — Step-by-step execution guide
architecture.md — Full technical specification
telegram_command.py — Hermes /deepdive command handler

Issue

#830 — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing