[#830 ] Deep Dive: Add data directory placeholder

Directory for runtime data: raw/ scored/ briefings/ audio/
[#830 ] Deep Dive architecture scaffold - IMPLEMENTATION.md
2026-04-05 03:31:05 +00:00 · 2026-04-05 03:31:04 +00:00 · 2026-04-05 03:31:04 +00:00
3 changed files with 665 additions and 0 deletions
--- a/research/deep-dive/ARCHITECTURE.md
+++ b/research/deep-dive/ARCHITECTURE.md
@@ -0,0 +1,416 @@
 # Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing
 > **Issue**: #830  
 > **Type**: EPIC (21 story points)  
 > **Owner**: Ezra (assigned by Alexander)  
 > **Status**: Architecture complete → Phase 1 ready for implementation
 ---
 ## Vision
 A fully automated daily intelligence briefing system that delivers a personalized AI-generated podcast briefing with **zero manual input**.
 **Inspiration**: NotebookLM workflow (ingest → rank → synthesize → narrate → deliver) — but automated, scheduled, and sovereign.
 ---
 ## 5-Phase Architecture
 ```
 ┌─────────────────────────────────────────────────────────────────────────┐
 │                         DEEP DIVE PIPELINE                              │
 ├───────────────┬───────────────┬───────────────┬───────────────┬─────────┤
 │   PHASE 1     │   PHASE 2     │   PHASE 3     │   PHASE 4     │ PHASE 5 │
 ├───────────────┼───────────────┼───────────────┼───────────────┼─────────┤
 │  AGGREGATE    │    RANK       │  SYNTHESIZE   │   NARRATE     │ DELIVER │
 ├───────────────┼───────────────┼───────────────┼───────────────┼─────────┤
 │ ArXiv RSS     │ Embedding     │ LLM briefing  │ TTS engine    │Telegram │
 │ Lab feeds     │ similarity    │ generator     │ (Piper /      │ voice   │
 │ Newsletters   │ vs codebase   │               │ ElevenLabs)   │ message │
 │ HackerNews    │               │               │               │         │
 └───────────────┴───────────────┴───────────────┴───────────────┴─────────┘
 Timeline: 05:00  →  05:15  →  05:30  →  05:45  →  06:00
          Fetch    Score    Generate   Audio      Deliver
 ```
 ---
 ## Phase 1: Source Aggregation (5 points)
 ### Data Sources
 | Source | URL/API | Frequency | Priority |
 |--------|---------|-----------|----------|
 | ArXiv cs.AI | `http://export.arxiv.org/rss/cs.AI` | Daily 5 AM | P1 |
 | ArXiv cs.CL | `http://export.arxiv.org/rss/cs.CL` | Daily 5 AM | P1 |
 | ArXiv cs.LG | `http://export.arxiv.org/rss/cs.LG` | Daily 5 AM | P1 |
 | OpenAI Blog | `https://openai.com/blog/rss.xml` | Daily 5 AM | P1 |
 | Anthropic | `https://www.anthropic.com/blog/rss.xml` | Daily 5 AM | P1 |
 | DeepMind | `https://deepmind.google/blog/rss.xml` | Daily 5 AM | P2 |
 | Google Research | `https://research.google/blog/rss.xml` | Daily 5 AM | P2 |
 | Import AI | Newsletter (email/IMAP) | Daily 5 AM | P2 |
 | TLDR AI | `https://tldr.tech/ai/rss` | Daily 5 AM | P2 |
 | HackerNews | `https://hnrss.org/newest?points=100` | Daily 5 AM | P3 |
 ### Storage Format
 ```json
 {
  "fetched_at": "2025-01-15T05:00:00Z",
  "source": "arxiv_cs_ai",
  "items": [
    {
      "id": "arxiv:2501.01234",
      "title": "Attention is All You Need: The Sequel",
      "abstract": "...",
      "url": "https://arxiv.org/abs/2501.01234",
      "authors": ["..."],
      "published": "2025-01-14",
      "raw_text": "title + abstract"
    }
  ]
 }
 ```
 ### Output
 `data/deep-dive/raw/YYYY-MM-DD-{source}.jsonl`
 ---
 ## Phase 2: Relevance Engine (6 points)
 ### Scoring Approach
 **Multi-factor relevance score (0-100)**:
 ```python
 score = (
    embedding_similarity * 0.40 +    # Cosine sim vs Hermes codebase
    keyword_match_score * 0.30 +     # Title/abstract keyword hits
    source_priority * 0.15 +         # ArXiv cs.AI = 1.0, HN = 0.3
    recency_boost * 0.10 +           # Today = 1.0, -0.1 per day
    user_feedback * 0.05             # Past thumbs up/down
 )
 ```
 ### Keyword Priority List
 ```yaml
 high_value:
  - "transformer"
  - "attention mechanism"
  - "large language model"
  - "LLM"
  - "agent"
  - "multi-agent"
  - "reasoning"
  - "chain-of-thought"
  - "RLHF"
  - "fine-tuning"
  - "retrieval augmented"
  - "RAG"
  - "vector database"
  - "embedding"
  - "tool use"
  - "function calling"
 medium_value:
  - "BERT"
  - "GPT"
  - "training efficiency"
  - "inference optimization"
  - "quantization"
  - "distillation"
 ```
 ### Vector Database Decision Matrix
 | Option | Pros | Cons | Recommendation |
 |--------|------|------|----------------|
 | **Chroma** | SQLite-backed, zero ops, local | Scales to ~1M docs max | ✅ **Default** |
 | PostgreSQL + pgvector | Enterprise proven, ACID | Requires Postgres | If Nexus uses Postgres |
 | FAISS (in-memory) | Fastest search | Rebuild daily | Budget option |
 ### Output
 `data/deep-dive/scored/YYYY-MM-DD-ranked.json`
 Top 10 items selected for synthesis.
 ---
 ## Phase 3: Synthesis Engine (3 points)
 ### Prompt Architecture
 ```
 You are Deep Dive, a technical intelligence briefing AI for the Hermes/Timmy
 agent system. Your audience is an AI agent builder working on sovereign,
 local-first AI infrastructure.
 SOURCE MATERIAL:
 {ranked_items}
 GENERATE:
 1. **Headlines** (3 bullets): Key announcements in 20 words each
 2. **Deep Dives** (2-3): Important papers with technical summary and
   implications for agent systems
 3. **Quick Hits** (3-5): Brief mentions worth knowing
 4. **Context Bridge**: Connect to Hermes/Timmy current work
   - Mention if papers relate to RL training, tool calling, local inference,
     or multi-agent coordination
 TONE: Professional, concise, technically precise
 TARGET LENGTH: 800-1200 words (10-15 min spoken)
 ```
 ### Output Format (Markdown)
 ```markdown
 # Deep Dive: YYYY-MM-DD
 ## Headlines
 - [Item 1]
 - [Item 2]
 - [Item 3]
 ## Deep Dives
 ### [Paper Title]
 **Source**: ArXiv cs.AI | **Authors**: [...]
 [Technical summary]
 **Why it matters for Hermes**: [...]
 ## Quick Hits
 - [...]
 ## Context Bridge
 [Connection to current work]
 ```
 ### Output
 `data/deep-dive/briefings/YYYY-MM-DD-briefing.md`
 ---
 ## Phase 4: Audio Generation (4 points)
 ### TTS Engine Options
 | Engine | Cost | Quality | Latency | Sovereignty |
 |--------|------|---------|---------|-------------|
 | **Piper** (local) | Free | Good | Medium | ✅ 100% |
 | Coqui TTS (local) | Free | Medium-High | High | ✅ 100% |
 | ElevenLabs API | $0.05/min | Excellent | Low | ❌ Cloud |
 | OpenAI TTS | $0.015/min | Excellent | Low | ❌ Cloud |
 | Google Cloud TTS | $0.004/min | Good | Low | ❌ Cloud |
 ### Recommendation
 **Hybrid approach**:
 - Default: Piper (on-device, sovereign)
 - Override flag: ElevenLabs/OpenAI for special episodes
 ### Piper Configuration
 ```python
 # High-quality English voice
 model = "en_US-lessac-high"
 # Speaking rate: ~150 WPM for technical content
 length_scale = 1.1
 # Output format
 output_format = "mp3"  # 128kbps
 ```
 ### Audio Enhancement
 ```bash
 # Add intro/outro jingles
 ffmpeg -i intro.mp3 -i speech.mp3 -i outro.mp3 \
       -filter_complex "[0:a][1:a][2:a]concat=n=3:v=0:a=1" \
       deep-dive-YYYY-MM-DD.mp3
 ```
 ### Output
 `data/deep-dive/audio/YYYY-MM-DD-deep-dive.mp3` (12-18 MB)
 ---
 ## Phase 5: Delivery Pipeline (3 points)
 ### Cron Schedule
 ```cron
 # Daily at 6:00 AM EST
 0 6 * * * cd /path/to/deep-dive && ./run-daily.sh
 # Or: staggered phases for visibility
 0 5 * * * ./phase1-fetch.sh
 15 5 * * * ./phase2-rank.sh
 30 5 * * * ./phase3-synthesize.sh
 45 5 * * * ./phase4-narrate.sh
 0 6 * * * ./phase5-deliver.sh
 ```
 ### Telegram Integration
 ```python
 # Via Hermes gateway or direct bot
 bot.send_voice(
    chat_id=TELEGRAM_HOME_CHANNEL,
    voice=open("deep-dive-YYYY-MM-DD.mp3", "rb"),
    caption=f"📻 Deep Dive for {date}: {headline_summary}",
    duration=estimated_seconds
 )
 ```
 ### On-Demand Command
 ```
 /deepdive [date]
 # Fetches briefing for specified date (default: today)
 # If audio exists: sends voice message
 # If not: generates on-demand (may take 2-3 min)
 ```
 ---
 ## Implementation Roadmap
 ### Quick Win: Phase 1 Only (2-3 hours)
 **Goal**: Prove value with text-only digests
 ```bash
 # 1. ArXiv RSS fetcher
 # 2. Simple keyword filter
 # 3. Text digest via Telegram
 # 4. Cron schedule
 Result: Daily 8 AM text briefing
 ```
 ### MVP: Phases 1-3-5 (Skip 2,4)
 **Goal**: Working system without embedding/audio complexity
 ```
 Fetch → Keyword filter → LLM synthesize → Text delivery
 ```
 Duration: 1-2 days
 ### Full Implementation: All 5 Phases
 **Goal**: Complete automated podcast system
 Duration: 1-2 weeks (parallel development possible)
 ---
 ## Directory Structure
 ```
 the-nexus/
 └── research/
    └── deep-dive/
        ├── ARCHITECTURE.md          # This file
        ├── IMPLEMENTATION.md        # Detailed dev guide
        ├── config/
        │   ├── sources.yaml         # RSS/feed URLs
        │   ├── keywords.yaml        # Relevance keywords
        │   └── prompts/
        │       ├── synthesis.txt    # LLM prompt template
        │       └── headlines.txt    # Headline-only prompt
        ├── scripts/
        │   ├── phase1-aggregate.py
        │   ├── phase2-rank.py
        │   ├── phase3-synthesize.py
        │   ├── phase4-narrate.py
        │   ├── phase5-deliver.py
        │   └── run-daily.sh         # Orchestrator
        └── data/                    # .gitignored
            ├── raw/                 # Fetched sources
            ├── scored/              # Ranked items
            ├── briefings/           # Markdown outputs
            └── audio/               # MP3 files
 ```
 ---
 ## Acceptance Criteria
 | # | Criterion | Phase |
 |---|-----------|-------|
 | 1 | Zero manual copy-paste | 1-5 |
 | 2 | Daily 6 AM delivery | 5 |
 | 3 | ArXiv coverage (cs.AI, cs.CL, cs.LG) | 1 |
 | 4 | Lab blog coverage | 1 |
 | 5 | Relevance ranking by Hermes context | 2 |
 | 6 | Written briefing generation | 3 |
 | 7 | TTS audio production | 4 |
 | 8 | Telegram voice delivery | 5 |
 | 9 | On-demand `/deepdive` command | 5 |
 ---
 ## Risk Matrix
 | Risk | Likelihood | Impact | Mitigation |
 |------|------------|--------|------------|
 | ArXiv rate limiting | Medium | Medium | Exponential backoff, caching |
 | RSS feed changes | Medium | Low | Health checks, fallback sources |
 | TTS quality poor | Low (Piper) | High | Cloud override flag |
 | Vector DB too slow | Low | Medium | Batch overnight, cache embeddings |
 | Telegram file size | Low | Medium | Compress audio, split long episodes |
 ---
 ## Dependencies
 ### Required
 - Python 3.10+
 - `feedparser` (RSS)
 - `requests` (HTTP)
 - `chromadb` or `sqlite3` (storage)
 - Hermes LLM client (synthesis)
 - Piper TTS (local audio)
 ### Optional
 - `sentence-transformers` (embeddings)
 - `ffmpeg` (audio post-processing)
 - ElevenLabs API key (cloud TTS fallback)
 ---
 ## Related Issues
 - #830 (Parent EPIC)
 - Commandment 6: Human-to-fleet comms
 - #166: Matrix/Conduit deployment
 ---
 ## Next Steps
 1. **Decision**: Vector DB selection (Chroma vs pgvector)
 2. **Implementation**: Phase 1 skeleton (ArXiv fetcher)
 3. **Integration**: Hermes cron registration
 4. **Testing**: 3-day dry run (text only)
 5. **Enhancement**: Add TTS (Phase 4)
 ---
 *Architecture document version 1.0 — Ezra, 2026-04-05*
--- a/research/deep-dive/IMPLEMENTATION.md
+++ b/research/deep-dive/IMPLEMENTATION.md
@@ -0,0 +1,248 @@
 # Deep Dive Implementation Guide
 > Quick-start path from architecture to running system
 ---
 ## Phase 1 Quick Win: ArXiv Text Digest (2-3 hours)
 This minimal implementation proves value without Phase 2/4 complexity.
 ### Step 1: Dependencies
 ```bash
 pip install feedparser requests python-telegram-bot
 ```
 ### Step 2: Basic Fetcher
 ```python
 #!/usr/bin/env python3
 # scripts/arxiv-fetch.py
 import feedparser
 import json
 from datetime import datetime
 FEEDS = {
    "cs.AI": "http://export.arxiv.org/rss/cs.AI",
    "cs.CL": "http://export.arxiv.org/rss/cs.CL",
    "cs.LG": "http://export.arxiv.org/rss/cs.LG",
 }
 KEYWORDS = [
    "transformer", "attention", "LLM", "large language model",
    "agent", "multi-agent", "reasoning", "chain-of-thought",
    "RLHF", "fine-tuning", "RAG", "retrieval augmented",
    "vector database", "embedding", "tool use", "function calling"
 ]
 def score_item(title, abstract):
    text = f"{title} {abstract}".lower()
    matches = sum(1 for kw in KEYWORDS if kw in text)
    return min(matches / 3, 1.0)  # Cap at 1.0
 def fetch_and_score():
    items = []
    for category, url in FEEDS.items():
        feed = feedparser.parse(url)
        for entry in feed.entries[:20]:  # Top 20 per category
            score = score_item(entry.title, entry.get("summary", ""))
            if score > 0.2:  # Minimum relevance threshold
                items.append({
                    "category": category,
                    "title": entry.title,
                    "url": entry.link,
                    "score": score,
                    "abstract": entry.get("summary", "")[:300]
                })
    # Sort by score
    items.sort(key=lambda x: x["score"], reverse=True)
    return items[:10]  # Top 10
 if __name__ == "__main__":
    items = fetch_and_score()
    date = datetime.now().strftime("%Y-%m-%d")
    with open(f"data/raw/{date}-arxiv.json", "w") as f:
        json.dump(items, f, indent=2)
    print(f"Fetched {len(items)} relevant papers")
 ```
 ### Step 3: Synthesis (Text Only)
 ```python
 #!/usr/bin/env python3
 # scripts/text-digest.py
 import json
 from datetime import datetime
 def generate_digest(items):
    lines = [f"📚 Deep Dive — {datetime.now().strftime('%Y-%m-%d')}", ""]
    for i, item in enumerate(items[:5], 1):
        lines.append(f"{i}. {item['title']}")
        lines.append(f"   {item['url']}")
        lines.append(f"   Relevance: {item['score']:.2f}")
        lines.append("")
    return "\n".join(lines)
 # Load and generate
 date = datetime.now().strftime("%Y-%m-%d")
 with open(f"data/raw/{date}-arxiv.json") as f:
    items = json.load(f)
 digest = generate_digest(items)
 print(digest)
 # Save
 with open(f"data/briefings/{date}-digest.txt", "w") as f:
    f.write(digest)
 ```
 ### Step 4: Telegram Delivery
 ```python
 #!/usr/bin/env python3
 # scripts/telegram-send.py
 import os
 import asyncio
 from telegram import Bot
 async def send_digest():
    bot = Bot(token=os.environ["TELEGRAM_BOT_TOKEN"])
    chat_id = os.environ["TELEGRAM_HOME_CHANNEL"]
    date = datetime.now().strftime("%Y-%m-%d")
    with open(f"data/briefings/{date}-digest.txt") as f:
        text = f.read()
    await bot.send_message(chat_id=chat_id, text=text[:4000])
 asyncio.run(send_digest())
 ```
 ### Step 5: Cron Setup
 ```bash
 # crontab -e
 0 6 * * * cd /path/to/deep-dive && ./scripts/run-daily.sh
 ```
 ```bash
 #!/bin/bash
 # scripts/run-daily.sh
 set -e
 DATE=$(date +%Y-%m-%d)
 mkdir -p "data/raw" "data/briefings"
 python3 scripts/arxiv-fetch.py
 python3 scripts/text-digest.py
 python3 scripts/telegram-send.py
 echo "✅ Deep Dive completed for $DATE"
 ```
 ---
 ## Phase 2: Embedding-Based Relevance (Add Day 2)
 ```python
 # scripts/rank-embeddings.py
 from sentence_transformers import SentenceTransformer
 import chromadb
 import json
 # Load model
 model = SentenceTransformer('all-MiniLM-L6-v2')
 # Initialize Chroma (persistent)
 client = chromadb.PersistentClient(path="data/chroma")
 collection = client.get_or_create_collection("hermes-codebase")
 # Load top items
 with open("data/raw/YYYY-MM-DD-arxiv.json") as f:
    items = json.load(f)
 # Score using embeddings
 def embedding_score(item):
    item_emb = model.encode(item['title'] + " " + item['abstract'])
    # Query similar docs from codebase
    results = collection.query(query_embeddings=[item_emb.tolist()], n_results=5)
    # Average similarity of top matches
    return sum(results['distances'][0]) / len(results['distances'][0])
 # Re-rank
 for item in items:
    item['embedding_score'] = embedding_score(item)
    item['final_score'] = (item['score'] * 0.3) + (item['embedding_score'] * 0.7)
 items.sort(key=lambda x: x['final_score'], reverse=True)
 ```
 ---
 ## Phase 4: Piper TTS Integration (Add Day 3)
 ```bash
 # Install Piper
 pip install piper-tts
 # Download voice
 mkdir -p voices
 wget -P voices/ https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/high/en_US-lessac-high.onnx
 wget -P voices/ https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/high/en_US-lessac-high.onnx.json
 ```
 ```python
 #!/usr/bin/env python3
 # scripts/generate-audio.py
 import subprocess
 from datetime import datetime
 date = datetime.now().strftime("%Y-%m-%d")
 # Read briefing
 with open(f"data/briefings/{date}-briefing.md") as f:
    text = f.read()
 # Preprocess for TTS (strip markdown, limit length)
 # ...
 # Generate audio
 subprocess.run([
    "piper",
    "--model", "voices/en_US-lessac-high.onnx",
    "--output_file", f"data/audio/{date}-deep-dive.wav",
    "--length_scale", "1.1"
 ], input=text[:5000].encode())  # First 5K chars
 # Convert to MP3
 subprocess.run([
    "ffmpeg", "-y", "-i", f"data/audio/{date}-deep-dive.wav",
    "-codec:a", "libmp3lame", "-q:a", "4",
    f"data/audio/{date}-deep-dive.mp3"
 ])
 ```
 ---
 ## Testing Checklist
 - [ ] Phase 1: Manual run produces valid JSON
 - [ ] Phase 1: Keyword filter returns relevant results only
 - [ ] Phase 2: Embeddings load without error
 - [ ] Phase 2: Chroma collection queries return matches
 - [ ] Phase 3: LLM generates coherent briefing
 - [ ] Phase 4: Piper produces audible WAV
 - [ ] Phase 4: MP3 conversion works
 - [ ] Phase 5: Telegram text message delivers
 - [ ] Phase 5: Telegram voice message delivers
 - [ ] End-to-end: Cron completes without error
 ---
 *Implementation guide version 1.0*
--- a/research/deep-dive/data/.gitkeep
+++ b/research/deep-dive/data/.gitkeep
@@ -0,0 +1 @@
 # Data directory - not committed