[#830] Deep Dive architecture scaffold - ARCHITECTURE.md

Full system design for automated daily AI intelligence briefing: - 5-phase pipeline: Aggregate → Rank → Synthesize → Narrate → Deliver - Source coverage: ArXiv, lab blogs, newsletters - TTS options: Piper (sovereign) / ElevenLabs (cloud) - Story points: 21 (broken down by phase)
2026-04-05 03:31:04 +00:00
parent 75fa66344d
commit 6aaf04dc04
1 changed files with 416 additions and 0 deletions
--- a/research/deep-dive/ARCHITECTURE.md
+++ b/research/deep-dive/ARCHITECTURE.md
@@ -0,0 +1,416 @@
+# Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing
+
+> **Issue**: #830  
+> **Type**: EPIC (21 story points)  
+> **Owner**: Ezra (assigned by Alexander)  
+> **Status**: Architecture complete → Phase 1 ready for implementation
+
+---
+
+## Vision
+
+A fully automated daily intelligence briefing system that delivers a personalized AI-generated podcast briefing with **zero manual input**.
+
+**Inspiration**: NotebookLM workflow (ingest → rank → synthesize → narrate → deliver) — but automated, scheduled, and sovereign.
+
+---
+
+## 5-Phase Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                         DEEP DIVE PIPELINE                              │
+├───────────────┬───────────────┬───────────────┬───────────────┬─────────┤
+│   PHASE 1     │   PHASE 2     │   PHASE 3     │   PHASE 4     │ PHASE 5 │
+├───────────────┼───────────────┼───────────────┼───────────────┼─────────┤
+│  AGGREGATE    │    RANK       │  SYNTHESIZE   │   NARRATE     │ DELIVER │
+├───────────────┼───────────────┼───────────────┼───────────────┼─────────┤
+│ ArXiv RSS     │ Embedding     │ LLM briefing  │ TTS engine    │Telegram │
+│ Lab feeds     │ similarity    │ generator     │ (Piper /      │ voice   │
+│ Newsletters   │ vs codebase   │               │ ElevenLabs)   │ message │
+│ HackerNews    │               │               │               │         │
+└───────────────┴───────────────┴───────────────┴───────────────┴─────────┘
+
+Timeline: 05:00  →  05:15  →  05:30  →  05:45  →  06:00
+          Fetch    Score    Generate   Audio      Deliver
+```
+
+---
+
+## Phase 1: Source Aggregation (5 points)
+
+### Data Sources
+
+| Source | URL/API | Frequency | Priority |
+|--------|---------|-----------|----------|
+| ArXiv cs.AI | `http://export.arxiv.org/rss/cs.AI` | Daily 5 AM | P1 |
+| ArXiv cs.CL | `http://export.arxiv.org/rss/cs.CL` | Daily 5 AM | P1 |
+| ArXiv cs.LG | `http://export.arxiv.org/rss/cs.LG` | Daily 5 AM | P1 |
+| OpenAI Blog | `https://openai.com/blog/rss.xml` | Daily 5 AM | P1 |
+| Anthropic | `https://www.anthropic.com/blog/rss.xml` | Daily 5 AM | P1 |
+| DeepMind | `https://deepmind.google/blog/rss.xml` | Daily 5 AM | P2 |
+| Google Research | `https://research.google/blog/rss.xml` | Daily 5 AM | P2 |
+| Import AI | Newsletter (email/IMAP) | Daily 5 AM | P2 |
+| TLDR AI | `https://tldr.tech/ai/rss` | Daily 5 AM | P2 |
+| HackerNews | `https://hnrss.org/newest?points=100` | Daily 5 AM | P3 |
+
+### Storage Format
+
+```json
+{
+  "fetched_at": "2025-01-15T05:00:00Z",
+  "source": "arxiv_cs_ai",
+  "items": [
+    {
+      "id": "arxiv:2501.01234",
+      "title": "Attention is All You Need: The Sequel",
+      "abstract": "...",
+      "url": "https://arxiv.org/abs/2501.01234",
+      "authors": ["..."],
+      "published": "2025-01-14",
+      "raw_text": "title + abstract"
+    }
+  ]
+}
+```
+
+### Output
+
+`data/deep-dive/raw/YYYY-MM-DD-{source}.jsonl`
+
+---
+
+## Phase 2: Relevance Engine (6 points)
+
+### Scoring Approach
+
+**Multi-factor relevance score (0-100)**:
+
+```python
+score = (
+    embedding_similarity * 0.40 +    # Cosine sim vs Hermes codebase
+    keyword_match_score * 0.30 +     # Title/abstract keyword hits
+    source_priority * 0.15 +         # ArXiv cs.AI = 1.0, HN = 0.3
+    recency_boost * 0.10 +           # Today = 1.0, -0.1 per day
+    user_feedback * 0.05             # Past thumbs up/down
+)
+```
+
+### Keyword Priority List
+
+```yaml
+high_value:
+  - "transformer"
+  - "attention mechanism"
+  - "large language model"
+  - "LLM"
+  - "agent"
+  - "multi-agent"
+  - "reasoning"
+  - "chain-of-thought"
+  - "RLHF"
+  - "fine-tuning"
+  - "retrieval augmented"
+  - "RAG"
+  - "vector database"
+  - "embedding"
+  - "tool use"
+  - "function calling"
+
+medium_value:
+  - "BERT"
+  - "GPT"
+  - "training efficiency"
+  - "inference optimization"
+  - "quantization"
+  - "distillation"
+```
+
+### Vector Database Decision Matrix
+
+| Option | Pros | Cons | Recommendation |
+|--------|------|------|----------------|
+| **Chroma** | SQLite-backed, zero ops, local | Scales to ~1M docs max | ✅ **Default** |
+| PostgreSQL + pgvector | Enterprise proven, ACID | Requires Postgres | If Nexus uses Postgres |
+| FAISS (in-memory) | Fastest search | Rebuild daily | Budget option |
+
+### Output
+
+`data/deep-dive/scored/YYYY-MM-DD-ranked.json`
+
+Top 10 items selected for synthesis.
+
+---
+
+## Phase 3: Synthesis Engine (3 points)
+
+### Prompt Architecture
+
+```
+You are Deep Dive, a technical intelligence briefing AI for the Hermes/Timmy
+agent system. Your audience is an AI agent builder working on sovereign,
+local-first AI infrastructure.
+
+SOURCE MATERIAL:
+{ranked_items}
+
+GENERATE:
+1. **Headlines** (3 bullets): Key announcements in 20 words each
+2. **Deep Dives** (2-3): Important papers with technical summary and
+   implications for agent systems
+3. **Quick Hits** (3-5): Brief mentions worth knowing
+4. **Context Bridge**: Connect to Hermes/Timmy current work
+   - Mention if papers relate to RL training, tool calling, local inference,
+     or multi-agent coordination
+
+TONE: Professional, concise, technically precise
+TARGET LENGTH: 800-1200 words (10-15 min spoken)
+```
+
+### Output Format (Markdown)
+
+```markdown
+# Deep Dive: YYYY-MM-DD
+
+## Headlines
+- [Item 1]
+- [Item 2]
+- [Item 3]
+
+## Deep Dives
+
+### [Paper Title]
+**Source**: ArXiv cs.AI | **Authors**: [...]
+
+[Technical summary]
+
+**Why it matters for Hermes**: [...]
+
+## Quick Hits
+- [...]
+
+## Context Bridge
+[Connection to current work]
+```
+
+### Output
+
+`data/deep-dive/briefings/YYYY-MM-DD-briefing.md`
+
+---
+
+## Phase 4: Audio Generation (4 points)
+
+### TTS Engine Options
+
+| Engine | Cost | Quality | Latency | Sovereignty |
+|--------|------|---------|---------|-------------|
+| **Piper** (local) | Free | Good | Medium | ✅ 100% |
+| Coqui TTS (local) | Free | Medium-High | High | ✅ 100% |
+| ElevenLabs API | $0.05/min | Excellent | Low | ❌ Cloud |
+| OpenAI TTS | $0.015/min | Excellent | Low | ❌ Cloud |
+| Google Cloud TTS | $0.004/min | Good | Low | ❌ Cloud |
+
+### Recommendation
+
+**Hybrid approach**:
+- Default: Piper (on-device, sovereign)
+- Override flag: ElevenLabs/OpenAI for special episodes
+
+### Piper Configuration
+
+```python
+# High-quality English voice
+model = "en_US-lessac-high"
+
+# Speaking rate: ~150 WPM for technical content
+length_scale = 1.1
+
+# Output format
+output_format = "mp3"  # 128kbps
+```
+
+### Audio Enhancement
+
+```bash
+# Add intro/outro jingles
+ffmpeg -i intro.mp3 -i speech.mp3 -i outro.mp3 \
+       -filter_complex "[0:a][1:a][2:a]concat=n=3:v=0:a=1" \
+       deep-dive-YYYY-MM-DD.mp3
+```
+
+### Output
+
+`data/deep-dive/audio/YYYY-MM-DD-deep-dive.mp3` (12-18 MB)
+
+---
+
+## Phase 5: Delivery Pipeline (3 points)
+
+### Cron Schedule
+
+```cron
+# Daily at 6:00 AM EST
+0 6 * * * cd /path/to/deep-dive && ./run-daily.sh
+
+# Or: staggered phases for visibility
+0 5 * * * ./phase1-fetch.sh
+15 5 * * * ./phase2-rank.sh
+30 5 * * * ./phase3-synthesize.sh
+45 5 * * * ./phase4-narrate.sh
+0 6 * * * ./phase5-deliver.sh
+```
+
+### Telegram Integration
+
+```python
+# Via Hermes gateway or direct bot
+bot.send_voice(
+    chat_id=TELEGRAM_HOME_CHANNEL,
+    voice=open("deep-dive-YYYY-MM-DD.mp3", "rb"),
+    caption=f"📻 Deep Dive for {date}: {headline_summary}",
+    duration=estimated_seconds
+)
+```
+
+### On-Demand Command
+
+```
+/deepdive [date]
+
+# Fetches briefing for specified date (default: today)
+# If audio exists: sends voice message
+# If not: generates on-demand (may take 2-3 min)
+```
+
+---
+
+## Implementation Roadmap
+
+### Quick Win: Phase 1 Only (2-3 hours)
+
+**Goal**: Prove value with text-only digests
+
+```bash
+# 1. ArXiv RSS fetcher
+# 2. Simple keyword filter
+# 3. Text digest via Telegram
+# 4. Cron schedule
+
+Result: Daily 8 AM text briefing
+```
+
+### MVP: Phases 1-3-5 (Skip 2,4)
+
+**Goal**: Working system without embedding/audio complexity
+
+```
+Fetch → Keyword filter → LLM synthesize → Text delivery
+```
+
+Duration: 1-2 days
+
+### Full Implementation: All 5 Phases
+
+**Goal**: Complete automated podcast system
+
+Duration: 1-2 weeks (parallel development possible)
+
+---
+
+## Directory Structure
+
+```
+the-nexus/
+└── research/
+    └── deep-dive/
+        ├── ARCHITECTURE.md          # This file
+        ├── IMPLEMENTATION.md        # Detailed dev guide
+        ├── config/
+        │   ├── sources.yaml         # RSS/feed URLs
+        │   ├── keywords.yaml        # Relevance keywords
+        │   └── prompts/
+        │       ├── synthesis.txt    # LLM prompt template
+        │       └── headlines.txt    # Headline-only prompt
+        ├── scripts/
+        │   ├── phase1-aggregate.py
+        │   ├── phase2-rank.py
+        │   ├── phase3-synthesize.py
+        │   ├── phase4-narrate.py
+        │   ├── phase5-deliver.py
+        │   └── run-daily.sh         # Orchestrator
+        └── data/                    # .gitignored
+            ├── raw/                 # Fetched sources
+            ├── scored/              # Ranked items
+            ├── briefings/           # Markdown outputs
+            └── audio/               # MP3 files
+```
+
+---
+
+## Acceptance Criteria
+
+| # | Criterion | Phase |
+|---|-----------|-------|
+| 1 | Zero manual copy-paste | 1-5 |
+| 2 | Daily 6 AM delivery | 5 |
+| 3 | ArXiv coverage (cs.AI, cs.CL, cs.LG) | 1 |
+| 4 | Lab blog coverage | 1 |
+| 5 | Relevance ranking by Hermes context | 2 |
+| 6 | Written briefing generation | 3 |
+| 7 | TTS audio production | 4 |
+| 8 | Telegram voice delivery | 5 |
+| 9 | On-demand `/deepdive` command | 5 |
+
+---
+
+## Risk Matrix
+
+| Risk | Likelihood | Impact | Mitigation |
+|------|------------|--------|------------|
+| ArXiv rate limiting | Medium | Medium | Exponential backoff, caching |
+| RSS feed changes | Medium | Low | Health checks, fallback sources |
+| TTS quality poor | Low (Piper) | High | Cloud override flag |
+| Vector DB too slow | Low | Medium | Batch overnight, cache embeddings |
+| Telegram file size | Low | Medium | Compress audio, split long episodes |
+
+---
+
+## Dependencies
+
+### Required
+
+- Python 3.10+
+- `feedparser` (RSS)
+- `requests` (HTTP)
+- `chromadb` or `sqlite3` (storage)
+- Hermes LLM client (synthesis)
+- Piper TTS (local audio)
+
+### Optional
+
+- `sentence-transformers` (embeddings)
+- `ffmpeg` (audio post-processing)
+- ElevenLabs API key (cloud TTS fallback)
+
+---
+
+## Related Issues
+
+- #830 (Parent EPIC)
+- Commandment 6: Human-to-fleet comms
+- #166: Matrix/Conduit deployment
+
+---
+
+## Next Steps
+
+1. **Decision**: Vector DB selection (Chroma vs pgvector)
+2. **Implementation**: Phase 1 skeleton (ArXiv fetcher)
+3. **Integration**: Hermes cron registration
+4. **Testing**: 3-day dry run (text only)
+5. **Enhancement**: Add TTS (Phase 4)
+
+---
+
+*Architecture document version 1.0 — Ezra, 2026-04-05*