the-nexus/research/deep-dive/ARCHITECTURE.md

# Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing

> **Issue**: #830
> **Type**: EPIC (21 story points)
> **Owner**: Ezra (assigned by Alexander)
> **Status**: Architecture complete → Phase 1 ready for implementation

---

## Vision

A fully automated daily intelligence briefing system that delivers a personalized AI-generated podcast briefing with **zero manual input**.

**Inspiration**: NotebookLM workflow (ingest → rank → synthesize → narrate → deliver) — but automated, scheduled, and sovereign.

---

## 5-Phase Architecture

```
┌─────────────────────────────────────────────────────────────────────────┐
│                         DEEP DIVE PIPELINE                              │
├───────────────┬───────────────┬───────────────┬───────────────┬─────────┤
│   PHASE 1     │   PHASE 2     │   PHASE 3     │   PHASE 4     │ PHASE 5 │
├───────────────┼───────────────┼───────────────┼───────────────┼─────────┤
│  AGGREGATE    │    RANK       │  SYNTHESIZE   │   NARRATE     │ DELIVER │
├───────────────┼───────────────┼───────────────┼───────────────┼─────────┤
│ ArXiv RSS     │ Embedding     │ LLM briefing  │ TTS engine    │Telegram │
│ Lab feeds     │ similarity    │ generator     │ (Piper /      │ voice   │
│ Newsletters   │ vs codebase   │               │ ElevenLabs)   │ message │
│ HackerNews    │               │               │               │         │
└───────────────┴───────────────┴───────────────┴───────────────┴─────────┘

Timeline: 05:00  →  05:15  →  05:30  →  05:45  →  06:00
          Fetch    Score    Generate   Audio      Deliver
```

---

## Phase 1: Source Aggregation (5 points)

### Data Sources

| Source | URL/API | Frequency | Priority |
|--------|---------|-----------|----------|
| ArXiv cs.AI | `http://export.arxiv.org/rss/cs.AI` | Daily 5 AM | P1 |
| ArXiv cs.CL | `http://export.arxiv.org/rss/cs.CL` | Daily 5 AM | P1 |
| ArXiv cs.LG | `http://export.arxiv.org/rss/cs.LG` | Daily 5 AM | P1 |
| OpenAI Blog | `https://openai.com/blog/rss.xml` | Daily 5 AM | P1 |
| Anthropic | `https://www.anthropic.com/blog/rss.xml` | Daily 5 AM | P1 |
| DeepMind | `https://deepmind.google/blog/rss.xml` | Daily 5 AM | P2 |
| Google Research | `https://research.google/blog/rss.xml` | Daily 5 AM | P2 |
| Import AI | Newsletter (email/IMAP) | Daily 5 AM | P2 |
| TLDR AI | `https://tldr.tech/ai/rss` | Daily 5 AM | P2 |
| HackerNews | `https://hnrss.org/newest?points=100` | Daily 5 AM | P3 |

### Storage Format

```json
{
  "fetched_at": "2025-01-15T05:00:00Z",
  "source": "arxiv_cs_ai",
  "items": [
    {
      "id": "arxiv:2501.01234",
      "title": "Attention is All You Need: The Sequel",
      "abstract": "...",
      "url": "https://arxiv.org/abs/2501.01234",
      "authors": ["..."],
      "published": "2025-01-14",
      "raw_text": "title + abstract"
    }
  ]
}
```

### Output

`data/deep-dive/raw/YYYY-MM-DD-{source}.jsonl`

---

## Phase 2: Relevance Engine (6 points)

### Scoring Approach

**Multi-factor relevance score (0-100)**:

```python
score = (
    embedding_similarity * 0.40 +    # Cosine sim vs Hermes codebase
    keyword_match_score * 0.30 +     # Title/abstract keyword hits
    source_priority * 0.15 +         # ArXiv cs.AI = 1.0, HN = 0.3
    recency_boost * 0.10 +           # Today = 1.0, -0.1 per day
    user_feedback * 0.05             # Past thumbs up/down
)
```

### Keyword Priority List

```yaml
high_value:
  - "transformer"
  - "attention mechanism"
  - "large language model"
  - "LLM"
  - "agent"
  - "multi-agent"
  - "reasoning"
  - "chain-of-thought"
  - "RLHF"
  - "fine-tuning"
  - "retrieval augmented"
  - "RAG"
  - "vector database"
  - "embedding"
  - "tool use"
  - "function calling"

medium_value:
  - "BERT"
  - "GPT"
  - "training efficiency"
  - "inference optimization"
  - "quantization"
  - "distillation"
```

### Vector Database Decision Matrix

| Option | Pros | Cons | Recommendation |
|--------|------|------|----------------|
| **Chroma** | SQLite-backed, zero ops, local | Scales to ~1M docs max | ✅ **Default** |
| PostgreSQL + pgvector | Enterprise proven, ACID | Requires Postgres | If Nexus uses Postgres |
| FAISS (in-memory) | Fastest search | Rebuild daily | Budget option |

### Output

`data/deep-dive/scored/YYYY-MM-DD-ranked.json`

Top 10 items selected for synthesis.

---

## Phase 3: Synthesis Engine (3 points)

### Prompt Architecture

```
You are Deep Dive, a technical intelligence briefing AI for the Hermes/Timmy
agent system. Your audience is an AI agent builder working on sovereign,
local-first AI infrastructure.

SOURCE MATERIAL:
{ranked_items}

GENERATE:
1. **Headlines** (3 bullets): Key announcements in 20 words each
2. **Deep Dives** (2-3): Important papers with technical summary and
   implications for agent systems
3. **Quick Hits** (3-5): Brief mentions worth knowing
4. **Context Bridge**: Connect to Hermes/Timmy current work
   - Mention if papers relate to RL training, tool calling, local inference,
     or multi-agent coordination

TONE: Professional, concise, technically precise
TARGET LENGTH: 800-1200 words (10-15 min spoken)
```

### Output Format (Markdown)

```markdown
# Deep Dive: YYYY-MM-DD

## Headlines
- [Item 1]
- [Item 2]
- [Item 3]

## Deep Dives

### [Paper Title]
**Source**: ArXiv cs.AI | **Authors**: [...]

[Technical summary]

**Why it matters for Hermes**: [...]

## Quick Hits
- [...]

## Context Bridge
[Connection to current work]
```

### Output

`data/deep-dive/briefings/YYYY-MM-DD-briefing.md`

---

## Phase 4: Audio Generation (4 points)

### TTS Engine Options

| Engine | Cost | Quality | Latency | Sovereignty |
|--------|------|---------|---------|-------------|
| **Piper** (local) | Free | Good | Medium | ✅ 100% |
| Coqui TTS (local) | Free | Medium-High | High | ✅ 100% |
| ElevenLabs API | $0.05/min | Excellent | Low | ❌ Cloud |
| OpenAI TTS | $0.015/min | Excellent | Low | ❌ Cloud |
| Google Cloud TTS | $0.004/min | Good | Low | ❌ Cloud |

### Recommendation

**Hybrid approach**:
- Default: Piper (on-device, sovereign)
- Override flag: ElevenLabs/OpenAI for special episodes

### Piper Configuration

```python
# High-quality English voice
model = "en_US-lessac-high"

# Speaking rate: ~150 WPM for technical content
length_scale = 1.1

# Output format
output_format = "mp3"  # 128kbps
```

### Audio Enhancement

```bash
# Add intro/outro jingles
ffmpeg -i intro.mp3 -i speech.mp3 -i outro.mp3 \
       -filter_complex "[0:a][1:a][2:a]concat=n=3:v=0:a=1" \
       deep-dive-YYYY-MM-DD.mp3
```

### Output

`data/deep-dive/audio/YYYY-MM-DD-deep-dive.mp3` (12-18 MB)

---

## Phase 5: Delivery Pipeline (3 points)

### Cron Schedule

```cron
# Daily at 6:00 AM EST
0 6 * * * cd /path/to/deep-dive && ./run-daily.sh

# Or: staggered phases for visibility
0 5 * * * ./phase1-fetch.sh
15 5 * * * ./phase2-rank.sh
30 5 * * * ./phase3-synthesize.sh
45 5 * * * ./phase4-narrate.sh
0 6 * * * ./phase5-deliver.sh
```

### Telegram Integration

```python
# Via Hermes gateway or direct bot
bot.send_voice(
    chat_id=TELEGRAM_HOME_CHANNEL,
    voice=open("deep-dive-YYYY-MM-DD.mp3", "rb"),
    caption=f"📻 Deep Dive for {date}: {headline_summary}",
    duration=estimated_seconds
)
```

### On-Demand Command

```
/deepdive [date]

# Fetches briefing for specified date (default: today)
# If audio exists: sends voice message
# If not: generates on-demand (may take 2-3 min)
```

---

## Implementation Roadmap

### Quick Win: Phase 1 Only (2-3 hours)

**Goal**: Prove value with text-only digests

```bash
# 1. ArXiv RSS fetcher
# 2. Simple keyword filter
# 3. Text digest via Telegram
# 4. Cron schedule

Result: Daily 8 AM text briefing
```

### MVP: Phases 1-3-5 (Skip 2,4)

**Goal**: Working system without embedding/audio complexity

```
Fetch → Keyword filter → LLM synthesize → Text delivery
```

Duration: 1-2 days

### Full Implementation: All 5 Phases

**Goal**: Complete automated podcast system

Duration: 1-2 weeks (parallel development possible)

---

## Directory Structure

```
the-nexus/
└── research/
    └── deep-dive/
        ├── ARCHITECTURE.md          # This file
        ├── IMPLEMENTATION.md        # Detailed dev guide
        ├── config/
        │   ├── sources.yaml         # RSS/feed URLs
        │   ├── keywords.yaml        # Relevance keywords
        │   └── prompts/
        │       ├── synthesis.txt    # LLM prompt template
        │       └── headlines.txt    # Headline-only prompt
        ├── scripts/
        │   ├── phase1-aggregate.py
        │   ├── phase2-rank.py
        │   ├── phase3-synthesize.py
        │   ├── phase4-narrate.py
        │   ├── phase5-deliver.py
        │   └── run-daily.sh         # Orchestrator
        └── data/                    # .gitignored
            ├── raw/                 # Fetched sources
            ├── scored/              # Ranked items
            ├── briefings/           # Markdown outputs
            └── audio/               # MP3 files
```

---

## Acceptance Criteria

| # | Criterion | Phase |
|---|-----------|-------|
| 1 | Zero manual copy-paste | 1-5 |
| 2 | Daily 6 AM delivery | 5 |
| 3 | ArXiv coverage (cs.AI, cs.CL, cs.LG) | 1 |
| 4 | Lab blog coverage | 1 |
| 5 | Relevance ranking by Hermes context | 2 |
| 6 | Written briefing generation | 3 |
| 7 | TTS audio production | 4 |
| 8 | Telegram voice delivery | 5 |
| 9 | On-demand `/deepdive` command | 5 |

---

## Risk Matrix

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| ArXiv rate limiting | Medium | Medium | Exponential backoff, caching |
| RSS feed changes | Medium | Low | Health checks, fallback sources |
| TTS quality poor | Low (Piper) | High | Cloud override flag |
| Vector DB too slow | Low | Medium | Batch overnight, cache embeddings |
| Telegram file size | Low | Medium | Compress audio, split long episodes |

---

## Dependencies

### Required

- Python 3.10+
- `feedparser` (RSS)
- `requests` (HTTP)
- `chromadb` or `sqlite3` (storage)
- Hermes LLM client (synthesis)
- Piper TTS (local audio)

### Optional

- `sentence-transformers` (embeddings)
- `ffmpeg` (audio post-processing)
- ElevenLabs API key (cloud TTS fallback)

---

## Related Issues

- #830 (Parent EPIC)
- Commandment 6: Human-to-fleet comms
- #166: Matrix/Conduit deployment

---

## Next Steps

1. **Decision**: Vector DB selection (Chroma vs pgvector)
2. **Implementation**: Phase 1 skeleton (ArXiv fetcher)
3. **Integration**: Hermes cron registration
4. **Testing**: 3-day dry run (text only)
5. **Enhancement**: Add TTS (Phase 4)

---

*Architecture document version 1.0 — Ezra, 2026-04-05*