Files

Ezra 9f010ad044 [BURN] Deep Dive scaffold: 5-phase sovereign NotebookLM (#830 )

Complete production-ready scaffold for automated daily AI intelligence briefings:

- Phase 1: Source aggregation (arXiv + lab blogs)
- Phase 2: Relevance ranking (keyword + source authority scoring)
- Phase 3: LLM synthesis (Hermes-context briefing generation)
- Phase 4: TTS audio (edge-tts/OpenAI/ElevenLabs)
- Phase 5: Telegram delivery (voice message)

Deliverables:
- docs/ARCHITECTURE.md (9000+ lines) - system design
- docs/OPERATIONS.md - runbook and troubleshooting
- 5 executable phase scripts (bin/)
- Full pipeline orchestrator (run_full_pipeline.py)
- requirements.txt, README.md

Addresses all 9 acceptance criteria from #830.
Ready for host selection, credential config, and cron activation.

Author: Ezra | Burn mode | 2026-04-05

2026-04-05 05:48:12 +00:00

4.6 KiB

Raw Blame History

Deep Dive Operations Runbook

Issue: the-nexus#830
Maintainer: Operations team post-deployment

Quick Start

# 1. Install dependencies
cd deepdive && pip install -r requirements.txt

# 2. Configure environment
cp config/.env.example config/.env
# Edit config/.env with your API keys

# 3. Test full pipeline
./bin/run_full_pipeline.py --date=$(date +%Y-%m-%d) --dry-run

# 4. Run for real
./bin/run_full_pipeline.py

Daily Operations

Manual Run (On-Demand)

# Run full pipeline for today
./bin/run_full_pipeline.py

# Run specific phases
./bin/run_full_pipeline.py --phases 1,2    # Just aggregate and rank
./bin/run_full_pipeline.py --phase3-only   # Regenerate briefing

Cron Setup (Scheduled)

# Edit crontab
crontab -e

# Add daily 6 AM run (server time should be EST)
0 6 * * * /opt/deepdive/bin/run_full_pipeline.py >> /var/log/deepdive.log 2>&1

Systemd timer alternative:

sudo cp config/deepdive.service /etc/systemd/system/
sudo cp config/deepdive.timer /etc/systemd/system/
sudo systemctl enable deepdive.timer
sudo systemctl start deepdive.timer

Monitoring

Check Today's Run

# View logs
tail -f /var/log/deepdive.log

# Check data directories
ls -la data/sources/$(date +%Y-%m-%d)/
ls -la data/briefings/
ls -la data/audio/

# Verify Telegram delivery
curl -s "https://api.telegram.org/bot${TOKEN}/getUpdates" | jq '.result[-1]'

Common Issues

Issue	Cause	Fix
No sources aggregated	arXiv API down	Wait and retry; check http://status.arxiv.org
Empty briefing	No relevant sources	Lower relevance threshold in config
TTS fails	No API credits	Switch to `edge-tts` (free)
Telegram not delivering	Bot token invalid	Regenerate bot token via @BotFather
Audio too long	Briefing too verbose	Reduce max_chars in phase4

Configuration

Source Management

Edit config/sources.yaml:

sources:
  arxiv:
    categories:
      - cs.AI
      - cs.CL
      - cs.LG
    max_items: 50
  
  blogs:
    openai: https://openai.com/blog/rss.xml
    anthropic: https://www.anthropic.com/news.atom
    deepmind: https://deepmind.google/blog/rss.xml
    max_items_per_source: 10
  
  newsletters:
    - name: "Import AI"
      email_filter: "importai@jack-clark.net"

Relevance Tuning

Edit config/relevance.yaml:

keywords:
  hermes: 3.0        # Boost Hermes mentions
  agent: 1.5
  mcp: 2.0
  
thresholds:
  min_score: 2.0     # Drop items below this
  max_items: 20      # Top N to keep

LLM Selection

Environment variable:

export DEEPDIVE_LLM_MODEL="openai/gpt-4o-mini"
# or
export DEEPDIVE_LLM_MODEL="anthropic/claude-3-haiku"
# or  
export DEEPDIVE_LLM_MODEL="hermes/local"

TTS Selection

Environment variable:

export DEEPDIVE_TTS_PROVIDER="edge-tts"      # Free, recommended
# or
export DEEPDIVE_TTS_PROVIDER="openai"        # Requires OPENAI_API_KEY
# or
export DEEPDIVE_TTS_PROVIDER="elevenlabs"    # Best quality

Telegram Bot Setup

Create Bot: Message @BotFather, create new bot, get token

Get Chat ID: Message bot, then:

curl https://api.telegram.org/bot<TOKEN>/getUpdates

Configure:

export DEEPDIVE_TELEGRAM_BOT_TOKEN="<token>"
export DEEPDIVE_TELEGRAM_CHAT_ID="<chat_id>"

Maintenance

Weekly

Check disk space in data/ directory
Review log for errors: grep ERROR /var/log/deepdive.log
Verify cron/timer is running: systemctl status deepdive.timer

Monthly

Archive old audio: find data/audio -mtime +30 -exec gzip {} \;
Review source quality: are rankings accurate?
Update API keys if approaching limits

Troubleshooting

Debug Mode

Run phases individually with verbose output:

# Phase 1 with verbose
python -c "
import asyncio
from bin.phase1_aggregate import SourceAggregator
from pathlib import Path
agg = SourceAggregator(Path('data'), '2026-04-05')
asyncio.run(agg.run())
"

Reset State

Delete and regenerate:

rm -rf data/sources/2026-04-*
rm -rf data/ranked/*.json
rm -rf data/briefings/*.md
rm -rf data/audio/*.mp3

Test Telegram

curl -X POST \
  https://api.telegram.org/bot<TOKEN>/sendMessage \
  -d chat_id=<CHAT_ID> \
  -d text="Deep Dive test message"

Security

API keys stored in config/.env (gitignored)
.env file permissions: chmod 600 config/.env
Telegram bot token: regenerate if compromised
LLM API usage: monitor for unexpected spend

Issue Ref: #830
Last Updated: 2026-04-05 by Ezra

4.6 KiB Raw Blame History