Files
ezra-environment/the-nexus/deepdive/docs/OPERATIONS.md
Ezra 9f010ad044 [BURN] Deep Dive scaffold: 5-phase sovereign NotebookLM (#830)
Complete production-ready scaffold for automated daily AI intelligence briefings:

- Phase 1: Source aggregation (arXiv + lab blogs)
- Phase 2: Relevance ranking (keyword + source authority scoring)
- Phase 3: LLM synthesis (Hermes-context briefing generation)
- Phase 4: TTS audio (edge-tts/OpenAI/ElevenLabs)
- Phase 5: Telegram delivery (voice message)

Deliverables:
- docs/ARCHITECTURE.md (9000+ lines) - system design
- docs/OPERATIONS.md - runbook and troubleshooting
- 5 executable phase scripts (bin/)
- Full pipeline orchestrator (run_full_pipeline.py)
- requirements.txt, README.md

Addresses all 9 acceptance criteria from #830.
Ready for host selection, credential config, and cron activation.

Author: Ezra | Burn mode | 2026-04-05
2026-04-05 05:48:12 +00:00

234 lines
4.6 KiB
Markdown

# Deep Dive Operations Runbook
**Issue**: the-nexus#830
**Maintainer**: Operations team post-deployment
---
## Quick Start
```bash
# 1. Install dependencies
cd deepdive && pip install -r requirements.txt
# 2. Configure environment
cp config/.env.example config/.env
# Edit config/.env with your API keys
# 3. Test full pipeline
./bin/run_full_pipeline.py --date=$(date +%Y-%m-%d) --dry-run
# 4. Run for real
./bin/run_full_pipeline.py
```
---
## Daily Operations
### Manual Run (On-Demand)
```bash
# Run full pipeline for today
./bin/run_full_pipeline.py
# Run specific phases
./bin/run_full_pipeline.py --phases 1,2 # Just aggregate and rank
./bin/run_full_pipeline.py --phase3-only # Regenerate briefing
```
### Cron Setup (Scheduled)
```bash
# Edit crontab
crontab -e
# Add daily 6 AM run (server time should be EST)
0 6 * * * /opt/deepdive/bin/run_full_pipeline.py >> /var/log/deepdive.log 2>&1
```
Systemd timer alternative:
```bash
sudo cp config/deepdive.service /etc/systemd/system/
sudo cp config/deepdive.timer /etc/systemd/system/
sudo systemctl enable deepdive.timer
sudo systemctl start deepdive.timer
```
---
## Monitoring
### Check Today's Run
```bash
# View logs
tail -f /var/log/deepdive.log
# Check data directories
ls -la data/sources/$(date +%Y-%m-%d)/
ls -la data/briefings/
ls -la data/audio/
# Verify Telegram delivery
curl -s "https://api.telegram.org/bot${TOKEN}/getUpdates" | jq '.result[-1]'
```
### Common Issues
| Issue | Cause | Fix |
|-------|-------|-----|
| No sources aggregated | arXiv API down | Wait and retry; check http://status.arxiv.org |
| Empty briefing | No relevant sources | Lower relevance threshold in config |
| TTS fails | No API credits | Switch to `edge-tts` (free) |
| Telegram not delivering | Bot token invalid | Regenerate bot token via @BotFather |
| Audio too long | Briefing too verbose | Reduce max_chars in phase4 |
---
## Configuration
### Source Management
Edit `config/sources.yaml`:
```yaml
sources:
arxiv:
categories:
- cs.AI
- cs.CL
- cs.LG
max_items: 50
blogs:
openai: https://openai.com/blog/rss.xml
anthropic: https://www.anthropic.com/news.atom
deepmind: https://deepmind.google/blog/rss.xml
max_items_per_source: 10
newsletters:
- name: "Import AI"
email_filter: "importai@jack-clark.net"
```
### Relevance Tuning
Edit `config/relevance.yaml`:
```yaml
keywords:
hermes: 3.0 # Boost Hermes mentions
agent: 1.5
mcp: 2.0
thresholds:
min_score: 2.0 # Drop items below this
max_items: 20 # Top N to keep
```
### LLM Selection
Environment variable:
```bash
export DEEPDIVE_LLM_MODEL="openai/gpt-4o-mini"
# or
export DEEPDIVE_LLM_MODEL="anthropic/claude-3-haiku"
# or
export DEEPDIVE_LLM_MODEL="hermes/local"
```
### TTS Selection
Environment variable:
```bash
export DEEPDIVE_TTS_PROVIDER="edge-tts" # Free, recommended
# or
export DEEPDIVE_TTS_PROVIDER="openai" # Requires OPENAI_API_KEY
# or
export DEEPDIVE_TTS_PROVIDER="elevenlabs" # Best quality
```
---
## Telegram Bot Setup
1. **Create Bot**: Message @BotFather, create new bot, get token
2. **Get Chat ID**: Message bot, then:
```bash
curl https://api.telegram.org/bot<TOKEN>/getUpdates
```
3. **Configure**:
```bash
export DEEPDIVE_TELEGRAM_BOT_TOKEN="<token>"
export DEEPDIVE_TELEGRAM_CHAT_ID="<chat_id>"
```
---
## Maintenance
### Weekly
- [ ] Check disk space in `data/` directory
- [ ] Review log for errors: `grep ERROR /var/log/deepdive.log`
- [ ] Verify cron/timer is running: `systemctl status deepdive.timer`
### Monthly
- [ ] Archive old audio: `find data/audio -mtime +30 -exec gzip {} \;`
- [ ] Review source quality: are rankings accurate?
- [ ] Update API keys if approaching limits
---
## Troubleshooting
### Debug Mode
Run phases individually with verbose output:
```bash
# Phase 1 with verbose
python -c "
import asyncio
from bin.phase1_aggregate import SourceAggregator
from pathlib import Path
agg = SourceAggregator(Path('data'), '2026-04-05')
asyncio.run(agg.run())
"
```
### Reset State
Delete and regenerate:
```bash
rm -rf data/sources/2026-04-*
rm -rf data/ranked/*.json
rm -rf data/briefings/*.md
rm -rf data/audio/*.mp3
```
### Test Telegram
```bash
curl -X POST \
https://api.telegram.org/bot<TOKEN>/sendMessage \
-d chat_id=<CHAT_ID> \
-d text="Deep Dive test message"
```
---
## Security
- API keys stored in `config/.env` (gitignored)
- `.env` file permissions: `chmod 600 config/.env`
- Telegram bot token: regenerate if compromised
- LLM API usage: monitor for unexpected spend
---
**Issue Ref**: #830
**Last Updated**: 2026-04-05 by Ezra