Files
the-nexus/intelligence/deepdive/architecture.md

278 lines
7.7 KiB
Markdown
Raw Normal View History

# Deep Dive Architecture Specification
## Phase 1: Source Aggregation Layer
### Data Sources
| Source | URL | Format | Frequency |
|--------|-----|--------|-----------|
| arXiv cs.AI | http://export.arxiv.org/rss/cs.AI | RSS | Daily |
| arXiv cs.CL | http://export.arxiv.org/rss/cs.CL | RSS | Daily |
| arXiv cs.LG | http://export.arxiv.org/rss/cs.LG | RSS | Daily |
| OpenAI Blog | https://openai.com/blog/rss.xml | RSS | On-update |
| Anthropic | https://www.anthropic.com/blog/rss.xml | RSS | On-update |
| DeepMind | https://deepmind.google/blog/rss.xml | RSS | On-update |
| Import AI | https://importai.substack.com/feed | RSS | Daily |
| TLDR AI | https://tldr.tech/ai/rss | RSS | Daily |
### Implementation
```python
# aggregator.py
class RSSAggregator:
def __init__(self, sources: List[SourceConfig]):
self.sources = sources
self.cache_dir = Path("~/.cache/deepdive/feeds")
async def fetch_all(self, since: datetime) -> List[FeedItem]:
# Parallel RSS fetch with etag support
# Returns normalized items with title, summary, url, published
pass
```
## Phase 2: Relevance Engine
### Scoring Algorithm
```python
# relevance.py
from sentence_transformers import SentenceTransformer
class RelevanceScorer:
def __init__(self):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.keywords = [
"LLM agent", "agent architecture", "tool use",
"reinforcement learning", "RLHF", "GRPO",
"transformer", "attention mechanism",
"Hermes", "local LLM", "llama.cpp"
]
# Pre-compute keyword embeddings
self.keyword_emb = self.model.encode(self.keywords)
def score(self, item: FeedItem) -> float:
title_emb = self.model.encode(item.title)
summary_emb = self.model.encode(item.summary)
# Cosine similarity to keyword centroid
keyword_sim = cosine_similarity([title_emb], self.keyword_emb).mean()
# Boost for agent/LLM architecture terms
boost = 1.0
if any(k in item.title.lower() for k in ["agent", "llm", "transformer"]):
boost = 1.5
return keyword_sim * boost
```
### Ranking
- Fetch all items from last 24h
- Score each with RelevanceScorer
- Select top N (default: 10) for briefing
## Phase 3: Synthesis Engine
### LLM Prompt
```jinja2
You are an intelligence analyst for the Timmy Foundation fleet.
Produce a concise daily briefing from the following sources.
CONTEXT: We build Hermes (local AI agent framework) and operate
a distributed fleet of AI agents. Focus on developments relevant
to: LLM architecture, agent systems, RL training, local inference.
SOURCES:
{% for item in sources %}
- {{ item.title }} ({{ item.source }})
{{ item.summary }}
{% endfor %}
OUTPUT FORMAT:
## Daily Intelligence Briefing - {{ date }}
### Headlines
- [Source] Key development in one sentence
### Deep Dive: {{ most_relevant.title }}
Why this matters for our work:
[2-3 sentences connecting to Hermes/Timmy context]
### Action Items
- [ ] Any immediate implications
Keep total briefing under 800 words. Tight, professional tone.
```
## Phase 4: Audio Generation
### TTS Pipeline
```python
# tts.py
import subprocess
from pathlib import Path
class PiperTTS:
def __init__(self, model_path: str, voice: str = "en_US-amy-medium"):
self.model = Path(model_path) / f"{voice}.onnx"
self.config = Path(model_path) / f"{voice}.onnx.json"
def generate(self, text: str, output_path: Path) -> Path:
# Piper produces WAV from stdin text
cmd = [
"piper",
"--model", str(self.model),
"--config", str(self.config),
"--output_file", str(output_path)
]
subprocess.run(cmd, input=text.encode())
return output_path
```
### Voice Selection
- Base: `en_US-amy-medium` (clear, professional)
- Alternative: `en_GB-southern_english_female-medium`
## Phase 5: Delivery Pipeline
### Cron Scheduler
```yaml
# cron entry (runs 5:30 AM daily)
deepdive-daily:
schedule: "30 5 * * *"
command: "/opt/deepdive/run-pipeline.sh --deliver"
timezone: "America/New_York"
```
### Delivery Integration
```python
# delivery.py
from hermes.gateway import TelegramGateway
class TelegramDelivery:
def __init__(self, bot_token: str, chat_id: str):
self.gateway = TelegramGateway(bot_token, chat_id)
async def deliver(self, audio_path: Path, briefing_text: str):
# Send voice message
await self.gateway.send_voice(audio_path)
# Send text summary as follow-up
await self.gateway.send_message(briefing_text[:4000])
```
### On-Demand Command
```
/deepdive [optional: date or topic filter]
```
Triggers pipeline immediately, bypasses cron.
## Data Flow
```
RSS Feeds
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Raw Items │───▶│ Scored │───▶│ Top 10 │
│ (100-500) │ │ (ranked) │ │ Selected │
└───────────┘ └───────────┘ └─────┬─────┘
┌───────────────────┘
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Synthesis │───▶│ Briefing │───▶│ TTS Gen │
│ (LLM) │ │ Text │ │ (Piper) │
└───────────┘ └───────────┘ └─────┬─────┘
┌───────┴───────┐
▼ ▼
Telegram Voice Telegram Text
```
## Configuration
```yaml
# config.yaml
deepdive:
schedule:
daily_time: "06:00"
timezone: "America/New_York"
aggregation:
sources:
- name: "arxiv_ai"
url: "http://export.arxiv.org/rss/cs.AI"
fetch_window_hours: 24
- name: "openai_blog"
url: "https://openai.com/blog/rss.xml"
limit: 5 # max items per source
relevance:
model: "all-MiniLM-L6-v2"
top_n: 10
min_score: 0.3
keywords:
- "LLM agent"
- "agent architecture"
- "reinforcement learning"
synthesis:
llm_model: "gemma-4-it" # local via llama-server
max_summary_length: 800
tts:
engine: "piper"
voice: "en_US-amy-medium"
speed: 1.0
delivery:
method: "telegram"
channel_id: "-1003664764329"
send_text_summary: true
```
## Implementation Phases
| Phase | Est. Effort | Dependencies | Owner |
|-------|-------------|--------------|-------|
| 1: Aggregation | 3 pts | None | Any agent |
| 2: Relevance | 4 pts | Phase 1 | @gemini |
| 3: Synthesis | 4 pts | Phase 2 | @gemini |
| 4: Audio | 4 pts | Phase 3 | @ezra |
| 5: Delivery | 4 pts | Phase 4 | @ezra |
## API Surface (Tentative)
```python
# deepdive/__init__.py
class DeepDivePipeline:
async def run(
self,
since: Optional[datetime] = None,
deliver: bool = True
) -> BriefingResult:
...
@dataclass
class BriefingResult:
sources_considered: int
sources_selected: int
briefing_text: str
audio_path: Optional[Path]
delivered: bool
```
## Success Metrics
- [ ] Daily delivery within 30 min of scheduled time
- [ ] < 5 minute audio length
- [ ] Relevance precision > 80% (manual audit)
- [ ] Zero API dependencies (full local stack)