278 lines
7.7 KiB
Markdown
278 lines
7.7 KiB
Markdown
|
|
# Deep Dive Architecture Specification
|
||
|
|
|
||
|
|
## Phase 1: Source Aggregation Layer
|
||
|
|
|
||
|
|
### Data Sources
|
||
|
|
|
||
|
|
| Source | URL | Format | Frequency |
|
||
|
|
|--------|-----|--------|-----------|
|
||
|
|
| arXiv cs.AI | http://export.arxiv.org/rss/cs.AI | RSS | Daily |
|
||
|
|
| arXiv cs.CL | http://export.arxiv.org/rss/cs.CL | RSS | Daily |
|
||
|
|
| arXiv cs.LG | http://export.arxiv.org/rss/cs.LG | RSS | Daily |
|
||
|
|
| OpenAI Blog | https://openai.com/blog/rss.xml | RSS | On-update |
|
||
|
|
| Anthropic | https://www.anthropic.com/blog/rss.xml | RSS | On-update |
|
||
|
|
| DeepMind | https://deepmind.google/blog/rss.xml | RSS | On-update |
|
||
|
|
| Import AI | https://importai.substack.com/feed | RSS | Daily |
|
||
|
|
| TLDR AI | https://tldr.tech/ai/rss | RSS | Daily |
|
||
|
|
|
||
|
|
### Implementation
|
||
|
|
|
||
|
|
```python
|
||
|
|
# aggregator.py
|
||
|
|
class RSSAggregator:
|
||
|
|
def __init__(self, sources: List[SourceConfig]):
|
||
|
|
self.sources = sources
|
||
|
|
self.cache_dir = Path("~/.cache/deepdive/feeds")
|
||
|
|
|
||
|
|
async def fetch_all(self, since: datetime) -> List[FeedItem]:
|
||
|
|
# Parallel RSS fetch with etag support
|
||
|
|
# Returns normalized items with title, summary, url, published
|
||
|
|
pass
|
||
|
|
```
|
||
|
|
|
||
|
|
## Phase 2: Relevance Engine
|
||
|
|
|
||
|
|
### Scoring Algorithm
|
||
|
|
|
||
|
|
```python
|
||
|
|
# relevance.py
|
||
|
|
from sentence_transformers import SentenceTransformer
|
||
|
|
|
||
|
|
class RelevanceScorer:
|
||
|
|
def __init__(self):
|
||
|
|
self.model = SentenceTransformer('all-MiniLM-L6-v2')
|
||
|
|
self.keywords = [
|
||
|
|
"LLM agent", "agent architecture", "tool use",
|
||
|
|
"reinforcement learning", "RLHF", "GRPO",
|
||
|
|
"transformer", "attention mechanism",
|
||
|
|
"Hermes", "local LLM", "llama.cpp"
|
||
|
|
]
|
||
|
|
# Pre-compute keyword embeddings
|
||
|
|
self.keyword_emb = self.model.encode(self.keywords)
|
||
|
|
|
||
|
|
def score(self, item: FeedItem) -> float:
|
||
|
|
title_emb = self.model.encode(item.title)
|
||
|
|
summary_emb = self.model.encode(item.summary)
|
||
|
|
|
||
|
|
# Cosine similarity to keyword centroid
|
||
|
|
keyword_sim = cosine_similarity([title_emb], self.keyword_emb).mean()
|
||
|
|
|
||
|
|
# Boost for agent/LLM architecture terms
|
||
|
|
boost = 1.0
|
||
|
|
if any(k in item.title.lower() for k in ["agent", "llm", "transformer"]):
|
||
|
|
boost = 1.5
|
||
|
|
|
||
|
|
return keyword_sim * boost
|
||
|
|
```
|
||
|
|
|
||
|
|
### Ranking
|
||
|
|
|
||
|
|
- Fetch all items from last 24h
|
||
|
|
- Score each with RelevanceScorer
|
||
|
|
- Select top N (default: 10) for briefing
|
||
|
|
|
||
|
|
## Phase 3: Synthesis Engine
|
||
|
|
|
||
|
|
### LLM Prompt
|
||
|
|
|
||
|
|
```jinja2
|
||
|
|
You are an intelligence analyst for the Timmy Foundation fleet.
|
||
|
|
Produce a concise daily briefing from the following sources.
|
||
|
|
|
||
|
|
CONTEXT: We build Hermes (local AI agent framework) and operate
|
||
|
|
a distributed fleet of AI agents. Focus on developments relevant
|
||
|
|
to: LLM architecture, agent systems, RL training, local inference.
|
||
|
|
|
||
|
|
SOURCES:
|
||
|
|
{% for item in sources %}
|
||
|
|
- {{ item.title }} ({{ item.source }})
|
||
|
|
{{ item.summary }}
|
||
|
|
{% endfor %}
|
||
|
|
|
||
|
|
OUTPUT FORMAT:
|
||
|
|
## Daily Intelligence Briefing - {{ date }}
|
||
|
|
|
||
|
|
### Headlines
|
||
|
|
- [Source] Key development in one sentence
|
||
|
|
|
||
|
|
### Deep Dive: {{ most_relevant.title }}
|
||
|
|
Why this matters for our work:
|
||
|
|
[2-3 sentences connecting to Hermes/Timmy context]
|
||
|
|
|
||
|
|
### Action Items
|
||
|
|
- [ ] Any immediate implications
|
||
|
|
|
||
|
|
Keep total briefing under 800 words. Tight, professional tone.
|
||
|
|
```
|
||
|
|
|
||
|
|
## Phase 4: Audio Generation
|
||
|
|
|
||
|
|
### TTS Pipeline
|
||
|
|
|
||
|
|
```python
|
||
|
|
# tts.py
|
||
|
|
import subprocess
|
||
|
|
from pathlib import Path
|
||
|
|
|
||
|
|
class PiperTTS:
|
||
|
|
def __init__(self, model_path: str, voice: str = "en_US-amy-medium"):
|
||
|
|
self.model = Path(model_path) / f"{voice}.onnx"
|
||
|
|
self.config = Path(model_path) / f"{voice}.onnx.json"
|
||
|
|
|
||
|
|
def generate(self, text: str, output_path: Path) -> Path:
|
||
|
|
# Piper produces WAV from stdin text
|
||
|
|
cmd = [
|
||
|
|
"piper",
|
||
|
|
"--model", str(self.model),
|
||
|
|
"--config", str(self.config),
|
||
|
|
"--output_file", str(output_path)
|
||
|
|
]
|
||
|
|
subprocess.run(cmd, input=text.encode())
|
||
|
|
return output_path
|
||
|
|
```
|
||
|
|
|
||
|
|
### Voice Selection
|
||
|
|
|
||
|
|
- Base: `en_US-amy-medium` (clear, professional)
|
||
|
|
- Alternative: `en_GB-southern_english_female-medium`
|
||
|
|
|
||
|
|
## Phase 5: Delivery Pipeline
|
||
|
|
|
||
|
|
### Cron Scheduler
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# cron entry (runs 5:30 AM daily)
|
||
|
|
deepdive-daily:
|
||
|
|
schedule: "30 5 * * *"
|
||
|
|
command: "/opt/deepdive/run-pipeline.sh --deliver"
|
||
|
|
timezone: "America/New_York"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Delivery Integration
|
||
|
|
|
||
|
|
```python
|
||
|
|
# delivery.py
|
||
|
|
from hermes.gateway import TelegramGateway
|
||
|
|
|
||
|
|
class TelegramDelivery:
|
||
|
|
def __init__(self, bot_token: str, chat_id: str):
|
||
|
|
self.gateway = TelegramGateway(bot_token, chat_id)
|
||
|
|
|
||
|
|
async def deliver(self, audio_path: Path, briefing_text: str):
|
||
|
|
# Send voice message
|
||
|
|
await self.gateway.send_voice(audio_path)
|
||
|
|
# Send text summary as follow-up
|
||
|
|
await self.gateway.send_message(briefing_text[:4000])
|
||
|
|
```
|
||
|
|
|
||
|
|
### On-Demand Command
|
||
|
|
|
||
|
|
```
|
||
|
|
/deepdive [optional: date or topic filter]
|
||
|
|
```
|
||
|
|
|
||
|
|
Triggers pipeline immediately, bypasses cron.
|
||
|
|
|
||
|
|
## Data Flow
|
||
|
|
|
||
|
|
```
|
||
|
|
RSS Feeds
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌───────────┐ ┌───────────┐ ┌───────────┐
|
||
|
|
│ Raw Items │───▶│ Scored │───▶│ Top 10 │
|
||
|
|
│ (100-500) │ │ (ranked) │ │ Selected │
|
||
|
|
└───────────┘ └───────────┘ └─────┬─────┘
|
||
|
|
│
|
||
|
|
┌───────────────────┘
|
||
|
|
▼
|
||
|
|
┌───────────┐ ┌───────────┐ ┌───────────┐
|
||
|
|
│ Synthesis │───▶│ Briefing │───▶│ TTS Gen │
|
||
|
|
│ (LLM) │ │ Text │ │ (Piper) │
|
||
|
|
└───────────┘ └───────────┘ └─────┬─────┘
|
||
|
|
│
|
||
|
|
┌───────┴───────┐
|
||
|
|
▼ ▼
|
||
|
|
Telegram Voice Telegram Text
|
||
|
|
```
|
||
|
|
|
||
|
|
## Configuration
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# config.yaml
|
||
|
|
deepdive:
|
||
|
|
schedule:
|
||
|
|
daily_time: "06:00"
|
||
|
|
timezone: "America/New_York"
|
||
|
|
|
||
|
|
aggregation:
|
||
|
|
sources:
|
||
|
|
- name: "arxiv_ai"
|
||
|
|
url: "http://export.arxiv.org/rss/cs.AI"
|
||
|
|
fetch_window_hours: 24
|
||
|
|
- name: "openai_blog"
|
||
|
|
url: "https://openai.com/blog/rss.xml"
|
||
|
|
limit: 5 # max items per source
|
||
|
|
|
||
|
|
relevance:
|
||
|
|
model: "all-MiniLM-L6-v2"
|
||
|
|
top_n: 10
|
||
|
|
min_score: 0.3
|
||
|
|
keywords:
|
||
|
|
- "LLM agent"
|
||
|
|
- "agent architecture"
|
||
|
|
- "reinforcement learning"
|
||
|
|
|
||
|
|
synthesis:
|
||
|
|
llm_model: "gemma-4-it" # local via llama-server
|
||
|
|
max_summary_length: 800
|
||
|
|
|
||
|
|
tts:
|
||
|
|
engine: "piper"
|
||
|
|
voice: "en_US-amy-medium"
|
||
|
|
speed: 1.0
|
||
|
|
|
||
|
|
delivery:
|
||
|
|
method: "telegram"
|
||
|
|
channel_id: "-1003664764329"
|
||
|
|
send_text_summary: true
|
||
|
|
```
|
||
|
|
|
||
|
|
## Implementation Phases
|
||
|
|
|
||
|
|
| Phase | Est. Effort | Dependencies | Owner |
|
||
|
|
|-------|-------------|--------------|-------|
|
||
|
|
| 1: Aggregation | 3 pts | None | Any agent |
|
||
|
|
| 2: Relevance | 4 pts | Phase 1 | @gemini |
|
||
|
|
| 3: Synthesis | 4 pts | Phase 2 | @gemini |
|
||
|
|
| 4: Audio | 4 pts | Phase 3 | @ezra |
|
||
|
|
| 5: Delivery | 4 pts | Phase 4 | @ezra |
|
||
|
|
|
||
|
|
## API Surface (Tentative)
|
||
|
|
|
||
|
|
```python
|
||
|
|
# deepdive/__init__.py
|
||
|
|
class DeepDivePipeline:
|
||
|
|
async def run(
|
||
|
|
self,
|
||
|
|
since: Optional[datetime] = None,
|
||
|
|
deliver: bool = True
|
||
|
|
) -> BriefingResult:
|
||
|
|
...
|
||
|
|
|
||
|
|
@dataclass
|
||
|
|
class BriefingResult:
|
||
|
|
sources_considered: int
|
||
|
|
sources_selected: int
|
||
|
|
briefing_text: str
|
||
|
|
audio_path: Optional[Path]
|
||
|
|
delivered: bool
|
||
|
|
```
|
||
|
|
|
||
|
|
## Success Metrics
|
||
|
|
|
||
|
|
- [ ] Daily delivery within 30 min of scheduled time
|
||
|
|
- [ ] < 5 minute audio length
|
||
|
|
- [ ] Relevance precision > 80% (manual audit)
|
||
|
|
- [ ] Zero API dependencies (full local stack)
|