[scaffold] Deep Dive intelligence pipeline: intelligence/deepdive/architecture.md
Some checks failed
Deploy Nexus / deploy (push) Has been cancelled
Some checks failed
Deploy Nexus / deploy (push) Has been cancelled
This commit is contained in:
277
intelligence/deepdive/architecture.md
Normal file
277
intelligence/deepdive/architecture.md
Normal file
@@ -0,0 +1,277 @@
|
||||
# Deep Dive Architecture Specification
|
||||
|
||||
## Phase 1: Source Aggregation Layer
|
||||
|
||||
### Data Sources
|
||||
|
||||
| Source | URL | Format | Frequency |
|
||||
|--------|-----|--------|-----------|
|
||||
| arXiv cs.AI | http://export.arxiv.org/rss/cs.AI | RSS | Daily |
|
||||
| arXiv cs.CL | http://export.arxiv.org/rss/cs.CL | RSS | Daily |
|
||||
| arXiv cs.LG | http://export.arxiv.org/rss/cs.LG | RSS | Daily |
|
||||
| OpenAI Blog | https://openai.com/blog/rss.xml | RSS | On-update |
|
||||
| Anthropic | https://www.anthropic.com/blog/rss.xml | RSS | On-update |
|
||||
| DeepMind | https://deepmind.google/blog/rss.xml | RSS | On-update |
|
||||
| Import AI | https://importai.substack.com/feed | RSS | Daily |
|
||||
| TLDR AI | https://tldr.tech/ai/rss | RSS | Daily |
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
# aggregator.py
|
||||
class RSSAggregator:
|
||||
def __init__(self, sources: List[SourceConfig]):
|
||||
self.sources = sources
|
||||
self.cache_dir = Path("~/.cache/deepdive/feeds")
|
||||
|
||||
async def fetch_all(self, since: datetime) -> List[FeedItem]:
|
||||
# Parallel RSS fetch with etag support
|
||||
# Returns normalized items with title, summary, url, published
|
||||
pass
|
||||
```
|
||||
|
||||
## Phase 2: Relevance Engine
|
||||
|
||||
### Scoring Algorithm
|
||||
|
||||
```python
|
||||
# relevance.py
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
class RelevanceScorer:
|
||||
def __init__(self):
|
||||
self.model = SentenceTransformer('all-MiniLM-L6-v2')
|
||||
self.keywords = [
|
||||
"LLM agent", "agent architecture", "tool use",
|
||||
"reinforcement learning", "RLHF", "GRPO",
|
||||
"transformer", "attention mechanism",
|
||||
"Hermes", "local LLM", "llama.cpp"
|
||||
]
|
||||
# Pre-compute keyword embeddings
|
||||
self.keyword_emb = self.model.encode(self.keywords)
|
||||
|
||||
def score(self, item: FeedItem) -> float:
|
||||
title_emb = self.model.encode(item.title)
|
||||
summary_emb = self.model.encode(item.summary)
|
||||
|
||||
# Cosine similarity to keyword centroid
|
||||
keyword_sim = cosine_similarity([title_emb], self.keyword_emb).mean()
|
||||
|
||||
# Boost for agent/LLM architecture terms
|
||||
boost = 1.0
|
||||
if any(k in item.title.lower() for k in ["agent", "llm", "transformer"]):
|
||||
boost = 1.5
|
||||
|
||||
return keyword_sim * boost
|
||||
```
|
||||
|
||||
### Ranking
|
||||
|
||||
- Fetch all items from last 24h
|
||||
- Score each with RelevanceScorer
|
||||
- Select top N (default: 10) for briefing
|
||||
|
||||
## Phase 3: Synthesis Engine
|
||||
|
||||
### LLM Prompt
|
||||
|
||||
```jinja2
|
||||
You are an intelligence analyst for the Timmy Foundation fleet.
|
||||
Produce a concise daily briefing from the following sources.
|
||||
|
||||
CONTEXT: We build Hermes (local AI agent framework) and operate
|
||||
a distributed fleet of AI agents. Focus on developments relevant
|
||||
to: LLM architecture, agent systems, RL training, local inference.
|
||||
|
||||
SOURCES:
|
||||
{% for item in sources %}
|
||||
- {{ item.title }} ({{ item.source }})
|
||||
{{ item.summary }}
|
||||
{% endfor %}
|
||||
|
||||
OUTPUT FORMAT:
|
||||
## Daily Intelligence Briefing - {{ date }}
|
||||
|
||||
### Headlines
|
||||
- [Source] Key development in one sentence
|
||||
|
||||
### Deep Dive: {{ most_relevant.title }}
|
||||
Why this matters for our work:
|
||||
[2-3 sentences connecting to Hermes/Timmy context]
|
||||
|
||||
### Action Items
|
||||
- [ ] Any immediate implications
|
||||
|
||||
Keep total briefing under 800 words. Tight, professional tone.
|
||||
```
|
||||
|
||||
## Phase 4: Audio Generation
|
||||
|
||||
### TTS Pipeline
|
||||
|
||||
```python
|
||||
# tts.py
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
|
||||
class PiperTTS:
|
||||
def __init__(self, model_path: str, voice: str = "en_US-amy-medium"):
|
||||
self.model = Path(model_path) / f"{voice}.onnx"
|
||||
self.config = Path(model_path) / f"{voice}.onnx.json"
|
||||
|
||||
def generate(self, text: str, output_path: Path) -> Path:
|
||||
# Piper produces WAV from stdin text
|
||||
cmd = [
|
||||
"piper",
|
||||
"--model", str(self.model),
|
||||
"--config", str(self.config),
|
||||
"--output_file", str(output_path)
|
||||
]
|
||||
subprocess.run(cmd, input=text.encode())
|
||||
return output_path
|
||||
```
|
||||
|
||||
### Voice Selection
|
||||
|
||||
- Base: `en_US-amy-medium` (clear, professional)
|
||||
- Alternative: `en_GB-southern_english_female-medium`
|
||||
|
||||
## Phase 5: Delivery Pipeline
|
||||
|
||||
### Cron Scheduler
|
||||
|
||||
```yaml
|
||||
# cron entry (runs 5:30 AM daily)
|
||||
deepdive-daily:
|
||||
schedule: "30 5 * * *"
|
||||
command: "/opt/deepdive/run-pipeline.sh --deliver"
|
||||
timezone: "America/New_York"
|
||||
```
|
||||
|
||||
### Delivery Integration
|
||||
|
||||
```python
|
||||
# delivery.py
|
||||
from hermes.gateway import TelegramGateway
|
||||
|
||||
class TelegramDelivery:
|
||||
def __init__(self, bot_token: str, chat_id: str):
|
||||
self.gateway = TelegramGateway(bot_token, chat_id)
|
||||
|
||||
async def deliver(self, audio_path: Path, briefing_text: str):
|
||||
# Send voice message
|
||||
await self.gateway.send_voice(audio_path)
|
||||
# Send text summary as follow-up
|
||||
await self.gateway.send_message(briefing_text[:4000])
|
||||
```
|
||||
|
||||
### On-Demand Command
|
||||
|
||||
```
|
||||
/deepdive [optional: date or topic filter]
|
||||
```
|
||||
|
||||
Triggers pipeline immediately, bypasses cron.
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
RSS Feeds
|
||||
│
|
||||
▼
|
||||
┌───────────┐ ┌───────────┐ ┌───────────┐
|
||||
│ Raw Items │───▶│ Scored │───▶│ Top 10 │
|
||||
│ (100-500) │ │ (ranked) │ │ Selected │
|
||||
└───────────┘ └───────────┘ └─────┬─────┘
|
||||
│
|
||||
┌───────────────────┘
|
||||
▼
|
||||
┌───────────┐ ┌───────────┐ ┌───────────┐
|
||||
│ Synthesis │───▶│ Briefing │───▶│ TTS Gen │
|
||||
│ (LLM) │ │ Text │ │ (Piper) │
|
||||
└───────────┘ └───────────┘ └─────┬─────┘
|
||||
│
|
||||
┌───────┴───────┐
|
||||
▼ ▼
|
||||
Telegram Voice Telegram Text
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
```yaml
|
||||
# config.yaml
|
||||
deepdive:
|
||||
schedule:
|
||||
daily_time: "06:00"
|
||||
timezone: "America/New_York"
|
||||
|
||||
aggregation:
|
||||
sources:
|
||||
- name: "arxiv_ai"
|
||||
url: "http://export.arxiv.org/rss/cs.AI"
|
||||
fetch_window_hours: 24
|
||||
- name: "openai_blog"
|
||||
url: "https://openai.com/blog/rss.xml"
|
||||
limit: 5 # max items per source
|
||||
|
||||
relevance:
|
||||
model: "all-MiniLM-L6-v2"
|
||||
top_n: 10
|
||||
min_score: 0.3
|
||||
keywords:
|
||||
- "LLM agent"
|
||||
- "agent architecture"
|
||||
- "reinforcement learning"
|
||||
|
||||
synthesis:
|
||||
llm_model: "gemma-4-it" # local via llama-server
|
||||
max_summary_length: 800
|
||||
|
||||
tts:
|
||||
engine: "piper"
|
||||
voice: "en_US-amy-medium"
|
||||
speed: 1.0
|
||||
|
||||
delivery:
|
||||
method: "telegram"
|
||||
channel_id: "-1003664764329"
|
||||
send_text_summary: true
|
||||
```
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
| Phase | Est. Effort | Dependencies | Owner |
|
||||
|-------|-------------|--------------|-------|
|
||||
| 1: Aggregation | 3 pts | None | Any agent |
|
||||
| 2: Relevance | 4 pts | Phase 1 | @gemini |
|
||||
| 3: Synthesis | 4 pts | Phase 2 | @gemini |
|
||||
| 4: Audio | 4 pts | Phase 3 | @ezra |
|
||||
| 5: Delivery | 4 pts | Phase 4 | @ezra |
|
||||
|
||||
## API Surface (Tentative)
|
||||
|
||||
```python
|
||||
# deepdive/__init__.py
|
||||
class DeepDivePipeline:
|
||||
async def run(
|
||||
self,
|
||||
since: Optional[datetime] = None,
|
||||
deliver: bool = True
|
||||
) -> BriefingResult:
|
||||
...
|
||||
|
||||
@dataclass
|
||||
class BriefingResult:
|
||||
sources_considered: int
|
||||
sources_selected: int
|
||||
briefing_text: str
|
||||
audio_path: Optional[Path]
|
||||
delivered: bool
|
||||
```
|
||||
|
||||
## Success Metrics
|
||||
|
||||
- [ ] Daily delivery within 30 min of scheduled time
|
||||
- [ ] < 5 minute audio length
|
||||
- [ ] Relevance precision > 80% (manual audit)
|
||||
- [ ] Zero API dependencies (full local stack)
|
||||
Reference in New Issue
Block a user