Compare commits
1 Commits
ci/fix-sec
...
research/r
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
1806ab6c42 |
63
research/03-rag-vs-context-framework.md
Normal file
63
research/03-rag-vs-context-framework.md
Normal file
@@ -0,0 +1,63 @@
|
||||
# Research: Long Context vs RAG Decision Framework
|
||||
|
||||
**Date**: 2026-04-13
|
||||
**Research Backlog Item**: 4.3 (Impact: 4, Effort: 1, Ratio: 4.0)
|
||||
**Status**: Complete
|
||||
|
||||
## Current State of the Fleet
|
||||
|
||||
### Context Windows by Model/Provider
|
||||
| Model | Context Window | Our Usage |
|
||||
|-------|---------------|-----------|
|
||||
| xiaomi/mimo-v2-pro (Nous) | 128K | Primary workhorse (Hermes) |
|
||||
| gpt-4o (OpenAI) | 128K | Fallback, complex reasoning |
|
||||
| claude-3.5-sonnet (Anthropic) | 200K | Heavy analysis tasks |
|
||||
| gemma-3 (local/Ollama) | 8K | Local inference |
|
||||
| gemma-3-27b (RunPod) | 128K | Sovereign inference |
|
||||
|
||||
### How We Currently Inject Context
|
||||
1. **Hermes Agent**: System prompt (~2K tokens) + memory injection + skill docs + session history. We're doing **hybrid** — system prompt is stuffed, but past sessions are selectively searched via `session_search`.
|
||||
2. **Memory System**: holographic fact_store with SQLite FTS5 — pure keyword search, no embeddings. Effectively RAG without the vector part.
|
||||
3. **Skill Loading**: Skills are loaded on demand based on task relevance — this IS a form of RAG.
|
||||
4. **Session Search**: FTS5-backed keyword search across session transcripts.
|
||||
|
||||
### Analysis: Are We Over-Retrieving?
|
||||
|
||||
**YES for some workloads.** Our models support 128K+ context, but:
|
||||
- Session transcripts are typically 2-8K tokens each
|
||||
- Memory entries are <500 chars each
|
||||
- Skills are 1-3K tokens each
|
||||
- Total typical context: ~8-15K tokens
|
||||
|
||||
We could fit 6-16x more context before needing RAG. But stuffing everything in:
|
||||
- Increases cost (input tokens are billed)
|
||||
- Increases latency
|
||||
- Can actually hurt quality (lost in the middle effect)
|
||||
|
||||
### Decision Framework
|
||||
|
||||
```
|
||||
IF task requires factual accuracy from specific sources:
|
||||
→ Use RAG (retrieve exact docs, cite sources)
|
||||
ELIF total relevant context < 32K tokens:
|
||||
→ Stuff it all (simplest, best quality)
|
||||
ELIF 32K < context < model_limit * 0.5:
|
||||
→ Hybrid: key docs in context, RAG for rest
|
||||
ELIF context > model_limit * 0.5:
|
||||
→ Pure RAG with reranking
|
||||
```
|
||||
|
||||
### Key Insight: We're Mostly Fine
|
||||
Our current approach is actually reasonable:
|
||||
- **Hermes**: System prompt stuffed + selective skill loading + session search = hybrid approach. OK
|
||||
- **Memory**: FTS5 keyword search works but lacks semantic understanding. Upgrade candidate.
|
||||
- **Session recall**: Keyword search is limiting. Embedding-based would find semantically similar sessions.
|
||||
|
||||
### Recommendations (Priority Order)
|
||||
1. **Keep current hybrid approach** — it's working well for 90% of tasks
|
||||
2. **Add semantic search to memory** — replace pure FTS5 with sqlite-vss or similar for the fact_store
|
||||
3. **Don't stuff sessions** — continue using selective retrieval for session history (saves cost)
|
||||
4. **Add context budget tracking** — log how many tokens each context injection uses
|
||||
|
||||
### Conclusion
|
||||
We are NOT over-retrieving in most cases. The main improvement opportunity is upgrading memory from keyword search to semantic search, not changing the overall RAG vs stuffing strategy.
|
||||
Reference in New Issue
Block a user