Compare commits

...

1 Commits

Author SHA1 Message Date
Alexander Whitestone
1806ab6c42 research: Long Context vs RAG Decision Framework (backlog #4.3)
Some checks failed
Smoke Test / smoke (pull_request) Failing after 5s
2026-04-13 04:37:15 -04:00

View File

@@ -0,0 +1,63 @@
# Research: Long Context vs RAG Decision Framework
**Date**: 2026-04-13
**Research Backlog Item**: 4.3 (Impact: 4, Effort: 1, Ratio: 4.0)
**Status**: Complete
## Current State of the Fleet
### Context Windows by Model/Provider
| Model | Context Window | Our Usage |
|-------|---------------|-----------|
| xiaomi/mimo-v2-pro (Nous) | 128K | Primary workhorse (Hermes) |
| gpt-4o (OpenAI) | 128K | Fallback, complex reasoning |
| claude-3.5-sonnet (Anthropic) | 200K | Heavy analysis tasks |
| gemma-3 (local/Ollama) | 8K | Local inference |
| gemma-3-27b (RunPod) | 128K | Sovereign inference |
### How We Currently Inject Context
1. **Hermes Agent**: System prompt (~2K tokens) + memory injection + skill docs + session history. We're doing **hybrid** — system prompt is stuffed, but past sessions are selectively searched via `session_search`.
2. **Memory System**: holographic fact_store with SQLite FTS5 — pure keyword search, no embeddings. Effectively RAG without the vector part.
3. **Skill Loading**: Skills are loaded on demand based on task relevance — this IS a form of RAG.
4. **Session Search**: FTS5-backed keyword search across session transcripts.
### Analysis: Are We Over-Retrieving?
**YES for some workloads.** Our models support 128K+ context, but:
- Session transcripts are typically 2-8K tokens each
- Memory entries are <500 chars each
- Skills are 1-3K tokens each
- Total typical context: ~8-15K tokens
We could fit 6-16x more context before needing RAG. But stuffing everything in:
- Increases cost (input tokens are billed)
- Increases latency
- Can actually hurt quality (lost in the middle effect)
### Decision Framework
```
IF task requires factual accuracy from specific sources:
→ Use RAG (retrieve exact docs, cite sources)
ELIF total relevant context < 32K tokens:
→ Stuff it all (simplest, best quality)
ELIF 32K < context < model_limit * 0.5:
→ Hybrid: key docs in context, RAG for rest
ELIF context > model_limit * 0.5:
→ Pure RAG with reranking
```
### Key Insight: We're Mostly Fine
Our current approach is actually reasonable:
- **Hermes**: System prompt stuffed + selective skill loading + session search = hybrid approach. OK
- **Memory**: FTS5 keyword search works but lacks semantic understanding. Upgrade candidate.
- **Session recall**: Keyword search is limiting. Embedding-based would find semantically similar sessions.
### Recommendations (Priority Order)
1. **Keep current hybrid approach** — it's working well for 90% of tasks
2. **Add semantic search to memory** — replace pure FTS5 with sqlite-vss or similar for the fact_store
3. **Don't stuff sessions** — continue using selective retrieval for session history (saves cost)
4. **Add context budget tracking** — log how many tokens each context injection uses
### Conclusion
We are NOT over-retrieving in most cases. The main improvement opportunity is upgrading memory from keyword search to semantic search, not changing the overall RAG vs stuffing strategy.