Research: Long Context vs RAG Decision Framework

Date: 2026-04-13 Research Backlog Item: 4.3 (Impact: 4, Effort: 1, Ratio: 4.0) Status: Complete

Current State of the Fleet

Context Windows by Model/Provider

Model	Context Window	Our Usage
xiaomi/mimo-v2-pro (Nous)	128K	Primary workhorse (Hermes)
gpt-4o (OpenAI)	128K	Fallback, complex reasoning
claude-3.5-sonnet (Anthropic)	200K	Heavy analysis tasks
gemma-3 (local/Ollama)	8K	Local inference
gemma-3-27b (RunPod)	128K	Sovereign inference

How We Currently Inject Context

Hermes Agent: System prompt (~2K tokens) + memory injection + skill docs + session history. We're doing hybrid — system prompt is stuffed, but past sessions are selectively searched via session_search.
Memory System: holographic fact_store with SQLite FTS5 — pure keyword search, no embeddings. Effectively RAG without the vector part.
Skill Loading: Skills are loaded on demand based on task relevance — this IS a form of RAG.
Session Search: FTS5-backed keyword search across session transcripts.

Analysis: Are We Over-Retrieving?

YES for some workloads. Our models support 128K+ context, but:

Session transcripts are typically 2-8K tokens each
Memory entries are <500 chars each
Skills are 1-3K tokens each
Total typical context: ~8-15K tokens

We could fit 6-16x more context before needing RAG. But stuffing everything in:

Increases cost (input tokens are billed)
Increases latency
Can actually hurt quality (lost in the middle effect)

Decision Framework

IF task requires factual accuracy from specific sources:
    → Use RAG (retrieve exact docs, cite sources)
ELIF total relevant context < 32K tokens:
    → Stuff it all (simplest, best quality)
ELIF 32K < context < model_limit * 0.5:
    → Hybrid: key docs in context, RAG for rest
ELIF context > model_limit * 0.5:
    → Pure RAG with reranking

Key Insight: We're Mostly Fine

Our current approach is actually reasonable:

Hermes: System prompt stuffed + selective skill loading + session search = hybrid approach. OK
Memory: FTS5 keyword search works but lacks semantic understanding. Upgrade candidate.
Session recall: Keyword search is limiting. Embedding-based would find semantically similar sessions.

Recommendations (Priority Order)

Keep current hybrid approach — it's working well for 90% of tasks
Add semantic search to memory — replace pure FTS5 with sqlite-vss or similar for the fact_store
Don't stuff sessions — continue using selective retrieval for session history (saves cost)
Add context budget tracking — log how many tokens each context injection uses

Conclusion

We are NOT over-retrieving in most cases. The main improvement opportunity is upgrading memory from keyword search to semantic search, not changing the overall RAG vs stuffing strategy.

2.9 KiB Raw Blame History