Files
timmy-home/research/03-rag-vs-context-framework.md
Alexander Whitestone 1806ab6c42
Some checks failed
Smoke Test / smoke (pull_request) Failing after 5s
research: Long Context vs RAG Decision Framework (backlog #4.3)
2026-04-13 04:37:15 -04:00

2.9 KiB

Research: Long Context vs RAG Decision Framework

Date: 2026-04-13 Research Backlog Item: 4.3 (Impact: 4, Effort: 1, Ratio: 4.0) Status: Complete

Current State of the Fleet

Context Windows by Model/Provider

Model Context Window Our Usage
xiaomi/mimo-v2-pro (Nous) 128K Primary workhorse (Hermes)
gpt-4o (OpenAI) 128K Fallback, complex reasoning
claude-3.5-sonnet (Anthropic) 200K Heavy analysis tasks
gemma-3 (local/Ollama) 8K Local inference
gemma-3-27b (RunPod) 128K Sovereign inference

How We Currently Inject Context

  1. Hermes Agent: System prompt (~2K tokens) + memory injection + skill docs + session history. We're doing hybrid — system prompt is stuffed, but past sessions are selectively searched via session_search.
  2. Memory System: holographic fact_store with SQLite FTS5 — pure keyword search, no embeddings. Effectively RAG without the vector part.
  3. Skill Loading: Skills are loaded on demand based on task relevance — this IS a form of RAG.
  4. Session Search: FTS5-backed keyword search across session transcripts.

Analysis: Are We Over-Retrieving?

YES for some workloads. Our models support 128K+ context, but:

  • Session transcripts are typically 2-8K tokens each
  • Memory entries are <500 chars each
  • Skills are 1-3K tokens each
  • Total typical context: ~8-15K tokens

We could fit 6-16x more context before needing RAG. But stuffing everything in:

  • Increases cost (input tokens are billed)
  • Increases latency
  • Can actually hurt quality (lost in the middle effect)

Decision Framework

IF task requires factual accuracy from specific sources:
    → Use RAG (retrieve exact docs, cite sources)
ELIF total relevant context < 32K tokens:
    → Stuff it all (simplest, best quality)
ELIF 32K < context < model_limit * 0.5:
    → Hybrid: key docs in context, RAG for rest
ELIF context > model_limit * 0.5:
    → Pure RAG with reranking

Key Insight: We're Mostly Fine

Our current approach is actually reasonable:

  • Hermes: System prompt stuffed + selective skill loading + session search = hybrid approach. OK
  • Memory: FTS5 keyword search works but lacks semantic understanding. Upgrade candidate.
  • Session recall: Keyword search is limiting. Embedding-based would find semantically similar sessions.

Recommendations (Priority Order)

  1. Keep current hybrid approach — it's working well for 90% of tasks
  2. Add semantic search to memory — replace pure FTS5 with sqlite-vss or similar for the fact_store
  3. Don't stuff sessions — continue using selective retrieval for session history (saves cost)
  4. Add context budget tracking — log how many tokens each context injection uses

Conclusion

We are NOT over-retrieving in most cases. The main improvement opportunity is upgrading memory from keyword search to semantic search, not changing the overall RAG vs stuffing strategy.