docs(research): add implementation recommendations to R@5 vs E2E gap report (#876 )

Appends Section 6 (Implementation Recommendations) to research_r5_vs_e2e_gap.md with the four concrete action items from issue #876: 1. Chunk-overlap retrieval (50% overlap) 2. Retrieval confidence scoring with configurable threshold 3. Chain-of-thought over retrieved context (not plain concatenation) 4. First-class "I don't know" fallback when confidence is low Also adds architecture-impact note on HRR limitations and renumbers limitations section to 7. References parent epic #659 and research #876.
2026-04-22 02:03:36 -04:00
2 changed files with 38 additions and 158 deletions
--- a/docs/research/ai-tools-evaluation-842.md
+++ b/docs/research/ai-tools-evaluation-842.md
@@ -1,157 +0,0 @@
-# AI Tools Evaluation Report (#842)
-
-**Source:** [formatho/awesome-ai-tools](https://github.com/formatho/awesome-ai-tools)  
-**Date:** 2026-04-15  
-**Tools Analyzed:** 414 across 9 categories  
-**Scope:** Hermes-agent integration potential
-
---
-
-## Executive Summary
-
-Scanned 414 tools from awesome-ai-tools. Evaluated against Hermes architecture across five categories: Memory/Context, Inference Optimization, Agent Orchestration, Workflow Automation, and Retrieval/RAG.
-
-## Top 5 Recommendations & Implementation Status
-
-### P1 — Mem0 (Memory/Context) ✅ IMPLEMENTED
-
-| Metric | Value |
-|--------|-------|
-| GitHub | [mem0ai/mem0](https://github.com/mem0ai/mem0) |
-| Stars | 53.1k ⭐ |
-| Integration Effort | 3/5 |
-| Impact | 5/5 |
-
-**Status:** Both cloud (mem0ai) and local (ChromaDB) variants implemented.
-
-**Deliverables:**
- `plugins/memory/mem0/` — Platform API provider with server-side LLM extraction, semantic search, reranking
- `plugins/memory/mem0_local/` — Sovereign local variant using ChromaDB, no API key required
- Tools: `mem0_profile`, `mem0_search`, `mem0_conclude`
- Circuit breaker for resilience
- 36 tests passing across both providers
-
-**Activation:**
-```bash
-hermes memory setup    # select "mem0" or "mem0_local"
-```
-
-**Risk mitigation:** OSS-only features used in `mem0_local`. Cloud version uses freemium API but has circuit-breaker fallback.
-
---
-
-### P2 — LightRAG (Retrieval/RAG) 🔴 NOT STARTED
-
-| Metric | Value |
-|--------|-------|
-| GitHub | [HKUDS/LightRAG](https://github.com/HKUDS/LightRAG) |
-| Stars | 33.1k ⭐ |
-| Integration Effort | 3/5 |
-| Impact | 4/5 |
-
-**Proposed integration:**
- Local knowledge base for skill references and codebase understanding
- Index GENOME.md, README.md, and key architecture files
- Query via tool call when agent needs contextual understanding (not just keyword search)
- Complements `search_files` without replacing it
-
-**Blocker:** Requires OpenAI-compatible embedding endpoint. Can use local Ollama via compatibility layer.
-
-**Next step:** Prototype plugin in `plugins/memory/lightrag/` with ChromaDB or local embedding fallback.
-
---
-
-### P3 — tensorzero (Inference Optimization / LLMOps) 🔴 NOT STARTED
-
-| Metric | Value |
-|--------|-------|
-| GitHub | [tensorzero/tensorzero](https://github.com/tensorzero/tensorzero) |
-| Stars | 11.2k ⭐ |
-| Integration Effort | 3/5 |
-| Impact | 4/5 |
-
-**Proposed integration:**
- Replace custom provider routing, fallback chains, and token tracking
- Intelligent routing across providers with cost/quality optimization
- Automatic prompt optimization based on feedback
- Evaluation metrics for A/B testing model/provider combinations
-
-**Blocker:** Rust-based infrastructure. Requires careful migration of existing provider logic. Best done as gradual opt-in, not replacement.
-
-**Next step:** Evaluate tensorzero gateway as optional `providers.tensorzero` backend.
-
---
-
-### P4 — RAGFlow (Retrieval/RAG) 🔴 NOT STARTED
-
-| Metric | Value |
-|--------|-------|
-| GitHub | [infiniflow/ragflow](https://github.com/infiniflow/ragflow) |
-| Stars | 77.9k ⭐ |
-| Integration Effort | 4/5 |
-| Impact | 4/5 |
-
-**Proposed integration:**
- Deploy as local Docker service for document understanding
- Ingest technical docs, research papers, codebases
- Query via HTTP API when agents need deep document comprehension
-
-**Blocker:** Heavy deployment (multi-service Docker). Best suited for always-on infrastructure, not per-session.
-
-**Next step:** Add RAGFlow API client tool in `tools/ragflow_tool.py` for document querying.
-
---
-
-### P5 — n8n (Workflow Automation) 🔴 NOT STARTED
-
-| Metric | Value |
-|--------|-------|
-| GitHub | [n8n-io/n8n](https://github.com/n8n-io/n8n) |
-| Stars | 183.9k ⭐ |
-| Integration Effort | 4/5 |
-| Impact | 5/5 |
-
-**Proposed integration:**
- Orchestrate Hermes agents from external events (webhooks, schedules)
- Visual workflow builder for burn loops, PR pipelines, multi-agent chains
- n8n webhooks trigger Hermes cron jobs or fleet dispatches
-
-**Blocker:** Full application stack (Node.js, PostgreSQL, Redis). Deploy as standalone Docker service.
-
-**Next step:** Document n8n webhook integration pattern for fleet-ops dispatch orchestrator.
-
---
-
-## Honorable Mentions Already in Stack
-
-| Tool | Status | Notes |
-|------|--------|-------|
-| llama.cpp | ✅ Integrated | Via Ollama local inference |
-| mempalace | ✅ Integrated | Holographic memory system (44.8k ⭐) |
-
---
-
-## Category Breakdown
-
-### Memory/Context (9 tools evaluated)
- Mem0 → **IMPLEMENTED** (cloud + local)
- memvid, mempalace, nocturne_memory, rowboat, byterover-cli, letta-code, hindsight, agentic-context-engine → Evaluated, no action
-
-### Inference Optimization (5 tools evaluated)
- llama.cpp → **Already integrated**
- vllm, tensorzero, mistral.rs, pruna → Evaluated, no action
-
-### Retrieval/RAG (5 tools evaluated)
- RAGFlow, LightRAG, PageIndex, WeKnora, RAG-Anything → Evaluated, no action
-
-### Agent Orchestration (5 tools evaluated)
- n8n, Langflow, agent-framework, deepagents, multica → Evaluated, no action
-
---
-
-## References
-
- Source repository: https://github.com/formatho/awesome-ai-tools
- Total tools: 414 across 9 categories
- Freshness distribution: 🟢 303 | 🟡 49 | 🟠 22 | 🔴 40
- Hermes issue: [#842](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/842)
--- a/research_r5_vs_e2e_gap.md
+++ b/research_r5_vs_e2e_gap.md
@@ -284,7 +284,44 @@ The gap can be reduced from 81 points to ~25-45 points with proper interventions

 ---

-## 6. Limitations of This Research
+## 6. Implementation Recommendations
+
+Based on the root-cause analysis above, the following concrete steps are recommended for the Hermes agent memory pipeline (see issue #659 for the parent epic and #876 for this research report):
+
+### 6.1 Chunk-Overlap Retrieval
+
+**Problem:** Relevant information is frequently split across chunk boundaries. Retrieval finds one chunk but the answer spans two.
+
+**Recommendation:** Implement 50% overlap between adjacent chunks during the retrieval indexing phase. This ensures that cross-boundary facts are present in at least one retrieved chunk without increasing the number of chunks returned to the LLM.
+
+### 6.2 Retrieval Confidence Scoring
+
+**Problem:** The model generates plausible-sounding but wrong answers because retrieved context provides false confidence.
+
+**Recommendation:** Add a confidence score to each retrieved chunk (e.g., cosine-similarity threshold + source-reliability weight). Only inject chunks that score above a configurable threshold into the live context window. Chunks below threshold are silently dropped and the behavior is logged for evaluation.
+
+### 6.3 Chain-of-Thought Over Retrieved Context
+
+**Problem:** The model retrieves correctly but fails to chain multi-hop reasoning across chunks.
+
+**Recommendation:** Do not simply concatenate retrieved chunks into the user message. Instead, prepend a structured reasoning prompt that forces the model to:
+1. Quote the specific chunk that supports each step.
+2. Flag when two chunks must be combined to reach a conclusion.
+3. Stop and emit "I don't know" if no chunk supports a required inference step.
+
+### 6.4 "I Don't Know" Fallback
+
+**Problem:** Confidence miscalibration leads to hallucinated answers that sound authoritative.
+
+**Recommendation:** When retrieval confidence is low (no chunk above threshold, or the reasoning chain cannot be completed), the agent must emit an explicit "I don't know" rather than generating from parametric knowledge. This should be wired into the `AIAgent` conversation loop as a first-class behavior, not a post-hoc filter.
+
+### 6.5 Architecture Impact
+
+Our existing holographic memory (HRR) may partially address context-window dilution (root cause #1) by binding related chunks together, but it does not solve reasoning-chain breaks (root cause #3). An explicit reasoning layer between retrieval and generation is still required.
+
+---
+
+## 7. Limitations of This Research

 1. **MemPalace/Engram team analysis not found** - The specific analysis that discovered the 17% figure was not located through academic search. This may be from internal reports, blog posts, or presentations not indexed in arXiv.