# SOTA Research: Structured Memory Systems for AI Agents **Date**: 2026-04-14 **Purpose**: Inform MemPalace integration for Hermes Agent --- ## 1. Landscape Overview | System | Type | License | Retrieval Method | Storage | |--------|------|---------|-----------------|---------| | **MemPalace** | Local verbatim store | Open Source | ChromaDB vector + metadata filtering (wings/rooms) | ChromaDB + filesystem | | **Mem0** | Managed memory layer | Apache 2.0 | Vector DB + LLM extraction/consolidation | Qdrant/Chroma/Pinecone + graph | | **MemGPT/Letta** | OS-inspired memory tiers | MIT | Hierarchical recall (core/recall/archival) | In-context + DB archival | | **Zep** | Context engineering platform | Commercial | Temporal knowledge graph (Graphiti) + vector | Graph DB + vector | | **LangMem** | Memory toolkit (LangChain) | MIT | LangGraph store (semantic search) | Postgres/in-memory store | | **Engram** | CLI binary (Rust) | MIT | Hybrid Gemini Embed + FTS5 + RRF | SQLite FTS5 + embeddings | --- ## 2. Benchmark Comparison (LongMemEval) LongMemEval is the primary academic benchmark for long-term memory retrieval. 500 questions, 96% distractors. | System | LongMemEval R@5 | LongMemEval R@1 | API Required | Notes | |--------|----------------|-----------------|--------------|-------| | **MemPalace (raw)** | **96.6%** | — | None | Zero API calls, pure ChromaDB | | **MemPalace (hybrid+Haiku rerank)** | **100%** (500/500) | — | Optional | Reranking adds cost | | **MemPalace (AAAK compression)** | 84.2% | — | None | Lossy, 12.4pt regression vs raw | | **Engram (hybrid)** | 99.0% | 91.0% | Gemini API | R@5 beats MemPalace by 0.6pt | | **Engram (+Cohere rerank)** | 98.0% | 93.0% | Gemini+Cohere | First 100 Qs only | | **Mem0** | ~85% | — | Yes | On LOCOMO benchmark | | **Zep** | ~85% | — | Yes | Cloud service | | **Mastra** | 94.87% | — | Yes (GPT) | — | | **Supermemory ASMR** | ~99% | — | Yes | — | ### LOCOMO Benchmark (Mem0's paper, arXiv:2504.19413) | Method | Accuracy | Median Search Latency | p95 Search Latency | End-to-End p95 | Tokens/Convo | |--------|----------|----------------------|-------------------|----------------|-------------| | **Full Context** | 72.9% | — | — | 17.12s | ~26,000 | | **Standard RAG** | 61.0% | 0.70s | 0.26s | — | — | | **OpenAI Memory** | 52.9% | — | — | — | — | | **Mem0** | 66.9% | 0.20s | 0.15s | 1.44s | ~1,800 | | **Mem0ᵍ (graph)** | 68.4% | 0.66s | 0.48s | 2.59s | — | **Key Mem0 claims**: +26% accuracy over OpenAI Memory, 91% lower p95 latency vs full-context, 90% token savings. --- ## 3. Retrieval Latency | System | Reported Latency | Notes | |--------|-----------------|-------| | **Mem0** | 0.20s median search, 0.71s end-to-end | LOCOMO benchmark | | **Zep** | <200ms claimed | Cloud service, sub-200ms SLA | | **MemPalace** | ~seconds for ChromaDB search | Local, depends on corpus size; raw mode is fast | | **Engram** | Fast (Rust binary) | No published latency numbers | | **LangMem** | Depends on underlying store | In-memory fast, Postgres slower | | **MemGPT/Letta** | Variable by tier | Core (in-context) is instant; archival has DB latency | **Target for Hermes**: <100ms is achievable with local ChromaDB + small embedding model (all-MiniLM-L6-v2, ~50MB). --- ## 4. Compression Techniques | System | Technique | Compression Ratio | Fidelity Impact | |--------|-----------|-------------------|-----------------| | **MemPalace AAAK** | Lossy abbreviation dialect (entity codes, truncation) | Claimed ~30x (disputed) | 12.4pt R@5 regression (96.6% → 84.2%) | | **Mem0** | LLM extraction → structured facts | ~14x token reduction (26K → 1.8K) | 6pt accuracy loss vs full-context | | **MemGPT** | Hierarchical summarization + eviction | Variable | Depends on tier management | | **Zep** | Graph compression + temporal invalidation | N/A | Maintains temporal accuracy | | **Engram** | None (stores raw) | 1x | No loss | | **LangMem** | Background consolidation via LLM | Variable | Depends on LLM quality | **Key insight**: MemPalace's raw mode (no compression) achieves the best retrieval scores. Compression trades fidelity for token density. For Hermes, raw storage + semantic search is the safest starting point. --- ## 5. Architecture Patterns ### MemPalace (recommended for Hermes integration) - **Hierarchical**: Wings (scope: global/workspace) → Rooms (priority: explicit/implicit) - **Dual-store**: SQLite for canonical data, ChromaDB for vector search - **Verbatim storage**: No LLM extraction, raw conversation storage - **Explicit-first ranking**: User instructions always surface above auto-extracted context - **Workspace isolation**: Memories scoped per project ### Mem0 (graph-enhanced) - **Two-phase pipeline**: Extraction → Update - **LLM-driven**: Uses LLM to extract candidate memories, decide ADD/UPDATE/DELETE/NOOP - **Graph variant (Mem0ᵍ)**: Entity extraction → relationship graph → conflict detection → temporal updates - **Multi-level**: User, Session, Agent state ### Letta/MemGPT (OS-inspired) - **Memory tiers**: Core (in-context), Recall (searchable), Archival (deep storage) - **Self-editing**: Agent manages its own memory via function calls - **Interrupts**: Control flow between agent and user ### Zep (knowledge graph) - **Temporal knowledge graph**: Facts have valid_at/invalid_at timestamps - **Graph RAG**: Relationship-aware retrieval - **Powered by Graphiti**: Open-source temporal KG framework --- ## 6. Integration Patterns for Hermes ### Current Hermes Memory (memory_tool.py) - File-backed: MEMORY.md + USER.md - Delimiter-based entries (§) - Frozen snapshot in system prompt - No semantic search ### MemPalace Plugin (hermes_memorypalace) - Implements `MemoryProvider` ABC - ChromaDB + SQLite dual-store - Lifecycle hooks: initialize, system_prompt_block, prefetch, sync_turn - Tools: mempalace_remember_explicit, mempalace_store_implicit, mempalace_recall - Local embedding model (all-MiniLM-L6-v2) ### Recommended Integration Approach 1. **Keep MEMORY.md/USER.md** as L0 (always-loaded baseline) 2. **Add MemPalace** as L1 (semantic search layer) 3. **Prefetch on each turn**: Run vector search before response generation 4. **Background sync**: Store conversation turns as implicit context 5. **Workspace scoping**: Isolate memories per project --- ## 7. Critical Caveats 1. **Retrieval ≠ Answer accuracy**: Engram team showed R@5 of 98.4% (MemPalace) can yield only 17% correct answers when an LLM actually tries to answer. The retrieval-to-accuracy gap is the real bottleneck. 2. **MemPalace's 96.6% is retrieval only**: Not end-to-end QA accuracy. End-to-end numbers are much lower (~17-40% depending on question difficulty). 3. **AAAK compression is lossy**: 12.4pt regression. Use raw mode for accuracy-critical work. 4. **Mem0's LOCOMO numbers are on a different benchmark**: Not directly comparable to LongMemEval scores. 5. **Latency depends heavily on corpus size and hardware**: Local ChromaDB on M2 Ultra runs fast; older hardware may not meet <100ms targets. --- ## 8. Recommendations for Hermes MemPalace Integration | Metric | Target | Achievable? | Approach | |--------|--------|-------------|----------| | Retrieval latency | <100ms | Yes | Local ChromaDB + small model, pre-indexed | | Retrieval accuracy (R@5) | >95% | Yes | Raw verbatim mode, no compression | | Token efficiency | <2000 tokens/convo | Yes | Selective retrieval, not full-context | | Workspace isolation | Per-project | Yes | Wing-based scoping | | Zero cloud dependency | 100% local | Yes | all-MiniLM-L6-v2 runs offline | **Priority**: Integrate existing hermes_memorypalace plugin with raw mode. Defer AAAK compression. Focus on retrieval latency and explicit-first ranking. --- ## Sources - Mem0 paper: arXiv:2504.19413 - MemGPT paper: arXiv:2310.08560 - MemPalace repo: github.com/MemPalace/mempalace - Engram benchmarks: github.com/199-biotechnologies/engram-2 - Hermes MemPalace plugin: github.com/neilharding/hermes_memorypalace - LOCOMO benchmark results from mem0.ai/research - LongMemEval: huggingface.co/datasets/xiaowu0162/longmemeval-cleaned