Research reports: - Vector DB research - Workflow orchestration research - Fleet knowledge graph SOTA research - LLM inference optimization - Local model crisis quality - Memory systems SOTA - Multi-agent coordination - R5 vs E2E gap analysis - Text-to-music-video Test: - test_skill_manager_error_context.py [Allegro] Forge workers — 2026-04-16
8.0 KiB
SOTA Research: Structured Memory Systems for AI Agents
Date: 2026-04-14 Purpose: Inform MemPalace integration for Hermes Agent
1. Landscape Overview
| System | Type | License | Retrieval Method | Storage |
|---|---|---|---|---|
| MemPalace | Local verbatim store | Open Source | ChromaDB vector + metadata filtering (wings/rooms) | ChromaDB + filesystem |
| Mem0 | Managed memory layer | Apache 2.0 | Vector DB + LLM extraction/consolidation | Qdrant/Chroma/Pinecone + graph |
| MemGPT/Letta | OS-inspired memory tiers | MIT | Hierarchical recall (core/recall/archival) | In-context + DB archival |
| Zep | Context engineering platform | Commercial | Temporal knowledge graph (Graphiti) + vector | Graph DB + vector |
| LangMem | Memory toolkit (LangChain) | MIT | LangGraph store (semantic search) | Postgres/in-memory store |
| Engram | CLI binary (Rust) | MIT | Hybrid Gemini Embed + FTS5 + RRF | SQLite FTS5 + embeddings |
2. Benchmark Comparison (LongMemEval)
LongMemEval is the primary academic benchmark for long-term memory retrieval. 500 questions, 96% distractors.
| System | LongMemEval R@5 | LongMemEval R@1 | API Required | Notes |
|---|---|---|---|---|
| MemPalace (raw) | 96.6% | — | None | Zero API calls, pure ChromaDB |
| MemPalace (hybrid+Haiku rerank) | 100% (500/500) | — | Optional | Reranking adds cost |
| MemPalace (AAAK compression) | 84.2% | — | None | Lossy, 12.4pt regression vs raw |
| Engram (hybrid) | 99.0% | 91.0% | Gemini API | R@5 beats MemPalace by 0.6pt |
| Engram (+Cohere rerank) | 98.0% | 93.0% | Gemini+Cohere | First 100 Qs only |
| Mem0 | ~85% | — | Yes | On LOCOMO benchmark |
| Zep | ~85% | — | Yes | Cloud service |
| Mastra | 94.87% | — | Yes (GPT) | — |
| Supermemory ASMR | ~99% | — | Yes | — |
LOCOMO Benchmark (Mem0's paper, arXiv:2504.19413)
| Method | Accuracy | Median Search Latency | p95 Search Latency | End-to-End p95 | Tokens/Convo |
|---|---|---|---|---|---|
| Full Context | 72.9% | — | — | 17.12s | ~26,000 |
| Standard RAG | 61.0% | 0.70s | 0.26s | — | — |
| OpenAI Memory | 52.9% | — | — | — | — |
| Mem0 | 66.9% | 0.20s | 0.15s | 1.44s | ~1,800 |
| Mem0ᵍ (graph) | 68.4% | 0.66s | 0.48s | 2.59s | — |
Key Mem0 claims: +26% accuracy over OpenAI Memory, 91% lower p95 latency vs full-context, 90% token savings.
3. Retrieval Latency
| System | Reported Latency | Notes |
|---|---|---|
| Mem0 | 0.20s median search, 0.71s end-to-end | LOCOMO benchmark |
| Zep | <200ms claimed | Cloud service, sub-200ms SLA |
| MemPalace | ~seconds for ChromaDB search | Local, depends on corpus size; raw mode is fast |
| Engram | Fast (Rust binary) | No published latency numbers |
| LangMem | Depends on underlying store | In-memory fast, Postgres slower |
| MemGPT/Letta | Variable by tier | Core (in-context) is instant; archival has DB latency |
Target for Hermes: <100ms is achievable with local ChromaDB + small embedding model (all-MiniLM-L6-v2, ~50MB).
4. Compression Techniques
| System | Technique | Compression Ratio | Fidelity Impact |
|---|---|---|---|
| MemPalace AAAK | Lossy abbreviation dialect (entity codes, truncation) | Claimed ~30x (disputed) | 12.4pt R@5 regression (96.6% → 84.2%) |
| Mem0 | LLM extraction → structured facts | ~14x token reduction (26K → 1.8K) | 6pt accuracy loss vs full-context |
| MemGPT | Hierarchical summarization + eviction | Variable | Depends on tier management |
| Zep | Graph compression + temporal invalidation | N/A | Maintains temporal accuracy |
| Engram | None (stores raw) | 1x | No loss |
| LangMem | Background consolidation via LLM | Variable | Depends on LLM quality |
Key insight: MemPalace's raw mode (no compression) achieves the best retrieval scores. Compression trades fidelity for token density. For Hermes, raw storage + semantic search is the safest starting point.
5. Architecture Patterns
MemPalace (recommended for Hermes integration)
- Hierarchical: Wings (scope: global/workspace) → Rooms (priority: explicit/implicit)
- Dual-store: SQLite for canonical data, ChromaDB for vector search
- Verbatim storage: No LLM extraction, raw conversation storage
- Explicit-first ranking: User instructions always surface above auto-extracted context
- Workspace isolation: Memories scoped per project
Mem0 (graph-enhanced)
- Two-phase pipeline: Extraction → Update
- LLM-driven: Uses LLM to extract candidate memories, decide ADD/UPDATE/DELETE/NOOP
- Graph variant (Mem0ᵍ): Entity extraction → relationship graph → conflict detection → temporal updates
- Multi-level: User, Session, Agent state
Letta/MemGPT (OS-inspired)
- Memory tiers: Core (in-context), Recall (searchable), Archival (deep storage)
- Self-editing: Agent manages its own memory via function calls
- Interrupts: Control flow between agent and user
Zep (knowledge graph)
- Temporal knowledge graph: Facts have valid_at/invalid_at timestamps
- Graph RAG: Relationship-aware retrieval
- Powered by Graphiti: Open-source temporal KG framework
6. Integration Patterns for Hermes
Current Hermes Memory (memory_tool.py)
- File-backed: MEMORY.md + USER.md
- Delimiter-based entries (§)
- Frozen snapshot in system prompt
- No semantic search
MemPalace Plugin (hermes_memorypalace)
- Implements
MemoryProviderABC - ChromaDB + SQLite dual-store
- Lifecycle hooks: initialize, system_prompt_block, prefetch, sync_turn
- Tools: mempalace_remember_explicit, mempalace_store_implicit, mempalace_recall
- Local embedding model (all-MiniLM-L6-v2)
Recommended Integration Approach
- Keep MEMORY.md/USER.md as L0 (always-loaded baseline)
- Add MemPalace as L1 (semantic search layer)
- Prefetch on each turn: Run vector search before response generation
- Background sync: Store conversation turns as implicit context
- Workspace scoping: Isolate memories per project
7. Critical Caveats
-
Retrieval ≠ Answer accuracy: Engram team showed R@5 of 98.4% (MemPalace) can yield only 17% correct answers when an LLM actually tries to answer. The retrieval-to-accuracy gap is the real bottleneck.
-
MemPalace's 96.6% is retrieval only: Not end-to-end QA accuracy. End-to-end numbers are much lower (~17-40% depending on question difficulty).
-
AAAK compression is lossy: 12.4pt regression. Use raw mode for accuracy-critical work.
-
Mem0's LOCOMO numbers are on a different benchmark: Not directly comparable to LongMemEval scores.
-
Latency depends heavily on corpus size and hardware: Local ChromaDB on M2 Ultra runs fast; older hardware may not meet <100ms targets.
8. Recommendations for Hermes MemPalace Integration
| Metric | Target | Achievable? | Approach |
|---|---|---|---|
| Retrieval latency | <100ms | Yes | Local ChromaDB + small model, pre-indexed |
| Retrieval accuracy (R@5) | >95% | Yes | Raw verbatim mode, no compression |
| Token efficiency | <2000 tokens/convo | Yes | Selective retrieval, not full-context |
| Workspace isolation | Per-project | Yes | Wing-based scoping |
| Zero cloud dependency | 100% local | Yes | all-MiniLM-L6-v2 runs offline |
Priority: Integrate existing hermes_memorypalace plugin with raw mode. Defer AAAK compression. Focus on retrieval latency and explicit-first ranking.
Sources
- Mem0 paper: arXiv:2504.19413
- MemGPT paper: arXiv:2310.08560
- MemPalace repo: github.com/MemPalace/mempalace
- Engram benchmarks: github.com/199-biotechnologies/engram-2
- Hermes MemPalace plugin: github.com/neilharding/hermes_memorypalace
- LOCOMO benchmark results from mem0.ai/research
- LongMemEval: huggingface.co/datasets/xiaowu0162/longmemeval-cleaned