# SOTA Research: Structured Memory Systems for AI Agents

**Date**: 2026-04-14
**Purpose**: Inform MemPalace integration for Hermes Agent

---

## 1. Landscape Overview

| System | Type | License | Retrieval Method | Storage |
|--------|------|---------|-----------------|---------|
| **MemPalace** | Local verbatim store | Open Source | ChromaDB vector + metadata filtering (wings/rooms) | ChromaDB + filesystem |
| **Mem0** | Managed memory layer | Apache 2.0 | Vector DB + LLM extraction/consolidation | Qdrant/Chroma/Pinecone + graph |
| **MemGPT/Letta** | OS-inspired memory tiers | MIT | Hierarchical recall (core/recall/archival) | In-context + DB archival |
| **Zep** | Context engineering platform | Commercial | Temporal knowledge graph (Graphiti) + vector | Graph DB + vector |
| **LangMem** | Memory toolkit (LangChain) | MIT | LangGraph store (semantic search) | Postgres/in-memory store |
| **Engram** | CLI binary (Rust) | MIT | Hybrid Gemini Embed + FTS5 + RRF | SQLite FTS5 + embeddings |

---

## 2. Benchmark Comparison (LongMemEval)

LongMemEval is the primary academic benchmark for long-term memory retrieval. 500 questions, 96% distractors.

| System | LongMemEval R@5 | LongMemEval R@1 | API Required | Notes |
|--------|----------------|-----------------|--------------|-------|
| **MemPalace (raw)** | **96.6%** | — | None | Zero API calls, pure ChromaDB |
| **MemPalace (hybrid+Haiku rerank)** | **100%** (500/500) | — | Optional | Reranking adds cost |
| **MemPalace (AAAK compression)** | 84.2% | — | None | Lossy, 12.4pt regression vs raw |
| **Engram (hybrid)** | 99.0% | 91.0% | Gemini API | R@5 beats MemPalace by 0.6pt |
| **Engram (+Cohere rerank)** | 98.0% | 93.0% | Gemini+Cohere | First 100 Qs only |
| **Mem0** | ~85% | — | Yes | On LOCOMO benchmark |
| **Zep** | ~85% | — | Yes | Cloud service |
| **Mastra** | 94.87% | — | Yes (GPT) | — |
| **Supermemory ASMR** | ~99% | — | Yes | — |

### LOCOMO Benchmark (Mem0's paper, arXiv:2504.19413)

| Method | Accuracy | Median Search Latency | p95 Search Latency | End-to-End p95 | Tokens/Convo |
|--------|----------|----------------------|-------------------|----------------|-------------|
| **Full Context** | 72.9% | — | — | 17.12s | ~26,000 |
| **Standard RAG** | 61.0% | 0.70s | 0.26s | — | — |
| **OpenAI Memory** | 52.9% | — | — | — | — |
| **Mem0** | 66.9% | 0.20s | 0.15s | 1.44s | ~1,800 |
| **Mem0ᵍ (graph)** | 68.4% | 0.66s | 0.48s | 2.59s | — |

**Key Mem0 claims**: +26% accuracy over OpenAI Memory, 91% lower p95 latency vs full-context, 90% token savings.

---

## 3. Retrieval Latency

| System | Reported Latency | Notes |
|--------|-----------------|-------|
| **Mem0** | 0.20s median search, 0.71s end-to-end | LOCOMO benchmark |
| **Zep** | <200ms claimed | Cloud service, sub-200ms SLA |
| **MemPalace** | ~seconds for ChromaDB search | Local, depends on corpus size; raw mode is fast |
| **Engram** | Fast (Rust binary) | No published latency numbers |
| **LangMem** | Depends on underlying store | In-memory fast, Postgres slower |
| **MemGPT/Letta** | Variable by tier | Core (in-context) is instant; archival has DB latency |

**Target for Hermes**: <100ms is achievable with local ChromaDB + small embedding model (all-MiniLM-L6-v2, ~50MB).

---

## 4. Compression Techniques

| System | Technique | Compression Ratio | Fidelity Impact |
|--------|-----------|-------------------|-----------------|
| **MemPalace AAAK** | Lossy abbreviation dialect (entity codes, truncation) | Claimed ~30x (disputed) | 12.4pt R@5 regression (96.6% → 84.2%) |
| **Mem0** | LLM extraction → structured facts | ~14x token reduction (26K → 1.8K) | 6pt accuracy loss vs full-context |
| **MemGPT** | Hierarchical summarization + eviction | Variable | Depends on tier management |
| **Zep** | Graph compression + temporal invalidation | N/A | Maintains temporal accuracy |
| **Engram** | None (stores raw) | 1x | No loss |
| **LangMem** | Background consolidation via LLM | Variable | Depends on LLM quality |

**Key insight**: MemPalace's raw mode (no compression) achieves the best retrieval scores. Compression trades fidelity for token density. For Hermes, raw storage + semantic search is the safest starting point.

---

## 5. Architecture Patterns

### MemPalace (recommended for Hermes integration)
- **Hierarchical**: Wings (scope: global/workspace) → Rooms (priority: explicit/implicit)
- **Dual-store**: SQLite for canonical data, ChromaDB for vector search
- **Verbatim storage**: No LLM extraction, raw conversation storage
- **Explicit-first ranking**: User instructions always surface above auto-extracted context
- **Workspace isolation**: Memories scoped per project

### Mem0 (graph-enhanced)
- **Two-phase pipeline**: Extraction → Update
- **LLM-driven**: Uses LLM to extract candidate memories, decide ADD/UPDATE/DELETE/NOOP
- **Graph variant (Mem0ᵍ)**: Entity extraction → relationship graph → conflict detection → temporal updates
- **Multi-level**: User, Session, Agent state

### Letta/MemGPT (OS-inspired)
- **Memory tiers**: Core (in-context), Recall (searchable), Archival (deep storage)
- **Self-editing**: Agent manages its own memory via function calls
- **Interrupts**: Control flow between agent and user

### Zep (knowledge graph)
- **Temporal knowledge graph**: Facts have valid_at/invalid_at timestamps
- **Graph RAG**: Relationship-aware retrieval
- **Powered by Graphiti**: Open-source temporal KG framework

---

## 6. Integration Patterns for Hermes

### Current Hermes Memory (memory_tool.py)
- File-backed: MEMORY.md + USER.md
- Delimiter-based entries (§)
- Frozen snapshot in system prompt
- No semantic search

### MemPalace Plugin (hermes_memorypalace)
- Implements `MemoryProvider` ABC
- ChromaDB + SQLite dual-store
- Lifecycle hooks: initialize, system_prompt_block, prefetch, sync_turn
- Tools: mempalace_remember_explicit, mempalace_store_implicit, mempalace_recall
- Local embedding model (all-MiniLM-L6-v2)

### Recommended Integration Approach
1. **Keep MEMORY.md/USER.md** as L0 (always-loaded baseline)
2. **Add MemPalace** as L1 (semantic search layer)
3. **Prefetch on each turn**: Run vector search before response generation
4. **Background sync**: Store conversation turns as implicit context
5. **Workspace scoping**: Isolate memories per project

---

## 7. Critical Caveats

1. **Retrieval ≠ Answer accuracy**: Engram team showed R@5 of 98.4% (MemPalace) can yield only 17% correct answers when an LLM actually tries to answer. The retrieval-to-accuracy gap is the real bottleneck.

2. **MemPalace's 96.6% is retrieval only**: Not end-to-end QA accuracy. End-to-end numbers are much lower (~17-40% depending on question difficulty).

3. **AAAK compression is lossy**: 12.4pt regression. Use raw mode for accuracy-critical work.

4. **Mem0's LOCOMO numbers are on a different benchmark**: Not directly comparable to LongMemEval scores.

5. **Latency depends heavily on corpus size and hardware**: Local ChromaDB on M2 Ultra runs fast; older hardware may not meet <100ms targets.

---

## 8. Recommendations for Hermes MemPalace Integration

| Metric | Target | Achievable? | Approach |
|--------|--------|-------------|----------|
| Retrieval latency | <100ms | Yes | Local ChromaDB + small model, pre-indexed |
| Retrieval accuracy (R@5) | >95% | Yes | Raw verbatim mode, no compression |
| Token efficiency | <2000 tokens/convo | Yes | Selective retrieval, not full-context |
| Workspace isolation | Per-project | Yes | Wing-based scoping |
| Zero cloud dependency | 100% local | Yes | all-MiniLM-L6-v2 runs offline |

**Priority**: Integrate existing hermes_memorypalace plugin with raw mode. Defer AAAK compression. Focus on retrieval latency and explicit-first ranking.

---

## Sources

- Mem0 paper: arXiv:2504.19413
- MemGPT paper: arXiv:2310.08560
- MemPalace repo: github.com/MemPalace/mempalace
- Engram benchmarks: github.com/199-biotechnologies/engram-2
- Hermes MemPalace plugin: github.com/neilharding/hermes_memorypalace
- LOCOMO benchmark results from mem0.ai/research
- LongMemEval: huggingface.co/datasets/xiaowu0162/longmemeval-cleaned