Files
hermes-agent/research_memory_systems_sota.md
Hermes Agent ff2ce95ade
Some checks failed
Tests / e2e (pull_request) Successful in 1m39s
Tests / test (pull_request) Failing after 1h7m45s
Docker Build and Publish / build-and-push (pull_request) Has been skipped
Contributor Attribution Check / check-attribution (pull_request) Successful in 24s
Supply Chain Audit / Scan PR for supply chain risks (pull_request) Successful in 28s
feat(research): Allegro worker deliverables — fleet research reports + skill manager test
Research reports:
- Vector DB research
- Workflow orchestration research
- Fleet knowledge graph SOTA research
- LLM inference optimization
- Local model crisis quality
- Memory systems SOTA
- Multi-agent coordination
- R5 vs E2E gap analysis
- Text-to-music-video

Test:
- test_skill_manager_error_context.py

[Allegro] Forge workers — 2026-04-16
2026-04-16 15:04:28 +00:00

8.0 KiB

SOTA Research: Structured Memory Systems for AI Agents

Date: 2026-04-14 Purpose: Inform MemPalace integration for Hermes Agent


1. Landscape Overview

System Type License Retrieval Method Storage
MemPalace Local verbatim store Open Source ChromaDB vector + metadata filtering (wings/rooms) ChromaDB + filesystem
Mem0 Managed memory layer Apache 2.0 Vector DB + LLM extraction/consolidation Qdrant/Chroma/Pinecone + graph
MemGPT/Letta OS-inspired memory tiers MIT Hierarchical recall (core/recall/archival) In-context + DB archival
Zep Context engineering platform Commercial Temporal knowledge graph (Graphiti) + vector Graph DB + vector
LangMem Memory toolkit (LangChain) MIT LangGraph store (semantic search) Postgres/in-memory store
Engram CLI binary (Rust) MIT Hybrid Gemini Embed + FTS5 + RRF SQLite FTS5 + embeddings

2. Benchmark Comparison (LongMemEval)

LongMemEval is the primary academic benchmark for long-term memory retrieval. 500 questions, 96% distractors.

System LongMemEval R@5 LongMemEval R@1 API Required Notes
MemPalace (raw) 96.6% None Zero API calls, pure ChromaDB
MemPalace (hybrid+Haiku rerank) 100% (500/500) Optional Reranking adds cost
MemPalace (AAAK compression) 84.2% None Lossy, 12.4pt regression vs raw
Engram (hybrid) 99.0% 91.0% Gemini API R@5 beats MemPalace by 0.6pt
Engram (+Cohere rerank) 98.0% 93.0% Gemini+Cohere First 100 Qs only
Mem0 ~85% Yes On LOCOMO benchmark
Zep ~85% Yes Cloud service
Mastra 94.87% Yes (GPT)
Supermemory ASMR ~99% Yes

LOCOMO Benchmark (Mem0's paper, arXiv:2504.19413)

Method Accuracy Median Search Latency p95 Search Latency End-to-End p95 Tokens/Convo
Full Context 72.9% 17.12s ~26,000
Standard RAG 61.0% 0.70s 0.26s
OpenAI Memory 52.9%
Mem0 66.9% 0.20s 0.15s 1.44s ~1,800
Mem0ᵍ (graph) 68.4% 0.66s 0.48s 2.59s

Key Mem0 claims: +26% accuracy over OpenAI Memory, 91% lower p95 latency vs full-context, 90% token savings.


3. Retrieval Latency

System Reported Latency Notes
Mem0 0.20s median search, 0.71s end-to-end LOCOMO benchmark
Zep <200ms claimed Cloud service, sub-200ms SLA
MemPalace ~seconds for ChromaDB search Local, depends on corpus size; raw mode is fast
Engram Fast (Rust binary) No published latency numbers
LangMem Depends on underlying store In-memory fast, Postgres slower
MemGPT/Letta Variable by tier Core (in-context) is instant; archival has DB latency

Target for Hermes: <100ms is achievable with local ChromaDB + small embedding model (all-MiniLM-L6-v2, ~50MB).


4. Compression Techniques

System Technique Compression Ratio Fidelity Impact
MemPalace AAAK Lossy abbreviation dialect (entity codes, truncation) Claimed ~30x (disputed) 12.4pt R@5 regression (96.6% → 84.2%)
Mem0 LLM extraction → structured facts ~14x token reduction (26K → 1.8K) 6pt accuracy loss vs full-context
MemGPT Hierarchical summarization + eviction Variable Depends on tier management
Zep Graph compression + temporal invalidation N/A Maintains temporal accuracy
Engram None (stores raw) 1x No loss
LangMem Background consolidation via LLM Variable Depends on LLM quality

Key insight: MemPalace's raw mode (no compression) achieves the best retrieval scores. Compression trades fidelity for token density. For Hermes, raw storage + semantic search is the safest starting point.


5. Architecture Patterns

  • Hierarchical: Wings (scope: global/workspace) → Rooms (priority: explicit/implicit)
  • Dual-store: SQLite for canonical data, ChromaDB for vector search
  • Verbatim storage: No LLM extraction, raw conversation storage
  • Explicit-first ranking: User instructions always surface above auto-extracted context
  • Workspace isolation: Memories scoped per project

Mem0 (graph-enhanced)

  • Two-phase pipeline: Extraction → Update
  • LLM-driven: Uses LLM to extract candidate memories, decide ADD/UPDATE/DELETE/NOOP
  • Graph variant (Mem0ᵍ): Entity extraction → relationship graph → conflict detection → temporal updates
  • Multi-level: User, Session, Agent state

Letta/MemGPT (OS-inspired)

  • Memory tiers: Core (in-context), Recall (searchable), Archival (deep storage)
  • Self-editing: Agent manages its own memory via function calls
  • Interrupts: Control flow between agent and user

Zep (knowledge graph)

  • Temporal knowledge graph: Facts have valid_at/invalid_at timestamps
  • Graph RAG: Relationship-aware retrieval
  • Powered by Graphiti: Open-source temporal KG framework

6. Integration Patterns for Hermes

Current Hermes Memory (memory_tool.py)

  • File-backed: MEMORY.md + USER.md
  • Delimiter-based entries (§)
  • Frozen snapshot in system prompt
  • No semantic search

MemPalace Plugin (hermes_memorypalace)

  • Implements MemoryProvider ABC
  • ChromaDB + SQLite dual-store
  • Lifecycle hooks: initialize, system_prompt_block, prefetch, sync_turn
  • Tools: mempalace_remember_explicit, mempalace_store_implicit, mempalace_recall
  • Local embedding model (all-MiniLM-L6-v2)
  1. Keep MEMORY.md/USER.md as L0 (always-loaded baseline)
  2. Add MemPalace as L1 (semantic search layer)
  3. Prefetch on each turn: Run vector search before response generation
  4. Background sync: Store conversation turns as implicit context
  5. Workspace scoping: Isolate memories per project

7. Critical Caveats

  1. Retrieval ≠ Answer accuracy: Engram team showed R@5 of 98.4% (MemPalace) can yield only 17% correct answers when an LLM actually tries to answer. The retrieval-to-accuracy gap is the real bottleneck.

  2. MemPalace's 96.6% is retrieval only: Not end-to-end QA accuracy. End-to-end numbers are much lower (~17-40% depending on question difficulty).

  3. AAAK compression is lossy: 12.4pt regression. Use raw mode for accuracy-critical work.

  4. Mem0's LOCOMO numbers are on a different benchmark: Not directly comparable to LongMemEval scores.

  5. Latency depends heavily on corpus size and hardware: Local ChromaDB on M2 Ultra runs fast; older hardware may not meet <100ms targets.


8. Recommendations for Hermes MemPalace Integration

Metric Target Achievable? Approach
Retrieval latency <100ms Yes Local ChromaDB + small model, pre-indexed
Retrieval accuracy (R@5) >95% Yes Raw verbatim mode, no compression
Token efficiency <2000 tokens/convo Yes Selective retrieval, not full-context
Workspace isolation Per-project Yes Wing-based scoping
Zero cloud dependency 100% local Yes all-MiniLM-L6-v2 runs offline

Priority: Integrate existing hermes_memorypalace plugin with raw mode. Defer AAAK compression. Focus on retrieval latency and explicit-first ranking.


Sources

  • Mem0 paper: arXiv:2504.19413
  • MemGPT paper: arXiv:2310.08560
  • MemPalace repo: github.com/MemPalace/mempalace
  • Engram benchmarks: github.com/199-biotechnologies/engram-2
  • Hermes MemPalace plugin: github.com/neilharding/hermes_memorypalace
  • LOCOMO benchmark results from mem0.ai/research
  • LongMemEval: huggingface.co/datasets/xiaowu0162/longmemeval-cleaned