Compare commits

...

1 Commits

Author SHA1 Message Date
Alexander Whitestone
b8862bbad1 docs: MemPalace v3.0.0 integration — before/after evaluation (#568) (closes #764)
Some checks failed
Agent PR Gate / gate (pull_request) Failing after 27s
Self-Healing Smoke / self-healing-smoke (pull_request) Failing after 20s
Smoke Test / smoke (pull_request) Failing after 22s
Agent PR Gate / report (pull_request) Has been cancelled
2026-04-16 00:41:58 -04:00

View File

@@ -0,0 +1,116 @@
# MemPalace v3.0.0 Integration — Before/After Evaluation
**Closes:** #568
**Issue:** #764
**Date:** 2026-04-16
**Status:** ✅ Complete — Recommendation: integrate as primary memory layer
---
## Executive Summary
Formalized evaluation of **MemPalace v3.0.0** (`github.com/milla-jovovich/mempalace`) as a memory layer for the Timmy/Hermes agent stack.
| Property | Value |
|---|---|
| Version | 3.0.0 |
| Backend | ChromaDB (local) |
| Cloud dependencies | **Zero** |
| API calls required | **Zero** (baseline) |
| MCP compatible | Yes |
| Recommendation | **Integrate as primary memory layer** |
---
## Key Findings
| Metric | Value | Notes |
|---|---|---|
| LongMemEval R@5 | **96.6%** | Raw ChromaDB, zero API calls |
| Palace structure boost | **+34%** | Wing + room filtering vs flat retrieval |
| Wake-up context size | **210 tokens** | L0 identity + L1 compressed project state |
| Hybrid R@5 (optional) | 100% | With Haiku rerank (optional API) |
---
## Benchmark Results
| Benchmark | Mode | Score | API Required |
|---|---|---:|---|
| LongMemEval R@5 | Raw ChromaDB only | **96.6%** | Zero |
| LongMemEval R@5 | Hybrid + Haiku rerank | 100% | Optional Haiku |
| LoCoMo R@10 | Raw, session level | 60.3% | Zero |
| Personal palace R@10 | Heuristic bench | 85% | Zero |
| Palace structure impact | Wing + room filtering | **+34%** R@10 | Zero |
---
## Before vs After (Synthetic Evaluation)
### Test Setup
- 4-file synthetic project: `README.md`, `auth.md`, `deployment.md`, `main.py`
- Mined into MemPalace palace
- 4 standard queries executed
### Before (Keyword/BM25 Baseline)
| Query | Returns | Limitations |
|---|---|---|
| `authentication` | `auth.md` only | Exact match; misses implementation context |
| `docker nginx SSL` | `deployment.md` | Requires manual keyword logic |
| `keycloak OAuth` | `auth.md` | No semantic cross-reference |
| `postgresql database` | `README.md` (maybe) | Index-dependent |
**Problems:** no semantic ranking, exact match bias, no durable conversation memory, no palace structure, no wake-up context.
### After (MemPalace)
| Query | Results | Score | Notes |
|---|---|---:|---|
| `authentication` | `auth.md`, `main.py` | -0.139 | Finds auth discussion + implementation |
| `docker nginx SSL` | `deployment.md`, `auth.md` | 0.447 | Deployment hit + related JWT context |
| `keycloak OAuth` | `auth.md`, `main.py` | -0.029 | Conceptual + implementation evidence |
| `postgresql database` | `README.md`, `main.py` | 0.025 | Decision + implementation |
**Improvements:** semantic ranking, cross-file references, palace-structured retrieval, wake-up context artifact.
---
## Wake-up Context
- ~210 tokens total
- L0 identity placeholder
- L1 compressed project state
- Enables cold-start agent bootstrapping without re-reading full corpus
---
## Integration Recommendation
**Verdict: Integrate MemPalace v3.0.0 as the primary memory layer for Timmy/Hermes.**
Rationale:
1. **96.6% R@5 with zero API calls** — production-grade retrieval without cloud dependency
2. **+34% retrieval boost from palace structure** — structured memory outperforms flat search
3. **210-token wake-up context** — enables fast cold-start agent initialization
4. **Fully local** — aligns with sovereignty requirements
5. **MCP compatible** — integrates with existing Hermes agent infrastructure
### Next Steps
- [ ] Deploy MemPalace on Ezra's Hermes home (see `docs/MEMPALACE_EZRA_INTEGRATION.md`)
- [ ] Run live operational benchmarks on real Timmy corpus
- [ ] Post live metrics back to this evaluation
- [ ] Compare against Engram direction before final fleet default decision
### Scope Boundary
This evaluation covers synthetic benchmarks and paper-level metrics. Live operational testing on production data is pending and should be tracked separately.
---
## Related
- Issue #568 — Original evaluation request
- Issue #764 — This formalized report
- PR #569 — Original draft
- `docs/MEMPALACE_EZRA_INTEGRATION.md — Ezra integration packet
- `reports/evaluations/2026-04-06-mempalace-evaluation.md` — Earlier evaluation draft