Agentic Memory Benchmark: How We Evaluate Memory Platforms

When we started building MemroOS, we looked for a public benchmark comparing agentic memory platforms. There wasn't one. There were product comparisons written by vendors, analyst overviews that didn't go deep on technical capabilities, and a few academic papers on specific memory retrieval techniques — but nothing that systematically evaluated production-grade agentic memory systems on the criteria that matter for deployment.

So we built one: the Marketplace Agentic Memory Benchmark.

This is the methodology behind it, the criteria we chose, and the findings.

Why This Benchmark Exists

Teams choosing a memory layer for their agent infrastructure need answers to specific questions:

How well does the platform retrieve relevant context versus irrelevant noise?
Does it support multiple memory types or just vector search?
What governance controls does it provide for enterprise deployment?
Does it integrate with orchestration systems, or is it standalone?
Can it be self-hosted, or is it cloud-only?
What does the operator experience look like?
Which agent frameworks does it work with?

None of the existing comparisons answered these questions systematically. The benchmark was designed to.

The 8 Evaluation Criteria

We scored each platform on eight dimensions, each on a 5-point scale with defined rubrics.

1. Recall Quality (Weight: High)

Measures how accurately the platform retrieves relevant memories when queried. A platform that retrieves everything scores low on precision; one that retrieves nothing relevant scores low on recall.

Evaluation approach: designed synthetic memory stores with known content, queried against them with ambiguous and precise queries, measured precision-at-K and recall-at-K.

2. Memory Tier Depth (Weight: High)

Measures whether the platform supports distinct memory types — episodic, semantic, procedural, declarative, working — or treats all memory as a flat store.

Why it matters: different memory types have different retrieval patterns, decay rates, and appropriate use cases. A platform that flattens all memory into a single tier loses the ability to differentiate between "this happened last session" and "this is always true."

3. Governed Write Paths (Weight: High)

Measures per-agent access control and operator review capabilities. Specifically: can you grant different write permissions to different agents? Can writes to high-trust tiers require approval before committing?

4. Orchestration Integration (Weight: Medium-High)

Measures how well the platform integrates with workflow orchestration — specifically, whether memory context is available to orchestrated agents and whether checkpoint/rollback is supported at the memory level.

5. Audit Trail (Weight: Medium-High)

Measures whether every memory mutation is logged with agent identity, timestamp, and source. Full audit lineage scores 5; no logging scores 1.

6. Deployment Flexibility (Weight: Medium)

Measures self-hosting capability, data residency controls, and on-premises deployment support. Cloud-only platforms score lower because enterprise deployments frequently cannot accept external data egress.

7. Observability (Weight: Medium)

Measures operator visibility: can operators see memory health metrics, active agent counts, write/read rates, and anomaly alerts? A NOC-style console scores higher than CLI-only tooling.

8. Framework Breadth (Weight: Lower)

Measures integration support across agent frameworks: Claude Code (MCP), LangGraph, CrewAI, AutoGen, Google ADK, REST-compatible agents. More integrations = higher score.

Scoring Methodology

Each criterion is scored 1–5 by two independent reviewers based on public documentation, source code review (where available), and hands-on testing. Scores are averaged and weighted by importance.

The weighting reflects what enterprise and developer teams consistently tell us they care most about: recall quality, governance, and orchestration integration are the table-stakes capabilities. Deployment flexibility matters significantly for regulated environments. Framework breadth matters less because most teams use one or two frameworks.

Scores are based on publicly available information and hands-on evaluation as of the benchmark date. They reflect our interpretation and may differ from vendor self-assessments.

The Platforms Evaluated

We evaluated platforms that either specifically target agentic memory or are commonly used as memory layers for agent workflows: MemroOS, Letta, Mem0, Zep, Midbrain, gBrain, EverMemos, AXME, AgenticMemory, Tytan, and WorldFlow.

Mem0 is included as a memory-engine baseline, with the conflict called out explicitly because MemroOS can use Mem0 as one component of its stack. Midbrain is included as a research-led retrieval and continual-learning competitor; its SmartSearch paper metrics are not treated as the same thing as MemroOS's architecture/governance score.

Key Findings

MemroOS scored 84.06/100 — the highest among evaluated platforms. Strongest on governed write paths, audit trail, orchestration integration, observability, and operator proof surfaces. Improvement areas: external retrieval benchmarks and comparative recall harnesses should be added alongside the existing operational evals.

Letta scored 70.58/100. Strong on stateful agent persistence and memory-first agent architecture. Weaker on governed enterprise memory as an operator control plane.

Mem0 scored 70.44/100. Strong managed memory baseline and relevant to the ecosystem; less differentiated on orchestration and governance as a complete operator control plane.

Zep scored 68.64/100. Strong on temporal knowledge graph memory and fact invalidation. Weaker on orchestration and enterprise governance as the central product surface.

Midbrain scored 65.21/100. Strongest on retrieval research and continual-learning direction through SmartSearch. Weaker on visible deployment controls, audit lineage, dispatch, and operator governance.

Remaining platforms scored below Midbrain, primarily limited by narrow memory scope, cloud-only deployment, absent governance features, or limited public proof.

The pattern across all evaluated platforms: recall quality and basic storage are table stakes. The differentiators are governance, orchestration integration, and deployment flexibility — exactly the capabilities that matter most when moving from prototype to production.

How to Use This Benchmark

The benchmark is a starting point, not a verdict. Your specific requirements may weight criteria differently:

If data residency is non-negotiable, deployment flexibility weight should be higher
If you're building a consumer product, governance matters less than recall quality
If you're deploying a single specialized agent, framework breadth is less important than if you have a heterogeneous agent fleet

Use the criteria as a checklist for your evaluation. The benchmark scores tell you where each platform is strong and where it isn't — your job is to decide which dimensions matter most for your deployment.

The full benchmark data, including per-criterion scores, is available via the MemroOS platform evaluation framework.

So we built one: the Marketplace Agentic Memory Benchmark.

This is the methodology behind it, the criteria we chose, and the findings.

Why This Benchmark Exists

Teams choosing a memory layer for their agent infrastructure need answers to specific questions:

How well does the platform retrieve relevant context versus irrelevant noise?
Does it support multiple memory types or just vector search?
What governance controls does it provide for enterprise deployment?
Does it integrate with orchestration systems, or is it standalone?
Can it be self-hosted, or is it cloud-only?
What does the operator experience look like?
Which agent frameworks does it work with?

None of the existing comparisons answered these questions systematically. The benchmark was designed to.

The 8 Evaluation Criteria

We scored each platform on eight dimensions, each on a 5-point scale with defined rubrics.

1. Recall Quality (Weight: High)

Measures how accurately the platform retrieves relevant memories when queried. A platform that retrieves everything scores low on precision; one that retrieves nothing relevant scores low on recall.

Evaluation approach: designed synthetic memory stores with known content, queried against them with ambiguous and precise queries, measured precision-at-K and recall-at-K.

2. Memory Tier Depth (Weight: High)

Measures whether the platform supports distinct memory types — episodic, semantic, procedural, declarative, working — or treats all memory as a flat store.

3. Governed Write Paths (Weight: High)

4. Orchestration Integration (Weight: Medium-High)

5. Audit Trail (Weight: Medium-High)

Measures whether every memory mutation is logged with agent identity, timestamp, and source. Full audit lineage scores 5; no logging scores 1.

6. Deployment Flexibility (Weight: Medium)

7. Observability (Weight: Medium)

Measures operator visibility: can operators see memory health metrics, active agent counts, write/read rates, and anomaly alerts? A NOC-style console scores higher than CLI-only tooling.

8. Framework Breadth (Weight: Lower)

Measures integration support across agent frameworks: Claude Code (MCP), LangGraph, CrewAI, AutoGen, Google ADK, REST-compatible agents. More integrations = higher score.

Scoring Methodology

Each criterion is scored 1–5 by two independent reviewers based on public documentation, source code review (where available), and hands-on testing. Scores are averaged and weighted by importance.

Scores are based on publicly available information and hands-on evaluation as of the benchmark date. They reflect our interpretation and may differ from vendor self-assessments.

The Platforms Evaluated

Key Findings

Letta scored 70.58/100. Strong on stateful agent persistence and memory-first agent architecture. Weaker on governed enterprise memory as an operator control plane.

Mem0 scored 70.44/100. Strong managed memory baseline and relevant to the ecosystem; less differentiated on orchestration and governance as a complete operator control plane.

Zep scored 68.64/100. Strong on temporal knowledge graph memory and fact invalidation. Weaker on orchestration and enterprise governance as the central product surface.

Remaining platforms scored below Midbrain, primarily limited by narrow memory scope, cloud-only deployment, absent governance features, or limited public proof.

How to Use This Benchmark

The benchmark is a starting point, not a verdict. Your specific requirements may weight criteria differently:

If data residency is non-negotiable, deployment flexibility weight should be higher
If you're building a consumer product, governance matters less than recall quality
If you're deploying a single specialized agent, framework breadth is less important than if you have a heterogeneous agent fleet

The full benchmark data, including per-criterion scores, is available via the MemroOS platform evaluation framework.

Agentic Memory Benchmark: How We Evaluate Memory Platforms

Why This Benchmark Exists

The 8 Evaluation Criteria

1. Recall Quality (Weight: High)

2. Memory Tier Depth (Weight: High)

3. Governed Write Paths (Weight: High)

4. Orchestration Integration (Weight: Medium-High)

5. Audit Trail (Weight: Medium-High)

6. Deployment Flexibility (Weight: Medium)

7. Observability (Weight: Medium)

8. Framework Breadth (Weight: Lower)

Scoring Methodology

The Platforms Evaluated

Key Findings

How to Use This Benchmark

Try MemroOS

Agentic Memory Benchmark: How We Evaluate Memory Platforms

Why This Benchmark Exists

The 8 Evaluation Criteria

1. Recall Quality (Weight: High)

2. Memory Tier Depth (Weight: High)

3. Governed Write Paths (Weight: High)

4. Orchestration Integration (Weight: Medium-High)

5. Audit Trail (Weight: Medium-High)

6. Deployment Flexibility (Weight: Medium)

7. Observability (Weight: Medium)

8. Framework Breadth (Weight: Lower)

Scoring Methodology

The Platforms Evaluated

Key Findings

How to Use This Benchmark

Try MemroOS