Agentic Memory Architecture: How to Build Memory That Scales
A single vector store is not agent memory. It is a similarity search index. Conflating the two is the most common architectural mistake teams make when building their first memory layer — and it produces systems that work in demos and degrade in production.
This guide is for engineers designing memory infrastructure for multi-agent systems. It covers why flat vector stores fail at scale, the five-tier memory model that production systems converge toward, write governance, decay policies, retrieval pipeline design, and operator observability requirements.
Why a Single Vector Store Is Not Enough
Vector databases are excellent at one thing: given a query embedding, return the stored vectors most similar to it. They are the right tool for semantic and episodic recall where "what does this remind me of?" is the question.
They are the wrong tool — or at least the incomplete tool — for:
Relational memory. "What are all the architectural decisions that affect the payments module?" is a graph traversal, not a nearest-neighbor search. The relationships between memories matter as much as the memories themselves. A flat vector store cannot represent these relationships without denormalizing data in ways that break at scale.
Structured fact storage. User preferences, business rules, and system configurations are declarative facts. They should be retrieved by exact match or structured query, not by semantic similarity. Storing them in a vector index means retrieval depends on how well the query text happens to embed near the stored fact — a fragile dependency.
Procedural knowledge. Skills and how-to procedures are better indexed by task type and precondition than by content embedding. A procedure for handling payment failures should surface when an agent encounters a payment failure, not when the current task description happens to be semantically close to the procedure's description text.
Temporal reasoning. "What happened most recently?" and "What has changed since last week?" are time-range queries. Vector stores with timestamp metadata can approximate this, but it is not their native query model and performance degrades.
The architectural answer is a multi-tier memory system where each tier is optimized for the memory type it stores.
The Five-Tier Memory Model
Tier 1: Vector Store (Semantic and Episodic Memory)
The vector tier handles embedding-based retrieval for unstructured and semi-structured memory: episodic event records, extracted semantic facts, and any memory where the primary retrieval signal is conceptual similarity.
Design considerations at this tier:
- Embedding model selection — domain-specific fine-tunes outperform general models for vertical use cases. Budget for re-embedding when models are updated.
- Index partitioning — per-agent or per-team namespacing prevents memory bleed across agents. A customer support agent should not be retrieving a sales agent's call summaries.
- Hybrid scoring — combine cosine similarity with a recency decay factor. A memory from yesterday is usually more relevant than a semantically similar one from two years ago, all else being equal.
Tier 2: Graph Store (Relational Memory)
The graph tier stores relationships between entities: which users are associated with which accounts, how architectural components depend on each other, which memories reference the same underlying concept.
This tier is what makes multi-hop recall possible. "What decisions did the team make about authentication, and which incidents have those decisions caused?" requires traversing a graph of memories, not retrieving a single nearest-neighbor.
Use a property graph model (Neo4j, Kuzu, or equivalent) rather than a triple store unless you have a specific RDF requirement. Property graphs are easier to query with complex traversal patterns and handle schema evolution better.
Tier 3: Episodic Store (Timestamped Event Log)
Episodic memory deserves its own tier separate from the vector index because temporal ordering is a first-class query dimension. Events need to be retrieved by time range, by actor, by session, and by outcome — not just by semantic similarity.
Store episodic records as structured objects with: timestamp, agent ID, session ID, action taken, observation, outcome, and a free-text summary that feeds into the vector index for semantic recall. The structured fields enable SQL-style queries; the summary embedding enables semantic recall.
Tier 4: Knowledge Store (Declarative Facts and Rules)
Declarative memory — policies, user preferences, business rules, system configurations — lives in a structured store with explicit schema. This tier is queried by exact match and structured predicate, not by similarity.
The governance requirements at this tier are strictest. Writes should require explicit authorization. Reads should be logged. The data here encodes ground truth that agents act on; incorrect data in this tier has the highest blast radius.
Tier 5: Skill Store (Procedural Memory)
Skills are executable or quasi-executable procedures: the steps to complete a deployment, the negotiation sequence that works for a specific customer segment, the debugging checklist for a class of errors.
Index skills by: task category, precondition predicates, success rate, last-used timestamp, and owning agent. Retrieval is primarily by task match, with success rate as a ranking signal. Skills with a high success rate on similar past tasks should surface above skills that have produced failures.
Write Paths and Governance
The write path is where most memory architectures are underspecified. Who can write what, when, and with what authorization?
A production-grade write governance model requires:
Per-agent write permissions. Agent roles map to memory tier write access. A retrieval-only agent should have no write access. A learning agent should write only to episodic memory and the vector tier. A configuration agent may write to the knowledge store but requires a human approval gate.
Write mode classification. Every memory write should be classified as one of: observation (automatic, no gate), extraction (derived from observations, may require confidence threshold), user-stated (explicit input, high confidence), or inferred (agent-derived, requires human confirmation for high-stakes tiers).
Optimistic writes with rollback capability. Writes that pass permission checks are committed optimistically but remain reversible for a configurable window. The rollback mechanism requires audit lineage: every write records its prior state.
Conflict resolution. When two agents write contradictory facts to the same memory slot, the system needs a resolution policy: last-write-wins, highest-confidence-wins, or human arbitration. The policy should be configurable per tier.
Decay Policies
Memory that is no longer accurate is actively harmful — it is worse than no memory because the agent treats it as ground truth.
Decay policies should be configured per tier:
- Vector and episodic tiers: apply a confidence decay factor over time. A memory written six months ago starts with full confidence and decays toward a threshold below which it is flagged for review rather than automatically surfaced.
- Knowledge tier: declarative facts do not decay automatically, but they do need explicit validity periods. A pricing policy has an expiration date. System configurations change. Build explicit TTL support and expiration review workflows.
- Skill tier: skills decay by non-use and by failure rate. A skill that has not been successfully invoked in 90 days should be flagged. A skill whose success rate drops below a threshold should be quarantined pending human review.
Retrieval Pipeline Design
Context assembly — the process of pulling the right memories into working context before an agent run — is a pipeline, not a single query.
A well-designed retrieval pipeline has four stages:
- Intent extraction: parse the current task to identify entities, actions, and relevant prior context anchors. This is often a lightweight LLM call over the task description.
- Multi-tier retrieval: issue parallel queries to each relevant memory tier using the extracted anchors. Collect candidates with scores and tier provenance.
- Re-ranking and deduplication: merge candidates from all tiers, deduplicate overlapping facts, apply recency and confidence weighting, and rank by relevance.
- Context packing: assemble the ranked candidates into a structured context pack that respects the available token budget. Higher-ranked memories go in first; lower-ranked ones are truncated if budget is exhausted.
The pipeline should be deterministic and auditable. Engineers need to be able to inspect what went into a context pack for a given run and understand why specific memories ranked where they did.
Operator Observability
A memory system without observability is a black box that degrades silently. Operators need:
- Memory health dashboards: tier fill rates, write volume by agent and tier, query latency percentiles, cache hit rates.
- Retrieval trace logs: for each agent run, a record of what was retrieved, from which tier, at what score, and what made it into the context pack.
- Write audit logs: immutable, queryable record of every memory mutation with actor, timestamp, tier, and diff.
- Anomaly detection: write rate spikes, retrieval latency degradation, unusual access patterns — these are early signals of agent misbehavior or memory corruption.
MemroOS implements the full five-tier model with per-agent write permissions, decay policies, a structured retrieval pipeline, and an operator NOC console — deployable on your own infrastructure with no cloud dependency. Source available at github.com/lac5q/memroos.