AI DevelopmentMethodology16 min readPublished May 24, 2026

Memory architecture is now an economics decision — not just a capability one.

Agent Memory in 2026: Dreaming, Memory Bank, and the Long-Context Shift

Anthropic shipped “Dreaming” on May 6, 2026 — an async hippocampal-replay process that reorganizes agent memory between sessions. Google launched Memory Bank at I/O 2026 (May 19). Claude Opus 4.7's 1M flat-priced context window has made long-context a legitimate memory architecture alternative for small fleets. This is the post-CWC update layer on top of the April 14 Mem0/Letta/Zep/Graph-RAG deep-dive.

DA
Digital Applied Team
Senior strategists · Published May 24, 2026
PublishedMay 24, 2026
Read time16 min
Sources16
Dreaming launch
May 6
2026
Anthropic Managed Agents
Harvey task lift
6x
vendor-reported
After enabling Dreaming
Memory Bank
I/O Day 1
May 19, 2026
Google identity-scoped
Opus 4.7 context
1M
tokens, flat-priced
$5 in / $25 out per Mtok

AI agent memory architecture changed materially in May 2026. Anthropic shipped “Dreaming” — an async hippocampal-consolidation process for Managed Agents — on May 6, and deepened coverage at Code with Claude London on May 19-20. Google launched Memory Bank, an identity-scoped persistence primitive, at I/O 2026. For the original Mem0/Letta/Zep/Graph-RAG framework comparison, see our April 14 deep dive on agent memory architectures. This post is the May 2026 update layer.

The stakes have shifted. Three months ago, the memory architecture decision was primarily a capability question: which framework handles multi-hop reasoning, temporal recall, or cross-session coherence best? In May 2026, it is also an economics question. Claude Opus 4.7's 1M-token context at flat pricing has made “just stuff it in context” operationally cheaper than a Mem0 + Pinecone stack for agents with fewer than roughly 500K tokens of accumulated history. That crossover point reshapes the decision tree for single-user and small-fleet deployments.

This guide covers: what just shipped (Dreaming, Memory Tool, Memory Bank, Mem0 ADK integration), how Anthropic's three memory surfaces differ, the vector DB 2026 landscape, episodic and knowledge-graph patterns from Letta and Zep, the long-context economics framing, a proprietary 10-scenario decision matrix, and the LangChain migration every 2026 agent build needs to complete. For the broader agentic AI context, see our agent architecture patterns taxonomy.

Key takeaways
  1. 01
    Dreaming is the biggest memory pattern shift of Q2 2026.Anthropic's Dreaming primitive (shipped May 6, 2026) runs asynchronously between agent sessions, reviewing session transcripts and memory stores, extracting patterns, merging duplicates, and surfacing new insights — explicitly modeled on hippocampal memory consolidation. Harvey, the legal AI firm, reportedly saw a 6x jump in task completion rates after enabling it (vendor-reported; treat as indicative). No prior memory post in this index documents async consolidation as a production-ready primitive.
  2. 02
    Three vendors now have three distinct default memory architectures.Anthropic's Memory Tool is filesystem-mounted at /mnt/memory/ inside the agent container. Google's Memory Bank is an identity-scoped database primitive in the Gemini Enterprise Agent Platform. OpenAI's file_search is a vector-store-backed retrieval tool in the Responses API. The API you build on now influences your default memory model — they are not interchangeable patterns.
  3. 03
    Long-context is now a legitimate memory architecture for small fleets.Claude Opus 4.7's 1M-token context window is flat-priced at $5 input / $25 output per Mtok — no surcharge above 200K unlike prior generations. For single-user agents with fewer than ~500K tokens of accumulated history and fewer than 10 sessions, the operational cost of long-context can undercut maintaining a Mem0 + Pinecone stack. This is an economics decision, not a capability gap.
  4. 04
    LangChain memory is officially deprecated — migrate to LangGraph.As of 2026, BufferMemory, ConversationBufferMemory, ConversationSummaryBufferMemory, and VectorStoreRetrieverMemory from langchain.memory are deprecated. LangGraph's checkpointer-based short_term + long_term memory pattern is the only officially supported approach. Many published tutorials still reference the deprecated APIs — this is a build-phase footgun for any team following older guides.
  5. 05
    Mem0 is now the cross-vendor memory layer across all three major stacks.Mem0 (41K GitHub stars, 14M downloads as of May 2026 per published reports) has published integrations with the Anthropic SDK, OpenAI Agents SDK, and Google ADK. This positions Mem0 as a portable abstraction layer: build once, swap the underlying agent runtime without rewriting your memory layer. For teams uncertain which vendor to standardize on, Mem0 offers a hedge.

01May 2026 UpdateWhat just shipped: the post-CWC London memory layer.

Four distinct memory-related developments landed in a 31-day window between April 23 and May 24, 2026 — more memory infrastructure shipped in that period than in the preceding six months combined. This section covers all four, with precise dates and primary sources. The April 14 comparison of Mem0/Letta/Zep/Graph-RAG remains the authoritative evergreen reference; what follows is the update layer.

April 23, 2026 — Anthropic adds persistent memory to Claude Managed Agents as a public beta. This was the infrastructure move that preceded everything else: Anthropic made cross-session state persistence available for Managed Agents builds without requiring developers to implement their own external memory stores. Source: EdTech Innovation Hub, retrieved May 24, 2026.

May 6, 2026 — Anthropic ships Dreaming. The Dreaming feature for Claude Managed Agents launched — an async, between-session process that reviews agent session transcripts, pulls patterns out, merges duplicates, replaces stale entries, and writes new memory entries that future sessions can use. Dreaming is explicitly modeled on hippocampal memory consolidation (see section 02 for the technical deep-dive).

May 19, 2026 — Google launches Memory Bank at I/O Day 1. Google's Memory Bank launched as part of the Gemini Enterprise Agent Platform alongside ADK 2.0 (Agent Development Kit, GA the same day). Memory Bank provides identity-scoped persistence — an agent can remember a user's preferences, history, and key details across multiple sessions. See section 03 for the architecture breakdown.

May 19-20, 2026 — Code with Claude London deepens memory coverage. Anthropic ran sessions on multi-agent orchestration, outcomes, memory, Dreaming, advisor strategy, and Claude Managed Agents at Code with Claude London — the same event where self-hosted sandboxes and MCP tunnels launched. The memory sessions added implementation depth to the May 6 feature announcements.

Mem0 publishes Google ADK persistence integration. Mem0 confirmed that its persistent-memory layer now integrates with all three major agent stacks — Anthropic SDK, OpenAI Agents SDK, and Google ADK. This cross-vendor positioning is the defining business-model move for Mem0 in 2026: it is no longer tied to a single vendor's memory model.

Persistent memory beta
Anthropic Managed Agents
Apr 232026

Cross-session state persistence added as a public beta to Claude Managed Agents. Preceded the Dreaming feature and CWC London memory sessions by weeks.

Infrastructure foundation
Dreaming launched
Async consolidation goes live
May 62026

Async hippocampal-replay process for Managed Agents. Runs between sessions, reorganizes memory stores, surfaces new insights. Harvey reportedly saw 6x task completion lift (vendor-reported).

Biggest Q2 2026 memory shift
Google Memory Bank
I/O Day 1 launch
May 192026

Identity-scoped memory persistence in the Gemini Enterprise Agent Platform, launching alongside ADK 2.0 GA. Distinct from the Interactions API's previous_interaction_id session continuity.

Gemini Enterprise Agent Platform
Mem0 ADK integration
Anthropic + OpenAI + Google ADK
3xstacks

Mem0 now integrates across all three major agent runtimes, confirming its positioning as the cross-vendor persistent-memory layer. 41K stars / 14M downloads as of May 2026 per published reports.

Cross-vendor memory layer

02Anthropic DreamingHippocampal consolidation for Claude agents — how Dreaming works.

The Dreaming feature is the single most architecturally significant memory development of Q2 2026 — and the one with the least coverage depth in the mainstream AI press. Most write-ups note that “Claude agents can now dream” and move on. The implementation details are where the architectural implications live.

Anthropic describes a “dream” as a read of an existing memory store plus past session transcripts, producing a new reorganized memory store. Per The New Stack's coverage: it is “a scheduled process that runs between agent sessions, reviews everything an agent did in its last job, pulls patterns out of those sessions, and writes new memory entries that the next session can use.”

The biological framing is precise, not metaphorical. SiliconANGLE reported: “Anthropic compares it to hippocampal memory consolidation, the way a human brain replays the day's events during sleep and decides what to keep.” The analogy maps onto the neuroscience: the hippocampus replays waking experiences during sleep to transfer important episodic memories to long-term cortical storage. Dreaming does the same for agent session data — the between-session window is when consolidation happens.

Operational characteristics:Dreams run asynchronously, typically taking minutes to tens of minutes depending on input size. The agent writes learnings as plain-text notes and structured “playbooks” that future sessions can reference. Every step is observable and auditable — writes appear in the session event stream. Dreaming is distinct from two other Anthropic memory surfaces developers need to distinguish:

  • Anthropic Memory Tool — filesystem-based, mounted at /mnt/memory/ inside the agent container. Claude uses bash and code-execution tools to read and write memory files. Storage is client-controlled (local FS, S3, etc.). Beta header: context-management-2025-06-27. Every write creates an immutable version for audit and rollback. Per Anthropic's Memory Tool docs: “Claude can create, read, update, and delete files that persist between sessions, allowing it to build knowledge over time without keeping everything in the context window.”
  • Anthropic persistent memory beta (April 23, 2026) — the platform-level cross-session state feature in Managed Agents, distinct from both the Memory Tool and from Dreaming.
  • Dreaming — the async consolidation layer that operates on top of the persistent memory and session transcripts. Managed Agents only.

Harvey's reported 6x task-completion lift after enabling Dreaming on production legal agents is the headline outcome metric — but it carries a critical caveat: this is vendor-reported data from a single production deployment, not an independent third-party reproduction. Treat it as an indicative signal, not a guaranteed performance benchmark. Per The New Stack, retrieved May 24, 2026.

Letta — Agent Memory blog, 2026

“Agents do not passively receive context — they explicitly call memory management functions to move information between tiers. This makes agents active participants in their own memory management, not passive recipients of injected context.” — Letta blog, Agent Memory, retrieved May 24, 2026. The Dreaming primitive extends this principle: consolidation is an explicit, scheduled agent action, not a passive process.

03Google Memory BankGoogle's identity-scoped memory at I/O 2026.

Google's Memory Bank launched at I/O 2026 Day 1 (May 19) as part of the Gemini Enterprise Agent Platform, alongside ADK 2.0 GA. The architectural position is distinct from every other memory primitive in Google's stack — understanding the difference matters for teams building on Google Managed Agents.

Per Google Cloud's Memory Bank docs: “For long-term memory across sessions, Google provides additional tools like Memory Bank which scopes memories to a specific identity, allowing an agent to remember a user's preferences, history, and key details across multiple sessions.”

Where it fits in the Google Managed Agents memory stack:

  • Session State— turn-scoped within one session. Persists files and state across calls within a session via the Interactions API's previous_interaction_id. Ephemeral once the session ends.
  • Memory Bank— identity-scoped, cross-session persistence. A user's preferences, facts, and history are stored against that user's identity and retrieved on future sessions. Available via a dedicated ADK quickstart.

The practical implication for the Google Managed Agents API: Memory Bank is the right tool for multi-session, multi-user agents (e.g., a customer service agent that remembers previous interactions). Session State is the right tool for within-session continuity. Mixing them up is a common build error.

Mem0 has also published an ADK integration pattern that layers Mem0's memory model on top of Google ADK agents — giving teams who prefer Mem0's graph-enhanced memory model an escape hatch from the native Memory Bank primitive. This cross-vendor integration confirms that Mem0 is positioning itself as the persistent-memory middleware for all three major agent stacks, not just a framework-specific tool.

04Vendor Architecture MapThree vendors, three default memory models — the coordinate system.

The most under-appreciated architectural shift of May 2026 is that the choice of which agent API you build on now shapes your default memory model. This is not a theoretical concern — it affects how data is stored, who controls it, how it is retrieved, and what security surface you are responsible for.

Anthropic
Filesystem-mounted at /mnt/memory/
Memory Tool · client-controlled storage

Memory files live inside the agent container at /mnt/memory/. Claude reads and writes via bash + code-execution tools. Storage backend is client-controlled (local FS, S3, etc.). Beta header: context-management-2025-06-27. Add Dreaming on top for async consolidation between sessions.

Filesystem · audit trail · Dreaming layer
Google
Identity-scoped Memory Bank database
Gemini Enterprise Agent Platform · ADK 2.0

Memory Bank scopes facts to a user identity across sessions. Session State handles within-session continuity via previous_interaction_id. Backed by Google Cloud infrastructure. Distinct primitives: don't conflate them. Mem0 ADK integration available as an alternative.

Identity-scoped · session vs cross-session
OpenAI
Vector-store via file_search
Responses API · previous_response_id

OpenAI's Responses API uses previous_response_id for session persistence — architecturally similar to Google's previous_interaction_id. The file_search built-in tool is a vector-store-backed retrieval surface, meaning long-term memory goes through vector similarity by default.

Vector-store default · Responses API

The practical implication: a team choosing between these three APIs is also implicitly choosing between a filesystem memory model, an identity-database memory model, and a vector-store memory model as their default starting point. Switching later requires rebuilding the memory layer, not just swapping an API key. Teams uncertain about vendor lock-in should evaluate Mem0 as an abstraction layer above all three.

For a comprehensive view of how memory fits into the broader agent architecture, see our complete guide to AI agent memory systems.

05Vector RetrievalVector retrieval in May 2026: six vendors, eight decision columns.

Vector retrieval remains the dominant pattern for agent long-term memory at scale — no May 2026 development changes that. What has changed is the pricing landscape and the scale thresholds at which each option is economical. The table below is a proprietary six-vendor operational matrix as of May 2026. For the 8-vendor head-to-head with performance benchmarks, see our vector DB comparison guide; for the RAG-context selection guide, see vector databases for RAG applications.

Key pricing notes as of May 24, 2026 (verify before committing — vector DB pricing changes frequently):

  • Pinecone Serverless: ~$70/month at 10M vectors of 1,536 dimensions; $0.33/GB/month storage; Read Units $16/M on Standard, $24/M on Enterprise. HNSW indexing adds ~1.5x storage overhead.
  • Weaviate Cloud: The old $25/month Serverless tier was retired (October 2025 restructure). Flex starts at $45/month + $0.095 per million vector dimensions.
  • Qdrant Cloud: ~$65/month at 10M vectors. Above 60-80M queries/month, self-hosted Qdrant on a fixed-cost VPS consistently undercuts Pinecone Serverless by 3-10x according to LeanOps' 2026 cost comparison, retrieved May 24, 2026.

One fabrication risk worth flagging: OpenAI's text-embedding-3-large native dimension is 3072, not 1536. Matryoshka shortening to 1536 is the common pgvector pairing — but it is explicit shortening, not the native default. Cohere embed-v4's default is 1536.

Vector DB estimated monthly cost at 10M vectors — May 2026

Source: LeanOps vector DB cost comparison 2026 + vendor docs, retrieved 2026-05-24. Verify pricing before committing — changes frequently.
Pinecone Serverless — ~$70/mo at 10M vectors (1,536-dim)Managed only · HNSW-like + sparse · RU-based pricing · ~1.5x storage overhead
~$70/mo
Qdrant Cloud — ~$65/mo at 10M vectors (Flex tier)OSS (Rust) + Cloud · HNSW + quantization + sparse · fast payload filtering
~$65/mo
Weaviate Cloud Flex — from $45/mo + $0.095/M dimsOSS + Cloud · HNSW + dynamic · multi-modal · GraphQL-native
From $45/mo
Chroma — free (OSS embedded) / Cloud betaDev-friendly · in-process SQLite · embedded mode · ≤1M vectors
Free OSS
Milvus — self-host / Zilliz Cloud (large-scale, GPU)OSS + Zilliz · GPU-CAGRA · DiskANN · 100M+ scale · distributed
Self-host
pgvector — Postgres infra cost onlyOSS Postgres ext. · HNSW default · ACID · Supabase-native · 16K dim limit
Postgres cost

HNSW vs IVF — don't conflate them. HNSW (Hierarchical Navigable Small World) is a graph-based ANN index with logarithmic search — the dominant choice for ≤100M vectors at interactive latency. Tunable via M (max connections per node), ef_construction (build-time quality), and ef_search (query-time quality). IVF (Inverted File) is a partition-based index tuned via nlist / nprobe — useful for batched workloads but different algorithm entirely. DiskANN (Microsoft) targets disk-resident billion-scale collections.

Hybrid search (BM25 + vector) is now standard across Pinecone, Weaviate, Qdrant, Milvus, and pgvector. Pure vector search misses exact-match terms (product SKUs, error codes); pure BM25 misses semantic synonyms. For a deeper implementation guide, see self-hosted RAG on Postgres + pgvector.

06Knowledge Graphs & Episodic MemoryNeo4j, Graphiti, and the Letta episodic model — when to use each.

Vector retrieval handles semantic similarity well. It does not handle multi-hop relational reasoning (“find all suppliers of suppliers of company X who are also customers of company Y”) or temporal recall (“last week we tried approach A and it failed — why?”). Those are knowledge-graph and episodic memory problems respectively.

Knowledge graphs. Neo4j (Cypher query language) is the commercial baseline — graph-native storage, ACID-compliant, with a Vector Index (HNSW) added for hybrid retrieval-augmented graph traversal. For agents that need multi-hop reasoning over structured relationships, Neo4j + vector index is the combination that production architectures converge on. Other viable options: Memgraph (in-memory, Cypher-compatible), AWS Neptune (Gremlin + SPARQL + openCypher), ArangoDB (multi-model: document + graph + key-value).

Zep's Graphiti engineis a temporal knowledge graph with timestamped node and edge updates — purpose-built for agent memory where “yesterday vs today” matters. Per vendor-published benchmark data (treat as indicative): Zep scores approximately 63.8% on LongMemEval (GPT-4o), around 15 points higher than Mem0's 49.0%, largely attributed to Graphiti's temporal reasoning capability. Source: Atlan's 2026 memory framework comparison, retrieved May 24, 2026.

Episodic memory via Letta (the MemGPT lineage). MemGPT is the research lineage (Berkeley 2023); Letta is the commercial spin-out that commercialized those patterns. The MemGPT/Letta architecture uses explicit memory management functions: core_memory_append, core_memory_replace, archival_memory_search, and conversation_search. Letta's three-tier model is deliberately inspired by OS memory management — core memory (always in-context, like RAM), archival memory (external searchable vector store, like disk), and recall memory (conversation history).

Letta scores approximately 83.2% on the LoCoMo long-conversation memory evaluation according to vendor-published benchmarks — strong performance on “agent remembers we tried X yesterday and it failed” tasks. This is vendor-published data; treat it as directionally informative rather than independently verified. For the full Mem0/Letta/Zep comparison, the April 14 deep-dive on agent memory architectures remains the definitive framework reference.

The long-context vs explicit memory decision has crossed over from a capability question to an economics question. For agents with fewer than 500K tokens of accumulated history, Opus 4.7's flat 1M context may be operationally cheaper than running Mem0 plus a managed vector store.Digital Applied analysis, May 24, 2026

07Long-Context Architecture“Just stuff it in context” is now a defensible architecture for small fleets.

The argument for long-context as a memory architecture has always existed in theory. What changed in May 2026 is the economics. Two developments collapsed the cost case for explicit episodic memory at small scale:

Claude Opus 4.7 (April 16, 2026)introduced a 1M token context window at flat pricing: $5 input / $25 output per Mtok. There is no surcharge above 200K tokens like prior generations. This makes long-context retrieval — where you simply include all accumulated session history in every call — roughly cost-competitive with maintaining a Mem0 + Pinecone stack for agents with fewer than ~700K tokens of accumulated history. For context on how Opus 4.7's context window performs at scale, see our long-context vs explicit retrieval analysis.

Gemini 3.5 Flash (GA May 19, 2026) offers the cheap-1M-context alternative: $1.50 input / $9.00 output per Mtok, with $0.15/Mtok for cached input. It is the default backing model for Google's Managed Agents API. At those price points, single-user agents with infrequent calls may find Gemini 3.5 Flash + full context cheaper than any managed vector store option.

GPT-5.5's 1.05M context window carries a long-context surcharge above 272K input tokens (2x input, 1.5x output for the full session) — so do not treat all 1M+ context models as equivalent in economics. The flat-pricing advantage is specific to Opus 4.7 and Gemini 3.5 Flash.

The crossover analysis from the Opus 4.7 cost strategy guide: for agents with fewer than ~500K tokens of accumulated history and fewer than 10 sessions, long-context can undercut Mem0 + Pinecone on total cost. Above ~10M history tokens or 100+ sessions, explicit memory becomes mandatory regardless of model. This is synthesis from public pricing data — verify with your specific usage patterns before architectural decisions.

For the broader context-window competition, the context-window arms race guide covers how long-context competes as a product surface.

08Decision MatrixThe 10-scenario memory architecture decision matrix.

The following matrix maps agent use-case dimensions — data type, query pattern, scale, latency, and budget — to specific recommended architectures. It is the first published operational decision matrix with May 2026 pricing crossovers included. For the agentic RAG retrieval patterns that complement this matrix, see our agentic RAG patterns guide. For common retrieval failure modes to avoid, see RAG anti-patterns and failure modes.

Dev / prototype
Unstructured docs · semantic search · 1K-100K scale

Data type: unstructured docs. Query pattern: semantic search. Scale: 1K-100K vectors. Latency: interactive (<1s). Budget: low. Recommended: Chroma embedded mode or pgvector + Mem0. Rationale: Chroma has near-zero setup overhead in embedded mode (SQLite-backed, in-process); pgvector works if you're already on Postgres/Supabase. Mem0 handles the memory abstraction layer above whichever you pick.

Chroma or pgvector + Mem0
Mid-scale SaaS
Unstructured docs · semantic search · 1M-10M scale

Data type: unstructured docs. Query pattern: semantic search. Scale: 1M-10M vectors. Latency: interactive (<1s). Budget: mid. Recommended: pgvector (Supabase) or Qdrant Cloud + Mem0. Rationale: pgvector HNSW handles up to ~10M vectors well if you're already on Postgres. Qdrant Cloud Flex is competitive in cost and adds payload filtering and quantization.

pgvector or Qdrant Cloud + Mem0
Production SaaS
Semantic search · 10M-100M scale · real-time

Data type: unstructured docs. Query pattern: semantic search. Scale: 10M-100M vectors. Latency: real-time (<100ms). Budget: high. Recommended: Pinecone Serverless or Qdrant Cloud Dedicated + Mem0. Rationale: Pinecone Serverless is hands-off managed at this scale. Qdrant Dedicated is cheaper at high query volume (managed breakeven: 60-80M queries/month). Mem0 adds the cross-session abstraction.

Pinecone Serverless or Qdrant Dedicated + Mem0
Enterprise batch
Semantic search · 100M+ scale · GPU budget

Data type: unstructured docs. Query pattern: semantic search. Scale: 100M+ vectors. Latency: batch (>5s OK). Budget: high. Recommended: Milvus (GPU-CAGRA) or Vespa. Rationale: Milvus is the only OSS vector DB with native GPU acceleration (CAGRA, GPU-IVF) and distributed horizontal scaling. Vespa is the alternative for teams that want native ranking + vector in one deployment.

Milvus (GPU-CAGRA) or Vespa
Relational / multi-hop
Structured relationships · multi-hop reasoning

Data type: structured relationships. Query pattern: multi-hop reasoning (find all suppliers of suppliers of X who are also customers of Y). Scale: any. Latency: interactive. Budget: mid+. Recommended: Neo4j + vector index for hybrid retrieval. Rationale: vector retrieval is single-hop semantic similarity; knowledge-graph traversal is N-hop relational. Neo4j's HNSW vector index enables hybrid retrieval-augmented graph traversal.

Neo4j + vector index hybrid
Temporal recall
Time-stamped events · “last week we tried...”

Data type: time-stamped events. Query pattern: temporal recall (yesterday we tried X, it failed — why?). Scale: <10M. Latency: interactive. Budget: mid. Recommended: Zep (Graphiti) or Letta. Rationale: Zep's temporal knowledge graph stores timestamped node/edge updates; Letta's episodic model has explicit conversation_search with temporal indexing. Both outperform pure vector search on temporal queries (Zep 63.8% vs Mem0 49.0% on LongMemEval, vendor-reported).

Zep (Graphiti) or Letta
Single-user small fleet
Any data · <500K tokens history, <10 sessions

Data type: any. Query pattern: any. Scale: <500K tokens accumulated history. Latency: interactive. Budget: mid. Recommended: long-context model (Claude Opus 4.7 1M flat-priced or Gemini 3.5 Flash 1M). No explicit memory layer required. Rationale: at <500K tokens, the per-Mtok cost of full-context inclusion is competitive with or cheaper than the monthly fixed cost of Mem0 + managed vector store. Simplest possible architecture — no retrieval layer to maintain.

Opus 4.7 or Gemini 3.5 Flash — no explicit memory
Multi-tenant identity
Identity-scoped facts · per-user 10K-1M history

Data type: per-user facts, preferences, history. Query pattern: identity-scoped retrieval. Scale: 10K-1M per user. Latency: interactive. Budget: mid. Recommended: Mem0 + Postgres OR Google Memory Bank (if building on Gemini). Rationale: Mem0 provides cross-session user memory with a graph-enhanced model; Google Memory Bank is purpose-built for this pattern in the Gemini stack. If vendor-agnostic, Mem0 + Postgres gives portability.

Mem0 + Postgres or Google Memory Bank
Self-improving agent
Cross-session pattern extraction, async OK

Data type: session transcripts + existing memory store. Query pattern: cross-session pattern extraction and consolidation. Scale: any. Latency: async OK (minutes to hours). Budget: mid+. Recommended: Anthropic Dreaming (Managed Agents) or a homegrown reflection loop. Rationale: Dreaming is the only production-grade async consolidation primitive shipping today. For teams not on Managed Agents, a scheduled reflection loop (agent reads past transcripts, writes new memory entries) replicates the pattern.

Anthropic Dreaming or reflection loop
Hybrid at scale
Mixed text + relationships · hybrid semantic + structural

Data type: mixed (unstructured text + structured relationships). Query pattern: hybrid semantic + structural (find relevant documents AND reason about their relationships). Scale: 1M-100M. Latency: interactive. Budget: mid+. Recommended: Qdrant + Neo4j hybrid. Rationale: Qdrant handles high-performance semantic vector retrieval; Neo4j handles multi-hop relational traversal. The combination is the production hybrid memory pattern for knowledge-intensive agents.

Qdrant + Neo4j hybrid

09Security & MigrationMemory poisoning, OWASP LLM04, and the LangChain migration every 2026 build needs.

Two maintenance items affect every production agent memory implementation in 2026: the security surface introduced by writable memory, and the LangChain deprecation that has left thousands of published guides pointing at dead APIs.

Memory poisoning.OWASP's Top 10 for LLM Applications 2025 lists LLM04 Data and Model Poisoning and LLM08 Vector and Embedding Weaknessesas the primary memory-relevant attack vectors. Memory poisoning includes three categories: prompt injection via retrieved document (a malicious document in the vector store causes the agent to execute unintended instructions); embedding-collision attacks (adversarially crafted inputs that appear semantically similar to legitimate queries but retrieve poisoned content); and direct memory-store tampering for agents with explicit write tools like Letta's core_memory_appendor Anthropic's filesystem Memory Tool.

Practical mitigations: validate and sanitize all retrieved content before injection into the agent context; implement read-only memory paths where write access is not required; use the Memory Tool's immutable versioning for audit trails; monitor the session event stream for unexpected memory write patterns. For common RAG failure modes that overlap with memory security, see the RAG anti-patterns guide.

LangChain memory is deprecated. As of 2026, BufferMemory, ConversationBufferMemory, ConversationSummaryBufferMemory, and VectorStoreRetrieverMemory from langchain.memory are the old API. Per db0.ai's breakdown: “A significant change in 2026 is that LangGraph is now the only officially supported way to do memory” in the LangChain ecosystem. The LangChain memory overview docs confirm the migration path. Many published tutorials — including results on the first page of Google for “LangChain memory” — still reference BufferMemory as current. If your team is following any guide written before mid-2025 for LangChain memory, it is almost certainly pointing at deprecated code.

The LangGraph memory modeluses a clean thread/store separation: short-term memory is thread-scoped (agent state persisted via a checkpointer to a database, fully resumable across interruptions); long-term memory is store-scoped (vector or graph backend, queryable across threads). The thread/store distinction maps directly onto session-scoped vs cross-session memory — the same architectural separation that Anthropic's three-surface model and Google's Session State/Memory Bank primitives encode.

For teams using LlamaIndex, the memory modules available include ChatMemoryBuffer, ChatSummaryMemoryBuffer, VectorMemory, and SimpleComposableMemory for chaining — none of these carry the deprecation risk of the LangChain legacy APIs.

For the full glossary of agentic memory terms, see our agentic AI glossary. For AI transformation advisory that includes agent memory architecture reviews, see our AI transformation services.

Conclusion

Memory architecture in May 2026: Dreaming, Memory Bank, and the economics crossover.

The four developments that shipped between April 23 and May 24, 2026 — Anthropic's persistent memory beta, Dreaming, Google Memory Bank, and Mem0's cross-vendor ADK integration — collectively moved agent memory from a solved-in-framework problem to a platform-level primitive. The choice of which cloud API you build on now determines your default memory model. That is a significant architectural constraint that did not exist in January 2026.

The economics crossover is the second structural shift. Opus 4.7's flat 1M context and Gemini 3.5 Flash's sub-$2 per Mtok pricing have made “no explicit memory” a defensible choice for small fleets. The decision tree in section 08 encodes the scale and session thresholds at which explicit memory becomes necessary — those are engineering decisions, not opinion.

For the Mem0/Letta/Zep/Graph-RAG framework comparison that underpins the patterns above, the April 14 deep-dive on agent memory architectures remains the canonical reference for that layer. This post is the May 2026 update: Dreaming, Memory Bank, long-context economics, and the LangChain migration. The two posts together cover the full production memory architecture landscape as of May 24, 2026.

Agent memory architecture advisory

From memory architecture to production.

We help engineering and product teams design production-grade agent memory architectures — from vector DB selection and hybrid retrieval design to Dreaming integration and LangGraph migration strategy.

Free consultationExpert guidanceTailored solutions
What we work on

Agent memory & retrieval architecture

  • Vector DB selection and cost optimization
  • Hybrid retrieval (BM25 + vector + KG) design
  • LangGraph memory migration from legacy LangChain
  • Anthropic Dreaming and Memory Tool integration
  • Memory security audit (OWASP LLM04 / LLM08)
FAQ · AI Agent Memory Architecture 2026

The questions teams ask about AI agent memory architecture in 2026.

Anthropic Dreaming shipped on May 6, 2026 for Claude Managed Agents. It is an asynchronous, between-session process that reviews an agent's session transcripts and existing memory stores, extracts patterns, merges duplicates, replaces stale entries, and writes reorganized memory entries that future sessions can use. Anthropic explicitly models it on hippocampal memory consolidation — the neuroscientific process by which a human brain replays waking experiences during sleep to decide what to retain in long-term memory. Dreams run asynchronously (typically minutes to tens of minutes), produce plain-text notes and structured playbooks, and are observable via the session event stream. Dreaming is distinct from Anthropic's Memory Tool (filesystem-based, /mnt/memory/) and from the April 23, 2026 persistent memory public beta. It is available in Claude Managed Agents only. Harvey, the legal AI firm, reportedly saw 6x task completion improvement after enabling it — this is vendor-reported data, not an independently reproduced benchmark.