Mark Lubin

← Back

Papers Worth Reading If You're Thinking About Agent Memory

February 27, 2026

I've spent the last few months reading papers on agent memory—how LLM-based agents store, retrieve, update, and forget things across conversations. The space is moving fast. Three memory benchmarks in eight weeks, multiple "memory OS" papers, and ICLR 2026 has a dedicated workshop on the topic. These are the ones that changed how I think about the problem.


The systems

Generative Agents — Park et al., 2023 (UIST). The paper that started it. Memory stream + reflection + three-factor retrieval scoring (recency, importance, relevance). Elegant machinery, surprisingly rich emergent behavior. Also the first clear demonstration of what breaks: unbounded memory growth, no forgetting, no observability.

MemGPT — Packer et al., 2023. Takes the OS metaphor seriously—context window as main memory, archival store as disk, agent-directed paging between tiers via function calls. Introduced tiered memory to the agent conversation. The paper itself notes that self-directed paging can be fragile. Spawned Letta.

Hindsight — Latimer et al., 2025. Four epistemically-typed memory networks (world facts, experiences, observations, confidence-scored opinions). Four-way parallel retrieval fused via RRF. A 20B model with Hindsight hits 83.6% on LongMemEval vs 60.2% for full-context GPT-4o—architecture matters more than model size.

A-MEM — Xu et al., 2025 (NeurIPS). Zettelkasten for agents—atomic notes with metadata that autonomously link and evolve as new information arrives. 85-93% token reduction, linear O(N) at 1M entries. Memory as a self-organizing graph of evolving notes is a different primitive than a fact store.

Memory-R1 — Yan et al., 2025. Focuses on the write path—learning when and what to write via RL. +28% F1 from a policy trained on just 152 examples that generalizes zero-shot. Suggests write-side memory management has been dramatically underinvested relative to retrieval.


The benchmarks

LongMemEval — Wu et al., 2024 (ICLR 2025). Five abilities tested: extraction, multi-session reasoning, temporal reasoning, knowledge updates, abstention. The temporal reasoning gap is 73% (25% model vs 92.6% human). Current memory systems fundamentally cannot track how things change over time.

LoCoMo — Maharana et al., 2024 (ACL). Fifty naturalistic conversations, 9K tokens across 35 sessions, multi-hop QA. Where LongMemEval tests specific abilities in controlled settings, LoCoMo tests whether memory holds up over realistic, messy, long-horizon interactions. Most new papers evaluate against both.


The constraint underneath

LLMs Do Not Have Human-Like Working Memory — Huang et al., 2025. GPT-4o failed all 200 trials of a yes/no game requiring internal state. LLMs are stateless text processors. Memory infrastructure isn't a nice-to-have—it's a load-bearing structural requirement. The memory layer is the agent's ability to be anything other than stateless.


Why hierarchies and lifecycles

Complementary Learning Systems — McClelland et al., 1995. The neuroscience case for why you need two memory systems: fast episodic (hippocampus) and slow consolidated (neocortex). A single system can't do both—fast specifics interfere with slow patterns. The theoretical reason tiered agent memory works, and the same reason the brain does it.

Generational Garbage CollectionLieberman & Hewitt, 1983; Ungar, 1984. Not AI papers—programming language runtime design. Most objects die young, so partition by age and collect young generations frequently (cheap) and old generations rarely (expensive). The parallel to agent memory is precise: most observations are transient, a few persist. Fifty years of engineering wisdom about managing memory hierarchies that nobody in agent memory is drawing on.

The Actor Model — Hewitt et al., 1973; Ray — Moritz et al., 2018. If memory tiers operate at different timescales, how do you run them concurrently? Independent processes, message passing, no shared mutable state. Ray implements exactly this—heterogeneous actors at different cadences. Most agent memory systems are synchronous pipelines. I'm curious why.

Matrix — Meta AI, 2025. Ray-native multi-agent framework. Stateless agents as Ray actors pulling from distributed queues, 2-15x throughput over centralized approaches. Validates that the runtime infrastructure for concurrent memory actors already exists and performs.

A History of Erlang — Joe Armstrong, 2007 (HOPL III). The design story behind Erlang/OTP. Process isolation, supervision trees, "let it crash"—and the core insight that restarting a process from clean state is cheaper and more reliable than trying to recover corrupted state. Arrived at "periodic rebuild beats incremental repair" from telecom reliability. The strongest prior art for why long-lived memory components should have finite lifespans and be rebuilt, not patched.


Emergence and adaptive dynamics

Emergence of Hierarchies in Multi-Agent Self-Organizing Systems — August 2025. Hierarchies emerge dynamically in MARL systems from joint objectives, quantified via gradient analysis. Memory tiers might not need to be hand-designed—set up the right forces (access frequency, semantic drift, information pressure) and tiers form naturally from usage patterns.

Emergence of Hybrid Computational Dynamics Through RL — October 2025. RL agents spontaneously develop hybrid attractor architectures: stable fixed-point attractors for maintaining decisions, quasi-periodic attractors for flexible evidence integration. Different memory tiers developing different dynamics might be a natural consequence of optimization, not a design choice.

Active Inference for Multi-Agent Coordination — September 2025. Principled framework for balancing information gain against coordination costs. Relevant because deciding what to consolidate vs. forget is an information-economic tradeoff, not just a retrieval optimization problem.

BRAIN: Bayesian Reasoning via Active Inference — February 2025. Continuous Bayesian belief updating that handles distribution shifts without retraining. Fast reactive updating at the working level, periodic consolidation at the identity level—a more natural model than extract-store-retrieve.


Information theory: compression, forgetting, and what to keep

MemFly — February 2026. Formulates agent long-term memory as an Information Bottleneck optimization: preserve maximum relevant information at minimum storage cost. The paper I wish more memory system builders would read. It shows you can formalize the compression-utility tradeoff instead of just hoping the LLM summarizes well.

Rate-Distortion Framework for Summarization — Arda & Yener, 2025 (ISIT). Defines the summarizer rate-distortion function as a fundamental lower bound. Aimed at text summarization, but the math transfers directly to agent memory consolidation. Are current systems anywhere near the optimal curve? I haven't seen anyone measure this.

The Data Processing Inequality. The most basic and most relevant result from information theory: every transformation is lossy or lossless, never constructive. Once you compress conversations into extracted facts, the lost information is gone. Every pipeline stage (chunking, extraction, summarization, embedding) is a one-way door. This is why raw storage underneath a compression layer matters—not to query often, but to re-derive when you discover the compression was wrong.

Mutual Information Surprise — August 2025. Redefines surprise as epistemic growth, not anomaly. MIS quantifies how much a new observation restructures what the agent knows. Offers a principled trigger for memory formation: store the things that change what you know the most, not just what's recent or labeled "important."

AGM Postulates — Alchourrón, Gärdenförs & Makinson, 1985; Darwiche & Pearl (iterated belief revision). Formal theory of rational belief change. The key insight: revision should operate on epistemic states (what you believe and why), not just belief sets. Without tracking provenance—where beliefs came from and what depends on them—you can't do correct revision when contradictions arrive.

Forgetful but Faithful — December 2025. MaRS architecture with typed, provenance-tracked memory nodes and six formal forgetting policies with differential privacy guarantees. Implements the derivation-chain tracking that the AGM framework says you need for responsible belief revision and forgetting.


The new arrivals: memory operating systems

EverMemOS — January 2026. Engram-inspired lifecycle: episodic trace formation, semantic consolidation into thematic MemScenes, reconstructive recollection. SOTA on LoCoMo and LongMemEval. The lifecycle framing—formed, consolidated, reconstructed—is closer to cognitive science than extract-and-index.

TiMem — January 2026. Temporal Memory Tree with progressive consolidation—recent conversations in detail, older ones as patterns, oldest as high-level identity. 52% context reduction. The right concept: compression increases with age.

MemOS — MemTensor, July 2025. Three-layer memory OS with MemCube abstraction (plaintext + activations + parameters), lifecycle management, multi-level access control. The most explicitly infrastructure-oriented memory paper I've found. Treats memory as a managed system resource, not an agent feature.

MemoryGraft — December 2025. Memory security paper. Poisoned experiences injected into a memory store get retrieved and imitated, causing persistent behavioral drift. Validates from the adversarial direction that versioning, provenance, and integrity checks are security requirements, not nice-to-haves.


What I'm still looking for

Memory regression testing—how do you know an update to extraction or consolidation didn't make agent behavior worse? Closest thing is Evo-Memory (DeepMind), but that's a benchmark, not a testing framework. Memory observability—tracing a bad agent decision back to a specific memory retrieval. The debugging story is nonexistent. Concurrent runtime + memory lifecycle—the pieces exist in different literatures but I haven't found a paper that composes them.

If you know of work on any of these, I'd be interested to hear about it.


Reading list current as of February 2026. I have no affiliation with any of the research groups listed.