AI's memory wall: why compute outpaces memory and how to fix it
Nate B Jones of AI News & Strategy Daily explains the structural reasons AI memory systems fail and presents eight principles for building better ones.
Summary
Nate B Jones presents a detailed analysis of what he calls the "memory wall" — a growing gap between AI compute capabilities, which have improved roughly 60,000-fold, and memory capabilities, which have improved only about 100-fold. He argues that this is not merely a hardware problem but a fundamental architectural one, and that current vendor solutions — larger context windows, passive memory accumulation, and proprietary memory layers — fail to address the root causes. Jones identifies six root causes of persistent AI memory failure, including the relevance problem, the persistence-precision trade-off, and the passive accumulation fallacy. He then presents eight principles for building memory systems that actually work, applicable whether you are an individual power user or an enterprise systems designer. His central argument is that memory requires active architecture, not passive features, and that users who build proper memory structures now will have a compounding advantage as AI systems mature.
Key Takeaways
FULL TRANSCRIPT
The Memory Wall: Hardware and Architecture
Nate B Jones: Memory is perhaps the biggest unsolved problem in AI, and it is one of the only problems in AI that is getting worse, not better. As we get better and better at intelligence, we get worse at memory, relatively speaking.
In fact, there's a name for it in the model-maker community. It's called the memory wall. We are not improving the hardware chip capabilities of our memory systems nearly as fast as we are improving the ability of those chips to infer or compute words — to do LLM inference. That generates a growing gap between our intelligence capabilities and our memory capabilities.
Don't worry, we won't stay at the hardware level for long. I want to go through with you the core issues that we see as builders, as users of AI, as designers of AI systems. What is the root of the memory problems we experience? If we're at a systems design level, if we're at a usage level, if we are even using ChatGPT, why are memory problems so sticky and hard to untangle? Why have we not seen better solutions in the market? I think there are good reasons for that. And then once we go through those root causes, how can we start to think about solving them? How can we think about solving them as users? How can we think about solving them as builders?
I'm going to go through five root causes — actually six — and then we're going to flip the script and I'm going to go through eight principles for building a solution, because I want you to walk away from this feeling empowered to actually design better memory systems. I don't want you to wait around for someone in Silicon Valley to make a pitch and get funded for this. You can design your own solution here.
Statelessness by Design
The key thing to keep in mind through this whole conversation is that AI systems are stateless by design, but useful intelligence requires state. Every conversation is stateless, meaning it starts from zero. The model has parametric knowledge — the weights we talk about in a model — but it doesn't have episodic memory. It does not remember what happened to you. And the ten or eleven sentences, or the very lossy memory that ChatGPT has right now, or the ability to search conversations that Claude has right now, is not good enough for that. You have to reconstruct your context every single time.
This is not a bug, actually. It is an intentional architecture. It is a design for statelessness, because the model makers want the model to be maximally useful at solving the next problem — the problem in front of you. And they cannot presume that state matters. It doesn't always matter.
So the promise of memory features is that vendors are going to be able to magically solve this by making the system stateful in ways that are useful to you. But this creates a whole host of new problems, because statefulness is not the same for all of us. What should it remember? Is it passive accumulation? Is it active curation? How long should it remember? Is it persistent forever? Is it ever stale? Does it drop off after thirty days? When do you retrieve it? Do you retrieve it when it's relevant, sort of like Claude does? Do you retrieve it all the time and potentially have it be noisy in the context window? How do you update it? This is one of the biggest problems with LLMs. People tell me they'll put their wiki into a retrieval-augmented generation system and I'm like, when was the last time you updated your wiki? If it's not updated, how do you overwrite it? How do you append data to it? How do you change data?
These are not implementation details. They are fundamental questions about what memory is and its purpose when we do work. Memory matters because we humans are able to quickly and fluidly negotiate between stateless brainstorming — things that are wide open and where we don't need to use a lot of our past memory — and very stateful work. LLMs are not good at that. Loading that context is very hard right now.
Root Cause One: The Relevance Problem
So why is this so persistent? We've talked a little about how the promise is hard to fulfill, but what are some of the root causes that make it hard for vendors to do this?
Number one, the relevance problem is one of the gnarliest unsolved problems out there. What's relevant actually changes based on the task that you're doing. Are you planning? Are you executing? The phase of your work — are you just exploring, or are you refining? The scope you're in — is it personal or is it a project?
I know someone who is in the healthcare industry, and they have to be very careful because if they were to ever ask for health advice, the memory retrieval within ChatGPT would pull up work stuff. And they are afraid that in the same context, if they pull up a work thing, their personal health data will leak in — because it will all look like health data. So the scope matters. What has changed since the last time you talked? We would call that the state delta. If you come back and say this is a new version, does the AI really understand that it's a new version?
Semantic similarity — which is what retrieval-augmented generation depends on — is just a proxy. It is a proxy for relevance. It is not a true solution. Finding similar documents works until you need to find the document where we decided X, and that's very specific. Or: ignore everything about Client A right now but pay attention to Clients B, C, and D. Or: please only pay attention to what we've decided since October 12th. These are all things that we humans can understand and execute on when we go and manually retrieve information. But the AI using semantic search — it's just not the right tool for that job.
There's no general algorithm for relevance. There's no magic relevance solve that the AI can depend on. You need to use human judgment about task context. And that means requiring very complicated architectures to accomplish a specific memory task — not just better embeddings in a RAG memory system. And that, by the way, is one of the big reasons why one-stop-shop vendors often struggle with real implementations.
Root Cause Two: The Persistence-Precision Trade-Off
Number two, the persistence-precision trade-off is a massive issue with memory systems. If you store everything, retrieval becomes very noisy and very expensive — you jam up your context window. If you store selectively, you're going to lose information that you need later. If you let the system decide what to keep, it optimizes for something you didn't ask it to. Maybe it optimizes for recency. Maybe it optimizes for frequency. Maybe it optimizes for statistical saliency versus actual importance.
If you wonder what statistical saliency is — have you ever tried having an argument with ChatGPT or Claude or Gemini about the fact that it's emphasizing the wrong thing in something it's writing? That is saliency. That's a saliency defect.
Human memory is actually, funnily enough, very good at this through the technology of forgetting. We use incredibly lossy compression with emotional and importance weighting. Studies on human memory show that you can, with practice, get better and better at recalling specific things. But if you choose not to recall something that happened to you, you're just going to lose it. And what's interesting is that it seems to be a database-keys issue for us — I realize someone in the comments is going to be a neuroscientist and rightly take me to town, but my understanding of the reading is that you have to be able to remember the equivalent of a database key to retrieve the memory. If you can do that, the memory becomes accessible again. But your short-term memory, so to speak, is very lossy. And so you lose the database keys if you can't persist them with intent — if you don't intend to remember them.
That is why your childhood memories can be very accessible, but what happened last Thursday? You're sitting there thinking: did we eat out or not? Which day did we go to the movies? It's not because you have a profound issue with memory. It's because your brain is desperately compressing information to make it useful to you and has dumped those database keys. And when you go to the effort of remembering, you're literally retrieving the database keys to get the memory back. Forgetting is a useful technology for us.
AI systems don't have any of that. They either accumulate or they purge, but they do not decay. And what I'm describing — did I go to the movies? Oh, yeah, it was that movie, who was that character? Oh, now I'm recovering the key and I'm able to get it back — the memory has decayed into a lossy approximation in the memory key, but I can recover it if I put effort into it. We have nothing like that in AI. That is a uniquely human technology, and it's funny but we have to think about forgetting as a technology when we talk about memory.
Root Cause Three: The Single Context Window Assumption
Number three, the single context window assumption. Vendors often try to solve memory by making context windows bigger. But volume is not the issue. Structure is the problem. A million-token context window is not a usable million-token context window if it's full of unsorted context. That is worse than a tightly curated ten-thousand-token window. The model still has to find what matters, parse the relevance, and ignore the noise. You have not solved the problem by expanding the context window. You have simply made your problem more expensive — sometimes substantially more expensive.
I know people who make API calls and don't budget them, and they're like, "Why is my API bill so high?" And I'm like, your API bill is high because you're stuffing the context window and throwing queries against it. It does not work well, and it is also very expensive.
The real solution requires multiple context streams with different life cycles and retrieval patterns. It is hard. You have to design it. It breaks the mental model of "just talk to the AI." That is why there is no one-size-fits-all solution.
Root Cause Four: The Portability Problem
Issue number four is the portability problem. Every single vendor builds proprietary memory layers because they think in their pitch deck that memory is a moat. I get it — it makes sense on a pitch deck. ChatGPT memory, Claude recall, Cursor memory banks — these are not inherently interoperable. Users invest time building up memory in a given system, and the model makers like that because it makes the switching cost real. You can't port what ChatGPT knows about you to Claude, and your memory is locked in.
The problem here is a problem of the commons. This behavior from vendors and model makers and tool builders encourages users to leave memory to the tool rather than building a proper context library. I get it from a product design perspective — how many users are really going to build a context library? But if we reframe it and say portability is a first-class problem, users should inherently be able to be multi-model. From a consumer standpoint, you might not care because ChatGPT has 800 million users and it dwarfs everything else. But Gemini is closing in on half a billion now. And from a business perspective, you have to be multi-model. It is a liability to be single-model.
So if you're building business memory systems, you must solve the portability problem. And the issue is that any given vendor is not incentivized to make memory truly portable — they want to make it proprietary to them. Then you have the same bottleneck, but now you're on a vendor who may not be as well-funded as the model maker. And so it becomes a house of cards.
Root Cause Five: The Passive Accumulation Fallacy
Number five, the passive accumulation fallacy. Most memory features assume you just use your AI normally and it will figure out what to remember. That is the default mental model of users, and so that's the assumption that memory features are built around. But this fails because the system cannot distinguish a preference from a fact. It cannot easily tell project-specific from evergreen context — I've often seen that mixed up. It doesn't automatically know when old information is stale. If you've ever wondered why ChatGPT or Claude or Perplexity comes back and talks about old AI models as if they are active today, that is the same issue. They can't tell when old information is stale, and the system optimizes for continuity, not for correctness. This is the "keep the conversation going" issue.
Useful memory fundamentally requires active curation. You have to decide what to keep, what to update, and what to discard. And that is work. Vendors promise passive solutions because active curation, they are told, does not scale as a product. I think we have to start by framing that problem better, because it turns out passive accumulation doesn't solve for it either. And this is still a big enough problem that it costs us billions of dollars at the enterprise level and is extremely frustrating for users both personally and professionally. The answer cannot be "there is no answer" or "we'll fake the answer."
Root Cause Six: Memory Is Multiple Problems
Finally, number six on the root cause side — and then we're going to get to solutions, which will feel better. Memory is actually multiple problems. And that's part of why it's so hard.
When people say "AI memory," what they really mean is any number of things. Preferences — how I like things done. That could be a key-value pair that's persistent. Facts — what's true about particular things or entities, which can be structured and might need updates. Knowledge — domain expertise, which can be parametric, embedded in weights, but it might not be right, and then what do you do? Episodic memory — conversational, temporal, ephemeral knowledge. And procedural memory — have we solved this before? If episodic memory is what we've discussed in the past, procedural memory is how we solved this problem in the past. Those are also different things. You have exemplars there, successes and failures in procedural memory.
Every single memory type needs different system design to handle storage, retrieval, and update patterns. And if you feel like you're getting a headache here, you're not alone. This is why we don't have a good solution. Treating this problem as one problem guarantees you are going to solve none of the real problems well. And that is why memory remains a persistent issue — in fact, a growingly worse issue — in the AI community.
Vendors are fundamentally treating this as an infrastructure solve, not an architecture solve. Bigger windows and better embeddings and cross-chat search scale, but they don't solve structurally. And users keep expecting passive solutions because they're sold passive solutions. "Just remember what matters" is not something you can expect to work — but we're told it will work. So if memory requires architecture and users want magic, the gap between what's promised, what's delivered, and what's needed has never been bigger. We have a memory wall of our own, beyond the chip level, in how we design our systems. And it won't get solved if we're solving the wrong problem.
Eight Principles for Solving Memory
So let's say you've gone through all of this and you want to solve memory correctly. I'm going to give you principles that work whether you are a power user at home who wants to build something yourself — and this absolutely works for that — or whether you are designing larger systems. It turns out that the principles for memory are fractal, because the problem is fractal. We have the same kinds of memory issues when we are individual power users in a chat as we do when we are designing agentic systems.
Principle One: Memory Is an Architecture
Nate B Jones: Number one — and there are going to be eight of these, so settle in. Memory is an architecture. It is not a feature. You cannot wait for vendors to solve this. Every tool will have memory capabilities, but if you leave it to tools, they will solve different slices. You need principles that work across all of them. And you need to architect memory as a standalone that works across your whole toolset.
Principle Two: Separate by Life Cycle, Not by Convenience
Principle two: you should separate by life cycle, not by convenience. As an example, you need to separate personal preferences — which can be permanent — from project facts — which can be temporary — and those should be separated from session state or conversation state, which can be ephemeral. Mixing different life cycle states — mixing permanent with temporary with ephemeral — just breaks memory. The discipline lies in keeping these apart cleanly.
And again, this works if you're in chat. It works if you're designing agentic systems. If you have a permanent personal preference, it is as simple as a very disciplined system chat update where you go into the system rules and the system prompt for ChatGPT and say, "This is what you need to know about me. These are my personal preferences." Model makers are starting to make that more exposed because they want that. But they don't tell you how to use it properly. And when I observe how people actually use that "tell me about yourself" prompt, it is absolutely a mix of personal preferences and ephemeral stuff and project facts, because no one has taught them to use it better.
If you're designing agentic systems, it gets more complex, but it's the same separation of concerns. You have to separate out what are the permanent facts in the situation, what are project-specific facts, and what is session state.
Principle Three: Match Storage to Query Pattern
Principle number three: you need to match storage to query pattern. That means you're going to need multiple stores, because different questions require different retrieval.
In the chat situation I described, ChatGPT can retrieve the memory if it's in a system prompt — it just calls it into the context window and it's super simple, and most people would never think of it as memory, but that's what it is. If you're designing an agentic system, it is understanding the difference between, for example: what is my style, which could be a key-value pair; what is the client ID, which should be structured or relational data; what similar work have we done, which could be semantic or vector storage data; and what did we do last time, which should be event logs. Those are four different types of data — key-value, structured, semantic, event logs. Trying to do all of these in one storage pattern is going to fail.
And that is why when people say, "We have our data lake and it's going to be a RAG," I'm like — why? Why is it going to be a RAG? Have you heard the word RAG repeated a hundred times like a magic spell for memory? It does not work that way. You need to match storage to the query pattern. Otherwise, you just have a very expensive data dump.
Principle Four: Mode-Aware Context Beats Volume
Principle number four: mode-aware context beats volume, hands down. More context is not better context. Planning conversations need breadth — space for alternatives, space for comparables. Brainstorming conversations are similar; you need to be able to range. Execution conversations and execution workflows in agentic situations need precision — precise constraints. Retrieval strategy needs to match your task type.
You cannot just sit there and think, "I'm going to have a brainstorming conversation and it's going to be incredibly precise," and just hope that it works. This is why I talk about prompting so much. Effectively, what prompting is doing is giving context that is mode-aware to an AI so that it can be in the right mode. That's super effective for chat users. But if you're designing agentic systems, it is your responsibility to architect mode awareness into the system so that it knows this is an execution environment, that precision matters, and that it is audited and evaluated on precision.
Principle Five: Build Portable as a First-Class Object
Principle number five: you need to build portable as a first-class object — portable and not platform-dependent. Your memory layer needs to survive vendor changes, tool changes, and model changes. If ChatGPT changes their pricing, if Claude adds a feature, your context library should be retrievable regardless. And that is something that almost nobody can say right now. The people who are doing it tend to be designing very large-scale agentic AI systems at the enterprise level. But this is a lesson we all need to take with us.
I think it is a best practice — it's sort of like keeping a go-bag next to the door in case something happens to your house. You need to have something portable that carries relevant memory and that you can use to have productive conversations with another AI.
I fully admit there is not an out-of-the-box solution for this. There are power users who configure Obsidian as a note-taking app and tie it into AI, making it a portable, platform-independent way of handling memory. There are people who use Notion for this. The common trait is that they are obsessed with making sure the memory is configured correctly for them, and the AI has to be queried or called correctly to engage with the piece of memory that matters — whether that is a key-value piece like "what's my style" or a semantic search like "what similar work have we done together." A good data structure accounts for that.
Principle Six: Compression Is Curation
Principle number six: compression is curation. Do not upload forty pages hoping the AI extracts what matters. I see people do this when they overload the context window and ask for an analysis of a report. You need to do the compression work. You need to — either in a separate LLM call or in your own work — write the brief, identify the key facts that matter, and state the constraints. This is where judgment lives. And if you don't delegate it, you will be happier with the precision and context awareness of the response.
Memory is bound up in how we humans touch the work. There are ways to use AI to amplify and expand your judgment — you can use a precise prompt to extract information in a structured way from forty pages of data, and then in a separate piece of work figure out what to do with that data. But it remains on you to make sure that the facts are correct, that the constraints are real, and that the precision work you're asking AI to do with that data is the correct precision work. The judgment in compression is human judgment. It may be human judgment that you amplify with AI, but it remains human judgment.
Principle Seven: Retrieval Needs Verification
Principle number seven: retrieval needs verification. Semantic search will recall topics and themes well, but fail on specifics. You need to pair fuzzy retrieval techniques like RAG search with exact verification where facts must be correct. You should have a two-stage retrieval path — recall candidates, and then verify against some kind of ground truth.
This is especially important in situations involving policy, financial facts, or legal facts that you need to validate. Something like this is exactly why there was a very prominent fine levied against a major consulting firm in the last two weeks — the fine came to close to half a million dollars because they could not verify facts around court cases in a document they prepared. They hallucinated them and didn't catch them. Retrieval failed. And because the LLM is designed to keep the conversation going, it just inserted something plausible and nobody caught it.
You need to be able to verify retrieval against ground truth. If it's a small task, that might be the human at the other end of the chat — it just is a step that needs doing. If it's a large agentic system, it is the exact same fractal principle, but you need to do it in an automatic way using an AI agent for evaluations.
Principle Eight: Memory Compounds Through Structure
Principle number eight: memory compounds through structure. Random accumulation does not compound — it just creates noise. Just adding stuff doesn't compound. If we added memories randomly the way we experience them in life, with no lossiness and no forgetting ability, we would not be able to function as people. Forgetting is a technology for us. In the same way that forgetting is a technology for us, structured memory is a technology for LLM systems.
Evergreen context goes one place, versioned prompts go another place, tagged exemplars go another place. At a small scale, yes, you can do this — people are doing this with Obsidian, with Notion, with other systems as individuals. And yes, you can scale this as a business. Same principle. You let each interaction build without degradation if you have structured memory. Otherwise, you just have random accumulation. Otherwise, you have the pile of transcripts you never got to, and you're thinking, "Well, this is data, we're logging it, it's probably good." It's just going to be random accumulation. It creates noise. You're not going to have structured memory.
Why This Matters Now
These are the principles that work. They work whether you are a power user with ChatGPT or a developer building agentic systems. They are guideposts for evaluating vendors in the memory space. These are tool-agnostic principles, designed to scale with complexity and to give you keys that solve the memory problem — because they make context persist reliably without the brittleness we see with current AI systems.
My challenge to you as we wrap up: we've gone through root causes, we've gone through why memory is a hard problem, and we've gone through eight principles for how to solve it. Please take memory seriously. The reason it matters now is that if you solve memory now, you have an agentic AI edge. These systems are going to get cheaper and more powerful, but you can't assume they're magically going to solve for memory. As I said at the beginning, there's a chip-level issue here. It is a hard problem.
If you take responsibility for memory and build it yourself in the way that works for you, you are starting the timer earlier than everybody else on getting memory that is functional across a long-term engagement with AI. We're in year two of the AI revolution. Wouldn't it be great to have memory that goes back to year two when you are working with AI systems in ten years, in fifteen years, in twenty years? Everybody else is going to have memory that started much later, and they're going to lose that discipline, that acceleration, that ability to manage deep work over time that AI is going to be capable of with proper memory structures.
So there is a moment here for you to think about and put in place a memory structure that works. Don't lose the opportunity. This is a complex one, but it's on you and me and all of us together to build memory systems that handle our own needs — whether that's personal needs or professional needs. Drop in the comments how you're doing it, because I think we should all crowdsource this.