podProse

Podcast transcripts, polished for reading

podProse

4 AI Labs Built the Same System Without Talking to Each Other (And Nobody's Discussing Why) | AI News & Strategy Daily | Nate B Jones Transcript

Four AI labs independently converged on the same multi-agent architecture for long-horizon work

Nate B Jones of AI News & Strategy Daily argues that the "jagged frontier" model of AI capability is no longer a useful frame for understanding AI in the workplace.

Summary

Nate B Jones presents a single extended argument: the widely accepted idea that AI has a "jagged" capability profile — brilliant at some things, terrible at others — was never an inherent property of AI intelligence, but rather an artifact of how AI was being deployed. He contends that as multi-agent harnesses, inference-time compute, and organizational scaffolding have matured, the jaggedness has smoothed out, particularly for the kinds of work that actually occur in professional settings. As evidence, he points to Cursor's coding agent solving a novel research-grade mathematics problem it was never designed for, using the same harness it uses to write code. He further notes that Anthropic, Google DeepMind, OpenAI, and Cursor independently converged on structurally similar multi-agent architectures — decompose, parallelize, verify, iterate — without coordinating with one another. Jones argues this convergence signals that the underlying problem of how to organize agent work is now essentially solved, and that the critical skill going forward is not execution but evaluation: the ability to sniff-check whether agent output is correct.

Key Takeaways

The jagged frontier was an artifact of deployment, not intelligence. Asking a model for one answer in one turn with no tools, no memory, and no ability to retry is like asking a professional to solve every problem in 30 seconds with no notes and no colleagues. The limitations that resulted were framed as properties of AI when they were actually properties of the interaction structure.

Inference-time compute and agent harnesses have changed the picture. Models that can think, self-correct, and operate within structured scaffolding — with memory files, task lists, planner-worker-judge hierarchies — produce qualitatively different results. The jaggedness that defined early AI strategy is smoothing out across the board.

Cursor's math result is the clearest proof point. A coding agent, using a harness built to write software, solved an unpublished research-grade mathematics problem from Stanford, MIT, and Berkeley academics — and improved on the human-written solution. It ran for four days with zero human guidance. Cursor is not a mathematics company. The generalization of the harness beyond its intended domain is the significant finding.

Four labs built the same architecture without talking to each other. Anthropic, Google DeepMind, OpenAI, and Cursor all independently arrived at the same structural pattern: decompose the problem, parallelize execution across agents, verify outputs, and iterate to completion. This convergence strongly suggests the architecture is a genuine solution to the underlying problem of how to get useful long-horizon work from agents with finite context and finite per-step reliability.

The architecture mirrors how human organizations already work. Planner-worker-judge structures, parallel exploration, handoffs, verification loops, and clean restarts are not AI-specific innovations — they are management principles that have organized human professional work for decades. The fact that the same structures work for agents suggests organizational intelligence generalizes across both human and artificial systems.

The relevant question for knowledge workers has shifted. The question is no longer "can AI do a specific task in my job family?" It is "can my work be decomposed into verifiable sub-problems?" Jones argues the answer is yes far more often than most people are comfortable acknowledging, and that this applies across engineering, legal, marketing, customer success, financial modeling, clinical research, and product management.

Evaluation competency now sits above execution competency in value. As agent harnesses take over execution, the skill that survives is the ability to sniff-check: to recognize whether architecture is maintainable, whether a solution is fragile, whether tests cover the important cases, whether a product strategy is sound. These meta-skills become more valuable as execution gets cheaper, not less.

The transition requires active engagement, not passive observation. Jones argues that organizations and individuals who proactively map their domains, identify what can be delegated, and build agent infrastructure will be well positioned. Those who wait for the shift to happen to them will be in a significantly worse position. The transition itself requires a large number of people and is not easy to install.

FULL TRANSCRIPT

The Jagged Frontier as an Organizing Frame

Nate B Jones: What if AI isn't jagged anymore? That is what I cannot stop thinking about. It's keeping me up at night. Everyone has assumed that AI capabilities have a jagged pattern — that they're incredible at some things and terrible at others. Experts talk about this. We see this in our daily lives. It seems like a truism. Why would I challenge it?

The jaggedness has become an organizing frame for just about everything. We think about how we install AI that way. We think about how we teach AI that way. But there is something we got wrong about jaggedness. The jagged frontier was never an inherent property of AI intelligence. I want to suggest it was an artifact of how we were asking AI to work, and that we are starting to figure that out as AI gets better. Let me dive into what I mean, and then you decide whether I'm right or whether I'm out to lunch.

What Single-Turn Interaction Actually Does

Here's what happens when you ask a model a typical question. We're going to start with a basic. You see that jaggedness just the way we've all seen it in chats. I've seen it too. When you ask a model for one answer in one turn — here's a question, I want an answer — all of the variance in task difficulty shows up as jaggedness in outcomes. That's not necessarily because the intelligence is jagged. It's because no organizational structure was being applied to that work.

We have been asking a capable analyst to solve every problem in 30 seconds with no notes, no colleagues, no ability to try something, and no ability to retry.

Now, that mental model isn't fully correct anymore. That was initially true in 2022, but I want to walk you forward. Now we're looking at inference computing, where AI takes time to decide and has some tokens it can process. This is what you get with ChatGPT o3 thinking and o3 Pro. It can think, it can maybe try some tokens that don't work, it can correct its mistakes, and it can come back. This produces higher quality results. We see better performance. But most of our conversation has talked about the fact that we see better performance, and maybe we haven't noticed that the jaggedness has started to smooth out. You don't have issues with counting the Rs in "strawberry" anymore, do you?

Why the Mental Model Needs to Change

The mental model that shaped three years of AI strategy needs to change. It needs to change because the last 30 days have convinced me that jaggedness is no longer the right guiding paradigm for how AI works in the workplace. It is certainly true that there are extraordinary capabilities for AI and there are capabilities for AI that are just very good. Now, that is a kind of jaggedness — but it is not a super relevant jaggedness, because the last time I solved an international Olympiad math problem at work was, oh, never. It just doesn't happen at work.

So I'm interested in the practical. In that world — the world of PRDs, the world of code, the world of customer service tickets — AI is not jagged anymore. And we need to stop pretending that it is. And I know why. And knowing why is going to help you understand how to do your work differently.

So let's go back and consider what single-turn, single-agent interaction actually means. You present a problem. The model produces a response. If it contains an error midway through, the error propagates through every single thing that follows. If the first approach is wrong, there's no mechanism to detect that and try something else. If the task requires more information than fits in a context window, it cannot accumulate that information incrementally very well. Every problem needs to be solved in a single shot.

This is the most primitive version of a chatbot. It's close to what we experienced with ChatGPT when it initially launched. And this is not how any competent human professional works. We know that, right? It's not how a lawyer researches a case. It's not how an engineer designs a system. It's not how a scientist runs an experiment. All of these involve trying things, recognizing when they're not working, adjusting, accumulating information over time, getting feedback at intermediate stages, and revising.

And yet all of these organizational structures that we've built around professional work — the review processes, the sprint cycles, the peer feedback loops, the draft-revise-publish pipeline — they all exist because we have a hard time solving one-shot cognition problems too. And we seem to have forgotten that AI might be able to use that help as well.

So we deployed AI originally into a paradigm that removes so many of those structures that help us to think, and then we described the resulting limitations as a property of AI.

The Learning Curve We Haven't Been Tracking

Now I am very aware I'm simplifying the story, so I'm going to walk you forward in a way that you understand what's going on here. That was 2022. We learned a lot. We got inference, which I described earlier. It helps AI not make mistakes. We got some tools for AI that help it a lot as well. We also realized that we need to be better at describing our tasks — that helps AI too, and that's what we call prompting. So we have been working on our side to provide tools, and at the same time AI has been getting smarter because we've been scaling intelligence — partly through inference, which I described, and also partly through reinforcement learning, the tried method that we've been using since the beginning of LLMs.

And so what we see is a trend line where intelligence has been climbing, but our fluency at using the tool has been getting better too. And we haven't been tracking that curve. We've been talking about the intelligence curve. We have not been talking about the curve that allows us to actually use this tool — the ability to learn to put agents into harnesses, the ability to learn to use tools in a loop to do practical work.

And what we really haven't recognized is that we are in a learning trend line that matters more than the intelligence curve at this point, at least for practical work. Because figuring out the scale at which we can operate intelligence now is a function of our ability to use tools with agents, our ability to use harnesses with agents.

A harness is the state around the agent — the scaffolding around the agent, the thing the agent operates within that allows it to do work. Maybe it's a markdown file for tasks. Maybe it's a spot to put its memory. All of it comes together. It's a harness. It allows it to do meaningful work. We've forgotten that part as valuable. We've forgotten that if we do that well, maybe we will address the jaggedness.

And so when the first couple of months of the year arrived, here we are surprised when the jaggedness starts to disappear — all at once. Video gets better. Text gets better. Mathematics gets better. Science gets better. I'm talking about specific advances that have been happening in the last 60 days. We are not seeing jagged improvements anymore. We are seeing a pattern of improvements where everything is getting better at once.

The frontier of AI is smoothing, and we are seeing much more smoothing if we look at the smaller bubble that is work — because work is inside the frontier at this point. The last time I solved a math problem like an international Olympiad athlete was never. We don't do it at work. Only a few of us are doing complicated science at work. Very few of us are actually doing super complicated engineering problems at work. We may be putting that effort in one time, but then we are executing against that to build out the product. For most of our work, this is a smooth product. It is not jagged. And we have got to recognize how big a deal that is, because that changes all of our assumptions about where we should expect AI to work and where we should deploy.

The Cursor Proof Point

I hear you saying, "Nate, what's the proof here? You've given vague generalities." The proof arrived on March 3rd, when Cursor CEO Michael Truell announced that Cursor had discovered a novel solution to a research-grade mathematics problem drawn from the unpublished work of Stanford, MIT, and Berkeley academics. In other words, you can't reinforcement-learn on it — it's unpublished. It didn't just solve it. It improved on the official human-written solution: stronger bounds, better coverage.

And they did it using the exact same coding harness that six weeks earlier had built a web browser from scratch. The harness ran for four days on this math problem with zero hints, zero human nudges, and zero mid-course guidance. And then it solved it.

Here's why I'm saying smoothing matters. Cursor did not build a system to solve math problems. It's one thing if Google says, "We put this special math model together and it did a special math thing." Or if OpenAI says the same. But in this case it matters more because Cursor is not a mathematics-solving company. Cursor is a coding company. A system designed to write code looked at a problem in spectral graph theory — you tell me what that is, I have no idea — and produced mathematics that the problem's own authors hadn't found.

This is a huge deal. And I think Michael Truell put it well. What he said was: this suggests that our technique for scaling agent coordination might generalize beyond coding. I will go farther. I will say it suggests that the way we put agents into harnesses to do long-running work looks like it will work for any domain that is even reasonably verifiable — in other words, that we can reasonably determine a correct answer to.

That opens up a lot. That's not just math. That's not just code. That opens up legal. That opens up many customer service use cases, because there's a verifiable correct answer. There are a surprising number of verifiable or near-verifiable problems in the business world where we know what's correct and what's not. This is a situation where the architecture of the agent marched up and just ate the problem set, and along the way changed the way we should think about AI, jaggedness, and our daily work.

What's Inside the Cursor Agent Harness

You might be wondering what's in the box on this Cursor agent. Is it something special? Is it secret sauce they're not going to share? No, they're going to share. In fact, they did share. In January, Wilson Linn published a Cursor blog post on scaling long-running autonomous coding.

The first attempt was flat coordination — agents shared a single file, they used locks to avoid collision, and it failed very badly. Agents became risk-averse. They avoided difficult tasks and optimized for small, safe changes. You got lots of activity but you did not get much progress.

The breakthrough came from hierarchy and specialization. Planners explore the codebase and create tasks, spawning sub-planners recursively — so there are two layers here. Workers pick up individual tasks and grind until done, ignoring everything else. A judge — an LLM acting as judge — determines whether to continue, and the next iteration begins fresh. The judge's ability to restart cleanly, bringing in a new agent with fresh context, turned out to be one of the system's most important properties, because it got around the problem of the context window.

The test case was building a web browser from scratch in Rust. The agents ran for a week and wrote a million lines of code. Cursor ran the same harness on a Solid to React migration and got that to work. They ran it on a Java language server. They ran it on a Windows 7 emulator — 1.2 million lines — and an Excel clone — 1.6 million lines. I think the Cursor team is having fun.

Two lessons emerged. First, model choice matters a lot for long-horizon tasks. They found that GPT o3 consistently outperforms Claude Opus, which tends to stop earlier and take shortcuts. Second, and more counterintuitively, many of the improvements they made came from removing complexity in the agentic system rather than adding to it. The actual improvement came from stripping out a lot of the complicated coordination machinery, adding hierarchy, and letting agents work in very clean isolation.

It is probably not an accident that that harness looks very similar to the Codex harness that you can set up if you download and use the Codex app, where you have agents running in isolation in sandboxes.

And the deepest observation is this: the system's behavior is disproportionately determined by the design of the prompt. Yes, prompting is still going to matter in the future. I've been saying it for a long time. If you can prompt with all of the information — the complete solution, what the model needs to do to be correct — and you set up your model harness correctly, it will run for a long time.

So Cursor got excited. They got experimental and they pointed at this math problem. The Cursor system found an approach involving the Marcus-Spielman SVA interlacing polynomial method — don't ask me that and don't try to say it five times fast. But the point is it solved it, and it went beyond what humans did.

This should wake you up. If you are thinking that a coding agent does code, if you are thinking that an LLM is a narrow thing, this should wake you up. It is not a narrow thing. These LLMs, especially in agents, are generalizing broadly. And this goes back to what I was saying earlier. We have assumed that jagged responses from LLMs are a function of intelligence. But the lesson that's been in plain sight over the last few years is that it's actually been at least as much a function of the harness we put the agent in.

Four Labs, One Architecture

Now at this point, four organizations — Anthropic, Google DeepMind, OpenAI, and Cursor — have independently built very large multi-agent coordination systems designed to do long-horizon work. None have coordinated. All four exhibit a similar structural pattern. And to my mind, this hasn't been clearly articulated. So hear me now.

This is not as different as it sounds. There are some differences in their patterns that are related to the models they use, but the underlying architectures are similar. One: decompose the work. Two: parallelize the execution. Three: verify outputs. Four: iterate toward completion.

Anthropic's approach uses an initializer agent that sets up an environment state and a progress file. A coding agent then makes incremental progress and leaves structured artifacts that the next session can read. Without this structure, the failure modes are vivid: the agent might try to one-shot the whole implementation, might run out of context mid-build, might leave things worse than they started, or might mark features complete without testing them.

Google DeepMind's approach — especially with the AlphaProof mathematics model — separates generation, verification, and revision into very distinct roles. The same principle underlies code review, legal adversarial proceedings, and scientific peer review. Do you see how this shows up in many fields?

OpenAI's Codex runs tasks in parallel sandbox environments. And Cursor's planner-worker-judge approach is structurally similar to how software teams with engineers and a PM actually operate. Yes, I hope it's not lost on you — this is also how people work.

This convergence is not a coincidence. It is a solution to a real problem: how do you get useful work from units of intelligence with finite context, finite per-step reliability, and no persistent memory? We also are units of intelligence. We have finite context. We have finite reliability. We make mistakes, and some days we don't have as good a memory as others.

The answer, for both the human organization and the agent, turns out to be organizational. You create roles, you create handoffs, you create verification, you create restart procedures. These are not AI-specific insights. They're management insights that generalize to autonomous agents as naturally as they generalize to human teams. In other words, we figured out how to generalize our intelligence by working collectively. And we seem to have forgotten those lessons and replicated them without realizing it. Because it turns out that when you do the exact same thing we have been doing to organize human work and you apply it to agents in a harness, that is also a good way to solve for agents doing meaningful work that they could not do individually.

That is a big deal. We humans have figured out a form of organizational intelligence, and now we are giving it to agents, and it turns out it scales. We should pay more attention to this than we are.

The Cost Question

Look, there's an obvious critique at this point, and I can hear people coughing in the back. Multi-agent harnesses are extremely expensive. And I want to engage with this honestly because it's not entirely incorrect. At the most fundamental level, multi-agent systems generate a ton of tokens that would not be generated by a single-turn interaction. So the cost is real. You have to be ready to enable token burn if you go with this kind of system.

But multi-agent systems give you an organizational strength you can't get any other way for the hardest problems. They give you structural diversity. Parallel workers can explore different decompositions of the problem. Dead-end results can inform the next planning cycle without contaminating another worker's context. The planner can spawn sub-planners to go deep on specific problems. Partial progress can accumulate across context windows rather than resetting.

This is about organization design. A brilliant individual with unlimited time can in principle solve almost anything. But certain problem classes are structurally inaccessible to serial cognition — not because the individual lacks the capability, but because the problem requires too many exploratory paths to hold in working memory simultaneously. So we don't structure organizations as one very smart person trying to do everything. We structure them with roles and handoffs and verification, because otherwise we don't get consistent progress regardless of individual talent.

Now, I've said before that I think teams of one have a special role in the world of AI, because teams of one are really teams of more than one. If you are a team of one and you can manage this kind of multi-agent system, you can be a team of a hundred and you're just you. But I want you to keep in mind that meaningful progress in human terms has almost always involved a team. Even the moments we celebrate in science — like Einstein's breakthrough — were both a function of individual genius and also a function of the scientific community around that individual. As Isaac Newton said, we stand on the shoulders of giants.

What This Means for Work

So let's say you believe me. Let's say you're like, okay, Nate, I get it. We are looking at a world where things are smoother than they appear. I haven't been paying attention to harnesses. I promise to. I haven't been paying attention to tooling. What can I look for at work that matters here?

I want to suggest that you need to start thinking about two tiers of domain verifiability.

The first is the simplest — the one that's easiest to get a hold of, the one we tend to think of with AI these days. It's machine-checkable. The code compiles or it doesn't. The tests pass or fail.

The second tier is expert-checkable with clear criteria: mathematical proofs, engineering designs, legal briefs. I think there are actually many categories of work inside the knowledge-based economy that run like this, because in almost every field, experts will look at something and say this is correct or incorrect and come to near consensus. I would say that's true even for things we would not traditionally consider a tier-two problem.

Let me give you an example. If you were constructing a product strategy for a particular company and you brought that product strategy to three or four different product leaders each with 15 or 20 years of experience, I am willing to bet you a lunch that their assessment of that product strategy will be remarkably consistent. There is a set of patterns that they have internalized that they are able to apply in that particular situation. In other words, that kind of work — which we have traditionally called soft work, very hard to verify — is more verifiable than we think. And if it's more verifiable than we think, we really should be thinking about a lot more of our work as something AI can access and be helpful with.

The Sniff-Check Skill

So what does this mean for work? The Anthropic 2026 Agentic Coding Trends report describes engineers delegating tasks where they can easily sniff-check for correctness. I love that, because I think it's not just engineers. The story of Cursor — the story of this moment where Cursor's coding-specific harness was able to generalize — is suggesting to me that we are on the cusp of being able to assign a whole lot of work inside the organization to agentic workflows, as long as we can easily sniff-check for correctness.

How much work can we sniff-check for correctness? Every single department has a lot of work in that category. Marketing has work where they can sniff-check a campaign design for correctness. Customer success has work where they can sniff-check an email template schema for correctness.

And it's tempting to look at this and say, well, engineers are delegating all the work, so what will engineers do? And first, we have to be honest: smoothing means delegating the hard work too. It's not just delegating the easy stuff. I see so many organizations in 2024 and 2025 thinking that when they say "we'll just delegate the easy stuff, the simple stuff." The best organizations are delegating the hard stuff, as long as they can actually sniff-check the work.

And so the skill that survives this transition isn't "I can do the work." It's "I can sniff-check. I can tell if the work is correct or not. I can tell if this is what we should be doing." In other words, everything at work is moving to meta-skills. Meta-skills like knowing whether architecture is maintainable, like recognizing a solution that is fragile, like understanding when tests cover all the important cases and you don't need more. These kinds of meta-skills get more valuable as harnesses improve, not less.

So the question outside engineering is pretty simple. What does sniff-checking look like for us — for financial modelers, for legal researchers, for clinical trial designers, for product managers, for customer success folks? In each case, an evaluation competency sits above execution competency in value and becomes even more valuable as execution gets cheaper with agents. The people who develop the ability to do those sniff checks are the ones who will find themselves really well positioned as harnesses come for their domains. And I'm here to tell you it's coming fast.

The Underlying Structure Is Now Solved

So let me come back to the four companies that independently built the same structure. Anthropic, Google, Cursor, OpenAI — they all built some version of: decompose a problem, parallelize the problem for agents, verify that problem, and then keep iterating until you get it done.

What that implies is that there is an underlying structure for how we solve problems at work that is now solved. It is solved by agents. It is just done. They have figured it out. If we can set up the right agentic harness, we should be able in principle to tackle any problem at work that we can decompose, parallelize, verify, and iterate on. Any problem at all. We are not limited. The surface is smooth.

I want to be very clear at this point. The relevant question for you and for me and for anyone doing knowledge work right now is shifting very quickly from "can AI do a specific task in my job family?" — which I hear all the time — to "can my work be decomposed into verifiable sub-problems?" — which is a mouthful but is much more relevant. And I'm here to tell you that the answer is yes far more often than most of us are comfortable with. And I think we're uncomfortable because it requires us to level up. It requires us to think about how we evaluate. It requires us to do a sniff check.

The organizations that figure out how to make this a productive shift — how to structure teams so that they think about delegation to agents, they know what tasks they can delegate, and they have support for that — they're going to have to work very differently. We're going to have to have much more talent thinking about agent infrastructure and building it. We're going to have to have much more talent and training around how we think about taste, and understanding if something is correct, and decomposing problems. There's a massive migration to do. That's going to take a lot of people.

Did you notice that I said a lot of people? It is not actually easy to install this stuff. It's hard. And I know that the easy thing to do is to say, "Wow, Cursor built a browser — what are we going to do?" The leverage that these agents provide is leverage that we can extend our impact with. It's also a challenge to how we do our daily work. And the latter is the thing I find is most scary for folks — that we cannot continue our current habits. And I cannot promise you that you can. Anywhere in the workplace today, your work is going to have to change.

And the only thing you can control is whether you understand that and get ahead of it by being proactive — start to map out your domain and ask: what can I delegate? How can I be a sniff-checker, a tastemaker, an agent infrastructure builder? How can I shift into a mode where I am bringing the agents into the space on my terms to extend my leverage? Or you're just going to sit there passively and it's going to happen anyway, and that's a much worse place to be.

The Lesson

The lesson I want to call out here is that AI is smooth for work. Capabilities are not jagged at work. That is a misunderstanding. The reason why is really cool. The reason why is that we essentially scaled all of our organizational intelligence — the learning tools for people — into AI over the last two or three years. And it finally hit a tipping point. And now it's basically solved anything at work that we can decompose, that we can parallelize, that we can verify, and that we can iterate. And that turns out to be a ton of work.

That is a huge piece of the story. I think it is much more important than whatever benchmark score drops tomorrow. Think about that. Remember that the world is not as jagged as it seems. And don't worry if you can't solve an Olympiad math problem. That is not what our future depends on.

Polished transcript of AI News & Strategy Daily | Nate B Jones. All views are those of the original speakers. Watch on YouTube ↗

Published by @maverick

More from AI News & Strategy Daily | Nate B Jones

Microsoft CoPilot Decoded: 12 Flavors, 20x ROI Playbook3 Jul 2025

Deep Dive on OpenAI Data Connectors5 Jun 2025

The A-to-Z AI Literacy Guide (2025 Edition)9 Jul 2025

The 6 Proven AI Workflows That Survive Every AI Hype Cycle28 Jul 2025

I Was Wrong About AI Agents — This $200 Browser Actually Works11 Jul 2025

More from @maverick

BITCOIN: GOING LOWER!!! (accumulation zone, Q4 valhalla)5 Jun 2026

BITCOIN: COLLAPSING SO FAST!!!! (buy zone hit)4 Jun 2026

BITCOIN: IT IS REPEATING!!!!! (My strategy 2026)3 Jun 2026

BITCOIN: ANOTHER LEG DOWN STARTING!!! (how I profit from the bear)1 Jun 2026

The Science & Process of Healing from Grief | Huberman Lab Essentials28 May 2026

Summary