Podcast transcripts, polished for reading

Tobi Lütke Made a 20-Year-Old Codebase 53% Faster Overnight. Here's How. | AI News & Strategy Daily | Nate B Jones Transcript

Polished transcript · AI News & Strategy Daily | Nate B Jones · 25 Mar 2026 · 29m · @maverick

Four types of AI agents explained for real-world production use cases

Nate B Jones of AI News & Strategy Daily breaks down the four distinct types of AI agents being used in production environments in 2026.

Summary

Nate B Jones argues that the term "agent" is too broadly applied and that conflating four fundamentally different agent architectures leads to poor implementation decisions. He identifies the four types as: coding harnesses (single or multi-agent systems where a human remains the quality gate), dark factories (fully autonomous pipelines that run from specification to eval with minimal human involvement in the middle), auto research (LLM-driven metric optimization descended from classical machine learning), and orchestration frameworks (multi-role agent pipelines managing specialized handoffs). Jones draws on real examples from Tobi Lütke's optimization of Shopify's Liquid framework, Andrej Karpathy's auto research work on LLM tuning, Cursor's multi-agent browser and compiler projects, and Peter Steinberger's use of multiple Codex agents to build OpenClaw. His central argument is that choosing the wrong agent type for a given problem is one of the most common and costly mistakes practitioners make today.

Key Takeaways

  • "Agent" is not a single thing — Jones argues that all four types share the same underlying structure (LLM + tools + loop), but their configurations, goals, and human involvement levels are so different that treating them as interchangeable leads directly to failed implementations.
  • Coding harnesses are the simplest and most widely used agent type, built around a single developer's judgment as the quality gate. Andrej Karpathy's agents running 16 hours a day and Peter Steinberger's multiple parallel Codex agents building OpenClaw are both examples of this model — the human decomposes the work, the agent executes it.
  • Project-scale coding requires a different architecture — Cursor's approach of using a planner agent directing short-running executor agents (rather than one long-running agent) demonstrates that scaling agentic coding to team-sized projects requires moving from human-as-manager to agent-as-manager, with simplicity being a critical design principle. Cursor tried three levels of management hierarchy and found it didn't work.
  • Dark factories remove humans from the middle of the process entirely, running from specification through to eval without human checkpoints. Jones notes that Amazon learned the hard way that fully trusting AI-generated code in production without senior engineering review creates real incident risk, and recommends retaining human judgment at the eval stage even in otherwise autonomous pipelines.
  • Auto research is metric optimization, not software production — Tobi Lütke's 53% performance improvement to Shopify's Liquid codebase and Karpathy's GPT-2-scale LLM tuning experiments are both examples of using LLMs to hill-climb toward a measurable target. Jones emphasizes that if you don't have a metric, you are not doing auto research.
  • Orchestration is the most complex type to set up and is only worth the investment at sufficient scale. Platforms like LangGraph and Crew AI manage specialized agent roles and handoffs, but the coordination overhead means they only make economic sense for high-volume workflows — tens of thousands of tickets or more, not hundreds.
  • The most common mistake Jones observes is people applying auto research to software-building problems, or trying to use long-running coding harnesses for tasks that are actually orchestration or human-creative problems. Matching agent type to problem shape is the core skill.

  • FULL TRANSCRIPT

    Why "agent" is too vague a term to be useful

    Nate B Jones: We want agents, but we don't know what we really want. When we say agents, it is too simplistic to say agents are just an AI plus tools in a loop. That's true, but we are missing the point. We are missing the fact that sophisticated agents diverge into at least four different types. Most of us don't understand what those types are, and we confuse them. So this video is about laying out how agents are really working in production use cases across these four subtypes, explaining why they're different, and then getting into how you use them and how you pick a given agent for a given use case.

    We're not going to be talking about individual models. If you think I'm going to talk about Claude or ChatGPT, that's not what this video is about. It's actually about the layer above that. You can plug any LLM model into an agentic system and get results — maybe not the results you want, but you can get results. The point is that you need to understand how these agent systems work. Because when we say agent, we really mean an LLM and tools and a loop where the agent comes back and gets feedback. The way we construct that is really, really important. And the details of that construction effectively give us what I'm calling agent species.

    So you're like, "What are species, Nate?" We've got coding harnesses. These are often starting out for individual contributors as a single LLM agent. It is working with your files and running with tools that you give it to accomplish coding work. When Andrej Karpathy talks about the kinds of agents he works with to do his coding projects, it's often this sort of coding harness idea. When individual developers talk about their work, it's a coding harness idea. There is an extension of this for larger projects that involves multiple agents that we'll also discuss — it's like a separate cousin species.

    Dark factories — that's another species of agent. These are fully autonomous systems. You put the spec in and the software comes out. The trick is you have to be really, really good at all the steps in between. You have to give the agent all of the support and all of the scaffolding and all of the evals or tests at the end to make sure that what comes out is actually effective. The way you develop this often depends on your ability to specify really excellent nonfunctional requirements — which is a fancy way of saying really excellent rules of the road for these agents in ways that are enforceable. We'll get into that.

    Another kind of agent is auto research. These are frameworks that descend from classical machine learning. All you're doing is automating the process of letting an AI agent optimize for something. Maybe it's optimizing for conversion rate on your landing page. Maybe it's tuning a particular coding framework. Whatever it's doing, it has to have a metric to optimize against. That's why it's called auto research. The whole goal is what machine learning scientists call hill climbing — you want to climb the hill and get to a more effective, optimized metric. If you don't have a metric, you're not doing auto research.

    And then we have what we would call orchestration frameworks. This is something we often see in big companies where you have multiple LLMs lined up in a row and you have an orchestration framework over the top that hands work over — writer to editor, drafter to researcher, or researcher to drafter as it were.

    All of these different types of agents — whether they're coding harnesses, orchestrators, dark factories, or auto researchers — have in common that they're using an LLM with tools. You can call all of them agents. That's okay. But if you don't understand why they're different and why that matters, you're going to use the wrong kind for the wrong kind of work and you're going to get into big trouble. I see this happen a lot. I have seen people take what I would describe as a single agent designed to do a single productive task and try to say, "We're going to make a dark factory out of that. We're going to make that into something that is a full multi-agent coding harness for multiple big projects." It's not going to work. That's not how that works. Agents have different needs.

    Why one agent can't do everything

    Now, you might ask: why is this so complicated? Why can't we have one agent to rule them all? The answer is that these are just tools that depend on the context around them to do effective work. If you want to do bigger pieces of work, you have to get more interesting and sophisticated in the way you put these agents together. Notice I didn't say more complicated. The art of building good agents is often the art of finding different simple configurations that enable the agent to do the particular work you have in front of you.

    When we talk about orchestration, you might envision a super complicated framework. It doesn't actually have to be complicated. The key to orchestrating is just recognizing you have multiple distinct jobs you need done that aren't well suited to having one long-running dark factory, and you need to find a way to negotiate those handoffs. Whereas with dark factories, you're usually optimizing toward an eval really relentlessly and you want to make sure you construct the pipeline so that the system gets you to that point. You have to look at the goal you're trying to accomplish and then ask yourself: what kind of agent do I need to get that goal done?

    Coding harnesses: the simplest agent type

    I want to get into the details on each of these four species because the more viscerally you understand them, the less you're going to be surprised when we talk about the differences and why they matter. Let's talk about coding harnesses first.

    Coding harnesses are in many ways the simplest kind of agentic harness. They are the kind you get when you pull up a terminal and use Claude Code or Codex. All they are is essentially an agent taking the place of a developer in an engineering process. The agent has many of the tools a developer would have. The agent can write code, call files, put files together, read files, write to files, and use tools like search. When you put all of that into the agent's context so the agent understands what it can do, the agent is able to do effective work.

    There are some slight variances. Codex tends to prefer to put these in a virtual machine, which is more secure — it's not touching your local laptop. And then there's Claude, which tends to like to work on your local laptop. There are pros and cons to each. But the point is that these are very similar overall approaches to the development problem. You should think of them as: you have a human, the human is now doing a managerial function, and the agent is doing the coding. If you do that well, you can give these agents — even if they're single-threaded, just one agent — a fairly long-running task to accomplish, and it will go and work. Andrej Karpathy talks about his agents running 16 hours a day. That's not unusual anymore in 2026. A lot of developers have that experience.

    I start with that because it is in many ways the simplest use case. It's really a single-threaded approach to agents. Think of the agent as a stand-in for the person — an engineer — and you'll get the idea.

    You can of course run multiple agents at once, and some developers do. Peter Steinberger, when he was building OpenClaw, famously described having multiple agents running at a time. In his case it was Codex, and they would get a particular task done — it would take about 20 minutes — and they'd check back in with him. A lot of his day as a manager of agents was essentially managing these agents that were all doing their own single-threaded tasks. So just because I talk about it as a single agent doesn't mean that developers view their work streams that way. Developers may view their work streams as: I'm managing all of these single-threaded agents all day.

    Decomposition as the key to scaling coding harnesses

    If you're wondering what makes this work or doesn't work, I'm going to give you a hint: decomposition. If you can get the work decomposed well, you can give that work to a bunch of single-threaded agents and you're going to get real far. A lot of developers like that. They like the challenge. They like to take a big problem that's kind of gnarly and rip it apart and say, "Okay, this chunk is really well defined — I'm going to give it to this agent. This chunk I'm going to give to this agent," and so on. That's how a lot of work gets done in 2026. You have the developer look at the overall shape of the project, maybe with an LLM as a planner assistant. The developer confirms the breakout the LLM planner agent may propose, and then the developer basically says, "Okay, let's start to break out this work." Then you start to break it out into individual agent tasks.

    When you are doing that already, notice something: you are already past the "spin up an agent in the chat and just talk to the chat to make it happen" stage. You may be working on different versions of your code or different sections of your code at once. You may be using a work tree approach. Fundamentally, this is about task-scale projects.

    Project-scale coding: moving from human-as-manager to agent-as-manager

    What happens when the work gets bigger? That's where we talk about a more complex variant of this coding harness that is really designed for projects. It's important to understand what that looks like because so often when we want to do big work at companies, we tend to think of big work as linearly tied to the number of engineers that can hold bits of the project in their head. But increasingly that's actually incorrect. What you want to do is look first at the agent side of things and basically say the agent has to be able to understand this work, figure out the right path forward, and we have to support the agent in getting that done.

    Cursor has done a lot of work in public — writing it up, helping us understand how to do that well. What you really need is a different way of handling a large set of agents and coordinating their work. Effectively, you're moving from a world where the human is the manager to a world where the agent is the manager. In that scenario — and this is real, Cursor has done this across multiple real projects from browsers to compilers, coding millions of lines of code — what you have is an agent that plays the manager, an agent that acts as the planner for the work, and then a system of sub-agents that come in to grind on particular tasks as ordered by the planner agent.

    So instead of thinking of it as Cursor got some individual agent to code for weeks and that's how you got a browser, that's not how it actually worked. What you actually have is short-running grunt agents — execution coding agents — that were spun up by a planner agent to hit exactly one problem, solve it, and get that particular part of the job done. How that works successfully is by making sure the planner can make notes. The planner agent has to be able to track tasks, keep things in memory, and understand whether a particular piece of work by an executor agent was done well or not.

    Now you might think: wow, this is complicated. Cursor actually tried to make it more complicated. They tried to add three levels of management and it didn't work well. One of the things the Cursor team explicitly noted is that simple scales well with agents. You want to keep your harness — this whole system of making the agent work well — pretty simple so it can scale effectively. I'm describing it as simply as possible precisely for that reason. Because if you don't understand how it works and it's a mystery to you conceptually, you're not really going to understand where to apply it or where to go and dig in more if you think this is right for you.

    The key to understanding the difference between individual coding harnesses like the ones Andrej Karpathy is talking about versus the big long-running ones like the one Cursor is doing: you need to recognize that individual coding harnesses are built for the mind of an individual developer. If you have a team of eight or 16 or 20 developers working on something, you have too much complexity in the room to not have a coding harness like Cursor used. You should be looking at project-level coding architectures rather than individual-level.

    That is one of the biggest unlocks, and it's very counterintuitive. I see a lot of people who tell me, "We've had so much speed-up with AI. We have AI assistants. We have individual engineers working with four or five coding assistants at a time. It's incredible how much we get done." But if I surface this simple idea — that maybe instead of framing everything around the human at the center, we should frame it around how we can make it easy for the agent to do the work since we're asking the agent to do all this work anyway — sometimes people look at me like I'm going crazy. They say, "What? Why would we do that? We see so much speed-up with them as individual assistants. Isn't that great?"

    It is great. That's great progress. But from a project perspective, all you're doing is speeding up the human work, and you still have all of the bottlenecks you had before. Only now it might be more complicated because you have a lot more code review to do than you did before. The humans are much busier because they're trying to figure out how to manage four different things at once, and they used to be individual contributors. So maybe with this much complexity, and the fact that it's really hard to parallelize all of this work across lots of developers in a big project, maybe we should actually try to build something at team scale. That's really how you understand you need to be at a level like Cursor's, where they're architecting larger multi-agent harnesses designed to do big work.

    Dark factories: removing humans from the middle

    This brings me to dark factories, and I fully admit there's some blur in these definitions. There are some architectures for large projects that are effectively dark factories. But if you want to know the difference: when you are doing a dark factory approach, you have almost no human involvement from the point you put a specification in to the point where the system says it has passed an eval and is done.

    The reason you do that is that people have found, as they go farther on this agentic coding journey, that it is often easier and simpler to get the human out of the middle of the process altogether. Once you walk into this process, you want the human to be heavily involved at the top — doing some of the design, making sure this is what the customer wants, making sure the spec is really good, making sure intent is communicated clearly — and you want the human at the end, making sure that what was built actually matters, making sure it passes the evals. But the less the human is involved in the middle, the less strain you have on the humans and on the whole process, because agents tend to push things through so fast that humans have trouble being bottlenecks in the middle.

    Dark factories are designed to get around that. They are designed as entire complete systems that hit eval at the end and iterate back automatically until the software passes the evaluation. That's really the heart of it. You put an evaluation or a test that the software has to pass before it can be launched.

    Now, if you're really bold — and dark factories are often bold plays — people will launch to production from there without having a human look at the code. The companies I look at tend to have an awareness of risk that is calibrated to actual production realities, and most of them are rightly uncomfortable with just trusting the agent and saying, "Yeah, we'll throw it into production. We hope it works well." If you're an enterprise, you're typically having a human look at the code just to make sure there's some accountability there. It's actually something Amazon learned the hard way recently when they called a bunch of their senior engineers and principal engineers into Seattle to talk about recent AI-generated incidents in production caused by junior engineers and what they were going to do about it. It makes sense to have a sophisticated engineering mind looking at the code at the end to make sure you're confident you got it right.

    That being said, you should understand that dark factories are essentially all about pulling the human out of the middle so humans aren't stressed and bottlenecked in the middle of a fast-flowing agentic process. You're just trying to get the evals done and get the software out the door. It's like a dark factory — the famous dark factories in China are the ones where the lights are off. It's literally dark and you're making stuff with automated robots all the way through. That's the vision. That's the metaphor we're using when we talk about agents in the system.

    You can see how that's so different from individuals using agents. If you give your agent a task and go make coffee for 20 minutes, that's not a dark factory. Similarly, if you're investing heavily in a coding harness and you're putting a multi-agent project together and you're checking on it obsessively all the way through and giving it ongoing guidance if it doesn't go right, you're probably closer to a larger project-scale harness, but you have a fair bit of human involvement. I admit it's a little bit of a blurry line. If you get your project harness to the point where it's very stable and you can do large runs of the code and you don't have to look at the code in between until it passes the eval, you are getting very close to a dark factory layout.

    Think of these as steps along the path toward humans being more and more involved at the beginning and end of the software process. If you're an individual, that can look like task-level autonomy for the agent. If you're an organization building project-level agentic engineering, it can look like the human being involved mostly at the beginning and mostly at the end with some guidance in the middle. And then if you're really sophisticated and you feel really good, you can have project-level engineering focused on those evals or tests, where you have human involvement from engineers and product at the top and then human involvement from engineers at the end.

    Auto research: optimizing for a metric, not producing software

    What about auto research? Auto research is kind of a different bug. If you look at those three steps to coding, they're all about producing code and working software. Auto research is not. Auto research is about optimizing for a metric. It's actually a descendant of classical machine learning techniques. In machine learning, when you teach a machine something, all you're doing is trying to get it to be better and better at optimizing for a target. When I was teaching machine learning around how to move titles around at video, we were optimizing for the ability to cut letters out and reliably shift them — that sounds silly, but it was actually necessary to resize title artwork.

    Now, if you were optimizing for auto research in the age of LLMs, you might be optimizing for different metrics. Tobi Lütke optimized his Liquid presentation framework that powers millions of Shopify shops. There's a codebase to optimize against and you're basically optimizing for a better runtime experience — optimizing for the code to run more smoothly in production. That's a metric you're using. Or you could be optimizing for something like how you tune models in production — are you tuning the weights of the models appropriately? That's something we actually got from Andrej Karpathy. He's the one who came out with auto research just a couple of weeks ago, and he used it on his own settings in his quest to auto-optimize his way toward effectively a GPT-2-level scale.

    Now you might think: GPT-2, who cares? It's GPT-5.4 right now. Well, what he's trying to do is demonstrate as an independent thinker that it is possible to auto research your way through an LLM development chain, which is a really important piece of research. You can use that same technology on any metric you want to optimize, as long as you have sufficient data points. I've given you an example from Tobi and Shopify around running code. I've given you an example for the deep LLM science nerds around optimizing your tunings. But if you're not any of those things, you can also use it to optimize conversion rates. Anything you can give it a metric for, in principle, you can auto research against.

    Here's the difference. Yes, this is an agentic process. The LLM is essentially climbing a mountain by relentlessly experimenting. You can think of it as trying to reach the most optimal condition possible. Many experiments will be failures. Some will be successes. Humans will probably need to review the ones that are successes to ensure they're scalable. But this doesn't work in the same way that the software process does. This is not about producing working software. This is about using the power of LLMs to optimize for a particular metric.

    You have to be able to understand: is my problem software-shaped, or is my problem metric-shaped? Those are super different things. If you can't figure out the difference between the two, you need to sit with your problem until you understand that either it's a rate I can optimize in some measure, or it's a piece of software I need to build. Those are usually pretty intuitive. Once I put it that way, people usually say, "Aha, I know what it is. It's one or the other. It's not both."

    Orchestration: the most complex agent type

    Now we come to orchestration. I've saved orchestration for the end because it's probably the most complicated one to set up and manage. That's one of the reasons there are lots of startups in the space — they're basically trying to optimize away that complexity for you. LangGraph is an example of an orchestrator. If you have a bunch of different jobs you want an agent to do — okay, this agent needs to pick up the ticket, this agent needs to go research for the ticket, this agent over here needs to go do something else, and then we have to close the ticket and comment on it along the way — you're basically handing off a bunch of things to agents. That's a customer success example, but you can imagine other kinds. If you're researching and then you're writing, those are two different things. So you're looking at orchestration. Orchestration is just a fancy way of saying handing off from A to B.

    I want to be careful here because if you're listening along and you say, "Hey, the Cursor example felt a lot like this — isn't the planner agent handing off work to the executor?" Yes, that's true. But keep in mind, in Cursor's case this is toward one unified goal. They're trying to build a piece of code and the multi-agent approach is just the most effective way to do that over a long period of time. In orchestration, you're actually giving these agents really specialized roles. If you're the person who got excited about giving agents different roles, you're really excited about orchestration, which is a small subset of what agents can do. You're saying, "I want a really good marketing agent, and then a really good copywriting agent, and then a really good finance agent." That's orchestration, and orchestration takes a lot of work from people. You have to be thoughtful about how you hand off — what do you hand off, what is the context, what are the protocols. You're essentially optimizing all of these individual LLM bits in the chain so that you can effectively manage the handoffs along the way.

    In my experience, when you start to talk about agentic systems, what you're really doing is talking about the bits of work where you can trust an agent to do something a human doesn't have to look at. In the orchestration example, there's actually a lot of joints in the process that a human has to look at. That is one of the things that makes a lot of the orchestration approaches right now feel somewhat heavy. You have to do a lot of human involvement. That doesn't mean they're not valuable — there are some tasks where you do need those specialized roles right now, and so it makes sense to have orchestration platforms like LangGraph for that task. The question is really whether the work you're doing on coordination matches the scale of the problem.

    If you're tackling 10,000 customer success tickets, it clearly is worth it to spend some time to get this right — let alone if it's millions or tens of millions. If you're only going to do this for a thousand tickets or a hundred tickets, it might not be worth it. When people talk about orchestration, I often ask about scale, because is it really worth going after? Are you going to put the work into all of the prompts and all of the context management and just not get the scale back? Or is it worth the value you're putting into it?

    Cheat sheet: which agent type to use

    Let me close by giving you a cheat sheet so you know which of these different kinds of agents to go after.

    If you are optimizing for just what is in front of you, you should be using a coding harness. Your judgment is really the gold standard here. This is what Peter Steinberger did when he used multiple Codex agents to code OpenClaw. His judgment was the gate. That's a coding harness — the classical approach. That's what Andrej does too. In that sense, it's the simplest approach, which is the one we started with in this video.

    At project scale, your judgment can still be the quality gate. It just looks a little bit more like Cursor's approach — having planner agents and executor agents working against an eval, but ultimately a human is still judging.

    If you go even further and your judgment is no longer the key thing to keep in mind because you trust the agents — they've been tuned so well, the evals are so good, you're confident they can hit production, or maybe your standards are lower (sometimes it's both) — then you might be doing dark factory. All you're doing is making sure the intent and the specification is good, making sure the agents pass the test honestly, and then going to production. You're putting a lot of work into monitoring and making sure that the work being done in production is legitimate and the quality is there. That is dark factory work. It's really the story of optimizing not against a task but against specifications. It is possible to hybridize those — you can do mostly dark factory and have a human check the evals at the end. I often recommend that, because you can get a lot of the value out of the middle part that's a dark factory and still get a human judgment at the end in a place that's really important.

    If you're optimizing against a rate or a metric, that's auto research. You're trying to figure out how to automatically use LLMs to run little mini experiments on code, on LLM tunings, or maybe on conversion rates, to figure out how to make that metric better. The sky is the limit. If you have a lot of data and you have a rate of some sort, in theory you can apply auto research. We're just at the beginning of using this. Andrej released the package a couple of weeks ago, but that's the principle and you're going to see a lot more like it. I've already seen forks that make this very generally applicable and let you ask a question in plain English. It's coming.

    Last but not least, if you're optimizing for workflow routing, you're really talking about orchestration — something like Crew AI, something like LangGraph. What you want to do at that point is make sure it is worth it to do all of those handoffs.

    Closing: don't mix up the species

    There you go — that's my safari tour. Those are the four species of different kinds of agents doing real work in the enterprise. Please do not confuse them. I see people proposing using auto research to build software. Don't do that. I see people using long-running coding harnesses and saying, "This is the way I want to build and write a novel." No — that would be an orchestration problem, or really probably a human should do it.

    There are lots and lots of ways to get agents right. But part of the challenge is that we are now sophisticated enough that we have to be really specific about what agents do and do not do well, and how you configure this supposedly simple idea of tools and a loop and an LLM into actual work configurations. That's why I made this video. I want you to walk away and really understand that there are at least four different types of agents in the wild in implementation today. Do not mix them up. Understand what you're building for.


    Polished transcript of AI News & Strategy Daily | Nate B Jones. All views are those of the original speakers. Watch on YouTube ↗
    Published by @maverick
    More from AI News & Strategy Daily | Nate B Jones
    More from @maverick
    Summary