Drago Anguelov of Waymo lectures on autonomous driving perception, simulation, and the long-tail problem at MIT
A guest lecture from Waymo's principal scientist on the core AI challenges of building a fully driverless vehicle system.
Summary
Drago Anguelov, principal scientist at Waymo, delivers a guest lecture for MIT's deep learning for self-driving cars course (6.S094). He outlines Waymo's journey from its founding ten years prior — including the world's first fully autonomous public ride and the launch of a commercial driverless service in Phoenix — to the technical challenges of scaling autonomous driving. The central argument of the talk is that the "long tail" of rare, unusual driving scenarios is the defining challenge of the field, and that machine learning, while essential, must be complemented by expert-designed systems to handle cases where data is scarce. Anguelov presents Waymo's approach across three pillars: a scalable ML factory (data collection, labeling, and model training infrastructure), hybrid perception and planning systems that remain robust when ML confidence is low, and a simulation framework running the equivalent of seven billion miles to test edge cases. He also presents research on end-to-end imitation learning for agent behavior and trajectory optimization agents, arguing that realistic simulated agents are critical to testing the long tail at scale.
Key Takeaways
FULL TRANSCRIPT
Introduction
Lex Fridman: Welcome back to 6.S094, Deep Learning for Self-Driving Cars. Today we have Drago Anguelov, principal scientist at Waymo. Aside from having the coolest name in autonomous driving, Drago has done a lot of excellent work developing and applying machine learning methods to autonomous vehicle perception, and more generally in computer vision and robotics. He's now helping Waymo lead the world in autonomous driving — ten-plus million miles achieved autonomously to date, which is an incredible accomplishment. So it's exciting to have Drago here with us to speak. Please give him a big hand.
Waymo's History and the Long-Tail Challenge
Drago Anguelov: Hi, thanks for having me. I'll tell you a bit about our work, the exciting nature of self-driving, the problem, and our solutions. My talk is called "Taming the Long Tail of Autonomous Driving Challenges."
My background is in perception and robotics. I did my PhD at Stanford with Daphne Koller and worked closely with one of the pioneers in the space, Professor Sebastian Thrun. I spent eight years at Google doing research on perception, also worked on Street View developing deep models for detection and neural net architectures. I was briefly at Zoox, heading the 3D perception team, where we built another perception system for autonomous driving. And I've been leading the research team at Waymo most recently.
I want to tell you a little bit about Waymo. When we started Waymo — actually this month marks its ten-year anniversary — it started with Sebastian Thrun convincing Google leadership to try an exciting new moonshot. The goal they set for themselves was to drive ten different segments that were 100 miles long, and later that year they succeeded and drove an order of magnitude more than anyone had ever driven. In 2015 we brought a car to the road that was built from the ground up as a study in what fully driverless mobility would be like. In 2015 we put this vehicle in Austin and it completed the world's first fully autonomous ride on public roads. The person inside this car is a fan of the project who is blind, so we did not want this to be just a demo — we wanted it to be a fully driverless experience. We worked hard, and in 2017 we launched a fleet of fully self-driving vehicles on the streets of the Phoenix metro area, and we have been doing fully driverless operations ever since.
We continued, and last year we launched our first commercial service in the Phoenix metro area. People can call a Waymo on their phone, it comes to pick them up, and helps them with errands or going to school. We've already been learning a lot from these customers and we're looking to grow and expand the service and bring it to more people.
In the process of running the service, we have driven ten million miles on public roads — as I said, driverlessly at Waymo, and also with human drivers to collect data. We've driven all kinds of scenarios across cities, capturing a diverse set of conditions and situations in which we develop our systems.
I want to tell you about the long tail of events — all the things we need to handle to enable a truly driverless future — and offer some solutions and show you how Waymo has been thinking about these issues.
The Long Tail: Rare Scenarios That Must Be Handled
Drago Anguelov: As we drove ten million miles, we still find new scenarios we have not seen before, and we keep collecting them. When you think about self-driving vehicles, they need to have the following properties. First, a vehicle needs to be capable — it needs to be able to handle the entire task of driving. You cannot handle just a subset and remove the human operator from the vehicle. All of these tasks obviously need to be done well and safely, and that is the requirement for achieving driving at scale.
The question then is: how many capabilities and how many scenarios do you really need to handle well? It turns out the world is quite diverse and complicated, and there are a lot of rare situations, and all of them need to be handled well. This is what we call the long tail — the long tail of situations. It's one type of effort to get self-driving working for the common cases, and then it's another effort to tame the rest, and they really, really matter.
I'll show you some examples. Here we are driving in the street — let's see if you can tell what is unusual in this video. There's a bicyclist, and he is carrying a stop sign. I don't know where he picked it up, but it's certainly not a stop sign we need to stop for, unlike others. You need to understand that.
Here's another scenario — our vehicle stops and a big pile of poles comes our way. You need to potentially understand that and learn to avoid it. Generally, different types of objects can fall on the road; it's not just poles.
Here's another interesting scenario: construction. Someone has changed the lane markings, put a bunch of cones, and this is our vehicle correctly identifying where it's supposed to be driving between all of these cones and successfully executing it. That happens fairly often if you drive a lot.
Another case: we hear a siren. We have the ability to understand sirens from special vehicles. You can see we hear it and stop, while some other drivers are much later than us, braking at the last moment to let the emergency vehicle pass.
Here's another scenario — we stopped at a green light, about to go, and someone goes at high speed running a red light without any remorse. We successfully stop and prevent issues. Sometimes you have the right of way and people don't always abide by the rules, and you don't want to go directly in front of that person even if they're breaking the law.
Hopefully with this I've convinced you that the situations that can occur are diverse and challenging, and there are quite a few of them.
Perception, Prediction, and Planning
Drago Anguelov: I want to take you on a tour of what makes this challenging and then tell you some ways in which we think about it and how we're handling it. To do this, we're going to delve into the main tasks for self-driving, which are perception, prediction, and planning.
Perception is a mapping from sensory inputs and potentially prior knowledge of the environment to a scene representation. That representation can contain objects, semantics, a constructed map, learned object relationships, and so on. The space of things you need to handle in perception is fairly hard — it's a complex mapping. You have sensors: pixels come in, LiDAR points come in, radar scans come in. You have multiple axes of variability in the environment.
There are a lot of objects with different types, appearances, and poses — there's a bunch of people dressed as dinosaurs in one case; people generally are fairly creative in how they dress. Vehicles can also be different types. People come in different poses and we have seen it all. There are different environments in which these objects appear: times of day, seasons, day and night, highway environments, suburban streets, and so on. And then there's another variability axis — different objects can appear in different configurations and have different relationships. Things like occlusion, a guy carrying a big board, reflections, people riding on horses, and so on.
I'm showing you this because I want to show you the space. In most cases you care about most objects in most environments in most reasonable configurations. That's the space you need to map from sensor inputs to a representation that makes sense, and you need to learn this mapping function or represent it somehow.
The next step is prediction. Apart from just understanding what's happening in the world, you need to be able to anticipate and predict what some of the actors in the world are going to do. The actors are mostly people, and people are honestly what makes driving quite challenging. The vehicle needs to be out there and be a full-fledged traffic scene participant. This anticipation of agent behavior sometimes needs to be fairly long-term — from one second to maybe ten seconds or more.
What goes into anticipating the future? You can watch past behavior — someone going this way will probably continue. You can use high-level semantics — I'm in a presentation room sitting at the front giving a talk, I'll probably stay here and continue. And of course there are subtle appearance cues: if a person is watching our vehicle and moving toward us, we can be fairly confident they're paying attention and not going to do anything particularly dangerous. If someone is distracted, or there is a person in a car waving at us, or a vehicle's blinkers are on — these are all signals we need to understand in order to behave well. Last but not least, agents are also affected by the other agents in the environment, so everyone can affect everyone else.
Here's an example: our Waymo vehicle is driving and there are two bicyclists going around a parked car. We correctly anticipate that as they bike, they will go around the car, and we slow down and let them pass. We are reasoning that they will interact with the parked car — that's our most likely prediction for the rear bicyclist. We anticipate that they will do this and we correctly handle it.
Planning is our decision-making machine. It produces vehicle behavior, typically ending in control commands to the vehicle — accelerate, slow down, steer. It needs to generate behavior that is safe (safety comes first), comfortable for the passengers, sends the right signals to other traffic participants, and makes progress to deliver the passengers. You need to trade all of these in a reasonable way, and it can be fairly sophisticated reasoning in complex environments.
Here's a complex scene: a school gathering, with bicyclists trailing us, vehicles very close, a bunch of pedestrians, and we need to make progress. We're driving reasonably well in crowded scenes, and that is part of the prerequisite of bringing this technology to urban environments.
Machine Learning as the Core Tool
Drago Anguelov: How are we going to do it? I'm a machine learning person. I think when you have complicated models and systems, machine learning is a really great tool to model complex actions, complex mapping functions, and features. We're going to learn our system, and we've been doing this. Machine learning is now permeating all parts of the Waymo stack — all of these systems I'm talking about. It helps us perceive the world, it helps us make decisions about what others are going to do, it helps us make our own decisions, and machine learning is a tool to handle the long tail.
I have an allegory about machine learning that I like to think about. There is a classical system and there is a machine learning system. A classical system — and I've been there, I've done early machine learning systems too — is like being an artisan. You're the expert, you have your tools, you need to build this product, you have your craft. You can fairly quickly get something reasonable, but then it's harder to change, harder to evolve. If you learn new things, you need to go back, and maybe the tools don't quite fit. As the product becomes more complicated, it becomes harder and harder to maintain.
Machine learning — modern machine learning — is like a factory. You build the factory, which is the machine learning infrastructure, and then you feed data into this factory and get good models to solve your problems. Infrastructure is at the heart of this new paradigm. Once you build the factory, you can iterate. It's scalable — just keep feeding the right data and the machine keeps giving you good models.
The ML Factory: Infrastructure, Data, and Models
Drago Anguelov: What is the ML factory for self-driving models? Roughly it goes like this: we have a software release, we put it on the vehicle, we drive, we collect data, we store it, and then we select some parts of this data and send it to labelers. The labelers annotate parts of the data that we find interesting — that's the knowledge we want to extract. These are the labels, the annotations, the results we want for our models. Then we train machine learning models on this data. After we have the models, we do testing and validation to confirm they're good to put on our vehicles. Once they're good, we go and collect more data, and the process starts again. You collect more data, you select new data you haven't selected before, you add it to your dataset, you keep training the model, and you iterate. It's a nice scalable setup. Of course this needs to be automated and scalable itself — it's a game of infrastructure.
At Waymo we have the beautiful advantage of being really well set up with regard to machine learning infrastructure. Let me tell you about its ingredients.
Ingredient one: computing and software infrastructure. We're part of Alphabet and Google. We can leverage TensorFlow, the deep learning framework. We have access to experts who know it in depth. We have data centers to run large-scale parallel compute and train models. We have specialized hardware for training models, which makes it cheaper, more affordable, and faster, so you can iterate better.
Ingredient two: high-quality labeled data. We have the scale to collect and store hundreds of thousands and millions of miles. But just collecting and storing a lot of miles is not necessarily the best thing you can do, because there is a decreasing utility to the data — most of the data comes from common scenarios you may already be good at, and that's where the long tail comes in. So it's really important how you select the data.
While you're running a release on the vehicle, you have a bunch of models and a bunch of understanding about the world, and you can annotate the data as you go and use this knowledge to decide what data is interesting, how to store it, and which data you can potentially ignore. Then you need to be very careful how to select data — you want to select data that captures the long-tail cases you may not be doing so well on. For this, we have active learning and data mining pipelines: given exemplars, find the rare examples; look for parts of your system which are uncertain or inconsistent over time; and go label those cases.
Last but not least, we also produce auto-labels. When you collect data, you also see the future for many of the objects — what they did. Because of that, knowing the past and the future, you can annotate your data better, and then go back to your model that does not know the future and try to replicate that.
Ingredient three: high-quality models. We're part of Alphabet, Google, and DeepMind, and generally Alphabet is the leader in AI. When I was at Google, we were very early in the deep learning revolution — around 2013, when a lot of things were not yet understood. Through that we had the opportunity to develop some important neural net architectures. The team I managed invented the Inception architecture, which became popular later. We invented at the time the state-of-the-art fast object detector called SSD. We won ImageNet 2014. Now if you go to conferences, Google and DeepMind are leaders in perception, reinforcement learning, smart agents, semantic segmentation, pose estimation, and object detection. We collaborate with Google and DeepMind on projects improving our models.
AutoML for Neural Architecture Search
Drago Anguelov: I want to tell you about something that captures all of these ideas — infrastructure, data, and models — in one project. This is work we did recently and put online on our blog today: automatic machine learning for tuning and adjusting architectures of neural networks.
There is a team at Google working on AutoML. Neural networks themselves are complex architectures crafted by practitioners — artisans of networks in some way. We have very high latency constraints in our models, some compute constraints, and the networks are specialized. It often takes people months to find the right architecture that is most performant and low-latency. There's a way to offload this work to machines: you can have machines themselves go and find a good network architecture that is both low-latency and high-performance.
As we keep collecting data and finding new cities or new examples, the architectures may change and we want to keep evolving them without too much effort. So we worked with the Google researchers, who had developed a system that searched the space of architectures and found a set of components — a small sub-network called a cell — that you can replicate in the network to build a larger network. They discovered this on a small vision dataset called CIFAR-10, which was very popular in the early days of deep learning and allows you to quickly train models and explore a large search space.
The first thing we did at Waymo was explore several hundred cell combinations to see what performs better on our tasks — one of them being LiDAR segmentation, where you have a map representation and some LiDAR points and you segment them: this point is part of a vehicle, that point is part of vegetation, and so on. We found one of two things: either models with similar quality but much lower latency and less compute, or models of a bit higher quality at the same latency. We essentially found better models than the human engineers did. Similar results were obtained for lane detection as well.
With this transfer learning approach you can also do full architecture search from scratch — there's no reason why what was found on CIFAR-10 is best suited for our more specialized problems. So we went about this more from the ground up. Our networks are trained on quite a lot of data and take quite a while to converge, so we defined a proxy task — a smaller, simplified task that correlates with the larger task — and once we established the proxy task, we executed the search algorithms developed by the Google researchers. We trained up to 10,000 architectures with different topologies and capacities, and once we found the top hundred models, we trained the large networks on those all the way and picked the best ones.
On the left you can see 4,000 different models spanning the scale of latency and quality. In red was the transfer model — after the first round of search we actually did not produce a better model than the transfer, which already leveraged their insight. So we took the learnings and the best models from this search and did a second round, shown in yellow, which allowed us to beat it. Third, we also executed a reinforcement learning algorithm developed by their researchers on 6,000 different architectures, and that one was able to significantly improve on the red dot, which itself significantly improves on the in-house algorithm. That's one example where infrastructure, data, and models combine and shows how you can keep automating the factory.
Robustness When ML Is Uncertain
Drago Anguelov: That is all good, but we keep finding new examples in the world, and for some situations we have fairly few examples. There are cases where the models are uncertain or can potentially make mistakes, and you need to be robust to those. You cannot put the product out and say our network just doesn't handle some case. So we have designed the system to be robust even when ML is not particularly confident.
One part is redundant and complementary sensors. We have a 360-degree field of view on our vehicles in camera, LiDAR, and radar. They are complementary modalities — an object is seen in all of them, and they all have different strengths and different modes of failure. Whenever one of them tends to fail, the others usually work fine, and that helps a lot to make sure we do not miss anything.
We also design our system to be a hybrid system. Some of these mapping problems or problems with neural network models are very complicated — they're high-dimensional, the image has a lot of pixels, LiDAR has a lot of points, the networks can end up pretty big, and it may not be easy to train with very few examples given the current state of the art. The state of the art keeps improving — there is zero-shot and one-shot learning — but we can also leverage expert domain knowledge. Humans can help develop the right input representations, put in expert bias that constrains the representation to fewer parameters that already describe the task, and with that bias it is easier to learn models with fewer examples. Experts can also put their knowledge into the design of the algorithm itself.
Our system is this hybrid. An example of what that looks like for perception: even if the machine learning system is not confident, we still have tracks and obstacles from LiDAR and radar scans, and we make sure we drive relative to those safely. In prediction and planning, if we're not confident in our predictions, we can drive more conservatively. Over time, as the factory is running and our models become more powerful and we get more data on all the cases, the scope of ML grows, and the set of cases you can handle with it increases.
So there are two ways to attack the tail: you both protect against it and you keep growing ML and making the system more performant.
Large-Scale Testing and Simulation
Drago Anguelov: I'm going to tell you now how we deal with large-scale testing, which is another key problem in the pipeline and in getting vehicles on the road.
How do you normally develop a self-driving algorithm? Ideally you make a change and put it on the vehicle, drive a bunch, and say it looks great. The problem is that some conditions and situations occur very, very rarely, and if you do this you're going to wait a long time. Furthermore, you don't just want to take your code and put it on a vehicle — you need to test it even before that. You don't want untested code on public streets.
You can do structured testing. We have a 90-acre former Air Force base where we can test very important situations and situations that occur rarely. You can select and deliberately stage conditions safely. But you cannot do this for all situations.
So what do you do? A simulator. How much do we need to simulate? We simulate a lot. We simulate the equivalent of 25,000 virtual cars driving ten million miles a day — over seven billion miles simulated. It's a key part of our release process.
Why do you need to simulate this much? The variety of cases to worry about is enormous, and changes can propagate through the system in unexpected ways. If you change perception slightly — different segmentation or detection — the changes can go through the system and the results can change significantly. You need to test all the way through.
What to simulate? One thing you can do is create scenarios from scratch, working with safety experts and analyzing the conditions that typically lead to accidents. You can also leverage your driving data — you have all your logs with a bunch of situations already in them. You can pick interesting situations from your logs, and furthermore you can take all these situations and create variations to get even more scenarios.
Here's an example of log simulation. In the real world, we mostly stayed in the middle lane and stopped. In simulation, our algorithm decided this time to merge to the left lane and stopped, and everything was fine.
What can go wrong in simulation from logs? In another scenario, our vehicle in the real world was where the green vehicle is. In simulation, we drove differently and have the blue vehicle. We're driving, and then — what happened? There is a purple agent who in the real world saw that we passed them safely, so it was safe for them to go. But it's no longer safe because we changed what we did. The insight is: in simulation, our actions affect the environment and that needs to be accounted for.
So if you want to have effective simulations on a large scale, you need to simulate realistic driver and pedestrian behavior. You could think of a simple model — a brake-and-swerve model, where there's some normal way reactions happen, a reaction time and braking profile. If an agent sees someone in front of them, they just apply this algorithm. Hopefully I've convinced you that behavior can be fairly complicated and this will not always produce a believable reaction, especially in complex interactive cases such as merges, lane changes, and intersections.
What could you do? You could learn an agent from real demonstrations. You went and collected all this data in the world — you have a bunch of information about how vehicles and pedestrians behave. You can learn the model and use that.
Learning Agent Behavior for Simulation
Drago Anguelov: What is an agent? An agent receives information — maybe context about the environment — and develops a policy, a reaction. That's the driver agent. It applies acceleration and steering, then gets new sensor information and map information, and continues. If it's our own vehicle, you also have a router — an explicit intent generator — which says the passenger wants to go over there, so try to make a right turn now. This is an agent, whether in simulation or in the real world.
This is an end-to-end agent. End-to-end learning is popular, and to its best approximation, if you learn a good policy this way you can apply it and have very believable agent reactions.
I'll tell you about work we did in this direction. We put a paper on arXiv about a month ago. We took 60 hours of footage of driving and tried to see how well we can imitate it using a deep neural network.
One option is to do exactly the same end-to-end policy, but we wanted to make the task easier. We have a good perception system at Waymo, so why not use its products for the agent? That simplifies the input representation. Controllers are well understood — we can use an existing controller, so no need to worry about acceleration and arcs. We can generate trajectories.
To understand the representation in a little more detail: we have our agent vehicle at the center and render an image with it there. We can augment it with some rotation to avoid over-biasing the orientation. It's an 80-by-80 box, so we see roughly 60 meters in front and 40 meters to the side. We render a road map in this box — which lanes you're allowed to drive on, traffic lights, and at intersections, which lanes are permitted or not permitted by the traffic lights. We render speed limits, the objects from the perception system, the current vehicle position, the pose history for the last few steps, and the intent — where you want to go. Conditioned on this intent and this input, you want to predict the future waypoints for this vehicle. That's the task, and you can frame it as a supervised learning problem — learn a policy with this network that approximates what you've seen in the world with 60 hours of data.
Of course, learning agents has a well-known problem identified in the DAgger paper by Stéphane Ross — who is actually at Waymo now — and Andrew Bagnell. It's easy to make small errors over time. Even if you do a relatively good estimate at each step, if you string ten steps together you can end up very different from where agents have been before. There are techniques to handle this. One thing we did is synthesize perturbations — you have a trajectory and you synthesize a deviation from the trajectory and force the vehicle to learn to come back to the middle of the lane.
But if you just have direct imitation-based supervision, we were trying to pass a vehicle in the street and it was stopping and never continuing. We did perturbations and it kind of ran through the vehicle. So that's not enough.
In addition to having this agent RNN — which takes the past, creates memory of its past decisions, and keeps iterating predicting multiple points in the future — we also learn about collisions and staying on the road. We augmented the network to also predict a mask for the road. Now we have a road mask loss: if you generate motions that take you outside the road, that's probably not good. And a collision loss: we take the other objects and predict their motions, predict our own motion, and try to make sure there are no collisions and that we stay on the road. You add this structural loss that adds a lot more constraints to the system as it trains, so it's not just limited to what it's explicitly seen — it allows it to reason about things it has not explicitly seen.
Here's an example of us driving with this network. You can see we're predicting the future with the yellow boxes and driving safely through intersections and complex scenarios. It handles a lot of scenarios very well. If you're interested, I welcome you to go read the paper. It handles most simple situations fine. The passing-a-parked-car scenario — one of the earlier approaches stops every time, another hits the car — this one actually handles it fine. And beyond that, it can stop at a stop sign, shown by the red line, and does all of these operations.
We took the system trained on imitation data and actually drove our real Waymo car with it. We took it to the Castle Air Force Base staging grounds, and here it is driving a road it's never seen before and stopping at stop signs.
But it has some issues. Here it is driving, and then it was driving too fast — because our range is limited, it didn't know it had to make a turn and it drove off the road. Another time, yellow is what we did in the real world and green is what we do in simulation — we're trying to execute a complex maneuver, a U-turn, and we're sitting there and we don't quite do it, though at least we end up in the driveway. And in really complex interactive situations, this network also does not do too well.
What does that tell us? The long tail came again in testing. You can learn the policy for a lot of the common situations, but in testing, some of the things you really care about are the long tail — the corner cases, the scenarios where someone is adversarial, where something unusual is happening.
The Long Tail in Agent Modeling
Drago Anguelov: One way to think of it: there is a distribution of human behavior across multiple axes — aggressive versus conservative, expert versus inexperienced, and so on. Our end-to-end model is a fairly general approximation — in theory it can learn any policy if it sees everything it needs to know about the environment. But it's complex. The input is images that are 80-by-80 with multiple channels — a large input space. The model can have tens of millions of parameters. If you have a case where you have two or three examples in your whole 60 hours of driving, there's no guarantee that your ten-million-parameter model will learn it well. It's really good when you have a lot of examples, but then you have the long tail.
What do you do? You can improve the representation, improve the model — there is a lot of room to keep evolving this, and that direction will keep expanding. There's a lot of interesting questions about how to do that and we're working on many of them.
Something else you can do — similar to what I said about the hybrid system — is use a simpler, biased, expert-designed input distribution that is much easier to learn with few examples. You can also use expert-designed models. In this case you still produce something reasonable by inputting human knowledge, and you could have many models — not just one — tuned to various aspects of this distribution. You can have little models for all the aspects you care about and mix and match them.
Trajectory Optimization Agents
Drago Anguelov: Let me tell you about one such model: the trajectory optimization agent. We take inspiration from motion control theory and want to plan a good trajectory for the agent vehicle that satisfies a bunch of constraints and preferences.
One insight is that we already know what the agent did in the environment last time, so you have a fairly strong idea about the intent, and that helps when you specify the preferences. You can say: give me a trajectory that minimizes some set of costs, which are preferences on the trajectory — typically called potentials. At different parts of the trajectory you can add an attractor potential: try to go where you used to be before. And you can have repeller potentials: don't hit things, don't run into other vehicles. To first approximation, that's roughly what it looks like.
Where is the learning? There is still a machine learning model. These potentials have parameters — the steepness of the curve, sometimes multi-dimensional — typically a few dozen parameters or less. You can learn them using a technique called inverse reinforcement learning: learn these parameters that produce trajectories that come close to the trajectories you've observed in the real world. If you pick a bunch of trajectories that represent a certain type of behavior you want to model, you tune the parameters to behave like it, and then generate reasonable, continuous, feasible trajectories that satisfy this. You can solve this optimization and tune these agents.
Here's a complex interactive scenario. On the left is the conservative driver, on the right is the aggressive driver, and blue is the agent, red is our vehicle being tested in simulation. The aggressive guy went in and passed us, pushing us further into that lane. In the other case, with the conservative driver, we are in front of them, they're not pressuring us, and we execute a much cleaner switch into the right lane where we want to go.
Now you have different scenarios depending on what agent you put in. I'll show you a little more. We can do things like merging from one side of the highway to the next. This type of agent can generate fairly reasonable behaviors — it slows down for a slow vehicle in front, lets vehicles on the side pass, and still completes the mission. You can generate multiple futures with this agent: the aggressive guy finds a gap between two vehicles and just goes for it, while the conservative person waits. You can test your stack this way.
One more: an aggressive motorcycle driving, weaving in the lane. You can have an agent that tests your reaction to that.
Takeaways on Testing and Agent Modeling
Drago Anguelov: What's my takeaway from this story about testing in the long tail? You need a ministry of agents. Learning from demonstration is key — you can encode some simple models by hand, but ultimately the task of modeling agent behavior is complex and it's much better learned.
Here's the space of models: you can have no learning and just replay the log. You can have designed trajectories for agents — for this reaction, do this; for that reaction, do that. Then you can have the brake-and-swerve model. Then trajectory optimization, which I just showed. Then our mid-level model. And potentially an end-to-end top-down model — top-down meaning you have a top view of the environment. There are many other representations possible. This is a very interesting space. These models have different utility and require different numbers of examples to train.
Smart agents are critical for testing at scale. This is something I truly believe working in this space, and this line of direction is exciting. There is still a lot of interesting progress to be made. Accurate models of human behavior of drivers and pedestrians help achieve several things: first, you will make better decisions when you drive yourself, because you'll be able to anticipate what others will do better. Second, you can develop a robust simulation environment with those insights. Third, our vehicle is also one more agent in the environment — it's an agent we have more control over than the others, but a lot of these insights apply to it as well.
Scaling to New Cities
Drago Anguelov: I wanted to finish the talk with a mental exercise. When you think of a system tackling a complex AI challenge like self-driving, what are the good properties of the system to have, and how do you think about a scalable system?
We want to grow and handle more environments, more cities. How do you scale to dozens or hundreds of cities? As we talked about, the long tail means each new environment can bring new challenges — complex intersections, cities like Paris, Lombard Street in San Francisco, narrow streets in European towns. In Pittsburgh, people drive the famous Pittsburgh left — they take different precedence than usual. The local customs of driving and behaving all need to be accounted for as you expand. This makes the system potentially harder to tune to all environments, but it's important because ultimately that's the only way you can scale.
What should the scalable process do? Let's say you have a very good self-driving system. This very much parallels the factory analogy. You take your vehicles, put a bunch of Waymo cars in a new environment, and drive a long time in that environment with drivers — maybe 30 days, maybe more. You collect all the data. Then your system should be able to improve a lot on the data you've collected. You want to train it actively on the data you've collected in that environment.
It's very important for a system to be able to quantify or elicit from itself whether it is incorrect or not confident, because then you can take action. This is an important property that I think people should think about when they design systems. You can ask questions to raters — that's fairly typical active learning, usually based on some amount of low confidence or surprise. Those are the examples you want to send.
Even better, the system could potentially directly update itself. This is an interesting question: how do systems update themselves in light of new knowledge? One way is to check and enforce consistency of beliefs and look for explanations of the world that are consistent. If you have a mechanism in the system that can do this, it allows the system to improve itself without necessarily being fed purely labeled data — it can improve from just collected data. I think it's interesting to think about systems where you can do reasoning and the representations that these models need to have.
Last but not least, we need scalable training and testing infrastructure. This is part of the factory I was talking about. I'm very lucky at Waymo to have wonderful infrastructure, and it allows this virtuous cycle to happen. Thank you.
Q&A
Lex Fridman: Thank you so much for the talk, really appreciate it. So if you were to train off of synthetic image and LiDAR data, would you weight the synthetic data differently than real-world data when training your models?
Drago Anguelov: There's actually a lot of interesting research in the field. There are people who train on simulator data, but also train adaptation models that make simulator data look like real data. You're essentially trying to build consistency — you learn a mapping from simulator scenes to real scenes, and you could potentially train on the transformed simulator data. That's transforming with other models. There are many ways to do this. Ultimately, achieving realism in simulator is an open research problem.
Lex Fridman: I assume there are a lot of rules you have to put into a system to be able to trust it. How do you find the balance between automatic models — where you're not quite sure what they'll do — and rules, where you know what it does but it's not scalable?
Drago Anguelov: Through lots and lots of testing and analysis. You keep tracking the performance of your models and see where they come short. Those are the areas you most need expert computing to complement. But the balance can change over time — it's a natural process of evolution. Generally, as ML grows and the capabilities and datasets grow, the balance shifts.
Audience member: You stressed at the end of both halves of your talk the importance of quantifying uncertainty in the predictions your models are making. Have you developed techniques for doing that with neural nets, or are you using probabilistic graphical models?
Drago Anguelov: A lot of the models and neural nets — there are many ways to capture this. I'll give a general answer without commenting specifically on what Waymo is doing. There are techniques in neural nets that can predict their own uncertainty fairly well — either directly regressing uncertainty for certain outputs, or using samples of networks, or dropout, or techniques like this that also provide a measure of uncertainty. Another way of doing uncertainty is to leverage constraints in the environment. If you have temporal sequences, you don't want objects to appear or disappear, or generally unreasonable changes in the environment. Inconsistent predictions in your models are good areas to look.
Audience member: Do you train and deploy different models depending on where the car is driving — what city — or do you train and deploy a single model that adapts to most scenarios?
Drago Anguelov: Ideally you would have a model that adapts to most scenarios, and then a complement is needed where it's not.
Audience member: First off, thanks for your talk. I find the simulator work really exciting. I was wondering if you could talk more about simulating pedestrians, because as a pedestrian myself I feel like my behavior is a lot less constrained than a vehicle's. And I imagine you're sensing from a vehicle, so you know your sensors are from a first-person vehicle perspective, but not from a pedestrian's perspective.
Drago Anguelov: That's correct. If you want to simulate pedestrians far away in an environment at very high resolution, and you've collected log data, you may not have detailed data on that pedestrian. At the same time, the subtle cues for that pedestrian matter less at that distance as well, because it's not like you observed them or reacted to them in the first place. There is an interesting question about at what fidelity you need to simulate things. There are levels of realism in simulation that at some level need to parallel what your models are paying attention to.
Audience member: Thank you for the talk, it was very interesting. Since you titled your talk around the long tail, it makes me wonder: is the bulk of the problem solved? Do you think we'll have this figured out within the next couple of years and there can be self-driving cars everywhere, or do you think it could be decades before we've really worked out everything necessary?
Drago Anguelov: That's a good question. It's a bit hard to give a prognosis. One thing I would say is it will take a while for self-driving cars to roll out at scale. This is not a technology that you just crank and it appears everywhere. There's logistics and algorithms and all this tuning and testing needed to make sure it's really safe in the various environments. So it will take some time.
Audience member: When you were talking about prediction, you mentioned looking at context and saying that if a person is looking at us, we can assume they will behave differently than if they're not paying attention. Is that something you're actively doing — taking into consideration whether pedestrians or other traffic participants are paying attention to your vehicle?
Drago Anguelov: I can't comment on our model designs too much, but I think these are generally cues one needs to pay attention to. They're very significant. Even when people drive, there's someone sitting in the vehicle next to you waving "keep going" — these natural interactions in the environment are something you need to think about.
Audience member: Thank you, it's a really cool talk. In one of your last slides you talked about resolving certain uncertainties by establishing a set of beliefs and checking to see if they were consistent. I feel that the concept of reasoning is underexplored in deep learning. If you read Kahneman — Type 1, Type 2 reasoning — we're really good at the instinctive mapping type of tasks, so likely some of the lower to maybe higher-level perception. But the reasoning part with neural networks is a bit less explored. I think long-term it's fruitful. Could you elaborate on that concept in connection with the models you're working with?
Drago Anguelov: To give an example from current work: there's a lot of work on weakly supervised learning, which has been a big topic in 2018 with a lot of really strong papers including from Google Brain and other teams. If you used to read the books about 3D reconstruction in geometry, there are a bunch of rules you can encode as geometric expectations about the world. When you have video and 3D outputs in your models, there is a certain amount of consistency. One example is ego motion versus depth estimation — there is a very strong constraint that if you predict the depth and predict the ego motion correctly, then you can reproject certain things and they will look good. That's a very strong consistency constraint about the expected environment, and this can help train your model. More of this type of reasoning may be interesting.
Audience member: You mentioned expert-designed algorithms. From your perspective, how important are non-machine-learning approaches to tackling the challenges of autonomous driving? How important is expert design outside of the field of machine learning?
Drago Anguelov: Generally, you want to be safe in the environment, and that means you don't want to make errors in perception, prediction, and planning. The state of machine learning is not at the point where it never makes errors, given the scope we're currently addressing. So throughout, starting with the current state of machine learning, it needs to be complemented. We've carefully done that. As machine learning improves, there'll be less and less need to do it. It's somewhat effort-intensive to have a hybrid system, especially in an evolving system. But right now, I think this is the main thing that keeps you able to do complex behaviors in cases where it's very hard to collect data and you still need to handle them. The way I view it as a machine learning person: I like doing better and better. We're not religious about it — it should not be about ideology. We just need to solve the problem, and right now the right mix is a hybrid system. That's my belief.