Podcast transcripts, polished for reading

Sacha Arnoud, Director of Engineering, Waymo - MIT Self-Driving Cars | Lex Fridman Transcript

Polished transcript · Lex Fridman · 16 Feb 2018 · 1h 13m · @martymcfly

Waymo's Director of Engineering explains how self-driving cars work at MIT

Sacha Arnoud, Director of Engineering and Head of Perception at Waymo, delivers a lecture at MIT on the technology, history, and engineering challenges behind autonomous vehicles.

Summary

Sacha Arnoud presents to an MIT audience on Waymo's journey from a 2009 Google project to a fully driverless commercial operation, covering the company's milestone of removing safety drivers from vehicles in Phoenix, Arizona. He traces the parallel history of deep learning at Google — including early breakthroughs in street number detection from Street View imagery — and shows how those techniques directly informed Waymo's onboard perception systems. The lecture's central argument is that getting a self-driving system to 90% capability takes only 10% of the effort, and the remaining 90% of the work involves scaling sensors, teams, labeling operations, simulation infrastructure, and testing programs to an industrial level. Arnoud details specific deep learning techniques used in the car — including convolutional networks, embeddings, and recurrent neural networks — and explains how Waymo runs the equivalent of 2.5 billion simulated miles per year to validate its software.

Key Takeaways

  • Removing the safety driver was a landmark milestone. In November 2017, Waymo became confident enough in its system to operate fully driverless vehicles in Chandler, Arizona — a threshold that requires an exceptionally high bar of validated safety, since there is no human fallback.
  • Deep learning at Google traces back to Street View. The first production deep learning system across Alphabet was a street number detector built from Street View imagery, deployed as early as 2012. This work directly seeded the perception techniques now used in Waymo's cars, illustrating how consumer mapping products and autonomous driving share a common research lineage.
  • The "90/10 rule" defines the real challenge of autonomous driving. Arnoud argues that when you feel 90% done, you still have 90% of the work ahead — requiring a 10x improvement in sensors, team size, data quality, and testing infrastructure. This framing reframes self-driving as primarily an industrial engineering problem, not just an algorithmic one.
  • Sensors are deliberately complementary, not redundant. Cameras provide dense semantic information but struggle with depth; lidar provides excellent depth but lacks semantic richness; radar adds further coverage. Waymo designs its sensor suite so that different sensors make different mistakes, providing systemic robustness against any single failure mode — including adversarial or reflective interference.
  • Perception must go far beyond object detection. Safely navigating a scene with a pulled-over police car, a cyclist, traffic cones, and an open car door requires understanding flashing lights, parked versus moving vehicles, implied behavior predictions, and the likely trajectories of all agents — a depth of semantic understanding that goes well beyond collision avoidance.
  • Labeling at scale is as important as the algorithms themselves. Many of Waymo's core deep learning models are supervised, requiring millions to billions of labeled examples. Waymo uses active learning and human-in-the-loop correction pipelines to make labeling tractable, but the scale of the operation — comparable to ImageNet and beyond — is itself a major engineering challenge.
  • Simulation multiplies real-world miles by three orders of magnitude. Waymo has driven four million real miles total, but runs 2.5 billion simulated miles per year using Google's data center infrastructure, including a tool called CarCraft that lets engineers modify the parameters of real recorded drives to generate new test scenarios.
  • A physical test facility handles rare long-tail scenarios. On a 90-acre former Air Force Base in California, Waymo recreates city environments with traffic lights and railroad crossings to test specific edge cases — such as a car reversing unexpectedly or objects being dropped in the road — that may not appear frequently enough in real-world driving to train against naturally.

  • FULL TRANSCRIPT

    Introduction and Waymo's Mission

    Sacha Arnoud: Thanks a lot, Lex, for the introduction. It's a pretty packed house — thanks a lot. I'm really excited, and thanks for giving me the opportunity to come and share my passion for self-driving cars and share with you all the great work we've been doing at Waymo over the last ten years, and give you more details on the recent milestones we've reached.

    As you'll see, we'll cover a lot of different topics — some more technical, some more about context. But whatever the content, I have three main objectives I'd like to convey today, so keep that in mind as we go through the presentation.

    My first objective is to give you some background around the self-driving space — what's happening there, what it takes to build self-driving cars — but also to give you some behind-the-scenes views and tidbits on the history of machine learning and deep learning, and how it all came together within the broader Alphabet family, from Google to Waymo. Another objective is to give you some technical substance around the techniques that are working today on our self-driving cars. During this class you've heard a lot about different deep learning techniques, models, architectures, and algorithms, and I want to put those in a coherent whole so you can see how those pieces fit together to build the system we have today. And as Lex mentioned, it takes a lot more than algorithms to build a sophisticated system like our self-driving cars. It fundamentally takes a full industrial project to make that happen, and I'll try to give you some color — hopefully different from what you've heard during the week — on what it takes to actually execute such an industrial project in real life and productionize machine learning.

    The Self-Driving Opportunity

    We hear a lot about self-driving cars. It's a very hot topic, and for very good reasons. I can tell you that 2017 has been a great year for Waymo. Actually, only a year ago in January 2017, Waymo became its own company — that was a major milestone and a testimony to the robustness of the technology, showing that we could move to a productization phase. What you see in the picture here is our latest generation self-driving vehicle, based on the Chrysler Pacifica. You can already see a bunch of sensors — I'll come back to those and give you more insights on what they do and how they operate.

    Self-driving draws a lot of attention, and for very good reason. I personally believe — and I think you'll agree — that self-driving really has the potential to deeply change the way we think about mobility and the way we move people and things around. To cover just a few aspects: safety is one of the main motivations. 94% of US crashes today involve human error, and a lot of those errors are around distraction and things that could be avoided. Disability and access to mobility is also a big motivation. The self-driving technology has the potential to make mobility much more available and cheaper for more people. And last but not least is collective efficiency. Not only do we spend a lot of time in our cars during long commutes — I personally spend a lot of time in commute hours — but that time we spend in traffic could probably be better spent doing something else. Beyond traffic, the self-driving technology has the potential to deeply change the way we think about parking, urban environments, and city design. That's why it's such an exciting topic, and that's why our mission at Waymo is fundamentally to make it safe and easy to move people and things around.

    The Origins: Google's Chauffeur Project

    We've been on that mission for a very long time. The whole adventure started close to ten years ago in 2009, under the umbrella of a Google project you may have heard of called Chauffeur. Back in those days — remember, we were before the deep learning days, at least in the industry — the first objective of the project was to try to assemble a vehicle, take off-the-shelf sensors, put them together, and try to decide if self-driving was even a possibility. It's one thing to have a prototype somewhere, but is it worth pursuing? That's a very common way for Google to tackle problems.

    The genesis of that work was to come up with a pretty aggressive objective. The first milestone for the team was to assemble ten 100-mile loops in Northern California around Mountain View — a total of 1,000 miles — and see if they could build a first system that would be able to drive those loops autonomously. The team was not afraid. Those loops went through some very aggressive patterns. Some went through the Santa Cruz Mountains, which has very small roads, two-way traffic, cliffs, and negative obstacles. Some went on highways — some of the busiest highways in the area. Some went around Lake Tahoe in the Sierras, where you can encounter different kinds of weather and road conditions. Some routes went over bridges — the Bay Area has quite a few. Some even went through dense urban areas, including San Francisco and Monterey, which bring truly dense urban challenges.

    So here you can see some pictures of the driving — and it's kind of working. You can see the roads in the Santa Cruz Mountains, driving at night with animals crossing, freeway driving, going through Palo Alto, the famous Lombard Street in San Francisco with its fog, slopes, and sharp turns. That was all the way back in 2010. Those ten loops were successfully completed 100% autonomously back in 2010 — more than eight years ago.

    On the heels of that success, the team and Google decided that self-driving was worth pursuing and moved forward with the development of the technology and testing. We've been at it for all those years, working very hard. Historically, Waymo and other companies have been relying on what we call safety drivers — someone who still sits behind the wheel even if the car is driving autonomously, able to take over at any time. We've been committed to developing the system through many iterations across all those years.

    Removing the Safety Driver

    We reached a major milestone, as Lex mentioned, back in November, where for the first time we reached a level of confidence and maturity in the system that we felt safe to remove the safety driver. As you can imagine, that's a major milestone because it takes a very high level of confidence to not have that backup solution of a safety driver who can take over if something arises.

    Here I'm going to show you a quick capture of that event. The video is from one of the first times we did that. Since then, we've been continuously operating driverless cars in the Phoenix area in Arizona to expand our testing. You can see our Chrysler Pacifica — members of the team are acting as passengers, getting in the back seat. There is no driver on the driver's seat. The passenger simply presses a button, the application knows where they want to go, and the car goes — no one on the driver's seat.

    We started with a fairly constrained geographical area in Chandler, close to Phoenix, Arizona, and we've been working hard to expand testing and the scope of our operating area since then. That goes well beyond a single car on a single day — we have a growing fleet of self-driving cars that we are deploying there, looking toward a product launch pretty soon.

    The 90/10 Rule: What It Really Takes

    I've talked about 2010, and we are in 2018, and we're getting there — but it took quite a bit of time. One of the key ideas I'd like to convey today, and that I'll come back to during this presentation, is how much work it takes to truly take a demo or something working in a lab into something you feel safe to put on the roads, and get all the way to that depth of understanding and depth of perfection in your technology that allows you to operate safely.

    One way to say that is: when you are 90% done, you still have 90% to go. The first 90% of the technology takes only 10% of the time. In other words, you need to 10x — you need to 10x the capabilities of your technology, 10x your team size and find ways for more engineers and researchers to collaborate together, 10x the capabilities of your sensors, and fundamentally 10x the overall quality of the system, your testing practices, and many other aspects of the program. That's what we've been working on.

    Deep Learning at Google: A Behind-the-Scenes View

    Beyond the context of self-driving cars, I want to spend a little time giving you an inside view of the rise of deep learning. As I mentioned, back in 2009 and 2010, deep learning was not yet fully available in the industry, and over those years it took a lot of breakthroughs to reach the stage we're at now.

    Google committed itself to machine learning and deep learning very early on. You may have heard of what we call internally the Google Brain team — a team fundamentally hard at work leading the bleeding edge of research, but also leading the development of tools and infrastructure for the whole machine learning ecosystem at Google, enabling many teams to develop machine learning at scale all the way to successful products. They've been pushing deep learning in many directions — from computer vision to speech understanding to NLP — and you can see the impact of deep learning in all those areas in Google products today, whether you're talking about Google Assistant, Google Photos, speech recognition, or even Google Maps.

    Many years ago, I myself was part of the Street View team, leading an internal project we called Street Smart. The goal of Street Smart was to use deep learning and machine learning techniques to analyze Street View imagery — a very large and varied corpus — so that we could extract elements that are core to our mapping strategy and build a better Google Maps.

    For instance, in a panorama from Street View imagery, there are a lot of pieces that, if you could find and properly localize them, would drastically help you build better maps. Street numbers are really useful for mapping addresses. Street names, combined with similar techniques from other views, help you properly draw all the routes and give them names. Those two combined allow you to do very high-quality address lookup, which is a common query on Google Maps. General text — and more specifically text on business facades — allows you to localize business listings to actual physical locations, or even build those local listings directly from scratch. And traffic-oriented patterns — traffic lights, traffic signs — can be used for ETA predictions and navigation.

    One of the hot problems was mapping addresses at scale. You can imagine the breakthrough when we were first able to properly find street numbers out of Street View imagery. Solving that problem actually requires a lot of pieces: not only do you need to find where the street number is on the facade — which is a fairly hard semantic problem, distinguishing a street number from another kind of number or other text — but you also need to read it, because there's no point having pixels if you can't understand the number on the facade, and then you need to properly geo-localize it so you can put it on Google Maps.

    The first deep learning application that succeeded in production — all the way back in 2012 — was really the first breakthrough we had across Alphabet in our ability to properly understand real scene situations. Here I'll show you a video that sums it up. Every one of those segments is a view starting from the car going to the physical number of house numbers we've been able to detect and transcribe. Here in São Paulo, when all that data is put together, it gives you a very consistent view of the addressing scheme. A similar example in Paris, where we have more imagery — more views of those physical numbers that, when you triangulate, allow you to localize them very accurately and have very accurate maps. And the last example is in Cape Town, South Africa, where the impact of that deep learning work has been huge in terms of quality. Many countries today actually have more than 95% of addresses mapped that way.

    You can see a lot of parallelism between that work on Street View imagery and doing the same thing on the car in real time — but doing it on the car is even harder, because you need to do it in real time with low latency, and you also need to do it in an embedded system. The cars have to be entirely autonomous. You cannot rely on a connection to a Google data center — first because you don't have the latency budget to bring data back and forth, but also because you cannot rely on a connection for the safe operation of your system. All the processing has to happen within the car.

    There's a paper from 2014 where, for the first time, by using slightly different techniques, we were able to put deep learning to work inside that constrained real-time environment and start to have impact — in that case around pedestrian detection. As I said, there are a lot of analogies: to properly drive a scene, you need to see the traffic light, understand if it's red or green, detect cyclists moving through the scene, and handle night driving — unlike Street View, where you can choose ideal conditions, driving requires you to take conditions as they are. There has been a lot of cross-pollination between the Street View work and the work on the cars, and that collaboration between Google Research and Waymo has always been very strong and continues to enable us to stay on the bleeding edge.

    The Perception System: What the Car Needs to Understand

    Now I want to go into more detail on what's going on in the cars today and how deep learning is actually impacting our current system. During the week you've probably heard about the major pieces you need to master to make a self-driving car: mapping, localization — putting the car within those maps and understanding where you are with good accuracy — perception and scene understanding, which is a higher-level semantic understanding of what's going on in the scene, predicting what the agents around you are going to do so you can do better motion planning, and the whole robotics aspect — at the end of the day, the car in many ways acts like a robot, whether it's around sensor data or the control interfaces to the car. And for anyone who has worked with robotics, you'll agree that it's not a perfect world and you need to deal with errors. Other pieces include simulation and validation of whatever system you put together.

    For the next section I'm going to focus more on the perception piece, which is a core element of what the self-driving car needs to do.

    What is perception? Fundamentally, perception is a system in the car that needs to build an understanding of the world around it, and it does that using two major inputs. The first is prior knowledge about the scene. For instance, it would be a little silly to have to recompute the actual location of the road or the actual connectivity of every intersection each time you arrive at a scene, because those things you can pre-compute in advance and save your onboard computing for tasks that are more critical. That's often referred to as the mapping exercise, but really it's about reducing the computation you'll have to do on the car once it's driving. The other big input is what the sensors give you once you're on the spot — sensor data is the signal that tells you what is different from what you mapped: is the traffic light red or green, where are the pedestrians, where are the cars, what are they doing?

    Sensors: Complementary by Design

    As we saw in the initial picture, we have quite a set of sensors on our self-driving cars — vision systems, radar, and lidar are the three big families. One point to note is that they are designed to be complementary. They are complementary first in their placement on the car — we don't put them in the same spot because blind spots are a major issue and you want good coverage of the field of view. They are also complementary in their capabilities. Cameras, for instance, are very good at giving you a dense representation — a very dense set of information containing a lot of semantic detail — but they are not really good at giving you depth, or it's much harder and computationally expensive to get depth information out of camera systems. A lidar system, on the other hand, will give you very good depth estimation when it hits objects, but it's going to lack a lot of the semantic information you find in camera systems. All those sensors are designed to be complementary in terms of their capabilities.

    It goes without saying that the better your sensors are, the better your perception system is going to be. That's why at Waymo we took the path of designing our own sensors in-house and enhancing what's available off the shelf, because it's important to go all the way and build a self-driving system we can believe in.

    Deep Semantic Understanding: The Police Car Example

    So what does perception do? It takes those two inputs and builds a representation of the scene. The nature of that work is what really differentiates what you need to do in a safe driving system as opposed to a lower-level driving assistance system. In many cases, for speed control or lower-level driver assistance, a lot of the strategies can be around not bumping into things — if you see things moving around you, you segment them into blocks of moving things and you don't hit them, and that's good enough in most cases. When you don't have a driver on the driver's seat, the challenge totally changes scale.

    To give you an example: if you're in a lane and you see a bicyclist going slowly on the right side of your lane, and there's a car next to you, you need to understand that there's a chance that car is going to want to avoid the bicyclist and swerve, and you need to anticipate that behavior so you can decide whether to slow down and give space, or speed up and have the car go behind you. Those are the kinds of behaviors that go well beyond not bumping into things and require much deeper understanding of the world around you.

    Let me put it in a picture. Here is a typical scene we encountered: a police car pulled over, probably having pulled someone over; a cyclist on the road moving forward; and we need to drive through that situation. The first thing you have to do is the basics — out of your sensor data, understand that a set of point clouds and pixels belong to the cyclist, find that you have two cars on the scene, understand the policeman as a pedestrian. But you need to go deeper in your semantics. If you understand that the flashing lights are on, you understand that the police car is an active emergency vehicle performing something on the scene. If you understand that the other car is parked, that's a very important piece of information that tells you whether you can pass it or not. Something you may not have noticed is that there are cones on the scene that would prevent you from taking a certain path. Next level: if you understand that the police car has an open door, you can start to expect behavior where someone is going to get out of that car, and the way someone getting out of that car would impact the trajectory of the cyclist is something you need to understand in order to safely drive. Only when you have that depth of understanding can you start to come up with realistic behavior predictions and trajectory predictions for all those agents, so that you can come up with a proper strategy for planning and control.

    Handling Imperfect Sensor Data

    How is deep learning playing into that whole space? Remember when I said when you're 90% done you still have 90% to go — that applies here too. I also talked about how robotics and having sensors in real life is not a perfect world, and that's a big piece of the puzzle. I wish sensors would give us perfect data all the time, but unfortunately that's not how it works.

    Here for instance you see an example where you have a pickup truck with smoke coming out of the exhaust, and that exhaust is triggering laser points from the lidar — not very relevant for any behavior prediction or driving behavior. Those points are safe to ignore in terms of scene understanding. Filtering the data coming off your sensors is a very important task because it reduces the computation you're going to have to do to operate safely.

    A more subtle but important one is around reflections. We're driving a scene, there's a car here in the camera picture, and the car is reflected in a bus. If you just do naive detection — especially if the bus moves along with you, which is very typical — you can suddenly appear to have two cars on the scene, and if you take that reflected car too seriously all the way to impacting your behavior, you're going to make mistakes. I showed you an example of reflections in the visual range, but obviously that affects all sensors in slightly different ways. You could have the same effect with lidar data — for instance, when you drive on a freeway and you have a road sign on top of the freeway that reflects in the back window of the car in front of you, showing a reflected sign on the road. You better understand that the thing you see on the road is a reflection and not try to swerve around it at 65 miles per hour.

    A lot of the signal processing is actually already using machine learning and deep learning, because as you can see in the reflection case, at some point you're going to have to have a higher level of understanding of the scene to realize it's not possible that the car is hiding behind the bus given your field of view.

    Convolutional Networks and Sensor Projections

    Assuming you have filtered sensor data, the very next thing you typically want to do is apply some kind of convolutional layers on top of that imagery. If you're not familiar with convolutional layers, that's a very popular way to do computer vision because it relies on connecting neurons with kernels that run across the imagery and learn, layer after layer, features of the imagery. Those kernels typically work locally on a region of the image and can understand lines, contours, and as you build up layers, higher and higher levels of feature representations that ultimately tell you what's happening in the image. That's a very common technique and much more efficient than fully connected layers, for instance.

    Unfortunately, a lot of the state of the art is in 2D convolutions, which have been developed on imagery and typically require a fairly dense input. For imagery that's great, because pixels are very dense — you always have a pixel next to the next one. But if you were to apply plain convolutions on a very sparse laser point cloud, you would have a lot of holes and those don't work nearly as well. So typically what we do is first project sensor data into 2D planes and do processing on those.

    Two very typical views we use: the first is a top-down bird's-eye view, which gives you a Google Maps kind of view of the scene — great for mapping cars and objects moving along the scene, though it's harder to incorporate camera imagery pixels into those top-down views. The other common one is the driver view — a projection onto the plane from the driver's perspective — which is much better at utilizing imagery because that's essentially how imagery was captured. If your sensors are properly registered, you can use both lidar and imagery signals together to better understand the scene.

    Segmentation and Object Detection Techniques

    The first kind of processing you can do is segmentation — once you have pixels or laser points, you need to group them together into objects that you can then use for better understanding and processing. Unfortunately, a lot of the objects you encounter while driving don't have a predefined shape. Snow, vegetation, trash bags — you can't come up with a prior understanding of how they're going to look, so you have to be ready for any shape.

    One technique that works pretty well is to build a smaller convolutional network that you slide across the projection of your sensor data — the sliding window approach. If you have a pixel-accurate snow detector that you slide across the image, you'll be able to build a representation of those patches of snow and navigate appropriately around them. That works pretty well, but as you can imagine it's a little expensive computationally — it's like a dot matrix printer that has to go point by point across the page. It works, but it's pretty slow, so you need to be very conscious about which areas of the scene you apply it to in order to stay efficient.

    Fortunately, many of the objects you need to care about have predefined shape priors. If you take a car from the bird's-eye view, it's going to be a rectangle. Driving lanes are going to go in similar directions. You can use those priors to do more efficient deep learning — in the literature this is known as single-shot multibox detection. Here you would start with convolutional towers but do only one pass of convolution — like the difference between a dot matrix printer and a printing press that prints a page at once. You train a deep net that directly takes the whole projection of sensor data and outputs boxes that encode the priors you have. Here for instance I can show you how such a thing would work for cone detection — we don't have all the fidelity of per-pixel cone detection, but we don't really care about that; we just need to know there is a cone somewhere. And since it's computationally much cheaper, you can run that across a pretty wide range of space and still be very efficient.

    Embeddings and Emergency Vehicle Classification

    We talked about the flashing lights on top of the police car. Even if you properly detect and segment cars on the road, many cars have very special semantics. There are many examples of emergency vehicles that you need to visually understand — first, that it is an emergency vehicle, and second, whether it's active or not. School buses are not actually emergency vehicles, but whether the bus has lights on or has a stop sign open on the side carries heavy semantics that you need to understand.

    How do you deal with that? One thing you could do is take that patch, build a new convolutional tower, and put a classifier on top — build a school bus classifier, a school bus with lights on classifier, a school bus with stop sign open classifier. I'm pretty sure that would work pretty well, but it would be a lot of work and pretty expensive to run on the car, since convolutional layers are typically the most expensive pieces of a neural net.

    A better thing to do is to use embeddings. If you're not familiar with them, embeddings are vector representations of objects that you can learn with deep nets, and they carry some semantic meaning of those objects. Given a vehicle, you can build a vector that carries the information that the vehicle is a school bus, whether the lights are on, whether the stop sign is open — and then you're back in a vector space that's much smaller and much more efficient to operate in for further processing.

    Embeddings have historically been more closely associated with word embeddings. In a typical text, if you were able to build vectors out of every word in a piece of text, and then look at the sequence of those vectors and operate in the vector space, you start to understand the semantics of those sentences. One of the early projects you can look at is called Word2Vec, done in the Google Brain group, where they were able to build such things and discovered that the embedding space actually carried interesting vector space properties — for instance, if you took the vector for "king" minus the vector for "man" plus the vector for "woman," you ended up with a vector whose closest word would be "queen." That shows how those vector representations can be very powerful in the amount of information they can contain.

    Pedestrian Detection and Behavior Prediction

    Let's talk about pedestrians. We talked about semantic image segmentation — the ability to go pixel by pixel for things that don't really have a shape — and we talked about using shape priors. But pedestrians actually combine the complexity of both approaches for many reasons. They are deformable and come with many shapes and poses — someone on a skateboard crouching, more unusual poses that you need to understand. The recall you need to have on pedestrians is very high, and pedestrians show up in many different situations. For instance, a pedestrian getting out of a car — there's a good chance that person is going to step into the road and you need to be ready for that. And predicting the behavior of pedestrians is really hard, because they can move in any direction. A car moving in a direction you can safely bet is not going to make a drastic change of angle in a moment's notice, but children are more complicated — they may not pay attention and may jump in any direction.

    So it's harder in terms of shape prior, harder in terms of recall, and harder in terms of prediction. You need a fine understanding of the semantics. Another example: you get to an intersection and you have a visually impaired person jaywalking — you obviously need to understand all of that to know that you need to yield to that person.

    Here's another example: there is something that really looks like a pedestrian — lying on the bed of a pickup truck. Obviously you shouldn't yield to that person, because yielding to a pedestrian at 35 miles per hour means hitting the brakes pretty hard, with the risk of causing an accident. You need to understand that that person is traveling with the truck and is not actually on the road, and it's okay not to yield to them. Those are examples of the rich range of semantics you need to understand.

    Recurrent Neural Networks and Temporal Understanding

    One way to do that is to start understanding the behavior of things over time. Everything we talked about up until now in how we use deep learning to solve some of these problems was on a pure frame basis. But understanding that a person is moving with the truck versus a jaywalker in the middle of an intersection — that kind of information you can get to if you observe behavior over time.

    Back to embeddings: if you have vector representations of those objects, you can start tracking them over time. A common technique to get there is to use recurrent neural networks — networks that build a state that gets better and better as they receive more sequential observations of a pattern. Coming back to the word example: you see one word, you see its vector representation; another word, the sentence starts to make sense; third word, fourth word — at the end of the sentence you have a good understanding and you can start to translate. It's a similar idea for driving: if you have a semantic representation encoded in an embedding for the pedestrian and the car, and you track that over time and build a state that gets more and more meaning as time goes by, you're going to get closer and closer to a good understanding of what's going on in the scene. Vector representations combined with recurrent neural networks is a common technique that can help you figure that out.

    Productionizing Machine Learning at Scale

    Back to the point: when you're 90% done, you still have 90% to go. To get to the last leg of my talk, I want to give you some appreciation for what it truly takes to build a machine learning system at scale and productionize it.

    Up until now we talked a lot about algorithms. Algorithms have been a breakthrough, and the efficiency of those algorithms has been a breakthrough for us to succeed at the self-driving task. But it takes a lot more than algorithms to actually get there.

    Labeling at Scale

    The first piece you need to 10x is around labeling efforts. A lot of the algorithms we talked about are supervised — meaning that even if you have a strong network architecture, you need to come up with a representative, high-quality set of labeled data that maps some input to the output you want the network to predict. That's a pedestrian, that's a car — and the network will learn in a supervised way how to build the right representations.

    The unsupervised space is a very active domain of research, both at Waymo and in collaboration with Google teams, but today a lot of it is still supervised. To give you orders of magnitude: you may be familiar with ImageNet, which is in the 15 million label range. Back in the early days of the street number problem, it took us a multi-billion label dataset to actually teach the network. Today we do a lot better — not only do we require less data, but we can generate those datasets much more efficiently. You can use machine learning itself to come up with labels and use operators to fix discrepancies or mistakes, rather than labeling the whole thing from scratch. That's the whole space of active learning. Combining those techniques together, you can get to completion faster. It's still very common to need datasets in the millions range to train a robust solution.

    Compute and Infrastructure

    Another piece is around computing power. Here's some historical data around the street number models — a detection model and a transcription model. If you look at number of neurons or number of connections per neuron, you start to be competitive in some cases with what the brain can do in certain domains. The main point is that you need a lot of computation — a lot of computing to train or infer those trained models in real time on the car — and that requires a lot of very robust engineering and infrastructure development to get to those scales. Google is pretty good at that, and at Waymo we have access to the Google infrastructure and tools to get there.

    The way it's happening at Google is around TensorFlow — maybe you've heard of it as a programming language to encode network architectures, but actually TensorFlow is also the whole ecosystem that can combine all those pieces together and do machine learning at scale. It's a language that allows teams to collaborate and work together, a data representation in which you can represent your labeled datasets or training batches, and a runtime that you can deploy on Google data centers. Another piece is hardware accelerators. Back in the early days we had CPUs to train models at scale, which is less efficient. Over time GPUs came into the mix, and Google has been proactive in developing very advanced hardware accelerators — including Tensor Processing Units, which are proprietary chipsets deployed in Google's data centers for training and inferring deep learning models more efficiently. TensorFlow is the glue that allows you to deploy at scale across all those pieces.

    The Three-Legged Testing Program

    So you're smart, you build a smart algorithm, you collect enough data to train it — great. But a self-driving system is pretty sophisticated, and it requires extensive testing. The last leg you need to cover to do machine learning at scale with a high safety bar is around your testing program. We have three legs that we use to make sure our machine learning is ready for production: real-world driving, simulation, and structured testing.

    Real-world driving: Obviously there is no way around it. If you want to encounter situations and understand how you behave, you need to drive. As you can see, the driving at Waymo has been accelerating over time. We crossed three million miles driven back in May 2017, and only six months later, back in November, we reached four million — an accelerating pace. Not every mile is equal, and what you care about are the miles that carry new situations and important situations. Those miles were acquired across 20 cities, many weather conditions, and many environments. To give you another order of magnitude: that's about 60 times around the globe, and more importantly, it's probably around 300 years of human driving equivalent. So in that dataset you potentially have 300 years of experience that your machine learning can tap into to learn what to do.

    Simulation: Even more importantly, the software changes regularly. If for each new revision of the software you need to go and re-drive four million miles, that's not very practical. The ability to have good enough simulation that you can replay all those miles in any new iteration of the software is key for deciding if the new version is ready or not. Even more important is your ability to make those miles more efficient and tweak them. Here is a screenshot of an internal tool we call CarCraft, which gives us the ability to fast-forward or change the parameters of the actual scenes we've driven. What if the cars were going at a slightly different speed? What if there was an extra car on the scene? What if a pedestrian crossed in front of the car? You can use the actual live miles as a base and augment them into new situations to test your self-driving system against. That's a very powerful way to drastically multiply the impact of any mile you drive.

    Simulation is another of those massive-scale projects. Using Google's infrastructure, we have the ability to run a virtual fleet of 25,000 cars 24/7 in data centers — software stacks that simulate the driving across either real miles we've driven or modified miles that help us understand the behavior of the software. Last year alone we drove 2.5 billion of those miles in data centers. Remember: four million driven miles total, all the way to 2.5 billion simulated — that's three orders of magnitude of expansion in your ability to truly understand how the system behaves.

    Structured testing: There's still a long tail of situations that will happen very rarely. The way we decided to tackle those is to set up our own testing facility that is a mock-up of a city and driving situations. We do that in a 90-acre testing facility on a former Air Force Base in California, set up with traffic lights, railroad crossings — truly trying to reproduce real-life situations — where we set up very specific scenarios that we haven't necessarily encountered during our regular driving but that we want to test. We then feed those back into the simulation, augment them using the same simulation strategies, and inject them into our 2.5 billion miles driven.

    Here I'm going to show you two quick examples of such tests. The first: just having a cab back up as the self-driving car gets close, and seeing what happens — using all that sensor data and injecting it into simulation. Another example is around people dropping boxes in the road — try to imagine the kind of segmentation and semantic understanding you need to do to understand what's happening there. And to make it even more interesting, note that a car has been placed on the other side, so swerving is not an option without hitting the car. Driving complex situations that go from perception to motion planning — the whole stack — and making sure we are robust even in those long-tail examples.

    Future Directions

    It looks like a lot of work, and it is — but we still have a lot of very interesting work coming. I don't have much time to go into too many details, but I'll give you two directions.

    The first is around growing what we call our ODD — our Operating Design Domain. Extending our fleet of self-driving cars not only geographically — meaning going into urban cores, deploying into different weather conditions — but also in terms of the environments we handle. Just as of yesterday morning, we announced that we're going to grow testing in San Francisco, for instance, with Waymo cars that bring urban environments, slopes, and fog. That's obviously a very important direction where machine learning is going to keep playing a very important role.

    Another area is around semantic understanding. In case you haven't noticed, I'm from France — that's the famous roundabout at the Arc de Triomphe in Paris, which seems pretty chaotic. I've driven it many times without any issues — touch wood — but I know that it took a lot of semantics and understanding for me to do it safely. I had a lot of expectations about what people would do, a lot of visual communication and gestures to get through that safely. Those require a lot of deeper semantic understanding of the scene for a self-driving system to get through.

    So back to my objectives: I hope I covered many of those. First, context — the context of the space, the history at Google and Waymo, and how deep the roots go back in time. Second, tying in some of the technical and algorithmic solutions you may have talked about during the class into the practical cases we need to solve in the production system. And third, emphasizing the scale and the engineering infrastructure work that needs to happen to truly take such a project into a production system.

    Q&A

    Audience member: It tends to fail at the intersection between perception and planning — your planner might assume something about a perfect world that perception cannot deliver. I was wondering if you use the simulation environment also to induce these perception failures, or whether that's really specific to the scenarios you're testing, and whether you have other validation arguments for the perception side.

    Sacha Arnoud: Very good question. One thing I didn't mention is that the simulator obviously enables you to simulate many different layers in the stack, and one of the hardcore engineering problems is to actually properly design your stack so that you can isolate and test independently — like any good piece of software, you need to have good APIs and layers. We have such a layer in our system between perception and planning. The way we test perception is more by measuring the performance of your perception system across the real miles and tweaking the output of the perception system with its mistakes — having a good understanding of the mistakes it makes and reproducing those mistakes realistically in the new scenarios you come up with as part of your simulator, to realistically test your planning side at scale.

    Audience member: Do you have a systematic way of creating the architectures of the embedded system? You have so many choices for sensors and algorithms, and each problem you showed has many different solutions that create different interfaces between each element. How do you choose which architecture you put in a car?

    Sacha Arnoud: That's true for any complex software stack. There's a combination of different things. The first thing, which I didn't talk too much about here, is the vast amount of research we do at Waymo and in collaboration with Google teams to understand what building blocks we have at our disposal to even play with and come up with those production systems. The other piece is obviously deciding which ones to take all the way to production. The two big elements I would say: the first and main element, frankly, is your ability to — and that search actually takes a lot of people to get right — part of the second 90% is your ability to grow your team and essentially grow the number of people who can productively participate in your engineering project. That's where the robustness you need to bring into your development environment and your testing is really key to being able to grow that team at scale and essentially explore all those paths and come up with the best one. At the end of the day, the robustness of testing is the judge — that's what tells you whether an approach works. It's not a philosophical debate.

    Audience member: Thank you for your talk. The car is making a decision at every single time step on direction and speed, and part of the reason you have the simulation is so that you can test those decisions in every possible scenario. Once self-driving cars become production-ready and out on the streets, do you expect that the decisions will be made based on prior understanding of every single situation that is possible, or can the car make a new decision in real time based on its scene understanding and everything around it?

    Sacha Arnoud: At the end of the day, the goal of the system is not to build a library of events that you can reproduce one by one and make sure you encode the right response. The analogy in machine learning would be overfitting — if you encountered five situations, I'm pretty sure you can hard-code the perfect thing to do in those five situations, but when the sixth one happens, if you don't generalize, it's going to fall through. The real complexity of what you need to do is extract the core principles that make you drive safely and have the algorithms learn those principles rather than the specifics of any situation, because as you said, the parameter space of a real scene is infinite. We try to stress that a little bit with the simulator — what if the cars went a little faster or slower — but the goal is not to enumerate all possibilities and make sure we handle those. The goal is to bring more diversity to the learning of those general principles that will be run by the system for the car to behave properly and generalize when a new situation occurs.

    Audience member: Fantastic talk. One of the questions I had was: you mentioned the difficulty of identifying snow because it can come in many different shapes. One thing I immediately thought of was the urban legend about the Inuit having 150 different words for snow. You mentioned embeddings of objects — do you think one possible approach might be to create a much wider array of object embeddings for things like snow? If you're dealing with many different types of snow, they could actually have pretty different impacts on driving, whether it's just a flurry or a really heavy blizzard like we just had.

    Sacha Arnoud: I think from an algorithmic point of view that may make sense, but something I'd like to emphasize is the very hard line you have to walk — what's computationally feasible in the car. Two points on your remark: if you had the processing power to process every point to a very low level of understanding and had the computing power to do that, maybe that would be an approach, but it would be very expensive and hard to do. Even more importantly, it wouldn't make sense to have a behavior prediction for every snowflake of the things you see on the side of the road. You need to group — that's the whole point of segmentation — you need to group what you see into semantic objects that are likely to exhibit behavior as a whole, and reason at that level of abstraction to have a meaningful semantic understanding that you need to drive.

    Audience member: Thanks for the talk. If you're using perception for your scene understanding, are you worried about adversarial examples, or things that have been demonstrated — do you believe that's a real-world attack that could be used for perception-based systems?

    Sacha Arnoud: Generally speaking, yes. Even beyond adversarial attacks, errors can happen in every mode. A prime example of that, which is not adversarial, is the reflection case — you could as well have put a sticker on the bus and said "you're confused, you think it's a car, it's not a car." But you don't need to put a sticker on the bus — real life already brings a lot of those examples. The way out is, first, to have sensors that complement each other. Different sensors or different systems are not going to make the same mistakes, so they're going to complement each other, and that's a very important piece of redundancy built into the system. The other one — even in the reflection case — is the semantic understanding. The way you as a human wouldn't be fooled is because you understand it's not possible that a car is hiding behind the bus given your field of view. That level of semantic understanding is what's going to tell you what is true and what is a mistake or error in your stack. Similar patterns apply.


    Polished transcript of Lex Fridman. All views are those of the original speakers. Watch on YouTube ↗
    Published by @martymcfly
    More from Lex Fridman
    More from @martymcfly
    Summary