Karl Iagnemma and Oscar Beijbom of Aptiv Autonomous Mobility lecture at MIT on self-driving car technology
A guest lecture at MIT's 6.S094 Deep Learning for Self-Driving Cars course, featuring Karl Iagnemma and Oscar Beijbom from Aptiv Autonomous Mobility.
Summary
Karl Iagnemma, President of Aptiv Autonomous Mobility, and Oscar Beijbom, the division's machine learning lead, deliver a guest lecture at MIT's deep learning for self-driving cars course. Iagnemma opens with a history of autonomous vehicle development — from the 2007 DARPA Urban Challenge through nuTonomy's founding and its 2017 acquisition by Aptiv — and then addresses the central challenge facing the industry: how to validate and trust neural network-based systems in safety-critical applications. He argues that the field has moved away from end-to-end pixel-to-actuator black-box architectures, primarily because of the difficulty of proving safety, and instead advocates for "caging" learned components within rigorously verifiable safety architectures. Beijbom then presents two concrete technical contributions from Aptiv's machine learning team: PointPillars, a novel and fast point cloud encoder for 3D object detection that outperforms published methods on the KITTI benchmark while running at over 60 Hz, and nuScenes, a large-scale annotated autonomous driving dataset released to the research community to advance the field.
Key Takeaways
FULL TRANSCRIPT
Introduction and Welcome
Lex Fridman: Welcome back to 6.S094, Deep Learning for Self-Driving Cars. Today we have Karl Iagnemma and Oscar Beijbom from Aptiv. Karl is the President of Aptiv Autonomous Mobility, and Oscar is the machine learning lead. Karl founded nuTonomy, as many of you know, in 2013 — a Boston-based autonomous vehicle company. nuTonomy was acquired by Aptiv in 2017 and is now part of Aptiv. Karl and his team are one of the leaders in autonomous vehicle development and deployment, with cars on roads all over the United States and several other sites. But most importantly, Karl is MIT through and through — as some of you may know, he got his PhD here and led a robotics group here as a research scientist for many years. It's a real pleasure to have both Karl and Oscar with us today. Please give them a warm welcome.
Karl Iagnemma: Background and Introduction to Aptiv
Karl Iagnemma: Thanks, Lex. Very glad to be back at MIT. I'm very impressed that you guys are here during IAP. My course load during IAP was usually ice skating, and sometimes there was a wine tasting course — this is now almost twenty years ago — and that was pretty much it. That's where the academic work stopped. So you guys are here to learn something, and I'm going to do my best and try something radical.
As president of Aptiv Autonomous Driving, I'm not really allowed to talk about anything technical or interesting, but I'm going to flout that a little bit and raise some topics that we think about — questions to keep in the back of your mind as you're thinking about deep learning and autonomous driving. I'll raise some of those questions, and then Oscar will present some real-life technology and some of the work that he and his outstanding team have been doing around machine learning-based detectors for the detection problem.
Let me first introduce Aptiv a little bit, because people usually ask me what Aptiv is when I say I work for them. Aptiv has actually been around for a long time, but in a different form. Aptiv was previously Delphi Technologies, which was previously part of General Motors. Everybody's heard of General Motors; some of you may have heard of Delphi. Aptiv spun from Delphi about fourteen months ago. Aptiv is a tier-one supplier — an automotive company that industrialises technology. Essentially, they take software and hardware, industrialise it, and put it on cars so it can run for many hundreds of thousands of miles without failing, which is a useful thing when we think about autonomous driving.
The themes for Aptiv are what they describe as safer, greener, and more connected solutions. Safer means safety systems — active safety and autonomous driving systems of the type that we're building. Greener means systems to enable electrification and green vehicles. And more connected means connectivity solutions, both within the vehicle for transmitting data around the vehicle, and externally for wireless communication. All of these things, as you can imagine, feed very nicely into the future transportation systems that software will only be a part of. So Aptiv is in a really interesting spot when you think about the future of autonomous driving.
To give you a sense of scale — and this still kind of amazes me — the biggest my research group ever was at MIT was about eighteen people. Aptiv is a hundred and fifty-six thousand employees. It's a significant-sized organisation, about a thirteen-billion-dollar company by revenue, operating in about fifty countries around the world. My group is about seven hundred people, of which Oscar is one very important person. We're about seven hundred people working on autonomous driving. We've got about a hundred and twenty cars on the road in different countries.
A History of Autonomous Vehicle Development
I'll show you some examples of that, but first let me take a trip down memory lane and show you a couple of snapshots about where we were not too long ago — as a community, but also personally. This will either inspire or horrify you; I'm not sure which.
In 2007, there were groups driving around with cars running blade servers in the trunk that were generating so much heat you had to install another air conditioner, which was then drawing so much power you had to add another alternator — and then kind of rinse and repeat. So it wasn't a great situation. But people did enough algorithmically and computationally to enable these cars — this is the DARPA Urban Challenge, for those who may be familiar — to do something useful and interesting on a closed course. It kind of convinced enough people that, given enough devotion of thought and resources, this might actually become a real thing someday. I was one of those people who got convinced.
In 2010 — and I'm going to borrow from my co-founder Emilio, who was a former MIT faculty member in AeroAstro — Emilio started up an operation in Singapore through SMART. This is some folks from SMART. That's James, who looks really young in that picture. He was one of Emilio's students, and he was basically taking a golf cart and turning it into an autonomous shuttle. It turned out to work pretty well, and it got people in Singapore excited, which in turn got us further excited.
In 2014, they did a demo where they let people in Singapore ride around in these carts in a garden, and that worked great over the course of a weekend. Around this time we had started nuTonomy — we'd actually started a commercial enterprise and stepped at least partly away from MIT.
By 2015, we had cars on the road. This is a Mitsubishi i-MiEV electric vehicle. When we had all our equipment in it, the front seat was pushed forward so far that I — I'm about six foot three — actually couldn't sit in the front seat, so I couldn't accompany people on rides. It wasn't very practical. We ended up switching to a Renault Zoé platform, which is the one you see here, which had a little more legroom. At that point we were giving open-to-the-public rides in our cars in Singapore in the part of the city we were allowed to operate in.
It was a quick transition. As you can see, even just visually, the evolution of these systems has come a long way in a short time. We're just one point example of this phenomenon, which is broadly speaking similar across the industry.
In 2017, we joined Aptiv, and we were excited by that because we, as primarily scientists and technologists, didn't have a great idea of how we were going to industrialise this technology, actually bring it to market, make it reliable and robust, and make it safe — which is what I'm going to talk about here today. So we joined Aptiv with its global footprint.
Aptiv's Current Operations
Today we're primarily in Pittsburgh, Boston, Singapore, and Las Vegas, and we've got connectivity to Aptiv's other sites in Shanghai and Wolfsburg.
Let me tell you a little bit about what's happening in Vegas. I think Luke Vincent from Lyft probably talked a little bit about Vegas when he was here. Vegas is really an interesting place for us. We've got a big operation there — a 130,000 square-foot garage, about seventy-five cars, and thirty of those cars on the Lyft network. So it's Aptiv technology, but connecting to the customer through Lyft. If you go to Vegas and open your Lyft app, it'll ask you whether you want to take a ride in an autonomous car. You can opt in or opt out — it's up to you. If you opt in, there's a reasonable chance one of our cars will pick you up when you call for a ride. Anybody can do this — competitors, innocent bystanders, totally up to you. We have nothing to hide. Our cars are on the road twenty hours a day, seven days a week.
When you get out of the car, just like any Lyft ride, you give us a star rating from one to five. That star rating is actually really interesting to us. It's a scalar — it's not too rich — but that rating says something about the ride quality: the comfort of the trip, the safety you felt, and the efficiency of getting to where you wanted to go. Our star rating today is 4.95, which is pretty good.
Key numbers: we've given, at this point, over 30,000 rides to more than 50,000 passengers. We've driven over a million miles in Vegas. And as I mentioned, the rating is 4.95.
Video Demonstration
What does it look like on the road? I'll show just one video today — Oscar has a few more. This one is actually in Singapore, but it's all morally equivalent. You'll see a slightly sped-up view of a run from about six or seven months ago on the road in Singapore, and it's got some interesting stuff in a fairly typical run.
Some of you may recognise these roads. We're on the wrong side of the road, remember, because we're in Singapore. But to give you an example of the types of problems we have to solve on a daily basis: the car is cruising down the road and has to avoid obstacles, sometimes in the face of oncoming traffic. We've got to deal with situations where other road users are maybe not perfectly following the rules. We've got to manage that in a natural way. Construction in Singapore, like everywhere else, is pretty ubiquitous, so you have to navigate through these less structured environments. There are people who are sometimes doing things or indicating some future action that you have to make inferences about, which can be tricky to navigate.
So a typical route that any one of us as humans would drive through without batting an eye actually presents some really complex problems for autonomous vehicles. But these are the table stakes these days — these are the things you have to do if you want to be on the road, and certainly if you want to drive millions of miles with very few accidents, which is what we're doing.
The Challenge of Learning and Safety in Autonomous Driving
So that's an introduction to Aptiv and a little bit of background. Let me talk about learning and how we think about it in the context of autonomous driving.
There was a period a few years ago where, I think, as a community, people thought we would be able to go from pixels to actuator commands with a single learned architecture — a single black box. Generally speaking, we no longer believe that's true. I should include myself in that — I didn't believe it was ever true — but some of us maybe thought it was. And I'll tell you part of the reason why.
A big part of it comes down to safety. The question of safety — convincing ourselves that that black box, even if we could train it to accurately approximate this massively complex underlying function, is actually safe — is very, very hard to answer affirmatively. This is not to say that learning methods are not incredibly useful for autonomous driving, because they absolutely are, and Oscar will show you examples of that. But this safety dimension is tricky.
There are actually two axes here. One is the actual technical safety of the system — can we build a system that is safe, that is provably safe in some sense, that we can validate, that we can convince ourselves achieves the intended functionality in our operational design domain, that adheres to whatever regulatory requirements might be imposed in the jurisdictions we're operating in? There's a whole longer list related to technical safety, but these are primarily technical problems.
But there's another dimension, which you might call perceived safety — when you ride in a car, even if it's safe, do you believe that it's safe, and therefore will you want to take another trip? That sounds kind of squishy, and as engineers we're typically uncomfortable with that kind of thing. But it turns out to be really important, and probably harder to solve because it is a little bit squishy. Quite obviously, we've got to be in the upper right-hand corner — we need not only a very safe car from a technical perspective, but one that feels safe, that inspires confidence in riders, in regulators, and in everybody else.
So how do we get there in the context of elements of this system that may be black boxes? What's required is trust. How do we get to the point where we can trust neural networks in the context of safety-critical systems, which is what an autonomous vehicle is? It really comes down to this question: how do we convince ourselves that we can validate these systems — ensuring they can meet the operational requirements in the domain of interest?
Three Dimensions of Validation
There are three dimensions to this key question of understanding how to validate, and I'm going to briefly introduce some questions around each of them.
The first is trusting the data. Do we actually have confidence about what goes into the algorithm? Everybody knows garbage in, garbage out. There are various ways we can make this garbage: we can have data that insufficiently covers our domain, that is not representative of the domain, or that is poorly annotated by our third-party labelling partners. So do we trust the data going into the algorithm?
The second is trusting the implementation. You've got a beautiful algorithm — super descriptive, super robust, not brittle at all, well-trained — and we're running it on poor hardware. We've coded it poorly. We've got buffer overruns left and right. Do we trust the implementation to actually execute in a safe manner?
And third, do we trust the algorithm itself? Generally speaking, we're trying to approximate really complicated functions. This is a gnarly, nasty function that has problems of critical interest which are very rare — in fact, they're the only ones of interest. These are events that happen very infrequently that we absolutely have to get right. It's a hard problem to convince ourselves that the algorithm is going to perform properly in these unexpected and rare situations.
These are the sorts of things we think about and have to answer in an intelligent way to convince ourselves that we have a validated neural network-based system.
Why Validation Is Hard
Let me step through each of these topics quickly.
On the topic of validation — why is it hard? There are a number of dimensions. The first is that we don't have insight into the nature of the function we're trying to approximate. The underlying phenomenon is really complicated. If it weren't, we'd probably be modelling it using different techniques — we'd write a closed-form equation to describe it.
Second, the actual crashes on the road are rare. Luckily, they're very rare. But that makes the statistical argument around being able to avoid these accidents really, really difficult. If you believe Rand — and they're pretty smart folks — they say you've got to drive 275 million miles without a crash to claim a lower fatality rate than a human with 95% confidence. How are we going to do that? Can we think about using some correlated incident — maybe some kind of close call — as a proxy for accidents, which may be more frequent, and back into it that way? There are a lot of questions here that I won't say we have no answers to, but they're hard questions without obvious answers.
The regulatory dimension is one of these known unknowns. How do we evaluate a system if the requirements that may be imposed on us from outside regulatory bodies are still to be written? There's a lack of consensus on what the safety target should be for these systems. This is obviously evolving, and smart people are thinking about it, but today it's not at all clear — if you're driving in Las Vegas, in Singapore, in San Francisco, or in between — what this target needs to be.
And then lastly, this is a really interesting one: we can get through a validation process for a build of code. Let's assume we can do that. Well, what happens when we update the code? Because obviously we will. Does that mean we have to start that validation process again from scratch, which will unavoidably be expensive and lengthy? What if we only change a little bit of the code? What if we only change one line — but what if that one line is the most important line of code in the whole codebase? This question of revalidation keeps a lot of people up at night.
And even keeping the codebase fixed — what if we move from one city to the next, and that city is quite similar to the previous one but not exactly the same? How do we think about validation in the context of new environments? This continuous development issue is a real challenge.
Trusting the Data
Let me move on to talking about the data. There are probably people in this room doing active research in this area, because it's a really interesting one.
There are a couple of questions we think about when we think about data. We can have a great algorithm, and if we're training it on poor data, we won't have a great output. One thing we think about is the efficiency, completeness, and bias inherent in the data for our operational domain. If we want to operate twenty-four hours a day and we only train on data collected during daytime, we're probably going to have an issue.
Annotating the data is another dimension. We can collect raw data that sufficiently covers our space, but when we annotate it — when we hand it off to a third party, because it's typically a third party that marks up the interesting aspects — we provide them some specifications, but we put a lot of trust in that third party. We trust that they're going to do a good job annotating the interesting parts and not the uninteresting parts, that they're going to catch all the interesting parts we've asked them to catch, and so on. This annotation part, which seems very mundane and easy to manage, is in fact another key aspect of ensuring that we can trust the data.
Trusting the Algorithm
Moving on from the data to the actual algorithm — how do we convince ourselves that an algorithm trained on a training set is going to do well on some unknown test set? There are a couple of properties of the algorithm we can interrogate to convince ourselves it will perform well.
One is invariance, and the other is stability. If we make small perturbations to this function, does it behave well? Given a bounded input, do we see a bounded output, or do we see some wild response? I'm sure you've all heard of examples of adversarial images that can confuse learning-based classifiers. You show it a turtle, and it says "that's a turtle." Then you show it a turtle that's been fuzzed with a little bit of noise that the human eye can't perceive — it still looks like a turtle — and it tells you it's a machine gun. In the driving domain, we want a stop sign to be correctly identified as a stop sign a hundred percent of the time. We don't want that stop sign, if somebody goes up and puts a piece of duct tape in the lower right-hand corner, to be interpreted as a yield sign.
And then lastly, there's the notion of interpretability — understanding why an algorithm made the decision it made. This may not be a nice-to-have; it may actually be a requirement, and it's likely to be a requirement from regulatory groups. Imagine the case of a crash where the system governing the trajectory generator was a data-driven, deep learning-based model. You may need to explain to someone exactly why that particular trajectory was generated at that particular moment. This may be a hard thing to do if the generator was a data-driven model. There are people doing active research into interpretable learning methods, but it's a thorny topic, and it's not at all clear to me when and if we'll get to the stage where we can explain — even to a technical audience, let alone to a lay jury — why algorithm X made decision Y.
Safety Architecture: Caging the Learning
With all that in mind, let me talk a little bit about safety. All of that may sound pretty bleak — you might think, "Well, I've been taking this course with Lex and we're never really going to use this stuff." But in fact we can, and will as a community. There are a lot of tools we can bring to bear to think about neural networks, generally speaking within the context of a broader safety argument. That's the key. We tend not to think about using a neural network as a holistic system to drive a car, but rather as a sub-module that we can build other systems around — systems about which we can make more rigorous claims about their performance and underlying properties, and therefore make a convincing holistic safety argument that the end-to-end system is safe.
We have tools. Functional safety is maybe familiar to some of you — it's something we think about a lot in the automotive domain. And SOTIF, which stands for Safety of the Intended Functionality, is basically asking ourselves: is this overall function doing what it's intended to do? Is it operating safely? Is it meeting its specifications? There's an analogy here to validation and verification, and we have to answer these questions around functional safety and SOTIF affirmatively, even when we have neural network-based elements, in order to eventually put this car on the road.
I mentioned that we need to do some embedding. This is an example of what it might look like. We sometimes call this "caging the learning" — we put the learning in a box. It's a powerful animal we want to control. In this case, it's up there at the top in red — that might be the trajectory proposer I was talking about. Let's say we've got a powerful trajectory proposer. We want to use this thing. We've got it on what we call our performance compute — our high-powered compute. It's maybe not automotive grade; it's got some potential failure modes, but it generally has good performance. We've got our neural network-based generator on it, about which we can say some things but maybe not everything we'd like to.
The argument is that if we can surround it — cage it, underpin it — with a safety system about which we can say very rigorous things, then generally speaking we may be okay. There may be a path to using neural networks on autonomous vehicles if we can wrap them in a safety architecture that we can say a lot of good things about. And this is exactly what this represents.
I'm going to conclude my part of the talk here and hand it over to Oscar with a quote and assertion that one of my engineers insisted I include today. The argument is the following: engineering is inching closer to the natural sciences. We're creating things that we don't fully understand, and then we're investigating the properties of our creation. We're not writing down closed-form functions — that would be too easy. We're generating these immensely complex function approximators and then just poking at them, asking: what does this thing do under these situations? And I'll leave you with one image, which I'll present without comment, and then hand it over to Oscar.
Oscar Beijbom: Introduction and the Deep Learning Revolution
Oscar Beijbom: Thanks, Karl. Thanks, Lex, for the invite. My name is Oscar. I run the machine learning team at Aptiv Autonomy.
The first slide was quite literally a joke — this is an actual comic. I won't ask if you've seen it before. I was doing my PhD in the era where building a bird classifier was like a PhD project. It's funny because it's true.
And then, of course, as you well know, the deep learning revolution happened. I want to draw a straight line from what I consider the breakthrough paper by Krizhevsky et al. to the work I'll be talking about today. There were sort of three key papers. You had deep learning end-to-end learning for image classification by Krizhevsky and Hinton — that paper has been cited 35,000 times, I checked yesterday. Then in 2014, Ross Girshick at Berkeley basically showed how to repurpose the deep learning architecture to do detection in images. That was the first time the computer vision community really started seeing that classification is more general — you can classify anything: an image, an audio signal, whatever. But detection in images was very intimate to the computer vision community; we thought we were the best in the world. So when that paper came out, that was the final argument: we all need to do deep learning now. And then in 2016, the Single Shot MultiBox Detector paper came out, which I think is a great paper. If you haven't read it, by all means read it carefully.
The result is that performance is no longer a joke. This is a network we developed in my group — it's an image joint classification and segmentation network. We can run this at 200 Hz on a single GPU. In this video rendering, there is no tracking applied, there is no temporal smoothing — every single frame is analysed independently from the others. You can see that we can model several different classes, both bounding boxes and surfaces, at the same time.
The Perception Pipeline and the Case for Deep Learning
Here's my cartoon drawing of a perception system for an autonomous vehicle. You have three different main sensor modalities. You typically have some module that does detection and tracking — there are tons of variations of this, of course. You have some sort of sensor pipelines, and then in the end you have a tracking and fusion step. What I showed you in the previous video is basically the camera-to-detection part.
When I started — I come strictly from the computer science and machine learning community — when I looked at this pipeline, I thought: why are there so many steps? Why aren't we optimising things end to end? Obviously there is a real temptation to just wrap everything in a single kernel with a very well-defined input-output function that, as Karl alluded to, can be verified quite well, assuming you have the right data. I'm not going to be talking about that. I am going to talk about building a deep learning kernel for the LiDAR pipeline.
The LiDAR pipeline is arguably the backbone of the perception system for most autonomous driving systems. The goal here is: we're going to have a point cloud as input, and we're going to have a neural network that takes that in and generates 3D bounding boxes in the world coordinate system — twenty metres that way, two metres wide, so long, this rotation, this orientation, and so on. That's what this talk is about.
PointPillars: A Novel Point Cloud Encoder
I'm going to talk about PointPillars, which is a new method we developed for this, and nuScenes, which is a benchmark dataset we released.
PointPillars is a novel point cloud encoder. What we do is learn a representation that is suitable for downstream detection. The main innovation is the translation from a point cloud to a canvas that can then be processed by a similar architecture to what you would use for images. We show it outperforms all published methods on KITTI by a large margin, especially with respect to inference speed. There's a preprint out and some code available if you want to play around with it.
The architecture looks something like this — and I should say most papers in this space use this architecture, so it's a natural design. You have the point cloud at the top, you have this encoder — that's where we introduce the point pillars — and then that feeds into a backbone, which is a standard 2D convolutional backbone. You have a detection head, and you may or may not have a segmentation head. The key point is that after the encoder, everything looks just like YOLO or SSD — very similar to the SSD architecture.
Let me go into a little more detail. The range — say you want to model forty metres in each direction — you have a certain resolution of your bins and a number of output channels. The input is a set of pillars. A pillar here is a vertical column. You have N by M of those that are non-empty in this space. A pillar P contains all the points — each point has X, Y, Z, and intensity — and there are N sub M points in each pillar. The number of points varies: it could be a single point at a particular location, or it could be two hundred points. The goal is to produce a tensor of fixed size: height by width by C, where C is the number of channels. In an image, C would be three. We call it a pseudo-image, but it's the same thing — a fixed number of channels that the backbone can operate on.
Here's the same thing without the math. You have a lot of points and you have this space, which you grid up into these pillars. Some are empty, some are not.
Literature Review: From Feature Engineering to VoxelNet
Let me give a brief literature review. People have tended to take each pillar and divide it into voxels — a 3D box grid — and then extract some sort of features for each box. For example: how many points are in this voxel? What is the maximum intensity of all the points in this voxel? Then you extract a feature for the whole pillar. All of these are hand-engineered functions that generate a fixed-length output. You can concatenate them, and the output is a tensor of X by Y by C.
Then VoxelNet came around, maybe a year or so ago. The first step is similar — you divide each pillar into voxels — but they got rid of the feature engineering. They said: we'll map from a voxel to features using a PointNet. I won't go into the details of PointNet, but it's basically a network architecture that allows you to take a point cloud and map it to a fixed-length representation using a series of 1D convolutions and max-pooling layers. It's a very neat paper. They apply that to each voxel, but now you end up with an awkward four-dimensional tensor, because you still have X, Y, Z from the voxels and then the C-dimensional output from the PointNet. So they have to consolidate this Z dimension through a 3D convolution, and now you achieve your X by Y by C tensor.
It's very nice in the sense that it's an end-to-end method and shows good performance, but at the time it was very slow — around five Hz runtime. The culprit is that last step: the 3D convolution is much, much slower than a standard 2D convolution.
The PointPillars Innovation
Here's what we did. We basically said: let's just forget about voxels. We'll take all the points in the pillar and put them straight through a PointNet. That single change gave a ten-to-one-hundred-fold speed-up over VoxelNet. Then we simplified the PointNet — instead of having several layers and several modules, we simplified it to a single 1D convolution and max-pooling layer. We showed you can get a really fast implementation by taking all your non-empty pillars, stacking them together into a nice dense tensor with a little bit of padding, and running the forward pass as a 2D convolution with a one-by-one kernel. The final encoder runtime is 1.3 milliseconds, which is really, really fast.
The full method looks like this: you have the point cloud, you have this pillar feature net which is the encoder, and that feeds straight into the backbone and your detection heads. It's still a multi-stage architecture, but the key is that all the steps are fully parameterised and we can back-propagate through the whole thing and learn it end to end.
Results on the KITTI Benchmark
Putting these things together, these were the results we got on the KITTI benchmark. If you look at the car class, we actually got the highest performance — this is the bird's-eye view metric — and we even outperformed methods that relied on LiDAR and image fusion. We did that running at just over 60 Hz. We can also measure the 3D benchmark and get very similar performance. Cars did well, cyclists did well, pedestrians did well. There were one or two fusion methods that did a little bit better, but in aggregate we ended up on top. I should note there's a small asterisk here — this is compared to published methods at the time of submission. Things are moving so quickly that there are tons of anonymous submissions on the KITTI leaderboard where we don't even know what the input was or what they did, so we only compared to published methods.
Here are some qualitative results. You can project the detections into the image — the grey boxes are the ground truth and the coloured ones are the predictions. There are some challenging cases. We have, for example, a person with a little stand that gets interpreted as a bicycle. We have a man on a ladder, which is an actual annotation error — we detected it as a person, but it wasn't annotated in the data. And here's a young child on a bicycle that didn't get detected — that's a bummer.
Deployment on Aptiv Vehicles
That's KITTI. I also wanted to show you that we can run this on our vehicle. This is a rendering where we deploy the network at two Hz on the full 360-degree sensor suite. The input is live LiDAR sweeps, just projected into the images for visualisation. Again, no tracking or smoothing applied — every single frame is analysed independently. You can see those arrows sticking out — that's the velocity estimate. We actually show how you can accumulate multiple point clouds into this method and start reasoning about velocity as well.
nuScenes: A New Benchmark Dataset
The second part I want to talk about is nuScenes, which is a dataset we have published.
What is nuScenes? It's one thousand twenty-second scenes collected with our development platforms — the same platform Karl showed earlier. It's a full automotive sensor suite. The data is registered and synced in a 360-degree view, and it's fully annotated with 3D bounding boxes. There are over one million 3D bounding boxes, and we make this freely available for research. You can go to nuscenes.org right now and download a teaser release of one hundred scenes. The full release will be in about a month.
The motivation is straightforward. The whole field is driven by benchmarks. Without ImageNet, I don't think any of us might be here — it may never have been possible to write that first paper and start this whole thing going. Looking at 3D, I looked at the KITTI benchmark, which is truly groundbreaking — I don't want to take anything away from it — but it was becoming outdated. It doesn't have a full 360-degree view, and it doesn't have any radar. I think nuScenes offers the opportunity to push the field forward a little.
Just as a comparison: the most similar benchmark, and really the only one you can compare to, is KITTI. There are other datasets that have maybe LiDAR only, or tons of datasets with images only, but nuScenes is quite a big step up.
Some details: you can see the layout with the radars along the edge, all the cameras on the roof, the top LiDAR, and the respective fields of view — all of this is on the website. The taxonomy models several different sub-categories of pedestrians, several types of vehicles, some static objects, barriers, cones, and in addition a bunch of attributes on the vehicles and on the pedestrians.
Without further ado, let's look at some data. This is one of the thousand scenes. All I'm showing here is just playing the frames one by one from all the images. The annotations live in the world coordinate system — they're full 3D boxes, and I've just projected them into the image. What's neat is that we're not really annotating the LiDAR or the camera or the radar — we're annotating the actual objects and putting them in a world coordinate system, and giving all the transformations so you can play around with it however you like.
To show that: because everything is registered, I can take the LiDAR sweep and project it into the images at the same time. Here I'm showing it coloured by distance, so now you have a sparse density measurement — a distance measurement — on the images. That's all I wanted to show.
Q&A
Audience member: I was really interested in your discussion around validation and particularly continuous development. My question is: is the nuScenes dataset enough to guarantee that your model is going to generalise to unseen data and not, say, hit pedestrians? Or do you have other validation that you need to do?
Oscar Beijbom: No — nuScenes is purely an academic effort. We want to share our data with the academic community to drive the field forward. We're not making any claims that this is somehow a sufficient dataset for training a safe system.
Karl Iagnemma: I would say that my background is in the academic world, and one of the hardest things was always collecting data because it's difficult and expensive. Having access to a dataset like this — which was expensive to collect and annotate — we thought we would make available because we hoped it would spark academic interest and smart people, like the people in this room, coming up with new and better algorithms, which could benefit the whole community. And then maybe some of them would want to come work with us at Aptiv — so there's a little bit of self-interest there too. It wasn't intended to be for validation; it was more for research.
To give you a sense of the scale of validation: there was one quote saying you've got to drive 275 million miles or more, depending on the level of certainty you want to impose. To date, as an industry, we've driven about twelve to fourteen million miles in total across all participants in autonomous mode, under hundreds of different builds of code and in many different environments. So you're supposed to drive hundreds of millions of miles in a particular environment on a single build of code on a single platform. Obviously, we're probably not going to do that. What we'll end up doing is supplementing the driving with quite a lot of simulation and other methodologies to convince ourselves we can make a statistical argument for safety. There'll be use of datasets like this, lots of regression testing on supersized versions of datasets or morally equivalent versions to test different parts of the system — not just classification, but motion planning, decision-making, localisation, all aspects of the system — and then augmenting that with on-road driving and simulation. The safety case is really quite a bit broader than any single dataset would allow you to speak to.
Audience member: From an industrial perspective, what do you think 5G can offer for autonomous vehicles?
Karl Iagnemma: It's an interesting one. These vehicles are connected — that's a requirement, certainly when you think about operating them as a fleet. When the day comes when you have an autonomous vehicle that is personally owned, it may or may not be connected, but when you have a fleet of vehicles and you want to coordinate the activity of that fleet to maximise the efficiency of the transportation network, they're certainly connected.
The requirements of that connectivity are fairly relaxed if you're talking about just passing back and forth the position of the car and maybe some status indicators — are you in autonomous mode or manual mode, are all systems go, do you have a fault code and what is it? There are some more stringent requirements if you think about what we call teleoperation and remote operation of the car — the case where if the car encounters a situation it doesn't recognise, can't figure out, gets stuck or confused, you might phone a human operator sitting remotely to intervene. In that case, the human operator will want situational awareness, and there may be a demand for high-bandwidth, low-latency, high-reliability connectivity of the sort that 5G is better suited to than 4G or LTE.
Broadly speaking, we see it as very nice to have, but like any infrastructure, we understand it's going to arrive on a timeline of its own and be maintained by someone who's not us. So it's very much outside our control, and for that reason we design the system such that we don't rely on the coming 5G wave — but we'll certainly welcome it when it arrives.
Audience member: You said you have a presence in about fifty countries. Did you observe any interesting patterns from that — like, was your same self-driving car model deployed in Vegas able to perform equally well in Singapore?
Karl Iagnemma: To speak to your question about country-to-country variation: driving in Singapore and driving in Vegas is pretty different. You're on the other side of the road for starters, but there are different traffic rules, and it's somewhat underappreciated that people drive differently — there are slightly different traffic norms.
One of the things we've done — and if anyone was in this class last year, my co-founder Emilio gave a talk about something we call rule books — is a structure we've designed around what we call the driving policy or decision-making engine. It tries to admit, in a general and fairly flexible way, the ability to reprioritise rules, reassign rules, and change weights on rules to enable us to drive in one community and then another in a fairly seamless manner.
To give you an example: imagine you're an autonomy engineer tasked with writing the decision-making engine. You decide to do a finite-state architecture, write down some transition rules by hand, and do it for right-hand driving. Then your boss comes in and says, "Oh yeah, next Monday we're going to be doing left-hand driving." If you've done it manually, that could be a huge pain, and it's generally very difficult to validate to ensure the outputs are correct across the entire spectrum of possibilities. So we wanted to avoid that, and we actually quite carefully designed the system such that we can scale to different cities and countries.
The four cities I mentioned — Boston, Pittsburgh, Vegas, and Singapore — span a wide spectrum of driving conditions. Everybody knows Boston, which is pretty bad. Vegas is warm weather, mid-density urban, but it's Vegas, so all kinds of stuff. And then Singapore is interesting — perfect infrastructure, good weather, flat, and people generally obey the rules. So it's kind of close to the ideal case. That exposure to this different spectrum of data is pretty valuable, I'll speak for Oscar on that, and I know for other parts of the development team it's quite valuable.
Oscar Beijbom: Singapore is ideal, except for the constant construction zones. Every time you drive out there's a new construction zone, so we've been forced to do a lot of work in construction zone detection in Singapore. And the torrential rain. And the jaywalkers — people don't really obey the rules there. So other than that, it's perfect.
Audience member: Which country is fully equipped? Which is the ideal market?
Karl Iagnemma: That's a really good question. It's interesting because there are other dimensions. When we look at which countries are interesting to us as a market, there's the infrastructure conditions, the driving patterns and properties, the density — is it Times Square at rush hour or is it Dubuque, Iowa? There's the regulatory environment, which is incredibly important. You may have a perfectly well-suited city from a technical perspective and they may not allow you to drive there. So it's really all of these things put together. We have a matrix where we analyse which cities check these boxes, assign them scores, and then try to understand the economics of that market — does that city check all these boxes but nobody there is using mobility services? Is there no opportunity to actually generate revenue from the service? You factor in all of those things.
Oscar Beijbom: And I think one thing to keep in mind — and this is always the first thing I tell candidates when I interview them — there's a huge difference in the advantage to the business model we're proposing. Having a service means we can choose, even if we commit to some city, to select the routes we feel comfortable with and roll it out piece by piece. We can say, "We don't feel comfortable driving at night in this city yet, so we just won't accept any rides at night." So there's that decision space as well.
Audience member: Thank you very much for the talk. I was comparing your PointPillars approach to the earlier voxel-based approach. In VoxelNet you had a four-dimensional tensor, and in PointPillars you only have three dimensions — you're throwing away the Z, as I understood it. When you do that, are you concerned that you're losing information about potential occlusions or semi-occlusions?
Oscar Beijbom: I may have been a little bit sloppy there. We're certainly not throwing away the Z. What we're saying is that we're learning the embedding in the C dimension jointly with everything else. VoxelNet, if you want, felt the need to spoon-feed the network a little bit — to say, let's learn everything stratified in this high dimension, and then we'll have a second step where we learn to consolidate that into a single vector. We just said: why don't we learn those things together?
Audience member: Thanks for the talk. I have a question for Karl. You mentioned that if people make a change to the code, we might need another validation. I work in the nuclear power industry — we do nuclear power simulations, and when we make any change to our simulation code to commercialise it, we need to submit a request to the NRC, the Nuclear Regulatory Commission. In your opinion, do you think for self-driving we need a third-party validation body, or should it be self-certified?
Karl Iagnemma: That's a really good question, and I don't know the answer. I would not be surprised either way if the automotive industry ended up with third-party regulatory oversight or if it didn't. There are great precedents for what you just described — nuclear, aerospace — where there are external bodies with deep technical competence who can come in, do investigations, impose or advise on strict regulation, and define requirements for certification of various types. The automotive industry has largely been self-certifying. There's an argument, which is certainly not unreasonable, that you have a real alignment of incentive within the industry and with the public to be as safe as possible — simply put, the cost of a crash is enormous, economically, socially, and in every other way.
Whether it continues along that path, I couldn't tell you. It's an interesting space because the federal government is actually moving very quickly — carefully, I would say, trying not to overstep and not trying to impose too much regulation around an industry that has never generated a dollar of revenue and is still quite nascent. But if you had told me a few years ago that there would have been very thoughtfully defined draft regulatory guidelines around this industry, I probably wouldn't have believed you. But in fact that exists — there's a third version that was released this summer by the Department of Transportation. So there's intense interest on the regulatory side. In terms of how far the process goes in terms of forming an external body, I think that really remains to be seen.
Audience member: Thanks for the insightful talk. Looking at this slide, I'm wondering how easy and effective your trained models are to transfer across different LiDARs, and whether you need specific training for specific LiDARs to work effectively — for example, if it is snowing.
Oscar Beijbom: The same rules apply to this method as to any other machine learning-based method: you want to have support in your training data for the situation you want to deploy in. If you have no snow in your training data, I wouldn't go and deploy this in snow.
One thing I like about working with LiDAR, having worked so much with vision, is that the LiDAR point cloud is really easy to augment and play around with. For example, if you want to be robust to some really rare events — let's say there's a piano on the road and you really want to detect that, but it's hard because you have very few examples of pianos on the road — if you think about augmenting your visual dataset with that data, it's actually quite tricky to have a photorealistic piano in your training data. But it is quite easy to do that in your LiDAR data. You have a 3D model of the piano, you have the model for your LiDAR, and you can get a pretty accurate, fairly realistic point cloud return from that.
So I like that part about working with LiDAR — you can augment and play around with it. In fact, one of the things we do when we train this model is copy and paste samples — we take a car that we saw yesterday, take the point differences on that car, and just paste it into your current LiDAR sweep. You have to be a little bit careful, and this was actually proposed by a previous paper, but we found it was really useful. It sounds absurd, but it actually works, and it speaks to the ability to do that with LiDAR data.