podProse

Podcast transcripts, polished for reading

podProse

MIT Self-Driving Cars (2018) | Lex Fridman Transcript

MIT lecture on autonomous vehicle technology, approaches, and AI challenges

Lex Fridman delivers a 2018 MIT lecture on deep learning for self-driving cars.

Summary

Lex Fridman delivers a lecture for his MIT course on deep learning for self-driving cars, covering the societal promise and risks of autonomous vehicles, the technical taxonomy of autonomy levels, sensor technologies, and the role of AI in solving key perception and control problems. He argues that the field is divided into two fundamental approaches — human-centered autonomy and full autonomy — and makes the case that human-centered systems are underappreciated and more immediately viable. He presents data from his own research group's instrumented fleet of 25 vehicles, including 21 Tesla Autopilot cars, which has collected over 300,000 miles and five billion video frames of real-world driving data. He cites Rodney Brooks's prediction that a fully driverless taxi service in a major US city will not arrive before 2032, and argues that full autonomy, while the goal, remains decades away for the most challenging real-world scenarios.

Key Takeaways

Human-centered autonomy is undervalued. Fridman argues that while most researchers and guest speakers in the field focus on full autonomy, the human-centered approach — where AI and human share control — is more immediately deployable and may be more effective than commonly assumed, provided the human-robot interaction problem is solved correctly.

Real-world data challenges the "automation complacency" narrative. Decades of literature from aviation and robotics predict that drivers will become dangerously disengaged when automation takes over. Fridman's own dataset of 300,000+ miles and five billion video frames from instrumented Tesla Autopilot vehicles shows that driver glance behavior does not significantly change between manual and autonomous driving — a finding that challenges the dominant view.

Full autonomy is much further away than public claims suggest. Fridman endorses Rodney Brooks's prediction of no earlier than 2032 for a fully driverless taxi service in a major US city, and argues that any company promising full autonomy more than one year out should be treated with significant skepticism. Waymo's driverless Phoenix ride in November 2017 is acknowledged as a genuine milestone, but is noted to be heavily constrained.

The SAE levels of autonomy are useful for policy but not for engineering. Fridman argues that the six-level SAE taxonomy obscures the real engineering distinction, which is simply between systems where a human is in the loop and systems where the AI is fully responsible. The difference matters enormously for liability, safety design, and what "failure" means.

Camera sensors are the most promising for deep learning, but lidar dominates current full-autonomy systems. Cameras are cheap, high-resolution, and generate the kind of rich data that deep learning thrives on. Lidar is expensive but highly accurate. The question of which sensor paradigm will win — Tesla's camera-first approach versus Waymo's lidar-first approach — remains open, with solid-state lidar potentially changing the calculus.

Driver state monitoring is a critical and underappreciated problem. For human-centered autonomy to work safely, the AI must perceive everything about the driver — glance direction, drowsiness, cognitive load, emotional state, hand position — and communicate clearly when it needs the human to re-engage. Fridman's group uses computer vision and deep learning on driver-facing cameras to extract these signals at scale.

The ethical and security challenges of autonomous vehicles are real but often poorly framed. A vehicle controlled by software can be hacked; an AI making life-or-death decisions operates as a black box; and the objective function driving a fully autonomous vehicle must encode human values that are not yet well defined. These concerns are not imminent crises but are important long-term design problems.

Technology adoption is accelerating. Historical data on adoption curves for electricity, cars, radio, and the telephone shows that the time from introduction to widespread adoption is shrinking with each successive technology — meaning a breakthrough solution, if it arrives, could propagate through society far faster than past predictions would suggest.

FULL TRANSCRIPT

Introduction: The Utopian and Dystopian Views of Autonomous Vehicles

Lex Fridman: Welcome back to 6.S094, Deep Learning for Self-Driving Cars. Today we will talk about autonomous vehicles, also referred to as driverless cars, autonomous cars, and robocars.

First, the utopian view. For many people, autonomous vehicles have the opportunity to transform our society in a positive direction. 1.3 million people die every year in automobile crashes globally. Thirty-five to forty thousand die every year in the United States. The opportunity that's huge — one of the biggest focuses for us here at MIT, for people who truly care about this — is to design autonomous systems, artificial intelligence systems, that save lives. Those systems help work with, deal with, or take away what NHTSA calls the four D's of human folly: drunk, drugged, distracted, and drowsy driving. Autonomous vehicles have the ability to eliminate drunk driving, distracted driving, drowsy driving, and drugged driving.

Eliminating car ownership — taking shared mobility to another level — is another opportunity. From the business side, it's the opportunity to save people money and increase mobility and access. Making vehicles available without ownership makes them more accessible because the cost of getting from point A to point B drops by an order of magnitude. The insertion of software and intelligence into vehicles makes the idea of transportation, the way we see moving from point A to point B, a totally different experience. Much like with our smartphones, it makes it a personalized, efficient, and reliable experience.

Now for the negative view, the dystopian view. Any technology throughout the history of human civilization has always created fear that jobs relying on the prior technology will be lost. This is a huge fear, especially in trucking, because so many people in the United States and across the world work in the transportation sector. The possibility that AI will remove those jobs has potentially catastrophic consequences.

The idea we have to struggle with in the 21st century — the role of intelligent systems that aren't human beings being further and further integrated into our lives — includes the idea that a failure of an autonomous vehicle, even if such failures are much rarer, even if the vehicles are much safer overall, creates the possibility that an AI algorithm designed by probably one of the engineers in this room will kill a person who would not have died if they had been in control of the vehicle. The idea of an intelligent system in indirect interaction with a human being killing that human being is one we have to struggle with at a philosophical, ethical, and technological level.

Artificial systems in popular culture — and engineering concerns may not be ethically grounded at this time. Much of the focus of building these systems, as we'll talk about today and throughout this course, is on the technology: how do we make these things work? But of course, years or decades out, the ethical concerns start arising. For Rodney Brooks, one of the seminal figures from MIT, those ethical concerns will not be an issue for another several decades — at least five decades — but they're still important. It continues the thought: what is the role of AI in our society when that car gets to make a decision about human life? What is it making that decision based on, especially when it's a black box? What is the ethical grounding of that system? Does it conform with our social norms, or does it go against them?

Security is definitely a big concern. A car that's not even AI-based — a car that's software-based — is becoming more and more reliant on software. Most of the cars on the road today are run by millions of lines of source code. The idea that those lines of source code, written again by some of the engineers in this room, get to decide the life of a human being means that a hacker from outside of the car can manipulate that code to also decide the fate of that human being. That's a huge concern from the engineering perspective.

The truth is somewhere in the middle. We want to find the best positive way we can build these systems to transform our society and improve the quality of life of everyone among us. But there's a grain of salt to the hype of autonomous vehicles. We have to remember, as we discussed in the previous lecture and as it will come up again and again, our intuition about what is difficult and what is easy for deep learning, for autonomous systems, is flawed. If we use ourselves as the example — human beings are extremely good at driving — our intuition has to be grounded in an understanding of what is the source of data, what is the annotation, what is the approach, what is the algorithm. You have to be careful about using our intuition, extending it decades out, and making predictions, whether toward the utopian or dystopian view.

As we'll talk about when discussing some of the advancements of companies working in this space today, you have to take what people say in the media, what the companies say, and what some of the speakers coming to this class say about their plans for the future and their current capabilities with a degree of skepticism. The guide I can provide is this: when there's a promise of a future technology or future vehicles that are two years out or more, that is a very doubtful prediction. One that is within a year is skeptical. The real proof comes in actual testing on public roads, or most impressively, when it's available for consumer purchase.

I would like to use Rodney Brooks as the voice here, so it doesn't come from my mouth — but I happen to agree. His prediction is no earlier than 2032 for a driverless taxi service in a major US city that will provide arbitrary pickup and dropoff locations, fully autonomously. That's 14 years away. And by 2045, it will do so in multiple cities across the United States. Think about that. A lot of the engineers working in this space, a lot of folks actually building these systems, agree with this idea. That is the earliest I believe this will happen, and the earliest Rodney believes — but as all technophobes have been wrong, we could be wrong too.

Technology Adoption Curves

Lex Fridman: This is a map — a plot — with time on the x-axis throughout the 20th century, and the adoption rate on the y-axis from zero to 100%, of various technologies: from electricity to cars, to radio, the telephone, and so on. As we get closer to today, the technology adoption rate — the number of years it takes to go from zero to a hundred percent adoption — is getting shorter and shorter. As a society, we're better at throwing away the technology of old and accepting the technology of new. So if a brilliant idea to solve some of the problems we're discussing comes along, it could change everything overnight.

Levels of Autonomy

Lex Fridman: Let's talk about different approaches to autonomy. We'll talk about sensors afterwards, then companies and players in this space, and then AI and the actual algorithms and how they can help solve some of the problems of autonomous vehicles.

Here's a useful taxonomization of levels of autonomy — useful for initial discussion, for legal discussion, for policy making, and for blog posts and media reports. But it's not useful, I would argue, for the design and engineering of the underlying intelligence and the system viewed from a holistic perspective — creating an experience that is safe and enjoyable.

Let's go over those levels. This is presented by SAE report J3016, the most widely accepted taxonomization of autonomy. Level zero is no automation. Levels one and two are increasing levels of automation. Level one is cruise control. Level two is adaptive cruise control and lane keeping. Level three — I don't know what level three is. There are a lot of people who will explain that level three is conditional automation, meaning it's constrained to certain geographical locations. From an engineering perspective, I'm personally a little bit confused about where that stands. I'll try to redefine how we should view automation. Level four and level five are high and full automation. Level four is when the vehicle can drive itself fully for part of the time — there are certain areas in which it can take care of everything, no human interaction or input is required. Level five automation is when the car does everything.

I would argue that those levels aren't useful for designing systems that actually work in the real world. I would argue that there are two systems, but first a starting point: every system to some degree involves a human. It starts with manual control — a human getting in the car and a human electing to do something. That's the manual control. What we're talking about when the human engages the system, when the system is first available and the human chooses to turn it on, is when we have two AI systems: human-centered autonomy, when the human is needed and involved, and full autonomy, when AI is fully responsible for everything.

From the legal perspective, full autonomy means the car designer, the AI system, is liable and responsible. For human-centered autonomy, the human is responsible.

Human-Centered vs. Full Autonomy

Lex Fridman: What does this practically mean for human-centered autonomy? When human interaction is necessary, the question becomes: how often is the system available? Is it available in traffic conditions — bumper-to-bumper? Is it available on the highway? Is it sensor-based, like in the Tesla vehicle, meaning based on the visual characteristics of the scene the vehicle is confident enough to make control decisions?

The other factor — poorly discussed, and I think imprecisely discussed when it is — is the number of seconds given to the driver to take over. In the Tesla vehicle, in all vehicles on the road today, that time is zero. Zero seconds are guaranteed, zero seconds are provided. There is sometimes some room — sometimes it's hundreds of milliseconds, sometimes it's multiple seconds — but there's no standard for how many seconds you get to say "wake up, take control."

Teleoperation — something some of the companies are playing with — is when a human being remotely controls the vehicle when the on-board system is not able to handle the situation. That's a very interesting idea to explore. But for human-centered autonomy, all of those features are not required, not guaranteed. The human driver inside the car is always responsible at the end of the day. They must pay attention to a degree that allows them to take over when the system fails. And no matter what, under this level of autonomy, the system will fail at some point. This is a collaboration between human and robot — the system will fail, and the human has to catch it when it does.

Full autonomy means AI is fully responsible. Now, as some companies in their marketing material and PR side of things might present, there are significant degrees of autonomy. If you're talking about L3 or L4 or L5, you have to read between the lines. You're not allowed to call it full autonomy if a human is remotely operating the vehicle — a human is still in the loop, it's still a human-centered autonomy system. You don't get credit from the ten-second rule — just because you give the driver ten seconds to take control doesn't somehow remove liability. If you say "as an AI system I can't resolve this situation and you have ten seconds to take over," that's not good enough. The driver might be sleeping. The driver may have had a heart attack. Full autonomous systems must find safe harbor. They must get you from point A to point B. That point B might be your desired destination or might be a safe parking lot, but it has to bring you to a safe location.

This is a clear definition of the two systems. As far as our current conception of artificial intelligence in cars today, a human always overrides the AI system. The human gets to choose to take control — the AI can't prevent that — except when danger is imminent, meaning sudden crashes. We're not yet ready as a society for AI systems to say "no, you're drunk, you can't drive."

Beyond the traditional levels from zero to five: the starting point is level zero, no automation. Levels one, two, and three I would argue fall into human-centered autonomy systems, because they involve some degree of a human. Then L4 and L5, to some degree with some crossover, fall into full autonomy — even though with L4, with Waymo, Uber, GM Cruise, and others playing in the space, there's very often a human driver involved.

One of the huge accomplishments of Waymo over the past month — an incredible accomplishment — is that in Phoenix, Arizona, they drove without a safety driver. There was no engineer or staff member there to catch the car. A human being who doesn't work for Google or Waymo got into that car and got from A to point B without a safety driver. That's an incredible accomplishment. That particular trip was a fully autonomous trip. There's no human to catch the car. That is full autonomy.

So the two paths for autonomous systems: on the left in blue is human-centered autonomy, on the right is full autonomy. Blue represents what is easier from the artificial intelligence perspective, and red represents what is harder. Easier meaning we do not have to achieve 100% accuracy. Harder means everything that falls short of 100% accuracy, no matter how small, has the potential of costing human lives and huge amounts of money for companies.

Sensors: Radar, Lidar, Ultrasonic, and Camera

Lex Fridman: Let's discuss the sensors — the sources of raw data we'll get to work with. There are cameras — image sensors, RGB, infrared, visual data — there's radar and ultrasonic, and there's lidar. Let's discuss the strengths and weaknesses of each and how they can be integrated together through sensor fusion.

Radar is the old trusted friend, the sensor commonly available in most vehicles that have any degree of autonomy. It's cheap. Both radar, which works with electromagnetic waves, and ultrasonic, which works with sound waves, send a wave, let it bounce off obstacles, and knowing the speed of that wave, calculate the distance to the obstacle. Radar does extremely well in challenging weather — rain, snow. The downside is low resolution compared to the other sensors. But it is the most reliable and most used in the automotive industry today, and in sensor fusion it's always there.

Lidar produces an extremely accurate depth map and a high-resolution map of the environment with 360-degree visibility. It has some of the big strengths of radar in terms of reliability but with much higher resolution and accuracy. The downside is cost. Lidar has been the successful source of ground truth — the reliable sensor relied upon by vehicles that don't care about cost.

Camera is the thing that most people here should be passionate about, because machine learning and deep learning have the most ability to have a significant impact there. Why? First, it's cheap, so it's everywhere. Second, it's the highest resolution, meaning the most densely packed amount of information — which means there's more information that can be learned and inferred to interpret the external scene. That's why it's the best source of data for understanding the scene. The other reason it's great for deep learning is the enormous amount of data involved. There are many orders of magnitude more data available for driving in camera visible light or infrared than there is in lidar. Our world is designed for visible light. Our eyes work in similar ways to cameras, at least crudely. The lane markings, traffic signs, traffic lights, other vehicles, pedestrians — all operate with each other in this RGB space in terms of visual characteristics. The downside is that cameras are bad at depth estimation. It's noisy and difficult even with stereo vision cameras to estimate depth relative to lidar. They're not good in extreme weather, and visible light cameras are not good at night.

Comparing the ranges: on the x-axis is range in meters, and on the y-axis is acuity, with ultrasonic, lidar, radar, and camera plotted. The range of cameras is the greatest. This is for clear, well-lit conditions — during the day, no rain, no fog. Lidar and radar have a smaller range, under 200 meters. Ultrasonic sensors, used mostly for parking assistance and blind spot warning, have terrible range — they're designed for extremely close, high-resolution distance estimation at very short distances.

Looking at clear dark conditions — a clear night, no rain — and heavy rain, snow, or fog: vision falls apart in terms of range and accuracy under dark conditions and in rain, snow, or fog. Radar, our old trusted friend, stays strong — the same range, just under 200 meters, at the same acuity. Same with sonar. Lidar works well at night but does not do well with rain, fog, or snow. That's one of the biggest downsides of lidar, other than cost.

Here's another interesting way to visualize this — a radar chart for each sensor, where the greater the radius of the blue, the more successful that sensor is at accomplishing that feature. For lidar: range is pretty good, not great; resolution is also pretty good; it works in the dark; it works in bright light; but it falls apart in snow; it does not provide color, texture, or contrast information; it's able to detect speed; but the sensor size is huge and the sensor cost is extremely expensive; and it doesn't do well in proximity, where ultrasonic shines.

Ultrasonic: does well in proximity detection; it's the cheapest sensor; the sensor size can be tiny; it works in snow, fog, and rain; but its resolution is terrible, its range is non-existent, and it's not able to detect speed.

Radar: able to detect speed; also cheap; also small; but the resolution is very low; and just like lidar, it's not able to provide texture, color, or contrast information.

Camera: sensor cost is cheap; sensor size is small; not good for close proximity; range is the longest of all; resolution is the best of all; it doesn't work in the dark; it works in bright light but not always — one of the biggest downfalls of camera sensors is sensitivity to lighting variation; it doesn't work in snow, fog, or rain, suffering much like lidar from that; but it provides rich, interesting visual information — the very kind that deep learning needs to make sense of this world.

So let's look at the cheap sensors — ultrasonic, radar, and cameras — which is one approach: putting a bunch of those in a car and fusing them together. The cost is low. When they're fused together, they complement each other's strengths. The question is whether the camera or lidar will win out for partial autonomy or full autonomy. At least under these considerations, the fusion of the cheap sensors can do as well as lidar. The open question is whether lidar in the future can become cheap and its range can increase, because then lidar could win out. Solid-state lidar and a lot of developments with startup lidar companies are promising to decrease the cost and increase the range of these sensors. But for now, we plow along with dedication on the camera front. The annotated driving data grows exponentially, more and more people are beginning to annotate and study the particular driving perception and control problems, and the very algorithms for supervised, semi-supervised, and generative networks that we use to work with this data are improving. It's a race, and of course radar and ultrasonic are there to help.

Companies in the Autonomous Vehicle Space

Lex Fridman: Now for the companies playing in this space. Waymo, in April 2017, exited their extensive and impressive testing process and allowed the first public rider in Phoenix in November 2017. It's an incredible accomplishment for a company and for an artificial intelligence system. In November 2017, no safety driver — the car truly achieved full autonomy under a lot of constraints, but it's full autonomy. It's an amazing step in the direction toward full autonomy, much sooner than people would otherwise predict. Four million miles driven autonomously by November 2017, and growing quickly. I say "full autonomous driving" cautiously, because most of those miles have a safety driver, so I would argue it's not full autonomy in the strictest sense — but however they define it, four million miles driven is incredible.

Uber, in terms of miles, is second on that list. They had driven two million miles autonomously by December 2017. The quiet player here — in terms of not making any declarations of being fully autonomous, just quietly driving in a human-centered way — is Tesla, with over one billion miles in Autopilot. Over 300,000 vehicles today are equipped with Autopilot technology, with the ability to control the car laterally and longitudinally. And if anyone believes the CEO of Tesla, there will be over one million such vehicles by the end of 2018. But no matter what, 300,000 is an incredible number, and one billion miles is an incredible number.

Autopilot was first released in September 2014, one of the first systems on the road to do so. In October 2016, Autopilot — and I count myself as one of the skeptics here — decided to let go of the incredible work done by Mobileye, now Intel, who were designing their perception and control system. Tesla decided to let go of it completely and start from scratch using mostly deep learning methods, the Drive PX 2 system from Nvidia, and eight cameras. That's the kind of boldness, the kind of risk-taking that can come with naivety, but in this case it worked. The incredible Autopilot 2 system is going to be released at the end of 2018, and it's promising one of the first vehicles calling what they call L3.

The definition of L3 according to Thorsten Leonhard, the head of automated driving at Audi, is: when the function operates as intended — if the customer turns the traffic jam pilot on — now this L3 system is designed only for traffic jams, bumper-to-bumper traffic under 60 kilometers an hour — if the customer turns the traffic jam pilot on and uses it as intended, and the car was in control at the time of the accident, the driver goes to the insurance company and the insurance company will compensate the victims of the accident. Afterwards they come to Audi, and Audi will pay them. So that means the car is liable.

The problem is, under the definition of L2 or L3, there may be some truth to this being an L3 system. The important thing here is it's nevertheless deeply and fundamentally human-centered, because even as you see in the demonstration video, the car for a poorly understood reason transfers control to the driver — says "I can't take care of the situation, you take control." How much time do you have in terms of seconds before you really need to take over? Well, this is the new thing about level three. With level three, the system allows the driver time to take over vehicle control again — in this case up to ten seconds. So if the traffic jam situation clears up, or any failure in the system occurs, the system still needs to be able to drive automatically because the driver has this time to take over.

You might ask what's new about this. Why is Audi saying this is the first level three system worldwide on the market? When talking about these levels of automation, there's a classification which starts at level zero — the driver is doing everything, no assistance, nothing — and then it gradually becomes partly automated. When we're talking about assistance functions like lane-keeping and distance-keeping, we're talking about level two, which means the driver is obliged to permanently monitor the traffic situation, keep hands on the wheel even though there's support and assistance, and intervene immediately if anything is not quite right. You know that from lane assistance systems — when the steering is not perfectly in the right lane, you have to intervene and correct immediately. That is the main difference. Now we got a takeover request.

So let's talk about what that means. This is still a human-centered system. It still must solve the human-robot interaction problem. And there are many others playing in the space. On the full autonomy side: Waymo, Uber, GM Cruise, Nutonomy — the CTO of which will speak here on Tuesday — Optimus Ride, Renovo, Voyage — the CEO of which will speak here next Thursday — and Aurora, not listed, the founder of which will speak here next Friday. On the human-centered autonomy side — the reason I am speaking about our work so much today is that we don't have any speakers representing that side, so I'm the speaker — Tesla Autopilot has for several years been doing incredible work. We are also working with Volvo Pilot Assist. There are a lot of different approaches. Also interesting is the Audi traffic jam assist, the A8 being released at the end of this year. The Mercedes Drive Pilot in the E-Class is an interesting vehicle I got to drive quite a bit. The Cadillac Super Cruise in the CT6 is very much constrained geographically to highway driving. And the loudest, proudest of them all: George Hotz of Comma.ai, with OpenPilot. Let's just leave that there.

Where AI Can Help: Key Problem Areas

Lex Fridman: So where can AI help? We'll get into the details in coming lectures on each individual component. I'd like to give some examples — the key problem spaces where we can use machine learning to solve from data.

The first is localization and mapping — being able to localize yourself in space, the very first question a robot needs to answer: where am I? The second is scene understanding — taking the scene in and interpreting it, detecting all the entities in the scene, detecting the class of those entities, in order to then do movement planning to move around those entities. And finally, driver state — an essential element for the human-robot interaction — perceiving everything about the driver, and everything about the pedestrians, cyclists, and cars outside: the human element of those, the human perception side.

Localization: Visual Odometry

Lex Fridman: First, the "where am I" question. Visual odometry uses camera sensors, which is really where deep learning is most amenable — a vision sensor is the most amenable to learning-based approaches. Visual odometry uses a camera to localize yourself, to answer the "where am I" question.

The traditional approaches use SLAM — detecting features in the scene and tracking them through time, from frame to frame. From the movement of those features, you're able to estimate the location and orientation of the vehicle or the camera. Those methods with stereo vision first require taking two camera streams, undistorting them, computing a disparity map from the different perspectives of the two cameras, computing the matching between the two, the feature detection — using non-deep-learning methods to extract strong, detectable features that can be tracked from frame to frame — tracking those features, and estimating the trajectory and orientation of the camera. That's the traditional approach to visual odometry.

In recent years, since 2015, but with most success in the last year, there have been end-to-end deep learning approaches using either stereo or monocular cameras. DeepVO is one of the most successful. The end-to-end method takes a sequence of images, extracts with a CNN the central features from each image, and then uses an RNN — a recurrent neural network — to track over time the trajectory and pose of the camera. Image to pose, end to end. Here's the visualization on the KITTI dataset using DeepVO — taking the video as input and estimating the position of the vehicle. In red is the estimate based on the CNN and RNN end-to-end approach; in blue is the ground truth in the KITTI dataset. This removes a lot of the modular parts of SLAM and visual odometry and allows it to be end-to-end, which means it's learnable, which means it gets better with data. That's huge.

Vision alone — this is one of the exciting opportunities for people working in AI — is the ability to use a single sensor, perhaps the most inspiring because that sensor is similar to our own eyes, to use that alone as the primary sensor to control a vehicle. The fact that deep learning and visible light are the most amenable to deep learning approaches makes this particularly exciting for deep learning research.

Scene Understanding

Lex Fridman: Scene understanding — of course one could do a thousand slides on this. Traditionally, object detection for pedestrians and vehicles used a bunch of different types of classifiers and feature extractions, Haar-like features and so on. Deep learning has basically taken over and dominated every aspect of scene interpretation, perception, understanding, tracking, recognition, classification, and detection problems.

And don't forget audio. We can use audio as a source of information — whether that's detecting honks, or in this case using the audio of the tires, with microphones on the tires, to determine road surface conditions. There's a spectrogram of the audio coming in — for those of you with a particularly tuned ear, you can listen to the different audio of wet road versus dry road after rain. There's no rain, but the road is nevertheless wet. Detecting that is extremely important for vehicles because they still have poor traction control, poor control in terms of the tire-road surface connection. Being able to detect that from just audio is a very interesting approach.

Movement Planning

Lex Fridman: For the perception and control side, movement planning — getting from point A to point B. Traditional approaches use optimization-based methods: determine the optimal control, formalize the problem in a way that's amenable to optimization, make the necessary assumptions, and then generate thousands or millions of possible trajectories with an objective function to determine which trajectory to take. Here's a race car optimizing how to take a turn at high speed.

With deep learning, reinforcement learning — the application of neural networks to reinforcement learning — is particularly exciting for both the control and the planning side. That's where two of the competitions we're doing in this class come into play: the simplistic two-dimensional world of DeepTraffic, and the high-speed, high-risk world of DeepCrash. We'll explore those in tomorrow's lecture on deep reinforcement learning.

Driver State Monitoring

Lex Fridman: Finally, driver state — detecting everything about the driver and then interacting with them. On the left in green are the easier problems; on the right in red are the harder problems, in terms of how amenable they are to deep learning methods.

Body pose estimation is a very well-studied problem. We have extremely good detectors for estimating the pose — the hands, elbows, shoulders, every visible aspect of the body. Head pose — the orientation of the head — we're extremely good at that. As we get smaller and smaller in terms of the region of interest, blink rate, blink duration, eye pose, and blink dynamics start getting more and more difficult. All of these metrics are extremely important for detecting things like drowsiness, or as components of detecting emotion, or where people are looking.

In driving, where your head is turned is not necessarily where you're looking. In regular life, when you look somewhere you usually turn your head to look with your eyes. In driving, your head often stays still or moves very subtly — your eyes do a lot more moving. It's the kind of effect we describe as the lizard-owl effect. Some fraction of people — a small fraction — are "owls," meaning they move their head a lot. Most people are "lizards," moving their eyes to allocate their attention. The problem with eyes is that from the computer vision perspective, they're much harder to detect under real-world lighting variation. That's where deep learning steps up and really helps with real-world data.

Cognitive load — we'll discuss estimating the cognitive load of the driver as well.

To give a quick clip: driver glance classification — estimating the most important problem on the driver state side — is determining whether they're looking on-road or off-road. It's the simplest but most important aspect: are they in the seat and looking at the road, or are they not? That's driver glance classification. Not estimating the X, Y, Z geometric orientation of where they're looking, but actually binary classification: on-road or off-road.

Body pose estimation — determining if the hands are on the wheel or not, determining if the body alignment is standard and good for seatbelt safety. This is one of the important things for autonomous vehicles: if there's imminent danger, the driver should be asked to return to a position that is safe for them in case of a crash.

Driver emotion — on the top is a satisfied driver, on the bottom is a frustrated driver. They self-reported these states. This is with a voice-based navigation system. One of the biggest sources of frustration for people in cars is voice-based navigation — trying to tell an AI system using your voice alone where you would like to go. One of the interesting things in our large dataset, from the affective computing perspective, is determining which features are most commonly associated with frustrated voice-based interaction. And that's the smile — the counter-intuitive notion that emotion in the car is very context-dependent. Smiling is not necessarily a sign of happiness, and the stoic, bored look of the driver on top is not necessarily a reflection of unhappiness. He is indeed a ten out of ten in terms of satisfaction with the experience. He happens to be Dan Brown, one of the amazing engineers in our team.

Cognitive load — estimating from the eye region and sequences of images using 3D convolutional neural networks, taking in a sequence of images from the eye, looking at the blink dynamics and eye position to determine the cognitive load from zero to two: how deep in thought you are.

The Case for Human-Centered Autonomy

Lex Fridman: Two paths to the autonomous future. I would like to, maybe for the last time but probably not, argue for the one on the left — because our brilliant, much smarter than me guest speakers will argue for the one on the right. The human-centered approach allows us to solve the problems of 99% accuracy in localization, scene understanding, and movement planning. Those are the problems we're taking on in this class — the scene segmentation we'll talk about on Thursday, the control we'll talk about tomorrow, and the driver state we'll talk about next Wednesday. These problems can be solved with deep learning today.

The problems on the right — solving them to close to 100% accuracy — are extremely difficult and may be decades away. Because for full autonomy to be here, we have to solve situations like this. I've shown this many times — we have to solve this situation. A busy crosswalk where no autonomous vehicle will ever have a hope of getting through unless it asserts itself. There are a couple of vehicles here that nudge themselves through, or at least when they have the right of way don't hesitate when a pedestrian is present. An ambulance flying by — even if you use a trajectory and pedestrian intent modeling algorithm to predict the momentum of the pedestrian and estimate where they can possibly go, an autonomous vehicle would stop. But these vehicles don't stop. They assert themselves. They move forward.

Now for a full autonomy system, because it's taking full control and following a reward function, an objective function, all of the problems — the ethical and the AI problems — that arise, like the Coast Runner problem, will arise. We have to solve those problems. We have to design that objective function correctly.

With that, I'd like to thank you and encourage you to come tomorrow, because you'll get a chance to participate in DeepTraffic, a deep reinforcement learning competition. Thank you very much.

Polished transcript of Lex Fridman. All views are those of the original speakers. Watch on YouTube ↗

Published by @martymcfly

More from Lex Fridman

Chris Gerdes (Stanford) on Technology, Policy and Vehicle Safety - MIT Self-Driving Cars6 Dec 2017

Sertac Karaman (MIT) on Motion Planning in a Complex World - MIT Self-Driving Cars13 Dec 2017

Sacha Arnoud, Director of Engineering, Waymo - MIT Self-Driving Cars16 Feb 2018

Emilio Frazzoli, CTO, nuTonomy - MIT Self-Driving Cars9 Mar 2018

Self-Driving Cars: State of the Art (2019)1 Feb 2019

More from @martymcfly

The Father, the Son, and the Holy Spirit - Bishop Barron's Sunday Sermon29 May 2026

Bishop Barron’s New Book on Persecution Against Christians26 May 2026

What Is the Spirit Calling You to Do? - Bishop Barron's Sunday Sermon16 May 2026

Bishop Barron on Bishop Fulton Sheen11 May 2026

Five Signs of the Holy Spirit - Bishop Barron's Sunday Sermon9 May 2026

Summary