Now Musk is touting yet another set of data that purports to prove that Autopilot doesn't prevent drivers from maintaining "a high degree of functional vigilance," seemingly invalidating the concerns articulated by the NTSB's investigation into Josh Brown's fatal crash and other critics of Autopilot's safety. Once again the latest "proof" of Autopilot's safety comes from a seemingly unimpeachable source, MIT researcher Lex Fridman, but as in the past a critical examination reveals that this latest study falls well short of proving that safety concerns about Autopilot are unfounded. If anything, it only proves how problematic an over-reliance on machine learning and an avoidance of the tricky human factors at the heart of human-automation interaction problems can be.

Before diving into the substance of Fridman's latest paper [PDF], which Musk recently quoted extremely selectively on Twitter as evidence of Autopilot's safety, a quick caveat is in order. The MIT "Autonomous Vehicle Technology Study" [PDF] that Fridman leads and which he drew on for his findings about Autopilot does attempt to introduce new data into an important and complex area of study, and I would not want my criticisms of it to be read as totally invalidating the relevance of its data. Indeed, Fridman and his co-authors are very much upfront about some of its limitations, and it's not their fault that Musk quoted it so selectively and failed to mention that their own clear caveats caution against drawing sweeping conclusions from it about Autopilot's overall safety. Nor is it strictly speaking the fault of the MIT team that Musk has so regularly pushed misleading and scientifically unsound data in an attempt to prove the safety of Autopilot features that Tesla sold as "convenience features" prior to Josh Brown's death, although given Fridman's persistent focus on Tesla and Autopilot this eventuality should have been easily predictable.

It's also important to acknowledge any factors that may affect my own perception of Fridman and his work, particularly in light of the fact that he has not been given space in this piece to respond (I do welcome any response he wishes to provide and would be happy to publish it). For one thing, he blocked me on Twitter some time ago in apparent response to my (admittedly superficial) criticism of what I saw as a simplistic and technocentric approach called "Deep Traffic" [PDF]. More bafflingly, Fridman also blocked NVIDIA's Director of Research Anima Anandkumar for suggesting that he pursue peer review, and Alex Roy for asking pointed questions about his research, all while soliciting notably pro-Tesla forums for their recommendations of "objective" journalists to cover this latest paper. Though I believe it is important to look at the substance of Fridman's work itself, these factors and other data points suggesting that he is biased toward pro-Tesla outcomes have undoubtedly affected my perspective on his work.

These concerns carry over into even the most high-level assessments of Fridman's work, starting with the questions he seeks to answer in his research. "We launched the MIT Autonomous Vehicle Technology (MIT-AVT) study," Fridman says, "to understand, through large-scale real-world driving data collection and large-scale deep learning based parsing of that data, how human-AI interaction in driving can be safe and enjoyable" [emphasis added]. This is echoed in Fridman's introduction to the Tesla Autopilot survey, in which he describes himself as "part of a team of engineers at MIT who are looking to understand and build semi-autonomous vehicles that save lives while creating an enjoyable driving experience." This goal fundamentally undercuts the study's pretensions of objectivity, particularly when seen in the context of Fridman's obvious fondness for Tesla and the problematic nature of the study's data collection.

At first glance this data collection regime seems as objective as can be: a set of cameras captures driver behavior and syncs it with data from the vehicle, allowing Fridman and his team to correlate conditions on the road with the driver's state and the Autopilot system's functions. Through this data collection, the authors sought to measure what they call "functional vigilance," which they describe as "incorporat[ing] both the ability of the driver to detect critical signals and the implicit ability of the driver to self-regulate when and where to switch from the role of manually performing the task to the role of supervising the automation as it performs the task." By comparing a driver's decision to take over control from Autopilot with road conditions, specifically what they call "tricky situations" that could theoretically result in a crash, the authors claim that their dataset (which notably had no crashes in it) demonstrated that "(1) drivers elect to use Autopilot for a significant percent of their driven miles and (2) drivers do not appear to over-trust the system to a degree that results in significant functional vigilance degradation in their supervisory role of system operation."

These data collection techniques themselves seem quite promising, given that driver awareness can be fairly well quantified via machine learning and the relative risk of road conditions can be more or less accurately assessed by human labelers. But the data they produce are undercut by unanswered questions about the pool of drivers they draw from. Short-term participants used MIT-owned vehicles, were screened by a criminal background search and a review of their driving record, and were prepared with more extensive training than a typical consumer receives when they purchase a car with Autopilot. This extensive screening and grooming of the MIT-AVT short-term study participants tilts data from the short-term portion of the study toward the outcome found in the Tesla-specific study, to an extent that the authors do not fully explore in their caveats.

Long-term participants used Tesla vehicles that they owned, and according to the authors were recruited "through flyers, social networks, forums, online referrals, and word of mouth." Both actual participants in the instrumented study and validating survey data were collected in a decidedly non-random fashion from forums like Tesla Motors Club, which is notorious for having an intensely pro-Tesla culture and an active Tesla investor forum. This approach makes the MIT-AVT subject pool anything but a random sampling, especially given that a high percentage of Tesla owners (particularly those most likely to respond to solicitations in Tesla-centric online forums) also own stock in the company. Whether they own stock or are simply fans of the company, self-selected subjects solicited from this pool would be highly likely to "perform" high levels of safety for a study that they saw as validating one of Tesla's key competitive feature. Again, the authors do not appear to have considered these factors in either the results of the Autopilot "functional vigilance" study, or the MIT-AVT study more broadly.

These problems make it highly unlikely that the MIT-AVT study's data set is broadly representative of how Autopilot is used in the real world, undercutting the paper's claim that "this work takes an objective, data-driven approach toward evaluating functional vigilance in real-world AI-assisted driving by analyzing naturalistic driving data in Tesla vehicles" as well as its findings. On top of these specific experimental design issues, which are not directly addressed in any of the resulting papers, the authors do (to their credit) bring up a variety of other issues which further undercut the value of the study as a validation of Autopilot safety. "Subject sample characteristics and demographic generalizability" concerns are raised by the authors, but only in terms of the "tech savvy" nature of Tesla owners and the lack of "high risk demographics such as teenage drivers" in the study. There is also an acknowledgement that safety issues may creep in over time periods longer than those studied, as well as issues with the annotation of "tricky situations," the way "functional vigilance" was measured and this concept's applicability to actual safety.

Even more significantly, the authors admit that the imperfection of Autopilot's performance and experimentation by drivers outside of proven operational design domains (ODDs) may provide a level of familiarity with the system's shortcomings that contributes to the high level of "functional vigilance" seen in the study. Between over-the-air updates that improve system performance and the natural increase in user familiarity over time, it stands to reason that "functional vigilance" would decline over a longer time horizon, particularly among a more random sample of users who received no special screening or training. Such a decline in "functional vigilance" would be more in line with peer-reviewed research into human monitoring of automation, the real-world findings of autonomous vehicle developers like Waymo, the results of the NTSB's investigation into the Josh Brown crash, and even Elon Musk's own admission that