We ourselves have been equally guilty of speculation disguised as explanation. In [72], JS writes that “the high dimensionality and abundance of irrelevant features. . . give the attacker more room to construct attacks”, without conducting any experiments to measure the effect of dimensionality on attackability. And in [71], JS introduces the intuitive notion of coverage without defining it, and uses it as a form of explanation, e.g.: “Recall that one symptom of a lack of coverage is poor estimates of uncertainty and the inability to generate high precision predictions.” Looking back, we desired to communicate insufficiently fleshed out intuitions that were material to the work described in the paper, and we were reticent to label a core part of our argument as speculative.

In contrast to the above examples, [69] separates speculation from fact. While this paper, which introduced dropout regularization, speculates at length on connections between dropout and sexual reproduction, a designated “Motivation” section clearly quarantines this discussion. This practice avoids confusing readers while allowing authors to express informal ideas.

In another positive example, [3] presents practical guidelines for training neural networks. Here, the authors carefully convey uncertainty. Instead of presenting the guidelines as authoritative, the paper states: “Although such recommendations come…from years of experimentation and to some extent mathematical justification, they should be challenged. They constitute a good starting point. . . but very often have not been formally validated, leaving open many questions that can be answered either by theoretical analysis or by solid comparative experimental work”.

3.2 Failure to Identify the Sources of Empirical Gains

The machine learning peer review process places a premium on technical novelty. Perhaps to satisfy reviewers, many papers emphasize both complex models (addressed here) and fancy mathematics (see §3.3). While complex models are sometimes justified, empirical advances often come about in other ways: through clever problem formulations, scientific experiments, optimization heuristics, data preprocessing techniques, extensive hyper-parameter tuning, or by applying existing methods to interesting new tasks. Sometimes a number of proposed techniques together achieve a significant empirical result. In these cases, it serves the reader to elucidate which techniques are necessary to realize the reported gains.

Too frequently, authors propose many tweaks absent proper ablation studies, obscuring the source of empirical gains. Sometimes just one of the changes is actually responsible for the improved results. This can give the false impression that the authors did more work (by proposing several improvements), when in fact they did not do enough (by not performing proper ablations). Moreover, this practice misleads readers to believe that all of the proposed changes are necessary.

Recently, Melis et al. [54] demonstrated that a series of published improvements, originally attributed to complex innovations in network architectures, were actually due to better hyper-parameter tuning. On equal footing, vanilla LSTMs, hardly modified since 1997 [32], topped the leaderboard. The community may have benefited more by learning the details of the hyper-parameter tuning without the distractions. Similar evaluation issues have been observed for deep reinforcement learning [30] and generative adversarial networks [51]. See [68] for more discussion of lapses in empirical rigor and resulting consequences.

In contrast, many papers perform good ablation analyses [41, 45, 77, 82], and even retrospective attempts to isolate the source of gains can lead to new discoveries [10, 65]. Furthermore, ablation is neither necessary nor sufficient for understanding a method, and can even be impractical given computational constraints. Understanding can also come from robustness checks (as in [15], which discovers that existing language models handle inflectional morphology poorly) as well as qualitative error analysis [40].

Empirical study aimed at understanding can be illuminating even absent a new algorithm. For instance, probing the behavior of neural networks led to identifying their susceptibility to adversarial perturbations [74]. Careful study also often reveals limitations of challenge datasets while yielding stronger baselines. [11] studies a task designed for reading comprehension of news passages and finds that 73% of the questions can be answered by looking at a single sentence, while only 2% require looking at multiple sentences (the remaining 25% of examples were either ambiguous or contained coreference errors). In addition, simpler neural networks and linear classifiers outperformed complicated neural architectures that had previously been evaluated on this task. In the same spirit, [80] analyzes and constructs a strong baseline for the Visual Genome Scene Graphs dataset.