Identifying valid and invalid problem-solving behaviors

In this section, we will investigate several showcases that demonstrate the effectiveness of explanation methods like LRP and SpRAy for understanding and validating the behavior of a learned model.

First, we provide an example where the learning machine exploits an unexpected spurious correlation in the data to exhibit what humans would refer to as “cheating”. The first learning machine is a model based on Fisher vectors (FV)31,32 trained on the PASCAL VOC 2007 image dataset33 (see Supplementary Note 5). The model and also its competitor, a pretrained deep neural network (DNN) that we fine-tune on PASCAL VOC, show both excellent state-of-the-art test set accuracy on categories, such as ‘person’, ‘train’, ‘car’, or ‘horse’ of this benchmark (see Supplementary Table 3). Inspecting the basis of the decisions with LRP, however, reveals for certain images substantial divergence, as the heatmaps exhibiting the reasons for the respective classification could not be more different. Clearly, the DNN’s heatmap points at the horse and rider as the most relevant features (see Supplementary Figure 11). In contrast, FV’s heatmap is most focused onto the lower left corner of the image, which contains a source tag. A closer inspection of the data set (of 9963 samples33) that typically humans never look through exhaustively, shows that such source tags appear distinctively on horse images; a striking artifact of the dataset that so far had gone unnoticed34. Therefore, the FV model has ‘overfitted’ the PASCAL VOC dataset by relying mainly on the easily identifiable source tag, which incidentally correlates with the true features, a clear case of ‘Clever Hans’ behavior. This is confirmed by observing that artificially cutting the source tag from horse images significantly weakens the FV model’s decision while the decision of the DNN stays virtually unchanged (see Supplementary Figure 11). If we take instead a correctly classified image of a Ferrari and then add to it a source tag, we observe that the FV’s prediction swiftly changes from ‘car’ to ‘horse’ (cf. Figure 2a) a clearly invalid decision (see Supplementary Note 5 and Supplementary Figures 12–17 for further examples and analyses).

The second showcase example studies neural network models (see Supplementary Figure 2 for the network architecture) trained to play Atari games, here Pinball. As shown in ref. 5, the DNN achieves excellent results beyond human performance. Like for the previous example, we construct LRP heatmaps to visualize the DNN’s decision behavior in terms of pixels of the pinball game. Interestingly, after extensive training, the heatmaps become focused on few pixels representing high-scoring switches and loose track of the flippers. A subsequent inspection of the games in which these particular LRP heatmaps occur, reveals that DNN agent firstly moves the ball into the vicinity of a high-scoring switch without using the flippers at all, then, secondly, “nudges” the virtual pinball table such that the ball infinitely triggers the switch by passing over it back and forth, without causing a tilt of the pinball table (see Fig. 2b and Supplementary Figure 3 for the heatmaps showing this point, and also Supplementary Movie 1). Here, the model has learned to abuse the “nudging” threshold implemented through the tilting mechanism in the Atari Pinball software. From a pure game scoring perspective, it is indeed a rational choice to exploit any game mechanism that is available. In a real pinball game, however, the player would go likely bust since the pinball machinery is programmed to tilt after a few strong movements of the whole physical machine.

The above cases exemplify our point, that even though test set error may be very low (or game scores very high), the reason for it may be due to what humans would consider as cheating rather than valid problem-solving behavior. It may not correspond to true performance when the latter is measured in a real-world environment, or when other criteria (e.g. social norms which penalize such behavior35) are incorporated into the evaluation metric. Therefore, explanations computed by LRP have been instrumental in identifying this fine difference.

Let us consider a third example where we can beautifully observe learning of strategic behavior: A DNN playing the Atari game of Breakout5 (see Supplementary Table 2 for the investigated network architectures). We analyze the learning progress and inspect the heatmaps of a sequence of DNN models in Fig. 2c. The heatmaps reveal conspicuous structural changes during the learning process. In the first learning phase the DNN focuses on ball control, the handle becomes salient as it learns to target the ball and in the final learning phase the DNN focuses on the corners of the playing field (see Fig. 2c). At this stage, the machine has learned to dig tunnels at the corners (also observed in ref. 5)—a very efficient strategy also used by human players. Detailed analyses using the heatmap as a function of a single game and comparison of LRP to sensitivity analysis explanations, can be found in the Supplementary Figures 4–10 and in the Supplementary Movie 2. Here, this objectively measurable advancement clearly indicates the unfolding of strategic behavior.

Overall, while in each scenario, reward maximization, as well as incorporating a certain degree of prior knowledge has done the essential part of inducing complex behavior, our analysis has made explicit that (1) some of these behaviors incorporate strategy, (2) some of these behaviors may be human-like or not human-like, and (3) in some case, the behaviors could even be considered as deficient and not acceptable, when considering how they will perform once deployed. Specifically, the FV-based image classifier is likely to not detect horses on the real-world data; and the Atari Pinball AI might perform well for some time, until the game is updated to prevent excessive nudging.

All insights about the classifier behavior obtained up to this point of this study require the analysis of individual heatmaps by human experts, a laborious and costly process which does not scale well.

Whole-dataset analysis of classification behavior

Our next experiment uses SpRAy to comprehend the predicting behavior of the classifier on large datasets in a semi-automated manner. Figure 3a displays the results of the SpRAy analysis when applied to the horse images of the PASCAL VOC dataset (see also Supplementary Figures 19 and 20). Four different strategies can be identified for classifying images as “horse”: (1) detect a horse and rider (Fig. 3b), (2) detect a source tag in portrait-oriented images (Fig. 3c), (3) detect wooden hurdles and other contextual elements of horseback riding (Fig. 3d), and (4) detect a source tag in landscape-oriented images (Fig. 3e). Thus, without any human interaction, SpRAy provides a summary of what strategies the classifier is actually implementing to classify horse images. An overview of the FV and DNN strategies for the other classes and for the Atari Pinball and Breakout game can be found in Supplementary Figures 23–25 and 30–32, respectively.

Fig. 3 The workflow of spectral relevance analysis. a First, relevance maps are computed for data samples and object classes of interest, which requires a forward and a LRP backward pass through the model (here a Fisher vector classifier). Then, an eigenvalue-based spectral cluster analysis is performed to identify different prediction strategies within the analyzed data. Visualizations of the clustered relevance maps and cluster groupings supported by t-SNE inform about the valid or anomalous nature of the prediction strategies. This information can be used to improve the model or the dataset. Four different prediction strategies can be identified for classifying images as “horse”: b detect a horse (and rider), c detect a source tag in portrait oriented images, d detect wooden hurdles and other contextual elements of horseback riding, and e detect a source tag in landscape-oriented images Full size image

The SpRAy analysis could furthermore reveal another ‘Clever Hans’-type behavior in our fine-tuned DNN model, which had gone unnoticed in previous manual analysis of the relevance maps. The large eigengaps in the eigenvalue spectrum of the DNN heatmaps for class “aeroplane” indicate that the model uses very distinct strategies for classifying aeroplane images (see Supplementary Figure 23). A t-SNE visualization (Supplementary Figure 25) further highlights this cluster structure. One unexpected strategy we could discover with the help of SpRAy is to identify aeroplane images by looking at the artificial padding pattern at the image borders, which for aeroplane images predominantly consists of uniform and structureless blue background. Note that padding is typically introduced for technical reasons (the DNN model only accepts square-shaped inputs), but unexpectedly (and unwantedly) the padding pattern became part of the model’s strategy to classify aeroplane images. Subsequently we observe that changing the manner in which padding is performed has a strong effect on the output of the DNN classifier (see Supplementary Figures 26–29).

We note that while recent methods (e.g. ref. 36) have characterized whole-dataset classification behavior based on decision similarity (e.g. cross-validation-based AP scores or recall), the SpRAy method can pinpoint divergent classifier behavior even when the predictions look the same. The specificity of SpRAy over previous approaches is thus its ability to ground predictions to input features, where classification behavior can be more finely characterized. A comparison of both approaches is given in Supplementary Note 6 and Supplementary Figures 21 and 22.