Two key benchmarks have been developed which were found to predict performance on the out-of-distribution test set. The first benchmark is to manually translate the validation-set spectra. This was found to approximate the effect of calibration and chemical shift. The second was by adding and removing noise from the data by modeling the noise distribution using combinations of principal components and measuring the resulting change in accuracy.

Three types of neural network architectures are explored in this study. The dense network has 11,000 weights, the convolutional and dense network has 1100 weights and the fully convolutional architecture has 650 weights. The dense neural network is similar to the dense network used in Gallagher and Deacon’s study1. More recent advances in neural networks were added to increase performance such as: neuron dropout to prevent overfitting10, batch normalization between hidden layers for activation normalization11, rectified linear units (ReLU) as activation function for the hidden units, and a softmax activation function for the output neurons. The second architecture is a convolutional neural network feature extractor attached to a densely connected neural network which is similar to the one used by Liu et al.6. The convolutional neural network architecture was inspired by Liu et al.6 and the choice for the number of convolutional filters and layers was inspired by the feature extractors used in the ImageNet competition like the well-established AlexNet and VGG16 architectures12,13. The third architecture has the same convolutional neural network architecture as previously described, but it is attached to a custom classification architecture inspired by the MobileNet and SqueezeNet architectures14,15.

A full description of the neural architectures can be found in the Methods section.

The graph of the fully convolutional neural network architecture can be found in Fig. 1. The feature extraction architecture is also the same used as the convolutional and dense containing network. The output of each operation can be found in Fig. 1b–d for the fully trained network visualized on the training set.

Figure 1 (a) Schematic diagram of the neural network graph of the fully convolutional neural network. Bracketed quantities are the length of the spectrum after pooling. The red window is the kernel size of the convolution (see Methods for window sizes). (b–d) Activation outputs per feature extraction block and classification block averaged over all training examples of Mn2+, Mn3+, and Mn4+ respectively. The input spectra shown are the average of all spectra in each valence state. The output shown for the Classification block is the output after the final length-1 depth-wise 1D convolution operation (but before global average pooling). Full size image

In Fig. 1a, the input spectrum is passed through the 5 successive feature extraction blocks where each block contains a convolution, batch-normalization and down-sampling (1D average pooling). The output of the 5th block (orange) is considered to be a representation of the original input data that is optimized for discrimination (i.e. discriminative features). These features are then used to classify the valence of the inputted spectrum using the single classification block which contains: dropout, convolution (kernel size, 1), global average pooling, and softmax. A description of every operation and relevant parameters can be found in the Methods section.

Data Collection

A total of 2001 electron energy-loss spectra of Mn2+, Mn3+, and Mn4+ were acquired using a FEI Titan transmission electron microscope in a variety of conditions. This includes 448 Mn2+ spectra, 765 Mn3+ spectra, and 788 Mn4+ spectra cropped between 635 and 665 eV (300 bins at a dispersion of 0.1 eV/bin). The microscope parameters and spectrum image preprocessing steps can be found in the Methods section.

To obtain spectra from a wide variety of instruments and resolutions to prove generalizability, reference spectra were digitized from three studies by Garvie et al., Zhang et al. and Tan et al.8,9,16. These spectra were not used for model training and are instead used as a withheld test-set representing signals that are well outside of the distribution of the acquired data. They are of significantly higher resolution and the differences in instrument calibration is clear, with the onsets of the peaks being different as shown in Fig. 2. The energy range of the digitized spectra is between 635 and 658 eV due to many of them being cropped for their respective publications.

Figure 2 (a) 12 digitized reference spectra of various Mn2+ compounds taken from published articles8,9,16 (black). 12 randomly selected MnO (Mn2+) spectra acquired for this study. (b) 10 digitized reference spectra of various Mn3+ compounds taken from published articles (black). 10 randomly selected Mn 2 O 3 (Mn3+) spectra acquired for this study. (c) 9 digitized reference spectra of various Mn4+ compounds taken from published articles (black). 10 randomly selected MnO 2 (Mn4+) spectra acquired for this study. All digitized spectra shown are taken from digitizing the figures in Zhang et al., (reprinted from American Mineralogist 95 1741–1746, (2010) with permission from the Mineralogical Society of America), Tan et al. (reprinted from Ultramicroscopy 116, 24–33, (2012) with permission of Elsevier) and Garvie and Craven (reprinted from Phys. Chem. Miner. 21, 191–206 (1994), with Permission from Springer Nature)8,9,16. Full size image

A qualitative inspection of Fig. 2 highlights that Mn2+ is narrower than both Mn3+ and Mn4+. In addition, Mn4+ can be differentiated from Mn3+ by a small shoulder located on the low energy side of the larger peak.

A flow-chart describing the pipe-line of going from acquired data to a functioning model can be found in the Supporting Information (Figure S1).

Model Training

Stratified 10-fold cross-validation was used for model validation and to estimate the error of the validation set. The acquired Mn dataset was divided into 10 roughly equal folds where each class is stratified. 9 folds (i.e. 90% of the data) was used for finding model parameters while the last fold was used for calculating withheld-set validation accuracy. This was repeated for every fold in the model to produce 10 models trained on different partitions of the data.

In batches of 128 randomly selected augmented training spectra at a time, the spectra are non-linearly transformed over all edges and nodes in the directed graph to produce predicted class probabilities \(p({y}_{k}^{(i)}|z,\theta )\) for all possible classes. The cost function being minimized C(θ) is the categorical cross-entropy loss between the true one-hot encoded label \({y}_{k}^{(i)}\) and the predicted class probability \(p({y}_{k}^{(i)}|x,\theta )\) calculated over all N training examples in the batch and for all possible K classes.

$$C(\theta )=-\sum _{i=1}^{N}\sum _{k=1}^{K}\,{y}_{k}^{(i)}\,\mathrm{log}\,p({y}_{k}^{(i)}|x,\theta )$$

The error from the cost function can be back-propagated backwards using the chain-rule through each weight and bias in the network to assign blame as to which parameters were unhelpful in the classification. With this scheme, the weights and biases can be nudged in a direction that minimizes error using stochastic gradient descent (SGD). The specific SGD algorithm used was the adaptive moment estimation optimizer, the Adam optimizer commonly used for the training of neural networks (default parameters)17.

Model Validation – Effect of translation

Depending on the calibration of the spectrometer, it is possible for the peaks in the measured spectrum to be translated. In electron spectroscopy, the electronic environment of the atom being measured can also cause translational shifts (i.e. different compounds with the same oxidation state can be shifted). Because of this, it is crucial that an electron spectroscopy classifier must understand the shape of the spectra and not simply memorize the absolute onset of the peak.

Translation-invariance was measured by cropping each validation example and moving the Mn ionization edge into different positions into the 300-length input vector. A 5 eV shift in this context is equivalent to moving the peak 50 bins out of the total 300 bins. To test translation-invariance, the validation-set spectra, for each cross-validation fold, were shifted left and right up to 5 eV and the resulting validation-accuracy was measured. To further probe if translation-invariance can be learned, data augmentation was used to randomly translate training examples between −5 and 5 eV. For each architecture, the training examples were randomly shifted (data augmentation) and this was performed for various scalar multiples of the training data ranging from 1–10x the original amount of training data. The results of the effect of translation-invariance with respect to neural network architecture and scalar multiples of training data can be found in Fig. 3.

Figure 3 A comparison of how the validation-accuracy changes with respect to shifting the validation spectra for each neural network architecture and with varying degrees of training-set data augmentation. Training examples were randomly translated (−5 to +5 eV) in different scalar multiples of the training data and also compared against no augmentation at all. The upper row of plots contains a zoomed-in view from 85–100% validation accuracy while the lower row of plots contains the full range of accuracy for each respective style of neural network. Full size image

As the amount of training data augmentation increases for the fully convolutional network, the performance increases (albeit subtly) (Fig. 3). This can be observed by looking at the gradient of color going from purple to red. However, this is not the case for the other two networks with no smooth increase in validation accuracy as the amount of augmentation increases.

Without applying translation data augmentation to the training set (dataset identified as “None”, dark purple colour in Fig. 3) and without shifting the validation data (i.e. zero-shift), all three networks are able to exceed 99% 10-fold cross-validation validation-set accuracy. When shifted, however, even without training data augmentation, the fully convolutional network is shown to produce a small decrease in validation accuracy as the spectra are translated, whereas the other two architectures have a sharp (<60%) decrease in accuracy near that of a random guess.

When applying translation data augmentation to the training set, the dense containing networks have a sharp increase in validation accuracy as the validation spectra are translated, suggesting that translation-invariance can be brute-force learned. When using the randomly shifted training data (“1× ” size, with random shifts) to train the dense containing networks, the fully convolutional network without any sort of data augmentation (dataset labeled as “None”) is still superior (this was measured by integrating the validation accuracy with respect to translation). It is only when large scalar multiples of training data are applied to the Conv. +Dense network that it is able to outperform the non-augmented fully convolutional architecture. However, when comparing the 10x trained fully convolutional network to 10x trained Conv. +Dense network, the fully convolutional network is still superior.

It is equally worth noting that the fully convolutional network is superior under these conditions despite containing only 60% as many weights as the convolutional and dense network, and only 6% as many weights as the fully dense network. The fully convolutional neural network is also much more constrained by using global average pooling as opposed to dense connections. This may aid in translation invariance. However, it was shown by Azulay and Weiss18 that popular pre-trained ImageNet classifiers (e.g. Inception19, VGGNet13) are not completely translation-invariant even when using global average pooling.

Model Validation – Effect of noise

To test the robustness of convolutional neural networks to the noise frequently found in electron energy loss spectra acquired using transmission electron microscopes, a noise test was implemented using principal component analysis (PCA) for the fully convolutional neural network. PCA has grown increasingly popular at denoising EELS spectra recently20,21,22,23,24. Removing low variance principal components is effective at eliminating some types of noise frequently detected in EELS with minimal loss of signal (typically <10−2% of the signal is removed). Using these low variance principal components as a noise distribution, they were added to each input spectrum in scalar multiples ranging between zero and five times the baseline noise level. In this context, a scalar multiple of zero would be PCA cleaned data (removing low variance components), and a scalar multiple of 1 is the original signal.

PCA was performed on each validation-set during cross-validation and low variance principal components were calculated on the validation-set and added back to the validation data in different scalar multiples. This measures how well the classifier can predict the oxidation state in the presence of extreme noise that a trained human spectroscopist would have great difficulty in classifying. As an example of what the signal-to-noise ratio looks like qualitatively, the effect of different scalar multiples of low variance principal components on spectra is shown Fig. 4b.

Figure 4 (a) A comparison of the effect of data augmentation on how the fully convolutional network when adding multiples of low variance principal components. The x-axis refers to multiples of noise added to the validation-set. With and without data augmentation refers to adding noise to the training data. (b) An example of spectra showing the effect of scalar multiples of low variance principal components added on to spectra. Full size image

An additional test was also performed to see if the model could be made robust to high levels of noise by deliberating adding noise to the training data. This is called training data augmentation. PCA was performed on every training fold during cross-validation and low variance principal components were added to each training example. The neural network was then evaluated using the data augmented folds of the validation-set to measure how the validation accuracy changes with increasing validation-set noise. The comparison between the fully convolutional classifier trained with or without training data augmentation is shown in Fig. 4a.

This test shows that, in the presence of noise that is a 5x scalar multiple of low variance principal components, the data augmented classifier is able to exceed 93% validation-set accuracy. Without data augmentation the accuracy decreases moderately as noise is added with a validation-set accuracy of 78% at 5x the base-line noise level. It should be noted that a 1x scalar multiple of the low variance principal components is indeed just the original signal. A 0x scalar multiple in this context is a PCA-cleaned spectrum with low variance components being removed.

Model Testing - Digitized Reference Spectra

The three neural networks were tested against the 31 digitized spectra taken from the publications by Zhang et al., Tan et al., and Garvie et al.8,9,16 to probe generalizability in the presence of different instruments, calibration, and resolution. This dataset contains 12 Mn2+ spectra (tetrahedral, octahedral, and dodecahedral coordination), 10 Mn3+ spectra (octahedral), and 9 Mn4+ spectra (octahedral). These spectra are also shifted significantly (~3 eV) compared to the acquired spectra, likely the result of different instrument calibration, and have different levels of noise (Fig. 2).

The three architectures tested in this study tested against the 31 digitized reference spectra. The test accuracies on the digitized reference spectra dataset are shown in Table 1.

Table 1 Performance (validation accuracy) of each neural network architecture on the digitized reference spectra dataset. Each network is also compared when using translation data augmentation during test time in different scalar multiples and also without any augmentation. Full size table

In analyzing Table 1, the fully convolutional neural network is proven to be extremely successful on the digitized reference spectra dataset even without data augmentation. In contrast, dense layer containing architectures fail to generalize to data outside of their training distributions and incorrectly classify Mn2+ and Mn4+ when data augmentation is not used. These results agree with the translation-invariance test in Fig. 3 that the dense layer containing networks have difficulty performing classification on datasets too dissimilar from their training data since the digitized reference spectra are shifted significantly (Fig. 2). It is only with data augmentation that the dense containing networks are accurate.

Neural networks are apparently capable of classifying Mn compounds that do not have the same coordination (octahedral) as the acquired training data such as tetrahedral and dodecahedral compounds. This is due to the fact that, as demonstrated by Garvie et al. and Tan et al., there are strong similarities in the shape of the fine-structure of the core-loss edges between various Mn, Fe and V compounds of the same valence9,16 for different coordinations, at least at the energy resolution used for the experiments.

The activations for each layer of the fully convolutional neural network on the digitized reference spectra dataset can be found in the Supporting Information.

Visualizing the feature space using t-SNE

To help demonstrate the success of the fully convolutional network on the digitized reference spectra test-set in the absence of training data augmentation, the 300-length preprocessed spectra (both acquired and also the reference spectra) were projected onto a 2D plane using a t-distributed stochastic neighbor embedding (t-SNE) technique. t-SNE developed by van der Maaten and Hinton25 is a popular non-linear dimensionality reduction technique which attempts to preserve the distribution of clusters in the original high-dimensional space when projecting the data onto a 2D plane for visualization purposes. Note that the analysis presented here is entirely qualitative to aid in exploratory visualization. The visualizations shown in Fig. 5 were initialized with PCA, have a perplexity of 100, and were run for 10 000 iterations. A variety of visualizations with perplexities between 20 and 100 can be found in the Supporting Information.

Figure 5 2D t-SNE visualization of the 300-dimensional input space (a) and of the 72-dimensional feature space of the trained fully convolutional network (b). The manifold of the feature space brings the digitized reference spectra (triangles) closer to the clusters of the acquired data as opposed to the original input space. Full size image

The features produced by the final feature extraction layer prior to global average pooling of the fully convolutional network, consisting of 12 vectors of length 6, was flattened into a 72-length vector and also visualized using t-SNE.

The classes can be easily differentiated on the feature-space shown in Fig. 5b. In the feature-space, it is evident that the reference spectra (the triangles in Fig. 5) are contained within the clusters of acquired spectra. This is in contrast to the t-SNE visualization of the input space (Fig. 5a) where all Mn4+ are incorrectly classified as Mn3+.

The digitized reference spectra dataset is very different from the acquired dataset (cleaner signal and shifted 3 eV). It is evident from the t-SNE visualization of the full input spectra space that these two datasets lie in different distributions. We can gauge this qualitatively by observing that reference Mn4+ are more closely similar to acquired Mn3+ than they are to acquired Mn4+.