Comparing the different methods to plot the loss landscape together with the trajectory, we see that choosing random directions result in plots that show little to no information. Here the trajectory barely moves away from its initial position. The PCA directions seem to offer a good choice to plot the trajectory. But PCA actually chooses its directions in such a way that one always obtains the same figure of the trajectory. Therfore this method is insufficient in offering any useful information of the training procedure. On the other hand, one can see that choosing eigenvectors results in interesting directions where minima as well as trajectories are present. Here the trajectories always look different and represent the ”true” path taken by the optimizer. Looking at the eigenvalues of the interpolation between two different minima in Figure 8 one can observe that the area in between those two minima is relatively flat. Most positive eigenvalues get pushed toward zero while the negative ones are not changing by a lot compared to the minima. Also looking at the scalability of the algorithms, we see that the visualization method is highly parallelizable, this is to be expected as we are able to split the grid and each worker can compute its values independently. For the stochastic Lanczos quadrature algorithm our algorithm is also highly parallelizable, again this is possible because the different initializations in the outer loop of the algorithm are independent of each other. On the other hand, using data parallelism in this algotrithm is much less parallelizable. One reason could be that after each time the dataset has been computed, all GPUs have to wait on the algorithm to finish the rest of the computations for this iteration. Still, if one attempts to scale to hundreds of GPUs, the best approach would be to perform a mix of both approaches, as the number of independent iterations in the stochastic Lanczos quadrature algorithm is on a scale of 101. So using more GPUs than iterations to compute would make no sense, therefore the remaining ones could compute samples in a data parallel fashion.