$\begingroup$

There are many questions there, I'll try to address each in turn.

How is t-SNE* better than just taking a random (probably stratified) sample of the data?

If your goal is to provide a visual overview of the data then clearly a stratified sample is not going to do that -- the samples each still live in high dimensional space, so are no easier to visualise, and if you are interested in how the data relate to each other then looking through a sample item by item does not necessarily help to build a reasonable mental model of those relationships. So sampling and visualisation with t-SNE (or similar approaches) are both attacking similar underlying problems: "What does my data look like"; they provide very different views on that, highlighting different aspects however, and I would say are entirely complementary. Why not both!

How is t-SNE* better than just fitting a neural network with a 2-neuron bottleneck to the data and then taking the (normalized) value of the 2 neurons for an embedding?

This is harder, because I can't point you to explicit studies demonstrating otherwise, but t-SNE generally does a better job of providing a visual representation that is meaningful to users. It is worth noting that t-SNE was co-created by Geoff Hinton, who I am sure knows a great deal about neural networks and their potential uses and benefits. If Geoff felt t-SNE was worth using then one can reasonably presume it has some merits over other NN approaches. Now, with that in mind one can probably interpret t-SNE as a NN with 2-neuron bottleneck; I haven't tried to write that out in detail, so don't quote me on it. Finally, NNs aren't necessarily the answer to all problems; t-SNE is an algorithm for manifold learning specifically designed for low dimensional embedding, while an autoencoder with suitable bottleneck may happen to give similar results, but certainly isn't necessarily clearly suited to the task.

Does t-SNE* give any guarantees?

Guarantees of what? Providing a visual representation of the data? Yes it will do that. Guarantee that the Kullback-Liebler divergence of inferred distributions for the high dimensional and low dimensional representations has been minimised? Up to the quality of the optimisation, yes. Guarantee that local structure is preserved? Up to a suitable choice of perplexity for the dataset, yes. Guarantee that the low dimensional representation is a fully faithful representation of the global structure of the data? No, no guarantees there. It really depends on what you're looking for.

Is t-SNE* good for the construction of classifiers? I mean: If you already have a classifier which is much better than random / guessing the most frequent class, does t-SNE help you to make better classifiers? How?

Not really -- not any more than what you potentially expect from any dimensionality reduction technique, and as you note forcing things to 2 dimensions constrains that a lot. In principle, since t-SNE focuses on preserving local structure and sacrifices global structure to do so, if your classification relies heavily on local structure then t-SNE could perform better than other dimension reduction techniques that seek to preserve global structure and thus have poorer local structure representation. In the case of unsupervised (density) clustering it is certainly true that local structure is more important than global structure so potentially t-SNE can help. However compression to 2 dimensions may be rather extreme.

There are many dimensionality reduction algorithms. How does one compare the non-linear ones? When is one algorithm better than others? Especially: Are they better than bottleneck features of neural networks?

If your interest is in visualisation (as is the case with t-SNE) then unfortunately comparisons are inevitably fairly subjective. This is much like the case of clustering. Certainly there are many different measures of clustering quality, but often that comes down to "what do you mean by a cluster?", and for each measure there is a clustering algorithm that optimises for that measure. Likewise there are measures for how well an embedding has succeeded, but ultimately it depends on what you mean by a successful embedding (is preserving large scale relationships critical? is preserving local structure more important? How do you weigh one against the other?), and for any given measure there is an algorithm that optimises that measure. For a subjective evaluation the standard approach seems to be to embed labelled data and then view the result colored by label -- no, that is never going to clearly demonstrate superiority, but it can hint at what seems to be working. In practice t-SNE seems to fair well at this (hence its popularity). Are any of these better than bottlenecked autoencoders? I'm not sure why one assumes that a bottlenecked autoencoder is by default the superior option, but certainly in my experience whenever I've tried several options on data I've found t-SNE to provide better intuition about the dataset. Anecdotes are not data of course, so by all means try the options yourself.

I hope this covers most of it, although perhaps these were not quite the sort of answers you wanted.