One of the basics steps when it comes to the analysis of the data is outlier detection, ie. detection of samples that are deviating from the general data distribution. During this step of data analysis abnormalities in the dataset are detected. This process is often driven by some kind of unsupervised model, that will eventually acquire ground truth. However, these models are unstable. This means that using just one unsupervised model is risky. That is why data scientists usually choose to build a number of models to get the data that they can additionally analyze. In essence, that is how outlier ensemble methods were developed. This approach has several flows, like scalability and computational costs. Apart from that, these algorithms (kNN, Local Outlier Factor, Local Operate Probabilities) work in Euclidean space, which means that dimensionality is a problem as well.

That is why the authors of this paper propose a three-module acceleration framework – SUOD, to speed up the training and prediction with a large number of unsupervised models. This framework generates a random low-dimensional subspace for each unsupervised model, on which the model is then trained. Also, balanced parallel scheduling heuristics are proposed for increasing efficiency in distributed systems, meaning for each model SUOD predicts running time and based on that distributes workload among workers. Finally, the third feature of SOUD is lower cost supervised regressors for the approximation of unsupervised models. As you are probably aware, supervised models are faster for prediction and easier for interpretation. This can be compared with some of the knowledge distilling techniques that are used for neural networks. The whole algorithm of SUOD goes like this: