Validation Set Over-fitting because of Random Initialization & Selection Bias

Does “Early-Stopping” really ensure the AI model generalization?

The logo of hawk:AI, where I work as a Chief Data Scientist, for the constant support of my colleagues.

Deep Learning models are called the universal function approximators. Their strength comes from their great ability of modeling the relationship of given input and output. However, this is also their primary weakness for coming up with generalizable solutions to a problem and what makes them so prone to over-fitting (memorizing) the training-set and not working with new data. The current method of ensuring the generalization of Deep Learning models was simply using a validation set to decide on how many iterations (epochs) should be done for training a model; or in other words, early stopping. Then, the Data Scientist would test the trained model on a blind test-set to ensure sure that none of the training hyper-parameters are also over-fitting. This method was working well so far with the problems where you can have such separate datasets, that each has a similar distribution. However, this is not the case for several problems in Finance or Healthcare (both in which I am deeply experienced), and certainly not for Reinforcement Learning problems where the new data would depend on both the environment also to the actions.

Example Critical Scenario(s) where the Early-Stopping often Fails!

In Healthcare, it is extremely critical for the deployed model to generalize for each unseen patient, who would be present in neither of train or test sets. For that reason, it is a common practice to validate the model with a leave-one-subject out cross-validation where each left-out subject will be different and try to decide on an early stopping epoch based on the statistics of each fold; which can be okay with a large population. In Finance, the situation is even worse as one needs to generalize for future data; whilst, the distribution of the stock-market data always changes and it is almost impossible to have such homogeneous train, validation and test sets. Several Quants train AI models for trading and they often do not perform well with the real-data, although they have been great in the back-tests during their experiments. Finally, most Reinforcement Learning methods have reproducibility problems (especially when a different random-seed is employed) and gradient-free methods like Augmented Random-Search are shown to work better in them. Lastly, it is common that ensembles of the same model (even) often results in better.

Good Validation Performance is not an Indicator of Generalization!

Continual learning of Deep Learning models, which are iterative machine-learning models, is also a nightmare and requires the same validation. In fact, one can never ensure that a Deep Learning model will converge to the same or a better solution in terms of generalization if re-trained from scratch with a new dataset of same size using the same number of epochs. It is a total mess! After years of experience in training models, I have identified a primary cause for this as validation set over-fitting. Since that any machine-learning model is randomly initialized, they can converge to a local-minima, which performs great with the validation-set in use and if you early-stop your training at that point, the model would have less generalization than the optimal solution. For the problems that I have given as an example, there is no way of ensuring this doesn’t happen by checking the validation set. A good method that I have discovered to detect over-fitting only by using the training set, is checking the model performance across a lot of mini-batches. Since that the model would lose its generalization capability during over-fitting, it would start to perform worse in certain mini-batches while trying to learn others and therefore the change of the loss would start fluctuating across them during the training.

The “Population-based” Constant Rebalanced Portfolio Selection!

Recently, I have faced this problem while working on a Portfolio Selection (Optimization) problem and I have finally discovered a solution, which is potentially applicable to other domains. I have invented a method that I have named as ‘Population-based Constant Rebalanced Portfolio Selection’. For this method, I have employed the Autograd functionality in PyTorch to optimize 8192 portfolio(s) simultaneously in GPU. Then, I have used the mean weights of the top-50% portfolio(s) for checking the validation set performance rather than selecting a candidate from portfolio weights that are being trained. (The method converges to a similar result when all of them are used for taking the mean and using top-50% just accelerates the training process.) To sum up, it t turned out that I was able to obtain a smooth training and validation curve, which I was able to use for deciding on an early-stopping epoch without a random-initialization bias. Another interesting possible use of this method could be concurrently evolving adversarial examples for attacking against Deep Learning models. The main difference from the existing evolutionary optimization methods is that each candidate is optimized independently in the proposed method. Further details on this method are in the following:

A Presentation about the proposed “Population-based” Constant Rebalanced Portfolio Selection Method.

Existing Methods of Portfolio Selection in Literature are Outdated!

The previous portfolio selection methods only addressed the selection of an optimal buy & hold portfolio but do not help to select a constant rebalanced portfolio. Whilst, a minimum volatility constant rebalanced portfolio, which has no-positive return when simply being held, could as well generate profits due to the mean-reversion. Therefore, I have rather tried to select an optimal portfolio for a given trading policy (such as UCRP) and risk-adjusted reward, under transaction costs (1%). Also, the divergence threshold, which is used for deciding when to rebalance back to the selected constant portfolio is also optimized, along with the portfolio weights. To the best of my knowledge, this is also the first Portfolio Selection method that optimizes a portfolio together with the parameters for a defined trading policy by simulating the strategy. The reward function acts as an alternative to Efficient Frontier optimization.

Source Code of Strategy with 80% Win-Rate & 20x Profit-Loss Ratio

To conclude, this method enabled me to find generalizable portfolio weights and parameters for a given trading strategy (in this case constant rebalanced portfolio) and defined risk-sensitive reward function. In fact, I have been able to optimize such a trading strategy that has been able to achieve an 80% win-rate and 20x profit-loss ratio on an out-of-sample back-test that is done with QuantConnect. In order to demonstrate the strength of the proposed method, I am sharing the source-code of this strategy that has been discovered in the below link; so that anyone can reproduce the results for themselves. Please, note that finding such a strategy would be very difficult even on training-set. Lastly, I am looking forward for other researchers to develop more solutions for the problem that I have defined in this article; and I hope that it can serve as a trigger for more high-quality research to tackle this severe problem in AI.

Thanks for Reading! The Short Auto-Biography of the Author (me):

I am an ex-academician (T-Labs, Microsoft Research) and entrepreneur (OTA Expert, LivingRooms), who revolutionized the Parkinson’s disease treatment at work (ConnectedLife) while hacking the stock-market at home. I previously worked in reputable research institutes incl. Socio-Digital Systems (Human Experience and Design) Group in Computer-Mediated Living Laboratory of Microsoft Research Cambridge and Quality & Usability Group of Deutsche Telekom Innovation Laboratories (T-Labs). I lead several research projects on Deep Learning, Machine Learning, Pattern Recognition, Data Mining, HCI, Information Retrieval, Artificial Intelligence, Computer Vision and Computer Graphics; and co-authored 35 publications in many conferences & journals. I currently work in the FinTech domain at hawk:AI, as a Chief Data Scientist.