Abstract Machine Learning (ML) methods have been proposed in the academic literature as alternatives to statistical ones for time series forecasting. Yet, scant evidence is available about their relative performance in terms of accuracy and computational requirements. The purpose of this paper is to evaluate such performance across multiple forecasting horizons using a large subset of 1045 monthly time series used in the M3 Competition. After comparing the post-sample accuracy of popular ML methods with that of eight traditional statistical ones, we found that the former are dominated across both accuracy measures used and for all forecasting horizons examined. Moreover, we observed that their computational requirements are considerably greater than those of statistical methods. The paper discusses the results, explains why the accuracy of ML models is below that of statistical ones and proposes some possible ways forward. The empirical results found in our research stress the need for objective and unbiased ways to test the performance of forecasting methods that can be achieved through sizable and open competitions allowing meaningful comparisons and definite conclusions.

Citation: Makridakis S, Spiliotis E, Assimakopoulos V (2018) Statistical and Machine Learning forecasting methods: Concerns and ways forward. PLoS ONE 13(3): e0194889. https://doi.org/10.1371/journal.pone.0194889 Editor: Alejandro Raul Hernandez Montoya, Universidad Veracruzana, MEXICO Received: December 9, 2017; Accepted: March 12, 2018; Published: March 27, 2018 Copyright: © 2018 Makridakis et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All data are available online at https://forecasters.org/resources/time-series-data/m3-competition/. Funding: The author(s) received no specific funding for this work. Competing interests: The authors have declared that no competing interests exist.

1 Introduction Artificial Intelligence (AI) has gained considerable prominence over the last decade fueled by a number of high profile applications in Autonomous Vehicles (AV), intelligent robots, image and speech recognition, automatic translations, medical and law usage as well as beating champions in games like chess, Jeopardy, GO and poker [1]. The successes of AI are based on the utilization of algorithms capable of learning by trial and error and improving their performance over time, not just by step-by-step coding instructions based on logic, if-then rules and decision trees, which is the sphere of traditional programming. In light of the above, AI found applications in the field of forecasting and a considerable amount of research has been conducted on how a special class of it, utilizing Machine Learning methods (ML) and especially Neural Networks (NNs), can be exploited to improve time series predictions. Literally hundreds of papers propose new ML algorithms, suggesting methodological advances and accuracy improvements [2–8]. Yet, limited objective evidence is available regarding their relative performance as a standard forecasting tool [9–12]. Their superiority claims are characterized by the following three major limitations: Their conclusions are based on a few, or even a single time series, raising questions about the statistical significance of the results and their generalization.

The methods are evaluated for short-term forecasting horizons, often one-step-ahead, not considering medium and long-term ones.

No benchmarks are used to compare the accuracy of ML methods versus alternative ones. The objective of ML methods is the same as that of statistical ones. They both aim at improving forecasting accuracy by minimizing some loss function, typically the sum of squared errors. Their difference lies in how such a minimization is done with ML methods utilizing non-linear algorithms to do so while statistical ones linear processes. ML methods are computationally more demanding than statistical ones, requiring greater dependence on computer science to be implemented, placing them at the intersection of statistics and computer science. The importance of objectively evaluating the relative performance of the ML methods in forecasting is obvious but has not been achieved so far raising questions about their practical value to improve forecasting accuracy and advance the field of forecasting. Simply being new, or based on AI, is not enough to persuade users of their practical advantages over alternative methods. A similar situation has been reported by [13] for data mining methods, suggesting among others that novel approaches should be properly tested through a wide range of diverse datasets and comparisons with benchmarks. As mentioned by [14], it should become clear that ML methods are not a panacea that would automatically improve forecasting accuracy. “Their capabilities can easily generate implausible solutions, leading to exaggerated claims of their potentials” and must be carefully investigated before any claims can be accepted. This paper consists of three sections. The first briefly reviews published empirical studies and investigates the performance of ML methods in comparison to statistical ones, also deliberating some major issues related to forecasting accuracy. The second part uses a subset of 1045 monthly series (the same ones used by [15]) from the 3003 of the M3 Competition [16] to calculate the performance of eight traditional statistical methods and eight popular ML ones, the same as those used by [15], plus two more that have become popular during recent years [17]. The forecasting model was developed using the first n − 18 observations, where n is the length of the series. Then, 18 forecasts were produced and their accuracy was evaluated compared to the actual values not used in developing the forecasting model. In addition, the computational complexity of the methods used was recorded as well as the accuracy of fitting the model to the n − 18 historical data (Model Fit). The third section discusses the outcome of the comparisons and attempts to explain why the forecasting accuracy of ML models was lower than most statistical ones, while also proposing possible ways to improve it. A critical question being asked is whether ML methods can actually be made to “learn” more efficiently using more information about the future and its unknown errors, rather than past ones. The motivation for writing this paper was an article [18] published in Neural Networks in June 2017. The aim of the article was to improve the forecasting accuracy of stock price fluctuations and claimed that “the empirical results show that the proposed model indeed display a good performance in forecasting stock market fluctuations”. In our view, the results seemed extremely accurate for stock market series that are essentially close to random walks so we wanted to replicate the results of the article and emailed the corresponding author asking for information to be able to do so. We got no answer and we, therefore, emailed the Editor-in-Chief of the Journal asking for his help. He suggested contacting the other author to get the required information. We consequently, emailed this author but we never got a reply. Not being able to replicate the result of [18] and not finding research studies comparing ML methods with alternative ones we decided to start the research leading to this paper.

2 The accuracy of ML methods: A brief review and discussion The first application of NNs (as ML methods were called at that time and also sometimes today) in forecasting dates back to 1964 but did not achieve much follow-up until the technique of backpropagation was introduced almost 20 years later [19]. Since then, there have been numerous studies utilizing NNs and some of them comparing their accuracy to traditional, statistical ones. A good number of these studies, going back to 1995, are summarized in the work of [15] who concluded: “The outcome of all of these studies has been somewhat mixed”. A similar conclusion was reached by [9] who evaluated 48 NN studies and stated that their accuracy in comparison to statistical methods provided mixed results. What characterized all these studies, however, was the limited number of series employed in the comparisons. The first large scale study, using 3003 time series, dates back to the M3 Competition published in 2000 by [16] that included an Automated Artificial NN (AANN) method which, accuracy-wise, did average in comparison to the traditional statistical ones included in the Competition and below the most accurate ones (see Table 1). Eleven years later, Crone, Hibon and Nikolopoulos (C-H-N) published the results of a specialized NN competition, using a subset of the M3 monthly data [12]. In this competition they compared 22 NN and CI (Computational Intelligence) methods, in addition to 11 statistical ones. Their conclusion was that no ML method outperformed the Theta method [20], the most accurate one in the M3 Competition, and that only one [21] was more accurate than Damped trend exponential smoothing [22] when the symmetric Mean Absolute Percentage Error (sMAPE) for the average of all 18 forecasting horizons was used. However, four NNs did better than the AANN of the M3 Competition denoting improvements in the accuracy of newer ML methods. Overall, however, the accuracy of the NNs was not exceptional, vis-à-vis those of the M3 Competition, or the 11 statistical methods that were included in the (C-H-N) study (see Table 2). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. sMAPE across the 3003 time series of the M3 competition. https://doi.org/10.1371/journal.pone.0194889.t001 PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 2. sMAPE and ranks of errors on the complete dataset of the C-H-N study. https://doi.org/10.1371/journal.pone.0194889.t002 ML methods have been gaining prominence over time as interest in AI has been rising. They are used to predict financial series [18, 23], the direction of the stock market [24], macroeconomic variables [25], accounting balance sheet information [26] and a good number of other applications, covering a wide range of areas [27]. A major purpose of this study is to determine, empirically, if their performance exceeds that of statistical methods and how their advantages could be exploited to improve forecasting accuracy. What seems certain is that Chatfield’s prediction of NNs becoming a “breakthrough or passing fad” will not be realized [10]. Their performance cannot be classified yet as a breakthrough but at the same time they are still used while there are indications that such usage will increase over time as newer ML methods are introduced and more ways are being devised to improve their accuracy [15, 28] and computational efficiency.

4 The accuracy, the goodness of fit and the computational complexity of the ML methods Fig 2 shows the overall sMAPE for all the statistical and ML methods included in this paper as well as the ML accuracies reported by Ahmed and colleagues for performing one-step-ahead forecasts. As seen, the six most accurate methods are statistical, confirming their dominance over the ML ones. Even Naive 2 (a seasonal Random Walk (RW) benchmark) is more accurate than half of the ML methods. The most interesting question and greatest challenge is to find the reasons for their poor performance with the objective of improving their accuracy and exploiting their huge potential. AI learning algorithms have revolutionized a wide range of applications in diverse fields and there is no reason that the same cannot be achieved with the ML methods in forecasting. Thus, we must find how to be applied to improve their ability to forecast more accurately. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Forecasting performance (sMAPE) of the ML and statistical methods included in the study. The results are reported for one-step-ahead forecasts having applied the most appropriate preprocessing alternative. https://doi.org/10.1371/journal.pone.0194889.g002 ML models are nonlinear functions connecting the inputs and outputs of neurons. The goal of the network is to “learn” by solving an optimization problem in order to choose a set of parameters, or weights, that minimize an error function, typically the sum of square errors. However, the same type of optimization is done in ARIMA (or regression) models. There is no obvious reason, therefore, to justify the more than 1.24% higher sMAPE of MLP, one of the best ML methods, in comparison to that of ARIMA, or that the sMAPE of this MLP is only 0.19% more accurate than Naive 2, the seasonally adjusted random walk model. Respectively, one would expect RNN and LSTM, which are more advanced types of NNs, to be far more accurate than the ARIMA and the rest of the statistical methods utilized. Clearly, if there was any form of learning, the accuracy of ML methods should have exceeded that of ARIMA and greatly outperform the Naive 2. Thus, it is imperative to investigate the reasons that this is not happening, e.g. by comparing the accuracy of ML and statistical methods series by series, explaining the differences observed and identifying the reasons involved. The more serious issue, simply put, is how ML methods can be made to learn about the unknown future rather how well a model fits past data. For this to be done, the ML methods must have access to information about the future and their objective must be to minimize future errors rather than those of fitting a model to available data. Until a later time when more advanced ML methods become available and in order to simplify things, we suggest that the data is deseasonalized before some ML model is utilized, as research [11] has shown little to no differences between the post-sample accuracy of models applied to original and seasonally adjusted data. A practical way to allow learning about the unknown future errors is by dividing the n − 18 data into two parts, with the first one containing the 1/3 of the n − 18 data and the second the remaining 2/3. If the data is first deseasonalized, a much simpler model can be developed using the first (n − 18)/3 data and then trained to learn how to best predict the next 18 observations. Then, the first (n − 18)/3 + 1 data can be used to let the method learn how to best predict the next 18 observations and continue using the first (n − 18)/3 + 2, the first (n − 18)/3 + 3 and so on, until having used all the observations available. Clearly, such a sliding simulation, attempting to predict future values based on post-sample accuracy optimization, will probably be a step in the right direction even though its performance needs to be empirically tested. Another possibility is to provide ML methods with alternative forecasts (e.g. the ones produced by the best statistical methods) and ask them to learn to select the most accurate one (or their combination) for each forecasting horizon and series in such a way as to minimize post-sample errors. This may require clustering the data into various categories (micro, macro, demographic etc.) or types of series (seasonal/non-seasonal, trended/non-trended, of high, medium or low randomness etc.) and develop different models for each category/type. In Table 6 of [15], for instance, accuracy varies significantly depending on the category of the series with the best one being in demographic and macro data, the worst in micro and industry time series, and finance in between. This may indicate that ML methods could under-perform among others, due to the fact that they are confused when attempting to optimize specific or heterogeneous data patterns. An additional concern could be the extent of randomness in the series and the ability of ML models to distinguish the patterns from the noise of the data, avoiding over-fitting. This can be a challenging problem since, in contrast to linear statistical methods, where over-fitting can be directly controlled by some information criteria (e.g., the AIC [68]) taking into account the number of parameters utilized, ML methods are nonlinear and training is performed dynamically, meaning that different forecasts may arise according e.g. to the maximum iterations considered, even if the complexity of the network’s architecture is identical. Since the importance of possible over-fitting by ML methods is critical, the topic will be covered in detail on its own in section 4.1 below. A final concern with ML methods could be the need for preprocessing that requires individual attention to select the most appropriate transformation, possible deseasonalization, as well as trend removal. Effective ML methods must, however, be able to learn and decide on their own the most appropriate preprocessing as there are few possibilities available. If, for example, the Box-Cox criterion can be used to determine the most appropriate transformation for statistical methods, it makes no sense that something similar cannot be applied by ML methods to automate preprocessing, simplify the modeling process and probably improve accuracy by doing so. 4.1 Over-fitting Tables 4, 5 and 6 report, among others, the goodness of fit, indicating how well the trained model fitted the n-18 observations available for each series. Yet, model fit is not a good predictor of post-sample forecasting accuracy, meaning that methods with low fitting errors might result in higher post-sample ones and vice versa. One would expect for instance that the MLP method, displaying a model fitting error of 2.11%, would forecast more accurately than the ARIMA whose corresponding error is higher (2.59%). However, this is not the case as the post-sample sMAPE of the two methods are 8.39% and 7.19%, respectively. Moreover, RBF, GRNN and CART, which have the best model fitting, are some of the worst performing methods. A possible reason for the improved accuracy of the ARIMA models is that their parameterization is done through the minimization of the AIC criterion, which avoids over-fitting by considering both goodness of fit and model complexity. In contrast, the MLP method specifies its complexity (input nodes) through cross-validation, but no additional criteria are applied for mitigating over-fitting e.g. by specifying when training should stop. The maximum number of iterations defined serves that purpose, yet there is no global optima: in some time series, over-fit might occur after a few iterations, while in others after many hundreds. Fig 3 shows the sMAPE (vertical axis) and the accuracy of model fit (horizontal axis). It is clear from this Fig that the old belief that minimizing the model fit errors would guarantee more accurate post-sample predictions does not hold, and that some criteria similar to the AIC or other successful techniques [69] would be required to indicate to ML methods when to stop the optimization process and avoid considering as pattern a part of the noise of the data. In our view, considerable improvements can result by such an action. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. Forecasting performance (sMAPE) versus model fitting. The results are reported for one-step-ahead forecasts having applied the most appropriate preprocessing alternative. https://doi.org/10.1371/journal.pone.0194889.g003 4.2 Computational complexity As forecasting methods are used in various applications, the computational time required to forecast becomes critical. It would be impractical for example to utilize the ML GRNN method (the most computationally demanding) to predict the demand for hundreds of thousands of inventory items, even though computers are becoming faster and cheaper. Memory and CPU usage optimization might serve in that direction but again, computational intensity remains an important issue. For instance, despite exploiting such optimization processes in our study, reducing the computational time of the ML methods by more than 30%, the complexity reported is still much greater compared to the statistical ones. For this reason, the information provided in Fig 4 is of value, as it confirms the low computational requirements of statistical methods, lying in the lower left part of the Fig, and additionally shows that superior accuracy can be achieved with less computational effort. In particular, the five inside the square box (Damped, Comb, Theta, SES and Holt) are not only some of the most accurate but also—apart from ETS—the least computationally demanding. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 4. Forecasting performance (sMAPE) versus computational complexity. The results are reported for one-step-ahead forecasts having applied the most appropriate preprocessing alternative. https://doi.org/10.1371/journal.pone.0194889.g004 For practical reasons, if ML methods are to be applied by business and non-profit organizations, their computational requirements must be reduced considerably. This can be done by deseasonalizing the data first, utilizing simpler models, limiting the number of training iterations or choosing the initial values of the weights, not in a completely arbitrary manner, but through a guided search that would provide values not too far from the optimal ones. Alternatively, the speed of moving towards the optimal can increase in order to reduce the computational time to reach such an optimal. These improvements would require testing to determine the trade-offs between lesser accuracy, resulting from the reduction in computational time, versus the savings from such a reduction.

5 Conclusions: The state of the art and ways forward The field of statistical forecasting has progressed a great deal since the early dates when [70] used exponential smoothing, in the late 1940s, for predicting the inventory demand for many thousands of items in navy shipyards. The introduction of the Box-Jenkins methodology to ARIMA models [71] brought academic respectability to a field dominated until then by practitioners, while the extensive use of regression and econometric models [72] further enlarged the field. Finally, multivariate GARCH models were also made available [73, 74] broadening the coverage of the field (for an excellent survey of the latest developments see Special Issue on “Simple Versus Complex Forecasting” [75]). A major innovation that has distinguished forecasting from other fields has been the good number of empirical studies aimed at both the academic community as well as the practitioners interested in utilizing the most accurate methods for their various applications and reducing cost or maximizing benefits by doing so. These studies contributed to establishing two major changes in the attitudes towards forecasting: First, it was established that methods or models, that best fitted available data, did not necessarily result in more accurate post sample predictions (a common belief until then). Second, the post-sample predictions of simple statistical methods were found to be at least as accurate as the sophisticated ones. This finding was furiously objected to by theoretical statisticians [76], who claimed that a simple method being a special case of e.g. ARIMA models, could not be more accurate than the ARIMA one, refusing to accept the empirical evidence proving the opposite. These two findings have fundamentally changed the field of forecasting and are also evident in this paper both in Fig 3, showing post-sample versus in-sample accuracy, as well as in Fig 2, displaying the accuracy level of various statistical and ML methods, with the latter being much more sophisticated and computationally demanding than the former. Knowing that a certain sophisticated method is not as accurate as a much simpler one is upsetting from a scientific point of view as the former requires a great deal of academic expertise and ample computer time to be applied. At the same time, understanding the reasons of their underperformance is the only way to improve them. This has certainly been the case with ARIMA models whose accuracy with monthly data (not the same as those used in this study) in the 1982 M Competition was 17.9% and has decreased to 11.28% in the present study, tying with the accuracy of the damped exponential smoothing, one of the most accurate methods of the M Competitions. ARIMA’s improved performance is mainly due to the utilization of the AIC criterion and other optimization processes, enabling effective automatic model selection and parameterization, while avoiding or minimizing over-fitting. Another interesting example could be the case of LSTM that compared to simpler NNs like RNN and MLP, report better model fitting but worse forecasting accuracy. ML theorists working on forecasting applications need to do something to improve the accuracy of their methods. For instance, the only thing exponential smoothing methods do is smoothen the most recent errors exponentially and then extrapolate the latest pattern in order to forecast. Given their ability to learn, ML methods should do better than simple benchmarks, like exponential smoothing. Accepting the problem is the first step in devising workable solutions and we hope that those in the field of AI and ML will accept the empirical findings and work to improve the forecasting accuracy of their methods. A problem with the academic ML forecasting literature is that the majority of published studies provide forecasts and claim satisfactory accuracies without comparing them with simple statistical methods or even naive benchmarks. Doing so raises expectations that ML methods provide accurate predictions, but without any empirical proof that this is the case. In our view, this situation is the same with what was happening in statistical literature in the late 1970s and 1980s. At that time, it was thought that forecasting methods were of superior accuracy simply because of their sophistication and their mathematical elegance. Now it is obvious that their value must be empirically proven in an objective, indisputable manner through large scale competitions. Thus, when it comes to papers proposing new ML methods, or effective ways to use them, academic journals must demand comparisons with alternative methods or at least benchmarks and require that the data of the articles being published be made available for those who want to replicate the results. In our experience, this has not been the case at present, making replications practically impossible and allowing conclusions that may not hold. In addition to empirical testing, research work is needed to help users understand how the forecasts of ML methods are generated (this is the same problem with all AI models whose output cannot be explained). Obtaining numbers from a black box is not acceptable to practitioners who need to know how forecasts arise and how they can be influenced or adjusted to arrive at workable predictions. A final, equally important concern is that in addition to point forecasts, ML methods must also be capable of specifying the uncertainty around them, or alternatively providing confidence intervals. At present, the issue of uncertainty has not been included in the research agenda of the ML field, leaving a huge vacuum that must be filled as estimating the uncertainty in future predictions is as important as the forecasts themselves. To overcome this issue, many researchers propose simulating the intervals by iteratively generating multiple future sample paths. Yet, even in that case, the forecast distribution of the methods is empirically and not analytically derived, raising many doubts about its quality. To summarize, according the results of this study, ML methods need to become more accurate, requiring less computer time, and be less of a black box. A major contribution of this paper is in showing that traditional statistical methods are more accurate than ML ones and pointing out the need to discover the reasons involved, as well as devising ways to reverse the situation. However, in the comparisons of the statistical and ML methods reported in this paper, it must be made clear that the results may be related to the specific data set being used. The 3003 time series of M3 come mainly from the business and economic world that seem to be represented truthfully by this data [77], characterized by considerable seasonality, some trend and a fair amount of randomness [78]. The frequency of close to half of the series is monthly, followed by quarterly and yearly ones of about the same percentage. The length of the series varies from 14 for yearly data to 126 for monthly ones, with the majority being in the Micro and Macro domain. The characteristics of the series as well as their length may be a critical factor determining the accuracy of the various methods reported in this paper, requiring additional research, using other data sets, to verify that similar results will hold true if different sets of data are used and, most importantly, the series are of much longer length so the ML methods can train their weights more optimally. For instance, in [78], the authors use a set of six features to analyze the M3 data, visualizing them in a 2-dimensional space and examining the strengths and weaknesses of different forecasting methods using the new classification. Their results show that the particularities of the dataset may affect the conclusions drawn, indicating that different ones could have emerged if another sample of time series had been selected instead. The relation between forecasting accuracy and time series characteristics is also reported by [79] who claim that there are indeed “Horses for Courses” in demand forecasting. In this regard, even though M3 might be representative of the reality when it comes to business applications, the findings may be different if nonlinear components are present, or if the data is being dominated by other factors. In such cases, the highly flexible ML methods could offer significant advantage over statistical ones. Furthermore, the length of business series, which is relatively limited compared to those of other applications that ML methods are typically utilized (e.g., energy forecasting), may also affect their performance as proper training may be difficult or even impossible when short series are involved. Thus, the conclusions of future studies would be necessary to come up with definite conclusions. At this point, the following suggestions/speculations, that must be empirically verified, can be made about the way forward regarding the ML methods, while these can be enriched by future research topics proposed in relative surveys [80]: Obtain more information about the unknown future values of the data rather than their past ones and base the optimization/learning on such future values as much as possible.

Deseasonalize the data before using ML methods. This will result to a simpler one, reducing the computational time required to arrive at optimal weights and, therefore, learn faster.

Use a sliding simulation approach to gain as much information as possible about future values and the resulting uncertainty and learn more effectively how to minimize them.

Cluster the series into various homogeneous categories and/or types of data and developing ML methods that optimally extrapolate them.

Avoid over-fitting as it is not clear if ML models can correctly distinguish the noise from the pattern of the data.

Automate preprocessing and avoid the extra decisions required from the part of the user.

Allow the estimation of uncertainty for the point forecasts and provide information for the construction of confidence intervals around such forecasts. Although the conclusion of our paper that the forecasting accuracy of ML models is lower to that of statistical methods may seem disappointing, we are extremely positive about the great potential of ML ones for forecasting applications. Clearly, more work is needed to improve such methods but the same has been the case with all new techniques, including the complex forecasting methods that have improved their accuracy considerably over time. Who could have believed even ten years ago that we will have AVs, personal assistance on our mobile phones understanding and speaking in natural languages, automatic translations in Skype, AlphaGo beating the world GO champion or facial expression recognition algorithms [81]. There is no reason that the same type of breakthroughs cannot be achieved with ML methods applied to forecasting. Even though, we must realize that applying AI to forecasting is quite different than doing so in games or in image and speech recognition and may require different, specialized algorithms to be successful. In contrast to other applications, the future is never identical to the past and training of AI methods cannot exclusively depend on it. Table 10 is our attempt to show that not all applications can be modeled equally well using AI algorithms. Games are the easiest as the rules are known and do not change, the environment is also known and stable, the predictions cannot influence the future and there is no uncertainty. The exact opposite is true for forecasting applications where not only the rules are not known but can also change, there are structural instabilities in the data, while there is plenty of uncertainty and noise, that can confuse the search for the optimal weights. Moreover, in certain applications, the forecasts themselves can influence, or even change the future creating self-fulfilling or self-defeating prophesies, expanding the level of noise and increasing the level of uncertainty. It may be necessary, therefore, to adapt the algorithms to these conditions and make sure that there is no over-fitting. Judging from the results of this paper, it may be necessary that ML algorithms applied to forecasting may require additional research to experiment with innovative ideas and come up with adjustments in order to achieve more accurate predictions. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 10. Features of various Artificial Intelligence (AI) applications. https://doi.org/10.1371/journal.pone.0194889.t010 As the first step towards providing a more diverse, large and representative dataset for evaluating the performance of forecasting methods, establishing reliable benchmarks and promoting future research, we have started the M4-Competition (see https://www.m4.unic.ac.cy/) that seeks to identify the most accurate forecasting method(s) for different types of predictions. It aims to compare all major time series methods and identify the most appropriate methods for each case. M4 utilizes 100,000 real-life series and has attracted great interest by both academic researchers and practitioners, providing objective evidence of the most appropriate way of forecasting various variables of interest. The new M4 Competition will extend and replicate the results of the previous three ones, while also avoiding the possible problems of M3. Furthermore, it will increase the number of series to 100,000, include additional frequencies, while also augmenting their length considerably. Given the great number of series, it will be possible to utilize advanced data analytics and related technologies to determine the influence of the various factors on forecasting accuracy, as well as to determine the most appropriate methods for different forecasting applications.

Supporting information S1 Appendix. Containing tables A1 and A2, presenting the analytical results of the forecasting models used in the present study. The accuracy is evaluated per forecasting horizon first according to sMAPE, and then to MASE. https://doi.org/10.1371/journal.pone.0194889.s001 (PDF)