My colleagues and I have devised a mathematical model which can be used to predict films that become blockbusters or flops at the box office – up to a month before the movie is released.

Our model is based on an analysis of the activity on Wikipedia pages about American films released in 2009 and 2010. After examining 312 movies, taking into account the number of page views for the movie’s article, the number of human editors contributing to the article, the number of edits made and the diversity of online users, we could come up with good estimations for the prospective popularity of a movie at box office. The results obtained using this model, and the actual figures (published in Internet Movie Database (IMDb)) showed a high degree of correlation.

Actual first weekend box office revenue in the United States against its predicted value based on Wikipedia data 30 days before the release. The green line, indicating the perfect prediction, is drawn for comparison. Each dot represents a movie from the sample and the size of the dot indicates the amount of the error in the prediction. Predictions for more successful movies are more accurate.

Their mathematical algorithm has allowed us to predict box office revenues with an overall accuracy of around 77 per cent. This level of accuracy is higher than the best existing predictive models applied by marketing firms (which they estimate to be at around 57 per cent). We could predict the box office takings of six out of 312 films with 99 per cent accuracy where the predicted value was within one per cent of the real value. Some 23 movies were predicted with 90 per cent accuracy and 70 movies with an accuracy of 70 per cent and above.

The more successful the show, the more accurately we were able to predict box office takings. This is possibly due to the increased amount of online data generated by films that turn out to be successful. The model correctly forecast the commercial success of Iron Man 2, Alice in Wonderland, Toy Story 3 and Inception, but failed to accurately forecast the financial return on less successful movies Never Let Me Go, and Animal Kingdom.

These results can be of great value to marketing firms but more importantly for us; we were able to demonstrate how we can use socially generated online data to predict a lot about future human behaviour.

We have demonstrated for the first time that Wikipedia edit statistics provide us with another tool to predict social events. We studied the problem of predicting the financial success of movies and concluded that, in some aspects, forecasting based on Wikipedia outperforms tweets as Wikipedia activity has a longer timescale which enables earlier predictions.

The efficiency of the predictions might be improved by applying more sophisticated statistical methods, such as including the controversy measure of an article.

Taha Yasseri is a Big Data Research Officer at the Oxford Internet Institute. Prior to Oxford Internet Institute, he spent two years as a Postdoctoral Researcher at the Budapest University of Technology and Economics, working on socio-physical aspects of the community of Wikipedia editors, focusing on conflict and editorial wars, along with Big Data analysis to understand human dynamics, language complexity, and popularity spread.

This Research has been published in PLoS ONE and can be accessed at “Mestyán, M., Yasseri, T., and Kertész, J. (2013) Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data. PLoS ONE 8 (8) e71226.”

(Image Credit: Brett Sayer)