The director starpower value is computed analogously to the starpower for a single actor.

The genre uniqueness is a measure of how unique a movie’s combination of genre categories is relative to all movies in my data set.

The “−log” here serves to create a more normally-distributed quantity while ensuring that more unique genres have a larger, positive value.

In the end, I have 38 features, most of them categorical and one-hot encoded. The process of selecting and engineering features is laborious but crucial since the success of any model depends heavily on the quantity and quality of the input data (recall: “garbage in, garbage out!”).

Building Models to Predict Movie Profitability

Here I use profitability as the metric of success for a film and define profitability as the return on investment (ROI). The ROI is simply the fraction of the budget that the movie makes back at the box office (i.e., ROI = Profit/Budget). Since extreme values of ROI are fairly common for movies (both massive successes and major flops) and the range is large, the target variable that I aim to predict is log(ROI + 1).

I used XGBoost as my regression model as I found that it slightly outperforms random forest regression when using the root-mean-square error (RMSE) as the goodness metric. I used 5-fold cross-validation to tune several hyperparameters of the XGBoost model, including the number of trees, the maximum depth of each tree, and the learning rate. Below is the result of a single XGBoost model trained on 80% of the data and tested on the unseen held-out 20%.

Scatterplot of the predicted ROI vs. the true ROI for the hold-out test set. The solid line shows the y=x line for comparison. The model performs decently well, but there is a lot of scatter, particularly in the extremes of the distribution.

The scatterplot is proof that predicting the success of movies is indeed hard! A single prediction of ROI output from a single model would not be very trustworthy. But creating a distribution of ROI predictions using random subsamples of the training data can give a sense of the variability in the prediction as a proxy of risk involved in funding a movie (this is essentially the idea behind jackknife and bootstrap resampling as well). Given my full training set of N samples, I generated 500 subsamples each of size N/2 and each randomly drawn from the full set of N. The values 500 and N/2 are somewhat arbitrary but were chosen in order to obtain a smooth distribution of ROI values and to balance the desire for sufficient variability in the predictions with the need to maintain a large enough training set for each model. I trained 500 models on these 500 random subsamples and built a distribution of ROI values from which I can extract summary statistics such as the median and 95% confidence interval. A schematic diagram of my modeling process is shown below.

Diagram outlining the modeling process behind ReelRisk.

ReelRisk: A Risk Assessment Tool for Movie Production

Below is a screenshot of the input page for ReelRisk, the web app I developed that helps studio executives and producers assess the risk involved in funding a proposed movie project.