This is the fourth part of my series of “watching Love Island via the lens of a data scientist”, for the first part / more of an intro the programme itself, see here.

For the final part of my series on Love Island via data analytics I am going to do create models that aim to predict the number of days an islander is expected to stay on the show, as well as which couple is going to win.

We will use the data from Series 4 as our training set and apply the models to the new data from Series 5 to predict outcomes, for the full pipeline of scraping twitter data as well as the text preprocessing please see my previous posts. For this analysis we are going to use a polynomial regression model of order 2, this is due to wanting to make sure that we incorporate the interaction effect between all the features we use. One point to note initially is that in Series 4 a number of islanders had the same name, given we are only using first names to identify islanders in each tweet this will cause some noise within our data; in every case I focus on the most relevant islander per name and this shouldn’t matter too much aside from Jack Fincham and Jack Fowler who were both impactful on the series.

Predicting the Number of Days an Islander Lasts

For the first part of this analysis we are going try and determine if we can predict the number of days that an islander will last for. There are various ways that islanders are “dumped” from the show, these include being voted off by other islanders, being voted off by the public or by not being in a couple for example, given that the public has a relatively small part to play in the process (even when voting off the contestant the incumbent islanders usually have the final say), therefore the dataset we will be using to train the model is very incomplete.

Preprocessing

My steps for preprocessing the data before applying the model are as follows:

Find mean sentiment score of all tweets about each islander per day, find the percentage of tweets per day that included a direct reference to each islander. Filter to only include the days that the islander was in the villa (if an islander left and in reaction to their actions on the outside they may have got a lot of tweets etc, this isn’t relevant for predicting something during their time in the villa). Find the mean sentiment score per islander from the days they were in the villa and the mean percentage of tweets about them across all the days they were inside; we want to be “fair” to all islanders, if we simply took the percentage of tweets per all tweets we would be very biased towards islanders who have been their longer, so by taking the average per day, then the average of these, we normalize for islanders who have been on the show for different lengths of time. Add some metadata to use downstream (age, sex etc) Calculate the proportion of days an islander lasted, out of the total number of days they could have stayed. If an islander entered on day 30 and lasted until day 60, then this is only half as many days as someone who entered on day 1 and lasted until day 60, however their “available days” is half as many so both of these islanders did the best they could and we should account for this, to calculate our dependant variable we shall do:

p = (day no. the islander left the villa - day no. islander entered the villa) / (59 - day no. islander entered the villa)

6. Normalize the independent variables as we know the sentiment score is capped between -1 and 1 with the mean percentage of tweets being uncapped and having a much larger range so normalizing will lead to much more interpretable coefficients.

We will then apply a polynomial regression curve to these features aiming to predict the proportion of days an islander will stay on the show. My choice to fit a polynomial stems from my assumption that their will be a large interaction between the percentage of tweets an islander receives and the sentiment score behind them, my thinking being that a mid to large percentage of tweets with a positive score will be a better predictor than an huge number of tweets with a score close to zero for example.

Training

Firstly, concentrating on the regression intercept, with 0 tweets and therefore a score of 0 we would expect an islander to last less slightly less than 20% of the days that they could, this makes sense as usually when an islander enters the villa there is a period of 5/6 days where there is no dumping, therefore if an islander entered halfway though the series and got dumped at the first opportunity then this would be 20% as predicted! For an islander to last more than 20% then they need to grab the publics attention and therefore will generate tweets.

Focusing on the coefficients, we see that there is a positive linear relationship between the percentage of tweets an islander receives a day and the number of days an islander stays for, we do see a negative squared term with nearly triple the magnitude of the positive linear term, this might suggest that this relationship is a parabola and that there is a positive relationship up to a point and then over a certain percentage of tweets the expected number of days on the show begins to reduce, this could be evidence of users tweeting more at negative events than positive ones. Perhaps counterintuitively, the relationship between mean sentiment score (both linear and squared terms) and proportion of days stayed is negative, this could however be due to other things such as the way the sentiment analyser deals with tweets where the author is feeling sorry for an islander for example. These terms may also be evidence for a bias towards the finalist; for example we know in this series that Alex got very far and he was an islander who received a lot of mixed reactions, as well as the fact that Dani and Jack were never really spoken about online aside from being called boring (ie negative sentiment) but got very far (they ended up winning).

Considering the interaction term in our model (the coefficient for x0*x1) we see it is the largest coefficient so has the greatest impact on the model, it is also positive which suggests that if an islander receives lots of positive tweets their expected days increase. This matches with expectations and validates that the choice of a non-linear model was a good one!

An R-squared term of 0.454 is slightly less than half of the maximum value (returns a value between 0–1), given that we are predicting human behaviours this not too bad; over 45% of the variance within the data is explained by the features of out model. We know that there will be a lot of noise within the data (due to the nature of sentiment analysis, misspelling on twitter, etc) so this score is relatively good.

Plotting the predicted proportion of days on the show vs the actual number of days on the show we get the following graph. We can see that there is a very positive relationship between the two and without any further statistical tests one would say that the model has a pretty good fit. We observe a clustering of islanders who have stayed for nearly the maximum number of days yet are predicted to only last between 50–70% of their maximum number of days — all the islanders in this cluster are the finalists, the majority of them were on the show from the first episode so have a much larger dataset than everyone else, whilst we have tried to normalize for this it does mean that they have a much greater chance of having days with a lot of tweets or with a very low or high sentiment score due to the fact that they were on our screens for so many more days.

Testing

Given my aim is to predict the winner of Love Island 2019 I want to compare the two series and determine if using last years series as a training set is suitable. In light of this we can use the same model to predict the proportion of potential days an islander will stay in series 5, given the majority have already left we have a lot of validation sources and will be able to determine how good the fit is.

Focusing on the R-squared term we actually see an increase — this is not expected but very good news! The similarity of the R-squared term suggests that the data is very similar which is encouraging for the second part of this work.

Plotting the relationship again we can see why as there are a number of points very close to the line (which represents a perfect correlation).

In conclusion to this first part on predicting the number of potential days each individual islander lasts, we see that we can produce a fairly good model using the mean percentage of tweets than mention each islander per day combined with the mean sentiment score per day. We have seen that there is not a huge difference between series 4 and 5. To improve this model we could use more features about the individuals, such as sex, ethnicity and age for example, however I did this to compare the features we used to determine if we could use season 4 data to compare couples, which means we wouldn’t be able to use these due to complexity and a very small sample size.

Predicting the Winner

The second part of this work focuses around predicting the winner of Love Island 2019, instead of predicting number of days we shall be predicting the vote share, ITV have published the vote share per couple per vote for series 4 so we shall be using this as the labelled training data (this can be found here). Instead of doing this at an individual level we shall be aggregating by couple (including all the couples per vote) and instead of aggregating per day we shall be combining all the days between each vote as a grouping. The features we are going to use are the same as the first part (number of tweets and sentiment score). Given we are now focusing on couples, the fact that some islanders had the same name won’t matter so much as we are focusing on tweets that explicitly reference both parties.

Training

The results of the model look slightly different to the first part, here we see a positive linear relationship to both the number of tweets and sentiment score. We also see an intercept of very close to 0, this makes sense in this context as we are predicting the vote percentage and therefore one would expect a couple to have generated 0 activity online to get no votes in the real world. We see both of the non linear terms are negative, we have discussed why this could be earlier but also we know that the voting throughout this series was unprecedentedly in favour of Jack and Dani (they received over 50% of the vote at any given opportunity). We also see a negative R-squared score that doesn’t fill us with much confidence moving forward!

Whilst we do care about the magnitude of the percentage vote received by each couple, what we really want to know is the relative order in which islanders are placed (whilst it’s a popularity contest, to win you only need to have one more person like you over anyone else!). Looking at the accuracy of the model we again aren’t filled with confidence; we see we only got the couple in the correct order in 43% of cases.

Looking at the graph we see a very positive relationship, we do however see that the model will be very influenced by the couples with a very high percentage of votes (above 50%) even though this isn’t relevant to the majority of points. By just eyeballing this graph one might suggest the model looks good, however there is a lot of variation between the actual vote share and the predicted scores, which is starting to suggest that the features we have chosen are not valuable for this part as they were for part 1. The fact we have a negative R-squared term is suggesting that the points with a very high vote share having such a large influence over the model that they are negatively impacting the results from all other points.

Testing

Applying the model to series 5 we see a really bad R-squared score, we would expect this score to decrease however confidence is at an all time low in this model working for series 5!

We have no validation data for series 5 as ITV haven’t released the vote spilts yet, what we do know is which couples were in the bottom two or three every time the public voted, ordering based on our predicted percentages and then comparing this with the order that was revealed on the show (ie were they in the couples at risk of being dumped or safe), we see that we got this wrong 9 times out of 36. Given the room for error is now so much more this might be masking the fact that the model doesn’t fit well.

Looking at which couples we got wrong, we can see that there are a number of votes / couples that can be explained as they are slightly atypical. We are basing this on Series 4 where there was no couple that was unanimously disliked for the length of time that Anna and Jordan were and the model consistently gets them wrong. We see that the n_tweets_perc variable for Anna and Jordan on 23rd July is over double the second largest in this data frame, given this is a normalized column this shows the extent at which people tweeted about them. Other examples that can be explained are Molly and Tommy; the public perception was that Molly didn’t like Tommy and therefore tweets that mention the couple have a lower score than the popularity of them would suggest therefore the model fails to place them in the correct order. This series was also a lot more about the individuals that the couples, for example Maura and Anton took a while to “settle” into a strong couple yet were favourites with the public, given this is based on tweets about the couples the model would have been blind to this.

Predicting

Finally, what everyone is really reading this for; predicting the winner! We only went and got it right (however we did get the 2nd and 3rd place the wrong way round). When interpreting these results is important to remember that the predicted vote share is not capped and therefore the total can (and clearly does here) add up to more than 100%, give its relative we see that Amber and Greg have over double the predicted share as the next largest and given what we know about the model coefficients this suggest a huge number of very positive tweets in comparison to the other couples.

I am aware that I am publishing this after the final date and therefore it might have been easy for me to “adapt” the model to prove my case, however I placed a bet on it to prove that I got it right and backed the prediction!!

And that’s it…nearly; I am going to do one final piece but it is going to be a run through of the code I used rather than a write up of any insight (ie a lot more technical than the previous post), I shall also publish the code in full so you can use it to win a whole £24 next year!