Last week, I wrote an article about a Logistic Regression model I fit that was able to predict outcomes with just about 85% accuracy. Within the first two hours, I received a ton of feedback. Among the core points of feedback, excluding the typical toxic r/LeagueOfLegends drivel, were the lack of K-fold validation, my choice of a logistic regression model, and that I was introducing a bias by creating my win-rate matrix on the whole dataset. Keeping in mind I’m a student, an ever-growing one at that, I made this week’s mission to look into these criticisms and growing from what I discovered.

An Unintended Bias

I don’t know a single scientific soul that goes into a project thinking, “Oh baby, let’s skew some data!” It seems as though others don’t share the same faith in humanity that I do at times, because the number of people who thought I intended to skew my accuracy was frightening. Despite their approach to the issue, they were right; my win-rate matrix introduced a bit of a bias that allowed my classification model to “know” the outcome of a game without knowing the outcome of a game. In a quick-fix, I was able to take the bias out of the win-rate matrix and amended my previous article with the proper score of 73.52%

While this was one of the more glaring issues with the previous project, it was only the beginning of improving upon last week’s project. As a student, I like to create projects that can evolve with my knowledge, and one of the largest gaps at the time was the cross-validation process.

K-Fold

Some ML hobbyists and Data Science aficionados in the Reddit comments were kind enough to recommend K-Fold validation as a means of validating my previous 85% accuracy claim. This introduced me to the world of cross-validation and the knowledge I gained through hours of reading about it and experimenting with it helped me shift my thought processes to avoid this issue in future studies and projects. So, in short, thank you.

Through 10-fold validation, in which every fold generated its own win-rate matrix from the X_train set, I was able to validate the model at an accuracy rating of 60.4%