3. Evaluation

How will we evaluate the performance of our Model?

The gold standard here is the train-test-validation split.

Frequently making a train-validation-test set, by sampling, we forgot about an implicit assumption — Data is rarely ever IID(independently and identically distributed).

In simple terms, our assumption that each data point is independent of each other and comes from the same distribution is faulty at best if not downright incorrect.

For an internet company, a data point from 2007 is very different from a data point that comes in 2019. They don’t come from the same distribution because of a lot of factors- internet speed being the foremost.

If you have a cat vs. dog prediction problem, you are pretty much good with Random sampling. But, in most of the machine learning models, the task is to predict the future.

You can think about splitting your data using the time variable rather than sampling randomly from the data. For example: for the click prediction problem you can have all your past data till last month as training data and data for last month as validation.

The next thing you will need to think about is the baseline model.

Let us say we use RMSE as an evaluation metric for our time series models. We evaluated the model on the test set, and the RMSE came out to be 4.8.

Is that a good RMSE? How do we know? We need a baseline RMSE. This could come from a currently employed model for the same task. Or by using some simple model. For Time series model, a baseline to defeat is last day prediction. i.e., predict the number on the previous day.

For NLP classification models, I usually set the baseline to be the evaluation metric(Accuracy, F1, log loss) of Logistic regression models on Countvectorizer(Bag of words).

You should also think about how you will be breaking evaluation in multiple groups so that your model doesn’t induce unnecessary biases.

Last year, Amazon was in the news for a secret AI recruiting tool that showed bias against women. To save our Machine Learning model from such inconsistencies, we need to evaluate our model on different groups. Maybe our model is not so accurate for women as it is for men because there is far less number of women in training data.

Or maybe a model predicting if a product is going to be bought or not given a view works pretty well for a specific product category and not for other product categories.

Keeping such things in mind beforehand and thinking precisely about what could go wrong with a particular evaluation approach is something that could definitely help us in designing a good ML system.