Lessons learned deploying my first ML project

Five takeaways after launching my first production ready ML project

Background

I work as a Software Engineer for Novelist, crafting software for public libraries. Reader’s Advisory a.k.a book recommendation is our bread and butter. Think of Netflix, but for books! We have two tiers of recommendation: Tier A, where our in-house professionals i.e. librarians etc. create hand crafted recommendations for the most popular titles and Tier B, recommendations generated by a recommender system using our proprietary metadata about books.

Our proprietary metadata is a treasure trove: hundreds of thousands of books with appeal terms, genres and theme labels. Our in house catalogers have spent years cataloging books and adding metadata. But this is also a bottleneck, because they have to manually read book summaries and reviews in order to add the appropriate metadata.

Example of metadata: appealpacing.fast-paced, appealstory.issue-oriented, appealcharacter.lgbtqia_diverse, appealtone.thought-provoking etc.

My project was simple: given our treasure trove of labelled book reviews, train a multi-label classifier that will predict appeal terms. This will expedite our cataloging process and eliminate our bottleneck.

Five Takeaways

1. Importance of understanding your data

My machine learning education consists of two courses taken during my undergrad 3 years ago, coupled with self study by watching Andrew Ng and reading articles from Medium/Towards Data Science. In a nutshell, no formal ML education. Being a beginner machine learning practitioner (I do not dare to call myself a machine learning engineer), I did what beginners do: code first, think later.

I started researching (read Googling) various multi-label text classification solutions and implementing one by one, comparing the results and getting frustrated at the lack of accuracy. After trying out my sixth or seventh classifier, I pumped the brake and took a step back. Then I did what I should have done before: understand the landscape of my data. After doing some basic data exploration, I have found that my data set is heavily imbalanced. Out of my possible 158 labels, only 20 of them had decent number of samples. Also, 158 unique labels!

I got the best results (performance + speed of training) using an adapted version of k-nearest neighbor (MLkNN) after segmenting the data set into separate groups i.e. appealtone, appealwritingstyle etc. Divide and conquer! The next step is to train one using BERT.

2. Creating the model is only half the work

So you created the model, got that sweet optimum performance coupled with great accuracy. Now what? The Jupyter notebook you were following did not talk about a life after that. Here are the things my team and I had to do to complete the project:

Create a Flask API endpoint, so that we can send a POST request with a new book review and retrieve a list of predicted labels.

Deploy the API to AWS so we can consume it.

Save the model artifacts to AWS S3 (AWS4?). Create a Cloud Formation template to automate spinning up instances.

Write automated tests for the API.

Create a React tester app so that our catalogers can play with the classifier and do a manual QA on the effectiveness.

Deploy the React application.

Create autonomous deployment process (CI/CD) for everything.

Protip: Use Cortex for automating CI/CD for your model.

Creating the model/classifier is half the work. Maybe quarter. Depends on your project, to be honest. When you plan for it, make sure you account for all these ancillary work.

Protip: Write your ML code in Python scripts instead of directly in your Jupyter notebook. Otherwise, reviewing code changes in notebooks is next to impossible. Also, this makes testing easier.

3. You do not need to be a ML specialist

Let me start this section by saying that if you are a specialist, then its better. Obviously. But if you are not, it does not mean you cannot create a ML powered product! As I said before, I do not have a formal ML education. Granted, the learning curve is stupendously steep, but it’s not insurmountable. I feel like an imposter everyday, but when I see our catalogers use our Appeal Predictor, I feel a little better.

The ML ecosystem is amazing! Libraries like pandas, scikit-learn, pytorch, huggingface etc. have democratized machine learning. You can create your own model in less than 20 lines of code! There are gazillions of tutorials that will hold your hand throughout the whole process. If you have the perseverance and determination, you can easily create your own ML powered product. It might not be the most cutting edge model out there, but it’s your model! And you learn as you build. Nothing beats that.

Remember, as long as you are solving a problem and there are people who are being benefited, there is value. The end users do not care if your model has a F1 score of 88% vs 90%.

4. The importance of agile development

Software engineering is problem solving: you have a business problem that needs to be solved. For my case, we wanted to reduce our cataloging bottleneck and expedite the process, saving time and resources. Always let the problem guide you to the solution. Not the other way around.

You want to get practical feedback for your product as soon as possible and then use the feedback to iterate on it. For us, it was getting something out to our catalogers so that they can use their domain expertise to verify the results of the predictor. We then used their feedback to tune the model.

5. Data is King

Your model is as good as your data. There is no way around it. One of the hardest thing in ML, in my opinion, is curating a well rounded data set. Fortunately, I had that ready at my fingertips. This expedited the whole process and made my life easier. Moreover, our data was labelled by domain experts, meaning they were highly accurate, which in turn enabled us to create an decently accurate model.

Before starting your own project, make sure your data is clean. Understand the landscape of your data. Don’t do what I did. Do some data exploration first. See if you can augment it using tools like SMOTE. And remember, even if your model is not as accurate as you would like it to be, you get to learn a bunch of new stuff along the way. That pays dividends in the long run.