Software engineering 👀

Project structure

text_classifier

│

├───notebooks

│ ├───classifier-a

│ ├───classifier-b

│ └───classifier-c

│ ├───data(common sample training data)

│ ├───preparation

│ ├───modelling

│ ├───evaluation

│ └───final

│

├───tc(acronym of text_classifier. contains core modules)

│ │ queries.py

│ │ base.py

│ │ preparation.py

│ │ models.py

│ │ train.py

│ │ postprocessing.py

│ │ predict.py

│ │ document_processing.py

│ │

│ ├───config

│ │ input_data_config.py

│ │ splunklog_config.py

│ │ training_config.py

│ │

│ ├───nlp

│ │ embeddings.py

│ │ preprocessors.py

│ │

│ └───utilities

│ helpers.py

│ loggers.py

│ metrics.py

│ plotting.py

│ preprocessors.py

│ aws.py

│ db_connectors.py

│ html.py

│

├───data

│ ├───classifier-a

│ ├───classifier-b

│ └───classifier-c

│

├───api

│ router.py

│ flask_app.py

│ request_handlers.py

│ inference.py

│

├───env

│ base.yaml

│ cpu.yaml

│ gpu.yaml

│ build.yaml

│

│

├───deployment

│ └───terraform

│

├───persistence

│

├───scripts

│

└───tests

It’s just super important to fix up the project structure in the beginning for the code to evolve in a structured way. We took considerable time and did many discussions before we converged. Have a look at this to start with a basic scaffold.

This is how we train models on AWS EC2/local and backup code, data, models and reports on AWS s3. The directory structure is created automatically by the preparation and train class.

data

└───region(we have models trained on many regions)

├───model-a(model for predicting a)

├───model-b(model for predicting b)

└───model-c(model for predicting c)

└───2019-08-01(model version as per date)

├───code.zip(codebase backup)

├───raw(fetched-data)

├───processed(training-data)

└───models

└───1(NN-architecture-type)

├───model.h5

├───encoders.pkl

└───reports

├───train_report.csv

├───test_report.csv

├───keras_train_history.csv

└───keras_test_history.csv

While doing the POCs, we have no idea which modules will be part of the final solution and pay less importance to modularity or reusability. But as soon as we are done with the POC, we should consolidate the final code into notebooks and keep them in /notebooks/final. We had one notebook for preparation steps and another for modeling.

notebooks

└───classifier-a

├───data

├───preparation

├───modelling

├───evaluation

└───final

├───preparation.ipynb

└───modelling.ipynb

These notebooks also became our presentation material.

Inheritance/imports ⏬

We wrote the training classes in a way to be used again by the predict classes. So every time we make any changes on preprocessing or encoding steps, we just do them on the training class.

Class imports

Inference class

Our inference modules use the predict class along with certain checks on the data for the failure cases such as empty strings. We also save the inferences to a central PostgreSQL inference database.

Our router is a simple flask router with methods for different models. All the important exceptions are caught and returned with appropriate messages.

Inference database

We save all the inferences to analyse the models in production like input values, predicted values, model version, model type, probability etc.

One of our next steps is to create APIs for creating reports on ML performance.

Design patterns 🐗

Singleton pattern to initialize embeddings and use the same object for different models. This saves memory usage of ec2.

Factory pattern to initialize model training classes with different configs.

Decorator pattern

A decorator to time functions to understand which ones take more time. A decorator to retry DB queries if they fail. This ensures the fetching of the data and doesn’t fail the pipeline of training. A decorator for Splunk logging of start-end of function execution. We save logs on Splunk as well as AWS Cloudwatch.

Scalability 🌀

From the beginning, we wanted to develop the codebase for using it for different data. So we parameterised everything through configs for input data and model hyper-parameter.