This summer I have taken on the formidable task of creating a ‘good’ data ingest and machine learning pipeline for Pythia. If you didn’t know, aside from effectively transporting water and sludge, a pipeline can tactically link different aspects in projects to create one cohesive framework. As I quickly discovered, there is more to building a pipeline than meets the eye. After lots of tinkering, slogging, and construction, the Pythia pipeline was born.

“You don’t have to be a genius or a visionary or even a college graduate to be successful. You just need a framework and a dream.” — Michael Dell

I cannot say I know the guy, but I have heard he made computers a long time ago. I have also found that a framework is a critical component to the success of a project (next Michael Dell perhaps?).

First, some background on Pythia. Many people spend much of their day reading articles, trying to find the important ones or perhaps some new nugget of information. A majority of these articles may rehash the same information, so being able to eliminate “duplicate” articles would vastly improve these peoples’ jobs. The Pythia project is exploring a variety of machine learning and deep learning techniques in order to detect novelty in text. We are defining novelty as a document that has something new to say about a topic, when compared to a corpus of already processed documents about that topic. For example, if “I had a sandwich for lunch” is the contents of the corpus (“lunch” being the topic) a new document containing “I ate a good sandwich” has low novelty, whereas a new document containing “After my sandwich, I took a nap” has high novelty. To detect novelty, Pythia uses a wide range of features and algorithms, and each of these can interact with one another in distinct ways.

As I began working on Pythia, it quickly became apparent that there would be a lots of trial and error required to find the best features and algorithms to detect novelty for any given corpus, and that “hard coding” them into Pythia would soon become cumbersome. I realized that it was vital to develop a flexible framework that allows our project to experiment on different combinations of features and algorithms and one that is easily extensible to incorporate new techniques.

After building our pipeline, I have learned that there are major advantages of having a well-designed framework in three key parts of the data science workflow: development, experimentation and deployment.

Development

As far as I can tell, the main reason people don’t spend time on their framework is the urge to get started. In the moment, it seems easier simply to hard code the specific functionality desired than to spend time planning and developing a general structure. For truly one-off projects, this may be the right approach. However, for larger or more sustained efforts, a solid framework is essential to the project team’s productivity.

With a well structured pipeline, implementing and integrating a new feature or algorithm becomes straightforward. Instead of having to consider how to work in the new technique, the sole focus can be on the technique itself because the code will easily slot into the larger framework.

The basic structure of Pythia’s pipeline is the following:

Data parsing — reads in JSON files formatted in the specified structure and stores the data internally, splitting into training and testing sets Preprocessing — builds any necessary “meta-data” from the training set that is necessary for the specified features Features — generates each observation from the data and calculates the specified features Algorithms— trains the specified classifier(s) Prediction — uses the classifier(s) to make predictions on the testing data Results — calculates a variety of metrics scoring the the predictions

Each step of the pipeline is built to allow a broad array of functionality, facilitating the flexibility desired in each stage of the project’s workflow. For example, the prepossessing module collects any necessary information and parameter settings from the data needed for the given features. In Pythia, we needed a vocabulary, encoder and decoder or a topic model, depending on the feature, and the pipeline allows us to incorporate and potentially change any of these items.

While all the steps in the workflow benefit from a flexible pipeline, feature extraction and algorithm development benefit the most. As the pipeline moves to feature extraction, it is not unusual for a project to apply multiple feature extraction techniques using vectors. Each vector is a different way to represent a group of text, storing different information. The Pythia team is currently looking at bag of words vectors, skip-thought vectors, and latent Dirichlet allocation (LDA) vectors.

If you are unfamiliar with these features, here is a brief overview. Bag of words vectors are sparse vectors commonly used in NLP, and keep an index of all (or most) of the words in a document. Skip-thought vectors are created using an encoder that takes into account additional information to represent both semantics and syntax at a sentence level. LDA vectors capture the relevance of a set of topics connected to the text, instead of specific words, which allows them to be much smaller in length.

For each feature, we calculate the vector representation of the document we are analyzing and the vector representation of the corpus to which it is being compared. There are many ways to compare these two vectors, and the pipeline is set up to implement any combination of them. For example, the most classic comparison of two vectors is cosine similarity, which calculates the angle between two vectors. While this gives a single figure as a comparison point, some of the algorithms do better with more information, which is why the feature module can also calculate the difference and product of the document vector and the corpus vector, or merely concatenate the two. The feature module connects all the specified features into one vector for each datapoint, so it is ready to be fed into the classifiers.

Setting up the pipeline with this kind of flexibility helps speed up development and eliminate errors because the generation of observations and the comparison of the vectors is all extracted from the individual feature calculation. To add a new feature to the pipeline, only a few if statements are required. This allows focus on the implementation of the feature itself, accelerates the integration into the pipeline, and requires less code and therefore reduces the chances for errors. As a proof point, adding skip-thoughts as a feature to our pipeline was less than an hour’s work.

The three machine learning algorithms we currently have implemented are logistic regression, support vector machine (SVM) and Extreme Gradient Boosting (XGBoost). Each of these algorithms takes in the training data in the form of two lists: one with the feature vectors and the second with their corresponding label (duplicate or novel), and returns a classifier. By standardizing the input and output, any algorithm can run on any combination of features, simplifying algorithm implementation while allowing maximum flexibility.

As we move forward, we are working to implement additional deep learning techniques, such as one-hot convolutional neural networks and dynamic memory networks. These techniques can be easily added to the framework, as they follow the same workflow of generating features then building the classifier.

Experimentation

Another key advantage of setting up a strong framework is the smooth transition to the experimentation stage of the project. There are a few aspects of the pipeline that enable easy, repeatable, experimentation both with different combinations of features and algorithms themselves and also with the hyper-parameters of the features and algorithms.

The framework’s parsing module reads JSON files full of text in an established format and stores them internally, randomly splitting all the document information into training and testing data given a configurable seed. Therefore, the raw data of a given dataset must be parsed separately, but once it is structured in compliance with the JSON format it can be run through the pipeline. This allows for easy experimentation on any number of datasets with little overhead.

In addition, since all of the modules, (features, algorithms, and other necessary information) are already connected, actually running tests all the way through to the end is easily achievable. The pipeline allows for any combination of any feature, and even any combination of different comparisons within the feature, as well as running any combination of algorithms. In Pythia, we exposed all of the important hyper parameters for our features and algorithms, such as vocabulary size for bag of words and kernel type for SVM, so that experimenting is even easier. Therefore, the same structure that helped us develop the pipeline also allows for immediate experimentation with absolute flexibility.

Deployment

In a project like Pythia, any given dataset may require a unique combination of available NLP techniques and hyper-parameters in order to detect text novelty. Since we might have multiple clients who want to detect text novelty, and we don’t have access to the specific data that our clients will be applying, it is important that our project is easily adjustable to the client’s specific use case. Because the pipeline wraps up all of Pythia’s functionality, a client can easily fiddle with parameters without any reconstruction or complications. All of the adjustments are centralized to one location, and the rest of the pipeline runs accordingly based on the specifications.