From Research to Production

Deploying models trained in your research environment is not always a simple task. Your research environment, your production programming language, and the interplay between them may affect the ease of introducing new statistical models in production.

In this blog post, I’ll demonstrate the complete flow from training a Random Forest model in R, exporting it to a PMML file and finally scoring by the model in production oriented languages using Scoruby and Goscore.

PMML stands for Predictive Model Markup Language and can represent models from research environments as XML files which can be later loaded and run in production.

Scoruby and Goscore are code packages written by myself, that consume PMML files of various models and execute them in Go and Ruby under production memory and speed constraints.

Random Forest Model

For training a Random Forest model we need training data, Kaggle website provides some nice datasets, I chose the Titanic dataset with features describing the Titanic passengers and labels indicating whether they survived or not (1 or 0). Our model will receive a passenger’s set of features and return its survival probability.

Random Forest training input is a dataset of labeled feature vectors, for the Titanic dataset each row represents a passenger by numerical features such as age and fare and factor features such as sex and port of embarkment.

Each row also has a Survived value of 0 or 1, this value represents our labels. When we train our Random Forest model, the output is an ensemble of decision trees each trying to classify if a passenger survived or not (0 or 1) by the passenger’s features.

When scoring with a Random Forest model we traverse all the model’s decision trees with our input features. We calculate our score for a label by the rate between the number of trees who predicted that label and the overall number of trees — if we have 300 trees and 180 predicted “1”, than our model score for “1” is 0.6 (180 / 300).

Train Random Forest Model in R

The Titanic training set can be downloaded as a CSV file after you signup to Kaggle for free, we would like to read the data from a CSV file to titanic.train dataset (line 4) and convert the numerical Survived values to factors so training could treat them as labels (line 5).

Then we would like to train the model using the randomForest function — we tell the training function that the Survived column is our labels, and that it should remove fields such as Name, Cabin and Ticket since their values won’t effect our model (and training will only allow a limited amount of values for factor features) (line 6). Next we specify the input data set (line 7), and the NA handling policy (line 8).

Now we have a Random Forest model object in R which can predict values when receiving features as input —

After exporting your model to a PMML file and importing it to another environment it is still the same model, so your scores shouldn’t change when your model evaluates the same features for different environments.

Notice that if you follow my example and train your own model, it won’t be identical to the model I’ve trained so our scores for the same features should be similar but not the same, since we trained it with the same data.

Our model object can be saved to an RDS file that can be transferred between R environments, but how can we run this model in production?

Export to PMML File

For scoring with this model in other environments we can export our R model to a PMML file. All we need to do is to import the “pmml” package and call the pmml and saveXML commands with our trained model —

Now we have a Random forest PMML file which consists of decision trees, we can look at one of the nodes in one of the trees to better understand how it works-

We enter this node if Fare ≤ ~8.08 and it’s tree scores 1 (Survived) if Parch (Number of parents and children) ≥ 1 and 0 otherwise.

Score with Ruby and Go

We at Riskified use JPMML Java packages for our production scoring, which works great for us, but requires a Java environment, not free of charge and is less fun to play with (it’s written in Java) :)

To make the production scoring more accessible for Rubyists and Gophers I’ve started two MIT licensed packages Scoruby and Goscore which currently supports Random Forest, Naive Bayes, Decision Tree and Gradient boosted models, and will support more models by requests.

The usage for both packages is quite straightforward — we load the model from the PMML file we created, then we call our prediction function with the features we’ve set — you can run it and see it returns the same score as it did on research for both packages.

Goscore —

Scoruby —

Conclusion

Exporting a trained model from research to production can be done easily using PMML files and packages, if you are interested in implementing new models supported by PMML contact me and I’ll be happy to assist and add support for new models. The PMML website lists many supported statistical models.

If you’re interested in practicing what you’ve learned here you can download other datasets from Kaggle, train your own model and export it to a PMML file, then you can run your models predictions in your favorite ecosystem.

Please contact me for your feedback, bug reports, and model requests by adding issues at the relevant repositories, or by mailing me at aschers@gmail.com