With the growing interest in data science, there is a growing number of companies competing for a piece of the pie. One interesting approach to making life easier for data scientists is Domino.

In this context, a data scientist is a person who develops models, for example better customer segmentations and strategies to increase conversion for individualised marketing campaigns. The model - on its own - is rather not sexy. It is basically a large matrix of numbers (parameters) that can be applied to inputs (e.g. customer click data). But if a (good) model is used to drive actions, it can yield rather satisfying results.

The process of coming up with a good model is what data scientists should be spending their time on, but invariably they waste time with moving data and code around different environments. This where Domino wants to provide a solution. They provide software that simplifies some of the complexities of deploying models to machines, keeping track of different versions of inputs, code, and outputs, and lets users share and collaborate. Think of it as a clever combination of AWS, Docker, and Github.

To give Domino a test ride, I deployed a model for the Expedia kaggle competition in R, using an underlying "gradient boosting" algorithm. The algorithm can make use of several cores at once, even in the open source version of R. The other obvious choice for programming language would be Python, but Domino only supports v2 of Python at the moment. Recently, Julia is also gaining in popularity, but I did not get a chance to test that on Domino.

I found that the standard domino R run time environment does not come with a package which transform categorical variables into so-called dummies, but it's easy to install a package on the fly like this:

The next nice thing is that for each "run" statistics about resource usage are stored, for example:

So my model training was consuming about 30GB of RAM and a little more than 8 cores. I suspect that the model itself cannot scale beyond 8 cores and the rest is overhead. My machine had 32 cores and I tried to use all of them by specifying 32 threads in the model parameters.

The "hassle" of procuring a server was taken away, Domino took care of deploying the model, spinning up the container and tearing it down at the end. The console output was visible during the run and saved for later inspection. Any files modified or written by the model are also version controlled.

If you want to use distributed processing, such as Hadoop or Spark, you need to create your own cluster. So one of the added values of Domino no longer applies. Perhaps this will be added later?

Underlining the "non-visual" mode of operation, another "consumption mechanism" for the model is to create an API. In other words, make the model available as a service on the Intranet, e.g. to an application server. Domino lets you do this really easily. Pricing then is based on number of API calls per month.

In summary, Domino tries to bring continuous development (also called "dev ops") to the data science world. First impressions are good, check it out if you have some models to play with already.

Antuit provides consulting services for analytics. Antuit is vendor independent.