Deploying machine learning solutions on Microsoft SQL Server using RevoScaleR

A quick guide to doing data science on your database

It is no secret that data science projects tend to be time consuming and resource intensive. Depending on the size of your data and the complexity of the underlying machine learning algorithms, the training process can take a considerable amount of time. Factor in the iterative process of building the model, which includes exploratory analysis, testing, feature engineering and hyper-parameter tuning and it is no mystery that the path from an idea to a production level application can be a long and winding one.

A typical walk-through of a data science solution entails copying the data set to a local machine, load it in memory and then performing the required analysis and building a model. This exposes your data to unnecessary movement and security risks in addition to having to work in separate computation environments for development and production.

To avoid all the unnecessary confusion and extra costs associated with setting up, maintaining and potentially paying for an additional computation environment for your data science project, consider Microsoft SQL Server’s Machine Learning Services.

As of SQL Server 2016 (Enterprise Edition) you can implement machine learning models using R (support for Python has been added on SQL Server 2017, all editions). There are several benefits to using an in-database predictive analytics service such as: less data movement, model persistent storage, real-time scoring, and being able to use T-SQL to pre-process data before training. And the best part, it comes at no extra cost.

The most common complaint of data scientists with respect to using R as your analytics engine is that it becomes problematic when it comes to scale and performance. Only data sets that fit into the available memory of the local computer can be used due to the fact that the runtime for R is single-threaded. To circumvent this, SQL Server machine learning services comes pre-loaded with RevoScaleR. This R package contains implementations of the most widely used R functions with parallelism and scalability in mind. In addition, it contains functions which can switch the compute context from your local machine to SQL Server without you having to leave the comfort of you favorite R IDE. This way, you can explore data, build, tune and deploy production quality models directly from one local workstation without ever moving the data from its original source or having to reevaluate your local scripts on a remote instance.

Once the code for model training has been finalized and tuned for performance, it can be included in a SQL stored procedure and integrated with the rest of regular ETL management tasks for scheduled re-training. RevoScaleR also provides functionality for converting the model to a binary variable allowing it to be stored in the database and be called as a parameter to execute predictions.