In this post, we’ll implement a basic deep neural network with one hidden layer (and ReLu and softmax activation functions) purely in SQL. The end-to-end steps for neural network training including the forward pass and back-propagation will be implemented as a single SQL query on BigQuery. As it runs on Bigquery, in effect we are performing distributed neural network training on 100s to 1000s of servers. Sounds cool! right ?

That said, please note that this is a fun project to test out the limits of SQL and BigQuery and to look at neural network training from a declarative data transformation perspective. This was done without any practical application in mind, though I’ll be discussing some practical research implications towards the end. Let’s get started.

We’ll start with a simple neural network-based classifier. It has an input dimension of 2 and a binary output. We’ll have a hidden layer with a dimension of 2 and the ReLU activation function. The output layer will have a two dimensional output with a softmax function in the end. The steps we’ll follow in implementing the network will be an SQL-based version of the Python example shown in Karpathy’s CS231n tutorial.

Model

The model has the following parameters:

Input-to-hidden

W : 2x2 weight matrix (elements: w_00, w_01, w_10, w_11 )

: 2x2 weight matrix (elements: ) B : 2x1 bias vector (elements: b_0, b_1 )

Hidden-to-output

W2 : 2x2 weight matrix (elements: w2_00, w2_01, w2_10, w2_11 )

: 2x2 weight matrix (elements: ) B2 : 2x1 bias vector (elements: b2_0, b2_1 )

The training data is stored in a BigQuery table with the columns x1 and x2 having the input and y having the output as shown below (table name: example_project.example_dataset.example_table ).

As mentioned earlier, we’ll implement the entire training as a single SQL query. The query will return the values of the parameters after the training is completed. As you might have guessed, this will be a heavily nested query. We’ll do step-by-step construction to prepare this query. We’ll start with the innermost sub-query and then we’ll add the nested outer layers one by one.

Forward Pass

Initially, we’ll assign random normal values for the weight parameters W and W2 and zero values to the bias parameters B and B2 . The random values for W and W2 can be generated within SQL itself. For simplicity, we’ll generate these values externally and use within the SQL query. The inner sub-query for initializing the parameters is:

Please note that the table example_project.example_dataset.example_table already contains the columns x1 , x2 and y . The model parameters will be added as additional columns in the result of the above query.

Next, we’ll compute the hidden layer. We’ll denote the hidden layer with a vector D having elements d0 and d1 . We’ll need to perform the matrix operation: D = np.maximum(0, np.dot(X, W) + B) where X denotes the input vector (elements x1 and x2 ). This matrix operation involves firstly multiplying X with the weights in W and then adding the bias vector B . Then the result is passed through the non-linear ReLu activation function which just sets the negative values to 0. The equivalent query in SQL is:

The above query adds two new columns d0 and d1 to the results of the previous inner sub-query. The output of the above query is shown below.

This completes the input-to-hidden layer transformation. Now we’ll perform the hidden layer-to-output layer transformation.

First, we’ll compute the scores for the output layer. The formula is: scores = np.dot(D, W2) + B2 . Then we’ll apply a softmax function on the scores to obtain the predicted probability of each class. The equivalent inner sub-query in SQL is:

This completes the forward pass of the neural network. Next, we’ll do back-propagation to adjust the model parameters based on the comparison of the predicted output ( probs ) with the expected output ( Y ).

First we’ll compute the aggregate loss resulting from the current prediction. We’ll use the cross-entropy loss function to compute the loss. We’ll first compute the negative log of the predicted probabilities of the correct class in each example. Cross-entropy loss is nothing but the average of these values across all the instances in X and Y . The natural log is an increasing function. Hence, it is intuitive to define the loss as the negative of the log of the predicted probability of the correct class. If the predicted probability of the correct class is high, loss will be low. Conversely, if the predicted probability of the correct class is low, loss will be high.

To reduce the chance of over-fitting, we’ll add L2 regularization too. In the overall loss, we’ll include 0.5*reg*np.sum(W*W) + 0.5*reg*np.sum(W2*W2) where reg is a hyper-parameter. Including this function in the loss will penalize high magnitude values in the weight vectors.

In the query, we’ll also count the number of training examples ( num_examples ). This will be useful later when computing averages. The query in SQL to compute the overall loss is:

Back-propagation

Next, for back-propagation, we’ll compute the partial derivatives of each of the parameters w.r.t the loss. We’ll use the chain rule to compute this layer by layer starting with the last layer. First, we’ll compute the gradients of the scores by using the derivatives of the cross entropy and softmax functions. The query corresponding to this is:

Recall that we computed the scores using scores = np.dot(D, W2) + B2 . Hence, based on the derivatives of the scores (termed dscores ), we can compute the gradients of the hidden layer D and the model parameters W2 and B2 . The corresponding query is:

Proceeding in a similar way, we know that D = np.maximum(0, np.dot(X, W) + B) . Thus by using the derivative of D , we can compute the derivatives of W and B . No point in computing the derivative of X as it is not a model parameter or computed using any model parameter. The query to compute the derivatives of W and B are:

Finally, we’ll update the model parameters W, B, W2 and B2 using their respective gradients. This can be computed by param -= learning_rate * d_param where learning_rate is a parameter. An additional factor of reg*weight will also be added in dW and dW2 to incorporate L2 regularization in the gradient computation. We’ll also remove the temporary columns like dw_00 , correct_logprobs etc. which we created in the inner sub-queries and retain only the training data (columns x1 , x2 and y ) and the model parameters (weights and biases). The corresponding query is:

This completes one iteration of forward pass and back-propagation. The above query will provide the updated values of the weights and biases. Sample result is shown below:

To do more training iterations, we’ll perform all the above steps recursively. We can do this using a simple Python function. The code is available in link. As we add more iterations, the query gets heavily nested. The resulting query for performing 10 training iterations is available in link.

Because of the heavy nesting and the complexity of the query, I ran into multiple resource limits while trying to execute it in BigQuery. Bigquery’s Standard SQL dialect is scaling better compared to its legacy SQL dialect. Even with Standard SQL, for a dataset with 100k instances, it is tough to perform more than 10 iterations. Because of the resource limits, we’ll evaluate this model on a simple decision boundary, so that we’ll get a decent accuracy with a small number of iterations.

We’ll use a simple dataset with inputs x1 and x2 which are sampled from a normal distribution with mean 0 and variance 1. The binary output y simply checks whether x1 + x2 is greater than zero or not. To train faster within 10 iterations, we’ll use a high learning rate of 2.0 (Note: such high value is not recommended in practice as the learning could diverge). Applying the above query with 10 iterations gives the learned model parameters as shown below.

We’ll store the result into a new table using Bigquery’s ‘save to table’ functionality. We can now check the accuracy on the training data by performing only the forward pass and then comparing the predicted and expected results. The query snippet for this is in link. We’re able to get an accuracy of 93% with 10 iterations (accuracy is similar on a separate test dataset).

If we can go till ~100 iterations, we’ll get an accuracy of more than 99%.

Optimizations

This concludes the fun project of implementing a deep neural network using pure SQL in BigQuery. Where do we go from here ? As we saw, resource limitation is factor which limits the size of the dataset and the number of training iterations we could perform. Other than hoping that Google will relax the resource limits, we could do multiple optimizations to address this.

We can create intermediate tables and multiple SQL queries to perform more iterations. For example, the result of the first 10 iterations can be stored in an intermediate table. The same training query can now be applied on this intermediate table for performing the next 10 iterations. Thus in effect, we’ve performed 20 training iterations. This can be repeated multiple times to perform a large number of iterations.

Instead of adding outer queries at each step, we could use functions of functions whenever possible. For example, we can compute both scores and probs in a single sub-query instead of 2 nested sub-queries.

and in a single sub-query instead of 2 nested sub-queries. In the above example, I’ve retained all the intermediate columns till the last outer query. Some of them like correct_logprobs can be removed earlier (though the SQL engine might perform this optimization automatically).

can be removed earlier (though the SQL engine might perform this optimization automatically). Application of user defined functions (UDFs) can be explored. If interested, you can check out a project where BigQuery UDF is used for model serving (however, training is not performed using SQL or UDFs).

Implications

Now let us look at the deeper implications of a distributed SQL engine in the context of deep learning. One limitation of warehouse SQL engines like BigQuery and Presto is that the query processing is performed using CPUs instead of GPUs. It would be interesting to check out the results with GPU-accelerated SQL databases like blazingdb and mapd. One straightforward approach to check out would be to perform query and data distribution using a distributed SQL engine and to perform the local computations using a GPU accelerated database.

Taking a step back, we can see that right now, performing distributed deep learning is hard. A large body of research work spread over decades has gone into distributed SQL engines which resulted in techniques for query planning, data partitioning, operator placement, check-pointing, multi-query scheduling etc. Some of it could be incorporated in distributed deep learning. If you’re interested on these lines, please have a look at this paper for a general research discussion on distributed databases and deep learning.

Hope you had fun as I did! Please share your comments and thoughts below. I’ll be happy to respond.