So with a linear model, the reason we made it linear is because it was a simplification so that way they could do the math. But nowadays we have more powerful computers, and most things in life don’t follow a linear relationship, they follow a non-linear relationship.

Now, when I say non-linear, that could have different meanings, depending on the technicality, but you can imagine if you had a cloud of points with a X and Y-axis - instead of fitting a straight line through those points, if you fit a step function. Maybe for the first segment it’s about a third of the way up the Y-axis; the second segment, the straight line would go two-thirds up, and the last segment - it’d be back down to the bottom of the Y-axis. That’s a simple step function that is non-linear. It doesn’t fit a nice, straight line, it doesn’t even fit a curvy linear line; it fits a step function. That’s somewhat the idea behind a tree, somewhat.

The ability to capture these non-linear relationships, regardless of the method, allows us to really model reality better. That’s why trees are really great, they have high predictive power, and why random forests and boosted trees. That’s also why deep learning is powerful, because it is non-linear. It has a lot of non-linearities.

When you’re going from your input to your first hidden layer, and then on to subsequent hidden layers, there are two steps. There is a matrix multiplication of the inputs by their weights (or coefficients) - and that’s linear. If you just did that, a deep learning model would just be a linear model. You could even stack many more layers, and if you just did these multiplications by the weights, it would just be a series of linear models, which would become one large linear model. Then you essentially have a straight line or a curvy linear line.

[ ] But it’s that next step at each layer - the activation function. That is a non-linear function you are applying. So whether it is a tanh, or it’s a ReLU, or a sigmoid, which is just a fancy word for inverse logit, regardless of which one you’re doing, you’re doing a non-linear transformation, and that puts a non-linearity in your model, which allows you to capture more complex relationships; if you do more layers, you have more non-linearities, so you can capture really interesting separations between your data.