$\begingroup$

The simple way to explain it is that regularization helps to not fit to the noise, it doesn't do much in terms of determining the shape of the signal. If you think of deep learning as a giant glorious function approximator, then you realize that it needs a lot of data to define the shape of the complex signal.

If there was no noise then increasing complexity of NN would produce a better approximation. There would not be any penalty to the size of the NN, bigger would have been better in every case. Consider a Taylor approximation, more terms is always better for non-polynomial function (ignoring numerical precision issues).

This breaks down in presence of a noise, because you start fitting to the noise. So, here comes regularization to help: it may reduce fitting to the noise, thus allowing us to build bigger NN to fit nonlinear problems.

The following discussion is not essential to my answer, but I added in part to answer some comments and motivate the main body of the answer above. Basically, the rest of my answer is like french fires that come with a burger meal, you can skip it.

(Ir)relevant Case: Polynomial regression

Let's look at a toy example of a polynomial regression. It is also a pretty good approximator for many functions. We'll look at the $\sin(x)$ function in $x\in(-3,3)$ region. As you can see from its Taylor series below, 7th order expansion is already a pretty good fit, so we can expect that a polynomial of 7+ order should be a very good fit too:

Next, we're going to fit polynomials with progressively higher order to a small very noisy data set with 7 observations:

We can observe what we've been told about polynomials by many people in-the-know: they're unstable, and start to oscillate wildly with increase in the order of polynomials.

However, the problem is not the polynomials themselves. The problem is the noise. When we fit polynomials to noisy data, part of the fit is to the noise, not to the signal. Here's the same exact polynomials fit to the same data set but with noise completely removed. The fits are great!

Notice a visually perfect fit for order 6. This shouldn't be surprising since 7 observations is all we need to uniquely identify order 6 polynomial, and we saw from Taylor approximation plot above that order 6 is already a very good approximation to $\sin(x)$ in our data range.

Also notice that higher order polynomials do not fit as well as order 6, because there is not enough observations to define them. So, let's look at what happens with 100 observations. On a chart below you see how a larger data set allowed us to fit higher order polynomials, thus accomplishing a better fit!

Great, but the problem is that we usually deal with noisy data. Look at what happens if you fit the same to 100 observations of very noisy data, see the chart below. We're back to square one: higher order polynomials produce horrible oscillating fits. So, increasing data set didn't help that much in increasing the complexity of the model to better explain the data. This is, again, because complex model is fitting better not only to the shape of the signal, but to the shape of the noise too.

Finally, let's try some lame regularization on this problem. The chart below shows regularization (with different penalties) applied to order 9 polynomial regression. Compare this to order (power) 9 polynomial fit above: at an appropriate level of regularization it is possible to fit higher order polynomials to noisy data.

Just in case it wasn't clear: I'm not suggesting to use polynomial regression this way. Polynomials are good for local fits, so a piece-wise polynomial can be a good choice. To fit the entire domain with them is often a bad idea, because they are sensitive to noise, indeed, as it should be evident from plots above. Whether the noise is numerical or from some other source is not that important in this context. the noise is noise, and polynomials will react to it passionately.