$\begingroup$

Cybenko's result is fairly intuitive, as I hope to convey below; what makes things more tricky is he was aiming both for generality, as well as a minimal number of hidden layers. Kolmogorov's result (mentioned by vzn) in fact achieves a stronger guarantee, but is somewhat less relevant to machine learning (in particular, it does not build a standard neural net, since the nodes are heterogeneous); this result in turn is daunting since on the surface it is just 3 pages recording some limits and continuous functions, but in reality it is constructing a set of fractals. While Cybenko's result is unusual and very interesting due to the exact techniques he uses, results of that flavor are very widely used in machine learning (and I can point you to others).

Here is a high-level summary of why Cybenko's result should hold.

A continuous function on a compact set can be approximated by a piecewise constant function.

A piecewise constant function can be represented as a neural net as follows. For each region where the function is constant, use a neural net as an indicator function for that region. Then build a final layer with a single node, whose input linear combination is the sum of all the indicators, with a weight equal to the constant value of the corresponding region in the original piecewise constant function.

Regarding the first point above, this can be taken as the statement "a continuous function over a compact set is uniformly continuous". What this means to us is you can take your continuous function over $[0,1]^d$, and some target error $\epsilon>0$, then you can grid $[0,1]^d$ at scale $\tau>0$ (ending up with roughly $(1/\tau)^d$ subcubes) so that a function which is constant over each subcube is within $\epsilon$ of the target function.

Now, a neural net can not precisely represent an indicator, but you can get very close. Suppose the "transfer function" is a sigmoid. (Transfer function is the continuous function you apply to a linear combination of inputs in order to get the value of the neural net node.) Then by making the weights huge, you output something close to 0 or close to 1 for more inputs. This is consistent with Cybenko's development: notice he needs the functions involved to equal 0 or 1 in the limit: by definition of limit, you get exactly what I'm saying, meaning you push things arbitrarily close to 0 or 1.

(I ignored the transfer function in the final layer; if it's there, and it's continuous, then we can fit anything mapping to $[0,1]$ by replacing the constant weights with the something in the inverse image of that constant according to the transfer function.)

Notice that the above may seem to take a couple layers: say, 2 to build the indicators on cubes, and then a final output layer. Cybenko was trying for two points of generality: minimal number of hidden layers, and flexibility in the choice of transfer function. I've already described how he works out flexibility in transfer function.

To get the minimum number of layers, he avoids the construction above, and instead uses functional analysis to develop a contradiction. Here's a sketch of the argument.

The final node computes a linear combination of the elements of the layer below it, and applies a transfer function to it. This linear combination is a linear combination of functions, and as such, is itself a function, a function within some subspace of functions, spanned by the possible nodes in the hidden layer.

A subspace of functions is just like an ordinary finite-dimensional subspace, with the main difference that it is potentially not a closed set; that's why cybenko's arguments all take the closure of that subspace. We are trying to prove that this closure contains all continuous functions; that will mean we are arbitrarily close to all continuous functions.

If the function space were simple (a Hilbert space), we could argue as follows. Pick some target continuous function which is contradictorily supposed to not lie in the subspace, and project it onto the orthogonal complement of the subspace. This residual must be nonzero. But since our subspace can represent things like those little cubes above, we can find some region of this residual, fit a little cube to it (as above), and thereby move closer to our target function. This is a contradiction since projections choose minimal elements. (Note, I am leaving something out here: Cybenko's argument doesn't build any little cubes, he handles this in generality too; this is where he uses a form of the Riesz representation theorem, and properties of the transfer functions (if I remember correctly, there is a separate lemma for this step, and it is longer than the main theorem).)

We aren't in a Hilbert space, but we can use the Hahn-Banach theorem to replace the projection step above (note, proving Hahn-Banach uses the axiom of choice).

Now I'd like to say a few things about Kolmogorov's result. While this result does not apparently need the sort of background of Cybenko's, I personally think it is much more intimidating.

Here is why. Cybenko's result is an approximation guarantee: it does not say we can exactly represent anything. On the other hand, Kolmogorov's result is provides an equality. More ridiculously, it says the size of the net: you need just $\mathcal O(d^2)$ nodes. To achieve this strengthening, there is a catch of course, the one I mentioned above: the network is heteregeneous, by which I mean all the transfer functions are not the same.

Okay, so with all that, how can this thing possible work?!

Let's go back to our cubes above. Notice that we had to bake in a level of precision: for every $\epsilon>0$, we have to go back and pick a more refined $\tau >0$. Since we are working with (finite) linear combinations of indicators, we are never exactly representing anything. (things only get worse if you include the approximating effects of sigmoids.)

So what's the solution? Well, how about we handle all scales simultaneously? I'm not making this up: Kolmogorov's proof is effectively constructing the hidden layer as a set of fractals. Said another way, they are basically space filling curves which map $[0,1]$ to $[0,1]^d$; this way, even though we have a combination of univariate functions, we can fit any multivariate function. In fact, you can heuristically reason that $\mathcal O(d^2)$ is "correct" via a ridiculous counting argument: we are writing a continuous function from $\mathbb{R}^d$ to $\mathbb R$ via univariate continuous functions, and therefore, to capture all inter-coordinate interactions, we need $\mathcal O(d^2)$ functions...

Note that Cybenko's result, due to using only one type of transfer function, is more relevant to machine learning. Theorems of this type are very common in machine learning (vzn suggested this in his answer, however he referred to Kolmogorov's result, which is less applicable due to the custom transfer functions; this is weakened in some more fancy versions of Kolmogorov's result (produced by other authors), but those still involve fractals, and at least two transfer functions).

I have some slides on these topics, which I could post if you are interested (hopefully less rambly than the above, and have some pictures; I wrote them before I was adept with Hahn-Banach, however). I think both proofs are very, very nice. (Also, I have another answer here on these topics, but I wrote it before I had grokked Kolmogorov's result.)