My explanation was probably needlessly complicated. Let me try again:

There are a lot of different ways you can decompose the problem of driving. You can also choose not to decompose it. If you choose not to decompose it then you have a monolithic system. For a driving system that is an NN that would be a monolithic NN and would usually be sensors as inputs and actuators as outputs. To train such a network you need to supply examples of sensor / actuation examples. This works really well for demonstrations and is quite easy to do up to a level that shows some level of useful functionality. But this approach has numerous drawbacks. One is that it’s very data intensive for the level of functionality that it provides. Another is that it, by design, does not interface with other systems that would allow for regulation, inspection, and so forth. These limitations are a big enough problem that nobody is currently developing commercial products using this method.

If you decompose the problem into multiple blocks you can often get better results on each block than a monolithic system. This requires that you do a good job of defining the blocks and their interfaces, which is not a trivial problem but it has been studied in depth and is relatively well understood for some arrangements. Some blocks might be best done by NNs and other blocks by other techniques. Today this is the most common approach for developing commercial products. The main reason that this approach is taken today is that it allows for better block performance given appropriate labor and capital. Organizations that take this approach are making the decision to invest additional resources to get better performance sooner.

As the underlying NN technology improves, more data becomes available, and more computational resources become available any given level of performance can be achieved with less labor and capital using few, bigger blocks. There’s a good chance that this ultimately leads to systems that have very few blocks, maybe only one, and which are trained in a fairly simple and general way - though they may take immense computational resources and data. This also removes the inherent performance limitations that block interfaces impose on the performance of the final system by allowing the training process to define whatever internal interfaces, representations, and restrictions result in the best performance on the objective function.

So this whole topic of end-to-end versus not is very complicated because there are a lot of options and the tradeoffs are not simple. A lot of people, myself included and apparently Karpathy as well, expect that more and more blocks will become NNs. When two adjacent blocks are NNs then they can be merged and over time the great majority of the system become a monolithic NN. So ‘end-to-end’ is a conceptual description of a simple and powerful technique which is not currently capable of making the best products. We may never get to true end-to-end but we will probably get fairly close, because over time those will be the best performing systems that take the least human labor and capital to construct.