I learned so much about Tesla’s NN design from this. For instance, I hadn’t understood why the network seemed so monolithic given that it generates so many different outputs. The architecture of the current camera nets has a large inception style ‘backbone’ but the heads that feed the various outputs are pretty small - basically they just deconvolve, refactor, or minimally interpret what seems to be a single massive representation generated by the inception backbone. This shouldn’t be possible, or at least it shouldn’t be very efficient, for outputs that are very different.

So the answer seems to be that, for training purposes the network is actually tree shaped with large sections of the higher layers being devoted to particular outputs or groups of outputs, but for inference purposes they preserve the functionally monolithic nature of the backbone because it’s computationally efficient. To pull this off they need to perform backprop from each output only to the neurons which feed that output while leaving the other branches unchanged. If you do that while managing the total neuron count in each layer you can get the benefit of a network that at inference time performs like a single big backbone while minimizing weight conflict in higher layers.