Conditional Training on unreasonable large networks.

One of the big problems in Artificial Intelligence is the gigantic amount of GPUs (or computers) needed to train large networks.

The training time of neural networks grows quadratically (think squared) in function of their size. This is due to how the network is trained. For each example, the entire network is modified, even though some parts might not even activate while processing this particular example.

However, the memory of a network is directly dependent on the size of the network. The larger the network, the more patterns it can learn and remember. Therefore, we have to build giant neural networks to process the ton of data that corporations like Google & Microsoft have.

Well, that was the case until Google released their paper Mixture of Experts Layer.