I watched the whole presentation. Dojo was mentioned in passing, and basically in response to whether HW3 was inference only or could be used for training. Dojo is meant to be used for training (hence the name 'Dojo') and in particular my recollection was that it was meant to be used for training neural networks using video. Currently, most of the neural networks take individual images as input. You can imagine a 30fps video of many seconds could potentially be many orders of magnitude larger as input (if an image is N bytes than a ~30 second video at 30fps could be about 1000 times larger as an input) if done as a single input. They'll probably use different Neural Net architectures for this though, probably some combination of their current image network and LSTMs.It may also be necessary to do the depth mapping that Karpathy talked about. Here is the paper