Deep learning has evolved not linearly but through a series of step-functions: sudden unexpected outbreaks of capability, which fundamentally changed the envelope of what computers are able to do. At TwentyBN, we have been taking the bet that the next outbreak of capability will be related to video understanding. We have created spatio-temporal video models, video infrastructure, as well as a data operation that allowed us to create many hundred thousands of labeled videos, showing everyday common-sense scenes and situations - many of them designed to be extremely subtle and hard to distinguish. This allowed us to successfully train neural networks end-to-end on a wide range of action understanding tasks, that neither hand-engineering nor neural networks had appeared anywhere near solving just a few months ago. I will show how these recognition tasks now drive commercial value at TwentyBN, and how they drive our long-term AI agenda, which represents another, longer term, bet on learning common sense world knowledge through video.