Image frames and human speech from the same video locations are often semantically aligned. The alignment is non-exhaustive and sometimes noisy, which we hope to mitigate by pretraining on larger datasets. For the left example, the ASR output is, “Keep rolling tight and squeeze the air out to its side and you can kind of pull a little bit.”, where the actions are captured by speech but the objects are not. For the right example, the ASR output is, “This is where you need to be patient patient patient,” which is not related to the visual content at all.