The system studies captioned videos and learns to link words to objects and actions by determining the accuracy of a description. It turns the potential meanings into logical mathematical expressions, picking the expression that most closely represents what it thinks is going on. While the AI may start with a vast range of potential meanings and little idea as to what it's seeing, it will gradually whittle down the possibilities. Annotations can help speed the process, but the technology doesn't need annotations to learn.

Crucially, the approach is flexible. As the system is observing its environment, it can learn based on how people actually speak, not just formal language. MIT envisions robots that could adapt to the linguistic habits of the people around it, even with sentence fragments and other signs of informal dialog.

The childlike method could speed up the learning process and make AI that can handle uncommon languages that rarely get AI-friendly annotations. We'd add that it could produce more robots that need relatively little hand-holding. It could even help understand how children learn about the world. At this stage, the main challenge is to teach robots how to learn this way through interactions, not just by watching.