So, here’s BabblePi’s software: CMU Sphinx running in a phoneme detection mode, i.e. it is not recognising text or words, but really phonemes (transcriptions of speech sounds). It is then speaking this sequence back using espeak again running in phoneme mode:

So when we’re looking at BabblePi and how it is listening and repeating words and sentences, we see that this kind of “deprived input” is leading to funny repetitions. This is because the system has no clue about the concept of words as being dedicated units in the speech nor has it any kind of mental dictionary, i.e. a list of words with their linguistic features like knowledgge about semantics (what does the word mean?) or syntax (how do i use a word in the context of other words to construct a grammatically correct sentence).

We can see an example of “deprived input” here, where Jimmy Fallon and Brie Larson play the whisper game. Here, the counterpart is only looking at the lips of the sender without hearing the sounds which is sort of “visual speech”. In this case, the human being certainly has knowledge about semantics and syntax however, as you can see this knowledge is not too helpful to reconstruct the original input.

I’m pretty intrigued about the similarities between BabblePi and these human performances because both outputs sound very similar and you can even see that the humans are partially even repeating sound sequences that don’t correspond to anything in the mental dictionary.

Certainly, this must be similar to how babies learn to speak by hearing. And I’m pretty sure that lip reading as part of multimodal language acquisition plays an important role.