A couple of notes:

We use 3 different speakers for the training data. While we've had success just using one speaker (not shown), I would guess using more speakers, particularly of different pitches makes the method more robust

For the second example where the speaker speaks faster, we forced the window to be smaller (2x smaller). While this isn't neccessary, it certainly makes the problem easier, and the result cleaner. In practice, you may have to use phonic analysis to identify exact windows

Thanks for reading!!

Please forwards comments, suggestions, and questions to crawles/gmail