The model combines a vision CNN with a language-generating RNN so it can take in an image and generate a fitting natural-language caption.