CNNs can encode abstract features from images. These can then be used for classification, object detection, segmentation, and a litany of other tasks[63]. Returning to the notion of contemporaneous successes in 2014, Vinyals et al. (2014)[64], successfully used a sequence-to-sequence model in which the typical encoder LSTM[65] was replaced by a CNN. In their paper titled, “Show and Tell: A Neural Image Caption Generator”, the CNN takes an input image and generates the feature representation which is then fed to the decoder LSTM for generating the output sentence (see fig. 13).

Figure 13: CNN encoder to LSTM decoder

Note: The image is encoded into a context vector by a CNN which can then be passed to a RNN decoder. Different neural blocks of computation are combined in new ways to handle different tasks.

Source: The M Tank

A few more specifics on how the sentence is generated. At every step of the RNN, the probability distribution of the next word is output using a softmax. Depending on the situation, a slightly naive approach would be to take the word with the highest probability at each step after extracting the output from the RNN. However, beam search is another method which represents a better approach for sentence construction. By searching through specific combinations of words, and creating different possible outputs, beam search constructs a whole sentence without relying too heavily on any individual word from the ones which the RNN may generate at any specific time step. Beam search, therefore, can rank a lot of different sentences according to their collective, or holistic, probability.

Figure 14: Beam search example

Source: Geeky is Awesome (2016)[66]

For example, at the first word prediction output step, a higher probability sentence might be outputted overall by choosing the word with a lower probability than the word with the highest. A deeper explanation of beam search for sentence generation, i.e. related to the decoder portion of our example above, may be found here[67].

Further contemporaneous work

Around the time Show and Tell came around, a similar, but distinct, approach was presented by Donahue et al. (2014): Long-term Recurrent Convolutional Networks for Visual Recognition and Description[68]. Instead of just using an LSTM for encoding a vector, as is typically done in sequence-to-sequence models, the feature representation is outputted by a CNN, in this case VGGNet[69], and presented to the decoder LSTM. This work was also successfully applied to video captioning, a natural extension of image captioning.

The main contribution of this work was not only this new connection setting between the CNN encoder and LSTM decoder[70], but an extensive set of experiments which stacked LSTMs to try different connection patterns. The team also assess beam search against their own random sampling method, as well as using a CNN trained on ImageNet or further fine-tuning the pre-trained network to the specific dataset used[71].

Delving deeper into the multimodal approach

From Captions to Visual Concepts and Back, by Fang et al. (2014)[72], is useful to explain multi-modality of the 2014 breakthroughs. Although distinct from the approaches of Vinyals et al. (2014)[73] and Donahue et al. (2014)[74], the paper represents an effective combination of some of these ideas[75]. For readers, the working flow of the captioning process may bring a new appreciation of the modularity of these approaches.

Figure 15: Creating captions from visual concepts

Source: Fang et al. (2014)

(I) Detect words

To begin, it is possible to squeeze more information, which is easier to interpret, out of the CNN. Looking closer at how humans would complete the task, they would notice the important objects, parts and semantics of an image and relate them within the global context of the image. All before attempting to put words into a coherent sentence. Similarly, instead of “just” using the encoded vector representation of the image, we can achieve better results by combining information contained in several regions of the image.

Using a word detection CNN, which generates bounding boxes similar to what an object detection CNN does, the different regions in the image may receive scores for many individual objects, scenes or characteristics which correspond to words in a predefined dictionary (which includes about 1000 words).

(II) Generate sentences

Next, the likelihood of the matched image descriptors (detections) are analysed according to a statistically predefined language model. E.g. if a region of the image is classified as “horse”, this information can be used as a prior to give a higher likelihood to the action of “running” over “talking” for the image captioning output. This combined with beam search produces a set of output sentences that are re-ranked with a Deep Multimodal Similarity Model (DMSM)[76].

(III) Re-ranking sentences

This is where the multimodal independence comes into play. The DMSM uses two independent networks: a CNN for retrieving a vector representation of the image (VGG) and a CNN architecture with an explicit use. The image encoding network is based on the trained object detector from the previous section, with the addition of a set of fully connected layers to be trained for this re-ranking task. The second CNN is designed to extract a vector representation out of a given natural language sentence, which is the same size as the vector generated by the image encoding CNN. This effectively enables the mapping of language and images to the same feature space.

Since the image and encoded sentence are both represented as vectors with the same size, both networks are trained to minimise the cosine similarity between the image and ground truth captions for the given image, as well as to increase the difference with a set of irrelevant captions provided.

During the inference phase, the set of output sentences generated from the language model with beam search are re-ranked with the DMSM networks and compared against each other. The caption with highest cosine similarity is selected as the final prediction.