One of the goals of multimodal embedding is to make it easier to expand machine learning models into new contexts. Deep architectures are typically trained on large, diverse datasets, because without a significant quantity of labeled data pairs (on the order of 10,000s to 1,000,000s), it can be quite a challenge to produce robust results. Typically, models will require dozens if not hundreds of exemplars in order to learn a new class. Even after learning on thousands of images, their architecture still only allows them to make predictions on the original label set. That means that, if the network was trained on “plane” but not “airplane” or “747”, there isn’t any straightforward way of teaching it how to use those labels without re-training the entire network with new samples.

Speaking on this problem, the recent paper “Return of Frustratingly Easy Domain Adaptation”, by Baochen Sun et al., noted:

Unlike human learning, machine learning often fails to handle changes between training (source) and test (target) input distributions. Such domain shifts, common in practical scenarios, severely damage the performance of conventional machine learning methods.

If we can find a way to flexibly align features from different modalities, it opens up the possibility for networks that generalize to new sources.

A Brief Intro to Attalos

Here at the Lab, the Attalos challenge has been chipping away at multimodal embedding. We take in labeled datasets of images and their associated tags, and after extracting features from the images using off-the-shelf deep convolutional networks and the tags using word embedding models, we’ve been exploring intelligent ways to project both encodings into the same vector space. From that point (no pun intended), it is much easier to assess similarity and find related content.

Image Embedding

Over the past few years, we’ve seen a proliferation of neural learning architectures—of the convolutional flavor, to be exact—with better performance on image tasks coming from increasingly deep networks. Instead of reinventing the wheel, we realized that it was possible to use these models pre-trained on image classification tasks to extract a representation of images. This basically means stepping in the middle of the network and extracting a dense, feature-rich encoding of our images.

Inception Network, one of the CNN we used to extract visual features

Text Embedding

We’ve posted on text embedding methods before (here, here, and here), so I won’t go into too much detail, but suffice it to say that there are a couple of options available, all of which perform well. Many authors use word2vec, a neural embedding model devised at Google in 2013, as the semantic embedding model. Others have taken advantage of embeddings pre-trained using Global Vectors for Word Representation (GloVe). For our purposes, though, any semantic embedding will do, so long as it preserves the relationship where semantically-similar words are closer to each other than they are to dissimilar ones.

Joint Embedding

The last (and most important) step in the process is translating our encoded image and text features into the same vector space. This is where we need to do the most heavy lifting, since there is no standard method for integrating multiple modalities in an extensible way. Luckily, there’s a decent amount of literature on different areas where joint embedding can be helpful. Researchers have successfully done joint embedding of shapes with images

, sounds with transcripts, two different languages, and even videos with images and summaries. So long as you can create an embedding of both modalities, a vector space model can reveal how each modality relates to the other.

Example of an ideal embedding of word vectors and images, from Socher et al. 2013

The heart of the challenge has been figuring out the best way to meaningfully project these image and text features into the same vector space. While there are a number of feature extraction architectures that all work fairly well, the difference between a good joint embedding method and a bad one is often pretty stark. Here’s one method that we’ve found promising!

Fast Zero Tag

Earlier this year at CVPR, researchers from the University of Florida presented a new approach to zero-shot learning that achieved surprising performance through a relatively simple method. In their paper, “Fast Zero-Shot Image Tagging”, they approach the issue of image-tag embedding from a new angle.

From Zhang et al. 2016

The key insight is this: what we want to know is what visual features make a difference to the meaning of the image. In order to translate image features into text features, the algorithm finds the direction in our semantic space that ranks tags in order of how appropriate they are to the image. Then we can simply grab that ranking for a given image and take the top-n tags to find its best tags. Or in reverse, we can take a group of tags and find the image whose direction maps closest to that group.

If you’re curious about the details, the authors of the method, which they call “Fast0Tag”, defined a cost function that optimizes for the best directions, equivalent to the following:

M is the set of images in the dataset or batch

P and N are the set of correct and incorrect tags for image m, and x is the network’s output.

In (slightly) plainer English, the cost is lowest when the network outputs a direction that’s aligned with the positive examples and away from the negative ones. With this, you can build a sequential model of fully connected layers (they used Theano + Keras, but TensorFlow is just as good) using the above equations in the loss function. And that’s it! After it trains, you should have a model that can accommodate multi-tag search, semantic clustering, and other multimodal tasks.

Things Ventured, Things Gained

Results

We trained the Fast0Tag model on some of the datasets we’ve been working with for the Attalos Challenge. The network architecture itself is straightforward: three layers, two with ReLU activations and one linear. It takes in the output from our pre-trained Inception image feature extractor and outputs a vector in the GloVe 200-dimensional word embedding space. Here are a few examples from testing the model on the IAPR TC-12 dataset after training it on the ESP Game dataset and vice versa:

Many images are tagged reasonably . . .

. . . while the model has some difficulty on a few images.

Alternatively, you can input a set of tags and find images related to those tags:

Result of searching for “fire” + “night”

Result of searching for “man” + “fish”

Result of searching for “drink” + “food”

Remember: not only are these images from a dataset the network had never seen before, but the labels are a different set from the tags it had trained on.

For example, the tag “ocean” isn’t part of the ESP Game dataset, but using our Fast0Tag model, we can use it as a query and examine the top results:

Overall, we found that Fast0Tag was able to get upwards of 40% overall precision, recall and F1 scores within a single corpus and 20% between corpora. It turns out that the model tends to perform a bit worse when trained on the ESP Game dataset, likely because the set of images contain a number of labels that are mostly unrelated to the visual features of the image. Tags like “poster” and “game” are difficult to deduce from the extracted features alone, especially when we’ve trained our model on a set of tags that are mostly annotating outdoor scenes. This really shows how important training on a broad tag set is to good model generalization.

Roadblocks

Although the results of Fast0Tag are promising, the method is not without flaws. For us, there’s a major bottleneck in the original implementation. In the paper, the authors computed the distances in the second equation from the network’s output for every tag in the dataset. Since they were training on the tiny NUS-WIDE training set (91 tags), that wasn’t an issue, but it becomes a major headache when dealing with even slightly larger datasets such as IAPR TC-12 (291 tags), not to speak of the giant tag sets from Visual Genome or Yahoo!-Flickr 100M (+100,000 tags).

Instead of optimizing for all positive and negative tags for every image, we’d like to choose a handful of negative tags for each, to accompany the positive tags from the training data. This would allow us to approximate their cost function, using negative sampling without the overhead of redundant computations. For instance, if we feed in an image of a forest, there’s no reason to compare against both “marine” and “reef”, especially since those two will already be close in word2vec space. Much better to randomly select one of them as a negative tag example every time we see that image. We’re working on a TensorFlow implementation of this negative sampling method.

Futures

So what does Fast0Tag mean for our quest to build a cross-modal embedding of features?

First, I think it provides yet another proof of concept that deep/wide networks can perform well at transferring knowledge between domains. It also makes me hopeful for developments in zero/one-shot learning over the next few years. Finding a successful joint embedding method would mean being able to swap test corpora and still get quality results, which is critical in the real world since it allows for dynamic application of prediction in new contexts without having to train another network to map between the two domains. To run your embedding on in a new context, just apply the same feature extractor and . . . voilà! Automatic tagging and image retrieval.

Humans have a remarkably strong capacity for abstraction. From a handful of examples we can pull out high-level regularities that we can extend to classify or produce novel examples. Moreover, we can often generalize to new classes with an incomplete data point or even zero examples from that class. By extracting general features from examples and translating them into a new domain, we can transfer existing knowledge to new problem sets.

Ardi, an Ardipithecus ramidus skeleton, considered relatively complete

Take early hominid fossils as an example. Digs rarely uncover a full head-to-toe skeleton in this area; anthropologists are lucky if they can find even partial remains. However, they’re able to use prior knowledge about hominid physiology to extrapolate from important features (dental structure, femur size, etc.) to a general understanding of how the individual looked and behaved.

With the flurry of literature on multimodal embeddings and their usefulness, I think that new models may draw from both mathematics and neuroscience to figure out how to formalize our insights about how humans integrate data from multiple modalities. We’re already taking a look at whether we can improve performance through localization (i.e. attention)! With a bit more refinement, networks that can learn to generalize to new cases could be on our horizon.