(This is a piece that covers what my friends and I did as an undergrad thesis and was accepted to the 2018 PNC Annual Conference and Joint Meetings and recently made available on IEEE Xplore)

The future of computing lies in the confluence of Artificial Intelligence and Human Computer Interaction. Real time systems that respond to human commands intelligently have a number of use cases in robotics and other aspects of daily life. This work is aimed at solving one such problem in this field.



The task at hand is that of Open Vocabulary Image Retrieval i.e. create a mapping to select the best among a set of candidate images in response to a natural language query. The power of sight comes very naturally to human beings and computer vision has advanced greatly in the last decade to be almost on par with human object recognition. But most of the work deals with assigning one out of a number of preset categories to an image. Another ability which is inherent in human beings is the ability to communicate by processing language. Unfortunately to replicate the same in machines is a tedious task which has recently come to the fore with IoT systems, chatbots and other advances on natural language understanding.

When expressing the description of an object, a human would often make use of complex adjectives and additional contextual information. These would be in no particular format and hence the need to handle free language queries is an important direction in the future of object detection. For example to describe the adjacent image of Cap’n Crunch cereal, humans rarely name an object with a single base level noun like “box” instead they use a language which is rich in text like “red cereal box” and hence an image retrieval task which simply works by grouping objects into categories wouldn’t suffice and there needs to be a more robust method which can recognize details like ‘red’ and ‘cereal’ in the query and gives the most relevant results. Hence it also becomes a computer vision problem as it needs to recognize these details in an image.



So it’s an information retrieval problem where we try to improve the recall of the system as a whole. The following figure shows the overall methodology. We consider the user query from which we need to select the most appropriate image amongst various distraction images that we provide. Most often, the best image is matched by correctly identifying the nouns, adjectives and verbs in the query and finding these in your candidate images. To do this, 5 projections are employed and these are all either category based and instance based. Category based projections aim to put the images into a set of categories for which it has been trained whereas instance based projections look for specifics within the image itself which can’t be broadly grouped. As you would expect, the category based projections are deep convolutional neural networks. We use the AlexNet model as one because it covers a large number of object categories and is a well trained industry standard. Additionally we chose Cifar 100 as a projection because it provides coarse and fine labels such as “vehicle” as well as “car” and this is extremely useful as different people may describe the image using the hypernym (vehicle) or homonym (car) form based on their vocabulary. Finally we make use of Caltech 256 which contains difficult, rotated and obscured images to provide another perspective to our model. So our category based projections provide us with a comprehensive map from all nouns in the query to all objects in the candidate image. But that’s not alone enough is it?



An Image Captioning Model trained on the MS-COCO dataset was developed along with reverse image search as our instance projections. Image captioning provides us with descriptions of the actions going on in the image which would be a good representation of verbs in the query. GISS was chosen to obtain information such as brands and labels which cannot be identified by any other means effectively but these adjectives form an important part of human conversations.



All this information is mapped to vector form and then each representation is compared to the query. The closest match through Pearson correlation is yielded as the best match mage. This is then evaluated by using the RefCoco dataset as an IR problem and it was found to perform better than existing benchmarks. That the comparative performance is further improved when more distractions are there adds to the utility.



A small demo application was also developed to visualise the methodology in practise. When a description of “chefs in a kitchen” is given the first mage is selected whereas “people snowboarding” and “people playing a sport” correctly yields image 2. To demonstrate further utility if you provide a description of “two people talking” the first image is correctly selected again. Each projection’s usefulness is shown here with “kitchen” obtained from the category based projections, “snowboarding/talking” from the MS-COCO actions and playing ”a sport” coarse label obtained from the Cifar 100.



Going forward there are numerous applications of the same and this is the first step that can be used to enable people to speak with their computer systems using conversational language itself which would be a great leap for the field. Let your minds dream to a world here you have your personal robot that can fetch you your breakfast cereal from the kitchen or pick out the clothes you want. That future is just a few years away.

