Most problems in computer vision involve a step of dimensionality reduction – where you go down from a high dimensional feature space to a low dimensional feature space. Two approaches from NLP have found great application in CV :

1) Bag of Words 2)Probabilistic latent semantic analysis

While bag of words is a model for supervised learning, probabilistic learning is its counterpart for unsupervised learning.

Bag of Words

Bag of words has been effectively used in object and action recognition. Unlike language you don’t have an existing dictionary of words that can be used directly in bag of words approach. So usually features like SIFT, HOG, HOF(Histogram of Optical Flow), SURF etc are clustered to get a dictionary. Now this dictionary of “visual words” can be used to the bag of words model.

There are some subtle details that are left to experimentation like the number of “visual words” in the dictionary, the distance metric to be used while clustering etc. Euclidean distance though commonly used is not a great metric for CV. For example, if you are clustering regions of similar colour, violet is closer to red than to green. This can be dealt with by changing into another colour space which models human vision better. This provides a good comparison of many commonly used colour spaces.

Once clustering is done, one can train SVM on the occurrence of these “visual words” in other images/videos to get various classifiers that can be used for object and action recognition.

Probabilistic Latent Semantic Analysis

pLSA is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low dimensional representation of the observed variables in terms of their affinity to certain hidden variables.[Wikipedia]

Instead of documents the algorithm operates on images, the words are substituted by visual words, and the topic is represented by a category of an object. Th initial steps of getting “visual words” is similar to that described in the bag of words model.

The training process of pLSA yields the probabilities P(z|i) and P(w|z); using P(z|i) for each image i. In the above probabilities, z=object category,i=image,w=visual word. The images were classified as containing object k if it is maximum of all probabilities of P(z|i) among all object categories present.