Hello World, Welcome to my first blogpost. In this and in the upcoming blog posts, I will be posting the short summaries of some of the most popular research papers in the area of NLP. Suggestions and constructive criticism are most welcome.

The Problem:

The fundamental problem for probabilistic language modeling is that the joint distribution of a large number of discrete variables results in exponentially large free parameters. It is called ‘Curse of Dimensionality’. This demands a use of modeling using continuous variables where the generalization can be easily achieved. The function that is learned will then have a local smoothness and every point (n-gram sequence) have significant information about a combinatorial number of neighboring points.

The Solution:

The paper presents an effective and computationally efficient probabilistic modeling approach that overcomes the curse of dimensionality. It also overcomes the problem when a totally new sequence not present in the training data is observed. A neural network model is developed which has the vector representations of each word and parameters of the probability function in its parameter set. The objective of the model is to find the parameters that minimize the perplexity of the training dataset. The model eventually learns the distributed representations of each word and the probability function of a sequence as a function of the distributed representations. The Neural model has a hidden layer with tanh activation and the output layer is a Softmax layer. The out of the model for each input of (n-1) prev word indices are the probabilities of the |V| words in the vocabulary.

src: Yoshua Bengio et.al. A Neural Probabilistic Language Model

The Significance:

This model is capable of taking advantage of longer contexts. Some traditional n-gram based models have slightly mitigated the problem of appearance of the new sequence by gluing overlapping sequences. But they could only account for shorter contexts. Continuous representation with each word having a vector representation, it is now possible to estimate the probabilities for a sequence unseen in the training corpus. The probability function uses parameters which increase only linearly with the size of the vocabulary and linear with the size of the dimension of the vector representation. The curse of dimensionality is solved as we don’t need the exponential number of free parameters. An extension of this work presents an architecture that outputs the energy function instead of the probabilities and also takes care of out-of-vocabulary words.

Experimentation and Results:

The two corpora selected seemed standard with significantly large data. While Brown corpus has all the English textbooks, the AP corpus has news from ’95 and ’96. The models that are compared are modified backoff n-gram models which perform better than the standard models. The test perplexity difference of the neural network was 24% in case of Brown corpus and 8% with AP News corpus. The best performance is observed when there are 10 hidden nodes in the MLP of Neural model.

src: Yoshua Bengio et.al. A Neural Probabilistic Language Model

My Take:

This paper uses the best features like learning the statistical model, using word similarities, using a distributed vector representation for each word, and using Neural Nets from different works etc.. and puts them together to find an elegant solution to the problem of statistical language modeling. Besides presenting an elegant model, the paper also mentioned how to take advantage of the present day computational resources in achieving the task quickly and efficiently by giving a description of data-parallel and parameter-parallel implementation. The idea of a mixture of models by using this model along with the trigram model and also how the author attributed the increase in performance to ‘neural model and trigram model making errors in different places’ is remarkable. The part where using direct connections to the output layer implies that the number of hidden layers will be 2 isn’t clearly explained. The work also presented clear future directions in terms of understanding the word representations, introducing prior knowledge and representation of conditional probability as a tree structure.