There are various techniques for handling text data in machine learning. In this article, we’ll look at working with word embeddings in Keras—one such technique. For a deeper introduction to Keras refer to this tutorial:

We’ll use the IMDB Reviews dataset for this tutorial. In an earlier tutorial, we used the same dataset using other techniques such as the bag of words model. Here, we’ll use a technique known as word embedding.

Word embedding is a technique used to represent documents with a dense vector representation. The vocabulary in these documents is mapped to real number vectors. Words that are semantically similar are mapped close to each other in the vector space. There are word embedding models that are ready for us to use, such as Word2Vec and GloVe. However, in this tutorial, we’re going to use Keras to train our own word embedding model.

Let’s get the balling rolling by importing our dataset and checking its head.

Next, we’ll import a couple of modules that we’ll need for this exercise.

array from NumPy to convert the dataset to NumPy arrays

one_hot to encode the words into a list of integers

to encode the words into a list of integers pad_sequences that will be used to pad the sentence sequences to the same length

that will be used to pad the sentence sequences to the same length Sequential to initialize the neural network

to initialize the neural network Dense to facilitate adding of layers to the neural network

to facilitate adding of layers to the neural network Flatten to reshape the arrays

to reshape the arrays Embedding that will implement the embedding layer

Next, we create variables with the reviews and the labels.

We’ll use Scikit-learn to separate our dataset to a training set and test set. We’ll train the word embedding on 80% of the data and test it on 20%.

Let’s now look at one of the reviews. We’ll compare this sentence with its transformation as we move along this tutorial.