What is Tensorflow Text Classification all About?

Text Classification is the task of assigning the right label to a given piece of text. This text can either be a phrase, a sentence or even a paragraph. Our aim would be to take in some text as input and attach or assign a label to it. Since we will be using Tensor Flow Is deep learning library, we can call this the Tensorflow text classification system. Seems simple doesn’t it? well, not so much.

This task involves training a neural network with lots of data indicating what a piece of text represents. I am sure you would have heard of the term “Sentiment Analysis“. Well, sentiment analysis a text classification task but it is restricted only to identify the sentiment of the person saying something. For example, the sentence, ” The food was amazing” has a positive sentiment. On the other hand, ” the movie was horrible” has a negative sentiment while the sentence “sun rises from the east” has a neutral sentiment.

For sentiment analysis, the labels are positive, negative and neutral most of the times. But, this is just one use of the text classification. If you are building other text-based applications like a chatbot, or a document parsing algorithm, you might want to know what a particular sentence belongs to. For example: ” Hello! how are you?” can have the label “Greeting” attached to it or the sentence ” It was a pleasure meeting you” can have the label “Farewell” attached to it.

What are you going to learn?

You could build a text classifier that classifies a given sentence to one of the many labels that the classifier is trained for. In this tutorial, we do just that. We will go through how you can build your own text-based classifier with loads of classes or labels.

The article Tensorflow text classification will be divided into multiple sections. First are the text pre-processing steps and creation and usage of the bag of words technique. Second is the training of the text classifier and finally the testing and using the classifier.

If you don’t know what Tensorflow is, then you can read this article What is Tensorflow first.

Some NLP terminologies before we begin

Natural Language Processing (NLP) is heavily being used in our text classification task. So, before we begin, I want to cover a few terms and concepts that we will be using. This will help you understand why a particular function or process is being called or at the very least clear any confusion you might have.

I) Stemming – Stemming is a process applied to a single word to derive its root. Many words that are being used in a sentence are often inflected or derived. To standardize our process, we would like to stem such words and end up with only root words. For example, a stemmer will convert the following words “walking”, “walked”, “walker” to its root word “walk“.

II) Tokenization – Tokens are basically words. This is a process of taking in a piece of text and find out all the unique words in the text. We would get a list of words in the text as the output of tokens.

For example, for the sentence “Python NLP is just going great” we have the token list [ “Python”, “NLP”, ïs”, “just”, “going”, “great”]. So, as you can see, tokenization involves breaking up the text into words.

III) Bag of Words – The Bag of Words model in Text Processing is the process of creating a unique list of words. This model is used as a tool for feature generation.

Eg: consider two sentences:

Star Wars is better than Star Trek. Star Trek isn’t as good as Star Wars.

For the above two sentences, the bag of words will be: [“Star”, “Wars”, “Trek”, “better”, “good”, “isn’t”, “is”, “as”].

The position of each word in the list is hence fixed. Now, to construct a feature for classification from a sentence, we use a binary array ( an array where each element can either be 1 or 0).

For example, a new sentence, “Wars is good” will be represented as [0,1,0,0,1,0,1,0] . As you can see in the array, position 2 is set to 1 because the word in position 2 is “wars” in the bag of words which is also present in our example sentence. This same holds good for the other words “is” and “good” as well. You can read more about the Bag of Words model here.

NOTE:The code below is present to explain the procedure and it is not complete. You can find the full working code in my Github Repository ( Link is given at the end of the article).

Step 1: Data Preparation

Before we train a model that can classify a given text to a particular category, we have to first prepare the data. We can create a simple JSON file that will hold the required data for training.

Following is a sample file that I have created, that contains 5 categories. You can create how many ever categories that you want.

{ "time" : ["what time is it?", "how long has it been since we started?", "that's a long time ago", " I spoke to you last week", " I saw you yesterday"], "sorry" : ["I'm extremely sorry", "did he apologize to you?", "I shouldn't have been rude"], "greeting": ["Hello there!", "Hey man! How are you?", "hi"], "farewell": ["It was a pleasure meeting you", "Good Bye.", "see you soon", "I gotta go now."], "age": ["what's your age?", "How old are you?", "I'm a couple of years older than her", "You look aged!"] }

In the above structure, we have a simple JSON with 5 categories ( time, sorry, greeting, farewell, and age). For each category, we have a set of sentences which we can use to train our model.

Given this data, we have to classify any given sentence into one of these 5 categories.

Step 2: Data Load and Pre-processing

# a table structure to hold the different punctuation used tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P')) # method to remove punctuations from sentences. def remove_punctuation(text): return text.translate(tbl) # initialize the stemmer stemmer = LancasterStemmer() # variable to hold the Json data read from the file data = None # read the json file and load the training data with open('data.json') as json_data: data = json.load(json_data) print(data) # get a list of all categories to train for categories = list(data.keys()) words = [] # a list of tuples with words in the sentence and category name docs = [] for each_category in data.keys(): for each_sentence in data[each_category]: # remove any punctuation from the sentence each_sentence = remove_punctuation(each_sentence) print(each_sentence) # extract words from each sentence and append to the word list w = nltk.word_tokenize(each_sentence) print("tokenized words: ", w) words.extend(w) docs.append((w, each_category)) # stem and lower each word and remove duplicates words = [stemmer.stem(w.lower()) for w in words] words = sorted(list(set(words))) print(words) print(docs)

In the code above, we create multiple lists. One list “words” will hold all the unique stemmed words in all the sentences provided for training. Another list “categories” holds all the different categories.

the output of this step is the “docs” list which contains the words from each sentence and which category the sentence belongs. An example document is ([“whats”, “your”, “age”], “age”).

Step 3: Convert the data to Tensorflow Specification

From the previous step, we have documents but they are still in the text form. Tensorflow being a math library accepts the data in the numeric form. So, before we begin with the tensorflow text classification, we take the text form and apply the bag of words model to convert the sentence into a numeric binary array. We then store the labels/category, in the same way, that is a numeric binary array.

# create our training data training = [] output = [] # create an empty array for our output output_empty = [0] * len(categories) for doc in docs: # initialize our bag of words(bow) for each document in the list bow = [] # list of tokenized words for the pattern token_words = doc[0] # stem each word token_words = [stemmer.stem(word.lower()) for word in token_words] # create our bag of words array for w in words: bow.append(1) if w in token_words else bow.append(0) output_row = list(output_empty) output_row[categories.index(doc[1])] = 1 # our training set will contain a the bag of words model and the output row that tells # which catefory that bow belongs to. training.append([bow, output_row]) # shuffle our features and turn into np.array as tensorflow takes in numpy array random.shuffle(training) training = np.array(training) # trainX contains the Bag of words and train_y contains the label/ category train_x = list(training[:, 0]) train_y = list(training[:, 1])

Step 4: Initiate Tensorflow Text Classification

With the documents in the right form, we can now begin the Tensorflow text classification. In this step, we build a simple Deep Neural Network and use that for training our model.

# reset underlying graph data tf.reset_default_graph() # Build neural network net = tflearn.input_data(shape=[None, len(train_x[0])]) net = tflearn.fully_connected(net, 8) net = tflearn.fully_connected(net, 8) net = tflearn.fully_connected(net, len(train_y[0]), activation='softmax') net = tflearn.regression(net) # Define model and setup tensorboard model = tflearn.DNN(net, tensorboard_dir='tflearn_logs') # Start training (apply gradient descent algorithm) model.fit(train_x, train_y, n_epoch=1000, batch_size=8, show_metric=True) model.save('model.tflearn')

The code above runs for a 1000 epochs. I ran it for 10,000 epochs which had 30,000 steps and it took around 2 minuted to finish training.

I have the Nvidia Geforce 940MX GPU. the size of data and the type of GPU heavily determine the time taken for training.

I have attached a screenshot of the training below. I have achieved almost 100% training accuracy.

Step 5: Testing the Tensorflow Text Classification Model

We can now test the neural network text classification python model using the code below.

# let's test the model for a few sentences: # the first two sentences are used for training, and the last two sentences are not present in the training data. sent_1 = "what time is it?" sent_2 = "I gotta go now" sent_3 = "do you know the time now?" sent_4 = "you must be a couple of years older then her!" # a method that takes in a sentence and list of all words # and returns the data in a form the can be fed to tensorflow def get_tf_record(sentence): global words # tokenize the pattern sentence_words = nltk.word_tokenize(sentence) # stem each word sentence_words = [stemmer.stem(word.lower()) for word in sentence_words] # bag of words bow = [0]*len(words) for s in sentence_words: for i, w in enumerate(words): if w == s: bow[i] = 1 return(np.array(bow)) # we can start to predict the results for each of the 4 sentences print(categories[np.argmax(model.predict([get_tf_record(sent_1)]))]) print(categories[np.argmax(model.predict([get_tf_record(sent_2)]))]) print(categories[np.argmax(model.predict([get_tf_record(sent_3)]))]) print(categories[np.argmax(model.predict([get_tf_record(sent_4)]))])

Just with this training, the model was able to correctly classify all the sentences. There will definitely be a lot of sentences that might fail to be classified correctly.

This is only because the amount of data is less. with more and more data, you can be assured the model will be more confident.

Conclusion and Next Steps

This is how you can perform tensorflow text classification. You can use this approach and scale it to perform a lot of different classification. You can use it to build chatbots as well. If you are interested in learning the concepts here, following are the links to some of the best courses on the planet for deep learning and python. I have learned a lot from all of these courses and I would highly recommend this to anyone interested in machine learning, deep learning or learning advanced python in general.

Note: If you wish to enroll in any of the course below, click on the link you want to enroll to and please login or sign up to Udemy and you will get a 90% discount on the courses listed below.

Complete Python 3 Bootcamp – For those of you who want to master Python programming. Python for Data Science and Machine Learning -For those who like to learn and master Machine learning. Deep Learning A-Z: Hands-On Artificial Neural Networks Deep Learning A-Z: Hands-On Artificial Neural Networks – If you already know the basics of ML and want to learn deep learning, then this one is the best course for you.

This is just the beginning! You can use this concept as a base for advanced applications and scale it up.

You can find the complete working code for neural network text classification python in my Git Repository for tensorflow.