



PyText is an open-source Natural Language Processing (NLP) tool recently developed by the Facebook: AI team. Although there are quite a few applications of the tool, for the purpose of this example, let's take a look at how an AI assistant/chatbot could be developed using the tool.

Before diving deep into the example, however, it is important to note that the tool can be easily used by both Machine Learning (ML) experts and amateurs in the field: it allows to start with the basics, and also extend the application once more knowledge is acquired and the expertise in ML is vastly improved. This article is a simple introduction to PyText: it shows the main concepts with minimal coding.

The basis of any AI assistant/chatbot involves using Intent detection and Slot filling . To understand spoken language we need to automatically identify the intent of the user as expressed in natural language and extract associated arguments or slots towards achieving a goal.

For this example, we will use e-commerce as the domain for creating an AI assistant. Take a look at some possible dialogue examples:

Intent Text Slots product Show me black t-shirts with size M. product=t-shirts, size=M, color=black add I will take the first one. position=first cost How much will it cost to deliver to 221B Baker Str London? location=221B Baker Str London

For each sentence, the intent needs to be identified in order to gather what each customer is inquiring about. To gather the details and comprehend the context the slots (labeled words) need to be found. For example, when a customer asks for a black t-shirt, we understand that he/she is looking for a specific kind of product, we know what kind of the product it is (t-shirt), and what color the product should be (black).

Introduction to PyText

To start using PyText, you need to have Python 3 and pip installed in order to run the following command that will install PyText:

pip install pytext-nlp

PyText provides simple and extensive interfaces and abstractions for model components. Through the following example, you will be able to see how easy it could be to train a model and start using it right away in production.

The command line tool pytext will automatically be installed into your PATH variable. In fact, it's the fastest way to start training the model.

Data Preparation

Data preparation is the most critical part of any machine learning algorithm. Unfortunately, there are no openly available datasets. Because of that, we need to generate our own simple dataset. If you would like to save some time, feel free to skip this part and use my dataset from https://github.com/kiril-me/assistance/tree/master/data

One of the main problems is that the training dataset is required to have a particular format. It is, however, rather easy to make a tool that would transform the markup text to the PyText format. For the sake of simplicity, try using my assistance_data.py script.

First, generate your chat samples and store them in chat.txt. Take a look at a sample of the training dataset we will be using:

product Show (black)[color] (t-shirts)[product] size (m)[size] add I will take the (first)[position] one. cost How match it will cost to deliver to (221B Baker Str London)[location]?

Each row contains two tab-separated columns: first is intent and second is text in the markdown format. The slot value is inside round brackets, and the slot definition is inside square brackets. Remember, however, that three samples are not sufficient for training a ML model, which is why more training certainly needs to be added. In addition to that, more intents and slots can be added in order to increase complexity.

To convert data into the PyText format and split it into training, validation and test sets run the assistance_data.py script with the following parameters:

python assistance_data.py -t chat.txt -o ./data

The script will generate three files inside the data directory.

Configuring the Model

Training a PyText model on a dataset is primarily about the configuration parameters. We have already made our training dataset, and now we need to configure our Deep Learning network and model parameters.

PyText configurations are in JSON format. Create the joint-model.json

{ "config" : { "task" : { "JointTextTask" : { "model" : { "representation" : { "BiLSTMDocSlotAttention" : { "lstm" : { "dropout" : 0.1 , "lstm_dim" : 180 , "num_layers" : 2 , "bidirectional" : true }, "pooling" : { "SelfAttention" : { "attn_dimension" : 64 } } } }, "output_layer" : { "doc_output" : { "loss" : { "CrossEntropyLoss" : {} } }, "word_output" : { "CRFOutputLayer" : {} } } }, "features" : { "word_feat" : { "embed_dim" : 50 , "embedding_init_strategy" : "zero" , "export_input_names" : [ "tokens_vals" ], "pretrained_embeddings_path" : "data/glove.6B.50d.txt" , "vocab_from_train_data" : true , "vocab_from_all_data" : false , "lowercase_tokens" : true } }, "optimizer" : { "type" : "adam" , "lr" : "0.001" , "weight_decay" : 0 }, "trainer" : { "epochs" : 10 }, "featurizer" : { "SimpleFeaturizer" : {} }, "labels" : [ { "DocLabelConfig" : {} }, { "WordLabelConfig" : {} } ], "data_handler" : { "columns_to_read" : [ "doc_label" , "word_label" , "text" ], "train_batch_size" : 86 , "eval_batch_size" : 128 , "test_batch_size" : 128 , "max_seq_len" : 20 , "train_path" : "data/train.csv" , "eval_path" : "data/val.csv" , "test_path" : "data/test.csv" }, "exporter" : {} } }, "save_snapshot_path" : "/tmp/joint_model.pt" , "export_caffe2_path" : "/tmp/joint_model.c2" } }

Let's take a look at the data_handler configuration. It configures the locations of all three datasets and has a definition of our data format. The first part of this configuration specifies the document label, next specifies the word label, and the last specifies the text label. This is a pretty standard text format for PyText.

To train a model, we need to convert our text to machine representation (vectorize it). For our model, we use GloVe pre-trained word embeddings. You can download the GloVe dataset from Stanford website https://nlp.stanford.edu/projects/glove/ and take the smallest one, glove.6B.50d.txt. It has about 6 billion words, and each word has a 50-dimensional vector. PyText will be able to automatically convert our sentences to vectors. The greater the number of dimensions, the better the accuracy, and the slower the training and the prediction. To understand the logic simply calculate how large of a vector you would have if you had a sentence with 20 words and 50 dimensions for each. This would result in a 20 * 50 = 1,000 vector. Using a 300-dimensional vector will create a 20 * 300 = 6,000 vector. In this case, the performance will drop roughly 6 times.

The model configuration has a deep learning network. You can learn more about the Joint Model of Intent Determination and Slot Filling approach [ https://www.ijcai.org/Proceedings/16/Papers/425.pdf ]. It uses bi-directional long short-term memory (LTSM) network. Because the dataset is too small we set the dropout to be 0.1.

Starting the Training

To start the training we will use the PyText training mode. It will take the joint-model.json configuration file, initialize the model, and begin the training process. In the end, it will save the best model in the snapshot folder. Our configuration has epoch set to 10, which means that we perform 10 iterations and select the best model. The model snapshot can be used further in the production.

pytext train < joint-model.json

You will get the final F1 score for each intent and slot.

Executing the Model

Before we use the model in the production, we want to make sure that it satisfies our needs, that is, has good accuracy. Let's test it:

pytext test < joint-model.json

Because our dataset is too small, we cannot get a good F1 score.

Let's try to make a prediction for one sample:

pytext -- config - file joint - model . json \ predict -- exported - model / tmp / joint_model . c2 <<< '{"raw_text": "Show jeans size medium"}'

This command will print all the intent and slot coefficients:

{ 'doc_scores:add' : array ([ - 15.131375 ], dtype = float32 ), 'doc_scores:cost' : array ([ - 12.765977 ], dtype = float32 ), 'doc_scores:product' : array ([ - 3.0994463e-06 ], dtype = float32 ), 'word_scores:NoLabel' : array ([[ - 1.2382780e-03 ], [ - 1.0783006e+01 ], [ - 2.0213978e-04 ], [ - 8.3832254e+00 ]], dtype = float32 ), 'word_scores:PAD_LABEL' : array ([[ - 13.03071 ], [ - 14.443796 ], [ - 15.797092 ], [ - 11.960256 ]], dtype = float32 ), 'word_scores:color' : array ([[ - 6.7855487 ], [ - 8.868696 ], [ - 12.415806 ], [ - 10.454444 ]], dtype = float32 ), 'word_scores:location' : array ([[ - 11.353401 ], [ - 14.120435 ], [ - 14.33404 ], [ - 12.914462 ]], dtype = float32 ), 'word_scores:position' : array ([[ - 10.097542 ], [ - 14.370812 ], [ - 12.048518 ], [ - 9.770829 ]], dtype = float32 ), 'word_scores:product' : array ([[ - 1.1044090e+01 ], [ - 1.8771265e-04 ], [ - 1.1694180e+01 ], [ - 9.5313435e+00 ]], dtype = float32 ), 'word_scores:size' : array ([[ - 1.0219145e+01 ], [ - 1.0625681e+01 ], [ - 8.6051016e+00 ], [ - 3.9603206e-04 ]], dtype = float32 )}

This shows that customers are most likely to inquire about a product because it has the smallest coefficients. Slots product and size also have smallest coefficients. To get a better F1 score we need to create a better training dataset and potentially change model parameters.

Starting to Use the Model in Your Project

Before we start using the model, we need to export it. To do so, make the following call:

pytext export --output -path joint_model.c2 < joint-model.json

The joint_model.c2 will contain all the information needed to run our model.

For a simple web application, we will use Flask

from flask import Flask , request , jsonify import pytext config_file = "joint-model.json" model_file = "joint_model.c2" config = pytext . load_config ( config_file ) predictor = pytext . create_predictor ( config , model_file ) app = Flask ( __name__ ) label_threshold = 0.1 . route ( "/chat" , methods =[ 'GET' , 'POST' ]) def chat (): message = request . data . decode () result = predictor ({ "raw_text" : message }) best_doc_label = max ( ( label for label in result if label . startswith ( "doc_scores:" )), key = lambda label : result [ label ][ 0 ], )[ len ( "doc_scores:" ):] best_label = max ( ( label for label in result if label . startswith ( "word_scores:" ) and label ! = "word_scores:NoLabel" ), key = lambda label : result [ label ], ) best_label_score = result [ best_label ][ 0 ] labels = [] for label in result : if label . startswith ( "word_scores:" ) and label ! = "word_scores:NoLabel" and best_label_score - result [ label ][ 0 ] < = label_threshold : labels . append ( label [ len ( "word_scores:" ):]) if best_doc_label == "product" : return jsonify ({ "answer" : f"Are you asking about {best_doc_label} {labels}?" }) elif best_doc_label == "add" : return jsonify ({ "answer" : f"Product was added to your cart" }) elif best_doc_label == "cost" : return jsonify ({ "answer" : f"We calculate price for you" }) return jsonify ({ "answer" : f"Sorry could you repeat please." })

To run the application execute the following curl command:

curl http://localhost:8080/chat -d "Show jeans size medium color blue"

{ "answer" : "Are you asking about product ['product', 'size', 'color']" }

For demonstration purposes, our code only displays labels and no actual words. You can see how to get actual words in the PyText demo project.[ https://github.com/facebookresearch/pytext/blob/master/demo/flask_server/atis.py ];

As you can see, the code doesn't look large, and you can play around with your model and tweak parameters or even change the machine learning algorithm. Data engineering is the hardest problem in creating accurate machine learning models, but you can easily play around with it and potentially improve it.