Implementation of the RNN that Can Be Used for Our Goal

TensorFlow includes the implementation of the RNN network that is used to train the translation model for English/French tuple. We will use it to train our chatbot.

Probably, one might ask: “why the hell we are looking on the translation model if we are writing the chatbot?”. But this might be confusing only at the beginning. Think for a moment what is translation? Translation can be represented as 2 steps process:

Produce a language independent representation of the input message. Map the information, that had been produced during the first step, to the target language, that we need to translate to.

Now think for a moment: what if we will train the same RNN model, however, instead of Eng/Fre phrases we will feed Eng/Eng dialogs from movies? In theory, we should be able to produce the chatbot, that is capable of responding to one liner questions (without the ability to remember the context of a dialog). But this should be sufficient enough for our first chatbot. Plus the approach is very simple. In the future, we should be able to iterate over it and make it more intelligent.

Later on we will learn how to train more complex networks that are more suitable for chatbots (for example like retrival-based models).

For now, here is a small example of the conversation result with the bot after only 50000 training iterations:

As you can see bot is capable of giving more or less informative answers to some of the questions. The quality of the bot are improving with the amount of training iterations. For example, here is how stupid it was after only first 200 iterations:

I will keep updating the post as long as I will continue training model further and further.

Such simple approach also allows us to create bots with different characters. For example, one might train it on the dialogue from the Star Wars saga, or from the “Lord Of The Rings”. Even more, if one has big enough corpus of dialogue of the same character (for example, all Chandler’s dialogue from the movie “Friends”) it can create a bot of the particular character.

Prepare Data that Can Be Used for Training

For training our first bot we will use the “Cornell Movie Dialogs Corpus”. In order to prepare the data for training, we need to use special converter script, that is capable of converting the data from the corpus to the format, that is required in order to train our RNN.

I would strongly encourage you to read the README file of the script in order to understand more about the corpus of dialogs and what the script is actually doing and only then continue reading. However if you just need commands that you can blindly copy & paste and execute in order to prepare training data, here are they(please pay attention to the fact that you need to use branch “simaple_input_generator”, since master might include update version of the code!):



Cloning into ‘dialog_converter’…

remote: Counting objects: 59, done.

remote: Compressing objects: 100% (49/49), done.

remote: Total 59 (delta 33), reused 20 (delta 9), pack-reused 0

Unpacking objects: 100% (59/59), done.

Checking connectivity… done. tmp# git clone https://github.com/b0noI/dialog_converter.git Cloning into ‘dialog_converter’…remote: Counting objects: 59, done.remote: Compressing objects: 100% (49/49), done.remote: Total 59 (delta 33), reused 20 (delta 9), pack-reused 0Unpacking objects: 100% (59/59), done.Checking connectivity… done. tmp# cd dialog_converter dialog_converter git:(master)# git checkout simaple_input_generator

Branch simaple_input_generator set up to track remote branch simaple_input_generator from origin.

Switched to a new branch 'simaple_input_generator' dialog_converter git:(master)# python converter.py dialog_converter git:(master)# ls

LICENSE README.md converter.py movie_lines.txt train.a train.b

By the end of the execution you will have 2 files that will be used of the training:

train.a

train.b

Train the Model

This is the most spectacular part, since in order to train the model we will need to:

Find the machine with the powerful and, what is very important, supported (read: NVIDIA) by the TensorFlow video card. Modify the original “translate” script that is used to train a model for the translation of the Eng/Fre. Prepare the machine for the training; Initiate the training. Wait. Wait. Wait. I’m serious… Wait. Now you can start chatting.

Find the Machine

In order to make this process as simple as possible, I will use the pre-build AMI — “Bitfusion TensorFlow AMI” that will be used with the AWS. It has the pre-installed TensorFlow that had been built with the GPU support. By the time of writing the article, the Bitfusion AMI has included the TensorFlow version 0.11.

The process of an instance creation should be straightforward and out of the scope of this article. Two important pieces of information that is relevant to the process the type of the instance that should be used and the size of an SSD. For the type of an instance I would recommend to use: p2.xlarge , it is the cheapest type that has NVIDIA GPU on board and sufficient video memory (12Gb)in order to train our model. For the size of an SSD I would recommend to allocate at least 100Gb.

Modify the Original “translate” Script

At this point I hope I can assume that you have ssh access to the machine where you will train the TensorFlow.

Firstly, let’s discuss why do we need to modify the original script at all. Well, the thing is, the original script does not allow you to override source of the data that the script is using to to train the model. I have created feature request: To add the ability to use own data during the training the RNN network for translation by the translate.py script. The pull-request already waiting for the review, but for now, you can participate adding +1 to the issue (or help with the review ;) ).

Do not be afraid - the modification is really simple. And in order to make life easier for you, I have created the repository that contains the modified version of the code. So, here is what you need to do - to put everything together:

Rename files “train.a” and “train.b” to “train.en” and “train.fr” accordingly. This is needed since the training script thinks that it is training the translation from English to French.

Both of files need to be uploaded to the remote hosts - this can be done via the command rsync:

➜ train# REMOTE_IP=... ➜ train# ls

train.en train.fr ➜ train# rsync -r . ubuntu@$REMOTE_IP:/home/ubuntu/train

Now let’s connect ourselves to the remote host and start the tmux session. If you don’t know what the tmux is, you can just connect via the ssh.

➜ train ssh ubuntu@$REMOTE_IP 53 packages can be updated.

42 updates are security updates. ########################################################################################################################

######################################################################################################################## ____ _ _ __ _ _

| __ )(_) |_ / _|_ _ ___(_) ___ _ __ (_) ___

| _ \| | __| |_| | | / __| |/ _ \| '_ \ | |/ _ \

| |_) | | |_| _| |_| \__ \ | (_) | | | |_| | (_) |

|____/|_|\__|_| \__,_|___/_|\___/|_| |_(_)_|\___/ Welcome to Bitfusion Ubuntu 14 Tensorflow - Ubuntu 14.04 LTS (GNU/Linux 3.13.0-101-generic x86_64)

http://www.bitfusion.io This AMI is brought to you by Bitfusion.io

support@bitfusion.io Please email all feedback and support requests to: We would love to hear from you! Contact us with any feedback or a feature request at the email above. ########################################################################################################################

######################################################################################################################## ######################################################################################################################## Please review the README located at /home/ubuntu/README for more details on how to use this AMI Last login: Sat Dec 10 16:39:26 2016 from 99-46-141-149.lightspeed.sntcca.sbcglobal.net

ubuntu@tf:~$ cd train/

ubuntu@tf:~/train$ ls

train.en train.fr

Let’s verify that TensorFlow is installed and it is using GPU:

ubuntu@tf:~/train$ python

Python 2.7.6 (default, Jun 22 2015, 17:58:13)

[GCC 4.8.2] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> import tensorflow as tf

I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.7.5 locally

I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5 locally

I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.7.5 locally

I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally

I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.7.5 locally

>>> print(tf.__version__)

0.11.0

As can be seen, there has been TF 0.11 installed and it is using CUDA library. Now let’s clone the training script:



ubuntu@tf:~$ cd src/

ubuntu@tf:~/src$ git clone

Cloning into 'tensorflow'...

remote: Counting objects: 117802, done.

remote: Compressing objects: 100% (10/10), done.

remote: Total 117802 (delta 0), reused 0 (delta 0), pack-reused 117792

Receiving objects: 100% (117802/117802), 83.51 MiB | 19.32 MiB/s, done.

Resolving deltas: 100% (88565/88565), done.

Checking connectivity... done.

ubuntu@tf:~/src$ cd tensorflow/

ubuntu@tf:~/src/tensorflow$ git checkout -b r0.11 origin/r0.11

Branch r0.11 set up to track remote branch r0.11 from origin.

Switched to a new branch 'r0.11' ubuntu@tf:~$ mkdir src/ubuntu@tf:~$ cd src/ubuntu@tf:~/src$ git clone https://github.com/b0noI/tensorflow.git Cloning into 'tensorflow'...remote: Counting objects: 117802, done.remote: Compressing objects: 100% (10/10), done.remote: Total 117802 (delta 0), reused 0 (delta 0), pack-reused 117792Receiving objects: 100% (117802/117802), 83.51 MiB | 19.32 MiB/s, done.Resolving deltas: 100% (88565/88565), done.Checking connectivity... done.ubuntu@tf:~/src$ cd tensorflow/ubuntu@tf:~/src/tensorflow$ git checkout -b r0.11 origin/r0.11Branch r0.11 set up to track remote branch r0.11 from origin.Switched to a new branch 'r0.11'

Keep in mind that we need branch r0.11. First of all, the branch is consistent with the version of the locally installed TensorFlow. Second — I have not cherry-picked my changes to any other branches, so in the case of need, you will have to cherry-pick the commit by yourself.

At this point we should be ready to start the training process.

ubuntu@tf:~/src/tensorflow$ cd tensorflow/models/rnn/translate/

ubuntu@tf:~/src/tensorflow/tensorflow/models/rnn/translate$ python ./translate.py --en_vocab_size=40000 --fr_vocab_size=40000 --data_dir=/home/ubuntu/train --train_dir=/home/ubuntu/train

...

Tokenizing data in /home/ubuntu/train/train.en

tokenizing line 100000

...

global step 200 learning rate 0.5000 step-time 0.72 perplexity 31051.66

eval: bucket 0 perplexity 173.09

eval: bucket 1 perplexity 181.45

eval: bucket 2 perplexity 398.51

eval: bucket 3 perplexity 547.47

Let’s discuss some flags that we are using here:

en_vocab_size — how many unique words will be learned by the model. If a number of unique words from the input data exceeds the size of the vocabulary, all words that are not in the vocabulary will be marked as “UNK” (code: 3). The vocabulary should not be bigger than actually needed, as well as it should not be smaller.

fr_vocab_size — same but for the other part of data.

data_dir — directory with the input data. Script will look for files “train.en” and “train.fr” there.

train_dir — directory where the script will store the result.

Verify that the Training is In Progress and Everything Goes Correctly

Congratulation! At this point, you have successfully started the training. However, let’s confirm, that the process is in progress and everything is fine. We do not want to end up in the situation where after 6 hours we will figure out that the process was started incorrectly.

First of all, we can confirm that the process has actually allocated free memory:

$ watch -n 0.5 nvidia-smi

As can be seen, almost all the memory on GPU has been allocated by our python process. This is a good sign. Also you do not need to be afraid that your process almost dead due to the OutOfMemory error, the thing is, TF allocates all memory on the GPU during the initial start.

Then you can check the folder “train” — it should include some new files.

~$ cd train

~/train$ ls

train.fr train.ids40000.fr dev.ids40000.en dev.ids40000.fr train.en train.ids40000.en vocab40000.fr

What is important here is to check files vocab4000.* and train.ids40000.*. Let’s see how they are looked from the inside:

~/train$less vocab40000.en

_PAD

_GO

_EOS

_UNK

.

'

,

I

?

you

the

to

s

a

t

it

of

You

!

that

...

Each line in the file is a unique word that was found in the input data. Each unique word in the input data will be represented via a number that represents the line number from this file. As can be seen, there are some technical words like PAD(0), GO(1), EOS(2), UNK(3). We are interested in the “UNK” since the amount of words that are marked with the code 3 will give us some clue about the size of our dictionary.

Now let’s look into the train.ids40000.en:

~/train$ less train.ids40000.en

1181 21483 4 4 4 1726 22480 4 7 251 9 5 61 88 7765 7151 8

7 5 27 11 125 10 24950 41 10 2206 4081 11 10 1663

84 7 4444 9 6 562 6 7 30 85 2435 11 2277 10289 4 275

107 475 155 223 12428 4 79 38 30 110 3799 16 13 767 3 7248 2055 6 142 62 4

1643 4 145 46 19218 19 40 999 35578 17507 11 132 21483 2235 21 4112 4

144 9 64 83 257 37 788 21 296 8

84 19 72 4 59 72 115 1521 315 66 22 4

16856 32 9963 348 4 68 5 12 77 1375 218 7831 4 275

11947 8

84 6 40 2135 46 5011 6 93 9 359 6370 6 139 31044 4 42 5 49 125 13 131 350 4

371 4 38279 6 11 22 316 4

3055 6 323 19212 6 562 21166 208 23 3 4 63 9666 14410 89 69 59 13262 69 4

59 155 3799 16 1527 4079 30 123 89 10 2706 16 10 2938 3 6 66 21386 4

116 8

...

Basically, this is the data from the input.en, however mapped via the dictionary file to the numbers. Now we can check how many words here were marked as “unknown” (UNK/3):

~/train$ grep -o ' 3 ' train.ids40000.en | wc -l

7977

Probably we can consider to increase the size of the dictionary from 40k to 45k or maybe even 50k. But firstly let’s finish the current training process.

6 Hours Later… Now You Can Start Chatting

After waiting sufficient amount of time, you can stop the process by simply interrupting it. Do not worry, the process dumps the result each amount of steps (be default this happens each 200 steps) so it is safe to interrupt it. I would recommend either to pick the amount of time that you are willing to wait (this might be based on the amount of money that you are willing to pay for the machine rent), or by the amount of steps that you want the training process to finish.

In order to start the chat — we just need only one simple key to be added to the same command that was used for the training purpose:

~/src/tensorflow/tensorflow/models/rnn/translate$ python ./translate.py --en_vocab_size=40000 --fr_vocab_size=40000 --data_dir=/home/ubuntu/train --train_dir=/home/ubuntu/train --decode

...

Reading model parameters from /home/ubuntu/data/translate.ckpt-54400

> Hello!

Hello .

> Hi are you?

Now , okay .

> What is your name?

My name is Sir Sir .

> Really?

Yeah .

> what about the real name?

N . . . real real .

> are you a live?

Yes .

> where are you?

I ' m here .

> where is here?

I don ' t know .

> can I help you to get here?

Yeah , to I ve ' t feeling nothing to me .

Conclusion and Ideas for the Script Improvements

I hope you enjoyed the journey and now you have your working bot. But this is just the beginning. Later on, we will see how to train networks that are more suitable for the chatbot and can maintain the topic of the conversation. But if you want to play more with the current solution, here is what you can do:

implement the logic that finds the perfect size for en/fr dictionaries;

publish train.en/train.fr somewhere, so next people will not need to generate them;

train master Yoda (or Darth Vader) bot;

train the bot that will speak like in the Lord of The Rings universe;

train the bot that will speak like people from the StarWars universe.