In this tutorial, we’ll build a Recurrent Neural Network (RNN) in PyTorch that will classify people’s names by their languages. We assume that the reader has a basic understanding of PyTorch and machine learning in Python.

At the end of this tutorial, we’ll be able to predict the language of the names based on their spelling. The dataset of names used in this tutorial can be downloaded here. This tutorial has been adapted from PyTorch’s official docs— check out more about the implementation from these docs.

Plan of Attack

Data Pre-processing Turning the Names into PyTorch Tensors Building the RNN Testing the RNN Training the RNN Plotting the Results Evaluating the Results Predicting on New Names Conclusion

Data Pre-processing

As is the case with any machine learning task, we’ll kick off by loading and preparing our dataset. Upon downloading the dataset, we notice that there’s a folder called names inside the data folder. It contains text files with surnames in eighteen different languages.

In order to load all the files in one go, we’ll use a Python module known as glob . The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. Results are returned in an arbitrary order. We’ll use it to load all the files in the folder that end with .txt .

Currently the names are in Unicode format. However, we have to convert them to ASCII standard. This will remove the diacritics in the words. For example, the French name Béringer will be converted to Beringer.

In the next step, we create a dictionary with a list of names for each language.

We can view the first fifteen names in the French dictionary as shown below.