Contents

Introduction

Dataset

A few days ago I was enticed into thinking if there is an app for generating names ( no, not a startup idea ). You know, something along the lines of “generate 10 Indian names”. I didn’t know the answer, but it got me thinking about names. Before the flame of this random musing could die down, I found a dataset of names, thanks to mbejda. The dataset consists of Hispanic, Indian, Caucasian and African American names. I had the dataset, so something had to be done.

RNNs

Now perhaps you have read Karpathy’s fantastic blog on RNNs and what they do.

Here is one of quotes from the blog:

We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?”

Yes, it’s pretty mind-blowing. The blog goes into the details of how it’s done. Here’s a short summary of what RNNs do: RNNs can take a bunch of text, say T, and learn to generate new text that would seem to be taken from T. For example, we can train an RNN on all the works of William Shakespeare, and the RNN can, in turn, generate new text that would seem to be written by Shakespeare. The blog I’ve linked to contains this and many other interesting examples. The way RNNs do this is by learning to “predict the next character” in the sequence given the hitherto seen characters.

Putting it together, I had:

A model that can learn to generate new samples from a piece of text. A corpus of names.

So it would not be very inspiring if I say that we can use RNNs to generate new names. Yes, we can, and yes, I did. I got an “Indian Name Generator” generating Deepaks and Nehas, a “Caucasian Name Generator” generating Michaels and Jennifers and so on. It was something, but as I said, not very interesting.

Cross Seeding

Now, I didn’t feel like throwing all these RNNs away, so wondered what would happen if I feed, say, the “Indian Name Generator” with a first few characters from a Caucasian name, and let it generate the rest? Will it try to create a name that sounds Indian? So I ran a bunch of these experiments, and present the somewhat more interesting results in this post. I also used seeds from unconventional names, like those of Pokémons and wrestlers. It was fun to see all these different RNNs take a stab at creating a name that sounds to be from their domain:

name seed african_american caucasian hispanic indian all_races undertaker underta undertall nix# undertan starlir# underta romero# undertala# undertayshawn king# aman madaan aman mad aman madadenis# aman madich# aman madro l gonzalez# aman madhkaran# aman madha# jose luis jose l jose l graham# jose l ramirez# jose l morales# jose lal sharma# jose l rodriguez# hideyoshi hideyo hideyon u bennett# hideyo g morio# hideyordo rodriguez# hideyohar sharma# hideyon d brown# dan fineman dan f dan f briggs# dan f witharr# dan flekrez# dan farjat saini# dan francersiii# hulk hogan hulk ho hulk hornes# hulk howstie# hulk hoelles.maldonado# hulk holoo chand singh# hulk holu#





This post has 6 section: The introduction just got over. We’ll now take a quick look at the data, followed by a discussion on how the input is converted to a representation that can be used for training these name generators. We’ll then look at some details of the model, how the predictions are done and the name styles transferred, and finally present the results.

The code, with cleaned + processed dataset, and notes on how to run the training and scoring processes, is located here.

Looking at the Data

Although there is no limit to the number of different analysis we can run, we’ll present only two here in the interest of space (and the attention span): i) the most popular names and ii) the name length distributions. Looking at the most popular names will give us a feel for the dataset, and the name length distributions were added to add more plots and make this post look fancier (and it’s used somewhere down the line, too).

Top 5 Most Popular Names

The following tables list the top 5 most popular names for each of the races. Please note that the first and the last names are listed separately (e.g., Latoya Williams is not the most popular African-American female name; Latoya is the most popular first name for African-American females, and Williams is the most popular last name for African-American females).

African American Female First Names Female Last names Male First Names Male Last Names latoya williams michael johnson ashley johnson james brown patricia brown anthony jones angela smith willie jackson mary jackson robert davis

Caucasian Female First Names Female Last Names Male First Names Male Last Names jennifer smith michael johnson amanda brown james rodriguez kimberly williams robert davis jessica miller david jones ashley johnson john brown

Hispanic Female First Names Female Last Names Male First Names Male Last Names maria rodriguez jose rodriguez melissa gonzalez juan garcia jennifer rivera luis martinez gloria perez carlos rivera elizabeth garcia jorge hernandez

Indian Female First Names Female Last Names Male First Names Male Last Names smt devi deepak kumar pooja pooja rahul singh smt. kumari amit sharma jyoti jyoti ram lal kumari bai sanjay ram

Name Length Distributions

The name length distributions are next plotted for each of the races. It seems like short names (perhaps without a surname) are popular among Indians, giving rise to the minor mode in the distribution. Hispanic names tend to be longer, as indicated by the fat tail following the mean. 15-ish seems to be the most popular name length across the races (here’s one thing you can take away from the post).

Input Representation

In this section, we will spend some time looking at how do we take a bunch of these names and convert them into a form that can be fed to an RNN. We will get to that representation in three steps: encoding, standardization, and embeddings.

a) Encoding

We convert each character in a (string) name to a number using the following mapping:

Character Encoding a-z 0-25 ” “ (Space) 26 ”#” (End of Name) 27 . (Invalid Character) 28

All the names are converted to lowercase English alphabets, with a space separating different components of a name. Every name ends with a special name end character (“#”). Every other character is mapped to a “.”, an invalid character substitute. For example, “joe” would be converted to [9 14 4 27] .

b) Standardization

Name lengths are anything but a constant, as seen from the length distributions. However, we are using “unrolled” RNNs, and thus we need to fix on a maximum name length. Names longer than the maximum name length will be truncated, and names shorter than the maximum name length will be padded with “.” (invalid character). As discussed, we assume that every name ends with a “#,” the name end character. All of this happens in the following piece of code:

def encode_and_standardize ( name ): name = name + CharCodec . NAME_END #add the end of the name symbol for everyname name = CharCodec . encode ( name ) #encode if name_len >= CharCodec . max_name_length : truncated_name = name [:( CharCodec . max_name_length - 1 )] truncated_name . append ( CharCodec . char_to_class [ CharCodec . NAME_END ]) #must attach the name end return np . array ( truncated_name , dtype = np . int32 ) else : padded_name = np . empty ( CharCodec . max_name_length , dtype = np . int32 ) padded_name . fill ( CharCodec . INVALID_CHAR_CLASS ) padded_name [: name_len ] = name return padded_name

Note that we retain the name end character (#) even after the truncation.

So how do we arrive at the max_name_length ? A simple way to do that is fix on a large number, like 100. However, that would mean that our network will be wider than we perhaps want (most of the names will be smaller than 100 characters). This will lead to lots of wasted computation and slower training times (give it a try!). Or, we could plot a distribution of the name lengths and pick something simple. We’ve already done that, and as you can see, it seems like 25 will cover most of the cases, and we add one space to accommodate the name end marker, “#.” Thus, all the names are standardized to length 26. At this point, the string name “joe” has become an array of 26 numbers, [ 9 14 4 27 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28].

c) Embeddings

So far, we have standardized each name to a fixed length, added a character to mark the end of the name, and encoded the name from an array of chars to an array of integers.

If you don’t know what embeddings are at all, I recommend checking out this or this link.

The tl;dr of the technique is that each of the characters is mapped to an array of numbers. The array is called an embedding, and the length of the array is the dimensionality of the embedding. For example, if we choose to use 5-dimensional embeddings, it just means that every character is mapped to a 5-dimensional real vector. We can start with pre-trained embeddings, or learn them as part of the training process, which is what our model does. In our setting, that means that we will hopefully learn embeddings that makes it easier for us to predict the next character in the name. With 5-dimensional embeddings, “joe” will now be a matrix of 26 rows and five columns.

The following three lines of code is all that’s needed add embeddings. embeddings is a matrix, which has one row for each character in our vocabulary. Concretely, if we had used 5-dimensional embeddings, the embeddings matrix would have the dimensions 29 x 5 ; each of the characters would have a corresponding array of length 5. tf.nn.embedding_lookup takes in the input, which is batch_size x max_name_length , and maps (looks up) each character in the standardized name to an embedding, to yield an input with dimensions batch_size x max_name_length x n_embeddings .

names = tf . placeholder ( tf . int32 , shape = ( None , max_name_length ), name = "input" ) embeddings = tf . Variable ( tf . random_uniform ([ vocab_size , n_embeddings ], - 1.0 , 1.0 )) embedded_names = tf . nn . embedding_lookup ( embeddings , names ) #(?, max_name_length, n_embeddings)

Tensorboard can help in visualizing embeddings using PCA and t-SNE. The t-SNE visualization of the embedding matrix from the Indian name generator is as follows. As you can see, the vowels are all close to each other, which hints at the fact that the learned embeddings wrap some linguistic properties of the names pertaining to the race, and are perhaps useful in generating new ones.

Embeddings learned by the Indian Name Generator visualized using Tensorboard

Model

The particular form of RNNs that we use for this exercise is an LSTM. This book has a pretty good explanation of the LSTMs, and this neat blog post is another standard reference for the topic. A high-level overview of the model follows. Each character in the normalized name is converted to the corresponding embedding vector, and the entire input name becomes a matrix or name embeddings. The name embeddings are then fed to the first LSTM in the stack. The output from this first LSTM is fed to a second LSTM. The second LSTM is connected to a dense layer, which emits a logits vector of length 29. An argmax over the logits vector is used to calculate the loss and generation. A diagram of the computation graph generated by Tensorboard is as follows:

The overall model setup generated by tensorboard. The input Embeddings are fed to stacked LSTMs, which in turn feed to a dense layer

An instance of the model with made up numbers is shown below. It’s a replica of this diagram from this blog I’ve already linked to.

A sample instance of the model. The input name is “Amy#” (# being the end of the name character). At each step, the network is expected to predict the next character. Thus, at step 1, the expected output is “M”, the 2nd character. At step 2, the expected output becomes the 3rd character, Y. As explained in the encoding section, the characters after the “#”, “.”, are invalid characters added for padding.

Transferring Name Styles

Since we have discussed a lot, let’s quickly recap before moving ahead. Some of the following may seem to be a repeat from the introduction because it sort of is.

Dataset: We have a dataset of names from different races.

Generator: We have discussed a model that can be trained to predict the next character given a sequence of characters. Let’s call this model the generator.

We train one generator per dataset. Thus, we have a model that has learned to predict the next few characters in an Indian name given the first few, and so on. We then seed each of these generators with a few characters from a name, say “Der” from “Derek,” and compare the results.

The prediction process is illustrated in the figure below, followed by the code.

Code:

res = seed initial_seed_offset = len ( seed ) for i in range ( CharCodec . max_name_length - len ( seed )): feats = CharCodec . encode_and_standardize ( res ) . reshape ( 1 , CharCodec . max_name_length ) prediction = sess . run ( model . prediction , feed_dict = { names : feats }) res += " " . join ( CharCodec . decode ( prediction [ 0 ])[ i + initial_seed_offset - 1 ]) if res [ - 1 ] == CharCodec . NAME_END : break

Results

The results are compiled in the following table. The first column is the name, the second the seed used as an initial input to the model. The subsequent columns list the names generated by the different generators using the given seed.

name seed african_american caucasian hispanic indian all_races zhang wei zhan zhankhea l stencor# zhane a nelson# zhanole s estrada# zhanna sankar# zhanson e martin#

In the above example, the seed used is “zhan,” from the Chinese name “zhang wei.” All the networks take this tricky seed and generate a name that finally looks like a name from the given model. For example, the African-American generator yields “zhankhea l stencor” (the “#” is the end of the name marker), the caucasian generator yields “zhane a nelson” and so on.

Results from this test file are as follows. I hope the post was useful, please feel free to contact me for any questions/comments at amn.madaan@gmail.com.

Thanks!