I recently posted Names that Switch Genders, which highlights 303 baby names that switched genders from 1879-2016. This project has been on my todo list for several years, and now that I have finished it, I want to share the process and code used to make complete it.

For those interested in the names the switch genders dataset, here are some links to download:

Downloading and Loading the Data

The original dataset comes from the Social Security Administration’s Baby Names from Social Security Card Applications-National Level Data, which includes the baby names, genders, and total births for each year from 1879-2016. The downloaded data is in a zip file that contains a readme file and 138 of comma-separated .txt files. The readme file provides basic information about the dataset, while the .txt files are broken up into years and each includes names, genders, and number of births per name.

The code below opens the .txt files and grabs the name, gender, and birth, extracts the year from the .txt file name, and combines everything into a data frame.

readCsvAddYear <- function(filename) { file <- read.csv(filename, header = FALSE, stringsAsFactors = FALSE) %>% rename(name = V1, gender = V2, births = V3) %>% mutate(year = ymd(gsub(" ", "", paste(as.integer(str_extract(filename, "[0-9]+")), "-12-31")))) } files <- list.files(path = "Data/names/" , pattern = ".txt") files <- lapply(files, function(x) readCsvAddYear(paste("Data/names/", x, sep = ""))) ssnNames <- do.call(rbind, files)

With everything loaded, here is a quick look at the data.

Data Value total female births 170,639,571 total male births 173,894,326 total births 344,533,897 total records 1,891,894 total unique names 96,174

Prepare Data

With the data loaded, here is the process I used to find gender switching names:

Spread each row into name, year, female/male births; Get percentage of female/male births for each name/year pair; For each name, find the maximum female/male percentage; Filter for only those names where the max female/male percentage in any year is greater than or equal to .5; Last, limit the entire dataset to the switched names.

spreadNames <- ssnNames %>% spread(gender, births, fill = 0) %>% rename(female = F, male = M) %>% mutate(`female perc` = female/(female+male), `male perc` = male/(female+male)) switchNames <- spreadNames %>% group_by(name) %>% summarise(`female max` = max(`female perc`), `male max` = max(`male perc`)) %>% filter(`female max` >= .5 & `male max` >= .5) genderSwitchNames <- ssnNames %>% filter(name %in% switchNames$name)

Here is a breakdown of the switched names dataset. One of the most interesting things that pops out is how many more females there are. In part, this can be explained by a few names that became a popular female name. For example, the names Allison, Ashley, or Madison all started out as a male name, but once females were given these names, they quickly outnumbered the males.

Data Value total female births 20,026,633 total male births 8,524,544 total births 28,551,177 total unique names 5,298

Getting back to the total data, 5,298 unique names is still a fairly large dataset. The histogram shows that most of the names occur only twice, once with where there are more females and once where there are more males. This makes sense because the SSA only publishes a name when there are at least 5 female or male births in a given year. Some names are so rare, that they only published twice. While these names do switch genders, it doesn’t really show a trend.

The heatmap provides another insight, instead of focusing on a count of names, it counts the total births per name and the number of years a name occurred. Like the histogram, it shows that many names occur in a few years and that most names have a low number of births.





Narrowing the List to a managable level

With these two insights, I took an easy approach and limited the switch genders dataset to only those names that had at least 10,000 total births. This limits the total number of names the switch genders to 303, which makes it possible to review all the names in one sitting.

keepers <- genderSwitchNames %>% group_by(name) %>% summarize(births = sum(births)) %>% filter(births >= 10000) finalNames <- genderSwitchNames %>% filter(name %in% keepers$name)

With the final names dataset, the last step was putting everything into Tableau and publishing it.