Introduction

I thought an easy project to learn machine learning was to guess the gender of a name using characteristics of the name. After playing around with different features by encoding characters of the name, I discovered you only needed THREE features for 80% accuracy which is pretty impressive. I am by no means an expert at machine learning, so if you see any errors, feel free to point them out.

Example: Name Actual Classified shea F F lucero F M damiyah F F nitya F F sloan M M porter F M jalaya F F aubry F F mamie F F jair M M (Click here for Source: IPython Notebook)

Dataset

The dataset used for getting names was from SSN’s baby names dataset for the year 2014.

https://www.ssa.gov/oact/babynames/names.zip

Methodology

I took all the baby names from the dataset that had at least 20 people for male and female since I found many names were low quality when they are least used (for example, there are a few guys named Amy born in 2014).

Loading

Code for loading data from dataset into numpy arrays ready for machine learning

import numpy as np from sklearn.cross_validation import train_test_split, cross_val_score from sklearn.ensemble import RandomForestClassifier from sklearn import svm my_data = np.genfromtxt('names/yob2014.txt', delimiter=',', dtype=[('name','S50'), ('gender','S1'),('count','i4')], converters={0: lambda s:s.lower()}) my_data = np.array([row for row in my_data if row[2]>=20]) name_map = np.vectorize(name_count, otypes=[np.ndarray]) Xlist = name_map(my_data['name']) X = np.array(Xlist.tolist()) y = my_data['gender']

X is an np.array of N * M, where N is number of names and M is number of features

y is M or F

name_map will be a function that converts a name (string) to an array of features

Fitting and Validation

We will be splitting the data into training and testing for cross-validation and using RandomForrest for classification since it performs well at classifying data.

for x in xrange(5): Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33) clf = RandomForestClassifier(n_estimators=100, min_samples_split=2) clf.fit(Xtr, ytr) print np.mean(clf.predict(Xte) == yte)

By default, RandomForest will set max_features(number of features to look at before split) = n_features which is recommended for classification problems ( http://scikit-learn.org/stable/modules/ensemble.html#parameters ). We will be using n_estimator (number of trees) of 100 and a min_samples_split (the minimum number of samples required to split an internal node) of 2 which we will tune when we determine a good feature set.

Picking Features

Character Frequency

My first attempt at features was the frequency of each character: