As a math person, I try to quantify everything in my daily life, so when I see a data set with lots of qualitative variables, my mind naturally tries to quantify them. Luckily, there’s a nice, neat function that can help us do that!

As someone who is new to the data science world, the discovery of pandas was pretty life-changing. Between pandas and scikit learn I think everyone could conquer the world (or at least the data science world). Pandas has a function which can turn a categorical variable into a series of zeros and ones, which makes them a lot easier to quantify and compare.

I started with loading in my data which I got from the website “http://data.princeton.edu/wws509/datasets/#salary”. This is a very small data set consisting of salary data for 52 professors at a small college, categorized by gender, professor rank, highest degree, and years of service paired with salary. I used this data set for this example because it’s short and has a few categorical variables.

sx= sex, rk = rank, yr = year in current rank, dg= degree, yd = years since earning highest degree, sl = salary

Since I loaded the data in using pandas, I used the pandas function pd.get_dummies for my first categorical variable sex. Since this variable has only two answer choices: male and female (not the most progressive data set but it is from 1985). pd.get_dummies creates a new dataframe which consists of zeros and ones. The dataframe will have a one depending on the sex of the professor in this case.

Since we’ve created a whole new dataframe, in order to compare it to our original dataframe, we’re going to need to either merge or concatenate them to work with them properly. In creating dummy variables, we essentially created new columns for our original dataset. The old and new dataset don’t have any columns in common, so it would make most sense to concatenate them (although I’m going to go through both ways).

I chose to put my dummy variable on the right side of my dataframe so when I use pd.concat (the concatenation function) and put my dataframe first, and then the dummy variable I declared. As they are columns, I concatenate them on axis=1.

Merging these dataframes is slightly more difficult as there are no overlapping columns. However, it can be done!

To merge on an index (our left-most column), all we have to do is set our left_index=True and right_index=True!

With just two lines of code, we can now compare our sex variable to our other numerical columns!