The Wall Street Journal published an article on May 31, 2017 with the title Though Outnumbered, Female CEOs Earn More Than Male Chiefs. The analysis done by the Wall Street Journal showed that female CEOs repeatedly outearn their male counterparts. Last year, 21 female CEOs received a median compensation package of $13.8 million, compared with the $11.6 million median for 382 male CEOs.

I was intrigued and a bit skeptical. When I looked closer, I noticed they reported the median earnings. Median is more robust than mean, but I was curious about where in the distribution of compensation the female CEOs were and wanted to know if that made a difference. Since there are only 23 women CEOs at S&P 500 companies this year, it could very well be true that these women were just at the correct companies to make this statistic true.

For example, I took ten random samples (4.6% of the population, matching the 23 female CEOs out of 500) from the data, and ended up with an average median of $13.97 million USD with a standard deviation of $0.71 million USD and an average mean of $18.0 million USD with a standard deviation of $1.48 million USD. From this example we can see that the median is more robust, however, there is still a margin of error that is wide enough to cause the difference in median for the WSJ result.

If I could build a model for CEO compensation, would the model be able to accurately predict the compensation for the chief executive using the rank of the company as the input?

To test my hypothesis, I needed some data. I found some data from the AFL-CIO page on CEO compensation and scraped it using ImportIO to save time. The AFL-CIO data is a bit of a hodgepodge, containing compensation numbers from 2015, 2016 and a couple from 2017, so I wasn’t able to directly replicate the WSJ analysis, but I figured it would be a decent approximation. All of the code used (and the data!) is available from my GitHub repo.

A Quick Look at the Data

A representation of the compensations of our top 500 CEOs. We can see there’s a rapid drop off after Sundar Pichai leading to a much lower compensation package for Thomas L Carter. This representation is every 16th CEO in the data, which gives us 32 CEOs. (Interesting side note: Rex Tillerson happened to make it into this figure.)

At first glance, I was surprised to see that the data set contained 3031 entries even though I was looking at the S&P 500 data. After limiting the data to the top 500 companies, it was time to take a peek at the data. Looking at the distribution of compensation on the left, the first thing I thought of was the Arrhenius equation from chemistry or physics (probably because of my background in biophysical chemistry).

The Arrhenius equation is an exponential function that allows you to predict the equilibrium constant of a reaction at a given temperature (top equation). This equation can be linearized making it a great candidate for some OLS regression (bottom equation).

Preparing the Data

Before considering the model further, I needed to do some data cleaning. In order to segment the data to separate the CEOs by gender, I needed to find out who the female CEOs were. I used Wikipedia and Catalyst to obtain a list of names for the female CEOs. The reasoning behind using these two lists was the variation given in the names, i.e., Mary Barra vs. Mary T. Barra or Patti Poppe vs. Patricia K. Poppe. The intention behind using this combined set of names was to maximize the chance of a match when masking the pandas DataFrame to find the data specific to our female CEOs. This method only gave me 16 of the 23 CEOs I was looking for. To figure out where I was missing data I used Fortune to find which companies had female CEOs and that ended up allowing me to find the data for 4 more companies. Ultimately the data seemed to be missing for Guardian Life Ins. Co. of America, Graybar Electric, and CH2M Hill.

The CEOs that are included in our data as females are: Mark V. Hurd, V.M. Rometty, Indra K. Nooyi, Mary T. Barra, Phebe N. Novakovic, Marillyn A. Hewson, Irene Rosenfeld, Ursula M. Burns, Lynn J. Good, Susan M. Cameron, Vicki Hollub, Denise M. Morrison, Barbara Rentler, A. Greig Woodring, Debra L. Reed, Ilene S. Gordon, Mary A. Laschinger, Sheri McCoy, Kimberly S. Lubel, and John P. Tague. Notably some of these CEOs are men. I make the assumption that the male CEOs’ compensation is close enough that we’re able to make a direct substitution.

Describing the Data Set

How do the compensation package values compare? Well using Pandas describe() function, we see that the male CEO data contains 480 members with a mean compensation of $16.46 million USD, a median compensation of $13.47 million USD and a minimum and maximum value of $9.75 million USD and $100.63 million USD, respectively. The female CEO data contains 20 members with a mean compensation of $16.05 million USD, a median compensation of $13.22 million USD and a minimum and maximum value of $9.46 million USD and $41.12 million USD, respectively.

Plotting a sample distributions and histograms of the data gives us this nice graphic:

The disparity between the histograms of male and female CEOs is one of the most shocking features . The bars for female CEOs are barely visible when viewing the two graphs at the same scale, dramatically showing how few female CEOs there are at top companies.