If you live in Silicon Valley (or San Francisco) as I do, you hear about inequality issues all the time, especially about the gender wage gap.

In social injustices, I’m just a theorist. Born white and male I never experienced any kind of discrimination. I always thought of myself as progressive and open towards all gender, races and ages. I prided myself in treating everyone the same (or at least I’ve tried) and I was stubbornly convinced that I was not part of the problem.

That changed once I took a better look at my current company. We’re team of 10 people of which only 2 are women, only one is a developer. Facing the unpleasant truth I started looking for excuses. “No women applied to any of the positions. It isn’t my fault that they’re not interested.” It took me a while to realize that I never did anything to encourage any women to apply. Why would they work in a company, that looks like any other company?

I usually spend couple of minutes every day skimming couple of sites for good candidates. My most favorite of all is definitely Angellist. It has great mixture of people that are less risk averse and may be more willing to join an early startup. I can filter them based on a location, there is no gender filter (understandably).

As I scrolling through profiles I noticed a pattern. Women with similar levels of experiences were asking for less than men. “Am I only imagining it?” There was no way to tell. I read studies claiming that women price their work lower than men, but I’ve never noticed it on my own. I wasn’t really sure if I was being fooled by my own head or if it was true, so I decided to confront the data.

The data

This part is a bit technical and may not be interesting to you. If you’re looking for the results, just skip to the results.

The first step was getting the data out of Angellist. This should be a piece of cake, as they offer an API that would take me just couple of minutes to hook on and start scraping data. But …

Well, ok. Since I’ve been using Angellist for couple of years now I know that they’re using background calls to load data seamlessly while users are scrolling through the page. I got the endpoint they’re calling quickly but then I found out that the result is not a nice structured Json object but a rendered HTML. “Fantastic”. Fortunately I write HTML parser for a living, so it took me around 2 hours to write my final template (the HTML they generate is far from helpful).

I decided to extract as much information as possible so I could later tweak my models. I settled on these:

user name

real name

user location

current position

external links to linkedin and other sites

school they attended

work experience (in months)

skills

what type of employment are they looking for [full time, internship, …]

desired salary

Before I turned on my crawler I decided on two things:

I’m going to use my own account instead of using a dummy one — I wanted the Angellist team to be able to track me down so we can talk about what I’m up to and they don’t freak out about somebody taking out person data of their users.

I’m going to use only one single connection (meaning a synchronous downloading) — it quickly became obvious to me that the call I was making is probably pretty taxing on their server(s) so I wanted to be as respectful as possible

In next couple of hours I downloaded and parsed over 550k profiles up until I was blocked by Angellist (didn’t expect less). I did not try to restart it though, I figured that this is almost all the profiles that are on Angellist anyway.

Now it was time to check what the data looked like. First, I wanted to know how many people actually filled all the fields I was interested in.

Only 191,524.

Good enough. For the purpose of my “study” I decided to include only people currently residing in the Bay Area. Not only is this the area where I live, I also have a decent understanding of the salaries.

That limited my options down to 11,168. Not the biggest number, but good enough.

Now as Angellist doesn’t report gender, I had two choices: use mechanical turk (or as my colleague calls it artificial-artificial intelligence) or write my own face recognition algorithm. I decided on the latter as I was already looking for an excuse to play with TensorFlow. I’ve trained my model and took it for a spin.

The results surprised me. I managed to identify 8,579 faces with confidence over 0.5 (50%). I checked over 500 manually to make sure there are no false positives. Now was time to identify the missing 2,589. 1,344 did not contain any picture. Fine, I will deal with those later.

I built a quick website that I wanted to use to collect correct answers from the Mechanical Turk, but as I was testing it out it become pretty engaging (big mistake). Whenever I wasn’t sure I followed external links provided by the user to look for clues (e.g. praises from others on Linkedin). In couple of minutes I recruited my girlfriend to help me out. It took us roughly 90 minutes to go through the dataset (two episodes of a TV show we’re currently binge-watching).

Thanks to this stupid idea I become seriously sick of seeing pictures the Golden Gate Bridge. Seems like every second person that moves to SF has to take a picture with it and of course uses it as their profile picture everywhere.

Now was the time to figure out what to do with the profiles without pictures. After writing another script that followed the external links to extract more profile pictures (which yielded another 428 profiles) I decided to do the rest manually again. This time it took me almost as long as the previous dataset. I was left with 17 that I couldn’t tell.

Finally, I could check the rest of the data. I used my scientific docker (containing jupyter, scipy, numpy, matplotlib, etc.). Plotting the first data I realized that something is wrong with the salaries. 561 users claimed that their desired salary is in hundreds of millions of dollars. I checked the rest of the dataset and over 28k of people have desired salaries in hundred of millions. That’s over 16% of users on Angellist that actually posted their desired salaries. Seems like this is a result of a confusing design and Angellist should do something about it. Anyway, I removed the 3 zeros to normalize the dataset.

Then I took a look at the education. I added a rank to each school I was able to lookup in the World University Rankings registry. Once I started working on the statistic models I realized that my dataset is small enough even without the education. Education shrank the results to single digits for each occupation, gender and work experience, so I decided to remove it from the final dataset.

What made it in? I only selected profiles that:

were looking for a full time job

had work experience of at least one month

posted their desired salary

I had their gender information for

I had their occupation

resided currently in the Bay Area

The final dataset shrank to 10,079.

The results

Before I present you with the results, please understand that the data are self-reported by the users themselves. I did not tamper with their data beyond the initial cleaning. The years worked is the overall number and it doesn’t represent their work experience in the current occupation. So a person claiming to be a data scientist may be on the job only for a week.

You may notice that salaries fluctuate a lot. The leading theory suggested by a colleague of mine is that the lows represent change of job, especially jumping between two fields.

What you see below is a series of graphs generated for each occupation, representing a median (not an average!) of salary for each gender, based on their work experience (in years). Each graph shares the x axis (representing 10 years of experiences) but y axis is unique for the occupation.

Operations