Mining Indeed: What are there in Data Scientist Job Postings?

That is how a data scientist conducts his job searching.

For those who are into big data, there is one question that always comes to their mind: What programming tools should I use? Hundreds of articles and blogs are comparing one tool with others, and they always come to the same conclusion: whatever works. But if your goal is to find a data scientist job, why not just inspect the job market?

Indeed is one of the most popular websites for job-seekers. From its search results, we can directly connects to company’s job postings, which specify all the requirement for each position. Using an easy web crawler tool: Scrapy Spider, I designed a routine which is able to traverse through Indeed and various job postings, and retrieve all the content of job description in HTML format. Next, I constructed a set of predefined keywords, and examined whether a particular “keyword” is in the job description or not. For those keywords, I also count in their abbreviation (Natural Language Processing v.s. NLP) and plural form (Decision Tree v.s. Decision Trees). I scraped roughly 9,700 job postings across major cities/regions in the States, and finally got the results.

Pandas and Seaborn are good enough to conduct all the EDA works. I apologize for not being perfect in data visualization (and maybe my English). The histogram below is the number of job postings in each cities. Please note that it is NOT the actual total number of job postings. Rather, the numbers only represent the amount of data I collected from Indeed, since I excluded all bad HTTP responses and time-out status. For example, San Francisco should outnumber San Jose in data scientist jobs, but it is the opposite from the data I used.

Number of JD for each location

Although Los Angeles and Chicago are two major metropolitan areas in the U.S., they are not the biggest cities for data science.

Now we are going to answer the question: what is the most popular toolkit for data scientist jobs? Then this chart shall give you the answer! (R = 0.2881 means that the single word R appears in 28.81% of total data scientist job descriptions)

Ratio of Tools in JD

R and Python dominate, but surprisingly, Excel still ranks on top 3. Well, I have to admit that R and C might be overestimated since the character “R” and “C” might appear somewhere in the HTML content. Though I already cleaned the text by BeautifulSoup, I could not guarantee 100% clean.

Next, we go through the skills.

Ratio of Skills in JD

Machine Learning is the most frequently-mentioned buzzwords. But we all know that Machine Learning is the top level idea in data science, it is not surprising at all.

Let’s see who is the largest recruiter in our dataset.

Top 20 data scientist recruiters

Note that the numbers only represent our dataset, not the actual job market., and some job postings on Indeed are actually overdue. Here, Amazon Corporate LLC (241) is the largest recruiter, followed by Ball Aerospace (139) and Walmart eCommerce (102). Amazon and Walmart are both giant corporates in America, while Ball Aerospace provides 137 out of 380 data scientist jobs Colorado. Anyway, is any company on the list going to hire me?

Does education matter? Absolutely YES.

Ratio of Educational Requirement on JD

How about different toolkit and skills associated with geological location? Let’s split our data by location and see how’s going.

Ratio of all tools in each location

Ratio of all skills in each location

Seems that companies in Bay Area use Python more, whereas R is much more desired in east coast. Excel is popular in New York and Los Angeles. If you are expert in Deep Learning or Computer Vision, please go to Bay Area to get decent job!

Also we can calculate the “conditional frequency” of a programming skill given that certain words are mentioned in the JD. I used Deep Learning, Data Mining, and Recommendation System as examples.

Deep Learning v.s. All

Data Mining v.s. All

Recommendation Systems v.s. All

Obviously, Excel seems to be an irrelevant requirement to these skills. Further, I drew a heatmap to visualize the correlations between different tools and skills.

Correlation Heatmap

The labels in both axes are sorted by popularity. Though Excel is relatively popular, it is uncorrelated with higher-level skills. Statistic tools like SAS and SPSS are not so relevant to these skills, given their slightly negative correlations to most of selected skills. On the other hand, TensorFlow, Theano, Torch and Caffe are generally highlighted with Deep Learning and Computer Vision, although they are not frequently-mentioned in all the JDs.

Next, we can look into company level. Who are the major recruiters in each city? And what company are hiring Deep Learning guys? Does company have any preference for different Deep Learning tools (say, TensorFlow, Theano, Torch and Caffe)?

Top 5 Recruiter in each city

Beside big tech names, there are still a lot of opportunities in educational and medical institutions. You will be extremely competitive in the job market if you have deep knowledge in both data analytics and medical science.

Top 3 Deep Learning Recruiter in each city

NaN is definitely NOT a company; it just means NONE.

Recruiters who highlight TensorFlow

Recruiters who highlight Theano

Recruiters who highlight Torch

Recruiters who highlight Caffe

This is exciting! If I follow the hiring information periodically, I might have more insight about each companies business strategies and the overall industrial trends, and I can do some big thing!

Further, in order to observe the clustering for the job postings, I applied Principle Component Analysis (or PCA) on the data. The intuition behind PCA is simple: to reduce dimension, so I transformed our data from high dimension (dummies of tools and skills mentioned) to 3D, and plotted human-readable visualization. Here I implemented the Sparse PCA in scikit-learn.

PhD v.s. Not Mentioned

I wonder the value-added of a PhD degree. So I split the data into two class: JD mentioned PhD v.s. not mentioned. Apparently, there exists some separation in these two groups.

FLAG v.s. Non-FLAG

I love this chart most! FLAG companies (namely, Facebook, LinkedIn, Amazon and Google) are looking for a diverse group of talents. On the other hand, JDs from non-FLAG companies are quite similar. If you want to go FLAG, you should be special! Well, let’s see how special you should be and what makes you get the FLAG offers.

By far, we have 405 FLAG job postings out of 9,700 in total. We can build a FLAG-classifier, and that model shall tell you how to prepare for a FLAG job! To balance the size of two groups, I randomly chose 405 non-FLAG JDs from the rest of data. Next, I trained a logistic regressor using L-1 penalty and 10-fold cross-validation. The explanatory power is acceptable with 0.8025 R-square. Here are all the coefficients:

Coefficients: What makes you get FLAG offers

Awesome! Go for Neural Network and TensorFlow!

That’s all the research I have done so far on this dataset. The magnitude of numbers merely imply correlations or popularity. The numbers are not associated with the expected salaries. There are still lots of works I can do if I combine with other data (e.g. the linkage between skills and package, or further labeling our data with industry categories). I truly hope this article provides you some insight about the data scientist job market and would be helpful for your education or career decisions.

All data scraped on Oct. 23rd, 2016. Special thanks to Will Fu, who constantly gives me feedback and helps me brainstorming. He needs a job and a girlfriend!