My LinkedIn profile describes me as a Software Engineer and Data Scientist. Based on my job history, the first half of that pair is probably more accurate, as I’ve only ever secured short-term contract gigs in data science. Having voluntarily moved out of an earlier career as a healthcare statistician, I was becoming vexed with my attempts to land a full-time data scientist position where I’m based, which is Singapore. I’ve seen some acquaintances, with only a Bachelor’s degree, readily secure positions while my Master’s in Medical Statistics and General Assembly certificate in Web Development did not seem to land the knockout blow that I hoped they would (two out of three in the Conway Venn diagram, or so I thought). My patience was also wearing thin with some of the “How I landed such-and-such role”-type self-congratulatory advice running rampant online which, after all, admitted a sample size of just 1.

What was dawning on me was that I had conflated the practice of data science with the strategy to become part of it. To my surprise, these turned out not to be the same thing. Like most novices, I was putting together information from a haphazard mix of blog posts, the requirements section of data science job postings, and hearsay from people in the field. The skills-heavy focus of these sources, not to mention the castigating and often moralizing tone that data scientists can and should learn a whole bunch of things, can ironically entrap beginners in a never-ending cycle of chasing after the latest skills, when perhaps the most efficient strategy would be to quickly land an adjacent data-related position first and then learn the skills on the job.

Daniel Kahneman would call this an example of falling victim to the availability heuristic. I think that I need to acquire 10 impossible skills before breakfast because that’s what I’ve read about how a data scientist looks like, without pausing to consider that there are probably thousands of data scientists out there who have already been successfully hired, and most of them (by definition) are not superstars. What I needed was not another navel-gazing post on the top skills required of a data scientist but actual data on people who have successfully made the transition into data science. What were they doing before?

What I needed was… actual data on people who have successfully made the transition into data science

Data on Data Scientists

While there are some publicly available, large-scale surveys that have been conducted on who is a data scientist, I saw several problems with such data:

Self-selection bias. Because these surveys are affiliated with certain kinds of organizations and are completely voluntary, a certain profile of respondents might be over-represented in the sample. I saw a particular problem with over-enthusiastic TensorFlow practitioners dominating the Kaggle Data Science Survey, which might be very different from how data science is actually practiced in business.

Because these surveys are affiliated with certain kinds of organizations and are completely voluntary, a certain profile of respondents might be over-represented in the sample. I saw a particular problem with over-enthusiastic TensorFlow practitioners dominating the Kaggle Data Science Survey, which might be very different from how data science is actually practiced in business. Respondent bias. Being completely voluntary and having no feedback on the respondent (you don’t suffer any consequence from misrepresenting yourself), individual respondents might have fewer disincentives to inflate their titles or education or other kinds of data.

Being completely voluntary and having no feedback on the respondent (you don’t suffer any consequence from misrepresenting yourself), individual respondents might have fewer disincentives to inflate their titles or education or other kinds of data. Market representation. My main motivation was to find out the profiles of people who have actually been successfully hired as data scientists in my target market (Singapore). From what I’ve seen, the survey data is inundated with data science aspirants (mainly students), and specific data on data scientists based in Singapore was limited.

There was no question in my mind that LinkedIn was where I needed to get the data from. While there might still be some selection bias (LinkedIn’s algorithms might not be showing me a truly random sample of data scientists¹), I saw its widespread adoption by jobseekers and the recruitment industry alike as an inbuilt check to minimize respondent bias and ensure the truthfulness of its profiles. LinkedIn profiles are subject, as it were, to the coercions of the actual job market.

In addition, LinkedIn allows me to specify the geography of profiles that I wished to analyze in my search query, limiting it to Singapore if so desired. There was only one problem: getting the data itself.

Scraping the data: don’t say I didn’t warn you

There has been some controversy surrounding the legality of scraping LinkedIn data. While recent precedent establishes that such information is public and therefore amenable to extraction by anyone, the legal status is far from settled. In any case, there are several roadblocks you will encounter when you try to scrape LinkedIn data:

You will be in violation of LinkedIn’s User Agreement. While the enforceability of such contracts remains murky, you run the risk of having your account suspended for breaching the terms of service.

LinkedIn sets an upper limit to the number of profiles you can click on within the free tier, which your little selenium bot will quickly hit (especially if you’re spending a lot of time just debugging the scraper).

LinkedIn has been quietly and frequently changing their HTML tags such that scraping based on any current set of tag attributes has a rather short shelf life.

Suffice to say the scraper that I wrote remained useful for long enough to acquire a decently sized dataset (1027 LinkedIn profiles) before the tags were replaced and the code became outdated. (If you’d like to find out more about the code nonetheless, feel free to reach out to me²).

Using the search query “Data Scientist AND Singapore”, I extracted as many profiles as I could from the People section of LinkedIn. There were really only three data elements that I considered relevant: Current Position (job title and name of employer), Education (most recent institution and field of study) and Experience (position, organization, and duration of previous roles). Limiting myself to these three elements not only saved time in writing and debugging the scraper but was also my attempt at minimizing the scope of potential liabilities from not adhering to LinkedIn’s terms of service.

After filtering out data science aspirants, students, and profiles with insufficient information, I was left with 869 data scientist profiles. Now I can go about asking: what common traits do currently employed data scientists have?

Finding 1: most data scientists have postgraduate degrees

The most striking finding from the data, and which has been corroborated elsewhere, is that most (73%) currently employed data scientists have degrees beyond just a Bachelor’s. A plurality (44%) hold a Master’s degree, while Ph.D.s outrank Bachelor’s degrees 29% to 21%. Only 6% of data scientists reported some form of MOOC, bootcamp or non-traditional certification as their primary qualification. This suggests that prospective employers trust the signaling provided by an advanced degree to fulfill the complex requirements of the data scientist position. It also puts paid to the notion that data science bootcamps or other non-traditional certification programs are an adequate substitute for such degrees.