Job titles for data scientists, including details about the simple but powerful classifier used to categorize these job titles. This analysis provides a break down per job category, and granular reports that you can download for free (job titles broken down per company, category and level), as well as NLP (natural language processing) source code. It is based on analyzing connections from multiple LinkedIn profiles - totaling more than 10,000 professionals. The first study was published in June 2013.

1. Summary

The table below shows the top job titles in the business analytics category. The full list has 700+ job titles shared by at least two practitioners, across the 11 following categories

Recruiter

Engineering

Developer

Data Plumbing

Data Science

Statistician

Research

Business Analytics

Consultant

Trainer

Student

The full table can be downloaded here (Excel spreadsheet). If you include job titles shared by only one person, we have 7,000+ job titles: this is another example of a system governed by a Zipf distribution, with very long tail. A very interesting spreadsheet with full details (including job title, job category, level, and company name) is available for DSC members exclusively. If you are not yet a member, you can sign-up here to access the spreadsheet.

Figure 1-a: Top job titles in the business analytics category

Figure 1-b: Top job titles in the data science category

2. Methodology

We analyzed the LinkedIn data (connections with job title and company, from well connected data scientists), cleaned the job title field, and created three extra fields:

Cleaned job title

Level (Executive / Manager / Consultant / Analyst / Professor / Student)

Category (see section 1)

In order to identify job categories and levels, we first created a data dictionary of all one-token and two-token keywords found in job titles, ranked by frequency, after filtering out tokens that make no sense (such as vice, because it is always associated with president, in job titles containing vice president).

The top 2-token words are displayed in Figure 2:

Figure 2: top 2-token words found in job titles

The full list, including both one- and two-token words (totaling 15,000 words), can be downloaded here (Excel spreadsheet).

The job categories, levels and cleaned job tiles were computed with the following Perl script, in section 3. While this is a clustering problem (creating a taxonomy of job titles for data scientists), because of our simple and scalable approach, from a computational point of view, it looks more like an indexing problem, rather than pure clustering.

3. Source Code

The idea was to quickly write a script, to produce the results in less than two hours or work - from start to finish. The input file jobs.txt contains raw job title and company, entered by LinkedIn connections. The first step uses regular expressions to clean the job titles. If you are unfamiliar with this type of code, read our data science cheat sheet first. Note that the many "if" statements in the code are in hierarchical order, you can not re-order them without causing some problems.

The main job categories and levels were created by looking at top entries (with highest frequencies) in the data dictionary: see Step 3 in the code below.

open(IN,"<jobs.txt");

open(OUT,">jobsTable.txt");

while ($i=<IN>) {

$i=~s/

//g;

@aux=split(/\t/,$i);

$company=$aux[0];

$job_raw=$aux[1];

$job=$aux[1]; # job title

$job=lc($job); # put in lowercase

$job=~s/ of / /g; # clean job title

$job=~s/ and / /g; # more cleaning

$job=~s/[\/,\\,\&,\,,\-,\.]/ /g; # more cleaning

$jobs=~s/ / /g; # more cleaning

$ljob{$job}++;

#---- Step 1: creating job level

$level="Other";

if ($job =~ "vice president") { $level="Executive"; }

if ($job =~ "vp ") { $level="Executive"; }

if ($job =~ "ceo") { $level="Executive"; }

if ($job =~ "executive") { $level="Executive"; }

if ($job =~ "officer") { $level="Executive"; }

if ($job =~ "chief") { $level="Executive"; }

if ($job =~ "partner") { $level="Executive"; }

if ($job =~ "president") { $level="Executive"; }

if ($job =~ "director") { $level="Manager"; }

if ($job =~ "manager") { $level="Manager"; }

if ($job =~ "lead") { $level="Manager"; }

if ($job =~ "consultant") { $level="Consultant"; }

if ($job =~ "principal") { $level="Consultant"; }

if ($job =~ "professor") { $level="Professor"; }

if ($job =~ "analyst") { $level="Analyst"; }

if ($job =~ "student") { $category="Student"; }

if ($job =~ "analyst") { $category="Analyst"; }



$ljob_level{$job}=$level;

#---- Step 2: creating category

$category="Other";

if ($job =~ "recruit") { $category="Recruiter"; }

if ($job =~ "talent") { $category="Recruiter"; }

if ($job =~ "engineer") { $category="Engineering"; }

if ($job =~ "software") { $category="Developer"; }

if ($job =~ "develop") { $category="Developer"; }

if ($job =~ "architect") { $category="Data Plumbing"; }

if ($job =~ "scientist") { $category="Data Science"; }

if ($job =~ "science") { $category="Data Science"; }

if ($job =~ "stat") { $category="Statistician"; }

if ($job =~ "research") { $category="Research"; }

if ($job =~ "marketing") { $category="Business Analytics"; }

if ($job =~ "analytics") { $category="Business Analytics"; }

if ($job =~ "business") { $category="Business Analytics"; }

if ($job =~ "operations") { $category="Business Analytics"; }

if ($job =~ "consultant") { $category="Consultant"; }

if ($job =~ "training") { $category="Trainer"; }

if ($job =~ "lecturer") { $category="Trainer"; }

if ($job =~ "professor") { $category="Trainer"; }

if ($job =~ "student") { $category="Student"; }

$ljob_category{$job}=$category;

print OUT "$company\t$job_raw\t$category\t$level\t$job

";

#---- Step 3: create data dictionary

# ltoken1 is list (hash table) of one-term words found

# ltoken2 is list (hash table) of two-term words found

@aux=split(' ',$job);

$ntokens=$#aux+1;

$token=$aux[0];

$ltoken1{$token}++;

for ($k=1; $k< $ntokens; $k++) {

$token_A=$aux[$k-1];

$token_B=$aux[$k];

$ltoken1{$token_B}++;

$ltoken2{"$token_A $token_B"}++;

}

}

close(OUT);

close(IN);

#---- more output

open(OUT,">jobs_dictionary.txt");

foreach $token1 (keys(%ltoken1)) {

print OUT "One Term\t$token1\t$ltoken1{$token1}

";

}

foreach $token2 (keys(%ltoken2)) {

print OUT "Two Terms\t$token2\t$ltoken2{$token2}

";

}

close(OUT);



open(OUT,">jobs_summary.txt");

foreach $job (keys(%ljob)) {

if ($ljob{$job} > 1) { # keep only jobs with 2+ enties

print OUT "$job\t$ljob{$job}\t$ljob_level{$job}\t$ljob_category{$job}

";

}

}

close(OUT);

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge