Part 1 and Part 2 both compared data scientists to data analysts. But I’ve been neglecting the unsung heroes of the data world: data engineers. I’m not too familiar with the life of a data engineer. I imagine there’s some overlap with data scientists (Python, Hadoop, etc), but with a stronger emphasis on data infastructure (Spark, AWS, etc.). Coming from a position of complete ignorance, let’s see if we can use NLP to identify the skills that are specific to data engineers. As always, the full code can be found on github.

Data Collection

Similar to Part 1, we’ll extract all data engineer, data scientist and data analyst jobs in London from the Indeed API and then filter out all junior/senior positions and plot the advertised salaries for each job type.

## if you haven't already installed jobbR # devtools::install_github("dashee87/jobbR") ## loading the packages we'll need require ( jobbR ) # searching indeed API require ( dplyr ) # data frame filtering/manipulation require ( rvest ) # web scraping require ( stringr ) # counting patterns within job descriptions require ( plotly ) # interactive plots require ( ggplot2 ) # vanilla plots require ( tm ) # text mining

## job_type num_jobs ## 1 Data Scientist 210 ## 2 Data Analyst 158 ## 3 Data Engineer 103

The first thing to note is there are about half as many data engineers posts as there are data scientist posts. Data engineers appear to be paid more than data scientists (though the former is a small sample), with the lowly data analyst bringing up the rear. We’ll now turn our focus to the job description. Repeating the work in Part 2, we’ll plot the proportion of job descriptions that contain specific predefined skills.

Apologies for small text on the x-axis, click here for a better version.

tf-idf

In this post, we’ll attempt to isolate the skills that are more strongly associated with data engineers than data scientists/analysts. We want words that feature frequently in data engineer job descriptions, but rarely with other job types (called term frequency-inverse document frequency, or tf-idf for short).

Firstly, we’ll scrape the job descriptions. I’ve added a few gsub commands to filter out unwanted punctuation features (e.g. bullet points), which may not be detected by the filters within the tm package.

# scrape job description webpages ds_job_descripts <- unlist ( lapply ( dataScientists $ results.url , function ( x ){ read_html ( x ) %>% html_nodes ( "#job_summary" ) %>% html_text () %>% tolower () %>% gsub ( "

|/" , " " , . ) %>% gsub ( "'|'" , "" , . ) %>% gsub ( "[^[:alnum:]///' ]" , "" , . )})) da_job_descripts <- unlist ( lapply ( dataAnalysts $ results.url , function ( x ){ read_html ( x ) %>% html_nodes ( "#job_summary" ) %>% html_text () %>% tolower () %>% gsub ( "

|/" , " " , . ) %>% gsub ( "'|'" , "" , . ) %>% gsub ( "[^[:alnum:]///' ]" , "" , . )})) de_job_descripts <- unlist ( lapply ( dataEngineers $ results.url , function ( x ){ read_html ( x ) %>% html_nodes ( "#job_summary" ) %>% html_text () %>% tolower () %>% gsub ( "

|/" , " " , . ) %>% gsub ( "'|'" , "" , . ) %>% gsub ( "[^[:alnum:]///' ]" , "" , . )}))

Our task consists of two parts:

Idenitfy words that commonly occur in data engineer job descriptions Identify words that commonly occur in data engineer/scientist/analyst job descriptions.

Words that appear highly in the first group but lowly within the second represent skills and themes specific to data engineers. To quantify word frequency, we must convert the job description vectors into a text corpus (large structured set of texts). We remove common words (called stop words) that provide little informative power (e.g. ‘and’, ‘the’, ‘are’). We’ll actually build two seperate corpuses: one for the data engineer jobs descriptions alone (to calculate tf ) and another for all of the job descriptions (to calculate idf )).

de_corpus <- Corpus ( VectorSource ( de_job_descripts )) %>% tm_map ( function ( x ){ removePunctuation ( x , preserve_intra_word_dashes = TRUE )}) %>% tm_map ( stripWhitespace ) %>% tm_map ( removeWords , stopwords ( "english" )) %>% tm_map ( PlainTextDocument ) all_corpus <- Corpus ( VectorSource ( c ( de_job_descripts , da_job_descripts , ds_job_descripts ))) %>% tm_map ( function ( x ){ removePunctuation ( x , preserve_intra_word_dashes = TRUE )}) %>% tm_map ( stripWhitespace ) %>% tm_map ( removeWords , stopwords ( "english" )) %>% tm_map ( PlainTextDocument )

Remember that we’re interested in the frequency of each term within the corpuses. We can easily convert the corpuses to term document matrices, where each row corresponds to an individual term and each column refers to a different job description and the value is simply the number of the times the term appeared in that job description (which is then converted to a binary).

de_tdm <- TermDocumentMatrix ( de_corpus ) all_tdm <- TermDocumentMatrix ( all_corpus ) de_df <- data.frame ( word = row.names ( de_tdm ), tf = rowSums ( ifelse ( as.matrix ( de_tdm ) > 0 , 1 , 0 )), row.names = NULL , stringsAsFactors = FALSE ) all_df <- data.frame ( word = row.names ( all_tdm ), tf = rowSums ( ifelse ( as.matrix ( all_tdm ) > 0 , 1 , 0 )), row.names = NULL , stringsAsFactors = FALSE )

# data engineer common words de_df %>% arrange ( - tf ) %>% head

## word tf ## 1 data 103 ## 2 experience 90 ## 3 engineer 85 ## 4 will 83 ## 5 working 80 ## 6 team 71

# all jobs common words all_df %>% arrange ( - tf ) %>% head

## word tf ## 1 data 469 ## 2 experience 412 ## 3 will 386 ## 4 skills 346 ## 5 team 342 ## 6 work 314

Taking the term frequency ( tf ) alone, unsurprisingly, we see that ‘data’ and ‘engineer’ are two of the three most common words in data engineer job descriptions. The remaining terms are more generic, illustrated by their high ranking among all jobs. This demonstrates the importance of the inverse document frequency ( idf ) component. It will penalise terms such as ‘skills’, ‘team’ and ‘work’, as they’re not strongly associated with data engineers exclusively. We’ll normalise the tf score (divide by the max) and calculate the idf . The tf_idf is simply the product of the tf and idf .

de_df $ tf = de_df $ tf / max ( de_df $ tf ) de_idf <- data.frame ( word = row.names ( all_tdm ), idf = log2 ( length ( all_corpus ) / rowSums ( ifelse ( as.matrix ( all_tdm ) > 0 , 1 , 0 ))), row.names = NULL , stringsAsFactors = FALSE ) de_df $ tf_idf = de_df $ tf * de_idf [ match ( de_df $ word , de_idf $ word ),] $ idf knitr :: kable ( de_df %>% inner_join ( de_idf , by = c ( "word" = "word" )) %>% arrange ( - tf_idf ) %>% mutate ( rank = row_number ()) %>% select ( rank , word , tf , idf , tf_idf ) %>% head ( 40 ), digits = 3 )

rank word tf idf tf_idf 1 engineer 0.825 2.372 1.957 2 etl 0.330 3.388 1.118 3 engineers 0.330 3.236 1.068 4 spark 0.544 1.949 1.060 5 java 0.379 2.690 1.018 6 aws 0.311 3.265 1.014 7 pipelines 0.262 3.750 0.983 8 engineering 0.485 1.997 0.969 9 hadoop 0.495 1.925 0.953 10 scala 0.330 2.835 0.936 11 platform 0.359 2.576 0.925 12 design 0.515 1.782 0.917 13 architecture 0.252 3.558 0.898 14 technologies 0.427 2.034 0.869 15 big 0.534 1.603 0.856 16 software 0.369 2.295 0.847 17 linux 0.204 4.022 0.820 18 infrastructure 0.223 3.632 0.811 19 redshift 0.194 4.125 0.801 20 systems 0.437 1.824 0.797 21 technical 0.437 1.792 0.783 22 nosql 0.243 3.125 0.758 23 hands 0.204 3.710 0.756 24 years 0.388 1.937 0.752 25 web 0.262 2.835 0.743 26 kafka 0.165 4.487 0.741 27 cloud 0.204 3.632 0.740 28 applications 0.291 2.487 0.724 29 building 0.340 2.111 0.717 30 environments 0.194 3.632 0.705 31 databases 0.262 2.690 0.705 32 date 0.233 2.973 0.693 33 languages 0.262 2.632 0.690 34 mapreduce 0.165 4.179 0.690 35 pig 0.175 3.925 0.686 36 hive 0.252 2.710 0.684 37 scripting 0.194 3.522 0.684 38 production 0.214 3.152 0.673 39 processes 0.262 2.487 0.652 40 build 0.408 1.576 0.643

It’s a good sanity check that ‘engineer’ returned the highest tf_idf score, as we’d expect that to be relatively specific to data engineer job descriptions. Also, it’s reassuring that the generic terms that previously scored well (e.g. ‘data’, ‘team’, ‘will’) are not in the table. The table provides some interesting insights. Take the example of ‘spark’: it has a relatively high tf , but is penalised by a low idf (spark is also a key skill among data scientists). ‘etl’, on the other hand, has a considerably lower tf , but outranks spark due to its higher idf (etl is a term more uniquely associated with data engineers).

It’s important to note that there is no strict defintion of either tf or idf . If you wish, you can attach more importance to either by applying a particular variant (a few examples here). I suppose it depends whether you think terms like ‘spark’ (high tf ; low idf ) should rank more highly than terms like ‘etl’ (low tf ; high idf ).

Summary

After some exploratory analysis, we used term frequency-inverse document frequency to idenitfy words and skills that are uniquely associated with data engineers. Think of the output as potential conversation starters with your engineer counterparts. “So… how about that etl?”