Contributed by Sharan Naribole. He is currently undertaking the part-time online bootcamp organized by NYC Data Science Academy (Dec 2016- April 2017). This blog is based on his bootcamp project - R Exploratory Data Analysis.

Abstract

The H-1B is an employment-based, non-immigrant visa category for temporary foreign workers in the United States. Every year, the US immigration department receives over 200,000 petitions and selects 85,000 applications through a random process. The application data is available for public access to perform in-depth longitudinal research and analysis. This data provides key insights into the prevailing wages for job titles being sponsored by US employers under H1-B visa category. In particular, I utilize the 2011-2016 H-1B petition disclosure data to analyze the employers with the most applications, data science related job positions and relationship between salaries offered and cost of living index.

H-1B Visa Data Introduction

The H-1B is an employment-based, non-immigrant visa category for temporary foreign workers in the United States. For a foreign national to apply for H1-B visa, an US employer must offer a job and petition for H-1B visa with the US immigration department. This is the most common visa status applied for and held by international students once they complete college/ higher education (Masters, PhD) and work in a full-time position.

The Office of Foreign Labor Certification (OFLC) generates program data that is useful information about the immigration programs including the H1-B visa. The disclosure data updated annually is available at https://www.foreignlaborcert.doleta.gov/performancedata.cfm

In this project, I analyze over 3 million records of H-1B petitions in the period 2011-2016.

Data Wrangling

In this subsection, I discuss a few key data transformations performed on the raw dataset before data analysis could be performed. The exhaustive code for the data transformation can be found at my GitHub.

Dataset Description

First, I describe the key elements of the data. The data set includes 40 columns in each year's records and the column names completely changed after 2015. My first step was to rename the columns in older records for the relevant columns to match with the newer records. The relevant columns include:

1) EMPLOYER_NAME: Name of employer submitting the H1-B application. Used in comparing salaries and number of applications of various employers.

2) JOB_TITLE: Title of the job using which we can filter specific job positions for e.g., Data Scientist, Data Engineer etc.

3)PREVAILING_WAGE: The prevailing wage for a job position is defined as the average wage paid to similarly employed workers in the requested occupation in the area of intended employment. The prevailing wage is based on the employer’s minimum requirements for the position. (Source). This column will be one of the key metrics of the data analysis.

4) WORKSITE_CITY, WORKSITE_STATE: The foreign worker’s intended area of employment. We will explore the relationship between prevailing wage for Data Scientist position across different locations.

5) CASE_STATUS: Status associated with the last significant event or decision. Valid values include “Certified,” “Certified-Withdrawn,” Denied,” and “Withdrawn”. This feature will help us analyze what share of the H-1B visa is taken by different employers/ job positions.

Other important columns include Unit of Pay and whether the Job position is a Full Time position or a Part-Time position.

Data Transformations

The main data transformations I performed are as follows:

Wage Unit of Pay

While 92% of the records provide Wage at the Year scale, 7.73% provide the information at Hour scale. As only 0.02% of the records have missing information, I remove such records from further analysis. For the remaining records, I convert them to the Year scale.

Imputing Full-Time Position

Interestingly, 21.6% of the records have missing values regarding the Full Time Position. For filling the missing values, I analyze the relationship of the Prevailing Wage with Full Time Position across the years.

Figure 1. Missing values for Full-Time Position

Observations from Figure 1:

100% of the records from 2016 have missing values. Expectedly, the median wage for Full time positions are higher than for part-time positions.

Based on the 75% percentile value for Part-Time positions, I select 70000 as the Prevailing Wage cut-off for Full-Time positions with missing values. Accordingly, the missing values are filled.

Work Site Spelling Corrector

I observe many of the Worksite values had spelling errors. For example, New York was misspelled New Yrok 16 times, San Francisco misspelled San Fransisco 82 times and Sunnyvale misspelled Suunyvale 11 times. These are just a few examples. In order to correct the spellings in a systematic approach, I implemented a Spell Corrector that uses a probabilistic model described here.

To describe briefly, this spell corrector finds out every possible transformation to a given word by 1-edit distance including deleting a letter, interchanging of two adjacent letters, inserting a new letter, replacing a letter with another letter from the English dictionary. This is performed for a single position in a word for a possible candidate correct solution. Once these transformations are obtained, the transformation with the highest occurrence in the list of work sites in our dataset is selected as the correct spelling. The code for this spelling corrector can be found on my GitHub. This code uses hashmap package that maps every worksite with the frequency of occurrence in the dataset.

As Houston is present in Texas, California and few other states, it would be erroneous to consider only the Work site city for this spelling correction. Therefore, I include both worksite city and worksite state to find the frequencies of occurrence before performing the spelling correction.

Geocoding

I find out the latitude and longitudes of the work sites. This information will help in creating map plots for the metrics considered in the data analysis.

ggmap package provides a convenient way of finding out the geocode given a location in string format. However, there is a 2500 request limit per day. Therefore, I find out the geocode only for the top 2500 worksites based on number of H-1B applications observed in our dataset. 96.47% of the records in our dataset are covered by the top 2500 work sites so it was sufficient information for the data analysis.

Scraping Cost of Living Index

Lastly, I used Scrapy package in Python to collect Cost of Living plus Rent index for top cities in the US. I expect the Wages offered for the same job position might vary significantly based on the cost of living. This will be another component of my data analysis.

The data was scraped from here comprising of 119 cities. The GitHub code for the Scrapy spider can be found here.

With this, I complete the data transformations and next begin the data analysis.

Data Analysis

I begin the data analysis by focusing on the employers with highest number of applications in the dataset and analyze the salaries offered by them with other popular employers. Next, I focus on the applications related to Data Science job positions. Last, I explore the distribution of Data Science related jobs across the US and the relationship with cost of living. Let's begin!

High Applicant Employers

The questions I will be answering through data include:

1) Which Employers submit the most number of H-1B visa applications?

2) What is the Percentage share out of the 85,000 visa cap for the Employers with most applications?

3) What are the most common Job Titles applied for by the high applicant employers?

4) How do the salaries compare for Software jobs between high applicant employers and other popular employers?

Figure 2. Total Applications in 2011-2016 by the top 10 Employers with most applications

Observations from Figure 2:

Infosys leads the pack by a huge margin with over 30000 applications in 2013 and 2015. The Top 10 list is dominated by the Indian IT companies. In 2016, we observe a slight dip in the number of applications from Infosys, Wipro, Tata Consultancy, IBM India and HCL America. This might be because of increased incorporation of automation in the IT industry. According to this article, the Indian IT firms have been preparing for reduced number of H-1B visas for nearly a decade through increased focus on automation, cloud computing and artificial intelligence.

Figure 3. Percentage share for the employers out of the H-1B visa cap

In Figure 3, I assume each certified H-1B application corresponds to a unique H-1B visa. Accordingly, if an employer's 8500 H-1B visa applications were certified then it's percentage share in the 85,000 visa cap is 10%. I use the CASE_STATUS column in the records to find out if a H-1B Visa petition has been certified or not.

Observations from Figure 3:

1. Over 90% of the certified H-1B visa applications belong to the employers

2. Infosys takes the biggest share more than double than most of the remaining top 10 high-applicant employers.

Figure 4. Most common Job Positions applied for by high-applicant employers

Observations from Figure 4.:

Technology related jobs fill up majority of the positions with the top 3 jobs being Technology Lead, Technology Analyst and Computer Programmer. Consultant and Manager related jobs fill up the remaining spots.

Figure 5. Wages for most common Job Positions applied for by high-applicant employers

Observations from Figure 5.:

1. Expectedly, the Manager level jobs and Lead Consultant job titles have the highest wages.

2. The Software Engineering jobs including Programmer analyst, Computer Programmer, Computer Systems Engineer have wages close to 60000 USD per annum.

3. Test Analyst and Systems Engineer have the lowest wages with the median slightly above 50000 USD.

Based on this data, it will be interesting to find out how the wages offered to Software related job titles by the high-applicant employers compares with the top Software companies like Google, Amazon, Facebook etc. For this purpose, I filter job titles with terms including Programmer, Computer, Software, Systems and Developer from the dataset and consider these positions to be software jobs. Next, I compare the wages offered by 5 high-applicant companies including IBM, Infosys, Wipro, Tata Consultancy Services, Deloitte with Google, Amazon, Microsoft and Facebook.

Figure 6. Wage Comparison of Employers for Software Jobs

Observations from Figure 6:

1. Clearly, the high-applicant employers with the most H-1B visas have significantly lower wages for Software job positions.

2. The median wage for the IT companies is lower than 70000 USD whereas for the top software companies the median wage is above 85000 USD.

3. Facebook and Google have a median above 100000 USD.

Next, I focus on the Data Science related job positions.

Data Science Jobs

Figure 7. H-1B Visa Applications for Data Science jobs

Observations from Figure 7:

1. Data Scientist and Data Engineer positions have observed an exponential growth in the last 6 years.

2. Job Titles with Machine Learning explicitly in them are still few in number (< 75 in any year).

3. In 2016, Data Scientist position broke the 1000 barrier on the number of H-1B Visa applications.

Figure 8. Wages for Data Science jobs

Observations from Figure 8:

1. Machine Learning jobs have the highest median wage although the number of Job Titles with Machine Learning explicitly in them are less than 75 in any year.

2. Median wage for Data Engineer jobs is consistently increasing.

3. Median wage for Data Scientist positions is negligibly decreasing since 2012 although this is the position that has seen the most growth in the last 6 years.

Location Distribution of Data Science jobs

Figure 9. H-1B Visa applications for Data Science jobs per State

In Figure 9, I filtered states with 50 Data Science related jobs in the last 6 years. Hence, the figure doesn't display all the US states. By Data Science job, I assume the Job title has Data Scientist or Data Engineer or Machine Learning in it.

Observations from Figure 9:

1. California leads the pack by a huge margin with over 2000 applications.

2. New York, Washington, Massachusetts and Texas form the remaining top 5 positions.

This result is expected as these states are hub of technology innovation with California housing the Silicon Valley, NY housing the Finance and media corporations, Washington housing the technology giants including Microsoft and Amazon.

3. Surprisingly, only 11 states passed the barrier of 50 H-1B applications related to Data Science in the last 6 years.

Figure 10. Mean Wage for Data Science jobs per State

Observations from Figure 10:

1. California has not only got the most number of jobs but also the highest wages. This might be due to the higher cost of living as I will analyze later.

2. Significant variation in the mean wage across the states.

3. I excluded Massachusetts as it had an weird mean Wage of 1500,000 USD per annum.

Next, I dive deeper into analyzing data science positions.at the granularity of Worksite city.

Figure 11. Hot spots for Data Science Jobs

Observations from Figure 11:

1. San Francisco leads the chart with the most number of jobs.

2. Inside California, the jobs are not uniformly distributed. Instead, are mainly clustered nearby San Francisco.

Cost of Living

Last, I explore the relationship between cost of living and the wage offered for Data Science jobs.

Figure 12. Cost of Living vs Data Science job wage

Observations:

1. A general increase in the Wage is observed with the cost of living although there are slight dips across the curve.

2. The standard deviation decreases as we move towards locations with higher cost of living index.

Conclusion & Future Work

To conclude, in this project, I performed exploratory data analysis on the H-1B visa petition disclosure data for the period 2011-2016. I found that the employers with the most number of H-1B visa applications pay significantly lower wages compared to other employers for similar job positions offered by other employers. Also, I found that the Data Scientist position has experiences an exponential growth in terms of H-1B visa applications. Interestingly, the Data Scientist jobs are clustered in a few hostpots with San Francisco region having the highest number.

I expanded this project to build a Shiny app which can be accessed at https://sharan-naribole.shinyapps.io/h_1b/. I will discuss the functionality of this Shiny app in more detail in my next blog post.

The GitHub code for this project can be found here. Thanks for reading!