What is a Data Janitor?

From Wikipedia: “A data janitor is a person who works to take big data and condense it into useful amounts of information. Also known as a “data wrangler,” a data janitor sifts through data for companies in the information technology industry. A multitude of start-ups rely on large amounts of data, so a data janitor works to help these businesses with this basic, but difficult process of interpreting data.”

Search Linkedin or AngelList and you won’t see a single job posting for a data janitor. Why is that? A Data Scientist’s time is valuable, and estimates are at 50 -80% of time spent in tiresome cleaning. I am surprised companies are not profiting on the trend of people wanting to become data scientists by hiring data janitors as entry level positions.

Janitorial Cleaning Cart

When I first started using R, data manipulation was daunting, and I avoided using R packages. At first I would do the cleaning manually in Excel and then import the data back to R for charting and analysis. Then I found SQLDF which allowed me to perform SQL selects on R data frames. As I felt more comfortable writing SQL queries to filter and join this was my tool of choice for a while. But sqldf() is not always the most efficient, especially with larger data frames.

#SQLDF Example library(sqldf) scoredist = sqldf("SELECT boro, score, COUNT(score) AS count From df WHERE boro <> 'Missing' Group by boro, score") 1 2 3 4 5 6 7 #SQLDF Example library ( sqldf ) scoredist = sqldf ( "SELECT boro, score, COUNT(score) AS count From df WHERE boro <> 'Missing' Group by boro, score" )

Hands-on dplyr tutorial for faster data manipulation in R

Dplyr is “A Grammar of Data Manipulations.” I heard of dplyr and plyr for a while until I finally gave the package a shot, and now I am wishing I started earlier.

Pandas

When using python, the library of choice for data cleaning would be Pandas. Some basic pandas commands.

import pandas as pd #View the head and summary statistics df.head() df.describe() #SQL like Merge/Join df = pd.merge(yelp_df,Inspection_df,how='inner',on='Phone') #Basic string contains df['Me'][df['From'].str.contains("MyEmail@gmail.com")] = 1 #drop rows with missing data df.dropna(how='any') # Get data with Ratings above 3 yelp_df[yelp_df.Rating > 3] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import pandas as pd #View the head and summary statistics df . head ( ) df . describe ( ) #SQL like Merge/Join df = pd . merge ( yelp_df , Inspection_df , how = 'inner' , on = 'Phone' ) #Basic string contains df [ 'Me' ] [ df [ 'From' ] . str . contains ( "MyEmail@gmail.com" ) ] = 1 #drop rows with missing data df . dropna ( how = 'any' ) # Get data with Ratings above 3 yelp_df [ yelp_df . Rating > 3 ]

What am I actually cleaning for?

A clean room is not hard to imagine, sparkling floors, and an organized desk, but what does clean data look like? In a tidy data set each variable is saved in its own column and each observation in it’s own row.

Standardize formats Dates should follow the same syntax (e.g YYYY-MM-DD) Correct data types, should the columns be characters or factors. Is the currency used consistent and labeled

Trim extraneous spaces

Label columns to remove ambiguity

Remove duplicate records

Sanity checking aggregation type (i.e. are you looking for a count or sum?)

Check the length of variables (e.g. A zip code should have a length of 5)

Sorting the data and checking for outliers (flag for removal)

Missing Data Is the data complete (why does one month have a third of the data points as another?) NaN or blank rows and columns



Don’t forget to automate when possible. I noticed an account manager spent an hour each week cleaning data in Excel to upload to the UI. Using Python I was able to write a script to automate the steps and save her time each week. Data cleaning is not the most fun and if the cost-benefit math works out, why not automate.

And no package that I know of can replace a fresh set of eyes, taking a look at the data to see if the values are reasonable. Sure it’d be great if you had a million page views in a day, and that would make sense for some blogs, but you need to know the baselines. Garbage in, garbage out, and as a data janitor one should make sure to take the trash out.