Getting Your Data

The data grocery store

The current technology landscape of data formats and stores is vast, to say the least. From good ol’ comma-separated values (CSVs) stored on Amazon S3 to more complex data types like MongoDB’s BSON, your data can—and does—live everywhere.

While dealing with your own localized data store may be familiar enough, incorporating new data from external sources (or even the department down the hall) means ensuring that the new data is ready for your processing tools. The act of simply seeing your data means overcoming two major potential barriers:

how the data is stored ( the format )

) where the data is stored (the storage)

Large scale enterprises and small individuals alike have to be able to jump through the hoops of the appropriate authentication and compatibility requirements presented by the ways that their data is stored. Oftentimes, this accessing means gaining the correct permissions to the organization’s data lake or being able to convert different file formats into each other via external packages. This is just to start looking at the data!

There are a multitude of ways to centralize where and how your data is stored and located, which is why understanding where to even start your data “shopping” is an essential and the obvious first step of starting any data science process.

Knowing Your Data

The data aisles

Once you’re able to actually get your hands on the data, you need to be able to make sense of it. Column headers are for humans right? I can just read them, right? Hopefully, yes. However, any person who has seen enough column headers is sure to have had an experience where these data titles can be less than helpful.

no empanadas?

Who’s to say whether or not a column entitled “ emp_no” contains employee_numbers or instead signifies empanadas_none ?

The only way to really understand what’s going on in the data is to look beyond the schema and quickly analyze the types of values and types that make up your data points. Joining the column Name in a table full of Businesses will not be fruitful if joined on a column also called Name that is contained in the Vegetable Inventory table of a grocery store. Since data scientists are intelligent, it doesn’t take too long to establish what the data actually contains with a quick glance-over.

At scale, however, with databases growing in their schema and contents, it quickly becomes unwieldy to understand all the fine details of every database. The reason why you need to know what’s in your data is because it informs how you can actually start performing data analysis.

Transforming Your Data

Is your data edible?

Once you access and understand your data, you can finally start brainstorming the types of questions and insights that could potentially be baked. Yet, often you’re still not ready for the data analysis process.

Depending on how in-depth your data analysis goals are, from simple visualizations with out-of-the-box tooling to customized machine learning models, your data may need to be transformed in ways that actually make it ingestible for the data analysis tools. Transformations include quality checks like cleaning out invalid values, to filtering down to only the columns that are interesting to you, or even limiting the amount of data you work with because of how computationally (not to mention financially) expensive it may be. Making sure the data isn’t rotten by filling in empty slots and ensuring you only get just the right amount of data by looking at only the interesting columns is crucial to getting your data ready for the data cooking.

What Are You Prepping For?

Finally, food puns aside, this three-step process sets you up for the fun stuff:

Data science!

Everything that gets you from raw data to being able to start your data analytics process falls under these steps, all while wearing your data preparation hat:

Accessing the data… getting to the store

Understanding what’s in it… going to the correct aisles

Making it usable for your data analytics tools… ensuring edibility

All of the above are all key processes to setting your data table. The ingredients that set you up for your tooling and actionable insights are crucial, though seemingly tedious. But understanding the importance of why and how data preparation works is key to being able to make the best use of your data scientists, understand what your data is telling you, and most importantly, get to data dining.

Make data scientists chefs, not grocery shoppers.

It’s time to eat better, faster, smarter. 👩‍🍳 🍳

All prepped up and ready to go

Interested in getting to cooking faster? Message me or check out our website to see how we can work together on your data science recipes.

Resources

Data Preparation

Data Preprocessing vs. Data Wrangling in Machine Learning Projects