Share

We take it for granted that there is a deluge of data, and that’s not wrong. A report from Domo estimates that 90 percent of the data recorded in human history was generated in the last two years. Organizations are now collecting data at an unprecedented rate, and business leaders are working to leverage this mountain of data into actionable business insights and intelligent data-driven products. And to really do that, you need a Data Scientist.

However, a near-universal problem for Data Scientists is that although data is being collected at an unprecedented rate, obtaining access to that data and working with it can be anywhere from difficult to impossible.

In a recent post, we talked about building a data science roadmap: a set of ideas framed in the language and structure of data science and machine learning that a newly hired Data Scientist can work on. In this post, we will dive deeper into what it takes to ensure that the Data Scientist can actually obtain the data required to begin working on this roadmap.

What Data Does a Data Scientist Need, Anyway?

A typical data science project is centered around the idea of building a model. A model is a mathematical system that in some way mimics the process that is generating your data. If you have a reasonably good model, you can examine it to learn about the process that you’re dealing with, or you can use the model to predict how that process will behave in the future. Say you run an eCommerce site and you wish to build a model to predict how likely a customer is to purchase the product they’re looking at. This is the type of project that a Data Scientist would be responsible for.

Data Scientists build models by showing a computer many previous examples of the outcome in question. To build a model to predict whether a customer will buy a certain product, we need many examples of customers that looked at a product, and we then need to know whether they actually purchased the product. This last point has a key subtlety that occasionally trips up data science newbies: We need examples of both customers purchasing products and not purchasing products. This is the only way to build a truly predictive model.

So, for the Data Scientist to work on this problem, she will need a historical record of visitors’ product browsing history, and an indication of whether each product viewed was purchased or not. Additionally, the Data Scientist requires data about the visitor and their history: demographic data, behavioral data, purchasing history, and any other pieces of data that can be instrumental in building this type of model.

Before setting your Data Scientist off to work on this type of model, it’s a worthwhile exercise to go through this brainstorming process and identify all of the data that will be necessary for her to complete the project.

You Know the Data Exists, but Where Is It?

Although we are collecting data rapidly, most of this collection is incidental, and not for the purpose of data science or machine learning. Data within a single organization is collected through invoices, ledgers, content management systems, customer relationship systems, analytics services, email inboxes, spreadsheets, and so on. The data is messy, inaccessible, and difficult to understand. For each of these data sources, credentials must be created and permissions granted, which may come with non-disclosure agreements and privacy restrictions, creating a range of challenges.

While Data Scientists are increasingly trained to move data from all of these sources and transform it into the formats required by modern machine learning tools, it’s a time-consuming process they can rarely do alone. For this reason, there is an increasing job market for Data Engineers: Software Engineers trained to seamlessly move data from where it was collected to where it needs to be. If your data architecture is particularly complex, a dedicated Data Engineer can be an invaluable asset to the team, helping the Data Scientist build models more efficiently.

Think About Data Engineering

Your Data Scientist might be interested in dealing with data engineering on her own, but even if that’s the case it’s important to consider the challenges she will face. For the eCommerce site, we should understand that the Data Scientist requires access to a host of services, and you will be able to hit the ground running if access credentials, privacy considerations, data residency issues, and whatever other hurdles might exist are taken care of before the Data Scientist even begins. This is another thing that managers can do, without any special knowledge of data science or machine learning, to ensure a smooth transition into becoming a data-driven organization.