To Data Science and Beyond! — The Role of a Data Scientist.

Editor — Ishmael Njie

“Data Science combines different fields of work in statistics and computation in order to interpret data for the purpose of decision making.” [1]

Data Scientists are employed to make sense of raw data and aim to make the data increasingly valuable for a company. The services of a Data Scientist are key in the retail sector, so that businesses can have a better understanding of their audience and how to target their products. Data Scientists are also key in following industries: finance, entertainment and healthcare.

Data Scientists call on techniques from statistics to make informed decisions about the data in possession. We have published a story on “The Role of Mathematics in Data Science” which gives a brief overview of the techniques in Mathematics that are applicable to the field of Data Science.

Having employed a Data Scientist, an employer is looking to understand the following:

How the data can be turned into profit.

Can efficiency be increased due to the analysis of particular sets of data?

How can the company use this data to grow?

In order to come to a conclusion identifying all of the above, a Data Scientist needs to consider the following:

Collection of data Understanding & Cleaning the data Visualising the data Modelling the data Contextual Analysis of the data

Collection of data

We need data from somewhere…

(For those that are looking for a dataset for particular projects, KDnuggets has a great page, listing many data repositories)

A Data Scientist is likely to find data from external sources (Maybe from repositories above) and/or internal sources such as databases. Common data formats include:

Comma Separated value (.csv) files. Files that are separated by a delimiter (e.g. “,” or “-“). Each line is a new record.

JavaScript Object notation (.json), a format, which is easy for humans to read, that is used in transmitting data.

Tools:

Database Managements skills: Use of the Structured Query Language (SQL), a language designed for manipulating and storing data in databases.

Frameworks: Apache Hadoop, Apache Spark. Specifically, Hadoop’s core element is the use of a Distributed File System, a system that stores data across computer clusters. On the other hand, Spark uses multisets of data which are distributed over the clusters; known as resilient distributed datasets (RDD).

Understanding & Cleaning the data

This section is arguably the most important. To limit the inaccuracy of possible models and conclusions drawn from the data, Data Scientists need to understand the data that is in front of them and need to look into cleaning it where possible.

What do I mean by cleaning?

We need a clean up on Row 7!

Essentially, a Data Scientist will look for errors, missing values or incorrect records in the dataset. Omit or amend particular records so that the contextual analysis of the data with respect to the problem is as accurate as possible.

Tools:

Programming languages such as Python & R. Python has a library called Pandas which allows for the cleaning of data and data manipulation in the form of Dataframes. The packages dplyr and tidyr are available in R to perform Data Wrangling (transforming and mapping of data).

Visualising the data

Here, the objective is to find patterns amongst the dataset. Following this, while keeping in mind to make the visuals understandable to the viewer, Data Scientists will use statistics to identify features of the dataset that hold a significant value. In other words, if the p-value of a variable is less than 0.05, it will be mentioned!

Tools:

An understanding of Statistics here is incredibly important in the process of using visuals to represent the data. Tablaeu is an example of a great visualisation tool.

Useful libraries in Python include: Numpy, Matplotlib, Pandas, Seaborn

Useful packages in R include: ggplot2 and ggvis

Modelling the data

Predictive Analysis is a business’ Fortune Teller

Machine Learning is a powerful tool for any business. After cleaning the data, predictive analysis can be used to aid in making decisions to solve problems or grow areas within the company. Mentioned in our first story, Regression is a tool that can be used to predict future instances. Businesses could use predictive analysis to predict the profits for the next year; as a result, business can tailor their decisions based on the output of the predictive model.

Tools:

Machine Learning concepts such as Regression through Supervised Learning and/or K-Means Clustering through Unsupervised Learning.

Python has scikit-learn, a machine learning library that features regression and clustering algorithms. R has CARET, the Classification and Regression Training package.

Contextual Analysis of the data

Where does the company go from here? What do we set our prices at? Is something wrong?

These are all questions that the business will ask a Data Scientist; subsequent to the above computation, is where they will make recommendations. They need to understand the problem at hand and Data Scientists also need to make sure that the visuals are simple as this is an essential part in communicating to a non-technical audience. Their aim is to help them, not to confuse them. The model will give Data Scientists definitive evidence of how the business should proceed.

Tools:

Knowledge of the company

General business knowledge

Communication skills

Data Visualisation

Conclusion

Understand the problem given, obtain the data, understand and clean the data to prevent disastrous outcomes in analysis, present the data with visuals that are made simple but informative for the audience, model the data (if needed) to predict instances that will give the company an idea of what is to be expected, and finally, propose a solution to the problem. Data Scientists also need to look at updating the models where possible. As time goes on, new variables/features will rise and needed to be considered in the predictive analysis. The previous model is not a great representation of the data, given the new features. New features will bring new solutions which then need to be relayed.

Remember: Businesses aim to make sense of the data that pours through their company to make decisions which will solve existing problems. The job of a Data Scientist is to analyse that data, present that data and make a recommendation to the company on how to solve such problems.