You Need Real Data Science With Data Visualization

Adequate data analysis generates clear and concise data visualizations.

@firmbee unsplash.com

In the last decade, our data has slowly multiplied exponentially. Enterprises celebrated the Big Data movement as business managers, data scientists and technologists eagerly looked for insights in the data they acquired. Each day, companies, government, private individuals are in a race to acquire even more data. With machine learning and artificial intelligence, acquiring the right data for the right purpose seems to become more urgent. As our data multiply, we are caught in a race to generate insights. In this race, we reach for Data Visualization as an easier solution to help us reach our insights faster, and cheaper. We have tools to help us generate visualizations across dimensions, with interactivity, and to overcome the complexities. Ultimately, for us to generate reliable insights that we can use for our businesses, we need simplicity that are true and valid.

The Problem of Clean Data

In the last few years, data scientists and technologists championed the movement towards “clean” data. Data cleaning, scrubbing, and normalizing became over 70% of a data scientist’s job. Companies hired business specialists working alongside of data scientists, applying business rules whenever possible to the data to obtain “clean” data.

Data Scientists who understood the business were paid premiums to extract business insights. Yet, even with good data scientists, these projects are costly, and often generate few usable insights.

For a while, Data Visualization software such as Tableau seems to cut down the time it takes to generate insights. Under this type of pressure, business managers reached for visualization software to bypass data scientists to tell the story of the data in record time. However, this story is often complex and littered with biases.

The truth is that the Big Data problem is complicated. It requires an inordinate amount of iterative effort to clean, analyze, test, and update the data to generate usable insights. Often, massive amounts of features have to be eliminated with care to answer one business question asked. Algorithms applied to the data has to make sense in the context of the data as well as being accurate, interpretable and generalizable.

Ideally, insights should be generated at the end of our iterative analysis when we are comfortable that we’ve accounted for most of the data issues present. Only then, can these insights be displayed with the simplicity and validity that make up a good Data Visualization?

Below are some common Data Visualization issues that are present when there’s inadequate analysis. By spotting these issues, we can appreciate our iterative data science efforts of making sense of this data.

1) Too Many Data Points and Clutter

At the beginning of the Data Science process, there are usually too many features and dimensions. When many features can not be eliminated, the Data Visualization often contains many data points and clutter. This is probably an “intermediate” Data Visualization that tells you some insights. But, these insights need to be further analyzed to extract the “real” story.

From CBOE Website: https://www.cboe.com/blogs/options-hub/2016/09/21/new-heat-map-shows-less-downside-bxmd-put-indexes-blog-1-wilshire-paper

2) Comparing Apples to Oranges

In Data Science, one of the easily overlooked issue is measurement. For two things to be comparable, they have to have the same measurement. When comparing completely dissimilar variables such as Pandora’s Net Loss in Millions versus Apple Stock Price, you are comparing different measurements. The comparison does not make sense.

3) Incomplete Data

Clean data does not equal complete data. Complete data refers to all relevant data. It is often driven by the scope of your Data Science project. Sometimes, incomplete data causes unusually large biases that render your conclusions useless. For instance, if you are working on a sales report of a company that contains 6 different departments. When an entire department’s data for a given month is not available, you still reached conclusions about the results for your company for that department that generalizes across all departments, then those results are likely to be biased.

4) Lack of Accuracy in Data due to Data Augmentation during Deep Learning

Data Augmentation is a strategy that data scientists use to increase the diversity of data available for training models. It is usually used in Deep Learning Algorithms. Data Augmentation techniques, such as, cropping, padding, and horizontally flipping data is often used to train neural networks. One of the issues with Data Augmentation is that it can potentially change the label of the data. Changing the label of data will add new meaning of data into your dataset. Will this new meaning be used to increase the accuracy of your dataset or will this new meaning bias your dataset? When you bias the training data using Data Augmentation, then the training data is less accurate.

Below is an example of such labels being changed in the data when an image of 9 is flipped vertically, then horizontally, the data’s meaning changed from 9 to 6.

5) Spurious Correlation

Big data is notorious for introducing spurious correlations into our data. When we use Data Visualization software to help us reduce dimensionality, we often see data “rising” and “falling” together and think that there must be a correlation. When there are minor business dependencies in the variables, we are especially inclined to think that we have arrived at insights.

In this case, it’s tempting to think that visitors to Universal Studios Orlando correlates to the marriage rate in Michigan, but the story is much more complicated. You will need to understand how many people from Michigan went to Universal Studios Orlando, what timeframe they went, the male/female ratio, whether or not they are single, etc.. to determine the answer to this question. In other words, there are many variables you are not accounting for that could affect the correlation of these two variables.

6) No Storyline or Context Accompanying the Visualization

Data must have a context. Some data scientists will tell you to only present Data Visualization that is simple and explainable. That is one side of the story. However, what are the questions that you are asking of the data? Under what conditions, have you collected the data to answer these questions?

When you are showing a graph of the emergency visits to a hospital, is this graph representative of “normal activity” inside the ER of the hospital. For your story to generalize across periods, are the data collected from “down times”, “busy times”, “holiday periods”, “normal periods”?

If you are showing a graph of a busy foot traffic leading to “holiday periods” in the ER, state that and explain the importance of “holiday periods” in the ER. Then, explain that it’s not the “normal activity” of the ER during the year.

The context often skews the data and what it is used for. With any data science problem, context drives the questions that we ask of data. It’s those questions that in turn drives the analysis and the outcomes of the analysis.