One of the most important steps for creating data visualizations is selecting which aspects, features or dimensions of the data to present—in other words, letting the data dictate the visualization. Unlike school assignments, data scientists and professionals rarely receive project that provides the same clear guidance they received as children. There is no longer a teacher who assigns a bar chart; instead, data scientists are expected to find insights that will enlighten managers and colleagues.

Utilizing data science can be beneficial to anyone interested in an effective visualization. This article shows how data science can be used to create effective data visualizations by focusing on one key question every data scientist needs to ask: What level of detail should I show in my visualization?

To demonstrate the importance of this question, consider the following scenario. A researcher is conducting an experiment and the researcher records the date, time and a measurement at 6 a.m., 2 p.m. and 8 p.m. every day for a month. How can this data set be visualized?

Step 1: Creating visual representations

The most direct way to present the data is to plot each data point. The most direct way to present the data is to plot each data point. In Figure 1, each measurement recorded over the course of the study is plotted against the date and time using a bar chart.

Figure 1: Bar chart of each measurement recorded over the course of the study.

Bar charts can seem simple and easy to use, but selecting the wrong data can impact the effectiveness of any visualization. With close to 100 data points in Figure 1, including every data point makes it difficult to gain significant insight without further analysis.

If plotting each data point doesn’t provide meaningful insight, consider using summary statistics to gather information and as a starting point for finding useful patterns in the data set. In certain cases, visualizing summary statistics may be sufficient for presenting information. For example, a chart showing the average temperature for each month can be an effective presentation of the seasonal weather changes for a geographic region.

Step 2: Digging into the data

In the previous step, Figure 1 fell short of presenting usable insights. To get better insights, you can use summary statistics to analyze the data points directly or evaluate the visualization. Either approach allows data scientists to explore potential patterns in any data sets, as shown below.

For data scientists who prefer to work directly with the data set, daily or weekly averages can present an effective overview by splitting up the data set into different levels. Figure 2a shows the daily average for the first seven days and the difference between the daily averages and the weekly average. The table shows that the difference between the daily average and the weekly average stands out on the sixth day, when the daily average is significantly higher than the weekly average. With the discovery of unusual behavior in the first week, it’s easy to check whether the pattern is consistent during the other weeks of the study.

Figure 2a: Measurements for the first seven days, including the daily average and the difference between the daily average and the weekly average.

For data scientists who prefer to work with visualization, the bar chart in Figure 1 can serve as a valuable source for insights. Figure 2b shows the measurements for the first seven days of the study with the average for this period represented by the horizontal red line. Similar to the previous step, the values for the sixth day are significantly different than the values for the other days in the study.

Figure 2b: Measurements recorded during the first 7 days. The red line represents the average measurement recorded.

Step 3: Revising the chart

Since both the data set and the original visualization revealed that the data peaked on the sixth day, Figure 1 can be revised to determine if this pattern is consistent throughout the study. Specifically, in Figure 3, the three measurements recorded each day are represented as one averaged daily value and shows that the measurement values peak each week on the same weekday.

Figure 3: Bar chart of the average measurement recorded daily.

Apply with caution

While averages can be useful for data mining, using this approach too liberally can inadvertently result in hiding valuable information. By replacing daily averages with weekly averages, Figure 4a no longer shows the peaks that occur on days 6, 13, 20 and 27—and the measurements are so close that the chart suggests there is very little variability in the data.

Figure 4a: Bar chart of the average measurement recorded weekly.

Conceptually calculating the average of a set of numbers is similar to redistributing the amounts evenly across these values until each one is equal. For instance, finding the average of 8 and 12 can be thought of as taking 2 from 12 and moving it to 10 so that the two values both equal 10, which is the average of the two numbers. Hence, if a set of numbers includes extreme values, averaging these terms can result in the loss of vital information.

Remember that using a “one-size-fits-all” approach can increase the chances of hiding or missing important insights. Creating alternative visualizations of the measurements by time, as in Figure 4b, will minimize this risk and open up the possibility of finding new patterns.

Figure 4b: Line chart of the measurements by time.

Discover how the IBM advanced analytics portfolio can help you find patterns and derive insights by visually exploring data.