It’s quite easy to contrast and compare these statistical measures for the different types of wine samples. Notice the stark difference in some of the attributes. We will emphasize those in some of our visualizations later on.

Univariate Analysis

Univariate analysis is basically the simplest form of data analysis or visualization where we are only concerned with analyzing one data attribute or variable and visualizing the same (one dimension).

Visualizing data in One Dimension (1-D)

One of the quickest and most effective ways to visualize all numeric data and their distributions, is to leverage histograms using pandas

Visualizing attributes as one-dimensional data

The plots above give a good idea about the basic data distribution of any of the attributes.

Let’s drill down to visualizing one of the continuous, numeric attributes. Essentially a histogram or a density plot works quite well in understanding how the data is distributed for that attribute.

Visualizing one-dimensional continuous, numeric data

It is quite evident from the above plot that there is a definite right skew in the distribution for wine sulphates .

Visualizing a discrete, categorical data attribute is slightly different and bar plots are one of the most effective ways to do the same. You can use pie-charts also but in general try avoiding them altogether, especially when the number of distinct categories is more than three.

Visualizing one-dimensional discrete, categorical data

Let’s move on to looking at higher dimensional data now.

Multivariate Analysis

Multivariate analysis is where the fun as well as the complexity begins. Here we analyze multiple data dimensions or attributes (2 or more). Multivariate analysis not only involves just checking out distributions but also potential relationships, patterns and correlations amongst these attributes. You can also leverage inferential statistics and hypothesis testing if necessary based on the problem to be solved at hand to check out statistical significance for different attributes, groups and so on.

Visualizing data in Two Dimensions (2-D)

One of the best ways to check out potential relationships or correlations amongst the different data attributes is to leverage a pair-wise correlation matrix and depict it as a heatmap.

Visualizing two-dimensional data with a correlation heatmap

The gradients in the heatmap vary based on the strength of the correlation and you can clearly see it is very easy to spot potential attributes having strong correlations amongst themselves. Another way to visualize the same is to use pair-wise scatter plots amongst attributes of interest.

Visualizing two-dimensional data with pair-wise scatter plots

Based on the above plot, you can see that scatter plots are also a decent way of observing potential relationships or patterns in two-dimensions for data attributes.

An important point to note about pairwise scatter plots is that the plots are actually symmetric. The scatterplot for any pair of attributes (X, Y) looks different from the same attributes in (Y, X) only because the vertical and horizontal scales are different. It does not contain any new information.

Another way of visualizing multivariate data for multiple attributes together is to use parallel coordinates.

Parallel coordinates to visualize multi-dimensional data

Basically, in this visualization as depicted above, points are represented as connected line segments. Each vertical line represents one data attribute. One complete set of connected line segments across all the attributes represents one data point. Hence points that tend to cluster will appear closer together. Just by looking at it, we can clearly see that density is slightly more for red wines as compared to white wines. Also residual sugar and total sulfur dioxide is higher for white wines as compared to red and fixed acidity is higher for red wines as compared to white wines . Check out the statistics from the statistic table we derived earlier to validate this assumption!

Let’s look at some ways in which we can visualize two continuous, numeric attributes. Scatter plots and joint plots in particular are good ways to not only check for patterns, relationships but also see the individual distributions for the attributes.

Visualizing two-dimensional continuous, numeric data using scatter plots and joint plots

The scatter plot is depicted on the left side and the joint plot on the right in the above figure. Like we mentioned, you can check out correlations, relationships as well as individual distributions in the joint plot.

How about visualizing two discrete, categorical attributes? One way is to leverage separate plots (subplots) or facets for one of the categorical dimensions.

Visualizing two-dimensional discrete, categorical data using bar plots and subplots (facets)

While this is a good way to visualize categorical data, as you can see, leveraging matplotlib has resulted in writing a lot of code. Another good way is to use stacked bars or multiple bars for the different attributes in a single plot. We can leverage seaborn for the same easily.

Visualizing two-dimensional discrete, categorical data in a single bar chart

This definitely looks cleaner and you can also effectively compare the different categories easily from this single plot.

Let’s look at visualizing mixed attributes in two-dimensions (essentially numeric and categorical together). One way is to use faceting\subplots along with generic histograms or density plots.

Visualizing mixed attributes in two-dimensions leveraging facets and histograms\density plots

While this is good, once again we have a lot of boilerplate code which we can avoid by leveraging seaborn and even depict the plots in one single chart.

Leveraging multiple histograms for mixed attributes in two-dimensions

You can see the plot generated above is clear and concise and we can easily compare across the distributions easily. Besides this, box plots are another way of effectively depicting groups of numeric data based on the different values in the categorical attribute. Box plots are a good way to know the quartile values in the data and also potential outliers.

Box Plots as an effective representation of two-dimensional mixed attributes

Another similar visualization is violin plots, which are another effective way to visualize grouped numeric data using kernel density plots (depicts probability density of the data at different values).

Violin Plots as an effective representation of two-dimensional mixed attributes

You can clearly see the density plots above for the different wine quality categories for wine sulphate .

Visualizing data till two-dimensions is pretty straightforward but starts becoming complex as the number of dimensions (attributes) start increasing. The reason is because we are bound by the two-dimensions of our display mediums and our environment. For three-dimensional data, we can introduce a fake notion of depth by taking a z-axis in our chart or leveraging subplots and facets. However for data higher than three-dimensions, it becomes even more difficult to visualize the same. The best way to go higher than three dimensions is to use plot facets, color, shapes, sizes, depth and so on.

Visualizing data in Three Dimensions (3-D)

Considering three attributes or dimensions in the data, we can visualize them by considering a pair-wise scatter plot and introducing the notion of color or hue to separate out values in a categorical dimension.

Visualizing three-dimensional data with scatter plots and hue (color)

The above plot enables you to check out correlations and patterns and also compare around wine groups. Like we can clearly see total sulfur dioxide and residual sugar is higher for white wine as compared to red.

Let’s look at strategies for visualizing three continuous, numeric attributes. One way would be to have two dimensions represented as the regular length (x-axis)and breadth (y-axis) and also take the notion of depth (z-axis) for the third dimension.

Visualizing three-dimensional numeric data by introducing the notion of depth

But is this effective? Not really! We can however leverage the regular 2-D axes for representing two continuous variables (scatter plot) and introduce the third continuous variable as a categorical variable by binning its values in fixed width bins — popularly these can be quantiles. Based on these quantiles (or bins) we can use size or even hue to represent the third variable here making it 3-D.

Using the notion of size or hue for representing continuous data in 3-D

A better option would be to use the notion of faceting as the third dimension (essentially subplots) where each subplot indicates a specific bin from our third variable (dimension). Do remember you need to create your bins manually if you are using the scatterplot functionality from matplotlib as opposed to seaborn (depicted in the following example).

Using the notion of facets for representing continous data in 3-D

The above plot clearly tells us that higher the residual_sugar levels and the alcohol content, lower is the fixed_acidity in the wine samples.

Visualizing three-dimensional categorical data by introducing the notion of hue and facets

The chart above clearly shows the frequency pertaining to each of the dimensions and you can see how easy and effective this can be in understanding relevant insights.

Considering visualization for three mixed attributes, we can use the notion of hue for separating our groups in one of the categorical attributes while using conventional visualizations like scatter plots for visualizing two dimensions for numeric attributes.

Visualizing mixed attributes in three-dimensions leveraging scatter plots and the concept of hue

Thus hue acts as a good separator for the categories or groups and while there is no or very weak correlation as observed above, we can still understand from these plots that sulphates are slightly higher for red wines as compared to white. Instead of a scatter plot, you can also use a kernel density plot to understand the data in three dimensions.

Visualizing mixed attributes in three-dimensions leveraging kernel density plots and the concept of hue

It is quite evident and expected that red wine samples have higher sulphate levels as compared to white wines. You can also see the density concentrations based on the hue intensity.

In case we are dealing with more than one categorical attribute in the three dimensions, we can use hue and one of the regular axes for visualizing data and use visualizations like box plots or violin plots to visualize the different groups of data.

Visualizing mixed attributes in three-dimensions leveraging split violin plots and the concept of hue

In the figure above, we can see that in the 3-D visualization on the right hand plot, we have represented wine quality on the x-axis and wine_type as the hue. We can clearly see some interesting insights like volatile acidity is higher for red wines as compared to white wines.

You can also consider using box plots for representing mixed attributes with more than one categorical variable in a similar way.

Visualizing mixed attributes in three-dimensions leveraging box plots and the concept of hue

We can see that both for quality and quality_label attributes, the wine alcohol content increases with better quality. Also red wines tend to have a sightly higher median alcohol content as compared to white wines based on the quality class. However if we check the quality ratings, we can see that for lower rated wines (3 & 4), the white wine median alcohol content is greater than red wine samples. Otherwise red wines seem to have a slightly higher median alcohol content in general as compared to white wines.

Visualizing data in Four Dimensions (4-D)

Based on our discussion earlier, we leverage various components of the charts visualize multiple dimensions. One way to visualize data in four dimensions is to use depth and hue as specific data dimensions in a conventional plot like a scatter plot.

Visualizing data in four-dimensions leveraging scatter plots and the concept of hue and depth

The wine_type attribute is denoted by the hue which is quite evident from the above plot. Also, while interpreting these visualizations start getting difficult due to the complex nature of the plots, you can still gather insights like fixed acidity is higher for red wines and residual sugar is higher for white wines. Of course if there were some association between alcohol and fixed acidity we might have seen a gradually increasing or decreasing plane of data points showing some trend.

Is this effective? Again, not really! One strategy to make this better, is to keep a 2-D plot, but use hue and data point size as data dimensions. Typically this would be a bubble chart similar to what we visualized earlier.

Visualizing data in four-dimensions leveraging bubble charts and the concept of hue and size

We use hue to represent wine_type and the data point size to represent residual sugar . We do see similar patterns from what we observed in the previous chart and bubble sizes are larger for white wine in general indicate residual sugar values are higher for white wine as compared to red.

Now this might be better than the previous 4-D plot but honestly its just alright in my opinion. Yes, the hue helps us with seeing which wines have higher or lower fixed acidity but I don’t quite like the notion of size since it is often hard to interpret. Can we do better? Yes we can! Let’s use facets instead as depicted in the following plot.

Visualizing data in four-dimensions leveraging scatter plots and the concept of hue and facets

Look at that! Clear and concise visuals telling us fixed acidity is lower for white wine as compared to red wine and also residual sugar being much higher for white wine as compared to red wine samples. Also higher the alcohol level, lower the fixed acidity .

If we have more that two categorical attributes to represent, we can reuse our concept of leveraging hue and facets to depict these attributes and regular plots like scatter plots to represent the numeric attributes. Let’s look at a couple of examples.

Visualizing data in four-dimensions leveraging scatter plots and the concept of hue and facets

The effectiveness of this visualization is verified by the fact we can easily spot multiple patterns. The volatile acidity levels for white wines are lower and also high quality wines have lower acidity levels. Also based on white wine samples, high quality wines have higher levels of alcohol and low quality wines have the lowest levels of alcohol !

Let’s take up a similar example with some other attributes and build a visualization in four dimensions.

Visualizing data in four-dimensions leveraging scatter plots and the concept of hue and facets

We clearly see that high quality wines have lower content of total sulfur dioxide which is quite relevant if you also have the necessary domain knowledge about wine composition. We also see that total sulfur dioxide levels for red wine are lower than white wine. The volatile acidity levels are however higher for red wines in several data points.

Visualizing data in Five Dimensions (5-D)

Once again following a similar strategy as we followed in the previous section, to visualize data in five dimensions, we leverage various plotting components. Let’s use depth, hue and size to represent three of the data dimensions besides regular axes representing the other two dimensions. Since we use the notion of size, we will be basically plotting a three dimensional bubble chart.

Visualizing data in five-dimensions leveraging bubble charts and the concept of hue, depth and size

This chart depicts the same patterns and insights that we talked about in the previous section. However, we can also see that based on the point sizes which are represented by total sulfur dioxide , white wines have higher total sulfur dioxide levels as compared to red wines.

Instead of depth, we can also use facets along with hue to represent more than one categorical attribute in these five data dimensions. One of the attributes representing size can be numerical (continuous) or even categorical (but we might need to represent it with numbers for data point sizes). While we don’t depict that here due to the lack of categorical attributes, feel free to try it out on your own datasets.

Visualizing data in five-dimensions leveraging bubble charts and the concept of hue, facets and size

This is basically an alternative approach to visualizing the same plot which we plotted previously for five dimensions. However, considering the difficulty in interpreting size which we observed previously, you can convert one of the variables, if continuous, to discrete categorical using binning and then use that as an additional faceting parameter as depicted below!