Lloyd’s note: I’d like to thank my tasters, and especially Blake at BourbonR.com and Bryan for collecting and crunching the numbers. Incredible job. These were all double blind and several samples were re-sent and re-tasted. It should be noted that when I did this with another group informally and not blind, scores were a full point (0-5 scale) higher. This blind tasting had tasters ranging from well-read bloggers to national whiskey authors, respected reviewers and lovers of liquid. I’m not a math wizard but have never seen such a great analysis of a tasting before.

This post is courtesy of Bryan (Twitter handle @Elenaran).

Lloyd sent out 13 samples of his “favorites and legends” bourbons to reviewers in January, 16 of whom received the samples and completed (most of) the tasting. He instructed the reviewers to rate them on a 0–5 scale, as follows:

0: wouldn’t give this to my Dogs

1: Dog Whiskey

2: Not bad, not great but I liked it

3: Pretty good, Id buy a bottle for me and my friends to enjoy.

4: This is really good I’d buy as many as I could afford and only my best friends could have a little.

5: I must be in Whiskey heaven, wow, just WOW. One of my favorites or my favorite ever.

Based on results he provided (203 total reviews—five reviews were missing from various reviewers), this graph shows how many of each score were given by all reviewers. Based on results he provided (203 total reviews—five reviews were missing from various reviewers), this graph shows how many of each score were given by all reviewers combined:

Since the reviewers were tasting Lloyd’s “favorites and legends,” we shouldn’t be surprised that the reviews skewed heavily in favor of higher ratings. In fact, only 12 total reviews were rated as “dog whiskey” or below, out of a total of 203 total reviews.

Here are the whiskeys, as revealed by Lloyd on Twitter, in tasting order and with the label shown on each sample:

#1 Willett Wheated 20 Year Gift Shop

#2 Pappy 20 Year

#3 Four Roses 2013 Small Batch

#4 Elijah Craig 12 Barrel Proof 134.2

#5 Four Roses Small Batch 2012

#6 Michter’s 10 Year KBD bottling of Stitzel Weller Barrel 18 Years Old (Last 10 year that was Stitzel-Weller)

#7 Elmer T Lee Private Barrel

#8 Willett Private 19 Year Wheated

#9 Jefferson 18 batch 14 Early Handwritten Label

#10 Parker Heritage 3rd Edition Golden Anniversary

#11 Very Very Old Fitzgerald 1955-67 (was sealed in a bar for 40yrs, condition is due to cigarette smoke)

#12 Willett 12 Year Private Hot Chocolate Fudge Bomb

#13/M Blend of 60% Bernheim Wheat Whiskey & 40% Wild Turkey Rare Breed (I wanted sweat wheat with spicy rye)

Here’s the collection all together:

I’ve taken the graph above showing the distributions of ratings, and broken it down into individual graphs for each of the 13 whiskeys:

While the distributions have some differences, it’s easy to see that, for the most part, they center around that 3-4 mark, much like the combined distribution.

Descriptive Statistics

First, we’ll look at descriptive statistics for the scores for each whiskey. From here on out, we’ll look at these scores on a 100-point scale, as that is how the data was stored on Bourbonr. It’s the same data, just on a different scale:

0–1 becomes 0–20

1–2 becomes 20–40

2–3 becomes 40–60

3–4 becomes 60–80

4–5 becomes 80–100

The following chart shows a box plot of each whiskey’s score. Box plots are made to give an idea of both the median (middle) score (it’s sort of like an average score, but better because it’s not as influenced by extreme scores, which we call “outliers”), and at the same time to give an idea of the distribution of scores around this median. The solid line in the middle of each box represents the median (defined as the 50th percentile score), while the top and bottom of each box represents the 75th and 25th percentiles, respectively. The mean is shown as a red “X,” while the open circles represent the outliers mentioned above.

Outliers tend to represent unlikely scores, so It’s important to look at the distribution of scores, rather than just the mean score. For example, the second whiskey, Elijah Craig Barrel Proof (ECBP) has a much mean (average) much lower than its median, because one reviewer rated ECBP very low, which pulls the median down.

The following table gives another view of the same data, in order of median, showing the minimum and maximum scores each received, along with the mean and median score, and the standard deviation of each score (which gives an idea of how much the scores varied):

Keep in mind that the median means half of the reviews are above and half are below that mark, so half of the reviewers scored the Willett 12-year as 4 out of 5 or better. It’s also interesting to note that five of the whiskeys had identical median scores.

As mentioned above, the standard deviation gives you an idea of how much variability there is in the scores for each whiskey. However, like the mean, the standard deviation is also very sensitive to outliers, so whiskeys like ECBP get appear to have much higher variability than one expect.

To mitigate the effects these extreme values have, I also tried calculating the descriptive statistics with the outliers removed (the open circles on the boxplot above). This table shows the same statistics, but calculated without the outliers:

For this table, I sorted by mean instead of median, since the mean should no longer be skewed. You’ll also notice reduced standard deviations in some cases, especially for ECBP.

Another thing I thought might be interesting to look at was how each individual reviewer scored his whiskeys:

This table is like the table above, but we’ve grouped scores by reviewer instead of by whiskey. Although there is quite a bit of variation from one reviewer to the next, it is interesting to note that 7 of the 16 reviewers had an identical median score of 70, which would be between “pretty good” and “really good” on the 5-point scale.

Another interesting measure is to look at the number of times each whiskey was picked as a reviewer’s favorite out of the bunch:

Just like in the medians calculated without outliers, Willett 12-year and ECBP tied for the top—I’d say they’re the clear favorites in this tasting (at least, for this group of reviewers).

Relationships between Whiskey Scores

Since the ratings are paired with the users that gave them, we can look to see if there are similarities in ratings between pairs of whiskeys, measured by correlation. If the scores for whiskey A are correlated with B, if you like A you’ll like B, or if you don’t like A you won’t like B. Or if the scores for whiskey C and D are anticorrelated, if you like C you wouldn’t like D, or if you like D you wouldn’t like C. The correlation ranges from -1 (strong anti-correlation) to 1 (strong correlation). Each correlation also has a p-value, which gives an idea of the statistical significance of the correlation. For this study, we consider a p-value less than 0.05 to be significant—that is, more likely to be true and not just a random fluke:

The strongest and most significant relationship was a positive relationship between the Bernheim blend and the Willett 12-year. That means that people who liked the Bernheim blend also tended to like the Willett 12-year, and that people who didn’t like the Bernheim tended to not like the Willet. The numbers don’t tell us why this is the case, though.

Conversely, there was a negative relationship between the Elijah Craig Barrel Proof (ECBP) and the Four Roses 2012 Limited Edition (FR2012), which means that people who liked ECBP tended not to like FR2012 as much, and vice versa.

Relationships between Reviewers’ Scores

Just like we can look at relationships in scores between whiskeys, we can do the same thing for users. Looking for correlations between users should show us whether two reviewers are likely to rate the same whiskey similarly, or dissimilarly:

Again, a lower p-value indicates better statistical significance, while a higher correlation (in terms of absolute value) indicates a stronger relationship. That means that spd0925 and joethebluesman scored things very similarly, in general, while risenc tended to score things differently than kbr127. All of the other relationships between reviewers not shown here were not significant (p>0.05), meaning that scores weren’t sufficiently similar or dissimilar for us to confidently say there is a relationship between two users’ scores.

Relationship between ABV and Rating

I also had an idea that alcohol by volume (ABV) may have had an impact on the ratings, so I looked at that relationship as well:

So, yes, there was a significant positive relationship between rating and ABV, albeit not very strong. In general, this means that while people tended to give higher ABV whiskeys higher scores, other factors likely contributed more strongly to variation between scores. This scatterplot shows the relationship between ABV and rating, but because the correlation is weak, it doesn’t look like much:

Differences between Whiskeys

One thing that Lloyd was probably looking for was to highlight the differences in scores between more well-known whiskeys like Pappy and lesser-known (or less popular) whiskeys. In order to look at differences in this type of data, statistically, I ran what is called a Repeated Measures Analysis of Variance (ANOVA) test. The details and outputs are more complex than what most people reading this will care about, but, to summarize the results, there are no significant differences between the scores of the whiskeys.

You might say, “Hey, Willett 12-year had an average score of 75, while Pappy was way down at 62,” but if you look at the average scores overall, they’re all pretty close together, right in that 60–80 range. To see this visually, look back at the box plot. There’s a lot of overlap from box to box, and especially from whisker to whisker.

What does this mean? Well, it could mean that the whiskeys aren’t really that different in terms of ratings, or that reviewers’ opinions differed to much to get consistent ratings. Or it could just be that (statistically speaking), this study’s sample size was too small to show consistent differences between the whiskeys. Also, keep in mind this is a pretty small sample size, statistically speaking. That’s not Lloyd’s fault—getting a big enough sample size would have required a lot of time and money, particularly considering the cost of these whiskeys. And you can only split one bottle into so many samples that are big enough to rate—if you have to buy multiple bottles of the same whiskey, the study could be affected by variations in batch, barrel, bottle age, or bottle storage.

The chart below shows the mean score for each whiskey along with the margins of error (the whiskers), which are related to the sample size (16) and the standard deviation of each set of scores:

So, while you can see there are differences in the means, you can also clearly see that pretty much all of the means are within each other’s margin of error—that is, the whiskeys have a lot of overlap—which visually shows why the ANOVA didn’t find significant differences.

Other non-significant findings

I also looked at a few other interesting variables, but they didn’t have significant results

The first was based on a discussion some of the reviewers had on twitter involving samples with white caps versus samples with black caps. Some reviewers thought that the black cap bottles had somehow tainted the whiskeys, giving them off flavors. I checked this using a t-test, and found that no significant difference existed in ratings between black cap and white cap whiskeys. However, as I showed before, it could have just been due to small sample size or from the fact that all of the whiskeys were fairly close in score. And some reviewers actively tried to ignore the off flavor from the black cap bottles, in which case we wouldn’t expect to see a difference between black cap and white cap bottles.

I ran a similar test on wheated versus non-wheated whiskeys, but again found no significant difference between those groups. Either this group of reviewers isn’t picky about whether a whiskey has wheat or rye in the mashbill, or other factors were more important, or the sample size was just too small.

Finally, I wondered if maybe higher alcohol-content whiskeys were more polarizing in terms of ratings—maybe some people love high alcohol whiskeys and some people hate them. To test this, I went back to my analysis of the relationship between ABV and rating and ran a test for non-constant variance. The results showed there was no significant change in variance of rating by ABV-level, so higher-alcohol whiskeys don’t appear to be more polarizing.

Summary

To summarize, the results show that it’s important with data like this to not just take an average, post those numbers and call it done. That approach ignores things like skew, outliers, sample size, etc. While there weren’t significant differences present between the whiskeys, there were some interesting findings, such as relationships between scores of different whiskeys, relationships between scores of different reviewers, the presence of a relationship between ABV and rating, as well as distributional pictures of for each whiskey.

While a sample size of 16 is too small to make any real statements about these whiskeys in terms of global taste preferences, I think it’s clear that the Willett 12-year was the favorite of these reviewers. It had the No. 1 rating in both mean and median score for both the datasets with and without outliers. After the Willett, things are less clear, appear only in certain circumstances (e.g. “This whiskey was the 2nd favorite in terms of median score after outliers were removed…”). However it’s important to keep in mind that even the last place finishers had mean and median scores in the 3-4 range, just like all the rest of the whiskeys.

Something that might also help improve results in future tastings would be to randomize the numbers for each taster, or to have the tasters cover all the whiskeys in a single setting. The current system was set up as a social network activity, so it made sense to have everyone rating the same whiskey at the same time, but that can lead to bias: if someone reads 3–4 other reviewers saying that #7 is the best whiskey they’ve ever had, they are likely going to have high expectations going in to that tasting, potentially skewing the results.

Randomizing the order also prevents assumptions such as Lloyd saving the best whiskey for last. Additionally, reviewers may have made changes to their tasting procedures or rating methods as they went along. You can imagine that maybe one taster started out rating the whiskeys too high or low, and changed as he or she went along. Or maybe palates varied a lot from night to night, or the scores changed based on a tester’s mood—everyone has their off nights.

Also, it would probably be best not to have any of the whiskeys specially marked. In the current tasting, the Bernheim blend was marked as “M” and was the last one to be tasted, leading people to give it inflated or deflated scores, due to thinking think it was special due to the label (the rest were numbered), combined with the fact that it was saved for last. Again, randomizing the order would help this.