When one of us (Bart) wanted to buy a car seat for his new son, he went to Amazon.com, entered some search terms, and sorted results by the average star rating. He narrowed his choice to two options: One was substantially more expensive but had a higher average rating, 4.6 to 3.8. In the end, he went with this option, reasoning that the additional quality was worth the money given the importance of the purchase.

Sounds reasonable, right? After all, online user ratings and reviews are now one of the most important sources of product quality information. Consumers love them because they are free, widely available, easy to access, and, ostensibly objective. The advent of online reviews has led some to argue that the power of brands and traditional marketing tactics is waning and consumers are making more informed and rational decisions.

But according to our recent research, Bart may have been misled in his search for a car seat. The punchline is that the trust we place in star ratings reflects an illusion of validity; we trust them much more than we should. Online ratings may not reflect a product’s quality at all.

There are a whole host of issues with user ratings — assuming they are even authentic. These can be divided into three categories: statistical, sampling, and evaluation.

Statistical issues stem from the fact that we only observe review scores from a subset of product users. The average rating from this sample does not perfectly coincide with the average rating we would have obtained if all product users had left a review. We can be more confident in an average star rating if the sample size is large and if the variability of the distribution of ratings is smaller (i.e. if different reviewers tend to agree). Unfortunately, sample sizes are often not large enough for statistical comfort. Variability also tends to be high for multiple reasons, including random noise. A reviewer may rate the wrong product or leave a low rating due to a complaint about shipping, for instance, which has little to do with the product itself.

stem from the fact that we only observe review scores from a subset of product users. The average rating from this sample does not perfectly coincide with the average rating we would have obtained if all product users had left a review. We can be more confident in an average star rating if the sample size is large and if the variability of the distribution of ratings is smaller (i.e. if different reviewers tend to agree). Unfortunately, sample sizes are often not large enough for statistical comfort. Variability also tends to be high for multiple reasons, including random noise. A reviewer may rate the wrong product or leave a low rating due to a complaint about shipping, for instance, which has little to do with the product itself. Sampling issues stem from the fact that the subset of users that leaves a review is not randomly sampled from those who have purchased the product. Consumers with extreme opinions are more likely to post reviews, which is referred to as a “brag-and-moan” bias. As a consequence, many rating distributions are J-shaped with mostly 5-star ratings, some 1-star ratings, and hardly any ratings in between. Positive ratings also increase the likelihood of later positive ratings.

stem from the fact that the subset of users that leaves a review is not randomly sampled from those who have purchased the product. Consumers with extreme opinions are more likely to post reviews, which is referred to as a “brag-and-moan” bias. As a consequence, many rating distributions are J-shaped with mostly 5-star ratings, some 1-star ratings, and hardly any ratings in between. Positive ratings also increase the likelihood of later positive ratings. Evaluation issues stem from the fact that accurately evaluating product performance requires a scientific approach. Alternatives need to be tested side-by-side under the same conditions, and objective performance measured with sophisticated and often expensive instruments. Users who post reviews do not have the knowledge, equipment, and time to assess product performance in this way. Consider the car seat example above. Many dimensions of performance (safety, reliability) cannot readily be assessed by an ordinary user, and most users ever only experience a single product as opposed to using and comparing a variety of car seats. Moreover, it is well-known that consumers’ quality evaluations are heavily biased by variables other than objective product performance, such as brand image, price, and physical appearance.

In light of these issues, we undertook our research project to answer two questions. First, is the average star rating a good indicator of product quality? We analyzed 1,272 products across 120 product categories, restricting our study to product categories in which objective performance can be clearly defined and measured (such as car seats, bike helmets, sunblock, refrigerators, and televisions).

Second, how much do consumers trust the average star rating as an indicator of quality? To answer this question, we ran a series of lab studies where we asked participants to judge product quality after inspecting products’ web pages on Amazon.com. We then assessed the extent to which their quality judgments depended on the average star rating compared to other cues they might have used, such as price. Here are some of our key findings:

The average star rating has surprisingly low correspondence to established quality metrics. We examined the extent to which average review scores from Amazon.com correspond with scores from Consumer Reports, an organization that specializes in scientific product testing. The correspondence was quite small — in fact, the product with the higher star rating on Amazon.com only received a higher score from Consumer Reports 57% of the time, which is just slightly better than flipping a coin.

The solid line in the chart below plots correspondence as a function of the difference in average star ratings. When the difference between two product options is smaller than 0.4 stars (which is the case for about half of comparisons in our dataset), correspondence is at chance (50%). Correspondence increases as the difference in user rating grows larger, but the increase is modest and it never exceeds 70%.

Another traditional metric of quality is resale value. Products with better reliability and performance retain more of their value over time and thus, if average user ratings reflect objective quality, they should correlate positively with resale values. We collected resale values from camelcamelcamel.com, an online price tracking website, and from usedprice.com, a proprietary service that uses dealer surveys and other sources to estimate used prices. In both cases, average star ratings bore essentially no relationship to used prices. In contrast, Consumer Reports scores did predict resale values.

Average star ratings are often based on insufficient sample sizes — but consumers trust them anyway. Because the difference in average user ratings for two products is smaller than 0.40 about half of the time, large sample sizes are needed to ensure that comparisons (say a 4.5 versus a 4.1) are statistically meaningful. Unfortunately, sample sizes are often too small to conclude much. About 50% of products has fewer than 50 ratings.

Consider a product with an average rating of 4 stars. A prospective buyer can be 95% confident that the average lies somewhere between 3.5 and 4.5 if 25 users rated the product; but a larger sample size might tell us that the true star rating for the product is either 3.5 or 4.5, which are perceived completely differently in consumers’ minds. The 95% confidence interval becomes narrower as sample size increases. It ranges from 3.6 to 4.4 if 50 consumers rated the product and from 3.7 to 4.3 if 100 consumers rated the product. However, even if 200 consumers rated the product, the 95% confidence interval still ranges from 3.8 to 4.2, and it is thus too wide to conclude much if two products differ by only 0.4 stars.

Despite this, consumers almost completely neglect sample size when judging quality based on star ratings. Our studies show shoppers rely as much on an average rating from 25 users as on one from 200 users.

Holding quality constant, more expensive products and brands with better reputations get better ratings. Star ratings are biased upwards for expensive products and for those from premium brands. User ratings are thus heavily influenced by good-old marketing tactics, such as advertising and price signaling. We found that for two products with the same score from Consumer Reports, going from a product from a brand at the 10th percentile in terms of brand reputation (e.g., a Casio digital camera) to one at the 90th percentile (e.g., a Sony digital camera) is worth about 0.4 stars. And going from a price at the 10th percentile to one at the 90th percentile is associated with a difference of about 0.2 stars. Star ratings are more strongly related to brand image and price than they are to Consumer Reports scores.

As we said above, this result is consistent with many years of research on how consumers form quality perceptions, so from that perspective it is not so surprising. Indeed, Nate Silver found a similar result when he analyzed Yelp ratings of New York City restaurants: Controlling for number of Michelin stars, more expensive restaurants are rated higher on Yelp. However, we also find that consumers do not anticipate these biasing effects. In fact, most participants in our studies have precisely the wrong intuition, at least when it comes to price: they think that reviewers penalize products for being more expensive when the opposite is true.

In the end, we love star ratings because they seem even-handed – even though they’re not. When a salesperson recommends we buy a more expensive car seat because it’s safer, many of us would consider the possibility that the salesperson is motivated to make a commission. Similarly, when Roger Federer or Tiger Woods recommend Gillette for a better shave, we assume they are paid endorsers for the brand. Consumers do not blindly accept all information they receive about products and brands, especially when they suspect an underlying persuasion motive. Although persuasion motives are highly accessible when we talk to a salesperson or watch an ad, we let our guard down when absorbing information from other users. This is a mistake.