Consider the two statements:

There is a universal standard for beauty.

Beauty is in the eye of the beholder.

Most people would agree that there's some truth to each of these statements. At Thing of Things Ozy wrote:

As for the beauty thing… well, yeah, everyone’s beautiful in the sense that everyone is sexually attractive to someone, and that human bodies in general are pretty cool-looking. But conventional attractiveness is still a thing. While I’m fairly conventionally attractive (thin, white, clear skin, symmetrical features), I doubt hairy legs, bound chests, and haircuts that make one look like a teenage boy are going to be all the rage at Cosmo any time soon.

This post explores the question of the extent to which each of the two statements is true, using data from a study of speed dating events conducted by Raymond Fisman and Sheena Iyengar.

The basic facts that I describe here are:

Attractiveness as defined by group consensus can be modeled well using a normal distribution.

The group consensus on somebody's attractiveness accounted for roughly 60% of the variance in people's perceptions of the person's relative attractiveness.

of the variance in people's perceptions of the person's relative attractiveness. The distribution of people's perceptions of the relative attractiveness of a fixed person can be modeled well using a normal distribution. Moreover, the standard deviations of these distributions tend to be quite close to one another (across different people), so that it's often possible to approximate the entire distribution of perceptions of somebody's relative attractiveness using only the mean of the distribution, which is just the group consensus on the person's attractiveness.

There's much more to say about how to interpret the group consensus and its implications, which I'll go into in a later post.

Each event involved ~15 men and ~15 women, and everybody of a given gender went on speed dates with everyone of opposite gender. Each participant on each date rated his or her partner on a number of dimensions, including attractiveness, on a scale from 1 to 10. For the purpose of this post, I focused on how attractive raters found a ratee relative to other ratees. For this reason, I scaled each rater's ratings so that the averages are the same for all raters of a given gender.

Gender differences

One sees essentially the same phenomena when the raters are men and the ratees are women as one does when the genders are reversed. There is however one very important difference: the average of the ratings that men gave women was ~6.5, and the average of the ratings that women gave men was ~5.9. The standard deviations were the (interestingly) same in both cases, and in terms of standard deviations, women were rated 0.5 SD higher than men were. This fact may have profound ramifications. I've pictured the distributions of average attractiveness ratings of men and of women below:

The main difference between the distributions is that the one for women is shifted to the right relative to the one for men. The shapes of the distributions are also a little bit different, but one can verify that the difference within the range of what one would expect by chance.

Hierarchical modeling

We're interested in what the average ratings would be if a sufficiently large number of raters rated a given ratee.

The ratees who are rated highest and lowest are also the ratees whose ratings are most likely to be unrepresentative of the entire population's consensus on their attractiveness: there's regression to the mean.

A methodology that allows us to correct for this is Bayesian hierarchical modeling, which involves simultaneously estimating the "true" distribution of average attractiveness ratings of all hypothetical ratees together with the true average attractiveness ratings of the particular ratees in the dataset. The default assumption in Bayesian hierarchical modeling is that the true distribution is a normal distribution with mean and standard deviation to be determined. The histograms above suggest that this is close to being true in our setting.

If we use Bayesian hierarchical modeling to generate refined estimates for the averages, we get distributions that look something like the following:

Note that the in contrast with the actual averages, the refined estimates are never below 4.5 or above 8 – the participants weren't rated by enough people for us to be confident that any participant is that far away from average.

The standard deviations of the distributions are nearly identical: 0.6 points on the 10 point scale.

The distribution of ratings for a fixed person

The image below shows the ratings of 18 women by 17 men.

The columns correspond to ratees and the first 17 rows correspond to raters.

Blue corresponds to "below average in the eyes of the raters" and red corresponds to "above average in the eyes of the rater."

The numbers in the side bar correspond to the number of points that a rating is above or below average.

The final three rows give the median, minimum and maximum ratings of a ratee.

One sees that with the exception of the ratees in columns 10 and 16, all ratees had at least one rater who perceived her attractiveness to be noticeably above average and at least one rater who perceived her attractiveness to be noticeably below average.

The graph below shows the median rating (black), maximum rating (red) and minimum rating (blue) for all ratees in the study, together with best fit curves:

Here too, one sees that there are very few people who are consistently rated as being above average or below average.

This is consistent with the fact that the fact that the standard deviation of the ratings that an individual was given was roughly the same as the standard deviation of average ratings of the population of ratees. I've plotted the standard deviations for individual ratees below:

We see that the standard deviations have a strong central tendency, with mean equal to ~0.7 points.

The average standard deviation being 0.7 points overstates the variability in perceptions of an individual's attractiveness. Some reasons for this are:

Ratings on a 10 point scale are imprecise: for example, raters were not allowed to give numerical ratings of 6.5.

Individual raters may inaccurately convey their perceptions of the person's attractiveness on account of not devoting their full attention to the task of reporting on it.

In order to estimate the true standard deviation of the distribution of perceptions of a given person's attractiveness, I examined the relative predictive power of:

(i) Our refined estimate of the group consensus on ratees' attractiveness

(ii) The extent to which a rater's rating deviates from this estimate

in the context of predicting a rater's decisions as to whether or not to see a ratee again.

I found that 60% of the predictive power comes from the group consensus and 40% of the predictive power comes from deviations from the group consensus, suggesting that the standard deviation of variation in perceptions of a ratee's attractiveness is about 2/3 that of the standard deviation of the group consensus across ratees. In terms of points on a 10 point scale, this is about 0.45 points.

To be continued...

In subsequent posts, I'll describe how the data bears on the following questions: