Ordinal scales are everywhere in psychology. From mood ratings to pain scales, they are one of the most prevalent tools in the field. They frequently appear in other domains e.g. medicine, education, etc. and most often appear as Likert scales. These scales require participants to give a score (along an increasing scale) in response to a question or series of questions e.g. “How afraid are you of spiders?”. The number of options typically range from 5 to 11. These types of scales produce ordinal data as opposed to metric data .

It is common practice to analyse ordinal data as though it were metric (Liddell & Kruschke, 2018). To do this, you analyse the data using tests that assume the data is either interval or ratio. Therefore, you assume they have equal sized gaps between the units e.g. t-tests, correlations, etc. These models also assume a normal distribution for the residual noise but when data is assumed to be ordinal, a different assumption for the noise is used; a thresholded cumulative normal distribution for the noise.

Mo models, mo problems

The two models using the different assumptions described above treat the data points in distinct ways. The former describes a single data point’s probability (along a normal distribution curve) at a given value, whereas the latter describes a datum’s probability (along a normal distribution curve) as the cumulative probability between two thresholds on an underlying construct. The graphs below show these two models.

For the metric model, the probability of a given ordinal response is just the probability density at the corresponding metric score. But for the thresholded cumulative normal model (also called an ordered probit model), the ordinal levels are created by dividing the normal distribution curve of an underlying continuous value into chunks. For example, if you asked someone to rate “How afraid of spiders are you?”, there is an assumption that the underlying fear is continuous and is divided at certain thresholds to create discrete values. The third graph above shows an assumed underlying continuous value with a normal distribution for the density of scores. The dashed lines show the divides along the continuous scale (the threshold values) which make up the ordinal response options. The areas under the normal distribution curve and between the dashed lines make up the probability of each ordinal response. The bar charts just above this graph represent the cumulative probability.

The probabilities of the responses using the metric model versus the ordered probit model, though they have the same distribution, are not the same. This is due to the threshold values between the intervals. The outer thresholds are fixed but all the inner thresholds are estimated from the data. This is why the threshold between ‘2’ and ‘3’ on the graph is closer to ‘2’. To analyse data using a metric model, the data needs to have equal gaps in units (which ordinal data doesn’t have) and normally distributed data. Ordinal data is frequently skewed or multi-modal so violates the assumption of normal distribution (Ghosh et al., 2018). Thus the distribution is not appropriate for analysis as metric data.

What does this mean for your results?

These discrepancies wouldn’t be a problem if the metric model produced robust results. Some studies have found this e.g. Heeren and D’Agostino (1987) and Bevan et al. (1974). But they each have a crucial weakness: they didn’t compare the results to a non-parametric model. This means we don’t know how well it performed compared to a model which is theoretically better equipped to analyse that kind of data. When research compares how each model performs, the metric model falls far short of its competitor.

Nanna & Sawilowsky (1998) used real-world data to compare how well a parametric t-test performed against a non-parametric Wilcoxon rank-sum test. The non-parametric test had greater power for almost every sample size, and therefore had a better detection rate. The disparity in power increased as the sample sizes increased, running counter to the prevailing logic that when the sample size is greater than or equal to 30 you can use a parametric test as the Central Limit Theorem means you can assume a normal distribution .

Liddell & Kruschke (2018) simulated a data set using an ordered probit model to see how well the two models explained the data. If it is acceptable to use a metric model when analysing ordinal data, the models should be roughly equally accurate. But that is not what they found.

The ordered probit model (unsurprisingly) was able to accurately capture both the effect size for the group difference (for both an effect size of 0 and 0.66) and the distribution of the responses. But the metric model wasn’t even close: it estimated an effect size of 0.49 when the true effect size was 0 (a Type I error) and -0.01 when the true effect size was 0.66 (Type II error), as well as very poorly estimating the distribution. This pattern of results (the metric model failing to capture the distribution and effect size of the data whilst the ordered probit model succeeded) held true for real world data.

But why?

These errors arise because of the discrepancies between the mean values from the different models. The ordered probit model correctly captured the underlying distributions and effect sizes (and therefore means) of the group. For the metric model to do the same, the mean ordinal values of the metric model need to be the same regardless of the standard deviation (distribution) of the latent mean. But looking at the graph below, we can clearly see they aren’t.

The underlying mean from the ordered probit model is on the X axis and the metric model mean is on the Y axis. The different lines represent different SDs, which correspond to different distributions (the larger the SD the wider the distribution). When the underlying mean is the same between groups but the SD is different (because of differences in distribution), the mean ordinal values for the two groups are different. This is a Type I error as the metric model states there is a difference between the means when there isn’t one. This corresponds to points A and B on the graph. When the latent mean is different but the mean ordinal values are the same (points B and D), this is a Type II error.

Conclusion

Many of us (including myself) were taught to analyse ordinal data with a metric model. But it seems like this is a bad idea. There is evidence to show it reduces power and greatly inflates the chances of a false positive or a false negative. Ordered-probit models are much better suited for the task as they allow for unequal variances across groups (an important component for a model). Therefore, it seems the best thing to do is analyse ordinal data like ordinal data.

Notes

Thank you to Alex Etz for the explanation drawing of the normal distribution of noise for a linear model.

Author feedback

John Kruschke recommended a few clarifications and developing the conclusion. Torrin Liddell stated he didn’t have anything to add to John’s comments and recommended this paper by Selker, Leem & Iyer (2017).

References

Bevan, M. F., Denton, J. Q., & Myers, J. L. (1974). The robustness of the F test to violations of continuity and form of treatment populations. British Journal of Mathematical and Statistical Psychology, 27, 199–204.

Liddell, T. & Kruschke, J. (2018). Analyzing Ordinal Data with Metric Models: What Could Possibly Go Wrong? Journal of Experimental Social Psychology, 79, 328-348. Available at: https://osf.io/9h3et/

Ghosh, S.K.; Burns, C.B.; Prager, Zhang, D.L.; & Hui, L. (2018). On nonparametric estimation of the latent distribution for ordinal data. Computational Statistics and Data Analysis,119, 86-98.

Heeren, T., & D’Agostino, R. (1987). Robustness of the two independent samples t-test when applied to ordinal scaled data. Statistics in Medicine, 6, 79–90.

Jolliffe, I.T. (1995) Sample Sizes and the Central Limit Theorem: The Poisson Distribution as an Illustration. The American Statistician, 49 (3), 269, DOI: 10.1080/00031305.1995.10476161

Nanna, M. J. and Sawilowsky, S. S. (1998). Analysis of Likert scale data in disability and medical rehabilitation research. Psychological Methods, 3(1), 55–67, doi:10.1037//1082-989X.3.1.55.

Selker, R., Lee, M. D., & Iyer, R. (2017). Thurstonian cognitive models for aggregating top-n lists. Decision, 4 (2), 87-101.