A couple of months ago, I wrote a summary of a recent paper arguing you shouldn’t analyse ordinal data like interval or ratio. If you do so, there’s a risk of inflated Type I and Type II error rates, as well as reduced power . In response, Helen Wauck wrote a comment asking about the relevance of this paper from . She had also been taught to use metric models for ordinal data, with this paper used as justification. The article argues you can analyse interval like ordinal data as the results of the tests produce similar results. This is because the tests are robust to assumption violations, such as the data being non-metric and not being normally distributed.

Norman frames the paper by outlining common criticisms, given during peer review, to statistical analyses. The 3 are: ‘You can’t use parametric tests in this study because the sample size is too small’; ‘You can’t use t tests and ANOVA because the data are not normally distributed’; and ‘You can’t use parametric tests like ANOVA and Pearson correlations… because the data are ordinal and you can’t assume normality’. In this analysis, I’m only going to focus on the 3rd . For this last argument, Norman gives 3 answers.

1st answer: a history of robustness

Tests of central tendency e.g. ANOVA, t tests, have previously demonstrated robustness. These studies analysed data sets using metric and parametric tests and compared the results. If the results were the same i.e. they both returned significant results, then the test was said to be robust. Various studies have been performed over the years with different distributions and sample sizes as low as 4 per group. The vast majority of results reported in were both statistically significant. However, this focus on merely retaining or rejecting a two-sided null hypothesis ignores a lot of valuable information. As Richard Morey argues, it is a ‘shallow way of thinking’. The reduction of a statistical test to a simple binary, whilst common, tells us very little (for a more detailed explanation, see ).

But this robustness also held true for correlations and went beyond retention or rejection of a two-sided null hypothesis . In addition, Norman replicated the finding with real data from . It started as a series of 10-point ordinal scales (taken at 2 time points) and he transformed it into a 5-point scale to make “extremely ordinal data sets”. He calculated the Pearson and Spearman correlation between Time 1 and Time 2 for each scale, then calculated the correlation between the pairs of correlations. For the original scales, the correlation between the Pearson and Spearman’s was 0.99. For the transformed 5-point scales, the correlation was 0.987. There was a near perfect correlation between the parametric and non-parametric measures. Whilst this is an impressive correlation, this only pertains to one type of data set. There is no evidence this extends to other kinds of data, with varying amounts of skew and different distributions.

2nd answer: converting ordinal to interval

The next advocated using metric tests because whilst individual Likert questions are ordinal, Likert scales (which involve summing items) are interval . The same argument has been put forward by other authors, such as . But, as Saskia Homer explains, labeling the ordinal responses with integers doesn’t turn them into numbers. They are still based on ordinal data, with unknown gap sizes between the rankings. It also mistakes the levels of measurement with the shape of the variable’s distribution . It is true the sum of the Likert items will be more like a normal distribution, but that doesn’t convert the units into interval. Likert himself argued this wasn’t a problem, as respondents typically tend to view the response scale as a set of evenly-spaced points along a continuum (as reported in ). But this doesn’t overcome the issue of the unknown sizes of the gaps and has been cautioned against . Further, provide empirical evidence that summing Likert items and analysing them using metric models inflates both the Type I and Type II error rate as the distributions are likely to be non-normal.

3rd answer: the numbers don’t know

Norman’s last piece of evidence is conceptual. Even if the numbers are drawn from a Likert scale, so we cannot theoretically guarantee the distances between the numbers are equal, they don’t have a magical property which means they can’t be analysed as metric data. The numbers aren’t aware where they were drawn from and behave accordingly. As long as the numbers are reasonably distributed, we can make inferences from the data. However, this rests on a key assumption for which there is strong evidence against. provide comprehensive theoretical evidence as to why you can’t assume a normal distribution with ordinal data for the tests Norman argues for. Doing so can inflate both the false positive and false negative rate, among other things.

Is it ever acceptable to analyse ordinal data using metric models?

Under certain conditions, using metric models for ordinal data may be suitable. ran simulations for confirmatory factor analysis, comparing the results from maximum likelihood estimation (a metric model) to weighted least squares means and variance adjusted estimation (a non-parametric model). They did this with a variety of sample sizes, number of factors, and number of categories for the ordinal data. The most relevant result was in relation to the differences in results over the different number of categories: when the data was divided into 5 or more, the metric model performed as well as the non-metric model.

However, the results found used polychoric correlations. Polychoric correlations are estimates of the linear relationship between two continuous variables when you only have ordinal data . There is a significant amount of empirical work demonstrating using maximum likelihood estimation with polychoric correlations is inappropriate as it often produces biased parameter estimates and standard errors .

Despite the arguments against it, there has been little direct comparison of the methods under various conditions. compared the results from metric models to non-metric models under a wide range of factors e.g. number of categories, underlying distribution, sample size. Non-parametric models were superior when the data had less than 5 categories. With 5 or more categories, both metric and non-parametric methods give acceptable performance. The choice as to which is more appropriate depends on other aspects of the data e.g. the symmetry of the observed distribution, the likely underlying distribution of the constructs being measured, etc.

Why the Norman paper might pose problems

Whilst it seems that most of the time it is inappropriate to analyse ordinal data like it is metric, there are instances where it is justifiable. However, the papers demonstrating its suitability are precise in the limiting conditions i.e. when it is and is not appropriate to use these methods. paints much broader strokes as to when it is acceptable to analyse ordinal data like metric data. It may be used to justify any use of metric analysis, when often it is suboptimal. A better approach would be to use ordinal models, with providing a valuable tutorial on how to do so.

Whether you can analyse ordinal data like interval or ratio greatly depends on the structure of your data and the types of questions you want to ask. For certain kinds of structural equation modelling (SEM), like ANOVA and t tests, there is good evidence to be cautious doing so. But, under certain conditions, there is reason to believe analysing ordinal data using metric models is acceptable for other kinds of SEM, like confirmatory factor analysis with 5 or more categories.

References

{5421944:VD8XETGZ};{5421944:L8K4PCQ7};{5421944:L8K4PCQ7};{5421944:6P96KSR4};{5421944:6Y8ALJBH};{5421944:VMUKANVW};{5421944:XEAJ33NB};{5421944:RKURRCBS};{5421944:7LHI2BAX};{5421944:PR4AJ877};{5421944:VD8XETGZ};{5421944:VD8XETGZ};{5421944:V4KGVQI8};{5421944:EGELY3WS};{5421944:FJRIQ4G2};{5421944:K3K7GDFD};{5421944:UEMGV4HM};{5421944:L8K4PCQ7};{5421944:PR4AJ877} apa default asc no 7529 Williams, M. (2019). Scales of measurement and statistical analyses [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/c5278 Medical Education, 42(12), 1150–1152. Carifio, J., & Perla, R. (2008). Resolving the 50-year debate around using and misusing Likert scales.(12), 1150–1152. https://doi.org/10.1111/j.1365-2923.2008.03172.x British Journal of Mathematical and Statistical Psychology, 407, 309–326. Dolan, C. (1994). Factor analysis of variables with 2, 3, 5 and 7 response categories: A comparison of categorical variable estimators using simulated data - Dolan - 1994 - British Journal of Mathematical and Statistical Psychology - Wiley Online Library., 309–326. https://onlinelibrary-wiley-com.libproxy.ucl.ac.uk/doi/epdf/10.1111/j.2044-8317.1994.tb01039.x Journal of Marketing Research, 28(4), 491. Rigdon, E., & Ferguson, C. (1991). The Performance of the Polychoric Correlation Coefficient and Selected Fitting Functions in Confirmatory Factor Analysis with Ordinal Data - ProQuest.(4), 491. https://search-proquest-com.libproxy.ucl.ac.uk/docview/235231338/fulltextPDF/720F2812EF914E5CPQ/1?accountid=14511 Psychological Methods, 9(4), 466–491. Flora, D., & Curran, P. (2004). An Empirical Evaluation of Alternative Methods of Estimation for Confirmatory Factor Analysis With Ordinal Data.(4), 466–491. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3153362/ Ordinal Regression Models in Psychology: A Tutorial. https://osf.io/qp3t7/ Bürkner, P.-C., & Vuorre, M. (2018). Philosophy of Science, 34(2), 103–115. Meehl, P. E. (1967). Theory-Testing in Psychology and Physics: A Methodological Paradox.(2), 103–115. https://doi.org/10.1086/288135 Journal of Clinical Epidemiology, 63(10), 1123–1131. Fletcher, K. E., French, C. T., Irwin, R. S., Corapi, K. M., & Norman, G. R. (2010). A prospective global measure, the Punum Ladder, provides more valid assessments of quality of life than a retrospective transition measure.(10), 1123–1131. https://doi.org/10.1016/j.jclinepi.2009.09.015 Perceptual and Motor Skills, 43(3), 1319–1334. Havlicek, L. L., & Peterson, N. L. (1976). Robustness of the Pearson correlation against violations of assumptions.(3), 1319–1334. https://doi.org/10.2466/pms.1976.43.3f.1319 Journal of Experimental Social Psychology, 79, 328–348. Liddell, T. M., & Kruschke, J. K. (2018). Analyzing ordinal data with metric models: What could possibly go wrong?, 328–348. https://doi.org/10.1016/j.jesp.2018.08.009 Psychological Methods, 17(3), 354–373. Rhemtulla, M., Brosseau-Liard, P. É., & Savalei, V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions.(3), 354–373. https://doi.org/10.1037/a0029315 LIKERT ITEMS AND SCALES. 11. Johns, R. (2010).. 11. https://ukdataservice.ac.uk/media/262829/discover_likertfactsheet.pdf Structural Equation Modeling: A Multidisciplinary Journal, 13(2), 186–203. Beauducel, A., & Herzberg, P. Y. (2006). On the Performance of Maximum Likelihood Versus Means and Variance Adjusted Weighted Least Squares Estimation in CFA.(2), 186–203. https://doi.org/10.1207/s15328007sem1302_2 Advances in Health Sciences Education: Theory and Practice, 15(5), 625–632. Norman, G. (2010). Likert scales, levels of measurement and the “laws” of statistics.(5), 625–632. https://doi.org/10.1007/s10459-010-9222-y