One of the first things all psychology students are taught is levels of measurement. Every student must wrap their heads around the four different forms data can take: nominal, ordinal, interval, or ratio. These are the bedrock of a lot of students’ understanding of measurement, including mine. I didn’t realise there were questions about their validity and utility until recently. Should we still use these levels of measurement? Do they aid our understanding of measurement? Or do they need to retire?

On the level

The levels of measurement we all know and love were proposed by Stanley Smith Stevens in the 1940’s. These levels are based on whether the meaning of the data change if different classes of transformation are used . For example, nominal data categories can be completely arbitrary. You can use numbers, letters, names, pretty much anything, and the information from the data is the same. Thus, they can be transformed in many ways. But you can’t do the same for ratio data; you must use transformations that preserve the information. If you are weighing something in kilograms and you change your ratio scale from 0-20 to 47.5-10,006 (whilst keeping the same number of units in the scale) then you don’t have the same information; the ‘fixed’ zero point is no longer fixed and the gaps between the units are no longer the same.

There is a ‘hierarchy of data scales based on invariance of their meaning under different classes of transformations’ . Nominal data is the lowest on the hierarchy , then ordinal, then interval, and ratio is the highest . Different kinds of transformations therefore tell you where your data are on this hierarchy. If the meaning is preserved after many different transformations, it is at the lowest level. The fewer transformations it can tolerate, the higher up the hierarchy it is. There is a corresponding positive relationship between where the data are on the hierarchy and the number of meaningful calculations. The higher the data are on the hierarchy, the more calculations can be performed. For nominal data, you can only count the number of cases and therefore calculate the mode. But for ratio data, you can calculate the mean (as well as median and mode), the standard deviation, correlations, etc.

Do as I command!

This hierarchy is the foundation of one of the main proscriptions of Stevens’ theory of scales of measurement: your choice of statistical test should be guided by the measurement level used, such that truth statements based on the statistical analyses should be as valid under ‘admissible’ transformations of the data . The ‘admissible’ transformations for nominal data are one-to-one transformations, like replacing one label with another. For ordinal, order-preserving transformations are acceptable e.g. moving from 0-6 to 15-27 . For interval data, positive linear transformations are admissible and for ratio data, it only allows transformations that multiply with a positive constant . Therefore, the higher the level of measurement, the fewer transformations are admissible.

The transformations the data can tolerate tell us which statistical tests to use. For example, ordinal data from two independent groups should be statistically analysed using tests that are invariant to changing the scale whilst preserving the order. Squaring each data point is such a monotonically increasing transformation, assuming positive numbers. Analysing ordinal results with a statistical test that converts the scores into ranks is insensitive to this admissible transformation. Whether you perform this order preserving transformation or not, the truth statements based on this statistical analysis remain the same. Thus, this test is appropriate. But theoretically you shouldn’t analyse the same data using a test that can’t tolerate this transformation e.g. Welch’s or Student’s t-test. These tests use the mean of the scores, so squaring each data point will affect this calculation. Thus, the truth statement of the result has likely changed despite the admissible transformation. Therefore, this statistical test is not appropriate.

Lord have mercy, have mercy

As stated above, one of the purposes of these levels of measurement was to determine what kinds of statistical tests could be permitted . The level of measurement of the data should therefore be a very strong signal as to what statistical analyses you should conduct. However, one of the main criticisms of these levels is: ‘They do not describe the attributes of real data that are essential to good statistical analysis’ . Thus, many argue that the levels of measurement don’t capture the important information that guides you as to how you should statistically analyse your data.

One of the most famous examples of this problem comes from . In it, Lord presents a thought experiment where a retired professor sells numbers on the back of American Football jerseys (called ‘football numbers’) to university players. One team suspects they were sold lower numbers than another team and complain to the professor. To see whether they were sold lower numbers by chance, the professor enlists the help of a statistician. The statistician begins by calculating the mean of the football numbers, much to the despair of the professor . After performing more forbidden mathematics, the statistician concludes that the numbers were not a random sample. We are then presented with a conundrum: how did putatively invalid calculations produce a meaningful result? As explain, scale type, as defined by Stevens, is not a fundamental attribute of the data. Instead, the measurement level depends on the questions you intend to ask of the data and any additional information you have. Many have argued this shows the lack of utility of Stevens’ levels of measurement.

However, the picture is more complex.

Ghost in the machine

On the surface, it seems the professor in Lord’s 1953 paper asked a question about the nominal numbers and whether they were randomly distributed. But, as argue, what the professor actually asked and drew a conclusion about was the machine; was it in its original state (randomly shuffled) when the numbers were drawn? As such, an inference about the uniqueness of the numbers wasn’t made. The reference class was a set of possible states for the machine (fair or biased). Therefore, whether a nominal scale can be analysed as metric wasn’t explored.

So what does it mean to say the inference regarded the state of the machine, not the player numbers? The test was to see if the machine was in one of two states: fair or biased. But the machine could have been tampered with in a multitude of ways, so it doesn’t fall into the nominal categories. The machine could have had all numbers below 20 removed, thus producing a higher mean value. Alternatively, all the numbers greater than 35 could have been removed, creating a highly biased low mean result. Thus, there is huge range of potential bias in the machines. Also, this amount of bias can be ordered (more or less biased than another hypothetical machine) so it can be treated as ordinal.

Not only that, it can be shown that the amount of bias possesses a quantitative structure that can be represented by linear transformations by the population mean. To show this quantitative structure, it is sufficient to show that the amount of bias can be concatenated and the result of this concatenation has the correct properties . The authors give the analogy of concatenating temperatures in volumes of liquid. If you have two equal volumes of water, one at 10°C and the other at 20°C, and add them together, the resulting temperature will be 15°C . This is the equivalent of taking the mean of the individual temperatures. This same process can conceptually be done to the amount of bias in two machines. The bias in the machines can be “added” by concatenating the numbers drawn from each machine into a random pile of numbers . This is the same as finding the mean of the bias of the two machines . The fact these linear transformations represent the bias equally well demonstrates the scale of bias is interval.

Take care

This analysis relies on the assumption that a single set of numbers can be represented in different ways. This was one of the criticisms of Stevens’ typology by , stating that the level of measurement isn’t a feature of the data. But, as Zand Scholten and Borsboom argue, being able to represent the numbers in different ways based on the types of questions you are asking isn’t a flaw with the levels of measurement. It does undermine the rules presented by Stevens’ regarding admissible tests, but not necessarily the concept of levels of measurement itself.

Whilst Lord’s paper may not be the devastating critique many think it is, what it does show is that you must be careful when thinking about how to analyse your data. You cannot blindly assume one form of analysis is correct, as what is most appropriate depends on a range of things. Lord himself, in a comment on his 1953 paper, stated that ‘utmost care must be exercised in interpreting the results of arithmetic operations upon nominal and ordinal numbers’ . He gives an example where an ordinal data set is best analysed using an ordinal statistic (the median) because to analyse it using a metric model would require some likely unjustified assumptions. What Lord was advocating for, as evidenced by his follow-up publication, was not the complete rejection of Stevens’ typology. What he, along with many other authors, was arguing for was the appreciation of the shades of grey when running statistical analyses. Rather than thoughtlessly follow the levels of measurement, ask: what is the distribution of the data ; how does your model treat the distribution of data and is it appropriate ? These are some of the factors that are more important than the level of measurement .

Time to retire?

Given the repeated calls for greater consideration when analysing data, would retiring Stevens’ levels of measurement help achieve this end? Would retiring them completely encourage people to think more about their data, rather than just thoughtlessly using the proscribed typology? If they were no longer taught, what should people use instead? Some have argued the levels do more to confuse than help, so we should only think about either discrete or continuous variables. I find this argument the most persuasive, though I worry the discussion is moot. The levels are largely ignored when it is convenient to the researcher. If the levels are fully removed, it is unlikely to improve the situation without a greater appreciation of nuance. If the retirement of the typology leads to a majority critically thinking about how to analyse the data, then I am all for it. But without an overhaul of how many of us approach statistical analysis, it feels like rearranging deck chairs on the titanic.

References

{5421944:GJPTUX5Q};{5421944:TVDURDJ2};{5421944:43VGWKHH};{5421944:IBIB5MS3};{5421944:TVDURDJ2};{5421944:DUBQSK8P};{5421944:TVDURDJ2};{5421944:43VGWKHH};{5421944:XR95BQVN};{5421944:TVDURDJ2};{5421944:BWQN97P4};{5421944:IM29UBZM};{5421944:VD8XETGZ};{5421944:D5WBEKRU} apa default asc no 8904 Journal of Mathematical Psychology, 53(2), 69–75. Zand Scholten, A., & Borsboom, D. (2009). A reanalysis of Lord’s statistical treatment of football numbers.(2), 69–75. https://doi.org/10.1016/j.jmp.2009.01.002 Krantz, D. H., Suppes, P., Luce, R. D., & Tversky, A. (2009). Additive and Polynomial Representations: 1. Dover Publications Inc. Lord, F. M. (1954). Further Comment on “Football Numbers.” American Psychologist, 9(6), 264–265. American Psychologist, 8(12), 750–751. Lord, F. M. (1953). On the Statistical Treatment of Football Numbers.(12), 750–751. https://doi.org/10.1037/h0063675 Stevens, S. S. (1951). Mathematics, measurement, and psychophysics. In Handbook of experimental psychology (pp. 1–49). Wiley. Cartography and Geographic Information Systems, 25(4), 231–242. Chrisman, N. R. (1998). Rethinking Levels of Measurement for Cartography.(4), 231–242. https://doi.org/10.1559/152304098782383043 The American Statistician, 47(1), 65–72. Velleman, P. F., & Wilkinson, L. (1993). Nominal, Ordinal, Interval, and Ratio Typologies are Misleading.(1), 65–72. https://doi.org/10.1080/00031305.1993.10475938 The Journal of General Psychology, 122(1), 83–94. Zimmerman, D. W. (1995). Increasing the Power of the ANOVA F Test for Outlier-Prone Distributions by Modified Ranking Methods.(1), 83–94. https://doi.org/10.1080/00221309.1995.9921224 Educational and Psychological Measurement, 79(6), 1184–1197. Zumbo, B. D., & Kroc, E. (2019). A Measurement Is a Choice and Stevens’ Scales of Measurement Do Not Help Make It: A Response to Chalmers.(6), 1184–1197. https://doi.org/10.1177/0013164419844305 Journal of Experimental Social Psychology, 79, 328–348. Liddell, T. M., & Kruschke, J. K. (2018). Analyzing ordinal data with metric models: What could possibly go wrong?, 328–348. https://doi.org/10.1016/j.jesp.2018.08.009

5421944 {5421944:RKURRCBS} items 1 apa default asc http://psychbrief.com/wp-content/plugins/zotpress/ Williams, M. (2019). Scales of measurement and statistical analyses [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/c5278