The longer the test, the more reliable it is – up to a point. Howard Wainer and Richard Feinberg expose the costs and hours lost in pursuit of marginal gains and worthless subscores

Stockbyte/Thinkstock Standardised tests – whether to evaluate student performance, to choose among college applicants or to license candidates for various professions – are often marathons. Tests designed to evaluate knowledge of coursework typically use the canonical hour; admissions tests are usually 2–3 hours, and licensing exams can take days. Why are they as long as they are? The first answer that jumps immediately to mind is the inexorable relationship between a test's length and its reliability, which is merely a standardised measure of the stability and consistency of test results, ranging between a low of 0 (the score is essentially a random number) and a high of 1 (the score does not fluctuate at all). Although a test score always gets more reliable as the test generating it gets longer, the law of diminishing returns sets in very quickly. In Figure 1 we show the reliability of a typical professionally prepared test as a function of its length. It shows that the marginal gain of moving from a 30‐item test to a 60‐ or even 90‐item one is not worth the trouble unless such small additional increments in reliability are required. Figure 1 Open in figure viewer PowerPoint Spearman–Brown function showing the reliability of a test as a function of its length, if a one‐item test has a reliability of 0.15 But perhaps there are other uses for the information gathered by the test that require this additional length and accuracy. Here, the US Census provides an example. It is wasteful (and perhaps unethical) to continue to administer tests that take more examinee time than is justified by the information yielded The last decennial population count cost $13 billion, or approximately $42 per person, to estimate the number of people in the country at 308 745 538, give or take 31 000. If all the Census gave us was that single number it would be a colossal waste of taxpayer money. However, the constitutionally mandated purpose of the Census is far broader than that. It must also provide small‐area estimates and answers to questions such as “How many households with two parents and three or more children live in the Bushwick section of Brooklyn, New York?”. Such small‐area estimates are crucial for the allocation of social services and for other purposes. In tests, the equivalent of the small‐area estimate is usually called a subscore. On a high school mathematics test there might be subscores on algebra, arithmetic, geometry and trigonometry. For a licensing exam in veterinary medicine there might be subscores on the pulmonary system, the skeletal system, the renal system, and so on. Thus the production of meaningful subscores would be a justification for tests that contain more items than would be required merely for an accurate enough estimate of total score. But what is a meaningful subscore? It is one that is reliable enough for its prospective use and one that has information that is not adequately contained in the total test score. There are at least two prospective uses of such subscores: to aid examinees in assessing their strengths and weaknesses, often with an eye towards remediating the latter; and to aid individuals and institutions (e.g. teachers and schools) in assessing the effectiveness of their instruction, again with an eye towards remediating weaknesses. In the first case, helping examinees, the subscores need to be reliable enough so that attempts to address shortcomings do not become just the futile pursuit of noise. And, obviously, the subscore must contain information that is not available from the total score. Let us designate these two characteristics of a worthwhile subscore as “reliability” and “orthogonality”. Subscore reliability is governed by the same inexorable rules of reliability as overall score – as test length decreases, so too does reliability. Thus if we need reliable subscores we must have enough items for that purpose. This would mean that the overall test length would have to be greater than would be necessary for merely a single score. Substandard subscores The problem of reliability in subscores led to the development of empirical Bayes methods that would allow weak test items to borrow strength from other items that empirically yielded an increase in reliability. This methodology was proposed by Wainer et al.1 in 2000, and was later elaborated the following year by Thissen and Wainer.2 This methodology often increased the reliability of subscores substantially, but at the same time the influence of items from the rest of the test reduced the orthogonality of those subscores to the rest of the test. Empirical Bayes gaveth, but it also tooketh away. What was needed was a way to measure value of an augmented subscore that weighed the delicate balance between increased reliability and decreased orthogonality. Until such a measure became available, the instigating question, “How successful are test developers in providing useful subscores?”, would remain unanswered. Happily, the ability to answer this important question was improved markedly in 2008 with the publication of Shelby Haberman's powerful new statistic that combined both reliability and orthogonality.3 Using this tool, Sandip Sinharay searched high and low for subscores that had added value over total score, but came up empty.4 Sinharay's empirical results were validated in simulations that matched the structure commonly encountered in different kinds of testing situations. These results were enriched and expanded by Richard Feinberg,5 and again the paucity of subscores was confirmed. This same finding, of subscores adding no marginal value over total score, was reconfirmed by Sinharay for tests whose goal was classification.6 While it is too early to say that there are no subscores that are ever reported that are worth having, it seems sensible that unless tests are massively redesigned such subscores are likely rare. For the second use, helping institutions, the test length would not have to increase, for the reliability would be calculated over the number of individuals from that institution who took the items of interest. If that number was large enough the estimate could achieve high reliability. So it would seem that one key justification for what appears at first to be the excessive length of most common tests is to provide feedback to examinees in subscores calculated from subsets of the tests. But how successful are test developers in providing such subscores? Not particularly, for such scores are typically based on few items and hence are not very reliable (see box).

Justification Where does this leave us? Unless we can find a viable purpose for which unreliable and non‐orthogonal subscores have marginal value over the total test score, we are being wasteful (and perhaps unethical) to continue to administer tests that take more examinee time than is justified by the information yielded. One such purpose might be the use of the test as a prod to motivate students to study all aspects of the curriculum and for the teachers to teach it. If tests were much shorter, fewer aspects of the curriculum would be well represented. But this is easily circumvented in a number of ways. If the curriculum is sampled cleverly, neither the teachers nor the examinees will know exactly what will be on the test and so have to include all of it in their study. Another approach is to follow that taken by the US National Assessment of Educational Progress and use some sort of a balanced incomplete block design in which all sectors of the curriculum are well covered but not each examinee will get all parts. That will allow estimates of sub‐area mastery to be estimated in the aggregate and, through the statistical magic of score equating, still allow all examinee scores to rest on a common metric. We should also keep in mind the result that has been shown repeatedly with adaptive tests in which we can give a test of half its usual length with no loss of motivation. Certainly there are fewer items of each sort, but examinees must still study all aspects because on a shorter test each item “counts” more towards the final score. So, unless evidence can be gathered that shows a radical change in teaching and studying behaviour with shorter tests, we believe that we can reject motivation as a reason for setting overly long tests.

Every little bit helps? Another justification for the apparent excessive length of tests is that the small increase in reliability is of practical importance. Of course the legitimacy of such a claim would need to be examined on a case‐by‐case basis, but perhaps we can gain some insight through a careful study of one artificial situation that has similarities to a number of serious tests. Let us consider the characteristics of a prototypical licensing examination that has, say, 300 items, which takes 8 hours to administer and has a reliability of 0.95. Such characteristics show a marked similarity to a number of professional licensing exams (it might be for attorneys or veterinarians or physicians or nurses or certified public accountants). The purpose of such an exam is to make a pass–fail decision, so let us assume that the pass score is 63%. To make this demonstration dramatic, let us see what happens to the accuracy of our decisions if we make draconian reductions in test length. To begin with, let us eliminate 75% of the test items and reduce the test to just 75 items. Because of the gradual slope of the reliability curve shown in Figure 1, this kind of reduction would only shrink the reliability to 0.83. Is this still high enough to safeguard the welfare of future clients? The metric of reliability is not one that is close to our intuitions, so let us shift to something easier to understand: how many wrong pass– fail decisions would be made? Unless evidence can be gathered that shows a radical change in teaching and studying behaviour with shorter tests, we can reject motivation as a reason for setting overly long tests With the original test, 3.13% of the decisions would be incorrect and these would be divided between false positives (passing when they should have failed) of 1.18% and false negatives (failing when they should have passed) of 1.95%. How well does our shrunken test do? First, the overall accuracy rate declines to 6.06%, almost double what the longer test yielded. This breaks down to a false positive rate of 2.26% and a false negative rate of 3.80%. LuckyBusiness/iStock/Thinkstock Is the inevitable diminution of accuracy sufficiently large to justify the fourfold increase in test length? Of course, that is a value judgement, but before making it we must realise that the cost in accuracy can be eased. The false positive rate for this test is the most important one, for it measures the proportion of incompetent practising professionals that are incorrectly licensed. Happily, we can control the false positive rate by simply raising the pass score. If we increase the pass score to 65% instead of 63% the false positive rate drops back to the same 1.18% we had with the full test. Of course, by doing this, the false negative rate grows to 6.6%, but this is a venial sin that can be ameliorated easily by adding additional items to those candidates who only barely failed. Note that the same law of diminishing returns that worked to our advantage in total score (shown in Figure 1) also holds in determining the marginal value of adding more items to decrease the false negative rate. The parallel to the Spearman–Brown curve is shown in Figure 2. Figure 2 Open in figure viewer PowerPoint The improvement in the false negative rate yielded through the lengthening of the test for those who only marginally failed The function in Figure 2 shows us that by adding only 40 items for those examinees just below the cut‐off score (those whose scores range from just below the minimal pass score of 65% to about 62%) we can attain false negative rates that are acceptably close to those obtained with the 300‐item test. This extension can be accomplished seamlessly if a computer administers the test. Thus for most of the examinees their test would take only a quarter of the time it would previously have required, and even for the small number of examinees who had to answer extra items, these would be few enough so that even they still came out far ahead in time. The connections between test length, reliability and error rates are summarised in Figure 3. Figure 3 Open in figure viewer PowerPoint Illustration of how increasing test length (number of items) increases score reliability and decreases error rates, though with diminishing returns In Table 1 we show a summary of these results as well as parallel results for an even more dramatically reduced test form with only 40 items. Table 1. Summary of passing statistics Test length Pass score (%) Reliability Total error rate (%) False positive rate (%) False negative rate (%) 300 63 0.95 3.13 1.18 1.95 75 63 0.83 6.06 2.26 3.80 75 65 0.83 7.78 1.18 6.60 40 63 0.72 8.23 2.59 5.64 40 66 0.72 11.62 1.14 10.48 For simple pass–fail decisions, we can dismiss accuracy as a reason for excessive test lengths; for within plausible limits we can obtain comparable error rates with much shorter tests, albeit with an adaptive stopping rule Thus, at least for simple pass–fail decisions, it seems that we can dismiss accuracy as a reason for the excessive test lengths used; for within plausible limits we can obtain comparable error rates with much shorter tests, albeit with an adaptive stopping rule.

Excess costs The costs associated with lengthy tests can be measured in various ways and, of course, they accrue differentially to different portions of the interested populations. The cost to users of test scores is nil, since neither their time nor money is used to gather the scores. The cost to the testing organisation is likely to be substantial, since the cost of a single operational test question/item is typically greater than $2500. Add to this the costs of “seat time” paid to whoever administers the exam, grading costs, etc., and it adds up to a considerable sum. Fixed costs being what they are, giving a new test a quarter of the length of an older one does not mean a quarter of the cost, but it does portend worthwhile savings. We are also well aware of concerns that many tests, at their current lengths, do not allow enough time for some examinees. This worry could be ameliorated easily if the time allowed for the test was shrunken, but not quite as far as the number of items would suggest. Which brings us to the examinees. Their costs are of two kinds: the actual costs paid to the testing organisation, which could be reduced if the costs to that organisation were dramatically reduced; and the opportunity costs of time. If a current test takes 8 hours, a shortened form might take only 2 hours. Multiplied by perhaps 100 000 examinees who annually seek licensure, this would yield a time saving of 600 000 hours If the current test takes 8 hours, then a shortened form only a quarter as long might take only 2 hours, a saving of 6 hours per examinee. Multiplied by perhaps 100 000 examinees who annually seek licensure, this would yield a time saving of 600 000 hours. Keeping in mind that the examinees taking a licensing exam are (or shortly will be) professionals for hire, it raises the question of what can be accomplished with 600 000 extra hours of their time? If those being licensed were attorneys and this was a bar exam, consider how much good 600 000 annual hours of pro bono legal aid could do; or a like amount of effort from professional accountants at tax time; or engineers; or veterinarians. It is hardly an exaggeration to suggest that this amount of spare time from pre‐ or just‐licensed professionals could accelerate the progress of our civilisation.

Multiple choice The facts presented here leave us with two possibilities: to shorten our tests to the minimal length required to yield acceptable accuracy for the total score, and thence choose more profitable activities for the time freed up; or to re‐engineer our tests so that the subscores that we calculate have the properties that we require. Though the time savings offered by the first option would be of benefit (if only to the examinees and their work–life balance), users of test scores almost universally desire more information than a simple pass–fail result. Thus, the second option – a redesign – is the one we find most attractive. Sinharay, Haberman and Wainer have shown that the shortcomings found in the subscores calculated on so many of our large‐scale tests are due to flaws in the tests’ designs.7 A redesign is needed because we cannot retrieve information from our tests if the capacity to gather that information was not built in to begin with. Happily, a blueprint for how to do this was carefully laid out more than a decade ago when Mislevy, Steinberg and Almond provided the principles and procedures for what they dubbed “evidence‐centred design”.8 It seems worth a shot to try it. In the meantime we ought to stop wasting resources giving tests that are longer than the information they yield is worth.

References 1 Wainer, H. , Sheehan, K. and Wang, X. ( 2000 ) Some paths toward making Praxis scores more useful . Journal of Educational Measurement , 37 , 113 – 140 .

, and ( ) . , , – . 2 Thissen, D. and Wainer, H. ( 2001 ) Test Scoring . Hillsdale, NJ : Lawrence Erlbaum Associates.

and ( ) . : Lawrence Erlbaum Associates. 3 Haberman, S. ( 2008 ) When can subscores have value? Journal of Educational and Behavioral Statistics , 33 ( 2 ), 204 – 229 .

( ) , ( ), – . 4 Sinharay, S. ( 2010 ) How often do subscores have added value? Results from operational and simulated data . Journal of Educational Measurement , 47 ( 2 ), 150 – 174 .

( ) . , ( ), – . 5 Feinberg, R. A. ( 2012 ) A simulation study of the situations in which reporting subscores can add value to licensure examinations . Ph.D. dissertation, University of Delaware. Retrieved 31 October 2012 from ProQuest Digital Dissertations database (Publication No. 3526412).

( ) . Ph.D. dissertation, University of Delaware. Retrieved 31 October 2012 from ProQuest Digital Dissertations database (Publication No. 3526412). 6 Sinharay, S. ( 2014 ) Analysis of added value of subscores with respect to classification . Journal of Educational Measurement , 51 ( 2 ), 212 – 222 .

( ) . , ( ), – . 7 Sinharay, S. , Haberman, S. J. and Wainer, H. ( 2011 ) Do adjusted subscores lack validity? Don't blame the messenger . Educational and Psychological Measurement , 7 ( 5 ), 789 – 797 .

, and ( ) . , ( ), – . 8 Mislevy, R. J. , Steinberg, L. S. , and Almond, R. G. ( 2003 ) On the structure of educational assessments . Measurement: Interdisciplinary Research and Perspectives, 1 ( 1 ), 3 – 67 .

Further reading 9 Haberman, S. J. , Sinharay, S. and Puhan, G. ( 2009 ) Reporting subscores for institutions . British Journal of Mathematical and Statistical Psychology, 62 , 79 – 95 .