



As an example, I used the College Board's SAT benchmarks (pdf) , in which a test taken during high school years is used to predict first year college grades. The benchmark study is interesting because it is one of the few examples of test-makers who actually check the accuracy of their instruments and report that information publicly. You can find my first thoughts on this in " SAT Error Rates ." The source material mainly consists of Table 1a on page 3 of the College Board report:













We can use this to see the power of the SAT to predict first year college grades at any cut-off score on the table. If we picked 1200, for example, we can see that 73% of the students we admit will have a first year grade overage at 2.7 or above. In other words, a 73% true positive rate and a 27% false positive rate. Because we are helpfully given the number of samples in each bin (the N column), we can also calculate the false positive and true negative rates for the test. Just multiply N by the percentage of students with FGPA > 2.7 to find the number of students in that bin who were successful in their first year (by that definition), and subtract that from N to get the number who were not. The graph below shows this visually.





The two graphs look roughly like normal distributions with means about 150 SAT points apart. This is all quite interesting, but for my purposes here I just want to pull one number from this: the total percentage of students with FGPA >2.7, which we can get by summing up all the heights on the blue line and dividing by the total of all samples. This turns out to be 59%.







If a student's SAT score exceeds the benchmark, there is a 65% chance they will have FGPA > 2.7

Of all students, 59% will have FGPA > 2.7 The difference between these numbers is not large: .65 - .59 = 6%. Using the benchmark to select "winners", we can do six percent better than just randomly sampling. If all we care about is the percentage of "good" students we get, that's the end of the story. But there's another dimension: the rate of unfair rejections, or false negatives.

If we randomly sample whom we accept, then 59% of those we reject would have had FGPA > 2.7 (assuming this is the rate for the whole population). Since it's unfair to reject qualified candidates, we might call 1-.59 = 41% the fairness of the method of selection. Another name for fairness is the true negative rate. I plotted it against the accuracy (true positive rate) in the previous article. Here it is again.





The blue line is accuracy, and the red line is fairness. They meet at 65%. So we can see that although using the SAT benchmark is only six percent more accurate than random sampling, it is .65 - .49 = 16% more fair. How do we make sense of how good this is?



One overall measure of test predictive power is the average rate of correct predictions, taking into account both true positives and true negatives. We might call that the "correctness rate" of the cut-off benchmark. Where the lines cross above in the above graph, both the rates for true positives and true negatives is 65%, so the correctness rate is also 65%. In general, the formula for the correctness rate c at a give cut-off benchmark is:

c = (number of actual positives that meet the benchmark + number of actual negatives that do not meet the benchmark) / (total number of all observations) Below is a graph that adds the correctness rate to the accuracy and fairness plots.





The correctness rate potentially solves the problem of considering accuracy and fairness separately. It does not, however, give us an absolute measure to compare the quality of tests with. This is because the fraction of actual positives in the population can vary, making detection easier or more difficult. If we are interested in comparing different tests over different kinds of detection environments, we need something different. In the next section we will derive an index to try to address this problem.



A Comparative Index The College Board's benchmark has 65% accuracy. In other words:The blue line is accuracy, and the red line is fairness. They meet at 65%. So we can see that although using the SAT benchmark is only six percent more accurate than random sampling, it is .65 - .49 = 16% more fair. How do we make sense of how good this is?One overall measure of test predictive power is the average rate of correct predictions, taking into account both true positives and true negatives. We might call that the "correctness rate" of the cut-off benchmark. Where the lines cross above in the above graph, both the rates for true positives and true negatives is 65%, so the correctness rate is also 65%. In general, the formula for the correctness rate c at a give cut-off benchmark is:Below is a graph that adds the correctness rate to the accuracy and fairness plots.The correctness rate potentially solves the problem of considering accuracy and fairness separately. It does not, however, give us an absolute measure to compare the quality of tests with. This is because the fraction of actual positives in the population can vary, making detection easier or more difficult. If we are interested in comparing different tests over different kinds of detection environments, we need something different. In the next section we will derive an index to try to address this problem.



In general, there is not a good way to turn results about predictability into results about complexity (see "



In order to proceed, imagine an even better version of the test. In this fantasy, a proportion p of the test benchmark results come back marked with an asterisk. Imagine that this notation means that the result is known to be true. The unmarked ones have no guarantee--some will be correct and some not. In this way we imagine separating out the good and useful work of the test in to the p group, whereas the rest is just random guessing.



It's just like a multiple choice test. Some answers you know you know, and others you guess at. By working backwards we can find that "known true" fraction:

Correctness rate = (fraction known correct) + (fraction not known correct)*(rate of correct responses with random sampling) Using the numbers from the SAT benchmark in the previous section gives us:

.65 = p + (1- p) * .59 p = (.65 - .59)/(1-.59) The fraction that would have to be "known true" is p = 14.6%. The advantage of this transformation is that we have a single number that is easy to visualize, and takes the context into account. If you wanted to explain it to someone, it would go like this:

The SAT benchmark prediction is like having a perfect understanding of 14.6% of test-takers and guessing at the rest. The graphs below show the linear relationship between average test accuracy, the larger of the percent of positives or negatives in the population (the "guess rate"), and the index p--the equivalent proportion of "perfect understanding" outcomes.







As an example to illustrate the graph above, if the number of actual positives and negatives are evenly split at r = 50%, then a test that can predict with 80% correctness has the equivalent "perfect understanding" index of 60%. But if the proportion of positives is r = 70% instead of 50%, the index drops to 33%. It's reasonable to say that even though the correctness rate is the same, the first test is almost twice as good as the second one.



Note that if the guess rate equals the test accuracy, the test explains exactly nothing, which is as it should be.



Here's a general formula for computing the index p, which is the proportion of "perfect understanding" test results. The other two variables are c = the test's average correct classification rate, and r = the larger of the proportions of negatives or positive actual outcomes. In the SAT example, 59% were successful according to the FGPA criterion, so r = 59. If it had been 45% successful, then we'd use r = 1-.45 = 55%. Given these inputs, we have a simple formula for the index p:

p = (c - r)/(1 - r) On the last graph, p is the height of the line, c is the bottom axis, and four values of r (guess rate) are given, one for each curve as noted on the legend.



(Note: edited 1/20/2012 for clarity) In general, there is not a good way to turn results about predictability into results about complexity (see " Randomness and Prediction "). However, using ideas from computational complexity, I stumbled upon a transformation that gives us another way to think about the predictive power of a test.In order to proceed, imagine an even better version of the test. In this fantasy, a proportion p of the test benchmark results come back marked with an asterisk. Imagine that this notation means that the result is. The unmarked ones have no guarantee--some will be correct and some not. In this way we imagine separating out the good and useful work of the test in to the p group, whereas the rest is just random guessing.It's just like a multiple choice test. Some answers you know you know, and others you guess at. By working backwards we can find that "known true" fraction:Using the numbers from the SAT benchmark in the previous section gives us:The fraction that would have to be "known true" is p = 14.6%. The advantage of this transformation is that we have a single number that is easy to visualize, and takes the context into account. If you wanted to explain it to someone, it would go like this:The graphs below show the linear relationship between average test accuracy, the larger of the percent of positives or negatives in the population (the "guess rate"), and the index p--the equivalent proportion of "perfect understanding" outcomes.The "guess rate" is just the bigger of the fraction of negatives or positives in the population. If there are more positives, then without more information, you would guess than any randomly chosen outcome would be positive. If there are more negatives, the best guess (without any other information) is that the outcome would be negative. In formulas, we will call this guess rate "r." For the SAT example, the real positive rate is 59%, so r = .59. If the real positive rate had been 45%, we'd use r = 1 - .45 = 55%.As an example to illustrate the graph above, if the number of actual positives and negatives are evenly split at r = 50%, then a test that can predict with 80% correctness has the equivalent "perfect understanding" index of 60%. But if the proportion of positives is r = 70% instead of 50%, the index drops to 33%. It's reasonable to say that even though the correctness rate is the same, the first test is almost twice as good as the second one.Note that if the guess rate equals the test accuracy, the test explains exactly nothing, which is as it should be.Here's a general formula for computing the index p, which is the proportion of "perfect understanding" test results. The other two variables are c = the test's average correct classification rate, and r = the larger of the proportions of negatives or positive actual outcomes. In the SAT example, 59% were successful according to the FGPA criterion, so r = 59. If it had been 45% successful, then we'd use r = 1-.45 = 55%. Given these inputs, we have a simple formula for the index p:On the last graph, p is the height of the line, c is the bottom axis, and four values of r (guess rate) are given, one for each curve as noted on the legend.(Note: edited 1/20/2012 for clarity)

This post is an overdue follow-up to " Randomness and Prediction ," which takes up the question of how we should judge the quality of a test. There are many kinds of tests, but for the moment I'm only interested in ones that are supposed to predict future performance. Since education is in the preparation business, the measure of success should be "did we prepare the student?" If that question can be answered satisfactorily with a yes or no, this feedback can be used to determine the accuracy of tests that are supposed to predict this outcome.