For this week’s Study of the Week I want to look at standardized tests, the concept of validity, and how best – and worst – to criticize exams like the SAT and ACT. To begin, let’s consider what exactly it means to call such exams valid.

What is validity?

Validity is a multi-faceted concept that’s seen as a core aspect of test development. Like many subjects in psychometrics and stats, it tends to be used casually and referred to as something fairly simple, when in fact the concept is notoriously complex. Accepting that any one-sentence definition of validity is thus a distortion, generally we say that validity refers to the degree that a test measures that which it purports to measure. A test is more valid or less depending on its ability to actually capture the underlying traits we are interested in investigating through its mechanism. No test can ever be fully or perfectly validated; rather we can say that it is more or less valid. Validity is a vector, not a destination.

Validity is so complex, and so interesting, in part because it sits at the nexus of both quantitative and philosophical concerns. Concepts that we want to test may appear superficially simple but are often filled with hidden complexity. As I wrote in a past Study of the Week, talking about the related issues of construct and operationalization,

If we want to test reading ability, how would we go about doing that? A simple way might be to have a a test subject read a book out loud. We might then decide if the subject can be put into the CAN READ or CAN’T READ pile. But of course that’s quite lacking in granularity and leaves us with a lot of questions. If a reader mispronounces a word but understands its meaning, does that mean they can’t read that word? How many words can a reader fail to read correctly in a given text before we sort them into the CAN’T READ pile? Clearly, reading isn’t really a binary activity. Some people are better or worse readers and some people can reader harder or easier texts. What we need is a scale and a test to assign readers to it. What form should that scale take? How many questions is best? Should the test involve reading passages or reading sentences? Fill in the blank or multiple choice? Is the ability to spot grammatical errors in a text an aspect of reading, or is that a different construct? Is vocabulary knowledge a part of the construct of reading ability or a separate construct?

Questions such as these are endemic to test development, and frequently we are forced to make subjective decisions about how best to measure complex constructs of interest. Common to the quantitative social sciences, this subjective, theoretical side of validity is often written out of our conception of the topic, as we want to speak with the certainty of numbers and the authority of the “harder” sciences. But theory is inextricable from empiricism, and the more that we wish to hide it, the more subject we are to distortions that arise from failing to fully think through our theories and what they mean. Good empiricists know theory comes first; without it, the numbers are meaningless.

Validity has been subdivided into a large number of types, which reflect different goals and values within the test development process. Some examples include:

Predictive Validity: The ability of a test’s results to predict that which it should be able to predict if the test is in fact valid. If a reading test predicts whether students can in fact read texts of a given complexity or reading level, that would provide evidence of predictive validity. The SAT’s ability to predict the grades of college freshmen is a classic example.

The ability of a test’s results to predict that which it should be able to predict if the test is in fact valid. If a reading test predicts whether students can in fact read texts of a given complexity or reading level, that would provide evidence of predictive validity. The SAT’s ability to predict the grades of college freshmen is a classic example. Concurrent Validity: If a test’s results are strongly correlated with that of a test that measures similar constructs and which has itself been sufficiently validated, that provides evidence of concurrent validity. Of course, you have to be careful – two invalid tests might provide similar results but not tell us much of actual worth. Still, a test of quantitative reasoning and a test of math would be expected to be imperfectly yet moderately-to-strongly correlated if each is itself a valid test of the given construct.

If a test’s results are strongly correlated with that of a test that measures similar constructs and which has itself been sufficiently validated, that provides evidence of concurrent validity. Of course, you have to be careful – two invalid tests might provide similar results but not tell us much of actual worth. Still, a test of quantitative reasoning and a test of math would be expected to be imperfectly yet moderately-to-strongly correlated if each is itself a valid test of the given construct. Curricular Validity: As the name implies, curricular validity reflects the degree to which a test matches with a given curriculum. If a test of biology closely matches the content in the syllabus of that biology course, we would argue for high curricular validity. This is important because we can easily imagine a scenario where general ability in biology could be measured effectively by a test that lacked curricular validity – students who are strong in biology might score well on a test, and students who are poor would likely score poorly, even if that test didn’t closely match the curriculum. But that test would still not be a particularly valid measure of biology as learned in that class, so curricular validity would be low. This is often expressed as a matter of ethics.

As the name implies, curricular validity reflects the degree to which a test matches with a given curriculum. If a test of biology closely matches the content in the syllabus of that biology course, we would argue for high curricular validity. This is important because we can easily imagine a scenario where general ability in biology could be measured effectively by a test that lacked curricular validity – students who are strong in biology might score well on a test, and students who are poor would likely score poorly, even if that test didn’t closely match the curriculum. But that test would still not be a particularly valid measure of biology as learned in that class, so curricular validity would be low. This is often expressed as a matter of ethics. Ecological Validity: Heading in a “softer” direction, ecological validity is often discussed to refer to the degree to which a test or similar assessment instrument matches the real-life contexts in which its consequences will be enacted. Take writing assessment. In previous generations, it was common for student writing ability to be tested through multiple choice tests on grammar and sentence combining. These tests were argued to be valid because their results tend to be highly correlated with the scores that students receive on written essay exams. But writing teachers objected, quite reasonably, that we should test student writing by having them write, even if those correlations are strong. This is an invocation of ecological validity and reflects a broader (and to me positive) effort to not reduce validity to narrowly numerical terms.

I could go on!

When we talk about entrance examinations like the SAT or GRE, we often fixate on predictive validity, for obvious reasons. If we’re using test scores as criteria for entry into selective institutions, we are making a set of claims about the relationship between those scores and the eventual performance of those students. Most importantly, we’re saying that the tests help us to know that students can complete a given college curriculum, that we’re not setting them up to fail by admitting them to a school where they are not academically prepared to thrive. This is, ostensibly, the first responsibility of the college admissions process. Ostensibly.

Of course, there are ceiling effects here, and a whole host of social and ethical concerns that predictive validity can’t address. I can’t find a link now but awhile back a Harvard admissions officer admitted that something like 90% of the applicants have the academic ability to succeed at the school, and that much of the screening process had little to do with actual academic preparedness. This is a big subject that’s outside of the bounds of this week’s study.

The ACT: Still Predictively Valid

Today’s study, by Paul A. Westrick, Huy Le, Steven B. Robbins, Justine M. R. Radunzel, and Frank L. Schmidt , is a large-n (189,612) study about the predictive validity of the ACT, with analysis of the role of socioeconomic status (SES) and high school grades in retention and college grades. The researchers examined the outcomes of students who took the ACT and went on to enroll in 4-year institutions from 2000 to 2006.

The nut:

After corrections for range restriction, the estimated mean correlation between ACT scores and 1st-year GPA was .51, and the estimated mean correlation between high school GPA and 1st-year GPA was .58. In addition, the validity coefficients for ACT Composite score and high school GPA were found to be somewhat variable across institutions, with 90% of the coefficients estimated to fall between .43 and .60, and between .49 and .68, respectively (as indicated by the 90% credibility intervals). In contrast, after correcting for artifacts, the estimated mean correlation between SES and 1st-year GPA was only .24 and did not vary across institutions….

…1st-year GPA, the most proximal predictor of 2nd-year retention, had the strongest relationship (.41). ACT Composite scores (.19) and high school GPA (.21) were similar in the strength of their relationships with 2nd-year retention, and SES had the weakest relationship with 2nd-year retention (.10).

The results should be familiar to anyone who has taken a good look at the literature on these tests, and to anyone who has been a regular reader of this blog. The ACT is in fact a pretty strong predictor of GPA, though far from a perfect one at .51. Context is key here; in the world of social sciences and education, .51 is an impressive degree of predictive validity for the criterion of interest. But there’s lots of wiggle! And I think that’s ultimately a good thing; it permits us to recognize that there are a variety of ways to effectively navigate the challenges of the college experience… and to fail to do so. (As the Study of the Week post linked to above notes, GPA is strongly influenced by Conscientiousness, the part of the Five Factor Model associated with persistence and delaying gratification.) We live in a world of variability, and no test can ever make perfectly accurate predictions about who will succeed or fail. Exceptions abound. Proponents of these tests will say, though, that they are probably much more valid predictors of college grades and dropout rates than more subjective criteria like essays and extracurricular activities. And they have a point.

Does the fact that SES correlates “only” at .24 with college GPA mean SES doesn’t matter? Of course not. That level of correlation for a variable that is truly construct-irrelevant and which has such obvious social justice dimensions is notable even if its less powerful than some would suspect. It simply means that we should take care not to exaggerate that relationship, or the relationship between SES and performance on tests like the ACT and SAT, which is similar at about .25 in the best data known to me. Again: clearly that is a relevant relationship, and clearly it does not support the notion that these tests only reflect differences in SES.

Ultimately, every read I have of the extant evidence demonstrates that tests like the SAT and ACT are moderately to highly effective at predicting which students will succeed in terms of college GPA and retention rates. They are not perfect and should not be treated as such, so we should use other types of evidence such as high school grades and other, “soft” factors in our college admissions procedures – in other words, what we already do – if we’re primarily concerned with screening for prerequisite ability. Does that mean I have no objections to these tests or their use? Not at all. It just means that I want to make the right kinds of criticisms.

Don’t Criticize Strength, Criticize Weakness

A theme that I will return to again and again in this space is that we need to consider education and its place in society from a high enough level to think coherently. Critics of the SAT and ACT tend to pitch their criticisms at a level that does them no good.

So take this piece in Slate from a couple enthusiastic SAT (and IQ) proponent. In it, they take several liberal academics to task for making inaccurate claims about the SAT, in particular the idea that the SAT only measures how well you take the SAT. As the authors say, the evidence against this is overwhelming; the SAT, like the ACT, is and has always been an effective predictor of college grades and retention rates, which is precisely what the test is mean to predict. The big testing companies invest a great deal of money and effort in making them predictively valid. (And a great deal of test taker time and effort, too, given that one section out of each given exam is “experimental,” unscored and used for the production of future tests.) When you attack the predictive validity of these tests – their ability to make meaningful predictions about who will succeed and who will fail at college – you are attacking them at their strongest point. It’s like their critics are deliberately making the weakest critique possible.

“These tests are only proxies for socioeconomic status” is a factually incorrect attempt to make a criticism of how our educational system replicates received advantage. It fails because it does not operate at the right level of perspective. Here’s a better version, my version: “these tests are part of an educational system that reflects a narrow definition of student success that is based on the needs of capitalism, rather than a fuller, more humanistic definition of what it means to be a good student.”

These tests do indeed tell us how well students are likely to do in college and in turn provide some evidence of how well they will do in the working world. But college, like our educational system as a whole, has been tuned to attend to the needs of the market rather than to the broader needs of humanity. The former privileges the kind of abstract processing and brute reasoning skills that tests are good at measuring and which makes one a good Facebook or Boeing employee. The latter would include things like ethical responsibility, aesthetic appreciation, elegance of expression, and dedication to equality, among other things, which tests are not well suited to measuring. A more egalitarian society would of course also have need for, and value, the raw processing power that we can test for effectively, but that strength would be correctly seen as just one value among many. To get there, though, we have to make much broader critiques and reforms of contemporary society than the “SAT just measures how well you take the SAT” crowd tend to engage in.

What I am asking for, in other words, is that we focus on telling the whole story rather than distorting what we know about part of the story. There is so much to criticize in our system and how it doles out rewards, so let’s attack weakness, not strength.