Submitted on November 17, 2014

So-called achievement gaps – the differences in average test performance among student subgroups, usually defined in terms of ethnicity or income – are important measures. They demonstrate persistent inequality of educational outcomes and economic opportunities between different members of our society.

So long as these gaps remain, it means that historically lower-performing subgroups (e.g., low-income students or ethnic minorities) are less likely to gain access to higher education, good jobs, and political voice. We should monitor these gaps; try to identify all the factors that affect them, for good and for ill; and endeavor to narrow them using every appropriate policy lever – both inside and outside of the educational system.

Achievement gaps have also, however, taken on a very different role over the past 10 or so years. The sizes of gaps, and extent of “gap closing," are routinely used by reporters and advocates to judge the performance of schools, school districts, and states. In addition, gaps and gap trends are employed directly in formal accountability systems (e.g., states’ school grading systems), in which they are conceptualized as performance measures.

Although simple measures of the magnitude of or changes in achievement gaps are potentially very useful in several different contexts, they are poor gauges of school performance, and shouldn’t be the basis for high-stakes rewards and punishments in any accountability system.

Let’s take a quick look at four problems with using gaps in the latter context. The first two are “technical” (and not particularly original), but still important. The final two focus specifically on what achievement gaps mean and how they might be used in a formal accountability system. I will then conclude by discussing an alternative approach (one that some states and districts have begun to adopt).

One – Subgroup samples are often too small for precise estimates. This is an old point, of course, since sample size plagues all test-based accountability systems (e.g., Kane and Staiger 2002). That is, schools are small, and so estimates are imprecise. But such error becomes magnified in simple gap measures because you’re focusing on even smaller sub-samples. In many urban districts, at least one of those subgroups (e.g., students not eligible for subsidized lunch) is often prohibitively small, and the margin of error for those estimates is often comparable to the magnitude of the gap itself. Again, this is not a new point, nor is it one that doesn’t also apply to most accountability measures, but it does bear mentioning.

Two – Gaps based on proficiency rates are not fit for use in most any context. When you measure gaps using proficiency rates, you are basically converting test scores into “yes or no” outcomes for two groups (in the case of income, groups also defined by conversion into “yes/no” outcomes), and then comparing those groups. There is no need to get into the technical details (see, for example, Ho 2008 for details), but the problem here is even worse than it is for overall, non-gap measures such as schoolwide proficiency rates. Proficiency-based gaps, particularly over time, are so prone to distortion that they are virtually guaranteed to be misleading, and should be used with extreme caution in any context, to say nothing of using them in a formal, high-stakes accountability system. There are alternative ways to measure gaps using proficiency- and other cutpoint-based rates (e.g., Ho and Reardon 2012), but I haven’t yet seen any states using them.

Three – The sizes of between-group gaps within schools don't tell you much about their performance. To begin with, what does a large gap actually tell us about student performance? Consider a hypothetical school or district in which achievement gaps are completely closed. What this means is that the two subgroups – say, low- and higher-income students, as measured by subsidized lunch eligibility – scored at roughly equal levels on tests. But, what if the higher-income students are relatively low-performing compared with their peers elsewhere? In this case, both subgroups perform at similarly low levels, but we’re calling it a success simply because there’s no difference between them. Or, similarly, we might call a school with a huge gap a failure, even though both subgroups may score at high levels. Gaps are relative measures, and their magnitude means different things in different circumstances about student performance.

More importantly for our purposes here, simple gaps don't provide much of a signal about the performance of the school (which, in an accountability context, is the key question). This is mostly because gaps rely on the comparison of subgroups' absolute performance measures (how highly students score), and student subgroups often enter their schools with large gaps between them. Schools may certainly contribute to size of the gaps between subgroups of the students they serve, but it’s impossible, using simple gap measures, to separate the school’s impact from that of where the students started out.

Four – Simple “gap closing” trends often mask highly undesirable outcomes. Even if we assume a more “growth-based” perspective, and look at trends in gaps instead of their size in any given year, there’s an obvious problem. There are four short-term (e.g., year-to-year) scenarios in which achievement gaps, say between high- and low-income students, might narrow:

Both groups make progress, but with more rapid growth in the scores of lower-income students; Both groups decline in performance, but the scores of higher-income students decline more rapidly; The scores of lower-income students increase, while those of higher-income students remain stable; The scores of lower-income students remain stable while those of higher-income students decline.

All four of these scenarios are quite common, particularly at the school-level, but also within states (see this post on New Jersey’s gaps). But only one of them – the first one – could be considered as genuine success. Why would we reward or punish schools based on a criterion that is likely to conceal undesirable outcomes? In addition, even if schools are effective in "closing gaps," they might nonetheless exhibit relatively large aggregate discrepancies between subgroups, since, every year, a cohort of students with a large pre-existing gap may enter the school at the lowest grade while a cohort (at the highest grade) will exit (see this post for an illustration of how this works).

In general, there is only one reason to measure gaps or gap-closing within a school for accountability purposes: The possibility that the school is educating some groups of students effectively and not others. Otherwise, we could just measure overall performance without breaking things down by subgroup.

From this perspective, what’s the theory behind identifying schools in which gaps are not closing (again, putting aside the massive measurement problems) for the purposes of high-stakes punitive decisions? The rationale seems to be that these schools are somehow neglecting one subgroup in favor of another. But how likely is this to have occurred systematically and intentionally? Are these schools providing inferior instruction to some students, based on observable characteristics such as income or ethnicity? Probably not.

This is not to say that achievement gaps cannot be useful in policy decisions. Quite the contrary. They can help identify schools in which historically lower-performing subgroups are further behind their peers, and these schools can be given additional resources and support to address these discrepancies. This type of intervention has proven effective in the past (Harris and Herrington 2006) But punishing schools based on these gaps is essentially the opposite approach (and, again, within-school gaps may not be particularly useful for this purpose).

In short, achievement gaps are a good example of a measure that is very important, but not well-suited for use in formal or informal accountability systems (at least those that are used for high-stakes punishments and rewards). There is obviously a strong case for paying special attention to the testing performance of historically low-performing groups, but using the gaps directly in accountability systems may not be the way to do it. There are alternatives.

For instance, one simple way to focus on lower-performing subgroups would be to adopt a high-quality school-level value-added measure and calculate separate estimates for subgroups, such as those defined in terms of income or race (or, alternatively, the lowest-scoring students without reference to other characteristics). These estimates could be given extra weight or importance in any accountability system. An approach such as this wouldn’t solve the small sample problem (unless the subgroup was defined in terms of testing performance, as suggested above), but it would at least reflect the goal of boosting performance among lower-scoring subgroups, and provide an incentive for focusing on these students (though, again, I’m not sure that these incentives are productive, and using gaps to target schools for additional resources may be a more effective route).

Finally, we should also stop using gaps and gap trends in our public discourse about school performance per se. They are measures of student performance (and, when measured within schools, limited ones at that). The goal should be to provide educational opportunity for all, not try clumsily to ensure equal outcomes by rewarding and punishing schools based on the degree to which they exhibit those equal outcomes. In an accountability context, there is a crucial difference.

- Matt Di Carlo