DIANE RAVITCH, the education-reform advocate famous for having long advocated chartered schools and centralised assessment of teachers only to turn against both reforms in the past few years, has a post at the New York Review of Books railing against New York's new testing standards. The state moved last week to conduct a centralised assessment of all public-school teachers on the basis of whether they have improved student performance year to year, and to fire teachers who fail to do so. New York had to make this move to meet criteria for the $700m it will receive under the federal Race to the Top programme. Ms Ravitch contends it's a disaster because it's actually based on standardised test scores, rather than "other measures, such as classroom observations by principals, independent evaluators, and peers, plus feedback from students and parents", even though those latter measures are supposed to count for 60% of a teacher's rating.

[O]ne sentence in the agreement shows what matters most: “Teachers rated ineffective on student performance based on objective assessments must be rated ineffective overall.” What this means is that a teacher who does not raise test scores will be found ineffective overall, no matter how well he or she does with the remaining sixty percent. In other words, the 40 percent allocated to student performance actually counts for 100 percent. Two years of ineffective ratings and the teacher is fired. ...This is madness. The tests have some value in measuring basic skills and rote learning, but their overuse distorts education. No standardized test can accurately measure the quality of education.

My instinctive reaction is to agree, on a personal-experience level; I'll get to the data later. Basically, I've always been very good at taking standardised tests, and my experience has been that my scores on such tests are very imperfectly correlated with how much I've learned in a class or how good my teacher was. It now seems that this talent has been passed to the next generation: in her most recent report card, my daughter came home with decent grades in each subject, and straight A's on the standardised tests that the government uses to judge student progress and teacher performance.

What do those high test scores tell me about the quality of my daughter's teacher? Next to nothing. Given the top-flight standardised scores and the okay grades, I think he's probably underperforming, failing to get her to develop to potential; she's scatterbrained (also runs in the family) and forgets to study for her geography tests. But what does it tell my teacher? Simple. If he's being evaluated on his ability to improve his class's performance on standardised tests, then his number one priority has nothing to do with improving his teaching. It is to make sure he gets my daughter (and the other kids like her in her cohort) in his classroom next year.

This is where that data Ms Ravitch cites comes in. It shows that using standardised progress metrics to judge teacher performance is grossly unreliable. Four education researchers (Linda Darling-Hammond and Edward Haertel of Stanford, Audrey Amrein-Beardsley of Arizona State, and Jesse Rothstein of UC Berkeley) found that teachers' "value added" in improving students' standardised test scores could vary dramatically from year to year based on the tests used and the composition of their classes. Here's an especially interesting finding:

One study that found considerable instability in teachers' value-added scores from class to class and year to year examined changes in student characteristics associated with the changes in teacher ratings. After controlling for prior test scores of students and student characteristics, the study still found significant correlations between teachers' ratings and their students' race/ethnicity, income, language background, and parent education. Figure 2 illustrates this finding for an experienced English teacher in the study whose rating went from the very lowest category in one year to the very highest category the next year (a jump from the 1st to the 10th decile). In the second year, this teacher had many fewer English learners, Hispanic students, and low-income students, and more students with well-educated parents, than in the first year.

It's possible to exaggerate these sorts of findings. Obviously, standardised tests can help identify poor teachers who ought to be fired. Earlier in the paper, the researchers note that in a study of performance in 2001 and 2002, of the lowest-scoring 20% of teachers in the first year, just a quarter of them were still in the lowest 20% the following year. But if anything, this ought to strengthen the case for getting rid of teachers who stay in that lowest-scoring group year after year.* At 75% mobility, assuming the movement in and out of the lowest group is random, then after two years of evaluation, just 6% of teachers will have been in the lowest 20% both years; after three years, just over 1% of teachers will have failed to move out of the group. This isn't sufficient reason to fire them—roll a die and you'll get similar results—but assuming there is some connection between teacher quality and failure to improve standardised scores, several years of scores should serve as a good guide to which teachers deserve close scrutiny.

Two years, however, seems clearly insufficient to serve as a hard-and-fast reason to fire someone. And this is where Ms Ravitch's conclusion is important: "Of course, teachers should be evaluated. They should be evaluated by experienced principals and peers... Those who can't teach and can't improve should be fired." If the current trend continues, she writes,

No student will be left untested. Every teacher will be judged by his or her students' scores. Cheating scandals will proliferate. Many teachers will be fired. Many will leave teaching, discouraged by the loss of their professional autonomy. Who will take their place? Will we ever break free of our national addiction to data?

It is possible to encourage excellence in an organisation, even a large one such as a public-school system, without relying on statistical performance metrics. It is a matter of culture and vitality. But that sort of excellence is hard to measure; it's hard to know whether you have it, how to get it, or what exactly you're missing. In "Seeing Like a State", James Scott writes of how modern governments are driven to implement rational, uniform metrics by the need of the sovereign for "legibility" of the territory they govern, and of how these metrics then distort and sometimes destroy the societies they measure. In America, the people are the sovereign, and it is our own fury at our underperforming school systems, and our inability to understand what we're doing wrong, that is driving our increasing desire to compile stats on our teachers and fire the ones who don't meet the cut-off.

* Wait. No it shouldn't! It occurs to me that I've made an error in statistical thinking that Daniel Kahneman would probably slap me for: the fact that a large number of people move in and out of a group tells us nothing about the characteristics of those who remain in the group. It just tells us that there won't be very many of them. As I say five seconds later, if you have people roll a die three times you'll find that less than 0.5% of them get a 6 each time. That doesn't tell you anything about the people who do. What you need to look at is the predictive power of being in the group once for being in the group again, and for that you need to run the experiment for more than two years so you can do some regressions.