Follow me on Twitter at @AnthonyCody

Last week I wrote about the fairy dust of multiple measures that the Department of Education has been sprinkling on worthless Value Added Models, under the mistaken belief that this somehow renders them golden. Dept of Ed press secretary Justin Hamilton quoted Arne Duncan, who said, "here in the US teacher evaluation is all too often tied only to test scores which makes no sense." I replied "WHO uses test scores only? Can you name one district that evaluates this way?" The answer came last week, as newspapers in New York published the value added ratings of 18,000 teachers, and made teacher evaluation a public sport.

This news was accompanied by something rather strange - Bill Gates and Michelle Rhee both criticized the action, suggesting that these scores alone do not give a full picture of teacher performance. What we have here seems to be a classic case of good cop/bad cop, where Rupert Murdoch's New York Post plays the abusive bad cop, publishing the names of teachers, and singling out the city's "worst" teachers for public humiliation. And Gates, Rhee and the Department of Education ride to the rescue, offering the sweeter, but nonetheless damaging "multiple measures" evaluation models.

Bill Gates wrote an opinion piece in the New York Times that said,

Value-added ratings are one important piece of a complete personnel system. But student test scores alone aren't a sensitive enough measure to gauge effective teaching, nor are they diagnostic enough to identify areas of improvement. Teaching is multifaceted, complex work. A reliable evaluation system must incorporate other measures of effectiveness, like students' feedback about their teachers and classroom observations by highly trained peer evaluators and principals.

Here we have the good cop's case for "multiple measures," very neatly made against a counterpoint, the bad cop's use of test scores only.

But there is a big problem that remains. The evaluation model we are being offered is driven by several false assumptions.

Assumption One: Schools in our nation are saddled with a significant number of crummy teachers, and achievement will rise dramatically if we can bring in new evaluation systems to reliably identify and weed out these teachers. Bill Gates, appearing on Oprah in 2010, asserted that if only we could get rid of the bad teachers, our schools would shoot from the bottom of international rankings to the top.

First of all, our international standings have become a meaningless exercise in political grandstanding, with little attention to the underlying data they are drawn from.



Second, the idea that we can fire our way to better schools has a fatal flaw. It assumes there are fresh teachers ready to take the place of those we fire. Given that our high poverty schools already have turnover rates in the neighborhood of 20% a year, and about 50% of beginning teachers wash out in their first five years, the idea that we will improve our schools by firing even more is hard to believe. Where are the high quality teachers going to come from to replace those we fire? School improvement is much more complex than this, and the foundation has to be based on building the profession.



Assumption Two: Test scores, and a host of secondary indicators found to be correlated with higher test scores, are the means by which we determine teacher quality. This takes a very keen eye to detect, because Mr. Gates and the researchers he sponsors are not fools. They know that test scores have been somewhat discredited as a result of NCLB's single-minded focus on them. But the sophisticated measures that Gates offers are, unfortunately, mostly tied back to test scores. Those student surveys he mentions? As Melinda Gates explained last fall, the questions were checked to see which ones correlate with higher student test scores. The training the peer observers and principals get can make sure they are watching for teacher behaviors associated with better student achievement (test scores.)

If you visit the Measures of Effective Teaching (MET) project, which is the flagship of the Gates Foundation work in this arena, you find that teacher quality is defined in terms of the ability of a teacher to produce gains in student achievement. Then the project seeks different ways to measure this ability. There is the direct measurement - the scores. And then there are a host of other indicators - observations, student surveys and so on. But the core idea of teacher "effectiveness" is tied to the "effect," which is defined as student achievement. And student achievement is always some sort of test performance.

Assumption Three: Teacher quality will increase with detailed and specific feedback. This WOULD be true, if we had a solid working definition of the qualities we are after. Unfortunately, we are working with a circular definition of teacher quality, that equates quality with the ability to raise student performance on tests, and then seeks to reinforce teacher behaviors associated with this ability.



What we need to build strong teachers are more diverse indicators of student learning. As Rog Lucido explained here a few days ago, student learning cannot be adequately reflected by a score, or even a set of scores using multiple measures. Student learning can be described, and it can be exemplified, through the use of authentic evidence - student writing, projects, presentations, research and other products. We need to work with a much richer set of student outcomes than we get by simply focusing on student achievement as it is currently defined. And we will not get there by the circular methods being promoted by the Gates Foundation and the Department of Education.

Is there an appropriate use for test score data in teacher evaluations? Absolutely! In the report, A Quality Teacher in Every Classroom, issued back in 2010 by Accomplished California Teachers, we took a close look at all the possible sources of information that might be brought to bear in a high quality teacher evaluation process. We wrote:



They should be evaluated with tools that assess professional standards of practice in the classroom, augmented with evidence of student outcomes. Beyond standardized test scores, those outcomes should include performance on authentic tasks that demonstrate learning of content; presentation of evidence from formative classroom assessments that show patterns of student improvement; the development of habits that lead to improved academic success (personal responsibility, homework completion, willingness and ability to revise work to meet standards), along with contributing indicators like attendance, enrollment and success in advanced courses, graduation rates, pursuit of higher education, and work place success.

In practice, that means we should oppose the formulaic use of Value Added Models, which research have shown to be highly unstable at the level of individual teachers. This method should not be used for any significant part of an evaluation, even if other measures are included. It is simply not ready for prime time. Of course we must also oppose the publication of teacher ratings based on these models. If you are reading this and thinking "How much harm could it do to use these ratings as one of several indicators of teacher quality?" PLEASE take a look at Gary Rubenstein's just published analysis of the New York City VAM data.

This data is garbage, and it has no business being any part of a professional teacher's evaluation.

The Initial Findings of the MET Project exploring teacher evaluation brings us the good cop/bad cop dichotomy quite clearly:





The public discussion usually portrays only two options: the status quo (where there is no meaningful feedback for teachers) and a seemingly extreme world in which tests scores alone determine a teacher's fate. Our results suggest that's a false choice. It is possible to combine measures from different sources to get a more complete picture of teaching practice. The measures should allow a school leader to both discern a teacher's ability to produce results and offer specific diagnostic feedback. Value-added scores alone, while important, do not recommend specific ways for teachers to improve.

So here we have it, teachers. You can be pilloried in public based on test scores alone, or you can have the magic of multiple measures to soften the blows. We can do this the easy way, or the hard way. How do you want it?

What do you think? Are you ready to accept the wonder that is multiple measures evaluation now?

