I may be the only one in New Jersey who had a twisted enough view of today’s news stories to pick up on this connection. Seemingly irrelevant to my blog, today, the Governor of New Jersey vetoed a bill that would have approved online gambling. At the same time, the Governor’s teacher effectiveness task force released its long-awaited report. And it did not disappoint. Well, I guess that’s a matter of expectations. I had very low expectations to begin with – fully expecting a poorly written, ill-conceived rant about how to connect teacher evaluations to test scores – growth scores – and how it is imperative that a large share of teacher evaluation be based on growth scores. And I got all of that and more!!!!!

I have written about this topic on multiple occasions.

For the full series on this topic, see: https://schoolfinance101.wordpress.com/category/race-to-the-top/value-added-teacher-evaluation/

And for my presentation slides on this topic, including summaries of the relevant research, see: https://schoolfinance101.files.wordpress.com/2010/10/teacher-evaluation_general.pdf

When it comes to critiquing the Task Force Report, I’m not even sure where to begin. In short, the report proposes the most ill-informed toxic brew of policy recommendations that one can imagine. The centerpiece, of course, is heavy… very heavy reliance on statewide student testing measures yet to be developed… yet to be evaluated for their statistical reliability … or their meaningfulness of any sort (including predictive validity of future student success). As Howard Wainer explains here, even the best available testing measures are not up to the task of identifying more and less effective teachers: http://www.njspotlight.com/ets_video2/

But who cares what the testing and measurement experts think anyway. This is about the kids… and we must fix our dreadful system and do it now… we can’t wait! The children can’t wait!

So then, what does this have to do with the online gambling veto? Well, it struck me as interesting that, on the one hand, the Governor vetoes a bill that would approve online gambling, but the Governor’s Task Force proposes a teacher evaluation plan that would make teachers’ year to year job security and teacher evaluations largely a game of chance. Yes, a roll of the dice. Roll a 6 and you’re fired! Damn hard to get 3 in a row (positive evaluations) to get tenure. Exponentially easier to get 2 in a row (bad evals) and get fired. No online gambling for sure, but gambling on the livelihood of teachers? That’s absolutely fine!

Interestingly, one of the only external sources even cited (outside of citing the comparably problematic Washington DC IMPACT contract, and think tanky schlock like the New Teacher Project’s “Teacher Evaluation 2.0“), was the Gates Foundation’s Measuring Effective Teaching Project (MET). Of course, the task force report fails to mention that the Gates Foundation MET project report does not make a very compelling statistical case that using test scores as a major factor for evaluating teachers is a good idea. Actually, they fail to mention anything substantive about the MET reports. I wrote about the MET report here. And economist Jesse Rothstein took a closer look at the Gates MET findings here! Rothstein concluded:

In particular, the correlations between value-added scores on state and alternative assessments are so small that they cast serious doubt on the entire value-added enterprise. The data suggest that more than 20% of teachers in the bottom quarter of the state test math distribution (and more than 30% of those in the bottom quarter for ELA) are in the top half of the alternative assessment distribution. Furthermore, these are “disattenuated” estimates that assume away the impact of measurement error. More than 40% of those whose actually available state exam scores place them in the bottom quarter are in the top half on the alternative assessment.

In other words, teacher evaluations based on observed state test outcomes are only slightly better than coin tosses at identifying teachers whose students perform unusually well or badly on assessments of conceptual understanding.

Yep that’s right. It’s little more than a coin toss or a roll of the dice! Online gambling (personally, I don’t care one way or the other about it), not okay. Gambling on teachers’ livelihoods with statistical error? Absolutely fine. After all, it’s those damn teachers that have sucked the economy dry with their high salaries and gold-plated benefits packages! And after all, it is the only profession in the world where you can do a really crappy job year after year after year… and you’re totally protected, right? Of course it’s that way. Say it loud enough and enough times, over and over again, and it must be true.

Here are a few random thoughts I have about the report:

So… as I understand it, they want to base 45% of a teacher’s evaluation on measures that have a 35% chance of misclassifying an average teacher as ineffective – and these are measures that only apply to about 15 to 20% of the teacher workforce? That doesn’t sound very well thought out to me.

Forcing reading and math teachers to be evaluated by measures over which they have limited control, and measures that jump around significantly from year to year and disadvantage teachers in more difficult settings isn’t likely to make New Jersey’s best and brightest jump at the chance to teach in Newark, Camden or Jersey City.

Even if the current system of teacher evaluation is less than ideal, it doesn’t mean that we should jump to adopt metrics that are as problematic as these. Promoters of these options would have the public believe that it’s either the status quo – which is necessarily bad – or test-score based evaluation – which is obviously good. This is untrue at many levels. First, New Jersey’s status quo is pretty good. Second, New Jersey’s best public and private schools don’t use test scores as a primary or major source of teacher evaluation. Yet somehow, they are still pretty darn good. So, using or not using test scores to hire and fire teachers is not likely the problem nor is it the solution. It’s an absurd false dichotomy.

Authors of the report might argue that they are putting only 45% of the weight of evaluations on these measures. The rest will include a mix of other objective and subjective measures. The reality of an evaluation that includes a single large, or even significant weight, placed on a single quantified factor is that that specific factor necessarily becomes the tipping point, or trigger mechanism. It may be 45% of the evaluation weight, but it becomes 100% of the decision, because it’s a fixed, clearly defined (though poorly estimated) metric.

Here’s a quick run-down on some of the issues associated with using student test scores to evaluate teachers:

[from a forthcoming article on legal issues associated with using test scores to evaluate, and dismiss teachers]

Most VAM teacher ratings attempt to predict the influence of the teacher on the student’s end-of-year test score, given the student’s prior test score and descriptive characteristics – for example, whether the student is poor, has a disability, or is limited in her English language proficiency.[1] These statistical controls are designed to account for the differences that teachers face in serving different student populations. However, there are many problems associated with using VAM to determine whether teachers are effective. The remainder of this section details many of those problems.

Instability of Teacher Ratings

The assumption in value-added modeling for estimating teacher “effectiveness” is that if one uses data on enough students passing through a given teacher each year, one can generate a stable estimate of the contribution of that teacher to those children’s achievement gains.[2] However, this assumption is problematic because of the concept of inter-temporal instability: that is, the same teacher is highly likely to get a very different value-added rating from one year to the next. Tim Sass notes that the year-to-year correlation for a teacher’s value-added rating is only about 0.2 or 0.3 – at best a very modest correlation. Sass also notes that:

About one quarter to one third of the teachers in the bottom and top quintiles stay in the same quintile from one year to the next while roughly 10 to 15 percent of teachers move all the way from the bottom quintile to the top and an equal proportion fall from the top quintile to the lowest quintile in the next year.[3] Further, most of the change or difference in the teacher’s value-added rating from one year to the next is unexplainable – not by differences in observed student characteristics, peer characteristics or school characteristics.[4]

Similarly, preliminary analyses from the Measures of Effective Teaching Project, funded by the Bill and Melinda Gates Foundation found:

When the between-section or between-year correlation in teacher value-added is below .5, the implication is that more than half of the observed variation is due to transitory effects rather than stable differences between teachers. That is the case for all of the measures of value-added we calculated.[5]

While some statistical corrections and multi-year analysis might help, it is hard to guarantee or even be reasonably sure that a teacher would not be dismissed simply as a function of unexplainable low performance for two or three years in a row.

Classification & Model Prediction Error

Another technical problem of VAM teacher evaluation systems is classification and/or model prediction error. Researchers at Mathematica Policy Research Institute in a study funded by the U.S. Department of Education carried out a series of statistical tests and reviews of existing studies to determine the identification “error” rates for ineffective teachers when using typical value-added modeling methods.[6] The report found:

Type I and II error rates for comparing a teacher’s performance to the average are likely to be about 25 percent with three years of data and 35 percent with one year of data. Corresponding error rates for overall false positive and negative errors are 10 and 20 percent, respectively.[7]

Type I error refers to the probability that based on a certain number of years of data, the model will find that a truly average teacher performed significantly worse than average.[8] So, that means that there is about a 25% chance, if using three years of data or 35% chance if using one year of data that a teacher who is “average” would be identified as “significantly worse than average” and potentially be fired. Of particular concern is the likelihood that a “good teacher” is falsely identified as a “bad” teacher, in this case a “false positive” identification. According to the study, this occurs one in ten times (given three years of data) and two in ten (given only one year of data).

Same Teachers, Different Tests, Different Results

Determining whether a teacher is effective may vary depending on the assessment used for a specific subject area and not whether that teacher is a generally effective teacher in that subject area. For example, Houston uses two standardized test each year to measure student achievement: the state Texas Assessment of Knowledge and Skills (TAKS) and the nationally-normed Stanford Achievement Test.[9] Corcoran and colleagues used Houston Independent School District (HISD) data from each test to calculate separate value-added measures for fourth and fifth grade teachers.[10] The authors found that a teacher’s value-added can vary considerably depending on which test is used.[11] Specifically:

among those who ranked in the top category (5) on the TAKS reading test, more than 17 percent ranked among the lowest two categories on the Stanford test. Similarly, more than 15 percent of the lowest value-added teachers on the TAKS were in the highest two categories on the Stanford.[12]

Similar issues apply to tests on different scales – different possible ranges of scores, or different statistical modification or treatment of raw scores, for example, whether student test scores are first converted into standardized scores relative to an average score, or expressed on some other scale such as percentile rank (which is done is some cases but would generally be considered inappropriate). For instance, if a teacher is typically assigned higher performing students and the scaling of a test is such that it becomes very difficult for students with high starting scores to improve over time, that teacher will be at a disadvantage. But, another test of the same content or simply with different scaling of scores (so that smaller gains are adjusted to reflect the relative difficulty of achieving those gains) may produce an entirely different rating for that teacher.

Difficulty in Isolating Any One Teacher’s Influence on Student Achievement

It is difficult if not entirely infeasible to isolate one specific teacher’s contribution to student’s learning, leading to situations where a teacher might be identified as a bad teacher simply because her colleagues are ineffective. This is called a spillover effect. [13] For students who have more than one teacher across subjects (and/or teaching aides/assistants), each teacher’s value-added measures may be influenced by the other teachers serving the same students. Kirabo Jackson and Elias Bruegmann, for example, found in a study of North Carolina teachers that students perform better, on average, when their teachers have more effective colleagues.[14] Cory Koedel found that reading achievement in high school is influenced by both English and math teachers.[15] These spillover effects mean that teachers assigned to weaker teams of teachers might be disadvantaged, through no fault of their own.

Non-Random Assignment of Students Across Teachers, Schools And Districts

The fact that teacher value-added ratings cannot be disentangled from patterns of student assignment across schools and districts leads to the likelihood that teachers serving larger shares of one population versus another are more likely to be identified as effective or ineffective, through no fault of their own. Non-random assignment, like inter-temporal instability is a seemingly complicated statistical issue. The non-random assignment problem relates not to the error in the measurement (test scores) but to the complications of applying a statistical model to real world conditions. The most fair comparisons between teachers would occur in a case where teachers could be randomly assigned to comparable classrooms with comparable resources, and where exactly the same number of students could be randomly assigned to those teachers, so that each teacher would have the same numbers of children and children of similar family backgrounds, prior performance, personal motivation and other characteristics. Obviously, this does not happen in reality.

Students are not sorted randomly across schools, across districts, or across teachers within schools. And teachers are not randomly assigned across school settings, with equal resources. It is certainly likely that one fourth grade teacher in a school is assigned more difficult students year-after-year than another. This may occur by choice of that teacher – a desire to try to help out these students – or other factors including the desire of a principal to make a teacher’s work more difficult. While most value-added models contain some crude indicators of poverty status, language proficiency and disability classification, few if any sufficiently mitigate the bias that occurs from non-random student assignment. That bias occurs from such apparently subtle forces as the influence of peers on one another, and the inability of value-added models to sufficiently isolate the teacher effect from the peer effect, both of which occur at the same level of the system – the classroom.[16]

Jesse Rothstein notes that “[r]esults indicate that even the best feasible value-added models may be substantially biased, with the magnitude of the bias depending on the amount of information available for use in classroom assignments.”[17]

Value-added modeling has more recently been at the center of public debate after the Los Angeles Times contracted RAND Corporation economist Richard Buddin to estimate value-added scores for Los Angeles teachers, and the Times reporters then posted the names of individual teachers classified as effective or ineffective on their web site.[18] The model used by the Los Angeles Times, estimated by Buddin, was a fairly typical one, and the technical documentation proved rich with evidence of the types of model bias described by Rothstein and others. For example:

97% of children in the lowest performing schools are poor, and 55% in higher performing schools are poor;

The number of gifted children a teacher has affects their value-added estimate positively – The more gifted children the teacher has, the higher the effectiveness rating;

Black teachers have lower value-added scores for both English Language Arts and Math than white teachers, and these are some of the largest negative correlates with effectiveness ratings provided in the report – especially for MATH;

Having more black students in your class is negatively associated with teacher’s value-added scores, though this effect is relatively small;

Asian teachers have higher value-added scores than white teachers for Math, with the positive association between being Asian and math teaching effectiveness being as strong as the negative association for black teachers.

Some of these associations above are explained by related research by Hanushek and Rivkin, which shows measurable effects of the racial composition of peer groups on individual student’s outcomes and explains the difficulty in distilling these effects from teacher effects.[19] Note that it is also likely that associations with teacher race above are entangled with student race, where black teachers are more likely to be in classrooms with larger shares of black students.[20]

All value-added comparisons are relative. They can be used for comparing one teacher to another in a school, teachers in one school to teachers in another school, or in one district to other districts. The reference group becomes critically important when determining the potential for disparate impact of negative teacher ratings, resulting from model bias. For example, if one were to employ a district-wide performance-based dismissal (or retention) policy in Los Angeles using the Los Angeles Times model, one would likely layoff disproportionate numbers of teachers in poor schools and black teachers of black students, while disproportionately retaining Asian teachers. But, if one adopted the layoff policy relative to within-school rather than district-wide norms, because children are largely segregated by neighborhoods and schools, the disparate effect might be lessened. The policy may neither be fairer nor better in terms of educational improvement, but racially disparate dismissals might be reduced.

Finally, because teacher value-added ratings cannot be disentangled entirely from patterns of student assignment across teachers within schools, principals may manipulate assignment of difficult and/or unmotivated students in order to compromise a teacher’s value-added ratings, increasing the principal’s ability to dismiss that teacher. This concern might be mitigated by requirements for lottery-based student assignment and teacher assignments. However, such requirements could create cumbersome student assignment processes and processes that interfere with achieving the best teacher match for each child.

Whereas the problem of stability rates and error rates above are issues of “statistical error,” the problem of non-random assignment is one of “model bias.” Many value-added ratings of “teacher effectiveness” suffer from both large degrees of error, and severe levels of model bias. The two are cumulative problems, not overlapping. In fact, the extent of error in the measures may partially mask the full extent of bias. In other words, we might not even know how prodigious the bias is.

In The Best Possible Case, About 20% of Contracted Certified Teachers in a District Might Have Value-Added Scores

Setting aside the substantial concerns above over “measurement error” and “model bias” which severely compromise the reliability and validity of value-added ratings of teachers, in most public school districts, fewer than 20% of certified teaching staff could be assigned any type of value-added assessment score. Existing standardized assessments typically focus on reading or language arts, and math performance between grades three and eight. Because baseline scores are required, and ideally multiple prior scores to limit model bias, it becomes difficult to fairly rate third grade teachers. By middle school or junior high, students are interacting with many more teachers and it becomes more difficult to assign value-added scores to any one teacher. When considering the various support staff roles, specialist teachers, teachers of elective and/or advanced secondary courses, value-added measures are generally applicable to only a small minority of teachers in any school district (<20%). Thus, in order to make value-added measures a defined element of teacher evaluation in teacher contracts, one must have separately negotiated contracts for those teachers to whom these measures apply and this is administratively cumbersome and potentially expensive for districts in these difficult economic times.

Washington DC’s IMPACT teacher evaluation system is one example that differentiates classes of teachers by having, or not, value-added measures.[21] While contractually feasible, this approach creates separate classes of teachers in schools and may have unintended consequences for educational practices, including increasing tensions between non-value-added-rated teachers wishing to pull students of value-added-rated teachers out of class for special projects or activities.

[1] Value-added ratings of teachers are generally not based on a simple subtraction of each student’s spring test score and previous fall test score for a specific subject. Such an approach would clearly disadvantage teachers who happen to serve less motivated groups of students, or students with more difficult home lives and/or fewer family resources to support their academic progress through the year. It would be even more problematic to simply use the spring test score from the prior year as the baseline score, and the spring of the current year to evaluate the current year teacher, because the teacher had little control over any learning gain or loss that may have occurred during the prior summer. And these gains and losses tend to be different for students from higher and lower socio-economic status. See Karl L. Alexander et al., Schools, Achievement, and Inequality: A Seasonal Perspective, 23 Educ. Eval. and Pol’y Analysis 171 (2001). Recent findings from a study funded by the Bill and Melinda Gates Foundation confirm these “seasonal” effects: “The norm sample results imply that students improve their reading comprehension scores just as much (or more) between April and October as between October and April in the following grade. Scores may be rising as kids mature and get more practice outside of school.” Bill & Melinda Gates Foundation, Learning about Teaching: Initial Findings from the Measures of Effective Teaching Project 8, available at http://www.metproject.org/downloads/Preliminary_Findings-Research_Paper.pdf. [2] Tim R. Sass, The Stability of Value-Added Measures of Teacher Quality and Implications for Teacher Compensation Policy, Urban Institute (2008), available at http://www.urban.org/UploadedPDF/1001266_stabilityofvalue.pdf. See also Daniel F. McCaffrey et al., The Intertemporal Variability of Teacher Effect Estimates, 4 Educ. Fin. & Pol’y, 572 (2009). [3] Sass, supra note 27. [4] Id. [5] Bill & Melinda Gates Foundation, supra note 26. [6] Peter Z. Schochet & Hanley S. Chiang, Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains (NCEE 2010-4004). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education (2010). [7] Id. [8] Id. at 12. [9] Sean P. Corcoran, Jennifer L. Jennings & Andrew A. Beveridge, Teacher Effectiveness on High- and Low-Stakes Tests, Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI (2010). [10] Id. [11] Id. [12] Id. [13] Cory Koedel, An Empirical Analysis of Teacher Spillover Effects in Secondary School, 28 Econ. of Educ. Rev.682 (2009). [14] C. Kirabo Jackson & Elias Bruegmann, Teaching Students and Teaching Each Other: The Importance of Peer Learning for Teachers, 1 Am. Econ. J.: Applied Econ. 85 (2009). [15] Koedel, supra note 38. [16] There exist at least two different approaches to control for peer group composition. On approach, used by Caroline Hoxby and Gretchen Weingarth involves constructing measures of the average entry level of performance for all other students in the class. C. Hoxby & G. Weingarth, Taking Race Out of the Equation: School Reassignment and the Structure of Peer Effects, available at http://www.hks.harvard.edu/inequality/Seminar/Papers/Hoxby06.pdf. Another involves constructing measures of the average racial and socioeconomic characteristics of classmates, as done by Eric Hanushek and Steven Rivkin. E. Hanushek & S. Rivkin, School Quality and the Black-White Achievement Gap, available at http://www.nber.org/papers/w12651.pdf?new_window=1. [17] Jesse Rothstein, Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement, 25 Q. J. Econ. (2008). See also Jesse Rothstein, Student Sorting and Bias in Value Added Estimation: Selection on Observables and Unobservables, available at http://gsppi.berkeley.edu/faculty/jrothstein/published/rothstein_vam2.pdf. Many advocates of value-added approaches point to a piece by Thomas Kane and Douglas Staiger as downplaying Rothstein’s concerns. Thomas Kane & Douglas Staiger, Estimating Teacher Impacts on Student Achievement: An Experimental Evaluation, available at http://www.nber.org/papers/w14607.pdf?new_window=1. However, Eric Hanushek and Steve Rivkin explain, regarding the Kane and Staiger analysis: “the possible uniqueness of the sample and the limitations of the specification test suggest care in interpretation of the results.” Eric A. Hanushek & Steve G. Rivkin, S., Generalizations about Using Value-Added Measures of Teacher Quality 8, available at http://www.utdallas.edu/research/tsp-erc/pdf/jrnl_hanushek_rivkin_2010_teacher_quality.pdf. [18] Richard Buddin, How Effective Are Los Angeles Elementary Teachers and Schools?, available at http://www.latimes.com/media/acrobat/2010-08/55538493.pdf. [19] Eric Hanushek & Steve Rivkin, School Quality and the Black-White Achievement Gap, Educ. Working Paper Archive, Univ. of Ark., Dep’t of Educ. Reform (2007). [20] Charles T. Clotfelter et al., Who Teaches Whom? Race and the Distribution of Novice Teachers, 24 Econ. of Educ. Rev. 377 (2005). [21] See generally, IMPACT Guidebooks, available at http://dcps.dc.gov/portal/site/DCPS/menuitem.06de50edb2b17a932c69621014f62010/?vgnextoid=b00b64505ddc3210VgnVCM1000007e6f0201RCRD.