I have written much on this blog about problems with the use of Value-added Estimates of teacher effect (used loosely) on student test score gains on this blog. I have addressed problems with both the reliability and validity of VAM estimates, and I have pointed out how SGP based estimates of student growth are invalid on their face for determining teacher effectiveness.

But, I keep hearing two common refrains from the uber-reformy (those completely oblivious to the statistics and research of VAM while also lacking any depth of understanding of the complexities of the social systems [schools] into which they propose to implement VAM as a de-selection tool) crowd. Sadly, these are the people who seem to be drafting policies these days.

Here are the persistent misrepresentations:

Misrepresentation #1: That this reliability and error stuff only makes it hard for us to distinguish among all those teachers clustered in the middle of the distribution. BUT… we can certainly be confident about those at the extremes of the distribution. We know who the really good and really bad teachers are based on their VAM estimates.

WRONG!

This would possibly be a reasonable assertion if reliability and error rates were the only problem. But this statement ignores entirely the issue of omitted variables bias (other stuff that affects teacher effect estimates that may have been missed in the model), and just how much those observations in the tails jump around when we tweak the VAM by adding or removing variables, or rescaling measures.

A recent paper by Dale Ballou & colleagues illustrates this problem:

“In this paper, we consider the impact of omitted variables on teachers’ value-added estimates, and whether commonly used single-equation or two-stage estimates are preferable when possibly important covariates are not available for inclusion in the value-added model. The findings indicate that these modeling choices can significantly influence outcomes for individual teachers, particularly those in the tails of the performance distribution who are most likely to be targeted by high-stakes policies.” (Ballou et al., 2012) [emphasis added]

The problem is that we can never know when we’ve got that model specification just right. Further, while we might be able to run checks as to whether the model estimates display bias with respect to measurable external factors, we can’t know if there is bias with respect to stuff we can’t measure, nor can we always tell if there are clusters of teachers in our model whose effectiveness estimates are biased in one direction and other clusters in another direction (also in relation to stuff unmeasured). That is, we can only test this omitted variables bias stuff when we can add in and take out measures that we have. We simply don’t know how much bias remains due to all sorts of other unmeasured stuff, nor do we know just how much that bias may affect many of those distributions in the tails!

Misrepresentation #2: We may be having difficulty in these early stages of estimating and using VAM models to determine teacher effectiveness, but these are just early development problems that will be cleared up with better models, better data and better tests.

WRONG AGAIN!

Quite possibly, what we are seeing now is as good as it gets. Keep in mind that many of the often cited papers applying the value-added methodology date back to the mid-1990s. Yeah…. we’ve been at this for a while and we’ve got what we’ve got!

Consider the sources of the problems with the reliability and validity of VAM estimates, or in other words:

The sources of random error and/or noise in VAM estimates

Random error in testing data can be a function of undetected and uncorrected poorly designed test items, such as items with no correct response or more than one correct response, testing conditions/disruptions, and kids being kids – making goofy errors such as filling in the wrong bubble (or toggling the wrong box in computerized testing) or simply having a brain fart on stuff they probably otherwise knew quite well. We’re talking about large groups of 8 and 9 year old kids in some cases, in physically uncomfortable settings, under stress, with numerous potential distractions.

Do we really think all of these sources of noise are going to go away? Substantively improve over time? Testing technology gains only have a small chance at marginally improving some of these. I hope to see those improvements. But it’s a drop in the bucket when it comes to the usefulness, reliability and validity of VAM estimates.

The factors other than the teacher which may influence the average test score gain of students linked to that teacher

First and foremost, kids simply aren’t randomly sorted across teachers and the various ways in which kids aren’t randomly sorted (by socioeconomic status, by disability status, by parental and/or child motivation level) substantively influence VAM estimates. As mentioned above, we can never know how much the unmeasured stuff influences the VAM estimates. Why? It’s unmeasured!

Second, teachers aren’t randomly sorted among teaching peers and VAM studies have shown what appear to be spillover effects – where teachers seem to get higher VAM estimates when other teachers serving the same students get higher VAM estimates. Teacher aides, class sizes, lighting/heating/cooling aren’t randomly distributed and all of this stuff may matter.

And you know what? This stuff isn’t going to change in the near future. In fact, the more time we waste obsessing on the future of VAM-based de-selection policies instead of equitably and adequately financing our school systems, the more that equity of schooling conditions is going to erode across children, teachers, schools and districts – in ways that are very much non-random [uh… that means certain kids will get more screwed than others]. So perhaps our time would be much better spent trying to improve the equity of those conditions across children. Provide more parity in teacher compensation and working conditions, and better integrating/distributing student populations.

Look – if we were trying to set up an experiment or a program evaluation in which we wanted our VAM estimates to be most useful – least likely to be biased by unmeasured stuff – we would take whatever steps we could to achieve the “all else equal” requirement. Translated to the non-experimental setting – applied in the real world – this all else equal requirement means that we actually have to concern ourselves with equality of teaching conditions – equality of the distribution of students by race, SES and other factors. Yeah… that actually means equitable access to financial resources – equitable access to all sorts of stuff (including peer group).

In other words, we’d be required to exercise more care in establishing equality of conditions or explaining why we couldn’t if we were simply comparing program effectiveness for academic publication than the current reformy crowd is willing to exercise when deciding which teachers to fire. [then again, the problem is that they don’t seem to know the difference. Heck, some of them are still hanging their hopes on measures that aren’t even designed for the purpose !]

But this conversation is completely out-of-sight, out-of-mind for the uber-reformy crowd. That’s perhaps the most ludicrous part of all of this reformy VAM-pocrisy ! Ignoring the substantive changes to the education system that could actually improve the validity of VAM estimates by asserting that VAM estimates alone will do the job, which they couldn’t possibly do if we continue to ignore all this stuff!

Finally, one more reason why VAM estimates are unlikely to become more valid or more useful over time? Once we start using these models with high stakes attached, the tendency for the data to become more corrupted and less valid escalates exponentially!

By the way, VAM estimates don’t seem to be very useful for evaluating a) the effectiveness of teacher preparation programs [due to the non-random geographic distributions of graduates] or b) principals either! More on this at another point.

Note on VAM-based de-selection: Yeah… the uber-reformy types will argue that no-one is saying that VAM should be used 100% for teacher de-selection, and further that no-one is really even arguing for de-selection. WRONG! AGAIN! As I discussed in a previous post, the standard reformy legislation template includes three basic features which essentially amount to using VAM (or even worse SGPs) as the primary basis for teacher de-selection – yes, de-selection. First, use of VAM estimates in a parallel weighting system with other components requires that VAM be considered even in the presence of a likely false positive. NY legislation prohibits a teacher from being rated highly if their test-based effectiveness estimate is low. Further, where VAM estimates vary more than other components, they will quite often be the tipping point – nearly 100% of the decision even if only 20% of the weight – and even where most of that variation is NOISE or BIAS… not even “real” effect (effect on test score growth). Second, the reformy template often requires (as does the TEACHNJ bill in NJ) that teachers be de-selected (or at least have their tenure revoked) after any two years in a row of falling on the wrong side of an arbitrary cut point rammed through these noisy data.

Finally, don’t give me the anything is better than the status quo crap!