Why did so many people fail the July 2014 bar exam? Among graduates of ABA-accredited law schools who took the exam for the first time last summer, just 78% passed. A year earlier, in July 2013, 82% passed. What explains a four-point drop in a single year?

The ExamSoft debacle looked like an obvious culprit. Time wasted, increased anxiety, and loss of sleep could have affected the performance of some test takers. For those examinees, even a few points might have spelled the difference between success and failure.

Thoughtful analyses, however, pointed out that pass rates fell even in states that did not use ExamSoft. What, then, explains such a large performance drop across so many states? After looking closely at the way in which NCBE and states grade the bar exam, I’ve concluded that ExamSoft probably was the major culprit. Let me explain why–including the impact on test takers in states that didn’t use ExamSoft–by walking you through the process step by step. Here’s how it could have happened:

Tuesday, July 29, 2014

Bar exam takers in about forty states finished the essay portion of the exam and attempted to upload their answers through ExamSoft. But for some number of them, the essays wouldn’t upload. We don’t know the exact number of affected exam takers, but it seems to have been quite large. ExamSoft admitted to a “six-hour backlog” and at least sixteen states ultimately extended their submission deadlines.

Meanwhile, these exam takers were trying to upload their exams, calling customer service, and worrying about the issue (wouldn’t you, if failure to upload meant bar failure?) instead of eating dinner, reviewing their notes for the next day’s MBE, and getting to sleep.

Wednesday, July 30, 2014

Test takers in every state but Louisiana took the multiple choice MBE. In some states, no one had been affected by the upload problem. In others, lots of people were. They were tired, stressed, and had spent less time reviewing. Let’s suppose that, due to these issues, the ExamSoft victims performed somewhat less well than they would have performed under normal conditions. Instead of answering 129 questions correctly (a typical raw score for the July MBE), they answered just 125 questions correctly.

August 2014: Equating

The National Conference of Bar Examiners (NCBE) received all of the MBE answers and began to process them. The raw scores for ExamSoft victims were lower than those for typical July examinees, and those scores affected the mean for the entire pool. Most important, mean scores were lower for both the “control questions” and other questions. “Control questions” is my own shorthand for a key group of questions; these are questions that have appeared on previous bar exams, as well as the most current one. By analyzing scores for the control questions (both past and present) and new questions, NCBE can tell whether one group of exam takers is more or less able than an earlier group. For a more detailed explanation of the process, see this article.

These control questions serve an important function; they allow NCBE to “equate” exam difficulty over time. What if the Evidence questions one year are harder than those for the previous year? Pass rates would fall because of an unfairly hard exam, not because of any difference in the exam takers’ ability. By analyzing responses to the control questions (compared to previous years) and the new questions, NCBE can detect changes in exam difficulty and adjust raw scores to account for them.

Conversely, these analyses can confirm that lower scores on an exam stem from the examinees’ lower ability rather than any change in the exam difficulty. Weak performance on control questions will signal that the examinees are “less able” than previous groups of examinees.

But here’s the rub: NCBE can’t tell from this general analysis why a group of examinees is less able than an earlier group. Most of the time, we would assume that “less able” means less innately talented, less well prepared, or less motivated. But “less able” can also mean distracted, stressed, and tired because of a massive software crash the night before. Anything that affects performance of a large number of test takers, even if the individual impact is relatively small, will make the group appear “less able” in the equating process that NCBE performs.

That’s step one of my theory: struggling with ExamSoft made a large number of July 2014 examinees perform somewhat below their real ability level. Those lower scores, in turn, lowered the overall performance level of the group–especially when compared, through the control questions, to earlier groups of examinees. If thousands of examinees went out partying the night before the July 2014 MBE, no one would be surprised if the group as a whole produced a lower mean score. That’s what happened here–except that the examinees were frantically trying to upload essay questions rather than partying.

August 2014: Scaling

Once NCBE determines the ability level of a group of examinees, as well as the relative difficulty of the test, it adjusts the raw scores to account for these factors. The adjustment process is called “scaling” and it consists of adding points to the examinees’ raw scores. In a year with an easy test or “less able” examinees, the scaling process adds just a few points to each examinee’s raw score. Groups who faced a harder test or were “more able,” get more points. [Note that the process is a little more complicated that this; each examinee doesn’t get exactly the same point addition. The general process, however, works in this way–and affects the score of every single examinee. See this article for more.]

This is the point at which the ExamSoft crisis started to affect all examinees. NCBE doesn’t scale scores just for test takers who seem less able than others; it scales scores for the entire group. The mean scaled score for the July 2014 MBE was 141.5, almost three points lower than the mean scaled score in July 2013 (which was 144.3). This was also the lowest scaled score in ten years. See this report (p. 35) for a table reporting those scores.

It’s essential to remember that the scaling process affects every examinee in every state that uses the MBE. Test takers in states unaffected by ExamSoft got raw scores that reflected their ability, but they got a smaller scaling increment than they would have received without ExamSoft depressing outcomes in other states. The direct ExamSoft victims, of course, suffered a double whammy: they obtained a lower raw score than they might have otherwise achieved, plus a lower scaling boost to that score.

Fall 2014: Essay Scoring

After NCBE finished calculating and scaling MBE scores, the action moved to the states. States (except for Louisiana, which doesn’t use the MBE), incorporated the artificially depressed MBE scores into their bar score formulas. Remember that those MBE scores were lower for every exam taker than they would have been without the ExamSoft effect.

The damage, though, didn’t stop there. Many (perhaps most) states scale the raw scores from their essay exams to MBE scores. Here’s an article that explains the process in fairly simple terms, and I’ll attempt to sum it up here.

Scaling takes raw essay scores and arranges them on a skeleton provided by that state’s scaled MBE results. When the process is done, the mean essay score will be the same as the mean scaled MBE score for that state. The standard deviations for both will also be the same.

What does that mean in everyday English? It means that your state’s scaled MBE scores determine the grading curve for the essays. If test takers in your state bombed the MBE, they will all get lower scores on the essays as well. If they aced the MBE, they’ll get higher essay scores.

Note that this scaling process is a group-wide one, not an individual one. An individual who bombed the MBE won’t necessarily flunk the essays as well. Scaling uses indicators of group performance to adjust essay scores for the group as a whole. The exam taker who wrote the best set of essays in a state will still get the highest essay score in that state; her scaled score just won’t be as high as it would have been if her fellow test takers had done better on the MBE.

Scaling raw essay scores, like scaling the raw MBE scores, produces good results in most years. If one year’s graders have a snit and give everyone low scores on the essay part of the exam, the scaling process will say, “wait a minute, the MBE scores show that this group of test takers is just as good as last year’s. We need to pull up the essay scores to mirror performance on the MBE.” Conversely, if the graders are too generous (or the essay questions were too easy), the scaling process will say “Uh-oh. The MBE scores show us that this year’s group is no better than last year’s. We need to pull down your scores to keep them in line with what previous graders have done.”

The scaled MBE scores in July 2014 told the states: “Your test takers weren’t as good this year as last year. Pull down those essay scores.” Once again, this scaling process affected everyone who took the bar exam in a state that uses the MBE and scales essays to the MBE. I don’t know how many states are in the latter camp, but NCBE strongly encourages states to scale their essay scores.

Fall 2014: MPT Scoring

You guessed it. States also scale MPT scores to the MBE. Once again, MBE scores told them that this group of exam takers was “less able” than earlier groups so they should scale down MPT scores. That would have happened in every state that uses both the MBE and MPT, and scales the latter scores to the former.

Conclusion

So there you have it: this is how poor performance by ExamSoft victims could have depressed scores for exam takers nationwide. For every exam taker (except those in Louisiana) there was at least a single hit: a lower scaled MBE score. For many exam takers there were three hits: lower scaled MBE score, lower scaled essay score, and lower scaled MPT score. For some direct victims of the ExamSoft crisis, there was yet a fourth hit: a lower raw score on the MBE. But, as I hope I’ve shown here, those raw scores were also pebbles that set off much larger ripples in the pond of bar results. If you throw enough pebbles into a pond all at once, you trigger a pretty big wave.

Erica Moeser, the NCBE President, has defended the July 2014 exam results on the ground that test takers were “less able” than earlier groups of test takers. She’s correct in the limited sense that the national group of test takers performed less well, on average, on the MBE than the national group did in previous years. But, unless NCBE has done more sophisticated analyses of state-by-state raw scores, that doesn’t tell us why the exam takers performed less “ably.”

Law deans like Brooklyn’s Nick Allard are clearly right that we need a more thorough investigation of the July 2014 bar results. It’s too late to make whole the 2,300 or so test takers who may have unfairly failed the exam. They’ve already grappled with a profound sense of failure, lost jobs, studied for and taken the February exam, or given up on a career practicing law. There may, though, be some way to offer them redress–at least the knowledge that they were subject to an unfair process. We need to unravel the mystery of July 2014, both to make any possible amends and to protect law graduates in the future.

I plan to post some more thoughts on this, including some suggestions about how NCBE (or a neutral outsider) could further examine the July 2014 results. Meanwhile, please let me know if you have thoughts on my analysis. I’m not a bar exam insider, although I studied some of these issues once before. This is complicated stuff, and I welcome any comments or corrections.

Updated on September 21, 2015, to correct reported pass rates.