Making the Cut

Which surgeon you get matters — a lot. But how do we know who the good ones are?

“You can think of surgery as not really that different than golf.” Peter Scardino is the chief of surgery at Memorial Sloan Kettering Cancer Center (MSK). He has performed more than 4,000 open radical prostatectomies. “Very good athletes and intelligent people can be wildly different in their ability to drive or chip or putt. I think the same thing’s true in the operating room.”

The difference is that golfers keep score. Andrew Vickers, a biostatistician at MSK, would hear cancer surgeons at the hospital having heated debates about, say, how often they took out a patient’s whole kidney versus just a part of it. “Wait a minute,” he remembers thinking. “Don’t you know this?”

“How come they didn’t know this already?”

In the summer of 2009, he and Scardino teamed up to begin work on a software project, called Amplio (from the Latin for “to improve”), to give surgeons detailed feedback about their performance. The program—still in its early stages but already starting to be shared with other hospitals — started with a simple premise: the only way a surgeon is going to get better is if he knows where he stands.

Vickers likes to put it this way. His brother-in-law is a bond salesman, and you can ask him, How’d you do last week?, and he’ll tell you not just his own numbers, but the numbers for his whole group.

Why should it be any different when lives are in the balance?

Andrew Vickers

The central technique of Amplio, using outcome data to determine which surgeons were more successful, and why, takes on a powerful taboo. Perhaps the longest-standing impediment to research into surgical outcomes — the reason that surgeons, unlike bond salesmen (or pilots or athletes), are so much in the dark about their own performance — are the surgeons themselves.

“Surgeons basically deeply believe that if I’m a well-trained surgeon, if I’ve gone through a good residency program, a fellowship program, and I’m board-certified, I can do an operation just as well as you can,” Scardino says. “And the difference between our results is really because I’m willing to take on the challenging patients.”

It is, maybe, a vestige of the old myth that anyone ordained to cut into healthy flesh is thereby made a minor god. It’s the belief that there are no differences in skill, and that even if there were differences, surgery is so complicated and multifaceted, and so much determined by the patient you happen to be operating on, that no one would ever be able to tell.

Vickers said to me that after several years of hearing this, he became so frustrated that he sat down with his ten­-year-­old daughter and conducted a little experiment. He searched YouTube for “radical prostatectomy” and found two clips, one from a highly respected surgeon and one from a surgeon who was rumored to be less skilled. He showed his daughter a 15­second clip of each and asked, “Which one is better?”

“That one,” she replied right away.

When Vickers asked her why, “She looked at me, like, can’t you tell the difference? You can just see.”

Would you want to be cut by this surgeon?

Or this one?

A remarkable paper published last year in the New England Journal of Medicine showed that maybe Vickers’s daughter was onto something.

In the study, run by John Birkmeyer, a surgeon who at the time was at the University of Michigan, bariatric surgeons were recruited from around the state of Michigan to submit videos of themselves doing a gastric bypass operation. The videos were sent to another pool of bariatric surgeons to be given a series of 1-to-5 rating on factors such as “respect for tissue,” “time and motion,” “economy of movement” and “flow of operation.”

The study’s key finding was that not only could you reliably determine a surgeon’s skill by watching them on video — skill was nowhere near as nebulous as had been assumed — but that those ratings were highly correlated with outcomes: “As compared with patients treated by surgeons with high skill ratings, patients treated by surgeons with low skill ratings were at least twice as likely to die, have complications, undergo reoperation, and be readmitted after hospital discharge,” Birkmeyer and his colleagues wrote in the paper.

You can actually watch a couple of these videos yourself [see above]. Along with the overall study results, Birkmeyer published two short clips: one from a highly rated surgeon and one from a low-rated surgeon. The difference is astonishing.

You see the higher-rated surgeon first. It’s what you always imagined surgery might look like. The metal hands move with purpose — quick, deliberate strokes. There’s no wasted motion. When they grip or sew or staple tissue, it’s with a mix of command and gentle respect. The surgeon seems to know exactly what to do next. The way they’ve set things up makes it feel roomy in there, and tidy.

Watching the lower-rated surgeon, by contrast, is like watching the hidden camera footage of a nanny hitting your kid: it looks like abuse. The surgeon’s view is all muddled, they’re groping aimlessly at flesh, desperate to find purchase somewhere, or an orientation, as if their instruments are being thrashed around in the undertow of the patient’s guts. It’s like watching middle schoolers play soccer: the game seems to make no sense, to have no plot or direction or purpose or boundary. It’s not, in other words, like, “This one’s hands are a bit shaky,” it’s more like, “Does this one have any clue what they’re doing?”

It’s funny: in other disciplines we reserve the word “surgical” for feats that took a special poise, a kind of deftness under pressure. But the thing we maybe forget is that not all surgery is worthy of the name.

Vickers is best known for showing exactly how much variety there is, plotting, in 2007, the so-called “learning curve” for surgery: a graph that tracks, on one axis, the number of cases a surgeon has under his belt, and on the other, his recurrence rates (the rate at which his patients’ cancer comes back).

As surgeons get more experience, their patients do better. This “learning curve” shows patients’ 5 year cancer-free rates rise with procedure volume.

He showed that in incidents of prostate cancer that haven’t spread beyond the prostate — so-called ‘organ-confined’ cases — the recurrence rates for a novice surgeon were 10 to 15%. For an experienced surgeon, they were less than 1%. With recurrence rates so low for the most experienced surgeons, Vickers was able to conclude that in organ-confined cancer cases, the only reason a patient would recur is “because the surgeon screwed up.”

There’s a large literature, going back to a famous paper in 1979, finding that hospitals with higher volumes of a given surgical procedure have better outcomes. In the ’79 study it was reported that for some kinds of surgery, hospitals that saw 200 or more cases per year had death rates that were 25% to 41% lower than hospitals with lower volumes. If every case were treated at a high-volume hospital, you would avoid more than a third of the deaths associated with the procedure.

But what wasn’t clear was why higher volumes led to better outcomes. And for decades, researchers penned more than 300 studies restating the same basic relationship, without getting any closer to explaining it. Did low-volume hospitals end up with the riskiest patients? Did high-volume hospitals have fancier equipment? Or better operating room teams? A better overall staff? An editorial as late as 2003 summarized the literature with the title, “The Volume–Outcome Conundrum.”

A 2003 paper by Birkmeyer, “Surgeon volume and operative mortality in the United States,” was the first to offer definitive evidence that the biggest factor determining the outcome of many surgical procedures — the hidden element that explained most of the variation among hospitals — was the procedure volume not of the hospital, but of the individual surgeons.

“In general I don’t think anyone was surprised that there was a learning curve,” Vickers says. “I think they were surprised at what a big difference it made.” Surprised, maybe, but not moved to action. “You may think that everyone would drop what they were doing,” he says, “and try and work out what it is that some surgeons are doing that the other ones aren’t… But things move a lot more slowly than that.”

Tired of waiting, Vickers started sharing some initial ideas with Scardino about the program that would become Amplio. It would give surgeons detailed feedback about their performance. It would show you not just your own results, but the results for everyone in your service. If another surgeon was doing particularly well, you could find out what accounted for the difference; if your own numbers dropped, you’d know to make an adjustment. Vickers explains that they wanted to “stop doing studies showing surgeons had different outcomes.”

“Let’s do something about it,” he told Scardino.

Dr. Scardino

The first time I heard about Amplio was on the third floor of the Chrysler Building, in a room they called the Innovation Lab — the very room you’d point to if the Martians ever asked you what a 125-year old bureaucracy looks like. As I arrived, the receptionist was trying to straighten up a small mess of papers, post-its, cookies, and coffee stirrers. “The last crowd had a wild time,” she said. Every surface in the room was gray or off-white, the color of questionable eggs. It smelled like hospital-grade hand soap.

The people who filed in, though, and introduced themselves to each other (this was a summit of sorts, a “Collaboration Meeting” where different research groups from around MSK shared their works in progress) looked straight out of a well-funded biotech startup. There was a Fulbright scholar; a double-major in biology and philosophy; a couple of epidemiologists; a mathematician; a master’s in biostats and predictive analytics. There were Harvards, Cals, and Columbias, bright-eyed and sharply dressed.

Vickers was one of the speakers. He’s in his forties but he looks younger, less like an academic than a seasoned ski instructor, a consequence, maybe, of the long wavy hair, or the well-worn smile lines around his eyes, or this expression he has that’s like a mix of relaxed and impish. He leans back when he talks, and he talks well, and you get the sense that he knows he talks well. He’s British, from north London, educated first at Cambridge and then, for his PhD in clinical medicine, at Oxford.

The first big task with Amplio, he said, was to get the data. In order for surgeons to improve, they have to know how well they’re doing. In order to know how well they’re doing, they have to know how well their patients are doing. And this turns out to be trickier than you’d think. You need an apparatus that not only keeps meticulous records, but keeps them consistently, and throughout the entire life cycle of the patient.

That is, you need data on the patient before the operation: How old are they? What medications are they allergic to? Have they been in surgery before? You need data on what happened during the operation: where’d you make your incisions? how much blood was lost? how long did it take?

And finally, you need data on what happened to the patient after the operation — in some cases years after. In many hospitals, followup is sporadic at best. So before the Amplio team did anything fancy, they had to devise a better way to collect data from patients. They had to do stuff like find out whether it was better to give the patient a survey before or after a consultation with their surgeon? And what kinds of questions worked best? And who were they supposed to hand the iPad to when they were done?

Only when all these questions were answered, and a stream of regular data was being saved for every procedure, could Amplio start presenting something for surgeons to use.

A screen in Amplio shows how a surgeon’s patients are doing against their colleagues’

After years of setup, Amplio now is in a state where it can begin to affect procedures. The way it works is that a surgeon logs into a screen that shows where they stand on a series of plots. On each plot there’s a single red dot sitting amid some blue dots. The red dot shows your outcomes; the blue dots show the outcomes for each of the other surgeons in your group.

You can slice and dice different things you’re interested in to make different kinds of plots. One plot might show the average amount of blood lost during the operation against the average length of the hospital stay after it. Another plot might show a prostate patient’s recurrence rates against his continence or erectile function.

There’s something powerful about having outcomes graphed so starkly. Vickers says that there was a surgeon who saw that they were so far into the wrong corner of that plot — patients weren’t recovering well, and the cancer was coming back — that they decided to stop doing the procedure. The men spared poor outcomes by this decision will never know that Amplio saved them.

It’s like an analytics dashboard, or a leaderboard, or a report card, or… well, it’s like a lot of things that have existed in a lot of other fields for a long time. And it kind of makes you wonder, why has it taken so long for a tool like this to come to surgeons?

The answer is that Amplio has cleverly avoided the pitfalls of some previous efforts. For instance, in 1989, New York state began publicly reporting the mortality rates of cardiovascular surgeons. Because the data was “risk-adjusted”—an unfavorable outcome would be considered less bad, or not counted at all, if the patient was at risk to begin with — surgeons started pretending their patients were a lot worse off than they were. In some cases, they avoided patients who looked like goners. “The sickest patients weren’t being treated,” Vickers says. One investigation into why mortality in New York had dropped for a certain procedure, the coronary artery bypass graft, concluded that it was just because New York hospitals were sending the highest-risk patients to Ohio.

Vickers wanted to resist such gaming. But the answer is not to quit adjusting for patient risk. After all, if a given report says that your patients have 60% fewer complications than mine, does that mean that you’re a 60% better surgeon? It depends on the patients we see. It turns out that maybe the best way to prevent gaming is just to keep the results confidential. That sounds counter to a patient’s interests, but it’s been shown that patients actually make little use of objective outcomes data when it’s available, that in fact they’re much more likely to choose a surgeon or hospital based on reputation or raw proximity.