Picture the following situation: You are taking a freshman-level philosophy class in college, and your professor has just asked you to imagine a runaway trolley barreling down a track toward a group of five people. The only way to save them from being killed, the professor says, is to hit a switch that will turn the trolley onto an alternate set of tracks where it will kill one person instead of five. Now you must decide: Would the mulling over of this dilemma enlighten you in any way?

I ask because the trolley-problem thought experiment described above—and its standard culminating question, Would it be morally permissible for you to hit the switch?—has in recent years become a mainstay of research in a subfield of psychology. Scientists use versions of the kill-one-to-save-five hypothetical, reworded and reframed for added nuance, as a standard way to probe the workings of the moral mind. The corpus of “trolleyology” data they’ve produced hints that men are more likely than women to sacrifice a life for the sake of several others, for example, and that younger people are inclined to do the same. (Argh, millennials and their consequentialist moral paradigm!) Trolley-problem studies also tell us people may be more likely to favor the good of the many over the rights of the few when they’re reading in a foreign language, smelling Parmesan cheese, listening to sound effects of people farting, watching clips from Saturday Night Live, or otherwise subject to a cavalcade of weird and subtle morality-bending factors in the lab.

For all this method’s enduring popularity, few have bothered to examine how it might relate to real-life moral judgments. Would your answers to a set of trolley hypotheticals correspond with what you’d do if, say, a deadly train were really coming down the tracks, and you really did have the means to change its course? In November 2016, though, Dries Bostyn, a graduate student in social psychology at the University of Ghent, ran what may have been the first-ever real-life version of a trolley-problem study in the lab. In place of railroad tracks and human victims, he used an electroschock machine and a colony of mice—and the question was no longer hypothetical: Would students press a button to zap a living, breathing mouse, so as to spare five other living, breathing mice from feeling pain?

“I think almost everyone within this field has considered running this experiment in real life, but for some reason no one ever got around to it,” Bostyn says. He published his own results last month: People’s thoughts about imaginary trolleys and other sacrificial hypotheticals did not predict their actions with the mice, he found.

It’s a discomfiting result, and one that seems—at least at first—to throw a boulder into the path of this research. Scientists have been using a set of cheap-and-easy mental probes (Would you hit the railroad switch?) to capture moral judgment. But if the answers to those questions don’t connect to real behavior, then where, exactly, have these trolley problems taken us?

Let’s start with where the field originated. In its modern form, the railway thought experiment started in the philosophy of ethics with Philippa Foot: In 1967, she asked her readers to imagine the driver of a “runaway tram” who could steer away from five potential victims, killing one instead. Another ethicist, Judith Jarvis Thomson, followed up with a more expansive set of hypotheticals, now presented to the reader in the second person, with two competing versions of the trolley story: In the switch narrative, the thought-experimenter must decide if he or she would intervene to send the trolley down a different track. (That’s the version given at the top of this piece.) Thomson contrasted this to the footbridge scenario, in which you’re instructed to imagine that you’re standing on a walkway up above the trolley tracks where the five people will soon be struck and killed. There’s a large stranger standing on the bridge beside you, and if you push him to his death (a feat you have the physical capacity to accomplish), the trolley will be derailed before it hits the others. Is it morally permissible to shove the big guy to his certain death?

By laying out these two scenarios side by side, philosophers aimed to disentangle two competing moral frameworks—one that focuses on promotion of the greatest good, and the other on following rules for avoiding harm. The juxtaposition proved illuminating because most people’s moral intuitions seem to flip when the harm becomes more personal: They say that it’s OK to hit the switch but not to shove the person off the bridge, even though the trade-off in human lives remains the same. Ethicists used this flip, and other intuitions drawn from trolley problems, in their arguments over how a person ought to make a moral judgment in real life.

The juxtaposition proved illuminating because most people’s moral intuitions seem to flip when the harm becomes more personal.

What had started out as rhetoric for philosophical debate ended up as fodder for experiments. In the early 2000s, a Princeton graduate student named Joshua Greene placed people in an fMRI brain scanner and confronted them with switch- and footbridge-type moral dilemmas, to see how moral thinking played out in the brain. On the basis of this and other studies, he and his colleagues argued that the two varieties of moral reasoning arise from different anatomical structures. Greene proposed that a slow, rational decision-making process leads people to arrive at a greatest-good conclusion (and say they’d flip the switch); while a quicker emotion-based process leads them to avoid inflicting harm on principle (and say they’d never push the person off the bridge).

Greene’s first paper on this topic has since been cited several thousand times. With that, new laboratory model had been established, and trolley problems came to serve as a means for tapping separate circuits in the brain, or at least separate ways of thinking. Follow-up research leveraged people’s responses to hypothetical dilemmas, especially trolley problems and their derivatives, to understand the nature of their mental processes, and how their moral judgments might be formed or influenced. “People sometimes ask me why I bother with these bizarre hypothetical dilemmas,” Greene wrote in 2009, by which time he’d joined the psychology faculty at Harvard. “To me, these dilemmas are like a geneticist’s fruit flies. They’re manageable enough to play around with in the lab but complex enough to capture something interesting about the wider and wilder world outside.”

But even from the get-go, people worried that these dilemmas might not fly outside the lab. “Even philosophers had complained about philosophers thinking too much about trolley cases,” said Greene in a recent interview. For one thing, the imaginary setups were on their face absurd. (Why can’t you just yell at the people to get off the tracks? How do we know this fat guy’s body will be enough to stop the trolley? Does anyone still ride trolleys, anyway? Et cetera.) It also seemed a little off that trolley problems were often posed in funny, entertaining ways, while real-life moral dilemmas are unfunny as a rule. And it wasn’t clear how they related to reality. Fruit flies succeeded as model organisms in part because they offered an inexpensive, flexible, and reproducible means of running experiments and gathering data. While the same could be said of trolley problem, these features only make them halfway useful in the lab. Fruit-fly researchers know that more than half the insects’ genes have human analogues; to borrow Greene’s phrase, it’s clear that fruit-fly biology “captures something interesting about the wider and wilder world outside” the lab. Can the same be said of trolley-switch dilemmas?

After all, lots of people kill mice at home in the absence of a moral meltdown.

That’s what Bostyn was attempting to suss out in Belgium. At first he thought of using real-life people as potential victims in a Milgram-esque dilemma: Would you let these five people receive electric shocks, or press a switch to zap another guy instead? But Bostyn figured the subjects in his study would realize that, rules-based ethics procedures being what they are, everyone involved had given their consent, and that understanding would make the stakes too low. So he went with animals instead. “Everyone always asks why we didn’t use puppies or kittens instead of mice,” he said. After all, lots of people kill mice at home in the absence of a moral meltdown. Puppies or kittens would’ve been far more expensive, though. Lab mice are ubiquitous at research universities, and serve as the default animals for many kinds of research.(You might say that they’re like the trolley-problems of biomedicine.) For practical reasons, then, Bostyn ended up using one well-established laboratory model to investigate another.

He worked as quickly as he could, so rumors wouldn’t spread about the research he was doing. In the end, he brought several hundred people to the lab in one week. Each experiment began with 10 trolleyology dilemmas, including the classic story of the stranger on the footbridge. Then some participants were asked to consider one more hypothetical: “Imagine the following situation,” this one read,

You are participating in an experiment as part of a course in Social Psychology. Previously, you were asked to respond to several moral dilemmas, much like the ones you have answered. You are guided to the lab, the door opens and you see two cages with mice: one cage containing a single mouse, one cage containing five mice. An electroshock is hooked up to both cages. The experimenter tells you that after a 20 second timer, an electrical shock will be administered to the cage with the five mice but that you can push a button to redirect this shock to the cage containing the single mouse. The shocks are very painful but nonlethal. Would you press the button?

Two-thirds of Bostyn’s subjects said yes, they would indeed press the button in this scenario.

The rest were tested on the real-life version of the mouse dilemma. In the lab were two cages with red plastic lids and mice inside, an electroshock machine, and a laptop that showed the 20-second timer. When the timer got to zero, the experiment was over. No shocks were ever administered to the animals, but the laptop recorded whether (and when) each participant had pressed the button. Participants would see, in the end, that the button had no effect—but at that point they’d already made their choice.

About five-sixths of these subjects pressed the actual button, suggesting they were more inclined to make that choice in real life than their fellow subjects were in hypotheticals. Moreover, people’s responses to the 10 trolleyology dilemmas they were given at the start of the experiment—whether they imagined that they’d push the fat man off the bridge and all that—did not meaningfully predict their choices with live mice. Those who had seemed to be more focused on the greater good in the hypotheticals did seem to press the real-life button more quickly, though, and they described themselves as being more comfortable with their decision afterward.

At least one of Bostyn’s findings—that when presented with a more realistic scenario, people are more inclined to sacrifice an individual for the benefit of the group—falls in line with earlier research. In the fall of 2016, just before he ran his experiment, a team of psychologists at the University of Plymouth led by Kathryn Francis published a trolley study that compared people’s responses to text-based hypotheticals with their behaviors in a virtual-reality environment. Subjects either made their judgments on the written version of the footbridge scenario, or else they watched the same vignette unfold in an Oculus Rift headset. At one point in the VR version, subjects heard a voice: Hey I am too far away but if you want to save the people you could push the large person on to the tracks and detail the train, it said. If you’re going to push him, do it now, but it is your choice. The sample size was small, but Francis and her colleagues found that people were more likely to push the simulated stranger off the footbridge with a flick of a joystick than they were to say they’d push a stranger in the thought experiment.

So in both the mouse study and the VR experiment, more life-like settings seemed to make subjects more pragmatic in their moral judgments. Bostyn wonders if people who are presented with standard trolley hypotheticals give biased answers because they’re worried about their reputations. They might think that if they told the experimenter they’d flip the switch or push the stranger off the bridge, it would make them seem cold and calculating. To avoid that outcome, they tilt their responses in the opposite direction. But when they’re confronted with a real-life version of the same dilemma, and one with real-life stakes, they might ignore that social anxiety and enact their truer, more utilitarian moral judgment.

If people’s answers to a trolley-type dilemma don’t match up exactly with their behaviors in a real-life (or realistic) version of the same, does that mean trolleyology itself has been derailed? The answer to that question depends on how you understood the purpose of those hypotheticals to begin with. Sure, they might not predict real-world actions. But perhaps they’re still useful for understanding real-world reactions. After all, the laboratory game mirrors a common experience: one in which we hear or read about a thing that someone did—a policy that she enacted, perhaps, or a crime that she committed—and then decide whether her behavior was ethical. If trolley problems can illuminate the mental process behind reading a narrative and then making a moral judgment then perhaps we shouldn’t care so much about what happened when this guy in Belgium pretended to be electrocuting mice.

Or perhaps the trolley problems don’t even have to model any real-life situations whatsoever. They could provide insight into how people judge the way they ought to act, or the way they’d like to act, even if they would or could not act that way in practice.

That explains the awkward fact that trolley studies tend to label psychopaths as utilitarians despite their moral shortcomings.

“I don’t deny the straightforward implications of [Bostyn’s] research,” says Greene. “You can’t just ask people a hypothetical question, especially when it involves unfamiliar situations and relatively high stakes, and assume that what they say in response is what they would actually do. That’s important and worth knowing.” At the same time, he says, Bostyn’s data aren’t grounds for saying that responses to trolley hypotheticals are useless or inane. After all, the mouse study did find that people’s answers to the hypotheticals predicted their actual levels of discomfort. Even if someone’s feeling of discomfort may not always translate to real-world behavior, that doesn’t mean that it’s irrelevant to moral judgment. “The more sensible conclusion,” Greene added over email, “is that we are looking at several weakly connected dots in a complex chain with multiple factors at work.”

If that’s the case, then trolley hypotheticals could be a useful way of teasing out or even amplifying aspects of cognition that are hard to see in real-life settings. Indeed, Greene insists that the dilemmas were never meant to serve as “cheap surrogates” for how people would respond to actual conundrums: “That was never the goal from my point of view.” Rather, the dilemmas were more like highly tailored artificial stimuli. He compares them to the flashing checkerboards used by vision scientists to drive neural responses in the retina and cerebral cortex. We may not see a lot of flashing checkerboards in daily life, but these stimuli can still activate the brain in reliable and telling ways. The same goes for trolley problems, Greene argues: Even if they turn out to have little bearing on reality, they can still be useful as a tool of basic science.

Bostyn’s mice aside, there are other reasons to wary of the trolley hypotheticals. For one thing, a recent international project to reproduce 40 major studies in the field of experimental philosophy included stabs at two of Greene’s highly cited trolley-problem studies. Both failed to replicate. Then there’s the fact that trolley-type dilemmas are often interpreted as though they were a valid measure of a person’s or a population’s tendency toward utilitarian decision-making. (That’s how researchers concluded that men are more utilitarian than women and that millennials are more utilitarian than Gen-Xers.) But recent research finds these hypotheticals only measure one component of utilitarian moral judgment; namely, the willingness to inflict sacrificial harm. That leaves out another basic element of this ethical framework: one’s commitment to the greater good, and positive investment in the well-being of strangers. That explains the awkward fact that trolley studies tend to label psychopaths as utilitarians despite their moral shortcomings. (Psychopaths, it turns out, tend to be quite willing to endorse pushing strangers off of footbridges.)

In the end, the value of the trolley problem as a research tool may not depend on whether people have the same response in real-world situations. But given its ubiquity—and its well-established imperfections—it would have been a happier hypothetical for trolleyologists if Bostyn’s study had come out the other way.