When I first took the implicit association test a few years ago, I was happy with my results: The test found that I had no automatic preference against white or black people. According to this test, I was a person free of racism, even at the subconscious level.

I took the IAT again a few days later. This time, I wasn’t so happy with my results: It turns out I had a slight automatic preference for white people. According to this, I was a little racist at the subconscious level — against black people.

Then I took the test again later on. This time, my results genuinely surprised me: It found once again that I had a slight automatic preference — only now it was in favor of black people. I was racist, but against white people, according to the test.

At this point, I was at a loss as to what this test was telling me. Should I consider the average of my three results, essentially showing I had no bias at all? Or should I have used the latest result? Was this test even worth taking seriously, or was it bullshit? I felt like I had gotten no real answers about my bias from this test. (I recently retook the test a few times — and, again, it was all over the place.)

It occurred to me, what would have happened if I took the test just once and walked away from it? Would I smugly conclude I wasn’t racist at all? What if I had gotten one of my other results that first time — potentially leading me to conclude that I am racist against either white or black people? What would I think of myself if I had just taken this test at face value? After all, it was managed by a group of respected researchers at Harvard University.

But here’s the thing: It turns out the IAT might not tell individuals much about their individual bias. According to a growing body of research and the researchers who created the test and maintain it at the Project Implicit website, the IAT is not good for predicting individual biases based on just one test. It requires a collection — an aggregate — of tests before it can really make any sort of conclusions.

“It can predict things in the aggregate, but it cannot predict behavior at the level of an individual” who took the test once, Calvin Lai, a postdoctoral fellow at Harvard University and director of research at Project Implicit, told me.

For individuals, this means they would have to take the test many times — maybe dozens of times — and average out the results to get a clear indication of their bias and potentially how that bias guides behavior. For a broader population, it’s similar: You’d have to collect the results of individuals in the population and average those out to get an idea of the overall population’s bias and potential behavior.

This isn’t how the test was sold in books like Malcolm Gladwell’s Blink or the pages of news organizations like the New York Times. The assumption seemed to be that you could take the test once and come away with a clear picture of your bias. And that served a real-world purpose: In a society that no longer tolerates explicit racism nearly as much as it used to, uncovering people’s subconscious implicit biases seemed like the way to show people that they really can be and are still racist.

Yet no researcher — not even the test’s creators — defends the one-off use.

Tony Greenwald, a University of Washington researcher who co-created the test with Mahzarin Banaji at Harvard, conceded this point, telling me that the IAT is only “good for predicting individual behavior in the aggregate, and the correlations are small.”

So the test, particularly when it came to individual results, really may not have told me anything valuable. The psychological tool that so many have trumpeted for measuring racism may not work as well as originally thought.

The problem the IAT sought to fix

The IAT tries to solve a very tricky problem we’ve seen in social science over the past few years: Measures of explicit racism (for example, directly asking whether a person thinks white people are superior to black people) have appeared to show a decline. But how much of that actually shows that racism is diminishing? Is it possible that people are lying when they answer those questions, fearing that telling the truth would make them look racist? And even if people don’t report explicit biases, is it possible they have implicit — meaning subconscious — ones?

Researchers created a test that they hoped would work around these questions. Then they made it public — through Project Implicit — in hopes that they could draw on a massive pool of test takers to flesh out their research, while raising awareness about implicit biases and how racism and other kinds of prejudice still exist within American society today.

The IAT tries to get at this by digging into people’s initial reflexes — and, hopefully, their subconscious mind — to gauge their real views. The race-based IAT works by asking you to first use two buttons (“E” or “I”) on your keyboard to identify a series of faces that flash on your screen as black or white and a series of words that flash on your screen as good or bad.

Where the test gets trickier is when it mixes up these categories. In the following rounds, both faces and words will flash on your screen, but you’ll still be limited to “E” or “I” — only “E” could now mean “black or good” while “I” will mean “white or bad” in one round and later be reversed so “E” means “black or bad” and “I” means “white or good” in the next round. The idea is that if you have a slower reaction to selecting “good” when “black” is linked to it or “bad” when “white” is linked to it, you probably have a bias against black people or bias in favor of white people. (You can take the test to better understand how it works.)

After several rounds of this, the test tells you if you have an “automatic preference” toward black or white people. (Greenwald emphasized to me that although a lot of people interpret this “automatic preference” as evidence of racism, his team doesn’t describe the results in that way. “I and my colleagues and collaborators do not call the IAT results a measure of implicit prejudice [or] implicit racism,” he said. “Racism and prejudice are explicit attitudes with components of hostility or negative animus toward a group. The IAT doesn’t even begin to measure something like that.”)

For the individual, the motivation to take the test is obvious: People would like to know if they have some deep, underlying bias against others of certain races. And the IAT, based on how it presents the results, at least appears to give some answers.

The IAT can’t really do what it’s supposed to: predict your bias

Only the IAT doesn’t predict subconscious racial biases, at least based on one test. So one time with the IAT might not tell you much, if anything, about your actual individual views and behavior.

As Lai told me, it’s not clear if the test even predicts biased behavior better than explicit measures: “What we don’t know is … whether or not the IAT and measures like the IAT can predict behavior over and above corresponding questionnaires of what we would call explicit measures or explicit attitudes.”

The big problem with the test is it doesn’t only pick up subconscious biases.

“The IAT is impacted by explicit attitudes, not just implicit attitudes,” James Jaccard, a New York University researcher who’s criticized the IAT, told me. “It is impacted by people’s ability to process information quickly on a general level. It is impacted by desires to want to create a good impression. It is impacted by the mood people are in. If the measure is an amalgamation of many things (one of which is purportedly implicit bias), how can we know which of those things is responsible for a (weak) correlation with behavior?”

I felt one of those variables when I took the test: I often pressed the wrong button or took a little longer pressing a button because I genuinely blanked out on what the buttons were for. This happened fewer times as I took the test more often, but it still happened. And this is coming from someone who plays a lot of video games in his free time, so I’m probably making fewer mistakes to begin with than most people are.

The IAT, however, does little to make its flaws clear to test takers. When you finish the test, you get a big message on your screen proclaiming you either have an “automatic preference,” with different measures like “slight” or “strong” to gauge the level of your preference, or “no automatic preference.” There’s no clear disclaimer about how one shot at the test is likely meaningless for predicting an individual’s bias and behavior.

Lai copped to this: “One thing that has come to our attention over the years is that we are not yet doing enough to clear up the misconceptions either way.” So Lai and other researchers know that people are coming away from one IAT session thinking that it gave a definitive measure of their bias, yet there’s no clear, concise disclaimer on the website that tells people the IAT is not very good for measuring individual bias after one test. But changes, Lai said, will be coming in the next few weeks or months.

“When you compare our website to other websites and social media, we’re already very wordy, very jargony,” he said. “You know, we’re scientists. So there’s always going to be a trade-off in how much we tell participants upfront.”

Regardless, as it stands, there seems to be wide agreement that the test is not good for predicting individual bias and behaviors after just one sitting.

The test might be good for measuring bias in the aggregate

Where the debate over the IAT gets much more contested is whether the IAT is good for predicting aggregate behavior — meaning behavior in an overall group.

As the IAT’s supporters admit, the IAT may not tell you much about the biases or behavior of an individual who took the test once. But once you take the results of a much larger population of test takers or an individual who took the test multiple times, supporters argue, you can say with some certainty whether that broader group or that individual is implicitly biased based on the average of all the tests.

Not everyone agrees. This debate, in fact, is very heated — unusually so for the academic world. When I reached out to the several researchers who have criticized the IAT, they told me they wanted no more part in this discussion, instead pointing me to a piece by Jesse Singal at New York magazine for their side.

“I appreciate your interest but I’m mostly trying to extricate myself from that debate — it’s genuinely unpleasant,” Hart Blanton, a University of Connecticut researcher who has criticized the IAT, told me.

So here’s what Blanton told Singal: “If you’re not willing to say what the positive [IAT score] means at the individual level, you have no idea what it means at the aggregate level. … If I’m willing to give 100 kids an IQ test, and not willing to say what an individual kid’s score means, how can I then say 75 percent of them are geniuses, or are learning disabled?”

In other words, if the test can’t predict individual behavior after one session, how can we be so sure that it can really tell us anything through an aggregate of those individual tests?

The IAT’s creators and facilitators, however, pushed back on that argument. “Many things are not good at predicting individual behavior,” Lai said, “but we still find valuable in the aggregate.”

Greenwald, the co-creator of the IAT, pointed to blood pressure tests as an example of another measure that isn’t totally accurate at the individual level after just one test but is accurate in the aggregate. Almost anyone who has gone to the doctor or tried one of those blood pressure machines at grocery stores can probably attest to this: Your blood pressure can vary from day to day based on a lot of factors — whether the test was applied correctly, whether you just exercised, whether you’re stressed, and so on.

Yet “a person who in repeated tests of blood measure has high blood pressure is indeed properly described as having high blood pressure,” Greenwald said. “A person who on repeated taking of the race-based IAT shows a strong automatic preference for one race or the other can be concluded as indeed having the automatic associations that the test is designed to measure.”

Lai also noted that measures of explicit bias are similarly flawed for individuals but valuable in the aggregate. For example, in questionnaires that ask people about their explicit prejudices, a lot of people might lie — because they don’t want to look racist — making it hard to gauge whether an individual is explicitly biased. But if you have a group in which, for instance, 40 percent admitted to explicit bias and another group in which 80 percent did, you would still expect the 80 percent group to be more biased in their overall behaviors — even if some of the respondents in both groups were dishonest.

“What we say our prejudices are tells us something about an individual,” Lai said. “It might not tell us much about what they do in everyday life, but it tells us something.”

The research so far comes down somewhere in the middle of the debate. It seems like the IAT predicts some variance in discriminatory behaviors, but its predictive power to this end seems to be quite small: Depending on the study, the estimate ranges from less than 1 percent to 5.5 percent. With percentages so small, it’s questionable just how useful the IAT really is for predicting biased behavior — even in the aggregate.

Still, the low number, Lai argued, can be deceiving: “In general, behavioral prediction is poor with almost any psychological variable. This is because any individual behavior is influenced by so many things — e.g., our attitudes, our personalities, social norms, how tired we are, how much money we have in our wallet, laws, what our parents and friends told us to do, what our job says we need to do, and so on. As one example, even a tried-and-trued personality trait like conscientiousness was correlated at r = .13 (1.7% of variance explained) with behaviors that are related to conscientiousness (e.g., not being late).”

Given this, it may be that the IAT is still the best tool for measuring subconscious bias. “The IAT, even though it is by many standards a bad [measure], is still the best measure of a bad family of measures,” Lai said.

The IAT’s mishaps don’t mean that racism isn’t real

Regardless of whether the IAT is good at the aggregate level or not, there’s really little doubt of racism’s prevalence in America. The researchers I spoke to on both sides conceded that there is a large body of scientific evidence for racial bias in the US.

Take some of the research that directly measures people’s behaviors: In a 2003 study, researchers sent out almost identical résumés, except some had stereotypically white names and others had stereotypically black names; the white names were 50 percent more likely to be called back for interviews. In a more recent 2015 study, researchers tested participants on the associations they make with “black-sounding names,” like DeShawn and Jamal, and “white-sounding names,” like Connor and Garrett — finding that participants tended to associate the black-sounding names with larger, more violent people.

Clearly, these kinds of studies — and there are many more — show that racism still plays a big role in America: Although we now live in a world where it’s not as acceptable to take part in explicit racism, it seems like people are, quietly but surely, engaging in other kinds of racial prejudice.

The question, then, is whether the IAT accurately measures that racism and whether, in fact, implicit biases are really the big force behind this racism.

As Jaccard, one of the IAT critics, told me, “I personally think structural and individual racism are serious problems and are something we need to address as a society. I worry that an obsession by some with implicit bias, given its overall empirical track record, may potentially divert attention and resources away from us addressing factors that are far more influential and important in shaping discriminatory behavior and that create the unjust ethnic disparities we sorely need to do something about.”

Again, the measures of explicit racism do show steady drops over time. Many researchers have interpreted this to suggest that a lot of people have simply shifted their racial biases from the conscious to the subconscious — hence the need for an IAT in the first place. But it’s equally plausible that a lot of people responding to surveys on explicit racial bias are simply lying about their explicit biases because they don’t want to look racist.

After all, while subconscious bias may explain why an employer rejects résumés with stereotypically black names, it’s also possible that explicit bias is behind it. Maybe an employer holds explicitly racist beliefs about a black employee’s ability, even if he doesn’t voice those feelings to those around him. In this way, the social stigmatization of racism that’s occurred in America since the 1960s may have simply forced racists to be quiet, not pushed their racism from conscious to subconscious levels.

Implicit bias may not be the right target for fighting racism

In fact, some recent research has questioned whether targeting implicit bias as a strategy for combating racism can even work.

A meta-analysis that Lai co-authored, which is still under peer review and undergoing changes, concluded that implicit bias (as measured by the IAT and other similar tests) is correlated with explicit bias and behavior, and implicit bias can be successfully mitigated. But, it found, changes in implicit bias don’t seem to lead to changes in explicit bias or behavior. This suggests that strategies that mitigate implicit bias aren’t going to have real-world outcomes.

“If you try to target just implicit bias,” Lai said, “it’s probably not going to affect the outcomes that you’re really interested in.”

Lai suggested that targeting racial bias in general may not be the correct approach. He pointed to an experiment recently run with the Las Vegas Police Department.

There, researcher Phillip Atiba Goff was tasked with helping the police find a way to reduce their use of force, which disproportionately targets minority residents. Goff found that a lot of these uses of force were often the result of foot pursuits.

With this finding, the police established a foot pursuit policy that said the officer who was giving chase should not be the first person to put his or her hands on the suspect, with coordinated backup instead arriving on the scene and taking on that role. The idea is that foot pursuits often ended in excessive use of force; after all, they are high-adrenaline chases in which the officer and the suspect can get really angry really fast. So by limiting, when possible, chasing officers from putting their hands on the suspect, Goff figured you could limit use of force.

The change appeared to work: There was a 23 percent reduction in total use of force and an 11 percent reduction in officer injury over several years, on top of reducing racial disparities, according to Goff.

As Goff previously told me, “I didn’t have to talk about race to reduce a disparity that has racial components to it. I had to change the fundamental situation where police are chronically engaging with suspects. And that’s the kind of example that I’m talking about how you interrupt the biases of life.”

This, Lai argued, is the kind of work that researchers need to consider if strategies that target implicit bias or other kinds of racial biases prove unworkable or ineffective.

Greenwald, co-creator of the IAT, agreed: “Don’t go for cures or remedies that claim to be eliminating implicit bias or eradicating automatic racial preferences or gender stereotypes in people’s heads. There’s no evidence that anything like that works. Those cures are of the snake oil variety. Go for the cures that involve redesigning procedures so that implicit bias, which can be assumed to be present in many people, just does not have a chance to operate.”

In this way, the IAT may not amount to much. It might tell us some important things about individuals who repeatedly take the test and broader populations, but the reality is that confronting systemic racism in America will require tackling a lot more than whatever it is that the test’s results are picking up.