Personality tests are both incredibly popular and largely bogus. BuzzFeed made its name in part by publishing quizzes telling readers which ‘90s kid they are, which Friends character they are, which Disney princess they are, and…well…which Disney princess they are, really. None of these have any scientific basis. Then there’s the somewhat more reputable Myers-Briggs test, inspired by Jungian theories about personality types. Some 2.5 million people take it every year, and 88% of Fortune 500 companies use it. Despite its reputation, however, the Myers-Briggs has poor scientific validity.

There is one personality test that is far and away more scientifically valid than any of the others: the “Big Five.”

The Big Five evaluates personality by measuring—as the name suggests—five personality traits: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism, each on a continuous scale. Studies have shown it that it effectively predicts behavior, and the test is often used in academic psychological personality research. People who score higher in conscientiousness tend to work harder, for example, while more neurotic personalities are more prone to anxiety and depression.

Despite its scientific validity, and even with the contemporary fascination with personality tests, the Big Five is relatively unpopular outside of academia. A recent FiveThirtyEight article on the subject suggested that personality scientists haven’t effectively marketed the one credible personality test.

But there are serious concerns not just with the marketing of the test, but with how it’s presented to a public audience. Despite the scientific rigor around the Big Five in academia, many online versions of the test are designed to give sexist results.

The origins of the Big Five

In the 1970s, two separate research teams found ways of evaluating personality according to the Big Five traits, and created their own tests. Paul Costa and Robert McCrae at the National Institutes of Health created the NEO Personality Inventory, while Lewis Goldberg at the Oregon Research Institute created the IPIP-NEO inventory. (Both have been refined and updated in the years since.) In 1998, Oliver John from Berkeley Personality Lab and Verónica Benet-Martinez, psychology professor at University of California at Davis, created the 44-item “Big Five Inventory” (BFI). These three scales are all scientifically validated and widely used in academic research into personality. The Big Five Inventory and the longer, more nuanced IPIP-NEO are both also freely available online.

On the sites linked above, test-takers are asked to enter their gender before getting the results—and the response significantly impacts the interpretation of the test results. Depending on whether you say you are “male” or” female,” the exact same answers produce very different personality assessments. Crucially, women are told they’re significantly more disagreeable than men who answer questions identically.

My results taking the test as a woman from the BFI site:

My results taking the test as a man:

I took the online IPIP-NEO assessment twice, without varying my answers—and was rated 55 on agreeableness as a woman, and 66 as a man. I also saw significant differences in results when I took the NEO Five-Factor Inventory (a shorter version of the NEO Personality Inventory, also created by Costa and McCrae), which isn’t freely available online. PAR, a publisher of psychological assessment products, sent me tests for both a man and a woman and, once I’d completed them, sent me the results. As a man, I was told that I’m “compassionate, good-natured, and eager to cooperate and avoid conflict.” As a woman: “Generally warm, trusting, and agreeable, but you can sometimes be stubborn and competitive.”

It’s a scientifically reinforced version of the sexism that pervades society: Women who are straightforward and opinionated are told they’re difficult and argumentative, while men with the exact same character traits are seen as charismatic leaders.

The sexist feedback isn’t due to an inherent flaw in the personality test, nor malicious intent. It’s because of how the psychologists behind the inventories present the results. Rather than giving an absolute score in each of the Big Five categories, they tell you your percentile in comparison to others within your gender.

“Your results are based on comparing you to all the other humans who have taken the test,” wrote Maggie Koerth-Baker in the FiveThirtyEight article. That seems to be true for the site on which Koerth-Baker took her test, which uses the BFI and is run by Christopher Soto, a psychology professor at Colby College. (I took the test as a man and a woman on this site, and got the same results each time.)

But the other tests taken by Quartz compared women to other women, and men to other men. As women tend to be more agreeable than men, this means that a slightly disagreeable woman will be relatively more disagreeable when compared only to other women. Women’s agreeable nature is not inherently biological: Social conditions pressure women to behave in this way. And this sexist trope is neatly reinforced in how the results from these personality tests are presented.

Crucially, those who take the tests aren’t told that this is how their results are calculated. The Big Five Inventory site acknowledges that “percentile scores are relative to our particular sample of people. Thus, your percentile scores may differ if you were compared to another sample (e.g., elderly British people).” The site does not mention the gender-based comparison groups. The IPIP-NEO site says that if you mark yourself as a woman, your score is calculated as compared to other women, but does not explain how this influences results.

The psychologists I spoke to seemed unconcerned about the implications. Laura Naumann, a psychology professor at Nevada State College, notes that women inherently evaluate their personalities as compared to other women: “The Big Five says, ‘Do you see yourself as someone who is considerate and kind to others?’ In that is an inherent comparison. As a woman, you’re thinking, ‘I think I’m considerate and kind.’ There’s research in social psychology to say women are implicitly comparing themselves to their own gender.” Naumann pointed me to the 1950s work of Leon Festinger, who found that people tend to evaluate their own performance relative to others they consider comparable, and the 2000s work of Monica Biernat, a psychology professor who found that gender stereotypes shape how people judge one another. However, Quartz was not able to find research-based evidence that women evaluate their personalities compared only to other women.

John Johnson, the Penn State University psychologist who created the IPIP-NEO website, says the psychology of personality reflects everyday perceptions. “[P]sychologists borrow the concept of personality traits from ordinary language, which reflects the way ordinary people think about each other’s thoughts, feelings, and behaviors,” he writes in an email. “It is all rather subjective, although people who grow up together in the same culture will tend to agree in their assessments of who is more or less agreeable because of their shared social standards.”

These standards, he says, are affected by age and gender. Just as a six-year-old who displays lackadaisical behavior might still be perceived as “conscientious” while a middle-aged person with the same behaviors will not, men and women are judged differently: “Because women are, on average, more agreeable than men, people often have a different standard for assessing whether a woman is ‘average’ in agreeableness or ‘highly agreeable,’” Johnson says.

There’s another flaw when it comes to taking Big Five personality tests online. The online versions of the Big Five traits inform people of negative character traits, without explaining that the positivity or negativity of all characteristics is shaped by context.

Costa believes that the Big Five’s willingness to point out negative traits makes the test more accurate: Myers-Briggs avoids “anything that could be negative. And that’s a great big marketing thing,” he says. But each potentially negative Big Five character trait is informed by the situation. “They’re only negative in certain contexts,” he says.

For example, Costa explains, agreeable people are great for a blind date, but tend to be overly dependent. Disagreeable people, meanwhile, aren’t good at smoothing over arguments. But they’re also less likely to obediently follow immoral orders—such as those demonstrated by the Milgram experiment, wherein participants are asked to administer increasingly intense electric shocks to a victim. (The longer IPIP-NEO test briefly acknowledged the importance of context, noting, “agreeableness is not useful in situations that require tough or absolute objective decisions,” but the Big Five Inventory website offered no such explanations.)

Personality is messier than star signs or Myers-Briggs tests or any clear-cut personality diagnosis likes to pretend. Contrary to the popular idea that we have some inherent true self, our personality is best scientifically evaluated simply according to how we—and those around us—see ourselves.

In an academic setting, psychologists often also ask family, coworkers, and friends to take a big-five questionnaire to evaluate someone else’s personality. There’s no right answer—colleagues may well view someone as highly conscientious, while neighbors still waiting for a pot to be returned will consider them less so. Both these views are meaningful, and collectively contribute to a picture of someone’s personality.

It’s inherently sexist, though, to view straightforward women as hostile or rude while approving of men who behave the same way. In an academic setting, the Big Five personality tests acknowledge the nuances of personality, often considering multiple personality inventories of the same person, taken by both themselves and those around them. If only the online tests were used to reflect such subtleties.