Reddit AmItheAsshole is nicer to women than to men — a SQL proof?

When redditors ask “am I the asshole” while talking about women, they have a higher change of being judged as the asshole. Let’s check out these metrics — with BigQuery, dbt, and Data Studio

Update 2020–02–20

Make sure to not take anything I wrote here as the absolute truth. Several people on Twitter noted problems and added corrections to the analysis I offered. Reading this post as originally presented — and the reactions — can be a great way for you to learn as much as I did while reading the responses. You can find many of their unfiltered thoughts by following this Twitter thread.

Context

/r/amItheAsshole has grown to be the 4th most active subreddit — by number of comments. People come to this subreddit to tell their stories, and they ask other redditors “am I the asshole here?”. Turns out most people are judged as “not the asshole”, as seen in this chart:

Most people are judged to be “not the asshole”

My tweet with these results got a lot of attention:

Including the question — is reddit nicer to women or to men?

Deciding gender

By looking at the title or content of a post, you might have a hard time deciding if “I” is a man or a woman — but it’s pretty easy to count the number of “she/he/her/his/girlfriend/boyfriend” present in the story.

Let’s look at some random posts, and the count for each of these pronouns and gendered words:

Some random posts, and the count of each of these pronouns and gendered words.

We can see that the count of gendered pronouns and words in the example matches who the story is about. These stories are about a male customer, a female girlfriend, a male neighbor, a male son, and a female teenage daughter.

With these numbers, we can now set an arbitrary rule: If there’s more than double the number of male pronouns than female, that post is about a man. We can use the opposite rule to say that the post is about a woman. If the numbers are too close or null, we will call the post “neutral”.

Another rule we can set to simplify the analysis:

If the judgment is ‘not the a-hole’ or ‘no a-holes here’ then we can say ‘the poster is not an asshole’.

an asshole’. If the judgment is ‘asshole’ or ‘everyone sucks’ then we can say ‘the poster is an asshole’.

If we aggregate all these posts, we get to the numbers:

Most people are not the asshole. But if you talk about women, there’s a higher probability of being one.

When I first presented these results, I was told “these numbers are too close, they could be a statistical error”.

Statistical significance?

How can we tell the numbers are not merely statistical error? Let’s see the trend month by month — is it stable?

Monthly judgements — when the poster talks about women, the poster always has a higher chance of being the asshole.

Yes! The trend varies month by month, but there’s a clear higher chance of being an asshole when talking about women than when talking about men. If the small difference was just a statistical fluke, we would expect the trend to jump wildly instead.

And please note these results are very specific, as this tweet notes:

To which I replied

How-to

This time I’m using dbt for the first time, and I left all my code on GitHub. Thanks Claire Carroll for your help getting started with this awesome tool!

To extract all the /r/AmItheAsshole posts in BigQuery to a new table, you can do:

CREATE TABLE temp.data AS SELECT *

FROM `fh-bigquery.reddit_posts.20*'

WHERE subreddit = 'AmItheAsshole'

AND _table_suffix > '19_'

Then the gender and judgement for each post can be determined with a query like:

WITH data AS (

SELECT *

, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bhe\b')) hes

, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bshe\b')) shes

, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bher\b')) hers

, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bhis\b')) hiss

, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bthey\b')) theys

, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bgirlfriend\b')) gfs

, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bboyfriend\b')) bfs

FROM {{ref('aita_posts')}}

WHERE link_flair_text IS NOT NULL

)

, gendered_data AS (

SELECT *

, CASE

WHEN males > 2+females*2 THEN 'to_male'

WHEN females > 2+males*2 THEN 'to_female'

ELSE 'neutral'

END to_gender

FROM (

SELECT *, hes+shes+hers+hiss+theys+gfs+bfs totalgender, hes+hiss+bfs males, shes+hers+gfs females

FROM data

)

)





SELECT CASE link_flair_text

WHEN 'not the a-hole' THEN 'no asshole'

WHEN 'no a-holes here' THEN 'no asshole'

WHEN 'everyone sucks' THEN 'asshole'

WHEN 'asshole' THEN 'asshole'

END judgement

, *

FROM gendered_data

WHERE link_flair_text IS NOT NULL

And finally the statistics presented here:

SELECT *, c/total_gender AS perc

FROM (

SELECT *, SUM(c) OVER(PARTITION BY to_gender, MONTH ) total_gender, SUM(c) OVER(PARTITION BY judgement, MONTH) total_judgement

FROM (

SELECT to_gender, judgement, CONCAT(to_gender, ': ', judgement) to_gender_judgement, month, COUNT(*) c, ARRAY_AGG(STRUCT(title, selftext) ORDER BY RAND() LIMIT 1)[OFFSET(0)] sample_title

FROM {{ref('aita_posts_gendered')}}

WHERE judgement IS NOT null

AND to_gender != 'neutral'

GROUP BY 1,2,3,4

)

)

WHERE c/total_gender > 0.01

AND total_judgement > 10

ORDER BY to_gender, perc DESC

Discussion

You’ll find plenty of insightful and entertaining replies on the twitter thread for this post:

Feel free to join the discussion (and tell me if I’m wrong?). Remember to be nice to each other — most people are not the asshole anyways.

Want more?

I only covered until August 2019 as that’s when the current full reddit archive in BigQuery stops — until future expected updates. Check my previous post for more details on collecting live data from pushshift.io. Thanks Jason Baumgartner for the constant supply!

I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Follow me on @felipehoffa, find my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.