Most of us wish that — on the internet — we could be anonymous. Few people wish that more than Redditors. Reddit is one of the few social networks where users hope they DON’T encounter their friends. A lot of people hide their Reddit usernames like the PIN to their debit card, praying that no one ever deciphers it.

But even on a site where most usernames consist of concoctions like: HotButteryCopPorn and POOPFEAST420 — you can tell a surprising amount of data about a user just from the username.

I scraped Reddit to get a list of 138k unique usernames. Of those usernames, 56k (~41%) contain a surname, and 30k (~22%) contain a first name. 6k (~4.5%) contain a date of birth. 307 even have a zipcode…

Combining these bits of data is where things get pretty interesting, though. The 30k accounts with first names can be mapped to gender pretty accurately. If your Reddit username is: dudematt0412, there’s a good chance your name is Matt. If your name is Matt, there’s a good chance you’re a dude.

And once you have names mapped to genders, you can start to identify other words like “balls”, “dick”, “destroyer”, and “sexy” which correlate highly to males. And other words like, “cat”, “kitty”, and “princess” which correlate highly to females. You can also find things like underscores and especially hyphens being more prevalent in male usernames than female. Female usernames more often than male usernames are CamelCased — like PrincessZoey89.

With surnames, you can start to map where users are located. Names like millerj2740 are likely to be of European descent. Names like qlchang57 are likely to be of Asian descent. And just like you can find words and patterns that identify with certain genders, you can find ones that identify with certain ethnicities.

I’m interested to see how accurate this can get. My data was far from perfect. Firstly, a sample size of 138k usernames for a website with 300M+ users is too small. Secondly, the sample wasn’t random. I crawled the most popular threads on the front page. The users had to post something to get crawled, and it had to be a popular comment. This could skew unfavorably to younger users or older users, or males or females, or toward English speakers, and so on.

In my data set (according to my algorithm), ~78% of users were male. A Reddit demographic study found that number to be ~64%. According to the same study, the average age on Reddit was 23. Of the usernames I found to have a birthday (subtracting ones with 69), I found that number to be 28.

Some random things I found to be pretty funny: