The wrapper comes with a method which returns an array containing individual submission elements each with their own comments. From there I could loop through each submission and then from each comment extract the individual words and analyze them. Easy right? or so I thought.

As it turns out the comments are actually a giant array of individual characters! So now I needed to find a way to convert a giant character array into individual words

Extracting individual words

The solution I came up with was simple. Every time I came across an empty character it would imply that the current word ended. So I would concatenate every character until I reach the empty space and store the string into an array.

The resulting array looked like this:

Initial attempt to extract words

Making progress but as you can see there are characters which are not letters that could possible skew the data later on. So I needed a way to get rid of them. Luckily python has a neat method built in to test for alphanumeric characters, and so I could just check for those and reduce the string accordingly. I also made sure empty words didn’t go through.

Alphanumeric Check

Awesome, now the last major thing I needed to do was map the character count so I used a dictionary to keep track of the number of occurrences. Then I sorted it to get the top occurrences.

Mapping words to occurrences

When I first tested it I found that naturally the top results would be common words such as “the”, ”this”, “that”, etc. and so I wanted to ignore those common words to find words more unique to each subreddit. The not-so-elegant-yet-efficient solution was to create a set of common words and before adding a word to the list it checks if the word is in the set and ignores it if so. I used a set instead of a regular list for more efficient lookup time O(1) vs O(n)

Yikes, I’m still adding new words to this set

The modification to check for the common words

now I have an ordered mapping of unique words to a subreddit and so I have to just display it. Python has a neat library called Matplotlib which can represent data beautifully. I used the pie chart component to display the data and picked the top 10 words.

Using matplotlib

The results were glorious, I tested it out on a few subreddits each with a sample size of 10,000 comment posts . Here’s what the subreddits had to say:

Warning, lots of foul language ahead! (it is Reddit after all)