Characteristics of Top Reddit Submissions

By: Jeff Clark Date: Mon, 18 Sep 2006

Most of you have probably seen the website Reddit. Basically, it allows people to submit links which are subsequently voted up or down by others. Based on these votes, Reddit determines the most popular or most controversial links over different periods of time and allows people to browse this information in different ways on the website. There are other features of the site as well but I want to focus on link popularity. Are there some common characteristics of the top-ranked links on Reddit ?

I started this analysis by getting two sets of links from the site. The top 100 links of all-time and, to use as a control, the links numbered 500-599 from the 'new - today' list. These two lists of links were both taken as of September 13th, 21:45 EST. By choosing fairly low ranked links from the 'new - today' list the second set should contain items that are average to low in popularity relative to all the links submitted. I was hoping that by contrasting these two sets I might discover something interesting about top-rated links.

1. Images and Videos

Most of my analysis has been based on the text of the linked items. Before I could proceed to do this I had to exclude all the linked items that are primarily non-textual in nature - in other words, videos and images. This little graph shows how many links were 'not found' or 'videos and images' out of the 100. It clearly shows that there are many more photos and videos in the top rated links than in the control set.

The quantities measured in the next 3 analyses don't depend on the absolute number of documents in the sets so all 65 of the top rated list and all 84 of the control list were used.

2. Number of Words / Link

After excluding the links mentioned above I gathered the text for each remaining link. An automated tool was used to convert the HTML to text and then I manually removed any text related to website navigation, feedback or comments. My goal was to analyze the primary content of the linked web page.

I wrote a simple tool to count the number of words for each link in both sets. The average number of words/link was 882 for the control set and 3151 for the top rated links. The top rated links have many more words per item than the control set.

3. Average Word Size

I also calculated the average word length in the two sets of documents. The top rated links had an average word size of 3.77 characters/word and the control set average was slightly longer at 4.02 characters/word. There does not appear to be a significant difference in the average size of words between the two sets.

4. Relative Word Frequency

Which words appeared much more often in the top rated links than they did in the control set ? What about the reverse ? The tables below show the top 30 words for each set that are relatively more frequent. Only words that appeared in both sets are shown. So, for example, the word 'programming' which appears at #10 in the 'Top Rated' column shows that this word was present much more frequently in the 'Top Rated' stories than in the control set. This has also taken into account the fact that there were more words total in the 'Top Rated' set - it's a relative measure.

Number Top Rated Control Set 1 org* nick 2 permalink* prayer 3 html* patients 4 wake stages 5 sleep hypnosis 6 aug rep 7 alarm networks* 8 www* hezbollah 9 numbers* 5000 10 programming* asia 11 est marketplace 12 bed trend* 13 wikipedia* virginia 14 http* previously 15 tired webmasters* 16 url* haired 17 asked tokens 18 voting diabetes 19 reply patient 20 learning creators 21 flowers peru 22 reddit* sean 23 patterns* lebanon 24 buried beta* 25 hole damages 26 morning empire 27 plant fascist 28 loans genes 29 stupid perceived 30 confidential turkey

The words marked with an asterix (*) are what I call 'nerd words'. It's a pretty subjective measure but is interesting nevertheless. Note that there are 11 in the 'top rated' set but only 4 in the control set. It appears that top rated links have more terminology related to computing and technical subjects than those links in the control set.

5. Topic Breakdown

This analysis depends on there being an equivalent number of links in each set so only the top 64 were used. I have taken the text for all the items in both sets and run them through my simple text categorizer. The results are shown below in a Multi-Level Pie Chart. See this previous entry for a description of this type of graphic.

The greener areas have a higher proportion of items from the 'top rated' set. This chart shows that the Technology topic , especially the engineering and software subtopics, has a higher proportion of articles from the top rated set. Other 3rd level topics with high representation from 'top rated' links are: Employment, Law, Interpersonal Relationships, Astronomy, Mathematics, Physics, and Music. Those with low (dark red) proportions are: Investment, Finance, Services, Food, House, Medicine, History, Psychology, Animation, Television, and Computer Games. This last one seems out of place since so many of the other associated topics are common in the top rated set. It may be that with a larger sample this anomaly would disappear. Or it may be that the text categorizer isn't working well in this domain - it is fairly simplistic after all.

Conclusion

It appears that the top rated items on Reddit are quantifiably different in several ways from those typically submitted to the site. The apparent popularity of technology-related topics in the Reddit community is suggestive that it is still used primarily by early technology adopters.