Discussion We have shown that backbone networks can be used to automatically map massive interest networks in social media based solely on user behavior. By viewing the big world of reddit as a hierarchical map, users can now explore related interests without providing any prior information about their own interests. Additionally, these maps could provide a dynamic view of interests on the social network, owing to the fact that they are constructed from actual user behavior in the social network. Future applications of this method may also facilitate navigation of other popular social network platforms such as Facebook and Twitter. Furthermore, such an interest map could allow social media users to self-organize into more specific interest forums, thus reducing potential preferential attachment to large, general interest forums and alleviating the issues that arise in overcrowded social network forums (Gilbert, 2013). Given previous work that suggests network properties such as small-worldness and even modularity can result solely from network growth processes (Hintze & Adami, 2010), it would be interesting in future work to observe what processes govern network growth when users have access to an interests map like those shown in Figs. 1 and 2, and what network properties emerge from these growth processes. Additionally, we explored the network properties of the backbone reddit interest network that we composed from the posting behavior of over 850,000 active reddit users. In this analysis, we found that the reddit interest network has a scale-free, small-world, and modular community structure, corroborating findings in many other online social networks (Ahn et al., 2007; Mislove et al., 2007). Uniquely, reddit potentially enforces a scale-free network structure on its users by automatically subscribing all new users to the same set of subreddits (Anonymous, 2011). Exploring the effect of automatically subscribing users to a fixed set of interest-specific forums on social interest network structure could be another interesting venue of future work. To expedite future analyses of the reddit interest network, we have provided the raw, anonymized data set available to download online (Olson, 2013b). Interestingly, our findings corroborate earlier analyses of the Del.icio.us tag network focusing on collaborative tagging systems (Golder & Huberman, 2006; Halpin, Robu & Shepherd, 2007). We show here that reddit’s subreddit degree distribution also follows a power law, as predicted for collaborative tagging systems. Further, whereas Halpin, Robu & Shepherd (2007) suggested that the long tail of infrequently-used tags could likely be ignored, we demonstrated that entire interest meta-communities exist in that long tail that would not otherwise be discovered, including the sports, programming, and LGBT meta-communities. Finally, while Halpin, Robu & Shepherd (2007) was limited to visualizing only subsets of the collaborative tagging network due to excess edges, the mapping method we present here suffers no such limitations due to the backbone network extraction method. It is important to note that the sample of user behavior we have taken is cross-sectional, reflecting users’ reddit posts and thus the relationships among reddit interests at a fixed point in time in mid-2013. However, as users’ interests evolve, so too do the relationships among them (Banerjee et al., 2009). In some cases, highly specialized and related subreddits may fuse into a single subreddit, while in other cases a general subreddit may split into multiple more specialized ones. Thus, such an interest map would require periodic (or, ideally, real-time) updating to accurately reflect dominant interests in the social network and their relationships to one another. Further, given that the network meta-communities are likely to change over time, it is not feasible to manually annotate the meta-communities as we did in Fig. 1. In future work, it would be beneficial to improve this mapping method by implementing a programmatic annotation algorithm using automatic content analysis of the conversations in the subreddits.

Methods To acquire the data for this study, we mined user posting behavior data from reddit by first gathering the user names of 876,961 active users that post to 15,122 distinct subreddits (see Fig. 4 for more detail). reddit provides an open source API for anyone to freely mine data from the web site (Anonymous, 2015c), and only requests that published compilations of reddit data be anonymized to protect the privacy of its users. We note that this data set represents a complete sample of all active users who posted one or more times on reddit between January 2013 and August 2013. For each of the users, we gathered their 1,000 most recent link submissions and comments, counted how many times they post to each subreddit, and registered them as interested in a subreddit only if they posted there at least 10 times. We applied this threshold of at least 10 posts to filter out users that are not active in a particular subreddit. Due to storage space limitations, the latter data format is the rawest form of data we were able to store long-term for this study. Figure 4: Edge distribution in the bipartite (user-to-subreddit) network. Note that the x-axes are log transformed to better display the distribution. From these data, we defined a bipartite network X, where X ij = 1 if user i is an active poster in subreddit j and otherwise is 0. We then projected this as a weighted unipartite network Y as XX′, where Y ij is the number of users that post in both subreddits i and j. This resulted in 4,520,054 non-zero, symmetric edges between the subreddits. Details of the raw weighted subreddit network are shown in Fig. 5. Figure 5: Edge distribution in the raw and pruned reddit user interest network. Note that the x-axes are log transformed to better display the distribution. Due to the challenges associated with analyzing large weighted networks, we reduced the number of edges in the weighted subreddit network using a backbone extraction algorithm (Serrano, Boguñá & Vespignani, 2009) that has previously been used to reduce bipartite projections (Ahn et al., 2011). This algorithm begins by replacing symmetric valued edges (S ij ) with asymmetric weighted edges (A ij and A ji ), where A ij = S ij /i’s degree and A ji = S ij /j’s degree. It then preserved edges whose weight is statistically incompatible, at a given level of significance α, with a null model in which edge weights are distributed uniformly at random. In our resulting backbone network, two subreddits are linked if the number of users who post in both of them is statistically significantly larger than expected in a null model, from the perspective of both subreddits. To recombine the directed edges between each two nodes, we replaced the two directed edges with a single undirected edge whose weight is the average of the two directed edges. Thus, this technique defines a binary network of subreddit pathways along which there is a high probability users might traverse if they navigate reddit by following the posts of other users. Adjusting the α parameter allows the backbone network to include more (e.g., when α is larger) or fewer (e.g., when α is smaller) such pathways. Figure 3 summarizes the topological properties of backbones extracted using a range of α parameter values; in the findings and discussion we focus on a backbone extracted using α = 0.05. Our choice of α = 0.05 is arbitrary, but because the backbone extraction technique we use is rooted in probability theory, it nonetheless offers a precise interpretation: an edge is retained in the backbone if there is less than a 5% chance that an edge with the same weight or greater would appear under a null model in which all edge weights were conditionally (on node degree) random. Future work on extracting the reddit backbone may benefit from exploring the use of more computationally complex algorithms designed specifically for bipartite projections, including the fixed-degree (Zweig & Kaufmann, 2011) and stochastic-degree (Neal, 2014) sequence models. We used Python’s PRAW package (Python reddit API Wrapper: https://github.com/praw-dev/praw) to gather the data and Python’s NetworkX package (Hagberg, Schult & Swart, 2008) to compute all network statistics except for the power law exponent, which was computed using the PARETOFIT module in Stata (Jenkins & Van Kerm, 2015). In the backbone graph, we focus only on the largest connected component. We detected network communities using Blondel et al. (2008), which aims to partition the nodes into mutually exclusive sets that maximize the graph’s modularity. Other community detection algorithms exist and may yield slightly different partitionings, but all aim to achieve the same goal in principle (Fortunato, 2010). We visualize the backbone network and detected communities using the OpenOrd node layout. Both the community detection and node layout routines are implemented in Gephi (Bastian, Heymann & Jacomy, 2009). Meta-communities identified by the community detection algorithm, e.g., “sports” and “programming,” were manually annotated using domain knowledge to identify a proper annotation.