Note how we’re using foldLeft , a variant of fold , to process each post and increment the count for the corresponding subreddit in the map. You may recall that fold is used to aggregate all values in a list down to a single value. In this case, the final value is a Map[String, Int] that associates subreddits to their count of posts.

A quicker refresher on fold functions: We call foldLeft with our initial value of aggregation and our folding function. fold calls our folding function for every element in the list. In each call, the folding function also receives the current value of the aggregation. Our function returns the updated value of the aggregate that incorporates the list element. The fold functions return the final value of aggregation after processing every element in the list using our folding function.

Spend some time reviewing this code to see if you can reason through how we’re computing the number of posts for each subreddit. As an exercise, can you modify this example code to instead count the number of posts for each user?

While the current results are nice, we’re more interested in knowing the results for the top subreddits; i.e., those with the most posts in this sample of Reddit posts. To that end, we’ll need a way to order the subreddits by the number of posts so that we can select the top few subreddits to show. In computer science terminology, such a process is referred to as sorting.

You can add the following code to the preceding example to sort the subreddits by post count and then show the top 10 in this sample of Reddit posts.

There are a few new things going on here. First, we’re converting subredditCount from Map[String, Int] to List[(String, Int)] using the Map.toList method. This introduces a new concept called tuples in that (String, Int) is the type for a length-two tuple where the first element is a string and the second element is an integer.

Tuples are a general data type in Scala that can be used to represent fixed length collections of elements whereby each position has a fixed type. E.g., (String, String, String, int) is a length-four tuple. We could’ve used this tuple type instead of the class Post to represent the data in a single Reddit post.

In general, classes are a more legible way to group together related elements. Tuples can be useful in some cases, particularly in cases where we want to write generic algorithms that use placeholder types. This is the case in wanting a general method to convert a Map[K, V] in a list of associated pairs, List[(K, V)] .

Next, we’re using the method sortBy to sort our list of tuples. The method takes a function that computes a ranking score for each element of the list. The elements of the list are sorted by rank and a new list is returned by sortBy in which the elements are ordered. You can see that our ranking function just fetches the count for each subreddit by accessing the second element of the tuple, t._2 .

The results for our sample of Reddit posts are as follows.

(AskReddit,254)

(AutoNewspaper,214)

(The_Donald,84)

(CryptoCurrency,71)

(SteamTradingCards,69)

(RocketLeagueExchange,65)

(newsbotbot,65)

(videos,64)

(GlobalOffensiveTrade,59)

(PewdiepieSubmissions,58)

In thinking about the numbers, we should remember that this is a small random sample of all Reddit posts so the counts are going to be much smaller than the full number of posts. Our sample is 0.1% sample of all Reddit posts in October 2018, so we could multiply these numbers by 1000 to estimate the total number of posts for each subreddit for this month.

With my passing familiarity with Reddit, I’d say these results seem consistent with my intuition about popular subreddits like “AskReddit”. What do you think?

Can you modify your earlier exercise code that computes the number of posts per author so that the results are sorted? Who are the top authors in this sample of Reddit posts?

Next, let’s see if our sample includes any posts with Scala in the title. You can add the following snippet to the previous ScalaFiddle widget to answer this question.

What do you think about these posts? Is every one of them about Scala or is there a deficiency in using this heuristic to programmatically identify relevant posts? We’ll consider more sophisticated ways to analyze posts soon.

What other words are interesting to you? Modify the code as you’d like to look for other posts that have certain keywords. In many ways, we’re building a simple, custom search engine to find posts relevant to our interests across this small sample.

Note, I myself have discovered a non-trivial amount of obscene language. As an exercise, you could write some Scala code to find posts that contain swear words. I’m not including example code for this because I don’t want to have a list of curse words on my blog. :)

In general, we’d be interested in computing the frequency of different words in post titles across each subreddit. Here’s some moderately sophisticated code that accomplishes such an analysis.