Introducing scalaps: Scala-inspired data structures for Python

A functional, object-oriented approach for working with sequences and collections. Also similar to Spark RDDs and Java Streams. Hope you find they simplify your code by providing a plethora of common algorithms for working with sequences and collections.

An example of using scalps to analyze Reddit posts

I’ve found that working on collections of elements by applying functions through well-defined algorithms (e.g., map , filter , and reduce ) to greatly simplify my code and remove many sources of errors. Therefore I was delighted to discover that Scala really pushes this to the next level by introducing a plethora of built-in algorithms on data structures. These concepts share some similarities to Spark RDDs and Java Streams, but I find the Scala approach simpler and more elegant.

As I return to data analysis and machine learning with Python, I’ve found it helpful to port these concepts to Python in a new library, scalaps. You can find the code at github.com/matthagy/scalaps.

In this article, we’ll walk through a few examples of how scalaps can simplify our code. Let’s start with this basic, contrived example.

ScSeq is a wrapper around any sequence and it provides numerous methods for operating on its input sequence. Many of these methods return another ScSeq instance.

Rather than analyze the contrived example, let’s walk through a more realistic example of analyzing Reddit posts using scalaps. For background, we have a sample of Reddit posts in the following CSV format.

We start by accessing the data and parsing it.

You can see the following ScSeq methods used in this example.

map(func): map a function across the current sequence to create another sequence

map a function across the current sequence to create another sequence to_frozen_list() : create a frozen, realized list of the current sequence as implemented in ScFrozenList .

: create a frozen, realized list of the current sequence as implemented in . take(n) : return a sequence that will have at most the first n elements of the current sequence

: return a sequence that will have at most the first elements of the current sequence for_each(func): call the function on every element of the sequence in order

Here’s the equivalent conventional Python for these operations.

Which is perfectly reasonable Python and we haven’t yet seen the strength of scalaps.

Next, let’s look at counting the number of elements that match a criteria.

This introduces two more ScSeq methods.

filter(func) : select elements that match a criteria

: select elements that match a criteria count(): count the number of elements in the sequence

Note that filter is lazy. It doesn’t evaluate to a realized collection but instead, is a lazily computed sequence. The same is true for other methods such as map . In fact, an entire ScSeq can be a lazy sequence when sourced from an appropriate lazy source. E.g., readings lines from a file.

In contrast, count is a sink. It realizes each element of the sequence through a chain of operations starting from the source. count is a constant memory sink and can, therefore, consume massive lazy sequences. Other sinks include to_fozen_list , which realizes the sequence into an immutable list of type ScFrozenList .

Once a ScSeq has been realized, it cannot run again. Instead, we can reconstruct the sequence to realize it again. It can be useful to have functions that build a sequence from passed-in source(s) so that we can easily reconstruct the sequence as needed. E.g.,

Returning to the Reddit post example, let’s compute the most popular subreddits in this sample of posts. This is accomplished with the following code.

This introduces a few new scalaps concepts. First, we’re passing a string to map . This is interpreted as “select the attribute of that name for each element”. Similarly, integers are interpreted as integer item lookups in a collection.

Next, we use the method value_counts() . This sink computes an ScDict in which each key is an element from the sequence and the value is the number of times the key occurred. ScDict is an augmented Python dictionary that includes functionality such as returning ScSeq s for keys() , values() , and items() . In the example, we use the items() method to generate a sequence of key/value tuples.

The sequence is then sorted into a ScList using sort_by(key) . Note, we’re using the integer 1 as the key so as to select the second element, the count, of each tuple. Hence, the list is now sorted by the number of posts.

reverse() is used to generate a ScSeq that is in the reversed order so that the posts are ordered by descending score. take(n) is used with n=5 to select the first five posts. Lastly, they’re printed through forach(print) .

I find this to be a more elegant description of this algorithm than the comparable Python. Do you agree? If not, in your opinion, what would the comparable Python be and why is it more elegant? Let me know in the comment section below.

Lastly, let me leave you with a more sophisticated use of scalaps that computes the frequency of title words in each subreddit.

I won’t explain the full example. Instead, see if you can reason through what the code is doing based upon the naming of the methods and the names of the functions used with them.

I will point out two interesting methods.

flat_map(func) : takes a function that returns sequence for each element in the original sequence. Each returned sequence is expanded, in order, within a returned ScSeq .

: takes a function that returns sequence for each element in the original sequence. Each returned sequence is expanded, in order, within a returned . group_by(func): construct an ScDict where the keys are computed by func and each value is an ScList with all the elements that have that same key.

In reading through this example, what are you’re thoughts on the legibility of the Python code with scalaps? I personally find this approach easier to reason through relative to conventional Python. Further, I’ve used such a style in Scala, Java, and (Py)Spark so as to structure my code as applying functions to collections using built-in algorithms. I’ve come to find this approach simpler, more legible, and less error prone relative to conventional imperative programming.