Introducing Pandas-Sets: Set-oriented Operations in Pandas

I frequently find myself storing standard Python set objects in DataFrame columns. This usually happens when I have some kind of a tags or labels column for each observation. It can also be the output of a groupby operation where the end result needs to be a list-like (or set-like) object before it's aggregated. Using set operations (union, intersection etc.) can come in handy in such cases.

To tackle those scenarios however I end up writing code like df.tags.map(lambda x: set(x).add(elem) which apart from being ugly, also doesn't allow for pandas-like immutable-based compositions (aka one-liners).

Ideally, I would like to be able treat the tags column as a set-like one, so I could write code like df.tags.set.add(elem) or filter like df[df.tags.set.contains(elem)] and df[df.tags.set.union({`t1`,`t2,`t3`})

To achieve this, I wrote pandas-sets , a Pandas extension that adds set-like properties to existing Series objects, provided that they already store set objects.

You can check out the code on GitHub.

The pandas_sets package adds a .set accessor to any pandas Series object; it's like .dt for datetime or .str for string , but for set .

It exposes all public methods available in the standard set .

Using it is pretty simple. First install with pip .

pip install pandas-sets

Then, just import the pandas_sets package and it will register a .set accessor to any Series object.

import pandas_sets import pandas as pd df = pd . DataFrame ({ 'post' : [ 1 , 2 , 3 , 4 ], 'tags' : [{ 'python' , 'pandas' }, { 'philosophy' , 'strategy' }, { 'scikit-learn' }, { 'pandas' }] }) pandas_posts = df [ df . tags . set . contains ( 'pandas' )] pandas_posts . tags . set . add ( 'data' ) pandas_posts . tags . set . update ({ 'data' , 'analysis' })

The implementation is very primitive for now and draws heavily from pandas' core StringMethods implementation.

Next steps include: further testing with edge-case scenarios, adding detailed docstrings and more fine-grained NA handling.

Some day it may be incorporated into pandas core itself.

Please enable JavaScript to view the comments powered by Disqus.