Photo by Markus Spiske on Unsplash

While writing this blog article, I took a break from working on lots of time series data with pandas. In the last weeks, I was performing lots of aggregation and feature engineering tasks on top of a credit card transaction dataset. I want to share with you some of my insights about useful operations for performing explorative data analysis or preparing a times series dataset for some machine learning tasks.

In this blog post you will learn:

How to load time series data from a CSV

from a CSV What the rolling operation on a pandas dataframe is what is it useful for

is what is it useful for How to combine group by operation and rolling operation on a pandas dataframe

and on a pandas dataframe Some examples for transformations using the two operations above that will be useful for you in practice

for transformations using the two operations above that will be useful for you in practice Some hints in how to parallelize these operations to be using all you CPU Cores.

(Hint you can find a Jupyter notebook containing all the code and the toy data mentioned in this blog post here)

Let us start by loading the data.

Loading Time Series data

Loading time series data from a CSV is straight forward in pandas. We simply use the read CSV command and define the Datetime column as an index column and give pandas the hint that it should parse the Datetime column as a Datetime field.

import pandas as pd

df = pd.read_csv("time_series_example.csv",index_col="Datetime",parse_dates=["Datetime"])

df = df.sort_index()

df

We can now see that we loaded successfully our data set. Let us take a brief look at it. For all TimeSeries operations it is critical that pandas loaded the index correctly as an DatetimeIndex you can validate this by typing df.index and see the correct index (see below). In addition to the Datetime index column, that refers to the timestamp of a credit card purchase(transaction), we have a Card ID column referring to an ID of a credit card and an Amount column, that ..., well indicates the amount in Dollar spent with the card at the specified time.

df.index

Rolling Windows on Timeseries with Pandas

The first thing we’re interested in is: “ What is the 7 days rolling mean of the credit card transaction amounts”. This means in this simple example that for every single transaction we look 7 days back, collect all transactions that fall in this range and get the average of the Amount column. Luckily this is very easy to achieve with pandas:

df.rolling('7D').Amount.mean()

This information might be quite interesting in some use cases, for credit card transaction use cases we usually are interested in the average revenue, the amount of transaction, etc… per customer (Card ID) in some time window.

Remark: To perform this action our dataframe needs to be sorted by the DatetimeIndex . You can achieve this by performing this action:

df = df.sort_index()

Combining grouping and rolling window time series aggregations with pandas

We can achieve this by grouping our dataframe by the column Card ID and then perform the rolling operation on every group individually. This is how we get the number of transactions in the last 7 days for any transaction for every credit card separately. (Hint: we store the result in a dataframe to later merge it back to the original df to get on comprehensive dataframe with all the relevant data)

_grouped = df.groupby("Card ID").rolling('7D').Amount.count()

df_7d_count = pd.DataFrame(_grouped)

df_7d_count = df_7d_count.rename(columns={"Amount":"Transaction Count 7D"})

df_7d_count

And we might also be interested in the average transaction volume per credit card:

df_7d_mean_amount = pd.DataFrame(df.groupby("Card ID").rolling('7D').Amount.mean())

df_7d_mean_amount = df_7d_mean_amount.rename(columns={"Amount":"Mean Amount 7D"})

df_7d_mean_amount

Merging the result in one comprehensive Dataframe

To have an overview of what columns/features we created, we can merge now simply the two created dataframe into one with a copy of the original dataframe.

result_df = df.copy()

result_df = result_df.merge(df_7d_mean_amount,left_index=True, right_index=True)

result_df = result_df.merge(df_7d_count,left_index=True, right_index=True)

result_df

This looks already quite good let us just add one more feature to get the average amount of transactions in 7 days by card. Therefore, we have now simply to group our dataframe by the Card ID again and then get the average of the Transaction Count 7D .

df_7d_mean_count = pd.DataFrame(result_df["Transaction Count 7D"].groupby("Card ID").mean())

df_7d_mean_count = df_7d_mean_count.rename(columns={"Transaction Count 7D":"Mean 7D Transaction Count","Card ID":"Card"})

df_7d_mean_count

We have now to join two dataframes with different indices (one multi-level index vs. a single-level index) we can use the inner join operator for that. In this case, pandas picks based on the name on which index to use to join the two dataframes. We cant see that after the operation we have a new column Mean 7D Transcation Count . We could add additional columns to the dataset, e.g. like the maximum 7 Days Rolling Amount, minimum, etc..

result_df = result_df.join(df_7d_mean_count, how='inner')

result_df

What I find very useful: We can now compute differences from the current 7 days window to the mean of all windows which can be for credit cards useful to find fraudulent transactions. E.g.:

result_df['Transaction Count 7D'] - result_df['Mean 7D Transaction Count']

Parallelize Group By Rolling Aggregation Operations

For datasets with lots of different cards (or any other grouping criteria) and lots of transactions (or any other time series events), these operations can become very computational inefficient. The obvious choice is to scale up the operations on your local machine i.e.: To use all the CPU Cores available in contrast to the pandas’ default to only use one CPU core. I find the little library pandarellel: https://github.com/nalepae/pandarallel very useful. I recently fixed a bug there that now it also works on time series grouped by and rolling dataframes. Let us install it and try it out.

Here is a small example of how to use the library to parallelize one operation:

from pandarallel import pandarallel

import numpy as np

pandarallel.initialize(nb_workers=2)

df.groupby("Card ID").rolling('7D').Amount.parallel_apply(np.mean,raw=True)

Pandarallel provides the new function parallel_apply on a dataframe that takes as an input a function. This function is then “applied” to each group and each rolling window. These operations are executed in parallel by all your CPU Cores.

Conclusion

To sum up we learned in the blog posts some methods to aggregate (group by, rolling aggregations) and transform (merging the data back together) time series data to either understand the dataset better or to prepare it for machine learning tasks. We also showed how to parallelize some workloads to use all your CPUs on certain operations on your dataset to save time.

I hope that this blog helped you to improve your workflow for time-series data in pandas.