The other day, I was using pandas to clean some messy Excel data that included several thousand rows of inconsistently formatted currency values. When I tried to clean it up, I realized that it was a little more complicated than I first thought. Coincidentally, a couple of days later, I followed a twitter thread which shed some light on the issue I was experiencing. This article summarizes my experience and describes how to clean up messy currency fields and convert them into a numeric value for further analysis. The concepts illustrated here can also apply to other types of pandas data cleanup tasks.

The Data

Here is a simple view of the messy Excel data:

In this example, the data is a mixture of currency labeled and non-currency labeled values. For a small example like this, you might want to clean it up at the source file. However, when you have a large data set (with manually entered data), you will have no choice but to start with the messy data and clean it in pandas.

Before going further, it may be helpful to review my prior article on data types. In fact, working on this article drove me to modify my original article to clarify the types of data stored in object columns.

Let’s read in the data:

import pandas as pd df_orig = pd . read_excel ( 'sales_cleanup.xlsx' ) df = df_orig . copy ()

Customer Sales 0 Jones Brothers 500 1 Beta Corp $1,000.00 2 Globex Corp 300.1 3 Acme $750.01 4 Initech 300 5 Hooli 250

I’ve read in the data and made a copy of it in order to preserve the original.

One of the first things I do when loading data is to check the types:

df . dtypes

Customer object Sales object dtype : object

Not surprisingly the Sales column is stored as an object. The ‘$’ and ‘,’ are dead giveaways that the Sales column is not a numeric column. More than likely we want to do some math on the column so let’s try to convert it to a float.

In the real world data set, you may not be so quick to see that there are non-numeric values in the column. In my data set, my first approach was to try to use astype()

df [ 'Sales' ] . astype ( 'float' )

--------------------------------------------------------------------------- ValueError Traceback ( most recent call last ) < ipython - input - 50 - 547 a9c970d4a > in < module > ----> 1 df [ 'Sales' ] . astype ( 'float' ) ..... ValueError : could not convert string to float : '$1,000.00'

The traceback includes a ValueError and shows that it could not convert the $1,000.00 string to a float. Ok. That should be easy to clean up.

Let’s try removing the ‘$’ and ‘,’ using str.replace :

df [ 'Sales' ] = df [ 'Sales' ] . str . replace ( ',' , '' ) df [ 'Sales' ] = df [ 'Sales' ] . str . replace ( '$' , '' ) df [ 'Sales' ]

0 NaN 1 1000.00 2 NaN 3 750.01 4 NaN 5 NaN Name : Sales , dtype : object

Hmm. That was not what I expected. For some reason, the string values were cleaned up but the other values were turned into NaN . That’s a big problem.

To be honest, this is exactly what happened to me and I spent way more time than I should have trying to figure out what was going wrong. I eventually figured it out and will walk through the issue here so you can learn from my struggles!

The twitter thread from Ted Petrou and comment from Matt Harrison summarized my issue and identified some useful pandas snippets that I will describe below.

Basically, I assumed that an object column contained all strings. In reality, an object column can contain a mixture of multiple types.

Let’s look at the types in this data set.

df = df_orig . copy () df [ 'Sales' ] . apply ( type )

0 <class 'int' > 1 <class 'str' > 2 <class 'float' > 3 <class 'str' > 4 <class 'int' > 5 <class 'int' > Name: Sales, dtype: object

Ahhh. This nicely shows the issue. The apply(type) code runs the type function on each value in the column. As you can see, some of the values are floats, some are integers and some are strings. Overall, the column dtype is an object.

Here are two helpful tips, I’m adding to my toolbox (thanks to Ted and Matt) to spot these issues earlier in my analysis process.

First, we can add a formatted column that shows each type:

df [ 'Sales_Type' ] = df [ 'Sales' ] . apply ( lambda x : type ( x ) . __name__ )

Customer Sales Sales_Type 0 Jones Brothers 500 int 1 Beta Corp $1,000.00 str 2 Globex Corp 300.1 float 3 Acme $750.01 str 4 Initech 300 int 5 Hooli 250 int

Or, here is a more compact way to check the types of data in a column using value_counts() :

df [ 'Sales' ] . apply ( type ) . value_counts ()

<class 'int' > 3 <class 'str' > 2 <class 'float' > 1 Name: Sales, dtype: int64

I will definitely be using this in my day to day analysis when dealing with mixed data types.