ON PANDAS

The pandas library is the most popular data manipulation library for Python. It provides an easy way to manipulate data through its data-frame API, inspired by R’s data frames.

Photo by Damian Patkowski on Unsplash

Understanding The pandas library

One of the keys to getting a good understanding of pandas, is to understand that pandas are mostly a wrapper around a series of other python libraries. The main ones being Numpy, SQL alchemy, Matplot lib, and openpyxl.

The core internal model of the data frame is a series of NumPy arrays, and pandas functions, such as the now deprecated “as_matrix” function, which return results in NumPy’s internal representation.

Pandas leverages other libraries to get data in and out of data-frames. SQL Alchemy, for instance, is used through the read_sql, and to_sql functions while openpyxl and xlsx writer are used for read_excel and to_excel functions.

Matplotlib and Seaborn, in turn, are used to provide an easy interface, to plot information available within a data frame, using a command such as df.plot()

Numpy’s Panda — Efficient pandas

One of the complaints that you often hear is that Python is slow or that it is difficult to handle large amounts of data. Most often than not, this is due to the poor efficiency of the code written. Native Python code indeed tends to be slower than compiled code. Still, libraries like Pandas provides a python interface to compiled code and knowing how to use this interface properly, let us get the best out of pandas/python.

APPLY & VECTORIZED OPERATIONS

Pandas, like its underlying library Numpy, performs vectorized operations more efficiently than performing loops. These efficiencies are due to vectorized operations are performed through C compiled code, rather than native python code. Another factor is the ability of vectorized operations to operate on entire datasets rather than just a sub-portion at the time.

The apply interface allows to gain some of the efficiency by using a CPython interface to do the looping:

df.apply(lambda x: x['col_a'] * x['col_b'], axis=1)

But most of the performance gain would be obtained from the use of vectorized operation themselves, be it directly in pandas or by calling its’ internal Numpy arrays directly.

As you can see from the picture above, the difference in performance can be drastic, between processing it with a vectorized operation (3.53ms) and looping with apply to do an addition (27.8s). Additional efficiencies can be obtained by directly invoking the NumPy’s arrays and API, e.g.:

Swifter: swifter is a python library that makes it easy to vectorize different types of operations on a data frame, its API is somewhat similar to that of the apply function

EFFICIENT DATA STORING THROUGH DTYPES

When loading a data-frame into memory, be it through read_csv, or read_excel or some other data-frame read function, pandas makes type inference, which might prove to be inefficient. These APIs allow you to specify the types of each column explicitly leveraging dtypes. Specifying dtypes allows for a more efficient storage of data in memory.

df.astype({'testColumn': str, 'testCountCol': float})

Dtypes are native objects from Numpy, which allows you to define the exact type and number of bits used to store certain pieces of information.

Numpy’s type np.dtype(‘int32’) would, for instance, represent a 32 bits long integer. Pandas default to 64 bits integer, we could save half the space by using 32 bits:

memory_usage() shows the number of bytes used by each of the columns, since there is only one entry (row) per column, the size of each int64 column is 8bytes and of int32 4bytes.

Pandas also introduces the categorical dtype, that allows for efficient memory utilization for frequently occurring values. In the example below, we can see a 28x decrease in memory utilization for the field posting_date when we converted it to a categorical value.

In our example, the overall size of the data-frame drops by more than 3X by just changing this data type:

Not only using the right dtypes allows you to handle larger datasets in memory, but it also makes some computations more efficient. In the example below, we can see that using the categorical type brought a 3X speed improvement for the groupby / sum operation.

Within pandas, you can define the dtypes, either during the data load (read_ ) or as a type conversion (astype).

CyberPandas: Cyber pandas is one of the different library extension that enables a wider variety of data types by supporting ipv4 and ipv6 data types and storing them efficiently.

HANDLING LARGE DATASETS WITH CHUNKS

Pandas allows for the loading of data in a data-frame by chunks. It is, therefore, possible to process data-frames as iterators and be able to handle data-frames larger than the available memory.

The combination of defining a chunk size when reading a data source and the get_chunk method allows pandas to process data as an iterator, such as in the example shown above, where the data frame is read two rows at the time. We can then iterate through these chunks:

i = 0

for a in df_iter:

# do some processing chunk = df_iter.get_chunk()

i += 1

new_chunk = chunk.apply(lambda x: do_something(x), axis=1)

new_chunk.to_csv("chunk_output_%i.csv" % i )

The output of which can then be fed to a CSV file, pickled, exported to a database, etc.…

Setting up operator by chunks also allows certain operations to be performed through multi-processing.

Dask: is a framework built on top of Pandas and build with multi-processing and distributed processing in mind. It makes use of collections of chunks of pandas data-frames both in memory and on disk.

SQL Alchemy’s Pandas — Database Pandas

Pandas also is built up on top of SQL Alchemy to interface with databases. As such, it can download datasets from diverse SQL types of databases as well as push records to it. Using the SQL Alchemy interface rather than the Pandas’ API directly allows us to do certain operations not natively supported within pandas such as transactions or upserts:

SQL TRANSACTIONS

Pandas can also make use of SQL transactions, handling commits and rollbacks. Pedro Capelastegui explained in one of his blog posts, notably, how pandas could take advantage of transactions through a SQL alchemy context manager.

with engine.begin() as conn:

df.to_sql(

tableName,

con=conn,

...

)

The advantage of using a SQL transaction, is the fact that the transaction would roll back should the data load fail.

SQL extension

PandaSQL

Pandas has a few SQL extension such as pandasql a library that allows to perform SQL queries on top of data-frames. Through pandasql, the data-frame object can be queried as if they were database tables.

SQL UPSERTs

Pandas doesn’t natively support upsert exports to SQL on databases supporting this function. Patches to pandas exist to allow this feature.

MatplotLib/Seaborn — Visual Pandas

Some of the data frame API already integrate Matplotlib and Seaborn, such as the .plot command. There is a reasonably comprehensive documentation as to how the interface works, on pandas’ website.

Extensions: Different extensions exist, such as Bokeh and Plotly, to provide interactive visualization within Jupyter notebooks, while it is also possible to extend matplotlib to handle 3D graphs.

Other Extensions

Quite a few other extensions for pandas exist, which are there to handle no-core functionalities. One of them is tqdm, which provides a progress bar functionality for certain operations; another is pretty pandas, which allows us to format data frames and add summary informations.

tqdm

tqdm is a progress bar extension in Python that interacts with pandas. It will enable the user to see the progress of maps and apply operations on pandas data frame when using the relevant function (progress_map and progress_apply):

PrettyPandas

PrettyPandas is a library that provides an easy way to format data-frames and to add table summaries to them: