Databricks Koalas-Python Pandas for Spark

Easy integration of Python Pandas to Spark to scale existing Python Pandas

Databricks announced yet another exciting feature in this year's Spark + AI Summit. The promise is that we can just copy, paste the existing Python pandas code by just replacing the pandas import to koalas and everything should work fine(or at least most of the times).

Koalas are here!

Python is widely used programming language when it comes to Data science workloads and Python has way too many different libraries to back this fact. Most of the data scientists are familiar with Python and pandas mostly. But the main issue with Pandas is it works great for small and medium datasets but not so good on big-data workloads. The challenge now becomes to convert the existing pandas code to pyspark code. This is just not straight forward and has a lot of performance hits if python UDFs are used without much care.

Koalas tries to address the first problem ie lessen the friction of learning different APIs to port their existing Pandas code to Pyspark. With Koalas, we can just directly replace the existing pandas code with Koalas. As far as the performance goes, there are no numbers yet as it is still in the initial phase of the project. But this definitely looks promising though.

To get started:

You can get started just by running :

pip install koalas

But I tried with python3.7 was not lucky enough to get it working for the first time. This might fail with a Cython or pyarrow dependency. In any case, it is safe to try it out with python3.6 as it works without any issues. I will update the article if I get it working with Python3.7.

Even before we delve too much into it, I just want to call it out that this is currently in Beta and Databricks and the community is adding features and is still work in progress. There can be some missing functions, They are planning to do weekly releases and the v0.1.0 was released recently.

The current version of the library has some kool features that were in Pandas added with some features of Spark to integrate the existing PySpark code to and from Pandas.

Some code examples of current features:

To start using koalas, all you have to do is import koalas.

import databricks.koalas as koalas

To Read data from a CSV or Parquet file:

koalas_csv_df = koalas.read_csv('data/Characters.csv') koalas_parquet_df = koalas.read_parquet('data/Characters.parquet')

Koala Dataframe Object :

This is the Pandas logical equivalent of Dataframe but is a Spark Dataframe internally.

To create a Koalas dataframe:

koalas_df = koalas.DataFrame({'Francrr': [1], 'Spain': [1], 'Brazil': [5]})

Select data from the dataframe:

koalas_df.columns = ['France', 'Spain', 'Brazil']

We can rename the existing when we read data using columns method.

Some basic data manipulation:

koalas_df['France2018'] = koalas_df['France']+1

Print the contents of the dataframe:

print(koalas_df.head(3))

<databricks.koalas.frame.DataFrame object at 0x10fed6f60>

The head(n) method is supposed to return first n rows but currently, it returns an object reference. It is most likely be fixed in the upcoming releases. The head() works fine on Jupyter Notebook though.

A quick workaround for this is to convert the Koalas DF to Panadas DF to view the data.

print(koalas_df.toPandas()) France Spain Brazil France2018

0 1 1 5 2

Series Object

This is the Pandas logical equivalent of Series but is a Spark Column internally.

s = koalas.Series([11, 3, 5, 6, 8])

print(s.max())

print(s.min())

print(s.kurtosis())

print(s.std()) 11

3

-1.0008671522719388

3.0495901363953815

Pandas and Koalas interoperation:

Let’s say you have a Pandas function that reads a file from the internet.

def getPandasDf():

pandas_df = pd.read_csv('http://data.princeton.edu/wws509/datasets/salary.dat',delim_whitespace=True)

print(pandas_df.head(5))

return pandas_df.head(5)

Now, you want to use this data in your spark application. We can do this easily by using the from_pandas() method.

def enrichUsingKoals():

koalas_df = koalas.from_pandas(getPandasDf())

koalas_df.columns = ['Sex', 'rank','year','degree','years_since_earning_highest_degree','salary']

dummy = koalas.get_dummies(koalas_df['Sex'])

print(dummy.toPandas())

Also, note that there pretty handy functions such as get_dummies() of Pandas available in Koalas.

Few Gotchas:

Handling Nulls: The way how Pandas dataframes work may differ from how Koalas would work. Some of the things like the way nulls are treated. We NaN in Pandas to indicate missing values whereas Koalas has a special flag on each value to indicate missing values.

Lazy Vs Eager evaluation: Pandas are inherently eagerly evaluated but Koalas would use lazy evaluation ie all of the computations are done only when some actions such as count() or collect() are called. You might wanna keep that in mind when working with Koalas.

Overall, this is a very interesting step towards bridging the gap between Python Pandas ecosystem and Apache Spark and Koalas seems to be very promising. It is still in its early phase, so we will have to wait and watch how this would evolve.

Thanks for reading! Please do share the article, if you liked it. Any comments or suggestions are welcome! Check out my other articles here.

Some useful links: