I’ve been playing around with Apache Zeppelin for a few months now (not so much playing as just frustration initially to get it working). After consistently using it a bit, I find it incredibly useful for data visualization and business intelligence purposes.

Apache Zeppelin is self described as “a web-based notebook that enables interactive data analytics”. Imagine it as an IPython notebook for interactive visualizations but supporting more languages than just Python to munge your data for visualization. Ultimately after getting Pyspark working on it, I find it incredibly useful for displaying business data and analytics. Right now it only has a couple graph options which include bar graphs, line graphs, pie charts, and scatter plots. Currently it’s also in incubation mode at Apache and open-sourced!

On a business and company level, I’ve found that it is probably the best way to introduce a new visualization tool when the interpreter language can be written in Spark SQL. SQL queries come naturally to all analysts and most product managers, so this can potentially introduce everyone to creating their own data visualizations if the data is loaded and formatted accordingly. Therefore anyone who knows SQL now can play around with visualizations with a lot more ease.

Ultimately it looks and functions a bit like Tableau minus the cost of thousands of dollars for a Tableau license. Yet for Zeppelin, as the data scales, hopefully the speed and functionality of Zeppelin scale linearly when running it on a cluster with Spark. I believe the end goal is to run huge amounts of data through it and potentially visualize billions of data-points with Spark doing most of the heavy lifting.

Well how does it work?! I’ll show a quick demo of the install and then some initial code. Here’s the link to the current Zeppelin github repo.

Installing Zeppelin and getting it set up and working is kind of a headache but let’s try it.

Installing Spark and Pyspark

If you care about getting Pyspark working on Zeppelin you’ll have to download and install pyspark manually. My friend Alex created a pretty good tutorial on how to install Spark here. I’ll go over the initial part quickly.

First we have to download the latest version of Spark and we’ll install it into our directory in /usr/local/bin .

Then we’ll change directories into where we installed Spark and decompress the file

cd /usr/local/bin sudo tar -zxf spark-1.5.2-bin-hadoop2.6.tgz -C /usr/local/bin

Awesome, now that Spark is installed and downloaded we can also set and export the SPARK_HOME and PYTHONPATH variables in bash.

export SPARK_HOME =/usr/local/bin/spark-1.5.2-bin-hadoop2.6 export PYTHONPATH = $SPARK_HOME /python: $SPARK_HOME /python/build: $PYTHONPATH export PYTHONPATH = $SPARK_HOME /python/lib/py4j-0.8.2.1-src.zip: $PYTHONPATH

These variables need to be exported and set on each new session that you run. It’s caused me a bit of a headache the past few months when I realized that the only reason things weren’t working were because of the PYTHONPATH not being set correctly.

Installing Zeppelin

Before installing Zeppelin make sure you have the requirements necessary to install it.

sudo apt-get update sudo apt-get install git sudo apt-get install openjdk-7-jdk sudo apt-get install npm sudo apt-get install libfontconfig # install maven wget http://www.eu.apache.org/dist/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz sudo tar -zxf apache-maven-3.3.3-bin.tar.gz -C /usr/local/ sudo ln -s /usr/local/apache-maven-3.3.3/bin/mvn /usr/local/bin/mvn pip install py4j #Necessary for pyspark

Navigate to the directory you want and clone the repository. After that you want to cd into the folder and build using maven. There’s a couple of options to build it with but I just used the Pyspark 1.5 for local use.

git clone https://github.com/apache/incubator-zeppelin/ cd incubator-zeppelin/ mvn clean package -Pspark-1.5 -DskipTests ./bin/zeppelin-daemon.sh start

Awesome, now go to http://localhost:8080/ and you should see the zeppelin homepage.

If you happen to get an error when installing Zeppelin, you can email me or tweet at me @racketracer. But I also likely might not have the answer to the question so checking the user community will help as well. Just the other day I tried setting it up on another server and it didn’t work. So now I almost think doing this is a crapshoot. Also this stackoverflow question helped me a lot.

Getting Started with Apache Zeppelin Notebook

With everything set up correctly we can open up a new notebook and start writing some code. I grabbed the Airbnb dataset from this website Inside Airbnb: Adding Data to the Debate. It has a plethora of information on listings on Airbnb from cities all across the world. We’re going to explore the San Francisco one to see if we can visualize any trends or interesting factoids from all of the listings and prices. Each row in the dataset is a Airbnb listing in San Francisco with numerous features like neighbourhood, price, review score, etc….Get the data here.

In the first cell, write %pyspark to tell the interpreter that the code will be written in PySpark. Scala, Java, and more languages also work to ingest the data.

%pyspark from pyspark.sql.types import * from pyspark.sql import functions import pandas as pd path = "/path_to_dataset/sf_listings_zeppelin.csv" data = pd.read_csv ( path ) df = sqlContext.createDataFrame ( data ) df.registerTempTable ( "airbnb" )

I’m reading the data with pandas for ease and then creating a spark sql data frame and registering it as a temporary table in memory. Note that Spark is now specifically used for datasets that are too large to fit in Pandas in-memory RAM on your laptop so while this isn’t the best practice, it works for our situation since the dataset is around 7000 rows. Also while in Spark you would have to initially declare sqlContext and encapsulate the spark context, in Zeppelin this isn’t necessary.

airbnb = sqlContext.sql( """ SELECT CAST(id as INT) AS id, CAST(host_id as INT) AS host_id, CAST(host_response_rate AS INT) AS host_response_rate, CAST(host_acceptance_rate AS INT) AS host_acceptance_rate, CAST(host_listings_count AS INT) AS host_listings_count, CAST(host_total_listings_count AS INT) AS host_total_listings_count, CAST(latitude AS FLOAT) AS latitude, CAST(longitude AS FLOAT) AS longitude, CAST(accommodates AS INT) AS accommodates, CAST(bathrooms AS FLOAT) AS bathrooms, CAST(beds AS INT) AS beds, CAST(minimum_nights AS INT) AS minimum_nights, CAST(maximum_nights AS INT) AS maximum_nights, CAST(availability_365 AS INT) AS availability_365, CAST(number_of_reviews AS INT) AS number_of_reviews, CAST(first_review AS DATE) AS first_review, CAST(last_review AS DATE) AS last_review, CAST(review_scores_rating AS INT) AS review_scores_rating, CAST(review_scores_accuracy AS INT) AS review_scores_accuracy, CAST(review_scores_cleanliness AS INT) AS review_scores_cleanliness, CAST(review_scores_checkin AS INT) AS review_scores_checkin, CAST(review_scores_communication AS INT) AS review_scores_communication, CAST(review_scores_location AS INT) AS review_scores_location, CAST(review_scores_value AS INT) AS review_scores_value, CAST(reviews_per_month AS FLOAT) AS reviews_per_month, CAST(price AS INT) AS price, CAST(weekly_price AS INT) AS weekly_price, CAST(monthly_price AS INT) AS monthly_price, CAST(security_deposit AS INT) AS security_deposit, CAST(cleaning_fee AS INT) AS cleaning_fee, CAST(guests_included AS INT) AS guests_included, CAST(extra_people AS INT) AS extra_people, host_response_time, host_is_superhost, host_neighbourhood, host_identity_verified, neighbourhood, property_type, room_type, bed_type, instant_bookable, cancellation_policy, require_guest_profile_picture, require_guest_phone_verification FROM airbnb """ ) airbnb.registerTempTable( "sf" ) airbnb.show( 1 )

Here I’m creating a new SQL query and casting all the numeric types into integers and floats just to be sure as well as grabbing all the relevant columns out of the dataset. The sqlContext.sql allows us to write SQL and modify the table and assign it to a specific variable, in this case called “airbnb”. After that, I’m registering the table as a temporary sql table again so that we can query out of it in the next cells.

Visualizations with SQL

Now it’s time to just write sql to get some interesting visualizations out of the dataset. First we have to create a new cell and annotate the interpreter with a %sql at the top like we did with PySpark in the first cell. This tells Zeppelin that the interpreter for the current cell is Spark SQL. The SQL syntax follows the same conventions as HIVE does.

Let’s try a query to get the average acceptance rates of Airbnb listings by the respective SF neighbourhoods.

% sql SELECT AVG (host_acceptance_rate) avg_acceptance, neighbourhood FROM sf GROUP BY neighbourhood ORDER BY 1 DESC

Press SHIFT+ENTER and the cell will run and also save the notebook at the same time. You’ll see that the navigation bar will pop up with graphs and options. If you mess around with each graph, you can generally picture what you’ll want to use with what data that you have. Since we’re looking at representing neighborhoods and their numeric average acceptance rates, I tried using the stacked line graphs. You can also drag and drop the fields that are part of the SELECT query into three different tabs.

Essentially the “Keys” tab sets the x-axis of the graph so I dragged and dropped the neighbourhood field into the “Keys” tab. “Groups” essentially separates out the different values in that field into different colors and groups. “Values” displays numeric values on the Y-axis. Because the aggregation has already happened with the SQL query, it doesn’t really matter whether the avg_acceptance value is SUM, COUNT, etc…; as long as we are using every single value in the graph, there is no aggregation that is necessary. We could also re-do this exact same visualization by just selecting the rows in SQL without doing an AVG(host_acceptance_rate) and then selecting AVG on the light blue button. However, since Zeppelin limits the dataset to 1000 rows on each query, we wouldn’t get all of the data.

Zeppelin Input Boxes

Another example to try out is to integrate input values through text boxes through the SQL code. For example, note the query below and the highlighted brackets.

% sql -- THIS QUERY GRABS Average prices grouped by the amount of people they can accommodate WITH subquery AS ( SELECT AVG (price) PRICE, neighbourhood, beds, COUNT ( 1 ) AS cnt FROM sf WHERE beds <= ${ max_accommodations= 4 } AND price < 1000 GROUP BY neighbourhood, beds ) SELECT * FROM subquery q2 INNER JOIN ( SELECT DISTINCT (neighbourhood) FROM subquery q1 WHERE q1.cnt > ${ min_count_of_listings= 50 } ) AS distinct_neigh ON distinct_neigh.neighbourhood = q2.neighbourhood WHERE cnt > 15

This query is essentially grabbing the average price per number of bedrooms in each neighborhood in SF. However, because I don’t want to skew results for listings that don’t have enough data, I set the “minimum count of listings” defaulted to 50 with the format of the input box in the second subquery.

By using the ${variable_name=50}, this allows me to set an input box to change the default value dynamically and set the name of the input box as “min_count_of_listings”.

The last thing I’ll try is a scatter plot. Zeppelin also limits the amount of data that can be visualized to 1000 data-points as of now so scatter plots aren’t as useful as they could be.

% sql SELECT review_scores_rating, price FROM sf WHERE review_scores_rating > "${min_review_score=0}" AND price < "${price_max=1000}"

We can see a general trend of better reviews costing more money as common sense states. But graphs like these really help with defining features in data that we can always potentially use when we run machine learning models on possible predictors of costs of each individual airbnb listing.

Tips and Tricks

There’s a couple bullet pointed tips I’ll add.

Optimize queries for visualizations : Think about what you want to visualize before you write your query. For example, if you want to get an average price for number of bathrooms and different neighbourhoods, it’s smarter to group by bedroom sizes and neighbourhoods in your query than to use the group function in the UI. Essentially you can imagine creating the graphs from the data in your table such as if you were creating graphs in excel, therefore all of the SQL data results has to be well segmented and selected and avoid doing a SELECT *

: Think about what you want to visualize before you write your query. For example, if you want to get an average price for number of bathrooms and different neighbourhoods, it’s smarter to group by bedroom sizes and neighbourhoods in your query than to use the group function in the UI. Essentially you can imagine creating the graphs from the data in your table such as if you were creating graphs in excel, therefore all of the SQL data results has to be well segmented and selected and avoid doing a SELECT * Keep all data wrangling in one cell: Doing all of the data cleaning and formatting before writing SQL in other cells helps with overall organization of the notebook.

Doing all of the data cleaning and formatting before writing SQL in other cells helps with overall organization of the notebook. Sort your data so that the first row isn’t null: I’ve noticed a bug in Zeppelin that currently nullifies your entire column if the first row in that column is also null. Try to sort your column so that each column has values in the first row to get around this bug.

Conclusion and Business Case

By segmenting data out like this on a notebook style visualization layer, Zeppelin is reaching into creating an awesome user experience for any business need. If you throw up a notebook on a server now, multiple people can now also edit it and append their own visualizations based on existing dataframes. Currently in a BI sense, I’ve found Zeppelin useful to visualize time series data as it can be continuously ported through cron jobs and sql pulls.

Try Zeppelin out today and tell me what you think.

As always, connect with me on LinkedIn, Twitter, Email or leave a comment below!