No choice is not good. But life could be complicated with too many choices (especially when we have no idea how to make decision). As a lifelong student of data science and technology in general, I usually run into challenges of what tool to use and fall in love with. That's why I'm writing this post to help learners like myself. I'm not going to talk about the commercial technologies (such as Adatao) and only focus on open source alternatives. Why? I like free stuffs.

Let's me clarify upfront that there are many other "notebooks" out there like Zeppelin and Jupyter. Some interesting alternatives are Rodeo and Beaker Notebook. The reason they are not in this comparison is very simple: Most are so early in the maturity level and I ended up didn't really use them enough to compare. Rodeo is like Rstudio for Python users / data scientists while Beaker Notebook is like a better UI for Jupyter but with significant fewer functionalities. The only advantage I see so far for Beaker Notebook is the client binary (i.e. where you can run it without opening a browser) and its ability to mix different languages (i.e. R with Python with Scala...) in the same notebook. But then, I like Zeppelin better with out-of-the-box Spark integration.

Another important note: If you and your organization are strictly R users, please stick with Rstudio! There is nothing that Zeppelin or Jupyter can provide that Rstudio can't for R users. But if you want to use Python, Hive and even Shell in the same notebook, you can read further.

If there is nothing you can take away from this post, remember one thing: Please don't overcomplicate the sh!t out of everything. Choose whatever you are comfortable and believe in.

Installation

Zeppelin is very hard to install. It has been around for almost four years so it definitely has a slow start, considering how cumbersome it is to get it working. I found out the best instruction is actually from Jay Feng. If you download the latest Zeppelin 0.55 binary as well as the latest Spark 1.5.2 with Hadoop 2.6, it won't work (due to version incompatibility). The best way to get Zeppelin to work is to build the source again by yourself. This means you have to setup your OS properly with maven and other libraries. I also found out that if you build Zeppelin within a firewall, you will have many issues and errors during the process. It's very hectic and cannot be a one-day task for a non-engineer person.

Jupyter installation is very trivial. All you need to do is to pull from their git repo:

https://github.com/jupyter/notebook

Then cd notebook folder and hit jupyter notebook in the terminal. That's it. It would even open the browser automatically for you.

Winner: Jupyter

Configuration

Zeppelin has a separate "Interpreter" configuration page that you can access for multiple language parameters. For example, you can change Spark Home directory as well as Spark Master string. It would create the Spark Context automatically so you don't need to deal with it in each notebook. All you have to do is to type %pyspark at the beginning and go along with your code.

Jupyter is a bit clunky in its user experiences for configuration. Yes, it allows you to do everything but I found out it's easier to change individual config files in your Jupyter folder. However, once it works, you will feel very satisfactory. One issue with Jupyter is its documentation for configuration. I just think it's a bit outdated in term of how it's being done (i.e. where to find variables and the entries...).

Winner: Zeppelin

Interface

Zeppelin is the obvious winner here because it leverages a common UI framework with Bootstrap and Angular.js. Not only it looks more modern, but also it's easy to understand and customize the code if you must. For example, you can create a "form" within Zeppelin's notebook to customize a chart by providing its inputs. This "form" is basically a Javascript form using Angular.js that "talks" to the Python code (or R, Scala, etc...). If you are familiar with R's Shiny server, you can do the same thing using Zeppelin easily out-of-the-box.

Jupyter is, again, very outdated from a frontend stand point. I have spent sometime to look into its frontend technology and I must say: someone better rewrite this because it's very hard to maintain (lol). They still use the same old jQuery-approach with a complex and confusing architecture. The stylesheet is compiled via LESS (which I am not a fan of). Let's say if you want to add interactive graphing ability (similar to Databricks or Zeppelin), it's almost an impossible task.

Winner: Zeppelin

Supported Languages

Zeppelin can only support 11 interpreters by default but those are the most important ones:

Cassandra Flink Geode Hive Spark Ignite Lens Markdown Shell Postqres, Hawk Tajo

You can also write your own Zeppelin interpreter and the common one is R. But look like there are two R interpreters:

https://github.com/datalayer/zeppelin-R

and

https://github.com/elbamos/Zeppelin-With-R

To be honest, I am still confused about which one to use.

Jupyter has a huge list of about 60 kernels being supported currently. So it's obvious that if you want a wide range of languages to use in your organization, Jupyter is the way to go. Installing a kernel is also easy. One small disadvantage is the ability to write multiple languages in the same notebook. Let's say, if you want to do R and Python in the same dataset, you have to write them in two separate notebooks. That means you have to export the data from one notebook and re-access it from another notebook. It's a bit clunky but not a deal killer.

Winner: Jupyter

Plotting Flexibility

Zeppelin is great to mix-and-match different interpreter in the same notebook (one big miss from Jupyter as mentioned earlier). You can also have the freedom to play around with various chart. Zeppelin use NVD3 out-of-the-box with the default charting options on top. However, those are very minor things IMHO comparing other "things" missing from Zeppelin such as multi-users (discussed below) and R support.

Jupyter has no charting options by default but you can always use existing charting libraries from R or Python. One of my recent discoveries is plot.ly for Python:

Overall, I think I can give up some Zeppelin's features and stay with Jupyter just to be safe.

Winner: Jupyter

Multi-users

This is one disappointment area for Zeppelin early adopters. I was hoping Zeppelin can fix this problem for the Jupyter community. However, it seems that NFLabs is trying to commercialize its Zeppelin Hub and make it like the Databricks for Zeppelin users. For freeloaders like myself, I have two options:

PullRequest-53 (Shiro security) is one good option and seems to be up-to-date (for now) by the developer. This method uses a single Zeppelin instance and all users access all interpreters.

Reverse Proxy + Zeppelin on docker leveraging https://github.com/NFLabs/z-manager/tree/master/multitenancy uses multiple Zeppelin instances and and each instance is assigned to a user. Obviously this method takes more resources (RAM and CPU) but provides better multi-tenancy support.

Jupyter Hub is the only option to allow multi-users for Jupyter users. It's fairly simple to setup because it leverages the Linux users and groups to provide authentication. This is quite similar to how Rstudio Server does it. The advantage is that each user cannot "touch or see" any other users because it's restricted at the OS layer. However, you have to keep Jupyter Hub in a single server and I don't see a way to scale this to thousands or hundreds of thousands users at the moment. If you have an internal Jupyter deployment for a group of 20 Data Scientists, this can work perfectly. But at least, Jupyter Hub is something you can download and install internally (without having to pay a 3rd party company for this feature).

Winner: Jupyter

Community

Zeppelin is still in the incubation stage of Apache and got a bit of attention once it started. However, after more than a year of open source, I still think it is progressing very slowly relatively to the status of Jupiter today. If you search on Stackoverflow, there are only about 300 questions and answers:

http://stackoverflow.com/search?q=zeppelin

However, since Jupyter has a long history as IPython, you can see tens of thousands of questions in the community and it's wide support:

http://stackoverflow.com/search?q=jupyter

http://stackoverflow.com/search?q=ipython

I use Stackoverflow as a pure indication of community support for various technologies. You can access Zeppelin mailing list but it's not ideal to search for previous questions. I found out many Zeppelin questions currently are on the installation topic because it's still a huge barrier of entry.

Winner: Jupyter

Conclusion

I love Zeppelin for being innovative and its modern architecture. However, I don't think it's good enough to convince me to switch from the more familiar and outdated Jupyter. I will continue to stay current with the Apache community to understand its direction and, hopefully, multi-tenancy is incorporated by default in Zeppelin. That would be a huge win. Also I want to see more documentation and options for theming and plotting (beside just NVD3). At this point, I will put Zeppelin on the side and continuing modifying Jupyter to make it fit for my workflow.