Google Cloud Dataproc and the 17 minute train challenge

My work commute

My commute to and from work on the train is on average 17 minutes. It’s the usual uneventful affair, where the majority of people pass the time by surfing their mobile devices, catching a few Zs, or by reading a book. I’m one of those people who like to check in with family & friends on my phone, and see what they have been up to back home in Europe, while I’ve been snug as a bug in my bed.

Stay with me here folks.

But aside from getting up to speed with the latest events from back home, I also like to catch up on the latest tech news, and in particular what’s been happening in the rapidly evolving cloud area. And this week, one news item in my AppyGeek feed immediately jumped off the screen at me. Google have launched yet another game-changing product into their cloud platform big data suite.

It’s called Cloud Dataproc.

Setting the challenge

Google’s party line for this brand spanking new beta tool is simple: with just a few simple clicks Dataproc will spin up and hand you over a Hadoop (2.7.1) and Spark (1.5) cluster on a silver platter, which is available for immediate use and fully managed. Oh yes, and I forgot – all in about 90 seconds. It sounded intriguing.

It also sounded like a very bold claim indeed, and one which I thought would be cool to test out and make sure wasn’t just one of those too-good-to-be-true marketing hypes. I mean, seriously, who doesn’t have 90 seconds to spare to spin up a cluster in the cloud with Hadoop & Spark installed out of the box?!

So, I decided that on the way home from work one evening I’d put Google’s bravado claim to the challenge. Before my train arrived back home, and by using just a laptop and tethering my phone, could I really create a Dataproc cluster in the cloud with Hadoop and Spark installed and ready to roll?

Game on

However, just spinning up the cluster seemed all too easy. And anyway, if Google’s claim was true, and it could be done in under 90 seconds, then I’d still have over 15 minutes to kill on my commute home.

I wasn’t too keen on scrolling through my Facebook feed again, and seeing another friend knocking back cocktails on a beautiful beach somewhere enjoying their holidays. So, to make it even more challenging and much more interesting, I laid down the gauntlet:

Board homeward bound train Fire up laptop and tether phone Create Dataproc cluster in under 90 seconds SSH into master and check Hadoop & Spark installations Run the 2 examples provided (Spark & PySpark) Delete cluster Disembark and enjoy pint at local pub

I was extremely proud of just how scientific this test I had concocted was.

Disclaimer 1: I also had a justifiable reason for trying out Dataproc. We actually do a lot of work on the Google Cloud Platform for our clients, and in particular using BigQuery and Dataflow to build big data solutions for them. Dataproc sounded intriguing, and I thought it could be something that would be very useful to us.

Disclaimer 2: Some of the screenshots coming up may be hard to read, especially the ones that are taken of the console. I hadn’t taken this into consideration when doing the challenge, and forgot to zoom in on my browser before embarking on my epic journey. However, a quick tip – if it’s hard to read you can click on the image and it will open the original sized one in another tab.

Lets get this party started.

Board homeward bound train

I left work the usual(ish) time, and headed for Flinders Street station in Melbourne CBD. My train is the Sandringham line, and it was due to leave at 5:23pm. However, I’d made a school boy error very early on. I’d completely forgotten that downstairs in the station the network coverage is very, very flaky. I was worried that as soon as the train pulled away that I’d struggle to find internet on my phone for at least a minute or two until it pulled out into the clear.

Bah!

Fire up laptop and tether phone

With the worry of network issues dogging my über-scientific challenge, my train departed at 5:23pm sharp. It was due to arrive at my station at 5:40pm. That gave me 17 minutes, give or take. As soon as the doors slammed shut, I flipped open my laptop and tethered my phone. After all, what good is working in the cloud if you’ve no internet? I tested that I had some internet juice by going to the Google Developers Console (GDC) in my browser. It was painfully slow to load, but at least some data was flowing.

I got cracking.

Create Dataproc cluster in under 90 seconds

The first step was to spin up the Dataproc cluster with 1 master and 4 worker nodes. This turned out to be really simple, and I didn’t need to revert to the Dataproc docs much at this point. In the GDC, I simply clicked:

Big Data -> Cloud Dataproc -> Clusters -> Create Cluster

Giving the cluster a very appropriate name as you can see above, I set the zone to us-central1-a simply because I like that area in the US, and also bumped the number of worker nodes up to 4 from the default 2.

Finally, I left everything else as default e.g. the default instance type n1-standard-4 (4 vCPU & 15GB RAM), and primary disk size of 500GB etc. That configuration resulted in my Dataproc cluster weighing it at 16 YARN cores, and 48GB of YARN memory.

That sounded good to me, and so I clicked Create and started my Apimac timer. Tick tock, tick tock…! I waited. The first stop of my commute, Richmond station, was coming up fast. I waited some more.

Boom. It took 64 seconds. Yes, you read that right, just 64 seconds. Let that soak in for a bit. For it to come up that quickly, Google must be using some seriously optimized images. The cluster I created wasn’t particularly big (one of our Dataflow clusters scales up to 64 instances), but it was incredibly impressive nevertheless.

Being able to get infrastructure up this quickly is a different ball game altogether. In most cases, big Hadoop/Spark clusters are provisioned, and simply left running when they are not being used because they are such a pain and expensive to get up in the first place. With this type of speed, developers will now be able to spin up clusters on an ad-hoc basis, and shut them down when they are finished. That’s bloody awesome.

Ok, it was back to business. That was just the GDC reporting that the cluster was running and ready, but I wanted to be 100% sure it really was. It was time to get SSH’ing.

SSH in and check Hadoop & Spark installations

I clicked on VM Instances in the GDC, and lo and behold I could see all 5 of my cluster instances listed there. The master is easily identified as the instance with the m suffix, and the rest are the workers with the w suffix. Looking at the them in the GDC, all 5 instances looked healthy:

Minor Gripe: I found that having to click on VM Instances to see all the nodes in the cluster to be a little annoying. It would be a far better user experience if they were all listed when clicking on each cluster itself.

If you look closely at the screenshot above, you’ll notice the conveniently placed SSH link right there next to each instance. I wouldn’t have to waste any time tabbing between my Terminal and the browser. Sweet! So I clicked SSH next to the master, and the connection was established – albeit pretty slowly. Then I ran the following trivial command to check the Hadoop installation:

hadoop version

Hadoop installation looked good. Next up, I wanted to check that Spark was also installed alongside successfully. Using the same SSH session I had just used to check the Hadoop installation, I ran the following command to fire up the Scala REPL for Spark:

spark-shell

FYI: There is also a Python REPL for Spark. Simply run the pyspark command instead of spark-shell .

As you can see from my screenshot above, the interactive shell came up without any issues at all. It also confirmed the Spark version of 1.5 right there and then! No extra packages to download and install. No config files to change. No hacking around. No – it just worked out of the box like it should. I like when technology does that.

Ok, so while the Spark shell was fired up, I took the opportunity to quickly test it by running the following command which should return the Spark version:

sc.context

Run the 2 examples provided (Spark & PySpark)

The Dataproc docs provide a few examples to run and test your cluster. You can either run them via the command line, or directly in the web UI. I thought it would be interesting to try both ways:

Run the PySpark & Spark examples through the command line

Run the same Spark example in the web UI

It was time to get Spark’ing.

I was just about the tab over to my terminal, when I noticed there is a really convenient new feature in the GDC. It’s called the Google Cloud Shell, and it initiates a session right there inside your browser on a temporary Compute Engine virtual machine instance running a Debian-based Linux OS. Outstanding!

First up was the PySpark example, which is just a simple Hello World:

Download the Python script from Cloud Storage

Run the PySpark job

cp gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py .

gcloud beta dataproc jobs submit pyspark --cluster dataproc-train-challenge hello-world.py

As you can see from the screenshot above, that went off without a hitch. So far so good I thought.

Next up, I ran the Spark job. The file:/// notation you see below denotes a Hadoop LocalFileSystem scheme; Cloud Dataproc installs /usr/lib/spark/lib/spark-examples.jar on the cluster’s master node when it creates the cluster:

Run the Spark job, passing the cluster name as an argument:

gcloud beta dataproc jobs submit spark --cluster dataproc-train-challenge --class org.apache.spark.examples.SparkPi --jars file:///usr/lib/spark/lib/spark-examples.jar 1000

Turns out PI is “roughly” 3.141756. Who would have thought? Anyway, I looked outside, and all of a sudden I’d reached Elsternwick station. Time had gotten away from me. There were just two more stops to go, with approximately 4 minutes left. I still wanted to run one more job via the UI, and I also had to delete the cluster!

It was going to be damn tight.

I flicked over to the web UI, and quickly clicked on Submit a job . I then proceeded to fill in all the details needed as quickly as humanly possible e.g. cluster name, main class etc. to run the Spark PI example job once again:

It was so easy to fill in all the details needed. I like how Dataproc has abstracted so much of the complexity away from users when submitting jobs to run on the cluster. I hit Submit , and the job started executing. This time it was possible to track the job’s progress directly inside the GDC:

I could see there were no errors, and I was notably happy. But I was also stressing at the time. At this point I’d like to share the 2nd rookie mistake that I’d made on my challenge; I’d grossly underestimated how long doing all these beautiful screenshots you see here before you would take. Yes, of course there is a keyboard shortcut, but when you travelling on a train which is moving around at high speed whilst you are trying to line everything up, let me assure you that it ain’t no walk in the park. I estimate that overall, doing the screenshots cost me at least 2 minutes.

The final job completed, and I checked the results of all of the 3 jobs that I had run by going to:

Cloud Dataproc -> Jobs

As you can see from above, all three jobs were reported as successful. There were 2 Spark jobs (1 ran via the command line, and 1 via the web UI), and one PySpark job (ran via the command line). The last job took 33 seconds, and it finished at 5:39:54pm!! Remember – my train was due to arrive at 5:40pm!

I looked outside, and I then realised that we’d just departed the last station before mine. Luckily the train was running about 1.5 minutes behind schedule. But, it was getting far too close for comfort.

Delete cluster

There are multiple ways that Google provide to delete the cluster i.e. from the command line, REST API or from the web UI. I only had enough time left for the latter, so I selected the cluster I had created (dataproc-train-challenge), and finally clicked Delete :



I waited. I could see my stop coming up. And I waited. About 60 seconds left. It was still showing ‘Deleting‘. Dang, I’m not going to make it I thought.

I waited some more. I could see the platform for my stop, and the train started slowing down. Still ‘Deleting‘. Rats! I kept waiting. The train stopped, doors opened and I had to get off. It was still ‘Deleting‘.

Game over.

Disembark and enjoy pint at local pub (and conclusion)

Technically, I hadn’t completed the Dataproc train challenge as the cluster didn’t delete before I stepped off my train. But I thought it was still a mighty big achievement nevertheless, and well deserving of a pint.

However, while I was sipping on my creamy Guinness at my local pub, I did check the GDC to see if the cluster had eventually been torn down, and deleted. It had. I’d probably just overshot it by a few seconds. Garrh!

Google’s new IaaS is something that a lot of big data developers should be interested in. ‘Want to spin up a fully managed Hadoop/Spark cluster in under 90 seconds?’ No problemo compadre. ‘Want to easily submit jobs to your cluster?’ Well, look no further. ‘Want to see the future of working with big data? ‘ Well folks, here it is.

Google’s strategy is as clear as day. They want take the burden off us, as developers, by handling all the heavy lifting and infrastructure for us. In other words, lower the barriers to entry, and entice more developers into their cloud. Dataproc is an IaaS after all. And that means we’re freed up to concentrate on the code, and actually answering the questions that are being asked of us by our stakeholders – in a timely fashion.

It’s all about simplicity, and abstracting the gnarly complexities that come with managing big clusters for managing massive datasets. This is clearly evident through all of the products sitting on Google’s big data suite e.g. BigQuery, Dataflow, and now Dataproc too. Not everyone will like this though, preferring instead to have complete control of their environment (e.g. for privacy, or legislative reasons), and having the ability to tweak and tune. But if you can, in my opinion, you should adopt this new approach.

The ludicrously fast speed & beautiful simplicity of Dataproc aside, I also think it’s a great thing that, instead of competing against open source technologies like Hadoop and Spark, Google are instead embracing and integrating them into their cloud ecosystem. This is especially useful for those who have already developed on the Hadoop/Spark stack, because migrating their existing jobs over to Dataproc should be a breeze.

I’d also like to take my tests further in the not so distant future:

I’m curious to conduct the same scientific train challenge with Elastic MapReduce (EMR) on AWS to see how it stacks up against Dataproc.

Spin up a much larger cluster, and see how long that takes.

Knock up some Spark jobs that handle billions of rows and TBs of data from BigQuery using the connector that is deployed with Dataproc, and see how the cluster performs when pushed.

Finally, although I didn’t quite manage to fully complete the Dataproc 17 minute train challenge, I did however manage to finish my pint well before the nice barman brought the next pint over to me. Cheers!