In this blog post I'll take a look at launching a 50-node Dataproc cluster and see if I can achieve query times that approach those seen using Google's BigQuery.

50-Node Cluster Up & Running

To start I requested two quota increases for my Google Cloud account. The first was to be able to run 200 CPU cores and the second was for 50 'in use addresses'.

When launching the cluster there will be 50 machines in total used. These are broken down by: 1 master node, 2 primary workers which will act as data nodes, and 47 preemptible workers. Preemptible workers are discounted up to 70% off the regular instance price.

Note that the worker-boot-disk-size of 500 GB will only apply to the two primary worker nodes, the 47 preemptible secondary worker nodes will have 100 GB of space each.

$ gcloud dataproc clusters \ create trips \ --zone europe-west1-b \ --master-machine-type n1-standard-4 \ --master-boot-disk-size 500 \ --num-preemptible-workers 47 \ --worker-machine-type n1-standard-4 \ --worker-boot-disk-size 500 \ --scopes 'https://www.googleapis.com/auth/cloud-platform' \ --project taxis-1273 \ --initialization-actions 'gs://taxi-trips/presto.sh'

For notes on the bootstrap script and other parameters used please see my Billion Taxi Rides on Google's Dataproc running Presto blog post.

After executing the above command the cluster was up and running within two minutes. Once up and running I was able to SSH into the master node.