Internals of the join operation in spark

Broadcast Hash Join

When 1 of the dataframe is small enough to fit in the memory, it is broadcasted over to all the executors where the larger dataset resides and a hash join is performed.

This has 2 phase,

broadcast-> the smaller dataset is broadcasted across the executors in the cluster where the larger table is located.

hash join-> A standard hash join is performed on each executor.

There is no shuffling involved in this and can be much quicker.

spark.sql.autoBroadcastJoinThreshold

This can be configured to set the Maximum size in bytes for a dataframe to be broadcasted.

-1 will disable broadcast join

Default is 10485760 ie 10MB

scala> spark.conf.get(“spark.sql.autoBroadcastJoinThreshold”)

res0: String = 10485760

spark.broadcast.compress can be used to configure whether to compress the data before sending it. It uses the compression specified in the spark.io.compression.codec config and the default is lz4. We can use other compression codecs such as lz4,lzf, snappy, ZStandard.

spark.broadcast.compress is true by default.

Run explain to see if this type of join is used.

scala> customer.join(payment,Seq(“customerId”)).queryExecution.executedPlan

res13: org.apache.spark.sql.execution.SparkPlan =

*(5) Project [customerId#6, name#7, paymentId#18, amount#20]

+- *(5) SortMergeJoin [customerId#6], [customerId#19], Inner

:- *(2) Sort [customerId#6 ASC NULLS FIRST], false, 0

: +- Exchange hashpartitioning(customerId#6, 200)

: +- *(1) Project [_1#3 AS customerId#6, _2#4 AS name#7]

: +- *(1) SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true, false) AS _2#4]

: +- Scan[obj#2]

+- *(4) Sort [customerId#19 ASC NULLS FIRST], false, 0

+- Exchange hashpartitioning(customerId#19, 200)

+- *(3) Project [_1#14 AS paymentId#18, _2#15 …

You can notice here that, even though my Dataframes are small in size sometimes spark doesn’t recognize that the size of the dataframe is < 10 MB. To enforce this we can use the broadcast hint .

customer.join(broadcast(payment),Seq(“customerId”)) scala> customer.join(broadcast(payment),Seq(“customerId”)).queryExecution.executedPlan

res12: org.apache.spark.sql.execution.SparkPlan =

*(2) Project [customerId#6, name#7, paymentId#18, amount#20]

+- *(2) BroadcastHashJoin [customerId#6], [customerId#19], Inner, BuildRight

:- *(2) Project [_1#3 AS customerId#6, _2#4 AS name#7]

: +- *(2) SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true, false) AS _2#4]

: +- Scan[obj#2]

+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, false] as bigint)))

+- *(1) Project [_1#14 AS paymentId#18, _2#15 AS customerId#19, _3#16 AS amount#20]

+- *(1) SerializeFromObject [assertnotnull(input[0, scala.Tuple3, true])._1 AS _1#14, …

If a broadcast hint is specified, the join side with the hint will be broadcasted irrespective of autoBroadcastJoinThreshold. If both sides of the join have the broadcast hints, the side with a smaller estimated physical size will be broadcasted. If there is no hint and if the estimated physical size of the table < autoBroadcastJoinThreshold then that table is broadcasted to all the executor nodes.

Spark has a BitTorrent-like implementation to perform broadcast. This is to avoid the driver being the bottleneck when sending data to multiple executors. In this, the broadcasting table is serialized and divided into small chunks by the driver and is stored in the BlockManager of the driver. In each executor, the executor first tries to locate the object from its BlockManager, remote fetch smaller chunks of the object from the driver and/or others executors if they are available in the other executors’ BlockManager.

Once an executor has the chunk, it puts the chunks in its BlockManager for other executors.

Normally, BHJ can perform faster than other join algorithms when the broadcast side is small enough. However, broadcasting tables is a network-intensive operation and it can cause OOM sometimes or even perform worse than other algorithms when the table broadcasted is big enough.

BHJ is not supported for a full outer join. For right outer join, the only left side table can be broadcasted and for other left joins only right table can be broadcasted.

Shuffle Hash Join

Shuffle hash join has 2 phases,

1) A shuffle phase, where the data from the join tables are partitioned based on the join key. What this phase does shuffles data across the partitions. The idea here is that if 2 tables have the same keys, they end up in the same partition so that the data required for joins is available in the same partition.

2) Hash join phase: The data on each partition performs a classic single node hash join algorithm.

What Shuffle hash join does is that breaks apart 1 big join of 2 tables into smaller chunks of the localized relatively smaller branch. Shuffle is a very expensive operation as it involves a lot of network-intensive movement of the data across the nodes in the cluster. Also, note that the creation of Hash tables is also an expensive operation and is memory bound. This is not well suited for joins that wouldn’t fit in memory.

Also, note that the performance of this is based on the distribution of keys in the dataset. The greater number of unique join keys the better data distribution we get. The maximum amount of parallelism that we can achieve is proportional to the number of unique keys .

Say we are joining 2 datasets based on something which would be unique like empId would be a good candidate over something like DepartmentName which wouldn’t have a lot of unique keys and would limit the maximum parallelism that we could achieve.

Gotchas:

We need to be cognizant about the situation in which 1 single partition(a small subset of all the partitions) getting way too much of data over the other partition. This can happen when there is a skew in our dataset. The distribution of the data should be uniform across the cluster.

Say we are joining Stock data on companies stock ticker and each partition gets data based on the company’s name. But there might not be a whole lot of data in the partition that house stocks of Z but there might be way too much of data on the partition that would house A such as Apple and Amazon. We would have to do something called as salting in these kinds of situations. Salting is a technique of adding some randomness in non-unique keys. Say we have 1 Million Amazon stocks in our dataset and Zillow stocks are 100. We selectively add a random integer to the Amazon stock so that the hash of the Amazon data is further uniformly distributed among different partitions.

The main thing to note here is that Shuffle Hash join will be used as the join strategy only when spark.sql.join.preferSortMergeJoin is set to false

and the cost to build a hash map is less than sorting the data. By default, sort merge join is preffered over Shuffle Hash Join.

ShuffledHashJoin is still useful when :

1) any partition of the build side could fit in memory

2)one table is much smaller than the other one, then cost to build a hash table on a smaller table is smaller than sorting the larger table.

Sort merge join

Sort merge join is the default join strategy if the matching join keys are sortable and not eligible for broadcast join or shuffle hash join. It is a very scalable approach and performs better than other joins most of the times. It has its traits from the legendary map-reduce programs. What makes it scalable is that it can spill the data to the disk and doesn’t require the entire data to fit inside the memory.

It has 3 phases:

1)Shuffle Phase: The 2 large tables are repartitioned as per the join keys across the partitions in the cluster.

2)Sort Phase: Sort the data within each partition parallelly.

3)Merge Phase: Join the 2 sorted + partitioned data. This is basically merging of the dataset by iterating over the elements and joining the rows having the same value for the join key.

Also, note that sometimes spark by default chooses Merge Sort Join which might not be always ideal.

Let's say we have a customer Dataframe and a payment dataframe.

scala> println(SizeEstimator.estimate(payment)/1000000.00)

40.329272 scala> println(SizeEstimator.estimate(customer)/1000000.00)

40.260272

Both of the data frames are almost 40MB. By default, spark chooses Sort-Merge Join but sometimes a Broadcast join can do better.

scala>customer.join(payment,"customerId").queryExecution.executedPlan

res28: org.apache.spark.sql.execution.SparkPlan =

*(5) Project [customerId#49, name#50, paymentId#37, amount#39]

+- *(5) SortMergeJoin [customerId#49], [customerId#38], Inner scala> spark.time(customer.join(payment,”customerId”).show)

Time taken: 693 ms

When we provide the broadcast hint,

scala>customer.join(broadcast(payment),"customerId").queryExecution.executedPlan

res29: org.apache.spark.sql.execution.SparkPlan =

*(2) Project [customerId#49, name#50, paymentId#37, amount#39]

+- *(2) BroadcastHashJoin [customerId#49], [customerId#38], Inner, BuildRight scala>spark.time(customer.join(broadcast(payment),"customerId").show)

Time taken: 253 ms

We have to remember that there is no single right answer to a problem. The right answer is always it depends . We have to utilize spark’s features generously as per our data.

Note: SizeEstimator.estimate gives an estimate of how much space the object takes up on the JVM heap. This will often be higher than the typical serialized size of the object.

Gotchas:

For Ideal performance of Sort-Merge join, it is important that all rows having the same value for the join key are available in the same partition. This warrants for the infamous partition exchange(shuffle) between executors.

Collocated partitions can avoid unnecessary data shuffle. Data needs to be evenly distributed n the join keys. The number of join keys is unique enough so that they can be equally distributed across the cluster to achieve the max parallelism from the available partitions.

Cartesian Join

When no join keys are specified and the join type is an inner join, CartesianProduct is chosen. This is an inherently very expensive operation and should not be chosen unless necessary. More on this here.

BroadcastNestedLoopJoin

This type of join is chosen when no joining keys are specified and either has a broadcast hint or if 1 side of the join could be broadcasted and is less than spark.sql.autoBroadcastJoinThreshold.

This type of join is also the final fall back option if no join keys are specified, and is not an inner join. But this could be very slow and can cause OutOfMemoryExceptions! if the broadcasted dataset is large enough.

Key Takeaways:

Sort-Merge join is the default join and performs well in most of the scenarios. And for cases, if you are confident enough that Shuffle Hash join is better than Sort-Merge join, disable Sort-Merge join for those scenarios. However, when the build size is smaller than the stream size, Shuffle Hash join will do better than Sort-Merge join. Tune the spark.sql.autoBroadcastJoinThreshold accordingly if deemed necessary. Try to use Broadcast joins wherever possible and filter out the irrelevant rows to the join key before the join to avoid unnecessary data shuffling. Joins without unique join keys or no join keys can often be very expensive and should be avoided.

The Part 1, and Part 2 are available here which covers the basics of Join operation. Thanks for reading! Please do share the article, if you liked it. Any comments or suggestions are welcome! Check out my other articles, Intro to Spark and Spark 2.4.