Short introduction

This is the first part of the Spark gotchas series. It’s about some of its limitations I encountered during my 9-5 gig. I will try to keep these posts short and sweet, so let’s get going! Note: I’m using Spark 2.4.0.

Apache Spark is awesome. It provides a very useful set of APIs, but they tend to leak really badly from time to time. Well, every 24 hours on average in my case.

Here’s a short brain teaser - what is going to blow up while joining two datasets this way?

scala > foos res0 : org.apache.spark.sql.Dataset [ Foo ] = [ someId: string , otherId: string ] scala > bars res1 : org.apache.spark.sql.Dataset [ Bar ] = [ someId: string , otherId: string ] scala > foos . joinWith ( bars , foos ( "someId" ) === bars ( "someId" ) || foos ( "otherId" ) === bars ( "otherId" ), "left_outer" )

Have you managed to solve the riddle by youself? Essentialy Spark will try to broadcast the entire bars dataset. This means the entire collection is going to be serialized and sent to the driver. I don’t know if you’ve ever written Spark production code, but this is a big no-no if the broadcasted dataset is of an unknown size. Let’s look at the plan.

scala > foos . joinWith ( bars , foos ( "someId" ) === bars ( "someId" ) || foos ( "otherId" ) === bars ( "otherId" ), "left_outer" ). explain () == Physical Plan == BroadcastNestedLoopJoin BuildRight , LeftOuter , (( _1 # 35. someId = _2 # 36. someId ) || ( _1 # 35. otherId = _2 # 36. otherId )) :- LocalTableScan [ _ 1 # 35 ] +- BroadcastExchange IdentityBroadcastMode +- LocalTableScan [ _ 2 # 36 ]

BroadcastNestedLoopJoin will, at some point, send the entire dataset to the driver and we’ll encounter the Exceeding 'spark.driver.maxResultSize' exception just after running this join on something else than hello world happy path dataset .

The maxResultSize option basically tells us how much data our executors can send to a driver in “one shot”. And this data size is directly correlated with the size of our bars collection. So, increasing this value is like eating a soup with a gun. You never know when it’s going to blow. And when it blows, it will be straight in your face.

Spark is very smart about optimizing queries, so maybe your first thought is to make it a little bit goofier. There’s an option spark.sql.autoBroadcastJoinThreshold which is responsible for determining whether Spark can broadcast a table. It does so, looking at its size. But, here’s a bummer, the same plan is generated if we tell Spark not to use broadcasts at all by setting this value to -1 .

To make it more complicated, the plan is generated without broadcasting if we join using conjunction operator.

scala > foos . joinWith ( bars , foos ( "someId" ) === bars ( "someId" ) && foos ( "otherId" ) === bars ( "otherId" ), "left_outer" ). explain () == Physical Plan == SortMergeJoin [ _ 1 # 47 .someId , _ 1 # 47 .otherId ], [ _ 2 # 48 .someId , _ 2 # 48 .otherId ], LeftOuter :- *( 1 ) Sort [ _ 1 # 47 .someId ASC NULLS FIRST , _ 1 # 47 .otherId ASC NULLS FIRST ], false , 0 : +- Exchange hashpartitioning ( _ 1 # 47 .someId , _ 1 # 47 .otherId , 200 ) : +- LocalTableScan [ _ 1 # 47 ] +- *( 2 ) Sort [ _ 2 # 48 .someId ASC NULLS FIRST , _ 2 # 48 .otherId ASC NULLS FIRST ], false , 0 +- Exchange hashpartitioning ( _2 # 48. someId , _2 # 48. otherId , 200 ) +- LocalTableScan [ _ 2 # 48 ]

What’s the solution?

If you think about it a little bit - then it’s no surprise. I don’t think there are relational SQL engines which can optimize join with disjunctive operation to either sort merge join or hash join. So a natural progression for Spark is to use nested loop join which requires broadcasting entire dataset. To work around this issue, we can split this logic to two joins. So first join on someId , then filter those which were not joined and then try to join them on otherId . This is going to produce the plan without broadcast gizmos.

val joinedOnSomeId = foos . joinWith ( bars , foos ( "someId" ) === bars ( "someId" ), "left_outer" ) val successfullyJoined = joinedOnSomeId . filter ( _ . _2 != null ) val notSuccessfullyJoined = joinedOnSomeId . filter ( _ . _2 == null ). map ( _ . _1 ) val joinedOnOtherId = notSuccessfullyJoined . joinWith ( bars , notSuccessfullyJoined ( "otherId" ) === bars ( "otherId" ), "left_outer" ) joinedOnSomeId . union ( joinedOnOtherId ). explain ()

And the Catalysts plan.

Union :- SortMergeJoin [ _ 1 # 10 .someId ], [ _ 2 # 11 .someId ], LeftOuter : :- * ( 1 ) Sort [ _ 1 # 10 .someId ASC NULLS FIRST ], false , 0 : : +- Exchange hashpartitioning ( _ 1 # 10 .someId , 200 ) : : +- LocalTableScan [ _ 1 # 10 ] : +- * ( 2 ) Sort [ _ 2 # 11 .someId ASC NULLS FIRST ], false , 0 : +- Exchange hashpartitioning ( _ 2 # 11 .someId , 200 ) : +- LocalTableScan [ _ 2 # 11 ] +- SortMergeJoin [ _ 1 # 24 .otherId ], [ _ 2 # 25 .otherId ], LeftOuter :- *( 6 ) Sort [ _ 1 # 24 .otherId ASC NULLS FIRST ], false , 0 : +- Exchange hashpartitioning ( _ 1 # 24 .otherId , 200 ) : +- * ( 5 ) Project [ named_struct ( someId , someId # 21 , otherId , otherId # 22 ) AS _ 1 # 24 ] : +- * ( 5 ) SerializeFromObject [ staticinvoke ( class org.apache.spark.unsafe. type s.UTF8String , StringType , fromString , assertnotnull ( input [ 0 , $line15.$read$$iw$$iw$Foo , true ]) .someId , true , false ) AS someId # 21 , staticinvoke ( class org.apache.spark.unsafe. type s.UTF8String , StringType , fromString , assertnotnull ( input [ 0 , $line15.$read$$iw$$iw$Foo , true ]) .otherId , true , false ) AS otherId # 22 ] : +- * ( 5 ) MapElements <function1> , obj # 20 : $line15.$read$$iw$$iw$Foo : +- * ( 5 ) Filter <function1>.apply : +- * ( 5 ) DeserializeToObject newInstance ( class scala.Tuple2 ), obj # 19 : scala.Tuple2 : +- SortMergeJoin [ _ 1 # 10 .someId ], [ _ 2 # 11 .someId ], LeftOuter : :- * ( 3 ) Sort [ _ 1 # 10 .someId ASC NULLS FIRST ], false , 0 : : +- ReusedExchange [ _ 1 # 10 ], Exchange hashpartitioning ( _1 # 10. someId , 200 ) : +- * ( 4 ) Sort [ _ 2 # 11 .someId ASC NULLS FIRST ], false , 0 : +- ReusedExchange [ _ 2 # 11 ], Exchange hashpartitioning ( _2 # 11. someId , 200 ) +- *( 7 ) Sort [ _ 2 # 25 .otherId ASC NULLS FIRST ], false , 0 +- Exchange hashpartitioning ( _2 # 25. otherId , 200 ) +- LocalTableScan [ _ 2 # 25 ]

Summary

The idea behind Spark’s original RDD API was noble. You make some operations on distributed collections as if they were regular Scala collections, and don’t care how Spark handles computations internally. However, some time ago me and my collegues were competing who can learn more about Spark’s internals by just solving cryptic errors we encounter during our day to day job. That’s how leaky this abstraction is. It’s still better than the original map reduce though.

If you can’t wait to read more about Spark gotchas, follow me on Twitter or subscribe to the mailing list.