This article aims to help experienced developers with some of the bottlenecks faced while dealing with extreme volume of data with limited resources. It is not about fundamentals and theoretical optimization techniques which are frequently discussed. Suggested solutions( or optimization tricks) are based on inferences drawn from the practical problems faced while optimizing Apache Spark.

Long Lineage

Lazy evaluation in spark means, actual execution does not happen until an action is triggered. The types of commands available in spark can be divided into 2 types.

Actions ( eg. head(), show(), write(), count())

Transformations (eg. map(), filter(), groupBy(), select())

Every transformation command run on spark gets added to the lineage(explained below) after the syntax check, actual execution happens only when an action based command is run.

Optimization Trick : It is not advisable to chain lot of transformation in a lineage, especially when you would like to process huge volume of data with minimum resources. Rather, break the lineage by writing intermediate results into HDFS( preferable HDFS if you have storage available, as writing S3 could be slower)

File System Preferences

The types of files we deal with can be divided into two types

Splittable ( eg. LZO, Bzip2)

Non- Splittable ( eg. Gzip, Zip)

For the purpose of the discussion, Splittable files means they are parallely processsable in a distributed fashion rather in one machine( non-Splittable).

Optimization Trick : If you have a huge file (10gb and zipped) and you try to load into spark, it might just get processed using one node( or executor) if it is not splittable which could be a bottleneck. If you come across such cases, it is a good idea to use s3cmd and move the file from s3 into HDFS and unzip it(If the big file you are referring is in s3). If it is in HDFS, you could unzip it before you load into spark.

Note : We will discuss the columnar file formats in PPD section below.

Writing Queries and/or Transformations

The biggest mistake people make in big data systems is, try to “optimize queries” in fact it should be “optimize data”. “Simplicity is the Key”, This is applied to all distributed systems including spark. To apply this in real life, it is advised not to write complex queries in spark, rather try to break it down as much simpler steps as you can. People have a misunderstanding that, more number of steps could increase the processing but, actually not. Spark might internally combine some of the steps and perform at once.

Optimization Trick : Always try to break your queries (or transformations) into granular steps instead of writing one big query. Operations chained in spark are different steps ( not a single big query or transformation)

Predicate Push Down(PPD)

PPD in simple terms, is a process of only selecting the required data for processing when querying a huge table. eg: If you have a table of 100 columns and you are querying only 10 columns, in PPD data for only those 10 columns are selected for further processing. Another example could be, if there is a filter clause(eg. where clause) in any query, the filter will be applied first to reduce the number of records picked for processing. This significantly improves the performance by reducing the number of records read/write resulting reduction in input/output operation.

Columnar file formats give us a great way of using the power of PPD as it inherently enabled to do so. Some of the examples of Columnar file formats are Parquet, RC or Row-Column, ORC or Optimized Row-Column etc.

Optimization Trick : There are two important notes to make here.

Use Parquet format wherever feasible for reading and writing files into HDFS or s3 as parquet seems to be performing very well along with Spark. Especially, All the intermediate steps that you would like to write data into HDFS so as to break the lineage( As mentioned under optimization trick in Lazy Evaluation)

Always try to identify the “filters” and try to move it up as early as you can for all your data processing pipeline.

Data Skew Checks

Performance of the distributed systems are highly dependent on how much distributed the data is. One way to ensure distribution is to check the number of partitions of a RDD or a DataFrame.

Optimization Trick : Do check the number of paritions of the dataframes or RDDs just before you carry out any complex operation. In case you find the number of partitions are too low, it is a good idea to repartition them to increase the number of partitions. you could use the below line of code for checking the number of partitions in pyspark.

df.rdd.getNumPartitions()

Conclusions

In Bigdata systems it is advisable to optimize data first before we think about optimizing quries.

The second part of the story is available on the below link. Kindly, give a read and share your feedback.

https://medium.com/@brajendragouda/5-key-factors-to-keep-in-mind-while-optimising-apache-spark-in-aws-part-2-c0197276623c