Repartitioning this way has one advantage. It will distribute the data evenly across the partitions. This way, you have the same number of objects in each partition. It will require a shuffle and creation of an entirely new set of partitions. Additionally, just because they are evenly partitioned doesn't mean you won't have to shuffle on your next operation.

Using the Repartition is the nuclear option, but often necessary. In our case, there really isn't really a good alternative. If we know we want partitions by fruit type, this is what needs to happen. If we didn't care, we could just do a pure repartition and Spark would create an evenly distributed number of partitions.

Coalesce

Coalesce is like repartition, however, it does not require a shuffle. You can look at this as taking out one of the partitions in our example. You may or may not get evenly distributed partitions, but it may make sense to do this in a number of scenarios.

Consider my previous example of Map Partitions to write data to MySQL. You'll notice I used Coalesce there and reduced it to the number of machines I had available. I did this because of MySQL in that instance, preferred one connection per host. Since I had 16 cores on the workers, I could not open up 16 connections without running into trouble, I could share the connection with the 16 cores but I preferred to just use 4 machines and chunk the data that way for that example.

You have the freedom to make this decision, however, it is not a free operation. There is some cost to doing this but it is much less of a cost than repartitioning.

Windowing

Sometimes, the questions you have about data is more than a simple count. Suppose you want to know the average number of fruit sold or the average number sold by the day or by the hour? You could create a report that looked something like this: