To keep it simple, let's assume you have this loop:

In order to parallelize it, we first wrap it into a method to be able to execute partitions of it:

The simplest approach is to create threads, where each thread executes one partition. In order to keep the whole CPU busy, we execute the last partition on the main thread, so it doesn't just sit there and wait for the worker threads to finish:

The last loop looks a bit ugly. We have to call join() on every thread, so the main thread gets blocked until all worker threads complete. Otherwise attempt1() might return earlier than the work is done. We can avoid this last loop by using an ExecutorService instead, which provides convenience methods to wait for completion. It also handles a pool of thread more effectively than we did it by hand. So let's rewrite our attempt using ExecutorService :

Looks good. This is a solid way to parallelize loops. But there is still room for optimization. What happens if one thread is faster than the others? Or one thread is particular slow? At the end, we always have to wait until all threads complete in order to know that the work is done. For example, if there are 10 threads and one thread is slower than everyone else, 9 threads just sit there doing nothing and wait until their slowest brother completes. This is where the idea of work stealing was born. Instead of splitting up the work into segments equal the size of available threads, we split up the work into many small tasks and each thread gets an equal number of tasks. If one thread finishes its tasks faster than others, it steals unfinished tasks from slower threads. That way all threads keep busy until completion, even if ones are slower than others.

With Java 7 a new ExecutorService was shipped, which implements work stealing, named ForkJoinPool . Its usage differs a bit from our first two approaches. Instead of threads, you define recursive actions which first split up the work until it matches the task size. Applied to our example, it looks like this:

Running these three attempts on my Intel i7-2620M CPU @ 2.7 GHz laptop with 4 cores and an array size of 8388608 results in these average running times (10 iterations each):

sequential avg: 468 ms attempt 1 avg: 143 ms attempt 2 avg: 138 ms attempt 3 avg: 107 ms

Wow. That's an awesome speed-up! Java 7 rocks! Thanks, Doug Lea!

How big you set your task size is a trade-off. If it's too small, the cost of context switches between tasks might become higher than the benefit of parallelization. It it's too big, you might again get the problem of our first two attempts, that it takes long to finish one task and if it's the last one, all other threads have to wait. The optimal value depends on your type of work, so I recommend to play with the task size until you get the best results.

Note that ForkJoinPool is called fork/join, because you can not only fork() tasks, you can also join() them. This is useful if you depend on the result of your forked tasks, like in merge sort or similar problems.

I've put the full source code here, including how I calculated the numbers. Enjoy!