To blog Previous post | Next post

How many threads do I need?

tl;dr; depends on your application.

But for those who wish to have some insight about how to squeeze out most from all those expensive cores you have purchased for your production site – bear with me and I will shed some light on the mysteries surrounding multi-threaded Java applications.

The content is “optimized” towards the most typical Java EE application, which has got a web frontend allowing end users to initiate a lot of small transactions within the application. And significant part of each transaction is kept waiting for some external resource. Such as a query to return from the database or from any other integrated data source. But most of the content is also relevant for other applications. Such as computation-heavy modeling applications or data-chugging batch processes.

But lets start with the basics. In the type of application we are describing you tend to have a lots of users interacting with your application. Will it be tens of simultaneous active users or tens of thousands – all those users expect application to respond them in a timely manner. And this is where you feel grateful for the operating system designers. Those guys had figured this kind of need out way before anybody had even dreamt about HTTP protocol.

Did you know that 16% of Java applications face degraded user experience due to lock contention? Don’t blame the locks – detect them with Plumbr instead.

The solution used is beneficial in situations where you create more threads in your software then underlying hardware can simultaneously execute. On hardware level you also have threads. Such as the cores on your CPU or a virtualized environment like Intel with its Hyperthreading. In any case – our application at hand can easily have spawned way more software threads than underlying hardware can support directly. What your OS is now launching is similar to a simple round-robin scheduling. During which each software thread gets its turn, called a time slice, to be run on the actual hardware.

Time slicing allows all threads to progress. Otherwise it is easy to imagine a situation where one of the users has initiated a truly expensive task and all other threads serving other users are starved.

So we have this amazing time slicing going on. Wouldn’t it then be feasible to set the number of threads to some LARGE_NUMBER and be done with it? Apparently no. There is overhead included, in fact even several types of overheads. So in order to make an educated decision while tuning your threads, lets introduce the problems caused by having LARGE_NUMBER of threads one-by-one.

Register state saving/restoring. Processor registers do contain a lot of state. Which gets saved to caches each time scheduler moves to the next task. And then restored when the time comes. Luckily the time slices allocated by schedulers are relatively large. So the save/restore overhead from and to the registries will most often not be the meanest of our enemies in multithreaded environments.

Locks. When the time slice is consumed by the lock-holding thread then all other threads waiting for this particular lock must now wait. Until the lock holder gets another slice and another chance to free the lock. So – if you have a lot of synchronization going on then check out your thread’s behavior under heavy load. There is a chance that your synchronization code is causing a lot more context switching to take place because of the lock-holding threads. Analyzing thread dumps would be a good place to start investigating this peril.

Thrashing virtual memory. All operating systems take advantage of the virtual memory swapped to external storage. By swapping least recently used (LRU) data in memory to a disk drive when the need arises. Which is good. But if you now are running your applications with limited memory and lot of threads fighting to fit their stack and private data into memory then you might run into problems.

In each time-slicing round you might have threads swapping data in and out from the external storage. Which will significantly decrease your application’s performance. Especially for Java applications where the problem is particularly nasty. Whenever you start swapping your heap then each Full GC run is going to take forever. Some gurus out go as far as recommending to turn off the swapping in the OS level. In Linux distros you can achieve this via swapoff –a.

But the good news is that this problem has been significantly reduced in past years. Both with widespread 64-bit OS deployments allowing larger RAM and SSD replacing traditional spinning disks all around the world. But be aware of the enemy and when in doubt – check the page in/out ratios for your processes.

Last but not least – thread cache state. In all modern processors you have caches built next to your cores enabling operations to be completed up to 100x faster than on data residing in RAM. Which is definitely cool. But what is uncool is when your threads start fighting for this extremely limited space. Then again the LRU algorithm in charge of the starts cleaning for cache making room for new data. Which could be the data last thread in its time slice entered to the cache. So your threads can end up cleaning each other’s data from the caches. Again creating a thrashing problem.

If you are running on Intel architecture then the solution which might help you out in this case is Intel’s VTune Performance Analyzer

So maybe throwing LARGE_NUMBER of threads into your application configuration would not be the wisest thing to do. But what hints could be given when configuring the number of threads?

First, certain applications can be configured to run with the number of threads equal to the underlying hardware threads. Could not be the case for the typical web application out there, but there are definitely good cases supporting this strategy. Note that when your threads are waiting behind an external resource such as the relation database, those threads are removed from the round-robin schedule. So in a typical Java EE application it is not uncommon to have a lot more threads than underlying hardware and still run without lock contention or other problems.

Next it would be wise to segment your threads to different groups used for different purpose. Typical cases would involve separating computing threads from I/O threads. Computing threads tend to be busy for the most of the time so it is important to keep their count below the underlying hardware capacity. I/O threads such as operations requiring database round-trips are on the other hand waiting for most of the time. And thus not contributing into the fight for resources too often. So it is safe to have the number of I/O threads (way) higher than the amount of hardware threads supporting your application.

Then you should minimize the thread creation and destruction. As those tend to be expensive operation then look into the pooling solutions. You could be using Java EE infrastructure having thread pools already built in or you can take a look into the java.util.concurrent.ThreadPoolExecutor and alikes for a solution. But you also should not be too shy when on occasions you need to increase or decrease the number of threads – just avoid creating and removing them on events as predictable as the next HTTP request or JDBC connection.

And as the last advice we are handing out the most important one. Measure. Tweak the sizes of your thread pools and run your application under load. Measure both throughput and latency. Then optimize to achieve your goals. And then measure again. Rinse and repeat. Until you are satisfied with the result.

Don’t make any assumptions about how CPUs will perform. The amount of magic going on in CPUs these days is enormous. Note also that the virtualization and JIT runtime optimization will also add additional layers of complexity. But those will be subjects for another talks. For which you will be notified in time if you subscribe to our Twitter feed.

While writing the article, the following resources were used as a source of inspiration:

—-

And yes. This article is the first hint about our research in other problem domains besides memory leaks. But we cannot yet predict if and when we are going to ship a solution for all the locking and cache contention problems out there. But there is definitely hope.