Update February 2015: this blog post has been superseded by the Server Optimization Guide. Please read that guide instead.

Phusion Passenger turns Apache and Nginx into a full-featured application server for Ruby and Python web apps. It has a strong focus on ease of use, stability and performance. Phusion Passenger is built on top of tried-and-true, battle-hardened Unix technologies, yet at the same time introduces innovations not found in most traditional Unix servers. Since mid-2012, it aims to be the ultimate polyglot application server.

We recently received a support inquiry from a Phusion Passenger Enterprise customer regarding excessive process creation activity. During peak times, Phusion Passenger would suddenly create a lot of processes, making the server slow or unresponsive for a period of time. This is because Phusion Passenger spawns and shuts down application processes according to traffic, but they apparently had irregular traffic patterns during peak times. Since their servers were dedicated for 1 application only, the solution was to make the number of processes constant regardless of traffic. This could be done by setting PassengerMinInstances to a value equal to PassengerMaxPoolSize .

The customer then raised the question: what is the best value for PassengerMaxPoolSize ? This is a non-trivial question, and the answer encompasses more than just PassengerMaxPoolSize . In this article we’re going to shed more light on this topic.

For simplicity reasons, we assume that your server only hosts 1 web application. Things become more complicated when more web applications are involved, but you can use the principles in this article to apply to multi-application server environments.

Aspects of concurrency tuning

The goal of tuning is usually to maximize throughput. Increasing the number of processes or threads increases the maximum throughput and concurrency, but there are several factors that should be kept in mind.

Memory. More processes implies a higher memory usage. If too much memory is used then the machine will hit swap, which slows everything down. You should only have as many processes as memory limits comfortably allow. Threads use less memory, so prefer threads when possible. You can create tens of threads in place of one process.

More processes implies a higher memory usage. If too much memory is used then the machine will hit swap, which slows everything down. You should only have as many processes as memory limits comfortably allow. Threads use less memory, so prefer threads when possible. You can create tens of threads in place of one process. Number of CPUs. True (hardware) concurrency cannot be higher than the number of CPUs. In theory, if all processes/threads on your system use the CPUs constantly, then: You can increase throughput up to NUMBER_OF_CPUS processes/threads. Increasing the number of processes/threads after that point will increase virtual (software) concurrency, but will not increase true (hardware) concurrency and will not increase maximum throughput. Having more processes than CPUs may decrease total throughput a little thanks to context switching overhead, but the difference is not big because OSes are good at context switching these days. On the other hand, if your CPUs are not used constantly, e.g. because they’re often blocked on I/O, then the above does not apply and increasing the number of processes/threads does increase concurrency and throughput, at least until the CPUs are saturated.

True (hardware) concurrency cannot be higher than the number of CPUs. In theory, if all processes/threads on your system use the CPUs constantly, then: Blocking I/O. This covers all blocking I/O, including hard disk access latencies, database call latencies, web API calls, etc. Handling input from the client and output to the client does not count as blocking I/O, because Phusion Passenger has buffering layers that relief the application from worrying about this. The more blocking I/O calls your application process/thread makes, the more time it spends on waiting for external components. While it’s waiting it does not use the CPU, so that’s when another process/thread should get the chance to use the CPU. If no other process/thread needs CPU right now (e.g. all processes/threads are waiting for I/O) then CPU time is essentially wasted. Increasing the number processes or threads decreases the chance of CPU time being wasted. It also increases concurrency, so that clients do not have to wait for a previous I/O call to be completed before being served.

With these in mind, we give the following tuning recommendations. These recommendations assume that your machine is dedicated to Phusion Passenger. If your machine also hosts other software (e.g. a database) then you should look at the amount of RAM that you’re willing to reserve for Phusion Passenger and Phusion Passenger-served applications.

Tuning the application process and thread count

In our experience, a typical single-threaded Rails application process uses 100 MB of RAM on a 64-bit machine, and by contrast, a thread would only consume 10% as much. We use this fact in determining a proper formula.

Step 1: determine the system’s limits

First, let’s define the maximum number of (single-threaded) processes, or the number of threads, that you can comfortably have given the amount of RAM you have. This is a reasonable upper limit that you can reach without degrading system performance. We use the following formulas.

In purely single-threaded multi-process deployments, the formula is as follows:

max_app_processes = (TOTAL_RAM * 0.75) / RAM_PER_PROCESS

This formula is derived as follows:

(TOTAL_RAM * 0.75) : We can assume that there must be at least 25% of free RAM that the operating system can use for other things. The result of this calculation is the RAM that is freely available for applications.

: We can assume that there must be at least 25% of free RAM that the operating system can use for other things. The result of this calculation is the RAM that is freely available for applications. / RAM_PER_PROCESS : Each process consumes a roughly constant amount of RAM, so the maximum number of processes is a single devision between the aforementioned calculation and this constant.

In multithreaded deployments, the formula is as follows:

max_app_threads_per_process = ((TOTAL_RAM * 0.75) - (NUMBER_OF_PROCESSES * RAM_PER_PROCESS * 0.9)) / (RAM_PER_PROCESS / 10)

Here, NUMBER_OF_PROCESSES is the number of application process you want to use. In case of Ruby or Python, this should be equal to NUMBER_OF_CPUS. This is because both Ruby and Python have a Global Interpreter Lock so that they cannot utilize multicore no matter how many threads they’re using. By using multiple processes, you can utilize multicore. If you’re using a language runtime that does not have a Global Interpreter Lock, e.g. JRuby or Rubinius, then NUMBER_OF_PROCESSES can be 1.

This formula is derived as follows:

(TOTAL_RAM * 0.75) : The same as explained earlier.

: The same as explained earlier. - (NUMBER_OF_PROCESSES * RAM_PER_PROCESS) : In multithreaded deployments, the application processes consume a constant amount of memory, so we deduct this from the RAM that is available to applications. The result is the amount of RAM available to application threads.

: In multithreaded deployments, the application processes consume a constant amount of memory, so we deduct this from the RAM that is available to applications. The result is the amount of RAM available to application threads. / (RAM_PER_PROCESS / 10) : A thread consumes about 10% of the amount of memory a process would, so we divide the amount of RAM available to threads with this number. What we get is the number of threads that the system can handle.

On 32-bit systems, max_app_threads_per_process should not be higher than about 200. Assuming an 8 MB stack size per thread, you will run out of virtual address space if you go much further. On 64-bit systems you don’t have to worry about this problem.

Step 2: derive the applications’ needs

The earlier two formulas were not for calculating the number of processes or threads that application needs, but for calculating how much the system can handle without getting into trouble. Your application may not actually need that many processes or threads! If your application is CPU-bound, then you only need a small multiple of the number of CPUs you have. Only if your application performs a lot of blocking I/O (e.g. database calls that take tens of milliseconds to complete, or you call to Twitter) do you need a large number of processes or threads.

Armed with this knowledge, we derive the formulas for calculating how many processes or threads we actually need.

If your application performs a lot of blocking I/O then you should give it as many processes and threads as possible: # Use this formula for purely single-threaded multi-process deployments. desired_app_processes = max_app_processes # Use this formula for multithreaded deployments. desired_app_threads_per_process = max_app_threads_per_process

If your application doesn’t perform a lot of blocking I/O, then you should limit the number of processes or threads to a multiple of the number of CPUs to minimize context switching: # Use this formula for purely single-threaded multi-process deployments. desired_app_processes = min(max_app_processes, NUMBER_OF_CPUS) # Use this formula for multithreaded deployments. desired_app_threads_per_process = min(max_app_threads_per_process, 2 * NUMBER_OF_CPUS)

Step 3: configure Phusion Passenger

You should put the number for desired_app_processes into the PassengerMaxPoolSize option. Whether you want to make PassengerMinInstances equal to that number or not is up to you: doing so will make the number of processes static, regardless of traffic. If your application has very irregular traffic patterns, response times could drop while Passenger spins up new processes to handle peak traffic. Setting PassengerMinInstances as high as possible prevents this problem.

If desired_app_processes is 1, then you should set PassengerSpawnMethod conservative (on Phusion Passenger 3 or earlier) or PassengerSpawnMethod direct (on Phusion Passenger 4 or later). By using direct/conservative spawning instead of smart spawning, Phusion Passenger will not keep an ApplicationSpawner/Preloader process around. This is because an ApplicationSpawner/Preloader process is useless when there’s only 1 application process.

In order to use multiple threads you must use Phusion Passenger Enterprise 4. The open source version of Phusion Passenger does not support multithreading, and neither does version 3 of Phusion Passenger Enterprise. At the time of writing, Phusion Passenger Enterprise 4.0 is on its 4th Release Candidate. You can download it from the Customer Area.

You should put the number for desired_app_threads_per_process into the PassengerThreadCount option. If you do this, you also need to set PassengerConcurrencyModel thread in order to turn on multithreading support.

Possible step 4: configure Rails

Only if you’re on a multithreaded deployment do you need to configure Rails.

Rails is thread-safe since version 2.2, but you need to enable thread-safety by setting config.thread_safe! in config/environments/production.rb .

You should also increase the ActiveRecord pool size because it limits concurrency. You can configure it in config/database.yml . Set the pool size to the number of threads. But if you believe your database cannot handle that much concurrency, keep it at a low value.

Example 1: purely single-threaded multi-process deployment with lots of blocking I/O

Suppose you have 1 GB of RAM and lots of blocking I/O, and you’re on a purely single-threaded multi-process deployment.

# Use this formula for purely single-threaded multi-process deployments. max_app_processes = (1024 * 0.75) / 100 = 7.68 desired_app_processes = max_app_processes = 7.68

Conclusion: you should use 7 or 8 processes. Phusion Passenger should be configured as follows:

PassengerMaxPoolSize 7

However a concurrency of 7 or 8 is way too low if your application performs a lot of blocking I/O. You should use a multithreaded deployment instead, or you need to get more RAM so you can run more processes.

Example 2: multithreaded deployment with lots of blocking I/O

Consider the same machine and application (1 GB RAM, lots of blocking I/O), but this time you’re on a multithreaded deployment with 2 application processes. How many threads do you need per process?

Let’s assume that we’re using Ruby and that we have 4 CPUs. Then:

# Use this formula for multithreaded deployments. max_app_threads_per_process = ((1024 * 0.75) - (4 * 100)) / (100 / 10) = 368 / 10 = 36.8

Conclusion: you should use 4 processes, each with 36-37 threads, so that your system ends up with . Phusion Passenger Enterprise should be configured as follows:

PassengerMaxPoolSize 4 PassengerConcurrencyModel thread PassengerThreadCount 36

Configuring the web server

If you’re using Nginx then it does not need configuring. Nginx is evented and already supports a high concurrency out of the box.

If you’re using Apache, then prefer the worker MPM (which uses a combination of processes and threads) or the event MPM (which is similar to the worker MPM, but better) over the prefork MPM (which only uses processes) whenever possible. PHP requires prefork, but if you don’t use PHP then you can probably use one of the other MPMs. Make sure you set a low number of processes and a moderate to high number of threads.

Because Apache performs a lot of blocking I/O (namely HTTP handling), you should give it a lot of threads so that it has a lot of concurrency. The number of threads should be at least the number of concurrent clients that you’re willing to serve with Apache. A small website can get away with 1 process and 100 threads. A large website may want to have 8 processes and 200 threads per process (resulting in 1600 threads in total).

If you cannot use the event MPM, consider putting Apache behind an Nginx reverse proxy, with response buffering turned on on the Nginx side. This reliefs a lot of concurrency problems from Apache. If you can use the event MPM then adding Nginx to the mix does not provide many advantages.

Conclusion

If your application performs a lot of blocking I/O, use lots of processes/threads. You should move away from single-threaded multiprocessing in this case, and start using multithreading.

If your application is CPU-bound, use a small multiple of the number of CPUs.

Do not exceed the number of processes/threads your system can handle without swapping.

We at Phusion are regularly updating our products. Want to stay up to date? Fill in your name and email address below and sign up for our newsletter. We won’t spam you, we promise.