Why on earth does my memory consumption chart look like that? It’s a question I hear every week. To help answer that question, I wrote a Web server request simulator to model how Ruby uses memory over time, though it applies to other languages as well. We will use the output of that project to dissect why a web app’s memory would be expected to look like this:

Logistic function generated via Wolfram Alpha Plot[100 / (1 + e^-(x/100) )], {x, 0, 1000}] . Shape is asymptotic.

In this post, we’ll talk a little about what causes this shape of memory use over time. Then we will dig into what that memory behavior means in terms of optimizing your application.

Simulating one Request

Here is the output of simulating one thread handling one request:

Time is the bottom axis — memory on the horizontal. As time progresses, our thread will process the request and allocate objects which require more memory. This behavior produces a diagonal line going up and to the right.

Once the request is over, their slots can be recycled, so the amount of memory required goes back down to zero. This behavior causes the graph to drop to zero and produces a “tooth” shape.

Ruby Tracks Max Memory: Multiple Requests with One thread

Now that you understand the output format, let’s look at a few requests and add in another piece of data, the “Max total” memory:

This “max total” line at the top of the graph traces the total maximum amount of memory needed to run the application.

In this example, the first request needs a large amount of memory.

Ruby will allocate enough space to handle whatever task needs to be done. Then it will assume (correctly in this case) that in the future, you’ll need to use that memory again, so it holds onto it. While looking at these graphs, the top line roughly represents the memory requirements of your program that you would see in Heroku’s memory metrics dashboard, or from activity monitor locally (if you’re on a Mac).

The other important thing about this graph is that different requests allocate different quantities of objects. You can see this visually as some of the spikes are different shapes and sizes. These shapes might represent serving different endpoints or parameters such as /users?per_page=2 versus /users?per_page=42_000 .

Simulation with two threads - one request each

Your application is rarely serving only one request at a time. What does a server handling two concurrent workloads look like in terms of memory use?

When we simulate multiple requests, the high-water mark of the “max” memory needed to run the application is now the sum of all threads.

In this example, the first request needed a lot of memory, and while it was being prepared, the next request came in. You can see that when both threads are processing a request, the “Max Total” goes up proportional to the sum of all threads.

Thread two maxes out at 222 memory units. At this time, thread one is about 74 memory units. The “Max Total” for the whole system ends up being around 296 memory units.

Simulation with two threads - ten requests each

Here’s another example with ten requests per thread:

Notice where the green “Max Total” line seems to jump above the other spikes, this is where the system is processing multiple requests at a time.

1,000 Requests Simulation with two threads

Here’s another example with 1000 requests:

It takes a while, but over time, memory use doubles. The height of thread one and two roughly max out at about 390 memory units. Overall memory use is 780 (390 * 2) memory units. This doubling happens because eventually, two requests end up being served at the same time with the maximum amount of memory requirements.

So what happens if we add a third thread? Do we expect it to use 1,170 memory units total?

Huh, it didn’t even come close to 1,170 memory units. In fact, it’s less memory than the two-thread example. Why? The total memory use depends not just on the number of threads, but also the distribution of requests we are getting.

What is the likelihood that the largest request will come in and hit all threads at the same time? In this case, it didn’t happen, but it doesn’t mean it won’t ever.

Simulating ten threads instead of just two

What happens if we move from two threads to ten? Would you expect our memory usage to be 10x? Let’s find out:

If we were going to 10x our memory, I would expect to see 3,900 (10 * 390) memory units being used.

This graph doesn’t show anywhere near that number, though. Why not? Our system still has the same theoretical maximum, but getting there means we would have to have several seemingly random events align perfectly. All ten threads would have to be serving the “largest” endpoint all at the same time.

What does it all Mean?

Here are some conclusions that you can draw from these simulations:

Total memory use goes up as the number of threads are increased

Memory use for an individual thread is a factor of the largest possible request it will ever serve

it will ever serve Memory use across all threads are based on a distribution of how likely that maximum request is to be hit simultaneously by all existing threads

by all existing threads As your application executes over time, it is expected and natural that your memory requirements will increase until they hit a steady-state.

Tune your application

If you want your application to use less memory, you need to move one of the factors we mentioned: number of threads, largest possible request, or the distribution of incoming requests.

You can decrease thread count to reduce your memory needs, but that might also lower your throughput.

You can add capacity via scaling out, such as adding additional dynos/servers. Adding capacity works because more servers/dynos in operation spread out the requests more, and the event that all threads on an individual machine are processing the largest request at the same time is reduced. This tactic might work from one to two servers, but over time returns are diminishing. i.e., the benefit of going from 99 servers to 100 won’t have a significant impact on the overall distribution of requests for individual machines.

In my experience, neither of these are viable long term solutions. Reducing object allocation is the best path to reducing your overall memory needs.

The good news is that reducing object allocation in your largest requests also means your application runs faster. The bad news is that moving this allocation number requires active effort and an intermediate-to-advanced understanding of performance behavior.

If you want to start improving your application’s memory consumption here are additional resources;

When working on reducing your application’s memory footprint, focus on the largest endpoint. If you can reduce your largest request by a factor of two in the simulation, from 390 memory units to 195, then your maximum theoretical usage at ten threads becomes 1,950 units. Neat!

In my experience, there is usually one or two endpoints that allocate an obscene amount of memory, maybe two to five times the amount of other endpoints. If I were to tune your memory use, I would start with the largest requests.

Also, since this has come up a few times: Your memory problem (if you have one) is not from your webserver, or your framework, or even your language. The bulk of allocations typically comes directly from business/application logic that you (or your team) wrote.

Caveats and Fine Print

The models I described above closely mimic the behavior and performance I’ve seen from real-world production applications over my last decade-plus working with Ruby. However, since these examples are based on simulation: it is useful to be explicit about what is simulated and what is excluded.

Threads versus processes While I said “threads,” concurrency via processes will see the same memory behavior for processing requests. I specifically chose “threads” for this example because people generally don’t associate them with memory use, or understand why memory goes up over time.

One difference in memory use between threads and processes is that a process will require a higher base-line amount of memory use than a thread. To understand more of the differences between the two concurrency constructs, you might want to check out my video and post WTF is a Thread.

Ruby behavior This behavior is a very rough approximation of how Ruby (2.6 is the latest release at the time of writing) allocates memory. In reality, the garbage collector (responsible for allocating memory) is more nuanced than this simulation. There is a range of topics you need for the full picture: object slots, slot versus heap allocation, generational GC, incremental GC, compacting memory, fragmentation due to malloc implementation, etc. But for now, this simplification is good enough.

Thread behavior Ruby can only execute one thread at a time due to the GVL, but ‘IO’ calls such as database queries, or a network requests can release the GVL. Ruby’s threads also use time-slicing, so if you have two requests trying to execute at the same time and neither are doing IO, then imagine that Ruby is bouncing back and forth between the two working on each a little at a time. In reality, there are more considerations, and we can model those interactions, but they’re not necessary for now.

Active memory versus “base” memory When you boot up your application but have not served any requests, there is a memory footprint. Think of this as the “base” size of the application. As a request comes in, imagine that your application pulls a user from the database, which requires allocating objects, renders a template that needs objects, and does a whole lot of internal object creation to deliver the request back. These objects are what I refer to as “Active” memory. Active memory is not retained for long but is needed for the duration of the request. For this simulation, I only included “active” memory generated from requests.

Zero Retained objects This simulation also assumes that at the end of the request that all allocated objects can be garbage collected. In reality, this is not true. For example, in Ruby on Rails, there is a “prepared statement cache” that will grow in size as your application prepares and saves those statements. When people think of “a memory leak,” that’s what they typically are thinking of, first memory is allocated, then retained, and never allowed to be reclaimed by the garbage collector. The primary purpose of these simulations are to show how memory needs can increase over time when there is no “leak,” and no objects are retained after a request.

What about background jobs These examples are framed in terms of “web” requests. Still, they apply to any system that is running concurrent processes or threads, such as processing background libraries, like Sidekiq.

Play

If you want to play around with the code that generated these simulations, you can by using my simulation and charting code on GitHub.