Use and Abuse of Garbage Collected Languages

The garbage collection vs. manual memory management debates ended years ago. As with the high-level vs. assembly language debates which came before them, it's hard to argue in favor of tedious bookkeeping when there's an automatic solution. Now we use Python, Ruby, Java, Javascript, Erlang, and C#, and enjoy the productivity benefits of not having to formally request and release blocks of bytes.

But there's a slight, gentle nagging--not even a true worry--about this automatic memory handling layer: what if when my toy project grows to tens or hundreds of megabytes of data, it's no longer invisible? What if, despite the real-time-ness and concurrent-ness of the garbage collector, there's a 100 millisecond pause in the middle of my real-time application? What if there's a hitch in my sixty frames per second video game? What if that hitch lasts two full seconds? The real question here is "If this happens, then what can I possibly do about it?"

These concerns aren't theoretical. There are periodic reports from people for whom the garbage collector has switched from being a friendly convenience to the enemy. Maybe it's because of a super-sized heap? Or maybe accidentally triggering worst-case behavior in the GC? Or maybe it's simply using an environment where GC pauses didn't matter until recently?

Writing a concurrent garbage collector to handle gigabytes is a difficult engineering feat, but any student project GC will tear through a 100K heap fast enough to be worthy of a "soft real-time" label. While it should be obvious that keeping data sizes down is the first step in reducing garbage collection issues, it's something I haven't seen much focus on. In image processing code written in Erlang, I've used the atom transparent to represent pixels where the alpha value is zero (instead of a full tuple: {0,0,0,0} ). Even better is to work with runs of transparent pixels (such as {transparent, Length} ). Data-size optimization in dynamic languages is the new cycle counting.

There's a more often recommended approach to solving garbage collection pauses, and while I don't want to flat-out say it's wrong, it should at least be viewed with suspicion. The theory is that more memory allocations means the garbage collector runs more frequently, therefore the goal is to reduce the number of allocations. So far, so good. The key technique is to preallocate pools of objects and reuse them instead of continually requesting memory from and returning it to the system.

Think about that for a minute. Manual memory management is too error prone, garbage collection abstracts that away, and now the solution to problems with garbage collection is to manually manage memory? This is like writing your own file buffering layer that sits on top of buffered file I/O routines. The whole point of GC is that you can say "Hey, I'd like a new [list/array/object]," and it's quick, and it goes away when no longer referenced. Memory is a lightweight entity. Need to build up an intermediate list and then discard it? Easy! No worries!

If this isn't the case, if memory allocations in a garbage collected language are still something to be calorie-counted, then maybe the memory management debates aren't over.

(If you liked this, you might enjoy Why Garbage Collection Paranoia is Still (sometimes) Justified.)

permalink April 21, 2012

previously