May 22, 2014, 9:39 pm

I got assigned an interesting bug to fix today at work: Performing a certain operation in our program caused an enormous memory leak, producing a FastMM report file that weighed in at over 150 MB, representing a serious amount of RAM in our program. A bit of debugging made it obvious that a certain interfaced object was at the root of the problem, and it had a refcount of 1 when the program ended. I found the object that was holding a reference to it and went looking for what was holding it up… and it turned out to have a refcount of over 4700 when the program ended!

And just to make things worse, it wasn’t just one place in the code where _AddRef was being called on this object; a bit of checking revealed multiple places that were calling it thousands of times. This is where some people would freak out and say that maybe the garbage collection advocates are onto something; there’s no way you can ever hope to trace through all of that and come up with anything useful!

I took a bit of a different view of it. If there’s no way you can ever hope to trace through all of that… then don’t; find a smarter way to do it. If you have a data set large enough that it can only be understood statistically, then gather some statistics!

Turns out one of my coworkers at De Novo Software had come up with a system for tracking down leaks like this by instrumenting your code to log adds and releases. It generates a logfile which you can load into his program for statistical analysis. He showed me how to set it up, which consisted of adding one unit to the project and a few lines of code to the class I was trying to instrument, and in just a few minutes I was building the log.

When I loaded it into the viewer, I got a TTreeView full of data about where _AddRef and _Release were being called, and I could edit out irrelevant bits of data. It was pretty intuitive for the most part, and soon I found there were two main places where _AddRef was being called over 4700 times, and one where _Release was being called over 4700 times. The log file captured call stacks, so it was easy to see that one of the _AddRef groups matched the _Release group, which meant the problem was in the other _AddRef group. From there, I had a stack trace, and it took about 2 minutes to track the problem to a minor misunderstanding about how records, pointers-to-records, and record copying worked. So I fixed the code, rebuilt, and tested it, and the leak was gone.

This is what I’ve been saying for years. Proper tooling makes the “problems” of real memory management trivial, without needing to sacrifice the performance benefits. Which, experience has shown, can be considerable, especially in memory-constrained environments such as mobile devices!

I asked my coworker, and he said that his tool “is not production ready,” but if that changes, I’ll make sure to post a link to it, since it’s really quite useful.