Space verus Time in Java's ObjectOutputStream

10 Jan 2010

This past week I helped my colleague D. B. investigate a memory-use issue in a component he wrote. The component reads and sorts records (basically database rows). Because the number of rows can be large, it uses a disk-based mergesort once the count exceeds a user-defined threshold in order to keep memory use in check. Under the covers it uses Java's ObjectOutputStream to write each chunk of sorted input and ObjectInputStream to read those chunks while merging. Well, despite our best intention and design, the sorter component was using far more memory than we expected, resulting in the dreaded "java.lang.OutOfMemoryError: Java heap space" exception. As a first step I installed TPTP trying to localize the problem. It showed me tens of thousands of live char[] instances in use but refused to report for me the origin of those instances, so that tool wasn't much help. Finally he hit upon the answer, namely that ObjectOutputStream was not releasing references. We thought that using writeUnshared() would be enough but it was not. The javadoc hints at this behavior, but I guess does not state it quite as clearly as we would like.

The solution is to call the class's reset() method at suitable intervals. We experimented and found some significant performance differences based on how often we call that method. So I wrote a little code (see below) and gathered some numbers. The tables below show results from processing an input CSV file of 544,222 lines and size 86Mb, on a 2Ghz dual-core laptop running WinXP on an ordinary disk drive. All tests launched from Eclipse in debug mode with a Java breakpoint set on the last executable line. The interval column shows how many records were written between each call to the object output stream's reset() method. The program reports its time in milliseconds. Memory numbers were gathered from the windoze taskmgr "Mem Usage" column when the breakpoint was reached.

Using writeObject() Using writeUnshared() Interval Time/MS Memory/KB Interval Time/MS Memory/KB 1 22,781 10,460 1 23,093 10,464 5 18,579 10,468 5 18,531 10,480 10 17,875 10,496 10 17,907 10,540 50 17,484 10,524 50 17,375 10,516 100 17,453 11,128 100 17,500 10,712 1,000 19,781 14,692 1,000 18,782 14,540 10,000 20,329 21,368 10,000 19,844 21,812

Calling reset at every write yields minimal memory use but it's huge overkill that significantly increases the run time. On the other end, wait too long to call reset and the memory use blows up the JVM, and the run time starts increases also (which I really didn't expect).

A person on the Sun forum (see the link in the code) pointed that this undesired behavior pops up out when the object written out has references within it. In my first attempt at this assessment I simply wrote out String objects read from the CSV file. But in that scenario, writeUnshared works perfectly, does what I expect, and the program uses very little memory. The way to reproduce the excessive memory use problem in this little example is to split the lines and write out the resulting array of string. Then the numbers showed no difference between the writeObject() and writeUnshared() methods.

The conclusion: calling reset at some interval is essential, and there's a sweet spot at about every 100th write where the space/time tradeoff is reasonable. I didn't investigate whether the sweet spot is really at 50 or 150; here's the code if you want to go after that one.