Livermore Lab pioneers debugging tool

New tool enables researchers to pinpoint a needling problem in a haystack of processors

How do you find a bug in a program when that program is spread across 200,000 processors?

As incredible as that scenario might sound, it is becoming a routine problem for Lawrence Livermore National Laboratory, home of the 212,992-core BlueGene/L supercomputer. To help spot bugs, laboratory researchers, along with those from the University of Wisconsin, have developed a new software program, named Stack Trace Analysis Tool (STAT).

"What we are finding is that today's architectures require novel [debugging] techniques," said Lawrence Livermore researcher Gregory Lee, who presented a paper about the new software at the SC08 conference, held in Austin last month.

Such debugging may become more crucial in years to come, as the largest petascale -systems might soon consist of at least 1 million cores.

Lee noted that many full-featured debuggers for parallel processor-based programs are already on the market, such as TotalView Technologies’ TotalView . Such parallel debugging tools do not scale well for programs that run across thousands of processors because they cannot complete analyses within a reasonable amount of time. Such tools’ thoroughness slows them down when they work on too many processors — the data structures they create grow too unwieldy.

"Even if your tool works with today's scales, if you take that same application and add one or two orders of magnitude, then some of the things you do now may not work well," Lee said.

STAT is not a full-featured debugger. It can encircle the problem area within a large parallel program, and more-thorough commercial debuggers then fix the problem.

"We wanted to develop lightweight tools that would help the heavyweight tools by identifying processes that behave in a similar fashion," Lee said.

STAT takes advantage of the fact that most parallel applications run similar processes across multiple nodes. Most debuggers can show each and every process. When analyzing thousands of processors, it would be too difficult for the developer to sort through all those processes even if the debugger could generate all that information in a reasonable amount of time.

STAT works by collapsing identical processes into a single visual representation. The software program gathers information about all the processes running and then merges them into a tree graph. It also offers the option of building a 3-D graph tree, which can show the program running over a period of time. Both approaches are good at locating weaknesses in unstable programs, such as deadlocking.

In one test using BlueGene/L, the research team was able to merge all 212,992 processes of a program into a single graph tree in about of a third of a second. "If you interpolate those results to a machine with 1 million cores, you're still talking about latencies that are tolerable," Lee said.

The Lawrence Livermore BlueGene/L support team has just installed STAT for production debugging use, Lee said. Users can deploy STAT alongside the laboratory’s copy of TotalView to vector and remediate code bugs. "We ran it on a couple of real end-cases," Lee said.

STAT is open source and available to other agencies.