



I am glad to announce that

we just released a new version of NDepend where the analysis phase

duration has been divided by 4 and the memory consumption has been divided by

2.

An interesting question is: Does such a massive performance gain means that it was badly coded the first time?

Previous slower versions met several thousands of functional

requirement needed to analyze properly any .NET application. In this

sense, it was nicely coded. But still, there

were room for improvement and my believe is that there is always room for better performance. Even on the current much faster

version we identified several significant optimizations to be done. The downside

is that this will require a lot of work. So the first lesson learned

is that there is always room for better

performance. We can complete this rule with the fact that the amount of work to make a program run

faster grows exponentially with the gain expected.

We quantified each

performance gain source and we obtained this chart. As said we have a 75% gain obtained

thanks to algorithm optimization, micro-optimization and parallelization. The

current analysis duration (25% of the original analysis duration) can be split between 17% of the time spent in our

code, and 8% of the time spent in tier code. In our case, tier code is the code of

Mono.Cecil,

used to build some object models of the assemblies analyzed.

The bulk of performance

was obtained with algorithm improvement and this is not a surprise. IMHO the

surprise comes from the relatively small gain obtained from parallelization. Indeed,

we did some massive refactoring to parallelized the analysis with the parallel monologue idea in mind. Now each

assembly is analyzed as a task that doesn’t have any synchronization need with any

other task. In other words each task uses its own states that are not visible

from the other tasks. Unfortunately this requirement is not enough to ensure

proper scaling on the number of processor. While we get the 15% gain from

between 1 and 2 processors, the gain is almost zero between 2 and 4 processors.

We identified some potential IO contentions and memory issues

that will require more attention in the future. This leads to another lesson: Don’t expect that scaling on many processors

will be something easy, even if you don’t share states and don’t use

synchronization.

When it comes to

enhancing performance there is only one way to do things right: measure and focus your work on the part

of the code that really takes the

bulk of time, not the one that you think

takes the bulk of time. I have a quick real-world illustration. Personally I

have always been amazed by the performance of the C# and VB.NET compilers. Imagine, it can compile in a few

seconds thousands of source files that took dozens or even hundreds of man-year

to be written. At analysis time, NDepend needs also to parse sources files but

it does it partially to only obtain comment and Cyclomatic Complexity metrics. By measuring file loading time

we had a good surprise: loading source file in memory is much cheaper than

expected. For example, to load the 1728 C# sources files of NDepend it takes

less than 0.2 seconds on the 13 seconds needed for a complete analysis of its own code. Now, knowing

that it is almost free to load the source files in memory I am a bit less

impressed by the performance of the C# and VB.NET compilers. And it leads to an

important lesson: NEVER anticipate the

performance cost, ALWAYS measure it. And the only professional way to

measure is to rely on performance profiler tools. Actually, we use both

dotTrace from JetBrain and Red-Gate ANTS Profiler and are happy with it.

Something important

concerning the measure of code performance is the percentage of time spent

inside tier code, things like DB or network access. As we explained in the previous chart, in our case study

35% of the analysis time is spent inside the Cecil code. We can infer from this

number that if we want to have an analysis duration divided by 2 in the future,

our code will need to run 4.3 faster! ((100 – 35) / (50 -35) = 4,3). This

clearly shows the importance of another lesson: assess your limits by measuring the percentage of time spent in tier code.

In the dissection of

performance gain, an interesting point is the 15% gain obtained from

micro-optimization. I mean things like using the right kind of loop, using the right kind of

collection, preferring to deal with primitive

CLR objects as int and string (POCO idea), choosing carefully between structures and classes, buffering

properties getter results into local variables or even exposing

non-private fields. I could summarized

the lesson learned by the fact that micro-optimizations

are worth it while premature optimizations are the root of evil (as Donald Knuth said a long time ago). The

difference between micro-optimizations and premature optimizations is that the

first ones are driven by measurement while the second ones are driven by

guesses, intuitions and hunches.





Something not quantified

in our performance gain is the fact that memory consumption has been roughly

divided by 2. Less memory doesn’t necessarily means less managed objects but in

our particular case we do allocate

less objects. The nice consequence is that less objects means less time spent in the

GC. Less memory also means less virtual memory page fault, which is IMHO the

worst thing when it comes to polish performances of a program.

Finally, we are now experiencing a

great advantage of optimization: automatic tests naturally ran

much faster and it is a good motivation to run them more often!

I end up with an illustration of the tenet there is always room

for better performance: the Amiga demo scene in which I

participated in the early nineties. Every demos ran on a constant

hardware.

It was the perfect incentive for

demo developers to produce code more and more optimized. During several

years, every months

some record were beaten, like the number of 3D polygons, the number of

sprites

or the number of dots displayed simultaneously at the rate of 50 frames

per

seconds. As far as I can estimate, the performance factor obtained in a

few years was something around x50! Imagine what

it means, running in one second a computation that took initially a

minute. And as far as I know this massive gain was the result of both

better algorithms (with many pre-computations and delegations to

sub-chip) and micro-optimizations at assembly language level (better

use of the chip registers, better use of the set of instructions…).