In our look at scientific computing and the continued longevity of Fortran in science and engineering circles, one of the recurring themes in the discussion that followed was performance.

One of the big reasons that Fortran remains important is because it's fast: number crunching routines written in Fortran tend to be quicker than equivalent routines written in most other languages. The languages that are competing with Fortran in this space—C and C++—are used because they're competitive with this performance.

This raises the question: why? What is it about C++ and Fortran that make them fast, and why do they outperform other popular languages, such as Java or Python?

Interpreting versus compiling

There are many ways to categorize and define programming languages, according to the style of programming they encourage and features they offer. When looking at performance, the biggest single distinction is between interpreted languages and compiled ones.

The divide is not hard; rather, there's a spectrum. At one end, we have traditional compiled languages, a group that includes Fortran, C, and C++. In these languages, there is a discrete compilation stage that translates the source code of a program into an executable form that the processor can use.

This compilation process has several steps. The source code is analyzed and parsed. Basic coding mistakes such as typos and spelling errors can be detected at this point. The parsed code is used to generate an in-memory representation, which too can be used to detect mistakes—this time, semantic mistakes, such as calling functions that don't exist, or trying to perform arithmetic operations on strings of text.

This in-memory representation is then used to drive a code generator, the part that produces executable code. Code optimization, to improve the performance of the generated code, is performed at various times within this process: high-level optimizations can be performed on the code representation, and lower-level optimizations are used on the output of the code generator.

Actually executing the code happens later. The entire compilation process is simply used to create something that can be executed.

At the opposite end, we have interpreters. The interpreters will include a parsing stage similar to that of the compiler, but this is then used to drive direct execution, with the program being run immediately.

The simplest interpreter has within it executable code corresponding to the various features the language supports—so it will have functions for adding numbers, joining strings, whatever else a given language has. As it parses the code, it will look up the corresponding function and execute it. Variables created in the program will be kept in some kind of lookup table that maps their names to their data.

The most extreme example of the interpreter style is something like a batch file or shell script. In these languages, the executable code is often not even built into the interpreter itself, but rather separate, standalone programs.

So why does this make a difference to performance? In general, each layer of indirection reduces performance. For example, the fastest way to add two numbers is to have both of those numbers in registers in the processor, and to use the processor's add instruction. That's what compiled programs can do; they can put variables into registers and take advantage of processor instructions. But in interpreted programs, that same addition might require two lookups in a table of variables to fetch the values to add, then calling a function to perform the addition. That function may very well use the same processor instruction as the compiled program uses to perform the actual addition, but all the extra work before the instruction can actually be used makes things slower.

Blurring the lines

Between the extremes are a range of options. For example, many high performance interpreters will act a lot more like compilers: they'll perform the same steps as a compiler, including the generation of directly executable code, but they'll then execute that code immediately (instead of storing it on disk for execution later). While these interpreters might keep the executable code around for the duration of the program so they don't have to generate code for a particular function more than once, it'll generally be thrown away when the program finishes. If you run the program again, the interpreter will have to generate the code all over again.

This process of compiling when the program is run is called just-in-time (JIT) compilation, and the JavaScript engines of Internet Explorer, Firefox, and Chrome all use this technique to maximize their scripting performance.

JIT compilation is typically faster than traditional interpreting. However, it generally can't compete with conventional ahead-of-time compilation. AOT compilation can be slow, with compilers spending considerable time to optimize the code to the best of their ability. They can afford to do this because nobody is actually waiting for the compilation to take place. JIT compilation, however, happens at runtime, with a user waiting at the keyboard for the program to actually run. This limits the time that can be spent optimizing. Techniques such as performing additional optimization on a background thread and making use of modern multicore processors can go some way toward closing this gap.

In principle, JIT compilation can offer performance benefits over conventional compilation. A conventionally compiled program generally has to be quite conservative in some ways. Microsoft can't easily compile Windows to, for example, take advantage of the latest AVX instructions found in newer Intel and AMD processors, because Windows has to run on processors that don't support AVX. A JIT compiler, however, knows exactly the hardware it will be used on, and so can take maximal advantage of it.

Historically, JIT compilers haven't been very good at using the kind of complex instructions that modern processors offer. Indeed, making use of instructions like SSE and AVX is a challenge even for AOT compilers, in spite of their lack of time constraints. However, this is starting to change, with for example Oracle's HotSpot compiler for Java including some early support for these instructions.

Another common technique is the use of bytecode. Bytecode-based platforms, including Java and .NET, have a traditional compilation process, but instead of generating executable machine code, the compiler generates bytecode, a kind of machine code designed not for real hardware, but for an idealized virtual machine. This bytecode can then be interpreted or JIT compiled when the program is actually run.

Generally, the performance of these bytecode systems is somewhere between interpreted ones and compiled ones. The bytecode is easier to JIT compile and optimize at runtime, giving an advantage over interpreters, but it still doesn't enable the same optimization effort as the compiler.

These various intermediate options in turn give rise to a range of intermediate performance options between the compiled and interpreted extremes.

Technically, the use of a compiler or an interpreter is not a property of the language itself. There are various projects that, for example, create interpreters for C, a language that's traditionally compiled. JavaScript has gone from simple interpreters to complex JIT compilers to get better performance.

However, mainstream languages don't tend to switch too often; C++ is essentially always going to be compiled ahead of time. So too is Fortran. C# and Java are almost always going to be compiled to bytecode and then JIT compiled at runtime. Python and Ruby are almost always going to be interpreted. This tends to create a performance hierarchy: C++ and Fortran are faster than Java and C#, which in turn are faster than Python and Ruby.

The languages themselves

Within each category there's still plenty of performance variation. Much of this is simply due to priorities. Consider Python and JavaScript, for example; two popular scripting languages that are both traditionally interpreted. In practice, JavaScript tends to be a lot faster than Python, not because of any feature of the languages themselves—they're broadly comparable in terms of expressiveness and capabilities—but because companies like Microsoft, Google, and Mozilla have invested heavily in making JavaScript faster.

Within similar kinds of language, this difference in investment (or development priorities) is probably the biggest single determinant in language performance. Making high-speed interpreters and compilers is a lot of work, and not every language actually needs all that work.

However, language differences can account for some of the differences. The longevity of Fortran happens to be a good example of this. For a long time, equivalent programs in Fortran and C (or C++) would be faster in Fortran, because Fortran enabled better optimizations. This was true even when the C compiler used the same code generator and optimizer as the Fortran compiler. This difference wasn't because of a feature Fortran had. In fact, it was the reverse: it was because of a feature that Fortran didn't have.

Number crunching apps typically do their work on large arrays of numbers: numbers packed in memory representing points in 3D space or forces or similar. The computations generally iterate over these arrays doing the same operation on each element of the array in turn. A function to add two arrays to each other element-by-element, for example, will take three inputs: the two arrays to add, and a third array to put the answers.

Element-wise addition like this can be done in any order. This allows all kinds of optimizations. For instance, it can also vector instructions like SSE and AVX. Instead of adding each element one by one, four elements from one array can be added to four elements from the other array, simultaneously, using SSE or AVX, for an easy four-fold improvement in performance. This kind of function is also amenable to being made multithreaded: if you have four cores, then one core can add the first quarter of the array, the next the next quarter, and so on. These array-based functions afford some of the most powerful compiler optimizations, able to make the code run many times faster.

C doesn't really let functions have arrays as inputs (or, for that matter, outputs). Instead, it has a thing called pointers. Pointers more or less represent memory addresses, and C has built-in features for reading and writing values stored at these memory addresses. In many regards, C uses pointers and arrays interchangeably; for an array, the pointer just represents the address of the first element of the array. The other elements of the array just have the next, sequential, memory addresses. C has built-in features for performing arithmetic on the memory addresses to access the arrays.

Pointers are very flexible, and C developers use them to construct complex data structures, as well as tightly packed sequential arrays. But this flexibility can make them difficult for compilers to optimize. Consider again that function to add two arrays. In C, it would take not three arrays as its input, but three pointers; two for the inputs, one for the output.

Here's the problem. Those pointers could represent any memory address. More to the point, they could overlap. The memory address of the output array could be the same as one of the input arrays. They might even overlap partially. The output array could lie over half of one of the input arrays.

This is a problem for the optimizer, because it means that the assumptions that the array-based optimizer could make no longer hold true. Specifically, the order in which elements are added now matters; if the output overlaps with one of the input arrays then the result of the calculation will be different depending on whether each element of the input array is read before they get overwritten as output, or after.

This means that those fancy optimizations—multithreading and vector instructions—aren't available. The compiler doesn't know if it's safe to do, so it has to conservatively do what the source code instructs, in the order it instructs, and no more. The compiler no longer has the freedom to rearrange the program to make it go faster.

This problem is called aliasing, and it's something that traditional Fortran doesn't suffer from, because traditional Fortran doesn't have pointers; it just has non-overlapping arrays. For many years, it meant that Fortran compilers (and developers) could apply powerful optimizations that weren't available in C. This cemented Fortran's position as the fastest language for number crunching.

Obviously for this kind of function, this flexibility that pointers offer isn't actually useful. If the arrays overlap then there's no right way to process the data, so it's bad luck that the optimizations can't be used. In the 1999 update to the C specification, known as C99, C was given an answer to this problem. In C99, pointers can be specified as not overlapping. When this is done, the compiler can make all the optimizations it wants, and with this feature, C99 (and C++, as most compiler vendors gave it a comparable capability) became as optimizable as Fortran.

This aliasing problem shows how language features can be relevant to optimization, especially when it comes to making big, transformative optimizations that, for example, automatically convert single-threaded code to multithreaded. However, it also shows how such differences aren't necessarily permanent. Developers want to be able to use C and C++ for their number crunching code, and if it takes small changes to the language to make it as fast as Fortran, those changes will be made.