Parallelism and performance go hand-in-hand. But achieving maximum performance can be a balancing act, as Intel senior engineer James Reinders explains to Dr. Dobb's editor in chief Jonathan Erickson.

Dr. Dobb's: Sandia National Labs engineers say that adding cores to high-performance computers slows data-intensive apps. Your thoughts?

Reinders: It is an issue of "balance." It certainly isn't that data-intensive application cannot use parallelism. Good performance requires a characteristic we call "balance." One person's balance is another person's headache. Supercomputer systems, in order to get the "balance" they need, place a high premium on things like memory bandwidth, I/O bandwidth, and low latency communications to mention a few. If we have a system which is tipping out of balance in a way where computing ability is not the bottleneck (say memory bandwidth is the limitation) -- then adding more computing ability (e.g., more cores) will not help. In fact, because it is likely to add more contention it can slow things.

The latest Intel architecture, we code named "Nehalem" and now available as processors known as Intel Core i7, put enormous emphasis on adding features to maintain or add balance. If the new architecture had simply added cores, without the new features, many people would have found slow down issues. The key features which add "balance" are QuickPath Interconnect (QPI) and an integrated memory controller, resulting in unprecedented amount of aggregate bandwidth.

Is it enough to get balance for everyone? It never is. Balance is a fundamental challenge in computer design, a classical example of a design goal in the face of numerous engineering tradeoffs. It will never end. That's why characterization of behavior, like the work done at Sandia, help frame the problems so they can be solved.

Dr. Dobb's: Microsoft's Azure doesn't support multicore applications. Is there a place in the cloud for parallelization?

Reinders: Parallelism is very important to cloud computing. Azure is clearly designed to use lots of cores. The initial implementation of Azure choose to assume we will have lots and lots of cores over time, and it has a simple one to one correspondence of physical cores to VMs. This is because Azure's future is more about doing many things in parallel rather than doing a few things each being parallelized. In this model, a multi-threaded program can still run on a single physical core -- and can still get parallelism when that core has hardware threading (such as in Intel Core i7).

My best guess would be that as we get many more cores, Azure would be tempted to let a single VM own multiple cores. I do think they will resist letting a single core run multiple VMs. To my understanding, that is the main they started with the one to one correspondence of VMs and cores. I think that was a rather bold but important decision. It add predictability to a system, and relies on growing upon a future where cores will be even more numerous.

Cloud computing loves cores. On multiple levels, it is trying to avoid shared mutable state, within applications. This is a good thing.

Dr. Dobb's: Shared-memory programming or message passing? Which is the better way to do parallel programming?

Reinders: If I have to pick -- the answer is message passing. If I was teaching parallel programming to new minds, with the goal of awakening them to "think parallel", I would start with something based on message passing.

However, if I'm giving advice to someone with an existing program that needs parallelism added with minimal fuss and on a schedule -- I'm unlikely to advocate message passing.

Message passing is a fundamental concept that is probably more important than OOP to the future of computing. I'm not sure we'll call it message passing for long.

That said, shared-memory programming isn't going away and will dominate programming for the foreseeable future. It is familiar and incremental. It is also the root of the key headache in parallel programming: "shared mutable state." Most programmers will use shared-memory programming, and need to learn a lot about "shared mutable state" and how to deal with it. Issues that it leads to such as deadlocks and race conditions haunt parallel programmers using shared memory programming.

Message passing, on the other hand, avoids "shared mutable state", but lacks the familiarity of shared memory programming. New frameworks, similar to OOP in many ways, which use "actors" or "agents" seem to have an intuitive feel that may attract more programmers than the most used messaging passing interface called MPI.

Dr. Dobb's: Transactional memory. What is it and why is it important?

Reinders: Transactional memory tries to make "shared mutable state" in parallel programs manageable by providing an illusion of "all or nothing" for memory updates. The concept of a "transaction" for a database is well established -- and taken for granted that you either want to update a whole record in a database all-at-once or not-at-all. Imagine changing your plane ticket from Tuesday 8:00 to Thursday 11:30 but the database update switched Tuesday -> Thursday and 8:00 ->11:30 independently, with the possibility one succeeds and the other fails. No one expects that with a database.

So transactional memory is this idea "Hey! What if we could update memory like a database -- in transactions?" This would bring easier to understand programming as well as less contention between parallel threads which will lead to more scaling/speed-up. Perfect in a parallel world. This is very attractive because "shared mutable state" is the root of many parallel programming problems. That said, we just have to figure out how to build it. Key topics include: what should be in software vs. hardware?, how large of a transaction needs to be handled?, what are the exact semantics?, what are the failure modes?

It remains an incredibly important research topic because it seems to offer so much hope. There are a number of companies doing software and hardware implementations to continue exploring this topic. Our Software Transactional Memory (STM) Compiler is available at http://whatif.intel.com for experimenting in C/C++.

Dr. Dobb's: We keep hearing that multicore and parallelization will demand new skill sets on the part of developers. What kind of skills, for instance?

Reinders: I have to start with "Think Parallel." What did a developer for programs before parallelism needed to know? The fundamentals include data structures, algorithms, modularity, parsing, databases, and operating systems. How do I mention good design and change control? How much you know about each, and what exact skills you develop, is an individual thing.

Add parallelism, and we need to understand parallelism as part of all of these. Some of this has been going on for some time -- any database expert is going to have thought about parallelism in many ways for years now. Ditto for operating systems, at least at some levels. Other things like parallelism in algorithms and data structures are new to many developers.

Start with "Think Parallel." Understand where parallelism can be exploited in your application, and the pros and cons of different approaches.

In addition to "Think Parallel" the two key skills are:

Writing scalable code. The shift in mindset here is fundamental: a good program needs to scale much more than it needs to be efficient. This is a very difficult concept for most to embrace, and if you develop a skill at seeing how to write code that scales, you will have a skill that will be in demand. To survive, we all need to cope with the issue of scaling some.

Dealing with shared information, managing global state sharing. Do you share mutable state? if not, how do you update state globally? The desire to make global state visible, but not globally changeable, is really similar to the debates in OOP about public vs. private with all the same issues.

Dr. Dobb's: Locks. Good, bad, or just ugly? Putting this another way, do you prefer locks or no-locks?

Reinders: Neither, I prefer working programs that scale.

Unfortunately, as ugly as locks are, they are here to stay. Nothing will make them disappear -- every if STM or lock-free algorithms catch on, locks will remain. So we need to understand them, even while we try to minimize their use. The key being "minimize" -- like the Albert Einstein's quote "Make everything as simple as possible, but not simpler."

Locks are not the real issue. The issue is highly contended locks. Grabbing a lock, and releasing it, generally is very clear programming to keep share mutable state correct in a parallel program. A huge problem occurs when many parts of a program have to wait for a lock to be available because there is contention. This kills scaling of a program (because there is a lot of overhead to being parallel -- overhead being the waiting for locks).

Of course, the solution is eliminating the contention! As usual, the best approach is to question the need for the contention in the first place. Can you rewrite your program to not have to fight over this shared state?

If you need the contention over the shared mutable state, then you question the need for the lock. Two promising approaches are lock-free algorithms and transactional memory. The first is hard and easy to get wrong, and the later is still a research topic.

I love looking at lock-free algorithms/methods. Honestly, most lock-free proposals I see are wrong. The ones that are right, are brilliant but can really make your head hurt as you puzzle out if they are guaranteed to work in all cases.

Writing lock-free code is playing with fire. The rewards for getting it right are great, the challenge many. I've never encourage people to try to solve there lock problems by rewriting as a lock-free algorithm. Over time, we'll add lock-free algorithms and algorithms with low contention for locks, into our libraries and programming languages. So if you want to write lock-free code, do it for all of us, and then I can point to it as something everyone should use.

So, when looking at an algorithm - don't ask "are they lock free?" Instead ask "how much contention will I get using these?"