What if I took to the pages of a major business magazine and made the bold recommendation that, because humans have run out of new places on Earth that we can migrate to, it is past time for us to make the collective leap to faster-than-light travel so that we can explore neighboring solar systems? Your reaction would probably be something like, "Yes, of course everyone would love to go faster than light, and if it were as easy as just deciding we all want to do it then it would be done already."

That's pretty much how I felt after reading a recent Forbes op-ed by NVIDIA's Bill Dally, in which he declares, "It is past time for the computing industry—and everyone who relies on it for continued improvements in productivity, economic growth and social progress—to take the leap into parallel processing." Obviously yes, we would all love to just magically jump right into parallel processing, and transform all of our existing serial workloads into parallel workloads. But there are two big problems: 1) nobody knows the percentage of existing serial workloads that can be usefully parallelized (but it's probably small), and 2) parallel programming is hard.

A lot of serial, not as much parallel

Note that in the preceding paragraph, I spoke of "workloads" and not "programs." That's because the problem isn't that existing software has been written one way and it needs to be rewritten in some new way. It's that the tasks that the software carries out are inherently serial. Of course, Dally is well aware of this distinction, but he conveniently ignores it because it doesn't help his point. However, the example that Dally uses to illustrate the difference between serial and parallel is actually a very good illustration of the fact that we can't just rewrite serial software and make it parallel.

Here's Dally's analogy: "Reading this essay is a serial process—you read one word after another. But counting the number of words, for example, is a problem best solved using parallelism. Give each paragraph to a different person, and the work gets done far more quickly." Yep, the process of reading is definitely serial—there's no way to accomplish the task in parallel (believe me, I've tried), and no amount of programming wizardry will make it otherwise. Word counts, on the other hand, can be done either in serial or in parallel; but counting words is a much less interesting and useful undertaking than reading.

As with reading vs. word counts, it has so far turned out that the main bulk of ordinary computing tasks that are interesting and worthwhile are serial tasks; the parallel stuff, while critically important in a few key verticals, is niche. This is unfortunate for NVIDIA, because NVIDIA is in the parallel business. Now, it could ultimately happen that the set of "interesting things that we want to do with computers that are best done in parallel" will one day grow larger than the set of "interesting things that we want to do with computers that can only be done serially," and if that happens that will be great for everyone (not just NVIDIA); but so far we appear to be on track for the opposite outcome.

Ultimately, NVIDIA's fundamental problem boils down to this simple fact: you can do parallel tasks in a serial manner, but you can't do serial tasks in a parallel manner. What this means for computing's history up until now is that everyone started out making serial hardware, with the result that parallel tasks have tended to be done in serial because that's the hardware that was available. Some percentage of those tasks can be rethought to work in parallel, but, as I said above, so far this percentage has been disappointingly low.

Are our programmers learning?

Dally talks a bit about practices and approaches, as if writing parallel software is mainly a matter of tools and training. Would that it were so.

There are some folks who honestly believe that if we gave computer science students the right tools for explicitly expressing parallelism and we totally reformed the comp sci curriculum so that students are trained to use these tools from day one, we'd enter into some sort of golden age of parallelism. But the number of people who think this way is shrinking, at least from what I've informally observed. This issue came up in an untranscribed portion of the conversation that I had with Stanford president and RISC pioneer John Hennessy, and it has come up in many other conversations that I've had since with folks in the field: most humans just don't seem to be wired to be able to learn to do parallel programming at the level that our processor hardware now demands. It's not that it can't be done—a few people can really take to it and do it well. But, like the innate potential to become a chess grandmaster, the innate potential to be a real parallel-programming wizard is not evenly distributed in the population. Some professors and CS grad students I've talked to have observed the same thing I have: that when you get to the point in a computer science curriculum where you introduce parallel programming, you lose too many students.

Right now, it's probably fair to say that the "tools and training will fix it" school of thought is still the mainstream, but as core counts increase and the gap widens between our hardware's peak theoretical performance and what real-world, ordinary programmers can actually get out of that hardware, we'll see more people concede that programmers just need to be able to write serial code that the machine then parallelizes for them. In other words, the programmer will get the board ready, and some software and/or hardware designed by grandmasters will then take over and do all of the grandmaster-level chess playing.

Right about one thing: it will take time

Responding to this essay is ultimately like fighting a heavyweight boxer who has both hands tied behind his back. Dally is a smart guy with impeccable credentials, and with this short, nonspecialist essay format he appears to have fallen into a common trap that ensnares specialists of all types when they're not good at writing for a lay audience but they're pressed into doing so for commercial reasons, i.e. he got just enough rope to hang himself. I would love to read a longer technical essay by him where he tries to make whatever points he intended to make with this piece, but using his own terms.

In the end, I can definitely agree with him on one thing: the industry has a long slog ahead of it before it's ready to start using Moore's Law to once again deliver the kinds of performance gains (as opposed to functionality and system-level power efficiency increases) that it gave us up until recently. That process will take not just time, but a whole lot more investment in fundamental computer science research.