Joe from scalability.org points to an interesting article. Apparently, ATI is moving into the field of stream processing. In this post, I will tell you a little about what stream processing is, what your graphics processor has to do with it and also what problems I see with it. But let me start by having a little fun with their announcement:

ATI has invited reporters to a Sept. 29 event in San Francisco at which it will reveal “a new class of processing known as Stream Computing.”

Wow. Looks like they invented a whole new paradigma. Or maybe not? I guess some smart guys in Stanford have been into stream processing for almost ten years now. And I bet they weren’t the first ones either.

Ok, so stream processing is not entirely new and I will stop giving their poor marketing people a hard time. It is expected that ATI will reveal a product called Firestream, which enables stream processing on their graphics cards. The concept behind that is called GPGPU, or General Purpose Graphics Processor Units. And this is where it gets really exciting. In fact, it gets so exciting that a colleague of mine did a seminar (in German) on the topic with some very motivated students last semester – and this is basically where I learned everything I am going to tell you next. Therefore, please take everything I will tell you here with a grain of salt, as I am certainly NOT an expert in this field.

You probably know that your graphics adapter has a lot of power (at least you may know that it eats a lot of power, but that’s a different, although related, story :P). Modern GPUs have a complexity (measured in transistors) that rivals, if not surpasses the complexity of your CPU. And it gets even better: while your CPU has to be able to do a whole lot of things, the GPU specializes on a very narrow task: it takes a huge amount of triangles and some textures, applies some relatively small operations to the triangles (in its so called vertex shaders), splits them up into pixels, applies some more operations (this time in the pixel shaders) and spits out a whole lot of pixels for your viewing pleasure. I can almost hear you scream in pain if you know your way around graphics programming and have to listen to this very gross oversimplification of the graphics pipeline, but since I am not an expert in this field and I expect most of you aren’t either, I will leave it at that :D. Feel free to flame me in the comments-section. No dependencies are involved when applying the shader-operations and therefore you have many of them that can work in parallel (which is good, because that makes the operations described above really, really fast).

Now, how does stream computing fit into the picture? Let’s say you have a huge amount of data and want to apply the same operation to each element. Maybe in the game of life. Or if you want to simulate a waterfall. Your CPU is not really well suited for that task, because it can only process one element at a time. Your graphics adapter on the other hand is a perfect fit: just feed these data into the graphics pipeline (turning them into a stream of data, which is really just a one-dimensional array) and program the shaders with the appropriate operations (called kernels in this context) and the GPU will happily crunch through the data, using all its data-parallel processing power. And because that’s what it can do best, it will do so fast. Really fast. We have seen speedups of 30 in a simple experiment with matrix multiplication done by one of our students, but of course your results may vary.

Unfortunately, the story has downsides as well. If you think parallel programming is hard, then programming GPGPUs is hell :(. At least that was my personal conclusion after listening to some of the experiences made by our students. Although there are some higher level languages available (e.g. Sh or Brook), you basically need to be a graphics expert anyways to program in them. You will also trip over countless bugs, as well as limitations of your particular graphic adapter that you never had to think about before.

And it gets even worse: Because all the fun is happening on the graphics adapter, there is no way to access main memory! Therefore, all the additional data you may need have to be converted into textures somehow, which can then be read by your kernels. Very strange, if you ask me. Your data may also not have any dependencies inside the stream, because the data is worked on in parallel by the different shaders on the GPU.

And finally, expressing your algorithms in a streamy way is not an easy task, when you look at the limitations sketched above. I will even go as far as to say that most known algorithms and data structures today are not usable at all for stream processing and new ways to express them have to be found upfront. A very rewarding task for a research project, but not so good when you want to utilize it in a product and your boss is bugging you with a deadline.

A very pessimistic outlook so far. But maybe ATI has something more than just hardware up at its sleeves. Some kind of Stream Building Blocks would be nice (kind of like what the Intel Thread Building Blocks are trying to do for threads). Or a language that abstracts away at least some of the hardware specifics and raises the level of abstraction considerably. And actually works. And if it’s not ATI, maybe somebody else manages to make GPGPU-programming at least achievable without loosing all your hair beforehand ;).

I want to close this post with a citation from a commercial for a large tire-manufacturer, which I find very fitting in this context:

Power is nothing without control.

I am hoping at least some people at ATI worry about control as much as about power.