The first conclusion I can draw from this experience is that CUDA can offer significantly higher performance for image processing than pixel shaders given the same hardware. Extrapolating, it looks like deinterlacing a 4:2:2 SD video frame with Yadif would take about 6-8ms/frame on this video card, which is much faster than I could do with D3D9 and more than good enough for 60 fps. The second conclusion is writing the kernel to achieve good performance takes a lot of work, definitely more than an equivalent pixel shader based solution. You definitely can't take an off-the-shelf routine and just compile it for the GPU and expect to get good performance, but if you spend a lot of time massaging the memory access patterns, you can indeed get good results.

In general, optimizing a CUDA kernel unfortunately doesn't appear to be easier than optimizing a CPU kernel. As I just noted, avoiding uncoalesced global memory accesses is an absolute must, and the solution often involves ugly copies to and from the on-chip shared memory, which is limited in size (16K). Second, you have to manually tune the block size and register count until everything is juuuuust right. When I was writing an ELA deinterlacer, my initial attempt took ~12ms for a 640x480, 8-bit image. I eventually got it down to 3ms, but afterward the kernel was ugly as sin. It also doesn't help that you can easily lock up the entire system if you accidentally scribble over adjacent video memory with your kernel, an event which ate the first version of this blog posting. There's no memory protection to save you here.

Another problem with using CUDA for image processing is that it isn't well suited for byte data. There are no byte operations other than conversions and load/store ops, so all byte values need to be widened to 32-bit ints or floats. You also don't get access to the raster-op (ROP) units, so you have to write bytes directly, and there you get bit by the conversion. The underlying PTX instruction set has a perfect instruction for doing this that does a float-to-uint8 conversion with saturation, but sadly the nvcc compiler doesn't seem to generate this. Memory access is also complicated: the obvious one-thread-per-byte setup leads to uncoalesced global accesses and lots of shared memory bank conflicts. Finally, the last insult to injury is that while there are vector types, no vector operations are defined by default and the underlying hardware is scalar anyway. The best way I've found so far is to manually unroll the kernel by four, which adds to the complexity.

This version does an aligned move of 48 elements into shared memory before extracting 18 of them to do the filter operation, executing in 4.8ms (221Mpixel/sec) instead of 22.4ms. Notice the need for explicit synchronization -- this code is actually being run on a 16x16 block of threads and a barrier is needed since data is being trampolined between them through memory. If you think that tracking race conditions on a four-core CPU is bad, well, we have 256 threads here.

The good news is that it's basically C; the bad news is that the memory performance is abysmal. It turns out that access to global memory is very tricky in CUDA due to alignment restrictions on most hardware. If all the threads access memory in just the right order and at just the right alignment, you really fast access. Get anything wrong and every single thread makes its own uncached transaction on the bus, leading to an order of magnitude drop in performance. One of the restrictions is that the set of accesses from a group of threads (half-warp) has to start at a 64 byte boundary, which doesn't work when you're trying to access a contiguous group of elements from the same thread. Therefore, the straightforward formulation above is actually a really bad way to write a CUDA kernel as it is completely bottlenecked on memory access. You can work around this with textures, which are cached, but the addressing is more annoying since only 1D textures can be addressed by integer index. The alternative is to copy data to temporary buffers with just the right block size and stride:

Now, 22ms for a simple filter operation is slow enough to be fairly useless. Initially I was willing to write this off as another casualty of my wonderfully fast video card, but fortunately that's not the case. The CUDA kernel for the filter operation looked like this:

With asynchronous copies, the CPU is no longer stalled at all and all API calls are fast. This bypasses the biggest problem with Direct3D 9, which is the synchronous GetRenderTargetData() call. The CUDA API gives a clue as to why GRTD() stalls, which is that asynchronous copies are only allowed if the host buffer is allocated as page-locked memory, which D3D9 probably does not support. This is a shame, given the much improved numbers above.

This is for a 3-tap horizontal filter on a 1024x1024 monochrome image, with 32-bit integer samples. The first set of numbers is for the GPU and the ones in parentheses are for the CPU. 22ms for the blur kinda sucks as it's 47 Mpixels/sec, but the upload and readback numbers are more interesting. 1.6ms for the upload corresponds to 2.6GB/sec, which is good; 9.2ms for the download is a decent and workable 450MB/sec. Clearly there is some overhead to getting data on and off the GPU, but you don't have to go to extreme measures to crunch data. On the CPU side, the launch time for the kernel is a pleasing 0.03ms, but the upload and readback times are disappointing. The reason is that these are synchronous copies. Switching to the asynchronous APIs gives a different result:

This special syntax looks just like a function call, with the added fields in angle brackets describing the grid and block dimensions for threading purposes. This is an asynchronous call, so it's easy to do the right thing of queuing up kernel calls on the device and only waiting when you need to do data transfers. Those are also easy, because CUDA has device analogs for malloc(), free(), memcpy(), and memset() operations. The API is also nicely partitioned, with the runtime API being suitable for writing CUDA applications directly, and the explicit driver API being better for integration into frameworks. You can even use PTX assembly language if you need to check the compiler's output or generate code for CUDA directly.

The first thing I have to say about CUDA is that it's fairly easy to use. Kernels are written in C using a special nvcc pre-compiler that allows you to mix both host and device code; the precompiler automatically splits the two apart, compiling the host code using VC++ or GCC and compiling the device code separately, and then linking the two together. This means that you can compile a CUDA application from the command line as easily as a standard Hello World application. Invoking a kernel on the device is also easy:

After experimenting with pixel shaders for video acceleration, I decided to give NVIDIA's CUDA a shot to see how it performed on the same hardware. As I've noted before, Direct3D 9 can be used for video acceleration, but has some performance problems in specific areas. CUDA uses a more general programming model and also provides more direct access to video driver, making it potentially more interesting for video acceleration.

Comments

Comments posted:

> asynchronous copies are only allowed if the host buffer is allocated as page-unlocked memoryHmm. There are so many terms for it I'm not 100% sure, but I think you meant page-locked. Alternative names: pinned (more a Unix term) or nonpaged (Windows term). In this case the more accurate (but not equivalent) term probably would be DMA-capable though (depending on how you define it).That is a good idea anyway, since only then direct DMA from the GPU is possible. In the other case (for upload) data is first copied into a DMA-capable buffer in the driver and then transferred, i.e. it involves an extra memcpy of all data (though this might partially be done in parallel with the DMA transfer).Issues with that type of memory: on Vista, only a limited amount is available. Some people have claimed as low values as 100 MB (and now combine that with memory fragmentation...). Of course there's a registry hack for that, but that's not fun.Now on XP and on Linux the issue is: Linux has a "official" way to get such memory, limited by ulimit, which NVidia does not use (while I can understand that I don't have to like it).Both on Linux and XP you can allocate almost all memory as such page-locked memory, from the driver, outside all the OS's memory handling. A rather sure way to bring down any machine, regarless also of free swap space. (Conclusion: never ever allow NVidia drivers on a true multi-user system or you've lost)Also a minor comment on the last Filter kernel example you have: as far as I can tell you only have threadIdx.x 0 - 15 i.e. one warp. And every thread that reads t[tx+15], t[tx+16] and t[tx+17] then is in exactly the same warp as the one that wrote it. In conclusion, the __syncthreads(); should be pointless and possibly decreases speed a lot since it makes interleaving of memory access and calculation harder.Also while I haven't personally confirmed it, a few people have claimed that using the 2D-layout (i.e. threadIdx.y) generally decreases perfomance over doing the calculations explicitly in the kernel, particularly for small kernels.Oh, that makes me think of something else: you recalculate Ar/Dr etc. in each thread, that is definitely wasting a good deal of processing power with such a small kernel, ideally you would use a small array and calculate them once for each warp (so you do not need syncthreads()), though just declaring the __shared__ and adding a syncthreads can be just as good in some cases. As so often with optimizations, only testing really works.One last thing: are you aware of the measurement pitfalls of CUDA? I.e. you must call a synchronization function before readback, otherwise your readback timing will include part of the calculation? Your numbers look like it, but it's amazing how easy that is to forget, even by people who _should_ know better ;-)

Reimar - 07 06 09 - 09:54

OpenGL does support async texture copies with PBOs, though it's a bit limited (no way to query completion other than to block until complete).

Glenn Maynard - 07 06 09 - 14:53

@Reimar:Whoops, yes, that should be page-locked, the term that NVIDIA uses. I've corrected the post.I'm still learning the traps and tricks of CUDA, but what I've found is that there is more compute power and less memory bandwidth than expected. As such, I haven't bothered much with trying to optimize address computations as my major bottlenecks are usually in memory access. What I've found in practice is that by the time you add in the code to copy data in and out of shared memory efficiently and unrolled the loop as necessary (if processing bytes), the address calcs are a minor issue. Then again, most kernels are probably smaller than my ELA kernel (29 registers, 25% occupancy). The profiler output is hazy, but it looks like the instruction throughput is very high even with the syncs.As for the __syncthreads(), you're probably right, but I worry about trying to accommodate different warp sizes. Also, I'm not sure how you do this if you are targeting a full warp, as that means that you necessarily have fetch more than one warp's worth of data in order to do overlapping filters.I didn't see a need for syncing around the readback, as you wouldn't want to do that in production code, and that's where I'm interested in performance. The timings show what you'd actually get, which is the kernel launching quickly, and the readback stalling while waiting for it to finish. Besides, I'd exclusively use asynchronous calls in practice.@Glenn Maynard:What's wrong with NV_fence or ARB_fence?

Phaeron - 07 06 09 - 15:24

> the bad news is that the memory performance is abysmal. It turns out that access to global memory is very tricky in CUDA due to alignment restrictions on most hardwareThe GT200 series don't have these restrictions anymore, although coalesced memory accesses are always going to be faster. When access patterns are hard to optimize it may make more sense to use texture memory instead. That also allows you to get the data type conversion for free.> The underlying PTX instruction set has a perfect instruction for doing this that does a float-to-uint8 conversion with saturation, but sadly the nvcc compiler doesn't seem to generate this.You can use the following trick to use inline PTX. The compiler front-end is Open64, so it uses the same syntax as GCC:__device__ uint float_to_u8(float value)uint result;asm("cvt.sat.rni.u8.f32 %0, %1;" : "=r" (result) : "f" (value));return result;Note that this is completely unsupported, so use at your risk.

castano (link) - 08 06 09 - 01:27

The reason the address calculations have been relevant for performance for some is that when you let the compiler do it, it usually uses 32 x 32 bit multiplies which are quite slow. Some people IIRC have gained several % perfomance just by using mul24 (or what that special function for 24 bit x 24 bit multiplication is called). Of course it does depend as always.The problem with not syncing around readback is that your numbers are useless for perfomance tuning. You might then assume that optimizing readback might help improve perfomance while the actual problem is that e.g. you kernel actually is so slow it takes 30 ms, not just 22.Also if you have no other reason to use the async API, an explicit sync is exactly what you will want for larger problem, e.g. that is what I used before readback after queueing a lot of operations, don't know of any better way so far:> while (cudaStreamQuery(cuStream) == cudaErrorNotReady) usleep(1000);I wouldn't worry about different warp size, NVidia would break most code if they did that. At least I'd not care about smaller warp sizes, i.e. just something like#if WARP_SIZE < 16#error not supported#endifIt should not be too hard to keep the code correct for larger warp sizes, performance is a different question though,

Reimar - 08 06 09 - 01:34

> The problem with not syncing around readback is that your numbers are useless for perfomance tuning.I'm using GPU events and sequencing async readbacks with kernels in the default stream, so I don't know why you would say this. As far as I can tell the kernel timings are not being polluted by readback.

Phaeron - 08 06 09 - 02:38

Is OpenCL addressing these issues at all? Or is it just an additional abstraction on top of the already quite arcane CUDA details?

sagacity - 08 06 09 - 07:56

> As I've noted before, Direct3D 9 can be used for video acceleration, but has some performance problems in specific areas.Could you elaborate? What performance problems are you talking about?I've done some GPGPU work in DirectX/OpenGL shaders and in CUDA, and in my experience CUDA is very difficult to optimize, and even when you do optimize it, it's only optimized for your particular graphics card.The only advantage of CUDA on this current generation hardware I can see is the support for shared memory in the kernels, which can be important in implementing some algorithms.

Emil Dotchevski (link) - 08 06 09 - 13:50

I've never looked at NV_fence, since there seems to be no equivalent on ATI hardware.As far as I can tell, there's no ARB_fence...

Glenn Maynard - 08 06 09 - 15:11

No idea on OpenCL. The only part I looked into about it was how programs are specified (source).Direct3D 9 does not allow you to do asynchronous readback on any hardware+driver combo that I know of. NVIDIA, ATI, Intel, you name it, all block in the driver until queued commands are executed. This makes it difficult to interleave readbacks with complex scene rendering, because you are forced to do an expensive stall whenever you do the readback no matter how much you try to space things out. This alone gives CUDA a significant advantage over Direct3D 9.Some OpenGL implementations do support async readback, however, when PBOs are supported. NVIDIA has very good performance here.Oops, I thought there was an ARB_fence. Guess not. Well, I guess the crazy hack to try would be to see if NV_occlusion_query could be used for the purpose, since apparently ATI does support that.

Phaeron - 08 06 09 - 22:57

Is branching expensive? I know that when working with shaders branching is terrible, so I'm curious to know if it would be faster to just allocate enough memory to not need "if (x < width && y < height)" anymore.

LeeN - 09 06 09 - 20:54

My problem with CUDA is exactly branching. If you are writing a shader there are macro defines that can select the most optimal path in a dynamically and client compiled code. With CUDA in pure c (not even templates are there) there is no way to write "über-shaders", as the dx11 documentation calls them.

Gabest - 21 06 09 - 20:53

Hello,I'd like to have a CUDA driven transcoder plugin that allows VirtualDub to transcode any video file format to another format, e.g. MPEG2 to MP4 or WMV to MPEG2.That would make VirtualDub to one of the most useful tools ever since the PC was invented and blame some companies with expensive video editing applications, but unable to implement CUDA