Another uglyness I saw, which isn't restricted to Framewave unfortunately, is assembly language routines that have been translated to intrinsics. The result is a nasty C++ routine that has variables like "pedx" and "pesi," but has instruction names translated so that what used to be an understandable "paddw" is now "_mm_add_epi16." I know this was a hack job for portability, but the result sure is unreadable.

It looks like there are indeed a number of well-optimized SSE2 routines in the Framewave library, but after seeing things like the above a few times I was left scratching my head a bit....

One thing that I wanted to look at was their 8x8 2D-IDCT source. The 8x8 2D inverse discrete cosine transform (IDCT) is popular and used in a number of video compression formats. There are a million ways to implement it quickly, and although everyone's seen Intel's AP-922 SSE2 algorithm for it by now, I hadn't seen one by AMD before. So I grab the source and dig around in the JPEG module, and I see this:

AMD just open sourced the AMD Performance Library as Framewave , which at least from my perspective seems like a good thing. Not that I'm going to attempt to use it, but I perused the source out of curiosity, and it looks like there are some useful goodies in there.

Comments

Comments posted:

Well - their library seems seriously to lack optimization for static operations. Their resizers are unable to maintain their coefficients, so no possibility of re-use.A much bigger WTF is that their resizers split the image without overlap, so you will get no incorrect interpolation, where the library. You might also enjoy to know that they call 4 mallocs per thread per resize. 16 bit resizes converts to and from float in the inner loop. Wow. ;)Maintaining this beast seems horrible. The 8 bit unsigned _cubic_ resizer alone is 1488 lines of code - yes 1488. It seems they just gave up on lanczos and just implemented it in plain C++.Colorspace conversions have no interpolation, whatsoever. No interlaced 4:2:0 RGB - nice - you have to separate fields - but considering their lack of interpolation that shouldn't be much of a concern.These are the only ones I've looked at - and honestly - I would only use it for prototyping, since it's easy to use and (presumably) well tested.

Klaus Post (link) - 21 02 08 - 06:10

is fwMalloc==malloc ? Are they using a custom preallocated memory pool to ensure locality or something?

Nick S - 21 02 08 - 06:23

It would've been nice to actually explain what's the problem with calling malloc and what would have been a better implementation.

Calvin K - 21 02 08 - 07:11

Until we see what fwmalloc() actually does there is no valid point in bashing it. If it's using some sort of fast pool-allocator I don't see anything wrong with it.

Bojan - 21 02 08 - 07:55

The little I digged it seems that this is as follows:void* STDCALL fwMalloc (int length){return AlignedMalloc ( length );}where AlignedMalloc is platform-specific:on windows it is return _aligned_malloc( length, 32 );on OS-X void *memptr = malloc ( length );return memptr;on Linux void *memptr; posix_memalign ( &memptr, 32, length ); return memptr;and on Solaris void *memptr = memalign ( 32, length ); return memptr;So it seems that they all call memory-allocators that return an aligned piece of memory. Seems to be quite wasteful indeed to call it in a otherwise tight routine but I do not known anyway nearly enough to say how else you can get aligned memory.

Jarkko Miettinen - 21 02 08 - 09:42

how else you can get aligned memory=> allocate an oversized buffer and use only the aligned middle of it.

Tobias Rieper - 21 02 08 - 10:03

Tobias: Presumably, that's exactly what the system calls do, too.

Jake McArthur (link) - 21 02 08 - 10:24

I've seen stuff like that pulled only in research code so far. Anything you're going to use often that needs temporary buffers should allocate it once, in advance, or at least be gracious enough to let you specify the temp buffer manually to avoid crap like that.

Stefan - 21 02 08 - 11:18

malloc() is really expensive, for many reasons. It's an external routine, it may have to lock, and it has to deal with requests for variable memory sizes. In a case like this, where you have a fixed-size allocation with local scope, you can handle the allocation privately much, much faster than any sort of general purpose allocator.I suppose it's not obvious to people who haven't written this kind of routine before, but for a relatively small buffer like this it's much faster to just allocate it on the stack as a local variable. There are a number of advantages, including: allocation failure case is tightly controlled (generally impossible), requires minimal code, the memory is likely already in L1 cache, you can control alignment against other buffers to reduce cache line address aliasing, and you don't need an extra register to point to it. C++ compilers generally provide a way to enforce alignment on local variables, such as __declspec(align(n)) in VC++, and in C# you can use stackalloc() in an unsafe routine. You can make the allocation call hurt less by implementing a fast allocator, but there's no way you'll beat stack allocation, even with a super-fast thread-local implementation or even a compacting garbage collector. The function call alone is significant in a routine as critical as an IDCT, which commonly executes in less than 2,000 clocks.

Phaeron - 21 02 08 - 22:53

I just took a look at the cubic and Lanczos resamplers that Klaus mentioned, and they're worse than he lets on -- in fact, they're implemented as direct 2D resamplers. For the cubic, this results in 16 multiplies / 16 adds per pixel, versus 8m7a for a row/column implementation. Memory bandwidth issues _might_ even that out. I have a hard time believing that might be faster for the Lanczos case, though, which uses a 6x6 kernel and where the difference is nearly 5:1 in ALU ops (64m64a vs. 12m11a).Quality looks like it might be an issue as well. They're using pmullw to do the weighting, which means they had to use only 6 bits of fractional precision for the weights. (I don't remember what Avisynth uses, but VirtualDub uses 14 bits since it uses pmaddwd.) That's somewhat marginal for a 1D implementation, and really sketchy for a 2D kernel... and they're also converting the weights to fixed-point using truncation instead of rounding. They do at least renormalize the weights _after_ conversion to integer, which is a common mistake otherwise.The data flow is also a bit complex for the 8bpp case. Rather than using slightly larger pre-aligned kernels, they use a scalar loop to copy the pixels into a temporary buffer and then load that into SSE2 registers. That's a bunch of load/store traffic and a series of store forwarding stalls. They seem to be doing this in order to compute 8 pixels in parallel and not have to do a horizontal add, but I wonder if this is actually faster than using pmaddwd and then a four-way, 32-bit horizontal add.I'd like to point out that it wasn't my intention to rip on the Framewave library in general, just on the particular IDCT routine. It does look like a number of routines in the library are underoptimized, however. I think one goal of the library is to be API compatible with the Intel Performance Primitives (IPP) library, though, so we should give leeway where the original Intel API lacks flexibility. Does anyone know how well Intel IPP fares in comparison?

Phaeron - 22 02 08 - 00:37

For the record, Avisynth also uses 14 bits coefficients. I rewrote the vertical resizer to a dynamic SSSE3, which uses pmulhrsw, so the accumulator faction is only 6 bits. It has fallback to MMX if there is more than 4 taps.

Klaus Post - 22 02 08 - 15:41

"Another uglyness I saw, which isn't restricted to Framewave unfortunately, is assembly language routines that have been translated to intrinsics. The result is a nasty C++ routine that has variables like "pedx" and "pesi," but has instruction names translated so that what used to be an understandable "paddw" is now "_mm_add_epi16." I know this was a hack job for portability, but the result sure is unreadable."Though I can understand why this was needed. x64 does not support inline assembly.

Yuhong Bao - 27 02 08 - 20:41

"Does anyone know how well Intel IPP fares in comparison?"I haven't tested those particular functions but I hear there is a free trial of IPP :)