Fast Fourier transform in x86 assembly

I created this FFT library to assess the effort and speedup of a hand-written SIMD vectorized implementation. The assembly implementation is under 150 lines of clear code; it achieves a speedup of 2× on small inputs, but only slight speedup on large inputs (memory-bound?).

The assembly code requires x86-64 AVX and FMA3, supported by Intel Haswell and later CPUs. The library uses the same equation as the implementation on the page Free small FFT in multiple languages, only the forward radix-2 FFT is provided, and the FFT is unscaled.

Source code

fft-test.c: Runnable main test program (portable)

fft.h: Exported function prototypes for all FFT implementations (portable)

fft-portable.c: FFT implementation for all CPU architectures

fft-x8664-avx.s: FFT core implementation in x86-64 AVX assembly

fft-x8664-avx-aux.c: Auxiliary functions in C to support the AVX implementation

fft-model-of-x8664-avx.c: Portable C code illustrating how the AVX code works

To compile, use one of these commands:

cc -O1 -o fft-test fft-test.c fft-portable.c -lm

cc -O1 -o fft-test fft-test.c fft-x8664-avx-aux.c fft-x8664-avx.s -lm

cc -O1 -o fft-test fft-test.c fft-x8664-avx-aux.c fft-model-of-x8664-avx.c -lm

License: MIT (open source)

Source code notes:

The x86-64 AVX assembly code only covers the FFT core. The table initialization and destruction are still implemented in C (fft-x8664-avx-aux.c) because they’re not time-critical.

The key idea behind the x86-64 AVX implementation is that all processing happens on vectors of four 64-bit floats, rather than individual scalar floats. In theory this could be 4× as fast as scalar code, but is limited by memory bandwidth, FPU execution resources, inherently serial loop control code, etc. Also, the cosine/sine tables are interleaved in blocks of 4 elements, and the table values for each FFT size are packed together so that the memory access stride is always contiguous.

The portable C implementation is the same as that found on my free FFT page, except that the table computation is separated out to an initialization phase ( fft_init() ).

As a bonus, the x86-64 AVX auxiliary C code has an implementation of a more accurate sine function by doing argument reduction to a smaller domain (1/8th of a circle instead of a half circle).

The C model of x86-64 AVX code exists to make it easier for the reader to understand what the AVX code is doing in a more familiar notation and without having to look up instruction names in the Intel manual.

Benchmark

For each implementation and FFT size, the nanoseconds per FFT computation is shown in the table.

FFT size Portable C, -O0 Portable C, best C model of AVX AVX impl Speedup 4 83 43 25 30 0.83× 8 178 68 40 40 1.00× 16 401 115 79 57 1.39× 32 910 221 173 91 1.90× 64 2 086 457 385 185 2.08× 128 5 732 984 851 391 2.18× 256 10 661 2 187 1 902 879 2.16× 512 23 632 5 046 4 559 2 960 1.54× 1 024 51 861 11 362 10 576 6 411 1.65× 2 048 115 539 26 845 24 615 14 471 1.70× 4 096 254 768 69 851 64 313 38 166 1.69× 8 192 561 347 170 864 148 510 92 166 1.61× 16 384 1 266 715 368 223 343 670 209 489 1.64× 32 768 2 878 547 897 410 700 666 502 660 1.39× 65 536 6 441 139 2 143 479 1 528 429 1 083 482 1.41× 131 072 14 593 055 4 855 895 3 289 835 2 355 535 1.40× 262 144 33 355 806 12 773 822 7 373 814 5 541 606 1.33× 524 288 86 464 543 33 790 638 18 393 958 15 442 926 1.19× 1 048 576 220 892 337 109 162 034 55 214 393 48 812 958 1.13× 2 097 152 598 481 088 279 865 295 129 967 843 112 598 745 1.15× 4 194 304 1 352 574 930 657 329 787 291 134 128 246 925 032 1.18× 8 388 608 2 775 037 172 1 452 446 318 625 878 736 531 560 736 1.18× 16 777 216 5 936 787 976 3 158 039 436 1 325 008 402 1 113 777 707 1.19× 33 554 432 12 731 136 778 6 816 764 522 2 807 901 996 2 385 468 193 1.18× 67 108 864 27 154 169 335 14 967 892 000 5 908 496 753 5 029 359 709 1.17× 134 217 728 57 980 427 807 33 244 483 338 12 538 038 624 10 419 909 000 1.20× 268 435 456 121 103 066 499 71 678 823 683 25 447 116 592 21 696 563 979 1.17× 536 870 912 266 202 828 987 161 371 471 435 53 985 182 920 42 373 561 605 1.27×

The benchmark results are a mixed bag, but a few conclusions can be drawn from the numbers: