btarunr 33% higher throughput for the simple reason that you can pack 4 operands per instruction versus 3.

OH NO MY GOD, btarunr, STOP THIS NONSENSE PLEASE.There's wiki out there that explains the difference between 4 operand and 3 operand. FMA4 and FMA3 they all just do one job, compute 'd=a*b+c'. The only difference is that FMA4 stores result 'd' in a new register which is specified in the instruction, while FMA3 stores it by overwriting one of the three input registers. THEY DID THE SAME THING, just different ways of handling the result.FMA4 has the advantage of programming flexibility, meaning there's more room for optimization, since the output and input do not interfere. But the room will never be anywhere near 33%. If you write x86 assembly code, you will understand. However FMA3 uses less transistors, easier to implement (means you can design it with less latency on silicon), so Intel jumped ship of FMA4 and chose FMA3.I don't write BLAS code, to be honest. But I do think well-optimized FMA3 doesn't have much disadvantage. Because if the flexibility is not well utilized, then the FMA4 processors will be troubled by its chunkier and slower units.Edit:I just come up with a great analogy of this. We can call x86's ADD as 'ADD2', ARM's ADD as 'ADD3'. If you write 'ADD A,B' in x86, then it stores the result in A, meaning A=A+B. If you write 'ADD A,B,C' in ARM, then it is good old 'A=B+C'.Sure ADD3 is more flexible, but I don't think ARM has 50% more throughput than x86.