I've started to implement a 8086/8088 with the goal of being cycle-exact. I can understand the reasoning behind the number of clock cycles for most instructions, however I must say I'm quite puzzled by the Effective Address (EA) calculation time.

More specifically, why does computing BP + DI or BX + SI take 7 cycles, but computing BP + SI or BX + DI take 8 cycles?

I could just wait for a given number of cycles, but I'm really interested in knowing why there's this 1-cycle difference (and overall why it takes so many cycles to do any EA calculation, since EA uses the ALU for computing addresses, and an ADD between registers is just 3 cycles).