Let's say we have a row-major array of 32×32 4-bytes words allocated in shared memory (As displayed below).

If we have been reading across a row, we wouldn't have any problem because each lane (thread inside a warp / wavefront) would have access to a different bank. The warp is therefore run in parallel.

However, accessing the array in a column-major fashion where each warp is associated to a column would create a 32-way bank conflict, WHICH IS TERRIBLE !

This issue is mainly encountered when implementing the matrix transpose operation on the GPU, and could easily be tackled with adding a column for padding.

So now the shared memory turns into a 32×33 sized array. By coloring the data entries according to the bank they reside in, we notice that whether we are reading across a row or down a column, we have no bank conflicts ! And all we payed for this was an extra 32 words of unused shared memory, which is not too big of a deal.

Some experimental results with performing a matrix transpose operation with a matrix of 10000×15000: