There are 2 different types of global memory loads. Caching and Non-caching.

With caching mode, we attempt to hit first in L1, then in L2 before going back to global memory, while in non-Caching mode we attempt to hit only in L2 before heading to global memory.

These 2 modes have 2 differents load granularities:

128-byte line for caching

32-byte line for non-caching

Obviously, caching is set as default during compilation, while non-caching could be used by giving to NVCC the following options: -Xptxas -dlcm=cg

The motivations for bypassing the cache are generally closely related to the overhead created by its important load granularity and also to bus utilization wastes in particular situations. So using the non-caching alternative would reduce these effects.

Now let's talk about the global memory access patterns at Runtime, shall we? Considering here only caching mode, we divide global memory in chunks of 128 bytes.

GPU memory operations are issued per warps. When a lane (thread within a warp) tries to access a global memory spot, the associated address indicates which cache line is needed to satisfy the request.

Therefore, having a warp with scattered accesses would lead to poor bus utilization with huge bandwidth wastes. As illustrated below, all the 32 lanes are accessing scattered 4-byte words from N=4 different 128-byte chunks, resulting to a bus utilization of 128 / (N×128) × 100 = 25% with N=4.

Worst scenario: Scattered Accesses

In contrast, coalescing these warp accesses would lead to a better throughput since bus utilization would significantly increase compared to the scattered one. In this scenario, we are loading at most 2 chunks, resulting to a bus utilization of at least 50% whatever the alignments.

Coalesced and Misaligned Accesses

Finally, by carefully adjusting the warp requests alignment with the memory addresses, we can obtain a theoretical bus utilization of 100% ! This could be achieved with mapping adequately the kernel grid to the data : For instance, choosing data blocks multiple of 128 bytes is a great way to get started with.

In this case, only 128 bytes move across the bus on a miss compared to 256 bytes with the misaligned version.

Best Scenario: Coalesced and Aligned Accesses

The way the lanes request data from the cache-line does not affect anything whether they are permuted or not. So you will still get your "100%".

To wrap this up, here is an example leveraging caching in CUDA:

We want to weight RGB image channels with coefficients (A,B,C). A for Red, B for Green, and C for Blue.

Instead of going for a sequence of [R|G|B] structures where each is processed by a single lane, we can reorganize the data structure in 3 parts (R,G,B) multiple of 128-bytes (most of image formats follow this convention anyway) so that each lane could stride across them, putting the data into L1/L2 cache for faster future accesses by other lanes.