To illustrate these three bets in action let's write some code and measure the results.

Results

Analysis



I ran the code on 3 different CPU architectures illustrating generational steps forward for Intel. It is clear from the results that each generation has become progressively better at hiding the latency to main-memory based on the 3 bets described above for a relatively small heap. This is because the size and sophistication of various caches keep improving. However as memory size increases they become less effective. For example, if the array is doubled to be 4GB in size, then the average latency increases from ~30ns to ~55ns for the i7-860 doing the random heap walk.



It seems that for the linear walk case, memory latency does not exist. However as we walk around memory in an evermore random pattern then the latency starts to become very apparent.

PRIME_INC = 39916801

Hardware Components

perf stat -e L1-dcache-loads,L1-dcache-load-misses java -Xmx4g TestMemoryAccessPatterns $ Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1': 1,496,626,053 L1-dcache-loads 274,255,164 L1-dcache-misses # 18.32% of all L1-dcache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2': 1,537,057,965 L1-dcache-loads 1,570,105,933 L1-dcache-misses # 102.15% of all L1-dcache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3': 4,321,888,497 L1-dcache-loads 1,780,223,433 L1-dcache-misses # 41.19% of all L1-dcache hits likwid-perfctr -C 2 -g L2CACHE java -Xmx4g TestMemoryAccessPatterns $ java -Xmx4g TestMemoryAccessPatterns 1 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 5.94918e+09 | | CPU_CLK_UNHALTED_CORE | 5.15969e+09 | | L2_TRANS_ALL_REQUESTS | 1.07252e+09 | | L2_RQSTS_MISS | 3.25413e+08 | +-----------------------+-------------+ +-----------------+-----------+ | Metric | core 2 | +-----------------+-----------+ | Runtime [s] | 2.15481 | | CPI | 0.867293 | | L2 request rate | 0.18028 | | L2 miss rate | 0.0546988 | | L2 miss ratio | 0.303409 | +-----------------+-----------+ +------------------------+-------------+ | Event | core 2 | +------------------------+-------------+ | L3_LAT_CACHE_REFERENCE | 1.26545e+08 | | L3_LAT_CACHE_MISS | 2.59059e+07 | +------------------------+-------------+ java -Xmx4g TestMemoryAccessPatterns 2 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 1.48772e+10 | | CPU_CLK_UNHALTED_CORE | 1.64712e+10 | | L2_TRANS_ALL_REQUESTS | 3.41061e+09 | | L2_RQSTS_MISS | 1.5547e+09 | +-----------------------+-------------+ +-----------------+----------+ | Metric | core 2 | +-----------------+----------+ | Runtime [s] | 6.87876 | | CPI | 1.10714 | | L2 request rate | 0.22925 | | L2 miss rate | 0.104502 | | L2 miss ratio | 0.455843 | +-----------------+----------+ +------------------------+-------------+ | Event | core 2 | +------------------------+-------------+ | L3_LAT_CACHE_REFERENCE | 1.52088e+09 | | L3_LAT_CACHE_MISS | 1.72918e+08 | +------------------------+-------------+ java -Xmx4g TestMemoryAccessPatterns 3 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 6.49533e+09 | | CPU_CLK_UNHALTED_CORE | 4.18416e+10 | | L2_TRANS_ALL_REQUESTS | 4.67488e+09 | | L2_RQSTS_MISS | 1.43442e+09 | +-----------------------+-------------+ +-----------------+----------+ | Metric | core 2 | +-----------------+----------+ | Runtime [s] | 17.474 | | CPI | 6.4418 | | L2 request rate | 0.71973 | | L2 miss rate | 0.220838 | | L2 miss ratio | 0.306835 | +-----------------+----------+ +------------------------+-------------+ | Event | core 2 | +------------------------+-------------+ | L3_LAT_CACHE_REFERENCE | 1.40079e+09 | | L3_LAT_CACHE_MISS | 1.34832e+09 | +------------------------+-------------+

perf stat -e dTLB-loads,dTLB-load-misses java -Xmx4g TestMemoryAccessPatterns $ Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1': 1,496,128,634 dTLB-loads 310,901 dTLB-misses # 0.02% of all dTLB cache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2': 1,551,585,263 dTLB-loads 340,230 dTLB-misses # 0.02% of all dTLB cache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3': 4,031,344,537 dTLB-loads 1,345,807,418 dTLB-misses # 33.38% of all dTLB cache hits

likwid-perfctr -C 2 -g LOAD_HIT_PRE_HW_PF:PMC0 java -Xmx4g TestMemoryAccessPatterns $ java -Xmx4g TestMemoryAccessPatterns 1 +--------------------+-------------+ | Event | core 2 | +--------------------+-------------+ | LOAD_HIT_PRE_HW_PF | 1.31613e+09 | +--------------------+-------------+ java -Xmx4g TestMemoryAccessPatterns 2 +--------------------+--------+ | Event | core 2 | +--------------------+--------+ | LOAD_HIT_PRE_HW_PF | 368930 | +--------------------+--------+ java -Xmx4g TestMemoryAccessPatterns 3 +--------------------+--------+ | Event | core 2 | +--------------------+--------+ | LOAD_HIT_PRE_HW_PF | 324373 | +--------------------+--------+

What does this mean for algorithms?

Conclusion

private static final long LARGE_PRIME_INC = 70368760954879L; RANDOM_HEAP_WALK { public int next(final int pageOffset, final int wordOffset, final int pos) { return (int)(pos + LARGE_PRIME_INC) & ARRAY_MASK; } };

Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, Linux 3.4.6 kernel 64-bit, Java 1.7.0_05 ================================================= 0 - 29.06ns RANDOM_HEAP_WALK 1 - 29.47ns RANDOM_HEAP_WALK 2 - 29.48ns RANDOM_HEAP_WALK 3 - 29.43ns RANDOM_HEAP_WALK 4 - 29.42ns RANDOM_HEAP_WALK Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3': 9,444,928,682 dTLB-loads 4,371,982,327 dTLB-misses # 46.29% of all dTLB cache hits 9,390,675,639 L1-dcache-loads 1,471,647,016 L1-dcache-misses # 15.67% of all L1-dcache hits +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 7.71171e+09 | | CPU_CLK_UNHALTED_CORE | 1.31717e+11 | | L2_TRANS_ALL_REQUESTS | 8.4912e+09 | | L2_RQSTS_MISS | 2.79635e+09 | +-----------------------+-------------+ +-----------------+----------+ | Metric | core 2 | +-----------------+----------+ | Runtime [s] | 55.0094 | | CPI | 17.0801 | | L2 request rate | 1.10108 | | L2 miss rate | 0.362611 | | L2 miss ratio | 0.329324 | +-----------------+----------+ +--------------------+-------------+ | Event | core 2 | +--------------------+-------------+ | LOAD_HIT_PRE_HW_PF | 3.59509e+06 | +--------------------+-------------+ +------------------------+-------------+ | Event | core 2 | +------------------------+-------------+ | L3_LAT_CACHE_REFERENCE | 1.30318e+09 | | L3_LAT_CACHE_MISS | 2.62346e+07 | +------------------------+-------------+

The random heap walk produced an interesting result. This is a our worst case scenario, and given the hardware specifications of these systems, we could be looking at 150ns, 65ns, and 75ns for the above tests respectively based on memory controller and memory module latencies. For the Nehalem (i7-860) I can further subvert the cache sub-system by using a 4GB array resulting in ~55ns on average per iteration. The i7-2760QM has larger load buffers, TLB caches, and Linux is running with transparent huge pages which are all working to further hide the latency. By playing with different prime numbers for the stride, results can vary wildly depending on processor type, e.g. tryfor Nehalem. I'd like to test this on a much larger heap with Sandy Bridge.The main take away is the more predictable the pattern of access to memory, then the better the cache sub-systems are at hiding main-memory latency. Let's look at these cache sub-systems in a little detail to try and understand the observed results.We have many layers of cache plus the pre-fetchers to consider for how latency gets hidden. In this section I'll try and cover the major components used to hide latency that our hardware and systems software friends have put in place. We will investigate these latency hiding components and use the Linux perf and Lightweight Performance Counters utilities to retrieve the performance counters from our CPUs which tell how effective these components are when we execute our programs. Performance counters are CPU specific and what I've used here are specific to Sandy Bridge.Processors typically have 2 or 3 layers of data cache. Each layer as we move out is progressively larger with increasing latency. The latest Intel processors have 3 layers (L1D, L2, and L3); with sizes 32KB, 256KB, and 4-30MB; and ~1ns, ~4ns, and ~15ns latency respectively for a 3.0GHz CPU.Data caches are effectively hardware hash tables with a fixed number of slots for each hash value. These slots are known as "ways". An 8-way associative cache will have 8 slots to hold values for addresses that hash to the same cache location. Within these slots the data caches do not store words, they store cache-lines of multiple words. For an Intel processor these cache-lines are typically 64-bytes, that is 8 words on a 64-bit machine. This plays to the spatial bet that adjacent memory is likely to be required soon, which is typically the case if we think of arrays or fields of an object.Data caches are typically evicted in a LRU manner. Caches work by using a write-back algorithm were stores need only be propagated to main-memory when a modified cache-line is evicted. This gives rise the the interesting phenomenon that a load can cause a write-back to the outer cache layers and eventually to main-memory.: The cache-miss rate of the combined L1D, L2 and L3 increases significantly as the pattern of access becomes more random.Our programs deal with virtual memory addresses that need to be translated to physical memory addresses. Virtual memory systems do this by mapping pages. We need to know the offset for a given page and its size for any memory operation. Typically page sizes are 4KB and are gradually moving to 2MB and greater. Linux introduced Transparent Huge Pages in the 2.6.38 kernel giving us 2MB pages. The translation of virtual memory pages to physical pages is maintained by the page table . This translation can require multiple accesses to the page table which is a huge performance penalty. To accelerate this lookup, processors have a small hardware cache at each cache level called the TLB cache. A miss on the TLB cache can be hugely expensive because the page table may not be in a nearby data cache. By moving to larger pages, a TLB cache can cover a larger address range for the same number of entries.: We only incur significant TLB misses when randomly walking the whole heap when huge pages are employed.Hardware will try and predict the next memory access our programs will make and speculatively load that memory into fill buffers. This is done at it simplest level by pre-loading adjacent cache-lines for the spatial bet, or by recognising regular stride based access patterns, typically less than 2KB in stride length. The tests below we are measuring the number of loads that hit a fill buffer from a hardware pre-fetch.: We have a significant success rate for load hits with the pre-fetcher on the linear walk.Beyond our last level cache (LLC) sits the memory controllers that manage access to the SDRAM banks. Memory is organised into rows and columns. To access an address, first the row address must be selected (RAS), then the column address is selected (CAS) within that row to get the word. The row is typically a page in size and loaded into a row buffer. Even at this stage the hardware is still helping hide the latency. A queue of memory access requests is maintained and re-ordered so that multiple words can be fetched from the same row if possible.Systems now have memory controllers on the CPU socket. This move to on-socket memory controllers gave an ~50ns latency reduction over existing front side bus (FSB) and external Northbridge memory controllers. Systems with multiple sockets employ memory interconnects, QPI from Intel, which are used when one CPU wants to access memory managed by another CPU socket. The presence of these interconnects gives rise to the non-uniform nature of server memory access. In a 2-socket system memory may be local or 1 hop away. On a 8-socket system memory can be up to 3 hops away, were each hop adds 20ns latency in each direction.The difference between an L1D cache-hit, and a full miss resulting in main-memory access, is 2 orders of magnitude; i.e. <1ns vs. 65-100ns. If algorithms randomly walk around our ever increasing address spaces, then we are less likely to benefit from the hardware support that hides this latency.Is there anything we can do about this when designing algorithms and data-structures? Yes there is a lot we can do. If we perform chunks of work on data that is co-located, and we stride around memory in a predictable fashion, then our algorithms can be many times faster. For example rather than using bucket and chain hash tables , like in the JDK, we can employ hash tables using open-addressing with linear-probing. Rather than using linked-lists or trees with single items in each node, we can store an array of many items in each node.Research is advancing on algorithmic approaches that work in harmony with cache sub-systems. One area I find fascinating is Cache Oblivious Algorithms . The name is a bit misleading but there are some great concepts here for how to improve software performance and better execute in parallel. This article is a great illustration of the performance benefits that can be gained.To achieve great performance it is important to have sympathy for the cache sub-systems. We have seen in this article what can be achieved by accessing memory in patterns which work with, rather than against, these caches. When designing algorithms and data structures, it is now vitally important to consider cache-misses, probably even more so than counting steps in the algorithm. This is not what we were taught in algorithm theory when studying computer science. The last decade has seen some fundamental changes in technology. For me the two most significant are the rise of multi-core, and now big-memory systems with 64-bit address spaces.One thing is certain, if we want software to execute faster and scale better, we need to make better use of the many cores in our CPUs, and pay attention to memory access patterns.Trying to design a random walk algorithm for all processors and memory sizes is tricky. If I use the algorithm below then my Sandy Bridge processor is slower but the Nehalem is faster. The point is performance will beunpredictable when you walk around memory in a random fashion. I've also included the L3 cache counters for more detail in all the tests.