A few weeks ago I had lunch with a co-worker. He was complaining about some process being slow. He estimated how many bytes of data were generated, how many processing passes were performed, and ultimately how many bytes of RAM needed to be accessed. He suggested that modern GPUs with upwards of 500 gigabytes per second of memory bandwidth could eat his problem for breakfast.

I thought his perspective was interesting. That's not how I think about problems.

I know about the processor-memory performance gap. I know how to write cache friendly code. I know approximate latency numbers. But I don't know enough to ballpark memory throughput on a napkin.

Here's my thought experiment. Imagine you have a contiguous array of one billion 32-bit integers in memory. That's 4 gigabytes. How long will it take to iterate that array and sum the values? How many bytes of contiguous data can a CPU read from RAM per second? How many bytes of random access? How well can it be parallelized?

Now you may think these aren't useful questions. Real programs are too complicated for such a naive benchmark to be meaningful. You're not wrong! The real answer is "it depends".

That said, I think the question is worth a blog post's worth of exploration. I'm not trying to find the answer. But I do think we can identify some upper and lower bounds, some interesting points in the middle, and hopefully learn a few things along the way.

Numbers Every Programmer Should Know

If you read programming blogs at some point you've probably come across "numbers every programmer should know". It looks something like this.

L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Compress 1K bytes with Zippy 3,000 ns 3 us Send 1K bytes over 1 Gbps network 10,000 ns 10 us Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD Read 1 MB sequentially from memory 250,000 ns 250 us Round trip within same datacenter 500,000 ns 500 us Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms

Source: Jonas Bonér

This is a great list. I see it on HackerNews at least once a year. Every programmer should know these numbers.

But these numbers are focused on a different question. Latency and throughput are not the same.

Latency in 2020

That list is from 2012. This post comes from 2020, and times have changed. Here are figures for an Intel i7 via StackOverflow.

L1 CACHE hit, ~4 cycles(2.1 - 1.2 ns) L2 CACHE hit, ~10 cycles(5.3 - 3.0 ns) L3 CACHE hit, line unshared ~40 cycles(21.4 - 12.0 ns) L3 CACHE hit, shared line in another core ~65 cycles(34.8 - 19.5 ns) L3 CACHE hit, modified in another core ~75 cycles(40.2 - 22.5 ns) local RAM ~60 ns

Interesting! What's changed?

L1 is slower; 0.5 ns -> 1.5 ns

L2 is faster; 7 ns -> 4.2 ns

L1 vs L2 relative speed is way closer; 2.5x vs 14x 🤯

🤯 L3 cache is now standard; 12 ns to 40 ns

Ram is faster; 100ns -> 60ns

I don't want to draw too many conclusions from this. It's not clear how the original figures were calculated. We're not comparing apples to apples.

Here's some bandwidth and cache sizes for my CPU courtesy of wikichip.

Memory Bandwidth: 39.74 gigabytes per second L1 cache: 192 kilobytes (32 KB per core) L2 cache: 1.5 megabytes (256 KB per core) L3 cache: 12 megabytes (shared; 2 MB per core)

Here's what I'd like to know:

What is the upper limit of RAM performance?

What is the lower limit?

What are the limits for L1/L2/L3 cache?

Naive Benchmarking

Let's run some tests. I put together a naive C++ benchmark. It looks very roughly like this.

// Generate random elements std :: vector < int > nums; for ( size_t i = 0 ; i < 1024 * 1024 * 1024 ; ++ i) // one billion ints nums.push_back(rng() % 1024 ); // small nums to prevent overflow // Run test with 1 to 12 threads for ( int thread_count = 1 ; thread_count <= MAX_THREADS; ++ thread_count) { auto slice_len = nums.size() / thread_count; // for-each thread for ( size_t thread = 0 ; thread < thread_count; ++ thread ) { // partition data auto begin = nums.begin() + thread * slice_len; auto end = ( thread == thread_count - 1 ) ? nums.end() : begin + slice_len; // spawn threads futures.push_back(std :: async([begin, end] { // sum ints sequentially int64_t sum = 0 ; for ( auto ptr = begin; ptr < end; ++ ptr) sum += * ptr; return sum; })); } // combine results int64_t sum = 0 ; for ( auto & future : futures) sum += future.get(); }

I'm leaving out a few details. But you get the idea. Create a large, contiguous array of elements. Divide the array into non-overlapping chunks. Process each chunk on a different thread. Accumulate the results.

I also want to measure random access. This is tricky. I tried a few methods. Ultimately I chose to pre-compute and shuffle indices. Each index exists exactly one. The inner-loop then iterates indices and computes sum += nums[index] .

std :: vector < int > nums = /* ... */ ; std :: vector < uint32_t > indices = /* shuffled */ ; // random access int64_t sum = 0 ; for ( auto ptr = indices.begin(); ptr < indices.end(); ++ ptr) { auto idx = * ptr; sum += nums[idx]; } return sum;

I do not consider the memory of the indices array for my bandwidth calculations. I only count bytes that contribute to sum . I'm not benchmarking my hardware. I'm estimating my ability to work with data sets of different sizes and different access patterns.

I performed tests with three data types:

int - basic 32-bit integer

- basic 32-bit integer matri4x4 - contains int[16] ; fits on 64-byte cache line

- contains ; fits on 64-byte cache line matrix4x4_simd - uses __m256i intrinsics

Large Block