\$\begingroup\$

This code review request follows my previous request Computational verification of Collatz conjecture. Unlike the previous program (which was designed for the CPU), this code should run on modern GPUs. For this purpose, I chose OpenCL as a programming language.

Prerequisites

Since the verification program needs to deal with 128-bit arithmetic, the first thing I was forced to solve is the availability of 128-bit integer type. So far, I rely on the __int128 compiler extension:

typedef unsigned __int128 uint128_t;

At this point my first question arises: How efficient (in terms of performance) is this solution? I also guess it may not be very portable, right?

For completeness, the

#define UINT128_MAX (~(uint128_t)0)

is maximum value for an object of type uint128_t .

I also needed to calculate the number of trailing zeros (ctz), which I solved as follows:

size_t ctzl(unsigned long n) { return 63 - clz(n & -n); } size_t ctzu128(uint128_t n) { size_t a = ctzl(n); if (a == ~0UL) { return 64 + ctzl(n>>64); } return a; }

My concern here is that this could be implemented more simply.

The last building block I need is the n -th power of three:

uint128_t pow3(size_t n) { uint128_t r = 1; uint128_t b = 3; while (n) { if (n & 1) { r *= b; } b *= b; n >>= 1; } return r; }

All \$ 3^n \$ for \$ n < 81 \$ fit the uint128_t type. Therefore I have the follow macro defined.

#define LUT_SIZE128 81

Code

My code verifies the convergence of the Collatz problem using this algorithm. The range of the size 2^task_size numbers starting at task_id * 2^task_size is evenly divided among threads. The threads do not communicate with each other. When the calculation is complete, each thread stores their results (partial checksum, number of uint128_t overflows) into global memory. My current solution is as follows:

__kernel void worker( __global unsigned long *overflow_counter, __global unsigned long *checksum_alpha, unsigned long task_id, unsigned long task_size, unsigned long task_units) { unsigned long private_overflow_counter = 0; unsigned long private_checksum_alpha = 0; size_t id = get_global_id(0); uint128_t lut[LUT_SIZE128]; unsigned long i; for (i = 0; i < LUT_SIZE128; ++i) { lut[i] = pow3(i); } uint128_t n_ = ((uint128_t)task_id << task_size) + ((uint128_t)(id + 0) << (task_size - task_units)) + 3; uint128_t n_sup_ = ((uint128_t)task_id << task_size) + ((uint128_t)(id + 1) << (task_size - task_units)) + 3; for (; n_ < n_sup_; n_ += 4) { uint128_t n = n_, n0 = n_; do { n++; size_t alpha = ctzu128(n); private_checksum_alpha += alpha; n >>= alpha; if (n > UINT128_MAX >> 2*alpha || alpha >= LUT_SIZE128) { private_overflow_counter++; break; } n *= lut[alpha]; n--; n >>= ctzu128(n); if (n < n0) { break; } } while (1); } overflow_counter[id] = private_overflow_counter; checksum_alpha[id] = private_checksum_alpha; }

I see an acceleration of more than two orders of magnitude compared to the CPU implementation. My main concern (about the performance) here is that the adjacent threads do not go the same code path (the control flow is not coherent for the threads of a processor). However, I am not sure how big this problem actually is.