The current versions of compilers (GCC 8.2, clang 7.0.0) are not able to autovectorize the code. However, they do autovectorize finding the minimum value, i.e. statement like return *std::min_element(v.begin(), v.end()) .

The C++ standard library allows to express the same algorithm in one line.

i

i

i

i

i

i

The goal is to find the first index of the minimum value in a non-empty sequence.

SIMD approach

In a SIMD approach we keep three vectors of: minimum values, corresponding indices and also current scalar indices. When the main loop completes, we select the appropriate single index from these vectors.

The algorithm outline is shown below (for four-element vectors).

// 1. SIMD part Vector indices = [ 0 , 1 , 2 , 3 ] // basically [i + 0, i + 1, i + 2, i + 4] Vector increment = [ 4 , 4 , 4 , 4 ] // sample values Vector minvalues = load_vector ( input [ 0 ]) // [77, 33, 11, 44] Vector minindices = [ 0 , 1 , 2 , 3 ] // [ 0, 1, 2, 3] for ( i = 4 ; i < input_size ; i += 4 ) { // advance scalar indices indices += increment ; // [4, 5, 6, 7] // compare Vector values = load_vector ( input [ i ]); // [55, 44, 22, 22] Mask less = compare ( values , minvalues ); // [55 < 77, 44 < 33, 22 < 11, 22 < 44] // [true, false, false, true] // two items will be updated // update minvalues = blend ( minvalues , values , less ) // [55, 33, 11, 22] minindices = blend ( minindices , indices , less ); // [ 4, 1, 2, 7] } // 2. scalar part min_value = get_item ( min_values , 0 ); min_index = get_item ( min_indices , 0 ); for ( int i = 1 ; i < 4 ; i ++ ) { if ( get_item ( min_values , i ) < min_value ) { min_value = get_item ( min_values , i ); min_index = get_item ( min_indices , i ); } else if ( get_item ( min_values , i ) == min_values ) // if some values are repeated, pick the smaller index min_index = min ( min_index , get_item ( min_indices , i )); } return min_index ;

Function compare yields a mask, it can be a byte-mask (SSE, AVX2) or a bit-mask (AVX512). In SSE it might be instruction pcmpgtd ( _mm_pcmplt_epi32 ), in AVX512 it might be vpcmpd ( _mm512_cmp_epi32_mask ).

Function blend is a vector selection operator, i.e. mask[i] ? x[i] : y[i] ; it stores items from either vector x or y based on the corresponding mask value. Many SIMD ISAs provides such operation; for instance SSE has the instruction pblendv ( _mm_blendv_epi8 ).

Please note that in case of SSE the blend instruction is relative new, as it was introduced in SSE4.1. For really old CPUs the blend operator has to be expressed with binary operations: (x & mask) | (y & ~mask) . Such expression is compiled into a sequence of three instructions: and , and-not , or .