Update: Release 0.3.2 supports vectorization of 64-bit arithmetic.

Julia is a new language for technical computing that combines interactive scripting convenience with high performance. Version 0.3 was released Aug. 20, 2014, and introduces experimental support for vectorizing loops, which can significantly improve performance of some kernels. This article explains how to use the vectorization feature effectively. This article is based on material that I first presented at JuliaCon 2014

What Is Vectorization?

"Vectorization" has two different meanings in Julia, both related to operating on chunks of data:

Writing your code in terms of operations that operate on whole arrays. For example, writing d=a+b-c where the variables all indicate array objects. See the Julia @devec package for more information about this style of code. Compiler transformations that improve performance by using SIMD (Single Instruction Multiple Data) instructions that operate on chunks of data. For example, hardware with Intel® Advanced Vector Extensions (Intel® AVX) can do eight 32-bit floating-point additions at once.

The first definition pertains to how you write your code. The second definition pertains to code generation. This article concerns the second definition.

Julia 0.3 has vectorization capabilities that can exploit SIMD instructions when executing loops, under the right conditions. Sometimes you have to give it a little nudging. Here is an example function and its invocation on two random arrays with 1003 32-bit values:

function axpy(a,x,y) @simd for i=1:length(x) @inbounds y[i] += a*x[i] end end n = 1003 x = rand(Float32,n) y = rand(Float32,n) axpy(1.414f0, x, y)

The @simd and @inbounds will be explained later. When the call to axpy happens, the Julia just-In-time (JIT) compiler generates code for an instance of axpy specialized to the arguments' types. For serial execution (that is, non-SIMD code) each iteration of the instance behaves as if the code were written like this:

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1}) n=length(x) i = 1 @inbounds while i<=n t1 = x[i] t2 = y[i] t3 = a*t1[i] t4 = t2+t3 y[i] = t4 i += 1 end end

Here each assignment in the loop represents one machine instruction.

The @simd macro gives the compiler license to vectorize without checking whether it will change the program's visible behavior. The vectorized code will behave as if the code were written to operate on chunks of the arrays. For example, for hardware with 4-wide execution units, Julia might generate code as if the source were written as follows, using fictional Julia operations + and * that operate on tuples:

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1}) n=length(x) i = 1 # Vectorized loop - four logical iterations per physical iteration @inbounds while i+3<=n t1 = (x[i],x[i+1],x[i+2],x[i+3]) # Load tuple t2 = (y[i],y[i+1],y[i+2],y[i+3]) # Load tuple t3 = a*t1 # Scalar times tuple t4 = t2+t3 # Tuple add (y[i],y[i+1],y[i+2],y[i+3]) = t4 # Tuple store i += 4 # Finished 4 logical iterations end # Scalar loop for remaining iterations @inbounds while i<=n t1 = x[i] t2 = y[i] t3 = a*t1 t4 = t2+t3 y[i] = t4 i += 1 end end

As long as the contents of array y do not overlap the contents of array x, the result will be the same as the serial code. However, if there is overlap, the results might differ, because @simd gives the compiler license to reorder operations. Here is a diagram showing the original order:

@simd transposes the order of operations for chunks of iterations, like this:

The horizontal arrows are academic here, since with 4-wide execution hardware, each row of operations happens simultaneously. However, @simd gives license to "vectorize" across chunks of iterations wider than the execution hardware. For example, the order above is also valid for 2-wide execution hardware too. In practice, the compiler often uses chunks that are wider than the execution hardware, so that multiple operations can be overlapped. So do not assume that the chunk size for transposition is the natural hardware size. Furthermore, @simd actually grants the compiler even more reordering latitude as I will discuss later.

Implicit vs. Explicit Vectorization

Vectorization is a program transform. For any program transform, a compiler has to ask three questions:

Is the transform possible ?

? Is the transform legal ?

? Is the transform profitable?

The part of a compiler that answers these questions for vectorization is called the vectorizer. The first question relates to the capabilities of the vectorizer and hardware. Some vectorizers can handle only very simple loops. Others can deal with complicated control flow. If the vectorizer determines that it does not know how to vectorize the loop, there is no point in asking the next two questions.

The meaning of "legal" depends on the language specification. Julia currently does not have a formal specification, so in practice "legal" means that the visible program behavior is identical to that of unvectorized code. Sometimes the behavior will be slightly different in a way that the programmer did not care about, but the vectorizer must nonetheless assume vectorization is not legal unless the programmer grants explicit license.

The meaning of "profitable" can depend on context. For purposes here, "profitable" means "runs faster", though in other contexts it might be "takes less code space in memory". Vectorization often, but not always, improves performance at the cost of increasing code space.

Vectorization can be either implicit or explicit. In implicit vectorization, the compiler proves that the transposition of operations is legal. For Julia, the hard part of the proof is proving that no output array overlaps with any input or output array. If the compiler can otherwise vectorize the code, but cannot prove the absence of overlap, it may generate the vector code anyway, with a run-time check that jumps to the remainder loop if overlap is detected. Here is our example with a run-time check:

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1}) n=length(x) i = 1 if !overlap(x,y) # "overlap" is fictional function # Vectorized loop - four logical iterations per physical iteration @inbounds while i+3<=n t1 = (x[i],x[i+1],x[i+2],x[i+3]) # Load tuple ... (y[i],y[i+1],y[i+2],y[i+3]) = t4 # Tuple store i += 4 # Finished 4 logical iterations end end # Scalar loop for remaining iterations @inbounds while i<=n t1 = x[i] ... y[i] = t4 i += 1 end end

In this example, the run-time check is relatively cheap since only two arrays are involved. You won't see a noticeable performance difference by adding @simd to the example since all it would do is to remove the run-time check. However, the cost of the check can grow quadratically with the number of arrays referenced by the loop body. Given M output arrays and N input arrays, the cost is O(M*(M+N)). Furthermore, a run-time check is impractical in cases that involve tricky subscripting patterns, such as:

for i=1:n t = w[j[i]] # "gather" w[k[i]] = t # "scatter" end

Here, proving that transposition of chunks does not change the visible program behavior amounts to detailed inspection of j[i] and k[i] that, in the absence of special hardware support, can be slower than just executing the code serially.

In explicit vectorization, you as the programmer guarantee that vectorization is legal. In Julia, that's done be prefixing a for loop with @simd. Be careful using it. If you use @simd on a loop what was not legal to vectorize, your results may be wrong.

For some technical reasons, @simd actually grants more latitude than just permission to transpose evaluations. It also tells the compiler that all iterations are independent in both of the following senses:

No iteration reads or writes a location that is written by another iteration.

No iteration waits on another iteration.

The second sense is what distinguishes an @simd loop from multi-threaded loops found in some other languages. Currently, it's an academic point for Julia until it acquires shared memory multi-threading capabilities. To summarize, when you use @simd, you promise that iterations do not communicate with each other.

Reductions

Reductions are an exception to the no communication rule for @simd. Reduction operations will work as long as the vectorizer recognizes them as such. The rules for making them recognizable need to be formalized, but for now use +=, *=, &=, |=, or $=, or expand the op= into its equivalent syntactic form. For example, "s += expr" can be written as "s =s + expr". The reduction variable should be a local variable. Here is an example reduction:

function summation(x) s = zero(x[1]) @simd for i=1:length(x) @inbounds s += x[i] end s end

Integer min/max reductions also work. If the compiler fails to recognize your reduction, it will refuse to vectorize the code.

For 4-wide SIMD execution, the vectorized code acts as if it were written like this:

function summation(x::Array{Float32,1}) n = length(x) t = (0f0,0f0,0f0,0f0) # Initialize partial sums i=1 @inbounds while i+3<=n # Four logical iterations per physical iteration t += (x[i],x[i+1],x[i+2],x[i+3]) i += 4 end s = (t[1]+t[2]) + (t[3]+t[4]) # Merge partial sums @inbounds while i<=n s += x[i] i += 1 end s end

Vectorization of reductions not only reorders loads and stores, it also reorders the reduction operations. Here is a drawing of how data flows through a serial summation:

Here is how it flows through a vectorized summation:

Since flloating-point addition is commutative, but not associative, the reordering may cause a different result. Whether the result from the reordered summation is more or less accurate than from the original serial order depends on the addends being summed. In fact, given random addends, the vector order probably gives a slightly more accurate result. Summing rand(Float32,1000), the odds are better than 6 to 1 that the four-way vector summation gives a more accurate result than the original serial code.

Of course, under some conditions, the serial sum may be more accurate. For example, if you are sorting your addends by magnitude before adding them, and there are a few addends with relatively large magnitude which cancel each other, the vectorized summation will likely give a less accurate result. But that kind of tricky case tends to be easily identifiable by noticing the sort.

The Julia compiler does implicit vectorization of reductions only if the result is the same as the serial code, as it will be for integers. Otherwise you need to use @simd. The following table summarizes the situation.

Which Reductions Vectorize in Julia 0.3 Integer Float32 and Float64 Implicit +, *, &, |, $, min, max none Explicit (@simd) +, *, &, |, $, min, max +, *

Floating-point min/max reductions might work in some future version of Julia. The issue is that LLVM recognizes floating-point min reductions for min(x,y) defined as x<y?x:y, but Julia uses a trickier definition of min that conforms with the IEEE floating-point standard:

min{T<:FloatingPoint}(x::T, y::T) = ifelse((y < x) | (x != x), y, x)

The expression x!=x is true when x is an IEEE NaN (Not a Number) value.

Speedup Surprise

The reassociation allowed by @simd sometimes improves speed by more than you would think possible. For example, I've seen @simd speed up the summation example by 12x on an Intel 4th Generation processor, even though the hardware has only 8-wide execution units. This is possible because the reassociation lets the compiler do the summation with conceptual 32-wide vector operations, each synthesized from four 8-wide operations. The compiler then overlaps those 8-wide operations in a way that lets the hardware pipeline them efficiently in a way that was not possible in the serial code.

Inspecting Whether Code Vectorizes

There is currently no feedback from the compiler on whether code vectorized. Feedback might be possible in future versions of Julia since the most recent version of the LLVM vectorizer implements such feedback. In lieu of feedback, the best thing is to learn how to skim LLVM code to check if vectorization is happening as you expect. You don't have to be a compiler expert since the vectorizer leaves some footprints in the code.

To inspect the LLVM code for the earlier axpy example, invoke the macro @code_llvm like this:

julia> @code_llvm axpy(1.414f0, x, y)

where the arguments are the same as the earlier example. Since code generation pays attention only to the types, not the values of the arguments, you can also use:

julia> @code_llvm axpy(0.0f0, Float32[],Float32[])

The macro dumps the LLVM code for the top-level expression. There's also a similar function code_llvm that is invoked like this:

julia> code_llvm(axpy,(Float32,Array{Float32,1},Array{Float32,1}))

The function code_llvm takes a function as its first argument and a tuple of argument types as its second argument.

Because Julia uses a just-in-time Compiler (JIT), the LLVM output depends on your processor. Indeed, one of the benefits of a JIT is that you can get code tailored to your processor. What I get on an Intel 4th Generation Core i7 processor for @code_llvm axpy(1.414f0, x, y) is too long to include completely here. See the attached .txt file if you are interested in seeing it all. For skimming purposes, here are the relevant lines (... denotes deleted portions)

... vector.ph: ; preds = %if %broadcast.splatinsert12 = insertelement <8 x float> undef, float %0, i32 0 %broadcast.splat13 = shufflevector <8 x float> %broadcast.splatinsert12, <8 x float> undef, <8 x i32> zeroinitializer br label %vector.body vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] ... %wide.load = load <8 x float>* %24, align 4 ... %wide.load9 = load <8 x float>* %26, align 4 ... %wide.load10 = load <8 x float>* %28, align 4 ... %wide.load11 = load <8 x float>* %30, align 4 %31 = fmul <8 x float> %wide.load10, %broadcast.splat13 %32 = fmul <8 x float> %wide.load11, %broadcast.splat13 %33 = fadd <8 x float> %wide.load, %31 %34 = fadd <8 x float> %wide.load9, %32 store <8 x float> %33, <8 x float>* %24, align 4 store <8 x float> %34, <8 x float>* %26, align 4 ... br i1 %35, label %middle.block, label %vector.body ...

The footprints to look for are the labels prefixed with vector, and types of the form <n x float>. Those labels are added by the vectorizer for the vectorized loop. The operations on <n x float> are the SIMD instructions. In this example, they correspond to AVX instructions on 8-wide vectors. Note that the vectorizer used a chunk size of 16 here, using pairs of 8-wide instructions for each conceptually 16-wide operation.

By the way, Julia has more features for inspecting code at different levels used by the Julia compiler. For example, you can use code_native to see the native code generated by compiler. Leah Hanson's blog is a good introduction to the various levels of "introspection".

Vectorization Recommendations for Julia 0.3

Earlier, I described how vectorization is a transform, and to decide whether to perform a transform, a compiler must answer three questions:

Is it possible?

Is it legal?

Is it profitable?

@simd provides a way to force the answer about legality. But you also have to make vectorization possible within the capabilities of the compiler, and profitable for the given hardware. So you have to learn how to cater to the limitations of the vectorizer. The following sections describe the catering rules and their rationale. A summary of those rules will be presented afterwards.

Trip Count Must Be Obvious

The trip count of a loop is the number of times the body is executed. The vectorized code needs to be able to calculate that trip count before commencing the loop, so that it knows how many chunk iterations to execute. Fortunately, Julia has concise for-loop syntax from which the trip count is obvious when the the loop is over a range object such as m:n. Stick with that form and the trip count should not be an issue.

The trip count is an issue when a for-loop is applied to iterable objects other than ranges, because a regular Julia for-loop is really just a glorified while-loop that marches an iterator across the object until the iterator says "no more". For sake of being able to compute a trip count, @simd causes a for-loop to executed differently. Given a loop of the form:

@simd for i=r … end

execution of @simd assumes that:

length (r) returns the trip count.

(r) returns the trip count. first (r) returns the first index value.

(r) returns the first index value. step(r) returns the stride between successive index values.

A regular for-loop does not assume such expressions are valid. See Section 2.11 of this paper for how regular for-loops are handled.

The Loop Body Should Be Straight-Line Code

The current vectorizer for Julia generally requires that the loop body not contain branches or function calls. But practically all operations in Julia are function calls, furthermore complicated by run-time dispatch (i.e. run-time overload resolution). Fortunately, the Julia compiler is good at eliminating calls to short functions by inlining them, as long as it can infer the argument types at compilation time. So the key is learning how to write type-stable code, which lets the inference mechanism work well. See John Myles White's article to learn about type-stable code.

Code with constructs that might throw exceptions also contains branches, and so will not vectorize. This is why the @inbounds notation is currently necessary. It turns off subscript checking that might throw an exception. Be sure that your subscripts are in bounds when using @inbounds, otherwise you can corrupt your Julia session.

Short conditional expressions involving &&, ||, or ?: will sometimes vectorize. Here is an example that vectorized with Julia 0.3 when I tried it on an Intel(R) 4th Generation Core i7 processor:

function clip(x, a, b) @simd for i=1:length(x) @inbounds x[i] = x[i]>b ? b : x[i]<a ? a : x[i] end end # Shows that code vectorizes for Float32 arrays. @code_llvm clip(Float32[],0.0f0,0.0f0)

Vectorization of p?q:r depends on the compilers ability to figure out whether it can legally/profitably replace it with the equivalent of Julia's ifelse(p,q.r). It's not always legally the same, because ifelse evaluates both expressions q and r regardless of the value of p. If you intend a loop body to be vectorizable, consider writing it with ifelse instead of ?:, like this:

function clip(x, a, b) @simd for i=1:length(x) @inbounds x[i] = ifelse(x[i]>b,b,ifelse(x[i]<a,a,x[i])) end end

The current Julia definitions of min and max use ifelse, so if a≤b this example could be written using min and max, like this:

function clip(x, a, b) @simd for i=1:length(x) @inbounds x[i] = min(b,max(a,x[i])) end end

The current Julia definition of clamp also uses ifelse, so a vectorizable alternative, with behavior like the original even when a>b, is:

function clip(x, a, b) @simd for i=1:length(x) @inbounds x[i] = clamp(x[i],a,b) end end

Subscripts Should Be Unit Stride

The amount that an array subscript changes between iterations is called its stride. The current vectorizer targets loops that have unit-stride access patterns. In practice, that means that for a @simd loop with index i, you want array subscripts to either be:

loop_invariant_value

i

i + loop_invariant_value

i - loop_invariant_value

The vectorizer will tolerate an occasional non-unit stride index, such as 2i, but be warned that the resulting code may be slow. For example, I tried this example, which has an access with stride 2:

function stride2(a, b, x, y) @simd for i=1:length(y) @inbounds y[i] = a * x[2i] + b end end @code_llvm stride2(0.0f,0.0f,Float32[],Float32[])

On an Intel 4th Generation Core i7 processor, the LLVM output showed that the compiler synthesized the load x[2i] from a bunch of scalar loads, so clumsily that removing the @simd actually speeds up the example by about 1.4x. I'm hoping futures versions of Julia will do better, but for now I recommend sticking with unit-stride subscripts when using @simd.

When working with nested loops on two-dimensional arrays, use @simd on the inner loop and make that loop index the leftmost subscript of arrays. This rule comes naturally to Fortran programmers, but rubs against habit for C/C++ programmers. Here is an example:

function updateV(irange, jrange, U, Vx, Vy, A) for j in jrange @simd for i in irange @inbounds begin Vx[i,j] += (A[i,j+1]+A[i,j])*(U[i,j+1]-U[i,j]) Vy[i,j] += (A[i+1,j]+A[i,j])*(U[i+1,j]-U[i,j]) end end end end # Shows that code vectorizes for Float32 R = 1:8 A = Float32[] @code_llvm updateV(R,R,A,A,A,A)

32-Bit Floating-Point Arithmetic Is Often Faster

You'll notice that all the examples that I've shown use Float32 instead of Float64. The reason is that when this article was originally written 64-bit arithmetic did not vectorize. That problem was fixed in Julia release 0.3.2. However, if 32-bit precision suffices for your purpose and speed is your goal, I recommend using 32-bit arithmetic nonetheless, because:

Each SIMD instruction can process twice as many 32-bit operands as 64-bit operands.

Loading/storing 32-bit operands requires half the memory bandwidth, which is often the limiting resource.

If you built Julia from source, you can see a demonstration of the performance difference on your platform by running the SIMD performance tests. Here are the steps:

cd julia/test/perf/simd/ ../../../julia perf.jl # Runs performance tests

Here is sample output:

julia,simd_axpy_32,0.016686,0.020729,0.017577,0.001768 julia,simd_axpy_64,0.030641,0.031022,0.030742,0.000160 julia,sum_reduction_32,0.008560,0.009099,0.008717,0.000218 julia,sum_reduction_64,0.014151,0.014587,0.014265,0.000183 julia,inner_32,0.011920,0.012366,0.012015,0.000196 julia,inner_64,0.020955,0.021324,0.021070,0.000147 julia,seismic_fdtd_32,0.748328,0.750636,0.749520,0.000830 julia,seismic_fdtd_64,1.085479,1.086907,1.085983,0.000564

The _32 or _64 indicates whether the test used 32-bit or 64-bit arithmetic. The first three numbers are times (lower is better): min, mean, max. The last number is the standard deviation.

Summary Recommendations for Effective Vectorization in Julia

For implicit or explicit vectorization:

No cross-iteration dependencies

Straight-line loop body only. Use ifelse for conditionals.

for conditionals. Use @inbounds

Make sure that all calls are inlined. Write type-stable code.

Use unit-stride subscripts

Reduction variables should be local variables.

32-bit arithmetic is often faster

For implicit vectorization, the additional constraints are:

Access no more than about 4 arrays inside the loop.

Do not use floating-point reductions.

Otherwise, use explicit vectorization by marking your loop with @simd.

Future Directions

@simd is a first step in adding vectorization to Julia. There is much room for improvement, particularly if Julia is to match the performance of statically typed languages. Some possible future improvements are:

Report to the user why a loop failed to vectorize.

Vectorize complex arithmetic.

Vectorize tuple math. This was prototyped, but took too much compilation time for Julia 0.3. With changes to LLVM it may be practical in the future.

Vectorize loops without @inbounds . The semantics of @simd are designed to allow doing subscript checking (and throwing an exception if necessary) before the loop really starts. Or perhaps the checks could be vectorized.

. The semantics of are designed to allow doing subscript checking (and throwing an exception if necessary) before the loop really starts. Or perhaps the checks could be vectorized. Vectorize loops bodies that have complicated control-flow. This may become more important and practical with instruction set extensions such as Intel® Advanced Vector Extensions 512 (Intel® AVX-512) that support masking.

Add support for limited forms of cross-iteration dependencies that are useful in practice. See my notes for more details.

LLVM hackers looking for projects: please take note!

Conclusion

There is a joke that vectorizers in the 1970's did not work well, but they taught programmers how to write code that could be vectorized by simple vectorizers. Such is the case with the current Julia implementation: Learn how to cater to the vectorizer and it can deliver. Remember too that Julia is an interactive "scripting" language. That such a language can often be at least half as fast as vectorized static languages such as C/C++/Fortran, without restricting that performance to a set of precompiled library routines, is amazing.

Acknowledgments

Jeff Bezanson pointed out ifelse as a way to avoid branches. Jim Cownie suggested corrections and improvements to an earlier draft. Elliot Saba, Patrick O'Leary, Jacob Quinn, and Gunnar Farnebäck corrected errors in earlier public drafts. Jacob pointed out the convenience of using @code_llvm instead of code_llvm. Gunnar pointed me to clamp. Thanks goes to the LLVM project for a framework that enables new languages to get going quickly. Thanks goes to the Julia people for starting a modern language for technical computing and making it open source.