In this section we'll code a generalized matrix vector multiplication ( gemv ), one of functionalities of Basic Linear Algebra Subprograms (BLAS). For a vector \(\textbf{y}\) of size \(n\), we want to compute

\[ \textbf{y} \leftarrow \alpha{}\textbf{A}\textbf{x} + \beta{}\textbf{y}, \]

where \(\textbf{A}\) is an \(n\times{}m\) matrix, \(\textbf{x}\) is a vector of size \(m\), and \(\alpha\) and \(\beta\) are constants. A very simple way to write gemv in Julia is:

function naive_gemv ( alpha, A, x, beta, y ) ( alpha * A * x ) + ( beta * y ) end

naive_gemv (generic function with 1 method)

The REPL output tells us that the function we defined has one method. Julia implements multiple dispatch, which in practice means you can implement several methods of a function, provided each method has different arguments. The most specific method of a function will be called for each set of arguments.

Note that we did not specify types for the arguments of naive_gemv . It is possible to write functions like this in Julia, but it can make it harder to debug your code. We will later write more specialized functions.

This implementation is so simple that it is not quite possible to parallelize it. We can't identify sequential regions with potential for parallelization, since most computation is abstracted. In the next implementation, we will expose the two loops used to compute gemv :

function sequential_gemv! ( alpha :: Float64 , A :: Array { Float64 , 2 } , x :: Array { Float64 , 1 } , beta :: Float64 , y :: Array { Float64 , 1 } , output :: Array { Float64 , 1 } ) @inbounds for i in 1 :size ( A, 1 ) @inbounds for j in 1 :size ( x, 1 ) output [ i ] += A [ i, j ] * x [ j ] end output [ i ] *= alpha output [ i ] += beta * y [ i ] end output end

sequential_gemv! (generic function with 1 method)

This time we specified all types, and we return the results in a pre-allocated output array. Note the use of the @inbounds macro. As in C and other languages, macros in Julia are metaprogramming utilities that enables the dynamic generation of code. The @inbounds macro used here will provide information for the compiler regarding boundary checking when accessing arrays inside the annotated loops. It will tell the compiler that all accesses are guaranteed to be inside the boundaries of the arrays. This brings performance improvements, but can cause incorrect memory accesses if boundaries are broken, so it must be used with care.

The " ! " at the end of sequential_gemv! is not necessary, but it's a Julia convention that tells callers that a function modifies its arguments. We can use sequential_gemv! as a stepping stone to a parallel version, since the two for loops are natural candidates for parallelization. Note that this algorithm is also embarassingly parallel, because its iterations are completely independent and the only operations that could present race conditions are read-only, such as accesses to the arrays A , x , and y . To parallelize sequential_gemv! using native OS threads, we will simply annotate the two loops with more macros:

function threads_gemv! ( alpha :: Float64 , A :: Array { Float64 , 2 } , x :: Array { Float64 , 1 } , beta :: Float64 , y :: Array { Float64 , 1 } , output :: Array { Float64 , 1 } ) @inbounds Threads. @threads for i in 1 :size ( A, 1 ) @inbounds Threads. @threads for j in 1 :size ( x, 1 ) output [ i ] += A [ i, j ] * x [ j ] end output [ i ] *= alpha output [ i ] += beta * y [ i ] end output end

threads_gemv! (generic function with 1 method)

Similarly to OpenMP's #pragma omp parallel , the Threads.@threads macro tells the compiler that a region is multi-threaded. A barrier is implicitly put at the end of the annotated region, and the code will run on the number of threads specified by the JULIA_NUM_THREADS environment variable, that sould be exported in your shell before launching the Julia REPL. To learn more configuring and using threads, check the documentation. You can check how many threads Julia is using by running:

Threads.nthreads ()

6

Next, let's obtain a distributed version of our gemv computation by modifying sequential_gemv! to use Julia processes instead. As you might be expecting by now, this will also be done using macros:

using Distributed, SharedArrays function processes_gemv! ( alpha :: Float64 , A :: Array { Float64 , 2 } , x :: Array { Float64 , 1 } , beta :: Float64 , y :: Array { Float64 , 1 } , output :: SharedArray { Float64 , 1 } ) @inbounds @sync @distributed for i in 1 :size ( A, 1 ) @inbounds for j in 1 :size ( x, 1 ) output [ i ] += A [ i, j ] * x [ j ] end output [ i ] *= alpha output [ i ] += beta * y [ i ] end output end

processes_gemv! (generic function with 1 method)

The @distributed macro annotates a region to tell the compiler that a region will run on several processes, which can be distributed over many machines. Similarly to OpenMPI, Julia can configure processes using a machinefile with the IPs and number of instances. The @sync macro is used here to place a barrier at the end of a region, forcing execution to wait until all workers are done. Julia processes have no shared memory, so we have to use a different strategy to get the results this time. I decided to use SharedArrays, a standard library package that provides the SharedArray type and takes care of the communication and memory synchronization between workers. We could implement work sharing and memory synchronization explicitly using Channels , but we're not going to do it in this post,

I opted for using the @distributed macro and the SharedArray type, from the standard library packages Distributed and SharedArrays respectively. It is expected that this approach will not produce the best results on this gemv implementation, simply because the complexity of the underlying system controlling processes requires a more careful implementation. Take this example as a showcase of how easy it can be to write distributed code in Julia, but also as an example of how the overhead of bad parallelization can impact performance negatively.

We will use the gemv version implemented in the LinearAlgebra standard library package as a performance baseline. This version comes from Julia's BLAS implementation and calls extensively optimized C code. We can load the BLAS.gemv! function, and check if it's loaded, by running:

using LinearAlgebra BLAS.gemv!

gemv! (generic function with 4 methods)