Posted on September 26, 2017 by Troels Henriksen

One of the core goals of Futhark has always been to make it usable in practice. However, as a purely functional high-performance language incapable of interacting with the outside world except through function parameters, Futhark is not suited for writing full applications. This post describes recent developments in making Futhark programming not merely fun, but perhaps even useful.

The Futhark compiler possesses two code generators: one that generates C code that uses the OpenCL library to orchestrate parallel execution, and another that generates Python code that does the same though PyOpenCL. Since most of the execution time will be spent inside OpenCL kernels that are identical anyway, the overhead of using Python is often negligible (see a previous blog post on the topic). Furthermore, we generate Python code that is easy to invoke as a library from other Python programs. This gives us the best of both worlds: we can write the performance-critical bits in Futhark, and use Python for high-level application logic and talking to the outside world (more on this here). This has allowed us to write a bunch of fancy demos to show off Futhark.

Unfortunately, Python code is not all that easy to call from programs not themselves written in Python. Our C code generator was until recently limited to generating executable programs that read data from standard input and write results on standard output. While this is fine for benchmarking, testing, and development, it is not exactly a low-overhead way to invoke Futhark code. To address this, we recently spent some time extending the C code generator with support for compiling a Futhark program to library code - C code that is meant to be linked (dynamically or statically) with other programs. While C may no longer be the preferred language for application development, most languages come with foreign function interfaces (FFIs) that allow C libraries to be called more or less conveniently. When we make Futhark code callable via a C interface, we thus enable interoperability with a wealth of languages.

First we will look at how to write a simple Futhark program (dot product) and how to compile it to a shared library ( .so ). We will look at the generated header file and will see how easy it is to call the generated code from an ordinary C program (if a little heavy on boilerplate). Then we will see how to use the FFI of Haskell to call that same generated code. Haskell merely serves as a stand-in for any modern language with an FFI, so the same principles should apply elsewhere. Finally, we will discuss how values of more complicated Futhark types, which have no direct analogue in C (like arrays of tuples), are mapped to C.

Warning: some of the details of library compilation may have changed or been updated since this blog post was written in 2017. Refer to the `library guide in the User’s Manual <https://futhark.readthedocs.io/en/stable/usage.html#compiling-to-library>`_ for updated details.

The following Futhark program computes the dot product of two integer vectors:

entry dotprod (xs: []i32) (ys: []i32): i32 = reduce (+) 0 (map2 (*) xs ys)

We define the dotprod function with entry to indicate that it should be externally visible. After saving this program as dotprod.fut , we can compile it with the futhark opencl compiler:

$ futhark opencl --library dotprod.fut

This produces two files in the current directory: dotprod.c and dotprod.h . We can compile dotprod.c to a shared library like this:

$ gcc dotprod.c -o libdotprod.so -fPIC -shared

We can now link to libdotprod.so the same way we link with any other shared library. But before we get that far, let’s take a look at (parts of) the generated dotprod.h file. We have written the code generator to produce as simple header files as possible, with no superfluous crud, in order to make them human-readable. This is particularly useful at the moment, since few explanatory comments are inserted in the header file.

The first declarations are related to initialisation, which is based on first constructing a configuration object, which can then be used to obtain a context. The context is used in all subsequent calls, and contains GPU state and the like. We elide most of the functions for setting configuration properties, as they are not very interesting:

/* * Initialisation */ struct futhark_context_config ; struct futhark_context_config *futhark_context_config_new(); void futhark_context_config_free(struct futhark_context_config *cfg); void futhark_context_config_set_device(struct futhark_context_config *cfg, const char *s); ... struct futhark_context ; struct futhark_context *futhark_context_new(struct futhark_context_config *cfg); void futhark_context_free(struct futhark_context *ctx); int futhark_context_sync(struct futhark_context *ctx);

The above demonstrates a pervasive design decision in the API: the use of pointers to opaque structs. The struct futhark_context is not given a definition, and the only way to construct it is via the function futhark_context_new() . This means that we cannot allocate it statically, which is contrary to how one would normally design a C library. The motivation behind this design is twofold:

It keeps the header file readable, as it elides implementation details like struct members. It is easier to use from FFIs. Most FFIs make it very easy to work with functions that only accept and produce pointers (and primitive types), but accessing and allocating structs is a little more involved.

The disadvantage is a little more boilerplate, and a little more dynamic allocation. However, relatively few objects of this kind are used, so the performance impact should be nil.

The next part of the header file concerns itself with arrays - how they are created and accessed:

/* * Arrays */ struct futhark_i32_1d ; struct futhark_i32_1d *futhark_new_i32_1d(struct futhark_context *ctx, int32_t *data, int dim0); int futhark_free_i32_1d(struct futhark_context *ctx, struct futhark_i32_1d *arr); int futhark_values_i32_1d(struct futhark_context *ctx, struct futhark_i32_1d *arr, int32_t *data); int64_t *futhark_shape_i32_1d(struct futhark_context *ctx, struct futhark_i32_1d *arr);

Again we see the use of pointers to opaque structs. We can use futhark_new_i32_1d to construct a Futhark array from a C array, and we can use futhark_values_i32_1d to read all elements from a Futhark array. The representation used by the Futhark array is intentionally hidden from us - we do not even know (or care) whether it is resident in CPU or GPU memory. The code generator automatically generates a struct and accessor functions for every distinct array type used in the entry points of the Futhark program.

The single entry point is declared like this:

int futhark_entry_dotprod(struct futhark_context *ctx, int32_t *out0, const struct futhark_i32_1d *in0, const struct futhark_i32_1d *in1);

As the original Futhark program accepted two parameters and returned one value, the corresponding C function takes one out parameter and two in parameters (as well as a context parameter).

We have now seen enough to write a small C program (with no error handling) that calls our generated library:

#include <stdio.h> #include "dotprod.h" int main() { int x[] = { 1, 2, 3, 4 }; int y[] = { 2, 3, 4, 1 }; struct futhark_context_config *cfg = futhark_context_config_new(); struct futhark_context *ctx = futhark_context_new(cfg); struct futhark_i32_1d *x_arr = futhark_new_i32_1d(ctx, x, 4); struct futhark_i32_1d *y_arr = futhark_new_i32_1d(ctx, y, 4); int res; futhark_entry_dotprod(ctx, &res, x_arr, y_arr); futhark_context_sync(ctx); printf("Result: %d

", res); futhark_free_i32_1d(ctx, x_arr); futhark_free_i32_1d(ctx, y_arr); futhark_context_free(ctx); futhark_context_config_free(cfg); }

We hard-code the input data here, but we could just as well have read it from somewhere. The call to futhark_context_new() is where the GPU is initialised (is applicable) and OpenCL kernel code is compiled and uploaded to the device. This call might be relatively slow. However, subsequent calls to entry point functions ( futhark_dotprod() ) will be efficient, as they re-use the already initialised context.

Note the use of futhark_context_sync() after calling the entry point: Futhark does not guarantee that the final results have been written until we synchronise explicitly. Note also that we free the two arrays x_arr and y_arr once we are done with them - memory management is entirely manual.

If we save this program as luser.c , we can compile and run it like this:

$ gcc luser.c -o luser -lOpenCL -lm -ldotprod $ ./luser Result: 24

You may need to set LD_LIBRARY_PATH=. before the dynamic linker can find libdotprod.so . Also, this program will only work if the default OpenCL device is usable on your system, since we did not request any specific device. For testing on a system that does not support OpenCL, simply use futhark c instead of futhark opencl . The generated API will be the same.

While C is no longer the favourite language of application programmers, surely Haskell is. Therefore, let’s look at how to call our Futhark library from Haskell. Haskell has a relatively lightweight FFI for calling C code, but it’s still rather verbose. First, some necessary imports:

import Data.Int import Foreign.Ptr import Foreign.Marshal.Alloc import Foreign.Marshal.Array import Foreign.Storable

Then we can define the foreign functions. For brevity, we omit the functions for freeing context and data:

data Futhark_Context_Config foreign import ccall "futhark_context_config_new" futhark_context_config_new :: IO (Ptr Futhark_Context_Config) data Futhark_Context foreign import ccall "futhark_context_new" futhark_context_new :: Ptr Futhark_Context_Config -> IO (Ptr Futhark_Context) data Futhark_i32_1d foreign import ccall "futhark_new_i32_1d" futhark_new_i32_1d :: Ptr Futhark_Context -> Ptr Int32 -> Int32 -> IO (Ptr Futhark_i32_1d) foreign import ccall "futhark_entry_dotprod" futhark_entry_dotprod :: Ptr Futhark_Context -> Ptr Int32 -> Ptr Futhark_i32_1d -> Ptr Futhark_i32_1d -> IO ()

We use empty data declarations to declare Haskell types corresponding to the C types. This is a nice trick for getting type-safe pointers, but ultimately just a convenience. Note how easily we are able to express the pointer-based C functions as Haskell functions. As the operations we perform are inherently effectful, we put them in the IO monad. This makes the interface somewhat awkward to use from most Haskell code, but a nicer interface can be built on top of this if desired. We can call the imported functions like this:

main :: IO () main = do cfg <- futhark_context_config_new ctx <- futhark_context_new cfg x <- newArray [1,2,3,4] y <- newArray [2,3,4,1] x_arr <- futhark_new_i32_1d ctx x 4 y_arr <- futhark_new_i32_1d ctx y 4 res <- alloca $ \res -> do futhark_entry_dotprod ctx res x_arr y_arr peek res putStrLn $ "Result: " ++ show res

The Haskell function newArray produces a C-level heap-allocated array, which we can pass to futhark_new_i32_1d . Memory management is still entirely manual (and since we skip the freeing, this program leaks memory), but we could easily wrap this in smart pointers with finalisers to automate it, if we wished.

Compiling and running this program is as straightforward as with C:

$ ghc luser.hs -ldotprod -lOpenCL $ ./luser Result: 24

Our dot product function uses only types that map easily to C: primitives and arrays of primitives. But what happens if we have an entry point that involves abstract types with hidden definitions, or types with no clear analogue in C, such as records or arrays of tuples? In this case, the generated API defines structs for opaque types that support very few operations.

(Some may argue that records are easily mapped to C structs, and arrays of tuples to arrays of structs. This is correct, but we don’t do that yet - it’s complicated by the fact that Futhark does not always represent values in the way indicated by their source language types, and for example stores an array of pairs by two separate arrays. We will probably improve the capabilities of the code generator in the future, but for now we’ll stick with these for our examples.)

Consider the following contrived program, pack.fut , which contains two entry points:

entry pack (xs: []i32) (ys: []i32): [](i32,i32) = zip xs ys entry unpack (zs: [](i32,i32)): ([]i32,[]i32) = unzip zs

The pack function turns two arrays into one array of pairs, and the unpack function reverses the operation. The generated header file contains the following definitions:

struct futhark_opaque_z31U814583239044437263 ; int futhark_free_opaque_z31U814583239044437263(struct futhark_context *ctx, struct futhark_opaque_z31U814583239044437263 *obj); int futhark_pack(struct futhark_context *ctx, struct futhark_opaque_z31U814583239044437263 **out0, struct futhark_i32_1d *in0, struct futhark_i32_1d *in1); int futhark_unpack(struct futhark_context *ctx, struct futhark_i32_1d **out0, struct futhark_i32_1d **out1, struct futhark_opaque_z31U814583239044437263 *in0);

The unfortunately named struct, futhark_opaque_z31U814583239044437263 , represents an array of tuples. There is nothing we can do with it except for freeing it, or passing it back to an entry point. Clearly we need to improve the rules by which we generate names for opaque Futhark types (currently it’s a hash of the internal representation), but the basic idea is sound.

Opaque values typically occur when you are writing a Futhark program that keeps some kind of state that you don’t want the user modifying or reading directly, but you need access to for each call to an entry point. Since Futhark programs are purely functional (and therefore stateless), having the user to manually pass back the state returned by the previous call is the only way to accomplish this.

The main missing piece in the generated code is proper error handling. Python has it easy: just throw an exception and let the automatic memory management deal with avoiding leaks. In C, we have to be careful to avoid leaking memory when we exit early from a function, and communicating what went wrong to the caller is not easy. While our generated code does make an attempt to return proper errors (most functions return zero on success), most errors will cause a message to be printed on standard error and the process to be aborted. This was fine when we were generating executables, but clearly not acceptable in library code.

It is also not a given that the current design of the API is convenient for all users. We are very interested in figuring out what kinds of things people may want to use Futhark to, and get some experience with the current limitations of the design, so we can improve it.