Posted on September 21, 2018 by John Ky

In the previous post, we’ve established that we want to use SIMD for speed.

We’d also like our CSV parser stream the data to avoid excessive memory usage so we’re going to have to read the CSV input in chunks.

Given that SIMD registers are currently up to 512-bits in size, the chunk size will need to be multiples of 64-bytes to work with arbitrary SIMD instructions.

This post will look at the chunk size Haskell’s bytestring library actually gives us and explore some ways we can get the required chunk size we need.

Lazy IO

Due to laziness, streaming in Haskell is straightforward. The following function lazily reads the entire contents of the input file and writes them into the output file.

For efficiency, the lazy bytestring is actually very similar in structure to a list of strict bytestrings. Each bytestring represents a chunk of the input file contents with a carefully chosen chunk size to maximise performance.

Evidence of this behaviour is observable by using the toChunks function to convert the lazy bytestring and inspecting their size.

The following command reads a file with lazy IO and counts the frequency of each chunk size:

Unfortunately for us, this chunk size is no good for using SIMD instructions that can use registers up to 512-bits or 64-bytes.

More interestingly, even had the defaultChunkSize been set to a more convenient size of being a multiple of 64-bytes, a 64-byte multiple chunk size is not guaranteed, as shown in this example where we read from standard input instead:

Rechunking & resegmenting

One strategy we could use to ensure our bytestrings are always chunked to 64-byte multiples is to rechunk the bytestrings into equal chunk sizes like the following:

|---------------a---------------|---------b----------|-c-|-------------d---------------| |-e--|-f--|-g--|-h--|-i--|-j--|=k==|-l--|-m--|-n--|=o==|=p==|-q--|-r--|-s--|-t--|-u--|v|

In the above, the chunks e - j , l - n , and q - v don’t require any byte copying because they are strict substrings of the chunks a , b and d respectively.

k , o , and p however do require copying because their bytes come from multipe source chunks.

The need for copying is denoted by using the = characters.

The above scheme may minimise the amount of byte copying, but it is still fairly expensive because many bytestring objects are created.

To reduce the number of bytestring objects, another approach is to resegment the data instead.

This process is shown below:

|---------------a---------------|---------b----------|-c-|-------------d---------------| |--------------w--------------|=k==|------x-------|=o==|=p==|-----------y------------|v|

Here, segments v , w and y are substrings of a , b and d respectively with a size that is the largest multiple of the chunk size allowed by the source segments. The segments are equivalent to the concatenation of the e - j , l - n , and q - v chunks in the rechunk example.

This gets us to the point where all except the last segment is a multiple of the chunk size.

For doing our SIMD operations, we’d like all the segments to be a multiple of the chunk size so resegmentPadded will pad the last segment to the chunk size with 0 bytes:

|---------------a---------------|---------b----------|-c-|-------------d---------------| |--------------w--------------|=k==|------x-------|=o==|=p==|-----------y------------|=z==|

This padded segment denoted by z will require byte copying from d and zero-filling the remaining buffer up to the chunk size.

For clarity, I provide the diagrams for each strategy side-by-side:

rechunk: |-e--|-f--|-g--|-h--|-i--|-j--|=k==|-l--|-m--|-n--|=o==|=p==|-q--|-r--|-s--|-t--|-u--|v| resegment: |--------------w--------------|=k==|------x-------|=o==|=p==|-----------y------------|v| resegmentPadded: |--------------w--------------|=k==|------x-------|=o==|=p==|-----------y------------|=z==|

Some benchmarking will give us some idea of how much rechunking and resegmenting costs us:

The results show the cost of using small chunks is drastic compared to the much more modest overhead of resegmenting.

Pre-chunked reading

An alternative to resegmenting the lazy bytestring is to read the bytes with the desired segment size in the first place.

The hGetContents function from the bytestring library is implemented in terms of hGetContentsN like this:

hGetContentsChunkedBy , a different version of the function which guarantees that every chunk is the same size (except the last) can be implemented by using hGetBuf and createAndTrim instead of hGetSome and keeping everything else the same:

Benchmarking this shows the performance is comparable to resegmenting.

This is likely because the chunks returned by hGetContents were already large enough that the extra effort to resegment the lazy bytestring is negligible.

Closing Remarks

This post looked at how we can resegment our lazy bytestring to make the chunk sizes compatible with SIMD instructions at a reasonable cost.

The next post will look at using FFI to call into C functions that use SIMD to do the heavy lifting.