I'm pleased to announce the availability of version 0.5 of text, a library that provides fast Unicode text handling for Haskell.

This version contains numerous changes compared to version 0.4, in three broad categories:

I made improvements to the performance of some common functions by, in many cases, more than 10x.

I have substantially refined the API.

Many bugs fixed, almost all of them subtle, thanks to the ever-astounding QuickCheck library.

Performance improvements In real-world code, many of the most frequently performed text manipulation operations involve searching for substrings. I've written a fast search function, based on Fredrik Lundh's very elegant adaptation of the Boyer-Moore-Horspool-Sunday algorithm, and rewritten all of the functions that perform searches internally to use it (without changing any of the APIs in question, of course). Retooling with this function has made a big difference to performance. Here are measurements of old-style searches, searches over the popular ByteString type, and new-style searches: For this data set, the new search code is 10.4 times faster than the old, and 24% faster than the ByteString search code. If we perform a search for a short snippet of DNA in a large data set, we see a similar pattern: Over the DNA data set, the new search code is 15.3 times faster than the old, and 85% faster than the ByteString search code. By the way, it was writing—and wanting to measure the performance of—this new search function that led me to create the criterion benchmarking package, which in turn begat the statistics package. What a month of hacking it's been!

API changes While the API of the text package still somewhat resembles the venerable Data.List API (which is now almost two decades old), I have modified it to be both narrower and more useful for text processing tasks. All of the integer-index-based search functions are now gone, since using integer offsets into Unicode strings is a bad idea (especially when the underlying representation uses a variable-length encoding). Instead, the search functions break up the original string. Consider the new find function, for instance: find :: Text -> Text -> (Text, [(Text, Text)])

This function returns all non-overlapping instances of needle in haystack . The first element of the returned pair is the prefix of haystack prior to any matches of needle . The second is a list of pairs. The first element of each pair in the list is a span from the beginning of a match to the beginning of the next match, while the second is a span from the beginning of the match to the end of the input. find "::" ""

==> ( "" , [])



find "/" "a/b/c/d"

==> ( "a" , [( "/b" , "/b/c/d" ), ( "/c" , "/c/d" ), ( "/d" , "/d" )])

Notice that no matter how much of the result list you consume, you can always reconstruct the entire original haystack from the elements you've already seen. More often, you won't want something nearly this general. Instead, you may want a function that simply splits a string on a separator: split :: Text -> Text -> [Text]

Here are a few examples of how the split function works: split "\r

" "a\r

b\r

d\r

e"

==> [ "a" , "b" , "d" , "e" ]



split "aaa" "aaaXaaaXaaaXaaa"

==> [ "" , "X" , "X" , "X" , "" ]



split "x" "x"

==> [ "" , "" ]

Some of the functions supplied by the list API are not very useful in the context of a text-oriented API. Consider break : break :: (a -> Bool ) -> [a] -> ([a],[a])

I use this name in the text package, but give the function a simpler and more useful type signature (its behaviour is otherwise the same): break :: Text -> Text -> (Text, Text)

If, for some reason, you still want the predicate-based functions, they're usually available, but (in most cases) with a " By " suffix added, to provide uniform naming: breakBy :: ( Char -> Bool ) -> Text -> (Text, Text)

