Haskell Platform Proposal: add the 'text' library

= Proposal: Add Data.Text to the Haskell Platform = Maintainer: Bryan O'Sullivan (submitted with his approval) == Introduction == This is a proposal for the 'text' package to be included in the next major release of the Haskell platform. An up to date copy of this text is kept at: http://trac.haskell.org/haskell-platform/wiki/Proposals/text Everyone is invited to review this proposal, following the standard procedure for proposing and reviewing packages. http://trac.haskell.org/haskell-platform/wiki/AddingPackages Review comments should be sent to the libraries mailing list by October 1 so that we have time to discuss and resolve issues before the final deadline on November 1. http://trac.haskell.org/haskell-platform/wiki/ReleaseTimetable == Credits == Proposal author and package maintainer: Bryan O'Sullivan, originally by Tom Harper, based on ByteString and Vector (fusion) packages. The following individuals contributed to the review process: Don Stewart, Johan Tibell == Abstract == The 'text' package provides an efficient packed, immutable Unicode text type (both strict and lazy), with a powerful loop fusion optimization framework. The 'Text' type represents Unicode character strings, in a time and space-efficient manner. This package provides text processing capabilities that are optimized for performance critical use, both in terms of large data quantities and high speed. The 'Text' type provides character-encoding, type-safe case conversion via whole-string case conversion functions. It also provides a range of functions for converting Text values to and from 'ByteStrings', using several standard encodings (see the 'text-icu' package for a much larger variety of encoding functions). Efficient locale-sensitive support for text IO is also supported. This module is intended to be imported qualified, to avoid name clashes with Prelude functions, e.g. import qualified Data.Text as T Documentation and tarball from the hackage page: http://hackage.haskell.org/package/text Development repo: darcs get http://code.haskell.org/text/ == Rationale == While Haskell's Char type is capable of reprenting Unicode code points, the String sequence of such Chars has some drawbacks that prevent is general use: 1. unicode-unaware case conversion (map toUpper is an unsafe case conversion) 2. the representation is space inefficient. 3. the data structure is element-level lazy, whereas a number of applications require either some level of additional strictness An intermediate solution to these was via 'Data.ByteString' (an efficient byte sequence type, that addresses points 2 and 3), which, when used in conjunction with utf8-string, provides very simple non-latin1 encoding support (though with significant drawbacks in terms of locale and encoding range). The 'text' package addresses these shortcomings in a number of way: 1. support whole-string case conversion (thus, type correct unicode transformations) 2. a space and time efficient representation, based on unboxed Word16 arrays 3. either fully strict, or chunk-level lazy data types (in the style of Data.ByteString) 4. full support for locale-sensitive, encoding-aware IO. The 'text' library has rapidly become popular for a number of applications, and is used by more than 50 other Hackage packages. As of Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most popular), in particular, in web programming. It is used by: * the blaze html pretty printing library * the hstringtemplate file templating library * *all* popular web frameworks: happstack, snap, salvia and yesod web frameworks * the hexpat and libxml xml parsers The design is based on experience from Data.Vector and Data.ByteString: * the underlying type is based on unpinned, packed arrays on the Haskell heap with an ST interface for memory effects. * pipelines of operations are optimized via converstion to and from the 'stream' abstraction[1] == The API == The API is broken into several logical pieces, which are self-explanatory: * combinators for operating on strict, abstract 'text's. http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text.html * an equivalent API for chunk-element lazy 'text's. http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Lazy.html * encoding transformations, to and from bytestrings: http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Encoding.html * support for conversion to Ptr Word16: http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Foreign.html * locale-aware IO layer: http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-IO.html http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Lazy-IO.html == Design decisions == * IO and pure combinators are in separate modules. * Both a fully strict, and partially-strict type are provided. * The underlying optimization framework is stream fusion, (not build/foldr), and is hidden from the user. * Unpinned arrays are used, to prevent fragmentation. * Large numbers of additional encodings are delegated to the text-icu package. * An 'IsString' instance is provided for String literals. * The implementation is OS and architecture neutral (portable). * The implementation uses a number of language extensions: CPP MagicHash UnboxedTuples BangPatterns Rank2Types RecordWildCards ScopedTypeVariables ExistentialQuantification DeriveDataTypeable * The implementation is entirely Haskell (no additional C code or libraries). * The package provides a QuickCheck/HUnit testsuite, and coverage data. * The package adds no new dependencies to the HP. * The package builds with the Simple cabal way. * There is no existing functionality for packed unicode text in the HP. * The package has complexity annotations. == Open issues == The text-icu package is not part of this propposal. == Notes == The implementation consists of 30 modules, and relies on cabal's package hiding mechanism to expose only 5 modules. The implementation is around 8000 lines of text total. The public modules expose none of these (?). The Python standard library provides both a string and a unicode sequence type. These are somewhat analogous to the ByteString/String/Text split. = References = [1]: "Stream Fusion: From Lists to Streams to Nothing at All", Coutts, Leshchinskiy and Stewart, ICFP 2007, Freiburg, Germany.