[Haskell] [ANN] Safe Lazy IO in Haskell

Hi folks, We have good news (nevertheless we hope) for all the lazy guys standing there. Since their birth, lazy IOs have been a great way to modularly leverage all the good things we have with *pure*, *lazy*, *Haskell* functions to the real world of files. We are happy to present the safe-lazy-io package [1] that does exactly this and is going to be explained and motivated in the rest of this post. === The context === Although these times were hard with the Lazy/IO technique, some people continue to defend them arguing that all discovered problems about it was not that harmful and that taking care was sufficient. Indeed some issues have been discovered about Lazy/IOs, some have been fixed in the underlying machinery, some have just been hidden and some others are still around. == An alternative == An alternative design has been proposed --and is still evolving--, it is called "Iteratee" [2] and has been designed by Oleg Kiselyov. This new design has tons of advantages over standard imperative IOs, and shares some of the goals of Lazy/IOs. Iteratee provides a way to do incremental processing in a high-level style. Indeed both processed data (via enumerators) and processing code (called iteratee) can be modularly composed. The handling of file-system resources is precise and safe. Catching errors can be done precisely and can be interleaved with the processing. In spite of all this, there is an important drawback: a lot of code has to be re-written and thought in another way. Processing becomes explicitly chunked which is not always needed and, even worse, exceptions handling also becomes very explicit. While this makes sense in a wide range of applications it makes things less natural than the general case of pure functions. We think that Iteratee have too be studied more, and we recommend them when you have incrementally react to IO errors. == Issues of Standard Lazy/IO == We think that we can save Lazy/IO cheaply, but before explaining the way we solve such and such issue, let's first expose Lazy/IO and its issues. One of the main Lazy/IO functions is 'readFile': it takes a file path opens it and returns the list of characters until the end of the file is reached. The characteristic of 'readFile' is that only the opening is done strictly, while the reading is performed lazily as much as the output list is processed. Cousins of 'readFile' are 'hGetContents' that takes a file handle and 'getContents' that reads on the standard input. This technique enables to process a file as if the file was completely stored in memory. Because it is read lazily one knows that only the required part of the file will be read. Even better, if the input is consumed to produce a small output or the output is emitted incrementally, then the processing can be done in constant memory space. Examples: -- Prints the number of words read on stdin > countWords = print . length . words =<< getContents -- Prints the length of the longest line > maxLineLen = print . maximum . map length . lines =<< getContents -- Prints in lower case the text read on stdin > lowerText = interact (map toLower) -- Alternatively > lowerText = putStr . map toLower =<< getContents All these examples are pretty idiomatic Haskell code and make a simple use of Lazy/IOs. Each of them runs in constant memory space even if they are declared as if the whole contents were available at once. By using stream fusion or 'ByteString''s one can get even faster code while keeping the code almost the same. Here we will stay with the default list of 'Char''s data type. However one goal of our approach is to be trivially adaptable to those data types. Using our library will be rougly a matter of namespace switch plus a running function: > lowerText = LI.run' (SIO.putStr . map toLower <$> LI.getContents) However we will introducing this library as one goes along. Here is another example where the Lazy/IO are still easy to use but no longer scales well. This program counts the lines of all the files given in arguments: > countLines = print . length . lines . concat =<< mapM readFile =<< getArgs Here the problem is the limitation of simultaneous opened files. Indeed, all the files are opened at the beginning therefore reaching the limit easily. It's time to recall when the files are closed. With standard Lazy/IOs the handle is closed when you reach the end of the file, and so when you've explored the whole list returned by 'readFile'. Note also that if you manually open the file and get a handle, then you can manually close the file, however if by misfortune you close the file and then still consume the lazy list you will get a truncated list, observing how much of the file has been read. This last point is due to the fact that 'readFile' considers the reading error as the end of the file. In particular one can fix this program, by simply counting the number of lines of each file separately and then compute the sum to get the final result. > countLines = print . sum =<< mapM (fmap (length . lines) . readFile) =<< getArgs However once again this program exhausts the handle resources. Trying to close the files will not save us either, one just risks getting truncated files. Indeed the list of intermediate results is produced eagerly but each intermediate result is lazy and then each file is opened but not immediately closed since the computation is delayed. Hopefully adding a bit of strictness cures the problem: > countLines = print . sum =<< mapM ((return' . length . lines =<<) . readFile) =<< getArgs > where return' x = x `seq` return x Until there, we have disclosed three problems: * while reading is lazy, opening is strict, which leads to a less natural processing of multiple files * the closing of files is hard to predict * the errors during reading are silently discarded The last one is a bit trickier and has recently been exposed by Oleg Kiselyov [3]. The problem appears when one gets twice the contents of the same stream---or some kind of inter-dependent streams. Because reading is implicitly driven by the consumer, the interleaving of reading may then depend on the reduction strategy employed. This is the case even if the consumer is a pure function. Basically in this example one can observe different values when using one of these functions: > f1 x y = x `seq` y `seq` x - y > f2 x y = y `seq` x `seq` x - y In this example one reads stdin twice and relies on the error handling to end one stream while keeping the other opened. Moreover there are other ways to achieve this like the use of unix fifo files, or using 'getChanContents' from the "Control.Concurrent.Chan" module. === Our solution === Here we will present another design, based on a very simple idea. Our goal is to provide IO processing in a style very similar to standard Lazy/IO with the following differences: - preservation of the determinism; - a simple control exceptions; - and a precise management of resources. Our solution is made of three key ingredients: a bit of strictness, some predefined schemas to interleave inputs, some scopes and abstract types to delimit lazy input operations from strict IO operations. == Dealing with a single input == Let's present the first ingredient alone through a first example: > mapHandleContents :: NFData sa => Handle -> (String -> sa) -> IO sa > mapHandleContents h f = do > s <- hGetContents h > return' (f s) `finally` hClose h > return' :: (Monad m, NFData sa) => sa -> m sa > return' x = rnf x `seq` return x It implements some combination of 'fmap' and 'hGetContents'. Actually some of our examples fit nicely in that model: > countWords = print =<< mapHandleContents stdin (length . words) > maxLineLen = print =<< mapHandleContents stdin (maximum . map length . lines) > lowerText = putStr =<< mapHandleContents stdin (map toLower) However while the two first examples work well in this setting, the third one tries to allocate the whole result in memory before printing it. Here the ingredient that is used is strictness: the purpose in forcing the result is to be sure that all the needed input is read, before the file is closed. So here we rely on 'NFData' instances to really perform deep forcing---this kind of assumption is a bit like 'Typeable' instances. In particular functions must not be an instance of 'NFData': indeed, we have no way to force lazy values that are stored in the closure of a function. The same remark applies to the 'IO' monad for at least three reasons: 'IO' if often represented by functions; lazy 'IORef''s could be used to hide one input for later use; exceptions with a lazy contents could also be used to make a lazy value escape. Let's now add some more strictness into the meal: the 'SIO' monad! == The 'SIO' monad == The 'SIO' monad is a thin layer over the 'IO' monad, populated only by strict 'IO' operations. In particular these operations are strict in the output, which means that once the output is produced then we know that the given arguments cannot be further evaluated/forced. Here is an example of strict IO using the 'SIO' monad: > import qualified System.IO.Strict as SIO > import System.IO.Strict (SIO) > countWords = SIO.run (SIO.print . length . words =<< SIO.getContents) Of course this function does not scale well since it reads the whole contents in memory before processing it. For now the strict-io [4] package contains wrappers for functions in "System.IO", and strict 'IORef''s. One can now introduce a function in lines of 'mapHandleContents': > withHandleContents :: NFData sa => Handle -> (String -> SIO sa) -> IO sa > withHandleContents h f = do > s <- hGetContents h > SIO.run (f s) `finally` hClose h One can then rewrite 'lowerText' as follow: > lowerText = withHandleContents stdin (SIO.putStr . map toLower) Until there one can deal quite nicely with single input, many outputs processing. Currently the only requirement is to delimit a scope where the resource will be used to return a strict value. Dealing with multiple inputs will lead us to our final design of lazy inputs. == Introducing 'LI', Lazy Inputs == We first introduce a type for these lazy inputs namely 'LI'. This type is abstract and we will progressively introduce functions to build, combine and run them. The 'LI' type is a pointed functor, but not necessarily a monad nor an applicative functor. Therefore one exports the 'pure' function as 'pureLI'. Building primitives allow to read files or handles: > LI.pureLI :: a -> LI a > LI.hGetContents :: Handle -> LI String > LI.getContents :: LI String > LI.readFile :: FilePath -> LI String > LI.getChanContents :: Chan a -> LI [a] Being a functor is important: it allows to wholly transform the underlying input lazily using standard functions about lists for instance: > length <$> LI.readFile "foo" > words <$> LI.readFile "foo" Extracting a final value of a lazy input ('LI' type) is a matter of using: > LI.run :: (NFData sa) => LI sa -> IO sa Or > LI.run' :: (NFData sa) => LI (SIO a) -> IO sa One can therefore re-write our examples using lazy inputs: > -- embedded printing > countWords = LI.run' (SIO.print . length . words <$> LI.getContents) > -- external printing > maxLineLen = print =<< LI.run (maximum . map length . lines <$> LI.getContents) > lowerText = LI.run' (SIO.putStr . map toLower <$> LI.getContents) == Combining inputs == Finally we come to managing multiple inputs. To get both laziness and precise resource management we will provide dedicated combinators. The first one is as simple as appending. > LI.append :: NFData sa => LI [sa] -> LI [sa] -> LI [sa] This one produces a single stream out that sequences the two given streams. It also sequences the usage of resources: the first resource is closed and then the second one is opened. Note that this combinator is still quite general since one can process each input beforehand: > -- one can drop parts of the inputs > (take 100 <$> i1) `LI.append` (drop 100 <$> i2) > -- one can tag each input to know where they come from > Left <$> i1 `LI.append` Right <$> i2 The second one is 'LI.zipWith' which opens the two resources and joins the items into a single stream. Again, since 'LI' is a functor one can join not only characters but also words, lines, chunks... A problem with zipping is that it stops on the shorter input (loosing a part of the other), hopefully one can prolongate them: > zipMaybesWith :: (NFData sa, NFData sb) -> (Maybe sa -> Maybe sb -> c) -> LI [sa] -> LI [sb] -> LI [c] > zipMaybesWith f xs ys = > map (uncurry f) . takeWhile someJust <$> zip (prolongate <$> xs) (prolongate <$> ys) > where someJust (Nothing, Nothing) = False > someJust _ = True > prolongate :: [a] -> [Maybe a] > prolongate zs = map Just zs ++ repeat Nothing The last one is 'LI.interleave': > LI.interleave :: (NFData sa) -> LI [sa] -> LI [sa] -> LI [sa] This function is currently left biased, moreover each resource is closed as soon as we reach its end. However since the inputs are mixed up together one generally prefers a tagged version trivially build upon this one: > interleaveEither :: (NFData sa, NFData sb) => LI [sa] -> LI [sb] -> LI [Either sa sb] > interleaveEither a b = interleave (map Left <$> a) (map Right <$> b) Here are some final programs that scale well with the number of files. > -- number of words in the given files > main = print =<< LI.run . fmap (length . words) . LI.concat . map LI.readFile =<< getArgs > -- almost the same thing but counts words independently in each file > main = print > =<< LI.run . fmap sum . LI.sequence . map (fmap (length . words) . LI.readFile) > =<< getArgs > -- prints to stdout swap-cased concatenation of all input files > main = LI.run' . (fmap (SIO.putStr . fmap swapCase) . LI.concat . map LI.readFile) =<< getArgs > where swapCase c | isUpper c = toLower c > | otherwise = toUpper c > -- sums character code points of inputs > main = print =<< LI.run . fmap (sum . map (toInteger . ord)) . LI.concat . map LI.readFile =<< getArgs Our solution is from now widely available as an Hackage package named "safe-lazy-io" [4]. We hope you will freely enjoy using Lazy/IO again! As usual, criticisms, comments, and help are expected! Finally, I would like to thank Benoit Montagu and Francois Pottier for helping me out to polish this work! [1]: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/safe-lazy-io [2]: http://okmij.org/ftp/Streams.html [3]: http://www.haskell.org/pipermail/haskell/2009-March/021064.html [4]: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/strict-io -- Nicolas Pouillard