More LINQ with System.Interactive – Functional fun and taming side-effects Thursday, January 07, 2010 3:43 AM bart

With the recent release of the Reactive Extensions for .NET (Rx) on DevLabs, you’ll hear quite a bit about reactive programming, based on the IObservable<T> and IObserver<T> interfaces. A great amount of resources is available on Channel 9. In this series, I’ll focus on the dual of the System.Reactive assembly, which is System.Interactive, providing a bunch of extensions to the LINQ Standard Query Operators for IEnumerable<T>. In today’s installment we’ll talk about EnumerableEx’s facilities to tame side-effects in a functionally inspired manner:

To side effect or not to side effect?

Being rooted in query comprehensions as seen in various functional programming languages (including (the) pure one(s)), one would expect LINQ to have a very functional basis. Indeed it has, but being hosted in various not functionally pure languages like C# and Visual Basic, odds are off reasoning about side-effects in a meaningful and doable manner. As we’ve seen before, when talking about the Do and Run operators, it’s perfectly possible for a query to exhibit side-effects during iteration. You don’t even have to look that far, since every lambda passed to a query operator is an opportunity of introducing effects. The delayed execution nature of LINQ queries makes that those effects appear at the point of query execution. So far, nothing new.

So, the philosophical question ought to be whether or not we should embrace side-effects or go for absolute purity. While the latter would be preferable for various reasons, it’s not enforceable through the hosting languages for LINQ, so maybe we should exploit side-effects if we really want to do so. The flip side of this train of thought is that those side-effects could come and get us if we’re not careful, especially when queries get executed multiple times, potentially as part of a bigger query. In such a case, you’d likely not want effects to be duplicated. Below is a sample of such a problematic query expression:

var xrs = EnumerableEx .Generate( new Random (), rnd => EnumerableEx .Return(rnd.Next(100)), /* iterate */ _ => _); xrs.Zip(xrs, (l, r) => l + r).Take(10).Run( Console .WriteLine);

Using Generate, we generate a sequence of random numbers. Recall the first argument is the state of the anamorphic Generate operator, which we get passed in the lambdas following it: once to produce an output sequence (just a single random number in our case) and once to iterate (just keeping the same random number generator here). What’s more important is we’re relying on the side-effect of reading the random number generator which, as the name implies, provides random answers to the Next inquiry every time it gets called. In essence, the side-effect can (not) be seen by looking at the signature of Random.Next, which says it returns an int. In .NET this means the method may return the same int every time it gets called, but there are no guarantees whatsoever (as there would be in pure functional programming languages).

This side-effect, innocent and intentional as it may seem, comes and gets us if we perform a Zip on the sequence with itself. Since Zip iterates both sides, we’re really triggering separate enumeration (“GetEnumerator”) over the same sequence two times. Though it’s the same sequence object, each of its iterations will produce different results. As a result, the expected invariant of the Zip’s output being only even numbers (based on the assumption l and r would be the same as they’re produced by the same sequence) doesn’t hold:

52

114

112

103

41

135

78

114

59

137

While random number generation is a pretty innocent side-effect, not having it under control properly can lead to unexpected results as shown above. We can visualize this nicely using another side-effect introduced by Do:

var xrs = EnumerableEx .Generate( new Random (), rnd => EnumerableEx .Return(rnd.Next(100)), /* iterate */ _ => _) .Do(xr => Console .WriteLine( "! -> " + xr)); xrs .Zip( xrs , (l, r) => l + r).Take(10).Run( Console .WriteLine);

This will print a message for every number flowing out of the random number generating sequence, as shown below:

! -> 97

! -> 78

175

! -> 11

! -> 6

17

! -> 40

! -> 17

57

! -> 92

! -> 63

155

! -> 70

! -> 13

83

! -> 41

! -> 1

42

! -> 64

! -> 76

140

! -> 30

! -> 71

101

! -> 1

! -> 81

82

! -> 65

! -> 45

110

If we look a bit further to the original query, we come to the conclusion we can’t apply any form of equational reasoning anymore: it seems that the common subexpression “xrs” is not “equal” (as in exposing the same results) in both use sites. The immediate reason in the case of LINQ is the delayed execution, which is a good thing as our Generate call produces an infinite sequence. More broadly, it’s the side-effect that lies at the heart of the problem as equational reasoning breaks down in such a setting. For that very reason, side-effect permitting languages have a much harder time carrying out optimizations to code and need to be very strict about specifying the order in which operations are performed (e.g. in C#, arguments to a method call – which is always “call-by-value” – are evaluated in a left-to-right order).

Moving Take(10) up doesn’t change the delayed characteristic either:

var xrs = EnumerableEx .Generate( new Random (), rnd => EnumerableEx .Return(rnd.Next(100)), /* iterate */ _ => _) .Take(10) .Do(xr => Console .WriteLine( "! -> " + xr)); xrs.Zip(xrs, (l, r) => l + r).Run( Console .WriteLine);

What would help is forcing the common subexpression’s query to execute, persisting (= caching) its results in memory, before feeding them in to the expression using it multiple times:

var xrs = EnumerableEx .Generate( new Random (), rnd => EnumerableEx .Return(rnd.Next(100)), /* iterate */ _ => _) .Take(10) .ToArray() .Do(xr => Console .WriteLine( "! -> " + xr)); xrs.Zip(xrs, (l, r) => l + r).Run( Console .WriteLine);

Don’t forget the Take(10) call though, as calling ToArray (or ToList) on an infinite sequence is not quite advised on today’s machines with finite amounts of memory. It’s clear such hacking is quite brittle and it breaks the delayed execution nature of the query expression. In other words, you can’t really hand out the resulting expression to a caller for it to call when it needs results (if it ever does). We’re too eager about evaluating (part of) the query, just to be able to tame the side-effect:

var xrs = EnumerableEx .Generate( new Random (), rnd => EnumerableEx .Return(rnd.Next(100)), /* iterate */ _ => _) .Take(10) .ToArray() ; var randomEvens = xrs.Zip(xrs, (l, r) => l + r); // What if the consumer of randomEvens expects different results on each enumeration... Hard cheese! randomEvens.Run( Console .WriteLine); randomEvens.Run( Console .WriteLine);

It’s clear that we need some more tools in our toolbox to tame desired side-effects when needed. That’s exactly what this post focuses on.

Option 1: Do nothing with Let

A first way to approach side-effects is to embrace them as-is. We just allow multiple enumerations of the same sequence to yield different results (or more generally, replicate side-effects). However, we can provide a bit more syntactical convenience in writing queries that reuse the same common subexpression in multiple places. In the above, we had to introduce an intermediate variable to store the common expression in, ready for reuse further on:

var xrs = EnumerableEx .Generate( new Random (), rnd => EnumerableEx .Return(rnd.Next(100)), /* iterate */ _ => _); xrs.Zip(xrs, (l, r) => l + r).Take(10).Run( Console .WriteLine);

Can’t we somehow write this more fluently? The answer is yes, using the Let operator which passes its left-hand side to a lambda expression that can potentially use it multiple times:

EnumerableEx .Generate( new Random (), rnd => EnumerableEx .Return(rnd.Next(100)), /* iterate */ _ => _) .Let(xrs => xrs.Zip(xrs, (l, r) => l + r)).Take(10).Run( Console .WriteLine);

You can guess the signature of Let just by looking at the use above, but let’s include it for completeness:

public static IEnumerable <TResult> Let<TSource, TResult>( this IEnumerable <TSource> source, Func < IEnumerable <TSource>, IEnumerable <TResult>> function);

Because of the call-by-value nature of the languages we’re talking about, the expression used for the source parameter will be fully evaluated (not the same as enumerated!) before Let gets called, so we can feed it (again in a call-by-value manner) to the function which then can refer to it multiple times by means of its lambda expression parameter (in the sample above this is “xrs”). Let comes from the world of functional languages where it takes the following form:

let x = y in z

means (in C#-ish syntax)

(x => z)(y)

In other words, there’s a hidden function x => z sitting in a let-expression and the “value” for x (which is y in the sample) gets passed to it, providing the result for the entire let-expression. In EnumerableEx.Let, the function is clear as the second parameter, and the role of “y” is fulfilled by the source parameter. One could create a Let-form for any object as follows (not recommended because of the unrestricted extension method):

public static R Let<T, R>( this T t, Func <T, R> f) { return f(t); }

With this, you can write things like this:

Console .WriteLine( DateTime .Now.Let(x => x - x).Ticks);

This will print 0 ticks for sure, since the same DateTime.Now is used for x on both sides of the subtraction. If we were to expand this expression by substituting DateTime.Now for x, we’d get something different due to the duplicate evaluation of DateTime.Now, exposing the side-effect of reading from the system clock:

Console .WriteLine(( DateTime .Now - DateTime .Now).Ticks);

(Pop quiz: What sign will the above Ticks result have? Is it possible for the above to return 0 sometimes?)

Option 2: Cache on demand a.k.a. MemoizeAll

As we’ve seen before, on way to get rid of the side-effect replication is by forcing eager evaluation of the sequence through operators like ToArray or ToList. However, those are a bit too eager in various ways:

They persist the whole sequence, which won’t work for infinite sequences.

They do so on the spot, i.e. the eagerness can’t be delayed till a later point (‘on demand”).

The last problem can be worked around using the Defer operator, but the first one is still a problem requiring another operator. Both those things are what MemoizeAll provides for, essentially persisting the sequence bit-by-bit upon consumption. This is achieved by exposing the enumerable while only maintaining a single enumerator to its source:

In the figure above, this is illustrated. Red indicates a fetch operation where the original source’s iterator makes progress as an element is requested that hasn’t been fetched before. Green indicates persisted (cached, memoized) objects. Gray indicates elements in the source that have been fetched and hence belong to the past from the (single) source-iterators point of view: MemoizeAll won’t ever request those again. Applying this operator to our running sample using Zip will produce results with the expected invariant:

var xrs = EnumerableEx .Generate( new Random (), rnd => EnumerableEx .Return(rnd.Next(100)), /* iterate */ _ => _) .Do(xr => Console .WriteLine( "! -> " + xr)) .MemoizeAll(); xrs.Do(xr => Console .WriteLine( "L -> " + xr)).Zip( xrs.Do(xr => Console .WriteLine( "R -> " + xr)), (l, r) => l + r).Take(10).Run( Console .WriteLine);

Now we’ll see the xrs-triggered Do messages being printed only 10 times since the same element will be consumed by the two uses of xrs within Zip. The result looks as follows, showing how the right consumer of Zip never causes a fetch back to the random number generating source due to the internal caching by MemoizeAll:

! -> 71

L -> 71

R -> 71

142

! -> 18

L -> 18

R -> 18

36

! -> 12

L -> 12

R -> 12

24

! -> 96

L -> 96

R -> 96

192

! -> 1

L -> 1

R -> 1

2

! -> 54

L -> 54

R -> 54

108

! -> 9

L -> 9

R -> 9

18

! -> 87

L -> 87

R -> 87

174

! -> 18

L -> 18

R -> 18

36

! -> 12

L -> 12

R -> 12

24

What about lifetime of the source’s single enumerator? As soon as one of the consumers reaches the end of the underlying sequence, we got all elements cached and are prepared to any possible inquiry for elements on the output side of the MemoizeAll operator, hence it’s possible to dispose of the original enumerator. It should also be noted that memoization operators use materialization internally to capture the behavior of the sequence to expose to all consumers. This means exceptions are captured as Notification<T> so they’re repeatable to all consumers:

var xes = EnumerableEx .Throw< int >( new Exception ()).StartWith(1).MemoizeAll(); xes.Catch(( Exception _) => EnumerableEx .Return(42)).Run( Console .WriteLine); xes.Catch(( Exception _) => EnumerableEx .Return(42)).Run( Console .WriteLine);

The above will therefore print 1, 42 twice. In other words, the source blowing up during fetching by MemoizeAll doesn’t terminate other consumers that haven’t reached the faulty state yet (but if they iterate long enough, they’ll eventually see it exactly as the original consumer did).

Finally, what’s All about MemoizeAll? In short: the cache used by the operator can grow infinitely large. The difference compared to ToArray and ToList has been explained before, but it’s worth repeating it: MemoizeAll doesn’t fetch its source’s results on the spot but only makes progress through the source’s enumerator when one of the consumers requests an element that hasn’t been retrieved yet. Call it a piecemeal ToList if you want.

Option 3: Memoize, but less “conservative”

While MemoizeAll does the trick to avoid repetition of side-effects, it’s quite conservative in its caching as it never throws away elements it has retrieved. You never know whether someone – like a slow enumerator or a whole new enumerator over the memoized result – will request the data again, so a general-purpose Memoize can’t throw away a thing. However, if you know the behavior of the consumers of the memoized source, you can be more efficient about it and use Memoize specifying a buffer size. In our running sample of Zip we know that both uses of the source for the left and right inputs to Zip will be enumerated at the same pace, so it suffices to keep the last element in the cache in order for the right enumerator to be able to see the element the left enumerator just saw. Memoize with buffer size 1 does exactly that:

var xrs = EnumerableEx .Generate( new Random (), rnd => EnumerableEx .Return(rnd.Next(100)), /* iterate */ _ => _) .Do(xr => Console .WriteLine( "! -> " + xr)) .Memoize(1); xrs.Do(xr => Console .WriteLine( "L -> " + xr)).Zip( xrs.Do(xr => Console .WriteLine( "R -> " + xr)), (l, r) => l + r).Take(10).Run( Console .WriteLine);

In pictures, this looks as follows:

Another valid buffer size – also the default – is zero. It’s left to the reader, as an exercise, to come up with a plausible theory for what that one’s behavior should be and to depict this case graphically.

(Question: Would it be possible to provide a “smart” memoization operator that knows exactly when it can abandon items in the front of its cache? Why (not)?)

Derived forms

The difference between Let and the Memoize operators is that the former feeds in a view on an IEnumerable<T> source to a function, allowing that one to refer to the source multiple times in the act of producing a source in return. Let is, as we saw, nothing but fancy function application in a “fluent” left-to-right dataflowy way. Derived forms of Memoize exist that have the same form where a function is fed a memoized data source:

Replay is Memoize on steroids

Publish is MemoizeAll on steroids

The following snippets show just what those operators do (modulo parameter checks):

public static IEnumerable <TResult> Publish<TSource, TResult>( this IEnumerable <TSource> source,

Func< IEnumerable <TSource>, IEnumerable <TResult>> function) { return function(source.MemoizeAll()); } public static IEnumerable <TResult> Publish<TSource, TResult>( this IEnumerable <TSource> source,

Func< IEnumerable <TSource>, IEnumerable <TResult>> function,

TSource initialValue) { return function(source.MemoizeAll().StartWith(initialValue)); } public static IEnumerable <TResult> Replay<TSource, TResult>( this IEnumerable <TSource> source,

Func< IEnumerable <TSource>, IEnumerable <TResult>> function) { return function(source.Memoize()); } public static IEnumerable <TResult> Replay<TSource, TResult>( this IEnumerable <TSource> source,

Func< IEnumerable <TSource>, IEnumerable <TResult>> function,

int bufferSize) { return function(source.Memoize(bufferSize)); }

So we could rewrite our Zip sample in a variety of ways, the following being the cleanest one-sized buffer variant:

EnumerableEx .Generate( new Random (), rnd => EnumerableEx .Return(rnd.Next(100)), /* iterate */ _ => _) .Do(xr => Console .WriteLine( "! -> " + xr)) .Replay( xrs => xrs .Do(xr => Console .WriteLine( "L -> " + xr)).Zip( xrs .Do(xr => Console .WriteLine( "R -> " + xr)), (l, r) => l + r), 1) .Take(10).Run( Console .WriteLine);

Option 4: Fair (?) sharing with Share and Prune

The Share operator shares an IEnumerator<T> for any number of consumers of an IEnumerable<T>, hence avoiding duplication of side-effects. In addition, it also guarantees that no two consumers can see the same element, so in effect the Share operator has the potential of distributing elements across different consumers. Looking at it from another angle, one consumer can steal elements from the source, preventing another consumer from seeing it. Prune is derived from Share as follows:

public static IEnumerable <TResult> Prune<TSource, TResult>( this IEnumerable <TSource> source,

Func< IEnumerable <TSource>, IEnumerable <TResult>> function) { return function(source.Share()); }

The naming for Prune follows from the effect consumers inside the function have on the sequence being shared: each one consuming data effectively prunes elements from the head of the sequence, so that others cannot see those anymore. An example is shown below, showing another way a Zip could go wrong (practical scenarios for this operator would involve sharing) since the left and right consumers both advance the cursor of the same shared enumerator under the hood:

EnumerableEx .Generate( new Random (), rnd => EnumerableEx .Return(rnd.Next(100)), /* iterate */ _ => _) .Do(xr => Console .WriteLine( "! -> " + xr)) .Prune(xrs => xrs.Do(xr => Console .WriteLine( "L -> " + xr)).Zip( xrs.Do(xr => Console .WriteLine( "R -> " + xr)), (l, r) => l + r)) .Take(10).Run( Console .WriteLine);

The result of this is of interest since the logging will reveal the sharing characteristic. Looking at the first Do’s output we’ll see it gets triggered by any consumer on the inside of Prune:

! -> 37

L -> 37

! -> 51

R -> 51

88

! -> 98

L -> 98

! -> 89

R -> 89

187

! -> 4

L -> 4

! -> 71

R -> 71

75

! -> 43

L -> 43

! -> 30

R -> 30

73

! -> 18

L -> 18

! -> 24

R -> 24

42

! -> 17

L -> 17

! -> 41

R -> 41

58

! -> 45

L -> 45

! -> 68

R -> 68

113

! -> 83

L -> 83

! -> 53

R -> 53

136

! -> 64

L -> 64

! -> 69

R -> 69

133

! -> 0

L -> 0

! -> 22

R -> 22

22

In pictures, this looks as follows:

Exercise: Can you guess how Memoize(0) differs from Share?

Quiz: What should be the behavior of the following fragment? (Tip: you got to know what two from clauses result in and how they execute)

Enumerable .Range(0, 10)

.Prune(xs => from x in xs.Zip(xs, (l, r) => l + r) from y in xs select x + y) .Run( Console .WriteLine);

Next on More LINQ

A look at the Asynchronous and Remotable operators, dealing with some infrastructure-related concepts, wrapping up this series for now.