The phenomenon comes from the amount of shared beta-reduction steps, which can be dramatically different in Haskell-style lazy evaluation (or usual call-by-value, which is not that far in this respect) and in Vuillemin-Lévy-Lamping-Kathail-Asperti-Guerrini-(et al…) "optimal" evaluation. This is a general feature, that is completely independent from the arithmetic formulas you could use in this particular example.

Sharing means having a representation of your lambda-term in which one "node" can describe several similar parts of the actual lambda-term you represent. For instance, you can represent the term

\x. x ((\y.y)a) ((\y.y)a)

using a (directed acyclic) graph in which there is only one occurrence of the subgraph representing (\y.y)a , and two edges targeting that subgraph. In Haskell terms, you have one thunk, that you evaluate only once, and two pointers to this thunk.

Haskell-style memoization implements sharing of complete subterms. This level of sharing can be represented by directed acyclic graphs. Optimal sharing does not have this restriction: it can also share "partial" subterms, which may imply cycles in the graph representation.

To see the difference between these two levels of sharing, consider the term

\x. (\z.z) ((\z.z) x)

If your sharing is restricted to complete subterms as it is the case in Haskell, you may have only one occurrence of \z.z , but the two beta-redexes here will be distinct: one is (\z.z) x and the other one is (\z.z) ((\z.z) x) , and since they are not equal terms they cannot be shared. If the sharing of partial subterms is allowed, then it becomes possible to share the partial term (\z.z) [] (that is not just the function \z.z , but "the function \z.z applied to something), which evaluates in one step to just something, whatever this argument is. Hence you can have a graph in which only one node represents the two applications of \z.z to two distinct arguments, and in which these two applications can be reduced in just one step. Remark that there is a cycle on this node, since the argument of the "first occurrence" is precisely the "second occurrence". Finally, with optimal sharing you can go from (a graph representing) \x. (\z.z) ((\z.z) x)) to (a graph representing) the result \x.x in just one step of beta-reduction (plus some bookkeeping). This is basically what happens in your optimal evaluator (and the graph representation is also what prevents space explosion).

For slightly extended explanations, you can look at the paper Weak Optimality, and the Meaning of Sharing (what you are interested in is the introduction and the section 4.1, and maybe some of the bibliographic pointers at the end).

Coming back at your example, the coding of arithmetic functions working on Church integers is one of the "well-known" mines of examples where optimal evaluators can perform better than mainstream languages (in this sentence, well-known actually means that a handful of specialists are aware of these examples). For more such examples, take a look at the paper Safe Operators: Brackets Closed Forever by Asperti and Chroboczek (and by the way, you will find here interesting lambda-terms that are not EAL-typeable; so I’m encouraging you to take a look at oracles, starting with this Asperti/Chroboczek paper).

As you said yourself, this kind of encoding is utterly unpractical, but they still represent a nice way of understanding what is going on. And let me conclude with a challenge for further investigation: will you be able to find an example on which optimal evaluation on these supposedly bad encodings is actually on par with traditional evaluation on a reasonable data representation? (as far as I know this is a real open question).