Here we describe well-known chaotic sequences, including new generalizations, with application to random number generation, highly non-linear auto-regressive models for times series, simulation, random permutations, and the use of big numbers (libraries available in programming languages to work with numbers with hundreds of decimals) as standard computer precision almost always produces completely erroneous results after a few iterations -- a fact rarely if ever mentioned in the scientific literature, but illustrated here, together with a solution. It is possible that all scientists who published on chaotic processes, used faulty numbers because of this issue.

This article is accessible to non-experts, even though we solve a special stochastic equation for the first time, providing an unexpected exact solution, for a new chaotic process that generalizes the logistic map. We also describe a general framework for continuous random number generators, and investigate the interesting auto-correlation structure associated with some of these sequences. References are provided, as well as fast source code to process big numbers accurately, and even an elegant mathematical proof in the last section.

While all the sequences investigated here are deterministic, we treat them as stochastic processes (after all, they are supposed to simulate chaos and randomness!) and we use probabilistic and statistical tools to handle them. To the best of my knowledge, this is the first time that continuous random number generators are described under a general unified framework, at least in the statistical literature. Similar methodology has been applied in several contexts pertaining to physics and mathematics. I hope that this article will jump-start some further research on this topic, and can be understood by a broad audience.

This article is also a useful read for participants in our analytical competition ($2,000 award) as it addresses a similar stochastic integral equation problem, also with exact solution, in the related context of self-correcting random walks - another kind of memory-less process.

The approach used here starts with traditional data science and simulations for exploratory analysis, with empirical results confirmed later by mathematical arguments in the last section. For simplicity, some advanced topics required to make this framework completely rigorous, are not discussed here, for instance the stochastic fixed-point theorem for non-linear integral operators (the fixed point being a statistical distribution, not a static value) including conditions of convergence and uniqueness of solutions.

Source for picture: click here.

1. General Framework

We are interested in the sequence of real numbers defined iteratively by

X(k) = g(X(k - 1)) for k > 0,

with s = X(0) in [0, 1] being called the seed, and g being a mapping (real-valued function, usually non invertible) from [0, 1] onto [0,1]. The successive values X(k) can be seen as deviates from a distribution X, used for instance to simulate random numbers.

While this concept of using a statistical distribution to model deterministic values is intuitive to non-statisticians, it is not for some statisticians. So let's discuss X in more details, as it is central to this article. For the statistician, the distribution of X is the limit of the empirical distribution computed on the X(k)'s, just as if the X(k)'s were observations of some experiment involving sampling. The distribution X may or may not exist (or can be singular) depending on the function g and on the seed.

Desirable Properties

Each X(k) has the same distribution as X on [0, 1] regardless of the seed (except possibly for rare "bad seeds".)

X(k) is always in [0, 1] and the successive values eventually fill the whole interval in some random order.

The X(k)'s may or may not be correlated. General conditions for convergence and uniqueness are beyond the scope of this article. Suffice it to say, for the initiated, that X is the fixed-point of a mathematical operator defined by g, and should not depend on the seed (but only on g) except possibly for a set S of "bad seeds" that yield periodicity or other non-random behavior, with the Lebesgue measure of this set S equal to 0. This is the case in all the examples discussed here.

The most simple case exhibiting the desired properties is the logistic map, defined by g(X) = 4 X(1 - X). It has been investigated thoroughly, used in many applications, generalized, and exact formulas for X(k) and the distribution of X are known: click here for details.

General Solution

In all cases, the distribution of X is solution of the following equation:

P(X < x) = P(g(X) < x).

While this sounds like a simple equation, it is actually a stochastic integral equation in disguise, and exact solutions are known only in very few cases. We will solve this equation later in this article, for a special case that at first glance seems more complicated than the logistic map. What makes this equation challenging is the fact that the function g, in all practical examples, is non-invertible on [0, 1].

Usage

Besides solving this functional equation, we are interested here in transforming the X(k)'s to obtain uniform deviates on [0, 1], for instance to produce random numbers on [0, 1]. While the X(k)'s and X also have [0, 1] as support domain, generally the distribution is not uniform, and thus a mapping is necessary to "uniformize" them. When no exact solution is known, an approximate solution obtained via interpolation on the X(k)'s empirical percentiles, can be used. Indeed that's what we did in the case that we solved (see next section), and that's how we discovered that it had indeed an exact solution.

Another example of self-replicating process (fractals)

2. Examples of Chaotic Sequences, and Big Number Programming

Because of the nature of the recursion involved to compute X(k), rounding errors propagate very fast, to the point that after 40 or so iterations, all computed values of X(k) are completely wrong. It is possible that all scientists who published on chaotic processes, used faulty numbers because of this issue. In some cases, it matters, and in some, it does not. See the examples below for details. Usually it is not a real issue, because it is like re-starting the sequence with a new seed every 40 iterations or so. So the fact that the conclusions made by prior authors are sound despite using totally faulty values, is an artifact of the process itself, which yields the same distribution X regardless of the seed.

Interestingly, I computed the values for a generalized logistic map (referred to as LM 0.5 below) both with big numbers -- that is, with extreme precision -- and with traditional arithmetic including Excel, and noticed the issue, with both sequences being totally different after 50 iterations. I then computed and used the empirical percentiles from both sequences (big numbers and Excel versions) theoretically representing the same exact process, to fit them with known distributions. I discovered that P(X < p x - q x^2) = x with p close to 2, and q close 1 in both cases. However, the faulty sequence produced p and q even closer to 1 and 2, so close and with such a great fit, that I decided to check whether indeed, it was the exact solution. Had I been working only with the big numbers, I might not have found that this was indeed the exact solution, as proved in the last section.

It is possible that re-seeding the process periodically, which is what the faulty sequence does, improves the overall estimation for a number of statistics, to the point that it outperforms using correct values computed on a single seed. The source code using big numbers is provided later in this article.

Examples of Sequences

Here are a few examples. I tested them all, mostly with the seed s = Pi / 11.

FRACT: Uses g(X) = mX - INT(mX) where m is the multiplier, and INT represents the integer part function. This is one of the most random, and thus best sequences, with all auto-correlations equal to 0, and distributed uniformly on [0, 1] without the need for any mapping. This is a consequence of the equidistribution theorem. Computations run very fast, even with big numbers. However, with traditional arithmetic, it gets stuck to 0 forever, after 10 iterations, if m = 32 This is not the case if m = 31: It then produces a faulty sequence, but one that appears random enough for practical purposes. If you avoid using a power of 2 for the multiplier m, and use a seed that is not too close to a simple rational number, you should be fine.

ABSIN: Uses g(X) = | sin(mX) | where m is a multiplier as in FRACT. My main comment is that with big numbers, it runs very slowly. Also, it is not free from auto-correlations.

LM 1: Uses g(X) = 4X (1 - X). This is the classic logistic map, with all auto-correlations equal to 0. An exact solution is known, Also there are infinitely many "bad seeds" to avoid, but these seeds are known and extremely rare. It runs fast even with big numbers. More about LM 1 can be found here.

LM 0.5: Uses g(X) = SQRT(4X (1 - X)). It has a curious auto-correlation structure, with exponential decay and alternate signs. The first auto-correlation (lag-1) is equal to -1/2. So auto-correlations are quickly vanishing, having no impact on convergence. The sequence can be modified to remove the auto-correlations, using Z(k) = sin(256 X(k)) instead of X(k). The implementation runs fast with big numbers, if you use the SQRT function rather than the power function to compute square roots. Despite the appearance, LM 0.5 is totally different from LM 1, and no smooth mapping exists to map one sequence onto the other. An exact solution for the distribution of X is available, see next section.

Discrete, Continuous Sequences and Generalizations

Finally, it would be interesting to see if a similar framework could be applied to discrete random number generators. Pretty much all random number generators in use today are discrete ones, in the sense that each X(k) generated is an integer. Most of them are periodic, though some have a very long period, see here. Continuous random number generators like those presented here are non-periodic, have nice properties, and might be easier to study. However, due to the finite amount of memory and data storage, for all practical purposes, because of machine precision, they become periodic when implemented, though the period can be immense.

All these sequences, whether made of integers or real numbers, are also stationary, in the sense that X(k) depends only on X(k-1) but not explicitly on k. A possible generalization consists of sequences of the form

X(k) = g(X(k-1), X(k-2))

with two seeds s = X(0) and t = X(1).

Also of interest is the sequence LM p defined by g(X) = {4X (1 - X)}^p where p is a positive parameter. The factor 4 that appears in the formula for g, guarantees that it maps [0, 1] onto [0, 1], as X(1-X) has a maximum value equal to 1/4. The parameter p must be in a certain range for the sequence to exhibit the desired randomness properties; p =1 and p = 0.5 work, but p = 0..2 or p = 1.1 do not. Low values of p result in significant auto-correlations, which eventually, when p is low enough, start impacting convergence. High values of p result in the sequence quickly converging to 0.

3. The Case LM 0.5, Auto-Regressive Time Series, Source Code

We investigate the case LM 0.5 defined by g(X) = SQRT(4X (1 - X)), that is, X(k) = SQRT( 4 X(k-1) (1 - X(k-1)) ). As discussed earlier, we have empirically obtained a solution, and proved that it was an exact solution, see next section for the proof. This solution is P(X < x) = 1 - SQRT(1 - x). As a result, Z(k) = 1 - SQRT(1 - X(k)) has a uniform distribution on [0, 1]. Whether there exists a simple exact formula for X(k), remains an open question. Note that for the case LM 1, an exact formula exists for X(k), see here.

Auto-correlation Structure

We have E[X] = 2/3, Var[X] = 4/45, E[g(X)] = E[X], Var[g(X)] = Var[X], and

Thus Covar[X, g(X)] = E[X g(X)] - 4/9 = -2/45, and the lag-1 autocorrelation is equal to Covar[X, g(X)] / Var[X] = -1/2.

A good approximation for the lag-n autocorrelation is (-1/2)^n. This might actually be the exact value, and it is indeed the case for n = 1 and n = 2. One might be able to find the exact value by deriving a recurrence formula to compute R(n) = E[ X g(g(g(...g(X)...))) ] where the symbol g appears n times inside the expectation operator E, and using the change of variable y = g(x) in the integral. The lag-n autocorrelation is equal to (R(n) - E^2[X]) / Var[X].

There are several ways to decorrelate time series, but here we simply used Z(k) = sin(256 X(k)) , as it fits our purpose of generating sequences with increased randomness (the opposite of what statisticians do in their daily life), and it works pretty well.

Auto-regressive, Non-linear Time Series

The LM 0.5 sequence can be used to simulate non-linear auto-regressive time series with an auto-correlation structure that is decaying exponentially fast. Indeed, it can be used as an alternate base model for auto-regressive processes. Each seed provides a different sequence, yet all sequences have the same distribution. Also, all sub-sequences of a given sequence have the same distribution. In short, re-seeding a sequence now and then, does not alter the distribution, as the sequence emulates a memory-less stochastic process or time series.

Related Sequences and Literature

I tried to find other articles about the LM 0.5 sequence, but could not find any. There is a lot of literature on generalized logistic maps, and my previous article includes references to 2-D logistic map and applications to image encryption and random number generators. See also the Wofram Encyclopedia entry on this subject.This is topic of active research.

Other articles worth reading:

Didier Gonze (2015) wrote an interesting tutorial on the subject, see here or here. It describes the same generalizations of the logistic map, the Lyapunov exponent which measures how chaotic the system is, the universal Feigenbaum chaos constant associated with these processes, and the case when dealing with time-continuous time series, that is, when k is a real number rather than an integer, resulting in a simple differential equation. The context is to model population dynamics. A more comprehensive reference on the subject is the textbook Chaotic Dynamical Systems (Robert L. Devaney, 2003) where the logistic map is referred to as the quadratic map. For a more recent textbook for beginners, read Non Linear Dynamics and Chaos (Steven H. Strogatz, 2015, 2nd Edition.)

Source: Attractors for various quadratic maps.

None of these authors mention LM 0.5 nor its exact solution. The fact that all computations, if based on standard arithmetic, are erroneous, is not mentioned either. It is not clear to me which conclusions are wrong if any, in these studies, because of this issue.

Source Code with Big Numbers, and Model Fitting with Excel

The source code below (in Perl) uses big numbers for LM 0.5. It is interesting to compare the results with the faulty sequence obtained with traditional arithmetic. Results are in this spreadsheet,which also features the data science technique used for model fitting, based on the empirical distribution and percentiles analysis.

Big number arithmetic also produces faulty sequences, but after many more iterations. This is not an issue, as long as X does not depend on the seed. The loss of accuracy with standard computer arithmetic becomes dramatic after 50 iterations or so. Think about this: For the case LM 1 for instance, X(k) is a polynomial (function of the seed s), of degree 2^(2^(k-1)). How do you expect to get a value correct with just one decimal, using standard arithmetic, for k as small as 50?

We encourage the reader to post a Python version (with big number arithmetic) in the comment section below. The Python libraries that I've found handle big integers, but not decimal numbers with hundreds of decimals. It might be possible to just use big integers, afterall the computations in the code below rely implicitly on big integers, via $z=int(0.5+$bigint*$z)/$bigint.

use strict;

use bignum;

my $pi=4*atan2(1,1);

my $bigint= 10**50;

my $seed=$pi/11;

my $k;

my $z=$seed;

my $zp;

open(OUT,">randgen.txt");

for ($k=1; $k<10000; $k++) {

$z=4*$z*(1-$z);

$z=sqrt($z);

# Note: $z=$z**0.5 is much slower than $z=sqrt($z)

$z=int(0.5+$bigint*$z)/$bigint;

$zp=sprintf("%.10f",$z); # get first 10 decimals

print OUT "$k\t$zp

";

print "$k -- $zp

";

select()->flush(); # you can delete this instruction

}

close(OUT);

The variable $bigint in the code is used to set the precision, here to 50 decimals. Without it, the code is incredibly slow and becomes much slower at each iteration, producing values with more and more decimals.

4. Proof of Main Results

We prove the following results, for the LM 0.5 sequence, with x in [0, 1]:

P(X < x) = 1 - SQRT(1 - x)

Let Y = 1 - SQRT(1 - X). Then Y has a uniform distribution on [0, 1].

P(X < 2x - x^2) = x.

To prove the first result, we plug X's distribution in the stochastic functional equation P(X < x) = P(g(X) < x), using the fact that g is a two-to-one convex function on [0, 1], and we check that the stochastic equation is satisfied. The second result follows immediately from the probability integral transform. The third result is left as an exercise. The proof of the first result proceeds as follows:

Thus we have:

and the stochastic equation to verify becomes, after successive squaring and re-arrangements:

As most terms cancel out in the right-hand side of the last equality, it becomes 1 - x = 1 - x, which shows that the stochastic equation is satisfied, thus completing the proof.

Note that there are automated tools for this type of tedious computations; they are usually referred to as symbolic math software. Some are available online.

For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on LinkedIn.

Follow @analyticbridge

DSC Resources

Popular Articles