We are interested here in factoring numbers that are a product of two very large primes. Such numbers are used by encryption algorithms such as RSA, and the prime factors represent the keys (public and private) of the encryption code. Here you will also learn how data science techniques are applied to big data, including visualization, to derive insights. This article is good reading for the data scientist in training, who might not necessarily have easy access to interesting data: here the dataset is the set of all real numbers -- not just the integers -- and it is readily available to anyone. Much of the analysis performed here is statistical in nature, and thus, of particular interest to data scientists.

Factoring numbers that are a product of two large primes allows you to test the strength (or weakness) of these encryption keys. It is believed that if the prime numbers in question are a few hundred binary digits long, factoring is nearly impossible: it would require years of computing power on distributed systems, to factor just one of these numbers.

While the vast majority of big numbers have some small factors and are thus easier to break, the integers that we are dealing with here are difficult to handle, by design. Factoring a product of two large primes is considered to be an intractable computer science problem. Here, we use techniques of big data, statistics, and machine learning - in short a data science approach - to hopefully discover new efficient factoring techniques for these massive numbers. Our innovative approach is quite different from that used in traditional algorithms.

We have investigated this problem in the past already (click here for references) but here we propose a different, more general framework, offering multiple opportunities to jump-start new research paths on this topic.

1. General Framework

Let z be a massive number that we want to factor. We use a non-standard version of the fixed-point algorithm to identify a factor. From now on, we assume that z is a product of two large prime numbers. Our generic version of the fixed point algorithm is as follows.

First, let's introduce the following notation:

F and G are strictly monotonically increasing, positive functions, with G(x) = 0 if and only if x = 0.

H is the inverse function of F, so F(H(x)) = H(F(x)) = x.

INT is the integer part function, for instance, INT(45.834297) = 45.

The modulo mod(z, x) is the remainder after division of z by x.

Fixed Point Algorithm

Step 1. Start with a seed p(0) = s.

Step 2. Iteratively compute p(n+1) as follows:

x = INT(p(n))

B = G(mod(z, x))

A = H(p(n))

p(n+1) = F(A + B)

Stop when one of the following occurs:

p(n+1) = p(n) or

p(n) > SQRT(z), or

after a pre-specified number of iterations.

Note that s is referred to as the starting point or seed. Several positive values for s, all far smaller than SQRT(n) but also far larger than 1, are eventually tested. Examples are provided in the next sections. A surprising fact here is that s does not need to be an integer: it can be a positive real number. Indeed, some solutions can only be found using a non-integer seed.

Fundamental Result

If p(n+1) = p(n) in the above algorithm, then convergence is reached, that is, p(n) = p(n+1) = p(n+2) = p(n+3) and so on. In addition, INT(p(n)) is a factor of z, so if z has only two factors (this is the case that we are concerned with) then factoring is complete. Also, p(n) is a strictly increasing sequence.

It is possible to design an even more general framework, and this could be the topic of some future research. For instance, we could use x = INT(Q(p(n))) where Q is some reasonably well behaved function, rather than x = INT(p(n)). In this case, when convergence occurs, the factor found is INT(Q(p(n))) rather than INT(p(n)). Also, we could release some requirements on F and G, resulting in p(n) no longer being a strictly increasing sequence, but rather an oscillating one.

The proof of the theorem is straightforward. The remarkable fact here is that convergence occurs in many cases (albeit rarely), sometimes in as little as 2 or 3 iterations, sometimes for a large number of starting points (the number s in step 1) despite the fact that we are dealing with highly chaotic structures that behave almost randomly - something very different from the classic fixed-point theorem. The remaining of the discussion is about identifying pockets of fast convergence, using ML-based discovery and pattern recognition techniques. We will start by narrowing to specific, elementary functions for F and G.

Simplified Fixed Point Algorithm

We consider G(x) = r x where r is a parameter (positive real number) and F(x) = x^a where a is another parameter (also a positive real number). Thus H(x) = x^{1/a} and therefore step 2 in the algorithm becomes

The brackets in the above formula represent the INT (integer part) function. Source code is provided in the last section. To test whether p(n+1) = p(n) in the algorithm (in which case convergence is reached and INT(p(n)) is a factor of z,) we test whether | p(n+1) - p(n) | is smaller than 0.0000001.

2. Case Study

Here we focus on z = 395,909,818,831 = 115,901 x 3,415,931, a product of two primes. We are investigating strategies that may apply to very large numbers, as well as putting some foundations for future research on this topic. We have performed over a trillion computations for this number alone, just to find patterns and learn about promising ideas that will, hopefully, lead to very efficient factoring systems. In this section, we summarize our findings, using the simplified fixed point algorithm described in the previous section, and testing it on a large number of parameter sets (a, r) as well as starting points s (referred to as seeds.)

Implementation details

We implemented the algorithm as follows.



We tested 100,000 seeds equally spaced between 50,000 and 60,000. So we tried seeds that were around SQRT(z) / 10. Remember that a seed is a starting point p(0) for the fixed point algorithm. We use the simplified version of this algorithm (see section 1) which has 2 parameters: r and a. For each seed, we tested 1,600 different values of a and r. The results are in the next subsection, and the source code in the last section.

For each seed and for each parameter (a, r) we allowed up to 20 iterations. If no fixed point was found after 20 iterations or when p(n) > SQRT(z) we stopped and moved to the next seed. It means that the seed in question did not provide convergence for us, given our constraints. In some cases, far more than 20 iterations are required for convergence, but we stopped at 20 for efficiency reasons, since our goal is to beat traditional algorithms, rather than studying the shape of the convergence zone in the parameter space (a very interesting problem in itself, but rather theoretical.).

Some values of (a, r) never provide convergence regardless of the seed. If you focus on 0.8 < a < 1 and 0.5 < r < .5, more than 60% of the parameters (a, r) have seeds that yield convergence. In the best cases, up to 0.2% of the seeds will yield convergence, for a specific (a, r), see Table 1, and convergence will occur in around 15 iterations. These "best cases" may not be too difficult to identify. This will be the subject of our next analysis. They were confined around our lower limit a = 0.8, with r mostly between 0.90 and 1.30. Identifying these "best case" is critical to get a high performing algorithm, and this goal can be achieved using data machine learning and pattern recognition.

Table 1: Values of a and r with great performance

Conclusion



The best cases found so far (those listed in table 1) provide a solution for factoring z = 395,909,818,831, in as little as 500 (one out of 500 seeds converging) x 20 (about 20 iterations used per tested seed) = 10,000 operations. Even doing the job in 30,000 operations is still good performance here, as there are about 50,000 prime numbers smaller than SQRT(z) and our algorithm does not use any table of pre-computed prime numbers. Such tables don't exist anyway for numbers with 200 digits.

Deeper Analysis

We start by introducing the concept of efficiency, for a parameter vector (a, r) that has converging seeds. The efficiency is defined as the number of operations required, for a specific (a, r), to find a factor. It involves time spent in many non-converging seeds until we find one that yields convergence. Remember, we limit the number of iterations, for a specific seed and parameter vector (a, r) to a threshold (here 20) to avoid wasting too much computer time on some slow-converging or non-converging seeds. Zones of high and low efficiency are illustrated in figure 2: and correspond to smaller values of r, or large r with small a.

Among the 160,000,000 configurations of seed and (a, r) tested, 8,557 (that is, one out of 18,698) yielded convergence, in an average of 9.27 iterations when converging. It is straightforward to improve performance by a factor 4, by avoiding zones of non-convergence in the parameter space. Figure 2 clearly shows that good parameter vectors (a, r) are not randomly distributed, and are actually clustered. So are the seeds yielding convergence. This fact can be exploited for instance to avoid testing more than 2 seeds within an interval of length 1 (however, beware that some solutions, for a given a and r, can only be found if you use a seed that is not an integer.)

So, pattern recognition techniques can be used here to further optimize the algorithm, and for instance, to identify the optimal threshold for the maximum number of iterations allowed (here, the number 20 is arbitrary.).

Figure 1: Parameter values yielding at least one converging seed (out of 100,000 tested seeds)

Figure 2: Testing 40 x 40 equally spaced parameter values (100,000 seeds for each)

Finally, a number of configurations yield convergence in just two iterations, even when starting with a seed as low as 50,000. They are typically associated with larger values for a and r (see Figure 3) and require processing a large number of seeds (for a given a and r) before finding one with such a fast convergence. As a result, these values of a and r are inefficient, in contrast with vectors (a, r) yielding convergence for a large number of seeds albeit more slowly.

Note that studying non-converging seeds is also intrinsically interesting, and might offer great insights to help localize the elusive converging seeds. Surprisingly, non-converging seeds are associated with sequences p(n) that exhibit strong patterns: after a number of iterations (sometimes as small as n = 3), the same values start to show up, regardless of the seed. This can be used to our advantage: once we hit a value p(n), at a certain n, known to be associated with a non-converging seed, we can just stop the iterating process for the seed in question, rather than wasting as much as 20 iterations.

Also, for small values of a and r (the parameter values with higher efficiency) convergence can be extremely slow. In this case, stopping at the second iteration whenever p(1) - p(0) is smaller than m, where m is some threshold to be determined for optimal performance, can significantly improve the overall efficiency.

Figure 3: Values of a and r with at least one seed converging in 2 iterations

3. Data, Code, Visualizations

Click here to download the summary data produced with the script in section 5: this is an Excel spreadsheet also containing Figure 1, 2, and 3. It consists of all the convergent seeds found for 40 x 40 parameters (a, r) as well as the list of parameter vectors that did not yield converging seeds. A number of high-level statistics are also computed in the various tabs in the spreadsheet, mostly summary statistics at the parameter vector level.

For the data scientist, finding converging seeds is a bit like finding oil when digging wells at random locations to detect the extent and shape of an oil field. Here, the shape of the convergence zone, defined by the efficiency curves, is highly irregular. We plotted the contour map using R, in Figure 4, to show how irregular it is.

Figure 4: Contour map showing efficiency levels for r (X-axis) and a (Y-axis)

Below is the R code used to produce the contour map in Figure 4.

data<-read.table("c:/vincentg/fr.txt",header=TRUE)

w<-data$ln_efficiency

z <- matrix(w, nrow = 40, ncol = 40, byrow = TRUE)

contour(z)

Click here to download the input text file fr.txt, consisting of 1,600 data points: one data point for each parameter (a, r), and for each parameter, its. efficiency. Note that this file is sorted by r then by a in order to work with the contour function in R..Also note that although the contour map is highly irregular, it is nevertheless not totally random. You can compare it with more typical contour maps in this paper. This contour map is also provided to teach you how to produce such charts with R.

Finally, Figure 5 below is provided with two purposes in mind:

Using a little sample to help you visualize how parameter vectors cluster in three categories.

Teach you how to produce scatter plots with multiple series, in Excel

You can download the smaller spreadsheet here. Can you test, using statistical techniques, whether these 100 points -- the data is in the spreadsheet -- are randomly distributed or not? (we know they are not.) To learn about how to produce scatter plots with multiple series in Excel, download our spreadsheet: that's where Figure 5 is coming from.

Figure 5: Testing 100 parameter vectors to find fast converging seeds

4. Improving the Algorithm

So far we have shown the potential of our methodology, which can yield a solution (factoring) in fewer iterations compared with other factoring algorithms. Below are strategies to be explored in the next wave of research, to further enhance -- hopefully dramatically -- the speed of the algorithm, and to more consistently discover good seeds, good parameter values, and stay in configurations of high efficiency.

Evidently the first step is to check how the methodology works to factor other large numbers, especially very big ones: so far we have spent a considerable amount of time, including more than a trillion computations across various tests, to factor a single, rather small integer z (eventually rather efficiently, but we can do much better.)

Strategies to improve efficiency

Using more complex F and G in the generic fixed point algorithm, releasing the assumptions on F and G, allowing p(n) to be an oscillating rather than a strictly increasing sequence.

Rather than using a power and a linear function for F and G, work with other functions that still produce strictly increasing sequences for p(n), but defined by more than two parameters, for better fine-tuning capabilities.

Do not stop the fixed point algorithm at SQRT(z). Allow for more iterations in hope of finding the largest divisor in case the algorithm failed to find the smaller one.

When reaching a value p(n) (for some n) know to be associated with non-converging seeds for the parameter values in question, stop the iteration process right away (rather than completing the 20 iterations) for the seed in question. This requires to pre-compute a table of popular, typical p(n) values associated with non-converging configurations, for the number z to be factored. It has the potential to dramatically improve efficiency, since most seeds do not yield convergence, yet end up with p(n) stuck in the same few values.

Rather than stopping after 20 iterations (an arbitrary number), stop after T iterations, where T is chosen to optimize efficiency. This number T will be a function of z, and for the number z studied here, the optimal T is likely to be smaller than 20. This might be the easiest way to significantly improve efficiency.

Stop the iterative process early for a seed yielding slow growth (say if p(1) - p(0) < m for some m.) This is an easy and powerful way to improve efficiency.

Space the test seeds optimally.

Consider lowering the lower bound for the parameter a, from a = 0.8 to a = 0.2. The reason for this is that the best performance obtained here was on the borderline, with parameter a very close to 0.8. By exploring smaller values of this parameter, along with the improvements suggested in this subsection, we might discover more efficient values for a.

Other articles by the same author

The selection below is a small subset of my publications. Here I only list articles that are related to number theory and probability theory. For a larger list, click here.

5. Source Code

Below is the source code in Perl, to test 100,000 seeds between 50,000 and 60,000, for 40 values of a and 40 values of r. The seed is represented by the variable $seed. The number to factor is z. The maximum number of iterations allowed is set to n < 20 in the While loop.

$z1=115901;

$z2=3415931;

$z=$z1*$z2;

open(OUT,">factor.txt");

$paramID=0;



$sqroot=int(sqrt($z));

$min=50000;

$max=60000;

$step=($max-$min)/100000;

for ($r=0.5; $r<4.5; $r+=0.1) {

for ($a=0.8; $a<1.0; $a+=0.005) {

print OUT "TESTING\t$z\t$paramID\t$r\t$a\t-\t-\t-

";

for ($seed=$min; $seed<=$max; $seed+=$step) {

$n=0;

$old_p=0;

$p=$seed; # $seed represents the seed

while (($p <= $sqroot)&&($p != $old_p)&&($n<20)) { ## $p <= $z1+1

$old_p=$p;

$p=exp($a*log( exp(log($p)/$a) + $r*($z % int($p)) ));

$n++;

$count++;

if ($count %10000 ==0) { print "$paramID | $count | $n | $seed

"; }

if (abs($p-$old_p) < 0.0000001) {

$p=$old_p;

print OUT "FACTOR FOUND\t$z\t$paramID\t$r\t$a\t$seed\t$n\t$p

";

print "$z: from $seed in $n iter to $p ($r)

";

}

} # endWhile

} # endFor $seed

$paramID++;

} # endFor $a

} # endFor $r

close(OUT);

Top DSC Resources

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge