$\begingroup$

These are primality examples, because they're common.

(1) Primality in SymPy. Issue 1789. There was an incorrect test put on a well-known web site that didn't fail until after 10^14. While the fix was correct, it was just patching holes rather than rethinking the issue.

(2) Primality in Perl 6. Perl6 has added is-prime which uses a number of M-R tests with fixed bases. There are known counterexamples, but they're quite large since the default number of tests is huge (basically hiding the real problem by degrading performance). This will be addressed soon.

(3) Primality in FLINT. n_isprime() returning true for composites, since fixed. Basically the same issue as SymPy. Using the Feitsma/Galway database of SPRP-2 pseudoprimes to 2^64 we can now test these.

(4) Perl's Math::Primality. is_aks_prime broken. This sequence seems similar to lots of AKS implementations -- lots of code that either worked by accident (e.g. got lost in step 1 and ended up doing the entire thing by trial division) or didn't work for larger examples. Unfortunately AKS is so slow that it is difficult to test.

(5) Pari's pre-2.2 is_prime. Math::Pari ticket. It used 10 random bases for M-R tests (with fixed seed on startup, rather than GMP's fixed seed every call). It will tell you 9 is prime about 1 out of every 1M calls. If you pick the right number you can get it to fail relatively often, but the numbers become sparser, so it doesn't show up much in practice. They have since changed the algorithm and API.

This isn't wrong but it's a classic of probabilistic tests: How many rounds do you give, say, mpz_probab_prime_p? If we give it 5 rounds, it sure looks like it works well -- numbers have to pass a base-210 Fermat test and then 5 pre-selected bases Miller-Rabin tests. You won't find a counterexample until 3892757297131 (with GMP 5.0.1 or 6.0.0a), so you'd have to do a lot of testing to find it. But there are thousands of counterexamples under 2^64. So you keep raising the number. How far? Is there an adversary? How important is a correct answer? Are you confusing random bases with fixed bases? Do you know what input sizes you'll be given?

There is a related point: what is a big number? To students it seems many think 10,000 is a huge number. To many programmers, $10^{16}$ is a big number. To programmers working on cryptography, these are small, and big is, say 4096 bits. To programmers working on computational number theory, these are all small, and big might be 10 to 100 thousand decimal digits. To some mathematicians these all may be considered "not big" considering there are many more positive numbers larger than these examples than there are smaller. This is something a lot of people don't think about, but makes a difference when thinking about correctness and performance.

These are quite difficult to test correctly. My strategy includes obvious unit tests, plus edge cases, plus examples of failures seen before or in other packages, test vs. known databases where possible (e.g. if you do a single base-2 M-R test, then you've reduced the computationally infeasible task of testing 2^64 numbers to testing about 32 million numbers), and finally, lots of randomized tests using another package as a standard. The last point works for functions like primality where there is a fairly simple input and a known output, but quite a few tasks are like this. I have used this to find defects in both my own development code as well as occasional problems in the comparison packages. But given the infinite input space, we can't test everything.

As for proving correctness, here is another primality example. The BLS75 methods and ECPP have the concept of a primality certificate. Basically after they churn away doing searches to find values that work for their proofs, they can output them in a known format. One can then write a verifier or have someone else write it. These run very fast compared to the creation, and now either (1) both pieces of code are incorrect (hence why you'd prefer other programmers for the verifiers), or (2) the math behind the proof idea is wrong. #2 is always possible, but these have typically been published and reviewed by multiple people (and in some cases are easy enough for you to walk through yourself).

In comparison, methods like AKS, APR-CL, trial division, or the deterministic Rabin test, all produce no output other than "prime" or "composite." In the latter case we may have a factor hence can verify, but in the former case we're left with nothing other than this one bit of output. Did the program work correctly? Dunno.

It's important to test the software on more than just a few toy examples, and also going through some examples at each step of the algorithm and saying "given this input, does it make sense that I am here with this state?"