

Simon Newcomb and "Natural Numbers" (Benford's Law) Posted May 2009. He [Newcomb] formulated a law (see below) and gave a rough proof which I will attempt to present... Tony Phillips

Stony Brook University

tony at math.sunysb.edu Mail to a friend Print this article Background In 1881 Simon Newcomb (1835-1909), the Canadian-American astronomer and mathematician, published a "Note on the Frequency of Use of the Different Digits in Natural Numbers." For Newcomb, natural numbers were those occurring "in nature," i.e. the kind of numbers one would run into in the course of everyday life. He discovered, for example, that not all the digits (1, 2, ..., 9) occur with the same frequency in the first place of such a number; he formulated a law (see below) and gave a rough proof which I will attempt to present. This law was rediscovered by Frank Benford ("The law of anomalous numbers," 1938) and is now somewhat unfairly known as "Benford's Law." A mathematically sound and complete proof was published by Theodore Hill in 1995. An experiment with The New York Times To get some experimental feeling for the phenomenon, I looked at all the numbers given as numerals in the first 15 pages of a recent edition of The New York Times. I omitted dates and advertisements, and repeats (in the same context, in captions or in tables). For each of those 213 numbers I recorded the first digit, and tabulated the data as follows: digit occurrences frequency 1 56 .26 2 48 .23 3 27 .13 4 20 .09 5 30 .14 6 11 .05 7 8 .04 8 9 .04 9 4 .02 For some of the flavor of Newcomb's "natural number" concept, here are the 8 numbers from this set with initial digit 7: page number reference A3 71 age of Jane Fonda A10 70,000 illegal gambling proceeds, Wilkes-Barre, Pa. A12 787 billion U.S. economic stimulus package, 2/09 A12 7,365.67 Dow Jones Industrial Average, 2/20/09 A14 7.2 magnitude of hypothetical earthquake A14 744,000 population of San Francisco A14 71 age of Senator Ronald W. Burris A15 70,000 low-end starting salary for butler, New York City Newcomb's Law Clearly the distribution is very unsymmetrical. Newcomb tells us how he was led to his discovery: "That the ten digits do not occur with equal frequency must be evident to anyone making much use of logarithmic tables, and noticing how much faster the first pages wear out than the last ones. The first significant figure is oftener 1 than any other digit, and the frequency diminishes up to 9." The place where he noticed the phenomenon gave him a clue to its explanation, which he formulated thus: The law of probability of the occurrence of numbers is such that all mantissae of their logarithms are equally probable. "Mantissae" probably seems as archaic to today's readers as a starter crank on the front of an automobile, but until 1960 or so every high-school science student was taught the lore of logarithms, and in particular how to use "common" (base-10) logarithmic tables in calculation. Their use involved the separation of a logarithm into two parts: its integer part (the characteristic) and its fractional part (the mantissa). Here is an example: Suppose, before the days of hand-held calculators, you needed a rapid way to multiply four-digit numbers, and to divide that product by another four-digit number, with an answer accurate to three digits. Say 86.73 X 1.265 X 7607 / .3018. Procedure: You think of each of the numbers as a power of 10 times a number between 1 and 10: 86.73 = 101 X 8.673

1.265 = 100 X 1.265

7607 = 103 X 7.607

.3108 = 10-1 X 3.108. When you take logarithms, since log(ab) = log a + log b, log(86.73) = 1 + log(8.673)

log(1.265) = 0 + log(1.265)

log(7607) = 3 + log(7.607)

log(.3018) = -1 + log(3.108). The second term in each of the logs is a number between 0 and 1: this will be the mantissa; the leading term is the characteristic. To obtain the log of the answer that we want, log(86.73) + log(1.265) + log(7607) - log(.3018), we make two calculations. First we add or subtract the characteristics; this is an integer calculation. Users of the "slide-rule" (an analogue device conveniently replacing the consultation of logarithmic tables, common through the first half of the twentieth century) would do this part in their heads. In this case the total is 5. Then you consult a four-place logarithmic table for the mantissae: log(8.673) = .93817

log(1.265) = .10209

log(7.607) = .88121

log(3.018) = .47972. The mantissae total (with signs) to 1.44175. You chop off the "1" and add it to the characteristic. The log table gives log(2.765) = .44170 and log(2.766) = .44185. Since you only expect 3 places of accuracy, you can take 2.765 as the mantissa contribution to the product, which you calculate as 105+1 X 2.765 = 2,765,000. Feeding the numbers into a digital calculator gives an answer to nine places: 2,765,375.13; but if the factors have an indeterminacy in the fifth place, the fourth digit in the product is not reliable: the extra precision is illusory. Newcomb's argument Newcomb first argues that all his "natural numbers" are ratios. This makes sense because most natural numbers are given in units, and the number exhibited is the ratio of some measurement to the same measurement taken on some more or less arbitrary token, e.g. the standard kilogram, the solar year. Then he argues that the set of natural numbers must be closed under further formation of ratios, i.e. under multiplication and division. This implies that the set of logarithms of natural numbers is closed under addition and subtraction; and in particular that the set of mantissae of logarithms of natural numbers is closed under addition and subtraction modulo 1, since as in the example above, when a sum of mantissae is greater than 1 the integer part is moved over to the characteristic; and similarly when it is less than -1. In Newcomb's words: "Since these exponents [the mantissae] are formed by casting off all the integers from a series of numbers, we may suppose them arranged around a circle ..." where we can add and subtract them like angles, except modulo 1 instead of modulo 2π. Newcomb's leap Next Newcomb asks the question (translated into our notation): Given a number of points on the circle distributed "according to any arbitrary law," choose n of them at random, say s 1 , s 2 , ... s n and form the sum s 1 ± s 2 ± ... ±s n (modulo 1). What is the probability that this sum will be contained in a given interval of length ds? And he answers: "It is evident that, whatever may be the original law of arrangement," the set of such sums "will approach to an equal distribution around the circle as n is increased," or, in other words, "the required probability will be equal to ds." In other words, The law of probability of the occurrence of numbers is such that all mantissae of their logarithms are equally probable. This is not evident, but it is plausible. The following figure shows a small simulation of the phenomenon. Here just two "mantissae" s and t, corresponding say to natural numbers m and n, are chosen; the mantissae corresponding to the products minj are plotted around the circle of numbers modulo 1, for i, j running from 0 to 8. Comparison with the logarithms of numbers starting with 1, 2, etc. suggests an explanation for the distribution of these numbers among natural numbers. a. An illustration of the equal distribution phenomenon Newcomb refers to. Here two numbers s and t are chosen on the circle of circumference 1 (I took numbers corresponding to angles 41o and 95o); the green angles correspond to all the numbers of the form is + jt (modulo 1), for i and j integers between 0 and 8. b. The mantissae corresponding to the integers 1, 2, ..., 9. This is the same display that occurs on a circular slide-rule (see below). Part of a circular slide-rule designed by John W. Mauchly. Mauchly was one of the designers of the ENIAC, the first large-scale general-purpose electronic computer. There was presumably another, smaller, paper disc with similar gradations that could rotate on top of this one, and probably a rotating pointer for keeping track of locations. (Image courtesy of University of Pennsylvania Libraries.) Recent history It took more than a hundred years for a satisfactory explanation of Newcomb's observation. The main stumbling block was the lack of a precise mathematical concept corresponding to Newcomb's "natural numbers." Theodore Hill realized that base-invariance was the key property: the uniform distribution of mantissae of natural numbers in any base (not only in base 10); this had been already been remarked by Newcomb. As Hill states it, "there is a unique countably-additive base-invariant probability measure on the positive reals." References Frank Benford, The law of anomalous numbers, Proceedings of the American Philosophical Society 78 (1938) 551-572 Theodore P. Hill, Base-invariance implies Benford's law, Proceedings of the A. M. S. 123 (1995) 887-895 Simon Newcomb, Note on the Frequency of Use of the Different Digits in Natural Numbers, American Journal of Mathematics 4 (1881) 39-40 Tony Phillips

Stony Brook University

tony at math.sunysb.edu

Welcome to the

Feature Column! These web essays are designed for those who have already discovered the joys of mathematics as well as for those who may be uncomfortable with mathematics.

Read more . . . Search Feature Column Feature Column at a glance

