This PEP proposes the addition of a module for common statistics functions such as mean, median, variance and standard deviation to the Python standard library. See also http://bugs.python.org/issue18606

The proposed statistics module is motivated by the "batteries included" philosophy towards the Python standard library. Raymond Hettinger and other senior developers have requested a quality statistics library that falls somewhere in between high-end statistics libraries and ad hoc code. Statistical functions such as mean, standard deviation and others are obvious and useful batteries, familiar to any Secondary School student. Even cheap scientific calculators typically include multiple statistical functions such as:

mean

population and sample variance

population and sample standard deviation

linear regression

correlation coefficient

Graphing calculators aimed at Secondary School students typically include all of the above, plus some or all of:

median

mode

functions for calculating the probability of random variables from the normal, t, chi-squared, and F distributions

inference on the mean

and others . Likewise spreadsheet applications such as Microsoft Excel, LibreOffice and Gnumeric include rich collections of statistical functions .

In contrast, Python currently has no standard way to calculate even the simplest and most obvious statistical functions such as mean. For those who need statistical functions in Python, there are two obvious solutions:

install numpy and/or scipy ;

or use a Do It Yourself solution.

Numpy is perhaps the most full-featured solution, but it has a few disadvantages:

It may be overkill for many purposes. The documentation for numpy even warns "It can be hard to know what functions are available in numpy. This is not a complete list, but it does cover most of them." and then goes on to list over 270 functions, only a small number of which are related to statistics.

Numpy is aimed at those doing heavy numerical work, and may be intimidating to those who don't have a background in computational mathematics and computer science. For example, numpy.mean takes four arguments: mean(a, axis=None, dtype=None, out=None) although fortunately for the beginner or casual numpy user, three are optional and numpy.mean does the right thing in simple cases: >>> numpy.mean([1, 2, 3, 4]) 2.5

For many people, installing numpy may be difficult or impossible. For example, people in corporate environments may have to go through a difficult, time-consuming process before being permitted to install third-party software. For the casual Python user, having to learn about installing third-party packages in order to average a list of numbers is unfortunate.

This leads to option number 2, DIY statistics functions. At first glance, this appears to be an attractive option, due to the apparent simplicity of common statistical functions. For example:

def mean(data): return sum(data)/len(data) def variance(data): # Use the Computational Formula for Variance. n = len(data) ss = sum(x**2 for x in data) - (sum(data)**2)/n return ss/(n-1) def standard_deviation(data): return math.sqrt(variance(data))

The above appears to be correct with a casual test:

>>> data = [1, 2, 4, 5, 8] >>> variance(data) 7.5

But adding a constant to every data point should not change the variance:

>>> data = [x+1e12 for x in data] >>> variance(data) 0.0

And variance should never be negative:

>>> variance(data*100) -1239429440.1282566

By contrast, the proposed reference implementation gets the exactly correct answer 7.5 for the first two examples, and a reasonably close answer for the third: 6.012. numpy does no better .

Even simple statistical calculations contain traps for the unwary, starting with the Computational Formula itself. Despite the name, it is numerically unstable and can be extremely inaccurate, as can be seen above. It is completely unsuitable for computation by computer . This problem plagues users of many programming language, not just Python , as coders reinvent the same numerically inaccurate code over and over again , or advise others to do so .

It isn't just the variance and standard deviation. Even the mean is not quite as straightforward as it might appear. The above implementation seems too simple to have problems, but it does:

The built-in sum can lose accuracy when dealing with floats of wildly differing magnitude. Consequently, the above naive mean fails this "torture test": assert mean([1e30, 1, 3, -1e30]) == 1 returning 0 instead of 1, a purely computational error of 100%.

Using math.fsum inside mean will make it more accurate with float data, but it also has the side-effect of converting any arguments to float even when unnecessary. E.g. we should expect the mean of a list of Fractions to be a Fraction, not a float.

While the above mean implementation does not fail quite as catastrophically as the naive variance does, a standard library function can do much better than the DIY versions.

The example above involves an especially bad set of data, but even for more realistic data sets accuracy is important. The first step in interpreting variation in data (including dealing with ill-conditioned data) is often to standardize it to a series with variance 1 (and often mean 0). This standardization requires accurate computation of the mean and variance of the raw series. Naive computation of mean and variance can lose precision very quickly. Because precision bounds accuracy, it is important to use the most precise algorithms for computing mean and variance that are practical, or the results of standardization are themselves useless.