In this data science article, emphasis is on science, not just on data. State-of-the art material is presented in simple English, from multiple perspectives: applications, theoretical research asking more questions than it answers, scientific computing, machine learning, and algorithms. I attempt here to lay the foundations of a new statistical technology, hoping that it will plant the seeds for further research on a topic with a broad range of potential applications. It is based on mixture models. Mixtures have been studied and used in applications for a long time, and it is still a subject of active research. Yet you will find here plenty of new material.

Content

1. Introduction and Context

2. Approximations Using Mixture Models

The error term

Kernels and model parameters

Algorithms to find the optimum parameters

Convergence and uniqueness of solution

Find near-optimum with fast, black-box step-wise algorithm

3. Example

Data and source code

Results

4. Applications

Optimal binning

Predictive analytics

Test of hypothesis and confidence intervals

Deep learning: Bayesian decision trees

Clustering

5. Interesting problems

Gaussian mixtures uniquely characterize a broad class of distributions

Weighted sums fail to achieve what mixture models do

Stable mixtures

Nested mixtures and Hierarchical Bayesian Systems

Correlations

1. Introduction and Context

In a previous article (see here) I attempted to approximate a random variable representing real data, by a weighted sum of simple kernels such as uniformly and independently, identically distributed random variables. The purpose was to build Taylor-like series approximations to more complex models (each term in the series being a random variable), to

avoid over-fitting,

approximate any empirical distribution (the inverse of the percentiles function) attached to real data,

easily compute data-driven confidence intervals regardless of the underlying distribution,

derive simple tests of hypothesis,

perform model reduction,

optimize data binning to facilitate feature selection, and to improve visualizations of histograms

create perfect histograms,

build simple density estimators,

perform interpolations, extrapolations, or predictive analytics,

perform clustering and detect the number of clusters,

create deep learning Bayesian systems.

Why I've found very interesting properties about stable distributions during this research project, I could not come up with a solution to solve all these problems. The fact is that these weighed sums would usually converge (in distribution) to a normal distribution if the weights did not decay too fast -- a consequence of the central limit theorem. And even if using uniform kernels (as opposed to Gaussian ones) with fast-decaying weights, it would converge to an almost symmetrical, Gaussian-like distribution. In short, very few real-life data sets could be approximated by this type of model.

Source for picture: SAS blog

I also tried with independently but NOT identically distributed kernels, and again, failed to make any progress. By "not identically distributed kernels", I mean basic random variables from a same family, say with a uniform or Gaussian distribution, but with parameters (mean and variance) that are different for each term in the weighted sum. The reason being that sums of Gaussian's, even with different parameters, are still Gaussian, and sums of Uniform's end up being Gaussian too unless the weights decay fast enough. Details about why this is happening are provided in the last section.

Now, in this article, starting in the next section, I offer a full solution, using mixtures rather than sums. The possibilities are endless.

2. Approximations Using Mixture Models

The problem is specified as follows. You have an univariate random variable Y that represents any of your quantitative features in your data set, and you want to approximate or decompose it using a mixture of n elementary independent random variables called kernels and denoted as X(n, k) for k = 1, ..., n, with decreasing probability weights p(n, k) that converge to zero. The approximation of Y based on the first n kernels, is denoted as Y(n). By approximation, we mean that the data-generated empirical distribution of Y is well approximated by the known, theoretical distribution of Y(n) and that as n tends to infinity, both become identical (hopefully).

Moving forward, N denotes your sample size, that is the number of observations; N can be be very large, even infinite, but you want to keep n as small as possible. Generalizations to the multivariate case is possible but not covered in this article. The theoretical version of this consists in approximating any known statistical distribution (not just empirical distributions derived from data sets) by a small mixture of elementary (also called atomic) kernels.

In statistical notation, we have:

We also want Y(n) to converge to Y, in distribution, as n tends to infinity. This implies that for large n, the weights p(n, k) must tend to zero as k tends to infinity.

The error term

There are various ways to define the distance between two distributions, say between Y(n) and Y. See here for details; one of the most popular ones is the Kolmogorov-Smirnov metric. Or you can also use the distance between the inverse of the cumulative distributions, see here for details (read the section on testing for normality, in particular the percentile test.) Regardless of the metric used, the error term is denoted as E(n) = ||Y - Y(n)||. Of course, the problem, for a given value of n, is to minimize E(n). As n tends to infinity, by carefully choosing the parameters in the model (that is, the weights, as well the the means and variances of the kernels,) the error E(n) is supposed to converge to 0. Note that the kernels are independent random variables, but not identically distributed: a mix of kernels with different means and variances is not only allowed, but necessary to solve this optimization problem.

Kernels and model parameters

Besides the weights, the other parameters of the models are the parameters attached to each kernel X(n, k). Typically, each kernel X(n, k) is characterized by two parameters: a(n, k) and b(n, k). In the case of Gaussian kernels, a(n, k) is the mean and b(n, k) is the variance; b(n, k) is set to 1. In the case of Uniform kernels with Y taking on positive values, a(n, k) is the lower bound of the support interval, while b(n, k) is the upper bound; in this case, since we want the support domains to form a partition of the set of positive real numbers (the set of potential observations), we use, for any fixed value of n, a(n, 1) = 0 and b(n, k) = a(n, k+1).

Finally, the various kernels should be re-arranged (sorted) in such a way that X(n, 1) always has the highest weight attached to it, followed by X(n, 2), X(n, 3) and so on. The methodology can also be adapted to discrete observations and distributions, as we will discuss later in this article.

Algorithms to find the optimum parameters

The goal is to find optimum model parameters, for a specific n, to minimize the error E(n). And then try bigger and bigger values of n, until the error is small enough. This can be accomplished in various ways.

The solution consists in computing the derivatives of E(n) with respect to all the model parameters, and then finding the roots (parameter values that make the derivatives vanish, see for instance this article.) For a specific value of n, you will have to solve a non-linear system of m equations with m parameters. In the case of Gaussian kernels, m = 2n. For uniform kernels, m = 2n + 1 (n weights, n interval lower bounds, plus upper bound for the rightmost interval.) No exact solution can be found, so you need to use an iterative algorithm. Potential modern techniques used to solve this kind of problem include:

You can also use Monte-Carlo simulations, however here you face the curse of dimensionality, the dimension being the number m of parameters in your model. In short, even for n as small as n = 4 (that is, m = 8), you will need to test trillions of randomly sampled parameter values (m-dimensional vectors) to get a solution close enough to the optimum, assuming that you use raw Monte-Carlo techniques. The speed of convergence is an exponential function of m. Huge improvements to this method are discussed later in this section, using some kind of step-wise algorithm to find local optima, reducing it to a 2-dimensional problem. By contrast, speed of convergence is quadratic for gradient-based methods, if E(n) is convex in the parameter space. Note that here, E(n) may not always be convex though.

Convergence and uniqueness of solution

In theory, both convergence and the fact that there is only one global optimum, are guaranteed. It is easy to see that, under the constraints imposed here on the model parameters, two different mixture models must have two distinct distributions. In the case of Uniform kernels, this is because the support domains of the kernels form a partition, and are thus disjoint. In the case of Gaussian kernels, as long as each kernel has a different mean, no two mixtures can have the same distribution: the proof is left as an exercise. To put it differently, any relatively well behaved statistical distribution is uniquely characterized by its set of parameters associated with its mixture decomposition. When using Gaussian kernels, this is equivalent to the fact that any infinitely differentiable density function is uniquely characterized by its coefficients in its Taylor series expansion. This is discussed in the last section.

The fact that under certain conditions, some of the optimization algorithms described in the previous subsection, converge to the global optimum, is more difficult to establish. It is always the case with the highly inefficient Monte Carlo simulations. In that case, the proof is pretty simple and proceeds as follows

Consider the discrete case where Y takes only on positive integer values (for example, your observations consist of counts,) and use the discrete Uniform kernel.

In that case, the solution will converge to a mixture model where each kernel support domain is a set with one value, and its associate weight is the frequency of that value, in your observed data. This is actually the global optimum, with E(n) converging to 0 as n tends to infinity.

Continuous distributions can be approximated by discrete distributions after proper re-scaling. For instance, a Gaussian distribution can be perfectly approximated by sequences of increasingly granular binomial distributions. Thus, the convergence to a global optimum, can be derived from the convergence obtained for the discrete approximations.

The stopping rule, that is, deciding when n is large enough, is based on how fast E(n) continue to improve as n increases. Initially, for small but increasing values of n, E(n) will drop sharply, but for some value of n usually between n = 3 and n = 10, improvements will start to taper off, with E(n) slowing converging to 0. If you plot E(n) versus n, the curve will exhibit an elbow, and you can decide to stop at the elbow. See the elbow rule here.

Finally, let us denote as a(k) the limit of a(n, k) as n tends to infinity; b(k) and p(k) are defined in a similar manner. Keep in mind that the kernels must be ordered by decreasing value of their associated weights. In the continuous case, a theoretical question is whether or not these limits exist. With Uniform kernels, p(n, k), as well as b(n, k) - a(n, k), that is, the length of the k-th interval, should both converge to 0, regardless of k, as n tends to infinity. The limiting quotient represents the value of Y's density at the point covered by the interval in question. Also, the sum of p(n, k) over all k's, should still be equal to one, at the limit as n tends to infinity. In practice, we are only interested in small values of n, typically much smaller than 20.

Find near-optimum with fast step-wise algorithm

A near optimum may be obtained fairly quickly with small values of n, and in practice this is good enough. To further accelerate the convergence, one can use the following step-wise algorithm, with the Uniform kernel. At iteration n+1, modify only two adjacent kernels that were obtained at iteration n (that is, kernels with adjacent support domains) as follows:

Increase the upper bound of the left interval, and decrease the lower bound of the right interval accordingly. Or do the other way around. Note that the cumulative density within each interval, before or after modification, is always equal to 1, since we are using uniform kernels.

Adjust the two weights, but keep the sum of the two weights unchanged.

So in fact you are only modifying two parameters (degrees of freedom is 2.) Pick up the two adjacent intervals, as well as the new weights and lower/upper bounds, in such a way as to minimize E(n+1).

3. Example

Here, I illustrate some of the concepts explained earlier, with an example based on simulated data. The source code and the data is provided so that my experiment can be replicated, and the technical details understood. The 10,000 data points generated (representing Y) are deviates from a skewed, non-symmetrical negative binomial distribution, taking integer values between 0 and 110. Thus we are dealing with discrete observations and distributions. The kernels have discrete uniform distributions, for instance uniform on {5, 6, 7, 8, 9, 10, 11} or on {41, 42, 43, 44}. The choice of a non-symmetrical target distribution (for Y) is to illustrate the fact that the methodology also works for non-Gaussian target variables, unlike the classic central limit theorem framework applying to sums (rather than mixtures) and where convergence is always towards a Gaussian. Here instead, convergence is towards the simulated negative binomial target.

I tried to find online tools to generate deviates from any statistical distribution, but haven't found any interesting ones. Instead, I used R to generate the 10,000 deviates, with the following commands:

The first line of code generates the 10,000 deviates from a negative binomial distribution, the second line produces its histogram with 50 bins (see picture below, where the vertical axis represents frequency counts, and the horizontal axis represents values of Y.) The third line of code exports the data to an output file that will first be aggregated and then used as an input for the script that (1) computes the model parameters, and (2) computes and minimizes the error E(n).

Histogram for the 10,000 deviates (negative binomial distribution) used in our example

Data and source code

The input data set for the script that processes the data, can be found here. It consists of the 10,000 negative binomial deviates (generated with the above R code), and aggregated / sorted by value. For instance, the first entry (104) means that among the 10,000 deviates, 104 of them have a value equal to 0. The second entry (175) means that among the 10,000 deviates, 175 of them have a value equal to 1. And so on.

The script is written in Perl (you are invited to write a Python version) but it is very easy to read and well documented. It illustrates the raw Monte-Carlo simulations with 4 discrete uniform kernels. So it is very inefficient in terms of speed, but easy to understand, with few lines of code. You can find it here. It produced the distribution (mixture of 4 kernels) that best approximates the above histogram, see picture below.

Approximation of above histogram with mixture model, using 4 uniform kernels

Results

The chart below shows a contour plot for the error E(2), when using n = 2 discrete uniform kernels, that is two intervals, with lower bounds of the first interval displayed on the vertical axis, and upper bounds on the horizontal axis. The upper bound of the second (rightmost) interval was set to the maximum observed value, equal to 110. Ignore the curves above the diagonal; they are just a mirror of the contours below. Outside the kernel intervals, densities were kept to 0. Clearly the best kernel (discrete) intervals to approximate the distribution of Y, are visually around {1, 2, ... , 33} and {34, ..., 110} corresponding to a lower bound of 1, and an upper bound of 33 for the first interval; it yields an error E(2) less than 0.45.

The contour plot below was produced using the contour function in R, using this data set as input, and the following code:

The interesting thing is that the error function E(n), as a function of the mixture model parameters, exhibits large areas of convexity containing the optimum parameters, when n = 2. This means that gradient descent algorithms (adapted to the discrete space here) can be used to find the optimum parameters. These algorithms are far more efficient than Monte-Carlo simulations.



Contour plot showing the area where optimum parameters are located, minimizing E(1)

I haven't checked if the convexity property still holds in the continuous case, or when you include the weight parameters in the chart, or for higher values of n. It still might, if you use the fast step-wise optimization algorithm described earlier. This could be the best way to go numerically, taking advantage of gradient descent algorithms, and optimizing only a few parameters at a time.

Now I discuss the speed of convergence, and improvements obtained by increasing the number of kernels in the model. Here, optimization was carried out via very slow, raw Monte-Carlo simulations. The table below shows the interval lower bounds and weights associated with the discrete uniform kernel, for n = 4, obtained by running 2 million simulations. The upper bound of the rightmost interval was set to the maximum observed value, equal to 110. For any given n, only simulations performing better than all the previous ones, are displayed: in short, these are the records. Using n = 5 does not significantly improve the final error E(n). Low errors with n = 2, 3, and 4 were respectively 0.41, 0.31, and 0.24. They were obtained respectively at iterations 7,662 (n = 2), 96,821 (n = 3) and 1,190,575 (n=4). It shows how slow Monte-Carlo converges, and the fact that the number of required simulations grows exponentially with the dimension n. The Excel spreadsheet, featuring the same table for n = 2, 3, 4, and 5, can be found here.

4. Applications

The methodology proposed here has many potential applications in machine learning and statistical science. These applications were listed in the introduction. Here, I just describe a few of them in more details.

Optimal binning

These mixtures allow you to automatically create optimum binning of univariate data, with bins of different widths and different sizes. In addition, the optimum number of bins can be detected using the elbow rule described earlier. Optimum binning is useful in several contexts: visualization (to display meaningful histograms), in decision trees, and in feature selection procedures. Some machine learning algorithms, for instance this one (see also here,) rely on features that are not too granular and properly binned, to avoid over-fitting and improve accuracy and processing time. These mixture models are handy tools to help with this.

For more on optimal binning, read this article (2013) or check the relevant R package smbinning.

Predictive analytics

Since this methodology creates a simple model to fit with your data, you can use that model to predict frequencies, densities, (including perform full density estimation), intensities, or counts attached to unobserved data points, especially if using kernels with infinite support domains, such as Gaussian kernels. It can be used as a regression technique, or for interpolation or extrapolation, or for imputation (assigning a value to a missing data point), all of this without over-fitting. Generalizing this methodology to multivariate data will make it even more useful.

Test of hypothesis and confidence intervals

These mixtures help build data-driven intermediate models, something in-between a basic Gaussian or exponential or whatever fit (depending on the shape of the kernels) and non-parametric empirical distributions. It also comes with core parameters (the model parameters) automatically estimated. Confidence intervals and tests of hypothesis are easy to derive, using the approximate mixture model distribution to determine statistical significance, p-values, or confidence levels, the same way you would do with standard, traditional parametric distributions.

Clustering

Mixture models were invented long ago for clustering purposes, in particular under a Bayesian framework. This is also the case here, and even more so as this methodology gets extended to deal with multivariate data. One advantage is that it can automatically detect the optimum number of clusters thanks to its built-in stopping rule, known as the elbow rule. Taking advantage of convexity properties in the parameter space, to use gradient descent algorithm for optimization, the techniques described in this article could perform unsupervised clustering faster than classical algorithms, and be less computer intensive.

Deep learning: Bayesian decision trees

See the subsection on nested mixtures, in section 5, for details.

5. Interesting Problems

We discuss here, from a more theoretical point of view, two fundamental results mentioned earlier, as well as new topics of interest about mixtures, including stable, nested mixtures and potential use in deep learning. All mixtures here may be infinite, and the kernels (in the mixture model) can be correlated.

Gaussian mixtures uniquely characterize a broad class of distributions

Let us consider an infinite mixture model with Gaussian kernels, each with a different mean a(k), same variance equal to 1, and weights p(k) that are strictly decreasing. Then the density associated with this mixture is

Two different sets of (a(k), p(k)) will result in two different density functions, thus the representation uniquely characterizes a distribution. Also, the exponential functions in the sum can be expanded as Taylor series. Thus we have:

Density functions infinitely differentiable at y = 0, can be represented in this way. Convergence issues are beyond the scope of this article.

Weighted sums fail to achieve what mixture models do

It is not possible, using an infinite weighted sum of independent kernels of the same family, to represent any arbitrary distribution. This fact was established here in the case where all the kernels have the exact same distribution. It is mostly an application of the central limit theorem. Here we generalize this theorem to kernels from a same family of distributions, but not necessarily identical. By contrast, the opposite is true if you use mixtures instead of weighted sums.

With a weighted sum of Gaussian kernels of various means and variances, we always end up with a Gaussian distribution (see here for explanation.) With Uniform kernels (or any other kernel family) we can prove the result as follows:

Consider a sum of n kernels from a same family. Say n(1) of them have (almost) the same parameters, another n(2) of them have the same parameters but different from the first group, another n(3) of them have the same parameters but different from the first two groups, and so on, with n = n(1) + n(2) + ...

Let n tends to infinity, with n(1), n(2) and so on also tend to infinity. The weighted sum in each group will converge to Gaussian, by virtue of the central limit theorem.

The overall sum across all groups will tend to a sum of Gaussian, and thus must be Gaussian. This depends on how fast the weights are decaying. Details about the decaying rate, for the result to be correct, are provided in my previous article.

By contrast, a mixture or any number of Gaussian kernels with different means, is not Gaussian.

Stable mixtures

Just like the Gaussian family is stable with respect to weighted sums, in the sense that the weighted sum of independent Gaussian is Gaussian (it is indeed the only type of distribution with finite variance, stable under addition, see here), is it possible to find families of kernel distributions that are stable when mixed? In order to answer this question, it is enough to identify two different kernels X and Z belonging to a same family, such that the densities (regardless of the kernel parameters) satisfy

with Y also belonging to the same family, regardless of the weight p. Note that X and Z can be correlated here. Clearly, the Gaussian family is not stable under mixing.

However, there is actually a large number of stable kernels for mixture models. Let g and h be arbitrary density functions. Then we have the following result:

In short, for mixtures, we have an infinite class of stable kernel families, of all shapes. Interestingly, if you choose two Gaussian with different means for g and h, then the resulting kernel (a mixture itself), is stable under mixing. That is, it belongs to the same family. So a mixture of different Gaussian constitutes a stable family of distributions for mixtures, but not for weighted sums. Yet the Gaussian kernel itself is not stable for mixtures, while it is the only stable family for weighted sums.

Note that stable kernels are not limited to two components. It easily generalizes to n components.

Nested mixtures and Hierarchical Bayesian Systems

In the previous subsection on stable mixtures, we've seen that the components (kernels) of a mixture can be mixtures themselves. So you can recursively build a tree of nested mixtures, with as many nodes as you wish, and as deep as you wish. What's more, all the mixtures, at any level in the hierarchy, can share the same arbitrary family of distributions (with any number of parameters), each mixture with its own set of parameters. This is just a sandardized deep learning, Bayesian hierarchical system. For instance, with the notations used in the previous subsection, P(Y), P(Z | Y) and P(X | Y) have the same distribution, up to a change in parameters. This also works if the distributions are multivariate.

For a standard treatment of nested mixtures, see here (Deep Gaussian Mixture Models, paper submitted for publication in November 2017) and here (Hierarchical Mixture Models for Nested Data Structures, undated).

Correlations

A sum Y = X + Z of independent random variables is always correlated with each of its summands (unless Y is constant, which is not possible if X and Z are independent.) This is also true for mixtures. Using the same mixture (with two components) as in the subsection on stable mixtures, prove the following:

Here the Greek symbols rho and sigma are used to denote the correlation and standard deviation, respectively. How does this formula generalize to any number of kernels?

A consequence is that for kernels with identical variances (as in the theoretical model), ordered by decreasing weights, the successive correlations between a component (kernel) and the target distribution Y, are also decreasing. This is a bit like a principal component analysis, and it can also be used for data reduction. The difference here is that the components are created from scratch, using the algorithms described in section 2. In practice, unequal kernel variances are allowed: they boost the speed of convergence, but the price to pay, depending on the kernel family being used, is that two different sets of parameters can lead to the same target distribution Y. The solution may no longer be unique.

Note that if instead of a mixture, we consider the weighted sum Y = pX + qZ, with X and Z independent, the correlation formulas above (as well as the conclusions) are still valid; the only thing that changes is the formula for the variance of Y.

To not miss this type of content in the future, subscribe to our newsletter. For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn, or visit my old web page here.

DSC Resources

Follow us: Twitter | Facebook