Pymix Tutorial

The aim of this tutorial is to provide examples and explanations for the models and methods implemented in the PyMix library. A certain familiarity with Python and mixture model theory is assumed as the tutorial focuses on the implementation in PyMix. Details for all the underlying theoretical concepts can be found in the PyMix publications. A comprehensive introduction into the Python programming language is available at the official Python tutorial.

For questions, comments or requests for additional explanations feel free to contact us via the mailing list.

Table of contents:

Quick Start

In this section we are going to jump right in by constructing a mixture of two univariate Gaussians and use it to perform clustering of an array of data. If some step is unclear, have a look at the more detailed coverage in the subsequent sections.

Importing the module

It is assumed that you have successfully installed Pymix before starting with the tutorial. You can check your installation by importing the main Pymix module from your Python interpreter (in the following >>> is the prompt of the Python interpreter):

>>> import mixture





There should be no error messages or warnings.

Preparing the data

Assume we have an array arr of continuous values which we want to cluster using a mixture of Gaussians.

The data looks something like this

>>> print arr [-5.21769071 -2.47271105 2.58716869 3.51648016 -4.6036057 -4.6540166 ... 2.14734122 3.5418215 -4.51690953 -2.59092045]





The first step is to convert the data into a Pymix data set. This is done by creating a new DataSet object and initializing it with the array arr.

>>> data = mixture.DataSet() >>> data.fromArray(arr)





Now the data is ready to be analyzed by PyMix.

Creating a mixture of Gaussians

In this example we are going to perform clustering using a mixture of two univariate Gauss distributions. The first step in building the model is to define a Gauss distribution object for each component.

>>> n1 = mixture.NormalDistribution(-2,0.4) >>> n2 = mixture.NormalDistribution(2,0.6)





This creates NormalDistribution objects with mean parameters -2 and 2 and standard deviation parameters 0.4 and 0.6 respectively.

The next step is to combine our two Gaussians in a mixture model.

>>> m = mixture.MixtureModel(2,[0.5,0.5], [n1,n2])





This creates a two component mixture model with uniform weights. We can look at the parameters in the mixture by printing it.

>>> print m G = 2 p = 1 pi =[ 0.5 0.5] compFix = [0, 0] Component 0: ProductDist: Normal: [-2, 0.4] Component 1: ProductDist: Normal: [2, 0.6]





Now that the model has been specified we can perform parameter estimation and clustering.

Performing clustering

We apply the Expectation Maximization (EM) algorithm to obtain the maximum likelihood (with the usual EM caveats) parameters from the data. The EM is called with a maximum number of steps of 40 and convergence criterion (difference in log likelihood of subsequent iterations) of 0.1

>>> m.EM(data,40,0.1) Step 1: log likelihood: -9355.08264125 (diff=-9354.08264125) Step 2: log likelihood: -1924.27837315 (diff=7430.8042681) Step 3: log likelihood: -1924.03385331 (diff=0.244519838955) Step 3: log likelihood: -1924.03375783 (diff=9.54820602601e-05) Convergence reached with log_p -1924.03375783 after 3 steps.





The EM took three steps to converge and we can again print the model to look at the parameters which were learned.

>>> print m G = 2 p = 1 pi =[ 0.42000304 0.57999696] compFix = [0, 0] Component 0: ProductDist: Normal: [-3.93487132706, 1.35292533261] Component 1: ProductDist: Normal: [2.97564166051, 0.588475818767]





Now all that remains for the clustering is to assign each data point to one of the components. This is done by

>>> clust = m.classify(data) classify loglikelihood: -1924.03375783. ** Clustering ** Cluster 0 , size 41 [1, 3, 5, 7, 13, ..., 92, 94, 96, 98] Cluster 1 , size 59 [0, 2, 4, 6, 8, ..., 93, 95, 97, 99] Unassigend due to entropy cutoff: []





Now clust is an array of cluster labels which can be used for subsequent analysis.

Constructing mixture models

The central step for building mixture models in the PyMix framework is the specication of the component distributions. PyMix offers a variety of different distribution which can be used to model data from different domains.

Univariate components

In the most simple case the data is one-dimensional and the component distributions are univariate distributions.

Data from a discrete alphabet of symbols can be modeled with a DiscreteDistribution object. For instance

>>> comp = mixture.DiscreteDistribution(4,[0.25,0.25,0.25,0.25])





would create a uniform discrete distribution over an alphabet of size four. Note, that if no explicit alphabet is given a default alphabet is used. Alternatively, we could als specifiy the symbols in the alphabet in form of an Alphabet object. For instance, a distribution over the four DNA bases could be obtained as

>>> DNA = mixture.Alphabet(['A','C','G','T']) >>> comp = mixture.DiscreteDistribution(4,[0.25,0.25,0.25,0.25],DNA)





When the data is continuous, we could model it with a Gaussian

>>> comp = mixture.NormalDistribution(0.0, 1.0)





or Exponential distribution

>>> comp = mixture.ExponentialDistribution(1.0)





Multivariate components

If the data is multi-dimensiona, multivariate component distributions are required.

Multinomial distribution

For multivariate discrete data, the multinomial distribution is the distribution of choice. For instance, a uniform |MultinomialDistribution object for DNA sequences of length 6 can be obtained by

>>> comp = mixture.MultinomialDistribution(6,4,[0.25,0.25,0.25,0.25],DNA)



Note that DNA is the same Alphabet object defined in the previous section.

Multivariate Gaussians

One of the most popular component distribution for continuous data is the multivariate Gaussian distribution. In PyMix a MultiNormalDistribution object is obtained by

>>> comp = mixture.MultiNormalDistribution(3,[0.5,-3.25,5.0], [[1.0,0.0,0.0],[0.0,2.0,0.0],[0.0,0.0,1.5]] )





The first argument give the dimensionality, the second a list of mean parameters and the last the covariance matrix.

Naive Bayes components

The naive Bayes model assumes independence between the different features of a data set. This means that the joined distribution over all features is given simply by the product of distributions over each individual feature. One advantage of this formulation is that the features need not to have the same distribution, that is we can for instance combine continuous and discrete features in the same model. In Pymix such a distribution is represented by the ProductDistribution object. A ProductDistribution takes as input a list of ProbDistribution objects.

As an example, lets consider the case of a data set with two discrete and one Gaussian features (f1, f2, f3). In order to create a naive Bayes model we first initialize distributions for each individual feature.

>>> f1 = mixture.DiscreteDistribution(4, [0.25, 0.25, 0.25, 0.25]) >>> f2 = mixture.DiscreteDistribution(4, [0.1, 0.4, 0.1, 0.4]) >>> f3 = mixture.NormalDistribution(0.0, 1.0)





In the next step we combine the individual distributions in the ProductDistribution.

>>> comp = mixture.ProductDistribution([f1, f2, f3])





Putting together the mixture

Assume we have constructed two components of one of the model types shown above. A mixture model of these two components is then created by

>>> m = mixture.MixtureModel(2,[0.5,0.5], [comp1, comp2])



where the first argument gives the number of components, the second the mixture weights and the third a list of the components. Note that this syntax is the same for all the different types of components shown above. However, both components must be of the same type.

Reading a mixture from a file

Pymix includes file I/O for models in a simple flat file format. A mixture m can be saved in a file 'm.mix' by

>>> mixture.writeMixture(m, 'm.mix')





The model can be retrieved with

>>> m2 = mixture.readMixture('m.mix')





Formating the data

Due to the generality of the framework PyMix can be adapted for a wide variety of different data sets. This makes dealing with possible input formats a bit tricky. PyMix deals with this by requiring the data to be loaded into a DataSet object prior to use. For most input data this is a very straightforward process.

Creating a DataSet

Typically, the first step is to create a new, empty DataSet object. A DataSet is initialized by

>>> data = mixture.DataSet()





The next step is to load the data into the DataSet. Depending in which format the data is given, a different method is used.

Data is list or array

If the data is already stored in a Python list or numpy array, we can easily load it into the DataSet. For example, a data set of DNA sequences of length three (i.e. a three dimensional, discrete data set) could look like this

>>> l = [['A', 'T', 'G'], ['C', 'A', 'G'], ['C', 'T', 'A']]





Note, that the multivariate data is represented as list of lists in Python. To load this data into the DataSet we use the DataSet.fromList() method

>>> data = mixture.DataSet() >>> data.fromList(l)





Similarly, if the data is in an array, we use the DataSet.fromArray() method. Consider an example input list for a three dimensional, continuous data set with three samples.

>>> data = mixture.DataSet() >>> a = numpy.array( [[0.3, 0.2, -0.5], [1.6, 3.3, -5.3], [3.4, 8.2, -0.4]] ) >>> data.fromArray(a)





Reading data from a file

If the data is stored in a flat text file, the DataSet.fromFile() method can be used. By default the PyMix file format assumes the following:

the first line contains feature IDs

each line after the first contains one sample

the first value in each line is the sample ID

values are tab separated

However, DataSet.fromFile() allows a fair amount of customization to adapt it to specific input files. For instance, the separator symbol (tab by default) can be freely chosen.

An example input file 'data.txt' could look like this:

X_1, X_2, X_3 sample1, 'YES', 0.5, 0.6 sample2, 'NO', 1.4, 5.6 sample3, 'YES', 2.3, 1.1 sample4, 'NO', 2.1, 4.9 sample5, 'NO', 6.3, 3.7





This would give a data set with five samples (sample1-sample5) and three features (X_1, X_2, X_3). The first feature is discrete ('YES'/'NO'), the other two features continuous. In order to read in data.txt we use

>>> data = mixture.DataSet() >>> data.fromFile('data.txt', sep=',')





Note that we have used the optional sep argument to change the seperator symbol to a comma.

We can inspect the data set by printing it

>>> print data >>> Data set overview: N = 5 p = 3 sampleIDs = ['sample5', 'sample4', 'sample1', 'sample3', 'sample2'] headers = ['X_1', ' X_2', ' X_3']





For data files which do not fit into this pattern, generally the quickest route for use in PyMix is to write a custom function to parse the data into a list and then make use of the DataSet.fromList() method. Note, that for biological sequences there are already some dedicated parsing functions available as part of the bioMixture module.

Introducing data and model

Due to the variety of models supported by PyMix a given data set can be modeled in many different ways. For instance a data set with three continuous features

>>> a = numpy.array( [[0.3, 0.2, -0.5], [1.6, 3.3, -5.3], [3.4, 8.2, -0.4]] ) >>> data.fromArray(a)





could be represented by a single three-dimensional multivariate Gaussian,

>>> mixture.MultiNormalDistribution(3, [0.5,-3.25,5.0], [[1.0,0.1,0.2],[0.1,2.0,0.5],[0.0,0.0,1.5]] )





a naive Bayes model of three univariate Gaussians

>>> n1 = mixture.NormalDistribution(0.0, 1.0) >>> n2 = mixture.NormalDistribution(2.0, 0.5) >>> n3 = mixture.NormalDistribution(-3.0, 1.0) >>> mixture.ProductDistribution(3,[n1, n2, n3])





or a naive Bayes model of combinations of univariate Gaussians and Exponential distributions.

>>> e1 = mixture.ExponentialDistribution(1.0) >>> n2 = mixture.NormalDistribution(2.0, 0.5) >>> n3 = mixture.NormalDistribution(-3.0, 1.0) >>> mixture.ProductDistribution(3,[e1, n2, n3])





Since the internal flow of the data is somewhat different for each of these models, we have to adapt the DataSet before it can be used with a model. For a given MixtureModel m, this is simply done by

>>> data.internalInit(m)





This initializes the internal data handling structures of DataSet to fit the mixture m.

Sampling from a model

Mixtures are generative models. Sampling data from a given model with known parameters is a crucial instrument in testing and assessing the performance of a model. For a given MixtureModel m a DataSet of size 10 can be sampled by

>>> data = m.sampleDataSet(10)





The resulting DataSet is already initialized for the model m and can immediately be used.

Parameter estimation

The central task for clustering with mixture models is learning the model parameters from the data. PyMix employs the standard Expectation Maximization (EM) algorithm. For a mixture m and data set data EM parameter estimation is performed by

>>> m.modelInitialization(data) >>> m.EM(data, 40, 0.1) Step 1: log likelihood: -385.925892006 (diff=-384.925892006) Step 2: log likelihood: -384.722461017 (diff=1.20343098874) Step 3: log likelihood: -384.111608064 (diff=0.61085295358) ... Step 14: log likelihood: -378.961590425 (diff=0.102952560783) Step 14: log likelihood: -378.899700057 (diff=0.061890367485) Convergence reached with log_p -378.899700057 after 14 steps.



The method MixtureModel.modelInitialization() randomizes the model parameters based on a random assignment of data points to components. In the MixtureModel.EM() method, the second argument gives the maximum number of iterations, the third the convergence criterion in form of a cut-off on the difference in likelihood in subsequent iterations.

On important issue with the EM algorithm is that convergence is only guaranteed to a local maximum of the likelihood. Which local maximum is obtained depends on the initial parameters. In order to prevent the procedure to get trapped in a bad local maximum, the standard approach is to run the EM multiple times from different starting points and retain the parameters with the highest likelihood. In PyMix can be done by

>>> m.randMaxEM(data, 10, 40, 0.1) ... Best model likelihood over 10 random initializations: Model likelihoods: [-378.89970005741327, -379.08088789293652, ..., -382.92603676773638] Average logp: -379.4309561 SD: 1.19503997444 Best logp: -378.804525502





where the second argument gives the number of EM runs to be performed. Each run is based on initial parameters obtained by a call to MixtureModel.modelInitialization().

Bayesian mixture models

The Bayesian mixtures in PyMix allow integration of prior knowledge in form of a prior distribution over the model parameters. Based on such a prior maximum a posteriori (MAP) estimation is performed for parameter learning.

The prior distributions used are the respective conjugate priors. For instance, in case of a mixture of naive Bayes components with three discrete distributions

>>> d1 = mixture.DiscreteDistribution(4, [0.25, 0.25, 0.25, 0.25]) >>> d2 = mixture.DiscreteDistribution(4, [0.1, 0.4, 0.1, 0.4]) >>> d3 = mixture.DiscreteDistribution(4, [0.3, 0.3, 0.2, 0.2]) >>> comp = mixture.ProductDistribution([d1, d2, d3])





the conjugate prior is a product of three Dirichlet distributions. The prior over the mixture weights pi is also Dirichlet. A MixtureModelPrior object for such a mixture is obtained by

>>> piPr = mixture.DirichletPrior(2,[1.0,1.0]) >>> dir1 = mixture.DirichletPrior(4,[1.0,1.0,1.0,1.0]) >>> dir2 = mixture.DirichletPrior(4,[1.0,1.0,1.0,1.0]) >>> dir3 = mixture.DirichletPrior(4,[1.0,1.0,1.0,1.0]) >>> prior = mixture.MixtureModelPrior(0.03, 0.03, piPr, [dir1, dir2, dir3])





The first two arguments of MixtureModelPrior are only relevant for the CSI structure learning. The prior over the mixture weights piPr is always a DirichletPrior. The final argument is a list of prior distributions appropriate for the distributions in the mixture components.

Now, an example BayesMixtureModel could be initialized by

>>> bm = mixture.BayesMixtureModel(3,pi,[comp1, comp2, comp],prior)





and MAP parameter estimation can be performed by

>>> bm.mapEM(data, 40, 0.1)





Context-specific independence mixtures

The central idea of the CSI extension is to adapt the model complexity, i.e. the number of parameters in a model to the degree of variability found in the data. This is done by allowing groups of components to feature-wise share parameters. PyMix implements CSI structure learning for naive Bayes mixtures in the Bayesian mixture framework described in the previous section.

In the following we are going to learn a CSI mixture from a data set sampled from a model with known parameters and structure. By comparing the learned and generating model, we can assess the performance of the structure learning.

Construct model for sampling

The first step is to construct the generating model. In the example we use a three component mixture of naive Bayes models over three univariate Gaussians. For simplicity, the standard deviation is set to 0.5 for all the distributions.

>>> n11 = mixture.NormalDistribution(0.0,0.5) >>> n12 = mixture.NormalDistribution(2.0,0.5) >>> n13 = mixture.NormalDistribution(-3.0,0.5) >>> comp1 = mixture.ProductDistribution([n11,n12,n13]) >>> n21 = mixture.NormalDistribution(2.0,0.5) >>> n22 = mixture.NormalDistribution(2.0,0.5) >>> n23 = mixture.NormalDistribution(-3.0,0.5) >>> comp2 = mixture.ProductDistribution([n21,n22,n23]) >>> n31 = mixture.NormalDistribution(4.0,0.5) >>> n32 = mixture.NormalDistribution(4.0,0.5) >>> n33 = mixture.NormalDistribution(-3.0,0.5) >>> comp3 = mixture.ProductDistribution([n31,n32,n33])





It can be seen that the parameters of the naive Bayes components are not all unequal. Only for the first feature, all three components have different means. On the other hand, the parameters of the third Gaussian in each component are identical in all three components (mean -3.0). Similarly, comp1 and comp2 have the same mean in the second Gaussian. This sharing of the same parameter amounts to an implicit CSI structure in the model. Our aim in the following is to sample a data set from this model and learn this CSI structure explicitly.

In order to construct the sampling model, we first build the parameter prior as described in the previous section. The conjugate prior for the Normal distribution is the Normal-Inverse-Gamma prior. The choices of hyper-parameters in the prior determine the impact of the prior and the strength of penalization for model complexity in the structure learning. Choosing these hyper-parameters is not straightforward and application dependent. In this example, we are going to employ heuristics to reset the hyper-parameters based on the data (next section). For the initialization we use dummy values of one.

>>> pipr = mixture.DirichletPrior(3,[1.0]*3) >>> sp1 = mixture.NormalGammaPrior(1.0,1.0,1.0,1.0) >>> sp2 = mixture.NormalGammaPrior(1.0,1.0,1.0,1.0) >>> sp3 = mixture.NormalGammaPrior(1.0,1.0,1.0,1.0) >>> prior = mixture.MixtureModelPrior(1.0,1.0, pipr,[sp1,sp2,sp3])





Now, the BayesMixtureModel can be build

>>> bm = mixture.BayesMixtureModel(3,[0.2,0.3,0.5],[comp1, comp2, comp3],prior, struct=1)





and the data set sampled

>>> data = bm.sampleDataSet(500)





Resetting parameters and hyper-parameters

In order to re-learn the parameters the data was generated from, we have to randomize the model parameters and update the hyper-parameters by using heuristics based on the data. The latter is done by

>>> sp1.setParams(data.getInternalFeature(0),3) >>> sp2.setParams(data.getInternalFeature(1),3) >>> sp3.setParams(data.getInternalFeature(2),3) >>> prior.structPriorHeuristic(0.05, data.N)





For the former, we simply do

>>> bm.modelInitialization(data)





Learning the CSI structure

Now we can re-learn model parameters and structure by

>>> bm.bayesStructureEM(data,5,5,40,0.1) ... Structural EM ( 5 runs over 5 random inits each): logp: [-1711.6502700392894, ..., -1731.6034477963904] Average logp: -1724.38225234 SD: 10.4878423071 Best logp: -1711.65025535





The resulting model can be inspected with

>>> bm.printStructure() Feature 0: 0 Group 0: (0) Normal: [-0.0229761658954, 0.440506814236] Group 1: (1) Normal: [1.95830489897, 0.521879559167] Group 2: (2) Normal: [3.97626431379, 0.506150083619] Feature 1: 1 Group 0: (0, 1) Normal: [1.985312505, 0.464407150932] Group 1: (2) Normal: [4.03515788391, 0.499133189533] Feature 2: 2 Group 0: (0, 1, 2) Normal: [-3.01707559568, 0.496683455398]





It can be seen that the learned parameters closely match the generating model and that the CSI structure was successfully learned.

Dependence tree mixtures

The naive Bayes model assumes independence between the features, this can be problematic for data sets where there are dependencies present. On the other hand the multivariate Gaussian distribution includes a full covariance structure. However, attempting to learn a the covariance matrix for data sets with more than a few dimensions is computationally expensive and, more to the point, carries a high risk of overfitting.

The dependence tree (or conditional Gauss) distribution offers a compromise between the two extremes by learning the dependencies in the data as a tree structure. The tree is encoded as a dictionary of parent indices. For instance, for a three-dimensional data set a dependence structure could be defined as

>>> tree = {} >>> tree[0] = -1 >>> tree[1] = 0 >>> tree[2] = 1





The -1 denotes the root of the tree. It can be seen that feature 1 is dependent on feature 0 and feature 2 on feature 1.

Another example would be

>>> tree = {} >>> tree[0] = -1 >>> tree[1] = 0 >>> tree[2] = 0





Here, both features 1 and 2 are dependent on the root feature 0.

A mixture of conditional Gaussian distributions using these trees as dependence structures can be obtained by

>>> n1 = mixture.ConditionalGaussDistribution(3,[0, 1, 0], [0, -0.1, 0.1], [0.5,0.5,0.5],tree) >>> n2 = mixture.ConditionalGaussDistribution(3,[-1, 0, 1], [0, 0.1, -0.1], [0.5,0.5,0.5],tree2)





The first argument is the number of dimensions, the second the vector of mean parameters, the third the standard deviations (diagonal entries of the covariance matrices), the fourth the covariances (off-diagonal entries of the covariance matrix) and the fifth and last the dependency structure.

After defining the ConditionalGaussDistribution, a mixture can be build and used as usual

>>> mixture.MixtureModel(2,[0.4,0.6],[n1,n2])





Semi-supervised learning with mixtures

The semi-supervised framework allows for the integration of prior knowledge on relations between samples into the clustering. The relations can either be positive (i.e. two samples should be in the same cluster) or negative (i.e. two samples should be in different clusters).

PyMix implements two variants for including such prior information. Prior information can be included either in the form of hard labels or soft pairwise-constraints between data points.

Labeled samples

In this variant, the cluster membership of a number of samples is assumed to be known a priori. The semi-supervised framework uses the ConstrainedDataSet object to hold the data and the prior knowledge.

Assuming we have data in a list l, a ConstrainedDataSet is constructed by

>>> data = mixture.ConstrainedDataSet() >>> data.fromList(l)





The labels are given as a list of lists of sample indices beloning to each component. For instance

>>> labels = [ range(30,41),range(40,51) ] >>> data.setConstrainedLabels(labels)





would assign samples 30-40 to component 0 and samples 40-50 to component 1.

After assinging the labels, a LabeledMixtureModel can be constructed and used in the usual manner, e.g.

>>> n1 = mixture.NormalDistribution(4.5,1.5) >>> n2 = mixture.NormalDistribution(8.0,1.8) >>> mpi = [0.5, 0.5] >>> lm = mixture.LabeledMixtureModel(2,mpi,[n1,n2]) >>> lm.EM(data,10,0.1)





This will train a model with the assignment of the labeled samples fixed according to the labels.

Pairwise constraints

A more sophisticated formulation of prior knowledge are soft pairwise-constraints on the cluster membership of samples. The constraints can be used to express a preference for clustering solutions consistent with the constraints. The strength of the preference is determined by penalty parameters for constraints violations.

Again, assuming a Python list l holding a 10 sample data set, we construct ConstrainedDataSet as previously

>>> data = mixture.ConstrainedDataSet() >>> data.fromList(l)





The next step is to initialize the constraint matrices. These matrices are quadratic, symmetric, binary matrices where an entry of one denotes a constraint between the samples with the corresponding indices.

For instance the matrices

>>> pos_constr = numpy.zeros((10,10), dtype='Float64') >>> pos_constr[0,2] = 1.0 >>> pos_constr[2,0] = 1.0 >>> neg_constr = numpy.zeros((10,10), dtype='Float64') >>> neg_constr[0,6] = 1.0 >>> neg_constr[6,0] = 1.0 >>> neg_constr[2,6] = 1.0 >>> neg_constr[6,2] = 1.0





denote a positive constraint between samples 0 and 2 and negative constraints between samples (0,6) and (2,6).

The constraints are set into the ConstrainedDataSet by

>>> data.setPairwiseConstraints(pos_constr,neg_constr)





The ConstrainedMixtureModel is constructed in the usual manner

>>> n1 = mixture.NormalDistribution(4.5,1.5) >>> n2 = mixture.NormalDistribution(6.6,1.8) >>> mpi = [0.5, 0.5] >>> cm = mixture.ConstrainedMixtureModel(2,mpi,[n1,n2])





When running parameter estimation, the EM takes as additional parameters the penalty for constraint violations, the posterior matrix based on the current model parameters as well as a flag denoting which type of constraints are in use.

>>> p = cm.modelInitialization(data) >>> cm.EM(data,100,0.1,100,100,p,3)



