The figure above shows DPMM clustering results for a Gaussian distribution (left) and Categorical distribution (right). On the left, we can see the ellipses (samples from posterior mixture distribution) of the DPMM after 100 Gibbs sampling iterations. The DPMM model initialized with 2 clusters and a concentration parameter alpha of 1, learned the true number of clusters K=5 and concentrated around cluster centers.

On the right, we can see the results of clusters of Categorical data, in this case a DPMM model was applied to a collection of NIPS articles. It was initialized with 2 clusters and a concentration parameter alpha of 10. After several Gibbs sampling iterations, it discovered over 20 clusters, with the first 4 shown in the figure. We can see that the word clusters have similar semantic meaning within each cluster and the cluster topics are different across clusters.

Hierarchical Dirichlet process (HDP)

The hierarchical Dirichlet process (HDP) is an extension of DP that models problems involving groups of data especially when there are shared features among the groups. The power of hierarchical models comes from an assumption that the features among groups are drawn from a shared distribution rather than being completely independent. Thus, with hierarchical models we can learn features that are common to all groups in addition to the individual group parameters.

In HDP, each observation within a group is a draw from a mixture model and mixture components are shared between groups. In each group, the number of components is learned from data using a DP prior. The HDP graphical model is summarized in the figure below [5]:

Focusing on HDP formulation in the figure on the right, we can see that we have J groups where each group is sampled from a DP: Gj ~ DP(alpha, G0) and G0 represents shared parameters across all groups which in itself is modeled as a DP: G0 ~ DP(gamma, H). Thus, we have a hierarchical structure for describing our data.

There exists many ways for inferring the parameters of hierarchical Dirichlet processes. One popular approach that works well in practice and is widely used in the topic modelling community is an online variational inference algorithm [6] implemented in gensim.

The figure above shows the first four topics (as a word cloud) for an online variational HDP algorithm used to fit a topic model on the 20newsgroups dataset. The dataset consists of 11,314 documents and over 100K unique tokens. Standard text pre-processing was used, including tokenization, stop-word removal, and stemming. A compressed dictionary of 4K words was constructed by filtering out tokens that appear in less than 5 documents and more than 50% of the corpus.

The top-level truncation was set to T=20 topics and the second level truncation was set to K=8 topics. The concentration parameters were chosen as gamma=1.0 at the top-level and alpha=0.1 at the group level to yield a broad range of shared topics that are concentrated at the group level. We can find topics about autos, politics, and for sale items that correspond to the target labels of the 20newsgroups dataset.

HDP hidden Markov models

The hierarchical Dirichlet process (HDP) can be used to define a prior distribution on transition matrices over countably infinite state spaces. The HDP-HMM is known as an infinite hidden Markov model where the number of states is inferred automatically. The graphical model for HDP-HMM is shown below:

In a nonparametric extension of HMM, we consider a set of DPs, one for each value of the current state. In addition, the DPs must be linked because we want the same set of next states to be reachable from each of the current states. This relates directly to HDP, where the atoms associated with state-conditional DPs are shared.

The HDP-HMM parameters can be described as follows:

Where the GEM notation is used to represent stick-breaking. One popular algorithm for computing the posterior distribution for infinite HMMs is called beam sampling and is described in [7].

Dependent Dirichlet process (DDP)

In many applications, we are interested in modelling distributions that evolve over time as seen in temporal and spatial processes. The Dirichlet process assumes that observations are exchangeable and therefore the data points have no inherent ordering that influences their labelling. This assumption is invalid for modelling temporal and spatial processes in which the order of data points plays a critical role in creating meaningful clusters.

The dependent Dirichlet process (DDP), originally formulated by MacEachern, provides a nonparametric prior over evolving mixture models. A construction of the DDP built on the Poisson process [8] led to the development of the DDP mixture model as shown below:

In the graphical model above we see a temporal extension of the DP process in which a DP at time t depends on the DP at time t-1. This time-varying DP prior is capable of describing and generating dynamic clusters with means and covariances changing over time.

Conclusion

In Bayesian Nonparametric models the number of parameters grows with data. This flexibility enables better modeling and generation of data. We focused on the Dirichlet process (DP) and key applications such as DP K-means (DP-means), Dirichlet process mixture models (DPMMs), hierarchical Dirichlet processes (HDPs) applied to topic models and HMMs, and dependent Dirichlet processes (DDPs) applied to time-varying mixtures.

We looked at how to construct nonparametric models using stick-breaking and examined some of the experimental results. To better understand the Bayesian Nonparametric model, I encourage you to read the literature mentioned in the references and experiment with the code linked throughout the article on challenging datasets!

References

[1] B. Kulis and M. Jordan, “Revisiting k-means: New Algorithms via Bayesian Nonparametrics ”, ICML, 2012

[2] E. Sudderth, “Graphical Models for Visual Object Recognition and Tracking”, PhD Thesis (Chp 2.5), 2006

[3] A. Rochford, Dirichlet process Mixture Model in PyMC3

[4] J. Sethuraman, “A constructive definition of Dirichlet priors”, Statistica Sinica, 1994.

[5] Y. Teh, M. Jordan, M. Beal and D. Blei, “Hierarchical Dirichlet process”, JASA, 2006

[6] C. Wang, J. Paisley, and D. Blei, “Online Variational Inference for the Hierarchical Dirichlet process”, JMLR, 2011.

[7] J. Van Gael, Y. Saatci, Y. Teh and Z. Ghahramani, “Beam Sampling for the infinite Hidden Markov Model”, ICML 2008

[8] D. Lin, W. Grimson and J. W. Fisher III, “Construction of Dependent Dirichlet processes based on compound Poisson processes”, NIPS 2010

YOU’D ALSO LIKE: