Nonparametric Mixture Modeling on Constrained Spaces

2019-08-16T14:15:18Z (GMT) by Putu Ayu G Sudyanti
Mixture modeling is a classical unsupervised learning method with applications to clustering and density estimation. This dissertation studies two challenges in modeling data with mixture models. The first part addresses problems that arise when modeling observations lying on constrained spaces, such as the boundaries of a city or a landmass. It is often desirable to model such data through the use of mixture models, especially nonparametric mixture models. Specifying the component distributions and evaluating normalization constants raise modeling and computational challenges. In particular, the likelihood forms an intractable quantity, and Bayesian inference over the parameters of these models results in posterior distributions that are doubly-intractable. We address this problem via a model based on rejection sampling and an algorithm based on data augmentation. Our approach is to specify such models as restrictions of standard, unconstrained distributions to the constraint set, with measurements from the model simulated by a rejection sampling algorithm. Posterior inference proceeds by Markov chain Monte Carlo, first imputing the rejected samples given mixture parameters and then resampling parameters given all samples. We study two modeling approaches: mixtures of truncated Gaussians and truncated mixtures of Gaussians, along with Markov chain Monte Carlo sampling algorithms for both. We also discuss variations of the models, as well as approximations to improve mixing, reduce computational cost, and lower variance.

The second part of this dissertation explores the application of mixture models to estimate contamination rates in matched tumor and normal samples. Bulk sequencing of tumor samples are prone to contaminations from normal cells, which lead to difficulties and inaccuracies in determining the mutational landscape of the cancer genome. In such instances, a matched normal sample from the same patient can be used to act as a control for germline mutations. Probabilistic models are popularly used in this context due to their flexibility. We propose a hierarchical Bayesian model to denoise the contamination in such data and detect somatic mutations in tumor cell populations. We explore the use of a Dirichlet prior on the contamination level and extend this to a framework of Dirichlet processes. We discuss MCMC schemes to sample from the joint posterior distribution and evaluate its performance on both synthetic experiments and publicly available data.