On the Interplay Between Statistical Concepts and Computational Models in Omics Applications
thesisposted on 06.12.2019 by Emery T. Goossens
In order to distinguish essays and pre-prints from academic theses, we have a separate category. These are often much longer text based documents than a paper.
Technological advancements have lead to the generation of enormous amounts ofdata. In order to capitalize on this trend, however, both computational and sta-tistical challenges must be tackled. While computational efficiency is important,interpretability of models and algorithms are essential to ensuring the validity of anyconclusions drawn. Nowhere is this more clear than in the case of biomedical data,where inferences drawn from large datasets are used to inform future directions ofresearch, diagnose diseases, and generate leads for the development of new pharma-ceuticals. This work examines the interplay between statistical concepts and compu-tational models in three applications. Specifically, quantifying protein expression offluorescent images, classifying somatic mutations in cancer, and combining p-valuescomputed from genomic summary statistics. Across these applications, there are threerecurring themes: accounting for technical and biological variation in data process-ing, evaluating the performance of a model in its end use case, and integrating resultswith outside data. Within these applications and themes, many statistical conceptsare employed including Bayes theorem, and type I error rate control alongside com-putational models such a convolutional neural networks and Monte Carlo samplingalgorithms. The results of these investigations inform much broader application ar-eas such as biomedical imaging, modeling genomic sequences, and hypothesis testingin high-dimensions. Specific contributions in the application of Convolutional NeuralNetworks include demonstrating their ability to replicate the quantification of proteinexpression images from various manually-generated or deterministic label sets as wellas the creation of a modeling framework for sequencing-based cancer diagnostics and the prioritization of unvalidated somatic mutations. In the area of hypothesis test-ing, novel algorithms are proposed that enable the use of a powerful and interpretabletechnique of combining p-values in the large-scale setting of genome-wide association studies.