Computational Methods for Population Genetics

2019-10-16T19:00:49Z (GMT) by Aritra Bose
The field of population genetics has seen an unprecedented growth driven by the advancement of sequencing technologies, resulting in volumes of massive datasets. As a result, efficient computational methods backed by theoretical foundations are required to analyze and understand the intricate details of complex biological processes captured in the genetic code. To this end, we developed novel computational tools to address issues related to population structure, scalability of methods, models of evolution and disease association.

History of a population, in light of genomics, is reconstructed through series of settlements, migrations, adaptations, demographic expansions, mixture, etc. To better understand such a theory of migration for the Peloponnesean Greeks, we analyzed their sub-structure and disproved the theory of their replacement by the Slavs in medieval age. Ecological and environmental factors such as society, language and geographical barriers, among others can influence gene flow in populations resulting in complex structure. We developed a computational framework called COGG (Correlation Optimization of Genetics and Geodemographics) which studies the contribution of these demographic factors shaping the genetic sub-structure of the Indian subcontinent.

Principal Component Analysis (PCA) has profound impact in the study of population structure and a significant challenge is to build scalable software to implement PCA on tera-scale data. To address this issue, we built TeraPCA, an out-of-core, multi-threaded C++ implementation of the Randomized Subspace Iteration method providing a faster and accurate alternative to the current state-of-the-art packages.

Stochastic models of evolution provides a better abstraction of complex evolutionary processes by simulating generations of random populations and provide foundations to analyze genetic variation among species. We developed the first algorithm that builds multiple loci selection with interacting polymorphic sites in a package called sSimRA. We also provide the first comparison between backward and forward simulators which models the effect of natural selection at multiple loci.

To address the issue of correcting for population structure confounding in Genome Wide Association Studies (GWAS) we developed CluStrat, a structure informed clustering based tool which outperforms the standard PCA based stratification correction approaches. GWAS is used ubiquitously to detect bio-markers predicting disorder traits and estimating heritability underlying phenotypic variation. One of the main challenges in GWAS is to correct for population structure in order to find the true positives. We provide a stratification correction technique called CluStrat, which corrects for complex population structure by performing agglomerative hierarchical clustering on the linkage disequilibrium (LD) induced distances between individuals captured in the Mahalanobis distance based Genetic Relationship Matrix (GRM). We further use CluStrat to outline a comprehensive guide to stratification and subsequent disorder trait prediction or estimation utilizing the underlying LD structure of the genotypes.