Model-Based High-Dimensional Network Inference: Theory & Methods

2019-01-03T18:57:43Z (GMT) by Min Ren
In the past several decades, the advent of high-throughput biotechnologies for genomics study provides appealing opportunities for us to understand the complex gene interaction inside biological systems, attracting many researches in constructing gene regulatory networks (GRNs). Motivated by the promise of the genetical genomics
study, our research group has recently focused on representing gene regulatory networks using structural equation models and further revealing system-wide gene regulations.This dissertation presents two recent works along this direction.

Firstly, we conducted thorough theoretical analysis of the recently proposed Two-Stage Penalized Least Squares (2SPLS) method for constructing large systems of structural equation models. We establish the estimation and prediction error bounds for results at both stages of 2SPLS as well as its variable selection consistency. Speci cally, a bounded eigenvalue assumption is imposed to ensure the consistency properties of the l2-penalized regressions at the first stage. At the second stage, the estimation and
variable selection consistency of the l1-penalized regressions are obtained by assuming a restricted eigenvalue condition and a variant of irrepresentable condition, which are both commonly employed in the current literature. We will show that the 2SPLS estimator works not only for fi xed dimensions but also diverging dimensions which can grow to infi nity with the sample size but at an appropriate rate.

Secondly, we developed a novel statistical method to identify structural differences between two cognate networks characterized by structural equation models. We
propose to reparameterize the model to separate the differential structures from common structures, and then design an algorithm with calibration and construction stages to identify these differential structures directly. The calibration stage serves to obtain consistent prediction by building the l2 regularized regression of each endogenous
variables against pre-screened exogenous variables, correcting for potential endogeneity issue. The construction stage consistently selects and estimates both common and
differential effects by undertaking l1 regularized regression of each endogenous variable against the predicts of other endogenous variables as well as its anchoring exogenous
variables. Our method allows for easy parallel computation. Theoretical results are obtained to establish non-asymptotic error bounds of predictions and estimates at both stages. Our studies on simulated data demonstrated that the proposed method performed much better than independently constructing networks. A real data set
was analyzed to illustrate the applicability of our method.