Comparisons of Statistical Modeling for Constructing Gene Regulatory Networks by Xiaohui Chen B.Sc., Zhejiang University, 2006 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Bioinformatics) The University Of British Columbia (Vancouver) August, 2008 © Xiaohui Chen 2008 Abstract Genetic regulatory networks are of great importance in terms of scien tific interests and practical medical importance. Since a number of high- throughput measurement devices are available, such as microarrays and sequencing techniques, regulatory networks have been intensively studied over the last decade. Based on these high-throughput data sets, statisti cal interpretations of these billions of bits are crucial for biologist to ex tract meaningful results. In this thesis, we compare a variety of existing regression models and apply them to construct regulatory networks which span trancription factors and microRNAs. We also propose an extended algorithm to address the local optimum issue in finding the Maximum A Posterjorj estimator. An E. coli mRNA expression microarray data set with known bona fide interactions is used to evaluate our models and we show that our regression networks with a properly chosen prior can perform com parably to the state-of-the-art regulatory network construction algorithm. Finally, we apply our models on a p53-related data set, NCI-60 data. By further incorporating available prior structural information from sequencing data, we identify several significantly enriched interactions with cell prolif eration function. In both of the two data sets, we select specific examples to show that many regulatory interactions can be confirmed by previous studies or functional enrichment analysis. Through comparing statistical models, we conclude from the project that combining different models with over-representation analysis and prior structural information can improve the quality of prediction and facilitate biological interpretation. Keywords: regulatory network, variable selection, penalized maximum likelihood estimation, optimization, functional enrichment analysis 11 Table of Contents Abstract ii Table of Contents iii List of Tables v List of Figures vii Acknowledgements x Dedication xi 1 Introduction and literature review 1 1.1 MiroRNA and Transcription factors 2 1.2 Previous work 2 1.3 Project motivations and goals 4 2 Models and algorithms 6 2.1 Model: Linear regression 6 2.2 Variable selection: penalized likelihood 6 2.3 Optimization: finding penalized ML estimator 8 2.3.1 EM algorithm 8 2.3.2 MM algorithm 10 2.3.3 Connections between algorithms 16 3 Results on comparing across models 19 3.1 Simulation studies 19 3.2 E.coli data set 22 111 Table of Contents 3.2.1 Precision-recall curves 24 3.2.2 Visualizing and analyzing inferred networks 28 4 Comparing models coupled with prior structural informa tion 41 4.1 NCI-60 Data preparation 41 4.2 Results 44 4.2.1 Prediction of p53 related microRNAs and genes . . . 44 4.2.2 Functional enrichment analysis 49 5 Conclusions and discussions 56 Bibliography 58 Appendices A Methods 65 A. 1 Penalties 65 A.1.1 Information criteria 65 A.1.2 Hard thresholding 65 A.1.3 £ penalty 65 A.1.4 SCAD penalty 67 A.1.5 Bayesian linear regression 67 B Proofs 70 B.1 Lemmas 70 B.2 Proof of Proposition A.1.1 70 B.3 Proof of Theorem 2.3.1 71 C Supplementary Materials 73 iv List of Tables 3.1 Mean errors for linear models, averaged over 100 simulations. MSE is the mean square error. C is the number of correctly estimated zeros and I is the number of incorrectly estimated zeros. Boldfaced methods are best results 20 3.2 Characteristics of 60% and 80% precise networks inferred from models with top perfermance in E. coli 25 3.3 Number of targets regulated by transcription factors in the 60% precise network with p 5 predicted targets with top performance algorithms, in E.coli network 28 4.1 MSE and estimated coefficients for the specific interactions predicted from NCI-60 data set. Bold numbers represent the estimated interactions agreeing with the literatures 46 4.2 Estimated coefficients for the specific interactions predicted from NCI-60 data set, compared between with and without prior information. Coefficients with absolute values less than 10—8 are thresheld to 0. Bold numbers represent the esti mated interactions agreeing with the literatures 48 4.3 Identified significant p53/miRNAs that target known cell pro liferation genes from Li prior 50 A. 1 Penalizations/log(prior) and their first order derivatives eval uated at I/I. Notation: qj = /()2 + 1/3112 69 A.2 Properties of sparsity promoting priors. Source: Francois Caron 69 v List of Tables C.l E. coli Transcription factors in the 60% precise network with p 5 predicted operon targets by CLR algorithm 74 C.2 E.coli Transcription factors in the 60% precise network with p> 5 predicted operon targets by Li prior with MM3 algorithm. 74 C.3 E.coli Transcription factors in the 60% precise network with p 5 predicted operon targets by SCAD prior with MM3 algorithm 75 C.4 E. coli Transcription factors in the 60% precise network with p 5 predicted operon targets by Li prior with EM algorithm. 75 C.5 E. coli Transcription factors in the 60% precise network with p 5 predicted operon targets by NIG prior with EM algorithm. 76 C.6 Identified significant p53/miRNAs that target known cell pro liferation genes from NG prior 81 C.7 Identified significant p53/miRNAs that target known cell pro liferation genes from NIG prior 82 C.8 Identified significant p53/miRNAs that target known cell pro liferation genes from NJ prior 83 C.9 Identified significant p53/miRNAs that target known cell pro liferation genes from SCAD prior 86 vi List of Figures 1.1 An illustration of coordinated gene expression processes. TFs activate a set of genes in nucleus (01-On). After genes being transcribed, the resulting mRNAs are bound by RNA-binding proteins (RBP), spliced and subsequently exported to the cy toplasm. RBPs and miRNAs can affect the stability of tran scripts, activate or repress their translations. Source: RNA regulons: coordination of post-transcriptional events. Keene, J. 2007 Nature Reviews Genetics. [34] 3 2.1 Penalty functions and their corresponding quadratic surro gate functions around origin. Solid lines are original penalty and the red dashed lines are the upper bound surrogate func tions of solid lines with 0 = 0.5 and e = 10—8 11 2.2 Penalty functions and their corresponding surrogate functions around origin. Dotted lines (in blue) are original penalty, solid lines are the proposed perturbed penalty function, and dashed lines (in red) are the surrogate functions of solid lines with Oo = 0.5 and = 10—8 14 2.3 Convergence of unperturbed penalized likelihood function. Assuming r = 1. (O) is the part regarding the change of unperturbed penalized likelihood. S is the part measur ing the goodness of approximation between the derivative of perturbed penalty and original penalty 16 3.1 Simulation comparisons, initialized with all zeros start. Sim ulation averaged over 100 times 21 vii List of Figures 3.2 Simulation comparisons, initialized with MLE. Simulation av eraged over 100 times 23 3.3 The real regulatory network of E.coli presented in a binary matrix M. A bright spot at is a regulatory relationship from TF j to gene i 24 3.4 Precision-recall curves for various models and algorithms on E.coli data set. MI is the raw MI method. Z is CLR algorithm. 26 3.5 Scatter plots of the fitted expression values of target genes against their residuals 27 3.6 80% precise network for CLR algorithm. Red lines are cor rectly inferred edges and blue lines are false positives. Grey nodes are TFs 31 3.7 80% precise network for Li penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 32 3.8 80% precise network for SCAD penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 33 3.9 80% precise network for LASSO penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 34 3.10 80% precise network for NIG penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 35 3.11 60% precise network for CLR algorithm. Red lines are cor rectly inferred edges and blue lines are false positives. Grey nodes are TFs 36 3.12 60% precise network for Li penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 37 3.13 60% precise network for SCAD penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 38 viii List of Figures 3.14 60% precise network for LASSO penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 39 3.15 60% precise network for NIG penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 40 4.1 Flowchart of learning regulatory network on NCI-60 data set. 43 4.2 Fisher’s exact test p-values with the Li prior. The blue line is the simulated uniform background p-values under null hy pothesis and horizontal line is the threshold at p-value at 0.05. 52 C.1 Fisher’s exact test p-values with the NG prior. The blue line is the simulated uniform background p-values under null hypothesis and horizontal line is the threshold at p-value at 0.05 77 C.2 Fisher’s exact test p-values with the NIG prior. The blue line is the simulated uniform background p-values under null hypothesis and horizontal line is the threshold at p-value at 0.05 78 C.3 Fisher’s exact test p-values with the NJ prior. The blue line is the simulated uniform background p-values under null hy pothesis and horizontal line is the threshold at p-value at 0.05. 79 C.4 Fisher’s exact test p-values with the SCAD prior. The blue line is the simulated uniform background p-values under null hypothesis and horizontal line is the threshold at p-value at 0.05 80 ix Acknowledgements First and foremost, I owe many thanks to my two supervisors: Prof. Raphael Gottardo and Prof. Kevin Murphy. This work would be impossible without their continuous efforts to keep my statistical ball rolling. Their many valu able and enlightened suggestions not only helped me from constantly study ing new things, but also taught me how to be an independent researcher. Thanks to both of you for being great mentors, both professionally and personally. I am also grateful to Dr.Mauricio Neira, who actually was my third unofficial supervisor. Thank you Mauricio for helping me so many times with valuable support and guidance on post-modeling bioinformatics analysis. Your never ending feedback on my analysis is highly appreciated. Besides, I would like to both thank you for helping prepare part of the data. Last but not least, I have to say thanks to Prof. Arnaud Doucet and Dr. François Caron, both of whom instructed and helped me with the statistical modeling. x To my parents Rui Chen and Yuping Ou xi Chapter 1 Introduction and literature review Gene regulatory networks have been intensively studied over the last decade because of their inherent scientific and medical importance (e.g., in the fight against cancer) as well as the development of high-throughput measurement devices. The data generated by these high-throughput experiments typically contain: • Gene microarray expression data: Each microarray is used to capture the snapshot of the transcriptional status of cells, and one ex periment usually includes several microarrays to reflect the change of status of a biological system over time (may contain repetitive mea surements). Although mRNA is a precursor to its downstream protein product which directly executes biological functions, a large amount of previous regulatory network studies used mRNA microarray data as a proxy for the amount of protein products [24, 46, 64]. • Sequence information: The discovery and characterization of regu latory controlling sequences have been greatly facilitated by sequencing projects. Generally, identification of regulatory elements in the genome is mainly based on comparative sequence analysis and some Chromatin ImmunoPrecipitation (ChIP) binding experiment data [40]. There are many secondary curated databases which consist of reliable, verified regulations, but this approach is not scalable for quick identification of regulatory relationships de novo, such as comparative genomics meth ods and algorithms based on sequence complementarity. 1 Chapter 1. Introduction and literature review 1.1 MiroRNA and Transcription factors At the molecular level, microRNA (miRNAs) and transcription factors (TFs) are two important categories of regulators that control thousands of mam malian genes. TFs are the proteins that can bind to specific parts of DNA (usually one gene(s)) using DNA binding domains and therefore help initi ate/repress transcription, while miRNAs are small non-coding RNAs that regulate gene expression at the post-transcriptional level by affecting the sta bility and translational processes of gene transcripts. For TFs, they could ei ther increase or decrease gene transcription, but generally for miRNAs, they mediate post-transcriptional gene silencing through degradation of mRNAs or inhibition of protein production [34]. In principle, miRNAs could help to explain discrepancies between mRNA and protein levels. For example, those discrepancies may seriously complicate the use of mRNA profiles to study chemoresistance [10]. Unfortunately, there still continues to be a question as to how well transcript levels predict translated protein levels; a graphi cal illustration of TFs and miRNAs coordinated gene expression processes is shown in Figure 1.1. Recently, [52] comprehensively studied the global architecture and network local motifs of human miRNA-TF regulatory net works. Based on the prediction datasets using sequence complementarity and conservation, they revealed that the combinatorial property of miRNA interactions and miRNA-TF cooperation for targeting genes is fundamental for the precise and complex nature of regulatory systems. 1.2 Previous work Most of previous work on discovering regulator-target association is branched into two fields, corresponding to the two major types of data mentioned at the beginning of this chapter. The former field is based on mining corre lated gene transcription patterns from mRNA microarray data. This in cludes clustering [41], Bayesian networks [24, 33, 46] and linear regression models [17, 64]. The latter one focuses on using sequence complementar ity and phylogenetic conservation; for example, see [7, 35, 39, 58]. These 2 Chapter 1. Introduction and literature review Cytoplasm miRNA Figure 1.1: An illustration of coordinated gene expression processes. TFs activate a set of genes in nucleus (G1-Gn). After genes being transcribed, the resulting mRNAs are bound by RNA-binding proteins (RBP), spliced and subsequently exported to the cytoplasm. RBPs and miRNAs can affect the stability of transcripts, activate or repress their translations. Source: RNA regulons: coordination of post-transcriptional events. Keene, J. 2007 Nature Reviews Genetics. [34] 3 Chapter 1. Introduction and literature review sequence based methods, however, have limited specificity due to imperfect matches between TF-TF Binding Sites (TFBS) and the miRNA-target. In the miRNA regulating transcripts domain, [31] proposed a Bayesian linear model to integrate the evidence from both sequence and expression data. The inferred regulatory network is a subset of the prior network given by a sequence-based target-finding program. To the same end, in the TF regu lating gene domain, [17, 48] combined mRNA expression data with positive protein binding data from ChIP experiments to improve the predictive abil ity. Their method further classified ChIP positives into functional and non functional TF targets using linear regression models. Currently, there is not much work on combining TFs and miRNAs to construct regulatory networks based on both microarray expression data and sequence-based structural in formation. Since more variety and volume of data sources are available, this joint modeling is expected to become popular in the near future. 1.3 Project motivations and goals As discussed in previous sections, the regulatory systems seldom function through oniy a few mechanisms. Hence, building a comprehensive regula tory network spanning TFs and miRNAs at different molecular levels and integrating both microarray expression data and prior structural informa tion from sequence based predictions are essential. Moreover, due to the sparseness property of regulatory networks, we formulate the problem as a variable selection problem. The goals of the project are summarized as below: 1. Compare various statistical models and algorithms without integrating structural information. 2. By incorporating available prior structural information, compare the results of different models. Comparisons are also made between com bining two information sources and merely using expression data to construct networks. 4 Chapter 1. Introduction and literature review 3. Combine functional enrichment analysis to interpret learned regula tory interactions. The rest of the thesis is organized as following: • Chapter 2 briefly presents the models we are using and the existing and our proposed algorithms to solve the models. Details of the modeling issues are put into the appendix. • Chapter 3 presents the results of a simulation study and a real data example (from E. coli) where the true regulatory network is known. The purpose of this chapter is to compare various models and algo rithms without prior structural information, from a statistical point of view. Model performance is measured by a precision-recall curve. Further, several hub genes are selected with common predictions from high performance algorithms and we show that at the functional level, the learned networks agree with current literature. • Chapter 4 presents the results on a p53-centered data set (NCI6O) and prior structural information is used to infer a network. The results are presented in several specific examples, which are of our biological interests, i.e. p53 tumor related genes/miRNAs. We combine the variable selection and functional over-representation analysis to detect bona fide interactions. • Chapter 5 concludes the thesis with a discussion. 5 Chapter 2 Models and algorithms 2.1 Model: Linear regression The model we are assuming is a multivariate linear regression model. y=X/3+e (2.1) where • X X>< is the design matrix consisting of expression values for all known candidate regulators, including TFs and microRNAs • y is the response vector of target gene expression values • /3 is the regulatory strength: /3 > 0 means up-regulation, /3 < 0 means down-regulation, and /3 = 0 represents no regulation • e is an error term assumed to be Gaussian with constant variance 2, i.e. N(0,u21). To capture the dependency between the regulators and target gene, we consecutively regress the expression levels of every target gene to X, while keeping X fixed for all. 2.2 Variable selection: penalized likelihood In many situations, the assumed linear model may be redundant in the sense that not all of its predictors have significant effects on the response. For example, there may be hundreds or even thousands candidate regula tors for one gene. Thus the full linear model is usually over-parameterized. 6 Chapter 2. Models and algorithms Intuitively, including more ifs, the prediction error would decrease and the fit would improve. However, the improvement is very marginal compared with the cost of estimating a much more complex model. Furthermore, the interpretation becomes unclear for separating the true predictors from the irrelevant variables. This is because the estimations yielded from mini mizing a least squares (LS) objective function typically do not equal zeros. Hence, selecting a proper subset of predictors from the full model is cru cial for high-dimensional statistical modeling. The basic idea is to penalize more complex models while preferring simpler models. In terms of math ematical formulation, many variable selection problems aim to maximize a penalized likelihood function, which is equivalent to minimizing a penalized negative-log-likelihood O(3, y) function Q(/3,AjX,y) = -(/3) +p/3) (2.2) where £ is the log-likelihood and p> (Ii3i) is the penalty term fOr each large /3 absolute value. Generally speaking, large penalties tend to shrink the /3 co efficients to zeros. Here in the dimensionality reduction context and the rest of the thesis, we use variable selection and model selection interchangeably. In the context of linear regression, i.e. when p = 0, the maximum likelihood (ML) and least squares approaches provide the same estimator /3MLE, which does not depend on g2 Thus cr2 can be absorbed in the penalty terms. That is to say given we can omit o2 and minimize the penalized LS objective function over /3 and O(/3, AjX,y) = II - X/3II + pA(I/3jI) (2.3) where II 2 is the Euclidean norm of a vector, and without loss of generality, we assume p takes the same functional form for all j = 1 : p, i.e. = p, in Eq.(2.3). Specifically, choosing p>, = AI/3I yields the £ penalty [22]. By avoiding the over-penalization of /3 with large absolute values, Fan and Li (2001) proposed a modified £‘ penalty [19], i.e the Smoothly Clipped 7 Chapter 2. Models and algorithms Absolute Deviation (SCAD) penalty. The problem becomes finding 3 such that Cã,) = argrninO(/3,AIX,y) (2.4) A can be determined on a linear grid by cross-validation (CV) giving the smallest prediction errors. 2.3 Optimization: finding penalized ML estimator It turns out the minimization of the objective function, Q(.), is not trivial. For p,\(.) = H (a.k.a. LASSO) [56], it can be efficiently solved by Osborne’s [44] and the Least Angle Regression (LARS) algorithm [16]. But these al gorithms are not generalized to other penalties (see Appendix A Methods). Caron and Doucet [12] proposed an Expectation-Maximization (EM) al gorithm to find the Maximum A Posteriori (MAP) estimator for general penalty functions. Further, for L’ and SCAD penalties (see Appendix A Methods), Hunter and Li [32] proposed a Majorize-Minimize (MM) algo rithm to optimize the same objective function and avoid the local optimum problem with the EM algorithm. The MM can be seen as a generalization of the EM algorithm. In this thesis, we extended the idea of the MM algorithm and propose a new MM algorithm to handle the general penalizations. The details will be addressed in the following paragraphs and also Appendix A Methods section. 2.3.1 EM algorithm [20], [27] and [12] proposed an EM algorithm [15] to find the posterior modes of Laplacian (Li), Normal-Gamma (NG), Normal-Jeifreys (NJ), and Normal-mv-Gaussian (NIG) models. In general, the marginal prior distri bution of regression coefficients in this family can be factored as a scale 8 Chapter 2. Models and algorithms mixture of Gaussian densities: = 0,T2)p(Td (2.5) Choosing the prior of to be Laplace, Gamma, Jeifreys and Inverse Gaus sian corresponds to the Li, NG, NJ and NIG models, respectively. In the EM framework, 3 is the parameter vector of interest, and T can be seen as missing data. Then, the posterior mode can be iteratively found by maximizing conditioned on the last estimate (c): (k+1) = argmaxQ(3,) (2.6) where Q(/3, (1c)) is defined as the expected complete-data log-likelihood: f log(p(X, y, r))p(rI, X, y) Then, in the next iteration, j(k+1) is given by maximizing the Q(3, )) function over j3. For linear regression models, the update expression of3(k+1) can be given in closed form: (k+1) (XTX + u())_1xTy (2.7) where 0 ... 0 (k) with (k) — p’(I/I)j — p/3j1 Here p’ (x) denotes the derivative of penalty function p(.) evaluated at x. 9 Chapter 2. Models and algorithms 2.3.2 MM algorithm Depending on the optimization context, the MM algorithm refers to the majorize-minimize or minorize-maximize algorithm, which is a generaliza tion of the famous EM algorithm. Because the MM algorithm is a very gen eral optimization method (for example iteratively reweighted least squares algorithm is also a special MM algorithm which uses tangent lines to surro gate the original objective function), we adopt the MM algorithm to solve penalized models with various penalties. In the problem where no missing data can be naturally introduced, MM requires the construction of a surro gate function for the objective function, which is equivalent to the E-step in the EM framework. The M-step is subsequently implemented to optimize the transfered surrogate function. These two steps are iteratively performed in turn until convergence. The MM algorithm has the monotonicity prop erty of the EM if the majorize/surrogate function constructed satisfies the following two properties: 1(O,Oo) p(O),’vReO (2.8) (9o,8o) = P(Oo) (2.9) where, 1(8, 8o) is the majorize function of the objective function p(O) at &o. Specifically, in the context of penalized linear regression, p(.) is the penalty. Then, it is obvious that by subtracting Eq.(2.9) from Eq.(2.8), we get 1(O,6o) — (6o,Oo) p(s) — p(Oo), which in turn guarantees the monotonically non-increasing property of original objective function: 1(O,Oo) 1(Oo,6o) i(8) P(8o) (2.10) Thus by choosing a more smooth surrogate function, instead of directly min imizing p(s), the optimization is transfered to minimizing the differentiable function (0, Oo) over 19. The penalty functions and corresponding upper bound surrogate functions are shown in Figure 2.1. 10 Chapter 2. Models and algorithms I (a) Li (Lipschitz continu- (b) SCAD (Lipschitz con ous) tinuous) (c) NJ (Infinite derivative) (d) NG (Infinite deriva- (e) NIG (Infinite deriva tive) tive) Figure 2.1: Penalty functions and their corresponding quadratic surrogate functions around origin. Solid lines are original penalty and the red dashed lines are the upper bound surrogate functions of solid lines with Oo = 0.5 and € = 108. 11 Chapter 2. Models and algorithms MM1 algorithm In the context of the linear regression, one can define the quadratic function (0, 0) pA(IOoI) + (0 - (2.11) it can be shown that ‘T(0, 0o) majorizes p(I0I) at ±10o1 (Proposition 3.1 [32]). Here, pjI0j+) denotes the limit of p(x) as x — 101 from above. By iteratively minimizing i(O,0o), the solution to this algorithm is identical to the solution of the M-step in the EM algorithm. Fan and Li called this algorithm Local Quadratic Approximation (LAQ) [19], but here we call it MM1 for simplicity. MM2 algorithm Note that in Eq.(2.11), c(9,8) is undefined at 0o = 0, and hence when a parameter is estimated to be zero or close enough to zero, then it is excluded from the subsequent sub-model and never re-enters the model again. To ease this problem, [32] proposed a modified version of LQA/MM1 with perturbed penalty function P2 p181 / PA2(10I) = PA(I0I) — -‘—-dt (2.12) 0 paired with its majorize function 2(0,0O) P2(I0OI) + (02 (2.13) Under certain regularity conditions (c.f. Proposition 3.2 [32]) and provided the space of 0 is compact, it can be shown as 0, 1PA2(O) — p(°)I —‘ 0 uniformly over 0. The iterative ridge-type solution can be given in closed form expression (k+1) = (XTX + vQ))_1xTy (2.14) 12 Chapter 2. Models and algorithms where 0 ... 0 V(k)= 0 (k) 0 with (k) = p(IPI+) 3 MM3 algorithm For NG, NJ and NIG models, the regularity conditions stated in [32] are not met. For example, in the NJ model, the first derivative of the penalty function is ik (see Table A.1), which is unbounded as O . 0. Also the penalty itself is unbounded below when 181 J, 0 (Figure 2.2(c)). Even worse, when is not integrable around the origin, it is not possible to use this functional form to derive the new penalty function. To handle this issue, we propose a perturbed penalty function, fix e > 0: PA3(IOI) = PA(IOI + e) - € f° p(t) dt (2.15) Define: 3(0, 0) = P3(I0Ol) + (82 - 0)p((oI+ e)+) (2.16) Similarly, by the above construction, we could prove the following theo rem. Theorem 2.3.1. Let p(.) be a function defined on [0,oo). Suppose pA(.) satisfies the regularity conditions: 1. differentiable, nondecreasing and concave on (0, oo) 2. continuous at the origin Ve> 0, define p3(IOj) and 4’3(8, Oo) as above. Then 1. 3(0,0O) majorizespA3(I0I) at ±IOoj, VO0ER. 13 Chapter 2. Models and algorithms (a) Li (Lipschitz continu- (b) SCAD (Lipschitz con- (c) NJ (Infinite derivative) ous) tinuous) Figure 2.2: Penalty functions and their corresponding surrogate functions around origin. Dotted lines (in blue) are original penalty, solid lines are the proposed perturbed penalty function, and dashed lines (in red) are the surrogate functions of solid lines with 8o = 0.5 and e = 10—8. 2. VC c e, the space of 0, where C is compact. As e 0, SUJ J3(0) — p(O) —÷ 0 (2.17) eec In particular, ifp,.(.) is Lipschitz continuous on [0,00), then sup90e IPA3(8) —(8)I — 0. The proof of this theorem is given in Appendix B. Theorem 2.3.1 tells us that under the new surrogate function, the condition PA (0+) <00 can even be abandoned. As we have seen, this is essential for many kinds of priors that have been used in variable selection, for example NJ and NG etc. Then iteratively update expression for the parameters j3 can also be analytically given by (k+1) = (XTX + W(k))XTy (2.18) where 0 ... 0 W(c)= 0 w ... o ... (k) 14 Chapter 2. Models and algorithms with — p((I/3I +e)+) j — The convergence criterion adopted here is different from the one used in [32]. We adjust e, hence also the surrogate function (if necessary), to ensure that, for /3 0, < T at the convergence, where r is a predefined effective-zero level. Algorithm 1 summarizes MM3 algorithm steps: Algorithm 1 MM3 algorithm Require: Design matrix X, response y, penalty name 1: Initialize e to be moderately small 2: repeat 3: Run MM3 algorithm 4: until Q((k)) — Q(/(k+1)) < T “(k+l)5: if /3 I<Tthen 6: Set (k+1) = 0 7: end if 8: Compute = max — ____ 8/3d 9/3j (k+1) 9: if then 10: Finish and exit 11: else 12: 13: Go to line 2 14: end if 15: return Estimated coefficients/Selected variables Remark: we do not adopt the procedure of presuming 13çk+1) = 0 if its sub-differential is greater than r, i.e. > r as in [32]. Since any coefficient close to zero has been removed in’ the last step, we argue that the remaining coefficients after the algorithm converges are all non-zero. In fact, these effective-zero coefficients are due to the perturbation introduced, so they should be exactly zero if e = 0. Also from the simulation studies, 15 Chapter 2. Models and algorithms 9 0 2 4 8 8 10 Figure 2.3: Convergence of unperturbed penalized likelihood function. As suming r = 1. (O) is the part regarding the change of unperturbed pe nalized likelihood. S is the part measuring the goodness of approximation between the derivative of perturbed penalty and original penalty. we find the estimates obtained by setting 13çk+1) = o if ôPj $(k+1) > T have quite bad performance under the NJ, NG and NIG priors3in terms of correctly selected predictors (results not shown in this thesis). Finally, under this scheme, when the coefficient is estimated to 0, it could re-enter the model after tuning smaller, because we allow zero estimations to compete with others as long as convergence has not been arrived. The tolerance is decomposed as shown in Figure 2.3. 2.3.3 Connections between algorithms The connection between the EM, MM2 and MM3 algorithms are summarized below: • The EM algorithm requires to compute the expected complete-data log-likelihood for the E-step, while the MM algorithmneeds a majorize function. The subsequent step of these algorithms are essentially the same: optimizing the transferred surrogate function. Hence, it is easy to see that the E-step is a special case of constructing an arbitrary sur 16 Chapter 2. Models and algorithms rogate function. In fact, the solution to the EM algorithm is identical to that of LQA/MM1 algorithms • If the marginal density of /3 is infinitely spiked around 0, then the EM and MM2 algorithms have the problem as forward/backward se lection procedures, i.e. variables discarded cannot be re-introduced into models in later stage. The proposed MM3 algorithm can avoid this local optimality by adding a small perturbation to the derivatives while retaining the monotonicity of the EM and MM2 algorithms. • The choice of €, and thus the majorize function, is adapted to the estimated parameters. Hence, the choice of can be controlled by the information from data and is robust to its initial choice. As c is dynamically tuning smaller, the MM3 usually takes a longer time to converge than the EM and the MM2 based on empirical experience. • It is not difficult to see that /3 = 0 is a local optimum for the penalized linear regression with the EM, MM2, and MM3 algorithms. The EM with all priors and the MM2 with NJ, NG, and NIG priors have to drop the variables once they are estimated to be zero, and thus will be stuck at the origin. However, a simulation study will demonstrate that the MM3 with all priors has the ability to jump from the zero nodal points and correctly identify the true variables. • It is well known that an Li penalty can be viewed as the following constrained optimization problem: 2minimize — X/3 subjected to I/3It The penalized regression is equivalent to the constrained regression problem, in that, given a ). 0, there exists a t> 0 such that these two problems share the same solution. Generally speaking, however, i7 Chapter 2. Models and algorithms the penalized formulation is not the Lagrangian function for the con strained formulation. However, for NG and NIG model, we can inter pret the MM solutions to penalized LS as log-barrier functions (see Table A.i). The idea of the MM algorithms here is the same as that of the log-barrier function solution to the Li penalized LS: they all prevent variables from becoming exactly 0 while remaining a good ap proximation to the original penalized functions. Specifically for Li, the log-barrier function can defined as g(x) = Iy_X +IjI -flogx When is sufficiently small, minimizers of g(x) corresponds to mini mizers of OC, AIX, y). For MM2 and MM3, the surrogate functions have the same role as g(x), and we can prove that as goes down suf ficiently close to 0, the original objective functions can be arbitrarily approximated by surrogate functions. 18 Chapter 3 Results on comparing across models 3.1 Simulation studies To compare the performances of models with different penalizations and al gorithmic solutions, we simulate 100 data sets, each with n = 100 points and o = 1. Following the setup of [32], the true coefficients are set to be /3 = (3, 0, 0, 0, 1.5, 0, 0, 0, 2, 0, 0, o)T, with p = 12. This is the scenario to simulate 3 true TFs out of 12 candidate TFs for one gene. Because genes are usually dependent, we introduce three correlation levels between covariates: p = 0.1,0.5,0.9. Data X is scaled to mean 0 and unit variance and y is centered. For both MM2 and MM3 algorithms, the initial e is set to 10—8 and effective-zero T = 10.8. The hyper-parameters of all models are set to be ci = 0.01 and c = 21:10 (see Appendix A Methods for the notations). All models and algorithms are initialized with all 0’s and the Ordinary Least Square (OLS) or the Maximum Likelihood Estimation (MLE). Optimal pa rameters are determined by 5-fold CV as used in [56]. The performances of various models are measured by the mean square error (MSE) (can be calculated in closed form by (_/3)Tcov(X)(_/3)), the number of correctly estimated zeros (i.e. equals 9), the number of correctly estimated non-zeros (i.e. equals 3) and the number of incorrectly estimated zeros. Note that we do not implement the SCAD prior with the EM algorithms. First, we look at the results with all models initialized at zeros at different correlation levels, see Figure 3.1. The true number of non-zero coefficients is three, and it is easy to observe that the MM3 uniformly dominates the 19 Chapter 3. Results on comparing across models Table 3.1: Mean errors for linear models, averaged over 100 simulations. MSE is the mean square error. C is the number of correctly estimated zeros and I is the number of incorrectly estimated zeros. Boldfaced methods are het rPii11R MSE Zeros MSE Zeros MSE Zeros Median C I Median C I Median C I Model p=O.l p=O.5 pO.9 NJ MM3 0.1104 8.54 0 0.1099 8.52 0 0.1254 8.51 0 NJ MM2 0.1104 8.54 0 0.1099 8.52 0 0.1255 8.51 0 NJ EM 0.1105 8.54 0 0.1099 8.52 0 0.1254 8.51 0 NG MM3 0.0992 8.75 0 0.0945 8.79 0 0.1280 8.77 0 NG MM2 0.1067 8.56 0 0.1055 8.54 0 0.1278 8.57 0 NG EM 0.0971 8.77 0 0.0975 8.75 0 0.1337 8.77 0 NIG MM3 0.0926 0 0 0.0872 0 0 0.1185 0 0 NIG MM2 0.0937 0.78 0 0.0846 0.9 0 0.1230 2 0 NIG EM 0.0905 0 0 0.0872 0 0 0.1256 0 0 Li MM3 0.1364 4.18 0 0.1457 4.60 0 0.1692 4.86 0 Li MM2 0.1533 5.36 0.32 0.1589 5.34 0.30 0.2275 6.06 0.46 Li EM 0.1462 4.36 0 0.1481 4.63 0 0.1751 4.97 0 SCAD MM3 0.1454 4.31 0 0.1387 4.84 0 0.1568 5.21 0 SCAD MM2 0.1204 6.35 0 0.1414 6.68 0 0.1758 7.32 0 EM and the MM2 in terms of the number of correctly identified non-zeros. Comparing across models, the EM is always stuck at 0’s for all models and the MM2 only selects variables in case of Li and SCAD. For MM3, starting from zeros has no significant effect on the selected variables. Hence, the local optimum issue can be well handled by the MM3 in this extreme situation. Secondly, we compare the results of algorithms starting from OLS es timations. From Table 3.1, we can see that the NJ and NG priors are performing among the best with all three algorithms, in terms of both cor rectly and incorrectly estimated zeros. The NIG has the lowest MSE but. its estimates are not a thresholding rule, i.e. the NIG prior is not a variable selection prior and can produce many coefficients with smallest absolute val ues (c.f. Appendix A Methods). However, one can read small coefficients as zeros and thus corresponding variables are not selected. Further, the NJ models for the EM, MM2, and MM3 algorithms are essentially the same and 20 Chapter 3. Results on comparing across models Figure 3.1: Simulation comparisons, initialized with all zeros start. Simula tion averaged over 100 times Numbers of non-zero correctly estimated Numbers of non-zero correctly estimated •EM •EM 0 MM2 0 MM2 MM3 MM3 jill] ii]]l LI NJ NO MG SCAD Li NJ NO NM SCAD (a)p=O.1 (b)pO.5 Numbers of non-zero correctly estimated • EM 0 MM2 • MM3 11111 Li NJ NO MG SOAD (c) p = 0-9 21 Chapter 3. Results on comparing across models the Li and SCAD priors have larger MSEs. Figure 3.2 shows that the num ber of zeros included in the NG and the NJ is higher than Li and SCAD. Therefore, the NG, NJ, and NIG models tend to produce sparser models than Li and SCAD. Finally, in general these variables selection models with the MM3 algorithm tend to have lower MSEs than those with the MM2 and the EM algorithms. In conclusion, local optimality is unlikely to occur if the starting points are not carefully chosen and the MM3 is better but the improvement is marginal in this case. 3.2 E.coli data set In [18], Faith et.al. showed how RegulonDB database [50] can serve as a ground truth of regulatory network for E. coli. RegulonDB contains 32i6 experimentally confirmed regulatory interactions among 1058 genes and 153 TFs. [18] assembled a compendium of 445 new and previously published E. coli K12 Affymetrix Aritisense2 microarray expression profiles collected under various conditions. Compared with the ground truth (see Figure 3.3), this compendium is an ideal real dataset that we can evaluate the performance of various models. This dataset can be downloaded at the Many Microbe Microarrays database (M3D) Web site (http://m3d.bu.edu/). We also compare our linear models with the state-of-the-art gene regu latory network construction algorithms, i.e. Context Likelihood Relatedness (CLR) algorithm [18]. CLR is a mutual information (MI)-based method. The CLR algorithm estimates the likelihood of the MI score for a particular pair of genes, i and j, by comparing the MI value for that pair of genes to a background distribution of MI values (the null model). There are three major steps in CLR algorithms: 1. Computing raw MI values for every pair of genes 2. For each gene, normalizing the raw MI values 3. Estimating the joint likelihood measure (i.e. a significance measure of MI values), Z-score, of MI between every pair of gene 22 Chapter 3. Results on comparing across models Figure 3.2: Simulation comparisons, initialized with MLE. Simulation aver aged over 100 times Numbers of zero correctly estimated Numbers of zero correctly estimated UEI EM 0 MM2 0 MM2 hNNi LI NJ NG NG SGAD LI NJ NG NIG SCO (a)p=O.1 (b)p=O.5 Numbers of zero correctly estimated • E 0 MM2 • MM3 LI NJ NG NIG SC,AD (c) p = 0.9 23 Chapter 3. Results on comparing across models a, C a, 0, Ground truth Figure 3.3: The real regulatory network of E.coli presented in a binary matrix M. A bright spot at is a regulatory relationship from TF j to gene i. 3.2.1 Precision-recall curves The performances of all methods are measured by the area under the precision- recall curve (PR curve). Precision is defined as the fraction of true positives out of the total predicted positives, and recall the fraction of true positives out of the actual total positives. By formulation, they can be expressed as following: TP precision = TP + PP TP recall = TP + FN (3.1) (3.2) Hence, the larger area under the PR curve, the better performance an algo rithm has. Faith [18] have applied various algorithms to genes on the whole E.coli 80 TF1153 24 Chapter 3. Results on comparing across models Table 3.2: Characteristics of 60% and 80% precise networks inferred from models with top perfermance in E. coli. 60% precise network 80% precise network Model TP FP Threshold TP FP Threshold Li MM3 132 88 0.4530(Coeff value) 39 ii 0.7531(Coeff value) SCAD MM3 137 93 0.4435(Coeff value) 39 ii 0.7610(Coeff value) Li EM 132 88 0.4396(Coeff value) 39 ii 0.7585(Coeff value) NIG EM 94 66 0.6446(Coeff value) 31 9 0.8954(Coeff value) CLR 147 103 5.7905(Z-score) 8 2 iO.658(Z-score) Antisense2 microarray data, but we only apply linear models and MI-based methods on the set of genes which have representation in RegulonDB database, i.e. 1058 targets + 153 TFs = 1211 genes. This is because [18] can biolog ically validate the predicted and presumably novel interactions which are not curated in RegulonDB, while we have no such advantages and our main focus in this thesis is on the statistical modeling side. Thus, our results from CLR are different from those in the original paper applied on E.coli [18]. From the PR curve in Figure 3.4, we can see that the Li penalty fitted with the EM and the MM3, the SCAD penalty with the MM3, and the NIG penalty with the EM, perform best in most algorithms and are comparable with the CLR algorithm. The NJ prior for all three algorithms are essentially overlapping each other. Performances of the NG prior for three algorithms are very close, although they are all inferior to the CLR algorithm. The network characteristics corresponding to models with top performance are summarized in Table 3.2. To access the goodness-of-fit of various models and algorithms, scatter plots of the fitted expression values of target genes against their residuals are shown in Figure 3.5. It displays that the correlation between the actual expression values and fitted values is quite high. Moreover, the correlations from the Li and SCAD models are better than the NIG model fitted using the EM algorithm. 25 Chapter 3. Results on comparing across models 0.9 0.8 07 06 05 04 03 02 0.1 0.9 0.8 07 06 0.5 04 03 0.2 0.1 0 002 004 006 004 01 012 014 016 (e) NIG Figure 3.4: Precision-recall curves for various models and algorithms on E.coli data set. MI is the raw MI method. Z is CLR algorithm. 0 0 (a) Li 004 006 0.04 01 012 014 (b) SCAD 06 (c) NJ (d) NG MI NIG EM NIG MM2 NIGMM3 26 Chapter 3. Results on comparing across models Figure 3.5: Scatter plots of the fitted against their residuals. 10 0 9 -.- -10 ‘F2 - - . - lledprioI (a) Li MM3 10 1 -15 -2 -10 - - - 6 6 Filled expression values expression values of target genes .0 . s. -N” -5 -10 12 -10 -6 -6 - -6 6 6 6 6 FOted expression valves (b) SCAD MM3 10 .15 (c) Li EM -10 -0 .5 -4 -2 0 2 0 5 Fin sO expreox on ye Ixes (d) NIG EM 27 Chapter 3. Results on comparing across models. Table 3.3: Number of targets regulated by transcription factors in the 60% precise network with p 5 predicted targets with top performance algo rithms, in E.coli network. TFs Targets # Targets # inferred in RegulonDB CLR Li MM3 Li EM SCAD MM3 Li NIG fliA 42 40 43 44 44 44 lexA i6 6 7 7 7 7 hycA 7 10 9 14 9 14 gatR 6 6 7 7 7 7 yhiE 5 11 10 10 11 10 3.2.2 Visualizing and analyzing inferred networks Networks are visualized through the Graphviz software, and the networks corresponding to Table 3.2 are shown in Figure 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14, and 3.15. Specifically, to visualize the actual learned net works, we first extract the 80% precise networks as [18] did from the CLR algorithm, see Figure 3.6, 3.9, and 3.10 (i.e. the network is extracted by thresholding which gives 80% precision.). At this precision level, the learned networks from the LASSO and the NIG penalties are much better than the CLR algorithm. This agrees with the result in the PR curve, since at 80% precision, CLR almost has a recall of 0, which means there are almost no TP edges inferred. In fact, by trading off between the true positive and false positive rate in selecting a threshold for identifying significant regulatory in teractions, a higher precision level leads to a small network with fewer false positives, while a lower threshold will include more false positive features. Hence, next we lower down the precision level to visualize the 60% precise networks, the resulting networks are shown in Figure 3.11, 3.12, 3.13, 3.14, and 3.15. The numbers in the nodes are the gene indexes in the expression matrix and each number can be uniquely mapped to one gene. We summarized the TFs in the 60% precise network with p 5 pre dicted operon targets with these top performance algorithms, and the re sults are presented in Appendix C Supplementary Materials C.1, C.2, C.3, and C.4. Here, we report 5 TFs with p 5 connectivities to their tar 28 Chapter 3. Results on comparing across models get genes supported by all 5 algorithms with high performance, measured by PR curve (see Table 3.3). Their gene names are: fiiA_b1922_at (fliA, gidx: 345), lexA.b4043.at (lexA, gidx: 615), hycA_b2725_at (hycA, gidx: 532), gatR2b2090_Lat (gatR, gidx: 413), and yhiEb35l2_at (yhiE, gidx: 1170). These genes are well documented in the literature. We suumarize their functions from the literature. • fliA: we observed that it is a hub gene in both 80% and 60% pre cise networks and contols the transcriptional activities of many down stream genes. Very recently, fliA is reported to be one of the two global regulators (rssB and fliA) in E. coli through genetic screen ex periments and the products of these two genes are involved in the regulation of major genetic networks [21]. It is known that even in the absence of identifiable exogenous stress, there remains a measur able, basal death frequency in E. coli populations. But the underlying mechanisms still remains unclear compared with those under stress conditions. [21] showed that mutant of the fliA gene affects the levels of different sigma factors within the cell and results reduced death fre quencies in E. coli populations. Specifically, the inactivation of the fliA gene encodes the fiagellar sigma factor. This results in the lack of ex pression of a number of genes involved in motility and chemotaxis, and consequently, non-motile cells. This loss of motility results in greater absolute availability of both RNA polymerase and energy for other processes within the cell, and consequently the cell may gain viabil ity through a number of potential mechanisms by losing the motility pathway. • lexA: a major regulator of DNA repair and DNA-binding transcrip tional repressor of SOS regulon, is known to have a single well-conserved DNA-binding motif. It is one of the best-perturbed regulators in the microarray compendium due to the compendium’s emphasis on DNA damaging conditions. 16 regulatory interactions were collected in the RegulonDB database. The target genes that are supported by these 5 algorithms are: dinF, recA, recN, sulA. Interestingly, all of these four 29 Chapter 3. Results on comparing across models genes are involved in the DNA repair and SOS response, which are important functions of lexA in E. coli. • hycA: a formate hydrogenlyase regulatory protein. The target genes that are supported by these 5 algorithms are: hycB, hycC, hycD, hycE, hycF, hycG, hycH, hydN. The products of these target genes (hycB H) are the subunits of hydrogenase, and hydN is involved in electron transport from formate to hydrogen. Thus, all genes have a common biological function: form and maturate the formate hydrogenlyase pro tein complex, which is an important molecule for energy production and conversion in E. coli. • gatR: annotated to be a pseudogene, repressor for gat operon in NCBI, The target genes that are predicted by these 5 algorithms are: gatA, gatB, gatC, gatD, gat Y, and gatZ. It is known that these genes form a gat operon with 7 ORFs in E.coii EC3132 and involve in galac titol metabolism. A mutation in gatR in the E.coii K12 strain implies constitutive expression of gatABCD YZ [8j. This is in fact a negative regulation effect of gatR on these genes. • yhiE: yhiE is a hypothetical protein, and currently there is no litera ture describing the functions of this gene with experimental support. The target genes that are commonly predicted by these 5 algorithms are: gadA, gadB, hdeA, hdeB, hdeD, sip, and yhiD. However, by com paring the functions of the genes targeted by yhiE annotated in NCBI, it might be that yhiE involves functions of regulating acid-resistance (hdeA -D) and/or glutamate decarboxylase (gadA -B). 30 Chapter 3. Results on comparing across models S2I t) o 487 1170 401 .180 Figure 3.6: 80% precise network for CLR algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 31 Chapter 3. Results on comparing across models O3 (41 (300) I Q6 164 43 J01 1106/b / 126 2Q0 997 615 .138 Figure 3.7: 80% precise network for Li penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 32 Chapter 3. Results on comparing across models 20O _ 73 6” (.) 10 oo 34’ 1 GC 12011104 114’ (1) E9D154N9 90011-14 Figure 3.8: 80% precise network for SCAD penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 33 Chapter 3. Results on comparing across models 1000 33 S2 óod ®d9 8 (4• c32 111 489 313 114-1 Figure 3.9: 80% precise network for LASSO penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 34 Chapter 3. Results on comparing across models 18 163 S58 357 P W 1 356 (4-) Figure 3.10: 80% precise network for NIG penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 35 Chapter 3. Results on comparing across models Figure 3.11: 6O9 precise network for CLR algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 36 Chapter 3. Results on comparing across models / ®® c) c5) Figure 3.12: 60% precise network for Li penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 37 Chapter 3. Results on comparing across models - 1 — - C ; —. c - := - - - - -- — — _ ,_ f _ - _ I J — - .) _ — I ‘ - -- -, - N / (._._-_ — - ( - — - - z - - Oc---- . - (-_ q. - S ( - - II) ( - L - -- Q - - - - -_ ä-. J’ -- - • — - - 0 0 - - - - — - — Figure 3.13: 60% precise network for SCAD penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 38 Chapter 3. Results on comparing across models - — --—- - -, 441 j - -i-- - -— ci - — —p - — — — -- — — — — —, ( — — (_ 3. —- III - . . ‘. — - - : ci --.4t; - -z- p--- -z-’ - I ____ — ___ — — —— 7 ci i.---- -‘ —_--- Q - - -- K-— — -—- i—- -- - -- — - -‘---. --— Il - :.‘ 1:I_1 ‘ S ,- KII -N -, L 1111 ,-, _Y —- ‘—_ II I4, •, ---- - —- —_—-, —- -, .- I1O — —- -: - - i- — -- —‘s— _ ‘‘1_’_)- -\jy 2 - - .i. N __- I 4 14, ) --_--K, - —-— - - - 2 -- z - ‘ Figure 3.14: 609”o precise network for LASSO penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 39 Chapter 3. Results on comparing across models Figure 3.15: 60% precise network for NIG penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. _ - I—s 40 Chapter 4 Comparing models coupled with prior structural information 4.1 NCI-60 Data preparation NCI-60 is a 60 human cancer cell line data set by Developmental Therapeu tics Program of the National Cancer Institute to screen more thn 100,000 chemical compounds since 1990, includeing leukemias, melanomas and can cers of ovarian etc [10, 53]. Until now, the NCI-60 cell lines have been characterized more extensively than any other set of cells in existence [53]. The purpose of our study for NCI-60 data set will be focused on the p53 hub, which is a central TF controlling the development of a variety of cancers in mammals. Hence, the ways we collected the data are including the set of genes which are targeted by p53 and potentially higher-level TFs regulating p53. Only transcript consistent re-mapped probe sets were used, i.e. probe sets that consistently match the same sets of transcripts. Further, as men tioned in the introduction to span the network across different molecular levels, we also collect a set of expression data of miRNAs in NCI-60. And we only consider the regulatory direction of microRNA—4gene (including TF). Finally, to incorporate the information from other sources (e.g. from sequence comparative analysis), binary structural data, i.e. TF—*gene and miRNA—gene, are also collected, respectively. On one hand for TF—gene, the selection was made initially of tar gets from a ChIP-PET (Chromatin ImmunoPrecipitation and sequencing 41 Chapter 4. Comparing models coupled with prior structural information of paired-end di-tags) experiment in colon cancer cells (treated 6hrs with 5- fluorouracil that elicits a p53 response), the most likely targets were selected from Table S4 in [59] and also from data deposited at UCSC genome browser table GIS ChIP-PET - Genome Institute of Singapore ChIP-PET. Poten tial targets were limited to genes downstream of ChIP binding sites. These tables select the most probable mappings and also looks for the presence of p53-TFBS. More targets are added from Ingenuity knowledge base [2] (both transcription activation and repression) and from known targets in the lit erature referred to [59]. Hence, the TFs chosen as potential targets of TP53 or regulators of TP53 are from three sources: collected from ChIP data, ingenuity and p53 knowledge base. On the other hand for miRNA—*gene, the potential targets selection was made from PITA version 5 database [1]. The prediction of PITA based on two levels. The stringent one is to choose top selections based on high conservation score across species (giving 2.24% positive interactions), which takes into account the structure around the seed sites. The other selection is much more noisy, which includes all predictions without filtering (giving 23.45% positive interactions). For our analysis, only the top predictions were used where conservation is taken into account. In summary, we have four input data sets: • X: a 16143 x 59 matrix containing mRNA expression levels for 16143 genes across 59 experiments • Z: a 278 x 59 matrix containing miRNA expression levels for 278 miR NAs across 59 experiments • D: a 16143 x 183 binary matrix containing TF—*gene prior structural information based sequence motif analysis • C: a 16143 x 278 binary matrix containing miRNA—*gene prior struc tural information based sequence motif analysis Note that there is one patient removed from the study because the measured expression values are constant over all treatments. To the same end to reduce computation as in [31], we fit various models within the positive interactions in prior binary data and select interactions 42 Chapter 4. Comparing models coupled with prior structural information Figure 4.1: Flowchart of learning regulatory network on NCI-60 data set. as predicted positives with supports from both prior structural information and the resulting sparse network. {(i,j) : LD = 1 fl = 1} {(k,l) : LCkI = 1 flCkj = 1} where LD and LC is the learned structures for TF—÷gene and miRNA—*gene by variable selection procedures, respectively. The whole process is illustrated in Figure 4.1. Remark: In current pipeline, we do not compare with the variable selec tion procedure without prior for every gene. We think of the prior structure information have taken account into most biological relevant facts and be- 43 Chapter 4. Comparing models coupled with prior structural information sides this information, there are also noises in the prior information, i.e. we assume that the probability of analogous type I error is small and under the positives in prior, use variable selection to denoise the irrelevances that are not supported by our expression data (remember we subset the learned net work from prior network). Thus, the methods with prior information should be better than those without prior, i.e. if the model can detect the signals contained in the prior, then the model should be also able to and easier to detect the same interaction in smaller range. Three specific examples are taken to compare between with and without prior information in the next section to justify this, see Table 4.2. 4.2 Results 4.2.1 Prediction of p53 related microRNAs and genes We took several specific examples that of particular interests and related to p53 (TP53) regulation [551. Some of them have biological support and the others are predicted from bioinformatics algorithms with high conservation scores (e.g. PITA) and suspected to have some biological functions that are interesting to us. 1. TP53—*GAS1. Growth arrest-specific 1 (GAS1) plays a role in growth suppression. GAS1 blocks entry to S phase and prevents cycling of normal and transformed cells. GAS1 is a putative tumor suppressor gene and TP53 as a protein has a domain important for the activity of GAS1 as a suppressor of the cell cycle, i.e. an anti-proliferative function [49]. 2. hsa-miR-34---GAS1. hsa-miR-34, conjectured to combine with p53, coregulates the transcription activities of GAS1 gene. hsa-miR-34---*GAS1 is predicted by PITA taking into consideration the structure around the seed sites, and one of the top predictions where conservation is taken into account. In our analysis, we took two members of its sub family, i.e. hsa-miR-34a and hsa-niiR-34c. 44 Chapter 4. Comparing models coupled with prior structural information 3. ETS1—TP53 and TP53—ETS1. There are some evidences that TP53 and ETS1 interact each other at protein level, regulating down stream target genes [38]. Two members of the ETS family of tran scription factors, ETS1 and ETS2, have been shown to bind to a palindromic ETS binding site and were able to trans-activate a het erologous promoter containing the binding site [13]. However, we are also interested in the direction TP53—*ETS1 because nearly all of our models claimed this interaction (see below). 4. TP53—RB1. p53 is known to suppress RB1 transcription through inhibition of the basal promoter activity [63]. Additionally, post transcriptionally, p53 might suppress RB1 further by inducing mir iO6a, a known suppressor of RB1 [55]. These predicted interactions are summarized in Table 4.1. Here, we do not include the CLR because the thresholding is not easy to be determined and justified without ground truth interactions known. By rows, we can see that Li and SCAD models can detect most of the interactions. NG and NJ models tend to produce sparser regulations. Although NIG can identify most non-zero elements, but the effects of these coefficients are very small and negligible. In fact, we can threshold out these small coefficients and read them as effective zeros. Hence, NIG is similar to the NG and NJ models. Moreover, the fitted models tend to have MSEs for Li and SCAD which are uniformly smaller over NG, NJ, and NIG models. This is not surprising as the Li and SCAD are less sparser than the others. By columns, we view the results for each specific interaction. For TP53—GAS1 and TP53—+ETS1, it seems most of the models can correctly identify the interaction (there are some exceptions such as NG MM3, NG EM, and Li MM2 etc). For hsa-miR-34---*GASi (including two family members), Li MM3, Li EM and SCAD MM3 select it as positive interaction, and the regulatory strength is negative due to the degradation nature of miRNAs. For TP53—*RBi is an interesting example, which can be correctly identified in some models with any solution (Li and SCAD), but in other models none of solutions do the right job (NIG, NG and NJ). As we mentioned above, p53 is known to have 45 Chapter 4. Comparing models coupled with prior structural information MSE__Coefficient .16 .23 .18 0 __________ .16 .23 .18 0 .18 1E-6 ____ .18 1E-6 .18 0 .17 .17 .18 0 .17 .21 .17 .17 .17 .21 .16 .23 .17 .36 TP53—*RB1 Interaction MSE Coefficient .059 .13 .049 .17 .056 .14 .086 0 .086 3E-7 .086 3E-7 .080 0 .080 0 .086 0 .079 0 .079 0 .079 0 .056 .14 .038 .22 hsa-miR-34c---*GAS 1 MSE Coefficient -.49 .18 0 .16 -.49 .18 0 .18 -1E-6 .18 -2E-6 .18 0 .17 0 .18 0 .17 0 .17 0 .17 0 .16 -.49 .17 0 TP53—*ETS1 MSE Coefficient .16 .37 .28 0 .16 .37 .20 .33 .20 .37 .21 .28 .18 .39 .17 .47 .19 .38 .17 .51 .17 .47 .17 .51 .16 .37 .097 .90 Table 4.1: MSE and estimated coefficients for the specific interactions pre dicted from NCI-60 data set. Bold numbers represent the estimated inter- actions agreeing with the literatures. . TP53—GAS1 hsa-miR-34a—*GAS1Interaction Li MM3 Li MM2 Li EM NIG MM3 NIG MM2 NIG EM NG MM3 NG MM2 NGEM NJ MM3 NJ MM2 NJEM SCAD MM3 SCAD MM2 MSE .16 .18 .16 .18 .18 .18 .18 .17 .18 .17 .17 .17 .16 .17 Coefficient -.18 0 -.18 0 -2E-6 -2E-6 0 0 0 0 0 0 -.18 0 Li MM3 Li MM2 Li EM NIG MM3 NIG MM2 NIG EM NG MM3 NG MM2 NGEM NJ MM3 NJ MM2 NJEM SCAD MM3 SCAD MM2 MSE .085 .20 .059 .18 .15 .18 .13 .11 .16 .11 .11 .11 .058 1 E-5 ETS1—TP53 Coefficient 0 0 <iE-8 0 0 1E-8 0 0 0 0 0 0 0 0 46 Chapter 4. Comparing models coupled with prior structural information a negative regulatory effect on RB 1 in inhibiting the basal promoter activity, but for cancer cell lines, there might be a positive effect as well, although currently there is no literature to support this claim. Finally, in the p53 knowledge base, ETS1 is documented to regulate TP53, but here none of our models detect this interaction. Surprisingly, nearly all of our models (ex cept Li MM2) claim there is an opposite regulatory strength TP53—ETSi. After we searched through the literature, there indeed are experimental ev idence in terms of apoptosis function to support this claim. For example, in embryonic stem (ES) cells, mouse ES cells lacking ETS1 are deficient in their ability to undergo UV-induced apoptosis, similar to p53 null ES cells. Chromatin immunoprecipitations demonstratedthat ETS1 was required for the formation of a stable p53-DNA complex under physiological conditions and activation of histone acety1tiansferase activity. These demonstrate that ETS1 is an essential component of a IJV-responsive p53 transcriptional ac tivation complex in ES cells and suggests that ETS1 may contribute to the specificity of p53-dependent gene transactivation in distinct cellular com partments [621. Furthermore, to see how the prior structural information can improve the prediction ability, we select GAS1 gene as the target gene and there are 463 possible regulators in our data set, including all TFs and miRNAs. The reason of not doing variable selection for each gene without prior information is mainly a computational concern as stated before. Therefore, we just selected a few examples to demonstrate the prediction improvement. The comparison result is summarized in Table 4.2. Essentially without prior information, all the three interested interactions are not selected among the 463 candidate regulators. We conjecture that there is so many variables that by including them all, it is like finding a needle in a haystack. In the work presented above, we started from a conjectured bona fide in teraction and compare the performance of various models. However, in most realities, we think of the problem differently: given a set of genes and pos sible regulatory networks, how can we infer the biological functions of those genes and networks? In the next section, we apply functional enrichment analysis to check whether or not our predictions are functionally significant. 47 Chapter 4. Comparing models coupled with prior structural information Table 4.2: Estimated coefficients for the specific interactions predicted from NCI-60 data set, compared between with and without prior information. Coefficients with absolute values less than 10—8 are thresheld to 0. Bold numbers represent the estimated interactions agreeing with the literatures. TP53—CAS1 hsa-miR-34a—*GAS1Interaction No_prior Li MM3 Li MM2 Li EM NIG MM3 NIG MM2 NIG EM NG MM3 NG MM2 NGEM NJ MM3 NJ MM2 NJ EM SCAD MM3 SCAD MM2 iE-07 0 0 0 0 0 0 0 0 0 0 0 3E-08 0 Prior .23 0 .23 0 1E-6 1E-6 0 .17 0 .21 .17 .21 .23 .36 No prior 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Prior -.18 0 -.18 0 -2E-6 -2E-6 0 0 0 0 0 0 -.18 0 hsa-miR-34c---GAS1 No prior Prior -iE-08 -.49 0 0 0 -.49 0 0 0 -1E-6 0 -2E-6 0 0 0 0 0 0 0 0 0 0 0 0 0 -.49 0 0 48 Chapter 4. Comparing models coupled with prior structural information 4.2.2 Functional enrichment analysis Functional over-representation analysis was performed to objectively iden tify biological processes potentially affected by p53 and miRNA target genes. Specifically, the learned network in previous section is cut into small sub groups. For each group, we test the significance of every possible function assumed in these sub-networks by counting the occurrence number of a func tion with a given gene ontology (GO) annotation. The background is set to be the domain which contains all the genes touched by the prior structure data. A significant p value (p < 0.05) associated with Fisher’s exact test indicates that the observed percentage of the target genes with a given anno tation could not likely occur by chance given the frequency of prior network with the same annotation. Note that we do not use genes covering the whole genome as the background to avoid potential underestimate, because genes filtered by prior network never have a chance to enter the model and thus will have no influence on the learned sub-network. Since apoptosis and growth arrest are common known consequences of p53 activation, we tested whether p53/miRNA tend to target apoptotic (a form of programmed cell death in multicellular organisms) and cell prolif eration related genes. [55] has shown that while cell cycle regulation was among the top in the p53/miRNA targets functional enrichment, apopto sis was not significant. Interestingly, this agrees with our results, i.e. in all scenarios, we found the apoptotic function annotation is not enriched by setting p-value cutoff at 0.05. Hence, we use p53/miRNA target gene functional enrichment analysis to identify specific p53/miRNA that target significant cell proliferation genes. The results for Li model are shown in Figure 4.2 and Table 4.3. The enrichment analysis results for other models are presented in Appendix C Supplementary Materials section: Figure C.1, C.2, C.3, C.4, Table C.6, C.7, C.8 and C.9. Obviously, genes with this enriched function that the EM algorithm has identified concentrate only on p53 regulator, and consequently EM can de tect many genes that have cell proliferation annotation. On the contrary, MM2 and MM3 have a different landscape: they identified many regulators 49 Chapter 4. Comparing models coupled with prior structural information Table 4.3: Identified significant p53/miRNAs that target known cell prolif ACADVL hsa-miR-125a hsa-miR- 155 PAX5 T hsa-miR-27b hsa-miR-296 hsa-mift-302b hsa-miR-372 ACADVL PSMG3 hsa-miR-202 hsa-miR-429 hsa-miR-507 hsa-miR-519l, ACADVL hsa-miR-125a hsa-miR-135a hsa-mifl-181a hsa_miR202* hsa-miR-221 hsa-xniR-27b hsa-miR-410 ACADVL PAX5 hsa-miR- 191 hsa-niiR- 206 hsa-miR-296 hsa-miR-378 hsa_miR524* MM3 hsa-miR-96 SF’l HDAC2 IRF1 MYCN hsa-miR- 186 hsa-miR-195 hsa-mik-202 hsamiR2o2* hsa-miR-31 hsa-mih-34r hsa-mik-429 hsa-miR-495 hsa-miR-499 hsa-miR-518e hsa-miR-7 3.5716-2 4.8986-2 4.3506-2 3.9106-2 3.0616-2 4.4376-2 4.4966-2 4.9606-2 2.1596-2 4.2476-2 4.4556-2 3.7066-2 1.6866-2 3.7036-2 1.2806-2 2.8066-2 1.0306-2 4.1516-2 2.4276-2 1.2296-2 3.8136-3 4.0756-3 1.2576-3 3. 1496-2 9.7236-3 4.9046-2 4.7726-2 3.0866-2 1.2446-2 3.5706-2 4.8086-2 1.1516-2 3.3316-2 2.6576-2 1.0236-2 2.2856-2 3.7546-2 2.4156-2 3.020E-2 9.240E-3 3 .3426-2 2.2396-2 3.4646-2 3. 1696-2 4.9116-2 1.3766-2 M M2 eration genes from Li prior. Method Regulator Total Ioi oredieted tareets cell proliferation p-value Gene names . genes EM TP53 1167 43 5.1526-3 11 68 108 30 9 124 18 161 125 5 6 111 198 60 88 15 88 180 111 66 145 159 128 26 34 98 128 15 36 251 133 41 20 52 90 383 60 99 66 135 116 211 166 153 53 115 2 3 4 6 2 3 2 6 4 2 2 4 5 3 5 3 4 6 7 4 6 6 8 5 7 3 3 2 3 6 3 5 3 8 5 8 4 4 4 7 3 5 5 4 3 5 UUK4rltjHI,rt b[N,1UFbl,k’AXa,ILIA, IL1B,ADRA1D,GAS6,PRL,BUB1B,CD1C6. CTF1 ,IcFBp4cYR61 ,AREG,EGF,FGF2, FGF7,FTH1,MDM4MK167,PIM1,PRKDS ,MAP21C1, RAflSTIL,TGFATXN,GULS,BUE1,EPS8,PLK1, E2F1,IFll6,REST,Tpx2,DLG7,PDGFG,OSM,E2F8, ZEE1,TNFSF13B GDK4,s MAP3IC11,MAPRE2,PESS PTEN,FGF2,SPOGIC1 BCAT1 1L7,FLT3LG,FGF17,CDKN1B,FLT3,HES1 LIF,SUZ12 DAZAPaSUZ12,PLEK9IK1 NR2E1,IGF2BP1 IGF1 GGND2,LIF,DAZAP2,SUZS2,IGF2BP1 NR2EI,GUL3,DAZAP2,SUz12 FTH1,HMOX1 RB1,MDM4 F’TEN,NDN,ST13,GPG3 PTEN,NDN,TIMP2,TOB1,FWAG4 BTG1,HDAG4,MNT PTEN,HDAC4,TSG 101. ADAMTS1,ST638 Gt11C4,CD1C6,Ezr RPS27,MAP3}C1 1,MAPRE2,PES1 DAB2,MNAT1,GDICSR1,EVIS,P1M2,MAPRE2 TGFBI,CFCSSE,0NA12,lRS2,UGHLS,ElThHB3,zAlC PTEN,EPS15,FGF2,GULS Rp54x,RPS27CYRSS,GNAI2.FRAT2,PDGFC RPS4X,EP815PRICD1,STJL,TRIB1,BHLHB3 COL4A3,RPS4X,RPS27,FGF2,PRKD1 ,IRS2,EPS8,HDGFRP3 BcL2,TGFB1,cxculo,ccr4D2,cDlcNlB 1L7FLT3LGFGfl7clD1cNlwFur3,HE5S,5TATSB lOFi R,HES 1 PEEPS IRS2,FOXA1 ,PBEFI NR2ES,IGF2BP1 9CLFS,CCND2,NTN1 FGA,IRS2,IJES 1,5uz12,Hoxclo,TNS3 PBEF1,MA821L2,STATSE TGFE1,PTGS2,EREG,HDAC4,ETS1 ILIE,TNF,TGPE1 IL1B,TNF,TGFE 1,PTGS2,EIF2AIC2JAFC2,OSM,JL1RN GLI3,RE1,TGFB1,TIMP2,LEPRE1 PTEN,E0x82SICAP2,SMAD4,ToE1,HDAc4,GIIERP,T062 PPM1D,GHERP,sEsNl,pnssE PTEN,NDNST13DLGAP2 PTEN,FGF2cULs,SMAD4 TIMP2,SRAP2,3A62,HDAG4,STK38,PDSSB,ETS1 BTG1,NOTGH2,ETS1 NDN,TIMP2TOB1,}4DAG4,SESNS PTEN,SSTR1 NDN,PAWR,ETS1 CULS,GUL1,PPM1D,F0x04 BMP2,MED17,PDS5E PPM1D,cNoT8,SEnSBs,ET51,GJ66 --- - 50 Chapter 4. Comparing models coupled with prior structural information including TFs and miRNAs targeting some genes which have cell prolifera tion annotation, but for each regulator the total number of genes which was known to be cell proliferation is much smaller than that of EM. However, this case only happens with the Li prior model and with the NIG model, only MM3 gets much more enriched regulators, see Table C.7. Again, comparing among priors, we found that results of NJ model for the three algorithms are essentially the same, which agrees with the simulation studies. Most of the models and algorithms have p53 regulator enriched, which makes sense since the cell proliferation is an important function of p53 for suppressing tumor genesis. 51 0Fi gu re 4. 2: Fi sh er ’s ex ac t te st p- va lu es w ith th e L i pr io r. Th e bl ue lin e is th e si m ul at ed u n ifo rm ba ck gr ou nd p- va lu es u n de r n u ll hy po th es is an d ho riz on ta ll in e is th e th re sh ol d a t p- va lu e a t 0. 05 . 0 4 , 4, > 0 (1 o _ 0 50 00 0 10 00 00 15 00 00 20 00 00 In de x 4, 0 a 0 0 O e. 00 2 4 ,0 4 44 + 04 6e +0 4 64 ,0 4 le + 05 In de c (b ) M M 2 (a) EM 4, 4, > a 0) 0 C ,’ t’ ) 0 4 0 0 20 00 0 40 00 0 60 00 0 80 00 0 In de x 12 00 00 (c) M M 3 Chapter 4. Comparing models coupled with prior structural information Figure 4.2 shows the negative logarithm of p-values in a descent order. The larger the log(p.-value), the more significant of corresponding enriched function in the subnetwork. Clearly in Li case, the p-values are skewed and not uniformly distributed, which means the learned associations are unlikely to happen by random chance. EM for Li has many p-values shifted to the non-significance level, and this again agrees with Table 4.3, which reflects that Li with EM is concentrated on identifying a few large sub- networks. For MM2 and MM3, the differences in the tails are not significant as EM. For other priors in general, the different shapes of log(p-value) the plots confirm the different landscapes listed in tables. For example, EM and MM2 are similar to each other with the NIG prior in terms of both the plots and enriched genes (see Figure C.2 and Table C.7), while MM3 significantly differs from those two. Note that we do not perform a multiple comparison correction procedure here. The reason is that we test the significance of each possible function by conditioning on its local sub-network and its annotation, this will introduce very complicated dependency structures among different testings. Thus, the independence assumption of multiple comparison fails here and it is very difficult to control the false positive error rate. For this reason no multiple testing is taken into account, the significance of the p-values should be taken with a grain of salt. Although we correctly inferred the TP53—*GAS1 regulatory relationship in most models, this interaction is not enriched in over-representation anal ysis here. On the other hand, by comparing to miRNAs which target cell prolif eration genes and were captured by [55], we predicted several regulatory interactions that have been reported in the literature as differentially regu lated in various cancerous tissues or cancer cell lines. For example, CKSiB, IRS2, and TGFBI, target genes of miR-181a, overlap the results found in [55]. We did a literature search and there are indeed very recent evidences to support the predicted cell proliferation function [45, 61, 66]. Speficially, overexpression of CKS1B is linked to a poor prognosis in multiple myeloma and contributes to increased p27Kipl turnover, cell proliferation, and a poor 53 Chapter 4. Comparing models coupled with prior structural information prognosis in many tumor types [661. [661 showed that CKS1B influences myeloma cell growth and survival through SKP2- and p27Kipl-dependent and -independent mechanisms and that therapeutic strategies aimed at abol ishing CKS1B function, which in our case is the possible degradation through miR-181a regulation, may hold promise for the treatment of high-risk dis ease for which effective therapies are currently lacking. Very recently, The protein levels of IRS2, insulin receptor substrate-2, has been showed that reduced IRS2 levels in the islets by high dosage chlorpromazine contributes to apoptosis of pancreatic b-cells and decreased proliferation [45]. TGFBI, transforming growth factor /3-induced, can reduce cell proliferation or be down-regulated in certain type of cancers, including bladder cancer [61]. Another example addressed in [55] is miR-195, which was also reported by our model. This miR-195 is important in that it targets genes that may function as both apoptosis and cell proliferation and these two functions are ranked very significantly from their model. Our models, for example Li, have also predicted this miR-195 and part of its target genes are overlapped with [551, i.e. PPMID and SESN1 genes. A recent study also implicated miR-195, along with miR-23a and miR-24, in cardiac hypertrophy and re ported that these miRNAs were regulated in response to stress signaling in the heart [571. GAS6, a gene targeted by p53 and enriched from Li prior with EM al gorithm in Table 4.3, like GAS1 is a member of the growth arrest sequences (GAS) family genes (known modulators of cell cycle progression and sur vival). Serial analysis of gene expression from aggressive mammary tumors derived from transplantable p53 null mouse mammary outgrowth lines re vealed significant up-regulation of GAS6 transcripts. The minimal region of amplification contained genes CUL4a, LAMP1, TFDP1, and GAS6, highly overexpressed in the p53 null mammary outgrowth lines at preneoplastic stages, and in all its derived tumors [3]. As we have shown many examples above, combining the variable selec tion and over-representation analysis can improve the quality of predicted functional interactions, and thus reduce the number of interactions without support from biological interpretations. We presented examples with pre 54 Chapter 4. Comparing models coupled with prior structural information dicted interactions without enriched functions, although these predictions have some direct/indirect biological supports. On the other hand, there are also many predicted interactions with enriched annotations, while those interactions are not found in the literature. Besides, comparing across various models can further enhance and pro vide clues for biological analysis. For example, using EM and sometimes MM2 (e.g. NIG prior) tend to focus on fewer regulators with a large popu lation of target genes, while using MM2 and MM3 can detect more enriched regulators (notably most of them are miRNAs). 55 Chapter 5 Conclusions and discussions In this thesis, our ultimate goal is integrating multiple data sources and us ing statistical methods to construct regulatory networks spanning different molecular levels. The regulatory interactions are modeled as multivariate linear regression models. However, linear model applied to high-dimensional data are usually over-parametrized, unclear for interpretation, and have high variance for estimated parameters. Motivated from this fact, we formulate the problem as fitting a sparse model to reduce its dimensionality. To achieve this, we used penalized regression models and look for the corresponding penalized MLEs (pMLEs). Because the penalization can be alternatively viewed as prior regularization, different functional forms of penalties can be mapped to priors on parameters and the pMLE is also the correspond ing MAP estimator. Based on this interpretation, various scaled mixture Gaussian prior distributions are proposed such that the marginal densities of coefficients/regulatory strengths can be given in closed form. The prac tical question is to determine the MAP estimation. Currently, there exists several optimization approaches. For example, EM and MM algorithm. Ap plying these two methods to a variable selection prior which is unbounded at origin could possibly entail the local optimum problem (although this case rarely happens). To address this issue, we propose a variant MM al gorithm to avoid this. The following simulation studies were performed to compare the performance of various models with the three algorithms. Then, these methods were applied to an E. coli data set with ground truth interactions believed to be known. The performances are measured in terms of precision-recall curve. The results showed that Li MM3, SCA1 MM3, Li EM, and NIG EM methods are performing best among all models, and their performances are comparable to the state-of-the-art regulatory net 56 Chapter 5. Conclusions and discussions work construction algorithm CLR. This shows that a simple linear regres sion network with a properly chosen prior can compete with the current best method. Further, five hub genes were selected with common predic tions from high performance algorithms and we showed that at functional level, the learned networks agree with current literature. Finally, to arrive at our ultimate goal, we coupled our models with collected prior structural information from various databases. A sub-network of the prior network was learned for each model, and we took several important interactions (possibly conjectured) to demonstrate the detecting ability of our models. To identify predicted interactions which are significantly enriched with cell proliferation function annotation, we subsequently did an over-representation analysis of our prediction results. Several interactions were taken as examples to show that some of predictions indeed exert cell proliferation function in abnormal cell lines, as confirmed by literature and agreeing with the other studies. On the variable selection side, beyond the models and methods used in this project there are many other ways to estimate both a compact sub- model and parameters. For parameter estimation post-model-selection, see [47]. Recently, there are also heavy literature volumes concerning simulta neously do model selection and parameter estimation [11, 26, 42, 56, 60, 68], both in frequencist and Bayesian contexts. The combinatorial modeling was discussed at the beginning of the thesis, and we did not implement this combined effects in this project. This is a computational concern. Grouped variable selection may be applied to reduce the optimization burden, but we have no extra time to extend on this topic. It is certainly a future topic worthy to study the combined effects of regulators. To conclude, by comparing different statistical models and integrating multiple data information can improve the reliability and facilitate interpre tation of constructed regulatory networks, in terms of prediction variability and biological functions. 57 Bibliography [1] http: I/genie. weizmann. ac. il/pubs/miro7/miro7_data.html. [2] Ingenuity systems. 2007. [3] Martin C. Abba, Victoria T. Fabrisi, Yuhui Hu, Frances S. Kittrell, Wei-Wen Cai, Lawrence A. Donehower, Aysegul Sahin, Daniel Medina, and C. Marcelo Aldaz. Identification of novel amplification gene targets in mouse and human breast cancer at a syntenic cluster mapping to mouse ch8al and human ch13q34. Cancer Res., 67(9):4104—12, 2007. [4] M. Abramowitz and I. Stegun. Handbook of Mathematical Functions: with Formulas, Graphs, and Mathematical Tables. Dover Publications, tenth edition, 1972. [5] H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716—23, 1974. [6] A Antoniadis. Wavelets in statistics: a reivew (with discussion). J. Italian Statistical Society, 6:97—144, 1997. [7] K. Azra and et al. Combinatorial microrna target predictions. Nat. Genetics, 37:495—500, 2005. [8] Nobelmann B and Lengeler JW. Molecular analysis of the gat genes from Escherichia coli and of their roles in galactitol transport and metabolism. J Bacteriol., 178(23) :6790—5, 1996. [9) P. Billingsley. Probability and Measure. Jone Wiley & Sons, Inc., second edition, 1986. 58 Bibliography [10] Paul E. Blower and et al. Microrna expression profiles for the nci-60 cancer cell panel. Mol Cancer Ther, 6(5), 2007. [11] P. Brown, M. Vannucci, and T. Fearn. Multivariate bayesian variable selection and prediction. J. Royal. Statist. Soc B., 60:627—641, 1998. [12] F. Caron and A. Doucet. Sparse bayesian nonparametric regression. International Conference on Machine Learning(ICML), Accepted, 2008. [13] Reisman D and Logirig WT. Transcriptional regulation of the p53 tumor suppressor gene. Semin Cancer Biol., 8:317—324, 1998. [14] A. Dempster. Covariance selection. Biometrics, 28(1):157—175, 1972. [15] A. Dempster, N. Laird, and D. Robin. Maximum likelihood from in complete data via the em algorithm (with discussion). J. Roy. Statist. Soc. Ser. B, 39:1—38, 1977. [16] B. Efron, T. Hastie, 1. Johnstone, and R. Tibshirani. Least angle re gression. Annals of Statistics, 32(2) :407—499, 2004. [17] Gao. F, Foat. B, and Bussemaker. H. Defining transcriptional networks through integrative modeling of mrna expression and transcription fac tor binding data. BMC Bioinformaticss, 5(31), 2004. [18] J. Faith and et al. Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol., 5(1):e8, 2007. [19] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc., 96(456):1348—1360, 2001. [20] M. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25:1150— 1159, 2003. 59 Bibliography [21] Fanette Fontaine, Eric J.Stewart, Ariel B.Lindner, and François Tad dei. Mutations in two global regulators lower individual mortality in Escherichia coli. Mol Microbiol., 67(1):2—14, 2008. [22] I. Frank and J. Friedman. A statistical view of some chemometrics regression tools (with discussion). Technometrics, 35:109—148, 1993. [231 J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, to be appeared, 2007. [24] N. Friedman, M. Linial, I. Nachman, and D. Pe’er. Using bayesian network to analyze expression data. J. Computational Biology, 7:601— 620, 2000. [25] W. Fu. Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3):397—416, 1998. [26] E. George and R. McCulloch. Variable selection via gibbs sampling. J. Amer. Statist. Assoc., 88(423):881—9, 1993. [27] J. Griffin and P. Brown. Alternative prior distributions for variable selection with very many more variables than observations. Technical Report, University of Kent, 2005. [28] J. Griffin and P. Brown. Bayesian adaptive lassos with non-convex penalization. Technical Report, University of Kent, 2007. [29] G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford Unviersity Press, third edition, 2001. [30] A. Hoerl and R. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(3) :55—67, 1970. [31] J. Huang, B. Morris, and J. Frey. Bayesian inference of microrna targets from sequence and expression data. J. Comp. Biol., 14(14):550—563, 2007. [32] D. Hunter and R. Li. Variable selection using mm algorithms. Annals of Statistics, 33:1617—1642, 2005. 60 Bibliography [33] Hartemink. J, Gifford D, Jaakkola. T, and Young. R. Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Nature Reviews Genetics, 6:422—33, 2001. [34] Keene. J. Rna regulons: coordination of post-transcriptional events. Nature Reviews Genetics, 8:533—543, 2007. [35] B. John, A. Enright, A. Aravin, T. Tuschl, C. Sander, and et al. Human microrna targets. PLoS Biol., 2(11):e363, 2004. [36] K. Knight and W Fu. Asymptotics for lasso-type estimators. Annals of Statistics, 28(5):1356—78, 2000. [37] S. Konishi and G Kitagawa. Information Criteria and Statistical Mod eling. Springer, first edition, 2008. [38] Sebum Lee and Hriday K. Das. Inhibition of basal activity of c-jun nh2-terminal kinase (jnk) represses the expression of presenilin-1 by a p53-dependent mechanism. Brain Res., 1207:19—31, 2008. [39] B. Lewis, C. Burge, and D. Bartel. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microrna targets. Cell, 120:15—20, 2005. [40] Buck. M and Lieb. J. Chip-chip: considerations for the design, anal ysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics, 83:349C360, 2004. [41] Eisen. M, Spellman. P, Browndagger. P, and Botstein. D. Cluster anal ysis and display of genome-wide expression patterns. Nature Reviews Genetics, 95:14863—8, 1998. [42] J. Mann and C. Robert. Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Springer-Verlag, first edition, 2007. [43] N. Meinshausen and P. Buhlmann. High-dimensional graphs and vari able selection with the lasso. Annals of Statistics, 34(3):1436—62, 2006. 61 Bibliography [44] M. Osborne, B. Presnell, and B. Turlach. On the lasso and its dual. Journal of Computational and Graphical Statistics, 9(2):319—337, 2000. [45] Sunmin Park, Sang Mee Hong, Ji Eun Lee, So Ra Sung, and Sung Hoon Kim. Abstract chiorpromazine attenuates pancreatic /3-cell function and mass through irs2 degradation, while exercise partially reverses the attenuation. J Psychopharmacol, 2008. [46] D. Pe’er, A. Regev, G. Elidan, , and Friedman N. Inferring subnet works from preturbed expression profiles. Bioinformatics, 17:S215— S224, 2001. [47] B. Potscher. Effects of model selection on inference. Econometric The ory, 7(2):163—85, 1991. [48] Jensen. 5, Chen. G, and Stoeckert. C. Bayesian variable selection and data integration for biological regulatory networks. Ann. Appi. Stat., 1(2):612—633, 2007. [49] G Del Sal, E M Ruaro, R Utrera, C N Cole, A J Levine, and C Schnei der. Gasi-induced growth suppression requires a transactivation independent p53 function. Mol Cell Biol, 15(12):7152—60, 1995. [50] H. Salgado and et al. Regulondb (version 5.0): Escherichia coli k-12 transcriptional regulatory network, operon organization, and growth conditions. Nuclear Acid Res., 34:D394—397, 2006. [51] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461—4, 1978. [52] R. Shalgi, D. Lieber, M. Oren, and Y. Pilpel. Global and local ar chitecture of the mammalian microrna-transcription factor regulatory network. PLoS Comput. Biol., 3(7):e131, 2007. [53] U. Shankavaram and et.al. Transcript and protein expression profiles of the nci-60 cancer cell panel: an integromic microarray study. Molecular Cancer Therapeutics, 6(3) :820—32, 2007. 62 Bibliography [54] J. Shao. Linear model selection by cross-validation. J. Amer. Statist. Assoc., 88:486—94, 1983. [55] Amit U Sinha, Vivek Kaimal, Jing Chen, and Anil G Jegga. Dissecting microregulation of a master regulatory network. BMC Genomics, 9(88), 2008. [56] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58:267—288, 1996. [57] Eva van Rooij, Lillian B. Sutherland, Ning Liu, Andrew H. Williams, John McAnally, Robert D. Gerard, James A. Richardson, and Eric N. Olson. A signature pattern of stress-responsive micrornas that can evoke cardiac hypertrophy and heart failure. Proc Nati Acad Sci U S A, 103(48):18255—60, 2006. [58] W. Wasserman and A. Sandelin. Applied bioinformatics for the identi fication of regulatory elements. Nat. Genetics, 5:276—287, 2004. [59] C. Wei, Q. Wu, and et al. A global map of p53 transcription-factor binding sites in the human genome. Cell, 124(1):207—19, 2006. [60] W. Wilks, S. Richardson, and D. Spiegeihalter. Markov Chain Monte Carlo In Practice. London: Chapman & Hall, first edition, 1995. [61] Wang X, Colby JK, Rengel RC, Fischer SM, Clinton SK, and Klein RD. Overexpression of cyclooxygenase-2 (cox-2) in the mouse urinary bladder induces the expression of immune- and cell proliferation-related genes. Mol Carcinog, 2008. [62] Dakang Xu, Trevor J. Wilson, David Chan, Elisabetta De Lucal, Jiong Zhoul, Paul J. Hertzogl, and Ismail Kola. Etsi is required for p53 transcriptional activity in uv-induced apoptosis in embryonic stem cells. The EMBO Journal, 21:4081C409, 2002. [63] Shiio Y, Yamamoto T, and Yamaguchi N. Negative regulation of rb expression by the p53 gene product. Proc Natl Acad Sci, 89(12):5206— 5210, 1992. 63 Bibliography [64] Stephen Yeung, Jesper Tegnçr, and James Collins. Reverse engineering gene networks using singular value decomposition and robust regression. PNAS, 99(9):6163—8, 2002. [65] A. Zeliner and A. Siow. Posterior odds ratios for selected regression hypotheses. In Bayesian Statistics: Proceedings of the First Interna tional Meeting held in Valencia (Spain), Valencia: University Press, pages 585—603, 1980. [66] Fenghuang Zhan, Simona Colla, Xiaosong Wu, Bangzheng Chen, James P. Stewart, W. Michael Kuehl, Bart Barlogie, and Jr John D. Shaughnessy. Ckslb, overexpressed in aggressive disease, regulates multiple myeloma growth and survival through skp2- and p27kipl- dependent and -independent mechanisms. Blood, 109(11) :4995—5001, 2007. [67] P. Zhao and B. Yu. On model selection consistency of lasso. J. of Machine Learning Res., 7:2541—2563, 2006. [68] H. Zhou and T. Hastie. Regularization and variable selection via the elastic net. J. Royal. Statist. Soc B., 67(2):301—20, 2005. 64 Appendix A Methods A.1 Penalties A.1.1 Information criteria AIC and BIC are motivated from model misspeciflcation and model selec tion area [37]. AIC [5] and BIC [511 corresponds to the penalty function with form p>, = 0.52ll(,3 0), where = and ) = respectively. Thus, the penalization is based on the complexity of assumed model and discontinuous w.r.t. parameter /3. Moreover, it is well known that model se lection procedures based on AIC are inconsistent in terms of the probability of selecting an over-parameterized model is asymptotically positive [47]. A.1.2 Hard thresholding Let = 2 —(1/31— )2I(j/3 < )b). Then this is the hard thresholding penalty [6], which is a smoothed version of entropy penalization. A.1.3 D3 penalty Choosing the £ penalty, i.e. = AIi3V’, is the solution of the bridge re gression [22, 25], in linear regression context. Typically, p e [0,21. Moreover the penalized likelihood has variable selection property when p 1 and the optimization problem of O(.) function in Eq.(2.2) is convex only if p 1. Particularly, p = 0 corresponds to traditional model selection, p = 1 yields LASSO [56] and p = 2 ridge regression [30]. Under £° penalty, Shao [54] has shown that the cross-validated choice of the penalty parameter ) is con sistent for model selection under certain conditions on the size of the testing data set. But this oracle property does not hold for £‘ penalty. Under a 65 Appendix A. Methods set of assumptions (c.f. Assumption 1-6 in [43]), however, Meinshausen and Buhlmann [43] showed that penalty is a consistent (in probability) ap proximation of jointly modeling the dependency in the Gaussian graphical model context [14, 23]. Zhao and Yu [67] also showed the probability of LASSO choosing the true model goes to 1 at the exponential rate, provided the regularity conditions and Strong Irrepresentabie Condition holds. Fur thermore, Knight and Fu [361 showed that LASSO even retains the model estimation consistency and asympototic normality when ? = o(n), where f (n) = o(n) means 1im = 0. Based on these work, we can further extend the consistency of LASSO in a stronger sense, i.e. almost sure (a.s.) convergence. Before establishing the results, the definitions are presented below: Definition A.1.1. (Model Selection Consistency) 1. In probability: P(sgn(/3(n)) = sgn(/3)) — 1, as n —* oo, notated as sgn(/3(n)) sgn(3). 2. Almost sure: P(1im_ sgriã(n)) = sgn(f3))) = 1, notated as sgn(jJ(n)) sgn(/3). Proposition A.1.1. Fix p, under the regularity conditions as in /36]: as n —* cc - C>-O (A.1) — maxxTxj —* 0 (A.2’) fli=1:n where C >- 0, meaning C is a positive definite matrix, and if Strong Irrcp resentable Condition holds /67], then sgnãLl(n)) sgri(/3). The proof is given in Appendix B Proofs, and essentially an application of Borel-Cantelli lemma ([9] p53-57). It reveals the fact that if the incon sistency decreases at a rapid rate when the number of data points tends to infinity, then there could be only a finite number of incorrected estimated models, which implies the a.s. convergence. 66 Appendix A. Methods A.1.4 SCAD penalty Let the continous differentiable penalty function has the form p(O) = A{(6 < A) + O> A)} (A.3) with some a> 2 and 0> 0, then p is called the Smoothly Clipped Absolute Deviation (SCAD) penalty [19]. x is the positive part of x, i.e. x x if x 0; x = 0 if x <0. SCAD penalty is very similar to £‘ penalty in the region where the absolute values of estimated parameters are small. But SCAD penalty is constant for large absolute values of estimated parameters to avoid over-penalization or biasness. See Figure 2.2(b). A.1.5 Bayesian linear regression The penalization for likelihood function has Bayes interpretation. The penalty term can be alternatively viewed as the prior regularization of the sampling model. Therefore, to maximize the penalized likelihood function is equivalent to find the mode of corresponding posterior distribution. For example, LASSO estimator is the posterior mode resulted from Gaussian likelihood coupled with double-exponential prior. Griffin and Brown [27, 28] proposed a family of Bayesian linear regression models using a scale mixture of Gaussian prior on regression coefficients, which includes Normal-Gamma (NG) and Normal-Inverse Gaussian (NIG) model. This hierarchical model family can be extended to include Zeilner’s g-prior [65], the parameter-free Normal-Jeffreys (NJ) prior [20], and LASSO, which is a double-exponential mixture of Gaussian as a special case of NG model. The MAP estimator can be found using EM algorithm [12, 28]. Next we briefly review this hi erarchical Bayesian regression model family. In general, the marginal prior distribution of regression coefficients in this family can be factored as a scale mixture of Gaussian density: = fN(j; 0, r)p(r) dr (A.4) 67 Appendix A. Methods where .iV(x; , o) denotes the normal distribution of x, with mean i and variance .2 Depending on different prior distribution specified for Ti?, we can introduce several different models. It is easy to check that LASSO, NG, and NJ priors satisfy the conditions of Theorem 2.3.1 and they are also singular at the origin [19], i.e. I/3d’ +p(I/3I) is strictly positive and attains minimum at /3 =0. Normal-Jeifreys prior If choosing the Jeifreys’ non-informative prior for r, i.e. p(T?) o -, then we get the NJ prior. The resulting marginal p.d.f. for /3 is (A.5) Normal-Gamma prior If choosing rj where g(x; a, b) is the Gamma p.d.f. with mean and variance . Then the marginal p.d.f. for /3 is given by a 1 = F() () i() (A.6) where K, (/3) is the modified Bessel function of the second kind [4]. Note that if = 1, the NG prior reduces to double-exponential prior, and thus yeilds LASSO model with A And NJ prior is the limiting case when —. 0 and c —* 0. Normal-Inverse Gaussian prior If choosing r IG(-, c), where IG(-, c) is the Inverse-Gaussian p.d.f. Then the marginal p.d.f. for /3 is given by = -exp() ( + I/32) i + I/3I2) (A.7) 68 Appendix A. Methods Table A. 1: Penalizations/log(prior) and their first order derivatives evalu ated at 13i1. Notation: qj = + !/3j12 PA(I/3j1) p(I/3jI) LASSO AI/3’ NJ 1ogI/3I 1*1 3(VjjI) NG /1 a—)logj/3j—logCa ‘Q’/I/I)7? )Ca iQVIi3I) NIG 1ogq — log PC1 (cqj) C0j)q3ACi(cq) SCAD AI/3I(I/3I A) + 12f(& < I/3j1) A{(I/3I A) + + (11_122)] ip < > A)} Table A.2: Properties of sparsity promoting priors. Source: François Caron. Name Range Finite value at 0 Sparsity Convexity of p (i3) Laplace c> 0 yes yes Weakly convex for > 0 NJ None no yes Strictly concave Strictly convex for > 1 NG >0,c>0 a)1 c>0 a<l Weaklyconvexfor=1 Strictly concave for - < 1 NlGauss - > 0, c> 0 yes no Because of lim,o(I/3jI+p(I/3j)) = 0, the NIG prior does not satisfy the variable selection criterion. Finally, Table A. 1 summarizes these penalizations/log(prior) and their corresponding first order derivatives and table A.2 summarizes the proper ties of sparsity promoting priors (also including NIG prior). 69 Appendix B Proofs Before proving the theorems, we quote and present some useful lemmas. B.1 Lemmas Lemma B.1.1. (c.f. [29] p310) Let X, X be r.v. ‘s. Fix e> 0, let A(e)={IX—Xl >€} If A(e) <00, Ve, then X Q4 X. B.2 Proof of Proposition A.1.1 Proof. Define Ve> 0 A(e) = {sgn(/L1(n)) - sgn(/)I > e} (B.1) By Lemma B.1.1, hence suffices to show A(e) <cc. Under the hypoth esis, Zhao and Yu [67) showed for some 0 c < 1 P(sgn(&(n)) = sgn8)) = 1 — o(e_nC) (B.2) But it is straightforward to check that = (1 — P(sgn(A(n)) sgn(3))) = o(e) <cc Li 70 Appendix B. Proofs B.3 Proof of Theorem 2.3.1 Proof. For part (1), clearly 4)3(80,80) =p3(IOoI). P\3(IOI) and 4)3(8,80) are even functions on Let x > 0. Consider (x) = d[4)3(x,8o)—pIxD] - xp((I9oI - p(x + ) + p(x + f) — IOoI+e - p((IOoI+f)+) p(x+€) — 801+e Now for 0 e (0, oo), letting x 0, we have i.(0) = lim(x) xL0 - p((I0oI+E)+) p((0+E)+) — IO0I+E Note that under hypothesis, it is straightforward to conclude that Vf> 0, p (0+) . .is non-decreasing, positive of 0 > 0. So, L(O) 0 for 0 < 9 < 8o1; and z(0) > 0 for 8 > 1901. Hence, 4)3(x, 8) — p3(x) attains its minimum at 1801, and therefore at ±jOo. For part (2), following definition IP3(8)PA(O)I = pA(I0I+f)_p(I8I)_ft)dt 101 ‘ IPA(I8I+f)-PA(I9DI+f dt I0I Ip(IOI + f) - PA(IOI)I + Ep(€+) f dt = 71 Appendix B. Proofs By the piecewise differentiability assumption of p(.) on (0, cc), we have p+) < cc, Vf > 0. Further, by compactness assumption, p(.) is uni formly continuous on C. Hence, sending € j. 0 yields claimed result. 72 Appendix C Supplementary Materials 73 Appendix C. Supplementary Materials Table C.1: E.coli Transcription factors in the 60% precise network with p 5 predicted operon targets by CLR algorithm. Regulator Targets in RegulonDB Targets # inferred by CLR flhC_bi89Lat 30 20 fihD_b1892_at 46 17 fiiA_b1922_at 42 40 gatR2_b2090±at 6 6 glcC_b2980_at 5 10 hycA_b2725_at 7 10 lexA.b4043_at 16 6 rcsB_b2217_at 11 5 rhaR..b390&at 5 9 rhaS.b3905_at 5 7 tdcR_b3119_at 7 17 yhiE_b3512_at 5 ii yhiW.b3515.at 4 6 yhiX_b3516_at 13 8 Table C.2: E.coli Transcription factors in the 60% precise network with p 5 predicted operon targets by Li prior, with MM3 algorithm. Regulator Targets in RegulonDB Targets # inferred by Li MM3 cbl_b1987_at 9 14 fecl_b4293_at 6 28 fliA_bi922_at 42 43 gatR.2_b2090±at 6 7 hycA_b2725_at 7 9 lexA_b4043..at 16 7 nac..bi988_at 12 9 narL_bi22i_at 84 7 yhiE_b35i2.at 5 10 ylcA.bO57Lat 4 5 74 Appendix C. Supplementary Materials Table C.3: E. coli rIIanscription factors in the 60% precise network with p 5 predicted operon targets by SCAD prior with MM3 algorithm. Regulator Targets in RegulonDB Targets # inferred by SCAD MM3 araC.b006&at 8 6 cbl_b1987_at 9 14 fecl_b4293_at 6 28 fiiAb1922_at 42 44 gatR_2.b2090i_at 6 7 hycA_b2725_at 7 9 lexA_b4043at 16 7 nac_b1988_at 12 12 narL_bl22Lat 84 7 yhiE_b3512_at 5 11 ylcA.bO57Lat 4 5 Table C.4: E.coli Transcription factors in the 60% precise network with p 5 predicted operon targets by Li prior with EM algorithm. Regulator Targets in RegiilonDB Targets # inferred by Li EM araC.b0064.at 8 6 cbl_b1987_at 9 10 fecl_b4293_at 6 28 fliA.b1922_at 42 44 gatR.2.b2090.J_at 6 7 hycA_b2725_at 7 14 lexA_b4043.at 16 7 1rpb0889_at 61 5 nac_b1988_at 12 9 narLbl22Lat 84 5 yhiE_b3512at 5 10 ylcA_b0571_at 4 5 75 Appendix C. Supplementary Materials Table C.5: E.coli Transcription factors in the 60% precise network with p> 5 predicted operon targets by NIG prior with EM algorithm. Regulator Targets in RegulonDB Targets # inferred by NIG EM fecl_b4293_at 6 19 fliA_b1922_at 42 33 gatR_2±2090_f_at 6 6 hycAb2725_at 7 10 lexA_b4043_at 16 5 lrp_b0889_at 61 6 nac_b1988_at 12 15 yhiE_b3512_at 5 7 ylcA_b0571_at 4 5 76 - . 4 - 1 a, 0 Fi gu re C. 1: Fi sh er ’s e x a c t te st p- va lu es w ith th e N G pr io r. Th e bl ue lin e is th e si m ul at ed u n ifo rm ba ck gr ou nd p- va lu es u n de r n u ll hy po th es is an d ho riz on ta ll in e is th e th re sh ol d a t p- va lu e a t 0. 05 . > a 0) 0 V C > a 0, 0 C 50 00 10 00 0 In de x 0 (a ) EM 0 50 00 10 00 0 15 00 0 20 00 0 25 00 0 30 00 0 In de x I N V (b ) M M 2 0 50 00 10 00 0 15 00 0 20 00 0 25 00 0 In de x N - 0 (c ) M M 3 Fi gu re C. 2: Fi sh er ’s ex ac t te st p- va lu es w ith th e N IG pr io r. Th e bl ue lin e is th e si m ul at ed u n ifo rm ba ck gr ou nd p- va lu es u n de r n u ll hy po th es is an d ho riz on ta ll in e is th e th re sh ol d a t p- va lu e a t 0. 05 . 0 C 44 C! a a 0 04 0 C ! 4, 0 ) 0 C ! 0 0 0 50 00 0 10 00 00 15 00 00 20 00 00 0 50 00 0 10 00 00 15 00 00 20 00 00 25 00 00 In de x (a) EM go 40 a a 0 (4, In de x (b) M M 2 0 00 40 0 0 50 00 10 00 0 15 00 0 In de x (c) M M 3 Fi gu re C. 3: Fi sh er ’s ex ac t te st p- va lu es w ith th e N J pr io r. Th e bl ue lin e is th e si m ul at ed u n ifo rm ba ck gr ou nd p- va lu es u n de r n u ll hy po th es is an d ho riz on ta l lin e is th e th re sh ol d a t p- va lu e a t 0. 05 . 15 00 0 20 00 0 25 00 0 30 00 0 In de x N 4) 0 a 0 - I- 4)1 0 50 00 10 00 0 15 00 0 20 00 0 25 00 0 30 00 0 In de x (a) EM . 4) 4) > 0 a 0 0 0 0 0 0 10 00 0 15 00 0 20 00 0 25 00 0 30 00 0 CI ) I (0 In de x (b ) M M 2 — 4 cc C’) C N 0 0 50 00 10 00 0 (c) JV IM 3 Fi gu re C. 4: Fi sh er ’s ex ac t te st p- va lu es w ith th e SC A D pr io r. Th e bl ue lin e is th e si m ul at ed u n ifo rm ba ck gr ou nd p- va lu es u n de r n u ll hy po th es is an d ho riz on ta l lin e is th e th re sh ol d a t p- va lu e a t 0. 05 . a 0 0 C . 0 0. 01 0 o 20 00 0 80 00 0 10 00 00 In de x (a) M M 2 O e+ 00 20 +0 4 40 +0 4 6a +0 4 80 +0 4 10 ÷0 5 In de x (b ) M M 3 Appendix C. Supplementary Materials Method Table C.6: Identified significant p53/miRNAs that target known cell proliferation genes from NO prior. Regulator p-value Gene namesTotal # of Cell proliferation predicted targets genes TCFBI,ICFBP4,CYR61,AREG,EPS8 TP53 70 9 1.586E-04 1F116,TPX2,DLG7,PDGFC ACADVL t 1 3.571E-02 TYR hsa-miR-130b 16 3 7.496R-04 CUL5,PDGFC,ChGn hsa-miR-145 15 2 2.803E-02 FgCN1,ptlGFC hsa-miR-155 12 2 1.178E-02 FCF2,SPOCK1 hsa-miR-lSa 4 1 4.726E-02 FGF2 hsa-miR-181a 7 2 1.572E-02 ONAI2,UCHL1 hsa-miR-221 24 3 4.478E-03 CYR6t,FRAT2,PDGFC hsa-miR-339 3 1 2.212E-02 RLF4 hsa-miR-367 16 2 1.348E-02 HDGFRP3,BHLHB1 hsa-miR-429 9 2 8.134E-03 CYR61,IRS2 hsa-miR-485-Sp 3 1 4.950E-02 5P0C1C1 EM hsa-miR-513 18 2 3.284E-02 EP88,BHLHB3 hsa-miR-522 18 2 4.320E-02 CYR61,EP88 ACADVL 1 1 3.846E-02 5MARCA4 hsa-miR-1 13 2 3.358E-03 1R92,FOXA1 hsa-miR-30c 7 2 2.724E-03 GNAI2,PBEF1 hsa-miR-375 8 1 5.000E-02 LRPS hsa-miR-410 17 2 1.030E-02 IRS2,HOXCIO R2F1 6 2 2.293E-02 IL1B,TGFB1 MYCN 10 2 1.704E-02 TIMP2,LEPRE1 hsa-let-7i 3 1 8.772E-03 ETS1 hsa-miR-2025 2 1 4.293E-02 SMAD4 hsa-miR-33 8 1 3.309E-02 HDAC4 hsa-miR-506 18 2 2.643E-02 PCAF,ETg1 hsa-miR-518c5 6 1 3.877E-02 ETS1 hsa-miR-7 14 2 1.920E-02 CNOT8,ET31 TPS3 0 6 E-04 TG1Z’BI,IGFEP4,CYR6I,AREG,MK167, 1 5 10 8.1 7 EP58,1fl16,TPx2,DLG7,PDGFC ACADVL 1 1 3.571E-02 TYR hsa-miR-130b 24 2 3.632E-02 CULS,ChCn hsa-miR-145 19 2 4.388E-02 FgCN1,1RS2 hsa-miR-155 13 2 1.383E-02 FCF2,5POCK1 hsa-miR-181a 13 3 4.940E-03 GNAI2,IRSI,UCHL1 hsa-miR-221 31 4 5.859E-04 CYR61,CNAI2,FRAT2,PDGFC hsa-miR-27b 27 2 4.676E-02 TRIB1,BHLHB3 hsa-miR-320 16 2 1.532E-02 PTEN,BHLHR3 hsa-miR-339 4 1 2.963E-02 ELF4 hsa-miR-367 26 2 3.47tE-02 HLICFRP3,EHLI4B3 hsa-miR-429 22 2 4.640E-02 CYR61,IRS2 hsa-miR-488 10 2 4.237E-02 SPOCIC1,HDGFRP3 hsa-miR-5175 4 1 3.695E-02 CPC4 hsa-miR-518f 6 2 9.557E-03 PTEN,SPOCK1 ACADVL 1 1 3.846E-02 SMARCA4 hsa-miR-1 22 2 9.713E-03 1R82,FOXA1 MM2 hsa-miR-145 19 2 2.181E-02 1R82,PEEF1 hsa-miR-181a 13 2 4.727E-03 GNAI2,1R82 hsa-miR-206 18 2 6.SO1E-03 IRS2,PEEF1 hsa-miR-221 31 2 3.805E-02 GNAI2,PDGFC hsa-miR-30c 20 2 2.263E-02 GNAI2,PBRF1 hsa-miR-410 21 2 1.564E-02 1R82,HOXC1O MYCN 13 2 2.872E-02 TIMP2,LEPRE1 hsa-let-7i 12 1 3.509E-02 ETS1 hsa-miR-144 23 2 4.463E-02 TOBt,BTG3 hsa-miR-186 31 3 3.672E-03 PTEN,HOXB2,gKAP2 hsa-miR-219 5 1 3.546E-02 BMP2 hsa-miR-26a 23 3 1.682E-03 TIMP2,gKAP2,HDAC4 hsa-miR-26b 24 2 2.989E-02 PTEN,TIMP2 81 Appendix C. Supplementary Materials Table C.6: (continued) Method Regulator Total # of Cell proliferation p-value Gene names predicted targets genes hsa-miR-33 19 2 1.487E-03 NDN,HDAC4 hsa-miR-34c 20 2 8.619E-03 NOTCH2,ETSI hsa-miR-429 22 3 7.382E-04 NDN,TIMP2,HDAC4 hsa-miR-492 2 1 t.942E-02 ADAMTS1 hsa-miR-498 9 2 1.332E-03 PTEN,SESN1 hsa-miR-506 19 2 2.936E-02 PCAF,ETS1 hsa-miR-518c5 7 1 4.516E-02 ETS1 hsa-miR-7 17 2 2.805E-02 CNOT8,ETS1 TGFEI,IGFBP4,CYR61,AREG,MR167, TP53 83 10 1.138E-04 EPS8,1F116,TPX2,DLC7,PDGFC ACADVL 1 1 3.571E-02 TYR hsa-miR-130b 19 3 1.275E-03 CUL5,PDCFC,ChGn hsa-miR-145 16 2 3.172E-02 FSCN1,PDGFC hsa-miR-155 11 2 9.875E-03 FGF2,SPOCR1 hsa-miR-181a 6 2 1.142E-02 GNAI2,UCHL1 hsa-miR-221 28 4 3.867E-04 CYR61,GNAI2,FRAT2,PDGFC hsa-miR-339 3 1 2.222E-02 ELF4 hsa-miR-367 21 2 2.300E-02 HDGFRP3,BIILHB3 hsa-miR-429 14 2 1.965E-02 CYR61,IRS2 hsa-miR-485-5p 3 1 4.950E-02 SPQCIC1 hsa-miR-488 7 2 2.095E-02 SPOCE1,HDGFRP3 hsa-miR-513 22 2 4.805E-02 EPS8,EHLHB3 ACADVL 1 1 3.846E-02 SMARCA4 hsa-miR-145 16 2 1.559E-02 PEEFI,PDCPC hsa-miR-221 28 2 3.129E-02 GNAI2,PDCFC hsa-miR-23b 21 2 3.560E-02 IRS2,STATSB MM hsa-miR-30c 11 2 6.949E-03 GNAI2,PBEF1 hsa-miR-410 18 2 1.154E-02 IRS2,HQXC1O ACADVL 5 2 3271E-02 MDM4,TSC1O1 MYCN 13 2 2.872E-02 TIMP2,LEPRE1 hsa-let-7i 8 1 2.339E-02 ETS1 hsa-miR-186 21 2 2.012E-02 PTEN,HOXB2 hsa-miR-26a 16 2 1.361E-02 TIMP2,HDAC4 hsa-miR-33 15 2 9.134E-04 NDN,HDAC4 hsa-miR-429 14 3 1.803E-04 NDN,TIMP2,HDAC4 hsa-miR-492 1 1 9.709E-03 ADAMTS1 hsa-miR-506 17 2 2.365E-02 PCAF,ETS1 hsa-miR-7 14 2 1.920E-02 CNOT8,ETS1 Table C.7: Identified significant p53/miRNAs tbat taeget known cell proliferation genes from NIG prior. Method Regulator Total # of Cell proliferation p-value Gene names predicted targets genes CDE4,PTCH1,PTEN,TCFBI,PAX3,IL1A, IL1B,ADRA1D,GAS6,PRL,BUB1B,CDK6, CTF1,IGFBP4,CYR61,AREG,CKS1E,EGF, FGF2,FGF7,FTH1,MDM4,MKJ67,PIM1, TP53 1261 45 5.680E-03 PRKD1,MAP2K1,RAF1,STIL,TGFA,TXN, CUL1,BUB1,EPS8PLIC1,CSF1R,E2F1, IF’I16,REST,TPX2,DLG7,PDCFC,OSM, EM E2F8,ZEB1 ,TNFSF13B KIT,PAX3,IGF1,BCL2,TGFBI,ADRA1D, IGF1R,CDC6,CTF1,EGF,F2R,FGF7, TP53 1261 25 1 .886E-02 LIF,0DC1,ST8SIA1,TTK,CDKN1B,MYCN, 82 Appendix C. Supplementary Materials Table C.7: (continued) Cell proliferation p-value genes TGFBI,ICFBP4CYR6J ,AREC,MlC167, l.101E-03 EPS8,1F116,TPX2DLG7,PDCFC Gene namesMetbod Regulator Total # of predicted targets HES1,PURA,PDGFCITGB1,TNFSF13B, SERTAD1,BIftC6 CDK4,PTCH1,PTEN,TGFBI,PAX3,IL1R, ADRA1D,GAS6,PRL,BUB1B,CTF1 ,IGFBP4, CYR61,AREG,CKS1REGF,FGF7,FTH1, TP53 1181 41 4.859E-02 MDM4,MK167,F’RKD1MAP2K1RAF1STIL, TGFA,TXNCUL1,BIJB1,EPS8,PLK1,CSF1R, E2F1,1F116,REST,TPX2DLG7,PDGPC, OSM,E2F8,ZEB1,TNFSF13R TGFBI,IL2RA,IL1AIL1B,ADRA1DADRA1B MM2 - PRL,CTF1,0A32,AREG,EGF,FGF2, FOS 421 29 4.859E-02 GCG,GNB1MAP2K1,RAF1,TGFA,TGFR3, TXN,1R52,CIJICSR1 ,E2F1,MPL,NAB2, 51008,Ct1160,FZD3,OSM,SFRF2 KIT,FAX3,IGF1 BCL2,TCFR1,ADRA1D, CDC6,CTF1ECF,F2R,FCF7,LIF, TP53 1181 24 2. 149E-02 ODCI.ST8SIA1TTK,CDKNI6,MYCN, HES1,PURA,PDCFC,ITCBI,TNFSF13B, SERTAD1,BIRC6 TP53 53 6 4.340E-03 TGFBI,CYR61,AREG,TPX2,OLG7,PDGFC bsa-miR-135b 3 1 4.822E-02 0A52,DAR2 bsa-miR-155 8 2 5.123E-Q3 FGF2,SPOCR1FGF2,SFOCK1 bsa-miR-lSa 3 1 3.561S-02 FGF2,FGF2 hsa-miR-181a 3 2 2.404E-03 GNAI2,UCHL1 ,GNAI2.UCHL1 hsa-miR-221 19 3 2.214E-03 CYR6IFRATZPDCFC,CYR6I,FRAT2,PDGFC hsa-milt-302c 9 1 3.688E-02 GPC4,GFC4 hsa-mifl-367 9 2 4.18881-03 HDGFRP3BHLHB3,HDGFRP3,RHLHB3 hsa-miR-485-Sp 3 1 4.95081-02 SPOCIC1,SPOCFC1 hsa-miR-513 12 2 1.48481-02 EPS8,BHLHB3,EPS8,EHLHB3 hsa-miR-522 16 3 2.73081-03 DAB2,CYR61,EPS8,DAB2,CYR61,EPS8 ACADVL 1 1 3.84681-02 SMARCA4 hsa-miR-128a 8 1 3.43781-02 PDGFC MM3 bsa-miR-128b 9 1 3.86281-02 FDGFC hsa-miR-181a 3 1 2.72181-02 GNAI2 bsa-miR-191 4 1 4.46081-02 PBEF1 hsa-miR-375 5 1 3.12581-02 LRPS hsa-miR-384 3 1 3.00781-02 IGF2BF1 hsa-mift-410 10 2 3.50681-03 IRS2,HOXCIO bsa-mift-429 6 1 3.99481-02 IRS2 hea-let-7i 1 1 2.92481-03 EITS1 hsa-miR-106b 3 1 3.69081-02 TXNIP hsa-miR-181a 3 1 3.39381-02 NOTCH2 hsa-miR-181b 4 1 4.50681-02 ETS1 bsa-miR-222 4 1 2.87881-02 ETS1 bsa-miR-33 7 1 2.89881-02 HIDAC4 hsa-miR-506 9 2 6.58581-03 FCAF,ETS1 Table CS: Identified significant p53/miRNAs tbat target known cell proliferation genes from NJ prior. TP53 109 10 Method Regulator Total # of Cell proliferation p-value Gene names predicted targets genes 83 Appendix C. Supplementary Materials Method Regulator Total # of Table C.8: (continued) Cell proliferation p-value Gene names predieted targets genes —________________________________________________ EGR1 22 7 4.628E-02 1L18,IRS2,NAB2,PDGFC ACADVL 1 1 3.571E-02 TTh hsa-miR-130b 26 2 4.222E-02 CUL5,ChGn hsa-miR-155 13 2 1.383E-02 FGF2,SPOCK1 hsa-miR-181a 14 4 3.999E-04 GNAI2,IRS2,UCHL1,EI4LHB3 hsa-miR-181r 7 2 l.572E-02 UCHL1,BHLHB3 hsa-miR-221 31 4 5.859E-04 CYR6IGNAI2,F’RAT2PDGFC hsa-miR-339 5 1 3704E-02 ELF4 hsa-miR-367 27 2 3.730E-02 IILIGFRP1,BHLHB3 hsa-miR-488 10 2 4.237E-02 SPOCK1,HDGF’RP3 hsa-miR-5175 4 1 3.695E-02 GPC4 hsa-miR-517c 4 2 6.162E-03 AREG,ChGn hsa-miR-518f 6 2 9.557E-03 PTEN,SPOCK1 hsa-miR-522 38 4 4.423E-03 DAB2,CYR61,EPS8,PDGFC hsa-miR-526b 14 2 2.832E-02 ChGn,TXNDC1 ACADVL 1 1 3.846E-02 SMARCA4 hsa-miR-1 22 2 9.713E-03 IRS2,FOXA1 hsa-miR-145 21 2 2.645E-02 IRS2,PBEF1 hsa-miR-181a 14 2 5.498E-03 GNAI2TRS2 hsa-miR-206 19 2 7.247E-03 IRS2,PBEF1 hsa-miR-221 31 2 3.805E-02 GNAI2,PDGFC hsa-miR-30c 20 2 2.263E-02 CNAI2,PBEF1 hsa-miR-410 20 2 1.421E-02 IRS2,HOXC1O MYCN 13 2 2.872E-02 TIMP2,LEPRE1 hsa-lct-7i 14 1 4.094E-02 ETS1 hsa-miR-144 24 2 4.828S-02 TOB1,BTG3 hsa-miR-186 33 3 4.403E-03 PTEN,HOXB2,SKAP2 hsa-miR-219 5 1 3.546E-02 BMP2 hsa-miR-26a 24 3 1912E-03 TIMP2,SKAP2,HDAC4 hsa-miR-26b 24 2 2.989E-02 PTEN,TIMP2 hsa-miR-33 19 2 1.487E-03 NDN,HDAC4 hsa-miR-34r 19 2 7.772E-03 NOTCH2,ETS1 hsa-miR-429 23 3 8.455E-04 NDN,TIMP2,HDAC4 hsa-miR-492 2 1 1.942E-02 ADAMTS1 hsa-miR-498 8 2 1.036E-03 PTEN,SESN1 hsa-miR-506 19 2 2.936E-02 PCAF,ETS1 hsa-miR-518c5 7 1 4.51691-02 ETS1 hsa-miR-7 17 2 2.80591-02 CNOT8,ETS1 TGFBI,ICFBP4,CYR61,AREG,MK167, TPS3 109 10 1.10191-03 EPS8,1F116,TPX2,DLG7PDGFC EGRI 22 4 4.628E-02 IL1B,1R82,NAB2,PDGFC ACADVL 1 1 3.571E-02 TYR hsa-miR-130b 27 2 4.53091-02 CUL5,ChGn hsa-miR-155 13 2 1.38391-02 FGF2,SPOCK1 hsa-miR-181a 14 4 3.999E-04 GNAI2191S2,UCHL1,BNLHB3 hsa-miR-181c 7 2 1.57291-02 UCI-1L1,BHLHB3 hsa-miR-221 31 4 5.85991-04 CYR6I,GNAI2,FRAT2,PDGFC hsa-miR-339 5 1 3.704E-02 ELF4 hsa-miR-367 27 2 3.730E-02 HDGFRP3,BHLHU3 hsa-miR-429 22 2 4.640E-02 CYR61,IRS2 hsa-miR-488 10 2 4.237E-02 SPOCIC1,HDGFRP3 hsa-miR-5175 4 1 3.69591-02 GPC4 hsa-miR-517c 4 2 6.162E-03 AREG,ChCn hsa-miR-518f 6 2 9.55791-03 PTEN,SPOCK1 hsa-miR-522 38 4 4.423E-03 DAB2,CYR61,EPS8,PDGFC hsa-miR-526b 14 2 2.832E-02 ChGn,TXNDC1 ACADVL 1 1 3.846E-02 SMARCA4 MM2 hsa-miR-1 22 2 9.71391-03 1R82,FOXA1 hsa-miR-145 21 2 2.64591-02 IRS2,PBEF1 hsa-miR-181a 14 2 5.498E-03 GNAI2,1R82 84 Appendix C. Supplementary Materials Table 0.8: (continued) Method Regulator Total # of Cell proliferation p-value Gene names peedieted targets genes hsa-miR-206 19 2 7.247E-03 JRS2,PBEF1 haa-miR-221 31 2 3.805E-02 GNAI2,PDGFC haa-miR-30e 20 2 2.263E-02 GNAI2,PBEF1 haa-miR-410 20 2 1.421E-02 1R82,HOXC1O MYCN 13 2 2.872E-02 TIMP2,LEPRE1 hsa-let-7i 12 1 3.509E-02 ETS1 haa-miR-144 23 2 4.463E-02 TOB1,BTG3 haa-miR-186 35 3 5.217E-03 PTEN,IIOXB2,SI<AP2 haa-miR-219 5 1 3.546E-02 BMP2 haa-miR-26a 24 3 1.912E-03 TIMP2,SKAP2,HDAC4 hsa-miR-26b 25 2 3.230E-02 PTEN,TJMP2 hsa-miR-33 20 2 1.653E-03 NDN,HDAC4 haa-miR-34e 19 2 7.772E-03 NOTCH2,ETS1 hsa-miR-429 22 3 7.382E-04 NDN,TIMP2,HDAC4 hsa-miR-492 2 1 1.942E-02 ADAMTSI haa-miR-498 9 2 1.332E-03 PTEN,SESNI hsa-miR-506 20 2 3.241E-02 PCAF,ETS1 bsa-miR-518c5 7 1 4.516E-02 ETS1 hsa-miR-7 17 2 2.805E-02 CNOT8,ETS1 TP TGFBI,IGFBP4,CYR61,AREGM1C16753 109 10 1.1O1E-03 RPS8,1F116,TPX2,DLG7,PDGFC EGR1 22 4 4.628E-02 IL1B,IRS2,NAB2,PDGFC ACADVL 1 1 3.57191-02 TYR haa-miR-130b 27 2 4.53091-02 CULS,ChGn hsa-miR-155 13 2 1.383E-02 FGF2,SPOCK1 hsa-miR-181a 14 4 3.999E-04 GNAI2,IRS2,UCHL1,BHLHB3 hsa-miR-181e 7 2 1.57291-02 UCHL1,BHLHR3 haa-miR-221 31 4 5.85991-04 CYR61,GNAI2,FRAT2,PDGFC haa-miR-339 5 1 3.70491-02 ELF4 haa-miR-367 27 2 3.73091-02 HDGFRP3,BHLHB3 hsa-miR-429 22 2 4.64091-02 CYR61,IRS2 bsa-miR-488 10 2 4.23791-02 SPOCK1,HDGFRP3 hsa-miR-5175 4 1 3.69591-02 GPC4 hea-miR-517e 4 2 6.162E-03 AREG,ChGn haa-miR-518f 6 2 9.55791-03 PTRN,SPQCR1 hsa-miR-522 38 4 4.42391-03 DAB2,CYR61,EPS8,PDGFC hsa-miR-526b 14 2 2.83291-02 CbGn,TXNJDC1 ACADVL 1 1 3.84691-02 SMARCA4 MM3 hsa-miR-1 22 2 9.71391-03 IRS2,FOXA1hsa-miR-145 21 2 2.64591-02 JRS2,PBEF1 hsa-miR-181a 14 2 5.49891-03 GNAI2,IRS2 hsa-miR-206 19 2 7.24791-03 1R52,PBEF1 hsa-miR-221 31 2 3.80591-02 GNAI2,PDGFC hsa-miR-30e 20 2 2.26391-02 GNAI2,PBEF1 hsa-miR-410 20 2 1.42191-02 IRS2,HOXC1O MYCN 13 2 2.87291-02 TJMP2,LEPRE1 hsa-let-7i 12 1 3.50991-02 ETS1 hsa-miR-144 23 2 4.46391-02 TOB1,BTG3 hsa-miR-186 35 3 5.21791-03 PTEN,HOXB2,SKAP2 hsa-miR-219 5 1 3.54691-02 BMP2 hsa-miR-26a 24 3 1.91291-03 TIMP2,SSCAP2,HDAC4 hsa-miR-26b 25 2 3.23091-02 PTEN,TIMP2 hsa-miR-33 20 2 1.65391-03 NDN,HDAC4 hsa-miR-34c 19 2 7.77291-03 NOTCH2,ETS1 haa-miR-429 22 3 7.38291-04 NDN,TIMP2,HDAC4 haa-miR-492 2 1 1.94291-02 AL)AMTS1 haa-miR-498 9 2 1.33291-03 PTEN,SESN1 hsa-miR-506 20 2 3.24191-02 PCAFETS1 hsa-miR-518e5 7 1 4.51691-02 ETS1 85 Appendix C. Supplementary Materials Table CS: (continued) Method Regulator Total # of Cell proliferation p-value Gene names predicted targets genes hsa-miR-7 17 2 2.805E-02 CNOT8,RTS1 Table C.9: Identified significant p53/miRNAs that target known cell proliferation genes from SCAJD prior. FOXO1 ACADVL AC ADVL hsa-miR-135b hsa-miR-lSa hsa-miR-200c RB1 MM2 EGR1 PAX5 hsa-miR-106b hsa-miR-130b hsa-miR-181b hsa-miR- 186 hsa-miR-296 hsa-miR-301 hsa-miR-30b hsa-miR-330 hsa-miR-372 hsa-miR-452 hsa-miR-520a5 hsa-miR-522 hsa-miR-9 HES1 LITAF ACADVL ACADVL FOSS HDAC3 ACADVL IRF1 MYCN ACADVL hsa-miR-31 hsa-miR-34c hsa-miR-507 hsa-miR-519b ACADVL ACADVL hsa-miR-125a hsa-miR-135b hsa-miR-15S hsa-miR-181a PTCH1,TGFHI,PAX3,IL1BCTF1,IGFSF4, 2 CYR61,AREG,MK167,PIM1,PRICD1,STIL, 1 2.503 02 TXN,BUS1,EPS8,1F116,REST,TPX2, DLG7,PDGFC,ZEB1 4 2.638E-02 PTEN,FRL,IRS2,1F116 4 4.859E-02 PAX3,DAB2,ELF4,PIM1 PTEN,PAX3,ADRA1D,ADRA1B,CTF1, 9 2.0855-02 EGF,15G20,PIM1,CSF1R 6 3.9355-02 RPS4X,RP527,DAB2,MNAT1,CDKSR1,MAPRE2 4 2.095E-02 RPS4X,RP527FGF2,FZD3 5 1.7165-02 FTEN,RPS4X,RP527,RPS1S,MAFRE1 ICIT,IGF1,TGFB1,IGF1R,CDC6, 9 1.741 02 CCND2,CDKN1B,MYCN,5UZ12 9 1.9265-02 TGFB1,EGF,F2R,FLT1,LYN,TIMP1,1R52,PDGFC,CHRM1 7 1.5095-02 1L7,FLT3LGFGF17,CDICN1B,FLT3,HES1,STATSB 7 2.6115-02 BCL2,CCND2,LIF,AKAPS,POU3F2,DERL2,IGF2BP1 6 4.2125-02 IGF1,SCL2,CDKN1S,POU3F2,5UZ12,IGF2BP1 6 6.323E-03 BCL2,LIF,CUL3,AKAPS,POU3F2,DAZAP2 7 3.2945-02 IC4F1R,ITPR1,CDKN1B,AICAP5,P01J3F2,5UZ12,IGF2BP1 2 3.526E-02 NR2E1,IGF2BP1 7 2.6395-03 IGFI,BCL2,IGF1R,CDKN1B,5UZ12,PDGFCIGF2BP1 7 1.3855-02 GLDC,KLFS,1R52,5UZ12,GLI2,BIRC6,PLEICHK1 4 4.9695-02 MAB21L1,POU3F2,STATSB,IGF2BP1 S 5.4965-03 IGF1R,NR2E1,CIJL3,DAZAP2,5UZ12 3 3.7815-02 CCND2,GNAI2,1R52 4 4.5485-02 IGF1R,CCND2.5UZ12,IGF2RP1 S GLDC,IGF1,CCND2,GNAI2,CD1CN1B,9 1.486 -02 5U312,PDGFC,IGF2SP1,PLEFCHK1 6 2.4365-02 CCND2,NR2E1,FGF18CDRNIB,PURA,ANGEL1 3 2.329E-02 AGT,ILIB,TGFB1 2 3.448E-02 IL1A,TNF 4 3.295E-02 TNF,TGFB1,FGF2,FOXO4 3 2.7485-02 POU1F1,TNF,TGFB1 3 1.7705-02 IL1B,TNF,BMP2 3 4.5965-02 PTEN,TNF,GDF11 4 2.6385-02 APC,1L3,TGFB1,BMP2 8 1.3025-02 IL1B,TNF,TGFB1,FTGS2,EIF2AIC2,3A52,OSM,IL1RN 5 5.2485-03 GLI3,RB1,TGFB1,TIMP2,LEPRE1 S 3.488E-02 PTEN,IL1A,1L18,TNF,TGFB1 7 1.709E-02 FRK,TIMP2,JAIC2,HDAC4,5TK38,PDSSB,ETSI 3 4.3965-02 STG1,NOTCH2,ETS1 3 3.8945-02 BTG1,HDAC4,MNT 8 3.132E-02 FTEN,RB1,CULS,HDAC4,T5G101,ADAMTS1,5TK38,RFX3 2 4.2865-02 CDK4,TYR 3 1.357E-02 CDK4,CDK6,E2F1 4 9.8325-03 RP527,MAP3SC1 1,MAPRE2,PES1 S 4.667E-02 RPS4X,DAB2,MNAT1,FIM2,MAPRE2 4 4.655E-02 FTEN,FGF2,SFOCK1,BCAT1 7 2.308E-02 TGFBI,CKS1S,GNAI2,1R52,UCHL1 ,BHLHB3,ZAK TPS3 471 Method Regulator Total # of Cell proliferation p-value Gene names predicted targets genes 67 44 133 178 128 163 104 83 36 283 205 174 406 16 178 259 107 162 146 126 312 296 27 6 29 14 19 20 40 47 79 74 149 127 61 245 12 12 87 135 110 110 86 M 543 Appendix C. Supplementary Materials Table CR (continued) Method Regulator Total # of Cell proliferation p-value Gene names predicted targets genes hsa-miR-2025 61 4 9.055E-03 PTEN,EPS15,FCF2,CUL5 hsa-miR-221 147 6 4.146E-03 RPS4X,RPS27,CYR61,CNAT2,FRAT2,PDGFC hsa-miR-27b 152 6 3.146E-03 RPS4X,EPS1S,PRRD1,STIL,TRIB1,BRLWB3 hsa-miR-320 127 4 4.963E-02 PTEN,DAB2,STAT4,BHLHB3 hsa-miR-410 122 8 8.823E-04 COL4A3,RPS4X,RPS27,FGF2,PRKD1.IRS2.EPS8,HDGFRP3 bsa-miR-504 27 3 3525E-02 RPS27,GPC4NAB2 hsa-miR-518f 46 4 2.1OSR-02 PTEN,RPS4X,RP527SPOCK1 bsa-miR-522 223 8 3.862E-02 RPS4X,DAB2CYR61,GNAI2,MAP2K1,EPS8,PDGFC,BCAR1 ACADVL 24 5 2.1 13E-02 BCL2,TCFB1,CXCL1O,CCND2,CDRN1B PAXS 33 7 7.721E-03 1L7,FLT3LG,FCF17CDKN1B,FLT3,HES1,STAT5B hsa-miR-206 115 3 3.531E-02 IRS2,FOXA1,PBEFJ hsa-miR-296 18 2 4.496E-02 NR2E1,IGF2BP1 ACADVL 14 3 2.748E-02 POU1FI,TNF,TGFB1 SP1 42 5 1.295E-02 TGFB1,PTCS2,ERRG,HDAC4,ETS1 HDAC2 19 3 2.83252-02 IL1B,TNF,TCFB1 IRF1 49 8 1.754E-02 IL1B,TNF,TGFB1,PTGS2,EIF2AK2,JAK2,OSMIL1RN MYCN 85 5 7.63852-03 GLI3,RB1,TGFB1,TIMP2,LEPREI PSMC3 6 2 4.455E-02 TGFE1,MDM4 SMAD3 57 6 4.92352-02 JL1B,TCFB1EREC,IFNB1,TCPS3,SMAD4 hsa-miR-128b 140 5 2.90652-02 PHBFOXO4,MY016,PDS5B,INCS bsa-miR-144 239 7 3.864E-02 PRRRA,PCAF,JAR2,T081,HDAC4.STC3,ETS1 hsa-miR-186 375 8 1.997E-02 PTEN.HOXE2,SKAP2,SMAD4,T081,HDAC4,C}tERP,T052 hsa-miR-195 54 4 2.60052-02 PPM1D,CHERP,SESN1,PDSSB bsamiR202* 61 4 2.26552-02 PTENFCF2,CIJLS,SMAD4 hsa-miR-31 132 6 4.21552-02 FRK,TIMP2,SKAP2,JAK2,HLIAC4,STK38 hsa-miR-34c 109 3 2.76852-02 BTG1,NOTCH2ETS1 bsa-miR-429 187 5 1.30352-02 NDN,TIMP2,T081,HDAC4,SESNI bsa-miR-499 150 4 2.94452-02 CUL5,CUL1 PPM1D,FQXQ4 hsa-miR-518e 44 3 2.88652-02 BMP2,MED17,PDS5S hsa-miR-7 117 5 1.49352-02 PPM1D,CNQT8,SETDB1,ETS1,GJB6 87
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Comparisons of statistical modeling for constructing...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Comparisons of statistical modeling for constructing gene regulatory networks Chen, Xiaohui 2008
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Comparisons of statistical modeling for constructing gene regulatory networks |
Creator |
Chen, Xiaohui |
Publisher | University of British Columbia |
Date Issued | 2008 |
Description | Genetic regulatory networks are of great importance in terms of scientific interests and practical medical importance. Since a number of high-throughput measurement devices are available, such as microarrays and sequencing techniques, regulatory networks have been intensively studied over the last decade. Based on these high-throughput data sets, statistical interpretations of these billions of bits are crucial for biologist to extract meaningful results. In this thesis, we compare a variety of existing regression models and apply them to construct regulatory networks which span trancription factors and microRNAs. We also propose an extended algorithm to address the local optimum issue in finding the Maximum A Posterjorj estimator. An E. coli mRNA expression microarray data set with known bona fide interactions is used to evaluate our models and we show that our regression networks with a properly chosen prior can perform comparably to the state-of-the-art regulatory network construction algorithm. Finally, we apply our models on a p53-related data set, NCI-60 data. By further incorporating available prior structural information from sequencing data, we identify several significantly enriched interactions with cell proliferation function. In both of the two data sets, we select specific examples to show that many regulatory interactions can be confirmed by previous studies or functional enrichment analysis. Through comparing statistical models, we conclude from the project that combining different models with over-representation analysis and prior structural information can improve the quality of prediction and facilitate biological interpretation. Keywords: regulatory network, variable selection, penalized maximum likelihood estimation, optimization, functional enrichment analysis. |
Extent | 2223158 bytes |
Subject |
Regulatory network Variable selection Optimization Functional enrichment analysis |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-02-02 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0066924 |
URI | http://hdl.handle.net/2429/4068 |
Degree |
Master of Science - MSc |
Program |
Bioinformatics |
Affiliation |
Science, Faculty of |
Degree Grantor | University of British Columbia |
GraduationDate | 2008-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2008_fall_chen_xiaohui.pdf [ 2.12MB ]
- Metadata
- JSON: 24-1.0066924.json
- JSON-LD: 24-1.0066924-ld.json
- RDF/XML (Pretty): 24-1.0066924-rdf.xml
- RDF/JSON: 24-1.0066924-rdf.json
- Turtle: 24-1.0066924-turtle.txt
- N-Triples: 24-1.0066924-rdf-ntriples.txt
- Original Record: 24-1.0066924-source.json
- Full Text
- 24-1.0066924-fulltext.txt
- Citation
- 24-1.0066924.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0066924/manifest