Comparisons of Statistical Modeling for Constructing Gene Regulatory Networks by Xiaohui Chen B.Sc., Zhejiang University, 2006 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Bioinformatics) The University Of British Columbia (Vancouver) August, 2008 © Xiaohui Chen 2008 Abstract Genetic regulatory networks are of great importance in terms of scien tific interests and practical medical importance. Since a number of highthroughput measurement devices are available, such as microarrays and sequencing techniques, regulatory networks have been intensively studied over the last decade. Based on these high-throughput data sets, statisti cal interpretations of these billions of bits are crucial for biologist to ex tract meaningful results. In this thesis, we compare a variety of existing regression models and apply them to construct regulatory networks which span trancription factors and microRNAs. We also propose an extended algorithm to address the local optimum issue in finding the Maximum A Posterjorj estimator. An E. coli mRNA expression microarray data set with known bona fide interactions is used to evaluate our models and we show that our regression networks with a properly chosen prior can perform com parably to the state-of-the-art regulatory network construction algorithm. Finally, we apply our models on a p53-related data set, NCI-60 data. By further incorporating available prior structural information from sequencing data, we identify several significantly enriched interactions with cell prolif eration function. In both of the two data sets, we select specific examples to show that many regulatory interactions can be confirmed by previous studies or functional enrichment analysis. Through comparing statistical models, we conclude from the project that combining different models with over-representation analysis and prior structural information can improve the quality of prediction and facilitate biological interpretation. Keywords: regulatory network, variable selection, penalized maximum likelihood estimation, optimization, functional enrichment analysis 11 Table of Contents Abstract ii Table of Contents iii List of Tables v List of Figures vii Acknowledgements x Dedication xi 1 Introduction and literature review 1.1 MiroRNA and Transcription factors 1.2 Previous work 1.3 Project motivations and goals 1 2 2 4 2 Models and algorithms 2.1 Model: Linear regression 6 2.2 Variable selection: penalized likelihood 2.3 Optimization: finding penalized ML estimator 6 8 2.3.1 2.3.2 2.3.3 3 EM algorithm MM algorithm Connections between algorithms Results on comparing across models 3.1 Simulation studies 3.2 E.coli data set 6 8 10 16 19 19 22 111 Table of Contents 3.2.1 3.2.2 24 28 Precision-recall curves Visualizing and analyzing inferred networks 4 Comparing models coupled with prior structural informa tion 41 4.1 NCI-60 Data preparation 41 4.2 5 Results 4.2.1 Prediction of p53 related microRNAs and genes 4.2.2 Functional enrichment analysis Conclusions and discussions Bibliography 44 . . . 44 49 56 58 Appendices A Methods A. 1 Penalties A.1.1 Information criteria A.1.2 Hard thresholding A.1.3 £ penalty A.1.4 SCAD penalty A.1.5 Bayesian linear regression B Proofs B.1 Lemmas B.2 Proof of Proposition A.1.1 B.3 Proof of Theorem 2.3.1 C Supplementary Materials 65 65 65 65 65 67 67 70 70 70 71 73 iv List of Tables 3.1 3.2 3.3 Mean errors for linear models, averaged over 100 simulations. MSE is the mean square error. C is the number of correctly estimated zeros and I is the number of incorrectly estimated zeros. Boldfaced methods are best results Characteristics of 60% and 80% precise networks inferred from models with top perfermance in E. coli Number of targets regulated by transcription factors in the 60% precise network with p 5 predicted targets with top performance algorithms, in E.coli network 4.1 4.2 MSE and estimated coefficients for the specific interactions predicted from NCI-60 data set. Bold numbers represent the estimated interactions agreeing with the literatures Estimated coefficients for the specific interactions predicted from NCI-60 data set, compared between with and without prior information. Coefficients with absolute values less than 10—8 are thresheld to 0. Bold numbers represent the esti 4.3 mated interactions agreeing with the literatures Identified significant p53/miRNAs that target known cell pro liferation genes from Li prior A. 1 Penalizations/log(prior) and their first order derivatives eval uated at I/I. Notation: qj = /()2 + 1/3112 A.2 Properties of sparsity promoting priors. Source: Francois Caron 20 25 28 46 48 50 69 69 v List of Tables C.l E. coli Transcription factors in the 60% precise network with 74 p 5 predicted operon targets by CLR algorithm C.2 E.coli Transcription factors in the 60% precise network with p> 5 predicted operon targets by Li prior with MM3 algorithm. 74 C.3 E.coli Transcription factors in the 60% precise network with 5 predicted operon targets by SCAD prior with MM3 p algorithm 75 C.4 E. coli Transcription factors in the 60% precise network with p 5 predicted operon targets by Li prior with EM algorithm. 75 C.5 E. coli Transcription factors in the 60% precise network with p 5 predicted operon targets by NIG prior with EM algorithm. 76 C.6 Identified significant p53/miRNAs that target known cell pro liferation genes from NG prior C.7 Identified significant p53/miRNAs that target known cell pro liferation genes from NIG prior C.8 Identified significant p53/miRNAs that target known cell pro liferation genes from NJ prior C.9 Identified significant p53/miRNAs that target known cell pro liferation genes from SCAD prior 81 82 83 86 vi List of Figures 1.1 An illustration of coordinated gene expression processes. TFs activate a set of genes in nucleus (01-On). After genes being transcribed, the resulting mRNAs are bound by RNA-binding proteins (RBP), spliced and subsequently exported to the cy toplasm. RBPs and miRNAs can affect the stability of tran scripts, activate or repress their translations. Source: RNA regulons: coordination of post-transcriptional events. Keene, J. 2007 Nature Reviews Genetics. [34] 2.1 2.2 2.3 3.1 Penalty functions and their corresponding quadratic surro gate functions around origin. Solid lines are original penalty and the red dashed lines are the upper bound surrogate func tions of solid lines with 0 = 0.5 and e = 10—8 Penalty functions and their corresponding surrogate functions around origin. Dotted lines (in blue) are original penalty, solid lines are the proposed perturbed penalty function, and dashed lines (in red) are the surrogate functions of solid lines with Oo = 0.5 and = 10—8 Convergence of unperturbed penalized likelihood function. Assuming r = 1. (O) is the part regarding the change of unperturbed penalized likelihood. S is the part measur ing the goodness of approximation between the derivative of perturbed penalty and original penalty Simulation comparisons, initialized with all zeros start. Sim ulation averaged over 100 times 3 11 14 16 21 vii List of Figures 3.2 3.3 3.4 3.5 3.6 Simulation comparisons, initialized with MLE. Simulation av eraged over 100 times The real regulatory network of E.coli presented in a binary matrix M. A bright spot at is a regulatory relationship from TF j to gene i 24 Precision-recall curves for various models and algorithms on E.coli data set. MI is the raw MI method. Z is CLR algorithm. 26 Scatter plots of the fitted expression values of target genes against their residuals 27 80% precise network for CLR algorithm. Red lines are cor rectly inferred edges and blue lines are false positives. Grey nodes are TFs 3.7 3.8 3.9 23 80% precise network for Li penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 80% precise network for SCAD penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 80% precise network for LASSO penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 3.10 80% precise network for NIG penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 3.11 60% precise network for CLR algorithm. Red lines are cor rectly inferred edges and blue lines are false positives. Grey nodes are TFs 3.12 60% precise network for Li penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 3.13 60% precise network for SCAD penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 31 32 33 34 35 36 37 38 viii List of Figures 3.14 60% precise network for LASSO penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 3.15 60% precise network for NIG penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs 4.1 4.2 Flowchart of learning regulatory network on NCI-60 data set. Fisher’s exact test p-values with the Li prior. The blue line is the simulated uniform background p-values under null hy 39 40 43 pothesis and horizontal line is the threshold at p-value at 0.05. 52 C.1 Fisher’s exact test p-values with the NG prior. The blue line is the simulated uniform background p-values under null hypothesis and horizontal line is the threshold at p-value at 0.05 C.2 Fisher’s exact test p-values with the NIG prior. The blue line is the simulated uniform background p-values under null hypothesis and horizontal line is the threshold at p-value at 0.05 C.3 Fisher’s exact test p-values with the NJ prior. The blue line is the simulated uniform background p-values under null hy 77 78 pothesis and horizontal line is the threshold at p-value at 0.05. 79 C.4 Fisher’s exact test p-values with the SCAD prior. The blue line is the simulated uniform background p-values under null hypothesis and horizontal line is the threshold at p-value at 0.05 80 ix Acknowledgements First and foremost, I owe many thanks to my two supervisors: Prof. Raphael Gottardo and Prof. Kevin Murphy. This work would be impossible without their continuous efforts to keep my statistical ball rolling. Their many valu able and enlightened suggestions not only helped me from constantly study ing new things, but also taught me how to be an independent researcher. Thanks to both of you for being great mentors, both professionally and personally. I am also grateful to Dr.Mauricio Neira, who actually was my third unofficial supervisor. Thank you Mauricio for helping me so many times with valuable support and guidance on post-modeling bioinformatics analysis. Your never ending feedback on my analysis is highly appreciated. Besides, I would like to both thank you for helping prepare part of the data. Last but not least, I have to say thanks to Prof. Arnaud Doucet and Dr. François Caron, both of whom instructed and helped me with the statistical modeling. x To my parents Rui Chen and Yuping Ou xi Chapter 1 Introduction and literature review Gene regulatory networks have been intensively studied over the last decade because of their inherent scientific and medical importance (e.g., in the fight against cancer) as well as the development of high-throughput measurement devices. The data generated by these high-throughput experiments typically contain: • Gene microarray expression data: Each microarray is used to capture the snapshot of the transcriptional status of cells, and one ex periment usually includes several microarrays to reflect the change of status of a biological system over time (may contain repetitive mea surements). Although mRNA is a precursor to its downstream protein product which directly executes biological functions, a large amount of previous regulatory network studies used mRNA microarray data as a proxy for the amount of protein products [24, 46, 64]. • Sequence information: The discovery and characterization of regu latory controlling sequences have been greatly facilitated by sequencing projects. Generally, identification of regulatory elements in the genome is mainly based on comparative sequence analysis and some Chromatin ImmunoPrecipitation (ChIP) binding experiment data [40]. There are many secondary curated databases which consist of reliable, verified regulations, but this approach is not scalable for quick identification of regulatory relationships de novo, such as comparative genomics meth ods and algorithms based on sequence complementarity. 1 Chapter 1. Introduction and literature review 1.1 MiroRNA and Transcription factors At the molecular level, microRNA (miRNAs) and transcription factors (TFs) are two important categories of regulators that control thousands of mam malian genes. TFs are the proteins that can bind to specific parts of DNA (usually one gene(s)) using DNA binding domains and therefore help initi ate/repress transcription, while miRNAs are small non-coding RNAs that regulate gene expression at the post-transcriptional level by affecting the sta bility and translational processes of gene transcripts. For TFs, they could ei ther increase or decrease gene transcription, but generally for miRNAs, they mediate post-transcriptional gene silencing through degradation of mRNAs or inhibition of protein production [34]. In principle, miRNAs could help to explain discrepancies between mRNA and protein levels. For example, those discrepancies may seriously complicate the use of mRNA profiles to study chemoresistance [10]. Unfortunately, there still continues to be a question as to how well transcript levels predict translated protein levels; a graphi cal illustration of TFs and miRNAs coordinated gene expression processes is shown in Figure 1.1. Recently, [52] comprehensively studied the global architecture and network local motifs of human miRNA-TF regulatory net works. Based on the prediction datasets using sequence complementarity and conservation, they revealed that the combinatorial property of miRNA interactions and miRNA-TF cooperation for targeting genes is fundamental for the precise and complex nature of regulatory systems. 1.2 Previous work Most of previous work on discovering regulator-target association is branched into two fields, corresponding to the two major types of data mentioned at the beginning of this chapter. The former field is based on mining corre lated gene transcription patterns from mRNA microarray data. This in cludes clustering [41], Bayesian networks [24, 33, 46] and linear regression models [17, 64]. The latter one focuses on using sequence complementar ity and phylogenetic conservation; for example, see [7, 35, 39, 58]. These 2 Chapter 1. Introduction and literature review Cytoplasm miRNA Figure 1.1: An illustration of coordinated gene expression processes. TFs activate a set of genes in nucleus (G1-Gn). After genes being transcribed, the resulting mRNAs are bound by RNA-binding proteins (RBP), spliced and subsequently exported to the cytoplasm. RBPs and miRNAs can affect the stability of transcripts, activate or repress their translations. Source: RNA regulons: coordination of post-transcriptional events. Keene, J. 2007 Nature Reviews Genetics. [34] 3 Chapter 1. Introduction and literature review sequence based methods, however, have limited specificity due to imperfect matches between TF-TF Binding Sites (TFBS) and the miRNA-target. In the miRNA regulating transcripts domain, [31] proposed a Bayesian linear model to integrate the evidence from both sequence and expression data. The inferred regulatory network is a subset of the prior network given by a sequence-based target-finding program. To the same end, in the TF regu lating gene domain, [17, 48] combined mRNA expression data with positive protein binding data from ChIP experiments to improve the predictive abil ity. Their method further classified ChIP positives into functional and non functional TF targets using linear regression models. Currently, there is not much work on combining TFs and miRNAs to construct regulatory networks based on both microarray expression data and sequence-based structural in formation. Since more variety and volume of data sources are available, this joint modeling is expected to become popular in the near future. 1.3 Project motivations and goals As discussed in previous sections, the regulatory systems seldom function through oniy a few mechanisms. Hence, building a comprehensive regula tory network spanning TFs and miRNAs at different molecular levels and integrating both microarray expression data and prior structural informa tion from sequence based predictions are essential. Moreover, due to the sparseness property of regulatory networks, we formulate the problem as a variable selection problem. The goals of the project are summarized as below: 1. Compare various statistical models and algorithms without integrating structural information. 2. By incorporating available prior structural information, compare the results of different models. Comparisons are also made between com bining two information sources and merely using expression data to construct networks. 4 Chapter 1. Introduction and literature review 3. Combine functional enrichment analysis to interpret learned regula tory interactions. The rest of the thesis is organized as following: • Chapter 2 briefly presents the models we are using and the existing and our proposed algorithms to solve the models. Details of the modeling issues are put into the appendix. • Chapter 3 presents the results of a simulation study and a real data example (from E. coli) where the true regulatory network is known. The purpose of this chapter is to compare various models and algo rithms without prior structural information, from a statistical point of view. Model performance is measured by a precision-recall curve. Further, several hub genes are selected with common predictions from high performance algorithms and we show that at the functional level, the learned networks agree with current literature. • Chapter 4 presents the results on a p53-centered data set (NCI6O) and prior structural information is used to infer a network. The results are presented in several specific examples, which are of our biological interests, i.e. p53 tumor related genes/miRNAs. We combine the variable selection and functional over-representation analysis to detect bona fide interactions. • Chapter 5 concludes the thesis with a discussion. 5 Chapter 2 Models and algorithms 2.1 Model: Linear regression The model we are assuming is a multivariate linear regression model. (2.1) y=X/3+e where • X X>< is the design matrix consisting of expression values for all known candidate regulators, including TFs and microRNAs • y is the response vector of target gene expression values • /3 is the regulatory strength: /3 down-regulation, and /3 = 0 means up-regulation, 0 represents no regulation > /3 < 0 means • e is an error term assumed to be Gaussian with constant variance i.e. N(0,u 1). 2 , 2 To capture the dependency between the regulators and target gene, we consecutively regress the expression levels of every target gene to X, while keeping X fixed for all. 2.2 Variable selection: penalized likelihood In many situations, the assumed linear model may be redundant in the sense that not all of its predictors have significant effects on the response. For example, there may be hundreds or even thousands candidate regula tors for one gene. Thus the full linear model is usually over-parameterized. 6 Chapter 2. Models and algorithms Intuitively, including more ifs, the prediction error would decrease and the fit would improve. However, the improvement is very marginal compared with the cost of estimating a much more complex model. Furthermore, the interpretation becomes unclear for separating the true predictors from the irrelevant variables. This is because the estimations yielded from mini mizing a least squares (LS) objective function typically do not equal zeros. Hence, selecting a proper subset of predictors from the full model is cru cial for high-dimensional statistical modeling. The basic idea is to penalize more complex models while preferring simpler models. In terms of math ematical formulation, many variable selection problems aim to maximize a penalized likelihood function, which is equivalent to minimizing a penalized negative-log-likelihood O(3, y) function Q(/3,AjX,y) = -(/3) +p/3) (2.2) where £ is the log-likelihood and p> (Ii i) is the penalty term fOr each large /3 3 absolute value. Generally speaking, large penalties tend to shrink the /3 co efficients to zeros. Here in the dimensionality reduction context and the rest of the thesis, we use variable selection and model selection interchangeably. In the context of linear regression, i.e. when p = 0, the maximum likelihood (ML) and least squares approaches provide the same estimator 2 Thus cr 2 can be absorbed in the MLE, which does not depend on g 3 / penalty terms. That is to say given we can omit o 2 and minimize the penalized LS objective function over /3 and O(/3, AjX,y) = II X/3II + pA(I/3jI) - (2.3) where II 2 is the Euclidean norm of a vector, and without loss of generality, we assume p takes the same functional form for all j = 1 : p, i.e. = p, in Eq.(2.3). Specifically, choosing p>, = AI/3I yields the £ penalty [22]. By avoiding the over-penalization of /3 with large absolute values, Fan and Li (2001) proposed a modified £‘ penalty [19], i.e the Smoothly Clipped 7 Chapter 2. Models and algorithms Absolute Deviation (SCAD) penalty. The problem becomes finding 3 such that Cã,) = argrninO(/3,AIX,y) (2.4) A can be determined on a linear grid by cross-validation (CV) giving the smallest prediction errors. 2.3 Optimization: finding penalized ML estimator It turns out the minimization of the objective function, Q(.), is not trivial. For p,\(.) = H (a.k.a. LASSO) [56], it can be efficiently solved by Osborne’s [44] and the Least Angle Regression (LARS) algorithm [16]. But these al gorithms are not generalized to other penalties (see Appendix A Methods). Caron and Doucet [12] proposed an Expectation-Maximization (EM) al gorithm to find the Maximum A Posteriori (MAP) estimator for general penalty functions. Further, for L’ and SCAD penalties (see Appendix A Methods), Hunter and Li [32] proposed a Majorize-Minimize (MM) algo rithm to optimize the same objective function and avoid the local optimum problem with the EM algorithm. The MM can be seen as a generalization of the EM algorithm. In this thesis, we extended the idea of the MM algorithm and propose a new MM algorithm to handle the general penalizations. The details will be addressed in the following paragraphs and also Appendix A Methods section. 2.3.1 EM algorithm [20], [27] and [12] proposed an EM algorithm [15] to find the posterior modes of Laplacian (Li), Normal-Gamma (NG), Normal-Jeifreys (NJ), and Normal-mv-Gaussian (NIG) models. In general, the marginal prior distri bution of regression coefficients in this family can be factored as a scale 8 Chapter 2. Models and algorithms mixture of Gaussian densities: 0, = )p(T T ) 2 dT (2.5) Choosing the prior of to be Laplace, Gamma, Jeifreys and Inverse Gaus sian corresponds to the Li, NG, NJ and NIG models, respectively. In the EM framework, 3 is the parameter vector of interest, and T can be seen as missing data. Then, the posterior mode can be iteratively found by maximizing conditioned on the last estimate (c): (k+1) = argmaxQ(3,) (2.6) where Q(/3, (1c)) is defined as the expected complete-data log-likelihood: f log(p(X, y, r))p(rI, X, y) Then, in the next iteration, j(k+1) is given by maximizing the Q(3, )) function over j3. (k+1) can be given For linear regression models, the update expression of 3 in closed form: (k+1) (XTX + u())_1xTy (2.7) where 0 ... 0 (k) with (k) — j — p’(I/I) 1 j 3 p/ Here p’ (x) denotes the derivative of penalty function p(.) evaluated at x. 9 Chapter 2. Models and algorithms 2.3.2 MM algorithm Depending on the optimization context, the MM algorithm refers to the majorize-minimize or minorize-maximize algorithm, which is a generaliza tion of the famous EM algorithm. Because the MM algorithm is a very gen eral optimization method (for example iteratively reweighted least squares algorithm is also a special MM algorithm which uses tangent lines to surro gate the original objective function), we adopt the MM algorithm to solve penalized models with various penalties. In the problem where no missing data can be naturally introduced, MM requires the construction of a surro gate function for the objective function, which is equivalent to the E-step in the EM framework. The M-step is subsequently implemented to optimize the transfered surrogate function. These two steps are iteratively performed in turn until convergence. The MM algorithm has the monotonicity prop erty of the EM if the majorize/surrogate function constructed satisfies the following two properties: 1(O,Oo) (9o,8o) = p(O),’vReO (2.8) P(Oo) (2.9) where, 1(8, 8o) is the majorize function of the objective function p(O) at &o. Specifically, in the context of penalized linear regression, p(.) is the penalty. Then, it is obvious that by subtracting Eq.(2.9) from Eq.(2.8), we get 1(O,6o) (6o,Oo) p(s) p(Oo), which in turn guarantees the monotonically non-increasing property of original objective function: — — 1(O,Oo) (Oo, 1 o 6 ) ) 8 i( o) 8 P( (2.10) Thus by choosing a more smooth surrogate function, instead of directly min imizing p(s), the optimization is transfered to minimizing the differentiable function (0, Oo) over 19. The penalty functions and corresponding upper bound surrogate functions are shown in Figure 2.1. 10 Chapter 2. Models and algorithms I (a) Li (Lipschitz continu- (b) SCAD (Lipschitz con (c) NJ (Infinite derivative) ous) tinuous) (d) NG (Infinite deriva- (e) NIG (Infinite deriva tive) tive) Figure 2.1: Penalty functions and their corresponding quadratic surrogate functions around origin. Solid lines are original penalty and the red dashed lines are the upper bound surrogate functions of solid lines with Oo = 0.5 and € = 108. 11 Chapter 2. Models and algorithms MM1 algorithm In the context of the linear regression, one can define the quadratic function (0, 0) pA(IOoI) + (0 (2.11) - it can be shown that ‘T(0, 0o) majorizes p(I I) at ±10o1 (Proposition 3.1 0 [32]). Here, pjI0j+) denotes the limit of p(x) as x — 101 from above. By iteratively minimizing i(O,0o), the solution to this algorithm is identical to the solution of the M-step in the EM algorithm. Fan and Li called this algorithm Local Quadratic Approximation (LAQ) [19], but here we call it MM1 for simplicity. MM2 algorithm Note that in Eq.(2.11), c(9,8) is undefined at 0o = 0, and hence when a parameter is estimated to be zero or close enough to zero, then it is excluded from the subsequent sub-model and never re-enters the model again. To ease this problem, [32] proposed a modified version of LQA/MM1 with perturbed penalty function P2 p181 I) 0 PA2(1 = / -‘—-dt PA(I0I) (2.12) 0 — paired with its majorize function + (02 2(0,0O) OI) 0 P2(I (2.13) Under certain regularity conditions (c.f. Proposition 3.2 [32]) and provided the space of 0 is compact, it can be shown as 0, 1PA2(O) — p(°)I —‘ 0 uniformly over 0. The iterative ridge-type solution can be given in closed form expression (k+1) = (XTX + vQ))_1xTy (2.14) 12 Chapter 2. Models and algorithms where 0 0 V(k)= ... (k) 0 0 with (k) = p(IPI+) 3 MM3 algorithm For NG, NJ and NIG models, the regularity conditions stated in [32] are not met. For example, in the NJ model, the first derivative of the penalty function is ik (see Table A.1), which is unbounded as O 0. Also the penalty itself is unbounded below when 181 J, 0 (Figure 2.2(c)). Even worse, . when is not integrable around the origin, it is not possible to use this functional form to derive the new penalty function. To handle this issue, we propose a perturbed penalty function, fix e > 0: PA3(IOI) = PA(IOI + e) - € f° p(t) dt (2.15) Define: 3(0, 0) + (82 = Ol) 0 P3(I - 0)p((oI+ e)+) (2.16) Similarly, by the above construction, we could prove the following theo rem. Theorem 2.3.1. Let p(.) be a function defined on [0,oo). Suppose pA(.) satisfies the regularity conditions: 1. differentiable, nondecreasing and concave on (0, oo) 2. continuous at the origin Ve> 0, define 1. 3(0,0O) p3(IOj) and 4’3(8, Oo) as above. Then majorizespA3(I0I) at ±IOoj, ER. 0 VO 13 Chapter 2. Models and algorithms (a) Li (Lipschitz continu- (b) SCAD (Lipschitz con- (c) NJ (Infinite derivative) ous) tinuous) Figure 2.2: Penalty functions and their corresponding surrogate functions around origin. Dotted lines (in blue) are original penalty, solid lines are the proposed perturbed penalty function, and dashed lines (in red) are the surrogate functions of solid lines with 8o = 0.5 and e = 10—8. 2. VC c e, the space of 0, where C is compact. As e SUJ eec J3(0) — p(O) —÷ 0 0, (2.17) In particular, ifp,.(.) is Lipschitz continuous on [0,00), then sup e IPA3( 90 ) 8 — The proof of this theorem is given in Appendix B. Theorem 2.3.1 tells us that under the new surrogate function, the condition PA (0+) <00 can even be abandoned. As we have seen, this is essential for many kinds of priors that have been used in variable selection, for example NJ and NG etc. Then iteratively update expression for the parameters j3 can also be analytically given by (k+1) = (XTX (2.18) + W(k))XTy where 0 W(c)= 0 w o ... 0 ... ... (k) 14 )I 8 ( — 0. Chapter 2. Models and algorithms with — p((I/3I +e)+) j — The convergence criterion adopted here is different from the one used in [32]. We adjust e, hence also the surrogate function (if necessary), to ensure that, for /3 0, < T at the convergence, where r is a predefined effective-zero level. Algorithm 1 summarizes MM3 algorithm steps: Algorithm 1 MM3 algorithm Require: Design matrix X, response y, penalty name 1: Initialize e to be moderately small 2: repeat 3: Run MM3 algorithm Q(/(k+1)) < T 4: until Q((k)) “(k+l) 5: if / I<Tthen 3 — (k+1) Set 7: end if 8: Compute 6: = 0 = 9: 10: 11: max — 8/3d j 3 / 9 (k+1) if then Finish and exit else 12: 13: 14: 15: Go to line 2 end if return Estimated coefficients/Selected variables Remark: we do not adopt the procedure of presuming sub-differential is greater than r, i.e. > çk+1) 13 = 0 if its r as in [32]. Since any coefficient close to zero has been removed in’ the last step, we argue that the remaining coefficients after the algorithm converges are all non-zero. In fact, these effective-zero coefficients are due to the perturbation introduced, so they should be exactly zero if e = 0. Also from the simulation studies, 15 Chapter 2. Models and algorithms 9 0 2 4 8 8 10 Figure 2.3: Convergence of unperturbed penalized likelihood function. As suming r = 1. (O) is the part regarding the change of unperturbed pe nalized likelihood. S is the part measuring the goodness of approximation between the derivative of perturbed penalty and original penalty. we find the estimates obtained by setting çk+1) 13 = o if ôPj $(k+1) > T have quite bad performance under the NJ, NG and NIG priors in terms 3 of correctly selected predictors (results not shown in this thesis). Finally, under this scheme, when the coefficient is estimated to 0, it could re-enter the model after tuning smaller, because we allow zero estimations to compete with others as long as convergence has not been arrived. The tolerance is decomposed as shown in Figure 2.3. 2.3.3 Connections between algorithms The connection between the EM, MM2 and MM3 algorithms are summarized below: • The EM algorithm requires to compute the expected complete-data log-likelihood for the E-step, while the MM algorithmneeds a majorize function. The subsequent step of these algorithms are essentially the same: optimizing the transferred surrogate function. Hence, it is easy to see that the E-step is a special case of constructing an arbitrary sur 16 Chapter 2. Models and algorithms rogate function. In fact, the solution to the EM algorithm is identical to that of LQA/MM1 algorithms • If the marginal density of /3 is infinitely spiked around 0, then the EM and MM2 algorithms have the problem as forward/backward se lection procedures, i.e. variables discarded cannot be re-introduced into models in later stage. The proposed MM3 algorithm can avoid this local optimality by adding a small perturbation to the derivatives while retaining the monotonicity of the EM and MM2 algorithms. • The choice of €, and thus the majorize function, is adapted to the estimated parameters. Hence, the choice of can be controlled by the information from data and is robust to its initial choice. As c is dynamically tuning smaller, the MM3 usually takes a longer time to converge than the EM and the MM2 based on empirical experience. • It is not difficult to see that /3 = 0 is a local optimum for the penalized linear regression with the EM, MM2, and MM3 algorithms. The EM with all priors and the MM2 with NJ, NG, and NIG priors have to drop the variables once they are estimated to be zero, and thus will be stuck at the origin. However, a simulation study will demonstrate that the MM3 with all priors has the ability to jump from the zero nodal points and correctly identify the true variables. • It is well known that an Li penalty can be viewed as the following constrained optimization problem: minimize — X/32 subjected to I/3It The penalized regression is equivalent to the constrained regression problem, in that, given a ). 0, there exists a t> 0 such that these two problems share the same solution. Generally speaking, however, i7 Chapter 2. Models and algorithms the penalized formulation is not the Lagrangian function for the con strained formulation. However, for NG and NIG model, we can inter pret the MM solutions to penalized LS as log-barrier functions (see Table A.i). The idea of the MM algorithms here is the same as that of the log-barrier function solution to the Li penalized LS: they all prevent variables from becoming exactly 0 while remaining a good ap proximation to the original penalized functions. Specifically for Li, the log-barrier function can defined as g(x) = Iy_X +IjI -flogx When is sufficiently small, minimizers of g(x) corresponds to mini mizers of OC, AIX, y). For MM2 and MM3, the surrogate functions have the same role as g(x), and we can prove that as goes down suf ficiently close to 0, the original objective functions can be arbitrarily approximated by surrogate functions. 18 Chapter 3 Results on comparing across models 3.1 Simulation studies To compare the performances of models with different penalizations and al gorithmic solutions, we simulate 100 data sets, each with n = 100 points and o = 1. Following the setup of [32], the true coefficients are set to be /3 = (3, 0, 0, 0, 1.5, 0, 0, 0, 2, 0, 0, o)T, with p = 12. This is the scenario to simulate 3 true TFs out of 12 candidate TFs for one gene. Because genes are usually dependent, we introduce three correlation levels between covariates: p = 0.1,0.5,0.9. Data X is scaled to mean 0 and unit variance and y is centered. For both MM2 and MM3 algorithms, the initial e is set to 10—8 and effective-zero T = 10.8. The hyper-parameters of all models are set to be ci = 0.01 and c = 21:10 (see Appendix A Methods for the notations). All models and algorithms are initialized with all 0’s and the Ordinary Least Square (OLS) or the Maximum Likelihood Estimation (MLE). Optimal pa rameters are determined by 5-fold CV as used in [56]. The performances of various models are measured by the mean square error (MSE) (can be calculated in closed form by (_/3)Tcov(X)(_/3)), the number of correctly estimated zeros (i.e. equals 9), the number of correctly estimated non-zeros (i.e. equals 3) and the number of incorrectly estimated zeros. Note that we do not implement the SCAD prior with the EM algorithms. First, we look at the results with all models initialized at zeros at different correlation levels, see Figure 3.1. The true number of non-zero coefficients is three, and it is easy to observe that the MM3 uniformly dominates the 19 Chapter 3. Results on comparing across models Table 3.1: Mean errors for linear models, averaged over 100 simulations. MSE is the mean square error. C is the number of correctly estimated zeros and I is the number of incorrectly estimated zeros. Boldfaced methods are het rPii11R Model NJ MM3 NJ MM2 NJ EM NG MM3 NG MM2 NG EM NIG MM3 NIG MM2 NIG EM Li MM3 Li MM2 Li EM SCAD MM3 SCAD MM2 MSE Zeros Median I C p=O.l 0.1104 8.54 0 0.1104 8.54 0 0.1105 8.54 0 0.0992 8.75 0 0.1067 8.56 0 0.0971 8.77 0 0.0926 0 0 0.0937 0.78 0 0.0905 0 0 0.1364 4.18 0 0.1533 5.36 0.32 0.1462 4.36 0 0.1454 4.31 0 0.1204 6.35 0 Zeros MSE Median I C p=O.5 0.1099 8.52 0 0.1099 8.52 0 0.1099 8.52 0 0.0945 8.79 0 0.1055 8.54 0 0.0975 8.75 0 0.0872 0 0 0.0846 0.9 0 0.0872 0 0 0.1457 4.60 0 0.1589 5.34 0.30 0.1481 4.63 0 0.1387 4.84 0 0.1414 6.68 0 Zeros MSE I Median C pO.9 0.1254 8.51 0 0.1255 8.51 0 0.1254 8.51 0 0.1280 8.77 0 0.1278 8.57 0 0.1337 8.77 0 0.1185 0 0 2 0.1230 0 0.1256 0 0 0.1692 4.86 0 0.2275 6.06 0.46 0.1751 4.97 0 0.1568 5.21 0 0.1758 7.32 0 EM and the MM2 in terms of the number of correctly identified non-zeros. Comparing across models, the EM is always stuck at 0’s for all models and the MM2 only selects variables in case of Li and SCAD. For MM3, starting from zeros has no significant effect on the selected variables. Hence, the local optimum issue can be well handled by the MM3 in this extreme situation. Secondly, we compare the results of algorithms starting from OLS es timations. From Table 3.1, we can see that the NJ and NG priors are performing among the best with all three algorithms, in terms of both cor rectly and incorrectly estimated zeros. The NIG has the lowest MSE but. its estimates are not a thresholding rule, i.e. the NIG prior is not a variable selection prior and can produce many coefficients with smallest absolute val ues (c.f. Appendix A Methods). However, one can read small coefficients as zeros and thus corresponding variables are not selected. Further, the NJ models for the EM, MM2, and MM3 algorithms are essentially the same and 20 Chapter 3. Results on comparing across models Figure 3.1: Simulation comparisons, initialized with all zeros start. Simula tion averaged over 100 times Numbers of non-zero correctly estimated Numbers of non-zero correctly estimated •EM 0 MM2 •EM 0 MM2 MM3 MM3 ii]]l jill] LI NJ NO MG SCAD Li NJ (a)p=O.1 NO NM SCAD (b)pO.5 Numbers of non-zero correctly estimated • EM 0 MM2 • MM3 11111 Li NJ (c) p NO = MG SOAD 0-9 21 Chapter 3. Results on comparing across models the Li and SCAD priors have larger MSEs. Figure 3.2 shows that the num ber of zeros included in the NG and the NJ is higher than Li and SCAD. Therefore, the NG, NJ, and NIG models tend to produce sparser models than Li and SCAD. Finally, in general these variables selection models with the MM3 algorithm tend to have lower MSEs than those with the MM2 and the EM algorithms. In conclusion, local optimality is unlikely to occur if the starting points are not carefully chosen and the MM3 is better but the improvement is marginal in this case. 3.2 E.coli data set In [18], Faith et.al. showed how RegulonDB database [50] can serve as a ground truth of regulatory network for E. coli. RegulonDB contains 32i6 experimentally confirmed regulatory interactions among 1058 genes and 153 TFs. [18] assembled a compendium of 445 new and previously published E. coli K12 Affymetrix Aritisense2 microarray expression profiles collected under various conditions. Compared with the ground truth (see Figure 3.3), this compendium is an ideal real dataset that we can evaluate the performance of various models. This dataset can be downloaded at the Many D) Web site (http://m3d.bu.edu/). 3 Microbe Microarrays database (M We also compare our linear models with the state-of-the-art gene regu latory network construction algorithms, i.e. Context Likelihood Relatedness (CLR) algorithm [18]. CLR is a mutual information (MI)-based method. The CLR algorithm estimates the likelihood of the MI score for a particular pair of genes, i and j, by comparing the MI value for that pair of genes to a background distribution of MI values (the null model). There are three major steps in CLR algorithms: 1. Computing raw MI values for every pair of genes 2. For each gene, normalizing the raw MI values 3. Estimating the joint likelihood measure (i.e. a significance measure of MI values), Z-score, of MI between every pair of gene 22 Chapter 3. Results on comparing across models Figure 3.2: Simulation comparisons, initialized with MLE. Simulation aver aged over 100 times Numbers of zero correctly estimated Numbers of zero correctly estimated UEI 0 MM2 EM 0 MM2 hNNi LI NJ NG NG SGAD LI NJ (a)p=O.1 NG NIG SCO (b)p=O.5 Numbers of zero correctly estimated • E 0 MM2 • MM3 LI NJ (c) p NIG NG = SC,AD 0.9 23 Chapter 3. Results on comparing across models Ground truth a, C a, 0, 80 TF1153 Figure 3.3: The real regulatory network of E.coli presented in a binary matrix M. A bright spot at is a regulatory relationship from TF j to gene i. 3.2.1 Precision-recall curves The performances of all methods are measured by the area under the precisionrecall curve (PR curve). Precision is defined as the fraction of true positives out of the total predicted positives, and recall the fraction of true positives out of the actual total positives. By formulation, they can be expressed as following: precision = recall = TP TP + PP TP TP + FN (3.1) (3.2) Hence, the larger area under the PR curve, the better performance an algo rithm has. Faith [18] have applied various algorithms to genes on the whole E.coli 24 Chapter 3. Results on comparing across models Table 3.2: Characteristics of 60% and 80% precise networks inferred from models with top perfermance in E. coli. 80% precise network 60% precise network Threshold Threshold TP FP Model TP FP 0.7531(Coeff value) Li MM3 132 88 0.4530(Coeff value) 39 ii ii 0.7610(Coeff value) SCAD MM3 137 93 0.4435(Coeff value) 39 Li EM 132 88 0.4396(Coeff value) 39 ii 0.7585(Coeff value) NIG EM 0.8954(Coeff value) 94 66 0.6446(Coeff value) 31 9 2 iO.658(Z-score) CLR 147 103 5.7905(Z-score) 8 Antisense2 microarray data, but we only apply linear models and MI-based methods on the set of genes which have representation in RegulonDB database, i.e. 1058 targets + 153 TFs = 1211 genes. This is because [18] can biolog ically validate the predicted and presumably novel interactions which are not curated in RegulonDB, while we have no such advantages and our main focus in this thesis is on the statistical modeling side. Thus, our results from CLR are different from those in the original paper applied on E.coli [18]. From the PR curve in Figure 3.4, we can see that the Li penalty fitted with the EM and the MM3, the SCAD penalty with the MM3, and the NIG penalty with the EM, perform best in most algorithms and are comparable with the CLR algorithm. The NJ prior for all three algorithms are essentially overlapping each other. Performances of the NG prior for three algorithms are very close, although they are all inferior to the CLR algorithm. The network characteristics corresponding to models with top performance are summarized in Table 3.2. To access the goodness-of-fit of various models and algorithms, scatter plots of the fitted expression values of target genes against their residuals are shown in Figure 3.5. It displays that the correlation between the actual expression values and fitted values is quite high. Moreover, the correlations from the Li and SCAD models are better than the NIG model fitted using the EM algorithm. 25 Chapter 3. Results on comparing across models Figure 3.4: Precision-recall curves for various models and algorithms on E.coli data set. MI is the raw MI method. Z is CLR algorithm. 0.9 0.8 07 06 0.5 04 03 0.2 0.1 0 004 0 006 0.04 01 (a) Li (b) SCAD (c) NJ (d) NG 012 014 06 0.9 NIG EM NIG MM2 NIGMM3 0.8 07 06 05 MI 04 03 02 0.1 0 002 004 006 004 01 012 014 016 (e) NIG 26 Chapter 3. Results on comparing across models Figure 3.5: Scatter plots of the fitted expression values of target genes against their residuals. 10 0 9 .0 -.- s. . -N” -5 -10 -10 2 ‘F - - . 12 - -6 -6 -10 lledprioI - -6 6 6 6 6 FOted expression valves (a) Li MM3 (b) SCAD MM3 10 10 1 -15 .15 -2 -10 - - - 6 Filled expression values (c) Li EM 6 -10 -0 .5 -4 -2 0 2 0 5 Fin sO expreox on ye Ixes (d) NIG EM 27 Chapter 3. Results on comparing across models. Table 3.3: Number of targets regulated by transcription factors in the 60% precise network with p 5 predicted targets with top performance algo rithms, in E.coli network. TFs Targets # Targets # inferred in RegulonDB CLR Li MM3 Li EM SCAD MM3 Li NIG fliA 42 40 43 44 44 44 lexA i6 6 7 7 7 7 hycA 7 10 14 14 9 9 gatR 6 6 7 7 7 7 yhiE 5 11 10 10 11 10 3.2.2 Visualizing and analyzing inferred networks Networks are visualized through the Graphviz software, and the networks corresponding to Table 3.2 are shown in Figure 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14, and 3.15. Specifically, to visualize the actual learned net works, we first extract the 80% precise networks as [18] did from the CLR algorithm, see Figure 3.6, 3.9, and 3.10 (i.e. the network is extracted by thresholding which gives 80% precision.). At this precision level, the learned networks from the LASSO and the NIG penalties are much better than the CLR algorithm. This agrees with the result in the PR curve, since at 80% precision, CLR almost has a recall of 0, which means there are almost no TP edges inferred. In fact, by trading off between the true positive and false positive rate in selecting a threshold for identifying significant regulatory in teractions, a higher precision level leads to a small network with fewer false positives, while a lower threshold will include more false positive features. Hence, next we lower down the precision level to visualize the 60% precise networks, the resulting networks are shown in Figure 3.11, 3.12, 3.13, 3.14, and 3.15. The numbers in the nodes are the gene indexes in the expression matrix and each number can be uniquely mapped to one gene. We summarized the TFs in the 60% precise network with p 5 pre dicted operon targets with these top performance algorithms, and the re sults are presented in Appendix C Supplementary Materials C.1, C.2, C.3, and C.4. Here, we report 5 TFs with p 5 connectivities to their tar 28 Chapter 3. Results on comparing across models get genes supported by all 5 algorithms with high performance, measured by PR curve (see Table 3.3). Their gene names are: fiiA_b1922_at (fliA, gidx: 345), lexA.b4043.at (lexA, gidx: 615), hycA_b2725_at (hycA, gidx: 532), gatR2b2090_Lat (gatR, gidx: 413), and yhiEb35l2_at (yhiE, gidx: 1170). These genes are well documented in the literature. We suumarize their functions from the literature. • fliA: we observed that it is a hub gene in both 80% and 60% pre cise networks and contols the transcriptional activities of many down stream genes. Very recently, fliA is reported to be one of the two global regulators (rssB and fliA) in E. coli through genetic screen ex periments and the products of these two genes are involved in the regulation of major genetic networks [21]. It is known that even in the absence of identifiable exogenous stress, there remains a measur able, basal death frequency in E. coli populations. But the underlying mechanisms still remains unclear compared with those under stress conditions. [21] showed that mutant of the fliA gene affects the levels of different sigma factors within the cell and results reduced death fre quencies in E. coli populations. Specifically, the inactivation of the fliA gene encodes the fiagellar sigma factor. This results in the lack of ex pression of a number of genes involved in motility and chemotaxis, and consequently, non-motile cells. This loss of motility results in greater absolute availability of both RNA polymerase and energy for other processes within the cell, and consequently the cell may gain viabil ity through a number of potential mechanisms by losing the motility pathway. • lexA: a major regulator of DNA repair and DNA-binding transcrip tional repressor of SOS regulon, is known to have a single well-conserved DNA-binding motif. It is one of the best-perturbed regulators in the microarray compendium due to the compendium’s emphasis on DNA damaging conditions. 16 regulatory interactions were collected in the RegulonDB database. The target genes that are supported by these 5 algorithms are: dinF, recA, recN, sulA. Interestingly, all of these four 29 Chapter 3. Results on comparing across models genes are involved in the DNA repair and SOS response, which are important functions of lexA in E. coli. • hycA: a formate hydrogenlyase regulatory protein. The target genes that are supported by these 5 algorithms are: hycB, hycC, hycD, hycE, hycF, hycG, hycH, hydN. The products of these target genes (hycB H) are the subunits of hydrogenase, and hydN is involved in electron transport from formate to hydrogen. Thus, all genes have a common biological function: form and maturate the formate hydrogenlyase pro tein complex, which is an important molecule for energy production and conversion in E. coli. • gatR: annotated to be a pseudogene, repressor for gat operon in NCBI, The target genes that are predicted by these 5 algorithms are: gatA, gatB, gatC, gatD, gat Y, and gatZ. It is known that these genes form a gat operon with 7 ORFs in E.coii EC3132 and involve in galac titol metabolism. A mutation in gatR in the E.coii K12 strain implies constitutive expression of gatABCD YZ [8j. This is in fact a negative regulation effect of gatR on these genes. • yhiE: yhiE is a hypothetical protein, and currently there is no litera ture describing the functions of this gene with experimental support. The target genes that are commonly predicted by these 5 algorithms are: gadA, gadB, hdeA, hdeB, hdeD, sip, and yhiD. However, by com paring the functions of the genes targeted by yhiE annotated in NCBI, it might be that yhiE involves functions of regulating acid-resistance (hdeA -D) and/or glutamate decarboxylase (gadA -B). 30 Chapter 3. Results on comparing across models S2I t) 487 1170 401 o .180 Figure 3.6: 80% precise network for CLR algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 31 Chapter 3. Results on comparing across models O3 (41 (300) I J01 1106/b 43 164 / 2Q0 Q6 126 997 615 .138 Figure 3.7: 80% precise network for Li penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 32 Chapter 3. Results on comparing across models O 20 73 6” (.) 10 oo 34’ 1 GC 1104 1201 (1) 114’ E9D154N9 11-14 900 Figure 3.8: 80% precise network for SCAD penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 33 Chapter 3. Results on comparing across models 1000 33 óod 2 S (4• 8 ®d9 2 c3 111 489 313 114-1 Figure 3.9: 80% precise network for LASSO penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 34 Chapter 3. Results on comparing across models 18 163 S58 357 W 1 P 356 (4-) Figure 3.10: 80% precise network for NIG penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 35 Chapter 3. Results on comparing across models Figure 3.11: 6O9 precise network for CLR algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 36 Chapter 3. Results on comparing across models / ®® c) c5) Figure 3.12: 60% precise network for Li penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 37 _ _ __ Chapter 3. Results on comparing across models - 1 — C - ; —. c := - - -- -- - — _ ,_ — J I — / N f I - .) —‘ - -- -, (._._-_ - — - ( - z —-- Oc---- - - - S - ( - (-_ ( J’ -- 0 - • II) - - L -- ä-. . q. Q -- - — - -_ 0 - - — - - — Figure 3.13: 60% precise network for SCAD penalty from MM3 algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 38 Chapter 3. Results on comparing across models --—- - — - ci - — — - 3. ci —, — —- --. t 4 ; III : — ( —p — -z- p--- — — . - -- (_ . - -— - — — -— -z-’ ‘. — — - I - — ci 441 j -, -i-- — 7 —— -‘ i.---- Q —_--- - - Il K-— -- -—- — i—- - -- — - -‘---. --— 1:I_1 :.‘ - -- ‘ ,S KII -N 1111 —- ‘—_ I4, II ---- •, - -: -, i- - — - \jy 2 - - - .i. N __- -- —‘s— 4 - z -- —_—-, —- — ‘‘1_’_)I —-— - —- I1O .—- -, _Y ,-, L _ - 14, - ) --_--K, - 2 ‘ Figure 3.14: 60 ”o precise network for LASSO penalty from EM algorithm. 9 Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 39 Chapter 3. Results on comparing across models - I—s Figure 3.15: 60% precise network for NIG penalty from EM algorithm. Red lines are correctly inferred edges and blue lines are false positives. Grey nodes are TFs. 40 Chapter 4 Comparing models coupled with prior structural information 4.1 NCI-60 Data preparation NCI-60 is a 60 human cancer cell line data set by Developmental Therapeu tics Program of the National Cancer Institute to screen more thn 100,000 chemical compounds since 1990, includeing leukemias, melanomas and can cers of ovarian etc [10, 53]. Until now, the NCI-60 cell lines have been characterized more extensively than any other set of cells in existence [53]. The purpose of our study for NCI-60 data set will be focused on the p53 hub, which is a central TF controlling the development of a variety of cancers in mammals. Hence, the ways we collected the data are including the set of genes which are targeted by p53 and potentially higher-level TFs regulating p53. Only transcript consistent re-mapped probe sets were used, i.e. probe sets that consistently match the same sets of transcripts. Further, as men tioned in the introduction to span the network across different molecular levels, we also collect a set of expression data of miRNAs in NCI-60. And we only consider the regulatory direction of microRNA—4gene (including TF). Finally, to incorporate the information from other sources (e.g. from sequence comparative analysis), binary structural data, i.e. TF—*gene and miRNA—gene, are also collected, respectively. On one hand for TF—gene, the selection was made initially of tar gets from a ChIP-PET (Chromatin ImmunoPrecipitation and sequencing 41 Chapter 4. Comparing models coupled with prior structural information of paired-end di-tags) experiment in colon cancer cells (treated 6hrs with 5fluorouracil that elicits a p53 response), the most likely targets were selected from Table S4 in [59] and also from data deposited at UCSC genome browser table GIS ChIP-PET Genome Institute of Singapore ChIP-PET. Poten tial targets were limited to genes downstream of ChIP binding sites. These tables select the most probable mappings and also looks for the presence of p53-TFBS. More targets are added from Ingenuity knowledge base [2] (both transcription activation and repression) and from known targets in the lit - erature referred to [59]. Hence, the TFs chosen as potential targets of TP53 or regulators of TP53 are from three sources: collected from ChIP data, ingenuity and p53 knowledge base. On the other hand for miRNA—*gene, the potential targets selection was made from PITA version 5 database [1]. The prediction of PITA based on two levels. The stringent one is to choose top selections based on high conservation score across species (giving 2.24% positive interactions), which takes into account the structure around the seed sites. The other selection is much more noisy, which includes all predictions without filtering (giving 23.45% positive interactions). For our analysis, only the top predictions were used where conservation is taken into account. In summary, we have four input data sets: • X: a 16143 x 59 matrix containing mRNA expression levels for 16143 genes across 59 experiments • Z: a 278 x 59 matrix containing miRNA expression levels for 278 miR NAs across 59 experiments • D: a 16143 x 183 binary matrix containing TF—*gene prior structural information based sequence motif analysis • C: a 16143 x 278 binary matrix containing miRNA—*gene prior struc tural information based sequence motif analysis Note that there is one patient removed from the study because the measured expression values are constant over all treatments. To the same end to reduce computation as in [31], we fit various models within the positive interactions in prior binary data and select interactions 42 Chapter 4. Comparing models coupled with prior structural information Figure 4.1: Flowchart of learning regulatory network on NCI-60 data set. as predicted positives with supports from both prior structural information and the resulting sparse network. {(i,j) : LD = 1 fl = 1} {(k,l) : LCkI = 1 flCkj = 1} where LD and LC is the learned structures for TF—÷gene and miRNA—*gene by variable selection procedures, respectively. The whole process is illustrated in Figure 4.1. Remark: In current pipeline, we do not compare with the variable selec tion procedure without prior for every gene. We think of the prior structure information have taken account into most biological relevant facts and be43 Chapter 4. Comparing models coupled with prior structural information sides this information, there are also noises in the prior information, i.e. we assume that the probability of analogous type I error is small and under the positives in prior, use variable selection to denoise the irrelevances that are not supported by our expression data (remember we subset the learned net work from prior network). Thus, the methods with prior information should be better than those without prior, i.e. if the model can detect the signals contained in the prior, then the model should be also able to and easier to detect the same interaction in smaller range. Three specific examples are taken to compare between with and without prior information in the next section to justify this, see Table 4.2. 4.2 4.2.1 Results Prediction of p53 related microRNAs and genes We took several specific examples that of particular interests and related to p53 (TP53) regulation [551. Some of them have biological support and the others are predicted from bioinformatics algorithms with high conservation scores (e.g. PITA) and suspected to have some biological functions that are interesting to us. 1. TP53—*GAS1. Growth arrest-specific 1 (GAS1) plays a role in growth suppression. GAS1 blocks entry to S phase and prevents cycling of normal and transformed cells. GAS1 is a putative tumor suppressor gene and TP53 as a protein has a domain important for the activity of GAS1 as a suppressor of the cell cycle, i.e. an anti-proliferative function [49]. 2. hsa-miR-34---GAS1. hsa-miR-34, conjectured to combine with p53, coregulates the transcription activities of GAS1 gene. hsa-miR-34---*GAS1 is predicted by PITA taking into consideration the structure around the seed sites, and one of the top predictions where conservation is taken into account. In our analysis, we took two members of its sub family, i.e. hsa-miR-34a and hsa-niiR-34c. 44 Chapter 4. Comparing models coupled with prior structural information 3. ETS1—TP53 and TP53—ETS1. There are some evidences that TP53 and ETS1 interact each other at protein level, regulating down stream target genes [38]. Two members of the ETS family of tran scription factors, ETS1 and ETS2, have been shown to bind to a palindromic ETS binding site and were able to trans-activate a het erologous promoter containing the binding site [13]. However, we are also interested in the direction TP53—*ETS1 because nearly all of our models claimed this interaction (see below). 4. TP53—RB1. p53 is known to suppress RB1 transcription through inhibition of the basal promoter activity [63]. Additionally, post transcriptionally, p53 might suppress RB1 further by inducing mir iO6a, a known suppressor of RB1 [55]. These predicted interactions are summarized in Table 4.1. Here, we do not include the CLR because the thresholding is not easy to be determined and justified without ground truth interactions known. By rows, we can see that Li and SCAD models can detect most of the interactions. NG and NJ models tend to produce sparser regulations. Although NIG can identify most non-zero elements, but the effects of these coefficients are very small and negligible. In fact, we can threshold out these small coefficients and read them as effective zeros. Hence, NIG is similar to the NG and NJ models. Moreover, the fitted models tend to have MSEs for Li and SCAD which are uniformly smaller over NG, NJ, and NIG models. This is not surprising as the Li and SCAD are less sparser than the others. By columns, we view the results for each specific interaction. For TP53—GAS1 and TP53—+ETS1, it seems most of the models can correctly identify the interaction (there are some exceptions such as NG MM3, NG EM, and Li MM2 etc). For hsa-miR-34---*GASi (including two family members), Li MM3, Li EM and SCAD MM3 select it as positive interaction, and the regulatory strength is negative due to the degradation nature of miRNAs. For TP53—*RBi is an interesting example, which can be correctly identified in some models with any solution (Li and SCAD), but in other models none of solutions do the right job (NIG, NG and NJ). As we mentioned above, p53 is known to have 45 Chapter 4. Comparing models coupled with prior structural information Table 4.1: MSE and estimated coefficients for the specific interactions pre dicted from NCI-60 data set. Bold numbers represent the estimated interactions agreeing with the literatures. hsa-miR-34a—*GAS1 hsa-miR-34c---*GAS 1 TP53—GAS1 Interaction MSE Coefficient Coefficient MSE__Coefficient MSE -.49 -.18 .23 Li MM3 .16 .16 .18 0 0 .18 Li MM2 .18 0 -.49 .16 -.18 .16 Li EM .23 .16 .18 0 0 .18 NIG MM3 .18 0 -1E-6 .18 -2E-6 1E-6 .18 NIG MM2 .18 -2E-6 .18 -2E-6 1E-6 .18 NIG EM .18 .18 0 .18 NG MM3 .18 0 0 .17 0 .17 0 .17 NG MM2 .17 .18 0 0 .18 NGEM 0 .18 .17 0 .17 NJ MM3 .21 0 .17 .17 0 .17 .17 0 .17 NJ MM2 .17 0 .21 NJEM .17 .17 0 -.49 .16 -.18 SCAD MM3 .16 .23 .16 .17 0 SCAD MM2 .36 .17 0 .17 TP53—*ETS1 TP53—*RB1 ETS1—TP53 Interaction MSE Coefficient MSE Coefficient MSE Coefficient .16 .37 Li MM3 .059 .13 0 .085 .28 0 Li MM2 .17 .049 0 .20 .16 <iE-8 .37 Li EM .14 .056 .059 .33 .20 NIG MM3 0 0 .18 .086 .20 .37 3E-7 .15 NIG MM2 0 .086 .28 .21 3E-7 1E-8 NIG EM .18 .086 .39 .18 0 .13 NG MM3 0 .080 .47 .17 .11 NG MM2 0 0 .080 .38 .19 0 .16 NGEM 0 .086 .51 .17 .11 0 NJ MM3 0 .079 .17 .47 0 .11 NJ MM2 0 .079 .51 .17 NJEM .11 0 .079 0 .16 .37 .14 0 SCAD MM3 .056 .058 .90 .097 1 E-5 .22 0 SCAD MM2 .038 . 46 Chapter 4. Comparing models coupled with prior structural information a negative regulatory effect on RB 1 in inhibiting the basal promoter activity, but for cancer cell lines, there might be a positive effect as well, although currently there is no literature to support this claim. Finally, in the p53 knowledge base, ETS1 is documented to regulate TP53, but here none of our models detect this interaction. Surprisingly, nearly all of our models (ex cept Li MM2) claim there is an opposite regulatory strength TP53—ETSi. After we searched through the literature, there indeed are experimental ev idence in terms of apoptosis function to support this claim. For example, in embryonic stem (ES) cells, mouse ES cells lacking ETS1 are deficient in their ability to undergo UV-induced apoptosis, similar to p53 null ES cells. Chromatin immunoprecipitations demonstratedthat ETS1 was required for the formation of a stable p53-DNA complex under physiological conditions and activation of histone acety1tiansferase activity. These demonstrate that ETS1 is an essential component of a IJV-responsive p53 transcriptional ac tivation complex in ES cells and suggests that ETS1 may contribute to the specificity of p53-dependent gene transactivation in distinct cellular com partments [621. Furthermore, to see how the prior structural information can improve the prediction ability, we select GAS1 gene as the target gene and there are 463 possible regulators in our data set, including all TFs and miRNAs. The reason of not doing variable selection for each gene without prior information is mainly a computational concern as stated before. Therefore, we just selected a few examples to demonstrate the prediction improvement. The comparison result is summarized in Table 4.2. Essentially without prior information, all the three interested interactions are not selected among the 463 candidate regulators. We conjecture that there is so many variables that by including them all, it is like finding a needle in a haystack. In the work presented above, we started from a conjectured bona fide in teraction and compare the performance of various models. However, in most realities, we think of the problem differently: given a set of genes and pos sible regulatory networks, how can we infer the biological functions of those genes and networks? In the next section, we apply functional enrichment analysis to check whether or not our predictions are functionally significant. 47 Chapter 4. Comparing models coupled with prior structural information Table 4.2: Estimated coefficients for the specific interactions predicted from NCI-60 data set, compared between with and without prior information. Coefficients with absolute values less than 10—8 are thresheld to 0. Bold numbers represent the estimated interactions agreeing with the literatures. hsa-miR-34a—*GAS1 hsa-miR-34c---GAS1 TP53—CAS1 Interaction Prior No prior Prior No_prior Prior No prior -.49 -iE-08 -.18 iE-07 Li MM3 0 .23 0 0 Li MM2 0 0 0 0 -.49 0 -.18 Li EM 0 0 .23 0 0 0 NIG MM3 0 0 0 -1E-6 -2E-6 NIG MM2 0 1E-6 0 0 -2E-6 0 -2E-6 NIG EM 1E-6 0 0 0 0 0 NG MM3 0 0 0 0 0 MM2 0 0 0 .17 NG 0 0 0 0 0 NGEM 0 0 0 0 0 0 .21 NJ MM3 0 0 0 0 .17 NJ MM2 0 0 0 0 .21 NJ EM 0 0 -.49 0 -.18 3E-08 SCAD MM3 0 .23 0 0 SCAD MM2 0 0 0 .36 48 Chapter 4. Comparing models coupled with prior structural information 4.2.2 Functional enrichment analysis Functional over-representation analysis was performed to objectively iden tify biological processes potentially affected by p53 and miRNA target genes. Specifically, the learned network in previous section is cut into small sub groups. For each group, we test the significance of every possible function assumed in these sub-networks by counting the occurrence number of a func tion with a given gene ontology (GO) annotation. The background is set to be the domain which contains all the genes touched by the prior structure data. A significant p value (p < 0.05) associated with Fisher’s exact test indicates that the observed percentage of the target genes with a given anno tation could not likely occur by chance given the frequency of prior network with the same annotation. Note that we do not use genes covering the whole genome as the background to avoid potential underestimate, because genes filtered by prior network never have a chance to enter the model and thus will have no influence on the learned sub-network. Since apoptosis and growth arrest are common known consequences of p53 activation, we tested whether p53/miRNA tend to target apoptotic (a form of programmed cell death in multicellular organisms) and cell prolif eration related genes. [55] has shown that while cell cycle regulation was among the top in the p53/miRNA targets functional enrichment, apopto sis was not significant. Interestingly, this agrees with our results, i.e. in all scenarios, we found the apoptotic function annotation is not enriched by setting p-value cutoff at 0.05. Hence, we use p53/miRNA target gene functional enrichment analysis to identify specific p53/miRNA that target significant cell proliferation genes. The results for Li model are shown in Figure 4.2 and Table 4.3. The enrichment analysis results for other models are presented in Appendix C Supplementary Materials section: Figure C.1, C.2, C.3, C.4, Table C.6, C.7, C.8 and C.9. Obviously, genes with this enriched function that the EM algorithm has identified concentrate only on p53 regulator, and consequently EM can de tect many genes that have cell proliferation annotation. On the contrary, MM2 and MM3 have a different landscape: they identified many regulators 49 Chapter 4. Comparing models coupled with prior structural information Table 4.3: Identified significant p53/miRNAs that target known cell prolif eration genes from Li prior. Method Regulator Total Ioi oredieted tareets . cell proliferation genes p-value Gene names UUK4rltjHI,rt b[N,1UFbl,k’AXa,ILIA, EM M M2 MM3 TP53 1167 43 5.1526-3 ACADVL hsa-miR-125a hsa-miR- 155 PAX5 T hsa-miR-27b hsa-miR-296 hsa-mift-302b hsa-miR-372 ACADVL PSMG3 hsa-miR-202 hsa-miR-429 hsa-miR-507 hsa-miR-519l, ACADVL hsa-miR-125a hsa-miR-135a hsa-mifl-181a hsa_miR202* hsa-miR-221 hsa-xniR-27b hsa-miR-410 ACADVL PAX5 hsa-miR- 191 hsa-niiR- 206 hsa-miR-296 hsa-miR-378 hsa_miR524* hsa-miR-96 SF’l HDAC2 IRF1 MYCN hsa-miR- 186 hsa-miR-195 hsa-mik-202 hsamiR2o2* hsa-miR-31 hsa-mih-34r hsa-mik-429 hsa-miR-495 hsa-miR-499 hsa-miR-518e hsa-miR-7 11 68 108 30 9 124 18 161 125 5 6 111 198 60 88 15 88 180 111 66 145 159 128 26 34 98 128 15 36 251 133 41 20 52 90 383 60 99 66 135 116 211 166 153 53 115 2 3 4 6 2 3 2 6 4 2 2 4 5 3 5 3 3.5716-2 4.8986-2 4.3506-2 3.9106-2 3.0616-2 4.4376-2 4.4966-2 4.9606-2 2.1596-2 4.2476-2 4.4556-2 3.7066-2 1.6866-2 3.7036-2 1.2806-2 2.8066-2 1.0306-2 4.1516-2 2.4276-2 1.2296-2 3.8136-3 4.0756-3 1.2576-3 3. 1496-2 9.7236-3 4.9046-2 4.7726-2 3.0866-2 1.2446-2 3.5706-2 4.8086-2 1.1516-2 3.3316-2 2.6576-2 1.0236-2 2.2856-2 3.7546-2 2.4156-2 3.020E-2 9.240E-3 3 .3426-2 2.2396-2 3.4646-2 3. 1696-2 4.9116-2 1.3766-2 4 6 7 4 6 6 8 5 7 3 3 2 3 6 3 5 3 8 5 8 4 4 4 7 3 5 5 4 3 5 IL1B,ADRA1D,GAS6,PRL,BUB1B,CD1C6. CTF1 ,IcFBp4cYR61 ,AREG,EGF,FGF2, FGF7,FTH1,MDM4MK167,PIM1,PRKDS ,MAP21C1, RAflSTIL,TGFATXN,GULS,BUE1,EPS8,PLK1, E2F1,IFll6,REST,Tpx2,DLG7,PDGFG,OSM,E2F8, ZEE1,TNFSF13B GDK4,s MAP3IC11,MAPRE2,PESS PTEN,FGF2,SPOGIC1 BCAT1 1L7,FLT3LG,FGF17,CDKN1B,FLT3,HES1 LIF,SUZ12 DAZAPaSUZ12,PLEK9IK1 NR2E1,IGF2BP1 IGF1 GGND2,LIF,DAZAP2,SUZS2,IGF2BP1 NR2EI,GUL3,DAZAP2,SUz12 FTH1,HMOX1 RB1,MDM4 F’TEN,NDN,ST13,GPG3 PTEN,NDN,TIMP2,TOB1,FWAG4 BTG1,HDAG4,MNT PTEN,HDAC4,TSG 101. ADAMTS1,ST638 Gt11C4,CD1C6,Ezr RPS27,MAP3}C1 1,MAPRE2,PES1 DAB2,MNAT1,GDICSR1,EVIS,P1M2,MAPRE2 TGFBI,CFCSSE,0NA12,lRS2,UGHLS,ElThHB3,zAlC PTEN,EPS15,FGF2,GULS Rp54x,RPS27CYRSS,GNAI2.FRAT2,PDGFC RPS4X,EP815PRICD1,STJL,TRIB1,BHLHB3 COL4A3,RPS4X,RPS27,FGF2,PRKD1 ,IRS2,EPS8,HDGFRP3 BcL2,TGFB1,cxculo,ccr4D2,cDlcNlB 1L7FLT3LGFGfl7clD1cNlwFur3,HE5S,5TATSB lOFi R,HES 1 PEEPS IRS2,FOXA1 ,PBEFI NR2ES,IGF2BP1 9CLFS,CCND2,NTN1 FGA,IRS2,IJES 1,5uz12,Hoxclo,TNS3 PBEF1,MA821L2,STATSE TGFE1,PTGS2,EREG,HDAC4,ETS1 ILIE,TNF,TGPE1 IL1B,TNF,TGFE 1,PTGS2,EIF2AIC2JAFC2,OSM,JL1RN GLI3,RE1,TGFB1,TIMP2,LEPRE1 PTEN,E0x82SICAP2,SMAD4,ToE1,HDAc4,GIIERP,T062 PPM1D,GHERP,sEsNl,pnssE PTEN,NDNST13DLGAP2 PTEN,FGF2cULs,SMAD4 TIMP2,SRAP2,3A62,HDAG4,STK38,PDSSB,ETS1 BTG1,NOTGH2,ETS1 NDN,TIMP2TOB1,}4DAG4,SESNS PTEN,SSTR1 NDN,PAWR,ETS1 CULS,GUL1,PPM1D,F0x04 BMP2,MED17,PDS5E PPM1D,cNoT8,SEnSBs,ET51,GJ66 50 Chapter 4. Comparing models coupled with prior structural information including TFs and miRNAs targeting some genes which have cell prolifera tion annotation, but for each regulator the total number of genes which was known to be cell proliferation is much smaller than that of EM. However, this case only happens with the Li prior model and with the NIG model, only MM3 gets much more enriched regulators, see Table C.7. Again, comparing among priors, we found that results of NJ model for the three algorithms are essentially the same, which agrees with the simulation studies. Most of the models and algorithms have p53 regulator enriched, which makes sense since the cell proliferation is an important function of p53 for suppressing tumor genesis. 51 C,’ t’) > 0 4, 4, 0 50000 0) 0 a > 4, 4, 0 0 (a) EM Index 100000 0 4 150000 20000 200000 0 0 Index 60000 0 a 4, (c) MM3 40000 o_ (1 0 0 80000 Oe.00 24,04 6e+04 Indec 120000 (b) MM2 44+04 64,04 le+05 Figure 4.2: Fisher’s exact test p-values with the Li prior. The blue line is the simulated uniform background p-values under null hypothesis and horizontal line is the threshold at p-value at 0.05. Chapter 4. Comparing models coupled with prior structural information Figure 4.2 shows the negative logarithm of p-values in a descent order. The larger the log(p.-value), the more significant of corresponding enriched function in the subnetwork. Clearly in Li case, the p-values are skewed and not uniformly distributed, which means the learned associations are unlikely to happen by random chance. EM for Li has many p-values shifted to the non-significance level, and this again agrees with Table 4.3, which reflects that Li with EM is concentrated on identifying a few large subnetworks. For MM2 and MM3, the differences in the tails are not significant as EM. For other priors in general, the different shapes of log(p-value) the plots confirm the different landscapes listed in tables. For example, EM and MM2 are similar to each other with the NIG prior in terms of both the plots and enriched genes (see Figure C.2 and Table C.7), while MM3 significantly differs from those two. Note that we do not perform a multiple comparison correction procedure here. The reason is that we test the significance of each possible function by conditioning on its local sub-network and its annotation, this will introduce very complicated dependency structures among different testings. Thus, the independence assumption of multiple comparison fails here and it is very difficult to control the false positive error rate. For this reason no multiple testing is taken into account, the significance of the p-values should be taken with a grain of salt. Although we correctly inferred the TP53—*GAS1 regulatory relationship in most models, this interaction is not enriched in over-representation anal ysis here. On the other hand, by comparing to miRNAs which target cell prolif eration genes and were captured by [55], we predicted several regulatory interactions that have been reported in the literature as differentially regu lated in various cancerous tissues or cancer cell lines. For example, CKSiB, IRS2, and TGFBI, target genes of miR-181a, overlap the results found in [55]. We did a literature search and there are indeed very recent evidences to support the predicted cell proliferation function [45, 61, 66]. Speficially, overexpression of CKS1B is linked to a poor prognosis in multiple myeloma and contributes to increased p27Kipl turnover, cell proliferation, and a poor 53 Chapter 4. Comparing models coupled with prior structural information prognosis in many tumor types [661. [661 showed that CKS1B influences myeloma cell growth and survival through SKP2- and p27Kipl-dependent and -independent mechanisms and that therapeutic strategies aimed at abol ishing CKS1B function, which in our case is the possible degradation through miR-181a regulation, may hold promise for the treatment of high-risk dis ease for which effective therapies are currently lacking. Very recently, The protein levels of IRS2, insulin receptor substrate-2, has been showed that reduced IRS2 levels in the islets by high dosage chlorpromazine contributes to apoptosis of pancreatic b-cells and decreased proliferation [45]. TGFBI, transforming growth factor /3-induced, can reduce cell proliferation or be down-regulated in certain type of cancers, including bladder cancer [61]. Another example addressed in [55] is miR-195, which was also reported by our model. This miR-195 is important in that it targets genes that may function as both apoptosis and cell proliferation and these two functions are ranked very significantly from their model. Our models, for example Li, have also predicted this miR-195 and part of its target genes are overlapped with [551, i.e. PPMID and SESN1 genes. A recent study also implicated miR-195, along with miR-23a and miR-24, in cardiac hypertrophy and re ported that these miRNAs were regulated in response to stress signaling in the heart [571. GAS6, a gene targeted by p53 and enriched from Li prior with EM al gorithm in Table 4.3, like GAS1 is a member of the growth arrest sequences (GAS) family genes (known modulators of cell cycle progression and sur vival). Serial analysis of gene expression from aggressive mammary tumors derived from transplantable p53 null mouse mammary outgrowth lines re vealed significant up-regulation of GAS6 transcripts. The minimal region of amplification contained genes CUL4a, LAMP1, TFDP1, and GAS6, highly overexpressed in the p53 null mammary outgrowth lines at preneoplastic stages, and in all its derived tumors [3]. As we have shown many examples above, combining the variable selec tion and over-representation analysis can improve the quality of predicted functional interactions, and thus reduce the number of interactions without support from biological interpretations. We presented examples with pre 54 Chapter 4. Comparing models coupled with prior structural information dicted interactions without enriched functions, although these predictions have some direct/indirect biological supports. On the other hand, there are also many predicted interactions with enriched annotations, while those interactions are not found in the literature. Besides, comparing across various models can further enhance and pro vide clues for biological analysis. For example, using EM and sometimes MM2 (e.g. NIG prior) tend to focus on fewer regulators with a large popu lation of target genes, while using MM2 and MM3 can detect more enriched regulators (notably most of them are miRNAs). 55 Chapter 5 Conclusions and discussions In this thesis, our ultimate goal is integrating multiple data sources and us ing statistical methods to construct regulatory networks spanning different molecular levels. The regulatory interactions are modeled as multivariate linear regression models. However, linear model applied to high-dimensional data are usually over-parametrized, unclear for interpretation, and have high variance for estimated parameters. Motivated from this fact, we formulate the problem as fitting a sparse model to reduce its dimensionality. To achieve this, we used penalized regression models and look for the corresponding penalized MLEs (pMLEs). Because the penalization can be alternatively viewed as prior regularization, different functional forms of penalties can be mapped to priors on parameters and the pMLE is also the correspond ing MAP estimator. Based on this interpretation, various scaled mixture Gaussian prior distributions are proposed such that the marginal densities of coefficients/regulatory strengths can be given in closed form. The prac tical question is to determine the MAP estimation. Currently, there exists several optimization approaches. For example, EM and MM algorithm. Ap plying these two methods to a variable selection prior which is unbounded at origin could possibly entail the local optimum problem (although this case rarely happens). To address this issue, we propose a variant MM al gorithm to avoid this. The following simulation studies were performed to compare the performance of various models with the three algorithms. Then, these methods were applied to an E. coli data set with ground truth interactions believed to be known. The performances are measured in terms of precision-recall curve. The results showed that Li MM3, SCA1 MM3, Li EM, and NIG EM methods are performing best among all models, and their performances are comparable to the state-of-the-art regulatory net 56 Chapter 5. Conclusions and discussions work construction algorithm CLR. This shows that a simple linear regres sion network with a properly chosen prior can compete with the current best method. Further, five hub genes were selected with common predic tions from high performance algorithms and we showed that at functional level, the learned networks agree with current literature. Finally, to arrive at our ultimate goal, we coupled our models with collected prior structural information from various databases. A sub-network of the prior network was learned for each model, and we took several important interactions (possibly conjectured) to demonstrate the detecting ability of our models. To identify predicted interactions which are significantly enriched with cell proliferation function annotation, we subsequently did an over-representation analysis of our prediction results. Several interactions were taken as examples to show that some of predictions indeed exert cell proliferation function in abnormal cell lines, as confirmed by literature and agreeing with the other studies. On the variable selection side, beyond the models and methods used in this project there are many other ways to estimate both a compact submodel and parameters. For parameter estimation post-model-selection, see [47]. Recently, there are also heavy literature volumes concerning simulta neously do model selection and parameter estimation [11, 26, 42, 56, 60, 68], both in frequencist and Bayesian contexts. The combinatorial modeling was discussed at the beginning of the thesis, and we did not implement this combined effects in this project. This is a computational concern. Grouped variable selection may be applied to reduce the optimization burden, but we have no extra time to extend on this topic. It is certainly a future topic worthy to study the combined effects of regulators. To conclude, by comparing different statistical models and integrating multiple data information can improve the reliability and facilitate interpre tation of constructed regulatory networks, in terms of prediction variability and biological functions. 57 Bibliography [1] http: I/genie. weizmann. ac. il/pubs/miro7/miro7_data.html. [2] Ingenuity systems. 2007. [3] Martin C. Abba, Victoria T. Fabrisi, Yuhui Hu, Frances S. Kittrell, Wei-Wen Cai, Lawrence A. Donehower, Aysegul Sahin, Daniel Medina, and C. Marcelo Aldaz. Identification of novel amplification gene targets in mouse and human breast cancer at a syntenic cluster mapping to mouse ch8al and human ch13q34. Cancer Res., 67(9):4104—12, 2007. [4] M. Abramowitz and I. Stegun. Handbook of Mathematical Functions: with Formulas, Graphs, and Mathematical Tables. Dover Publications, tenth edition, 1972. [5] H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716—23, 1974. [6] A Antoniadis. Wavelets in statistics: a reivew (with discussion). J. Italian Statistical Society, 6:97—144, 1997. [7] K. Azra and et al. Combinatorial microrna target predictions. Nat. Genetics, 37:495—500, 2005. [8] Nobelmann B and Lengeler JW. Molecular analysis of the gat genes from Escherichia coli and of their roles in galactitol transport and metabolism. J Bacteriol., 178(23) :6790—5, 1996. [9) P. Billingsley. Probability and Measure. Jone Wiley & Sons, Inc., second edition, 1986. 58 Bibliography [10] Paul E. Blower and et al. Microrna expression profiles for the nci-60 cancer cell panel. Mol Cancer Ther, 6(5), 2007. [11] P. Brown, M. Vannucci, and T. Fearn. Multivariate bayesian variable selection and prediction. J. Royal. Statist. Soc B., 60:627—641, 1998. [12] F. Caron and A. Doucet. Sparse bayesian nonparametric regression. International Conference on Machine Learning(ICML), Accepted, 2008. [13] Reisman D and Logirig WT. Transcriptional regulation of the p53 tumor suppressor gene. Semin Cancer Biol., 8:317—324, 1998. [14] A. Dempster. Covariance selection. Biometrics, 28(1):157—175, 1972. [15] A. Dempster, N. Laird, and D. Robin. Maximum likelihood from in complete data via the em algorithm (with discussion). J. Roy. Statist. Soc. Ser. B, 39:1—38, 1977. [16] B. Efron, T. Hastie, 1. Johnstone, and R. Tibshirani. Least angle re gression. Annals of Statistics, 32(2) :407—499, 2004. [17] Gao. F, Foat. B, and Bussemaker. H. Defining transcriptional networks through integrative modeling of mrna expression and transcription fac tor binding data. BMC Bioinformaticss, 5(31), 2004. [18] J. Faith and et al. Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol., 5(1):e8, 2007. [19] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc., 96(456):1348—1360, 2001. [20] M. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25:1150— 1159, 2003. 59 Bibliography [21] Fanette Fontaine, Eric J.Stewart, Ariel B.Lindner, and François Tad dei. Mutations in two global regulators lower individual mortality in Escherichia coli. Mol Microbiol., 67(1):2—14, 2008. [22] I. Frank and J. Friedman. A statistical view of some chemometrics regression tools (with discussion). Technometrics, 35:109—148, 1993. [231 J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, to be appeared, 2007. [24] N. Friedman, M. Linial, I. Nachman, and D. Pe’er. Using bayesian network to analyze expression data. J. Computational Biology, 7:601— 620, 2000. [25] W. Fu. Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3):397—416, 1998. [26] E. George and R. McCulloch. Variable selection via gibbs sampling. J. Amer. Statist. Assoc., 88(423):881—9, 1993. [27] J. Griffin and P. Brown. Alternative prior distributions for variable selection with very many more variables than observations. Technical Report, University of Kent, 2005. [28] J. Griffin and P. Brown. Bayesian adaptive lassos with non-convex penalization. Technical Report, University of Kent, 2007. [29] G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford Unviersity Press, third edition, 2001. [30] A. Hoerl and R. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(3) :55—67, 1970. [31] J. Huang, B. Morris, and J. Frey. Bayesian inference of microrna targets from sequence and expression data. J. Comp. Biol., 14(14):550—563, 2007. [32] D. Hunter and R. Li. Variable selection using mm algorithms. Annals of Statistics, 33:1617—1642, 2005. 60 Bibliography [33] Hartemink. J, Gifford D, Jaakkola. T, and Young. R. Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Nature Reviews Genetics, 6:422—33, 2001. [34] Keene. J. Rna regulons: coordination of post-transcriptional events. Nature Reviews Genetics, 8:533—543, 2007. [35] B. John, A. Enright, A. Aravin, T. Tuschl, C. Sander, and et al. Human microrna targets. PLoS Biol., 2(11):e363, 2004. [36] K. Knight and W Fu. Asymptotics for lasso-type estimators. Annals of Statistics, 28(5):1356—78, 2000. [37] S. Konishi and G Kitagawa. Information Criteria and Statistical Mod eling. Springer, first edition, 2008. [38] Sebum Lee and Hriday K. Das. Inhibition of basal activity of c-jun nh2-terminal kinase (jnk) represses the expression of presenilin-1 by a p53-dependent mechanism. Brain Res., 1207:19—31, 2008. [39] B. Lewis, C. Burge, and D. Bartel. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microrna targets. Cell, 120:15—20, 2005. [40] Buck. M and Lieb. J. Chip-chip: considerations for the design, anal ysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics, 83:349C360, 2004. [41] Eisen. M, Spellman. P, Browndagger. P, and Botstein. D. Cluster anal ysis and display of genome-wide expression patterns. Nature Reviews Genetics, 95:14863—8, 1998. [42] J. Mann and C. Robert. Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Springer-Verlag, first edition, 2007. [43] N. Meinshausen and P. Buhlmann. High-dimensional graphs and vari able selection with the lasso. Annals of Statistics, 34(3):1436—62, 2006. 61 Bibliography [44] M. Osborne, B. Presnell, and B. Turlach. On the lasso and its dual. Journal of Computational and Graphical Statistics, 9(2):319—337, 2000. [45] Sunmin Park, Sang Mee Hong, Ji Eun Lee, So Ra Sung, and Sung Hoon Kim. Abstract chiorpromazine attenuates pancreatic /3-cell function and mass through irs2 degradation, while exercise partially reverses the attenuation. J Psychopharmacol, 2008. [46] D. Pe’er, A. Regev, G. Elidan, , and Friedman N. Inferring subnet works from preturbed expression profiles. Bioinformatics, 17:S215— S224, 2001. [47] B. Potscher. Effects of model selection on inference. Econometric The ory, 7(2):163—85, 1991. [48] Jensen. 5, Chen. G, and Stoeckert. C. Bayesian variable selection and data integration for biological regulatory networks. Ann. Appi. Stat., 1(2):612—633, 2007. [49] G Del Sal, E M Ruaro, R Utrera, C N Cole, A J Levine, and C Schnei der. Gasi-induced growth suppression requires a transactivation independent p53 function. Mol Cell Biol, 15(12):7152—60, 1995. [50] H. Salgado and et al. Regulondb (version 5.0): Escherichia coli k-12 transcriptional regulatory network, operon organization, and growth conditions. Nuclear Acid Res., 34:D394—397, 2006. [51] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461—4, 1978. [52] R. Shalgi, D. Lieber, M. Oren, and Y. Pilpel. Global and local ar chitecture of the mammalian microrna-transcription factor regulatory network. PLoS Comput. Biol., 3(7):e131, 2007. [53] U. Shankavaram and et.al. Transcript and protein expression profiles of the nci-60 cancer cell panel: an integromic microarray study. Molecular Cancer Therapeutics, 6(3) :820—32, 2007. 62 Bibliography [54] J. Shao. Linear model selection by cross-validation. J. Amer. Statist. Assoc., 88:486—94, 1983. [55] Amit U Sinha, Vivek Kaimal, Jing Chen, and Anil G Jegga. Dissecting microregulation of a master regulatory network. BMC Genomics, 9(88), 2008. [56] R. Tibshirani. Regression shrinkage and selection via the lasso. Royal. Statist. Soc B., 58:267—288, 1996. J. [57] Eva van Rooij, Lillian B. Sutherland, Ning Liu, Andrew H. Williams, John McAnally, Robert D. Gerard, James A. Richardson, and Eric N. Olson. A signature pattern of stress-responsive micrornas that can evoke cardiac hypertrophy and heart failure. Proc Nati Acad Sci U S A, 103(48):18255—60, 2006. [58] W. Wasserman and A. Sandelin. Applied bioinformatics for the identi fication of regulatory elements. Nat. Genetics, 5:276—287, 2004. [59] C. Wei, Q. Wu, and et al. A global map of p53 transcription-factor binding sites in the human genome. Cell, 124(1):207—19, 2006. [60] W. Wilks, S. Richardson, and D. Spiegeihalter. Markov Chain Monte Carlo In Practice. London: Chapman & Hall, first edition, 1995. [61] Wang X, Colby JK, Rengel RC, Fischer SM, Clinton SK, and Klein RD. Overexpression of cyclooxygenase-2 (cox-2) in the mouse urinary bladder induces the expression of immune- and cell proliferation-related genes. Mol Carcinog, 2008. [62] Dakang Xu, Trevor J. Wilson, David Chan, Elisabetta De Lucal, Jiong Zhoul, Paul J. Hertzogl, and Ismail Kola. Etsi is required for p53 transcriptional activity in uv-induced apoptosis in embryonic stem cells. The EMBO Journal, 21:4081C409, 2002. [63] Shiio Y, Yamamoto T, and Yamaguchi N. Negative regulation of rb expression by the p53 gene product. Proc Natl Acad Sci, 89(12):5206— 5210, 1992. 63 Bibliography [64] Stephen Yeung, Jesper Tegnçr, and James Collins. Reverse engineering gene networks using singular value decomposition and robust regression. PNAS, 99(9):6163—8, 2002. [65] A. Zeliner and A. Siow. Posterior odds ratios for selected regression hypotheses. In Bayesian Statistics: Proceedings of the First Interna tional Meeting held in Valencia (Spain), Valencia: University Press, pages 585—603, 1980. [66] Fenghuang Zhan, Simona Colla, Xiaosong Wu, Bangzheng Chen, James P. Stewart, W. Michael Kuehl, Bart Barlogie, and Jr John D. Shaughnessy. Ckslb, overexpressed in aggressive disease, regulates multiple myeloma growth and survival through skp2- and p27kipldependent and -independent mechanisms. Blood, 109(11) :4995—5001, 2007. [67] P. Zhao and B. Yu. On model selection consistency of lasso. J. of Machine Learning Res., 7:2541—2563, 2006. [68] H. Zhou and T. Hastie. Regularization and variable selection via the elastic net. J. Royal. Statist. Soc B., 67(2):301—20, 2005. 64 Appendix A Methods Penalties A.1 Information criteria A.1.1 AIC and BIC are motivated from model misspeciflcation and model selec tion area [37]. AIC [5] and BIC [511 corresponds to the penalty function respectively. and ) = with form p>, = 0.52ll(,3 0), where = Thus, the penalization is based on the complexity of assumed model and discontinuous w.r.t. parameter /3. Moreover, it is well known that model se lection procedures based on AIC are inconsistent in terms of the probability of selecting an over-parameterized model is asymptotically positive [47]. Hard thresholding A.1.2 2 Then this is the hard thresholding penalty [6], which is a smoothed version of entropy penalization. Let A.1.3 = —(1/31— )2I(j/3 < )b). 3 penalty D Choosing the £ penalty, i.e. = AIi3V’, is the solution of the bridge re gression [22, 25], in linear regression context. Typically, p e [0,21. Moreover the penalized likelihood has variable selection property when p 1 and the 1. optimization problem of O(.) function in Eq.(2.2) is convex only if p Particularly, p = 0 corresponds to traditional model selection, p = 1 yields LASSO [56] and p = 2 ridge regression [30]. Under £° penalty, Shao [54] has shown that the cross-validated choice of the penalty parameter ) is con sistent for model selection under certain conditions on the size of the testing data set. But this oracle property does not hold for £‘ penalty. Under a 65 Appendix A. Methods set of assumptions (c.f. Assumption 1-6 in [43]), however, Meinshausen and penalty is a consistent (in probability) ap Buhlmann [43] showed that proximation of jointly modeling the dependency in the Gaussian graphical model context [14, 23]. Zhao and Yu [67] also showed the probability of LASSO choosing the true model goes to 1 at the exponential rate, provided the regularity conditions and Strong Irrepresentabie Condition holds. Fur thermore, Knight and Fu [361 showed that LASSO even retains the model estimation consistency and asympototic normality when ? = o(n), where = 0. f (n) = o(n) means 1im Based on these work, we can further extend the consistency of LASSO in a stronger sense, i.e. almost sure (a.s.) convergence. Before establishing the results, the definitions are presented below: Definition A.1.1. (Model Selection Consistency) 1. In probability: P(sgn(/3(n)) sgn(3). sgn(/3(n)) = sgn(/3)) 2. Almost sure: P(1im_ sgriã(n)) = — 1, as n sgn(f3))) = —* oo, notated as 1, notated as sgn(jJ(n)) sgn(/3). Proposition A.1.1. Fix p, under the regularity conditions as in /36]: as n —* cc — maxxTxj - C>-O (A.1) —* 0 (A.2’) fli=1:n where C >- 0, meaning C is a positive definite matrix, and if Strong Irrcp sgri(/3). resentable Condition holds /67], then sgnãLl(n)) The proof is given in Appendix B Proofs, and essentially an application of Borel-Cantelli lemma ([9] p53-57). It reveals the fact that if the incon sistency decreases at a rapid rate when the number of data points tends to infinity, then there could be only a finite number of incorrected estimated models, which implies the a.s. convergence. 66 Appendix A. Methods A.1.4 SCAD penalty Let the continous differentiable penalty function has the form p(O) = A{(6 < A) + O> A)} (A.3) with some a> 2 and 0> 0, then p is called the Smoothly Clipped Absolute x if Deviation (SCAD) penalty [19]. x is the positive part of x, i.e. x x 0; x = 0 if x <0. SCAD penalty is very similar to £‘ penalty in the region where the absolute values of estimated parameters are small. But SCAD penalty is constant for large absolute values of estimated parameters to avoid over-penalization or biasness. See Figure 2.2(b). A.1.5 Bayesian linear regression The penalization for likelihood function has Bayes interpretation. The penalty term can be alternatively viewed as the prior regularization of the sampling model. Therefore, to maximize the penalized likelihood function is equivalent to find the mode of corresponding posterior distribution. For example, LASSO estimator is the posterior mode resulted from Gaussian likelihood coupled with double-exponential prior. Griffin and Brown [27, 28] proposed a family of Bayesian linear regression models using a scale mixture of Gaussian prior on regression coefficients, which includes Normal-Gamma (NG) and Normal-Inverse Gaussian (NIG) model. This hierarchical model family can be extended to include Zeilner’s g-prior [65], the parameter-free Normal-Jeffreys (NJ) prior [20], and LASSO, which is a double-exponential mixture of Gaussian as a special case of NG model. The MAP estimator can be found using EM algorithm [12, 28]. Next we briefly review this hi erarchical Bayesian regression model family. In general, the marginal prior distribution of regression coefficients in this family can be factored as a scale mixture of Gaussian density: = fN(j; 0, r)p(r) dr (A.4) 67 Appendix A. Methods where .iV(x; o) denotes the normal distribution of x, with mean i and variance .2 Depending on different prior distribution specified for Ti?, we can introduce several different models. It is easy to check that LASSO, NG, and NJ priors satisfy the conditions of Theorem 2.3.1 and they are also singular at the origin [19], i.e. I/ d’ +p(I/3I) is strictly positive and attains 3 , minimum at / 3 =0. Normal-Jeifreys prior If choosing the Jeifreys’ non-informative prior for r, i.e. p(T?) o we get the NJ prior. The resulting marginal p.d.f. for /3 is -, then (A.5) Normal-Gamma prior If choosing rj and variance . where g(x; a, b) is the Gamma p.d.f. with mean Then the marginal p.d.f. for /3 is given by a = F() () 1 i() (A.6) where K, (/3) is the modified Bessel function of the second kind [4]. Note = 1, the NG prior reduces to double-exponential prior, and thus that if And NJ prior is the limiting case when yeilds LASSO model with A —. 0 and c —* 0. Normal-Inverse Gaussian prior IG(-, c), where IG(-, c) is the Inverse-Gaussian p.d.f. If choosing r Then the marginal p.d.f. for /3 is given by = -exp() ( + I/32) i + I/3I2) (A.7) 68 Appendix A. Methods Table A. 1: Penalizations/log(prior) and their first order derivatives evalu +2 ated at 1 j1 3 !/ i1. Notation: qj = 3 LASSO NJ /1 NG 1) PA(I/ j 3 AI/3’ p(I/3jI) I 3 1ogI/ 1*1 3(VjjI) a —)logj/3j—logCa ‘Q’/I/I) 7? log PC 1 (cqj) + f(& 12 A) 1ogq NIG AI/3I(I/3I SCAD + + ip iQVIi3I) C0j) ACi(cq) 3 q — (11_122)] )Ca < ) 1 j 3 I/ A{(I/3I A) > < A)} Table A.2: Properties of sparsity promoting priors. Source: François Caron. Finite value at 0 Sparsity Convexity of p (i3) Range Name Weakly convex for > 0 yes yes Laplace c> 0 Strictly concave no yes None NJ > 1 Strictly convex for a)1 a<l Weaklyconvexfor=1 c>0 NG >0,c>0 < 1 Strictly concave for no yes NlGauss > 0, c> 0 - - Because of lim,o(I/ jI +p(I/3j)) = 0, the NIG prior does not satisfy 3 the variable selection criterion. Finally, Table A. 1 summarizes these penalizations/log(prior) and their corresponding first order derivatives and table A.2 summarizes the proper of sparsity promoting priors (also including NIG prior). ties 69 Appendix B Proofs Before proving the theorems, we quote and present some useful lemmas. Lemmas B.1 Lemma B.1.1. (c.f. [29] p310) Let X, X be r.v. ‘s. Fix e> 0, let A(e)={IX—Xl Ve, then X Q4 >€} X. If A(e) B.2 Proof of Proposition A.1.1 <00, Proof. Define Ve> 0 A(e) = {sgn(/L1(n)) - By Lemma B.1.1, hence suffices to show esis, Zhao and Yu [67) showed for some 0 P(sgn(&(n)) But it is = sgn(/)I A(e) c< 1 sgn8)) = 1 — > (B.1) e} <cc. Under the hypoth o(e_nC) (B.2) straightforward to check that = (1 — P(sgn(A(n)) sgn(3))) = o(e) <cc Li 70 Appendix B. Proofs Proof of Theorem 2.3.1 B.3 Proof. For part (1), clearly 4)3(80,80) =p3(IOoI). P\3(IOI) and 4)3(8,80) are even functions on Let x 0. Consider > (x) (x,8o)—p d[4) ( 3 IxD] = xp((I9oI IOoI+e p((IOoI+f)+) - - — p(x + ) + p(x + f) p(x+€) - 801+e — Now for 0 e (0, oo), letting x i.(0) = 0, we have lim(x) xL0 - — p((0+E)+) p((I0oI+E)+) IO0I+E Note that under hypothesis, it is straightforward to conclude that Vf> 0, p (0+) is non-decreasing, positive of 0 > 0. So, L(O) 0 for 0 < 9 < 8 o1; and z(0) > 0 for 8 > 1901. Hence, 4)3(x, 8) p3(x) attains its minimum at 1801, and therefore at ±jOo. . . — For part (2), following definition )PA(O)I 8 IP3( = pA(I0I+f)_p(I8I)_ft)dt 101 IPA(I8I+f)-PA(I9DI+f Ip(IOI + f) PA(IOI)I + Ep(€+) - ‘ dt f I0I dt = 71 Appendix B. Proofs By the piecewise differentiability assumption of p(.) on (0, cc), we have p+) < cc, Vf > 0. Further, by compactness assumption, p(.) is uni formly continuous on C. Hence, sending € j. 0 yields claimed result. 72 Appendix C Supplementary Materials 73 Appendix C. Supplementary Materials Table C.1: E.coli Transcription factors in the 60% precise network with p 5 predicted operon targets by CLR algorithm. Targets in RegulonDB Targets # inferred by CLR Regulator 20 30 flhC_bi89Lat 17 46 fihD_b1892_at 40 42 fiiA_b1922_at 6 6 gatR2_b2090±at 10 glcC_b2980_at 5 10 7 hycA_b2725_at 16 lexA.b4043_at 6 11 rcsB_b2217_at 5 rhaR..b390&at 9 5 rhaS.b3905_at 7 5 17 tdcR_b3119_at 7 yhiE_b3512_at ii 5 4 yhiW.b3515.at 6 13 yhiX_b3516_at 8 Table C.2: E.coli Transcription factors in the 60% precise network with p 5 predicted operon targets by Li prior, with MM3 algorithm. Regulator Targets in RegulonDB Targets # inferred by Li MM3 14 cbl_b1987_at 9 28 fecl_b4293_at 6 43 42 fliA_bi922_at gatR.2_b2090±at 7 6 7 hycA_b2725_at 9 16 7 lexA_b4043..at 12 nac..bi988_at 9 84 narL_bi22i_at 7 10 yhiE_b35i2.at 5 4 ylcA.bO57Lat 5 74 Appendix C. Supplementary Materials Table C.3: E. coli rIIanscription factors in the 60% precise network with p 5 predicted operon targets by SCAD prior with MM3 algorithm. Regulator Targets in RegulonDB Targets # inferred by SCAD MM3 6 8 araC.b006&at 14 cbl_b1987_at 9 28 fecl_b4293_at 6 44 42 fiiAb1922_at 7 gatR_2.b2090i_at 6 9 hycA_b2725_at 7 7 16 lexA_b4043at 12 12 nac_b1988_at 84 7 narL_bl22Lat 11 yhiE_b3512_at 5 ylcA.bO57Lat 4 5 Table C.4: E.coli Transcription factors in the 60% precise network with p 5 predicted operon targets by Li prior with EM algorithm. Regulator Targets in RegiilonDB Targets # inferred by Li EM araC.b0064.at 6 8 cbl_b1987_at 10 9 fecl_b4293_at 28 6 44 fliA.b1922_at 42 7 gatR.2.b2090.J_at 6 14 hycA_b2725_at 7 lexA_b4043.at 16 7 1rpb0889_at 61 5 nac_b1988_at 12 9 84 narLbl22Lat 5 10 yhiE_b3512at 5 4 5 ylcA_b0571_at 75 Appendix C. Supplementary Materials Table C.5: E.coli Transcription factors in the 60% precise network with p> 5 predicted operon targets by NIG prior with EM algorithm. Regulator Targets in RegulonDB Targets # inferred by NIG EM fecl_b4293_at 19 6 fliA_b1922_at 42 33 gatR_2±2090_f_at 6 6 hycAb2725_at 10 7 lexA_b4043_at 16 5 lrp_b0889_at 61 6 nac_b1988_at 12 15 yhiE_b3512_at 5 7 ylcA_b0571_at 4 5 76 -.4 -1 10000 (a) 0 a, 0 N- V N EM Index 0 5000 10000 (c) 0 15000 0 C C MM3 Index 0, 0 5000 a 0) 0 > a > V 20000 5000 Index 15000 25000 (b) MM2 10000 20000 25000 30000 Figure C.1: Fisher’s exact test p-values with the NG prior. The blue line is the simulated uniform background p-values under null hypothesis and horizontal line is the threshold at p-value at 0.05. I 00 0 a a 44 0 C! C 0 0 50000 0 a a 40 go 0 40 0 (a) EM 0 150000 Index 100000 200000 Index 0 10000 0 C! C! (c) MM3 5000 250000 0) 0 4, 0 04 Index 100000 (b) MM2 15000 50000 150000 200000 Figure C.2: Fisher’s exact test p-values with the NIG prior. The blue line is the simulated uniform background p-values under null hypothesis and horizontal line is the threshold at p-value at 0.05. (4, cc —4 -I- 0 a 0 4) 4)1 0 5000 Index 15000 4) 0 a 4) > 0 0 N C C’) (a) EM. 10000 0 20000 5000 25000 Index 15000 (c) JVIM3 10000 30000 0 N 20000 0 25000 000 Index 15000 30000 (b) MM2 10000 20000 25000 30000 Figure C.3: Fisher’s exact test p-values with the NJ prior. The blue line is the simulated uniform background p-values under null hypothesis and horizontal line is the threshold at p-value at 0.05. (0 I CI) 0 0 20000 80000 (a) MM2 Index 100000 0 0 o 01 0. C. a Oe+00 20+04 6a+04 Index (b) MM3 40+04 80+04 10÷05 Figure C.4: Fisher’s exact test p-values with the SCAD prior. The blue line is the simulated uniform background p-values under null hypothesis and horizontal line is the threshold at p-value at 0.05. Appendix C. Supplementary Materials Table C.6: Identified significant p53/miRNAs that target known cell proliferation genes from NO prior. Method EM MM2 Regulator Total # of predicted targets Cell proliferation genes p-value TP53 70 9 1.586E-04 ACADVL hsa-miR-130b hsa-miR-145 hsa-miR-155 hsa-miR-lSa hsa-miR-181a hsa-miR-221 hsa-miR-339 hsa-miR-367 hsa-miR-429 hsa-miR-485-Sp hsa-miR-513 hsa-miR-522 ACADVL hsa-miR-1 hsa-miR-30c hsa-miR-375 hsa-miR-410 R2F1 MYCN hsa-let-7i 5 hsa-miR-202 hsa-miR-33 hsa-miR-506 5 hsa-miR-518c hsa-miR-7 t 16 15 12 4 7 24 3 16 9 3 18 18 1 13 7 8 17 6 10 3 2 8 18 6 14 1 3 2 2 1 2 3 1 2 2 1 2 2 1 2 2 1 2 2 2 1 1 1 2 1 2 3.571E-02 7.496R-04 2.803E-02 1.178E-02 4.726E-02 1.572E-02 4.478E-03 2.212E-02 1.348E-02 8.134E-03 4.950E-02 3.284E-02 4.320E-02 3.846E-02 3.358E-03 2.724E-03 5.000E-02 1.030E-02 2.293E-02 1.704E-02 8.772E-03 4.293E-02 3.309E-02 2.643E-02 3.877E-02 1.920E-02 TPS3 1 05 10 8.1 6 7 E-04 ACADVL hsa-miR-130b hsa-miR-145 hsa-miR-155 hsa-miR-181a hsa-miR-221 hsa-miR-27b hsa-miR-320 hsa-miR-339 hsa-miR-367 hsa-miR-429 hsa-miR-488 5 hsa-miR-517 hsa-miR-518f ACADVL hsa-miR-1 hsa-miR-145 hsa-miR-181a hsa-miR-206 hsa-miR-221 hsa-miR-30c hsa-miR-410 MYCN hsa-let-7i hsa-miR-144 hsa-miR-186 hsa-miR-219 hsa-miR-26a hsa-miR-26b 1 24 19 13 13 31 27 16 4 26 22 10 4 6 1 22 19 13 18 31 20 21 13 12 23 31 5 23 24 1 2 2 2 3 4 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2 1 2 3 1 3 2 3.571E-02 3.632E-02 4.388E-02 1.383E-02 4.940E-03 5.859E-04 4.676E-02 1.532E-02 2.963E-02 3.47tE-02 4.640E-02 4.237E-02 3.695E-02 9.557E-03 3.846E-02 9.713E-03 2.181E-02 4.727E-03 6.SO1E-03 3.805E-02 2.263E-02 1.564E-02 2.872E-02 3.509E-02 4.463E-02 3.672E-03 3.546E-02 1.682E-03 2.989E-02 Gene names TCFBI,ICFBP4,CYR61,AREG,EPS8 1F116,TPX2,DLG7,PDGFC TYR CUL5,PDGFC,ChGn FgCN1,ptlGFC FCF2,SPOCK1 FGF2 ONAI2,UCHL1 CYR6t,FRAT2,PDGFC RLF4 HDGFRP3,BHLHB1 CYR61,IRS2 5P0C1C1 EP88,BHLHB3 CYR61,EP88 5MARCA4 1R92,FOXA1 GNAI2,PBEF1 LRPS IRS2,HOXCIO IL1B,TGFB1 TIMP2,LEPRE1 ETS1 SMAD4 HDAC4 PCAF,ETg1 ETS1 CNOT8,ET31 TG1Z’BI,IGFEP4,CYR6I,AREG,MK167, EP58,1fl16,TPx2,DLG7,PDGFC TYR CULS,ChCn FgCN1,1RS2 FCF2,5POCK1 GNAI2,IRSI,UCHL1 CYR61,CNAI2,FRAT2,PDGFC TRIB1,BHLHB3 PTEN,BHLHR3 ELF4 HLICFRP3,EHLI4B3 CYR61,IRS2 SPOCIC1,HDGFRP3 CPC4 PTEN,SPOCK1 SMARCA4 1R82,FOXA1 1R82,PEEF1 GNAI2,1R82 IRS2,PEEF1 GNAI2,PDGFC GNAI2,PBRF1 1R82,HOXC1O TIMP2,LEPRE1 ETS1 TOBt,BTG3 PTEN,HOXB2,gKAP2 BMP2 TIMP2,gKAP2,HDAC4 PTEN,TIMP2 81 Appendix C. Supplementary Materials Table C.6: (continued) Method MM p-value hsa-miR-33 hsa-miR-34c hsa-miR-429 hsa-miR-492 hsa-miR-498 hsa-miR-506 5 hsa-miR-518c hsa-miR-7 Total # of predicted targets 19 20 22 2 9 19 7 17 Cell proliferation genes 2 2 3 1 2 2 1 2 TP53 83 10 1.138E-04 ACADVL hsa-miR-130b hsa-miR-145 hsa-miR-155 hsa-miR-181a hsa-miR-221 hsa-miR-339 hsa-miR-367 hsa-miR-429 hsa-miR-485-5p hsa-miR-488 hsa-miR-513 ACADVL hsa-miR-145 hsa-miR-221 hsa-miR-23b hsa-miR-30c hsa-miR-410 ACADVL MYCN hsa-let-7i hsa-miR-186 hsa-miR-26a hsa-miR-33 hsa-miR-429 hsa-miR-492 hsa-miR-506 hsa-miR-7 1 19 16 11 6 28 3 21 14 3 7 22 1 16 28 21 11 18 5 13 8 21 16 15 14 1 17 14 1 3 2 2 2 4 1 2 2 1 2 2 1 2 2 2 2 2 2 2 1 2 2 2 3 1 2 2 3.571E-02 1.275E-03 3.172E-02 9.875E-03 1.142E-02 3.867E-04 2.222E-02 2.300E-02 1.965E-02 4.950E-02 2.095E-02 4.805E-02 3.846E-02 1.559E-02 3.129E-02 3.560E-02 6.949E-03 1.154E-02 3271E-02 2.872E-02 2.339E-02 2.012E-02 1.361E-02 9.134E-04 1.803E-04 9.709E-03 2.365E-02 1.920E-02 Regulator 1.487E-03 8.619E-03 7.382E-04 t.942E-02 1.332E-03 2.936E-02 4.516E-02 2.805E-02 Gene names NDN,HDAC4 NOTCH2,ETSI NDN,TIMP2,HDAC4 ADAMTS1 PTEN,SESN1 PCAF,ETS1 ETS1 CNOT8,ETS1 TGFEI,IGFBP4,CYR61,AREG,MR167, EPS8,1F116,TPX2,DLC7,PDGFC TYR CUL5,PDCFC,ChGn FSCN1,PDGFC FGF2,SPOCR1 GNAI2,UCHL1 CYR61,GNAI2,FRAT2,PDGFC ELF4 HDGFRP3,BIILHB3 CYR61,IRS2 SPQCIC1 SPOCE1,HDGFRP3 EPS8,EHLHB3 SMARCA4 PEEFI,PDCPC GNAI2,PDCFC IRS2,STATSB GNAI2,PBEF1 IRS2,HQXC1O MDM4,TSC1O1 TIMP2,LEPRE1 ETS1 PTEN,HOXB2 TIMP2,HDAC4 NDN,HDAC4 NDN,TIMP2,HDAC4 ADAMTS1 PCAF,ETS1 CNOT8,ETS1 Table C.7: Identified significant p53/miRNAs tbat taeget known cell proliferation genes from NIG prior. Method Regulator TP53 Total # of predicted targets 1261 Cell proliferation genes 45 p-value 5.680E-03 EM TP53 1261 25 1 .886E-02 Gene names CDE4,PTCH1,PTEN,TCFBI,PAX3,IL1A, IL1B,ADRA1D,GAS6,PRL,BUB1B,CDK6, CTF1,IGFBP4,CYR61,AREG,CKS1E,EGF, FGF2,FGF7,FTH1,MDM4,MKJ67,PIM1, PRKD1,MAP2K1,RAF1,STIL,TGFA,TXN, CUL1,BUB1,EPS8PLIC1,CSF1R,E2F1, IF’I16,REST,TPX2,DLG7,PDCFC,OSM, E2F8,ZEB1 ,TNFSF13B KIT,PAX3,IGF1,BCL2,TGFBI,ADRA1D, IGF1R,CDC6,CTF1,EGF,F2R,FGF7, LIF,0DC1,ST8SIA1,TTK,CDKN1B,MYCN, 82 Appendix C. Supplementary Materials Table C.7: (continued) Metbod Regulator TP53 MM2 1181 Cell proliferation genes 41 p-value 4.859E-02 - FOS MM3 Total # of predicted targets 421 29 4.859E-02 TP53 1181 24 2. 149E-02 TP53 bsa-miR-135b bsa-miR-155 bsa-miR-lSa hsa-miR-181a hsa-miR-221 hsa-milt-302c hsa-mifl-367 hsa-miR-485-Sp hsa-miR-513 hsa-miR-522 ACADVL hsa-miR-128a bsa-miR-128b hsa-miR-181a bsa-miR-191 hsa-miR-375 hsa-miR-384 hsa-mift-410 bsa-mift-429 hea-let-7i hsa-miR-106b hsa-miR-181a hsa-miR-181b bsa-miR-222 bsa-miR-33 hsa-miR-506 53 3 8 3 3 19 9 9 3 12 16 1 8 9 3 4 5 3 10 6 1 3 3 4 4 7 9 6 1 2 1 2 3 1 2 1 2 3 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 4.340E-03 4.822E-02 5.123E-Q3 3.561S-02 2.404E-03 2.214E-03 3.688E-02 4.18881-03 4.95081-02 1.48481-02 2.73081-03 3.84681-02 3.43781-02 3.86281-02 2.72181-02 4.46081-02 3.12581-02 3.00781-02 3.50681-03 3.99481-02 2.92481-03 3.69081-02 3.39381-02 4.50681-02 2.87881-02 2.89881-02 6.58581-03 Gene names HES1,PURA,PDGFCITGB1,TNFSF13B, SERTAD1,BIftC6 CDK4,PTCH1,PTEN,TGFBI,PAX3,IL1R, ADRA1D,GAS6,PRL,BUB1B,CTF1 ,IGFBP4, CYR61,AREG,CKS1REGF,FGF7,FTH1, MDM4,MK167,F’RKD1MAP2K1RAF1STIL, TGFA,TXNCUL1,BIJB1,EPS8,PLK1,CSF1R, E2F1,1F116,REST,TPX2DLG7,PDGPC, OSM,E2F8,ZEB1,TNFSF13R TGFBI,IL2RA,IL1AIL1B,ADRA1DADRA1B PRL,CTF1,0A32,AREG,EGF,FGF2, GCG,GNB1MAP2K1,RAF1,TGFA,TGFR3, TXN,1R52,CIJICSR1 ,E2F1,MPL,NAB2, 51008,Ct1160,FZD3,OSM,SFRF2 KIT,FAX3,IGF1 BCL2,TCFR1,ADRA1D, CDC6,CTF1ECF,F2R,FCF7,LIF, ODCI.ST8SIA1TTK,CDKNI6,MYCN, HES1,PURA,PDCFC,ITCBI,TNFSF13B, SERTAD1,BIRC6 TGFBI,CYR61,AREG,TPX2,OLG7,PDGFC 0A52,DAR2 FGF2,SPOCR1FGF2,SFOCK1 FGF2,FGF2 GNAI2,UCHL1 ,GNAI2.UCHL1 CYR6IFRATZPDCFC,CYR6I,FRAT2,PDGFC GPC4,GFC4 HDGFRP3BHLHB3,HDGFRP3,RHLHB3 SPOCIC1,SPOCFC1 EPS8,BHLHB3,EPS8,EHLHB3 DAB2,CYR61,EPS8,DAB2,CYR61,EPS8 SMARCA4 PDGFC FDGFC GNAI2 PBEF1 LRPS IGF2BF1 IRS2,HOXCIO IRS2 EITS1 TXNIP NOTCH2 ETS1 ETS1 HIDAC4 FCAF,ETS1 Table CS: Identified significant p53/miRNAs tbat target known cell proliferation genes from NJ prior. Method Regulator Total # of predicted targets Cell proliferation genes p-value TP53 109 10 l.101E-03 Gene names TGFBI,ICFBP4CYR6J ,AREC,MlC167, EPS8,1F116,TPX2DLG7,PDCFC 83 Appendix C. Supplementary Materials Table C.8: (continued) Method MM2 Cell proliferation genes EGR1 ACADVL hsa-miR-130b hsa-miR-155 hsa-miR-181a hsa-miR-181r hsa-miR-221 hsa-miR-339 hsa-miR-367 hsa-miR-488 5 hsa-miR-517 hsa-miR-517c hsa-miR-518f hsa-miR-522 hsa-miR-526b ACADVL hsa-miR-1 hsa-miR-145 hsa-miR-181a hsa-miR-206 hsa-miR-221 hsa-miR-30c hsa-miR-410 MYCN hsa-lct-7i hsa-miR-144 hsa-miR-186 hsa-miR-219 hsa-miR-26a hsa-miR-26b hsa-miR-33 hsa-miR-34r hsa-miR-429 hsa-miR-492 hsa-miR-498 hsa-miR-506 5 hsa-miR-518c hsa-miR-7 Total # of predieted targets 22 1 26 13 14 7 31 5 27 10 4 4 6 38 14 1 22 21 14 19 31 20 20 13 14 24 33 5 24 24 19 19 23 2 8 19 7 17 TPS3 109 10 1.10191-03 EGRI ACADVL hsa-miR-130b hsa-miR-155 hsa-miR-181a hsa-miR-181c hsa-miR-221 hsa-miR-339 hsa-miR-367 hsa-miR-429 hsa-miR-488 5 hsa-miR-517 hsa-miR-517c hsa-miR-518f hsa-miR-522 hsa-miR-526b ACADVL hsa-miR-1 hsa-miR-145 hsa-miR-181a 22 1 27 13 14 7 31 5 27 22 10 4 4 6 38 14 1 22 21 14 4 1 2 2 4 2 4 1 2 2 2 1 2 2 4 2 1 2 2 2 4.628E-02 3.571E-02 4.53091-02 1.38391-02 3.999E-04 1.57291-02 5.85991-04 3.704E-02 3.730E-02 4.640E-02 4.237E-02 3.69591-02 6.162E-03 9.55791-03 4.423E-03 2.832E-02 3.846E-02 9.71391-03 2.64591-02 5.498E-03 Regulator 7 1 2 2 4 2 4 1 2 2 1 2 2 4 2 1 2 2 2 2 2 2 2 2 1 2 3 1 3 2 2 2 3 1 2 2 1 2 p-value Gene names —________________________________________________ 4.628E-02 3.571E-02 4.222E-02 1.383E-02 3.999E-04 l.572E-02 5.859E-04 3704E-02 3.730E-02 4.237E-02 3.695E-02 6.162E-03 9.557E-03 4.423E-03 2.832E-02 3.846E-02 9.713E-03 2.645E-02 5.498E-03 7.247E-03 3.805E-02 2.263E-02 1.421E-02 2.872E-02 4.094E-02 4.828S-02 4.403E-03 3.546E-02 1912E-03 2.989E-02 1.487E-03 7.772E-03 8.455E-04 1.942E-02 1.036E-03 2.936E-02 4.51691-02 2.80591-02 1L18,IRS2,NAB2,PDGFC TTh CUL5,ChGn FGF2,SPOCK1 GNAI2,IRS2,UCHL1,EI4LHB3 UCHL1,BHLHB3 CYR6IGNAI2,F’RAT2PDGFC ELF4 IILIGFRP1,BHLHB3 SPOCK1,HDGF’RP3 GPC4 AREG,ChGn PTEN,SPOCK1 DAB2,CYR61,EPS8,PDGFC ChGn,TXNDC1 SMARCA4 IRS2,FOXA1 IRS2,PBEF1 GNAI2TRS2 IRS2,PBEF1 GNAI2,PDGFC CNAI2,PBEF1 IRS2,HOXC1O TIMP2,LEPRE1 ETS1 TOB1,BTG3 PTEN,HOXB2,SKAP2 BMP2 TIMP2,SKAP2,HDAC4 PTEN,TIMP2 NDN,HDAC4 NOTCH2,ETS1 NDN,TIMP2,HDAC4 ADAMTS1 PTEN,SESN1 PCAF,ETS1 ETS1 CNOT8,ETS1 TGFBI,ICFBP4,CYR61,AREG,MK167, EPS8,1F116,TPX2,DLG7PDGFC IL1B,1R82,NAB2,PDGFC TYR CUL5,ChGn FGF2,SPOCK1 GNAI2191S2,UCHL1,BNLHB3 UCI-1L1,BHLHB3 CYR6I,GNAI2,FRAT2,PDGFC ELF4 HDGFRP3,BHLHU3 CYR61,IRS2 SPOCIC1,HDGFRP3 GPC4 AREG,ChCn PTEN,SPOCK1 DAB2,CYR61,EPS8,PDGFC ChGn,TXNDC1 SMARCA4 1R82,FOXA1 IRS2,PBEF1 GNAI2,1R82 84 Appendix C. Supplementary Materials Table 0.8: (continued) Method MM3 Regulator p-value hsa-miR-206 haa-miR-221 haa-miR-30e haa-miR-410 MYCN hsa-let-7i haa-miR-144 haa-miR-186 haa-miR-219 haa-miR-26a hsa-miR-26b hsa-miR-33 haa-miR-34e hsa-miR-429 hsa-miR-492 haa-miR-498 hsa-miR-506 5 bsa-miR-518c hsa-miR-7 Total # of peedieted targets 19 31 20 20 13 12 23 35 5 24 25 20 19 22 2 9 20 7 17 Cell proliferation genes 2 2 2 2 2 1 2 3 1 3 2 2 2 3 1 2 2 1 2 TP 53 109 10 1.1O1E-03 EGR1 ACADVL haa-miR-130b hsa-miR-155 hsa-miR-181a hsa-miR-181e haa-miR-221 haa-miR-339 haa-miR-367 hsa-miR-429 bsa-miR-488 5 hsa-miR-517 hea-miR-517e haa-miR-518f hsa-miR-522 hsa-miR-526b ACADVL hsa-miR-1 hsa-miR-145 hsa-miR-181a hsa-miR-206 hsa-miR-221 hsa-miR-30e hsa-miR-410 MYCN hsa-let-7i hsa-miR-144 hsa-miR-186 hsa-miR-219 hsa-miR-26a hsa-miR-26b hsa-miR-33 hsa-miR-34c haa-miR-429 haa-miR-492 haa-miR-498 hsa-miR-506 5 hsa-miR-518e 22 1 27 13 14 7 31 5 27 22 10 4 4 6 38 14 1 22 21 14 19 31 20 20 13 12 23 35 5 24 25 20 19 22 2 9 20 7 4 1 2 2 4 2 4 1 2 2 2 1 2 2 4 2 1 2 2 2 2 2 2 2 2 1 2 3 1 3 2 2 2 3 1 2 2 1 4.628E-02 3.57191-02 4.53091-02 1.383E-02 3.999E-04 1.57291-02 5.85991-04 3.70491-02 3.73091-02 4.64091-02 4.23791-02 3.69591-02 6.162E-03 9.55791-03 4.42391-03 2.83291-02 3.84691-02 9.71391-03 2.64591-02 5.49891-03 7.24791-03 3.80591-02 2.26391-02 1.42191-02 2.87291-02 3.50991-02 4.46391-02 5.21791-03 3.54691-02 1.91291-03 3.23091-02 1.65391-03 7.77291-03 7.38291-04 1.94291-02 1.33291-03 3.24191-02 4.51691-02 7.247E-03 3.805E-02 2.263E-02 1.421E-02 2.872E-02 3.509E-02 4.463E-02 5.217E-03 3.546E-02 1.912E-03 3.230E-02 1.653E-03 7.772E-03 7.382E-04 1.942E-02 1.332E-03 3.241E-02 4.516E-02 2.805E-02 Gene names JRS2,PBEF1 GNAI2,PDGFC GNAI2,PBEF1 1R82,HOXC1O TIMP2,LEPRE1 ETS1 TOB1,BTG3 PTEN,IIOXB2,SI<AP2 BMP2 TIMP2,SKAP2,HDAC4 PTEN,TJMP2 NDN,HDAC4 NOTCH2,ETS1 NDN,TIMP2,HDAC4 ADAMTSI PTEN,SESNI PCAF,ETS1 ETS1 CNOT8,ETS1 TGFBI,IGFBP4,CYR61,AREGM1C167 RPS8,1F116,TPX2,DLG7,PDGFC IL1B,IRS2,NAB2,PDGFC TYR CULS,ChGn FGF2,SPOCK1 GNAI2,IRS2,UCHL1,BHLHB3 UCHL1,BHLHR3 CYR61,GNAI2,FRAT2,PDGFC ELF4 HDGFRP3,BHLHB3 CYR61,IRS2 SPOCK1,HDGFRP3 GPC4 AREG,ChGn PTRN,SPQCR1 DAB2,CYR61,EPS8,PDGFC CbGn,TXNJDC1 SMARCA4 IRS2,FOXA1 JRS2,PBEF1 GNAI2,IRS2 1R52,PBEF1 GNAI2,PDGFC GNAI2,PBEF1 IRS2,HOXC1O TJMP2,LEPRE1 ETS1 TOB1,BTG3 PTEN,HOXB2,SKAP2 BMP2 TIMP2,SSCAP2,HDAC4 PTEN,TIMP2 NDN,HDAC4 NOTCH2,ETS1 NDN,TIMP2,HDAC4 AL)AMTS1 PTEN,SESN1 PCAFETS1 ETS1 85 Appendix C. Supplementary Materials Table CS: (continued) Method Regulator hsa-miR-7 Total # of predicted targets 17 Cell proliferation genes 2 p-value 2.805E-02 Gene names CNOT8,RTS1 Table C.9: Identified significant p53/miRNAs that target known cell proliferation genes from SCAJD prior. Method MM2 Regulator Total # of predicted targets Cell proliferation genes p-value TPS3 471 21 2.503 FOXO1 ACADVL 67 44 4 4 2.638E-02 4.859E-02 AC AD VL 133 9 2.0855-02 hsa-miR-135b hsa-miR-lSa hsa-miR-200c 178 128 163 6 4 5 3.9355-02 2.095E-02 1.7165-02 02 RB1 104 9 1.741 EGR1 PAX5 hsa-miR-106b hsa-miR-130b hsa-miR-181b hsa-miR- 186 hsa-miR-296 hsa-miR-301 hsa-miR-30b hsa-miR-330 hsa-miR-372 hsa-miR-452 5 hsa-miR-520a 83 36 283 205 174 406 16 178 259 107 162 146 126 9 7 7 6 6 7 2 7 7 4 S 3 4 1.9265-02 1.5095-02 2.6115-02 4.2125-02 6.323E-03 3.2945-02 3.526E-02 2.6395-03 1.3855-02 4.9695-02 5.4965-03 3.7815-02 4.5485-02 02 hsa-miR-522 312 9 1.486 S -02 hsa-miR-9 HES1 LITAF ACADVL ACADVL FOSS HDAC3 ACADVL IRF1 MYCN ACADVL hsa-miR-31 hsa-miR-34c hsa-miR-507 hsa-miR-519b ACADVL ACADVL hsa-miR-125a hsa-miR-135b hsa-miR-15S hsa-miR-181a 296 27 6 29 14 19 20 40 47 79 74 149 127 61 245 12 12 87 135 110 110 6 3 2 4 3 3 3 4 8 5 S 7 3 3 8 2 3 4 S 4 7 2.4365-02 2.329E-02 3.448E-02 3.295E-02 2.7485-02 1.7705-02 4.5965-02 2.6385-02 1.3025-02 5.2485-03 3.488E-02 1.709E-02 4.3965-02 3.8945-02 3.132E-02 4.2865-02 1.357E-02 9.8325-03 4.667E-02 4.655E-02 2.308E-02 Gene names PTCH1,TGFHI,PAX3,IL1BCTF1,IGFSF4, CYR61,AREG,MK167,PIM1,PRICD1,STIL, TXN,BUS1,EPS8,1F116,REST,TPX2, DLG7,PDGFC,ZEB1 PTEN,FRL,IRS2,1F116 PAX3,DAB2,ELF4,PIM1 PTEN,PAX3,ADRA1D,ADRA1B,CTF1, EGF,15G20,PIM1,CSF1R RPS4X,RP527,DAB2,MNAT1,CDKSR1,MAPRE2 RPS4X,RP527FGF2,FZD3 FTEN,RPS4X,RP527,RPS1S,MAFRE1 ICIT,IGF1,TGFB1,IGF1R,CDC6, CCND2,CDKN1B,MYCN,5UZ12 TGFB1,EGF,F2R,FLT1,LYN,TIMP1,1R52,PDGFC,CHRM1 1L7,FLT3LGFGF17,CDICN1B,FLT3,HES1,STATSB BCL2,CCND2,LIF,AKAPS,POU3F2,DERL2,IGF2BP1 IGF1,SCL2,CDKN1S,POU3F2,5UZ12,IGF2BP1 BCL2,LIF,CUL3,AKAPS,POU3F2,DAZAP2 IC4F1R,ITPR1,CDKN1B,AICAP5,P01J3F2,5UZ12,IGF2BP1 NR2E1,IGF2BP1 IGFI,BCL2,IGF1R,CDKN1B,5UZ12,PDGFCIGF2BP1 GLDC,KLFS,1R52,5UZ12,GLI2,BIRC6,PLEICHK1 MAB21L1,POU3F2,STATSB,IGF2BP1 IGF1R,NR2E1,CIJL3,DAZAP2,5UZ12 CCND2,GNAI2,1R52 IGF1R,CCND2.5UZ12,IGF2RP1 GLDC,IGF1,CCND2,GNAI2,CD1CN1B, 5U312,PDGFC,IGF2SP1,PLEFCHK1 CCND2,NR2E1,FGF18CDRNIB,PURA,ANGEL1 AGT,ILIB,TGFB1 IL1A,TNF TNF,TGFB1,FGF2,FOXO4 POU1F1,TNF,TGFB1 IL1B,TNF,BMP2 PTEN,TNF,GDF11 APC,1L3,TGFB1,BMP2 IL1B,TNF,TGFB1,FTGS2,EIF2AIC2,3A52,OSM,IL1RN GLI3,RB1,TGFB1,TIMP2,LEPRE1 PTEN,IL1A,1L18,TNF,TGFB1 FRK,TIMP2,JAIC2,HDAC4,5TK38,PDSSB,ETSI STG1,NOTCH2,ETS1 BTG1,HDAC4,MNT FTEN,RB1,CULS,HDAC4,T5G101,ADAMTS1,5TK38,RFX3 CDK4,TYR CDK4,CDK6,E2F1 RP527,MAP3SC1 1,MAPRE2,PES1 RPS4X,DAB2,MNAT1,FIM2,MAPRE2 FTEN,FGF2,SFOCK1,BCAT1 TGFBI,CKS1S,GNAI2,1R52,UCHL1 ,BHLHB3,ZAK 86 M 543 Appendix C. Supplementary Materials Table CR (continued) Method Regulator hsa-miR-202 5 hsa-miR-221 hsa-miR-27b hsa-miR-320 hsa-miR-410 bsa-miR-504 hsa-miR-518f bsa-miR-522 ACADVL PAXS hsa-miR-206 hsa-miR-296 ACADVL SP1 HDAC2 IRF1 MYCN PSMC3 SMAD3 hsa-miR-128b bsa-miR-144 hsa-miR-186 hsa-miR-195 bsamiR202* hsa-miR-31 hsa-miR-34c bsa-miR-429 bsa-miR-499 hsa-miR-518e hsa-miR-7 Total # of predicted targets 61 147 152 127 122 27 46 223 24 33 115 18 14 42 19 49 85 6 57 140 239 375 54 61 132 109 187 150 44 117 Cell proliferation genes 4 6 6 4 8 3 4 8 5 7 3 2 3 5 3 8 5 2 6 5 7 8 4 4 6 3 5 4 3 5 p-value 9.055E-03 4.146E-03 3.146E-03 4.963E-02 8.823E-04 3525E-02 2.1OSR-02 3.862E-02 2.1 13E-02 7.721E-03 3.531E-02 4.496E-02 2.748E-02 1.295E-02 2.83252-02 1.754E-02 7.63852-03 4.455E-02 4.92352-02 2.90652-02 3.864E-02 1.997E-02 2.60052-02 2.26552-02 4.21552-02 2.76852-02 1.30352-02 2.94452-02 2.88652-02 1.49352-02 Gene names PTEN,EPS15,FCF2,CUL5 RPS4X,RPS27,CYR61,CNAT2,FRAT2,PDGFC RPS4X,EPS1S,PRRD1,STIL,TRIB1,BRLWB3 PTEN,DAB2,STAT4,BHLHB3 COL4A3,RPS4X,RPS27,FGF2,PRKD1.IRS2.EPS8,HDGFRP3 RPS27,GPC4NAB2 PTEN,RPS4X,RP527SPOCK1 RPS4X,DAB2CYR61,GNAI2,MAP2K1,EPS8,PDGFC,BCAR1 BCL2,TCFB1,CXCL1O,CCND2,CDRN1B 1L7,FLT3LG,FCF17CDKN1B,FLT3,HES1,STAT5B IRS2,FOXA1,PBEFJ NR2E1,IGF2BP1 POU1FI,TNF,TGFB1 TGFB1,PTCS2,ERRG,HDAC4,ETS1 IL1B,TNF,TCFB1 IL1B,TNF,TGFB1,PTGS2,EIF2AK2,JAK2,OSMIL1RN GLI3,RB1,TGFB1,TIMP2,LEPREI TGFE1,MDM4 JL1B,TCFB1EREC,IFNB1,TCPS3,SMAD4 PHBFOXO4,MY016,PDS5B,INCS PRRRA,PCAF,JAR2,T081,HDAC4.STC3,ETS1 PTEN.HOXE2,SKAP2,SMAD4,T081,HDAC4,C}tERP,T052 PPM1D,CHERP,SESN1,PDSSB PTENFCF2,CIJLS,SMAD4 FRK,TIMP2,SKAP2,JAK2,HLIAC4,STK38 BTG1,NOTCH2ETS1 NDN,TIMP2,T081,HDAC4,SESNI CUL5,CUL1 PPM1D,FQXQ4 BMP2,MED17,PDS5S PPM1D,CNQT8,SETDB1,ETS1,GJB6 87
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Comparisons of statistical modeling for constructing...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Comparisons of statistical modeling for constructing gene regulatory networks Chen, Xiaohui 2008
pdf
Page Metadata
Item Metadata
Title | Comparisons of statistical modeling for constructing gene regulatory networks |
Creator |
Chen, Xiaohui |
Publisher | University of British Columbia |
Date Issued | 2008 |
Description | Genetic regulatory networks are of great importance in terms of scientific interests and practical medical importance. Since a number of high-throughput measurement devices are available, such as microarrays and sequencing techniques, regulatory networks have been intensively studied over the last decade. Based on these high-throughput data sets, statistical interpretations of these billions of bits are crucial for biologist to extract meaningful results. In this thesis, we compare a variety of existing regression models and apply them to construct regulatory networks which span trancription factors and microRNAs. We also propose an extended algorithm to address the local optimum issue in finding the Maximum A Posterjorj estimator. An E. coli mRNA expression microarray data set with known bona fide interactions is used to evaluate our models and we show that our regression networks with a properly chosen prior can perform comparably to the state-of-the-art regulatory network construction algorithm. Finally, we apply our models on a p53-related data set, NCI-60 data. By further incorporating available prior structural information from sequencing data, we identify several significantly enriched interactions with cell proliferation function. In both of the two data sets, we select specific examples to show that many regulatory interactions can be confirmed by previous studies or functional enrichment analysis. Through comparing statistical models, we conclude from the project that combining different models with over-representation analysis and prior structural information can improve the quality of prediction and facilitate biological interpretation. Keywords: regulatory network, variable selection, penalized maximum likelihood estimation, optimization, functional enrichment analysis. |
Extent | 2223158 bytes |
Subject |
Regulatory network Variable selection Optimization Functional enrichment analysis |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-02-02 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0066924 |
URI | http://hdl.handle.net/2429/4068 |
Degree |
Master of Science - MSc |
Program |
Bioinformatics |
Affiliation |
Science, Faculty of |
Degree Grantor | University of British Columbia |
Graduation Date | 2008-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2008_fall_chen_xiaohui.pdf [ 2.12MB ]
- Metadata
- JSON: 24-1.0066924.json
- JSON-LD: 24-1.0066924-ld.json
- RDF/XML (Pretty): 24-1.0066924-rdf.xml
- RDF/JSON: 24-1.0066924-rdf.json
- Turtle: 24-1.0066924-turtle.txt
- N-Triples: 24-1.0066924-rdf-ntriples.txt
- Original Record: 24-1.0066924-source.json
- Full Text
- 24-1.0066924-fulltext.txt
- Citation
- 24-1.0066924.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0066924/manifest