"Science, Faculty of"@en . "Statistics, Department of"@en . "DSpace"@en . "UBCV"@en . "Kanters, Steve"@en . "2011-02-17T00:47:53Z"@en . "2006"@en . "Master of Arts - MA"@en . "University of British Columbia"@en . "High throughput phenotypic experiments include both deletion sets and R N A i experiments.\r\nThey are genome wide and require much physical space. As a result, multiple plates are\r\noften required in order to cover the whole genome. The use of multiple plates leads to systematic\r\nplate-wise experimental artefact, which impede statistical inference. In this paper,\r\ncurrent pre-processing methodology will be reviewed. Their fundamental principle is to align\r\na common feature shared by all plates. From this very principle, we propose an improved\r\nmethod which simultaneously estimates all parameters required for the pre-processing transformation.\r\nSome of the alignment features popular today implicitly assume conditions which are\r\noften not met in practice. We discuss the various choices of features to align. Specifically,\r\nthe upper quantiles and the mean of the left tail trimmings of each plate's data distribution\r\nare features which are always available and simple to obtain. Moreover, they are robust to\r\nnon-randomization of genes to plates. Their use will be motivated through simulation and\r\napplied to real data. Applications to real data will be used to demonstrate superiority over\r\ncurrent methods as well as to discuss choices in transformation types."@en . "https://circle.library.ubc.ca/rest/handle/2429/31404?expand=metadata"@en . "Pre-Processing of Quantitative Phenotypes from High Throughput Studies by . STEVE RANTERS B.Sc, McGill University, 2004 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Statistics) The University of British Columbia December 2006 \u00C2\u00A9 Steve Kanters, 2006 Abstract High throughput phenotypic experiments include both deletion sets and R N A i experiments. They are genome wide and require much physical space. A s a result, multiple plates are often required in order to cover the whole genome. The use of multiple plates leads to sys-tematic plate-wise experimental artefact, which impede statistical inference. In this paper, current pre-processing methodology wi l l be reviewed. Their fundamental principle is to align a common feature shared by al l plates. From this very principle, we propose an improved method which simultaneously estimates all parameters required for the pre-processing trans-formation. Some of the alignment features popular today implici t ly assume conditions which are often not met in practice. We discuss the various choices of features to align. Specifically, the upper quantiles and the mean of the left tai l trimmings of each plate's data distribution are features which are always available and simple to obtain. Moreover, they are robust to non-randomization of genes to plates. Their use wi l l be motivated through simulation and applied to real data. Applications to real data wi l l be used to demonstrate superiority over current methods as well as to discuss choices in transformation types. i i Contents Abstract i i Contents i i i List of Figures v Acknowledgements v i i 1 I n t r o d u c t i o n 1 2 H T P E x p e r i m e n t s 3 3 F o u n d a t i o n s for H T P D a t a P r e - P r o c e s s i n g 12 3.1 Phenotype Distr ibution 14 3.2 Distortion Model 15 3.3 Current Approaches to Data Pre-Processing 16 3.3.1 H T P Experiments 16 3.3.2 Microarrays \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 20 4 R e c o m m e n d e d P r e - P r o c e s s i n g M e t h o d 27 4.1 Features ' \u00E2\u0080\u00A2 28 4.1.1 External Features 28 4.1.2 Internal Features 28 4.2 Proposed Pre-processing Method 32 i i i 4.2.1 Exact alignment of 2 features under the linear model . 33 4.2.2 Alignment of features through optimization 34 4.2.3 A piecewise distortion model . 36 5 Application to Real Data 44 5.1 Data 44 5.2 Recommended Method versus Median Centering 45 5.3 Choice of Feature 47 5.3.1 Complementing Chapter 4's Simulation 47 5.3.2 Internal and External Features in a Case Study 48 5.4 Optimizat ion versus Linear Splines 50 6 Conclusion and Future Work 60 Bibliography 62 Appendices 64 A Derivation of Gradient 64 B The Edge Effect 66 iv List of Figures 2.1 Gene Deletion in Yeast 8 2.2 R N A Interference 9 2.3 Experimental Formats 10 2.4 Image Quantification 11 3.1 Comparison of H T P and Microarray experiments . . . 23 3.2 Raw Data Distributions 24 3.3 Non-Random Mutant Al locat ion 25 3.4 Quantile Normalizat ion i. . 26 4.1 Simulation Population - 39 4.2 Example of Simulated Plate-Sets 40 4.3 Simulation Results 41 4.4 Mean Left Ta i l Trimmings 42 4.5 Linear Splines 43 5.1 Comparing Methods 52 v 5.2 Complimentary Diagnostic Plot 53 5.3 Internal Feature Behaviour Induced by External Feature Alignemnt 54 5.4 Appl i ed Pre-processing 1 55 5.5 Appl ied Pre-processing 2 56 5.6 Appl ied Pre-processing 3 57 5.7 Appl ied Pre-processing 4 58 5.8 Appl ied Pre-processing 5 59 B . l Edge Effect . '. 68 v i Acknowledgements First and foremost, I would like to thank Jenny Bryan for her guidance, support and patience. Hopefully, my emotional roller coaster throughout this program did not rub off on others. I would also like to thank Elizabeth Conibear from the Center for Molecular Medicine and Therapeutics for giving me the opportunity to work on such an interesting project. M y experience as a graduate student in the Department of Statistics at U B C has by far exceeded my expectations. I feel like a statistician - a scientist - something I d id not feel after my undergraduate studies. I would like thank my parents, Ron and Renee, for their life long love and support. They have given me the proper tools to lead the life I choose. Dad, you've shown me that life is an adventure. One which too many people take for granted. Maman , tu m 'a donner un outi l essentielle a la vie: la communication. Many students in the department helped me along the way. Thank you Just in for helping improve my computing skills and for entertaining coffee breaks. Thanks Jeff for helping me with my mathematics when I needed it and speaking French. Mike Mar in , you've become more than a fellow student - you're a true friend. I enjoyed the many conversations we've had which I think were beneficial to both of us. Final ly thanks to Robin Steenweg and Jon Morrison who both continue to inspire me greatly in the game we call life. S T E V E R A N T E R S The University of British Columbia December 2006 v i i Chapter 1 Introduction Gene t i c s has grea t ly deve loped i n the last cen tu ry a n d a half. T r a d i t i o n a l l y , expe r imen t s s t a r t ed w i t h a pheno type a n d t r i e d to ident i fy the m u t a t i o n w h i c h lead to i t . T h i s is k n o w n as fo rward genetics. Today , select o rgan isms have mos t of the i r genome m a p p e d . A c c o r d i n g l y , one can cause or observe changes i n genetic sequence a n d wi tness the i n d u c e d change i n phenotype . T h i s procedure , i n the oppos i te d i r e c t i o n of c lass ica l genetics, is c a l l ed reverse genetics. B y p e r t u r b i n g a specific gene a n d eva lua t ing i ts effects o n a pheno type of interest , we can be t te r u n d e r s t a n d each gene's func t iona l i ty . S u c h exper iments ca r r i ed out o n a genome-wide basis are ca l l ed 'high t h r o u g h p u t p h e n o t y p i c ( H T P ) exper iments . C h o o s i n g a specific pheno type m a y lead to the p roper iden t i f i ca t ion of genes c o m p o s i n g a p a t h w a y of interest . F u r t h e r m o r e , the i n t r o d u c t i o n of double de le t ion sets i n H T P exper imen t s has a l lowed scient is ts to ident i fy key syn the t i c in te rac t ions [1]. U n l i k e o ther genomic-based approaches such as mic roa r rays a n d p ro teomics , gene p e r t u r b a t i o n provides a d i rec t l i n k f rom gene to func t ion and . to date, offers the best t o o l for r e a l i z i n g the ful l p o t e n t i a l of the genome projec t [3]. 1 In all high throughput genome-wide experiments, data are very much affected by the biological phenomenon being investigated and by the experimental processes involved. Thus, analyzing the raw data to draw conclusions on the underlying biological mechanisms may be extremely misleading. Data pre-processing is the process of removing as much experi-mental artefact as possible while affecting the biological information as little as possible. In microarrays it is a topic which is well documented in the literature [13],[5],[8]. There are key differences between H T P and microarray experiments. Consequently, normalization methods developed for microarrays are of l imited usefulness here. These ex-periments are young and relatively unknown to the statistical community. The goal of this report is to develop a pre-processing method for H T P experiments which properly reflects the biology involved and efficiently alleviates the experimental artefacts. Five sections wi l l be used to present and justify our proposed normalization approach. The first wi l l describe H T P experiments. The second wi l l discuss current approaches for normalizing these and microarray experiments. The ensuing section wi l l present our proposed method which wi l l be followed by applications to real data. Finally, the fifth section wi l l be composed of concluding remarks. 2 Chapter 2 H T P Experiments Proper discussion of high throughput phenotypic data pre-processing requires a better un-derstanding of the experiments. In this section, key aspects of these experiments w i l l be described and, in the process, essential vocabulary wi l l be developed. Individual H T P experiments differ in many respects. The primary distinctions are the organism of study, the gene perturbing strategy, the experimental format and the phenotype of interest. Organisms studied in this manner are S. cerevisiae, Drosophila melanogaster, C. elegans and various mammalian cell cultures (most commonly mice and human). There are numerous reasons for studying these particular organisms. For example, a significant proportion of genes in S. cerevisiae are homologous to human genes implicated in human disease [12]. Moreover, C. elegans is the simplest organism to have biological systems, such as a nervous system and digestive system, akin to those found in humans [11]. Most importantly, though, these are the organisms for which a substantial portion of the genome is known. There are two classes of gene perturbation: gene deletion and gene inhibit ion. Gene 3 de le t ion is pe rmanen t . T h e gene is r emoved f rom star t c o d o n to s top c o d o n a n d u s u a l l y rep laced w i t h a cassette. T h e cassette conta ins m a n y regions w h i c h i t shares w i t h the ta rge ted gene b o t h u p s t r e a m a n d d o w n s t r e a m to a l low proper inse r t ion . It u s u a l l y conta ins a k a n a m y c i n resis tance gene a n d un ique mo lecu la r bar-codes . T h e la t t e r was done for the genome-wide S. cerevisiae de le t ion set [9]. F i g u r e 2.1 i l lus t ra tes th is process. G e n e i n h i b i t i o n is t empora ry . T h i s is a ccompl i shed t h r o u g h R N A interference ( R N A i ) . Interfer ing R N A molecules w i l l b i n d to the m R N A expressed by a ta rge ted gene a n d neu t ra l i ze i t , hence s i l enc ing the gene as dep ic t ed i n figure 2.2. O n c e the gene i n h i b i t i n g substance is exhaus ted , the ce l l b e h a v i o u r re tu rns to n o r m a l . R N A i is not yet perfected. Present ly , there are s t i l l issues w i t h efficiency a n d specif ici ty. T h e ta rge ted gene is not a lways c o m p l e t e l y s i lenced a n d other genes m a y be s i lenced s imul taneous ly . If genes were te levis ions , de le t ion w o u l d be the m u t e o p t i o n a n d R N A interference w o u l d be t u r n i n g d o w n the v o l u m e . Hence , gene de le t ion is more successful t h a n i n h i b i t i o n . F o r convenience, t h roughou t t h i s repor t , we w i l l often use mutant t o descr ibe a spec imen w i t h one or more dele ted or i n h i b i t e d genes even t h o u g h t echn ica l l y the la t te r is not a mu tan t . It is also c o m m o n to refer to gene dele t ions as knock-outs a n d i n h i b i t e d genes as knock-downs. These gene p e r t u r b i n g strategies are employed on a genome-wide scale. D e a l i n g w i t h genome-wide co l lec t ions of m u t a n t s requires m u c h p h y s i c a l space. T h e r e are three m a i n e x p e r i m e n t a l formats: m u l t i - p l a t e , l i v i n g ce l l m ic roa r r ays a n d p o o l e d cel ls . F i g u r e 2.3 depic ts the three formats . T h e m u l t i - p l a t e format is used for a l l types of gene pe r tu rba t i ons a n d organisms . F o r R N A i exper iments , m i c r o t i t r e pla tes are most c o m m o n l y used. Hence i n the R N A i c o m m u n i t y , the m u l t i - p l a t e format is often referred to as the m i c r o t i t r e or m u l t i -we l l p la te format [2]. F o r de le t ion sets, other types of plates are also used. F o r example , yeast de le t ion sets are often g r o w n on agar plates . A g a r plates are steri le P e t r i dishes t ha t con t a in agar a n d nut r ien ts . T h e y l end themselves p a r t i c u l a r l y we l l to yeast since i t is a 4 great m e d i u m for g r o w t h , the mos t c o m m o n pheno type for de le t ion sets, a n d a l l ow other in te res t ing phenotypes to be measured , such as invasiveness [4]. R a r e l y are plates capable of hos t i ng more t h a n 1536 i n d i v i d u a l ce l l colonies or cul tures . G e n o m e - w i d e exper imen t s dea l w i t h thousands of m u t a n t s thus r e q u i r i n g a co l l ec t i on of plates . A plate-set is the co l l ec t i on of plates used to pe r fo rm one e x p e r i m e n t a l rep l ica te for a l l mutan t s . It is t h i s fo rmat w h i c h is of interest t o th i s repor t . W i t h i n the m u l t i - p l a t e fo rmat there are m a n y expe r imen ta l set-ups. T e c h n i c a l r ep l i -cates of a m u t a n t are colonies or wel ls w i t h the same gene p e r t u r b a t i o n w i t h i n a plate-set . In general , i f there are t echn ica l repl ica tes , t hey are con ta ined o n the same p la te . C o n s e -quent ly , the mu tan t s found o n two separate plates w i t h i n a plate-set are (a lmost ) c o m p l e t e l y different. T h i s has i m p o r t a n t consequences for d a t a pre-process ing. For tuna te ly , most , i f not a l l , exper iments have c o m m o n elements found o n m u l t i p l e plates . These are genera l ly referred to as controls. In essence, a con t ro l is a n y t h i n g w h i c h is shared o n t w o or more pla tes i n the set a n d s h o u l d have a s i m i l a r pheno type o n each pla te . W i l d t y p e cells p l aced o n each p la te are a t y p i c a l con t ro l . W i l d types have not been a l te red i n any way, so they are p r o v i d i n g i n f o r m a t i o n i n the fo rm of a reference phenotype . T h i s has the convenient feature of a l l o w i n g the iden t i f i ca t ion of gene pe r tu rba t i ons w h i c h resul t i n an enhanced , d i m i n i s h e d or unchanged pheno type . A n o t h e r c o m m o n c o n t r o l is t o have mu tan t s w i t h k n o w n pheno-t y p i c behav iou r s p a n n i n g over the range of phenotypes across plates. Fo r example , suppose a g r o w t h exper iment on a yeast de le t ion set is conduc ted . A p laus ib le choice of cont ro ls w o u l d be to have a s low g r o w i n g yeast, an average g r o w i n g yeast and a fast g r o w i n g yeast on each pla te . It is also c o m m o n for blanks to be present o n each pla te , bu t despite the i r consistent behav iou r they are different f rom cont ro ls . T h e y don ' t p rov ide as m u c h i n f o r m a t i o n about the pla te effect as b i o l o g i c a l cont ro ls do . W e w i l l e x p a n d o n the top ic of the p la te effects i n the fo l lowing sections where th i s w i l l become evident . B l a n k s are nonetheless useful for 5 q u a l i t y con t ro l at the pre-process ing stage. O n c e the plate-set has been prepared , p h e n o t y p i n g m a y beg in . T h e mu tan t s are set i n a fixed env i ronment for a fixed p e r i o d of t i m e . A specific stress or t rea tment , such as an enzyme or d rug , m a y be i n t r o d u c e d . Essen t i a l ly , the m u t a n t s are exposed to specific cond i t ions . T h e pheno type of interest is often g r o w t h , even w h e n exposed to t r ea tment . Fu r the rmore , the s t udy of one pheno type does not prec lude the s t udy of others, even w i t h the same plate-set . In pa r t i cu l a r , i n de le t ion sets, g r o w t h assays are often the first p h e n o t y p e s tud ied because other phenotypes m a y have to be adjus ted for the observed g r o w t h . T h e f ina l e x p e r i m e n t a l stage is pheno type quan t i f i ca t ion . O f interest is the exact measure of the pheno type . F o r example , i n the case of a g r o w t h assay, the i n f o r m a t i o n of interest is the n u m b e r of cells r e su l t i ng f rom the g r o w t h p e r i o d . These expe r imen t s are o n a genome-wide level , so i t is i m p r a c t i c a l for pheno types to be measured exac t ly . A s a compromise , technologies have been deve loped w h i c h es t imate the pheno type . F o r s imple assays, the plates m a y be scanned to cap tu re a n image a n d the r e su l t ing spot size or br ightness for each m u t a n t is measured . F o r more c o m p l e x phenotypes ce l l engineer ing m a y be requ i red to ease the task of quan t i f i ca t ion . F luorescen t or luminescen t markers m a y be inser ted to a l low p la te readers to quant i fy phenotypes . These are u sua l ly the p r o d u c t of reporter genes inser ted i n the ce l l . T h e r a w image r e su l t ing f rom the scan is t hen processed us ing image quan t i f i ca t ion software. F i g u r e 2.4 shows a r a w image of a single agar p la te f rom a g r o w t h assay u s ing a yeast de le t ion set a long w i t h i ts processed image p r o d u c e d b y such software. In the end . a numer i c q u a n t i t y is ass igned to each m u t a n t w h i c h hopefu l ly reflects the pheno type . S t a t i s t i c i ans ' m o t t o is to never t h r o w away da ta . Some exper imenters choose to eval -uate the results v i s u a l l y as h e a l t h y / n o t heal thy. S u c h a n app roach removes the need for 6 normalization, yet simplifying potentially continuous data to a categorical or binary form is equivalent to throwing away a portion of the data. Furthermore, it introduces issues of replicability and increased error. Firstly, different people wi l l categorize differently. In fact, due to the subjective nature of these measurements, the same person may categorize differ-ently on two separate occasions. Secondly, the extra data manipulation required here may lead to human error. Fortunately wi th the advent of image quantification technologies, this strategy is becoming less common. Phenotype quantification gives access to al l information provided by the experiment and gives rise to a need for data pre-processing. 7 UPTAG PRIMER 74MER ...AUG YEAST HOMOLOGY KANMX E <3 YEAST HOMOLOGY D2 DN D1 DNTAG PRIMER 74MER PCR ROUND 1 GENERATE DELETION CASSETTE IP U2 D2 DN D1 C PCR ROUND 2 EXTEND YEAST HOMOLOGY HOMOLOGY INSERT DELETION CASSETTE INTO YEAST BY HOMOLOGOUS RECOMBINATION GENOMIC DNA YEAST ORF GENOMIC DNA I UP DN TAG TAG F i g u r e 2 . 1 : T h e following image was o b t a i n e d f rom R o b e r t G . Eason ' s a r t i c le o n the charac-t e r i z a t i o n of syn the t i c D N A ba r codes [6], copyr igh t (2004). It depic ts b o t h the c o n s t r u c t i o n of the cassette a n d i ts i n se r t i on in to the l o c a t i o n of the dele ted gene. T h e un ique iden t i fy ing b a r codes are essent ial for h y b r i d i z i n g purposes . cS DICER Digestion Unwinding Binding Degradation dsRNA ; siRNA RISC V V V V 1 mRNA Efficiency Specificity Figure 2.2: The above figure was obtained from: Henschel A . , Buchholz F . and Habermann B . , D E Q O R : A Web-Based Tool for the design and Qual i ty Control of s iRNAs.Nucleic Acids Research, 32:W113-W120,2004, by permission of Oxford University Press. It depicts the multiple stages involved in gene inhibit ion v ia R N A i . 9 e Pootedoete Soouonco or tVxitfzo to oaono c-*> F i g u r e 2.3: E x p e r i m e n t a l F o r m a t s a: M u l t i - w e l l p la tes come i n var ious sizes ( t y p i c a l l y 96, 384 a n d 1536 wel ls ) , thus the comple te exper imen t w o u l d compr i se of m a n y such plates, b : L i v i n g ce l l m ic roa r r ays are large arrays w i t h thousands of spots p r i n t e d o n t h e m . A n i n h i b i t i n g so lu t i on t a rge t ing a precise gene is p l aced at each spot . C e l l s are t hen d i s t r i b u t e d across the ent i re array. T h i s format does not l end i t se l f w e l l to who le organisms such as the w o r m due to space cons t ra in ts . In contras t to p o o l e d cells , l i v i n g ce l l m ic roa r r ays are conf ined to gene i n h i b i t i o n because they re ly o n i n h i b i t i n g so lu t ions , c: P o o l e d cells b e g i n w i t h an equa l share of each m u t a n t . A f t e r a set p e r i o d of t i m e or exposure to some stress, the co l l ec t ive D N A is a n a l y z e d u s ing mic roa r r ays . T h i s requires each m u t a n t to be i den -t i f iable . M o s t c o m m o n l y , i den t i f i ca t ion is o b t a i n e d t h r o u g h ba r codes. T h u s , th i s fo rmat is m o s t l y conf ined to de le t ion mu tan t s . A d a p t e d by p e r m i s s i o n f rom M a c m i l l a n P u b l i s h e r s L t d : N a t u r e R e v i e w s Gene t i c s , copyr igh t (2004) [2]. 10 F i g u r e 2.4: T h e image o n the left is a scan of an a c t u a l p la te used i n a g r o w t h assay. T h e image o n the r ight is a r e su l t ing quan t i f i ed image o b t a i n e d u s ing the G r i d G r i n d e r software. A l o n g w i t h th is image comes a n u m e r i c a l value for each i n d i v i d u a l colony. These images are p r o v i d e d cour tesy of the C o n i b e a r L a b . i i Chapter 3 Foundations for H T P Data Pre-Processing Multi-plate high throughput phenotypic experiments always involve multiple steps. In-evitably, the handling of each plate wi l l be different to some degree. The discrepancies may come from various sources, such as the temperature of treatment or the distribution of nutrients in the agar. Undoubtedly, the experimental processes wi l l affect the data in a plate-wise manner. The experimental artefacts introduced to the data in this fashion are referred to as the plate effect. H T P experiments usually involve multiple conditions just as in microarrays. Plate-wise experimental artefacts exist both wi thin and across plate-sets. On one hand, plate-set pre-processing - the removal of experimental artefacts across plates within a plate-set - presents a novel problem. O n the other hand, pre-processing plate-sets across experimental conditions may be achieved using existing microarray methodology. Throughout the remainder of this paper, plate effect wi l l pertain to the plate-wise distortion within a plate-set only. 12 D a t a a r i s ing f rom m u l t i - p l a t e H T P exper imen t s are used i n a fundamen ta l l y different manne r t h a n m i c r o a r r a y da ta . In m i c r o a r r a y exper iments , compar i sons of a gene across dif-ferent cond i t i ons are d i rec t a n d inter-gene compar i sons are ind i rec t . C o m p a r i s o n s be tween genes are done based o n the i r b e h a v i o u r across cond i t ions . F o r example , genes m a y be c lus-tered a c c o r d i n g to the i r up - regu la t ion a n d d o w n - r e g u l a t i o n across cond i t ions . P re -p rocess ing me thods i n m ic roa r r ays are t y p i c a l l y concerned w i t h r e m o v i n g expe r imen ta l artefacts w i t h regards to a gene across a l l samples . D o i n g so a l lows for b o t h types of compar i sons to be mean ingfu l . H T P exper iments , o n the o ther h a n d , are ve ry interested i n d i rec t i n t e r -mutan t compar i sons w i t h i n plate-sets a n d across cond i t i ons . These differences are dep ic t ed i n F i g -ure 3.1. W e can hope to make mean ingfu l compar i sons of m u t a n t s w i t h i n a plate-set under the presence of r a n d o m b i o l o g i c a l a n d / o r t e chn ica l va r i ab i l i t y , bu t they w i l l be c o m p r o m i s e d by the presence of sys temat ic , p la te-wise e x p e r i m e n t a l artefacts. T h e comple te r e m o v a l of the p la te effect is a n idea l w h i c h is (a lmost) never achieved i n prac t ice . T h e goal of pre-process ing is to m i n i m i z e the m a g n i t u d e of the p la te effect. Ex i s t ence of a p la te effect a n d a genuine desire to make d i rec t compar i sons w i t h i n a plate-set es tab l i sh a need to develop a pre-process ing m e t h o d for plate-sets i n H T P expe r imen t w h i c h is the purpose of th i s pa -per. T h e pheno type of a p a r t i c u l a r m u t a n t m a y be fur ther affected by i ts l o c a t i o n o n the p la te . These expe r imen ta l artefacts w i l l not be deal t w i t h i n th is paper , bu t i t is suspec ted tha t the concep tua l in f ras t ruc ture deve loped here w i l l be general izable to other sources of expe r imen ta l artefacts such as l o c a t i o n . T h e r e is a s imple b luep r in t t o d e r i v i n g a d a t a pre-process ing m e t h o d . F i r s t , a d is-t o r t i o n m o d e l deemed appropr i a t e for d a t a p r o d u c e d by the class of exper imen t s at h a n d is deve loped . Second , b io log i ca l a s sumpt ions p laus ib le for these types of da ta , a n d w h i c h s imul t aneous ly a l low for e s t i m a t i o n of the parameters i n the proposed d i s t o r t i o n m o d e l , are es tab l i shed . F r o m these a n a t u r a l s o l u t i o n is fo rmula ted : the pre-process ing m e t h o d . A g o o d 13 pre-process ing m e t h o d w i l l remove as m u c h e x p e r i m e n t a l artefacts as poss ible a n d remove as l i t t l e gene p e r t u r b a t i o n effect as poss ible . H a v i n g es tab l i shed our goal , the r ema inde r of t h i s chapter w i l l explore current m e t h o d s where the above b luep r in t w i l l be used extensively. In order to do th is proper ly , c e r t a in d a t a charac ter i s t ics w i l l be shown a n d a d i s t o r t i o n m o d e l w i l l be p roposed . 3.1 Phenotype Distribution P e r t u r b a t i o n of a gene, under a specific c o n d i t i o n , can lead t o four different ou tcomes w i t h regards to a g iven pheno type . T h e pheno type m a y be enhanced , d i m i n i s h e d , r e m a i n u n -changed or become undef ined because the m u t a n t is nonv iab l e . S ince the m u t a n t co l lec t ions o n each p la te differ, the pheno type d i s t r i b u t i o n s also differ. If mu tan t s were r a n d o m l y a l lo -ca ted to plates i n the set, the d i sc repancy w i t h regards to pheno type d i s t r i b u t i o n o n each p la te w o u l d solely be a result of s a m p l i n g a n d not be subs tan t i a l . Unfo r tuna t e ly , i n prac t ice , the a l l o c a t i o n is not per formed r a n d o m l y a n d often m u c h larger p ro p o r t i o n s of weak mu tan t s (mutan ts w i t h gene pe r tu rba t ions i n d u c i n g d i m i n i s h e d pheno type or m u t a n t s w h i c h are n o n -v iab le) are a l loca t ed to a few select plates w i t h i n the set. F o r example , i t is not u n c o m m o n for most plates to have a s m a l l p r o p o r t i o n of weak mutan t s , say a p p r o x i m a t e l y 10%, a n d the few r e m a i n i n g plates to have a large p r o p o r t i o n of weak mutan t s , say a p p r o x i m a t e l y 60%. Nonetheless , these d i s t r i b u t i o n s do have c o m m o n features. T h e r e are a lways m u t a n t s f rom a l l four categories (unaffected, d i m i n i s h e d , enhanced a n d le tha l ) o n each p la te . The re -fore, the d i s t r i bu t ions of phenotypes o n each p la te have a p p r o x i m a t e l y the same range. T h e mu tan t s w h i c h i n c u r pheno typ ic l e t ha l i t y (nonviab le mutan t s ) a n d b l anks are s t r u c t u r a l ze-roes. Consequent ly , the d i s t r i b u t i o n s are often b i m o d a l w i t h one m o d e co r r e spond ing to the s t r uc tu r a l zeroes. T h e d i s t r i b u t i o n m a y be m u l t i - m o d a l w i t h each m o d e co r r e spond ing to a 14 category. Figure 3.3 depicts smoothed histograms of the raw observed phenotypes from two plates from the same plate-set. It shows the bimodal shape of phenotype distributions and the consequences of non-random mutant allocation. The non-random mutant allocation to plates, approximately equal range and bimodality are important features of the data which should be acknowledged when deriving a pre-processing method. 3.2 Distortion Model It is accepted by most scientists that the effect of gene perturbation is multiplicative in nature. Likewise, the effects of most experimental processes are also multiplicative. F ig -ure 3.2 depicts the distribution of phenotypes on separate plates wi thin a plate-set. None of the plates have observations close to zero where the phenotypes of nonviable mutants and blanks are expected to be. If the plate effect were solely multiplicative their phenotypes would be almost unaffected. Therefore, the plate effect also has an additive component. These box plots not only support the notion of an additive component to the plate-effect, but also serve as motivation for pre-processing by displaying the importance of the systematic experimental artefacts (i.e. not removing the plate effect wi l l be extremely misleading). Recall that the quantity recorded in the phenotype quantification stage of the exper-iment is not a measurement of the true phenotype, but a mapping of the phenotype to a numeric quantity. There is no serious interest in recovering the underlying phenotype mea-surement. Therefore throughout the rest of this paper, true phenotype wi l l refer to the numeric quantity obtained from the mapping which would be observed if the exact protocol were followed. It wi l l be denoted by the variable z and the observed quantity wi l l be denoted by x. 15 The distortion model needs to be simple and sufficient. Thus, the distortion model may be expressed as a linear multiplicative-additive model, Xgp CXp ~\~ PpZgp, where g is the gene being perturbed and p the plate. Here no assumptions are made about the source of the location and scale parameters of the plate effect. The multiplicative component, Pp, is a combination of the effect of all experimental sources of multiplicative distortion. Similarly, the additive component, ap, engulfs al l sources of additive effect. 3.3 Current Approaches to Da ta Pre-Processing In this section, the most common methods of data pre-processing in H T P experiments wi l l be presented as well as select methods used for microarray data. It wi l l be demonstrated that the latter methods are not valid for the data at hand, but certain concepts found within them may be useful for H T P data pre-processing. 3.3.1 H T P Exper iments The plate effect as specified by the distortion model in (3.1) is most easily dealt wi th in two steps. The first step is to remove the additive effect. It is the typical phenotype recorded when the true phenotype is zero. There are various methods of estimating this effect. One is to subtract, from all observed phenotypes on each plate, the plate's minimal observed value. This can also be done within smaller geographical confines of the plate. For example, the plate may be divided into quadrants and so on. A t the most extreme level, these quantities are estimated locally for each mutant colony or well. The latter requires local estimates to 16 (3.1) be provided by the image quantification software and is only valid for plates which don't use wells, such as agar plates. Removing the estimated additive effect, ap, provides adjusted observations. Most approaches currently used in H T P experiments use this first step. For of subsection 3.3.1. The plate effect for these adjusted observations may be expressed as, where the subscripts are as previously defined. The problem is now reduced to a multiplica-tive model. Belief in a common location of the distribution of true phenotypes on each plate may not enable the multiplicative components of the plate effect to be estimated, but it does allow for meaningful comparisons to be made. Let mp = xgp be the mean of the adjusted observed phenotypes on plate p and let m = zgp be the mean of the distribution of phenotypes having followed the exact experimental protocol. Hence, the value of m is unknown. The variables rn and mp are instances of random variables wi th common expectations in the absence of a plate effect. Thus, the plate effect is estimated by /3P = Given m, pre-processing can be accomplished by dividing observations on plate p by Pp. Pre-processed phenotypes wi l l be denoted by y, so this may be expressed as ygp = xgp/Pp. Due to the multiplicative nature of the effects of gene perturbation, comparisons are typically done by way of ratios. Hence comparing two mutants pre-processed in this fashion leads to sake of simplicity, let x\ (new) \u00E2\u0080\u00A29P Xgp OLp refer to the adjusted observations for the remainder (3.2) Vg'v' Vgp p Pp' Zgp 17 where hopefully ~ 1- If the estimated plate effect is taken to be ^ 9P _ m X9P rt nip PpXg'p' ~W[X9'P' m. Xgip* Xgipi I vrtpi p. it becomes apparent that the value of m is irrelevant. Thus, by simply dividing the adjusted observations from each plate by the plate's mean, the plates are normalized to each other and meaningful comparisons may be made. The mean may be replaced by other measures of location, most commonly the median. This approach, referred to as mean or median plate centering, is the most commonly used in current H T P experiment analyses. A n alternative approach is to take advantage of controls conveniently placed on every plate in the set. This method was applied by B . L . Drees et al. [4]. Keeping to the same distortion model as in equation 3 . 2 , rather than using the belief in a common location as a means to meaningful comparisons, it is the fact that the controls have a common phenotype that is used. Define wp to be the observed phenotype of the control on plate p. Under this constraint (common phenotype for controls) and model, the plate effect may be estimated by fa = - P . ( 3 . 3 ) v medianp(wp) The adjusted data from plate p are normalized by dividing them by their respective estimated plate effect (3V. Similarly to median plate centering, it is equivalent to divide all xgp by wp. In essence, al l that is needed once the observations has been adjusted for additive effect is to divide by a phenotype which is believed to be approximately equal across al l plates. The exception to this rule are blanks. These should already be approximately equal after the removal of additive effects. Hence, aligning these would provide no insight to the 18 multiplicative component of the plate effect. Bo th approaches are honest attempts at removing the plate effect, but they may not be the most effective ones. To begin with , estimations of the additive component as described above are crude. This in turn may affect model 3.2. If the model for which we are estimating the parameters, in this case Pp, is wrong to begin with , then the so called normalized data wi l l s t i l l be heavily distorted. Moreover, mean/median centering assumes location to be the same, but this does not comply wi th the distribution of phenotypes previously described. Disproportions in the number of weak mutants on different plates wi thin the set, such as that depicted in figure 3.3, wi l l assuredly lead to different plate locations. Evidently, aligning a mutant which is a structural zero on one plate wi th a mutant whose phenotype is unchanged on another does not respect the underlying biology - an undesirable quality to any pre-processing method. A t the very least, it is much wiser to use the median rather than the mean, but regardless both these are poor measures of location when dealing wi th distributions which are not unimodal. Using control phenotypes to estimate Pp is an improvement on median/mean plate centering. In principle this is a really good idea, however it does have a weakness. Controls are often replicated in small numbers and any given colony may be strongly affected by location phenomena. Therefore, controls have substantial variances and division by variables wi th large variance lead to bad variance properties for the result. Moreover, quality control should be performed to ensure the control's behaviour is consistent enough to be useful. If technical replicates are included in the experiment, then variance of a control may be obtained on each plate. B y comparing the average of these variances to the variance of control phenotypes across plates in the set, quality of the controls may be assessed. 19 3.3.2 Mic roa r r ays A variety of normalization approaches have been proposed for microarray experiments. Such diversity is due in part to progress and different types of microarray chips. For example, data generated by high density oligonucleotide microarray technology differ from those obtained using c D N A microarrays allowing variant normalization approaches to be employed. Two classes of normalization here are channel to channel normalization in c D N A experiments and normalizing many profiles to each other. The latter is used in high density oligonucleotide microarrays and in c D N A experiments which use many arrays. The pre-processing strategies developed for normalizing many profiles are the most useful to our cause. A s previously mentioned the biological effect is multiplicative in nature. For this reason, it is common practice to log transform the data to render the effect linear. This , in turn, allows more traditional statistical methods to be applied. Two methods st i l l popular today are quantile normalization [5] and variance stabilizing normalization [8]. In microarrays, each probe is present on each array. It is assumed that the majority of genes are not differentially expressed across conditions being studied. For the genes that are, change occurs in both directions. In other words, there is both up-regulation of some genes and down regulation of others. These changes generally off-set each other, so it is reasonable to assume equal intensity distributions on each array across an array set. It is this assumption which is used to generate quantile normalization. Recall that an array set in the microarray context is very different from a plate-set in the H T P context. A plate-set is like a single microarray. To compare two datasets wi th regards to distribution, one can construct a quantile-quantile (Q-Q) plot. A perfectly diagonal line on a Q-Q plot indicates equal distribution, up to a location, of the two datasets. The Q-Q plot serves as motivation for a transformation that can be used to render the distribution of phenotypes on both plates 20 equal. Two arrays in an array set wi l l almost never have the same distribution. In practice the exact distribution of each plate is unavailable. Rather than use quantiles as a basis for a transformation, the G order statistics are used. Let X(g)p be the gth order statistic of observed phenotypes on plate p = 1,2. Then the mapping ^ \u00E2\u0080\u009E \u00C2\u00A3 _ _ + \u00C2\u00A3 _ , V m ( 3 4 ) wi l l force the distribution of both arrays to be the same. This non-linear transformation easily extends to array sets of size P. A n example using multiple arrays is shown in figure 3.4. Unfortunately, assuming the equal phenotype distributions on each plate wi thin a set, in the H T P context is not reasonable. Here the mutants composing each plate are entirely different and often not randomly allocated. However, the next chapter wi l l discuss how exploiting quantiles may be of use in the quest to H T P data pre-processing. Huber et al suggest an alternative pre-processing method for microarray data which both normalizes and stabilizes the variance al l at once, which is aptly named variance sta-bil izing normalization [8]. It is based on a multiplicative-additive distortion model as in equation 3.1 combined wi th a multiplicative-additive error model suggested by Rocke, D . M . and Durbin , B . P . [10]. The resulting model is xgp = ap + ppZgpe11\"\" + Vgp rigP ~ 7V(0, <7\u00E2\u0080\u009E) i.i.d., \"gp ~ N(0, ov) u . d . . (3.5) This implies a quadratic mean-variance dependency and leads to the following pre-processing transformation ygp = arsinh(X9P \u00C2\u00B0 p j (3.6) \ Op / where bp is a function of j3p and ov. Thus the goal here is to estimate the coefficients ap and 21 bp as best as possible to eliminate the plate effect and (a topic which is not addressed here) to minimize the structural relationship between the mean and the variance. The assumption used to derive this normalization method is that the majority of genes are not differentially expressed. The data is t r immed to include only genes believed to be non-differentially expressed. Naturally, the means of non-differentially expressed genes should not change across arrays. Thus, by mean centering the t r immed data, maximum likelihood estimates for the parameters can be obtained. The result is a normalization approach which estimates ap and bp simultaneously. Thus, rather than use crude estimates for the additive parameters, both the additive and multiplicative parameters of the distortion model are estimated using biological assumptions. To some the inverse hyperbolic sine function may seem foreign, but it is closely related to the log al l while having advantages over it: arsinh(x) = log(x + y/x2 + 1). Therefore, using this normalizing method removes the need to log transform the data as discussed earlier. Variance stabilizing normalization may not be directly applied to H T P data because key to the maximum likelihood estimation of the parameters is the presence of the same genes on each arraj'. Unfortunately, plates within a set do not share mutants in H T P experiments. Over the course of this chapter, currently employed H T P pre-processing methods were presented along wi th two microarray normalization methods. The fundamental principles driving current methodolgy are good, but there is room for improvement. Methods used on microarray data may not be directly applied to H T P data, but do offer tools which can help deal with the issues of non-random allocation and parameter estimation strategies. author = B .M.Bo l s t ad et al. , title = A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias, journal = Bioinformatics, year = 2003, volume = 1 9 , number \u00E2\u0080\u0094 2, pages = 185-193 22 High Throughput Phenotypic MicroArrav 3\u00E2\u0080\u0094* Condi ttori 1 Condition 2 F i g u r e 3 . 1 : T h i s d i a g r a m is a s imp le c o m p a r i s o n of H T P a n d m i c r o a r r a y exper iments . In H T P exper iments , plate-sets are r equ i r ed for each c o n d i t i o n . P re -p rocess ing of plate-sets is done i ndependen t ly for each c o n d i t i o n . 23 R a w P h e n o t y p e D i s t r i b u t i o n in a P l a t e - S e t Plate 1 Plate 2 Plate 3 Plate 4 Figure 3.2: These data come from a plate-set containing more plates, but only four are included here for convenience. In these modified box plots, each box covers the entire phenotype range and uses a color scheme to present the approximate data distribution. 24 Unrelated Medians i 1 1 1 1 r 10000 15000 20000 25000 30000 35000 40000 Phenotypes Figure 3.3: These are smoothed histograms of the observed raw phenotypes from two plates from the same plate-set. The plate depicted in blue contains many more weak mutants than the other plate and this was a conscious choice. 25 Figure 3.4: A plot of the densities for 27 arrays. The density after quantile normalization is shown in bold black. The above figure was obtained from: B .M.Bo l s t ad et a l . ,A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias. Bioinform,atics,19(2):l8b-l93. 2003 [5]. 26 Chapter 4 Recommended Pre-Processing Method The previous chapter considered, among others, the assumption that phenotype distributions on plates within a set share a common mean or median. The event of non-random allocation of mutants to plates renders this assumption implausible. Furthermore, distributions wi th more than one mode are not well characterized by such a measure of centrality. A broad theme identified in the previous chapter is the idea of aligning a collection of datasets on one or more features believed to be constant across the collection, prior to the introduction of experimental artefacts. A distortion model has been been proposed in (3 .1 ) . The crit icism in the previous chapter was with regards to the methods, not the model. We wi l l also use this model and simple extensions of it. Furthermore, the concept of feature alignment makes sense and wi l l also be used here. Thus, the task at hand is to establish proper features for alignment. 27 4.1 Features There are two classes of features: internal and external. 4.1.1 E x t e r n a l Features This class of features was introduced when phenotypes on each plate were divided by each plate's wi ld type phenotypes. External features are those which arise from the design of the experiment. They are external to the phenotype distribution on plates. Controls and blanks are external features. Their presence is not required for the experiment to run, thus they wi l l not always be available as a pre-processing tool. B y definition, in the absence of a plate effect, the phenotype of each external feature would be approximately equal across plates. There is no questioning the biological validity of their approximate equality, but, as previously mentioned, quality control should carried out before aligning these features. 4.1.2 Internal Features Internal features are those which pertain to the phenotype distributions. Both median centering and quantile normalization use internal features. In the first case the feature is the median of the phenotype distributions on each plate and in the second the features are all quantiles of the plate distributions. Their invalidity as features to be aligned in H T P experiments has been established. Yet, there is a middle ground in which certain quantiles,, and the mean up to certain quantiles, tend to be approximately equal in the absence of experimental artefacts in the H T P context. Internal features considered thus far have been susceptible to departures from mutant randomization. For a better understanding of how the data are affected by a lack of ran-28 domization, consider a simple simulation which compares two plate-sets: one constructed to represent randomized mutant allocation which is implici t ly assumed by median centering and the other to represent non-randomized mutant allocation which reflects the reality of the yeast deletion set. Real data are used to construct a fictitious population of possible phenotypes. They arise from a single agar plate, with 1536 yeast colonies. Figure 4.1 shows a smoothed histogram of this population. For sake of simplicity, only two categories of data are considered: healthy and un-healthy. The cut-off value of 7500 was used to distinguish healthy mutants (those wi th phenotypes greater than 7500) and unhealthy mutants. It is based on a rough visual esti-mate which assumes the two left modes are composed of phenotypes from unhealthy mutants. According to this cut-off, approximately 30% of the mutants are unhealthy. Both plate-sets consist of two plates, each containing 300 mutants. In the random plate-set, plates are simply populated by sampling with replacement from the population, thus representing random allocation. The pathological plate-set simulates the allocation often seen in practice. One plate receives 25% of its mutants from the unhealthy subpopu-lation and the other plate recieves 50% of its mutants from that subpopulation. Figure 4.2 displays particular results using both sampling schemes. These represent plate-sets which have not been distorted by the plate effect. They are the data which we desire to recover after pre-processing. The simulation consisted of 500 iterations. In each iteration, an instance of both plate-sets was populated. W i t h i n each set, for a range of quantiles, the observed quantile value was recorded for plate 1 and plate 2. The results are depicted in figure 4.3 which are fancy scatter plots. They show the 500 observed quantile differences at each quantile using a smoothed color density representation. The yellow lines are the smoothed mean differences. 29 The random plate-set suggests that most quantiles are approximately equal when mu-tants are randomly allocated to plates. The only exception being quantiles in the extreme tails, as these contain outliers. Stabili ty in the lower tai l persists to more extreme quantiles than in the upper ta i l because outliers in the lower ta i l are solely due to error. The results presented in figure 4.1 don't suggest large discrepancies in the lower tai l , but the smallest measured quantile is at 1%. Caut ion with regards to outliers should be taken at both ends. Here the quantiles in both tails which are not too extreme align just as well as the median. In fact some of them are better behaved than the median (e.g. the 85% quantile). In general, under random allocation the medians differ less across plates than do the extreme quantiles, but the difference tends to be small. The pathological plate-set tells a different story. Here the median differs by a substan-t ia l amount across the set - it is the most unequal quantile. In fact, their values correspond to phenotypes of different biological categories: healthy and unhealthy. O n the other hand, the quantiles in the tails seem to be much less affected by the disproportionality of healthy and unhealthy mutants across the set, specifically, quantiles roughly between 1% and 5% and between 85% and 95%. Define the tails of the distribution which are not too extreme to be affected by outliers as the almost-tails. The simulation suggests that quantiles in the almost-tails wi l l be approximately equal across the set and that this is robust with regards to departures from randomization of the mutants. One expects the range of underlying phenotypes on each plate to be roughly the same. However range may be severely altered by outliers. Thus, the idea is to use a compromise between range and center wi th more emphasis placed on range. Therefore, the middle ground between assuming the equality of medians and assuming equality of all quantiles is to assume equality of quantiles in the almost-tails. Quantiles in the almost-tails of phenotype distributions are internal features which are appropriate for H T P experiments due to their 30 robustness to modest departures in randomization. Another example of internal assumptions is the approximate equality of the modes in the phenotype distributions. In particular, the left modes, when these exist, often correspond to phenotypes of blanks and nonviable mutants, so their alignment would make much sense. It isn't clear whether other modes are approximately equal. The upside to aligning the left modes is that they are the least affected by error and hence have the smallest variance. The measurement errors are believed to be more than a simple mean zero additive error, but blanks and nonviable mutants are only affected by this error. Ultimately, it is a simple task to compute quantiles, while it is often not simple to identify modes of a distribution. Therefore, we place a heavy emphasis on quantiles in our pre-processing approach. There exists cases where approximate equality of quantiles in the lower ta i l may be questionable. In cases where the left mode corresponds to non-viable mutants and blanks only, the discrepancy between phenotypes of viable and non-viable mutants may be large. This , in turn, may reduce the region of lower quantiles over which approximate equality across the plate-set holds and moreover, very much increase the difference between lower quantiles which are not in the region of approximate equality. A n example of such a scenario wi l l be expanded upon in the next chapter. As an alternative, rather than assume the approximate equality of the quantiles in the lower tai l , assume the approximate equality of the average phenotype up to a given quantile. The quantile becomes an upper bound for the values which are averaged over. This is analogous to mean centering, but restricted to the left ta i l . Essentially, the claim is that the expected value of the lowest r% of the phenotype distributions on each plate across the set should be approximately equal for value of r that are small (but not vanishly small). In Huber et aVs work, mean centering was carried out using tr immed data. Here we suggest the opposite, to align the mean of the \"trimmings\" in the left tai l . Hence this internal feature is called the mean of left ta i l trimmings. Figure 4.4 depicts 31 the observed differences of the mean left ta i l trimmings on both plate-sets and compares the results with those of quantiles. There do not seem to be any differences in the approximate equality of these internal features wi thin the random plate-set context. However, there is an important difference observed in the pathological plate-set. The approximate equality of the mean left ta i l t r im-mings spans a much larger range of quantiles than does the approximate equality of the quantiles themselves. The instability due to non-randomized allocation is simply delayed by averaging over the trimmings. The mean up to a given quantile is always'available and the region of approximate equality continues to be the almost-tails. Thus, mean left ta i l t r im-mings preserve the simplicity found in quantiles and wi l l be favored as the internal feature of choice in the left tai l . The choice of alignment features wi l l be revisited in the next chapter. 4.2 Proposed Pre-processing Method Pre-processing involves transforming the raw data to remove systematic experimental arte-facts. Having specified a model for the plate effect, a natural transformation arises, fp(*) = ^ (4-1) where ap and (3P are as in (3.1). Thus far we've referred to ap and /3P as the additive and multiplicative components, respectively, of the plate effect. W i t h i n the context of the transformation in (4.1), they wi l l , respectively, be referred to as the location and scale parameters. The novelty of Huber et aVs normalization method which we wish to incorporate into our method is the simultaneous estimation of both the location and scale parameters based on biological assumptions. The assumption proposed is the approximate equality 32 of select features of the data. Two classes of such features which respect the underlying biology of the experiments have been established. Using this transformation on a plate-set containing P plates leads to 2P unknown parameters. Simply aligning one feature across the plate-set wi l l not be sufficient to solve for the unknown parameters. The alignment of two or more features, though, does allow for simultaneous parameter estimation. However, the estimation method wi l l differ if only two features are aligned rather than three or more. These two cases wi l l be considered separately. In order to avoid over-parameterization, constraints need to be set. There are a few natural choices here. One plate may be set as a reference plate. Another option is to constrain the quantiles to equal their averages which is analogous to what is done in quantile normalization. In the end, the choice of constraints is not critical. Results obtained using one constraint may be linearly transformed to conform to another constraint using the very same slope and intercept for each datum. Using a reference plate introduces artefacts with respect to plate-sets under other conditions, but pre-processing across conditions wi l l remove these. B y setting restraints, the transformation does not use proper estimations of ap and Pp as suggested in (4.1). Therefore, we reformulate the transformation as fp(x) = \u00C2\u00A3 Z \u00C2\u00A3 E . Without loss of generality, set a\ = 0 and b\ = 1 (i.e. set plate 1 as the reference plate). Hence, ap = ap \u00E2\u0080\u0094 ct\ and there remains 2P \u00E2\u0080\u0094 2 unknown parameters. 4.2.1 E x a c t al ignment of 2 features under the l inear mode l We begin with the case of two features. Let qjP denote feature j on plate p. Having set plate 1 as a reference plate the following 2P \u00E2\u0080\u0094 2 equations may be formulated based on the 33 application of the transformation in (4.1): /lOfci) = fP(qjp) forj = l , 2 Vp = 2 , . . . ,P . (4.2) There are 2P \u00E2\u0080\u0094 2 equations and an equal amount of unknowns, therefore an exact solution exists. In fact, when using transformation (4.1), a direct solution of the unknown parameters is only possible when aligning two features. For each individual plate, simple algebra may be used to solve for its location and scale parameters. b = SlE _ q2Pqn ~ g 2 l g l P (4 4) P \u00E2\u0080\u00A2 \u00E2\u0080\u00A2\u00E2\u0080\u00A2> bp) be the vector of scale parameters and \u00E2\u0080\u0094 (qji,qj2, \u00E2\u0080\u00A2\u00E2\u0080\u00A2\u00E2\u0080\u00A2,Qjp) be the vectors of features for j = 1,...,Q. Without the possibility of an exact alignment of the features, the next best solution is to 34 minimize their variance across the set. This leads to the following objective function: Various optimization algorithms may be implemented to minimize the objective func-tion and obtain the parameter estimates of interest. The Newton-Raphson method requires the inversion of the Hessian matrix which is problematic as it often becomes singular. There-fore, it is recommended to use optimization methods which only use the exact calculation of first derivatives. There are two classes of such methods: conjugate gradient methods and quasi-Newton methods. They differ in their strategy to numerically estimate the Hes-sian. Conjugate gradient methods require less storage than quasi-Newton methods require, but the latter are generally less fragile [7]. Therefore, since these wi l l never be large prob-lems, it is recommended to use quasi-Newton methods in general, though both methods are appropriate. Regardless of the class of method chosen, the gradient is required for their proper implementation. It is a vector of length 2P \u00E2\u0080\u0094 2 wi th P \u00E2\u0080\u0094 1 entries which are the partial derivatives of the objective function g(a, b) wi th respect to each location parameter (exclud-ing a,i \u00E2\u0080\u0094 0) and P \u00E2\u0080\u0094 1 entries which are the partial derivatives with respect to each scale parameter (excluding b\ = 1). Let (jj = ^ __p=i fp{Q.jp) be the average jth feature across the the transformed plate-set, then the following equations may be used to construct the gradient. (4.5) (4.6) dbk \u00E2\u0080\u00A20(a, b ) bl(P-l) - 2 Q Y2(fk(0jk) - 0j)(0jk - ak) for k = 2, 3 , . . (4.7) 35 The complete derivation of these results are given in Appendix A . Once the optimization method has been selected and the first derivatives specified, the last requirement is a choice of ini t ia l parameters. Init ial parameters which are too far from the minimum may lead the algorithm to converge to a local minimum or to travel in the wrong direction and fail to converge wi th in the specified number of iterations. Furthermore, the choice of ini t ia l parameters wi l l affect the number of iterations required to converge. There are various ways to obtain good ini t ia l parameters. One way is to first use the difference between the min imum observation on each plate and that of the reference plate to initiate the location parameters: ap0 = mm(x p) \u00E2\u0080\u0094 mm(xi). Here, x p is the vector of phenotypes on plate p. Then initiate the scale parameters by taking the ratio of the range of each plate over that of the reference plate. In other words, first subtract the obtained location parameter guesses from each plate: ygp \u00E2\u0080\u0094 xgp \u00E2\u0080\u0094 ap0. This is analogous to the removal of the additive effect discussed in the current methods section. Then initiate the scale parameter a g ^ _ rnax(yp)-min{yp) max(y1)\u00E2\u0080\u0094min(y1) ' 4.2.3 E x a c t al ignment of more than two features using a piecewise d i s tor t ion mode l W i t h i n the context of aligning three or more features, there is an alternative to optimization. A simple extension to the distortion model allows for exact alignment of the features. The distortion model could be described as piecewise linear. Perhaps the plate effect is not exactly the same at different magnitudes of the phenotype. Using linear spline interpolation with knots at the features wi l l allow these to be equal across the set, rather than approximately equal. Also, different slopes and intercepts between each feature would help account for any change in plate effect due to phenotype. Figure 4.5 depicts the difference between the linear 36 transformation as proposed in (4.1) and linear spline interpolation. The transformation called upon by linear splines is as follows. Suppose there are Q features. Then there are Q knots (x^y , ) which the transformation is to pass through. The linear function between each of these knots is: Si = Vi + V l + 1 _ V% (x - x^ x e [xi, x i + 1 } . (4.8) Thus the function is only defined for phenotypes between the selected features. The linear mappings between the first and second features and between the two upper features may be extended to map all observations. Alternatively, al l values below X\ may be mapped to y\ and al l those above XQ may be mapped to UQ. Constraints are once again called for. The candidates are the same as before, either align to a reference plate or align to the average value of each feature. In this case, the choice cannot be changed afterward using a global linear transformation, therefore the choice of constraint wi l l have a heavier impact on the results. For this reason, it seems more reasonable to map each feature to the average value of that feature rather than to the value observed on a reference plate. There are two advantages to using linear splines over using linear transformations and optimization. First solving the latter is more computationally expensive: it requires more memory and time. Secondly, modest departures from linearity in the plate effect wi l l be compensated for using this approach. Hence, it offers more flexibility. The method we propose uses biological assumptions to simultaneously estimate the additive and multiplicative components of the plate effect. The assumptions used to achieve this are the approximate equality of two or more features. The approximate equality of quantiles in the almost-tails is an example of internal assumptions which are robust to 37 departures from random mutant allocation to plates. Quantiles are not the only features which can be used. For aligning more than two features, two methods have been proposed. They are both valid and have their pros and cons. A more in depth comparison of the two methods wi l l be conducted in the next chapter. 38 Phenotype Population for Simulation -5000 0 5000 10000 15000 20000 25000 Phenotype Figure 4.1: The approximate density of the phenotype population used for the simulation. The cut-off is depicted by the black line at 7500. The heavy left ta i l suggests that a fair number of mutants are affected by the gene perturbation, which this cut-off value tries to capture. 39 Approximate Densities of Random Plate Set i 1 r 0 5000 10000 15000 20000 Approximate Densities of Pathological Plate Set -5000 0 5000 10000 15000 20000 Figure 4.2: Plates generated in a random fashion sti l l present different phenotype distri-butions since the mutant collections on each plate are different. However this difference is minimal compared to that which arises when a large collection of weak mutants are purposely contained on one plate. 40 Random Plate-Set: Seeking quantiles that are approximately equal within a set 1 -5T \u00E2\u0080\u00A2 Si? f \u00C2\u00B0-CM *\u00E2\u0084\u00A2 CD Random Plate Set: [Difference in Quantilesj Random Plate Set: Average Left Tail Trimming; 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 CD CD Q_ \u00C2\u00A3 ^ o c C CD 0) \u00C2\u00A3 \u00C2\u00A3 \u00C2\u00B0 -C\J \"~ CD ffl -s _ CO Quantiles Pathological Plate Set: [Difference in Quantilesj Quantiles Pathological Plate Set: Average Left Tail Trimming; 0.0 0.1 0.2 0.3 0.4 0.5 i i r 0.0 0.1 0.2 0.3 0.4 0.5 Quantiles Quantiles Figure 4.4: Again the range of the y-axis is kept constant and all start at 0. 42 Mapping Observations to Normalized Values For a Single Plate o T 0 2000 4000 6000 8000 Observed Phenotypes Figure 4.5: The red dots are ordinal pairs composed of the observed features and the values they should be mapped to. Linear spline interpolation allows for these features to be mapped exactly to the desired location. A linear transformation cannot do this, so the line which minimizes the distance between the mapped features and their desired value is chosen instead. 43 Chapter 5 Application to Real Data In this chapter, our recommended pre-processing method wi l l be applied to real data. The many knobs resulting from the methods flexibility, in both the class of features to be aligned and the alignment method, wi l l be explored. To properly do so requires an understanding of the data used throughout the chapter. 5.1 Data The data arise from growth assays using yeast deletion sets and are provided by the Conibear lab. Each plate-set is composed of 14 agar plates. Each agar plate contains 1536 yeast colonies. There are 384 distinct mutants per plate with four technical replicates of each (384 x 4 = 1536). Of the 384 mutants, only five are not deletion mutants. One is a blank and four are controls. The controls are yeasts which have different growth rates expected to cover the spectrum of possible phenotypes. The location effect is quite strong in this data. Appendix B describes the nature of the edge effect along with the interim solution. It is worth noting because the controls in this experiment are near the edge. 44 The mutants are not randomly allocated to plates. O n twelve of the fourteen plates, yeast colonies are placed on the agar plates according to their location on purchased plates which is roughly based on their order on the chromosome. The remaining two plates contain mutants of interest (many of them of the weak variety or susceptible to food competition) wi th additional blanks. 5.2 Recommended Method versus Median Centering We begin by comparing the old method of median centering to the recommended method to see if there is any apparent improvement. Figure 5.1 contains three plots and, for simplicity, only 4 of the 14 plates are shown. The first plot depicts smoothed histograms of observed raw phenotypes on the four selected plates. It shows the presence of a plate effect as well as the presence of non-random mutant allocation (as seen by a much greater proportion of mutants in the left mode of one plate). The plate whose density is depicted by the blue line, plate 13, is one of two plates which contains mutants of interest as described above. The second plot in this figure depicts smoothed histograms of preprocessed pheno-types obtained by median centering. Specifically, the additive effect was first removed by subtracting from each observation its plate's minimal phenotype: Then the resulting ad-justed phenotypes were divided by the plate medians. There appears to be an improvement over the raw data. The plate depicted in green is now aligned wi th the others, but there sti l l appears to be issues wi th plate 13 (blue line). The majority of its non-blanks have a pre-processed phenotype larger than those reported on the other three plates resulting in a range which is one and half times larger than that of the other three plates. Another problem is the misalignment of the left modes which should align as they pertain to blanks and nonviable mutants. 45 The third plot depicts results from aligning both the blanks and the 90% quantiles. Since only two features were aligned, an exact solution was available, so there was no need to choose between linear splines and optimization. Visual ly the results are quite pleasing. Firstly, the non-blanks on plate 13 have adjusted phenotypes wi thin the range of the other plates and, secondly, the left modes align well. These are real data, so it is impossible to evaluate these methods based on the true plate effects as they are unknown. However, the presence of blanks and controls on al l plates can be used as an assessment tool to complement visual inspection of the smoothed phenotype histograms. Figure 5.2 contains three diagnostic plots to complement the previous figure. Each plot displays the median phenotypes of the four controls and blanks on the selected plates. Controls one and two have four technical replicates and controls three and four have two (due to the edge effect). This allows for an analysis of variance to be carried out. Unfortunately, the blanks are on corners, so they don't have technical replicates. The p-value in the title of each plot corresponds to the existence of a plate effect in a two way analysis of variance (controls and plates as explanatory variables wi th no interactions). The phenotypes were transformed in order to achieve homoscedasticity. Looking at these p-values, it is apparent that the recommended method has removed much more of the plate effect. The diagnostic plot for median centering supports the argument above: having a range which is one and half times larger than the range of other plates is problematic. This can be seen by the normalized values of controls 1, 3 and 4 of plate 13, with control 3 being of the chart. The diagnostic plot for the recommended method of feature alignment shows a clear improvement on both the raw data and median centering. The blanks, of course, are perfectly aligned and the controls are better behaved. It is apparent on this plot that no transformation would allow both the blanks and control 2 to be perfectly aligned. Results 46 would have been different had we chosen to align control 2 instead of blanks. This begs the question: which features should we align? 5.3 Choice of Feature 5.3.1 Observing internal feature behaviour induced through the alignment of external features In chapter 4, two classes of features to align across plates were defined: internal and external. B y design, external features - controls and blanks - should be approximately equal across all plates. Thus, the natural question to ask is whether the alignment of external features wi l l induce approximate equality of quantiles. To complement chapter 4's simulation, the alignment of external features is applied to real data and the resulting quantiles and mean left ta i l trimmings on each plate are compared in similar fashion. This particular data-set does not have severe problems in proportions of healthy and unhealthy mutants as can be seen in the first plot in figure 5.3. It shows the smoothed histograms of the phenotypes on each of its plates after alignment of controls. The distance between the left mode and the remaining phenotypes is large as was discussed in chapter 4. Dealing with real data, it is not possible to obtain 500 values for each quantile, thus using a smoothed scatter plot as in the simulation is not useful. Instead, the variance of each quantile across the 13 plates is presented. In the top right hand corner plot, the variance of quantiles over the entire range is shown. The quantiles in the upper almost-tail align well, but the lower quantiles are problematic as was stated in chapter 4. There is a small region over which the quantiles don't differ too much, but quantiles slightly outside this region do not align well at al l . The last plot depicts the variance of mean left tai l trimmings. The 47 region over which this feature is stable across the set is much larger and the increase in its discrepancy outside that region is much less severe. This further supports the superiority of mean left ta i l trimmings over quantiles in the lower almost-tail. It also suggests that the alignment of external features also induces the approximate equality of the proposed internal features. In the next section we wi l l further compare internal and external features. 5.3.2 Internal and E x t e r n a l Features i n a Case Study-U p to now controls and blanks have been used as a means of evaluation. Controls can equally be assessed v ia blanks and internal features. Consider two data sets A and B . Figures 5.4 and 5.5 show the results of pre-processing each data set by minimizing the variance of the four controls across plates v ia a linear transformation. Each figure contains three sections. The first gives the settings used for pre-processing and the p-value resulting from a kruskal wallis test for the presence of a plate effect. The second shows the smoothed histograms of the pre-processed data and the third shows the corresponding values of select features for each plate. The alignment is not perfect as is readily observable, but real data is always noisy. In the first figure, the p-values do not suggest that there is a plate effect. Furthermore, the upper quantiles align wi th about equal variance to controls 1 and 3, and the two smaller mean left tai l trimmings align with approximately the same variance as the upper quantiles if we ignore the one plate which seems problematic for the controls as well. This is encouraging and supports our claims. The second figure suggests that this complimentary behaviour between internal and external features isn't always observed. The low p-value for control 1 suggests heavy contradictions between controls 1 and 3. In the diagnostic plot (the bottom section), the internal features and the blanks do not seem approximately equal wi th the alignment of controls. The results from the alignment of controls in this scenario are not desirable. 48 S i m i l a r l y to the results of m e d i a n cen te r ing s h o w n earl ier , the pre-processed pheno types f rom some plates are a lmost a l l h igher t h a n those f rom others . T h o u g h cen t ra l b e h a v i o u r of p la te d i s t r i b u t i o n s are expec ted to differ, the b i o l o g y does not d i c t a t e such a severe difference. Hence , figure 5.5 doesn ' t con t rad ic t the a p p r o x i m a t e equa l i t y of i n t e rna l features v i a the a l ignment of cont ro ls as m u c h as i t quest ions the r e l i a b i l i t y of cont ro ls . Inev i t ab ly , h a v i n g o n l y 4 or 2 repl icates of each c o n t r o l m a y give way to subs t an t i a l va r i ance a n d cons ide rab ly l i m i t the usefulness of the controls . T h e r e are benefits to be h a d b y a p p r o a c h i n g the p r o b l e m the o ther way r o u n d b y a l i g n i n g in t e rna l features a n d obse rv ing the r e su l t ing a l ignmen t of cont ro l s . F i g u r e 5.6 considers Set A . A g a i n the resul ts of pre-process ing, th i s t i m e w i t h i n t e rna l features, are v i s u a l l y p leas ing . T h e cont ro ls seem for the mos t pa r t to be w e l l behaved , a l t h o u g h not enough to suggest t ha t there is no p la te effect (see k ruska l -wa l l i s p-values) . F i g u r e 5.7 depic ts the pre-process ing of Set B u s ing i n t e rna l features. N o t su rp r i s i ng ly the cont ro l s i n t u r n do not a l ign w e l l , bu t the s m o o t h e d h i s tograms suggest super ior resul ts here. T h e plates are we l l a l igned as are the left modes . Fu r the rmore , the b l a n k s are also a p p r o x i m a t e l y equal . T h e contro ls i n the same range of phenotypes as the b l a n k s do not a l i g n w e l l at a l l . W h a t d i s t inguishes the b l anks f rom the cont ro ls is tha t they are o n l y affected b y a d d i t i v e effects a n d errors. T h u s they are more re l iab le t h a n contro ls w h e n t e c h n i c a l repl icates are l i m i t e d . In th is b r i e f case s tudy, contro ls were effective pre-process ing features for Set A , bu t not for Set B . S imul taneous ly , i n t e rna l features were effective pre-process ing features i n b o t h d a t a sets. T h e pheno type of each con t ro l is based o n a l i m i t e d n u m b e r of t e chn ica l repl icates . T h e y are suscept ib le to var ious sources of error a n d e x p e r i m e n t a l effects, such as edge effect, w h i c h i n t u r n m a y severely increase the i r var iance . T h e b io logy d ic ta tes t ha t the i r values shou ld be a p p r o x i m a t e l y equa l across the plate-set, b u t the i r h i g h va r iance a n d 49 low replication count hinders their usefulness here. Quantiles and mean left tai l trimmings on the other hand are based on al l the plate's phenotypes. This renders them more robust and thus more reliable as pre-processing features. Regardless of which feature is used, quality control should always be carried out as was done in al l four cases here. 5.4 Optimization versus increasing the number of pa-rameters in the model Two different paths, which both stem from the exact alignment of two features, may be taken when aligning more than two features. Either the distortion model can be kept linear, leading to parameter estimation v ia optimization, or it can be slightly extended allowing for linear splines to be used in favor of a linear transformation. Solely based on the model, it is more desirable to use the simple extension to the distortion model which allows for slight departures from linearity to take place. It is a more flexible approach. O n the basis of implementation, linear splines are much less expensive than optimization, both in memory and time. However, when using linear splines, it is important for the knots to be in order so that the resulting transformation is injective. When using controls as features, this is not always possible. For example, on one plate from Set B , control 4 expresses a lower phenotype than control 2, yet the average value of control 4 across the plate-set is higher than the average value of control 2. Therefore, using linear splines wi th controls may not always be possible. Quantiles and other internal features on the other hand must be properly ordered and lend themselves nicely to linear splines. Lastly, when using splines, there is the danger of over fitting. Figure 5.8 differs from figure 5.7 only in its estimation method: optimization versus linear splines wi th knots at the features. The p-values do not change 50 significantly suggesting that there is no over-fitting. This seems to hold consistently when using three or four features at a time. Thus, linear splines are the suggested alignment method and optimization of parameters for a linear transformation should be used when linear splines are not tractable. 51 Plate Densities of Raw Data Within a Plate Set Densities After Removal of Estimated Additive Effect & D i v i s i on b y P late M e d i a n Figure 5.1: For simplicity, we only depict the density of four plates. The plate whose density is depicted by the blue line is one whose mutants were not randomly allocated. 52 Diagnostics to Raw Data ANOVA P-Value for Plate Effect: 1.264e-12 controW control3 control2 control 1 blank 5000 I 10000 I 15000 1 20000 1 25000 1 30000 controW control3 control2 control 1 blank Diagnostics to Median Centering ANOVA P-Value for Plate Effect: 6.272e-08 o.o 0.5 1.0 1.5 Diagnostics to Recommended Method ANOVA P-Value for Plate Effect: 0.0126 controW control3 control2 control 1 blank m \u00E2\u0080\u00A2 1 10000 \u00E2\u0080\u0094 I \u00E2\u0080\u0094 15000 1 20000 1 25000 Figure 5.2: These plots compliment the previous ones. They depict the values of controls and blanks across the (reduced) plate-set considered in the two normalization approaches shown in (5.1). 53 Normalized Plate Densities Variance of Quantiles o o CD O O CM O O CO 10000 20000 30000 Quantiles in the Left Tail 0.0 0.2 0.4 0.6 0.8 1.0 Quantiles Mean Left Tail Trimmings i 1 1 r 0.00 0.05 0.10 0.15 Quantiles 0.20 0.00 0.05 0.10 0.15 0.20 Quantiles Figure 5.3: 54 Features Aligned: controh, control2, control3, controW Kruskal P--value for Control 1: 0.0988 Method of Estimation: Optimization using Quasi Newton Kruskal P--value for Control 2: 0.1446 Data Set used: Set A Kruskal P--value for Control 3: 0.3684 Plates Used: 1,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 Kruskal P--value for Control 4: 0.631 Edge Removed: TRUE Background Corrected: FALSE Normalized Plate Densities trim 0.075 trim 0.04 trim 0.025 95% quantile 85% quantile 2% quantile controW control3 control2 control 1 blanks \"i r 5000 10000 15000 20000 Diagnostics 25000 30000 5000 10000 15000 20000 25000 30000 Figure 5.4: 55 Features Aligned: contrail, control2, oontrol3, oontroW Kruskal P--value for Control 1: 8e-04 Method of Estimation: Optimization using Quasi Newton Kruskal P--value for Control 2: 0.1413 Data Set used: Set B Kruskal P--value for Control 3: 0.291 Plates Used: 1,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 Kruskal P--value for Control 4: 0.5035 Edge Removed: TRUE Background Corrected: FALSE Normalized Plate Densities \"55 15000 20000 25000 30000 35000 Diagnostics trim 0.075 \u00E2\u0080\u00A2 \u00E2\u0080\u0094 \u00E2\u0080\u00A2 * \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 trim 0.04 nm \u00E2\u0080\u00A2-\u00E2\u0080\u00A2 95% quantile \u00E2\u0080\u00A2 \u00E2\u0080\u0094 \u00E2\u0080\u00A2 - \u00E2\u0080\u00A2 \u00C2\u00BB\u00C2\u00AB m \u00E2\u0080\u00A2\u00E2\u0080\u0094 85% quantile \u00C2\u00BB\u00E2\u0080\u0094m-m\u00E2\u0080\u0094 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00C2\u00BB\u00E2\u0080\u00A2 2% quantile \u00C2\u00AB \u00E2\u0080\u00A2 \u00E2\u0080\u0094 \u00E2\u0080\u00A2 \u00E2\u0080\u0094 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 control4 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 control3 \u00E2\u0080\u0094\u00C2\u00BB-am*m control2 control 1 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00C2\u00BB \u00E2\u0080\u00A2 blanks e\u00C2\u00BB\u00E2\u0080\u0094\u00E2\u0080\u00A2\u00E2\u0080\u0094\u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00C2\u00BB - \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 1 1 1 1 1\u00E2\u0080\u0094 15000 20000 25000 30000 35000 Figure 5.5: 56 Features Aligned: Trim 0.025/0.04 and Quantiles 85/95% Kruskal P--value for Control 1: 0.0038 Method of Estimation: Optimization using Quasi Newton Kruskal P--value for Control 2: 0 Data Set used: Set A Kruskal P--value for Control 3: 0.1086 Plates Used: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 Kruskal P--value for Control 4: 0.0476 Edge Removed: TRUE Background Corrected: FALSE Normalized Plate Densities c Q trim 0.075 trim 0.04 trim 0.025 95% quantile 85% quantile 2% quantile controW control3 control2 control 1 blanks 5000 10000 15000 20000 Diagnostics 25000 I 5000 I 10000 \u00E2\u0080\u0094 i \u00E2\u0080\u0094 15000 1 20000 1 25000 30000 \u00E2\u0080\u0094 r 30000 Figure 5.6: 57 Features Aligned: Trim 0.025/0.04 and Quantiles 85/95% Kruskal P-\u00E2\u0080\u00A2value for Control 1: 0 Method of Estimation: Optimization using Quasi Newton Kruskal P--value for Control 2: 0 Data Set used: Set B Kruskal P--value for Control 3: 0.0231 Plates Used: 1,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 Kruskal P--value for Control 4: 0.0198 Edge Removed: TRUE Background Corrected: FALSE Normalized Plate Densities 15000 20000 25000 30000 35000 Diagnostics trim 0.075 \u00C2\u00BB ' \u00E2\u0080\u00A2 trim 0.04 \u00E2\u0080\u00A2 \u00C2\u00BB trim 0.025 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 95% quantile \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 85% quantile 2% quantile m-mm control4 \u00E2\u0080\u0094 * \u00E2\u0080\u00A2 * \u00E2\u0080\u00A2 \u00C2\u00AB \u00E2\u0080\u00A2 \u00E2\u0080\u0094 \u00E2\u0080\u0094 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 control3 \u00E2\u0080\u00A2 \u00E2\u0080\u0094 \u00E2\u0080\u00A2 \u00E2\u0080\u0094~ \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u0094 \u00E2\u0080\u00A2 control2 \u00E2\u0080\u0094 \u00E2\u0080\u00A2 - \u00E2\u0080\u00A2 \u00C2\u00BB \u00E2\u0080\u00A2 ~* \u00E2\u0080\u0094 \" contrail \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u0094\u00E2\u0080\u00A2\u00E2\u0080\u0094\u00E2\u0080\u00A2\u00E2\u0080\u0094 ' I 1 1 1 1\u00E2\u0080\u0094 15000 20000 25000 30000 35000 Figure 5.7: 58 Features Aligned: Trim 0.025/0.04 and Quantiles 85/95% Kruskal P--value for Control 1: 0.003 Method of Estimation: Linear splines with knots @ features Kruskal P--value for Control 2: 0 Data Set used: Set A Kruskal P--value for Control 3: 0.0891 Plates Used: 1,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 Kruskal P--value for Control 4: 0.0609 Edge Removed: TRUE Background Corrected: FALSE Normalized Plate Densities \u00E2\u0080\u00A2 o 3 o I o ED o o + O trim 0.075 trim 0.04 trim 0.025 95% quantile 85% quantile 2% quantile controW control3 control2 contrail blanks 1 1 1 1 1 5000 10000 15000 20000 25000 Diagnostics 5000 10000 15000 20000 25000 Figure 5.8: 59 Chapter 6 Conclusion and Future Work There are many levels at which experimental artefacts distort high throughput phenotypic experiments: location within a plate, plate-wise within a plate-set and across conditions. In this report, we have concentrated on the plate-wise variety. Current methods used for pre-processing were presented and their issues discussed. A class of alternative pre-processing methods which allow for all transformation parameters to be based on biological assumptions were proposed. The biological assumptions required here are the alignment of features across the plate-set. The alignment of two features leads to an exact solution while, otherwise, two routes may be taken. A simple extension to the linear distortion model which increases the number of transformation parameters to allow for linear splines is generally favored over minimization of feature variance across the set. The former has computational advantages as well as the capability to adapt to slight departures from linearity. However, in order to use linear splines, the features must retain the same phenotype order on each plate. Furthermore, in the event that a large number of features are selected for alignment, say five or more, using linear splines may lead to over-fitting. In such situations, optimization should be employed. 60 We have also discussed the choice of features to align. T w o classes were defined: external and internal. Whi le external features should align across the plate-set by design, low replication may lead to l imited pre-processing effectiveness as was suggested in section 5.3.2. Internal features, on the other hand, are based on al l the data from each plate which saves them from this shortcoming. Often, in practice, mutant are non-randomly allocated to plates which in turn may have a heavy influence on various internal features such as the plate's median phenotype. We have presented the upper almost-tail quantiles and the mean left ta i l trimmings as internal features which are approximately equal across the plate-set under randomized allocation and robust to modest departures from randomization. These are always available, regardless of the experimental design, and simple to obtain. There is st i l l room for improvement. It would be desirable to use a model which combines the distortion model wi th an error model, as was done by Huber et al in the mi-croarray context, as a basis for a pre-processing transformation. Consequently, any mean variance relationship resulting from the plate effect interacting wi th the random errors would be removed using this strategy. Pre-processing across conditions can be achieved using ex-isting methodology developed by the microarray community such as quantile normalization. However, pre-processing with regards to plate location effect is presently done very crudely. Improvement here is the most pressing matter. 61 Bibliography [1] et al. A . H . Tong, G . Lesage. Global mapping of the yeast genetic interaction network. Science, 303(5659) :808-13, 2004. [2] A.E.Carpenter and D.M.Sabat in i . Systematic genome-wide screens of gene function. Nature Reviews, 5:11-22, 2004. [3] S. Armknecht and M . Boutros et al. High-throughput rna interference screens in drosophila tissue culture cells. Methods Enzymol, 392:55-73, 2005. [4] B . L . Drees and V . Thorsson et al. Derivation of genetic interaction networks from quantitative phenotype data. Genome Biol, 6(4):R38, 2005. [5] B . M . B o l s t a d et al. A comparison of normalization methods for high density oligonu-cleotide array data based on variance and bias. Bioinformatics, 19(2):185-193, 2003. [6] Robert G . Eason et al . Characterization of synthetic dna bar codes in saccharomyces cerevisiae gene-deletion strains. Proc Natl Acad Sci USA., 101(30):11046-11051, 2004. [7] W i l l i a m H . Press et al. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, 40 West 20th st., New York, N Y 10011-4211, U S A , 1988. 62 [8] Wolfgang Huber et al. Parameter estimation for the calibration and variance stabi-lization of microarray data. Statistical Applications in Genetics and Molecular Biology, 2( l ) : l - 22 , 2003. [9] G . Giaever and A . M. C h u et al. Functional profiling of the saccharomyces cerevisiae genome. Nature, 418(6896):387-91, 20021. [10] David M. Rocke and Blythe Durbin . A model for measurement error for gene expression analysis. Journal of Computational Biology, 8:557-569, 2001. \ [11] C. Sachse and E . Krausz et al. High-throughput rna interference strategies for target discovery and validation by using synthetic short interfering rnas: functional genomics investigations of biological pathways. Methods Enzymol, 392:242-77, 2005. [12] E . A . Winzeler and D . D . Shoemaker et al. Functional characterization of the s. cere-visiae genome by gene deletion and parallel analysis. Science, 285(5429):901-6, 1999. [13] L u u P. Speed T . P . Yang Y . H . , Dudoit S. Normalization for cdna microarray data. Technical report, University of California, Berkeley, 2001. 63 Appendix A Derivation of Gradient The derivative of a sum is the sum of the derivatives, so we begin wi th a single term from the sum. Let q~j = \u00C2\u00A3 J2p=i fp(ajp)> t n e n > d /g,- - ax _d_ p dakVar\ b / dak\{P-l)^ . P~ 1 1 P d /p_D 5Z-2(/P(9JP) - 9j)^-(/p(?jp) - 9i) ^ ' p = i rrj S 2fofop)\" ^ ) ^ ) + (pZT) ( - \u00C2\u00A3 + ^ )(Ato*) - & ) ' p (-\u00E2\u0080\u00A2f)(fk(qjk)-qj) p= 2 Therefore, by adding these terms up we obtain, a a ; 5 ( a ' b ) = M p r T ) E ( A ( ^ ) - ^ ) -64 The process is very similar for partial derivatives of the scale parameters. P 1 P d = ( p _ t) H 2(fp(Hp) - ^g^ifp^jp) - Qj) - ^ E % W - * ) \u00C2\u00AB + ^ ( A ( , , ) - * ) ( - ^ ) - ( p r i j ( - ^ ) W w ) - * ) -2 -= b2^p _ ^ (/fc(gjfc) ~ gj)(gjfc - flfc) Which yields the equation a -2 0 \u00E2\u0080\u0094g(a,b) = (^A(gjfc) ~ gj)(gjfc ~ afc) 65 Appendix B The Edge Effect Colonies near the edge produce systematically higher phenotypes. This effect is dubbed the edge effect. There are two explanations for this. Fi rs t ly these colonies have less neighbours and hence less competition for nutrients (distributed in the agar) allowing them to grow faster. Secondly the agar plates are slightly curved at the edges which bends the light from the scanner and results in brighter recorded intensities (i.e. stronger phenotypes are reported). Figure 2.4 shows an image of a particular agar plate obtained v ia scanner (left). The larger size of colonies near the edges, indicating a higher growth rate, is immediately observable as is the relative location of the technical replicates. These are allocated in two by two squares directly next to each other. Similarly, the controls are contained in a two by two square on the edge. Thus they are contained within the first four rows from the edge. As previously mentioned, the location effect is outside the realm of this paper,, never-theless it cannot be entirely ignored here. The interim solution is to remove the data from the edges (first and last columns and rows). Figure B . l shows the distribution of phenotypes wi thin columns of a single plate before and after this manipulation. It is quite evident that 66 while there is an improvement, the effect is not entirely removed and that it spans over the first few rows/columns. Fortunately for the controls, their location is the same from plate to plate. It is expected that the edge effect wi l l cancel out, but it may be argued that the standard error wi l l be increased because the median value is taken over two observations rather than 4. 67 Before Edge Removal 1 4 7 10 14 18 22 26 30 34 38 42 46 Column After Edge Removal Column Figure B . l : Evidently, the observations on the first and last columns are those most affected by the location effect, but it is also evident that the edge effect is not restricted to these two columns. The data is eventually collapsed to the median of the four technical replicates, thus.it is equivalent to replace the values on the edge wi th the values from the neighbouring row or column which explains why the first two and last two columns are equally distributed in the second plot. 68 "@en . "Thesis/Dissertation"@en . "10.14288/1.0100664"@en . "eng"@en . "Statistics"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en . "Graduate"@en . "Pre-processing of quantitative phenotypes from high throughput studies"@en . "Text"@en . "http://hdl.handle.net/2429/31404"@en .