UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Pre-processing of quantitative phenotypes from high throughput studies Kanters, Steve 2006

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2007-0138a.pdf [ 7.02MB ]
Metadata
JSON: 831-1.0100664.json
JSON-LD: 831-1.0100664-ld.json
RDF/XML (Pretty): 831-1.0100664-rdf.xml
RDF/JSON: 831-1.0100664-rdf.json
Turtle: 831-1.0100664-turtle.txt
N-Triples: 831-1.0100664-rdf-ntriples.txt
Original Record: 831-1.0100664-source.json
Full Text
831-1.0100664-fulltext.txt
Citation
831-1.0100664.ris

Full Text

Pre-Processing of Quantitative Phenotypes from High Throughput Studies by . STEVE RANTERS B.Sc, McGill University, 2004  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR T H E DEGREE OF M A S T E R OF SCIENCE in T H E FACULTY OF GRADUATE STUDIES (Statistics)  The University of British Columbia December 2006 © Steve Kanters, 2006  Abstract H i g h throughput phenotypic experiments include both deletion sets and R N A i experiments. T h e y are genome wide and require much physical space.  A s a result, multiple plates are  often required i n order to cover the whole genome. T h e use of multiple plates leads to systematic plate-wise experimental artefact, which impede statistical inference. In this paper, current pre-processing methodology will be reviewed. Their fundamental principle is to align a common feature shared by all plates. From this very principle, we propose an improved method which simultaneously estimates all parameters required for the pre-processing transformation. Some of the alignment features popular today implicitly assume conditions which are often not met in practice. We discuss the various choices of features to align. Specifically, the upper quantiles and the mean of the left tail trimmings of each plate's data distribution are features which are always available and simple to obtain. Moreover, they are robust to non-randomization of genes to plates. Their use will be motivated through simulation and applied to real data. Applications to real data will be used to demonstrate superiority over current methods as well as to discuss choices in transformation types.  ii  Contents Abstract  ii  Contents  iii  List of Figures  v  Acknowledgements  vii  1  Introduction  1  2  H T P Experiments  3  3  F o u n d a t i o n s for H T P D a t a P r e - P r o c e s s i n g  12  3.1  Phenotype D i s t r i b u t i o n  14  3.2  Distortion M o d e l  15  3.3  Current Approaches to D a t a Pre-Processing  16  3.3.1  H T P Experiments  16  3.3.2  Microarrays  4  • • •  Recommended Pre-Processing M e t h o d 4.1  4.2  Features  20 27  '•  28  4.1.1  E x t e r n a l Features  28  4.1.2  Internal Features  28  Proposed Pre-processing M e t h o d  32  iii  5  E x a c t alignment of 2 features under the linear model  .  33  4.2.2  Alignment of features through optimization  34  4.2.3  A piecewise distortion model .  36  Application to Real Data  44  5.1  Data  44  5.2  Recommended M e t h o d versus M e d i a n Centering  45  5.3  Choice of Feature  47  5.3.1  Complementing Chapter 4's Simulation  47  5.3.2  Internal and E x t e r n a l Features i n a Case Study  48  5.4 6  4.2.1  O p t i m i z a t i o n versus Linear Splines  50  Conclusion and Future Work  60  Bibliography  62  Appendices  64  A  Derivation of Gradient  64  B  T h e Edge Effect  66  iv  List of Figures 2.1  Gene Deletion i n Yeast  8  2.2  R N A Interference  9  2.3  Experimental Formats  10  2.4  Image Quantification  11  3.1  Comparison of H T P and Microarray experiments  3.2  R a w D a t a Distributions  24  3.3  Non-Random Mutant Allocation  25  3.4  Quantile Normalization  4.1  Simulation Population  4.2  Example of Simulated Plate-Sets  40  4.3  Simulation Results  41  4.4  M e a n Left T a i l Trimmings  42  4.5  Linear Splines  43  5.1  Comparing Methods  52  . . .  23  i.  .  26  - 39  v  5.2  Complimentary Diagnostic P l o t  53  5.3  Internal Feature Behaviour Induced by E x t e r n a l Feature Alignemnt  54  5.4  A p p l i e d Pre-processing 1  55  5.5  A p p l i e d Pre-processing 2  56  5.6  A p p l i e d Pre-processing 3  57  5.7  A p p l i e d Pre-processing 4  58  5.8  A p p l i e d Pre-processing 5  59  B.l  Edge Effect  .  '.  vi  68  Acknowledgements First and foremost, I would like to thank Jenny B r y a n for her guidance, support and patience. Hopefully, my emotional roller coaster throughout this program d i d not rub off on others.  I would also like to thank E l i z a b e t h Conibear from the Center for Molecular  Medicine and Therapeutics for giving me the opportunity to work on such an interesting project. M y experience as a graduate student i n the Department of Statistics at U B C has by far exceeded my expectations. I feel like a statistician - a scientist - something I d i d not feel after my undergraduate studies. I would like thank my parents, R o n and Renee, for their life long love and support. T h e y have given me the proper tools to lead the life I choose. D a d , you've shown me that life is an adventure. One which too many people take for granted. M a m a n , t u m ' a donner un outil essentielle a l a vie: l a communication. M a n y students in the department helped me along the way. T h a n k you J u s t i n for helping improve my computing skills and for entertaining coffee breaks.  T h a n k s Jeff for  helping me w i t h my mathematics when I needed it and speaking French. M i k e M a r i n , you've become more than a fellow student - you're a true friend. I enjoyed the many conversations we've had which I think were beneficial to both of us. Finally thanks to R o b i n Steenweg and Jon Morrison who b o t h continue to inspire me greatly i n the game we call life.  STEVE RANTERS The  University  December  of British  Columbia  2006  vii  Chapter 1 Introduction G e n e t i c s has g r e a t l y d e v e l o p e d i n t h e l a s t c e n t u r y a n d a half.  Traditionally,  experiments  s t a r t e d w i t h a p h e n o t y p e a n d t r i e d t o i d e n t i f y t h e m u t a t i o n w h i c h l e a d t o i t . T h i s is k n o w n as f o r w a r d genetics. T o d a y , select o r g a n i s m s have m o s t o f t h e i r g e n o m e m a p p e d .  Accordingly,  one c a n cause or observe changes i n g e n e t i c sequence a n d w i t n e s s t h e i n d u c e d c h a n g e i n phenotype.  T h i s p r o c e d u r e , i n the o p p o s i t e d i r e c t i o n o f c l a s s i c a l g e n e t i c s , is c a l l e d reverse  genetics. B y p e r t u r b i n g a specific gene a n d e v a l u a t i n g i t s effects o n a p h e n o t y p e o f i n t e r e s t , we c a n b e t t e r u n d e r s t a n d each gene's f u n c t i o n a l i t y .  Such experiments carried out on a  g e n o m e - w i d e b a s i s are c a l l e d 'high t h r o u g h p u t p h e n o t y p i c ( H T P ) e x p e r i m e n t s .  Choosing  a specific p h e n o t y p e m a y l e a d t o t h e p r o p e r i d e n t i f i c a t i o n of genes c o m p o s i n g a p a t h w a y of i n t e r e s t .  F u r t h e r m o r e , t h e i n t r o d u c t i o n of d o u b l e d e l e t i o n sets i n H T P e x p e r i m e n t s has  a l l o w e d s c i e n t i s t s t o i d e n t i f y k e y s y n t h e t i c i n t e r a c t i o n s [1].  Unlike other  genomic-based  a p p r o a c h e s s u c h as m i c r o a r r a y s a n d p r o t e o m i c s , gene p e r t u r b a t i o n p r o v i d e s a d i r e c t l i n k f r o m gene t o f u n c t i o n a n d . t o d a t e , offers t h e b e s t t o o l for r e a l i z i n g t h e full p o t e n t i a l o f t h e g e n o m e p r o j e c t [3].  1  In all high throughput genome-wide experiments, data are very much affected by the biological phenomenon being investigated and by the experimental processes involved. Thus, analyzing the raw data to draw conclusions on the underlying biological mechanisms may be extremely misleading. D a t a pre-processing is the process of removing as much experimental artefact as possible while affecting the biological information as little as possible. In microarrays it is a topic which is well documented i n the literature [13],[5],[8]. There are key differences between H T P and microarray experiments.  Consequently,  normalization methods developed for microarrays are of limited usefulness here. These experiments are young and relatively unknown to the statistical community. T h e goal of this report is to develop a pre-processing method for H T P experiments which properly reflects the biology involved and efficiently alleviates the experimental artefacts. Five sections will be used to present and justify our proposed normalization approach. T h e first will describe H T P experiments.  T h e second w i l l discuss current approaches for  normalizing these and microarray experiments. T h e ensuing section w i l l present our proposed method which will be followed by applications to real data. Finally, the fifth section w i l l be composed of concluding remarks.  2  Chapter 2 H T P Experiments Proper discussion of high throughput phenotypic data pre-processing requires a better understanding of the experiments.  In this section, key aspects of these experiments w i l l be  described and, i n the process, essential vocabulary w i l l be developed. Individual H T P experiments differ in many respects. T h e primary distinctions are the organism of study, the gene perturbing strategy, the experimental format and the phenotype of interest.  Organisms studied in this manner are S. cerevisiae, Drosophila  melanogaster,  C. elegans and various mammalian cell cultures (most commonly mice and human). There are numerous reasons for studying these particular organisms. For example, a significant proportion of genes i n S. cerevisiae are homologous to human genes implicated i n human disease [12].  Moreover, C. elegans is the simplest organism to have biological systems,  such as a nervous system and digestive system, akin to those found in humans [11]. Most importantly, though, these are the organisms for which a substantial portion of the genome is known. There are two classes of gene perturbation: gene deletion and gene inhibition. Gene  3  d e l e t i o n is p e r m a n e n t .  T h e gene is r e m o v e d f r o m s t a r t c o d o n t o s t o p c o d o n a n d u s u a l l y  r e p l a c e d w i t h a cassette.  T h e c a s s e t t e c o n t a i n s m a n y r e g i o n s w h i c h it shares w i t h  the  t a r g e t e d gene b o t h u p s t r e a m a n d d o w n s t r e a m t o a l l o w p r o p e r i n s e r t i o n . It u s u a l l y c o n t a i n s a k a n a m y c i n r e s i s t a n c e gene a n d u n i q u e m o l e c u l a r b a r - c o d e s .  T h e l a t t e r w a s d o n e for t h e  g e n o m e - w i d e S. cerevisiae d e l e t i o n set [9]. F i g u r e 2.1 i l l u s t r a t e s t h i s process. G e n e i n h i b i t i o n is t e m p o r a r y .  T h i s is a c c o m p l i s h e d t h r o u g h R N A interference  (RNAi).  Interfering R N A  m o l e c u l e s w i l l b i n d t o t h e m R N A e x p r e s s e d b y a t a r g e t e d gene a n d n e u t r a l i z e i t ,  hence  s i l e n c i n g t h e gene as d e p i c t e d i n figure 2.2. O n c e t h e gene i n h i b i t i n g s u b s t a n c e is e x h a u s t e d , t h e c e l l b e h a v i o u r r e t u r n s t o n o r m a l . R N A i is n o t y e t p e r f e c t e d .  P r e s e n t l y , t h e r e are s t i l l  issues w i t h efficiency a n d s p e c i f i c i t y . T h e t a r g e t e d gene is n o t a l w a y s c o m p l e t e l y s i l e n c e d a n d o t h e r genes m a y b e s i l e n c e d s i m u l t a n e o u s l y .  If genes were t e l e v i s i o n s , d e l e t i o n w o u l d  b e t h e m u t e o p t i o n a n d R N A interference w o u l d b e t u r n i n g d o w n t h e v o l u m e . H e n c e , gene d e l e t i o n is m o r e successful t h a n i n h i b i t i o n . F o r c o n v e n i e n c e , t h r o u g h o u t t h i s r e p o r t , we w i l l often use  mutant t o d e s c r i b e a s p e c i m e n w i t h one or m o r e d e l e t e d or i n h i b i t e d genes e v e n  t h o u g h t e c h n i c a l l y t h e l a t t e r is n o t a m u t a n t . as  It is also c o m m o n t o refer t o gene d e l e t i o n s  knock-outs a n d i n h i b i t e d genes as knock-downs. T h e s e gene p e r t u r b i n g s t r a t e g i e s are e m p l o y e d o n a g e n o m e - w i d e scale. D e a l i n g w i t h  g e n o m e - w i d e c o l l e c t i o n s of m u t a n t s r e q u i r e s m u c h p h y s i c a l space. experimental  T h e r e are t h r e e m a i n  formats: m u l t i - p l a t e , l i v i n g c e l l m i c r o a r r a y s a n d p o o l e d cells.  F i g u r e 2.3  d e p i c t s t h e t h r e e f o r m a t s . T h e m u l t i - p l a t e f o r m a t is u s e d for a l l t y p e s of gene p e r t u r b a t i o n s a n d o r g a n i s m s . F o r R N A i e x p e r i m e n t s , m i c r o t i t r e p l a t e s are m o s t c o m m o n l y u s e d . H e n c e i n t h e R N A i c o m m u n i t y , t h e m u l t i - p l a t e f o r m a t is often referred t o as t h e m i c r o t i t r e or m u l t i w e l l p l a t e f o r m a t [2]. F o r d e l e t i o n sets, o t h e r t y p e s of p l a t e s are also u s e d .  For example,  yeast d e l e t i o n sets are often g r o w n o n a g a r p l a t e s . A g a r p l a t e s are sterile P e t r i dishes t h a t contain agar a n d nutrients.  T h e y l e n d t h e m s e l v e s p a r t i c u l a r l y w e l l t o yeast since it is a  4  great m e d i u m for g r o w t h , t h e m o s t c o m m o n p h e n o t y p e for d e l e t i o n sets, a n d a l l o w o t h e r i n t e r e s t i n g p h e n o t y p e s t o b e m e a s u r e d , s u c h as i n v a s i v e n e s s [4]. R a r e l y are p l a t e s c a p a b l e o f h o s t i n g m o r e t h a n 1536 i n d i v i d u a l c e l l c o l o n i e s o r c u l t u r e s . G e n o m e - w i d e e x p e r i m e n t s d e a l w i t h thousands of m u t a n t s thus r e q u i r i n g a collection of plates. A  plate-set  is t h e c o l l e c t i o n  of p l a t e s u s e d t o p e r f o r m one e x p e r i m e n t a l r e p l i c a t e for a l l m u t a n t s . It is t h i s f o r m a t w h i c h is o f i n t e r e s t t o t h i s r e p o r t . W i t h i n t h e m u l t i - p l a t e f o r m a t t h e r e are m a n y e x p e r i m e n t a l set-ups.  Technical repli-  cates o f a m u t a n t are c o l o n i e s o r w e l l s w i t h t h e s a m e gene p e r t u r b a t i o n w i t h i n a p l a t e - s e t . In g e n e r a l , i f t h e r e are t e c h n i c a l r e p l i c a t e s , t h e y are c o n t a i n e d o n t h e s a m e p l a t e . C o n s e q u e n t l y , t h e m u t a n t s f o u n d o n t w o s e p a r a t e p l a t e s w i t h i n a plate-set are ( a l m o s t ) c o m p l e t e l y different.  T h i s has i m p o r t a n t c o n s e q u e n c e s for d a t a p r e - p r o c e s s i n g . F o r t u n a t e l y , m o s t , i f  n o t a l l , e x p e r i m e n t s have c o m m o n e l e m e n t s f o u n d o n m u l t i p l e p l a t e s . T h e s e are g e n e r a l l y referred t o as  controls.  I n essence, a c o n t r o l is a n y t h i n g w h i c h is s h a r e d o n t w o o r m o r e  p l a t e s i n t h e set a n d s h o u l d h a v e a s i m i l a r p h e n o t y p e o n e a c h p l a t e . W i l d t y p e cells p l a c e d o n each p l a t e are a t y p i c a l c o n t r o l . W i l d t y p e s h a v e n o t b e e n a l t e r e d i n a n y way, so t h e y are p r o v i d i n g i n f o r m a t i o n i n t h e f o r m of a reference p h e n o t y p e . T h i s has t h e c o n v e n i e n t feature of a l l o w i n g t h e i d e n t i f i c a t i o n of gene p e r t u r b a t i o n s w h i c h r e s u l t i n a n e n h a n c e d , d i m i n i s h e d or u n c h a n g e d p h e n o t y p e . A n o t h e r c o m m o n c o n t r o l is t o h a v e m u t a n t s w i t h k n o w n p h e n o t y p i c b e h a v i o u r s p a n n i n g over t h e r a n g e of p h e n o t y p e s across p l a t e s . F o r e x a m p l e , s u p p o s e a g r o w t h e x p e r i m e n t o n a yeast d e l e t i o n set is c o n d u c t e d . A p l a u s i b l e c h o i c e o f c o n t r o l s w o u l d b e t o have a s l o w g r o w i n g yeast, a n average g r o w i n g yeast a n d a fast g r o w i n g yeast o n e a c h p l a t e . It is also c o m m o n for  blanks t o  b e present o n each p l a t e , b u t d e s p i t e t h e i r c o n s i s t e n t  b e h a v i o u r t h e y are different f r o m c o n t r o l s . T h e y d o n ' t p r o v i d e as m u c h i n f o r m a t i o n a b o u t t h e p l a t e effect as b i o l o g i c a l c o n t r o l s d o . W e w i l l e x p a n d o n t h e t o p i c o f t h e p l a t e effects i n t h e f o l l o w i n g s e c t i o n s w h e r e t h i s w i l l b e c o m e e v i d e n t . B l a n k s are n o n e t h e l e s s u s e f u l for  5  q u a l i t y c o n t r o l at t h e p r e - p r o c e s s i n g stage. O n c e t h e plate-set has b e e n p r e p a r e d , p h e n o t y p i n g m a y b e g i n . T h e m u t a n t s are set i n a fixed e n v i r o n m e n t for a fixed p e r i o d of t i m e . a n e n z y m e or d r u g , m a y b e i n t r o d u c e d . conditions.  A specific stress or t r e a t m e n t , s u c h as  E s s e n t i a l l y , t h e m u t a n t s are e x p o s e d t o specific  T h e p h e n o t y p e o f interest is often g r o w t h , e v e n w h e n e x p o s e d t o t r e a t m e n t .  F u r t h e r m o r e , t h e s t u d y of one p h e n o t y p e does n o t p r e c l u d e t h e s t u d y of o t h e r s , e v e n w i t h t h e s a m e plate-set. I n p a r t i c u l a r , i n d e l e t i o n sets, g r o w t h assays are often t h e first p h e n o t y p e s t u d i e d b e c a u s e o t h e r p h e n o t y p e s m a y h a v e t o b e a d j u s t e d for t h e o b s e r v e d g r o w t h . T h e f i n a l e x p e r i m e n t a l stage is p h e n o t y p e q u a n t i f i c a t i o n . m e a s u r e of t h e p h e n o t y p e .  O f interest is t h e  exact  F o r e x a m p l e , i n t h e case of a g r o w t h assay, t h e i n f o r m a t i o n  of interest is t h e n u m b e r of cells r e s u l t i n g f r o m t h e g r o w t h p e r i o d .  These  experiments  are o n a g e n o m e - w i d e l e v e l , so i t is i m p r a c t i c a l for p h e n o t y p e s t o b e m e a s u r e d e x a c t l y . A s a compromise, technologies have been developed w h i c h estimate the phenotype.  For  s i m p l e assays, t h e p l a t e s m a y b e s c a n n e d t o c a p t u r e a n i m a g e a n d t h e r e s u l t i n g s p o t size or b r i g h t n e s s for e a c h m u t a n t is m e a s u r e d .  For more c o m p l e x phenotypes cell engineering  m a y b e r e q u i r e d t o ease t h e t a s k of q u a n t i f i c a t i o n . F l u o r e s c e n t or l u m i n e s c e n t m a r k e r s m a y be i n s e r t e d t o a l l o w p l a t e r e a d e r s t o q u a n t i f y p h e n o t y p e s . T h e s e are u s u a l l y t h e p r o d u c t of  reporter genes i n s e r t e d i n t h e c e l l . T h e r a w i m a g e r e s u l t i n g f r o m t h e s c a n is t h e n p r o c e s s e d u s i n g i m a g e q u a n t i f i c a t i o n software. F i g u r e 2.4 s h o w s a r a w i m a g e of a single agar p l a t e f r o m a g r o w t h assay u s i n g a yeast d e l e t i o n set a l o n g w i t h its p r o c e s s e d i m a g e p r o d u c e d b y s u c h software. I n t h e e n d . a n u m e r i c q u a n t i t y is a s s i g n e d t o each m u t a n t w h i c h h o p e f u l l y reflects the phenotype. S t a t i s t i c i a n s ' m o t t o is t o never t h r o w a w a y d a t a . S o m e e x p e r i m e n t e r s choose t o e v a l u a t e t h e r e s u l t s v i s u a l l y as h e a l t h y / n o t h e a l t h y .  6  S u c h a n a p p r o a c h r e m o v e s t h e n e e d for  normalization, yet simplifying potentially continuous data to a categorical or binary form is equivalent to throwing away a portion of the data. Furthermore, it introduces issues of replicability and increased error. Firstly, different people w i l l categorize differently. In fact, due to the subjective nature of these measurements, the same person may categorize differently on two separate occasions. Secondly, the extra data manipulation required here may lead to human error. Fortunately w i t h the advent of image quantification technologies, this strategy is becoming less common. Phenotype quantification gives access to a l l information provided by the experiment and gives rise to a need for data pre-processing.  7  UPTAG PRIMER 74MER ...AUG YEAST HOMOLOGY  KANMX  YEAST HOMOLOGY  E  DN <3DNTAG PRIMER 74MER D2  PCR ROUND 1  IP  D1  GENERATE DELETION CASSETTE  D2  U2  DN D1  C PCR ROUND 2  EXTEND YEAST HOMOLOGY  HOMOLOGY  INSERT DELETION CASSETTE INTO YEAST BY HOMOLOGOUS RECOMBINATION GENOMIC DNA  GENOMIC DNA  YEAST ORF  I Figure 2.1: The  following  UP  DN  TAG  TAG  i m a g e was o b t a i n e d f r o m R o b e r t G . E a s o n ' s a r t i c l e o n t h e c h a r a c -  t e r i z a t i o n of s y n t h e t i c D N A b a r codes [6], c o p y r i g h t (2004). It d e p i c t s b o t h t h e c o n s t r u c t i o n of t h e c a s s e t t e a n d i t s i n s e r t i o n i n t o t h e l o c a t i o n of t h e d e l e t e d gene. T h e u n i q u e i d e n t i f y i n g b a r c o d e s are e s s e n t i a l for h y b r i d i z i n g p u r p o s e s .  cS  DICER dsRNA  ;  Digestion siRNA  Unwinding RISC  Efficiency Binding Specificity Degradation V V  VV  1  mRNA  Figure 2.2: The above figure was obtained from: Henschel A . , Buchholz F . and Habermann B . , D E Q O R : A Web-Based Tool for the design and Q u a l i t y Control of siRNAs.Nucleic Acids Research, 32:W113-W120,2004, by permission of Oxford University Press. It depicts the multiple stages involved in gene inhibition v i a R N A i .  9  e Pootedoete  Soouonco or tVxitfzo to oaono c-*>  F i g u r e 2.3: E x p e r i m e n t a l F o r m a t s a:  M u l t i - w e l l p l a t e s c o m e i n v a r i o u s sizes ( t y p i c a l l y  96, 384 a n d 1536 w e l l s ) , t h u s t h e c o m p l e t e e x p e r i m e n t w o u l d c o m p r i s e of m a n y s u c h p l a t e s , b:  L i v i n g c e l l m i c r o a r r a y s are l a r g e a r r a y s w i t h t h o u s a n d s of s p o t s p r i n t e d o n t h e m .  An  i n h i b i t i n g s o l u t i o n t a r g e t i n g a p r e c i s e gene is p l a c e d at e a c h s p o t . C e l l s are t h e n d i s t r i b u t e d across t h e e n t i r e array.  T h i s f o r m a t does n o t l e n d i t s e l f w e l l t o w h o l e o r g a n i s m s s u c h as  the w o r m due to space constraints.  I n c o n t r a s t t o p o o l e d cells, l i v i n g c e l l m i c r o a r r a y s are  c o n f i n e d t o gene i n h i b i t i o n b e c a u s e t h e y r e l y o n i n h i b i t i n g s o l u t i o n s , c: P o o l e d cells b e g i n w i t h an equal share of each m u t a n t .  A f t e r a set p e r i o d o f t i m e o r e x p o s u r e t o s o m e stress,  t h e c o l l e c t i v e D N A is a n a l y z e d u s i n g m i c r o a r r a y s . tifiable.  T h i s requires each m u t a n t to be iden-  M o s t c o m m o n l y , i d e n t i f i c a t i o n is o b t a i n e d t h r o u g h b a r codes.  is m o s t l y c o n f i n e d t o d e l e t i o n m u t a n t s .  Thus, this format  A d a p t e d by permission from M a c m i l l a n Publishers  L t d : N a t u r e R e v i e w s G e n e t i c s , c o p y r i g h t (2004) [2].  10  F i g u r e 2.4: T h e i m a g e o n t h e left is a s c a n of a n a c t u a l p l a t e u s e d i n a g r o w t h assay. T h e i m a g e o n t h e r i g h t is a r e s u l t i n g q u a n t i f i e d i m a g e o b t a i n e d u s i n g t h e G r i d G r i n d e r software. A l o n g w i t h t h i s i m a g e c o m e s a n u m e r i c a l v a l u e for e a c h i n d i v i d u a l c o l o n y . T h e s e i m a g e s are p r o v i d e d c o u r t e s y o f the C o n i b e a r L a b .  ii  Chapter 3 Foundations for H T P Data Pre-Processing Multi-plate high throughput phenotypic experiments always involve multiple steps. evitably, the handling of each plate will be different to some degree.  In-  T h e discrepancies  may come from various sources, such as the temperature of treatment or the distribution of nutrients in the agar. Undoubtedly, the experimental processes will affect the data i n a plate-wise manner.  T h e experimental artefacts introduced to the data in this fashion are  referred to as the plate effect. H T P experiments usually involve multiple conditions just as in microarrays. Plate-wise experimental artefacts exist b o t h w i t h i n and across plate-sets. O n one hand, plate-set pre-processing - the removal of experimental artefacts across plates w i t h i n a plate-set - presents a novel problem.  O n the other hand, pre-processing plate-  sets across experimental conditions may be achieved using existing microarray methodology. Throughout the remainder of this paper, plate effect will pertain to the plate-wise distortion within a plate-set only.  12  D a t a a r i s i n g f r o m m u l t i - p l a t e H T P e x p e r i m e n t s are u s e d i n a f u n d a m e n t a l l y different m a n n e r t h a n m i c r o a r r a y d a t a . I n m i c r o a r r a y e x p e r i m e n t s , c o m p a r i s o n s of a gene across different c o n d i t i o n s are d i r e c t a n d i n t e r - g e n e c o m p a r i s o n s are i n d i r e c t . C o m p a r i s o n s b e t w e e n genes are d o n e b a s e d o n t h e i r b e h a v i o u r across c o n d i t i o n s . F o r e x a m p l e , genes m a y b e c l u s t e r e d a c c o r d i n g t o t h e i r u p - r e g u l a t i o n a n d d o w n - r e g u l a t i o n across c o n d i t i o n s . P r e - p r o c e s s i n g m e t h o d s i n m i c r o a r r a y s are t y p i c a l l y c o n c e r n e d w i t h r e m o v i n g e x p e r i m e n t a l a r t e f a c t s w i t h r e g a r d s t o a gene across a l l s a m p l e s .  D o i n g so a l l o w s for b o t h t y p e s of c o m p a r i s o n s t o b e  m e a n i n g f u l . H T P e x p e r i m e n t s , o n t h e o t h e r h a n d , are v e r y i n t e r e s t e d i n d i r e c t i n t e r - m u t a n t c o m p a r i s o n s w i t h i n plate-sets a n d across c o n d i t i o n s . T h e s e differences are d e p i c t e d i n F i g ure 3.1.  W e can hope to make meaningful comparisons of m u t a n t s w i t h i n a plate-set under  t h e presence of r a n d o m b i o l o g i c a l a n d / o r t e c h n i c a l v a r i a b i l i t y , b u t t h e y w i l l b e c o m p r o m i s e d b y t h e presence of s y s t e m a t i c , p l a t e - w i s e e x p e r i m e n t a l artefacts.  T h e c o m p l e t e r e m o v a l of  t h e p l a t e effect is a n i d e a l w h i c h is ( a l m o s t ) n e v e r a c h i e v e d i n p r a c t i c e .  T h e goal of pre-  p r o c e s s i n g is t o m i n i m i z e t h e m a g n i t u d e o f t h e p l a t e effect. E x i s t e n c e of a p l a t e effect a n d a g e n u i n e desire t o m a k e d i r e c t c o m p a r i s o n s w i t h i n a plate-set e s t a b l i s h a n e e d t o d e v e l o p a p r e - p r o c e s s i n g m e t h o d for p l a t e - s e t s i n H T P e x p e r i m e n t w h i c h is t h e p u r p o s e of t h i s p a per.  T h e p h e n o t y p e of a p a r t i c u l a r m u t a n t m a y b e f u r t h e r affected b y i t s l o c a t i o n o n t h e  p l a t e . T h e s e e x p e r i m e n t a l a r t e f a c t s w i l l n o t b e d e a l t w i t h i n t h i s p a p e r , b u t i t is s u s p e c t e d t h a t t h e c o n c e p t u a l i n f r a s t r u c t u r e d e v e l o p e d h e r e w i l l be g e n e r a l i z a b l e t o o t h e r sources of e x p e r i m e n t a l a r t e f a c t s s u c h as l o c a t i o n . T h e r e is a s i m p l e b l u e p r i n t t o d e r i v i n g a d a t a p r e - p r o c e s s i n g m e t h o d .  F i r s t , a dis-  t o r t i o n m o d e l d e e m e d a p p r o p r i a t e for d a t a p r o d u c e d b y t h e class of e x p e r i m e n t s at h a n d is d e v e l o p e d .  S e c o n d , b i o l o g i c a l a s s u m p t i o n s p l a u s i b l e for these t y p e s of d a t a , a n d w h i c h  s i m u l t a n e o u s l y a l l o w for e s t i m a t i o n of t h e p a r a m e t e r s i n t h e p r o p o s e d d i s t o r t i o n m o d e l , are e s t a b l i s h e d . F r o m these a n a t u r a l s o l u t i o n is f o r m u l a t e d : t h e p r e - p r o c e s s i n g m e t h o d . A g o o d  13  p r e - p r o c e s s i n g m e t h o d w i l l r e m o v e as m u c h e x p e r i m e n t a l a r t e f a c t s as p o s s i b l e a n d r e m o v e as l i t t l e gene p e r t u r b a t i o n effect as p o s s i b l e . H a v i n g e s t a b l i s h e d o u r g o a l , t h e r e m a i n d e r of t h i s chapter w i l l explore current methods where the above blueprint w i l l be used extensively. In order to do this properly, certain d a t a characteristics w i l l be shown a n d a d i s t o r t i o n m o d e l w i l l be proposed.  3.1  Phenotype Distribution  P e r t u r b a t i o n of a gene, u n d e r a specific c o n d i t i o n , c a n l e a d t o four different o u t c o m e s w i t h regards to a given phenotype.  T h e phenotype may be enhanced, diminished, remain un-  c h a n g e d or b e c o m e u n d e f i n e d b e c a u s e t h e m u t a n t is n o n v i a b l e . S i n c e t h e m u t a n t c o l l e c t i o n s o n each p l a t e differ, t h e p h e n o t y p e d i s t r i b u t i o n s also differ. If m u t a n t s were r a n d o m l y a l l o c a t e d t o p l a t e s i n t h e set, t h e d i s c r e p a n c y w i t h r e g a r d s t o p h e n o t y p e d i s t r i b u t i o n o n e a c h p l a t e w o u l d s o l e l y b e a r e s u l t of s a m p l i n g a n d n o t be s u b s t a n t i a l . U n f o r t u n a t e l y , i n p r a c t i c e , t h e a l l o c a t i o n is n o t p e r f o r m e d r a n d o m l y a n d often m u c h l a r g e r p r o p o r t i o n s of w e a k m u t a n t s ( m u t a n t s w i t h gene p e r t u r b a t i o n s i n d u c i n g d i m i n i s h e d p h e n o t y p e or m u t a n t s w h i c h are n o n v i a b l e ) are a l l o c a t e d t o a few select p l a t e s w i t h i n t h e set. F o r e x a m p l e , i t is n o t u n c o m m o n for m o s t p l a t e s t o have a s m a l l p r o p o r t i o n of w e a k m u t a n t s , say a p p r o x i m a t e l y 1 0 % , a n d t h e few r e m a i n i n g p l a t e s t o h a v e a l a r g e p r o p o r t i o n of w e a k m u t a n t s , 60%.  N o n e t h e l e s s , these d i s t r i b u t i o n s d o h a v e c o m m o n features.  say a p p r o x i m a t e l y  T h e r e are a l w a y s m u t a n t s  f r o m a l l four categories (unaffected, d i m i n i s h e d , e n h a n c e d a n d l e t h a l ) o n e a c h p l a t e . T h e r e fore, t h e d i s t r i b u t i o n s of p h e n o t y p e s o n e a c h p l a t e have a p p r o x i m a t e l y t h e s a m e r a n g e .  The  m u t a n t s w h i c h i n c u r p h e n o t y p i c l e t h a l i t y ( n o n v i a b l e m u t a n t s ) a n d b l a n k s are s t r u c t u r a l zeroes. C o n s e q u e n t l y , t h e d i s t r i b u t i o n s are often b i m o d a l w i t h one m o d e c o r r e s p o n d i n g t o t h e s t r u c t u r a l zeroes. T h e d i s t r i b u t i o n m a y b e m u l t i - m o d a l w i t h each m o d e c o r r e s p o n d i n g t o a  14  category. Figure 3.3 depicts smoothed histograms of the raw observed phenotypes from two plates from the same plate-set. It shows the b i m o d a l shape of phenotype distributions and the consequences of non-random mutant allocation. T h e non-random mutant allocation to plates, approximately equal range and bimodality are important features of the data which should be acknowledged when deriving a pre-processing method.  3.2  Distortion Model  It is accepted by most scientists that the effect of gene perturbation is multiplicative i n nature.  Likewise, the effects of most experimental processes are also multiplicative.  Fig-  ure 3.2 depicts the distribution of phenotypes on separate plates w i t h i n a plate-set. None of the plates have observations close to zero where the phenotypes of nonviable mutants and blanks are expected to be. would be almost unaffected.  If the plate effect were solely multiplicative their phenotypes Therefore, the plate effect also has an additive component.  These box plots not only support the notion of an additive component to the plate-effect, but also serve as motivation for pre-processing by displaying the importance of the systematic experimental artefacts (i.e. not removing the plate effect w i l l be extremely misleading). Recall that the quantity recorded in the phenotype quantification stage of the experiment is not a measurement of the true phenotype, but a mapping of the phenotype to a numeric quantity. There is no serious interest i n recovering the underlying phenotype measurement.  Therefore throughout the rest of this paper, true phenotype w i l l refer to the  numeric quantity obtained from the mapping which would be observed if the exact protocol were followed. It w i l l be denoted by the variable z and the observed quantity w i l l be denoted by x.  15  The distortion model needs to be simple and sufficient. Thus, the distortion model may be expressed as a linear multiplicative-additive model,  Xgp  CXp ~\~ PpZgp,  (3.1)  where g is the gene being perturbed and p the plate. Here no assumptions are made about the source of the location and scale parameters of the plate effect. T h e multiplicative component, P , is a combination of the effect of all experimental sources of multiplicative distortion. p  Similarly, the additive component, a , engulfs a l l sources of additive effect. p  3.3  C u r r e n t Approaches t o D a t a Pre-Processing  In this section, the most common methods of data pre-processing i n H T P experiments w i l l be presented as well as select methods used for microarray data. It w i l l be demonstrated that the latter methods are not valid for the data at hand, but certain concepts found w i t h i n them may be useful for H T P data pre-processing.  3.3.1  H T P Experiments  The plate effect as specified by the distortion model i n (3.1) is most easily dealt w i t h i n two steps. T h e first step is to remove the additive effect. It is the typical phenotype recorded when the true phenotype is zero. There are various methods of estimating this effect. One is to subtract, from all observed phenotypes on each plate, the plate's m i n i m a l observed value. T h i s can also be done w i t h i n smaller geographical confines of the plate. For example, the plate may be divided into quadrants and so on. A t the most extreme level, these quantities are estimated locally for each mutant colony or well. T h e latter requires local estimates to 16  be provided by the image quantification software and is only valid for plates which don't use wells, such as agar plates. Removing the estimated additive effect, a , provides adjusted p  observations. M o s t approaches currently used i n H T P experiments use this first step. For (new)  sake of simplicity, let x\•9P  X  OL refer to the adjusted observations for the remainder  gp  p  of subsection 3.3.1. T h e plate effect for these adjusted observations may be expressed as,  (3.2)  where the subscripts are as previously defined. T h e problem is now reduced to a multiplicative model. Belief i n a common location of the distribution of true phenotypes on each plate may not enable the multiplicative components of the plate effect to be estimated, but it does allow for meaningful comparisons to be made. Let m  p  phenotypes on plate p and let m = z  gp  = x  be the mean of the adjusted observed  gp  be the mean of the distribution of phenotypes having  followed the exact experimental protocol. Hence, the value of m is unknown. T h e variables rn and m  p  are instances of random variables w i t h common expectations i n the absence of a  plate effect. Thus, the plate effect is estimated by /3 =  G i v e n m, pre-processing can be  P  accomplished by dividing observations on plate p by P .  Pre-processed phenotypes w i l l be  denoted by y, so this may be expressed as y  Due to the multiplicative nature  p  gp  = x /P . gp  p  of the effects of gene perturbation, comparisons are typically done by way of ratios. Hence comparing two mutants pre-processed i n this fashion leads to  Vgp  p Pp'  Vg'v'  17  Z  gp  where hopefully  ~ 1- If the estimated plate effect is taken to be  9P  _  m  rt  ^  9P  X  nip  PpXg'p'  ~W[ 9'P' X  m.p. Xgip*  Xgipi  I  vrtpi  it becomes apparent that the value of m is irrelevant. Thus, by simply dividing the adjusted observations from each plate by the plate's mean, the plates are normalized to each other and meaningful comparisons may be made. T h e mean may be replaced by other measures of location, most commonly the median. T h i s approach, referred to as mean or median plate centering, is the most commonly used i n current H T P experiment analyses. A n alternative approach is to take advantage of controls conveniently placed on every plate i n the set. T h i s method was applied by B . L . Drees et al. [4]. Keeping to the same distortion model as i n equation 3 . 2 , rather than using the belief i n a common location as a means to meaningful comparisons, it is the fact that the controls have a common phenotype that is used. Define w  p  to be the observed phenotype of the control on plate p. Under this  constraint (common phenotype for controls) and model, the plate effect may be estimated by fa  = v  .  - P  (3.3)  median (w ) p  p  T h e adjusted data from plate p are normalized by dividing them by their respective estimated plate effect (3 . V  w. p  Similarly to median plate centering, it is equivalent to divide all x  gp  by  In essence, a l l that is needed once the observations has been adjusted for additive  effect is to divide by a phenotype which is believed to be approximately equal across a l l plates. T h e exception to this rule are blanks. These should already be approximately equal after the removal of additive effects. Hence, aligning these would provide no insight to the  18  multiplicative component of the plate effect. B o t h approaches are honest attempts at removing the plate effect, but they may not be the most effective ones. T o begin w i t h , estimations of the additive component as described above are crude. T h i s i n t u r n may affect model 3.2. If the model for which we are estimating the parameters, i n this case P , is wrong to begin w i t h , then the so called normalized data p  w i l l still be heavily distorted. Moreover, mean/median centering assumes location to be the same, but this does not comply w i t h the distribution of phenotypes previously described. Disproportions i n the number of weak mutants on different plates w i t h i n the set, such as that depicted i n figure 3.3, w i l l assuredly lead to different plate locations. Evidently, aligning a mutant which is a structural zero on one plate w i t h a mutant whose phenotype is unchanged on another does not respect the underlying biology - an undesirable quality to any preprocessing method. A t the very least, it is much wiser to use the median rather than the mean, but regardless both these are poor measures of location when dealing w i t h distributions which are not unimodal. Using control phenotypes to estimate P is an improvement on p  median/mean plate centering. In principle this is a really good idea, however it does have a weakness. Controls are often replicated in small numbers and any given colony may be strongly affected by location phenomena.  Therefore, controls have substantial variances  and division by variables w i t h large variance lead to bad variance properties for the result. Moreover, quality control should be performed to ensure the control's behaviour is consistent enough to be useful. If technical replicates are included i n the experiment, then variance of a control may be obtained on each plate. B y comparing the average of these variances to the variance of control phenotypes across plates in the set, quality of the controls may be assessed.  19  3.3.2  Microarrays  A variety of normalization approaches have been proposed for microarray experiments. Such diversity is due i n part to progress and different types of microarray chips. For example, data generated by high density oligonucleotide microarray technology differ from those obtained using c D N A microarrays allowing variant normalization approaches to be employed. T w o classes of normalization here are channel to channel normalization i n c D N A experiments and normalizing many profiles to each other. T h e latter is used i n high density oligonucleotide microarrays and i n c D N A experiments which use many arrays. T h e pre-processing strategies developed for normalizing many profiles are the most useful to our cause. mentioned the biological effect is multiplicative i n nature.  A s previously  For this reason, it is common  practice to log transform the data to render the effect linear. This, i n turn, allows more traditional statistical methods to be applied. T w o methods still popular today are quantile normalization [5] and variance stabilizing normalization [8]. In microarrays, each probe is present on each array. It is assumed that the majority of genes are not differentially expressed across conditions being studied. For the genes that are, change occurs i n b o t h directions. In other words, there is b o t h up-regulation of some genes and down regulation of others.  These changes generally off-set each other, so it is  reasonable to assume equal intensity distributions on each array across an array set. It is this assumption which is used to generate quantile normalization. Recall that an array set in the microarray context is very different from a plate-set in the H T P context. A plate-set is like a single microarray. To compare two datasets w i t h regards to distribution, one can construct a quantile-quantile (Q-Q) plot. A perfectly diagonal line on a Q - Q plot indicates equal distribution, up to a location, of the two datasets. T h e Q - Q plot serves as motivation for a transformation that can be used to render the distribution of phenotypes on b o t h plates  20  equal. T w o arrays i n an array set w i l l almost never have the same distribution. In practice the exact distribution of each plate is unavailable.  Rather than use quantiles as a basis  for a transformation, the G order statistics are used. Let X( )  g p  be the g  th  order statistic of  observed phenotypes on plate p = 1,2. T h e n the mapping  ^  „ £__+£_,  V  m  ( 3 4 )  w i l l force the distribution of b o t h arrays to be the same. T h i s non-linear transformation easily extends to array sets of size P. A n example using multiple arrays is shown i n figure 3.4. Unfortunately, assuming the equal phenotype distributions on each plate w i t h i n a set, i n the H T P context is not reasonable. Here the mutants composing each plate are entirely different and often not randomly allocated. However, the next chapter w i l l discuss how exploiting quantiles may be of use i n the quest to H T P data pre-processing. Huber et al suggest an alternative pre-processing method for microarray data which b o t h normalizes and stabilizes the variance a l l at once, which is aptly named variance stabilizing normalization [8]. It is based on a multiplicative-additive distortion model as i n equation 3.1 combined w i t h a multiplicative-additive error model suggested by Rocke, D . M . and D u r b i n , B . P . [10]. T h e resulting model is  x  = a + ppZgpe "" + Vgp 11  gp  p  rig ~ 7V(0, <7„) i.i.d., "gp ~ N(0, o ) P  v  u  . ..  (3.5)  d  T h i s implies a quadratic mean-variance dependency and leads to the following pre-processing transformation y where b is a function of j3 and o . p  p  v  = arsinh(  X9P  gp  \  ° j Op /  (3.6)  p  Thus the goal here is to estimate the coefficients a  p  21  and  b as best as possible to eliminate the plate effect and (a topic which is not addressed here) p  to minimize the structural relationship between the mean and the variance. The assumption used to derive this normalization method is that the majority of genes are not differentially expressed.  T h e data is t r i m m e d to include only genes believed  to be non-differentially expressed. Naturally, the means of non-differentially expressed genes should not change across arrays.  Thus, by mean centering the t r i m m e d data, m a x i m u m  likelihood estimates for the parameters can be obtained.  T h e result is a normalization  approach which estimates a and b simultaneously. Thus, rather than use crude estimates p  p  for the additive parameters, both the additive and multiplicative parameters of the distortion model are estimated using biological assumptions. To some the inverse hyperbolic sine function may seem foreign, but it is closely related to the log all while having advantages over it: arsinh(x) = log(x + y/x  2  + 1). Therefore, using  this normalizing method removes the need to log transform the data as discussed earlier. Variance stabilizing normalization may not be directly applied to H T P data because key to the m a x i m u m likelihood estimation of the parameters is the presence of the same genes on each arraj'. Unfortunately, plates w i t h i n a set do not share mutants i n H T P experiments. Over the course of this chapter, currently employed H T P pre-processing methods were presented along w i t h two microarray normalization methods. The fundamental principles driving current methodolgy are good, but there is room for improvement. Methods used on microarray data may not be directly applied to H T P data, but do offer tools which can help deal w i t h the issues of non-random allocation and parameter estimation strategies.  author = B . M . B o l s t a d et al., title = A Comparison of Normalization Methods for H i g h Density Oligonucleotide A r r a y D a t a Based on Variance and Bias, journal = Bioinformatics, year = 2003, volume = 1 9 , number — 2, pages = 185-193 22  High Throughput Phenotypic  MicroArrav  3—*  Condi ttori 1  Condition 2  F i g u r e 3 . 1 : T h i s d i a g r a m is a s i m p l e c o m p a r i s o n of H T P a n d m i c r o a r r a y e x p e r i m e n t s .  In  H T P e x p e r i m e n t s , plate-sets are r e q u i r e d for each c o n d i t i o n . P r e - p r o c e s s i n g of plate-sets is d o n e i n d e p e n d e n t l y for e a c h c o n d i t i o n .  23  Raw Phenotype Distribution in a Plate-Set  Plate 1  Plate 2  Plate 3  Plate 4  Figure 3.2: These data come from a plate-set containing more plates, but only four are included here for convenience. In these modified box plots, each box covers the entire phenotype range and uses a color scheme to present the approximate data distribution.  24  Unrelated Medians  i 10000  1 15000  1  20000  1 25000  1  r  30000  35000  40000  Phenotypes  Figure 3.3: These are smoothed histograms of the observed raw phenotypes from two plates from the same plate-set. T h e plate depicted in blue contains many more weak mutants than the other plate and this was a conscious choice.  25  Figure 3.4: A plot of the densities for 27 arrays. T h e density after quantile normalization is shown in bold black. The above figure was obtained from: B . M . B o l s t a d et a l . , A Comparison of Normalization Methods for H i g h Density Oligonucleotide A r r a y D a t a Based on Variance and Bias. Bioinform,atics,19(2):l8b-l93. 2003 [5].  26  Chapter 4 Recommended Pre-Processing Method T h e previous chapter considered, among others, the assumption that phenotype distributions on plates w i t h i n a set share a common mean or median. T h e event of non-random allocation of mutants to plates renders this assumption implausible. Furthermore, distributions w i t h more than one mode are not well characterized by such a measure of centrality. A broad theme identified i n the previous chapter is the idea of aligning a collection of datasets on one or more features believed to be constant across the collection, prior to the introduction of experimental artefacts. A distortion model has been been proposed i n (3.1).  T h e criticism i n the previous  chapter was w i t h regards to the methods, not the model. We will also use this model and simple extensions of it. Furthermore, the concept of feature alignment makes sense and w i l l also be used here. Thus, the task at hand is to establish proper features for alignment.  27  4.1  Features  There are two classes of features: internal and external.  4.1.1  E x t e r n a l Features  T h i s class of features was introduced when phenotypes on each plate were divided by each plate's w i l d type phenotypes.  E x t e r n a l features are those which arise from the design of  the experiment. T h e y are external to the phenotype distribution on plates. Controls and blanks are external features. T h e i r presence is not required for the experiment to run, thus they w i l l not always be available as a pre-processing tool. B y definition, i n the absence of a plate effect, the phenotype of each external feature would be approximately equal across plates. There is no questioning the biological validity of their approximate equality, but, as previously mentioned, quality control should carried out before aligning these features.  4.1.2  I n t e r n a l Features  Internal features are those which pertain to the phenotype distributions. centering and quantile normalization use internal features.  B o t h median  In the first case the feature is  the median of the phenotype distributions on each plate and i n the second the features are all quantiles of the plate distributions. T h e i r invalidity as features to be aligned i n H T P experiments has been established. Yet, there is a middle ground i n which certain quantiles,, and the mean up to certain quantiles, tend to be approximately equal i n the absence of experimental artefacts i n the H T P context. Internal features considered thus far have been susceptible to departures from mutant randomization. For a better understanding of how the data are affected by a lack of ran28  domization, consider a simple simulation which compares two plate-sets: one constructed to represent randomized mutant allocation which is implicitly assumed by median centering and the other to represent non-randomized mutant allocation which reflects the reality of the yeast deletion set.  Real data are used to construct a fictitious population of possible  phenotypes. T h e y arise from a single agar plate, w i t h 1536 yeast colonies. Figure 4.1 shows a smoothed histogram of this population. For sake of simplicity, only two categories of data are considered: healthy and unhealthy.  T h e cut-off value of 7500 was used to distinguish healthy mutants (those w i t h  phenotypes greater than 7500) and unhealthy mutants. It is based on a rough visual estimate which assumes the two left modes are composed of phenotypes from unhealthy mutants. According to this cut-off, approximately 30% of the mutants are unhealthy. B o t h plate-sets consist of two plates, each containing 300 mutants.  In the random  plate-set, plates are simply populated by sampling w i t h replacement from the population, thus representing random allocation. T h e pathological plate-set simulates the allocation often seen i n practice. One plate receives 25% of its mutants from the unhealthy subpopulation and the other plate recieves 50% of its mutants from that subpopulation. Figure 4.2 displays particular results using both sampling schemes. These represent plate-sets which have not been distorted by the plate effect. T h e y are the data which we desire to recover after pre-processing. T h e simulation consisted of 500 iterations. In each iteration, an instance of both platesets was populated. W i t h i n each set, for a range of quantiles, the observed quantile value was recorded for plate 1 and plate 2. T h e results are depicted i n figure 4.3 which are fancy scatter plots. T h e y show the 500 observed quantile differences at each quantile using a smoothed color density representation. T h e yellow lines are the smoothed mean differences.  29  T h e random plate-set suggests that most quantiles are approximately equal when mutants are randomly allocated to plates. T h e only exception being quantiles i n the extreme tails, as these contain outliers. Stability i n the lower tail persists to more extreme quantiles than i n the upper t a i l because outliers i n the lower t a i l are solely due to error. T h e results presented in figure 4.1 don't suggest large discrepancies i n the lower tail, but the smallest measured quantile is at 1%. C a u t i o n w i t h regards to outliers should be taken at b o t h ends. Here the quantiles i n both tails which are not too extreme align just as well as the median. In fact some of them are better behaved than the median (e.g. the 85% quantile). In general, under random allocation the medians differ less across plates than do the extreme quantiles, but the difference tends to be small. T h e pathological plate-set tells a different story. Here the median differs by a substant i a l amount across the set - it is the most unequal quantile. In fact, their values correspond to phenotypes of different biological categories: healthy and unhealthy. O n the other hand, the quantiles in the tails seem to be much less affected by the disproportionality of healthy and unhealthy mutants across the set, specifically, quantiles roughly between 1% and 5% and between 85% and 95%. Define the tails of the distribution which are not too extreme to be affected by outliers as the almost-tails. T h e simulation suggests that quantiles in the almost-tails w i l l be approximately equal across the set and that this is robust w i t h regards to departures from randomization of the mutants. One expects the range of underlying phenotypes on each plate to be roughly the same. However range may be severely altered by outliers. Thus, the idea is to use a compromise between range and center w i t h more emphasis placed on range.  Therefore, the middle  ground between assuming the equality of medians and assuming equality of all quantiles is to assume equality of quantiles i n the almost-tails. Quantiles in the almost-tails of phenotype distributions are internal features which are appropriate for H T P experiments due to their  30  robustness to modest departures i n randomization. Another example of internal assumptions is the approximate equality of the modes i n the phenotype distributions. In particular, the left modes, when these exist, often correspond to phenotypes of blanks and nonviable mutants, so their alignment would make much sense. It isn't clear whether other modes are approximately equal. T h e upside to aligning the left modes is that they are the least affected by error and hence have the smallest variance. T h e measurement errors are believed to be more than a simple mean zero additive error, but blanks and nonviable mutants are only affected by this error. Ultimately, it is a simple task to compute quantiles, while it is often not simple to identify modes of a distribution. Therefore, we place a heavy emphasis on quantiles in our pre-processing approach. There exists cases where approximate equality of quantiles i n the lower t a i l may be questionable. In cases where the left mode corresponds to non-viable mutants and blanks only, the discrepancy between phenotypes of viable and non-viable mutants may be large. This, i n turn, may reduce the region of lower quantiles over which approximate equality across the plate-set holds and moreover, very much increase the difference between lower quantiles which are not i n the region of approximate equality. A n example of such a scenario will be expanded upon i n the next chapter.  A s an alternative, rather than assume the  approximate equality of the quantiles i n the lower tail, assume the approximate equality of the average phenotype up to a given quantile. T h e quantile becomes an upper bound for the values which are averaged over. T h i s is analogous to mean centering, but restricted to the left tail. Essentially, the claim is that the expected value of the lowest r% of the phenotype distributions on each plate across the set should be approximately equal for value of r that are small (but not vanishly small). In Huber et aVs work, mean centering was carried out using trimmed data. Here we suggest the opposite, to align the mean of the "trimmings" i n the left tail. Hence this internal feature is called the mean of left t a i l trimmings. Figure 4.4 depicts 31  the observed differences of the mean left t a i l trimmings on b o t h plate-sets and compares the results w i t h those of quantiles. There do not seem to be any differences i n the approximate equality of these internal features w i t h i n the random plate-set context.  However, there is an important difference  observed i n the pathological plate-set. T h e approximate equality of the mean left t a i l t r i m mings spans a much larger range of quantiles than does the approximate equality of the quantiles themselves. T h e instability due to non-randomized allocation is simply delayed by averaging over the trimmings. T h e mean up to a given quantile is always'available and the region of approximate equality continues to be the almost-tails. Thus, mean left t a i l t r i m mings preserve the simplicity found in quantiles and w i l l be favored as the internal feature of choice in the left tail. T h e choice of alignment features w i l l be revisited in the next chapter.  4.2  Proposed Pre-processing Method  Pre-processing involves transforming the raw data to remove systematic experimental artefacts. Having specified a model for the plate effect, a natural transformation arises,  fp(*) = ^  where a  p  and (3 are as i n (3.1). P  Thus far we've referred to a  (4-1)  p  and /3 as the additive P  and multiplicative components, respectively, of the plate effect. W i t h i n the context of the transformation i n (4.1), they w i l l , respectively, be referred to as the location and scale parameters. T h e novelty of Huber et aVs normalization method which we wish to incorporate into our method is the simultaneous estimation of b o t h the location and scale parameters based on biological assumptions.  T h e assumption proposed is the approximate equality  32  of select features of the data.  T w o classes of such features which respect the underlying  biology of the experiments have been established. Using this transformation on a plate-set containing P plates leads to 2P unknown parameters.  Simply aligning one feature across  the plate-set will not be sufficient to solve for the unknown parameters. T h e alignment of two or more features, though, does allow for simultaneous parameter estimation. However, the estimation method w i l l differ if only two features are aligned rather t h a n three or more. These two cases w i l l be considered separately. In order to avoid over-parameterization, constraints need to be set. There are a few natural choices here.  One plate may be set as a reference plate.  A n o t h e r option is to  constrain the quantiles to equal their averages which is analogous to what is done i n quantile normalization. In the end, the choice of constraints is not critical. Results obtained using one constraint may be linearly transformed to conform to another constraint using the very same slope and intercept for each datum. Using a reference plate introduces artefacts w i t h respect to plate-sets under other conditions, but pre-processing across conditions w i l l remove these. B y setting restraints, the transformation does not use proper estimations of a  p  P as suggested i n (4.1). Therefore, we reformulate the transformation as f (x) p  p  W i t h o u t loss of generality, set a\ = 0 and b\ = 1 (i.e.  =  and £Z£E.  set plate 1 as the reference plate).  Hence, a = a — ct\ and there remains 2P — 2 unknown parameters. p  4.2.1  p  E x a c t a l i g n m e n t o f 2 features u n d e r the linear m o d e l  We begin w i t h the case of two features.  Let qj denote feature j on plate p. P  Having set  plate 1 as a reference plate the following 2P — 2 equations may be formulated based on the  33  application of the transformation i n (4.1):  /lOfci) = f (q ) P  forj = l , 2  jp  Vp = 2 , . . . , P .  (4.2)  There are 2P — 2 equations and an equal amount of unknowns, therefore an exact solution exists. In fact, when using transformation (4.1), a direct solution of the unknown parameters is only possible when aligning two features.  For each individual plate, simple algebra may  be used to solve for its location and scale parameters.  b P  = SlE _ <Jn  P  q2  qn  ~  g 2 l g l  <7ii(<7n -  P  (4  4)  ?2i)  A p p l y i n g the transformation i n equation (4.1) using estimates as described i n (4.3) and (4.4) will force features 1 and 2 one each plate to be equal to those of the reference plate.  4.2.2  A p p r o x i m a t e a l i g n m e n t o f more t h a n t w o features t h r o u g h optimization  W h e n there are more alignment features than there are parameters, a direct solution for the unknown parameters is not available. In such circumstances, under this distortion model, the ability to align all selected features exactly across the plate-set is lost. Alternatively, optimization may be used to estimate the parameters. Suppose there are Q features to align. Let a = (0, ( 2 2 , a p ) be the vector of location parameters, b = (1, &2> • ••> bp) be the vector of scale parameters and  — (qji,qj2, •••,Qjp) be the vectors of features for j =  1,...,Q.  W i t h o u t the possibility of an exact alignment of the features, the next best solution is to  34  minimize their variance across the set. T h i s leads to the following objective function:  (4.5)  Various optimization algorithms may be implemented to minimize the objective function and obtain the parameter estimates of interest. T h e Newton-Raphson method requires the inversion of the Hessian matrix which is problematic as it often becomes singular. Therefore, it is recommended to use optimization methods which only use the exact calculation of first derivatives. There are two classes of such methods: conjugate gradient methods and quasi-Newton methods. T h e y differ i n their strategy to numerically estimate the Hessian. Conjugate gradient methods require less storage than quasi-Newton methods require, but the latter are generally less fragile [7]. Therefore, since these w i l l never be large problems, it is recommended to use quasi-Newton methods i n general, though b o t h methods are appropriate. Regardless of the class of method chosen, the gradient is required for their proper implementation. It is a vector of length 2P — 2 w i t h P — 1 entries which are the partial derivatives of the objective function g(a, b) w i t h respect to each location parameter (excluding a,i — 0) and P — 1 entries which are the partial derivatives w i t h respect to each scale parameter (excluding b\ = 1). Let (jj = ^ __p=i fp{Q.jp) be the average j  th  feature across  the the transformed plate-set, then the following equations may be used to construct the gradient. (4.6)  -2  •0(a, )  db  b  k  bl(P-l)  Q  Y2(fk(0jk) - 0j)(0jk - a ) k  35  for k = 2, 3 , . .  (4.7)  The complete derivation of these results are given i n A p p e n d i x A . Once the optimization method has been selected and the first derivatives specified, the last requirement is a choice of initial parameters. Initial parameters which are too far from the m i n i m u m may lead the algorithm to converge to a local m i n i m u m or to travel i n the wrong direction and fail to converge w i t h i n the specified number of iterations. Furthermore, the choice of initial parameters w i l l affect the number of iterations required to converge. There are various ways to obtain good initial parameters.  One way is to first use the difference  between the m i n i m u m observation on each plate and that of the reference plate to initiate the location parameters: on plate p.  a  p0  = mm(x ) — mm(xi). Here, x p  p  is the vector of phenotypes  T h e n initiate the scale parameters by taking the ratio of the range of each  plate over that of the reference plate. In other words, first subtract the obtained location parameter guesses from each plate: y  gp  —x  gp  —a . p0  T h i s is analogous to the removal of the  additive effect discussed i n the current methods section. T h e n initiate the scale parameter a  g  ^  _  rnax(y )-min{y ) p  p  max(y )—min(y ) 1  4.2.3  1  '  E x a c t a l i g n m e n t of more t h a n two features u s i n g a piecewise distortion model  W i t h i n the context of aligning three or more features, there is an alternative to optimization. A simple extension to the distortion model allows for exact alignment of the features.  The  distortion model could be described as piecewise linear. Perhaps the plate effect is not exactly the same at different magnitudes of the phenotype. Using linear spline interpolation with knots at the features will allow these to be equal across the set, rather than approximately equal. Also, different slopes and intercepts between each feature would help account for any change i n plate effect due to phenotype. Figure 4.5 depicts the difference between the linear  36  transformation as proposed i n (4.1) and linear spline interpolation. The transformation called upon by linear splines is as follows. Suppose there are Q features. T h e n there are Q knots ( x ^ y , ) which the transformation is to pass through. T h e linear function between each of these knots is:  Si = Vi +  V  l  +  1  _  V%  (x -  x^  x e [xi, x  i + 1  }.  (4.8)  Thus the function is only defined for phenotypes between the selected features. T h e linear mappings between the first and second features and between the two upper features may be extended to map a l l observations. Alternatively, a l l values below X\ may be mapped to y\ and a l l those above XQ may be mapped to UQ. Constraints are once again called for. T h e candidates are the same as before, either align to a reference plate or align to the average value of each feature. In this case, the choice cannot be changed afterward using a global linear transformation, therefore the choice of constraint w i l l have a heavier impact on the results. For this reason, it seems more reasonable to map each feature to the average value of that feature rather than to the value observed on a reference plate. There are two advantages to using linear splines over using linear transformations and optimization. First solving the latter is more computationally expensive: it requires more memory and time. Secondly, modest departures from linearity i n the plate effect w i l l be compensated for using this approach. Hence, it offers more flexibility. The method we propose uses biological assumptions to simultaneously estimate the additive and multiplicative components of the plate effect. T h e assumptions used to achieve this are the approximate equality of two or more features.  T h e approximate equality of  quantiles in the almost-tails is an example of internal assumptions which are robust to  37  departures from random mutant allocation to plates. Quantiles are not the only features which can be used. For aligning more t h a n two features, two methods have been proposed. T h e y are b o t h valid and have their pros and cons. A more i n depth comparison of the two methods w i l l be conducted i n the next chapter.  38  Phenotype Population for Simulation  -5000  0  5000  10000  15000  20000  25000  Phenotype  Figure 4.1: T h e approximate density of the phenotype population used for the simulation. T h e cut-off is depicted by the black line at 7500. T h e heavy left t a i l suggests that a fair number of mutants are affected by the gene perturbation, which this cut-off value tries to capture.  39  Approximate Densities of Random Plate Set  i  0  5000  10000  1  15000  r  20000  Approximate Densities of Pathological Plate Set  -5000  0  5000  10000  15000  20000  Figure 4.2: Plates generated i n a random fashion still present different phenotype distributions since the mutant collections on each plate are different. However this difference is minimal compared to that which arises when a large collection of weak mutants are purposely contained on one plate.  40  Random Plate-Set: Seeking quantiles that are approximately equal within a set 1  <D  9-  -5T Q.  ^ o o c CM _"  (D  £ s _2 E  Figure 4.3: The actual values are not shown i n the y-axis, but the limits are kept constant for both graphs and b o t h start at 0.  41  Random Plate Set: [Difference in Quantilesj  Random Plate Set: Average Left Tail Trimming;  CD CD Q.  9; >•  Si? f °CM *™ CD  0.0  0.1  0.2  0.3  0.4  0.0  0.5  Quantiles  0.1  0.2  0.3  0.4  0.5  Quantiles  Pathological Plate Set: [Difference in Quantilesj  Pathological Plate Set: Average Left Tail Trimming;  CD CD Q_  £ ^  o c  C CD  0) £  £°-  C\J "~ CD  ffl -s _ CO  i  0.0  0.1  0.2  0.3  0.4  0.5  0.0  Quantiles  0.1  0.2  0.3  i 0.4  r 0.5  Quantiles  Figure 4.4: A g a i n the range of the y-axis is kept constant and all start at 0.  42  Mapping Observations to Normalized Values For a Single Plate o  T  0  2000  4000  6000  8000  Observed Phenotypes  Figure 4.5: T h e red dots are ordinal pairs composed of the observed features and the values they should be mapped to. Linear spline interpolation allows for these features to be mapped exactly to the desired location. A linear transformation cannot do this, so the line which minimizes the distance between the mapped features and their desired value is chosen instead.  43  Chapter 5 Application to Real Data In this chapter, our recommended pre-processing method w i l l be applied to real data. T h e many knobs resulting from the methods flexibility, i n both the class of features to be aligned and the alignment method, will be explored. T o properly do so requires an understanding of the data used throughout the chapter.  5.1  Data  The data arise from growth assays using yeast deletion sets and are provided by the Conibear lab.  E a c h plate-set is composed of 14 agar plates.  colonies.  Each agar plate contains 1536 yeast  There are 384 distinct mutants per plate w i t h four technical replicates of each  (384 x 4 = 1536). O f the 384 mutants, only five are not deletion mutants. One is a blank and four are controls. T h e controls are yeasts which have different growth rates expected to cover the spectrum of possible phenotypes. T h e location effect is quite strong i n this data. A p p e n d i x B describes the nature of the edge effect along w i t h the interim solution. It is worth noting because the controls in this experiment are near the edge. 44  T h e mutants are not randomly allocated to plates. O n twelve of the fourteen plates, yeast colonies are placed on the agar plates according to their location on purchased plates which is roughly based on their order on the chromosome. T h e remaining two plates contain mutants of interest (many of them of the weak variety or susceptible to food competition) w i t h additional blanks.  5.2  Recommended Method versus Median Centering  We begin by comparing the old method of median centering to the recommended method to see if there is any apparent improvement. Figure 5.1 contains three plots and, for simplicity, only 4 of the 14 plates are shown. T h e first plot depicts smoothed histograms of observed raw phenotypes on the four selected plates. It shows the presence of a plate effect as well as the presence of non-random mutant allocation (as seen by a much greater proportion of mutants i n the left mode of one plate). T h e plate whose density is depicted by the blue line, plate 13, is one of two plates which contains mutants of interest as described above. T h e second plot i n this figure depicts smoothed histograms of preprocessed phenotypes obtained by median centering. Specifically, the additive effect was first removed by subtracting from each observation its plate's m i n i m a l phenotype:  T h e n the resulting ad-  justed phenotypes were divided by the plate medians. There appears to be an improvement over the raw data. T h e plate depicted i n green is now aligned w i t h the others, but there still appears to be issues w i t h plate 13 (blue line). T h e majority of its non-blanks have a pre-processed phenotype larger than those reported on the other three plates resulting i n a range which is one and half times larger t h a n that of the other three plates.  Another  problem is the misalignment of the left modes which should align as they pertain to blanks and nonviable mutants. 45  T h e t h i r d plot depicts results from aligning both the blanks and the 90% quantiles. Since only two features were aligned, an exact solution was available, so there was no need to choose between linear splines and optimization. V i s u a l l y the results are quite pleasing. Firstly, the non-blanks on plate 13 have adjusted phenotypes w i t h i n the range of the other plates and, secondly, the left modes align well. These are real data, so it is impossible to evaluate these methods based on the true plate effects as they are unknown. However, the presence of blanks and controls on a l l plates can be used as an assessment tool to complement visual inspection of the smoothed phenotype histograms. Figure 5.2 contains three diagnostic plots to complement the previous figure. Each plot displays the median phenotypes of the four controls and blanks on the selected plates.  Controls one and two have four technical replicates and controls three and four  have two (due to the edge effect).  T h i s allows for an analysis of variance to be carried  out. Unfortunately, the blanks are on corners, so they don't have technical replicates. T h e p-value i n the title of each plot corresponds to the existence of a plate effect i n a two way analysis of variance (controls and plates as explanatory variables w i t h no interactions). T h e phenotypes were transformed i n order to achieve homoscedasticity.  L o o k i n g at these p-  values, it is apparent that the recommended method has removed much more of the plate effect. T h e diagnostic plot for median centering supports the argument above: having a range which is one and half times larger than the range of other plates is problematic. T h i s can be seen by the normalized values of controls 1, 3 and 4 of plate 13, w i t h control 3 being of the chart. T h e diagnostic plot for the recommended method of feature alignment shows a clear improvement on both the raw data and median centering. T h e blanks, of course, are perfectly aligned and the controls are better behaved. It is apparent on this plot that no transformation would allow b o t h the blanks and control 2 to be perfectly aligned. Results  46  would have been different had we chosen to align control 2 instead of blanks. T h i s begs the question: which features should we align?  5.3 5.3.1  Choice of Feature Observing internal feature behaviour induced through the alignment of external features  In chapter 4, two classes of features to align across plates were defined: internal and external. B y design, external features - controls and blanks - should be approximately equal across all plates. Thus, the natural question to ask is whether the alignment of external features w i l l induce approximate equality of quantiles. To complement chapter 4's simulation, the alignment of external features is applied to real data and the resulting quantiles and mean left t a i l trimmings on each plate are compared i n similar fashion. T h i s particular data-set does not have severe problems i n proportions of healthy and unhealthy mutants as can be seen i n the first plot i n figure 5.3. It shows the smoothed histograms of the phenotypes on each of its plates after alignment of controls. T h e distance between the left mode and the remaining phenotypes is large as was discussed i n chapter 4. Dealing w i t h real data, it is not possible to obtain 500 values for each quantile, thus using a smoothed scatter plot as i n the simulation is not useful. Instead, the variance of each quantile across the 13 plates is presented. In the top right hand corner plot, the variance of quantiles over the entire range is shown. T h e quantiles i n the upper almost-tail align well, but the lower quantiles are problematic as was stated i n chapter 4. There is a small region over which the quantiles don't differ too much, but quantiles slightly outside this region do not align well at a l l . T h e last plot depicts the variance of mean left tail trimmings. T h e 47  region over which this feature is stable across the set is much larger and the increase i n its discrepancy outside that region is much less severe. T h i s further supports the superiority of mean left t a i l trimmings over quantiles i n the lower almost-tail. It also suggests that the alignment of external features also induces the approximate equality of the proposed internal features. In the next section we w i l l further compare internal and external features.  5.3.2  I n t e r n a l a n d E x t e r n a l Features i n a C a s e Study-  U p to now controls and blanks have been used as a means of evaluation. Controls can equally be assessed v i a blanks and internal features.  Consider two data sets A and B . Figures 5.4  and 5.5 show the results of pre-processing each data set by minimizing the variance of the four controls across plates v i a a linear transformation. E a c h figure contains three sections. T h e first gives the settings used for pre-processing and the p-value resulting from a kruskal wallis test for the presence of a plate effect. T h e second shows the smoothed histograms of the pre-processed data and the t h i r d shows the corresponding values of select features for each plate. T h e alignment is not perfect as is readily observable, but real data is always noisy. In the first figure, the p-values do not suggest that there is a plate effect. Furthermore, the upper quantiles align w i t h about equal variance to controls 1 and 3, and the two smaller mean left tail trimmings align w i t h approximately the same variance as the upper quantiles if we ignore the one plate which seems problematic for the controls as well. T h i s is encouraging and supports our claims. T h e second figure suggests that this complimentary behaviour between internal and external features isn't always observed. T h e low p-value for control 1 suggests heavy contradictions between controls 1 and 3. In the diagnostic plot (the b o t t o m section), the internal features and the blanks do not seem approximately equal w i t h the alignment of controls. T h e results from the alignment of controls i n this scenario are not desirable.  48  S i m i l a r l y t o t h e r e s u l t s of m e d i a n c e n t e r i n g s h o w n earlier, t h e p r e - p r o c e s s e d p h e n o t y p e s f r o m s o m e p l a t e s are a l m o s t a l l h i g h e r t h a n t h o s e f r o m o t h e r s . T h o u g h c e n t r a l b e h a v i o u r of p l a t e d i s t r i b u t i o n s are e x p e c t e d t o differ, t h e b i o l o g y d o e s n o t d i c t a t e s u c h a severe Hence,  figure  difference.  5.5 d o e s n ' t c o n t r a d i c t t h e a p p r o x i m a t e e q u a l i t y o f i n t e r n a l f e a t u r e s v i a t h e  a l i g n m e n t o f c o n t r o l s as m u c h as i t q u e s t i o n s t h e r e l i a b i l i t y o f c o n t r o l s . I n e v i t a b l y , h a v i n g o n l y 4 or 2 r e p l i c a t e s o f e a c h c o n t r o l m a y g i v e w a y t o s u b s t a n t i a l v a r i a n c e a n d c o n s i d e r a b l y l i m i t t h e usefulness of t h e c o n t r o l s . T h e r e are benefits t o b e h a d b y a p p r o a c h i n g t h e p r o b l e m t h e o t h e r w a y r o u n d b y a l i g n i n g i n t e r n a l features a n d o b s e r v i n g t h e r e s u l t i n g a l i g n m e n t of c o n t r o l s .  F i g u r e 5.6  c o n s i d e r s Set A . A g a i n t h e r e s u l t s of p r e - p r o c e s s i n g , t h i s t i m e w i t h i n t e r n a l features, visually pleasing.  are  T h e c o n t r o l s s e e m for t h e m o s t p a r t t o b e w e l l b e h a v e d , a l t h o u g h n o t  e n o u g h t o suggest t h a t t h e r e is n o p l a t e effect  (see k r u s k a l - w a l l i s p - v a l u e s ) .  d e p i c t s t h e p r e - p r o c e s s i n g of Set B u s i n g i n t e r n a l features.  F i g u r e 5.7  N o t surprisingly the  controls  i n t u r n d o n o t a l i g n w e l l , b u t t h e s m o o t h e d h i s t o g r a m s suggest s u p e r i o r r e s u l t s here.  The  p l a t e s are w e l l a l i g n e d as are t h e left m o d e s . F u r t h e r m o r e , t h e b l a n k s are a l s o a p p r o x i m a t e l y e q u a l . T h e c o n t r o l s i n t h e s a m e r a n g e of p h e n o t y p e s as t h e b l a n k s d o n o t a l i g n w e l l at a l l . W h a t d i s t i n g u i s h e s t h e b l a n k s f r o m t h e c o n t r o l s is t h a t t h e y are o n l y affected b y a d d i t i v e effects a n d e r r o r s . T h u s t h e y are m o r e r e l i a b l e t h a n c o n t r o l s w h e n t e c h n i c a l r e p l i c a t e s are limited. I n t h i s b r i e f case s t u d y , c o n t r o l s were effective p r e - p r o c e s s i n g features for Set A , b u t n o t for Set B . S i m u l t a n e o u s l y , i n t e r n a l features were effective p r e - p r o c e s s i n g features i n b o t h d a t a sets. T h e p h e n o t y p e of e a c h c o n t r o l is b a s e d o n a l i m i t e d n u m b e r o f t e c h n i c a l replicates.  T h e y are s u s c e p t i b l e t o v a r i o u s sources of e r r o r a n d e x p e r i m e n t a l effects, s u c h  as edge effect, w h i c h i n t u r n m a y severely increase t h e i r v a r i a n c e . T h e b i o l o g y d i c t a t e s t h a t t h e i r v a l u e s s h o u l d b e a p p r o x i m a t e l y e q u a l across t h e plate-set, b u t t h e i r h i g h v a r i a n c e a n d  49  low replication count hinders their usefulness here. Quantiles and mean left tail trimmings on the other hand are based on a l l the plate's phenotypes. T h i s renders them more robust and thus more reliable as pre-processing features. Regardless of which feature is used, quality control should always be carried out as was done i n a l l four cases here.  5.4  Optimization versus increasing the number of parameters in the model  T w o different paths, which b o t h stem from the exact alignment of two features, may be taken when aligning more than two features. E i t h e r the distortion model can be kept linear, leading to parameter estimation v i a optimization, or it can be slightly extended allowing for linear splines to be used i n favor of a linear transformation. Solely based on the model, it is more desirable to use the simple extension to the distortion model which allows for slight departures from linearity to take place. It is a more flexible approach. O n the basis of implementation, linear splines are much less expensive than optimization, b o t h in memory and time. However, when using linear splines, it is important for the knots to be i n order so that the resulting transformation is injective. W h e n using controls as features, this is not always possible.  For example, on one plate from Set B , control 4 expresses a lower  phenotype than control 2, yet the average value of control 4 across the plate-set is higher than the average value of control 2. Therefore, using linear splines w i t h controls may not always be possible. Quantiles and other internal features on the other hand must be properly ordered and lend themselves nicely to linear splines. Lastly, when using splines, there is the danger of over fitting.  Figure 5.8 differs from figure 5.7 only in its estimation method:  optimization versus linear splines w i t h knots at the features.  50  T h e p-values do not change  significantly suggesting that there is no over-fitting. T h i s seems to hold consistently when using three or four features at a time.  Thus, linear splines are the suggested alignment  method and optimization of parameters for a linear transformation should be used when linear splines are not tractable.  51  Plate Densities of Raw Data Within a Plate Set  Densities After Removal of Estimated Additive Effect & Division b y Plate M e d i a n  Figure 5.1: For simplicity, we only depict the density of four plates. T h e plate whose density is depicted by the blue line is one whose mutants were not randomly allocated.  52  Diagnostics to Raw Data A N O V A P-Value for Plate Effect: 1.264e-12 controW control3 control2 control 1 blank  1  I  I  5000  15000  10000  1  20000  1  25000  30000  Diagnostics to Median Centering A N O V A P-Value for Plate Effect: 6.272e-08 controW control3 control2 control 1 blank  o.o  0.5  1.5  1.0  Diagnostics to R e c o m m e n d e d Method A N O V A P-Value for Plate Effect: 0.0126 controW control3 control2 control 1  m•  blank  1 10000  1  — I —  15000  20000  1 25000  Figure 5.2: These plots compliment the previous ones. T h e y depict the values of controls and blanks across the (reduced) plate-set considered i n the two normalization approaches shown i n (5.1).  53  Normalized Plate Densities  Variance of Quantiles o o  CD  O O CM  O O CO  10000  20000  0.0  30000  0.2  0.4  0.6  0.8  1.0  Quantiles  Mean Left Tail Trimmings  Quantiles in the Left Tail  i 0.00  1  0.05  1 0.10  r 0.15  0.00  0.20  0.05  0.10 Quantiles  Quantiles  Figure 5.3:  54  0.15  0.20  Features Aligned:  controh, control2, control3, controW  Kruskal P--value for Control 1:  0.0988  Method of Estimation:  Optimization using Quasi Newton  Kruskal P--value for Control 2:  0.1446  Data Set used:  Set A  Kruskal P--value for Control 3:  0.3684  Plates Used:  1,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13  Kruskal P--value for Control 4:  0.631  Edge Removed: TRUE  Background Corrected:  FALSE  Normalized Plate Densities  5000  10000  r  "i  20000  25000  30000  20000  25000  30000  15000  Diagnostics trim 0.075 trim 0.04 trim 0.025 9 5 % quantile 8 5 % quantile 2% quantile controW control3 control2 control 1 blanks 5000  10000  15000  Figure 5.4:  55  Features Aligned:  contrail, control2, oontrol3, oontroW  Kruskal P--value for Control 1:  8e-04  Method of Estimation:  Optimization using Quasi Newton  Kruskal P--value for Control 2:  0.1413  Data Set used:  Set B  Kruskal P--value for Control 3:  0.291  Plates Used:  1,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13  Kruskal P--value for Control 4:  0.5035  Edge Removed:  TRUE  FALSE  Background Corrected:  Normalized Plate Densities  "55  15000  20000  25000  30000  35000  Diagnostics trim 0.075 trim 0.04  • — • *• nm  •  • •-•  95% quantile  •—•-• »—m-m—  8 5 % quantile  2% quantile control4 control3 control2 control 1 blanks  «• — • • •  • —••  »« •  •  m  •— •  »•  •  —»-am*m • • • • » •  e»—•—•  1 15000  •  »  -•  1 20000  •••  1 25000  Figure 5.5:  56  1 30000  1— 35000  Features Aligned:  Trim 0.025/0.04 and Quantiles 85/95%  Kruskal P--value for Control 1:  0.0038  Method of Estimation:  Optimization using Quasi Newton  Kruskal P--value for Control 2:  0  Data Set used:  Set A  Kruskal P--value for Control 3:  0.1086  Plates Used:  1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13  Kruskal P--value for Control 4:  0.0476  Edge Removed:  TRUE  Background Corrected:  FALSE  Normalized Plate Densities  c Q  5000  10000  15000  20000  25000  30000  Diagnostics trim 0.075 trim 0.04 trim 0.025 95% quantile 85% quantile 2% quantile controW control3 control2 control 1 blanks I  I  5000  10000  — i — 15000  1  1  20000  25000  Figure 5.6:  57  — r  30000  Features Aligned:  Trim 0.025/0.04 and Quantiles 85/95%  Kruskal P-•value for Control 1:  0  Method of Estimation:  Optimization using Quasi Newton  Kruskal P--value for Control 2:  0  Data Set used:  Set B  Kruskal P--value for Control 3:  0.0231  Plates Used:  1,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13  Kruskal P--value for Control 4:  0.0198  Edge Removed:  Background Corrected:  TRUE  FALSE  Normalized Plate Densities  15000  20000  25000  30000  35000  Diagnostics trim 0.075 trim 0.04 trim 0.025 9 5 % quantile 8 5 % quantile 2 % quantile control4 — * control3 control2 contrail '  » ' • • » ••• •••  •  * • « •  m-mm ——•• •—•  — • - • » • ~*  —  • • I  15000  —~  ••—•  "  1 20000  —•—•—  1 25000  Figure 5.7:  58  1 30000  1— 35000  Trim 0.025/0.04 and Quantiles 85/95%  Kruskal P--value for Control 1:  0.003  Method of Estimation:  Linear splines with knots @ features  Kruskal P--value for Control 2:  0  Data Set used:  Set A  Kruskal P--value for Control 3:  0.0891  Plates Used:  1,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13  Kruskal P--value for Control 4:  0.0609  Edge Removed: TRUE  Background Corrected:  Features Aligned:  FALSE  Normalized Plate Densities  o  3 o  I  • o  ED  o o +  O  1 5000  1 10000  1  1  15000  1  20000  25000  20000  25000  Diagnostics trim 0.075 trim 0.04 trim 0.025 9 5 % quantile 85% quantile 2% quantile controW control3 control2 contrail blanks 5000  10000  15000  Figure 5.8:  59  Chapter 6 Conclusion and Future Work There are many levels at which experimental artefacts distort high throughput phenotypic experiments: location w i t h i n a plate, plate-wise within a plate-set and across conditions. In this report, we have concentrated on the plate-wise variety. Current methods used for preprocessing were presented and their issues discussed. A class of alternative pre-processing methods which allow for all transformation parameters to be based on biological assumptions were proposed. T h e biological assumptions required here are the alignment of features across the plate-set. T h e alignment of two features leads to an exact solution while, otherwise, two routes may be taken. A simple extension to the linear distortion model which increases the number of transformation parameters to allow for linear splines is generally favored over minimization of feature variance across the set. T h e former has computational advantages as well as the capability to adapt to slight departures from linearity. However, i n order to use linear splines, the features must retain the same phenotype order on each plate. Furthermore, in the event that a large number of features are selected for alignment, say five or more, using linear splines may lead to over-fitting. In such situations, optimization should be employed.  60  We have also discussed the choice of features to align.  T w o classes were defined:  external and internal. W h i l e external features should align across the plate-set by design, low replication may lead to limited pre-processing effectiveness as was suggested i n section 5.3.2. Internal features, on the other hand, are based on a l l the data from each plate which saves them from this shortcoming. Often, i n practice, mutant are non-randomly allocated to plates which i n t u r n may have a heavy influence on various internal features such as the plate's median phenotype. We have presented the upper almost-tail quantiles and the mean left t a i l trimmings as internal features which are approximately equal across the plate-set under randomized allocation and robust to modest departures from randomization. These are always available, regardless of the experimental design, and simple to obtain. There is still room for improvement.  It would be desirable to use a model which  combines the distortion model w i t h an error model, as was done by Huber et al i n the m i croarray context, as a basis for a pre-processing transformation.  Consequently, any mean  variance relationship resulting from the plate effect interacting w i t h the random errors would be removed using this strategy. Pre-processing across conditions can be achieved using existing methodology developed by the microarray community such as quantile normalization. However, pre-processing w i t h regards to plate location effect is presently done very crudely. Improvement here is the most pressing matter.  61  Bibliography [1] et al. A . H . Tong, G . Lesage. G l o b a l mapping of the yeast genetic interaction network. Science,  303(5659) :808-13, 2004.  [2] A . E . C a r p e n t e r and D . M . S a b a t i n i . Nature  Systematic genome-wide screens of gene function.  Reviews, 5:11-22, 2004.  [3] S. Armknecht and M . Boutros et al.  High-throughput rna interference  drosophila tissue culture cells. Methods Enzymol,  screens i n  392:55-73, 2005.  [4] B . L . Drees and V . Thorsson et al. Derivation of genetic interaction networks from quantitative phenotype data. Genome Biol, 6(4):R38, 2005. [5] B . M . B o l s t a d et al. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics,  19(2):185-193, 2003.  [6] Robert G . Eason et al. Characterization of synthetic dna bar codes i n saccharomyces cerevisiae gene-deletion strains. Proc Natl Acad Sci USA., [7] W i l l i a m H . Press et al.  Numerical  Recipes  in  C: The  101(30):11046-11051, 2004. Art  of Scientific  Computing.  Cambridge University Press, 40 West 20th st., New York, N Y 10011-4211, U S A , 1988.  62  [8] Wolfgang Huber et al. Parameter estimation for the calibration a n d variance stabilization of microarray data. Statistical  Applications  in Genetics  and Molecular  Biology,  2 ( l ) : l - 2 2 , 2003. [9] G . Giaever and A . M. C h u et al. Functional profiling of the saccharomyces cerevisiae genome. Nature, 418(6896):387-91, 2002 . 1  [10] D a v i d M. Rocke and B l y t h e D u r b i n . A model for measurement error for gene expression analysis. Journal  of Computational  Biology, 8:557-569, 2001.  \  [11] C . Sachse and E . Krausz et al. High-throughput rna interference strategies for target discovery and validation by using synthetic short interfering rnas: functional genomics investigations of biological pathways. Methods Enzymol,  392:242-77, 2005.  [12] E . A . Winzeler and D . D . Shoemaker et al. Functional characterization of the s. cerevisiae genome by gene deletion and parallel analysis. Science, 285(5429):901-6, 1999. [13] L u u P. Speed T . P . Y a n g Y . H . , Dudoit S. Normalization for cdna microarray data. Technical report, University of California, Berkeley, 2001.  63  Appendix A Derivation of Gradient T h e derivative of a sum is the sum of the derivatives, so we begin w i t h a single term from the sum. Let q~j = £ J2p=i fp( jp)> a  d da \ Var  k  /g,- - ax  b  t n e n  _d_  >  p  da \{P-l)^  /  k  . P~ 1  1  d  P  /p_D  5Z-(/P(9JP) - 9j)^-(/p(?jp) 2  9i)  ' p=i  ^  rrj S fofop)" ^ ) ^ ) + (pZT) ( - £ + ^)(Ato*) - & ) ' 2  p  p=  2  (-•f)(fk(qjk)-qj)  Therefore, by adding these terms up we obtain,  aa;  5 ( a  '  b )  =  MprT)E(A(^)-^)-  64  The process is very similar for partial derivatives of the scale parameters. P  =  1 d ( p _ t) H (fp(Hp) - ^g^ifp^jp)  -  ^  P  2  %  W  -  *  )  «  +  ^(A(,,)-*)(-^)  (prij(-^)Ww)-*)  =  E  - Qj)  2^  b  -2 _ ^ (/fc(gjfc) ~ gj)(gjfc - fc) fl  p  -  Which yields the equation  a —g(a,b) =  -2  0  ^(A(gjfc) ~ gj)(gjfc ~ afc)  65  Appendix B The Edge Effect Colonies near the edge produce systematically higher phenotypes. T h i s effect is dubbed the edge effect. There are two explanations for this. F i r s t l y these colonies have less neighbours and hence less competition for nutrients (distributed i n the agar) allowing them to grow faster.  Secondly the agar plates are slightly curved at the edges which bends the light  from the scanner and results i n brighter recorded intensities (i.e.  stronger phenotypes are  reported). Figure 2.4 shows an image of a particular agar plate obtained v i a scanner (left). T h e larger size of colonies near the edges, indicating a higher growth rate, is immediately observable as is the relative location of the technical replicates. These are allocated i n two by two squares directly next to each other. Similarly, the controls are contained i n a two by two square on the edge. Thus they are contained w i t h i n the first four rows from the edge. A s previously mentioned, the location effect is outside the realm of this paper,, nevertheless it cannot be entirely ignored here. T h e interim solution is to remove the data from the edges (first and last columns and rows). Figure B . l shows the distribution of phenotypes w i t h i n columns of a single plate before and after this manipulation. It is quite evident that  66  while there is an improvement, the effect is not entirely removed and that it spans over the first few rows/columns. Fortunately for the controls, their location is the same from plate to plate. It is expected that the edge effect w i l l cancel out, but it may be argued that the standard error w i l l be increased because the median value is taken over two observations rather than 4.  67  Before Edge Removal  1  4  7 10  14  18  22  26  30  34  38  42  46  Column  After Edge Removal  Column Figure B . l : Evidently, the observations on the first and last columns are those most affected by the location effect, but it is also evident that the edge effect is not restricted to these two columns. T h e data is eventually collapsed to the median of the four technical replicates, thus.it is equivalent to replace the values on the edge w i t h the values from the neighbouring row or column which explains why the first two and last two columns are equally distributed i n the second plot.  68  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0100664/manifest

Comment

Related Items