Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Detection of differentially expressed alternative transcripts using conventional microarrays : with application… Chan, Fong Chun 2011

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2012_spring_chan_fongchun.pdf [ 1.83MB ]
Metadata
JSON: 24-1.0072493.json
JSON-LD: 24-1.0072493-ld.json
RDF/XML (Pretty): 24-1.0072493-rdf.xml
RDF/JSON: 24-1.0072493-rdf.json
Turtle: 24-1.0072493-turtle.txt
N-Triples: 24-1.0072493-rdf-ntriples.txt
Original Record: 24-1.0072493-source.json
Full Text
24-1.0072493-fulltext.txt
Citation
24-1.0072493.ris

Full Text

Detection of Differentially Expressed Alternative Transcripts using Conventional Microarrays With Application to Diffuse Large B-Cell Lymphoma by Fong Chun Chan B.Sc., Simon Fraser University, 2009 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Bioinformatics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) December 2011 © Fong Chun Chan 2011 Abstract Alternative splicing, alternative transcript initiation, and alternative 3’ polyadeny- lation are the main molecular mechanisms which allow for a single gene to give rise to multiple, biologically distinct, alternative mRNA transcripts. Prior studies have observed that alternatively expressed transcripts can be detected using con- ventional gene expression microarrays by identifying inconsistent expression pat- terns of probesets interrogating the same gene. However, conventional microar- rays have been disregarded as a potential platform for detecting differentially ex- pressed alternative transcripts between two groups of samples. We have developed a novel algorithm, called DISCO (DIscordancy SCOre), for detecting differentially expressed alternative transcripts between two groups of samples that is designed to work on conventional microarrays. Using a published dataset with RT-PCR validated results, we demonstrated DISCO’s ability to accu- rately discriminate between true positive and true negative events. Using an inter- nal cohort of 36 conventional microarrays with matched RNA-Seq libraries and an external cohort of 200 conventional microarrays of diffuse large B-cell lymphoma samples, dichotomized into the two main subtypes, we showed that DISCO has superior performance over an existing method. Gene enrichment analysis per- formed on the top 165 DISCO candidate genes, from the comparison of the two main subtypes, showed an enrichment for the molecular terms “alternative splic- ing” and “splice variants”. Among the top ranked genes with differentially expressed ii Abstract transcripts, between the twomain subtypes, included FOXP1 (a previously reported finding), and PHF19, which to the best of our knowledge has not yet been reported. With DISCO, conventional microarrays can now be analyzed to detect for differ- entially expressed transcripts which these microarrays were not originally designed for. To the best of our knowledge, this is the first study to assess the concordancy between conventional microarrays and RNA-Seq predictions for differentially ex- pressed transcripts. The concordancy between these two platforms indicate that the extensive repository of conventional microarrays may serve as a powerful sup- plement, potentially acting as a first line discovery platform, to RNA-Seq data. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Molecular Mechanisms that Regulate Alternative Transcript Expres- sion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Platforms for Detecting Alternative Transcript Expression . . . . . . 5 1.2.1 Conventional Gene Expression Microarrays . . . . . . . . . 5 1.2.2 Exon Microarrays . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.3 RNA-Seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Bioinformatic Approaches for Detecting Differentially Expressed Al- ternative Transcripts . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Methods Focusing at an Individual Probe/Probeset Level for Conventional Microarrays . . . . . . . . . . . . . . . . . . . 9 iv Table of Contents 1.3.2 SplicerAV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.3 SI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.4 FIRMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.5 Tuxedo Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4 Diffuse Large B-Cell Lymphomas . . . . . . . . . . . . . . . . . . . 17 1.5 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1 Chapter Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 The DISCO Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Regrouping of Probes on GeneChips© using Custom CDFs . . . . 25 2.4 Implementation of DISCO . . . . . . . . . . . . . . . . . . . . . . . . 27 3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1 Chapter Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Evaluating the Performance of DISCO . . . . . . . . . . . . . . . . . 28 3.3 Comparison of DISCO to Other Conventional Microarray Methods . 35 3.4 Potential Confounding Factors on DISCO . . . . . . . . . . . . . . . 39 3.5 Application of DISCO to ABC vs. GCB . . . . . . . . . . . . . . . . . 41 4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1 Limitations and Future Directions . . . . . . . . . . . . . . . . . . . 49 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Appendices A Supplementary Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 62 v Table of Contents A.1 Conventional Microarray Analysis . . . . . . . . . . . . . . . . . . . 62 A.2 Exon Microarray Analysis . . . . . . . . . . . . . . . . . . . . . . . . 63 A.3 RNA-Seq Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 A.4 Generation of Sensitivity/Specificity Concordancy Plots . . . . . . . 64 A.5 Gene Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . . . 65 B Supplementary Information . . . . . . . . . . . . . . . . . . . . . . . . 67 B.1 Using RT-PCR for Validation of Alternatively Expressed Transcripts 67 B.2 Genes with Complex Patterns of Differentially Expressed Transcripts 69 vi List of Tables 2.1 Differences in the Affymetrix CDF and the Exon-Centric CDF . . . . 26 3.1 Gene Enrichment Analysis on the Top 165 Ranked Genes Predicted to have Differentially Expressed Transcripts between ABC and GCB 42 3.2 Biological Properties of PHF19 long vs. PHF19 short . . . . . . . . . 46 vii List of Figures 1.1 Different Molecular Mechanisms that Regulate Alternative Transcript Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Distribution of Probesets per Gene on a HG-U133 Plus 2.0 Array . . 7 1.3 Cufflinks Assembly of Transcripts using RNA-Seq Data . . . . . . . 15 1.4 Differences in the Transcriptome Detected by Cuffdiff . . . . . . . . 16 2.1 DISCO Matrices of Genes with Discordant Patterns of Differentially Expressed Probesets . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1 ROC Curve of the DISCO Values . . . . . . . . . . . . . . . . . . . . 30 3.2 Tumor vs. Normal DISCO Plot of NME1 . . . . . . . . . . . . . . . . 31 3.3 Tumor vs. Normal DISCO Plot of COL6A3 . . . . . . . . . . . . . . . 33 3.4 Sensitivity/Specificity Concordancy Plots of DISCO and SplicerAV Predictions to RNA-Seq predictions for ABC vs. GCB in the Internal Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 Sensitivity/Specificity Concordancy Plots of DISCO and SplicerAV Predictions to RNA-Seq predictions for ABC vs. GCB in the External Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Density Scatter Plots of Possible Confounding Factors on DISCO Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 viii List of Figures 3.7 Sensitivity/Specificity Concordancy Plots of Possible Confounding Factors to RNA-Seq predictions for ABC vs. GCB in the External Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.8 ABC vs. GCB DISCO plot of FOXP1 . . . . . . . . . . . . . . . . . . 43 3.9 ABC vs. GCB DISCO plot of PHF19 . . . . . . . . . . . . . . . . . . 45 A.1 Diagram of the List of Genes Used to Generate the Concordancy Plots 65 B.1 RT-PCR Validation of Alternative Transcript Regulation Events . . . 68 B.2 Tumor vs. Normal DISCO Plot of CTTN in Gardina et al. Dataset . . 69 B.3 Tumor vs. Normal DISCO plot of FN1 in Gardina et al. Dataset . . . 70 ix Acknowledgements I would like to thank my rotation supervisors, Dr. Paul Pavlidis and Dr. Mark Wilkinson, for giving me the opportunity to expand my scientific horizons by being part of their labs. I would also like to thank the 2009 cohort of the Bioinformatics Training Program, members of the Shah, Steidl, and Gascoyne Labs for being great and encouraging colleagues to study and work with throughout my masters thesis. To my committee members, Dr. Sohrab Shah and Dr. WyethWasserman, thank you for your invaluable guidance and advice on how to be a good researcher. Finally, I would like to extend my sincerest gratitude to my supervisors Dr. Sanja Rogic, Dr. Christian Steidl, and Dr. Randy Gascoyne for providing me with endless motivation, support, and a unique environment for my graduate studies. x Dedication To my loving Lord, extended family, sister, and parents. xi Chapter 1 Introduction 1.1 Molecular Mechanisms that Regulate Alternative Transcript Expression Since the sequencing of the human genome, it has been revealed that there are only somewhere between 20,000-25,000 protein-coding genes - a number which is substantially less than originally hypothesized [1]. This discrepancy is partly ex- plained by precursor mRNA (pre-mRNA) transcripts undergoing a molecular pro- cess called alternative splicing [2] whereby introns are excised and exons or parts of exons are ligated in different ways. There are 5 major forms of alternative splic- ing with the most prevalent being exon skipping (Panel A of Figure 1.1). In this form of alternative splicing, exons are either spliced out or retained resulting in alterna- tive transcripts. These exons are called cassette exons in contrast to constitutive exons which are included in all alternative transcripts. A more complicated alter- native splicing event occurs when there are two sequential exons and either one is retained in an alternative transcript, but not both (Panel B of Figure 1.1). In addition to retaining and splicing out of entire exons, alternative 5’ and 3’ splice sites of an exon can be utilized resulting in parts of exons being included or excluded (Panel C and D of Figure 1.1). Finally, the rarest form of alternative splicing is intron reten- tion (Panel E of Figure 1.1) where the excision of the intron does not occur during 1 1.1. Molecular Mechanisms that Regulate Alternative Transcript Expression the normal splicing process and the intron is retained in an alternative transcript. Aside from these five major forms of alternative splicing, two additional molecular mechanisms serve to regulate alternative transcripts: alternative transcript initia- tion and alternative 3’ polyadenylation. In alternative transcript initiation (Panel F of Figure 1.1), the facilitation of transcription is provided by different promoters which generally initiate transcription at different start sites resulting in different first ex- ons and a different pre-mRNA transcript than another promoter. For alternative polyadenylation site, there can be multiple sites for polyadenylation leading to dif- ferent 3’ end sites for an alternative transcript. Both of these alternative transcript regulation mechanisms are used in conjunction with alternative splicing to allow for a single gene to give rise to multiple, biologically distinct, alternative transcripts. The frequency of alternative transcript regulation is much more prevalent in the human genome than originally thought and it has been predicted to occur in over 90% of genes [3]. A consequence of this biological phenomenon is an extensive diversification of the human transcriptome and proteome. 2 1.1. Molecular Mechanisms that Regulate Alternative Transcript Expression 1 2 3 Poly(A) (A) Exon-Skipping 1 2a 2b 3 Poly(A) (B) Mutually exclusive exons 1 2 3 Poly(A) (D) Alternative 3' Splice Site 1 2 3 Poly(A) (C) Alternative 5' Splice Site 1 Poly(A) (E) Intron Retention 1a 1b 2 Poly(A) (F) Alternative Transcript Initiation (G) Alternative 3' Polyadenylation 1 2a 2b Poly(A)Poly(A) Major Forms of Alternative Splicing Additional Molecular Mechanisms of Regulation 2 3 4 Figure 1.1: Different Molecular Mechanisms that Regulate Alternative Tran- script Expression. The blue regions represent genomic data which is constitu- tively spliced while the red regions represent genomic data which can be spliced out. The solid black lines represent introns while dashed lines represent different splicing scenarios. (A) Exon skipping. (B) Mutually exclusive exons. (C) Alternative 5’ Splice Site. (D) Alternative 3’ Splice Site. (E) Intron Retention. (F) Alternative Transcript Initiation. (G) Alternative 3’ Polyadenylation. There are an increasing number of reports describing specific alternative tran- scripts found to be preferentially expressed in one group of samples compared to another group; in some cases, disease tissues were compared to their normal tissue counterparts and differentially expressed alternative transcripts were discov- ered which may have been contributing to the pathogenesis of the disease [4]. For example, Wang et al. [5] discovered that the tumor suppressor Syk produced a 3 1.1. Molecular Mechanisms that Regulate Alternative Transcript Expression specific short alternative transcript that was expressed exclusively in breast cancer samples. When translated, the short isoform had, unlike its long isoform counter- part, its subcellular localization signal disrupted resulting in an inability to enter the nucleus and perform its tumor suppression function. Another prominent and well- studied example of a gene expressing disease specific alternative transcripts is CD44 - a gene that encodes for a cell-surface glycoprotein. Specifically, alternative transcripts containing exon 6 were frequently expressed at high levels in squamous cell carcinomas, adenocarcinomas, and other non-epithelial tumors; this observa- tion was in sharp contrast to their normal tissue counterparts [6]. This discovery sparked interest in studying the potential of exon 6 as a therapeutic target site and eventually culminated in the design of a drug called bivatuzumab mertansine that could specifically target the translated amino acid sequence of exon 6; phase I clin- ical trials showed promising anti-cancer effects, but the trials were eventually dis- continued due to unforeseen side effects [7]. Despite this, the examples of CD44 and Syk demonstrate the significance of identifying alternative transcripts and their translated protein isoforms which are preferentially expressed in one group of sam- ples. Such discoveries may serve as diagnostic markers or targets for therapeutic drugs and may ultimately result in an improvement in patient survival in the asso- ciated diseases. 4 1.2. Platforms for Detecting Alternative Transcript Expression 1.2 Platforms for Detecting Alternative Transcript Expression 1.2.1 Conventional Gene Expression Microarrays The first generation of conventional gene expression microarrays (referred from now on as conventional microarrays) were introduced in 1995 [8]. They were intended to allow for large-scale gene expression quantification and many com- panies have since entered the market to produce conventional microarrays (e.g. Affymetrix, Agilent). While other types of microarrays have been developed to go beyond gene expression (e.g. single nucleotide polymorphisms microarrays, chromatin immunoprecipitation microarrays), this thesis will be focusing on the Affymetrix GeneChip© conventional microarrays; in particular, Affymetrix’s HG- U133 Plus 2.0microarray which is themost widely used genome-wide gene expres- sion profiling conventional microarray for humans. As such, this section will focus primarily on the design and protocols of Affymetrix microarrays whichmay vary from other microarray manufacturers. In brief, a microarray is a micro-miniaturized solid surface containing thousands of tiny spots. Each spot contains many immobilized copies of the same DNA oligonucleotide (25 base-pairs long) called a probe which is designed to target a specific genomic region. During a typical gene expression microarray experiment, polyA-tailed mRNAs are extracted from the sample of in- terest followed by reverse transcription to generate cDNA molecules. These cDNA molecules then serve as a template for in-vitro transcription by an RNA polymerase to produce cRNAmolecules that are then biotinylated and hybridized to themicroar- ray. The cRNA molecules will then hybridize to its complementary probe which re- sults in a quantifiable fluorescence emission that can be detected by a scanner. The 5 1.2. Platforms for Detecting Alternative Transcript Expression fluorescent intensity at each spot is correlated with the total amount of hybridization and therefore it can be used as a measurement of gene expression for a particular gene. By having thousands of probes on a single microarray, the simultaneous and mass quantification of gene expression for a given sample can be done through a single microarray. For the HG-U133 Plus 2.0 microarrays, probes are characterized as being either a perfect match (PM) or a mismatch (MM) probe. Each PM probe has an antagonist MM probe which differs in only one base at the 13th nucleotide position of the probe. The inclusion of MM probes was meant to allow researchers to measure background noise levels andMM probe expression level is typically subtracted from PM probe expression level. But often, this leads to negative values resulting in problems with downstream analysis; therefore, MM probes are often disregarded inmanymicroarray analysis methods and only PM probe values are used (from now on, any probes from conventional microarray will be assumed to be perfect match probes unless stated otherwise). Probes which interrogate approximately the same genomic region are grouped into a single probeset; therefore, it is often the probeset expression value, from the summarized expression of the corresponding probes, that is of interest and not the individual probe expression values. The original design of the HG-U133 Plus 2.0 microarrays was to have each probeset contains approximately 11-20 probes which resulted in over 54,000 probe- sets to be analyed. Due to RNA degradation starting at the 5’ end, the majority of these probesets are designed to hybridize to the 3’-region of a target sequence creating an overall 3’ bias on these microarrays. Based on the known genomic information at that time, many genes were known to have multiple polyadenylation sites and sometimes the actual orientation of the gene was unknown. To account for this, a gene was often designed to be interro- 6 1.2. Platforms for Detecting Alternative Transcript Expression gated by multiple probesets that could measure the different polyadenylation sites and sometimes even both ends of the gene. As a result, the HG-U133 Plus 2.0 mi- croarrays actually have a large number of genes interrogated by multiple probesets (Figure 1.2). Histogram of Probesets per Gene Number of Probesets N um be r o f G en es 9558 5561 2955 1457 754 325 151 62 34 16 10 1 2 3 4 5 6 7 8 9 10 0 20 00 40 00 60 00 80 00 10 00 0 Figure 1.2: Distribution of Probesets per Gene on a HG-U133 Plus 2.0 Array. Information based on Affymetrix annotation data. X-axis is truncated at 10 probe- sets to simplify visualization. 7 1.2. Platforms for Detecting Alternative Transcript Expression An assumption that is often made about genes interrogated by multiple probe- sets is that the expression level should be consistent across all probesets. However Stalteri et al. [9] performed a case study of the gene Surf4 interrogated by multiple probesets and demonstrated these probesets were inconsistent in their expression patterns. These differences were attributed to alternatively expressed transcripts which resulted from different mechanisms of alternative transcript regulation. This indicates that researchers are often disregarding important information, available from conventional microarrays, about the alternatively expressed transcripts of a gene by simply taking the mean expression value of multiple probesets or simply the highest expression value. 1.2.2 Exon Microarrays Following the first generation of conventional microarrays, advances in microar- ray technology led to an increase in microarray density thus allowing for a larger number of probes to be allocated on microarrays. Rather than placing these addi- tional probes at the 3’ end of target sequences, the additional probes were designed to interrogate each individual exon along a gene creating what are called exon mi- croarrays. The consequence of this was an extended genomic view beyond the 3’ end of genes and an increased capacity to detect for alternative transcript events. The most widely used human exon microarrays are from Affymetrix which contains on average one probeset per exon giving a total of 1.4 million probesets. 1.2.3 RNA-Seq Unlike microarrays, high-throughput sequencing has enabled a nucleotide-level resolution of the genome that is not reliant on prior assumptions of the composi- 8 1.3. Bioinformatic Approaches for Detecting Differentially Expressed Alternative Transcripts tion of the genome. RNA-Sequencing (RNA-Seq) [10] is a specific high-throughput sequencing protocol which targets and sequences cDNA molecules allowing for whole transcriptome sequencing and the mass quantification of the mRNA con- tent in a given sample for a particular condition. In general, mRNA molecules are targeted and isolated using their poly(A) tails followed by reverse-transcriptase to convert the mRNA to cDNA molecules. The cDNA molecules are fragmented into 200-500 base pair fragments and then either one end or both ends of the fragments are sequenced creating single or paired-end sequences (aka. reads) respectively. These short reads can then be aligned back to the genome and metrics such as RPKM (Reads Per Kilobase of Transcript per Million Mapped Reads) [11] can be used to quantify gene expression levels from mapped RNA-Seq reads. In addition to being able to measure gene expression levels, the high resolution of the data has allowed for detection of alternative transcript expression and many tools have been developed for this purpose (reviewed in [12]) 1.3 Bioinformatic Approaches for Detecting Differentially Expressed Alternative Transcripts 1.3.1 Methods Focusing at an Individual Probe/Probeset Level for Conventional Microarrays Both Hu et al. and Fan et al. [13, 14] developed methods for detecting differen- tially expressed alternative transcripts between two groups which were focused at an individual probe/probeset level on conventional microarrays. Hu et al. method was developed before standard approaches for background approaches and nor- 9 1.3. Bioinformatic Approaches for Detecting Differentially Expressed Alternative Transcripts malization in conventional microarrays were developed and accepted. As such, their method uses the aforementioned MM probes (Section 1.2.1) and focuses on individual probes and not probesets. In brief, the method 1) uses the MM probes to remove background noise from PM generating difference values and setting nega- tive results to zero, 2) normalizes each difference value by dividing it by the probe’s average difference value across all samples to generate relative signal strength (RSS) values, 3) normalizes each probe’s RSS value by the probe’s average RSS value across all samples. The final result is a ratio for each probe in each sample and candidate probes and samples can be identified by observing for differences in the ratio across multiple samples. Fan et al. method made the observation that probes within the same probeset would often have discordant expression patterns between two groups even though they interrogate approximately the same genomic region. They hypothesized this could be due to alternative transcript expression. To detect for this, they proposed that probes be re-grouped into ’pseudo-exons’ based on their expression values. This was done by checking if the difference, between two groups, in the probe inten- sities of the ithand ith+1 probe (moving towards the 3’ end) were similar and grouping the probes into the same pseudo-exon if they were. The entire pseudo-exon was then considered the ith probe and the same procedure is done with the new ith+1 probe (assuming there is one). This is repeated until there is either no difference in which case the current pseudo-exon is considered formed and a new pseudo-exon is formed with the ith+1 probe initiating the procedure again, or there is no ith+1 probe and the procedure moves to another gene. Once all the pseudo-exons are formed, testing for differential expression is performed on each pseudo-exon. The major caveat of both of these methods is that results are returned at an indi- vidual probe/probeset without directly observing for differences in adjacent probes/probe- 10 1.3. Bioinformatic Approaches for Detecting Differentially Expressed Alternative Transcripts sets. Differences in the expression of these probes/probesets could simply be as- sociated with differential gene expression between two groups of samples and not differential transcript expression and researchers would be required to perform a post-processing step to differentiate between these two biological events. 1.3.2 SplicerAV Specifically designed for conventional microarrays, SplicerAV [15] uses a ma- chine learning approach by identifying genes interrogated by multiple probesets and then 1) computes the fold changes of probesets between two groups, 2) fits the fold change data, for a single gene, to a single Gaussian distribution and then to a mixture of two Gaussian distributions using the expectation maximization algo- rithm, and then finally 3) uses the maximum likelihood ratio to compare the relative fit of the single Gaussian distribution to the mixture of two Gaussian distributions. The assumption is that a mixture of two Gaussian distributions will fit these data bet- ter in situations where a gene has differentially expressed transcripts. However, on conventional microarrays the average number of probesets interrogating a gene is small (∼2.08 probesets per gene on HG-U133 Plus 2.0). Typical machine learning approaches require many more data points to learn and fit a Gaussian distribution and since SplicerAV only uses the data for one gene at a time, without borrow- ing information across genes, this can potentially cause overfitting resulting in a considerable number of false positive results. 1.3.3 SI Splicing Index (SI) was one of the first proposed metrics for detecting differential expressed transcripts between two groups [16] on exon arrays. The metric works 11 1.3. Bioinformatic Approaches for Detecting Differentially Expressed Alternative Transcripts by identifying exons which differ in their inclusion rates between two groups by first calculating a gene-level normalized intensity (NI): NIi,j,k = ei,j,k gj,k Where ei,j,k is the expression of the ith probeset interrogating the jth gene for the kth sample and gj,k is the expression of the jth gene for the kth sample. The expression level of a gene for a sample is determined by the mean expression across all probesets interrogating the gene for that sample. The NIi,j,k normalizes each probeset to the expression level of its gene which allows comparison between two samples to generate the SI: SIi,j = NIi,j,a NIi,j,b Where a and b are two different samples. To generalize this to compare two groups of samples, the average NI values would be taken from each group and used to generate the SI. The magnitude of the SI would be indicative of the amount of differential splicing occurring between the two groups and the directionality of the differential splicing can be interpreted by the sign of the SI (i.e. “+” sign would be preferential inclusion in group a samples while a “-” sign would be preferential inclusion in group b samples). 1.3.4 FIRMA Finding isoforms using robust multichip analysis (FIRMA) is a method devel- oped for detecting alternative splicing, specifically for exon skipping, on Affymetrix exon arrays [17]. It leverages off the popular microarray normalization and sum- 12 1.3. Bioinformatic Approaches for Detecting Differentially Expressed Alternative Transcripts marization method called RMA [18]: log2(PMijk) = βjk + αik + ijk • log2(PMijk) is the background-corrected log2 expression level of the ith perfect match probe in the jth probeset on the kth sample • βjk is the log2 expression level of the jth probeset on the kth sample (denoted as the chip effect) • αik is the probe effect of the ith probe in the kth probeset • ijk is the residual of the RMA fit for unexplained variations. The idea behind FIRMA is that a large value of ijk is indication of a potential exon skipping event and thus FIRMA frames the prediction of splicing events as a outlier detection problem: ijk = log2(PMijk)− β̂jk − α̂ik The final FIRMA score for a given probeset can be generated by summarizing the probe residuals by using the median divided by the median absolute devia- tion. Therefore, FIRMA scores are generated at a per probeset/exon level which is appropriate for detection of exon skipping events. 1.3.5 Tuxedo Suite Arguably, themost well-established andwidely adopted set of tools for RNA-Seq analysis is the Tuxedo Suite which is composed of three programs: Bowtie, Tophat 13 1.3. Bioinformatic Approaches for Detecting Differentially Expressed Alternative Transcripts and Cufflinks [19, 20, 21]. Bowtie is a fast short-read aligner which uses a Burrows- Wheeler index technique to create a small memory footprint. One caveat of Bowtie is its disregard for the reads that span splice junctions (i.e. exon-exon boundaries). These type of reads are a unique feature of RNA-Seq data in which they are split between two exons and thus typical aligners will be unable to align them properly and treat them as unmapped reads. TopHat was designed specifically to detect for de novo splice junctions by leveraging off Bowtie. It does so by 1) using Bowtie to align the reads and puts aside all the initially unmapped reads (IUM), 2) generates potential splice junctions from the aligned reads using the canonical GT-AG din- uceotide sequence that flanks splice sites, and then 3) finally attempts to map all the IUM to the potential splice junctions. The result is a list of reads aligned to exons and splice junctions which can be used as input into Cufflinks which is a transcript assembly and quantification program that accepts aligned reads from a split-aware read aligner like TopHat. A parsimonious set of transcripts are assembled based on the information from paired-end reads and split-reads (Figure 1.3). Quantifica- tion of the transcripts is performed by checking for the compatibility and counting of reads to assembled transcripts. The quantification values are reported as FPKM (Fragments Per Kilobase of exon per Million fragments mapped) values per tran- script which is a variant of RPKM with paired-end reads representing a fragment instead of individual reads. 14 1.3. Bioinformatic Approaches for Detecting Differentially Expressed Alternative Transcripts Re ads Alt ern ativ e Tra nsc ript s Figure 1.3: Cufflinks Assembly of Transcripts using RNA-Seq Data. The top panel contains a set of hypothetical paired-end reads which are linked by a solid black lines. Dashed black lines represent reads which are split at a splice junction. Purple, red, and blue reads support the purple, red and blue transcripts respectively in the bottom panel. Yellow reads support all three transcripts. The black paired- end read represents a scenario where the read supports both the red and purple transcripts, but is more likely associated with the red transcript since the distance between the paired reads is aberrant. In addition to assembling and quantification of transcripts, Cufflinks can look for differences in the transcriptome between two groups of samples through the Cuffdiff program. 15 1.3. Bioinformatic Approaches for Detecting Differentially Expressed Alternative Transcripts Gene X Transcription 1 2 3 4pre-mRNATranscript J Alternative Splicing 1 2 3 4AlternativeTranscript J1 1 3 4Alternative Transcript J2 1 4Alternative Transcript J3 Poly(A) Poly(A) Poly(A) Protein Isoform J1 Protein Isoform J2 Protein Isoform J3 Translation (A) (B) (C) (D) (E) P PJK Transcription Start Site J Figure 1.4: Differences in the Transcriptome Detected by Cuffdiff. A hypothet- ical scenario where Gene X produces a pre-mRNA transcript, facilitated by pro- moter J, which is alternatively spliced to form three different alternative transcripts. These three different alternative transcripts then get translated into three different protein isoforms. Cuffdiff is capable of detecting the following biological events between two groups of samples: A) Differential Gene Expression and Differential Promoter Usage, B) Differential pre-mRNA Transcript Expression, C) Differential Splicing Pattern, D) Differential Alternative Transcript Expression, E) Differential Protein Coding Expression. 16 1.4. Diffuse Large B-Cell Lymphomas 1.4 Diffuse Large B-Cell Lymphomas The normal life cycle of an organism requires the growth and replication of new cells to replace old and dying cells. A typical cell in the human body has been programmed with several mechanisms which tightly regulate its growth and prolif- eration. Cancer occurs when these mechanisms are compromised thus resulting in uncontrollable growth of cells which may result in malignant tumours [22]. Can- cer can occur in any organ and the causes of various types of cancers are still unknown. The focus of this thesis will be on lymphoma and in particular a sub- type of lymphoma called diffuse large B-cell lymphoma (DLBCL). Lymphoma is a group of cancers that affect lymphocytes, a type of white blood cell, of the immune system. During cancer development, lymphocytes are transformed into malignant cells that eventually settle in lymph nodes to form solid tumours. Lymphoma can be further classified into two major categories which are Hodgkin lymphoma (HL) and non-Hodgkin lymphoma (NHL); these two categories are distinguished based on the presence or absence of Hodgkin Reed–Sternberg cells, the frequency of tumour cells in the biopsy, and the composition of the microenvironment. Approx- imately 30-40% of new lymphoma cases are DLBCL making it the most common type of NHL, and lymphoma in general, [23] and is characterized by sheets of large malignant B-cells. Two major molecular subtypes of DLBCL have been identified through gene expression profiling: activated B-cell like (ABC) and germinal centre B-cell like (GCB) [24]. Aside from containing thousands of differentially expressed genes, the two subtypes are well distinguished by the cell of origin (i.e. stage of differentiation in which they result in malignant transformation); ABC transforming when B-cells become activated and GCB transforming when the B-cells are found in the germinal centre. It is also well known that the two subtypes differ in their sur- 17 1.5. Thesis Overview vival rates post cancer treatment with ABC being the more lethal subtype although the cause of this survival discrepancy is still largely unknown [25]. 1.5 Thesis Overview Conventional microarrays have been a driving force for high-throughput research for the past two decades. As a result an enormous collection of these microar- rays, in particular the Affymetrix HG-U133 Plus 2.0 (∼57,000 in GEO as of October 2011), have been generated and their expression datasets are now available in public repositories. An unappreciated feature of these microarrays is the fact that large numbers of genes are interrogated by multiple probesets. Researchers will often under-utilize these data by either taking the mean expression of the multiple probesets or simply the highest value to generate a summarized expression value for each gene; however, previous literature has suggested that discrepancy in the expression values of these probesets can be evidence of alternative transcript ex- pression [9, 26]. This means valuable information can be potentially gleaned from already existing and publicly available datasets which are typically disregarded as viable data for detecting alternative transcript expression. The goal of this thesis is to explore the possibility of detecting differentially ex- pressed alternative transcripts between two groups of samples using conventional microarrays (in particular the HG-U133 Plus 2.0 microarrays). We approached this problem by 1) explaining that existing Bioinformatic approaches are insufficient at solving the problem (discussed in Sections 1.3.1 and 1.3.2) designing a novel algo- rithm to detect differentially expressed alternative transcripts called DISCO, 2) eval- uating the performance of DISCO using a ground truth tumor-normal colon cancer dataset of RT-PCR validated differential splicing events, 3) demonstrating DISCO’s 18 1.5. Thesis Overview superior performance over an existing method using an internal (36 samples) and external (200 samples) cohort of DLBCL samples assayed on HG-U133 Plus 2.0 microarrays and RNA-Seq data as a high resolution dataset, and finally 4) discuss the biological relevance of two highly ranked genes nominated by DISCO when ap- plied to DLBCL samples. It is important to note that we are focusing only on genes which produce alternative transcripts and not genes which produce only a single transcript. Therefore, from now on the term transcript(s) will refer to alternative transcript(s) for brevity unless stated otherwise. 19 Chapter 2 Methodology 2.1 Chapter Synopsis This chapter discusses in detail the algorithmic and mathematical components of the DISCO (DIscordancy SCOre) method as well as how to effectively visualize the DISCO results. This is followed with a discussion on issues with the probeset design on Affymetrix GeneChip© conventional microarrays and how the perfor- mance of DISCO can be improved by using an alternative probeset re-grouping structure. Finally, specific implementation details of DISCO are mentioned along with its run time. 2.2 The DISCO Algorithm Let A = {a1, a2, . . . , an} and B = {b1, b2, ..., bm} be two groups of samples (e.g. disease vs. normal, disease subtype A vs. disease subtype B). DISCO accepts probeset expression values from a summarization algorithm (e.g. RMA) as input to the algorithm. DISCO then proceeds through three major steps: 1. For each probeset j, calculate the fold change (FC) between the two groups: FCj = 1 n n∑ i=1 ai,j − 1 m m∑ i=1 bi,j 20 2.2. The DISCO Algorithm Where ai,j and bi,j represent the log2 transformed expression values of the jth probeset in the ith sample of groups A and B, respectively. 2. For each pair of probesets, j and k, interrogating the same gene, calculate the difference in FC (DiffFC): DiffFCj,k = FCj − FCk By calculating the DiffFC between two probesets interrogating the same gene, we get a measure of inconsistency between the differential expression pat- tern of two different regions of the same gene in two groups of samples. A caveat of using DiffFC is that all the probeset data across a single group gets summarized into single value and thus their variance across the samples is lost. Therefore, a weighting step is performed to take this into account. 3. Weight each DiffFC by its pooled standard deviation sdj,k: DiffFCj,k sdj,k sdj,k = √ (n− 1)s2A,j + (m− 1)s2B,j + (n− 1)s2A,k + (m− 1)s2B,k 2n+ 2m− 4 s2A,j = n∑ i=1 (ai,j − 1 n n∑ i=1 ai,j) 2 s2B,j = m∑ i=1 (bi,j − 1 m m∑ i=1 bi,j) 2 s2A,k = n∑ i=1 (ai,k − 1 n n∑ i=1 ai,k) 2 s2B,k = m∑ i=1 (bi,k − 1 m m∑ i=1 bi,k) 2 21 2.2. The DISCO Algorithm By taking standard deviation into account, probesets which have stable ex- pression (i.e. low standard deviation across samples in each group) are given more weight to indicate a higher confidence in their DiffFC score. Often lowly expressed genes have small sdj,k values and therefore give a false sense of discordancy for the probesets interrogating the gene. This is corrected by us- ing a moderating constant sα [27], which minimizes the coefficient of variance of each DISCO value (defined below): (a) For all sdj,k values calculate the 100 quantiles where q1 < q2 < . . . < q100 (b) Let sα be the αth quantile of all the sdj,k values. (c) Let (DISCO)αj,k = DiffFCj,k sdj,k+sα (d) For each αl  {0, 5, 10, . . . , 100}: i. Calculate (DISCO)αlj,k = DiffFCj,k sdj,k+s αl ii. Calculate bi = MAD( [DISCO]αlj,k | sdj,k  [qi, qi+1] ), i = {0, 1, . . . , 100} where MAD is the median absolute deviation. iii. Calculate the coefficient of variation, cv(αl), of all the bi values (e) The sα which results in the argmin[cv(α)] is selected as the DISCOmod- erating constant. Intuitively what we are attempting to do is separate signal from noise. We want to bin together DISCO values which have similar standard de- viations and avoid binning together DISCO values which contain diverse standard deviations. The MAD metric for each bin gives a measure of this and low values would be a consequence of the bin containing simi- lar DISCO values as well as similar standard deviations. The coefficient 22 2.2. The DISCO Algorithm of variation gives us an overall measure across all bins for a given sα. The addition of the sα to the denominator of the DISCO calculation re- sults in DISCO values being independent of gene expression levels and therefore comparable across genes. This gives a final metric of: (DISCO)αj,k = DiffFCj,k sdj,k + sα DISCO values are computed between each pair of probesets interrogating the same gene and results for a specific gene can be visualized by using a symmetrical heatmap called a DISCO Matrix. A highly intense column in the DISCO matrix would be indicative of a probeset whose differential expression pattern between two groups of samples is discordant with respect to other probesets interrogating the same gene. DISCO values for a given probeset can then be summarized into a single score. Various summarization functions could be used and the most trivial function would be the mean of all the DISCO values (excluding the DISCO value between the probeset compared to itself). This works well in situations where a single probeset’s differential expression pattern is discordant from the rest of the probesets (Panel A of Figure 2.1). 23 2.2. The DISCO Algorithm PS2 PS1 PS3 PS4 PS5 PS6 PS7 PS8 PS9 (A) PS1 PS2 PS3 PS4 PS5 PS6 PS7 PS8 PS9 DIS CO  Ma trix DIS CO  Ma trix PS2 PS1 PS3 PS4 PS5 PS6 PS7 PS8 PS9 (B) Pro bes et Ex pre ssi on Pro bes et Ex pre ssi on Group A Group B Group A Group B PS1 PS2 PS3 PS4 PS5 PS6 PS7 PS8 PS9 Figure 2.1: DISCOMatrices of Genes with Discordant Patterns of Differentially Expressed Probesets. (A) DISCO Matrix of a gene with a single probeset’s dif- ferential expression pattern that is discordant from the rest of the probesets. (B) DISCO Matrix of a gene with multiple probeset differential expression patterns that are discordant from the rest of the probesets. In situations where multiple probeset differential expression patterns are dis- cordant from the remaining probesets, the summarization of DISCO values is not trivial because the DISCO values for probesets which follow the same differen- 24 2.3. Regrouping of Probes on GeneChips© using Custom CDFs tial expression pattern will be low (Panel B of Figure 2.1) and thus down-weight the summarized mean DISCO value. A more sophisticated technique can be ap- plied here such a k-means or a mixture of Gaussian distributions [15] to cluster the probesets into two distinct groups and then summarize only with respect to the other group. Ultimately, the summarized DISCO values of a probeset serve as an over- all indication of a discordant pattern of differential expression and the researcher should examine the entire DISCO matrix to interpret the biological phenomenon that is occurring (See Section 3.2 for an empirical example of this). 2.3 Regrouping of Probes on GeneChips© using Custom CDFs When the probes on the human Affymetrix GeneChips© were first designed, they were selected using the most up to date genomic information available at that time. Affymetrix provided the grouping of these probes into their respective probesets through what are called chip description files (CDF) which allowed for downstream bioinformatics analysis. As our understanding of the human genome increased, the design of the CDF were revisited by others and many problems were discovered with these CDFs. For example, Harbig et al. [28] did an analysis of the probe design on the HG-U133 Plus 2.0 microarray by re-aligning all the probes to GenBank. They reported the following issues: 1. Over 5000 probesets had at least one probe which was cross-hybridizing to another genomic locus 2. Many of the individual probes were found not to hybridize to any known human mRNA 25 2.3. Regrouping of Probes on GeneChips© using Custom CDFs 3. A number of probesets were found to be chimeric where half of the probes would hybridize to one locus and the other half to another locus To address these issues, Dai et al. [29] reorganized the probes on numerous Affymetrix GeneChips©, including the HG-U133 Plus 2.0, by re-aligning all the probes us- ing the most up to date genomic information. The probes were heavily filtered for cross-hybridization, no hybridization, wrong orientation, and other problematic hybridization issues. Following the filtering, the probes are then re-grouped into probesets following a protocol which is dependent on the biological question of in- terest. For example, if a researcher was particularly interested in gene expression then a gene-centric CDF should be used since it was created by grouping together all the probes binding to the same gene into a single probeset; this would eliminate the potential need to summarize over multiple probesets and also increases the number of probes to use in summarization algorithms which may improve expres- sion level estimates. The flexibility to perform these re-groupings has resulted in the creation of various CDFs which are freely available for researchers to download and use. Of particular interest is the exon-centric CDF which was created by re- grouping all probes which align to the same exon as annotated by Ensembl. Table 2.1 shows a comparison of the Affymetrix CDF to the exon-centric (version 14.1) CDF No. of Probes No. ofProbesets No. of Genes with > 1 Probesets Original Affymetrix 604,258 54,675 11,344 Exon-centric 328,716 40,934 9,982 Table 2.1: Differences in the Affymetrix CDF and the Exon-Centric CDF The filtering of problematic probes results in a total of 45.6% being removed which consequently causes a 25% drop in the total number of probesets and a 26 2.4. Implementation of DISCO 12% drop in the number of genes interrogated by multiple probesets. The trade-off is an improvement in the measurement of expression levels [30] centered around exons instead of just transcripts. This drop is indicative of the noisy design of these microarrays and emphasizes the importance of using a re-annotated CDF to reduce technical noise as much as possible. Therefore, we recommend the use of the exon-centric CDF in conjunction with the DISCO algorithm. This however does not exclude the DISCO algorithm from being used with other CDFs like the Affymetrix one. 2.4 Implementation of DISCO DISCO was implemented as a pipeline of R and python scripts. DISCO was designed to look for differentially expressed transcripts between two groups and thus requires that users dichotomize the samples into two groups before running DISCO. Additionally, it will only return DISCO values for probesets which interro- gate a gene with multiple probesets. Otherwise, the probesets are disregarded in the analysis. The run-time depends on the number of samples and the number of genes to be analyzed. As an example, a dataset of 36 microarrays with a to- tal of 9,982 analyzable genes was processed through DISCO in approximately 10 minutes on a typical computer. 27 Chapter 3 Results 3.1 Chapter Synopsis The chapter begins with an evaluation of the sensitivity and specificity of DISCO using a previously published RT-PCR validated dataset. Two popular methods, SI and FIRMA, for detecting differentially expressed transcripts using exon microar- rays are then benchmarked to DISCO. An empirical example is given to explain the reasons for DISCO slight under performance compared to FIRMA and also to show limitations in using a naive summarization technique for DISCO values. This is followed with a comparison to another method on conventional microarrays, SplicerAV, and a demonstration of DISCO’s superior performance over it. Finally, an application of DISCO to an internal and external cohort of DLBCL samples, dichotomized into the subtypes ABC and GCB, is performed and evidence for dif- ferentially expressed transcripts in two high ranking genes is shown and their bio- logical relevance is discussed. 3.2 Evaluating the Performance of DISCO While the DISCO algorithm was developed to analyze conventional microar- rays, its design allows it to be in principle applied to any gene expression platform with at least two data points on the same gene. Therefore, for the purposes of 28 3.2. Evaluating the Performance of DISCO quantifying the sensitivity and specificity of the DISCO algorithm, we retrieved a ground truth data set of 20 matched tumor-normal colon cancer samples assayed on the Affymetrix Human Exon 1.0 ST microarrays with RT-PCR validated results [31] (See Section B.1 for a discussion on validation using RT-PCR). A total of 46 genes with their corresponding exons were selected for RT-PCR validation of which 13 genes were confirmed to have differentially spliced cassette exons between the matched tumor and normal samples while no evidence of differentially spliced cas- sette exons could be observed in the remaining genes. These exons were selected for RT-PCR validation based on their high ranking SI scores and on previously de- scribed differential splicing events in literature. Based on a previous study [17], which have also used this data as a ground truth dataset, and our own analysis, we identified a total of 41 probesets which could be unambiguously associated with the exons which were RT-PCR validated. Of these, 15 were true positive events (one exon was interrogated by two probesets) and 26 were true negative events. We ran DISCO on the exon arrays and computed a receiver operator characteristics (ROC) curve using the summarized mean DISCO values of the true positive and true negative events (Figure 3.1). 29 3.2. Evaluating the Performance of DISCO ROC Curve False positive rate Tru ep osi tive rat e 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 2 0.5 0.7 8 1.0 6 1.3 4 1.6 2 0.65 0.80 Figure 3.1: ROC Curve of the DISCO Values. This was performed on the mean DISCO values of probesets interrogating the RT-PCR validated exons. The corre- sponding DISCO value for a true positive and false positive rate can be identified by associating the colour to the DISCO value using the coloured scale on the right. A DISCO value between 0.65 and 0.80 would be an optimal threshold for calling candidate probesets. The area under the curve (AUC) of the ROC curve was 0.77 indicating a good ability to discriminate the true positive from the true negative events. Based on the ROC Curve, the optimal threshold to call candidate genes would be anywhere between 0.65 to 0.80 since these points give the best trade-off between specificity 30 3.2. Evaluating the Performance of DISCO and sensitivity. We examined some of the true negative events with high DISCO value like NME1, which had the second highest DISCO value, and the DISCO plot (Figure 3.2) of these data suggests that differentially expressed alternative transcripts are occurring between tumor and normal samples. 2 4 6 8 NM_00108136 3'5' NM_198175 NM_000269 Tumor Normal 372 694 2 372 694 4 372 694 5 372 694 7 372 695 0 372 695 2 372 695 5 372 695 6 372 695 7 372 695 8 Pro bes et E xpr ess ion DIS CO  Ma trix 1 2 3 4 5 6 7 8 9 10 Figure 3.2: Tumor vs. Normal DISCO Plot of NME1. The very top panel is the DISCO Matrix and juxtaposed underneath are the corresponding probeset expres- sion values summarized using boxplots for each group. The blue lines link the probesets to corresponding exons and RefSeq transcripts which the probesets in- terrogate. The dashed red box represents two probesets interrogating an exon which was RT-PCR validated to be constitutively spliced in both normal and tumor samples. Exon 2 was RT-PCR validated to be constitutively spliced in both groups which 31 3.2. Evaluating the Performance of DISCO is supported by the probeset expression data. The remaining probesets on the gene show a distinct trend in which the tumor samples have considerable overex- pression. When looking at the interrogation sites of these probesets, one could hy- pothesize that the transcript NM 198175 is expressed at equivalent levels, while the transcript NM 000269 is overexpressed in tumor indicating differentially expressed transcripts between the two groups. This biological event was missed in the origi- nal predictions since the focus was not on the data across the gene but on the two probesets, 3726944 and 3726945, and their interrogated exon since it had been reported in literature as being differential spliced between normal and colon sam- ples. DISCO values and the DISCO matrix allows researchers to move away from focusing on individual probesets and potentially capture more complicated differ- ential transcript expression events, like NME1 in this scenario, that would not have been possible to observe by simply looking at the individual SI scores of two probe- sets. This ground truth dataset also offers an opportunity to benchmark DISCOagainst existing methods, such as SI and FIRMA (discussed in Sections 1.3.3 and 1.3.4 re- spectively), for detecting differentially expressed alternative transcripts using exon arrays. ROC curves were generated based on the scores from SI and FIRMA and the AUCs were 0.74 and 0.79 respectively. This indicates that DISCO outperforms the popular and widely used SI. While DISCO is outperformed by FIRMA, this is likely due to differences in how the final scores are calculated in both methods. FIRMA scores are derived at a single probeset level which is in contrast to DISCO values which are pairwise scores that can be summarized into one value by taking the mean of the DISCO values; however, there are caveats (discussed in Section 2.2) of just taking themean of DISCO values whichmay be confounded in situations where there are complicated patterns of differentially expressed transcripts. As an 32 3.2. Evaluating the Performance of DISCO example, Figure 3.3 shows the DISCO plot of COL6A3 with an enlarged column view of probeset 2605391, from the DISCO matrix: 260 532 5 260 532 6 260 532 7 260 532 8 260 532 9 260 533 0 260 533 3 260 533 4 260 533 5 260 533 6 260 533 7 260 533 8 260 533 9 260 534 0 260 534 2 260 534 6 260 534 7 260 535 0 260 535 1 260 535 3 260 535 4 260 535 5 260 535 6 260 535 8 260 535 9 260 536 4 260 536 6 260 536 7 260 536 8 260 536 9 260 537 0 260 537 1 260 537 3 260 537 4 260 537 5 260 537 6 260 537 7 260 537 8 260 537 9 260 538 0 260 538 2 260 538 3 260 538 4 260 538 5 260 538 6 260 538 7 260 539 0 260 539 1 260 539 5 260 539 6 260 539 8 260 539 9 260 540 0 260 540 8 NM_057167 NM_057165 NM_057166 NM_057164 NM_004369 Tumor Normal DIS CO  Ma trix Pro be set  Ex pre ssi on 2 4 6 8 10 5'3' 123456789.   .   .   14  .   .   10 44  .   .   .   .   .   .   40   .   .    .  36  .   .   .   27  .   .   . 0.0 0.5 1.0 1.5 2605400 2605390 2605386 2605350 2605337 2605333 2605330 260 539 1 2605396 2605391 Figure 3.3: Tumor vs. Normal DISCO Plot of COL6A3. The red box represents an enlarged view of the DISCO values associated with probeset 2605391. The black-green scale indicates the range of the DISCO values and the black arrows emphasize the low DISCO values due to these probesets displaying a similar dif- ferential expression pattern (indicated by the orange boxes) as probeset 2605391. The dashed red box represents the probeset interrogating exon 4 which was RT- PCR validated to be preferentially included in tumor samples. The orange text at the bottom represents the corresponding exons going 5’ to 3’. 33 3.2. Evaluating the Performance of DISCO Exon 4 interrogated by the probeset 2605391 was RT-PCR validated to be a cassette exon and being preferentially included in tumor samples. In respect to the summarized DISCO value of other probesets, the summarized DISCO value of probeset 2605391 is 16th highest ranked summarized DISCO value. In contrast to FIRMA scores, the FIRMA score of probeset 2605391 is 9th highest. The lower rank of probeset 2605391 can be attributed to eight other probesets (2605330, 2605333, 2605337, 2605350, 2605386, 2605390, 2605396, 2605400) displaying a similar differential expression pattern between the two groups and this results in several low DISCO values for the probeset 2605391 (indicated by the black arrows in the enlarged column view of probeset 2605391) and thus confounds the sum- marization. The similar differential expression pattern of other probesets is likely due to COL6A3 displaying a slightly complicated pattern of differentially expressed transcripts since probeset 2605396 interrogates exon 3 which was also RT-PCR validated to be differential spliced. This is the exact same scenario as exon 6 which is interrogated by probeset 2605386. All of this is further complicated by the fact that probeset 2605390 also interrogates exon 4 and thus will have a low DISCO value compared to probeset 2605391. In this situation, a more favourable and sophisticated approach would be to cluster the DISCO values into two groups and then summarize only within each group. An approach like this would result in a higher summarized DISCO value that increases the rank of probeset 2605391 from 16th to 14th and increase the AUC to 0.78. Several other genes like CTTN (Figure B.2) and FN1 (Figure B.3) also display complicated patterns of differentially expressed transcripts and thus the summarized DISCO value of their corresponding probesets can be improved with an approach like this. This demonstrates that the summarization technique is dependent on the particular biological scenario, however it is of greatly impor- 34 3.3. Comparison of DISCO to Other Conventional Microarray Methods tance to view the entire DISCO matrix rather than focusing on a single summarized DISCO value as mentioned in Section 2.2. 3.3 Comparison of DISCO to Other Conventional Microarray Methods As discussed in Sections 1.3.1, both Hu et al. and Fan et al. suggested methods for detecting differentially expressed transcripts using conventional microarrays by investigating at an individual probe/probeset level, but both of these methods then require the researcher to perform a post-processing step of comparing to other probe/probesets to identify whether a discordant expression pattern is occurring. Neither of these two methods are directly comparable to DISCO since their pre- dictions do not consider the data from the remainder of the gene. The only other method applicable to conventional microarrays, to our knowledge, is SplicerAV dis- cussed in Section 1.3.2. To compare the performance of DISCO to SplicerAV, we analyzed a small pre- viously published internal cohort of 36 DLBCL samples [32] assayed on HG-U133 Plus 2.0 arrays which had matched RNA-Seq libraries [33]. This cohort was di- chotomized into 13 ABC and 23 GCB samples based on their gene expression profile and then processed through DISCO and SplicerAV to generate lists of candi- date genes that have differentially expressed transcripts. Due to a lack of a ground truth dataset on HG-U133 Plus 2.0 arrays, we used the RNA-Seq libraries as a high resolution dataset to assess how concordant the predictions of the two meth- ods were to the RNA-Seq predictions derived from a Tophat/Cufflinks pipeline. The concordancy was quantified and visualized through the usage of concordancy plots 35 3.3. Comparison of DISCO to Other Conventional Microarray Methods (See Section A.4 for specific details on the construction of these plots). Panel A of Figure 3.4 shows a sensitivity concordancy plot in which we see that DISCO’s and SplicerAV’s most highly ranked predictions are fairly concordant with RNA-Seq predictions. 0.0 0.2 0.4 0.6 0.8 1.0 1 10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 1 10 20 30 40 50 (A) (B) Gene Rank Gene Rank Co nc ord an ce wit h P os itiv e R NA -Se q P red ict ion s ( %) Co nc ord an ce wit h N eg ati ve RN A-S eq  Pr ed ict ion s ( %) Sensitivity Concordancy Plot Specificity Concordancy Plot DISCOSplicerAV Avg Random DISCOSplicerAV Avg Random Figure 3.4: Sensitivity/Specificity Concordancy Plots of DISCO and SplicerAV Predictions to RNA-Seq predictions for ABC vs. GCB in the Internal Cohort. SplicerAV’s sensitivity line is superimposed on top of DISCO sensitivity line, due to equivalent sensitivity, for the first 8 genes. The dotted black line represents an experiment where genes are randomly selected and checked for concordancy with RNA-Seq. It is the average concordancy of 1000 of these random experiments. (A) Sensitivity Concordancy Plot. (B) Specificity Concordancy Plot. However, SplicerAV concordancy with RNA-Seq drops considerably by the 10th ranked gene and by the 50th ranked gene the concordancy is the same as randomly selecting genes as being candidate genes. The ranked genes from DISCO con- tinue to remain highly concordant with the top 50 genes being approximately 70% concordant with RNA-Seq predictions. Panel B of Figure 3.4 shows a specificity concordancy plot in which SplicerAV has a better concordancy in the lowest ranked genes than DISCO, but overall neither method is very concordant. This suggests 36 3.3. Comparison of DISCO to Other Conventional Microarray Methods that a large number of potentially candidate genes are being missed, but this is not surprising as the conventional microarrays lack the resolution that RNA-Seq data has to predict for differentially expressed alternative transcripts. We checked the average number of probesets on the top 50 ranked DISCO genes which was 3.82. Conversely, the average number of probesets on the bottom 50 ranked DISCO genes was only 2 which is the minimum number of probesets required for DISCO to analyze a gene. This supports the notion DISCO’s most bottom ranked DISCO genes are potentially negative due to a lack of data resolution and not specifically because the method is miscalling it. To further compare the performance of DISCO to SplicerAV as well as to validate the method using a larger dataset, an independent externally published DLBCL cohort [25] was obtained. This external cohort contained 93 ABC and 107 GCB samples assayed on HG-U133 Plus 2.0 microarrays. Similar to the smaller internal cohort, both methods were run on this large external cohort to produce a list of candidate genes and then compared against the same set of RNA-Seq predictions as the internal cohort. Panel A of Figure 3.5 shows the sensitivity concordancy plot of the external cohort in which DISCO remains highly concordant with the highest ranking genes and then drops to approximately 65% after the first 20 highest ranked genes. 37 3.3. Comparison of DISCO to Other Conventional Microarray Methods 0.0 0.2 0.4 0.6 0.8 1.0 1 10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 1 10 20 30 40 50 (A) (B) Gene Rank Gene Rank Co nc ord an ce wit h P os itiv e R NA -Se q P red ict ion s ( %) Co nc ord an ce wit h N eg ati ve RN A-S eq  Pr ed ict ion s ( %) Sensitivity Concordancy Plot Specificity Concordancy Plot DISCOSplicerAV Avg Random DISCOSplicerAV Avg Random Figure 3.5: Sensitivity/Specificity Concordancy Plots of DISCO and SplicerAV Predictions to RNA-Seq predictions for ABC vs. GCB in the External Cohort. The dotted black line represents an experiment where genes are randomly selected and checked for concordancy with RNA-Seq. It is the average concordancy of 1000 of these random experiments. (A) Sensitivity Concordancy Plot. (B) Specificity Concordancy Plot. This drop is expected since the RNA-Seq libraries used for the comparison do not match the HG-U133 Plus 2.0 microarrays from the external cohort. But the concordancy still remains similar which indicates the stability of DISCO across dif- ferent cohorts. On the other hand, the external SplicerAV concordancy fluctuates above and below the random line for the top ranked 20 genes indicating that a considerable proportion of the highly ranked genes are in fact not concordant with the RNA-Seq predictions. This provides evidence for the aforementioned problem that SplicerAV method might call large numbers of false positives due to overfit- ting. Panel B of Figure 3.5 shows the specificity concordancy plot of both methods and that DISCO’s specificity concordancy improves significantly while SplicerAV remains relatively the same. This suggests that DISCO’s specificity can be im- proved by increasing the number of samples analyzed while SplicerAV is unable 38 3.4. Potential Confounding Factors on DISCO to benefit from a larger cohort. Based on these data from an internal and external cohort of conventional microarrays and RNA-Seq data, we conclude that DISCO outperforms SplicerAV in sensitivity and increasing the cohort size will substantially improve DISCO’s specificity. 3.4 Potential Confounding Factors on DISCO To further validate the performance of DISCO, it was necessary to check for potential confounding factors on DISCO values. For example, DISCO values may be directly correlated with the number of probesets interrogating the gene as sug- gested in Section 3.3. Another potential confounding factor could be gene expres- sion since DISCO values include a moderating constant, sα, which corrects for potential false positives due to lowly expressed gene and as a result could create a bias towards highly expressed genes. Therefore, we decided to check for the ef- fects of these two confounding factors: number of probesets and gene expression. Panel A of Figure 3.6 is a density scatter plot of the DISCO values for a gene as a function of the number of probesets and shows a poor correlation (r = 0.25). Panel B of Figure 3.6 is a density scatter plot of the DISCO values for a gene as a function of the gene expression and also shows a poor correlation (r = 0.014). 39 3.4. Potential Confounding Factors on DISCO 5 10 15 20 25 30 0.0 0.5 1.0 1.5 2.0 2.5 DISCO Value for Gene vs. Number of Probesets in Gene Number of Probesets in Gene DIS CO Val ue for Ge ne 2 4 6 8 10 12 0.0 0.5 1.0 1.5 2.0 2.5 DISCO Value for Gene vs. Gene Expression Gene Expression DIS CO Val ue for Ge ne (A) (B) Figure 3.6: Density Scatter Plots of Possible Confounding Factors on DISCO Values. The DISCO value for a gene was generated by taking the highest DISCO value of a probeset, summarized using the mean, associated with the gene. The darkness of the blue dots is directly correlated with the number of superimposed data points at that position. A) DISCO Value for Gene vs. Number of Probesets in Gene. B) DISCO Value for Gene vs. Gene Expression. We further tested for the confounding effect of these two factors by generating sensitivity and specificity concordancy plots using the external cohort. We did not consider the internal cohort because the gene ranks as a function of the number of probesets would be the same across cohorts (i.e. the number the probesets interro- gating a gene is always the same regardless of cohort since this is a CDF feature) and the gene expression values should be relatively similar across cohorts. Panel A and B of Figure 3.7 shows the sensitivity and specificity concordancy plot for both confounding factors respectively. These two plots show that when we ranked genes solely based on the number of probesets or gene expression; the rankings are not concordant with RNA-Seq predictions. Based on the results from Figure 3.6 and Figure 3.7, we conclude that neither confounding factor, number of probesets and gene expression, are confounding the results from DISCO. 40 3.5. Application of DISCO to ABC vs. GCB 0.0 0.2 0.4 0.6 0.8 1.0 1 10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 1 10 20 30 40 50 Number of Probesets Gene Expression Avg Random (A) (B) Gene Rank Gene Rank Co nc ord an ce wit h P os itiv e R NA -Se q P red ict ion s ( %) Co nc ord an ce wit h N eg ati ve RN A-S eq  Pr ed ict ion s ( %) Sensitivity Concordancy Plot Specificity Concordancy Plot Number of Probesets Gene Expression Avg Random Figure 3.7: Sensitivity/Specificity Concordancy Plots of Possible Confound- ing Factors to RNA-Seq predictions for ABC vs. GCB in the External Cohort. The dotted black line represents an experiment where genes are randomly se- lected and checked for concordancy with RNA-Seq. It is the average concordancy of 1000 of these random experiments. The genes were ranked based on the num- ber of probesets and gene expression value. A) Sensitivity Concordancy Plot. B) Specificity Concordancy Plot. 3.5 Application of DISCO to ABC vs. GCB Taking the top 165 ranked genes (DISCO value > 0.65) from the application of DISCO to the external DLBCL cohort, a gene-enrichment analysis was performed and the top two most significantly enriched molecular terms were alternative splic- ing and splice variant (Table 3.1). This provides further support for the relevance of DISCO’s high scoring predictions and suggests that there is a significant number of genes featuring differential transcript expression. 41 3.5. Application of DISCO to ABC vs. GCB Category Term No. of Genes Benjamini q-value SP PIR KEYWORD Alternative Splicing 101 8.7E-5 UP SEQ FEATURE Splice Variant 100 4.8E-4 Table 3.1: Gene Enrichment Analysis on the Top 165 Ranked Genes Predicted to have Differentially Expressed Transcripts between ABC and GCB Among the top ranked genes was FOXP1, a member of the FOX family of tran- scription factors, which was interrogated by a total of nine probesets and displayed a distinctly discordant pattern of differentially expressed probesets (Figure 3.8). 42 3.5. Application of DISCO to ABC vs. GCB 4 6 8 10 12 0 10 20 30 40 50 3'5' DIS CO  Ma trix Pro bes et E xpr ess ion FPKM ABC overexpressed transcripts ~ Equivalently expressed transcripts GCB ABC 2 Prob ese t I Prob ese t II Prob ese t III Prob ese t IV Prob ese t V Prob ese t VI Prob ese t VII Prob ese t VII I Prob ese t IX Figure 3.8: ABC vs. GCB DISCO plot of FOXP1. The transcripts listed here are the set of de novo assembled transcripts from Cufflinks with the corresponding hori- zontal bar plots representing the upper bound of the 95% confidence interval on the predicted transcript expression level in both groups. Lowly expressed transcripts in both groups have been removed to simplify the visualization of these data. There are two groups of transcripts here: One group which is overexpressed in ABC (indi- cated by the orange vertical bar) and the other group which is expressed at similar levels in both groups (indicated by the purple vertical bar). The light grey verti- cal boxes help indicate the genomic interrogation location of each corresponding probeset. The orange/purple lines represent probesets interrogating both groups of transcripts while the purple lines represent probesets interrogating exons exclu- sively from the group of transcripts expressed at similar levels. 43 3.5. Application of DISCO to ABC vs. GCB The DISCO matrix suggests an interesting biological event in which five of the probesets have a differential expression pattern between ABC and GCB that is dif- ferent from the other four probesets. This is supported by the probeset expression level data which indicates that probesets near the 3’ end of the gene are more highly expressed in ABC compared to GCB. One hypothesis is that this gene has a set of differentially expressed transcripts which are expressed predominantly in ABC while another set of transcripts are expressed at similar levels between ABC and GCB; this hypothesis is supported by the Tophat/Cufflinks analysis. We believe the discordant pattern of differential expression in probesets III and IV when com- pared to probesets V-IV, all probesets which interrogate both sets of transcripts, is due to technical noise in the microarray data. This pattern of differentially ex- pressed transcripts in FOXP1 was reported by Brown et al. [34] in which western blot experiments identified smaller, potentially oncogenic activating transcripts that are overexpressed in ABC compared to GCB. We visually analyzed another top ranking gene PHF19, a transcription repres- sor, and the DISCO matrix (Figure 3.9) indicates that probeset IV has a differential expression pattern that is discordant from the rest of the probesets interrogating the same gene. This probeset appears to have equivalent expression levels between ABC and GCB while the remaining probesets are overexpressed in ABC and GCB. 44 3.5. Application of DISCO to ABC vs. GCB 5 6 7 8 9 10 0 100 300 500 Pro be set  Ex pre ssi on DIS CO  Ma trix GCB ABC FPKM3'5' PHF19 Long PHF19 Short Pro bes et I Pro bes et II Pro bes et II I Pro bes et IV Figure 3.9: ABC vs. GCB DISCO Plot of PHF19. A long and short RefSeq tran- scripts are shown here indicated by an orange and purple vertical bar respectively. The corresponding horizontal bar plots represent the upper bound of the 95% con- fidence interval on the predicted transcript expression level by Cufflinks in both groups. The light grey vertical boxes help indicate the genomic interrogation lo- cation of each corresponding probeset. The orange line represents the probeset interrogates only the long transcript while the orange/purple line represents the probeset which interrogates both transcripts. When looking at the Cufflinks’ predicted transcript expression levels, the overex- pression of the short transcript in ABC appears to be the cause of this discordancy. The translated protein isoforms from these two transcripts have significant biolog- 45 3.5. Application of DISCO to ABC vs. GCB ical differences which have been reported in the literature [35] and summarized Table 3.2. Biological Property PHF19 Long PHF19 Short Tissue Placenta, skeletalmuscle, and kidney Liver and peripheral blood leukocytes Subcellular Location Exclusively in Nucleus Nucleus and Cytoplasm Repression of HSV-tk promoter 20-30 fold 4-fold Table 3.2: Biological Properties of PHF19 long vs. PHF19 short PHF19 is one of three, the others being PHF1 and MTF2, Polycomb-like (PCL) homologue proteins that interact with the Polycomb repressive complex 2 (PRC2), which catalyzes themono-, di-, or trimethylation of lysine 27 on histone H3 (H3K27), a suppressive chromatin mark associated with decreased gene expression [36]. It has been shown that PCLs, as a group, is required for high levels of H3K27 trimethylation [37] and studies have identified the specific biological effects of PHF1 andMTF2 when associated with PRC2 [38, 39, 40]. Despite studies that have char- acterized the biological properties of the short and long protein isoforms of PHF19 [35, 41], the exact biological role of PHF19 and differences in the functions of its two isoforms is large unknown, to the best of our knowledge, especially in the con- text of DLBCL. One possible hypothesis is that the over abundance of the short isoform functions as a dominant-negative by out competing the long isoform for in- teraction with PRC2. The association with PRC2 has a negative effect resulting in little trimethylation of H3K27 thus creating a potential negative feedback regulation mechanism for trimethylation of H3K27 in ABC. This is contrasted with GCB where the negative feedback mechanism might be attenuated by the low abundance of the short transcript resulting in hypertrimethylation of H3K27. The idea that the pathogenesis of ABC and GCB could differ due to differences in their epigenetic 46 3.5. Application of DISCO to ABC vs. GCB marks is supported by a report of recurrent GCB-specific mutations in EZH2 [33], the catalytic subunit of PRC2 responsible for trimethylation, which results in a gain- of-function mutant phenotype causing hypertrimethylation [42, 43]. Therefore, the differential transcript expression of PHF19 could be working in concert with GCB- specific mutations in EZH2 to drive hypertrimethylation, which alters the epigenetic landscape of the genome contributing to the differences in the pathogenesis of ABC and GCB. 47 Chapter 4 Conclusions Although the enormous repository of conventional microarray data has frequently been reanalyzed for the purpose of gene expression profiling, there is still a great deal of under-appreciation and disregard for the potential utility of these microar- rays to detect for other biological phenomena. We have demonstrated that these microarrays can also be used for the discovery of differentially expressed tran- scripts - a biological phenomenon these microarrays were not originally designed for. We have developed a novel algorithm called DISCO which calculates discor- dancy scores and described a way for researchers to visualize and interpret the patterns of differentially expressed transcripts of a particular gene between two groups of samples. Our results indicate that such discoveries are possible despite the low resolution of the number of data points per gene on these conventional microarrays. Based on a RT-PCR validated dataset, we demonstrated that DISCO has a rea- sonably high accuracy; also by using matched RNA-Seq libraries with our internal cohort of conventional microarrays, we demonstrated the improved performance in detecting differentially expressed transcripts compared to an existing algorithm. This improved performance was further validated through the use of an external and larger cohort of conventional microarrays. A gene enrichment analysis of the top ranked candidate genes from DISCO shows an enrichment for the molecular terms “alternative splicing” and “splice vari- 48 4.1. Limitations and Future Directions ant”, and two highly ranked examples were chosen to demonstrate that DISCO’s results are supported by published literature and likely to represent biologically in- teresting and relevant events. To the best of our knowledge, this is the first study that has used conventional microarrays in conjunction with RNA-Seq to detect dif- ferentially expressed transcripts between two groups. The degree of concordancy between conventional microarrays and RNA-Seq predictions suggests that con- ventional microarrays can supplement RNA-Seq data in detecting differentially ex- pressed transcripts between two groups, and even potentially act as a first line discovery platform. This is especially feasible given the substantial repository of publicly available conventional microarray expression data. 4.1 Limitations and Future Directions Due to DISCO’s design of performing pairwise discordant measures between probesets on the same gene, there is no uniform technique that can generate an optimal summarized value for each probeset. As discussed in earlier sections, complicated patterns of differentially expressed transcripts will often be hard to in- terpret at a individual probeset level using a simple summarization technique, such as the mean. In the future, a scenario based summarization approach could be implemented to improve the accuracy of summarized DISCO values. In addition, DISCO does not require any filtering of lowly expressed probesets that could in- troduce technical noise. We decided not to use filtering because conventional mi- croarrays are already a low-resolution platform and the elimination of additional probesets will further reduce the potential number of analyzable genes. In the fu- ture, filtering weights could be incorporated into DISCO such that lowly expressed probesets could be down-weighted to try to separate true biological signal from 49 4.1. Limitations and Future Directions technical noise. Despite these limitations, we have shown that DISCO is a useful tool for detect- ing differentially expressed transcripts using conventional microarrays. The data used as examples in this thesis only represent a tiny portion of the enormous repos- itory of conventional microarrays that could be reanalyzed. With the short running time of DISCO and the simple visualization method of the DISCO values using a DISCO matrix, it would be quick and easy to reanalyze large cohorts of different diseases assayed on conventional microarrays and potentially glean new biological insights into differentially expressed transcripts. 50 Bibliography [1] Consortium International Human Genome Sequencing. Finishing the euchro- matic sequence of the human genome. Nature, 431(7011):931--945, Oct 2004. ISSN 1476-4687 (Electronic); 0028-0836 (Linking). doi: 10.1038/ nature03001. [2] Hadas Keren, Galit Lev-Maor, and Gil Ast. Alternative splicing and evolution: diversification, exon definition and function. Nat Rev Genet, 11(5):345--355, May 2010. ISSN 1471-0064 (Electronic); 1471-0056 (Linking). doi: 10.1038/ nrg2776. [3] Eric T Wang, Rickard Sandberg, Shujun Luo, Irina Khrebtukova, Lu Zhang, Christine Mayr, Stephen F Kingsmore, Gary P Schroth, and Christopher B Burge. Alternative isoform regulation in human tissue transcriptomes. Na- ture, 456(7221):470--476, Nov 2008. ISSN 1476-4687 (Electronic); 0028- 0836 (Linking). doi: 10.1038/nature07509. [4] Jamal Tazi, Nadia Bakkour, and Stefan Stamm. Alternative splicing and dis- ease. Biochim Biophys Acta, 1792(1):14--26, Jan 2009. ISSN 0006-3002 (Print); 0006-3002 (Linking). doi: 10.1016/j.bbadis.2008.09.017. [5] Lei Wang, Lindsay Duke, Peter S Zhang, Ralph B Arlinghaus, W Fraser Sym- mans, Aysegul Sahin, Richard Mendez, and Jia Le Dai. Alternative splicing disrupts a nuclear localization signal in spleen tyrosine kinase that is required 51 Bibliography for invasion suppression in breast cancer. Cancer Res, 63(15):4724--4730, Aug 2003. ISSN 0008-5472 (Print); 0008-5472 (Linking). [6] Karl-Heinz Heider, Hartmut Kuthan, Gerd Stehle, and Gerd Munzert. Cd44v6: a target for antibody-based cancer therapy. Cancer Immunol Immunother, 53 (7):567--579, Jul 2004. ISSN 0340-7004 (Print); 0340-7004 (Linking). doi: 10.1007/s00262-003-0494-4. [7] Ute Rupp, Eva Schoendorf-Holland, Michael Eichbaum, Florian Schuetz, Ilka Lauschner, Peter Schmidt, Alexander Staab, Gertraud Hanft, Jens Huober, Hans-Peter Sinn, Christof Sohn, and Andreas Schneeweiss. Safety and pharmacokinetics of bivatuzumab mertansine in patients with cd44v6-positive metastatic breast cancer: final results of a phase i study. Anticancer Drugs, 18(4):477--485, Apr 2007. ISSN 0959-4973 (Print); 0959-4973 (Linking). doi: 10.1097/CAD.0b013e32801403f4. [8] M Schena, D Shalon, R W Davis, and P O Brown. Quantitative monitoring of gene expression patterns with a complementary dna microarray. Science, 270(5235):467--470, Oct 1995. ISSN 0036-8075 (Print); 0036-8075 (Linking). [9] Maria A Stalteri and Andrew P Harrison. Interpretation of multiple probe sets mapping to the same gene in affymetrix genechips. BMC Bioinformatics, 8: 13, 2007. ISSN 1471-2105 (Electronic); 1471-2105 (Linking). doi: 10.1186/ 1471-2105-8-13. [10] Zhong Wang, Mark Gerstein, and Michael Snyder. Rna-seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 10(1):57--63, Jan 2009. ISSN 1471- 0064 (Electronic); 1471-0056 (Linking). doi: 10.1038/nrg2484. 52 Bibliography [11] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and Bar- bara Wold. Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Methods, 5(7):621--628, Jul 2008. ISSN 1548-7105 (Electronic); 1548- 7091 (Linking). doi: 10.1038/nmeth.1226. [12] Manuel Garber, Manfred G Grabherr, Mitchell Guttman, and Cole Trapnell. Computational methods for transcriptome annotation and quantification using rna-seq. Nat Methods, 8(6):469--477, Jun 2011. ISSN 1548-7105 (Electronic); 1548-7091 (Linking). doi: 10.1038/nmeth.1613. [13] G K Hu, S J Madore, B Moldover, T Jatkoe, D Balaban, J Thomas, and Y Wang. Predicting splice variant from dna chip expression data. Genome Res, 11(7):1237--1245, Jul 2001. ISSN 1088-9051 (Print); 1088-9051 (Link- ing). doi: 10.1101/gr.165501. [14] Wenhong Fan, Najma Khalid, Andrew R Hallahan, James M Olson, and Lue Ping Zhao. A statistical method for predicting splice variants between two groups of samples using genechip expression array data. Theor Biol Med Model, 3:19, 2006. ISSN 1742-4682 (Electronic); 1742-4682 (Linking). doi: 10.1186/1742-4682-3-19. [15] Timothy J Robinson, Michaela A Dinan, Mark Dewhirst, Mariano A Garcia- Blanco, and James L Pearson. Splicerav: a tool for mining microarray ex- pression data for changes in rna processing. BMC Bioinformatics, 11:108, 2010. ISSN 1471-2105 (Electronic); 1471-2105 (Linking). doi: 10.1186/ 1471-2105-11-108. [16] KarpagamSrinivasan, Lily Shiue, Justin D Hayes, Ross Centers, Sean Fitzwa- ter, Rebecca Loewen, Lillian R Edmondson, Jessica Bryant, Michael Smith, 53 Bibliography Claire Rommelfanger, Valerie Welch, Tyson A Clark, Charles W Sugnet, Ken- neth J Howe, Yael Mandel-Gutfreund, andManuel Jr Ares. Detection andmea- surement of alternative splicing using splicing-sensitive microarrays. Methods, 37(4):345--359, Dec 2005. ISSN 1046-2023 (Print); 1046-2023 (Linking). doi: 10.1016/j.ymeth.2005.09.007. [17] E Purdom, K M Simpson, M D Robinson, J G Conboy, A V Lapuk, and T P Speed. Firma: a method for detection of alternative splicing from exon array data. Bioinformatics, 24(15):1707--1714, Aug 2008. ISSN 1367-4811 (Elec- tronic); 1367-4803 (Linking). doi: 10.1093/bioinformatics/btn284. [18] Rafael A Irizarry, Benjamin M Bolstad, Francois Collin, Leslie M Cope, Bridget Hobbs, and Terence P Speed. Summaries of affymetrix genechip probe level data. Nucleic Acids Res, 31(4):e15, Feb 2003. ISSN 1362-4962 (Electronic); 0305-1048 (Linking). [19] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultra- fast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol, 10(3):R25, 2009. ISSN 1465-6914 (Electronic); 1465- 6906 (Linking). doi: 10.1186/gb-2009-10-3-r25. [20] Cole Trapnell, Lior Pachter, and Steven L Salzberg. Tophat: discovering splice junctions with rna-seq. Bioinformatics, 25(9):1105--1111, May 2009. ISSN 1367-4811 (Electronic); 1367-4803 (Linking). doi: 10.1093/bioinformatics/ btp120. [21] Cole Trapnell, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. 54 Bibliography Transcript assembly and quantification by rna-seq reveals unannotated tran- scripts and isoform switching during cell differentiation. Nat Biotechnol, 28(5): 511--515, May 2010. ISSN 1546-1696 (Electronic); 1087-0156 (Linking). doi: 10.1038/nbt.1621. [22] Douglas Hanahan and Robert A Weinberg. Hallmarks of cancer: the next generation. Cell, 144(5):646--674, Mar 2011. ISSN 1097-4172 (Electronic); 0092-8674 (Linking). doi: 10.1016/j.cell.2011.02.013. [23] Georg Lenz and Louis M Staudt. Aggressive lymphomas. N Engl J Med, 362 (15):1417--1429, Apr 2010. ISSN 1533-4406 (Electronic); 0028-4793 (Link- ing). doi: 10.1056/NEJMra0807082. [24] A A Alizadeh, M B Eisen, R E Davis, C Ma, I S Lossos, A Rosenwald, J C Boldrick, H Sabet, T Tran, X Yu, J I Powell, L Yang, G E Marti, T Moore, J Jr Hudson, L Lu, D B Lewis, R Tibshirani, G Sherlock, W C Chan, T C Greiner, D D Weisenburger, J O Armitage, R Warnke, R Levy, W Wilson, M R Grever, J C Byrd, D Botstein, P O Brown, and L M Staudt. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403 (6769):503--511, Feb 2000. ISSN 0028-0836 (Print); 0028-0836 (Linking). doi: 10.1038/35000501. [25] G Lenz, G Wright, S S Dave, W Xiao, J Powell, H Zhao, W Xu, B Tan, N Gold- schmidt, J Iqbal, J Vose, M Bast, K Fu, D D Weisenburger, T C Greiner, J O Armitage, A Kyle, L May, R D Gascoyne, J M Connors, G Troen, H Holte, S Kvaloy, D Dierickx, G Verhoef, J Delabie, E B Smeland, P Jares, A Mar- tinez, A Lopez-Guillermo, E Montserrat, E Campo, R M Braziel, T P Miller, L M Rimsza, J R Cook, B Pohlman, J Sweetenham, R R Tubbs, R I Fisher, 55 Bibliography EHartmann, A Rosenwald, GOtt, H-KMuller-Hermelink, DWrench, T A Lister, E S Jaffe, W HWilson, W C Chan, and L M Staudt. Stromal gene signatures in large-b-cell lymphomas. N Engl J Med, 359(22):2313--2323, Nov 2008. ISSN 1533-4406 (Electronic); 0028-4793 (Linking). doi: 10.1056/NEJMoa0802885. [26] Veera D’mello, Ju Y Lee, Clinton C MacDonald, and Bin Tian. Alternative mrna polyadenylation can potentially affect detection of gene expression by affymetrix genechip arrays. Appl Bioinformatics, 5(4):249--253, 2006. ISSN 1175-5636 (Print); 1175-5636 (Linking). [27] V G Tusher, R Tibshirani, and G Chu. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A, 98(9): 5116--5121, Apr 2001. ISSN 0027-8424 (Print); 0027-8424 (Linking). doi: 10.1073/pnas.091062498. [28] Jeremy Harbig, Robert Sprinkle, and Steven A Enkemann. A sequence-based identification of the genes detected by probesets on the affymetrix u133 plus 2.0 array. Nucleic Acids Res, 33(3):e31, 2005. ISSN 1362-4962 (Electronic); 0305-1048 (Linking). doi: 10.1093/nar/gni027. [29] Manhong Dai, Pinglang Wang, Andrew D Boyd, Georgi Kostov, Brian Athey, Edward G Jones, William E Bunney, Richard M Myers, Terry P Speed, Huda Akil, Stanley J Watson, and Fan Meng. Evolving gene/transcript definitions significantly alter the interpretation of genechip data. Nucleic Acids Res, 33 (20):e175, 2005. ISSN 1362-4962. URL http://www.ncbi.nlm.nih.gov/pubmed/ 16284200. [30] Rickard Sandberg and Ola Larsson. Improved precision and accuracy for microarrays using updated probe set definitions. BMC Bioinformatics, 8:48, 56 Bibliography 2007. ISSN 1471-2105 (Electronic); 1471-2105 (Linking). doi: 10.1186/ 1471-2105-8-48. [31] Paul J Gardina, Tyson A Clark, Brian Shimada, Michelle K Staples, Qing Yang, James Veitch, Anthony Schweitzer, Tarif Awad, Charles Sugnet, Suzanne Dee, Christopher Davies, Alan Williams, and Yaron Turpaz. Alternative splic- ing and differential gene expression in colon cancer detected by a whole genome exon array. BMC Genomics, 7:325, 2006. ISSN 1471-2164 (Elec- tronic); 1471-2164 (Linking). doi: 10.1186/1471-2164-7-325. [32] P Mickey Williams, Rui Li, Nathalie A Johnson, George Wright, Joe-Don Heath, and Randy D Gascoyne. A novel method of amplification of ffpet- derived rna enables accurate disease classification with microarrays. J Mol Diagn, 12(5):680--686, Sep 2010. ISSN 1943-7811 (Electronic); 1525-1578 (Linking). doi: 10.2353/jmoldx.2010.090164. [33] Ryan DMorin, Maria Mendez-Lago, Andrew JMungall, Rodrigo Goya, Karen L Mungall, Richard D Corbett, Nathalie A Johnson, Tesa M Severson, Readman Chiu, Matthew Field, Shaun Jackman, Martin Krzywinski, David W Scott, Di- ane L Trinh, Jessica Tamura-Wells, Sa Li, Marlo R Firme, Sanja Rogic, Malachi Griffith, SusannaChan, Oleksandr Yakovenko, IrmtraudMMeyer, Eric Y Zhao, Duane Smailus, Michelle Moksa, Suganthi Chittaranjan, Lisa Rimsza, Angela Brooks-Wilson, John J Spinelli, Susana Ben-Neriah, Barbara Meissner, Bruce Woolcock, Merrill Boyle, Helen McDonald, Angela Tam, Yongjun Zhao, Allen Delaney, Thomas Zeng, Kane Tse, Yaron Butterfield, Inanc Birol, Rob Holt, Jacqueline Schein, Douglas E Horsman, Richard Moore, Steven J M Jones, Joseph M Connors, Martin Hirst, Randy D Gascoyne, and Marco A Marra. Frequent mutation of histone-modifying genes in non-hodgkin lymphoma. Na- 57 Bibliography ture, 476(7360):298--303, Aug 2011. ISSN 1476-4687 (Electronic); 0028-0836 (Linking). doi: 10.1038/nature10351. [34] Philip J Brown, Sally L Ashe, Ellen Leich, Christof Burek, Sharon Barrans, James A Fenton, Andrew S Jack, Karen Pulford, Andreas Rosenwald, and Alison H Banham. Potentially oncogenic b-cell activation-induced smaller iso- forms of foxp1 are highly expressed in the activated b cell-like subtype of dl- bcl. Blood, 111(5):2816--2824, Mar 2008. ISSN 0006-4971 (Print); 0006-4971 (Linking). doi: 10.1182/blood-2007-09-115113. [35] ShuwenWang, Gavin P Robertson, and Jiyue Zhu. A novel human homologue of drosophila polycomblike gene is up-regulated in multiple cancers. Gene, 343(1):69--78, Dec 2004. ISSN 0378-1119 (Print); 0378-1119 (Linking). doi: 10.1016/j.gene.2004.09.006. [36] Raphael Margueron and Danny Reinberg. The polycomb complex prc2 and its mark in life. Nature, 469(7330):343--349, Jan 2011. ISSN 1476-4687 (Elec- tronic); 0028-0836 (Linking). doi: 10.1038/nature09784. [37] Maxim Nekrasov, Tetyana Klymenko, Sven Fraterman, Bernadett Papp, Katarzyna Oktaba, Thomas Kocher, Adrian Cohen, Hendrik G Stunnenberg, Matthias Wilm, and Jurg Muller. Pcl-prc2 is needed to generate high lev- els of h3-k27 trimethylation at polycomb target genes. EMBO J, 26(18): 4078--4088, Sep 2007. ISSN 0261-4189 (Print); 0261-4189 (Linking). doi: 10.1038/sj.emboj.7601837. [38] Ru Cao, Hengbin Wang, Jin He, Hediye Erdjument-Bromage, Paul Tempst, and Yi Zhang. Role of hphf1 in h3k27 methylation and hox gene silencing. Mol 58 Bibliography Cell Biol, 28(5):1862--1872, Mar 2008. ISSN 1098-5549 (Electronic); 0270- 7306 (Linking). doi: 10.1128/MCB.01589-07. [39] Emily Walker, Wing Y Chang, Julie Hunkapiller, Gerard Cagney, Kamal Gar- cha, Joseph Torchia, Nevan J Krogan, Jeremy F Reiter, and William L Stan- ford. Polycomb-like 2 associates with prc2 and regulates transcriptional net- works during mouse embryonic stem cell self-renewal and differentiation. Cell Stem Cell, 6(2):153--166, Feb 2010. ISSN 1875-9777 (Electronic). doi: 10.1016/j.stem.2009.12.014. [40] Miguel Casanova, Tanja Preissner, Andrea Cerase, Raymond Poot, Daisuke Yamada, Xiangzhi Li, Ruth Appanah, Karel Bezstarosti, Jeroen Demmers, Haruhiko Koseki, and Neil Brockdorff. Polycomblike 2 facilitates the re- cruitment of prc2 polycomb group complexes to the inactive x chromosome and to target loci in embryonic stem cells. Development, 138(8):1471-- 1482, Apr 2011. ISSN 1477-9129 (Electronic); 0950-1991 (Linking). doi: 10.1242/dev.053652. [41] Gaylor Boulay, Claire Rosnoblet, Cateline Guerardel, Pierre-Olivier Angrand, and Dominique Leprince. Functional characterization of human polycomb-like 3 isoforms identifies them as components of distinct ezh2 protein complexes. Biochem J, 434(2):333--342, Mar 2011. ISSN 1470-8728 (Electronic); 0264- 6021 (Linking). doi: 10.1042/BJ20100944. [42] Christopher J Sneeringer, Margaret Porter Scott, Kevin W Kuntz, Sarah K Knutson, Roy M Pollock, Victoria M Richon, and Robert A Copeland. Coor- dinated activities of wild-type plus mutant ezh2 drive tumor-associated hyper- trimethylation of lysine 27 on histone h3 (h3k27) in human b-cell lymphomas. 59 Bibliography Proc Natl Acad Sci U S A, 107(49):20980--20985, Dec 2010. ISSN 1091-6490 (Electronic); 0027-8424 (Linking). doi: 10.1073/pnas.1012525107. [43] Damian B Yap, Justin Chu, Tobias Berg, Matthieu Schapira, S-W Grace Cheng, Annie Moradian, Ryan D Morin, Andrew J Mungall, Barbara Meissner, Merrill Boyle, Victor E Marquez, Marco A Marra, Randy D Gascoyne, R Keith Humphries, Cheryl H Arrowsmith, Gregg B Morin, and Samuel A J R Aparicio. Somatic mutations at ezh2 y641 act dominantly through a mechanism of se- lectively altered prc2 catalytic activity, to increase h3k27 trimethylation. Blood, 117(8):2451--2459, Feb 2011. ISSN 1528-0020 (Electronic); 0006-4971 (Link- ing). doi: 10.1182/blood-2010-11-321208. [44] Ben Bolstad. affyPLM: Fitting Probe Level Models, April 2011. [45] H. Bengtsson, K. Simpson, J. Bullard, and K. Hansen. aroma.affymetrix: A generic framework in R for analyzing small to very large Affymetrix data sets in bounded memory. Technical Report 745, Department of Statistics, University of California, Berkeley, February 2008. [46] W Evan Johnson, Cheng Li, and Ariel Rabinovic. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics, 8 (1):118--127, Jan 2007. ISSN 1465-4644 (Print); 1465-4644 (Linking). doi: 10.1093/biostatistics/kxj037. [47] Exon 1.0 st array sample dataset. URL http://www.affymetrix.com/support/ technical/sample_data/exon_array_data.affx. [48] An open-source r framework for your microarray analysis | aroma-project.org. URL http://www.aroma-project.org/. 60 [49] Tobias Sing, Oliver Sander, Niko Beerenwinkel, and Thomas Lengauer. Rocr: visualizing classifier performance in r. Bioinformatics, 21(20):3940--3941, Oct 2005. ISSN 1367-4803 (Print); 1367-4803 (Linking). doi: 10.1093/ bioinformatics/bti623. [50] Donna Karolchik, Angela S Hinrichs, Terrence S Furey, Krishna M Roskin, Charles W Sugnet, David Haussler, and W James Kent. The ucsc ta- ble browser data retrieval tool. Nucleic Acids Res, 32(Database issue): D493--6, Jan 2004. ISSN 1362-4962 (Electronic); 0305-1048 (Linking). doi: 10.1093/nar/gkh103. [51] Da Wei Huang, Brad T Sherman, and Richard A Lempicki. Systematic and in- tegrative analysis of large gene lists using david bioinformatics resources. Nat Protoc, 4(1):44--57, 2009. ISSN 1750-2799 (Electronic); 1750-2799 (Linking). doi: 10.1038/nprot.2008.211. 61 Appendix A Supplementary Methods A.1 Conventional Microarray Analysis The internal cohort of 37 Affymetrix HG-U133 Plus 2.0 arrays (GSE19246) was selected based on available matched RNA-Seq libraries and the microarrays were assessed for quality by using the metrics normalized unscaled standard errors (NUSE) and relative log expression (RLE), provided by the affyPLM package [44], and a pairwise pearson correlation test. A total of 36 out of 37 microarrays passed all 3 QC thresholds (NUSE < 1.05, |RLE| < 0.05, mean r2 > 0.9) and were used for downstream analyses. The external DLBCL cohort of 200 Affymetrix HG-U133 Plus 2.0 was retrieved from GEO (GSE10846). The exon-centric custom CDF (ver- sion 14.1) was used to group the probesets and normalization, background correc- tion and summarization of probeset expression values was done using the RMA algorithm implemented in the aroma.affymetrix R package [45]. The internal co- hort of microarrays was comprised of samples that were assayed in two separate batches. To effectively combine the two batches together, we used the ComBat R package [46] to account for possible batch effects. DISCO’s ranked list of candi- date genes was generated by ranking genes based on its corresponding probeset with the highest summarized mean DISCO value. The default parameter settings of SplicerAV were used to and the ranked list of candidate genes was generated by ranking the SplicerAV score for each gene. Gene expression values were gener- 62 A.2. Exon Microarray Analysis ated by taking the average expression value across all probesets interrogating the same gene. A.2 Exon Microarray Analysis 20matched tumor-normal Affymetrix exon arrays were retrieved from the Affymetrix website [47]. The HuEx-1 0-st-v2,coreR2,A20070914,EP CDF available at the aroma project website [48], were used to group probesets and normalization, back- ground correction and summarization of probeset expression values was done us- ing the RMA algorithm implemented by the aroma.affymetrix R package. The SI algorithm was implemented and a final SI score for each probeset was gener- ated by taking the difference of mean gene-level normalized values within each group. FIRMA values were generated using the FIRMA implementation from the aroma.affymetrix R package and final FIRMA scores were calculated by taking the difference of the mean FIRMA values within each group. ROC curves and their corresponding AUC values were generated using the ROCR R package [49] . A.3 RNA-Seq Analysis The RNA-Seq data was analyzed by taking the 36 RNA-Seq libraries and 1) aligning them to the human reference genome (version 19) using the split-read aware aligner TopHat, 2) running Cufflinks on each TopHat aligned library to per- form a de novo assembly of transcripts, 3) running Cuffmerge to merge all the de novo assembled transcripts from each library, and finally 4) Cuffdiff was used to pre- dict differences in the transcriptome between ABC andGCB using themerged list of de novo assembled transcripts from Cuffmerge. At each step, RefSeq transcripts, 63 A.4. Generation of Sensitivity/Specificity Concordancy Plots retrieved from the UCSC Table Browser [50], were supplied as a parameter to pro- vide known splice junctions and transcripts to work with which improves the accu- racy of the analysis. The final candidate list of genes with differentially expressed transcripts was constructed by including any genes with differential promoter us- age, differential pre-mRNA transcript expression, differential splicing pattern, and differential transcript expression (as indicated in Figure 1.4). Candidate genes of differential transcript expression were chosen based on the the threshold q-value < 0.05. Further filtering was performed to include only genes which contained mul- tiple probesets on the HG-U133 Plus 2.0 arrays. The final list of candidate genes and non-candidate genes was 5338 and 3762 respectively. A.4 Generation of Sensitivity/Specificity Concordancy Plots Sensitivity concordancy plots were generated by (1) ranking the list of candidate genes from DISCO and SplicerAV, and then (2) for each ith ranked gene (where i = {1, . . . 50}) calculating the proportion of 1 to ith ranked genes that are in the list of RNA-Seq candidate genes; a high concordancy would be indicative of a good true positive rate. To compare the specificity of both methods, the list of candidate genes was inverted so that the top ranked gene was actually the lowest ranked prediction in each of the methods. The same procedure for generating the speci- ficity concordancy plots was performed, but the genes were compared to the list of non-candidate genes derived from the TopHat/Cufflinks pipeline. In a specificity concordancy plot, high concordancy would be indicative of a good true negative rate. Figure A.1 shows a diagram of how the specific gene lists from each method 64 A.5. Gene Enrichment Analysis and platform that was used to generate the sensitivity/specificity concordancy plots. Internal DLBCL Microarray Cohort DLBCL RNA-Seq Cohort External DLBCL Microarray Cohort 36 DLBCL RNA-Seq (13 ABC, 23 GCB) Tophat List of Genes with Differentially Expressed Transcripts Internal Sensitivity/Specificity Concordance Plot External Sensitivity/Specificity Concordance Plot Cufflinks Cuffmerge Cuffdiff 36 DLBCL Affymetrix HG-U133 Plus 2.0 (13 ABC, 23 GCB) 200 DLBCL Affymetrix HG-U133 Plus 2.0 (93 ABC, 107 GCB) Matched DISCO DISCO Internal DISCO Candidate Genes External DISCO Candidate Genes SplicerAV Internal SplicerAV Candidate Genes SplicerAV External SplicerAV Candidate Genes Figure A.1: Diagram of the List of Genes Used to Generate the Concordancy Plots. A.5 Gene Enrichment Analysis The gene enrichment analysis was performed using the DAVID (Database for Annotation, Visualization and Integrated Discovery) Bioinformatics tool [51] on the top ranked genes from DISCO. The enrichment analysis was performed relative to a pre-defined background which was all the genes (9,982 in total) with multiple probesets. Enrichment results were from the “Functional Annotation Chart” with the default annotation sources selected. The p-values returned from DAVID quantify 65 A.5. Gene Enrichment Analysis the probability of observing such an abundance of an annotation term, associated with the candidate genes, in relation to a background set of genes; low p-values would be indicative of enrichment of this annotation term. 66 Appendix B Supplementary Information B.1 Using RT-PCR for Validation of Alternatively Expressed Transcripts This section provides information on how polymerase chain reaction(s) (PCR) can be used to validate alternatively expressed transcripts. There are several vari- ations of PCR with RT-PCR (Reverse-Transcriptase PCR) being one of the most widely adopted techniques. The general idea is to design oligonucleotide primers to interrogate flanking regions of the genomic region of interest. To detect for events like alternative splicing of a cassette exon, flanking primers can be designed to interrogate flanking exons and then the PCR products can be run on an agarose gel and observed for separation of multiple PCR product bands. The separation is due to the relative sizes of the band in which longer bands, associated with longer PCR products, will run down the agarose gel slower than shorter bands (Panel A of Figure B.1). The choice of primer pairs will depend on the particular alternative transcript regulation event in question, and will have to be carefully selected; incor- rect selection of primer pairs can produce uninformative results (Panel B of Figure B.1) 67 B.1. Using RT-PCR for Validation of Alternatively Expressed Transcripts (A) RT-PCR Ref ere nce 1000 750 500 400 300 200 Sam ple  A Sam ple  B (B) Consecutive Cassette Exons 1 2 43 5 1 5 RT-PCR Ref ere nce 1000 750 500 400 300 200 Sam ple  A Sam ple  B 2 31 2 31 2 31 Alternative 5' Splice Site 1 2 43 5 No Primer Alignment Figure B.1: RT-PCRValidation of Alternative Transcript Regulation Events. (A) RT-PCR Validation of an alternative 5’ splice site event. Primers flank the spliced region of the exon and two possible transcripts are formed. The agarose gel, with a reference lane on the far left lane, of the PCR products show two distinct bands with the higher band associated with the longer PCR product (i.e. Transcript includ- ing the spliced exonic region) and the lower band associated with the shorter PCR product (i.e. Transcript excluding the spliced exonic region). The results indicate that the longer transcript is expressed exclusively in sample A while the shorter transcript is expressed exclusively in sample B. (B) RT-PCR validation of a con- secutive cassette exons event with uninformative results. Primers were designed for exons which were not constitutive exons, but part of a set of consecutive cas- sette exons. Only one PCR product is actually amplified here since the other PCR product has no primers associated with it. The RT-PCR results show that sample A and B equally include longer transcript and a researcher may conclude that there is no differentially expressed transcripts occurring. But this result is confounded since the shorter transcript is missed due to no primers being associated with it. 68 B.2. Genes with Complex Patterns of Differentially Expressed Transcripts B.2 Genes with Complex Patterns of Differentially Expressed Transcripts 2 3 4 5 6 7 8 9 333 855 3 333 855 8 333 856 0 333 856 1 333 856 4 333 857 2 333 857 4 333 857 7 333 858 3 333 858 7 333 858 9 333 859 6 333 859 9 333 860 0 333 860 1 333 860 4 333 860 7 333 860 9 333 861 0 333 861 1 333 861 2 333 861 3 333 861 4 333 861 5 333 861 6 333 861 7 333 861 8 NM_138565 NM_005231 Tumor Normal DS ICO  Ma trix Pro be set  Ex pre ssi on 5' 3' 0 0.4 0.8 1.2 333 858 9 3338558 3338574 3338589 33386093338611 3338617 1 2 3 4 5 6 7 8 9 10 11 12 13 15 1614 17 18 Figure B.2: Tumor vs. Normal DISCO Plot of CTTN in Gardina et al. Dataset. The red box represents an enlarged view of the DISCO values associated with probeset 338589. The black-green scale indicates the range of the DISCO values and the black arrows emphasize the low DISCO values due to these probesets dis- playing a similar differential expression pattern (indicated by the orange boxes) as probeset 338589. The dashed red box represents the probeset interrogating exon 11 which was RT-PCR validated to be preferentially included in tumor samples. The orange text at the bottom represents the corresponding exons going 5’ to 3’. 69 B.2. Genes with Complex Patterns of Differentially Expressed Transcripts 2 4 6 8 10 259 826 5 259 826 2 259 826 3 259 826 4 259 826 6 259 826 7 259 826 8 259 826 9 259 827 0 259 827 1 259 827 3 259 827 6 259 827 7 259 828 0 259 828 1 259 828 4 259 828 6 259 828 7 259 828 8 259 828 9 259 829 0 259 829 4 259 829 6 259 829 9 259 830 1 259 830 2 259 830 4 259 830 6 259 830 7 259 830 8 259 831 0 259 831 3 259 831 4 259 831 8 259 832 1 259 832 4 259 832 5 259 832 8 259 832 9 259 833 0 259 833 1 259 833 4 259 833 5 259 833 8 259 833 9 259 834 0 259 834 2 259 834 4 259 834 6 259 835 0 259 835 1 259 835 2 259 835 3 259 835 4 259 835 6 259 835 7 259 835 8 259 835 9 259 836 0 259 836 2 259 836 3 259 836 7 259 837 1 259 837 2 259 837 3 259 837 4 259 837 5 259 837 6 NM_212482 NM_212474 NM_212476 NM_212478 NM_054034 NM_002026 DIS CO  Ma trix Pro bes et E xpr ess ion 0.0 0.4 0.8 1.2 Tumor Normal 5'3' 2598262 25982632598264 2598321 2598350 2598351 259 832 1 12345612   .   .   .   71320   .    .   142124 . . 2225272838   .   .   .   .  .   .  29 2643  .  .   .  3944454647 Figure B.3: Tumor vs. Normal DISCO Plot of FN1 in Gardina et al. Dataset. The red box represents an enlarged view of the DISCO values associated with probeset 2598321. The black-green scale indicates the range of the DISCO values and the black arrows emphasize the low DISCO values due to these probesets displaying a similar differential expression pattern (indicated by the orange boxes) as probeset 2598321. The dashed red box represents the probeset interrogating exon 25 which was RT-PCR validated to be preferentially included in normal samples. The orange text at the bottom represents the corresponding exons going 5’ to 3’. 70

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0072493/manifest

Comment

Related Items