UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A multi-task machine learning pipeline for the classification and analysis of cancers from gene expression data Disyak, Michael

Abstract

The work contained within this thesis sought to accurately classify 55 primary cancer subtypes, 20 metastatic cancer subtypes, and 16 normal tissues using gene expression data. The classification was done using a multiple learning task approach in which an artificial neural network model makes four distinct classifications at varying levels of biological hierarchy for each input sample. These learning tasks were the organ system of origin, the disease state, the cancer type, and the cancer subtype. The model achieved classification performance ranging from a macro F1-score of 0.987 within the disease state learning task to 0.831 within the cancer subtype learning task on a test set composed of primary cancer, metastatic cancer, and normal tissue samples. Having shown good classification performance of the model, the second part of the thesis focused on leveraging what the model has learned to extract biological information about the various cancers present in the data set. A backpropagation-based tool called DeepLift was used to generate a list of importance scores for each gene within every class of each learning task. The list of scores was then analyzed for trends that could be utilized to infer biological insight about specific cancer types and subtypes, and between primary and metastatic cancers as individual groups. The lists provide a means to functionally annotate enriched pathways and to quantify and compare the role of RNA genes and pseudogenes across various classes and learning tasks. Some of the results output by DeepLift were validated for their biological relevance by presenting supporting evidence from relevant scientific literature. The ultimate product of this thesis research is a tool with which one can quantify the role of a variety of genes within cancers spanning both primary and metastatic cancer types. Further analysis of the output generated by the tool could provide a better understanding of the role of genetic expression, including RNA and pseudogenes, within a variety of different cancers.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International