Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Database-driven whole genome profiling for stratifying Triple Negative Breast Cancers (TNBC) Asiimwe, Rebecca 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2019_may_asiimwe_rebecca.pdf [ 27.02MB ]
Metadata
JSON: 24-1.0377717.json
JSON-LD: 24-1.0377717-ld.json
RDF/XML (Pretty): 24-1.0377717-rdf.xml
RDF/JSON: 24-1.0377717-rdf.json
Turtle: 24-1.0377717-turtle.txt
N-Triples: 24-1.0377717-rdf-ntriples.txt
Original Record: 24-1.0377717-source.json
Full Text
24-1.0377717-fulltext.txt
Citation
24-1.0377717.ris

Full Text

Database-Driven Whole Genome Profiling forStratifying Triple Negative Breast Cancers(TNBC)byRebecca AsiimweA THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)March 2019c© Rebecca Asiimwe, 2019The following individuals certify that they have read, and recommend to the Faculty of Graduateand Postdoctoral Studies for acceptance, a thesis entitled:Database-Driven Whole Genome Profiling for Stratifying Triple Negative Breast Can-cers (TNBC)Submitted by Rebecca Asiimwe in partial fulfillment of the requirements for the degree of Masterof Science in BioinformaticsExamining Committee:Dr. Sohrab P. Shah————————————————————————————————————————–SupervisorDr. David Huntsman————————————————————————————————————————–Supervisory Committee MemberDr. William Hsiao————————————————————————————————————————–Supervisory Committee MemberDr. Martin Hirst————————————————————————————————————————–Committee ChairiiAbstractWhole genome sequencing of cancers for variant discovery and patient stratification generates vastamounts of data including on the order of 10ˆ 6 relevant features per sample. The current practiceis to store this data in flat files whose structure complicates tasks required to optimally store, queryand conduct integrative data mining and analysis of orthogonally collected data such as phenotypeand clinical outcomes. In this study we designed, developed and optimized an object-relationaldatabase to support optimal storage, integration, querying, analysis and visualization of largescalewhole genome profiling data at the level of genome-wide individual somatic variants (CNAs, SNVs,SVs and indels). We structured variant data from analytics pipelines and implemented a Post-greSQL database in which we bulk-loaded clinical outcomes and somatic variants from 88 TripleNegative Breast cancers (TNBCs). Our focus on TNBC was driven by the current and urgentneed for better characterization of the genetic, molecular and clinical biomarkers of this hetero-geneous, more aggressive and difficult to treat breast cancer subtype for which there are limitedtreatment options. Secondly, our inclination to whole genome sequencing (WGS) was attributedto the ability of WGS approaches to provide an in-depth analysis and elucidation of the landscapeof mutations occurring across the genome that may reflect specific mutational processes as tar-getable vulnerabilities in human cancers. However, a whole genome sequencing study in TNBC atscale to investigate genomic properties as a stratification tool has not been undertaken. Hinged onthese notions, we applied the developed database and present its indispensable utility in support-ing optimal access, exploration, analysis and visualization of genomic contents of patient tumoursto support quality control, inference of patterns of mutations and genomic events underpinninga patient’s disease, population level aggregation analysis, gene mutation visualization and patientstratification. Furthermore, we developed Genome-Miner, a web-based database user interface toadditionally support interactive and convenient access, sharing, interrogation and visualization ofcollected data across various research groups. We anticipate the database infrastructure we presentwill have utility in other whole genome studies and push the field beyond the use of flat files formanaging whole genome datasets in cancer.iiiLay SummarySince the inception of DNA sequencing in the 1970s, various sequencing technologies have beenintroduced to help biologists understand the genetic makeup of individuals towards optimizedtreatment of diseases especially complex diseases such as cancer. These sequencing technologiesgenerate vast amounts of data that are mostly stored in flat files whose structure does not supportoptimal storage, access, exploration, analysis and visualization of vast amounts of related genomicand clinical outcomes data. Using whole genome profiling data from 88 Triple Negative BreastCancers (TNBC), we designed, developed, optimized and implemented a postgreSQL database andfurther developed Genome-Miner, a database-driven and web-based tool to support the optimalstorage, exploration, analysis, visualization and global sharing of clinical outcomes and genomiccontents of cancers in whole genome studies. We present the indispensable application of databasesfor quality control, population level aggregation analysis, gene mutation visualization and patientstratification.ivPrefaceThis thesis was written under the guidance of my supervisor, Dr. Sohrab P. Shah in collaborationwith my distinguished supervisory and examining committee comprised of Dr. David Huntsman,Dr. William Hsiao and Dr. Martin Hirst. The intellectual contents herein are original and writtenby Rebecca Asiimwe with contributions from the following:- Project conception, main methods anddirection were provided by my supervisor Dr. Sohrab Shah, sample and data collection was done byDr. Samuel Aparicio who also provided key insights and advice on the project; collection of clinicaldata from various institutions (Breast Cancer Outcomes Unit (BCOU), Alberta and Montreal)was done by Dr. Steven McKinney; sample and library preparation and construction was doneby Damian Yap and Adrian Wan who further submitted libraries to the Genome Sciences Center(GSC) for Whole Genome Sequencing (WGS). Quality control (QC) and variant calling from thedata from Whole Genome Sequencing was done by the Shah Lab. I completed all tasks pertainingdownstream analysis, that is:- structuring and modelling the data from variant calling and dataanalytics pipelines, database design and development, bulk data loading, database management andoptimization, downstream QC, identification of significantly mutated genes in the cohort, analysisof the genomic variants and clinical data in the database, TNBC subgroup discovery and analyses.Towards subgroup discovery, Tyler Funnel provided data on mutation signatures derived from themulti-modal correlated topic model (MMCTM) as one of the key input parameters for patientstratification. I further developed a database interface (Genom-Miner) to support database accessand user defined data analyses, visualizations and sharing among various research groups, an ideaconceived by my supervisor.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xviii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Breast cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Triple Negative Breast Cancer (TNBC) . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Immunohistochemical classification and clinical characteristics of TNBC . . 41.2.2 Molecular and genomic stratification of TNBC . . . . . . . . . . . . . . . . . 51.2.2.1 Molecular heterogeneity of TNBC . . . . . . . . . . . . . . . . . . . 51.2.2.2 Driver mutations in TNBCs . . . . . . . . . . . . . . . . . . . . . . 81.2.2.3 "BRCAness" in TNBC . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.2.4 The clonal spectrum of TNBCs . . . . . . . . . . . . . . . . . . . . 91.2.2.5 TNBC mutational signatures . . . . . . . . . . . . . . . . . . . . . 101.2.3 Treatment of TNBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12viTable of Contents1.2.3.1 Surgery, radiotherapy and chemotherapy treatment in TNBC . . . 121.2.3.2 Emerging therapeutic modalities in TNBC . . . . . . . . . . . . . . 131.2.4 Whole genome profiling as a stratification tool in cancer . . . . . . . . . . . 151.2.5 Databases for large scale and integrated genomic data mining and analysis . 171.2.5.1 Why database management systems? . . . . . . . . . . . . . . . . . 181.2.6 Research aims, rationale and hypotheses . . . . . . . . . . . . . . . . . . . . 191.2.6.1 Research questions the database infrastructure is intended to sup-port: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.2.6.2 Research methods and workflow . . . . . . . . . . . . . . . . . . . . 212 Database Design, Implementation and Optimization . . . . . . . . . . . . . . . . 262.1 Data structuring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2 Database design and development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.2.1 Relationships between entities and data constraints . . . . . . . . . . . . . . 332.3 Database optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.3.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.3.2 Query optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.3.3 Re-clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.3.4 Vacuuming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.3.5 Bulk-loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 Database Application to Whole Genome Profiling and Stratification of TNBCs 483.1 Quality Control (QC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.2 Somatic aberrations characteristic of TNBC . . . . . . . . . . . . . . . . . . . . . . 523.2.1 Distribution of mutation loads per sample and across the cohort . . . . . . . 523.2.2 Structural variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.2.3 Copy number aberrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.2.4 Gene-level analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.3 TNBC genomic subgroup discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.3.1 TNBC subgroups identified by mutation signatures . . . . . . . . . . . . . . 613.3.2 TNBC subgroups identified by CNAS . . . . . . . . . . . . . . . . . . . . . . 623.3.3 TNBC subgroups identified by SNVs . . . . . . . . . . . . . . . . . . . . . . 633.3.4 TNBC subgroups identified by indels . . . . . . . . . . . . . . . . . . . . . . 64viiTable of Contents3.3.5 TNBC subgroups identified by SVS . . . . . . . . . . . . . . . . . . . . . . . 643.3.6 TNBC subgroup discovery by genomic feature integration . . . . . . . . . . . 653.3.7 TNBC genomic subgroup analysis . . . . . . . . . . . . . . . . . . . . . . . . 683.3.7.1 Subgroup comparative analyses of mutation loads . . . . . . . . . . 683.3.7.2 Subgroup comparative analyses of the distribution of rearrange-ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.3.7.3 Subgroup comparative analyses of trinucleotide distributions . . . . 703.3.7.4 Subgroup comparative analyses from a clinical perspective . . . . . 714 Data Access and Visualization Interface . . . . . . . . . . . . . . . . . . . . . . . . 754.1 Genome-Miner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.2 Quality control analyses and visualizations . . . . . . . . . . . . . . . . . . . . . . . 774.3 Mutation load analyses and visualizations per sample and across the cohort . . . . . 794.4 Genomic visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.5 Intra-sample trinucleotide distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 824.6 CNAs analysis and visualizations per sample and across the cohort . . . . . . . . . 824.7 SVs analysis and visualizations per sample and across the cohort . . . . . . . . . . . 844.8 Gene-level analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.9 TNBC subgroup analysis and visualizations . . . . . . . . . . . . . . . . . . . . . . . 865 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.1 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91AppendicesA Examples of Data Structuring, Bulk-loading and Data Extraction Scripts . . . 101A.1 Examples of data structuring and loading scripts . . . . . . . . . . . . . . . . . . . . 101A.1.1 Script to structure and load bam file statics derived by flagstats . . . . . . . 101A.1.2 Script to structure and load mutationSeq data . . . . . . . . . . . . . . . . . 106A.2 Examples of data extraction scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . 109A.2.0.1 Extracting mutation loads per case: . . . . . . . . . . . . . . . . . . 109viiiTable of ContentsA.2.0.2 Extracting samples with specified mutations in genes of interest: . 110A.2.0.3 Extracting copy number profile of a case of interest: . . . . . . . . 112B Significantly Mutated Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114B.1 50 top significantly mutated genes (SMGs) in this TNBC study cohort identifiedusing MutsigCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114C Database Data Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116ixList of Tables1.1 Breast cancer stages, corresponding tumour sizes and node involvement . . . . . . . 4C.1 Database entity - Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116C.2 Database entity - Clinical_data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118C.3 Database entity - Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119C.4 Database entity - Titan_params_cnas . . . . . . . . . . . . . . . . . . . . . . . . . . 119C.5 Database entity - Titan_segs_cnas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120C.6 Database entity - Titan_outfile_cnas . . . . . . . . . . . . . . . . . . . . . . . . . . 121C.7 Database entity - Museq_snvs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124C.8 Database entity - Strelka_snvs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126C.9 Database entity - Strelka_indels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129C.10 Database entity - Destruct_breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . 131C.11 Database entity - Lumpy_svs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133C.12 Database entity - Svs_filtered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134C.13 Database entity - Bamstats_tumour . . . . . . . . . . . . . . . . . . . . . . . . . . . 135C.14 Database entity - Bamstats_normal . . . . . . . . . . . . . . . . . . . . . . . . . . . 137xList of Figures1.1 Histological classification of breast cancer subtypes . . . . . . . . . . . . . . . . . . . 21.2 Histological grades of breast cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Rates of distance recurrence in TNBCs . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 PARP inhibitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.5 Research workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.1 Variant Call Format (VCF) file structure. . . . . . . . . . . . . . . . . . . . . . . . . 282.2 VCF data line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3 VCF annotation (ANN) Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.4 VCF annotation (ANN) field description . . . . . . . . . . . . . . . . . . . . . . . . . 292.5 Extract of a structured VCF file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.6 Database extract of structured VCF file . . . . . . . . . . . . . . . . . . . . . . . . . 312.7 Sample BAMStats - output file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.8 Structured BAM statistics output file . . . . . . . . . . . . . . . . . . . . . . . . . . 322.9 TITAN pipeline output (text file) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.10 Titan output file (database extract) . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.11 Database model - Entity Relationship Diagram (ERD) . . . . . . . . . . . . . . . . . 342.12 Clustered and unclusterd B+ tree index structure . . . . . . . . . . . . . . . . . . . . 392.13 Database query output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.14 Query tree without indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.15 Query plan without indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.16 Query tree with indexing (Clustered B+ Tree) . . . . . . . . . . . . . . . . . . . . . 422.17 Query plan with index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.18 Query plan of poor performance query . . . . . . . . . . . . . . . . . . . . . . . . . . 442.19 Vacuuming for database optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 45xiList of Figures3.1 Average read coverage, tumour-normal samples . . . . . . . . . . . . . . . . . . . . . 493.2 Percentage of mapped reads, tumour-normal samples . . . . . . . . . . . . . . . . . . 493.3 Percentage of properly paired reads, tumour-normal samples . . . . . . . . . . . . . . 503.4 Normal contamination estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.5 Distribution of mutation loads per sample and across the cohort . . . . . . . . . . . 533.6 Distribution of structural variants per sample and across the cohort . . . . . . . . . 543.7 Distribution of copy number aberrations (CNAs) per sample and across the cohort . 563.8 Intra-sample heterogeneity - Genomic visualization . . . . . . . . . . . . . . . . . . . 573.9 Case-based copy number profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.10 Visualizing gene-based mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.11 TNBC genomic subgroups identified by mutation signatures . . . . . . . . . . . . . . 613.12 TNBC genomic subgroups identified by CNAs . . . . . . . . . . . . . . . . . . . . . . 623.13 TNBC genomic subgroups identified by SNVs . . . . . . . . . . . . . . . . . . . . . . 633.14 TNBC genomic subgroups identified by indels . . . . . . . . . . . . . . . . . . . . . . 643.15 TNBC genomic subgroups identified by SVS . . . . . . . . . . . . . . . . . . . . . . . 643.16 TNBC genomic subgroups - optimal number of clusters . . . . . . . . . . . . . . . . 663.17 TNBC genomic subgroups identified by genomic feature integration . . . . . . . . . . 673.18 Subgroup mutation loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.19 Subgroup mean mutation loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.20 Subgroup rearrangement distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 703.21 Subgroup trinucleotide distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.22 Subgroup comparative analyses based on clinical outcomes - age . . . . . . . . . . . 723.23 Subgroup comparative analyses based on clinical outcomes - tumour size, node statusand grade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.1 Data access and visualization interface . . . . . . . . . . . . . . . . . . . . . . . . . . 764.2 Database interface - quality control explorations and visualizations . . . . . . . . . . 784.3 Database interface - mutation loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.4 Database interface - genomic visualization . . . . . . . . . . . . . . . . . . . . . . . . 814.5 Database interface - trinucleotide distribution per sample . . . . . . . . . . . . . . . 824.6 Database interface - distribution of CNAs . . . . . . . . . . . . . . . . . . . . . . . . 834.7 Database interface - distribution of SVs . . . . . . . . . . . . . . . . . . . . . . . . . 85xiiList of Figures4.8 Database interface - gene-level analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 864.9 Database interface - subgroup analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 87xiiiGlossaryALOH Amplified Loss of HeterozygosityASCNA Allele-Specific Copy Number AmplificationBAM Binary Alignment MapBCNA Balanced Copy Number AmplificationBCOU Breast Cancer Outcomes UnitBCS Breast-Conserving SurgeryBLIA Basal-Like Immune-ActivatedBLIS Basal-Like Immune-Suppressedbp Base PairCl-FBI Clustered Foldback InversionsCl-SV Clustered Structural VariantsCNA Copy Number AberrationCOSMIC The Catalogue of Somatic Mutations in CancerCSV Comma-Separated ValuesDBMS Database Management SystemdbSNP Single Nucleotide Polymorphism DatabaseDCIS Ductal Carcinoma In SutiDDR DNA Damage RepairDLOH Hemizygous Deletion LOHDMFS Distant Metastasis Free SurvivalDS Disease Free SurvivalDSS Disease Specific SurvivalEMBL European Molecular Biology LaboratoryER Estrogen ReceptorERD Entity Relationship DiagramxivGlossaryFBI Foldback InversionsGAIN Gain/Duplication of 1 AlleleGSC Genome Sciences CenterHER2 Human Epidermal Growth Factor Receptor 2HET Diploid HeterozygousHGSC High-Grade Serous CarcinomaHOMD Homozygous DeletionHPRD Human Protein Reference DatabaseHRD Homologous Recombination DeficiencyIDC Infiltrating Ductal CarcinomaIHC ImmunohistochemicalIM ImmunomodulatoryIndel Insertions and DeletionsL-Del Large DeletionsL-Dup Large DuplicationsLAR Luminal Androgen ReceptorLCIS Lobular Carcinoma In SutiM MesenchymalM-Dup Medium DuplicationsMMCTM Multi-Model Correlated Topic ModelMMRD Mismatch Repair DeficiencyMSL Mesenchymal Stem-LikemTNBC Metastatic TNBCNCBI National Center for Biotechnology InformationNGS Next Generation SequencingNLOH Copy Neutral LOHNMF Non-Negative Matrix FactorizationOS Overall SurvivalPARP Poly(ADP-ribose) PolymerasePARPi Poly(ADP-ribose) Polymerase InhibitorspCR Pathologic Complete ResponsePDB Protein Data BankxvGlossaryPR Progesterone ReceptorQC Quality ControlRFS Relapse-Free SurvivalS-Del Small DeletionsS-Dup Small DuplicationsSAM Sequence Alignment MapSMGs Significantly Mutated GenesSNV Single Nucleotide VariantSQL Structured Query LanguageSV Structural VariantTCGA The Cancer Genome AtlasTNBC Triple Negative Breast CancerTNM Tumor Node MetastasisTr TranslocationsUBCNA Unbalanced Copy Number AmplificationVCF Variant Call FormatVEP Variant Effect PredictorWGS Whole Genome SequencingxviAcknowledgementsI would like to thank my supervisor, Dr. Sohrab Shah, without whom my research accomplishmentswouldn’t have been realized. Thank you Sohrab for such a great opportunity to work and learnfrom you and the lab in general. Your guidance, encouragement, thoughtfulness and support havebeen immeasurable. I would also like to thank you for the great environment and establishmentsput in place in the Shah Lab to enable students work effectively.I would also like to thank members of my supervisory and defense committee, Dr. David Huntsman,Dr. William Hsiao and Dr. Martin Hirst for their continued support, guidance, keenness and keyinsights provided during my research journey.To the faculty, staff and students of the department of Bioinformatics (UBC), thank you for thegreat insights, ideas and fun-filled events shared. I owe my special thanks to Dr. Steven Jonesand Sharon Ruschkowski for the tireless efforts, guidance, advice and help rendered that made mygraduate journey at UBC a success.My appreciation also goes out to members of the Shah lab especially: Yikan Wang, Ali Bashashati,Tyler Funnell, Allen Zhang, Diljot Grewal, Daniel Lai, Andrew McPherson, Sohrab Salehi, FatemehDorri, Kieran Campbell, Camila de Souza, Saeed Saberi, Oleg Golovko and Mirela Andronescu fortheir support, constructive criticisms and helpful insights that went a long way into shaping myresearch and its implementations.Last but not least, I would like to thank my family and friends. Dad and mum, you have beenawesome! Thank you for being their for me, encouraging and praying for me. Words are notenough to express my gratitude towards the love, support and care you have provided. To mysisters, brothers and to all my friends, you rock! Thank you for the encouragement and supportwhen most needed.xviiDedicationTo My FamilyEspecially, Dad and MumxviiiChapter 1Introduction1.1 Breast cancerWorldwide, breast cancer is reported as one of the most common cancers with more than 1,300,000cases and 450,000 deaths each year [1]. The multiple subtypes of breast cancer portray its hetero-geneity that has proven an invaluable asset in understanding differences in patient outcomes andresponses to therapy. Clinically, there are three therapeutic categories of breast cancer establishedby the presence or lack of three hormone receptors: oestrogen receptor (ER), human epidermalgrowth factor receptor 2 (HER2) - also called ERBB2 and progesterone receptor (PR). However,various studies [1–3] have demonstrated that the heterogeneity of breast cancer extends far beyondthese immunohistochemical (IHC) classifications. Intrinsic molecular breast cancer can also be clas-sified as either luminal or basal-like dependent on the expression of different cytokeratins (basal-likecytokeratins: KRT5, KRT6A, KRT6B, KRT14, KRT16, KRT17, KRT23, and KRT81 ; luminal cy-tokeratins: KRT7, KRT8, KRT18, and KRT19 )[4, 5] with the basal-like subtype accounting for10-25% of all invasive breast cancers [6].In addition to cytokeratin expression, breast cancers have further been classified as basal-like,HER2-like, normal breast–like, luminal A, and luminal B based on an “intrinsic/UNC” 306-gene set[2, 3]. This intrinsic subtyping of breast cancer by gene expression analyses was further supportedby research done by The Cancer Genome Atlas Network [1] in which various omics data (DNAcopy-number arrays, DNA methylation, exome sequencing, messenger-RNA arrays, microRNA se-quencing, and reverse-phase protein arrays) were integrated to report four molecular breast cancersubtypes: luminal/ER+, HER2 and basal-like. Each of the identified subtypes exhibited molecularheterogeneity, distinct domination of specific signaling pathways and enrichment for mutations incertain genes like the enrichment of specific mutations in GATA3, PIK3CA and MAP3K1 withinthe luminal A subtype.Histopathologically, breast cancer can be broadly classified into in situ carcinoma or invasive (infil-11.1. Breast cancertrating) carcinoma (figure 1.1). in situ carcinoma can be further subdivided into ductal carcinoma(DCIS) or lobular carcinoma (LCIS) which arises from multiple foci (10 or more) and thereforeregarded as multicentric; bilateral LCIS is also common [7]. LCIS is not a premalignant lesion butis regarded indispensable in identifying women at increased risk of developing succeeding invasivebreast cancers (DCIS) and as such mammographies taken regularly could help in early breast can-cer detection [7]. DCIS on the other hand are more prevalent and heterogeneous compared to LCISand have a likelihood of progressing into an invasive cancer. DCIS is therefore characterized aspre-invasive or a precursor lesion and accounts for about 16% of all detected breast cancer malig-nancies. It is also reported to be multicentric in 40% of breast cancer cases with high rates of localrelapse (50% recurrences) that could exceed those of invasive cancer after monotherapy treatmentwith breast-conserving surgery [8].Figure 1.1: Histological classification of breast cancer subtypes. Figure modified from Malhotraet al. [9].Invasive (infiltrating) carcinomas on the other hand are a heterogeneous group of tumors differenti-ated into seven histological subtypes: tubular, ductal/lobular, invasive lobular, infiltrating ductal,mucinous (colloid), medullary and papillary carcinomas (Fig. 1.1). Infiltrating ductal carcinoma(IDC) is the most common subtype accounting for 70–80% of all invasive lesions [9] and is fur-ther sub-classified by grade as either being well-differentiated, moderately differentiated or poorlydifferentiated [9].Classification of breast cancers by histological grade has long been used as an indication of prog-nosis with a significant bearing on the choice of patient treatment. Grading is done based on cell21.1. Breast cancermorphology, similarity of cancerous cells to non-cancerous cells and the nuclear grade which shadeslight on the size and shape of the nucleous and proliferative index (NCI, 2013). Histological gradesrange from grade 1 to grade 3. In Grade 1, the cancer cells look like normal cells with a highhomology to the normal breast terminal duct lobular unit. They are small and uniform with amild degree of pleomorphism and are usually slow-growing compared to other breast cancer grades.Grade 1 is therefore regarded well-differentiated with a low proliferative index. Grade 2 breastcancer has cells slightly bigger than normal cells. They vary in shape, grow faster than normal cellsand are moderately differentiated while Grade 3 cells look more abnormal compared to normal cellsand are poorly differentiated or undifferentiated highly proliferative tumours (Fig. 1.2).Figure 1.2: Histological grades of breast cancer obtained using the Nottingham Grading System:(a) Grade 1 - well differentiated tumors that exhibit high homology to the normal breast terminalduct lobular unit, low mitotic rates, a low incidence of nuclear polymorphism and are arranged insmall tubes (tubule formation > 75%). (b) Grade 2 - moderately differentiated tumor. (c) Grade3 - poorly or undifferentiated tumor - lacks normal features (no tubule formation < 10%), higherincidence of nuclear polymorphism, tends to grow and spread faster. Source: Rakha et.al [10].Besides using breast cancer grades to classify patients, the tumor node metastasis (TNM) systemwas developed by the American Joint Committee on Cancer to stratify patients based on prognosis.Characteristics of a patient’s primary tumour such as the size, lymph node status, invasiveness andexistence of metastasis (local or distant) are among the key features incorporated into this system.Treatment of breast cancer patients is currently informed by hormone receptor status (ER, PR andHER2), tumour size, lymph node status, cancer stage and the general health condition of a patient.Local, non-invasive breast cancers are treated with surgery as a mono-therapy or in combinationwith radiation. Towards effective surgery, neoadjuvant therapy is administered before surgery toreduce the size of a patient’s tumour. In patients whose lymph node status is positive, adjuvant31.2. Triple Negative Breast Cancer (TNBC)Stage Tumour Size Node InvolvementI < 2cm No axillary lymph node involvement> 5cm No node involvementIII Extensive ipsilateral axillary lymph node positivity or supraclavicularlymph node involvement. Inflammatory carcinoma. Tumour extensioninto the chest wall or skin in the form of ulcerationIV Distant metastasisTable 1.1: Breast cancer stages, corresponding tumour sizes and node involvement. Patients withthe poorest prognosis commonly present with stage III or IV breast cancer, tumour sizes >5cmand/ with a node positive status (node positivity indicates the likelihood of cancer spread to othertissues).therapy is administered after surgery to reduce the risk of disease recurrence. More systematicapproaches that have been applied in the treatment of breast cancer include the administrationof chemotherapy, and targeted therapies that putitively reduce toxicity to normal cells. Amongthe current targeted therapies and standard of care for patients with breast cancer is tamoxifen,an anti-hormonal endocrine compound used to treat patients with ER and PR positive cancers.Trastuzumab a monoclonal antibody has also been used in the treatment of HER2 positive breastcancer.1.2 Triple Negative Breast Cancer (TNBC)1.2.1 Immunohistochemical classification and clinical characteristics of TNBCTriple Negative Breast Cancer (TNBC) is a distinct subtype of breast cancer that represents 10%- 20% of all breast cancers worldwide [1, 4]. Immunohistochemically, TNBC is a breast cancerphenotype whose tumors are a subtype of exclusion, characterized by the lack of expression ofbiomarkers: estrogen receptor (ER) and progesterone receptor (PR) and for which the humanepidermal growth factor receptor 2 (HER-2) is not over expressed or its gene not amplified asassessed by fluorescence in situ hybridization. TNBCs are also classified as basal-like based on thePAM50 classification in which 80.6% of TNBCs classified as basal-like. The notion that basal-likebreast cancers account for the highest proportion of TNBCs is also supported by studies conductedby the TCGA [1], Lehman et al. [4] and Curtis et al. [11]. Synonymous to basal-like cancers,TNBCs exhibit high proliferative indices, mutations and genomic deletions in TP53 and RB1 [12].They are also closely associated with the expression of high-molecular-weight basal cytokeratins41.2. Triple Negative Breast Cancer (TNBC)5/6, 14, and 17, P-cadherin, p53, and EGFR [1, 13, 14].Clinically, TNBC is the most aggressive form of breast cancer [2] with the majority TNBCs histolog-ically classified as being of higher grade compared to other types of breast cancer. They are invasiveductal carcinoma, usually found at a late stage [15, 16]. TNBCs are also characterized with largetumors whose size incongruity does not correlate with node status in women whose tumours < 5cm.In a study conducted by Dent et al., even small tumours in TNBC had a high rate of node positivitywith 55% of women with tumours < 1cm having at least one positive lymph node [17] indicatingan increased risk of their cancer spreading. TNBCs are also reported to be more common in youngwomen (age < 50years) [4, 16, 18] with a higher incidence among African-American and Hispanicwomen [19, 20].Compared to hormone receptor positive invasive ductal carcinomas, TNBCs exhibit poorer prog-nosis with patients exhibiting a shorter time to relapse, metastatic disease and overall survival[15, 19, 21]. TNBC metastases also distinctively and predominantly affect the central nervoussystem, lymph nodes and visceral organs (especially the lungs) [22, 23] compared to other breastcancers whose relapses are commonly in bone and skin [22, 24, 25]. The high proliferative indexand median survival of TNBC metastases (∼ 12 months) are both reported much higher comparedto other breast cancer types. Treatment of TNBCs with presurgical (neoadjuvant) chemotherapyhas reported higher clinical response rates in some patients compared to response rates in otherbeast cancer types [1, 4, 18] (clinical response rates of up to 85% and pathologic complete response(pCR) rates of 30 - 40% [26]), however, despite these encouraging response rates, the vast major-ity of TNBC patients have very poor outcomes and are still at a greater risk of distant diseaserecurrence and rapid disease progression within 3 - 5 years of recurrence (Fig. 1.3) [17, 26, 27].All patients with metastatic TNBC eventually die of the disease, despite having had adjuvantchemotherapy [17, 18].1.2.2 Molecular and genomic stratification of TNBC1.2.2.1 Molecular heterogeneity of TNBCCurrent efforts towards elucidating TNBC and the stratification of patient groups that may elicitdifferent biology and treatment response have been underway. In a study conducted by Lehmann etal. [4], the results from an aggregate analysis of 21 publicly available expression data sets: 3,247 pri-mary human breast cancers and 587 TNBC gene expression profiles identified six distinct molecular51.2. Triple Negative Breast Cancer (TNBC)Figure 1.3: Rates of distance recurrence following surgery in a cohort of TNBC patients comparedto other breast cancer patients. The hazard ratio for distant recurrence within the first 5 yearspost-surgery in TNBC compared to other breast cancers was 2.6; 95% Confidence interval (CI) 2.0– 3.5. Source: Lee et.al [27].TNBC subtypes: ((two basal-like (BL1 and BL2), two Mesenchymal subtypes (Mesenchymal (M)and Mesenchymal Stem-Like (MSL), Immunomodulatory (IM) and Luminal Androgen Receptor(LAR))), each showing distinctive biological phenotypes, gene ontologies, gene expression patternsand clinical outcomes. Pharmacological targeting of predicted driver signaling pathways in cell linemodels representative of each of the six subtypes revealed sensitivity to targeted therapeutic agentsin the different subtypes.The BL1 subtype was found to be enriched in cell cycle and DNA damage response gene expressionsignatures. Patients with tumors in this subtype putatively benefit from agents that preferentiallytarget highly proliferative tumors (e.g., use of proliferation biomarkers such as Ki-67, anti-mitoticand DNA-damaging agents); cisplatin and PARP inhibitors. BL2 on the other hand was found tobe enriched in growth factor signaling and myoepithelial markers and preferentially responded tomTOR and growth factor inhibitors. The IM subtype enriched in immune cell signaling pathwaysresponded to cisplatin and PARP inhibitors. The M and MSL subtypes- (with the MSL subtypedisplaying low expression levels of claudins), were found to be characterized with high expression ofgenes involved in differentiation and growth factor pathways. The mesenchymal subtypes respondedto dasatinib, an SRC inhibitor. The LAR subtype on the other hand was driven by androgenreceptor signaling and exhibited a high expression of luminal markers, FOXA1 and XBP1 andbenefited from targeting both the AR antagonist bicalutamide and PI3K inhibitors (PI3K/mTORinhibitor NVP-BEZ235 ) attributed to the high frequency of PIK3CA mutations in this subtype61.2. Triple Negative Breast Cancer (TNBC)[1, 4]. Based on this study, 47% of TNBCs were classified as basal-like, 17% luminal A, 12% normalbreast-like, 6% luminal B, 6% HER2, and 12% were unclassified. These findings further revealedthat not all TNBCs are basal like.Besides exhibiting unique biology, the study conducted by Lehman et al. also portrayed distinctsubtype variations in patient relapse-free survival (RFS) and distant-metastasis-free survival despitethe administration of subtype preferential treatments. RFS was significantly decreased in the LARsubtype compared to other non-luminal subtypes. RFS was significantly decreased in the M subtypecompared with BL1 and IM subtypes, while the MSL subtype had higher RFS than the M subtype.Distant-metastasis-free survival (DMFS) did not vary between TNBC subtypes. The M and MSLsubtypes differed clinically, with patients in the M subtype presenting with shorter RFS. Thesefindings suggest that patient outcomes are strongly correlated with their tumor composition orsubtype.Despite these salient findings on TNBC and its heterogeniety, the analysis of IHC-confirmed ER,PR and HER2 expression tumors in Lehman et al.’s study led to the observation of only 5 of the 6gene expression subtypes. Hinged on this limitation, Burstein and colleagues [28], conducted RNAand DNA profiling analyses on 198 tumors, revealing four distinct and stable TNBC subtypes:(1) Luminal-AR (LAR); 2) Mesenchymal (MES); 3) Basal-Like Immune-Suppressed (BLIS), and4) Basal-Like Immune-Activated (BLIA). Like the 6 gene expression subtypes from Lehman etal.’s study, each of the 4 subtypes identified in Burstein et al.’s study showed distinct molecularprofiles with distinct prognoses, with the BLIS tumors having the worst prognosis while the BLIAtumors had the best. Subtype-specific targets included androgen receptor and the cell surface mucinMUC1 in the LAR subtype; growth factor receptors (PDGF receptor A; c-Kit) in the MES subtype;an immune suppressing molecule (VTCN1) in BLIS; and Stat signal transduction molecules andcytokines in the BLIA subtype [28].In contrast to the therapies identified for the 6 gene expression subtypes, Burstein et al.’s study[28] suggests the application of MUC1 and AR antagonists in the treatment of AR- and MUC1 -overexpressing LAR tumors; MES tumors would preferentially respond to beta blockers, IGF in-hibitors, or PDGFR inhibitors. BLIS tumors would benefit from immune-based strategies (e.g.,PD1 or VTCN1 antibodies) while STAT inhibitors, cytokine or cytokine receptor antibodies, oripilumimab a CTLA4 inhibitor [29] may be effective treatments for BLIA tumors. The findings71.2. Triple Negative Breast Cancer (TNBC)of this study suggest that analysis of TNBCs beyond gene expression profiles reveals novel TNBCsubtype-specific markers that could be targeted for more effective treatment of TNBCs.In a more recent study conducted by Lehmann et al. to further elucidate triple-negative breastcancer subtypes, the original six molecular classifications were refined into four: basal- like 1 (BL1),basal-like 2 (BL2), luminal androgen receptor (LAR) and mesenchymal (M) [30] and are shown toco-occur within given tumours when analyzed using single-cell genomics [31]. These studies showfor and confirm the heterogeneity of TNBC and the treatment complications associated with it.1.2.2.2 Driver mutations in TNBCsVarious studies have been conducted to identify the molecular portraits and sub-type specific mu-tations that provide a selective growth advantage and thus promote cancer development in TNBCto better understand the disease [1, 2, 4, 28, 32].In a study conducted by The Cancer Genome Atlas (TCGA), somatic mutations in TP53, PIK3CA,and GATA3 were identified occurring at a frequency higher than 10% [1] in primary breast cancers.TP53 mutations (mostly nonsense and frameshift) were identified to be most prevalent in basal-like breast cancers exhibiting a TP53 loss of function. Specific to TNBC, Shah et al. identifiedTP53, PIK3CA, USH2A, MYO3A, PTEN and RBI as the most frequently mutated genes inTNBC [32]. Most of the loss-of-function and gain-of-function alterations in TNBC involve genesassociated with DNA damage repair and phosphatidylinositol 3-kinase (PI3K) signalling pathways,respectively [1]. Apart from loss of TP53, other alterations in DNA damage repair genes includeloss of RB1 and loss of BRCA1 function [18]. Low PTEN protein levels have also been reportedin TNBCs [4]. FGFR2, MAPK13, SRC family, MUC family, and the BCL2 family are another setof hyper-activated genes identified from the exploration of TNBC genomic profiles in which theyare revealed to exhibit higher expression levels, more copy number changes (most characterized byloss of 5q and 10q) [1, 11], lower DNA methylation levels (also in concordance with the TCGAstudy [1]), or seen as targets of miRNAs with lower expression in TNBC than in normal samples.EGFR is also reported to be upregulated in approximately 60% of basal-like TNBCs [13]. A furtherreview of the 6 TNBC subtypes identified by Lehman et al.[4] revealed higher mutation rates inbasal-like cancers however with less diversity [18]. This finding suggests that the high mutation rateof a gene significantly contributes to fueling a disease as opposed to the diversity and recurrenceof mutated genes in the genome. Combining CNA, and mutation data with expression data also81.2. Triple Negative Breast Cancer (TNBC)implicated well known oncogenes and tumour suppressors: TP53, PIK3CA, NRAS, EGFR, RB1,ATM. PARK2, RB1, PTEN and EGFR were the most frequently observed copy number events thatmostly belonged to the BL2 subtype that is heavily enriched for growth factor signaling pathways[32]. These genes are suggested to be potential targets for TNBC treatment [33].1.2.2.3 "BRCAness" in TNBCApproximately 10 – 20% of TNBC patients harbour germline BRCA mutations. Even in wild-typeBRCA patients, somatic mutations of the homologous recombination (HR) pathway can produce asimilar phenotype termed "BRCAness" [27, 34].Tumours with known BRCA1 and BRCA2 mutations display phenotypes that correlate with thebasal-like subtype [4, 35], a subtype that is also characterized by genomic instability [36]. In par-ticular though, TNBCs exhibit gene expression profiles similar to those of BRCA1-deficient tumors[15], inheriting the increased sensitivity to genotoxic agents exhibited by BRCA [37]. Sporadicbasal-like breast tumors and tumors arising in BRCA1 carriers possess a similar etiology, they areboth likely to be of high grade, both express basal keratins, they are ER/PR-negative, HER2-negative and have a high frequency of TP53 mutations [18]. Other hallmarks of "BRCAness"include, EGFR expression, c-MYC amplification, loss of RAD51 -focus formation, and sensitivityto DNA-crosslinking agents [38]. To note is that the genomic instability reported in TNBCs andBRCA associated breast cancers could be as a result of deficient DNA repair and may lead to thesuccess of some chemotherapy regimens [39]. A study conducted by Jiang et al.[40] on the predic-tors of chemosensitivity in TNBC revealed an RNA-based BRCA-deficient subtype that includedup to 50% of TNBC tumors that appeared immune primed. It was also found that mutations thatlowered the levels of functional BRCA1 or BRCA2 RNA were associated with significantly bettersurvival outcomes [40].1.2.2.4 The clonal spectrum of TNBCsTNBCs exhibit a wide and continuous spectrum of genomic evolution that portrays a continuouslyvarying distribution of mutation abundance among tumors [32].By analyzing somatic mutations, copy number aberrations (CNA), gene fusions, and gene expressionpatterns of 104 primary TNBCs, Shah et al. revealed a mismatch in the proportion of somaticmutation abundance relative to the proportion of the genome altered by CNAs in TNBC cases,91.2. Triple Negative Breast Cancer (TNBC)with some cases having numerous mutations but only close to 1% alterations in the genome, whileother cases presented with few mutations but with notably high numbers of genomic alterations. Asignificant variation in the clonal composition of TNBCs was also found, with some cases presentingwith few genotypes while others presented with multiple genotypes. We would expect that anincrease in mutations would increase clonal frequency and that mutations in driver genes occur inthe highest frequency groups, however, this was not evidenced in this study as some cases werefound to have driver genes in low clonal frequency groups. 12% of cases did not have mutations inany known driver genes, further suggesting that TNBCs are mutationally heterogeneous from theoutset with variations in early clonal expansion drivers. Basal-like TNBCs were also reported topresent with higher clonality at diagnosis compared to non-basal TNBCs [32]. Jiang et al. furtherconfirmed this notion in their study that also revealed an increased clonal mutational burden(more clonal tumors with a higher number of mutations per clone) in TNBC tumors that areBRCA deficient [40]. The pathways of the most frequently mutated genes (TP53, PIK3CA, PTEN- basal-like and luminal) as analysed in Shah et al.’s study were seen in high clonal frequencygroups while genes in cell motility and ECM pathways (mesenchymal-like) were seen in lowerclonal frequency groups and are believed to have mutations that were acquired much later [32].The key findings of this study suggest that TNBC tumours are unique, with varying mutationalcontent in particular pathways; they have varying numbers of implicated molecular pathways andare shaped by distinctive mutagens and biological processes that drive mutations, clonal evolutionand expansion.1.2.2.5 TNBC mutational signaturesSomatic mutations in genes that control cellular growth and division are a consequence of muta-tional processes such as exogenous (ultra-violet radiation and tobacco) or endogenous (age, DNArepair deficiencies) mutagenic processes that offer insights into tumour causative events. Thesemutational processes are linked to specific molecular lesions and subsequent repair mechanismsinitiated by a cell to mitigate the damage which in-turn generates unique combinations of mutationtypes (signatures) that change DNA in a specific way [41, 42]. For example, endogenous processeslike DNA repair deficiencies initiate point mutations and structural variations [43]; APOBEC dys-regulation results in C→T substitutions [44] while C→A substitutions are reported induced bytobacco smoke [41]. These signatures can indicate which causative mechanisms are active in a pa-tient’s tumor and can reveal clinically actionable events and key features for patient stratification101.2. Triple Negative Breast Cancer (TNBC)[42, 45, 46].In an effort to determine the role of genomic rearrangements as driver mutations in breast cancer,Nik-Zainal et al. identified six rearrangement signatures, 2 of which are associated with TNBC[47]. All rearrangements in Signature 1 were characterized by tandem duplications (> 100kb),evenly distributed across the genome. Cancers exhibiting this phenomenon are frequently TP53mutated. Signature 3 was characterized by tandem duplications (<10kb) and most of the cancers(91%) with BRCA1 mutations or promoter hypermethylation were found in this group, a groupalso enriched for basal-like TNBCs. Signature 5, characterized by deletions (<100kb) was stronglyassociated with the presence of BRCA1 mutations or promoter hypermethylation, BRCA2 muta-tions and with rearrangement signature 1 large tandem deletions. These events were also revealedto be evenly distributed across the genome. Signature 2 on the other hand was characterized bydeletions (>100kb), inversions and interchromosamal translocations and contains components im-plicated in kataegis-focal base substitutions and APOBEC DNA-editing proteins. Signature 4 wascharacterized by interchromosomal translocations while signature 6 was characterized by inversionsand deletions.In Nik-Zainal et al.’s study, cancers without identifiable mutations of BRCA1/2 or BRCA1 pro-moter methylation showed similar features with those of BRCA1/2. This implies that either theBRCA1 mutations might have been missed or other mutated or promoter methylated genes maybe exerting similar effects [47]. Based on this observation, combinations of base substitutions, indeland rearrangement mutational signatures may be better biomarkers of defective homologous recom-bination of DNA double strand break repair and better biomarkers of responsiveness to cisplatinand PARP inhibitors other than relying on BRCA1/2 mutations/promoter methylation alone.In a more recent and generalized study on breast and ovary somatic mutations conducted by Funellet.al to better understand mutation signatures from the perspective of DNA repair deficiency, bothSNVs and SVs were used for mutation signature inference in which an age-related signature (SNV),APOBEC signature (SNV), deletion (SV), tandem duplication (SV) and HRD (SNV) signatureswere identified associated with breast cancer. Unsupervised clustering of tumours revealed sub-groups with mutations in BRCA1/BRCA2 that were associated with an HRD signature. This studyalso revealed the salient role of mutation signatures and their application in prognostic, patient andtherapeutic subgroup discovery[42].111.2. Triple Negative Breast Cancer (TNBC)1.2.3 Treatment of TNBCHormone receptors ER, PR and HER2 (also called ERBB2 ) are known to fuel most breast cancers[4, 32] for which intense efforts have been made to identify druggable targets [4, 28]. To date, themost successful therapies for breast cancer are those that target these receptors, with the mostsuccessful being the targeting of HER2 and ER in HER2+ and ER+ patients respectively [1]. Incontrast, due to the lack of targetable receptors, TNBC patients do not benefit from hormonaltherapies. Further, the lack of identification of significant genomic driver alterations in TNBC,and the degree of tumor cell heterogeneity, has limited a targeted approach to the management ofTNBC. This has left surgery, radiation and chemotherapy or a combination of these therapies asthe first line of treatment for TNBC patients [48]. However, more recently, research has shown thebenefit for and identified certain receptors as putative targets for new therapeutic drugs as will bediscussed in subsequent sections.1.2.3.1 Surgery, radiotherapy and chemotherapy treatment in TNBCPredominantly, local and non-invasive TNBCs are treated with surgery. This is done with orwithout radiation to eliminate residual disease and reduce recurrence. Studies have shown that theyounger age, higher grade or biological aggressiveness of a patient’s disease does not impact surgicaltreatment choices; that is mastectomy vs lumpectomy, with the surgical choice mostly done basedon clinicopathological variables and patient preferences [49]. Current surgical approaches howeveradvocate for breast-conservative surgery (BCS) followed by radiation as opposed to mastectomy -a more radical procedure given that both are associated with equivalent survival rates with CBSfurther reducing surgical complications [50]. TNBCs are reported to be appropriate candidates forbreast-conservative surgery as the local recurrence rate after surgery is not as high as that of otherbreast cancer subtypes [51]. However, this remains controversial as some research teams suggestthat BCS followed by radiation therapy in early stage TNBC is not equivalent to mastectomy giventhe rapid growth and locally aggressive nature of TNBCs [52]. Secondly, given that some TNBCsharbour mutations in BRCA1, these tumors are deficient in double-strand DNA break repair byhomologous recombination and are potentially highly radiosensitive[53]. Given the complex natureof TNBC and most cancers in general, more systematic treatment options that go beyond surgeryand radiotherapy have been applied towards effective treatment of patients.Among the systematic approaches applied in the treatment of TNBC has been the application of121.2. Triple Negative Breast Cancer (TNBC)chemotherapy that combines the use of drugs with surgery and radiotherapy. Currently, the mostcommon chemotherapeutic regimens include anthracyclineetaxane chemotherapy (either in the neo-adjuvant or adjuvant setting)[54]. Compared to estrogen receptor positive tumors, TNBCs haveshown a higher response rate to neoadjuvant therapy [55], however, there is a higher risk of recur-rence in patients who do not achieve pathological complete response. In such cases with metastaticTNBC (mTNBC)), the only available strategy is the re-administration of systemic chemotherapy;unfortunately, this approach is limited by poor response, toxicity and eventual multi-drug resis-tance. Chemotherapy works by impairing proliferation. It elicits a selective effect on cells thatdivide rapidly. As chemotherapeutic cytotoxicity is highly proliferative and non-exclusive to cancercells, normal cells are affected too. This results in undesirable side effects such as anemia, alopecia,fever, mucositis, myelosuppression and immunosuppression [56]. Residual disease is also associatedwith a poorer prognosis compared to other types of residual breast cancer.Our evolving and improved understanding of the underpinnings and molecular biology of TNBCis beginning to shed more light on possible and effective theraputic modalities and has led tothe discovery of new agents that target specific pathways in TNBC as will be discussed in thesubsequent section.1.2.3.2 Emerging therapeutic modalities in TNBCRecently, the application of massively parallel sequencing and other ‘omics’ technologies for ge-nomic analysis has begun to reveal molecular alterations and potentially actionable features suchas BRCA1/2 mutations ("BRCAness") and the presence of the androgen receptor in some TNBCsubtypes. This has allowed for the discovery of targeted therapies that could be included intoclinical trials to improve patient outomces. Various therapeutic modalities (agents targeting somecomponent(s) of the signalling cascades) active in TNBC like PARP, Src, EGFR and VEGF in-hibitors, have been proposed and identified to putatively benefit TNBC patients.Poly (ADP-ribose) polymerase inhibitors (PARPi): As earlier mentioned, about 10 - 20%of TNBCs harbor mutations in BRCA1/2 ; genes that are pivotal for genomic stability and regu-lation of DNA damage repair and maintenance. Cells in tumors with loss of BRCA1 or BRCA2function are deficient for homologous recombination DNA repair mechanisms, hence TNBCs pref-erentially respond to DNA damaging agents such as PARP inhibitors that catalyze the fusion of131.2. Triple Negative Breast Cancer (TNBC)components needed for alternative pathways of DNA repair (Fig. 1.4), recognize DNA damage andfacilitate DNA repair to maintain genomic stability [57, 58].Figure 1.4: PARP inhibitors: DNA double strand break caused by PARPi via 2 mechanisms ofaction: inhibition of PARP enzyme activity and PARP trapping. In HR competent tumors, tumorcells with homologous recombination repair survive while in HR deficient cancers, blockade of thispathway by PARP inhibition leads to synthetic lethality and cell death. Source: Lim et.al [59].In research done by Carey et al., it was hypothesized that “PARP inhibition, in conjunction withthe loss of DNA repair via BRCA-dependent mechanisms, would result in synthetic lethality andaugmented cell death” however, identifying patients most likely to respond to PARP inhibitorsis still a challenge [60]. Poly-ADP ribose polymerase inhibitors – (mono-therapies: veliparib andolaparib) are currently in clinical trials and have shown improved overall response rates whencombined with chemotherapy [27].Other emerging therapeutic modalities in TNBC: Somatic mutations involving the Epi-dermal Growth Factor Receptor (EGFR) lead to its activation which subsequently producesuncontrolled cell division, proliferation, epithelial-mesenchymal transition (EMT), migration, inva-sion and angiogenesis [61] which in-turn promote primary tumorigenesis and metastasis. EGFR is apromising therapeutic target for TNBC given that it has been reported expressed in approximately89% of TNBCs [62]. Angiogenesis inhibition using VEGF (Vascular Endothelial Growth Fac-tor) pathway inhibitors have also been involved in TNBC clinical trials due to the poorer prognosis141.2. Triple Negative Breast Cancer (TNBC)associated with VEGF and its expression that is reported significantly higher in TNBCs comparedto other breast cancer subtypes [63]. Src an oncoprotein and member of the family of nonreceptortyrosine kinases is reported expressed higher in TNBC cells than in ER+ cancer cells regulating anumber of signaling pathways including but not limited to proliferation, metastasis, survival, mi-gration, invasion, and angiogenesis [64, 65]. The combination of dasatinib, cetuximab and cisplatinhave provided therapeutic promise by enhancing the inhibition of cell growth, migration and inva-sion [66]. A number of clinical trials using androgen receptor targeting therapy for the treatmentof AR-positive TNBC are also underway as Androgen Receptor (AR) is reported expressed in 12 –60% of TNBCs, particularly in the LAR-subtype [67]. AR inhibitors such as Tamoxifen have shownto reduce disease recurrence in AR-positive TNBCs [68]. Other promising treatments are a combi-nation of AR inhibition + radiotherapy and the combination of antiandrogens + immunotherapyfor TNBC patients that co-express AR and PD-L1 [69].Some of the above mentioned therapeutic modalities in TNBC have undergone multiple clinicaltrials whereas others have only been investigated in early-phase trials or tested preclinically usingTNBC cell lines. Despite these efforts, no targeted therapies have been approved for TNBC. Currentresearch has shown promising prospects in effectively treating TNBC patients, however, some clini-cal trials have reported no objective response to therapies such as those involving EGFR inhibitorslike erlotinib and lapatinib and the monoclonal antibodies (mAbs) cetuximab and panitumumab[70, 71]. Other clinical trials such as those involving the administration of anti-angiogenic therapies,bevacizumab in combination with chemotherapy to improve pCR in a neoadjuavent setting haveresulted in undesirable side effects like hypertension and cardiotoxicity [26] and high rates of highgrade (3 and 4) toxicities when tyrosine kinase inhibitors (RTKIs) like Sorafenib in combinationwith chemotherapy are administered [72]. Secondly, the development of these targeted therapieshas been hindered by the inability to define patient groups that would preferentially benefit from aparticular targeted agent. There is still a dire need to understand TNBC and develop new strategiesfor classifying and treating TNBC patients, including recurrent and metastatic cases.1.2.4 Whole genome profiling as a stratification tool in cancerRecent cancer studies have implemented the use of Next Generation Sequencing (NGS) to revealvariants (SNVs and idels) that have provided insights into the genomic landscapes, molecular andgenomic underpinnings and the heterogeneity existing in different cancer tumours. However, the151.2. Triple Negative Breast Cancer (TNBC)variants revealed in these studies reside in protein-coding regions that comprise < 1% of the humangenome [1, 11, 32]. Furthermore, in complex diseases like cancers, genotypic biomarkers do notprovide a comprehensive representation of the biological nature of a cancer [73]. The breadth andsignificance of various mutation types across multiple genes affecting biological pathways relevantto cancer and their potential clinical significance remain largely unexplored.Whole-genome sequencing (WGS) has been proven as a useful approach for in-depth analysis ofthe landscape of mutations occurring across the genome [73, 74] and has over the years elucidatedcomplex mutational processes at all scales through the exploration of genomic aberrations like copynumber aberrations (CNAs), structural variants (SVs), small insertions and deletions (indels), singlenucleotide variants (SNVs) including the identification of intricate events like SV patterns (tandemduplications, interchromosomal translocations, foldback inversions and interstitial deletions) thatrepresent double-strand break repair mechanisms in tumors characterized with genomic instability[73, 75].In concurrence with the above notions, Wang et al. analyzed whole-genome point mutations andstructural variation patterns of 133 ovary tumors to reveal seven subgroups within the studiedovarian cancer cohort. In this study, somatic alterations in the tumor genome of each patient wereidentified to include SVs, indels, CNAs, and SNVs. They then conducted hierarchical clusteringof the 133 patients based on selected genomic features like the aforementioned identified genomicaberrations and mutation signatures revealing 7 distinct subgroups "(G-BC: GCT tumors with mu-tation signature S.BC (associated with breast cancer and medulloblastoma); E-MSI: MSI ENOCtumors characterized by mutation signature S.MMR (reflective of mismatch-repair deficiency); Mix-ture: HGSC, CCOC, and ENOC cases without obvious discriminant features; C-APOBEC: CCOCcases characterized by mutation signature S.APOBEC (attributed to activity of the AID/APOBECfamily of cytidine deaminases); C-AGE: CCOC cases characterized by mutation signature S.AGE(associated with age at diagnosis); H-FBI: HGSC cases with high prevalence of foldback inversionSVs; H-HRD: HGSCs with prevalence of duplications or deletion rearrangements and mutationsignature S.HRD (reflective of homologous recombination deficiency)"[73].In Wang et al.’s study. structural variants and point mutations in the somatic genome were seento provide solid, discriminant biomarkers for subgroup discovery in ovarian cancer. This groundbreaking research established whole genome profiling as a useful design for patient stratificationand further highlights specific genomic events as putative targetable vulnerabilities suggestive of161.2. Triple Negative Breast Cancer (TNBC)better treatment strategies for patients. Based on the findings of this study and the notion thatTNBCs are genomically similar to high-grade serous carcinoma (HGSC) - both well characterizedby genomic instability, heterogeneity and mutations in BRCA1/2; can TNBCs similarly be stratifiedbased on their whole genome profiles? We answer this question in chapter 3.1.2.5 Databases for large scale and integrated genomic data mining andanalysisDatabases have long been used as an indispensable tool in modelling and organizing vast amountsof data, not excluding biological data. Currently, though, profiling of patient genomes to inferpatterns of mutations and genomic events underpinning a patient’s disease heavily relies on datastored in flat files. Flat files such as bam (Binary Alignment Map), vcf (Variant Call Format), txt(text) or csv (comma-separated values) are structured to contain a collection of singular recordseach having atomic data with record specific fields. Each of the fields in these files is separated bydelimiters such as commas, white space or tabs. The structure of these files makes them relativelyquick, easy to set up and use, however, this very structure complicates tasks required to query andanalyze highly relational and complexly structured genomic data coupled with data redundancyand high efforts required to access data in flat files.The explosive growth of biological data such as data from sequencing, gene expression, annotationof features and genomic events, protein structures, and data from alignment has continuously seenthe need for storing, managing and accessing data using data structures that supersede those of flatfiles. This rapid explosion of sequencing data is attributed to the decreasing cost to sequence wholegenomes and is evidenced by the exponential increase in entries from commonly used systems suchas the Catalogue of Somatic Mutations in Cancer (COSMIC), dbSNP, the Cancer Genome Atlas(TCGA), European Molecular Biology Laboratory (EMBL) and the Protein Data Bank (PDB)[76]. To this effect, data generation is no longer a bottle neck but the management and analysisof these large volumes of data is [77], which has made database management systems an attractivesolution for the storage and management of genomic data. In existence are different types ofstructural database management systems that include types such as hierarchical, network, object-oriented and relational databases. Given the highly relational data under study, we opted forrelational databases that support relational data structures between objects and are more reliablethan hierarchical or network database structures.171.2. Triple Negative Breast Cancer (TNBC)1.2.5.1 Why database management systems?The concept of database management systems (DMBS) and in particular relational (DBMSs) asa solution to data management started as far back as 1970 [78] and has since matured to providefaster and more accurate access to large amounts of data, excellent data integration capabilitiesand effective and efficient data management, sharing, storage and analysis functions. Relationaldatabases collect data in multiple tables linked together by a common piece of data and can bearranged to support ad hoc queries. Relational databases are able to capture data of varioustypes (numbers, strings, images, booleans, dates and time, arrays, integers, floats, characters etc.)and provide advanced data structuring capabilities that support the creation of more complexrelationships between data. This has further supported storage, organization, retrieval and sharingof large data sets with the ability to facilitate data visualization. Extending these functionalitiesto support analysis of genome data directly within databases has allowed for reliable managementof genome data and the analysis of genomic variants [79].The application of databases for genomic data analysis also supports flexible user defined parame-terization and analysis of data on the fly - directly from a database using query and programminglanguages such as SQL without incurring costly data exports or having to rely on precomputedresults. Secondly, database query optimizers have the ability to analyse queries using statisticaldata characteristics in a database to determine the most efficient execution mechanism for a queryto improve query performance and execution runtime. Queries used also provide the execution de-scription that can be used for documentation purposes and reproducibility of the analysis process.The DBMS data architecture supports for data independence where changes in the application layeror user view are immune to changes in the physical (storage) or logical (conceptual) schemas and viceversa. For instance the addition or removal of new entities, attributes, or relationships is possiblewithout having to rewrite existing application programs or changing the inter file organizationsystem or storage structures, storage devices or indexing strategy. DBMSs also enforce both userand system defined constraints to support data consistency, integrity and security. They possessexcellent concurrency control mechanisms like Strict-Two-Phase locking (Strict 2PL) where sharedlocks are acquired to read a data object in a database and exclusive locks acquired when an objectneeds to be modified. This prevents update anomalies such as a database user process readingdata that is still in the process of being updated by another concurrent user process. Defined inthe ARIES recovery algorithm to avoid data loss in event of a crash, DBMs execute crash recovery181.2. Triple Negative Breast Cancer (TNBC)protocols such as write-ahead logging where all updates must be written to stable storage beforethey are written to disk. The recovery algorithm also retraces all actions of a database before acrash and restores it to the state it was before a crash.With the vast amounts of data from sequencing has come the need for better management and accessmethods for data from sequencing. To this effect, publicly available databases have been widelyestablished and used to aid the access, querying and visualization of data in these repositories forexample the cBioPortal [80] that fuses cancer genomics data at gene level from multiple and variousstudies, platforms and other databases such as the NCBI Gene database and the Human ReferenceProtein Database (HPRD). The cBioPortal was established to support interactive exploration,visualizing, and analysis of clinical outcomes and molecular profiling data (e.g gene expression,genetic and proteomic events) across multiple samples, genes and pathways. The portal in itselfis a web service interface that supports database access and querying for the presence of specificbiological events in each sample such as, gene homozygous deletions, amplifications and increasedor decreased mRNA or miRNA expression) as a means to accelerate the translation of genomicdata into new biological insights and therapies [80].Another similar database is the Database of Genomic Variants that [81] consists of a front-endweb application that facilitates data analysis and a back-end relational database (implementedin PostgreSQL) that supports flexible and interactive database querying for structural variationswithin or across multiple studies.1.2.6 Research aims, rationale and hypothesesCurrently, TNBC clinical trials use a similar patient selection criteria, however, these trials of-ten display surprising heterogeneity in response to treatment, survival rates, and the likelihoodof recurrence and metastasis. This is attributed to the differences in prognostic factors and pa-tient characteristics like mutations and molecular signatures, gene expression profiles and tissueand organ morphologies. To improve our understanding of TNBC and to identify potential clini-cally actionable events, better characterization of the genetic, molecular and clinical biomarkers ofTNBC is still urgently needed. Whole genome sequencing approaches have shown to reflect specificmutational processes as targetable vulnerabilities in human cancers. However, a whole genome se-quencing study in TNBC at scale to investigate genomic properties as a stratification tool has notbeen undertaken. Secondly, data from whole genome sequencing is often stored and management191.2. Triple Negative Breast Cancer (TNBC)in flat file format. This format is very cumbersome and ineffective for the optimal exploration,analysis and visualization of clinical outcomes and vast amounts of genomic data from which novelinsights into complex diseases such as TNBC can be generated. Hinged on the hypotheses that(1) TNBC patients can be stratified into distinct subgroups based on their whole genome profilesand (2) the identified TNBC subgroups exhibit distinct clinical, molecular and genomic character-istics; the main objective of this study was to design and develop a relational database of highlystructured clinical and mutation data of a cohort of TNBC and implement the developed databaseto support the exploration of the genomic landscape and mutational characteristics underpinningTNBCs. The developed clinical and genomic variants database was further applied to support thecomprehensive analysis of clinical and whole genome profiles of 88 TNBC patients, with a novelaim of stratifying TNBC patients into distinct genomic subgroups to improve our understandingof the disease and provide valuable insights into options for novel therapeutic modalities and theidentification of patients most likely to respond to specific modalities. The results of this studywill also go a long way in identifying subgroup-specific clinically actionable events, clarifying uncer-tain histopathological diagnosis, informing prognosis, guiding treatment options for patients andsupporting the use of the genome as a potential biomarker in patient treatment.Towards testing our hypotheses and achieving the main goal of this research, the following specificobjectives were established:Objective 1: Design and develop an object-relational database of clinical outcomes and genome-wide somatic variants extracted from whole genome sequencing data of 88 TNBCsObjective 2: Structure output data from variant calling and analytics pipelines and implementdata loaders to bulk-load the structured variants and clinical outcomes into the databaseObjective 3: Apply the developed database for the exploration and analysis of the clinical dataand genomic variants in the developed database with specific focus on the following:– Conduct quality control checks and analyses on the data from whole genome sequencing– Identify and analyse all genome-wide somatic mutations (copy number aberrations (CNAs),structural variants (SVs), insertions/deletions (indels) and single-nucleotide variants (SNVs))in a cohort of TNBC of 88 cases and extract genomic features for patient stratification– Identify the significantly mutated genes (SMG) in the TNBC cohort201.2. Triple Negative Breast Cancer (TNBC)– Identify TNBC genomic subgroups and conduct comparative subgroup analyses:— Compute the prevalence of mutations in each subgroup, specifically on the alterations(SNVs, CNAs, and SVs) in DNA damage repair genes and the identified SMGs— Investigate the association between SMGs and the genomic subgroups— Examine the association between the identified subgroups and clinical outcomes— Identify mutations that appear mutually exclusive between the identified genomic sub-groups— Investigate the association between driver mutations and mutation signatures whichstratified the genomic subgroups of the TNBC cohortObjective 4: Build a database user interface to support interactive data access, exploration, userdefined querying and analysis, interpretation and sharing of the stored genomic variants and clinicaloutcomes among various research groups.1.2.6.1 Research questions the database infrastructure is intended to support:1. Can we stratify TNBC patients using their whole genome profiles?2. Can we identify fold-back inversion events in TNBC tumours?3. Do mutational signatures associate with specific driver mutations?4. Are the segregated TNBC subgroups associated with distinct clinical outcomes?1.2.6.2 Research methods and workflowSamples from 88 TNBC cases were collected from various facilities across Canada (British Columbia,Montreal, Alberta) and tumor/normal sample pairs subjected to whole genome sequencing usingIllumina HiSeq2500 (Fig. 1.5). Patient clinical data was also collected to include but not limitedto: the date of diagnosis, age at diagnosis, tumour grade, tumour size (in centimeters), node status,patient status, HER2, ER2 and PR status and survival and recurrence status. However, due tothe premature data collected on the overall survival status of patients, comprehensive analysesinvolving overall survival were not included in this study.Aim 1: The exponential growth of data generated from DNA sequencing has continually seen theneed for optimal data management, access, analysis and visualization methods. Currently data211.2. Triple Negative Breast Cancer (TNBC)Figure 1.5: Research workflow.from sequencing is often stored in flat files which are inefficient for optimal storage, querying andanalysis of orthogonally collected data. To overcome these challenges, we designed and developed arelational database structure to support optimal storage, access, querying, exploration and analysisof clinical outcomes and whole genome profiling data at the level of genome-wide individual variantsfrom the 88 TNBCs in this study cohort. Entity relationship modeling using Crow’s Foot Notationwas used to design the database that was implemented using PostgreSQL (psql version 10.5, server9.4.8), an object-relational database management system (DBMS). The developed database washosted, run and managed on a CentOS 6.5 server, with an Intel(R) Xeon(R) CPU(E5-2660v2) witha 2.20GHz base frequency (2 CPUs, 10 physical cores per CPU, 20 logical CPU units in total),126GB of RAM and a 40GB InfiniBand connection. The choice of this DBMS (PostgreSQL) stemsfrom its ability to hold highly relational and large datasets which are characteristic of genomic data.PostgreSQL also supports parameterized and user defined queries, custom data-types and indexesfor query optimization; it is an open source DBMS that supports ACID (Atomicity, Consistency,Isolation, Durability) properties and stored procedures/SQL functions. The choice of this DBMS221.2. Triple Negative Breast Cancer (TNBC)was also based on its interoperability and ability to support other languages such as pgSQL, pythonand R that were largely used in this study.Aim 2: Genomic alterations have over time been shown to have predictive and prognosticimplications in cancer patients. The discovery of all genome-wide somatic mutations was doneto support the identification of putative molecular underpinnings of patients with TNBC and thepotentially actionable molecular events that could provide insights into treatment options for TNBCpatients. Applied were a number of various bioinformatics tools developed and assembled into ananalytics pipelines to support variant calling and analyses of data from whole genome sequencing.TITAN [82], an R Bioconductor package was used to compute cellularity and identify regions(clonal and subclonal) of copy number alterations within patient samples. To further support ouranalysis, gene annotations for each copy number segment was performed using pygenes a pythonlibrary based on the human genome reference Homo sapiens GRCh37.73.gtf. Structural variants(SVs) including rearrangement breakpoints were predicted using deStruct [83], a tool that identifiesbreakpoints and assigns read alignments to the identified breakpoints. Deletions, duplications,inversions, translocations and foldback inversions were identified based on the relative position andorientation of the break-ends in the genome. Breakpoints detected by an alternative variant callingtool - Lumpy [84] were used to filter results from deStruct and to remove low mapability regions.Single nucleotide variants (SNVs) were predicted using mutationSeq [85] while the variant callinganalysis for somatic SNVs and insertion/deletions (indels) was performed using Strelka [86]. TheSnpEff tool was used to annotate the identified SNVs and indels for variant effects and gene-codingstatus. All put together, the variants identified in this cohort shed more light on the mutationpatterns and signatures exhibited by different patients and patient subgroups.Given the nature of the various tools used for variant calling, the data output from the variantcalling pipelines was in disparate formats and in flat files. As earlier mentioned, this complicatesdata querying and processes that involve comparative data analyses. To solve this problem, thedata was structured using python and R scripts that were also used to load all the structured datainto the database. Also loaded in the database were statistics derived from bam files using theFlagstat software tool to extract bamstats and mpileup to extract average read coverage. Thesestatistics were also structured and loaded into the database for further exploration.Aim 3: We then applied the developed database to support optimal access, exploration, analysis231.2. Triple Negative Breast Cancer (TNBC)and visualization of the mutation contents and clinical outcomes in the developed database towardsanswering our research questions and providing insights and a better understanding of TNBCs.First, we used the database to conduct quality control checks and analyses on data from wholegenome sequencing. Of interest was the average read coverage of tumour samples for which sam-ples that did not meet the set threshold (60X) were excluded. We then used the database to identifyand explore somatic mutations (copy number aberrations (CNAs), structural variants (SVs), in-sertions/deletions (indels) and single-nucleotide variants (SNVs)) by analysing mutation loads andpatters across the cohort and per case. All analyses were completed using R which was both locallyand remotely linked to the developed database.MutSigCV [87] was used to identify the significantly mutated genes (SMGs) across the TNBC cohortas it has the ability to discover unexpected variations in the mutation frequency and spectrum acrossthe genome with a unique ability to incorporate mutational heterogeneity to eliminate most of theartifactual significantly mutated genes. This enables the identification of genes truly associatedwith a cancer type. In this study, only genes whose false discovery rate < 0 .1 were regardedas most significantly muted in this TNBC cohort. The identification of the significantly mutatedgenes in this cohort shed light on the subgroup putative drivers and implicated pathways that couldfurther be probed for druggable targets. This analysis also shed light on defects co-occurring incertain pathways that may be of benefit in patient treatment for example a combination of defectsin two DDR pathways leads to synthetic lethality that may be an effective therapeutic strategy forpatients with such defects.Stratification of patients is key in providing effective treatment options. To identify the genomicsubgroups in this TNBC cohort, patient stratification was done based on the integration of theidentified genomic features: CNAs, SVs, indels, SNVs and mutation signatures discovered usingthe multi-modal correlated topic model (MMCTM) [42]. Non-negative matrix factorization (NMF)approaches have been used extensively to study point mutation and structural variation signatures,however, NMF does not effectively support joint inference of signatures. MMCTM on the otherhand provides an integrative approach that infers signatures using joint statistical inference frommultiple mutation types like point mutations and structural variants. This further supports dis-covery of signatures active among patient groups as seen in the case of homologous recombinationdeficiency that induces patterns of both SNVs and SVs in breast and high grade serous ovarian can-cers [42]. It’s because of the aforementioned attributes that MMCTM was preferred in this study241.2. Triple Negative Breast Cancer (TNBC)to support signature inference for the discovery of genomic subgroups in this TNBC cohort. Allthe identified stratification features were used for integrative hierarchical clustering analysis usingthe R package pheatmap and the manhattan distance measure to determine patient subgroups andto support the discovery of prognostic and therapeutic stratification, driver-gene associations andclinical predictions.To further our understanding on the identified genomic subgroups, a number of comparative analy-ses were done. The overall mutation loads were computed and the prevalence of mutations in SMGsand DNA damage repair genes identified per subgroup. Chi-square tests were run to identify muta-tions that appear mutually exclusive between the subgroups. Also conducted were investigations onthe association between SMGs and the genomic subgroups and the association between driver mu-tations and mutation signatures to provide insights into the identified subgroups and their genomicand clinical characteristics.In most studies, mutation profiles and signatures are not routinely investigated in the clinical settingdespite their salient benefit in detecting subtypes implicated in pathways that are associated withfavourable prognosis [47] like those with defective mismatch repair that may benefit from immunecheckpoint inhibition. In this study, the integrative analysis of the various mutation types (CNAs,SVs, SNVs and indels) with clinical data shed more light on the correlation between mutationprofiles and clinical outcomes.Aim 4: Finally, we developed a database user interface using Shiny, Plotly and JavaScript toextend the database functionality to various research groups. The developed back-end PostgresQLdatabase was linked with the data analysis module (R) to support both local and remote dataextraction, exploration, analysis and visualisation, results of which are rendered dynamically intothe front-end web application for utilization by researchers, biologists and clinicians using intuitiveand interactive plots and data tables. The data can be shared across individuals and research groupsthat also have the ability to upload files for data analysis and visualization without having to needany programming knowledge. This establishment will go a long way in helping researchers generatenovel insights and hypotheses by triggering analyses and visualizations of the clinical outcomes andgenomic variants data in the database.25Chapter 2Database Design, Implementation andOptimizationThe research presented herein was hinged on the application of relational databases as an indis-pensable tool for the exploration and analysis of tumour contents of patients in cancer studies.This chapter presents work done on the design, development and optimization of the database ofclinical outcomes and genomic variants of TNBC cases in this study. Section 2.1 starts with apreliminary overview of the data structuring processes to suit database storage and downstreamanalysis. Section 2.2 presents the design of the variants database followed by the physical databaseimplementation to meet data mining and data analysis functions and the methods deployed tooptimize the developed database in Section 2.3.2.1 Data structuringThe large volumes of data generated by genomic pipelines like variant calling pipelines is producedin formats such as the Variant Call Format (VCF), tsv (Tab Separated Values) or text files. Asearlier mentioned in section 1.2.5, these output data files take on formats that do not supporteffective and efficient data mining processes. To prepare pipeline output data for database storageand further downstream analysis, the data was structured before bulk loading into the database aswill be presented in the following sections.VCF files: The Variant Call Format (VCF) is a file specification format used to store geneticvariation data obtained from genomic sequencing and large-scale genotyping. It specifies a textformat that contains three main sections: (1) metadata lines prefixed by "##" that describe thedata values in the body of a file (Fig. 2.1). These lines describe the INFO (information), FILTERand FORMAT fields used in the body of a VCF file. (2) The header line prefixed by "#" contains8 fixed and mandatory fields: "#CHROM POS ID REF ALT QUAL FILTER INFO". The header262.1. Data structuringline also contains a "FORMAT" field and an arbitrary number of "sample ID" fields if genotypedata is present in a file and finally (3) the data section that contains the variants per chromosomeposition for each field (Fig. 2.2).INFO fields in the metadata lines are described as follows:##INFO=<ID=PR,Number=1,Type=Float,Description="Probability of somatic mutation">The above line shows the data value being captured, "PR" (probability of somatic mutation) and alsoshows the number of expected PR values. In this example, the number is equal to 1 (Number=1)and implies that we can only have one value for the probability of a variant call being a somaticmutation. Also included is the data type and in this case PR is of type float (Type=Float). Otherdata types captured in INFO fields include: integer, flag, character and string.272.1. Data structuring##INFO=<ID=PR,Number=1,Type=Float,Description="Probability of somatic mutation">##INFO=<ID=TC,Number=1,Type=String,Description="Tri-nucleotide context">##INFO=<ID=TR,Number=1,Type=String,Description="Count of tumour with reference to REF">##INFO=<ID=TA,Number=1,Type=String,Description="Count of tumour with reference to ALT">##INFO=<ID=NR,Number=1,Type=String,Description="Count of normal with reference to REF">##INFO=<ID=NA,Number=1,Type=String,Description="Count of normal with reference to ALT">##INFO=<ID=ND,Number=1,Type=String,Description="Number of Deletions">##INFO=<ID=NI,Number=1,Type=String,Description="Number of Insertions">##FILTER=<ID=threshold,Description="Threshold on probability of positive call">##SnpEffVersion="4.3t (build 2017-11-24 10:18), by Pablo Cingolani"##SnpEffCmd="SnpEff GRCh37.75 -noStats /shahlab/archive/sochan_tmp/jobs/SA681/temp/wgs_SA681/mutationseq/museq.vcf "##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: ’Allele | Annotation |Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank |HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance |ERRORS / WARNINGS / INFO’ ">##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for thisvariant. Format: ’Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected’">##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects forthis variant. Format: ’Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected’">##INFO=<ID=MA,Number=.,Type=String,Description="Predicted functional impact of amino-acid substitutions inproteins.Format: (Mutation|RefGenome variant|Gene|Uniprot|Info|Uniprot variant|Func. Impact|FI score) ">##DBSNP_DB=/shahlab/pipelines/reference/dbsnp_142.human_9606.all.vcf.gz##INFO=<ID=DBSNP,Number=.,Type=String,Description="DBSNP flag">##1000Gen_DB=/shahlab/pipelines/reference/1000G_release_20130502_genotypes.vcf.gz##INFO=<ID=1000Gen,Number=.,Type=String,Description="1000Gen flag">##Cosmic_DB=/shahlab/dgrewal/cosmic/CosmicMutantExport.sorted.vcf.gz##INFO=<ID=Cosmic,Number=.,Type=String,Description="Cosmic flag">#CHROM POS ID REF ALT QUAL FILTER INFO20 64871 . C A 7.36 PASS PR=0.82;TR=40;TA=4;NR=28;NA=0;TC=ACA;NI=0;ND=32;ANN=A|upstream_gene_variant|MODIFIER|DEFB125|ENSG00000178591|transcript|ENST00000382410|protein_coding||c.-3480C>A|||||3480|,A|upstream_gene_variant|MODIFIER|DEFB125|ENSG00000178591|transcript|ENST00000608838|processed_transcript||n.-3020C>A|||||3020|,A|intergenic_region|MODIFIER|CHR_START-DEFB125|CHR_START-ENSG00000178591|intergenic_region|CHR_START-ENSG00000178591|||n.64871C>A||||||;MA=();DBSNP=F;1000Gen=F;Cosmic=F20 139915 . T A 5.73 INDL PR=0.73;TR=53;TA=5;NR=33;NA=0;TC=CTA;NI=31;ND=1;ANN=A|downstream_gene_variant|MODIFIER|DEFB127|ENSG00000088782|transcript|ENST00000382388|protein_coding||c.*250T>A|||||111|,A|intergenic_region|MODIFIER|DEFB127-DEFB128|ENSG00000088782-ENSG00000185982|intergenic_region|ENSG00000088782-ENSG00000185982|||n.139915T>A||||||;MA=();DBSNP=[rs11471580,rs386393059,rs386393060,rs397947941];1000Gen=F;Cosmic=F20 351395 . G A 17.61 PASS PR=0.98;TR=58;TA=14;NR=38;NA=0;TC=TGT;NI=0;ND=0;ANN=A|intergenic_region|MODIFIER|NRSN2-TRIB3|ENSG00000125841-ENSG00000101255|intergenic_region|ENSG00000125841-ENSG00000101255|||n.351395G>A||||||;MA=();DBSNP=F;1000Gen=F;Cosmic=FFigure 2.1: Variant Call Format (VCF) file structure.#CHROM POS ID REF ALT QUAL FILTER INFO20 64871 . C A 7.36 PASS PR=0.82;TR=40;TA=4;NR=28;NA=0;TC=ACA;NI=0;ND=32;ANN=A|upstream_gene_variant|MODIFIER|DEFB125|ENSG00000178591|transcript|ENST00000382410|protein_coding||c.-3480C>A|||||3480|,A|upstream_gene_variant|MODIFIER|DEFB125|ENSG00000178591|transcript|ENST00000608838|processed_transcript||n.-3020C>A|||||3020|,A|intergenic_region|MODIFIER|CHR_START-DEFB125|CHR_START-ENSG00000178591|intergenic_region|CHR_START-ENSG00000178591|||n.64871C>A||||||;MA=();DBSNP=F;1000Gen=F;Cosmic=FFigure 2.2: VCF data line.Another key field captured in VCF files is the annotation (ANN) field (Fig. 2.3) described as shown282.1. Data structuringin the INFO field in Fig. 2.4:ANN=A|upstream_gene_variant|MODIFIER|DEFB125|ENSG00000178591|transcript|ENST00000382410|protein_coding||c.-3480C>A|||||3480|,A|upstream_gene_variant|MODIFIER|DEFB125|ENSG00000178591|transcript|ENST00000608838|processed_transcript||n.-3020C>A|||||3020|,A|intergenic_region|MODIFIER|CHR_START-DEFB125|CHR_START-ENSG00000178591|intergenic_region|CHR_START-ENSG00000178591|||n.64871C>A||||||;MA=();DBSNP=F;1000Gen=F;Cosmic=FFigure 2.3: VCF annotation (ANN) field and corresponding data values.##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: ’Allele | Annotation |Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank |HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance |ERRORS / WARNINGS / INFO’ ">Figure 2.4: VCF INFO field describing the annotation (ANN) field.The metadata section also contains filters that have been applied to the data and are described asfollows:##FILTER=<ID=threshold,Description="Threshold on probability of positive call">Fig. 2.1 shows a snippet from a VCF file of one of the TNBC samples in this study cohort. Ofparticular interest to the data structuring process was the decomposition of multi-valued fields likethe functional annotation (ANN) field into atomic values. The annotation field contains multipledata fields for one genomic position, encoded separated by a pipe sign "|" and each annotationdelimited by ";". Multiple effects (consequences) are separated by a comma as shown in Fig.2.3. Atomizing mutli-valued fields involved decomposing and mapping each distinct functionalannotation with each genomic position providing distinct variant tupples with a one-to-one mapping(one record/tupple for each REF/ALT combination) to support relational and downstream dataanalysis (Fig. 2.5).Example data line extract: chr20 64871 . C A,A . . ANN=A|... , A|...Structured output of the above line:chr20 64871 . C A . . ANN=A|...chr20 64871 . C A . . ANN=A|...Data structuring also involved breaking down INFO fields into atomic variables.292.1. Data structuringExample line: PR=0.82;TR=40;TA=4;NR=28;NA=0;TC=ACA;NI=0;ND=32;Structured output of the above line:chr pos ref alt pr tr ta nr na tc ni nd20 64871 C A 0.82 40 4 28 0 ACA 0 32The structured data from all VCF files as called by respective variant callers (MutationSeq, Strelkaand Lumpy) was then loaded into respective database tables (Fig. 2.5 and Fig. 2.6) in which fieldsare denoted by: tumour_id , chrom , pos , ref , alt, pr , tc , tr , ta , nr , na , nd , ni , annotation, annotation_impact , gene_name , gene_id , feature_type , feature_id , transcript_biotype ,rank , hgvs_c , hgvs_p , cdna_pos_cdna_length , cds_pos_cds_length , aa_pos_aa_length ,distance , errors_warnings_info , lof , nmd , ma , dbsnp , x1000gen and cosmic with each recordcontaining one genomic event at a particular chromosomal position. Having an atomic value foreach data field enabled effective data mining, querying, manipulation and analysis.Figure 2.5: Extract of a structured VCF file: Rows denote data values captured for each variable/field(columns).302.1. Data structuringFigure 2.6: Structured VCF file - sample database extract.BAMStats Output Files: Sequence Alignment Map (SAM) and Binary Alignment Map (BAM)file formats have long been used as a standard of storage for large sequence alignments generatedfrom genome mapping. The BAMstats software tool (Flagstat) was used to generate mappingstatistics on BAM files containing sequence data to provide statistics on the total_reads, qc_failure,number of duplicate reads, number of mapped reads, mapped_percentage, paired_in_sequencing,reads (1 and 2), properly_paired, properly_paired_percentage, self_and_mate_mapped, single-tons, singletons_percentage, MAPQ values, and the avg_read_coverage (Fig. 2.7).Figure 2.7: Sample BAMStats output file showing row-wise data fields and values.BAMStats output file data was then transformed and structured into specific fields and their respec-tive values that were later loaded into database tables (bamstats_tumour and bamstats_normalfor tumour and normal bam files respectively) for further downstream analysis (Fig. 2.8). Beloware examples of data lines from BAMstats output files structured to suit database storage.Example Line1: 1136651250 in totalExample Line2: 1044480124 properly paired (91.89%)Structured output:312.2. Database design and developmenttotal_reads properly_paired properly_paired_percentage1136651250 1044480124 91.89Figure 2.8: Structured BAM statistics output file - database extract of selected columns.Text Files: Variant callers like TITAN, and deStruct provide output in form of text files (Fig.2.9) most of which conform to a "unique_field - singular_value" data structure. Files such as thesewere loaded into the database as is with a few changes made to support database storage (Fig.2.10).Figure 2.9: TITAN pipeline output in text file format specifying atomic values for each data field(tab delimited).Figure 2.10: Titan output file - database extract.2.2 Database design and developmentA good and well thought-out database design is a prerequisite for the development of effective andhigh performance databases that address efficient data manipulation, mining and analysis processes322.2. Database design and developmentby minimizing data redundancy and the cost of running a query in terms of total execution/runtime. They also enforce referential integrity and alleviate the need for data restructuring.The key tasks undertaken to design the clinical outcomes and genomic variants database involvedidentifying data objects (entities represented as a logical collection of items and correspond to atable in the database), their attributes that correspond to the columns of a particular table and therelationships between the identified objects. Entity relationship diagrams (ERDs) have long beenused as data models for relational databases to map out and show database entities, attributes,constraints and relationships between them. Fig.2.11 shows the created data model upon which thedatabase was built. The entities and their description, attributes and constraints of the developeddatabase as shown in the data model are extensively presented in a data dictionary in AppendixC.2.2.1 Relationships between entities and data constraints332.2. Database design and developmentFigure 2.11: Database model (Entity Relationship Diagram (ERD)) developed using Crow’s FootNotation: Database entities are represented as boxes while relationships between entities are repre-sented as lines. The cardinality of a relationship is represented by symbols |, -0<-, |<- that denote"one and only one", "zero to many", or "one to many" relationships respectively.342.2. Database design and developmentTowards creating an effective database, we imposed a number of constraints in the database design:1. Primary keys are record identifiers that support efficient database querying. Primary keyconstraints were supplied to a column or group of columns to uniquely identify the rowsin each database table. In cases where tables had no candidate primary keys, a sequentialkey was supplied. This was common for entities that contain variant data. In such entitiesone sample, identified by the tumour_id can have multiple genomic variant entries. Thisimplies that in any single table that captures variants, the tumour_id is duplicated acrossrows making it ineligible for primary key candidacy. Apart from being unique, fields chosenas primary keys were also required not to be null.2. Fields expected to have a data value were specified with the "NOT NULL" constraint sothat a null value is not assumed. A case in point is, if data at a genomic position has beencaptured, we expect entries of the chromosome and position not to be null. On the otherhand, a patients’ ER status or tumour grade may be unknown. Such fields were left to defaultto "NULL" in case data on a subject was not available.3. A unique constraint was supplied for fields whose data is expected to be unique.4. A data type for each field was specified to validate the kind of data that can be stored ina field. As examples, date_of_diagnosis of a patient was stored with a data type ‘date’,the field that captures a patient’s age was constrained to store integers, chromosome wasconstrained to store character data, position was constrained to store big integers (8 Bytes),pr (probability of somatic mutation) was constrained to store floating-point data values andannotation and annotation impact were constrained to store variable character data.5. Referential integrity is a key feature in the design of relational databases. This constraintensures that implied relationships between database entities are enforced hence the notion"relational database". To implement referential integrity, foreign key constraints were suppliedspecifying which values in a column (or a group of columns) must match the values appearingin a row of another table. The enforcement of this and other constraints mentioned in 1 -4 above is shown in the example query below that was used to create the samples databaseobject. The query specifies the following constraints:- "primary key" (unique by default), "notnull", "foreign keys" and the "data variable types":cur.execute("352.3. Database optimizationCREATE TABLE samples(tumour_id VARCHAR,normal_id VARCHAR NOT NULL,consent_id VARCHAR REFERENCES clinical (consent_id) ON DELETE CASCADE,facility_of_origin VARCHAR NOT NULL,sample_type VARCHAR NOT NULL,project_code VARCHAR REFERENCES projects (project_code)ON DELETE CASCADE, PRIMARY KEY (tumour_id))")In the above query, the samples table is being created with 6 fields or variables whose datatypes such as "VARCHAR" (variable character) are shown: tumour_id, normal_id , con-sent_id, facility_of_origin, sample_type and project_code. The consent_id field referencesthe consent_id field (primary key) in the clinical table and the project_code field referencesthe project_code field in the project table. This further implies that the clinical and projecttables (parent tables) be created before the samples table (child) as some of its fields referencefields in other tables. Also, in such cases, an update or deletion of a patient by consent_id inthe clinical table would require a cascade update or delete of all corresponding patient recordsin child tables. Thus a row in a parent table cannot be deleted until all referenced rows inthe child tables are deleted towards enforcing database integrity.2.3 Database optimizationDatabases have the ability to store enormous amounts of data and support the storage of millions ofdata entries. In this study, we opted for the deployment of a PostreSQL database that unlike otherdatabase types like MySQL or MongoDB has an unlimited database size that supports storage of asmuch data as required with the main constraint being system based storage constraints. PostreSQLdatabases also support a 32 terabyte (TB) maximum table size. With such large databases thathave the ability to contain large table sizes comes the salient issue of database performance - thelarger the table, the higher the cost of running a table scan in terms of total execution/ run-timeand page I/Os (reads and writes of blocks containing data records to and from disk into mainmemory) as shown in the query plan extracts below:Smaller table with 91 rows:362.3. Database optimizationAggregate (cost=2.14..2.15 rows=1 width=0)(actual time=0.047..0.048 rows=1 loops=1)-> Seq Scan on samples (cost=0.00..1.91 rows=91 width=0)(actual time=0.016..0.023 rows=91 loops=1)Planning time: 0.393 msExecution time: 0.132 msLarger table with 157,070,329 rows:Aggregate (cost=3919120.84..3919120.85 rows=1 width=0)(actual time=27677.309..27677.309 rows=1 loops=1)-> Seq Scan on titan_outfile_cnas(cost=0.00..3526380.07 rows=157096307 width=0)(actual time=0.020..17813.013 rows=157070329 loops=1)Planning time: 0.070 msExecution time: 27677.332 msGiven the high cost associated with working with large databases and in particular the databasecreated herein to store mutation data, database optimization was imperative to reduce the sys-tem response time by maximizing the speed and efficiency with which data is retrieved. To opti-mize database performance, various optimization strategies were implemented to include: indexing,query optimization, vacuuming, partition large tables and bulk loading as will be presented in thefollowing sections.2.3.1 IndexingIndexing has long been proven to be one of the most beneficial methods for optimizing databaseand query performance by supporting fast access to data records in associated database tables andminimizing the overall cost required to process a user query. These data structures are createdusing column(s) from a database table and contain a search key value and a pointer that holds theaddress of the disk block where a particular key value can be found.In existence are B+ tree indexes that allow both range (e.g. 50 > age > 80) and equality (e.g. age= 60) searches, and Hash indexes that only support equality searches, the most common being B+372.3. Database optimizationtree indexes as they support both search types. Well constructed indexes have a huge bearing onquery optimization as these could avoid scanning an entire table for results by opting for a moreefficient query plan such as an index scan that involves iterating over most or all index items whenan index item meets a search condition. The required records as specified by a query are retrievedthrough an index whose entries (Fig. 2.12) are read left to right. In cases where all required datacan be accessed through an index, there is no need for the database query optimizer to visit themuch larger data table whose access consequently amounts to more page I/Os and greater run-time.Another alternative access path to a table scan is an index seek/ probe that requires searching anindex for a specific value or a small set of values (fewer than those required in an index scan).B+ tree indexes on data tables in the developed databases were created using queries like the below:CREATE INDEX dest_idx1 ON destruct_breakpoints (tumour_id);Where "dest_idx1" is the name of the index being created on column "tumour_id" in the databasetable "destruct_breakpoints".When a new index is created, the database server automatically updates database statistics fromwhich a query optimizer can discover the distribution of values in a column to determine the optimalexecution plan for a query. The rough estimate of the number of elements within a specific rangein a histogram of the query optimizer helps the optimizer decide on whether to use an index scanor a table scan for query execution.There are two types of B+ tree indexes: clustered and non-clustered indexes, each with uniquebenefits depending on the data or query in question. Clustered indexes sort the data and dictatethe storage order of the data records in a table (Fig. 2.12) - the order of data records is the same(or close) to the order of data entries in an index. As an example, given a B+ tree index on acolumn with patient ages, the ages will be ordered in ascending order, in that, ages 20 - 30 couldbe on one page while ages 30 - 60 on another index page. Running a query that requires retrievingages between 50 and 60 would require reading one page into memory. This type of index is moreefficient if build on columns of data that are most often accessed for ranges of values. Given thatdata entries are arranged in sorted order, this index type also excels at finding a specific row whenthe indexed value is unique.382.3. Database optimizationFigure 2.12: Clustered and unclusterd B+ tree index structure: A hierarchical data search struc-ture is maintained with all searches beginning at the root of the tree to the lowest level of the tree(leaf level containing data entries). Using node pointers (separated by search key values), indexentries direct searches to the correct leaf page. In clustered B+ tree indexes, node pointers to theleft of a key value k point to a subtree that contains only data entries less than k and the nodepointer to the right of a key value k points to a subtree that contains only data entries greater thanor equal to k while unclustered indexes do no maintain this order.In contrast, with non clustered indexes, two records that are close to each other as defined by theindex might not appear on the same data page or adjacent data pages. With such indexes, thereis no defined order as seen in Fig. 2.12. This implies that if we have patient ages scattered acrossmultiple pages and we have a query that searches for patients aged between 30 and 60, we canread as many as 30+ pages into memory instead of 1!! as is the case in clustered indexes. This isbecause all records are on different pages that all need to be fetched into memory. Because of thehigh costs accrued with unclusted indexes, clustered indexes were applied in this study.Application of clustered B+ tree indexes for optimizationGiven a database query for the tumour_id, gene_name, age, grade and overall survival status for allpatients with a high impact mutation in PIK3CA, BRCA1 and BRCA2 and for which the variantwas called at a probability > = 0.9, we could run the below query that produces the databaseoutput shown in Fig. 2.13:SELECT DISTINCT s.tumour_id, i.gene_name, c.age, c.grade, c.os_status392.3. Database optimizationFROM clinical_data c, snvs_intersect i, samples sWHERE c.tumour_id = s.tumour_id and s.tumour_id = i.tumour_idAND (i.gene_name = ‘PIK3CA’ or i.gene_name = ‘BRCA2’ or i.gene_name = ‘BRCA1’)AND i.pr > = 0.9AND (i.annotation_impact = ‘HIGH’ or i.annotation_impact = ‘MODERATE’) ORDER BY 2;Relational Algebraic Notation of the above query:pi tumour_id, gene_name, age, grade, os_status ((σ(gene_name = ‘PIK3CA’ ∨ gene_name = ‘BRCA2’ ∨ gene_name = ‘BRCA1’) ∧ pr >= 0.9 ∧ (annotation_impact = ‘HIGH’ ∨ annotation_impact = ‘MODERATE’)snvs_intersect) ./ samples ./ clinical_data)Figure 2.13: Database query output.Query tree without indexing:The relational algebra tree in Fig. 2.14 shows the query evaluation plan of the query in question andconsists of annotations at each tree node indicating the data access methods for the query. Queryexecution starts with a full table/file scan of the snvs_intersect table for gene_name = (‘PIK3CA’or ‘BRCA2’ or ‘BRCA1’) and pr > = 0.9 and annotation_impact = ‘HIGH’ or ‘MODERATE’.Records that satisfy the query conditions are selected (selection denoted by sigma (σ)) and theresults of this subtree query are joined by tumour_id (using a Nested Loop Join (joins denoted bya bowtie ./) to the samples table. Using a Merge Join, the resultant subquery results are joinedto the clinical_data table by tumour_id from which an overall projection (denoted by pi (pi)) of402.3. Database optimizationthe requested queried data preceded by a Hash Aggregate to select distinct records is returned.In the query tree, ‘∧’ denotes an intersection (or ‘AND’) while ‘∨’ denotes union (or ‘OR’). Thequery plan of the query and corresponding tree in Fig. 2.14 is shown in (Fig. 2.15). The total totalexecution time for this query is 19423.822ms.Figure 2.14: Query tree without indexing.Query plan without indexing:Figure 2.15: Query plan without indexing (Run-time = 19423.822ms).412.3. Database optimizationApplying a clustered B+ tree:Below is a query used to create a clustered B+ tree index on columns (pr and gene_name) in thesnvs_intersect database table:CREATE INDEX gene_pr_idx ON snvs_intersect (pr, gene_name);CLUSTER snvs_intersect USING gene_pr_idx;With the application of a clustered B+ tree index on columns (pr and gene_name) of thesnvs_intersect table, a scan on the index is done for only data entries whose values (pr andgene_name) satisfy the search conditions in the query (Fig. 2.16 and Fig. 2.17 ). These (fewer)data entries are then used to return only the required data as specified by the query conditions.This decreases the cost required to execute the query by avoiding a full table scan. The totalexecution time for this query is 876.877ms compared to 19423.822ms without an index.Query tree with indexing:Figure 2.16: Query tree with indexing (Clustered B+ Tree).Query plan with indexing:422.3. Database optimizationFigure 2.17: Query plan with index (Run-time = 5876.877ms).2.3.2 Query optimizationBesides the application of indexes to enhance performance, the construction of smart queries thatleverage knowledge on database tables can also yield faster data access. Below is a differentlystructured query that provides the same output as seen in Fig. 2.13, however, this query has alonger execution time (10935.661 ms, Fig. 2.18) despite the created clustered index.SELECT DISTINCT s.tumour_id, i.gene_name, c.age, c.grade, c.os_statusFROM clinical cJOIN samples s on c.tumour_id = s.tumour_idJOIN snvs_intersect i on s.tumour_id = i.tumour_idWHERE (i.gene_name like ‘PIK3CA’ or i.gene_name like ‘BRCA2’ ori.gene_name like ‘BRCA1’)AND i.pr > = 0.9AND (i.annotation_impact = ‘HIGH’ or i.annotation_impact = ‘MODERATE’) ORDER BY 2;432.3. Database optimizationFigure 2.18: Query plan of poor performance query: Increased run-time regardless of appliedindexes.2.3.3 Re-clusteringClustered tables are physically ordered based on the order of created clustered indexes, however,as clustering is a one time operation, subsequent table updates or inserts are not clustered. Forexample, if data records are ordered by probability of somatic mutation (pr), new table entries maybe inserted at the end of a file whereby a new record with pr = 0.5 may be found on a page whosepr range was originally 0.8 - 1.0. With time, a table tends to be unclustered which ends up affectingperformance by increasing the cost of executing a query. To avoid this, occasional reclustering wasdone by reissuing the same clustering command especially on updated tables.2.3.4 VacuumingVacuuming is another optimization mechanism that was used to reclaim storage occupied by deadtuples in database tables. In normal database operations, records that are deleted or obsolete byan update are not physically/completely removed from a table and keep occupying storage spaceuntil a "VACUUM" is done. Fig. 2.19 shows an example of vacuuming done on a sample table(clinical) to reclaim storage space from dead tuples not removed by "autovacuum".442.3. Database optimizationFigure 2.19: Vacuuming for database optimization.2.3.5 Bulk-loadingAll data in the developed database was bulk-loaded using developed data loading scripts. Thisapproach significantly improved performance as it is much faster than repeated inserts. Secondlyrecords are sorted before bulk-loading. All scripts that performed data structuring of pipeline datahad a database loading function to pass structured data instantly to the database as shown below.#Database connectionpw <- { " "}drv <- dbDriver("PostgreSQL")con <- dbConnect(drv, dbname = "genomic_variants", host = "",user = "", password = pw)rm(pw)..data structuring script..#Writing structured data to the database into table "museq_unfiltered"dbWriteTable(con, "museq_unfiltered", museq_unfiltered, append=TRUE,row.names=0)Unlike pipeline output data that contained genomic variants, the clinical data used in this studywas loaded from a .csv file as shown in the abstract script below.452.3. Database optimization#!/home/rasiimwe/miniconda3/bin/pythonimport psycopg2import sysimport csvimport oscon = Nonetry:con = psycopg2.connect("host=‘localhost’ dbname=‘genomic_variants’user=‘ ’ password=‘ ’")cur = con.cursor()path="path to file"## Creating table clinical_data##-----------------------------------------------------------------cur.execute("DROP TABLE IF EXISTS clinical_data")cur.execute("CREATE TABLE clinical_data (...)## Data Loading##-----------------------------------------------------------------cur.execute("COPY clinical (consent_id, diagnosis_date, age, ...)FROM ‘%s’ delimiter ‘,’ csv header" % (path))##-----------------------------------------------------------------con.commit()except psycopg2.DatabaseError as e:if con:con.rollback()462.3. Database optimizationprint (‘Error %s’) % esys.exit(1)finally:if con:con.close()47Chapter 3Database Application to WholeGenome Profiling and Stratification ofTNBCsThis chapter presents the utility of the developed database in facilitating the exploration andanalysis of mutation contents and patterns in complex diseases with emphasis on understandingthe genomic landscape and mutational characteristics underlying TNBCs towards TNBC subgroupdiscovery. Section 3.1 provides an overview of the utility of the developed database in supportingpreliminary quality control (QC) checks and analyses on the data in the database. Section 3.2presents database mining and exploratory functions to support comprehensive genome and gene-level analyses in the cohort to further support the discovery and subsequent analysis of TNBCgenomic subgroups as presented in section 3.3.3.1 Quality Control (QC)QC checks for whole genome sequencing:- Before embarking on downstream data analysis,the database was explored for sequencing thresholds that were applied to the data during wholegenome sequencing. The sequencing parameters used in this study were derived from the bam file ofeach sample using the SAMtools-mpileup utility and were thereon loaded into the database for stor-age and subsequent analysis. Among the data variables captured in the bam statistics data tablesinclude ‘total_reads’, ‘qc_failure’, ‘duplicates’, ‘mapped’, ‘mapped_percentage’, ‘paired_in_seq’,‘read1’, ‘read2’, ‘properly_paired’, ‘properly_paired_percentage’, ‘self_and_mate_mapped’, ‘sin-gletons’, ‘singletons_percentage’ and ‘avg_read_coverage’. Of interest to our study was the averageread coverage used for whole genome sequencing (Fig. 3.1) and the percentage of mapped and prop-erly paired reads (Fig. 3.2 and Fig. 3.3 respectively). For effective and high confidence variant483.1. Quality Control (QC)discovery, the established average read coverage threshold in this study was 60X. All samples thatdid not meet this threshold were flagged for higher resequencing coverage and excluded from furtherdownstream analyses. Queries such as the below were used to extract data used to check sequencingparameters:stats.tumour <- dbGetQuery(con,"SELECT tumour_id, mapped_percentage, properly_paired,avg_read_coverageFROM bamstats_tumourORDER BY 3 DESC)")0.000.020.040.0640 60 80 100average read coverage (tumour sample)densitya0.000.050.1040 60 80 100average read coverage (normal sample)densitybFigure 3.1: Average read coverage: (a) Tumour samples: mean = 79.91X, range = 66.88X -89.83X. (b) Normal samples: mean = 39.81X, range = 34.80X - 45.07X.0.00.10.20.380 85 90 95 100mapped percentage (tumour sample)densitya0.00.10.20.380 85 90 95 100mapped percentage (normal sample)densitybFigure 3.2: Percentage of mapped reads: (a) Tumour samples: mean = 92.41%, range = 78.83%- 95.44%. (b) Normal samples: mean = 94.16%, range = 89.85% - 99.77%.493.1. Quality Control (QC)0.00.10.20.380 85 90 95 100properly paired percentage (tumour sample)densitya0.00.10.20.30.480 85 90 95 100properly paired percentage (normal sample)densitybFigure 3.3: Percentage of properly paired reads: (a) Tumour samples: mean = 90.81%, range =77.42% - 93.74%. (b) Normal samples: mean = 92.55%, range = 88.50% - 97.58%.QC checks for normal contamination levels:- In various genomic studies, sequencing ofmatched tumor and normal samples has become a conventional study design to distinguish betweensomatic and germline variants towards supporting reliable detection of somatic mutations. Tumor-normal sample contamination causes decreased sensitivity in mutation detection that could resultin inaccurate sequencing data [88]. The detection of normal contamination estimates in this studywas derived from TITAN output data [82]. From the perspective of copy number inference, theexploration and analysis of genomic allelic imbalances and loss of heterozygosity events as derivedfrom allelic ratio data (RefCount/Depth) is significantly influenced by the proportion of the normalcontent in a tumour sample (tumour content = 1 - (normal contamination estimate)) [82]. Fig.3.4 presents database derived normal contamination estimates of the samples in this cohort, all ofwhich were rendered viable for subsequent downstream analysis. To note is that 7 out of 11 sampleswith no normal contamination are patient-derived xenografts (PDX).QC implementations for genomic variant data:- In our study, the identification of genome-wide somatic mutations was executed using the Kronos workflow assembler [89] that was usedto run TITAN [82] to infer copy number aberrations and loss of heterozygosity (LOH) events ineach patient_tumour sample(s), deStruct [83] and Lumpy [84] to infer structural variants (SVs),mutationSeq [85] to infer single nucleotide variants (SNVs) and Strelka [86] to infer both indel andsingle nucleotide variants. All QC implementations by the various pipeline tools were applied tothe TNBC whole genome sequencing data. To maintain high confidence calls, further downstreamquality control involved intersecting SVs inferred by deStruct and Lumpy and removing those withbreakpoints falling in low-mapability regions. SNVs called by both mutationSeq and Strelka werealso intersected and variants for which the probability of somatic mutation (pr) >= 0.9 were usedin all subsequent study analyses. Given that databases support computation of results on demand,503.1. Quality Control (QC)0.00.20.40.6SA591SA218SA679SA994SA655SA576SA673SA1072SA395SA998SA274SA394SA1069SA669SA600SA592SA1058SA495SA1035SA1027SA678SA402SA283SA1070SA654SA683SA680SA280SA675SA533SA586SA667SA601SA593SA682SA672SA1062SA677SA590SA596SA589SA535SA997SA1026SA666SA404SA674SA681SA276SA287SA597SA1017SA1028SA1064SA1073SA1040SA530SA289SA676SA286SA1065SA588SA1071SA598SA291SA668SA272SA671SA992SA275SA398SA599SA219SA585SA595SA279SA1074SA211SA234SA296SA409SA278SA300SA220SA239SA420SA230SA232SA665SA221SA238SA423SA425 normal_contaminationa01230.00 0.25 0.50 0.75 1.00normal contamination estimatedensitybFigure 3.4: Normal contamination estimates: (a) Proportion of the normal content in a tumoursample (tumour content = 1 - (normal contamination estimate)), mean = 0.45, range = 0 - 0.76.(b) Corresponding density plot showing the distribution of the normal contamination estimate inthis cohort.pr thresholds were directly applied during database query time allowing for flexibility in settingthresholds for data analysis. Below is an example query that returns results based on user definedparameters.gene <- "BRCA1"effect <- "stop_gained"pr.pass <- 0.9query <- fn$identity("SELECT DISTINCT tumour_id, gene_name, annotationFROM strelka_indelsWHERE gene_name = ‘$gene’ AND annotation like ‘$effect’UNION SELECT distinct tumour_id, gene_name, annotationFROM snvs_intersectWHERE gene_name = ‘$gene’ AND annotation like ‘$effect’AND pr >= $pr.pass ORDER BY 1 ASC")data <- dbGetQuery(con, query)513.2. Somatic aberrations characteristic of TNBCOutput of the above query:tumour_id | gene_name | annotation-----------+-----------+-------------SA296 | BRCA1 | stop_gainedSA535 | BRCA1 | stop_gainedSA590 | BRCA1 | stop_gainedSA655 | BRCA1 | stop_gained3.2 Somatic aberrations characteristic of TNBCSomatic aberrations (CNAs, SVs, SNVs and indels) present in the tumor genome of each patientwere discovered using the aforementioned variant calling tools:- TITAN, deStruct and Lumpy,mutationSeq and Strelka (snvs) and strelka (indels) respectively. Database driven explorations andanalyses conducted on the identified somatic mutations to infer patient specific and cohort-widemutation loads, patterns and characteristics are presented in the following sections.3.2.1 Distribution of mutation loads per sample and across the cohortThe distribution of mutation loads in this TNBC cohort as depicted in Fig. 3.5 shows a varyingdistribution of mutation loads among TNBC cases with variations seen across the cohort and in themutation-type loads in each sample. Some key questions to ask here would be whether mutationloads have a bearing on survival outcomes and patient stratification and whether cases with highermutation burdens associate with higher levels of genomic instability. We answer these questions insection 3.3.523.2. Somatic aberrations characteristic of TNBCFigure 3.5: Distribution of mutation loads: Track 1 (a) shows the number of SNV mutations(y-axis) for each sample (x-axis), (mean = 8172.62, range = 0 - 77463). b) Shows the number ofSVs for each sample (mean = 180.70, range = 0 - 785), c) shows the number of indels for eachsample (mean = 1327.41, range = 1 - 10373) and d) shows the total mutation load for each of thesamples (mean = 9680.74, range = 149 - 79075). Samples are sorted in ascending order based onthe total mutation load.533.2. Somatic aberrations characteristic of TNBC3.2.2 Structural variants0200040006000inversion foldback translocation deletion duplicationValuea deletion duplication foldback inversion translocationSA1058SA2186001SA3952 6571 27SA4 2275SA 386 02 6SA 41 15 5070SA 49973SA4 53666SA 8221310 2SA59671 72SA 503SA 990SA 8SA 81722 4SA 17665SA 92378SA 9341074SA67191028SA 005 56935SA 798SA 7692SA 95 02SA 759SA67113SA 81 400SA 333 85 1SA 5594073SA67768SA 058SA2186001SA3952 65727SA4 2275SA 386 02 6SA 41 15 5070SA 49973SA4 53666SA 8221310 2SA59671 72SA 503SA 990SA 8SA 81722 4SA 17665SA 92378SA 9341074SA67191028SA 005 56935SA 798SA 7692SA 95 02SA 759SA67113SA 81 400SA 333 85 1SA 5594073SA67768SA 058SA 957602SA2 58738SA 02866 410 1SA 5706 4SA99734 5SA 366682SA221310 2SA 9671 72SA 503SA 9902SA 8SA 812724SA 1761 65SA 92378SA 9341074SA67191028SA 005 56935SA 798SA 7692SA 95 02SA 756 9SA 7113SA 81 400SA 333 85 1SA65594073SA67768SA1058SA6013952 6SA 727402SA275838SA6 02 6410 1SA 5704SA99734 5SA 366682SA221310 2SA 9671 72SA 503SA 9902SA 8SA 812724SA 17665SA 92378SA 9341074SA67191028SA 005 56935SA 798SA 7692SA 95 02SA 759SA67115 3SA 81 400SA 333 85 1SA65594073SA67768SA106458SA2186001SA3952 6571 27SA4 2275SA 3806 41SA58510 06 4SA99734 5SA 366682SA 21310 2SA59671 72SA 503SA 990SA 8SA 81722 4SA 17665SA 92378SA 9341074SA67191028SA 005 56935SA 798SA 7692SA 95 02SA 759SA67113SA 81 400SA 333 85 1SA 5594073SA677680200400600Valueb0%25%50%75%100%SA1071SA291SA586SA423SA232SA683SA682SA997SA601SA280SA287SA666SA994SA585SA1027SA655SA677SA398SA590SA595SA668SA676SA992SA681SA654SA675SA673SA535SA1062SA300SA395SA279SA402SA286SA1073SA598SA591SA678SA220SA404SA596SA230SA272SA420SA1058SA1074SA597SA533SA219SA1065SA211SA671SA1028SA593SA221SA289SA599SA530SA1070SA425SA218SA674SA274SA576SA1069SA588SA680SA998SA283SA600SA1072SA672SA667SA276SA1035SA394SA234SA239SA669SA296SA592SA278SA409SA1040SA238SA495SA275SA1064proportionstype deletion duplication foldback inversion translocationcFigure 3.6: Distribution of structural variants (SVs) per sample and across the cohort: (a) and (b)show the distribution and abundance of SV types across the cohort, sorted in ascending order basedon the total number of mutations observed in each SV type (inversions, foldback, translocations,deletions and duplications). (c) The proportion of SV types (y-axis) identified in each sample(x-axis).Overall, we see that TNBCs are enriched for duplications followed by deletions (Fig. 3.6 a))with clear genomic heterogeneity observed between cases in this cohort, some harboring significantstructural variations in specific variant types compared to others (Fig. 3.6 c)). Specific structuralvariants disrupt gene structures and consequently promote tumour progression. SVs and mutationsignatures derived from specific structural variants have played an important role in patient andprognostic stratification and the identification of potentially actionable events [42, 47, 73]. DetectedSVs in this study were used as key features for patient stratification as will be expounded on insection 3.3. The data object ‘breakpoints.all’ used to store structural variant data extracted fromthe database and used to generate figures 3.6 a), b) and c) was created using the following query:543.2. Somatic aberrations characteristic of TNBCbreakpoints.all <- dbGetQuery(con,"SELECT DISTINCT tumour_id, type, COUNT(*)FROM svs_filteredWHERE type = ‘foldback’GROUP BY 1, 2UNION SELECT DISTINCT tumour_id, type, COUNT(*)FROM svs_filteredWHERE type = ‘duplication’GROUP BY 1, 2UNION SELECT DISTINCT tumour_id, type, count(*)FROM svs_filtered WHERE type = ‘translocation’GROUP BY 1, 2... ")Sample data extract:tumour_id | type | count-----------+---------------+-------SA1071 | duplication | 22SA669 | inversion | 26SA673 | deletion | 23SA586 | duplication | 18SA423 | duplication | 94SA287 | foldback | 1SA275 | inversion | 5SA232 | duplication | 99SA997 | duplication | 10SA211 | translocation | 37553.2. Somatic aberrations characteristic of TNBC3.2.3 Copy number aberrations0500010000HOMDUBCNAASCNABCNAALOHNLOHDLOHHETGAIN countsa050010001500HOMDUBCNAASCNABCNAALOHNLOHDLOHHETGAIN countsb0%25%50%75%100%SA588SA673SA597SA420SA674SA211SA666SA1026SA394SA280SA680SA678SA677SA1072SA598SA278SA1035SA599SA676SA272SA398SA276SA221SA218SA495SA220SA665SA1070SA1062SA289SA585SA219SA230SA533SA591SA683SA596SA395SA239SA279SA1071SA595SA682SA409SA593SA655SA592SA232SA1027SA1040SA530SA287SA1064SA668SA275SA286SA234SA601SA1074SA679SA669SA1058SA1069SA675SA576SA992SA291SA667SA423SA589SA671SA404SA274SA590SA994SA997SA998SA283SA672SA425SA1073SA296SA238SA1065SA681SA586SA402SA600SA1017SA535SA1028SA654SA300proportionstitan_callALOHASCNABCNADLOHGAINHETHOMDNLOHUBCNAcFigure 3.7: Distribution of copy number aberrations (CNAs) per sample and across the cohort: (a)and (b) The distribution and abundance of CNA types across the cohort, sorted in ascending orderbased on the total number of mutations observed in each CNA type (Homozygous deletion (HOMD),Unbalanced CNA (UBCNA), Allele-specific CNA (ASCNA), Balanced CNA (BCNA), AmplifiedLOH (ALOH), Copy-neutral LOH (NLOH), Hemizygous deletion (DLOH), Diploid heterozygous(HET) and copy number GAIN). (c) The proportion of CNA types (y-axis) identified in eachsample (x-axis).Fig. 3.7 (c) shows the variation in copy number profiles in the TNBC cohort and the heterogeneityof TNBCs at CNA level. The Intra-sample heterogeneity at both CNA and SV level is furtherdepicted in Fig. 3.8 that displays the variations in the genome structure of patient sample SA586and the corresponding relationships between genomic intervals.563.2. Somatic aberrations characteristic of TNBCFigure 3.8: Intra-sample heterogeneity at both CNA and SV levels: Circos plot showing the typeof copy number aberrations (HET, BCNA, UBCNA, ALOH, ASCNA, NLOH, DLOH, HOMD,GAIN) across the genome (track 1), copy number variations in the genome (track 2 and 3) and thetype of structural variations (translocation, duplication, foldback, deletions, and inversions - track4) followed by corresponding links between genomic positions.Figure 3.9: Case-based copy number profile: Copy number profile of patient sample SA586 showingcopy number variations (y-axis) along the genome denoted by coordinates representing genomicpositions (x-axis).The below snippet shows the database call required to extract the data used to generate Fig. 3.8and Fig. 3.9 for CNAs and SVs respectively followed by sample data extracts.573.2. Somatic aberrations characteristic of TNBCCNAs data call:sample <- input$sample_idquery <- fn$identity("SELECT chromosome, start_position_bp, end_position_bp,titan_call, copy_numberFROM titan_segs_cnasWHERE tumour_id = ‘$sample’")cnas <- dbGetQuery(con, query)Sample data extract (CNAs):chromosome | start_position_bp | end_position_bp | titan_call | copy_number-----------+-------------------+-----------------+------------+-------------13 | 50046072 | 79052038 | ALOH | 87 | 66617961 | 66628179 | ALOH | 87 | 66591794 | 66594881 | ALOH | 84 | 156518707 | 191043593 | ALOH | 322 | 47415860 | 47415875 | ALOH | 8SVs data call:sample <- input$sample_idquery <- fn$identity("SELECT chrom_1, brk_1, chrom_2, brk_2, brk_dist, typeFROM svs_filteredWHERE tumour_id = ‘$sample’ ORDER BY 1")svs <- dbGetQuery(con, query)Sample data extract (SVs):chrom_1 | brk_1 | chrom_2 | brk_2 | brk_dist | type---------+-----------+---------+-----------+----------+---------------1 | 64435479 | 3 | 101565610 | Infinity | translocation10 | 61995259 | 10 | 61993238 | 2021 | duplication583.2. Somatic aberrations characteristic of TNBC10 | 34280375 | 10 | 34280333 | 42 | foldback10 | 28596747 | 10 | 28596716 | 31 | foldback10 | 43885134 | 10 | 43885091 | 43 | foldback10 | 124903202 | 10 | 124903123 | 79 | foldback10 | 83072235 | 10 | 83106088 | 33853 | deletion10 | 2186974 | 10 | 2186941 | 33 | foldback11 | 124022349 | 11 | 124022281 | 68 | foldback11 | 119800299 | 13 | 68122937 | Infinity | translocation3.2.4 Gene-level analysisThe identification of significantly mutated genes (Appendix B) across this TNBC cohort was ac-complished using MutSigCV from which EMCN, TP53, MUC21, PIK3CA, MUC4, MB, CTU2,RAB3IL1, PTEN were identified as the most significantly mutated genes in this cohort (FDR <0.1). Database derived mutations in each gene per case were visualized using an oncoplot (Fig.3.10) with each row representing a gene and each column representing a case. As expected, PIK3CAmutations appear mutually exclusive with PTEN loss. The script written to extract the data usedto generate Fig. 3.10 is shown in Appendix A.2.0.2.Figure 3.10: Visualizing gene-based mutations: Oncoplot showing high impact mutations in eachgene (rows) per sample (columns). Multiple mutations in a gene are represented by multiple colorsrepresentative of specific mutation types in a single gene. TP53 (56.9%) was identified as themost frequently mutated gene in this cohort, followed by PIK3CA (8.9%), PTEN (7.3%), BRCA1(5.7%), USH2A (4.9%), MUC4 (4.9%) and RB1 (4.1%) respectively.593.3. TNBC genomic subgroup discovery3.3 TNBC genomic subgroup discoveryOne of the main objectives of this study was to stratify TNBCs into distinct subgroups us-ing genomic features extracted from the developed database. This was hinged on our hypothe-sis that TNBC patients can be stratified into distinct genomic subgroups based on their wholegenome profiles. Genomic features integrated for subgroup discovery included CNAs (HET,DLOH, GAIN, NLOH, HOMD, ASCNA, ALOH, BCNA and UBCNA), SNVs (stop_gained,splice_donor, splice_acceptor, start_lost and stop_lost), indels (frameshift_variant, splice_donor,splice_acceptor, stop_gained, bidirectional_gene_fusion, gene_fusion and stop_lost), SVs (dupli-cation, deletion, translocation, inversion and foldback) and mutation signatures (POLE, APOBEC,HRD (Homologous Recombination Deficiency), UNK (Unknown), MMRD (Mismatch Repair De-ficiency), T→C, M-Dup (Medium Duplications), S-Del (Small Deletions), Cl-SV (Clustered Struc-tural Variants), FBI (Foldback Inversions), Cl-FBI (Clustered Foldback Inversions), L-Del (LargeDeletions), S-Dup (Small Duplications), Tr (Translocations) and L-Dup (Large Duplications)).CNAs, SNVs, SVs, and indels were computed as the proportion of each variant over all domainspecific variants while mutation signatures were inferred using the multi-modal correlated topicmodel (MMCTM) [42]. All the identified stratification features were used for integrative hierar-chical clustering analysis using the R package pheatmap and the Manhattan distance measure tosupport the discovery of patient subgroups and their genomic and clinical characteristics. Figures3.11, 3.12, 3.13, 3.14, 3.15 and 3.17 show subgroups identified by mutation signatures, CNAs, SNVs,indels, SVs and by multi-feature integration respectively.603.3. TNBC genomic subgroup discovery3.3.1 TNBC subgroups identified by mutation signaturesFigure 3.11: TNBC genomic subgroups identified by mutation signatures: Hierarchical clusteringof 88 TNBC cases (x-axis) reveals 5 subgroups using scaled values of mutation signatures (POLE,APOBEC, HRD, UNK, MMRD, T→C, M-Dup, S-Del, Cl-SV, FBI, Cl-FBI, L-Del, S-Dup, Tr andL-Dup) (rows in the bottom panel of the heatmap). Color scales range from blue to red to reflectno or low proportions of a variant (blue) relative to high variant proportion levels (red) in eachcase. Heatmap annotations are shown in rows in the top panel where blue signifies presence of amutation (mutant) in significantly mutated genes and in DNA damage repair genes while whitesignifies absence of a mutation in a gene (wild type) for each case.Stratification of TNBC cases in this cohort by mutation signatures (Fig. 3.11) led to the discoveryof 5 main subgroups. The first 2 groups (leftmost) were identified enriched for the HRD signatureand further distinguished by S-Dup and S-Del signatures in group 1 and group 2 respectively. ∼1/5of the samples in group 1 were identified enriched for the APOBEC signature. Group 3 unlike othergroups was highly enriched for the Cl-SV signature, group 4 was enriched for APOBEC, FBI and asignature unknown (UNK) while group 5 was enriched for FBI, UNK and MMRD. 3 cases (cluster3 and 7) did not fall in any of the main clusters and therefore flagged as outliers. Based on thisstratification, all patients with a BRCA1/BRCA 2 mutation were classified in group 1 which alsohad no case with a PIK3CA mutation. All cases with a PTEN mutation were also classified ingroup 1. A chi-square test was conducted to check for mutual exclusivity among subgroup gene-based mutations, however, this test and all subsequent tests yielded low p-values (> 0.33) due tofew observations. There are future prospects of re-testing mutual exclusivity with a larger cohort.613.3. TNBC genomic subgroup discovery3.3.2 TNBC subgroups identified by CNASFigure 3.12: Stratification of cases by scaled values of CNA proportions (HET, DLOH, GAIN,NLOH, HOMD, ASCNA, ALOH, BCNA and UBCNA) reveals 5 subgroups.Stratification of TNBC cases by copy number aberrations revealed 5 subgroups (Fig. 3.12); group1 (leftmost) was enriched for HET, DLOH, GAIN and NLOH; group 2 was significantly enrichedfor HET, with ∼3/4 of the cases being enriched for copy number GAIN, the third and 4th clusterscontaining 2 cases each were flagged as outliers. Group 3 and 4 were identified enriched for copynumber GAIN with NLOH being a distinguishing feature found enriched in group 4 while group 5was identified enriched for UBCNA compared to other subgroups followed by ALOH and BCNArespectively. All cases with a mutation in PTEN were found in group 4 which also comprised ofmost cases with a BRCA1 mutation and no case with a PIK3CA mutation. Group 2 that wasenriched for HET, had the fewest cases with a TP53 mutation and with the highest number ofcases with a mutation in PIK3CA.623.3. TNBC genomic subgroup discovery3.3.3 TNBC subgroups identified by SNVsFigure 3.13: Stratification of cases by scaled values of SNV proportions (stop_gained, splice_donor,splice_acceptor, start_lost and stop_lost) reveals 4 subgroups.Patient stratification by SNVs revealed 4 main subgroups (Fig. 3.13), the first (leftmost) heavilyenriched for stop_gained mutations while group 2 was identified enriched for stop_gained with∼1/2 of the cases in this group being enriched for splice_donor and splice_acceptor mutations(∼1/3). Group 3 was heavily enriched for splice_donor while group 4 was heavily enriched forsplice_acceptor. The 3 right-most clusters comprising of 2 cases each were flagged as outliers.Most cases with BRCA1, PIK3CA or LAMB4 mutations were clustered in group 1 which had nocase with a PTEN mutation. Group 2 had most cases with a mutation in MUC4, group 3 had nocases with either a BRCA1 or PIK3CA mutation while group 4 had cases with the fewest mutationsin the genes of interest.633.3. TNBC genomic subgroup discovery3.3.4 TNBC subgroups identified by indelsFigure 3.14: Stratification of cases by scaled values of indel proportions (frameshift_variant,splice_donor, splice_acceptor, stop_gained, bidirectional_gene_fusion, gene_fusion andstop_lost) identifies 3 subgroups.Stratification of TNBC cases by indels (Fig. 3.14) identified 3 groups, all of which were enrichedfor frame_shift variants with a stronger enrichment in group 1. Compared to other groups, group2 and 3 had a higher signal for splice_acceptor and splice_donor mutations respectively. Flaggedas outliers were the right-most 4 clusters comprising of 1 case each. Whether cases in group 2 and3 have no BRCA1/2 mutations remains inconclusive due to the few cases in these groups.3.3.5 TNBC subgroups identified by SVSFigure 3.15: Stratification of cases by scaled values of SV proportions (duplication, deletion, translo-cation, inversion and foldback) reveals 6 subgroups.643.3. TNBC genomic subgroup discoveryPatient stratification was also done based on SVs (Fig. 3.15) from which 6 subgroups were identified.Group 1 (leftmost) in which all cases with a PTEN or a BRCA1/2 mutation were clustered washeavily enriched for duplications. Group 2 was identified enriched for duplications, deletions andinversions while group 3 was heavily enriched for deletions. Group 4 was enriched for inversions(compared to other groups) while group 5 and 6 were enriched for translocations and foldbackinversions respectively. No cases in groups 2 - 6 had a mutation in PTEN, BRCA1 and BRCA2.The above database derived analyses for patient stratification show that TNBCs can be stratifiedbased on mutation signatures and individual domain specific somatic variants (CNAs, SVs, SNVsand indels) into subgroups that depict unique patterns and characteristics with mutations in SMGsor in DNA damage repair genes seen enriched in certain groups compared to others. We also seethat some driver mutations are associated with mutation signatures that stratified the TNBCs asseen in Fig. 3.11 where all BRCA1, BRCA2 and PTEN mutations were seen enriched in the HRDgroup.3.3.6 TNBC subgroup discovery by genomic feature integrationThe preceding section demonstrates that TNBCs can be stratified into distinct subgroups by muta-tion signatures and by domain specific genomic variants (CNAS, SNVs, indels and SVs). Integrat-ing all genomic features (CNAs (HET, DLOH, GAIN, NLOH, HOMD, ASCNA, ALOH, BCNAand UBCNA), SNVs (stop_gained, splice_donor, splice_acceptor, start_lost and stop_lost), in-dels (frameshift_variant, splice_donor, splice_acceptor, stop_gained, bidirectional_gene_fusion,gene_fusion and stop_lost), SVs (duplication, deletion, translocation, inversion and foldback) andmutation signatures (POLE, APOBEC, HRD, UNK, MMRD, T→C, M-Dup, S-Del, Cl-SV, FBI,Cl-FBI, L-Del, S-Dup, Tr and L-Dup)) for patient stratification identified 5 subgroups (Fig. 3.17).The optimal number of clusters was identified using the Elbow method (Fig. 3.16) that suggested5 optimal clusters.653.3. TNBC genomic subgroup discoveryFigure 3.16: TNBC genomic subgroups - Optimal number of clusters: (a) Optimal number ofclusters = 5 as identified by the Elbow method. (b) Silhouette measure (range = -1 to +1) ofwithin-cluster similarity where high values indicate that a case is well matched to its own clusterand poorly matched to neighboring clusters).663.3. TNBC genomic subgroup discoveryFigure 3.17: TNBC genomic subgroups identified by genomic feature integration (mutation signa-tures, CNAs, SVs, SNVs and indels): Hierarchical clustering of 88 TNBC cases (x-axis) revealedby scaled values of mutation signatures (HRD, S-Dup, Tr, T→C, S-Del, APOBEC, FBI, UKN,L-Dup, MMRD, L-Del, Cl-SV, POLE, Cl-FBI, M-Dup), CNAs (HET, NLOH, GAIN, DLOH, AS-CNA, ALOH, BCNA, UBCNA) (rows in the bottom panel of the heatmap) with a dendrogram ofthe hierarchical cluster analysis. Color scales range from blue to red to reflect no or low proportionsof a variant (blue) relative to high variant proportion levels (red) in each case. Heatmap anno-tations are shown in rows in the top panel where blue signifies presence of a mutation (mutant)in significantly mutated genes and in DNA damage repair genes while white signifies absence of amutation in a gene (wild type) for each case.Stratification of TNBCs in this cohort by genomic feature integration using hierarchical clusteringand the Manhattan distance measure, identified 5 novel TNBC subgroups. Group 1 and 2 (leftmost)were found enriched for the HRD signature and were further distinguished by the S-Dup signaturethat is seen only enriched in group 1. Unlike group 2, group 1 was also enriched for copy numberaberration HET. Flagged as an outlier was the third cluster comprising of one case heavily enrichedfor the Cl-FBI signature compared to other subgroups. Group 3 and 4 were enriched for the FBIsignature with a stronger signal found in group 3. These 2 groups were further distinguished bythe Cl-SV signature that was enriched in group 4. Group 5 was enriched for HET with ∼1/3of the cases enriched for APOBEC. All groups except for group 1 had at least one case with aPIK3CA mutation. No cases with a mutation in MUC4 were identified in group 1 and 4 whileall groups except for groups 2, 3 and 4 had at least one case with a BRCA1 mutation. All caseswith a mutation in PTEN or in BRCA2 were clustered in group 1. As seen in previous analyses,the identified subgroups are associated with differing characteristics such as the enrichment of673.3. TNBC genomic subgroup discoverymutations in the genes of interest and the association of specific gene mutations with mutationsignatures. Almost all BRCA1 mutant cases are seen associated with the HRD group (group 1)that contains all the cases with a BRCA2 or PTEN mutation. We also see less association ofPIK3CA and MUC4 mutations with the HRD signature.3.3.7 TNBC genomic subgroup analysis3.3.7.1 Subgroup comparative analyses of mutation loadsTo improve our understanding of the identified genomic subgroups beyond their association withSMGs, DNA damage repair genes, mutation signatures and somatic aberrations, comparative anal-yses were conducted from both genomic and clinical perspectives to compute the prevalence ofmutations in each subgroup and to conduct comparative clinical data analyses. Fig. 3.18 showsthe mutation loads per subgroup based on a) SNVS, b) indels, c) SVs and d) the total mutationload. Based on SNVs, group 2 (HRD) had the highest mutation load followed by group 1 (HRD +S-Dup) while group 3 had the lowest (FBI) and by indels, group 4 (FBI + Cl-SV) had the highestmutation load while group 3 (FBI) had the lowest. Based on SVs, group 1 (HRD + S-Dup) hadthe highest mutation load while group 5 (HET) had the lowest. Overall, group 2 (HRD) had thehighest mutation load followed by group 1 (HRD + S-Dup) while group 3 had the lowest. Thesummary of these analyses is presented in Fig. 3.19.683.3. TNBC genomic subgroup discovery050001000015000 snvsGroup 1a Group 2 Group 3 Group 4 Group 50100020003000 indelsb    0200400600800 svsc    05000100001500020000SA590SA280SA677SA425SA398SA588SA591SA992SA655SA675SA1073SA283SA994SA279SA598SA276SA535SA1069SA1035SA211SA278SA495SA1040SA291SA239SA1072SA676SA232SA289SA394SA234SA300SA274SA423SA1062 total loaddSA409SA533SA219SA669SA404SA599SA681SA593SA592SA1065SA678SA997SA221 SA220SA654SA683SA595SA666SA998SA682SA585SA673SA674SA286SA230SA596SA601SA402SA680SA1027 SA597SA671SA530SA1028SA668SA420SA667 SA600SA218SA238SA287SA672SA272SA1074SA395SA1070SA275SA1071SA576SA296SA1058SA1064 Figure 3.18: Subgroup mutation loads: (a): SNV average mutation load per subgroup:- Group1 = 8,630, Group 2 = 11,001, Group 3 = 4,041, Group 4 = 7,222, Group 5 = 6,463, (b): Indelaverage mutation load per subgroup:- Group 1 = 1,406, Group 2 = 1,847, Group 3 = 609, Group4 = 2,211, Group 5 = 982, (c): SV average mutation load per subgroup:- Group 1 = 284, Group2 = 193, Group 3 = 81, Group 4 = 196, Group 5 = 59, (d): Total average mutation load persubgroup:- Group 1 = 10,320, Group 2 = 13,041, Group 3 = 4,674, Group 4 = 9,628, Group 5 =7,504.mean_indels mean_snvs mean_svs mean_total_load3 5 4 1 2 3 5 4 1 2 3 5 4 1 2 3 5 4 1 202500500075001000012500subgroupmutation loadsFigure 3.19: Subgroup mean mutation loads3.3.7.2 Subgroup comparative analyses of the distribution of rearrangementsComparative analyses for subgroup rearrangement distributions were conducted (Fig. 3.20) andidentified a high proportion of translocations, duplications, deletions, inversions (other types ofinversions other than foldback inversions), and foldback inversions in group 2, 1, 2, 4 and 3 respec-tively as shown below.693.3. TNBC genomic subgroup discoveryGroup 1 Group 2 Group 3 Group 4 Group 5SA291SA423SA232SA280SA994SA655SA677SA398SA590SA676SA992SA675SA535SA1062SA300SA279SA1073SA598SA591SA211SA289SA425SA274SA1069SA588SA283SA1072SA276SA1035SA394SA234SA239SA278SA1040SA495SA997SA681SA678SA404SA533SA219SA1065SA593SA221SA599SA669SA592SA409SA683SA682SA601SA666SA585SA1027SA595SA654SA673SA402SA286SA220SA596SA230SA674SA680SA998SA668SA420SA597SA671SA1028SA530SA667SA1071SA287SA395SA272SA1058SA1074SA1070SA218SA576SA600SA672SA296SA238SA275SA10640%25%50%75%100%proportionstype deletion duplication foldback inversion translocationFigure 3.20: Subgroup rearrangement distributions. Group 1 was found enriched for duplications(58.69%), Group 2 enriched for deletions (37.73%), Group 3 enriched for deletions (30.64%), Group4 enriched for inversions (28.27%) while group 5 was enriched for duplications (38.69%). The groupthat associated with the highest translocations was Group 2, duplications - Group 1, deletions -Group 2, inversions - Group 4 and foldback - Group 3.| Group 1 Group 2 Group 3 Group 4 Group 5---------------|------------------------------------------------------------Translocation |17.26% 22.15% 13.83% 15.78% 14.40%Duplication |58.69% 19.24% 21.03% 22.13% 38.70%Deletion |14.62% 37.73% 30.64% 22.21% 26.66%Inversion |5.69% 11.65% 20.01% 28.27% 13.39%Foldback |3.74% 9.22% 14.48% 11.61% 6.86%3.3.7.3 Subgroup comparative analyses of trinucleotide distributionsTrinucleotide distribution analyses (Fig. 3.21) also revealed distinct trinucleotide distributions persubgroup where group 2 (HRD) was enriched for C→A substritutions followed by group 1 (HRD+ S-Dup) with the least abundance of C→A substritutions identified in group 3 (FBI). Group 2(HRD) followed by group 1 (HRD + S-Dup) had the highest abundance of C→G substitutionswhile group 3 (FBI) had the lowest. C→T substritutions were highest in group 5 (HET) closelyfollowed by group 2 while group 3 had the lowest. The abundance of T→A substritutions wasseen enriched in group 2 (HRD) and lowest in group 3 (FBI). T→C substritutions were enrichedin group 2 (HRD) compared to group 3 (FBI) that had the lowest while T→G substritutions weremost abundant in group 2 (HRD).703.3. TNBC genomic subgroup discoveryGroup 1 Group 2 Group 3 Group 4 Group 5SA280SA590SA1073SA279SA398SA677SA588SA598SA655SA992SA675SA 91SA283SA495SA994SA276SA425SA1035SA211SA535SA291SA239SA278SA1040SA1 69SA676SA289SA1072SA394SA232SA300SA234SA274SA1062SA423SA219SA533SA404SA 09SA669SA681SA593SA599SA592SA678SA1065SA221SA997SA654SA595SA666SA 83SA682SA 86SA585SA998SA673SA674SA601SA596SA402SA 30SA680SA1027SA671SA597SA420SA530SA668SA1028SA667SA218SA238SA600SA287SA672SA 75SA1074SA395SA272SA1070SA1 71SA576SA296SA1058SA1064050010001500C>AaSA425SA677SA675SA655SA398SA280SA283SA590SA994SA591SA 069SA1035SA992SA598SA279SA588SA276SA1073SA211SA535SA239SA1040SA676SA278SA232SA 91SA289SA1072SA394SA 95SA234SA300SA274SA1062SA423SA409SA669SA219SA533SA599SA404SA592SA681SA593SA1065SA678SA221SA997SA654SA683SA666SA 82SA595SA673SA998SA585SA674SA286SA596SA230SA680SA402SA601SA 027SA597SA671SA530SA1028SA668SA420SA667SA287SA238SA672SA1070SA1 71SA272SA1074SA218SA600SA395SA275SA 76SA296SA1058SA10640500100015002000C>GbSA590SA425SA280SA677SA675SA398SA655SA994SA279SA283SA598SA1035SA 88SA992SA591SA 069SA1073SA535SA211SA276SA495SA1040SA239SA278SA676SA291SA 072SA394SA289SA300SA232SA 34SA274SA1062SA423SA409SA219SA533SA404SA669SA593SA599SA592SA681SA678SA1065SA997SA221SA654SA595SA682SA683SA666SA998SA585SA674SA673SA286SA596SA 80SA601SA402SA 30SA1027SA597SA671SA 028SA668SA530SA420SA667SA600SA218SA672SA 38SA287SA272SA1074SA1071SA395SA1070SA275SA 76SA296SA1058SA10640200040006000C>TcSA280SA590SA279SA1073SA 98SA588SA598SA283SA992SA591SA677SA276SA 55SA495SA994SA675SA425SA1035SA211SA535SA291SA239SA1040SA676SA1069SA278SA289SA232SA300SA234SA394SA1072SA 74SA1062SA423SA219SA533SA404SA 09SA669SA681SA593SA599SA592SA678SA1065SA997SA221SA595SA683SA654SA682SA666SA585SA673SA674SA998SA601SA596SA286SA402SA 30SA680SA1027SA671SA597SA530SA668SA1028SA420SA667SA218SA238SA600SA287SA672SA1074SA272SA395SA1070SA275SA1071SA576SA296SA1058SA106405001000T>AdSA590SA398SA280SA677SA598SA283SA591SA655SA992SA588SA276SA425SA1073SA279SA675SA278SA535SA291SA994SA239SA211SA676SA1040SA1 35SA495SA1069SA232SA394SA289SA300SA234SA274SA1072SA423SA1062SA669SA409SA219SA533SA599SA404SA681SA593SA592SA1065SA678SA221SA997SA595SA683SA654SA682SA585SA666SA286SA 73SA601SA998SA674SA596SA230SA402SA680SA1027SA671SA597SA668SA530SA1028SA420SA667SA600SA218SA238SA287SA672SA395SA1074SA272SA 75SA1070SA1 71SA576SA296SA1058SA10640100020003000T>CeSA590SA280SA677SA398SA425SA591SA283SA1073SA598SA992SA588SA276SA655SA279SA675SA994SA535SA1035SA291SA211SA239SA1040SA1069SA278SA232SA394SA289SA300SA495SA676SA234SA1072SA423SA274SA1062SA219SA669SA593SA409SA533SA404SA681SA592SA599SA678SA1065SA221SA997SA683SA654SA585SA595SA682SA666SA673SA674SA286SA998SA601SA596SA230SA402SA680SA1027SA671SA597SA530SA668SA420SA1028SA667SA238SA218SA600SA672SA287SA395SA272SA1074SA1070SA275SA576SA1071SA296SA1058SA10640300600900T>GfFigure 3.21: Subgroup trinucleotide distributions: (a) C→A substitutions: Means = 802.7714,948.5385, 281.25, 598.1429, 404.6 for groups 1 - 5 respectively. (b) C→G substitutions: Means= 867.4571, 1055, 269.75, 570.1429, 290.4 for groups 1 - 5 respectively. (c) C→T substitutions:Means = 1001.286, 1279.308, 771.9375, 1154, 1279.4 for groups 1 - 5 respectively. (d) T→A sub-stitutions: Means = 590.1714, 717, 208.75, 409.7143, 279.2 for groups 1 - 5 respectively. (e) T→Csubstitutions: Means = 680, 984.2308, 325.125, 590.1429, 758.2667 for groups 1 - 5 respectively.(f) T→G substitutions: Means = 371.7429, 514.4615, 162.3125, 265.1429, 233.7333 for groups 1 -5 respectively.3.3.7.4 Subgroup comparative analyses from a clinical perspectiveTo determine the association between the identified subgroups and clinical outcomes, comparativeanalyses of patient groups by age, tumour_size, node status, grade and overall survival were con-ducted. Comparison of subgroups by age identified no large variance in subgroup mean ages (means= 50.08, 52.01, 58.85, 64.5, 62.11 for groups 1 - 5 respectively), however, as seen in (Fig. 3.22: a)),group 4 (FBI + Cl-SV) was identified having the highest average age (range: 47 - 81) while group1 (HRD + S-Dup) had patients with the youngest ages (range: 26 - 80), with more than half ofthe patients having age < 50. We would expect that younger patients would be associated with a713.3. TNBC genomic subgroup discoverylower mutation burden but this was on the contrary as younger patients were clustered in one of thegroups with the highest mutation burden (HRD + S-Dup group) suggestive of the likelihood thatyounger patients have more proliferative disease and therefore the tumors accumulate mutationsat a higher rate.Figure 3.22: Subgroup comparative analyses based on clinical outcomes - age: (a) Age distributionper subgroup. (b) Age distribution of cases in this TNBC cohort: average age = 55.4years, range= 26years - 82years.Based on tumour size and grade, group 2 (HRD) had the highest average tumour_size while group5 (HET) had the lowest (Fig. 3.23: a)). As shown in (Fig. 3.23: c)), 24% of the patients ingroup 1 (HRD + S-Dup) were found node positive, while 40%, 67%, 50% and 54% of the patientsin groups 2, 3, 4 and 5 were found node positive respectively. Noted from these analyses is thatnode positivity is not associated with mutation burden as groups with high mutation loads hadfewer patients whose tumours were node positive. 67% of patients in the lowest mutation loadgroup (group 3 (FBI)) were node positive while only 24% of the patients in a higher mutation loadgroup were node positive (group 1 (HRD + S-Dup)). Also, tumour size was found dissociated fromnode positivy as seen with groups 1 (HRD + S-Dup) (24% node +ve, average tumour_ size =2.7), 3 (FBI) (67% node +ve, average tumour_size = 2.4) and 2 (HRD) (40% node +ve, averagetumour_size = 2.9). All cases in each subgroup presented with high grade tumours but for group3 (FBI) that had some cases (27%) with low grade tumors.723.3. TNBC genomic subgroup discoveryFigure 3.23: Subgroup comparative analyses based on clinical outcomes - tumour size, node statusand grade: (a) Tumour_size distribution: Means = 2.77cm, 2.93cm, 2.43cm, 2.83cm, 2.33cm forgroups 1 - 5 respectively. (b) Relationship between tumour size, node status and overall survival.(c) Chi-square contingency table of node status and grade observations per subgroup.A preliminary sniff into the overall survival of the identified patient subgroups identified group 5(HET) to putatively have the best survival outcomes while group 2 (HRD) has the worst. Despitethe HRD group (group 2) having the highest mutation burden and the worst overall survival (OS),we also see that the FBI group (group 3) with the lowest mutation burden is not associated withthe best overall survival but group 5 (HET) is. This goes to show that the mutation burden maynot necessarily have a large bearing on patient outcomes. Secondly, group 1 (HRD + S-Dup) witha higher average mutation burden has better OS than the FBI group with a lower average mutationburden.From these statistical analyses we can putatively deduce that:1. Mutation burden may not necessarily have a large bearing on patient outcomes as the FBIgroup (group 3) with the lowest mutation burden was not found associated with the bestoverall survival (OS). Secondly, group 1 (HRD + S-Dup) with a higher average mutationburden was found to have a better OS than the FBI group with a lower average mutationburden.733.3. TNBC genomic subgroup discovery2. Node positivity is not associated with mutation burden as groups with high mutation loadshad fewer patients whose tumours were node positive. 67% of patients in the lowest mutationload group (group 3 (FBI)) were node positive while only 24% of the patients in a highermutation load group were node positive (group 1 (HRD + S-Dup)).3. Younger patients were identified in one of the groups with the highest mutation burden (HRD+ S-Dup, group 1) suggestive of more proliferative disease that could lead to a higher rate atwhich tumours accumulate mutations in younger patients.4. Low grade tumours were found associated with a low mutation load subgroup (group 3 (FBI)).5. Tumour size has no bearing on node positivy as seen with groups 1 (HRD + S-Dup) (24%node +ve, average tumour_ size = 2.7), 3 (FBI) (67% node +ve, average tumour_size =2.4).** All the above deductions pend validation with a larger cohort. This will also go a long way inidentifying a solid association between SMGs and clinical outcomes.74Chapter 4Data Access and VisualizationInterfaceCurrent advances in sequencing technologies have led to the generation of vast amounts of sequenc-ing data that have come with a salient need for data management, access, analysis and visualisation.Secondly, as genomic research has become increasingly collaborative, it has become crucial to accessand share data in a way that is understandable to research teams, both technical and non tech-nical. We extended this functionality to the implemented database by developing Genome-Miner,a flexible, convenient, and interactive web-based database interface to support global data access,interactive exploration, querying, analysis, visualization and sharing of clinical outcomes and wholegenome profiling data in the developed database, results of which are rendered dynamically into thefront-end web application and directly from the database for utilisation by researchers, biologistsand clinicians using intuitive and interactive plots and data tables, as shown in this chapter. Thedeveloped interface also allows researchers to download all results and upload files for data anal-ysis and visualization without the need of any programming knowledge. This establishment willgo a long way in helping researchers generate novel insights and hypotheses by visualizing clinicaloutcomes and genomic variant data on CNAs, SNVs, SVs and indels.4.1 Genome-MinerThe front page to the developed platform provides users with the objective and overview of Genome-Miner (Fig. 4.1). This page also provides navigational links to the interface implementations ofthe analyses conducted on the data in the database based on quality control, mutation burden,genomic visualization, trinucleotide distributions, CNAs, rearrangements, gene-level analysis, sub-group discovery and clinical outcomes analyses and visualizations.754.1. Genome-MinerFigure 4.1: Genome-Miner: Front page showing navigational links to the main analysis and vi-sualisation themes availed through this platform (QC, Mutation Loads, Genomic Visualization,Trinucleotide Distributions, CNAs, Rearrangements, Gene Level Analysis, Subgroups and ClinicalOutcomes).764.2. Quality control analyses and visualizations4.2 Quality control analyses and visualizationsab774.2. Quality control analyses and visualizationscFigure 4.2: Database interface - user defined QC explorations and visualizations for a) averageread coverage b) mapped percentage and c) normal contamination estimates.QC explorations, analyses and visualizations are based on 5 main parameters: average read cover-age, mapped percentage, properly paired reads, normal contamination estimates and tumour ploidyestimates. These analyses are based on sequencing statistics extracted from bam files (average readcoverage, mapped percentage, properly paired reads) and Titan output files (normal contaminationand ploidy estimates) providing an overview of the distribution of specified parameters across thecohort. More data on each of the samples in the cohort is availed through data tables that can bemanipulated to dynamically trigger visualizations of interest.784.3. Mutation load analyses and visualizations per sample and across the cohort4.3 Mutation load analyses and visualizations per sample andacross the cohortabFigure 4.3: Database interface - mutation loads: User defined explorations, analyses and visualiza-tions for a) SNV mutation loads and b) total mutation loads.794.4. Genomic visualizationsInterface enabled explorations, analyses and visualizations of mutation loads per sample and acrossthe cohort based on variant types: SNVs, SVs, indels and the total mutation load are shown inFig. 4.3. Data tables provide sample specific mutation loads that can be ordered (ascending ordescending) by variant type and based on a user’s visualization preference. Users also have theability to refer to the average read coverage for cases of interest.4.4 Genomic visualizationsThe database interface also enables user-triggered visualizations for genomic events across a pa-tient’s chromosome by CNAs (Fig. 4.4 a)), breakpoints or by a combination of CNAs and break-points (Fig. 4.4 b)). A user specifies two parameters: the sample of interest and the mutationtype, parameters that are subsequently used to generate and render circos plots on the fly as is thecase with all other plots.a804.4. Genomic visualizationsbFigure 4.4: Database interface - genomic visualization: a) Genomic visualization by CNAs. b)Genomic visualization by CNAs and SNVs.Fig. 4.4 a) (top plot) shows a circos plot rendered to the user where track 1 in the plot showsthe type of copy number aberration per genomic position while the copy number variations areshown in track 2 and 3. The bottom plot shows the copy number profile of the sample of interest(case SA586) and shows copy number variations (y-axis) along the genome denoted by coordinatesrepresenting genomic positions (x-axis). Fig. 4.4 b) on the other hand shows a user-triggeredcircos plot for both copy number aberrations and structural variants. Structural variations acrossthe chromosome (translocations, duplications, foldback inversions, deletions, and inversions) areshown in track 4 followed by corresponding links between genomic positions. The interface willalso support visualizations of multiple circos plots to support comparative visualizations of thegenomic structures of multiple cases, for example, visualizations by subgroup.814.5. Intra-sample trinucleotide distribution4.5 Intra-sample trinucleotide distributionFigure 4.5: Database interface - trinucleotide distribution per sample.The developed interface supports visualizations of the trinucleotide distribution across the chro-mosome of each case as shown in Fig. 4.5. Trinuclueotide substitution (C>A, C>G, C>T, T>A,T>C, T>G) counts (y-axis) are shown for every chromosome (x-axis) with an extract of requiredsource data and details provided in the data table.4.6 CNAs analysis and visualizations per sample and across thecohortExplorations and visualizations of the distribution and abundance of copy number aberrationsacross the cohort and within each sample as triggered by the interface user are shown in Fig. 4.6a) and b) respectively.824.6. CNAs analysis and visualizations per sample and across the cohortabFigure 4.6: Database interface - distribution of CNAs: a) Overall distribution b) Cohort-wide andintrasample distribution.834.7. SVs analysis and visualizations per sample and across the cohort4.7 SVs analysis and visualizations per sample and across thecohortThe distribution and abundance of structural variants across the cohort and within each sampleare shown in Fig. 4.7 a) and b) respectively. By specifying parameters of interest, the interfacerenders reactive plots and data tables to support user interaction with both the data and renderedplots as shown in Fig. 4.7.a844.8. Gene-level analysisbFigure 4.7: Database interface - distribution of SVs: a) Overall distribution b) Cohort-wide andintrasample distribution.4.8 Gene-level analysisGenome-Miner also supports user defined visualizations of mutations in each gene per case (Fig.4.8). Here, a user specifies genes and variants of interest and an oncoplot showing which caseshave the specified mutations in selected genes is generated. Each row represents a gene and eachcolumn represents a case. Multiple hits in a gene are represented by multiple colors representativeof specific mutation types in a single gene.854.9. TNBC subgroup analysis and visualizationsFigure 4.8: Database interface - gene-level analysis.4.9 TNBC subgroup analysis and visualizationsDatabase-derived stratification of patients as enabled by the developed interface (Fig. 4.9) canbe done based on the user’s choice of individual variants: SNVs, CNAs, indels, SVs, mutationsignatures or by the integration of all features (SNVs, CNAs, indels, SVs and mutation signatures).For inclusion will be user-based selections of genes of interest to apply for annotating the clusteredheatmap and choices of variants from various variant domains for inclusion as stratification features.864.9. TNBC subgroup analysis and visualizationsFigure 4.9: Database interface - subgroup discovery by a) CNAs b) mutation signatures and c)integrated genomic features.87Chapter 5Conclusions and Future WorkRelational databases have long been used as an indispensable tool in modelling and organizingvast amounts of data, including biological data. Currently, profiling of patient genomes to inferpatterns of mutations and genomic events underpinning a patient’s disease heavily relies on datastored in flat files whose structure complicates tasks required for analyzing relational and com-plexly structured genomic data. To the best of our knowledge, this is the first research of its kindthat solves this problem by implementing a database driven approach to integrate data from wholegenome sequences with clinical outcomes for the exploration of the genomic landscapes and muta-tion characteristics underpinning cancers. The developed clinical outcomes and genomic variantsdatabase was further applied to support the mining and comprehensive analysis of clinical outcomesand whole genome profiles of 88 TNBC patients. Functionality of the database was extended tosupport global data access, interactive exploration, querying, analysis, visualization and sharingof collected data among various research groups through the birth of Genome-Miner, a flexible,convenient, and interactive web-based platform.We demonstrate the applicability of the database to effectively support and enforce quality controlchecks and measures by filtering data to meet down stream analysis requirements. We also demon-strate the utility of the developed database for the exploration and analysis of somatic alterations(CNAs, SNVs, SVs and indels), results of which show a varying distribution of mutation loads inthis TNBC cohort. Also identified was the variation in the mutation type loads within each sampleand the heterogeneity of TNBCs at CNA, SV, SNV and indel level. In this study cohort, TP53(56.9%), PIK3CA (8.9%), PTEN (7.3%), BRCA1 (5.7%), USH2A (4.9%), MUC4 (4.9%) and RB1(4.1%) were identified as the most frequently mutated genes.We further applied the database to mine and compute genomic features used for patient strat-ification and demonstrate for the first time and to the best of our knowledge, the discovery of5 putative and distinct TNBC genomic subgroups revealed by 23 significant genomic features, 8of which were mined and computed from the developed database (proportions of: HET, GAIN,885.1. Limitations and future workDLOH, NLOH, ASCNA, ALOH, BCNA, UBCNA) and the other 15 (mutation signatures: POLE,APOBEC, HRD, UNK, MMRD, T→C, M-Dup, S-Del, Cl-SV, FBI, Cl-FBI, L-Del, S-Dup, Tr andL-Dup) derived using the multi-modal correlated topic model (MMCTM). Each of the identifiedsubgroups exhibited distinct genomic and clinical characteristics.Results from this research show and confirm our hypothesis that TNBC patients can be stratifiedinto distinct subgroups based on their whole genome profiles and that the identified TNBC sub-groups exhibit distinct clinical and genomic characteristics. Our results also show that mutationsignatures enriched in identified subgroups associate with specific driver mutations or mutationsin DNA damage repair genes. These results provide an improved understanding of TNBCs andwill further provide valuable insights into subgroup specific clinically actionable events, optionsfor novel therapeutic modalities and the identification of patients most likely to respond to spe-cific modalities. These results also show for the utility of the genome as a potential discriminantbiomarker in patient treatment.5.1 Limitations and future workThe research herein focused on the analysis of whole genome profiles and clinical outcomes ofTNBC patients, however, the elucidation of TNBC subgroups and their respective characteristicsfurther requires a multi-omics approach that integrates and analyses data from all platforms (DNAmethylation, messenger RNA arrays, exome sequencing, microRNA sequencing and reverse-phaseprotein arrays) for the discovery of more informative subgroups. Also, the analysis of a largercohort is still needed to provide more insights into the genomic underpinnings of TBCs and tocorroborate the identified subgroups in this study or identify additional subgroups in TNBC.Secondly, the database was structured to suit the data output from respective variant callers (mu-tationSeq, Titan, deStruct, Lumpy and Strelka). A more generic approach of storing and miningvariants discovered using other variant calling tools other than those used in this study is needed.Also, to note is that the variants in this study were annotated using snpEff. A more generic solutionwill require the mapping and inclusion of annotations from other variant annotation tools such asVEP.Future work will involve implementing a more generalizable database and bulk-loading genomic datafrom other cancers such as ovary into the database to support integrated analyses and inferences895.1. Limitations and future workfrom different cancer types. Prospects to include more data such as gene expression data areunderway.90Bibliography[1] C. G. A. Network et al., “Comprehensive molecular portraits of human breast tumours,”Nature, vol. 490, no. 7418, p. 61, 2012.[2] C. M. Perou, T. Sørlie, M. B. Eisen, M. Van De Rijn, S. S. Jeffrey, C. A. Rees, J. R. Pollack,D. T. Ross, H. Johnsen, L. A. Akslen et al., “Molecular portraits of human breast tumours,”nature, vol. 406, no. 6797, p. 747, 2000.[3] Z. Hu, C. Fan, D. S. Oh, J. S. Marron, X. He, B. F. Qaqish, C. Livasy, L. A. Carey, E. Reynolds,L. Dressler, A. Nobel, J. Parker, M. G. Ewend, L. R. Sawyer, J. Wu, Y. Liu, R. Nanda,M. Tretiakova, A. Ruiz Orrico, D. Dreher, J. P. Palazzo, L. Perreard, E. Nelson, M. Mone,H. Hansen, M. Mullins, J. F. Quackenbush, M. J. Ellis, O. I. Olopade, P. S. Bernard, and C. M.Perou, “The molecular portraits of breast tumors are conserved across microarray platforms,”BMC Genomics, vol. 7, p. 96, Apr 2006.[4] B. D. Lehmann, J. A. Bauer, X. Chen, M. E. Sanders, A. B. Chakravarthy, Y. Shyr, andJ. A. Pietenpol, “Identification of human triple-negative breast cancer subtypes and preclinicalmodels for selection of targeted therapies,” The Journal of clinical investigation, vol. 121, no. 7,pp. 2750–2767, 2011.[5] C. M. Perou, “Molecular stratification of triple-negative breast cancers,” Oncologist, vol. 16Suppl 1, pp. 61–70, 2011.[6] F. Bertucci, P. Finetti, and D. Birnbaum, “Basal breast cancer: a complex and deadly molec-ular subtype,” Curr. Mol. Med., vol. 12, no. 1, pp. 96–110, Jan 2012.[7] G. J. Logan, D. J. Dabbs, P. C. Lucas, R. C. Jankowitz, D. D. Brown, B. Z. Clark, S. Oester-reich, and P. F. McAuliffe, “Molecular drivers of lobular carcinoma in situ,” Breast CancerResearch, vol. 17, no. 1, p. 76, 2015.91Bibliography[8] D. Krug and R. Souchon, “Radiotherapy of ductal carcinoma in situ,” Breast Care, vol. 10,no. 4, pp. 259–264, 2015.[9] G. K. Malhotra, X. Zhao, H. Band, and V. Band, “Histological, molecular and functionalsubtypes of breast cancers,” Cancer biology & therapy, vol. 10, no. 10, pp. 955–960, 2010.[10] E. A. Rakha, J. S. Reis-Filho, F. Baehner, D. J. Dabbs, T. Decker, V. Eusebi, S. B. Fox,S. Ichihara, J. Jacquemier, S. R. Lakhani et al., “Breast cancer prognostic classification in themolecular era: the role of histological grade,” Breast Cancer Research, vol. 12, no. 4, p. 207,2010.[11] C. Curtis, S. P. Shah, S.-F. Chin, G. Turashvili, O. M. Rueda, M. J. Dunning, D. Speed, A. G.Lynch, S. Samarajiwa, Y. Yuan et al., “The genomic and transcriptomic architecture of 2,000breast tumours reveals novel subgroups,” Nature, vol. 486, no. 7403, p. 346, 2012.[12] E. Lerma, G. Peiro, T. Ramón, S. Fernandez, D. Martinez, C. Pons, F. Munoz, J. M. Sabate,C. Alonso, B. Ojeda et al., “Immunohistochemical heterogeneity of breast carcinomas negativefor estrogen receptors, progesterone receptors and her2/neu (basal-like breast carcinomas),”Modern Pathology, vol. 20, no. 11, p. 1200, 2007.[13] N. A. Makretsov, D. G. Huntsman, T. O. Nielsen, E. Yorida, M. Peacock, M. C. Cheang,S. E. Dunn, M. Hayes, M. van de Rijn, C. Bajdik et al., “Hierarchical clustering analysis oftissue microarray immunostaining data identifies prognostically significant groups of breastcarcinoma,” Clinical cancer research, vol. 10, no. 18, pp. 6143–6151, 2004.[14] J. Reis-Filho and A. Tutt, “Triple negative tumours: a critical review,” Histopathology, vol. 52,no. 1, pp. 108–118, 2008.[15] B. G. Haffty, Q. Yang, M. Reiss, T. Kearney, S. A. Higgins, J. Weidhaas, L. Harris, W. Hait,and D. Toppmeyer, “Locoregional relapse and distant metastasis in conservatively managedtriple negative early-stage breast cancer,” Journal of clinical oncology, vol. 24, no. 36, pp.5652–5657, 2006.[16] J. D. Marotti, F. B. de Abreu, W. A. Wells, and G. J. Tsongalis, “Triple-negative breast cancer:next-generation sequencing for target identification,” The American journal of pathology, vol.187, no. 10, pp. 2133–2138, 2017.92Bibliography[17] R. Dent, M. Trudeau, K. I. Pritchard, W. M. Hanna, H. K. Kahn, C. A. Sawka, L. A. Lickley,E. Rawlinson, P. Sun, and S. A. Narod, “Triple-negative breast cancer: clinical features andpatterns of recurrence,” Clinical cancer research, vol. 13, no. 15, pp. 4429–4434, 2007.[18] V. G. Abramson, B. D. Lehmann, T. J. Ballinger, and J. A. Pietenpol, “Subtyping of triple-negative breast cancer: implications for therapy,” Cancer, vol. 121, no. 1, pp. 8–16, 2015.[19] L. A. Carey, C. M. Perou, C. A. Livasy, L. G. Dressler, D. Cowan, K. Conway, G. Karaca,M. A. Troester, C. K. Tse, S. Edmiston et al., “Race, breast cancer subtypes, and survival inthe carolina breast cancer study,” Jama, vol. 295, no. 21, pp. 2492–2502, 2006.[20] K. R. Bauer, M. Brown, R. D. Cress, C. A. Parise, and V. Caggiano, “Descriptive analysis ofestrogen receptor (er)-negative, progesterone receptor (pr)-negative, and her2-negative invasivebreast cancer, the so-called triple-negative phenotype: a population-based study from thecalifornia cancer registry,” Cancer, vol. 109, no. 9, pp. 1721–1728, 2007.[21] H. G. Kaplan, J. A. Malmgren, and M. Atwood, “T1n0 triple negative breast cancer: risk ofrecurrence and adjuvant chemotherapy,” The breast journal, vol. 15, no. 5, pp. 454–460, 2009.[22] H. Kennecke, R. Yerushalmi, R. Woods, M. C. U. Cheang, D. Voduc, C. H. Speers, T. O.Nielsen, and K. Gelmon, “Metastatic behavior of breast cancer subtypes,” Journal of clinicaloncology, vol. 28, no. 20, pp. 3271–3277, 2010.[23] R. Dent, W. M. Hanna, M. Trudeau, E. Rawlinson, P. Sun, and S. A. Narod, “Pattern ofmetastatic spread in triple-negative breast cancer,” Breast cancer research and treatment, vol.115, no. 2, pp. 423–428, 2009.[24] N. U. Lin, J. R. Bellon, and E. P. Winer, “Cns metastases in breast cancer,” Journal of clinicaloncology, vol. 22, no. 17, pp. 3608–3617, 2004.[25] F. Heitz, P. Harter, A. Traut, H. Lueck, B. Beutel, and A. du Bois, “Cerebral metastases(cm) in breast cancer (bc) with focus on triple-negative tumors,” Journal of Clinical Oncology,vol. 26, no. 15_suppl, pp. 1010–1010, 2008.[26] G. Von Minckwitz, M. Untch, J.-U. Blohmer, S. D. Costa, H. Eidtmann, P. A. Fasching,B. Gerber, W. Eiermann, J. Hilfrich, J. Huober et al., “Definition and impact of pathologiccomplete response on prognosis after neoadjuvant chemotherapy in various intrinsic breastcancer subtypes,” J Clin oncol, vol. 30, no. 15, pp. 1796–1804, 2012.93Bibliography[27] A. Lee and M. B. Djamgoz, “Triple negative breast cancer: emerging therapeutic modalitiesand novel combination therapies,” Cancer treatment reviews, vol. 62, pp. 110–122, 2018.[28] M. D. Burstein, A. Tsimelzon, G. M. Poage, K. R. Covington, A. Contreras, S. A. Fuqua,M. I. Savage, C. K. Osborne, S. G. Hilsenbeck, J. C. Chang et al., “Comprehensive genomicanalysis identifies novel subtypes and targets of triple-negative breast cancer,” Clinical CancerResearch, vol. 21, no. 7, pp. 1688–1698, 2015.[29] J. Stagg and B. Allard, “Immunotherapeutic approaches in triple-negative breast cancer: latestresearch and clinical prospects,” Therapeutic advances in medical oncology, vol. 5, no. 3, pp.169–181, 2013.[30] B. D. Lehmann, B. Jovanović, X. Chen, M. V. Estrada, K. N. Johnson, Y. Shyr, H. L. Moses,M. E. Sanders, and J. A. Pietenpol, “Refinement of triple-negative breast cancer molecularsubtypes: implications for neoadjuvant chemotherapy selection,” PloS one, vol. 11, no. 6, p.e0157368, 2016.[31] R. Gao, A. Davis, T. O. McDonald, E. Sei, X. Shi, Y. Wang, P.-C. Tsai, A. Casasent, J. Waters,H. Zhang et al., “Punctuated copy number evolution and clonal stasis in triple-negative breastcancer,” Nature genetics, vol. 48, no. 10, p. 1119, 2016.[32] S. P. Shah, A. Roth, R. Goya, A. Oloumi, G. Ha, Y. Zhao, G. Turashvili, J. Ding, K. Tse,G. Haffari et al., “The clonal and mutational evolution spectrum of primary triple-negativebreast cancers,” Nature, vol. 486, no. 7403, p. 395, 2012.[33] X. Wang and C. Guda, “Integrative exploration of genomic profiles for triple negative breastcancer identifies potential drug targets,” Medicine, vol. 95, no. 30, 2016.[34] C. J. Lord and A. Ashworth, “Brcaness revisited,” Nature Reviews Cancer, vol. 16, no. 2, p.110, 2016.[35] O. A. Stefansson, J. G. Jonasson, O. T. Johannsson, K. Olafsdottir, M. Steinarsdottir, S. Val-geirsdottir, and J. E. Eyfjord, “Genomic profiling of breast tumours in relation to brca abnor-malities and phenotypes,” Breast Cancer Research, vol. 11, no. 4, p. R47, 2009.[36] K. A. Kwei, Y. Kung, K. Salari, I. N. Holcomb, and J. R. Pollack, “Genomic instability inbreast cancer: pathogenesis and clinical implications,” Molecular oncology, vol. 4, no. 3, pp.255–266, 2010.94Bibliography[37] M. L. Telli, K. M. Timms, J. Reid, B. Hennessy, G. B. Mills, K. C. Jensen, Z. Szallasi, W. T.Barry, E. P. Winer, N. M. Tung et al., “Homologous recombination deficiency (hrd) scorepredicts response to platinum-containing neoadjuvant chemotherapy in patients with triple-negative breast cancer,” Clinical cancer research, 2016.[38] N. Turner, A. Tutt, and A. Ashworth, “Hallmarks of’brcaness’ in sporadic cancers,” Naturereviews cancer, vol. 4, no. 10, p. 814, 2004.[39] M. K. Graeser, A. McCarthy, C. J. Lord, K. Savage, M. Hills, J. Salter, N. Orr, M. Parton,I. E. Smith, J. Reis-Filho et al., “A marker of homologous recombination predicts pathologicalcomplete response to neoadjuvant chemotherapy in primary breast cancer,” Clinical CancerResearch, pp. clincanres–1027, 2010.[40] T. Jiang, W. Shi, V. B. Wali, L. S. Pongor, C. Li, R. Lau, B. Győrffy, R. P. Lifton, W. F.Symmans, L. Pusztai et al., “Predictors of chemosensitivity in triple negative breast cancer:an integrated genomic analysis,” PLoS medicine, vol. 13, no. 12, p. e1002193, 2016.[41] L. B. Alexandrov, S. Nik-Zainal, D. C. Wedge, S. A. Aparicio, S. Behjati, A. V. Biankin, G. R.Bignell, N. Bolli, A. Borg, A.-L. Børresen-Dale et al., “Signatures of mutational processes inhuman cancer,” Nature, vol. 500, no. 7463, p. 415, 2013.[42] T. Funnell, A. Zhang, Y.-J. Shiah, D. Grewal, R. Lesurf, S. McKinney, A. Bashashati, Y. K.Wang, P. Boutros, and S. Shah, “Integrated single-nucleotide and structural variation signa-tures of dna-repair deficient human cancers,” bioRxiv, p. 267500, 2018.[43] T. Helleday, S. Eshtad, and S. Nik-Zainal, “Mechanisms underlying mutational signatures inhuman cancers,” Nature Reviews Genetics, vol. 15, no. 9, p. 585, 2014.[44] S. A. Roberts, M. S. Lawrence, L. J. Klimczak, S. A. Grimm, D. Fargo, P. Stojanov, A. Kiezun,G. V. Kryukov, S. L. Carter, G. Saksena et al., “An apobec cytidine deaminase mutagenesispattern is widespread in human cancers,” Nature genetics, vol. 45, no. 9, p. 970, 2013.[45] L. B. Alexandrov, S. Nik-Zainal, D. C. Wedge, P. J. Campbell, and M. R. Stratton, “Deci-phering signatures of mutational processes operative in human cancer,” Cell reports, vol. 3,no. 1, pp. 246–259, 2013.[46] D. Ramazzotti, A. Lal, K. Liu, R. Tibshirani, and A. Sidow, “De novo mutational signaturediscovery in tumor genomes using sparsesignatures,” bioRxiv, 2018.95Bibliography[47] S. Nik-Zainal, H. Davies, J. Staaf, M. Ramakrishna, D. Glodzik, X. Zou, I. Martincorena, L. B.Alexandrov, S. Martin, D. C. Wedge et al., “Landscape of somatic mutations in 560 breastcancer whole-genome sequences,” Nature, vol. 534, no. 7605, p. 47, 2016.[48] H. A. Wahba and H. A. El-Hadaad, “Current approaches in treatment of triple-negative breastcancer,” Cancer biology & medicine, vol. 12, no. 2, p. 106, 2015.[49] C. Crutcher, L. Cornwell, and A. Chagpar, “Effect of triple-negative status on surgical decisionmaking.” ASCO, 2010.[50] E. B. C. T. C. Group et al., “Effects of radiotherapy and of differences in the extent of surgeryfor early breast cancer on local recurrence and 15-year survival: an overview of the randomisedtrials,” The Lancet, vol. 366, no. 9503, pp. 2087–2106, 2005.[51] G. M. Freedman, P. R. Anderson, T. Li, and N. Nicolaou, “Locoregional recurrence of triple-negative breast cancer after breast-conserving surgery and radiation,” Cancer, vol. 115, no. 5,pp. 946–951, 2009.[52] J. Panoff, J. Hurley, C. Takita, I. Reis, W. Zhao, V. Sujoy, C. Gomez, M. Jorda, L. Koniaris,and J. Wright, “Risk of locoregional recurrence by receptor status in breast cancer patientsreceiving modern systemic therapy and post-mastectomy radiation,” Breast cancer researchand treatment, vol. 128, no. 3, pp. 899–906, 2011.[53] B. S. Abdulkarim, J. Cuartero, J. Hanson, J. Deschênes, D. Lesniak, and S. Sabri, “Increasedrisk of locoregional recurrence for women with T1-2N0 triple-negative breast cancer treatedwith modified radical mastectomy without adjuvant radiation therapy compared with breast-conserving therapy,” Journal of Clinical Oncology, vol. 29, no. 21, p. 2852, 2011.[54] C. K. Anders and L. A. Carey, “Biology, metastatic patterns, and treatment of patients withtriple-negative breast cancer,” Clinical breast cancer, vol. 9, pp. S73–S81, 2009.[55] T. Ballinger, J. Kremer, and K. Miller, “Triple negative breast cancer-review of current andemerging therapeutic strategies,” 2016.[56] R. Thirumaran, G. C. Prendergast, and P. B. Gilman, “Cytotoxic chemotherapy in clinicaltreatment of cancer,” in Cancer Immunotherapy. Elsevier, 2007, pp. 101–116.96Bibliography[57] P. C. Fong, T. A. Yap, D. S. Boss, C. P. Carden, M. Mergui-Roelvink, C. Gourley, J. De Greve,J. Lubinski, S. Shanley, C. Messiou et al., “Poly (adp)-ribose polymerase inhibition: frequentdurable responses in BRCA carrier ovarian cancer correlating with platinum-free interval,”Journal of clinical oncology, vol. 28, no. 15, pp. 2512–2519, 2010.[58] C. K. Anders, E. P. Winer, J. M. Ford, R. Dent, D. P. Silver, G. W. Sledge, and L. A. Carey,“Poly (adp-ribose) polymerase inhibition:“targeted” therapy for triple-negative breast cancer,”Clinical Cancer Research, vol. 16, no. 19, pp. 4702–4710, 2010.[59] J. S. Lim and D. S. Tan, “Understanding resistance mechanisms and expanding the therapeuticutility of parp inhibitors,” Cancers, vol. 9, no. 8, p. 109, 2017.[60] L. A. Carey, “Directed therapy of subtypes of triple-negative breast cancer,” The oncologist,vol. 16, no. Supplement 1, pp. 71–78, 2011.[61] N. T. Ueno and D. Zhang, “Targeting egfr in triple negative breast cancer,” Journal of Cancer,vol. 2, p. 324, 2011.[62] F. Sobande, L. Dusek, A. Matejková, T. Rozkos, J. Laco, and A. Ryska, “Egfr in triple negativebreast carcinoma: significance of protein expression and high gene copy number,” Cesk Patol,vol. 51, no. 2, pp. 80–86, 2015.[63] A. Bahnassy, M. Mohanad, M. F. Ismail, S. Shaarawy, A. El-Bastawisy, and A.-R. N. Zekri,“Molecular biomarkers for prediction of response to treatment and survival in triple negativebreast cancer patients from egypt,” Experimental and molecular pathology, vol. 99, no. 2, pp.303–311, 2015.[64] A. Aleshin and R. S. Finn, “Src: a century of science brought to the clinic,” Neoplasia, vol. 12,no. 8, pp. 599–607, 2010.[65] M. Anbalagan, K. Moroz, A. Ali, L. Carrier, S. Glodowski, and B. G. Rowan, “Subcellularlocalization of total and activated src kinase in african american and caucasian breast cancer,”PloS one, vol. 7, no. 3, p. e33017, 2012.[66] E. M. Kim, K. Mueller, E. Gartner, and J. Boerner, “Dasatinib is synergistic with cetuximaband cisplatin in triple-negative breast cancer cells,” journal of surgical research, vol. 185, no. 1,pp. 231–239, 2013.97Bibliography[67] F. E. Vera-Badillo, A. J. Templeton, P. de Gouveia, I. Diaz-Padilla, P. L. Bedard, M. Al-Mubarak, B. Seruga, I. F. Tannock, A. Ocana, and E. Amir, “Androgen receptor expressionand outcomes in early breast cancer: a systematic review and meta-analysis,” Journal of theNational Cancer Institute, vol. 106, no. 1, p. djt319, 2013.[68] E. Hilborn, J. Gacic, T. Fornander, B. Nordenskjöld, O. Stål, and A. Jansson, “Androgenreceptor expression predicts beneficial tamoxifen response in oestrogen receptor-α-negativebreast cancer,” British journal of cancer, vol. 114, no. 3, p. 248, 2016.[69] N. Tung, J. E. Garber, M. R. Hacker, V. Torous, G. J. Freeman, E. Poles, S. Rodig, B. Alexan-der, L. Lee, L. C. Collins et al., “Prevalence and predictors of androgen receptor and pro-grammed death-ligand 1 in brca1-associated and sporadic triple-negative breast cancer,” NPJBreast Cancer, vol. 2, p. 16002, 2016.[70] R. M. Layman, A. S. Ruppert, M. Lynn, E. Mrozek, B. Ramaswamy, M. B. Lustberg,R. Wesolowski, S. Ottman, S. Carothers, A. Bingman et al., “Severe and prolonged lym-phopenia observed in patients treated with bendamustine and erlotinib for metastatic triplenegative breast cancer,” Cancer chemotherapy and pharmacology, vol. 71, no. 5, pp. 1183–1190,2013.[71] M. Schuler, M. Uttenreuther-Fischer, M. Piccart-Gebhart, N. Harbeck, study group, and trialteam, “Bibw 2992, a novel irreversible egfr/her1 and her2 tyrosine kinase inhibitor, for thetreatment of patients with her2-negative metastatic breast cancer after failure of no more thantwo prior chemotherapies.” Journal of Clinical Oncology, vol. 28, no. 15_suppl, pp. 1065–1065,2010.[72] K. Gelmon, R. Dent, J. Mackey, K. Laing, D. McLeod, and S. Verma, “Targeting triple-negative breast cancer: optimising therapeutic outcomes,” Annals of oncology, vol. 23, no. 9,pp. 2223–2234, 2012.[73] Y. K. Wang, A. Bashashati, M. S. Anglesio, D. R. Cochrane, D. S. Grewal, G. Ha, A. McPher-son, H. M. Horlings, J. Senz, L. M. Prentice et al., “Genomic consequences of aberrant dnarepair mechanisms stratify ovarian cancer histotypes,” Nature genetics, vol. 49, no. 6, p. 856,2017.[74] A. Schuh, H. Dreau, S. J. Knight, K. Ridout, T. Mizani, D. Vavoulis, R. Colling, P. Antoniou,98BibliographyE. M. Kvikstad, M. M. Pentony et al., “Clinically actionable mutation profiles in patientswith cancer identified by whole-genome sequencing,” Molecular Case Studies, vol. 4, no. 2, p.a002279, 2018.[75] P. H. Sudmant, T. Rausch, E. J. Gardner, R. E. Handsaker, A. Abyzov, J. Huddleston,Y. Zhang, K. Ye, G. Jun, M. H.-Y. Fritz et al., “An integrated map of structural variation in2,504 human genomes,” Nature, vol. 526, no. 7571, p. 75, 2015.[76] R. Shepherd, S. A. Forbes, D. Beare, S. Bamford, C. G. Cole, S. Ward, N. Bindal, P. Gu-nasekaran, M. Jia, C. Y. Kok et al., “Data mining using the catalogue of somatic mutationsin cancer biomart,” Database, vol. 2011, 2011.[77] E. R. Mardis, “The 1, 000genome, the100,000 analysis?” Genome medicine, vol. 2, no. 11,p. 84, 2010.[78] E. F. Codd, “A relational model of data for large shared data banks,” Communications of theACM, vol. 13, no. 6, pp. 377–387, 1970.[79] T. J. Lee, Y. Pouliot, V. Wagner, P. Gupta, D. W. Stringer-Calvert, J. D. Tenenbaum, andP. D. Karp, “Biowarehouse: a bioinformatics database warehouse toolkit,” BMC bioinformat-ics, vol. 7, no. 1, p. 170, 2006.[80] J. Gao, B. A. Aksoy, U. Dogrusoz, G. Dresdner, B. Gross, S. O. Sumer, Y. Sun, A. Jacobsen,R. Sinha, E. Larsson et al., “Integrative analysis of complex cancer genomics and clinicalprofiles using the cbioportal,” Sci. Signal., vol. 6, no. 269, pp. pl1–pl1, 2013.[81] J. R. MacDonald, R. Ziman, R. K. Yuen, L. Feuk, and S. W. Scherer, “The database ofgenomic variants: a curated collection of structural variation in the human genome,” Nucleicacids research, vol. 42, no. D1, pp. D986–D992, 2013.[82] G. Ha, A. Roth, J. Khattra, J. Ho, D. Yap, L. M. Prentice, N. Melnyk, A. McPherson,A. Bashashati, E. Laks et al., “Titan: inference of copy number architectures in clonal cellpopulations from tumor whole-genome sequence data,” Genome research, pp. gr–180 281, 2014.[83] A. McPherson, S. P. Shah, and S. C. Sahinalp, “destruct: Accurate rearrangement detectionusing breakpoint specific realignment,” bioRxiv, p. 117523, 2017.99[84] R. M. Layer, C. Chiang, A. R. Quinlan, and I. M. Hall, “Lumpy: a probabilistic frameworkfor structural variant discovery,” Genome biology, vol. 15, no. 6, p. R84, 2014.[85] J. Ding, A. Bashashati, A. Roth, A. Oloumi, K. Tse, T. Zeng, G. Haffari, M. Hirst, M. A.Marra, A. Condon et al., “Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data,” Bioinformatics, vol. 28, no. 2, pp. 167–175, 2011.[86] C. T. Saunders, W. S. Wong, S. Swamy, J. Becq, L. J. Murray, and R. K. Cheetham, “Strelka:accurate somatic small-variant calling from sequenced tumor–normal sample pairs,” Bioinfor-matics, vol. 28, no. 14, pp. 1811–1817, 2012.[87] M. S. Lawrence, P. Stojanov, P. Polak, G. V. Kryukov, K. Cibulskis, A. Sivachenko, S. L.Carter, C. Stewart, C. H. Mermel, S. A. Roberts et al., “Mutational heterogeneity in cancerand the search for new cancer-associated genes,” Nature, vol. 499, no. 7457, p. 214, 2013.[88] A. Taylor-Weiner, C. Stewart, T. Giordano, M. Miller, M. Rosenberg, A. Macbeth, N. Lennon,E. Rheinbay, D.-A. Landau, C. J. Wu et al., “Detin: overcoming tumor-in-normal contamina-tion,” Nature Methods, vol. 15, no. 7, p. 531, 2018.[89] M. J. Taghiyar, J. Rosner, D. Grewal, B. M. Grande, R. Aniba, J. Grewal, P. C. Boutros, R. D.Morin, A. Bashashati, and S. P. Shah, “Kronos: a workflow assembler for genome analyticsand informatics,” GigaScience, vol. 6, no. 7, p. gix042, 2017.100Appendix AExamples of Data Structuring,Bulk-loading and Data ExtractionScriptsA.1 Examples of data structuring and loading scriptsA.1.1 Script to structure and load bam file statics derived by flagstats#!/home/rasiimwe/miniconda3/bin/pythonimport psycopg2import sysimport csvimport osimport stringimport subprocesscon = Nonetry:con = psycopg2.connect("host=‘localhost’ dbname=‘genomic_variants’user=‘ ’ password=‘ ’")cur = con.cursor()#creating tumour_bamstats database objectcreate table tumour_bamstats (tumour_id varchar primary key references101A.1. Examples of data structuring and loading scriptssamples (tumour_id), normal_id varchar, total_reads bigint, qc_failurebigint, duplicates bigint, mapped bigint, mapped_percentage float,paired_in_seq bigint, read1 bigint, read2 bigint, properly_pairedbigint, properly_paired_percentage float,self_and_mate_mapped bigint,singletons bigint, singletons_percentage float, mate_map_diff_chrbigint, mate_map_diff_chr_mapq bigint, mapq varchar, mapq2 varchar,avg_read_coverage float)#Calling extracted stats, structuring and loading into database#-----------------------------------------------------------------------------path = ""os.chdir(path)for file in os.listdir(path):tumour_id = ‘_illumina’.join(file.split(‘_illumina’)[:-1])f = open(file, "r")for i, line in enumerate(f):if i == 0: #in total e.g. 843990938total = int(filter(str.isdigit, line))elif i == 1: #QC failure e.g. 0qc.f = int(filter(str.isdigit, line))elif i == 2: #duplicates e.g. 623568175duplicates = int(filter(str.isdigit, line))elif i == 3: #mapped e.g. 801225533, percentage#mapped eg (94.93%) # pass mapped and percentage#mapped as 2 diff varialesmapped = ‘ ’.join(line.split(‘ ’)[:1])da = ‘ ’.join(line.split(‘ ’)[-1:])db = str(da)[1:-1]102A.1. Examples of data structuring and loading scriptsdc = ‘)’.join(db.split(‘)’)[:-1])mapped_percentage= ‘%’.join(dc.split(‘%’)[:-1])elif i==4: #paired in sequencing e.g. 843990938paired_in_seq = int(filter(str.isdigit, line))elif i == 5: #read1 e.g. 421995469read1 = ‘ ’.join(line.split(‘ ’)[:1])elif i == 6: #read2 e.g. 421995469read12 = ‘ ’.join(line.split(‘ ’)[:1])elif i == 7: #properly paired e.g. 791170058,#percentage of properly paired (93.74%)properly_paired = ‘ ’.join(line.split(‘ ’)[:1])ha = ‘ ’.join(line.split(‘ ’)[-1:])hb = str(ha)[1:-1]hc = ‘)’.join(hb.split(‘)’)[:-1])properly_paired_percentage = ‘%’.join(hc.split(‘%’)[:-1])elif i == 8: #with itself and mate mapped#e.g. 795636451self_and_mate_mapped = int(filter(str.isdigit,line))elif i == 9: #singletons e.g. 42764211 percentage#of singletons (5.07%)singletons = ‘ ’.join(line.split(‘ ’)[:1])ja = ‘ ’.join(line.split(‘ ’)[-1:])jb = str(ja)[1:-1]103A.1. Examples of data structuring and loading scriptsjc = ‘)’.join(jb.split(‘)’)[:-1])singletons_percentage =’%’.join(jc.split(‘%’)[:-1])elif i == 10: #with mate mapped to a different chr#e.g. 2882215mate.mapped.chr= int(filter(str.isdigit, line))elif i == 11: #with mate mapped to a different chr#e.g. 2178551, mapQ (mapQ>=5)mate.mapped.chr.diff =‘ ’.join(line.split(‘ ’)[:1])la = ‘ ’.join(line.split(‘ ’)[-1:])lb = str(la)[1:-1]mapq = ‘)’.join(lb.split(‘)’)[:-1])mapq2 = ‘Q’.join(l2.split(‘Q’)[-1:])cur.execute ("""update tumour_bamstats settotal_reads=(%s), qc_failure=(%s),duplicates=(%s),mapped=(%s), mapped_percentage=(%s),paired_in_seq=(%s), read1=(%s), read2=(%s),properly_paired=(%s), properly_paired_percentage=(%s),self_and_mate_mapped=(%s),singletons=(%s), singletons_percentage=(%s),mate_map_diff_chr=(%s), mate_map_diff_chr_mapq=(%s),mapq=(%s), mapq2=(%s)where tumour_id = (select tumour_id from samples wheretumour_archive_id =(%s) andsamples.tumour_id = tumour_bamstats.tumour_id)""", [total, qc.f, duplicates, mapped,104A.1. Examples of data structuring and loading scriptsmapped_percentage, paired_in_seq, read1,read2, properly_paired,properly_paired_percentage,self_and_mate_mapped, singletons,singletons_percentage, mate.mapped.chr,mate.mapped.chr.diff, mapq, mapq2, tumour_id])#-----------------------------------------------------------------------------con.commit()except psycopg2.DatabaseError, e:if con:con.rollback()print ‘Eror %s’ % esys.exit(1)105A.1. Examples of data structuring and loading scriptsA.1.2 Script to structure and load mutationSeq datalibrary(VariantAnnotation)library(dplyr)library(tidyr)library(splitstackshape)library(RPostgreSQL)pw <- { " "}drv <- dbDriver("PostgreSQL")con <- dbConnect(drv, dbname = "genomic_variants", host = "localhost",user = "rasiimwe", password = pw)rm(pw)dbExistsTable(con, "pipeline_result_paths")museqsnvs <- dbGetQuery(con, "select tumour_id, mutationseq frompipeline_result_paths")museqsnvs <- as.data.frame(museqsnvs)tumour_id1 = museqsnvs[1]museqsnvs.path = museqsnvs[2]for(i in museqsnvs.path){files <- Sys.glob(file.path(i, "*.vcf"))for (f in files){x <- matrix(unlist(strsplit(as.character(f), ‘/’)), ncol=1,byrow=TRUE)tumour_id <- as.character(x[5])vcf <- readVcf(f, "hg19")if (dim(vcf)[1]! = 0){initial <- data.frame(info(vcf))initial <- tibble::rownames_to_column(initial,106A.1. Examples of data structuring and loading scripts"chrom_pos_ref_alt")split1 <- matrix(unlist(strsplit(as.character(initial$chrom_pos_ref_alt), ‘:’)), ncol=2,byrow=TRUE)df <- cbind(initial$chrom_pos_ref_alt,as.data.frame(split1))names(df) <- c("chrom_pos_ref_alt", "chrom", "pos")split2 <- matrix(unlist(strsplit(as.character(df$pos), ‘_’)), ncol = 2, byrow=TRUE)df2 <- cbind(df, split2)names(df2) <- c("chrom_pos_ref_alt", "chrom","pos1", "pos", "ref_alt")split3 <- matrix(unlist(strsplit(as.character(df2$ref_alt), ‘/’)), ncol = 2, byrow = TRUE)df3 <- cbind(df2, split3)names(df3) <- c("chrom_pos_ref_alt", "chrom","pos1", "pos", "ref_alt", "ref", "alt")initial <- cbind (df3$chrom, df3$pos,df3$ref,df3$alt,initial)names(initial)[names(initial)==‘df3$chrom’]<-‘chrom’names(initial)[names(initial) == ‘df3$pos’]<- ‘pos’names(initial)[names(initial) == ‘df3$ref’]<- ‘ref’names(initial)[names(initial) == ‘df3$alt’]<- ‘alt’initial$chrom_pos_ref_alt <- NULLnewann <- cSplit(initial, 13, sep = ",",direction = "long", fixed = FALSE, drop = TRUE,stripWhite = TRUE, makeEqual =FALSE,type.convert = TRUE)newann <- as.data.frame(newann)107A.1. Examples of data structuring and loading scriptsnewann[] <- lapply(newann, gsub, pattern=‘"’,replacement=‘’)newann <- cSplit(newann, "ANN", "|")names(newann)[names(newann) %in% ...newann[] <- lapply(newann, gsub,pattern = ‘\\(’, replacement=‘’)newann[] <- lapply(newann, gsub, pattern=‘)’,replacement = ‘’)newann$allele[] <- lapply(newann$allele, gsub,pattern = ‘c’, replacement=‘’)newann$tumour_id <- " "newann$tumour_id <- tumour_idnewann <- setNames(newann, tolower(colnames(newann)))museq_unfiltered <- newann[,c(35,1:12,19:34,13:18)]#museq_unfiltered <- newann[,c(1:12,19:34,13:18)]museq_unfiltered<-cbind("id"=1:nrow(museq_unfiltered),museq_unfiltered)dbWriteTable(con,"museq_unfiltered", museq_unfiltered,append=TRUE, row.names=0)}}}dbDisconnect(con)108A.2. Examples of data extraction scriptsA.2 Examples of data extraction scriptsA.2.0.1 Extracting mutation loads per case:dbExistsTable(con, "snvs_intersect", "svs_filtered", "strelka_indels")#snvsmutation.load <- dbGetQuery(con,"SELECT load1.tumour_id, snvs_load, svs_load,indel_loadFROM (SELECT DISTINCT tumour_id, COUNT(DISTINCT(tumour_id, chrom, pos)) AS snvs_loadFROM snvs_intersectWHERE pr >= 0.9 GROUP BY 1 ORDER BY 2)load1LEFT OUTER JOIN LATERAL(SELECT COUNT(tumour_id) AS svs_loadFROM svs_filteredWHERE load1.tumour_id = svs_filtered.tumour_idGROUP BY tumour_id )load2ON TRUE LEFT OUTER JOIN LATERAL(SELECT COUNT(DISTINCT(tumour_id, chrom, pos))AS indel_loadFROM strelka_indelsWHERE load1.tumour_id = strelka_indels.tumour_id)load3ON TRUE")#sample output------------------------------------------------------------------------------tumour_id | snvs_load | svs_load | indel_load-----------+-----------+----------+-------------SA1064 | 129 | 1 | 18SA1058 | 269 | 16 | 23SA296 | 450 | 28 | 2109A.2. Examples of data extraction scriptsSA576 | 645 | 37 | 5SA1027 | 1376 | 32 | 78SA680 | 1954 | 51 | 32SA402 | 2127 | 43 | 121SA230 | 2193 | 75 | 239SA1071 | 2536 | 61 | 29SA601 | 2640 | 20 | 45SA596 | 2672 | 102 | 71SA275 | 2746 | 44 | 32SA423 | 3040 | 136 | 161SA1062 | 3250 | 89 | 42SA286 | 3293 | 43 | 77A.2.0.2 Extracting samples with specified mutations in genes of interest:gene.effect <- function(gene, annotation){if (!is.character(gene) | !is.character(annotation)) {stop(paste("Expecting gene or annotation to be of type character.You supplied", typeof(gene), "for gene and ",typeof(annotation),"for annotation"))}elseif(is_empty(gene) | is_empty(annotation)){stop(paste("Expecting gene or annotation elements.Your vector is empty"))}else{genes <- as.vector(gene)for(i in genes){annotation <- paste(annotation, "%", sep="")annotation <- paste( "%", annotation, sep="")110A.2. Examples of data extraction scriptsannotations <- as.vector(annotation)for(j in annotations){#magic_for(print, silent = TRUE)gene <- ieffect <- jquery <- fn$identity("select distinct tumour_id, gene_name, annotationfrom strelka_indels where gene_name like ‘$gene’ and annotation like‘$effect’ union select distinct tumour_id, gene_name, annotation fromsnvs_intersect where gene_name like ‘$gene’ and annotation like‘$effect’ and pr >= 0.90 order by gene_name asc")x <- dbGetQuery(con, query)x <- as.data.frame(x)if (nrow(x)! = 0){x <- as.data.frame(x)x <- na.omit(x)print(x)}else{next}}}}}output.gene.effect <- as.data.frame(capture.output(gene.effect(gene, annotation)))111A.2. Examples of data extraction scriptsA.2.0.3 Extracting copy number profile of a case of interest:pw <- {" "}drv <- dbDriver("PostgreSQL")con <- dbConnect(drv, dbname = "genomic_variants",host = "localhost", port = 5433,user = "rasiimwe", password = pw)#on.exit(dbDisconnect(con))sample <- ‘SA994’query <- fn$identity("select chromosome, start_position_bp as start_pos,end_position_bp as end_pos, titan_call, copy_numberfrom titan_segs_cnas wheretumour_id = ‘$sample’ order by 1, 5 asc")cna.profile <- dbGetQuery(con, query)#sample output-----------------------------------------------------------------------------chromosome | start_pos | end_pos | titan_call | copy_number------------+-----------+-----------+------------+-------------1 | 774883 | 811735 | HOMD | 01 | 1809509 | 2831790 | NLOH | 21 | 814077 | 1514183 | NLOH | 21 | 17019673 | 17266878 | HET | 21 | 5812704 | 8640892 | NLOH | 21 | 3833193 | 5798176 | NLOH | 21 | 8695522 | 17005876 | NLOH | 21 | 17275895 | 19582338 | NLOH | 21 | 19587201 | 19806850 | NLOH | 21 | 37365498 | 39169657 | NLOH | 2112A.2. Examples of data extraction scripts1 | 232230516 | 249208389 | NLOH | 21 | 2835487 | 3089976 | NLOH | 21 | 20376717 | 36454013 | NLOH | 21 | 120919943 | 146528957 | BCNA | 41 | 231220183 | 232224588 | BCNA | 41 | 226334605 | 231187205 | BCNA | 41 | 225887453 | 226292868 | BCNA | 41 | 197254186 | 225611118 | BCNA | 41 | 192443500 | 197201223 | BCNA | 41 | 36496696 | 37362015 | ALOH | 4113Appendix BSignificantly Mutated GenesB.1 50 top significantly mutated genes (SMGs) in this TNBCstudy cohort identified using MutsigCVgene expr reptime hic N_nonsilent N_silent N_noncoding n_nonsilent n_silent n_noncodingnnei x X p qEMCN 231787 807 -5 127075 34775 421590 6 0 2 50 10 19776510 0 0TP53 2069567 213 34 200785 55445 348465 37 0 1 50 6 18124470 0 0MUC21 2483101 261 36 248820 84435 111735 7 1 0 47 4 16643185 2.332409e-06 1.466463e-02PIK3CA 401889 613 11 513110 127855 660270 9 0 0 26 5 12899575 1.166765e-05 5.501878e-02MUC4 920866 365 49 551915 154570 713895 23 4 0 4 4 1561300 6.284072e-05 1.812296e-01MB 936853 205 -13 74165 19240 108225 2 0 0 50 11 17708145 7.135792e-05 1.812296e-01CTU2 802673 189 41 242840 70720 398190 3 0 0 50 12 20439705 7.879780e-05 1.812296e-01RAB3IL1 1998419 212 69 177970 53105 165750 3 0 0 50 5 24505325 8.475045e-05 1.812296e-01PTEN 259678 300 34 196820 45760 336765 3 0 0 50 10 20711470 8.647367e-05 1.812296e-01CLEC9A 351158 445 9 119535 26715 384150 2 0 0 50 11 26085865 1.262553e-04 2.160980e-01NOL10 804107 344 32 337220 81445 295815 3 0 0 50 11 20631910 1.449161e-04 2.160980e-01LAMB4 487451 326 4 838240 218270 1281150 5 1 0 37 4 18093855 1.474913e-04 2.160980e-01PLP1 472103 NaN 26 129610 37700 355875 2 0 0 24 2 13731445 1.489383e-04 2.160980e-01SPATA4 134663 504 -4 144040 38870 182910 3 0 0 50 14 23871315 1.707353e-04 2.300293e-01PRB3 311837 673 -1 137020 47840 184275 3 0 0 50 7 15785380 2.310394e-04 2.856918e-01PDCD6IP 293448 409 9 411580 112970 586755 5 0 1 38 10 17206410 2.423427e-04 2.856918e-01PIK3R1 48999 619 32 368940 91455 665730 6 1 0 50 12 19350110 2.707102e-04 2.901350e-01MIDN 2214581 234 29 208325 70720 121875 2 0 0 50 2 15072785 2.768757e-04 2.901350e-01SERPINB3 174293 370 -5 187460 46735 297960 2 0 0 50 9 23473580 3.351789e-04 3.311552e-01SYT8 916434 481 33 181610 60190 156390 2 0 0 50 8 20130760 3.559895e-04 3.311552e-01TFCP2L1 240173 568 9 231140 60970 481065 2 0 0 50 8 24944725 3.686915e-04 3.311552e-01RUNX1 164915 429 45 226135 70265 319800 3 0 0 50 9 22489350 4.203889e-04 3.481186e-01ANO2 349024 712 13 478335 125970 716430 5 0 0 11 1 5744960 4.279815e-04 3.481186e-01TMEM41A 953031 366 31 121290 37245 147030 2 0 0 50 5 21343595 4.568487e-04 3.481186e-01EZR 564251 163 55 283010 70525 564525 2 0 0 50 5 18855655 4.660136e-04 3.481186e-01114B.1. 50 top significantly mutated genes (SMGs) in this TNBC study cohort identified using MutsigCVZCCHC5 263052 NaN 20 216775 61685 11895 3 0 0 50 8 18993455 4.798581e-04 3.481186e-01CREB3L1 801520 158 36 242255 70525 130845 2 0 0 50 5 19788925 5.353796e-04 3.740122e-01DNAJC17 1011507 172 46 148395 38610 636285 2 0 0 15 2 7109895 5.709995e-04 3.846497e-01C6orf89 836025 244 43 169130 44005 263250 2 0 0 50 4 20571005 7.313133e-04 4.680890e-01OR4X1 400891 509 -27 137735 41080 20280 4 0 0 50 11 18862740 7.533784e-04 4.680890e-01115Appendix CDatabase Data DictionaryThe data dictionary presented herein shows the database objects (entities) for which data wascollected and their respective descriptions. Described are the fields (variables), their description,data type and constraints for each corresponding object.Entity 1: Projects - Towards database expansion and analysis of data from different studies,the projects table contains data about the different projects to which each collected sample databelongs. All current samples in the developed database belong to the TNBC project.Entity: ProjectsField Name Description Data Type Constraintsproject_code Project code uniquely identifies studiesin the database for which variant datafrom WGS has been collectedCHARACTER_VARYINGPRIMARY_KEYname Project name CHARACTER_VARYINGNOT NULLTable C.1: Database entity - ProjectsEntity 2: Clinical_data - This table contains available clinical outcomes data collected on thesamples in this TNBC cohort. Clinical data was collected from various institutions (BC outcomesunit, Montreal, Alberta and McGill) from which samples were collected.Entity: Clinical_dataField Name Description Data Type Constraintsconsent_id Clinical id that uniquely identifies pa-tients across systemsCHARACTER_VARYINGPRIMARY_KEY116Appendix C. Database Data Dictionarydiagnosis_date Date of cancer diagnosis DATE NULLage Patient’s age at diagnosis INTEGER NULLgrade This field captures the patient’s tumourgrade and takes on values such as 3,2 or 1 for grades 3, 2 and 1 respec-tively. The tumour grade can also be un-known, therefore the field accommodatesnull valuesINTEGER NULLtumour_size Patient’s tumour size in cm. Null valuesare accommodated as a patient’s tumoursize may be unknownFLOAT NULLnode_status Node status (can either be positive, neg-ative or unknown (NULL))CHARACTER_VARYINGNULLHER_status A patient’s HER2 status as identified byIHC (can either be positive, negative orunknown (NULL))CHARACTER_VARYINGNULLER_status A patient’s ER2 status as identified byIHC (can either be positive, negative orunknown (NULL))CHARACTER_VARYINGNULLPR_status A patient’s PR status as identified byIHC (can either be positive, negative orunknown (NULL))CHARACTER_VARYINGNULLOS_status A patient’s overall survival status (caneither be alive or dead or unknown(NULL))CHARACTER_VARYINGNULLOS_years A patient’s overall survival in years FLOAT NULLDSS_status A patient’s disease specific survival sta-tus (can either be alive or dead or un-known (NULL))CHARACTER_VARYINGNULLDSS_years A patient’s disease specific survival inyearsFLOAT NULL117Appendix C. Database Data DictionaryPFS_status A patient’s progression free survival sta-tus (can either be alive or dead or un-known (NULL))CHARACTER_VARYINGNULLPFS_years A patient’s progression free survival inyearsFLOAT NULLTable C.2: Database entity - Clinical_dataEntity 3: Samples - This table contains metadata on each of the patient samples. Data collectedincludes the sample identifier, clinical identifier, facility of origin, sample type and the associatedproject to which a sample belongs.Entity: SamplesField Name Description Data Type Constraintstumour_id This field captures the unique tumouridentifierCHARACTER_VARYINGCOMPOSITEPRIMARY_KEY (uniqueid pair (tu-mour_id &normal_id))normal_id Normal sample identifier (part of thecomposite primary key). This value isnot unique as a patient with 2 tumoursamples will have 1 normal sample asso-ciated with each matched tumour sampleCHARACTER_VARYINGNOT NULLconsent_id Patient clinical identifier CHARACTER_VARYINGFOREIGN_KEY ref-erencesclinical_data(consent_id)facility_of_originThis field captures the facility/institu-tion of origin of each sampleCHARACTER_VARYINGNOT NULL118Appendix C. Database Data Dictionarysample_type The sample type can either be xenograftor primaryCHARACTER_VARYINGNOT NULLproject_code The code (identifier) of the project towhich a sample belongsCHARACTER_VARYINGFOREIGN_KEY ref-erencesprojects(project_code)Table C.3: Database entity - SamplesEntity 4: Titan_params_cnas - This table contains relevant parameters (nor-mal_contamination_estimate and average_tumour_ploidy_estimate ) on each sample inferredby TITAN.Entity: Titan_params_cnasField Name Description Data Type Constraintstumour_id Unique identifier for a tumour sample CHARACTER_VARYINGPRIMARY_KEYnormal_contamination_estimateThis field captures the normal contam-ination estimate derived from TITANand denotes the proportion of normalcontent in a sampleFLOAT NOT NULLaverage_tumour_ploidy_estimateThe average number of estimated copiesin the genome (2 represents diploid)FLOAT NOT NULLTable C.4: Database entity - Titan_params_cnasEntity 5: Titan_segs_cnas - This table contains copy number aberrations per genomic seg-ment inferred by TITAN.Entity: Titan_segs_cnasField Name Description Data Type Constraints119Appendix C. Database Data Dictionaryid Serial id INTEGER PRIMARY_KEY(SEQ)tumour_id Tumour/sample id associated with thevariant calledCHARACTER_VARYINGchromosome Chromosome with copy number aberra-tionCHARACTER_VARYINGNOT NULLstart_position_bpSegment start position BIGINT NOT NULLend_position_bpSegment end position BIGINT NOT NULLlength_bp Number of SNPs in the segment INTEGER NOT NULLmedian_ratio Median allelic ratio across SNPs in a seg-mentFLOAT NOT NULLmedian_logr Median log ratio across SNPs in the seg-mentFLOAT NOT NULLtitan_state State number used by TITAN INTEGER NOT NULLtitan_call Interpretable TITAN state (Can beHOMD, DLOH, HET, NLOH, ALOH,ASCNA, BCNA, UBCNA)CHARACTER_VARYINGNOT NULLcopy_number Predicted TITAN copy number INTEGER NOT NULLminorcn Copy number of minor allele INTEGER NOT NULLmajorcn Copy number of major allele INTEGER NOT NULLclonal_cluster Predicted TITAN clonal cluster INTEGER NOT NULLclonal_frequencyClonal frequency INTEGER NOT NULLgene_name Mutated gene(s) in segment CHARACTER_VARYINGNOT NULLTable C.5: Database entity - Titan_segs_cnasEntity 6: Titan_outfile_cnas - This table contains CNAs derived from Titan.120Appendix C. Database Data DictionaryEntity: Titan_outfile_cnasField Name Description Data Type Constraintsid Serial id CHARACTER_VARYINGNOT NULLtumour_id Tumour/sample id associated with thevariant calledCHARACTER_VARYINGchr Chromosome CHARACTER_VARYINGNOT NULLposition Position BIGINTrefcount Number of reads matching the referencebaseINTEGER NOT NULLnrefcount Number of reads matching the non-reference baseINTEGER NOT NULLdepth Total read depth at a position INTEGER NOT NULLallelicratio Refcount/depth FLOAT NOT NULLlogratio Log2 ratio between normalized tumourand normal read depthsFLOAT NOT NULLcopynumber Predicted TITAN copy number INTEGER NOT NULLtitanstate Internal state number used by TITAN INTEGER NOT NULLtitancall Interpretable TITAN state (Can beHOMD, DLOH, HET, NLOH, ALOH,ASCNA, BCNA, UBCNA)CHARACTER_VARYINGNOT NULLclonalcluster Predicted TITAN clonal cluster INTEGER NOT NULLcellularprevalenceProportion of tumour cells containing ge-nomic eventFLOAT NOT NULLTable C.6: Database entity - Titan_outfile_cnasEntity 7: Museq_snvs - This table contains somatic single nucleotide variants (SNV) detectedat each genomic position using mutationSeq for each sample in the cohort.** The table snvs_intersect contains a mapping of snvs detected by mutationSeq with those de-tected by Strelka. The table therefore inherits fields and descriptions as those of the museq_snvs121Appendix C. Database Data Dictionarytable.Entity: Museq_snvsField Name Description Data Type Constraintsid Serial id CHARACTER_VARYINGNOT NULLtumour_id Sample identifier associated with thevariant called at each positionCHARACTER_VARYINGNOT NULLchrom Chromosome identifier from the refer-ence genomeCHARACTER_VARYINGNOT NULLpos Reference position BIGINT NOT NULLref Reference nucleotide at position (pos) ofthe chromosomeCHARACTER_VARYINGNOT NULLalt Alternate non-reference allele CHARACTER_VARYINGNOT NULLpr Probability of somatic mutation FLOAT NOT NULLtc Tri-nucleotide context CHARACTER_VARYINGNOT NULLtr Count of tumour with reference to REF INTEGER NOT NULLta Count of tumour with reference to ALT INTEGER NOT NULLnr Count of normal with reference to REF INTEGER NOT NULLna Count of normal with reference to ALT INTEGER NOT NULLnd Number of deletions INTEGER NOT NULLni Number of insertions INTEGER NOT NULLallele Identifies the alt being referred to in in-stances of multiple alt fieldsCHARACTER_VARYINGNOT NULLannotation Effect or consequence annotatedusing sequence ontology terms e.gsplice_donorCHARACTER_VARYINGNOT NULLannotation_impactEstimation of putative impact or delete-riousness (Can either be HIGH, MOD-ERATE, LOW or MODIFIER)CHARACTER_VARYINGNOT NULL122Appendix C. Database Data Dictionarygene_name Common gene name (HGNC) CHARACTER_VARYINGNOT NULLgene_id Gene ID CHARACTER_VARYINGNOT NULLfeature_type Transcript, motif, miRNA CHARACTER_VARYINGNOT NULLfeature_id Depending on the annotation, this maybe: Transcript ID Motif ID, miRNA,ChipSeq peak, Histone mark, etcCHARACTER_VARYINGNOT NULLtranscript_biotypeDescription on whether the transcript iscoding or noncodingCHARACTER_VARYINGNOT NULLrank Exon or intron rank or total number ofexons or intronsCHARACTER_VARYINGNOT NULLhgvs_c Variant using HGVS notation (DNAlevel)CHARACTER_VARYINGNOT NULLhgvs_p If variant is coding, this field describesthe variant using HGVS notation (pro-tein level)CHARACTER_VARYINGNOT NULLcdna_pos_cdna_lengthPosition in cDNA and trancript’s cDNAlengthINTEGER NOT NULLcds_pos_cds_lengthPosition and number of coding bases INTEGER NOT NULLaa_pos_aa_lengthPosition and number of AA INTEGER NOT NULL123Appendix C. Database Data Dictionarydistance Context and implementation dependent.E.g. when the variant is “intronic” theannotation may show the distance to theclosest exon; when the variant is “inter-genic” it may show the distance to theclosest gene; and when the variant is “up-stream or downstream” may show thedistance to the closest 5’UTR or 3’UTRbaseINTEGER NOT NULLerrors_warning_infoWarnings or information messages CHARACTER_VARYINGNOT NULLTable C.7: Database entity - Museq_snvsEntity 8: Strelka_snvs - This table contains somatic single nucleotide variants (SNV) detectedat each genomic position using Strelka for each sample in the cohort.Entity: Strelka_snvsField Name Description Data Type Constraintsid Serial id CHARACTER_VARYINGNOT NULLtumour_id Sample identifier associated with thevariant called at each positionCHARACTER_VARYINGNOT NULLchrom Chromosome identifier from the refer-ence genomeCHARACTER_VARYINGNOT NULLpos Reference position BIGINT NOT NULLref Reference nucleotide at position (pos) ofthe chromosomeCHARACTER_VARYINGNOT NULLalt Alternate non-reference allele CHARACTER_VARYINGNOT NULLqss SNV quality score INTEGER NOT NULL124Appendix C. Database Data Dictionarytqss Data tier used to compute QSS INTEGER NOT NULLnt Genotype of the normal in all data tiers,as used to classify somatic variantsCHARACTER_VARYINGNOT NULLqss_nt Quality score reflecting the joint proba-bility of a somatic variant and NTINTEGER NOT NULLtqss_nt Data tier used to compute QSS_NT INTEGER NOT NULLsgt Most likely somatic genotype excludingnormal noise statesCHARACTER_VARYINGNOT NULLsomatic Denoting somatic mutation CHARACTER_VARYINGNOT NULLallele Identifies the alt being referred to in in-stances of multiple alt fieldsCHARACTER_VARYINGNOT NULLannotation Effect or consequence annotatedusing sequence ontology terms e.gsplice_donorCHARACTER_VARYINGNOT NULLannotation_impactEstimation of putative impact or delete-riousness (Can either be HIGH, MOD-ERATE, LOW or MODIFIER)CHARACTER_VARYINGNOT NULLgene_name Common gene name (HGNC) CHARACTER_VARYINGNOT NULLgene_id Gene ID CHARACTER_VARYINGNOT NULLfeature_type Transcript, motif, miRNA CHARACTER_VARYINGNOT NULLfeature_id Depending on the annotation, this maybe: Transcript ID Motif ID, miRNA,ChipSeq peak, Histone mark, etcCHARACTER_VARYINGNOT NULLtranscript_biotypeDescription on whether the transcript iscoding or noncodingCHARACTER_VARYINGNOT NULLrank Exon or intron rank or total number ofexons or intronsCHARACTER_VARYINGNOT NULL125Appendix C. Database Data Dictionaryhgvs_c Variant using HGVS notation (DNAlevel)CHARACTER_VARYINGNOT NULLhgvs_p If variant is coding, this field describesthe variant using HGVS notation (Pro-tein level)CHARACTER_VARYINGNOT NULLcdna_pos_cdna_lengthPosition in cDNA and trancript’s cDNAlengthINTEGER NOT NULLcds_pos_cds_lengthPosition and number of coding bases INTEGER NOT NULLaa_pos_aa_lengthPosition and number of AA INTEGER NOT NULLdistance Context and implementation dependent.E.g. when the variant is “intronic” theannotation may show the distance to theclosest exon; when the variant is “inter-genic” it may show the distance to theclosest gene; and when the variant is “up-stream or downstream” may show thedistance to the closest 5’UTR or 3’UTRbaseINTEGER NOT NULLerrors_warning_infoWarnings or information messages CHARACTER_VARYINGNOT NULLTable C.8: Database entity - Strelka_snvsEntity 9: Strelka_indels - This table contains insertions and deletions detected at each ge-nomic position using Strelka for each sample in the cohort.Entity: Strelka_indlesField Name Description Data Type Constraints126Appendix C. Database Data Dictionaryid Serial id CHARACTER_VARYINGNOT NULLtumour_id Sample identifier associated with thevariant called at each positionCHARACTER_VARYINGNOT NULLchrom Chromosome identifier from the refer-ence genomeCHARACTER_VARYINGNOT NULLpos Reference position BIGINT NOT NULLref Reference nucleotide at position (pos) ofthe chromosomeCHARACTER_VARYINGNOT NULLalt Alternate non-reference allele CHARACTER_VARYINGNOT NULLqsi Quality score for variant INTEGER NOT NULLtqsi Data tier used to compute QSI INTEGER NOT NULLnt Genotype of the normal in all data tiers,as used to classify somatic variantsCHARACTER_VARYINGNOT NULLqsi_nt Quality score reflecting the joint proba-bility of a somatic variant and NTINTEGER NOT NULLtqsi_nt Data tier used to compute QSI_NT INTEGER NOT NULLsgt Most likely somatic genotype excludingnormal noise statesCHARACTER_VARYINGNOT NULLru Smallest repeating sequence unit in in-serted or deleted sequenceCHARACTER_VARYINGNOT NULLrc Number of times RU repeats in the ref-erence alleleINTEGER NOT NULLic Number of times RU repeats in the indelalleleINTEGER NOT NULLihp Largest reference interrupted homopoly-mer length intersecting with the indelINTEGER NOT NULLsvtype Type of structural variant CHARACTER_VARYINGNOT NULL127Appendix C. Database Data Dictionarysomatic Flag denoting somatic mutation CHARACTER_VARYINGNOT NULLoverlap Flag denoting somatic indel possiblyoverlaps a second indelCHARACTER_VARYINGNOT NULLallele Identifies the alt being referred to in in-stances of multiple alt fieldsCHARACTER_VARYINGNOT NULLannotation Effect or consequence annotatedusing sequence ontology terms e.gsplice_donorCHARACTER_VARYINGNOT NULLannotation_impactEstimation of putative impact or delete-riousness (Can either be HIGH, MOD-ERATE, LOW or MODIFIER)CHARACTER_VARYINGNOT NULLgene_name Common gene name (HGNC) CHARACTER_VARYINGNOT NULLgene_id Gene ID CHARACTER_VARYINGNOT NULLfeature_type Transcript, motif, miRNA CHARACTER_VARYINGNOT NULLfeature_id Depending on the annotation, this maybe: Transcript ID Motif ID, miRNA,ChipSeq peak, Histone mark, etcCHARACTER_VARYINGNOT NULLtranscript_biotypeDescription on whether the transcript iscoding or noncodingCHARACTER_VARYINGNOT NULLrank Exon or intron rank or total number ofexons or intronsCHARACTER_VARYINGNOT NULLhgvs_c Variant using HGVS notation (DNAlevel)CHARACTER_VARYINGNOT NULLhgvs_p If variant is coding, this field describesthe variant using HGVS notation (pro-tein level)CHARACTER_VARYINGNOT NULL128Appendix C. Database Data Dictionarycdna_pos_cdna_lengthPosition in cDNA and trancript’s cDNAlengthINTEGER NOT NULLcds_pos_cds_lengthPosition and number of coding bases INTEGER NOT NULLaa_pos_aa_lengthPosition and number of AA INTEGER NOT NULLdistance Context and implementation dependent.E.g. when the variant is “intronic” theannotation may show the distance to theclosest exon; when the variant is “inter-genic” it may show the distance to theclosest gene; and when the variant is “up-stream or downstream” may show thedistance to the closest 5’UTR or 3’UTRbaseINTEGER NOT NULLerrors_warning_infoWarnings or information messages CHARACTER_VARYINGNOT NULLTable C.9: Database entity - Strelka_indelsEntity 10: Destruct_breakpoints - This table contains structural variants derived from de-Struct.Entity: Destruct_breakpointsField Name Description Data Type Constraintsid Serial id CHARACTER_VARYINGNOT NULLtumour_id Sample identifier associated with thevariant called at each positionCHARACTER_VARYINGNOT NULLprediction_id Unique identifier of the breakpoint pre-dictionINTEGER NOT NULL129Appendix C. Database Data Dictionarychromosome_1 Chromosome for breakend 1 CHARACTER_VARYINGNOT NULLstrand_1 Strand for breakend 1 CHARACTER_VARYINGNOT NULLposition_1 Position of breakend 1 BIGINT NOT NULLchromosome_2 Chromosome for breakend 2 CHARACTER_VARYINGNOT NULLstrand_2 Strand for breakend 2 CHARACTER_VARYINGNOT NULLposition_2 Position of breakend 2 BIGINT NOT NULLhomology Sequence homology at the breakpoint INTEGER NOT NULLnum_split Total number of discordant reads split bythe breakpointINTEGER NOT NULLinserted Nucleotides inserted at the breakpoint CHARACTER_VARYINGNOT NULLmate_score Average score of mate reads aligning asif concordantFLOAT NOT NULLtemplate_length_1Length of region to which discordantreads align at breakend 1INTEGER NOT NULLlog_cdf Mean cdf of discordant read alignmentlikelihoodFLOAT NOT NULLtemplate_length_2Length of region to which discordantreads align at breakend 2INTEGER NOT NULLlog_likelihood Mean likelihood of discordant read align-mentsFLOAT NOT NULLtemplate_length_minMinimum of template_length_1 andtemplate_length_2INTEGER NOT NULLnum_reads Total number of discordant reads INTEGER NOT NULLnum_unique_readsTotal number of discordant reads, poten-tial PCR duplicates removedINTEGER NOT NULL130Appendix C. Database Data Dictionarytype Breakpoint orientation type deletion: +-, inversion: ++ or –, duplication -+,translocation: 2 different chromosomesCHARACTER_VARYINGNOT NULLnum_inserted Number of untemplated nucleotides in-serted at the breakpointINTEGER NOT NULLsequence Sequence as predicted by discordantreads and possibly split readsCHARACTER_VARYINGNOT NULLgene_id_1 Ensembl gene id for gene at or nearbreakend 1CHARACTER_VARYINGNOT NULLgene_name_1 Name of the gene at or near breakend 1 CHARACTER_VARYINGNOT NULLgene_location_1Location of the gene with respect to thebreakpoint for breakend 1CHARACTER_VARYINGNOT NULLgene_id_2 Ensembl gene id for gene at or nearbreakend 2CHARACTER_VARYINGNOT NULLgene_name_2 Name of the gene at or near breakend 2 CHARACTER_VARYINGNOT NULLgene_location_2Location of the gene with respect to thebreakpoint for breakend 2CHARACTER_VARYINGNOT NULLdgv_ids Database of genomic variants annotationfor germline variantsCHARACTER_VARYINGNOT NULLTable C.10: Database entity - Destruct_breakpointsEntity 11: Lumpy_svs - This table contains sample specific structural variants discovered byLumpy.Entity: Lumpy_svsField Name Description Data Type Constraintsid Serial id CHARACTER_VARYINGNOT NULL131Appendix C. Database Data Dictionarytumour_id Sample identifier associated with thevariant called at each positionCHARACTER_VARYINGNOT NULLchrom Chromosome CHARACTER_VARYINGchstart Chromosome start INTEGER NOT NULLchend Chromosome end INTEGER NOT NULLwidth Width of range CHARACTER_VARYINGNOT NULLstrand Segment strand CHARACTER_VARYINGNOT NULLparamrangeid Distinguishes which records came fromwhich rangeCHARACTER_VARYINGNULLref Reference nucleotide at position (pos) ofthe chromosomeCHARACTER_VARYINGNOT NULLalt Alternate non-reference allele CHARACTER_VARYINGNOT NULLqual Phred_quality score CHARACTER_VARYINGNOT NULLfilter PASS if the probability of being somaticis greater than thresholdNOT NULLsvtype Type of structural variant CHARACTER_VARYINGNOT NULLsvlen Difference in length between REF andALT allelesINTEGER NOT NULLend End position of the variant described inthis recordBIGINT NOT NULLstrands Strand orientation of the adjacencyin BEDPE format (DEL:+-, DUP:-+,INV:++/–)CHARACTER_VARYINGNOT NULLimprecise Flag denoting imprecise structural varia-tionCHARACTER_VARYINGNOT NULL132Appendix C. Database Data Dictionarycipos Confidence interval around POS for im-precise variantsINTEGER NOT NULLciend Confidence interval around END for im-precise variantsINTEGER NOT NULLcipos95 Confidence interval (95%) around POSfor imprecise variantsINTEGER NOT NULLciend95 Confidence interval (95%) around ENDfor imprecise variantsINTEGER NOT NULLmateid ID of mate breakends CHARACTER_VARYINGNOT NULLevent ID of event associated to breakend CHARACTER_VARYINGNOT NULLsecondary Flag denoting secondary breakend in amulti-line variantsCHARACTER_VARYINGNOT NULLsu Number of pieces of evidence supportingthe variant across all samplesINTEGER NOT NULLpe Number of paired-end reads supportingthe variant across all samplesINTEGER NOT NULLsr Number of split reads supporting thevariant across all samplesINTEGER NOT NULLbd Amount of BED evidence supporting thevariant across all samplesINTEGER NOT NULLev Type of LUMPY evidence contributingto the variant callCHARACTER_VARYINGNOT NULLprpos Probability curve of the POS breakend CHARACTER_VARYINGNOT NULLprend Probability curve of the END breakend CHARACTER_VARYINGNOT NULLTable C.11: Database entity - Lumpy_svs133Appendix C. Database Data DictionaryEntity 12: Svs_filtered - Table containing breakpoints filtered for low mapability regions fromboth deStruct and lumpyEntity: Svs_filteredField Name Description Data Type Constraintsid Serial id CHARACTER_VARYINGNOT NULLtumour_id Sample identifier associated with thevariant called at each positionCHARACTER_VARYINGNOT NULLchrom_1 Chromosome for breakend 1 CHARACTER_VARYINGNOT NULLbrk_1 Break 1 BIGINT NOT NULLchrom_2 Chromosome for breakend 2 CHARACTER_VARYINGNOT NULLbrk_2 Break 2 BIGINT NOT NULLhomlen Length of base pair identical micro-homology at event breakpointsBIGINT NOT NULLbrk_dist Break distance BIGINT NOT NULLtype Type of structural variant CHARACTER_VARYINGNOT NULLTable C.12: Database entity - Svs_filteredEntity 13: Bamstats_tumour - This table contains sequencing statistics derived from thebam file of each tumour sample.Entity: Bamstats_tumourField Name Description Data Type Constraintstumour_id Sample identifier associated with thevariant called at each positionCHARACTER_VARYINGNOT NULLtotal_reads Number of reads that are in a sample’sbam fileINTEGER NOT NULLqc_failure Number of reads marked QC failure INTEGER NOT NULL134Appendix C. Database Data Dictionaryduplicates Number of duplicate reads INTEGER NOT NULLmapped Number of reads marked mapped in theflagINTEGERmapped_percentageNumber of mapped reads / totalreads(%)FLOAT NOT NULLpaired_in_seq Reads paired in sequencing INTEGER NOT NULLread1 Count read1 INTEGER NOT NULLread2 Count read2 INTEGERproperly_paired Properly paired reads INTEGER NOT NULLproperly_paired_percentagePercentage of properly paired reads FLOAT NOT NULLself_and_mate_mappedNumber of reads for which both readsmappedINTEGER NOT NULLsingletons Reads that mapped but the mate didn’t INTEGERsingletons_percentagePercentage of reads that mapped but themate didn’tFLOAT NOT NULLmate_map_diff_chrNumber of reads with a mate mappedon a different chromosomeINTEGER NOT NULLmate_map_diff_chr_mapqNumber of reads with a mate mapped ona different chromosome - mapping qual-ityINTEGER NOT NULLmapq The phred scaled probability of thealignment/base being wrongFLOAT NOT NULLavg_read_coverageAverage read coverage FLOAT NOT NULLTable C.13: Database entity - Bamstats_tumourEntity 14: Bamstats_normal - This table contains sequencing statistics derived from thebam file of each normal sample.Entity: Bamstats_normal135Appendix C. Database Data DictionaryField Name Description Data Type Constraintsnormal_id Matched normal-sample identifier CHARACTER_VARYINGNOT NULLtotal_reads Number of reads that are in a sample’sbam fileINTEGER NOT NULLqc_failure Number of reads marked QC failure INTEGER NOT NULLduplicates Number of duplicate reads INTEGER NOT NULLmapped Number of reads marked mapped in theflagINTEGERmapped_percentageNumber of mapped reads / totalreads(%)FLOAT NOT NULLpaired_in_seq Reads paired in sequencing INTEGER NOT NULLread1 Count read1 INTEGER NOT NULLread2 Count read2 INTEGERproperly_paired Properly paired reads INTEGER NOT NULLproperly_paired_percentagePercentage of properly paired reads FLOAT NOT NULLself_and_mate_mappedNumber of reads for which both readsmappedINTEGER NOT NULLsingletons Reads that mapped but the mate didn’t INTEGERsingletons_percentagePercentage of reads that mapped but themate didn’tFLOAT NOT NULLmate_map_diff_chrNumber of reads with a mate mappedon a different chromosomeINTEGER NOT NULLmate_map_diff_chr_mapqNumber of reads with a mate mapped ona different chromosome - mapping qual-ityINTEGER NOT NULLmapq The phred scaled probability of thealignment/base being wrongFLOAT NOT NULL136Appendix C. Database Data Dictionaryavg_read_coverageAverage read coverage FLOAT NOT NULLTable C.14: Database entity - Bamstats_normal137

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0377717/manifest

Comment

Related Items