@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix skos: . vivo:departmentOrSchool "Science, Faculty of"@en, "Chemistry, Department of"@en ; edm:dataProvider "DSpace"@en ; ns0:degreeCampus "UBCV"@en ; dcterms:creator "Kovalchik, Kevin"@en ; dcterms:issued "2019-08-15T20:38:20Z"@en, "2019"@en ; vivo:relatedDegree "Doctor of Philosophy - PhD"@en ; ns0:degreeGrantor "University of British Columbia"@en ; dcterms:description """The chemical characterization of biological and environmental samples are areas of research which involve the analysis of highly complex chemical mixtures. While the samples from these two fields differ greatly in composition, they present similar challenges. Complex mixtures provide a challenge to the analytical chemist as compounds in the mixture can have matrix effects which interfere with the analysis. Indeed, these interfering compounds may even be analytes themselves. High resolution mass spectrometry, which separates and detects ions based on their mass-to-charge ratio, is a powerful tool in the analysis of such mixtures. The amount of data resulting from such analyses, however, can be intractable to manual analysis, necessitating the use of computational tools. Furthermore, for the data to be reliable it is important that the performance of the mass spectrometer is optimal and consistent, but the complexity of the data again makes manual interpretation of the quality difficult. Thus, there is a need for computational assistance in analysis as well as method optimization and quality control. In Chapter 2:, we present a review of considerations toward the design of a standard mass spectrometry-based method for the quantification of naphthenic acids. The study provides recommendations for how these considerations can be addressed. In Chapter 3:, we describe a computational method of resolving dicarboxylic acids in high resolution mass spectrometry data of mixtures of derivatized naphthenic acid fraction compounds. The study is a proof-of-concept and demonstrates that derivatization-based methods of analyzing these diacid components is feasible but requires further investigation. In Chapter 4: and Chapter 5:, we present two computational tools which assist in method optimization and quality control of Thermo Orbitrap mass spectrometer systems. Chapter 4: presents RawQuant, a software tool which extracts scan quantification and meta data from data-dependent analysis data files from Orbitrap mass spectrometer systems. The tool is designed to inform the user toward method optimization. Chapter 5: presents RawTools, which builds upon RawQuant by adding the ability to track important measures of mass spectrometer performance longitudinally across a multi-run experiment. The tool is demonstrated using a 140-file dataset and provides easy visual monitoring of instrument performance."""@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/71302?expand=metadata"@en ; skos:note "PARSING AND ANALYSIS OF MASS SPECTROMETRY DATA OF COMPLEX BIOLOGICAL AND ENVIRONMENTAL MIXTURES by Kevin Kovalchik B.S., Oregon State University, 2014 B.M., The University of Idaho, 2007 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (CHEMISTRY) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August 2019 © Kevin Kovalchik, 2019 ii The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled: PARSING AND ANALYSIS OF MASS SPECTROMETRY DATA OF COMPLEX BIOLOGICAL AND ENVIRONMENTAL MIXTURES submitted by Kevin A Kovalchik in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Chemistry Examining Committee: David DY Chen Co-supervisor John V Headley Co-supervisor Roman Krems Supervisory Committee Member Ed Grant University Examiner Keng Chou University Examiner iii Abstract The chemical characterization of biological and environmental samples are areas of research which involve the analysis of highly complex chemical mixtures. While the samples from these two fields differ greatly in composition, they present similar challenges. Complex mixtures provide a challenge to the analytical chemist as compounds in the mixture can have matrix effects which interfere with the analysis. Indeed, these interfering compounds may even be analytes themselves. High resolution mass spectrometry, which separates and detects ions based on their mass-to-charge ratio, is a powerful tool in the analysis of such mixtures. The amount of data resulting from such analyses, however, can be intractable to manual analysis, necessitating the use of computational tools. Furthermore, for the data to be reliable it is important that the performance of the mass spectrometer is optimal and consistent, but the complexity of the data again makes manual interpretation of the quality difficult. Thus, there is a need for computational assistance in analysis as well as method optimization and quality control. In Chapter 2:, we present a review of considerations toward the design of a standard mass spectrometry-based method for the quantification of naphthenic acids. The study provides recommendations for how these considerations can be addressed. In Chapter 3:, we describe a computational method of resolving dicarboxylic acids in high resolution mass spectrometry data of mixtures of derivatized naphthenic acid fraction compounds. The study is a proof-of-concept and demonstrates that derivatization-based methods of analyzing these diacid components is feasible but requires further investigation. In Chapter 4: and Chapter 5:, we present two computational tools which assist in method optimization and quality control of Thermo Orbitrap mass spectrometer systems. Chapter 4: presents RawQuant, a software tool which extracts scan quantification and meta data from data- iv dependent analysis data files from Orbitrap mass spectrometer systems. The tool is designed to inform the user toward method optimization. Chapter 5: presents RawTools, which builds upon RawQuant by adding the ability to track important measures of mass spectrometer performance longitudinally across a multi-run experiment. The tool is demonstrated using a 140-file dataset and provides easy visual monitoring of instrument performance. v Lay Summary Mass spectrometry is a powerful and prevalent tool in the chemical analysis of complex samples. The technology both drives and is driven by an increasing depth of analysis found in fields as diverse as petroleum analysis, environmental monitoring, and cancer research. This thesis will demonstrate the utility of mass spectrometry and the development of new mass spectrometry data analysis tools in two areas: environmental and biological analysis. Toward environmental analysis, we will present a case for the development of a mass spectrometry-based standard analysis method of naphthenic acids and demonstrate how computational analysis of mass spectrometry data can deepen the analysis of such samples. Toward biological analysis, a new software tool for processing and analysis of protein mass spectrometry data and instrument performance is described which aids in method development and quality control of mass spectrometer operation and of the resulting data. vi Preface Except as indicated as follows, all results presented in this thesis are my own work. My research program was designed by myself and my graduate supervisor. Chapters 2, 3, 4 and 5 have been published. Publication details and author contributions are as follows: Chapter 2 was published in Frontiers of Chemical Science and Engineering: Kovalchik KA, MacLennan M, Peru K, Headley J, Chen DDY. Standard method design considerations for semi-quantification of total naphthenic acids in oil sands process affected water by mass spectrometry: A review. Frontiers in Chemical Science and Engineering. 2017;11(3):497-507. KAK carried out the majority of the literature review and wrote the manuscript. MM contributed to literature review. MM, PK, JH and DC contributing to writing the manuscript. Chapter 3 was published in Rapid Communications in Mass Spectrometry: Kovalchik KA, MacLennan MS, Peru KM, Ajaero C, McMartin DW, Head JV, Chen DDY. Characterization of dicarboxylic naphthenic acid fraction compounds utilizing amide derivatization: Proof of concept. Rapid Commun Mass Sp. 2017;31(24):2057-2065. KAK carried out the data analysis and wrote the manuscript. KAK, MSM, KMP. and CA carried out the experimental work. MSM, KMP, CA, DWM, JVH and DDYC contributed to writing the manuscript. vii Chapter 4 was published in The Journal of Proteome Research: Kovalchik KA, Moggridge S, Chen DDY, Morin GB, Hughes CS. Parsing and Quantification of Raw Orbitrap Mass Spectrometer Data Using RawQuant. J Proteome Res. 2018;17(6):2237-2247. CSH and KAK conceived the idea, carried out the data analysis, and wrote the manuscript. KAK developed and wrote all code for the computational tool. SM performed data analysis and contributed to writing of the manuscript. DDYC and GBM contributed to writing of the manuscript. Chapter 5 was published in The Journal of Proteome Research: Kovalchik KA, Colborne S, Spencer SE, Sorensen PH, Chen DDY, Morin GB, Hughes CS. RawTools: Rapid and Dynamic Interrogation of Orbitrap Data Files for Mass Spectrometer System Management. J Proteome Res. 2019;18(2):700-708. KAK and CSH conceived the idea, carried out the data analysis, and wrote the manuscript. KAK developed and wrote all code for the computational tool. SC and S.S. helped with data acquisition and tool design. PHS, DDYC, and GBM contributed to writing of the manuscript. viii Table of Contents Abstract ......................................................................................................................................... iii Lay Summary .................................................................................................................................v Preface ........................................................................................................................................... vi Table of Contents ....................................................................................................................... viii List of Tables .............................................................................................................................. xiv List of Figures ...............................................................................................................................xv List of Abbreviations ................................................................................................................ xxii Acknowledgements .................................................................................................................. xxiv Dedication ...................................................................................................................................xxv Introduction ................................................................................................................1 1.1 Naphthenic acids and naphthenic acid fraction components ............................................ 1 1.2 Proteomics ........................................................................................................................ 2 1.2.1 Quality of data in mass spectrometry-based proteomics ........................................... 3 1.3 Mass Spectrometry ........................................................................................................... 4 1.3.1 The Orbitrap mass analyzer ....................................................................................... 5 1.3.2 Coupling mass spectrometry to liquid chromatography ............................................ 6 1.4 Untargeted mass spectrometry methods ........................................................................... 7 1.4.1 Direct injection .......................................................................................................... 7 1.4.2 Data-dependent acquisition ....................................................................................... 8 1.4.3 Isobaric labelling ....................................................................................................... 8 1.5 Research objectives ........................................................................................................ 11 ix 1.5.1 Illustrate the requirements of a standard mass spectrometry method for the determination of total naphthenic acids concentration in water samples (Chapter 2:) ......... 11 1.5.2 Demonstrate the potential for using derivatization methods and mass spectrometry to determine the di-carboxylic acid components of an oil sands processing affected water sample (Chapter 3:) ............................................................................................................... 11 1.5.3 Demonstrate potential areas of improvement in global proteomics analysis pipelines using a newly developed software tool, RawQuant (Chapter 4:) ......................................... 12 1.5.4 Demonstrate the utility of a newly developed software tool, RawTools (a new iteration of RawQuant), toward mass spectrometer system management and quality control (Chapter 5:) ........................................................................................................................... 13 Standard method design considerations for semi-quantification of total naphthenic acids in oil sands process affected water by mass spectrometry: A review ........15 2.1 Introduction ..................................................................................................................... 15 2.2 Toward a standard mass spectrometric method for quantifying total naphthenic acids . 17 2.3 Method design considerations for mass spectrometric semi-quantification of total NAs in water ...................................................................................................................................... 19 2.3.1 Working definition of total NAs .............................................................................. 20 2.3.2 Extraction of NAs and sample clean-up .................................................................. 21 2.3.3 Matrix effects and extraction efficiency .................................................................. 26 2.3.4 Minimum resolving power ...................................................................................... 27 2.3.5 Derivatization vs. no derivatization ......................................................................... 28 2.3.6 Polarity and mode of ionization ............................................................................... 28 2.3.7 Choice of calibration standards and use of internal standards ................................. 29 x 2.3.8 Online or offline fractionation of samples ............................................................... 30 2.4 Conclusions..................................................................................................................... 32 2.5 Acknowledgements: ....................................................................................................... 33 Characterization of dicarboxylic naphthenic acid fraction compounds utilizing amide derivatization: Proof of concept ......................................................................................34 3.1 Introduction ..................................................................................................................... 34 3.2 Experimental ................................................................................................................... 38 3.2.1 Chemicals and materials .......................................................................................... 38 3.2.2 Sample preparation and derivatization .................................................................... 39 3.2.3 Mass Spectrometry .................................................................................................. 39 3.2.4 Data processing ........................................................................................................ 40 3.3 Results and Discussion ................................................................................................... 41 3.3.1 Derivatization of standard compounds and MS2 of standards and selected ions in OSPW extract ....................................................................................................................... 41 3.3.2 Analysis of spectra and matched formula lists (singly-charged) ............................. 46 3.3.3 Analysis of doubly-charged peaks ........................................................................... 50 3.4 Conclusions..................................................................................................................... 53 3.5 Acknowledgements ......................................................................................................... 54 Parsing and Quantification of Raw Orbitrap Mass Spectrometer Data Using RawQuant .....................................................................................................................................55 4.1 Introduction ..................................................................................................................... 55 4.2 Experimental Section ...................................................................................................... 57 4.2.1 Access of deposited data .......................................................................................... 57 xi 4.2.2 RawQuant processing .............................................................................................. 58 4.2.3 E. coli cell culture, protein isolation, reduction, and alkylation .............................. 60 4.2.4 Protein clean-up with SP3, and protease digestion .................................................. 60 4.2.5 Synthetic peptide mix preparation ........................................................................... 61 4.2.6 Tandem mass tag labeling of peptides ..................................................................... 61 4.2.7 Peptide clean-up procedures .................................................................................... 62 4.2.8 Chromatographic separation prior to MS analysis .................................................. 62 4.2.9 MS analysis of peptide samples on the Orbitrap Fusion ......................................... 63 4.2.10 Mass spectrometry data analysis ............................................................................. 64 4.2.11 General statistical parameters .................................................................................. 65 4.2.12 Data and code availability ....................................................................................... 65 4.3 Results and Discussion ................................................................................................... 66 4.3.1 RawQuant enables efficient and robust parsing of raw MS files ............................ 67 4.3.2 RawQuant enables efficient and robust quantification from raw MS files ............. 71 4.3.3 Assessing isobaric tag reporter ion values with RawQuant highlights differences in acquisition method settings and types................................................................................... 74 4.4 Conclusion ...................................................................................................................... 79 4.5 Supporting Information .................................................................................................. 82 4.6 Acknowledgements ......................................................................................................... 82 4.7 Funding Sources ............................................................................................................. 83 4.8 Competing Financial Interests ........................................................................................ 83 4.9 Author Contributions ...................................................................................................... 83 xii RawTools: Rapid and Dynamic Interrogation of Orbitrap Data Files for Mass Spectrometer System Management ............................................................................................84 5.1 Introduction ..................................................................................................................... 84 5.2 Experimental Section ...................................................................................................... 87 5.2.1 RawTools software, documentation, and availability ............................................. 87 5.2.2 Cell culture and harvest ........................................................................................... 88 5.2.3 Guanidine-based protein isolation, reduction, alkylation, and digestion ................ 88 5.2.4 Peptide clean-up ...................................................................................................... 89 5.2.5 Mass spectrometry data acquisition ......................................................................... 90 5.2.6 Mass spectrometry data analysis ............................................................................. 91 5.2.7 General statistical parameters .................................................................................. 93 5.2.8 Data and code availability ....................................................................................... 93 5.3 Results and Discussion ................................................................................................... 94 5.3.1 RawTools enables efficient parsing and analysis of raw Orbitrap MS files............ 96 5.3.2 RawTools facilitates robust tracking of acquisition performance ......................... 101 5.4 Conclusion .................................................................................................................... 106 5.5 Supporting Information ................................................................................................ 108 5.6 Acknowledgements ....................................................................................................... 108 5.7 Funding Sources ........................................................................................................... 108 5.8 Competing Financial Interests ...................................................................................... 108 5.9 Author Contributions .................................................................................................... 109 Concluding Remarks and Future Work ..............................................................110 6.1 Concluding Remarks .................................................................................................... 110 xiii 6.2 Further Work ................................................................................................................ 111 6.2.1 Isobaric Labelling of Naphthenic Acid Fraction Compounds ............................... 111 6.2.2 Advanced Method Development Techniques for Mass Spectrometry Proteomics Experiments. ....................................................................................................................... 112 References ...................................................................................................................................113 Appendix: Supplemental information for Chapter 4: ............................................................. 120 A.1 Analysis of Deposited Data ................................................................................... 120 A.2 Supplemental Figures ............................................................................................ 128 A.3 Supplemental Tables .............................................................................................. 143 xiv List of Tables Table 2.1 Considerations discussed for the proposal of a standard classical NAs semi-quantification method. .................................................................................................................. 19 Table 2.2 Summary of target ions for classical NA quantification analysis. Masses represent the anions formed by deprotonation in negative-ion mode ESI. ................................... 20 Table 3.1. Summary of data processing. Correction values yield the expected formula mass of the neutral, underivatized molecule corresponding to the observed ion peak. ................ 41 Table 3.2. Summary of processed spectra. ....................................................................... 48 xv List of Figures Figure 1.1 Representation of the Knight-modified Kingdom Trap. Specially shaped outer electrodes surround an inner thin-wire electrode. An oscillating potential applied to the outer electrodes introduces an axial quadrupole term to the electric field. .............................................. 5 Figure 1.2 A cutaway view of an Orbitrap Mass analyzer. The red line indicates a theoretical stable ion trajectory. As the ion packets oscillate along the z-axis, currents are induced between endplates which are then transformed into the m/z domain via Fast Fourier Transform. [Reprinted from Hu et al. 2005.26] ............................................................................... 6 Figure 1.3 a) Sample preparation and MS analysis workflow utilizing isobaric tags. After harvesting of cells and protein extraction/preparation, proteins are digested and each sample is labeled with a different isobaric tag. The samples are then combined and analyzed all at once by LC-MS/MS. b) Sample MS spectra from an isobaric labelled DDA experiment. Top: Full scan MS1 spectrum. Bottom: MS2 peptide spectrum of selected precursor ion. Insert: region of MS2 scan containing reporter ions. [Reprinted from Rauniyar & Yates 2014.33] ................................ 10 Figure 2.1 Relative total NAFC extraction using selected solvent systems or SPE. Solvent polarity index is given along the x-axis. Areas represent total area under the TIC curves observed by negative ion electrospray ionization (ESI) Orbitrap MS. [Reprinted with permission from Figure 3, Headley et al. 2013.8 Copyright (2013) Elsevier.] ........................................................ 22 Figure 2.2. Distribution of selected components of NAFC. Extraction using selected solvent systems or SPE was carried out prior to analysis by negative-ion ESI Orbitrap MS. [Reprinted with permission from Figure 4, Headley et al. 2013.8 Copyright (2013) Elsevier.] ... 23 xvi Figure 2.3. Extracted amounts of selected NAFC classes using six solvents. Extractions were carried out at pH values of a) 12.0, b) 8.5, and c) 2.0. [Reprinted with permission from Figure 1, Huang et al. 2016.69 Copyright (2016) Elsevier.] .......................................................... 24 Figure 2.4. Predicted log DOW values with changing pH for 19 classical NAs. Numbers reference the compounds described in Celsie et al. 2016 SI. [Data taken from Celsie et al. 201668 supplemental information.] ........................................................................................................... 25 Figure 2.5. Predicted log DOW values with changing temperature for 19 classical NAs. Numbers reference the compounds described in Celsie et al. 2016 SI. [Data taken from Celsie et al. 201668 supplemental information.] .......................................................................................... 26 Figure 2.6. Relative abundance of compounds matching the formula CnH2n+ZOx, summed for n = 8 to 30 and Z = 0 to -12. [Reprinted with permission from Figure 4, Grewer et al. 2010.2 Copyright (2010) Elsevier.] .......................................................................................................... 30 Figure 2.7. Bar plots and contour diagrams showing the number of homologues detected for the formula CnH2n+ZOx for: a) OSPW sample extracted by liquid-liquid extraction and directly injected into FTICR-MS; b) OSPW sample extracted as in (a) and fractionated by UHPLC prior to FTICR-MS; and c) combination of two OSPW samples processed as in (b) to compensate for dilution effects. [Reprinted with permission from Figure 3, Nyakas et al. 2013.60 Copyright (2013) American Chemical Society.] .......................................................................... 32 Figure 3.1. a) Derivatization of 3,3-dimethylglutaric acid with major (A) and minor (B) products. b) Expected fragmentation of a generic derivatized naphthenic acid. .......................... 38 Figure 3.2. Spectrum of derivatized 3,3-dimethylglutaric acid (a) and MS2 product ion spectra of the singly-derivatized (b) and doubly-derivatized (c) pathway (A) products (precursor ions m/z 231.2 and 151.1, respectively). ...................................................................................... 43 xvii Figure 3.3. Full-scan and selected MS2 product ion spectra of derivatized OSPW extract. Precursor ions identified in figure headings. ................................................................................ 44 Figure 3.4. Mass spectra of ions identified as O2 class by DPos(B) and as NO3 class by DPos(A). Many ion peaks are identified as belonging to both groups, indicating poor characterization of these classes. .................................................................................................. 46 Figure 3.5. Raw mass spectra (a) and processed mass spectra (b) of underivatized and derivatized OSPW. Fine-scale view of the processed mass spectra (c). ....................................... 47 Figure 3.6. a-d) Relative abundances and peak numbers for selected classes of NAFCs. e) Correlation plot of intensity of peaks observed in both UNeg(1) and DPos(A). .......................... 49 Figure 3.7. O4 class spectrum from DPos(AA), a) before mass cut-off and b) after high molecular mass cut-off of 360 Da. ................................................................................................ 51 Figure 3.8. O4 class observed in both DPos(A) and DPos(AA). Monoacids are represented by DPos(A) and diacids by DPos(AA). The y-scale of each spectrum is ................. 53 Figure 4.1 RawQuant has a wide range of built-in utilities and potential functionality. Schematic depicting the analysis pipelines capable with RawQuant. Parsing of raw Thermo Orbitrap MS files can yield parsed data matrices containing scan and MS operation efficiency data, as well as quantification values for isobaric tags. RawQuant generated MGF outputs can be used directly in search engines that accept this input (e.g. Mascot), and the results combined with the RawQuant parsed or quantification data based on scan numbers and file names. .................. 67 Figure 4.2 Analysis of RawQuant data highlights regions of non-uniform identification. Data from an MS2-only, label-free whole proteome analysis were re-processed and the identification results queried.93 (a) Depicts the numbers of MS2 scans obtained in each individual fraction. (b) Depicts the numbers of PSMs identified in each fraction. Horizontal red lines xviii indicate the mean value in each plot. For all analyses, n = 46. (c) Depicts the identification rate (number of PSMs / number of MS2 spectra) across each individual fraction. .............................. 69 Figure 4.3 Analysis of RawQuant data facilitates extraction of MS acquisition efficiency from raw files. Data from an MS2-only, label-free whole proteome analysis were re-processed and the identification results queried.93 (a) Boxplot of the topN distribution in each fraction. topN refers to the number of MS2 scans selected after an MS1 event. (b) Scatter plot of the numbers of MS2 events triggered per second (Hz). Data were binned in 1-second intervals, and the number of MS2 scans found reported for each bin with a value above 0. For each plot, horizontal red lines indicate the mean at the specified n-values. (c) Boxplots depict the distribution of ‘cycle’ times across each individual fraction. For this work, a cycle is defined as the time between two bordering MS1 events. Horizontal lines indicate the peak widths at baseline (6.20 seconds) and half-height (3.64 seconds) as determined previously.93 ................................................................ 76 Figure 4.4 Ratio compression was observed in a channel-dependent manner in MS2 and MS3 acquisition modes. A mixture of E. coli (TMT126 – 0:1:0:2:4:1:2:4:1:2:4 – TMT131C, whole proteome) and human peptides (TMT126 – 2:0:0:0:0:1:1:1:3:3:3 – TMT131C, synthetic peptides, n = 444) were analyzed using MS2 and SPS-MS3 acquisition methods on an Orbitrap Fusion. Values displayed are for each PSM, where the signal in each channel relative to the total signal across all channels is calculated. The data are transformed by multiplication to display values at the expected mixing ratio, for visualization purposes. (a) Boxplot of PSMs belonging to E. coli peptides obtained from MS2 analysis. (b) Boxplot of PSMs belonging to E. coli peptides obtained from SPS-MS3 analysis. (c) Boxplot of PSMs belonging to synthetic spiked peptides obtained from MS2 analysis. (d) Boxplot of PSMs belonging to synthetic spike peptides obtained from SPS-MS3 analysis. Red lines depict the expected quantification ratios for the channels. ... 78 xix Figure 5.1 RawTools includes a wide range of built-in utilities and potential functionality for raw MS data processing. Schematic depicting the different functional modules of RawTools. The software is divided into ‘parse’ and ‘qc’ processing pipelines that generate overlapping, but individual data sets as indicated. All functionalities displayed work directly with raw MS files derived from Thermo Orbitrap instruments on Windows, Linux, and MacOS computational hardware. ....................................................................................................................................... 88 Figure 5.2 RawTools enables rapid and dynamic analysis of raw data files to illuminate MS performance. A subset (n = 10) of raw files acquired as part of a replicate injection set derived from a HeLa tryptic digest were analyzed with the RawTools parse functionality to generate ‘Metrics’ files. The resultant text output from RawTools was investigated to generate insights into: (a) Scan numbers, (b) MS2 scan rates, (c) Numbers of dependent scans triggered per MS1, (d) Duty cycle duration, (e) Chromatographic peak width, and (f) Column peak capacity. Dashed lines on each plot indicate the mean across the 10 replicate injections for the displayed values. ........................................................................................................................... 98 Figure 5.3 RawTools enables simplified detection of errors that occur during MS acquisition. The third injection from the 1 – 10 injection set was examined further using the RawTools parse functionality to generate scan ‘Matrix’ files to determine the cause of the difference in relation to the other replicate samples. The resultant text output from RawTools was used to identify a break in the spray being generated from the nanospray tip, as observed as a gap in (a) MS2 scan acquisition and (b) Intensity of precursors in MS1 scans. Red arrows indicate regions of nanospray instability. ..................................................................................... 99 Figure 5.4 MGF output generated by RawTools is equivalent to standard software tools as measured by identification rates. MGF output generated using RawTools (with and without xx precursor and charge state recalibration) and ProteoWizard were individually searched using X!Tandem as part of SearchCLI and PeptideShakerCLI to generate peptide and protein identification results. The boxplots display the total numbers of (a) Peptide and (b) Protein identifications for the set of subset injections (n = 10) using MGF output from the two separate software tools. Outlier points in the peptide match plots are from injection 3, as discussed in the main manuscript. ......................................................................................................................... 100 Figure 5.5 RawTools QC analysis facilitates illumination of variation in MS operational performance. The entire set of (n = 140) HeLa replicate injections was analyzed with the diagnostic QC feature of RawTools to generate a single comma separated summary output. The resultant data were used to probe (a) Scan numbers and (b) MS1 intensities to reveal inconsistencies. (c) Selected base peak chromatogram of MS1 intensities demonstrating spray instability. The inset image is a zoom of the selected chromatogram from 5 to 15 minutes. Red arrows on the inset plot indicate areas of electrospray instability. The RawTools data were further examined to highlight instrument performance degradation via (c) MS2 intensities, (d) MS2 injection times, and (e) IdentiPy MS2 spectral identification rates. Dashed red lines indicate the 100th sample injection. .......................................................................................................... 103 Figure 5.6 RawTools enables illumination of electrospray instability events. The entire replicate injection set (n = 140) was examined to calculate the stability of electrospray across acquisitions. Stability is calculated as the number of MS1 events where neighboring scans differ in their summed intensity by more than 10-fold. Red circled points indicate injections 3 and 23 as replicates where problematic performance had been observed as indicated by other metrics and discussed in the main text. ................................................................................................... 104 xxi Figure 5.7 RawTools highlights errors in mass detection in MS2 spectra. The entire replicate injection set (n = 140) was examined using the QC feature of RawTools with IdentiPy. The observed masses of detected peptides were compared to the calculated sequence values on-the-fly by RawTools to determine mass errors. The absolute mass errors for identified peptides are displayed on a parts-per-million scale. The red dashed line indicates the 100th injection of the standard sample. .......................................................................................................................... 105 Figure 5.8 RawTools data reveals information from detected peptides that can be used to monitor stability in sample preparation. The entire replicate injection set (n = 140) was examined to calculate the tryptic digestion efficiency and the rate of oxidation of methionine. (a) Scatter plot of enzyme digestion efficiency, defined as the number of peptides with no missed cleavages as a proportion of the total number of peptides. (b) Scatter plot of methionine oxidation frequency, calculated as the number of observed oxidation events as a proportion of the number of available methionine sites. Both data sets are based upon the peptide hits generated by IdentiPy when used as part of the diagnostic QC processing in RawTools (n = 1000 MS2 spectra searched per file). ........................................................................................................................ 106 xxii List of Abbreviations APPI – atmospheric pressure photoionization CID – collision induced dissociation DCM - dichloromethane DDA – data-dependent acquisition DI – direct injection DIA – data-independent acquisition DMSO – dimethyl sulfoxide ECCC – Environment and Climate Change Canada EDC – N-(3-dimethylaminopropyl)-N’-ethylcarbodiimide ESI – electrospray ionization FFT – Fast Fourier Transform FTICR – Fourier transform ion cyclotron resonance FTIR – Fourier transform infrared spectroscopy GC – gas chromatography HCD – higher energy collisional dissociation HPLC – high performance liquid chromatography LC – liquid chromatography MGF – Mascot generic file MS – mass spectrometry or mass spectrometer NA – naphthenic acid NAFC – naphthenic acid fraction component NHS – N-hydroxysuccinimide xxiii OSPW – oil sands process affected water PSM – peptide spectral match QC – quality control Q-TOF – quadrupole time-of-flight SPE – solid phase extraction SPS – synchronous precursor selection TPP – Trans Proteomic Pipeline UHPLC – ultra high-performance liquid chromatography xxiv Acknowledgements I would like to thank my research supervisor, David Chen for his continual support and interest in whatever I worked on (with or without his prior blessing). I would especially like to thank him for his understanding and support of the needs of my family throughout my graduate studies. Much thanks must also go to my co-supervisor John Headley and collaborator Kerry Peru at Environment and Climate Change Canada for their guidance, insight and practical help on our research together into naphthenic acids. This work sparked in me a strong interest in mass spectrometry data computation, which was instrumental in the conception and completion of the later works presented in this thesis. I would also like to thank Gregg Morin, Christopher Hughes and Sandi Spencer at BC Genome Sciences Centre for their guidance, support, advice and collaboration on the RawTools project. Thank you to my group mates in David’s lab: Caitlyn, Matthew, Jessica, Akram, Cheng, Lingyu, Jianhui, Wenqiang, Zi-Ao, Tingting, Adam, and Xander. Finally thank you to Hilary, Anna and Meredith who have provided a continual balance to my studies. xxv Dedication This thesis is dedicated to my two children, Anna and Meredith, both of whom were born during my graduate studies. 1 Introduction 1.1 Naphthenic acids and naphthenic acid fraction components Naphthenic acids (NAs) and naphthenic acid fraction components (NAFCs) are organic compounds found in oil sands process affected water (OSPW) and to a lesser extent in natural waters in oil sands regions. NAs are defined as the family of monocarboxylic acids with the general chemical formula CnH2n+ZO2, where Z is a negative, even integer representing hydrogen deficiency due to the presence of ring structures. NAs have been identified as principle toxicants in OSPW, and the vast amounts of OSPW stored in tailing ponds in the Athabasca region makes accidental releases and seepage of NAs into surrounding waters a subject of concern.1-5 Unfortunately, interlaboratory comparison of NA quantification remains variable.6 Because of this there is a growing need for a standard analysis method in order to allow reproducible and comparable testing, which would allow for regulatory bodies and industry to agree upon and monitor release limits of NAs.7,8 This topic is introduced further in Chapter 2:, a review of design considerations for a mass spectrometry-based method for the quantification of total NAs in aqueous solution. While NAs are important compounds found in OSPW, it is well known that there are also NA-like compounds of varying oxidation state which contain N and S heteroatoms and unsaturated and aromatic components.2,9 Naphthenic acid fraction components (NAFCs) is a class of compounds designated to include both NAs and these NA-like compounds and represent the complex mixture of all acid-extractable organic compounds found in OSPW and natural waters in oil sands regions.10 Dicarboxylic acids (diacids) are a group of NAFCs containing two carboxylic acids. Diacids are of interest because they have been shown to be less toxic than their monocarboxylic acid counterparts1 and have demonstrated use in temporal profiling of OSPW 2 settling ponds.11 While diacid NAFCs have been previously analyzed using authentic deuterated standards and gas chromatography-mass spectrometry,11,12 there exists a need for untargeted methods of analysis, which is further addressed in Chapter 3:. 1.2 Proteomics The proteome is defined as the complement of proteins expressed by a genome, cell, tissue or organism. The term was first introduced in the mid-90’s,13 and the use of the term proteomics to describe the large-scale study of protein expression, structure and function and how these characteristics vary quickly followed.14,15 Modern proteomics takes many forms, from global quantification to studies of phosphorylation state to analysis of the proteomes of specific cellular compartments or membranes. Mass spectrometry plays an increasingly important role in proteomics analyses. A typical bottom-up MS-based proteomics experiment workflow is as follows: 1) the sample of interest is harvested, typically either from a cultured cell line or tissue sample; 2) proteins are extracted by cell lysis and subsequent clean up; 3) proteins are digested using trypsin or another enzyme which reproducibly cleaves proteins at known amino acid sites; 4) the resulting peptide mixture undergoes a variety of preparation steps, possibly including isobaric labelling (discussed in section 1.4.3), offline chromatographic separation (fractionation) and desalting; 5) the prepared peptide sample is analyzed utilizing a chromatography-mass spectrometry system (discussed further throughout sections 1.3 and 1.4); 6) the resulting MS data is searched using a database search engine which assigns peptide identifications that can then be linked to quantification information from the same data. The amount of data produced by such experiments is vast and can range from hundreds to tens-of-thousands of megabytes per sample. 3 1.2.1 Quality of data in mass spectrometry-based proteomics Mass spectrometry plays an increasingly important role in the study of the proteome. This trend, and technological advances in mass spectrometry, have resulted in the generation of an incredible amount of data. Monitoring the quality of data is of paramount importance for on-going studies as well as post-hoc analysis of existing datasets. Monitoring the number of confident peptide identifications has often been the metric of choice for researchers to assess data quality.16 Afterall, identifying peptides is a central part of the MS-based proteomics workflow. Unfortunately, while the number of identifications might generally correlate with data quality, it does not provide any diagnostic information that could be of use in determining why, exactly, the quality fails in a given instance. More recently, increased focus has been placed on monitoring the performance of the MS instrumentation via analysis of standard sample data files.17,18 This type of QC can benefit MS operators because the data generated is directly related to the performance of the MS (e.g. duty cycle, ion trap fill times, ion count/current, etc.). A number of publications have recently described tools for this purpose.19-25 While the majority of these tools are excellent resources when used properly, they suffer from a number of issues, the most exasperating of which is the difficulty or inability to locate the tool itself or one of its dependencies given the information in the published manuscript.16,20,23,24 Furthermore, tools which do not allow for longitudinal tracking of QC metrics (i.e. across multiple experiments) might be useful for assessment of individual data files, but are less useful when assessing the quality of large datasets comprised of many data files.20,21,25 Finally, a number of tools aim to ease interpretation by reporting a single score or otherwise reducing the dimensionality of the metrics.19,21-23 While this does ease interpretation of the overall quality of the data, it hampers the ability to distinguish which aspects 4 of MS performance might be impacting the quality. Easily interpreted and transparently calculated metrics directly relating to instrument operation are preferable when troubleshooting instrument issues and can head off problems before they become so sever as to cause data quality to fail control. Thus, to assist MS operators in evaluating the quality of data during and after acquisition of large-scale datasets, there exists a need for QC tools which provide longitudinal tracking of easily interpretable MS performance metrics. Chapter 4: and Chapter 5: describe the development of a computational tool aimed toward addressing this need. 1.3 Mass Spectrometry Mass spectrometry is an analytical technique which separates and detects gas-phase ions based upon their mass-to-charge ratio (m/z). The basic components of a mass spectrometer are 1) a sample inlet to introduce chemical species to the instrument; 2) an ionization source, which may be part of the sample introduction mechanism or may be a separate step, which converts neutral and charged compounds into species of a specific polarity (i.e. positively or negatively charged); 3) a mass analyzer; 4) an ion detector; 5) a data-processing system which converts the information from the ion detector into a mass spectrum. While these components must all be present, they can differ greatly in design and operation. The mass spectrometers discussed in this thesis all utilize the following components, respectively: 1&2) the sample inlet and ionization are both accommodated by electrospray ionization (ESI); 3&4) the mass analyzer and detector is an Orbitrap mass analyzer (discussed in section 1.3.1); 5) the frequency information from the mass analyzer is converted into a mass spectrum via Fast Fourier Transform. 5 1.3.1 The Orbitrap mass analyzer The major part of this thesis deals with Orbitrap mass spectrometers, which are a high-resolution family of mass spectrometers developed and marketed by Thermo Fisher Scientific. An Orbitrap mass spectrometer is based around the Orbitrap mass analyzer, a relatively new mass analyzer invented by Alexander Markov and first described in a complete mass spectrometry system in 2005.26 The ancestor of the Orbitrap could be considered the Kingdom Trap, a much older technology which was describe almost a century ago in 1923.27 The Kingdom trap consists of a thin-wire central electrode, an outer coaxial electrode, and two endcap electrodes. The system uses entirely electrostatic interactions to trap gas-phase ions between the inner and outer electrodes. The Kingdom Trap was modified in 1981 by Robert Knight, resulting in a design utilizing a thin-wire inner electrode and outer electrodes shaped to add an axial quadrupole term to the electric field when an appropriate oscillating potential is applied (See Figure 1.1).28 This axial quadrupole term induces harmonic axial oscillation of trapped ions. Figure 1.1 Representation of the Knight-modified Kingdom Trap. Specially shaped outer electrodes surround an inner thin-wire electrode. An oscillating potential applied to the outer electrodes introduces an axial quadrupole term to the electric field. 6 The Orbitrap mass analyzer builds upon the design of the Knight-modified Kingdom Trap, utilizing a spindle-shaped inner electrode and a specially shaped outer electrode (see Figure 1.2). The shapes of the electrodes result in stable ion trajectories that orbit the inner electrode while simultaneously oscillating axially. These stable trajectories are achieved with entirely static electric potentials. Ions oscillate axially according to their mass-to-charge ratio, and the oscillation of these groups of ions is detected as current between two endplates and subsequently transformed by Fast Fourier Transform (FFT) from the time domain into the m/z domain.26 Figure 1.2 A cutaway view of an Orbitrap Mass analyzer. The red line indicates a theoretical stable ion trajectory. As the ion packets oscillate along the z-axis, currents are induced between endplates which are then transformed into the m/z domain via Fast Fourier Transform. [Reprinted from Hu et al. 2005.26] 1.3.2 Coupling mass spectrometry to liquid chromatography Hyphenating liquid chromatography and mass spectrometry (LC-MS) provides an additional dimension of separation over mass spectrometry alone. In liquid chromatography, analytes pass through a packed column and separate based upon differential partitioning between the mobile phase and the packing material. This results in analytes being temporally separated as they leave the column and enter the mass spectrometer, usually via electrospray ionization. The separation provided by liquid chromatography is orthogonal to that of mass spectrometry, being 7 based upon the partitioning coefficient of an analyte in a particular system as opposed to its mass and charge. This additional dimension of separation greatly reduces the complexity of the resulting mass spectra, which are subsequently indexed as a function of time. The experimental work involved in Chapter 4: and Chapter 5: utilizes LC-MS. 1.4 Untargeted mass spectrometry methods Mass spectrometry experimental workflows can generally be grouped into two different categories: targeted and untargeted (or global) analysis. A targeted experiment is one in which there are known targets of interest and is often associated with hypothesis testing. Thus, the experiment is designed to yield optimal data for the targets of interest. Examples of targeted acquisition methods are selected ion monitoring, selected and multiple reaction monitoring, and parallel reaction monitoring.29 An untargeted experiment is one in which information on as many analytes as possible is desired, and such global information is often associated with hypothesis generation. The experiment is designed to yield a balance between volume of data and quality of individual data points. Two common untargeted methods which make use of LC-MS are data-independent acquisition (DIA) and data-dependent acquisition (DDA). Another method of untargeted acquisition which does not use online separation is direct injection (DI). The work in this thesis deals exclusively with untargeted acquisition, specifically DI in Chapter 3: and DDA in Chapter 4: and Chapter 5:. The rest of this section will focus on these untargeted acquisition methods. 1.4.1 Direct injection In direct injection, the sample is introduced directly to the ionization source with no on-line separation involved. This results in a single mass profile acquired over many scans, the quality of which can be improved by background subtraction and spectral averaging. A high 8 mass resolving power allows for assignment of molecular formulas to ion peaks, but structural isomers are not resolved. This type of analysis is useful for high-throughput situations where general global information is desired (e.g. petroleomics). 1.4.2 Data-dependent acquisition Data dependent acquisition (DDA) is a type of computer-automated mass spectrometry workflow, which utilizes on-the-fly algorithms to select precursor ions from full-range MS1 scans to undergo further analysis by MS2 and potentially MS3. In a MS3 acquisition method, the resulting MS2 spectrum would again be analyzed and selected peaks selected for MS3.30 Such MS3 DDA methods are typically used in conjunction with isobaric labelling to aid in quantification (described in section 1.4.3). The use of DDA allows for the acquisition of relatively pure MS2 spectra of peptides for peptide identification. The use of DDA in the study of proteins is not a new occurrence,31 but advances in mass spectrometry technology have vastly improved its application. For example, in 2001 one could expect to acquire a full DDA cycle (a single MS1 scan and the subsequent MS2 scans) in 10-15 seconds which would typically yield three to five peptide fragment spectra.31 With scan rates on modern instruments one can expect to acquire hundreds of peptide fragment spectra in the same amount of time,32 allowing for vastly improved proteome coverage. 1.4.3 Isobaric labelling Quantification in mass spectrometry experiments can be absolute or relative. Absolute quantification aims to determine the exact concentration of an analyte in a sample while relative quantification seeks to determine the relative abundances of an analyte in different samples. Absolute concentration is a desirable goal, but it is impractical in untargeted global studies because of the need to build calibration curves for the analytes to be quantified. Relative 9 quantification does not rely on the construction of calibration curves, and thus provides a practical approach in untargeted global studies. Two common approaches used in relative quantification of proteins in biological samples is chemical labeling and metabolic labeling. Chemical labeling typically involves the differential labeling and multiplexing of peptide samples after protein extraction and digestion has occurred. Metabolic labeling (e.g. SILAC) can achieved by growing cell cultures in isotopically labeled growth media. The harvested cells from such experiments then contain peptides with “built-in” isotopic labeling for multiplexing purposes. Metabolic labeling is advantageous because there are fewer post-harvest processing steps involved before the samples can be multiplexed. It is risky, however, because the expense of the labeling comes upfront in the growth medium and samples which turn out to be non-viable for analysis are an expensive waste. Isobaric labeling is advantageous because only viable samples need to be labeled. Furthermore, chemical labeling allows for much greater multiplexing than metabolic labeling, up to 11-plex. Isobaric labeling (e.g. TMT, ITRAQ) is a relative quantification strategy using chemical labeling which is commonly employed in quantitative proteomics in which different peptide samples are labeled with amine-reactive reagents, mixed together, and then subjected to a MS2 or MS3 based analysis. The respective reagents for the different samples are all isobaric and have the same chemical structure but have different distributions of isotopes within the structure (i.e. they are isotopic isomers). Thus, the differentially labeled peptides are indistinguishable in an MS1 scan. The labels (or tags), however, are structured such that they reproducibly fragment during collision induced dissociation (CID) to yield reporter ions of a unique mass for each tag (Figure 1.3).33 This strategy facilitates the multiplexed relative quantification of multiple peptide samples in a single MS run. 10 Figure 1.3 a) Sample preparation and MS analysis workflow utilizing isobaric tags. After harvesting of cells and protein extraction/preparation, proteins are digested and each sample is labeled with a different isobaric tag. The samples are then combined and analyzed all at once by LC-MS/MS. b) Sample MS spectra from an isobaric labelled DDA experiment. Top: Full scan MS1 spectrum. Bottom: MS2 peptide spectrum of selected precursor ion. Insert: region of MS2 scan containing reporter ions. [Reprinted from Rauniyar & Yates 2014.33] Isobaric labeling is most commonly used in conjunction with DDA which facilitates the complicated MS acquisition workflow required to apply the strategy to untargeted profiling. A common issue with isobaric labelling and DDA is the co-isolation of multiple peptides in the MS2 isolation step. The presence of interfering species during the MS2 step hampers the accuracy of the reporter ion ratios,34 an occurrence which is now commonly termed ratio compression or ratio distortion. Aside from filtering quantification results based on the level of 11 interference, one of the most common methods used to reduce ratio compression utilizes MS3 in the DDA workflow to isolate and fragment the most intense peptide fragments, which results in reduced interference by lower abundance peptides (SPS-MS3).30,35 Other methods exist, including the use of novel labeling reagents36 or custom methods and data processing utilizing existing labeling reagents.37,38 1.5 Research objectives 1.5.1 Illustrate the requirements of a standard mass spectrometry method for the determination of total naphthenic acids concentration in water samples (Chapter 2:) While numerous methods have been described for the quantification of naphthenic acids (NAs) in oil sands process affected water (OSPW) and natural waters,4,9,10,39 reproducibility and inter-laboratory comparability remains elusive.6 We wished to provide an outline for the creation of a standard method for the quantification of NAs in water samples. Toward this goal, we compiled a list of design consideration for a standard method to quantify NAs, and through a thorough review of the literature we provided recommendations as to how these considerations could be fulfilled.40 1.5.2 Demonstrate the potential for using derivatization methods and mass spectrometry to determine the di-carboxylic acid components of an oil sands processing affected water sample (Chapter 3:) Naphthenic acid fraction compounds (NAFCs) represent the acid-extractable organic compounds found in OSPW or natural waters in oil sands regions.10 They are termed as such because they contain NAs, which have been identified as a major toxicant in OSPW and are of concern for toxicological and regulatory reasons.41-43 Dicarboxylic acids are a specific subset of NAFCs which have been identified as less toxic than there NA counterparts,1 and have also 12 proven useful in temporal profiling of OSPW and contaminated water sources.11 At the time of our research, the only demonstrated methods for characterizing dicarboxylic NAFCs involved the use of isotopic standard compounds.11,12 We wished to demonstrate the use of a derivatization method to allow for untargeted identification of dicarboxylic acids. We used a recently described method of converting NAFCs to amide compounds, the labeling mass of which allowed us to discriminate a subset of dicarboxylic acids from monocarboxylic acids using custom data analysis scripts.44 1.5.3 Demonstrate potential areas of improvement in global proteomics analysis pipelines using a newly developed software tool, RawQuant (Chapter 4:) Ensuring the quality of data obtained in proteomics experiments utilizing data-dependent acquisition mass spectrometry entails the careful optimization of a number of instrument parameters. The ultimate goal of such an optimization is to maximize the number and confidence of peptide identifications. Commonly this involves testing many sets of parameters, each of which involves running the entire MS experiment and assigning identities to the resulting MS2 peptide fragment spectra using a database search engine (e.g. Mascot,45 Andromeda,46 X! Tandem,47 Sequest,48 PEAKS,49 etc.). It is a costly process and made more difficult by the fact that multiple parameters must be optimized based upon very limited information. This is analogous in some respects to an optimization algorithm in which the objective function is a day’s worth of MS operation. To truly search the parameter space would require months of non-stop instrument time, which is why researchers rely heavily upon previous experience and intuition to select only a few sets of parameters, and test those. While peptide identifications and quantifications are the ultimate goal of most global proteome analyses, we felt that there was a lot of unutilized information in the mass spectrometry 13 data that could be useful in informing these optimization experiments. Metrics such as ion accumulation times, duty cycle, chromatogram peak shapes and ion intensities can all be calculated directly from the raw mass spectrometry data without performing a database search and are all directly relevant to instrument parameterization. To demonstrate how such metrics could guide the method optimization we developed RawQuant, a computational tool written in the Python programming language.50 RawQuant parses raw (unconverted) Orbitrap mass spectrometry data files to yield both detailed and summary information such as that described above. 1.5.4 Demonstrate the utility of a newly developed software tool, RawTools (a new iteration of RawQuant), toward mass spectrometer system management and quality control (Chapter 5:) Beyond the careful selection of instrument parameters, ensuring the quality of mass spectrometry proteomics data involves quality control (QC) analysis of the instrument performance and resulting identifications in a longitudinal manner. To help ease this process for Thermo MS instruments we developed RawTools, which is a second iteration of the RawQuant tool written in the C# programming language and designed to enhance performance and add methods for QC analysis.51 RawTools processes unconverted Thermo Orbitrap .raw MS files and calculates easily interpretable metrics of performance related to both instrument performance as well as peptide identification rate and labeling efficiency, and protein digestion efficiency. These metrics are then automatically added to a database, allowing users to track the QC metrics longitudinally. RawTools is not an automated QC program in the sense that it still requires human interpretation of the data. However, we feel that the use of such a tool by an experience MS operator is of more benefit than a tool which outputs only a single overall quality metric. A 14 single quality metrics only informs you than something is wrong, while the descriptive measures of quality provided by RawTools can potentially inform one of the actual causes of loss of quality. 15 Standard method design considerations for semi-quantification of total naphthenic acids in oil sands process affected water by mass spectrometry: A review 2.1 Introduction Naphthenic acids (NAs) are among the many organic acids present in crude oil and bitumen, and are classically defined as the family of aliphatic or alicyclic monocarboxylic acids with the general chemical formula CnH2n+ZO2, where Z is a negative, even integer representing hydrogen deficiency in the molecule due to the presence of ring structures. NAs can be found in high concentrations in process water produced in the industrial extraction of bitumen. The extraction process involves mixing oil sands bitumen with caustic hot water, mobilizing the bitumen off the sand, causing the partitioning of naphthenic acids and other molecules into the water phase. This water, termed oil sands process affected water (OSPW), is recycled several times and ultimately stored in large tailings ponds. OSPW contains high levels of classical NAs, but also N and S heteroatomic acids, aromatic acids, and heavily oxygenated acids. This larger class of acids, of which classical NAs are a subset, is often referred to as “naphthenic acid fraction components” (NAFCs) or the “acid extractable fraction”. NAs are among the principal toxicants in OSPW and thus increased attention has been given to monitoring the levels of oil sands acids in the environment.1-5 As more laboratories engage in the analyses of NAs and OSPW components, there is a growing demand to assess the comparability of results from various methods. This need for interlaboratory comparison was identified earlier by practitioners as a high priority activity by specialists at an international workshop on analytical strategies for NAs.10 16 There are a wide variety of methods used for the determination of total NAs in OSPW and environmental samples.4,9,10 Extracting NAs from aqueous OSPW samples has largely developed along the lines of liquid-liquid extractions and solid-phase extractions. Some recent methods dispense with extraction altogether and rely only on pH adjustment prior to NAs analysis.52,53 Detailed reviews are given elsewhere4,9,10,39 covering the pros and cons of existing methods in use by practitioners along with emerging methods for fingerprinting or environmental forensics.10 The variety of methods used for analyses of total NAs include Fourier transform infrared (FTIR)54 and fluorescence spectroscopic techniques,55,56 along with low resolution57 or high resolution2,57-59 mass spectrometry. Many mass spectrometric methods have been developed which utilize direct injection,60 gas chromatography/mass spectrometry,(GC/MS) 54,61 liquid chromatography/mass spectrometry (LC/MS),59,62-64 and LC/MS/MS,53,65 employing either negative-ion53 or positive-ion65 detection and numerous ionization platforms.4 Likewise, some methods use off-line chromatography60 for sample clean up or preconcentration prior to MS analysis. Some methods analyze the sample without derivitization whereas others utilize derivatization steps.64-66 Instrument calibration methods vary between laboratories and depend on the availability or choice of commercial standards, along with the limited access to actual standards of NAFCs extracted from different sources of OSPW. Finally, the data analysis for some methods is based on integrated peaks of total naphthenic acid fraction compounds (“NAFCs” or the “acid extractable fraction”)8 while others select extracted ions that correspond to NA congeners.57 17 2.2 Toward a standard mass spectrometric method for quantifying total naphthenic acids As more laboratories engage in analyses of NAs and other OSPW components, there is demand to assess the comparability of results from various methods.7 Furthermore, the need for interlaboratory comparison was identified by practitioners as a high priority activity by specialists at an international workshop on analytical strategies for NAs.10 Evaluation of an interlaboratory study on semi-quantifying total NAs in water was reported for 15 participating laboratories.6 Methods included (number of laboratories is given in parentheses): FTIR (3), along with MS methods (12) using either low resolution (9) or high resolution (3) with direct injection (2), GCMS (3), LCMS (3), and LC/MS/MS (1) employing either negative-ion (11) or positive-ion detection (1). Four methods utilized derivatization steps and one laboratory used off-line chromatography for sample cleanup and preconcentration prior to MS analyses. Quantitative data analysis in the FTIR methods was based on integrated peaks for total carbonyl group signal, while the mass spectrometry methods utilized specific ions which corresponded to CnH2n+ZO2 congeners. A neat Merichem naphthenic acid (NA) mixture (a gift received from Merichem Chemicals and Refinery Services LLC, Houston, TX) was also provided to the participating laboratories for use as a reference to minimize variability between laboratory standards.6 Despite these measures, variable results were reported and this led to subsequent intralaboratory studies by Environment and Climate Change Canada (ECCC), with 4 participating labs, and steps were taken to better understand and control the factors contributing to variability in measurements. The activities are on-going and have prompted the establishment of an ECCC taskforce of analytical chemists to improve the measurement of total NAs in environmental samples. 18 All of this information suggests there is a need for a standard method against which laboratories would be required to demonstrate performance and traceability of a given method. Indeed, this need has been previously identified,7 but at present there is still no standard method. The intent of this review is to discuss the design considerations for such a method for the semi-quantification of classical NAs. A range of currently used methods for the analysis of NAs has been compiled. Additionally, studies on specific aspects of NAFCs analysis are considered (e.g. extraction method, ionization polarity, etc.). While this review will discuss the important features of a method for semi-quantification of classical NAs, it is acknowledged that there will likely be a series of standard methods for analyses of other NAFCs, depending on the end use of the data. Prior to discussion of the design considerations, the choice of detection method will be discussed. Traditionally, the method for total NA quantification has been Fourier transform infra-red spectroscopy (FTIR), in which quantification is correlated to total carboxylic acid functional groups. FTIR is thus sensitive not only to classical NAs, but also to any compound containing one or more carboxylic acid moieties.4 As such, FTIR is not suitable for the selective analysis of classical NAs. There is consensus among the authors that mass spectrometry is the preferred instrument for the measurement of NAs: with sufficient mass resolution, mass spectrometry is able to distinguish the different classes of OSPW organic acids.2 The selection of mass spectrometry as an analysis method will inform much of the following discussion. 19 2.3 Method design considerations for mass spectrometric semi-quantification of total NAs in water In this section we will review the following important factors in developing method guidelines for the analysis of total NAs in OSPW: 1. Definition of total NAs 2. Extraction phase, pH, temperature, subsequent cleanup 3. Minimum mass resolving power of instrument 4. Use of derivatization 5. Polarity and mode of ionization 6. Use of surrogate standards for compensation of extraction inefficiency and matrix effects 7. Choice of suitable calibration standard 8. Use of on-line or off-line fractionation of sample Considerations include the current state-of-the-art, the ease of adoption by diverse laboratories, and the time and effort involved. The discussions are summarized in Table 2.1. Table 2.1 Considerations discussed for the proposal of a standard classical NAs semi-quantification method. Factors Conclusions References 1. Definition of total NAs Use the classical definition of NAs 2 2. Extraction phase, pH, temperature Liquid-liquid extraction at pH 2 and room temperature with DCM as organic phase, or use ENV+ SPE 8,67-69 3. Use of surrogate standards Use isotopically labelled model compounds as surrogate standards 52 4. Minimum resolving power of instrument 50,000 at m/z 200, acknowledging that potential interferences contribute to method uncertainty 2,10,57,60 5. Use of derivatization Do not utilize derivatization 62,65,66,70 6. Polarity and mode of ionization Negative-ion mode ESI 4,71 7. Suitable calibration standard and internal standard Use commercially available Merichem NA mixture and at least one isotopically labelled internal standard 2,52,57 8. Use of on-line or off-line fractionation of sample Employ on-line chromatography prior to MS detection 52,60,72,73 20 2.3.1 Working definition of total NAs While complete characterization and quantification of all OSPW naphthenic acid fraction components (NAFCs) is a desirable goal, we will restrict our scope by using a working definition of NAs which only includes classical NAs (formula class CnH2n+ZO2) and those only within a specified carbon number and hydrogen deficiency range (n = 6 to 40; Z = 0 to -12). Therefore, “total NAs” refers to the measured quantity of those naphthenic acids within the scope of the working definition. This provides a practical starting point for future quantification methods for total NAFCs and simplifies analysis considerations. Table 2.2 contains a summary of the target NAs for analysis Table 2.2 Summary of target ions for classical NA quantification analysis. Masses represent the anions formed by deprotonation in negative-ion mode ESI. Z-value 0 -2 -4 -6 -8 -10 -12 Carbon # Target ion accurate mass (amu) 6 115.07645 113.06080 111.04515 109.02950 107.01385 104.99820 7 129.09210 127.07645 125.06080 123.04515 121.02950 119.01385 116.99820 8 143.10775 141.09210 139.07645 137.06080 135.04515 133.02950 131.01385 9 157.12340 155.10775 153.09210 151.07645 149.06080 147.04515 145.02950 10 171.13905 169.12340 167.10775 165.09210 163.07645 161.06080 159.04515 11 185.15470 183.13905 181.12340 179.10775 177.09210 175.07645 173.06080 12 199.17035 197.15470 195.13905 193.12340 191.10775 189.09210 187.07645 13 213.18600 211.17035 209.15470 207.13905 205.12340 203.10775 201.09210 14 227.20165 225.18600 223.17035 221.15470 219.13905 217.12340 215.10775 15 241.21730 239.20165 237.18600 235.17035 233.15470 231.13905 229.12340 16 255.23295 253.21730 251.20165 249.18600 247.17035 245.15470 243.13905 17 269.24860 267.23295 265.21730 263.20165 261.18600 259.17035 257.15470 18 283.26425 281.24860 279.23295 277.21730 275.20165 273.18600 271.17035 19 297.27990 295.26425 293.24860 291.23295 289.21730 287.20165 285.18600 20 311.29555 309.27990 307.26425 305.24860 303.23295 301.21730 299.20165 21 325.31120 323.29555 321.27990 319.26425 317.24860 315.23295 313.21730 21 22 339.32685 337.31120 335.29555 333.27990 331.26425 329.24860 327.23295 23 353.34250 351.32685 349.31120 347.29555 345.27990 343.26425 341.24860 24 367.35815 365.34250 363.32685 361.31120 359.29555 357.27990 355.26425 25 381.37380 379.35815 377.34250 375.32685 373.31120 371.29555 369.27990 26 395.38945 393.37380 391.35815 389.34250 387.32685 385.31120 383.29555 27 409.40510 407.38945 405.37380 403.35815 401.34250 399.32685 397.31120 28 423.42075 421.40510 419.38945 417.37380 415.35815 413.34250 411.32685 29 437.43640 435.42075 433.40510 431.38945 429.37380 427.35815 425.34250 30 451.45205 449.43640 447.42075 445.40510 443.38945 441.37380 439.35815 31 465.46770 463.45205 461.43640 459.42075 457.40510 455.38945 453.37380 32 479.48335 477.46770 475.45205 473.43640 471.42075 469.40510 467.38945 33 493.49900 491.48335 489.46770 487.45205 485.43640 483.42075 481.40510 34 507.51465 505.49900 503.48335 501.46770 499.45205 497.43640 495.42075 35 521.53030 519.51465 517.49900 515.48335 513.46770 511.45205 509.43640 36 535.54595 533.53030 531.51465 529.49900 527.48335 525.46770 523.45205 37 549.56160 547.54595 545.53030 543.51465 541.49900 539.48335 537.46770 38 563.57725 561.56160 559.54595 557.53030 555.51465 553.49900 551.48335 39 577.59290 575.57725 573.56160 571.54595 569.53030 567.51465 565.49900 40 591.60855 589.59290 587.57725 585.56160 583.54595 581.53030 579.51465 2.3.2 Extraction of NAs and sample clean-up When extracting NAs from a sample, the choice of extraction phase, solvent, temperature, and pH can play significant roles in the types and quantities of organic acids obtained.8,67-69 There are examples of methods which do not utilize extraction prior to analysis,52,53,74,75 but for the purposes of this review, the use of extraction will be considered. Headley et al.8 examined various phases and solvents for extraction of NAFCs from OSPW. For liquid-liquid extraction, dichloromethane (DCM) was observed to have the highest total extraction of NAFCs (Figure 2.1). Hexane was observed to be most selective for classical NAs (Figure 2.2), as was also observed by Huang et al.69 However, the total amount of components extracted by hexane was approximately 2/3 that of DCM. ENV+ solid phase extraction (SPE) 22 performed relatively well for all NAFCs and had total extraction similar to that of liquid-liquid extraction with DCM (Figure 2.2).8 Given this evidence, liquid-liquid extraction with DCM as the organic phase or ENV+ SPE would be good choices for use in a standard quantification method for total NAs. It should be noted that the use of ENV+ SPE would have the additional advantage of aiding in desalting the sample prior to MS analysis, which would benefit quantification.76 Figure 2.1 Relative total NAFC extraction using selected solvent systems or SPE. Solvent polarity index is given along the x-axis. Areas represent total area under the TIC curves observed by negative ion electrospray ionization (ESI) Orbitrap MS. [Reprinted with permission from Figure 3, Headley et al. 2013.8 Copyright (2013) Elsevier.] 23 Figure 2.2. Distribution of selected components of NAFC. Extraction using selected solvent systems or SPE was carried out prior to analysis by negative-ion ESI Orbitrap MS. [Reprinted with permission from Figure 4, Headley et al. 2013.8 Copyright (2013) Elsevier.] It is commonly known that pH affects the partitioning of an ionizable molecule between aqueous and organic phases. This was recently reported for NAs.67-69 Huang et al.69 demonstrated relatively high partitioning of O2 NAs into DCM at pH 2.0 and 8.5 and negligible extraction at pH 12.0 (Figure 2.3). Theoretical data from Celsie et al.68 indicates the octanol-water partitioning of classical NAs is largely unchanged below a pH of 4 and changes significantly at pH values greater than ~6 (Figure 2.4). This indicates that if liquid-liquid extraction is used, both extraction efficiency and repeatability will benefit from a low extraction pH. 24 Figure 2.3. Extracted amounts of selected NAFC classes using six solvents. Extractions were carried out at pH values of a) 12.0, b) 8.5, and c) 2.0. [Reprinted with permission from Figure 1, Huang et al. 2016.69 Copyright (2016) Elsevier.] 25 Figure 2.4. Predicted log DOW values with changing pH for 19 classical NAs. Numbers reference the compounds described in Celsie et al. 2016 SI. [Data taken from Celsie et al. 201668 supplemental information.] Furthermore, Celsie et al.68 provide calculated data on octanol-water partitioning of organic acids with varying temperature (Figure 2.5). The data shows changes in partitioning behavior with changing temperature, but the effects are considerably less than those occurring with changing pH and do not indicate a need for stringent temperature regulation. According to this information, room temperature (~20 °C) would be appropriate for carrying out sample extractions. 26 Figure 2.5. Predicted log DOW values with changing temperature for 19 classical NAs. Numbers reference the compounds described in Celsie et al. 2016 SI. [Data taken from Celsie et al. 201668 supplemental information.] 2.3.3 Matrix effects and extraction efficiency When sample analysis requires extraction and extensive handling, it is commonplace to spike the original sample with surrogate standards. Surrogate standards allow for an estimate of sample loss and extraction efficiency. For a complex mixture like NAs, surrogate standards are not available for all components. As a result, extraction and ionization efficiencies can only be approximated using a limited number of surrogate standards.52 However, for design considerations, the addition of surrogate compounds to the sample prior to extraction is recommended to provide a guide to the extraction efficiencies of a given method. 27 2.3.4 Minimum resolving power The mass resolving power of a mass spectrometer is known to have a significant impact on the identification and quantification of OSPW organic acids.57,62 In order to resolve classical NAs from other OSPW organic acids, high mass resolving power is necessary.2,10,60 Nominal mass resolution is insufficient. The naphthenic acid anions C24H41O3 and C25H45O2 have the following monoisotopic m/z values (rounded) in negative mode MS: 377.3061 and 377.3425, respectively. To a unit resolution instrument these would appear as a single m/z peak. Based on the definition of mass resolving power, 𝑅 = 𝑀/∆𝑀, the minimum resolving power required for this pair is 10,367. In a study of OSPW extract utilizing FTICR, Nyakas et al. reported sulfur-containing compounds made up 23% of all identified compounds.60 At lower resolution, sulfur containing compounds present as isobaric with O2 compounds. For example, the anions C25H45O2, C25H45S and C24H41OS have the following monoisotopic m/z values: 377.3425, 377.3247 and 377.2883. To resolve the O2 compound from the two S-containing compounds would require minimum mass resolving powers of 21,199 and 6,962, respectively. Because S- and Ox-species are not the only possible interferences, characterization and fingerprinting studies often make use of ultra-high resolution instruments.9,60,70-72 To minimize the possibility of other mass interferences and decreases in resolving power accompanying instrument drift, a minimum resolving power of 50,000 at m/z 200 is suggested for a standard semi-quantification method for total NAs. Both medium-high-resolution (similar to 50,000) and high-resolution (similar to or greater than 100,000) mass spectrometers satisfy this requirement. Both quadrupole-time-of-flight (Q-TOF) and Orbitrap instruments meet this specification, allowing method design to be realized by a large number of laboratories. 28 2.3.5 Derivatization vs. no derivatization Derivatization of NAs has been carried out in both characterization and quantification efforts.62,65,66,70 A novel semi-quantification method presented by Woudneh et al. exhibits high analytical sensitivity and quantifies specific NA isomer groups as equivalents of a single standard.65 However, the extra step and cost involved in derivatization may distract from adoption of the method by some laboratories. It is noted that the method by Woudneh et al. could potentially play a role in the future quantification of individual NA classes present in calibration standards. 2.3.6 Polarity and mode of ionization Because derivatization has not been recommended, available ionization techniques are limited to those associated with direct-injection-MS or liquid chromatography-MS. For the characterization of classical NAs, ESI has received much attention due to the production of molecular ions from polar compounds with little fragmentation. Atmospheric pressure photoionization (APPI) has also been demonstrated as a useful technique in analysis of NAFCs, especially analysis of less polar compounds.58,67 Thus APPI may play an important role in the quantification of non-acidic classes of NAFCs. Classical NAs, however, being polar and easily ionizable in solution are well suited for ESI. Pereira et al. demonstrated that positive- and negative-mode ESI result in different populations of NAs being measured, and chromatographic separation suggested the O2 species detected in positive-mode ESI were chemically distinct from classical NAs.71 Negative-ion ESI spectra, however, tend to be dominated by Ox species,58 indicating that with sufficient resolving power it is well suited to the analysis of classical NAs. 29 Furthermore, negative-ion mode is already commonly used in analysis of OSPW, which would ease adoption of the standard method. The formation of salt adducts is not normally observed using negative ion electrospray ionization 2.3.7 Choice of calibration standards and use of internal standards The selection of a calibration standard for use in total NAs quantification is not a straightforward decision. Two options for calibration standards are commercial NAs mixtures or NAFCs extracted from OSPW. Commercial mixtures are readily available, but they can differ greatly in composition from NAs in OSPW. Additionally, OSPW NAFCs differ from source to source, and the abundance of compounds other than classical NAs complicates standardization. Grewer et al. showed that commercially available Merichem NA mixtures are dominated by classical NAs, which potentially makes them well suited for a method semi-quantifying O2 species (Figure 2.6).2 Martin et al. compared two high-resolution MS calibration curves for classical NAs, one prepared with Merichem NAs and one prepared with OSPW extract. Both curves exhibited good linearity, but the response of the Merichem NA curve, relative to the internal standard, was approximately three times that of the OSPW extract curve.57 It was suggested this might be due to the fact that OSPW extract contains many compounds other than classical NAs. Calibration curves made from Merichem NAs and OSPW extract might be similar to each other by gravimetric or FTIR quantification, but the responses will differ considerably when analyzed by a method such as high-resolution MS which can distinguish classical NAs from other compounds in the mixture. While this is an area that requires further investigation, the predominant O2 nature of Merichem NAs observed by Grewer et al. might lend them well to the semi-quantification of classical NAs in OSPW. 30 Figure 2.6. Relative abundance of compounds matching the formula CnH2n+ZOx, summed for n = 8 to 30 and Z = 0 to -12. [Reprinted with permission from Figure 4, Grewer et al. 2010.2 Copyright (2010) Elsevier.] The use of at least one internal standard during MS analysis is proposed. The reasoning of Brunswick et al.52 has been adopted, in that the variety of compounds in OSPW is too great to be matched with any reasonable number of internal standards, and that a single standard may be used for monitoring instrument performance. Because of the complexity of the mixture, it is not possible to completely account for the effects of ion suppression. However, these effects can be significantly reduced using sample preparation techniques (i.e. SPE) and/or chromatographic separation. 2.3.8 Online or offline fractionation of samples Prefractionation of OSPW extract has been shown to enhance characterization and is commonly used in the analysis of OSPW.52,60,72,73 Although a minimum mass resolving power for mass spectrometers has been suggested that would ensure reduced interference with classical NAs, the use of a separation technique could help further resolve potential molecular 31 interferences, reduce matrix effects and aid in sample desalting prior to MS, considerably reducing the effects of ion suppression. On this point, Nyakas et al. demonstrated that off-line prefractionation of OSPW extract by UHPLC prior to analysis by FTICR-MS resulted in a nearly 200% increase in the number of assigned compounds relative to direct injection (Figure 2.7).60 Off-line fractionation was used by Nyakas et al. only because of software limitations. HPLC is commonly interfaced with the suggested mass spectrometers (Q-TOF or Orbitrap), and software compatibility with on-line detection on these instruments should not be an issue. Choice of column, mobile phases and gradients is not considered here, but it is recognized that they are important factors to be considered in method development. 32 Figure 2.7. Bar plots and contour diagrams showing the number of homologues detected for the formula CnH2n+ZOx for: a) OSPW sample extracted by liquid-liquid extraction and directly injected into FTICR-MS; b) OSPW sample extracted as in (a) and fractionated by UHPLC prior to FTICR-MS; and c) combination of two OSPW samples processed as in (b) to compensate for dilution effects. [Reprinted with permission from Figure 3, Nyakas et al. 2013.60 Copyright (2013) American Chemical Society.] 2.4 Conclusions A standard method for the semi-quantification of total naphthenic acids is needed to provide a reference point for the multitude of different methods used in NA quantification. A a) b) c) 33 range of studies on specific aspects of NA and NAFC analysis were compiled, and important details of currently used methods were discussed. Requisite features were suggested for the design of future standard mass spectrometric methods for quantifying total NAs. The design considerations proposed are suitable for use with both Q-TOF and Orbitrap mass spectrometers. This review may also provide a starting point for the development of future standard methods for the quantification of other OSPW NAFCs. 2.5 Acknowledgements: We thank the members from an Environment and Climate Change Canada (ECC) led taskforce of 10 laboratory experts from government, industry and academia during April 2016 for helpful insights and discussions. KAK acknowledges a 4 Year Fellowship from the University of British Columbia. MSM acknowledges a Postgraduate Scholarships-Doctoral from the Natural Sciences and Engineering Research Council of Canada. 34 Characterization of dicarboxylic naphthenic acid fraction compounds utilizing amide derivatization: Proof of concept 3.1 Introduction The extraction of bitumen from oil sands results in the significant production of wastewater which is ultimately stored in tailings ponds. Roughly 75-80% of the water used in the extraction process is recycled from these tailings ponds and reused. While this practice reduces the net production of waste water, the production is still significant, and as of 2009 the volume of tailings was estimated to have reached 720 million cubic meters.77 This water, termed oil sands process affected water (OSPW), contains a complex mixture of organic compounds.2 Naphthenic acid fraction compounds (NAFCs) represent the organic acid compounds found in OSPW.10 NAFCs are termed as such because they include naphthenic acids (NAs), which are of special interest for both toxicological and regulatory reasons.41-43 Naphthenic acids are a class of alkyl-substituted cyclic and aliphatic carboxylic acids defined by the general formula CnH2n+ZO2, where Z is an even, negative integer or zero representing hydrogen deficiency due to ring structures. While NAs are principal components of the organic mixture found in OSPW, it is well known that there are significant numbers of NA-like compounds containing N and S heteroatoms, unsaturated and aromatic components, and varying oxidation states.2,9 The presence of compounds with two carboxyl groups (diacids) in NAFCs has been previously confirmed.12 NAFCs with multiple dicarboxylic acid groups have been shown to be significantly less toxic towards D. magna than their monocarboxylic counterparts,1 which makes diacid occurrence of interest in OSPW studies of toxicological and regulatory nature. Furthermore, diacid abundance has been used to profile temporal variation of samples from 35 settling ponds.11 Previous studies have relied on the use of authentic methylated standards for the identification of methylated diacids using two-dimensional gas chromatography hyphenated with time of flight mass spectrometry (GCxGC/TOFMS).11,12 While the use of authentic standards allows for the unambiguous identification of specific diacids, there are currently no standards available for the complete profiling of diacids, or indeed of any class present in OSPW. There are many examples of derivatization or intentional adduct formation of NAFCs in the literature.2,12,61,62,66,70,75,78-82 Most derivatization efforts are directed towards making the compounds in the mixture amenable to gas chromatography, but there are also examples where derivatization aids in quantification or in characterization of specific classes. For gas-chromatography, NA-like compounds are typically derivatized to their methyl esters,11,12,61,70,79,80 but tert-butyldimethylsilyl2 and pentafluorobenzyl derivatives81 have also been used. Woudneh et al. reported a method in which derivatized compounds yielded a single, common product ion in MS2, benefiting quantification.78 MacLennan et al. utilized a modified version of the Woudneh et al. method to demonstrate the first use of capillary electrophoresis hyphenated to (+)ESI-MS for the analysis of NAs.66 Duncan et al. described the (+)ESI-MS detection of barium ion adducts with NA-like compounds.75 Wang et al. utilized dansyl chloride derivatization of hydroxyl groups to characterize oxy-classes of NA-like compounds.82 To the best of our knowledge, however, the use of derivatization as a direct means to investigate the occurrence of diacids in NAFCs has not been reported. Though it was not previously discussed, the derivatization method recently reported by MacLennan et al.66 showed promise for the profiling of diacid classes in OSPW. The derivatization is similar to that described by Woudneh et al.78 In the Woudneh et al. method, the carboxylic acid group is reacted with N-(3-dimethylaminopropyl)-N’-ethylcarbodiimide (EDC) 36 to yield an O-acylurea which undergoes rearrangement to yield a stable N-acylisourea.78 In that study, the common product ions of the NA-EDC derivatives in MS2 was used for the quantification of NAs in equivalents of a derivatized standard. MacLennan et al. proposed the addition of a second step to the derivatization, in which the NA-EDC O-acylurea, before it rearranges to the N-acylisourea, is reacted with a primary-tertiary diamine (e.g. N',N'-dimethylethane-1,2-diamine) to yield naphthenic amides, though the rearrangement to the N-acylisourea as a minor product is also expected.66 Additionally, the NA-EDC derivative is formed in the presence of N-hydroxysuccinimide (NHS), a combination which is often used in peptide chemistry. The rationale behind the modification from the Woudneh method was that the two-step reaction provides additional specificity for carboxylic acids, while the single-step reaction utilizing EDC alone is known to cause reaction at a range of functional groups and combinations of functional groups.83 It is reasonable to expect that a derivatized NAFC will have two derivatization labels only if it is a diacid, which may aid in the characterization of diacid classes. MacLennan et al. utilized capillary electrophoresis hyphenated to low-resolution TOFMS, and as such the amine-derivatization has not previously been investigated by high-resolution MS. The reaction for the derivatization of 3,3-dimethylglutaric acid (a standard used in the presented study) along with the expected fragmentation are shown in Figure 3.1. (A) in Figure 3.1 represents the product when the O-acylurea intermediate reacts with N',N'-dimethylethane-1,2-diamine, and (B) represents the product when the O-acylurea intermediate rearranges to form a stable N-acylisourea. These will be referred to as derivatization pathways (A) and (B) throughout. The net change in mass upon derivatization by pathways (A) and (B) are 70.08948 and 155.14225 Da, respectively. 37 38 Figure 3.1. a) Derivatization of 3,3-dimethylglutaric acid with major (A) and minor (B) products. b) Expected fragmentation of a generic derivatized naphthenic acid. Herein is presented the use of high-resolution MS for the characterization of amide-derivatized OSPW extract, including proof-of-concept characterization of the O4 diacid class. The O4 diacid class is defined as the class of NAFCs containing two carboxylic acid functional groups, with no additional heteroatoms (e.g. no N or S in the molecular formula). Assuming the overall structure of NAFCs has minimal impact on derivatization efficiency of carboxylic acids, it is hypothesized that the profile of NA-amide derivatives will be similar to that of underivatized OSPW analyzed in (-)ESI. This was previously observed by Woudneh et al. for NA isomer classes in NA-EDC derivatives.78 Furthermore, it is hypothesized that, in addition to charge state, the presence of multiple derivatization labels can be used for the identification of likely diacids and diacid class profiling. 3.2 Experimental 3.2.1 Chemicals and materials OSPW was provided by the Environment Canada National Hydrology Research Centre (Saskatoon, SK, Canada). The source of the OSPW was a study on the wetland microbial degradation of NAFCs.84 The sample used was an influent sample, and as such would not yet have experienced biodegradation as a result of the study. N-(3-dimethylaminopropyl)-N’-ethylcarbodiimide, N-hydroxysuccinimide, and N,N'-dimethylethane-1,2-diamine were purchased from Sigma Aldrich (St. Louise, MO, USA). Phenomenex Strata-XAW solid-phase extraction cartridges (200 mg) were purchased from Phenomenex (Torrance, CA, USA). 3,3- 39 dimethylglutaric acid was purchased from Sigma Aldrich (St. Louise, MO, USA). All solvents used were of the highest purity available. 3.2.2 Sample preparation and derivatization Samples were extracted from OSPW using weak anion exchange solid-phase extraction (WAX-SPE) as previously described.84 For underivatized samples the eluent was dried under nitrogen gas and reconstituted in 1 mL 50:50 acetonitrile:water with 0.1% NH4OH. For the derivatized sample the eluent was dried under nitrogen gas and derivatized in dichloromethane (DCM) as follows. To the dried residue 10 µL each of 10 mM EDC, 10 mM NHS and 200 mM N,N-DMEDA were added. 500 µL DCM was then added and the sample vortexed and left to sit overnight. The derivatized solution was dried under nitrogen gas and reconstituted in 1 mL 50:50 acetonitrile:water. For the standard sample, approximately 100 µL of 10 mM deoxycholic acid in DCM was derivatized as above, and diluted 10-fold in 30% methanol. 3.2.3 Mass Spectrometry An Orbitrap Elite mass spectrometer (Thermo Fisher Scientific, San Jose, CA, USA) was used for all OSPW experiments. Samples were introduced by direct injection-electrospray ionization with an injection volume of 5.0 µL. Ionization was carried out in either positive- and negative-ion modes, depending upon the sample being analyzed. Resolving power was 250,000 at 300 m/z and mass accuracy was set at 4 ppm. For MS2 experiments HCD activation was used with a collision energy of 25 eV and an isolation width of 1.0 m/z. An AB SCIEX API 4000 triple-quadrupole mass spectrometer (Applied Biosystems/MDS Sciex, Concord, Ontario) was used for analysis of the dicarboxylic acid standard. Samples were introduced by direct infusion, and ionization was carried out in positive-ion mode electrospray ionization. For MS2 experiments, the collisional energy was set to 35 eV. 40 3.2.4 Data processing All data processing and analysis was performed using MSFileReader (Thermo Fisher Scientific, San Jose, CA, USA) and the Python programming language v3.6 (Python Software Foundation, Delaware, USA). For analysis of MS2 data of derivatized 3,3-dimethylglutaric acid and derivatized OSPW extract, the raw data was interpreted as-is, with no processing performed beyond background subtraction. For processing of derivatized and underivatized OSPW extract MS1 data, the following procedure was followed. A formula list was generated using the following formula parameters; C: 6-30; O: 2-6; N: 0-1; S: 0-1; Z: -12 to 0; and H: procedurally generated based on the formula 2n+Z+N, where n is the number of carbons, N is the number of nitrogen atoms, and Z is confined to even, negative integers or zero. These parameters were chosen due to the interest in NA-like compounds. Peaks were selected from background-subtracted mass spectra using a low-intensity cut-off of 100 cps for all spectra. The selected ion peaks were matched to molecular formulae from the generated formula list to within 2 ppm of m/z value by equation (1), 𝑚𝑧=𝑓𝑜𝑟𝑚𝑢𝑙𝑎 𝑚𝑎𝑠𝑠 + 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑖𝑜𝑛|𝑐ℎ𝑎𝑟𝑔𝑒 𝑠𝑡𝑎𝑡𝑒| (1) where the charge state and correction values for the different processing methods are described in Table 3.1. When multiple formulae were possible, the closest match was selected. To simplify processing, it was assumed that the only forms of ionization were protonation and deprotonation and that all ions were singly charged or doubly charged. It should be noted the results of the data processing have units of mass and are not m/z values. As such, the x-axes of processed mass spectra are explicitly labeled “molecular mass” and have units of Da rather than m/z. Whether a mass spectrum represents raw MS data or processed data will also be noted in the 41 text. The processing methods and the data resulting from the respective methods will be referred to in the text as UNeg(1), DPos(A), DPos(B), and DPos(AA), as summarized in Table 3.1. Table 3.1. Summary of data processing. Correction values yield the expected formula mass of the neutral, underivatized molecule corresponding to the observed ion peak. Processing method Charge state Correction Description UNeg(1) -1 1.0073 Underivatized (-)ESI, processing for singly charged ions DPos(A) +1 -71.0968 Derivatized (+)ESI, processing for singly charged ions from pathway (A) DPos(B) +1 -156.1495 Derivatized (+)ESI, processing for singly charged ions from pathway (B) DPos(AA) +2 -142.1935 Derivatized (+)ESI, processing for doubly charged ions, doubly derivatized via pathway (A) 3.3 Results and Discussion 3.3.1 Derivatization of standard compounds and MS2 of standards and selected ions in OSPW extract Full-scan MS1 and MS2 mass spectra of derivatized 3,3-dimethylglutaric acid are presented in Figure 3.2. In the full-scan of the derivatized model compound (Figure 3.2a), the expected peak for the singly-derivatized product of pathway (A) is clearly visible (m/z 231.2). The expected peak at m/z 316.3 for the product of pathway (B) is not present. Assuming the ionization efficiencies of the two products are equal, this indicates most of the deoxycholic acid was derivatized via pathway (A). The underivatized compound is also seen at m/z 161.2. In the MS2 spectrum of the singly-derivatized peak (Figure 3.2b), the expected neutral loss of 45.0 is observed. The doubly-derivatized compound with a charge state of 2 would be expected to appear at m/z 151.1 (marked in Figure 3.2a). This peak is not seen in any significant abundance. However, in a MS2 spectrum of m/z 151.1, a fragmentation pattern consistent with the doubly-derivatized compound can be seen (Figure 3.2c). Assuming the precursor is doubly-charged, the 42 transition from m/z 151.1 to m/z 128.6 represents a neutral loss of 45.0. A further neutral loss of 45.0 is seen at m/z 106.1. This second neutral loss would be expected from the presence of a second derivatized carboxylic acid. The peak at m/z 106.1 is low in intensity, but the corresponding singly-charged peak is strongly seen at m/z 211.2. The relative abundances of the singly- and doubly-derivatized compounds suggests the second derivatization is inefficient (Figure 3.2a). Nevertheless, the second derivatization does appear to take place. 43 Figure 3.2. Spectrum of derivatized 3,3-dimethylglutaric acid (a) and MS2 product ion spectra of the singly-derivatized (b) and doubly-derivatized (c) pathway (A) products (precursor ions m/z 231.2 and 151.1, respectively). 44 A full-scan MS1 spectrum of derivatized OSPW extract and MS2 spectra of selected ions are presented in Figure 3.3. The product ions in both 3b and 3c exhibit neutral loss of 45 and no visible signals at m/z 129 or 174, consistent with derivatization pathway (A). This confirms the derivatization of the OSPW extract. The MS2 spectrum in Figure 3.3d shows product ions of m/z 129 and 174 as well as a neutral loss of 45, which is consistent with derivatization pathway (B). This suggests that, while it was not observed in the standard compound derivatization, pathway (B) does occur to some extent. Figure 3.3. Full-scan and selected MS2 product ion spectra of derivatized OSPW extract. Precursor ions identified in figure headings. 45 It is notable that pathway (B) products might be isobaric with certain pathway (A) products. A molecule which contains an additional O and an N as heteroatoms, and is derivatized by pathway (A), will simply need an appropriate number of carbons and hydrogens to match its molecular formula up with that of a pathway (B) product. This is illustrated in Figure 3.4, which compares ions identified as belonging to the NO3 class by DPos(A) data processing with those identified as belonging to the O2 class by DPos(B) data processing. The NO3 class is comprised of all NAFCs containing exactly one N atom and three O atoms, while the O2 class represents all NAFCs containing exactly two O atoms, but no N atoms. The two spectra are remarkably similar, and most of the ions have been identified as belonging to different classes by the two data processing methods. This indicates the characterization of nitrogen-containing classes by DPos(A) is unfavorable, and that the characterization of any class by DPos(B) is likely to be contaminated by data from pathway (A) products. Fortunately, if N is not considered as a possible heteroatom then the products of pathway (A) will not experience interference from pathway (B) products. This is so because the number of nitrogen atoms in the two products differs by one. Furthermore, if chromatographic or electrophoretic separation is employed prior to MS analysis, it is highly likely that the multiple pathway interferences can be eliminated, as was observed by MacLennan et al.66 This would allow for characterization of nitrogen-containing classes. Additionally, an isotopically labeled diamine reagent would greatly reduce any interference between the two pathways. 46 Figure 3.4. Mass spectra of ions identified as O2 class by DPos(B) and as NO3 class by DPos(A). Many ion peaks are identified as belonging to both groups, indicating poor characterization of these classes. 3.3.2 Analysis of spectra and matched formula lists (singly-charged) Based upon the above discussion of nitrogen-containing class interferences, the remainder of the analysis shall focus only on non-nitrogen classes. Unprocessed and formula matched mass spectra of singly-charged underivatized OSPW extract in (-)ESI and of derivatized OSPW extract in (+)ESI are presented in Figure 3.5a. Qualitatively, the processing can be seen to align the distributions of derivatized (+)ESI ions with the underivatized (-)ESI ions, and the profiles look similar (Figure 3.5b). In Figure 3.5c, the similarity can be seen to extend down to the fine scale. 47 Figure 3.5. Raw mass spectra (a) and processed mass spectra (b) of underivatized and derivatized OSPW. Fine-scale view of the processed mass spectra (c). A summary of the peak information from each spectrum and the different processing methods is presented in Table 3.2. 48 Table 3.2. Summary of processed spectra. Spectrum Number of peaks in unprocessed spectrum Processing method Formula matched peaks including N classes Formula matched peaks excluding N classes (-)ESI underivatized 12,957 UNeg(1) 896 761 UNeg(2) 463 432 (+)ESI derivatized – pathway A 16,146 DPos(A) 968 613 DPos(AA) 540 401 Total 43,915 2,876 2,207 When N classes are included, the number of matched peaks is higher for all processing methods, but the effect is much more dramatic for the derivatized spectrum than the underivatized. Because DPos(A) and DPos(AA) should represent derivatized NA-like compounds, they could be expected to behave similarly to UNeg(1) and UNeg(2). The relatively large loss when N classes are excluded seems to support the hypothesis that pathway (B) products interfere with identification of pathway (A) products and that characterization of N classes by the derivatization method without the use of chromatographic or electrophoretic separation is unfavorable. Excluding N, the number of peaks matched to formula in UNeg(1) and DPos(A) were somewhat similar, with UNeg(1) having the most. The DPos(A) data processing ideally identifies only those species which are amenable to the derivatization, which should make it more selective for NA-like compounds than UNeg(1). Thus it is reasonable that DPos(A) resulted in fewer formula matches than UNeg(1). Figure 3.6 presents both peak counts (a-b) and relative abundances (c-d) of selected classes in UNeg(1) and DPos(A). The overlay bars show the number or relative abundances of peaks within the underlying dataset which are also present in the overlaid datasets. For example, Figure 3.6a depicts the total number of peaks observed in selected classes of UNeg(1), as well as the number of those peaks which were also observed in DPos(A). Similar trends are seen for 49 peak numbers in both processing methods (Figure 3.6a and Figure 3.6c). While both UNeg(1) and DPos(A) exhibit peaks unique to those spectra (Figure 3.6a and Figure 3.6c), it appears that the exclusive peaks are generally lower in intensity, and the higher intensity peaks seen in a given method are observed in the other (Figure 3.6b and Figure 3.6d). Furthermore, the intensities of peaks observed by both methods are generally correlated, especially for higher-intensity peaks (Figure 3.6e). Given this evidence, it appears that the relative abundances of peaks are reasonably well conserved by the derivatization. Figure 3.6. a-d) Relative abundances and peak numbers for selected classes of NAFCs. e) Correlation plot of intensity of peaks observed in both UNeg(1) and DPos(A). It is interesting that the relative abundances of O2 and O4 classes differ between UNeg(1) and DPos(A) (Figure 3.6c and Figure 3.6d), especially considering the similarity of the rest of the class profiles. A possible explanation of the differing O2 and O4 abundances is the presence of incompletely derivatized diacids. An underivatized diacid will most likely be doubly-charged in (-)ESI and will not be observed by UNeg(1). A derivatized diacid in which only one carboxyl group was derivatized, however, will be singly-charged in (+)ESI and will be observed by 50 DPos(A). This would increase the apparent relative abundance of O4 class compounds in the DPos(A) spectrum. If this is indeed the case, the difference between the abundances in DPos(A) and UNeg(1) suggests there is a considerable amount of O4 diacid species. 3.3.3 Analysis of doubly-charged peaks To reduce the amount of data processing for doubly-charged peaks, we focus exclusively on underivatized O4 and O2 classes and their derivatized counterparts. Charge state can be used to identify possible O4 diacids, but it is not unambiguous. A doubly-charged O4 species observed in (-)ESI-MS might be either a diacid or a dihydroxyl monoacid. However, a doubly charged O4 species in DPos(AA) is identified as such because it possesses two derivatization labels, and it is reasonable to assume that an O4 species will only possess two derivatization labels if it originated from a diacid. While requiring the presence of two derivatization labels reduces ambiguity, there is still the possibility that peaks will be formula matched by the wrong processing method (e.g. mass resolving power is insufficient to distinguish H from H+, so a singly-charged peak can be identified by both single-charge and double-charge processing if it is in the right location). This most frequently occurs with high-mass doubly-charged peaks and low-mass singly charged peaks. For example, 𝑚1+11=𝑚2+22 when 𝑚1 =12𝑚2. Isotope peaks would commonly be used to address this possibility, but in this case, we can take advantage of the Z-series to further filter the data. After formula matching, it is expected that the spectrum of a given class should exhibit Z-series with spacing of approximately 2 Da. A singly-charged series which is identified by the doubly-charged processing will exhibit a Z-series with spacing of approximately 4 Da as 51 opposed to the expected spacing of 2 Da. Aberrant Z-series can thus be identified and easily removed if they do not overlap the proper Z-series. The O4 class matched by DPos(AA) is shown in Figure 3.7a. As the data processing results in neutral masses, this spectrum has units of Da rather than m/z. A visual inspection shows the aberrant Z-series begins above ~360 Da. Fortunately, there seems to be minimal overlap of the Z-series, so a simple high mass cut-off of 360 Da can be used to eliminate the interference. The mass spectrum of O4 diacids after applying this cut-off is shown in Figure 3.7b. Figure 3.7. O4 class spectrum from DPos(AA), a) before mass cut-off and b) after high molecular mass cut-off of 360 Da. 52 Figure 3.8 shows the O4 class as observed by both DPos(A) and DPos(AA) after applying the described cut-off. This represents the O4 monoacids and diacids. Note that this spectrum represents neutral molecules, and as such the x-axis has units of Da. Because we have already established the inefficiency of the second derivatization of a diacid, the comparison is not quantitative. However, qualitative observations can be made. To facilitate the comparison, the scales of the two spectra in Figure 3.8 have been adjusted such that the highest intensity peaks in each are approximately equal in magnitude. While the two profiles are generally similar in shape, the profile of the diacid O4 class (represented by DPos(AA) in Figure 3.8) is seen to be slightly shifted to higher mass relative to the monoacid O4 class, (represented by DPos(A) Figure 3.8). This shift suggests the carboxylic acid content of NAFCs tends to increase with increasing molecular weight. This same trend was reported by Frank et al.1 53 Figure 3.8. O4 class observed in both DPos(A) and DPos(AA). Monoacids are represented by DPos(A) and diacids by DPos(AA). The y-scale of each spectrum is 3.4 Conclusions A previously described method was used to derivatize OSPW extract, and the derivatized mixture was characterized by high-resolution (+) ESI-MS for comparison with high-resolution (-) ESI-MS characterization of the underivatized sample. Application of the derivatization to OSPW extract was confirmed by MS2 analysis of a derivatized single-compound standard and comparison with MS2 of derivatized OSPW extract. The derivatized OSPW appears to preserve much of the information observed by (-) ESI-MS of the underivatized sample, with the class distributions being generally similar. Unfortunately, the derivatization is unsuitable for the direct injection-MS characterization of nitrogen-containing classes due to the potential for interference between the multiple pathway 54 products. Analysis of all other classes, however, is feasible if considering only pathway (A) products. Furthermore, chromatographic or electrophoretic separation could likely be used to separate the interfering compounds, resulting in characterization of a wider class range. Use of an isotopically labelled diamine reagent could likewise eliminate interference between the two pathways without the need for separation. Nevertheless, reducing the complexity of the derivatization may be an area for improvement in future research. The O4 diacid class was profiled using the DPos(AA) processing method with apparent success, and comparison with the O4 monoacid class suggests that carboxylic acid content increases with increasing molecular weight. Interference in the O4 diacid profile was noted from the O2 monoacid class, but this could likely be resolved with the use of chromatographic or electrophoretic separation. 3.5 Acknowledgements This work was supported by a contract from Environment Canada and the Natural Science and Engineering Research Council (NSERC) of Canada. KAK was supported by a Four Year Doctoral Fellowship from UBC, and MSM was supported by a PGS-D Fellowship from NSERC. 55 Parsing and Quantification of Raw Orbitrap Mass Spectrometer Data Using RawQuant 4.1 Introduction In recent years, mass spectrometry (MS) methods have been successfully applied to study the proteomes of a variety of organisms.85 Achieving optimal performance in these types of studies requires thorough examination of MS methods to maximize the quality of the output data. During typical MS data acquisition, survey (MS1) scans are used to identify precursor ions suitable for fragmentation and tandem MS/MS (MS2) analysis. As a metric to gauge the performance of this type of shotgun proteomics approach, readouts of the numbers of peptide spectral matches (PSMs), unique peptides, and protein identifications derived from analysis of the generated MS2 spectra are often used. When measuring a complex sample, the numbers of identifications generally provides an excellent metric by which to assess MS performance, as it indirectly provides information on acquisition rate (number of MS2 scans) and the quality of the obtained data (identification rate). As a result, a growing number of tools designed for quality control tracking of MS performance are also based on identification metrics.18,86 In addition to identification metrics, direct assessment of MS performance can be achieved through examination of the raw scan data to determine parameters such as: 1. topN – number of dependent MS2 scans triggered after an MS1 event; 2. Duty cycle length – time required for an MS1 and all triggered dependent MS2 scans; 3. Scan rate – number of MS2 scans acquired per second. However, these data can be difficult to compile for the average MS user as they require direct access to the raw data. A foundational challenge to the extraction of data suitable for identification or further analysis is the closed-source format of the MS raw data file. Software capable of parsing and 56 converting raw formats from individual MS vendors (e.g. MSFileReader) has resulted in substantial progress in the ability to access and process the enclosed data.87-89 However, the rapid development and increasing complexity of MS analysis approaches mean the composition and structure of a raw file can be substantially different between vendor MS or software iterations, potentially compromising the compatibility with, and ultimately the functionality of a given tool. As an example, the introduction of synchronous precursor selection tandem MS/MS/MS (SPS-MS3) scanning for isobaric tag reporter ion acquisition on the Orbitrap Fusion90 and Orbitrap Fusion Lumos MS platforms resulted in the storage of an additional scan per precursor ion in a non-sequential manner. Implementation of analysis workflows that integrate SPS-MS3 quantification with MS2 identification results has been relatively slow, with the commercial tools Proteome Discoverer (PD) and PEAKS,49 freeware MaxQuant,91 and open-source Trans-Proteomic Pipeline (TPP)92 offering varying levels of compatibility with raw files generated using this acquisition type. To improve the accessibility of data contained within raw Thermo Orbitrap MS data files, this work describes the development, validation, and application of a new open-source tool: RawQuant. RawQuant supports simple parsing of meta and scan data from all combinations of MS1, MS2, and SPS-MS3 acquisition modes. RawQuant offers the functionality to output standard text format tables of scan meta data, acquisition characteristics (e.g. topN, duty cycle), and Mascot Generic Files (MGF) suitable for MS2 spectral identification. In addition, RawQuant provides the capability to extract isobaric tag quantification data (e.g. isobaric tags for relative and absolute quantification - iTRAQ and tandem mass tags - TMT) across the Q-Exactive and Orbitrap Fusion instrument families. The functionality of RawQuant was demonstrated using a combination of individual applications that serve to highlight the capability of the tool to provide 57 a user-friendly method to capture the information contained within raw data files acquired on Thermo Orbitrap MS instruments. 4.2 Experimental Section 4.2.1 Access of deposited data For benchmarking RawQuant, a collection of previously published data obtained from ProteomeXchange were examined: 1. Re-analysis of data from a study focused on examination of the HeLa cell proteome with MS analysis on a Q-ExactiveHF instrument (PXD004452).93 2. Re-analysis of data from a study focused on examination of the HeLa cell proteome with MS analysis on a Q-ExactiveHF-X instrument (PXD006932).94 3. Re-analysis of data from a study focused on examination of the HeLa cell proteome with MS analysis on a Q-ExactiveHF instrument (PXD001305).95 4. Re-analysis of data from a study of TMT 10-plex labeled ‘triple-knockout’ genetic mutant Saccharomyces cerevisiae strains (PXD008009).96 5. Re-analysis of data files from a study of an iTRAQ 8-plex two-proteome mixture model of human and E. coli peptides performed on a Q-Exactive MS (PXD003640).97 6. Re-analysis of data from a study of TMT 10-plex labeled Saccharomyces cerevisiae grown in different carbon sources (PXD002875).98 Supplementary tables of expression values for this study were obtained through direct contact with the authors. Detailed explanations of re-analysis of the above data sets for identification results are described in the Supplementary Information. 58 4.2.2 RawQuant processing RawQuant was implemented with the Python programming language and utilizes MSFileReader (Thermo Scientific) to gain access to raw Thermo Orbitrap MS files. For this work, the freely available versions of Python (version 3.6.1, 64-bit) and MSFileReader (version 3.0.29, 64-bit) were used. RawQuant accesses raw data files utilizing François Allain’s Python bindings for MSFileReader (https://github.com/frallain/MSFileReader-Python-bindings). RawQuant and its dependencies are available in the Python Package Index (https://pypi.python.org/pypi/RawQuant/0.1.0). In addition, detailed information describing installation and use are also freely available on GitHub: https://github.com/kevinkovalchik/RawQuant. RawQuant has two primary modes of operation: parse and quant. In parse mode, the user provides the input raw file, the orders (e.g. MS1, MS2, MS3) of the MS scans for extraction, and whether or not an MGF file should be created for subsequent database matching. If specified, RawQuant can return standard meta data for all selected orders, including: scan index, retention time, injection time, and other linked scan events (e.g. MS1 scan from which an MS2 scan was triggered). In quant mode, RawQuant requires input of the raw file and reporter ion design (e.g. TMT 10-plex, iTRAQ 8-plex). Quant mode can utilize standard isobaric tagging methods (e.g. iTRAQ 2 – 8plex, and TMT 0 – 11plex), as well as custom, user-defined tag sets. Given an isotope impurity table, corrections can be applied for both TMT and iTRAQ experimental designs. The impurities are corrected using linear algebra and Cramer’s rule. Reporter ions are quantified based on the centroid data calculated by the instrument firmware. RawQuant performs no peak picking, and thus relies entirely on vendor provided centroid values contained within the raw files themselves. Fortunately, centroided versions of every spectrum where the Orbitrap 59 detector is used are stored in the raw file independent of the data mode used (e.g. profile or centroid), so these values are available. For extracting reporter ions, RawQuant assumes that the data in the highest order MSn scans contain the data of interest (e.g. MS3 for SPS-MS3). However, this functionality can be overridden by the user. Quantification is performed by searching a window of ± 0.003 Da around the expected reporter ion mass. If multiple ion peaks are discovered, the ion with the lowest ppm mass error is chosen. The reporter ion’s mass, mass error, and intensity are available from both ion trap and Orbitrap data. Additionally, when the data were acquired in the Orbitrap, the resolution, baseline, and noise are automatically output to the quantification matrix. The generated quantification matrix is automatically saved to a disk as a tab-delimited text file. If MS1 interference quantification is desired, it is assumed that the MS1 scan is acquired in the Orbitrap. Interference can be calculated for both profile and centroid scans, and bases the calculations on area or intensity, respectively. The isolation width is automatically acquired from the raw file and is used to extract the relevant mass list centered on the precursor mass. Any peak found at the precursor’s m/z is designated a non-interference. In all scans, appropriate carbon isotope peaks are searched for at higher m/z values than the precursor and are designated as non-interferences. If the precursor’s mass is greater than 1,000 Da, isotope peaks are also searched for at lower m/z values. To increase the script’s speed, RawQuant does not perform curve-fitting to determine area under profile scans. The area is determined directly from the mass list using the composite trapezoidal rule. The MS1 interference is calculated as a percent ratio of interference area or intensity to the total area or intensity in the isolation window. In both parse and quant modes, the option to output standard MGF files that contain centroided peaks from MS2 scans is available. In addition, in both modes, a ‘metrics’ table 60 containing information such as number of MS1 scans, number of MS2 scans, mean topN, and mean scan rate and duty cycle can be generated. 4.2.3 E. coli cell culture, protein isolation, reduction, and alkylation E. coli cultures grown in Luria Broth with standard conditions (total of ~1e9 cells were harvested) were prepared using the following protocol. Pellets (~1e7 cells each) were thawed on ice and periodically vortexed. To each pellet, 900 μL of lysis buffer (50 mM HEPES pH 8, 1% SDS (Thermo Fisher, CAT#BP1311-1), 1% Triton X-100 (Sigma, CAT#T8787), 1% NP-40 (Sigma, CAT#NP-40), 50mM NaCl (Sigma, CAT#S7653), 10 mM tris(2-carboxyethyl)phosphine hydrochloride (TCEP) (Sigma, CAT#C4706), 40mM chloroacetamide (CAA) (Sigma, CAT#C0267), 1X cOmplete protease inhibitor – EDTA free (Sigma, CAT#11836170001)) was added. Lysis mixtures were transferred to 2 mL FastPrep-compatible tubes containing Lysing Matrix B (MP Biomedicals; CAT#116911050). Lysis mixtures were vortexed on the FastPrep-24 instrument (6 M/s, 40 seconds, 2 cycles, 120 second rest between cycles). Lysates were then centrifuged at 20,000 g for 5 minutes, and the supernatant recovered. Resultant lysates were heated at 90 °C for 15 minutes, and chilled to room temperature for a further 15 minutes. Protein concentrations were approximated using A280 readings from a NanoDrop instrument (Thermo Scientific). 4.2.4 Protein clean-up with SP3, and protease digestion Proteins were purified using the SP3 method, as described previously.99,100 A total of 200 μg of protein was prepared in a final volume of 100 μL in a standard 1.5mL tube. To the lysate, 20 μL (400μg) of a 1:1 combination of two different types of carboxylate-functionalized beads, both with a hydrophilic surface (Sera-Mag Speed Beads, GE Life Sciences, CAT#45152105050350 and CAT#65152105050350). Beads were rinsed in water prior to 61 addition to the lysate. The pH of the bead-lysate mixture was maintained at basic conditions (HEPES pH 8) to ensure optimal binding to beads.100,101 To promote binding to the beads, 100 μL of 90% ethanol was added to achieve a final concentration of 50% by volume (e.g. 45% ethanol final concentration). Tubes were mixed on a ThermoMixer unit (Eppendorf) at 1000 rpm for 10 minutes at room temperature. Tubes were placed in a magnetic rack and incubated for 2 minutes. The supernatant was discarded, and the beads rinsed 3x with 180 μL of 90% ethanol by removing the tubes from the magnetic rack and gently re-suspending the beads by pipette mixing. For elution, tubes were removed from the magnetic rack, and beads were re-suspended in 100 μL of 50mM HEPES, pH 8 containing an appropriate amount of trypsin/rLysC mix (1:25 enzyme to protein concentration) (Promega, CAT#V5071) and incubated for 14 hours at 37 °C in a ThermoMixer with mixing at 1000rpm. After incubation, the tubes were sonicated briefly (~30 seconds) in a bath sonicator, placed on a magnetic rack, and the supernatant recovered for further processing. 4.2.5 Synthetic peptide mix preparation The set of standard peptides was taken as a subset from the collection analyzed in the ProteomeTools initiative.102 Peptides were selected for a panel of 61 genes, resulting in a set of 444 total candidates that were synthesized in a ‘SpikeMix’ format (JPT Peptide Technologies) (Table S-1). Upon delivery, dried peptides were reconstituted in 100 μL of DMSO, vortexed briefly (~15 seconds), and sonicated in a water bath for 5 minutes. Reconstituted peptides were spiked into real samples based on an acquired signal response curve for the mixture. 4.2.6 Tandem mass tag labeling of peptides TMT 11-plex labeling kits were obtained from Pierce. Each TMT label (5 mg per vial) was reconstituted in 500 μL of acetonitrile and refrozen. A maximum of 100 μg of combined 62 peptide was present in any single channel. Labeling reactions were carried out through addition of 300 μg of TMT label in two volumetrically equal steps of 15 μL (150 μg per addition), 30 minutes apart. Reactions were quenched through addition of 10 μL of glycine (1M stock solution) (Sigma). Labeled peptides were concentrated on a SpeedVac centrifuge (Thermo Scientific) to remove excess acetonitrile, acidified to 1% (v/v) trifluoroacetic acid (TFA), and purified with a C18 TopTip (100 - 1000μL TopTip, Glygen Corp., CAT#TT3C18). Peptides were dried in a SpeedVac centrifuge, and reconstituted at 1 μg/μL in 1% DMSO, 1% formic acid in water. 4.2.7 Peptide clean-up procedures Peptides were desalted and concentrated using TopTip treatment. For TopTip clean-up, 1 mL TopTips (Glygen, CAT#TT3C18) were rinsed twice with 0.6 mL of acetonitrile with 0.1% TFA. Cartridges were then rinsed twice with 0.6 mL of water with 0.1% TFA prior to sample loading. Loaded samples were rinsed three times with 0.1% formic acid (0.6 mL per rinse) and eluted with 1.2 mL of 90% acetonitrile containing 0.1% formic acid. All TopTip processed samples were concentrated in a SpeedVac centrifuge and subsequently reconstituted in 1% formic acid with 1% DMSO in water. 4.2.8 Chromatographic separation prior to MS analysis For all runs, samples were introduced to the MS using an Easy-nLC 1000 system (Thermo Scientific). Columns used for trapping and analytical separations were packed in-house in fritted capillaries prepared using a combination of formamide and Kasil (1:3 ratio, Next Advance, CAT#FRIT-KIT). Trapping columns were packed in 100 μm internal diameter capillaries to a length of 3 cm with C18 core-shell beads (Aeris PEPTIDE XB-C18, Phenomenex, 1.7 μm particle size, CAT#04A-4506). Prior to injection, the pre- and analytical 63 columns were equilibrated at 400 bar for 10 μL and 3 μL, respectively. After injection, trapping was carried out for a total volume of 15 μL at a pressure of 400 bar. After trapping, gradient elution of peptides was performed on a core-shell C18 (Aeris PEPTIDE XB-C18, Phenomenex, 1.7μm particle size) column packed in 100 μm internal diameter capillaries to a length of 25cm and heated to 50°C using AgileSLEEVE column ovens (Analytical Sales & Service). Elution in 60-minute runs was performed with a gradient of mobile phase A (water and 0.1% formic acid) from 3 – 8% B (acetonitrile and 0.1% formic acid) over 2 minutes, 8 – 25% B over 40 minutes, and to 40% B over 11 minutes, with final elution (80% B) using a further 7 minutes at a flow rate of 400nL/min. 4.2.9 MS analysis of peptide samples on the Orbitrap Fusion Data acquisition with Orbitrap Fusion (control software version 3.0.2041) was carried out using a data-dependent method with MS2 in the Orbitrap, or multi-notch synchronous precursor selection (SPS)-MS3 scanning for TMT tags. The Orbitrap Fusion was operated with a positive ion spray voltage of 2200 and a transfer tube temperature of 275 °C. MS1 scans were acquired in the Orbitrap at a resolution of 120K, across a mass range of 350 – 1500 m/z, with an RF lens setting of 60, an AGC target of 2e5, a max injection time of 50ms, for 1 microscan in profile mode. For dependent scans, monoisotopic precursor selection was enabled with the ‘Peptide’ setting, an intensity threshold of 5e3, charge state selection of 2 – 4 charges, and dynamic exclusion for 30 seconds after 1 appearance with 20 ppm low and high tolerances. Isotopes were excluded from repeat analysis, and the dependent scan on a single charge state per precursor setting was disabled. For MS2 acquisition in the Orbitrap, quadrupole isolation using a 1.4 m/z window with no offset used prior to HCD fragmentation with a setting of 35%. Data acquisition was carried out 64 in the Orbitrap using a mass resolving power of 50K, a first mass of 110 m/z, an AGC target of 2e5, and a max injection time of 120 ms for 1 microscan in profile mode. For SPS-MS3 acquisition, MS2 scans were acquired in the ion trap after quadrupole isolation with a window of 1.4 m/z. Activation was by CID with an energy of 30%, a 10ms activation time, and 0.25 activation Q. The ion trap was set to scan in ‘Rapid’ mode, with a 1e4 AGC target, and 30ms max inject time. MS2 scans were acquired in centroid mode. Ions for MS3 scans were selected based on a precursor mass range of 350 – 1200, a relative intensity filter of 10%, precursor ion exclusion of 20, and 5ppm (Low, High), and isobaric tag loss of TMT. A total of 10 precursors were set for SPS using a MS1 and MS2 isolation windows of 2m/z with no offset. Ions were fragmented with HCD at an energy of 60%. Scans were acquired in the Orbitrap at a resolution of 50K and scan range of 120 – 750m/z for 1 microscan in profile mode. SPS-MS3 AGC targets and maximum injection times were altered across windows of 8e4 – 8e5 and 60 – 120ms as indicated. 4.2.10 Mass spectrometry data analysis All acquired data were processed using SearchCLI (version 3.2.30) and PeptideShakerCLI (version 1.16.11).103,104 All searches used a combination of XTandem (version 2015.12.15.2), Myrimatch (version 2.2.140), MS-GF+ (version 10282), and Comet (version 2016.01 rev. 3) algorithms. MGF files generated with RawQuant were used in all searches. For the analysis of the MS data in SearchCLI, centroided MS2 spectra (MGF files from RawQuant) were searched against a UniProt E. coli proteome database (release 2017_10) containing common contaminants and the synthetic peptide sequences, that was appended to reversed sequences generated using the –decoy tag of FastaCLI in SearchCLI (8,710 total sequences, 4,355 target). Identification parameter files were generated using 65 IdentificationParametersCLI in SearchCLI, specifying precursor and fragment tolerances of 20ppm and 0.5 Da (ion trap MS2) or 0.05 Da (Orbitrap MS2), carbamidomethyl of cysteine, TMT 10-plex of peptide N-term, and TMT 10-plex of lysine as fixed modifications, and oxidation of methionine and acetylation of protein N-term as variable modifications. The msgf_instrument, msgf_fragmentation, and msgf_protocol tags were set to 0, 1, and 4 for ion trap MS2, and 3, 3, and 4 for Orbitrap MS2. All SearchCLI results were processed into PSM, peptide, and protein sets using PeptideShakerCLI. Error rates are controlled in PeptideShakerCLI using the target-decoy search strategy to determine false-discovery rates (FDR). Hits from multiple search engines were unified using posterior error probabilities determined from the target-decoy search strategy. Results were exported from PeptideShakerCLI using ReportCLI. All results were filtered to provide a final FDR at the PSM, peptide, and protein level of <1%. Final mzid files were output from PeptideShakerCLI using MzidCLI with the default parameters. 4.2.11 General statistical parameters In all boxplots, center lines in plotted boxes indicate the median, upper and lower the 75th and 25th percentiles, and upper and lower whiskers 1.5X the interquartile range. All correlation calculations utilize the Pearson method. The calculation of individual p-values was performed using two-sided Students t-tests of biological replicates, unless otherwise noted. 4.2.12 Data and code availability The mass spectrometry proteomics data were deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository105 with the dataset identifier PXD008787 (DOI: 10.6019/PXD008787). The repository contains the 66 RawQuant generated MGF files, identification results, and mzid files from SearchCLI and PeptideShakerCLI. In addition to search data and results, R Notebook files that catalog and describe the analysis of data and construction of the figures are provided as individual html files on https://github.com/chrishuges/RawQuant_JPR-2018. 4.3 Results and Discussion RawQuant was designed to enable extraction of meta and scan data from raw data files acquired with Thermo Orbitrap MS platforms. Using the ‘parse’ functionality, RawQuant supports extraction of meta and scan data from multiple types of MS acquisition (e.g. MS1, MS2, and MS3) spanning a range of conventional Orbitrap instrument architectures (e.g. Fusion, Fusion Lumos, Q-ExactiveHF, Q-ExactiveHF-X) (Table S-2, 0). Using the ‘quant’ functionality, RawQuant provides output of isobaric tag data (e.g. TMT, iTRAQ) from multiple types of MS acquisition type (e.g. MS2, SPS-MS3) or Orbitrap instrument architecture (e.g. Fusion, Fusion Lumos, Q-Exactive series) (Table S-3, 0). In addition to standard quantification, RawQuant is capable of calculating MS1 isolation interference values, application of reporter ion impurity corrections, and assessing custom reporter ion masses (e.g. TMTc ions106) (Table S-3, 0). RawQuant can also output a metrics table containing a variety of calculated values (e.g. number of MS1 and MS2 scans, mean topN rate: number of triggered MS2 scans following an MS1 event, mean Hz: number of MS scans per second, and mean duty cycle time: time between MS1 scans) (Table S-4, 0). Lastly, RawQuant can generate a standard MGF file of centroided MS2 spectra for database searching. All parsing and quantification data output by RawQuant are indexed by scan number, enabling integration with other tools that also make this information available (e.g. commonly used search engines) (Figure 4.1). 67 Figure 4.1 RawQuant has a wide range of built-in utilities and potential functionality. Schematic depicting the analysis pipelines capable with RawQuant. Parsing of raw Thermo Orbitrap MS files can yield parsed data matrices containing scan and MS operation efficiency data, as well as quantification values for isobaric tags. RawQuant generated MGF outputs can be used directly in search engines that accept this input (e.g. Mascot), and the results combined with the RawQuant parsed or quantification data based on scan numbers and file names. 4.3.1 RawQuant enables efficient and robust parsing of raw MS files To demonstrate the usefulness of the parsing functionality of RawQuant, published data was re-processed from a research study describing an optimized workflow for shotgun whole-proteome analysis utilized in the examination of a collection of mammalian cell lines.93 The work from Bekker-Jensen et al.93 consisted of a fractionation scheme resulting in 46 final samples that were each analyzed with short 30-minute chromatography separations coupled to data acquisition on a Q-ExactiveHF MS (45-minute total run time per fraction, 34.5 hours total 68 analysis time). For re-analysis with RawQuant, the 46-fraction HeLa cell (1mg) data set was initially chosen as it represented the original benchmark for the performance of the optimized workflow. Reprocessing of the 46 raw data files with RawQuant to generate MGF files and meta data matrices for the set of 242,699 MS1 and 1,018,676 MS2 scans required under 20 minutes on a standard PC (Core i7-6400 @ 3.4GHz, 8GB RAM, 64-bit Windows 10). From an identification metrics standpoint, the data obtained with re-processing using SearchCLI and PeptideShakerCLI appear to be exceptional, and display minimal differences in comparison to the originally reported93 depth of coverage at the PSM, peptide, and protein levels (Table S-5, 0). However, using the RawQuant output to examine the numbers of MS2 events across the individual fractions revealed that the majority of the early samples (e.g. fractions 1 – 15) had less than the overall average number of 22,145 scans, suggesting limited complexity in terms of multiply charged peptide species available for sampling (Figure 4.1a). Further investigation revealed similar trends for the numbers of PSMs, resulting in a lower overall number of MS2 identifications from early fractions (Figure 4.1b-c). Analysis of the MS1 and MS2 cycling information calculated from the RawQuant output revealed that the Q-ExactiveHF MS used in the study rarely hit the specified topN (specified maximum number of MS2 scans to target after a given MS1 event, topN = 20), with an observed mean of 7.6 across all fractions (n = 133,992 MS1 scans followed by an MS2 event) (Figure 4.2a). The time between sequential MS1 events (cycles) revealed a median cycle time of approximately a quarter second (median = 258ms, n = 242,653 cycles) (Figure 4.2b); far shorter than the 6.2 second base-to-base average peak width observed in the original study.93 Combination of the RawQuant output with the identification results revealed that across all scans, 816,177 of 1,018,676 (80%) hit the maximum allowable injection time (15 ms) for an MS2 event. However, just 254,548 of scans that hit the 69 maximum resulted in an identification (31%), compared with 115,043 of the 202,499 (57%) that used less than the maximum MS2 time (<15ms). Figure 4.2 Analysis of RawQuant data highlights regions of non-uniform identification. Data from an MS2-only, label-free whole proteome analysis were re-processed and the identification results queried.93 (a) Depicts the numbers of MS2 scans obtained in each individual fraction. (b) Depicts the numbers of PSMs identified in each fraction. Horizontal red lines indicate the mean value in each plot. For all analyses, n = 46. (c) Depicts the identification rate (number of PSMs / number of MS2 spectra) across each individual fraction. To further investigate the observed scanning and identification relationships in reference to data acquired using alternative approaches, re-analysis of two published acquisitions of HeLa cell proteomes using Q-ExactiveHF 95 and Q-ExactiveHF-X 94 MS instruments was carried out. The Q-ExactiveHF study utilized a fractionation-concatenation scheme designed to minimize empty elution windows (14 final concatenated samples, less than 24-hours of total run time, ~14% devoted to LC overhead) with longer MS acquisitions of each fraction. The HF-X work utilized an identical pre-fractionation scheme as the previous mammalian cell line analysis by Bekker-Jensen et al,93 but a more rapid MS analysis workflow (19 minute MS runs, total analysis times of ~26 hours, assuming 15-minute per-sample overhead from the LC system). This fractionation scheme was combined with an optimized 28Hz scan method (max injection time = 70 22 ms, resolution = 15,000) that included an advanced precursor detection (APD) algorithm that would result in a greater number of selectable precursors for dependent analysis.107 The identification metrics for the HF95 and HF-X94 re-processed data sets both compared favorably with the original studies (Table S-5, 0). Inspection of the information compiled by RawQuant uncovered the improved distributions of MS2 events in the HF-X and HF analyses (Figure S-1a-b, 0), and the greater uniformity in PSM identifications across fractions (Figure S-1c-d, 0). Together, the data compiled from RawQuant analysis of the data files from the original Bekker-Jensen et al study93 suggested that the described analysis would potentially benefit from modifications in two primary areas. Firstly, the uneven distribution of MS2 spectra and resultant variability in identifications across the individual fractions is indicative of the presence of empty elution windows where there are no peptide candidates to be sampled. This behavior was not observed in the second Q-ExactiveHF study,95 where a concatenation scheme in the pre-fractionation of peptides was used (or in the HF-X work, likely due to a modification that was made in the offline fractionation gradient conditions94). Secondly, the rapid duty cycle (258 ms duty cycle would give 24 points across a 6.2 second base-to-base peak) in combination with the sub-optimal topN value indicate that the MS was performing more MS1 scans than was necessary due to exhaustion of peptide candidates to sample in a given elution window. In the HF-X work94, the improved ion metrics and APD algorithm resulted in a larger selectable pool of precursors for MS2 that reduced the duty cycle inefficiency. To achieve peak performance for a given sample, a balance between regulating duty cycle duration and parallelization with ion targets and fill times towards maximizing scanning speed, all while ensuring optimal MS2 quality must be achieved.95 In the absence of the ability to identify more selectable candidates in MS1 (e.g. no APD), the data from RawQuant suggested the identified excess cycle time should have 71 been devoted towards improving the ion metrics of acquired MS2 data. Although this would reduce the overall ‘scan acquisition rate’, the MS2 identification rate would potentially increase, improving the overall result. Importantly, RawQuant simplifies the observation of these traits of MS acquisition towards optimization of workflows for individual sample sets. 4.3.2 RawQuant enables efficient and robust quantification from raw MS files To demonstrate RawQuant use for processing raw files obtained in isobaric labeling experiments in comparison to established tools, a TMT 10-plex labeled ‘triple-knockout’ sample set derived from Saccharomyces cerevisiae was re-examined.96 This sample set consists of triplicates of three individual knockout strains (Met6, Pfk2, and Ura2) multiplexed into 9 channels of a TMT 10-plex experiment. The data were acquired with an Orbitrap Fusion with SPS-MS3 scanning, or an Orbitrap Fusion Lumos with MS2 or SPS-MS3. Quantification of isobaric tag reporter ions was performed using RawQuant and the ‘gold-standard’ commercial software developed by Thermo, Proteome Discoverer (PD). Comparing the quantification values from the two tools using the known knockout genes as a reference revealed the strong concordance between the signal-to-noise (S2N) data reported by RawQuant and PD (Figure S-2 – S-4, 0). The data similarity between RawQuant and PD was also observed when using comparisons between all identified PSMs identified in matched channels (e.g. TMT126 in RawQuant versus TMT126 in PD), with a mean r2 of 0.99 (n = 9). The high correlation (r2 = 0.99, n = 33) in the calculated values for the ‘interference free index’ (IFI = 1 – ((average S2N in KO) / (average S2N in non-KO))) proposed in the original study based on MET6 PSMs,96 further established the agreement between the reported values of RawQuant and PD (Figure S-5, 0). To compare with a non-commercial and open-source software, the data from RawQuant were further examined relative to those calculated by Libra, an isobaric tag quantification tool 72 integrated into the TPP.92 Direct comparison of intensity values from matched channels (e.g. TMT126 from RawQuant vs TMT126 from Libra) obtained from both tools revealed an exact agreement between the values (r2 = 1.0, n = 9 for each of the MS2, and SPS-MS3 data sets), further validating the output from RawQuant. Despite known issues with ion interference,34 acquisition of reporter ions in MS2 mode remains prevalent. To demonstrate the compatibility of RawQuant with MS2 data and iTRAQ reporter tags, data from a study that aimed to quantify reporter ion interference across MS platforms was re-examined.97 The study from Martinez-Val et al.97 utilized a two-proteome interference model wherein human peptides were spiked into specific channels containing defined amounts of E. coli digest. Comparing the channel-by-channel quantification values from RawQuant and PD (e.g. RawQuant iTRAQ 113 vs. PD iTRAQ 113), a mean Pearson r2 of 0.99 (n = 8 individual channels) was obtained. To compare with observations from the original manuscript, the output ratios in ‘compressed’ (i.e. iTRAQ 113 / iTRAQ 115) and ‘uncompressed’ (i.e. iTRAQ 117 / iTRAQ 119) channels were assessed using values determined by RawQuant from Q-Exactive files (isolation window of 2 Da, n = 2). In agreement with the previously established results,97 these data revealed significant compression (p = 2.2e-16) of the observed log-ratio in the ‘compressed’ channel from the expected fold change of 2.5 (Figure S-6, 0). Based on the agreement between RawQuant and PD, examination of a larger quantitative data set that utilized an ‘in-house’ processing pipeline was undertaken to establish the ability to illuminate biological variation in whole-proteome screens. RawQuant was used to re-examine a study that employed TMT and SPS-MS3 to study carbon-source dependent differences in protein abundance in Saccharomyces cerevisiae.98 The study utilized a single TMT 9-plex where 73 biological triplicates of Saccharomyces cerevisiae, cultured with galactose, glucose, or raffinose as carbon sources, were combined and fractionated into 24 individual samples that were analyzed with an Orbitrap Fusion with SPS-MS3 reporter ion acquisition. Reprocessing of the 24 raw data files with RawQuant to generate MGF files and quantification matrices for the set of 150,849 MS1, 746,852 MS2, and 746,941 MS3 scans required ~59 minutes on a standard PC (Core i7-6400 @ 3.4GHz, 8GB RAM, 64-bit Windows 10). To establish the fidelity of the RawQuant quantification with the published results, PSMs were filtered (summed S2N > 100, at least 6 observed values) and collapsed into proteins by summing the individual peptide values per protein. Principle component analysis of the processed RawQuant assigned protein values revealed robust separation of the samples based on the carbon source, in concordance with the previous results (Figure S-7a, 0).98 Markers highlighted as having divergent expression dependent on carbon source, GAL10 (p = 2.72e-05, 95% CI of GAL10 = 9.4 – 10.8), SUC2 (p = 9.77e-08, 95% CI of SUC2 = 9.1 – 9.4), and HXT3 (p = 2.37e-04, 95% CI of HXT3 = 8.9 – 10.7) displayed the expected patterns of abundance based on observations from the original study (Figure S-7b-c, 0). Across all peptides quantified by both RawQuant and Paulo et al.98 where TMT expression values were observed across all 9 channels (n = 37,563 unique peptides), high r2 values (minimum r2 = 0.94, mean r2 = 0.95, n = 9) were observed when comparing matched labels (e.g. TMT126 from RawQuant vs. TMT126 from Paulo et al.98). Of the top proteins with the highest variance from Paulo et al. (n = 114), 108 were observed among the top 114 hits in the analysis of data from RawQuant. Taken together, these data establish confidence in the quantification values reported by RawQuant, as they are in direct agreement with those provided by a ‘gold-standard’ commercial software (PD) when working with TMT and iTRAQ in both conventional MS2 and SPS-MS3 74 scan modes. For quantification analysis, RawQuant analysis is advantageous in these scenarios as it combines working directly on the unprocessed raw MS file with being open-source and freely available. As a result, a clear understanding of exactly how quantification values are calculated can be obtained, different types of data can be extracted (e.g. intensity, noise, baseline), and the resultant information can be interfaced with identifications from virtually any available search tool that reports scan indexes. 4.3.3 Assessing isobaric tag reporter ion values with RawQuant highlights differences in acquisition method settings and types In addition to quantification signal values, RawQuant outputs a variety of other data that can be useful in assessing MS performance and quantification accuracy, such as reporter ion masses, SPS ions, and MS1 interference. To examine the utility of reporter ion mass extraction with RawQuant, a defined mixture of E. coli (TMT126 – 0:1:0:2:4:1:2:4:1:2:4 – TMT131C, whole proteome) and human peptides (TMT126 – 2:0:0:0:0:1:1:1:3:3:3 – TMT131C, synthetic peptides, n = 444) was prepared in a TMT 11-plex format and analyzed with an Orbitrap Fusion. Modulation of parameters focused on ion target values in MS2 (2e5 ion target, 120ms fill time) and SPS-MS3 (8e4, 2e5, 4e5, and 8e5 ion targets with fill times of 60 and 120ms) as previous work demonstrated the coalescence of reporter ions dependent on the ion target values chosen.108 As the TMT 129, 130, and 131 isotopologue channels contained the largest peptide abundance based on sample mixing ratios, mass error calculations were focused on these reporters. As expected, there was no observable ion coalescence in MS2 runs using an ion target of 2e5 (Figure S-8a, 0). Interestingly, no coalescence events were observable in the low (8e4) and high (8e5) ion target settings with SPS-MS3 acquisition (Figure S-8b-c, 0), indicating that this phenomenon is not as prominent in this scan mode. Despite the absence of ion coalescence in these particular 75 analyses, the simple extraction of the required mass values by RawQuant makes this type of quality-control assessment a routine process. To examine the use of ion interference values derived from RawQuant, analysis of the fidelity in quantification in MS2 and SPS-MS3 acquisition modes was undertaken. The 2e5-120 SPS-MS3 acquisition setting was used as it enabled direct comparison with the acquired MS2 condition (2e5 target, 120ms fill time). The S2N of isobaric tag reporter ions for both the E. coli and spiked peptides were observed to deviate in a channel dependent manner from the expected value across MS2 and SPS-MS3 methods (Figure 4.3a-b). This dependency was easily observed in the reporter channels that were left empty (TMT126, 127C for E. coli, TMT127N, 127C, 128N, and 128C for synthetic spike peptides), where the observed signal scaled based on the amount of peptide present from the alternate mixture (e.g. E. coli bleeding into the empty synthetic peptide TMT127C, 128N, and 128C channels) (Figure 4.3c-d). Using the spike peptide ratios to measure compression, a clear increase in the deviation from the established value was observed with MS2 in comparison to SPS-MS3 methods (Figure S-9a). However, no difference was observed in the amount of compression when using intensity values as opposed to S2N values (TMT ratio mean r2 = 0.99 between S2N and intensity values) (Figure S-9b). 76 Figure 4.3 Analysis of RawQuant data facilitates extraction of MS acquisition efficiency from raw files. Data from an MS2-only, label-free whole proteome analysis were re-processed and the identification results queried.93 (a) Boxplot of the topN distribution in each fraction. topN refers to the number of MS2 scans selected after an MS1 event. (b) Scatter plot of the numbers of MS2 events triggered per second (Hz). Data were binned in 1-second intervals, and the number of MS2 scans found reported for each bin with a value above 0. For each plot, horizontal red lines indicate the mean at the specified n-values. (c) Boxplots depict the distribution of ‘cycle’ times across each individual fraction. For this work, a cycle is defined as the time between two bordering MS1 events. Horizontal lines indicate the peak widths at baseline (6.20 seconds) and half-height (3.64 seconds) as determined previously.93 To assess properties driving deviation from the expected quantification values, different measures of scan purity were compared. To maximize the number of measurable values and the observable impact of the filters, PSMs of synthetic peptides with signal in the highly compressed TMT131C / TMT130N channels were used (n = 2,178 for MS2 and n = 1,594 for SPS-MS3). Using MS1 isolation interference calculated by RawQuant to measure MS2 purity, a positive correlation was observed with the absolute difference from the expected quantification value of 3 (3:1 for TMT131C / TMT130N) for both MS2 and SPS-MS3 (r2 = 0.32 and 0.34, respectively) (Figure S-10a-b, 0), in agreement with previous results.109 In contrast to MS1 ion interference, a 77 negative correlation was observed when comparing the ratio difference to the overall ion S2N in MS2 and SPS-MS3 (r2 = -0.35 and -0.42) (Figure S-10c-d, 0). These data suggest that application of a maximum MS1 interference and minimum S2N filter would potentially improve the observed reporter ion errors. However, independent application of an MS1 interference filter (MS1 interference <= 50%), a S2N cutoff (S2N >= 100), or an observed values filter (Number of NA values across 11 channels <= 5) resulted in no observable improvement in the compression of the TMT131C / TMT130N ratio in the MS2 and SPS-MS3 data (Figure S-11a-b, 0). As RawQuant outputs ions selected for SPS, calculation of an MS3 purity value is possible using reported peptide ions matched from PeptideShaker (MS3 purity = ((Sum of Ion Signal for SPS Ions Matched by PeptideShaker) / (Sum of Ion Signal for SPS Ions Selected for SPS-MS3 Analysis)) * 100). Similar to the previous use of an on-line MS2 spectral matching step,110 the MS3 purity calculation assumes that a greater number of ‘true’ MS2 ions selected for SPS-MS3 will result in more accurate reporter ion values. Across the 2e5-120 SPS-MS3 replicates (n = 3 replicates, n = 1,594 PSMs with observed signal for synthetic peptides in TMT131C / TMT130N), a wide distribution of MS3 purity values were observed (Figure S-12a, 0). When compared with the expected TMT131C / TMT130N ratio of 3, an r2 = -0.66 was observed with the MS3 purity value (Figure S-12b, 0). However, filtering of the SPS-MS3 data with a purity cutoff (SPS-MS3 purity >= 60%) resulted in a minor difference relative to the other thresholds applied (Figure S-12c, 0). Based on the observed correlation with SPS-MS3 purity, increased stringency filters would aid in reducing the impact of ratio compression in the data, but would come at the cost of the number of usable data points (90% SPS-MS3 purity filter reduces data points from 1,594 to 779) (Figure 4.4). 78 Figure 4.4 Ratio compression was observed in a channel-dependent manner in MS2 and MS3 acquisition modes. A mixture of E. coli (TMT126 – 0:1:0:2:4:1:2:4:1:2:4 – TMT131C, whole proteome) and human peptides (TMT126 – 2:0:0:0:0:1:1:1:3:3:3 – TMT131C, synthetic peptides, n = 444) were analyzed using MS2 and SPS-MS3 acquisition methods on an Orbitrap Fusion. Values displayed are for each PSM, where the signal in each channel relative to the total signal across all channels is calculated. The data are transformed by multiplication to display values at the expected mixing ratio, for visualization purposes. (a) Boxplot of PSMs belonging to E. coli peptides obtained from MS2 analysis. (b) Boxplot of PSMs belonging to E. coli peptides obtained from SPS-MS3 analysis. (c) Boxplot of PSMs belonging to synthetic spiked peptides 79 obtained from MS2 analysis. (d) Boxplot of PSMs belonging to synthetic spike peptides obtained from SPS-MS3 analysis. Red lines depict the expected quantification ratios for the channels. Together, these data further demonstrate the utility of RawQuant for parsing of quantification data from raw MS files. The data output from RawQuant simplified investigation of method parameter changes to examine phenomena such as ion coalescence, observation of the increased compression of reporter ion values dependent on scan mode, and in assessing the application of filters to improve quantification accuracy. An important conclusion from the RawQuant data analyses performed was that ratio compression was largely dependent on the nature of the interference (e.g. a 3:1 peptide interfering with a 1:3 peptide vs. a 3:1 peptide), and filtering based on purity measurements has the potential to arbitrarily exclude true positives. 4.4 Conclusion This work describes the open-source RawQuant tool for simple extraction of meta, scan, and reporter ion data from raw MS files acquired from Thermo Orbitrap mass spectrometers. The utility of RawQuant for parsing of standard Orbitrap MS files to identify scan parameters to be targeted for further optimization was demonstrated using re-analysis of published data from a collection of studies using the Q-Exactive family of instruments. Through analysis with RawQuant, these data reveal the importance of achieving a balance between the fractionation-concatenation scheme and MS acquisition approach to ensure maximum efficiency of operation, and ultimately the optimum results. RawQuant facilitates this type of investigation through providing straightforward access to the data required to generate identifications and metrics (e.g. scan speed, topN) that provide insight into MS operation. In addition to optimization, the 80 identification and performance data compiled from RawQuant outputs can be added to quality control regimes and tracked longitudinally to ensure optimal MS operation. As a quantification tool, the validity of RawQuant was demonstrated relative to established tools using a collection of label types and study designs. RawQuant provides a simple means of extracting quantitative information from raw files, including numerous values beyond label intensities (e.g. noise, background, resolution, fill times, related scans) that are not reported with other popular software packages (e.g. MaxQuant, TPP). No pre-processing, conversion, or analysis of the data is required prior to RawQuant analysis, enabling direct and rapid interrogation of all quantification data obtained in an acquisition. In addition to quantification signal values, RawQuant also outputs other scan data (e.g. MS1 interference, detected reporter masses, and SPS ions) that can be used to investigate further acquisition properties, such as ion coalescence and parameter-dependent reporter ion purity. In this work, the additional data yielded from RawQuant processing was used to establish the limited utility of filtering of isobaric tag quantification data with a variety of criteria. With the ability to extract meta and scan data, observed reporter ion and SPS masses, MS1 interference measurements, and diverse signal values (e.g. intensity, noise, background), RawQuant offers unique functionality not currently available in other tools for researchers using MS-based proteomics. An important aspect of this tool is that the output of RawQuant can be interfaced with the search data from any tool that outputs scan numbers associated with peptide matches. As a result, novel combinations of identification and parsing or quantification results can be generated without reliance on a given software package having this functionality. In this way, RawQuant supplements the amount of available information for analysis alongside MGF creation. In this work, we focus on combination of RawQuant output with results processed with 81 the open source tools SearchGUI (X!Tandem, MyriMatch, MS-GF+, Comet) and PeptideShaker. However, many other popular tools, such as MaxQuant and the TPP, output scan numbers with peptide matches and thus can be combined with the more feature-rich information in the RawQuant output, further emphasizing both the broad utility and compatibility of this tool. Furthermore, as the output from RawQuant is simple tab-delimited text matrices, the data can be directly processed in a wide variety of R and Python packages. The primary advantage of this open data access is the freedom generated by RawQuant to develop an analysis pipeline bespoke for a given experimental setup, without reliance on obtaining an all-in-one software solution. RawQuant is user-friendly and open-source, ensuring the ability to further adapt and improve the tool as new workflows and data file structures are introduced. Unfortunately, Thermo is currently the only major MS vendor to provide a freely available and modifiable library to support parsing of their raw file format. RawQuant can be readily modified to take as input raw data from other vendor instruments as these tools become available. The general idea of RawQuant is not a novel one, with previous tools (e.g. RawMeat, which is no longer supported) providing some of the parsing functionality described above. However, the ability to parse and openly extract information from raw instrument outputs is one of the primary foundations of why next-generation sequencing technologies have seen such success in development of processing tools and inter-laboratory recycling of previously acquired and deposited data. One of the first steps to advancing a technology is fundamentally understanding what is in the output and what impact experimental variables have on the result. We hope that RawQuant can serve as a catalyst to drive unrestricted development of open-source tools for interrogating MS data from all vendors and hardware. Based on the described aspects and results, RawQuant is positioned to become a prominent utility for MS-based proteomics research. 82 4.5 Supporting Information The following files are available free of charge at ACS website http://pubs.acs.org: • Supporting methods describing the analysis of the published data that was re-analyzed in this work. • Supporting figures and legends for Figures S-1 relating to the measurement of acquisition performance in MS2-only runs. Supporting legends for Figures S-3, S-4, S-5, S-6, and S-7 relating to the validation of RawQuant in comparison to PD and application in large isobaric tag data analysis. Supporting legend for Figure S-8 relating to the measurement of ion coalescence. Supporting legends for Figures S-9 and S-10 relating to the variation in reported quantification ratios using MS2 and SPS-MS3. Supporting legends for Figures S-11 and S-12 relating to the filtering of data to assess quality improvement in reported quantification ratios. • Supporting table and legend of synthetic peptides used in the MS2 and SPS-MS3 comparisons using RawQuant data (Table S-1). Supporting tables and legends displaying typical output from RawQuant of parse data (Table S-2), quant data (Table S-3), and a metrics table (Table S-4). Supporting table and legend of identification metrics for published data that was re-analyzed in this work (Table S-5). 4.6 Acknowledgements All authors would also like to acknowledge Joao Paulo and Christian Kelstrup for correspondence and data access related to previously published work. 83 4.7 Funding Sources This work was supported by the British Columbia Cancer Foundation (G.B.M., C.S.H, S.M.) and a Discovery Grant from the Natural Sciences and Engineering Research Council (NSERC) of Canada. K.A.K. acknowledges Four-Year Doctoral Fellowships from the University of British Columbia (award numbers - 6569, 6456). S.M. was supported by an Undergraduate Training Award from the British Columbia Proteomics Network. 4.8 Competing Financial Interests The authors declare no competing financial interests. 4.9 Author Contributions C.S.H. and K.A.K. conceived the idea, carried out the data analysis, and wrote the manuscript. S.M. performed data analysis and contributed to writing of the manuscript. D.D.Y.C. and G.B.M. contributed to writing of the manuscript. 84 RawTools: Rapid and Dynamic Interrogation of Orbitrap Data Files for Mass Spectrometer System Management 5.1 Introduction The rapid development of mass spectrometry (MS) as a tool to study and characterize the proteome has led to the creation of vast amounts of data of various qualities and levels of information content. Initially, making sense of these data was primarily addressed using software focused on the robust and accurate identification of tandem mass spectrometry (MS2) fragmentation spectra of peptides,111 and subsequent quantification of their signal intensities to determine abundance.112 Continued development efforts have resulted in a collection of search and quantification tools capable of performing a wide variety of analyses on data originating from diverse MS platforms. Recently, more focus has been on the development of tools aimed at monitoring the performance of MS hardware via interrogation of data files generated from analysis of standard samples.17,18 These quality control (QC) software tools provide valuable insight into MS metrics that includes peptide identification rates, ion signal, and chromatography performance, which can all be used to temporally monitor instrument performance. However, there remains a need for software tools aimed at diagnostic management of MS systems towards improving the quality and maximizing the information content of generated data. Parsing of meta, scan, and quantification data from a raw MS file can allow informed decision making related to the optimization of instrument and experiment performance and design. Recently, there has been active development in the parsing of these acquisition metrics from raw MS data.16,24,25 Acquisition data metrics can include information such as scan rate, duty cycle time, ion injection times, and number of triggered dependent events per cycle. These metrics are useful for optimization of MS method parameters and in the provision of greater 85 insight during the tracking of instrument performance. Unfortunately, obtaining these data from raw files using non-commercial and non-vendor provided software is made challenging due to the proprietary nature of the raw MS data file format. To facilitate these types of analyses, we previously developed the RawQuant tool. RawQuant enabled parsing of meta and scan data, in addition to quantification of isobaric tag reporter ions directly from Orbitrap raw MS files.113 Using RawQuant, the ability to efficiently identify key areas where MS performance was suboptimal was demonstrated, along with the limited utility of a variety of isobaric tag quality filtering approaches.113 Importantly, RawQuant was provided as an open-source and freely available tool that generated easily parsed plain text output from raw Orbitrap MS file contents. Although efficient at processing raw Orbitrap files for meta, scan, and quantification data, RawQuant was built around the Thermo MSFileReader library that rendered it incompatible with non-Windows operating systems. In addition, the development of RawQuant in Python necessitated numerous installation steps and limited processing performance. In this work, we present the development of the RawTools software package as a substantial improvement over RawQuant with improved speed, increased functionality, and expanded operating system compatibility. RawTools has been built from the ground up in C# to facilitate easier implementation on user machines, as well as to provide significant processing speed gains. Like RawQuant, RawTools is designed to process data directly from any data-dependent acquisition (DDA) .raw file acquired from the Thermo Orbitrap family of instruments, including Q-Exactive, LTQ-Orbitrap, and Tribrid (Fusion) series and their successors (referred to by the generic name Orbitrap for the remainder of the manuscript). RawTools maintained much of the functionality of its predecessor, such as parsing of meta and scan data along with isobaric tag quantification, but has also incorporated new features that include label-free precursor ion 86 quantification and direct assessment of chromatography performance. In addition, RawTools contains a newly developed ‘QC’ functionality for diagnostic analysis using a large number of summarized metrics to facilitate informed decision-making towards achieving and maintaining optimum MS performance. As part of the QC feature, RawTools also directly communicates with the open-source and freely available search tools IdentiPy and X! Tandem to perform a rapid ‘preview’ search of the processed data. This preview search enables tracking of metrics that include proteolysis performance, isobaric labeling efficiency, and MS2 identification rates. Importantly, diagnostic QC analysis with RawTools allows rapid instrument parameter optimization and experiment quality monitoring in near real-time due to the speed and efficiency of the software tool. Moreover, diagnostic and other processed results can be easily dissected and visualized by any user using the newly developed R Shiny application web interface for RawTools. Lastly, RawTools has been built to utilize the Thermo RawFileReader package, enabling universal direct processing of Orbitrap raw files on Windows, MacOS, and Linux platforms. Like RawQuant, RawTools is open source, freely available, and includes detailed step-by-step and image-rich user documentation to ensure maximum usability by users of all skill levels. Altogether, these features make RawTools a valuable software tool for the rapid and dynamic processing of raw data acquired on Thermo Orbitrap MS instruments towards performance optimization and monitoring to ensure the consistent acquisition of high-quality and information-rich data. 87 5.2 Experimental Section 5.2.1 RawTools software, documentation, and availability RawTools was developed in the C# programming language and is covered by the Apache 2.0 license. C# is a .NET language, which makes it natively compatible with the RawFileReader library. The RawFileReader library is developed and distributed by Thermo Scientific under its own license, and is separate from RawTools. RawFileReader is built on the .NET framework, which is compatible with all three major operating systems (e.g. Windows, MacOS, Linux). RawTools utilizes RawFileReader in order to access the meta and scan data present in the raw MS files. RawTools was developed based on the use of the .NET framework version 4.6.2 or greater, facilitating support across a wide range of Windows operating systems. For use on Linux and MacOS machines, RawTools was extensively tested using Mono (version 5.12.0.233), which clones the .NET framework for use on Unix systems. RawTools is open source and freely available. The code and compiled versions along with step-by-step walkthroughs and image-rich documentation are available on GitHub: https://github.com/kevinkovalchik/RawTools. In lieu of providing supplementary tables with this work describing the RawTools output, detailed parameter-by-parameter documents are available on the GitHub page. As RawTools will be in active development for the foreseeable future, the software and supporting documentation are in constant flux. Therefore, the most up-to-date versions can always be found in the publicly accessible GitHub resource. Lastly, an R Shiny application developed to enable direct visualization and interpretation of RawTools QC results is freely available on the web at: https://rawtoolsqcdv.bcgsc.ca. The RawTools R Shiny application is also freely available on the GitHub page if use on a local machine is preferred: https://github.com/kevinkovalchik/RawTools/tree/master/documentation/manuscript/RawTools_ 88 RShiny_Application. This web application requires the processed text file outputs from any of the RawTools analysis types (‘parse’ or ‘qc’ functions in RawTools (Figure 5.1)). Example data that can be used with the R Shiny application is available for free download on the RawTools GitHub page (https://github.com/kevinkovalchik/RawTools/tree/master/documentation/). Figure 5.1 RawTools includes a wide range of built-in utilities and potential functionality for raw MS data processing. Schematic depicting the different functional modules of RawTools. The software is divided into ‘parse’ and ‘qc’ processing pipelines that generate overlapping, but individual data sets as indicated. All functionalities displayed work directly with raw MS files derived from Thermo Orbitrap instruments on Windows, Linux, and MacOS computational hardware. 5.2.2 Cell culture and harvest HeLa cells were grown and harvested by the National Cell Culture Center (Biovest International). A total of 1x109 cells were grown and provided as aliquoted pellets at a concentration of 5 x 107 cells per tube. Cells were stored at -80°C until use. 5.2.3 Guanidine-based protein isolation, reduction, alkylation, and digestion Cell pellets (5 x 107 cells each) were thawed on ice with periodic vortexing. To each pellet, 900 μL of lysis buffer (50mM HEPES pH 8, 4 M guanidine hydrochloride (Sigma, CAT#G3272), 50 mM NaCl (Sigma, CAT#S7653), 10 mM tris(2-carboxyethyl)phosphine hydrochloride (TCEP) (Sigma, CAT#C4706), 40 mM chloroacetamide (CAA) (Sigma, 89 CAT#C0267), 1X cOmplete protease inhibitor – EDTA free (Sigma, CAT#11836170001)) was added, and the pellets mixed using an 18G syringe. Lysis mixtures were transferred to 2 mL FastPrep-compatible tubes containing Lysing Matrix Y (MP Biomedicals, CAT#116960050). Lysis mixtures were vortexed on the FastPrep-24 5G instrument (MP Biomedicals) (6 M/s, 40 seconds, 2 cycles, 120 second rest between cycles). Lysates were centrifuged at 20,000g for 5 minutes and the supernatant recovered. Resultant lysates were heated at 65°C for 15 minutes, and chilled to RT for a further 15 minutes. Protein concentrations were approximated using A280 readings from a NanoDrop instrument (Thermo Scientific). For in-solution digestion, 200 μg of HeLa protein sample was diluted 1:10 with 50 mM HEPES pH 8, and trypsin/rLysC mix (Promega, CAT#V5071) (1:25 enzyme to protein concentration) was added prior to incubation for 14 hours at 37°C in a ThermoMixer with mixing at 1000 rpm (final digestion volume per sample ≈400 μL). Following digestion, mixtures were acidified by addition of trifluoroacetic acid (TFA) (Sigma, CAT#302031) to a 1% final concentration. Tubes were centrifuged at 20,000g for 1 minute to pellet any precipitate and the supernatant was recovered for further processing. 5.2.4 Peptide clean-up Peptides were desalted and concentrated using TopTips (Glygen). For TopTip clean-up, 1 mL TopTips (CAT#TT3C18) were rinsed twice with 0.6 mL of acetonitrile (Sigma, HPLC-grade, CAT#34998) with 0.1% TFA. Cartridges were then rinsed twice with 0.6 mL of water (Sigma, HPLC-grade, CAT#34877) with 0.1% TFA prior to sample loading. Loaded samples were rinsed three times with 0.1% formic acid (0.6 mL per rinse) and eluted with 1.2 mL of 80% acetonitrile containing 0.1% formic acid. Eluted peptides were concentrated in a SpeedVac centrifuge (Thermo Scientific) and subsequently reconstituted in 1% formic acid (Thermo 90 Scientific, LC-MS grade, CAT#85178) with 1% dimethylsulfoxide (Sigma, CAT#D4540) in water. 5.2.5 Mass spectrometry data acquisition Analysis of peptides was carried out on an Orbitrap Velos MS platform (Thermo Scientific). Samples were introduced using an Easy-nLC 1000 system (Thermo Scientific). The Easy-nLC 1000 system was plumbed with a single-column setup using a liquid-junction for spray voltage application. The factory 20 µm ID x 50 cm S-valve column-out line was replaced with a 50 µm ID x 20 cm line to reduce backpressure during operation at high flow rates. The column used for analytical separations was packed in-house in a 200µm ID capillary that was prepared with a fritted nanospray tip (formamide and Kasil 1640 in a 1:3 ratio) using a laser puller instrument (Sutter Instruments). The 200 µm ID analytical column was packed to a length of 20 cm with 1.9 µm Reprosil-Pur C18 beads (Dr. Maisch) in an acetone slurry. The analytical column was connected to the Orbitrap Velos MS using a modified version of the UWPR Nanospray source (http://proteomicsresource.washington.edu/protocols05/nsisource.php) combined with column heating to 50°C using a 15 cm AgileSLEEVE column oven (Analytical Sales & Service). Prior to each sample injection, the analytical column was equilibrated at 400 bar for a total volume of 4 μL. After injection, sample loading was carried out for a total volume of 8 μL at a pressure of 400 bar. After loading, elution of peptides was performed with a gradient of mobile phase A (water and 0.1% formic acid) from 3 – 7% B (acetonitrile and 0.1% formic acid) over 2 minutes, to 30% B over 24 minutes, to 80% B over 0.5 minutes, hold at 80% B for 2 minutes, to 3% B in 0.5 minutes, and holding at 3% for 1 minute, at a flow rate of 1.5 μL/min. Data acquisition on the Orbitrap Velos (control software version 2.6.0.1065 SP3) was carried out using a data-dependent method with MS2 in the ion trap. The Orbitrap Velos was 91 operated with a positive ion spray voltage of 2400 and a transfer tube temperature of 325°C. Survey scans (MS1) were acquired in the Orbitrap at a resolution of 30K, across a mass range of 400 – 1200 m/z, with an S-Lens RF lens setting of 60, an AGC target of 1e6, a max injection time of 10 ms in profile mode. For dependent scans, an intensity filter of 1e3, charge state selection of 2 – 4 charges, and dynamic exclusion for 15 seconds with 10 ppm low and high tolerances were used. A 2 m/z window was used prior to CID fragmentation with a setting of 35%. Data acquisition was carried out in the ion trap using the ‘Normal’ scan rate, an AGC target of 1e4, and a max injection time of 100 ms in centroid mode. 5.2.6 Mass spectrometry data analysis All data files were processed with RawTools as described in the main text. For comparison of RawTools and ProteoWizard114 generated MGF output, a combination of SearchCLI (version 3.3.1)103,115 and PeptideShakerCLI (version 1.16.23)104 was used. All searches used the X!Tandem (version 2015.12.15.2)47 algorithm. MS2 data were searched against a UniProt human proteome database (version 2018_09) containing common contaminants (The Global Proteome Machine cRAP sequences - https://www.thegpm.org/crap/) that was appended to reversed sequences generated using the –decoy tag of FastaCLI in SearchCLI (42,190 total sequences, 21,095 target). Identification parameter files were generated using IdentificationParametersCLI in SearchCLI specifying precursor and fragment tolerances of 20ppm and 0.5 Da, carbamidomethyl of cysteine as a fixed modification, and oxidation of methionine and acetylation of protein N-term as variable modifications. Trypsin enzyme rules with a total of 2 missed cleavages allowable was specified. All SearchCLI results were processed into PSM, peptide, and protein sets using PeptideShakerCLI. Error rates are controlled in PeptideShakerCLI using the target-decoy search 92 strategy to determine false-discovery rates (FDR). Hits from multiple search engines are unified using posterior error probabilities determined from the target-decoy search strategy104. Results reports were exported from PeptideShakerCLI using the ReportCLI with numeric values provided to the –reports tag to provide the ‘Certificate of Analysis’, ‘Default Protein Report’, ‘Default Peptide Report’, and ‘Default PSM Report’. All results (PSM, Peptide, Protein) were filtered to provide a final FDR level of <1%. Final .mzid files output from PeptideShakerCLI used MzidCLI with the default parameters. For peptide matching as part of the diagnostic ‘preview’ search functionality of RawTools, the IdentiPy (commit version 0275e13)116 search engine was used. A total of 1000 MS2 scans (adjustable via the ‘-N’ flag) were extracted from each raw file and fed into IdentiPy on-the-fly. MS2 data were searched against a UniProt human proteome database (version 2018_09) containing common contaminants (21,095 total target sequences). Decoy proteins were generated on-the-fly by IdentiPy using the ‘reverse’ specification for target-decoy analysis. RawTools automatically reads the instrument configuration and adjusts the mass accuracy settings based on the determined mass analyzer. Left and right precursor accuracy settings were set at 10 ppm, with product accuracy at 0.5 Da for the LTQ Velos Orbitrap MS1 and MS2 ion trap data. Carbamidomethylation of cysteine was set as a fixed modification and oxidation of methionine as variable. Trypsin enzyme rules with a total of 2 missed cleavages allowable was specified. To enhance speed, the ‘auto-tune’ functionality of IdentiPy is automatically disabled in RawTools. IdentiPy results are filtered by taking the 95th percentile decoy score and keeping all target hits above this value. 93 5.2.7 General statistical parameters In all boxplots, center lines in plotted boxes indicate the median, upper and lower lines the 75th and 25th percentiles, and upper and lower whiskers 1.5X the interquartile range. The calculation of individual p-values was performed using two-sided Students t-tests of sample sets, unless otherwise noted. 5.2.8 Data and code availability The mass spectrometry proteomics data were deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository117 with the dataset identifiers: PXD011070 (fraction subset data), and PXD011069 (diagnostic QC data). The repository contains all raw data, search results, and sequence database files. R Notebook files detailing data analysis and figure creation for this manuscript are all freely available on the RawTools GitHub page: https://github.com/kevinkovalchik/RawTools/tree/master/documentation/manuscript/Rscripts_for_data-analysis. 94 5.3 Results and Discussion RawTools was developed to improve the previously described RawQuant113 tool that was used for parsing of scan and quantification data directly from raw files acquired on Thermo Orbitrap MS instruments. The original use of Python programming language made RawQuant installation and use of the tool less user-friendly and ultimately limited the efficiency of raw file processing. In addition, RawQuant was built around the use of the Thermo MSFileReader library, which restricted raw file processing to Windows systems. To improve on these aspects, RawTools has been developed in the ‘.NET’ language C# and was built to utilize the Thermo RawFileReader library. As a result of these changes, RawTools is now distributed as a single, user-friendly package that rapidly processes raw MS data derived from Thermo Orbitrap instruments in an operating system-independent manner (e.g. Windows, MacOS, and Linux). Like RawQuant, RawTools includes functionality for parsing scan and quantification data from the raw MS files of a variety of instrument architectures (e.g. Q-Exactive and Fusion families – including the HF-X and Lumos, LTQ-Orbitrap family) without any pre-conversion or processing (Figure 5.1). However, RawTools includes new functionalities to greater facilitate interrogation of parsed scan and quantification data, including: 1. An improved ‘Metrics’ text output that provides a detailed summary information on MS performance, 2. Improved scan ‘Matrix’ text output, including scan-by-scan measures of duty cycle time, number of triggered MS2 scans, precursor ion abundance and elution windows, 3. Automatic linking of identification and quantification scans when using isobaric tag quantification, and 4. Direct output of total or base-peak intensity chromatographic data. As with RawQuant, all processed data are output in simple tab-delimited text files that can be easily input into supplementary analysis tools for visualization (e.g. Excel, Python, R). RawTools can also be used to generate MGF output for use 95 with database matching search tools. Moreover, RawTools now includes additional functionality that enables recalibration of precursor masses and charge states, as well as mass and intensity filtering of the generated MGF output (e.g. to remove isobaric reporter ions prior to data searching). From a quantification standpoint, RawTools incorporates the functionality of RawQuant for the extraction of isobaric tag quantification data directly from raw files (including signal intensity, noise, resolution, and isolated precursor purity), while also adding the ability to perform precursor ion abundance analysis (Figure 5.1). RawTools also includes a newly developed implementation of a diagnostic ‘QC’ feature that can be used to achieve and maintain optimum instrument performance (Figure 5.1). The diagnostic QC feature of RawTools uses a combination of the outputs gathered from the parsing of scan and quantification data to compile summary data that can be used to aid in informed decision-making related to instrument performance. In addition, the diagnostic QC analysis includes an integrated ‘preview’ database search using either the IdentiPy or X! Tandem tools, facilitating calculation and monitoring of a wide collection of identification-related metrics, including: 1. MS2 identification rates, 2. Enzyme cleavage efficiency, and 3. Modification or labeling efficiency. Importantly, due to the exceptional speed of RawTools and the direct processing of the raw output data from an Orbitrap MS, diagnostic QC analysis can be completed in near real-time. As a final feature to aid in the visualization and interpretation of RawTools diagnostic QC results for users of all experience levels, an R Shiny application has been developed that generates high-quality plots of summarized user input data through a publicly accessible web interface. To demonstrate the functionality of RawTools for data processing and analysis, a single tryptic digest of a HeLa cell lysate was prepared and injected repeatedly (n = 140 individual 96 injections) in 30-minute runs on an Orbitrap Velos MS instrument. The LC-MS system was cleaned and calibrated and a fresh chromatography column prepared and equilibrated prior to the first injection to ensure optimum performance. This design allowed for the direct visualization of the performance degradation of the Velos MS over time. This set of 140 raw files was used to demonstrate the performance and utility of RawTools in a variety of scenarios as described below. 5.3.1 RawTools enables efficient parsing and analysis of raw Orbitrap MS files To first demonstrate the basic parsing functionality of RawQuant, the raw files from the first 10 injections of the 140-raw file set were used as a test data set. As these represent the initial injections, these data files should represent the optimum of instrument performance. To benchmark the performance of RawTools, the test set was processed separately on Windows (Core i7-6400 @ 3.4GHz, 32GB of RAM, 64-bit Windows 10) and Linux (Xeon E5-2690 @ 2.9GHz, 132GB of RAM, CentOS 7) systems. Each of the computational setups efficiently processed the entire 10-file data set, requiring just 01m12s and 02m07s (for simplicity, time is given in the notation of hours, minutes, seconds – 00h00m00s) on the Windows and Linux systems, respectively. To put these times in context, on the same Windows machine, RawQuant (version 0.2.3) and ProteoWizard (version 3.0.18225 64-bit) required 2m36s and 1m53s just to generate individual MGF outputs for each of the files across the entire set. However, as part of processing, RawTools was also generating summarized metrics and parsed scan and quantification data files in this same time window along with MGF creation. Compared to its predecessor RawQuant, this represented a 116% speed improvement despite the extraction of additional information by the RawTools software. 97 During parsing, RawTools can be used to generate two types of text output, Metrics (-x flag in RawTools) and Matrix (-p or -q flag in RawTools, combined with –u for precursor quantification) files. The Metrics files contain summarized information on the MS operation during the acquisition. Using the metrics data, properties such as the scan numbers, rates, duty cycle characteristics, and values relating to chromatography performance (e.g. column peak capacity) can be easily visualized (Figure 5.2a-f). Interrogation of these values can provide valuable insight into targetable areas for improvement of instrument performance, as demonstrated previously113. In the case of the 10-replicate subset examined here, the summarized metrics data illuminated a potential problem with the third replicate. Despite being a repeat of the other runs, the third injection had a reduced MS2 acquisition rate resulting in fewer scans acquired overall (Figure 5.2). However, although it was clear something was wrong with this replicate, it was not immediately obvious from the summarized metrics exactly what the issue was. 98 Figure 5.2 RawTools enables rapid and dynamic analysis of raw data files to illuminate MS performance. A subset (n = 10) of raw files acquired as part of a replicate injection set derived from a HeLa tryptic digest were analyzed with the RawTools parse functionality to generate ‘Metrics’ files. The resultant text output from RawTools was investigated to generate insights into: (a) Scan numbers, (b) MS2 scan rates, (c) Numbers of dependent scans triggered per MS1, (d) Duty cycle duration, (e) Chromatographic peak width, and (f) Column peak capacity. Dashed lines on each plot indicate the mean across the 10 replicate injections for the displayed values. To investigate the third injection further, the Matrix output of RawTools was used. When RawTools is invoked with the ‘-p’ flag, the Matrix output contains individual information for all scan events contained within the raw MS file. Alternatively, when used with the ‘-q’ flag, the scan Matrix output from RawTools additionally contains information related to isobaric tag quantification. With the Matrix output from replicate 3 of the subset, plotting of the MS2 acquisition rate and MS1 precursor ion intensities across the entire acquisition window revealed 99 an unexplained gap in the electrospray in the early stages of the gradient, lasting just 5 minutes (Figure 5.3a-b). This short drop in spray resulted in no triggered MS2 scans for its duration, but did not impact the overall quality of chromatography as measured by peak shape and column capacity (Figure 5.2e-f). These data highlight the utility of measuring multiple metrics of instrument performance, and the ability of RawTools to facilitate easy illumination of potentially sub-optimal MS runs. Figure 5.3 RawTools enables simplified detection of errors that occur during MS acquisition. The third injection from the 1 – 10 injection set was examined further using the RawTools parse functionality to generate scan ‘Matrix’ files to determine the cause of the difference in relation to the other replicate samples. The resultant text output from RawTools was used to identify a break in the spray being generated from the nanospray tip, as observed as a gap in (a) MS2 scan acquisition and (b) Intensity of precursors in MS1 scans. Red arrows indicate regions of nanospray instability. Lastly, to establish the validity of the MGF output from RawTools, a database search was carried out and the results compared with a file generated using the established ProteoWizard software tool. The MGF outputs for the 10-replicate subset from RawTools and ProteoWizard 100 were individually searched using SearchCLI and PeptideShaker. Comparison of the unique peptide and protein identification rates (1% FDR) revealed no significant differences between the results when using MGF outputs from the individual software tools (p = 0.99 for identification numbers of peptides and proteins, 99% overlap in identifications for both peptides and proteins between RawTools and ProteoWizard sets) (Figure 5.4). Using RawTools recalibrated masses and charge states results in modest increases in peptide and protein identifications compared to the standard MGF creation routine. Taken together, these data demonstrate the effective parsing and extraction of meta and scan data from Thermo Orbitrap raw files using RawTools towards general identification analyses or in-depth examination of MS performance. Figure 5.4 MGF output generated by RawTools is equivalent to standard software tools as measured by identification rates. MGF output generated using RawTools (with and without precursor and charge state recalibration) and ProteoWizard were individually searched using X!Tandem as part of SearchCLI and PeptideShakerCLI to generate peptide and protein identification results. The boxplots display the total numbers of (a) Peptide and (b) Protein identifications for the set of subset injections (n = 10) using MGF output from the two separate software tools. Outlier points in the peptide match plots are from injection 3, as discussed in the main manuscript. 101 5.3.2 RawTools facilitates robust tracking of acquisition performance RawTools includes a newly developed diagnostic ‘QC’ feature designed for the visualization of instrument performance. In addition to providing summarized performance metrics as before (e.g. scan numbers and rates, chromatography performance), RawTools diagnostic QC output currently includes other supplementary data: 1. Electrospray stability, 2. Gradient elution performance, and 3. MS2 fragment signal distributions. In addition, RawTools can employ the IdentiPy and X!Tandem search tools to provide a ‘preview’ functionality that enables tracking of data in QC that includes: 1. MS2 identification rate, 2. Modification efficiency (e.g. labeling efficiency), and 3. Analyzer mass error. The output of RawTools QC is a single comma separated file containing compiled data from all files processed to date. Newly acquired files can be appended to this compiled file by simply placing them in a target directory and re-triggering the RawTools QC command. Although this operation is not true real-time because it is not automatically scanning over a directory to monitor for newly generated data, this near real-time implementation allows users greater control over which files are included in a diagnostic QC set and the ability to carry out these tasks in a remote location. Visualization and interpretation of these data are further facilitated via the freely available RawTools R Shiny application web interface. To demonstrate the performance and utility of this QC feature, the entire 140-raw file set was processed with RawTools on a Linux system (Xeon E5-2690 @ 2.9GHz, 132GB of RAM, CentOS 7). Processing of the entire 140-raw file set for diagnostic QC information required 24m52s without the database search, and 3h36m57s with IdentiPy analysis (01m30s per file, auto-tune disabled, n = 1 variable modification, n = 1000 MS2 spectra) on the Linux system. Although this processing time appears to be substantial, the majority of this time is devoted to 102 the IdentiPy search. For comparison, performing IdentiPy analysis independent of RawTools requires 03h18m35s for the entire 140 raw file set (01m25s per file, auto-tune disabled, n = 1 variable modification, n = 1000 MS2 spectra). Therefore, the RawTools component of the analysis required just 18 minutes (8% of the total processing time). The processing speed when using the ‘preview’ search functionality is improved when using the more rapid search of X!Tandem47, where just 35m32s is required to process the entire set with RawTools. Investigation of the scan totals calculated by RawTools across the 140-injection set illuminated a gradual but steady decrease in the numbers of acquired MS2 events as the injections continued (Figure 5.5a). Using the values for the summed signal intensities from MS1 events within each raw file, a continual decrease was again observable, but with specific outliers appearing (Figure 5.5b). Using the injection 23 outlier as an example for further investigation, the RawTools chromatogram output of this file revealed inconsistencies in the base peak intensity patterns, indicative of unstable electrospray (Figure 5.5c). These spray instability events were also easily observed for both the 3rd (as discussed above) and 23rd replicate injections based on the electrospray stability output of RawTools (Stability = Number of MS1 scans whose neighbor differs in signal by >10-fold) (Figure 5.6). Although minor in terms of duration and frequency, these events made enough of an impact to be flagged as potentially problematic via user interpretation of the generated RawTools output. 103 Figure 5.5 RawTools QC analysis facilitates illumination of variation in MS operational performance. The entire set of (n = 140) HeLa replicate injections was analyzed with the diagnostic QC feature of RawTools to generate a single comma separated summary output. The resultant data were used to probe (a) Scan numbers and (b) MS1 intensities to reveal inconsistencies. (c) Selected base peak chromatogram of MS1 intensities demonstrating spray instability. The inset image is a zoom of the selected chromatogram from 5 to 15 minutes. Red arrows on the inset plot indicate areas of electrospray instability. The RawTools data were further examined to highlight instrument performance degradation via (c) MS2 intensities, (d) MS2 injection times, and (e) IdentiPy MS2 spectral identification rates. Dashed red lines indicate the 100th sample injection. 104 Figure 5.6 RawTools enables illumination of electrospray instability events. The entire replicate injection set (n = 140) was examined to calculate the stability of electrospray across acquisitions. Stability is calculated as the number of MS1 events where neighboring scans differ in their summed intensity by more than 10-fold. Red circled points indicate injections 3 and 23 as replicates where problematic performance had been observed as indicated by other metrics and discussed in the main text. As expected, across the 140 injections the steady decline in MS1 signal was indicative of a loss of sensitivity in the MS, a trend that is easily visible in both the decrease in summed signals observed in MS2 scans, as well as the increasing ion injection times required to hit the same specified ion target value (Figure 5.5d-e). This drop in MS2 quality is reflected in the observed decrease in peptide identification rates as determined with the ‘preview’ search done with IdentiPy (Figure 5.5f). This drop in spectral identification rate may also include a contribution by the observed drift in the accuracy of the mass analyzer used for MS1 precursor acquisition across the replicate injections (Figure 5.7). RawTools also enabled identification of areas where stability is observed, such as sample preparation metrics like enzyme digestion 105 efficiency and the proportions of oxidation events on methionine-containing peptides (Figure 5.8). Although not demonstrated here, the modification frequency calculations performed by RawTools can also be used to track properties such as labeling efficiency when using isobaric tags. Altogether, these results demonstrate the utility of RawTools for processing data in a QC setting towards longitudinal tracking of instrument and experiment performance, and highlight the ability to identify even minute disruptions in expected MS operation. Figure 5.7 RawTools highlights errors in mass detection in MS2 spectra. The entire replicate injection set (n = 140) was examined using the QC feature of RawTools with IdentiPy. The observed masses of detected peptides were compared to the calculated sequence values on-the-fly by RawTools to determine mass errors. The absolute mass errors for identified peptides are displayed on a parts-per-million scale. The red dashed line indicates the 100th injection of the standard sample. 106 Figure 5.8 RawTools data reveals information from detected peptides that can be used to monitor stability in sample preparation. The entire replicate injection set (n = 140) was examined to calculate the tryptic digestion efficiency and the rate of oxidation of methionine. (a) Scatter plot of enzyme digestion efficiency, defined as the number of peptides with no missed cleavages as a proportion of the total number of peptides. (b) Scatter plot of methionine oxidation frequency, calculated as the number of observed oxidation events as a proportion of the number of available methionine sites. Both data sets are based upon the peptide hits generated by IdentiPy when used as part of the diagnostic QC processing in RawTools (n = 1000 MS2 spectra searched per file). 5.4 Conclusion In this work, the development and use of the RawTools software tool is described. RawTools is a simplified and streamlined version of the previously presented RawQuant tool that now provides improved flexibility in implementation, additional data outputs, and a QC feature for performance tracking. Importantly, RawTools operates directly on the raw output of Thermo Orbitrap MS instruments with no additional data conversion. Substantial effort has gone into simplifying RawTools and the output from the tool in order to improve usability for users of 107 all skill levels. As an example of this, the RawTools R Shiny application provides users with a simple point-and-click interface through which their data can be visualized. It is also noteworthy that the majority of the plots used in this work were generated using code from the RawTools R Shiny application. Although not explicitly discussed in detail here, RawTools has substantial utility in the optimization of MS method acquisition parameters and in the extraction of quantification results, directly from raw MS data as established previously113. Importantly, all the data discussed here are available almost immediately after MS analysis due to the exceptionally rapid processing times of RawTools, the ability to work directly with raw Orbitrap files, and the ease with which the output data can be handled via the provided web-application. Taken together, the improved design and newly implemented features of RawTools position it as a powerful software for interrogation of MS operation dynamics and performance. 108 5.5 Supporting Information The following files are available free of charge at ACS website http://pubs.acs.org: • Supporting information describing the equivalency in MGF output between software tools (Figure S1). Information relating to the measured instability of electrospray during extended MS analysis (Figure S2) and the increase in mass detection error (Figure S3). Information relating to the monitoring of the stability in sample preparation using digestion efficiency and methionine oxidation frequency (Figure S4). 5.6 Acknowledgements C.S.H. would like to acknowledge valuable input and discussions from Lida Radan and Philipp Lange. All authors would like to acknowledge valuable support from the Genome Sciences Centre Systems department, specifically Ross Stevenson and Brendan O’Huiginn for their help with hosting the R Shiny web application. 5.7 Funding Sources This work was supported by the British Columbia Cancer Foundation (G.B.M., C.S.H, S.M.) and a Discovery Grant from the Natural Sciences and Engineering Research Council (NSERC) of Canada (G.B.M. and D.D.Y.C.). K.A.K. acknowledges Four-Year Doctoral Fellowships from the University of British Columbia (award numbers - 6569, 6456) and a Gladys Estella Laird Research Fellowship (award number 4818). 5.8 Competing Financial Interests The authors declare no competing financial interests. 109 5.9 Author Contributions K.A.K. and C.S.H. conceived the idea, carried out the data analysis, and wrote the manuscript. S.C. and S.S. helped with data acquisition and tool design. P.H.S., D.D.Y.C., and G.B.M. contributed to writing of the manuscript. 110 Concluding Remarks and Future Work 6.1 Concluding Remarks The material presented in this thesis provides insight into new directions of computationally-driven NA research and practical software tools for use in mass spectrometry-based proteomics research. A thorough literature review was performed to identify key considerations in the development of a standard mass spectrometry-based method for the quantification of total NAs in aqueous samples. Suggestions were made as to how these considerations could be addressed in method design. These suggestions were designed not only to provide the optimal method outline, but also to ease the adoption of such a standard method by choosing instruments and methods which are commonly used or easily adapted to most laboratory settings. This review was the first published work directed toward the development of a standard method for the quantification of NAs. A computational method of resolving dicarboxylic acids in mixtures of amine-derivatized NAFCs utilizing high resolution mass spectrometry was described. The publication of this method served as proof-of-concept. While the method described was limited, it demonstrated the potential for derivatization and careful data analysis to aid in the characterization of specific classes of NAFCs. A computational software tool, RawQuant, was developed with the Python programming language for extracting scan meta data and quantification data from Thermo Orbitrap mass spectrometer data files of DDA proteomics experiments. The main goal of this tool is the description of instrument performance during a run, facilitating insight into which MS parameters might need adjusting for method optimization. Furthermore, the tool provides the 111 ability to quantify precursor ion peak areas and reporter ion intensities in isobaric labeled samples (e.g. TMT labeled) and to generate MGF files to facilitate downstream identifications using a database search engine. A second generation of the RawQuant tool, called RawTools, was developed in the C# programming language. RawTools provides significant performance improvements over RawQuant, but most importantly allows for the automated longitudinal tracking of performance metrics in multi-run experiments (i.e. multiple mass spectrometry data files). This tracking data is provided in a user-friendly .csv file, allowing for easy visualization of the metrics as a function of time. This implementation requires user interpretation of the data, but the metrics are transparent enough that any MS operator should be able to instantly understand their importance (e.g. the metrics directly reflect specific aspects of the MS performance with little in the way of transformation). Furthermore, manual interpretation allows the MS operator to intervene with instrument maintenance or adjustment of acquisition parameters before the instrument fails control and there is a loss of data. 6.2 Further Work 6.2.1 Isobaric Labelling of Naphthenic Acid Fraction Compounds In Chapter 3: we presented a method of using amine derivatization to characterize dicarboxylic acids in mixtures of NAFCs. The chemistry involved in the isobaric labeling of peptides with TMT or iTRAQ reagents (as in Chapter 4: and Chapter 5:) is quite similar to that involved in the derivatization of NAFCs,33,66,78,118 which suggests that isobaric labelling and multiplexing of NAFCs could be accomplished. The chromatographic separation of NAFCs might not be efficient enough to allow for DDA MS methods to be used in the analysis of multiplexed NAFCs, but an analysis method similar to that used by Woudneh et al.78 could be 112 employed, where the full mass range is fragmented. This would allow for relative quantification of total NAFCs. Using TMT labeling reagents would allow for up to 11 NAFC mixtures to be analyzed in a single MS run, vastly reducing instrument time and increasing throughput. 6.2.2 Advanced Method Development Techniques for Mass Spectrometry Proteomics Experiments. We described a tool, RawQuant, which facilitates the interrogation of MS acquisition performance toward method development. However, more can be accomplished to assist with method development. Method development experiments typically test a single set of MS acquisition parameters at a time. However, it is not uncommon for MS methods to contain differing acquisition parameters at different parts of the run. This is an approach that could easily be modified to assist in method development by running multiple sets of parameters in a single optimization experiment. The scans corresponding to a particular set of parameters could then be interrogated to assess their relative performance, allowing one to explore the parameter space in a much more efficient manner. Likely the largest impediment to such an undertaking is the parsing of the resultant data, as existing tools are typically not designed to split up the MS results into multiple sections. A tool like RawQuant or RawTools, however, which gives you access to scan-by-scan information on instrument operation and performance, could be used or adapted for this purpose. 113 References 1. Frank RA, Fischer K, Kavanagh R, et al. Effect of Carboxylic Acid Content on the Acute Toxicity of Oil Sands Naphthenic Acids. Environ Sci Technol. 2009;43(2):266-271. 2. Grewer DM, Young RF, Whittal RM, Fedorak PM. Naphthenic acids and other acid-extractables in water samples from Alberta: What is being measured? Sci Total Environ. 2010;408(23):5997-6010. 3. Headley JV, McMartin DW. A review of the occurrence and fate of naphthenic acids in aquatic environments. J Environ Sci Heal A. 2004;39(8):1989-2010. 4. Headley JV, Peru KM, Barrow MP. Mass Spectrometric Characterization of Naphthenic Acids in Environmental Samples: A Review. Mass Spectrom Rev. 2009;28(1):121-134. 5. Morandi GD, Wiseman SB, Pereira A, et al. Effects-Directed Analysis of Dissolved Organic Compounds in Oil Sands Process-Affected Water. Environ Sci Technol. 2015;49(20):12395-12404. 6. Malle H, Simser J. Final Report for Naphthenic Acids - Interlaboratory Calibration Study 2014. In: Environment Canada EOALRSD-IQM, ed. Burlington, Ontario2014. 7. Wrona FJ, di Cenzo P. Lower Athabasca Water Quality Monitoring Program: PHASE 1, Athabasca River Mainstem and Major Tributaries. In: Environment Canada - Environmental Stewardship Branch, ed: Environment Canada; 2011:29-34,74. 8. Headley JV, Peru KM, Fahlman B, Colodey A, McMartin DW. Selective solvent extraction and characterization of the acid extractable fraction of Athabasca oils sands process waters by Orbitrap mass spectrometry. Int J Mass Spectrom. 2013;345:104-108. 9. Headley JV, Peru KM, Barrow MP. Advances in mass spectrometric characterization of naphthenic acids fraction compounds in oil sands environmental samples and crude oil-A review. Mass Spectrom Rev. 2016;35(2):311-328. 10. Headley JV, Peru KM, Mohamed MH, et al. Chemical fingerprinting of naphthenic acids and oil sands process watersA review of analytical methods for environmental samples. J Environ Sci Heal A. 2013;48(10):1145-1163. 11. Lengger SK, Scarlett AG, West CE, et al. Use of the distributions of adamantane acids to profile short-term temporal and pond-scale spatial variations in the composition of oil sands process-affected waters. Environ Sci-Proc Imp. 2015;17(8):1415-1423. 12. Lengger SK, Scarlett AG, West CE, Rowland SJ. Diamondoid diacids (' O-4 ' species) in oil sands process-affected water. Rapid Commun Mass Sp. 2013;27(23):2648-2654. 13. Wasinger VC, Cordwell SJ, Cerpa-Poljak A, et al. Progress with gene-product mapping of the Mollicutes: Mycoplasma genitalium. Electrophoresis. 1995;16(1):1090-1094. 14. Anderson NL, Anderson NG. Proteome and proteomics: New technologies, new concepts, and new words. Electrophoresis. 1998;19(11):1853-1861. 15. James P. Protein identification in the post-genome era: the rapid rise of proteomics. Quarterly Reviews of Biophysics. 1997;30(4):279-331. 16. Rudnick PA, Clauser KR, Kilpatrick LE, et al. Performance metrics for liquid chromatography-tandem mass spectrometry systems in proteomics analyses. Molecular & cellular proteomics: MCP. 2010;9(2):225-241. 17. Bittremieux W, Tabb DL, Impens F, et al. Quality control in mass spectrometry-based proteomics. Mass Spectrom Rev. 2018;37(5):697-711. 18. Bittremieux W, Valkenborg D, Martens L, Laukens K. Computational quality control tools for mass spectrometry proteomics. Proteomics. 2017;17(3-4). 114 19. Bittremieux W, Meysman P, Martens L, Valkenborg D, Laukens K. Unsupervised Quality Assessment of Mass Spectrometry Proteomics Experiments by Multivariate Quality Control Metrics. J Proteome Res. 2016;15(4):1300-1307. 20. Ma Z-Q, Polzin KO, Dasari S, et al. QuaMeter: Multivendor Performance Metrics for LC–MS/MS Proteomics Instrumentation. Anal Chem. 2012;84(14):5845-5850. 21. Solovyeva EM, Lobas AA, Kopylov AT, Gorshkov MV. Semi-supervised quality control method for proteome analyses based on tandem mass spectrometry. Int J Mass Spectrom. 2018;427:59-64. 22. Stanfill BA, Nakayasu ES, Bramer LM, et al. Quality Control Analysis in Real-time (QC-ART): A Tool for Real-time Quality Control Assessment of Mass Spectrometry-based Proteomics Data. Molecular & Cellular Proteomics. 2018;17(9):1824-1836. 23. Wang X, Chambers MC, Vega-Montoto LJ, Bunk DM, Stein SE, Tabb DL. QC Metrics from CPTAC Raw LC-MS/MS Data Interpreted through Multivariate Statistics. Anal Chem. 2014;86(5):2497-2509. 24. Amidan BG, Orton DJ, Lamarche BL, et al. Signatures for mass spectrometry data quality. J Proteome Res. 2014;13(4):2215-2222. 25. Trachsel C, Panse C, Kockmann T, Wolski WE, Grossmann J, Schlapbach R. rawDiag: An R Package Supporting Rational LC-MS Method Optimization for Bottom-up Proteomics. J Proteome Res. 2018;17(8):2908-2914. 26. Hu QZ, Noll RJ, Li HY, Makarov A, Hardman M, Cooks RG. The Orbitrap: a new mass spectrometer. J Mass Spectrom. 2005;40(4):430-443. 27. Kingdon KH. A method for the neutralization of electron space charge by positive ionization at very low gas pressures. Phys Rev. 1923;21(4):408-418. 28. Knight RD. Storage of Ions from Laser-Produced Plasmas. Appl Phys Lett. 1981;38(4):221-223. 29. Borras E, Sabido E. What is targeted proteomics? A concise revision of targeted acquisition and targeted data analysis in mass spectrometry. Proteomics. 2017;17(17-18). 30. Ting L, Rad R, Gygi SP, Haas W. MS3 eliminates ratio distortion in isobaric multiplexed quantitative proteomics. Nat Methods. 2011;8(11):937-940. 31. Mann M, Hendrickson RC, Pandey A. Analysis of Proteins and Proteomes by Mass Spectrometry. Annual Review of Biochemistry. 2001;70(1):437-473. 32. Meier F, Brunner A-D, Koch S, et al. Online Parallel Accumulation–Serial Fragmentation (PASEF) with a Novel Trapped Ion Mobility Mass Spectrometer. Molecular & Cellular Proteomics. 2018;17(12):2534-2545. 33. Rauniyar N, Yates JR. Isobaric Labeling-Based Relative Quantification in Shotgun Proteomics. J Proteome Res. 2014;13(12):5293-5309. 34. Ow SY, Salim M, Noirel J, Evans C, Rehman I, Wright PC. iTRAQ Underestimation in Simple and Complex Mixtures: \"The Good, the Bad and the Ugly\". J Proteome Res. 2009;8(11):5347-5355. 35. McAlister GC, Nusinow DP, Jedrychowski MP, et al. MultiNotch MS3 Enables Accurate, Sensitive, and Multiplexed Detection of Differential Expression across Cancer Cell Line Proteomes. Anal Chem. 2014;86(14):7150-7158. 36. Winter SV, Meier F, Wichmann C, Cox J, Mann M, Meissner F. EASI-tag enables accurate multiplexed and interference-free MS2-based proteome quantification. Nat Methods. 2018;15(7):527-+. 115 37. Sonnett M, Yeung E, Wuhr M. Accurate, Sensitive, and Precise Multiplexed Proteomics Using the Complement Reporter Ion Cluster. Anal Chem. 2018;90(8):5032-5039. 38. Wühr M, Haas W, McAlister GC, et al. Accurate multiplexed proteomics at the MS2 level using the complement reporter ion cluster. Anal Chem. 2012;84(21):9214-9221. 39. Rodgers RP, McKenna AM. Petroleum Analysis. Anal Chem. 2011;83(12):4665-4687. 40. Kovalchik KA, MacLennan M, Peru K, Headley J, Chen D. Standard method design considerations for semi-quantification of total naphthenic acids in oil sands process affected water by mass spectrometry: A review. Frontiers in Chemical Science and Engineering. 2017;11(3):497-507. 41. Allen EW. Process water treatment in Canada's oil sands industry: I. Target pollutants and treatment objectives. Journal of Environmental Engineering and Science. 2008;7(2):123-138. 42. McKenzie N, Yue S, Liu X, Ramsay BA, Ramsay JA. Biodegradation of naphthenic acids in oils sands process waters in an immobilized soil/sediment bioreactor. Chemosphere. 2014;109:164-172. 43. Baird D, Banic C, G. B, et al. Lower Athabasca Water Quality Monitoring Program: PHASE 1, Athabasca River Mainstem and Major Tributaries. Gatineau, Quebec, Canada: Environment Canada;2011. 44. Kovalchik KA, MacLennan MS, Peru KM, et al. Characterization of dicarboxylic naphthenic acid fraction compounds utilizing amide derivatization: Proof of concept. Rapid Commun Mass Sp. 2017;31(24):2057-2065. 45. Perkins DN, Pappin DJC, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20(18):3551-3567. 46. Cox J, Neuhauser N, Michalski A, Scheltema RA, Olsen JV, Mann M. Andromeda: A Peptide Search Engine Integrated into the MaxQuant Environment. J Proteome Res. 2011;10(4):1794-1805. 47. Craig R, Beavis RC. TANDEM: matching proteins with tandem mass spectra. Bioinformatics (Oxford, England). 2004;20(9):1466-1467. 48. Eng JK, Fischer B, Grossmann J, MacCoss MJ. A fast SEQUEST cross correlation algorithm. J Proteome Res. 2008;7(10):4598-4602. 49. Zhang J, Xin L, Shan BZ, et al. PEAKS DB: De Novo Sequencing Assisted Database Search for Sensitive and Accurate Peptide Identification. Mol Cell Proteomics. 2012;11(4). 50. Kovalchik KA, Moggridge S, Chen DDY, Morin GB, Hughes CS. Parsing and Quantification of Raw Orbitrap Mass Spectrometer Data Using RawQuant. J Proteome Res. 2018;17(6):2237-2247. 51. Kovalchik KA, Colborne S, Spencer SE, et al. RawTools: Rapid and Dynamic Interrogation of Orbitrap Data Files for Mass Spectrometer System Management. J Proteome Res. 2019;18(2):700-708. 52. Brunswick P, Shang DY, van Aggelen G, et al. Trace analysis of total naphthenic acids in aqueous environmental matrices by liquid chromatography/mass spectrometry-quadrupole time of flight mass spectrometry direct injection. J Chromatogr A. 2015;1405:49-71. 53. Shang DY, Kim M, Haberl M, Legzdins A. Development of a rapid liquid chromatography tandem mass spectrometry method for screening of trace naphthenic acids in aqueous environments. J Chromatogr A. 2013;1278:98-107. 116 54. Scott AC, Young RF, Fedorak PM. Comparison of GC-MS and FTIR methods for quantifying naphthenic acids in water samples. Chemosphere. 2008;73(8):1258-1264. 55. Kavanagh RJ, Burnison BK, Frank RA, Solomon KR, Van Der Kraak G. Detecting oil sands process-affected waters in the Alberta oil sands region using synchronous fluorescence spectroscopy. Chemosphere. 2009;76(1):120-126. 56. Mohamed MH, Wilson LD, Headley JV, Peru KM. Screening of oil sands naphthenic acids by UV-Vis absorption and fluorescence emission spectrophotometry. J Environ Sci Heal A. 2008;43(14):1700-1705. 57. Martin JW, Han XM, Peru KM, Headley JV. Comparison of high- and low-resolution electrospray ionization mass spectrometry for the analysis of naphthenic acid mixtures in oil sands process water. Rapid Commun Mass Sp. 2008;22(12):1919-1924. 58. Barrow MP, Witt M, Headley JV, Peru KM. Athabasca Oil Sands Process Water: Characterization by Atmospheric Pressure Photoionization and Electrospray Ionization Fourier Transform Ion Cyclotron Resonance Mass Spectrometry. Anal Chem. 2010;82(9):3727-3735. 59. Han XM, MacKinnon MD, Martin JW. Estimating the in situ biodegradation of naphthenic acids in oil sands process waters by HPLC/HRMS. Chemosphere. 2009;76(1):63-70. 60. Nyakas A, Han J, Peru KM, Headley JV, Borchers CH. Comprehensive Analysis of Oil Sands Processed Water by Direct-Infusion Fourier-Transform Ion Cyclotron Resonance Mass Spectrometry with and without Offline UHPLC Sample Prefractionation. Environ Sci Technol. 2013;47(9):4471-4479. 61. Holowenko FM, MacKinnon MD, Fedorak PM. Characterization of naphthenic acids in oil sands wastewaters by gas chromatography-mass spectrometry. Water Res. 2002;36(11):2843-2855. 62. Bataineh M, Scott AC, Fedorak PM, Martin JW. Capillary HPLC/QTOF-MS for characterizing complex naphthenic acid mixtures and their microbial transformation. Anal Chem. 2006;78(24):8354-8361. 63. Hindle R, Noestheden M, Peru K, Headley J. Quantitative analysis of naphthenic acids in water by liquid chromatography-accurate mass time-of-flight mass spectrometry. J Chromatogr A. 2013;1286:166-174. 64. Smith BE, Rowland SJ. A derivatisation and liquid chromatography/electrospray ionisation multistage mass spectrometry method for the characterisation of naphthenic acids. Rapid Commun Mass Sp. 2008;22(23):3909-3927. 65. Woudneh MB, Hamilton MC, Benskin JP, Wang GH, McEachern P, Cosgrove JR. A novel derivatization-based liquid chromatography tandem mass spectrometry method for quantitative characterization of naphthenic acid isomer profiles in environmental waters. J Chromatogr A. 2013;1293:36-43. 66. MacLennan MS, Tie C, Kovalchik K, et al. Potential of capillary electrophoresis mass spectrometry for the characterization and monitoring of amine-derivatized naphthenic acids from oil sands process-affected water. Journal of Environmental Sciences. 2016;49:203-212. 67. Barrow MP, Peru KM, McMartin DW, Headley JV. Effects of Extraction pH on the Fourier Transform Ion Cyclotron Resonance Mass Spectrometry Profiles of Athabasca Oil Sands Process Water. Energ Fuel. 2016;30(5):3615-3621. 68. Celsie A, Parnis JM, Mackay D. Impact of temperature, pH, and salinity changes on the physico-chemical properties of model naphthenic acids. Chemosphere. 2016;146:40-50. 117 69. Huang RF, McPhedran KN, Sun N, Chelme-Ayala P, El-Din MG. Investigation of the impact of organic solvent type and solution pH on the extraction efficiency of naphthenic acids from oil sands process-affected water. Chemosphere. 2016;146:472-477. 70. Ortiz X, Jobst KJ, Reiner EJ, et al. Characterization of Naphthenic Acids by Gas Chromatography-Fourier Transform Ion Cyclotron Resonance Mass Spectrometry. Anal Chem. 2014;86(15):7666-7673. 71. Pereira AS, Bhattacharjee S, Martin JW. Characterization of Oil Sands Process-Affected Waters by Liquid Chromatography Orbitrap Mass Spectrometry. Environ Sci Technol. 2013;47(10):5504-5513. 72. Pereira AS, Martin JW. Exploring the complexity of oil sands process-affected water by high efficiency supercritical fluid chromatography/orbitrap mass spectrometry. Rapid Commun Mass Sp. 2015;29(8):735-744. 73. Ross MS, Pereira AD, Fennell J, et al. Quantitative and Qualitative Analysis of Naphthenic Acids in Natural Waters Surrounding the Canadian Oil Sands Industry. Environ Sci Technol. 2012;46(23):12796-12805. 74. Duncan KD, Letourneau DR, Vandergrift GW, et al. A semi-quantitative approach for the rapid screening and mass profiling of naphthenic acids directly in contaminated aqueous samples. J Mass Spectrom. 2016;51(1):44-52. 75. Duncan KD, Volmer DA, Gill CG, Krogh ET. Rapid Screening of Carboxylic Acids from Waste and Surface Waters by ESI-MS/MS Using Barium Ion Chemistry and On-Line Membrane Sampling. J Am Soc Mass Spectr. 2016;27(3):443-450. 76. Headley JV, Peru KM, McMartin DW, Winkler M. Determination of dissolved naphthenic acids in natural waters by using negative-ion electrospray mass spectrometry. J Aoac Int. 2002;85(1):182-187. 77. Gosselin P, Hrudey SE, Naeth MA, et al. Environmental and Health Impacts of Canada's Oil Sands Industry. Ottawa, Ontario, Canada: The Royal Society of Canada;2010. 78. Woudneh MB, Hamilton MC, Benskin JP, Wang G, McEachern P, Cosgrove JR. A novel derivatization-based liquid chromatography tandem mass spectrometry method for quantitative characterization of naphthenic acid isomer profiles in environmental waters. J Chromatogr A. 2013;1293:36-43. 79. Rowland SJ, Scarlett AG, Jones D, West CE, Frank RA. Diamonds in the Rough: Identification of Individual Naphthenic Acids in Oil Sands Process Water. Environ Sci Technol. 2011;45(7):3154-3159. 80. Rowland SJ, West CE, Scarlett AG, Jones D. Identification of individual acids in a commercial sample of naphthenic acids from petroleum by two-dimensional comprehensive gas chromatography/mass spectrometry. Rapid Commun Mass Sp. 2011;25(12):1741-1751. 81. Gutierrez-Villagomez JM, Vazquez-Martinez J, Ramirez-Chavez E, Molina-Torres J, Trudeau VL. Analysis of naphthenic acid mixtures as pentafluorobenzyl derivatives by gas chromatography-electron impact mass spectrometry. Talanta. 2017;162:440-452. 82. Wang BL, Wan Y, Gao YX, Yang M, Hu JY. Determination and Characterization of Oxy-Naphthenic Acids in Oilfield Wastewater. Environ Sci Technol. 2013;47(16):9545-9554. 83. Kurzer F, Douraghi.K. ADVANCES IN CHEMISTRY OF CARBODIIMIDES. Chem Rev. 1967;67(2):107-&. 118 84. Ajaero C, McMartin DW, Peru KM, et al. Fourier Transform Ion Cyclotron Resonance Mass Spectrometry Characterization of Athabasca Oil Sand Process-Affected Waters Incubated in the Presence of Wetland Plants. Energ Fuel. 2017;31(2):1731-1740. 85. Larance M, Lomond AI. Multidimensional proteomics for cell biology. Nat Rev Mol Cell Bio. 2015;16(5):269-280. 86. Bielow C, Mastrobuoni G, Kempa S. Proteomics Quality Control: Quality Control Software for MaxQuant Results. J Proteome Res. 2016;15(3):777-787. 87. Gatto L, Breckels LM, Naake T, Gibb S. Visualization of proteomics data using R and Bioconductor. Proteomics. 2015;15(8):1375-1389. 88. Kessner D, Chambers M, Burke R, Agusand D, Mallick P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics. 2008;24(21):2534-2536. 89. Martens L, Chambers M, Sturm M, et al. mzML-a Community Standard for Mass Spectrometry Data. Mol Cell Proteomics. 2011;10(1). 90. Senko MW, Remes PM, Canterbury JD, et al. Novel Parallelized Quadrupole/Linear Ion Trap/Orbitrap Tribrid Mass Spectrometer Improving Proteome Coverage and Peptide Identification Rates. Anal Chem. 2013;85(24):11710-11714. 91. Cox J, Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nature Biotechnology. 2008;26(12):1367-1372. 92. Deutsch EW, Mendoza L, Shteynberg D, et al. A guided tour of the Trans-Proteomic Pipeline. Proteomics. 2010;10(6):1150-1159. 93. Bekker-Jensen DB, Kelstrup CD, Batth TS, et al. An Optimized Shotgun Strategy for the Rapid Generation of Comprehensive Human Proteomes. Cell Syst. 2017;4(6):587-+. 94. Kelstrup CD, Bekker-Jensen DB, Arrey TN, Hogrebe A, Harder A, Olsen JV. Performance Evaluation of the Q Exactive HF-X for Shotgun Proteomics. J Proteome Res. 2018;17(1):727-738. 95. Kelstrup CD, Jersie-Christensen RR, Batth TS, et al. Rapid and Deep Proteomes by Faster Sequencing on a Benchtop Quadrupole Ultra-High-Field Orbitrap Mass Spectrometer. J Proteome Res. 2014;13(12):6187-6195. 96. Paulo JA, O'Connell JD, Gygi SP. A Triple Knockout (TKO) Proteomics Standard for Diagnosing Ion Interference in Isobaric Labeling Experiments. J Am Soc Mass Spectr. 2016;27(10):1620-1625. 97. Martinez-Val A, Garcia F, Ximenez-Embun P, et al. On the Statistical Significance of Compressed Ratios in Isobaric Labeling: A Cross-Platform Comparison. J Proteome Res. 2016;15(9):3029-3038. 98. Paulo JA, O'Connell JD, Gaun A, Gygi SP. Proteome-wide quantitative multiplexed profiling of protein expression: carbon-source dependency in Saccharomyces cerevisiae. Mol Biol Cell. 2015;26(22):4063-4074. 99. Hughes CS, Foehr S, Garfield DA, Furlong EE, Steinmetz LM, Krijgsveld J. Ultrasensitive proteome analysis using paramagnetic bead technology. Mol Syst Biol. 2014;10(10). 100. Moggridge S, Sorensen PH, Morin GB, Hughes CS. Extending the Compatibility of the SP3 Paramagnetic Bead Processing Approach for Proteomics. J Proteome Res. 2018;17(4):1730-1740. 101. Sielaff M, Kuharev J, Bohn T, et al. Evaluation of FASP, SP3, and iST Protocols for Proteomic Sample Preparation in the Low Microgram Range. J Proteome Res. 2017;16(11):4060-4072. 119 102. Zolg DP, Wilhelm M, Schnatbaum K, et al. Building ProteomeTools based on a complete synthetic human proteome. Nat Methods. 2017;14(3):259-+. 103. Vaudel M, Barsnes H, Berven FS, Sickmann A, Martens L. SearchGUI: An open-source graphical user interface for simultaneous OMSSA and X!Tandem searches. Proteomics. 2011;11(5):996-999. 104. Vaudel M, Burkhart JM, Zahedi RP, et al. PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nature Biotechnology. 2015;33(1):22-24. 105. Vizcaino JA, Csordas A, del-Toro N, et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 2016;44(D1):D447-D456. 106. Wuhr M, Haas W, McAlister GC, et al. Accurate Multiplexed Proteomics at the MS2 Level Using the Complement Reporter Ion Cluster. Anal Chem. 2012;84(21):9214-9221. 107. Hebert AS, Thoing C, Riley NM, et al. Improved Precursor Characterization for Data-Dependent Mass Spectrometry. Anal Chem. 2018;90(3):2333-2340. 108. Werner T, Sweetman G, Savitski MF, Mathieson T, Bantscheff M, Savitski MM. Ion Coalescence of Neutron Encoded TMT 10-Plex Reporter Ions. Anal Chem. 2014;86(7):3594-3601. 109. O'Brien JJ, O'Connell JD, Paulo JA, et al. Compositional Proteomics: Effects of Spatial Constraints on Protein Quantification Utilizing Isobaric Tags. J Proteome Res. 2018;17(1):590-599. 110. Erickson BK, Rose CM, Braun CR, et al. A Strategy to Combine Sample Multiplexing with Targeted Proteomics Assays for High-Throughput Protein Signature Characterization. Mol Cell. 2017;65(2):361-370. 111. Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. Journal of Proteomics. 2010;73(11):2092-2123. 112. Bantscheff M, Lemeer S, Savitski MM, Kuster B. Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present. Anal Bioanal Chem. 2012;404(4):939-965. 113. Kovalchik KA, Moggridge S, Chen DDY, Morin GB, Hughes CS. Parsing and Quantification of Raw Orbitrap Mass Spectrometer Data using RawQuant. J Proteome Res. 2018. 114. Chambers MC, Maclean B, Burke R, et al. A cross-platform toolkit for mass spectrometry and proteomics. Nature Biotechnology. 2012;30(10):918-920. 115. Barsnes H, Vaudel M. SearchGUI: A Highly Adaptable Common Interface for Proteomics Search and de Novo Engines. J Proteome Res. 2018;17(7):2552-2555. 116. Levitsky LI, Ivanov MV, Lobas AA, et al. IdentiPy: An Extensible Search Engine for Protein Identification in Shotgun Proteomics. J Proteome Res. 2018;17(7):2249-2255. 117. Vizcaíno JA, Csordas A, Del-Toro N, et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 2016;44(22):11033. 118. Valeur E, Bradley M. Amide bond formation: beyond the myth of coupling reagents. Chem Soc Rev. 2009;38(2):606-631. 120 Appendix: Supplemental information for Chapter 4: A.1 Analysis of Deposited Data For benchmarking RawQuant, a collection of previously published data obtained from ProteomeXchange were examined: 1. Analysis of the HeLa cell proteome utilizing an optimized pipeline based on MS analysis with a Q-ExactiveHF instrument (PXD004452) [1]. a. Data were re-processed using RawQuant coupled to SearchCLI (version 3.2.30) [2] and PeptideShakerCLI (version 1.16.11) [3]. Centroided MS2 spectra were searched against a UniProt Human proteome database containing common contaminants that was appended to reversed sequences generated using the –decoy tag of FastaCLI in SearchCLI (42,058 total sequences, 21,029 target). All searches used a combination of XTandem (version 2015.12.15.2), Myrimatch (version 2.2.140), MS-GF+ (version 10282), and Comet (version 2016.01 rev. 3) algorithms. Identification parameter files were generated using IdentificationParametersCLI in SearchCLI, specifying precursor and fragment tolerances of 20ppm and 0.05 Da, carbamidomethylation of cysteine as a fixed modification, and oxidation of methionine and acetylation of protein N-term as variable modifications. The msgf_instrument, msgf_fragmentation, and msgf_protocol tags were set to 3, 3, and 0 for Orbitrap MS2. SearchCLI results were processed into PSM, peptide, and protein sets using PeptideShakerCLI. Error rates were controlled in PeptideShakerCLI using the target-decoy search strategy to determine false-discovery rates (FDR). Hits from multiple search engines were unified using posterior error probabilities determined from the 121 target-decoy search strategy. Results reports were exported from PeptideShakerCLI using the ReportCLI. All results were filtered to provide a final FDR at the PSM, peptide, and protein level of <1%. Final mzid files were output from PeptideShakerCLI using MzidCLI with the default parameters. 2. Analysis of the HeLa cell proteome utilizing an optimized pipeline based on MS analysis with a Q-ExactiveHF-X instrument (PXD006932) [4]. a. Data were re-processed using RawQuant coupled to SearchCLI (version 3.2.30) [2] and PeptideShakerCLI (version 1.16.11) [3]. Centroided MS2 spectra were searched against a UniProt Human proteome database containing common contaminants that was appended to reversed sequences generated using the –decoy tag of FastaCLI in SearchCLI (42,058 total sequences, 21,029 target). All searches used a combination of XTandem (version 2015.12.15.2), Myrimatch (version 2.2.140), MS-GF+ (version 10282), and Comet (version 2016.01 rev. 3) algorithms. Identification parameter files were generated using IdentificationParametersCLI in SearchCLI, specifying precursor and fragment tolerances of 20ppm and 0.05 Da, carbamidomethylation of cysteine as a fixed modification, and oxidation of methionine and acetylation of protein N-term as variable modifications. The msgf_instrument, msgf_fragmentation, and msgf_protocol tags were set to 3, 3, and 0 for Orbitrap MS2. SearchCLI results were processed into PSM, peptide, and protein sets using PeptideShakerCLI. Error rates were controlled in PeptideShakerCLI using the target-decoy search strategy to determine false-discovery rates (FDR). Hits from multiple search engines were unified using posterior error probabilities determined from the 122 target-decoy search strategy. Results reports were exported from PeptideShakerCLI using the ReportCLI. All results were filtered to provide a final FDR at the PSM, peptide, and protein level of <1%. Final mzid files were output from PeptideShakerCLI using MzidCLI with the default parameters. 3. Analysis of the HeLa cell proteome utilizing a standard pipeline based on MS analysis with a Q-ExactiveHF instrument (PXD001305) [5]. a. Data were re-processed using RawQuant coupled to SearchCLI (version 3.2.30) [2] and PeptideShakerCLI (version 1.16.11) [3]. Centroided MS2 spectra were searched against a UniProt Human proteome database containing common contaminants that was appended to reversed sequences generated using the –decoy tag of FastaCLI in SearchCLI (42,058 total sequences, 21,029 target). All searches used a combination of XTandem (version 2015.12.15.2), Myrimatch (version 2.2.140), MS-GF+ (version 10282), and Comet (version 2016.01 rev. 3) algorithms. Identification parameter files were generated using IdentificationParametersCLI in SearchCLI, specifying precursor and fragment tolerances of 20ppm and 0.05 Da, carbamidomethylation of cysteine as a fixed modification, and oxidation of methionine and acetylation of protein N-term as variable modifications. The msgf_instrument, msgf_fragmentation, and msgf_protocol tags were set to 3, 3, and 0 for Orbitrap MS2. SearchCLI results were processed into PSM, peptide, and protein sets using PeptideShakerCLI. Error rates were controlled in PeptideShakerCLI using the target-decoy search strategy to determine false-discovery rates (FDR). Hits from multiple search engines were unified using posterior error probabilities determined from the 123 target-decoy search strategy. Results reports were exported from PeptideShakerCLI using the ReportCLI. All results were filtered to provide a final FDR at the PSM, peptide, and protein level of <1%. Final mzid files were output from PeptideShakerCLI using MzidCLI with the default parameters. 4. Analysis of TMT 10-plex labeled ‘triple-knockout’ samples from an experiment in Saccharomyces cerevisiae (PXD008009) [6]. a. Data were re-processed using RawQuant coupled to SearchCLI (version 3.2.30) [2] and PeptideShakerCLI (version 1.16.11) [3]. Centroided MS2 spectra (MGF files from RawQuant) were searched against a UniProt Yeast proteome database containing common contaminants that was appended to reversed sequences generated using the –decoy tag of FastaCLI in SearchCLI (12,188 total sequences, 6,094 target). All searches used a combination of XTandem (version 2015.12.15.2), Myrimatch (version 2.2.140), MS-GF+ (version 10282), and Comet (version 2016.01 rev. 3) algorithms. Identification parameter files were generated using IdentificationParametersCLI in SearchCLI, specifying precursor and fragment tolerances of 20ppm and 0.5 Da (ion trap MS2) or 0.05 Da (Orbitrap MS2), carbamidomethylation of cysteine, TMT 10-plex of peptide N-term, and TMT 10-plex of lysine as fixed modifications. Oxidation of methionine and acetylation of protein N-term were set as variable modifications. The msgf_instrument, msgf_fragmentation, and msgf_protocol tags were set to 0, 1, and 4 for ion trap MS2, and 3, 3, and 4 for Orbitrap MS2. SearchCLI results were processed into PSM, peptide, and protein sets using PeptideShakerCLI. Error rates were controlled in PeptideShakerCLI using the target-decoy search strategy 124 to determine false-discovery rates (FDR). Hits from multiple search engines were unified using posterior error probabilities determined from the target-decoy search strategy. Results reports were exported from PeptideShakerCLI using the ReportCLI. All results were filtered to provide a final FDR at the PSM, peptide, and protein level of <1%. Final mzid files were output from PeptideShakerCLI using MzidCLI with the default parameters. b. For reanalysis of the published triple-knockout Saccharomyces cerevisiae data [6] in PD, MS2 spectra from raw files were searched using Sequest HT against a UniProt Yeast proteome database appended to a list of common contaminants (6,094 total sequences). Sequest HT parameters were specified as: trypsin enzyme, 2 missed cleavages allowed, minimum peptide length of 6, precursor mass tolerance of 20ppm, and fragment mass tolerance of 0.5 Da. Oxidation of methionine, and acetylation of protein N-termini were set as variable modifications. Carbamidomethylation of cysteine, TMT 10-plex of lysines, and TMT 10-plex of peptide N-termini were set as fixed modifications. Peptide spectral match (PSM) error rates were determined using the target-decoy strategy coupled to Percolator modeling of positive and false matches [7–10]. Data were filtered at the PSM-level to control for false discoveries using a q-value cut off of 0.01 as determined by Percolator. 5. Analysis of an iTRAQ 8-plex two-proteome mixture model on a Q-Exactive MS (PXD003640) [11]. a. For the reanalysis of the published iTRAQ mixed human and E. coli sample in SearchCLI, centroided MS2 spectra (MGF files from RawQuant) were searched 125 against a UniProt human + E. coli proteome database containing common contaminants that was appended to reversed sequences generated using the –decoy tag of FastaCLI in SearchCLI (50,676 total sequences, 25,338 target). All searches used a combination of XTandem (version 2015.12.15.2), Myrimatch (version 2.2.140), MS-GF+ (version 10282), and Comet (version 2016.01 rev. 3) algorithms. Identification parameter files were generated using IdentificationParametersCLI in SearchCLI, specifying precursor and fragment tolerances of 20ppm and 0.05 Da, carbamidomethylation of cysteine, iTRAQ 8-plex of peptide N-term, and iTRAQ 8-plex of lysine as fixed modifications. Oxidation of methionine and acetylation of protein N-term were set as variable modifications. The msgf_instrument, msgf_fragmentation, and msgf_protocol tags were set to 3, 3, and 2 for Orbitrap MS2. SearchCLI results were processed into PSM, peptide, and protein sets using PeptideShakerCLI. Error rates were controlled in PeptideShakerCLI using the target-decoy search strategy to determine false-discovery rates (FDR). Hits from multiple search engines were unified using posterior error probabilities determined from the target-decoy search strategy. Results reports were exported from PeptideShakerCLI using the ReportCLI. All results were filtered to provide a final FDR at the PSM, peptide, and protein level of <1%. Final mzid files were output from PeptideShakerCLI using MzidCLI with the default parameters. b. For the reanalysis of the published iTRAQ mixed human and E. coli samples [11] in PD, centroided MS2 spectra from raw files were searched using Sequest HT against a UniProt human + E. coli proteome database appended to a list of 126 common contaminants (25,338 total sequences). Sequest HT parameters were specified as: trypsin enzyme, 2 missed cleavages allowed, minimum peptide length of 6, precursor mass tolerance of 20ppm, and a fragment mass tolerance of 0.05 Daltons. Oxidation of methionine, and acetylation of protein N-termini were set as variable modifications. Carbamidomethylation of cysteine, iTRAQ 8-plex of lysines, and iTRAQ 8-plex of peptide N-termini were set as fixed modifications. PSM error rates were determined using the target-decoy strategy coupled to Percolator modeling of positive and false matches [7–10]. Data were filtered at the PSM-level to control for false discoveries using a q-value cut off of 0.01 as determined by Percolator. 6. Analysis of data from a TMT 10-plex examination of carbon source dependency in Saccharomyces cerevisiae (PXD002875) [12]. a. Data were re-processed using RawQuant coupled to SearchCLI (version 3.2.30) [2] and PeptideShakerCLI (version 1.16.11) [3]. Centroided MS2 spectra (MGF files from RawQuant) were searched against a UniProt Yeast proteome database containing common contaminants that was appended to reversed sequences generated using the –decoy tag of FastaCLI in SearchCLI (12,188 total sequences, 6,094 target). All searches used a combination of XTandem (version 2015.12.15.2), Myrimatch (version 2.2.140), MS-GF+ (version 10282), and Comet (version 2016.01 rev. 3) algorithms. Identification parameter files were generated using IdentificationParametersCLI in SearchCLI, specifying precursor and fragment tolerances of 20ppm and 0.5 Da, carbamidomethylation of cysteine, TMT 10-plex of peptide N-term, and TMT 10-plex of lysine as fixed 127 modifications. Oxidation of methionine and acetylation of protein N-term were set as variable modifications. The msgf_instrument, msgf_fragmentation, and msgf_protocol tags were set to 0, 1, and 4. SearchCLI results were processed into PSM, peptide, and protein sets using PeptideShakerCLI. Error rates were controlled in PeptideShakerCLI using the target-decoy search strategy to determine false-discovery rates (FDR). Hits from multiple search engines were unified using posterior error probabilities determined from the target-decoy search strategy. Results reports were exported from PeptideShakerCLI using the ReportCLI. All results were filtered to provide a final FDR at the PSM, peptide, and protein level of <1%. Final mzid files were output from PeptideShakerCLI using MzidCLI with the default parameters. 128 A.2 Supplemental Figures Figure S-1 – Analysis of RawQuant data highlights improved uniformity with fraction concatenation. Data from MS2-only, label-free whole proteome analyses carried out on a Q-ExactiveHF and Q-ExactiveHF-X were re-processed and the identification results queried. (a) Depicts the numbers of MS2 scans obtained in each individual fraction from the HF-X analysis (n = 46). (b) Depicts the numbers of MS2 scans obtained in each individual fraction from the HF analysis (n = 14). (c) Depicts the resultant numbers of PSMs identified across each 129 individual fraction in the HF-X analysis (n = 46). (d) Depicts the resultant numbers of PSMs identified across each individual fraction in the HF analysis (n = 14). Red lines indicate the mean values for each of the examined parameters. 130 131 Figure S-2 – Quantification of MET6 by RawQuant and PD display similar distributions. Data from MS2 by Orbitrap Fusion Lumos, SPS-MS3 by Orbitrap Fusion Lumos, and SPS-MS3 by Orbitrap Fusion acquisition of Saccharomyces cerevisiae triple knockout samples [14] were re-analyzed with RawQuant and PD. Boxplots depict the expression for all PSMs assigned to MET6 as found after analysis with (a) Orbitrap Fusion Lumos with SPS-MS3, (b) Orbitrap Fusion with SPS-MS3, and (c) Orbitrap Fusion Lumos with MS2. Dashed red lines indicate the median expression across the three-replicate knockout (ko) channels. 132 133 Figure S-3 – Quantification of PFK2 by RawQuant and PD display similar distributions. Data from MS2 by Orbitrap Fusion Lumos, SPS-MS3 by Orbitrap Fusion Lumos, and SPS-MS3 by Orbitrap Fusion acquisition of Saccharomyces cerevisiae triple knockout samples [14] were re-analyzed with RawQuant and PD. Boxplots depict the expression for all PSMs assigned to SUC2 as found after analysis with (a) Orbitrap Fusion Lumos with SPS-MS3, (b) Orbitrap Fusion with SPS-MS3, and (c) Orbitrap Fusion Lumos with MS2. Dashed red lines indicate the median expression across the three-replicate knockout (ko) channels. 134 135 Figure S-4 – Quantification of URA2 by RawQuant and PD display similar distributions. Data from MS2 by Orbitrap Fusion Lumos, SPS-MS3 by Orbitrap Fusion Lumos, and SPS-MS3 by Orbitrap Fusion acquisition of Saccharomyces cerevisiae triple knockout samples [14] were re-analyzed with RawQuant and PD. Boxplots depict the expression for all PSMs assigned to URA2 as found after analysis with (a) Orbitrap Fusion Lumos with SPS-MS3, (b) Orbitrap Fusion with SPS-MS3, and (c) Orbitrap Fusion Lumos with MS2. Dashed red lines indicate the median expression across the three-replicate knockout (ko) channels. Figure S-5 – Data from RawQuant recapitulated the ratio interference trend observed in previously published data. Data from MS2 by Orbitrap Fusion Lumos and SPS-MS3 by Orbitrap Fusion Lumos acquisition of Saccharomyces cerevisiae triple knockout samples [14] were re-analyzed with RawQuant. (a) Interference index values (IFI = 1 – ((average SN in KO) / (average SN in non-KO))) for PSMs assigned to MET6 for MS2 and SPS-MS3 analysis on an Orbitrap Fusion Lumos. (b) Scatter plot of the IFI values obtained from RawQuant and PD. Correlation values were calculated using the Pearson method at the indicated n-level. 136 Figure S-6 – Analysis of RawQuant data recapitulated observed ratio compression from iTRAQ data. Data from an MS2-only analysis of iTRAQ 8-plex labeled samples from a two-proteome mixture of human and E. coli proteolytic digests were re-analyzed with RawQuant. Boxplot depicts the observed ratio compressed based on RawQuant data. Values were calculated based on values from ‘compressed’ (e.g. iTRAQ 113 / iTRAQ 115) and ‘uncompressed’ (e.g. iTRAQ 117 / iTRAQ 119) channels. Dashed horizontal line indicates the expected value of the iTRAQ ratio. 137 Figure S-7 – RawQuant processing enables user-friendly analysis of whole-proteome quantification data sets. Data from an SPS-MS3 interrogation of differential protein expression in Saccharomyces cerevisiae dependent on the carbon source used for growth was re-analyzed with RawQuant. (a) Principle component analysis of the replicate samples using quantification data from RawQuant. Individual boxplots of (b) GAL10, (c) SUC2, (d) HXT3; biological features previously observed to differentiate between the carbon sources as observed from the RawQuant data. 138 139 Figure S-8 – There is no observable ion coalescence across ion target values in SPS-MS3 scans. A mixture of E. coli (TMT126 – 0:1:0:2:4:1:2:4:1:2:4 – TMT131C, whole proteome) and human peptides (TMT126 – 2:0:0:0:0:1:1:1:3:3:3 – TMT131C, synthetic peptides, n = 444) was analyzed using MS2 and SPS-MS3 acquisition methods that spanned a range of ion target values on an Orbitrap Fusion. Detected reporter ion masses from RawQuant were plotted. Density scatter plots depict detected reporter ion masses for the isotopologue TMT129, TMT130, and TMT131 channels when using (a) MS2 with a 2e5 ion target, (b) SPS-MS3 with an 8e4 ion target, and (c) SPS-MS3 with an 8e5 ion target. Red lines in all plots highlight the known TMT reporter ion masses. Figure S-9 – MS2 reporter ion acquisition results in increased amounts of ratio compression compared with SPS-MS3. A mixture of E. coli (TMT126 – 0:1:0:2:4:1:2:4:1:2:4 – TMT131C, whole proteome) and human peptides (TMT126 – 2:0:0:0:0:1:1:1:3:3:3 – TMT131C, synthetic peptides, n = 444) were analyzed using MS2 and SPS-MS3 with a 2e5 ion target and 120ms fill time (n = 3). The values displayed are S2N of the individual PSMs for the noted 140 channel ratios. The horizontal red line indicates the expected TMT ratio result based on the TMT13C / TMT130N channels. Figure S-10 – MS1 isolation interference and reporter ion signal minimally correlate with ratio accuracy. Data from the E. coli and spike peptide mixture were analyzed to extract properties that drive ratio accuracy. For each plot, the absolute difference between the expected ratio value and observed is taken for the TMT131C / TMT130N channels in the 2e5-120 samples (n = 3). All plots display values for PSMs. (a) Scatter plot of the absolute difference in observed 141 and expected ratio versus the MS1 isolation interference when using MS2 (n = 3) and (b) SPS-MS3 (n = 3). (c) Scatter plot of the absolute difference in observed and expected ratio versus the S2N for MS2 (n = 3) and (d) SPS-MS3 (n = 3). Horizontal red lines highlight the deviation of zero from the expected quantification ratio. Figure S-11 – Data filtering using isolation interference and signal intensity criteria has no observable impact on quantification ratio accuracy. Data from the E. coli and spike peptide mixture were analyzed to assess methods that can be used to improve ratio accuracy. All calculations for quantification ratios were performed using the TMT131C / TMT130N set of channels. Spike peptide values in the 2e5-120 (a) MS2 (n = 3) and (b) SPS-MS3 (n = 3) samples (n = 3) were filtered using an MS1 isolation interference (<=50%), number of missing values (<50%), and summed S2N across all channels (>=100). Labels above x-axis denote the numbers of PSMs remaining after application of the stated filter. The horizontal red lines highlight the expected quantification ratio. 142 Figure S-12 – Data filtering using SPS ion purity measurements has no observable impact on quantification ratio accuracy. Data from the E. coli and spike peptide mixture were analyzed to extract methods that can be used to improve ratio accuracy. The SPS-MS3 purity value calculation used peptide ions matched from PeptideShaker interfaced with SPS ions from RawQuant (MS3 purity = ((Sum of Ion Signal for SPS Ions Matched by PeptideShaker) / (Sum of Ion Signal for SPS Ions Selected for MS3 Analysis)) * 100). All calculations for quantification ratios were performed using the TMT131C / TMT130N set of channels. (a) Distribution of SPS- 143 MS3 purity values from the 2e5-120 samples (n = 3). (b) Scatter plot of SPS-MS3 purity values relative to the absolute difference from the expected ratio in the TMT131C / TMT130N channel in the 2e5-120 samples (n = 3). Horizontal red line indicates the zero value. (c) Boxplot displaying the impact of additional filtering based on SPS-MS3 purity (>=60%). Horizontal red line indicates the expected quantification ratio value. A.3 Supplemental Tables Due to space limitations, Tables S-1 through S-3 are not reprinted here. They are available in the supplementary information of the original publication: https://pubs.acs.org/doi/suppl/10.1021/acs.jproteome.8b00072 Table S-4 – Representative metrics table output from RawQuant. Table displays a typical metrics table output obtained using RawQuant. A standard SPS-MS3 file was used for this analysis. Table S-4 Raw file: D:\\chughes\\raw-quant-paper-submission\\tmt-methods\\raw\\ch_29Sept2017_eColi-31907_TMT11_2e5-120_1.raw Instrument: Orbitrap Fusion MS order: 3 Total analysis time: 3299.6081851510003 s Total scans: 27730 MS1 scans: 1149 MS2 scans: 13308 MS3 scans: 13273 Mean topN: 11.5822454308 MS1 scans/sec: 0.348223163335 MS2 scans/sec: 4.03320614244 Mean duty cycle: 2.87172165809 144 Table S-5 – Identifications obtained from re-analysis of published Q-Exactive data. Table lists the identification metrics obtained from re-analysis of the published Q-Exactive data using SearchCLI and PeptideShakerCLI alongside those from the original works for comparative analysis. Table S-5 Data Set Instrument Number of PSMs Number of Unique Peptides Number of Proteins Analysis Tool PMID HeLa 46-fraction Q-ExactiveHF not reported 166,920 11,292 MaxQuant 28601559 HeLa 46-fraction Q-ExactiveHF 397,474 176,582 9,786 RawQuant, SearchCLI/PeptideShakerCLI HeLa 12-fraction Q-ExactiveHF not reported 140,000* 8,400* MaxQuant 25349961 HeLa 12-fraction Q-ExactiveHF 388,852 164,798 8,524 RawQuant, SearchCLI/PeptideShakerCLI HeLa 46-fraction Q-ExactiveHF-X not reported 130,997 8,854 MaxQuant 29183128 HeLa 46-fraction Q-ExactiveHF-X 369,950 159,051 8,706 RawQuant, SearchCLI/PeptideShakerCLI * - exact values were not reported in the original work "@en ; edm:hasType "Thesis/Dissertation"@en ; vivo:dateIssued "2019-09"@en ; edm:isShownAt "10.14288/1.0380437"@en ; dcterms:language "eng"@en ; ns0:degreeDiscipline "Chemistry"@en ; edm:provider "Vancouver : University of British Columbia Library"@en ; dcterms:publisher "University of British Columbia"@en ; dcterms:rights "Attribution-NonCommercial-NoDerivatives 4.0 International"@* ; ns0:rightsURI "http://creativecommons.org/licenses/by-nc-nd/4.0/"@* ; ns0:scholarLevel "Graduate"@en ; dcterms:title "Parsing and analysis of mass spectrometry data of complex biological and environmental mixtures"@en ; dcterms:type "Text"@en ; ns0:identifierURI "http://hdl.handle.net/2429/71302"@en .