Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Developing bioinformatics tools and analyses on protein indels and protein-protein interactions : novel… Hsing, Michael 2010

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2010_spring_hsing_michael.pdf [ 2.89MB ]
Metadata
JSON: 24-1.0068978.json
JSON-LD: 24-1.0068978-ld.json
RDF/XML (Pretty): 24-1.0068978-rdf.xml
RDF/JSON: 24-1.0068978-rdf.json
Turtle: 24-1.0068978-turtle.txt
N-Triples: 24-1.0068978-rdf-ntriples.txt
Original Record: 24-1.0068978-source.json
Full Text
24-1.0068978-fulltext.txt
Citation
24-1.0068978.ris

Full Text

DEVELOPING BIOINFORMATICS TOOLS AND ANALYSES ON PROTEIN INDELS AND PROTEIN-PROTEIN INTERACTIONS – NOVEL APPLICATIONS FOR DRUG DISCOVERY IN STAPHYLOCOCCUS AUREUS by Michael Hsing B.Sc., Simon Fraser University, 2002 M.Sc., University of British Columbia, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Bioinformatics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) January 2010  © Michael Hsing 2010  Abstract Infectious diseases caused by bacterial pathogens continue to be major public health concerns affecting millions of human lives annually, as conventional treatment via antibiotics has lost its effectiveness due to growing problems of drug resistance. Recent advancements in systems biology, high-throughout sequencing, protein interaction study and computer-aided drug development can offer possible solutions to antibiotic resistance through discovery of novel antimicrobials. The thesis describes several bioinformatics approaches that focus on protein interaction network (PIN) studies, analyses of targetable protein indels (insertions and deletions) and virtual compound screening for new antibacterial candidates – approaches integrated into an antibiotic discovery pipeline for methicillin-resistant Staphylococcus aureus (MRSA252). In the course of the described work we identified new drug targets corresponding to highly interacting proteins (hubs) through comprehensive PIN analysis in MRSA252. The advantage of using hub proteins as targets is established by their essentiality, non-replaceable PIN position and lower rate of mutation, all of which can help to counter bacterial resistance. To accelerate these studies hub predicting tools have been developed to assist proteomics experiments for PIN discovery and to facilitate drug target identification in pathogens. Because some bacterial proteins are conserved in humans, we applied the indel (insertion or deletion) concept to locate unique compound-binding sites that enabled us to specifically target conserved and essential bacterial hubs. We demonstrated associations between the presence of sizable indels in proteins with their essentiality and network rewiring capability, which established indels as potential markers for drug targets. To provide the research community a fast and user-friendly web portal for identification and characterization of indelii  bearing drug targets, the Indel PDB database has been developed to characterize the functional and structural features of 117,266 indel sites across numerous species. Finally, combining the above bioinformatics methodologies with a rapid and efficient procedure of virtual screening allowed discovery of compounds that effectively inhibited MRSA252 cell growth with no signs of human toxicity. We anticipate that the drug discovery pipeline along with established MRSA PIN resource, hub prediction tools and indel database will provide a framework for the development of next-generation antibiotics in other existing or emerging pathogens.  iii  Table of contents Abstract ………………………………………………………………………………………..ii Table of contents ...............................................................................................................iv List of tables .....................................................................................................................vii List of figures ....................................................................................................................ix List of abbreviations and definitions ..............................................................................xi Acknowledgements .........................................................................................................xiv Dedication........................................................................................................................xvi Co-authorship statement...............................................................................................xvii 1 1.1 1.1.1 1.2 1.2.1 1.2.2 1.2.3 1.3 1.3.1 1.3.2 1.4 1.4.1 1.5 2  Introduction and thesis outline.......................................................................1 Problems of infectious diseases and antibacterial drug resistance ..................1 Mechanisms of current antibiotics and resistance .......................................3 Protein interaction networks............................................................................5 High-throughput protein interaction experiments .......................................6 Scale-free characteristics of protein interaction networks...........................8 Predicting highly interacting proteins (hubs) ............................................10 Protein insertions and deletions.....................................................................13 Bioinformatics analyses of indels on protein essentiality, protein interactions and structural features............................................................14 Targeting conserved proteins by indel-induced drug-binding sites ..........16 Computer-aided drug discovery in MRSA252..............................................18 An efficient and cost-effective in silico drug development platform........19 References .....................................................................................................22  2.2.4 2.3 2.3.1 2.3.2 2.3.3 2.4 2.5  The use of Gene Ontology terms for predicting highly-connected ‘hub’ nodes in protein-protein interaction networks ...........................................29 Introduction ...................................................................................................29 Methods.........................................................................................................32 Data acquisition .........................................................................................32 Hub protein classification by boosting trees .............................................36 Comparison of the hub classifiers with other existing protein interaction prediction approaches................................................................................39 Simulation of protein bait selections and network coverage.....................42 Results and discussion...................................................................................43 Prediction performance of the hub prediction classifier............................43 GO term predictor importance ..................................................................47 Applying hub classifier to protein bait selection.......................................48 Conclusion.....................................................................................................50 References .....................................................................................................51  3.1 3.2  Predicting highly-connected hubs in protein interaction networks by QSAR and biological data descriptors.........................................................55 Introduction ...................................................................................................55 Methods.........................................................................................................57  2.1 2.2 2.2.1 2.2.2 2.2.3  3  iv  3.2.1 3.2.2 3.2.3 3.2.4 3.3 3.3.1 3.3.2 3.3.3 3.4 3.5 4  Acquisition and analysis of protein-protein interaction data.....................57 Calculation and collection of protein descriptors......................................59 Training and testing hub classifiers by boosting trees ..............................64 Hub prediction based on sequence conservation.......................................65 Results and discussion...................................................................................66 Prediction performance of the hub prediction classifiers ..........................66 Prediction performance of a traditional hub prediction approach based on sequence conversation ...............................................................................69 Characteristics of hub and non-hub proteins .............................................72 Conclusion.....................................................................................................83 References .....................................................................................................85  4.3 4.3.1 4.3.2 4.3.3 4.3.4 4.3.5 4.3.6 4.3.7 4.4 4.5  Mapping protein interaction network in methicillin-resistant Staphylococcus aureus identifies novel drug targets...................................88 Introduction ...................................................................................................88 Methods.........................................................................................................89 Selection of baits .......................................................................................89 Protein interaction network analysis .........................................................90 Expression and purification of GST fusion baits ......................................91 MRSA cell extract preparation for GST pull down...................................92 GST pull down of MRSA protein complexes ...........................................92 Identification of interacting proteins by mass spectrometry .....................93 Co-immunoprecipitation with F(ab’)2 fragments of anti-MRSA pyruvate kinase (PK) antibody. ................................................................................94 Results and discussion...................................................................................94 MRSA PIN summary ................................................................................94 Accuracy of the protein-protein interactions.............................................95 PIN coverage .............................................................................................96 Overlap with other experimental PINs ......................................................97 Scale-free characteristics of MRSA PIN...................................................98 Characteristic features of hub proteins. ...................................................100 Utilizing PIN data for drug target analysis and identification.................101 Conclusion...................................................................................................109 References ...................................................................................................110  5.1 5.2 5.2.1 5.2.2 5.2.3 5.2.4 5.3 5.3.1 5.3.2 5.3.3 5.3.4 5.3.5 5.3.6  Indel PDB: a database of structural insertions and deletions derived from sequence alignments of closely related proteins........................................113 Introduction .................................................................................................113 Methods.......................................................................................................115 Constructing Indel PDB ..........................................................................115 Comparative analysis of indels................................................................119 Length distribution ..................................................................................120 Location analysis of protein domain and indel .......................................121 Results and discussion.................................................................................122 Overview of Indel PDB ...........................................................................122 Sequence composition of indels ..............................................................126 Length distribution of indels ...................................................................128 Secondary structure composition of indels .............................................132 Solvent accessibility of indels .................................................................134 Proximity of indels and protein domains ................................................135  4.1 4.2 4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 4.2.6 4.2.7  5  v  5.4 5.5 6  Conclusion...................................................................................................137 References ...................................................................................................139  6.4 6.5 6.6  The identification of a novel pyruvate kinase inhibitor for methicillinresistant Staphylococcus aureus (MRSA252) via an indel-based targeting approach .......................................................................................................141 Introduction .................................................................................................141 Methods.......................................................................................................145 Homology modeling................................................................................146 ZINC chemical database preparation ......................................................147 Binding site prediction ............................................................................147 Molecular docking...................................................................................147 Scoring functions and predictors .............................................................148 PK enzymatic assays and cell-based screens ..........................................149 Similarity searching (SS).........................................................................150 Molecular dynamics (MD) and binding free energy calculations ...........150 Results .........................................................................................................151 In silico screens .......................................................................................153 Experimental testing on the virtual compound hits.................................154 Similarity search for analogues of compound #63 ..................................156 Experimental testing on compound #130 (analogues of compound #63)157 Compound #130 binding within the deletion site of MRSA252 PK via docking ....................................................................................................157 Conformational analysis of compound #130 via a molecular dynamics simulation ................................................................................................158 Discussion ...................................................................................................160 Conclusion...................................................................................................162 References ...................................................................................................163  7.1 7.2  Concluding chapter .....................................................................................166 Future directions..........................................................................................171 References ...................................................................................................175  6.1 6.2 6.2.1 6.2.2 6.2.3 6.2.4 6.2.5 6.2.6 6.2.7 6.2.8 6.3 6.3.1 6.3.2 6.3.3 6.3.4 6.3.5 6.3.6  7  Appendix …...………………………………………………………………………………177 Appendix A - Supplementary materials.......................................................................177 Appendix B - List of publications................................................................................178  vi  List of tables Table 1.1 Classes of antibacterial drugs. .............................................................................4 Table 2.1 A summary of protein interaction and Gene Ontology annotation data used in the training and testing of the hub classifiers. ..................................................32 Table 2.2 A summary of protein interaction and Gene Ontology annotation data used in the external validation of the hub classifiers.....................................................33 Table 2.3 Prediction performance of the hub classifier in the combined data set of the four species. ......................................................................................................44 Table 2.4 Hub prediction comparison of the classifier and the hypothetical PINs in MRSA252. ........................................................................................................45 Table 2.5 Comparing ranked lists of hub-likeness properties between the classifier and the hypothetical PINs in MRSA252. ................................................................46 Table 2.6 Hub prediction result in C. elegans. ..................................................................46 Table 2.7 Hub prediction result in the randomized data set. .............................................47 Table 2.8 Top 20 important Gene Ontology term predictors. ...........................................48 Table 3.1 A summary of protein interaction data used in the training and testing of the hub classifiers. ..................................................................................................57 Table 3.2 A summary of protein descriptors. ....................................................................59 Table 3.3 A set of reference proteomes for sequence similarity descriptors.....................60 Table 3.4 Physicochemical descriptors..............................................................................63 Table 3.5 E. coli hub classifier prediction performance....................................................67 Table 3.6 S. cerevisiae hub classifier prediction performance. .........................................67 Table 3.7 D. melanogaster hub classifier prediction performance....................................68 Table 3.8 H. sapiens hub classifier prediction performance. ............................................68 Table 3.9 Hub proteins conservation among species.........................................................69 Table 3.10 Hub prediction based on sequence conservation.............................................71 Table 3.11 Top GO term descriptors. ................................................................................73 Table 3.12 Frequency of common GO terms appearing in hub and non-hub proteins. ....74 Table 3.13 Top SCOP fold descriptors..............................................................................77 Table 4.1 The bait coverage summary for MRSA and other PIN datasets........................97 Table 4.2 Total numbers and percentages of conserved interactions, interactors and hub proteins between MRSA and other datasets. ....................................................98 Table 4.3 Linear regression values based on power-law degree distribution in 6 different PINs. ...............................................................................................................100 Table 4.4 Differential characteristics averaged for MRSA hubs and non-hubs. The pvalues from two sample t-tests are also given. ...............................................101 Table 4.5 Differential characteristics averaged for MRSA hubs, non-hubs and antimicrobial targets and corresponding p-values from two sample t-tests....103 vii  Table 4.6 Differential characteristics averaged for clinically approved and experimental antimicrobial targets and corresponding p-values from two sample t-tests....108 Table 5.1 Solvent accessibility of indel sites and indel proteins. ....................................134 Table 5.2 Solvent accessibility of indel sites and loops. .................................................134 Table 5.3 Percentage of indel sites that overlapped with at least one protein domain. ...135 Table 5.4 Percentage of domains that overlapped with at least one indel site. ...............135 Table 5.5 Top 20 domains that overlapped with indels. ..................................................137 Table 6.1 The two acyl hydrazone-based compounds shown in the table demonstrated selective in-vitro inhibitory activity towards MRSA252 PK..........................155  viii  List of figures Figure 1.1 Drug resistance is increasing at an alarming rate...............................................2 Figure 1.2 The number of new antibacterial drugs is declining. .........................................3 Figure 1.3 A bioinformatics drug discovery platform. ........................................................6 Figure 1.4 An example of PIN representation. ....................................................................9 Figure 1.5 The presence of indels is identified from a sequence alignment between a protein (query) and its homologue (subject).....................................................14 Figure 1.6 Superimposed model of Leishmania EF-1α α protein (green) and human protein (red)...................................................................................................................17 Figure 1.7 Docking simulation of the compound in the indel-site in Leishmania (A) and human protein structure (B) models. ................................................................18 Figure 2.1 Cumulative protein interaction distribution plots.............................................34 Figure 2.2 A flow chart of the development of the hub protein classifiers and their corresponding bioinformatics analyses.............................................................35 Figure 2.3 Distribution of Gene Ontology annotation terms between the training and testing sets in the four cross-validation samples...............................................37 Figure 2.4 Network coverage of different bait selection strategies in protein complex pull-down experiments, simulated in Saccharomyces cerevisiae.....................49 Figure 3.1 A flow chart for the characterization and prediction of hub proteins. .............58 Figure 3.2 A box plot of descriptor values on sequence similarity for hubs and non-hubs among the four species. ....................................................................................75 Figure 3.3 A box plot of descriptor values on the number of all Pfam domains for hubs and non-hubs among the four species...............................................................76 Figure 3.4 A box plot of descriptor values on average threading score for "PDZ domainlike" fold (SCOP: b.36) for hubs and non-hubs among the four species..........78 Figure 3.5 A box plot of descriptor values on average threading score for "beta-Grasp (ubiquitin-like)" fold (SCOP: d.15) for hubs and non-hubs among the four species...............................................................................................................79 Figure 3.6 A box plot of descriptor values on average hydrophilicity for hubs and nonhubs among the four species. ............................................................................80 Figure 3.7 A box plot of descriptor values on fraction of flexible coil residues for hubs and non-hubs among the four species...............................................................81 Figure 3.8 A box plot of descriptor values on estimated surface area for hubs and nonhubs among the four species. ............................................................................82 Figure 3.9 A box plot of descriptor values on average surface polarizability for hubs and non-hubs among the four species......................................................................82 Figure 3.10 A box plot of descriptor values on isoelectric point (pI) for hubs and nonhubs among the four species. ............................................................................83 Figure 4.1 2D representation of the developed MRSA PIN..............................................95 ix  Figure 4.2 Distribution of the number of protein interactions (left panel) and a logarithmic transformation plot for the distribution of protein interactions (right panel).......................................................................................................99 Figure 4.3 Average number of protein interactions among drug targets, hubs and nonhubs (left panel), and Network Betweenness (NB) values for drug targets, hubs and non-hubs (right panel). .............................................................................104 Figure 4.4 Separation of hubs non-hubs and antimicrobial targets in the threedimensional space of protein abundance, protein conservation and similarity to human proteins................................................................................................106 Figure 5.1 A flowchart for constructing the Indel PDB web server................................116 Figure 5.2 A database schema for Indel PDB..................................................................119 Figure 5.3 An indel detail page on Indel PDB.................................................................124 Figure 5.4 A 3D view of an indel site by the Jmol applet on Indel PDB. .......................125 Figure 5.5 Amino acid composition of indel sequences..................................................126 Figure 5.6 The difference of average amino acid composition between indel sites and full-length protein sequences. .........................................................................127 Figure 5.7 The difference of average amino acid composition between indel sites and loop sequences. ...............................................................................................128 Figure 5.8 Accumulative indel length distribution. .........................................................129 Figure 5.9 Indel length modeled by the Weibull and power law distribution. ................130 Figure 5.10 Loop length modeled by the Weibull and power law distribution...............131 Figure 5.11 Average secondary structure composition of indels. ...................................133 Figure 5.12 Difference of average secondary structure composition between indel sites and full-length protein sequences. ..................................................................133 Figure 6.1 A multiple sequence alignment showing the indel region (shown as gaps) for pyruvate kinase (PK) from Bacillus stearothermophilus, Staphylococcus aureus, Streptococcus pneumoniae, Clostridium difficile, Escherichia coli, Salmonella typhimurium, Chlamydia trachomatis, and Homo sapiens..........143 Figure 6.2 Structure alignment and superposition of human PK (shown in yellow) with MRSA252 PK (shown in blue).......................................................................144 Figure 6.3 Path for the identification of selective MRSA PK inhibitors.........................145 Figure 6.4 The in silico screening cascade used in this study for the discovery of selective inhibitors of MRSA252 PK. ...........................................................................146 Figure 6.5 Structural model of N-terminal indel site for MRSA PK...............................152 Figure 6.6 (Top) Compounds #63 and #130 (10µM) selectivity inhibit MRSA PK enzymatic activity; (Middle) Effect of compounds #63 and #130 on S. aureus growth; (Bottom) Toxicity evaluation of compounds #63 and #130 in human HeLa cells. ......................................................................................................155 Figure 6.7 Change of RMSD values for compound #130 bound to the indel site of MRSA252 PK. ................................................................................................159  x  List of abbreviations and definitions BML  Bacterial Metabolite Likeness. A QSAR model that is capable of distinguishing bacterial metabolites based on their chemical descriptors, and useful for in silico antibiotic studies.  GO  Gene Ontology. A list of standardized vocabulary developed by the biological community to describe functions of genes or proteins in an organism.  Indel  Insertion and Deletion. A gap in a pair-wise nucleotide or protein sequence alignment.  MD  Molecular Dynamics. A form of computer programs where the motion of atoms and molecules are simulated according to the law of physics.  MOE  Molecular Operating Environment. A commercial software package capable of visualizing and manipulating chemicals and proteins.  MRSA  Methicillin-Resistant Staphylococcus aureus. A gram-positive bacterium that is resistant to many antibiotics and responsible for difficult-to-treat infections in humans.  PDB  Protein Data Bank. A data warehouse for the deposit and storage of threedimensional protein structures determined by experimental techniques such as X-ray crystallography.  Pfam  Protein Family. A database of collections of protein domains, each represented by multiple sequence alignments and hidden Markov models.  xi  PIN  Protein Interaction Network. A two-dimensional representation of proteinprotein interactions in a cell, where proteins are visualized as nodes and their physical interactions are presented as connecting lines.  PK  Pyruvate Kinase. An enzyme in the glycolysis pathway to catalyze the transfer of a phosphate group from phosphoenolpyruvate to ADP, producing one molecule of pyruvate and one molecule of ATP.  PREPARE  Proteomics for Emerging Pathogen Response. A four-year Genome BCfunded project that aims to use state-of-the-art genomic and proteomic approaches to identify novel targets in microbial pathogens leading to novel therapeutics.  QSAR  Quantitative Structure Activity Relationship. A computational technique by which structural features of a molecular (chemical or protein) are correlated with its biological activity.  RMSD  Root Mean Square Deviation. A frequently-used measure of the differences between two values (usually between a predicted value and observed value).  ROC  Receiver Operating Characteristic. A graphical plot of the sensitivity vs. (1 − specificity) for a binary classifier system as its discrimination threshold is varied  SCOP  Structural Classification of Proteins. A database of classification of proteins based on their structural folds. All proteins with known structures can be assigned with a SCOP classification ID that specifies their structural grouping.  xii  SN  Semantic Networks. A form of knowledge representation where abstract concepts are connected to each other through relationships in a networklike organization.  SVL  Scientific Vector Language. A programming language used by the MOE software for scripting and automating tasks.  xiii  Acknowledgements I feel grateful to work along with many inspiring and talented individuals during my PhD journey. First and foremost, I would like to express my deepest gratitude to my thesis supervisor, Artem Cherkasov, for providing me the opportunity to be trained as a research scientist and for his scientific and personal guidance and encouragement. It has been my pleasure to work along with the former and current members of the Cherkasov lab, including Peter Axerio, Simon Chan, Christopher Fjell, Ken Byler, Osvaldo Santos-Filho, Fuqiang Ban, Meilan Huang, and Evgeny Maksakov, and I am grateful for their scientific collaboration and philosophical discussion. Second, I would like to thank all of my thesis committee members, Leah Keshet, Francis Ouellette, and Wyeth Wasserman, for their scientific advice and mentorship that have guided me during my PhD study, and their comments on my PhD thesis are much appreciated. Third, I would like to thank all of our collaborators from the PREPARE (Proteomics for Emerging Pathogen Response) project, in particular, Raymond See and Roya Zoraghi, and from the SFU computer sciences department (Fereydoun Hormozdiari, Phuong Dao, Farhad Hormozdiari, and Cenk Sahinalp) for their contribution to several research works presented in the PhD thesis. Fourth, I would like to acknowledge the UBC Bioinformatics Graduate Program for providing an excellent and stimulating training environment, and research funding from the Michael Smith Foundation for Health Research (MSFHR), the Natural Sciences and Engineering Research Council (NSERC), Genome Canada, Genome BC and the PREPARE project. Finally, I am blessed with many caring and loving friends, in particular, Dorothy Chang, who has encouraged me to face many challenges during the PhD study. I would like to express  xiv  my deepest gratitude to my parents, Herman Hsing and Fanny Chang, and my sister, Vivian Hsing; without their most caring love I would not be able to complete this PhD journey.  xv  Dedication I would like to dedicate this PhD thesis to my parents (Herman Hsing and Fanny Chang), my family and Dorothy Chang for their unconditional love, encouragement and support.  xvi  Co-authorship statement The work presented in this Ph.D. thesis was in part due to collaboration with other researchers. Those involved in the research are listed in the publication citations in each of the chapters. Their individual contributions to the research are outlined below. Chapter 2: I acquired and analyzed protein interaction and Gene Ontology data, designed and developed the hub classifiers, built the hypothetical PINs, simulated the protein bait selection experiments, and drafted and revised the manuscript. Kendall Byler analyzed the statistical models and tools for boosting trees, and revised the manuscript. Artem Cherkasov conceived and designed the study, and revised the manuscript. Chapter 3: I acquired and analyzed protein sequence, protein similarity, protein interaction, Gene Ontology annotation and protein domain data, and performed BLAST searches and protein threading. I designed and developed the hub classifiers, performed the analyses on hub features based on the protein descriptors, and drafted and revised the manuscript. Kendall Byler performed the calculations on the physicochemical descriptors. Artem Cherkasov conceived and designed the study, and revised the manuscript. Chapter 4: I managed and analyzed protein interaction data in MRSA252, collected interaction data from other species and performed PIN comparison. I estimated the PIN accuracy and protein interaction distribution, and studied the network characteristics of hub proteins and drug targets. Farhad Hormozdiari and Phuong Dao calculated network betweenness values. Roya Zoraghi, Jihong Jiang, Sukhbir Kaur, Tian Lian, Linda Jackson, Huansheng Gong, Rick Swayze, Emily Amandoron prepared and performed affinity pull-down experiments and determined protein-protein interactions and protein essentiality in MRSA252. Leonard J Foster and Nikolay Stoynov performed the mass spectrometry experiments for protein complex identification. Artem xvii  Cherkasov, Kendall Byler and I analyzed and selected protein baits. Raymond See coordinated the study. Artem Cherkasov, Roya Zoraghi, Leonard J Foster and I drafted and revised the manuscript. Artem Cherkasov, William R. McMaster, Robert Brunham, Brett Finlay and Neil E Reiner conceived the study. Chapter 5: I acquired and analyzed protein sequence and structural data from PDB, developed and implemented the Indel PDB database and website, carried out the indel analyses, and drafted and revised the manuscript. Artem Cherkasov conceived and designed the study, and revised the manuscript. Chapter 6: I modeled and optimized the pyruvate kinase protein structure. Peter AxerioCilies and I performed in silico screens and compound analogue similarity searches. Artem Cherkasov, Peter Axerio-Cilies and I selected the virtual compound hits. Peter Axerio-Cilies prepared the virtual chemical database, calculated compound binding affinity and RMSD scores, and performed molecular dynamics simulations. Roya Zoraghi, Jihong Jiang, Sukhbir Kaur, Tian Lian, Linda Jackson, Huansheng Gong, Rick Swayze, Emily Amandoron performed enzymatic assays and cell-based assays in MRSA252 and human cells. Peter Axerio-Cilies and I drafted the manuscript. Artem Cherkasov conceived and designed the study.  xviii  1 Introduction and thesis outline 1.1 Problems of infectious diseases and antibacterial drug resistance Infectious diseases are the second leading cause of death worldwide [1]. In 2002, nearly 15 million or 25% of the total annual deaths were caused by infectious diseases. Moreover, in addition to the ongoing HIV/AIDS pandemic, there have been numerous occurrences of emerging and re-emerging infections such as SARS, avian influenza, West Nile, tuberculosis and malaria [2]. Treatment of infectious diseases becomes more challenging when common pathogens such as Staphylococcus aureus, Mycobacterium tuberculosis and Pseudomonas aeruginosa develop resistance to drugs that were once effective. In 2002 more than 70 percent of hospital-acquired infections were caused by drug resistance bacteria [3]. As illustrated in Figure 1.1, the incidence of drug-resistance strains has been increasing at an astonishing rate in the last two decades with S. aureus being one of the most prevalent drugresistance bacteria. The wide-spread problem of bacterial drug resistance has not only increased the economic cost of treating infectious diseases by billions of dollars annually in North America [3], but also placed an urgent demand on developing novel drugs to counter the increasing rate of bacterial resistance.  1  Figure 1.1 Drug resistance is increasing at an alarming rate.  Source: [3] and Centers for Disease Control and Prevention, by permission. MRSA = Methicillin-resistant Staphylococcus aureus, VRE = Vancomycin-resistant Enterococcus, FQRP = Fluoroquinolone-Resistant Pseudomonas Aeruginosa.  However, there is a continuing decline of the rate at which new antibacterial drugs are being developed and approved (Figure 1.2). Pharmaceutical companies have been reducing research and development efforts on antibiotics in the past 10 years, partially due to technical difficulties in high throughput screening, molecular modeling, novel target identification, and toxicity and delivery issues with drug prototypes. Antibiotics produce a low return on investment for manufacturers as patients are commonly prescribed for only a short period of time (1 to 2 weeks), in contrast to the long period time of other drugs for chronic diseases or lifestyle issues [3].  2  Figure 1.2 The number of new antibacterial drugs is declining.  Source: [3] and Spellberg, CID 2004, by permission.  1.1.1  Mechanisms of current antibiotics and resistance Thus ever increasing drug resistance in bacteria and the lack of new antibiotics  motivates research to determine how current antimicrobial drugs work and how drug resistance has arisen, and importantly, to apply accumulated biomedical information for developing new approaches in antimicrobial discovery. As shown in Table 1.1, there are 14 major classes of antibiotic drugs discovered since 1935 [3]. The majority of these drugs work by inhibiting three cellular processes: cell wall synthesis, protein synthesis and DNA/RNA synthesis [4]. For instance, penicillin works by targeting penicillin binding proteins, which are enzymes involved in the synthesis of peptidoglycan, a major component of the bacterial cell wall. Thus, the irreversible binding of penicillin to the active site of the enzyme causes damage in the bacterial cell wall that decreases selective permeability, and eventual cell lysis and death. In another example, tetracycline binds to the 30S ribosomal subunit and prevents the normal binding of aminoacyl-tRNA, and thus, inhibits the process of translation. Another class of antibiotics, rifamycin, binds to bacterial RNA polymerase and blocks the transcription process.  3  Table 1.1 Classes of antibacterial drugs. Year Introduced 1935 1941 1944 1945 1949 1950 1952 1956 1957 1959 1962 1968 2000 2003  Class of Drug Sulfonamides Penicillins Aminoglycosides Cephalosporins Chloramphenicol Tetracyclines Macrolides/Lincosamides/Streptogramins Glycopeptides Rifamycins Nitroimidazoles Quinolones Trimethoprim Oxazolidinones Lipopeptides  mechanism of action cell metabolism (folic acid) cell wall synthesis protein synthesis cell wall synthesis protein synthesis protein synthesis protein synthesis cell wall synthesis RNA synthesis DNA integrity DNA synthesis cell metabolism (folic acid) protein synthesis cell membrane permeability  Source: [3, 4] In response to antibiotics action, bacteria have developed several counter-attack mechanisms, which include removing drugs from the cytosol by membrane protein pumps, synthesizing enzymes capable of digesting drug molecules, or altering drug interaction site(s) via sequence mutations [5-7]. A wide rage of drug resistance mechanisms can be illustrated by the example of tetracycline. It has been reported [8] that bacterial cells can resist tetracycline by at least three mechanisms. First, tetracycline can be inactivated by addition of an acetyl group by a specific bacterial enzyme. Second, tetracycline can be pumped out of the bacterial cell by a bacterial membrane protein. Finally, certain bacterial proteins encoded by resistance genes can block binding between tetracycline and ribosome or in some cases, can modify ribosome structure to enable tRNA binding even in the presence of the antibiotic. The ever-increasing problems of drug resistance and on-going decline of research and development efforts in the field of antibiotics have caused serious public health concerns. Strikingly, in the past 30 years, only two new classes of antibiotics have been developed [9-11], and with the rate of drug resistance, we will run out of drugs to treat multiple drug-resistance bacteria such as methicillin-resistant Staphylococcus aureus (MRSA252). The first case of an 4  S. aureus isolate that was totally resistant to vancomycin has already been documented in the United States in 2002 [12], and the rise of MRSA is occurring at alarming rate in Canada [13]. Thus, deeper understanding of MRSA biology and identification of novel effective targets with a lower risk of resistance represent an important task. The solution to the problem may be found in systems biology-related areas such as high-throughout genomic sequencing, gene expression profiling, protein mass spectrometry, large-scale protein interaction experiments and computer-aided drug discovery. Major advances in these areas have provided new insights into bacterial cell biology and broadened the scope of biological data involving protein sequences, interactions, and structures that require bioinformatics integration to derive new scientific discoveries leading towards novel therapeutic applications.  1.2 Protein interaction networks The ultimate goal of the research work described in the thesis was to develop a novel bioinformatics- and cheminformatics-based drug discovery platform that can deliver new antibiotic drug leads to counter drug-resistant pathogens. As illustrated in Figure 1.3, the developed pipeline integrates prediction tools for the identification of highly interacting proteins (hubs), comprehensive study of protein interaction networks (PINs), analysis of sizable protein indels, and in silico compound screenings for discovering new drug candidates. The computational work we have conducted not only provided general support for proteomics experiments of the PREPARE project (Proteomics for Emerging Pathogen Response [14]), supplemented efforts to develop new drugs for MRSA252 pathogen, but also addressed important scientific questions that arose in the course of the study.  5  Figure 1.3 A bioinformatics drug discovery platform.  The pipeline starts with protein interaction data in MRSA252 and ends with the final drug leads. Purple boxes represent experimental efforts from the PREPARE laboratory. The bioinformatics supports and analyses (as presented in the thesis) are depicted in both blue and orange boxes. New bioinformatics resources are illustrated in yellow. 1.2.1  High-throughput protein interaction experiments The importance of studying protein-protein interactions has been recognized as proteins  execute a wide rage of cellular functions ranging from DNA replication, transcription, protein synthesis to cell metabolism not in isolation but through interacting with other proteins [15].  6  For decades, physical interactions between proteins have been identified by a variety of biochemical methods such as protein affinity chromatography, affinity blotting, coimmunoprecipitation and chemical cross-linking [16]. Recent comprehensive sequencing efforts and advancements in systems biology led to a number of high-throughput technologies for establishing protein-protein interactions at a large scale. The two most common methods are the yeast two-hybrid system and mass spectrometry of purified protein complexes, as reviewed in our recently published book chapter [17]. The yeast two-hybrid assay detects pairwise interactions between the ‘bait’ and the ‘pray’ proteins that represent two sequences under question respectively fused with the AD and DBD domains [18]. The AD is an activating domain of the yeast transcriptional activator GAL4, while the DBD represents its DNA-binding fragment. Both these domains are required for synthesizing β-galactosidase enzyme and, therefore, when interaction between the bait and prey proteins brings the two domains into spatial proximity, the resulting β -galactosidase expression can be biochemically detected. The yeast two-hybrid systems have been applied for identification of pairwise protein-protein interactions in several organisms including Escherichia coli bacteriophage T7 [19], Helicobacter pylori [20], Campylobacter jejuni [21], Treponema pallidum [22], Saccharomyces cerevisiae [23-26], Caenorhabditis elegans [27], Drosophila melanogaster [28, 29] and homo sapiens [30, 31]. The PREPARE project utilized the second approach, which consisted of two independent components: a) purification of protein complexes occurring in a cell and b) identification of individual components of the complexes by mass spectrometry. The method utilizes sequence tags such as GST, TAP [32, 33] and FLAG [34] that are fused with the target genes at the 3' and/or 5' terminus through genetic engineering and enable affinity-purification of the corresponding protein products. The purification step pulls down not only the tagged 7  proteins (the baits) but also their interacting partners (the preys), and the isolated complexes further undergo mass spectrometry analysis [35-37] to determine the identity of the preys. To date, several affinity-purification constructed protein interaction maps have been reported for Escherichia coli K12 [38, 39] and Saccharomyces cerevisiae [33, 34, 40]; less detailed interactomes have also been compiled for certain human cell lines [41, 42]. The protein-protein interaction data derived from large-scale experiments as described above are stored in public databases, such as IntAct [43] and DIP [44]. The accumulated protein interaction data have facilitated many recent studies focusing on the representation and analysis of protein interaction networks.  1.2.2  Scale-free characteristics of protein interaction networks The most common method to represent a physical interaction between two proteins  graphically is through a line (the interaction) connecting two dots (the proteins). Thus, by combining a set of protein interactions together, a two dimensional network is formed as illustrated in Figure 1.4. Furthermore, our previous work has demonstrated that an artificial intelligence approach called ‘semantic networks’ is capable of representing and simulating more complicated cause-effect relationships observed in protein interactions and cell signaling events [17, 45, 46]. Recent network analyses demonstrated a key feature of uneven distributions in interaction networks. It has been found that seemingly unrelated systems such as biological, economic, professional, sexual and social networks, airline routing, power lines connections, language networks and internet hyperlinks all exhibit a power-law degree distribution. The −γ distribution can be described in the form of P(k ) ~ k , where P(k) is the probability of a  selected protein with exactly k links (degree), and γ is the value of the exponent [15, 47]. This 8  implies that in a protein interaction network, the majority of the proteins are involved in low number of protein interactions, and only a few proteins (commonly referred to “hubs”) are highly connected. Figure 1.4 An example of PIN representation.  Each blue circle represents a protein and a line represents a physical interaction between two proteins. The PIN diagram was visualized by using Cytoscape [48].  Such an uneven architecture in PINs contributes to their robustness and error-tolerance because a random perturbation (attack) is likely to disable nodes with few connections. However, a targeted attack directed toward protein network hubs can efficiently impair system’s topology and functionality [49]. This insight into topology and function of biological 9  networks is important for developing novel anti-infective therapies that can impact proteinprotein interaction networks of target pathogens, as the correlation between the connectivity of proteins and their essentiality in a cell has already been confirmed [50, 51]. Furthermore, hub proteins are associated with slower rate of mutation than less connected proteins, because hubs have multiple protein interaction surfaces that are under larger evolutionary constraints [52]. We propose that highly interacting hub proteins will be effective antibacterial drug targets, because their essentiality, non-replaceable PIN position and lower rate of mutation have the potential to counter bacterial resistance mechanisms. Hence, the first objective of the thesis is to construct a protein interaction network for MRSA252 from the experimental data provided from the PREPARE laboratory and to identify key hub proteins suitable for antibacterial drug development.  1.2.3  Predicting highly interacting proteins (hubs) One particular challenge we have faced during the construction of MRSA PIN is the  selection of protein baits used in the PREPARE’s protein pull-down experiments. In order to maximize the effectiveness and coverage of the MRSA PIN, we attempted to select protein baits that possess characteristics of highly interacting hubs, so that more protein-protein interactions can be identified from the experiments. Hub identification or prediction is a challenging task in organisms or pathogens with little or no protein interaction data. Recently, several computational approaches have been developed to predict protein-protein interactions by utilizing existing bioinformatics data such as gene proximity [53, 54], gene fusion [55, 56], gene co-expression [57-59], phylogenetic profiles [60], orthologous protein interactions [61] and interacting protein domains [62-65]. Several other bioinformatics approaches have been developed that identify hypothetical interactions between proteins based on three-dimensional structures [66, 67] or by applying text-mining techniques [68, 69]. Such computational methods 10  predict pairwise protein-protein interactions with varying degrees of accuracy [70]; however, none of them have explicitly focused on predicting hub proteins. To our surprise, conventional hub prediction methods that rely on sequence similarities from different species were not adequate to predict hubs reliably, mostly due to the limited overlap of existing PIN data [71]. The question on what sequence and structural characteristics separate hubs from the rest of the proteome was mostly unanswered in the literature. Thus, we investigated the common characteristics shared by highly interacting proteins including functional annotations, physicochemical properties, and protein domain composition, and as a result, protein hub prediction tools were developed through the use of machine learning algorithms. In Chapter 2 of the thesis, we described our results from the development of a hub predictor based on the utilization of Gene Ontology (GO) annotations [72]. The hub classifier has been trained by the available protein interaction networks in Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster and Homo sapiens with the boosting trees machine learning method. Testing the developed hub classifier on external sets of experimental protein interaction data in Methicillin-resistant Staphylococcus aureus (MRSA) 252 and Caenorhabditis elegans demonstrated that our approach can predict hub proteins with an acceptable degree of accuracy. In continuation of that work we had success in employing a QSAR (quantitative structure-activity relationship) approach that utilized 75 physical and chemical descriptors to predict hub proteins in the MRSA252 proteome [73]. Although the study was focused on a small subset of MRSA252 proteins, it demonstrated accurate hub characterization. We combined the above Gene Ontology- and QSAR-based approaches with over 1300 protein descriptors, some of which reflect sequence- and structure-derived protein characteristics including domain composition and classification in the standard fold libraries, to 11  build a new generation of hub predictors with improved accuracy. Through analyses of the most relevant protein descriptors, we demonstrated that hub proteins not only share certain common characteristics that make them different from the non-hub counterparts, but they also exhibit species-specific characteristics. The result of this work is presented in Chapter 3. The developed hub predictors will be particularly useful for rapid identification of potential drug targets in emerging pathogens when no protein interaction data are yet available (as PIN mapping experiments are generally time-consuming and labour-intensive). In addition, application of the developed hub-predicting tools for bait selection in the MRSA252 PIN mapping experiments have led to a better network coverage compared to the other existing approaches. Thus, the bioinformatics-assisted bait selection strategy combined with the protein pulldown experiments from the PREPARE laboratory produced a protein interaction network for MRSA252 containing over 600 proteins involved in more than 13,000 protein-protein interactions. This is the first large-scale PIN data available for Staphylococcus aureus species. Not only have we utilized the PIN to identify new antibacterial drug targets, we performed network analyses including the comparison of MRSA PIN to other species, study of protein interaction degree distribution and estimation of the overall PIN accuracy. We characterized hub proteins in MRSA252 based on their overall sequence conservation, cellular abundance, essentiality, hydrophobicity, protein size, and various other network properties. The MRSA PIN study is presented in Chapter 4. The MRSA interaction data allowed us to evaluate the characteristic of 61 known antimicrobial targets [74-77] in a protein network, and to reach the conclusion that the majority could not be classified as highly interacting hub proteins. Therefore, it was essential to revisit how conventional drug targets were selected (i.e. why hub proteins have been largely avoided), 12  and ultimately, to demonstrate how the analysis of protein insertions and deletions (indels) can reveal new possibilities for targeting conserved hub proteins.  1.3 Protein insertions and deletions In traditional bacterial drug development, bacterial-specific proteins with no human homologs are selected as antimicrobial targets preferably. Examples of such targets include penicillin-binding protein, ribonuclease P, and bacterial-specific RNA polymerase subunits. This approach ensures specificity of antibacterial drugs and determines the absence of host toxicity; at the same time, such a selection strategy excludes many highly conserved, essential and house-keeping proteins from the consideration for drug development. We hypothesized that it is feasible to target conserved proteins in bacteria without cross-reaction in humans by the use of small compounds directed specifically at structural differential sites caused by sequence insertions and deletions (indels) between homologous proteins. Sequence indels represent a common form of protein length polymorphism and can be found in a broad spectrum of proteins [78, 79]. The conventional definition of indel is an insertion or deletion gap between two aligned sequences, as illustrated in Figure 1.5. It is common to refer a gap as either an insertion or deletion with respect to the query sequence.  13  Figure 1.5 The presence of indels is identified from a sequence alignment between a protein (query) and its homologue (subject). |-insertion-| Query: VSKLSDGLPPGASDARGAKPAECTYLEAHKDTDLLAVGYADGVIKVWDLMSKTVLLNFNG Sbjct: ILIL-----------QGLKQ-EVTCLCPSPDGLHLAVGYEDGSIRIFSLLSGEGNVTFNG |- deletion -| Query: IEND------KMGGKL--------TEMGIFEKQ--------------SKQRGLKIEFITN Sbjct: IEDPEEPDPKKIKGSSPGIQDTLEAEDGAFETDEAPEDRILSCRKAGSIMREGRDRVVNL  The sequence alignment contains one insertion and three deletion sites in the query sequence in comparison to the subject sequence, generated by the BLAST program [80].  1.3.1  Bioinformatics analyses of indels on protein essentiality, protein interactions and structural features Analyzing insertions and deletions occurring in aligned protein sequences represents an  important objective in bioinformatics and structural proteomics, with application in multiple sequence alignments, homology protein modeling and phylogenetic analysis. However, unlike amino acid substitutions that have been intensively studied [81, 82], there have been few reports addressing sequence indels [83, 84]. Early indel surveys with limited datasets illustrated the distribution of indel lengths [85, 86] and sequence/structure composition [87-89]. Recently a large-scale indel analysis with 136 bacterial and protozoan genomes revealed that 5~10% of proteins contain sizable indels in comparison to human homologues [78]. This study indicated that indel presence is common across species; however, the presence of indels in the conserved proteins is puzzling because conserved proteins often have essential functions in a cell and sequence variation can be disruptive. It was suggested that indels are likely to be accommodated in loop regions on protein surfaces without disrupting the core protein folds and functions [79, 87, 88]. It has been shown that DNA mispairing during replication can generate short indels (one or two amino acid) [90], while chromosome crossover and transposition events are the 14  likely biological mechanisms for longer indels [86]. Previous indel studies provide explanation on how indels can occur even in the most conserved proteins, but did not address important questions such as: “why have indels been accumulated among homologous sequences across many species?” and “what are the possible functions of indels?” It is reasonable to assume that the occurrence of indels in some proteins might have provided certain functional roles to the organism; therefore, the frequency of the indel-containing protein in a population is maintained from one generation to the next. With hundreds of genomes completely sequenced [91] and tens of thousands of protein structures available in PDB [92], it is now possible to study indels in terms of their genomic occurrence, structural and sequence features, functional associations, and potential drugbinding suitability at a large scale. A comprehensive indel survey would identify novel drugbinding sites in bacterial proteins and test several indel-related hypotheses. For instance, it is feasible to suggest that an indel-induced modification may have an impact on protein interaction interfaces leading to changes in protein-protein interactions or protein essentiality. To address the above questions on protein indels, we conducted two indel-related studies: First, associations were found between the presence of sizable indels in protein sequences and protein essentiality. Utilizing available essentiality data for three species (Bacillus subtilis, Escherichia coli, and Saccharomyces cerevisiae) we demonstrated that essential proteins have higher indel occurrence compared to non-essential proteins [93]. Furthermore using S. cerevisiae data, we demonstrated that indel-bearing proteins were involved in more interactions and possess greater connectivity values within the studied PINs. Second, we have examined the correlation between the presence of sizable protein indels and their impact on protein interactions more directly by focusing on the indel occurrence in all paralogous protein pairs within Saccharomyces cerevisiae and Drosophila 15  melanogaster [94]. We found that protein paralog pairs with sequence indels have greater variations in terms of their protein network measures such as degree and betweenness than protein pairs without an indel. The observation suggested that indel sites (especially the ones that are located in close proximity of protein-protein interaction interfaces) can modify proteinprotein associations that lead to loss or a gain of interactions. The findings from the above studies supported a view that indel-bearing proteins are potential drug targets due to their associations to protein essentiality and network rewiring capability. To provide the research community a fast and user-friendly web portal for identification of indel-based drug targets, we created a web-based resource, Indel PDB [95], to represent a structural database of insertions/deletions identified from the sequence alignments of highly similar proteins found in the Protein Data Bank [92]. Indel PDB utilized large amounts of available structural information to characterize 1-, 2- and 3-dimensional features of indel sites. The created resource contains 117,266 non-redundant indel sites extracted from 11,294 indel-containing protein alignments. The identified indel fragments have been characterized by their sequences, lengths, locations, secondary structure composition, solvent accessibility, protein domain association and three dimensional structures. Indel PDB enables functional studies of indels, assists protein modeling efforts, and supports the identification of indel-directed drug binding sites. The results of this work are presented in the Chapter 5.  1.3.2  Targeting conserved proteins by indel-induced drug-binding sites In addition to the above studies, a proof-of-principle on applying the indel concept in  antimicrobial drug development can be best demonstrated by a recent experiment on Leishmania elongation factor 1α (EF-1α) conducted by our laboratory and collaborators [9699]. EF-1α is an essential and conserved protein that functions in protein translation, and contains a 12 amino-acid indel between several pathogens and the human host [98]. Figure 1.6 16  illustrates that the 12 amino-acid indel corresponds to a hairpin fold that is present in the human protein, but not in the Leishmania protein [98]. Interestingly, it has been shown that the protein surface region, which is only exposed in the Leishmania protein (due to the lack of the 12 amino-acid insertion), provides new virulence function by binding to a human tyrosine phosphatase (SHP-1), causing deactivation of the human macrophage cell [97]. Figure 1.6 Superimposed model of Leishmania EF-1α protein (green) and human protein (red).  The 12 amino-acid insertion forms a hairpin fold (red) unique to the human protein.  The above example not only illustrated the presence of indels in conserved proteins and their roles in novel pathogenic functions, but also provided a new opportunity for site-specific drug development. The indel properties have been utilized to develop antibodies and small compounds specifically targeting the protein elongation factor-1(EF-1) in Leishmania donovani with no cross-reaction with the human homologue [98]. The high selectivity is contributed to the unique indel binding site only present in Leishmania EF-1α but shielded by the extra 12 amino-acid hairpin fold in the human protein (Figure 1.7 shows the docking simulation result). Selected compounds have been shown to provide up to 75% of inhibition on Leishmania protein translation, with no effects on human translation system in the laboratory [99]. 17  Figure 1.7 Docking simulation of the compound in the indel-site in Leishmania (A) and human protein structure (B) models.  Docking indicated that the binding of the compound (grey) with the human protein is blocked by the 12 amino-acid hairpin structure (yellow surface).  The Leishmania example led us to apply the indel concept to identify potential compound binding sites in hub proteins determined from the constructed MRSA252 PIN. Structural models have been built for more than 30 bacterial and human proteins to identify suitable indel sites in terms of their unique structural differences, size of the binding pocket, and relative positions to protein surface and functional domains. Potential drug target candidates were then processed by an in silico screening pipeline to identify a list of compound hits, as described in the following section.  1.4 Computer-aided drug discovery in MRSA252 The combination of hub prediction tools, protein interaction network study and analysis of structural indels in an integrated bioinformatics framework has allowed us to identify selective compounds against MRSA252 in a manner different from the traditional approaches. Conventionally, drug discovery is composed of five major stages: target identification, target validation, lead identification, lead optimization, and preclinical/clinical evaluation [9, 100]. During the first stage, a suitable drug target such as a protein receptor or enzyme is identified for its potential role in human diseases. Second, the cause-effect relationship between 18  the target and the disease is validated in vitro and in animal models. Third, compound libraries are screened for either activating or inhibitory binding to the target. Fourth, the suitable compounds (leads) are optimized for their potency, selectivity, and pharmacokinetics including absorption, distribution, metabolism and excretion [100]. Finally, the drug leads are evaluated in animal models, followed by clinical trials in humans. Traditional drug discovery approaches often rely on brute-force in vitro screening to identify leads among millions of compounds from large combinatorial chemical libraries or natural products for their ability to inhibit bacterial cells [75]. The process is cost- and timeconsuming with a low success. For instance, it was estimated that the entire drug discovery process takes an average of 880 million US dollars over 14.7 years [9, 101]. Drug targets are often not determined from the high-throughput screens, leading to unknown mode of drug actions. The cost of drug discovery has increased due to the difficulty of identifying novel and effective drug targets, and the problem is worsened by bacterial drug resistance, rendering many targets and drugs ineffective [102]. The difficulties of novel target identification have forced the industry to focus on developing drugs against established targets, which limits the scope of treatable diseases.  1.4.1  An efficient and cost-effective in silico drug development platform The research presented in the thesis utilized the latest approaches in rational drug  discovery [100] for developing new antibiotic leads more effectively and efficiently compared to the traditional approaches. As illustrated in Figure 1.3, after drug targets were determined from our bioinformatics-based target identification approach with the analysis of the MRSA252 PIN and structural indels on hub proteins, we identified compound leads with high selectivity against S. aureus by leveraging the success of existing protein structure modeling and in silico screening methods. 19  One of the most important tasks in computer-aided drug development is the construction of accurate structural models for drug target proteins. As Critical Assessment of Methods of Protein Structure Prediction (CASP) has demonstrated, the approach of ‘homology modeling’ is the most computationally efficient method for protein structure prediction when suitable structural templates and alignments are available [103, 104]. For instance, programs such as Modeller [105] are widely used for homology modeling with high structural accuracy, and several fully-automatic prediction servers such as SPARKS 2 [106] have been shown to produce comparable models as the ones constructed by human experts. Hence, we have applied the method of homology modeling to establish adequate structural models of selected drug target proteins in both MRSA252 and humans by using crystal structures as template from close species. In the subsequent step, the predicted 3D structures of protein targets enabled in silico protein-compound docking simulations using established computational tools [107-109]. In protein-drug docking, a structural library of compounds (ligands) is docked against a single protein target (receptor). The most optimal ligands are calculated based on an energy function incorporating geometric fit and binding forces with the receptor. A wide range of academic or commercial docking software are available for rigid docking or flexible docking with an average speed of one ligand every few seconds [108], much faster than the conventional in vitro screens. For instance, on average, it takes only 2 weeks to screen over 3 millions of compounds against a single protein target using our in-house computer cluster. Applying the above computational methods, we have successfully identified several novel drug targets and compound leads in MRSA252. The protein targets have been confirmed as essential experimentally and have been screened against millions of compounds in our virtual library. One of the most promising targets is ‘pyruvate kinase’, an essential enzyme in 20  the glycolysis pathway, and hundreds of compound leads against the protein have been identified from the in silico screens for experimental testing. The results from the enzyme assays by the PREPARE laboratory have shown some of the predicted compounds to be inhibitory and selective against pyruvate kinase in MRSA252 with no cross-reaction to the human counterparts. Further experiments have confirmed their ability to inhibit S. aureus bacterial cell growth. Hence, the results from the pyruvate kinase represent a successful story for our antibacterial drug discovery approach and are presented in the Chapter 6. In addition to pyruvate kinase, several other drug targets have been identified and are currently at different stages of the drug development pipeline. To summarize, we developed a useful bioinformatics drug discovery pipeline, whose effectiveness and applicability have been demonstrated in the example of a drug-resistance bacterium, MRSA252. In addition, during the course of the research work, a number of fundamental biological questions regarding to protein-protein interaction and protein indels have been addressed. We anticipate that the bioinformatics and cheminformatics platform as established in the thesis will provide a framework for other PIN and drug discovery initiatives in the future. In the following thesis chapters, we present manuscripts to describe each of the research efforts in more details.  21  1.5 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.  World Health Organization: The world health report 2004 - changing history [http://www.who.int/whr/2004/en/index.html] Fauci AS, Touchette NA, Folkers GK: Emerging infectious diseases: a 10-year perspective from the National Institute of Allergy and Infectious Diseases. Emerg Infect Dis 2005, 11(4):519-525. Infectious Diseases Society of America: Bad bugs, no drugs as antibiotic discovery stagnates. A public health crisis brews. [http://www.idsociety.org/badbugsnodrugs.html] Alliance for the prudent use of antibiotics: How Antibiotics Work--the Mechanism of Action [http://www.tufts.edu/med/apua/Miscellaneous/mechanisms.html] Poole K: Mechanisms of bacterial biocide and antibiotic resistance. Symp Ser Soc Appl Microbiol 2002(31):55S-64S. Wright GD: Mechanisms of resistance to antibiotics. Curr Opin Chem Biol 2003, 7(5):563-569. Sheldon AT, Jr.: Antibiotic resistance: a survival strategy. Clin Lab Sci 2005, 18(3):170-180. Roberts MC: Tetracycline resistance determinants: mechanisms of action, regulation of expression, genetic mobility, and distribution. FEMS Microbiol Rev 1996, 19(1):1-24. Butcher EC: Can cell systems biology rescue drug discovery? Nat Rev Drug Discov 2005, 4(6):461-467. Hopkins AL, Groom CR: The druggable genome. Nat Rev Drug Discov 2002, 1(9):727730. Zambrowicz BP, Sands AT: Knockouts model the 100 best-selling drugs--will they model the next 100? Nat Rev Drug Discov 2003, 2(1):38-51. Moran GJ, Mount J: Update on emerging infections: news from the Centers for Disease Control and Prevention. Ann Emerg Med 2003, 41(1):148-151. Simor AE, Ofner-Agostini M, Bryce E, Green K, McGeer A, Mulvey M, Paton S: The evolution of methicillin-resistant Staphylococcus aureus in Canadian hospitals: 5 years of national surveillance. CMAJ 2001, 165(1):21-26. Genome BC: Functional Genomics for Emerging Infectious Diseases (PREPARE) Project. [http://www.genomebc.ca/research_tech/research_projects/health/prepare.htm] Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004, 5(2):101-113. Phizicky EM, Fields S: Protein-protein interactions: methods for detection and analysis. Microbiol Rev 1995, 59(1):94-123. Hsing M, Cherkasov A: Bioinformatics tools for researching protein interaction networks. In: Genome Research Advances. Edited by Bellini AP. Hauppauge, New York: Nova Science Publishers; 2007. Fields S, Song O: A novel genetic system to detect protein-protein interactions. Nature 1989, 340(6230):245-246. Bartel PL, Roecklein JA, SenGupta D, Fields S: A protein linkage map of Escherichia coli bacteriophage T7. Nat Genet 1996, 12(1):72-77. Rain JC, Selig L, De Reuse H, Battaglia V, Reverdy C, Simon S, Lenzen G, Petel F, Wojcik J, Schachter V, Chemama Y, Labigne A, Legrain P: The protein-protein interaction map of Helicobacter pylori. Nature 2001, 409(6817):211-215. Parrish JR, Yu J, Liu G, Hines JA, Chan JE, Mangiola BA, Zhang H, Pacifico S, Fotouhi F, DiRita VJ, Ideker T, Andrews P, Finley RL, Jr.: A proteome-wide protein interaction map for Campylobacter jejuni. Genome Biol 2007, 8(7):R130. 22  22. 23. 24. 25.  26. 27.  28.  29. 30.  31.  Titz B, Rajagopala SV, Goll J, Hauser R, McKevitt MT, Palzkill T, Uetz P: The binary protein interactome of Treponema pallidum--the syphilis spirochete. PLoS One 2008, 3(5):e2292. Fromont-Racine M, Rain JC, Legrain P: Toward a functional analysis of the yeast genome through exhaustive two-hybrid screens. Nat Genet 1997, 16(3):277-282. Fromont-Racine M, Mayes AE, Brunet-Simon A, Rain JC, Colley A, Dix I, Decourty L, Joly N, Ricard F, Beggs JD, Legrain P: Genome-wide protein interaction screens reveal functional networks involving Sm-like proteins. Yeast 2000, 17(2):95-110. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403(6770):623-627. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive twohybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 2001, 98(8):4569-4574. Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, HirozaneKishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M: A map of the interactome network of the metazoan C. elegans. Science 2004, 303(5657):540-543. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon CA, Finley RL, Jr., White KP, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets RA, McKenna MP, Chant J, Rothberg JM: A protein interaction map of Drosophila melanogaster. Science 2003, 302(5651):1727-1736. Stanyon CA, Liu G, Mangiola BA, Patel N, Giot L, Kuang B, Zhang H, Zhong J, Finley RL, Jr.: A Drosophila protein-interaction map centered on cell-cycle regulators. Genome Biol 2004, 5(12):R96. Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, Klitgord N, Simon C, Boxem M, Milstein S, Rosenberg J, Goldberg DS, Zhang LV, Wong SL, Franklin G, Li S, Albala JS, Lim J, Fraughton C, Llamosas E, Cevik S, Bex C, Lamesch P, Sikorski RS, Vandenhaute J, Zoghbi HY, Smolyar A, Bosak S, Sequerra R, Doucette-Stamm L, Cusick ME, Hill DE, Roth FP, Vidal M: Towards a proteome-scale map of the human protein-protein interaction network. Nature 2005, 437(7062):1173-1178. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann S, Goedde A, Toksoz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H, Wanker EE: A human protein-protein interaction network: a resource for annotating the proteome. Cell 2005, 122(6):957-968.  23  32. 33.  34.  35. 36. 37. 38.  39.  40.  41.  Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M, Seraphin B: A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol 1999, 17(10):1030-1032. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415(6868):141-147. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415(6868):180-183. Neubauer G, Gottschalk A, Fabrizio P, Seraphin B, Luhrmann R, Mann M: Identification of the proteins of the yeast U1 small nuclear ribonucleoprotein complex by mass spectrometry. Proc Natl Acad Sci U S A 1997, 94(2):385-390. Mann M, Hendrickson RC, Pandey A: Analysis of proteins and proteomes by mass spectrometry. Annu Rev Biochem 2001, 70:437-473. Yates JR, 3rd: Mass spectrometry as an emerging tool for systems biology. Biotechniques 2004, 36(6):917-919. Arifuzzaman M, Maeda M, Itoh A, Nishikata K, Takita C, Saito R, Ara T, Nakahigashi K, Huang HC, Hirai A, Tsuzuki K, Nakamura S, Altaf-Ul-Amin M, Oshima T, Baba T, Yamamoto N, Kawamura T, Ioka-Nakamichi T, Kitagawa M, Tomita M, Kanaya S, Wada C, Mori H: Large-scale identification of protein-protein interaction of Escherichia coli K-12. Genome Res 2006, 16(5):686-691. Butland G, Peregrin-Alvarez JM, Li J, Yang W, Yang X, Canadien V, Starostine A, Richards D, Beattie B, Krogan N, Davey M, Parkinson J, Greenblatt J, Emili A: Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature 2005, 433(7025):531-537. Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440(7084):631-636. Bouwmeester T, Bauch A, Ruffner H, Angrand PO, Bergamini G, Croughton K, Cruciat C, Eberhard D, Gagneur J, Ghidelli S, Hopf C, Huhse B, Mangano R, Michon AM, Schirle M, Schlegl J, Schwab M, Stein MA, Bauer A, Casari G, Drewes G, Gavin AC, Jackson DB, Joberty G, Neubauer G, Rick J, Kuster B, Superti-Furga G: A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. Nat Cell Biol 2004, 6(2):97-105.  24  42. 43.  44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61.  Brajenovic M, Joberty G, Kuster B, Bouwmeester T, Drewes G: Comprehensive proteomic analysis of human Par protein complexes reveals an interconnected protein network. J Biol Chem 2004, 279(13):12804-12811. Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R: IntAct: an open source molecular interaction database. Nucleic Acids Res 2004, 32(Database issue):D452-455. Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisenberg D: DIP: the database of interacting proteins. Nucleic Acids Res 2000, 28(1):289-291. Hsing M, Bellenson JL, Shankey C, Cherkasov A: Modeling of cell signaling pathways in macrophages by semantic networks. BMC Bioinformatics 2004, 5:156. Hsing M, Cherkasov A: Integration of biological data with semantic networks. Current Bioinformatics 2006, 1(3):273-290. Barabasi AL: Linked. Cambridge, MA: Plume; 2003. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13(11):2498-2504. Albert R, Jeong H, Barabasi AL: Error and attack tolerance of complex networks. Nature 2000, 406(6794):378-382. Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks. Nature 2001, 411(6833):41-42. Wuchty S: Evolution and topology in the yeast protein interaction network. Genome Res 2004, 14(7):1310-1314. Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW: Evolutionary rate in the protein interaction network. Science 2002, 296(5568):750-752. Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci 1998, 23(9):324-328. Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U S A 1999, 96(6):2896-2901. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Detecting protein function and protein-protein interactions from genome sequences. Science 1999, 285(5428):751-753. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction maps for complete genomes based on gene fusion events. Nature 1999, 402(6757):86-90. Ge H, Liu Z, Church GM, Vidal M: Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat Genet 2001, 29(4):482-486. Grigoriev A: A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. Nucleic Acids Res 2001, 29(17):3513-3519. Jansen R, Greenbaum D, Gerstein M: Relating whole-genome expression data with protein-protein interactions. Genome Res 2002, 12(1):37-46. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96(8):4285-4288. Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M: Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res 2001, 11(12):21202126. 25  62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82.  Gomez SM, Rzhetsky A: Towards the prediction of complete protein--protein interaction networks. Pac Symp Biocomput 2002:413-424. Ng SK, Zhang Z, Tan SH: Integrative approach for computationally inferring protein domain interactions. Bioinformatics 2003, 19(8):923-929. Obenauer JC, Yaffe MB: Computational prediction of protein-protein interactions. Methods Mol Biol 2004, 261:445-468. Reiss DJ, Schwikowski B: Predicting protein-peptide interactions via a network-based motif sampler. Bioinformatics 2004, 20 Suppl 1:I274-282. Lu L, Lu H, Skolnick J: MULTIPROSPECTOR: an algorithm for the prediction of protein-protein interactions by multimeric threading. Proteins 2002, 49(3):350-364. Aloy P, Russell RB: Interrogating protein interaction networks through structural biology. Proc Natl Acad Sci U S A 2002, 99(9):5896-5901. Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin A, Mazo I: Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 2004, 20(5):604-611. Hoffmann R, Krallinger M, Andres E, Tamames J, Blaschke C, Valencia A: Text mining for metabolic pathways, signaling cascades, and protein networks. Sci STKE 2005, 2005(283):pe21. Qi Y, Bar-Joseph Z, Klein-Seetharaman J: Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins 2006, 63(3):490-500. Huang H, Jedynak BM, Bader JS: Where have all the interactions gone? Estimating the coverage of two-hybrid protein interaction maps. PLoS Comput Biol 2007, 3(11):e214. Hsing M, Byler KG, Cherkasov A: The use of Gene Ontology terms for predicting highly-connected 'hub' nodes in protein-protein interaction networks. BMC Syst Biol 2008, 2:80. Byler K, Hsing M, Cherkasov A: The Use of sequence-derived QSPR descriptors for predicting highly connected proteins (hubs) in protein-protein interactions. QSAR & Combinatorial Science 2009, 28:509-519. Barker JJ: Antibacterioal Drug Discovery and Structure-Based Design. Drug Discovery Today 2006, 11(9/10):391-404. Chan PF, Holmnes DJ, Payne DJ: Finding the Gems Using Genomic Discovery: Antibacterial Drug Discovery Strategies - The Successes and Challenges. Drug Discovery Today: Therapeutic Strategies 2004, 1(4):519-527. Garcia-Lara J, Masalha M, Foster SJ: Staphylococcus aureus: the search for novel targets. Drug Discov Today 2005, 10(9):643-651. Payne DJ, Gwynn MN, Holmes DJ, Pompliano DL: Drugs for bad bugs: confronting the challenges of antibacterial discovery. Nat Rev Drug Discov 2007, 6(1):29-40. Cherkasov A, Lee SJ, Nandan D, Reiner NE: Large-scale survey for potentially targetable indels in bacterial and protozoan proteins. Proteins 2006, 62(2):371-380. Benner SA, Cohen MA, Gonnet GH: Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J Mol Biol 1993, 229(4):1065-1082. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402. Yang Z, Goldman N, Friday A: Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol Biol Evol 1994, 11(2):316-324. Whelan S, Lio P, Goldman N: Molecular phylogenetics: state-of-the-art methods for looking into the past. Trends Genet 2001, 17(5):262-272. 26  83. 84. 85. 86. 87. 88. 89.  90. 91. 92. 93. 94. 95. 96. 97.  98. 99.  100. 101.  Thorne JL: Models of protein sequence evolution and their applications. Curr Opin Genet Dev 2000, 10(6):602-605. Pang A, Smith AD, Nuin PA, Tillier ER: SIMPROT: using an empirically determined indel distribution in simulations of protein evolution. BMC Bioinformatics 2005, 6:236. Gu X, Li WH: The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. J Mol Evol 1995, 40(4):464-473. Qian B, Goldstein RA: Distribution of Indel lengths. Proteins 2001, 45(1):102-104. Pascarella S, Argos P: Analysis of insertions/deletions in protein structures. J Mol Biol 1992, 224(2):461-471. Sibanda BL, Thornton JM: Accommodating sequence changes in beta-hairpins in proteins. J Mol Biol 1993, 229(2):428-447. Fechteler T, Dengler U, Schomburg D: Prediction of protein three-dimensional structures in insertion and deletion regions: a procedure for searching data bases of representative protein fragments using geometric scoring criteria. J Mol Biol 1995, 253(1):114-131. Garcia-Diaz M, Kunkel TA: Mechanism of a genetic glissando*: structural biology of indel mutations. Trends Biochem Sci 2006, 31(4):206-214. Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2005, 33(Database issue):D501-504. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28(1):235-242. Chan SK, Hsing M, Hormozdiari F, Cherkasov A: Relationship between insertion/deletion (indel) frequency of proteins and essentiality. BMC Bioinformatics 2007, 8:227. Hormozdiari F, Salari R, Hsing M, Schonhuth A, Chan SK, Sahinalp SC, Cherkasov A: The effect of insertions and deletions on wirings in protein-protein interaction networks: a large-scale study. J Comput Biol 2009, 16(2):159-167. Hsing M, Cherkasov A: Indel PDB: a database of structural insertions and deletions derived from sequence alignments of closely related proteins. BMC Bioinformatics 2008, 9:293. Nandan D, Yi T, Lopez M, Lai C, Reiner NE: Leishmania EF-1alpha activates the Src homology 2 domain containing tyrosine phosphatase SHP-1 leading to macrophage deactivation. J Biol Chem 2002, 277(51):50190-50197. Nandan D, Cherkasov A, Sabouti R, Yi T, Reiner NE: Molecular cloning, biochemical and structural analysis of elongation factor-1 alpha from Leishmania donovani: comparison with the mammalian homologue. Biochem Biophys Res Commun 2003, 302(4):646-652. Cherkasov A, Nandan D, Reiner NE: Selective targeting of indel-inferred differences in spatial structures of highly homologous proteins. Proteins 2005, 58(4):950-954. Nandan D, Lopez M, Ban F, Huang M, Li Y, Reiner NE, Cherkasov A: Indel-based targeting of essential proteins in human pathogens that have close host orthologue(s): discovery of selective inhibitors for Leishmania donovani elongation factor-1alpha. Proteins 2007, 67(1):53-64. Terstappen GC, Reggiani A: In silico research in drug discovery. Trends Pharmacol Sci 2001, 22(1):23-26. DiMasi JA, Hansen RW, Grabowski HG: The price of innovation: new estimates of drug development costs. J Health Econ 2003, 22(2):151-185. 27  102. 103. 104. 105. 106. 107. 108. 109.  Rybak MJ: Resistance to antimicrobial agents: an update. Pharmacotherapy 2004, 24(12 Pt 2):203S-215S. Moult J, Fidelis K, Rost B, Hubbard T, Tramontano A: Critical assessment of methods of protein structure prediction (CASP)--round 6. Proteins 2005, 61 Suppl 7:3-7. Venclovas C, Margelevicius M: Comparative modeling in CASP6 using consensus approach to template selection, sequence-structure alignment, and structure assessment. Proteins 2005, 61 Suppl 7:99-105. Sali A, Blundell TL: Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 1993, 234(3):779-815. Zhou H, Zhou Y: SPARKS 2 and SP3 servers in CASP6. Proteins 2005, 61 Suppl 7:152-156. Dror O, Shulman-Peleg A, Nussinov R, Wolfson HJ: Predicting molecular interactions in silico: I. A guide to pharmacophore identification and its applications to drug design. Curr Med Chem 2004, 11(1):71-90. Schneidman-Duhovny D, Nussinov R, Wolfson HJ: Predicting molecular interactions in silico: II. Protein-protein and protein-drug docking. Curr Med Chem 2004, 11(1):91107. Schneider G, Fechner U: Computer-based de novo design of drug-like molecules. Nat Rev Drug Discov 2005, 4(8):649-663.  28  2 The use of Gene Ontology terms for predicting highly-connected ‘hub’ nodes in protein-protein interaction networks1 2.1 Introduction A broad range of cellular functions are mediated through complex protein-protein interactions, which are commonly visualized as two-dimensional networks connecting thousands of proteins by their physical interactions. Such a network perspective suggests that cellular effects and functions of proteins can only be fully understood in context with their interacting partners in a protein interaction network (PIN). The study of PINs has been made possible through recent advancements in highthroughput proteomics that have detected protein-protein interactions on a genome-wide scale, which generated large amounts of interaction data for several species including Saccharomyces cerevisiae [3-7], Escherichia coli [8], Drosophila melanogaster [9], Caenorhabditis elegans [10], and Homo sapiens [11, 12]. The corresponding protein interaction networks have been made publicly accessible through open access databases such as IntAct [13] and DIP [14]. The accumulated protein interaction data have further supported recent protein network analyses that demonstrated the scale-free organization of PINs, where the majority of proteins have a low number of interactions in the network, with a few highly-connected proteins (also called hubs) having a significant number of interacting partners [1, 2]. Such inhomogeneous network topology allows a PIN to be robust against random removal of protein nodes, but vulnerable to targeted removal of network hubs [15]. In addition, previous studies have shown defined relationships between the degree of connectivity of proteins in PINs, their sequence conservation, and cellular essentiality properties [16, 17]. Those studies indicated that highly1  A version of this chapter has been published. Hsing M, Byler KG, Cherkasov A. The use of Gene Ontology terms for predicting highly-connected ‘hub’ nodes in protein-protein interaction networks. BMC Systems Biology 2008, 2:80. 29  connected proteins (or hubs) represent very attractive subjects for understanding cellular functions, identifying novel drug targets, and for use in the rational design of large-scale pulldown experiments. Although large-scale PINs have already been experimentally determined for several species (and thus represent suitable training sets for hub-predicting bioinformatics approaches), in general, protein interaction data are still lacking for many organisms. Thus, several computational approaches have been developed to predict protein-protein interactions utilizing existing bioinformatics data such as gene proximity information [18, 19], gene fusion events [20, 21], gene co-expression data [22-24], phylogenetic profiling [25], orthologous protein interactions [26] and identification of interacting protein domains [27-30]. Several other bioinformatics approaches have been developed that identify hypothetical interactions between proteins based on their three-dimensional structures [31, 32] or by applying text-mining techniques [33, 34]. Traditionally, such computational predictions have focused on the identification of pairwise protein-protein interactions with varying degrees of accuracy [35]; however, none of them have been explicitly focused on predicting hypothetical hub proteins. At the same time, it is reasonable to hypothesize that hub proteins should share certain common sequence or structural features that not only enable them to participate in multitudes of protein interactions, but also can be utilized for the theoretical identification of such hub proteins without prior knowledge of the corresponding PINs. Therefore, the goal of this study is to develop such a ‘hub predictor’ (or classifier), capitalizing on experimental and bioinformatics data available to date for proteins in several model organisms with alreadydetermined PINs. We have focused the construction of the hub classifier on Gene Ontology (GO) data, which provide functional annotations for individual proteins using an expert knowledge base 30  [36-38]. The advantage of applying GO annotation to hub prediction lies in the readily available information for proteins in hundreds of species. Importantly, the GO annotations have been shown to reflect certain properties that can mediate protein-protein interactions [35], but the annotation itself does not rely on the availability of corresponding experimental data. Thus, the GO-based hub classifier should be suitable for predicting highly-connected proteins, even in organisms that lack protein interaction data. Here we present the development of such a hub protein classifier, trained on the existing GO and protein-protein interaction data for Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster and Homo sapiens species. The generated models were crossvalidated and tested on two external protein interaction data sets: Methicillin-resistant Staphylococcus aureus (MRSA) 252 and Caenorhabditis elegans. The developed bioinformatics approach has not only demonstrated an improved accuracy in identifying highly-connected PIN nodes (as compared to homology- or protein domain-based predicting methods), but has also shown an improved speed and a lower demand on computational resources. To illustrate a possible application of the developed tool, we have used it for rationalizing a bait selection strategy for a large-scale protein complex pull-down experiment. The source code and executable program of the hub classifier is freely available for download at: http://www.cnbi2.ca/hub/  31  2.2 Methods 2.2.1  Data acquisition  2.2.1.1 Protein-protein interaction data Protein interaction data used for the training and testing of the hub protein classifier were obtained from the IntAct database [13] for the following species: Escherichia coli K 12 (taxonomy ID: 83333), Saccharomyces cerevisiae (taxonomy ID: 4932), Drosophila melanogaster (taxonomy ID: 7227), and Homo sapiens (taxonomy ID: 9606) (acquisition date: Sep. 25th, 2007). Two external validation data sets were collected for protein interactions in MRSA252 (provided by the PREPARE project in Vancouver BC Canada [39]) and Caenorhabditis elegans (obtained from IntAct database on Sep. 25th, 2007). Table 2.1 lists the total number of proteins and their interactions of the four species in the training and testing, which have been combined into a single data set for the subsequent analyses. Similar information on the external validation sets is shown in Table 2.2. Table 2.1 A summary of protein interaction and Gene Ontology annotation data used in the training and testing of the hub classifiers. Training / Testing set # of proteins # of hubs (10% of total proteins) # of non-hubs (90% of total proteins) # of protein interactions minimum # of interactions per hub # of proteins with at least one GO term # of proteins without any GO term  E. coli  S. cerevisiae  D. melanogaster  H. sapiens  total of 4 species  2860  5397  6935  6592  21784  286  535  628  620  2069  2574  4862  6307  5972  19715  13888  37167  19994  19115  90164  20  33  16  13  1378  4738  5931  5097  17144  1482  659  1004  1495  4640  48.18%  87.79%  85.52%  77.32%  78.70%  # of different GO terms - process  30  41  48  49  50  # of different GO terms - function  21  37  38  37  40  4  27  31  29  35  55  105  117  115  125  % of proteins with at least one GO term  # of different GO terms - component # of different GO terms - total  The top table lists the protein interactions and hubs in each of the four species, and the bottom part of the table lists the number of unique GO terms for each annotation category.  32  Table 2.2 A summary of protein interaction and Gene Ontology annotation data used in the external validation of the hub classifiers. External validation set # of proteins # of hubs (10% of total proteins) # of non-hubs (90% of total proteins) # of protein interactions minimum # of interactions per hub # of proteins with at least one GO term # of proteins without any GO term % of proteins with at least one GO term # of different GO terms - process # of different GO terms - function # of different GO terms - component # of different GO terms - total  MRSA252 133 13 120 2401 45  C. elegans 2890 276 2614 4594 7  109 24 81.95%  2403 487 83.15%  27 19 5 51  46 34 22 102  The top table lists the protein interactions and hubs in each of the two species, and the bottom part of the table lists the number of unique GO terms for each annotation category. Hub proteins were identified based on their numbers of protein interactions and their percentile ranking relative to other proteins in the same species. Proteins of the same species were divided into different percentile groups, sorted by the number of protein-protein interactions in a decreasing order (i.e. higher percentile proteins have more interactions than lower percentile proteins). It is clear that hub proteins have more interactions than non-hubs, but currently there is no consensus on exactly how many interactions a hub protein should have. Often, hubs are defined arbitrarily to have at least certain number of interactions [40]. In our study, the hub selection criterion was based on the position of a sharp turn (or inflection point) on a cumulative protein interaction distribution plot from each of the four species. As shown in Figure 2.1, the protein interactions followed a power law distribution, such that a sharp turn is visible around the 90th protein percentile position on the interaction plots.  33  Figure 2.1 Cumulative protein interaction distribution plots.  a) E. coli, b) S. cerevisiae, c) D. melanogaster, d) H. sapiens. On each plot, the (x, y) coordinate of the sharp turn or the inflection point is shown.  To achieve a consistent hub definition across the four studied species, hub proteins were defined as above or equal to the 90th percentiles of interactors; in other words, the hubs represented the top 10 percent of highly-connected interactors, and the non-hubs consisted of the bottom 90 percent of the proteins. Using this definition, hub proteins were determined from each of the four PINs individually. At the 90th protein percentile, E. coli hubs have at least 20 protein interactions, S. cerevisiae hubs have at least 33 protein interactions, D. melanogaster hubs have at least 16 protein interactions, and H. sapiens hubs have at least 13 interactions. The number of assigned hub and non-hub classifications is shown in Table 2.1.  34  Figure 2.2 illustrates the subsequent steps involved in the development of the hub protein classifiers and their corresponding bioinformatics analyses. Figure 2.2 A flow chart of the development of the hub protein classifiers and their corresponding bioinformatics analyses.  2.2.1.2 Gene Ontology (GO) data Each protein obtained from the IntAct database was identified by a unique UniProt accession number, which enabled a fast collection of GO annotation data from the UniProt Retrieval System [37, 41] (UniProt protein data obtained on Oct. 1st, 2007). The complete UniProt protein annotation pages were downloaded as flat texts, which were then parsed by PERL scripts to extract the GO annotations in the three categories: biological process, molecular function, and cellular component. Because each GO term could be assigned to a different level of the annotation hierarchy, we established a fixed general GO level that 35  represented all of the specific GO terms of the proteins in the study. This general Gene Ontology annotation level was determined based on the GO slim project, which provides a list of generic GO terms on which many bioinformatics analyses can be performed [42]. Importantly, the GO slim generic terms provided a reasonable number of protein ‘predictors’ for a machine learning method to effectively operate. The tool “map2slim” [43] was used to map specific GO terms to the “GO slim” generic terms (GO annotation files were obtained from [44] on Oct. 17th, 2007; GO format-version: 1.2, GO date: 16:10:2007 16:19, GO revision: 5.514; GO slim format-version: 1.2, GO slim date: 01:10:2007 16:53, GO slim revision: 1.682). This generic version of GO slim contained 53 [biological process] terms, 42 [molecular functions] terms and 37 [cellular component] terms. Table 2.1 and 2.2 list the number of GO slim terms used to annotate the proteins in each species and the number of the proteins with or without a GO annotation term. All protein interaction data and GO annotations were stored in a local MySQL database for fast data searching and reporting.  2.2.2  Hub protein classification by boosting trees To train models that classify a protein as a hub or a non-hub, the protein interaction data  from the four species were combined into a single data set (90,164 interactions involving 2,069 hubs and 19,715 non-hubs). A four-fold cross-validation strategy was used in which four nonoverlapping testing sets (25% of the total protein set), and four training sets (75% of the total protein set) were utilized for building the hub classifiers. Each training and testing set maintained the same hub to non-hub (1:9) ratio. In addition, the proteins in the training sets have maintained the same distribution of Gene Ontology annotation terms as the proteins in the testing sets. Figure 2.3 illustrates the distribution of each of the 125 GO terms, represented by the percentage of proteins with this term in the training sets vs. the testing sets of the four 36  cross-validation samples. A high correlation R2 values of 0.9981 ~ 0.9983 indicated an equal Gene Ontology distribution between the training and testing sets. It is also shown that the majority of the GO terms were associated with less than 10% of the proteins in a given data set. Figure 2.3 Distribution of Gene Ontology annotation terms between the training and testing sets in the four cross-validation samples.  Each point on a graph represents the percentage of proteins annotated with a given GO term in the training set (x-axis), and the percentage of proteins annotated with the same GO term in the testing set (y-axis). All four plots were fitted with linear regression lines, with high R2 values of 0.998. This indicates an equal distribution of the GO terms between the training and testing sets of the four samples.  37  We focused the machine-learning effort on hub classification by applying boosting trees, which is one of the best performing methods for classifying complex data and providing interpretable results [45]. The training and testing of the hub-predicting classification trees were performed on 125 Gene Ontology terms as predictor variables by using the boosting tree application as implemented in STATISTICA version 8 [46]. The input data were formatted as tables of binary data where each column represented a GO term variable (1 = present, 0 = absent) and each row represented a sample protein. Four classifiers were built (one for each of the four training sets) and compiled in the C++ language under Linux. In addition to the four testing sets in the cross-validation study, the best of the four hub classifiers has been validated on two external data sets, which were consisted of experimentally-determined PINs in MRSA252 and C. elegans. The classifier predicted each protein in the data sets as either a hub or a non-hub, and the classification statistics were calculated as the following: Sensitivity = TP / (TP + FN) Specificity = TN / (TN + FP) Accuracy = (TP + TN) / (TP + TN + FP + FN) PPV (Positive Predictive Value) = TP / (TP + FP) NPV (Negative Predictive Value) = TN / (TN + FN) where TP = number of true positives, FP = number of false positives, TN = number of true negatives, and FN = number of false negative. A useful output feature of the boosting tree method is the relative predictor importance, which measures the average influence of a predictor variable on the prediction outcome over all  38  of the trees [45]. The most important predictor is assigned a value of 100, and the other variables are scaled accordingly.  2.2.3  Comparison of the hub classifiers with other existing protein interaction prediction approaches To further assess the performance of the hub classifier against other existing approaches  for predicting hub proteins, we applied three different types of bioinformatics methods to construct hypothetical PINs in MRSA252, where hub proteins were determined by the number of predicted pairwise protein-protein interactions.  2.2.3.1 Hypothetical PIN – pathway maps The first type of hypothetical PIN represented the known protein-protein interactions available for MRSA252. A total of 513 protein interactions were manually extracted from the pathway maps in the KEGG database [47] (acquisition date: May 3rd, 2006).  2.2.3.2 Hypothetical PIN – orthologous interactions The second type of PIN was constructed based on known protein-protein interactions between orthologs from three other species: Helicobacter pylori, Saccharomyces cerevisiae, and Escherichia coli. The experimental PIN in H. pylori was obtained from the BIND database [48] (acquisition date: Aug. 11th, 2005). Two sources were used to build the S. cerevisiae PIN: the BIND database (acquisition date: Aug. 11th, 2005) and Gavin’s study [6] (acquisition date from the IntAct database [13]: Feb. 7th, 2006). We extracted the E. coli PIN in Butland’s study [8] from the IntAct database [13] (acquisition date: Apr. 13th, 2006). 2656 protein sequences in MRSA252 were obtained from the RefSeq databases at NCBI [49] (acquisition date: Feb. 4th, 2006). The orthologs of the interacting proteins from each of the above species were identified in MRSA252 by using the program InParanoid [50] 39  (version 1.35). If a pair of MRSA252 proteins whose orthologs interacted in one of the three species, the pair was assigned as an interacting protein pair. A total of 3258 protein interactions were predicted for this type of MRSA252 PIN reconstruction.  2.2.3.3 Hypothetical PIN – interacting domains The third type of MRSA PIN was predicted based on protein domain-domain interactions. First, the presence of Pfam domains [51] in each of the 2656 MRSA252 proteins was determined by scanning the Pfam domain profiles (version 19.0) with the program HMMER [52] (version 2.3.2). Second, domain-domain interaction data were acquired from two sources: InterDom [53] (version: 1.2) and iPfam [54] (version: 19.0). If a pair of MRSA252 proteins contained interacting domains according to one of the two sources, the pair was assigned as an interacting protein pair. A total of 11,608 protein interactions were predicted based by this method.  2.2.3.4 Validating the prediction on an experimental MRSA252 PIN The experimental MRSA252 PIN provided by the PREPARE project contained interaction data for 133 proteins and was used as the external validation set for measuring the prediction performance of the hub classifier and the different types of hypothetical PINs. We have compared the prediction results in two different ways. In the first type of comparison, both the hub classifier and the combined hypothetical PINs classified the 133 MRSA proteins as hubs or non-hubs, while the same 133 proteins were also classified as hubs or non-hubs based on the experimental results provided by PREPARE. In the case of the hub classifier, hubs and non-hubs were reported explicitly from the prediction program. In the cases of hypothetical and experimental PINs, hubs were defined as above or equal to the 90th percentile of proteins ranked by the number of interactions (same criterion as the hub 40  classifier). The following classification statistics were calculated: sensitivity, specificity, accuracy, PPV and NPV. In the second type of comparison, we compared ranked lists of proteins based on their ‘hub-likeness’ property. In the case of the hub classifier, the proteins were ranked based on the differences between predicted hub probabilities and non-hub probabilities as computed by the boosting tree method. In the case of the hypothetical and experimental PINs, the proteins were ranked by their numbers of protein interactions. The ranked lists were compared to the list of proteins ranked by the number of experimental interactions in MRSA252 by using a Spearman rank order correlation as implemented in STATISTICA 8.  2.2.3.5 Validating the prediction on an experimental C. elegans PIN In addition to MRSA252, we have tested the hub protein classifier on an external set of protein interaction data in C. elegans. The same procedure was applied to determine hub prediction statistics, as described above.  2.2.3.6 Test of significance To test the hub protein classifier against a null hypothesis, which claims there is no difference of Gene Ontology term distribution between hubs and non-hubs, we have randomized the protein interaction data in the following ways. First, the same 5445 proteins in the testing set (25% of the total protein set consisted of the four species) for the hub classifier were used in the construction of a randomized data set. Second, 10% of those proteins were randomly assigned as hubs, while the other 90% of proteins were randomly assigned as nonhubs. Third, the GO terms originally associated with those proteins were randomly distributed within the data set. The combination of the above two randomization methods ensured that there was no significant difference in GO term distribution between the hub and non-hub 41  proteins. Finally, the hub classifier was used to predict hubs and non-hubs in the randomized data set, and prediction statistics were obtained.  2.2.4  Simulation of protein bait selections and network coverage The effectiveness of protein bait selections assisted by the hub classifier has been  simulated using yeast protein-protein interaction data determined by protein-complex pulldown and mass spectrometry experiments, available from Gavin’s study [6]. One major goal of such large-scale experiments is to maximize the number of protein interactions identified by using a small set of proteins as ‘baits’ to pull down their interactors (preys). Therefore, it is crucial to select protein baits based on properties that will produce the best network coverage, as measured by the ratio between the number of protein interactions identified by an experiment and the total number of interactions in an organism. In our simulation experiments, 18,028 interactions, involving 2551 proteins from Gavin’s yeast data set (acquisition date from the IntAct database [13]: Feb. 7th, 2006), were hypothetically treated as the total number of protein interactions in Saccharomyces cerevisiae. To simulate the bait selection process, we selected a subset of proteins (ranged from 5% up to 100% of the 2551 yeast proteins) as baits and calculated the number of interactions such baits would “pull-down” from the yeast interaction data set and computed the overall network coverage. Two selection criteria were used. In one simulation, the baits were randomly selected from the total pool of the yeast proteins. In the other simulation, the baits were selected from the pool of hub proteins predicted by the hub classifier. In addition to the bait selection strategy described above (referred to as one-round selection), we simulated the network coverage results by applying a second round of selections. In this type of selection, baits were divided into two sets: one-third as the first round of baits, and two-thirds as the second round of baits. The first-round baits were chosen by either random 42  selection or by hub prediction. The second round of baits was selected from the most abundant preys pulled down by the first round of baits. Such an approach is also referred to as the “name your friend” method and has been applied to maximize the effectiveness in vaccinations against infectious diseases [55, 56], as well as in some protein complex experiments [8].  2.3 Results and discussion 2.3.1  Prediction performance of the hub prediction classifier One prediction model was constructed for each of the four cross-validation samples, for  a total of four hub classifiers. The executable files of the classifiers were complied by the Gnu C++ compiler in Linux. The classifier programs used a list of query proteins and their corresponding GO term occurrences as the input file, and produced the same list of the proteins with hub prediction results and probability scores. The running time was only a few seconds for predicting hubs from over 21,000 proteins on a 3.0GHz Pentium D personal computer. Overall, the classification statistics were consistent between the training and testing sets for the four classifiers, and the classification statistics on the best of the four hub classifiers is shown in Table 2.3.  43  Table 2.3 Prediction performance of the hub classifier in the combined data set of the four species. Hub classifier (# of nodes in each tree = 15, FN: FP penalty = 1:1.9, total # of trees = 187) Training observed non-hub hub  predicted non-hub 13381 986  predicted hub 1405 567  sensitivity 36.51%  specificity 90.50%  accuracy 85.37%  PPV 28.75%  NPV 93.14%  predicted non-hub 4415 371  predicted hub 514 145  sensitivity 28.10%  specificity 89.57%  accuracy 83.75%  PPV 22.00%  NPV 92.25%  predicted non-hub 17796 1357  predicted hub 1919 712  sensitivity 34.41%  specificity 90.27%  accuracy 84.96%  PPV 27.06%  NPV 92.91%  Testing observed non-hub hub All observed non-hub hub  The observed vs. predicted hubs and non-hubs and their corresponding classification statistics are shown for the best classifier based on the training, testing and all (training + testing) data sets.  We have further validated the prediction accuracy of the best hub classifier in the external MRSA252 data set. As indicated in Table 2.4, in comparison to the other protein prediction methods, the hub classifier has the highest prediction statistics. The next best hub prediction result was achieved by the hypothetical MRSA PIN based on orthologous interactions. On the other hand, the results from the predicted PINs of pathway maps and interacting domains were poor as none of them had any true positives.  44  Table 2.4 Hub prediction comparison of the classifier and the hypothetical PINs in MRSA252. Hub classifier predicted observed non-hub non-hub 109 hub 9  predicted hub 11 4  Hypothetical PIN – pathway maps predicted predicted hub observed non-hub non-hub 111 9 hub 13 0  sensitivity 30.77%  specificity 90.83%  accuracy 84.96%  PPV 26.67%  NPV 92.37%  sensitivity 0.00%  specificity 92.50%  accuracy 83.46%  PPV 0.00%  NPV 89.52%  specificity 91.67%  accuracy 84.96%  PPV 23.08%  NPV 91.67%  specificity 97.50%  accuracy 87.97%  PPV 0.00%  NPV 90.00%  Hypothetical PIN – orthologous interactions predicted predicted hub sensitivity observed non-hub non-hub 110 10 23.08% hub 10 3 Hypothetical PIN – interacting domains predicted predicted hub sensitivity observed non-hub non-hub 117 3 0.00% hub 13 0  Combined hypothetical PIN – (pathway maps + orthologous interactions) predicted predicted observed non-hub hub sensitivity specificity accuracy PPV non-hub 110 10 23.08% 91.67% 84.96% 23.08% hub 10 3  NPV 91.67%  Combined hypothetical PIN - (pathway maps + orthologous interactions + interacting domains) predicted predicted observed non-hub hub sensitivity specificity accuracy PPV NPV non-hub 108 12 7.69% 90.00% 81.95% 7.69% 90.00% hub 12 1  The prediction performance of the hub classifier is compared to that of the hypothetical PINs in MRSA252. The classification statistics is reported. In the other comparison, we correlated a ranked list of proteins based from their ‘hublikeness’ (determined from either the hub classifier or the hypothetical PINs) to that of the experimental MRSA PIN. As shown in Table 2.5, the hub classifier had a correlation coefficient of 0.32 – highest among all other methods. The next best correlation was achieved by the hypothetical PIN of orthologous interactions. 45  Table 2.5 Comparing ranked lists of hub-likeness properties between the classifier and the hypothetical PINs in MRSA252. Hub prediction methods Hub classifier Hypothetical PIN – pathway maps Hypothetical PIN – orthologous interactions Hypothetical PIN – interacting domain Combined hypothetical PIN – (pathway maps + orthologous interactions) Combined hypothetical PIN - (pathway maps + orthologous interactions + interacting domains)  correlation coefficient 0.32 0.11 0.27 -0.29 0.24 -0.01  The ranked protein lists based on hub-likeness properties, produced by either the classifier or the hypothetical PINs, has been compared to that of the experimental PIN in MRSA252. The coefficient of Spearman rank order correlation is reported with p-value < 0.05. In addition to MRSA252, the hub protein classifier has achieved comparable prediction results in the C. elegans validation data set as shown in Table 2.6. Table 2.6 Hub prediction result in C. elegans. observed non-hub hub  predicted non-hub 2270 185  predicted hub 344 91  sensitivity 32.97%  specificity 86.84%  accuracy 81.70%  PPV 20.92%  NPV 92.46%  The prediction performance of the hub classifier was validated, based on the experimental PIN in C. elegans. The prediction statistics of the hub classifier on the randomized data set are summarized in Table 2.7. The result shows that the hub classifier was not able to achieve a significant hub prediction when the GO terms and protein hubs were randomly assigned. The prediction only reached 11.43% sensitivity and 8.39% PPV in the randomized set, compared to 28.10% sensitivity and 22.00% PPV in the testing set before the randomizations. The specificity and NPV were comparable before and after the randomizations, due to the inherited 1:9 ratio between the number of hubs and non-hubs. Therefore, it is easier to make a correct prediction on non-hub proteins than hub proteins. The comparison of the prediction results between the testing set and the randomized set indicates that hub proteins have a distinct distribution of GO terms, which contributed to the predictability of the hub classifier. 46  Table 2.7 Hub prediction result in the randomized data set. observed non-hub hub  predicted non-hub 4285 457  predicted hub 644 59  sensitivity 11.43%  specificity 86.93%  accuracy 79.78%  PPV 8.39%  NPV 90.36%  The prediction performance of the hub classifier was tested on the null hypothesis that there is no difference of GO term distribution between hubs and non-hubs. Overall, the hub classifier built on the Gene Ontology annotations achieved high specificity and NPV, but had lower than expected sensitivity and PPV. We attribute this to the lack of Gene Ontology annotations for certain proteins in the training sets, as the level of annotations varied among the four species. For instance, S. cerevisiae had the highest percentage of the proteins with GO annotations (87.8%), while only 48.2% of the proteins in E. coli had any GO annotation. Therefore, the performance of the current hub classifier primarily relied on the number of GO annotations available for each species. We expect the sensitivity value of the hub classifier to be improved when more annotation data become available for the four species in the training sets.  2.3.2  GO term predictor importance An indicator of the contribution of each Gene Ontology term used in the boosted trees  classifiers was provided by the relative importance of predictors in the training output. The importance value ranged from 0 to 100, where 100 indicated that a predictor had the most influence on the hub prediction outcome, and 0 meant a predictor had the least influence. The top 20 GO annotation terms that were likely to be shared among hub proteins are listed in Table 2.8.  47  Table 2.8 Top 20 important Gene Ontology term predictors. GO ID GO:0005730 GO:0003723 GO:0005515 GO:0006412  GO:0006139 GO:0006996 GO:0030246 GO:0005840 GO:0005777 GO:0009719 GO:0007049 GO:0004871 GO:0005654 GO:0008219 GO:0006118 GO:0006259 GO:0050789 GO:0006950 GO:0005811 GO:0008135  GO name nucleolus RNA binding protein binding translation nucleobase, nucleoside, nucleotide and nucleic acid metabolic process organelle organization and biogenesis carbohydrate binding ribosome peroxisome response to endogenous stimulus cell cycle signal transducer activity nucleoplasm cell death electron transport DNA metabolic process regulation of biological process response to stress lipid particle translation factor activity, nucleic acid binding  GO Type cellular component molecular function molecular function biological process  predictor importance 100 97 96 95  biological process  90  biological process molecular function cellular component cellular component  89 87 86 85  biological process biological process molecular function cellular component biological process biological process biological process  82 81 77 77 75 73 73  biological process biological process cellular component  73 72 71  molecular function  70  The top Gene Ontology terms included several annotations such as ‘RNA binding’, ‘translation’, and ‘ribosome’, commonly used to annotate ribosomal proteins, which were often identified as the top interacting proteins in other experiments [6, 8]. The list of important predictors indicated that hub proteins tend to participate in several common cellular processes, including translation, nucleotide metabolism, organelle biogenesis, cell cycle, signal transduction, cell death, and electron transport.  2.3.3  Applying hub classifier to protein bait selection The bait selection strategy, assisted by the hub classifier, was simulated in the  experimental PIN of Saccharomyces cerevisiae. In the case of one-round selection, choosing 48  baits that were predicted as hubs by the classifier has greatly increased the network coverage in comparison to random selection. For instance, as illustrated in Figure 2.4, when 15% of total proteins were selected as baits based on the result of the hub classifier, 42.39% of the network coverage was achieved. On the other hand, only 26.53% of the network coverage was generated by the random bait selection. In the case of the two-round selection, the network coverage produced by either random or hub bait selection has shown a great improvement from the one-round selection. Figure 2.4 Network coverage of different bait selection strategies in protein complex pulldown experiments, simulated in Saccharomyces cerevisiae.  The results suggest that the hub classifier is a useful tool for selecting baits and prioritizing proteins for protein interaction experiments. Although it was not explored in the present study, we expect that the hub classifier can also assist in the identification of highlyinteracting proteins in pathogens as potential drug targets.  49  2.4 Conclusion We have studied the available interaction and Gene Ontology data for proteins in Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster and Homo sapiens genomes. By utilizing the boosting trees classification method, we have shown that highlyconnected proteins in the studied PINs share certain common GO terms; this observation enabled the development of a hub classifier capable of distinguishing hub proteins from nonhubs. This classifier has improved accuracy for hub prediction relative to other traditional approaches for protein interaction prediction. It is anticipated that the hub classifier can serve as a useful tool to identify highly-interacting proteins in species without any available protein interaction data, with potential applications in optimizing protein pull-down experiments and identifying new drug targets against pathogens.  50  2.5 References 1. 2. 3.  4. 5.  6.  7.  8.  9.  Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004, 5(2):101-113. Albert R: Scale-free networks in cell biology. J Cell Sci 2005, 118(Pt 21):4947-4957. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403(6770):623-627. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive twohybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 2001, 98(8):4569-4574. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415(6868):180-183. Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440(7084):631-636. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, Punna T, Peregrin-Alvarez JM, Shales M, Zhang X, Davey M, Robinson MD, Paccanaro A, Bray JE, Sheung A, Beattie B, Richards DP, Canadien V, Lalev A, Mena F, Wong P, Starostine A, Canete MM, Vlasblom J, Wu S, Orsi C, Collins SR, Chandran S, Haw R, Rilstone JJ, Gandi K, Thompson NJ, Musso G, St Onge P, Ghanny S, Lam MH, Butland G, Altaf-Ul AM, Kanaya S, Shilatifard A, O'Shea E, Weissman JS, Ingles CJ, Hughes TR, Parkinson J, Gerstein M, Wodak SJ, Emili A, Greenblatt JF: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 2006, 440(7084):637-643. Butland G, Peregrin-Alvarez JM, Li J, Yang W, Yang X, Canadien V, Starostine A, Richards D, Beattie B, Krogan N, Davey M, Parkinson J, Greenblatt J, Emili A: Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature 2005, 433(7025):531-537. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon CA, Finley RL, Jr., White KP, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets RA, McKenna MP, Chant J, Rothberg JM: A protein interaction map of Drosophila melanogaster. Science 2003, 302(5651):1727-1736. 51  10.  11.  12.  13.  14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.  Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, HirozaneKishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M: A map of the interactome network of the metazoan C. elegans. Science 2004, 303(5657):540-543. Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, Klitgord N, Simon C, Boxem M, Milstein S, Rosenberg J, Goldberg DS, Zhang LV, Wong SL, Franklin G, Li S, Albala JS, Lim J, Fraughton C, Llamosas E, Cevik S, Bex C, Lamesch P, Sikorski RS, Vandenhaute J, Zoghbi HY, Smolyar A, Bosak S, Sequerra R, Doucette-Stamm L, Cusick ME, Hill DE, Roth FP, Vidal M: Towards a proteome-scale map of the human protein-protein interaction network. Nature 2005, 437(7062):1173-1178. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann S, Goedde A, Toksoz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H, Wanker EE: A human protein-protein interaction network: a resource for annotating the proteome. Cell 2005, 122(6):957-968. Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R: IntAct: an open source molecular interaction database. Nucleic Acids Res 2004, 32(Database issue):D452-455. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32(Database issue):D449451. Albert R, Jeong H, Barabasi AL: Error and attack tolerance of complex networks. Nature 2000, 406(6794):378-382. Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks. Nature 2001, 411(6833):41-42. He X, Zhang J: Why do hubs tend to be essential in protein networks? PLoS Genet 2006, 2(6):e88. Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci 1998, 23(9):324-328. Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U S A 1999, 96(6):2896-2901. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Detecting protein function and protein-protein interactions from genome sequences. Science 1999, 285(5428):751-753. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction maps for complete genomes based on gene fusion events. Nature 1999, 402(6757):86-90. Ge H, Liu Z, Church GM, Vidal M: Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat Genet 2001, 29(4):482-486. Grigoriev A: A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. Nucleic Acids Res 2001, 29(17):3513-3519. Jansen R, Greenbaum D, Gerstein M: Relating whole-genome expression data with protein-protein interactions. Genome Res 2002, 12(1):37-46. 52  25. 26.  27. 28. 29. 30. 31. 32. 33. 34. 35. 36.  37.  38. 39. 40. 41. 42. 43. 44.  Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96(8):4285-4288. Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M: Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res 2001, 11(12):21202126. Gomez SM, Rzhetsky A: Towards the prediction of complete protein--protein interaction networks. Pac Symp Biocomput 2002:413-424. Ng SK, Zhang Z, Tan SH: Integrative approach for computationally inferring protein domain interactions. Bioinformatics 2003, 19(8):923-929. Obenauer JC, Yaffe MB: Computational prediction of protein-protein interactions. Methods Mol Biol 2004, 261:445-468. Reiss DJ, Schwikowski B: Predicting protein-peptide interactions via a network-based motif sampler. Bioinformatics 2004, 20 Suppl 1:I274-282. Lu L, Lu H, Skolnick J: MULTIPROSPECTOR: an algorithm for the prediction of protein-protein interactions by multimeric threading. Proteins 2002, 49(3):350-364. Aloy P, Russell RB: Interrogating protein interaction networks through structural biology. Proc Natl Acad Sci U S A 2002, 99(9):5896-5901. Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin A, Mazo I: Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 2004, 20(5):604-611. Hoffmann R, Krallinger M, Andres E, Tamames J, Blaschke C, Valencia A: Text mining for metabolic pathways, signaling cascades, and protein networks. Sci STKE 2005, 2005(283):pe21. Qi Y, Bar-Joseph Z, Klein-Seetharaman J: Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins 2006, 63(3):490-500. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, 32(Database issue):D262-266. Rhee SY, Wood V, Dolinski K, Draghici S: Use and misuse of the gene ontology annotations. Nat Rev Genet 2008, 9(7):509-515. PRoteomics for Emerging PAthogen REsponse (PREPARE) [http://www.prepare.med.ubc.ca/] Haynes C, Oldfield CJ, Ji F, Klitgord N, Cusick ME, Radivojac P, Uversky VN, Vidal M, Iakoucheva LM: Intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes. PLoS Comput Biol 2006, 2(8):e100. UniProt batch retrieval system [http://beta.uniprot.org/?tab=batch] Go Slim [http://www.geneontology.org/GO.slims.shtml] map2slim [http://search.cpan.org/~cmungall/go-perl/scripts/map2slim] the Gene Ontology [http://www.geneontology.org/] 53  45. 46. 47. 48.  49. 50. 51.  52. 53. 54. 55. 56.  Hastie T, Tibshirani R, Friedman J: The elements of statistical learning; data mining, inference, and prediction. New York: Springer; 2001. STATISTICA [http://www.statsoft.com/] Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y: KEGG for linking genomes to life and the environment. Nucleic Acids Res 2008, 36(Database issue):D480-484. Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E, Buzadzija K, Cavero R, D'Abreo C, Donaldson I, Dorairajoo D, Dumontier MJ, Dumontier MR, Earles V, Farrall R, Feldman H, Garderman E, Gong Y, Gonzaga R, Grytsan V, Gryz E, Gu V, Haldorsen E, Halupa A, Haw R, Hrvojic A, Hurrell L, Isserlin R, Jack F, Juma F, Khan A, Kon T, Konopinsky S, Le V, Lee E, Ling S, Magidin M, Moniakis J, Montojo J, Moore S, Muskat B, Ng I, Paraiso JP, Parker B, Pintilie G, Pirone R, Salama JJ, Sgro S, Shan T, Shu Y, Siew J, Skinner D, Snyder K, Stasiuk R, Strumpf D, Tuekam B, Tao S, Wang Z, White M, Willis R, Wolting C, Wong S, Wrong A, Xin C, Yao R, Yates B, Zhang S, Zheng K, Pawson T, Ouellette BF, Hogue CW: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res 2005, 33(Database issue):D418-D424. Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, 35(Database issue):D61-65. Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and inparalogs from pairwise species comparisons. J Mol Biol 2001, 314(5):1041-1052. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A: Pfam: clans, web tools and services. Nucleic Acids Res 2006, 34(Database issue):D247251. HMMER [http://hmmer.janelia.org/] Ng SK, Zhang Z, Tan SH, Lin K: InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes. Nucleic Acids Res 2003, 31(1):251-254. Finn RD, Marshall M, Bateman A: iPfam: visualization of protein-protein interactions in PDB at domain and amino acid resolutions. Bioinformatics 2005, 21(3):410-412. Kretzschmar M, van Duynhoven YT, Severijnen AJ: Modeling prevention strategies for gonorrhea and Chlamydia using stochastic network simulations. Am J Epidemiol 1996, 144(3):306-317. Muller J, Schonfisch B, Kirkilionis M: Ring vaccination. J Math Biol 2000, 41(2):143171.  54  3 Predicting highly-connected hubs in protein interaction networks by QSAR and biological data descriptors2 3.1 Introduction The accumulation of vast amounts of protein sequences and interaction data has not only accelerated study of cellular processes, but also revealed the underlying complexity of protein interactions. Results from recent large-scale protein-protein interaction experiments conducted for various species [1-5] have further elaborated that there are some proteins, which are involved in numerous physical interactions with other proteins and exert major effects on various cellular processes and pathways. A commonly accepted viewpoint is that protein-protein interactions (PPI) occur in a network fashion, where a protein can be visualized as a node and a physical interaction with another protein is represented as a link between the nodes. Previous protein interaction network (PIN) studies have demonstrated a characteristic distribution of such protein nodes, where the majority of them have a low number of connections, while relatively few proteins are involved in the majority of network interactions [6,7]. Those highly-connected proteins are referred as ‘hubs’ - the name that emphasizes a central role of such proteins in cellular processes. Such hub proteins represent attractive study objects helping our understanding of cellular interactions and promising new and intriguing opportunities for therapeutics development. Identification of high PIN interactors is straightforward when large amounts of interaction data are available for a given proteome. However, the task of finding hubs in species that lack protein interaction information appears to be difficult. Many computational methods have been proposed for predicting pairwise protein interactions [8-16]. Those  2  A version of this chapter has been published. Hsing M, Byler K, Cherkasov A. Predicting highly-connected hubs in protein interaction networks by QSAR and biological data descriptors. Bioinformation 2009, 4(4):164-168. 55  predictors demonstrate different degrees of accuracy, but none have focused on explicit identification of protein hubs. Thus, it remains mostly unknown which structure- and/or sequence- related features (if any) can distinguish hubs from other proteins. In our previous work [17], we constructed a classifier capable of distinguishing hubs and non-hubs with reasonable accuracy utilizing the Gene Ontology (GO) annotations for hub prediction. Although the developed hub classifier demonstrated suitable performance (28% sensitivity and 90% specificity), it was limited to the extent of available GO data and relied on expert knowledge. The set of GO terms that are most characteristic for hubs vary significantly among species, rendering the task of hub prediction even more challenging. In addition, we had some success in employing a QSAR (quantitative structure-activity relationship) approach that utilized 75 physical and chemical descriptors to predict hub proteins in Methicillin-Resistant Staphylococcus aureus MRSA252 proteome [18]. Although the study was focused on a small subset of MRSA252 proteins, it has demonstrated the potential to accurate hub characterization. The aim of the current study is to improve our understanding of hub proteins through the use of more comprehensive set of 1300 protein descriptors reflecting physicochemical properties (quantified through QSAR descriptors), as well as domain and fold composition, cellular function and sequence similarity. We focused efforts on determining common and distinctive features of protein hubs in four model organisms: Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster and Homo sapiens. By combining numerous protein descriptors with a boosting-trees machine learning method, hub classifiers were constructed for the four species. The source code and executable program of the hub classifier, and the training and testing datasets are freely available for download at: http://www.cnbi2.ca/hub-analysis/  56  3.2 Methods 3.2.1  Acquisition and analysis of protein-protein interaction data Experimental protein interaction data used in constructing hub classifiers and the  subsequent hub protein analyses, were obtained from the IntAct database [19] for the following species: Escherichia coli K 12 (taxonomy ID: 83333), Saccharomyces cerevisiae (taxonomy ID: 4932), Drosophila melanogaster (taxonomy ID: 7227), and Homo sapiens (taxonomy ID: 9606) (dated by Sep. 25th, 2007). Hub proteins were identified based on the corresponding numbers of their interactions ranked within the same species. These proteins were divided into different percentile groups, with the top 10% interactors considered as hubs, as in our previous study [17]. The hub selection criterion has been associated with a sharp turn (or inflection point) on cumulative protein interaction distribution plots created for each of the four species. Table 3.1 summarizes the number of protein interactions, as well as hub classifications for each of the four species under study. Table 3.1 A summary of protein interaction data used in the training and testing of the hub classifiers. Training / Testing set # of proteins # of hubs (10% of total proteins) # of non-hubs (90% of total proteins) # of protein interactions minimum # of interactions per hub  E. coli  S. cerevisiae  D. melanogaster  H. sapiens  2860  5397  6935  6592  286  535  628  620  2574  4862  6307  5972  13888  37167  19994  19115  20  33  16  13  Figure 3.1 illustrates the entire process of characterization and prediction of hub proteins in E. coli, S. cerevisiae, D. melanogaster, and H. sapiens.  57  Figure 3.1 A flow chart for the characterization and prediction of hub proteins.  58  3.2.2  Calculation and collection of protein descriptors To fully characterize each protein in our dataset, over 1300 descriptors have been  calculated and grouped into five categories shown in Table 3.2.  Table 3.2 A summary of protein descriptors. E. coli  S. cerevisiae  D. melanogaster  H. sapiens  1250  1300  1312  1310  Gene Ontology  55  105  117  115  Sequence similarity  11  11  11  11  4  4  4  4  1105  1105  1105  1105  75  75  75  75  250 (20%)  260 (20%)  262 (20%)  262 (20%)  Total number of descriptors  Number of Pfam domains Average threading score of SCOP folds Physicochemical properties Total number of the top 20% descriptors Gene Ontology  3 (5%)  13 (12%)  9 (8%)  6 (5%)  Sequence similarity  6 (55%)  5 (45%)  1 (9%)  4 (36%)  Number of Pfam domains  3 (75%)  4 (100%)  1 (25%)  1 (25%)  193 (17%)  192 (17%)  195 (18%)  204 (18%)  45 (60%)  46 (61%)  56 (75%)  47 (63%)  Average threading score of SCOP folds Physicochemical properties  The top part of the table shows the distribution of all descriptors considered in the study, and the bottom part of the table shows the distribution of the top 20% descriptors, selected based on their relative predictor importance values. The values inside a bracket indicate the percentage of descriptors that were present in the top 20% descriptor set for a given descriptor category. For example, for Gene Ontology descriptor category in E. coli, 3 out of 55 or 5% were present in the top 20% descriptor set. Abbreviations: Pfam = Protein Family, SCOP = Structural Classification of Proteins.  3.2.2.1 Gene Ontology (GO) annotations For each protein we utilized GO annotations as before [17]. Briefly, the GO data were obtained from the UniProt Retrieval System [20,21] using unique UniProt protein accession numbers. To achieve a manageable level in GO data hierarchy, we adapted a generic GO annotation level determined by the “GO slim” project [22]. The tool “map2slim” [23] was used to map specific GO terms to the “GO slim” generic terms. The resulting GO descriptor had a binary form, where ‘1’ indicates that a protein had this GO annotation, while ‘0’ indicates its absence. 59  3.2.2.2 Sequence similarity To investigate the degree of sequence conservation in the training set we compared the constituent proteins to selected reference proteomes. A total of 10 such proteomes were used: 2 species from Archaea, 4 species from Bacteria, and 4 species from Eukaryota (Table 3.3). The protein sequences were obtained from the UniProt database [24] (dated Oct. 1st, 2007), and the sequences of the reference proteomes were acquired from the RefSeq database [25] (dated Aug. 28th, 2008). Table 3.3 A set of reference proteomes for sequence similarity descriptors. Branch Archaea  Group  Species name  Crenarchaeota  Sulfolobus solfataricus P2 Methanosarcina acetivorans C2A Bacillus subtilis subsp. subtilis str. 168  273057  2977  188937  4540  224308  4105  Mycoplasma genitalium G37 Pseudomonas aeruginosa PAO1  243273  1335  208964  5568  85962  1576  10090  35191  Euryarchaeota Bacteria Firmicutes (gram +) Firmicutes Gammaproteobacteria (gram -) Epsilonproteobacteria (gram -)  Fungi  Helicobacter pylori 26695 Mus musculus 129X1/SvJ; 129S1/SvImJ; DBA/2J; A/J Cryptococcus neoformans var. neoformans JEC21  Plant  Arabidopsis thaliana  Protist  Plasmodium falciparum 3D7  Eukaryota Animals  NCBI Taxonomy ID  number of proteins  214684  6594  3702  32816  36329  5262  A pairwise sequence alignment was performed between each sequence in the training set and each sequence in the reference set by BLASTp [26] (version 2.2.13). The best BLAST hit was identified for each training set protein, and the presence of homologous sequences was determined based on the following criteria: e-value coverage  10-5, similarity  50%, and alignment  80%. There are 10 binary descriptors in this category, where ‘1’ indicates that for a  given a protein in the training set, a homologous protein was present in one of the 10 reference proteomes, while ‘0’ indicates its absence. An additional descriptor was made to sum up all the binary values from each of the 10 descriptors.  60  3.2.2.3 Number of Pfam protein domains Each protein in the training set has been associated with Pfam (Protein Family) domains [27] through the UniProt Retrieval System [21] (dated by Oct. 1st, 2007). Two descriptors were used to record the total number of Pfam domains for each protein: one descriptor counted each unique Pfam domain only once, while the other descriptor counted all the instances of each domain. Among all identified Pfam domains a subset had the iPfam association representing interacting domains identified from known protein complexes [28]. The iPfam characteristics were obtained from the iPfam database [28] (version 20). Like the Pfam descriptors, two iPfam-related parameters were recorded: the total number of iPfam domains (unique or repetitive) present on each protein in the training set. Thus, there are a total of 4 Pfam/iPfam domain descriptors.  3.2.2.4 Protein threading scores and SCOP protein domain composition Descriptors in this category were derived in two steps. First, protein sequence of each training set protein was compared against 8,539 PDB structural templates implemented by the THREADER 3.5 program [29,30] (threading template library obtained on Feb. 7th, 2008). The Z-score, which represents the fitness measure, was calculated for each pair of the query protein and each structural template. In the second step, each template in the library was linked to a specific SCOP (Structural Classification of Proteins) domain classification number [31, 32] (version 1.73). To achieve a manageable number of descriptors, the protein templates were grouped at the first (Class) and second (Fold) level of the SCOP classification. As a result, 8,539 PDB templates were classified into 1,105 fold descriptors for this category.  61  3.2.2.5 Physicochemical properties Table 3.4 features 75 sequence-based physical and chemical descriptors that were calculated and used for training hub classifiers. Those descriptors quantify such protein properties as molecular weight, net charge, isoelectric point, hydrophobicity, surface area, solvent accessibilities, electronegativity, secondary structure composition, surface coils and flexibility among others. The protein descriptors were calculated by SABLE 2.0 [33], PARASURF ‘06 [34], and other QSAR-related formulas as described in [18].  62  Table 3.4 Physicochemical descriptors. index 1 2 3-22 23 24 25 26 27 28 29 30 31 32-51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75  Physicochemical descriptors number of residues molecular weight fraction of each residues in sequence fraction of polar residues in sequence fraction of hydrophobic residues fraction of charged residues net charge at pH = 7.0 average hydrophobicity, ∆Gtrans (kcal/mol) average “hydrophilicity”, ∆Gapp (kcal/mol) fraction of surface residues in sequence estimated surface area estimated volume fraction of each residue at surface fraction of polar residues at surface fraction of hydrophobic residues at surface fraction of charged residues at surface net surface charge average surface hydrophobicity average surface hydrophilicity ratio of average surface hydrophobicity to average hydrophobicity for sequence ratio of average surface “hydrophilicity” to average hydrophilicity for sequence isoelectric point (elementary charge unit) isoelectric point of surface (elementary charge unit) fraction of random coil residues fraction of α-helix residues fraction of β-sheet residues helix-to-coil ratio for surface – helix-to-coil ratio for sequence average surface polarizability (kcal/mol) average surface MEP (kcal/mol) average surface ionization potential (kcal/mol) average surface electron affinity (kcal/mol) average surface electronegativity (kcal/mol) number of coil stretches > 4 residues in length length of longest contiguous coil stretch fraction of flexible coil residues in sequence fraction of flexible residues at surface average flexibility index for coil residues at the surface  63  3.2.3  Training and testing hub classifiers by boosting trees After calculating the above descriptors we combined them into a unified set for each of  the four studied proteomes. The total number of descriptors varied among species because of the different numbers of available GO terms, but on average each proteome was associated with 1,300 descriptors for the constituent proteins. To narrow the number of descriptors we applied two rounds of training to select those most suitable for differentiating hubs and nonhubs. During both rounds each hub classifier was trained individually for each of the four species using a four-fold cross-validation strategy. Each training and testing set had a constant hub to non-hub (1:9) ratio. The models were trained and tested using the boosting trees method as implemented in STATISTICA version 8 [35]. In the first round of training, all descriptors (~1300) were used and for each an importance value [36] was computed to represent its relative distinguishing power for the corresponding hub-classifying model. The average descriptor importance values were calculated from the four cross-validation samples, and the top 20% of descriptors with the highest average importance values were advanced to the second round of training. The following boosting trees parameters were used: maximum number of nodes in a tree= 7, classification penalty between false negatives and false positives = 1: 1.5. The number of trees ranged from 45 to 199, as determined automatically by the STATISTICA program for optimal training and testing accuracy. Four classifiers were built for each species (one for each of the four cross-validation samples) and compiled in the C++ language under Linux. We combined the four classifiers by consensus voting method, where a hub is determined if it has two or more votes. The classification statistics were calculated as the following:  64  Sensitivity = TP / (TP + FN) Specificity = TN / (TN + FP) Accuracy = (TP + TN) / (TP + TN + FP + FN) PPV (Positive Predictive Value) = TP / (TP + FP) NPV (Negative Predictive Value) = TN / (TN + FN) where TP = True Positive, FP = False Positive, TN = True Negative, and FN = False Negative. In addition, the final consensus hub classifier built for a given species was tested on the other three species, which were used as an external validation set.  3.2.4  Hub prediction based on sequence conservation To compare our method with a hub prediction approach based on sequence  conservation, the following procedure was performed. First, each protein of a given species (the query) was aligned by BLASTp against each protein in the other three species (the subject). A homologous relationship was established if the best hit in the subject species has met the sequence alignment criteria: e-value  10-5, similarity  50%, and alignment coverage  80%.  Second, we examined if the homologous protein in the subject species was a hub or non-hub protein. A ‘conserved hub’ was identified when a protein and its homolog were both hubs in the query and subject species respectively. On the other hand, a ‘conserved non-hub’ represented a case when a protein and its homolog were both non-hubs in the query and subject species respectively. Thus, we predicted hub proteins in the query species based on sequence conservation from the subject species and calculated the corresponding classification statistics.  65  3.3 Results and discussion 3.3.1  Prediction performance of the hub prediction classifiers Hub protein classifiers were trained for each of the 4 species and tested on the other 3  species for validation. Table 3.5 ~ 3.8 show the classification statistics for the hub classifiers trained on E. coli, S. cerevisiae, D. melanogaster, and H. sapiens proteomes respectively. When the classifiers were tested on the same species using the 4-fold cross-validation procedure, they all displayed improved prediction performance in comparison to our previous GO-based classifier. The E. coli hub classifier achieved 51.40% sensitivity and 32.59% PPV and the S. cerevisiae hub classifier reached 62.99% sensitivity and 33.37% PPV on the testing sets; these values are significantly higher than 28.10% sensitivity and 22.00% PPV that we have reported in of our previous work [17] based on the GO annotation. In addition, Tables 4.5 ~ 4.8 demonstrate another trend - when a classifier trained for one species is applied to another one, its prediction accuracy dramatically decreases. We constructed a generalized classifier by combining protein interaction data from all four species, however no prediction performance improvement was observed. These results suggest that hub proteins within the same species have enough commonality so the species-specific hub predictors demonstrated high accuracy. However, the diversity among hubs from different species appears to be much greater; therefore, it is not always feasible to apply the hub classifier trained from one species to another.  66  Table 3.5 E. coli hub classifier prediction performance. E. coli hub classifier Four-fold cross-validation average performance in E. coli Training sensitivity specificity accuracy PPV NPV 86.71% 91.60% 91.11% 53.41% 98.41% Testing sensitivity specificity accuracy PPV NPV 51.40% 88.19% 84.51% 32.59% 94.23% External Testing S. cerevisiae sensitivity specificity 53.46% 72.71% D. melanogaster sensitivity specificity 37.26% 63.75% H. sapiens sensitivity specificity 47.26% 58.57%  accuracy 70.80%  PPV 17.73%  NPV 93.42%  accuracy 61.36%  PPV 9.29%  NPV 91.08%  accuracy 57.51%  PPV 10.59%  NPV 91.45%  Table 3.6 S. cerevisiae hub classifier prediction performance. S. cerevisiae hub classifier Four-fold cross-validation average performance in S. cerevisiae Training sensitivity specificity accuracy PPV NPV 84.36% 88.99% 88.53% 45.74% 98.10% Testing sensitivity specificity accuracy PPV NPV 62.99% 86.16% 83.86% 33.37% 95.49% External Testing E. coli sensitivity specificity 34.97% 88.50% D. melanogaster sensitivity specificity 23.73% 80.45% H. sapiens sensitivity specificity 41.13% 73.63%  accuracy 83.15%  PPV 25.25%  NPV 92.45%  accuracy 75.31%  PPV 10.78%  NPV 91.37%  accuracy 70.57%  PPV 13.93%  NPV 92.34%  67  Table 3.7 D. melanogaster hub classifier prediction performance. D. melanogaster hub classifier Four-fold cross-validation average performance in D. melanogaster Training sensitivity specificity accuracy PPV NPV 74.95% 87.24% 86.12% 36.90% 97.22% Testing sensitivity specificity accuracy PPV NPV 41.24% 83.86% 80.00% 20.28% 93.48% External Testing E. coli sensitivity specificity 11.54% 92.74% S. cerevisiae sensitivity specificity 20.37% 88.15% H. sapiens sensitivity specificity 30.16% 83.05%  accuracy 84.62%  PPV 15.00%  NPV 90.42%  accuracy 81.43%  PPV 15.91%  NPV 90.96%  accuracy 78.08%  PPV 15.60%  NPV 91.97%  Table 3.8 H. sapiens hub classifier prediction performance. H. sapiens hub classifier Four-fold cross-validation average performance in H. sapiens Training sensitivity specificity accuracy PPV NPV 51.77% 91.31% 87.59% 38.21% 94.80% Testing sensitivity specificity accuracy PPV NPV 26.61% 88.78% 82.93% 19.76% 92.10% External Testing E. coli sensitivity specificity 22.03% 87.65% S. cerevisiae sensitivity specificity 38.32% 83.42% D. melanogaster sensitivity specificity 13.38% 90.26%  accuracy 81.08%  PPV 16.54%  NPV 91.00%  accuracy 78.95%  PPV 20.28%  NPV 92.48%  accuracy 83.30%  PPV 12.03%  NPV 91.28%  68  3.3.2  Prediction performance of a traditional hub prediction approach based on sequence conversation To investigate a common approach that predicts hub proteins based on sequence  conservation between species, we have examined the degree of hub conservation in the studied data sets. In general, hubs from one species are more likely to have homologous proteins in another species compared to the cases of non-hub proteins, as the data in Table 3.9 illustrate. However, highly connected proteins in one species do not necessarily behave like hubs in other species. The percentage of conserved hubs (homologous proteins that are hubs in both species) is low (1 ~ 10%). This reversing trend makes it difficult to infer hubs from the sequence conservation alone. Table 3.9 Hub proteins conservation among species. Query species  Subject species E. coli  S. cerevisiae  D. melanogaster  H. sapiens  E. coli % of hubs with homologs  18.18%  15.03%  % of non-hubs with homologs  8.00%  5.67%  18.18% 5.75%  % of conserved hubs  4.20%  1.05%  2.80%  % of conserved non-hubs  6.72%  5.56%  5.36%  S. cerevisiae % of hubs with homologs  7.48%  34.02%  39.44%  % of non-hubs with homologs  3.78%  10.98%  11.74%  % of conserved hubs  3.55%  6.36%  10.28%  % of conserved non-hubs  2.88%  10.22%  10.26%  D. melanogaster % of hubs with homologs  1.27%  12.26%  23.89% 20.64%  % of non-hubs with homologs  1.93%  9.75%  % of conserved hubs  1.11%  6.69%  6.69%  % of conserved non-hubs  1.43%  7.23%  17.82%  2.58%  22.10%  37.90% 24.55%  H. sapiens % of hubs with homologs % of non-hubs with homologs  2.28%  12.34%  % of conserved hubs  1.94%  10.00%  9.35%  % of conserved non-hubs  1.62%  8.98%  21.78%  69  As data in Table 3.10 indicate, such a hub prediction strategy has low sensitivity; therefore, this method failed to predict hubs in a given species based on the sequence and interaction data of the proteins from another species. We suspect that there are two contributing factors: 1) lack of complete protein interaction data, and 2) natural differences in PIN of the species that might occur as the result of evolutionary network rewiring.  70  Table 3.10 Hub prediction based on sequence conservation. E. coli - conserved hub prediction S. cerevisiae sensitivity specificity accuracy PPV 2.24% 99.18% 89.57% 23.08% D. melanogaster sensitivity specificity accuracy PPV 0.48% 99.37% 90.41% 6.98% H. sapiens sensitivity specificity accuracy PPV 1.29% 99.26% 90.05% 15.38% S. cerevisiae - conserved hubs prediction E. coli sensitivity specificity accuracy PPV 6.64% 99.18% 89.93% 47.50% D. melanogaster sensitivity specificity accuracy PPV 5.41% 97.65% 89.30% 18.68% H. sapiens sensitivity specificity accuracy PPV 8.87% 97.39% 89.06% 26.07%  NPV 90.22% NPV 90.93% NPV 90.64%  NPV 90.53% NPV 91.20% NPV 91.15%  D. melanogaster - conserved hub prediction E. coli sensitivity specificity accuracy PPV NPV 2.45% 99.96% 90.21% 87.50% 90.22% S. cerevisiae sensitivity specificity accuracy PPV NPV 7.85% 99.28% 90.22% 54.55% 90.73% H. sapiens sensitivity specificity accuracy PPV NPV 6.77% 98.19% 89.59% 28.00% 91.03% H. sapiens - conserved hubs testing E. coli sensitivity specificity accuracy PPV 4.20% 99.84% 90.28% 75.00% S. cerevisiae sensitivity specificity accuracy PPV 11.59% 98.46% 89.85% 45.26% D. melanogaster sensitivity specificity accuracy PPV 9.24% 97.19% 89.23% 24.68%  NPV 90.37% NPV 91.01% NPV 91.49%  71  3.3.3  Characteristics of hub and non-hub proteins The utilization of the 1300 protein descriptors has enabled construction of binary hub  classifiers and characterization of hub proteins in each of the four species: E. coli, S. cerevisiae, D. melanogaster, and H. sapiens. For each of the four species we have selected the top 20% descriptors ranked by their importance values (Table 3.2). Notably, different descriptor categories have shown different contributions to the final descriptors set. For example, only a small percentage of Gene Ontology descriptors (5~12%) and SCOP fold descriptors (17~18 %) have been selected, while Pfam descriptors (25~100%) and physicochemical parameters (60~75%) were among the most relevant ones. The sequence similarity descriptors showed high contribution rate in all the species except for the dataset in D. melanogaster. Among the top 20% descriptors, 21 descriptors were common among all four species. In the next sections, we report the examination of similarities and differences between hubs and non-hubs among the four species by focusing on the relevant protein descriptors.  3.3.3.1 Gene Ontology (GO) annotations Gene Ontology (GO) provides functional annotations for individual proteins using an expert knowledge base [20, 37]. The advantage of using the GO terms in hub prediction is the availability of GO information for proteins in hundreds of species; however, the level of GO annotations varied greatly between species. For instance, among the four studied species, S. cerevisiae had the highest percentage of the proteins with GO annotations (87.8%), while only 48.2% of the proteins in E. coli could be related to a particular GO term. Our results indicated that certain GO terms appear more frequently in hubs. For instance, the GO term, [protein binding], is among the most frequent annotations. Others, such as [cytoplasm], [protein complex], [nucleolus], [nucleobase, nucleoside, nucleotide and nucleic acid metabolic process], 72  [nucleoplasm] and [response to endogenous stimulus], are associated with highly interacting proteins (more details in Table 3.11 and 3.12). Table 3.11 Top GO term descriptors. E. coli GO ID  GO name  GO type  GO:0005515*  protein binding  molecular function  GO:0005737  +  predictor importance 72  cytoplasm  cellular component  59  cell  cellular component  46  GO name  GO type  protein complex  cellular component  78  GO:0003723  RNA binding  molecular function  76  GO:0005515*  protein binding  molecular function  74  GO:0005840  ribosome  cellular component  70 69  GO:0005623 S. cerevisiae GO ID GO:0043234  GO:0005730  +  +  predictor importance  nucleolus  cellular component  GO:0006996  organelle organization  biological process  67  GO:0006412  translation nucleobase, nucleoside, nucleotide and nucleic acid metabolic process  biological process  58  biological process  57  nucleoplasm  cellular component  54  translation factor activity, nucleic acid binding  molecular function  52  response to endogenous stimulus  biological process  46  GO:0005829  cytosol  cellular component  45  GO:0006350  transcription  biological process  42  GO:0006139  +  GO:0005654  +  GO:0008135 GO:0009719  +  D. melanogaster GO ID  GO name  GO type  GO:0050789  regulation of biological process  biological process  64  GO:0005811  lipid particle  cellular component  62  GO:0007049  cell cycle  biological process  60  GO:0000003  reproduction  biological process  58  GO:0007275  multicellular organismal development  biological process  57  anatomical structure morphogenesis  biological process  56  response to endogenous stimulus  biological process  54  GO:0009628  response to abiotic stimulus  biological process  51  GO:0005515*  protein binding  molecular function  48  GO name nucleobase, nucleoside, nucleotide and nucleic acid metabolic process  GO type biological process  60  GO:0009653 GO:0009719  +  predictor importance  H. sapiens GO ID  predictor importance  GO:0006139  +  GO:0005737  +  cytoplasm  cellular component  49  GO:0005654  +  nucleoplasm  cellular component  47  GO:0043234  +  protein complex  cellular component  42  GO:0005730  +  GO:0004871  nucleolus  cellular component  39  signal transducer activity  molecular function  37  GO terms that are common in 3 species are indicated by ‘*’, and those that are common in 2 species are indicated by ‘+’.  73  Table 3.12 Frequency of common GO terms appearing in hub and non-hub proteins. E. coli  S. cerevisiae  D. melanogaster  H. sapiens  protein binding (GO:0005515) hub proteins  44.06%  56.26%  80.41%  77.58%  non-hub proteins  25.33%  38.03%  58.27%  43.72%  hub proteins  9.44%  21.50%  6.05%  18.39%  non-hub proteins  2.41%  21.23%  4.19%  9.73%  hub proteins  0.00%  38.88%  5.89%  17.42%  non-hub proteins  0.00%  18.65%  4.58%  9.66%  hub proteins  0.00%  16.07%  0.16%  3.23%  non-hub proteins  0.00%  2.80%  0.33%  1.09%  cytoplasm (GO:0005737)  protein complex (GO:0043234)  nucleolus (GO:0005730)  nucleobase, nucleoside, nucleotide and nucleic acid metabolic process (GO:0006139) hub proteins  2.10%  25.42%  2.23%  8.71%  non-hub proteins  1.67%  9.05%  2.05%  3.45%  hub proteins  0.00%  13.27%  2.23%  6.94%  non-hub proteins  0.00%  5.37%  1.59%  2.75%  hub proteins  1.05%  8.60%  2.55%  4.03%  non-hub proteins  0.39%  3.83%  0.62%  2.14%  nucleoplasm (GO:0005654)  response to endogenous stimulus (GO:0009719)  3.3.3.2 Sequence similarity It is a common practice to look into sequence similarity among protein hubs in different species. It has been previously suggested that hubs are more likely to be conserved compared to proteins with fewer number of interactions [38]. Hubs in most of the species have higher sequence conservation compared to non-hubs, as illustrated in Figure 3.2. We conducted two sample t-tests comparing hubs and non-hubs in terms of the occurrence of similar proteins among the 10 reference proteomes. In particular, the tests resulted in p-values of 0, 0 and 0 when comparing proteins within E. coli, S. cerevisiae, and H. sapien respectively. However, for D. melanogaster, the trend is not observed (p-value = 0.7).  74  Figure 3.2 A box plot of descriptor values on sequence similarity for hubs and non-hubs among the four species.  Each species is presented in different color. The lower, middle, and upper bar of a box represents the 25th, 50th, and 75th percentile respectively. The lowest line and the highest line represent the 10th and 90th percentile respectively. The blue diamond dot represents the average value. Hubs have higher average values than non-hubs in E. coli, S. cerevisiae and H. sapiens (two-sample t-test: p-values = 0, 0 and 0 respectively). 3.3.3.3 Number of Pfam protein domains Previous studies have suggested that the presence of multiple interaction interfaces on a protein surface can be an indication of its hub role [39]. In this study, we assessed whether the number of Pfam domains in a protein can distinguish hubs and non-hubs. Hubs in E. coli and S. cerevisiae contain more Pfam domains than non-hubs (p-values = 0, see Figure 3.3 for more details). However, the difference in the number of Pfam domains between hubs and non-hubs in D. melanogaster and H. sapiens was not significant.  75  Figure 3.3 A box plot of descriptor values on the number of all Pfam domains for hubs and non-hubs among the four species.  Hubs have higher average values than non-hubs in E. coli and S. cerevisiae (p-values = 0 and 0 respectively). 3.3.3.4 Protein threading scores and SCOP protein fold composition One of the main advantages of SCOP classification as descriptors is in the general availability of these structural parameters for any protein sequence as they can be obtained easily by threading. To achieve a manageable number of descriptors, the original set of 8,539 standard threading templates was grouped into 1105 entries corresponding to the second level of SCOP classification (fold). It is notable that many SCOP descriptors distinguish hubs and non-hubs, as data in Table 3.13 illustrate. However the majority of such SCOP descriptors have complex characteristics where hubs might be more associated with a certain SCOP fold than non-hubs in some species but not in the others. Two notable examples are the ‘PDZ domainlike’ and ‘beta-Grasp (ubiquitin-like)’ folds.  76  Table 3.13 Top SCOP fold descriptors.  SCOP ID b.36 d.15 a.174 a.223 a.39 a.74 b.161 b.34 b.38 b.42 b.45 b.70 c.1 c.124 c.126 c.128 c.148 c.45 c.72 c.76 c.8 c.93 c.98 d.108 d.129 d.144 d.159 d.20 d.233 d.24 d.25 d.42 d.58 d.79 d.80 d.81 e.37  Fold name PDZ domain-like beta-Grasp (ubiquitin-like) Double Clp-N motif Triger factor/SurA peptide-binding domain-like EF Hand-like Cyclin-like PTSIIA/GutA-like SH3-like barrel Sm-like fold beta-Trefoil Split barrel-like 8-bladed beta-propeller TIM beta/alpha-barrel NagB/RpiA/CoA transferase-like DNA polymerase III psi subunit DNA polymerase III chi subunit ComB-like (Phosphotyrosine protein) phosphatases II Ribokinase-like Alkaline phosphatase-like The "swivelling" beta/beta/alpha domain Periplasmic binding protein-like I MurF and HprK N-domain-like Acyl-CoA N-acyltransferases (Nat) TBP-like Protein kinase-like (PK-like) Metallo-dependent phosphatases UBC-like Inhibitor of vertebrate lysozyme, Ivy Pili subunits Mitochondrial glycoprotein MAM33-like POZ domain Ferredoxin-like Bacillus chorismate mutase-like Tautomerase/MIF FwdE/GAPDH domain-like Siroheme synthase middle domains-like  # of species-specific hub classifier containing the SCOP descriptor 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3  As illustrated in Figure 3.4 and 3.5, the hubs are associated with the ‘PDZ domain-like’ and ‘beta-Grasp’ folds more frequently than the non-hubs in E. coli and H. sapiens, but not in  77  S. cerevisiae, and D. melanogaster. In fact, it is the non-hubs that are more associated with the two folds in those two species. The above examples demonstrated that structural features of hub proteins cannot be summarized easily by only a few SCOP descriptors; therefore, to fully characterize hubs in each species, the use of a complex set of fold descriptors is required. Figure 3.4 A box plot of descriptor values on average threading score for "PDZ domain-like" fold (SCOP: b.36) for hubs and non-hubs among the four species.  The average threading scores of the fold is higher for hubs than non-hubs in E. coli and H. sapiens (p-value = 0 and 0.05 respectively), but lower in S. cerevisiae, and D. melanogaster (pvalue = 0 and 0.01 respectively).  78  Figure 3.5 A box plot of descriptor values on average threading score for "beta-Grasp (ubiquitin-like)" fold (SCOP: d.15) for hubs and non-hubs among the four species.  The average threading scores of the fold is higher for hubs than non-hubs in E. coli and H. sapiens (p-value = 0 and 0 respectively), but lower in S. cerevisiae, and D. melanogaster (pvalue = 0 and 0.01 respectively).  3.3.3.5 Physicochemical properties The descriptors in this category have contributed greatly to the discriminative power of the hub classifiers due to the unbiased characterization of all the studied proteins in terms of their physical and chemical properties. One of the general trends we observed, as illustrated in Figure 3.6, is that the average hydrophilicity of hubs are higher than that of non-hubs in all four species (two- sample t-test: p-values in E. coli, S. cerevisiae, D. melanogaster and H. sapiens = 0). This can be explained perhaps by the fact that hubs tend to contain more polar residues on the surfaces, which facilitate protein-protein interactions.  79  Figure 3.6 A box plot of descriptor values on average hydrophilicity for hubs and non-hubs among the four species.  The average hydrophilicity of hubs are higher than that of non-hubs in all four species (pvalues in E. coli, S. cerevisiae, D. melanogaster and H. sapiens = 0, 0, 0 and 0 respectively).  The fraction of flexible coil residues demonstrated higher values for hubs in all four species (Figure 3.7). This observation agrees with previous studies [40,41] that outlined that structural disorder is a common feature of hub proteins. The hubs/non-hubs discriminative power of others physicochemical descriptors such as estimated surface area, average surface polarizability and isoelectric point is also high (see the corresponding Figure 3.8 – 3.10). However, the observations are more complex for these descriptors because their values are higher for hubs than non-hubs in some species, but lower in the others. It is also notable that the absolute values of average hydrophilicity of E. coli proteins are the smallest among the studied species. This in turn can be related to the fact that E. coli proteins have the smallest average surface area among the studied proteins.  80  The above results demonstrated that the physicochemical descriptors allowed us to derive meaningful observations, interpretable from the stand point of protein biology. In addition, these descriptors demonstrated the complexity of hub characteristics among the different species.  Figure 3.7 A box plot of descriptor values on fraction of flexible coil residues for hubs and non-hubs among the four species.  Hubs have higher average fractions of flexible coil residues than non-hubs in E. coli, S. cerevisiae, D. melanogaster and H. sapiens (p-values = 0, 0, 0 and 0.21 respectively).  81  Figure 3.8 A box plot of descriptor values on estimated surface area for hubs and non-hubs among the four species.  The average surface area of hubs is higher than that of non-hubs in E. coli and S. cerevisiae (pvalue = 0 and 0 respectively); however, the average surface area of hubs is lower than that of non-hubs in D. melanogaster and H. sapiens (p-value = 0.05 and 0.1, respectively).  Figure 3.9 A box plot of descriptor values on average surface polarizability for hubs and nonhubs among the four species.  Hubs have higher average values than non-hubs in D. melanogaster (p-value = 0), while hubs have lower average values than non-hubs in the other three species (p-values in E. coli, S. cerevisiae and H. sapiens = 0, 0 and 0 respectively).  82  Figure 3.10 A box plot of descriptor values on isoelectric point (pI) for hubs and non-hubs among the four species.  Hubs have higher average pI than non-hubs in E. coli (p-value = 0), while hubs have lower average pI than non-hubs in S. cerevisiae, D. melanogaster and H. sapiens (p-values = 0.18, 0.03 and 0 respectively).  3.4 Conclusion The use of a comprehensive set of sequence- and structure-derived descriptors has enabled in-depth characterization of hub proteins. On one hand, hubs in E. coli, S. cerevisiae, D. melanogaster, and H. sapiens generally have higher frequency of association with the ‘protein binding’ GO term, demonstrate higher sequence conservation, tend to be more hydrophilic and possess higher fraction of flexible coil residues. On the other hand, hubs have exhibited several species-specific characteristics including certain associated GO terms, occurrence of Pfam domains, most frequent SCOP folds, as well as surface area, surface polarizability and isoelectric point values. Through the analyses of the most relevant protein descriptors, we have demonstrated that hub proteins share certain common characteristics, making them different from non-hub counterparts. However, the descriptor analyses indicated (somewhat surprisingly) that highly connected proteins are more different across species than previously anticipated. It has become 83  clear that hub proteins cannot be characterized easily by only a few descriptors, but the use of a complex set of protein descriptions is required. Such hub differentiation should be taken into account when analyzing PINs from different species. We have established that E. coli hubs exhibit higher sequence similarity to the reference proteomes, lower hydrophilicity, smaller estimated surface area, smaller fraction of flexible coil residues and lower surface polarizability, compared to the hubs from the other three species under investigation. We conclude that the different nature of hub proteins in different species makes it difficult to build a generic, general hub classifier that will work on all proteomes. Nonetheless, through the two rounds of modeling, we were able to identify many protein descriptors that can distinguish hubs from non-hubs within the same species. By integrating both bioinformatics and physiochemical protein descriptors, the species-specific hub classifiers have shown improved prediction accuracy over their GO term-based and sequence similarity-based predecessors. We anticipate that the developed hub classifiers can be used for determining new hubs in the four studied species to assist future proteomics experiments and PIN analyses.  84  3.5 References 1.  2.  3.  4.  5.  6. 7. 8. 9. 10. 11. 12.  Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440(7084):631-636. Butland G, Peregrin-Alvarez JM, Li J, Yang W, Yang X, Canadien V, Starostine A, Richards D, Beattie B, Krogan N, Davey M, Parkinson J, Greenblatt J, Emili A: Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature 2005, 433(7025):531-537. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon CA, Finley RL, Jr., White KP, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets RA, McKenna MP, Chant J, Rothberg JM: A protein interaction map of Drosophila melanogaster. Science 2003, 302(5651):1727-1736. Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, HirozaneKishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M: A map of the interactome network of the metazoan C. elegans. Science 2004, 303(5657):540-543. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann S, Goedde A, Toksoz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H, Wanker EE: A human protein-protein interaction network: a resource for annotating the proteome. Cell 2005, 122(6):957-968. Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004, 5(2):101-113. Albert R: Scale-free networks in cell biology. J Cell Sci 2005, 118(Pt 21):4947-4957. Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci 1998, 23(9):324-328. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Detecting protein function and protein-protein interactions from genome sequences. Science 1999, 285(5428):751-753. Jansen R, Greenbaum D, Gerstein M: Relating whole-genome expression data with protein-protein interactions. Genome Res 2002, 12(1):37-46. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96(8):4285-4288. Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M: Identification of potential interaction networks using sequence-based searches for 85  13. 14. 15. 16. 17. 18. 19.  20.  21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32.  conserved protein-protein interactions or "interologs". Genome Res 2001, 11(12):21202126. Ng SK, Zhang Z, Tan SH: Integrative approach for computationally inferring protein domain interactions. Bioinformatics 2003, 19(8):923-929. Lu L, Lu H, Skolnick J: MULTIPROSPECTOR: an algorithm for the prediction of protein-protein interactions by multimeric threading. Proteins 2002, 49(3):350-364. Hoffmann R, Krallinger M, Andres E, Tamames J, Blaschke C, Valencia A: Text mining for metabolic pathways, signaling cascades, and protein networks. Sci STKE 2005, 2005(283):pe21. Qi Y, Bar-Joseph Z, Klein-Seetharaman J: Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins 2006, 63(3):490-500. Hsing M, Byler KG, Cherkasov A: The use of Gene Ontology terms for predicting highly-connected 'hub' nodes in protein-protein interaction networks. BMC Syst Biol 2008, 2:80. Byler K, Hsing M, Cherkasov A: The Use of Sequence-Derived QSPR Descriptors for Predicting Highly Connected Proteins (Hubs) in Protein-Protein Interactions. QSAR & Combinatorial Science 2009, 28: 509-519. Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R: IntAct: an open source molecular interaction database. Nucleic Acids Res 2004, 32(Database issue):D452-455. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, 32(Database issue):D262-266. UniProt batch retrieval system [http://beta.uniprot.org/?tab=batch] GO Slim [http://www.geneontology.org/GO.slims.shtml] map2slim [http://search.cpan.org/~cmungall/go-perl/scripts/map2slim] UniProt Consortium: The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res 2009, 37(Database issue):D169-174. Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, 35(Database issue):D61-65. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A: The Pfam protein families database. Nucleic Acids Res 2008, 36(Database issue):D281-288. Finn RD, Marshall M, Bateman A: iPfam: visualization of protein-protein interactions in PDB at domain and amino acid resolutions. Bioinformatics 2005, 21(3):410-412. Jones DT, Taylor WR, Thornton JM: A new approach to protein fold recognition. Nature 1992, 358(6381):86-89. THREADER [http://bioinf.cs.ucl.ac.uk/threader/] Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536-540. SCOP: Structural Classification of Proteins [http://scop.mrc-lmb.cam.ac.uk/scop/] 86  33. 34. 35. 36. 37.  38. 39. 40. 41.  Adamczak R, Porollo A, Meller J: Combining prediction of secondary structure and solvent accessibility in proteins. Proteins 2005, 59(3):467-475. ParaSurf [http://www.ceposinsilico.de/Pages/Products.html] STATISTICA [http://www.statsoft.com/] Hastie T, Tibshirani R, Friedman J: Boosting and additive trees. In: The elements of statistical learning; data mining, inference, and prediction. New York: Springer; 2001: 299-345. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29. Fox A, Taylor D, Slonim DK: High throughput interaction data reveals degree conservation of hub proteins. Pac Symp Biocomput 2009:391-402. Kim PM, Lu LJ, Xia Y, Gerstein MB: Relating three-dimensional structures to protein networks provides evolutionary insights. Science 2006, 314(5807):1938-1941. Haynes C, Oldfield CJ, Ji F, Klitgord N, Cusick ME, Radivojac P, Uversky VN, Vidal M, Iakoucheva LM: Intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes. PLoS Comput Biol 2006, 2(8):e100. Singh GP, Ganapathi M, Dash D: Role of intrinsic disorder in transient interactions of hub proteins. Proteins 2007, 66(4):761-765.  87  4 Mapping protein interaction network in methicillin-resistant Staphylococcus aureus identifies novel drug targets3 4.1 Introduction Recent progress in genomic technologies and the accumulation of protein sequence and interaction data have accelerated investigation of cellular processes and revealed the complexity of protein interactions. Protein-protein interactions are organized in a form of scalefree networks that govern the operation of biological systems including those of microbial pathogens [1]. To date, such scale-free protein interaction networks (PIN) have been described in the yeast Saccharomyces cerevisiae [2], in the gastrointestinal pathogen Helicobacter pylori [3], in the multicellular Caenorhabditis elegans [4] in Escherichia coli [5] as well in hepatitis C [6] and vaccinia viruses [7]. Consequent PIN analysis established a direct correlation between the level of connectivity of a given protein node within the network and its importance for network function and survival of the organism. Thus, in yeast, removal of a highly connected protein (called ‘a hub’) from the network was far more likely to lead to lethality than removal of a less connected network node [8]. These findings suggest that studying architecture of PINs within microbial pathogens can lead to advances in treatment of infectious diseases, through development of new antibiotic drugs targeting hub proteins that are essential for network integrity. Based on an observation that hub proteins are less prone to mutations [9], we anticipate that hubs are less likely to undergo resistant changes in bacteria.  3  A version of this chapter has been submitted for publication. Cherkasov A, Hsing M, Zoraghi R, Foster LJ, See R, Stoynov N, Jiang J, Kaur S, Lian T, Jackson L, Gong H, Swayze R, Amandoron E, Hormozdiari F, Dao P, Sahinalp C, Santos-Filho O, Axerio-Cilies P, Byler K , McMaster WR, Brunham R, Finlay B, Reiner NE Mapping protein interaction network in methicillin-resistant Staphylococcus aureus identifies novel drug targets. 88  We investigated the validity of this approach by defining a PIN of a common strain of methicillin resistant Staphylococcus aureus (MRSA252). This pathogen of major medical importance emerged in the 80s, and has since become endemic in many hospitals, leading to the increased use of vancomycin, one of the so-called ‘last line of defence’ antibiotics. In 2002, the first case of an S. aureus isolate that was totally resistant to vancomycin was documented in the United States [10]. In Canada the rise of MRSA is occurring at alarming rate [11]. Thus, a deeper understanding of MRSA biology and the identification of novel targets represent an important and ambitious task.  4.2 Methods 4.2.1  Selection of baits Initially we tested 449 bait proteins, of which 406 were successfully cloned and  expressed for the pull-down experiments to identify their interacting proteins (preys). We have utilized a two-round bait selection strategy where 406 baits were separated into two groups to maximize potential coverage of the MRSA network. During the first round, one-third of the baits (133) were selected based on their predicted hub-like properties by utilizing hub prediction tools from our previous works [12, 13] where biological and physicochemical descriptors were used to predict highly interacting proteins. During the second round of bait selection, 225 baits were selected based on the scalefree network-derived strategy ‘name your friend’. Based on the results of the first round, we identified the most common preys which then have been selected as the second round baits (as potential hubs). A subset (48) of the second-round baits were chosen based on homology to known drug targets.  89  4.2.2  Protein interaction network analysis MRSA protein interactions were analyzed and compared to PINs including Escherichia  coli PIN by Arifuzzman et al. [14] and Butland et al. [5], Campylobacter jejuni PIN by Parrish et al. [15], Treponema pallidum PIN by Titz et al. [16] and Saccharomyces cerevisiae PIN by Gavin et al [2]. The interaction data sets were obtained from IntAct [17] and stored in a local MySQL database. Physicochemical properties have been pre-calculated and correlated with numbers of interactions (network degrees) for proteins in MRSA252 PIN. Those included hydrophobicity, surface area, and fraction of flexible residues that have been calculated by SABLE 2.0 [18], PARASURF ‘06 [19], and other QASAR-related approaches described in more details in our previous work [13]. In order to estimate abundances of different MRSA proteins, an MRSA whole-cell lysates was separated into 24 fractions by off-gel isoelectric focusing and analyzed using LCMS/MS as described [20]. The analysis was repeated in triplicate and absolute protein expression (APEX) indices were calculated as described [20]. Protein essentiality were confirmed by antisense RNA or gene knockout experiments for some proteins in MRSA252, while the essentiality of the other proteins were inferred based on their homology to known essential proteins in Staphylococcus aureus N315, Streptococcus pneumoniae, or Bacillus subtilis, as reported in Database of Essential Genes [21]. The degree of sequence conservation has been estimated by comparing each protein in MRSA252 to each of the 20 reference bacterial proteomes (8 gram-positive and 12 gramnegative bacterial species) by BLAST. The total number of species that contained at least one homolog (homology criteria: blast E-value 10-5, sequence similarity  35% and alignment  90  coverage  65%) were counted. Using the above criteria, MRSA252 proteins have been  compared to the human proteome by BLAST to identify human homologs. The statistical analysis on average values and t-test among different characteristics of hubs, non-hubs and drug targets has been performed using STATISTICA 8.0 package [22].  (The following sections from 4.2.3 to 4.2.7 on experimental methods describe work conducted by the collaborators from the PREPARE laboratory. More details can be found in the ‘coauthorship statement’)  4.2.3  Expression and purification of GST fusion baits Genomic DNA was extracted from genome sequenced MRSA strain Sanger 252  (NRS71) obtained from NARSA (Network on Antimicrobial Resistance in S. aureus) using Dneasy Tissue Kit (Qiagen). ORFs of 449 MRSA baits were amplified by PCR using genomic DNA as template and inserted in frame at the 3' end of GST in the expression vector pGEX6P3 (GE Healthcare). Primer sequences are available upon request. DNA segments corresponding to GST alone and GST fusion-protein for 406 MRSA baits were cloned and transformed into Escherichia coli BL-21 (DE3) (Invitrogen). For each transformation, a single fresh colony was grown overnight in L-broth (LB) containing 100 µg/ml ampicillin. This was then diluted 1/100 in 2YT broth containing 100 µg/ml ampicillin and grown at 37°C to reach an OD600 of 0.4-0.6. The expression of the GST and GST fusion MRSA baits were induced by adding of 0.1 mM IPTG for 3 h at 20°C. Bacteria were harvested and washed in ice cold PBS, lysed in lysis buffer [20 mM Hepes pH 7.6, 100 mM KCl, 0.2 mM EDTA, 100  g/ml  lysozyme, 2 mM DTT, 20% (v/v) glycerol and 0.5% (v/v) Nonidet-40] containing EDTA-Free protease inhibitor (Complete ; Roche Molecular Biochemicals). After sonication on ice the 91  sample was then centrifuged at 20,000 ×g at 4°C for 20 min. Cleared lysate was incubated in the presence of 2 mM ATP and 10 mM MgCl2 for 5 min at room temperature to reduce nonspecific binding. GST protein and GST-fusion baits were purified from bacterial lysates with glutathione–Sepharose 4B beads (GE Healthcare). Beads were washed three times with 10 volumes of PBS and once with lysis buffer lacking lysozyme and DTT.  Purity and physical  integrity of proteins were confirmed by SDS−PAGE. The amount of GST or GST-fusion proteins bound to beads was estimated by quantification of SDS-PAGE gels using known amounts of BSA as standards.  4.2.4  MRSA cell extract preparation for GST pull down MRSA strain 252 (NRS71) was precultured overnight (OD600 of 1.0), inoculated  (OD600 of 0.1) and grown until mid-exponential phase (OD600 of 0.45) in brain heart infusion (BHI) medium (BD). The cells were washed with PBS and lysed for 30 min at 37 °C in MRSA Lysis buffer (20 mM Tris-HCl pH7.5, 150 mM NaCl, 100 µg/ml lysozyme, 100 µg /ml lysostaphin, 0.04% Triton X-100, 16 g/ml DNaseI, 1.6 mM MgCl, 0.5% NP-40, 1 mM DTT and Complete  protease inhibitor). Cell debris was pelleted by centrifugation at 25,000 ×g for  20 min. The pellet was discarded and soluble protein extracts were further filtered through a 0.8 µm filter.  4.2.5  GST pull down of MRSA protein complexes MRSA strain NRS71 soluble protein extract was prepared from cells grown in brain  heart infusion (BHI) medium to mid-exponential phase (OD600 of 0.45). GST protein was used as negative control. Equal amounts (40 pmol) of freshly prepared GST protein or GST-fusion baits bound to 50 l glutathione–Sepharose 4B beads were incubated for 2 h at 4 °C with 4 ml 92  of MRSA (15 mg/ml) protein extracts with continuous end-over-end mixing. The beads were washed four times with 20 volumes of MRSA lysis buffer lacking lysozyme, lysostaphin DNase, MgCl2 and Triton X-100. Protein complexes released from beads after overnight cleavage of GST by 8 units of PreScission protease (GE Healthcare) were precipitated and digested in solution as described [23]. At least four pull down experiments were repeated for each of the 406 baits.  4.2.6  Identification of interacting proteins by mass spectrometry Prior to proteomic analysis, digested peptides were labelled with formaldehyde C1H2O  (bound to GST alone) or C2H2O (bound to GST-fusion) in the presence of cyanoborohydride to reductively dimethylate [24] primary amines and to allow later distinction of non-specific (i.e., bound to GST alone and GST-fusion) from specific (i.e., bound only to GST-fusion) interactions. Derivatized samples were mixed and analyzed using an LTQ-FT (ThermoFisher Scientific, Bremen, Germany) as described in [25]. Fragment spectra were searched against the MRSA protein database using Mascot (v2.2, [26]) using the following criteria: trypsin cleavage rules with up to 1 missed cleavage, ESI-TRAP fragmentation characteristics, 10 ppm parent mass  accurate,  0.6  carbamidomethylation,  Da  fragment  variable  mass  accuracy,  modifications:  fixed  Dimethyl  modification:  (N-term),  cysteine  Dimethyl  (K),  Dimethyl:2H(4) N-term, Dimethyl:2H(4) K, methionine oxidation [27]. Proteins were considered identified if two or more peptides meeting the above criteria and having IonsScores 20 were sequenced in an experiment. Quantitative ratios were extracted using MSQuant (http://msquant.sourceforge.net). Only proteins with a GST-fusion/GST alone ratio of  2 were  considered as specific preys. To further reduce the number of false positive protein interactions, a bait-prey interaction is only reported in our MRSA252 PIN dataset if the above ratio ( 2)  93  was observed in at least two pull down experiments with two or more different peptides identified for each prey.  4.2.7  Co-immunoprecipitation with F(ab’)2 fragments of anti-MRSA pyruvate kinase (PK) antibody. Polyclonal anti-MRSA PK antibody was produced by immunization of rabbit with  purified recombinant MRSA His6-PK (Pacific Immunology Crop, CA) according to standard protocols. F(ab’)2 fragments of purified IgG from pre-immune and immunized (anti-MRSA PK) sera were prepared by pepsin treatment (Pierce, Rockford, IL) and immobilized to NHSactivated Sepharose 4 fast flow (GE Healthcare, Uppsala, Sweden) according to manufacturer’s protocols. PK interactors were co-immunoprecipitated from 30 mg of pre-cleared MRSA252 lysate for 2 h at 4°C using 12 g of each immobilized pre-immune (control) or immunized (anti-MRSA PK) F(ab’)2 fragments. Protein complexes released from beads after washing steps were precipitated and solubilized in preparation of in-solution tryptic digestion. Reductive dimethylation labeling and quantitative mass spectrometry-based proteomics were used to identify the physiological interacting proteins with MRSA PK.  4.3 Results and discussion 4.3.1  MRSA PIN summary Initially, 406 MRSA proteins were cloned, expressed as glutathione S-transferase (GST)  fusions and used as baits in affinity pull-down experiments, with each pull-down having an internal, isotopically labelled negative control. The resulting protein complexes were characterized by liquid chromatography-tandem mass spectrometry (LC-MS/MS), allowing the identification of 13,219 interactions among 608 bacterial proteins (the list of 608 interacting  94  MRSA proteins, as well as 11 protein baits with no interactions, can be found in Table A1 of the Supplementary materials). The resulting MRSA PIN illustrated on Figure 4.1 has further been subjected to various types of bioinformatics and statistical analysis. Figure 4.1 2D representation of the developed MRSA PIN.  Hub proteins are marked in yellow and non-hubs are in blue. The conventional antimicrobial targets are marked in red if they are classified as non-hubs and in purple if they are hubs. The PIN diagram was visualized by using Cytoscape. 4.3.2  Accuracy of the protein-protein interactions It has been documented in numerous studies that high throughput PIN mapping methods  generate a substantial amount of potentially false positive and false negative protein-protein interactions [28, 29]. To assess accuracy of our experimental procedures we employed co95  immunoprecipitation method (Co-IP) (see details in the ‘Methods’ section), which is often considered as a more optimal technique approach for detecting true protein-protein interactions in a smaller scale [30]. We applied the Co-IP experiment to re-examine the protein-protein interactions for pyruvate kinase (one of the most connected hubs in the developed PIN, with 243 interacting proteins from the GST pull-down experiment). The Co-IP experiment identified a total of 83 preys that interacted with pyruvate kinase, and out of which 56 were present in the original GST pull-down experiment. This produces a sensitivity value of 67.5% (56/83) or a false negative rate of 32.5% (1-sensitivity). Notably, the estimated 32.5% false negative rate for the generated MRSA PIN is lower than the values previously reported for other systems: 76% for Caenorhabditis elegans, 77% for Treponema pallidum, 85% for Saccharomyces cerevisiae, and 90% for Drosophila melanogaster [16, 29]. In addition, because 56 of the original 243 protein interactions with pyruvate kinase in the GST pull-down can be confirmed by the Co-IP experiment, we have calculated the positive predictive value for our data to be 23% (56/243).  4.3.3  PIN coverage To better understand organization of the estimated MRSA PIN we have compared it  with similar datasets recently reported for Escherichia coli [5, 14], Campylobacter jejuni [15], Treponema pallidum [16] and Saccharomyces cerevisiae species [2]. We were interested in comparing the numbers of protein interactions established per bait in those experiments. The corresponding parameters have been calculated for the above datasets and presented in Table 4.1 along with such relevant properties as average number of interacting partners established per bait, and average degree of a node in a network.  96  Table 4.1 The bait coverage summary for MRSA and other PIN datasets. Baits attempted  Baits succeeded  Baits with no interactions  Interactors (per bait), nr  MRSA 252  449  406  11 (2.7%)  608 (1.5)  E. coli [14]  4339  2667  E. coli [5]  1000  648  C. jejuni [15]  no data  1477  T. pallidum [16] S. cerevisiae [2]  no data 6446  Data set  Interactions (per bait), nr 13219 (32.56)  Average network degree  Average degree (hubs only)  43.48  45.03  2448 (0.92)  8625 (3.23)  7.04  12.88  1241 (1.92)  5170 (7.98)  8.34  30.94  no data  1301 (0.88)  11557 (7.82)  17.76  23.77  991  no data  721 (0.73)  3640 (3.67)  10.10  21.61  1993  239 (12%)  2549 (1.28)  21246 (10.66)  16.68  32.88  330 (12.3%) 118 (18.2%)  nr = non-redundant.  These numbers indicate that the hub-based bait selection strategy allowed recovery of more interactions and interactors per bait. The largest average degree of the MRSA network illustrates its enrichment with hub proteins. The PIN mapping efforts resulted in the lowest number of baits with no interacting prey proteins. As data in Table 4.1 indicate, only 11 such non-interacting proteins (2.7%) are found in the developed MRSA PIN (see Figure 4.1), compared to 12%-18% reported in other studies [2, 5, 14]. The use of predicted hubs as pulldown baits allowed broader network coverage with fewer experiments.  4.3.4  Overlap with other experimental PINs An important consideration for the MRSA protein-protein interactions is the extent of  conservation in similar networks. For each pair of interacting proteins in the subject species we searched for a pair of highly similar interacting proteins in the MRSA (the query) network (using the following similarity criteria: alignment E-value and alignment coverage  10-5, sequence similarity  35%  65%). The first column of Table 4.2 displays the number of proteins  in Escherichia coli, Campylobacter jejuni, Treponema pallidum and Saccharomyces cerevisiae classified as highly similar to MRSA counterparts. This number varies from 11.61% to 31.59%, 97  with E. coli being the closest species, and the S. cerevisiae having the least number of conserved proteins compared to the MRSA. The numbers are, however, much lower for conserved interactions, varying from 0.05% for T. pallidum to 8.82% for E. coli. Table 4.2 Total numbers and percentages of conserved interactions, interactors and hub proteins between MRSA and other datasets. Data set E. coli [14] E. coli [5] C. jejuni [15] T. pallidum [16] S. cerevisiae [2]  Number (%) of homologues with MRSA 584 (23.86%) 392 (31.59%) 302 (23.21%) 162 (22.47%) 296 (11.61%)  Number (%) of conserved interactions with MRSA 259 (3.00%) 456 (8.82%) 123 (1.06%) 2 (0.05%) 263 (1.24%)  Number (%) of conserved hubs with MRSA 26 (11.93%) 27 (21.95%) 8 (6.25%) 0 (0.00%) 16 (6.45%)  Percentages were calculated with respect to the subject species. Such low overlap between PINs established for different species has been reported before [15] and can be attributed to different experimental conditions used in various studies, differences in utilized methodologies and incompleteness of the resulting PINs. Interestingly, the established degree of hubs conservation (Table 4.2) is still notable (with the exception of Treponema set). This provides indirect evidence that ‘wiring’ of the studied networks may be different (i.e. conserved hubs interact with different partners in different PINs) and contributes to overall low similarity between the studied datasets.  4.3.5  Scale-free characteristics of MRSA PIN Recent advances in networks theory have demonstrated a key role of uneven  distributions occurring in many natural processes. It has been found that seemingly unrelated systems such as economic, professional, social and biological networks all exhibit a power-law distribution (1):  P(k ) ~ k −γ  (1)  98  where P(k) is the probability of a selected protein with exactly k links (degree), and γ is the value of the exponent [1, 31]. The heterogeneous architecture of scale-free networks contributes to robustness and error-tolerance from random perturbation and is often viewed as a possible common blueprint for naturally occurring large-scale networks. Those findings have laid the foundation for characterizing the evolution of the protein universe in terms of a growing scale-free system in which individual genes are represented as nodes of a propagating net [32-36]. To evaluate the topology of the developed MRSA PIN, we conducted a distribution analysis of network degrees. Figure 4.2 shows the MRSA252 PIN follows a power-law degree distribution, which fits logarithmic linear dependence with R square value of 0.68. Figure 4.2 Distribution of the number of protein interactions (left panel) and a logarithmic transformation plot for the distribution of protein interactions (right panel). 0  0.16  0  0.14  0.5  1  1.5  2  2.5  3  -0.5  0.12 -1  Log(P(k))  P(k)  0.1  0.08 0.06 0.04  -1.5  -2  -2.5  0.02  -3  0 0  50  100  150  k  200  250  300  -3.5  Log(k)  As indicated in Table 4.3, the lower γ exponent value of 0.71 suggests that highly interacting proteins are more abundant in the MRSA252 PIN compared to the other five PIN systems. To further investigate the degree distribution difference between MRSA252 and the other PINs, we computed the average number of interactions among only hub proteins (the top 99  10% interactors). Table 4.1 indicates that the average degree among hubs is much higher than the average degree from the entire PINs for E. coli, C. jejuni, T. pallidum, and S. cerevisiae. The overall average degree of MRSA252 is higher than the other 5 PINs; however, the degree difference from the hubs is smaller within MRSA252 (45.03 vs. 43.48). The result suggests that the identified MRSA252 protein network is enriched with well-connected hub proteins, possibly representing the core of the entire (but yet to be determined) S. aureus PIN. Table 4.3 Linear regression values based on power-law degree distribution in 6 different PINs.  γ MRSA 252 E. coli [15] E. coli [4] C. jejuni [16] T. pallidum [14] S. cerevisiae [1]  4.3.6  0.71 1.40 1.23 1.27 1.15 1.28  Standard Error 0.039 0.092 0.049 0.050 0.058 0.052  R Square 0.68 0.78 0.89 0.86 0.86 0.83  Characteristic features of hub proteins. A prominent feature of scale-free PINs is the presence of highly connected hub proteins  that play a major role in network functioning and are critical for network stability. Identification of characteristic features and properties of hubs in the developed scale-free MRSA PIN represents an important task. Based on our experimental network containing 608 proteins we defined 60 hubs that correspond to the top 10% of interactors (currently there is no consensus on defining hubs quantitatively [37] but the criteria used here has been justified [12, 13]). We compared the defined MRSA hubs with less connected proteins in terms of their essentiality, cellular abundance and several physicochemical characteristics. We have demonstrated that several sequence-derived physicochemical properties of proteins such as Hydrophobicity, Surface Area and Fraction of Flexible Coil Fragments can exhibit different patterns for hubs and non-hubs [13]. The corresponding average values computed for MRSA 100  hubs and non-hubs are featured in Table 4.4 and indicate that MRSA hubs are generally smaller, more hydrophilic and contain a higher fraction of flexible coil residues as compared to non-hubs. Table 4.4 Differential characteristics averaged for MRSA hubs and non-hubs. The p-values from two sample t-tests are also given.  Number of proteins Fraction of essential proteins in the group Relative protein abundance Average hydrophobicity Estimated surface area Fraction of flexible coil residues  Hubs  Non-hubs  60 76.67% 6.603 -0.331 12458.070 0.583  548 48.72% 1.619 -0.309 14519.700 0.570  t-test p-value n/a n/a 0.000 0.026 0.042 0.010  For certain MRSA proteins that lack protein essentiality data in the Database of Essential Genes [21], we proceeded to knock down or knock-out the designated target using antisense RNA or TargeTron technologies [38]. The corresponding averaged values featured in Table 4.4 demonstrate that the percentage of confirmed essential proteins is much higher among hubs (77%) than in non-hubs (49%). The data in Table 4.4 illustrate that highly connected MRSA proteins have higher levels of relative cellular abundance, in line with previous observations that hubs tend to be essential, involved in house-keeping functions, and usually expressed at higher levels [39]. Differences in physicochemical and biological properties of hub and non-hubs can be utilized for developing predictive bioinformatics approaches that can optimize PIN mapping experiments [12, 13].  4.3.7  Utilizing PIN data for drug target analysis and identification On one hand, the estimated network of interacting bacterial proteins can help  investigating cell biology of S. aureus species. On another hand, the generated data can assist in identification of novel antimicrobial targets that are efficient in disrupting the bacterial protein 101  network. This expectation is established from an observation that antimicrobial targets known to date have not been chosen due to their roles in protein interactomes. To evaluate the network role and effectiveness of the current antibacterial targets we searched the literature (including reviews [40-43]) and identified 94 bacterial proteins that were previously reported as either experimental or established antimicrobial drug targets (see Table A2 of the Supplementary materials). 70 proteins were reported as antimicrobial targets in the MRSA network based on sequence similarity (see Table A1 of the Supplementary materials). Amongst the 70 drug targets, 28 have FDA-approved antibiotics targeting them, and the remaining 42 targets are still either theoretical or experimental. Six of the former group of drug targets were classified as hub proteins in the MRSA252 PIN (50S ribosomal protein L4 [YP_041689.1], 50S ribosomal protein L10 [YP_039993.1], 30S ribosomal protein S4 [YP_041184.1], 50S ribosomal protein L22 [YP_041685.1], 30S ribosomal protein S9 [YP_041655.1], translation elongation factor G [YP_040001.1]). Surprisingly, only two of the experimental targets - UDP-N-acetylmura-moylalanyl-D-glutamate--2,6-dia minopimelate ligase (YP_040406.1) and cell division protein FtsZ (YP_040573.1) met the criterion for MRSA PIN hubs based upon their level of connectivity (169 protein interactions for YP_040406.1 and 137 for YP_040573.1). To illustrate this point graphically, in Figure 4.1 we color-coded these 70 antimicrobial targets and this shows clearly that the majority of the target proteins were positioned in peripheral parts of the interactome (red = non-hub like, purple = hub like). From an absolute numeric standpoint indicating the experimentally estimated average number of protein interactions (average Network Degree, aND), it is clear that conventional antimicrobial targets (aND=51) were closer to non-hubs (aND=28) than they were to actual hubs (aND=174) (see Table 4.5).  102  Table 4.5 Differential characteristics averaged for MRSA hubs, non-hubs and antimicrobial targets and corresponding p-values from two sample t-tests. Hubs  Non-hubs  Drug targets (all)  52  486  70  Average network degree  173.73  28.40  Average network betweenness Average conservation across 20 bacterial pathogens (%) Average sequence similarity to human proteins (%)  5429.74  Number of proteins  Relative protein abundance  p-values (hubs vs. non-hubs)  p-values (hubs vs. targets)  p-values (non-hubs vs. targets)  51.44  0.000  0.000  0.036  307.22  959.98  0.000  0.000  0.003  75.29  62.33  86.00  0.001  0.160  0.000  46.26  41.69  36.61  0.088  0.006  0.032  6.55  1.66  1.98  0.000  0.000  0.236  These observations demonstrate that the current antimicrobial targets tend not to play central roles in the protein interactome of the MRSA. Therefore attacking such targets may not be drastic enough to break the overall integrity of PINs, as the system is likely to recover from the loss of a non-hub node [1, 31]. We calculated the corresponding Network Betweenness (NB) values for all 608 studied MRSA proteins, as the betweenness measure of a protein has been shown to highly correlate with its cell lethality in protein interaction networks [44]. In particular, for each protein i its NB parameter has been defined as the fraction of all shortest paths that pass through the protein i (2):  N jk (i )  NBi = i ≠ j ≠ k∈G  N jk  (2)  where Njk is the number of shortest paths between j and k in a network graph (G), and Njk(i) is the number of shortest paths between j and k that pass through a protein i. The estimated NB parameters (found in Table A1 of the Supplement Materials) reflect the relevance of each protein for the network efficiency, where higher NB values indicate that the removal of the corresponding protein nodes will have more destabilizing effects. The 103  average NB values computed for 52 MRSA hubs and 486 non-hubs (excluding the antimicrobial targets) differ significantly (5430 versus 307) and reflect previously outlined network-stabilizing effects of high interactors. Importantly, the average NB value of 960 estimated for known antimicrobial targets not only groups them with less connected proteins but also clearly characterises them as targets with low network impact (see Table 4.5 and Figure 4.3 for more details). Figure 4.3 Average number of protein interactions among drug targets, hubs and non-hubs (left panel), and Network Betweenness (NB) values for drug targets, hubs and non-hubs (right panel).  These unexpected observations raised an immediate question: why have MRSA hub proteins not yet emerged as antimicrobial targets? It is perhaps because antimicrobial compounds derived naturally from other bacteria have avoided targeting their own hub proteins that are conserved across bacterial species. In addition, it is reasonable to assume that conventional selection of drug targets by the pharmaceutical industry has been biased toward essential pathogenic proteins that do not have close human homologues and exhibit low cellular abundance.  104  We considered 70 antimicrobial targets, 52 hubs and 486 non-hubs of the MRSA, and within these three groups we evaluated parameters of cross-bacteria conservation of the constituent proteins, their relative cellular abundance and similarity to human sequences. In order to assess the degree of cross-species conservation of the MRSA proteins we selected 20 pathogenic bacteria (8 gram-positive and 12 gram-negative species). Then for each MRSA PIN node we conducted a sequence similarity search across 20 bacterial proteomes and counted the number of species with at least one homologue to the query protein. Similarity to human proteins has been established by comparing 608 MRSA sequences against human proteome and reporting the highest estimated sequence similarity values with at least 30% alignment coverage with respect to the query MRSA protein. The established parameters along with the values of cellular abundance of MRSA proteins have then been subjected to the two-sample t-test conducted for the three groups. The estimated p-values have been collected into Table 4.5 along with the corresponding group averages for conservation, human similarity and relative abundance numbers. These estimates indicate that antimicrobial targets can be distinguished from the MRSA hubs and exhibit the highest bacterial sequence conservation, lowest protein abundance and lowest sequence similarity to human proteins among the studied sequences. In a plot of 608 studied MRSA proteins within the mentioned three dimensions (Figure 4.4) we observed that known antimicrobial targets form two distinctive groups that are clearly separated from the MRSA hubs.  105  Figure 4.4 Separation of hubs non-hubs and antimicrobial targets in the three-dimensional space of protein abundance, protein conservation and similarity to human proteins.  A) Top view of the space, B) side view. The areas of the 3D space where targets are grouped marked with red ovals and the MRSA hubs are highlighted within the yellow ovals. 106  Such separation supports the idea that current antimicrobial targets have been chosen using criteria of essentiality, cellular abundance, and sequence conservation, but not network characteristics. Higher cellular abundance of some MRSA hubs and/or their higher similarity to human proteins could contribute to the fact that only very few hubs have been targeted to date. The demonstrated moderate conservation of hub proteins and low conservation of proteinprotein interactions between different species may also be a contributing factor – i.e. some of the existing antimicrobials targets may, in fact, act as hubs in other bacterial networks. One can argue that the reported PIN is limited and much enriched with hubs and, therefore, it does not adequately represent a global MRSA network consisting of all 2656 proteins. In a complete protein network many of the considered conventional antimicrobial targets may correspond to the top 10% of interactors (i.e. hubs). In such a scenario, the target proteins will still have fewer connections in the global MRSA PIN compared to the reported hubs and, therefore, will still be less attractive for developing antibiotics from a network stability standpoint. In general, our results demonstrate that conventional drug targets tend to group with non-hub proteins, rather than highly-connected hubs when judged by their network degree, betweenness values and physical-chemical characteristics. On another hand, we also observed that actual drug targets in use in the clinic do possess higher network connectivity values (Table 4.6) and more frequently correspond to hub proteins (6 out of 28 are hubs or 21%), when compared to experimental or theoretical antimicrobial targets (2 out of 42 are hubs or 5%). Despite the fact that actual drug targets with clinical relevance have not been deliberately chosen based on their hierarchy of network connectivity, their tendency to segregate with hub proteins may be more than a random event.  107  Table 4.6 Differential characteristics averaged for clinically approved and experimental antimicrobial targets and corresponding p-values from two sample t-tests. Hubs  Drug targets (approved)  Drug targets (experim ental)  52  28  42  Average network degree  173.73  69.36  Average network betweenness Average conservation across 20 bacterial pathogens (%) Average sequence similarity to human proteins (%)  5429.74  Number of proteins  Relative protein abundance  p-values (hubs vs. approved targets)  p-values (hubs vs. experimen tal. targets)  p-values (approved vs. experimen tal targets)  39.50  0.00  0.00  0.03  1487.27  608.45  0.00  0.00  0.02  75.29  88.57  84.29  0.02  0.11  0.44  46.26  34.42  38.08  0.02  0.08  0.53  6.55  3.12  1.25  0.03  0.00  0.01  The demonstrated scale-free character of bacterial PINs along with the recent developments in the network theory make it obvious that conventional target selection criteria should be supplemented with network characteristics of proteins. MRSA hubs should be considered as potential antimicrobial targets. In particular, the generated MRSA data indicate that there are certain proteins that act as network hubs and yet meet all usual target selection criteria – i.e. they are essential for bacterial survival, have no paralogues in the MRSA genome, demonstrate low cellular abundance, and do not have close human homologues (see Table A1 of the Supplementary materials). Such highly interacting bacterial hubs represent an alternative drug development opportunity, as the discovery of the corresponding antagonists will not only expand an arsenal of modern antibiotics, but will also help address the problem of rising antibiotic resistance. Our own initial successful results on developing selective inhibitors for MRSA pyruvate kinase (manuscript in preparation) may serve as an additional argument for such network-guided drug design strategy.  108  4.4 Conclusion We reported 13,219 interactions involving 608 MRSA proteins experimentally established using high confidence methods of protein pull-down and mass-spectrometry. The identified protein interactions have been analysed using bioinformatics and statistical tools demonstrating a low degree of resemblance between the MRSA interaction network and previously reported data for Saccharomyces cerevisiae, Helicobacter pylori, Caenorhabditis elegans and Escherichia coli species. Such low similarity between the PINs of closely related bacteria justifies that it may not be feasible to infer protein-protein interactions from the existing experimental data. The identified scale-free protein interaction network for the MRSA revealed the network role of conventional antimicrobial targets, demonstrating predominant non-hub characteristics. This observation characterizes them as low efficiency targets which are more likely to develop resistance. The results of the network-guided study add insight into the biology of S. aureus, offer clues for rising antimicrobial resistance and lay a foundation for developing novel and more effective antibiotics.  109  4.5 References 1. 2.  3. 4.  5.  6. 7. 8. 9. 10. 11. 12. 13. 14.  Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004, 5(2):101-113. Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440(7084):631-636. Rain JC, Selig L, De Reuse H, Battaglia V, Reverdy C, Simon S, Lenzen G, Petel F, Wojcik J, Schachter V, Chemama Y, Labigne A, Legrain P: The protein-protein interaction map of Helicobacter pylori. Nature 2001, 409(6817):211-215. Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, HirozaneKishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M: A map of the interactome network of the metazoan C. elegans. Science 2004, 303(5657):540-543. Butland G, Peregrin-Alvarez JM, Li J, Yang W, Yang X, Canadien V, Starostine A, Richards D, Beattie B, Krogan N, Davey M, Parkinson J, Greenblatt J, Emili A: Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature 2005, 433(7025):531-537. Flajolet M, Rotondo G, Daviet L, Bergametti F, Inchauspe G, Tiollais P, Transy C, Legrain P: A genomic approach of the hepatitis C virus generates a protein interaction map. Gene 2000, 242(1-2):369-379. McCraith S, Holtzman T, Moss B, Fields S: Genome-wide analysis of vaccinia virus protein-protein interactions. Proc Natl Acad Sci U S A 2000, 97(9):4879-4884. Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks. Nature 2001, 411(6833):41-42. Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW: Evolutionary rate in the protein interaction network. Science 2002, 296(5568):750-752. Moran GJ, Mount J: Update on emerging infections: news from the Centers for Disease Control and Prevention. Ann Emerg Med 2003, 41(1):148-151. Simor AE, Ofner-Agostini M, Bryce E, Green K, McGeer A, Mulvey M, Paton S: The evolution of methicillin-resistant Staphylococcus aureus in Canadian hospitals: 5 years of national surveillance. CMAJ 2001, 165(1):21-26. Hsing M, Byler KG, Cherkasov A: The use of Gene Ontology terms for predicting highly-connected 'hub' nodes in protein-protein interaction networks. BMC Syst Biol 2008, 2:80. Byler K, Hsing M, Cherkasov A: The Use of sequence-derived QSPR descriptors for predicting highly connected proteins (hubs) in protein-protein interactions. QSAR & Combinatorial Science 2009, 28:509-519. Arifuzzaman M, Maeda M, Itoh A, Nishikata K, Takita C, Saito R, Ara T, Nakahigashi K, Huang HC, Hirai A, Tsuzuki K, Nakamura S, Altaf-Ul-Amin M, Oshima T, Baba T, Yamamoto N, Kawamura T, Ioka-Nakamichi T, Kitagawa M, Tomita M, Kanaya S, 110  15. 16. 17.  18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34.  Wada C, Mori H: Large-scale identification of protein-protein interaction of Escherichia coli K-12. Genome Res 2006, 16(5):686-691. Parrish JR, Yu J, Liu G, Hines JA, Chan JE, Mangiola BA, Zhang H, Pacifico S, Fotouhi F, DiRita VJ, Ideker T, Andrews P, Finley RL, Jr.: A proteome-wide protein interaction map for Campylobacter jejuni. Genome Biol 2007, 8(7):R130. Titz B, Rajagopala SV, Goll J, Hauser R, McKevitt MT, Palzkill T, Uetz P: The binary protein interactome of Treponema pallidum--the syphilis spirochete. PLoS One 2008, 3(5):e2292. Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R: IntAct: an open source molecular interaction database. Nucleic Acids Res 2004, 32(Database issue):D452-455. Adamczak R, Porollo A, Meller J: Combining prediction of secondary structure and solvent accessibility in proteins. Proteins 2005, 59(3):467-475. CEPOS: ParaSurf [http://www.ceposinsilico.de/Pages/Products.html] Kwok MC, Holopainen JM, Molday LL, Foster LJ, Molday RS: Proteomics of photoreceptor outer segments identifies a subset of SNARE and Rab proteins implicated in membrane vesicle trafficking and fusion. Mol Cell Proteomics 2008, 7(6):1053-1066. Zhang R, Lin Y: DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res 2009, 37(Database issue):D455-458. StatSoft: STATISTICA [http://www.statsoft.com/] Foster LJ, De Hoog CL, Mann M: Unbiased quantitative proteomics of lipid rafts reveals high specificity for signaling factors. Proc Natl Acad Sci U S A 2003, 100(10):5813-5818. Boersema PJ, Aye TT, van Veen TA, Heck AJ, Mohammed S: Triplex protein quantification based on stable isotope labeling by peptide dimethylation applied to cell and tissue lysates. Proteomics 2008, 8(22):4624-4632. Chan QW, Howes CG, Foster LJ: Quantitative comparison of caste differences in honeybee hemolymph. Mol Cell Proteomics 2006, 5(12):2252-2262. Matrix Science: Mascot [http://www.matrixscience.com/] Chan QW, Foster LJ: Changes in protein expression during honey bee larval development. Genome Biol 2008, 9(10):R156. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature 2002, 417(6887):399-403. Huang H, Jedynak BM, Bader JS: Where have all the interactions gone? Estimating the coverage of two-hybrid protein interaction maps. PLoS Comput Biol 2007, 3(11):e214. Howell JM, Winstone TL, Coorssen JR, Turner RJ: An evaluation of in vitro proteinprotein interaction techniques: assessing contaminating background proteins. Proteomics 2006, 6(7):2050-2069. Barabasi AL: Linked: The New Science of Networks. Cambridge, Mass: Perseus Publ; 2002. Luscombe NM, Qian J, Zhang Z, Johnson T, Gerstein M: The dominance of the population by a selected few: power-law behaviour applies to a wide variety of genomic properties. Genome Biol 2002, 3(8):RESEARCH0040. Koonin EV, Wolf YI, Karev GP: The structure of the protein universe and genome evolution. Nature 2002, 420(6912):218-223. Qian J, Luscombe NM, Gerstein M: Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. J Mol Biol 2001, 313(4):673-681. 111  35. 36. 37. 38. 39. 40. 41. 42. 43. 44.  Rzhetsky A, Gomez SM: Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics 2001, 17(10):988-996. Yanai I, Camacho CJ, DeLisi C: Predictions of gene family distributions in microbial genomes: evolution by gene duplication and modification. Phys Rev Lett 2000, 85(12):2641-2644. Vallabhajosyula RR, Chakravarti D, Lutfeali S, Ray A, Raval A: Identifying hubs in protein interaction networks. PLoS One 2009, 4(4):e5344. Yao J, Zhong J, Fang Y, Geisinger E, Novick RP, Lambowitz AM: Use of targetrons to disrupt essential and nonessential genes in Staphylococcus aureus reveals temperature sensitivity of Ll.LtrB group II intron splicing. RNA 2006, 12(7):1271-1281. Ivanic J, Yu X, Wallqvist A, Reifman J: Influence of protein abundance on highthroughput protein-protein interaction detection. PLoS One 2009, 4(6):e5815. Payne DJ, Gwynn MN, Holmes DJ, Pompliano DL: Drugs for bad bugs: confronting the challenges of antibacterial discovery. Nat Rev Drug Discov 2007, 6(1):29-40. Chan PF, Holmnes DJ, Payne DJ: Finding the Gems Using Genomic Discovery: Antibacterial Drug Discovery Strategies - The Successes and Challenges. Drug Discovery Today: Therapeutic Strategies 2004, 1(4):519-527. Garcia-Lara J, Masalha M, Foster SJ: Staphylococcus aureus: the search for novel targets. Drug Discov Today 2005, 10(9):643-651. Barker JJ: Antibacterioal Drug Discovery and Structure-Based Design. Drug Discovery Today 2006, 11(9/10):391-404. Yu H, Kim PM, Sprecher E, Trifonov V, Gerstein M: The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics. PLoS Comput Biol 2007, 3(4):e59.  112  5 Indel PDB: a database of structural insertions and deletions derived from sequence alignments of closely related proteins4 5.1 Introduction Insertions/deletions (indels) and amino acid substitutions represent the two most common types of sequence variations, observed among similar proteins [1]. Unlike amino acid substitutions, which have been studied intensively in the past years [2, 3], indels remain less understood and still pose many biological questions. Recently, a large-scale indel analysis has been conducted for 136 complete bacterial and protozoan genomes, and the results have shown that up to 5-10% of all proteins contained sizable indels, when compared to human homologues [4]. Our research has further shown possible relationships between indels and protein essentiality [5] and the role of indels in protein-protein interactions. For instance, it has been shown that indel-containing proteins were more likely to be essential than non-indel proteins and involved in more protein-protein interactions [5]. It has been suggested that sequence insertions and deletions can change protein-protein interactions and modify protein network characteristics [6]. Moreover, it has been demonstrated that the structural differences of indel sites between pathogen and host proteins can have valuable therapeutic applications, enabling selective targeting of conserved bacterial proteins, but at the same time, eliminating drug cross-reaction with the human homologues [7-9]. For instance, it has been shown that a Leishmania elongation factor contains a 12 amino acid sequence deletion compared with its human homolog, and the deletion site has been used for developing small compounds targeting specifically to the Leishmania protein but not the human protein [9]. 4  A version of this chapter has been published. Hsing M, Cherkasov A. Indel PDB: a database of structural insertions and deletions derived from sequence alignments of closely related proteins. BMC Bioinformatics 2008, 9:293. 113  Despite the common occurrences of indels and their important roles in protein functions, currently there are no bioinformatics resources that archive structural and sequence information on indel sites derived from sequence alignments of similar proteins. Although early studies have shown us some common features shared by indels in limited datasets [10-15], our understanding of indels can be improved by utilizing the large amount of structural data, as accumulated in Protein Data Bank [16]. Thus we present here, Indel PDB, a structural database of insertion and deletion sites, extracted from aligned protein sequences in PDB. The goal of Indel PDB is to provide a resource of indel 3D structures, which enable various bioinformatics analyses including primary sequence composition, secondary structure assignment, solvent accessibility, length distribution, protein domain association, homology modeling and other comprehensive structural studies. Some of such applications from Indel PDB have been performed and reported in this paper. Indel PDB is different from loop databases, whose scope is limited to protein loops that lack clear secondary structures [17, 18]. For instance in ArchDB [18], which represents one of the most comprehensive loop databases available on the internet, loops are defined as regions that connect the regular secondary structures, extracted from 9587 protein structures. ArchDB classified a total of 58,664 loops (ArchDB95, 13-6-2007) based on their structural similarity with respect to the surrounding secondary structures. On the other hand, Indel PDB is not limited to loops, but includes all possible gaps (insertions or deletions) present in sequence alignments among closely related proteins in PDB; therefore such indel sites can possess any possible secondary structures. Although some overlap between Indel PDB and loop databases is expected, Indel PDB features more indel sequences with secondary structures including alpha-helices and beta-sheets in addition to 114  loops. In fact, our analyses have demonstrated that many indels had recognizable 2D structures, in contrast to previous studies that showed most indels had undefined structures and loops [13]. To further distinguish between indels and loops, their differences have been investigated in three aspects: sequence composition, length distribution, and solvent accessibility. In addition, Indel PDB contains a larger structural database in comparison to ArchDB. Indel PDB is consisted of 117,266 non-redundant indel structures extracted from 11,294 indelcontaining proteins. Both the indel structural data and the analysis results are freely accessible through the Indel PDB website [19]. We believe data presented in Indel PDB will not only enable future functional studies of indels, but also facilitate protein modeling of indels and the identification of novel drug binding sites against infectious diseases. Thus, potential users of Indel PDB include 1) molecular biologists who wish to study the functions of particular indel sites by integrating information on protein domains, 2) structural biologists who wish to improve protein homology models or to perform a comprehensive indel structural analysis based on the available indel 3D coordinates, and 3) computational chemists who are searching for potential compound-binding sites of new drug leads by the use of a comprehensive indel search engine available at the Indel PDB website. Indel PDB is freely available over the internet at (http://www.cnbi2.ca/indel/).  5.2 Methods 5.2.1  Constructing Indel PDB Building Indel PDB involved nine steps, each of which is described in this section and  depicted on Figure 5.1. In step one, a total of 38,395 PDB structural files and their sequences were downloaded from the PDB website ([20], dated August 30th 2006). In step two, 115  BLASTCLUST (a part of BLAST 2.2.13 package) was used to cluster sequences of the downloaded proteins, based on 100% identity. Proteins that had 100% identical sequences were removed, and only one representative protein was included in the next steps. To avoid short protein sequences, only proteins with 70 or more amino acids were selected for the subsequent parts of the analysis. There were a total of 22,103 proteins that met such criteria. Figure 5.1 A flowchart for constructing the Indel PDB web server.  The numbers in brackets indicate proteins or indels remained at each stage of the bioinformatics pipeline.  Furthermore, 22,103 PDB protein sequences were aligned to each other, using BLASTp (version 2.2.13) with the following parameters: e-value  10-5 and low-complexity filter turned  on. The BLAST produced 1,299,971 insertion sites, whose locations and sequences were parsed by in-house Perl scripts, and the results were stored in a MySQL database. Because an insertion site in a query sequence implied a deletion site in a subject sequence, and vice versa;  116  only insertions but not deletions needed to be represented in the database. Therefore, in this paper the word ‘indel’ was used to refer insertion sites on query proteins. In step four, the DSSP program (obtained from [21] in August, 2006, original paper by [22]) was performed on each of the 22,103 protein structures from step #2 to determine the secondary structures and solvent accessibility. During step 1 to 4, we have noticed there were discrepancies between some ‘original’ protein sequences as obtained from the PDB website and the ‘actual’ sequences as stored in the PDB structural files. Such discrepancies were caused by unresolved structural regions or gaps in the crystallographic analysis. Thus, in the fifth step, another BLAST was performed on the original protein sequences against the actual PDB sequences to identify all the unresolved regions in the structural files. This step is important to ensure that the unresolved structural regions in the PDB files were removed from the subsequent indel analyses, so that Indel PDB represents true insertions or deletions in sequence alignments, not the structural gaps in crystallographic analysis. In step number six, the information from step #5 and the original indel positions (step #3) was combined to accurately assign secondary structures and solvent accessibility scores to each of the indel sites. In the seventh step, indels were selected to be included in the Indel PDB database, based on the following criteria which ensured that the indels were the results of significant BLAST alignments. The BLAST alignment criteria were e-value 50%, and alignment coverage  10-5, sequence similarity  80%. In addition, any indel site that contained unresolved  region in the PDB structure or low-complexity residues marked by BLAST, was removed from the analysis and excluded from Indel PDB. In the eighth step, Perl scripts were utilized to extract 3D coordinates of the selected indel sites from the PDB structural files. The indel structures of the same protein were copied  117  into a single PDB file. A total of 11,294 PDB files were produced, which together contained the 3D structures of 488,039 indel sites. In the final step, an Apache web server was setup on an IBM Pentium D computer, which links to all the necessary indel information and files stored in a local MySQL database. All of the above indel results are stored in two tables: [indel_pdb_summary] and [pdb_blast_alignment], as shown in Figure 5.2. The connection between the web server and the MySQL database was established through Perl and CGI.  118  Figure 5.2 A database schema for Indel PDB.  5.2.2  Comparative analysis of indels To demonstrate applications of the Indel PDB database, we utilized the indel data to  investigate several indel features that include sequence composition, length distribution, secondary structure composition, and solvent accessibility. All of the analyses operated on a non-redundant set of indel sites, which were extracted from the original set of 488,039 indels 119  by grouping together indel sites with the same start and end position on the same protein. The resulting non-redundant set contains 117,266 indel sites. The values required for each of the analyses were retrieved from the MySQL database using Perl scripts. The analyses of amino acid sequence and secondary structure composition were repeated on both the indel sites and the full-length indel-containing proteins (referred as indel proteins). Data obtained from indel proteins were treated as background values that were compared to the indel site data. Chi-square test was applied to evaluate if the differences between indel sites and indel proteins were significant. For instance, in the case of comparing the alpha-helix content (H) between indel sites and indel proteins (our samples), the percentages of residues that were H or non-H in both samples were calculated. Then a Chisquare test value was calculated and a P-value was assigned. The same process was repeated for the other secondary structures or the sequence compositions. Solvent accessibility was measured by (the number of water molecules in contact with a residue) multiplied by 10 or (residue water exposed surface in Angstrom)^2, according to the DSSP program. Two sample t-test was applied to compare the differences of solvent accessibility between indel sites and indel proteins.  5.2.3  Length distribution The indel and loop length distributions were modeled by the Weibull [23] and power  law distributions. The Weibull survival distribution can be described by the function: S(x) = exp{-(x/ ) }, x  0,  , >0  where S(x) is the survival function, and  and  represent a scaling factor and a shape  parameter, respectively. The double logarithmic transformation of the Weibull function was performed: 120  ln(-ln(S(x)) = ln(x) – ln( ) The survival function, S(x), is the probability that a variable X has a value greater than a number x. S(x) was calculated by dividing the number of indels with more than x residues by the total number of indels, where x ranges from 1 to 49. If the Weibull distribution can accurately model the indel length distribution, the double natural logarithmic plot is expected to be linear. The Pearson correlation coefficient (r2) as implemented in MS Excel was used to evaluate the linearity of the resulting plot. The power law distribution is represented by the function: S(x) = ax-b The logarithmic transformation of the function is: log(S(x)) = log(a) – b log(x) Therefore, the fitness of the power law function for the indel/loop length distribution has been evaluated based on the Pearson correlation coefficient (r2) of the linear plot.  5.2.4  Location analysis of protein domain and indel To further study functional aspects of indels presented in Indel PDB, we have  investigated the presence of protein domains that were in the proximity of indel sites. First, 9,318 protein domain profiles characterized by Hidden Markov Model (HMM) were obtained from the Pfam database (version 22.0, [24]). Second, the HMMER program (version 2.3.2, [25]) was utilized to scan each of the 22,103 PDB protein sequences against each of the 9,318 Pfam domain profiles. The scanning processes were performed on a cluster of 50 CPUs to generate outputs, which contained the exact starting and ending amino acid residues where protein domains were located for each of the protein sequences. In step three, the locations of the protein domains were overlaid with the locations of 117,266 indel sites in 11,294 indel121  containing proteins. From an indel perspective, we calculated the distance between any given indel site and all domains on a given protein. The distance was measured by the number of amino acid residues between the boundary of the indel site and a domain site. If there was an overlap between the residues of the indel and the domain, the distance was assigned a “0”. Based on the locations of the indels and the domains, we have computed the overall percentages of indel sites that overlapped with domains, and vice versa, the overall percentages of domain sites that overlapped with indels. In addition, the top 20 protein domains with the highest percentages of overlapping indels were reported with P values determined by Fisher’s exact test. The Fisher’s exact test was performed with a 2x2 contingency table (column 1: indel containing protein, column 2: non-indel containing protein, row 1: contained a domain that overlapped with an indel, row 2: not contained a domain that overlapped with an indel)  5.3 Results and discussion 5.3.1  Overview of Indel PDB Indel PDB contains sequence and structural data associated with 488,039 (or 117,266  non-redundant) indel sites, extracted from 11,294 indel-containing proteins in PDB. Indel PDB and the indel analysis results are freely accessible to the public over the internet on the World Wide Web [19]. An easy way for users to interact with indel data is through a comprehensive indel search engine. Users can search indels using one or more of the following criteria, including PDB ID, indel length, secondary structure composition, solvent accessibility score, and proximity with protein domains. In addition, users can specify the sources (species) of query and subject proteins. For example, the various searching criteria can be used to identify indels of interests between pathogens and humans for possible drug target binding sites. Furthermore, 122  users can set a specific range on indel length, secondary structure or solvent accessibility to find indel sites that are, for instance, long, mainly alpha-helical, and surface exposed. Moreover, users can search for indels that overlap with certain protein domains by turning on the domain search option, setting the proximity domain distance to ‘0’ and giving a specific domain name or ID (e.g. Peroxidase or PF00141). Such results are useful to study the functional roles of indels among similar proteins. Alternatively, a query protein sequence can be submitted and searched against all the indel sequences in Indel PDB by BLASTp. Successfully indel hits are displayed to users. As shown in Figure 5.3, the following information of each indel site is displayed: Query PDB ID (protein that contains the insertion site), Query name, Query source, Subject PDB ID (protein that contains the corresponding deletion site), Subject name, Subject source, BLAST alignment scores, the complete sequence alignment, indel location (start and end position on the query protein), indel length, indel sequence, indel secondary structure composition, and indel solvent accessibility scores. Moreover, each indel protein has been cross-referenced to the UniProt database for comprehensive functional annotations. Furthermore, the page shows the number of protein domains that are in proximity of the indel sites, with links to additional information on the domains. The “help” function on each webpage contains more detailed information on web site navigation and display.  123  Figure 5.3 An indel detail page on Indel PDB.  The navigation buttons on the top of the page provide easy access to different functionality of the website. The indel report on the query protein, 10gs_A is shown. The screenshot displays indel sites between the query and one of its subject proteins (1ags_A). A detailed BLAST alignment report, followed by an indel summary table is shown.  In addition, users can visualize each indel 3D structure on Indel PDB by a Jmol JAVA applet [26]. As an example, a 3D view of a 14-amino acid indel site, with an alpha-helix structure, between 1EDO_A and 1UZL_A is shown in Figure 5.4. Indel PDB includes not only the 3D atomic coordinates of each indel, but also the anchoring residues up to 6 amino acids on each side of the indel, which can be used for protein homology modeling of the indel regions. The indel 3D structure files can be downloaded directly from the Indel PDB website.  124  Figure 5.4 A 3D view of an indel site by the Jmol applet on Indel PDB.  The indel site (insertion) is 14 amino acid long, present on a query protein, 1EDO_A (Betaketo acyl carrier protein reductase), in an alignment with a subject protein, 1UZL_A (3oxoacyl-[acyl-carrier protein] reductase). The indel site has an alpha-helix structure.  In the following sections, we demonstrated the applications of Indel PDB to characterize the structural features of indels. In particular, we studied the sequence composition, length distribution, secondary structure composition, solvent accessibility and domain association of indels in known protein structures. The results obtained are important for understanding the functions of indels and their roles in protein essentiality, protein-protein interactions and drug design. For example, the results on solvent accessibility and secondary structure composition will enable the identification of surface exposed indel sites with unique structural conformation, which can be applied to design novel drug binding site for bacteria and their host proteins. Moreover, each indel site in Indel PDB has a start and end location with respect to its PDB sequence, and thus the indel locations can be mapped to nearby protein domains to investigate the functions of the indels and their potential ligand-binding ability. In addition, the sequence composition results enable studying the bias of amino acid usage on indel sites. Finally, the length distribution models of indels can provide insights about indel abundance among related proteins.  125  5.3.2  Sequence composition of indels The average sequence composition of the 20 amino acids in each of the 117,266 indel  sites was calculated, and the calculation was repeated in the full-length sequences of the 11,294 indel-containing proteins, where those indel sites were extracted from. Indels and their indel proteins were classified into four groups according to their length: indels with 1, 5, 10, 15, or 20 residues. Table A3 (Supplementary materials) summarizes the sequence composition results and the corresponding chi-square and p values of each indel length category, by comparing indel sites and their indel-containing proteins. The average amino acid composition of indels with different minimal length is depicted in Figure 5.5. As shown in the Figure, the average residue percentages of A, D, and E increased, but those of G, L, N, S, T, and Y decreased when the length of indels went up. The rest of amino acids did not show any clear trend among different indel-length groups. This result indicated that residues such as Alanine (A) and Glutamic acid (E), which prefer a helix conformation [27], are more frequently observed as the length of indels increased. This observation is supported by the later analysis of secondary structural composition, which showed an increase of alpha-helix content (H) in indels as length increased. Figure 5.5 Amino acid composition of indel sequences.  126  Figure 5.6 shows the natural logarithm of the ratio of average amino acid frequency in the indel sites to that in the full-length protein sequences. Some trends can be easily identified from the Figure. For instance, indel sites contained more D, P, and Y in comparison to the entire protein sequences, while I, L, Q, T and V were reduced in indel sites. The differences are significant at P value < 0.001, based on the chi-square tests. Figure 5.6 The difference of average amino acid composition between indel sites and fulllength protein sequences.  The y-axis shows the natural logarithm of the ratio of average amino acid frequency in the indel sites to that in the full-length protein sequences.  To compare the sequence composition of indels to that of loops (protein regions that lack any defined secondary structures), the average amino acid frequency of all loops in the 11,294 indel-containing proteins has been computed. A total of 310,103 of loops of various lengths have been identified from the proteins. As shown in Figure 5.7, the indel sites contained more A, D, E, F, K, M, R, W, Y residues in comparison to the loops, but less C, I, L, N, P, Q, S, T, V residues. The differences are significant at P value < 0.001, based on the chi-square tests. Table A4 (Supplementary materials) summarizes the sequence composition results and the corresponding chi-square and p values of each indel length category, by comparing indel sites and loops. 127  Figure 5.7 The difference of average amino acid composition between indel sites and loop sequences.  The y-axis shows the natural logarithm of the ratio of average amino acid frequency in the indel sites to that in the loop sequences. 5.3.3  Length distribution of indels Our previous indel studies have shown that indel length distribution could be accurately  modeled by the Weibull distribution [4, 5]. Therefore, in the current study the Weibull distribution was used to model the length distribution of the 117,266 indel sites and 310,103 loops from Indel PDB. In addition to the Weibull function, the length distributions of indels and loops have been fitted to a power-law function. The survival function over a range of indel or loop lengths (from 1-25) was plotted on Figure 5.8, indicating that there were many short indels/ loops but very few longer indels/loops. The number of indels and loops both reduced as the length increased, however, the number of loops reduced as a faster rate than indels. Moreover, the maximal loop length was 26 amino-acid long, while the maximal indel length was 50 in the Indel PDB dataset.  128  Figure 5.8 Accumulative indel length distribution.  The survival function of indel and loop lengths were plotted against a range of lengths from 1 ~ 25 residues.  Figure 5.9a shows a double logarithmic plot of the survival function versus the logarithm of the indel length. The plot could be fitted onto a liner line with R2 value of 0.9661, indicating a good fit of indel length distribution by the Weibull distribution. In Figure 5.9b, the indel length distribution was fitted into a logarithmic transformation of a power law function, with R2 value of 0.9643. The loop length distribution has been fitted to the Weibull function (Figure 5.10a) with R2 value of 0.991, and the power law function with R2 value of 0.9237 (Figure 5.10b).  129  Figure 5.9 Indel length modeled by the Weibull and power law distribution.  a) The double natural logarithm of the survival function in Weibull distribution was plotted against the logarithm of indel length, which ranged from 1 to 49 residues. The plot was fitted onto a liner line with R2 = 0.9661, indicating a good fit of the Weibull distribution. b) The logarithm of the survival function in power law distribution was plotted against the logarithm of indel length, which ranged from 1 to 49 residues. The plot was fitted onto a line with R2 = 0.9643.  130  Figure 5.10 Loop length modeled by the Weibull and power law distribution.  a) The double natural logarithm of the survival function in Weibull distribution was plotted against the logarithm of loop length, which ranged from 1 to 25 residues. The plot was fitted onto a liner line with R2 = 0.991, indicating a good fit of the Weibull distribution. b) The logarithm of the survival function in power law distribution was plotted against the logarithm of loop length, which ranged from 1 to 25 residues. The plot was fitted onto a line with R2 = 0.9237.  Thus, the Weibull function has a better fit of the length distribution of indels and loops, in comparison to the power law function. In addition, the results suggest that the occurrence of indels in the studied PDB proteins cannot be attributed to random processes (when normal  131  distribution behaviors would be expected), and the indel lengths are likely to be associated with certain evolutionary mechanisms.  5.3.4  Secondary structure composition of indels We have assigned secondary structures to each of the 11,294 indel-containing proteins  and their 117,266 indel sites in Indel PDB. Secondary structures were defined and assigned by DSSP [22] and its computer program [21]. Table A5 (Supplementary materials) summarizes the secondary structure composition results and the corresponding chi-square and p values of each indel length category, by comparing indel sites and their indel-containing proteins. Figure 5.11 illustrates that when the indel length increased, there was an increase of alpha-helix content (H) in indels, while the percentages of extended beta-strands (E), H-bonded turns (T), bends (S) and loops decreased. In comparison to the indel proteins (as shown in Figure 5.12), indel sites have increased percentages of T, S, and loop structures, but reduced contents of alpha helices and beta strands. The differences are significant at P value < 0.001, based on the chi-square tests. The proportion of alpha-helices in shorter indels was lower, comparing to the indel protein sequences. However, longer indels had comparable or higher H content than the indel proteins. In contrast to previous findings [12, 13], our results showed that many indel sites had recognizable secondary structures such as alpha helices and beta sheets, in addition to loops or turns.  132  Figure 5.11 Average secondary structure composition of indels.  Secondary structures were defined by DSSP [22]: H = alpha-helix, B = residue in isolated betabridge. E = extended strand, participates in beta-ladder, G = 3-helix, I = 5-helix, T = H-bonded turn, S = bend, and loop = undefined structure.  Figure 5.12 Difference of average secondary structure composition between indel sites and full-length protein sequences.  The y-axis shows the natural logarithm of the ratio of average secondary structure frequency in the indel sites to that in the full-length protein sequences.  133  5.3.5  Solvent accessibility of indels The DSSP program was utilized to predict solvent accessibility of the indel proteins and  their indel sites. Table 5.1 indicates that indel sites in all five length groups had higher average of solvent accessibility values than the indel proteins. The differences of solvent accessibility values were significant at P value of 0. The result showed that indel sites were more exposed to the protein surface than average residues of the proteins. Table 5.1 Solvent accessibility of indel sites and indel proteins. Minimal indel length Indel – avg. solvent accessibility indel protein – avg. solvent accessibility t test value p value  1  5  10  15  20  59.53 44.68  60.36 43.52  60.03 43.46  58.46 42.05  60.50 42.80  168.20  130.52  73.96  45.96  34.83  0  0  0  0  0  Solvent accessibility was measured by (the number of water molecules in contact with a residue) multiplied by 10 or (residue water exposed surface in Angstrom)^2, according to the DSSP program. The higher the solvent accessibility value, the more exposed the residue is.  In addition, the solvent accessibility of indels has been compared to that of loops. Table 5.2 indicates that indel sites in all five length groups had higher average of solvent accessibility values than the loops with a significant P value of 0. The result showed that indel sites were more exposed to the protein surface than the loops. Table 5.2 Solvent accessibility of indel sites and loops. 1  5  10  15  20  indel – avg. solvent accessibility  59.53  60.36  60.03  58.46  60.50  loop – avg. solvent accessibility t test value p value  49.26 98.74 0  47.62 85.06 0  47.55 47.53 0  44.64 32.38 0  44.21 25.87 0  Minimal indel length  Solvent accessibility was measured by (the number of water molecules in contact with a residue) multiplied by 10 or (residue water exposed surface in Angstrom)^2, according to the DSSP program. The higher the solvent accessibility value, the more exposed the residue is.  134  5.3.6  Proximity of indels and protein domains Previous studies have suggested the roles of indels in modification of protein functions  and interactions. Thus, it is possible to anticipate that such indel-directed modifications may occur in the proximity of protein functional or structural domains. To determine the percentage of indel sites that overlapped with at least one protein domain, we have calculated the relative distance between each indel site and a domain on all 22,103 PDB proteins. The average length of domains was 151.4 amino-acid long, with a minimal length of 8 and a maximal length of 1289. As shown in Table 5.3, among domains of all possible lengths, 93.66% of all indel sites overlapped with at least one domain. Among domains that were equal or less than the average length, 47.33% of all indel sites overlapped with at least one domain. Table 5.3 Percentage of indel sites that overlapped with at least one protein domain. domain length <= 1289 <= 152  # of indels overlapping with a domain 109835 55503  % 93.66% 47.33%  # of indels not overlapping with a domain 7431 61763  % 6.34% 52.67%  total # of indels 117266 117266  From a domain perspective, a total of 31,700 instances of domains have been found on the 22,103 PDB proteins. Table 5.4 indicates that among domains of all possible lengths, 45.22% of the domains overlapped with at least one indel site, and among domains that were equal or less than the average length, 25.94% of the domains overlapped with an indel. Table 5.4 Percentage of domains that overlapped with at least one indel site. domain length <= 1289 <= 152  # of domains overlapping with a indel 14336 8222  % 45.22% 25.94%  # of domains not overlapping with a indel 17364 23478  % 54.78% 74.06%  total # of domains 31700 31700  135  In addition, for each indel-overlapping domain, we have calculated the faction of the number of proteins with such a domain overlapped with an indel to the total number of proteins where the domain was present. Table 5.5 shows the top 20 over-represented indel-overlapping domains with P-values determined by the Fisher’s exact test. Several enzymatic domains such as peroxidase, nitric oxide synthase, and catalase are among the top 20 list, and therefore the result has suggested some possible functional roles of indels, participating in the modification of enzymatic activity of those proteins. Overall the results have indicated that a great number of indel sites overlapped with the locations of protein domains, therefore it is possible to hypothesize that some of such indel sites are associated with the change of protein functions through domain modifications in evolution.  136  Table 5.5 Top 20 domains that overlapped with indels.  Domain ID (Pfam) PF00141 PF00565 PF02876 PF02898 PF00199 PF00232 PF00502 PF01327 PF00896 PF00490 PF01742 PF00022 PF00274 PF00503 PF00224 PF03414  PF00217 PF00343 PF00113 PF00162  Domain Name Peroxidase Staphylococcal nuclease homologue Staphylococcal/Streptococcal toxin, beta-grasp domain Nitric oxide synthase, oxygenase domain Catalase Glycosyl hydrolase family 1 Phycobilisome protein Polypeptide deformylase Phosphorylase family 2 Delta-aminolevulinic acid dehydratase Clostridial neurotoxin zinc protease Actin Fructose-bisphosphate aldolase class-I G-protein alpha subunit Pyruvate kinase, barrel domain Glycosyltransferase family 6 ATP:guanido phosphotransferase, Cterminal catalytic domain Carbohydrate phosphorylase Enolase, C-terminal TIM barrel domain Phosphoglycerate kinase  # of proteins with the domain overlapping with an indel 88  total # of proteins with the domain 88  domain fraction 1  P-value (one tail) from Fisher's exact test 1.85E-26  44  44  1  1.42E-13  33  33  1  2.33E-10  31 30 30 29 24 23  31 30 30 29 24 23  1 1 1 1 1 1  8.94E-10 1.75E-09 1.75E-09 3.43E-09 9.92E-08 1.94E-07  22  22  1  3.81E-07  21 20  21 20  1 1  7.45E-07 1.46E-06  20 20  20 20  1 1  1.46E-06 1.46E-06  17 16  17 16  1 1  1.10E-05 2.15E-05  15 15  15 15  1 1  4.21E-05 4.21E-05  14 13  14 13  1 1  8.24E-05 1.61E-04  5.4 Conclusion We presented Indel PDB (Insertion/Deletion Protein Data Bank), a free web-based resource that contains information on structural insertions and deletions in proteins that have been derived from alignments of closely related sequence. The developed Indel PDB resource aims to facilitate bioinformatics analysis of 1-, 2- and 3-dimensional features of indel sites and their roles in protein essentiality, protein-protein interactions, homology modeling and drug design.  137  The analysis of the database content demonstrated that indel sites had certain bias of amino acid usage and that indel tended to occur on solvent exposed areas of proteins. In addition, it has been shown that protein indels possessed distinguishable secondary structure composition where loops, turns and bends were the most abundant structural features followed by alpha-helix and beta-sheet containing fragments. It has been demonstrated that indel length distribution could be accurately described by Weibull function. Moreover, a great number of indel sites have overlapped with locations of protein domains, and the result suggests a possible association between indel occurrences and modifications of protein function. We anticipate that further applications of Indel PDB in conjunction of protein domain and drug databases will enable identification of novel indel-based drug binding sites for computer-aided drug discovery. Indel PDB is freely available over the internet on the World Wide Web [19].  138  5.5 References 1. 2. 3. 4. 5. 6. 7.  8. 9.  10. 11. 12. 13. 14.  15. 16. 17. 18. 19. 20. 21.  Thorne JL: Models of protein sequence evolution and their applications. Curr Opin Genet Dev 2000, 10(6):602-605. Yang Z, Goldman N, Friday A: Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol Biol Evol 1994, 11(2):316-324. Whelan S, Lio P, Goldman N: Molecular phylogenetics: state-of-the-art methods for looking into the past. Trends Genet 2001, 17(5):262-272. Cherkasov A, Lee SJ, Nandan D, Reiner NE: Large-scale survey for potentially targetable indels in bacterial and protozoan proteins. Proteins 2006, 62(2):371-380. Chan SK, Hsing M, Hormozdiari F, Cherkasov A: Relationship between insertion/deletion (indel) frequency of proteins and essentiality. BMC Bioinformatics 2007, 8:227. Wagner A: How the global structure of protein interaction networks evolves. Proc Biol Sci 2003, 270(1514):457-466. Nandan D, Cherkasov A, Sabouti R, Yi T, Reiner NE: Molecular cloning, biochemical and structural analysis of elongation factor-1 alpha from Leishmania donovani: comparison with the mammalian homologue. Biochem Biophys Res Commun 2003, 302(4):646-652. Cherkasov A, Nandan D, Reiner NE: Selective targeting of indel-inferred differences in spatial structures of highly homologous proteins. Proteins 2005, 58(4):950-954. Nandan D, Lopez M, Ban F, Huang M, Li Y, Reiner NE, Cherkasov A: Indel-based targeting of essential proteins in human pathogens that have close host orthologue(s): discovery of selective inhibitors for Leishmania donovani elongation factor-1alpha. Proteins 2007, 67(1):53-64. Gu X, Li WH: The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. J Mol Evol 1995, 40(4):464-473. Qian B, Goldstein RA: Distribution of Indel lengths. Proteins 2001, 45(1):102-104. Pascarella S, Argos P: Analysis of insertions/deletions in protein structures. J Mol Biol 1992, 224(2):461-471. Sibanda BL, Thornton JM: Accommodating sequence changes in beta-hairpins in proteins. J Mol Biol 1993, 229(2):428-447. Fechteler T, Dengler U, Schomburg D: Prediction of protein three-dimensional structures in insertion and deletion regions: a procedure for searching data bases of representative protein fragments using geometric scoring criteria. J Mol Biol 1995, 253(1):114-131. Benner SA, Cohen MA, Gonnet GH: Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J Mol Biol 1993, 229(4):1065-1082. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28(1):235-242. Michalsky E, Goede A, Preissner R: Loops In Proteins (LIP)--a comprehensive loop database for homology modelling. Protein Eng 2003, 16(12):979-985. Espadaler J, Fernandez-Fuentes N, Hermoso A, Querol E, Aviles FX, Sternberg MJ, Oliva B: ArchDB: automated protein loop classification as a tool for structural genomics. Nucleic Acids Res 2004, 32(Database issue):D185-188. Indel PDB [http://www.cnbi2.ca/indel/] Protein Data Bank (PDB) [http://www.rcsb.org] The DSSP software and database [http://swift.cmbi.ru.nl/gv/dssp/] 139  22. 23. 24.  25. 26. 27.  Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22(12):2577-2637. Coles S: An introduction to statistical modeling of extreme values. London: SpringerVerlag; 2001. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A: Pfam: clans, web tools and services. Nucleic Acids Res 2006, 34(Database issue):D247251. HMMER: biosequence analysis using profile hidden Markov models [http://hmmer.janelia.org/] Jmol: an open-source Java viewer for chemical structures in 3D [http://www.jmol.org/] Petsko GA, Ringe D: Protein Structure and Function. Waltham, MA: New Science Press; 2004.  140  6 The identification of a novel pyruvate kinase inhibitor for methicillin-resistant Staphylococcus aureus (MRSA252) via an indel-based targeting approach5 6.1 Introduction The majority of the current antibiotics act on bacteria-specific targets to avoid essential "housekeeping" proteins that have homologues in hosts [1]. Such specificity ensures the absence of host toxicity, but, at the same time, may be a contributing factor to ever growing antibiotic resistance among human pathogens. In response to antibiotics action, bacteria have developed numerous counter-attack mechanisms [2], thus the effectiveness of conventional antimicrobials against resistant bacteria such as methicillin-resistant Staphylococcus aureus (MRSA) is diminishing at a rapid pace [3]. Therefore, in order to combat bacterial infection more effectively, it is critical to initiate new antibiotics development programs that utilize modern integrative knowledge of cellular processes in bacteria and their pathogenesis. In particular, recent advancement in high-throughput experiments has enabled the detection of protein-protein interactions in a genome-wide scale. Techniques such as yeast twohybrid assay and affinity purification followed by mass spectrometry have been used to identify protein interactions in a number of organisms including Escherichia coli [4] and Saccharomyces cerevisiae [5]. Such data generated to date on whole-organisms’ interactomes [6] have shifted conventional thinking on cellular processes from protein-by-protein consideration to more coherent network perspectives [7]. Our own recent study of protein interaction network (PIN) of MRSA252 organism study has demonstrated that defining the architecture of PINs within microbial pathogens can 5  A version of this chapter will be submitted for publication. Axerio-Cilies P, Hsing M, Zoraghi R, See R, Jiang J, Kaur S, Lian T, Jackson L, Gong H, Swayze R, Amandoron E, Cherkasov A. The identification of a novel pyruvate kinase inhibitor for methicillin-resistant Staphylococcus aureus (MRSA252) via an indel-based targeting approach. 141  lead to advances in medical treatment of infectious diseases. We anticipate that targeting hub proteins that are essential for network integrity will result in the development of new antibiotic agents that can help address the problem of rising antibiotic resistance. The later assumption is based on an observation that hub proteins are less prone to mutations [8], therefore when targeted they are less likely to undergo resistant changes. However, many pathogen hub proteins would not be typically considered as viable drug targets because of the presence of human homologues. To overcome this, we have developed a strategy of targeting hub proteins based on unique compound-binding sites induced by sizable insertion and deletions (indels), which are identified from sequence alignments compared to host homologues [9-11]. Based on the results of our previous studies we were motivated to apply an indel-based drug discovery platform towards novel drug targets identified in MRSA252. Candidate bacterial targets of MRSA252 were selected based on the following criteria: 1) hub proteins that are highly connected in the protein interaction network (PIN), 2) essential proteins derived from the experimental data, 3) proteins without any paralog, 4) proteins with sizable indels in comparison to human homologues, 5) indel regions defined by a cavity which can be targeted by a small molecule, and 6) assay feasibility for testing candidate compounds. Based on the criteria mentioned above, pyruvate kinase (PK) from MRSA252 was among the list of potential targets that were identified. The PK enzyme from MRSA252 is known to be a critical hub in protein networks and is responsible for catalyzing the final step of glycolysis, which involves the transfer of the phosphoryl group of phosphoenolpyruvate (PEP) to ADP to produce pyruvate and ATP [12].  According to a recent large-scale protein  interaction experiment in MRSA252 [13], 243 interacting partners were identified for PK suggesting its importance in the PIN. Pyruvate kinase was identified among the top 1% of hub 142  proteins and was found to be essential to MRSA252 growth by both antisense and knockout experiments. One of the major differences between MRSA252 and human PK is a 6 amino-acid insertion region in humans, not present in MRSA252. The protein sequence alignments among bacterial species and humans are shown in Figure 6.1. The 6 amino-acid insertion region in human PK corresponded to two surface-exposed hydrogen bonded turn loops shown in Figure 6.2. We confirmed that the presence of this indel region between MRSA252 and human PK offers structural differences that can be exploited for small molecule binding. Moreover, the potential targeting within this indel site is enhanced by the fact that the 6 amino acid deletion in MRSA252 PK is conserved in other gram-positive and gram-negative bacterial species (Figure 6.1), making selective compounds viable for broad-spectrum anti-infective drugs in the future.  Figure 6.1 A multiple sequence alignment showing the indel region (shown as gaps) for pyruvate kinase (PK) from Bacillus stearothermophilus, Staphylococcus aureus, Streptococcus pneumoniae, Clostridium difficile, Escherichia coli, Salmonella typhimurium, Chlamydia trachomatis, and Homo sapiens.  [+] denotes gram-positive bacteria and [-] denotes gram-negative bacteria. The amino acid insertion in human PK is shown to be the SDPILY fragment. Figure 6.1 was generated by ClustalX version 2.0. [14]  143  Figure 6.2 Structure alignment and superposition of human PK (shown in yellow) with MRSA252 PK (shown in blue).  The two hydrogen bonded turn loop insertion from human PK represented as orange spheres blocks the solvent assessable pocket shown in pink alpha spheres. This cavity is not present in human PK as a result of this 6 amino acid insertion, therefore the corresponding deletion region in MRSA PK (pink alpha spheres) could be exploited as a potential site for small molecule binding. Figure 6.2 was generated by MOE version 2007.09 [15].  By using this indel-based drug discovery approach, we aimed to identify small molecules that would have the ability to preferentially bind to the deletion site MRSA252 PK without affecting its human counterpart. In this work, we report the successful identification of a low micro-molar inhibitor of MRSA252 PK by implementing an "in-house" indel-based pipeline, which includes in-silico protein modeling, computer-aided drug design strategies and experimental testing for discovery of potent selective compounds. Identification of these  144  compounds suggests that PINs are viable means to identify highly connected hub proteins that have a potential of being effective antimicrobial targets.  6.2 Methods The indel-based pipeline that was implemented for the identification of potent and selective MRSA252 PK inhibitors is shown in Figure 6.3 and the in-silico screening cascade is illustrated in Figure 6.4. Each step involved is described in more detail in the subsequent sections. Figure 6.3 Path for the identification of selective MRSA PK inhibitors.  145  Figure 6.4 The in silico screening cascade used in this study for the discovery of selective inhibitors of MRSA252 PK.  First in silico screening is described within the blue boxes and the second in silico screening is described within the blue and orange boxes.  6.2.1  Homology modeling The crystal structure of pyruvate kinase from Bacillus Stearothermophilis (PDB code:  2E28) [12] with a sequence identity of 62% was identified as a suitable template for homology modeling of MRSA252 PK by BLASTp [16]. The sequence alignment was re-aligned using a global sequence alignment program, GGSEARCH (version 35.03 [17]). A homology model of MRSA252 was built using Modeller9v4 [18] and optimized by the OPLSAA force field energy minimization implemented by GROMACS [19]. The crystal structure (PDB code: 1T5A) for a human PK was obtained from PDB [20] and used for a subsequent structural superimposition with the MRSA PK homology model.  146  6.2.2  ZINC chemical database preparation Approximately 3.3 million commercially available structures in a SD format from the  ZINC-5.0 compound library [21] was imported into a molecular database using Molecular Operating Environment (MOE) version 2007.08. [18] These structures were ‘washed’ by removing all inorganic components and coordinating all ionizable groups with a 7.0 pH condition. Next, this database was energy minimized using a MMFF94x force field and exported in a SD format for the use by the GLIDE [22, 23] and eHiTS [24, 25] docking programs.  6.2.3  Binding site prediction The ‘Site Finder’ functionality in the MOE package was used to identify suitable  binding pockets around the indel sites on the optimized MRSA252 PK homology model. The Site Finder is a geometric method, which is a generalization of the convex hulls method for calculating possible binding sites in receptors from 3D atomic coordinates. Alpha spheres were consequently generated to identify a binding site (close to the indel site in Figure 6.1), which was subsequently used for molecular docking. [26]  6.2.4  Molecular docking The Maestro suite [27] was used to prepare the MRSA252 PK structure for molecular  docking by adding and adjusting hydrogen atoms. Protein and ligand charges have been assigned by the OPLS molecular mechanics force-field method [28], and a binding pocket with 10 Å radius was defined around the indel site. The ZINC chemical database (3.3 million compounds) that was previously prepared was then docked into the indel bearing cavity by the GLIDE with standard precision (SP) protocol for the first screen. Molecular structures above the cut-off of -6.0 GLIDE score were then re-docked into the same indel site using the eHiTS-  147  docking program (Figure 6.4). All calculations were done by a Linux cluster of 224 Intel Xeon dual processors at 2.66 GHz. Graphic manipulations and analysis of the docking experiments were performed by the Maestro Graphical User Interface, version 85207, for Linux operating systems.  6.2.5  Scoring functions and predictors The following scores were calculated for each compound selected after the GLIDE and  eHiTS screens.  6.2.5.1 Binding affinity (pKi) The MOE estimated pKi was calculated for each ligand using the scoring.svl script available through the Scientific Vector Language (SVL) exchange service [29]. The Gasteiger partial charges [30] were calculated for the structures that passed the docking cutoff. The estimated pKi for these structures were calculated by choosing the dock_pKi descriptor with default settings for the molecular database.  6.2.5.2 Ligand explorer (LigX) LigX module [31, 32] was used to account for the receptor/ligand flexibility where the receptor atoms near the ligand produced by GLIDE/eHiTS docking programs and the ligand itself are succumb to the MMFF94x force field energy minimization as implemented in MOE. LigX calculations were applied to all the structures that passed the docking cutoffs. Tether restraints were set to 10 for the receptor to discourage gross movement and all other parameters were set to default. Subsequently, pKi binding affinity score was calculated after energy minimization to predict the ligands that showed the best binding characteristics defined mainly by the energy of hydrogen bonds and hydrophobic interactions.  148  6.2.5.3 Root mean standard deviation (RMSD) The root mean standard deviation (RMSD) was calculated between the GLIDE poses and the eHiTS poses to evaluate the docking consistency and reliability and to establish the most probable binding pose for a given ligand. The RMSD values were calculated for each ligand using the rmsd_mol.svl script in MOE. The RMSD was added to the voting “consensus” system, where ligands were assigned a 1 vote for a RMSD value of  3.0 Å and a 0 vote for >  3.0 Å.  (The following sections 6.2.6, 6.3.2 and 6.3.4 on experimental assays describe work conducted by the collaborators from the PREPARE laboratory. More details in the ‘co-authorship statement’)  6.2.6  PK enzymatic assays and cell-based screens Each compound identified from the final in silico screening was purchased and tested in  a PK enzyme colorimetric assay. Any compounds showing 50% or more inhibition of MRSA PK activity in the primary screen were then screened against the 4 human isoenzymes (M1, M2, L and R) in the secondary assay to determine selectivity of the compounds. Compounds that inhibited MRSA PK without any effect on any of the human PK isoenzymes (inhibition or activation) were considered as selective compounds, and their IC50s were determined in the subsequent step. Compounds that were potent and selective were tested for inhibition of S. aureus growth using a 24-hour minimum inhibitory concentration (MIC) assay. Finally, human HeLa cells were used to evaluate whether PK inhibitory compounds were cytotoxic by using CellTiter 96® AQueous One Solution Cell Proliferation kit (Promega). Those compounds that showed potency in enzymatic- and- cell-based assays were re-screened by an in silico similarity  149  search method to find analogs that exhibit a higher potency and improvement in S. aureus cell penetration and growth inhibition.  6.2.7  Similarity searching (SS) The 2D similarity for the compounds were calculated by means of the Tanimoto  coefficient by using the MACCS fingerprints [33, 34] as implemented in MOE [15]. The fingerprints contain 166 bits indicating the presence of specified structural fragments in the molecular graph representation. The similarity searching was applied to the ZINC-8.0 compound library, which is a ~8.7 million compound library database [21]. A Tc similarity measure was calculated for each ligand based on formula (1) [35]. Tc cutoffs were implemented based on a 75% similarity or a Tc value of 0.75 between the target and database molecule.  (1) Tc (A, B) =  NAB ____________ NA + NB - NAB  where NAB is the number of bits that are set on in both fingerprints, and NA and NB refer to the number of bits that are set on in A and B, respectively. Identical fingerprints have a Tc value of 1, whereas non-overlapping fingerprints are given a Tc value of 0 [35].  (The following sections 6.2.8 and 6.3.6 on MD simulation describe work conducted by Peter Axerio-Cilies. More details in the ‘co-authorship statement’)  6.2.8  Molecular dynamics (MD) and binding free energy calculations The MD simulations and binding free energy calculations were performed on the  MRSA252 PK-ligand complexes using the GROMACS 4.0 simulation package [36]. The final docking configurations for the protein-ligand complexes were used as starting points for the 150  MD simulations. The force field considered in the GROMACS calculations was the gmx force field [37, 38] with the SPC explicit model of solvation [39, 40]. A 10 Å water-filled cubic box was generated around the MRSA252 PK-ligand complex model and the corresponding proteinligand models were neutralized by the adding 17 Na+ counterions. The freely available PRODRG tool was used to generate the topologies for compounds #63 and #130 identified by the in silico virtual screening. [41] The pre-minimization was done on the system using ~5000 iterations of the steepest descent algorithm with an energy convergence fixed on 1000 kJ mol-1 nm-1. The particle mesh Ewald (PME) method was used to treat the electrostatic term [42, 43]. Short-range and long-range forces were cutoff at 0.9 nm and 1.4 nm respectively and periodic boundary conditions were applied. The system was subjected to position-restrained MD using the Berendsen's temperature and pressure coupling method for 100 ps at 300K with a time step of 1 fs [44]. The MD trajectories obtained from the final 550 ps MD run were used to study the dynamic behavior and estimate the free-binding interaction energies for compounds #63 and #130. The linear interaction energy (g_lie) GROMACS module [45-47] was used to estimate the free-binding interaction energy for compounds #63 and #130 using the stabilized MD trajectories from 150 to 350 ps with the following parameters: alpha=0.33, beta=0.18.  6.3 Results The homology modeling of MRSA252 PK based on the Bacillus Stearothermophilis crystal structure was acceptable with a backbone RMSD of 0.46 Å compared to the template. The superposition of MRSA252 and human PK models revealed that the 6 amino-acid insertion (positions 100 to 105 of the insertion fragment: SDPILY) in human PK blocks a part of otherwise solvent-exposed surface on MRSA252 PK as shown in Figure 6.5. The sequence alignment between MRSA252 and human PK is shown in Figure 6.1. The indel site was visually examined to determine if it constitutes a suitable compound binding pocket that is 151  unique to the MRSA protein. The relationship and physical distance between the potential compound binding site and any nearby active site were considered. As a result of the analysis, we anticipated that this assessable area within the deletion cavity could be considered an appealing target area for selective small molecule binding. Figure 6.5 Structural model of N-terminal indel site for MRSA PK.  Insertion sequences (in orange) conceal a pocket in the human protein that is exposed in the MRSA252 protein. Figure was generated from the MOE molecular package [15]. This particular indel region was not the only indel-defined area for MRSA252 and human PKs. Another insertion region was defined by a PVQEAW fragment on human PK between positions 477 to 482. However, this PVQEAW fragment did not form a well-defined cavity for small molecule binding at its corresponding deletion area in MRSA252 PK; therefore it was not considered further in the study. The identification of the indel binding site for the MRSA252 PK protein structure motivated us to use in silico high-throughput screening to identify compounds that would preferentially bind to this indel defined region. As it has been shown previously, indel sites can be associated with essential protein functions and protein-protein interactions [48, 49];  152  therefore, it would be reasonable to expect that these inhibitors would be selective towards MRSA252 PK and affect its enzymatic function.  6.3.1  In silico screens For the first screen we considered ~3.3 million commercially available compounds from  the ZINC 5.0 compound library (Figure 6.4). These compounds were optimized (see ‘Methods’ section for more details) and docked into MRSA252 PK indel pocket using the GLIDE 1.64.2.5 docking module [22, 23]; cut-off filters were implemented based on GLIDE scores of -6.0, which corresponded to 20% of the top “hits” from the first screen. This resulted in a dataset of ~125000 molecules which were then re-docked into the same indel binding cavity using the electronic high-throughput screening (eHiTS) docking software program [24, 25]. At the next stage, the 125000 compound dataset was complied and analyzed to assess the consistency and reliability of the docking poses between the GLIDE and eHiTS docking programs. For this, the root mean standard deviation (RMSD) was calculated to establish the most probable binding pose for a given ligand with respect to the binding cavity. In addition to the two scores from GLIDE and eHiTS, additional scoring functions or predictors were calculated for this dataset to remove compounds that exhibited poor binding characteristics towards MRSA252 PK. The third predictor calculated for each ligand was the dock_pKi parameter computable by the MOE scoring svl script. The dock_pKi is an estimated pKi value defined mainly by the energy of hydrogen bonds and hydrophobic interactions. The fourth predictor used was the binding affinity score generated by the ligand explorer (LigX) module in MOE, where the receptor atoms near the ligand and the ligand itself are succumb to energy minimization and ultimately scored based on the dock_pKi parameter. Based on the four scores (GLIDE docking, eHiTS docking, pKi and LigX scores), we implemented a voting scheme where, for each score, a value of 1 was assigned to the ligands in 153  the top 10%, while the other 90% of the ligands were given a value of 0. Then the four binary votes were added together for each ligand and, in turn, the ligands were ranked by their voting “consensus” score, where a score of four was the maximum optimal value. In addition, the fifth predictor RMSD (between GLIDE pose and eHiTS pose) was added to the voting system, where ligands were assigned a ‘1’ vote for a RMSD value of  3.0 Å and a ‘0’ vote for > 3.0 Å.  Amongst, ~125000 ligands in the dataset, 500 with the highest vote totals (compounds voted 4 or 5 times) were visually inspected and selected according to the following criteria: 1) GLIDE and eHiTS predicted a similar pose, 2) lipophilic moieties were not exposed into the solvent phase, and 3) ligands showed good overall binding characteristics. This method refined and reduced the number of candidates to a total of 130 compounds, which were recognized as “virtual hits for experimental testing.  6.3.2  Experimental testing on the virtual compound hits Out of the 130 virtual hits, 28 compounds were identified with > 50 % inhibitory effects  on MRSA PK enzymatic activity at 50 micro-molar. More specifically, compound #63 (an acyl hydrazone-based compound as shown in Table 6.1) has showed promising results on enzymatic inhibition and selectivity against the MRSA PK as illustrated in Figure 6.6. However, the cellbased assays demonstrated that compound #63 has only achieved inhibition less than 10% on the growth of S. aureus under 100 µM. At such a high concentration, toxicity to human cells has been observed as shown in Figure 6.6. We hypothesize that the low cell growth inhibition by compound #63 might be associated with its lack of ability to penetrate the cell membrane of MRSA252. Hence, we performed a similarity search to find analogues of compound #63, seeking better MRSA inhibition and selectivity.  154  Table 6.1 The two acyl hydrazone-based compounds shown in the table demonstrated selective in-vitro inhibitory activity towards MRSA252 PK. Compounds GLIDE score eHiTS score Predicted pKi Interaction Energy via MD (KJ/mol) IC50 (µ µM) #63  -7.5  -3.75  7.41  -12.458  0.9  #130  -6.76  -2.55  7.77  -15.601  0.1  Table also shows GLIDE-SP and eHiTS docking score values with respect to the indel-bearing cavity on MRSA252-PK, the predicted pKi values for both inhibitors, the protein-ligand interaction energies estimated by the MD simulations and finally the IC50 values for both compounds obtained from the in-vitro assays. The chemical structures of the compounds are not shown due to patent pending.  Figure 6.6 (Top) Compounds #63 and #130 (10µM) selectivity inhibit MRSA PK enzymatic activity; (Middle) Effect of compounds #63 and #130 on S. aureus growth; (Bottom) Toxicity evaluation of compounds #63 and #130 in human HeLa cells.  155  6.3.3  Similarity search for analogues of compound #63 The success of compound #63 at inhibiting MRSA252 PK propelled us to find  analogues that would exhibit a better binding affinity towards MRSA252 PK with an improvement in cell penetration ability. For this, a 2D similarity search was done on compound #63. The similarity search was computed against version 8.0 of the ZINC compound library, which contained 8.7 million compounds. This ZINC-8.0 compound library was a larger database than the ZINC-5.0 compound library used in the previous docking stage and thus provided us with a more diverse dataset for the similarity search. A Tc similarity score was generated by means of the Tanimoto coefficient by using the MACCS fingerprints as implemented in MOE. Cut-off values of  0.75 (75%) were used, and this generated a dataset  of 106,724 analogous compounds, which were then all docked into the deletion site of MRSA252 PK by using the GLIDE 1.64.2.5 and eHiTS-v6.2 docking packages. Three other predictors as described in Methods section were calculated, which included pKi, LigX, and RMSD between GLIDE and eHiTS poses. In an effort to improve cell penetration propensity, all 106,724 compounds were exposed to a Bacterial-Metabolite-Likeness (BML) model filter [50], which uses inductive Quantitative Structure Activity Relationship (QSAR) descriptors to distinguish between antimicrobial compounds, conventional drugs, and drug-like substances. The BML model is an effective tool at discriminating between these chemical species and thus helped us to identify metabolite-like compounds, which are known to have a greater capacity to penetrate into the cell [50]. All the compounds were classified and ranked based on the five in-silico scoring functions plus their resemblance to bacterial metabolites based on the BML model. 40  156  analogues were selected and tested in a biochemical assay screen on MRSA252 and human PKs.  6.3.4  Experimental testing on compound #130 (analogues of compound #63) The experimental results show that the compound analogue #130, inhibited MRSA PK  enzymatic activity more potently than compound #63 at 10 µM (Figure 6.6). Compound #130 with IC50 of 0.10 µM has been shown to be a better inhibitor than compound #63 with IC50 of 0.90 µM. Compound #130 meets the >100-fold selectivity requirement needed between homologous human and bacterial protein targets. When compounds #63 and #130 were evaluated for inhibition of S. aureus growth, compound 130 was shown to inhibit 35% growth at 25 µM (Figure 6.6). No further inhibition of growth was noted beyond 25 µM. However, we believe that the lack of further inhibition may be attributed to the instability of compound #130 over a 24 hour period (data not shown). Furthermore, Figure 6.6 shows compound #130 had no growth inhibitory effects on human HeLa cells when tested up to 500 µM. Overall, the experimental results show that compound #130 exhibited a vast improvement in both binding inhibition and cell penetration propensity and consequently can be considered as a lead compound for future optimization.  6.3.5  Compound #130 binding within the deletion site of MRSA252 PK via docking Modeling of the most interesting ligand was undertaken based on the generated docking  poses from GLIDE and eHiTS docking modules. Compound #130 was chosen since it was found to be the most potent compound and was selective towards MRSA252 PK based on biochemical and cell-based assays. The identified hydrogen bonding patterns for compound #130 based on the docking results were attributed to the hydroxyl group of the benzene moiety, 157  which interacted with Arg-2 and the carbonyl group that interacted with Arg-328. The NH group of the indole moiety is found to interact with Ile-60. In addition, the indole moiety was positioned in a lipophilic pocket primarily formed by Ile-60, Pro-418, Ala-407. The structural overlap assessment for the binding modes of compound #130 was calculated via a RMSD calculation between the GLIDE pose and the eHiTS pose. The RMSD value of 2.5 Å was calculated for compound #130, and this suggested that the poses were consistent between the two docking protocols. Thus a reasonable configuration was obtained with respect to the indel binding site. The compound was not able to be docked into human PK model because of the presence of the SDPILY insertion fragment which blocks the otherwise exposed cavity.  6.3.6  Conformational analysis of compound #130 via a molecular dynamics simulation To further validate our models we preformed a MD run on the protein-ligand complex  based on the optimized model of MRSA252 PK and the final binding mode of compound #130 docked at the deletion site of MRSA252 PK. MD simulations were conducted to assess the stability and the conformational changes of compound #130, which exhibited substantial activity towards MRSA252 PK. Moreover, MD was used to assess the interactions between protein and ligand, and its estimated interaction energy was calculated. The MD simulation for compound #130 was conducted until the equilibrium state was reached, and this resulted in a stable MD trajectory as shown in Figure 6.7. This result reflected the mobility of the ligand and confirmed its conformational position within the deletion site of MRSA252 PK. Its thermodynamic stability shown in Figure 6.7 provided evidence to support that compound #130 has the ability to bind to the deletion area on MRSA252 PK.  158  Figure 6.7 Change of RMSD values for compound #130 bound to the indel site of MRSA252 PK.  The binding orientation of compound #130 from the MD simulation showed the indole moiety occupying the same lipophilic pocket formed by Ile-60, Pro-418 and Ala-407 as predicted by the docking results. Hydrogen bonding (dashed arrows) was predicted between the NH group from the naphthalene moiety with Asn-29. This Asn-29 residue seems to be involved in an interaction with the carbonyl group of compound #130. The NH group adjacent to the carboxyl group and the hydroxyl group from the benzene moiety was predicted to interact with Gln-417. Finally, the binding orientation of the ligand showed Arg-2 was likely to be involved in hydrogen bond anchoring with the carbonyl and formed a hydrophobic contact with the benzene ring, which stabilized the ligand into the indel-bearing pocket. In general, the predicted binding mode of compound #130 displayed robust hydrogen bonding interactions as well as lipophilic contacts, which are the characteristic patterns needed for a ligand to be an effective inhibitor. 159  To further validate our model we preformed a binding free energy analysis of the MRSA252 PK deletion site with compounds #63 and #130. Based on the averaged energies for the complexes, compound #130 was shown to be the better binder with a binding free energy of -15.601 KJ/mol compared to compound #63 which had predicted value of -12.458 KJ/mol. Typically, the binding free energies correspond well with the docking scores but in this case, the result contradicted the docking results where compound #63 was found to be the better inhibitor. Nevertheless based on the stability of compound #130 and its interactions, we expect that compound #130 to be a more effective inhibiter to MRSA252 PK. Table 6.1 summarizes the results obtained from the MD simulations, the docking and the IC50s observed for compounds #63 and #130.  6.4 Discussion The indel-based approach has allowed us to selectively target highly conserved hub proteins in MRSA252. In this study, we revealed that the human PK indel contains a 6 aminoacid SDPILY insertion and that this fragment was recognized structurally as two hydrogen bonded turn loops. Previous findings have shown that indels of this nature are located at the protein surface interface and have been mainly structurally characterized as variable loop regions [51]. Interestedly, this finding is consistent with what we see for the human PK insertion, and we observed that this insertion blocks a portion of an otherwise exposed surface cavity of MRSA252 PK, which has been selectively targeted via small molecules. Although both the enzyme- and cell-based assays have demonstrated the inhibitory ability and selectivity of compound #130, we do not have direct evidence that this active compound is binding to the deletion region on MRSA252 PK. We can only propose that the predicted binding mode, relative energy of the compound, and the MD simulation of the protein-ligand complex all favor the binding of the compound at the indel site. A further co160  crystallization MRSA252 PK with this potential active compound could be helpful in confirming our findings. Although compound #130 is one of the best hits from our in silico screen, it does not achieve 100% cell growth inhibition in MRSA252. Based on our results, we expect that compound #130 is suitable for lead optimization to enhance its effectiveness, diminish its toxicity, and increase its absorption. For instance, it might be possible to increase the binding potency through hydrophobic interactions by adding hydrophobic groups to the indole moiety, which lies predominantly in a lipophilic cavity. It is interesting how the binding of a compound at the indel site can potentially affect the enzymatic activity of PK. Upon close examination of the indel site, we noticed that it is not in close proximity of the active site. We observed a similar situation in the previous example from Leishmania donovani EF-1α [11] where the binding of the indel site affected the enzymatic function of Leishmania donovani EF-1α despite the fact that the indel site and the active site were not in close proximity with each other. These two observations along with our previous experience of protein modeling and simulations have led us to hypothesize a correlated binding effect between two distant sites. Based on our results, compound #130 is a low micro-molar antagonist, which can be considered as viable candidate lead compound for drug development against MRSA252 PK activity. The potential use for this compound is enhanced by the fact that the 6 amino-acid deletion is conserved among other gram-positive and negative bacterial species such as Bacilius subtilis, Escherichia coli, Streptococcus pneumoniae, Salmonella typhimurium and therefore would potentially make the lead compounds ideal for broad-spectrum anti-infective treatments.  161  6.5 Conclusion We have implemented a procedure for selectively targeting of pathogen proteins that possess indel regions with respect to their human homologues. This method is based on the idea that protein fragments from amino-acid insertions in human proteins can block a portion of otherwise solvent assessable surface in pathogen homologues. We proposed that pathogen proteins that have a large enough cavity within a deletion site can potentially be a viable region for small molecule binding. Based on the indel-based pipeline presented here, we were successful at identifying a low micro-molar selective inhibitor against MRSA252 PK. This discovery validated that essential and highly conserved pathogenic hub proteins can potentially be attractive targets for small molecule inhibitors with the proper utilization of the indel sites. The uniqueness of this finding is that these types of pathogen proteins would not typically be considered as viable drug targets because of the presence of their human homologues. We expect that targeting indel regions of essential and hub-like pathogen proteins may help in addressing the growing problem of antibiotic resistance amongst pathogenic proteins. Several approaches are currently in progress to advance the indel-based platform. These include the structural resolution of MRSA252 PK experimentally, the co-crystallization of MRSA252 PK with the lead compound, and the search for more potent MRSA252 PK inhibitors.  162  6.6 References 1. 2. 3. 4.  5.  6.  7. 8. 9.  10.  11. 12. 13.  14. 15.  Cherkasov A, Lee SJ, Nandan D, Reiner NE: Large-scale survey for potentially targetable indels in bacterial and protozoan proteins. Proteins 2006, 62(2):371-380. Wright GD: Mechanisms of resistance to antibiotics. Curr Opin Chem Biol 2003, 7(5):563-569. Rybak MJ: Resistance to antimicrobial agents: an update. Pharmacotherapy 2004, 24(12 Pt 2):203S-215S. Butland G, Peregrin-Alvarez JM, Li J, Yang W, Yang X, Canadien V, Starostine A, Richards D, Beattie B, Krogan N, Davey M, Parkinson J, Greenblatt J, Emili A: Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature 2005, 433(7025):531-537. Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440(7084):631-636. Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R: IntAct: an open source molecular interaction database. Nucleic Acids Res 2004, 32(Database issue):D452-455. Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004, 5(2):101-113. Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW: Evolutionary rate in the protein interaction network. Science 2002, 296(5568):750-752. Nandan D, Lopez M, Ban F, Huang M, Li Y, Reiner NE, Cherkasov A: Indel-based targeting of essential proteins in human pathogens that have close host orthologue(s): discovery of selective inhibitors for Leishmania donovani elongation factor-1alpha. Proteins 2007, 67(1):53-64. Nandan D, Cherkasov A, Sabouti R, Yi T, Reiner NE: Molecular cloning, biochemical and structural analysis of elongation factor-1 alpha from Leishmania donovani: comparison with the mammalian homologue. Biochem Biophys Res Commun 2003, 302(4):646-652. Li YY, Jones SJ, Cherkasov A: Selective targeting of indel-inferred differences in spatial structures of homologous proteins. J Bioinform Comput Biol 2006, 4(2):403-414. Suzuki K, Ito S, Shimizu-Ibuka A, Sakai H: Crystal structure of pyruvate kinase from Geobacillus stearothermophilus. J Biochem 2008, 144(3):305-312. Cherkasov A, Hsing M, Zoraghi R, Foster JL, See R, Stoynov N, Jiang J, Kaur S, Lian T, Jackson L, Gong H, Swayze R, Amandoron E, Dao P, Hormozdiari F, Sahinalp C, Santos-Filho O, Axerio-Cilies P, Byler K, McMaster RW, Brunham R, Finlay B, N. RE: Mapping of the protein interaction network in methicillin resistant Staphylococcus aureus (MRSA-252) identifies novel high quality drug targets. (In Preparation) 2009. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics 2007, 23(21):2947-2948. Chemical Computing Group: Molecular Operating Environment (MOE) [http://www.chemcomp.com/software.htm] 163  16. 17. 18. 19. 20. 21. 22.  23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37.  Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402. Pearson WR: Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 2000, 132:185-219. Eswar N, Webb B, Marti-Renom MA, Madhusudhan MS, Eramian D, Shen MY, Pieper U, Sali A: Comparative protein structure modeling using Modeller. Curr Protoc Bioinformatics 2006, Chapter 5:Unit 5 6. GROMACS: GROMACS [http://www.gromacs.org/] Dombrauckas JD, Santarsiero BD, Mesecar AD: Structural basis for tumor pyruvate kinase M2 allosteric regulation and catalysis. Biochemistry 2005, 44(27):9417-9429. Irwin JJ, Shoichet BK: ZINC--a free database of commercially available compounds for virtual screening. J Chem Inf Model 2005, 45(1):177-182. Friesner RA, Banks JL, Murphy RB, Halgren TA, Klicic JJ, Mainz DT, Repasky MP, Knoll EH, Shelley M, Perry JK, Shaw DE, Francis P, Shenkin PS: Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 2004, 47(7):1739-1749. Schrodinger: GLIDE [http://www.schrodinger.com/ProductDescription.php?mID=6&sID=6&cID=0] Zsoldos Z, Reid D, Simon A, Sadjad SB, Johnson AP: eHiTS: a new fast, exhaustive flexible ligand docking system. J Mol Graph Model 2007, 26(1):198-212. Zsoldos Z, Reid D, Simon A, Sadjad BS, Johnson AP: eHiTS: an innovative approach to the docking and scoring function problems. Curr Protein Pept Sci 2006, 7(5):421435. Edelsbrunner H, Facello M, Fu R, Liang J: Measuring proteins and voids in proteins. In: Proceedings of the 28th Hawaii International Conference on Systems Science: 1995; Washington, DC: IEEE Computer Society; 1995: 256-264. Schrodinger: Maestro [http://www.schrodinger.com/ProductDescription.php?mID=6&sID=15&cID=0] Rizzo RC, Jorgensen WL: OPLS All-atom model for amines: resolution of the amine hydration problem. J Am Chem 1999, 121:4827-4836. Chemical Computing Group: SVL exchange [http://svl.chemcomp.com/index.php] Gasteiger J, Marsili M: Iterative partial equalization of orbital electronegativity - a rapid access to atomic charges. Tetrahedron 1980, 36(22):3219-3228. Allen MP, Tildesley DJ: Computer simulation of liquids: Oxford University Press; 1987. Labute P: The generalized Born/volume integral implicit solvent model: estimation of the free energy of hydration using London dispersion instead of atomic surface area. J Comput Chem 2008, 29(10):1693-1698. Durant JL, Leland BA, Henry DR, Nourse JG: Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci 2002, 42(6):1273-1280. Sheridan RP, Miller MD, Underwood DJ, Kearsley SK: Chemical Similarity Using Geometric Atom Pair Descriptors. J Chem Inf Model 1996, 36(1):128-136. Johnson MA, Maggiora GM: Concepts and Applications of Molecular Similarity. New York: Wiley; 1990. Hess B, Kutzner C, van der Spoel D, Lindahl E: GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. J Chem Theor Comp 2008, 4(3):435-447. Van Der Spoel D, Lindahl E, Hess B, Groenhof G, Mark AE, Berendsen HJ: GROMACS: fast, flexible, and free. J Comput Chem 2005, 26(16):1701-1718. 164  38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51.  Lindahl E, Hess B, van Der Spoel D: GROMACS 3.0: A package for molecular simulation and trajectory analysis. J Mol Mod 2001, 7:306-317. Berendsen HJ, Grigera J, Straatsma T: The missing term in effective pair potentials. J Phys Chem 1987, 91(24):6269-6271. Hess B, van der Vegt NF: Hydration thermodynamic properties of amino acid analogues: a systematic comparison of biomolecular force fields and water models. J Phys Chem B 2006, 110(35):17616-17626. Schuttelkopf AW, van Aalten DM: PRODRG: a tool for high-throughput crystallography of protein-ligand complexes. Acta Crystallogr D Biol Crystallogr 2004, 60(Pt 8):1355-1363. Darden T, York D, Pedersen L: Particle mesh Ewald: An N [center-dot] log(N) method for Ewald sums in large systems. J Chem Phys 1993, 98:10089-10092. Essmann U, Perera L, Berkowitz ML, Darden T, Lee H, Pedersen L: A smooth particle mesh Ewald method. J Chem Phys 1995, 103:8577-8592. Berendsen HJC, Postma JPM, DiNola A, Haak JR: Molecular dynamics with coupling to an external bath. J Chem Phys 1984, 81:3684-3690. Almlof M, Brandsdal BO, Aqvist J: Binding affinity prediction with different force fields: examination of the linear interaction energy method. J Comput Chem 2004, 25(10):1242-1254. Aqvist J, Luzhkov VB, Brandsdal BO: Ligand binding affinities from MD simulations. Acc Chem Res 2002, 35(6):358-365. Aqvist J, Marelius J: The linear interaction energy method for predicting ligand binding free energies. Comb Chem High Throughput Screen 2001, 4(8):613-626. Chan SK, Hsing M, Hormozdiari F, Cherkasov A: Relationship between insertion/deletion (indel) frequency of proteins and essentiality. BMC Bioinformatics 2007, 8:227. Hormozdiari F, Salari R, Hsing M, Schonhuth A, Chan SK, Sahinalp SC, Cherkasov A: The effect of insertions and deletions on wirings in protein-protein interaction networks: a large-scale study. J Comput Biol 2009, 16(2):159-167. Cherkasov A: Can 'Bacterial-Metabolite-Likeness' model improve odds of 'in silico' antibiotic discovery? J Chem Inf Model 2006, 46(3):1214-1222. Fechteler T, Dengler U, Schomburg D: Prediction of protein three-dimensional structures in insertion and deletion regions: a procedure for searching data bases of representative protein fragments using geometric scoring criteria. J Mol Biol 1995, 253(1):114-131.  165  7 Concluding chapter The thesis describes a bioinformatics framework that has been developed to support and service the large-scale, multi-year PREPARE project focused on identification of novel drug targets in S. aureus and development of novel antibiotic candidates. This project encompassed a number of research centres working in the field of systems biology and integrated an entire drug discovery pipeline starting from large-scale protein interaction mapping and resulting in development of small molecule inhibitors. Such a challenging task demanded using an array of modern bioinformatics platforms and methods, and it required developing novel bioinformatics and cheminformatics approaches and tools. The presented platform includes innovative methods for antibacterial target identification, combining new types of protein interaction network analysis, an indel-targeting approach and a large-scale cheminformatics pipeline for in silico drug screening. Our strategy is different from the historic approach, in which antibiotics were derived from natural compounds [1], brute-force in vitro screens, or whole-cell screening where drug mode of actions are often unknown [2, 3]. Instead, with the availability of large-scale genomics and proteomics data we were able to apply a top-down approach to identify antibiotic targets more systematically and efficiently, by considering not only the essentiality of proteins but also their overall connectivity characteristics in the protein interaction network for MRSA252. Our analysis of the MRSA252 PIN enabled the identification of several highlyinteracting hub proteins, which are suitable for drug targets because of their essential functions in the bacterial cell [4] and lower rate of mutations [5]. Furthermore, since hub proteins mediate many protein-protein interactions at the core of the network, we anticipate targeting hub proteins will cause dramatic effects to the cell where the loss of functions will be difficult  166  to recover even through alternative pathways by bacteria. For instance, the identified compound 130 not only inhibited the enzymatic activity of pyruvate kinase in MRSA252, but also reduced the number of its protein interactions by 80%, demonstrated in a GST pull-down experiment. Due to all these unique properties associated with hub proteins, it is not surprising that the ‘hub’ concept has been the focus of several recent protein interaction papers [6-13], many of which have emphasized on the scale-free network property, protein complex and modularity, and functional annotation in other species. Yet, to our knowledge, this PhD thesis along with its corresponding chapters on protein interaction network analyses is one of the first research works that attempt to utilize the hub concept for enhancing antibiotic discovery against drug resistance problems in current pathogens. The analysis of the MRSA252 PIN led to the identification of 60 hub proteins, some of which are involved in hundreds of protein-protein interactions and have been confirmed essential to the survival of MRSA252 cells by the antisense or knockout experiments. This list of protein hubs has been further filtered based on the availability of structural templates and presence of paralogs for the identification of potential antibacterial drug targets against S. aureus. Interestingly, the MRSA252 protein interaction data determined by the PREPARE lab provided a unique opportunity for evaluating the network properties of known antibacterial drug targets. Much to our surprise, out of 70 known drug targets only 8 could be categorized as hub proteins, while the rest have a low number of protein-protein interactions. Upon further examination of the known drug targets, we learned that the majority of the targets have been chosen, not based on their protein connectivity, but based on their confirmed essentiality, low protein abundance, and low sequence similarity to human proteins. The constraint on low or no human sequence similarity has limited the list of potential drug targets in the past, and as we  167  demonstrated, the sequence conservation issue with human proteins can be resolved by targeting specific indel sites with small compounds. Our indel analysis has indicated that some well-conserved bacterial proteins have unique structural differences caused by sequence insertions or deletions compared to their human homologs, and such indel sites can be effectively utilized for specific compound binding. Subsequent studies have further established indels as important drug target markers because of their associations with protein essentiality and network rewiring capability. Thus, we have characterized over 117,266 unique indel sites in the Indel PDB database, offering the research community a fast and user-friendly web resource to identify potential indel-based drug targets in other pathogens, in addition to the MRSA proteins studied in the thesis. The validation of our developed drug discovery pipeline based on the hub and indel concepts was demonstrated by the research work on pyruvate kinase, a conserved hub protein with a distinguishable indel site between the MRSA252 and humans. Several small molecule compounds, identified from the developed in silico screening procedure, have been shown to inhibit MRSA252 cell growth with no signs of toxicity in humans by the subsequent biochemical experiments. Thus, we anticipate that it is equally feasible to apply the same drug development strategy to identify new antibiotics in other bacterial pathogens such as Helicobacter pylori [14], Campylobacter jejuni [12], Treponema pallidum [13], and Escherichia coli K12 [9, 10] where PIN data are already available. However, identifying hub proteins for drug targets in pathogens with no PIN data still remains a challenging task. In that context, several hub prediction tools have been developed in the thesis by analyzing sequence- or physicochemical-based similarities shared among highly interacting proteins in the four model organisms, Escherichia coli, Saccharomyces cerevisiae, Drosophila 168  melanogaster and Homo sapiens, by the use of machine learning methods. Several common hub properties have been identified, and they include the association of certain Gene Ontology functional terms, higher sequence conservation, more hydrophilic and higher fraction of flexible coil residues. We anticipate the hub prediction tools will assist the development novel drug targets in emerging or existing pathogens where no interaction data is yet available. Thus, on one hand, we have developed useful bioinformatics tools and provided a research framework for the next-generation antibacterial drug discovery; on the other hand, a number of important questions and challenges have been raised during the course of the thesis work: 1) In this project we have relied on the currently available protein interaction data. However, as discussed previously, two major experimental systems - yeast two-hybrid and protein complex pull-down, both have their own limitations. In particular, the protein pulldown experiments such as the one used for constructing the MRSA252 PIN are performed in an in vitro environment, and it is an open question as how an in vitro protein-protein physical interaction can be translated to an equivalent interaction in vivo. Even with the isotope labelling technology as the one employed by the PREPARE lab to remove non-specific binders, two proteins that have been shown to be involved in a physical interaction in vitro does not necessarily mean they would interact in a cell due to many factors such as different compartments and timing of protein expression. It is commonly accepted that high-throughput protein interaction data have high false positive and high false negative rates [13, 15], while the overlap of different PIN datasets, even within the same species, has been lower than expected [12, 15].  169  All of these results allowed us to conclude that the currently available PIN experiments have only recovered a small portion of the entire ‘interactome’ for any given species. 2) The conducted survey of the estimated highly connected proteins demonstrated that hub proteins are not as conserved or similar across different species. Although they do share certain common characteristics, many of the PIN hubs exhibit species-specific properties. One such example is the surface area of proteins: while hubs in E. coli and S. cerevisiae are larger than non-hubs, hubs in D. melanogaster, H. sapiens and S, aureus are smaller than lessconnected proteins. Moreover, we demonstrated that none of the thought-to-be-common hub properties such as sequence conservation or number of constituent protein domains are capable of sufficiently distinguishing hubs from non-hubs. Thus, a rather elaborate machine learning method such as boosting trees was required for combining hundreds of protein descriptors together in order to achieve a reasonable prediction performance result. We do anticipate the difficulty of hub prediction will be lessened when more protein interaction data become available in other species. 3) Even though the indel technology and protein interaction data have opened up a list of novel potential drug targets, validating such targets is a challenging task. For instance, before we were able to identify one indel-bearing drug target - pyruvate kinase - we have modeled three-dimensional structures for as many as 34 MRSA proteins. After close examination of the protein structural surfaces, we established that most of the proteins were not suitable for selective compound binding due to undesired properties of their indel sites (shallow cavities, small differences between MRSA252 and human proteins, or improper location).  170  Several potentially attractive indel targets had to be removed from consideration due to their unconfirmed protein essentiality or unfeasibility of their enzyme assays. Because of the challenges of identifying suitable indel targets, we have explored an alternative target selection strategy focused on unique bacterial proteins. To our surprise, bacteria-specific hub proteins are more frequent than we originally expected; therefore, we have identified several highly-connected hub proteins in MRSA252 that are only conserved among other bacterial species but not in humans, which makes such bacteria-unique hubs good target candidates. The corresponding study is currently underway.  7.1 Future directions The biggest challenge in treating infectious diseases is drug-resistant bacterial strains emerging at a rate much faster than new drugs developed from current antibiotic research. Historically, many antibiotic drugs were derived from natural compounds produced from bacteria such as Actinomycetes [1]. Such antimicrobial compounds were preferred by the pharmaceutical industry because of their ability to penetrate bacterial cell membrane, providing good starting structural templates for further modifications through medicinal chemistry [1]. However, because of this dependence on natural compounds, antibiotics are limited to targeting few bacterial processes, and only two new classes of drugs were introduced in the past 30 years [16]. It is critical that the next generation of antibiotics, with completely different modes of action than existing drugs, to be developed rapidly through alternative approaches utilizing current advancement in genomics and proteomics. From our examination of existing antibacterial drug targets, it is observed that only few are hub proteins. It is perhaps because antimicrobial compounds produced naturally from bacteria have avoided targeting their own hub proteins that are conserved across bacterial  171  species. Hence, we anticipate the presented hub prediction and indel identification tools will provide the research community an alternative strategy for identifying drug targets for antibiotic development, including essential hub proteins that are previously avoided due to their sequence conservation in humans. Furthermore, we anticipate the value of in silico screenings will be much appreciated and utilized in future antibiotic research programs. As demonstrated in the thesis, virtual screening of millions of compounds can be performed rapidly with sufficient computational power, and compound hits can be tested quickly through a well-established enzymatic assay. Combining with a bioinformatics-based drug target selection, the process of antibiotic discovery leading towards drug leads can be shortened to a few months, in contrast to years of research based on traditional screening approaches deployed in the industry [17]. It is possible to further speed up the virtual screening process by taking advantage of recent development of chemical-protein databases such as STITCH [18] where chemical-toprotein interactions are reported based on results from thousands of high-throughput compound screens in numerous species. Hence, one can quickly query such databases with a list of potential drug targets (e.g. a list of bacterial hub proteins) and search any existing compounds or drugs with inhibitory effects. The activity of those compound hits can be confirmed by either protein- or cell-based assays. Another strategy to counter bacterial resistance is the simultaneous use of multiple antibiotics, each targeting a different hub protein in the cell. Such an antibiotic ‘cocktail’ has been shown effective in treating patients infected with drug-resistant bacteria [19]. By integrating metabolic and signalling pathways with protein-protein interaction maps, critical hubs in each major cellular process such as translation, energy production and metabolism can be identified. Targeting a number of such hub proteins simultaneously could break the entire 172  bacterial protein interaction network into several disconnected pieces. It will be nearly impossible for the bacteria to counter such a dramatic network perturbation through a single mutation. In addition, several other proteomic advancements can assist antibiotic research. One time-limiting step in drug target validation is the determination of protein essentiality for the survival of bacterial cells. Large-scale gene knock-out experiments such as the ones performed in Staphylococcus aureus [20], Escherichia coli [21] and Mycobacterium tuberculosis [22] will greatly speed up antibiotic development by providing a list of essential bacterial proteins. On anther hand, recent success in whole-cell simulation in micro-organisms like Escherichia coli, Mycoplasma genitalium and Bacillus subtilis [23] will provide a valuable testing platform for simulating possible outcomes from inhibiting target proteins in a virtual bacterial cell. It is even possible to anticipate alternative pathways that bacteria can utilize to recover the loss of functions, and to identify secondary protein targets to counter such a bacterial resistance. The ongoing efforts on determining protein three-dimensional structures by experimental methods such as X-ray crystallography and nuclear magnetic resonance will continue assisting structural modeling of potential drug target proteins. Currently, over 55,000 protein structures are available from the PDB database [24], providing valuable structural templates for in silico compound screening against bacterial targets. With all of the mentioned biological data and computational tools, we anticipate that the time and cost for antibiotic development will be reduced due to more accurate and effective drug target identification and drug lead determination. The thesis illustrates an example where bioinformatics  and  cheminformatics  approaches  were  combined  through  academic  collaborations, leading towards practical results in the form of antibacterial compounds that have the potential to be optimized for animal studies and clinical trials supported by 173  pharmaceutical companies. We anticipate this study will encourage future antibiotic research projects in academia and facilitate collaborations with the industry, reviving the interest in antibiotic development much needed for treating infectious diseases worldwide.  174  7.2 References 1. 2. 3. 4. 5. 6. 7. 8. 9.  10.  11.  12. 13. 14. 15.  Overbye KM, Barrett JF: Antibiotics: where did we go wrong? Drug Discov Today 2005, 10(1):45-52. Chan PF, Holmnes DJ, Payne DJ: Finding the Gems Using Genomic Discovery: Antibacterial Drug Discovery Strategies - The Successes and Challenges. Drug Discovery Today: Therapeutic Strategies 2004, 1(4):519-527. Barker JJ: Antibacterioal Drug Discovery and Structure-Based Design. Drug Discovery Today 2006, 11(9/10):391-404. Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks. Nature 2001, 411(6833):41-42. Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW: Evolutionary rate in the protein interaction network. Science 2002, 296(5568):750-752. Ekman D, Light S, Bjorklund AK, Elofsson A: What properties characterize the hub proteins of the protein-protein interaction network of Saccharomyces cerevisiae? Genome Biol 2006, 7(6):R45. Manna B, Bhattacharya T, Kahali B, Ghosh TC: Evolutionary constraints on hub and non-hub proteins in human protein interaction network: insight from protein connectivity and intrinsic disorder. Gene 2009, 434(1-2):50-55. Vallabhajosyula RR, Chakravarti D, Lutfeali S, Ray A, Raval A: Identifying hubs in protein interaction networks. PLoS One 2009, 4(4):e5344. Arifuzzaman M, Maeda M, Itoh A, Nishikata K, Takita C, Saito R, Ara T, Nakahigashi K, Huang HC, Hirai A, Tsuzuki K, Nakamura S, Altaf-Ul-Amin M, Oshima T, Baba T, Yamamoto N, Kawamura T, Ioka-Nakamichi T, Kitagawa M, Tomita M, Kanaya S, Wada C, Mori H: Large-scale identification of protein-protein interaction of Escherichia coli K-12. Genome Res 2006, 16(5):686-691. Butland G, Peregrin-Alvarez JM, Li J, Yang W, Yang X, Canadien V, Starostine A, Richards D, Beattie B, Krogan N, Davey M, Parkinson J, Greenblatt J, Emili A: Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature 2005, 433(7025):531-537. Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440(7084):631-636. Parrish JR, Yu J, Liu G, Hines JA, Chan JE, Mangiola BA, Zhang H, Pacifico S, Fotouhi F, DiRita VJ, Ideker T, Andrews P, Finley RL, Jr.: A proteome-wide protein interaction map for Campylobacter jejuni. Genome Biol 2007, 8(7):R130. Titz B, Rajagopala SV, Goll J, Hauser R, McKevitt MT, Palzkill T, Uetz P: The binary protein interactome of Treponema pallidum--the syphilis spirochete. PLoS One 2008, 3(5):e2292. Rain JC, Selig L, De Reuse H, Battaglia V, Reverdy C, Simon S, Lenzen G, Petel F, Wojcik J, Schachter V, Chemama Y, Labigne A, Legrain P: The protein-protein interaction map of Helicobacter pylori. Nature 2001, 409(6817):211-215. Huang H, Jedynak BM, Bader JS: Where have all the interactions gone? Estimating the coverage of two-hybrid protein interaction maps. PLoS Comput Biol 2007, 3(11):e214. 175  16. 17. 18. 19. 20.  21. 22. 23. 24.  Infectious Diseases Society of America: Bad bugs, no drugs as antibiotic discovery stagnates. A public health crisis brews. [http://www.idsociety.org/badbugsnodrugs.html] Payne DJ, Gwynn MN, Holmes DJ, Pompliano DL: Drugs for bad bugs: confronting the challenges of antibacterial discovery. Nat Rev Drug Discov 2007, 6(1):29-40. Kuhn M, von Mering C, Campillos M, Jensen LJ, Bork P: STITCH: interaction networks of chemicals and proteins. Nucleic Acids Res 2008, 36(Database issue):D684688. Huyser C, Loskutoffb NM, Singha R, Lindequec K, Beckerd P: A novel antibiotic cocktail for eliminating bacteria in human semen. International Congress Series 2004, 1271:205-209. Forsyth RA, Haselbeck RJ, Ohlsen KL, Yamamoto RT, Xu H, Trawick JD, Wall D, Wang L, Brown-Driver V, Froelich JM, C KG, King P, McCarthy M, Malone C, Misiner B, Robbins D, Tan Z, Zhu Zy ZY, Carr G, Mosca DA, Zamudio C, Foulkes JG, Zyskind JW: A genome-wide strategy for the identification of essential genes in Staphylococcus aureus. Mol Microbiol 2002, 43(6):1387-1400. Baba T, Ara T, Hasegawa M, Takai Y, Okumura Y, Baba M, Datsenko KA, Tomita M, Wanner BL, Mori H: Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Mol Syst Biol 2006, 2:2006 0008. Sassetti CM, Boyd DH, Rubin EJ: Genes required for mycobacterial growth defined by high density mutagenesis. Mol Microbiol 2003, 48(1):77-84. Ishii N, Robert M, Nakayama Y, Kanai A, Tomita M: Toward large-scale modeling of the microbial cell for computer simulation. J Biotechnol 2004, 113(1-3):281-294. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28(1):235-242.  176  Appendix  Appendix A - Supplementary materials Table A1. A list of interacting MRSA proteins. [http://www.cnbi2.ca/pin/Supplementary_Table_1.xls] Table A2. A list of conventional antimicrobial targets identified from the literature. [http://www.cnbi2.ca/pin/Supplementary_Table_2.xls] Table A3. Sequence composition results, comparing indel sites and their indel-containing proteins. [http://www.biomedcentral.com/content/supplementary/1471-2105-9-293-s1.xls] Table A4. Sequence composition results, comparing indel sites and loops. [http://www.biomedcentral.com/content/supplementary/1471-2105-9-293-s2.xls] Table A5. Secondary structure composition results, comparing indel sites and their indelcontaining proteins. [http://www.biomedcentral.com/content/supplementary/1471-2105-9-293-s3.xls]  177  Appendix B - List of publications 1.  Hsing M, Byler K, Cherkasov A. Predicting highly-connected hubs in protein interaction networks by QSAR and biological data descriptors. Bioinformation 2009, 4:164-168.  2.  Byler K, Hsing M, Cherkasov A. The Use of sequence-derived QSPR descriptors for predicting highly connected proteins (hubs) in protein-protein interactions. QSAR & Combinatorial Science 2009, 28:509-519.  3.  Hormozdiari F, Salari R, Hsing M, Schonhuth A, Chan SK, Sahinalp SC, Cherkasov A. The effect of insertion and deletion (indels) on wirings in protein-protein interaction networks: a large-scale study. Journal of Computational Biology 2009, 16:159-167.  4.  Hsing M, Byler KG, Cherkasov A. The use of Gene Ontology terms for predicting highlyconnected ‘hub’ nodes in protein-protein interaction networks. BMC Systems Biology 2008, 2:80.  5.  Hsing M, Cherkasov A. Indel PDB: a database of structural insertions and deletions derived from sequence alignments of closely related proteins. BMC Bioinformatics 2008, 9:293.  6.  Chan SK, Hsing M, Hormozdiari F, Cherkasov A. Relationship between insertion/deletion (indel) frequency of proteins and essentiality. BMC Bioinformatics 2007, 8:227.  7.  Hsing M, Cherkasov A: Bioinformatics tools for researching protein interaction networks. In: Genome Research Advances. Edited by Bellini AP. Hauppauge, New York: Nova Science Publishers; 2007.  8.  Hsing M, Cherkasov A. Integration of biological data with semantic networks. Current Bioinformatics 2006, 1(3):273-290.  178  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0068978/manifest

Comment

Related Items