UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The development of bioinformatic and chemoinformatic approaches for structure-activity modelling and… Fjell, Christopher David 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2009_spring_fjell_christopher.pdf [ 10.4MB ]
Metadata
JSON: 24-1.0066950.json
JSON-LD: 24-1.0066950-ld.json
RDF/XML (Pretty): 24-1.0066950-rdf.xml
RDF/JSON: 24-1.0066950-rdf.json
Turtle: 24-1.0066950-turtle.txt
N-Triples: 24-1.0066950-rdf-ntriples.txt
Original Record: 24-1.0066950-source.json
Full Text
24-1.0066950-fulltext.txt
Citation
24-1.0066950.ris

Full Text

THE DEVELOPMENT OF BIOINFORMATIC AND CHEMOINFORMATIC APPROACHES FOR STRUCTURE-ACTIVITY MODELLING AND DISCOVERY OF ANTIMICROBIAL PEPTIDES by CHRISTOPHER DAVID FJELL B.A.Sc., The University of British Columbia, 1990 M.Sc., The University of British Columbia, 1995  A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Experimental Medicine)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) January 2009 © Christopher David Fjell, 2009  Abstract The emergence of pathogens resistant to available drug therapies is a pressing global health problem. Antimicrobial peptides (AMPs) may potentially form new therapeutics to counter these pathogens. AMPs are key components in the mammalian innate immune system and are responsible for both direct killing and immunomodulatory effects in host defense against pathogenic organisms. This thesis describes computational methods for the identification of novel natural and synthetic AMPs. A bioinformatic resource was constructed for classification and discovery of genecoded AMPs, consisting of a database of clustered known AMPs and a set of hidden Markov models (HMMs). One set of 146 clusters was based on the mature peptide sequence, and one set of 40 clusters was based on propeptide sequence. The bovine genome was analyzed using the AMPer resources, and 27 of the 34 known bovine AMPs were identified with high confidence and up to 69 AMPs were predicted to be novel peptides. One novel cathelicidin AMP was experimentally verified as up-regulated in response to infection in bovine intestinal tissue. A chemoinformatic analysis was performed to model the antibacterial activity of short synthetic peptides. Using high-throughput screening data for the activities of over 1400 peptides of diverse sequence, quantitative structure-activity relation (QSAR) models were created using artificial neural networks and physical characteristics of the peptide that included three-dimensional atomic structure. The models were used to predict the activity of a set of approximately 100,000 peptide sequence variants. After ranking the predicted activity, the models were shown to be very accurate. When 200 peptides were synthesized  ii  and screened using four levels of expected activity, 94% of the top 50 peptides expected to have the highest level of activity were found to be highly active. Several promising candidates were synthesized with high quality and tested against several multi- antibioticresistant pathogens including clinical strains of Pseudomonas aeruginosa, Staphylococcus aureus, Enterococcus faecalis and Escherichia coli. These peptides were found to be highly active against these pathogens as determined by minimal inhibitory concentration; this serves as independent confirmation of the effectiveness of high-throughput screening and in silico analysis for identifying peptide antibiotic drug leads.  iii  Table of Contents Abstract...........................................................................................................................ii Table of Contents ...........................................................................................................iv List of tables..................................................................................................................vii List of figures...............................................................................................................viii List of abbreviations.......................................................................................................ix Acknowledgements .........................................................................................................x Dedication......................................................................................................................xi Co-Authorship Statement ..............................................................................................xii Chapter 1: Introduction.................................................................................................1 1.1 Thesis overview..................................................................................................2 1.2 Gene-coded antimicrobial peptides .....................................................................3 1.2.1 Classes of antimicrobial peptides..................................................................3 1.2.2 Mechanisms of antibacterial activity ............................................................7 1.2.3 Antimicrobial peptides in regulation of innate immunity ............................13 1.2.4 Bioinformatics for discovery of novel AMPs..............................................14 1.3 Synthetic antimicrobial peptides .......................................................................16 1.3.1 Quantitative structure-activity relationships................................................18 1.3.2 Previous QSAR analysis of antimicrobial peptides.....................................19 1.3.3 Limitations of current studies .....................................................................21 1.3.4 'Inductive' QSAR descriptors .....................................................................22 1.4 Thesis objectives and hypotheses ......................................................................24 1.4.1 Gene-coded antimicrobial peptides.............................................................24 1.4.2 Identification of synthetic AMPs by QSAR analysis and machine learning.25 1.4.3 Key assumptions ........................................................................................26 1.5 References ........................................................................................................29 Chapter 2: Prediction of gene-coded antimicrobial peptides by bioinformatic analysis36 2.1 Introduction ......................................................................................................37 2.2 Results and discussion ......................................................................................39 2.2.1 Database of antimicrobial peptides.............................................................40 2.2.2 Clustering of the AMPs..............................................................................42 2.2.3 HMM modelling ........................................................................................44 2.2.4 Iterative enhancement of clusters................................................................45 2.2.5 Accuracy of models ...................................................................................51 2.2.6 On-line tools ..............................................................................................54 2.3 Conclusion........................................................................................................55 2.4 Methods............................................................................................................55 2.4.1 Initial peptide set........................................................................................55 2.4.2 Clustering ..................................................................................................56 2.4.3 Iterative enhancement of clusters................................................................57 2.4.4 Accuracy of models ...................................................................................58 2.4.5 On-line tools ..............................................................................................60 2.5 Web resources ..................................................................................................60 2.6 Supplementary material ....................................................................................61 iv  2.7 References ........................................................................................................67 Chapter 3: Identification of novel host defense peptides and the absence of alphadefensins in the bovine genome.....................................................................................69 3.1 Introduction ......................................................................................................70 3.2 Results and discussion ......................................................................................73 3.2.1 Identification of host defense peptides........................................................73 3.2.2 Selection of predicted AMPs for confirmation............................................79 3.2.3 Analysis of predicted novel AMP gene expression .....................................82 3.2.4 Absence of alpha-defensins ........................................................................84 3.3 Conclusions ......................................................................................................90 3.4 Methods and materials ......................................................................................91 3.4.1 Set of known antimicrobial peptides...........................................................91 3.4.2 Creation of AMPer.....................................................................................91 3.4.3 Bovine genomic and EST sequences ..........................................................92 3.4.4 Prediction of AMPs in ESTs ......................................................................92 3.4.5 Prediction of AMPs in genomic sequence ..................................................93 3.4.6 Comparison of predicted AMPs to known AMPs .......................................94 3.4.7 Identification of novel AMPs .....................................................................95 3.4.8 Pairwise comparison of known AMPs to bovine sequence .........................95 3.4.9 Analysis of AMP gene expression..............................................................96 3.4.10 Informatics...............................................................................................97 3.5 Acknowledgments ............................................................................................97 3.6 Web resources ..................................................................................................98 3.7 Supplementary table .........................................................................................98 3.8 References ...................................................................................................... 102 Chapter 4: Identification of antibacterial peptides by chemoinformatics and machine learning.............................................................................................................................105 Introduction............................................................................................................. 106 4.1 Results and discussion .................................................................................... 108 4.1.1 Effect of control antibacterial peptide on bacteria..................................... 109 4.1.2 Peptide data sets for model training.......................................................... 112 4.1.3 Calculation of peptide activity.................................................................. 113 4.1.4 QSAR descriptors and model building...................................................... 115 4.1.5 Validation of model performance ............................................................. 117 4.1.6 Independent model testing........................................................................ 118 4.1.7 Antibacterial activity of predicted peptides against resistant strains .......... 123 4.2 Conclusions .................................................................................................... 126 4.3 Materials and methods .................................................................................... 127 4.3.1 Electron microscopy of AMPs.................................................................. 127 4.3.2 Peptide sequences for model training ....................................................... 127 4.3.3 Peptide SPOT synthesis and screening ..................................................... 128 4.3.4 Calculation of peptide activity.................................................................. 129 4.3.5 QSAR descriptors .................................................................................... 130 4.3.6 Training and validation data sets .............................................................. 133 4.3.7 Test data set ............................................................................................. 133 4.3.8 Model training ......................................................................................... 134 v  4.3.9 In silico ranking and selection of test peptides.......................................... 135 4.3.10 Minimal inhibitory concentration (MIC) determination .......................... 135 4.4 Acknowledgements......................................................................................... 137 4.5 Supplementary tables...................................................................................... 138 4.6 References ...................................................................................................... 146 Chapter 5: Genetic algorithms for identification of potent antimicrobial peptides ..... 148 5.1 Introduction .................................................................................................... 149 5.2 Results and discussion .................................................................................... 151 5.2.1 Evaluation of peptide fitness score ........................................................... 152 5.2.2 Initial population of peptides.................................................................... 153 5.2.3 Iterative improvement in peptides ............................................................ 154 5.2.4 Evolution of amino acid composition ....................................................... 159 5.2.5 Assessment of genetic algorithm performance.......................................... 162 5.3 Conclusions .................................................................................................... 163 5.4 Materials and methods .................................................................................... 164 5.4.1 Creation of classification models for highly active peptides...................... 164 5.4.2 Evaluation of peptide fitness .................................................................... 165 5.4.3 Initial peptide population.......................................................................... 165 5.4.4 Evolution of peptide sequences ................................................................ 165 5.4.5 Evaluation of peptide antibacterial activity............................................... 166 5.5 References ...................................................................................................... 168 Chapter 6: Summary and conclusions ....................................................................... 170 6.1 Summary ........................................................................................................ 171 6.1.1 Gene-coded antimicrobial peptides........................................................... 171 6.1.2 Synthetic antimicrobial peptides............................................................... 173 6.2 Conclusions and future directions ................................................................... 175 6.3 References ...................................................................................................... 178 Appendix A: Epilogue................................................................................................. 180  vi  List of tables Table 1.1. Classes of antimicrobial peptides. ...................................................................4 Table 2.1. Effect of similarity threshold on clustering of mature peptides. .....................43 Table 2.2. Changing consensus sequence with iteration. ................................................51 Table 2.3. Properties of largest mature peptide clusters..................................................62 Table 2.4. Properties of largest propeptide clusters ........................................................62 Table 2.5. Performance of AMP identification method determined by cross-validation for mature peptide clusters. .........................................................................................64 Table 2.6. Performance of AMP identification method determined by cross-validation for propeptide clusters.................................................................................................65 Table 3.1. Numbers of predicted antimicrobial peptides.................................................75 Table 3.2. Known bovine antimicrobial peptides ...........................................................76 Table 3.3. Identification of known bovine host defense peptides in dbEST sequences ...78 Table 3.4. Bovine primers used for qRT-PCR................................................................97 Table 3.5. Most significant matches of AMPer model 146 to bovine genome sequence.101 Table 4.1. Activities of peptides from training sets and quartiles in the 100,000 test set.120 Table 4.2. Predicted activity rank and experimental Rel.IC50 values for selected test peptides. .............................................................................................................. 123 Table 4.3. Activities against multi-resistant Superbugs of selected peptides predicted through the QSAR analysis compared to the peptide Bac2A. ............................... 125 Table 4.4. Description of all QSAR descriptors used in analysis of peptide activities... 140 Table 4.5. Candidate peptides for confirmation of QSAR predictions. ......................... 145 Table 5.1. Initial peptide population for simulation A. ................................................. 154 Table 5.2. Initial peptide population for simulation B. ................................................. 154 Table 5.3. Final peptide population simulation A......................................................... 158 Table 5.4. Final peptide population, simulation B........................................................ 158  vii  List of figures Figure 1.1. Phylogenetic tree of known antimicrobial peptides. .......................................6 Figure 1.2. Barrel-stave model of antimicrobial peptide activity. .....................................9 Figure 1.3. Toroidal model of antimicrobial peptide activity..........................................10 Figure 1.4. Carpet model of antimicrobial peptide activity............................................. 11 Figure 1.5. Intracellular targets of antibacterial peptides. ...............................................13 Figure 1.6. Structure of an alpha-helical antimicrobial peptide.......................................17 Figure 1.7. Structure of an artificial neural network.......................................................24 Figure 2.1. Creation of initial AMPer clusters................................................................39 Figure 2.2. Summary of iterative enrichment of clusters. ...............................................40 Figure 2.3. The relationship between E-value and model length.....................................46 Figure 2.4. Relationship between mature peptides and propeptides from the same protein for largest mature peptide clusters..........................................................................49 Figure 2.5. Relationship between mature peptides and propeptides from the same protein for largest propeptide clusters. ...............................................................................50 Figure 2.6. Relationship between mature peptides and propeptides from the same protein clusters of all sizes.................................................................................................66 Figure 3.1. Multiple alignment of predicted host defense peptide DBEST_AMP_397....81 Figure 3.2. Multiple alignment of predicted host-defense peptide DBEST_AMP_416. ..81 Figure 3.3. Gel image of qRT-PCR for putative AMPs in blood and tissue. ...................83 Figure 3.4. Gel image of putative AMPs following Taq-man re-amplification. ..............84 Figure 4.1. General workflow for QSAR modelling of antimicrobial peptides. ............ 109 Figure 4.2. Transmission electron micrographs of cross-sections of Pseudomonas aeruginosa. .......................................................................................................... 110 Figure 4.3. SEM micrographs of Pseudomonas aeruginosa. ........................................ 111 Figure 4.4: Electron micrographs of cross-sections of Pseudomonas aeruginosa. ......... 112 Figure 4.5: Distibution of amino acids in training and test sets..................................... 113 Figure 4.6. Luminescence profile of a dilution series for three peptides. ...................... 115 Figure 4.7. Structure of an artificial neural network..................................................... 116 Figure 4.8. The receiver operating characteristics curves for the three data sets. ......... 118 Figure 4.9. Activity and properties of training and test peptides................................... 121 Figure 5.1. Examples of peptide evolution................................................................... 152 Figure 5.2. Evolution of peptide scores........................................................................ 156 Figure 5.3. Initial evolution of peptide scores for simulation B. ................................... 157 Figure 5.4. Evolution of peptide amino acid composition. ........................................... 161  viii  List of abbreviations AMP ANN AROC Bac2A BLAST cDNA EST GA HFIP HMM HPLC IC50 MIC mRNA PCR qRT-PCR QSAR ROC SDS  antimicrobial peptide artificial neural network area under the receiver operating characteristics curve synthetic peptide analogue of bovine bactenecin Basic Local Alignment Search Tool complementary DNA produced by reverse transcription of messenger RNA expressed sequence tag genetic algorithm hexafluoroisopropanol hidden Markov model high pressure liquid chromatography inhibitory concentration 50% minimal inhibitory concentration, the lowest concentration of an agent that inhibits bacterial growth messenger RNA polymerase chain reaction quantitative reverse transcription polymerase chain reaction quantitative structure-activity relation receiver operating characteristics sodium lauryl sulfate  ix  Acknowledgements I wish to thank my supervisor Dr. Artem Cherkasov for the opportunity of working with him on this research and for his supervision and encouragement throughout. I am grateful to the members of the Hancock lab (Centre for Microbial Diseases and Immunity Research) at UBC for the exceptional opportunity of working with them and the amazing data they are able to generate, especially to Drs Bob Hancock, Kai Hilpert, and Håvard Jenssen. I wish to thank the Canadian Institutes for Health Research for a Doctoral Research Award, and the University of British Columbia for a University Graduate Fellowship. I wish to thank the past and present members of my thesis committee: Drs. Zakaria Hmama, Steven Jones, Boris Sobolev and Michael Grigg. I am grateful to the past and present members of the Cherkasov lab for the opportunity of learning from them in the diverse areas of their research, including Michael Hsing, Ken Byler, Fuqiang Ban, Osvaldo Santos-Filho, Meilan Huang, Simon Chan, and Evgeny Maksakov. Finally, I thank my wife Donna and my girls (Meghan and Kristen) for their love and for putting up with my absence during many weekends spent at work.  x  Dedication This work is dedicated to the memory of my brother, Brent Fjell, whose young life was lost when the antibiotics didn't work.  xi  Co-Authorship Statement My role in this work was theoretical and computational analysis; I did not perform any of the laboratory measurements. For each chapter, I did the following. I performed all work described in Chapter 2. In Chapter 3, I performed all the computational work except for PCR primer design. All laboratory experimental work was done at the Hancock lab (UBC) or Vaccine and Infectious Disease Organization (University of Saskatchewan). In Chapter 4, I performed all computational work except for one script used to calculate some QSAR descriptors, and the randomized amino acid distribution and peptides for Set A, Set B and the set of 100,000 virtual peptides (selected by Kai Hilpert). In Chapter 5, I performed all work except for laboratory measurement of antibacterial activity. Håvard Jenssen performed PCR verification of predicted gene products using PCR (both primer design and assay) on RNA samples from bovine provided by Patrick Fries, Palok Aich, and Phillip Griebel (University of Saskatchewan). Kai Hilpert supplied randomized peptide sequences and antibacterial activity data for peptides synthesized on cellulose using the luminescence assay. Håvard Jenssen also provided some antibacterial luminescence assay data. Håvard Jenssen performed MIC assays on peptides. Warren Cheung contributed a script to calculate some QSAR descriptors. Nelly Panté contributed electron micrographs of bacteria. Robert E.W. Hancock was involved in discussions of most aspects of this work, especially regarding laboratory experiments that were done. Artem Cherkasov was involved in discussions of all aspects of this work, especially on aspects of QSAR and choice of analysis techniques.  xii  Chapter 1:  Introduction  1  1.1 Thesis overview The purpose of this thesis was to systematically study known antimicrobial peptides (AMPs), to discover new gene-coded AMP sequences, and to develop new peptide-based antibiotic leads using bioinformatic and chemoinformatic methods. Antimicrobial peptides are produced by nearly all organisms and constitute an important part of the innate immune system. The innate immune system constitutes that part of the immune system that is responsible for defense of the host from infection in a non-specific manner and includes barriers to infection and immediate responses such as inflammation and recruitment of non-specific cells such as macrophages, neutrophils and dendritic cells. Hundreds of these peptides have been identified. However, understanding of their role in both direct killing of pathogens as well as regulation of the innate immune response has recently been enhanced. Identification of additional examples of these peptides both in human and other organisms would serve to increase our understanding of the innate immune system and possibly lead to novel therapeutic interventions. The second chapter of this thesis describes the creation of a resource, which we call AMPer, consisting of an on-line database of peptides as well as software models for identification of antimicrobial peptides using bioinformatics analysis. In the third chapter, these resources are applied to discovery of novel peptides in the bovine genome. Pathogens that are resistant to current antibiotics are a continuing challenge. According to the Infectious Diseases Society of America (http://www.idsociety.org), the incidence of some pathogens such as methicillion-resistant Staphyloccus aureus now exceeds 50% and resistance in other pathogens also rapidly increasing. For example, resistance to vancomycin and floroquinolone has jumped from near zero incidence to 2  nearly 30% in the last ten years in some organisms. During this period, the number of new agents approved for use continues to decline. Synthetic antimicrobial peptides may be an important source of antibacterial agents to counter the continuing challenge of pathogens that have developed resistance to conventional drugs. As described in Chapter 4, previous analyses of the properties of short cationic peptides have failed to yield models that are sufficiently general and predictive of antibacterial activity for in silico screening of potential drug candidates. These efforts have been limited by both the robustness of the modelling techniques and the quantities of empirical data available. Generation of large sets of peptides of varying activity are now possible due to new highthroughput peptide synthesis and antibacterial activity assaying techniques. Chapter 4 describes how the combination of advanced chemoinformatic methods and machine learning algorithms was developed to successfully screen for peptides with high activity. Chapter 5 describes a novel method to optimize this screening process by the use of a search algorithm inspired by natural evolution.  1.2 Gene-coded antimicrobial peptides 1.2.1  Classes of antimicrobial peptides AMPs represent a diverse class of natural peptides that form part of the innate  immune system of mammalians, insects, amphibians, and plants among others (see for example, Sima and Sigler, 2003a, 2003b). As reviewed by Brogden (2005), there are currently over 880 different antimicrobial peptides identified or predicted from nucleic acid sequence. These peptides can be classified as shown in Table 1.1 based on peptide 3  characteristics that serve to contrast groups of AMPs. These characteristics include anionic (negative) charge, peptide structure and charge (linear, cationic and !-helical), amino acid composition (enrichment for particular amino acids), peptides that form internal cross-links (disulphide bridges) and peptides that are formed as fragments of larger mature proteins.  Class  Examples  Anionic peptides  • Maximin H5 from amphibians. • Small anionic peptides rich in glutamic and aspartic acids from sheep, cattle and humans. • Dermcidin from humans. • Cecropins (A), andropin, moricin, ceratotoxin and melittin from insects. • Cecropin P1 from Ascaris nematodes. • Magainin (2), dermaseptin, bombinin, brevinin-1, esculentins and buforin II from amphibians. • Pleurocidin from skin mucous secretions of the winter flounder. • Seminalplasmin, BMAP, SMAP (SMAP29, ovispirin), PMAP from cattle, sheep and pigs. • CAP18 from rabbits. • LL37 from humans. • Proline-containing peptides include abaecin from honeybees. • Proline- and arginine-containing peptides include apidaecins from honeybees; drosocin from Drosophila28; pyrrhocoricin from the European sap-sucking bug; bactenecins from cattle (Bac7), sheep, and goats149; and PR-39 from pigs. • Proline- and phenylalanine-containing peptides include prophenin from pigs. • Glycine-containing peptides include hymenoptaecin from honeybees. • Glycine- and proline-containing peptides include coleoptericin and holotricin from beetles. • Tryptophan-containing peptides include indolicidin from cattle. • Small histidine-rich salivary polypeptides, including the histatins from man and some higher primates. • Peptides with 1 disulphide bond include brevinins. • Peptides with 2 disulphide bonds include protegrin from pigs and tachyplesins from horseshoe crabs. • Peptides with 3 disulphide bonds include !-defensins from humans (HNP-1,HNP-2, cryptidins), rabbits (NP-1) and rats; "-defensins from humans (HBD1, DEFB118), cattle, mice, rats, pigs, goats and poultry; and rhesus #-defensin (RTD-1) from the rhesus monkey. • Insect defensins (defensin A). • SPAG11/isoform HE2C, an atypical anionic "-defensin. • Peptides with >3 disulphide bonds include drosomycin in fruit flies and plant antifungal defensins. • Lactoferricin from lactoferrin. • Casocidin I from human casein. • Antimicrobial domains from bovine !-lactalbumin, human haemoglobin, lysozyme and ovalbumin.  Linear cationic !-helical peptides  Cationic peptides enriched for specific amino acids  Anionic and cationic peptides that contain cysteine and form disulphide bonds  Anionic and cationic peptide fragments of larger proteins  Table 1.1. Classes of antimicrobial peptides. Adapted from Brogden, 2005.  4  The anionic antimicrobial peptides are typically found in surfactant extracts, bronchoalveolar lavage fluid and airway epithelial cells. Examples include the maximin H5 peptide from the skin of the toad Bomina maxima (Lai, et al., 2002), and the human dermicidin peptides secreted by the sweat glands (Schittek, et al., 2001). The linear cationic !-helical peptides constitute one of the largest classes of AMPs with roughly 300 members. These include the cecropins (from hemolymph of the cecropia moth), the magainins and buforin II from amphibians, pleurocidin from skin secretions of the flounder, and LL-37 from human. Another subgroup is characterized by cationic amino acids lacking cysteine and enriched in certain amino acids such as proline (the peptides bactenecin and prophenin), arginine (bactenecin), and phenylalanine (indolicidin). This group has roughly 44 peptides. The largest class includes anionic and cationic peptides that contain cysteine and form disulphide bonds. There are nearly 400 peptides in this class including the large and diverse families of defensins. In mammals, the defensins include the alpha-defensins derived from neutrophils, the cryptdins of the small intestine, and the beta-defensins found throughout the epithelia. In addition, there are over 50 arthropod defensins and plant defensins (Brogden, 2005). The last class of antimicrobial peptide consists of those peptides that are fragments of larger, functional proteins. These include lactoferricin (derived from lactoferrin) and casocidin I (derived from human casein). A phylogenetic tree showing peptide diversity is shown in Figure 1.1.  5  Figure 1.1. Phylogenetic tree of known antimicrobial peptides. One peptide was selected from each of the clusters in AMPer described in Chapter 2, and used to create a phylogenetic tree to show the relationship between peptides. Selected AMPs are labelled as space allows  It is important to note that there is disagreement among researchers concerning what should be properly considered an antimicrobial peptide. Some groups consider that a positive charge is an essential factor and in addition that many peptides which have antimicrobial activity attributed to them are not antimicrobial under physiological conditions of ion concentrations, host proteases and low peptide concentrations (Jenssen, et al., 2006). Under this view, activity is an artifact of the antimicrobial assays using conditions of dilute media.  Where antimicrobial activity can be demonstrated, the  6  mechanisms of action and target of peptide activity are not clear in many cases.  1.2.2  Mechanisms of antibacterial activity There are several physical characteristics that are thought to be important for  activity of antimicrobial peptides; these are closely tied to the proposed mechanisms of microbial cell killing. Antimicrobial peptides tend to be relatively short, from 12 to 100 amino acids for cationic AMPs and down to 6 amino acids for anionic AMPs. The mechanisms and structure of cationic peptides have been most intensively studied and will be discussed here. Their charges typically range from +2 to +9. The positive charge on these peptides is believed to be key to the initial interaction with the bacterial cell target, due to attraction to negatively charged phosphate groups in the lipopolysaccaride, anionic phospholipids of Gram-negative bacteria, or lipotechoic acids present on the surfaces of Gram-positive bacteria (Jenssen et al, 2006; Brogden, 2005). This initial interaction is also responsible for the selective binding of the peptides to microbial membranes and not to host cells which are composed primarily of neutral lipid. Regardless of the ultimate target responsible for killing of the microbe, this initial interaction with the cell surface and contact with the membrane appears to be an important step for all peptides (Hancock and Rozek, 2002). Antimicrobial peptides can form a variety of secondary structures: alpha-helical, beta-sheet, loop or extended structure. In the case of extended (unstructured) peptides, these may form organized structures only on binding to lipid bilayer. For example, indolicidin is unstructured in solution and takes on a structure when bound to membrane (Rozek, et al, 2000). This flexibility has been proposed as a mechanism that allows a single peptide to interact with more than one target molecule such as DNA in addition to 7  initial interactions with membrane (Hsu, et al., 2005). Some other peptides do not fall into this structural classification: some for example contain a combination of domains such as alpha-helix and beta-sheet (Uteng, et al., 2003). Regardless of specific secondary structure, antimicrobial peptides tend to have separated hydrophobic and hydrophillic domains that generate an amphipathic character that enhances interaction with lipid bilayer (Yount and Yeaman, 2004; Brodgen, 2005; Jenssen et al., 2006). While the initial interaction relies on electrostatic attraction, subsequent steps are driven by a combination of hydrophobic and electrostatic interactions (Jenssen, 2006). In Gram-negative bacteria, hydrophobic interactions drive insertion of peptides into the outer membrane. In a process termed "self-promoted uptake", initial peptides permeabilize the membrane to entry by other peptides. The importance of the amphipathic structures can be seen in three models of membrane disruption by antimicrobial peptides. In common to all the models, the peptides are assumed to initially aggregate parallel to the membrane surface by embedding the hydrophobic regions of the peptide into the hydrophobic lipid bilayer, as for example seen in mellitin (Yang, et al, 2001). In the "barrel-stave" model (Figure 1.2), the hydrophobic portions of the peptide align with the lipid core region and the hydrophilic face of the peptides form the interior of the pore. This manner of pore formation has been reported for the peptide alamethicin where the peptide takes on alphahelical configuration that serves as the staves of the barrel-shaped pore (Brogden, 2005; Spear, 2004; Yang, et al., 2001). In a related model, the aggregate model, the peptides line the pore in an unoriented arrangement in complex with lipid micelles (Jenssen, 2006).  8  Figure 1.2. Barrel-stave model of antimicrobial peptide activity. The blue and red regions of the alpha helix represent hydrophobic and hydrophilic regions, respectively.  In contrast to the barrel-stave model, the lipid monolayer bends continuously through the pore in the toroidal model (Figure 1.3) with the pore centre lined by both the peptide and the lipid head groups. This pore structure was determined for the antimicrobial peptides magainin, mellitin and protegrin (Yang, et al., 2001). The carpet model (Figure 1.4) is similar to the toroidal model with the lipid monolayer bending continuously into the outer leaflet of the membrane and peptide along the surface between the lipid head groups and the pore centre. However, the carpet model suggests that at sufficiently high concentrations of peptide, the membrane will be covered by peptide with detergent-like effect leading to formation of micelles and disruption of the membrane. The peptide ovispirin is suggested to act in this manner (Yamaguchi et al.,  9  2001), and studies of the peptide mellitin also implicated this mechanism depending on membrane composition and peptide concentration (Ladokhin and White, 2001).  Figure 1.3. Toroidal model of antimicrobial peptide activity.  10  Figure 1.4. Carpet model of antimicrobial peptide activity.  However, many of these studies rely on model membranes and infer cell killing by effects seen on such model systems. But loss of cell viability is often seen to precede the major ultrastructural changes in the microbial cell, suggesting that killing occurs after pore formation rather than detergent-like effect of membrane disruption. For example, magainin 2 exposure results in immediate loss of cytoplasmic potassium and cell death (Matsuzaki, 1997) without the membrane disruption that occurs at later time points in response to this peptide. For some peptides the antibacterial effects are independent of membrane effects. Some targets and activities are shown in Figure 1.5, along with examples of peptides that have these effects. Some of the mechanisms of action include:  11  inhibition of DNA and RNA synthesis through binding to those molecules (buforin II); other inhibition of synthesis of macromolecules such as DNA, RNA or protein (pleurocidin, dermaseptin, indolicidin); inhibition of cell-wall synthesis (mersacidin); prevention of cell division by inhibition of septum formation (indolicidin); and inhibition of enzymatic activity (pyrrhocoricin). It is worth noting that under non-physiological conditions of salt and lack of serum protein, virtually any cationic peptide will show membrane disturbance given a high concentration of peptide (Jenssen et al., 2007; Zhang et al., 2001; Patrzykat et al., 2002). It is likely that many peptides once considered to kill bacteria by membrane disruption do not do so in vivo; they may attack internal targets instead or act through modulation of the immune system as discussed next.  12  Figure 1.5. Intracellular targets of antibacterial peptides. Some structures of a bacterium are shown along with peptides that target them or inhibit their synthesis. Image modified from public domain image at http://en.wikipedia.org/wiki/Bacterium.  1.2.3  Antimicrobial peptides in regulation of innate immunity In addition to direct killing of bacteria, many antimicrobial peptides have been  recognized as having regulatory roles in the innate immune response to infection (Bowdish et al., 2005; Scott and Hancock, 2000; Yang et al., 2004; Zanetti, 2004; Finlay and Hancock, 2004). Since these roles do not involve direct antimicrobial activity and there is dispute about whether many of these peptides play such roles in vivo, some researchers now refer to these as "cationic host defense peptides" (Finlay and Hancock, 13  2004). Some of these roles in higher organisms involve nearly all steps in host response to infection that are not part of adaptive immunity. These steps appear to include the following (Finlay and Hancock, 2004): 1) They are induced at sites of inflammation or infection. 2) They act to counter inflammation that would lead to sepsis due to endotoxin (lipopolysaccharide) released by bacteria by selectively suppressing expression of genes induced by LPS or by directly interacting with LPS. 3) They signal other cellular components of the innate immune system through the MAP kinase pathways. 4) They recruit other cells such as neutrophils and monocytes to sites of infection and modulate chemokine and histamine release by neutrophils and mast cells. Finally, 5) They promote wound healing by promotion of fibroblast chemotaxis and angiogenesis.  1.2.4  Bioinformatics for discovery of novel AMPs As described above, antimicrobial peptides play a significant and possibly  under-appreciated role in the innate immune system of higher organisms. However, while many of these peptides have common properties across species and structural classes, bioinformatics analysis must also address the large diversity of the sequences involved potentially distinct roles in innate immune response. Previous bioinformatics analyses of antimicrobial peptides for gene discovery have been limited to identification of one particular class of peptide. For example, the second exon of beta-defensins in mouse and human contains a motif with six cysteines. Additional beta-defensins were identified in human and mouse genomic sequence (Jia, et al., 2001; Scheetz, et al., 2002; Schutte et al, 2002) by comparison with known defensins  14  on a pair-wise basis using the Basic Local Alignment Search Tool (BLAST) (Altschul et al., 1990; Altschul et al., 1997), and hidden Markov models (Eddy, 1998). In addition, Yount and Yeaman (2004) identified a simple sequence motif found in cysteinecontaining antimicrobial peptides that was reflected in a conserved 3D structure. This motif consists of a glycine followed by any amino acid followed by cysteine (GXC) and occurs in a specific conformation of the covalently bound chains. Two peptides, brazzein and charybdotoxin, matched this motif and reflected greatest sequence similarity with the core structure but had no documented antimicrobial activity; these were chosen for antibacterial assay and found to have activity when assayed. Both these efforts for predicting additional antimicrobial peptides (defensins and GXC motifs) were specific to cysteine-containing peptides and involved manual steps. A more general and automated approach to bioinformatics analysis of antimicrobial peptides is necessary for large-scale identification and classification of peptides. However, gene prediction from genomic sequence is not considered optimal for two reasons: 1) the presence of introns in DNA (sequences that are removed from mRNA and thus do not appear in the translated protein) prevents confident prediction of protein sequence and 2) relatively few genomes have been sequenced to high quality. The large available quantity of expressed sequence tags (ESTs) is a valuable source of sequence for gene prediction. ESTs consist of single-pass sequence reads from either the 3' or 5' ends of sequences in a cDNA library; these cDNAs are constructed from mRNA by reverse transcription (Boguski, et al., 1993). Since mRNA is ultimately transcribed into protein (apart from untranslated regions on the ends on the mRNA), these cDNA sequences lack the complexity of genomic sequence (introns, exons and alternative splicing) that makes  15  gene prediction extremely challenging (Zhang, 2002). However, ESTs are by their nature lower in quality: they are "single pass" reads with up to 3% sequencing errors and may contain truncated sequence (Boguski, et al., 1993).  1.3 Synthetic antimicrobial peptides Antimicrobial peptides have drawn significant scientific attention as a novel class of antimicrobial therapeutics as both antibacterial drugs and modulators of innate immunity (Hamilton-Miller, 2004; Levy and Marshall, 2004; Koczulla and Bals, 2003; Finlay and Hancock, 2004). Antimicrobial peptides tend to exhibit lower potency against susceptible bacterial targets than conventional low-molecular-weight antibiotic compounds; however, they have several advantages. These include fast target killing, broad range of activity, low toxicity and minimal development of resistance in target organisms (Hancock and Sahl, 2006; Yount and Yeaman, 2003). Over fourteen peptides are currently in development or clinical trials; but clinical trials to date have shown efficacy of peptides only as topical agents (Hancock and Sahl, 2006). Four cationic peptides have advanced to phase 3 clinical trials, each of which is a derivative of a genecoded peptide. Of these, two have demonstrated efficacy. There are several properties of peptides considered to be important for antibacterial activity: charge, hydrophobicity and amphipathicity. It is not possible, however, to create high potency peptides by simple manipulation of the amino acid sequence (Tossi et al., 2000). Structure-activity relationship data for the alpha-helical peptides (Figure 1.6 for example), have identified at least seven parameters that can  16  influence the potency and spectrum of activity. These include the 1) size, 2) sequence, 3) degree of structuring (% helical content), 4) charge, 5) overall hydrophobicity, 6) amphipathicity, and 7) respective widths of the hydrophobic and hydrophilic faces of the helix. These properties are intimately linked and therefore modifications intended to enhance one property will necessarily impact the others.  Figure 1.6. Structure of an alpha-helical antimicrobial peptide. The peptide IKWLKIFL is shown. Red indicates regions of positive charge and green indicates regions of hydrophobicity.  There have been four main methodologies used to study the structure-activity relationships of antimicrobial peptides (as reviewed by Tossi et al., 2000). These are: 1) Sequence modification methods evaluate peptide sequences generated by modifying natural peptides. Amino acids are deleted, added, replaced, truncated or combined with 17  other natural sequences to generate novel sequence. Sequence modification methods have been applied to the study of cecropins, magainins and mellitins in particular. 2) Minimalist approaches evaluate de novo sequences designed to be amphipathic and alpha-helical. To simplify analysis, the types of amino acids used are generally limited to one of the basic amino acids (lysine or arginine) and one or two of the hydrophobic residues (alanine, leucine, phenylalanine or tryptophan).  3) Synthetic combinatorial  libraries evaluate combinatorial libraries of peptide sequences. To reduce the number of peptides needed for synthesis, typically only a few amino acids types are considered and only at a few amino acid positions. 4) Template-assisted methods generate sequence templates by comparing sequences of naturally occurring peptides and deriving patterns in terms of residue type (such as charged, hydrophobic, etc). Novel peptide sequence is then created using the templates as a guide for activity. These structure-activity analyses have primarily been limited to qualitative analysis. However, a limited number of studies have attempted to derive quantitative structure-activity relationships, as described next.  1.3.1  Quantitative structure-activity relationships A quantitative structure-activity relationship (QSAR) relates quantitative  properties (descriptors) of a compound with other properties such as drug-like activity or toxicity. While QSAR methods have been use extensively in screening programs for drug discovery and toxicology studies (Perkins, et al., 2003), QSAR has been applied to antimicrobial peptides relatively recently. QSAR modelling of antibacterial peptides has two aspects: the choice of QSAR descriptors and the choice of analysis technique to relate descriptor values to antibacterial activity. A large number of QSAR descriptors have been used for small compounds in the 18  literature and large numbers are available from commercial software products. Those descriptors used in QSAR studies of antibacterial peptides may be separated into two categories, empirical and calculated descriptors. High pressure liquid chromatography (HPLC) retention time is an example of an empirical descriptor (a surrogate measure of solubility or hydrophilicity). Total peptide charge at pH 7 and Van der Waals surface area are examples of calculated descriptors. Many statistical learning methods are available to relate descriptors to activity. Regression models predict the activity of a peptide as a continuous variable such as MIC (minimal inhibitory concentration), while classification models classify peptides as active or inactive. Primarily, linear regression methods have been used for antimicrobial peptides, using multiple linear regression alone, or in conjunction with principal component analysis (PCA) and projections to latent structures (PLS). More complex (non-linear) models such as artificial neural networks (ANNs) give superior predictions but do not clearly relate input descriptors to activity. Some researchers have favoured linear models such as multiple linear regression and principal component analysis because they yield models that explicitly relate the input descriptors to the output prediction of activity; but they do so at the cost of poorer performance (Weaver, 2004).  1.3.2  Previous QSAR analysis of antimicrobial peptides Previous work on QSAR models for antimicrobial peptides have concentrated on  derivatives of three natural peptides: lactoferricin, protegrin and bactenecin.  19  1.3.2.1  Lactoferricin derivatives Several studies have examined the activities of lactoferricin derivatives against  bacteria targets (Lejon et al., 2001; Lejon et al., 2004; Strom et al., 2001) and herpes simplex virus (Jenssen et al., 2005). Specific amino acids changes were made in derivatives of lactoferricin to observe the effect on activity. Strom et al. (2001) modelled a set of 20 peptides with QSAR descriptors such as alpha helicity (determined empirically from circular dichroism spectroscopy or calculated several different ways), HPLC retention time, calculated net charge, molecular surface, and symmetry of charge and hydrophobicity distribution. Using principal component analysis, descriptors related to charge and hydrophobicity had the highest weights in the models. Using the same set of peptides, similar results were obtained (Lejon et al., 2001) using only three descriptors, the z-values derived through an earlier analysis of changes in peptide empirical and calculated properties due to amino acid substitutions (Hellberg et al., 1987). Using an expanded set of peptides, good predictive accuracy was found using zvalues for a larger set of peptide analogues where only a few amino acid substitutions were made (Lejon et al, 2004). However, predictions were much less accurate when more than one or two substitutions was made in a single peptide, indicating the limitation of the amino acid substitution approach for more general antibacterial prediction. 1.3.2.2  QSAR of Protegrin Analogues and De Novo Peptides Activities of antibacterial peptides based on protegrin have been reported in  several studies. A de novo design strategy was used to produce synthetic peptides with structural similarity to cyclic beta-sheet defense peptides such as protegrin by Frecer et al (2004). A total of seven peptides were constructed and synthesized. Three descriptors 20  were used to model antibacterial activity: total charge, an amphipathicity index, and a lipophilicity index. In a second paper, Frecer (2006) performed QSAR analysis on 97 protegrin derivatives of 14 amino acids in length based on published activity values, using 14 descriptors including features such as charge, overall lipophilicity, and separate properties of molecular sections (e.g. lipophilicity of polar and nonpolar faces of the molecule, molecular surface areas for polar and nonpolar faces). Linear equations involving up to 5 descriptors were generated using a genetic function approximation (GFA) to describe antibacterial activity. Only moderate predictive power was found; predictions depended mostly on to charge and amphipathicity. In another study, Ostberg and Kaznessis (2004) examined 62 protegrin derivatives using a larger selection of calculated QSAR descriptors. Multivariate linear regression produced moderate correlation between predicted and actual activity using five descriptors. 1.3.2.3  QSAR of scrambled bactenecin-derived peptides A linear variant of the bovine cationic peptide bactenecin, Bac2A, has been used  in studies of positional importance of amino acids by Hilpert et al (2006). The activity of 49 peptides resulting from a scrambled sequence of Bac2A was modelled using 18 descriptors based largely on positions of arginines, distributions of hydrophobic amino acids and water accessible surface. Here, a binary classification algorithm was used to create a decision tree to classify peptides are active or inactive. An accuracy of 74% was obtained from training on the full set of peptides.  1.3.3  Limitations of current studies Existing QSAR modelling studies are limited in several ways. The primary  21  limitation is due to the size of the data sets. Unfortunately, the use of the three z-values was only effective when modelling very similar variants of a template peptide. More general predictions were more accurate after considering a larger number of descriptors, but the number of peptides considered was small compared to the number of descriptors. The types of models used also limit these QSAR studies. Often the choice to use simpler linear models is made deliberately (for example, as stated (Frecer et al, 2004; Frecer, 2006)) because the resulting models give straightforward interpretation of the contribution of each descriptor. However, more complex models such as artificial neural networks (ANNs) are capable of modelling non-linear relationships as well, where descriptors interact with one another in a non-additive manner. As mentioned above, the main disadvantages of the more complex models are the cryptic nature of the models produced (contributions of individual descriptors to activity are not clear), and the requirement for much larger amounts of data, due to the larger number of parameters used to the models. The recent advance in high-throughput peptide synthesis, in combination with a rapid assay of activity with the luminescence-based assay, has resulted in very large amounts of antibacterial peptide data becoming available (Hilpert et al., 2005).  1.3.4  'Inductive' QSAR descriptors The QSAR descriptors used for previous modelling of antibacterial peptides  have often required a high degree of similarity between peptides. More general QSAR descriptors have been developed recently that include properties sensitive to the three dimensional structure of peptides, the 'inductive' QSAR descriptors among others (reviewed in (Cherkasov, 2005a)). Previously, ‘inductive’ QSAR descriptors have been 22  successfully applied to a number of molecular modelling studies including quantification of antibacterial activity of organic compounds (Cherkasov, 2005b), prediction of other molecular properties (Cherkasov, 2003), and small compound lead discovery (Cherkasov, 2005; Karakoc et al., 2006a). These descriptors have been used in different types of models for classification of compounds, from artificial neural networks (ANNs), knearest neighbors, linear discriminative analysis and multiple linear regression. It has been found that ANNs result in generally more accurate predictions, followed closely by k-nearest neighbors methods (Karakoc et al., 2006b). The structure of an artificial neural network in the context of QSAR analysis is shown in Figure 1.7.  23  Figure 1.7. Structure of an artificial neural network. The network consists of three layers: the input layer, hidden layer and output layer. The input nodes take the values of the normalized QSAR descriptors. Each node in the hidden layer takes the weighted sum of the input nodes (represented as lines), and transforms the sum into an output value. The output node takes the weighted sum of these hidden node values and transforms the sum into an output value between 0 and 1.  1.4 Thesis objectives and hypotheses 1.4.1  Gene-coded antimicrobial peptides The objectives of this section of the thesis follow from the hypothesis that  analysis of existing peptides and construction of bioinformatic models can identify additional antimicrobial peptides both from known proteins (unacknowledged antimicrobial peptides among known proteins) and from unannotated sequence. The first objective was to create a resource consisting of software models of all known classes of antimicrobial peptide. In addition, a web site was constructed to allow the community to 24  browse the many classes of peptides, enter sequence to be scanned and view results in the context of multiple sequence alignments. In Chapter 2, I describe the creation of the AMPer resource that performs these functions; a manuscript was published based on this work (Fjell, C.D., R.E. Hancock, and A. Cherkasov (2007) AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics 23:1148-1155). The second objective was to examine unannotated EST sequence and genomic data to identify novel genes using the AMPer resources. In Chapter 3, I describe an analysis of the bovine EST data set with identification of a number of putative novel genes. Subsequent experimental confirmation with collaborators confirmed one of these predicted genes was present and up-regulated in response to infection in bovine tissue. A manuscript on this work has been published (Fjell CD, Jenssen H, Fries P, Aich P, Griebel P, Hilpert K, Hancock RE, Cherkasov A. Identification of novel host defense peptides and the absence of alpha-defensins in the bovine genome. Proteins. 2008 73(2):420-30).  1.4.2  Identification of synthetic AMPs by QSAR analysis and  machine learning The hypothesis of Chapter 4 is that highly antibacterial peptides can be identified by a combination of non-linear machine learning algorithms and QSAR descriptors that are sensitive to the three-dimensional atomic conformation of the peptide. The objective of the work was to identify a set of QSAR descriptors that allow artificial neural networks to be trained to identify novel peptides that are high antibacterial and to validate the system by predicting entirely new peptides of two categories: highly active and inactive peptides. I report in Chapter 4 the first successful identification of highly 25  active peptide activity in silico, without the use of a template sequence. A version of this chapter work has been submitted to Journal of Medicinal Chemistry (Fjell, C.D., Hilpert, K., Jenssen, H., Cheung, W.A., Panté, N., Hancock, R.E.W., and Cherkasov, A. Identification of Novel Antibacterial Peptides by Chemoinformatics and Machine Learning); these results in combination with further laboratory work by collaborators has been accepted (Cherkasov, A., Hilpert, K., Jenssen, H., Fjell, C.D., Waldbrook, M., Mullaly, S.C., Volkmer, R., and Hancock, R.E.W. Use of artificial intelligence in the design of small peptide antibiotics effective against a broad spectrum of highly antibiotic resistant Superbugs. ACS Chemical Biology, 2008). Some highly active peptide sequences from this work have been submitted for patent protection. Chapter 5 is an extension of Chapter 4 to address one limitation of the QSAR methodology. Calculation of 3D QSAR descriptors can be a computationally expensive operation. The hypothesis of Chapter 5 is that an evolutionary search method called a genetic algorithm can be used to efficiently search through the possibilities of peptide sequences to identify additional peptides that are likely to be highly antibacterial. This work utilizes the software models described in Chapter 4, with the objective of dramatically increasing the efficient in silico discovery of novel antibacterial peptides was demonstrated.  1.4.3  Key assumptions There are several important assumptions involved in this work. For the analysis  of gene-coded AMPs, assumptions about the error rates in the sequencing technology are  26  important since these were used to choose a threshold value for maximum allowed differences between predicted AMP sequence and the sequences of AMPs considered to be already known. If the accuracy was much lower than expected, sequences of known AMPs found in EST sequence at low accuracy may be identified as novel, related sequence. Comparing sequences within a multiple alignment allows one to observe whether random sequencing errors or areas of low quality sequence at the ends of ESTs (which are known to be of poorer quality) might account for observed differences between sequences. For the work on synthetic AMPs, there are several assumptions related to measurement of antibacterial activity. The screening assay used to measure killing of bacteria relies on detection of luminescence of bacteria due to a luciferase gene cassette. Killing of bacteria is assumed to be responsible for observed decreases in luminescence. Since it is not feasible to routinely measure amounts of peptide synthesized per spot, the amount of peptide synthesized in each spot on the cellulose sheet is assumed to be constant; otherwise, there would be no way to compare peptide activities with this assay. In addition, the accuracy of luminescence detection assay has an important impact on analysis. The luminescence varies up to approximately 2-fold between measurements of the same peptide. Therefore, the activity of peptides with low activity (large IC50 values) will have much higher levels of noise than highly active peptides (small IC50 values). This is the assumed reason for the failure of regression analysis to accurately predict activity, while the classification analysis worked well. Peptides predicted to have high activity were ultimately synthesized on resin with a different method that has high accuracy (generally above 95% pure), and peptide activities measured directly by MIC  27  dilution series were found to correlate well with luminescence measurements. Therefore, these assumptions about peptide concentration and luminescence activity measurements were found to be valid.  28  1.5 References  Aich, P., H. L. Wilson, et al. (2005). Microarray analysis of gene expression following preparation of sterile intestinal “loops” in calves. Can. J. Anim. Sci., 85: 13–22. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403-410. Altschul, S.F., Madden, T.L., Schäffer, A.A, Zhang, J., Zhang, A., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25: 3389–3402 Bechinger, B. (1997). Structure and function of channel-forming peptides: magainins, cecropins, melittin and alamethicin. J. Membrane Biol., 156: 197-211. Bechinger, B. (1999). The structure, dynamics and orientation of antimicrobial peptides in membranes by multidimensional solid-state NMR spectroscopy. Biochim. Biophys. Acta, 1462: 157-183. Blondelle, S. E., K. Lohner, et al. (1999). Lipid-induced conformation and lipid-binding properties of cytolytic and antimicrobial peptides: determination and biological specificity. Biochim. Biophys. Acta, 1462: 89-108. Boguski, M. S., T. M. J. Lowe, et al. (1993). dbEST — database for "expressed sequence tags". Nature Genetics, 4: 332 - 333. Bowdish, D.M., Davidson, D.J., Hancock, R.E.W. (2005) A Re-evaluation of the Role of Host Defence Peptides in Mammalian Immunity. Curr. Protein. Pept. Sci., 6(1):35-51. Brahmachary, M., Krishnan, S. P. T., Koh, J. L. Y., Khan, A. M., Seah, S. H., Tan T. W., Brusic, V., Bajic, V. B. (2004) ANTIMIC: a database of antimicrobial sequences. Nucl. Acids Res., 32: 90001, 1-589 Brogden, K. A. (2005). Antimicrobial peptides: pore formers or metabolic inhibitors in bacteria? Nat. Rev. Microbiol., 3: 238–250. Brogden, K. A., De Lucca, A. J., Bland, J. & Elliott, S. (1996). Isolation of an ovine pulmonary surfactant-associated anionic peptide bactericidal for Pasteurella haemolytica. Proc. Natl Acad. Sci. USA, 93, 412–416 Chapple, D.S., Hussain, R., Joannou, C.L., Hancock, R.E.W., Odell, E., Evans, R.W., Siligardi, G. (2004) Structure and Association of Human Lactoferrin Peptides with Escherichia coli Lipopolysaccharide. Antimicrob. Agents Chemother., 48 (6): 2190-2198 Cherkasov, A. (2003) Inductive Electronegativity Scale. Iterative Calculation of Inductive Partial Charges. J. Chem. Inf. Comp. Sci., 43, 2039-2047, Cherkasov, A. (2005) ‘Inductive’ Descriptors. 10 Successful Years in QSAR. Current Computer-Aided Drug Design, 1, 21-42. 29  Cherkasov, A. (2005) Inductive QSAR Descriptors. Distinguishing Compounds with Antibacterial Activity by Artificial Neural Networks. Int. J. Mol. Sci., 6, 63-86 Cherkasov, A., Shi, Z., Fallahi, M., and Hammond, GL. (2005) Successful in Silico Discovery of Novel Non-Steroidal Ligands for Human Sex Hormone Binding Globulin. J. Med. Chem., 48, 3203-3213. Coombes, B. K., B. A. Coburn, et al. (2005). Analysis of the contribution of Salmonella pathogenicity islands 1 and 2 to enteric disease progression using a novel bovine ileal loop model and a murine model of infectious enterocolitis. Infect Immun., 73(7161-7169). Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.J. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University press, Cambridge, UK Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14: 9, 755-763. Eisenhauer, P. B. and R. I. Lehre (1992). Mouse neutrophils lack defensins. Infect. Immun., 60: 3446-3447. Epand, R. M. and H. J. Vogel (1999). Diversity of antimicrobial peptides and their mechanisms of action. Biochimica Biophysica Acta, 1462: 11-28. Finlay, B.B., Hancock, R.E.W. (2004) Can innate immunity be enhance to treat microbial infections? Nature Reviews Microbiology, 2, 497-504. Fjell, C. D., R. E. Hancock, et al. (2007). AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics, 23: 1148-1155. Frecer, V. (2006) QSAR analysis of antimicrobial and haemolytic effects of cyclic cationic antimicrobial peptides derived from protegrin-1. Bioorganic & Medicinal Chemistry, 14, 6065-6074 Frecer, V., Ho, B., Ding, J.L. (2004) De Novo Design of Potent Antimicrobial Peptides. Antimicrob. Agents Chemother., 48, 3349-3357 Hamilton-Miller, J.M.T. (2004) Antibiotic resistance from two perspectives: man and microbe. International. Journal of Antimicrobial Agents, 23: 209-212. Hancock, R. E. (2003). Concerns regarding resistance to self-proteins. Microbiology, 149: 3343-3344. Hancock, R. E. and D. S. Chapple (1999). Peptide Antibiotics. Antimicrob. Agents Chemother., 43: 1317-1323. Hancock, R. E. and R. Lehrer (1998). Cationic peptides: a new source of antibiotics. Trends Biotechnol., 16: 82-88. Hancock, R.E.W. (2001) Cationic peptides: effectors in innate immunity and novel antimicrobials. The Lancet Infectious Diseases, 1 (3) 156-164. Hancock, R.E.W., and Sahl, H.G. (2006).Antimicrobial and host-defense peptides as new anti-infective therapeutic strategies. Nature Biotechnology, 24, 1551 - 1557 Hancock, R.E.W., Rozek, A. (2002) Role of membranes in the activities of antimicrobial 30  cationic peptides. FEMS Microbiology Letters, 206 (2), 143-149 Hellberg, S., Sjostrom, M., Skagerberg, B., and Wold, S. (1987) Peptide quantitative structure-activity relationships, a multivariate approach. J. Med.Chem., 30: 1126– 1135. Hilpert, K., Elliott, M.R., Volkmer-Engert, R., Henklein, P., Donini, O., Zhou, Q. et al. (2006) Sequence requirements and an optimization strategy for short antimicrobial peptides. Chem. Biol., 13: 1101-7. Hilpert, K., Volkmer-Engert, R., Walter, T., Hancock, R.E.W. (2005) High-throughput generation of small antibacterial peptides with improved activity. Nature Biotechnology, 23: 1008-1012 Hsu, C. H., C. Chen, M. L. Jou, A. Y. Lee, Y. C. Lin, Y. P. Yu, W. T. Huang, and S. H. Wu. 2005. Structural and DNA-binding studies on the bovine antimicrobial peptide, indolicidin: evidence for multiple conformations involved in binding to membranes and DNA. Nucleic Acids Res., 33:4053–4064. Hwang, P.M., Vogel, H.J. (1998) Structure-function relationships of antimicrobial peptides. Biochem. Cell Biol., 76:235-46. Jack, R.W., Tagg, J.R., Ray, B. (1995) Bacteriocins of gram-positive bacteria. Microbiol Rev., 59:171-200. Jenssen, H., Gutteberg, T.J., and Lejon, T (2005) Modelling of anti-HSV activity of lactoferricin analogues using amino acid descriptors. J. Pept. Sci., 11:97-103. Jenssen, J., Hamill, P., and Hancock, R.E.W. (2006) Peptide Antimicrobial Agents. Clinical Microbiology Reviews, 19:491–511 Karakoc, E., Cherkasov, A., Sahinalp, S.C. (2006) Distance based algorithms for small biomolecule classification and structural similarity search. Bioinformatics, 15: 243-251. Karakoc, E., Sahinalp, S.C., and Cherkasov, A. (2006) Comparative QSAR- and fragments distribution analysis of drugs, druglikes, metabolic substances, and antimicrobial compounds. J. Chem. Inf. Model., 46:2167-2182. Khush, R. S., F. Leulier, et al. (2001). Drosophila immunity: two paths to NF-kappaB. Trends Immunol., 22: 260-264. Kim, H. S., H. Yoon, et al. (2000). Pepsin-Mediated Processing of the Cytoplasmic Histone H2A to Strong Antimicrobial Peptide Buforin. I. J. Immunol., 165: 32683274. Koczulla, A.R., Bals, R. (2003) Antimicrobial Peptides: Current Status and Therapeutic Potential. Drugs, 63:389-407. Ladokhin, A. S. & White, S. H. (2001) ‘Detergent-like’ permeabilization of anionic lipid vesicles by melittin. Biochim. Biophys. Acta., 1514: 253–260 Lai, R., Liu, H., Hui Lee, W. & Zhang, Y. (2002) An anionic antimicrobial peptide from toad Bombina maxima. Biochem.Biophys. Res. Commun. 295: 796–799. Lejon, T., Stiberg, T., Strom, M.B., and Svendsen, J.S. (2004) Prediction of antibiotic 31  activity and synthesis of new pentadecapeptides based on lactoferricins. J. Pept. Sci., 10:329 – 335 Lejon, T., Strom, M.B., and Svendsen, J.S. (2001) Antibiotic activity of pentadecapeptides modelled from amino acid descriptors. J. Pept. Sci., 7: 74-81. Levy, O., Weiss, J., et al. (1993). Antibacterial 15-kDa protein isoforms (p15s) are members of a novel family of leukocyte proteins. J. Biol. Chem., 268: 6058-6063. Levy, S.B., Marshall, B. (2004) Antibacterial resistance worldwide: causes, challenges and responses. Nature Medicine, 10: S122 - S129. Madera, M., Gough, J. (2002) A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res., 30: 4321-4328. Marshall, S. H. and G. Arenas (2003). Antimicrobial peptides: A natural alternative to chemical antibiotics and a potential for applied biotechnology. Electron. J. Biotech, 6: 271-284. Matsuzaki, K., Sugishita, K., Harada, M., Fujii, N. & Miyajima, K. (1997) Interactions of an antimicrobial peptide, magainin 2, with outer and inner membranes of Gramnegative bacteria. Biochim. Biophys. Acta, 1327:119–130 Mookherjee, N. and Hancock, R.E. (2007). Cationic host defence peptides: innate immune regulatory peptides as a novel approach for treating infections. Cell. Mol. Life Sci., 64: 922-933. Mookherjee, N., Wilson, H. L.; Doria, S.; Popowych, Y.; Falsafi, R.; Yu, J. J.; Li, Y.; Veatch, S.; Roche, F. M.; Brown, K. L.; Brinkman, F. S.; Hokamp, K.; Potter, A.; Babiuk, L. A.; Griebel, P. J.; Hancock, R. E. (2006). Bovine and human cathelicidin cationic host defense peptides similarly suppress transcriptional responses to bacterial lipopolysaccharide. J. Leukoc. Biol., 80: 1563-1574. Niculescu, S.P. (2003) Artificial neural networks and genetic algorithms in QSAR. Journal of Molecular Structure (Theochem) 622: 71–83 Ostberg, N., and Kaznessis, Y. (2004) Protegrin structure–activity relationships: using homology models of synthetic sequences to determine structural characteristics important for activity. Peptides, 26: 197–206 Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T., Chothia, C. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284:1201-1210. Parrill, A.L. (1996) Evolutionary and genetic methods in drug design. Drug Design Today, 1:514-521 Patrzykat, A., Friedrich, C.L., Zhang, L., Mendoza, V., Hancock, R.E.W. (2002) Sublethal Concentrations of Pleurocidin-Derived Antimicrobial Peptides Inhibit Macromolecular Synthesis in Escherichia coli. Antimicrob. Agents Chemother., 46: 605-614. Perkins, R., Fang, H., Tong, W., and Welsh, W.J. (2003) Quantitative structure-activity relationship methods: perspectives on drug discovery and toxicology. 32  Environmental Toxicology and Chemistry, 22: 1666-79 Pfaffl, M. W. (2001). A new mathematical model for relative quantification in real-time RT-PCR. Nucleic Acids Res., 29: e45. Powers, J.P.S., Hancock, R.E.W. (2003). The relationship between peptide structure and antibacterial activity. Peptides, 24: 1681-1691 Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannerty, B.P. (1992), Numerical Recipes in C: The Art of Scientific Computing, (2nd Edition), Cambridge University Press, New York. Rice, P., Longden, I., and Bleasby, A. (2000) EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics, 16:276--277 Rozek, A., C. L. Friedrich, and R. E. Hancock. (2000) Structure of the bovine antimicrobial peptide indolicidin bound to dodecylphosphocholine and sodium dodecyl sulfate micelles. Biochemistry, 39:15765–15774 Rozen, S. and H. J. Skaletsky (2000). Primer3 on the WWW for general users and for biologist programmers. Bioinformatics Methods and Protocols: Methods in Molecular Biology vol. 132, S. Krawetz and S. Misener (eds.), Humana Press, Totowa, N.J., U.S.A. Scheetz, T., Bartlett, J.A., Walters, J.D., Schutte, B.C., Casavant, T.L., McCray, P.B. (2002) Genomics-based approaches to gene discovery in innate immunity. Immunol Rev., 190: 137-145. Schittek, B., Hipfel, R., Sauer, B., Bauer, J., Kalbacher, H., Stevanovic, S., Schirle, M., Schroeder, K., Blin, N., Meier, F., Rassner, G., Garbe, C. (2001) Dermcidin: a novel human antibiotic peptide secreted by sweat glands. Nat. Immunol., 2: 11331137 Schutte, B.C., Mitros, J.P., Bartlett, J.A., Walters, J.D., Jia, H.P., Welsh, M.J., Casavant, T.L., McCray, P.B. (2002) Discovery of five conserved beta -defensin gene clusters using a computational search strategy. Proc. Natl. Acad. Sci. U S A, 99: 2129-2133. Scott, M. G., and R. E. Hancock. (2000) Cationic antimicrobial peptides and their multifunctional role in the immune system. Crit. Rev. Immunol., 20:407–431 Shai, Y. (1999). Mechanism of the binding, insertion and destabilization of phospholipids bilayer membranes by !-helical antimicrobial and cell non-selective membranelytic peptides. Biochim. Biophys. Acta, 1462: 55-70. Sima, P., Trebichavsky, I., Sigler, K. (2003) Mammalian antibiotic peptides. Folia Microbiol., 48: 123-137. Sima, P., Trebichavsky, I., Sigler, K. (2003) Non-mammalian vertebrate antibiotic peptides. Folia Microbiol., 48: 709-724. Simmaco, M., G. Mignogna, et al. (1998). Antimicrobial peptides from amphibian skin: what do they tell us? Biopolymers, 47: 435-450. Solmajer, T. and Zupan, J. (2004) Optimization algorithms and natural computing in drug 33  discovery. DDT , 1: 247-252 Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A., Durbin, R., (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res., 26: 320-322. Spaar, A., Munster, C. & Salditt, T. (2004) Conformation of peptides in lipid membranes studied by X-ray grazing incidence scattering. Biophys. J., 87: 396–407. Strom, M.B., Stensen, W., Svendsen, J.S., and Rekdal, O. (2001) Increased antibacterial activity of 15-residue murine lactoferricin derivatives. J. Peptide Res., 57: 127– 139 Thompson, J. D., Higgins, D. G., Gibson, T. J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22: 4673-4680. Tossi, A., Sandri, L. & Giangaspero, A. (2000) Amphipathic, !-helical antimicrobial peptides. Biopolymers, 55: 4–30. Uteng, M., Hauge, H. H., Markwick, P. R., Fimland, G., Mantzilas, D., Nissen-Meyer, J., Muhle-Goll, C. (2003) Three-dimensional structure in lipid micelles of the pediocin-like antimicrobial peptide sakacin P and a sakacin P variant that is structurally stabilized by an inserted C-terminal disulfide bridge. Biochemistry, 42: 11417-26 Wang, Z., and Wang, G. (2004) APD: the Antimicrobial Peptide Database. Nucleic Acids Res., 32:D590–D592 Weaver, D.C. (2004) Applying data mining techniques to library design, lead generation and lead optimization. Current Opinion in Chemical Biology, 8: 264-270 Whale, T. A., Wilson, H. L., Tikoo, S. K., Babiuk, L. A., Griebel, P. J. (2006) Pivotal Advance: Passively acquired membrane proteins alter the functional capacity of bovine polymorphonuclear cells. J. Leukocyte Biology, 80: 481-491. Yamaguchi S, Huster D, Waring A, Lehrer RI, Kearney W, Tack BF, Hong M. (2001) Orientation and dynamics of an antimicrobial peptide in the lipid bilayer by solidstate NMR spectroscopy. Biophys. J., 81: 2203–2214. Yang, D., A. Biragyn, D. M. Hoover, J. Lubkowski, J. J. Oppenheim. (2004) Multiple roles of antimicrobial defensins, cathelicidins, and eosinophil-derived neurotoxin in host defense. Annu. Rev. Immunol., 22:181–215 Yang, L., Harroun, T. A., Weiss, T. M., Ding, L. & Huang, H.W. (2001) Barrel-stave model or toroidal model? A case study on melittin pores. Biophys. J. 81: 1475– 1485. Yeaman, M.R., Yount, N.Y. (2003) Mechanisms of Antimicrobial Peptide Action and Resistance. Pharmacol. Rev., 55: 27-55. Yount, N.Y., Yeaman, M.R. (2004) Multidimensional signatures in antimicrobial 34  peptides. PNAS, 101: 7363-7368 Zanetti, M. 2004. Cathelicidins, multifunctional peptides of the innate immunity. J. Leukoc. Biol., 75:39–48 Zhang, L., Rozek, A., Hancock, R.E. (2001) Interaction of cationic antimicrobial peptides with model membranes. J. Biol. Chem., 276:35714–35722 Zhang, M.Q. (2002) Computational prediction of eukaryotic protein-coding genes. Nature Reviews Genetics, 3:698-709  35  Chapter 2:  Prediction of gene-coded antimicrobial  peptides by bioinformatic analysis  A version of this chapter has been published as: Fjell, C.D., R.E. Hancock, and A. Cherkasov (2007) AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics 23:1148-1155.  36  2.1 Introduction Antimicrobial peptides (AMPs) represent a diverse class of natural peptides that form a part of the innate immune system of mammalians, insects, amphibians, and plants among others (for example, Sima and Sigler, 2003a, 2003b). In the face of increasing antibiotic resistance in pathogenic microorganisms, AMPs have drawn significant scientific attention as a novel class of prospective antimicrobial therapeutics as both antibacterial drugs and modulators of innate immunity (Hamilton-Miller, 2004; Levy and Marshall, 2004; Koczulla and Bals, 2003; Finlay and Hancock, 2004). Although the antimicrobial peptides exhibit relatively lower potency against susceptible bacterial targets compared to conventional low-molecular-weight antibiotic compounds, they hold several compensatory advantages including fast target killing, broad range of activity, low toxicity and minimal development of resistance in target organisms (Hancock, 2001; Yount and Yeaman, 2003). Despite the fact that a broad spectrum of antimicrobial peptides have been identified and discussed in the literature, their structure-activity relationships are not well understood, largely because of substantial sequence and structure diversity. Examples include the alpha-helical cecropins and magainins and the beta-sheet structure of betadefensins among others. It should be mentioned, however, that AMP three dimensional structures are often dependent on binding to membrane or lipopolysaccharide, and in solution many AMPs may exist in different, and/or non-ordered configuration (Chapple et al., 2004; Yount and Yeaman, 2003). Thus, the general views on the AMP characteristic features typically involve their cationic character, relatively high hydrophobicity and short length (Powers and Hancock, 2003; Yount and Yeaman, 2003) 37  The mechanisms of peptide antimicrobial action are also under debate; while membrane disruption has been a common theme, other evidence suggests that peptides transit into the cytosol and disrupt intracellular targets and that the membrane effects are distinct from (and not always crucial to) the killing effects (Hancock et al., 2002; Patrzykat et al., 2002). In addition, the relative importance of direct killing versus immunomodulatory effects of mammalian AMPs is not obvious since some peptides generally considered as AMPs do not appear to have direct microbe-killing effects in vivo (Brogden, 2005; Bowdish et al., 2005). All the above-mentioned controversies make ‘in silico’ discovery and/or modelling of antimicrobial peptides an important but challenging bioinformatics task. Currently, sequence analysis for AMP discovery has been done on a limited number of AMPs: the beta-defensins and other cysteine-containing peptides. A number of novel beta-defensins in mouse and human were identified by analysis of a specific exon of betadefensins followed by scanning of genomic sequence (Scheeta et al., 2002; Schutte et al., 2002). Manual identification of a predictive motif, GXC, for cysteine-containing AMPs was also used to find novel AMPs of that type (Yount et al., 2004). However, these efforts were applicable only to a small number of AMP types. We decided to conduct a more generalized study of AMP sequences using profile-based hidden Markov models (HMMs) in combination with sequence clustering and protein structure annotation. The major objective of the study was to produce HMM models for the existing AMP types such as defensins, cathelicidins, and histatins among others, and to apply these methods to create a more consistent classification of antimicrobial sequences. This new resource is available as an on-line database, for 38  investigation of AMP sequence diversity, and as a set of HMM files for the discovery of novel gene-coded AMP candidates.  2.2 Results and discussion The analysis of the antimicrobial peptides proceeded as described next and is summarized in Figure 2.1 and Figure 2.2.  Figure 2.1. Creation of initial AMPer clusters  39  Figure 2.2. Summary of iterative enrichment of clusters.  2.2.1  Database of antimicrobial peptides Initially, we used the set of known gene-coded AMPs from the AMSDb  collection at the University of Trieste to compile a generalized set of known AMP sequences (see the “Web Resources” section for more details about the source of AMP  40  sequences). This resulting set of confirmed AMPs contained 890  sequences and  encompassed all major AMP classes including defensins, cathelicidins and granulin among others. These peptides are available as entire holopeptides, containing both mature functional peptides as well as prosequences. Some of these proteins were found to contain obsolete annotations and refer to obsolete Uniprot IDs. Since we are interested in analyzing the mature and prosequence regions separately, we required the proteins be present in the current version (August 2006) of the Uniprot database. To associate the proteins in AMSDb to the current Uniprot we performed a pair wise similarity comparison using blastp of the BLAST tool (Altschul et al., 1990) . We considered a match to be made where the AMSDb protein has at least 99% sequence identity over at least 99% of the length of the smaller sequence of the pair. We tried relaxing the criteria to 95% for each parameter - this resulted in only 2 more matches, which we did not consider significant to justify the additional risk of incorrect assignment. In addition, 33 proteins were identified based on sequence ID that were the same proteins between AMSDb and Uniprot, but the sequence was <99% similar. These 33 Uniprot proteins were used. Of the 890 original AMSDb proteins, 741 proteins were matched in Uniprot (661 from Swiss-Prot and 80 from TrEMBL). The peptide location annotations were used from Uniprot to identify mature peptide and propeptide regions. A total of 679 Uniprot proteins were found to have suitable annotation for mature peptides, yielding 767 mature peptides. Most proteins contributed one mature peptide while one protein, human Histatin-3 (HIS3_HUMAN), contributed 26 peptides, the highest number. A total of 238 Uniprot proteins had annotations for propeptides, yielding 316 propeptides. Most proteins contributed 1 41  propeptide, but up to 7 (for AMP_IMPBA from Balsam plant) were contributed for a single protein.  2.2.2  Clustering of the AMPs As it has already been mentioned, AMPs are very diverse in their sequences and  fall into classification of a small number of secondary structures (Hwang et al., 1998; Powers and Hancock, 2003). However, our objective in clustering was to group similar peptides for later analysis by hidden Markov models. For this purpose, we wanted to capture in a single cluster the diversity of sequences that likely corresponded to single type of peptide. While a large number of AMP groups can be defined based on descriptions in the literature (such as defensins, magainans, cathelicidins), this nomenclature is not amenable to specification for automated grouping, due to the large diversity in sequence as well as length for a given protein name or description. Since no classification scheme was found that was suitable for our purpose, we chose to group AMPs by sequence analysis using custom sequence similarity. In short, clusters were constructed to have a minimum amount of similarity between all peptides in the cluster (see Methods section for details). Two sets of clusters were constructed, for mature peptides and propeptides. Each peptide was compared to the peptides in existing clusters and a minimum 'global' sequence identity was calculated as the number of matching amino acids divided by the length of the shorter peptide using the most significant alignment given by the blastp algorithm. A peptide was placed in an existing cluster based on the minimum global sequence identity for any peptide in the cluster. The peptide was placed in the cluster giving the highest minimum match, if the minimum was greater than a given minimum identity threshold. Peptides not placed in 42  existing clusters were used to start new clusters.  Threshold [%]  Number of Clusters  Clustered Fraction [%]  10  136  94  20  142  92  30  149  90  `40  148  84  50  151  80  60  158  75  70  153  66  80  136  56  90  120  42  Table 2.1. Effect of similarity threshold on clustering of mature peptides. The original set of mature peptides were clustered for several values of the minimum global percent similarity (Threshold). The clustered Fraction is the fraction of the original set of mature peptides that were placed in clusters for the given threshold  Minimum similarity thresholds in the range 10-90% were used to evaluate the resulting clusters. Decreasing the threshold to a minimum of 10% global similarity gives the maximum number of peptides placed in clusters. However, when we examined the multiple alignments of these clusters for low thresholds we found problems: Many contained two or more sets of closely-related peptides that were more appropriately separated into distinct clusters. As well, short peptides were found to be inserted into clusters where the matching amino acids in the multiple alignment were interspersed with gaps between matching positions of only one or two amino acids. However, for higher thresholds, dramatically lower coverage of the peptides was represented in the clusters, with a 90% threshold yielding clusters for only 42% of the starting peptides. Therefore, we decided to use an intermediate threshold of 30% global sequence  43  and manually correct the clusters by removing short peptides having poor alignment, and by splitting clusters into additional clusters where the peptides consisted of two or more highly-similar sets of peptides. In total, 20 peptides were removed from 19 clusters; 3 clusters were split into 6; and 6 clusters composed of 2 clones each were removed. There were 146 resulting clusters for mature peptides, containing 655 peptides. The propeptide clusters were treated similarly using a threshold of 30% global identity. There were 207 clustered propeptides in 42 clusters before manual edits. Four propeptides were removed from 4 clusters; 5 clusters were removed; and 3 clusters were split into 6. There were 40 resulting clusters containing 192 propeptides. As anticipated, such classification approach allowed grouping together all related peptides as in the conventional classes such as beta-defensins, cecropins, magainins etc. Peptides of a particular class such as the beta-defensins were also separated into multiple clusters, indicating sub-classes of these peptides. We did not try to reduce the number of clusters, for example, to produce a single cluster for each type of defensin. We considered that the larger number of clusters with more highly similar peptides in each is beneficial for model building as the more specific models may reflect important sequence motifs that may be lost if the clusters contain too much variation.  2.2.3  HMM modelling Once we had created the initial clusters, we created profile hidden Markov  models (HMMs) for the clusters to be used to search for additional members of the AMP groups that were not present in the original AMP dataset. The HMMER software package (Eddy, 1998; http://hmmer.wustl.edu/) has been utilized to create one profile hidden Markov model for each AMP cluster. ClustalW was used to generate the multiple 44  alignments used by HMMER. The HMMER package was chosen over other tools because it is considered to be less sensitive to small misalignments in the multiple sequence alignments and to report reliable E-values (Madera and Gough, 2002).  2.2.4  Iterative enhancement of clusters To enhance our initial clusters, we identified AMP sequences from Swiss-Prot  and used these to enrich the initial clusters of the AMPs by iteratively applying the corresponding HMM models to Swiss-Prot sequences. For the current work, we considered only the Swiss-Prot database as it contains confirmed and relatively wellstudied peptide sequences to allow validation of the process to be done. We found that it was not possible to use a specific threshold for significance of match (such as expectation value, E-value, from BLAST or HMMER) to distinguish between hits to AMPs and non-AMPs. In an attempt to identify an E-value threshold that will distinguish significant matches from matches due to chance when searching the Swiss-Prot database, we evaluated the clustered peptides with the models specific for their cluster specifying the size of the data set as the number of peptides in Swiss-Prot. When these E-values were plotted against the length of the model it becomes clear that there is no E-value that can distinguish significant matches from random matches for short peptides (Figure 2.3). (Note that the length of the hidden Markov model is approximately the length of the peptides upon which it was trained.)  45  Figure 2.3. The relationship between E-value and model length. The peptides in each cluster were scanned with the model corresponding to the cluster. For the shortest models (created from the shortest peptides) the E-values are greater than one.  Since E-values alone are not sufficient to identify significant matches, we decided to use additional information from the Swiss-Prot database to determine significance. For each Swiss-Prot protein, the model giving an HMM match with the lowest E-value was identified. The annotations for the Swiss-Prot protein were used to identify any protein regions overlapping with the region matched by a model. The SwissProt peptide with highest mutual overlap with the region matched by the model was  46  identified. This peptide was also compared to all peptides in the model's cluster to determine its similarity to a listed AMP. To be considered a significant match, the mutual overlap between the region matched by a model and the annotated peptide was at least 90%. In addition, the blastp match between the Swiss-Prot peptide and the best matching clustered peptide was at least 50% identity over 90% of the peptide length. Those Swiss-Prot entries that produced a significant match to any of the 186 HMMs (146 for mature peptides and 40 for propeptides) were added into the existing AMP clusters. After peptides were added to a cluster, a new multiple alignment and HMM were constructed as described above. The new model, based on a larger number of sequences, was then used to scan Swiss-Prot. This was repeated until no additional peptides had a significant match: there were 5 iterations for the mature peptide models, and only one for the propeptide models. An example of changing to consensus sequence is shown in Table 2.2. The iterative scanning of the Swiss-Prot database (containing 230,133 peptides) resulted in an additional 389 mature peptides from 229 Swiss-Prot proteins being added to the AMP dataset as candidate AMPs, for a total of 1045 peptide from 970 Uniprot proteins. Sixty-one propeptides were also added for a total of 253 propeptides from 223 proteins. Peptides were considered to be properly included as AMPs where the annotations included reference to antimicrobial activity or the protein belonged to the same family as a known AMP already in the database (see Methods for details). The utility of a selection process that does not rely on the E-value can be seen in Cluster  1  (see  on-line  supplementary  table  at  http://www.cnbi2.com/cgi-  bin/amp.pl?peptide=1&cluster=5&type=MATURE) for the mature peptides. Starting 47  with an initial 2 AMPs, an additional 9 peptides are added to the cluster. Despite the high E-values (up to 5.9), all peptides were found to have annotations that demonstrate antimicrobial activity. The relationship between the mature peptides and propeptides from the same protein is shown in Figure 2.4 and Figure 2.5. In Figure 2.4, mature clusters are joined to propeptide clusters where the propeptides are derived from the same protein as a mature peptide in the cluster. Only the mature peptide clusters of at least ten peptides. Similarly, Figure 2.5 shows links from the largest propeptide sequences to mature peptide clusters. These figures suggest there is greater conservation of propeptide sequence, since a greater proportion of propeptide clusters have links to multiple mature clusters. A full mapping between clusters is available as supplementary Figure 2.6.  48  Figure 2.4. Relationship between mature peptides and propeptides from the same protein for largest mature peptide clusters. For mature peptide clusters of 10 or more peptides, the corresponding propeptide clusters are indicated by a line joining the clusters. The width of the line indicates the number of propeptides in that cluster that are from the same protein IDs as the mature peptides. Percentage values following the left clusters are the fraction of peptides with links to the right clusters.  49  Figure 2.5. Relationship between mature peptides and propeptides from the same protein for largest propeptide clusters. The linkage from propeptide clusters with ten or more propeptides is shown. See caption of Figure 4 for details of line width and numbers.  Of the 229 proteins added, 34 either did not have annotation for antimicrobial activity, or annotation specifically stated that they were not antimicrobial. Among these are two groups of peptides that have antimicrobial peptides in the same family: 9 Dahlein peptides are annotated as inactive (2 other Dahleins are active, DAH11_LITDA and DAH12_LITDA), and 8 Aurein peptides are annotated as inactive while 6 are active. An additional 17 peptides are peptide hormones such as cholecystokinin that do not have annotations for antimicrobial activity. However, there is considerable controversy surrounding whether certain peptides should be considered antimicrobial or not; in particular, differing assay conditions used by different investigators lead to differing  50  results. For this reason, these peptides were left in the AMPer database and it is left to the investigators to review the relevant literature provided through links from the AMPer system. The physico-chemical properties of the mature peptides vary dramatically between clusters. As can be seen in supplementary Table 2.3 for the largest AMP clusters (size greater than 10 peptides) the net charge depends strongly on the type of AMP. As expected, the median charges typically exceed +2 but one class is negative. Except for one cluster, the median hydrophobicity is above 40% with a maximum of 77%. There are 5 clusters of propeptides size 10 or greater, shown in supplementary Table 2.4 . These tend to be strongly negative and much less hydrophobic than the mature clusters.  N  Consensus  0  GlLDtLKnlAktagKGalqslLntaSCKLsgqC  1  GiLDtlKnlAkgvaKgvaqsLLdklsCKlskgC  2  GiLDtlKnlAkgaAKgvaqsLLdtlkCKltggC  3  GiLDtlKnlAkgaaKgaaqsLLdtlsCKlsggC  4  GiLDtlKglAknaGKGvaqsLLdtlsCKisggC  5  GiLDtlKnlAkgaAKgaAqsLLdtlsCKisggC  Table 2.2. Changing consensus sequence with iteration. The consensus sequence and number of iterations for mature peptides in cluster 137 is shown. N is the iteration number with N=0 the initial data from AMSDb.  2.2.5  Accuracy of models The 186 final clusters were produced with high stringency requirements for  matches to HMMs. Such stringency explains the relatively large number of identified clusters containing similar annotation: for example, there are 22 clusters of defensins which are split along the defensin subclasses (including several subclasses of alpha- and  51  beta-defensins, cryptdins and other enteric defensins). Further investigation of the effect of using lower stringency thresholds for the initial clustering and for addition of peptides to clusters might allow these clusters to be merged, and a more representative model to be produced. However, performing additional merges may also lead to incorrect merges that give less-accurate models. We consider that the presence of multiple clusters of similar peptides reflects subclasses of these peptides, and that the larger number of higher accuracy models may be beneficial for further work on mechanisms of action of AMPs that differ between subclasses. To assess the expected performance of the system to identify previouslyunknown AMPs from proproteins, we performed an approximately 10-fold crossvalidation on the AMP identification procedure as described in detail below. Since we were interested in the capacity of the system to identify AMPs in proproteins, we performed the testing steps of the validation on full proproteins from Swiss-Prot rather than simply the peptide comprising the clusters. The presence of another peptide from the same protein in both testing and training sets severely complicates interpretation of the results. The current pipeline is intended to identify proteins that contain additional antimicrobial peptides and will not properly handle recognition of additional peptides of the same cluster type. For this reason, only the 105 mature peptides and 29 propeptide clusters that did not contain more than one peptide from the same proprotein were considered. In addition, for creation of HMMs, at least 2 peptides are required; to select a test peptide from the set, therefore, a minimum cluster size of 3 is needed. This left a total of 81 mature and 20 prosequence clusters used for cross-validation. The results of the cross-validation show great variation in performance for 52  recognizing additional AMPs. The cross-validation sensitivity varied from 0% for one mature cluster containing 3 peptides, to 100% for 36 mature clusters. The average sensitivity of all mature clusters was 82% (the standard deviation of the cluster mean sensitivities was 23%). The specificity and accuracy were both 99.2% (SD 1.3%). For the prosequence clusters, the sensitivity also varied between 0% for two clusters of 3 peptides, and 100% for 9 clusters with average 81% (SD 30%); the average specificity for the prosequence clusters was 98.8% (SD 2.7%) and accuracy was 98.8% (SD 2.7%). The values for each cluster are available in supplementary Table 2.5 and supplementary Table 2.6. It should be noted that the specificity is conservatively based on distinguishing a class of AMPs from other possibly very similar AMPs (such as one class of defensins from several other classes of defensins). As well, the accuracy is dominated by the number of negatives, since the number of actual negatives is much larger than the number of actual positives. In scanning a large database of unrelated proteins such as Swiss-Prot, the specificity and accuracy is expected to be significantly better since the number of false positives will be much lower, as demonstrated by the low number of total positive matches found for all of Swiss-Prot. The low sensitivity of some clusters is thought to be due to the relatively large variation in sequence in these clusters, especially for clusters containing few peptides. A variety of technical reasons were found for why peptides were missed: the HMM search did not give a significant match (E-value>10), or the HMM match did not align well with the Uniprot feature list, or the BLAST match to the closest training peptide was too poor (data not shown). This suggests that a simple tweaking of system parameters will not lead to a dramatic increase in sensitivity without undesired  53  decrease in specificity; therefore, a search for better search parameters was not pursued in this study.  2.2.6  On-line tools All  materials  described  here  have  been  made  available  on-line  (http://www.cnbi2.com/cgi-bin/amp.pl). All AMP sequences and final clusters are available for download. In addition, utilities are provided on-line to scan sequence provided by the user to categorize the sequence according to these models. The HMMER HMM files used to predict and classify AMPs are available for researchers to download and use to scan sequence files using the HMMER package independently. This is a unique contribution to the community: one other site, ANTIMIC (Brahmachary et al., 2004; http://research.i2r.a-star.edu.sg/Templar/DB/ANTIMIC), provides some limited search against a few specific models but does not categorize submitted sequence, and does not provide for download of the sequences or the few HMM models it contains. Web pages are available for viewing the AMPs and corresponding properties. The initial page (http://www.cnbi2.com/cgi-bin/amp.pl) provides links to lists of the AMP clusters and the peptides themselves. In addition to properties such as peptide length, charge and hydrophobicity, the consensus sequence is given as well as links to navigate to the list of AMPs in each cluster. For each peptide, there are clickable links to the Swiss-Prot web site and to the Swiss-Prot records for the version used in this study. The iteration number ("round") is indicated for each peptide with round 0 indicating the peptide is from the original set from AMSDb database (a link is also given to AMSDb).  54  Several properties of the peptide subsequence matched by the HMM model are also given: amino acid sequence, length, charge, hydrophobicity (as hydrophobic fraction - fraction of amino acids that are hydrophobic), position of the subsequence within the main protein, as well as the E-value of the model match for this peptide. Additionally, values used for analysis are also given: the coverage of the best-matching peptide by the region matched by the HMM  and vice versa, and the best matching (by blastp)  previously clustered peptide with percent identity and alignment length.  2.3  Conclusion In summary, we utilized a set of documented AMPs to collect additional known  gene-coded AMPs into a single database using a hybrid method for identifying antimicrobial peptides. We clustered the peptides and enriched the clusters with peptides from Swiss-Prot which could be matched by the trained HMM at high confidence by integrating additional information using pair-wise sequence comparison and annotations of peptide positions. The HMM models and sequence files are made available to the public from the AMPer website. We anticipate that these will be useful for discovering novel AMPs from unannotated sequence.  2.4 Methods 2.4.1  Initial peptide set The initial set of gene-coded AMP sequences was obtained from the  Biochemistry  Department  University 55  of  Triest,  Italy  (http://www.bbcm.units.it/~tossi/pag5.htm). These peptides were compared to the current Uniprot  (Swiss-Prot  and  TrEMBL)  databases  (downloaded  from  http://www.pir.uniprot.org/ on August 4, 2006) to determine the current naming and annotation of the initial AMPs. Pairwise comparison was done using the blastp algorithm of the BLAST package with no filtering (parameters -F F). We considered a match to be positive when there was at least 99% identity of amino acids over a match length of at least 99% of the length of the AMP in the initial set. For AMSDb proteins with current Uniprot IDs but where the sequence was significantly different, the current Uniprot record was used. Mature peptides and propeptides were identified for each protein using the feature annotations available from Uniprot. For proteins with multiple mature peptides,  those peptides annotated as  antimicrobial were kept for analysis. Peptides were required to have definite start and end positions (records with '?' were rejected).  2.4.2  Clustering Pairwise similarity between peptides was calculated using blastp (BLAST  package, Altschul et al., 1990) with filtering off (-F F) and word size of 2 (-W 2). Clusters of similar peptides were constructed based on the pairwise alignments using a percentage match defined as the number of amino acids identical between the two peptide in the most significant alignment (highest bit score) divided by the length of the shorter of the two peptides. Clusters were built by successively adding peptides to a cluster where the percentage match was greater than threshold for every peptide in the cluster. The percentage match threshold was varied between 10% and 90% for clustering  56  mature peptides. Multiple alignments were created for each cluster using ClustalW (Thompson et al., 1994). The alignments of mature peptide clusters resulting from several thresholds were examined. Low thresholds produced clusters containing similar peptides mixed with smaller peptides that were aligned at widely-spaced intervals to the longer peptides. The clusters from a 30% threshold were manually edited for both mature peptides and propeptides. Peptides were removed that aligned with a large number of widely-spaced inserts, and clusters containing two groups of highly-similar peptides were split into two clusters.  2.4.3  Iterative enhancement of clusters At the start of an iteration, multiple sequence alignments were built for each  cluster using ClustalW (as  above). The HMMER software package (Eddy, 1998;  http://hmmer.wustl.edu/) was used to create one hidden Markov model for each cluster from the multiple alignment, using the utility, hmmbuild. Default parameters were used except for ‘–f’ parameter, used to create local models. The Swiss-Prot database was scanned using the HMMER utility, hmmsearch, for each model file. Custom Java, Python and BASH shell code were used to execute hmmsearch and parse resulting output. Scanning of Swiss-Prot was performed for all models. For each Swiss-Prot protein matched, the information for the most significant match (lowest E-value) for any model was stored. Sequence regions matched by the HMMs were then compared to the annotated feature regions from Swiss-Prot. The annotated region (mature peptides or 57  propeptides) having the greatest overlap with the HMM match region were stored. As an additional check, the clustered sequences were aligned to the full Swiss-Prot proteins matched by the HMMs using blastp. The best-matching clustered peptide was determined based on highest bit score. Swiss-Prot peptides were considered positive matches and added to the clusters if the regions matched by the HMMs and the feature annotation agreed to at least 90% of the their length, and the best matching peptide from the same cluster had at least 50% identity to the Swiss-Prot protein. Positive matches were then added to the clusters for mature peptides and propeptides if they were not already present in any cluster. A new multiple alignment was then created using ClustalW, and a new model file was created using HMMER as described above. The Swiss-Prot sequences were scanned again using the new model files, and any additional matching peptides were added to the clusters. The process of scanning Swiss-Prot, adding matching peptide to clusters, and rebuilding the model files was repeated until no additional Swiss-Prot peptides were found. Consensus sequence was obtained using the utility, hmmemit, with the ‘-c’ option. Mature peptide clusters were mapped to propeptide clusters by identifying clusters containing peptides from the same Uniprot protein. Graphics were created with PyX (http://pyx.sourceforge.net/) and ImageMagick (http://www.imagemagick.org).  2.4.4  Accuracy of models An approximately 10-fold cross-validation was performed to estimate the  expected performance of the models. Cross-validation was performed for each cluster independent of the others. Testing and training sets of peptides were created by randomly assigning peptides in a cluster to a number of sets of approximately equal size. Where the 58  cluster had 10 or more peptides, 10% of the peptides were assigned to each set. Where the number of peptides in a cluster was not evenly divisible by 10, additional peptides were randomly assigned to sets (allowing only one additional peptide per set) until all peptides were assigned to exactly one set. Where a cluster had less than 10 peptides, one peptide was assigned to each of N sets where N is the number of peptides in the cluster. By selecting one set in turn as the positive data for the test set and the other sets as positive data for the training sets, the sets of data were prepared to give an approximately 10-fold cross-validation for clusters having more than 10 peptides, and leave-one-out cross-validation for clusters having less than 10 peptides. In all cases, peptides from all other clusters were taken as negative test data (HMMs do not use negative training data). Since the software system was intended to identify unrecognized AMPs from proteins, the system will not attempt to recognize additional peptides from a protein already known to contain AMPs. Therefore, performing a cross-validation was done using only clusters where each peptide was derived from unique proteins. This avoids the situation where a test peptide is automatically considered a positive match since it belongs to the same protein as a training peptide. In addition, for HMMs to be created, at least two peptides are required; therefore, only clusters of size three or greater were evaluated (so that one peptide would be available for the test set). The same procedure was used during validation as was used in identifying additional AMPs from Swiss-Prot. For each cluster, the training peptides were used to create an HMM. Since the purpose of the method is to identify AMPs from within full proteins, the HMM was used to scan the full Swiss-Prot protein corresponding to the test peptides. A BLAST search was performed between the training peptides and the  59  corresponding Swiss-Prot proteins. As before, positives were defined when proteins passed the conditions that the region matched by the HMMs and the feature annotation agreed to at least 90% of the their length, and the best matching peptide had at least 50% identity to the Swiss-Prot protein.  2.4.5  On-line tools The web site uses a Perl CGI script running on an Apache Linux server with a  MySQL RDBMS. On-line sequence analysis uses utilities from the HMMER package.  2.5 Web resources Biochemistry Department University of Triest, Italy: http://www.bbcm.units.it/~tossi/pag5.htm HMMER: http://hmmer.wustl.edu/ Uniprot database: http://www.pir.uniprot.org/  60  2.6 Supplementary material Cluster Peptide Families  Median Number of Peptide Peptides Length  Median Peptide Charge  Peptide Hydrophobicity  Consensus Sequence  146  Alpha defensin (Neutrophil defensin)  45  33  7  0.46  CyCRrgrClsrErlsGtCringriyrLCCR  134  Apidaecin  10  18  4  0.56  nnRPvYipqPRPPHPRl  128  Aurein, Caeridin, Caerin, Citropin, Dahlein  12  25  2  0.54  GLlgsIGkaLGgLladvlKpKlqaa  84  Aurein, Caerin, Dahlein  11  25  2  0.52  GLlsSiGKaLGGlLadvlKpKtqaa  131  Aurein, Citropin, Dahlein, Dermaseptin, Maculatin  22  23  1  0.43  GLwqkIKeklkelAsGaivegvqs  129  Aurein, Citropin, Uperin  12  14  1  0.57  DivKkVvsavggL  139  Bactericidin, Cecropin, Hyphancin  31  39  6  0.45  WlkkifKkiErvGqnvRDaiikagpavqvvaqaa alar  145  Beta-defensin  26  36  6  0.47  dPvtClrnGGiClysrCpgrtrqiGtCGhPkvKC CK  144  Beta-defensin , Spheniscin, LAP, TAP  21  40  9  0.50  slsCrrnkGvCvpirCpgkmrQIGtCfgppVKCC Rrk  135  Bombinin-like, Maximin  11  27  3  0.48  GIGakILsgvKtaLKGaakeLAstyln  141  Brevinin, Gaegurin, Ranatuerin  18  24  4  0.75  FLPllaglAAkvlpkiiCsItkKC  137  Brevinin, Ocellatin, Palustrin, Ranatuerin  29  31  3  0.46  GiLDtlKnlAkgaAKgaAqsLLdtlsCKisggC  41  Brevinin, Ranalexin  12  24  3.5  0.71  FLpilaslaakvlpkiiCavtkKC  140  Caerin, Maculatin  16  24  3  0.64  svLgsvakhvlpHvvPviAEkl  118  Caerulein, Cholecystokinins, Cionin  30  8  -2  0.63  dYtGwmDF  142  Cecropin, Sarcotoxin  22  35  7  0.44  GrlKKlGKKiEgvGkrvfdAaekaLpvaagvkal a  60  Circulin, Cyclopsychotride, Cyclotide, Cycloviolacin, Kalata  48  30  1  0.57  CGESCvvipCyttsvlGCsCxnkVCyrN  133  Cysteine-rich antifungal protein  15  51  5  0.49  QKlcerpSGTwsGVCGNnNACkNQCInLEgArHG SCNYvFPaHkCiCYfPC  132  Defensin-like protein  11  47  5  0.41  ktCenesdTfkGvCitkapCdkhCrnkEkftdGr CskiLrRClCTknC  143  Defensin, Holotricin, Phormicin, Royalisin, Sapecin, Tenecin  22  43  5  0.50  aTCDlLSfegkgvkvnhsaCAahClarGrkGGyC nkkavCvCRn  138  Fabatin, Gamma-thionin, Gamma hordothionin  21  47  7  0.40  rtCesqShrFKGpClsdsNCasVCrnEGFsGGnC rGfrRRCfCtrqC  Hemolytic protein, Ranatuerin, Temporin  11  13  1  0.77  FlpaiAsLLgkll  136  Histatin  13  13  7  0.16  HEKHHsHRGYr  95  Liver-expressed AMP, Penaeidin  23  55  10  0.51  kgpYtRpvsrPpfvRPigasPigPYngCdvSCRg isesqARlCckRlGrCChlskgys  112  Mastoparan  12  14  3  0.64  inlKalldlaKkvL  94  Maximin  44  20  2  0.60  ilGPvlglvgnalggllkkl  83  Melittin  10  26  5  0.50  gIGaiLKVLatgLPaliSWiKrKRqq  0.46  AtftitNncpytVwAaalpgdgkpqLxgGGreLd sgqSwsldvpaGTwsaRfWgRTgCnfDaSGrGsC qTGDCGGqLsCnGaGapPPaTLAEytLaqfgglD FyDvSLVDGFNlPmsfaPtgGsGdCkaisCaAdi NavCPaeLkvkgsgGsVvACnsACtvFntpqYCC tggndtpetCpPTdYSriFKqqCPdAYSYayDDp tSTFTCsggtnYrvtFCP  1  107  Osmotin, Thaumatin, Zeamatin  30  217.5  0.5  61  59  Temporin, Vespid chemotactic peptide  16  13  1  0.69  flPiigkllsglL  Table 2.3. Properties of largest mature peptide clusters  Cluster Peptide Families  Median Number of Peptide Peptides Length 45 25  Median Peptide Charge -3  Peptide Hydrophobicity 0.24  Consensus Sequence  35  Maximins  37  Dermaseptin, Dermatoxin  20  20.5  -11  0.00  eEKrEnEnEeeqEddeqSEe  38  Beta-defensin 1  17  11  1  0.27  dnFLtGLGHRs  39  Cryptdin  18  39  -12  0.23  DpiqntDEEtKtEEqpgEedqAvsvsFGdpeGsaL qeea  40  Cathelicidin, Myeloid antibacterial peptide, Prophenin2, Protegrin  25  101  -4  0.44  qalsYreAvLRAvdqlnersseanlYRLLeLDppP kddedpdtpKpvsFrvKEtvCprttqqppEqCdFK enGlvKqCvGtvtldqvkdsfditCnelqsv  rseendvqsLsqRdvLeEEsLREiR  Table 2.4. Properties of largest propeptide clusters The number of peptides, peptide properties (median length, charge and hydrophobicity) and consensus sequence are shown.  Cluster  Peptide Families  Number of Peptides  Sensitivity [%, mean (SD)]  Specificity [%, mean (SD)]  Accuracy [%, mean (SD)]  1  Temporin, Ranatuerin, Hemolytic protein  11  100.0 (0.0)  98.7 (0.1)  98.7 (0.1)  3  Chrysophsin, Dicentracin, Moronecidin  5  40.0 (54.8)  100.0 (0.0)  99.9 (0.1)  8  Histone H1, Uperin  3  66.7 (57.7)  99.8 (0.3)  99.8 (0.4)  9  Penaeidin, Corticostatin-related  4  75.0 (50.0)  99.4 (0.3)  99.4 (0.3)  11  Cicerin, Gymnin  3  66.7 (57.7)  99.5 (0.3)  99.4 (0.2)  16  Pardaxin  6  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  18  Cathelin-related  3  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  20  Clavanin  5  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  22  Moricin, Virescein.  3  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  35  Beta-defensin, Circulin-B  4  75.0 (50.0)  97.3 (1.7)  97.2 (1.7)  39  Sperm-associated antigen 11  3  66.7 (57.7)  100.0 (0.1)  99.9 (0.1)  41  Brevinin, Ranalexin  12  100.0 (0.0)  96.5 (0.3)  96.5 (0.3)  42  Ceratotoxin, Dermadistinctin, Dermaseptin  4  50.0 (57.7)  98.6 (0.4)  98.5 (0.5)  50  Mast cell degranulating peptide  3  66.7 (57.7)  100.0 (0.0)  100.0 (0.1)  51  Cecropin  8  100.0 (0.0)  97.7 (0.1)  97.7 (0.1)  52  Defensin heliomicin, ARD1, Mytilin  3  66.7 (57.7)  100.0 (0.0)  100.0 (0.1)  53  Beta-defensin, heterophil peptide  4  75.0 (50.0)  99.7 (0.0)  99.7 (0.1)  55  Uperin, Maculatin  5  80.0 (44.7)  99.6 (0.1)  99.6 (0.1)  56  Dermaseptin, Dermadistinctin  6  100.0 (0.0)  98.4 (0.1)  98.4 (0.1)  62  Cluster  Peptide Families  Number of Peptides  Sensitivity [%, mean (SD)]  Specificity [%, mean (SD)]  Accuracy [%, mean (SD)]  58  Alpha-defensin 6, Corticostatin  3  66.7 (57.7)  100.0 (0.0)  100.0 (0.1)  59  Temporin, Vespid chemotactic peptide  16  100.0 (0.0)  99.1 (0.1)  99.1 (0.1)  60  Circulin, Cyclopsychotride, Cyclotide, Cycloviolacin, Kalata  48  67.0 (35.9)  99.8 (0.1)  99.6 (0.2)  62  Beta-defensin, Corticostatin  3  66.7 (57.7)  100.0 (0.0)  100.0 (0.1)  70  Thaumatin-like  8  50.0 (53.5)  100.0 (0.0)  99.9 (0.1)  71  Ceratotoxin, Pleurocidin  5  100.0 (0.0)  99.7 (0.0)  99.7 (0.0)  72  Lebocin  3  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  75  Cryptdin, Circulin  4  75.0 (50.0)  100.0 (0.0)  100.0 (0.1)  76  Styelin  3  66.7 (57.7)  99.9 (0.1)  99.9 (0.0)  77  Styelin, Phylloxin  3  66.7 (57.7)  99.8 (0.1)  99.7 (0.0)  78  Defensin-like peptide  3  66.7 (57.7)  100.0 (0.0)  100.0 (0.1)  79  Holotricin-3, Tenecin-3  3  0.0 (0.0)  100.0 (0.0)  99.9 (0.0)  81  Hadrurin, Opistoporin, Pandinin  4  75.0 (50.0)  100.0 (0.0)  100.0 (0.1)  83  Melittin  10  100.0 (0.0)  99.8 (0.1)  99.8 (0.1)  86  Esculentin, Ranatuerin  6  100.0 (0.0)  99.8 (0.1)  99.8 (0.1)  87  Cathelin  4  50.0 (57.7)  99.3 (0.8)  99.2 (0.9)  88  Dermaseptin  5  60.0 (54.8)  99.3 (0.4)  99.3 (0.4)  89  Uperin  3  100.0 (0.0)  99.4 (0.0)  99.4 (0.0)  90  Beta-defensin  3  100.0 (0.0)  97.8 (0.0)  97.8 (0.0)  91  Defensin, Plectasin  4  100.0 (0.0)  99.4 (0.0)  99.4 (0.0)  92  Eosinophil granule major basic protein  6  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  94  Maximin  44  100.0 (0.0)  98.8 (0.0)  98.8 (0.0)  95  Penaeidin, Liver-expressed AMP  23  100.0 (0.0)  99.9 (0.0)  99.9 (0.0)  96  Ponericin, Pandinin, Gaegurin  6  83.3 (40.8)  97.1 (1.2)  97.0 (1.3)  98  Defensin  4  100.0 (0.0)  99.6 (0.0)  99.6 (0.0)  99  Tachyplesin, Rhesus theta defensin  5  80.0 (44.7)  99.8 (0.1)  99.8 (0.2)  100  Hepcidin  4  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  101  AFP2B, Defensin J1-1, Drosomycin  4  50.0 (57.7)  97.5 (0.5)  97.4 (0.4)  102  Protegrin  4  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  103  Tigerinin  4  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  104  Cecropin-A, Hyphancin, Moricin  4  75.0 (50.0)  97.1 (0.0)  97.1 (0.1)  105  Defensin  4  75.0 (50.0)  100.0 (0.0)  100.0 (0.1)  106  Bactenecin  4  75.0 (50.0)  100.0 (0.0)  100.0 (0.1)  107  Thaumatin, Osmotin, Zeamatin  30  86.7 (17.2)  98.9 (0.1)  98.8 (0.1)  108  Tachyplesin, Polyphemusin, Hepcidin  4  75.0 (50.0)  99.4 (0.0)  99.4 (0.1)  109  Basal layer antifungal peptide  4  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  110  Pseudin  4  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  111  Maximins, Ponericin  4  50.0 (57.7)  93.3 (0.1)  93.2 (0.2)  112  Mastoparan  12  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  113  Amoebapore (ameobapore)  5  80.0 (44.7)  100.0 (0.0)  100.0 (0.1)  114  Andropin  6  50.0 (54.8)  100.0 (0.0)  99.9 (0.1)  63  Cluster  Peptide Families  Number of Peptides  Sensitivity [%, mean (SD)]  Specificity [%, mean (SD)]  Accuracy [%, mean (SD)]  115  Bombolitin  5  80.0 (44.7)  99.9 (0.1)  99.9 (0.1)  117  BPI, LBP, (Lipopolysaccharide-binding protein, Bactericidal permeability-increasing protein )  8  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  118  Caerulein, Cholecystokinins, Cionin  30  56.7 (38.7)  100.0 (0.0)  99.8 (0.2)  119  Ponericin  5  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  120  Gallinacin  5  40.0 (54.8)  99.9 (0.0)  99.8 (0.1)  121  Uperin  5  80.0 (44.7)  99.1 (0.2)  99.1 (0.1)  123  Metalnikowin, Pyrrhocoricin.  5  60.0 (54.8)  100.0 (0.0)  99.9 (0.1)  124  Antimicrobial peptide  6  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  125  Esculentin, Gaegurin, Rugosin  7  100.0 (0.0)  99.1 (0.1)  99.1 (0.1)  126  Dermaseptin  6  100.0 (0.0)  99.0 (0.2)  99.1 (0.2)  127  Bombinin, Maximin  6  100.0 (0.0)  93.7 (0.0)  93.8 (0.0)  131  Aurein, Citropin, Dahlein, Dermaseptin, Maculatin  22  100.0 (0.0)  99.1 (0.2)  99.1 (0.2)  132  Defensin-like protein  11  80.0 (42.2)  98.0 (0.5)  98.0 (0.5)  133  Cysteine-rich antifungal protein  15  100.0 (0.0)  99.7 (0.0)  99.7 (0.0)  137  Brevinin-, Ranatuerin, Palustrin, Ocellatin  29  100.0 (0.0)  99.2 (0.3)  99.2 (0.3)  138  Fabatin, Gamma-thionin, Gamma hordothionin  21  90.0 (21.1)  99.7 (0.0)  99.7 (0.1)  139  Bactericidin, Cecropin, Hyphancin  31  100.0 (0.0)  98.3 (0.0)  98.3 (0.0)  141  Brevinin, Gaegurin, Ranatuerin  18  95.0 (15.8)  96.9 (0.2)  96.9 (0.2)  143  Defensin, Holotricin, Sapecin, Tenecin, Royalisin, Phormicin  22  90.0 (21.1)  100.0 (0.0)  100.0 (0.1)  144  Beta-defensin , Spheniscin, LAP, TAP  21  60.0 (39.4)  99.2 (0.4)  99.1 (0.3)  145  Beta-defensin  26  70.0 (23.3)  98.4 (0.4)  98.3 (0.4)  Table 2.5. Performance of AMP identification method determined by cross-validation for mature peptide clusters. An approximately 10-fold cross-validation was performed for each cluster where the number of peptides was greater than 10. For clusters with less than 10 peptides, a leave-one-out cross-validation was performed. Clusters were included in this analysis where the number of peptides was 3 or greater and where no two peptides were derived from the same Swiss-Prot protein (see main text). The performance measures of sensitivity, specificity and accuracy are reported as the mean and standard deviations of the values calculated for the cluster during the cross-validation.  Cluster  Peptide Families  Number of Peptides  Sensitivity [%, mean (SD)]  Specificity [%, mean (SD)]  Accuracy [%, mean (SD)]  2  Hepcidin  4  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  4  Hepcidin  3  66.7 (57.7)  100.0 (0.0)  99.8 (0.4)  15  Rhesus theta defensin  3  100.0 (0.0)  99.8 (0.4)  99.8 (0.4)  17  Phormicin, Phormicin, Sapecin  3  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  18  Pleurocidin  4  75.0 (50.0)  100.0 (0.0)  99.8 (0.3)  19  Floral defensin-like protein  3  0.0 (0.0)  100.0 (0.0)  99.3 (0.0)  64  Cluster  Peptide Families  Number of Peptides  Sensitivity [%, mean (SD)]  Specificity [%, mean (SD)]  Accuracy [%, mean (SD)]  20  Lebocin  3  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  21  Styelin  3  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  22  Defensin  3  66.7 (57.7)  100.0 (0.0)  99.8 (0.4)  23  Cryptdin  3  0.0 (0.0)  94.9 (5.8)  94.3 (5.7)  28  Neutrophil defensin  6  66.7 (51.6)  97.1 (1.4)  96.9 (1.2)  29  Ranalexin, Esculentin-1B, Gaegurin-5, Ranalexin, Temporin-G  6  83.3 (40.8)  98.8 (0.6)  98.7 (0.3)  30  Brevinin, Gaegurin  4  100.0 (0.0)  98.8 (0.3)  98.8 (0.3)  31  Corticostatin, Neutrophil antibiotic peptide  5  80.0 (44.7)  100.0 (0.0)  99.9 (0.3)  32  Neutrophil defensin, Defensin 5  9  100.0 (0.0)  88.7 (1.6)  88.8 (1.6)  33  Maximins  6  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  37  Dermaseptin, Dermatoxin  20  95.0 (15.8)  99.9 (0.2)  99.8 (0.3)  38  Beta-defensin 1  17  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  39  Cryptdin  18  95.0 (15.8)  97.8 (0.2)  97.7 (0.4)  40  Cathelicidin, Myeloid antibacterial peptide, Prophenin-2, Protegrin  25  88.3 (24.9)  100.0 (0.0)  99.8 (0.5)  Table 2.6. Performance of AMP identification method determined by cross-validation for propeptide clusters. An approximately 10-fold cross-validation was performed for each cluster where the number of peptides was greater than 10. For clusters with less than 10 peptides, a leave-one-out cross-validation was performed. Clusters were included in this analysis where the number of peptides was 3 or greater and where no two peptides were derived from the same Swiss-Prot protein (see main text). The performance measures of sensitivity, specificity and accuracy are reported as the mean and standard deviations of the values calculated for the cluster during the cross-validation.  65  Figure 2.6. Relationship between mature peptides and propeptides from the same protein clusters of all sizes. The corresponding propeptide clusters are indicated by a line joining the mature clusters. The width of the line indicates the number of propeptides in that cluster that are from the same protein IDs as the mature peptides.  66  2.7 References Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215: 403-410. Brahmachary, M., Krishnan, S. P. T., Koh, J. L. Y., Khan, A. M., Seah, S. H., Tan T. W., Brusic, V., Bajic, V. B. (2004) ANTIMIC: a database of antimicrobial sequences. Nucl. Acids Res. 32: 90001, 1-589 Bowdish, D.M., Davidson, D.J., Hancock, R.E.W. (2005) A Re-evaluation of the Role of Host Defence Peptides in Mammalian Immunity. Curr. Protein. Pept. Sci., 6:3551. Brogden, K.A. (2005) Antimicrobial peptides: pore formers or metabolic inhibitors in bacteria? Nat. Rev. Microbiol., 3: 238-50. Chapple, D.S., Hussain, R., Joannou, C.L., Hancock, R.E.W., Odell, E., Evans, R.W., Siligardi, G. (2004) Structure and Association of Human Lactoferrin Peptides with Escherichia coli Lipopolysaccharide. Antimicrob. Agents Chemother., 48: 2190-2198 Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.J. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univeristy press, Cambridge, UK Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14: 755-763. Finlay, B.B., Hancock, R.E.W. (2004) Can innate immunity be enhance to treat microbial infections? Nature Reviews Microbiology, 2: 497-504. Hamilton-Miller, J.M.T. (2004) Antibiotic resistance from two perspectives: man and microbe. International Journal of Antimicrobial Agents, 23: 209-212. Hancock, R.E.W. (2001) Cationic peptides: effectors in innate immunity and novel antimicrobials. The Lancet Infectious Diseases, 1: 156-164. Hancock, R.E.W., Rozek, A. (2002) Role of membranes in the activities of antimicrobial cationic peptides. FEMS Microbiology Letters, 206: 143-149 Hwang, P.M., Vogel, H.J. (1998) Structure-function relationships of antimicrobial peptides. Biochem. Cell Biol., 76:235-46. Jack, R.W., Tagg, J.R., Ray, B. (1995) Bacteriocins of gram-positive bacteria. Microbiol Rev., 59: 171-200. Koczulla, A.R., Bals, R. (2003) Antimicrobial Peptides: Current Status and Therapeutic Potential. Drugs, 63: 389-407. Levy, S.B., Marshall, B. (2004) Antibacterial resistance worldwide: causes, challenges and responses. Nature Medicine, 10: S122 - S129. Madera, M., Gough, J. (2002) A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res., 30: 4321-4328. 67  Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T., Chothia, C. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284: 1201-1210. Patrzykat, A., Friedrich, C.L., Zhang, L., Mendoza, V., Hancock, R.E.W. (2002) Sublethal Concentrations of Pleurocidin-Derived Antimicrobial Peptides Inhibit Macromolecular Synthesis in Escherichia coli. Antimicrob. Agents Chemother., 46:605-614. Powers, J.P.S., Hancock, R.E.W. (2003). The relationship between peptide structure and antibacterial activity. Peptides, 24: 1681-1691 Schutte, B.C., Mitros, J.P., Bartlett, J.A., Walters, J.D., Jia, H.P., Welsh, M.J., Casavant, T.L., McCray, P.B. (2002) Discovery of five conserved beta -defensin gene clusters using a computational search strategy. Proc. Natl. Acad. Sci. USA, 99: 2129-2133. Scheetz, T., Bartlett, J.A., Walters, J.D., Schutte, B.C., Casavant, T.L., McCray, P.B. (2002) Genomics-based approaches to gene discovery in innate immunity. Immunol Rev., 190:137-145 Sima, P., Trebichavsky, I., Sigler, K. (2003) Mammalian antibiotic peptides. Folia Microbiol., 48: 123-137. Sima, P., Trebichavsky, I., Sigler, K. (2003) Non-mammalian vertebrate antibiotic peptides. Folia Microbiol., 48: 709-724. Sing, T., Sander, O., Beerenwinkel, N., Lengauer, T. (2005) ROCR: visualizing classifier performance in R. Bioinformatics, 21: 3940-3941 Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22:4673-4680. Yeaman, M.R., Yount, N.Y. (2003) Mechanisms of Antimicrobial Peptide Action and Resistance. Pharmacol Rev., 55: 27-55. Yount, N.Y., Yeaman, M.R. (2004) .Multidimensional signatures in antimicrobial peptides. Proc. Natl. Acad. Sci. USA, 101: 7363-7368  68  Chapter 3:  Identification of novel host defense  peptides and the absence of alpha-defensins in the bovine genome  A version of this chapter has been published as: Fjell CD, Jenssen H, Fries P, Aich P, Griebel P, Hilpert K, Hancock RE, Cherkasov A. (2008) Identification of novel host defense peptides and the absence of alpha-defensins in the bovine genome. Proteins. 73:420-30. 69  3.1 Introduction Host defense peptides (known also as antimicrobial peptides, AMPs) are natural peptides produced as part of the innate immune system of a broad range of organisms including mammalians, insects, amphibians, plants and amoeboid protozoa among others (Simmaco et al. 1998; Khush et al. 2001; Sima et al. 2003; Sima et al. 2003). As the problem  of  antibiotic resistance to  conventional  therapeutics by  pathogenic  microorganisms increases, AMPs have drawn significant scientific attention as a novel class of prospective anti-infective therapeutics (Hancock and Lehrer 1998; Hancock and Chapple 1999; Hancock 2003; Marshall and Arenas 2003). They offer several advantages including fast target killing, broad range of activity, low toxicity and minimal development of resistance in target organisms (Hancock and Lehrer 1998; Hancock and Chapple 1999; Hancock 2003; Marshall and Arenas 2003; Sima et al. 2003; Sima et al. 2003). Their mechanisms of killing are diverse and include membrane disruption (Bechinger 1997; Bechinger 1999; Blondelle et al. 1999; Epand and Vogel 1999; Shai 1999) as well as metabolic inhibition of intracellular targets (Brogden 2005). In addition to direct killing, certain host defense peptides play important roles in modulation of the innate immune response both in up-regulation to enhance killing of pathogens, as well as down-regulation to reduce detrimental conditions such as sepsis (Mookherjee and Hancock 2007). The relative importance of direct killing by AMPs versus immunomodulation is also unclear (Bowdish et al. 2005). Limited numbers of novel antimicrobial peptides have been identified previously with the help of computational approaches (Scheetz et al. 2002; Schutte et al. 2002; Patil A 2004; Patil AA 2005; Looft C et al. 2006; Belov K et al. 2007; Lynn DJ and DG 2007). 70  The majority of these studies (Scheetz et al. 2002; Schutte et al. 2002; Patil A 2004; Patil AA 2005; Lynn DJ and DG 2007) searched specifically for the presence of a particular class of antimicrobial peptide, the defensins, which belong to three sub-families: the alpha-, beta- and theta-defensins. The commonly used techniques to identify novel peptides using sequence analysis are: comparing examples of a class of peptides to a novel sequence in a pairwise fashion (for example using a BLAST analysis (Altschul et al. 1990)) or using a set of similar peptides to construct a profile of the class of peptides and then deriving a statistical model of the class for searching novel sequence (for example using profile hidden Markov models (Durbin et al. 1998)). Profile hidden Markov models (HMMs) have been used extensively for large-scale analysis of protein sequences (Durbin et al. 1998; Sonnhammer et al. 1998) and we have previously developed the AMPer resource (Fjell et al. 2007) (http://www.cnbi2.com/cgi-bin/amp.pl) that includes HMMs to describe and predict AMPs based on peptide sequence. AMPer includes all AMPs that were available in the Uniprot database and separately describes mature peptides and propeptides have been determined based on Uniprot annotations. These have been grouped into sets of related peptides, with each set used to produce one hidden Markov model specific to that subclass of AMP. AMPer includes 1045 mature peptides (with 146 corresponding models) and 253 propeptide sequences (with 40 corresponding models) derived from 970 Uniprot proteins. Models from AMPer provide the means to perform high-throughput analysis to discover novel AMPs that are related to peptides that are currently known. This serves to identify additional peptides that may have antimicrobial activity and may suggest the absence of a class of peptide in an organism. As an example, we consider the alpha-  71  defensins: there are currently no recognized alpha-defensins in the bovine genome. Phylogenetic analysis of defensins has suggested that all defensins in the mammalian lineage have been derived from a single ancestral beta-defensin and that alpha-defensins arose from beta-defensins by a process of gene duplication followed by diversification in response to the pathogens encountered in the particular ecological niche of the organism (Patil A 2004; Xiao et al. 2004; Patil AA 2005). Alpha-defensins were recently believed to be restricted to the primate and glires (rodents and lagomorphs) lineage (Patil A 2004; Xiao et al. 2004; Patil AA 2005); however, more recent analysis of defensins from a broader range of mammals has identified alpha-defensins in opossum (Belov K et al. 2007), elephant and hedgehog tenrec (Lynn DJ and DG 2007), and horse (Looft C et al. 2006). In the current work, we used hidden Markov models from the AMPer resource to identify AMPs from bovine. For this work, we considered nucleic acid sequence from the draft genome sequence and expressed sequence tags (ESTs, single-pass sequences of cDNA created from mRNA). Our aim was to discover previously uncharacterized genecoded bovine AMPs of all classes as well as to test the hypothesis that the bovine genome lacks alpha-defensins.  72  3.2 Results and discussion 3.2.1  Identification of host defense peptides We used the AMPer models of mature peptides to identify known and  potentially novel antimicrobial sequences of bovine using expressed sequence tags (from NCBI dbEST resource, http://www.ncbi.nlm.nih.gov/dbEST/, (Boguski, Lowe et al. 1993)) and genomic sequence (from the Baylor College of Medicine Human Genome Sequencing  Center,  http://www.hgsc.bcm.tmc.edu/projects/bovine).  These  were  translated into all six reading frames and scanned with each of the 146 AMP models. Results are presented here using the current dbEST resource containing 1,433,737 bovine ESTs (downloaded Aug 25, 2007). The models of mature peptides produced 5,628 matches with an E-value <10, consisting of 4,591 unique ESTs. Of these, 2,228 had an Evalue<1 and cover at least 25% of the length of the model. We identified unique sequences by clustering the matched protein using an allvs-all comparison: each matched protein was compared to every other matched protein with blastp (Altschul et al. 1990). Where predicted peptides were at least 90% identical, we conservatively considered these to be the same antimicrobial peptide (at the risk of grouping together closely related peptides that are in fact distinct). By repeating this pairwise comparison, a total of 278 potential peptides were identified. From these 278 peptides, we selected those that were matched at high statistical significance (an HMM Evalue <= 1e-5), resulting in the 124 potential peptides shown in Table 3.1. We mapped the 34 known bovine AMPs using the full protein sequence (Table 73  3.2) to all predicted protein sequences from the ESTs using pairwise comparison (the blastp algorithm (Altschul et al. 1990)) to identify the most likely ESTs corresponding to the bovine AMP. We similarly mapped the 34 known bovine AMPs to those predicted protein sequences identified by AMPer as containing an AMP (Table 3.3). Since we expect these sequences to differ only due to artifacts such as sequencing errors, we called a match significant where there was at least 95% sequence identity between the known bovine AMP and the other sequence, and where the length of the matching region between two sequences (reported by blastp) was within 95% of the shorter sequence (this was meant to allow for the untranslated regions of the mRNA). A total of 27 known bovine AMPs had significant matches to ESTs. Since some AMPs are subsequences of other AMPs and ESTs may also be significantly shorter than the cDNA from which they are sequenced, it is difficult to determine uniquely which bovine AMPs were identified where multiple known bovine AMP sequences mapped to the same EST sequence. For four bovine AMPs the best matching EST was not unique (one other known bovine AMP also matched that EST most significantly of all ESTs). These are indicated by a '(2)' on four entries in Table 3.3. Similarly, a total of 27 are also found to have significant matches to AMPs predicted by AMPer, though the list of known AMPs with no clearly matching predicted AMP are slightly different than those known AMPs with no clearly matching ESTs (24 had good matches to both ESTs and AMPer predictions).  74  AMPer Model 17 66 106 90 145 144 35 117 116 87 18 133 92 13 27 64 8 12 38 24 95 39  Peptide Families  Number of AMPs  15 kDa protein Apolipoprotein A-II Bactenecin Beta-defensin Beta-defensin Beta-defensin, Spheniscin, LAP, TAP Beta-defensin, Circulin-B BPI, LBP, (bactericidal permeabilityincreasing protein, lipopolysaccharidebinding protein) Cathelicidin Cathelin Cathelin-related Cysteine-rich antifungal protein Eosinophil granule major basic protein Granulysin, NK-lysin Hemolin Hepcidin Histone H1, Uperin Histone H2A Histone H2A Myeloid antibacterial peptide Penaeidin, Liver-expressed AMP Sperm-associated antigen 11  1 12 1 4 7 6 1 29 1 9 1 1 7 5 2 1 5 19 8 2 1 1  Table 3.1. Numbers of predicted antimicrobial peptides An E-value threshold of 1e-5 was used to determine significance of an HMM match.  Uniprot ID  Description  AMPer model  APOA2_BOVIN  Apolipoprotein A-II.  66  BCTN1_BOVIN  Bactenecin-1.  19  BCTN5_BOVIN  Bactenecin-5.  106  BCTN7_BOVIN  Bactenecin-7.  106  BD01_BOVIN  Beta-defensin 1.  144  BD02_BOVIN  Beta-defensin 2.  144  BD03_BOVIN  Beta-defensin 3.  144  BD04_BOVIN  Beta-defensin 4.  144  BD05_BOVIN  Beta-defensin 5.  144  BD06_BOVIN  Beta-defensin 6.  144  BD07_BOVIN  Beta-defensin 7.  90  BD08_BOVIN  Beta-defensin 8.  90  BD09_BOVIN  Beta-defensin 9.  90  75  Uniprot ID  Description  AMPer model  BD10_BOVIN  Beta-defensin 10.  145  BD11_BOVIN  Beta-defensin 11.  144  BD12_BOVIN  Beta-defensin 12.  144  BD13_BOVIN  Beta-defensin 13.  144  BDC7_BOVIN  Beta-defensin C7.  144  BMA27_BOVIN  Antibacterial peptide BMAP-27.  24  BMA28_BOVIN  Antibacterial peptide BMAP-28.  18  BMA34_BOVIN  Antibacterial peptide BMAP-34.  116  BPI_BOVIN  Bactericidal permeability-increasing protein.  117  CALT_BOVIN  Caltrin  -  CAS2_BOVIN  Casocidin-1 (now CASA2_BOVIN)  -  CCKN_BOVIN  Cholecystokinin  118  CMGA_BOVIN  Chromogranin-A  -  EAP_BOVIN  Enteric beta-defensin.  144  INDC_BOVIN  Indolicidin  -  LAP_BOVIN  Lingual antimicrobial peptide.  144  LBP_BOVIN  Lipopolysaccharide-binding protein.  117  LEAP2_BOVIN  Liver-expressed antimicrobial peptide 2.  95  PENK_BOVIN  Synenkephalin, Met-enkephalin.  5  SCG1_BOVIN  Secretogranin-1, Secretolytin, GAWK, BAM-1745.  133  TAP_BOVIN  Tracheal antimicrobial peptide.  144  Table 3.2. Known bovine antimicrobial peptides Where a peptide does not below to an AMPer cluster, a "-" is given.  76  Matched ESTs by BLAST Known Bovine AMP  Matched EST*  APOA2_BOVIN  gi|75805025|gb|DT855734.1  BCTN1_BOVIN  % Identity  AMPer Predicted AMPs  Match Length (coverage %)  Blast Evalue  Matched Predicted AMP*  100  100 (100.0)  7.20E-050  DBEST_AMP_1858  gi|119554907|gb|EH155902.1  100  155 (100.0)  1.10E-086  BCTN5_BOVIN  gi|154772689|gb|EV792452.1  100  176 (100.0)  BCTN7_BOVIN  gi|119563722|gb|EH164717.1  100  BD01_BOVIN  gi|17892782|gb|BM257183.1  BD02_BOVIN  % Identity  77  Match Length (coverage %)  Blast Evalue  100  77 (100.0 %)  6.70E-041  DBEST_AMP_255  100  101 (100.0 %)  8.30E-058  6.00E-102  DBEST_AMP_249  100  101 (100.0 %)  1.50E-057  190 (100.0)  8.00E-111  DBEST_AMP_304  100  102 (100.0 %)  1.10E-057  100  38 (100.0)  3.50E-019  DBEST_AMP_1047  100  36 (100.0 %)  7.70E-021  gi|7049236|gb|AW479130.1  100  40 (100.0)  5.30E-020  DBEST_AMP_1428  100  38 (100.0 %)  1.10E-021  BD03_BOVIN  gi|7049236|gb|AW479130.1 (2)  100  57 (100.0)  2.50E-029  DBEST_AMP_1428 (2)  100  38 (100.0 %)  7.40E-022  BD04_BOVIN  gi|154397167|gb|EV640446.1  100  63 (100.0)  6.20E-033  DBEST_AMP_901  100  36 (100.0 %)  3.00E-021  BD05_BOVIN  gi|17037442|gb|BM106372.1  100  64 (100.0)  2.50E-034  DBEST_AMP_860  100  37 (100.0 %)  8.90E-023  BD06_BOVIN  gi|119558511|gb|EH159506.1  100  40 (95.2)  9.00E-020  DBEST_AMP_2132  100  38 (100.0 %)  6.40E-022  BD07_BOVIN  gi|119561789|gb|EH162784.1  100  40 (100.0)  2.00E-019  DBEST_AMP_1576  100  38 (100.0 %)  1.50E-021  BD08_BOVIN  gi|119564671|gb|EH165666.1  100  38 (100.0)  3.90E-018  DBEST_AMP_308  100  38 (100.0 %)  1.50E-021  gi|119564671|gb|EH165666.1 BD09_BOVIN  (2)  98.18  55 (100.0)  9.20E-027  DBEST_AMP_308 (2)  97.37  38 (100.0 %)  5.10E-021  BD10_BOVIN  gi|42731051|gb|CK778738.1  98.39  62 (100.0)  1.00E-030  DBEST_AMP_139  100  36 (100.0 %)  1.50E-020  BD11_BOVIN  gi|119531287|gb|EH137278.1  100  60 (100.0)  4.60E-031  DBEST_AMP_209  100  37 (100.0 %)  6.70E-022  BD12_BOVIN  gi|74502222|gb|DT722637.1  97.37  38 (100.0)  6.60E-018  DBEST_AMP_1461  97.3  37 (100.0 %)  1.30E-020  BD13_BOVIN  gi|74502222|gb|DT722637.1 (2)  97.62  42 (100.0)  5.30E-020  DBEST_AMP_1461 (2)  97.3  37 (100.0 %)  1.20E-020  Known Bovine AMP  Matched EST*  % Identity  Match Length (coverage %)  Blast Evalue  Matched Predicted AMP*  % Identity  Match Length (coverage %)  Blast Evalue  gi|119531287|gb|EH137278.1  78  BDC7_BOVIN  (2)  94.34  53 (100.0)  1.10E-024  DBEST_AMP_209 (2)  91.89  37 (100.0 %)  7.80E-020  BMA27_BOVIN  gi|120572158|gb|EH378295.1  99.22  128 (81.0)  1.80E-068  DBEST_AMP_383  98.98  98 (100.0 %)  6.80E-055  BMA28_BOVIN  gi|119558428|gb|EH159423.1  100  159 (100.0)  2.30E-087  DBEST_AMP_274  100  113 (100.0 %)  1.10E-063  BMA34_BOVIN  gi|61753367|emb|CR452179.2  100  165 (100.0)  1.70E-091  DBEST_AMP_478  100  129 (100.0 %)  8.30E-075  BPI_BOVIN  gi|119650848|gb|EH179456.1  99.61  254 (52.7)  2.00E-141  DBEST_AMP_1174  100  250 (100.0 %)  4.00E-146  CALT_BOVIN  gi|86366255|gb|DY165694.1  100  80 (100.0)  2.10E-040  DBEST_AMP_186  30.43  23 (28.8 %)  3.7  CAS2_BOVIN  gi|70828695|gb|DR712392.1  100  222 (100.0)  1.00E-124  DBEST_AMP_332  22.86  35 (31.2 %)  4.8  CCKN_BOVIN  gi|60967497|gb|DN524024.1  100  58 (100.0)  4.00E-027  DBEST_AMP_358  40.91  22 (21.8 %)  0.03  CMGA_BOVIN  gi|119653666|gb|EH182274.1  99.62  266 (59.2)  5.00E-152  -  -  -  -  EAP_BOVIN  gi|75771874|gb|DT822941.1  100  64 (100.0)  3.60E-033  DBEST_AMP_1091  100  36 (100.0 %)  3.20E-020  INDC_BOVIN  gi|119556821|gb|EH157816.1  100  144 (100.0)  2.00E-082  DBEST_AMP_542  100  100 (100.0 %)  2.70E-056  LAP_BOVIN  gi|154466011|gb|EV693095.1  100  64 (100.0)  6.80E-032  DBEST_AMP_746  100  38 (100.0 %)  1.10E-020  LBP_BOVIN  gi|82642070|gb|DV789175.1  93.44  380 (79.0)  0  DBEST_AMP_1816  93.44  380 (99.7 %)  0  LEAP2_BOVIN  gi|154538028|gb|EV742363.1  100  77 (100.0)  2.40E-039  DBEST_AMP_729  100  40 (100.0 %)  3.00E-021  PENK_BOVIN  gi|119686348|gb|EH206269.1  100  246 (93.5)  2.00E-144  DBEST_AMP_978  38.89  18 (60.0 %)  9.8  SCG1_BOVIN  gi|82827500|gb|DV893271.1  99.66  297 (46.0)  1.00E-177  DBEST_AMP_549  100  13 (100.0 %)  0  TAP_BOVIN  gi|154464382|gb|EV691466.1  100  64 (100.0)  1.10E-032  DBEST_AMP_753  100  36 (100.0 %)  9.20E-020  Table 3.3. Identification of known bovine host defense peptides in dbEST sequences EST sequences were mapped to known bovine AMPs based on pairwise similarity using BLAST. * a '(2)' indicates that this is the second entry with the same identifier. The entries consisting of a dash (-) indicate no match was found.  3.2.2  Selection of predicted AMPs for confirmation We manually examined the sequences of these predicted AMPs to identify  peptides of interest for laboratory follow-up. Using on-line tools that we developed, we examined multiple alignments of these predicted AMPs alongside the following: the most similar known bovine AMP, the most similar AMP from any species (if different from bovine) and the peptides that were used to construct the AMPer model. (These are available from links on the bovine analysis pages at the AMPer site.) We chose two predicted AMPs for follow-up that appeared to be novel and belong to the cathelicidin family. Two ESTs corresponding to these predicted AMPs were identified for laboratory analysis of changes in expression due to infection as discussed below. The first predicted AMP that we sought to confirm was DBEST_AMP_248, matched by model 17. This peptide sequence was compared to all proteins in Uniprot (Swiss-Prot and TrEMBL) using the on-line BLAST utility at http://www.expasy.org/tools/blast. Since we began this work, an entry containing DBEST_AMP_248 has been deposited in TrEMBL as A5PJH7_BOVIN (discussed below) based only on cDNA sequencing. The most similar peptide to DBEST_AMP_248 is an antimicrobial peptide found in rabbit, P15B_RABIT, designated as "15 kDa protein" (Levy et al. 1993) with 55% sequence identity and 99.2% coverage. The most similar known bovine AMP, Bactenecin-7 (BCTN7_BOVIN, now called CTHL3_BOVIN in the current version of Uniprot) has only 33% sequence identity and 95.8% coverage. Based on earlier data, in place of DBEST_AMP_248, we examined the predicted AMP, DBEST_AMP_397 and EST sequence gi|12122965|gb|BF775065.1 (a slightly shorter sequence within the same cluster of predicted AMP sequences as 79  DBEST_AMP_248). As shown in Figure 3.1, the translated EST sequence (BF775065.1) shows good alignment with the 15kDa protein sequence and poorer alignment with the bovine peptide BCTN7_BOVIN.  In Figure 3.1, the underlined regions indicate the  region of mature peptide corresponding to the active antimicrobial peptide. The second predicted novel AMP that we sought to confirm was identified from EST gi|15378291|gb|BI537181.1, as predicted peptide DBEST_AMP_416, matched by model 87. This predicted AMP matches a short region of the sequence for the known bovine AMP, Bactenecin-5 (BCTN5_BOVIN, now CTHL2_BOVIN in the current Uniprot). Examination of the translated EST sequence that was recognized by the AMPer model and produced DBEST_AMP_416 shows that it codes for a similar protein with differences near the N-terminus. The predicted sequence is shown in Figure 3.2, along with the proteins that were used to construct the AMPer model that recognized this peptide, and the closest matching known bovine AMP (BCTN5_BOVIN). We compared the EST sequence (232 nucleic acids) for this predicted AMP to the current bovine genome in Ensembl (http://www.ensembl.org) and did not find a significant match except to a short region of the genomic sequence for Bactenecin-5: 52 positions on chromosome 22 (49,818,207 to 49,818,362) matched the EST from positions 27 to 78. This region overlaps with Bactenecin-5 exon 4 (ENSBTAE00000175540) 49,818,093 to 49,818,356 and extends 6 positions into intron 3-4. Neighboring DNA regions on the chromosome did not contain additional flanking EST sequence that would be expected if the sequences were separated in the genome due to introns. However, the EST sequence matched a longer region of 77 nucleic acids (EST region 17-90) on a sequence contig from whole genome shotgun (gi|112113766|gb|AAFC03064548.1| Ctg60.CH240-439A19). This  80  suggests that the predicted AMP from DBEST_AMP_416 is from a novel gene that has not yet been incorporated into the genome assembly. However, the sequence was originally found in expressed sequence; therefore it appears to be a true gene rather than a pseudogene, despite not being able to identify the full gene sequence in the genome.  Figure 3.1. Multiple alignment of predicted host defense peptide DBEST_AMP_397. The predicted peptide DBEST_AMP_397 is shown aligned to all peptides in the AMPer cluster, the most similar AMP (P15B_RABIT), the most similar bovine AMP (BCTN7_BOVIN), and the EST that DBEST_AMP_297 was derived from (BF775065.1). Underlined sequence indicates the position of mature peptides within the proteins. The consensus sequence of the AMPer model is also shown (HMM_consensus).  Figure 3.2. Multiple alignment of predicted host-defense peptide DBEST_AMP_416. The predicted peptide DBEST_AMP_416 is shown aligned to all peptides in the AMPer cluster, the most similar bovine AMP (BCTN5_BOVIN), and the EST that DBEST_AMP_416 was derived from (BI537181.1). Underlined sequence indicates the position of mature peptides within the proteins. The consensus sequence of the AMPer model is also shown (HMM_consensus).  81  3.2.3  Analysis of predicted novel AMP gene expression We designed primers to detect and amplify RNA corresponding to these two  putative AMPs along with and two housekeeping genes (GAPDH and beta-actin) that serve as positive controls. Quantitative real-time PCR (qRT-PCR) was performed using these primers on total RNA derived from bovine peripheral blood mononuclear cells (PBMC), and tissue collected from the bovine small intestine (ileum). The intestinal tissue was sampled both prior to and 4 hours after challenge with S. typhimurium with the S. typhimurium infection performed as described previously by Coombes et al. (Coombes et al. 2005). Initial qRT-PCR products were run on agarose gel and showed faint bands (Figure 3.3). The qRT-PCR products were re-amplified using a 30 cycle Taq-man PCR protocol and visualized on gel (Figure 3.4). The DBEST_AMP_397 product is clearly visible and up-regulated in response to bacterial infection in intestinal tissue. However, the DBEST_AMP_416 product cannot be distinguished from negative control lanes in Figure 3.4 and the presence of two bands rather than the expected single band in Figure 3.3 suggests the putative AMP product for DBEST_AMP_416 was not found.  82  Figure 3.3. Gel image of qRT-PCR for putative AMPs in blood and tissue. The DBEST_AMP_397 (P397) and DBEST_AMP_416 (P416) products are visible. B-actin lanes are positive control lanes and NTC lanes are "no template" controls.  83  Figure 3.4. Gel image of putative AMPs following Taq-man re-amplification. The DBEST_AMP_397 (P397) product is clearly visible in the infected tissue but not healthy tissue. While a difference is observed for DBEST_AMP_416 (P416) between healthy and infected tissue, the P416 lane does not produce a useful band and is not distinguishable from NTC. GAPDH are positive control lanes and NTC lanes are "no template" controls.  3.2.4  Absence of alpha-defensins Notably absent from Table 3.1 are any of the alpha-defensin peptide families  (often described as simply "defensins"). There are several models in AMPer for mature peptides of this type including models 53, 98, 105 and 146 as well as subclasses such as 84  cryptdins (model 75). For example, AMPer model 146 is built from a set of 45 alphadefensin peptides from 42 different Swiss-Prot proteins taken from eight mammalian species. The model matches these 45 peptides with high statistical significance (E-values are all less than 1e-10 with only two greater than 1e-20; see AMPer web site). However, the  most  significant  match  in  the  bovine  EST  sequences  is  to  gi|82672759|gb|DV812566.1 with an E-value of 3.6e-4. The analysis described here tolerates the presence of introns and will combine neighboring regions identified by an HMM model to cover the length of the model and report the resulting peptide with a single ID. An example of an AMP containing introns that is correctly identified by AMPer is BD07_BOVIN. This BD07_BOVIN contains one intron of 1460 nucleotides (487 amino acids when translated) and is identified from EST sequence by DBEST_EST_292 (model 90). The predicted AMP based on genomic sequence, GENOME_AMP_169, is identical in sequence to BD07_BOVIN but short by 2 amino acids (length of 38 vs 40) and produces an HMM E-value of 4e-23 (see web resources). In contrast, the most significant E-value for (alpha-defensin) model 146 against bovine genomic data is 4e-10 but the coverage of the model is low at only 69% and the predicted AMP sequence lacks the characteristic six-cysteine motif (see supplementary Table 3.5 for predicted AMPs based on genomic data with E-values less than 1e-5). *  *  *  Host defense peptides of the innate immune system are important components for control of infection. Historically, host defense peptides have been described as antimicrobial peptides (AMPs); however, the important role of modulation of the innate 85  immune response has come to the fore recently. Natural host defense peptides are considered to be lead compounds in the search for agents that beneficially modulate inflammatory responses both directed against a pathogen and to counter detrimental immune responses such as those involved in sepsis. The importance of these peptides in host defense and as the basis of possible novel therapeutics indicates the need for information about the numbers and what types that are present to gain further understanding of their roles in innate immunity. In order to identify potentially novel host-defense peptides, we used the hidden Markov models constructed for the AMPer resource to scan bovine expressed sequence tags and genomic sequence. The AMPer models represent groups of mature peptides as well as propeptides that are products of the parental prepropeptides due to processing after protein translation; there are 146 models of mature peptides and 40 models of propeptides representing classes and subclasses of peptides such as defensins and cathelicidins. In this study, we used the models for mature peptides only. We are primarily concerned with identifying mature antimicrobial peptides for the purpose of structure-activity analysis. Therefore, we primarily relied upon EST sequences since they do not have the added complication of introns in predicted protein sequence. Since the same gene may lead to many ESTs, we sought to identify those unique sequences corresponding to a gene by grouping the predicted peptides based on sequence similarity. We chose a conservative threshold since we are interested in identifying novel AMPs, and are less interested in identifying close homologues of known bovine AMPs; in addition, EST sequencing is a single-pass process with sequencing errors of up to a few percent (Boguski et al. 1993) so true matches are  86  expected to not match perfectly. We considered EST sequence where the matched regions of these ESTs were more than 90% identical over the region of the pairwise match to belong to the same host defense peptide. This threshold yielded a total of 278 potential peptides of varying statistical significance. The HMM E-value represents the number of false positive matches expected at a given threshold; using an HMM E-value threshold of 1e-5 (i.e. 1e-5 expected false positives for each of the 146 models) yields a prediction of up to 124 AMPs, including 32 matches to histone (from which the AMP buforin is derived (Kim et al. 2000)). There are 92 non-histone AMPs, a number that is feasible to review manually (Table 3.1). As well, this E-value threshold is large enough that sequences belonging to more distant homologues would not be discarded, but at the risk of including peptides that are only distantly related to and not actually AMPs. To determine which of these predicted AMPs correspond to the known bovine AMPs, we compared the sequences using sequence similarity (blastp (Altschul et al. 1990)) to find predicted peptide from both ESTs and peptide identified by AMPer models. Of the 34 known bovine AMPs (full length proteins, Table 3.2), a total of 27 known bovine AMPs have significant matches to ESTs. As well, 27 known bovine AMPs have significant matches to AMPs predicted by AMPer. The known AMPs with no significant match to ESTs are slightly different than those known AMPs with no significant match to a peptide identified by AMPer. Several known bovine AMPs were not identified in the EST data presumably because they were not expressed in the tissues that were sampled for mRNA and used to construct the EST libraries. Of the three known AMPs (CALT_BOVIN, CAS2_BOVIN and CCKN_BOVIN) that appear to have been represented in the EST data set but missed by the AMPer search, only CCKN_BOVIN  87  seems to have been missed due to inadequacy of the AMPer model: CALT_BOVIN and CAS2_BOVIN did not contribute mature peptides that were used in constructing AMPer models (for details of the AMPer construction algorithm see (Fjell et al. 2007)). Considering that a total of 95 non-histone AMPs were predicted and up to 27 known AMPs were found to have significant matching ESTs, there are up to 68 potentially novel AMPs identified in the EST set by the AMPer models at the threshold values we used. We chose two predicted AMPs for follow-up that appear to be novel and belong to the cathelicidin family, a group of peptides of special interest to us. We chose two ESTs corresponding to these predicted AMPs for RT-PCR analysis of gene transcription as well as changes in gene expression following infection. (Note that since this work began, significantly more bovine sequence has become available and slightly different ESTs might have been chosen based on current data.) We demonstrated that one of these, DBEST_AMP_397, is expressed in response to infection. When compared to all proteins found in Uniprot (both Swiss-Prot and TrEMBL), this predicted peptide is most similar to the '15kDa protein' AMP found in rabbit and of a class of AMP not previously described for bovine. Since our work began on AMPs in bovine, this peptide (DBEST_AMP_397) has been predicted based on sequencing of cDNA from a thymus sample and submitted to the  TrEMBL  database  of  Uniprot  as  A5PJH7_BOVIN  (http://www.expasy.org/uniprot/A5PJH7) by the Mammalian Gene Collection project (http://mgc.nci.nih.gov/). Here, we report that we have independently identified this peptide using the AMPer resource and demonstrated that it is up-regulated in the small intestine in response to infection. We did not find the second predicted AMP we attempted to confirm in the tissues we sampled, and we did not locate the genome  88  location of its sequence in the current genome assembly. However, the sequence was found in whole genome shotgun sequence that was not incorporated into the current bovine assembly. Since it was originally found in expressed sequence, it appears to be a true gene rather than a pseudogene. We did not identify any AMP sequences for alpha-defensins in bovine EST sequence, strongly suggesting that alpha-defensins are not present in the EST dataset we used. In addition, when we scanned translated genomic sequence we also did not find evidence for alpha-defensins. The analysis we performed did account for the presence of introns in constructing AMP predictions: For example, beta-defensins were found reliably despite the presence of intron sequence. Since we cannot account for the lack of alphadefensins identified using the AMPer models due to any technical deficiencies (and additionally we cannot find reference to any bovine alpha-defensins in the literature), we conclude that these results indicate that the bovine genome lack this important class of host defense peptide. Other mammalian species such as mouse are known to lack neutrophil-derived alpha-defensins (Eisenhauer and Lehre 1992). Previous reports have speculated that alpha-defensins are found only in the primate and glires (rodents and lagomorphs) lineage (Patil A 2004; Xiao et al. 2004; Patil AA 2005), while more recent reports have identified alpha-defensins in a wide range of diverse mammals such as opossum (Belov K et al. 2007), elephant and hedgehog tenrec (Lynn DJ and DG 2007), and the horse (Looft C et al. 2006), a close evolutionary cousin to bovine. This suggests that the bovine genome has lost alpha-defensins from an ancestor through evolution, rather than being on a lineage where alpha-defensins were never present.  89  3.3 Conclusions We have used the HMM models from the AMPer resource to scan the draft bovine genome and bovine expressed sequence tags from the dbEST data set. To additionally describe the peptides, we have identified the most similar known AMP for each predicted peptide. The AMPer models identified 27 of the 34 known bovine antimicrobial peptides. An additional 68 potential peptides were identified that appear to be previously unidentified AMPs, for a total of 102 AMPs. We sought to experimentally verify two of these that belong to the cathelicidin family. One of these, DBEST_AMP_397, was clearly identified in qRT-PCR product and was found to be upregulated in bovine intestinal tissue following challenge with S. typhimurium. One other putative AMP (DBEST_AMP_416) was not confirmed in blood mononuclear cells and small intestine. In addition to the identification of unrecognized AMPs our results suggest that bovine lacks alpha-defensins. The novel antimicrobial peptide, DBEST_AMP_397, was also predicted by the Mammalian Gene Collection project as part of an effort to provide full-length clones to investigators for a limited number of organisms (human, rat, mouse and bovine). This serves to confirm the utility of the AMPer approach to identifying novel AMPs: by examining the large resource of low quality EST sequence, we have identified a novel peptide that was added to the major sequence databases only recently, after a high quality cDNA sequencing project. This suggests that a large number of additional peptides might be identified from publicly available data that will not be added to major databases for some time. These results indicate the effectiveness of in silico screening with software resources such as AMPer that are tailored to specific interests of the community, in this 90  case, investigators examining peptides of the innate immune system. The hidden Markov models used by AMPer are freely available to investigators and straightforward to use (see http://www.cnbi2.com/cgi-bin/amp.pl). Future work on AMPer will include automation of the steps involved in the study described here, and its application to larger numbers of organisms.  3.4 Methods and materials 3.4.1  Set of known antimicrobial peptides. We considered the set of known antimicrobial peptides to be derived from the  1135 proteins in Uniprot identified during construction of the AMPer resource (described previously (Fjell et al. 2007)); these are the 980 protein IDs from AMSdb combined with additional proteins identified by AMPer that were found to have some support for antimicrobial or host defense activity in the literature. These are available at the AMPer web site (http://www.cnbi2.com/cgi-bin/amp.pl).  3.4.2  Creation of AMPer The AMPer resource has been described previously (Fjell et al. 2007). Briefly, the  980 Uniprot protein IDs from AMSdb were considered to contain all known AMPs. Mature peptides and propeptides were identified from these proteins using Uniprot annotations of peptide positions within the proteins. The peptides were compared to one another based on pairwise sequence similarity and grouped based on this similarity. For each group, a hidden Markov model was created using the HMMER software package. 91  These models were used to iteratively scan Swiss-Prot to identify additional peptides that were not currently identified in the set of AMPs. Uniprot annotations were reviewed for proteins that were identified; where annotations suggested antimicrobial activity, these were added to what was considered the set of known AMPs and but used to update the AMPer hidden Markov models. Only the 146 hidden Markov models corresponding to mature peptides were used to search bovine sequence.  3.4.3  Bovine genomic and EST sequences We present here the results in the context of the current versions of the bovine  genome  and  EST  set.  The  bovine  genome  was  downloaded  from  ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Btaurus/fasta/Btau200708xx/LinearScaffolds. Preliminary work used the draft bovine genome sequence was obtained from ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Btaurus/fasta/Btau20050310-freeze/linearScaffolds/. ESTs were obtained from the NCBI resource dbEST resource, downloaded Aug 25, 2007 from ftp://ftp.ncbi.nih.gov/blast/db/FASTA/est_others.gz. Bovine ESTs (numbering 1,433,737) were identified as those containing the annotation 'Bos taurus cDNA'. Preliminary work used the same resource downloaded October 2006.  The  EST  sequences were translated into predicted protein sequences in all six reading frames using software from the BioJava project (http://www.biojava.org).  3.4.4  Prediction of AMPs in ESTs Predicted protein sequences from ESTs were scanned using the 146 AMPer  models for mature peptides using the HMMER utility, hmmsearch (Durbin et al. 1998). 92  Regions of sequence matched by a model ('predicted peptides') were examined to identify likely AMPs as follows. Predicted peptides that were less than 25% of the model length were excluded from consideration since they were considered to be unlikely candidates as AMPs and more likely represent conserved protein domains instead. Each matched EST was assigned an identifier of the form DBEST_AMP_n where n is an integer since they are interpreted as a predicted AMP. In addition, multiple ESTs may correspond to the same gene product and may differ due to sequencing errors and different lengths of sequencing reads (ESTs are single reads of a cDNA). Therefore, peptide sequences matched by a model were clustered into groups to represent a single predicted AMP based on similarity of the sequences. Specifically, predicted peptides were added to groups where each peptide was at least 90% identical to every other peptide in the group over the length of the peptide (or the smaller peptide if they varied in length). A pairwise BLAST blastp comparison was used (Altschul et al. 1990). Each group of similar predicted peptides were conservatively considered a single antimicrobial peptide. The longest predicted peptide was taken as the representative of each group of similar predicted peptides.  3.4.5  Prediction of AMPs in genomic sequence The draft genome sequence of bovine was also scanned with the AMPer models  of mature peptides using the HMMER utility, hmmsearch (Durbin et al. 1998), with the total number of sequences specified (using the parameters "-Z 922") to account for matches against the sequence database that spans many files. Genomic sequence contains introns, regions that are not translated into mRNA (and hence protein). However, the predicted protein sequence used for searching included intron sequence; therefore, the 93  protein sequences matched by a model will be fragments of a mature peptide corresponding to exons. To account for intron and exon sequence within the genome, predicted peptides were constructed from multiple matching regions within 1000 amino acid positions of each other that cover the length of the AMPer model. Overlap between regions of matches for different models was not allowed. Predicted antimicrobial peptides based on genomic sequence were identified as GENOME_AMP_n where n is an integer.  3.4.6  Comparison of predicted AMPs to known AMPs We wished to identify which of the predicted AMPs corresponded to known  AMPs. The predicted AMPs were compared to known bovine AMPs using pairwise sequence comparison using the blastp algorithm of the BLAST package (Altschul et al. 1990). Significance of a match was taken as the E-value reported by blastp. Coverage of the two sequences was also calculated to assess the extent of the pairwise match, giving the extent of the matched region in comparison to the length of the known AMP and the AMPer model. Coverage is calculated as the alignment length divided by the maximum possible alignment length (the minimum sequence length between the known AMP sequence and the predicted AMP sequence). For each known bovine AMP, the best matching (lowest E-value) AMPs predicted from the dbEST data set was calculated. A match was considered good if the alignment had minimum 95% identity over minimum 95% coverage. For each AMP predicted from the dbEST data, the best matching known AMP (of any organism) and best matching known bovine AMP were calculated, taking the matches with lowest E-values as the best matches. These are reported on the web pages  linked  from  the  summary  page  at  http://www.cnbi2.com/cgi-  bin/amp.pl?dbests=hits. The on-line tools allow predicted AMPs to be viewed in the 94  context of the multiple alignment (generated by ClustalW, v 1.83, (Thompson et al. 1994)) containing the predicted AMPs of the model, all known AMPs of the model, the HMM consensus sequence for the model, best-matching AMPs to the predicted AMP and any AMPs predicted from the bovine genomic data that have significant blastp match to the AMP predicted from dbEST data.  3.4.7  Identification of novel AMPs For each class of AMP, the multiple sequence alignment (generated by  ClustalW) was viewed and unique predicted AMPs were identified by eye, by requiring significant differences to be visible in the alignment between the predicted AMP, all other predicted AMPs and the known bovine AMPs. To determine whether the putative novel  AMPs had  been  previously  identified,  we used  the NCBI website  (http://www.ncbi.nlm.nih.gov/BLAST/) to search for the sequences in the NCBI nr (nonredundant) databank which contains all non-redundant GenBank CDS translations, Refseq, PDB, Swiss-Prot, PIR and PRF.  3.4.8  Pairwise comparison of known AMPs to bovine sequence The set of 1135 known AMPs were used to search for similar sequences in the  translated bovine genome and ESTs sequences using blastp of the BLAST package. For genome scanning, the total number of sequences was corrected using the parameter "-z 922". The most significant matches (lowest E-values) are reported along with coverage calculated as the alignment length divided by the length of the known AMP. Only 95  matches with E-values < 1e-5 were considered, to restrict the matches to close matches and limit the number of results returned.  3.4.9  Analysis of AMP gene expression Total RNA was extracted from bovine intestinal tissue and bovine peripheral  blood mononuclear cells (PBMC) as described previously (Aich et al. 2005) and RNA was isolated using an RNeasy Mini Kit (Qiagen Inc., Ontario, Canada). The intestinal samples were collected both prior to and 4 hours after challenge with S. typhimurium using the infection model developed by Coombes et al. (Coombes et al. 2005). Isolated RNA samples were eluted and stored in RNase-free water (Ambion Inc., Austin, Texas) at -80 C until further use. The RNA concentration, integrity and purity were assessed determining the OD260/280 ratio with a BioPhotometer (Eppendorf, Hamburg Germany) in addition to analysis on a 1% agarose gel and Bioanalyzer (Agilent, USA). Quantitative real-time PCR (qRT-PCR) was performed using Invitrogen’s SuperScript™ III Platinum two-step qRT-PCR kit with SYBR-Green on the ABI 7300 Real Time PCR System (Applied Biosystems, Foster City, CA) as described previously (Mookherjee et al. 2006). Endogenous house keeping genes, GAPDH and beta-actin, were used for normalization and determination of fold changes of the respective AMPs using the comparative threshold cycle method (Pfaffl, 2001). The qRT-PCR products were run on a 2% agarose gel to verify the presence of gene products. All primers used for qRT-PCR were designed using Primer3 v.0.3.0 (Rozen and Skaletsky, 2000), except beta-actin that was designed earlier (Whale et al. 2006). The 96  primers are listed in Table 3.4. Bovine gene  Accession number a  Primer direction Primer sequence (5'-3')  GAPDH  BC102589  Beta-Actin  AF191490  DBEST_AMP_397  XM_586989  DBEST_AMP_416  BC120477  Forward Reverse Forward Reverse Forward Reverse Forward Reverse  AGATGGTGAAGGTCGGAGTG GATCTCGCTCCTGGAAGATG CTAGGCACCAGGGCGTAATG CCACACGGAGCTCGTTGTAG TCGTGGTGGAGTTCAAATCA GCTTGGAAGGCACTGGTACT GGATTGGTGGAGGAAATCTG GAATGGGCTGGTGAAACAGT  Table 3.4. Bovine primers used for qRT-PCR Accession numbers are from NCBI (http://www.ncbi.nlm.nih.gov).  3.4.10  Informatics All calculations were performed on a Linux or Mac OS X environment using  custom Java, Python, Perl or BASH code. Data were stored in a MySQL database for manipulation and presentation via Perl CGI scripts on an Apache web server running on a Linux server at http://www.cnbi2.com.  3.5 Acknowledgments We gratefully acknowledge financial support from the Canadian Institutes for Health Research (CIHR) and from Genome BC for the Pathogenomics of Innate Immunity research program. CDF is supported by a Doctoral Research Award from CIHR. KH received a CIHR postdoctoral fellowship. REWH was the recipient of a Canada Research Chair.  97  3.6 Web resources AMPer: http://www.cnbi2.com/cgi-bin/amp.pl Baylor College of Medicine Human Genome Sequencing Center, bovine genome: http://www.hgsc.bcm.tmc.edu/projects/bovine NCBI dbEST: http://www.ncbi.nlm.nih.gov/dbEST/ BioJava: http://www.biojava.org  3.7 Supplementary table HMM EPredicted AMP  value  Model coverage Matched sequence  Chromosome Strand  Position start  Position end  [na]  [na]  GENOME_AMP_139  4.20E-010  0.69 CKDRESRIGSCFYNGVLLSL  Chr26  +  38412814  38412873  GENOME_AMP_248  6.90E-009  0.55 CFCQFNHCFRGERMFG  ChrUn.51  -  283701  283746  GENOME_AMP_34  2.70E-008  0.55 CFCRARLCFTDEKLYG  Chr13  +  42595672  42595719  GENOME_AMP_102  4.20E-008  0.48 TCRLNDALHPLCPR  Chr22  +  54964399  54964440  GENOME_AMP_220  4.60E-008  0.93 RSPFCSSGSDTGEKRSGSCVRNRLLTHCCS Chr7  +  56856289  56856378  GENOME_AMP_5  9.10E-008  0.52 GYCELGEMLWNLCPR  Chr1  -  84535783  84535825  GENOME_AMP_11  1.00E-007  0.59 ASGYCTGQHRLHFHCCR  Chr10  +  12224735  12224785  GENOME_AMP_223  1.10E-007  0.48 TCRLPGLRHAMCCR  Chr7  -  47423547  47423586  GENOME_AMP_187  1.30E-007  0.93 CTCQEGACQSPEMRGLCRKSARVWGL  Chr3  +  4341074  4341151  GENOME_AMP_21  1.40E-007  0.66 CFCRWALCLTDPVHSGTCT  Chr11  -  97411077  97411131  GENOME_AMP_36  1.40E-007  0.59 CSCHRPHCGV*EVLSGS  Chr13  -  55411562  55411610  GENOME_AMP_152  1.90E-007  0.45 CFCRIWGCPGGES  Chr27  -  35393152  35393188  GENOME_AMP_200  2.60E-007  0.59 HNGACTHRGEMATLCPR  Chr4  -  31779220  31779268  GENOME_AMP_43  2.80E-007  0.52 G*CIVRRALHPFCCR  Chr14  +  67433469  67433513  GENOME_AMP_74  2.90E-007  0.48 CTCRDAVCAQREKM  Chr19  +  22908125  22908166  GENOME_AMP_204  3.70E-007  0.62 CRCPSLACDTLEVASGMC  Chr5  +  83595081  83595134  GENOME_AMP_136  5.80E-007  0.41 TFNGTFYSLCCS  Chr25  +  35141361  35141396  GENOME_AMP_143  5.80E-007  0.55 SGYCK*N*RIVRLCCG  Chr26  -  13777645  13777690  98  HMM EPredicted AMP  value  Model coverage Matched sequence  Chromosome Strand  Position start  Position end  [na]  [na]  GENOME_AMP_62  5.90E-007  0.55 NGRCG*NHLLHLLCPR  Chr17  -  6019846  6019891  GENOME_AMP_205  6.40E-007  0.41 EKMGDIYRLCCR  Chr5  -  94990448  94990481  GENOME_AMP_256  6.80E-007  0.62 RMEGFCGLGAVL*AQCCR  ChrX  -  34472939  34472990  GENOME_AMP_22  7.20E-007  0.48 FCIYKDRFHSLCCS  Chr11  -  16621399  16621438  GENOME_AMP_253  8.00E-007  0.45 CGSDGRVYLLCCR  ChrUn.93  +  53978  54016  GENOME_AMP_73  8.40E-007  0.48 TCSLSH*SYVLCCR  Chr19  +  37417075  37417116  GENOME_AMP_234  8.70E-007  0.59 CYCVDTLCALLERQSGA  Chr9  +  29829331  29829381  GENOME_AMP_236  8.70E-007  0.59 CYCVDTLCALLERQSGA  Chr9  -  29905196  29905244  GENOME_AMP_172  8.90E-007  0.45 CRSHHTLSTLCCR  Chr28  +  26472568  26472606  GENOME_AMP_44  9.70E-007  0.31 GTIWPLCCR  Chr14  -  14123849  14123873  GENOME_AMP_188  1.00E-006  0.41 ELGQAIYSLCCR  Chr3  +  83833191  83833226  GENOME_AMP_228  1.00E-006  0.52 GTCFM*SRRESLCCR  Chr8  +  107646187  107646231  GENOME_AMP_231  1.00E-006  0.52 GTCFM*SRRESLCCR  Chr8  -  107680851  107680893  GENOME_AMP_4  1.20E-006  0.66 ETQRGTCFVLQSLAPLCC*  Chr1  +  130678770  130678826  GENOME_AMP_27  1.20E-006  0.34 CYCRIFVCLS  Chr12  -  43568385  43568412  GENOME_AMP_49  1.20E-006  0.66 ENRDGHCASEGLIHPLCCA  Chr15  -  76976736  76976790  GENOME_AMP_61  1.20E-006  0.38 KNHTFYMLCCS  Chr17  -  22463118  22463148  GENOME_AMP_229  1.30E-006  0.48 TCFTNHLLGPLCCR  Chr8  +  61599012  61599053  GENOME_AMP_255  1.30E-006  0.45 CTQSHRLAQLCCR  ChrX  +  23114562  23114600  GENOME_AMP_173  1.40E-006  0.45 CQF*GVMVRLCCR  Chr28  +  12795050  12795088  GENOME_AMP_203  1.40E-006  0.69 CVCRSEICLLRQHIYGSCFL  Chr5  +  1402720  1402779  GENOME_AMP_37  1.80E-006  0.34 GHTLWSLCCR  Chr13  -  51262512  51262539  GENOME_AMP_63  1.80E-006  0.52 CHCKSRGCLRREKVN  Chr18  +  37678594  37678638  GENOME_AMP_65  2.00E-006  0.41 CQCRRPLCPRGE  Chr18  +  5247825  5247860  GENOME_AMP_174  2.00E-006  0.55 FGVCFQGRVHWLCCK  Chr28  -  32934074  32934116  GENOME_AMP_197  2.00E-006  0.41 TKHSRFHRLCCR  Chr4  +  109553651  109553686  GENOME_AMP_257  2.00E-006  0.34 NG*IYILCCR  ChrX  -  14040157  14040184  GENOME_AMP_12  2.20E-006  0.62 CWCWEGGCKRGEHLEGGC  Chr10  -  398902  398953  GENOME_AMP_235  2.20E-006  0.45 CFSSGLIVSLCCR  Chr9  +  46245935  46245973  GENOME_AMP_76  2.30E-006  0.41 EISGLRWYFCCR  Chr19  -  18649083  18649116  GENOME_AMP_42  2.40E-006  0.52 CFC*QPSCKTGESAS  Chr14  +  79033828  79033872  99  HMM EPredicted AMP  Model  Position end  [na]  [na]  value  coverage Matched sequence  GENOME_AMP_105  2.40E-006  0.41 CFCRHTLCIFGE  Chr22  -  5907881  5907914  GENOME_AMP_86  2.50E-006  0.31 GAFYVLCCR  Chr2  -  10362969  10362993  GENOME_AMP_130  2.60E-006  0.45 CDYGLILYTLCCR  Chr24  -  10203402  10203438  GENOME_AMP_221  2.60E-006  0.31 GRFWRLCCR  Chr7  +  98996633  98996659  GENOME_AMP_87  2.70E-006  0.38 VRHRLHSLCCR  Chr2  -  38157475  38157505  GENOME_AMP_25  2.80E-006  0.62 REHMYGYCNREGLILNLC  Chr12  +  70532686  70532739  GENOME_AMP_84  2.90E-006  0.48 ACEKRRLIYTCCPR  Chr2  +  114381586  114381627  GENOME_AMP_85  2.90E-006  0.41 FYKHSFHRLCCR  Chr2  +  126851343  126851378  GENOME_AMP_237  2.90E-006  0.45 CSCREFVCVFGES  Chr9  -  45591135  45591171  GENOME_AMP_38  3.30E-006  0.41 ELNGRTHSRCCR  Chr13  -  77632369  77632402  GENOME_AMP_67  3.30E-006  0.41 ELQH*LYTRCCR  Chr18  -  55288966  55288999  GENOME_AMP_75  3.40E-006  0.62 RRRRCPPIEKVIGVCKLG  Chr19  +  55345221  55345274  GENOME_AMP_178  3.50E-006  0.48 FCFVNRFIYTLCCA  Chr29  -  38808844  38808883  GENOME_AMP_222  3.60E-006  0.52 CFCHSPSCGSGEAAS  Chr7  +  35208018  35208062  GENOME_AMP_97  3.70E-006  0.34 GILIYPLCCR  Chr21  -  13819732  13819759  GENOME_AMP_198  3.70E-006  0.38 VEIRVYVLCCR  Chr4  +  98755398  98755430  GENOME_AMP_104  4.20E-006  0.41 ELQELLWRLCCR  Chr22  +  5571048  5571083  GENOME_AMP_254  4.30E-006  0.52 GSCRLSHQVARLCCL  ChrX  +  9279319  9279363  GENOME_AMP_121  4.60E-006  0.86 GHCWPRAESRGACSTAGTLWSLCCM  Chr23  -  5851633  5851705  GENOME_AMP_93  4.70E-006  0.48 VCTLGNSIYMICPR  Chr20  +  43613148  43613189  GENOME_AMP_196  4.70E-006  0.55 CRCWSRGCVALEQL*G  Chr4  +  1468702  1468749  GENOME_AMP_214  5.00E-006  0.41 TMNILVYALCCR  Chr6  -  84116009  84116042  GENOME_AMP_177  5.10E-006  0.62 CTCKTSREKSIERWYGFC  Chr29  -  520862  520913  GENOME_AMP_60  5.40E-006  0.52 ACRGPACASGEQLS  Chr17  +  67973195  67973236  GENOME_AMP_251  5.60E-006  0.41 CRCRKPICGHGE  ChrUn.89  +  394598  394633  GENOME_AMP_224  5.80E-006  0.41 EYNEVVWPLCCR  Chr7  -  20126494  20126527  GENOME_AMP_35  6.20E-006  0.48 SCLKNGRR**LCCS  Chr13  +  45930102  45930143  GENOME_AMP_141  6.20E-006  0.41 CVCRRTLCVTLE  Chr26  -  37906649  37906682  GENOME_AMP_142  6.20E-006  0.41 CVCRRTLCVTLE  Chr26  -  37948377  37948410  GENOME_AMP_20  6.40E-006  0.38 CYCRKVVCLQG  Chr11  +  95105110  95105142  GENOME_AMP_26  6.40E-006  0.38 INGDIYSICCR  Chr12  +  59338218  59338250  100  Chromosome Strand  Position start  HMM EPredicted AMP  value  Model coverage Matched sequence  Chromosome Strand  Position start  Position end  [na]  [na]  GENOME_AMP_135  6.80E-006  0.79 CFCRRS*ECLFSEPRIGLCGVSPR  Chr25  +  38183668  38183739  GENOME_AMP_140  7.10E-006  0.59 CLCRTIFCTSGEKPLGS  Chr26  +  14708927  14708977  GENOME_AMP_106  7.70E-006  0.59 TCRRGSCLEGEEVLGV  Chr22  -  56755276  56755321  GENOME_AMP_239  7.80E-006  0.38 CACRTPSCLGG  ChrUn.110  +  494582  494614  GENOME_AMP_213  8.20E-006  0.45 CDI*ERIV*LCCR  Chr6  +  77956896  77956934  GENOME_AMP_199  8.30E-006  0.69 CLCRIQRCQRLGPARGVCRL  Chr4  -  112261775  112261832  GENOME_AMP_64  8.40E-006  0.41 ELQH*LYARCCR  Chr18  +  55762646  55762681  GENOME_AMP_66  8.40E-006  0.41 ELQH*LYARCCR  Chr18  -  55571750  55571783  GENOME_AMP_129  8.40E-006  0.66 ERVLGSCF*NITM*P*CCL  Chr24  -  28424783  28424837  GENOME_AMP_48  8.70E-006  0.38 CFCRIPLCDPL  Chr15  +  73183693  73183725  GENOME_AMP_151  9.00E-006  0.41 CRCRQPACGFSE  Chr27  +  5037651  5037686  GENOME_AMP_175  9.10E-006  1.03 FCRSF*CQT*ENPSGFLHLLLHTICCD  Chr28  -  29239411  29239489  GENOME_AMP_3  9.40E-006  0.79 CVCRHRA*VPLESPKGSCLLGGL  Chr1  +  137624732  137624800  GENOME_AMP_96  9.60E-006  0.41 CQCRWRRCKSRE  Chr21  +  22225095  22225130  GENOME_AMP_238  9.70E-006  0.45 CRLLIVMASLCCR  Chr9  -  13006906  13006942  GENOME_AMP_230  9.80E-006  0.45 CGFSGLTWLLCCR  Chr8  -  15419900  15419936  GENOME_AMP_250  9.80E-006  0.34 CYCRISVCKT  ChrUn.7  -  1270680  1270707  GENOME_AMP_241  9.90E-006  0.45 CDS*GRIYTCCCK  ChrUn.26  -  249417  249453  Table 3.5. Most significant matches of AMPer model 146 to bovine genome sequence.  101  3.8 References Aich, P., Wilson, H. L., Rawlyk, N. A., Jalal, S., Kaushik, R. S., Begg, A. A., Potter, A. A., Babiuk, L. A., Abrahamsen, M. S. and Griebel, P. J. (2005). Microarray analysis of gene expression following preparation of sterile intestinal loops in calves. Can. J. Anim. Sci., 85: 13–22. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). Basic local alignment search tool. J Mol Biol, 215: 403-410. Bechinger, B. (1997). Structure and function of channel-forming peptides: magainins, cecropins, melittin and alamethicin. J Membrane Biol., 156: 197-211. Bechinger, B. (1999). The structure, dynamics and orientation of antimicrobial peptides in membranes by multidimensional solid-state NMR spectroscopy. Biochim. Biophys. Acta, 1462: 157-183. Belov K, Sanderson CE, Deakin JE, Wong ES, Assange D, McColl KA, Gout A, de Bono B, Barrow AD, Speed TP, Trowsdale J and AT, P. (2007). Characterization of the opossum immune genome provides insights into the evolution of the mammalian immune system. Genome Res., 17: 982–991. Blondelle, S. E., Lohner, K. and Aguilar, M. I. (1999). Lipid-induced conformation and lipid-binding properties of cytolytic and antimicrobial peptides: determination and biological specificity. Biochim. Biophys. Acta, 1462: 89-108. Boguski, M. S., Lowe, T. M. J. and Tolstoshev, C. M. (1993). dbEST — database for expressed sequence tags. Nature Genetics, 4: 332 - 333. Bowdish, D. M., Davidson, D. J. and Hancock, R. E. (2005). A re-evaluation of the role of host defence peptides in mammalian immunity. Curr. Protein Pept. Sci., 6: 35– 51. Brogden, K. A. (2005). Antimicrobial peptides: pore formers or metabolic inhibitors in bacteria? Nat. Rev. Microbiol., 3: 238–250. Coombes, B. K., Coburn, B. A., Potter, A. A., Gomis, S., Mirakhur, K., Li, Y. and Finlay, B. B. (2005). Analysis of the contribution of Salmonella pathogenicity islands 1 and 2 to enteric disease progression using a novel bovine ileal loop model and a murine model of infectious enterocolitis. Infect. Immun., 73(7161-7169). Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK, Cambridge University Press. Eisenhauer, P. B. and Lehre, R. I. (1992). Mouse neutrophils lack defensins. Infect. Immun., 60: 3446-3447. Epand, R. M. and Vogel, H. J. (1999). Diversity of antimicrobial peptides and their mechanisms of action. Biochim. Biophys. Acta, 1462: 11-28.  102  Fjell, C. D., Hancock, R. E. and Cherkasov, A. (2007). AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics, 23: 11481155. Hancock, R. E. (2003). Concerns regarding resistance to self-proteins. Microbiology, 149: 3343-3344. Hancock, R. E. and Chapple, D. S. (1999). Peptide Antibiotics. Antimicrob. Agents Chemother., 43: 1317-1323. Hancock, R. E. and Lehrer, R. (1998). Cationic peptides: a new source of antibiotics. Trends Biotechnol., 16: 82-88. Jenssen, H., Hamill, P. and Hancock, R. E. W. (2006). Peptide Antimicrobial Agents. Clinical Microbiol. Rev., 19(3): 491–511. Khush, R. S., Leulier, F. and Lemaitre, B. (2001). Drosophila immunity: two paths to NFkappaB. Trends Immunol., 22: 260-264. Kim, H. S., Yoon, H., Minn, I., Park, C. B., Lee, W. T., Zasloff, M. and Kim, S. C. (2000). Pepsin-Mediated Processing of the Cytoplasmic Histone H2A to Strong Antimicrobial Peptide Buforin. I. J. Immunol., 165: 3268-3274. Levy, O., Weiss, J., Zarember, K., Ooi, C. E. and Elsbach, P. (1993). Antibacterial 15kDa protein isoforms (p15s) are members of a novel family of leukocyte proteins. J. Biol. Chem., 268: 6058-6063. Looft C, Paul S, Philipp U, Regenhard P, Kuiper H, Distl O, Chowdhary BP and T, L. (2006). Sequence analysis of a 212 kb defensin gene cluster on ECA 27q17. Gene, 376(2): 192-8. Lynn DJ and DG, B. (2007). Discovery of alpha-defensins in basal mammals. Dev. Comp. Immunol., 31(10): 963-7. Marshall, S. H. and Arenas, G. (2003). Antimicrobial peptides: A natural alternative to chemical antibiotics and a potential for applied biotechnology. Electron J. Biotech., 6: 271-284. Mookherjee, N. and Hancock, R. E. (2007). Cationic host defence peptides: innate immune regulatory peptides as a novel approach for treating infections. Cell Mol Life Sci., 64: 922-933. Mookherjee, N., Wilson, H. L., Doria, S., Popowych, Y., Falsafi, R., Yu, J. J., Li, Y., Veatch, S., Roche, F. M., Brown, K. L., Brinkman, F. S., Hokamp, K., Potter, A., Babiuk, L. A., Griebel, P. J. and Hancock, R. E. (2006). Bovine and human cathelicidin cationic host defense peptides similarly suppress transcriptional responses to bacterial lipopolysaccharide. J. Leukoc. Biol., 80: 1563-1574. Patil A, H. A., Zhang G. (2004). Rapid evolution and diversification of mammalian alpha-defensins as revealed by comparative analysis of rodent and primate genes. Physiol. Genomics, 20(1): 1-11. Patil AA, C. Y., Sang Y, Blecha F, Zhang G. (2005). Cross-species analysis of the mammalian beta-defensin gene family: presence of syntenic gene clusters and 103  preferential expression in the male reproductive tract. Physiol. Genomics., 23: 517. Pfaffl, M. W. (2001). A new m athematical model for relative quantification in real-time RT-PCR. Nucleic Acids Res., 29: e45. Rozen, S. and Skaletsky, H. J. (2000). Primer3 on the WWW for general users and for biologist programmers. Bioinformatics Methods and Protocols: Methods in Molecular Biology S. Krawetz and S. Misener. Totowa, NJ,, Humana Press: 365386. Scheetz, T., Bartlett, J. A., Walters, J. D., Schutte, B. C., Casavant, T. L. and McCray, P. B. J. (2002). Genomics-based approaches to gene discovery in innate immunity. Immunol. Rev., 190: 137-145. Schutte, B. C., Mitros, J. P., Bartlett, J. A., Walters, J. D., Jia, H. P., Welsh, M. J., Casavant, T. L. and McCray, P. B. (2002). Discovery of five conserved beta defensin gene clusters using a computational search strategy. PNAS, 99: 21292133. Shai, Y. (1999). Mechanism of the binding, insertion and destabilization of phospholipids bilayer membranes by !-helical antimicrobial and cell non-selective membranelytic peptides. Biochim. Biophys. Acta, 1462: 55-70. Sima, P., Trebichavsky, I. and Sigler, K. (2003). Mammalian antibiotic peptides. Folia Microbiol., 48: 123-137. Sima, P., Trebichavsky, I. and Sigler, K. (2003). Non-mammalian vertebrate antibiotic peptides. Folia Microbiol., 48: 709-724. Simmaco, M., Mignogna, G. and Barra, D. (1998). Antimicrobial peptides from amphibian skin: what do they tell us? Biopolymers, 47: 435-450. Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A. and Durbin, R. (1998). Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucl. Acids Res., 26: 320-322. Thompson, J. D., Higgins, D. G. and Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl. Acids Res., 22: 4673-4680. Whale, T. A., Wilson, H. L., Tikoo, S. K., Babiuk, L. A. and Griebel, P. J. (2006). Pivotal Advance: Passively acquired membrane proteins alter the functional capacity of bovine polymorphonuclear cells. J. Leukocyte Biology, 80: 481-491. Xiao, Y., Hughes, A. L., Ando, J., Matsuda, Y., Cheng, J.-F., Skinner-Noble, D. and Zhang, G. (2004). A genome-wide screen identifies a single beta-defensin gene cluster in the chicken: implications for the origin and evolution of mammalian defensins. BMC Genomics, 5: 56.  104  Chapter 4:  Identification of antibacterial peptides by  chemoinformatics and machine learning  A version of this chapter has been submitted as: Fjell, C.D., Jenssen, H., Hilpert, K., Cheung, W.A., Panté, N., Hancock, R.E.W., and Cherkasov, A. Identification of Novel Antibacterial Peptides by Chemoinformatics and Machine Learning Some material from this chapter has been accepted for publication in: Cherkasov, A., Hilpert, K., Jenssen, H., Fjell, C.D., Waldbrook, M., Mullaly, S.C., Volkmer, R., and Hancock, R.E.W. Use of artificial intelligence in the design of small peptide antibiotics effective against a broad spectrum of highly antibiotic resistant Superbugs. ACS Chemical Biology 105  Introduction Short cationic, amphipathic peptides possessing antimicrobial activity are present throughout the kingdoms of life. In the face of increasing antibiotic resistance in pathogenic microorganisms, short cationic peptides have drawn significant attention as a possible source of novel antibacterial agents (Hamilton-Miller, 2004; Levy and Marchall, 2004; Koczulla and Bals, 2003; Finlay and Hancock, 2004; Hancock and Sahl, 2006). Although antimicrobial peptides generally exhibit lower potency against susceptible bacterial targets compared to conventional low-molecular-weight antibiotic compounds, they hold several compensatory advantages including fast killing, broad range of activity, low toxicity and minimal development of resistance in target organisms (Hancock and Sahl, 2006; Jenssen et al., 2006). The use of quantitative structure-activity relationships (QSAR) to predict antibacterial activity of peptides is a relatively recent development. QSAR analysis seeks to relate quantitative properties of a compound (known as descriptors) with other properties such as drug-like activity or toxicity. QSAR relies on quantities that can be conveniently measured or calculated to predict in a non-trivial way other properties of interest such as antibacterial activity. QSAR has become an integral part of screening programs in pharmaceutical drug discovery pipelines of small compounds and more recently in toxicological studies (Perkins et al., 2003). There are two aspects to QSAR analysis: choice of the set of descriptors and choice of statistical learning technique. Previous QSAR analysis of antimicrobial peptides has been limited to comparisons between peptides that differ in only a small number of amino acids, for example, derivatives of lactoferricin (Lejon et al, 2001; Strom et al., 2001; Lejon et al 106  2004; Jenssen et al, 2005) and protegrin and similar de novo peptides (Frecer et al., 2004; Frecer, 2006; Ostberg and Kaznessis, 2004). These QSAR studies have mainly utilized descriptors that are designed to model differences in properties of similar peptides, such as in the lactoferricin studies or used ones such as charge, amphipathicity and lipophilicity whose relationship has been demonstrated empirically from amino acid substitution studies (Frecer et al, 2004). Where larger sets of QSAR descriptors have been used, for example for protegrin and analogues (Frecer. 2006; Ostberg and Kznessis, 2004), the models have been limited to linear models, resulting in only moderate predictive ability. We decided to perform QSAR analysis on AMPs using a more intensive QSAR methodology that utilizes atomic-scale molecular information, recently developed and applied to small molecules. These ‘inductive’ QSAR descriptors (reviewed in Cherkasov, 2005a) have been successfully applied to a number of molecular modelling studies including: identification of antibacterial activity of small compounds (Cherkasov, 2005b), classification of antimicrobial compounds, conventional drugs and drug-like substances with up to 97% accuracy on an extensive set of over 2500 chemical structures (Karakoc et al, 2006a). These studies have relied on modelling techniques of greater complexity than those previously applied to antimicrobial peptides. In particular, classification of compounds have compared artificial neural networks (ANNs), k-nearest neighbors, linear discriminative analysis and multiple linear regression and found that ANNs result in generally more accurate predictions for classification, followed closely by k-nearest neighbors methods (Karakoc et al, 2006b). These higher-complexity models use a larger number of parameters and 107  therefore require greater amounts of data. This data was available from the recently developed high-throughput method for screening large numbers of peptides for antibacterial activity (Hilpert et al., 2005). This method uses peptides synthesized on cellulose support for rapid creation of peptides that are not limited in sequence diversity. The peptides are assayed for antimicrobial activity using strain of Pseudomonas aeruginosa engineered to constitutively luminesce via a luciferase cassette insertion. By measuring the decrease in luminescence due to killing of the bacteria, a large number of peptides can be screened for antibacterial activity in an automated manner. In the current work, we apply for the first time atomic-resolution QSAR methods combined with complex, non-linear modelling to accurately predict antibacterial activity of short cationic peptides containing high sequence diversity. By combining highthroughput generation of synthetic peptides with a high-throughput antibacterial assay, we were able to apply these methods to a larger data set of peptides than has been used to date. We demonstrate that this combination of experimental procedure and QSAR analysis provides dramatic improvement in prediction of diverse antibacterial peptides. With methods we describe here, we have performed an efficient, large-scale in silico screening for antibacterial peptides that has yielded several potential drug leads.  4.1 Results and discussion The overall process we used for QSAR modelling of antimicrobial peptides is shown in Figure 4.1. The starting point was a set of random peptides with measured activity. For these peptides the 3D structure was estimated and used to calculate QSAR descriptors for each. Models for peptide activity were built using artificial neural networks based on these descriptors and the known levels of activity. These models were 108  then used to computationally assess a much larger set of virtual peptides for predicted activity. The accuracy of the predictions was independently assessed by synthesizing and testing many peptides with various levels of predicted activity.  Figure 4.1. General workflow for QSAR modelling of antimicrobial peptides.  4.1.1  Effect of control antibacterial peptide on bacteria The effect of treatment of P. aeruginosa with the active control peptide Bac2A  is shown in transmission electron micrographs (TEMs) of thin sections of Pseudomonas aeruginosa (Figure 4.2). These electron micrographs show that Bac2A has a dramatic 109  effect in the morphology of the bacteria cell wall. While the cell wall of control untreated bacteria appears smooth and linear (see Figure 4.2A), the Bac2A-treated bacteria have cell walls that are severely damage and contain numerous blebs (Figure 4.2B), a well known phenomenon observed when bacterial cells are exposed to cationic peptides (Sawyer et al., 1988). In addition, the space between the cell wall and plasma membrane appears swollen. The blebs of the cell wall are better appreciated when the surface of Bac2A-treated bacteria are visualized by SEM (Figure 4.3). As illustrated in Figure 4.4, Bac2A causes damage to the cell wall of Pseudomonas aeruginosa in a time- and concentration-dependent manner.  Figure 4.2. Transmission electron micrographs of cross-sections of Pseudomonas aeruginosa. Micrographs are shown for control untreated (A) and Bac2A-treated (B). Bac2A concentration was at the MIC. Bacteria were incubated with Bac2A for one hour at 37 ºC before fixation and preparation for embedding/thin section TEM. Scale bar is 100 nm.  110  Figure 4.3. SEM micrographs of Pseudomonas aeruginosa. Micrographs are shown for control untreated (A) and Bac2A-treated (B, C). Bac2A was at a concentration of one-fold the MIC. Bacteria were incubated with Bac2A for one hour at 37 ºC before fixation and preparation for SEM. Scale bars are 500 nm for A and B, and 100 nm for C.  111  Figure 4.4: Electron micrographs of cross-sections of Pseudomonas aeruginosa. Micrographs are shown for control untreated or treated with Bac2A at the concentration and time indicated. Scale bar is 100 nm.  4.1.2  Peptide data sets for model training Two initial sets of synthetic peptides of nine amino acids in length were assayed  for antibacterial activity. Set A consists of 933 peptides; set B consists of 500 peptides. The primary sequences of Set A were chosen with a bias towards enrichment of these sets for the amino acid proportions of our previously-isolated peptides with antibacterial activity based on previous studies (Hilpert et al., 2005; Hilpert et al., 2006). Subsequently Set B peptides were designed with the adjusted amino acid compositions of the initial  112  peptide population plus Set A peptides, as shown in Figure 4.5. In both sets, there were no constraints on the amino acid proportions found within any particular peptide. The two sets were progressively prepared by synthesis on a cellulose support and assayed for activity against P. aeruginosa using a luciferase reporter assay as described previously (Hilpert et al., 2005).  0.5 0.45  Amino acid fraction  0.4 0.35 Set A  0.3  Set B Q1  0.25  Q2 Q3  0.2  Q4  0.15 0.1 0.05 0 A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  Amino acid  Figure 4.5: Distibution of amino acids in training and test sets. The quartiles of the activity for the test peptides are indicated as Q1 to Q4.  4.1.3  Calculation of peptide activity Peptide antibacterial activity was measured using the luminescence assay, which  assesses the loss of energy generation capacity, shown with antimicrobial peptides to proportionately reflect lethality as previously described (Hilpert et al., 2005; Hilpert et al., 2006). Briefly, peptides were assayed in a dilution series in sets of 10 peptides with 113  one control peptide Bac2A per series. Luminescence values for the experimental peptides were fit to a function describing the expected profile of luminescence for a dilution series (Figure 4.6). The relative IC50 (Rel.IC50) values of the experimental peptides were calculated as the ratio of the IC50 values for the peptide to the control peptide Bac2A. The fit of the luminescence experimental values was generally good except for peptides of very low activity where the plateau at low luminescence (high concentration) is not present. For this reason, inactive peptides were identified where the luminescence at highest concentration of peptide was greater than 50% of the luminescence at lowest concentration; for these peptides, the Rel.IC50 was set to 25 (the approximate lower limit of activity that can be observed). The activity of the two sets is shown in Table 4.1 (Training Set A and B rows) classified into higher activity (Rel.IC50 is less than 50% of the control peptide, Bac2A), similar activity (Rel.IC50 is between 50% and 150% of control)  and  lower  activity  (Rel.IC50  114  greater  than  150%  of  control).  Figure 4.6. Luminescence profile of a dilution series for three peptides. The luminescence for three peptides having high, medium (control peptide) and low activity are shown. Luminescence and concentration were scaled to maximum of 1.0. Where the horizontal line at luminescence of 0.5 crosses the fitted curves indicates the relative IC50 value for each peptide.  4.1.4  QSAR descriptors and model building A large number of QSAR descriptors are available to describe the physical  chemistry of compounds. A total of 77 descriptors were calculated here for each peptide in the two sets of training peptides. Some descriptor values were found to be highly correlated with each other, which led to problems in modelling; therefore a set of 44 descriptors were chosen that showed less than 95% correlation to any other selected descriptor. A set of 44 descriptors were chosen that showed less than 95% correlation to 115  any other selected descriptor. All descriptors are shown in supplementary Table 4.4; those used for modelling are indicated. We used artificial neural networks (ANNs) (see Figure 4.7) to model antibacterial activity since this has already been successfully demonstrated for small molecules (for example, Karakoc, et al., 2006a). Neural networks typically rank highly among machine learning techniques in predictive performance, and in addition, they are relatively insensitive to the presence of noise and correlated inputs. We used a network configuration with one hidden layer of 10 nodes, 44 input nodes (one for each descriptor) and one output node. A variety of other network configurations were also evaluated and showed no improvement in performance (data not shown).  Figure 4.7. Structure of an artificial neural network. The network consists of three layers: the input layer, hidden layer and output layer. The input nodes take the values of the normalized QSAR descriptors. Each node in the hidden layer takes the weighted sum of the input nodes (represented as lines), and transforms the sum into an output value. The output node takes the weighted sum of these hidden node values and transforms the sum into an output value between 0 and 1.  116  4.1.5  Validation of model performance We assessed the ability of the ANN models to predict antibacterial activity by  first classifying the top 5% of the Set A and B peptides as active according to the Rel.IC50 values – this corresponds to an approximate Rel.IC50 threshold of 0.6 (0.56 for Set A and 0.61 for Set B). A ten-fold cross-validation was performed as described below with 90% of data allocated to training and 10% to validation (i.e. reserving a different 10% for each of the 10 validation sets). Set A and Set B were synthesized and assayed at different times and we observed some systematic differences in the luminescence results related to peptides of very low and very high activity. Therefore, we treated Set A and Set B separately, along with an additional pooled set, Set A+B. The performance of the three models was assessed using receiver operating characteristics curves (Figure 4.8) and the area under the ROC curves (AROCs). AROC values approaching one indicate an increasing ability to accurately classify data; AROC values close to 0.5 indicate a poor ability to classify. The average AROC value for Sets A, B and the combined Set A+B were found to be (mean ± standard deviation, SD) 0.87 ± 0.10, 0.83 ± 0.12 and 0.80 ± 0.09 respectively. These data show that the cross-validated performance of the models to predict peptide activity was quite good. We integrated the large number of models generated during the cross-validation in a consensus approach to allow a combined, single prediction for a given peptide. We did this using a "voting" system where each of the thirty models (ten each for Set A, Set B, and the combined Set A+B) was used to evaluate a test peptide.  117  Figure 4.8. The receiver operating characteristics curves for the three data sets. The average ROC curve was calculated based on validation data for the 10 ROC curves from the cross-validation of each of the data sets.  4.1.6  Independent model testing To perform an independent assessment of this approach to identify highly active  antibacterial peptides, we created a random set of approximately 100,000 peptides in an independent test set using the same global amino acid proportions as Set B (Figure 4.5). When we calculated the 44 QSAR descriptors for each peptide, a modest number of peptides fell more than 15% out of the range of descriptor values encountered in Sets A and B and were not considered further, since this is believed to lead to less reliable performance by the models. This left a total of 99,577 test peptides. Each of these 118  peptides was ranked numerically using a voting system as described below. Since these models were built to classify peptides as active or inactive, rather than predict actual activity levels, the ranked list of test peptides indicated the likelihood that a peptide is highly active. To independently evaluate these predictions of peptide activity, we selected and synthesized a total of 200 candidate peptides comprising sets of fifty candidate peptides at four positions of ranking. Quartile 1 (Q1) peptides were ranked in the top-most 50 positions and considered the most likely to be more active than control. Quartile 2 (Q2) peptides were ranked at the start of the 2nd quartile, positions 24,895 to 24,944, and thus considered likely to be more active than control. Quartile 3 (Q3) peptides were ranked at the end of the 3rd quartile, positions 74,633 to 74,682, and considered likely to be less active than control. Quartile 4 (Q4) peptides were ranked at the end of the 4th quartile, positions 99,528 to 99,577, and considered to be most likely to be less active than control. These two hundred predicted peptides were synthesized and assayed for activity using the luminescence assay. As summarized in Table 4.1, the activity was predicted very accurately by the system. Of the fifty peptides in the most likely active set (Q1), 94% were found to be more active than control. Of the set considered less likely to be active (Q2), 64% were better than control. Of the peptides predicted to be much less active (Q3), 88% had lower activity than control. In the set considered least likely to be active (Q4), all (100%) were less active than control. All two hundred candidate peptides are shown in supplementary Table 4.5 along with the rank, cumulative vote, experimentally determined relative IC50 values, and selected physical properties (charge, hydrophobic fraction and hydrophobic moment).  119  Rel. IC50 Data set  Higher Activity (<0.5)  Similar Activity (0.5-1.5)  Lower Activity (>1.5)  Median  Set A Set B Q1 Q2 Q3 Q4  35 (3.8%) 14 (2.8%) 47 (94%) 32 (64%) 1 (2%) 0 (0%)  210 (22.5%) 114 (22.8%) 2 (4%) 15 (30%) 5 (10%) 0 (0%)  688 (73.7%) 372 (74.4%) 1 (2%) 4 (8%) 44 (88%) 50 (100%)  2.12 3.33 0.23 0.35 4.38 8.34  Table 4.1. Activities of peptides from training sets and quartiles in the 100,000 test set. Numbers of peptides with various levels of antibacterial activity are shown. Q1: top of 1st quartile; Q2: Top of 2nd Quartile; Q3: Bottom of 3rd Quartile; Q4: Bottom of 4th Quartile. Rel.IC50 is the relative IC50, the ratio of the IC50 for the experimental peptide to the IC50 of Bac2A.  Interestingly, despite the very large difference in predicted activities, the peptides in each quartile had rather similar bulk physical properties (charge, hydrophobicity, hydrophobic moment) as shown in Figure 4.9, indicating the importance of using a broad variety of descriptors in neural network modelling. Ten peptides from each quartile are shown in Table 4.2 for discussion. Consistent with the bulk features of the entire library of sequences, for these peptides the charge and hydrophobicity showed a large degree of overlap for most quartiles. Only certain of the peptides from Q4 showed a noticeable difference in these physical properties, specifically in showing a lower charge  and  hydrophobicity.  The  importance  of  charge,  hydrophobicity  and  amphipathicity for antibacterial activity of peptides is well known (Jenssen et al, 2006; Yeaman and Yount, 2003). However, in these groups of peptides there was a clear difference only between the most active and least active sets (Q1 and Q4) in terms of charge and hydrophobicity, while the differences in activity across all quartiles were quite dramatic. A graphic example that these properties are by themselves insufficient to make  120  predictions can be observed by comparing peptides 10 and 74,675 that have very similar values for charge (+4), hydrophobicity (0.44-0.56), and hydrophobic moment (a measure of amphipathicity; 4.2-4.65) but have relative IC50s that differ more than 100-fold (0.04 and 7.1). This demonstrates that the success in predictions is not based on identifying potent peptides using previously known characteristics.  Figure 4.9. Activity and properties of training and test peptides. Peptide antibacterial activity and physical properties are shown. For Rel.IC50 values, these are median with error bars indicating interquartile range. For all others, these are means with error bars indicating SEM. Top left: median values of Rel.IC50 from the training sets A and B and the corresponding median values for 200 experimentally tested peptides separated into activity quartiles, Q1 to Q4. Top right: median values of formal charge; bottom left: amphipathicity (expressed as hydrophobic moment in Eisenberg units); bottom right: hydrophobic fraction. Statistical significance of difference in means from Q1 values is indicated (ns - not significant, otherwise P values: * <0.05, ** < 0.01, *** < 0.001) using two-tailed Mann Witney test calculated using GraphPad Prism 4.03.  121  Peptide Quartile  Sequence  Number  Cumulative Average Vote Rank  Rel. IC50  Charge Hydrophobicity Hydrophobic Moment  1  1  RWRWKRWWW  29  2027.1  0.25  4  0.56  1.48  2  1  RWRRWKWWW  29  2707.9  0.40  4  0.56  1.96  3  1  RWWRWRKWW  29  2729  0.28  4  0.56  2.11  4  1  RWRRKWWWW  28  2831.9  0.39  4  0.56  2.75  5  1  RWRWWKRWY  28  3044.5  0.20  4  0.56  2.86  6  1  RRKRWWWWW  27  2434.6  0.43  4  0.56  1.22  7  1  RWRIKRWWW  27  2589.1  0.12  4  0.56  1.84  8  1  KIWWWWRKR  27  2622.3  0.13  4  0.56  2.06  9  1  RWRRWKWWL  27  3201.2  0.08  4  0.56  2.12  10  1  KRWWKWIRW  27  3660.7  0.04  4  0.56  4.65  51  2  IRMWVKRWR  0  13255.8  0.61  4  0.56  4.24  52  2  RIWYWYKRW  0  13263.4  0.36  3  0.67  4.06  53  2  FRRWWKWFK  0  13275.7  0.12  4  0.56  5.40  54  2  RVRWWKKRW  0  13278.9  0.27  5  0.44  2.27  55  2  RLKKVRWWW  0  13318.8  0.34  4  0.56  1.16  56  2  RWWLKIRKW  0  13319.5  0.18  4  0.56  3.85  57  2  LRWWWIKRI  0  13336.1  0.33  3  0.67  0.99  58  2  TRKVWWWRW  0  13336.2  0.76  3  0.56  0.78  59  2  KRFWIWFWR  0  13347.1  3.04  3  0.67  4.11  60  2  KKRWVWVIR  0  13348.2  0.35  4  0.56  2.92  141  3  KIRRKVRWG  0  67295.4  10.55  5  0.33  2.02  142  3  AIRRWRIRK  0  67295.8  4.62  5  0.44  5.94  143  3  WRFKVLRQR  0  67297.8  7.08  4  0.44  4.20  144  3  RSGKKRWRR  0  67298  6.50  6  0.11  4.66  145  3  FMWVYRYKK  0  67298  1.51  3  0.67  1.81  146  3  RGKYIRWRK  0  67298.1  3.83  5  0.33  4.94  147  3  WVKVWKYTW  0  67298.3  5.64  2  0.67  2.41  148  3  VVLKIVRRF  0  67298.6  25.00  3  0.67  1.86  149  3  GKFYKVWVR  0  67298.7  1.21  3  0.56  5.39  150  3  SWYRTRKRV  0  67299.6  6.66  4  0.33  4.24  191  4  GRIGGKNVR  0  98644.5  9.12  3  0.22  4.30  192  4  NKTGYRWRN  0  98701.1  8.33  3  0.22  2.75  193  4  VSGNWRGSR  0  98756.7  8.54  2  0.22  2.67  194  4  GWGGKRRNF  0  98807.8  7.38  3  0.22  1.13  195  4  KNNRRWQGR  0  98885.2  6.45  4  0.11  2.88  196  4  GRTMGNGRW  0  98946.9  6.93  2  0.22  1.40  197  4  GRQISWGRT  0  98949.4  8.04  2  0.22  1.94  198  4  GGRGTRWHG  0  99178.5  8.60  3  0.11  2.63  122  Peptide Quartile  Sequence  Number  Cumulative Average Vote Rank  Rel. IC50  Charge Hydrophobicity Hydrophobic Moment  199  4  GVRSWSQRT  0  99185.7  8.50  2  0.22  2.56  200  4  GSRRFGWNR  0  99199.5  8.10  3  0.22  0.58  Table 4.2. Predicted activity rank and experimental Rel.IC50 values for selected test peptides. Forty peptides are shown from the 200 total test peptides. Hydrophobic moment uses the Eisenberg scale.  4.1.7  Antibacterial activity of predicted peptides against  resistant strains A selection of 18 of these 200 peptides was synthesized in bulk and tested against a large variety of drug-resistant bacterial pathogens (Table 4.3). A total of 13 peptides from quartile 1 and 2 with high activity, and 5 peptides from quartile 3 with low activity were evaluated for their in vitro effect (MIC activity) against several multi-drug resistant and problematic pathogens including strains of multi-drug resistant P. aeruginosa, methicillin resistant Staphylococcus aureus (MRSA), Enterobacter cloacae with derepressed chromosomal !-lactamase, extended spectrum !-lactamase producing Escherichia coli and Klebsiella pneumonia, and vancomycin resistant Enterococcus faecalis and Enterococcus faecium (VRE). All 15 peptides belonging to the first and second quartiles had significant in vitro inhibitory activity against antibiotic-resistant bacteria. Moreover, some peptides from the 1st quartile, such as 8 and 9 exhibited MICs of 0.3-10µM against most of the tested ‘superbugs’, compared to the only antimicrobial peptide to show efficacy to date in advanced clinical trials, MX-226 (Hancock and Sahl, 2006) which exhibited MICs of 10-76µM (Cherkasov et al., in press). These results characterize the developed peptides as excellent antibiotic candidates for treating some of  123  the most recalcitrant and dangerous human infections. As reported elsewhere (Cherkasov et al., in press), two other peptides identified from the first quartile were also found to be protective against Staphylococcus aureus infection in animal models.  124  Peptide ID  125  Bac2A 8 9 20 45 48 24,897 24,901 24,910 24,913 24,915 24,919 24,921 24,944 74,655 74,658 74,665 74,674 74,680  Sequence A RLARIVVIRVAR 48 KIWWWWRKR 5.9 RWRRWKWWL 2.9 WRWWKIWKR 5.9 WKRWWKKWR 23 WKKWWKRRW 23 FRRWWKWFK 1.5 LRWWWIKRI 13 RKRLKWWIY 25 KKRWVWIRY 25 KWKIFRRWW 12 RKWIWRWFL 6.1 IWWKWRRWV 6.0 RRFKFIRWW 6.1 AVWKFVKRV 240 AWRFKNIRK >223 KRIMKLKMR >226 AIRRWRIRK >217 VVLKIVRRF >241  B 192 47 12 24 46 46 12 50 50 51 24 12 48 24 >240 >223 >226 >217 >241  C 95 24 12 24 46 46 3.0 25 50 25 24 3.1 12 49 240 >223 >226 >217 >241  D 192 47 23 47 93 93 5.9 50 50 51 48 6.1 48 49 >240 >223 >226 >217 >241  E 95 47 5.8 12 23 46 5.9 25 50 25 12 6.1 12 12 240 >223 >226 >217 >241  F 95 12 12 47 46 46 24 25 50 25 48 3.1 48 49 >240 >223 >226 >217 >241  G 12 5.9 0.3 1.5 5.8 5.8 1.5 6.3 6.3 3.2 3.1 1.5 6.0 3.1 120 111 >226 217 241  H 3.0 3.0 0.7 0.8 1.4 1.4 0.8 3.2 3.2 1.6 1.5 1.5 1.5 0.8 60 >223 >226 108 60  I 24 94 5.8 12 93 23 24 50 13 25 6 3 6 12 >240 >223 >226 108 241  MIC (µM) J K 24 24 12 5.9 5.7 2.9 5.9 5.9 23 2.9 23 2.9 12 6.1 25 13 6.3 6.3 25 13 24 12 1.5 1.5 24 6.0 12 12 240 120 >223 223 >226 >226 108 108 >241 241  L 192 189 46 94 186 186 97 201 >202 204 194 3.1 96 98 >240 >223 >226 >217 >241  M 192 47 11 24 46 46 24 50 50 51 97 3.1 12 49 >240 >223 >226 >217 >241  N 24 5.9 2.9 3.0 5.8 1.4 1.5 6.3 6.3 13 3.1 3.1 3.0 6.1 240 223 >226 54 241  O 24 5.9 2.9 3.0 5.8 2.9 3.0 6.3 6.3 13 3.1 3.1 3.0 6.1 240 >223 >226 54 241  P 12 24 23 24 93 93 24 13 50 25 24 6.1 24 12 >240 >223 >226 >217 241  Q 48 94 92 94 >186 >186 195 50 >202 102 97 24 48 98 >240 >223 >226 >217 >241  R 48 94 92 94 >186 >186 97 25 202 102 97 24 48 49 >240 >223 >226 >217 >241  S 12 5.9 5.7 5.9 23 12 6.1 6.3 13 6.4 24 3.1 6.0 6.1 120 >223 >226 108 241  T 3.0 1.5 1.4 1.5 5.8 5.8 6.1 1.5 3.2 6.4 6.1 3.1 3.0 6.1 120 223 >226 14 60  Table 4.3. Activities against multi-resistant Superbugs of selected peptides predicted through the QSAR analysis compared to the peptide Bac2A. Peptides from the top quartile (8 to 48) were compared to peptides from the 2nd (24,897 to 24,944) and 3rd (74,655 to 74,680) Quartiles. Columns legends: Peptide ID indicates the control Bac2A or the test peptide by rank number. Columns give MIC values (µ M) measured in 3-5 replicates for A, P. aeruginosa wild type strain H103; B,C,D, P. aeruginosa multidrug resistant strains from Brazil #9, #198 and #213 respectively; E,F,G, P. aeruginosa Liverpool epidemic strains LES400, H1030, and H1027 respectively; H, P. maltophilia ATCC13637; I, Constitutive Class C chromosomal !-lactamase expressing Enterobacter cloacae 218R. J,K, Extended-spectrum !-lactamase-producing (ESBL) E. coli (clinical strains 63103 and 64771); L,M, ESBL resistant Klebsiella pneumonia (clinical strain 61962 and 63575); N, S. aureus ATCC25923; O, Methicillin resistant S. aureus strain C623; P, Enterococcus faecalis ATCC29212; Q,R, Vancomycin resistant E. faecalis [clinical isolate w61950 (VanA) and f43559 (VanB)]; S,T, Vancomycin resistant E. faecium [clinical isolates mic80 (VanA) and t62764 (VanB)].  It is interesting to note that two of the peptides have high potency (IDs 45 and 48 in Table 4.3) are active against a large number of the drug-resistant strains but have poor activity against one extended-spectrum !-lactamase-producing  (ESBL) pathogen  (column L) and two vancomycin resistant organisms (Q and T). However, these three peptides are active against other ESBL organisms (columns J and K), and other vancomycin resistant organisms (R and S). It seems likely that this resistance is due to different mechanisms than resistance to the conventional antibiotics. For example, !lactamase would not be expected to inactivate these peptides since they do not contain !lactam rings.  4.2 Conclusions We have demonstrated in this study the specific methodology used in the first application of atomic-resolution 3D QSAR methodology prediction of antibacterial activity to a large data set of diverse peptides. With the availability of large numbers of synthetic peptides and a rapid assay to determine their antibacterial activity, larger sets of data on peptide sequence and activity can now be created. Based on two random libraries containing a total of over 1400 peptides, we developed artificial network models that predict and rank the relative activities of novel antimicrobial peptides with remarkable accuracy: in an independent test set of 100,000 virtual peptides, 94% of the 50 highest ranked peptides predicted to be highly active were found to be highly active. In addition to creating more complex models that utilize the 'inductive' QSAR methodology, the availability of high quantity and quality peptide data also allows more 126  rigorous training and evaluation of the machine learning techniques. We consider the methodology described here as the first successful demonstration of high-throughput in silico screening of antibacterial peptides for novel drug leads.  4.3 Materials and methods 4.3.1  Electron microscopy of AMPs TEM micrographs of thin sections of Pseudomonas aeruginosa untreated, and  treated with Bac2A (sequence: RLARIVVIRVAR) at the MIC (50!g/mL) for one hour at 37 ºC. For control, bacteria were mock incubated and prepared for embedding/thin section electron microscopy in the same way as the peptide-treated bacteria. SEM micrographs of Pseudomonas aeruginosa were prepared for control untreated and Bac2Atreated (50!g/mL). Bacteria were incubated with Bac2A for one hour at 37 ºC before fixation and preparation for SEM.  4.3.2  Peptide sequences for model training Two experimental sets of peptides were created, one consisting of 943 peptides  (Set A) and another with 500 peptides (Set B). Peptides sequences in these sets were selected randomly from the amino acid distributions show in Figure 4.5 using custom computer software. The amino acid proportions for Set A were determined based on our previous studies of substitution analysis (Hilpert et al., 2005), and proportions for Set B were further determined from early analysis of Set A activity. In one plate of ten peptides in the set of 943 peptides, the control Bac2A peptide did not show the expected 127  luminescence profile and these ten experimental peptides were excluded from further use, leaving 933 peptides. For modelling (described below) three training sets were prepared, consisting of the set of 933 peptides (Set A), the set of 500 peptides (Set B) and a set created from combining 933 and 500 sets (the Set A+B). A set of 100,000 random peptide sequences were generated in the same amino acid proportions as Set B, using the same algorithm as described above. There were 311 duplicates that were removed, leaving 99,577 peptides (the test set). Peptides from this set were evaluated in silico and 200 (50 from each quartile) were selected for synthesis and assay.  4.3.3  Peptide SPOT synthesis and screening Peptides synthesis was performed as previously described (Hilpert et al., 2005;  Hilpert et al., 2007). Briefly, peptides were synthesized on cellulose support with a pipetting robot using two glycine residues as linker. Peptides were cleaved from the dried membrane in an ammonia atmosphere resulting in free peptides with two glycines at the amidated C terminus due to the linker sequence. The peptide spots were punched out and transferred to 96-well microtitre plates in sets of 10 along with a positive control peptide (Bac2A) and an unrelated peptide (GATPEDLNQKLS) or an empty well for negative control. An overnight culture of P. aeruginosa strain H1001 was diluted at 1:500 ratio with 100mM Tris buffer (pH 7.3), 20 mM glucose. This diluted culture was added to the microtiter plate wells (100 µL/well) containing the peptide spots and controls. After 30 min incubation, serial dilutions were performed from the membrane spots to successive rows of the plate. Luminescence of the P. aeruginosa PAO1 strain H1001 containing luciferase gene cassette luxABCDE was measured at 4 hours using a Tecan Spectra Fluor 128  Plus (Tecan US).  4.3.4  Calculation of peptide activity The luminescence of each peptide in a dilution series was fit to the following  function (1) independently for each peptide, after luminescence data were normalized to 1.0 for the most dilute luminescence point for each peptide. This function had the form of a sigmoid curve consisting of two-plateaus with a smoothly varying region joining them. Parameters of the function described the height of the higher plateau, the position of the center of the slope at half the maximum luminescence, and the slope at the center. Estimation of parameters was performed using custom C software using Numerical Recipes in C (Press et al., 1992). L  L= 1+e  max "2S x"x 1 / 2  (  )  (1)  In this function, Lmax controls the maximum height of the curve, S controls the !  slope, and x1/2 is the value of x giving luminescence of half of the maximum luminescence. The values of x were in dilution steps with values from zero, for the initial concentration, to seven (after seven dilutions); these corresponded to changes in concentration C ,  C = C0 2"x  !  (2)  where C0 was the initial concentration of peptide in the undiluted well. We were 129  interested in calculating the concentration of peptide that reduces the number (and hence the luminescence) of viable energized bacteria by 50%, the IC50. From these equations we can state the IC50 as,  IC50 = C (x1 / 2 )= C0 2  -x1/2  (3)  However, we can eliminate the need to determine the initial concentration of peptide by reporting the activity of peptides as relative IC50 (Rel.IC50) values: the ratio of IC50 for the experimental peptide to the IC50 for Bac2A. Values of Rel.IC50<1.0 mean the peptide is more active than Bac2A since a lower concentration yields the same reduction in bacterial concentration. For peptides with very low or zero activity, curve fitting was problematic. Where the luminescence of a well for an undiluted peptide was greater than 50% of the maximum luminescence for the peptide at high dilutions, the IC50 concentration was not observed even at the highest peptide concentration used. Here, the peptide was considered inactive and assigned a Rel.IC50 value of 25. For Set A and B, 7 dilution points were used in the calculation of Rel.IC50 due to frequent artifacts in the last dilution row (dramatic increases in luminescence were observed that were inconsistent with the expected profile). For the 200 peptides taken from the independent test set, the Rel.IC50 was determined from all 8 dilution points for each peptide since these artifacts were largely eliminated in later measurements.  4.3.5  QSAR descriptors The QSAR descriptors used in this study are shown in Table 4.4. The 'inductive' 130  QSAR descriptors used in this study were previously described (Cherkasov, 2005). An initial set of seventy-seven QSAR descriptors was calculated for each peptide in the two training and test sets using MOE (Molecular Operational Environment, 2005, by Chemical Computing Group Inc., Montreal, Canada). The peptide structure was optimized based on an initial linear structure followed by potential energy minimization of each molecule using MMFF94 force-field calculations (Halgren, 1996). Structure optimization was done without including interactions with other molecules. The atomic types have been assigned according to their name, valence state and the formal charge of constituent atoms, as defined within MOE. QSAR descriptors were calculated using custom SVL scripts within the MOE environment. The ‘inductive’ QSAR variables can be computed by the following equations  N #1  Rs j "G  1 =R % 2 i$ j r j#i N %1  !  " *j #G = '  ( $ 0j % $ i0 )R 2j  Ri2 ! 2 i # G , i " j ri $ j n  (5)  n  (6)  2 rj%i  i& j  RsG % j = &  (4)  2 j  " G* # j = $  )  ( % i0 & % 0j )Ri2  i'G,i( j  ri&2 j  (7)  !  " G0 # j  !  N #1  2% j$i  (8)  N "1  ( % j " % i )( R 2j + Ri2 )  i# j  r j2"i  $N j = Q j + & !  "j =  !  !  " i0 (Ri2 + R 2j ) ri$2 j = i% jN $1 2 R + R2 & i r2 j i$ j i% j N $1  &  1 R 2j + Ri2  (9)  (10)  "MOL =  2 rj#i  ! 131  1 sMOL  =  1 R 2j + Ri2 2% 2 rj#i j$i N #1  (11)  N "1  R 2j + Ri2  j #i  r j2"i  s i = 2!  N  N  s MOL = !!  (12)  j "i j "i  R 2j + Ri2 r j2#i  (13)  where R is the covalent atomic radii, r = interatomic distance, Qj = formal charge of atom j , " = ‘inductive’ electronegativity, RS = is the steric constant, #* = the inductive constants, "N =‘inductive’ partial charge, and $ and s are the ‘inductive’ analogues of chemical hardness and softness. It should be noted that the variables indexed with j subscript describe the influence of a singe atom onto a group of atoms G (typically the rest of N-atomic molecule) while G indices designate group (molecular) quantities. The linear character of equations (1) - (6) makes ‘inductive’ descriptors readily computable and suitable for sizable databases and positions them as appropriate parameters for large-scale QSAR models. Resources using the R language for statistical computing (http://r-project.org, (R Development Core Team, 2005)) were used for all following steps. Each descriptor in the training and test sets was normalized to the range encountered in training peptide Set A and B. A cross-correlation was performed on the descriptors in the set of all peptides from training and testing. Where the Pearson correlation coefficient was >0.95 or < -0.95, one descriptor of the pair was dropped. This was repeated until no descriptors had absolute correlation above 0.95. This left a final set of forty-four descriptors (Table 4.4). Hydrophobic moments were calculated for comparison purposes (Figure 4.9 and Table 4.2) and not used in ANN modelling. These were calculated using the hmoment utility in EMBOSS (Rice et al., 2000) modified to use the Eisenberg scale (Eisenberg et al., 1984).  132  4.3.6  Training and validation data sets For each of the three training sets of peptides described above (Set A, Set B and  Set A+B), the peptides were classified by considering the top 5% of Rel.IC50 values to be active peptides and assigned the activity-value of 1 in the data sets for training the ANNs; other activity-values were assigned 0. A stratified ten-fold cross-validation was performed on the three sets, resulting in ten models for each of the training sets for a total of thirty models. Briefly, to create the cross-validation data sets, 10% of the active peptides in the training set (one of Set A, Set B, or Set A+B) were randomly assigned to each of 10 lists. Then 10% of the inactive peptides in the training set were randomly assigned to each of 10 lists. One list of actives was combined with one list of inactives, to create 10 lists of combined active and inactive peptides. Using one of these lists as the peptides for a validation data set, the other 9 were used as the corresponding training set. This was repeated a total of 10 times to created 10 validation sets and 10 training sets. This creation of 10-fold cross-validation sets was performed separately for each of the training sets (A, B, and A+B).  4.3.7  Test data set To evaluate the voting system's ability to predict peptide activity, we selected a  set of 100,000 peptide sequences according to the amino acid frequencies used in Set B. QSAR descriptors were calculated as described in section 4.3.5 above. The maximum and minimum values of each of the 44 descriptors were compared to the range present in the Set A and B training data. Where a peptide in the test data was outside 15% above or below the range in the training data, the test peptide was dropped from the test set, leaving a total of 99,577 peptide sequences. 133  4.3.8  Model training Artificial neural networks (ANNs) were constructed and evaluated using SNNS  (Stuttgart Neural Network Simulator, version 4.2, from University of Tübingen, Stuttgart, Germany available at http://www-ra.informatik.uni-tuebingen.de/SNNS/). The networks (Figure 4.7) consisted of forty-four input nodes (one for each QSAR descriptor as described above), ten nodes in one hidden layer, and one output node; all were fully connected. The output node values for training were zero for not active, and one for active. Networks were initialized using randomized weights. Model training was performed using pairs of training and validation data sets generated for the 10-fold cross-validation described above. Therefore, 10 models were created for each of the training sets (Set A, Set B, and Set A+B) for a total of 30 models. Training was performed on each training data set used the standard backpropagation learning function with parameters #=0.2 and dmax=0. The update function used topological order with shuffled order of training patterns. For each cycle of training, the validation data set was evaluated. As the network trained, network parameters giving a minimum error on the validation set were stored. After 200 training cycles with no new minimum model error found, all network weights were jogged by 2% to attempt to escape local minima; and weights that showed more than 95% correlation during propagation were jogged by 5%. Training continued and was terminated after an additional 200 cycles with no new minimum validation error encountered. Performance measures such as ROC curves and areas, sensitivity and specificity were calculated using the ROCR package in R (Sing et al., 2005). 134  4.3.9  In silico ranking and selection of test peptides To test the predictions of the ANNs, all peptides in the test set were evaluated by  all 30 ANNs and the combined predictions were integrated into a single ordering of the test peptides as follows. Each peptide in the test set was assigned a ranking by each ANN. If a test peptide appeared in the top 5% of all peptides in the test set for an ANN, it received one 'vote' to indicate the model suggested it to be highly active. Therefore, a test peptide may receive up to 30 votes from the total of 30 ANNs. Peptides were ranked by number of votes with the relative ordering of peptides receiving the same number of votes determined by the average of the rankings of all ANNs. Sets of 50 peptides at 4 positions of ranking were selected to independently evaluate the system's ability to predict peptide activity and inactivity. Quartile 1 (Q1) peptides were ranked in the topmost 50 positions and considered the most likely to be more active than control. Quartile 2 (Q2) peptides were ranked at the start of the 2nd quartile, positions 24895 to 24944, and considered likely to be more active than control. Quartile 3 (Q3) peptides were ranked at the end of the 3rd quartile, positions 74673 to 74682, and considered likely to be less active than control. Quartile 4 (Q4) peptides were ranked at the end of the 4th quartile, positions 99568 to 99577, and considered to be most likely less active than control. These 200 predicted peptides were synthesized and assayed for activity as described above.  4.3.10  Minimal inhibitory concentration (MIC) determination The minimal inhibitory concentration (MIC) of the peptides was measured as  described (Cherkasov et. al, in press). Briefly a modified broth microdilution method was  135  used. The peptides were dissolved and stored in glass vials. The assay was performed in sterile 96-well polypropylene microtitre plates (Cat. #3790, Costar, Costar, Cambridge, MA). Serial dilutions of the peptides to be assayed were performed in 0.01% acetic acid containing 0.2% bovine serum albumin at 10 fold the desired final concentration. Ten microlitres of the 10-fold concentrated peptides were added to each well of a 96-well polypropylene plate containing 90 µl of MH media per well. Bacteria were added to the plate from an overnight culture at a final concentration of 2 - 7 x 105 CFU/ml and incubated over night at 37˚C. The MIC was taken as the concentration at which no growth was observed. MIC analyses were done on a panel of bacterial pathogens that were both susceptible and resistant to common antibiotics. P. aeruginosa PAO1 strain H10319 and P. maltophilia ATCC#13637, and S. aureus ATCC#2592319 and Enterococcus faecalis ATCC#292129 and Enterobacter cloacae 218R, constitutively expressing Class C chromosomal !-lactamase31, were from our lab strains collection. A methicillin resistant S. aureus (MRSA) clinical isolate was kindly provided by Anthony Chow (Vancouver General Hospital, Vancouver, Canada). Two Klebsiella pneumoniae and two E. coli clinical isolates expressing extended spectrum !-lactamases (ESBL) were kindly provided by George Zhanel (Health Sciences Centre, Winnipeg, Canada). Vancomycinresistant clinical isolates of Enterococcus faecalis and E. faecium were obtained from Ana M. Paccagnella (BC Centre for Disease Control, Vancouver, Canada). Three clinical isolates (#9, #198 and #213) of multi-drug resistant P. aeruginosa, were kindly provided by Carlos Kiffer (University of São Paulo, Brazil). These isolates all have resistance to piperacillin/tazobactam, meropenem, ceftazidime, ciprofloxacin and cefepime, and #9 is 136  also polymyxin B resistant. Three P. aeruginosa clinical isolates of the Liverpool epidemic strain (LES) (H1027, H1030 and LES400) 32 were all kindly provided by Craig Winstanley (University of Liverpool, UK). LES400 was resistant to gentamicin and tobramicin, while H1030 showed resistance to colistin, amikacin, gentamicin and tobramicin. All tested bacterial strains were categorized as biohazard level 2 pathogens.  4.4 Acknowledgements We gratefully acknowledge financial support from the Canadian Institutes for Health Research (CIHR) and the Foundation of the National Institutes of Health and CIHR through the Grand Challenges in Global Health Initiative. We thank Jessica Lee for technical support in creating the computer-based peptide libraries. RH is the recipient of a Canada Research Chair. KH received a CIHR fellowship. CDF received a Doctoral Research Award from the CIHR.  137  4.5 Supplementary tables  Descriptor  Explanation  Parental Equation Electronegativity-based  EO_Equalized*  Iteratively equalized electronegativity of a molecule  (8), (9)  Average_EO_Pos*  Arithmetic mean of electronegativities of atoms with positive partial charge  (8), (9)  Average_EO_Neg*  Arithmetic mean of electronegativities of atoms with negative partial charge  (8), (9)  Hardness-based Global_Hardness  Molecular hardness - reversed softness of a molecule  (10)  Sum_Hardness*  Sum of hardnesses of atoms of a molecule  (10)  Sum_Pos_Hardness  Sum of hardnesses of atoms with positive partial charge  (10)  Sum_Neg_Hardness*  Sum of hardnesses of atoms with negative partial charge  (10)  Average_Hardness*  Arithmetic mean of hardnesses of all atoms of a molecule  (10)  Average_Pos_Hardness*  Arithmetic mean of hardnesses of atoms with positive partial charge  (10)  Average_Neg_Hardness*  Arithmetic mean of hardnesses of atoms with negative partial charge  (10)  Smallest_Pos_Hardness*  Smallest atomic hardness among values for positively charged atoms  (10)  Smallest_Neg_Hardness*  Smallest atomic hardness among values for negatively charged atoms.  (10)  Largest_Pos_Hardness*  Largest atomic hardness among values for positively charged atoms  (10)  Largest_Neg_Hardness*  Largest atomic hardness among values for negatively charged atoms  (10)  Hardness_of_Most_Pos*  Atomic hardness of an atom with the most positive charge  (10)  Hardness_of_Most_Neg*  Atomic hardness of an atom with the most negative charge  (10)  Global_Softness  Molecular softness – sum of constituent atomic softnesses  (11)  Total_Pos_Softness  Sum of softnesses of atoms with positive partial charge  (11)  Total_Neg_Softness*  Sum of softnesses of atoms with negative partial charge  (11)  Average_Softness  Arithmetic mean of softnesses of all atoms of a molecule  (11)  Average_Pos_Softness  Arithmetic mean of softnesses of atoms with positive partial charge  (11)  Average_Neg_Softness*  Arithmetic mean of softnesses of atoms with negative partial charge  (11)  Smallest_Pos_Softnes  Smallest atomic softness among values for positively charged atoms  (11)  Smallest_Neg_Softness  Smallest atomic softness among values for negatively charged atoms  (11)  Largest_Pos_Softness  Largest atomic softness among values for positively charged atoms  (11)  Largest_Neg_Softness  Largest atomic softness among values for positively charged atoms  (11)  Softness_of_Most_Pos  Atomic softness of an atom with the most positive charge  (11)  Softness_of_Most_Neg  Atomic softness of an atom with the most negative charge  (11)  Softness-based  Charge-based Total_Charge  Sum of absolute values of partial charges on all atoms of a molecule  (9)  Total_Charge_Formal  Sum of charges on all atoms of a molecule (formal charge of a molecule)  (9)  138  Descriptor  Explanation  Parental Equation  Average_Pos_Charge*  Arithmetic mean of positive partial charges on atoms of a molecule  (9)  Average_Neg_Charge*  Arithmetic mean of negative partial charges on atoms of a molecule  (9)  Most_Pos_Charge  Largest partial charge among values for positively charged atoms  (9)  Most_Neg_Charge  Largest partial charge among values for negatively charged atoms  (9)  Descriptors based on inductive substituent constants Total_Sigma_mol_i*  Sum of inductive parameters sigma (molecule%atom) for all atoms within a molecule (7)  Total_Abs_Sigma_mol_i  Sum of absolute values of group inductive parameters sigma (molecule%atom) for all (7) atoms within a molecule  Most_Pos_Sigma_mol_i  Largest positive group inductive parameter sigma (molecule%atom) for atoms in a molecule  (7)  Most_Neg_Sigma_mol_i*  Largest (by absolute value) negative group inductive parameter sigma (molecule%atom) for atoms in a molecule  (7)  Most_Pos_Sigma_i_mol  Largest positive atomic inductive parameter sigma (atom%molecule) for atoms in a molecule  (7)  Most_Neg_Sigma_i_mol  Largest negative atomic inductive parameter sigma (atom%molecule) for atoms in a molecule  (7)  Sum_Pos_Sigma_mol_i*  Sum of all positive group inductive parameters sigma ( molecule %atom) within a molecule  (7)  Sum_Neg_Sigma_mol_i*  Sum of all negative group inductive parameters sigma ( molecule %atom) within a molecule  (7)  Descriptors based on steric substituent constants Largest_Rs_mol_i  Largest value of steric influence Rs(molecule%atom) in a molecule  (5)  Smallest_Rs_mol_i*  Smallest value of group steric influence Rs(molecule%atom) in a molecule  (5)  Largest_Rs_i_mol*  Largest value of atomic steric influence Rs(atom%molecule) in a molecule  (4)  Smallest_Rs_i_mol  Smallest value of atomic steric influence Rs(atom%molecule) in a molecule  (4)  Most_Pos_Rs_mol_i  Steric influence Rs(molecule%atom) ON the most positively charged atom in a molecule  (5)  Most_Neg_Rs_mol_i*  Steric influence Rs(molecule%atom) ON the most negatively charged atom in a molecule  (5)  Most_Pos_Rs_i_mol  Steric influence Rs(atom%molecule) OF the most positively charged atom to the rest of a molecule  (4)  Most_Neg_Rs_i_mol*  Steric influence Rs(atom%molecule) OF the most negatively charged atom to the rest of a molecule  (4)  Conventional QSAR descriptors implemented by the MOE software a_acc*  Number of hydrogen bond acceptor atoms  a_don*  Number of hydrogen bond donor atoms  ASA*  Water accessible surface area  ASA_H*  Water accessible surface area of all hydrophobic atoms.  ASA_P*  Water accessible surface area of all polar atoms.  ASA-*  Water accessible surface area of all atoms with negative partial charge  ASA+*  Water accessible surface area of all atoms with positive partial charge  FCharge*  Total charge of the molecule  b_1rotN  Number of rotatable single bonds  139  N/A  Descriptor  Explanation  Parental Equation  logP(o/w)*  Log of the octanol/water partition coefficient  logS*  Log of the aqueous solubility  Mr  Molecular refractivity  PC-*  Total negative partial charge  PC+*  Total positive partial charge  RPC-  Relative negative partial charge  RPC+*  Relative positive partial charge  TPSA  Polar surface area  vdw_area*  van der Waals surface area calculated using a connection table approximation.  vdw_vol  van der Waals volume calculated using a connection table approximation.  Vol  van der Waals volume calculated using a grid approximation  VSA  van der Waals surface area using polyhedral representation  vsa_acc*  Approximation to the sum of VDW surface areas of pure hydrogen bond acceptors  vsa_acid*  Approximation to the sum of VDW surface areas of acidic atoms.  vsa_base  Approximation to the sum of VDW surface areas of basic atoms.  vsa_don  Approximation to the sum of VDW surface areas of pure hydrogen bond donors  vsa_hyd*  Approximation to the sum of VDW surface areas of hydrophobic atoms.  Weight*  Molecular weight  N/A  Table 4.4. Description of all QSAR descriptors used in analysis of peptide activities. The column 'Parental Equation' refers to the equation described in the text that is used to calculate the descriptor. Those descriptors without a parental equation were provided by molecular simulation software (Molecular Operational Environment, 2005, by Chemical Computing Group Inc., Montreal, Canada). Descriptors indicated with * were used in the classification analysis as described in the text.  140  Sequence Topmost 50 (rows 1-50) RWRWKRWWW RWRRWKWWW RWWRWRKWW RWRRKWWWW RWRWWKRWY RRKRWWWWW RWRIKRWWW KIWWWWRKR RWRRWKWWL KRWWKWIRW KRWWWWWKR IRWWKRWWR IKRWWRWWR RRKWWWRWW RKWWRWWRW KRWWWWRFR IKRWWWRRW KRWWWVWKR KWRRWKRWW WRWWKIWKR WRWRWWKRW WKRWKWWKR RIKRWWWWR IWKRWWRRW KWWKIWWKR RKRWLWRWW KRWRWWRWW KKRWLWWWR RWWRKWWIR KWWRWWRKW KRWWIRWWR KIWWWWRRR RRRKWWIWW RRRWWWWWW RWWIRKWWR KRWWKWWRR KRWWRKWWR RRIWRWWWW IRRRKWWWW KRKIWWWIR RKIWWWRIR  Cumulative Vote Average Rank Rel IC50  29 29 29 28 28 27 27 27 27 27 26 26 26 26 26 26 26 26 26 26 26 26 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25  2027.13 2707.87 2728.97 2831.87 3044.53 2434.63 2589.1 2622.3 3201.17 3660.7 2601.83 2735.33 2848.03 2859 2866.13 2952.17 3063.3 3080.23 3291.97 3456 4973.83 5351.2 2875.47 3011.93 3075.07 3292.37 3309.7 3328.63 3426.07 3543.47 3591.17 3616.8 3926.37 3935 3965.1 3974.97 3980.1 4065.33 4099.4 4202.17 4205  0.25 0.4 0.28 0.39 0.2 0.43 0.12 0.13 0.08 0.04 0.22 0.21 0.23 0.27 0.31 0.24 0.22 0.36 0.15 0.14 0.28 0.25 0.31 0.24 0.2 0.25 0.28 0.3 0.24 0.2 0.21 0.21 0.18 1.82 0.21 0.13 0.15 0.68 0.21 0.28 0.59  141  Charge  4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 4 4 5 4 4 4 4 4 4 4 4 4 4 4 3 4 5 5 3 4 4 4  Hydrophobicity  0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.44 0.56 0.56 0.44 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.67 0.56 0.44 0.44 0.67 0.56 0.56 0.56  Hydrophobic moment  1.48 1.96 2.11 2.75 2.86 1.22 1.84 2.06 2.12 4.65 4.19 6.32 5.75 1.2 5.44 2.13 2.98 4.19 4.29 4.75 2.6 2.96 2.09 5.23 3.32 2.28 2.03 2.57 4.11 5.14 5.09 2.72 0.32 1.22 5.05 5.89 6.4 4.7 0.93 3.9 1.83  Sequence  Cumulative Vote Average Rank Rel IC50  Charge  Hydrophobicity  Hydrophobic moment  KRWWIWRIR RWFRWWKRW WRWWWKKWR WKRWWKKWR WKRWRWIRW WRWWKWWRR WKKWWKRRW WRWYWWKKR WRRWWKWWR  25 25 25 25 25 25 25 25 25  4216.67 4610.57 5055.03 5248.37 5696.47 6026.73 6133.6 6147.73 6591.37  0.35 0.26 0.19 0.2 0.28 0.23 0.19 0.22 0.23  4 4 4 5 4 4 5 4 4  0.56 0.56 0.56 0.44 0.56 0.56 0.44 0.56 0.56  2.02 5.94 4.2 4.66 1.81 4.94 2.41 1.86 5.39  start of 2nd quartile (rows 24895 –24944) IRMWVKRWR RIWYWYKRW FRRWWKWFK RVRWWKKRW RLKKVRWWW RWWLKIRKW LRWWWIKRI TRKVWWWRW KRFWIWFWR KKRWVWVIR KRWVWYRYW IRKWRRWWK RHWKTWWKR RRFKKWYWY RIKVIWWWR RKRLKWWIY LVFRKYWKR RRRWWWIIV KKRWVWIRY RWRIKFKRW KWKIFRRWW IWKRWRKRL RRRKWWIWG RWLVLRKRW RKWIWRWFL KRRRIWWWK IWWKWRRWV LRWRWWKIK RWKMWWRWV VKRYYWRWR RWYRKRWSW KRKLIRWWW  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  13255.83 13263.4 13275.73 13278.87 13318.77 13319.53 13336.07 13336.23 13347.1 13348.17 13352.4 13365.3 13385.47 13390 13392.73 13406.5 13417.57 13418.2 13418.77 13440.07 13460.03 13465.47 13466.93 13469.13 13472.93 13487.3 13521.7 13547.1 13552.9 13559.43 13593.73 13608.7  0.61 0.36 0.12 0.27 0.34 0.18 0.33 0.76 3.04 0.35 0.54 0.41 0.95 0.26 0.51 0.18 0.99 0.85 0.22 0.26 0.16 0.33 0.57 0.53 0.15 0.4 0.29 0.26 0.24 1.23 0.7 0.23  4 3 4 5 4 4 3 3 3 4 3 5 5 4 3 4 4 3 4 5 4 5 4 4 3 5 3 4 3 4 4 4  0.56 0.67 0.56 0.44 0.56 0.56 0.67 0.56 0.67 0.56 0.67 0.44 0.33 0.56 0.67 0.56 0.56 0.67 0.56 0.44 0.56 0.44 0.44 0.56 0.67 0.44 0.67 0.56 0.67 0.56 0.44 0.56  4.24 4.06 5.4 2.27 1.16 3.85 0.99 0.78 4.11 2.92 0.41 5.9 4.67 3.72 0.95 1.98 3.44 1.55 0.98 2.9 3.53 3.74 0.46 1.58 2.8 0.49 3.52 0.69 3 3.11 2.59 3.68  142  Sequence  Cumulative Vote Average Rank Rel IC50  Charge  Hydrophobicity  Hydrophobic moment  RWRWWIKII KFRKRVWWW IWIWRKLRW LRFILWWKR RVWFKRRWW RRWFVKWWY KWWLVWKRK RWILWWWRI KRWLTWRFR RKWRWRWLK IRRRWWWIV IKWWWRMRI RWKIFIRWW IRQWWRRWW RRRKTWYWW RRWWHLWRK RRWWMRWWV RRFKFIRWW  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  13621.07 13632.63 13638.23 13645.27 13669.67 13671 13675.37 13678.73 13690.3 13700.5 13702.2 13705.3 13708 13720.43 13724.03 13725.63 13726.37 13731.6  0.46 0.3 0.46 0.88 0.26 0.52 0.23 25 0.54 0.31 0.23 0.39 1.82 0.5 0.32 0.38 0.33 0.24  3 4 3 3 4 3 4 2 4 5 3 3 3 3 4 5 3 4  0.67 0.56 0.67 0.67 0.56 0.67 0.56 0.78 0.44 0.44 0.67 0.67 0.67 0.56 0.44 0.44 0.67 0.56  2.99 2.08 2.68 3.75 0.23 3.18 2.56 4.11 2.62 2.4 2.55 1.52 2.84 4.89 0.41 5.24 3.07 2.13  end of 3rd quartile (rows 7463374682) INRKRRLRW RRMKKLRRK RKVRWKIRV VRIVRVRIR IKRVKRRKR RVKTWRVRT RVFVKIRMK IRGRIIFWV ATWIWVFRR KKSKQLWKR MINRVRLRW GGIRRLRWY RLVHWIRRV AWKIKKGRI FVVMKRIVW GIKWRSRRW RWMVSKIWY IVVRVWVVR RWIGVIIKY WIRKRSRIF GWKILRKRK YQRLFVRIR AVWKFVKRV  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  67262.97 67264.37 67264.47 67264.93 67265.27 67265.3 67265.4 67266.27 67267.5 67268.5 67269.17 67270.13 67270.3 67270.47 67271.23 67272.9 67273.33 67274.7 67274.83 67275.33 67277.03 67280 67280.2  4.25 4.22 0.32 2.22 2.93 5.66 0.72 0.44 4.88 3.23 2.77 1.16 2.62 3.59 5.38 1.06 25 3.5 2.24 3.39 2.74 25 8.18  5 7 5 4 7 4 4 2 2 5 3 3 4 4 2 4 2 2 2 4 5 3 3  0.33 0.22 0.44 0.56 0.22 0.33 0.56 0.67 0.67 0.22 0.56 0.44 0.56 0.44 0.78 0.33 0.67 0.78 0.67 0.44 0.33 0.56 0.67  0.84 4.46 3.76 3.69 3.91 1.24 2.63 0.57 2.91 4.1 2.2 2.84 5.36 0.13 2.33 2.06 2.15 0.69 4.01 3.41 1.96 3.38 4.52  143  Sequence  Cumulative Vote Average Rank Rel IC50  Charge  Hydrophobicity  Hydrophobic moment  IRKKRRRWT ILRVISKRR AWRFKNIRK HYKFQRWIK RRIRRVRWG VLVKKRRRR RWRGIVHIR WRNRKVVWR KFWWWNYLK KRIMKLKMR IRRRKKRIK RKWMGRFLM RRVQRGKWW WHGVRWWKW WVRFVYRYW RKRTKVTWI IRRIVRRKI KIRRKVRWG AIRRWRIRK WRFKVLRQR RSGKKRWRR FMWVYRYKK RGKYIRWRK WVKVWKYTW VVLKIVRRF GKFYKVWVR SWYRTRKRV  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  67281.47 67282 67282.57 67283.83 67283.93 67283.93 67284.03 67284.73 67284.93 67284.97 67286.73 67286.77 67287.2 67289.83 67289.93 67291.97 67292.87 67295.4 67295.77 67297.83 67297.97 67298 67298.13 67298.3 67298.63 67298.7 67299.6  6.59 25 9.2 2.79 8.22 12.48 4.93 6.79 1.81 6.5 6.42 4.38 6.3 2.5 2.15 5.11 11.15 10.55 4.62 7.08 6.5 1.51 3.83 5.64 25 1.21 6.66  6 4 4 4 5 6 4 4 2 5 7 3 4 3 2 4 5 5 5 4 6 3 5 2 3 3 4  0.22 0.44 0.44 0.44 0.33 0.33 0.44 0.44 0.67 0.44 0.22 0.56 0.33 0.56 0.78 0.33 0.44 0.33 0.44 0.44 0.11 0.67 0.33 0.67 0.67 0.56 0.33  3.08 2.06 1.85 3.94 3.54 1.2 0.64 2.89 1.33 4.04 3.29 2.92 3.04 2.63 4.59 0.7 4.79 2.02 5.94 4.2 4.66 1.81 4.94 2.41 1.86 5.39 4.24  Bottom-most 50 (rows 9952899577) KNRGRWFSH AFRGSRHRM GRNGWYRIN AGGMRKRTR ATRKGYSKF SSGVRWSWR RVWRNGYSR WGRTRWSSR GKRVWGRGR SFNWKRSGK WGRGGWTNR ANRWGRGIR WGGHKRRGW WHGGQKWRK  0 0 0 0 0 0 0 0 0 0 0 0 0 0  97923.77 97924.6 97925.57 97945.47 97994.6 97995.4 97996.27 98002.77 98018.07 98036.97 98042.53 98047.13 98049.73 98093  9.79 11.36 10.74 25 25 8.16 10.24 9.64 8.2 25 25 10.8 6.19 8.5  4 4 2 4 3 2 3 3 4 3 2 3 4 4  0.22 0.33 0.33 0.22 0.33 0.33 0.33 0.22 0.22 0.22 0.22 0.33 0.22 0.22  2.42 1.89 2.97 2.12 2.78 3.65 4.26 1.4 3.28 2.47 1.17 5.29 1.66 2.87  144  Sequence FVWQKGTNR HGVWGNRKR TRGWSLGTR GRRVMNQKR RNKFGGNWR GVRVQRNSK NQKWSGRRR RQNGVWRVF GRMRLWNGR WHYRSQVGR GWNTMGRRW RRMGNGGFR SKNVRTWRQ ARGRWINGR GSRRSVWVF WSQNVRTRI GMRRWRGKN RGRTSNWKM GRRWGMGVR WGKRRGWNT AMLGGRQWR QRNKGLRHH ARGKSIKNR NRRNGQMRR RGRRQIGKF ASKRVGVRN GRIGGKNVR NKTGYRWRN VSGNWRGSR GWGGKRRNF KNNRRWQGR GRTMGNGRW GRQISWGRT GGRGTRWHG GVRSWSQRT GSRRFGWNR  Cumulative Vote Average Rank Rel IC50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  98093.2 98107.57 98118.03 98140.33 98153.27 98166.57 98171.9 98183.87 98205.93 98228.63 98257.43 98272.87 98314.23 98370.6 98381.53 98383.63 98444.37 98450.4 98481.6 98490.87 98497.47 98538.87 98539.63 98587.97 98602.67 98637.47 98644.47 98701.07 98756.67 98807.8 98885.2 98946.9 98949.4 99178.53 99185.7 99199.47  11.3 7.95 12.15 9.83 25 25 7.97 8.26 7.91 6.65 6.32 8.71 7.06 7.24 2.3 5.71 6.05 7.07 7.75 7.91 6.75 8.76 8.35 8.41 8.51 8.17 9.12 8.33 8.54 7.38 6.45 6.93 8.04 8.6 8.5 8.1  Charge 2 4 2 4 3 3 4 2 3 3 2 3 3 3 2 2 4 3 3 3 2 5 4 4 4 3 3 3 2 3 4 2 2 3 2 3  Hydrophobicity  Hydrophobic moment  0.33 0.22 0.22 0.22 0.22 0.22 0.11 0.44 0.33 0.33 0.33 0.22 0.22 0.33 0.44 0.33 0.22 0.22 0.33 0.22 0.44 0.11 0.22 0.11 0.22 0.33 0.22 0.22 0.22 0.22 0.11 0.22 0.22 0.11 0.22 0.22  Table 4.5. Candidate peptides for confirmation of QSAR predictions. The 200 total candidate peptides are shown with peptide charge, hydrophobicity as hydrophobic fraction and hydrophobic moment using the Eisenberg scale.  145  2.18 0.77 3.88 3.34 1.88 3.02 0.77 2.03 1.03 1.83 3.5 4.81 3.81 0.98 2.29 1.63 1.7 1.29 4.01 1.94 2.87 1.16 1.67 2.31 4.43 2.06 4.3 2.75 2.67 1.13 2.88 1.4 1.94 2.63 2.56 0.58  4.6 References Cherkasov, A., Hilpert, K., Jenssen, H., Fjell, C.D., Waldbrook, M., Mullaly, S.C., Volkmer, R., and Hancock, R.E.W. (2008) Use of artificial intelligence in the design of small peptide antibiotics effective against a broad spectrum of highly antibiotic resistant Superbugs. ACS Chemical Biology, in press. Cherkasov, A. (2005) ‘Inductive’ Descriptors. 10 Successful Years in QSAR. Current Computer-Aided Drug Design, 1, 21-42. Cherkasov, A. (2005) Inductive QSAR Descriptors. Distinguishing Compounds with Antibacterial Activity by Artificial Neural Networks. Int. J. Mol. Sci., 6: 63-86. Eisenberg, D., Weiss, R. M., Terwilliger, T. C. (1984) The hydrophobic moment detects periodicity in protein hydrophobicity. Proc. Natl. Acad. Sci. USA, 81:140-4. Finlay, B.B., Hancock, R.E.W. (2004) Can innate immunity be enhance to treat microbial infections? Nature Reviews Microbiology, 2, 497-504. Frecer, V. (2006) QSAR analysis of antimicrobial and haemolytic effects of cyclic cationic antimicrobial peptides derived from protegrin-1. Bioorganic & Medicinal Chemistry, 14, 6065-6074. Frecer, V., Ho, B., Ding, J.L. (2004) De Novo Design of Potent Antimicrobial Peptides. Antimicrob. Agents Chemother., 48, 3349-3357. Halgren, T. A. (1996) Merck molecular force field .1. Basis, form, scope, parameterization, and performance of MMFF94. Journal of Computational Chemistry, 17: 490-519. Hamilton-Miller, J.M.T. (2004) Antibiotic resistance from two perspectives: man and microbe. International. Journal of Antimicrobial Agents, 23: 209-212. Hancock, R.E.W., and Sahl, H.G. (2006).Antimicrobial and host-defense peptides as new anti-infective therapeutic strategies. Nature Biotechnology, 24:1551-1557. Hilpert, K., Volkmer-Engert, R., Walter, T., Hancock, R.E.W. (2005) High-throughput generation of small antibacterial peptides with improved activity. Nature Biotechnology 23: 1008-1012 Hilpert, K., Elliott, M. R., Volkmer-Engert, R., Henklein, P., Donini, O., Zhou, Q., Winkler, D. F., Hancock, R. E. (2006) Sequence requirements and an optimization strategy for short antimicrobial peptides. Chem. Biol., 13: 1101-7. Hilpert, K.; Winkler, D. F.; Hancock, R. E. (2007) Peptide arrays on cellulose support: SPOT synthesis, a time and cost efficient method for synthesis of large numbers of peptides in a parallel and addressable fashion. Nat. Protoc., 2: 1333-49 Jenssen, H., Gutteberg, T.J., and Lejon, T (2005) Modelling of anti-HSV activity of lactoferricin analogues using amino acid descriptors. J. Pept. Sci., 11: 97-103. Jenssen, J., Hamill, P., and Hancock, R.E.W. (2006) Peptide Antimicrobial Agents. Clinical Microbiology Reviews, 19: 491–511. 146  Karakoc, E., Cherkasov, A., Sahinalp, S.C. (2006) Distance based algorithms for small biomolecule classification and structural similarity search. Bioinformatics. 15: 243-251. Karakoc, E., Sahinalp, S.C., and Cherkasov, A. (2006) Comparative QSAR- and fragments distribution analysis of drugs, druglikes, metabolic substances, and antimicrobial compounds. J. Chem. Inf. Model., 46: 2167-2182. Koczulla, A.R., Bals, R. (2003) Antimicrobial Peptides: Current Status and Therapeutic Potential. Drugs, 63: 389-407. Lejon, T., Stiberg, T., Strom, M.B., and Svendsen, J.S. (2004) Prediction of antibiotic activity and synthesis of new pentadecapeptides based on lactoferricins. J. Pept. Sci., 10: 329 – 335. Lejon, T., Strom, M.B., and Svendsen, J.S. (2001) Antibiotic activity of pentadecapeptides modelled from amino acid descriptors. J. Pept. Sci., 7: 74-81. Levy, S.B., Marshall, B. (2004) Antibacterial resistance worldwide: causes, challenges and responses. Nature Medicine, 10: S122 - S129. Ostberg, N., and Kaznessis, Y. (2004) Protegrin structure–activity relationships: using homology models of synthetic sequences to determine structural characteristics important for activity. Peptides, 26, 197–206 Perkins, R., Fang, H., Tong, W., and Welsh, W.J. (2003) Quantitative structure-activity relationship methods: perspectives on drug discovery and toxicology. Environmental Toxicology and Chemistry, 22: 1666-79 Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannerty, B.P. (1992), Numerical Recipes in C: The Art of Scientific Computing, (2nd Edition), Cambridge University Press, New York. R Development Core Team (2005) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3900051-07-0 Rice, P., Longden, I., and Bleasby, A. (2000) EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics, 16:276--277 Sawyer, J. G., Martin, N. L., Hancock, R. E. (1988) Interaction of macrophage cationic proteins with the outer membrane of Pseudomonas aeruginosa. Infect. Immun., 56: 693-8 Sing, T., Sander, O., Beerenwinkel, N., Lengauer, T. (2005) ROCR: visualizing classifier performance in R. Bioinformatics, 21: 3940-3941 Strom, M.B., Stensen, W., Svendsen, J.S., and Rekdal, O. (2001) Increased antibacterial activity of 15-residue murine lactoferricin derivatives. J. Pept. Res., 57: 127–139 Yeaman, M.R., Yount, N.Y. (2003) Mechanisms of Antimicrobial Peptide Action and Resistance. Pharmacol. Rev., 55: 1, 27-55.  147  Chapter 5:  Genetic algorithms for identification of  potent antimicrobial peptides  A version of this chapter will be submitted as: Fjell, C.D., Jenssen, H., Hilpert, K., Cheung, W.A., Hancock, R.E.W., and Cherkasov, A. Optimization of Antibacterial Peptides by Genetic Algorithms and QSAR 148  5.1 Introduction Human pathogens that are resistant to current antibiotic treatment represent a significant health threat worldwide (Levy and Marshall, 2004). Drugs based on synthetic peptides are inspired by the short cationic, amphipathic peptides found throughout the kingdoms of life that possess antimicrobial activity by various mechanisms (see for example, Yeaman and Yount, 2003). These peptides have drawn significant attention as a possible source of novel antibacterial agents (Hamilton-Miller, 2004; Koczulla and Bals, 2003; Finlay and Hancock, 2004; Hancock and Sahl, 2006). While antimicrobial peptides generally exhibit lower potency against susceptible bacterial targets compared to conventional low-molecular-weight antibiotic compounds, they have advantages that compensate for this lower potency, including fast killing, a broad range of activity, a postulated multiplicity of targets, low toxicity for host cells and minimal development of resistance in target organisms (Hancock and Sahl, 2006; Jenssen et al., 2006). We have recently shown for the first time that synthetic peptides with high antibacterial activity and low toxicity can be identified with high accuracy using chemoinformatics and machine learning and without the use of an original template sequence (Cherkasov et al., in press). To achieve this we used quantitative structureactivity relationships (QSAR) combined with artificial neural networks to build software models of peptide activity. As a basis for describing structure in these peptides, we employed a set of 44 descriptors including 3D QSAR ones that utilize atomic-scale molecular information, the so-named ‘inductive’ QSAR descriptors reviewed in (Cherkasov, 2005a). In addition to our peptide studies, these have been successfully 149  applied to a number of molecular modelling studies including: identification of antibacterial activity of small compounds (Cherkasov, 2005b), classification of antimicrobial compounds, conventional drugs and drug-like substances chemical structures (Karakoc et al, 2006a and 2006b). In our recent work (Cherkasov et al., in press), we used a large data sets of 1400 synthetic peptides, screened for activity using a high-throughput assay (Hilpert et al., 2005), containing random sequences that were biased to contain amino acids believed from substitution analyses to be important for antibacterial activity. Three-dimensional structures for these peptides were estimated and descriptors were calculated for each peptide. These values were related to the measured antibacterial activity using artificial network models to classify peptides as active or inactive. To demonstrate the power of these techniques for identifying drug targets we performed an in silico screening of 100,000 peptides and demonstrated, by synthesizing example peptides from each activity quartile, that peptides with superior activity could be identified with 94% accuracy. However, the complexity of the artificial neural network solution prevents us from 'inverting' the solution and using it to directly determine peptide sequences that are predicted to be active; instead, a small number of active peptides are identified from a large set of in silico candidates by computational evaluation. A common problem in drug discovery is that an exhaustive search is not possible due to the massive numbers of possible peptide variants (X20, where X is the number of amino acids in the peptide chain) and the time and resources needed for QSAR descriptor calculations. We considered that it would be advantageous to utilize a search strategy that would minimize the number of peptides that need to be evaluated to determine additional 150  highly active peptides. Here genetic algorithms were applied to this problem since these evolutionary methods have been applied successfully in other areas of chemoinformatics (Parrill, 1999; Niculescu, 2003; Solmajer and Zupan, 2004; Weaver, 2004). A genetic algorithm is a heuristic method for search and approximation problems and is particularly well suited for problems involving string-like data such as the amino acids in a peptide. Genetic algorithms operate on populations of solutions by iteratively enhancing solutions using operations inspired by natural genetic processes: cross-overs (combining parts of two solutions to suggest another) and mutations (randomly changing one part of a solution to generate another). Each solution ('phenotype' in the jargon of genetic algorithms) is composed of elements ('genes') that are randomly modified ('mutated') or shuffled with other solutions ('crossed-over') and evaluated for fitness at each iteration ('generation'). The best solutions are propagated into the next iteration with new solutions added to the population produced based on modifications and combinations of these best peptides. In the current work, we demonstrate that a genetic algorithm approach effectively minimizes the number of peptides that must be evaluated for in silico screening of synthetic antibacterial peptides with high potency.  5.2 Results and discussion A genetic algorithm solution requires that the problem be described in terms of a genetic representation, and a fitness function must be specified to permit evaluation of each solution. The genetic algorithm then either passes high fitness individuals on to the next generation, removes low fitness individuals, or creates offspring by cross-over of two existing individuals or by mutation of an existing individual. Examples of mutation 151  and cross-over that showed dramatic changes on peptide fitness are shown in Figure 5.1, whereby mutation of one amino acid (V to I) increase fitness from 20 to 26, and where cross-over where combining portions of two peptides with fitness 20 yielded a peptide with fitness 0.  RVWKIWRWR (21)  RWYYWWRRH (20)  KWKWWRMWR (20)  Mutation  Recombination  RIWKIWRWR (26)  RWYYWWMWR (0)  Figure 5.1. Examples of peptide evolution. Two examples of peptide evolution are shown: mutation of a single amino acid that results in an improved peptide, and recombination of two moderate scoring peptides recombining to form one low scoring peptide. Values in round brackets are the fitness scores for the peptides.  5.2.1  Evaluation of peptide fitness score In our previous studies we created a software system to predict the activity of 9  amino acid peptides. This system was constructed to make maximum use of the available experimental data by utilizing models produced by a stratified 10-fold cross-validation, as described previously. The system consisted of a set of 30 artificial neural network models derived from the 10-fold cross-validation models of the 2 data sets (Set A and B) of screened peptides plus the combined set (Set A+B). These were classification models trained to consider the top 5% as active. Our confidence that a peptide is active could be judged by the number of models that classified the peptide as active. As reported  152  previously, the accuracy of predicting peptide activity is strongest when the largest numbers of models predicted activity: for example, for the top 50 peptides predicted out of a set of 100,000 amino-acid-biased semi-random peptides, the number of models indicating high activity ranged from 25 to 29. For these peptides, the accuracy of predicting highly active peptides was 94%. This number of models indicating high activity was therefore taken as the genetic algorithm fitness score.  5.2.2  Initial population of peptides We executed genetic algorithm searches starting from two initial populations of  peptides for two purposes: Firstly, we wished to identify additional peptides with very high fitness scores to evaluate the ability of genetic algorithms to identify novel peptides for screening by antibacterial activity assay. Secondly, we wished to understand the importance of starting population on the composition of later peptide populations in a search. Both sets of peptides were selected from the biased random set of 100,000 peptides we have described previously (Cherkasov et al., in press) at different levels of fitness score. For the first search (Simulation A), we selected peptides that were moderately predicted to be active, having a fitness value of 20 or 21. We selected these peptides as a small initial population that maximized the diversity of amino acids present in the peptides with this level of initial fitness score, by ensuring that all amino acids present in the library were present at least to some degree in these peptides ( Table 5.1). An initial set of 19 peptides was selected that included all of the 12 amino acids present in the 594 peptides of the 100,000 having a fitness values of 20 or 21. Since some  153  amino acids had low representation (1 peptide only containing any of G, Q and S, and 2 for H) we decided to use a small population to minimize the effect of the relatively large numbers of certain other amino acids in the population. Similarly, the initial peptides for Simulation B were selected to have fitness score of 2, a low score indicating low confidence that these are highly active peptides (Table 5.2).  Sequence  Score  KKWWYWWKR KWKRWFKWR KWKWWRMWR MWRKWRRWW RKKWWWLFR RLKWWRWRW RRWRWWWVW RRWWWRLWW RRWWWRRWY RVWKIWRWR RWIRKIWWR RWIWWRRWW RWRWWGWRR RWRWWWKKT RWWRWWKQR RWWWWSRRR RWYYWWRRH RYRWWKWRH TWWWKKWRR  20 21 20 21 21 21 21 21 21 21 21 21 20 20 20 20 20 20 20  Sequence  Score  ARKWWWRWK AWWRKRKWW FVKRWWRFR IGWWWRKRW IWKRWWRKT KNWKWWRWR KRRSWWKWW KRWRWLRWG KWWRWRRFI QRRRWWWWK RLIRWWIRK RRKRLYWIW RRRWYWKWN RRWRIWWIK RTYKRWYRW RWIRWWRQW RWRHIWWRW RWWKWRWLM RWYKHWRFR SRWWKRRWY VKRWWWRRM WWRKLWRKL  2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2  Table 5.2. Initial peptide population for simulation B.  Table 5.1. Initial peptide population for simulation A.  Peptides were chosen from a set of biased random sequences that had a score of 20/21 in simulation A (moderate confidence in activity) or fitness score of 2 (low confidence in activity). Peptides were selected to have diverse amino acids populations.  5.2.3  Iterative improvement in peptides The two populations were evolved from two initial starting populations in  Simulation A and B. As shown in Figure 5.2 for Simulation A and B, there was rapid  154  improvement in scores from the first generation to generation 100 with continued improvement up to generation 600. As well, these was a rapid increase in peptide fitness for Simulation B, shown from the initial population containing much lower scores as seen in right-hand side of Figure 5.2 and in Figure 5.3, showing the first generations in detail where a dramatic rise in fitness scores was seen in the first several generations. As expected, throughout the evolution of the population of peptides, the genetic algorithm created a set of peptides having a variety of fitness scores due to the random nature of novel peptide generation. For Simulation A, the final generation contained 34 peptides, including 10 peptides with score of 29, and 22 peptides that were 26 and higher (Table 5.3). The highest score observed in any of the peptides studied here or previously (Fjell et al., submitted) is 29 rather than 30. This suggests that the method cannot identify any peptides with a higher score than those that were already found. Of the 10 top-scoring peptides, 9 were closely related and start with the sequence RWKRW. There are 3 other peptides starting with this sequence with lower scores: score 28 (RWKRWWRIL), 21 (RWKRWWKVW) and 1 (RWKRWSRLL). The population of peptides always contained a proportion of lower scoring peptides (as seen in the left hand side of Figure 5.2) due to the random nature of how novel peptides are created by the genetic algorithm. Similarly, the final population containing 52 peptides is shown in Table 5.4.  155  Simulation A  Simulation B  Figure 5.2. Evolution of peptide scores. The fraction of peptides in the population at each range of fitness score is shown.  156  Figure 5.3. Initial evolution of peptide scores for simulation B.  157  Sequence RKRWWWRWW RWKRWIRWW RWKRWLRWW RWKRWWRIW RWKRWWRLL RWKRWWRLW RWKRWWRVW RWKRWWRWI RWKRWWRWL RWKRWWRWW KKRWWWWFR KRWWWWKFR KWWRWRRWW RKRWWWRWL RWKKWWRWL RWKKWWRWW RWKRWWRIL KKRWWWWWR KWKRWRRWW KWKRWWWWR RKRWWWWFR KWKRWWWFR RKRWWWRWR RWKRWWKVW RWKWWWKFR RWKKWWRVW RWYRWWRIW KRWRWWRLL KWKKWWRWL KWKRWWWWL KKKRWRRWW RWKYWWRII RKRWWWRGL RWKRWSRLL  Fitness Score  Activity  29 29 29 29 29 29 29 29 29 29 28 28 28 28 28 28 28 27 27 27 27 26 22 21 20 19 15 12 9 9 8 4 1 1  0.73 0.38 0.67 0.37 0.38 0.38 0.47 0.41 0.67 * -  Table 5.3. Final peptide population simulation A.  Sequence  Fitness Score  IWKRWWWKR KWKRWWWIR KWKRWWWWR RIWKIWWKR IKKRWWWFR IKWKRWWWR KLKRWWWFR KLKRWWWWR KWKRWWWFR KWWKIWRWR KWWKRWKWR KWWKRWWIR KWWKRWWKR KWWKRWWWR RFWKIWWKR RIWKRWWFR RLWKIWWRR RLWKRWWFR RLWKRWWIR RWWKIWKWR RWWKIWWKR RWWKIWWRR RWWKRWWFR RWWKRWWIR RWWKRWWWR IKKRWWWWR KLKRWWWIR KWWKIWWKR KWWKRWWFR RIWKRWWWR RLKRWWWFR RWKRWWWFR KLWKRWWWR RWWKIWRWR KWWKIWKWR RWWKWWWIR CWKRWWWKR RFWKIWRWR KWKRIWWKR RWWKRWAIR RTWKRWWIR RTWKIWKWR KWWKRWWIH KWWKRWSWR RLWTRWWFR RIWARWWFR KWWKDWWKR RFEKIWWKR RIDKIWLKR RLWKNWWRR RFWQIWRWR RWSKRWWWV  27 27 27 27 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 25 25 25 25 25 25 25 24 24 22 22 21 21 19 19 18 12 11 10 9 7 6 6 5 2 0 0  Table 5.4. Final peptide population, simulation B. The final generation (generation 600) of peptides is sorted by score. The common subsequence RWKRW is shown in bold in Table 5.3 and discussed in the text. Activity values for 9 peptide sequences were determined using the bioluminescence assay against P. aeruginosa; units are IC50 relative to Bac2A control peptide. '-' indicates activity not determined. Two peptides appear in both final populations, KWKRWWWFR and KWKRWWWWR. * average of two peptide measurements.  158  There were two peptides in common in the final populations (KWKRWWWFR and KWKRWWWWR) for Simulation A and B. Apart from these two peptides, there were no peptides in common between the two final populations, indicating that the processes followed were stochastic. In addition, Simulation B had no peptides with fitness score above 28 but more peptides with high score, i.e. 25 peptides with fitness score of 26 and above. This indicates that the specific peptides in the final population were largely dependent on the initial population of peptides. This is to be expected given the nature of the genetic algorithm, since the dominant method of generation of novel sequence is through cross-over from previous peptides; mutation will affect only a comparatively small number of single amino acids in each generation with the genetic algorithm parameters used here. The number of high fitness score peptides appeared to be unchanged between generation 400 and generation 600 (Figure 5.2) for both Simulation A and B, suggesting that in each case the genetic algorithm had settled on a local optimum set of sequences from which it was unlikely to escape through continued evolution. Further improvements would likely require introduction of peptides with dramatically different sequences into the population  5.2.4  Evolution of amino acid composition The amino acid distribution of the peptide populations varied during the peptide  sequence evolution (shown in Figure 5.4). As described above, the number of amino acid types was maximized when selecting the initial population to include 14 amino acid types for Simulation A and 16 amino acids for Simulation B. During evolution over the 600 generations, the number of amino acid types is reduced to 7 amino acid types (in declining proportion: W, R, K, L, I, F, V) for the high scoring peptides in Simulation A 159  and 6 amino acid types (in declining proportion: W, R, K, I, F, L) for the high scoring peptides in Simulation B. This proportion of amino acids for high scoring peptides is similar to the proportions we found previously for high scoring peptide based on peptides sampled from a biased random library of 100,000 peptides.  160  Simulation A  Simulation B  Figure 5.4. Evolution of peptide amino acid composition. Simulation A is shown on the left-hand side and Simulation B is on the right-hand side. The initial populations of peptides (top panels) have higher amino acid diversity which is lost as the populations evolve (middle panels show generation 600 for all peptides). The high scoring peptides (fitness score >=26) have the lowest diversity and show similar amino acid proportions (bottom panels).  161  5.2.5  Assessment of genetic algorithm performance In our previous study (Fjell et al., submitted), we examined 100,000 peptides  from a biased random library of sequences. We empirically tested the activity of the 50 peptides ranked highest by fitness score. As we reported previously, 94% of these peptides were found to be highly accurate. This group of highly active peptides included all peptides with fitness scores of 29 to 26, and some of the peptides scoring 25. (Some peptides scoring 25 were also outside of this group.) Therefore, for comparison we considered here that peptides receiving a fitness score of 26 or higher could be relatively confidently predicted to have high antibacterial activity. As reported previously, a total of 22 peptides scoring 26 or higher were identified by examining 99,576 peptides in the random library (the 100,000 random peptides minus duplicates), or 0.026% highly active peptides of these evaluated. In contrast, using genetic algorithms we identified, over all generations of the simulated evolution of the peptide populations, 22 peptides scoring 26 or above by evaluating a total of 4,492 peptides (0.49% highly active) in Simulation A, and 25 peptides scoring 26 or above by evaluating 5,067 peptides (0.51% highly active) in Simulation B, over all generations of the simulated evolution of the peptide populations for a combined efficiency of 0.50% highly active peptides identified per peptide evaluated. Taking these two values as representative of the two methods (0.026% for searching a large random library and 0.50% for genetic algorithm search), we observed a 19-fold enhancement in discovery of highly active peptides. In addition, the progressive clustering of peptides scores in at the high scoring region was much slower after the first 100 generations. This suggests that stopping the genetic algorithm at approximately generation 100 will be more efficient since further 162  peptides will not be efficiently identified after this point. The antibacterial activity for a selection of peptides was performed using the luminescence assay as described previously (see Hilpert et al., 2005). In this classification work, we considered a peptide to be highly active if its IC50 was less than half that of the control peptide, Bac2A. The Rel. IC50 values in Table 5.3 indicate that 6 of the 9 peptides (66%) assayed were highly active (Rel.IC50 <0.5), with the remainder more active than control but lower than this threshold, a result (66% accurate) less than the 94% accuracy we found before. We believe there may have been two contributions to this discrepancy: this difference may be due to chance for this small set of samples (9 peptides), or variability in the luminescence assay for antibacterial activity.  5.3 Conclusions We have described here the use of a genetic algorithm to efficiently identify novel peptides that have a high likelihood of being strongly antibacterial. In our previous work, we created software models using artificial neural networks that were found to be up to 94% accurate in predicting highly active peptides. However, our previous work utilized a very large in silico library of 100,000 biased-random sequences to identify additional peptides. In the current study, we demonstrated that the heuristic search method of genetic algorithms identifies additional active peptides with considerably greater efficiency (0.50% of evaluated peptides) than our previous work with biased random sequences (0.026% of evaluated peptides). Currently, we evaluate QSAR descriptors for each peptide using commercial software (MOE) on a limited number of  163  computers, a situation that significantly limits the number of peptides that can be evaluated. Hence, we find that the increased efficiency of genetic algorithm methods allows a dramatically increased capability to identify novel antimicrobial peptide candidates.  5.4 Materials and methods 5.4.1  Creation of classification models for highly active  peptides As described previously (Cherkasov et al., in press; Fjell et al., submitted), we constructed a software modelling system to classify peptides as highly active or inactive based on a set of 44 QSAR descriptors calculated for each peptide combined with machine learning using artificial neural networks (ANNs). Briefly, we have constructed a set of 30 ANNs that classify a peptide as highly active or inactive. These 30 ANNs were trained based on a 10-fold cross-validation of 3 data sets consisting of over 1400 peptides whose activities were measured using a high-throughput luminescence assay against a modified strain of Pseudomonas aeruginosa luxCDABE (see below, Peptide activity assay). The top 5% of each set of peptides was defined as highly active (ANN output value 1) and the rest low activity (ANN output value 0).  Data manipulation and  normalization was performed using scripts in the R language (R Core Development Team, 2005; http://r-project.org)  164  5.4.2  Evaluation of peptide fitness In our previous study (Fjell et al., submitted), each of the 30 trained ANNs was  used to rank a set of 100,000 test peptides. For each ANN, the ANN output value that determined the top 5% of the 100,000 peptides was identified. Using these thresholds, a single fitness score was defined as the number of ANNs that classify an input peptide as in the top 5% of peptides (i.e. the number of 'votes' that a peptide is highly active). Here, we use the same threshold values derived from the 100,000 random peptides to classify novel peptides using the number of 'votes' as the fitness score for the genetic algorithm.  5.4.3  Initial peptide population Two simulated evolution experiments were performed here. Small initial  populations of peptides were selected from the biased random population of 100,000 peptides to maximize the diversity of amino acids present in the population. Peptides containing all the 12 amino acids (F, G, H, I, K, L, M, Q, S, T, V, and Y) present in the population were selected at two levels of fitness score. In simulation A), 19 peptides with moderate activity were selected from 100,000 peptides biased random population having moderate prediction of activity - fitness score of 20 or 21. In simulation B), 20 peptides were selected having a fitness score of 2.  5.4.4  Evolution of peptide sequences The initial populations of peptides were evolved over 600 generations using  custom Java code utilizing the JGAP 3.2 (http://jgap.sourceforge.net) genetic algorithm 165  package and converting single letter amino acid peptides into integer arrays for manipulations. QSAR descriptors were calculated through embedded calls to MOE (Molecular Operational Environment, 2005, by Chemical Computing Group Inc., Montreal, Canada) from the Java code. The population size was allowed to vary to ensure all high scoring peptides remained in the population. A mutation rate of 1/15 was used.  5.4.5  Evaluation of peptide antibacterial activity Antibacterial activity of synthesized peptides was determined using a luciferase-  based in vitro assay and reported as inhibitory concentration at 50% (IC50) relative to a control peptide. Peptides were synthesized on cellulose support with a pipetting robot using two glycine residues as linker as previously described (Hilpert et al., 2005). Briefly, peptides were cleaved from the dried membrane in an ammonia atmosphere resulting in free peptides with two glycines at the amidated C terminus due to the linker sequence. The peptide spots were punched out and transferred to 96-well microtiter plates in sets of 10 along with a positive control peptide (Bac2A). An overnight culture of P. aeruginosa PAO1 strain H1001 (containing a luciferase gene cassette luxABCDE) was diluted at 1:500 ratio with 100mM Tris buffer (pH 7.3), 20 mM glucose. This diluted culture was added to the microtitre plate wells (100 µL/well) containing the peptide spots and controls. After 30 min incubation, serial dilutions were performed from the membrane spots to successive rows of the plate. Luminescence of the P. aeruginosa was measured for 8 dilutions at 4 hours using a Tecan Spectra Fluor Plus (Tecan US). As described previously (Fjell et al., submitted), each luminescence profile for the dilution series was used to calculate the IC50 relative to the Bac2A peptide (a control peptide with low activity), by fitting the luminescence values to a sigmoid curve and 166  normalizing the peptide values to the Bac2A values found on the same plate. Parameter estimation was performed using custom C software and routines from Numerical Recipes in C (Press et al., 1992).  167  5.5 References Cherkasov, A. (2005) ‘Inductive’ Descriptors. 10 Successful Years in QSAR. Current Computer-Aided Drug Design, 1: 21-42. Cherkasov, A. (2005) Inductive QSAR Descriptors. Distinguishing Compounds with Antibacterial Activity by Artificial Neural Networks. Int. J. Mol. Sci., 6: 63-86. Finlay, B.B., Hancock, R.E.W. (2004) Can innate immunity be enhance to treat microbial infections? Nature Reviews Microbiology, 2: 497-504. Hamilton-Miller, J.M.T. (2004) Antibiotic resistance from two perspectives: man and microbe. International Journal of Antimicrobial Agents, 23: 209-212. Hancock, R.E.W., and Sahl, H.G. (2006).Antimicrobial and host-defense peptides as new anti-infective therapeutic strategies. Nature Biotechnology, 24:1551 - 1557. Hilpert, K., Volkmer-Engert, R., Walter, T., Hancock, R.E.W. (2005) High-throughput generation of small antibacterial peptides with improved activity. Nature Biotechnology, 23: 1008 - 1012. Jenssen, J., Hamill, P., and Hancock, R.E.W. (2006) Peptide Antimicrobial Agents. Clinical Microbiology Reviews, 19:491–511. Karakoc, E., Cherkasov, A., Sahinalp, S.C. (2006) Distance based algorithms for small biomolecule classification and structural similarity search. Bioinformatics, 15: 243-251. Karakoc, E., Sahinalp, S.C., and Cherkasov, A. (2006) Comparative QSAR- and fragments distribution analysis of drugs, druglikes, metabolic substances, and antimicrobial compounds. J. Chem. Inf. Model. 46: 2167-2182. Koczulla, A.R., Bals, R. (2003) Antimicrobial Peptides: Current Status and Therapeutic Potential. Drugs, 63 : 389-407. Levy, S.B., Marshall, B. (2004) Antibacterial resistance worldwide: causes, challenges and responses. Nature Medicine, 10: S122 - S129. Niculescu, S.P. (2003) Artificial neural networks and genetic algorithms in QSAR. Journal of Molecular Structure (Theochem) 622: 71–83 Parrill, A.L. (1996) Evolutionary and genetic methods in drug design. Drug Design Today, 1:514-521. Perkins, R., Fang, H., Tong, W., and Welsh, W.J. (2003) Quantitative structure-activity relationship methods: perspectives on drug discovery and toxicology. Environmental Toxicology and Chemistry, 22: 1666-79. Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannerty, B.P. (1992), Numerical Recipes in C: The Art of Scientific Computing, (2nd Edition), Cambridge University Press, New York. R Development Core Team (2005) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3168  900051-07-0 Solmajer, T. and Zupan, J. (2004) Optimization algorithms and natural computing in drug discovery. DDT, 1: 247-252 Weaver, D.C. (2004) Applying data mining techniques to library design, lead generation and lead optimization. Current Opinion in Chemical Biology, 8: 264-270 Yeaman, M.R., Yount, N.Y. (2003) Mechanisms of Antimicrobial Peptide Action and Resistance. Pharmacol Rev., 55: 27-55.  169  Chapter 6:  Summary and conclusions  170  6.1 Summary This thesis describes the bioinformatic and chemoinformatic analysis of genecoded antimicrobial peptides (host defense peptides) and synthetic antimicrobial peptides. This work addressed the hypotheses that additional novel gene-coded antimicrobial peptides can be identified by sequence analysis, and highly active synthetic antimicrobial peptides can be identified using chemoinformatics and machine learning methods.  6.1.1  Gene-coded antimicrobial peptides Antimicrobial peptides (AMPs) represent a diverse class of natural peptides that  form part of the innate immune system of mammalians, insects, amphibians, and plants among others (for example, Sima et al.,, 2003a, 2003b). Prior to the work described here, there were over 880 different antimicrobial peptides identified or predicted from nucleic acid sequence (Brogden, 2005). These peptides fall into a number of diverse classes characterized by charge, peptide structure, and amino acid composition. The first hypothesis of the thesis is that analysis of existing peptides and construction of bioinformatic models can identify additional antimicrobial peptides both from known proteins (unacknowledged antimicrobial or host defense function among known proteins) and from unannotated sequence. The objective was to create a software resource that classifies existing AMPs and can be used to identify additional AMPs from nucleic acid or amino acid sequence. Profile hidden Markov models (HMMs) are widely used for bioinformatics analysis of biological sequence. HMMs for both mature peptides and propeptides were constructed as follows. Propeptide and mature peptide sequences were identified from 171  annotation of the Uniprot proteins identified as AMPs. These were clustered separately into a total of 146 models for mature peptides and 40 for propeptides. These corresponded to known AMP classes and subclasses such as defensins and cathelicidins. HMMs were created based on multiple alignments of these clusters. Additional peptides were identified by iteratively scanning the Swiss-Prot database with these HMMs and the clusters and HMMs rebuilt after each search. As a result, an additional 229 additional AMPs have been identified from Swiss-Prot, and all but 34 could be associated with known antimicrobial or host defense activities according to the literature. The final set of 1045 mature peptides and 253 propeptides have been organized into the open-source AMPer resource available to the community at http://www.cnbi2.com/cgi-bin/amp.pl. A manuscript describing the AMPer resource was published (Fjell, et al., 2007). The set of HMMs from AMPer were used to screen bovine sequence for novel AMPs. The set of available expressed sequence tags (ESTs) from NCBI and the draft genome sequence were scanned. Of the 34 known bovine AMPs, 27 were identified with high confidence in the AMPs predicted from ESTs. A further potential 69 AMPs predicted from the EST data were found that appear to be novel. Two of these were cathelicidins and selected for experimental verification in RNA derived from bovine tissue. One predicted AMP, most similar to rabbit '15 kDa protein' AMP, was confirmed to be present in infected bovine intestinal tissue using PCR. These findings demonstrated the practical applicability of the developed bioinformatics approach and laid a foundation for future discoveries of gene-coded AMPs. In addition, no members of the alphadefensin family were found in the bovine sequences, suggesting that cattle lack this important family of host defense peptides. A manuscript has been published (Fjell C.D.,  172  Jenssen H., Fries P., Aich P., Griebel P., Hilpert K., Hancock R.E., Cherkasov A. (2008) Identification of novel host defense peptides and the absence of alpha-defensins in the bovine genome. Proteins. 73:420-30)  6.1.2  Synthetic antimicrobial peptides With  increasing  antibiotic  resistance  in  pathogenic  microorganisms,  antimicrobial peptides have drawn significant scientific attention as a novel class of antimicrobial therapeutics as both antibacterial drugs and modulators of innate immunity (Hamilton-Miller, 2004; Levy and Marshall, 2004; Koczulla and Bals, 2003; Finlay and Hancock, 2004; Hancock and Sahl, 2006). AMPs demonstrate fast target killing, broad range of activity, low toxicity and minimal development of resistance in target organisms. Extensive efforts have been made to develop qualitative structure-activity relationships but there has been no means of relating peptide characteristics to antibacterial activity outside of peptides with very similar structures. The importance of charge, hydrophobicity and amphipathicity are well known; however, high potency peptides cannot be easily selected by manipulation of the amino acid sequence (Tossi et al., 2000). One hypothesis of this thesis is that highly antibacterial peptides can be identified by a combination of non-linear machine learning algorithms and quantitative structure-activity relationship (QSAR) analysis that utilizes descriptors that are sensitive to the 3D atomic conformation the peptide. We calculated descriptors for over 1400 peptides for which the antibacterial activity had been measured using a high-throughput assay (Hilpert et al, 2005). We built artificial neural network models to classify peptides as active or inactive based on these descriptors and screened a virtual library containing 173  nearly 100,000 biased random sequences. A total of 200 peptides were selected for synthesis that were predicted to have activity ranging from highly active to inactive. The predictions were remarkably accurate with 94% of the 50 predicted most active showing high activity and the 50 predicted least active all had low activity. This work represents the first high-throughput in silico screening for novel antibacterial peptides suitable for drug leads. A manuscript describing these methods and results has been submitted to Journal of Medicinal Chemistry (Fjell, C.D., Hilpert, K., Jenssen, H., Cheung, W.A., Panté, N., Hancock, R.E.W., and Cherkasov, A. Identification of Novel Antibacterial Peptides by Chemoinformatics and Machine Learning) and a manuscript including these results has been accepted for publication (Cherkasov, A., Hilpert, K., Jenssen, H., Fjell, C.D., Waldbrook, M., Mullaly, S.C., Volkmer, R., and Hancock, R.E.W. Use of artificial intelligence in the design of small peptide antibiotics effective against a broad spectrum of highly antibiotic resistant Superbugs. ACS Chemical Biology). A serious constraint on the use of QSAR descriptors utilizing 3D atomic resolution information is the computational expense in time and resources. In order to confidently identify a set of 50 peptide sequences possessing high antibacterial activity, we screened a virtual library of 100,000 peptides. The hypothesis of Chapter 5 is that an evolutionary search method called a genetic algorithm can be used to efficiently search through the possibilities of peptide sequences to identify additional peptides that are likely to be highly antibacterial. Genetic algorithms (GAs) mimic biological evolutionary processes to estimate solutions to computational problems where solutions can be represented in a string-like format ('chromosomes' in GA jargon) through the random variation of existing solutions ('mutation'), or combination of existing solutions ('cross-  174  over'). We found that our implementation of a GA method provides a large improvement in identification of novel antibacterial peptides. Approximately 0.49% of peptides evaluated during the GA method were classified as highly active, while only 0.026% of the nearly 100,000 sequences we previously screened was classified as highly active at the same level (an 19-fold increase). Since the computational effort to screen in silico libraries dominates the cost of these virtual screening methods, we find that use of GA significantly improves the possibility of identifying peptides that may lead to novel antibiotic therapeutics.  6.2 Conclusions and future directions The first two chapters of this thesis describe the successful development and application of bioinformatics methods to identify gene-coded antimicrobial peptides and the creation of the most comprehensive database of its kind. Further refinements to the methods used in AMPer are possible but I consider that these will not dramatically increase the utility of the resource. For example, as described in Chapter 2, the creation of initial peptide clusters required a choice of threshold value for global sequence similarity (a value of 30% was used) with additional manual editing. More complex methods could have been used to compare peptides in each cluster based on three-dimensional structure or physical properties using such as techniques as threading, since empirical 3D structures are available for a number of these peptides (Höltje, et al., 2003). While this modelling effort may improve the similarity of peptides in each cluster, and may reduce the number of clusters by finding clusters to merge, this is not likely to dramatically improve the number or quality of clusters. Other methods have been developed for 175  remote homology detection that may lead to improved detection of AMPs in unannotated sequence.  However, these methods (for example, Hochreiter et al, 2007) require  significantly more data than is available for AMPs - typically the SCOP dataset containing over 30,000 entries is used (Hou et al., 2004; Kuang et al., 2005; Lingner and Meinicke,  2006;  Rangwala  and  Karypis,  2005).  The  AMPer  website  (http://www.cnbi2.com/cgi-bin/amp.pl) also serves as a resource to compare novel peptides to known classes of AMPs using the HMMs as well as to display the results in the context of the multiple alignments and sequence profiles on which the models are based. Currently, only the draft bovine genome and EST data set from NCBI has been searched using AMPer, primarily as a demonstration of the utility of the method. The software models and applications needed to scan novel sequences are all freely available to the public and do not require a license so that investigators are able to apply these methods to data sets of their choice. Scanning multiple organisms would allow investigation of evolutionary relationships between known and newly identified AMPs that may shed additional light on mechanisms of the innate immune system. Chapters 4 and 5 of this thesis concern analysis of synthetic antimicrobial peptides. Chapter 4 of this thesis describes the successful identification of highly active, antibacterial peptides by a combination of non-linear machine learning algorithms and atomic resolution QSAR. Chapter 5 describes an efficient method of generating candidate peptides. Further analysis of the importance of the descriptors for prediction may yield valuable insight into the commonalities between highly active peptides that are recognized by the models. Unfortunately, complex modelling methods such as artificial 176  neural networks do not lend themselves to this analysis. Previous efforts to derive models that are more easily interpretable using logistic regression and principal component regression (methods described in Hastie et al, 2001) did not result in useful models (data not shown). Additional modelling techniques could be used for comparison to the ANN results; but only in the interest of time was this not done. In Chapter 5, simulated evolution by genetic algorithms starting from two initial populations of peptides ended in nearly non-overlapping final populations of high-scoring peptides. Therefore, I expect that these models were capable of identifying hundreds if not thousands of peptides with high likelihood of being potent antibacterial agents. It is hoped that future work will further investigate which of these are suitable as antibacterial agents in the clinic.  177  6.3 References  Brogden, K. A. (2005). Antimicrobial peptides: pore formers or metabolic inhibitors in bacteria? Nat. Rev. Microbiol., 3: 238–250. Finlay, B.B., Hancock, R.E.W. (2004) Can innate immunity be enhance to treat microbial infections? Nat. Rev. Microbiol., 2: 497-504. Fjell, C. D., R. E. Hancock, et al. (2007). AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics, 23: 1148-1155. Hamilton-Miller, J.M.T. (2004) Antibiotic resistance from two perspectives: man and microbe. International Journal of Antimicrobial Agents, 23: 209-212. Hancock, R.E.W., and Sahl, H.G. (2006).Antimicrobial and host-defense peptides as new anti-infective therapeutic strategies. Nature Biotechnology, 24:1551- 1557. Hastie, T., Tibshirani, R., and Friedman, J. (2001) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York. Hilpert, K., Volkmer-Engert, R., Walter, T., Hancock, R.E.W. (2005) High-throughput generation of small antibacterial peptides with improved activity. Nature Biotechnology, 23: 1008 - 1012. Hou,Y. et al. (2004) Remote homolog detection using local sequence-structure correlations. Proteins: Struct., Funct. and Bioinformatics, 57:518–530. Hochreiter S, Heusel M, Obermayer K. (2007) Fast model-based protein homology detection without alignment. Bioinformatics. 23:1728-36 Höltje, H.-D., Sippl, W., Rognan, D., and Folkers, G. (2003) Section 4.3: Comparative protein modeling. In Molecular Modeling, Basic Principles and Applications. pp 100-116. Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, Germany. Koczulla, A.R., Bals, R. (2003) Antimicrobial Peptides: Current Status and Therapeutic Potential. Drugs, 63: 389-407. Kuang,R. Ie, E., Want, K., Wang, K.,Siddiqi, M., Freund, Y., Leslie, C. (2005) Profilebased string kernels for remote homology detection and motif extraction. J. Bioinf. Comp. Biology, 3: 527–550. Levy, S.B., Marshall, B. (2004) Antibacterial resistance worldwide: causes, challenges and responses. Nature Medicine, 10: S122 - S129. Lingner,T. and Meinicke,P. (2006) Remote homology detection based on oligomer distances. Bioinformatics, 22: 2224–2236. Rangwala,H. and Karypis,G. (2005) Profile based direct kernels for remote homology detection and fold recognition. Bioinformatics. 21: 4239–4247 Sima, P., Trebichavsky, I., Sigler, K. (2003) Mammalian antibiotic peptides. Folia Microbiol., 48: 123-137. 178  Sima, P., Trebichavsky, I., Sigler, K. (2003) Non-mammalian vertebrate antibiotic peptides. Folia Microbiol., 48: 709-724. Tossi, A., Sandri, L. & Giangaspero, A. (2000) Amphipathic, &-helical antimicrobial peptides. Biopolymers, 55: 4–30.  179  Appendix A: Epilogue  180  There are several lessons-learned that I might note after completing this thesis. The first and most significant lesson is the importance of finding a narrow focus for the research as early as possible, thus allowing more time to be spent on a more comprehensive treatment of the research area in the available time. I have spent a fair amount of time during my PhD studies on work that did not yield publications for a couple of reasons. Either I did not come up with anything novel to report (a sometimes unavoidable research outcome), or I spent time redeveloping software tools that were available elsewhere. For the computational biologist or chemist, the vast assortment of applications, databases and algorithms available for any task are nearly overwhelming (for example, there are at least 384 software packages available to calculate phylogeny from sequence similarity, http://evolution.genetics.washington.edu/phylip/software.html). However, after a significant investment of spent time evaluating available tools, many of them do not work for your purpose, will not execute on your hardware or operating system, or are too poorly implemented to be useful. So the question is a difficult one: write your own code or continue looking for existing code. My inclination has been to write my own; sometimes this is necessary but would often not have been. Particularly for machine learning and statistical analysis, the R-project statistical language and resources (http://www.r-project.org) provide high-quality software, but involves a very steep learning curve to use the code for anything but trivial work. R is a vector-based language quite unlike any I have used previously; but I would have been far more productive over the years it I had learned those coding skills and what the resource had to offer at the beginning. The work of a computational biologist or chemist must eventually be validated 181  against nature through direct experiment. This means that there is always a reliance on wet lab experimentalists who may or may not see the worth in the computational work. Ultimately the decision to proceed with expensive experimental work to validate predictions came, in my case, out of their grant money. I have been very fortunate for the opportunity to collaborate with the Hancock lab. Conflicts over research direction and authorship issues have been minimal. I returned to graduate studies eight years after I completed my M.Sc. in Physics (in radiation biochemistry) which was also some years after I completed my engineering undergraduate degree. I have not regretted my decision to return to graduate studies for a Ph.D. after such a long time; rather I regret I did not do so earlier. For someone considering such a move as late in life as I did, this is not an easy decision. There are the added difficulties such as family needs of older children (time and financial) as well as mounting financial pressure to plan for a retirement without poverty. But I have found that the work has been both more satisfying and important than otherwise would have been available to me. I hope this experience for me has also indirectly made for a more enriching environment in which my children will grow up.  182  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0066950/manifest

Comment

Related Items