UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The development of bioinformatic and chemoinformatic approaches for structure-activity modelling and… Fjell, Christopher David 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2009_spring_fjell_christopher.pdf [ 10.4MB ]
Metadata
JSON: 24-1.0066950.json
JSON-LD: 24-1.0066950-ld.json
RDF/XML (Pretty): 24-1.0066950-rdf.xml
RDF/JSON: 24-1.0066950-rdf.json
Turtle: 24-1.0066950-turtle.txt
N-Triples: 24-1.0066950-rdf-ntriples.txt
Original Record: 24-1.0066950-source.json
Full Text
24-1.0066950-fulltext.txt
Citation
24-1.0066950.ris

Full Text

 THE DEVELOPMENT OF BIOINFORMATIC AND CHEMOINFORMATIC APPROACHES FOR STRUCTURE-ACTIVITY MODELLING AND DISCOVERY OF ANTIMICROBIAL PEPTIDES   by  CHRISTOPHER DAVID FJELL  B.A.Sc., The University of British Columbia, 1990 M.Sc., The University of British Columbia, 1995         A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF  DOCTOR OF PHILOSOPHY  in  THE FACULTY OF GRADUATE STUDIES   (Experimental Medicine)        THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  January 2009  © Christopher David Fjell, 2009  ii Abstract The emergence of pathogens resistant to available drug therapies is a pressing global health problem. Antimicrobial peptides (AMPs) may potentially form new therapeutics to counter these pathogens. AMPs are key components in the mammalian innate immune system and are responsible for both direct killing and immunomodulatory effects in host defense against pathogenic organisms. This thesis describes computational methods for the identification of novel natural and synthetic AMPs. A bioinformatic resource was constructed for classification and discovery of gene- coded AMPs, consisting of a database of clustered known AMPs and a set of hidden Markov models (HMMs). One set of 146 clusters was based on the mature peptide sequence, and one set of 40 clusters was based on propeptide sequence. The bovine genome was analyzed using the AMPer resources, and 27 of the 34 known bovine AMPs were identified with high confidence and up to 69 AMPs were predicted to be novel peptides. One novel cathelicidin AMP was experimentally verified as up-regulated in response to infection in bovine intestinal tissue. A chemoinformatic analysis was performed to model the antibacterial activity of short synthetic peptides. Using high-throughput screening data for the activities of over 1400 peptides of diverse sequence, quantitative structure-activity relation (QSAR) models were created using artificial neural networks and physical characteristics of the peptide that included three-dimensional atomic structure. The models were used to predict the activity of a set of approximately 100,000 peptide sequence variants. After ranking the predicted activity, the models were shown to be very accurate. When 200 peptides were synthesized  iii and screened using four levels of expected activity, 94% of the top 50 peptides expected to have the highest level of activity were found to be highly active. Several promising candidates were synthesized with high quality and tested against several multi- antibiotic- resistant pathogens including clinical strains of Pseudomonas aeruginosa, Staphylococcus aureus, Enterococcus faecalis and Escherichia coli. These peptides were found to be highly active against these pathogens as determined by minimal inhibitory concentration; this serves as independent confirmation of the effectiveness of high-throughput screening and in silico analysis for identifying peptide antibiotic drug leads.   iv Table of Contents Abstract...........................................................................................................................ii Table of Contents ...........................................................................................................iv List of tables..................................................................................................................vii List of figures...............................................................................................................viii List of abbreviations.......................................................................................................ix Acknowledgements .........................................................................................................x Dedication......................................................................................................................xi Co-Authorship Statement ..............................................................................................xii Chapter 1: Introduction.................................................................................................1 1.1 Thesis overview..................................................................................................2 1.2 Gene-coded antimicrobial peptides .....................................................................3 1.2.1 Classes of antimicrobial peptides..................................................................3 1.2.2 Mechanisms of antibacterial activity ............................................................7 1.2.3 Antimicrobial peptides in regulation of innate immunity ............................13 1.2.4 Bioinformatics for discovery of novel AMPs..............................................14 1.3 Synthetic antimicrobial peptides .......................................................................16 1.3.1 Quantitative structure-activity relationships................................................18 1.3.2 Previous QSAR analysis of antimicrobial peptides.....................................19 1.3.3 Limitations of current studies .....................................................................21 1.3.4 'Inductive' QSAR descriptors .....................................................................22 1.4 Thesis objectives and hypotheses ......................................................................24 1.4.1 Gene-coded antimicrobial peptides.............................................................24 1.4.2 Identification of synthetic AMPs by QSAR analysis and machine learning.25 1.4.3 Key assumptions ........................................................................................26 1.5 References ........................................................................................................29 Chapter 2: Prediction of gene-coded antimicrobial peptides by bioinformatic analysis36 2.1 Introduction ......................................................................................................37 2.2 Results and discussion ......................................................................................39 2.2.1 Database of antimicrobial peptides.............................................................40 2.2.2 Clustering of the AMPs..............................................................................42 2.2.3 HMM modelling ........................................................................................44 2.2.4 Iterative enhancement of clusters................................................................45 2.2.5 Accuracy of models ...................................................................................51 2.2.6 On-line tools ..............................................................................................54 2.3 Conclusion........................................................................................................55 2.4 Methods............................................................................................................55 2.4.1 Initial peptide set........................................................................................55 2.4.2 Clustering ..................................................................................................56 2.4.3 Iterative enhancement of clusters................................................................57 2.4.4 Accuracy of models ...................................................................................58 2.4.5 On-line tools ..............................................................................................60 2.5 Web resources ..................................................................................................60 2.6 Supplementary material ....................................................................................61  v 2.7 References ........................................................................................................67 Chapter 3: Identification of novel host defense peptides and the absence of alpha- defensins in the bovine genome.....................................................................................69 3.1 Introduction ......................................................................................................70 3.2 Results and discussion ......................................................................................73 3.2.1 Identification of host defense peptides........................................................73 3.2.2 Selection of predicted AMPs for confirmation............................................79 3.2.3 Analysis of predicted novel AMP gene expression .....................................82 3.2.4 Absence of alpha-defensins ........................................................................84 3.3 Conclusions ......................................................................................................90 3.4 Methods and materials ......................................................................................91 3.4.1 Set of known antimicrobial peptides...........................................................91 3.4.2 Creation of AMPer.....................................................................................91 3.4.3 Bovine genomic and EST sequences ..........................................................92 3.4.4 Prediction of AMPs in ESTs ......................................................................92 3.4.5 Prediction of AMPs in genomic sequence ..................................................93 3.4.6 Comparison of predicted AMPs to known AMPs .......................................94 3.4.7 Identification of novel AMPs .....................................................................95 3.4.8 Pairwise comparison of known AMPs to bovine sequence .........................95 3.4.9 Analysis of AMP gene expression..............................................................96 3.4.10 Informatics...............................................................................................97 3.5 Acknowledgments ............................................................................................97 3.6 Web resources ..................................................................................................98 3.7 Supplementary table .........................................................................................98 3.8 References ......................................................................................................102 Chapter 4: Identification of antibacterial peptides by chemoinformatics and machine learning.............................................................................................................................105 Introduction.............................................................................................................106 4.1 Results and discussion ....................................................................................108 4.1.1 Effect of control antibacterial peptide on bacteria.....................................109 4.1.2 Peptide data sets for model training.......................................................... 112 4.1.3 Calculation of peptide activity.................................................................. 113 4.1.4 QSAR descriptors and model building...................................................... 115 4.1.5 Validation of model performance ............................................................. 117 4.1.6 Independent model testing........................................................................ 118 4.1.7 Antibacterial activity of predicted peptides against resistant strains ..........123 4.2 Conclusions ....................................................................................................126 4.3 Materials and methods ....................................................................................127 4.3.1 Electron microscopy of AMPs..................................................................127 4.3.2 Peptide sequences for model training .......................................................127 4.3.3 Peptide SPOT synthesis and screening .....................................................128 4.3.4 Calculation of peptide activity..................................................................129 4.3.5 QSAR descriptors ....................................................................................130 4.3.6 Training and validation data sets ..............................................................133 4.3.7 Test data set .............................................................................................133 4.3.8 Model training .........................................................................................134  vi 4.3.9 In silico ranking and selection of test peptides..........................................135 4.3.10 Minimal inhibitory concentration (MIC) determination ..........................135 4.4 Acknowledgements.........................................................................................137 4.5 Supplementary tables......................................................................................138 4.6 References ......................................................................................................146 Chapter 5: Genetic algorithms for identification of potent antimicrobial peptides .....148 5.1 Introduction ....................................................................................................149 5.2 Results and discussion ....................................................................................151 5.2.1 Evaluation of peptide fitness score ...........................................................152 5.2.2 Initial population of peptides....................................................................153 5.2.3 Iterative improvement in peptides ............................................................154 5.2.4 Evolution of amino acid composition .......................................................159 5.2.5 Assessment of genetic algorithm performance..........................................162 5.3 Conclusions ....................................................................................................163 5.4 Materials and methods ....................................................................................164 5.4.1 Creation of classification models for highly active peptides......................164 5.4.2 Evaluation of peptide fitness ....................................................................165 5.4.3 Initial peptide population..........................................................................165 5.4.4 Evolution of peptide sequences ................................................................165 5.4.5 Evaluation of peptide antibacterial activity...............................................166 5.5 References ......................................................................................................168 Chapter 6: Summary and conclusions .......................................................................170 6.1 Summary ........................................................................................................171 6.1.1 Gene-coded antimicrobial peptides...........................................................171 6.1.2 Synthetic antimicrobial peptides...............................................................173 6.2 Conclusions and future directions ...................................................................175 6.3 References ......................................................................................................178 Appendix A: Epilogue.................................................................................................180   vii  List of tables Table 1.1. Classes of antimicrobial peptides. ...................................................................4 Table 2.1. Effect of similarity threshold on clustering of mature peptides. .....................43 Table 2.2. Changing consensus sequence with iteration. ................................................51 Table 2.3. Properties of largest mature peptide clusters..................................................62 Table 2.4. Properties of largest propeptide clusters ........................................................62 Table 2.5. Performance of AMP identification method determined by cross-validation for mature peptide clusters. .........................................................................................64 Table 2.6. Performance of AMP identification method determined by cross-validation for propeptide clusters. ................................................................................................65 Table 3.1. Numbers of predicted antimicrobial peptides.................................................75 Table 3.2. Known bovine antimicrobial peptides ...........................................................76 Table 3.3. Identification of known bovine host defense peptides in dbEST sequences ...78 Table 3.4. Bovine primers used for qRT-PCR................................................................97 Table 3.5.  Most significant matches of AMPer model 146 to bovine genome sequence.101 Table 4.1. Activities of peptides from training sets and quartiles in the 100,000 test set.120 Table 4.2. Predicted activity rank and experimental Rel.IC50 values for selected test peptides. ..............................................................................................................123 Table 4.3. Activities against multi-resistant Superbugs of selected peptides predicted through the QSAR analysis compared to the peptide Bac2A. ...............................125 Table 4.4. Description of all QSAR descriptors used in analysis of peptide activities...140 Table 4.5. Candidate peptides for confirmation of QSAR predictions. .........................145 Table 5.1. Initial peptide population for simulation A. .................................................154 Table 5.2. Initial peptide population for simulation B. .................................................154 Table 5.3. Final peptide population simulation A.........................................................158 Table 5.4. Final peptide population, simulation B........................................................158   viii List of figures Figure 1.1. Phylogenetic tree of known antimicrobial peptides. .......................................6 Figure 1.2. Barrel-stave model of antimicrobial peptide activity. .....................................9 Figure 1.3. Toroidal model of antimicrobial peptide activity. .........................................10 Figure 1.4. Carpet model of antimicrobial peptide activity.............................................11 Figure 1.5. Intracellular targets of antibacterial peptides. ...............................................13 Figure 1.6. Structure of an alpha-helical antimicrobial peptide.......................................17 Figure 1.7. Structure of an artificial neural network. ......................................................24 Figure 2.1. Creation of initial AMPer clusters................................................................39 Figure 2.2. Summary of iterative enrichment of clusters. ...............................................40 Figure 2.3. The relationship between E-value and model length.....................................46 Figure 2.4. Relationship between mature peptides and propeptides from the same protein for largest mature peptide clusters..........................................................................49 Figure 2.5. Relationship between mature peptides and propeptides from the same protein for largest propeptide clusters. ...............................................................................50 Figure 2.6. Relationship between mature peptides and propeptides from the same protein clusters of all sizes.................................................................................................66 Figure 3.1. Multiple alignment of predicted host defense peptide DBEST_AMP_397....81 Figure 3.2. Multiple alignment of predicted host-defense peptide DBEST_AMP_416. ..81 Figure 3.3. Gel image of qRT-PCR for putative AMPs in blood and tissue. ...................83 Figure 3.4. Gel image of putative AMPs following Taq-man re-amplification. ..............84 Figure 4.1. General workflow for QSAR modelling of antimicrobial peptides. ............109 Figure 4.2. Transmission electron micrographs of cross-sections of Pseudomonas aeruginosa. .......................................................................................................... 110 Figure 4.3.  SEM micrographs of Pseudomonas aeruginosa. ........................................ 111 Figure 4.4: Electron micrographs of cross-sections of Pseudomonas aeruginosa. ......... 112 Figure 4.5: Distibution of amino acids in training and test sets..................................... 113 Figure 4.6. Luminescence profile of a dilution series for three peptides. ...................... 115 Figure 4.7. Structure of an artificial neural network. .................................................... 116 Figure 4.8.  The receiver operating characteristics curves for the three data sets. ......... 118 Figure 4.9. Activity and properties of training and test peptides...................................121 Figure 5.1. Examples of peptide evolution...................................................................152 Figure 5.2. Evolution of peptide scores........................................................................156 Figure 5.3. Initial evolution of peptide scores for simulation B. ...................................157 Figure 5.4. Evolution of peptide amino acid composition. ...........................................161   ix List of abbreviations AMP antimicrobial peptide ANN artificial neural network AROC area under the receiver operating characteristics curve Bac2A synthetic peptide analogue of bovine bactenecin BLAST Basic Local Alignment Search Tool cDNA complementary DNA produced by reverse transcription of messenger RNA EST  expressed sequence tag GA genetic algorithm HFIP hexafluoroisopropanol HMM hidden Markov model HPLC high pressure liquid chromatography IC50 inhibitory concentration 50% MIC minimal inhibitory concentration, the lowest concentration of an agent that inhibits bacterial growth mRNA messenger RNA PCR polymerase chain reaction qRT-PCR quantitative reverse transcription polymerase chain reaction QSAR quantitative structure-activity relation ROC receiver operating characteristics SDS sodium lauryl sulfate      x Acknowledgements I wish to thank my supervisor Dr. Artem Cherkasov for the opportunity of working with him on this research and for his supervision and encouragement throughout. I am grateful to the members of the Hancock lab (Centre for Microbial Diseases and Immunity Research) at UBC for the exceptional opportunity of working with them and the amazing data they are able to generate, especially to Drs Bob Hancock, Kai Hilpert, and Håvard Jenssen. I wish to thank the Canadian Institutes for Health Research for a Doctoral Research Award, and the University of British Columbia for a University Graduate Fellowship. I wish to thank the past and present members of my thesis committee: Drs. Zakaria Hmama, Steven Jones, Boris Sobolev and Michael Grigg. I am grateful to the past and present members of the Cherkasov lab for the opportunity of learning from them in the diverse areas of their research, including Michael Hsing, Ken Byler, Fuqiang Ban, Osvaldo Santos-Filho, Meilan Huang, Simon Chan, and Evgeny Maksakov. Finally, I thank my wife Donna and my girls (Meghan and Kristen) for their love and for putting up with my absence during many weekends spent at work.   xi Dedication This work is dedicated to the memory of my brother, Brent Fjell, whose young life was lost when the antibiotics didn't work.   xii Co-Authorship Statement My role in this work was theoretical and computational analysis; I did not perform any of the laboratory measurements. For each chapter, I did the following. I performed all work described in Chapter 2. In Chapter 3, I performed all the computational work except for PCR primer design. All laboratory experimental work was done at the Hancock lab (UBC) or Vaccine and Infectious Disease Organization (University of Saskatchewan). In Chapter 4, I performed all computational work except for one script used to calculate some QSAR descriptors, and the randomized amino acid distribution and peptides for Set A, Set B and the set of 100,000 virtual peptides (selected by Kai Hilpert). In Chapter 5, I performed all work except for laboratory measurement of antibacterial activity. Håvard Jenssen performed PCR verification of predicted gene products using PCR (both primer design and assay) on RNA samples from bovine provided by Patrick Fries, Palok Aich, and Phillip Griebel (University of Saskatchewan). Kai Hilpert supplied randomized peptide sequences and antibacterial activity data for peptides synthesized on cellulose using the luminescence assay. Håvard Jenssen also provided some antibacterial luminescence assay data. Håvard Jenssen performed MIC assays on peptides. Warren Cheung contributed a script to calculate some QSAR descriptors. Nelly Panté contributed electron micrographs of bacteria. Robert E.W. Hancock was involved in discussions of most aspects of this work, especially regarding laboratory experiments that were done. Artem Cherkasov was involved in discussions of all aspects of this work, especially on aspects of QSAR and choice of analysis techniques.  1 Chapter 1: Introduction                                          2  1.1  Thesis overview The purpose of this thesis was to systematically study known antimicrobial peptides (AMPs), to discover new gene-coded AMP sequences, and to develop new peptide-based antibiotic leads using bioinformatic and chemoinformatic methods. Antimicrobial peptides are produced by nearly all organisms and constitute an important part of the innate immune system. The innate immune system constitutes that part of the immune system that is responsible for defense of the host from infection in a non-specific manner and includes barriers to infection and immediate responses such as inflammation and recruitment of non-specific cells such as macrophages, neutrophils and dendritic cells. Hundreds of these peptides have been identified. However, understanding of their role in both direct killing of pathogens as well as regulation of the innate immune response has recently been enhanced. Identification of additional examples of these peptides both in human and other organisms would serve to increase our understanding of the innate immune system and possibly lead to novel therapeutic interventions. The second chapter of this thesis describes the creation of a resource, which we call AMPer, consisting of an on-line database of peptides as well as software models for identification of antimicrobial peptides using bioinformatics analysis. In the third chapter, these resources are applied to discovery of novel peptides in the bovine genome. Pathogens that are resistant to current antibiotics are a continuing challenge. According to the Infectious Diseases Society of America (http://www.idsociety.org), the incidence of some pathogens such as methicillion-resistant Staphyloccus aureus now exceeds 50% and resistance in other pathogens also rapidly increasing. For example, resistance to vancomycin and floroquinolone has jumped from near zero incidence to  3 nearly 30% in the last ten years in some organisms. During this period, the number of new agents approved for use continues to decline. Synthetic antimicrobial peptides may be an important source of antibacterial agents to counter the continuing challenge of pathogens that have developed resistance to conventional drugs. As described in Chapter 4, previous analyses of the properties of short cationic peptides have failed to yield models that are sufficiently general and predictive of antibacterial activity for in silico screening of potential drug candidates. These efforts have been limited by both the robustness of the modelling techniques and the quantities of empirical data available. Generation of large sets of peptides of varying activity are now possible due to new high- throughput peptide synthesis and antibacterial activity assaying techniques. Chapter 4 describes how the combination of advanced chemoinformatic methods and machine learning algorithms was developed to successfully screen for peptides with high activity. Chapter 5 describes a novel method to optimize this screening process by the use of a search algorithm inspired by natural evolution.   1.2  Gene-coded antimicrobial peptides  1.2.1  Classes of antimicrobial peptides AMPs represent a diverse class of natural peptides that form part of the innate immune system of mammalians, insects, amphibians, and plants among others (see for example, Sima and Sigler, 2003a, 2003b). As reviewed by Brogden (2005), there are currently over 880 different antimicrobial peptides identified or predicted from nucleic acid sequence. These peptides can be classified as shown in Table 1.1 based on peptide  4 characteristics that serve to contrast groups of AMPs. These characteristics include anionic (negative) charge, peptide structure and charge (linear, cationic and !-helical), amino acid composition (enrichment for particular amino acids), peptides that form internal cross-links (disulphide bridges) and peptides that are formed as fragments of larger mature proteins.  Class  Examples Anionic peptides • Maximin H5 from amphibians. • Small anionic peptides rich in glutamic and aspartic acids from sheep, cattle and humans. • Dermcidin from humans. Linear cationic !-helical peptides • Cecropins (A), andropin, moricin, ceratotoxin and melittin from insects. • Cecropin P1 from Ascaris nematodes. • Magainin (2), dermaseptin, bombinin, brevinin-1, esculentins and buforin II from amphibians. • Pleurocidin from skin mucous secretions of the winter flounder. • Seminalplasmin, BMAP, SMAP (SMAP29, ovispirin), PMAP from cattle, sheep and pigs. • CAP18 from rabbits. • LL37 from humans. Cationic peptides enriched for specific amino acids • Proline-containing peptides include abaecin from honeybees. • Proline- and arginine-containing peptides include apidaecins from honeybees; drosocin from Drosophila28; pyrrhocoricin from the European sap-sucking bug; bactenecins from cattle (Bac7), sheep, and goats149; and PR-39 from pigs. • Proline- and phenylalanine-containing peptides include prophenin from pigs. • Glycine-containing peptides include hymenoptaecin from honeybees. • Glycine- and proline-containing peptides include coleoptericin and holotricin from beetles. • Tryptophan-containing peptides include indolicidin from cattle. • Small histidine-rich salivary polypeptides, including the histatins from man and some higher primates. Anionic and cationic peptides that contain cysteine and form disulphide bonds • Peptides with 1 disulphide bond include brevinins. • Peptides with 2 disulphide bonds include protegrin from pigs and tachyplesins from horseshoe crabs. • Peptides with 3 disulphide bonds include !-defensins from humans (HNP-1,HNP-2, cryptidins), rabbits (NP-1) and rats; "-defensins from humans (HBD1, DEFB118), cattle, mice, rats, pigs, goats and poultry; and rhesus #-defensin (RTD-1) from the rhesus monkey. • Insect defensins (defensin A). • SPAG11/isoform HE2C, an atypical anionic "-defensin. • Peptides with >3 disulphide bonds include drosomycin in fruit flies and plant antifungal defensins. Anionic and cationic peptide fragments of larger proteins • Lactoferricin from lactoferrin. • Casocidin I from human casein. • Antimicrobial domains from bovine !-lactalbumin, human haemoglobin, lysozyme and ovalbumin. Table 1.1. Classes of antimicrobial peptides. Adapted from Brogden, 2005.    5 The anionic antimicrobial peptides are typically found in surfactant extracts, bronchoalveolar lavage fluid and airway epithelial cells. Examples include the maximin H5 peptide from the skin of the toad Bomina maxima (Lai, et al., 2002), and the human dermicidin peptides secreted by the sweat glands (Schittek, et al., 2001). The linear cationic !-helical peptides constitute one of the largest classes of AMPs with roughly 300 members. These include the cecropins (from hemolymph of the cecropia moth), the magainins and buforin II from amphibians, pleurocidin from skin secretions of the flounder, and LL-37 from human.  Another subgroup is characterized by cationic amino acids lacking cysteine and enriched in certain amino acids such as proline (the peptides bactenecin and prophenin), arginine (bactenecin), and phenylalanine (indolicidin). This group has roughly 44 peptides. The largest class includes anionic and cationic peptides that contain cysteine and form disulphide bonds. There are nearly 400 peptides in this class including the large and diverse families of defensins. In mammals, the defensins include the alpha-defensins derived from neutrophils, the cryptdins of the small intestine, and the beta-defensins found throughout the epithelia. In addition, there are over 50 arthropod defensins and plant defensins (Brogden, 2005). The last class of antimicrobial peptide consists of those peptides that are fragments of larger, functional proteins. These include lactoferricin (derived from lactoferrin) and casocidin I (derived from human casein). A phylogenetic tree showing peptide diversity is shown in  Figure 1.1.  6  Figure 1.1. Phylogenetic tree of known antimicrobial peptides. One peptide was selected from each of the clusters in AMPer described in Chapter 2, and used to create a phylogenetic tree to show the relationship between peptides. Selected AMPs are labelled as space allows  It is important to note that there is disagreement among researchers concerning what should be properly considered an antimicrobial peptide. Some groups consider that a positive charge is an essential factor and in addition that many peptides which have antimicrobial activity attributed to them are not antimicrobial under physiological conditions of ion concentrations, host proteases and low peptide concentrations (Jenssen, et al., 2006). Under this view, activity is an artifact of the antimicrobial assays using conditions of dilute media.  Where antimicrobial activity can be demonstrated, the  7 mechanisms of action and target of peptide activity are not clear in many cases.  1.2.2  Mechanisms of antibacterial activity There are several physical characteristics that are thought to be important for activity of antimicrobial peptides; these are closely tied to the proposed mechanisms of microbial cell killing. Antimicrobial peptides tend to be relatively short, from 12 to 100 amino acids for cationic AMPs and down to 6 amino acids for anionic AMPs. The mechanisms and structure of cationic peptides have been most intensively studied and will be discussed here. Their charges typically range from +2 to +9. The positive charge on these peptides is believed to be key to the initial interaction with the bacterial cell target, due to attraction to negatively charged phosphate groups in the lipopolysaccaride, anionic phospholipids of Gram-negative bacteria, or lipotechoic acids present on the surfaces of Gram-positive bacteria (Jenssen et al, 2006; Brogden, 2005). This initial interaction is also responsible for the selective binding of the peptides to microbial membranes and not to host cells which are composed primarily of neutral lipid. Regardless of the ultimate target responsible for killing of the microbe, this initial interaction with the cell surface and contact with the membrane appears to be an important step for all peptides (Hancock and Rozek, 2002). Antimicrobial peptides can form a variety of secondary structures: alpha-helical, beta-sheet, loop or extended structure. In the case of extended (unstructured) peptides, these may form organized structures only on binding to lipid bilayer. For example, indolicidin is unstructured in solution and takes on a structure when bound to membrane (Rozek, et al, 2000). This flexibility has been proposed as a mechanism that allows a single peptide to interact with more than one target molecule such as DNA in addition to  8 initial interactions with membrane (Hsu, et al., 2005). Some other peptides do not fall into this structural classification: some for example contain a combination of domains such as alpha-helix and beta-sheet (Uteng, et al., 2003). Regardless of specific secondary structure, antimicrobial peptides tend to have separated hydrophobic and hydrophillic domains that generate an amphipathic character that enhances interaction with lipid bilayer  (Yount and Yeaman, 2004; Brodgen, 2005; Jenssen et al., 2006). While the initial interaction relies on electrostatic attraction, subsequent steps are driven by a combination of hydrophobic and electrostatic interactions (Jenssen, 2006). In Gram-negative bacteria, hydrophobic interactions drive insertion of peptides into the outer membrane. In a process termed "self-promoted uptake", initial peptides permeabilize the membrane to entry by other peptides. The importance of the amphipathic structures can be seen in three models of membrane disruption by antimicrobial peptides. In common to all the models, the peptides are assumed to initially aggregate parallel to the membrane surface by embedding the hydrophobic regions of the peptide into the hydrophobic lipid bilayer, as for example seen in mellitin (Yang, et al, 2001). In the "barrel-stave" model (Figure 1.2), the hydrophobic portions of the peptide align with the lipid core region and the hydrophilic face of the peptides form the interior of the pore. This manner of pore formation has been reported for the peptide alamethicin where the peptide takes on alpha- helical configuration that serves as the staves of the barrel-shaped pore (Brogden, 2005; Spear, 2004; Yang, et al., 2001). In a related model, the aggregate model, the peptides line the pore in an unoriented arrangement in complex with lipid micelles (Jenssen, 2006).  9    Figure 1.2. Barrel-stave model of antimicrobial peptide activity. The blue and red regions of the alpha helix represent hydrophobic and hydrophilic regions, respectively.   In contrast to the barrel-stave model, the lipid monolayer bends continuously through the pore in the toroidal model (Figure 1.3) with the pore centre lined by both the peptide and the lipid head groups. This pore structure was determined for the antimicrobial peptides magainin, mellitin and protegrin (Yang, et al., 2001). The carpet model (Figure 1.4) is similar to the toroidal model with the lipid monolayer bending continuously into the outer leaflet of the membrane and peptide along the surface between the lipid head groups and the pore centre. However, the carpet model suggests that at sufficiently high concentrations of peptide, the membrane will be covered by peptide with detergent-like effect leading to formation of micelles and disruption of the membrane. The peptide ovispirin is suggested to act in this manner (Yamaguchi et al.,  10 2001), and studies of the peptide mellitin also implicated this mechanism depending on membrane composition and peptide concentration (Ladokhin and White, 2001).  Figure 1.3. Toroidal model of antimicrobial peptide activity.  11  Figure 1.4. Carpet model of antimicrobial peptide activity.  However, many of these studies rely on model membranes and infer cell killing by effects seen on such model systems. But loss of cell viability is often seen to precede the major ultrastructural changes in the microbial cell, suggesting that killing occurs after pore formation rather than detergent-like effect of membrane disruption. For example, magainin 2 exposure results in immediate loss of cytoplasmic potassium and cell death (Matsuzaki, 1997) without the membrane disruption that occurs at later time points in response to this peptide. For some peptides the antibacterial effects are independent of membrane effects. Some targets and activities are shown in Figure 1.5, along with examples of peptides that have these effects. Some of the mechanisms of action include:  12 inhibition of DNA and RNA synthesis through binding to those molecules (buforin II); other inhibition of synthesis of macromolecules such as DNA, RNA or protein (pleurocidin, dermaseptin, indolicidin); inhibition of cell-wall synthesis (mersacidin); prevention of cell division by inhibition of septum formation (indolicidin); and inhibition of enzymatic activity (pyrrhocoricin). It is worth noting that under non-physiological conditions of salt and lack of serum protein, virtually any cationic peptide will show membrane disturbance given a high concentration of peptide (Jenssen et al., 2007; Zhang et al., 2001; Patrzykat et al., 2002). It is likely that many peptides once considered to kill bacteria by membrane disruption do not do so in vivo; they may attack internal targets instead or act through modulation of the immune system as discussed next.     13   Figure 1.5. Intracellular targets of antibacterial peptides. Some structures of a bacterium are shown along with peptides that target them or inhibit their synthesis. Image modified from public domain image at http://en.wikipedia.org/wiki/Bacterium.    1.2.3  Antimicrobial peptides in regulation of innate immunity In addition to direct killing of bacteria, many antimicrobial peptides have been recognized as having regulatory roles in the innate immune response to infection (Bowdish et al., 2005; Scott and Hancock, 2000; Yang et al., 2004; Zanetti, 2004; Finlay and Hancock, 2004). Since these roles do not involve direct antimicrobial activity and there is dispute about whether many of these peptides play such roles in vivo, some researchers now refer to these as "cationic host defense peptides" (Finlay and Hancock,  14 2004). Some of these roles in higher organisms involve nearly all steps in host response to infection that are not part of adaptive immunity. These steps appear to include the following (Finlay and Hancock, 2004): 1) They are induced at sites of inflammation or infection. 2) They act to counter inflammation that would lead to sepsis due to endotoxin (lipopolysaccharide) released by bacteria by selectively suppressing expression of genes induced by LPS or by directly interacting with LPS. 3) They signal other cellular components of the innate immune system through the MAP kinase pathways. 4) They recruit other cells such as neutrophils and monocytes to sites of infection and modulate chemokine and histamine release by neutrophils and mast cells. Finally, 5) They promote wound healing by promotion of fibroblast chemotaxis and angiogenesis.   1.2.4  Bioinformatics for discovery of novel AMPs As described above, antimicrobial peptides play a significant and possibly under-appreciated role in the innate immune system of higher organisms. However, while many of these peptides have common properties across species and structural classes, bioinformatics analysis must also address the large diversity of the sequences involved potentially distinct roles in innate immune response. Previous bioinformatics analyses of antimicrobial peptides for gene discovery have been limited to identification of one particular class of peptide. For example, the second exon of beta-defensins in mouse and human contains a motif with six cysteines. Additional beta-defensins were identified in human and mouse genomic sequence (Jia, et al., 2001; Scheetz, et al., 2002; Schutte et al, 2002) by comparison with known defensins  15 on a pair-wise basis using the Basic Local Alignment Search Tool (BLAST) (Altschul et al., 1990; Altschul et al., 1997), and hidden Markov models (Eddy, 1998). In addition, Yount and Yeaman (2004) identified a simple sequence motif found in cysteine- containing antimicrobial peptides that was reflected in a conserved 3D structure. This motif consists of a glycine followed by any amino acid followed by cysteine (GXC) and occurs in a specific conformation of the covalently bound chains. Two peptides, brazzein and charybdotoxin, matched this motif and reflected greatest sequence similarity with the core structure but had no documented antimicrobial activity; these were chosen for antibacterial assay and found to have activity when assayed. Both these efforts for predicting additional antimicrobial peptides (defensins and GXC motifs) were specific to cysteine-containing peptides and involved manual steps. A more general and automated approach to bioinformatics analysis of antimicrobial peptides is necessary for large-scale identification and classification of peptides. However, gene prediction from genomic sequence is not considered optimal for two reasons: 1) the presence of introns in DNA (sequences that are removed from mRNA and thus do not appear in the translated protein) prevents confident prediction of protein sequence and 2) relatively few genomes have been sequenced to high quality. The large available quantity of expressed sequence tags (ESTs) is a valuable source of sequence for gene prediction. ESTs consist of single-pass sequence reads from either the 3' or 5' ends of sequences in a cDNA library; these cDNAs are constructed from mRNA by reverse transcription (Boguski, et al., 1993). Since mRNA is ultimately transcribed into protein (apart from untranslated regions on the ends on the mRNA), these cDNA sequences lack the complexity of genomic sequence (introns, exons and alternative splicing) that makes  16 gene prediction extremely challenging (Zhang, 2002). However, ESTs are by their nature lower in quality: they are "single pass" reads with up to 3% sequencing errors and may contain truncated sequence (Boguski, et al., 1993).   1.3  Synthetic antimicrobial peptides Antimicrobial peptides have drawn significant scientific attention as a novel class of antimicrobial therapeutics as both antibacterial drugs and modulators of innate immunity (Hamilton-Miller, 2004; Levy and Marshall, 2004; Koczulla and Bals, 2003; Finlay and Hancock, 2004). Antimicrobial peptides tend to exhibit lower potency against susceptible bacterial targets than conventional low-molecular-weight antibiotic compounds; however, they have several advantages. These include fast target killing, broad range of activity, low toxicity and minimal development of resistance in target organisms (Hancock and Sahl, 2006; Yount and Yeaman, 2003). Over fourteen peptides are currently in development or clinical trials; but clinical trials to date have shown efficacy of peptides only as topical agents (Hancock and Sahl, 2006).  Four cationic peptides have advanced to phase 3 clinical trials, each of which is a derivative of a gene- coded peptide. Of these, two have demonstrated efficacy. There are several properties of peptides considered to be important for antibacterial activity: charge, hydrophobicity and amphipathicity. It is not possible, however, to create high potency peptides by simple manipulation of the amino acid sequence (Tossi et al., 2000). Structure-activity relationship data for the alpha-helical peptides (Figure 1.6 for example), have identified at least seven parameters that can  17 influence the potency and spectrum of activity. These include the 1) size, 2) sequence, 3) degree of structuring (% helical content), 4) charge, 5) overall hydrophobicity, 6) amphipathicity, and 7) respective widths of the hydrophobic and hydrophilic faces of the helix. These properties are intimately linked and therefore modifications intended to enhance one property will necessarily impact the others.  Figure 1.6. Structure of an alpha-helical antimicrobial peptide. The peptide IKWLKIFL is shown. Red indicates regions of positive charge and green indicates regions of hydrophobicity.  There have been four main methodologies used to study the structure-activity relationships of antimicrobial peptides (as reviewed by Tossi et al., 2000). These are: 1) Sequence modification methods evaluate peptide sequences generated by modifying natural peptides. Amino acids are deleted, added, replaced, truncated or combined with  18 other natural sequences to generate novel sequence. Sequence modification methods have been applied to the study of cecropins, magainins and mellitins in particular. 2) Minimalist approaches evaluate de novo sequences designed to be amphipathic and alpha-helical. To simplify analysis, the types of amino acids used are generally limited to one of the basic amino acids (lysine or arginine) and one or two of the hydrophobic residues (alanine, leucine, phenylalanine or tryptophan).  3) Synthetic combinatorial libraries evaluate combinatorial libraries of peptide sequences. To reduce the number of peptides needed for synthesis, typically only a few amino acids types are considered and only at a few amino acid positions. 4) Template-assisted methods generate sequence templates by comparing sequences of naturally occurring peptides and deriving patterns in terms of residue type (such as charged, hydrophobic, etc). Novel peptide sequence is then created using the templates as a guide for activity.  These structure-activity analyses have primarily been limited to qualitative analysis. However, a limited number of studies have attempted to derive quantitative structure-activity relationships, as described next.  1.3.1  Quantitative structure-activity relationships A quantitative structure-activity relationship (QSAR) relates quantitative properties (descriptors) of a compound with other properties such as drug-like activity or toxicity. While QSAR methods have been use extensively in screening programs for drug discovery and toxicology studies (Perkins, et al., 2003), QSAR has been applied to antimicrobial peptides relatively recently. QSAR modelling of antibacterial peptides has two aspects: the choice of QSAR descriptors and the choice of analysis technique to relate descriptor values to antibacterial activity. A large number of QSAR descriptors have been used for small compounds in the  19 literature and large numbers are available from commercial software products. Those descriptors used in QSAR studies of antibacterial peptides may be separated into two categories, empirical and calculated descriptors. High pressure liquid chromatography (HPLC) retention time is an example of an empirical descriptor (a surrogate measure of solubility or hydrophilicity).  Total peptide charge at pH 7 and Van der Waals surface area are examples of calculated descriptors. Many statistical learning methods are available to relate descriptors to activity. Regression models predict the activity of a peptide as a continuous variable such as MIC (minimal inhibitory concentration), while classification models classify peptides as active or inactive. Primarily, linear regression methods have been used for antimicrobial peptides, using multiple linear regression alone, or in conjunction with principal component analysis (PCA) and projections to latent structures (PLS). More complex (non-linear) models such as artificial neural networks (ANNs) give superior predictions but do not clearly relate input descriptors to activity. Some researchers have favoured linear models such as multiple linear regression and principal component analysis because they yield models that explicitly relate the input descriptors to the output prediction of activity; but they do so at the cost of poorer performance (Weaver, 2004).   1.3.2  Previous QSAR analysis of antimicrobial peptides Previous work on QSAR models for antimicrobial peptides have concentrated on derivatives of three natural peptides: lactoferricin, protegrin and bactenecin.  20  1.3.2.1  Lactoferricin derivatives Several studies have examined the activities of lactoferricin derivatives against bacteria targets (Lejon et al., 2001; Lejon et al., 2004; Strom et al., 2001) and herpes simplex virus (Jenssen et al., 2005). Specific amino acids changes were made in derivatives of lactoferricin to observe the effect on activity. Strom et al. (2001) modelled a set of 20 peptides with QSAR descriptors such as alpha helicity (determined empirically from circular dichroism spectroscopy or calculated several different ways), HPLC retention time, calculated net charge, molecular surface, and symmetry of charge and hydrophobicity distribution. Using principal component analysis, descriptors related to charge and hydrophobicity had the highest weights in the models. Using the same set of peptides, similar results were obtained (Lejon et al., 2001) using only three descriptors, the z-values derived through an earlier analysis of changes in peptide empirical and calculated properties due to amino acid substitutions (Hellberg et al., 1987). Using an expanded set of peptides, good predictive accuracy was found using z- values for a larger set of peptide analogues where only a few amino acid substitutions were made (Lejon et al, 2004). However, predictions were much less accurate when more than one or two substitutions was made in a single peptide, indicating the limitation of the amino acid substitution approach for more general antibacterial prediction.  1.3.2.2  QSAR of Protegrin Analogues and De Novo Peptides Activities of antibacterial peptides based on protegrin have been reported in several studies. A de novo design strategy was used to produce synthetic peptides with structural similarity to cyclic beta-sheet defense peptides such as protegrin by Frecer et al (2004). A total of seven peptides were constructed and synthesized. Three descriptors  21 were used to model antibacterial activity: total charge, an amphipathicity index, and a lipophilicity index. In a second paper, Frecer (2006) performed QSAR analysis on 97 protegrin derivatives of 14 amino acids in length based on published activity values, using 14 descriptors including features such as charge, overall lipophilicity, and separate properties of molecular sections (e.g. lipophilicity of polar and nonpolar faces of the molecule, molecular surface areas for polar and nonpolar faces). Linear equations involving up to 5 descriptors were generated using a genetic function approximation (GFA) to describe antibacterial activity. Only moderate predictive power was found; predictions depended mostly on to charge and amphipathicity. In another study, Ostberg and Kaznessis (2004) examined 62 protegrin derivatives using a larger selection of calculated QSAR descriptors. Multivariate linear regression produced moderate correlation between predicted and actual activity using five descriptors.  1.3.2.3  QSAR of scrambled bactenecin-derived peptides A linear variant of the bovine cationic peptide bactenecin, Bac2A, has been used in studies of positional importance of amino acids by Hilpert et al (2006). The activity of 49 peptides resulting from a scrambled sequence of Bac2A was modelled using 18 descriptors based largely on positions of arginines, distributions of hydrophobic amino acids and water accessible surface. Here, a binary classification algorithm was used to create a decision tree to classify peptides are active or inactive. An accuracy of 74% was obtained from training on the full set of peptides.  1.3.3  Limitations of current studies Existing QSAR modelling studies are limited in several ways. The primary  22 limitation is due to the size of the data sets. Unfortunately, the use of the three z-values was only effective when modelling very similar variants of a template peptide. More general predictions were more accurate after considering a larger number of descriptors, but the number of peptides considered was small compared to the number of descriptors. The types of models used also limit these QSAR studies. Often the choice to use simpler linear models is made deliberately (for example, as stated (Frecer et al, 2004; Frecer, 2006)) because the resulting models give straightforward interpretation of the contribution of each descriptor. However, more complex models such as artificial neural networks (ANNs) are capable of modelling non-linear relationships as well, where descriptors interact with one another in a non-additive manner. As mentioned above, the main disadvantages of the more complex models are the cryptic nature of the models produced (contributions of individual descriptors to activity are not clear), and the requirement for much larger amounts of data, due to the larger number of parameters used to the models. The recent advance in high-throughput peptide synthesis, in combination with a rapid assay of activity with the luminescence-based assay, has resulted in very large amounts of antibacterial peptide data becoming available (Hilpert et al., 2005).  1.3.4  'Inductive' QSAR descriptors The QSAR descriptors used for previous modelling of antibacterial peptides have often required a high degree of similarity between peptides. More general QSAR descriptors have been developed recently that include properties sensitive to the three dimensional structure of peptides, the 'inductive' QSAR descriptors among others (reviewed in (Cherkasov, 2005a)). Previously, ‘inductive’ QSAR descriptors have been  23 successfully applied to a number of molecular modelling studies including quantification of antibacterial activity of organic compounds (Cherkasov, 2005b), prediction of other molecular properties (Cherkasov, 2003), and small compound lead discovery (Cherkasov, 2005; Karakoc et al., 2006a). These descriptors have been used in different types of models for classification of compounds, from artificial neural networks (ANNs), k- nearest neighbors, linear discriminative analysis and multiple linear regression. It has been found that ANNs result in generally more accurate predictions, followed closely by k-nearest neighbors methods (Karakoc et al., 2006b). The structure of an artificial neural network in the context of QSAR analysis is shown in Figure 1.7.    24    Figure 1.7. Structure of an artificial neural network. The network consists of three layers: the input layer, hidden layer and output layer. The input nodes take the values of the normalized QSAR descriptors. Each node in the hidden layer takes the weighted sum of the input nodes (represented as lines), and transforms the sum into an output value. The output node takes the weighted sum of these hidden node values and transforms the sum into an output value between 0 and 1.   1.4  Thesis objectives and hypotheses  1.4.1  Gene-coded antimicrobial peptides The objectives of this section of the thesis follow from the hypothesis that analysis of existing peptides and construction of bioinformatic models can identify additional antimicrobial peptides both from known proteins (unacknowledged antimicrobial peptides among known proteins) and from unannotated sequence. The  first objective was to create a resource consisting of software models of all known classes of antimicrobial peptide. In addition, a web site was constructed to allow the community to  25 browse the many classes of peptides, enter sequence to be scanned and view results in the context of multiple sequence alignments. In Chapter 2, I describe the creation of the AMPer resource that performs these functions; a manuscript was published based on this work (Fjell, C.D., R.E. Hancock, and A. Cherkasov (2007) AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics 23:1148-1155). The second objective was to examine unannotated EST sequence and genomic data to identify novel genes using the AMPer resources. In Chapter 3, I describe an analysis of the bovine EST data set with identification of a number of putative novel genes. Subsequent experimental confirmation with collaborators confirmed one of these predicted genes was present and up-regulated in response to infection in bovine tissue. A manuscript on this work has been published (Fjell CD, Jenssen H, Fries P, Aich P, Griebel P, Hilpert K, Hancock RE, Cherkasov A. Identification of novel host defense peptides and the absence of alpha-defensins in the bovine genome. Proteins. 2008 73(2):420-30).   1.4.2  Identification of synthetic AMPs by QSAR analysis and machine learning The hypothesis of Chapter 4 is that highly antibacterial peptides can be identified by a combination of non-linear machine learning algorithms and QSAR descriptors that are sensitive to the three-dimensional atomic conformation of the peptide. The objective of the work was to identify a set of QSAR descriptors that allow artificial neural networks to be trained to identify novel peptides that are high antibacterial and to validate the system by predicting entirely new peptides of two categories: highly active and inactive peptides. I report in Chapter 4 the first successful identification of highly  26 active peptide activity in silico, without the use of a template sequence. A version of this chapter work has been submitted to Journal of Medicinal Chemistry (Fjell, C.D., Hilpert, K., Jenssen, H., Cheung, W.A., Panté, N., Hancock, R.E.W., and Cherkasov, A. Identification of Novel Antibacterial Peptides by Chemoinformatics and Machine Learning); these results in combination with further laboratory work by collaborators has been accepted (Cherkasov, A., Hilpert, K., Jenssen, H., Fjell, C.D., Waldbrook, M., Mullaly, S.C., Volkmer, R., and Hancock, R.E.W.  Use of artificial intelligence in the design of small peptide antibiotics effective against a broad spectrum of highly antibiotic resistant Superbugs. ACS Chemical Biology, 2008). Some highly active peptide sequences from this work have been submitted for patent protection. Chapter 5 is an extension of Chapter 4 to address one limitation of the QSAR methodology. Calculation of 3D QSAR descriptors can be a computationally expensive operation. The hypothesis of Chapter 5 is that an evolutionary search method called a genetic algorithm can be used to efficiently search through the possibilities of peptide sequences to identify additional peptides that are likely to be highly antibacterial. This work utilizes the software models described in Chapter 4, with the objective of dramatically increasing the efficient in silico discovery of novel antibacterial peptides was demonstrated.   1.4.3  Key assumptions There are several important assumptions involved in this work. For the analysis of gene-coded AMPs, assumptions about the error rates in the sequencing technology are  27 important since these were used to choose a threshold value for maximum allowed differences between predicted AMP sequence and the sequences of AMPs considered to be already known. If the accuracy was much lower than expected, sequences of known AMPs found in EST sequence at low accuracy may be identified as novel, related sequence. Comparing sequences within a multiple alignment allows one to observe whether random sequencing errors or areas of low quality sequence at the ends of ESTs (which are known to be of poorer quality) might account for observed differences between sequences. For the work on synthetic AMPs, there are several assumptions related to measurement of antibacterial activity. The screening assay used to measure killing of bacteria relies on detection of luminescence of bacteria due to a luciferase gene cassette. Killing of bacteria is assumed to be responsible for observed decreases in luminescence. Since it is not feasible to routinely measure amounts of peptide synthesized per spot, the amount of peptide synthesized in each spot on the cellulose sheet is assumed to be constant; otherwise, there would be no way to compare peptide activities with this assay. In addition, the accuracy of luminescence detection assay has an important impact on analysis. The luminescence varies up to approximately 2-fold between measurements of the same peptide. Therefore, the activity of peptides with low activity (large IC50 values) will have much higher levels of noise than highly active peptides (small IC50 values). This is the assumed reason for the failure of regression analysis to accurately predict activity, while the classification analysis worked well. Peptides predicted to have high activity were ultimately synthesized on resin with a different method that has high accuracy (generally above 95% pure), and peptide activities measured directly by MIC  28 dilution series were found to correlate well with luminescence measurements. Therefore, these assumptions about peptide concentration and luminescence activity measurements were found to be valid.    29  1.5  References  Aich, P., H. L. Wilson, et al. (2005). Microarray analysis of gene expression following preparation of sterile intestinal “loops” in calves. Can. J. Anim. Sci., 85: 13–22. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403-410. Altschul, S.F., Madden, T.L., Schäffer, A.A, Zhang, J., Zhang, A., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25: 3389–3402 Bechinger, B. (1997). Structure and function of channel-forming peptides: magainins, cecropins, melittin and alamethicin. J. Membrane Biol., 156: 197-211. Bechinger, B. (1999). The structure, dynamics and orientation of antimicrobial peptides in membranes by multidimensional solid-state NMR spectroscopy. Biochim. Biophys. Acta, 1462: 157-183. Blondelle, S. E., K. Lohner, et al. (1999). Lipid-induced conformation and lipid-binding properties of cytolytic and antimicrobial peptides: determination and biological specificity. Biochim. Biophys. Acta, 1462: 89-108. Boguski, M. S., T. M. J. Lowe, et al. (1993). dbEST — database for "expressed sequence tags". Nature Genetics, 4: 332 - 333. Bowdish, D.M., Davidson, D.J., Hancock, R.E.W. (2005) A Re-evaluation of the Role of Host Defence Peptides in Mammalian Immunity. Curr. Protein. Pept. Sci., 6(1):35-51. Brahmachary, M., Krishnan, S. P. T., Koh, J. L. Y., Khan, A. M., Seah, S. H., Tan T. W., Brusic, V., Bajic, V. B. (2004) ANTIMIC: a database of antimicrobial sequences. Nucl. Acids Res., 32: 90001, 1-589 Brogden, K. A. (2005). Antimicrobial peptides: pore formers or metabolic inhibitors in bacteria? Nat. Rev. Microbiol., 3: 238–250. Brogden, K. A., De Lucca, A. J., Bland, J. & Elliott, S. (1996). Isolation of an ovine pulmonary surfactant-associated anionic peptide bactericidal for Pasteurella haemolytica. Proc. Natl Acad. Sci. USA, 93, 412–416 Chapple, D.S., Hussain, R., Joannou, C.L., Hancock,  R.E.W., Odell, E., Evans, R.W., Siligardi,  G. (2004) Structure and Association of Human Lactoferrin Peptides with Escherichia coli Lipopolysaccharide. Antimicrob. Agents Chemother., 48 (6): 2190-2198 Cherkasov, A. (2003) Inductive Electronegativity Scale. Iterative Calculation of Inductive Partial Charges. J. Chem. Inf. Comp. Sci., 43, 2039-2047, Cherkasov, A. (2005) ‘Inductive’ Descriptors. 10 Successful Years in QSAR. Current Computer-Aided Drug Design, 1, 21-42.  30 Cherkasov, A. (2005) Inductive QSAR Descriptors. Distinguishing Compounds with Antibacterial Activity by Artificial Neural Networks. Int. J. Mol. Sci., 6, 63-86 Cherkasov, A., Shi, Z., Fallahi, M., and Hammond, GL. (2005) Successful in Silico Discovery of Novel Non-Steroidal Ligands for Human Sex Hormone Binding Globulin. J. Med. Chem., 48, 3203-3213. Coombes, B. K., B. A. Coburn, et al. (2005). Analysis of the contribution of Salmonella pathogenicity islands 1 and 2 to enteric disease progression using a novel bovine ileal loop model and a murine model of infectious enterocolitis. Infect Immun., 73(7161-7169). Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.J. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University press, Cambridge, UK Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14: 9, 755-763. Eisenhauer, P. B. and R. I. Lehre (1992). Mouse neutrophils lack defensins. Infect. Immun., 60: 3446-3447. Epand, R. M. and H. J. Vogel (1999). Diversity of antimicrobial peptides and their mechanisms of action. Biochimica Biophysica Acta, 1462: 11-28. Finlay, B.B., Hancock, R.E.W. (2004) Can innate immunity be enhance to treat microbial infections? Nature Reviews Microbiology, 2, 497-504. Fjell, C. D., R. E. Hancock, et al. (2007). AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics, 23: 1148-1155. Frecer, V. (2006) QSAR analysis of antimicrobial and haemolytic effects of cyclic cationic antimicrobial peptides derived from protegrin-1. Bioorganic & Medicinal Chemistry, 14, 6065-6074 Frecer, V., Ho, B., Ding, J.L. (2004) De Novo Design of Potent Antimicrobial Peptides. Antimicrob. Agents Chemother., 48, 3349-3357 Hamilton-Miller, J.M.T. (2004) Antibiotic resistance from two perspectives: man and microbe. International. Journal of Antimicrobial Agents, 23: 209-212. Hancock, R. E. (2003). Concerns regarding resistance to self-proteins. Microbiology, 149: 3343-3344. Hancock, R. E. and D. S. Chapple (1999). Peptide Antibiotics. Antimicrob. Agents Chemother., 43: 1317-1323. Hancock, R. E. and R. Lehrer (1998). Cationic peptides: a new source of antibiotics. Trends Biotechnol., 16: 82-88. Hancock, R.E.W. (2001) Cationic peptides: effectors in innate immunity and novel antimicrobials. The Lancet Infectious Diseases, 1 (3) 156-164. Hancock, R.E.W., and Sahl, H.G. (2006).Antimicrobial and host-defense peptides as new anti-infective therapeutic strategies. Nature Biotechnology, 24, 1551 - 1557 Hancock, R.E.W., Rozek, A. (2002) Role of membranes in the activities of antimicrobial  31 cationic peptides. FEMS Microbiology Letters, 206 (2), 143-149 Hellberg, S., Sjostrom, M., Skagerberg, B., and Wold, S. (1987) Peptide quantitative structure-activity relationships, a multivariate approach.  J. Med.Chem., 30: 1126– 1135. Hilpert, K., Elliott, M.R., Volkmer-Engert, R., Henklein, P., Donini, O., Zhou, Q. et al. (2006) Sequence requirements and an optimization strategy for short antimicrobial peptides. Chem. Biol., 13: 1101-7. Hilpert, K., Volkmer-Engert, R., Walter, T., Hancock, R.E.W. (2005) High-throughput generation of small antibacterial peptides with improved activity. Nature Biotechnology, 23: 1008-1012 Hsu, C. H., C. Chen, M. L. Jou, A. Y. Lee, Y. C. Lin, Y. P. Yu, W. T. Huang, and S. H. Wu. 2005. Structural and DNA-binding studies on the bovine antimicrobial peptide, indolicidin: evidence for multiple conformations involved in binding to membranes and DNA. Nucleic Acids Res., 33:4053–4064. Hwang, P.M., Vogel, H.J. (1998) Structure-function relationships of antimicrobial peptides. Biochem. Cell Biol., 76:235-46. Jack, R.W., Tagg, J.R., Ray, B. (1995) Bacteriocins of gram-positive bacteria. Microbiol Rev., 59:171-200. Jenssen, H., Gutteberg, T.J., and Lejon, T (2005) Modelling of anti-HSV activity of lactoferricin analogues using amino acid descriptors. J. Pept. Sci., 11:97-103. Jenssen, J., Hamill, P., and Hancock, R.E.W. (2006) Peptide Antimicrobial Agents. Clinical Microbiology Reviews, 19:491–511 Karakoc, E., Cherkasov, A., Sahinalp, S.C. (2006) Distance based algorithms for small biomolecule classification and structural similarity search. Bioinformatics, 15: 243-251. Karakoc, E., Sahinalp, S.C., and Cherkasov, A. (2006) Comparative QSAR- and fragments distribution analysis of drugs, druglikes, metabolic substances, and antimicrobial compounds. J. Chem. Inf. Model., 46:2167-2182. Khush, R. S., F. Leulier, et al. (2001). Drosophila immunity: two paths to NF-kappaB. Trends Immunol., 22: 260-264. Kim, H. S., H. Yoon, et al. (2000). Pepsin-Mediated Processing of the Cytoplasmic Histone H2A to Strong Antimicrobial Peptide Buforin. I. J. Immunol., 165: 3268- 3274. Koczulla, A.R., Bals, R. (2003) Antimicrobial Peptides: Current Status and Therapeutic Potential. Drugs, 63:389-407. Ladokhin, A. S. & White, S. H. (2001) ‘Detergent-like’ permeabilization of anionic lipid vesicles by melittin. Biochim. Biophys. Acta., 1514: 253–260 Lai, R., Liu, H., Hui Lee, W. & Zhang, Y. (2002) An anionic antimicrobial peptide from toad  Bombina maxima. Biochem.Biophys. Res. Commun. 295: 796–799. Lejon, T., Stiberg, T., Strom, M.B., and Svendsen, J.S. (2004) Prediction of antibiotic  32 activity and synthesis of new pentadecapeptides based on lactoferricins. J. Pept. Sci., 10:329 – 335 Lejon, T., Strom, M.B., and Svendsen, J.S. (2001) Antibiotic activity of pentadecapeptides modelled from amino acid descriptors. J. Pept. Sci., 7: 74-81. Levy, O., Weiss, J., et al. (1993). Antibacterial 15-kDa protein isoforms (p15s) are members of a novel family of leukocyte proteins. J. Biol. Chem., 268: 6058-6063. Levy, S.B., Marshall, B. (2004) Antibacterial resistance worldwide: causes, challenges and responses. Nature Medicine, 10: S122 - S129. Madera, M., Gough, J. (2002) A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res., 30: 4321-4328. Marshall, S. H. and G. Arenas (2003). Antimicrobial peptides: A natural alternative to chemical antibiotics and a potential for applied biotechnology. Electron. J. Biotech, 6: 271-284. Matsuzaki, K., Sugishita, K., Harada, M., Fujii, N. & Miyajima, K. (1997) Interactions of an antimicrobial peptide, magainin 2, with outer and inner membranes of Gram- negative bacteria. Biochim. Biophys. Acta, 1327:119–130 Mookherjee, N. and Hancock, R.E. (2007). Cationic host defence peptides: innate immune regulatory peptides as a novel approach for treating infections. Cell. Mol. Life Sci., 64: 922-933. Mookherjee, N., Wilson, H. L.; Doria, S.; Popowych, Y.; Falsafi, R.; Yu, J. J.; Li, Y.; Veatch, S.; Roche, F. M.; Brown, K. L.; Brinkman, F. S.; Hokamp, K.; Potter, A.; Babiuk, L. A.; Griebel, P. J.; Hancock, R. E. (2006). Bovine and human cathelicidin cationic host defense peptides similarly suppress transcriptional responses to bacterial lipopolysaccharide. J. Leukoc. Biol., 80: 1563-1574. Niculescu, S.P. (2003) Artificial neural networks and genetic algorithms in QSAR. Journal of Molecular Structure (Theochem) 622: 71–83 Ostberg, N., and Kaznessis, Y. (2004) Protegrin structure–activity relationships: using homology models  of  synthetic sequences to determine structural characteristics important for activity. Peptides, 26: 197–206 Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T., Chothia, C. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284:1201-1210. Parrill, A.L. (1996) Evolutionary and genetic methods in drug design. Drug Design Today, 1:514-521 Patrzykat, A., Friedrich, C.L., Zhang, L., Mendoza, V., Hancock, R.E.W. (2002) Sublethal Concentrations of Pleurocidin-Derived Antimicrobial Peptides Inhibit Macromolecular Synthesis in Escherichia coli. Antimicrob. Agents Chemother., 46: 605-614. Perkins, R., Fang, H., Tong, W., and Welsh, W.J. (2003) Quantitative structure-activity relationship methods: perspectives on drug discovery and toxicology.  33 Environmental Toxicology and Chemistry, 22: 1666-79 Pfaffl, M. W. (2001). A new mathematical model for relative quantification in real-time RT-PCR. Nucleic Acids Res., 29: e45. Powers, J.P.S., Hancock, R.E.W. (2003). The relationship between peptide structure and antibacterial activity. Peptides, 24: 1681-1691 Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannerty, B.P. (1992), Numerical Recipes in C: The Art of Scientific Computing, (2nd Edition), Cambridge University Press, New York. Rice, P., Longden, I., and Bleasby, A. (2000) EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics, 16:276--277 Rozek, A., C. L. Friedrich, and R. E. Hancock. (2000) Structure of the bovine antimicrobial peptide indolicidin bound to dodecylphosphocholine and sodium dodecyl sulfate micelles. Biochemistry, 39:15765–15774 Rozen, S. and H. J. Skaletsky (2000). Primer3 on the WWW for general users and for biologist programmers. Bioinformatics Methods and Protocols: Methods in Molecular Biology vol. 132, S. Krawetz and S. Misener (eds.), Humana Press, Totowa, N.J., U.S.A. Scheetz, T., Bartlett, J.A., Walters, J.D., Schutte, B.C., Casavant, T.L.,  McCray, P.B. (2002) Genomics-based approaches to gene discovery in innate immunity. Immunol Rev., 190: 137-145. Schittek, B., Hipfel, R., Sauer, B., Bauer, J., Kalbacher, H., Stevanovic, S., Schirle, M., Schroeder, K., Blin, N., Meier, F., Rassner, G., Garbe, C. (2001) Dermcidin: a novel human antibiotic peptide secreted by sweat glands. Nat. Immunol., 2: 1133- 1137 Schutte, B.C., Mitros, J.P., Bartlett, J.A., Walters, J.D., Jia, H.P., Welsh, M.J., Casavant, T.L., McCray, P.B. (2002)  Discovery of five conserved beta -defensin gene clusters using a computational search strategy. Proc. Natl. Acad. Sci. U S A, 99: 2129-2133. Scott, M. G., and R. E. Hancock. (2000) Cationic antimicrobial peptides and their multifunctional role in the immune system. Crit. Rev. Immunol., 20:407–431 Shai, Y. (1999). Mechanism of the binding, insertion and destabilization of phospholipids bilayer membranes by !-helical antimicrobial and cell non-selective membrane- lytic peptides. Biochim. Biophys. Acta, 1462: 55-70. Sima, P., Trebichavsky, I., Sigler, K. (2003) Mammalian antibiotic peptides. Folia Microbiol., 48: 123-137. Sima, P., Trebichavsky, I., Sigler, K. (2003) Non-mammalian vertebrate antibiotic peptides. Folia Microbiol., 48: 709-724. Simmaco, M., G. Mignogna, et al. (1998). Antimicrobial peptides from amphibian skin: what do they tell us? Biopolymers, 47: 435-450. Solmajer, T. and Zupan, J. (2004) Optimization algorithms and natural computing in drug  34 discovery. DDT , 1: 247-252 Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A., Durbin, R., (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res., 26: 320-322. Spaar, A., Munster, C. & Salditt, T. (2004) Conformation of peptides in lipid membranes studied by X-ray grazing incidence scattering. Biophys. J., 87: 396–407. Strom, M.B., Stensen, W., Svendsen, J.S., and Rekdal, O. (2001) Increased antibacterial activity of 15-residue murine lactoferricin derivatives. J. Peptide Res., 57: 127– 139 Thompson, J. D., Higgins, D. G., Gibson, T. J. (1994)  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22: 4673-4680. Tossi, A., Sandri, L. & Giangaspero, A. (2000) Amphipathic, !-helical antimicrobial peptides. Biopolymers, 55: 4–30. Uteng, M., Hauge, H. H., Markwick, P. R., Fimland, G., Mantzilas, D., Nissen-Meyer, J., Muhle-Goll, C. (2003) Three-dimensional structure in lipid micelles of the pediocin-like antimicrobial peptide sakacin P and a sakacin P variant that is structurally stabilized by an inserted C-terminal disulfide bridge. Biochemistry, 42: 11417-26 Wang, Z., and Wang, G. (2004) APD: the Antimicrobial Peptide Database. Nucleic Acids Res., 32:D590–D592 Weaver, D.C. (2004) Applying data mining techniques to library design, lead generation and lead optimization. Current Opinion in Chemical Biology, 8: 264-270 Whale, T. A., Wilson, H. L., Tikoo, S. K., Babiuk, L. A., Griebel, P. J. (2006) Pivotal Advance: Passively acquired membrane proteins alter the functional capacity of bovine polymorphonuclear cells. J. Leukocyte Biology, 80: 481-491. Yamaguchi S, Huster D, Waring A, Lehrer RI, Kearney W, Tack BF, Hong M. (2001) Orientation and dynamics of an antimicrobial peptide in the lipid bilayer by solid- state NMR spectroscopy. Biophys. J., 81: 2203–2214. Yang, D., A. Biragyn, D. M. Hoover, J. Lubkowski, J. J. Oppenheim. (2004) Multiple roles of antimicrobial defensins, cathelicidins, and eosinophil-derived neurotoxin in host defense. Annu. Rev. Immunol., 22:181–215 Yang, L., Harroun, T. A., Weiss, T. M., Ding, L. & Huang, H.W. (2001) Barrel-stave model or  toroidal model? A case study on melittin pores. Biophys. J. 81: 1475– 1485. Yeaman, M.R., Yount, N.Y. (2003) Mechanisms of Antimicrobial Peptide Action and Resistance. Pharmacol. Rev., 55: 27-55. Yount, N.Y., Yeaman, M.R. (2004) Multidimensional signatures in antimicrobial  35 peptides. PNAS, 101: 7363-7368 Zanetti, M. 2004. Cathelicidins, multifunctional peptides of the innate immunity. J. Leukoc. Biol., 75:39–48 Zhang, L., Rozek, A., Hancock, R.E. (2001) Interaction of cationic antimicrobial peptides with model membranes. J. Biol. Chem., 276:35714–35722 Zhang, M.Q. (2002) Computational prediction of eukaryotic protein-coding genes. Nature Reviews Genetics, 3:698-709          36 Chapter 2: Prediction of gene-coded antimicrobial peptides by bioinformatic analysis                   A version of this chapter has been published as: Fjell, C.D., R.E. Hancock, and A. Cherkasov (2007) AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics 23:1148-1155.  37  2.1  Introduction Antimicrobial peptides (AMPs) represent a diverse class of natural peptides that form a part of the innate immune system of mammalians, insects, amphibians, and plants among others (for example, Sima and Sigler, 2003a, 2003b). In the face of increasing antibiotic resistance in pathogenic microorganisms, AMPs have drawn significant scientific attention as a novel class of prospective antimicrobial therapeutics as both antibacterial drugs and modulators of innate immunity (Hamilton-Miller, 2004; Levy and Marshall, 2004; Koczulla and Bals, 2003; Finlay and Hancock, 2004). Although the antimicrobial peptides exhibit relatively lower potency against susceptible bacterial targets compared to conventional low-molecular-weight antibiotic compounds, they hold several compensatory advantages including fast target killing, broad range of activity, low toxicity and minimal development of resistance in target organisms (Hancock, 2001; Yount and Yeaman, 2003). Despite the fact that a broad spectrum of antimicrobial peptides have been identified and discussed in the literature, their structure-activity relationships are not well understood, largely because of substantial sequence and structure diversity. Examples include the alpha-helical cecropins and magainins and the beta-sheet structure of beta- defensins among others. It should be mentioned, however, that AMP three dimensional structures are often dependent on binding to membrane or lipopolysaccharide, and in solution many AMPs may exist in different, and/or non-ordered configuration (Chapple et al., 2004; Yount and Yeaman, 2003). Thus, the general views on the AMP characteristic features typically involve their cationic character, relatively high hydrophobicity and short length (Powers and Hancock, 2003; Yount and Yeaman, 2003)  38 The mechanisms of peptide antimicrobial action are also under debate; while membrane disruption has been a common theme, other evidence suggests that peptides transit into the cytosol and disrupt intracellular targets and that the membrane effects are distinct from (and not always crucial to) the killing effects (Hancock et al., 2002; Patrzykat et al., 2002). In addition, the relative importance of direct killing versus immunomodulatory effects of mammalian AMPs is not obvious since some peptides generally considered as AMPs do not appear to have direct microbe-killing effects in vivo (Brogden, 2005; Bowdish et al., 2005). All the above-mentioned controversies make ‘in silico’ discovery and/or modelling of antimicrobial peptides an important but challenging bioinformatics task. Currently, sequence analysis for AMP discovery has been done on a limited number of AMPs: the beta-defensins and other cysteine-containing peptides. A number of novel beta-defensins in mouse and human were identified by analysis of a specific exon of beta- defensins followed by scanning of genomic sequence (Scheeta et al., 2002; Schutte et al., 2002). Manual identification of a predictive motif, GXC, for cysteine-containing AMPs was also used to find novel AMPs of that type (Yount et al., 2004). However, these efforts were applicable only to a small number of AMP types. We decided to conduct a more generalized study of AMP sequences using profile-based hidden Markov models (HMMs) in combination with sequence clustering and protein structure annotation. The major objective of the study was to produce HMM models for the existing AMP types such as defensins, cathelicidins, and histatins among others, and to apply these methods to create a more consistent classification of antimicrobial sequences. This new resource is available as an on-line database, for  39 investigation of AMP sequence diversity, and as a set of HMM files for the discovery of novel gene-coded AMP candidates.   2.2  Results and discussion The analysis of the antimicrobial peptides proceeded as described next and is summarized in Figure 2.1 and Figure 2.2.  Figure 2.1. Creation of initial AMPer clusters  40  Figure 2.2. Summary of iterative enrichment of clusters.   2.2.1  Database of antimicrobial peptides Initially, we used the set of known gene-coded AMPs from the AMSDb collection at the University of Trieste to compile a generalized set of known AMP sequences (see the “Web Resources” section for more details about the source of AMP  41 sequences). This resulting set of confirmed AMPs contained 890  sequences and encompassed all major AMP classes including defensins, cathelicidins and granulin among others. These peptides are available as entire holopeptides, containing both mature functional peptides as well as prosequences. Some of these proteins were found to contain obsolete annotations and refer to obsolete Uniprot IDs. Since we are interested in analyzing the mature and prosequence regions separately, we required the proteins be present in the current version (August 2006) of the Uniprot database. To associate the proteins in AMSDb to the current Uniprot we performed a pair wise similarity comparison using blastp of the BLAST tool (Altschul et al., 1990) . We considered a match to be made where the AMSDb protein has at least 99% sequence identity over at least 99% of the length of the smaller sequence of the pair. We tried relaxing the criteria to 95% for each parameter - this resulted in only 2 more matches, which we did not consider significant to justify the additional risk of incorrect assignment. In addition, 33 proteins were identified based on sequence ID that were the same proteins between AMSDb and Uniprot, but the sequence was <99% similar. These 33 Uniprot proteins were used. Of the 890 original AMSDb proteins, 741 proteins were matched in Uniprot (661 from Swiss-Prot and 80 from TrEMBL). The peptide location annotations were used from Uniprot to identify mature peptide and propeptide regions. A total of 679 Uniprot proteins were found to have suitable annotation for mature peptides, yielding 767 mature peptides. Most proteins contributed one mature peptide while one protein, human Histatin-3 (HIS3_HUMAN), contributed 26 peptides, the highest number. A total of 238 Uniprot proteins had annotations for propeptides, yielding 316 propeptides. Most proteins contributed 1  42 propeptide, but up to 7 (for AMP_IMPBA from Balsam plant) were contributed for a single protein.  2.2.2  Clustering of the AMPs As it has already been mentioned, AMPs are very diverse in their sequences and fall into classification of a small number of secondary structures  (Hwang et al., 1998; Powers and Hancock, 2003). However, our objective in clustering was to group similar peptides for later analysis by hidden Markov models. For this purpose, we wanted to capture in a single cluster the diversity of sequences that likely corresponded to single type of peptide. While a large number of AMP groups can be defined based on descriptions in the literature (such as defensins, magainans, cathelicidins), this nomenclature is not amenable to specification for automated grouping, due to the large diversity in sequence as well as length for a given protein name or description. Since no classification scheme was found that was suitable for our purpose, we chose to group AMPs by sequence analysis using custom sequence similarity. In short, clusters were constructed to have a minimum amount of similarity between all peptides in the cluster (see Methods section for details). Two sets of clusters were constructed, for mature peptides and propeptides. Each peptide was compared to the peptides in existing clusters and a minimum 'global' sequence identity was calculated  as the number of matching amino acids divided by the length of the shorter peptide using the most significant alignment given by the blastp algorithm. A peptide was placed in an existing cluster based on the minimum global sequence identity for any peptide in the cluster. The peptide was placed in the cluster giving the highest minimum match, if the minimum was greater than a given minimum identity threshold. Peptides not placed in  43 existing clusters were used to start new clusters.  Threshold [%] Number of Clusters Clustered Fraction [%] 10 136 94 20 142 92 30 149 90 `40 148 84 50 151 80 60 158 75 70 153 66 80 136 56 90 120 42 Table 2.1. Effect of similarity threshold on clustering of mature peptides. The original set of mature peptides were clustered for several values of the minimum global percent similarity (Threshold). The clustered Fraction is the fraction of the original set of mature peptides that were placed in clusters for the given threshold   Minimum similarity thresholds in the range 10-90% were used to evaluate the resulting clusters. Decreasing the threshold to a minimum of 10% global similarity gives the maximum number of peptides placed in clusters. However, when we examined the multiple alignments of these clusters for low thresholds we found problems: Many contained two or more sets of closely-related peptides that were more appropriately separated into distinct clusters. As well, short peptides were found to be inserted into clusters where the matching amino acids in the multiple alignment were interspersed with gaps between matching positions of only one or two amino acids. However, for higher thresholds, dramatically lower coverage of the peptides was represented in the clusters, with a 90% threshold yielding clusters for only 42% of the starting peptides. Therefore, we decided to use an intermediate threshold of 30% global sequence  44 and manually correct the clusters by removing short peptides having poor alignment, and by splitting clusters into additional clusters where the peptides consisted of two or more highly-similar sets of peptides. In total, 20 peptides were removed from 19 clusters; 3 clusters were split into 6; and 6 clusters composed of 2 clones each were removed. There were 146 resulting clusters for mature peptides, containing 655 peptides. The propeptide clusters were treated similarly using a threshold of 30% global identity. There were 207 clustered propeptides in 42 clusters before manual edits. Four propeptides were removed from 4 clusters; 5 clusters were removed; and 3 clusters were split into 6. There were 40 resulting clusters containing 192 propeptides. As anticipated, such classification approach allowed grouping together all related peptides as in the conventional classes such as beta-defensins, cecropins, magainins etc. Peptides of a particular class such as the beta-defensins were also separated into multiple clusters, indicating sub-classes of these peptides. We did not try to reduce the number of clusters, for example, to produce a single cluster for each type of defensin. We considered that the larger number of clusters with more highly similar peptides in each is beneficial for model building as the more specific models may reflect important sequence motifs that may be lost if the clusters contain too much variation.  2.2.3  HMM modelling Once we had created the initial clusters, we created profile hidden Markov models (HMMs) for the clusters to be used to search for additional members of the AMP groups that were not present in the original AMP dataset. The HMMER software package (Eddy, 1998; http://hmmer.wustl.edu/) has been utilized to create one profile hidden Markov model for each AMP cluster. ClustalW was used to generate the multiple  45 alignments used by HMMER. The HMMER package was chosen over other tools because it is considered to be less sensitive to small misalignments in the multiple sequence alignments and to report reliable E-values (Madera and Gough, 2002).  2.2.4  Iterative enhancement of clusters To enhance our initial clusters, we identified AMP sequences from Swiss-Prot and used these to enrich the initial clusters of the AMPs by iteratively applying the corresponding HMM models to Swiss-Prot sequences. For the current work, we considered only the Swiss-Prot database as it contains confirmed and relatively well- studied peptide sequences to allow validation of the process to be done. We found that it was not possible to use a specific threshold for significance of match (such as expectation value, E-value, from BLAST or HMMER) to distinguish between hits to AMPs and non-AMPs. In an attempt to identify an E-value threshold that will distinguish significant matches from matches due to chance when searching the Swiss-Prot database, we evaluated the clustered peptides with the models specific for their cluster specifying the size of the data set as the number of peptides in Swiss-Prot. When these E-values were plotted against the length of the model it becomes clear that there is no E-value that can distinguish significant matches from random matches for short peptides (Figure 2.3). (Note that the length of the hidden Markov model is approximately the length of the peptides upon which it was trained.)  46  Figure 2.3. The relationship between E-value and model length.  The peptides in each cluster were scanned with the model corresponding to the cluster. For the shortest models (created from the shortest peptides) the E-values are greater than one.  Since E-values alone are not sufficient to identify significant matches, we decided to use additional information from the Swiss-Prot database to determine significance. For each Swiss-Prot protein, the model giving an HMM match with the lowest E-value was identified. The annotations for the Swiss-Prot protein were used to identify any protein regions overlapping with the region matched by a model. The Swiss- Prot peptide with highest mutual overlap with the region matched by the model was  47 identified. This peptide was also compared to all peptides in the model's cluster to determine its similarity to a listed AMP. To be considered a significant match, the mutual overlap between the region matched by a model and the annotated peptide was at least 90%. In addition, the blastp match between the Swiss-Prot peptide and the best matching clustered peptide was at least 50% identity over 90% of the peptide length. Those Swiss-Prot entries that produced a significant match to any of the 186 HMMs (146 for mature peptides and 40 for propeptides) were added into the existing AMP clusters. After peptides were added to a cluster, a new multiple alignment and HMM were constructed as described above. The new model, based on a larger number of sequences, was then used to scan Swiss-Prot. This was repeated until no additional peptides had a significant match: there were 5 iterations for the mature peptide models, and only one for the propeptide models. An example of changing to consensus sequence is shown in Table 2.2. The iterative scanning of the Swiss-Prot database (containing 230,133 peptides) resulted in an additional 389 mature peptides from 229 Swiss-Prot proteins being added to the AMP dataset as candidate AMPs, for a total of 1045 peptide from 970 Uniprot proteins. Sixty-one propeptides were also added for a total of 253 propeptides from 223 proteins. Peptides were considered to be properly included as AMPs where the annotations included reference to antimicrobial activity or the protein belonged to the same family as a known AMP already in the database (see Methods for details). The utility of a selection process that does not rely on the E-value can be seen in Cluster 1 (see on-line supplementary table at http://www.cnbi2.com/cgi- bin/amp.pl?peptide=1&cluster=5&type=MATURE) for the mature peptides. Starting  48 with an initial 2 AMPs, an additional 9 peptides are added to the cluster. Despite the high E-values (up to 5.9), all peptides were found to have annotations that demonstrate antimicrobial activity. The relationship between the mature peptides and propeptides from the same protein is shown in Figure 2.4 and Figure 2.5. In Figure 2.4, mature clusters are joined to propeptide clusters where the propeptides are derived from the same protein as a mature peptide in the cluster. Only the mature peptide clusters of at least ten peptides. Similarly, Figure 2.5 shows links from the largest propeptide sequences to mature peptide clusters. These figures suggest there is greater conservation of propeptide sequence, since a greater proportion of propeptide clusters have links to multiple mature clusters. A full mapping between clusters is available as supplementary Figure 2.6.  49     Figure 2.4. Relationship between mature peptides and propeptides from the same protein for largest mature peptide clusters. For mature peptide clusters of 10 or more peptides, the corresponding propeptide clusters are indicated by a line joining the clusters. The width of the line indicates the number of propeptides in that cluster that are from the same protein IDs as the mature peptides. Percentage values following the left clusters are the fraction of peptides with links to the right clusters.    50   Figure 2.5. Relationship between mature peptides and propeptides from the same protein for largest propeptide clusters. The linkage from propeptide clusters with ten or more propeptides is shown. See caption of Figure 4 for details of line width and numbers.  Of the 229 proteins added, 34  either did not have annotation for antimicrobial activity, or annotation specifically stated that they were not antimicrobial. Among these are two groups of peptides that have antimicrobial peptides in the same family: 9 Dahlein peptides are annotated as inactive (2 other Dahleins are active, DAH11_LITDA and DAH12_LITDA), and 8 Aurein peptides are annotated as inactive while 6 are active. An additional 17 peptides are peptide hormones such as cholecystokinin that do not have annotations for antimicrobial activity. However, there is considerable controversy surrounding whether certain peptides should be considered antimicrobial or not; in particular,  differing assay conditions used by different investigators lead to differing  51 results. For this reason, these peptides were left in the AMPer database and it is left to the investigators to review the relevant literature provided through links from the AMPer system. The physico-chemical properties of the mature peptides vary dramatically between clusters. As can be seen in  supplementary Table 2.3 for the largest AMP clusters (size greater than 10 peptides) the net charge depends strongly on the type of AMP. As expected, the median charges typically exceed +2 but one class is negative. Except for one cluster, the median hydrophobicity is above 40% with a maximum of 77%. There are 5 clusters of  propeptides size 10 or greater, shown in supplementary Table 2.4 . These tend to be strongly negative and much less hydrophobic than the mature clusters.  N Consensus 0 GlLDtLKnlAktagKGalqslLntaSCKLsgqC 1 GiLDtlKnlAkgvaKgvaqsLLdklsCKlskgC 2 GiLDtlKnlAkgaAKgvaqsLLdtlkCKltggC 3 GiLDtlKnlAkgaaKgaaqsLLdtlsCKlsggC 4 GiLDtlKglAknaGKGvaqsLLdtlsCKisggC 5 GiLDtlKnlAkgaAKgaAqsLLdtlsCKisggC Table 2.2. Changing consensus sequence with iteration. The consensus sequence and number of iterations for mature peptides in cluster 137 is shown. N is the iteration number with N=0 the initial data from AMSDb.   2.2.5  Accuracy of models The 186 final clusters were produced with high stringency requirements for matches to HMMs. Such stringency explains the relatively large number of identified clusters containing similar annotation: for example, there are 22 clusters of defensins which are split along the defensin subclasses (including several subclasses of alpha- and  52 beta-defensins, cryptdins and other enteric defensins). Further investigation of the effect of using lower stringency thresholds for the initial clustering and for addition of peptides to clusters might allow these clusters to be merged, and a more representative model to be produced. However, performing additional merges may also lead to incorrect merges that give less-accurate models. We consider that the presence of multiple clusters of similar peptides reflects subclasses of these peptides, and that the larger number of higher accuracy models may be beneficial for further work on mechanisms of action of AMPs that differ between subclasses. To assess the expected performance of the system to identify previously- unknown AMPs from proproteins, we performed an approximately 10-fold cross- validation on the AMP identification procedure as described in detail below. Since we were interested in the capacity of the system to identify AMPs in proproteins, we performed the testing steps of the validation on full proproteins from Swiss-Prot rather than simply the peptide comprising the clusters. The presence of another peptide from the same protein in both testing and training sets severely complicates interpretation of the results. The current pipeline is intended to identify proteins that contain additional antimicrobial peptides and will not properly handle recognition of additional peptides of the same cluster type. For this reason, only the 105 mature peptides and 29 propeptide clusters that did not contain more than one peptide from the same proprotein were considered. In addition, for creation of HMMs, at least 2 peptides are required; to select a test peptide from the set, therefore, a minimum cluster size of 3 is needed. This left a total of 81 mature and 20 prosequence clusters used for cross-validation. The results of the cross-validation show great variation in performance for  53 recognizing additional AMPs. The cross-validation sensitivity varied from 0% for one mature cluster containing 3 peptides, to 100% for 36 mature clusters. The average sensitivity of all mature clusters was 82% (the standard deviation of the cluster mean sensitivities was 23%). The specificity and accuracy were both 99.2% (SD 1.3%). For the prosequence clusters, the sensitivity also varied between 0% for two clusters of 3 peptides, and 100% for 9 clusters with average 81% (SD 30%); the average specificity for the prosequence clusters was 98.8% (SD 2.7%) and accuracy was 98.8% (SD 2.7%). The values for each cluster are available in supplementary Table 2.5 and supplementary Table 2.6. It should be noted that the specificity is conservatively based on distinguishing a class of AMPs from other possibly very similar AMPs (such as one class of defensins from several other classes of defensins). As well, the accuracy is dominated by the number of negatives, since the number of actual negatives is much larger than the number of actual positives. In scanning a large database of unrelated proteins such as Swiss-Prot, the specificity and accuracy is expected to be significantly better since the number of false positives will be much lower, as demonstrated by the low number of total positive matches found for all of Swiss-Prot. The low sensitivity of some clusters is thought to be due to the relatively large variation in sequence in these clusters, especially for clusters containing few peptides. A variety of technical reasons were found for why peptides were missed: the HMM search did not give a significant match (E-value>10), or the HMM match did not align well with the Uniprot feature list, or the BLAST match to the closest training peptide was too poor (data not shown). This suggests that a simple tweaking of system parameters will not lead to a dramatic increase in sensitivity without undesired  54 decrease in specificity; therefore, a search for better search parameters was not pursued in this study.   2.2.6  On-line tools  All materials described here have been made available on-line (http://www.cnbi2.com/cgi-bin/amp.pl). All AMP sequences and final clusters are available for download. In addition, utilities are provided on-line to scan sequence provided by the user to categorize the sequence according to these models. The HMMER HMM files used to predict and classify AMPs are available for researchers to download and use to scan sequence files using the HMMER package independently. This is a unique contribution to the community: one other site, ANTIMIC (Brahmachary et al., 2004; http://research.i2r.a-star.edu.sg/Templar/DB/ANTIMIC), provides some limited search against a few specific models but does not categorize submitted sequence, and does not provide for download of the sequences or the few HMM models it contains. Web pages are available for viewing the AMPs and corresponding properties. The initial page (http://www.cnbi2.com/cgi-bin/amp.pl) provides links to lists of the AMP clusters and the peptides themselves. In addition to properties such as peptide length, charge and hydrophobicity, the consensus sequence is given as well as links to navigate to the list of AMPs in each cluster. For each peptide, there are clickable links to the Swiss-Prot web site and to the Swiss-Prot records for the version used in this study. The iteration number ("round") is indicated for each peptide with round 0 indicating the peptide is from the original set from AMSDb database (a link is also given to AMSDb).  55 Several properties of the peptide subsequence matched by the HMM model are also given: amino acid sequence, length, charge, hydrophobicity (as hydrophobic fraction - fraction of amino acids that are hydrophobic), position of the subsequence within the main protein, as well as the E-value of the model match for this peptide. Additionally, values used for analysis are also given: the coverage of the best-matching  peptide by the region matched by the HMM  and vice versa, and the best matching (by blastp) previously clustered peptide with percent identity and alignment length.   2.3   Conclusion In summary, we utilized a set of documented AMPs to collect additional known gene-coded AMPs into a single database using a hybrid method for identifying antimicrobial peptides. We clustered the peptides and enriched the clusters with peptides from Swiss-Prot which could be matched by the trained HMM at high confidence by integrating additional information using pair-wise sequence comparison and annotations of peptide positions. The HMM models and sequence files are made available to the public from the AMPer website. We anticipate that these will be useful for discovering novel AMPs from unannotated sequence.  2.4  Methods  2.4.1  Initial peptide set The initial set of gene-coded AMP sequences was obtained from the Biochemistry Department University of Triest, Italy  56 (http://www.bbcm.units.it/~tossi/pag5.htm). These peptides were compared to the current Uniprot (Swiss-Prot and TrEMBL) databases (downloaded from http://www.pir.uniprot.org/ on August 4, 2006) to determine the current naming and annotation of the initial AMPs. Pairwise comparison was done using the blastp algorithm of the BLAST package with no filtering (parameters -F F).  We considered a match to be positive when there was at least 99% identity of amino acids over a match length of at least 99% of the length of the AMP in the initial set. For AMSDb proteins with current Uniprot IDs but where the sequence was significantly different, the current Uniprot record was used. Mature peptides and propeptides were identified for each protein using the feature annotations available from Uniprot. For proteins with multiple mature peptides,  those peptides annotated as antimicrobial were kept for analysis. Peptides were required to have definite start and end positions (records with '?' were rejected).  2.4.2  Clustering Pairwise similarity between peptides was calculated using blastp (BLAST package, Altschul et al., 1990) with filtering off (-F F) and word size of 2 (-W 2). Clusters of similar peptides were constructed based on the pairwise alignments using a percentage match defined as the number of amino acids identical between the two peptide in the most significant alignment (highest bit score) divided by the length of the shorter of the two peptides. Clusters were built by successively adding peptides to a cluster where the percentage match was greater than threshold for every peptide in the cluster. The percentage match threshold was varied between 10% and 90% for clustering  57 mature peptides. Multiple alignments were created for each cluster using ClustalW (Thompson et al., 1994). The alignments of mature peptide clusters resulting from several thresholds were examined. Low thresholds produced clusters containing similar peptides mixed with smaller peptides that were aligned at widely-spaced intervals to the longer peptides. The clusters from a 30% threshold were manually edited for both mature peptides and propeptides. Peptides were removed that aligned with a large number of widely-spaced inserts, and clusters containing two groups of highly-similar peptides were split into two clusters.   2.4.3  Iterative enhancement of clusters At the start of an iteration, multiple sequence alignments were built for each cluster using ClustalW (as  above). The HMMER software package (Eddy, 1998; http://hmmer.wustl.edu/) was used to create one hidden Markov model for each cluster from the multiple alignment, using the utility, hmmbuild. Default parameters were used except for ‘–f’ parameter, used to create local models. The Swiss-Prot database was scanned using the HMMER utility, hmmsearch, for each model file. Custom Java, Python and BASH shell code were used to execute hmmsearch and parse resulting output. Scanning of Swiss-Prot was performed for all models. For each Swiss-Prot protein matched, the information for the most significant match (lowest E-value) for any model was stored. Sequence regions matched by the HMMs were then  compared to the annotated feature regions from Swiss-Prot.  The annotated region (mature peptides or  58 propeptides) having the greatest overlap with the HMM match region were stored. As an additional check, the clustered sequences were aligned to the full Swiss-Prot proteins matched by the HMMs using blastp. The best-matching clustered peptide was determined based on highest bit score. Swiss-Prot peptides were considered positive matches and added to the clusters if the regions matched by the HMMs and the feature annotation agreed to at least 90% of the their length, and the best matching peptide from the same cluster had at least 50% identity to the Swiss-Prot protein. Positive matches were then added to the clusters for mature peptides and propeptides if they were not already present in any cluster. A new multiple alignment was then created using ClustalW, and a new model file was created using HMMER as described above. The Swiss-Prot sequences were scanned again using the new model files, and any additional matching peptides were added to the clusters. The process of scanning Swiss-Prot, adding matching peptide to clusters, and rebuilding the model files was repeated until no additional Swiss-Prot peptides were found. Consensus sequence was obtained using the utility, hmmemit, with the ‘-c’ option. Mature peptide clusters were mapped to propeptide clusters by identifying clusters containing peptides from the same Uniprot protein. Graphics were created with PyX (http://pyx.sourceforge.net/) and ImageMagick (http://www.imagemagick.org).  2.4.4  Accuracy of models An approximately 10-fold cross-validation was performed to estimate the expected performance of the models. Cross-validation was performed for each cluster independent of the others. Testing and training sets of peptides were created by randomly assigning peptides in a cluster to a number of sets of approximately equal size. Where the  59 cluster had 10 or more peptides, 10% of the peptides were assigned to each set. Where the number of peptides in a cluster was not evenly divisible by 10, additional peptides were randomly assigned to sets (allowing only one additional peptide per set) until all peptides were assigned to exactly one set. Where a cluster had less than 10 peptides, one peptide was assigned to each of N sets where N is the number of peptides in the cluster. By selecting one set in turn as the positive data for the test set and the other sets as positive data for the training sets, the sets of data were prepared to give an approximately 10-fold cross-validation for clusters having more than 10 peptides, and leave-one-out cross-validation for clusters having less than 10 peptides. In all cases, peptides from all other clusters were taken as negative test data (HMMs do not use negative training data). Since the software system was intended to identify unrecognized AMPs from proteins, the system will not attempt to recognize additional peptides from a protein already known to contain AMPs. Therefore, performing a cross-validation was done using only clusters where each peptide was derived from unique proteins. This avoids the situation where a test peptide is automatically considered a positive match since it belongs to the same protein as a training peptide. In addition, for HMMs to be created, at least two peptides are required; therefore, only clusters of size three or greater were evaluated (so that one peptide would be available for the test set). The same procedure was used during validation as was used in identifying additional AMPs from Swiss-Prot. For each cluster, the training peptides were used to create an HMM. Since the purpose of the method is to identify AMPs from within full proteins, the HMM was used to scan the full Swiss-Prot protein corresponding to the test peptides. A BLAST search was performed between the training peptides and the  60 corresponding Swiss-Prot proteins. As before, positives were defined when proteins passed the conditions that the region matched by the HMMs and the feature annotation agreed to at least 90% of the their length, and the best matching peptide had at least 50% identity to the Swiss-Prot protein.  2.4.5  On-line tools The web site uses a Perl CGI script running on an Apache Linux server with a MySQL RDBMS. On-line sequence analysis uses utilities from the HMMER package.   2.5  Web resources Biochemistry Department University of Triest, Italy: http://www.bbcm.units.it/~tossi/pag5.htm HMMER: http://hmmer.wustl.edu/ Uniprot database: http://www.pir.uniprot.org/   61  2.6  Supplementary material Cluster Peptide Families Number of Peptides Median Peptide Length Median Peptide Charge Peptide Hydro- phobicity Consensus Sequence 146 Alpha defensin (Neutrophil defensin) 45 33 7 0.46 CyCRrgrClsrErlsGtCringriyrLCCR 134 Apidaecin 10 18 4 0.56 nnRPvYipqPRPPHPRl 128 Aurein, Caeridin, Caerin, Citropin, Dahlein 12 25 2 0.54 GLlgsIGkaLGgLladvlKpKlqaa 84 Aurein, Caerin, Dahlein 11 25 2 0.52 GLlsSiGKaLGGlLadvlKpKtqaa 131 Aurein, Citropin, Dahlein, Dermaseptin, Maculatin 22 23 1 0.43 GLwqkIKeklkelAsGaivegvqs 129 Aurein, Citropin, Uperin 12 14 1 0.57 DivKkVvsavggL 139 Bactericidin, Cecropin, Hyphancin 31 39 6 0.45 WlkkifKkiErvGqnvRDaiikagpavqvvaqaa alar 145 Beta-defensin 26 36 6 0.47 dPvtClrnGGiClysrCpgrtrqiGtCGhPkvKC CK 144 Beta-defensin , Spheniscin, LAP, TAP 21 40 9 0.50 slsCrrnkGvCvpirCpgkmrQIGtCfgppVKCC Rrk 135 Bombinin-like, Maximin 11 27 3 0.48 GIGakILsgvKtaLKGaakeLAstyln 141 Brevinin, Gaegurin, Ranatuerin 18 24 4 0.75 FLPllaglAAkvlpkiiCsItkKC 137 Brevinin, Ocellatin,  Palustrin, Ranatuerin 29 31 3 0.46 GiLDtlKnlAkgaAKgaAqsLLdtlsCKisggC 41 Brevinin, Ranalexin 12 24 3.5 0.71 FLpilaslaakvlpkiiCavtkKC 140 Caerin, Maculatin 16 24 3 0.64 svLgsvakhvlpHvvPviAEkl 118 Caerulein, Cholecystokinins, Cionin 30 8 -2 0.63 dYtGwmDF 142 Cecropin, Sarcotoxin 22 35 7 0.44 GrlKKlGKKiEgvGkrvfdAaekaLpvaagvkal a 60 Circulin, Cyclopsychotride, Cyclotide, Cycloviolacin, Kalata 48 30 1 0.57 CGESCvvipCyttsvlGCsCxnkVCyrN 133 Cysteine-rich antifungal protein 15 51 5 0.49 QKlcerpSGTwsGVCGNnNACkNQCInLEgArHG SCNYvFPaHkCiCYfPC 132 Defensin-like protein 11 47 5 0.41 ktCenesdTfkGvCitkapCdkhCrnkEkftdGr CskiLrRClCTknC 143 Defensin, Holotricin, Phormicin, Royalisin, Sapecin, Tenecin 22 43 5 0.50 aTCDlLSfegkgvkvnhsaCAahClarGrkGGyC nkkavCvCRn 138 Fabatin, Gamma-thionin, Gamma hordothionin 21 47 7 0.40 rtCesqShrFKGpClsdsNCasVCrnEGFsGGnC rGfrRRCfCtrqC 1 Hemolytic protein, Ranatuerin, Temporin 11 13 1 0.77 FlpaiAsLLgkll 136 Histatin 13 13 7 0.16 HEKHHsHRGYr 95 Liver-expressed AMP, Penaeidin 23 55 10 0.51 kgpYtRpvsrPpfvRPigasPigPYngCdvSCRg isesqARlCckRlGrCChlskgys 112 Mastoparan 12 14 3 0.64 inlKalldlaKkvL 94 Maximin 44 20 2 0.60 ilGPvlglvgnalggllkkl 83 Melittin 10 26 5 0.50 gIGaiLKVLatgLPaliSWiKrKRqq 107 Osmotin, Thaumatin, Zeamatin 30 217.5 0.5 0.46 AtftitNncpytVwAaalpgdgkpqLxgGGreLd sgqSwsldvpaGTwsaRfWgRTgCnfDaSGrGsC qTGDCGGqLsCnGaGapPPaTLAEytLaqfgglD FyDvSLVDGFNlPmsfaPtgGsGdCkaisCaAdi NavCPaeLkvkgsgGsVvACnsACtvFntpqYCC tggndtpetCpPTdYSriFKqqCPdAYSYayDDp tSTFTCsggtnYrvtFCP  62 59 Temporin, Vespid chemotactic peptide 16 13 1 0.69 flPiigkllsglL Table 2.3. Properties of largest mature peptide clusters  Cluster Peptide Families Number of Peptides Median Peptide Length Median Peptide Charge Peptide Hydro- phobicity Consensus Sequence 35 Maximins 45 25 -3 0.24 rseendvqsLsqRdvLeEEsLREiR 37 Dermaseptin, Dermatoxin 20 20.5 -11 0.00 eEKrEnEnEeeqEddeqSEe 38 Beta-defensin 1 17 11 1 0.27 dnFLtGLGHRs 39 Cryptdin 18 39 -12 0.23 DpiqntDEEtKtEEqpgEedqAvsvsFGdpeGsaL qeea 40 Cathelicidin, Myeloid antibacterial peptide, Prophenin- 2, Protegrin 25 101 -4 0.44 qalsYreAvLRAvdqlnersseanlYRLLeLDppP kddedpdtpKpvsFrvKEtvCprttqqppEqCdFK enGlvKqCvGtvtldqvkdsfditCnelqsv Table 2.4. Properties of largest propeptide clusters The number of peptides, peptide properties (median length, charge and hydrophobicity) and consensus sequence are shown.    Cluster Peptide Families Number of Peptides Sensitivity [%, mean (SD)] Specificity [%, mean (SD)] Accuracy [%, mean (SD)] 1 Temporin, Ranatuerin, Hemolytic protein 11 100.0 (0.0) 98.7 (0.1) 98.7 (0.1) 3 Chrysophsin, Dicentracin, Moronecidin 5 40.0 (54.8) 100.0 (0.0) 99.9 (0.1) 8 Histone H1, Uperin 3 66.7 (57.7) 99.8 (0.3) 99.8 (0.4) 9 Penaeidin, Corticostatin-related 4 75.0 (50.0) 99.4 (0.3) 99.4 (0.3) 11 Cicerin, Gymnin 3 66.7 (57.7) 99.5 (0.3) 99.4 (0.2) 16 Pardaxin 6 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 18 Cathelin-related 3 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 20 Clavanin 5 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 22 Moricin, Virescein. 3 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 35 Beta-defensin, Circulin-B 4 75.0 (50.0) 97.3 (1.7) 97.2 (1.7) 39 Sperm-associated antigen 11 3 66.7 (57.7) 100.0 (0.1) 99.9 (0.1) 41 Brevinin, Ranalexin 12 100.0 (0.0) 96.5 (0.3) 96.5 (0.3) 42 Ceratotoxin, Dermadistinctin, Dermaseptin 4 50.0 (57.7) 98.6 (0.4) 98.5 (0.5) 50 Mast cell degranulating peptide 3 66.7 (57.7) 100.0 (0.0) 100.0 (0.1) 51 Cecropin 8 100.0 (0.0) 97.7 (0.1) 97.7 (0.1) 52 Defensin heliomicin, ARD1, Mytilin 3 66.7 (57.7) 100.0 (0.0) 100.0 (0.1) 53 Beta-defensin, heterophil peptide 4 75.0 (50.0) 99.7 (0.0) 99.7 (0.1) 55 Uperin, Maculatin 5 80.0 (44.7) 99.6 (0.1) 99.6 (0.1) 56 Dermaseptin, Dermadistinctin 6 100.0 (0.0) 98.4 (0.1) 98.4 (0.1)  63 Cluster Peptide Families Number of Peptides Sensitivity [%, mean (SD)] Specificity [%, mean (SD)] Accuracy [%, mean (SD)] 58 Alpha-defensin 6, Corticostatin 3 66.7 (57.7) 100.0 (0.0) 100.0 (0.1) 59 Temporin, Vespid chemotactic peptide 16 100.0 (0.0) 99.1 (0.1) 99.1 (0.1) 60 Circulin, Cyclopsychotride, Cyclotide, Cycloviolacin, Kalata 48 67.0 (35.9) 99.8 (0.1) 99.6 (0.2) 62 Beta-defensin, Corticostatin 3 66.7 (57.7) 100.0 (0.0) 100.0 (0.1) 70 Thaumatin-like 8 50.0 (53.5) 100.0 (0.0) 99.9 (0.1) 71 Ceratotoxin, Pleurocidin 5 100.0 (0.0) 99.7 (0.0) 99.7 (0.0) 72 Lebocin 3 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 75 Cryptdin, Circulin 4 75.0 (50.0) 100.0 (0.0) 100.0 (0.1) 76 Styelin 3 66.7 (57.7) 99.9 (0.1) 99.9 (0.0) 77 Styelin, Phylloxin 3 66.7 (57.7) 99.8 (0.1) 99.7 (0.0) 78 Defensin-like peptide 3 66.7 (57.7) 100.0 (0.0) 100.0 (0.1) 79 Holotricin-3, Tenecin-3 3 0.0 (0.0) 100.0 (0.0) 99.9 (0.0) 81 Hadrurin, Opistoporin, Pandinin 4 75.0 (50.0) 100.0 (0.0) 100.0 (0.1) 83 Melittin 10 100.0 (0.0) 99.8 (0.1) 99.8 (0.1) 86 Esculentin, Ranatuerin 6 100.0 (0.0) 99.8 (0.1) 99.8 (0.1) 87 Cathelin 4 50.0 (57.7) 99.3 (0.8) 99.2 (0.9) 88 Dermaseptin 5 60.0 (54.8) 99.3 (0.4) 99.3 (0.4) 89 Uperin 3 100.0 (0.0) 99.4 (0.0) 99.4 (0.0) 90 Beta-defensin 3 100.0 (0.0) 97.8 (0.0) 97.8 (0.0) 91 Defensin, Plectasin 4 100.0 (0.0) 99.4 (0.0) 99.4 (0.0) 92 Eosinophil granule major basic protein 6 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 94 Maximin 44 100.0 (0.0) 98.8 (0.0) 98.8 (0.0) 95 Penaeidin, Liver-expressed AMP 23 100.0 (0.0) 99.9 (0.0) 99.9 (0.0) 96 Ponericin, Pandinin, Gaegurin 6 83.3 (40.8) 97.1 (1.2) 97.0 (1.3) 98 Defensin 4 100.0 (0.0) 99.6 (0.0) 99.6 (0.0) 99 Tachyplesin, Rhesus theta defensin 5 80.0 (44.7) 99.8 (0.1) 99.8 (0.2) 100 Hepcidin 4 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 101 AFP2B, Defensin J1-1, Drosomycin 4 50.0 (57.7) 97.5 (0.5) 97.4 (0.4) 102 Protegrin 4 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 103 Tigerinin 4 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 104 Cecropin-A, Hyphancin, Moricin 4 75.0 (50.0) 97.1 (0.0) 97.1 (0.1) 105 Defensin 4 75.0 (50.0) 100.0 (0.0) 100.0 (0.1) 106 Bactenecin 4 75.0 (50.0) 100.0 (0.0) 100.0 (0.1) 107 Thaumatin, Osmotin, Zeamatin 30 86.7 (17.2) 98.9 (0.1) 98.8 (0.1) 108 Tachyplesin, Polyphemusin, Hepcidin 4 75.0 (50.0) 99.4 (0.0) 99.4 (0.1) 109 Basal layer antifungal peptide 4 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 110 Pseudin 4 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 111 Maximins, Ponericin 4 50.0 (57.7) 93.3 (0.1) 93.2 (0.2) 112 Mastoparan 12 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 113 Amoebapore (ameobapore) 5 80.0 (44.7) 100.0 (0.0) 100.0 (0.1) 114 Andropin 6 50.0 (54.8) 100.0 (0.0) 99.9 (0.1)  64 Cluster Peptide Families Number of Peptides Sensitivity [%, mean (SD)] Specificity [%, mean (SD)] Accuracy [%, mean (SD)] 115 Bombolitin 5 80.0 (44.7) 99.9 (0.1) 99.9 (0.1) 117 BPI, LBP, (Lipopolysaccharide-binding protein, Bactericidal permeability-increasing protein ) 8 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 118 Caerulein, Cholecystokinins, Cionin 30 56.7 (38.7) 100.0 (0.0) 99.8 (0.2) 119 Ponericin 5 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 120 Gallinacin 5 40.0 (54.8) 99.9 (0.0) 99.8 (0.1) 121 Uperin 5 80.0 (44.7) 99.1 (0.2) 99.1 (0.1) 123 Metalnikowin, Pyrrhocoricin. 5 60.0 (54.8) 100.0 (0.0) 99.9 (0.1) 124 Antimicrobial peptide 6 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 125 Esculentin, Gaegurin, Rugosin 7 100.0 (0.0) 99.1 (0.1) 99.1 (0.1) 126 Dermaseptin 6 100.0 (0.0) 99.0 (0.2) 99.1 (0.2) 127 Bombinin, Maximin 6 100.0 (0.0) 93.7 (0.0) 93.8 (0.0) 131 Aurein, Citropin, Dahlein, Dermaseptin, Maculatin 22 100.0 (0.0) 99.1 (0.2) 99.1 (0.2) 132 Defensin-like protein 11 80.0 (42.2) 98.0 (0.5) 98.0 (0.5) 133 Cysteine-rich antifungal protein 15 100.0 (0.0) 99.7 (0.0) 99.7 (0.0) 137 Brevinin-, Ranatuerin, Palustrin, Ocellatin 29 100.0 (0.0) 99.2 (0.3) 99.2 (0.3) 138 Fabatin, Gamma-thionin, Gamma hordothionin 21 90.0 (21.1) 99.7 (0.0) 99.7 (0.1) 139 Bactericidin, Cecropin, Hyphancin 31 100.0 (0.0) 98.3 (0.0) 98.3 (0.0) 141 Brevinin, Gaegurin, Ranatuerin 18 95.0 (15.8) 96.9 (0.2) 96.9 (0.2) 143 Defensin, Holotricin, Sapecin, Tenecin, Royalisin, Phormicin 22 90.0 (21.1) 100.0 (0.0) 100.0 (0.1) 144 Beta-defensin , Spheniscin, LAP, TAP 21 60.0 (39.4) 99.2 (0.4) 99.1 (0.3) 145 Beta-defensin 26 70.0 (23.3) 98.4 (0.4) 98.3 (0.4) Table 2.5. Performance of AMP identification method determined by cross-validation for mature peptide clusters. An approximately 10-fold cross-validation was performed for each cluster where the number of peptides was greater than 10. For clusters with less than 10 peptides, a leave-one-out cross-validation was performed. Clusters were included in this analysis where the number of peptides was 3 or greater and where no two peptides were derived from the same Swiss-Prot protein (see main text). The performance measures of sensitivity, specificity and accuracy are reported as the mean and standard deviations of the values calculated for the cluster during the cross-validation.   Cluster Peptide Families Number of Peptides Sensitivity [%, mean (SD)] Specificity [%, mean (SD)] Accuracy [%, mean (SD)] 2 Hepcidin 4 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 4 Hepcidin 3 66.7 (57.7) 100.0 (0.0) 99.8 (0.4) 15 Rhesus theta defensin 3 100.0 (0.0) 99.8 (0.4) 99.8 (0.4) 17 Phormicin, Phormicin, Sapecin 3 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 18 Pleurocidin 4 75.0 (50.0) 100.0 (0.0) 99.8 (0.3) 19 Floral defensin-like protein 3 0.0 (0.0) 100.0 (0.0) 99.3 (0.0)  65 Cluster Peptide Families Number of Peptides Sensitivity [%, mean (SD)] Specificity [%, mean (SD)] Accuracy [%, mean (SD)] 20 Lebocin 3 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 21 Styelin 3 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 22  Defensin 3 66.7 (57.7) 100.0 (0.0) 99.8 (0.4) 23 Cryptdin 3 0.0 (0.0) 94.9 (5.8) 94.3 (5.7) 28 Neutrophil defensin 6 66.7 (51.6) 97.1 (1.4) 96.9 (1.2) 29 Ranalexin, Esculentin-1B, Gaegurin-5, Ranalexin, Temporin-G 6 83.3 (40.8) 98.8 (0.6) 98.7 (0.3) 30 Brevinin, Gaegurin 4 100.0 (0.0) 98.8 (0.3) 98.8 (0.3) 31 Corticostatin, Neutrophil antibiotic peptide 5 80.0 (44.7) 100.0 (0.0) 99.9 (0.3) 32 Neutrophil defensin, Defensin 5 9 100.0 (0.0) 88.7 (1.6) 88.8 (1.6) 33 Maximins 6 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 37 Dermaseptin, Dermatoxin 20 95.0 (15.8) 99.9 (0.2) 99.8 (0.3) 38 Beta-defensin 1 17 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 39 Cryptdin 18 95.0 (15.8) 97.8 (0.2) 97.7 (0.4) 40 Cathelicidin, Myeloid antibacterial peptide, Prophenin-2, Protegrin 25 88.3 (24.9) 100.0 (0.0) 99.8 (0.5) Table 2.6. Performance of AMP identification method determined by cross-validation for propeptide clusters. An approximately 10-fold cross-validation was performed for each cluster where the number of peptides was greater than 10. For clusters with less than 10 peptides, a leave-one-out cross-validation was performed. Clusters were included in this analysis where the number of peptides was 3 or greater and where no two peptides were derived from the same Swiss-Prot protein (see main text). The performance measures of sensitivity, specificity and accuracy are reported as the mean and standard deviations of the values calculated for the cluster during the cross-validation.   66  Figure 2.6. Relationship between mature peptides and propeptides from the same protein clusters of all sizes. The corresponding propeptide clusters are indicated by a line joining the mature clusters. The width of the line indicates the number of propeptides in that cluster that are from the same protein IDs as the mature peptides.    67  2.7  References Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215: 403-410. Brahmachary, M., Krishnan, S. P. T., Koh, J. L. Y., Khan, A. M., Seah, S. H., Tan T. W., Brusic, V., Bajic, V. B. (2004) ANTIMIC: a database of antimicrobial sequences. Nucl. Acids Res. 32: 90001, 1-589 Bowdish, D.M., Davidson, D.J., Hancock, R.E.W. (2005) A Re-evaluation of the Role of Host Defence Peptides in Mammalian Immunity. Curr. Protein. Pept. Sci., 6:35- 51. Brogden, K.A. (2005) Antimicrobial peptides: pore formers or metabolic inhibitors in bacteria?  Nat. Rev. Microbiol., 3: 238-50. Chapple, D.S., Hussain, R., Joannou, C.L., Hancock,  R.E.W., Odell, E., Evans, R.W., Siligardi,  G. (2004) Structure and Association of Human Lactoferrin Peptides with Escherichia coli Lipopolysaccharide. Antimicrob. Agents Chemother., 48: 2190-2198 Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.J. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univeristy press, Cambridge, UK Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14: 755-763. Finlay, B.B., Hancock, R.E.W. (2004) Can innate immunity be enhance to treat microbial infections? Nature Reviews Microbiology, 2: 497-504. Hamilton-Miller, J.M.T. (2004) Antibiotic resistance from two perspectives: man and microbe. International Journal of Antimicrobial Agents, 23: 209-212. Hancock, R.E.W. (2001) Cationic peptides: effectors in innate immunity and novel antimicrobials. The Lancet Infectious Diseases, 1: 156-164. Hancock, R.E.W., Rozek, A. (2002) Role of membranes in the activities of antimicrobial cationic peptides. FEMS Microbiology Letters, 206: 143-149 Hwang, P.M., Vogel, H.J. (1998) Structure-function relationships of antimicrobial peptides. Biochem. Cell Biol., 76:235-46. Jack, R.W., Tagg, J.R., Ray, B. (1995) Bacteriocins of gram-positive bacteria. Microbiol Rev., 59: 171-200. Koczulla, A.R., Bals, R. (2003) Antimicrobial Peptides: Current Status and Therapeutic Potential. Drugs, 63: 389-407. Levy, S.B., Marshall, B. (2004) Antibacterial resistance worldwide: causes, challenges and responses. Nature Medicine, 10: S122 - S129. Madera, M., Gough, J. (2002) A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res., 30: 4321-4328.  68 Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T., Chothia, C. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284: 1201-1210. Patrzykat, A., Friedrich, C.L., Zhang, L., Mendoza, V., Hancock, R.E.W. (2002) Sublethal Concentrations of Pleurocidin-Derived Antimicrobial Peptides Inhibit Macromolecular Synthesis in Escherichia coli. Antimicrob. Agents Chemother., 46:605-614. Powers, J.P.S., Hancock, R.E.W. (2003). The relationship between peptide structure and antibacterial activity. Peptides, 24: 1681-1691 Schutte, B.C., Mitros, J.P., Bartlett, J.A., Walters, J.D., Jia, H.P., Welsh, M.J., Casavant, T.L., McCray, P.B. (2002)  Discovery of five conserved beta -defensin gene clusters using a computational search strategy. Proc. Natl. Acad. Sci. USA, 99: 2129-2133. Scheetz, T., Bartlett, J.A., Walters, J.D., Schutte, B.C., Casavant, T.L.,  McCray, P.B. (2002) Genomics-based approaches to gene discovery in innate immunity. Immunol Rev., 190:137-145 Sima, P., Trebichavsky, I., Sigler, K. (2003) Mammalian antibiotic peptides. Folia Microbiol., 48: 123-137. Sima, P., Trebichavsky, I., Sigler, K. (2003) Non-mammalian vertebrate antibiotic peptides. Folia Microbiol., 48: 709-724. Sing, T., Sander, O., Beerenwinkel, N., Lengauer, T. (2005) ROCR: visualizing classifier performance in R. Bioinformatics, 21: 3940-3941 Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994)  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22:4673-4680. Yeaman, M.R., Yount, N.Y. (2003) Mechanisms of Antimicrobial Peptide Action and Resistance. Pharmacol Rev., 55: 27-55. Yount, N.Y., Yeaman, M.R. (2004) .Multidimensional signatures in antimicrobial peptides. Proc. Natl. Acad. Sci. USA, 101: 7363-7368   69 Chapter 3: Identification of novel host defense peptides and the absence of alpha-defensins in the bovine genome                  A version of this chapter has been published as: Fjell CD, Jenssen H, Fries P, Aich P, Griebel P, Hilpert K, Hancock RE, Cherkasov A. (2008) Identification of novel host defense peptides and the absence of alpha-defensins in the bovine genome. Proteins. 73:420-30.  70  3.1  Introduction Host defense peptides (known also as antimicrobial peptides, AMPs) are natural peptides produced as part of the innate immune system of a broad range of organisms including mammalians, insects, amphibians, plants and amoeboid protozoa among others (Simmaco et al. 1998; Khush et al. 2001; Sima et al. 2003; Sima et al. 2003). As the problem of antibiotic resistance to conventional therapeutics by pathogenic microorganisms increases, AMPs have drawn significant scientific attention as a novel class of prospective anti-infective therapeutics (Hancock and Lehrer 1998; Hancock and Chapple 1999; Hancock 2003; Marshall and Arenas 2003). They offer several advantages including fast target killing, broad range of activity, low toxicity and minimal development of resistance in target organisms (Hancock and Lehrer 1998; Hancock and Chapple 1999; Hancock 2003; Marshall and Arenas 2003; Sima et al. 2003; Sima et al. 2003). Their mechanisms of killing are diverse and include membrane disruption (Bechinger 1997; Bechinger 1999; Blondelle et al. 1999; Epand and Vogel 1999; Shai 1999) as well as metabolic inhibition of intracellular targets (Brogden 2005). In addition to direct killing, certain host defense peptides play important roles in modulation of the innate immune response both in up-regulation to enhance killing of pathogens, as well as down-regulation to reduce detrimental conditions such as sepsis (Mookherjee and Hancock 2007). The relative importance of direct killing by AMPs versus immunomodulation is also unclear (Bowdish et al. 2005). Limited numbers of novel antimicrobial peptides have been identified previously with the help of computational approaches (Scheetz et al. 2002; Schutte et al. 2002; Patil A 2004; Patil AA 2005; Looft C et al. 2006; Belov K et al. 2007; Lynn DJ and DG 2007).  71 The majority of these studies (Scheetz et al. 2002; Schutte et al. 2002; Patil A 2004; Patil AA 2005; Lynn DJ and DG 2007) searched specifically for the presence of a particular class of antimicrobial peptide, the defensins, which belong to three sub-families: the alpha-, beta- and theta-defensins. The commonly used techniques to identify novel peptides using sequence analysis are: comparing examples of a class of peptides to a novel sequence in a pairwise fashion (for example using a BLAST analysis (Altschul et al. 1990)) or using a set of similar peptides to construct a profile of the class of peptides and then deriving a statistical model of the class for searching novel sequence (for example using profile hidden Markov models (Durbin et al. 1998)). Profile hidden Markov models (HMMs) have been used extensively for large-scale analysis of protein sequences (Durbin et al. 1998; Sonnhammer et al. 1998) and we have previously developed the AMPer resource (Fjell et al. 2007) (http://www.cnbi2.com/cgi-bin/amp.pl) that includes HMMs to describe and predict AMPs based on peptide sequence. AMPer includes all AMPs that were available in the Uniprot database and separately describes mature peptides and propeptides have been determined based on Uniprot annotations. These have been grouped into sets of related peptides, with each set used to produce one hidden Markov model specific to that subclass of AMP. AMPer includes 1045 mature peptides (with 146 corresponding models) and 253 propeptide sequences (with 40 corresponding models) derived from 970 Uniprot proteins. Models from AMPer provide the means to perform high-throughput analysis to discover novel AMPs that are related to peptides that are currently known. This serves to identify additional peptides that may have antimicrobial activity and may suggest the absence of a class of peptide in an organism. As an example, we consider the alpha-  72 defensins: there are currently no recognized alpha-defensins in the bovine genome. Phylogenetic analysis of defensins has suggested that all defensins in the mammalian lineage have been derived from a single ancestral beta-defensin and that alpha-defensins arose from beta-defensins by a process of gene duplication followed by diversification in response to the pathogens encountered in the particular ecological niche of the organism (Patil A 2004; Xiao et al. 2004; Patil AA 2005).  Alpha-defensins were recently believed to be restricted to the primate and glires (rodents and lagomorphs) lineage (Patil A 2004; Xiao et al. 2004; Patil AA 2005); however, more recent analysis of defensins from a broader range of mammals has identified alpha-defensins in opossum (Belov K et al. 2007), elephant and hedgehog tenrec (Lynn DJ and DG 2007),  and horse (Looft C et al. 2006). In the current work, we used hidden Markov models from the AMPer resource to identify AMPs from bovine. For this work, we considered nucleic acid sequence from the draft genome sequence and expressed sequence tags (ESTs, single-pass sequences of cDNA created from mRNA). Our aim was to discover previously uncharacterized gene- coded bovine AMPs of all classes as well as to test the hypothesis that the bovine genome lacks alpha-defensins.  73   3.2  Results and discussion  3.2.1  Identification of host defense peptides We used the AMPer models of mature peptides to identify known and potentially novel antimicrobial sequences of bovine using expressed sequence tags (from NCBI dbEST resource, http://www.ncbi.nlm.nih.gov/dbEST/, (Boguski, Lowe et al. 1993)) and genomic sequence (from the Baylor College of Medicine Human Genome Sequencing Center, http://www.hgsc.bcm.tmc.edu/projects/bovine). These were translated into all six reading frames and scanned with each of the 146 AMP models. Results are presented here using the current dbEST resource containing 1,433,737 bovine ESTs (downloaded Aug 25, 2007). The models of mature peptides produced 5,628 matches with an E-value <10, consisting of 4,591 unique ESTs. Of these, 2,228 had an E- value<1 and cover at least 25% of the length of the model. We identified unique sequences by clustering the matched protein using an all- vs-all comparison: each matched protein was compared to every other matched protein with blastp (Altschul et al. 1990). Where predicted peptides were at least 90% identical, we conservatively considered these to be the same antimicrobial peptide (at the risk of grouping together closely related peptides that are in fact distinct). By repeating this pairwise comparison, a total of 278 potential peptides were identified.  From these 278 peptides, we selected those that were matched at high statistical significance (an HMM E- value <= 1e-5), resulting in the 124 potential peptides shown in Table 3.1. We mapped the 34 known bovine AMPs using the full protein sequence (Table  74 3.2) to all predicted protein sequences from the ESTs using pairwise comparison (the blastp algorithm (Altschul et al. 1990)) to identify the most likely ESTs corresponding to the bovine AMP. We similarly mapped the 34 known bovine AMPs to those predicted protein sequences identified by AMPer as containing an AMP (Table 3.3). Since we expect these sequences to differ only due to artifacts such as sequencing errors, we called a match significant where there was at least 95% sequence identity between the known bovine AMP and the other sequence, and where the length of the matching region between two sequences (reported by blastp) was within 95% of the shorter sequence (this was meant to allow for the untranslated regions of the mRNA). A total of 27 known bovine AMPs had significant matches to ESTs. Since some AMPs are subsequences of other AMPs and ESTs may also be significantly shorter than the cDNA from which they are sequenced, it is difficult to determine uniquely which bovine AMPs were identified where multiple known bovine AMP sequences mapped to the same EST sequence. For four bovine AMPs the best matching EST was not unique (one other known bovine AMP also matched that EST most significantly of all ESTs). These are indicated by a '(2)' on four entries in Table 3.3. Similarly, a total of 27 are also found to have significant matches to AMPs predicted by AMPer, though the list of known AMPs with no clearly matching predicted AMP are slightly different than those known AMPs with no clearly matching ESTs (24 had good matches to both ESTs and AMPer predictions).  75  AMPer Model Peptide Families Number of AMPs 17 15 kDa protein  1 66 Apolipoprotein A-II  12 106 Bactenecin  1 90 Beta-defensin  4 145 Beta-defensin  7 144 Beta-defensin, Spheniscin, LAP, TAP  6 35 Beta-defensin, Circulin-B  1 117 BPI, LBP, (bactericidal permeability- increasing protein, lipopolysaccharide- binding protein) 29 116 Cathelicidin  1 87 Cathelin  9 18 Cathelin-related  1 133 Cysteine-rich antifungal protein  1 92 Eosinophil granule major basic protein  7 13 Granulysin, NK-lysin  5 27 Hemolin  2 64 Hepcidin  1 8 Histone H1, Uperin  5 12 Histone H2A  19 38 Histone H2A  8 24 Myeloid antibacterial peptide  2 95 Penaeidin, Liver-expressed AMP  1 39 Sperm-associated antigen 11  1 Table 3.1. Numbers of predicted antimicrobial peptides An E-value threshold of 1e-5 was used to determine significance of an HMM match.   Uniprot ID Description AMPer model APOA2_BOVIN Apolipoprotein A-II. 66 BCTN1_BOVIN Bactenecin-1. 19 BCTN5_BOVIN Bactenecin-5. 106 BCTN7_BOVIN Bactenecin-7. 106 BD01_BOVIN Beta-defensin 1. 144 BD02_BOVIN Beta-defensin 2. 144 BD03_BOVIN Beta-defensin 3. 144 BD04_BOVIN Beta-defensin 4. 144 BD05_BOVIN Beta-defensin 5. 144 BD06_BOVIN Beta-defensin 6. 144 BD07_BOVIN Beta-defensin 7. 90 BD08_BOVIN Beta-defensin 8. 90 BD09_BOVIN Beta-defensin 9. 90  76 Uniprot ID Description AMPer model BD10_BOVIN Beta-defensin 10. 145 BD11_BOVIN Beta-defensin 11. 144 BD12_BOVIN Beta-defensin 12. 144 BD13_BOVIN Beta-defensin 13. 144 BDC7_BOVIN Beta-defensin C7. 144 BMA27_BOVIN Antibacterial peptide BMAP-27. 24 BMA28_BOVIN Antibacterial peptide BMAP-28. 18 BMA34_BOVIN Antibacterial peptide BMAP-34. 116 BPI_BOVIN Bactericidal permeability-increasing protein. 117 CALT_BOVIN Caltrin - CAS2_BOVIN Casocidin-1 (now CASA2_BOVIN) - CCKN_BOVIN Cholecystokinin 118 CMGA_BOVIN Chromogranin-A - EAP_BOVIN Enteric beta-defensin. 144 INDC_BOVIN Indolicidin - LAP_BOVIN Lingual antimicrobial peptide. 144 LBP_BOVIN Lipopolysaccharide-binding protein. 117 LEAP2_BOVIN Liver-expressed antimicrobial peptide 2. 95 PENK_BOVIN Synenkephalin, Met-enkephalin. 5 SCG1_BOVIN Secretogranin-1, Secretolytin, GAWK, BAM-1745. 133 TAP_BOVIN Tracheal antimicrobial peptide. 144 Table 3.2. Known bovine antimicrobial peptides Where a peptide does not below to an AMPer cluster, a "-" is given.  7 7     Matched ESTs by BLAST AMPer Predicted AMPs Known Bovine AMP Matched EST* % Identity Match Length (coverage %) Blast E- value Matched Predicted AMP* % Identity Match Length (coverage %) Blast E- value APOA2_BOVIN gi|75805025|gb|DT855734.1 100 100 (100.0) 7.20E-050 DBEST_AMP_1858 100 77 (100.0 %) 6.70E-041 BCTN1_BOVIN gi|119554907|gb|EH155902.1 100 155 (100.0) 1.10E-086 DBEST_AMP_255 100 101 (100.0 %) 8.30E-058 BCTN5_BOVIN gi|154772689|gb|EV792452.1 100 176 (100.0) 6.00E-102 DBEST_AMP_249 100 101 (100.0 %) 1.50E-057 BCTN7_BOVIN gi|119563722|gb|EH164717.1 100 190 (100.0) 8.00E-111 DBEST_AMP_304 100 102 (100.0 %) 1.10E-057 BD01_BOVIN gi|17892782|gb|BM257183.1 100 38 (100.0) 3.50E-019 DBEST_AMP_1047 100 36 (100.0 %) 7.70E-021 BD02_BOVIN gi|7049236|gb|AW479130.1 100 40 (100.0) 5.30E-020 DBEST_AMP_1428 100 38 (100.0 %) 1.10E-021 BD03_BOVIN gi|7049236|gb|AW479130.1 (2) 100 57 (100.0) 2.50E-029 DBEST_AMP_1428 (2) 100 38 (100.0 %) 7.40E-022 BD04_BOVIN gi|154397167|gb|EV640446.1 100 63 (100.0) 6.20E-033 DBEST_AMP_901 100 36 (100.0 %) 3.00E-021 BD05_BOVIN gi|17037442|gb|BM106372.1 100 64 (100.0) 2.50E-034 DBEST_AMP_860 100 37 (100.0 %) 8.90E-023 BD06_BOVIN gi|119558511|gb|EH159506.1 100 40 (95.2) 9.00E-020 DBEST_AMP_2132 100 38 (100.0 %) 6.40E-022 BD07_BOVIN gi|119561789|gb|EH162784.1 100 40 (100.0) 2.00E-019 DBEST_AMP_1576 100 38 (100.0 %) 1.50E-021 BD08_BOVIN gi|119564671|gb|EH165666.1 100 38 (100.0) 3.90E-018 DBEST_AMP_308 100 38 (100.0 %) 1.50E-021 BD09_BOVIN gi|119564671|gb|EH165666.1 (2) 98.18 55 (100.0) 9.20E-027 DBEST_AMP_308 (2) 97.37 38 (100.0 %) 5.10E-021 BD10_BOVIN gi|42731051|gb|CK778738.1 98.39 62 (100.0) 1.00E-030 DBEST_AMP_139 100 36 (100.0 %) 1.50E-020 BD11_BOVIN gi|119531287|gb|EH137278.1 100 60 (100.0) 4.60E-031 DBEST_AMP_209 100 37 (100.0 %) 6.70E-022 BD12_BOVIN gi|74502222|gb|DT722637.1 97.37 38 (100.0) 6.60E-018 DBEST_AMP_1461 97.3 37 (100.0 %) 1.30E-020 BD13_BOVIN gi|74502222|gb|DT722637.1 (2) 97.62 42 (100.0) 5.30E-020 DBEST_AMP_1461 (2) 97.3 37 (100.0 %) 1.20E-020  7 8  Known Bovine AMP Matched EST* % Identity Match Length (coverage %) Blast E- value Matched Predicted AMP* % Identity Match Length (coverage %) Blast E- value BDC7_BOVIN gi|119531287|gb|EH137278.1 (2) 94.34 53 (100.0) 1.10E-024 DBEST_AMP_209 (2) 91.89 37 (100.0 %) 7.80E-020 BMA27_BOVIN gi|120572158|gb|EH378295.1 99.22 128 (81.0) 1.80E-068 DBEST_AMP_383 98.98 98 (100.0 %) 6.80E-055 BMA28_BOVIN gi|119558428|gb|EH159423.1 100 159 (100.0) 2.30E-087 DBEST_AMP_274 100 113 (100.0 %) 1.10E-063 BMA34_BOVIN gi|61753367|emb|CR452179.2 100 165 (100.0) 1.70E-091 DBEST_AMP_478 100 129 (100.0 %) 8.30E-075 BPI_BOVIN gi|119650848|gb|EH179456.1 99.61 254 (52.7) 2.00E-141 DBEST_AMP_1174 100 250 (100.0 %) 4.00E-146 CALT_BOVIN gi|86366255|gb|DY165694.1 100 80 (100.0) 2.10E-040 DBEST_AMP_186 30.43 23 (28.8 %) 3.7 CAS2_BOVIN gi|70828695|gb|DR712392.1 100 222 (100.0) 1.00E-124 DBEST_AMP_332 22.86 35 (31.2 %) 4.8 CCKN_BOVIN gi|60967497|gb|DN524024.1 100 58 (100.0) 4.00E-027 DBEST_AMP_358 40.91 22 (21.8 %) 0.03 CMGA_BOVIN gi|119653666|gb|EH182274.1 99.62 266 (59.2) 5.00E-152 - - - - EAP_BOVIN gi|75771874|gb|DT822941.1 100 64 (100.0) 3.60E-033 DBEST_AMP_1091 100 36 (100.0 %) 3.20E-020 INDC_BOVIN gi|119556821|gb|EH157816.1 100 144 (100.0) 2.00E-082 DBEST_AMP_542 100 100 (100.0 %) 2.70E-056 LAP_BOVIN gi|154466011|gb|EV693095.1 100 64 (100.0) 6.80E-032 DBEST_AMP_746 100 38 (100.0 %) 1.10E-020 LBP_BOVIN gi|82642070|gb|DV789175.1 93.44 380 (79.0) 0 DBEST_AMP_1816 93.44 380 (99.7 %) 0 LEAP2_BOVIN gi|154538028|gb|EV742363.1 100 77 (100.0) 2.40E-039 DBEST_AMP_729 100 40 (100.0 %) 3.00E-021 PENK_BOVIN gi|119686348|gb|EH206269.1 100 246 (93.5) 2.00E-144 DBEST_AMP_978 38.89 18 (60.0 %) 9.8 SCG1_BOVIN gi|82827500|gb|DV893271.1 99.66 297 (46.0) 1.00E-177 DBEST_AMP_549 100 13 (100.0 %) 0 TAP_BOVIN gi|154464382|gb|EV691466.1 100 64 (100.0) 1.10E-032 DBEST_AMP_753 100 36 (100.0 %) 9.20E-020 Table 3.3. Identification of known bovine host defense peptides in dbEST sequences EST sequences were mapped to known bovine AMPs based on pairwise similarity using BLAST.  * a '(2)' indicates that this is the second entry with the same identifier. The entries consisting of a dash (-) indicate no match was found.  79   3.2.2  Selection of predicted AMPs for confirmation We manually examined the sequences of these predicted AMPs to identify peptides of interest for laboratory follow-up. Using on-line tools that we developed, we examined multiple alignments of these predicted AMPs alongside the following: the most similar known bovine AMP, the most similar AMP from any species (if different from bovine) and the peptides that were used to construct the AMPer model. (These are available from links on the bovine analysis pages at the AMPer site.) We chose two predicted AMPs for follow-up that appeared to be novel and belong to the cathelicidin family. Two ESTs corresponding to these predicted AMPs were identified for laboratory analysis of changes in expression due to infection as discussed below. The first predicted AMP that we sought to confirm was DBEST_AMP_248, matched by model 17. This peptide sequence was compared to all proteins in Uniprot (Swiss-Prot and TrEMBL) using the on-line BLAST utility at http://www.expasy.org/tools/blast. Since we began this work, an entry containing DBEST_AMP_248 has been deposited in TrEMBL as A5PJH7_BOVIN (discussed below) based only on cDNA sequencing. The most similar peptide to DBEST_AMP_248 is an antimicrobial peptide found in rabbit, P15B_RABIT, designated as "15 kDa protein" (Levy et al. 1993)  with 55% sequence identity and 99.2% coverage. The most similar known bovine AMP, Bactenecin-7 (BCTN7_BOVIN, now called CTHL3_BOVIN in the current version of Uniprot) has only 33% sequence identity and 95.8% coverage. Based on earlier data, in place of DBEST_AMP_248, we examined the predicted AMP, DBEST_AMP_397 and EST sequence gi|12122965|gb|BF775065.1 (a slightly shorter sequence within the same cluster of predicted AMP sequences as  80 DBEST_AMP_248). As shown in Figure 3.1, the translated EST sequence (BF775065.1) shows good alignment with the 15kDa protein sequence and poorer alignment with the bovine peptide BCTN7_BOVIN.  In Figure 3.1, the underlined regions indicate the region of mature peptide corresponding to the active antimicrobial peptide. The second predicted novel AMP that we sought to confirm was identified from EST gi|15378291|gb|BI537181.1, as predicted peptide DBEST_AMP_416, matched by model 87. This predicted AMP matches a short region of the sequence for the known bovine AMP, Bactenecin-5 (BCTN5_BOVIN, now CTHL2_BOVIN in the current Uniprot). Examination of the translated EST sequence that was recognized by the AMPer model and produced DBEST_AMP_416 shows that it codes for a similar protein with differences near the N-terminus. The predicted sequence is shown in Figure 3.2, along with the proteins that were used to construct the AMPer model that recognized this peptide, and the closest matching known bovine AMP (BCTN5_BOVIN). We compared the EST sequence (232 nucleic acids) for this predicted AMP to the current bovine genome in Ensembl (http://www.ensembl.org) and did not find a significant match except to a short region of the genomic sequence for Bactenecin-5: 52 positions on chromosome 22 (49,818,207 to 49,818,362) matched the EST from positions 27 to 78. This region overlaps with Bactenecin-5 exon 4 (ENSBTAE00000175540) 49,818,093 to 49,818,356 and extends 6 positions into intron 3-4. Neighboring DNA regions on the chromosome did not contain additional flanking EST sequence that would be expected if the sequences were separated in the genome due to introns. However, the EST sequence matched a longer region of 77 nucleic acids (EST region 17-90) on a sequence contig from whole genome shotgun (gi|112113766|gb|AAFC03064548.1| Ctg60.CH240-439A19). This  81 suggests that the predicted AMP from DBEST_AMP_416 is from a novel gene that has not yet been incorporated into the genome assembly. However, the sequence was originally found in expressed sequence; therefore it appears to be a true gene rather than a pseudogene, despite not being able to identify the full gene sequence in the genome.   Figure 3.1. Multiple alignment of predicted host defense peptide DBEST_AMP_397. The predicted peptide DBEST_AMP_397 is shown aligned to all peptides in the AMPer cluster, the most similar AMP (P15B_RABIT), the most similar bovine AMP (BCTN7_BOVIN), and the EST that DBEST_AMP_297 was derived from (BF775065.1). Underlined sequence indicates the position of mature peptides within the proteins. The consensus sequence of the AMPer model is also shown (HMM_consensus).  Figure 3.2. Multiple alignment of predicted host-defense peptide DBEST_AMP_416. The predicted peptide DBEST_AMP_416 is shown aligned to all peptides in the AMPer cluster, the most similar bovine AMP (BCTN5_BOVIN), and the EST that DBEST_AMP_416 was derived from (BI537181.1). Underlined sequence indicates the position of mature peptides within the proteins. The consensus sequence of the AMPer model is also shown (HMM_consensus).   82  3.2.3  Analysis of predicted novel AMP gene expression We designed primers to detect and amplify RNA corresponding to these two putative AMPs along with and two housekeeping genes (GAPDH and beta-actin) that serve as positive controls. Quantitative real-time PCR (qRT-PCR) was performed using these primers on total RNA derived from bovine peripheral blood mononuclear cells (PBMC), and tissue collected from the bovine small intestine (ileum). The intestinal tissue was sampled both prior to and 4 hours after challenge with S. typhimurium with the S. typhimurium infection performed as described previously by Coombes et al. (Coombes et al. 2005).  Initial qRT-PCR products were run on agarose gel and showed faint bands (Figure 3.3). The qRT-PCR products were re-amplified using a 30 cycle Taq-man PCR protocol and visualized on gel (Figure 3.4). The DBEST_AMP_397 product is clearly visible and up-regulated in response to bacterial infection in intestinal tissue. However, the DBEST_AMP_416 product cannot be distinguished from negative control lanes in Figure 3.4 and the presence of two bands rather than the expected single band in Figure 3.3 suggests the putative AMP product for DBEST_AMP_416 was not found.  83 Figure 3.3. Gel image of qRT-PCR for putative AMPs in blood and tissue. The DBEST_AMP_397 (P397) and DBEST_AMP_416 (P416) products are visible. B-actin lanes are positive control lanes and NTC lanes are "no template" controls.   84  Figure 3.4. Gel image of putative AMPs following Taq-man re-amplification. The DBEST_AMP_397 (P397) product is clearly visible in the infected tissue but not healthy tissue. While a difference is observed for DBEST_AMP_416 (P416) between healthy and infected tissue, the P416 lane does not produce a useful band and is not distinguishable from NTC.  GAPDH are positive control lanes and NTC lanes are "no template" controls.   3.2.4  Absence of alpha-defensins Notably absent from Table 3.1 are any of the alpha-defensin peptide families (often described as simply "defensins"). There are several models in AMPer for mature peptides of this type including models 53, 98, 105 and 146 as well as subclasses such as  85 cryptdins (model 75). For example, AMPer model 146 is built from a set of 45 alpha- defensin peptides from 42 different Swiss-Prot proteins taken from eight mammalian species. The model matches these 45 peptides with high statistical significance (E-values are all less than 1e-10 with only two greater than 1e-20; see AMPer web site). However, the most significant match in the bovine EST sequences is to gi|82672759|gb|DV812566.1 with an E-value of 3.6e-4. The analysis described here tolerates the presence of introns and will combine neighboring regions identified by an HMM model to cover the length of the model and report the resulting peptide with a single ID. An example of an AMP containing introns that is correctly identified by AMPer is BD07_BOVIN.  This BD07_BOVIN contains one intron of 1460 nucleotides (487 amino acids when translated) and is identified from EST sequence by DBEST_EST_292 (model 90). The predicted AMP based on genomic sequence, GENOME_AMP_169, is identical in sequence to BD07_BOVIN but short by 2 amino acids (length of 38 vs 40) and produces an HMM E-value of 4e-23 (see web resources). In contrast, the most significant E-value for (alpha-defensin) model 146 against bovine genomic data is 4e-10 but the coverage of the model is low at only 69% and the predicted AMP sequence lacks the characteristic six-cysteine motif (see supplementary Table 3.5 for predicted AMPs based on genomic data with E-values less than 1e-5). *              *             * Host defense peptides of the innate immune system are important components for control of infection. Historically, host defense peptides have been described as antimicrobial peptides (AMPs); however, the important role of modulation of the innate  86 immune response has come to the fore recently. Natural host defense peptides are considered to be lead compounds in the search for agents that beneficially modulate inflammatory responses both directed against a pathogen and to counter detrimental immune responses such as those involved in sepsis. The importance of these peptides in host defense and as the basis of possible novel therapeutics indicates the need for information about the numbers and what types that are present to gain further understanding of their roles in innate immunity. In order to identify potentially novel host-defense peptides, we used the hidden Markov models constructed for the AMPer resource to scan bovine expressed sequence tags and genomic sequence. The AMPer models represent groups of mature peptides as well as propeptides that are products of the parental prepropeptides due to processing after protein translation; there are 146 models of mature peptides and 40 models of propeptides representing classes and subclasses of peptides such as defensins and cathelicidins. In this study, we used the models for mature peptides only. We are primarily concerned with identifying mature antimicrobial peptides for the purpose of structure-activity analysis. Therefore, we primarily relied upon EST sequences since they do not have the added complication of introns in predicted protein sequence. Since the same gene may lead to many ESTs, we sought to identify those unique sequences corresponding to a gene by grouping the predicted peptides based on sequence similarity. We chose a conservative threshold since we are interested in identifying novel AMPs, and are less interested in identifying close homologues of known bovine AMPs; in addition, EST sequencing is a single-pass process with sequencing errors of up to a few percent (Boguski et al. 1993) so true matches are  87 expected to not match perfectly. We considered EST sequence where the matched regions of these ESTs were more than 90% identical over the region of the pairwise match to belong to the same host defense peptide. This threshold yielded a total of 278 potential peptides of varying statistical significance. The HMM E-value represents the number of false positive matches expected at a given threshold; using an HMM E-value threshold of 1e-5 (i.e. 1e-5 expected false positives for each of the 146 models) yields a prediction of up to 124 AMPs, including 32 matches to histone (from which the AMP buforin is derived (Kim et al. 2000)). There are 92 non-histone AMPs, a number that is feasible to review manually (Table 3.1). As well, this E-value threshold is large enough that sequences belonging to more distant homologues would not be discarded, but at the risk of including peptides that are only distantly related to and not actually AMPs. To determine which of these predicted AMPs correspond to the known bovine AMPs, we compared the sequences using sequence similarity (blastp (Altschul et al. 1990)) to find predicted peptide from both ESTs and peptide identified by AMPer models. Of the 34 known bovine AMPs (full length proteins, Table 3.2), a total of 27 known bovine AMPs have significant matches to ESTs. As well, 27 known bovine AMPs have significant matches to AMPs predicted by AMPer. The known AMPs with no significant match to ESTs are slightly different than those known AMPs with no significant match to a peptide identified by AMPer. Several known bovine AMPs were not identified in the EST data presumably because they were not expressed in the tissues that were sampled for mRNA and used to construct the EST libraries. Of the three known AMPs (CALT_BOVIN, CAS2_BOVIN and CCKN_BOVIN) that appear to have been represented in the EST data set but missed by the AMPer search, only CCKN_BOVIN  88 seems to have been missed due to inadequacy of the AMPer model: CALT_BOVIN and CAS2_BOVIN did not contribute mature peptides that were used in constructing AMPer models (for details of the AMPer construction algorithm see (Fjell et al. 2007)). Considering that a total of 95 non-histone AMPs were predicted and up to 27 known AMPs were found to have significant matching ESTs, there are up to 68 potentially novel AMPs identified in the EST set by the AMPer models at the threshold values we used. We chose two predicted AMPs for follow-up that appear to be novel and belong to the cathelicidin family, a group of peptides of special interest to us. We chose two ESTs corresponding to these predicted AMPs for RT-PCR analysis of gene transcription as well as changes in gene expression following infection. (Note that since this work began, significantly more bovine sequence has become available and slightly different ESTs might have been chosen based on current data.) We demonstrated that one of these, DBEST_AMP_397, is expressed in response to infection. When compared to all proteins found in Uniprot (both Swiss-Prot and TrEMBL), this predicted peptide is most similar to the '15kDa protein' AMP found in rabbit and of a class of AMP not previously described for bovine. Since our work began on AMPs in bovine, this peptide (DBEST_AMP_397) has been predicted based on sequencing of cDNA from a thymus sample and submitted to the TrEMBL database of Uniprot as A5PJH7_BOVIN (http://www.expasy.org/uniprot/A5PJH7) by the Mammalian Gene Collection project (http://mgc.nci.nih.gov/). Here, we report that we have independently identified this peptide using the AMPer resource and demonstrated that it is up-regulated in the small intestine in response to infection. We did not find the second predicted AMP we attempted to confirm in the tissues we sampled, and we did not locate the genome  89 location of its sequence in the current genome assembly. However, the sequence was found in whole genome shotgun sequence that was not incorporated into the current bovine assembly. Since it was originally found in expressed sequence, it appears to be a true gene rather than a pseudogene. We did not identify any AMP sequences for alpha-defensins in bovine EST sequence, strongly suggesting that alpha-defensins are not present in the EST dataset we used. In addition, when we scanned translated genomic sequence we also did not find evidence for alpha-defensins. The analysis we performed did account for the presence of introns in constructing AMP predictions: For example, beta-defensins were found reliably despite the presence of intron sequence. Since we cannot account for the lack of alpha- defensins identified using the AMPer models due to any technical deficiencies (and additionally we cannot find reference to any bovine alpha-defensins in the literature), we conclude that these results indicate that the bovine genome lack this important class of host defense peptide. Other mammalian species such as mouse are known to lack neutrophil-derived alpha-defensins (Eisenhauer and Lehre 1992). Previous reports have speculated that alpha-defensins are found only in the primate and glires (rodents and lagomorphs) lineage (Patil A 2004; Xiao et al. 2004; Patil AA 2005), while more recent reports have identified alpha-defensins in a wide range of diverse mammals such as opossum (Belov K et al. 2007), elephant and hedgehog tenrec (Lynn DJ and DG 2007), and the horse (Looft C et al. 2006), a close evolutionary cousin to bovine. This suggests that the bovine genome has lost alpha-defensins from an ancestor through evolution, rather than being on a lineage where alpha-defensins were never present.   90  3.3  Conclusions We have used the HMM models from the AMPer resource to scan the draft bovine genome and bovine expressed sequence tags from the dbEST data set. To additionally describe the peptides, we have identified the most similar known AMP for each predicted peptide. The AMPer models identified 27 of the 34 known bovine antimicrobial peptides. An additional 68 potential peptides were identified that appear to be previously unidentified AMPs, for a total of 102 AMPs. We sought to experimentally verify two of these that belong to the cathelicidin family. One of these, DBEST_AMP_397, was clearly identified in qRT-PCR product and was found to be up- regulated in bovine intestinal tissue following challenge with S. typhimurium. One other putative AMP (DBEST_AMP_416) was not confirmed in blood mononuclear cells and small intestine. In addition to the identification of unrecognized AMPs our results suggest that bovine lacks alpha-defensins. The novel antimicrobial peptide, DBEST_AMP_397, was also predicted by the Mammalian Gene Collection project as part of an effort to provide full-length clones to investigators for a limited number of organisms (human, rat, mouse and bovine). This serves to confirm the utility of the AMPer approach to identifying novel AMPs: by examining the large resource of low quality EST sequence, we have identified a novel peptide that was added to the major sequence databases only recently, after a high quality cDNA sequencing project. This suggests that a large number of additional peptides might be identified from publicly available data that will not be added to major databases for some time. These results indicate the effectiveness of in silico screening with software resources such as AMPer that are tailored to specific interests of the community, in this  91 case, investigators examining peptides of the innate immune system. The hidden Markov models used by AMPer are freely available to investigators and straightforward to use (see http://www.cnbi2.com/cgi-bin/amp.pl). Future work on AMPer will include automation of the steps involved in the study described here, and its application to larger numbers of organisms.  3.4  Methods and materials  3.4.1  Set of known antimicrobial peptides. We considered the set of known antimicrobial peptides to be derived from the 1135 proteins in Uniprot identified during construction of the AMPer resource (described previously (Fjell et al. 2007)); these are the 980 protein IDs from AMSdb combined with additional proteins identified by AMPer that were found to have some support for antimicrobial or host defense activity in the literature. These are available at the AMPer web site (http://www.cnbi2.com/cgi-bin/amp.pl).   3.4.2  Creation of AMPer The AMPer resource has been described previously (Fjell et al. 2007). Briefly, the 980 Uniprot protein IDs from AMSdb were considered to contain all known AMPs. Mature peptides and propeptides were identified from these proteins using Uniprot annotations of peptide positions within the proteins. The peptides were compared to one another based on pairwise sequence similarity and grouped based on this similarity. For each group, a hidden Markov model was created using the HMMER software package.  92 These models were used to iteratively scan Swiss-Prot to identify additional peptides that were not currently identified in the set of AMPs. Uniprot annotations were reviewed for proteins that were identified; where annotations suggested antimicrobial activity, these were added to what was considered the set of known AMPs and but used to update the AMPer hidden Markov models. Only the 146 hidden Markov models corresponding to mature peptides were used to search bovine sequence.  3.4.3  Bovine genomic and EST sequences We present here the results in the context of the current versions of the bovine genome and EST set. The bovine genome was downloaded from ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Btaurus/fasta/Btau200708xx/LinearScaffolds. Preliminary work used the draft bovine genome sequence was obtained from ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Btaurus/fasta/Btau20050310-freeze/linearScaffolds/. ESTs were obtained from the NCBI resource dbEST resource, downloaded Aug 25, 2007 from ftp://ftp.ncbi.nih.gov/blast/db/FASTA/est_others.gz. Bovine ESTs (numbering 1,433,737) were identified as those containing the annotation 'Bos taurus cDNA'. Preliminary work used the same resource downloaded October 2006.  The EST sequences were translated into predicted protein sequences in all six reading frames using software from the BioJava project (http://www.biojava.org).   3.4.4  Prediction of AMPs in ESTs Predicted protein sequences from ESTs were scanned using the 146 AMPer models for mature peptides using the HMMER utility, hmmsearch (Durbin et al. 1998).  93 Regions of sequence matched by a model ('predicted peptides') were examined to identify likely AMPs as follows. Predicted peptides that were less than 25% of the model length were excluded from consideration since they were considered to be unlikely candidates as AMPs and more likely represent conserved protein domains instead. Each matched EST was assigned an identifier of the form DBEST_AMP_n where n is an integer since they are interpreted as a predicted AMP. In addition, multiple ESTs may correspond to the same gene product and may differ due to sequencing errors and different lengths of sequencing reads (ESTs are single reads of a cDNA). Therefore, peptide sequences matched by a model were clustered into groups to represent a single predicted AMP based on similarity of the sequences. Specifically, predicted peptides were added to groups where each peptide was at least 90% identical to every other peptide in the group over the length of the peptide (or the smaller peptide if they varied in length). A pairwise BLAST blastp comparison was used (Altschul et al. 1990). Each group of similar predicted peptides were conservatively considered a single antimicrobial peptide. The longest predicted peptide was taken as the representative of each group of similar predicted peptides.  3.4.5  Prediction of AMPs in genomic sequence The draft genome sequence of bovine was also scanned with the AMPer models of mature peptides using the HMMER utility, hmmsearch (Durbin et al. 1998), with the total number of sequences specified (using the parameters "-Z 922") to account for matches against the sequence database that spans many files. Genomic sequence contains introns, regions that are not translated into mRNA (and hence protein). However, the predicted protein sequence used for searching included intron sequence; therefore, the  94 protein sequences matched by a model will be fragments of a mature peptide corresponding to exons. To account for intron and exon sequence within the genome, predicted peptides were constructed from multiple matching regions within 1000 amino acid positions of each other that cover the length of the AMPer model. Overlap between regions of matches for different models was not allowed. Predicted antimicrobial peptides based on genomic sequence were identified as GENOME_AMP_n where n is an integer.  3.4.6  Comparison of predicted AMPs to known AMPs We wished to identify which of the predicted AMPs corresponded to known AMPs. The predicted AMPs were compared to known bovine AMPs using pairwise sequence comparison using the blastp algorithm of the BLAST package (Altschul et al. 1990). Significance of a match was taken as the E-value reported by blastp.  Coverage of the two sequences was also calculated to assess the extent of the pairwise match, giving the extent of the matched region in comparison to the length of the known AMP and the AMPer model. Coverage is calculated as the alignment length divided by the maximum possible alignment length (the minimum sequence length between the known AMP sequence and the predicted AMP sequence). For each known bovine AMP, the best matching (lowest E-value) AMPs predicted from the dbEST data set was calculated. A match was considered good if the alignment had minimum 95% identity over minimum 95% coverage. For each AMP predicted from the dbEST data, the best matching known AMP (of any organism) and best matching known bovine AMP were calculated, taking the matches with lowest E-values as the best matches. These are reported on the web pages linked from the summary page at http://www.cnbi2.com/cgi- bin/amp.pl?dbests=hits. The on-line tools allow predicted AMPs to be viewed in the  95 context of the multiple alignment (generated by ClustalW, v 1.83, (Thompson et al. 1994)) containing the predicted AMPs of the model, all known AMPs of the model, the HMM consensus sequence for the model, best-matching AMPs to the predicted AMP and any AMPs predicted from the bovine genomic data that have significant blastp match to the AMP predicted from dbEST data.   3.4.7  Identification of novel AMPs For each class of AMP, the multiple sequence alignment (generated by ClustalW) was viewed and unique predicted AMPs were identified by eye, by requiring significant differences to be visible in the alignment between the predicted AMP, all other predicted AMPs and the known bovine AMPs. To determine whether the putative novel AMPs had been previously identified, we used the NCBI website (http://www.ncbi.nlm.nih.gov/BLAST/) to search for the sequences in the NCBI nr (non- redundant) databank which contains all non-redundant GenBank CDS translations, Refseq, PDB, Swiss-Prot, PIR and PRF.   3.4.8  Pairwise comparison of known AMPs to bovine sequence The set of 1135 known AMPs were used to search for similar sequences in the translated bovine genome and ESTs sequences using blastp of the BLAST package. For genome scanning, the total number of sequences was corrected using the parameter "-z 922". The most significant matches (lowest E-values) are reported along with coverage calculated as the alignment length divided by the length of the known AMP. Only  96 matches with E-values < 1e-5 were considered, to restrict the matches to close matches and limit the number of results returned.   3.4.9  Analysis of AMP gene expression Total RNA was extracted from bovine intestinal tissue and bovine peripheral blood mononuclear cells (PBMC) as described previously (Aich et al. 2005) and RNA was isolated using an RNeasy Mini Kit (Qiagen Inc., Ontario, Canada). The intestinal samples were collected both prior to and 4 hours after challenge with S. typhimurium using the infection model developed by Coombes et al. (Coombes et al. 2005). Isolated RNA samples were eluted and stored in RNase-free water (Ambion Inc., Austin, Texas) at -80 C until further use. The RNA concentration, integrity and purity were assessed determining the OD260/280 ratio with a BioPhotometer (Eppendorf, Hamburg Germany) in addition to analysis on a 1% agarose gel and Bioanalyzer (Agilent, USA). Quantitative real-time PCR (qRT-PCR) was performed using Invitrogen’s SuperScript™ III Platinum two-step qRT-PCR kit with SYBR-Green on the ABI 7300 Real Time PCR System (Applied Biosystems, Foster City, CA) as described previously (Mookherjee et al. 2006). Endogenous house keeping genes, GAPDH and beta-actin, were used for normalization and determination of fold changes of the respective AMPs using the comparative threshold cycle method (Pfaffl, 2001). The qRT-PCR products were run on a 2% agarose gel to verify the presence of gene products. All primers used for qRT-PCR were designed using Primer3 v.0.3.0 (Rozen and Skaletsky, 2000), except beta-actin that was designed earlier (Whale et al. 2006). The  97 primers are listed in Table 3.4. Bovine gene Accession number a Primer direction Primer sequence (5'-3') Forward AGATGGTGAAGGTCGGAGTG  GAPDH BC102589 Reverse GATCTCGCTCCTGGAAGATG Forward CTAGGCACCAGGGCGTAATG Beta-Actin AF191490 Reverse CCACACGGAGCTCGTTGTAG Forward TCGTGGTGGAGTTCAAATCA  DBEST_AMP_397 XM_586989 Reverse GCTTGGAAGGCACTGGTACT Forward GGATTGGTGGAGGAAATCTG  DBEST_AMP_416 BC120477 Reverse GAATGGGCTGGTGAAACAGT Table 3.4. Bovine primers used for qRT-PCR Accession numbers are from NCBI (http://www.ncbi.nlm.nih.gov).   3.4.10  Informatics All calculations were performed on a Linux or Mac OS X environment using custom Java, Python, Perl or BASH code. Data were stored in a MySQL database for manipulation and presentation via Perl CGI scripts on an Apache web server running on a Linux server at http://www.cnbi2.com.   3.5  Acknowledgments We gratefully acknowledge financial support from the Canadian Institutes for Health Research (CIHR) and from Genome BC for the Pathogenomics of Innate Immunity research program. CDF is supported by a Doctoral Research Award from CIHR. KH received a CIHR postdoctoral fellowship. REWH was the recipient of a Canada Research Chair.    98  3.6  Web resources AMPer: http://www.cnbi2.com/cgi-bin/amp.pl Baylor College of Medicine Human Genome Sequencing Center, bovine genome: http://www.hgsc.bcm.tmc.edu/projects/bovine NCBI dbEST: http://www.ncbi.nlm.nih.gov/dbEST/ BioJava: http://www.biojava.org   3.7  Supplementary table Predicted AMP HMM E- value Model coverage Matched sequence Chromosome Strand Position start [na] Position end [na] GENOME_AMP_139 4.20E-010 0.69 CKDRESRIGSCFYNGVLLSL Chr26 + 38412814 38412873 GENOME_AMP_248 6.90E-009 0.55 CFCQFNHCFRGERMFG ChrUn.51 - 283701 283746 GENOME_AMP_34 2.70E-008 0.55 CFCRARLCFTDEKLYG Chr13 + 42595672 42595719 GENOME_AMP_102 4.20E-008 0.48 TCRLNDALHPLCPR Chr22 + 54964399 54964440 GENOME_AMP_220 4.60E-008 0.93 RSPFCSSGSDTGEKRSGSCVRNRLLTHCCS Chr7 + 56856289 56856378 GENOME_AMP_5 9.10E-008 0.52 GYCELGEMLWNLCPR Chr1 - 84535783 84535825 GENOME_AMP_11 1.00E-007 0.59 ASGYCTGQHRLHFHCCR Chr10 + 12224735 12224785 GENOME_AMP_223 1.10E-007 0.48 TCRLPGLRHAMCCR Chr7 - 47423547 47423586 GENOME_AMP_187 1.30E-007 0.93 CTCQEGACQSPEMRGLCRKSARVWGL Chr3 + 4341074 4341151 GENOME_AMP_21 1.40E-007 0.66 CFCRWALCLTDPVHSGTCT Chr11 - 97411077 97411131 GENOME_AMP_36 1.40E-007 0.59 CSCHRPHCGV*EVLSGS Chr13 - 55411562 55411610 GENOME_AMP_152 1.90E-007 0.45 CFCRIWGCPGGES Chr27 - 35393152 35393188 GENOME_AMP_200 2.60E-007 0.59 HNGACTHRGEMATLCPR Chr4 - 31779220 31779268 GENOME_AMP_43 2.80E-007 0.52 G*CIVRRALHPFCCR Chr14 + 67433469 67433513 GENOME_AMP_74 2.90E-007 0.48 CTCRDAVCAQREKM Chr19 + 22908125 22908166 GENOME_AMP_204 3.70E-007 0.62 CRCPSLACDTLEVASGMC Chr5 + 83595081 83595134 GENOME_AMP_136 5.80E-007 0.41 TFNGTFYSLCCS Chr25 + 35141361 35141396 GENOME_AMP_143 5.80E-007 0.55 SGYCK*N*RIVRLCCG Chr26 - 13777645 13777690  99 Predicted AMP HMM E- value Model coverage Matched sequence Chromosome Strand Position start [na] Position end [na] GENOME_AMP_62 5.90E-007 0.55 NGRCG*NHLLHLLCPR Chr17 - 6019846 6019891 GENOME_AMP_205 6.40E-007 0.41 EKMGDIYRLCCR Chr5 - 94990448 94990481 GENOME_AMP_256 6.80E-007 0.62 RMEGFCGLGAVL*AQCCR ChrX - 34472939 34472990 GENOME_AMP_22 7.20E-007 0.48 FCIYKDRFHSLCCS Chr11 - 16621399 16621438 GENOME_AMP_253 8.00E-007 0.45 CGSDGRVYLLCCR ChrUn.93 + 53978 54016 GENOME_AMP_73 8.40E-007 0.48 TCSLSH*SYVLCCR Chr19 + 37417075 37417116 GENOME_AMP_234 8.70E-007 0.59 CYCVDTLCALLERQSGA Chr9 + 29829331 29829381 GENOME_AMP_236 8.70E-007 0.59 CYCVDTLCALLERQSGA Chr9 - 29905196 29905244 GENOME_AMP_172 8.90E-007 0.45 CRSHHTLSTLCCR Chr28 + 26472568 26472606 GENOME_AMP_44 9.70E-007 0.31 GTIWPLCCR Chr14 - 14123849 14123873 GENOME_AMP_188 1.00E-006 0.41 ELGQAIYSLCCR Chr3 + 83833191 83833226 GENOME_AMP_228 1.00E-006 0.52 GTCFM*SRRESLCCR Chr8 + 107646187 107646231 GENOME_AMP_231 1.00E-006 0.52 GTCFM*SRRESLCCR Chr8 - 107680851 107680893 GENOME_AMP_4 1.20E-006 0.66 ETQRGTCFVLQSLAPLCC* Chr1 + 130678770 130678826 GENOME_AMP_27 1.20E-006 0.34 CYCRIFVCLS Chr12 - 43568385 43568412 GENOME_AMP_49 1.20E-006 0.66 ENRDGHCASEGLIHPLCCA Chr15 - 76976736 76976790 GENOME_AMP_61 1.20E-006 0.38 KNHTFYMLCCS Chr17 - 22463118 22463148 GENOME_AMP_229 1.30E-006 0.48 TCFTNHLLGPLCCR Chr8 + 61599012 61599053 GENOME_AMP_255 1.30E-006 0.45 CTQSHRLAQLCCR ChrX + 23114562 23114600 GENOME_AMP_173 1.40E-006 0.45 CQF*GVMVRLCCR Chr28 + 12795050 12795088 GENOME_AMP_203 1.40E-006 0.69 CVCRSEICLLRQHIYGSCFL Chr5 + 1402720 1402779 GENOME_AMP_37 1.80E-006 0.34 GHTLWSLCCR Chr13 - 51262512 51262539 GENOME_AMP_63 1.80E-006 0.52 CHCKSRGCLRREKVN Chr18 + 37678594 37678638 GENOME_AMP_65 2.00E-006 0.41 CQCRRPLCPRGE Chr18 + 5247825 5247860 GENOME_AMP_174 2.00E-006 0.55 FGVCFQGRVHWLCCK Chr28 - 32934074 32934116 GENOME_AMP_197 2.00E-006 0.41 TKHSRFHRLCCR Chr4 + 109553651 109553686 GENOME_AMP_257 2.00E-006 0.34 NG*IYILCCR ChrX - 14040157 14040184 GENOME_AMP_12 2.20E-006 0.62 CWCWEGGCKRGEHLEGGC Chr10 - 398902 398953 GENOME_AMP_235 2.20E-006 0.45 CFSSGLIVSLCCR Chr9 + 46245935 46245973 GENOME_AMP_76 2.30E-006 0.41 EISGLRWYFCCR Chr19 - 18649083 18649116 GENOME_AMP_42 2.40E-006 0.52 CFC*QPSCKTGESAS Chr14 + 79033828 79033872  100 Predicted AMP HMM E- value Model coverage Matched sequence Chromosome Strand Position start [na] Position end [na] GENOME_AMP_105 2.40E-006 0.41 CFCRHTLCIFGE Chr22 - 5907881 5907914 GENOME_AMP_86 2.50E-006 0.31 GAFYVLCCR Chr2 - 10362969 10362993 GENOME_AMP_130 2.60E-006 0.45 CDYGLILYTLCCR Chr24 - 10203402 10203438 GENOME_AMP_221 2.60E-006 0.31 GRFWRLCCR Chr7 + 98996633 98996659 GENOME_AMP_87 2.70E-006 0.38 VRHRLHSLCCR Chr2 - 38157475 38157505 GENOME_AMP_25 2.80E-006 0.62 REHMYGYCNREGLILNLC Chr12 + 70532686 70532739 GENOME_AMP_84 2.90E-006 0.48 ACEKRRLIYTCCPR Chr2 + 114381586 114381627 GENOME_AMP_85 2.90E-006 0.41 FYKHSFHRLCCR Chr2 + 126851343 126851378 GENOME_AMP_237 2.90E-006 0.45 CSCREFVCVFGES Chr9 - 45591135 45591171 GENOME_AMP_38 3.30E-006 0.41 ELNGRTHSRCCR Chr13 - 77632369 77632402 GENOME_AMP_67 3.30E-006 0.41 ELQH*LYTRCCR Chr18 - 55288966 55288999 GENOME_AMP_75 3.40E-006 0.62 RRRRCPPIEKVIGVCKLG Chr19 + 55345221 55345274 GENOME_AMP_178 3.50E-006 0.48 FCFVNRFIYTLCCA Chr29 - 38808844 38808883 GENOME_AMP_222 3.60E-006 0.52 CFCHSPSCGSGEAAS Chr7 + 35208018 35208062 GENOME_AMP_97 3.70E-006 0.34 GILIYPLCCR Chr21 - 13819732 13819759 GENOME_AMP_198 3.70E-006 0.38 VEIRVYVLCCR Chr4 + 98755398 98755430 GENOME_AMP_104 4.20E-006 0.41 ELQELLWRLCCR Chr22 + 5571048 5571083 GENOME_AMP_254 4.30E-006 0.52 GSCRLSHQVARLCCL ChrX + 9279319 9279363 GENOME_AMP_121 4.60E-006 0.86 GHCWPRAESRGACSTAGTLWSLCCM Chr23 - 5851633 5851705 GENOME_AMP_93 4.70E-006 0.48 VCTLGNSIYMICPR Chr20 + 43613148 43613189 GENOME_AMP_196 4.70E-006 0.55 CRCWSRGCVALEQL*G Chr4 + 1468702 1468749 GENOME_AMP_214 5.00E-006 0.41 TMNILVYALCCR Chr6 - 84116009 84116042 GENOME_AMP_177 5.10E-006 0.62 CTCKTSREKSIERWYGFC Chr29 - 520862 520913 GENOME_AMP_60 5.40E-006 0.52 ACRGPACASGEQLS Chr17 + 67973195 67973236 GENOME_AMP_251 5.60E-006 0.41 CRCRKPICGHGE ChrUn.89 + 394598 394633 GENOME_AMP_224 5.80E-006 0.41 EYNEVVWPLCCR Chr7 - 20126494 20126527 GENOME_AMP_35 6.20E-006 0.48 SCLKNGRR**LCCS Chr13 + 45930102 45930143 GENOME_AMP_141 6.20E-006 0.41 CVCRRTLCVTLE Chr26 - 37906649 37906682 GENOME_AMP_142 6.20E-006 0.41 CVCRRTLCVTLE Chr26 - 37948377 37948410 GENOME_AMP_20 6.40E-006 0.38 CYCRKVVCLQG Chr11 + 95105110 95105142 GENOME_AMP_26 6.40E-006 0.38 INGDIYSICCR Chr12 + 59338218 59338250  101 Predicted AMP HMM E- value Model coverage Matched sequence Chromosome Strand Position start [na] Position end [na] GENOME_AMP_135 6.80E-006 0.79 CFCRRS*ECLFSEPRIGLCGVSPR Chr25 + 38183668 38183739 GENOME_AMP_140 7.10E-006 0.59 CLCRTIFCTSGEKPLGS Chr26 + 14708927 14708977 GENOME_AMP_106 7.70E-006 0.59 TCRRGSCLEGEEVLGV Chr22 - 56755276 56755321 GENOME_AMP_239 7.80E-006 0.38 CACRTPSCLGG ChrUn.110 + 494582 494614 GENOME_AMP_213 8.20E-006 0.45 CDI*ERIV*LCCR Chr6 + 77956896 77956934 GENOME_AMP_199 8.30E-006 0.69 CLCRIQRCQRLGPARGVCRL Chr4 - 112261775 112261832 GENOME_AMP_64 8.40E-006 0.41 ELQH*LYARCCR Chr18 + 55762646 55762681 GENOME_AMP_66 8.40E-006 0.41 ELQH*LYARCCR Chr18 - 55571750 55571783 GENOME_AMP_129 8.40E-006 0.66 ERVLGSCF*NITM*P*CCL Chr24 - 28424783 28424837 GENOME_AMP_48 8.70E-006 0.38 CFCRIPLCDPL Chr15 + 73183693 73183725 GENOME_AMP_151 9.00E-006 0.41 CRCRQPACGFSE Chr27 + 5037651 5037686 GENOME_AMP_175 9.10E-006 1.03 FCRSF*CQT*ENPSGFLHLLLHTICCD Chr28 - 29239411 29239489 GENOME_AMP_3 9.40E-006 0.79 CVCRHRA*VPLESPKGSCLLGGL Chr1 + 137624732 137624800 GENOME_AMP_96 9.60E-006 0.41 CQCRWRRCKSRE Chr21 + 22225095 22225130 GENOME_AMP_238 9.70E-006 0.45 CRLLIVMASLCCR Chr9 - 13006906 13006942 GENOME_AMP_230 9.80E-006 0.45 CGFSGLTWLLCCR Chr8 - 15419900 15419936 GENOME_AMP_250 9.80E-006 0.34 CYCRISVCKT ChrUn.7 - 1270680 1270707 GENOME_AMP_241 9.90E-006 0.45 CDS*GRIYTCCCK ChrUn.26 - 249417 249453 Table 3.5.  Most significant matches of AMPer model 146 to bovine genome sequence.  102  3.8  References Aich, P., Wilson, H. L., Rawlyk, N. A., Jalal, S., Kaushik, R. S., Begg, A. A., Potter, A. A., Babiuk, L. A., Abrahamsen, M. S. and Griebel, P. J. (2005). Microarray analysis of gene expression following preparation of sterile intestinal loops in calves. Can. J. Anim. Sci., 85: 13–22. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). Basic local alignment search tool. J Mol Biol, 215: 403-410. Bechinger, B. (1997). Structure and function of channel-forming peptides: magainins, cecropins, melittin and alamethicin. J Membrane Biol., 156: 197-211. Bechinger, B. (1999). The structure, dynamics and orientation of antimicrobial peptides in membranes by multidimensional solid-state NMR spectroscopy. Biochim. Biophys. Acta, 1462: 157-183. Belov K, Sanderson CE, Deakin JE, Wong ES, Assange D, McColl KA, Gout A, de Bono B, Barrow AD, Speed TP, Trowsdale J and AT, P. (2007). Characterization of the opossum immune genome provides insights into the evolution of the mammalian immune system. Genome Res., 17: 982–991. Blondelle, S. E., Lohner, K. and Aguilar, M. I. (1999). Lipid-induced conformation and lipid-binding properties of cytolytic and antimicrobial peptides: determination and biological specificity. Biochim. Biophys. Acta, 1462: 89-108. Boguski, M. S., Lowe, T. M. J. and Tolstoshev, C. M. (1993).  dbEST — database for expressed sequence tags. Nature Genetics, 4: 332 - 333. Bowdish, D. M., Davidson, D. J. and Hancock, R. E. (2005). A re-evaluation of the role of host defence peptides in mammalian immunity. Curr. Protein Pept. Sci., 6: 35– 51. Brogden, K. A. (2005). Antimicrobial peptides: pore formers or metabolic inhibitors in bacteria? Nat. Rev. Microbiol., 3: 238–250. Coombes, B. K., Coburn, B. A., Potter, A. A., Gomis, S., Mirakhur, K., Li, Y. and Finlay, B. B. (2005). Analysis of the contribution of Salmonella pathogenicity islands 1 and 2 to enteric disease progression using a novel bovine ileal loop model and a murine model of infectious enterocolitis. Infect. Immun., 73(7161-7169). Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK, Cambridge University Press. Eisenhauer, P. B. and Lehre, R. I. (1992). Mouse neutrophils lack defensins. Infect. Immun., 60: 3446-3447. Epand, R. M. and Vogel, H. J. (1999). Diversity of antimicrobial peptides and their mechanisms of action. Biochim. Biophys. Acta, 1462: 11-28.   103 Fjell, C. D., Hancock, R. E. and Cherkasov, A. (2007). AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics, 23: 1148- 1155. Hancock, R. E. (2003). Concerns regarding resistance to self-proteins. Microbiology, 149: 3343-3344. Hancock, R. E. and Chapple, D. S. (1999). Peptide Antibiotics. Antimicrob. Agents Chemother., 43: 1317-1323. Hancock, R. E. and Lehrer, R. (1998). Cationic peptides: a new source of antibiotics. Trends Biotechnol., 16: 82-88. Jenssen, H., Hamill, P. and Hancock, R. E. W. (2006). Peptide Antimicrobial Agents. Clinical Microbiol. Rev., 19(3): 491–511. Khush, R. S., Leulier, F. and Lemaitre, B. (2001). Drosophila immunity: two paths to NF- kappaB. Trends Immunol., 22: 260-264. Kim, H. S., Yoon, H., Minn, I., Park, C. B., Lee, W. T., Zasloff, M. and Kim, S. C. (2000). Pepsin-Mediated Processing of the Cytoplasmic Histone H2A to Strong Antimicrobial Peptide Buforin. I. J. Immunol., 165: 3268-3274. Levy, O., Weiss, J., Zarember, K., Ooi, C. E. and Elsbach, P. (1993). Antibacterial 15- kDa protein isoforms (p15s) are members of a novel family of leukocyte proteins. J. Biol. Chem., 268: 6058-6063. Looft C, Paul S, Philipp U, Regenhard P, Kuiper H, Distl O, Chowdhary BP and T, L. (2006). Sequence analysis of a 212 kb defensin gene cluster on ECA 27q17. Gene, 376(2): 192-8. Lynn DJ and DG, B. (2007). Discovery of alpha-defensins in basal mammals. Dev. Comp. Immunol., 31(10): 963-7. Marshall, S. H. and Arenas, G. (2003). Antimicrobial peptides: A natural alternative to chemical antibiotics and a potential for applied biotechnology. Electron J. Biotech., 6: 271-284. Mookherjee, N. and Hancock, R. E. (2007). Cationic host defence peptides: innate immune regulatory peptides as a novel approach for treating infections. Cell Mol Life Sci., 64: 922-933. Mookherjee, N., Wilson, H. L., Doria, S., Popowych, Y., Falsafi, R., Yu, J. J., Li, Y., Veatch, S., Roche, F. M., Brown, K. L., Brinkman, F. S., Hokamp, K., Potter, A., Babiuk, L. A., Griebel, P. J. and Hancock, R. E. (2006). Bovine and human cathelicidin cationic host defense peptides similarly suppress transcriptional responses to bacterial lipopolysaccharide. J. Leukoc. Biol., 80: 1563-1574. Patil A, H. A., Zhang G. (2004). Rapid evolution and diversification of mammalian alpha-defensins as revealed by comparative analysis of rodent and primate genes. Physiol. Genomics, 20(1): 1-11. Patil AA, C. Y., Sang Y, Blecha F, Zhang G. (2005). Cross-species analysis of the mammalian beta-defensin gene family: presence of syntenic gene clusters and  104 preferential expression in the male reproductive tract. Physiol. Genomics., 23: 5- 17. Pfaffl, M. W. (2001). A new m athematical model for relative quantification in real-time RT-PCR. Nucleic Acids Res., 29: e45. Rozen, S. and Skaletsky, H. J. (2000). Primer3 on the WWW for general users and for biologist programmers. Bioinformatics Methods and Protocols: Methods in Molecular Biology S. Krawetz and S. Misener. Totowa, NJ,, Humana Press: 365- 386. Scheetz, T., Bartlett, J. A., Walters, J. D., Schutte, B. C., Casavant, T. L. and McCray, P. B. J. (2002). Genomics-based approaches to gene discovery in innate immunity. Immunol. Rev., 190: 137-145. Schutte, B. C., Mitros, J. P., Bartlett, J. A., Walters, J. D., Jia, H. P., Welsh, M. J., Casavant, T. L. and McCray, P. B. (2002). Discovery of five conserved beta - defensin gene clusters using a computational search strategy. PNAS, 99: 2129- 2133. Shai, Y. (1999). Mechanism of the binding, insertion and destabilization of phospholipids bilayer membranes by !-helical antimicrobial and cell non-selective membrane- lytic peptides. Biochim. Biophys. Acta, 1462: 55-70. Sima, P., Trebichavsky, I. and Sigler, K. (2003). Mammalian antibiotic peptides. Folia Microbiol., 48: 123-137. Sima, P., Trebichavsky, I. and Sigler, K. (2003). Non-mammalian vertebrate antibiotic peptides. Folia Microbiol., 48: 709-724. Simmaco, M., Mignogna, G. and Barra, D. (1998). Antimicrobial peptides from amphibian skin: what do they tell us? Biopolymers, 47: 435-450. Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A. and Durbin, R. (1998). Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucl. Acids Res., 26: 320-322. Thompson, J. D., Higgins, D. G. and Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl. Acids Res., 22: 4673-4680. Whale, T. A., Wilson, H. L., Tikoo, S. K., Babiuk, L. A. and Griebel, P. J. (2006). Pivotal Advance: Passively acquired membrane proteins alter the functional capacity of bovine polymorphonuclear cells. J. Leukocyte Biology, 80: 481-491. Xiao, Y., Hughes, A. L., Ando, J., Matsuda, Y., Cheng, J.-F., Skinner-Noble, D. and Zhang, G. (2004). A genome-wide screen identifies a single beta-defensin gene cluster in the chicken: implications for the origin and evolution of mammalian defensins. BMC Genomics, 5: 56.    105  Chapter 4: Identification of antibacterial peptides by chemoinformatics and machine learning                    A version of this chapter has been submitted as: Fjell, C.D., Jenssen, H., Hilpert, K., Cheung, W.A., Panté, N.,  Hancock, R.E.W., and Cherkasov, A. Identification of Novel Antibacterial Peptides by Chemoinformatics and Machine Learning  Some material from this chapter has been accepted for publication in: Cherkasov, A., Hilpert, K., Jenssen, H., Fjell, C.D., Waldbrook, M., Mullaly, S.C., Volkmer, R., and Hancock, R.E.W.  Use of artificial intelligence in the design of small peptide antibiotics effective against a broad spectrum of highly antibiotic resistant Superbugs. ACS Chemical Biology   106 Introduction Short cationic, amphipathic peptides possessing antimicrobial activity are present throughout the kingdoms of life. In the face of increasing antibiotic resistance in pathogenic microorganisms, short cationic peptides have drawn significant attention as a possible source of novel antibacterial agents (Hamilton-Miller, 2004; Levy and Marchall, 2004; Koczulla and Bals, 2003; Finlay and Hancock, 2004; Hancock and Sahl, 2006). Although antimicrobial peptides generally exhibit lower potency against susceptible bacterial targets compared to conventional low-molecular-weight antibiotic compounds, they hold several compensatory advantages including fast killing, broad range of activity, low toxicity and minimal development of resistance in target organisms (Hancock and Sahl, 2006; Jenssen et al., 2006). The use of quantitative structure-activity relationships (QSAR) to predict antibacterial activity of peptides is a relatively recent development. QSAR analysis seeks to relate quantitative properties of a compound (known as descriptors) with other properties such as drug-like activity or toxicity. QSAR relies on quantities that can be conveniently measured or calculated to predict in a non-trivial way other properties of interest such as antibacterial activity. QSAR has become an integral part of screening programs in pharmaceutical drug discovery pipelines of small compounds and more recently in toxicological studies (Perkins et al., 2003). There are two aspects to QSAR analysis: choice of the set of descriptors and choice of statistical learning technique. Previous QSAR analysis of antimicrobial peptides has been limited to comparisons between peptides that differ in only a small number of amino acids, for example, derivatives of lactoferricin (Lejon et al, 2001; Strom et al., 2001; Lejon et al  107 2004; Jenssen et al, 2005) and protegrin and similar de novo peptides (Frecer et al., 2004; Frecer, 2006; Ostberg and Kaznessis, 2004). These QSAR studies have mainly utilized descriptors that are designed to model differences in properties of similar peptides, such as in the lactoferricin studies or used ones such as charge, amphipathicity and lipophilicity whose relationship has been demonstrated empirically from amino acid substitution studies (Frecer et al, 2004).  Where larger sets of QSAR descriptors have been used, for example for protegrin and analogues (Frecer. 2006; Ostberg and Kznessis, 2004), the models have been limited to linear models, resulting in only moderate predictive ability. We decided to perform QSAR analysis on AMPs using a more intensive QSAR methodology that utilizes atomic-scale molecular information, recently developed and applied to small molecules. These ‘inductive’ QSAR descriptors (reviewed in Cherkasov, 2005a) have been successfully applied to a number of molecular modelling studies including: identification of antibacterial activity of small compounds (Cherkasov, 2005b), classification of antimicrobial compounds, conventional drugs and drug-like substances with up to 97% accuracy on an extensive set of over 2500 chemical structures (Karakoc et al, 2006a). These studies have relied on modelling techniques of greater complexity than those previously applied to antimicrobial peptides. In particular, classification of compounds have compared artificial neural networks (ANNs), k-nearest neighbors, linear discriminative analysis and multiple linear regression and found that ANNs result in generally more accurate predictions for classification, followed closely by k-nearest neighbors methods (Karakoc et al, 2006b). These higher-complexity models use a larger number of parameters and  108 therefore require greater amounts of data. This data was available from the recently developed high-throughput method for screening large numbers of peptides for antibacterial activity (Hilpert et al., 2005). This method uses peptides synthesized on cellulose support for rapid creation of peptides that are not limited in sequence diversity. The peptides are assayed for antimicrobial activity using strain of Pseudomonas aeruginosa engineered to constitutively luminesce via a luciferase cassette insertion. By measuring the decrease in luminescence due to killing of the bacteria, a large number of peptides can be screened for antibacterial activity in an automated manner. In the current work, we apply for the first time atomic-resolution QSAR methods combined with complex, non-linear modelling to accurately predict antibacterial activity of short cationic peptides containing high sequence diversity. By combining high- throughput generation of synthetic peptides with a high-throughput antibacterial assay, we were able to apply these methods to a larger data set of peptides than has been used to date. We demonstrate that this combination of experimental procedure and QSAR analysis provides dramatic improvement in prediction of diverse antibacterial peptides. With methods we describe here, we have performed an efficient, large-scale in silico screening for antibacterial peptides that has yielded several potential drug leads.  4.1  Results and discussion The overall process we used for QSAR modelling of antimicrobial peptides is shown in Figure 4.1. The starting point was a set of random peptides with measured activity. For these peptides the 3D structure was estimated and used to calculate QSAR descriptors for each. Models for peptide activity were built using artificial neural networks based on these descriptors and the known levels of activity. These models were  109 then used to computationally assess a much larger set of virtual peptides for predicted activity. The accuracy of the predictions was independently assessed by synthesizing and testing many peptides with various levels of predicted activity.   Figure 4.1. General workflow for QSAR modelling of antimicrobial peptides.    4.1.1  Effect of control antibacterial peptide on bacteria The effect of treatment of P. aeruginosa with the active control peptide Bac2A is shown in transmission electron micrographs (TEMs) of thin sections of Pseudomonas aeruginosa (Figure 4.2). These electron micrographs show that Bac2A has a dramatic  110 effect in the morphology of the bacteria cell wall. While the cell wall of control untreated bacteria appears smooth and linear (see Figure 4.2A), the Bac2A-treated bacteria have cell walls that are severely damage and contain numerous blebs (Figure 4.2B), a well known phenomenon observed when bacterial cells are exposed to cationic peptides (Sawyer et al., 1988). In addition, the space between the cell wall and plasma membrane appears swollen. The blebs of the cell wall are better appreciated when the surface of Bac2A-treated bacteria are visualized by SEM (Figure 4.3). As illustrated in Figure 4.4, Bac2A causes damage to the cell wall of Pseudomonas aeruginosa in a time- and concentration-dependent manner.  Figure 4.2. Transmission electron micrographs of cross-sections of Pseudomonas aeruginosa. Micrographs are shown for control untreated (A) and Bac2A-treated (B). Bac2A concentration was at the MIC. Bacteria were incubated with Bac2A for one hour at 37 ºC before fixation and preparation for embedding/thin section TEM. Scale bar is 100 nm.   111  Figure 4.3.  SEM micrographs of Pseudomonas aeruginosa. Micrographs are shown for control untreated (A) and Bac2A-treated (B, C). Bac2A was at a concentration of one-fold the MIC. Bacteria were incubated with Bac2A for one hour at 37 ºC before fixation and preparation for SEM. Scale bars are 500 nm for A and B, and 100 nm for C.  112     Figure 4.4: Electron micrographs of cross-sections of Pseudomonas aeruginosa. Micrographs are shown for control untreated or treated with Bac2A at the concentration and time indicated. Scale bar is 100 nm.    4.1.2  Peptide data sets for model training Two initial sets of synthetic peptides of nine amino acids in length were assayed for antibacterial activity.  Set A consists of 933 peptides; set B consists of 500 peptides. The primary sequences of Set A were chosen with a bias towards enrichment of these sets for the amino acid proportions of our previously-isolated peptides with antibacterial activity based on previous studies (Hilpert et al., 2005; Hilpert et al., 2006). Subsequently Set B peptides were designed with the adjusted amino acid compositions of the initial  113 peptide population plus Set A peptides, as shown in Figure 4.5. In both sets, there were no constraints on the amino acid proportions found within any particular peptide. The two sets were progressively prepared by synthesis on a cellulose support and assayed for activity against P. aeruginosa using a luciferase reporter assay as described previously (Hilpert et al., 2005).  0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 A R N D C Q E G H I L K M F P S T W Y V Amino acid A m in o  a c id  fr a c ti o n Set A Set B Q1 Q2 Q3 Q4  Figure 4.5: Distibution of amino acids in training and test sets. The quartiles of the activity for the test peptides are indicated as Q1 to Q4.   4.1.3  Calculation of peptide activity Peptide antibacterial activity was measured using the luminescence assay, which assesses the loss of energy generation capacity, shown with antimicrobial peptides to proportionately reflect lethality as previously described (Hilpert et al., 2005; Hilpert et al., 2006). Briefly, peptides were assayed in a dilution series in sets of 10 peptides with  114 one control peptide Bac2A per series. Luminescence values for the experimental peptides were fit to a function describing the expected profile of luminescence for a dilution series (Figure 4.6). The relative IC50 (Rel.IC50) values of the experimental peptides were calculated as the ratio of the IC50 values for the peptide to the control peptide Bac2A. The fit of the luminescence experimental values was generally good except for peptides of very low activity where the plateau at low luminescence (high concentration) is not present. For this reason, inactive peptides were identified where the luminescence at highest concentration of peptide was greater than 50% of the luminescence at lowest concentration; for these peptides, the Rel.IC50 was set to 25 (the approximate lower limit of activity that can be observed). The activity of the two sets is shown in Table 4.1 (Training Set A and B rows) classified into higher activity (Rel.IC50 is less than 50% of the control peptide, Bac2A), similar activity (Rel.IC50 is between 50% and 150% of control) and lower activity (Rel.IC50 greater than 150% of control).  115  Figure 4.6. Luminescence profile of a dilution series for three peptides. The luminescence for three peptides having high, medium (control peptide) and low activity are shown. Luminescence and concentration were scaled to maximum of 1.0. Where the horizontal line at luminescence of 0.5 crosses the fitted curves indicates the relative IC50  value for each peptide.   4.1.4  QSAR descriptors and model building A large number of QSAR descriptors are available to describe the physical chemistry of compounds. A total of 77 descriptors were calculated here for each peptide in the two sets of training peptides. Some descriptor values were found to be highly correlated with each other, which led to problems in modelling; therefore a set of 44 descriptors were chosen that showed less than 95% correlation to any other selected descriptor.   A set of 44 descriptors were chosen that showed less than 95% correlation to  116 any other selected descriptor. All descriptors are shown in supplementary Table 4.4; those used for modelling are indicated. We used artificial neural networks (ANNs) (see Figure 4.7) to model antibacterial activity since this has already been successfully demonstrated for small molecules (for example, Karakoc, et al., 2006a). Neural networks typically rank highly among machine learning techniques in predictive performance, and in addition, they are relatively insensitive to the presence of noise and correlated inputs. We used a network configuration with one hidden layer of 10 nodes, 44 input nodes (one for each descriptor) and one output node. A variety of other network configurations were also evaluated and showed no improvement in performance (data not shown).   Figure 4.7. Structure of an artificial neural network. The network consists of three layers: the input layer, hidden layer and output layer. The input nodes take the values of the normalized QSAR descriptors. Each node in the hidden layer takes the weighted sum of the input nodes (represented as lines), and transforms the sum into an output value. The output node takes the weighted sum of these hidden node values and transforms the sum into an output value between 0 and 1.    117  4.1.5  Validation of model performance We assessed the ability of the ANN models to predict antibacterial activity by first classifying the top 5% of the Set A and B peptides as active according to the Rel.IC50 values – this corresponds to an approximate Rel.IC50 threshold of 0.6 (0.56 for Set A and 0.61 for Set B). A ten-fold cross-validation was performed as described below with 90% of data allocated to training and 10% to validation (i.e. reserving a different 10% for each of the 10 validation sets). Set A and Set B were synthesized and assayed at different times and we observed some systematic differences in the luminescence results related to peptides of very low and very high activity. Therefore, we treated Set A and Set B separately, along with an additional pooled set, Set A+B. The performance of the three models was assessed using receiver operating characteristics curves (Figure 4.8) and the area under the ROC curves (AROCs). AROC values approaching one indicate an increasing ability to accurately classify data; AROC values close to 0.5 indicate a poor ability to classify. The average AROC value for Sets A, B and the combined Set A+B were found to be (mean ± standard deviation, SD) 0.87 ± 0.10, 0.83 ± 0.12 and 0.80 ± 0.09 respectively. These data show that the cross-validated performance of the models to predict peptide activity was quite good. We integrated the large number of models generated during the cross-validation in a consensus approach to allow a combined, single prediction for a given peptide. We did this using a "voting" system where each of the thirty models (ten each for Set A, Set B, and the combined Set A+B) was used to evaluate a test peptide.   118  Figure 4.8.  The receiver operating characteristics curves for the three data sets. The average ROC curve was calculated based on validation data for the 10 ROC curves from the cross-validation of each of the data sets.     4.1.6  Independent model testing To perform an independent assessment of this approach to identify highly active antibacterial peptides, we created a random set of approximately 100,000 peptides in an independent test set using the same global amino acid proportions as Set B (Figure 4.5). When we calculated the 44 QSAR descriptors for each peptide, a modest number of peptides fell more than 15% out of the range of descriptor values encountered in Sets A and B and were not considered further, since this is believed to lead to less reliable performance by the models. This left a total of 99,577 test peptides. Each of these  119 peptides was ranked numerically using a voting system as described below. Since these models were built to classify peptides as active or inactive, rather than predict actual activity levels, the ranked list of test peptides indicated the likelihood that a peptide is highly active. To independently evaluate these predictions of peptide activity, we selected and synthesized a total of 200 candidate peptides comprising sets of fifty candidate peptides at four positions of ranking. Quartile 1 (Q1) peptides were ranked in the top-most 50 positions and considered the most likely to be more active than control. Quartile 2 (Q2) peptides were ranked at the start of the 2nd quartile, positions 24,895 to 24,944, and thus considered likely to be more active than control. Quartile 3 (Q3) peptides were ranked at the end of the 3rd quartile, positions 74,633 to 74,682, and considered likely to be less active than control. Quartile 4 (Q4) peptides were ranked at the end of the 4th quartile, positions 99,528 to 99,577, and considered to be most likely to be less active than control. These two hundred predicted peptides were synthesized and assayed for activity using the luminescence assay. As summarized in Table 4.1, the activity was predicted very accurately by the system. Of the fifty peptides in the most likely active set (Q1), 94% were found to be more active than control. Of the set considered less likely to be active (Q2), 64% were better than control. Of the peptides predicted to be much less active (Q3), 88% had lower activity than control. In the set considered least likely to be active (Q4), all (100%) were less active than control. All two hundred candidate peptides are shown in supplementary Table 4.5 along with the rank, cumulative vote, experimentally determined relative IC50 values, and selected physical properties (charge, hydrophobic fraction and hydrophobic moment).  120   Rel. IC50 Data set Higher Activity (<0.5) Similar Activity (0.5-1.5) Lower Activity (>1.5) Median Set A 35 (3.8%) 210 (22.5%) 688 (73.7%) 2.12 Set B 14 (2.8%) 114 (22.8%) 372 (74.4%) 3.33 Q1 47 (94%) 2 (4%) 1 (2%) 0.23 Q2 32 (64%) 15 (30%) 4 (8%) 0.35 Q3 1 (2%) 5 (10%) 44 (88%) 4.38 Q4 0 (0%) 0 (0%) 50 (100%) 8.34 Table 4.1. Activities of peptides from training sets and quartiles in the 100,000 test set. Numbers of peptides with various levels of antibacterial activity are shown. Q1: top of 1 st  quartile; Q2: Top of 2 nd  Quartile; Q3: Bottom of 3 rd  Quartile; Q4: Bottom of 4 th  Quartile. Rel.IC50 is the relative IC50, the ratio of the IC50 for the experimental peptide to the IC50 of Bac2A.  Interestingly, despite the very large difference in predicted activities, the peptides in each quartile had rather similar bulk physical properties (charge, hydrophobicity, hydrophobic moment) as shown in Figure 4.9, indicating the importance of using a broad variety of descriptors in neural network modelling. Ten peptides from each quartile are shown in Table 4.2 for discussion. Consistent with the bulk features of the entire library of sequences, for these peptides the charge and hydrophobicity showed a large degree of overlap for most quartiles. Only certain of the peptides from Q4 showed a noticeable difference in these physical properties, specifically in showing a lower charge and hydrophobicity. The importance of charge, hydrophobicity and amphipathicity for antibacterial activity of peptides is well known (Jenssen et al, 2006; Yeaman and Yount, 2003). However, in these groups of peptides there was a clear difference only between the most active and least active sets (Q1 and Q4) in terms of charge and hydrophobicity, while the differences in activity across all quartiles were quite dramatic. A graphic example that these properties are by themselves insufficient to make  121 predictions can be observed by comparing peptides 10 and 74,675 that have very similar values for charge (+4), hydrophobicity (0.44-0.56), and hydrophobic moment (a measure of amphipathicity; 4.2-4.65) but have relative IC50s that differ more than 100-fold (0.04 and 7.1). This demonstrates that the success in predictions is not based on identifying potent peptides using previously known characteristics.  Figure 4.9. Activity and properties of training and test peptides. Peptide antibacterial activity and physical properties are shown. For Rel.IC50 values, these are median with error bars indicating interquartile range. For all others, these are means with error bars indicating SEM. Top left: median values of Rel.IC50 from the training sets A and B and the corresponding median values for 200 experimentally tested peptides separated into activity quartiles, Q1 to Q4. Top right: median values of formal charge; bottom left: amphipathicity (expressed as hydrophobic moment in Eisenberg units); bottom right: hydrophobic fraction. Statistical significance of difference in means from Q1 values is indicated (ns - not significant, otherwise P values: * <0.05, ** < 0.01, *** < 0.001) using two-tailed Mann Witney test calculated using GraphPad Prism 4.03.   122  Peptide Number Quartile Sequence Cumulative Vote Average Rank Rel. IC50 Charge Hydrophobicity Hydrophobic Moment 1 1 RWRWKRWWW 29 2027.1 0.25 4 0.56 1.48 2 1 RWRRWKWWW 29 2707.9 0.40 4 0.56 1.96 3 1 RWWRWRKWW 29 2729 0.28 4 0.56 2.11 4 1 RWRRKWWWW 28 2831.9 0.39 4 0.56 2.75 5 1 RWRWWKRWY 28 3044.5 0.20 4 0.56 2.86 6 1 RRKRWWWWW 27 2434.6 0.43 4 0.56 1.22 7 1 RWRIKRWWW 27 2589.1 0.12 4 0.56 1.84 8 1 KIWWWWRKR 27 2622.3 0.13 4 0.56 2.06 9 1 RWRRWKWWL 27 3201.2 0.08 4 0.56 2.12 10 1 KRWWKWIRW 27 3660.7 0.04 4 0.56 4.65 51 2 IRMWVKRWR 0 13255.8 0.61 4 0.56 4.24 52 2 RIWYWYKRW 0 13263.4 0.36 3 0.67 4.06 53 2 FRRWWKWFK 0 13275.7 0.12 4 0.56 5.40 54 2 RVRWWKKRW 0 13278.9 0.27 5 0.44 2.27 55 2 RLKKVRWWW 0 13318.8 0.34 4 0.56 1.16 56 2 RWWLKIRKW 0 13319.5 0.18 4 0.56 3.85 57 2 LRWWWIKRI 0 13336.1 0.33 3 0.67 0.99 58 2 TRKVWWWRW 0 13336.2 0.76 3 0.56 0.78 59 2 KRFWIWFWR 0 13347.1 3.04 3 0.67 4.11 60 2 KKRWVWVIR 0 13348.2 0.35 4 0.56 2.92 141 3 KIRRKVRWG 0 67295.4 10.55 5 0.33 2.02 142 3 AIRRWRIRK 0 67295.8 4.62 5 0.44 5.94 143 3 WRFKVLRQR 0 67297.8 7.08 4 0.44 4.20 144 3 RSGKKRWRR 0 67298 6.50 6 0.11 4.66 145 3 FMWVYRYKK 0 67298 1.51 3 0.67 1.81 146 3 RGKYIRWRK 0 67298.1 3.83 5 0.33 4.94 147 3 WVKVWKYTW 0 67298.3 5.64 2 0.67 2.41 148 3 VVLKIVRRF 0 67298.6 25.00 3 0.67 1.86 149 3 GKFYKVWVR 0 67298.7 1.21 3 0.56 5.39 150 3 SWYRTRKRV 0 67299.6 6.66 4 0.33 4.24 191 4 GRIGGKNVR 0 98644.5 9.12 3 0.22 4.30 192 4 NKTGYRWRN 0 98701.1 8.33 3 0.22 2.75 193 4 VSGNWRGSR 0 98756.7 8.54 2 0.22 2.67 194 4 GWGGKRRNF 0 98807.8 7.38 3 0.22 1.13 195 4 KNNRRWQGR 0 98885.2 6.45 4 0.11 2.88 196 4 GRTMGNGRW 0 98946.9 6.93 2 0.22 1.40 197 4 GRQISWGRT 0 98949.4 8.04 2 0.22 1.94 198 4 GGRGTRWHG 0 99178.5 8.60 3 0.11 2.63  123 Peptide Number Quartile Sequence Cumulative Vote Average Rank Rel. IC50 Charge Hydrophobicity Hydrophobic Moment 199 4 GVRSWSQRT 0 99185.7 8.50 2 0.22 2.56 200 4 GSRRFGWNR 0 99199.5 8.10 3 0.22 0.58 Table 4.2. Predicted activity rank and experimental Rel.IC50 values for selected test peptides. Forty peptides are shown from the 200 total test peptides. Hydrophobic moment uses the Eisenberg scale.   4.1.7  Antibacterial activity of predicted peptides against resistant strains A selection of 18 of these 200 peptides was synthesized in bulk and tested against a large variety of drug-resistant bacterial pathogens (Table 4.3). A total of 13 peptides from quartile 1 and 2 with high activity, and 5 peptides from quartile 3 with low activity were evaluated for their in vitro effect (MIC activity) against several multi-drug resistant and problematic pathogens including strains of multi-drug resistant P. aeruginosa, methicillin resistant Staphylococcus aureus (MRSA), Enterobacter cloacae with derepressed chromosomal !-lactamase, extended spectrum !-lactamase producing Escherichia coli and Klebsiella pneumonia, and vancomycin resistant Enterococcus faecalis and Enterococcus faecium (VRE). All 15 peptides belonging to the first and second quartiles had significant in vitro inhibitory activity against antibiotic-resistant bacteria. Moreover, some peptides from the 1 st  quartile, such as 8 and 9 exhibited MICs of 0.3-10µM against most of the tested ‘superbugs’, compared to the only antimicrobial peptide to show efficacy to date in advanced clinical trials, MX-226 (Hancock and Sahl, 2006) which exhibited MICs of 10-76µM (Cherkasov et al., in press). These results characterize the developed peptides as excellent antibiotic candidates for treating some of  124 the most recalcitrant and dangerous human infections. As reported elsewhere (Cherkasov et al., in press), two other peptides identified from the first quartile were also found to be protective against Staphylococcus aureus infection in animal models.  1 2 5    MIC (µM) Peptide ID Sequence A B C D E F G H I J K L M N O P Q R S T Bac2A RLARIVVIRVAR 48 192 95 192 95 95 12 3.0 24 24 24 192 192 24 24 12 48 48 12 3.0 8 KIWWWWRKR 5.9 47 24 47 47 12 5.9 3.0 94 12 5.9 189 47 5.9 5.9 24 94 94 5.9 1.5 9 RWRRWKWWL 2.9 12 12 23 5.8 12 0.3 0.7 5.8 5.7 2.9 46 11 2.9 2.9 23 92 92 5.7 1.4 20 WRWWKIWKR 5.9 24 24 47 12 47 1.5 0.8 12 5.9 5.9 94 24 3.0 3.0 24 94 94 5.9 1.5 45 WKRWWKKWR 23 46 46 93 23 46 5.8 1.4 93 23 2.9 186 46 5.8 5.8 93 >186 >186 23 5.8 48 WKKWWKRRW 23 46 46 93 46 46 5.8 1.4 23 23 2.9 186 46 1.4 2.9 93 >186 >186 12 5.8 24,897 FRRWWKWFK 1.5 12 3.0 5.9 5.9 24 1.5 0.8 24 12 6.1 97 24 1.5 3.0 24 195 97 6.1 6.1 24,901 LRWWWIKRI 13 50 25 50 25 25 6.3 3.2 50 25 13 201 50 6.3 6.3 13 50 25 6.3 1.5 24,910 RKRLKWWIY 25 50 50 50 50 50 6.3 3.2 13 6.3 6.3 >202 50 6.3 6.3 50 >202 202 13 3.2 24,913 KKRWVWIRY 25 51 25 51 25 25 3.2 1.6 25 25 13 204 51 13 13 25 102 102 6.4 6.4 24,915 KWKIFRRWW 12 24 24 48 12 48 3.1 1.5 6 24 12 194 97 3.1 3.1 24 97 97 24 6.1 24,919 RKWIWRWFL 6.1 12 3.1 6.1 6.1 3.1 1.5 1.5 3 1.5 1.5 3.1 3.1 3.1 3.1 6.1 24 24 3.1 3.1 24,921 IWWKWRRWV 6.0 48 12 48 12 48 6.0 1.5 6 24 6.0 96 12 3.0 3.0 24 48 48 6.0 3.0 24,944 RRFKFIRWW 6.1 24 49 49 12 49 3.1 0.8 12 12 12 98 49 6.1 6.1 12 98 49 6.1 6.1 74,655 AVWKFVKRV 240 >240 240 >240 240 >240 120 60 >240 240 120 >240 >240 240 240 >240 >240 >240 120 120 74,658 AWRFKNIRK >223 >223 >223 >223 >223 >223 111 >223 >223 >223 223 >223 >223 223 >223 >223 >223 >223 >223 223 74,665 KRIMKLKMR >226 >226 >226 >226 >226 >226 >226 >226 >226 >226 >226 >226 >226 >226 >226 >226 >226 >226 >226 >226 74,674 AIRRWRIRK >217 >217 >217 >217 >217 >217 217 108 108 108 108 >217 >217 54 54 >217 >217 >217 108 14 74,680 VVLKIVRRF >241 >241 >241 >241 >241 >241 241 60 241 >241 241 >241 >241 241 241 241 >241 >241 241 60  Table 4.3. Activities against multi-resistant Superbugs of selected peptides predicted through the QSAR analysis compared to the peptide Bac2A. Peptides from the top quartile (8 to 48) were compared to peptides from the 2 nd  (24,897 to 24,944) and 3 rd  (74,655 to 74,680) Quartiles. Columns legends: Peptide ID indicates the control Bac2A or the test peptide by rank number. Columns give MIC values (µM) measured in 3-5 replicates for A, P. aeruginosa wild type strain H103; B,C,D, P. aeruginosa multidrug resistant strains from Brazil #9, #198 and #213 respectively; E,F,G, P. aeruginosa Liverpool epidemic strains LES400, H1030, and H1027 respectively; H, P. maltophilia ATCC13637; I, Constitutive Class C chromosomal !-lactamase expressing Enterobacter cloacae 218R. J,K, Extended-spectrum !-lactamase-producing (ESBL) E. coli (clinical strains 63103 and 64771); L,M,  ESBL resistant Klebsiella pneumonia (clinical strain 61962 and 63575); N, S. aureus ATCC25923; O, Methicillin resistant S. aureus strain C623; P, Enterococcus faecalis ATCC29212; Q,R, Vancomycin resistant E. faecalis [clinical isolate w61950 (VanA) and f43559 (VanB)]; S,T, Vancomycin resistant E. faecium [clinical isolates mic80 (VanA) and t62764 (VanB)].  126 It is interesting to note that two of the peptides have high potency (IDs 45 and 48 in Table 4.3) are active against a large number of the drug-resistant strains but have poor activity against one extended-spectrum !-lactamase-producing  (ESBL) pathogen (column L) and two vancomycin resistant organisms (Q and T). However, these three peptides are active against other ESBL organisms (columns J and K), and other vancomycin resistant organisms (R and S). It seems likely that this resistance is due to different mechanisms than resistance to the conventional antibiotics. For example, !- lactamase would not be expected to inactivate these peptides since they do not contain !- lactam rings.    4.2  Conclusions We have demonstrated in this study the specific methodology used in the first application of atomic-resolution 3D QSAR methodology prediction of antibacterial activity to a large data set of diverse peptides. With the availability of large numbers of synthetic peptides and a rapid assay to determine their antibacterial activity, larger sets of data on peptide sequence and activity can now be created. Based on two random libraries containing a total of over 1400 peptides, we developed artificial network models that predict and rank the relative activities of novel antimicrobial peptides with remarkable accuracy: in an independent test set of 100,000 virtual peptides, 94% of the 50 highest ranked peptides predicted to be highly active were found to be highly active. In addition to creating more complex models that utilize the 'inductive' QSAR methodology, the availability of high quantity and quality peptide data also allows more  127 rigorous training and evaluation of the machine learning techniques. We consider the methodology described here as the first successful demonstration of high-throughput in silico screening of antibacterial peptides for novel drug leads.    4.3  Materials and methods  4.3.1  Electron microscopy of AMPs TEM micrographs of thin sections of Pseudomonas aeruginosa untreated, and treated with Bac2A (sequence: RLARIVVIRVAR) at the MIC (50!g/mL) for one hour at 37 ºC. For control, bacteria were mock incubated and prepared for embedding/thin section electron microscopy in the same way as the peptide-treated bacteria. SEM micrographs of Pseudomonas aeruginosa were prepared for control untreated and Bac2A- treated (50!g/mL). Bacteria were incubated with Bac2A for one hour at 37 ºC before fixation and preparation for SEM.  4.3.2  Peptide sequences for model training Two experimental sets of peptides were created, one consisting of 943 peptides (Set A) and another with 500 peptides (Set B). Peptides sequences in these sets were selected randomly from the amino acid distributions show in Figure 4.5 using custom computer software. The amino acid proportions for Set A were determined based on our previous studies of substitution analysis (Hilpert et al., 2005), and proportions for Set B were further determined from early analysis of Set A activity. In one plate of ten peptides in the set of 943 peptides, the control Bac2A peptide did not show the expected  128 luminescence profile and these ten experimental peptides were excluded from further use, leaving 933 peptides. For modelling (described below) three training sets were prepared, consisting of the set of 933 peptides (Set A), the set of 500 peptides (Set B) and a set created from combining 933 and 500 sets (the Set A+B). A set of 100,000 random peptide sequences were generated in the same amino acid proportions as Set B, using the same algorithm as described above. There were 311 duplicates that were removed, leaving 99,577 peptides (the test set). Peptides from this set were evaluated in silico and 200 (50 from each quartile) were selected for synthesis and assay.  4.3.3  Peptide SPOT synthesis and screening Peptides synthesis was performed as previously described (Hilpert et al., 2005; Hilpert et al., 2007). Briefly, peptides were synthesized on cellulose support with a pipetting robot using two glycine residues as linker. Peptides were cleaved from the dried membrane in an ammonia atmosphere resulting in free peptides with two glycines at the amidated C terminus due to the linker sequence. The peptide spots were punched out and transferred to 96-well microtitre plates in sets of 10 along with a positive control peptide (Bac2A) and an unrelated peptide (GATPEDLNQKLS) or an empty well for negative control. An overnight culture of P. aeruginosa strain H1001 was diluted at 1:500 ratio with 100mM Tris buffer (pH 7.3), 20 mM glucose. This diluted culture was added to the microtiter plate wells (100 µL/well) containing the peptide spots and controls. After 30 min incubation, serial dilutions were performed from the membrane spots to successive rows of the plate. Luminescence of the P. aeruginosa PAO1 strain H1001 containing luciferase gene cassette luxABCDE was measured at 4 hours using a Tecan Spectra Fluor  129 Plus (Tecan US).   4.3.4  Calculation of peptide activity The luminescence of each peptide in a dilution series was fit to the following function (1) independently for each peptide, after luminescence data were normalized to 1.0 for the most dilute luminescence point for each peptide. This function had the form of a sigmoid curve consisting of two-plateaus with a smoothly varying region joining them. Parameters of the function described the height of the higher plateau, the position of the center of the slope at half the maximum luminescence, and the slope at the center. Estimation of parameters was performed using custom C software using Numerical Recipes in C (Press et al., 1992).    (1) In this function,  Lmax controls the maximum height of the curve, S controls the slope, and x1/2 is the value of x giving luminescence of half of the maximum luminescence. The values of x were in dilution steps with values from zero, for the initial concentration, to seven (after seven dilutions); these corresponded to changes in concentration C ,     (2)  where C0  was the initial concentration of peptide in the undiluted well. We were ! L = L max 1+e "2S x"x 1 / 2( ) ! C =C 0 2 "x  130 interested in calculating the concentration of peptide that reduces the number (and hence the luminescence) of viable energized bacteria by 50%, the IC50. From these equations we can state the IC50 as,   (3)  However, we can eliminate the need to determine the initial concentration of peptide by reporting the activity of peptides as relative IC50 (Rel.IC50) values: the ratio of IC50 for the experimental peptide to the IC50 for Bac2A. Values of Rel.IC50<1.0 mean the peptide is more active than Bac2A since a lower concentration yields the same reduction in bacterial concentration. For peptides with very low or zero activity, curve fitting was problematic. Where the luminescence of a well for an undiluted peptide was greater than 50% of the maximum luminescence for the peptide at high dilutions, the IC50 concentration was not observed even at the highest peptide concentration used. Here, the peptide was considered inactive and assigned a Rel.IC50 value of 25. For Set A and B, 7 dilution points were used in the calculation of Rel.IC50 due to frequent artifacts in the last dilution row (dramatic increases in luminescence were observed that were inconsistent with the expected profile). For the 200 peptides taken from the independent test set, the Rel.IC50 was determined from all 8 dilution points for each peptide since these artifacts were largely eliminated in later measurements.  4.3.5  QSAR descriptors The QSAR descriptors used in this study are shown in Table 4.4.  The 'inductive' ( ) 1/2-x 2/1 2 050 C=xC=IC  131 QSAR descriptors used in this study were previously described (Cherkasov, 2005). An initial set of seventy-seven QSAR descriptors was calculated for each peptide in the two training and test sets using MOE (Molecular Operational Environment, 2005, by Chemical Computing Group Inc., Montreal, Canada). The peptide structure was optimized based on an initial linear structure followed by potential energy minimization of each molecule using MMFF94 force-field calculations (Halgren, 1996). Structure optimization was done without including interactions with other molecules. The atomic types have been assigned according to their name, valence state and the formal charge of constituent atoms, as defined within MOE. QSAR descriptors were calculated using custom SVL scripts within the MOE environment. The ‘inductive’ QSAR variables can be computed by the following equations  ! Rs j"G = R j 2 1 rj# i 2 i$ j N#1 %    (4)  ! "# $ % = n jiGi ji i jG r R Rs , 2 2 &     (5) ! " j#G * = ($ j 0 % $ i 0 )R j 2 rj% i 2 i& j N%1 '     (6) ! "G# j * = $ (% i 0 & % j 0 )Ri 2 ri& j 2 i'G,i( j n )  (7) ! "G# j 0 = " i 0 (Ri 2 + R j 2 ) ri$ j 2 i% j N$1 & Ri 2 + R j 2 ri$ j 2 i% j N$1 &     (8)  ! " # " +" +=$ 1 2 22 ))((N ji ij ijij jj r RR QN %% &   (9) ! " j = 1 2 R j 2 + Ri 2 rj# i 2 j$ i N#1 %   (10) ! "MOL = 1 sMOL = 1 2 R j 2 + Ri 2 rj# i 2 j$ i N#1 %   (11)  132 ! " # " + = 1 2 22 2 N ij ij ij i r RR s    (12)   !! " " # + = N ij N ij ij ij MOL r RR s 2 22   (13) where R is the covalent atomic radii, r = interatomic distance, Qj = formal charge of atom j , " = ‘inductive’ electronegativity, RS = is the steric constant, #* = the inductive constants, "N =‘inductive’ partial charge, and $ and s are the ‘inductive’ analogues of chemical hardness and softness. It should be noted that the variables indexed with j subscript describe the influence of a singe atom onto a group of atoms G (typically the rest of N-atomic molecule) while G indices designate group (molecular) quantities. The linear character of equations (1) - (6) makes ‘inductive’ descriptors readily computable and suitable for sizable databases and positions them as appropriate parameters for large-scale QSAR models. Resources using the R language for statistical computing (http://r-project.org, (R Development Core Team, 2005)) were used for all following steps.  Each descriptor in the training and test sets was normalized to the range encountered in training peptide Set A and B. A cross-correlation was performed on the descriptors in the set of all peptides from training and testing. Where the Pearson correlation coefficient was >0.95 or < -0.95, one descriptor of the pair was dropped. This was repeated until no descriptors had absolute correlation above 0.95. This left a final set of forty-four descriptors (Table 4.4). Hydrophobic moments were calculated for comparison purposes (Figure 4.9 and Table 4.2) and not used in ANN modelling. These were calculated using the hmoment utility in EMBOSS (Rice et al., 2000) modified to use the Eisenberg scale (Eisenberg et al., 1984).  133  4.3.6  Training and validation data sets For each of the three training sets of peptides described above (Set A, Set B and Set A+B), the peptides were classified by considering the top 5% of Rel.IC50 values to be active peptides and assigned the activity-value of 1 in the data sets for training the ANNs; other activity-values were assigned 0. A stratified ten-fold cross-validation was performed on the three sets, resulting in ten models for each of the training sets for a total of thirty models. Briefly, to create the cross-validation data sets, 10% of the active peptides in the training set (one of Set A, Set B, or Set A+B) were randomly assigned to each of 10 lists. Then 10% of the inactive peptides in the training set were randomly assigned to each of 10 lists. One list of actives was combined with one list of inactives, to create 10 lists of combined active and inactive peptides. Using one of these lists as the peptides for a validation data set, the other 9 were used as the corresponding training set. This was repeated a total of 10 times to created 10 validation sets and 10 training sets. This creation of 10-fold cross-validation sets was performed separately for each of the training sets (A, B, and A+B).  4.3.7  Test data set To evaluate the voting system's ability to predict peptide activity, we selected a set of 100,000 peptide sequences according to the amino acid frequencies used in Set B. QSAR descriptors were calculated as described in section 4.3.5 above. The maximum and minimum values of each of the 44 descriptors were compared to the range present in the Set A and B training data. Where a peptide in the test data was outside 15% above or below the range in the training data, the test peptide was dropped from the test set, leaving a total of 99,577 peptide sequences.  134   4.3.8  Model training Artificial neural networks (ANNs) were constructed and evaluated using SNNS (Stuttgart Neural Network Simulator, version 4.2, from University of Tübingen, Stuttgart, Germany available at http://www-ra.informatik.uni-tuebingen.de/SNNS/). The networks (Figure 4.7) consisted of forty-four input nodes (one for each QSAR descriptor as described above), ten nodes in one hidden layer, and one output node; all were fully connected. The output node values for training were zero for not active, and one for active. Networks were initialized using randomized weights. Model training was performed using pairs of training and validation data sets generated for the 10-fold cross-validation described above. Therefore, 10 models were created for each of the training sets (Set A, Set B, and Set A+B) for a total of 30 models. Training was performed on each training data set used the standard backpropagation learning function with parameters #=0.2 and dmax=0. The update function used topological order with shuffled order of training patterns. For each cycle of training, the validation data set was evaluated.  As the network trained, network parameters giving a minimum error on the validation set were stored. After 200 training cycles with no new minimum model error found, all network weights were jogged by 2% to attempt to escape local minima; and weights that showed more than 95% correlation during propagation were jogged by 5%. Training continued and was terminated after an additional 200 cycles with no new minimum validation error encountered. Performance measures such as ROC curves and areas, sensitivity and specificity were calculated using the ROCR package in R (Sing et al., 2005).  135  4.3.9  In silico ranking and selection of test peptides To test the predictions of the ANNs, all peptides in the test set were evaluated by all 30 ANNs and the combined predictions were integrated into a single ordering of the test peptides as follows. Each peptide in the test set was assigned a ranking by each ANN. If a test peptide appeared in the top 5% of all peptides in the test set for an ANN, it received one 'vote' to indicate the model suggested it to be highly active. Therefore, a test peptide may receive up to 30 votes from the total of 30 ANNs. Peptides were ranked by number of votes with the relative ordering of peptides receiving the same number of votes determined by the average of the rankings of all ANNs. Sets of 50 peptides at 4 positions of ranking were selected to independently evaluate the system's ability to predict peptide activity and inactivity. Quartile 1 (Q1) peptides were ranked in the topmost 50 positions and considered the most likely to be more active than control. Quartile 2 (Q2) peptides were ranked at the start of the 2nd quartile, positions 24895 to 24944, and considered likely to be more active than control. Quartile 3 (Q3) peptides were ranked at the end of the 3rd quartile, positions 74673 to 74682, and considered likely to be less active than control. Quartile 4 (Q4) peptides were ranked at the end of the 4th quartile, positions 99568 to 99577, and considered to be most likely less active than control. These 200 predicted peptides were synthesized and assayed for activity as described above.  4.3.10  Minimal inhibitory concentration (MIC) determination The minimal inhibitory concentration (MIC) of the peptides was measured as described (Cherkasov et. al, in press). Briefly a modified broth microdilution method was  136 used. The peptides were dissolved and stored in glass vials. The assay was performed in sterile 96-well polypropylene microtitre plates (Cat. #3790, Costar, Costar, Cambridge, MA). Serial dilutions of the peptides to be assayed were performed in 0.01% acetic acid containing 0.2% bovine serum albumin at 10 fold the desired final concentration. Ten microlitres of the 10-fold concentrated peptides were added to each well of a 96-well polypropylene plate containing 90 µl of MH media per well. Bacteria were added to the plate from an overnight culture at a final concentration of 2 - 7 x 105 CFU/ml and incubated over night at 37˚C. The MIC was taken as the concentration at which no growth was observed. MIC analyses were done on a panel of bacterial pathogens that were both susceptible and resistant to common antibiotics. P. aeruginosa PAO1 strain H10319 and P. maltophilia ATCC#13637, and S. aureus ATCC#2592319 and Enterococcus faecalis ATCC#292129 and Enterobacter cloacae 218R, constitutively expressing Class C chromosomal !-lactamase31, were from our lab strains collection. A methicillin resistant S. aureus (MRSA) clinical isolate was kindly provided by Anthony Chow (Vancouver General Hospital, Vancouver, Canada). Two Klebsiella pneumoniae and two E. coli clinical isolates expressing extended spectrum !-lactamases (ESBL) were kindly provided by George Zhanel (Health Sciences Centre, Winnipeg, Canada). Vancomycin- resistant clinical isolates of Enterococcus faecalis and E. faecium were obtained from Ana M. Paccagnella (BC Centre for Disease Control, Vancouver, Canada). Three clinical isolates (#9, #198 and #213) of multi-drug resistant P. aeruginosa, were kindly provided by Carlos Kiffer (University of São Paulo, Brazil). These isolates all have resistance to piperacillin/tazobactam, meropenem, ceftazidime, ciprofloxacin and cefepime, and #9 is  137 also polymyxin B resistant. Three P. aeruginosa clinical isolates of the Liverpool epidemic strain (LES) (H1027, H1030 and LES400) 32 were all kindly provided by Craig Winstanley (University of Liverpool, UK). LES400 was resistant to gentamicin and tobramicin, while H1030 showed resistance to colistin, amikacin, gentamicin and tobramicin. All tested bacterial strains were categorized as biohazard level 2 pathogens.   4.4  Acknowledgements We gratefully acknowledge financial support from the Canadian Institutes for Health Research (CIHR) and the Foundation of the National Institutes of Health and CIHR through the Grand Challenges in Global Health Initiative. We thank Jessica Lee for technical support in creating the computer-based peptide libraries. RH is the recipient of a Canada Research Chair. KH received a CIHR fellowship. CDF received a Doctoral Research Award from the CIHR.   138   4.5  Supplementary tables   Descriptor Explanation Parental Equation Electronegativity-based EO_Equalized* Iteratively equalized electronegativity of a molecule (8), (9) Average_EO_Pos* Arithmetic mean of electronegativities of atoms with positive partial charge (8), (9) Average_EO_Neg* Arithmetic mean of electronegativities of atoms with negative partial charge (8), (9) Hardness-based Global_Hardness Molecular hardness - reversed softness of a molecule (10) Sum_Hardness* Sum of hardnesses of atoms of a molecule (10) Sum_Pos_Hardness Sum of hardnesses of atoms with positive partial charge (10) Sum_Neg_Hardness* Sum of hardnesses of atoms with negative partial charge (10) Average_Hardness* Arithmetic mean of hardnesses of all atoms of a molecule (10) Average_Pos_Hardness* Arithmetic mean of hardnesses of atoms with positive partial charge (10) Average_Neg_Hardness* Arithmetic mean of hardnesses of atoms with negative partial charge (10) Smallest_Pos_Hardness* Smallest atomic hardness among values for positively charged atoms (10) Smallest_Neg_Hardness* Smallest atomic hardness among values for negatively charged atoms. (10) Largest_Pos_Hardness* Largest atomic hardness among values for positively charged atoms (10) Largest_Neg_Hardness* Largest atomic hardness among values for negatively charged atoms (10) Hardness_of_Most_Pos* Atomic hardness of an atom with the most positive charge (10) Hardness_of_Most_Neg* Atomic hardness of an atom with the most negative charge (10) Softness-based Global_Softness Molecular softness – sum of constituent atomic softnesses (11) Total_Pos_Softness Sum of softnesses of atoms with positive partial charge (11) Total_Neg_Softness* Sum of softnesses of atoms with negative partial charge (11) Average_Softness Arithmetic mean of softnesses of all atoms of a molecule (11) Average_Pos_Softness Arithmetic mean of softnesses of atoms with positive partial charge (11) Average_Neg_Softness* Arithmetic mean of softnesses of atoms with negative partial charge (11) Smallest_Pos_Softnes Smallest atomic softness among values for positively charged atoms (11) Smallest_Neg_Softness Smallest atomic softness among values for negatively charged atoms (11) Largest_Pos_Softness Largest atomic softness among values for positively charged atoms (11) Largest_Neg_Softness Largest atomic softness among values for positively charged atoms (11) Softness_of_Most_Pos Atomic softness of an atom with the most positive charge (11) Softness_of_Most_Neg Atomic softness of an atom with the most negative charge (11) Charge-based Total_Charge Sum of absolute values of partial charges on all atoms of a molecule (9) Total_Charge_Formal Sum of charges on all atoms of a molecule (formal charge of a molecule) (9)  139 Descriptor Explanation Parental Equation Average_Pos_Charge* Arithmetic mean of positive partial charges on atoms of a molecule (9) Average_Neg_Charge* Arithmetic mean of negative partial charges on atoms of a molecule (9) Most_Pos_Charge Largest partial charge among values for positively charged atoms (9) Most_Neg_Charge Largest partial charge among values for negatively charged atoms (9) Descriptors based on inductive substituent constants Total_Sigma_mol_i* Sum of inductive parameters sigma (molecule%atom) for all atoms within a molecule (7) Total_Abs_Sigma_mol_i Sum of absolute values of group inductive parameters sigma (molecule%atom) for all atoms within a molecule (7) Most_Pos_Sigma_mol_i Largest positive group inductive parameter sigma (molecule%atom) for atoms in a molecule (7) Most_Neg_Sigma_mol_i* Largest (by absolute value) negative group inductive parameter sigma (molecule%atom) for atoms in a molecule (7) Most_Pos_Sigma_i_mol Largest positive atomic inductive parameter sigma (atom%molecule) for atoms in a molecule (7) Most_Neg_Sigma_i_mol Largest negative atomic inductive parameter sigma (atom%molecule) for atoms in a molecule (7) Sum_Pos_Sigma_mol_i* Sum of all positive group inductive parameters sigma ( molecule %atom) within a molecule (7) Sum_Neg_Sigma_mol_i* Sum of all negative group inductive parameters sigma ( molecule %atom) within a molecule (7) Descriptors based on steric substituent constants Largest_Rs_mol_i Largest value of steric influence Rs(molecule%atom) in a molecule (5) Smallest_Rs_mol_i* Smallest value of group steric influence Rs(molecule%atom) in a molecule (5) Largest_Rs_i_mol* Largest value of atomic steric influence Rs(atom%molecule) in a molecule (4) Smallest_Rs_i_mol Smallest value of atomic steric influence Rs(atom%molecule) in a molecule (4) Most_Pos_Rs_mol_i Steric influence Rs(molecule%atom) ON the most positively charged atom in a molecule (5) Most_Neg_Rs_mol_i* Steric influence Rs(molecule%atom) ON the most negatively charged atom in a molecule (5) Most_Pos_Rs_i_mol Steric influence Rs(atom%molecule) OF the most positively charged atom to the rest of a molecule (4) Most_Neg_Rs_i_mol* Steric influence Rs(atom%molecule) OF the most negatively charged atom to the rest of a molecule (4) Conventional QSAR descriptors implemented by the MOE software a_acc* Number of hydrogen bond acceptor atoms a_don* Number of hydrogen bond donor atoms ASA* Water accessible surface area ASA_H* Water accessible surface area of all hydrophobic atoms. ASA_P* Water accessible surface area of all polar atoms. ASA-* Water accessible surface area of all atoms with negative partial charge ASA+* Water accessible surface area of all atoms with positive partial charge FCharge* Total charge of the molecule b_1rotN Number of rotatable single bonds        N/A       140 Descriptor Explanation Parental Equation logP(o/w)* Log of the octanol/water partition coefficient logS* Log of the aqueous solubility Mr Molecular refractivity PC-* Total negative partial charge PC+* Total positive partial charge RPC- Relative negative partial charge RPC+* Relative positive partial charge TPSA Polar surface area vdw_area* van der Waals surface area calculated using a connection table approximation. vdw_vol van der Waals volume calculated using a connection table approximation. Vol van der Waals volume calculated using a grid approximation VSA van der Waals surface area using polyhedral representation vsa_acc* Approximation to the sum of VDW surface areas of pure hydrogen bond acceptors vsa_acid* Approximation to the sum of VDW surface areas of acidic atoms. vsa_base Approximation to the sum of VDW surface areas of basic atoms. vsa_don Approximation to the sum of VDW surface areas of pure hydrogen bond donors vsa_hyd* Approximation to the sum of VDW surface areas of hydrophobic atoms. Weight* Molecular weight              N/A Table 4.4. Description of all QSAR descriptors used in analysis of peptide activities. The column 'Parental Equation' refers to the equation described in the text that is used to calculate the descriptor. Those descriptors without a parental equation were provided by molecular simulation software (Molecular Operational Environment, 2005, by Chemical Computing Group Inc., Montreal, Canada). Descriptors indicated with * were used in the classification analysis as described in the text.   141  Sequence Cumulative Vote Average Rank Rel IC50 Charge Hydrophobicity Hydrophobic moment Topmost 50 (rows 1-50) RWRWKRWWW 29 2027.13 0.25 4 0.56 1.48 RWRRWKWWW 29 2707.87 0.4 4 0.56 1.96 RWWRWRKWW 29 2728.97 0.28 4 0.56 2.11 RWRRKWWWW 28 2831.87 0.39 4 0.56 2.75 RWRWWKRWY 28 3044.53 0.2 4 0.56 2.86 RRKRWWWWW 27 2434.63 0.43 4 0.56 1.22 RWRIKRWWW 27 2589.1 0.12 4 0.56 1.84 KIWWWWRKR 27 2622.3 0.13 4 0.56 2.06 RWRRWKWWL 27 3201.17 0.08 4 0.56 2.12 KRWWKWIRW 27 3660.7 0.04 4 0.56 4.65 KRWWWWWKR 26 2601.83 0.22 4 0.56 4.19 IRWWKRWWR 26 2735.33 0.21 4 0.56 6.32 IKRWWRWWR 26 2848.03 0.23 4 0.56 5.75 RRKWWWRWW 26 2859 0.27 4 0.56 1.2 RKWWRWWRW 26 2866.13 0.31 4 0.56 5.44 KRWWWWRFR 26 2952.17 0.24 4 0.56 2.13 IKRWWWRRW 26 3063.3 0.22 4 0.56 2.98 KRWWWVWKR 26 3080.23 0.36 4 0.56 4.19 KWRRWKRWW 26 3291.97 0.15 5 0.44 4.29 WRWWKIWKR 26 3456 0.14 4 0.56 4.75 WRWRWWKRW 26 4973.83 0.28 4 0.56 2.6 WKRWKWWKR 26 5351.2 0.25 5 0.44 2.96 RIKRWWWWR 25 2875.47 0.31 4 0.56 2.09 IWKRWWRRW 25 3011.93 0.24 4 0.56 5.23 KWWKIWWKR 25 3075.07 0.2 4 0.56 3.32 RKRWLWRWW 25 3292.37 0.25 4 0.56 2.28 KRWRWWRWW 25 3309.7 0.28 4 0.56 2.03 KKRWLWWWR 25 3328.63 0.3 4 0.56 2.57 RWWRKWWIR 25 3426.07 0.24 4 0.56 4.11 KWWRWWRKW 25 3543.47 0.2 4 0.56 5.14 KRWWIRWWR 25 3591.17 0.21 4 0.56 5.09 KIWWWWRRR 25 3616.8 0.21 4 0.56 2.72 RRRKWWIWW 25 3926.37 0.18 4 0.56 0.32 RRRWWWWWW 25 3935 1.82 3 0.67 1.22 RWWIRKWWR 25 3965.1 0.21 4 0.56 5.05 KRWWKWWRR 25 3974.97 0.13 5 0.44 5.89 KRWWRKWWR 25 3980.1 0.15 5 0.44 6.4 RRIWRWWWW 25 4065.33 0.68 3 0.67 4.7 IRRRKWWWW 25 4099.4 0.21 4 0.56 0.93 KRKIWWWIR 25 4202.17 0.28 4 0.56 3.9 RKIWWWRIR 25 4205 0.59 4 0.56 1.83  142 Sequence Cumulative Vote Average Rank Rel IC50 Charge Hydrophobicity Hydrophobic moment KRWWIWRIR 25 4216.67 0.35 4 0.56 2.02 RWFRWWKRW 25 4610.57 0.26 4 0.56 5.94 WRWWWKKWR 25 5055.03 0.19 4 0.56 4.2 WKRWWKKWR 25 5248.37 0.2 5 0.44 4.66 WKRWRWIRW 25 5696.47 0.28 4 0.56 1.81 WRWWKWWRR 25 6026.73 0.23 4 0.56 4.94 WKKWWKRRW 25 6133.6 0.19 5 0.44 2.41 WRWYWWKKR 25 6147.73 0.22 4 0.56 1.86 WRRWWKWWR 25 6591.37 0.23 4 0.56 5.39  start of 2nd quartile (rows 24895 –24944) IRMWVKRWR 0 13255.83 0.61 4 0.56 4.24 RIWYWYKRW 0 13263.4 0.36 3 0.67 4.06 FRRWWKWFK 0 13275.73 0.12 4 0.56 5.4 RVRWWKKRW 0 13278.87 0.27 5 0.44 2.27 RLKKVRWWW 0 13318.77 0.34 4 0.56 1.16 RWWLKIRKW 0 13319.53 0.18 4 0.56 3.85 LRWWWIKRI 0 13336.07 0.33 3 0.67 0.99 TRKVWWWRW 0 13336.23 0.76 3 0.56 0.78 KRFWIWFWR 0 13347.1 3.04 3 0.67 4.11 KKRWVWVIR 0 13348.17 0.35 4 0.56 2.92 KRWVWYRYW 0 13352.4 0.54 3 0.67 0.41 IRKWRRWWK 0 13365.3 0.41 5 0.44 5.9 RHWKTWWKR 0 13385.47 0.95 5 0.33 4.67 RRFKKWYWY 0 13390 0.26 4 0.56 3.72 RIKVIWWWR 0 13392.73 0.51 3 0.67 0.95 RKRLKWWIY 0 13406.5 0.18 4 0.56 1.98 LVFRKYWKR 0 13417.57 0.99 4 0.56 3.44 RRRWWWIIV 0 13418.2 0.85 3 0.67 1.55 KKRWVWIRY 0 13418.77 0.22 4 0.56 0.98 RWRIKFKRW 0 13440.07 0.26 5 0.44 2.9 KWKIFRRWW 0 13460.03 0.16 4 0.56 3.53 IWKRWRKRL 0 13465.47 0.33 5 0.44 3.74 RRRKWWIWG 0 13466.93 0.57 4 0.44 0.46 RWLVLRKRW 0 13469.13 0.53 4 0.56 1.58 RKWIWRWFL 0 13472.93 0.15 3 0.67 2.8 KRRRIWWWK 0 13487.3 0.4 5 0.44 0.49 IWWKWRRWV 0 13521.7 0.29 3 0.67 3.52 LRWRWWKIK 0 13547.1 0.26 4 0.56 0.69 RWKMWWRWV 0 13552.9 0.24 3 0.67 3 VKRYYWRWR 0 13559.43 1.23 4 0.56 3.11 RWYRKRWSW 0 13593.73 0.7 4 0.44 2.59 KRKLIRWWW 0 13608.7 0.23 4 0.56 3.68  143 Sequence Cumulative Vote Average Rank Rel IC50 Charge Hydrophobicity Hydrophobic moment RWRWWIKII 0 13621.07 0.46 3 0.67 2.99 KFRKRVWWW 0 13632.63 0.3 4 0.56 2.08 IWIWRKLRW 0 13638.23 0.46 3 0.67 2.68 LRFILWWKR 0 13645.27 0.88 3 0.67 3.75 RVWFKRRWW 0 13669.67 0.26 4 0.56 0.23 RRWFVKWWY 0 13671 0.52 3 0.67 3.18 KWWLVWKRK 0 13675.37 0.23 4 0.56 2.56 RWILWWWRI 0 13678.73 25 2 0.78 4.11 KRWLTWRFR 0 13690.3 0.54 4 0.44 2.62 RKWRWRWLK 0 13700.5 0.31 5 0.44 2.4 IRRRWWWIV 0 13702.2 0.23 3 0.67 2.55 IKWWWRMRI 0 13705.3 0.39 3 0.67 1.52 RWKIFIRWW 0 13708 1.82 3 0.67 2.84 IRQWWRRWW 0 13720.43 0.5 3 0.56 4.89 RRRKTWYWW 0 13724.03 0.32 4 0.44 0.41 RRWWHLWRK 0 13725.63 0.38 5 0.44 5.24 RRWWMRWWV 0 13726.37 0.33 3 0.67 3.07 RRFKFIRWW 0 13731.6 0.24 4 0.56 2.13  end of 3rd quartile (rows 74633- 74682) INRKRRLRW 0 67262.97 4.25 5 0.33 0.84 RRMKKLRRK 0 67264.37 4.22 7 0.22 4.46 RKVRWKIRV 0 67264.47 0.32 5 0.44 3.76 VRIVRVRIR 0 67264.93 2.22 4 0.56 3.69 IKRVKRRKR 0 67265.27 2.93 7 0.22 3.91 RVKTWRVRT 0 67265.3 5.66 4 0.33 1.24 RVFVKIRMK 0 67265.4 0.72 4 0.56 2.63 IRGRIIFWV 0 67266.27 0.44 2 0.67 0.57 ATWIWVFRR 0 67267.5 4.88 2 0.67 2.91 KKSKQLWKR 0 67268.5 3.23 5 0.22 4.1 MINRVRLRW 0 67269.17 2.77 3 0.56 2.2 GGIRRLRWY 0 67270.13 1.16 3 0.44 2.84 RLVHWIRRV 0 67270.3 2.62 4 0.56 5.36 AWKIKKGRI 0 67270.47 3.59 4 0.44 0.13 FVVMKRIVW 0 67271.23 5.38 2 0.78 2.33 GIKWRSRRW 0 67272.9 1.06 4 0.33 2.06 RWMVSKIWY 0 67273.33 25 2 0.67 2.15 IVVRVWVVR 0 67274.7 3.5 2 0.78 0.69 RWIGVIIKY 0 67274.83 2.24 2 0.67 4.01 WIRKRSRIF 0 67275.33 3.39 4 0.44 3.41 GWKILRKRK 0 67277.03 2.74 5 0.33 1.96 YQRLFVRIR 0 67280 25 3 0.56 3.38 AVWKFVKRV 0 67280.2 8.18 3 0.67 4.52  144 Sequence Cumulative Vote Average Rank Rel IC50 Charge Hydrophobicity Hydrophobic moment IRKKRRRWT 0 67281.47 6.59 6 0.22 3.08 ILRVISKRR 0 67282 25 4 0.44 2.06 AWRFKNIRK 0 67282.57 9.2 4 0.44 1.85 HYKFQRWIK 0 67283.83 2.79 4 0.44 3.94 RRIRRVRWG 0 67283.93 8.22 5 0.33 3.54 VLVKKRRRR 0 67283.93 12.48 6 0.33 1.2 RWRGIVHIR 0 67284.03 4.93 4 0.44 0.64 WRNRKVVWR 0 67284.73 6.79 4 0.44 2.89 KFWWWNYLK 0 67284.93 1.81 2 0.67 1.33 KRIMKLKMR 0 67284.97 6.5 5 0.44 4.04 IRRRKKRIK 0 67286.73 6.42 7 0.22 3.29 RKWMGRFLM 0 67286.77 4.38 3 0.56 2.92 RRVQRGKWW 0 67287.2 6.3 4 0.33 3.04 WHGVRWWKW 0 67289.83 2.5 3 0.56 2.63 WVRFVYRYW 0 67289.93 2.15 2 0.78 4.59 RKRTKVTWI 0 67291.97 5.11 4 0.33 0.7 IRRIVRRKI 0 67292.87 11.15 5 0.44 4.79 KIRRKVRWG 0 67295.4 10.55 5 0.33 2.02 AIRRWRIRK 0 67295.77 4.62 5 0.44 5.94 WRFKVLRQR 0 67297.83 7.08 4 0.44 4.2 RSGKKRWRR 0 67297.97 6.5 6 0.11 4.66 FMWVYRYKK 0 67298 1.51 3 0.67 1.81 RGKYIRWRK 0 67298.13 3.83 5 0.33 4.94 WVKVWKYTW 0 67298.3 5.64 2 0.67 2.41 VVLKIVRRF 0 67298.63 25 3 0.67 1.86 GKFYKVWVR 0 67298.7 1.21 3 0.56 5.39 SWYRTRKRV 0 67299.6 6.66 4 0.33 4.24  Bottom-most 50 (rows 99528- 99577) KNRGRWFSH 0 97923.77 9.79 4 0.22 2.42 AFRGSRHRM 0 97924.6 11.36 4 0.33 1.89 GRNGWYRIN 0 97925.57 10.74 2 0.33 2.97 AGGMRKRTR 0 97945.47 25 4 0.22 2.12 ATRKGYSKF 0 97994.6 25 3 0.33 2.78 SSGVRWSWR 0 97995.4 8.16 2 0.33 3.65 RVWRNGYSR 0 97996.27 10.24 3 0.33 4.26 WGRTRWSSR 0 98002.77 9.64 3 0.22 1.4 GKRVWGRGR 0 98018.07 8.2 4 0.22 3.28 SFNWKRSGK 0 98036.97 25 3 0.22 2.47 WGRGGWTNR 0 98042.53 25 2 0.22 1.17 ANRWGRGIR 0 98047.13 10.8 3 0.33 5.29 WGGHKRRGW 0 98049.73 6.19 4 0.22 1.66 WHGGQKWRK 0 98093 8.5 4 0.22 2.87  145 Sequence Cumulative Vote Average Rank Rel IC50 Charge Hydrophobicity Hydrophobic moment FVWQKGTNR 0 98093.2 11.3 2 0.33 2.18 HGVWGNRKR 0 98107.57 7.95 4 0.22 0.77 TRGWSLGTR 0 98118.03 12.15 2 0.22 3.88 GRRVMNQKR 0 98140.33 9.83 4 0.22 3.34 RNKFGGNWR 0 98153.27 25 3 0.22 1.88 GVRVQRNSK 0 98166.57 25 3 0.22 3.02 NQKWSGRRR 0 98171.9 7.97 4 0.11 0.77 RQNGVWRVF 0 98183.87 8.26 2 0.44 2.03 GRMRLWNGR 0 98205.93 7.91 3 0.33 1.03 WHYRSQVGR 0 98228.63 6.65 3 0.33 1.83 GWNTMGRRW 0 98257.43 6.32 2 0.33 3.5 RRMGNGGFR 0 98272.87 8.71 3 0.22 4.81 SKNVRTWRQ 0 98314.23 7.06 3 0.22 3.81 ARGRWINGR 0 98370.6 7.24 3 0.33 0.98 GSRRSVWVF 0 98381.53 2.3 2 0.44 2.29 WSQNVRTRI 0 98383.63 5.71 2 0.33 1.63 GMRRWRGKN 0 98444.37 6.05 4 0.22 1.7 RGRTSNWKM 0 98450.4 7.07 3 0.22 1.29 GRRWGMGVR 0 98481.6 7.75 3 0.33 4.01 WGKRRGWNT 0 98490.87 7.91 3 0.22 1.94 AMLGGRQWR 0 98497.47 6.75 2 0.44 2.87 QRNKGLRHH 0 98538.87 8.76 5 0.11 1.16 ARGKSIKNR 0 98539.63 8.35 4 0.22 1.67 NRRNGQMRR 0 98587.97 8.41 4 0.11 2.31 RGRRQIGKF 0 98602.67 8.51 4 0.22 4.43 ASKRVGVRN 0 98637.47 8.17 3 0.33 2.06 GRIGGKNVR 0 98644.47 9.12 3 0.22 4.3 NKTGYRWRN 0 98701.07 8.33 3 0.22 2.75 VSGNWRGSR 0 98756.67 8.54 2 0.22 2.67 GWGGKRRNF 0 98807.8 7.38 3 0.22 1.13 KNNRRWQGR 0 98885.2 6.45 4 0.11 2.88 GRTMGNGRW 0 98946.9 6.93 2 0.22 1.4 GRQISWGRT 0 98949.4 8.04 2 0.22 1.94 GGRGTRWHG 0 99178.53 8.6 3 0.11 2.63 GVRSWSQRT 0 99185.7 8.5 2 0.22 2.56 GSRRFGWNR 0 99199.47 8.1 3 0.22 0.58 Table 4.5. Candidate peptides for confirmation of QSAR predictions. The 200 total candidate peptides are shown with peptide charge, hydrophobicity as hydrophobic fraction and hydrophobic moment using the Eisenberg scale.  146  4.6  References Cherkasov, A., Hilpert, K., Jenssen, H., Fjell, C.D., Waldbrook, M., Mullaly, S.C., Volkmer, R., and Hancock, R.E.W.  (2008) Use of artificial intelligence in the design of small peptide antibiotics effective against a broad spectrum of highly antibiotic resistant Superbugs. ACS Chemical Biology, in press. Cherkasov, A. (2005) ‘Inductive’ Descriptors. 10 Successful Years in QSAR. Current Computer-Aided Drug Design, 1, 21-42. Cherkasov, A. (2005) Inductive QSAR Descriptors. Distinguishing Compounds with Antibacterial Activity by Artificial Neural Networks. Int. J. Mol. Sci., 6: 63-86. Eisenberg, D., Weiss, R. M., Terwilliger, T. C. (1984) The hydrophobic moment detects periodicity in protein hydrophobicity. Proc. Natl. Acad. Sci. USA, 81:140-4. Finlay, B.B., Hancock, R.E.W. (2004) Can innate immunity be enhance to treat microbial infections? Nature Reviews Microbiology, 2, 497-504. Frecer, V. (2006) QSAR analysis of antimicrobial and haemolytic effects of cyclic cationic antimicrobial peptides derived from protegrin-1. Bioorganic & Medicinal Chemistry, 14, 6065-6074. Frecer, V., Ho, B., Ding, J.L. (2004) De Novo Design of Potent Antimicrobial Peptides. Antimicrob. Agents Chemother., 48, 3349-3357. Halgren, T. A. (1996) Merck molecular force field .1. Basis, form, scope, parameterization, and performance of MMFF94. Journal of Computational Chemistry, 17: 490-519. Hamilton-Miller, J.M.T. (2004) Antibiotic resistance from two perspectives: man and microbe. International. Journal of Antimicrobial Agents, 23: 209-212. Hancock, R.E.W., and Sahl, H.G. (2006).Antimicrobial and host-defense peptides as new anti-infective therapeutic strategies. Nature Biotechnology, 24:1551-1557. Hilpert, K., Volkmer-Engert, R., Walter, T., Hancock, R.E.W. (2005) High-throughput generation of small antibacterial peptides with improved activity. Nature Biotechnology 23: 1008-1012 Hilpert, K., Elliott, M. R., Volkmer-Engert, R., Henklein, P., Donini, O., Zhou, Q., Winkler, D. F., Hancock, R. E. (2006) Sequence requirements and an optimization strategy for short antimicrobial peptides. Chem. Biol., 13: 1101-7. Hilpert, K.; Winkler, D. F.; Hancock, R. E. (2007) Peptide arrays on cellulose support: SPOT synthesis, a time and cost efficient method for synthesis of large numbers of peptides in a parallel and addressable fashion. Nat. Protoc., 2: 1333-49 Jenssen, H., Gutteberg, T.J., and Lejon, T (2005) Modelling of anti-HSV activity of lactoferricin analogues using amino acid descriptors. J. Pept. Sci., 11: 97-103. Jenssen, J., Hamill, P., and Hancock, R.E.W. (2006) Peptide Antimicrobial Agents. Clinical Microbiology Reviews, 19: 491–511.  147 Karakoc, E., Cherkasov, A., Sahinalp, S.C. (2006) Distance based algorithms for small biomolecule classification and structural similarity search. Bioinformatics. 15: 243-251. Karakoc, E., Sahinalp, S.C., and Cherkasov, A. (2006) Comparative QSAR- and fragments distribution analysis of drugs, druglikes, metabolic substances, and antimicrobial compounds. J. Chem. Inf. Model., 46: 2167-2182. Koczulla, A.R., Bals, R. (2003) Antimicrobial Peptides: Current Status and Therapeutic Potential. Drugs, 63: 389-407. Lejon, T., Stiberg, T., Strom, M.B., and Svendsen, J.S. (2004) Prediction of antibiotic activity and synthesis of new pentadecapeptides based on lactoferricins. J. Pept. Sci., 10: 329 – 335. Lejon, T., Strom, M.B., and Svendsen, J.S. (2001) Antibiotic activity of pentadecapeptides modelled from amino acid descriptors. J. Pept. Sci., 7: 74-81. Levy, S.B., Marshall, B. (2004) Antibacterial resistance worldwide: causes, challenges and responses. Nature Medicine, 10: S122 - S129. Ostberg, N., and Kaznessis, Y. (2004) Protegrin structure–activity relationships: using homology models  of  synthetic sequences to determine structural characteristics important for activity. Peptides, 26, 197–206 Perkins, R., Fang, H., Tong, W., and Welsh, W.J. (2003) Quantitative structure-activity relationship methods: perspectives on drug discovery and toxicology. Environmental Toxicology and Chemistry, 22: 1666-79 Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannerty, B.P. (1992), Numerical Recipes in C: The Art of Scientific Computing, (2nd Edition), Cambridge University Press, New York. R Development Core Team (2005) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3- 900051-07-0 Rice, P., Longden, I., and Bleasby, A. (2000) EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics, 16:276--277 Sawyer, J. G., Martin, N. L., Hancock, R. E. (1988) Interaction of macrophage cationic proteins with the outer membrane of Pseudomonas aeruginosa. Infect. Immun., 56: 693-8 Sing, T., Sander, O., Beerenwinkel, N., Lengauer, T. (2005) ROCR: visualizing classifier performance in R. Bioinformatics, 21: 3940-3941 Strom, M.B., Stensen, W., Svendsen, J.S., and Rekdal, O. (2001) Increased antibacterial activity of 15-residue murine lactoferricin derivatives. J. Pept. Res., 57: 127–139 Yeaman, M.R., Yount, N.Y. (2003) Mechanisms of Antimicrobial Peptide Action and Resistance. Pharmacol. Rev., 55: 1, 27-55.   148  Chapter 5: Genetic algorithms for identification of potent antimicrobial peptides                      A version of this chapter will be submitted as: Fjell, C.D., Jenssen, H., Hilpert, K., Cheung, W.A., Hancock, R.E.W., and Cherkasov, A. Optimization of Antibacterial Peptides by Genetic Algorithms and QSAR  149   5.1  Introduction Human pathogens that are resistant to current antibiotic treatment represent a significant health threat worldwide (Levy and Marshall, 2004). Drugs based on synthetic peptides are inspired by the short cationic, amphipathic peptides found throughout the kingdoms of life that possess antimicrobial activity by various mechanisms (see for example, Yeaman and Yount, 2003). These peptides have drawn significant attention as a possible source of novel antibacterial agents (Hamilton-Miller, 2004; Koczulla and Bals, 2003; Finlay and Hancock, 2004; Hancock and Sahl, 2006). While antimicrobial peptides generally exhibit lower potency against susceptible bacterial targets compared to conventional low-molecular-weight antibiotic compounds, they have advantages that compensate for this lower potency, including fast killing, a broad range of activity, a postulated multiplicity of targets, low toxicity for host cells and minimal development of resistance in target organisms (Hancock and Sahl, 2006; Jenssen et al., 2006). We have recently shown for the first time that synthetic peptides with high antibacterial activity and low toxicity can be identified with high accuracy using chemoinformatics and machine learning and without the use of an original template sequence (Cherkasov et al., in press). To achieve this we used quantitative structure- activity relationships (QSAR) combined with artificial neural networks to build software models of peptide activity. As a basis for describing structure in these peptides, we employed a set of 44 descriptors including 3D QSAR ones that utilize atomic-scale molecular information, the so-named ‘inductive’ QSAR descriptors reviewed in (Cherkasov, 2005a). In addition to our peptide studies, these have been successfully  150 applied to a number of molecular modelling studies including: identification of antibacterial activity of small compounds (Cherkasov, 2005b), classification of antimicrobial compounds, conventional drugs and drug-like substances chemical structures (Karakoc et al, 2006a and 2006b). In our recent work (Cherkasov et al., in press), we used a large data sets of 1400 synthetic peptides, screened for activity using a high-throughput assay (Hilpert et al., 2005), containing random sequences that were biased to contain amino acids believed from substitution analyses to be important for antibacterial activity. Three-dimensional structures for these peptides were estimated and descriptors were calculated for each peptide. These values were related to the measured antibacterial activity using artificial network models to classify peptides as active or inactive. To demonstrate the power of these techniques for identifying drug targets we performed an in silico screening of 100,000 peptides and demonstrated, by synthesizing example peptides from each activity quartile, that peptides with superior activity could be identified with 94% accuracy. However, the complexity of the artificial neural network solution prevents us from 'inverting' the solution and using it to directly determine peptide sequences that are predicted to be active; instead, a small number of active peptides are identified from a large set of in silico candidates by computational evaluation. A common problem in drug discovery is that an exhaustive search is not possible due to the massive numbers of possible peptide variants (X 20 , where X is the number of amino acids in the peptide chain) and the time and resources needed for QSAR descriptor calculations. We considered that it would be advantageous to utilize a search strategy that would minimize the number of peptides that need to be evaluated to determine additional  151 highly active peptides. Here genetic algorithms were applied to this problem since these evolutionary methods have been applied successfully in other areas of chemoinformatics (Parrill, 1999; Niculescu, 2003; Solmajer and Zupan, 2004; Weaver, 2004). A genetic algorithm is a heuristic method for search and approximation problems and is particularly well suited for problems involving string-like data such as the amino acids in a peptide. Genetic algorithms operate on populations of solutions by iteratively enhancing solutions using operations inspired by natural genetic processes: cross-overs (combining parts of two solutions to suggest another) and mutations (randomly changing one part of a solution to generate another). Each solution ('phenotype' in the jargon of genetic algorithms) is composed of elements ('genes') that are randomly modified ('mutated') or shuffled with other solutions ('crossed-over') and evaluated for fitness at each iteration ('generation'). The best solutions are propagated into the next iteration with new solutions added to the population produced based on modifications and combinations of these best peptides. In the current work, we demonstrate that a genetic algorithm approach effectively minimizes the number of peptides that must be evaluated for in silico screening of synthetic antibacterial peptides with high potency.   5.2  Results and discussion A genetic algorithm solution requires that the problem be described in terms of a genetic representation, and a fitness function must be specified to permit evaluation of each solution. The genetic algorithm then either passes high fitness individuals on to the next generation, removes low fitness individuals, or creates offspring by cross-over of two existing individuals or by mutation of an existing individual. Examples of mutation  152 and cross-over that showed dramatic changes on peptide fitness are shown in Figure 5.1, whereby mutation of one amino acid (V to I) increase fitness from 20 to 26, and where cross-over where combining portions of two peptides with fitness 20 yielded a peptide with fitness 0.       Figure 5.1. Examples of peptide evolution. Two examples of peptide evolution are shown: mutation of a single amino acid that results in an improved peptide, and recombination of two moderate scoring peptides recombining to form one low scoring peptide. Values in round brackets are the fitness scores for the peptides.   5.2.1  Evaluation of peptide fitness score In our previous studies we created a software system to predict the activity of 9 amino acid peptides. This system was constructed to make maximum use of the available experimental data by utilizing models produced by a stratified 10-fold cross-validation, as described previously. The system consisted of a set of 30 artificial neural network models derived from the 10-fold cross-validation models of the 2 data sets (Set A and B) of screened peptides plus the combined set (Set A+B). These were classification models trained to consider the top 5% as active. Our confidence that a peptide is active could be judged by the number of models that classified the peptide as active. As reported RVWKIWRWR (21) RIWKIWRWR (26) Mutation KWKWWRMWR (20) RWYYWWRRH (20) RWYYWWMWR (0) Recombination  153 previously, the accuracy of predicting peptide activity is strongest when the largest numbers of models predicted activity: for example, for the top 50 peptides predicted out of a set of 100,000 amino-acid-biased semi-random peptides, the number of models indicating high activity ranged from 25 to 29. For these peptides, the accuracy of predicting highly active peptides was 94%. This number of models indicating high activity was therefore taken as the genetic algorithm fitness score.  5.2.2  Initial population of peptides We executed genetic algorithm searches starting from two initial populations of peptides for two purposes: Firstly, we wished to identify additional peptides with very high fitness scores to evaluate the ability of genetic algorithms to identify novel peptides for screening by antibacterial activity assay. Secondly, we wished to understand the importance of starting population on the composition of later peptide populations in a search. Both sets of peptides were selected from the biased random set of 100,000 peptides we have described previously (Cherkasov et al., in press) at different levels of fitness score. For the first search (Simulation A), we selected peptides that were moderately predicted to be active, having a fitness value of 20 or 21. We selected these peptides as a small initial population that maximized the diversity of amino acids present in the peptides with this level of initial fitness score, by ensuring that all amino acids present in the library were present at least to some degree in these peptides ( Table 5.1). An initial set of 19 peptides was selected that included all of the 12 amino acids present in the 594 peptides of the 100,000 having a fitness values of 20 or 21. Since some  154 amino acids had low representation (1 peptide only containing any of G, Q and S, and 2 for H) we decided to use a small population to minimize the effect of the relatively large numbers of certain other amino acids in the population. Similarly, the initial peptides for Simulation B were selected to have fitness score of 2, a low score indicating low confidence that these are highly active peptides (Table 5.2).   Sequence Score KKWWYWWKR 20 KWKRWFKWR 21 KWKWWRMWR 20 MWRKWRRWW 21 RKKWWWLFR 21 RLKWWRWRW 21 RRWRWWWVW 21 RRWWWRLWW 21 RRWWWRRWY 21 RVWKIWRWR 21 RWIRKIWWR 21 RWIWWRRWW 21 RWRWWGWRR 20 RWRWWWKKT 20 RWWRWWKQR 20 RWWWWSRRR 20 RWYYWWRRH 20 RYRWWKWRH 20 TWWWKKWRR 20  Table 5.1. Initial peptide population for simulation A.  Sequence Score ARKWWWRWK 2 AWWRKRKWW 2 FVKRWWRFR 2 IGWWWRKRW 2 IWKRWWRKT 2 KNWKWWRWR 2 KRRSWWKWW 2 KRWRWLRWG 2 KWWRWRRFI 2 QRRRWWWWK 2 RLIRWWIRK 2 RRKRLYWIW 2 RRRWYWKWN 2 RRWRIWWIK 2 RTYKRWYRW 2 RWIRWWRQW 2 RWRHIWWRW 2 RWWKWRWLM 2 RWYKHWRFR 2 SRWWKRRWY 2 VKRWWWRRM 2 WWRKLWRKL 2 Table 5.2. Initial peptide population for simulation B.  Peptides were chosen from a set of biased random sequences that had a score of 20/21 in simulation A (moderate confidence in activity) or fitness score of 2 (low confidence in activity). Peptides were selected to have diverse amino acids populations.    5.2.3  Iterative improvement in peptides The two populations were evolved from two initial starting populations in Simulation A and B. As shown in Figure 5.2 for Simulation A and B, there was rapid  155 improvement in scores from the first generation to generation 100 with continued improvement up to generation 600. As well, these was a rapid increase in peptide fitness for Simulation B, shown from the initial population containing much lower scores as seen in right-hand side of Figure 5.2 and in Figure 5.3, showing the first generations in detail where a dramatic rise in fitness scores was seen in the first several generations. As expected, throughout the evolution of the population of peptides, the genetic algorithm created a set of peptides having a variety of fitness scores due to the random nature of novel peptide generation. For Simulation A, the final generation contained 34 peptides, including 10 peptides with score of 29, and 22 peptides that were 26 and higher (Table 5.3). The highest score observed in any of the peptides studied here or previously (Fjell et al., submitted) is 29 rather than 30. This suggests that the method cannot identify any peptides with a higher score than those that were already found. Of the 10 top-scoring peptides, 9 were closely related and start with the sequence RWKRW. There are 3 other peptides starting with this sequence with lower scores: score 28 (RWKRWWRIL), 21 (RWKRWWKVW) and 1 (RWKRWSRLL). The population of peptides always contained a proportion of lower scoring peptides (as seen in the left hand side of Figure 5.2) due to the random nature of how novel peptides are created by the genetic algorithm. Similarly, the final population containing 52 peptides is shown in Table 5.4.  156 Simulation A          Simulation B        Figure 5.2. Evolution of peptide scores. The fraction of peptides in the population at each range of fitness score is shown.   157           Figure 5.3. Initial evolution of peptide scores for simulation B.    158   Sequence Fitness Score Activity RKRWWWRWW 29 - RWKRWIRWW 29 - RWKRWLRWW 29 - RWKRWWRIW 29 - RWKRWWRLL 29 0.73 RWKRWWRLW 29 - RWKRWWRVW 29 - RWKRWWRWI 29 0.38 RWKRWWRWL 29 - RWKRWWRWW 29 0.67 KKRWWWWFR 28 - KRWWWWKFR 28 - KWWRWRRWW 28 0.37 RKRWWWRWL 28 - RWKKWWRWL 28 0.38 RWKKWWRWW 28 0.38 RWKRWWRIL 28 - KKRWWWWWR 27 0.47 KWKRWRRWW 27 - KWKRWWWWR 27 - RKRWWWWFR 27 0.41 KWKRWWWFR 26 0.67 * RKRWWWRWR 22 - RWKRWWKVW 21 - RWKWWWKFR 20 - RWKKWWRVW 19 - RWYRWWRIW 15 - KRWRWWRLL 12 - KWKKWWRWL 9 - KWKRWWWWL 9 - KKKRWRRWW 8 - RWKYWWRII 4 - RKRWWWRGL 1 - RWKRWSRLL 1 - Table 5.3. Final peptide population simulation A.  Sequence Fitness Score IWKRWWWKR 27 KWKRWWWIR 27 KWKRWWWWR 27 RIWKIWWKR 27 IKKRWWWFR 26 IKWKRWWWR 26 KLKRWWWFR 26 KLKRWWWWR 26 KWKRWWWFR 26 KWWKIWRWR 26 KWWKRWKWR 26 KWWKRWWIR 26 KWWKRWWKR 26 KWWKRWWWR 26 RFWKIWWKR 26 RIWKRWWFR 26 RLWKIWWRR 26 RLWKRWWFR 26 RLWKRWWIR 26 RWWKIWKWR 26 RWWKIWWKR 26 RWWKIWWRR 26 RWWKRWWFR 26 RWWKRWWIR 26 RWWKRWWWR 26 IKKRWWWWR 25 KLKRWWWIR 25 KWWKIWWKR 25 KWWKRWWFR 25 RIWKRWWWR 25 RLKRWWWFR 25 RWKRWWWFR 25 KLWKRWWWR 24 RWWKIWRWR 24 KWWKIWKWR 22 RWWKWWWIR 22 CWKRWWWKR 21 RFWKIWRWR 21 KWKRIWWKR 19 RWWKRWAIR 19 RTWKRWWIR 18 RTWKIWKWR 12 KWWKRWWIH 11 KWWKRWSWR 10 RLWTRWWFR 9 RIWARWWFR 7 KWWKDWWKR 6 RFEKIWWKR 6 RIDKIWLKR 5 RLWKNWWRR 2 RFWQIWRWR 0 RWSKRWWWV 0 Table 5.4. Final peptide population, simulation B. The final generation (generation 600) of peptides is sorted by score. The common subsequence RWKRW is shown in bold in Table 5.3 and discussed in the text. Activity values for 9 peptide sequences were determined using the bioluminescence assay against P. aeruginosa; units are IC50 relative to Bac2A control peptide. '-' indicates activity not determined. Two peptides appear in both final populations, KWKRWWWFR and KWKRWWWWR. * average of two peptide measurements.  159 There were two peptides in common in the final populations (KWKRWWWFR and KWKRWWWWR) for Simulation A and B. Apart from these two peptides, there were no peptides in common between the two final populations, indicating that the processes followed were stochastic. In addition, Simulation B had no peptides with fitness score above 28 but more peptides with high score, i.e. 25 peptides with fitness score of 26 and above. This indicates that the specific peptides in the final population were largely dependent on the initial population of peptides. This is to be expected given the nature of the genetic algorithm, since the dominant method of generation of novel sequence is through cross-over from previous peptides; mutation will affect only a comparatively small number of single amino acids in each generation with the genetic algorithm parameters used here. The number of high fitness score peptides appeared to be unchanged between generation 400 and generation 600 (Figure 5.2) for both Simulation A and B, suggesting that in each case the genetic algorithm had settled on a local optimum set of sequences from which it was unlikely to escape through continued evolution. Further improvements would likely require introduction of peptides with dramatically different sequences into the population  5.2.4  Evolution of amino acid composition The amino acid distribution of the peptide populations varied during the peptide sequence evolution (shown in Figure 5.4). As described above, the number of amino acid types was maximized when selecting the initial population to include 14 amino acid types for Simulation A and 16 amino acids for Simulation B. During evolution over the 600 generations, the number of amino acid types is reduced to 7 amino acid types (in declining proportion: W, R, K, L, I, F, V) for the high scoring peptides in Simulation A  160 and 6 amino acid types (in declining proportion: W, R, K, I, F, L) for the high scoring peptides in Simulation B. This proportion of amino acids for high scoring peptides is similar to the proportions we found previously for high scoring peptide based on peptides sampled from a biased random library of 100,000 peptides.  161  Simulation A       Simulation B    Figure 5.4. Evolution of peptide amino acid composition. Simulation A is shown on the left-hand side and Simulation B is on the right-hand side. The initial populations of peptides (top panels) have higher amino acid diversity which is lost as the populations evolve (middle panels show generation 600 for all peptides). The high scoring peptides (fitness score >=26) have the lowest diversity and show similar amino acid proportions (bottom panels).  162  5.2.5  Assessment of genetic algorithm performance In our previous study (Fjell et al., submitted), we examined 100,000 peptides from a biased random library of sequences. We empirically tested the activity of the 50 peptides ranked highest by fitness score. As we reported previously, 94% of these peptides were found to be highly accurate. This group of highly active peptides included all peptides with fitness scores of 29 to 26, and some of the peptides scoring 25. (Some peptides scoring 25 were also outside of this group.) Therefore, for comparison we considered here that peptides receiving a fitness score of 26 or higher could be relatively confidently predicted to have high antibacterial activity. As reported previously, a total of 22 peptides scoring 26 or higher were identified by examining 99,576 peptides in the random library (the 100,000 random peptides minus duplicates), or 0.026% highly active peptides of these evaluated. In contrast, using genetic algorithms we identified, over all generations of the simulated evolution of the peptide populations, 22 peptides scoring 26 or above by evaluating a total of 4,492 peptides (0.49% highly active) in Simulation A, and 25 peptides scoring 26 or above by evaluating 5,067 peptides (0.51% highly active) in Simulation B, over all generations of the simulated evolution of the peptide populations for a combined efficiency of 0.50% highly active peptides identified per peptide evaluated. Taking these two values as representative of the two methods (0.026% for searching a large random library and 0.50% for genetic algorithm search), we observed a 19-fold enhancement in discovery of highly active peptides. In addition, the progressive clustering of peptides scores in at the high scoring region was much slower after the first 100 generations. This suggests that stopping the genetic algorithm at approximately generation 100 will be more efficient since further  163 peptides will not be efficiently identified after this point. The antibacterial activity for a selection of peptides was performed using the luminescence assay as described previously (see Hilpert et al., 2005). In this classification work, we considered a peptide to be highly active if its IC50 was less than half that of the control peptide, Bac2A. The Rel. IC50 values in Table 5.3 indicate that 6 of the 9 peptides (66%) assayed were highly active (Rel.IC50 <0.5), with the remainder more active than control but lower than this threshold, a result (66% accurate) less than the 94% accuracy we found before. We believe there may have been two contributions to this discrepancy: this difference may be due to chance for this small set of samples (9 peptides), or variability in the luminescence assay for antibacterial activity.   5.3  Conclusions We have described here the use of a genetic algorithm to efficiently identify novel peptides that have a high likelihood of being strongly antibacterial. In our previous work, we created software models using artificial neural networks that were found to be up to 94% accurate in predicting highly active peptides. However, our previous work utilized a very large in silico library of 100,000 biased-random sequences to identify additional peptides. In the current study, we demonstrated that the heuristic search method of genetic algorithms identifies additional active peptides with considerably greater efficiency (0.50% of evaluated peptides) than our previous work with biased random sequences (0.026% of evaluated peptides). Currently, we evaluate QSAR descriptors for each peptide using commercial software (MOE) on a limited number of  164 computers, a situation that significantly limits the number of peptides that can be evaluated. Hence, we find that the increased efficiency of genetic algorithm methods allows a dramatically increased capability to identify novel antimicrobial peptide candidates.   5.4  Materials and methods  5.4.1  Creation of classification models for highly active peptides As described previously (Cherkasov et al., in press; Fjell et al., submitted), we constructed a software modelling system to classify peptides as highly active or inactive based on a set of 44 QSAR descriptors calculated for each peptide combined with machine learning using artificial neural networks (ANNs). Briefly, we have constructed a set of 30 ANNs that classify a peptide as highly active or inactive. These 30 ANNs were trained based on a 10-fold cross-validation of 3 data sets consisting of over 1400 peptides whose activities were measured using a high-throughput luminescence assay against a modified strain of Pseudomonas aeruginosa luxCDABE (see below, Peptide activity assay).  The top 5% of each set of peptides was defined as highly active (ANN output value 1) and the rest low activity (ANN output value 0).  Data manipulation and normalization was performed using scripts in the R language (R Core Development Team, 2005; http://r-project.org)   165  5.4.2  Evaluation of peptide fitness In our previous study (Fjell et al., submitted), each of the 30 trained ANNs was used to rank a set of 100,000 test peptides. For each ANN, the ANN output value that determined the top 5% of the 100,000 peptides was identified. Using these thresholds, a single fitness score was defined as the number of ANNs that classify an input peptide as in the top 5% of peptides (i.e. the number of 'votes' that a peptide is highly active). Here, we use the same threshold values derived from the 100,000 random peptides to classify novel peptides using the number of 'votes' as the fitness score for the genetic algorithm.   5.4.3  Initial peptide population Two simulated evolution experiments were performed here. Small initial populations of peptides were selected from the biased random population of 100,000 peptides to maximize the diversity of amino acids present in the population. Peptides containing all the 12 amino acids (F, G, H, I, K, L, M, Q, S, T, V, and Y) present in the population were selected at two levels of fitness score.  In simulation A), 19 peptides with moderate activity were selected from 100,000 peptides biased random population having moderate prediction of activity - fitness score of 20 or 21.  In simulation B), 20 peptides were selected having a fitness score of 2.   5.4.4  Evolution of peptide sequences The initial populations of peptides were evolved over 600 generations using custom Java code utilizing the JGAP 3.2 (http://jgap.sourceforge.net) genetic algorithm  166 package and converting single letter amino acid peptides into integer arrays for manipulations. QSAR descriptors were calculated through embedded calls to MOE (Molecular Operational Environment, 2005, by Chemical Computing Group Inc., Montreal, Canada) from the Java code. The population size was allowed to vary to ensure all high scoring peptides remained in the population. A mutation rate of 1/15 was used.  5.4.5  Evaluation of peptide antibacterial activity Antibacterial activity of synthesized peptides was determined using a luciferase- based in vitro assay and reported as inhibitory concentration at 50% (IC50) relative to a control peptide. Peptides were synthesized on cellulose support with a pipetting robot using two glycine residues as linker as previously described (Hilpert et al., 2005). Briefly, peptides were cleaved from the dried membrane in an ammonia atmosphere resulting in free peptides with two glycines at the amidated C terminus due to the linker sequence. The peptide spots were punched out and transferred to 96-well microtiter plates in sets of 10 along with a positive control peptide (Bac2A). An overnight culture of P. aeruginosa PAO1 strain H1001 (containing a luciferase gene cassette luxABCDE) was diluted at 1:500 ratio with 100mM Tris buffer (pH 7.3), 20 mM glucose. This diluted culture was added to the microtitre plate wells (100 µL/well) containing the peptide spots and controls. After 30 min incubation, serial dilutions were performed from the membrane spots to successive rows of the plate. Luminescence of the P. aeruginosa was measured for 8 dilutions at 4 hours using a Tecan Spectra Fluor Plus (Tecan US). As described previously (Fjell et al., submitted), each luminescence profile for the dilution series was used to calculate the IC50 relative to the Bac2A peptide (a control peptide with low activity), by fitting the luminescence values to a sigmoid curve and  167 normalizing the peptide values to the Bac2A values found on the same plate. Parameter estimation was performed using custom C software and routines from Numerical Recipes in C (Press et al., 1992).  168  5.5  References Cherkasov, A. (2005) ‘Inductive’ Descriptors. 10 Successful Years in QSAR. Current Computer-Aided Drug Design, 1: 21-42. Cherkasov, A. (2005) Inductive QSAR Descriptors. Distinguishing Compounds with Antibacterial Activity by Artificial Neural Networks. Int. J. Mol. Sci., 6: 63-86. Finlay, B.B., Hancock, R.E.W. (2004) Can innate immunity be enhance to treat microbial infections? Nature Reviews Microbiology, 2: 497-504. Hamilton-Miller, J.M.T. (2004) Antibiotic resistance from two perspectives: man and microbe. International Journal of Antimicrobial Agents, 23: 209-212. Hancock, R.E.W., and Sahl, H.G. (2006).Antimicrobial and host-defense peptides as new anti-infective therapeutic strategies. Nature Biotechnology, 24:1551 - 1557. Hilpert, K., Volkmer-Engert, R., Walter, T., Hancock, R.E.W. (2005) High-throughput generation of small antibacterial peptides with improved activity. Nature Biotechnology, 23: 1008 - 1012. Jenssen, J., Hamill, P., and Hancock, R.E.W. (2006) Peptide Antimicrobial Agents. Clinical Microbiology Reviews, 19:491–511. Karakoc, E., Cherkasov, A., Sahinalp, S.C. (2006) Distance based algorithms for small biomolecule classification and structural similarity search. Bioinformatics, 15: 243-251. Karakoc, E., Sahinalp, S.C., and Cherkasov, A. (2006) Comparative QSAR- and fragments distribution analysis of drugs, druglikes, metabolic substances, and antimicrobial compounds. J. Chem. Inf. Model. 46: 2167-2182. Koczulla, A.R., Bals, R. (2003) Antimicrobial Peptides: Current Status and Therapeutic Potential. Drugs, 63 : 389-407. Levy, S.B., Marshall, B. (2004) Antibacterial resistance worldwide: causes, challenges and responses. Nature Medicine, 10: S122 - S129. Niculescu, S.P. (2003) Artificial neural networks and genetic algorithms in QSAR. Journal of Molecular Structure (Theochem) 622: 71–83 Parrill, A.L. (1996) Evolutionary and genetic methods in drug design. Drug Design Today, 1:514-521. Perkins, R., Fang, H., Tong, W., and Welsh, W.J. (2003) Quantitative structure-activity relationship methods: perspectives on drug discovery and toxicology. Environmental Toxicology and Chemistry, 22: 1666-79. Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannerty, B.P. (1992), Numerical Recipes in C: The Art of Scientific Computing, (2nd Edition), Cambridge University Press, New York. R Development Core Team (2005) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-  169 900051-07-0 Solmajer, T. and Zupan, J. (2004) Optimization algorithms and natural computing in drug discovery. DDT, 1: 247-252 Weaver, D.C. (2004) Applying data mining techniques to library design, lead generation and lead optimization. Current Opinion in Chemical Biology, 8: 264-270 Yeaman, M.R., Yount, N.Y. (2003) Mechanisms of Antimicrobial Peptide Action and Resistance. Pharmacol Rev., 55: 27-55.   170  Chapter 6: Summary and conclusions  171  6.1  Summary This thesis describes the bioinformatic and chemoinformatic analysis of gene- coded antimicrobial peptides (host defense peptides) and synthetic antimicrobial peptides. This work addressed the hypotheses that additional novel gene-coded antimicrobial peptides can be identified by sequence analysis, and highly active synthetic antimicrobial peptides can be identified using chemoinformatics and machine learning methods.  6.1.1  Gene-coded antimicrobial peptides Antimicrobial peptides (AMPs) represent a diverse class of natural peptides that form part of the innate immune system of mammalians, insects, amphibians, and plants among others (for example, Sima et al.,, 2003a, 2003b). Prior to the work described here, there were over 880 different antimicrobial peptides identified or predicted from nucleic acid sequence (Brogden, 2005). These peptides fall into a number of diverse classes characterized by charge, peptide structure, and amino acid composition. The first hypothesis of the thesis is that analysis of existing peptides and construction of bioinformatic models can identify additional antimicrobial peptides both from known proteins (unacknowledged antimicrobial or host defense function among known proteins) and from unannotated sequence. The objective was to create a software resource that classifies existing AMPs and can be used to identify additional AMPs from nucleic acid or amino acid sequence. Profile hidden Markov models (HMMs) are widely used for bioinformatics analysis of biological sequence. HMMs for both mature peptides and propeptides were constructed as follows. Propeptide and mature peptide sequences were identified from  172 annotation of the Uniprot proteins identified as AMPs. These were clustered separately into a total of 146 models for mature peptides and 40 for propeptides. These corresponded to known AMP classes and subclasses such as defensins and cathelicidins. HMMs were created based on multiple alignments of these clusters. Additional peptides were identified by iteratively scanning the Swiss-Prot database with these HMMs and the clusters and HMMs rebuilt after each search. As a result, an additional 229 additional AMPs have been identified from Swiss-Prot, and all but 34 could be associated with known antimicrobial or host defense activities according to the literature. The final set of 1045 mature peptides and 253 propeptides have been organized into the open-source AMPer resource available to the community at http://www.cnbi2.com/cgi-bin/amp.pl. A manuscript describing the AMPer resource was published (Fjell, et al., 2007). The set of HMMs from AMPer were used to screen bovine sequence for novel AMPs. The set of available expressed sequence tags (ESTs) from NCBI and the draft genome sequence were scanned. Of the 34 known bovine AMPs, 27 were identified with high confidence in the AMPs predicted from ESTs. A further potential 69 AMPs predicted from the EST data were found that appear to be novel. Two of these were cathelicidins and selected for experimental verification in RNA derived from bovine tissue. One predicted AMP, most similar to rabbit '15 kDa protein' AMP, was confirmed to be present in infected bovine intestinal tissue using PCR. These findings demonstrated the practical applicability of the developed bioinformatics approach and laid a foundation for future discoveries of gene-coded AMPs. In addition, no members of the alpha- defensin family were found in the bovine sequences, suggesting that cattle lack this important family of host defense peptides. A manuscript has been published (Fjell C.D.,  173 Jenssen H., Fries P., Aich P., Griebel P., Hilpert K., Hancock R.E., Cherkasov A. (2008) Identification of novel host defense peptides and the absence of alpha-defensins in the bovine genome. Proteins. 73:420-30)   6.1.2  Synthetic antimicrobial peptides With increasing antibiotic resistance in pathogenic microorganisms, antimicrobial peptides have drawn significant scientific attention as a novel class of antimicrobial therapeutics as both antibacterial drugs and modulators of innate immunity (Hamilton-Miller, 2004; Levy and Marshall, 2004; Koczulla and Bals, 2003; Finlay and Hancock, 2004; Hancock and Sahl, 2006). AMPs demonstrate fast target killing, broad range of activity, low toxicity and minimal development of resistance in target organisms. Extensive efforts have been made to develop qualitative structure-activity relationships but there has been no means of relating peptide characteristics to antibacterial activity outside of peptides with very similar structures. The importance of charge, hydrophobicity and amphipathicity are well known; however, high potency peptides cannot be easily selected by manipulation of the amino acid sequence (Tossi et al., 2000). One hypothesis of this thesis is that highly antibacterial peptides can be identified by a combination of non-linear machine learning algorithms and quantitative structure-activity relationship (QSAR) analysis that utilizes descriptors that are sensitive to the 3D atomic conformation the peptide. We calculated descriptors for over 1400 peptides for which the antibacterial activity had been measured using a high-throughput assay (Hilpert et al, 2005). We built artificial neural network models to classify peptides as active or inactive based on these descriptors and screened a virtual library containing  174 nearly 100,000 biased random sequences. A total of 200 peptides were selected for synthesis that were predicted to have activity ranging from highly active to inactive. The predictions were remarkably accurate with 94% of the 50 predicted most active showing high activity and the 50 predicted least active all had low activity. This work represents the first high-throughput in silico screening for novel antibacterial peptides suitable for drug leads. A manuscript describing these methods and results has been submitted to Journal of Medicinal Chemistry (Fjell, C.D., Hilpert, K., Jenssen, H., Cheung, W.A., Panté, N., Hancock, R.E.W., and Cherkasov, A. Identification of Novel Antibacterial Peptides by Chemoinformatics and Machine Learning) and a manuscript including these results has been accepted for publication (Cherkasov, A., Hilpert, K., Jenssen, H., Fjell, C.D., Waldbrook, M., Mullaly, S.C., Volkmer, R., and Hancock, R.E.W.  Use of artificial intelligence in the design of small peptide antibiotics effective against a broad spectrum of highly antibiotic resistant Superbugs. ACS Chemical Biology). A serious constraint on the use of QSAR descriptors utilizing 3D atomic resolution information is the computational expense in time and resources. In order to confidently identify a set of 50 peptide sequences possessing high antibacterial activity, we screened a virtual library of 100,000 peptides. The hypothesis of Chapter 5 is that an evolutionary search method called a genetic algorithm can be used to efficiently search through the possibilities of peptide sequences to identify additional peptides that are likely to be highly antibacterial. Genetic algorithms (GAs) mimic biological evolutionary processes to estimate solutions to computational problems where solutions can be represented in a string-like format ('chromosomes' in GA jargon) through the random variation of existing solutions ('mutation'), or combination of existing solutions ('cross-  175 over'). We found that our implementation of a GA method provides a large improvement in identification of novel antibacterial peptides. Approximately 0.49% of peptides evaluated during the GA method were classified as highly active, while only 0.026% of the nearly 100,000 sequences we previously screened was classified as highly active at the same level (an 19-fold increase).  Since the computational effort to screen in silico libraries dominates the cost of these virtual screening methods, we find that use of GA significantly improves the possibility of identifying peptides that may lead to novel antibiotic therapeutics.   6.2  Conclusions and future directions The first two chapters of this thesis describe the successful development and application of bioinformatics methods to identify gene-coded antimicrobial peptides and the creation of the most comprehensive database of its kind. Further refinements to the methods used in AMPer are possible but I consider that these will not dramatically increase the utility of the resource. For example, as described in Chapter 2, the creation of initial peptide clusters required a choice of threshold value for global sequence similarity (a value of 30% was used) with additional manual editing. More complex methods could have been used to compare peptides in each cluster based on three-dimensional structure or physical properties using such as techniques as threading, since empirical 3D structures are available for a number of these peptides  (Höltje, et al., 2003). While this modelling effort may improve the similarity of peptides in each cluster, and may reduce the number of clusters by finding clusters to merge, this is not likely to dramatically improve the number or quality of clusters. Other methods have been developed for  176 remote homology detection that may lead to improved detection of AMPs in unannotated sequence.  However, these methods (for example, Hochreiter et al, 2007) require significantly more data than is available for AMPs - typically the SCOP dataset containing over 30,000 entries is used (Hou et al., 2004; Kuang et al., 2005; Lingner and Meinicke, 2006; Rangwala and Karypis, 2005). The AMPer website (http://www.cnbi2.com/cgi-bin/amp.pl) also serves as a resource to compare novel peptides to known classes of AMPs using the HMMs as well as to display the results in the context of the multiple alignments and sequence profiles on which the models are based. Currently, only the draft bovine genome and EST data set from NCBI has been searched using AMPer, primarily as a demonstration of the utility of the method. The software models and applications needed to scan novel sequences are all freely available to the public and do not require a license so that investigators are able to apply these methods to data sets of their choice. Scanning multiple organisms would allow investigation of evolutionary relationships between known and newly identified AMPs that may shed additional light on mechanisms of the innate immune system. Chapters 4 and 5 of this thesis concern analysis of synthetic antimicrobial peptides. Chapter 4 of this thesis describes the successful identification of highly active, antibacterial peptides by a combination of non-linear machine learning algorithms and atomic resolution QSAR. Chapter 5 describes an efficient method of generating candidate peptides. Further analysis of the importance of the descriptors for prediction may yield valuable insight into the commonalities between highly active peptides that are recognized by the models. Unfortunately, complex modelling methods such as artificial  177 neural networks do not lend themselves to this analysis. Previous efforts to derive models that are more easily interpretable using logistic regression and principal component regression (methods described in Hastie et al, 2001) did not result in useful models (data not shown). Additional modelling techniques could be used for comparison to the ANN results; but only in the interest of time was this not done.  In Chapter 5, simulated evolution by genetic algorithms starting from two initial populations of peptides ended in nearly non-overlapping final populations of high-scoring peptides. Therefore, I expect that these models were capable of identifying hundreds if not thousands of peptides with high likelihood of being potent antibacterial agents. It is hoped that future work will further investigate which of these are suitable as antibacterial agents in the clinic.     178  6.3  References  Brogden, K. A. (2005). Antimicrobial peptides: pore formers or metabolic inhibitors in bacteria? Nat. Rev. Microbiol., 3: 238–250. Finlay, B.B., Hancock, R.E.W. (2004) Can innate immunity be enhance to treat microbial infections? Nat. Rev. Microbiol., 2: 497-504. Fjell, C. D., R. E. Hancock, et al. (2007). AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics, 23: 1148-1155. Hamilton-Miller, J.M.T. (2004) Antibiotic resistance from two perspectives: man and microbe. International Journal of Antimicrobial Agents, 23: 209-212. Hancock, R.E.W., and Sahl, H.G. (2006).Antimicrobial and host-defense peptides as new anti-infective therapeutic strategies. Nature Biotechnology, 24:1551- 1557. Hastie, T., Tibshirani, R., and Friedman, J. (2001)  The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York. Hilpert, K., Volkmer-Engert, R., Walter, T., Hancock, R.E.W. (2005) High-throughput generation of small antibacterial peptides with improved activity. Nature Biotechnology, 23: 1008 - 1012. Hou,Y. et al. (2004) Remote homolog detection using local sequence-structure correlations. Proteins: Struct., Funct. and Bioinformatics, 57:518–530. Hochreiter S, Heusel M, Obermayer K. (2007) Fast model-based protein homology detection without alignment. Bioinformatics. 23:1728-36 Höltje, H.-D., Sippl, W., Rognan, D., and Folkers, G. (2003) Section 4.3: Comparative protein modeling. In Molecular Modeling, Basic Principles and Applications. pp 100-116. Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, Germany. Koczulla, A.R., Bals, R. (2003) Antimicrobial Peptides: Current Status and Therapeutic Potential. Drugs, 63: 389-407. Kuang,R. Ie, E., Want, K., Wang, K.,Siddiqi, M., Freund, Y., Leslie, C. (2005) Profile- based string kernels for remote homology detection and motif extraction. J. Bioinf. Comp. Biology, 3: 527–550. Levy, S.B., Marshall, B. (2004) Antibacterial resistance worldwide: causes, challenges and responses. Nature Medicine, 10: S122 - S129. Lingner,T. and Meinicke,P. (2006) Remote homology detection based on oligomer distances. Bioinformatics, 22: 2224–2236. Rangwala,H. and Karypis,G. (2005) Profile based direct kernels for remote homology detection and fold recognition. Bioinformatics. 21: 4239–4247 Sima, P., Trebichavsky, I., Sigler, K. (2003) Mammalian antibiotic peptides. Folia Microbiol., 48: 123-137.  179 Sima, P., Trebichavsky, I., Sigler, K. (2003) Non-mammalian vertebrate antibiotic peptides. Folia Microbiol., 48: 709-724. Tossi, A., Sandri, L. & Giangaspero, A. (2000) Amphipathic, &-helical antimicrobial peptides. Biopolymers, 55: 4–30.  180  Appendix A: Epilogue  181 There are several lessons-learned that I might note after completing this thesis. The first and most significant lesson is the importance of finding a narrow focus for the research as early as possible, thus allowing more time to be spent on a more comprehensive treatment of the research area in the available time. I have spent a fair amount of time during my PhD studies on work that did not yield publications for a couple of reasons. Either I did not come up with anything novel to report (a sometimes unavoidable research outcome), or I spent time redeveloping software tools that were available elsewhere. For the computational biologist or chemist, the vast assortment of applications, databases and algorithms available for any task are nearly overwhelming (for example, there are at least 384 software packages available to calculate phylogeny from sequence similarity, http://evolution.genetics.washington.edu/phylip/software.html). However, after a significant investment of spent time evaluating available tools, many of them do not work for your purpose, will not execute on your hardware or operating system, or are too poorly implemented to be useful. So the question is a difficult one: write your own code or continue looking for existing code. My inclination has been to write my own; sometimes this is necessary but would often not have been. Particularly for machine learning and statistical analysis, the R-project statistical language and resources (http://www.r-project.org) provide high-quality software, but involves a very steep learning curve to use the code for anything but trivial work. R is a vector-based language quite unlike any I have used previously; but I would have been far more productive over the years it I had learned those coding skills and what the resource had to offer at the beginning. The work of a computational biologist or chemist must eventually be validated  182 against nature through direct experiment. This means that there is always a reliance on wet lab experimentalists who may or may not see the worth in the computational work. Ultimately the decision to proceed with expensive experimental work to validate predictions came, in my case, out of their grant money. I have been very fortunate for the opportunity to collaborate with the Hancock lab. Conflicts over research direction and authorship issues have been minimal. I returned to graduate studies eight years after I completed my M.Sc. in Physics (in radiation biochemistry) which was also some years after I completed my engineering undergraduate degree. I have not regretted my decision to return to graduate studies for a Ph.D. after such a long time; rather I regret I did not do so earlier. For someone considering such a move as late in life as I did, this is not an easy decision. There are the added difficulties such as family needs of older children (time and financial) as well as mounting financial pressure to plan for a retirement without poverty. But I have found that the work has been both more satisfying and important than otherwise would have been available to me. I hope this experience for me has also indirectly made for a more enriching environment in which my children will grow up.  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0066950/manifest

Comment

Related Items