UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The discovery of small molecule inhibitors for TOX1 and ERG oncotargets with the development and use… Agrawal, Vibudh 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2019_november_agrawal_vibudh.pdf [ 4.78MB ]
Metadata
JSON: 24-1.0384604.json
JSON-LD: 24-1.0384604-ld.json
RDF/XML (Pretty): 24-1.0384604-rdf.xml
RDF/JSON: 24-1.0384604-rdf.json
Turtle: 24-1.0384604-turtle.txt
N-Triples: 24-1.0384604-rdf-ntriples.txt
Original Record: 24-1.0384604-source.json
Full Text
24-1.0384604-fulltext.txt
Citation
24-1.0384604.ris

Full Text

     THE DISCOVERY OF SMALL MOLECULE INHIBITORS FOR TOX1 AND ERG ONCOTARGETS  WITH THE DEVELOPMENT AND USE OF PROGRESSIVE DOCKING PD2.0 APPROACH by  Vibudh Agrawal  Btech., Indian Institute of Technology Bombay (IIT-B), 2017  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Bioinformatics)  THE UNIVERSITY OF BRITISH COLUMBIA Vancouver  October 2019   © Vibudh Agrawal, 2019  ii  The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, a thesis/dissertation entitled: The discovery of small molecule inhibitors for TOX1 and ERG oncotargets with the development of progressive docking PD2.0 approach.  submitted by Vibudh Agrawal in partial fulfillment of the requirements for the degree of Master of Science in Bioinformatics  Examining Committee: Dr. Artem Cherkasov Supervisor  Dr. Michael Cox Supervisory Committee Member  Dr. Youwen Zhou Supervisory Committee Member  Additional Examiner   Additional Supervisory Committee Members:  Supervisory Committee Member  Supervisory Committee Member iii  Abstract Drug discovery is a rigorous process that can cost up to 3 billion dollars and takes more than 10 years to bring new therapeutics from bench to bedside. While virtual screening (such as molecular docking) can significantly speed up the discovery process and improve hit rates, its speed already lags behind the rate of the explosive growth of publically available chemical databases which already exceed billions of entries. This recent surge of available chemical entities presents great opportunities for discovering novel classes of small molecule drugs but also brings a significant demand for faster docking methods. In the current thesis, we illustrated the need for a faster screening method by virtually screening 7.6 million molecules against Thymocyte selection-associated high mobility group box protein (TOX). Then we demonstrated that the deep learning-based method of ‘Progressive Docking (PD2.0)’ can speed up such virtual screening by up to hundred folds. In particular, by utilizing deep learning QSAR models trained on the docking scores of a subset of the database, one can approximate in an iterative manner the docking outcome of unprocessed entries. We tested the developed method against various targets including ETS transcription factor ERG, Estrogen Receptor Activation Function 2 (ERAF2), Androgen Receptor (AR), Estrogen Receptor (ER), Sodium-Ion Channel (Nav1.7) and many more.   In this work, we identified 18 active compounds against TOX with micro-molar potency. We also used the PD2.0 method to dock up to 1.3 billion compounds from the ZINC15 database and demonstrated that this deep-learning-based approach resulted in 65X speed acceleration and 130X Full Predicted Database Enrichment (FPDE) while retaining more than 90% of good hits. We also demonstrate the method’s robustness by docking 570 million compounds from the ZINC15 database into 13 diverse drug targets including ERG. iv  Lay Summary  Therapeutics discovery is an expensive process that can cost up to $3 billion dollars and more than 10 years. Nowadays, the modern drug discovery process relies on two main components: 1) computational, in which we identify potential drug candidates using virtual screening and 2) experimental, where we test these computationally identified drug candidates in actual lab settings. The objective of this thesis was to significantly accelerate the in silico identification of potential drug candidates using methods of artificial intelligence (AI). In the last 15 years, the number of synthesizable molecules has increased from less than one million to more than a billion. Thus, current computational methods of molecular docking are not fast enough to screen billions of molecules. Thus, there is a critical demand for faster VS methods and protocols and in this work, we present such an AI-enabled approach called “progressive docking (PD) 2.0”.      v  Preface  The following is the contribution breakdown: i) Dr. Michael Hsing and I from Dr. Artem Cherkasov’s lab at Vancouver prostate centre are responsible for the computational part (CADD) while Dr. Mingwan Su and Yuanshen Huang from Dr. Yowen Zhou’s lab from skincare center performed experimentation for computer-aided drug discovery of small molecules against TOX (Chapter 3). Dr. Michael Hsing, Dr. Mingwan Su and I were responsible for manuscript compoisition and chapter 3 has been published as Agrawal V, Su M, Huang Y, Hsing M, Cherkasov A, Zhou Y. Computer-Aided Discovery of Small Molecule Inhibitors of Thymocyte Selection-Associated High Mobility Group Box Protein (TOX) as Potential Therapeutics for Cutaneous T-Cell Lymphomas. Molecules. 2019;24(19):34591. ii) For PD2.0 (Chapter 4), the idea came from one of the previous publications of my PI Dr. Artem Cherkasov and I did most of the analysis and pipeline building (with guidance and discussions with Dr. Michael Hsing and Dr. Francesco Gentile from Dr. Artem Cherkasov’s lab at Vancouver prostate). For the part of Conformal Prediction, I got help from Dr. Ulf Norinder from Department of Computer and Systems Sciences, Stockholm University. Dr. Michael Hsing, Dr. Francesco Gentile and I were responsible for manuscript compoisition and chapter 4 has been published as Agrawal V, Gentile F, Hsing M, Ban F, Cherkasov A. Progressive Docking-Deep Learning Based Approach for Accelerated Virtual Screening. Paper presented at: International Conference on Artificial Neural Networks20192. iii) I was the lead investigator for Chapter 5 and manuscripts are planned but not published. vi  Table of Contents  Abstract ......................................................................................................................................... iii Lay Summary ............................................................................................................................... iv Preface .............................................................................................................................................v Table of Contents ......................................................................................................................... vi List of Tables ..................................................................................................................................x List of Figures ............................................................................................................................. xiii Glossary ........................................................................................................................................xx Acknowledgments ...................................................................................................................... xxi Chapter 1: Introduction ................................................................................................................1 1.1 Artificial intelligence a brief overview ........................................................................... 1 1.2 Computer-Aided Drug Design (CADD) ......................................................................... 2 1.3 Molecule databases ......................................................................................................... 3 1.4 Mode of drug delivery .................................................................................................... 4 1.5 Virtual screening (VS) .................................................................................................... 5 1.5.1 Ligand-based virtual screening ................................................................................... 5 1.5.2 Structure-based virtual screening ................................................................................ 5 1.5.3 Molecular docking ...................................................................................................... 6 1.5.4 Ultra-large database docking ...................................................................................... 7 1.5.5 Machine learning and docking .................................................................................... 7 1.5.5.1 Scoring power ..................................................................................................... 8 1.5.5.2 Ranking power .................................................................................................... 9 vii  1.5.5.3 Docking power .................................................................................................. 10 1.5.6 Predicting docking score using machine learning .................................................... 11 1.5.7 Structure-Activity Relationship (SAR) and lead optimization ................................. 12 1.5.8 ADMET Prediction by QSAR .................................................................................. 13 1.5.9 In-house CADD pipeline .......................................................................................... 13 1.6 Protein targets ............................................................................................................... 15 1.6.1 Thymocyte selection-associated high mobility group box protein (TOX) ............... 15 1.6.2 ETS transcription factor ERG ................................................................................... 15 1.6.3 Other targets .............................................................................................................. 16 Chapter 2: Problem and specific aims .......................................................................................19 2.1 Problem ......................................................................................................................... 19 2.2 Aims and objectives ...................................................................................................... 21 Chapter 3: Computer-aided discovery of small molecule inhibitor of thymocyte selection-associated high mobility group box protein (TOX) ..................................................................23 3.1 Introduction ................................................................................................................... 23 3.2 Methods......................................................................................................................... 25 3.2.1 Structural evaluation of TOX druggability ............................................................... 25 3.2.2 In silico screening ..................................................................................................... 26 3.2.3 In vitro screening ...................................................................................................... 28 3.3 Results ........................................................................................................................... 30 3.3.1 Druggability assessment of the TOX HMG-box domain ......................................... 30 3.3.2 Large-scale in silico screening .................................................................................. 31 3.3.3 In vitro experimental validation ................................................................................ 32 viii  3.4 Discussion and future direction .................................................................................... 36 Chapter 4: Accelerating docking using progressive docking 2.0 .............................................40 4.1 Introduction ................................................................................................................... 40 4.2 Methods......................................................................................................................... 42 4.2.1 Molecule fingerprints ................................................................................................ 42 4.2.2 Molecular docking .................................................................................................... 43 4.2.3 Clustering .................................................................................................................. 43 4.2.3.1 K-means clustering ........................................................................................... 43 4.2.3.2 Mini-batch K-Means clustering ........................................................................ 44 4.2.4 Logistic regression .................................................................................................... 45 4.2.5 Support Vector Machine ........................................................................................... 45 4.2.6 Random forest ........................................................................................................... 46 4.2.7 Feedforward neural network ..................................................................................... 47 4.2.8 Evaluation metrics .................................................................................................... 48 4.2.9 Conformal prediction ................................................................................................ 49 4.3 Results and discussion .................................................................................................. 50 4.3.1 Molecular featurization ............................................................................................. 50 4.3.2 Database sampling .................................................................................................... 51 4.3.3 QSAR model for docking scores .............................................................................. 57 4.3.4 Progressive docking 2.0 ............................................................................................ 59 4.3.4.1 Training data for PD2.0 .................................................................................... 59 4.3.4.2 PD2.0 on Estrogen Receptor Activation Function 2 (ER-AF2) ....................... 61 4.3.4.2.1 Threshold-based prediction ......................................................................... 62 ix  4.3.4.2.2 Conformal prediction .................................................................................. 66 4.3.4.3 Speed analysis ................................................................................................... 67 4.3.4.4 PD2.0 on diverse drug targets ........................................................................... 69 4.4 Future direction ............................................................................................................. 87 Chapter 5: Virtual screening of ETS transcription factor ERG with PD2.0 .........................89 5.1 Introduction ................................................................................................................... 89 5.2 Methods......................................................................................................................... 91 5.3 Results and discussion .................................................................................................. 91 5.4 Future directions ........................................................................................................... 94 Chapter 6: Conclusions ...............................................................................................................96 Bibliography .................................................................................................................................98  x  List of Tables  Table 1. Details of protein targets selected for PD2.0. ................................................................. 18 Table 2. Top candidates for TOX small molecule inhibitors (SMIs). .......................................... 32 Table 3. Different enrichment values for MACCS, Morgan, and Pharmacophore fingerprints. .. 51 Table 4. Comparison of different QSAR models: Feedforward Neural Network (DNN), Random forest (RF), and logistic regression. DNN provides the best scores for all 5 enrichment values (top 10, top 100, top 1000, top 10000, FPDE at 90% recall). ...................................................... 58 Table 5. Different model statistics for threshold-based prediction after each iteration for ER-AF2. ROC-AUC, different enrichment values and recall values for the best model are reported. We observed an increase in all enrichment values after each iteration. The model cutoff/definition of good molecules improved after each iteration as well. ................................................................. 62 Table 6. The total number of molecules of positive class based on a fixed cutoff value in the datasets obtained at different iterations......................................................................................... 63 Table 7. Comparison of model statistics for threshold-based prediction between internal test set and external validation set for ER-AF2. ROC-AUC, FPDE, and recall for the best model are reported. All the statistics are consistent between the internal test set and the external validation set. ................................................................................................................................................. 65 Table 8. A comparison of good molecules (TP) returned by PD2.0 and good molecules returned by docking. The predicted and docking values are very similar. .................................................. 66 Table 9. Different model statistics for CP after each iteration for ER-AF2. ROC-AUC, enrichment values and recall values for the best model are reported. .......................................... 67 xi  Table 10. Comparing different model statistics for CP between internal test set and external validation set for ER-AF2. ROC-AUC, FPDE, and recall of the best model at each iteration are reported. ........................................................................................................................................ 67 Table 11. The time required for individual steps of the PD2.0 pipeline. The major time-consuming step is docking. One PD2.0 iteration takes 3 days using 200-300 CPU cores and 4 GPUs. ............................................................................................................................................ 68 Table 12. Model ROC-AUC, FPDE, Recall for all 12 protein target for the last iteration. ......... 71 Table 13. Model cutoff, ROC-AUC, FPDE and recall for Androgen Receptor (AR) after each iteration. All the statistics are consistent between the internal and external validation set. ......... 72 Table 14. Model cutoff, ROC-AUC, FPDE and recall for Estrogen Receptor after each iteration. All the statistics are consistent between the internal and external validation set. ........................ 73 Table 15. Model cutoff, ROC-AUC, FPDE and recall for Peroxisome Proliferator-Activated Receptor (PPARγ) after each iteration. All the statistics are consistent between the internal and external validation set. .................................................................................................................. 74 Table 16. Model cutoff, ROC-AUC, FPDE and recall for Calcium/calmodulin-dependent protein kinase kinase 2 (CAMKK2) after each iteration. All the statistics are consistent between the internal and external validation set. .............................................................................................. 75 Table 17. Model cutoff, ROC-AUC, FPDE and recall for Cyclin-dependent kinase 6 (CDK6) after each iteration. All the statistics are consistent between the internal and external validation set. ................................................................................................................................................. 76 Table 18. Model cutoff, ROC-AUC, FPDE and recall for Vascular endothelial growth factor receptor 2 (VEGFR2) after each iteration. All the statistics are consistent between the internal and external validation set. ........................................................................................................... 77 xii  Table 19. Model cutoff, ROC-AUC, FPDE and recall for Adenosine A2a Receptor (ADORA2A) after each iteration. All the statistics are consistent between the internal and external validation set. ................................................................................................................................................. 78 Table 20. Model cutoff, ROC-AUC, FPDE and recall for Thromboxane A2 Receptor (TBXA2R) after each iteration. All the statistics are consistent between the internal and external validation set. ................................................................................................................................................. 79 Table 21. Model cutoff, ROC-AUC, FPDE and recall for Angiotensin II type-1 receptor (AT1R) after each iteration. All the statistics are consistent between the internal and external validation set. ................................................................................................................................................. 80 Table 22. Model cutoff, ROC-AUC, FPDE and recall for Sodium-Ion Channel (Nav1.7) after each iteration. All the statistics are consistent between the internal and external validation set. . 81 Table 23. Model cutoff, ROC-AUC, FPDE and recall for bacterial (Gloeobacter) Ligand-gated Ion Channel (GLIC) after each iteration. All the statistics are consistent between the internal and external validation set. .................................................................................................................. 82 Table 24. Model cutoff, ROC-AUC, FPDE and recall for Gamma-AminoButyric Acid receptor subunit alpha-1 (GABA α1) after each iteration. All the statistics are consistent between the internal and external validation set. .............................................................................................. 83 Table 25. Model statistics after each iteration for ERG. ROC-AUC and FPDE values for the best model are reported. We observed an increase in all enrichment values after each iteration. The model cutoff/definition of good molecules improved after each iteration as well. ...................... 93  xiii  List of Figures  Figure 1. Two views of a molecule docked into a target pocket. (a) Protein surface representation with DNA strand (in green) and docked molecule (in pink). The red area on the protein surface are the important residues which are interacting with the molecule. (b) Same protein-molecule complex, here the blue netted surface is the defined pocket........................................................... 6 Figure 2. Computer-aided drug discovery pipeline. Different steps and tools involved in the process of computer-aided drug discovery, starting with protein structure modeling, moving on to virtual screening, candidate selection, experimental testing and finally hit-to-lead and lead optimization. ................................................................................................................................. 14 Figure 3. The number of synthesizable molecules available in the ZINC database over the period of the last 15 years. There is an exponential growth of the number of available synthesizable molecules in the ZINC database, from 700,000 in 2005 to more than 1 billion in 2019.............. 20 Figure 4. Docking time required based on the same amount of resources for 10 million vs >1 billion molecules. Using the same amount of CPU cores (~300) to dock 10 million molecules (<1%) it takes about 7 days while it would take ~2.5 years to dock ~1 billion molecules. .......... 20 Figure 5. Protein structural templates for the HMG-box domain of TOX. (a) An NMR structure of mouse TOX protein (PDB ID: 2CO9) has been identified as the best structural template, with 100% sequence similarity across the 87 amino acids of the HMG-box domain, compared to the human TOX protein. (b, c) By superimposing the 2CO9 structure (orange ribbons) onto the HMG-box protein TFAM (pink ribbons, PDB ID: 3TMM, 46% sequence similarity to human TOX, 3.8A RMSD) in complex with DNA (green ribbons), the TOX-DNA interface was determined..................................................................................................................................... 30 xiv  Figure 6. Druggability of the HMG-box domain of TOX. (a) TOX-DNA interface as determined from Figure 1C. TOX HMG-box domain in orange ribbons, DNA in green ribbons. (b) Molecular surface presentation of the TOX HMG-box domain, in the same orientation as (a). (c) The TOX HMG-box domain is rotated by 180 degrees to illustrate the small-molecule binding hot spots (red). (d) A total of 200,000 drug-like molecules were docked to the TOX HMG-box domain. The percentage of interacting small molecules are shown for each protein residue as a bar graph (multiple interactions/contacts are represented as separate bars for each amino acid). Protein residues, including Gln262, Pro264, Arg273, Lys313, Glu320, Gln324 and Tyr 328 that interact with at least 10% of the small molecules, are highlighted and mapped to their corresponding locations as hot spots (red surface patches) on the TOX HMG-box domain. ...... 31 Figure 7. Viability curves of TOX-high and TOX-low expressing cells. Cells were treated with various concentrations of compounds, (a) 190444, (b) 190414, (c) 190447 and (d) 190441, for 72 hours in 37C incubator with 5% CO2. Viability was measured by CellTiter-Blue® assay and compared to the DMSO control as described in Methods and Materials. Jurkat and Hut78 cells (solid line) are the TOX-high expressing cell lines, while K562 (dotted line) is the TOX-low expressing cell line. (Produced in collaboration with Dr. Zhou’s lab) ......................................... 35 Figure 8. Viability curves of TOX-high and TOX-low expressing cells. Cells were treated with various concentrations of compounds, (a) 190444, (b) 190414, (c) 190447 and (d) 190441, for 72 hours in 37C incubator with 5% CO2. Viability was measured by CellTiter-Blue® assay and compared to the DMSO control as described in Methods and Materials. Jurkat and Hut78 cells (solid line) are the TOX-high expressing cell lines, while K562 (dotted line) is the TOX-low expressing cell line. (Produced in collaboration with Dr. Zhou’s lab) ......................................... 36 xv  Figure 9.  TOX small molecule inhibitors (SMIs) bind at the hot spots on the protein-DNA interface. Docking poses are shown for compounds (a) 190444, (b) 190414, (c) 190447 and (d) 190441 on the HMG-box domain of TOX. For each panel: left) molecular surface representation, SMI in cyan, DNA in green ribbons, hot spots as red surface patches; right) in-depth molecular interactions between SMI (cyan) and protein residues of TOX (orange), hydrogen-bonds in red dotted line, hydrophobic interactions in green dotted line. ........................................................... 39 Figure 10. Different steps involved in PD2.0. Features are generated from the SMILES of the small molecule in the library and stored in a feature database. For the first iteration, a small subset of molecules (e.g. 3 million) is randomly sampled and divided into training, validation and testing set, and docked against the protein target. A score cut-off value is chosen to convert the problem from regression to classification. A QSAR model is built and used to predict the classes for all the molecules of the database (1=good and 0=bad). From the next iteration onwards, the same number of molecules (3 million) are sampled only from the positive molecules (predicted to be good binders) of the previous iteration, while ‘bad’ molecules are discarded. The validation and testing set from the first iteration remain the same for all the iterations, and the existing training set is enriched with the newly sampled molecules. All the other steps remain the same as the first iteration. This process is repeated multiple times until the total number of positively predicted molecules is a reasonable number that can be docked. ...... 41 Figure 11. Docking score distribution of the entire dataset versus the docking score distribution of the largest cluster when a Tanimoto cutoff of 0.5 was used. The distributions are overlapping hence all the molecules within the cluster have the same distribution of scores as that of random sampling. ....................................................................................................................................... 53 xvi  Figure 12. Accuracy and stability of the QSAR model obtained from samples of clustered and unclustered data. Each figure is a plot of FPDE (at 90% recall) against the number of clusters for different sample sizes ranging from 3000 to 480000. The mean, as well as the standard deviation, does not show any trend for any sampling technique. .................................................................. 54 Figure 13. Generalizability (criterion 1) of the QSAR model from samples of clustered and unclustered data. Each figure is a plot of the ratio between test and validation recall values against the number of clusters for different sample sizes ranging from 3000 to 480000. The mean, as well as the standard deviation, does not show any trend for any sampling technique. . 55 Figure 14. Generalizability (criterion 2) of the QSAR model from samples of clustered and unclustered data. Each figure is a plot of the ratio between test and validation precision values against the number of clusters for different sample sizes ranging from 3000 to 480000. The mean, as well as the standard deviation, does not show any trend for any sampling technique. . 56 Figure 15. The architecture of the DNN used for the QSAR model. One block is the combination of one fully connected layer and one dropout layer. The number of blocks, number of neurons and dropout frequency are the hyperparameters of the model. The input layer consists of 1024 neurons taking values from Morgan radius 2 and 1024 bits fingerprints. The output layer consists of two neurons as the problem is of binary classification. ............................................................ 58 Figure 16. The plot of FPDE against training dataset size for random sampling. FPDE increases with the increasing of the dataset size. .......................................................................................... 61 Figure 17. Generalizability (criterion 1)/recall ration for random sampling. (a) The recall ratio i.e. ratio of validation recall and test recall becomes more stable (smaller spread) with bigger sample size. (b) The standard deviation of the recall ratio goes down with a larger sample size. 61 xvii  Figure 18. Docking score distribution plot for ER-AF2 at all iterations. The score distribution shifted towards left/better molecules; the shift was more prominent at the initial iteration compared to later ones. This implies that enrichment slowed down as we proceeded through more iterations. ............................................................................................................................. 64 Figure 19. Time projection of PD2.0 and regular docking against different database sizes. As the size of the molecule database increases, the difference between the speed of PD2.0 and regular docking becomes more accentuated. Projections were generated with the assumption that when dealing with more molecules, the number of iterations, as well as inference time, will increase. Since the time per iteration is very small due to the low number of molecules required to dock (3 million), increasing by few iterations will still lead to a significant gain in speed compared to docking a  library of billions of molecules. .................................................................................. 69 Figure 20. Docking score distribution for AR. The distribution of scores shifted towards better (lower) docking scores after each iteration. .................................................................................. 72 Figure 21. Docking score distribution for ER. The distribution of scores shifted towards better (lower) docking scores after each iteration. .................................................................................. 73 Figure 22. Docking score distribution for PPARγ. The distribution of scores shifted towards better (lower) docking scores after each iteration. ........................................................................ 74 Figure 23. Docking score distribution for CAMKK2. The distribution of scores shifted towards better (lower) docking scores after each iteration. ........................................................................ 75 Figure 24. Docking score distribution for CDK6. The distribution of scores shifted towards better (lower) docking scores after each iteration. .................................................................................. 76 Figure 25. Docking score distribution for FEGFR2. The distribution of scores shifted towards better (lower) docking scores after each iteration. ........................................................................ 77 xviii  Figure 26. Docking score distribution for ADORA2A. The distribution of scores shifted towards better (lower) docking scores after each iteration. ........................................................................ 78 Figure 27. Docking score distribution for TBXA2R. The distribution of scores shifted towards better (lower) docking scores after each iteration. ........................................................................ 79 Figure 28. Docking score distribution for AT1R. The distribution of scores shifted towards better (lower) docking scores after each iteration. .................................................................................. 80 Figure 29. Docking score distribution for Nav1.7. The distribution of scores shifted towards better (lower) docking scores after each iteration. ........................................................................ 81 Figure 30. Docking score distribution for GLIC. The distribution of scores shifted towards better (lower) docking scores after each iteration. .................................................................................. 82 Figure 31. Docking score distribution for GABA α1. The distribution of scores shifted towards better (lower) docking scores after each iteration. ........................................................................ 83 Figure 32. Top 10, 100, 1000 enrichment for the 12 targets after the last iteration. The enrichment values decreased for an increasing number of top molecules, showing that the DNN model was able to prioritize high-affine molecules (good) i.e. molecules associated to higher DNN probabilities had higher chances of being good-scoring molecules. ................................... 84 Figure 33. The mean docking score distribution for the 12 targets across all the iterations. The mean score decreased with the iterations. The decrease rates were higher at the beginning and eventually slowed down. The rate of decrease, as well as the docking score values, were protein-dependent. ..................................................................................................................................... 85 Figure 34. The number of molecules left after each iteration for the 12 targets. The number of molecules left after each iteration decreased with the iterations. The decrease was more significant at the beginning and eventually slowed down, and it was highly protein-dependent. 86 xix  Figure 35. Top 100 enrichment value for the 12 targets after each iteration. The enrichment value increased with the iterations and the rate of increase, as well as maximum enrichment value, was protein-dependent. ........................................................................................................................ 87 Figure 36. Docking score distribution for ERG. The distribution of scores shifted towards better (lower) docking scores after each iteration. Lower docking score is an indication of a molecule with better binding affinity with ERG. ......................................................................................... 94   xx  Glossary  AI Artificial Intelligence CADD Computer-Aided Drug Design CP Conformal Prediciton CTCL Cutaneous T Cell Lymphoma DL Deep Learning ER-AF2 Estrogen Receptor Activation Function 2 DNN Deep Neural Network FPDE Full Predicted Database Enrichment GBM Gradient Boosting Machine LR Logistic Regression ML Machine Learning PCa Prostate Cancer PD Progressive Docking QSAR Quantitative Structure-Activity Relationship RF Random Forest SAR Structure-Activity Relationship SVM Support Vector Machine TOX Thymocyte selection-associated high mobility group box protein VS Virtual Screening ZINC Zinc Is Not Commercial  xxi  Acknowledgments  I like to acknowledge the following people who helped in finishing my degree:  My supervisor Dr. Artem Cherkasov and my direct supervisor Dr. Michael Hsing for guiding me through the entire degree.  My committee members Dr. Michael Cox and Dr. Youwen Zhou for suggesting some very important additions to my work.  CREATE scholarship for financially supporting me through my master’s degree and Canadian Dermatology Foundation for TOX project.  My fellow lab members Dr. Francesco Gelitle, Dr. Fuqian Ban, Satyam Bhasin, Mariia Radaeva, Godwin woo, Divya Bafna, Anh-Tien Ton and Oliver Snow for all the interesting and informative discussions.  Special thanks to my family for always supporting me throughout my student life.  1  Chapter 1: Introduction This chapter will cover all the concepts and theory required to understand and interpret the data from the upcoming chapters as well as highlight the various steps involved in the field of computer-aided drug design (CADD). It will also cover a brief introduction to artificial intelligence and how it is being integrated with the CADD pipeline.   1.1 Artificial intelligence a brief overview Artificial Intelligence (AI) can be defined as the ability for a machine to learn, reason and correct itself just like human intelligence. Machine learning is a subset of AI and can be thought of as the process of making a computational algorithm to learn the embedded structure from data. Data can be images, prices, costs of goods, gene expression data or any data which may have hidden patterns. Deep learning (DL), in particular, being a sub-discipline of machine learning, works on extracting higher-level information from raw data in multiple steps/layers. The idea is that each subsequent step/layer learns a more abstract and composite representation of the raw data, compared to the previous layer. The other way to look at DL is to consider various types of supervision applied to the deep learning phase, which can be divided into three categories: fully-supervised/supervised, semi-supervised and unsupervised. Fully supervised and unsupervised, as the name suggests, require full supervision/labeled data and no supervision/unlabeled data respectively. Some of the examples of supervised machine learning models include (but are not limited to) Linear Regression3, Logistic Regression4, Random Forest5 (RF), Support Vector Machines6 (SVM), Gradient Boosting Machines7 (GBM), Neural Networks8 (NN) methods. Methods of unsupervised learning encompass clustering9, 10 and dimensionality reduction11, 12 among others.  2  Examples of supervised deep learning models that gained particular attention in recent years include Deep Feed Forward Neural Network13 (DFFNN), Convolution Neural Network14 (CNN), Recurrent Neural Network15-17 (RNN), while popular Variation Auto Encoders18 (VAE) and Generative Adversarial Network19 (GAN) correspond to unsupervised DL.  The semi-supervised learning, on the other hand, doesn’t require the exact labels but requires a criterion to reward the machine if it does the correct task and lets it figure out how to do it correctly based on these rewards (very similar to humans). AI in general is the combination of all three, supervised, unsupervised and semi-supervised learnings. Reinforcement learning is a semi-supervised method and is part of AI, in which the machine learns, reason and correct itself based on the things it observes, some of the examples are machine learning to play games, robots able to learn to walk and function, a self-driving car able to make decisions. For all of these tasks, just like humans, machines learn from experience as it makes mistakes. Out of the three types of learning, supervised one is the most broadly used and studied. Some of the fields where supervised learning made a significant impact in recent years include object detection20, 21, machine translation22, 23, speech recognition24, facial recognition25, 26, biomedical imaging 27, as well as predictive pharmacology and computer-aided drug discovery (CADD) among others28.   1.2 Computer-Aided Drug Design (CADD) Drug discovery is an expensive and very slow process that can cost up to $3 billion dollars and takes more than 10 years to bring a drug from laboratories to clinical practice. Two of the main challenges are 1) low hit discovery rate (< 1%), when conventional chemical libraries are screened in vitro against a disease-associated target, and 2) unwanted side-effects arisen from hit 3  compounds cross-reacting with other protein targets29, 30. The CADD can significantly speed up compounds screening and improve hit rates by imposing virtual screening (VS) protocols to millions of drug-like small molecules. VS scores those molecules against a specific site on a target protein in silico and triages them for subsequent experimental validations30. Such VS protocols allow achieving >10% hit (success) rates in initial drug discovery campaigns, as opposed to ~0.03% yields of wet-lab screening29.   The other core steps in computer-aided drug design campaign30 include compound design, energy calculations, SAR analysis, ADME modeling, and drug-target interaction prediction31. Below we present the high-level overview of all those steps.   1.3 Molecule databases Current molecular databases contain diverse information about known (or sometimes, prospective) compounds, their activities (expressed as IC50 numbers, binding affinities, etc), physicochemical properties, relevant information on chemical vendors, etc. Some of the most well-known databases are ZINC1532 (consists of purchasing information for more than a billion molecules), ENAMINE33 (consists of purchasing information for 100s of million molecules), PubChem34 (consists of experimental information on millions of molecules-target pairs), PDB Database35 (stores 3D coordinates of proteins, nucleic acid and complex assemblies), BindingDB36 and PDBbind37 (cataloguing binding affinity number of millions of ligand-target pairs and relevant information on the corresponding experimental conditions, etc).  4  1.4 Mode of drug delivery The drug can be introduced into the body in several ways including orally, through the skin (transcutaneous), through injections directly to the bloodstream, placed under the tongue, inserted in rectum or vagina, placed in the eye, etc. In this thesis, we focus on the oral and transcutaneous mode of delivery. The oral route is the most convenient, safest as well as inexpensive for delivering the drug. The drug can be in the form of tablets, capsules or liquids. The absorption of drugs happens in the mouth, stomach and small intestine. Out of the three most drugs get absorbed in the small intestine. Intestinal wall alters many drugs chemically, causing only a small amount of drug to be absorbed and reach the bloodstream, thus oral delivery requires a high dosage of the drug. There are several properties that we expect in an oral drug based on the existing drug candidates including lower molecular weight (<500 Daltons), lower rotatable bond count, lower hydrogen bond counts, lower surface area and certain minimum lipophilicity (octanol-water partition coefficient)38. The transcutaneous route of delivery is the delivery of drugs through the skin, drugs can be delivered through a patch on the skin. The drug needs to penetrate the skin to reach the bloodstream, for enhancing the skin penetration these drugs are sometimes mixed with other chemicals such as alcohol. The patches can stay for hours or days delivering the drug slowly but continuously. This form of drug delivery is useful for drugs that are removed from the system very fast thus needed to be taken frequently. Patches, on the other hand, can cause skin irritation and can only be used for drugs that need to be given in small daily doses. Some desirable properties for such drug candidate include lower molecular weight, lipophilicity (octanol-water partition coefficient) greater than 1, lower number of hydrogen bonds since permeability across lipid bilayer decreases with more hydrogen bonds, lower number of hydrogen acceptors since too many 5  hydrogen bond acceptors hinders the permeability across the lipid bilayer and molecules with no charge39-41.  1.5 Virtual screening (VS) VS is an efficient approach for the identification of hit compounds for a given protein target. The starting point for the VS campaign is the identification of a source of already available or readily synthesizable chemicals, such as ZINC database32, which consists of billions of molecular entries. The next step is to apply VS protocol(s) such as ligand-based or structure-based screening, depending on the target structure is available or reliably modelable.  1.5.1 Ligand-based virtual screening For ligand-based screening, a similarity search is performed between known active molecules and the entire small molecule database library. The search is based on some form of representation of small molecules also known as fingerprints or descriptors. One of the drawbacks of such an approach is that the calculated similarity relationship does not have any strong correlation with the observed activity and hence results in a high number of false positives.   1.5.2 Structure-based virtual screening Structure-based virtual screening is arguably the most accurate VS protocol, but its use is only feasible when a target protein possesses an experimentally resolved structure. This structure or closely related homology models are then used to identify a targetable pocket or a ligand-binding site to be inhibited. Once the target pocket is defined (some examples are shown in Figure 6  1), by a certain CADD method, such as chemical probing or hot-mapping, a significant number of small molecules can be screened against this pocket using means of molecular docking.    Figure 1. Two views of a molecule docked into a target pocket. (a) Protein surface representation with DNA strand (in green) and docked molecule (in pink). The red area on the protein surface are the important residues which are interacting with the molecule. (b) Same protein-molecule complex, here the blue netted surface is the defined pocket.                                        1.5.3 Molecular docking The process of molecular docking involves the search for the most energetically favorable binding pose of a given compound within a defined pocket. In a nutshell, the docking workflow can be divided into two parts 1) performing a conformal search across all possible conformers and 2) calculating the binding free energy of each conformer using a VS scoring function. The conformal search problem is a solved case for docking protocols, but it is difficult to accurately rank compounds based on predicted binding free energy (scoring function). Of note, the current 7  scoring functions which only approximate the binding free energy for a protein-ligand pair represent the limiting step in docking and have significant room for improvement.   1.5.4 Ultra-large database docking In one of the recent papers by Lyu et al42, authors docked a library of ~100 million molecules into two targets - AmpCβ-lactamase (AmpC) and D4 dopamine receptor. They discovered that for these targets, their known ligands (active molecules) appear among the top hits i.e. the used docking methods could correctly prioritize a very large chemical space. They compared the docking of a small library of compounds with the docking of a larger library and observed that performance is typically improved in the latter case. Authors concluded that the ultra-large libraries contain molecules that are better suited for a given receptor structure and that by searching large libraries one can discover better drug candidates.  Furthermore, authors anticipated that given an ultra-large library, it would be logical and time-saving to cluster them and only dock particular cluster representative (s).   1.5.5 Machine learning and docking As outlined above, the process of docking can be divided into two major steps: 1) conformal search and 2) free energy estimation. The latter phase is particularly error-prone and computationally extensive. The free energy estimation in a docking process is typically approximated by a much-simplified scoring function. The good scoring function is expected to have three properties: 1) scoring power, i.e. the ability to correctly predict the binding affinity for different binding poses 2) ranking power or ability to rank the binding poses from a set of ligands with known poses against the same target and 3) docking power - the ability to identify the best 8  binding pose of a given ligand amongst different generated conformers43. In the next section, different statistical and machine learning methods are discussed in the context of the above mentioned three properties.   1.5.5.1 Scoring power Scoring power is defined as the ability of a scoring function to predict binding energy for a protein-ligand complex with known 3D coordinates. Ideally, the corresponding predicted scores should be linearly correlated with experimentally measured binding affinities. There are three categories of scoring functions44: 1) force field-based 2) empirical and 3) knowledge-based. All three are parametric methods that involve estimating the fixed number of parameters based on the available experimental data. Thus, the force field- methods estimate the parameters of molecular mechanics energies approximated from experimental data and/or ab initio simulations45. For empirical scoring functions, the complete energy term is composed of individual weighted energy terms, where the weight coefficients obtained from a regression on experimental binding energies46. Lastly, knowledge-based functions are the weakest predictors and based on the theory that a large database of protein-ligand complexes can be statistically mined to deduce rules and models implicitly embedded in data47.  Various machine learning methods are actively used for constructing VS scoring functions. Thus, Kinnings et al.48 used Support Vector Machine (SVM) to derive individual weight terms of different protein families for the empirical scoring function. The same method can be used for deriving different parameter coefficients for force-field based scoring function. There are some ML approaches that used protein-ligand features available in the literature (geometric features, pharmacophore features, physical force field energy terms) for predicting the binding affinities. 9  Few other non-parametric models tried to learn non-linear dependency of the structure of the protein-ligand complex with the binding affinity49. In this approximation, each feature represents the number of occurrences of a particular protein-ligand atom type pair interacting within a certain distance range. The authors used the random forest to model the relationship between these features and binding affinity. Overall the features can be divided into 1) physical features (e.g. energy terms) 2) chemical features (fingerprints) 3) Geometric features (Based on the structure of protein-ligand complex).  The work by Li et.al50 used RF with a combination of different energy as well as geometric features and showed that incorporating more features and training with more data can boost the performance of the model. In the work by Ballester et.al51, authors utilized chemical descriptors to represent complex and showed that having more precise chemical descriptors does not always lead to a more accurate model. Recently, there had been a number of similar publications on scoring functions and machine learning43, 52.  1.5.5.2 Ranking power Ranking power is defined as the ability of a scoring function to properly rank ligands for their binding affinity to the same protein. There are two classical approaches for evaluating the ranking power: 1) high-level ranking and 2) low-level ranking. The PDBbind37 v2013 database contains complexes with 3 different ligands against the same protein target. The high-level ranking is defined by correctly ordering the three ligands based on the binding affinity towards the protein target. The low-level ranking on the other hand just requires to correctly identify the best binding ligand out of the three. For any protein target, one point is awarded if the scoring function succeeds for high-level ranking/low-level ranking. 10  In another study, authors evaluated a panel of 20 conventional scoring functions53 and demonstrated that the methods called the “S-score” ranked the ligands with the highest accuracy in both high-level and low-level ranking exercises. The work presented by Ashtawy et.al.44 assessed the ranking power of ML-based function on PDBbind37 2007 and 2010 benchmark datasets. The authors also used a very diverse set of molecule features: X-score46, AffiScore54-56, and RF-score49. For ML models they used Multivariate Linear Regression3 (MLR), multivariate adaptive regression splines57 (MARS), k-nearest neighbours58 (KNN), SVM6, RF5, and Boosting Regression Trees59 (BRT). Out of all these, they found ensemble-based methods like RF and BRT worked the best, with RF having the best results for high-level ranking (62.5%) as well as low-level ranking (78.1%). Some of the other works60, 61 involved using non-parametric ML models such as inductive logic programming (ILP) combined with SVM; and SVR-based scoring functions: 1) SVR-knowledge based and 2) SVR-empirical descriptor based. More work can be found in the review papers on machine learning and docking43, 52.   1.5.5.3 Docking power The docking power is defined as the ability of a scoring function to identify the native binding pose amongst the generated ones. In other words, it should score the native pose as the best among all possibilities. The classical way of testing a scoring function53 is to generate some decoy binding poses (usually few 100s) and include the native binding poses in them to see it gets scored on the top and whether the RMSD between the poses is low (<2.0A or so). Out of  20 scoring functions tested by the authors of a review, paper53 ChemPLP@GOLD62 and Chemscore@GOLD63 resulted in a success rate above 80%.  11  Finding the right conformation is usually done using: steepest descent optimization, Monte Carlo Simulation, simulated annealing, Molecular Dynamics (MD), genetic algorithms, and geometry methods. The importance of docking power of any scoring function has been demonstrated previously64; there is a variety of ML-based scoring functions that are in-sensitive to docking pose accuracy. Thus, authors of [65] compared the docking power of linear and ML-based scoring functions and demonstrated that the docking power of linear scoring functions is significantly higher than those of ML-based scoring functions. In other study66 researchers had used deep learning approximation to predict the binding affinity values by extracting features from the poses obtained from two docking programs.   1.5.6 Predicting docking score using machine learning We have briefly reviewed a variety of methods published to date on the prediction of protein-ligand binding affinities with various scoring functions. Yet, Progressive Docking (PD)67 was the first reported method that utilized classical QSAR methodology to simulate docking scores and further use them for reducing the number of remaining docking jobs in a CADD pipeline. The original idea of PD was to build model specific to the protein target site using a subset of docked molecules and to predict the scores for all the remaining molecules using QSAR model with 3D descriptors. The predicted docking scores are then used to iteratively removing less promising molecules form un-docked database. In the last few years there have been few similar studies aimed to approximate docking scores using methods like SVM, RF enhanced with conformal prediction statistics68, 69. These methods used the same idea of PD i.e of progressive removal of undocked molecules based on some classification (instead of quantitative regression, as in case of PD). 12   1.5.7 Structure-Activity Relationship (SAR) and lead optimization As the name suggests SAR methods link a structure of protein-ligand complex to some measure of ligand’s activity like the corresponding IC50 value. Historically, the field of Quantitative Structure-Activity Relationship70 (QSAR) was encompassing ligand-based approaches which are important components of computer-aided drug discovery. The origin of QSAR dates back to the 1960s and was developed by Corwin Hansch71. QSAR model can be linear or non-linear in nature, the advantage of the former class is in their interpretability, while they are generally less predictive.  There are also various SAR visualization techniques72, 73 which are more intuitive and attempt to compliment analog design with the graphical representation of SAR patterns and identify key components for choosing new analogs. Nowadays, QSAR models play a very important role in hit-to-lead and lead optimization workflows.  In such workflows, after the initial active compounds (hits) is identified, the best ones undergo hit-to-lead optimization to qualify as the lead candidates, which is further optimized into drug candidates. Among other steps, the hit-to-lead optimization involves the search for hit analogs in chemical space in an iterative manner and using QSAR approaches74, 75.  After the lead compound has been selected based on the hit-to-lead process the lead compound move to the lead optimization process. In this process, the lead compound is iteratively made more potent and have less off-target effects by medicinal chemistry knowledge. New custom compounds are made based on the lead molecule and their activity is predicted using the QSAR model and eventually, these molecules are purchased and tested in vitro.  13  1.5.8 ADMET Prediction by QSAR ADMET stands for “absorption, distribution, metabolism, excretion, and toxicity” and describes the main pharmaceutically-relevant characteristics of a compound administered into a living organism. ADMET modeling usually involves building a variety of machine learning models as well as statistical modeling and is also one of the very hot topics in the field of computer-aided drug discovery. Among other applications, ADMET involves predicting the complex in vivo behavior of bioactive compounds which we do not understand completely ourselves. The data available for ADMET prediction is very sparse making it even more difficult for building good models. Over the last few years, there was a variety of computational methods developed for elaborate prediction of drug metabolism and toxicities76, 77.  1.5.9 In-house CADD pipeline We had developed our own computer-aided drug design pipeline30 (Figure 2) which had previously led us to successfully discover SMI’s for other cancer drug targets such as AR78, ERG79, and MYC80 among others81-83. Most of the steps are same as every other CADD pipeline but also involve a few additions to some of the intermediate steps.   14   Figure 2. Computer-aided drug discovery pipeline. Different steps and tools involved in the process of computer-aided drug discovery, starting with protein structure modeling, moving on to virtual screening, candidate selection, experimental testing and finally hit-to-lead and lead optimization.  The first step is protein structure modeling which is part of every CADD pipeline, in where we get the crystal structure of the protein from the PDB database (if the structure is available) or performs homology modeling on PBD-retrieved templates. We then use MOE84 software for identifying the best targetable pocket for the protein which can be used in the subsequent step of virtual screening.  At the next stage, we perform virtual screening of the ZINC database. Previously we used to work with the entire ZINC as it contained 10-20M entries. But these days ZINC database has grown more than 1 billion entries and, it would take too long to screen all of them. Thus, we apply various filters to ZINC such as numbers of Hydrogen bonds, Rings, LogP etc85 to decrease the dataset to 10’s of million manageable molecules.  For molecular docking, we use GLIDE86, OEDDocking87, and ICM88 packages. After the docking round, we select molecules that have consistent pose across all the three software based on the corresponding threshold RMSD value of 2A. We also calculate other properties of virtual 15  hits such as ADMET77 characteristics, drug likeness89, 90 and predicted pKi. Based on all these calculated properties we derive a consensus score and select top 100’s of molecules and do clustering to remove similar structures.  Finally, we select ~100 molecules for testing experimentally (in vitro). Based on the hit compounds we do hit-to-lead and lead optimization using QSAR models, 2D/3D similarity search and Molecular Dynamic (MD) simulations.  1.6 Protein targets 1.6.1 Thymocyte selection-associated high mobility group box protein (TOX) Thymocyte selection-associated high mobility group box protein (TOX) is a 526 aa nuclear protein (molecular weight, 57 kD) that binds DNA in structure-dependent and sequence-independent manner91. It is a member of the evolutionarily conserved high mobility group (HMG) box family and a key regulatory nuclear protein in the development of CD4+ T cells, natural killer cells, and lymphoid tissue inducer cells91-95. The roles of TOX in immune system development are well-characterized91. TOX expression is tightly controlled and it is expressed during early CD4+ T cell development. The importance of TOX in T cell development and maturation has been shown before by gene knockout experiments of TOX on mice. In recent years, strong evidence has emerged that TOX is a specific biomarker, strong prognostic factor, key pathogenic driver, and attractive therapeutic target for CTCL96-99.  1.6.2 ETS transcription factor ERG ERG (ETS‐related gene) is a transcription factor with a full-length of 486 amino-acids and a molecular weight of 54kDa100. ERG is a member of the ETS family, the majority of which have 16  an ETS DNA-binding domain (DBD) that is about 85 amino acid long. This DBD recognizes the DNA sequence containing a core GGA(A/T) motif101. ERG is initially highly expressed in the embryonic mesoderm and endothelium where it plays an important role in bone development, the formation of urogenital tract and vascular system102, 103. ERG is also expressed at high levels in embryonic neural crest cells during their migratory phase104. ERG expression decreases during vascular development105 but continues to regulate the pluripotency of hematopoietic stem cells106, endothelial cell (EC) homeostasis107, 108 and angiogenesis102. Genomic alteration involving translocation of the ERG occurs in approximately half of the prostate cancer cells79. These alterations result in aberrant, androgen-regulated production of ERG protein variants that directly contribute to disease development and progression79.  ERG modulates various PCa related phenotypes such as disruption of the epithelial differentiation program via AR dysregulation109, activation of c-Myc, epigenetic reprogramming via EZH2110 and promotion of genomic instability via PARP dysregulation111. Overexpression of ERG also promotes epithelial-mesenchymal transition (EMT) and enables the transformed cells to acquire migratory and invasive characteristics109, 112. Previously, a new class of anti-ERG small molecule inhibitors (compound VPC-18005 and its derivatives) have been developed at Vancouver Prostate Centre through the use of an in-house CADD pipeline79. In summary, there are multiple lines of evidence that together strongly demonstrate that ERG is an attractive molecular target for developing PCa therapies.  1.6.3 Other targets To evaluate the performance of the developed PD2.0 approach we considered 12 proteins from four main drug-targeted families: nuclear receptors, kinases, G-coupled protein receptors, 17  and ion channels113. These targets are very well-validated and also have well-defined x-ray structures (where the target is bound to various ligands), and most of them, already have approved drugs. Details of these proteins are presented in Table1.  We also considered Estrogen Receptor Activation Function 2 (ERAF2) as a target which is an unconventional pocket of Estrogen Receptor (ER), which undergoes the in house drug development. This hydrophobic cavity on the surface of ERα is involved in the recruitment of different coactivator proteins of ERα, and a number of recent studies have successfully explored ER-AF2 as an alternative target site for ER inhibitors to the hormone-binding pocket in breast cancer.  18  Table 1. Details of protein targets selected for PD2.0. Family Target Ligand PDB ID Resolution Significance Nuclear receptors AR Dihydrotestosterone 1T7R114 1.40 Å Prostate cancer ERα Raloxifene 1ERR115 2.60 Å Breast cancer, osteoporosis, menopausal symptoms  PPARγ Rosiglitazone 5YCP116 2.00 Å Diabetes Kinases CAMKK2 STO-609 2ZV2117 2.40 Å Prostate cancer, metabolic hepatic diseases  CDK6 Abemaciclib 5L2S118 2.27 Å Breast cancer VEGFR2 Axitinib 4AG8119 1.95 Å Multiple cancer types  G-protein coupled receptors ADORA2A Theophylline 5MZJ120 2.00 Å Myocardial perfusion imaging, inflammation, neuropathic pain, Parkinson’s disease  TXA2 Ramatroban 6IIU121 2.50 Å Cardiovascular diseases, asthma  AT1R Olmesartan 4ZUD122 2.80 Å Hypertension Ion channels Nav1.7 GX-936 5EK0123 3.53 Å Pain GLIC Anesthetic ketamine 4F8H124 2.99 Å General anesthetics GABA α1 THDOC 5OSB125 3.80 Å Neurological disorders 19  Chapter 2: Problem and specific aims 2.1 Problem Purchasable chemical databases such as ZINC have grown exponentially from 700,000 compounds in 2005126 to over 1 billion molecules in 201932 (Figure 3). Previously due to limitations in available computer power and methods, we could only virtually screen around 10 million molecules (representing less than 1% of the entire ZINC database as shown in Figure 4) through the use of molecular docking. As it has been shown before42, the larger the chemical libraries screened, the better the chance of finding optimal drug candidates. Thus, these billions of molecules in chemical databases such as ZINC provide a tremendous opportunity for drug discovery, provided that if we could accelerate and scale up molecular docking methods from tens of millions to billions of compounds. We encountered such a challenge from drug discovery projects targeting the TOX and ERG proteins.   20   Figure 3. The number of synthesizable molecules available in the ZINC database over the period of the last 15 years. There is an exponential growth of the number of available synthesizable molecules in the ZINC database, from 700,000 in 2005 to more than 1 billion in 2019.    Figure 4. Docking time required based on the same amount of resources for 10 million vs >1 billion molecules. Using the same amount of CPU cores (~300) to dock 10 million molecules (<1%) it takes about 7 days while it would take ~2.5 years to dock ~1 billion molecules.  21  2.2 Aims and objectives In this thesis, we describe the development of a new virtual screening pipeline ‘Progressive Docking PD2.0’ based on a deep learning framework, which can accelerate the docking process for very large chemical databases (>1 billion molecules). The PD2.0 uses docking scores generated from a small subset of the molecules, to predict binding affinities of the remaining molecules in the entire chemical library via a progressive, iterative manner. 1. Identify novel small molecule inhibitors targeting TOX through the use of virtual screening. There are currently no small molecules inhibitors of TOX, and herein we aimed to address this unmet need by developing anti-TOX therapeutics through our in-house CADD pipeline. (While this aim only utilizes the docking of 7 million molecules, it motivates us to fully develop PD2.0) 2. Develop the Progressive Docking PD2.0 method to validate its performance on a diverse list of drug targets. For molecule features, we evaluated MACCS, Morgan, and Pharmacophore fingerprints. For choosing the QSAR model we compared Deep Neural Networks (DNN)  with other machine learning models, including Random Forest (RF), Logistic Regression (LR), and Support Vector Machine (SVM). Then we selected the sampling technique based on performance and time complexity. We then optimized the pipeline by testing various ways to combine different components within the pipeline. Lastly, we validated the PD2.0 on ERAF2 and 12 targets mentioned in Table 1. 3. Apply PD2.0 on ERG for the discovery of new small molecule inhibitors. Previously, we have applied CADD to virtually screen 3 million molecules from the ZINC database and developed a list of anti-ERG inhibitors79. We have now virtually screened 350 million 22  molecules with the use of the PD2.0 method on ERG to identify new classes of small molecule inhibitors. 23  Chapter 3: Computer-aided discovery of small molecule inhibitor of thymocyte selection-associated high mobility group box protein (TOX) 3.1 Introduction In this chapter, we are going to demonstrate the need for a faster computational method for virtual screenings. In particular, we identified a prospective protein target TOX for Cutaneous T cell lymphoma (CTCL) and found new SMI’s to target this protein using virtual screening followed by experimental validation. CTCL is a primary lymphoma of the skin and is derived from cutaneous resident memory T cells. In most of the cases, malignant T cells are CD4+ i.e. CD4 (cluster of differentiation 4), which is a glycoprotein is expressed in those cells.  The most common variants of CTCL are mycosis fungoides (MF) and Sezary syndrome (SS). Only about 10% of CTCL patients with minor conditions and about 25% with major conditions end up developing end-stage conditions such as leukemic stage CTCL including SS, but for the rest, the life span is that of a healthy individual. SS patients have a median survival of 2-4 years and an estimated 5-year survival rate of 24%127.  Current treatment of CTCL depends upon the level of severity and clinical stage, for example, the treatments available for early stages are skin-directed, including topical steroids, topical nitrogen mustard, topical retinoid as well as phototherapy. It gets more difficult to treat late stages when the disease is no longer limited to the skin. Although a number of agents, including retinoids, interferon (IFN), monoclonal antibodies, epigenetic modifiers such as histone deacetylase inhibitors (HDACi), and denileukin diftitox, have shown benefit in the treatment of advanced disease, none are curative. Therefore, new therapies for CTCL are urgently needed. 24  Thymocyte selection-associated high mobility group box protein (TOX) is a 526 aa nuclear protein (molecular weight, 57 kD) that binds DNA in a structure-dependent and sequence-independent manner91. It is a member of the evolutionarily conserved high mobility group (HMG) box family and a key regulatory nuclear protein in the development of CD4+ T cells, natural killer cells, and lymphoid tissue inducer cells91-95. The roles of TOX in immune system development are well-characterized91. TOX expression is tightly controlled and it is expressed during early CD4+ T cell development. The importance of TOX in T cell development and maturation has been shown before by gene knockout experiments of TOX on mice. In recent years, strong evidence has emerged that TOX is a specific biomarker, strong prognostic factor, key pathogenic driver, and attractive therapeutic target for CTCL.96-99 In particular: (1) TOX is aberrantly expressed in CTCL: In comparative transcriptome studies99, TOX emerged as the most highly enriched in CTCL skin biopsies.  (2) Enhanced transcript levels of TOX correlate with increased risk of disease-specific mortality in CTCL: Further experiments showed that TOX expression levels in CTCL skin biopsies and in peripheral blood purified malignant CTCL cells were positively correlated with disease-specific mortality of CTCL patients98.  (3) Stable knockdown of TOX inhibits the growth of CTCL cells in vitro: TOX gene silencing suppresses CTCL cell proliferation as well as decreases colony formation capability in vitro97.  (4) TOX suppression induces apoptosis and caspase activation in CTCL cells: With TOX suppressed, CTCL cells had increased apoptotic cells97.  25  (5) TOX suppression impairs the tumor-forming ability of CTCL cells in vivo: Injecting CTCL patient-derived cell into Mouse readily forms a tumor but this is not observed when TOX expression was silenced97.  (6) TOX suppression led to expression changes of multiple downstream genes: TOX silencing led to a change in expression levels of numerous genes out of which SMAD3 is the most significant. SMAD3 is usually suppressed in the presence of TOX but sensitively induced after TOX gene silencing97. In summary, there are multiple lines of evidence that together strongly demonstrate that TOX is an attractive molecular target for developing CTCL therapies. There are currently no small molecules inhibitors of TOX, and herein we aimed to address this unmet need by developing anti-TOX therapeutics through a computer-aided drug design (CADD) platform that we have previously established30 and successfully utilized in a number of other cancer-related drug targets including AR78, ER83, ERG79, MYC80, etc. Here we report the use of this established CADD pipeline, which combined virtual screening of 7.6 million drug-like small molecules within vitro experimental validation, to discover new classes of anti-TOX compounds.  3.2 Methods 3.2.1 Structural evaluation of TOX druggability Since there is no resolved human TOX NMR structure available, we found mouse TOX (PDB ID: 2CO9) as the best matching template (100% sequence similarity) for human TOX (from K251 to Y337, with residue numbering based on UniProt ID: O94900-1). We then identified the suitable binding site on human TOX model by docking 200,000 small molecules (ZINC database32, in-stock, 3D representation, with molecule weight from 375 to 400 Dalton, and logP = 1.5 +/- 1) 26  against the HMG-box domain. We used the “blind docking” setup of GLIDE86 i.e. no binding site was defined ((Standard Precision mode with default parameters) from Schrodinger 2016-3 package). For identifying the important residues i.e. the residues with the frequency of interaction with the molecules of greater than or equal to 10%, we applied Protein-Ligand Interaction Fingerprints (PLIF) on the top hit compounds using Molecular Operating Environment (MOE)84.    3.2.2 In silico screening For initial virtual screening of probing 200,000 molecules, we took the top 10% scored entries (20,000 molecules) predicted by GLIDE and applied the in-house pipeline described in Chapter 1. In particular, these 20,000 molecules were docked using eHiTS program using the “blind docking” setup and the molecules with consistent pose between GLIDE and eHiTS (RMSD<=3A, 136 molecules) were docked using ICM (default parameters). Among the processed molecules, 22 demonstrated docking consistency with all three programs (Glide, eHiTS, and ICM). Through in vitro testing on TOX expressing Sezary cell lines, one of the candidates, VPC-190010, showed inhibition activity with IC50 less than 50 µM in all the TOX expressing Sezary cell lines (data not shown). For the large-scale docking (7.6 million molecules), the binding site was identified using the docking pose of initial hit compound VPC-190010 (TOX hot spot residues including Gln262, Pro264, Arg273, Lys313, Glu320, and Gln324). The screening process was divided into 1) compounds for the oral mode of delivery and 2) topical model of delivery. All the molecules with 3D representation and logP>-1 were downloaded from the ZINC database.  27  For the oral application, the filters specific to the oral mode of delivery such as molecular weight>=350 Dalton filter specific to drug-like criteria from FA-Drugs4 and filter based on the initial in-silico screening such as number of rings between 4 and 6 were applied. As a result, a total of ~ 3 million molecules were retained and then docked with Glide (Standard Precision mode with default parameters). Molecules with docking scores less than -5 (lower is better) were further docked with the FRED program, OEDocking (up to 500 conformers were generated for each molecule and were docked using FRED with default parameters), and the corresponding RMSD values were calculated for the top poses. All the molecules with an RMSD ≤ 3A were retained and were docked again using ICM. For the poses predicted by ICM (default parameters), RMSD values against Glide were calculated and only the molecules with RMSD ≤ 3A were retained.  Within the resulting set, theoretical pKi values were calculated for each molecule using a custom MOE SVL script. Other properties like ADMET (absorption, distribution, metabolism, excretion, toxicity) and pharmacokinetics predictions were also calculated by using computational programs such as ADMET Predictor, FAF-Drugs and Quantitative Estimate of Drug-likeness (QED).  In the next step, a consensus scoring method was used and Molecules with total vote greater or equal to 7 were retained and then clustered together to remove the similar compounds (70% similarity). Finally, a total of 66 compounds were chosen for experimental validation. For the topical application candidate compounds, the filters specific to the topical mode of delivery were applied: (MW<350 Dalton, charge = 0 and 2 ≤ LogP ≤ 440, 41) along with filters specific to lead-like criteria such as FA-Drugs4. Apart from the above-mentioned criteria, molecules with chiral centers ≤ 1 (vendors usually sell the racemic mixture), 2 ≤ number of rings ≤ 4 (molecules with 1 ring are too simplistic), rotatable bonds ≤ 6, and 2 ≤ hydrogen bond acceptors ≤ 7 were 28  retained. As a result, a total of about 4.6 million molecules were retained and then docked with Glide (Standard Precision mode with default parameters). Molecules with docking scores below -5.0 cutoff were further docked with the FRED program, OEDocking (up to 500 conformers were generated for each molecule and were docked using FRED with default parameters), and the corresponding RMSD values were calculated for the top poses. All the molecules with an RMSD ≤ 3A were retained and were docked again using ICM. For the poses predicted by ICM (default parameters), RMSD values against Glide were calculated and only the molecules with RMSD ≤ 3A were retained.  Within this set, predicted docking pKi was calculated for each molecule using a custom MOE SVL script. Other properties like ADMET (absorption, distribution, metabolism, excretion, toxicity) and pharmacokinetics predictions were also calculated by using computational programs such as ADMET Predictor, FAF-Drugs and Quantitative Estimate of Drug-likeness (QED). In the next step, a consensus scoring method was used and Molecules with total consensus scores greater than 5 were retained and then clustered together to remove the similar compounds (70% similarity). Finally, a total of 52 compounds were chosen for experimental validation.  3.2.3 In vitro screening All compounds were dissolved in DMSO as a 50mM stock solution and diluted into treating concentration with growth medium (RPMI 1640 (Hyclone, GE) containing 10% FBS (Thermofisher Scientific) and 1X Anti-anti (Antibiotic-Antimycotic, Thermofisher Scientific)). Suspension cell lines Hut78, Jurkat, K562, and U937 were purchased from ATCC. SZ4 and Mac2A cell lines were generous gifts from Dr. Ivan Litvinov128. The cell was cultured in the growth medium and collected at the logarithmic growth phase (about 5-10x10^5/ml). Cells were 29  seeded into 96 well culture plates (Nunc, Thermofisher Scientific) with 10^4/well. Cells were cultured with various concentrations of the testing compounds in 0.2% DMSO or DMSO control only (as 0 µM) in growth medium for 64-68 hours in incubator containing 5% CO2 at 37 degrees. Viability assay was performed using CellTiter-Blue® Cell Viability Assay (Promega), and a fluorescent signal (579Ex/584Em) was recorded after 2 hours and 4 hours of incubation time using Glomax Multi Detection System (Promega). All treatments were with triplicates, and the final value was calculated as the mean of the 3 datasets after subtracting medium only background. Net fluorescent signals at various concentrations were then compared to the DMSO only control and calculated as the percentage of the surviving population. IC50 was then estimated using an online IC50 calculator (AAT Bioquest). RNA expression of TOX was measured by dye intercalated Realtime PCR method described previously97. RNA was extracted from cells using RNAeasy purification kit (Qiagen), and cDNA template was produced using SuperScript™ VILO™ cDNA Synthesis Kit (Invitrogen, Thermofisher Scientific). Gene expression levels were expressed as mRNA copies per 1000 glyceraldehyde-3-phosphate dehydrogenase (GAPDH) copies by standardizing to internal housekeeping gene GAPDH. The primers used for real-time measurement are as following: GAPDH forward AAGATCATCAGCAATGCCTCC, GAPDH reverse TGGACTGTGGTCATGAGTCCTT; TOX forward GTGCAGAAATCCTCCCCCAC, TOX reverse TTTGTCCCTCTGCATGCCC.  30  3.3 Results 3.3.1 Druggability assessment of the TOX HMG-box domain For in silico screening of small molecules we used the available NMR structure of TOX (Figure 5) from the PDB database (ID: 2CO9). We first performed a pilot study by screening about 200,000 small molecules (having drug-like properties from ZINC database) using the “blind docking” setup of docking software GLIDE. This was done to both investigate the druggability of the DNA-binding domain of TOX as well as to identify the binding hotspots of small molecules on the protein surface. Some of the hotspots are near the DNA interface on the HMG-box domain which can be seen from Figure 6.    Figure 5. Protein structural templates for the HMG-box domain of TOX. (a) An NMR structure of mouse TOX protein (PDB ID: 2CO9) has been identified as the best structural template, with 100% sequence similarity across the 87 amino acids of the HMG-box domain, compared to the human TOX protein. (b, c) By superimposing the 2CO9 structure (orange ribbons) onto the HMG-box protein TFAM (pink ribbons, PDB ID: 3TMM, 46% sequence similarity to human TOX, 3.8A RMSD) in complex with DNA (green ribbons), the TOX-DNA interface was determined. 31     Figure 6. Druggability of the HMG-box domain of TOX. (a) TOX-DNA interface as determined from Figure 1C. TOX HMG-box domain in orange ribbons, DNA in green ribbons. (b) Molecular surface presentation of the TOX HMG-box domain, in the same orientation as (a). (c) The TOX HMG-box domain is rotated by 180 degrees to illustrate the small-molecule binding hot spots (red). (d) A total of 200,000 drug-like molecules were docked to the TOX HMG-box domain. The percentage of interacting small molecules are shown for each protein residue as a bar graph (multiple interactions/contacts are represented as separate bars for each amino acid). Protein residues, including Gln262, Pro264, Arg273, Lys313, Glu320, Gln324 and Tyr 328 that interact with at least 10% of the small molecules, are highlighted and mapped to their corresponding locations as hot spots (red surface patches) on the TOX HMG-box domain.  3.3.2 Large-scale in silico screening We then performed a large scale virtual screening of about 7.6 million molecules with drug-like properties taken from the ZINC database, details of which are mentioned in the method 32  section. Finally, 118 compounds were selected based on the top consensus scores for in vitro testing, 66 compounds with a molecular weight greater or equal to 350 Dalton, and 52 compounds with molecular weight less than 350 Dalton.  3.3.3 In vitro experimental validation All the compounds including 22 compounds from initial screening and 118 compounds from large-scale screening were experimentally tested by using TOX dependent CTCL cells (Hut78 cells) at 10 µM and 100 µM concentrations. Out of all 140 compounds, 18 compounds showed concentration-dependent inhibition of cell viability in Hut78 cells, their IC50 values were determined in 3 TOX-high/dependent CTCL cell lines (Hut78, SZ4, Jurkat), and 3 TOX-low/independent cell lines (K562, U937, Mac2A) (Table 2). The IC50 values for these 18 compounds were lower in TOX-high cell line compared to TOX-low cell lines (i.e. the molecules are selective towards TOX-high cell line, Table 2). Several of these small molecule inhibitors (SMIs), such as 190444 and 190414, have IC50 values in the range of 10~20 µM, more active than the hit compound 190010 that were identified from the initial in silico screen. As further illustrated in Figure 7, compounds 190444, 190414, 190447 and 190441 inhibited cell viability of the TOX-high cells (Hut78, Jurkat) selectively, compared to the TOX-low cells (K562). In addition, Figure 8 shows that compounds 190444, 190414 and 190441 increased the expression of SMAD3, which is normally suppressed by TOX97.  Table 2. Top candidates for TOX small molecule inhibitors (SMIs). Compound VPC-ID Chemical Structure Average IC50  (µM) (TOX-High cells) 1 Average IC50 (µM) (TOX-Low cells) 2 TOX-selectivity index 3 33  190444  16.28 69.94 4.30 190414  20.64 69.55 3.37 190350  12.01 33.69 2.80 190447  16.68 41.13 2.47 190410  15.35 37.45 2.44 190358  15.13 32.98 2.18 190441  34.76 64.69 1.86 190327  16.55 30.29 1.83 34  190343  33.26 56.29 1.69 190341  52.44 85.82 1.64 190325  39.77 64.63 1.63 190323  43.79 68.68 1.57 190339  34.02 51.83 1.52 190322  24.64 36.19 1.47 190349  52.88 70.57 1.33 190301  38.75 47.29 1.22 190354  58.59 66.12 1.13 35  190010  37.70 n/a n/a 1Average IC50 values of cell viability from 3 TOX-high/dependent CTCL cell lines (Hut78, SZ4, Jurkat). 2Average IC50 values of cell viability from 3 TOX-low/independent lymphoid cell lines (K562, U937, Mac2A). 3TOX-selectivity index = Average IC50 (TOX-Low cells) / Average IC50 (TOX-High cells).   Figure 7. Viability curves of TOX-high and TOX-low expressing cells. Cells were treated with various concentrations of compounds, (a) 190444, (b) 190414, (c) 190447 and (d) 190441, for 72 hours in 37C incubator with 5% CO2. Viability was measured by CellTiter-Blue® assay and compared to the DMSO control as described in Methods and Materials. Jurkat and Hut78 cells (solid line) are the TOX-high expressing cell lines, while K562 (dotted line) is the TOX-low expressing cell line. (Produced in collaboration with Dr. Zhou’s lab)  36   Figure 8. Viability curves of TOX-high and TOX-low expressing cells. Cells were treated with various concentrations of compounds, (a) 190444, (b) 190414, (c) 190447 and (d) 190441, for 72 hours in 37C incubator with 5% CO2. Viability was measured by CellTiter-Blue® assay and compared to the DMSO control as described in Methods and Materials. Jurkat and Hut78 cells (solid line) are the TOX-high expressing cell lines, while K562 (dotted line) is the TOX-low expressing cell line. (Produced in collaboration with Dr. Zhou’s lab)  3.4 Discussion and future direction TOX has been identified as a promising drug target for CTCL therapies based on the previous study99 and since there are no SMI’s for TOX, we used this unmet opportunity to discover new TOX inhibitors. Using our established in house pipeline (chapter 1) and in vitro experiments we discovered 18 SMIs for TOX, which can inhibit the viability of TOX-high/dependent cells with micro-molar IC50 and up to 4 folds’ selectivity (Table 2). As illustrated in Figure 9, compounds 37  190444, 190414, 190447 and 190441 can bind at the hot spots located in close proximity to the protein-DNA interface on the HMG-box domain of TOX. These SMIs interact with TOX protein residues including Gln262, Pro264, Arg273, Lys313, Glu320, Gln324 through hydrogen-bond and hydrophobic interactions, corresponding well to the hot spots as identified from the druggability assessment (Figures 6 and 9).  Given the proximity of binding, it is likely that the small molecules can interfere with TOX-DNA interactions and inhibiting the activity of TOX. This hypothesis is partially supported by the experimental results, compounds 190444, 190414 and 190441 increased the expression of SMAD3, which is normally suppressed by TOX (Figure 8). To establish the hypothesis further, additional experiments, including direct binding, DNA-competition, and luciferase reporter assays, are required. Future development of drug candidates inhibiting TOX-DNA interactions can follow previous studies where SMIs have been successfully developed via CADD to target the DNA-binding domains of other cancer drug targets such as AR, ERG, and MYC.  It takes more than 3 billion dollars and 10 years for a drug to reach patients from a laboratory. We demonstrated how virtual screening can significantly speed up time as well as save a lot of money during the drug discovery phase. 18 compounds were found active out of 140 in vitro tested compounds, achieving a hit rate of 13% (18/140) which is a lot higher than <1% from conventional, experimental high-throughput screening without any computational guidance29, 30. These 18 hit compounds provide the foundation where more potent TOX-SMIs can be developed through 2D/3D similarity searches (ligand-based screening) of chemical analogs against the entire ZINC database, which has grown exponentially from 700,000 compounds in 2005126 to over 1 billion molecules in 201932. Until now we were just using <10 million molecules (7.6 million in case of TOX) i.e. <1% of entire ZINC database and throwing away remaining >99% due to time 38  limitations30. It has been shown before42 that larger library contains molecules that are better suited for a given receptor structure than can be found in smaller libraries. This ZINC database of chemicals provides a tremendous opportunity for TOX drug discovery where virtual screening by molecular docking can be expanded from the initial 7.6 million to all of the 1 billion molecules. In the next section, we are going to talk about “Progressive Docking 2.0” that trains a machine learning model to efficiently predict binding scores based on chemical structures; thus compute-intensive docking only needs to be performed on a subset of molecules pre-calculated as good target-binding candidates67. We anticipate applying such a PD2.0 algorithm (with >50x speed-up) to virtually screen 1 billion molecules against TOX. Both of these approaches, similarity search, and PD2.0, can greatly expand our current collection of TOX hit compounds and enable us to build Quantitative Structure-Activity Relationship (QSAR) models129 that can guide future development for the next generation of potent and selective TOX drug candidates.  39   Figure 9.  TOX small molecule inhibitors (SMIs) bind at the hot spots on the protein-DNA interface. Docking poses are shown for compounds (a) 190444, (b) 190414, (c) 190447 and (d) 190441 on the HMG-box domain of TOX. For each panel: left) molecular surface representation, SMI in cyan, DNA in green ribbons, hot spots as red surface patches; right) in-depth molecular interactions between SMI (cyan) and protein residues of TOX (orange), hydrogen-bonds in red dotted line, hydrophobic interactions in green dotted line.   40  Chapter 4: Accelerating docking using progressive docking 2.0 4.1 Introduction VS of ultra-large chemical libraries is an emerging approach in CADD which recently was proved successful in identifying novel chemotypes with high potency42. Despite the great potential for drug discovery, current methods are limited to screen ~100M molecules and lag behind the unceasing growth of databases such as ZINC15 (1.3 billion of structures) or Enamine REAL (700 million structures). Thus, there is a pressing need to accelerate the standard docking process. We came across the very same need when we targeted TOX protein using standard docking protocols, which limited our VS campaign to less than 1% of the ZINC database (7.6M compounds) due to the computational cost associated with larger-scale screening. In this chapter, we report the development and validation of a new deep learning-based method called “Progressive Docking 2.0” (PD2.0) which allows up to 65 folds docking speed increase on a variety of drug targets.  Components of the developed PD2.0 protocol are shown in Figure 10. The first step of the protocol is to obtain the SMILES and calculate molecular features (fingerprints) for all the small molecules of the library to be screened. In the first PD2.0 iteration, a fixed number of molecules (e.g. 3 million structures) are randomly sampled from the database and docked using GLIDE or FRED. Docking scores obtained are then converted to binary values (0=low binding affinity, 1=high binding affinity) based on a docking score threshold (details of which are explained later). In this way, we convert a regression problem into a classification problem. The data is then divided into training, testing and validation sets. QSAR models are built and evaluated, and the best model is used to predict the class (0/1) for each of the molecules in the entire dataset (e.g. 1B molecules).  For the successive iterations, the same number of molecules are randomly sampled from the set of molecules predicted as positives during the previous iteration, and docked. These molecules are 41  then added to the training dataset from the previous iteration. The validation and testing set remains the same as the first iteration for all the iterations. Thus, the distribution of scores in validation and testing sets are conserved while the training set distribution might change. QSAR models are again generated and evaluated, and the best model is used to predict the classes for all the compounds of the original database. This process is repeated until the number of positive predictions returned by PD2.0 (e.g. 5-10 million good binding candidates) can be docked in a reasonable amount of time, given the computational resources available.  Figure 10. Different steps involved in PD2.0. Features are generated from the SMILES of the small molecule in the library and stored in a feature database. For the first iteration, a small subset of molecules (e.g. 3 million) is randomly sampled and divided into training, validation and testing set, and docked against the protein target. A score cut-off value is chosen to convert the problem from regression to classification. A QSAR model is built and used to predict the classes for all the molecules of the database (1=good and 0=bad). From the next iteration 42  onwards, the same number of molecules (3 million) are sampled only from the positive molecules (predicted to be good binders) of the previous iteration, while ‘bad’ molecules are discarded. The validation and testing set from the first iteration remain the same for all the iterations, and the existing training set is enriched with the newly sampled molecules. All the other steps remain the same as the first iteration. This process is repeated multiple times until the total number of positively predicted molecules is a reasonable number that can be docked.  4.2 Methods 4.2.1 Molecule fingerprints Chemical fingerprints130 is an easier molecular format to be interpreted by a computer, compared to SMILES131. A fingerprint is a series of binary bits that can vary from 100-4k in size. Typically, a kernel is applied to generate a bit vector or count vectoring order to get molecule features, hash them and determine bits based on the hashing. MACCS132 fingerprints rely on a binary representation of the presence and counting of specific atoms and rings in a molecule. For instance, one of the MACCS bits assesses if the molecule has more than 2 nitrogens by having a value of 1 (true) or 0 (false). Another bit will assume a value of 1 or 0 depending on the presence of more than 2 aromatic rings, and so on. MACCS fingerprint generated with RDKIT133 provides a 166-bit vector. Another type of fingerprints is Morgan130 fingerprints, which are 2D representations of molecules derived from the Extended Connectivity FingerPrint134 (ECFP). These fingerprints account for internal graph connectivity as they evaluate the neighborhood for each atom of the structure. All possible paths within a given radius through an atom are obtained and each path is hashed into a number based on the number of specified bits. Higher radius will result in bigger fragments being encoded, thus more bits will allow Morgan fingerprints to be more selective among the molecules. 43   Pharmacophore fingerprints135 consider any pair of pharmacophore features, such as hydrophobic atoms or groups, hydrogen bond acceptors, and hydrogen bond donors. Each pair is assigned a number of bits corresponding to non-overlapping ranges of topological distances. When evaluating the pharmacophore fingerprint of a given molecule, bits corresponding to feature pairs present in the structure at specific distance ranges are set to 1136.  4.2.2 Molecular docking All the complexes from the PDB repository were prepared and optimized using Protein Preparation Wizard137 of Schrödinger suite. Receptors were prepared for docking using the Make Receptor utility from OpenEye and Schrödinger’s Grid Preparation Wizard. Docking was performed using OpenEye’s FRED87 docking program and Schrödinger’s GLIDE86 program.   4.2.3 Clustering In clustering, objects that are similar to each other are grouped together while dissimilar objects are divided into different groups. Extracting representative objects from each cluster instead of random sampling ensures to cover the diverse object space more accurately.  4.2.3.1 K-means clustering In a K-means9 clustering algorithm, a data point is associated with the cluster (K groups) with the most similar centroids (means) to it. Thus, for each data point 𝑥, the corresponding cluster will be the one which satisfies: 44   argmin𝑐𝑖∈𝐶𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑐𝑖, 𝑥) (1) Here ci is the collection of centroids in set C, and the similarity measure is usually the squared L2 (Euclidean) distance. After each iteration where all data points assigned to respective clusters, the centroids are recalculated:  𝑐𝑖 =  1|𝑆𝑖|∑ 𝑥𝑖𝑥𝑖∈𝑆𝑖 (2) where 𝑆𝑖 is the set of data points of the ith cluster. The process is repeated until convergence, i.e. all the samples are assigned to the same cluster in consecutive iterations. As K-means algorithms converge to local optima, it is important to run multiple initializations and chose the best one based on the residual sum of squares (RSS) objective function:  𝑅𝑆𝑆𝑘 =  ∑ |𝑥 −  𝜇(𝜔𝑘|2𝑥∈𝜔𝑘 (3) The time complexity of K-means clustering using the squared L2 distance as similarity measure is O(nkId), where n stands for the number of data points, k stands for the number of centroids, I stand for the number of iterations and d stand for the dimensionality of data. In our PD2.0 pipeline, we deal with more than 1 billion molecules, each one having a Morgan fingerprint of 1024 bits. Thus, running standard K-means clustering for any realistic number of clusters was computationally infeasible. Similarly, searching for an optimal number of cluster was infeasible since it would have taken an extremely large amount of time just for one run.  4.2.3.2 Mini-batch K-Means clustering Mini-batch K-means138 is an approximation of K-means clustering, which uses a mini-batch of fixed size for updating the cluster centers at each iteration. The method is fast and suitable 45  for large data sets, and it also requires less memory space. A detailed trade between accuracy and speed is described in detail in the paper by Feizollah et al139. The major takeaway is that as the number of clusters increases the speed gets an exponential boast while at the same time clusters start to diverge from K-means clusters. This issue becomes more prominent as the data set size increases.  4.2.4 Logistic regression Binary logistic regression4 is logistic regression where the algorithm is used to assign observations to two discrete classes. Logistic regression is based on probability; it uses a sigmoid function/logistic function (Equation 4) to map the probability of observation between 0 and 1. The decision boundary generated by logistic regression is linear in nature and is decided by a cut-off value chosen based on some accuracy metrics.   𝑆𝑖𝑔𝑚𝑜𝑖𝑑 𝜎(𝑧) =  11 + 𝑒−𝑧 (4) The objective function/cost function used for logistic regression is a convex function called logistic loss/cross-entropy loss140 (Equation 5). This loss is minimized to obtain the global optimum solution.   𝐽(𝜃) =  −1𝑚∑ [𝑦𝑖 log (ℎ𝜃(𝑥(𝑖))) + (1 − 𝑦𝑖) log (1 − ℎ𝜃(𝑥(𝑖)))] (5)  4.2.5 Support Vector Machine The goal of Support Vector Machine (SVM) is to find the hyperplane which separates the N-dimensional space into distinct classes by maximizing the distances (margins) between the data points belonging to different classes6. SVM uses support vectors i.e. the data points of each class 46  which is closest to the hyperplane. The presence of margin is a form of reinforcement towards classifying future data with higher confidence. The loss function representing the margins is given by hinge loss with an L2 regularizer140 (Equation 6), which is minimized. The defined loss is a convex function thus the solution obtained is a global optimum.   𝐽(𝜃) =  𝜆||𝑤||2 +  ∑(1 − 𝑦𝑖 < 𝑥𝑖 , 𝑤 >)+𝑛𝑖=1 (6) The decision boundary obtained is again linear in nature but non-linear decision boundaries can be obtained via the kernel trick141. Here a kernel function is used to map the non-linear separable input data to higher dimensional separable features. The solution is then mapped back to the non-linear separable space. The type of kernel which is most used includes Radial Basis Function142 (RBF) kernel (Equation 7) and polynomial kernel141 (Equation 8).   𝐾(𝑥, 𝑥′) = exp (−||𝑥 − 𝑥′||22𝜎2) (7)    𝐾(𝑥, 𝑥′) =  (𝑥𝑡𝑦 + 𝑐)𝑑 (8)                     4.2.6 Random forest Random forest5 (RF) consists of an ensemble of decision trees each built using bootstrap sample and a subset of features. Since decision trees usually have high variance since they overfit the training data, this overfitting can be reduced by using multiple decision trees generating independent mistakes, and making decisions based on the majority. Trees making independent mistakes can be obtained by using bootstrap samples i.e. multiple training sets generated with replacement and only using a subset of features for each tree. This strategy not only ensures 47  diversity among trees different but also reduces the training time since not all the features are used for training. All the decision trees can be independently trained in parallel. During the final prediction stage, results from each tree are obtained and a decision is made based on the majority.  4.2.7 Feedforward neural network The neural network8 (NN) approach was initially inspired by how human neurons work. A typical NN consists of multiple neurons (nodes) arranged in multiple layers, with all nodes in adjacent layers having connections to each other. Layers are divided into an input layer (data features), output layer (predictions) and hidden layers (intermediate layers connecting an input to output). Each node in the hidden layer transforms the input features to generate more refined features (activation) by taking a linear combination of all the input features and applying non-linear transformations like the rectified linear unit to it:  ?̂?𝑖 = 𝑤𝑡 (mlI1ℎ(𝑤𝑙𝑥𝑖))  (9) Finally, these refined features are sent to an output layer to either perform a regression task or classification task. The loss function is different for regression (mean squared error, Equation 10) and classification tasks (cross-entropy error, Equation 5), it is non-convex in nature, due to the introduction of non-linearity after each layer, and it gives a local optimum solution.   𝐽(𝜃) =  1𝑛∑(𝑦(𝑖) − ?̂?(𝑖))2𝑛𝑖=1 (10)  48  4.2.8 Evaluation metrics Evaluation metrics are used to test the performance of different models and select the best one. There is a variety of evaluation metrics available for regression and classification tasks. For our classification models, we have used Receiver Operating Characteristic (ROC) curve, precision-recall, and enrichment, which are discussed below. ROC curve is a plot of true positive rate (TPR or sensitivity or recall, Equation 11) against false positive rate (FPR or fall-out, Equation 12) at multiple threshold settings.    𝑇𝑃𝑅 =  𝑇𝑃𝑇𝑃 + 𝐹𝑁 (11)   𝐹𝑃𝑅 =  𝐹𝑃𝐹𝑃 + 𝑇𝑁 (12) TPR and FPR both increase simultaneously but ideally TPR should increase a lot faster than FPR at the beginning and plateau at 1, while FPR should slowly increase to 1. The area under the ROC curve (AUC) is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. Thus, a model with higher AUC works better than a model with a lower value. In a classification problem, task precision is defined as the total number of true positives (TP) divided by the total number of elements model predicted to be positive (TP + false negatives, FN) (Equation 13). Precision represents what fraction of predictions is actually correct for a specific class.  𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =  𝑇𝑃𝑇𝑃 + 𝐹𝑁 (13) 49  On the other hand, recall is the total number of TP divided by total elements that actually belong to the positive class (TP + false positives, FP) (Equation 14), and it represents the fraction of elements of a class correctly classified by the model.   𝑅𝑒𝑐𝑎𝑙𝑙 =  𝑇𝑃𝑇𝑃 + 𝐹𝑃 (14) Thus, both high recall, as well as high precision score, are desirable in a model. We defined two types of enrichment: 1) full predicted database enrichment (FPDE) and 2) top n enrichment. FPDE (Equation 16) is defined as the ratio of the fraction of positive class which is predicted positive (precision, Equation 13) and the fraction of positive class in the dataset (random precision, Equation 15). FPDE indicates the TP enrichment of the dataset after a model iteration compared to the initial dataset. Top n enrichment (Equation 17) is defined as a ratio of the total number of positive class elements in top n data points predicted by the model (where the rank is established by the DNN probabilities) and n data points randomly selected from the same dataset.   𝑅𝑎𝑛𝑑𝑜𝑚 𝑃𝑟𝑒𝑐𝑠𝑖𝑜𝑛 =  #𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑖𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡#𝑚𝑜𝑙𝑒𝑐𝑢𝑙𝑒𝑠 𝑖𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 (15)  FPDE =  𝑀𝑜𝑑𝑒𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅𝑎𝑛𝑑𝑜𝑚 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (16)  𝑇𝑜𝑝 𝑛 𝑒𝑛𝑟𝑖𝑐ℎ𝑚𝑒𝑛𝑡 =#𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑖𝑛 𝑚𝑜𝑑𝑒𝑙𝑠 𝑡𝑜𝑝 𝑛 𝑟𝑎𝑛𝑘𝑖𝑛𝑔#𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑖𝑛 𝑟𝑎𝑛𝑑𝑜𝑚 𝑛 (17)  4.2.9 Conformal prediction Conformal prediction is a method of prediction that can be added to any classification or regression algorithms like logistic regression, SVM or Neural network. It was invented by Vovk et al143. The main advantage of using it over the threshold-based approaches is that users can define 50  a confidence interval i.e. given a confidence threshold of Є it is guaranteed that the actual label has a recall value of 90% (we only miss 10% of the actual compounds). For a binary classification model, four classes are predicted: class A, class B, both class A, B and the empty set. For our application we set all the predictions to be the negative class/class B if not predicted to be just class A. We have used Inductive Conformal Prediction (ICP) over transductive approach since the transductive method is very computationally expensive.  For transductive conformal prediction, a model is built on a training set and then a calibration set (validation/test set) is used to calculate the nonconformity score. This score captures the information on how different the calibration/validation/testing set is from the training set. Ones the calibration is done, for predicting on a new dataset, the p-value for all the classes (in our case binary) is calculated and the classes for which the p-value is greater than a predefined threshold Є are assigned to the data point. For example, for a binary classification problem let the p-value for class 0 and 1 be 0.05 and 0.3 respectively. Then given Є to be 0.1 (90% confident or 10% error rate) the data point will be assigned to class 1. With this method, one can make sure that the right label is present in the prediction set with a probability of 1- Є.   4.3 Results and discussion 4.3.1 Molecular featurization To evaluate the effect of different molecular fingerprints to model performances, a random subset of 1 million molecules from ZINC 3D database32 (570 million molecules) was divided into training, testing and validation sets, and docked with FRED87. MACCS, Morgan, and pharmacophore fingerprints were calculated for all the 1 million molecules. Models were built for each fingerprint using the same hyperparameters, and different top n enrichment values such as 51  top 10, top 100, top 1000, top 10000 and FPDE at a fixed value of 90% for recall were calculated. The fingerprint with the highest values in the majority of enrichment scores was selected for PD2.0. Enrichment values are shown in Table 3. The best scoring fingerprint was Morgan with a dimension of 1024 and radius 2.  Table 3. Different enrichment values for MACCS, Morgan, and Pharmacophore fingerprints. Fingerprint Top 10 enrichment Top 100 enrichment Top 1000 enrichment Top 10000 enrichment FPDE (90% recall) MACCS 71 50 40 23 3.7 Morgan_512_2 71 92 59 30 4.1 Morgan_512_3 141 89 58 27 3.5 Morgan_1024_2 164 103 71 31 4.2 Morgan_1024_3 70 92 64 29 3.8 Morgan_2048_2 141 82 69 31 4.3 Morgan_2048_3 94 85 65 29 3.7 Pharmacophore 0 14 22 14 1.7   4.3.2 Database sampling To evaluate different sampling techniques, 3 million molecules (test dataset) were randomly sampled from the ZINC database (1 billion molecules) and docked against ER-AF2 using GLIDE. The molecules were then clustered using mini-batch K-means clustering into 10, 100, 500, 1000, 2000, 3000, 4000 and 30000 clusters. We then performed a single iteration of PD2.0 by sampling from clusters (where the number of molecules sampled from each cluster was proportional to the size of cluster) as well as by performing random sampling on the whole database. For each cluster size, we sampled training, validation, and testing set, each consisting of 52  the same number of molecules. We used the same hyperparameters for building the models. We tested different sample sizes and compared the stability, generalizability, and accuracy of the two methods. The criteria for determining the stability of a model was to retrain the model multiple times (5 times) by resampling each time and analyzing how stable the FPDE values were for the test dataset. Generalizability was determined based on how the validation precision (criterion 2) and recall (criterion 1) transferred to the test dataset and accuracy was the measure of FPDE (at 90% recall) for each of the models. We observed that model stability, as well as accuracy, was similar irrespective of sampling from the cluster or from the random at all the sample sizes (Figure 12). The model generalizability was also very similar. The plots for the ratio of test recall to validation recall and test precision to validation precision are shown in Figures 13 and 14.  To assess how cluster compactness related to the docking scores, we randomly sampled and docked 10000 molecules. We then performed clustering based on multiple Tanimoto similarities144. We observed that at higher cutoff (0.8-0.9) molecules within each cluster were similar but too many clusters were generated, including many singletons. When cutoff was set to lower values of 0.5 or less, we observed fewer clusters but the docking scores distribution within each cluster was the same as the distribution of the whole dataset (Figure 11).  After all these different experiments we could safely conclude that clustering does not provide any significant advantage over random sampling given the heavy computational cost associated with it.  53   Figure 11. Docking score distribution of the entire dataset versus the docking score distribution of the largest cluster when a Tanimoto cutoff of 0.5 was used. The distributions are overlapping hence all the molecules within the cluster have the same distribution of scores as that of random sampling.      54    Figure 12. Accuracy and stability of the QSAR model obtained from samples of clustered and unclustered data. Each figure is a plot of FPDE (at 90% recall) against the number of clusters for different sample sizes ranging from 3000 to 480000. The mean, as well as the standard deviation, does not show any trend for any sampling technique.  55    Figure 13. Generalizability (criterion 1) of the QSAR model from samples of clustered and unclustered data. Each figure is a plot of the ratio between test and validation recall values against the number of clusters for different sample sizes ranging from 3000 to 480000. The mean, as well as the standard deviation, does not show any trend for any sampling technique.  56   Figure 14. Generalizability (criterion 2) of the QSAR model from samples of clustered and unclustered data. Each figure is a plot of the ratio between test and validation precision values against the number of clusters for different sample sizes ranging from 3000 to 480000. The mean, as well as the standard deviation, does not show any trend for any sampling technique.  57  4.3.3 QSAR model for docking scores We tested four different machine learning models: RF, Logistic Regression, SVM and DNN to select the best-suited one for PD training. 3 million molecules were sampled and docked against the ER-AF2 site to obtain an evaluation dataset for model testing. Furthermore, a basic grid search was performed to select the best hyperparameters. Enrichment values for all the models are reported in Table 4. SVM took an extraordinary amount of time for training compared to other approaches (>10 hours), therefore the method was discarded.  DNN generated the highest enrichment values, as we expected given the large size of the dataset. It should also be noted that docking is the rate-determining step in the PD2.0 pipeline. Thus, the small enrichment improvement of DNN compared to RF would have resulted in a significantly lower number of molecules left to be docked after the final iteration, resulting in a notable speed boost. From this analysis, we concluded that DNN was the best model for our application and used it for all the further modeling.  The general architecture for the DNN used for PD2.0 consists of 1-4 hidden layers, each layer consisting of the same number of neurons. A dropout layer was present after each fully connected layer. The number of hidden layers, number of neurons and dropout frequency were the hyperparameters of the model. The scheme is shown in Figure 15.    58  Table 4. Comparison of different QSAR models: Feedforward Neural Network (DNN), Random forest (RF), and logistic regression. DNN provides the best scores for all 5 enrichment values (top 10, top 100, top 1000, top 10000, FPDE at 90% recall). Method Top 10 enrichment Top 100 enrichment Top 1000 enrichment Top 10000 enrichment FPDE (90% recall) DNN 164 103 71 31 4.2 RF 117 75 59 28 4.1 Logistic Regression 47 49 46 24 3.7    Figure 15. The architecture of the DNN used for the QSAR model. One block is the combination of one fully connected layer and one dropout layer. The number of blocks, number of neurons and dropout frequency are the hyperparameters of the model. The input layer consists of 1024 neurons taking values from Morgan radius 2 and 1024 bits fingerprints. The output layer consists of two neurons as the problem is of binary classification. 59   4.3.4 Progressive docking 2.0 We validated PD2.0 on a wide range of protein targets. We screened 1.3 billion small molecules from ZINC15 against ER-AF2 using GLIDE. To our knowledge, this is the first successful virtual screening performed on more than 1 billion molecules. We further evaluated PD2.0 robustness by docking 570M molecules on 12 drug targets belonging to different protein families (Table 1), using FRED. Thus, we conclude that PD2.0 is a suitable technique for large-scale virtual screening which can be used for a variety of protein targets and docking programs.   4.3.4.1 Training data for PD2.0 To determine the optimal number of molecules to sample and dock at each PD iteration, we plotted the training sample size versus FPDE with recall fixed at 90% (Figure 16): larger training sample sizes lead to higher FPDE values.   Also, we observed a decrease in the standard deviation with the increase of the sample sizes, indicating an increase of the model stability (Figure 17). Consequently, we decided to sample at least 1 million molecules for each training iteration. In future applications of PD2.0, the training sample size can be user-determined based on the available computational resources. In order to convert the continuous docking score values to a classification problem, we chose a cut-off value based on a required number of positive molecules (molecules with docking score lower than a cutoff value) to be observed in the validation set after the last iteration. At each ith iteration, the cut-off was chosen as the value that returns a rescaled number of positive molecules observed in the validation set equal to: 60    𝑁𝑝𝑜𝑠,𝑖 = 𝑁𝑝𝑜𝑠,𝑙𝑎𝑠𝑡 ∙𝑚𝑜𝑙𝑖−1𝑚𝑜𝑙𝑙𝑎𝑠𝑡  (18)   where 𝑁𝑝𝑜𝑠,𝑙𝑎𝑠𝑡 is the minimum number of positive molecules in the validation set after the last iteration (200 in our case), 𝑚𝑜𝑙𝑖−1 is the number of molecules left after the ith-1 iteration and 𝑚𝑜𝑙𝑙𝑎𝑠𝑡 is the minimum number of molecules to be left after the last iteration (15 million in our case). The best hyperparameter combination was selected based on the model with a recall of 90% and the highest precision or FPDE. A decrease of the docking score cut-off after each iteration was expected and observed, indicating that the definition of positive molecules (i.e. good binders) improves after each iteration. In order to deal with a highly imbalanced classification dataset (ranging from 0.1%-2% positives), we have used both oversampling of the minority class as well as class weights, where the class weight and the oversampling ratio were the hyperparameters of the model.  61  Figure 16. The plot of FPDE against training dataset size for random sampling. FPDE increases with the increasing of the dataset size.   Figure 17. Generalizability (criterion 1)/recall ration for random sampling. (a) The recall ratio i.e. ratio of validation recall and test recall becomes more stable (smaller spread) with bigger sample size. (b) The standard deviation of the recall ratio goes down with a larger sample size.  4.3.4.2 PD2.0 on Estrogen Receptor Activation Function 2 (ER-AF2) The ER-AF2 site was obtained from the Protein Data Bank (PDB 3UUD145) and prepared using SCHRODINGER 2016 package as described in Chapter 1. All the 1.3 billion SMILES of the ZINC15 2D  database were downloaded and Morgan fingerprints with radius 2 and 1024 bits were calculated. We then applied our PD2.0 pipeline to accelerate Glide docking on the AF2 site.     62  4.3.4.2.1 Threshold-based prediction All the molecules with a model-predicted probability above a threshold were allotted to the positive class. We determined this threshold as the value at which the model recall was 90%, as discussed previously. A new DNN model was trained at each PD iteration. A basic hyperparameter grid search was performed and the best model was chosen for prediction on the entire dataset (1.3 billion molecules). Table 5 shows the resulting model statistics of the best model at each iteration.   Table 5. Different model statistics for threshold-based prediction after each iteration for ER-AF2. ROC-AUC, different enrichment values and recall values for the best model are reported. We observed an increase in all enrichment values after each iteration. The model cutoff/definition of good molecules improved after each iteration as well. Iteration Number Model cutoff Model ROC AUC Top 10 Top 100 Top 1000 FPDE Model Recall 1 -5.24 0.923 39 28 26 4.14 90 2 -5.84 0.97 218 144 104 11.11 90 3 -6.95 0.994 2143 1821 482 74.3 90.4 4 -7.14 0.997 4801 3361 696 130 90.9  The enrichment of the dataset increased with the number of iterations since sampling at each iteration was performed over the positive dataset derived from the previous iteration, which was obviously a more enriched dataset (Table 5). Thus, with PD2.0 protocol we achieved a final FPDE of 130x while retaining 90% of high-affinity molecules. In the top n enrichment, we evaluated the number of ‘good’ molecules, i.e. with docking scores below the threshold used for 63  the iteration, within the top n predicted molecules, where the rank was provided by the probability values of the DNN model. Thus, molecules with higher probabilities were expected to be TP with higher confidence, and consequently, top-ranked portions of the dataset would have higher enrichments. As anticipated, enrichment values peaked when considering the top 10 predicted molecules and decrease when considering the top 100, 1000 or the whole predicted dataset. In the final iteration, the corresponding ’top 100’ enrichment was 3000x.  Table 6 shows the number of positive samples based on the same threshold value across different iterations. Notably, we observed an exponential increase in positive molecules in the dataset after each iteration, which generated more balanced datasets and allowed us to safely decrease the cutoff value.  Table 6. The total number of molecules of positive class based on a fixed cutoff value in the datasets obtained at different iterations.  Iteration cutoff Number of Positive molecules (per million) 1 -7 71 2 362 3 1010 4 5752 End 10370  The distribution of the docking scores after each iteration is plotted in Figure 18. As expected, the overall distribution moved towards molecules with lower docking scores after each 64  iteration. This also implies that sampling from more enriched datasets after each iteration would give more balanced classes, as shown also in Table 6.    Figure 18. Docking score distribution plot for ER-AF2 at all iterations. The score distribution shifted towards left/better molecules; the shift was more prominent at the initial iteration compared to later ones. This implies that enrichment slowed down as we proceeded through more iterations.  In order to externally validate our model, we docked 10 million molecules randomly sampled from the entire ZINC database. We then used the best model at each iteration to predict their classes (0=bad binders, 1=good binders) and compared these statistics with the internal test set (Table 7). Statistical results were consistent between the internal test set and the external validation set.  65   Table 7. Comparison of model statistics for threshold-based prediction between internal test set and external validation set for ER-AF2. ROC-AUC, FPDE, and recall for the best model are reported. All the statistics are consistent between the internal test set and the external validation set. Iteration Number Validation Set Model cutoff Model ROC AUC FPDE Model Recall (%) 1 Internal -5.24 0.923 4.14 90 External 0.924 4.2 89 2 Internal -5.84 0.97 11.11 90 External 0.972 11.4 90.87 3 Internal -6.95 0.994 90 92.7 External 0.991 87.8 88 4 Internal -7.14 0.997 130 90.9 External 0.997 136 93  After each iteration, we also compared the positive molecules predicted to be TP with the actual TP molecules. We used the precision values of the model to calculate the TP returned by PD2.0 at each iteration and compared it with docking results. Thus, we docked 1 million random molecules from the positive dataset after each iteration, and evaluated the number of molecules below the threshold, as shown in Table 8. The predicted and docking values were consistent, which further reinforced that the model statistics were consistent with the prediction statistics. Unfortunately, in order to assess the consistency of the recall values, we would have to dock the entire ZINC database which would have taken years to complete. For this reason, 10 million compounds were considered as the external validation set.   66  Table 8. A comparison of good molecules (TP) returned by PD2.0 and good molecules returned by docking. The predicted and docking values are very similar. After Iteration Cutoff #Good molecules (predicted, per million) #Good molecules (docking, per million) 1 -5.24 106000 108000 2 -5.84 51000 51400 3 -6.95 7200 70400 4 -7.14 6400 6000   4.3.4.2.2 Conformal prediction We evaluated the docking results using Conformal Prediction (CP) statistics as described in section 4.2.9. For each iteration, the same model as before is used for CP. We observed that all the model statistics including ROC-AUC, recall and enrichment values as reported in Table 9 are almost identical to those observed for the threshold-based approximation (Table 5) and the only difference was noticed for the first iteration.  Table 10 compares the model statistics from the internal test set and the external validation set. Almost identical statistics as those of threshold-based approximation were observed again. Based on the above observations we concluded that for our task threshold-based prediction worked on par with conformal prediction.   67  Table 9. Different model statistics for CP after each iteration for ER-AF2. ROC-AUC, enrichment values and recall values for the best model are reported.  Iteration Number Model cutoff Top 50 Top 100 Top 1000 FPDE Model Recall 1 -5.24 33 31 27 6.8 74.5 2 -5.84 168 168 113 11.5 91 3 -6.95 3428 1821 503 73.2 92.7 4 -7.14 3637 3839 707 130 90.1   Table 10. Comparing different model statistics for CP between internal test set and external validation set for ER-AF2. ROC-AUC, FPDE, and recall of the best model at each iteration are reported.  Iteration Number Validation Set Model cutoff FPDE Model Recall (%) 1 Internal -5.24 6.8 74.5 External 6.4 74.6 2 Internal -5.84 11.5 91 External 12.2 90.1 3 Internal -6.95 73.2 92.7 External 78.3 89.4 4 Internal -7.14 130 90.1 External 136 93   4.3.4.3 Speed analysis For our ERAF2 study, we used 200-300 CPU cores for docking using GLIDE and 4 GPUs for model making and model predictions. In future applications of PD2.0, one can tweak different 68  features of the method depending upon the amount of resources available, for example, the number of molecules to be docked at each iteration of the number of hyperparameters to be searched. The overall timeline of the PD2.0 is shown in Table 11. We performed 4 iterations which took about ~12 days and docking of 8 million molecules which were predicted after the last iteration took about 6 days. Thus, the overall time required to virtually screen 1.3 billion molecules against ER-AF2 using PD2.0 was around 18 days. Docking all 1.3 billion molecules using GLIDE with the same amount of resources would have taken about ~2.5years. Thus, PD2.0 offers a speedup of about 65x. PD2.0 is highly scalable to billions of molecules with expected higher speed and enrichment gains. Figure 19 shows the projection of speedup against a different size of the molecule database.  Table 11. The time required for individual steps of the PD2.0 pipeline. The major time-consuming step is docking. One PD2.0 iteration takes 3 days using 200-300 CPU cores and 4 GPUs. Steps (per iteration) Time 2D to 3D optimization <4 hours Docking 2 days Model Making <18 hours Prediction <5 hours Overall ~3 days per iteration  69   Figure 19. Time projection of PD2.0 and regular docking against different database sizes. As the size of the molecule database increases, the difference between the speed of PD2.0 and regular docking becomes more accentuated. Projections were generated with the assumption that when dealing with more molecules, the number of iterations, as well as inference time, will increase. Since the time per iteration is very small due to the low number of molecules required to dock (3 million), increasing by few iterations will still lead to a significant gain in speed compared to docking a  library of billions of molecules.   4.3.4.4 PD2.0 on diverse drug targets We used PD2.0 to screen 570 million molecules from ZINC15 3D against 12 relevant drug targets (Table 1, Chapter 1). The DNN model described previously was trained at each iteration. A basic hyperparameter grid search was performed and the best model was chosen for prediction on the entire dataset (570 million). Docking was performed using OpenEye’s FRED. We generated 70  an external validation set by docking 9 million molecules randomly sampled from the database against all targets. We then used the best model of each iteration to predict the scores of these 9 million molecules and compared the statistics with the internal test set (test set used during training).  We calculated model ROC-AUC, enrichment values and recall for all the 12 targets at all the iterations (Table 13-24). The enrichment of the dataset with good molecules increased with the iterations, as discussed in the case of ER-AF2 (Figure 35). The results were also consistent between both internal and external validation test sets for all the targets (Table 13-24). The distribution of docking scores after each iteration is plotted in Figures 23-31. As already observed for ER-AF2, the distributions of scores shifted towards better scoring molecules after each iteration. The mean docking score (calculated on 1 million molecules randomly sampled after each iteration) decreased with the iterations but the rate of decrease slowed down at later iterations. The mean value, as well as the decrease rate, were target-dependent (Figure 33). The number of molecules retained after each iteration is plotted in Figure 34, which shows that the rate of decrease slowed down as well at later iterations, and was also highly target-dependent. The enrichment values (top 10, 100, 1000) were also target-dependent (Figure 35). Model recall, AUC-ROC, and FPDE for the last iteration are reported in Table 12. By using PD2.0 on 12 relevant drug targets, we achieved FPDE values up to ~78X and ’top 100’ enrichment values up to ~2000X.     71  Table 12. Model ROC-AUC, FPDE, Recall for all 12 protein target for the last iteration. Protein Model ROC AUC FPDE Model Recall (%) 1T7R 0.97 35 87 1ERR 0.97 37 90 5YCP 0.98 38 90 2ZV2 0.95 18 88 5L2S 0.98 18 92 4AG8 0.97 34 88 5MZJ 0.98 31 90 6IIU 0.97 22 83 4ZUD 0.97 11 91 5EKO 0.98 78 90 4F8H 0.98 48 90 5OSB 0.98 30 90   72  Table 13. Model cutoff, ROC-AUC, FPDE and recall for Androgen Receptor (AR) after each iteration. All the statistics are consistent between the internal and external validation set.  Iteration Number Validation Set Model cutoff Model ROC AUC FPDE Model Recall (%) 1 Internal -14.00 0.93 4 90 External 0.93 4 90 2 Internal -14.74 0.96 9 89 External 0.96 9 90 3 Internal -15.07 0.96 13 89 External 0.96 13 90 4 Internal -15.25 0.98 22 87 External 0.98 22 88 5 Internal -15.63 0.98 35 87 External 0.98 35 86    Figure 20. Docking score distribution for AR. The distribution of scores shifted towards better (lower) docking scores after each iteration. 73  Table 14. Model cutoff, ROC-AUC, FPDE and recall for Estrogen Receptor after each iteration. All the statistics are consistent between the internal and external validation set. Iteration Number Validation Set Model cutoff Model ROC AUC FPDE Model Recall (%) 1 Internal -13.63 0.93 4.3 90 External 0.92 4.3 90 2 Internal -13.88 0.97 9.7 91 External 0.96 9.6 90 3 Internal -15.00 0.97 37 90 External 0.97 34 85   Figure 21. Docking score distribution for ER. The distribution of scores shifted towards better (lower) docking scores after each iteration.   74  Table 15. Model cutoff, ROC-AUC, FPDE and recall for Peroxisome Proliferator-Activated Receptor (PPARγ) after each iteration. All the statistics are consistent between the internal and external validation set. Iteration Number Validation Set Model cutoff Model ROC AUC FPDE Model Recall (%) 1 Internal -13.00 0.92 4 90 External 0.92 4 90 2 Internal -14.04 0.97 10 90 External 0.97 10 90 3 Internal -14.48 0.98 16 92 External 0.98 16 91 4 Internal -14.70 0.97 23 91 External 0.98 23 91 5 Internal -15.04 0.98 38 92 External 0.98 38 90    Figure 22. Docking score distribution for PPARγ. The distribution of scores shifted towards better (lower) docking scores after each iteration. 75  Table 16. Model cutoff, ROC-AUC, FPDE and recall for Calcium/calmodulin-dependent protein kinase kinase 2 (CAMKK2) after each iteration. All the statistics are consistent between the internal and external validation set. Iteration Number Validation Set Model cutoff Model ROC AUC FPDE Model Recall (%) 1 Internal -13.70 0.91 4 90 External 0.91 4 90 2 Internal -14.47 0.95 7 90 External 0.95 7 91 3 Internal 14.77 0.96 10 90 External 0.97 10 90 4 Internal -14.95 0.96 14 90 External 0.97 14 90 5 Internal -15.55 0.95 18 88 External 0.97 19 90    Figure 23. Docking score distribution for CAMKK2. The distribution of scores shifted towards better (lower) docking scores after each iteration. 76  Table 17. Model cutoff, ROC-AUC, FPDE and recall for Cyclin-dependent kinase 6 (CDK6) after each iteration. All the statistics are consistent between the internal and external validation set. Iteration Number Validation Set Model cutoff Model ROC AUC FPDE Model Recall (%) 1 Internal -12.92 0.91 4 90 External 0.91 4 90 2 Internal -13.60 0.95 6 90 External 0.95 6 90 3 Internal -13.87 0.96 8 90 External 0.96 8 91 4 Internal -13.98 0.97 10 90 External 0.97 10 91 5 Internal -14.63 0.98 18 92 External 0.98 18 91    Figure 24. Docking score distribution for CDK6. The distribution of scores shifted towards better (lower) docking scores after each iteration.  77  Table 18. Model cutoff, ROC-AUC, FPDE and recall for Vascular endothelial growth factor receptor 2 (VEGFR2) after each iteration. All the statistics are consistent between the internal and external validation set. Iteration Number Validation Set Model cutoff Model ROC AUC FPDE Model Recall (%) 1 Internal 14.13 0.92 4 90 External 0.92 4 89 2 Internal -15.12 0.96 7 90 External 0.96 7 90 3 Internal -15.41 0.98 16 90 External 0.98 16 89 4 Internal -15.79 0.97 34 88 External 0.96 34 86    Figure 25. Docking score distribution for FEGFR2. The distribution of scores shifted towards better (lower) docking scores after each iteration.  78  Table 19. Model cutoff, ROC-AUC, FPDE and recall for Adenosine A2a Receptor (ADORA2A) after each iteration. All the statistics are consistent between the internal and external validation set. Iteration Number Validation Set Model cutoff Model ROC AUC FPDE Model Recall (%) 1 Internal 12.29 0.91 3 90 External 0.91 3 90 2 Internal -12.99 0.95 8 90 External 0.96 8 90 3 Internal -13.35 0.95 10 89 External 0.95 10 89 4 Internal -13.51 0.97 14 90 External 0.97 14 91 5 Internal -14.07 0.98 31 90 External 0.98 31 90    Figure 26. Docking score distribution for ADORA2A. The distribution of scores shifted towards better (lower) docking scores after each iteration. 79  Table 20. Model cutoff, ROC-AUC, FPDE and recall for Thromboxane A2 Receptor (TBXA2R) after each iteration. All the statistics are consistent between the internal and external validation set. Iteration Number Validation Set Model cutoff Model ROC AUC FPDE Model Recall (%) 1 Internal -15.22 0.89 3 90 External 0.89 3 90 2 Internal -15.83 0.94 5 90 External 0.94 5 90 3 Internal -16.13 0.96 8 90 External 0.96 8 90 4 Internal -16.35 0.97 13 89 External 0.97 13 89 5 Internal -17.00 0.97 22 83 External 0.98 23 87    Figure 27. Docking score distribution for TBXA2R. The distribution of scores shifted towards better (lower) docking scores after each iteration. 80   Table 21. Model cutoff, ROC-AUC, FPDE and recall for Angiotensin II type-1 receptor (AT1R) after each iteration. All the statistics are consistent between the internal and external validation set. Iteration Number Validation Set Model cutoff Model ROC AUC FPDE Model Recall (%) 1 Internal -13.15 0.90 3 91 External 0.89 3 91 2 Internal -13.66 0.93 5 90 External 0.93 5 90 3 Internal -13.85 0.95 6 90 External 0.95 6 89 4 Internal -13.97 0.95 7 91 External 0.95 7 90 5 Internal -14.68 0.97 11 90 External 0.97 11 89    Figure 28. Docking score distribution for AT1R. The distribution of scores shifted towards better (lower) docking scores after each iteration.  81   Table 22. Model cutoff, ROC-AUC, FPDE and recall for Sodium-Ion Channel (Nav1.7) after each iteration. All the statistics are consistent between the internal and external validation set. Iteration Number Validation Set Model cutoff Model ROC AUC FPDE Model Recall (%) 1 Internal -13.93 0.91 3 89 External 0.91 3 89 2 Internal -15.32 0.97 10 90 External 0.97 10 90 3 Internal -15.91 0.98 26 91 External 0.98 26 89 4 Internal -16.43 0.98 78 90 External 0.99 78 90     Figure 29. Docking score distribution for Nav1.7. The distribution of scores shifted towards better (lower) docking scores after each iteration. 82  Table 23. Model cutoff, ROC-AUC, FPDE and recall for bacterial (Gloeobacter) Ligand-gated Ion Channel (GLIC) after each iteration. All the statistics are consistent between the internal and external validation set. Iteration Number Validation Set Model cutoff Model ROC AUC FPDE Model Recall (%) 1 Internal -10.04 0.93 4 90 External 0.92 4 90 2 Internal -10.82 0.97 11 90 External 0.97 11 90 3 Internal -11.27 0.98 23 90 External 0.98 23 91 4 Internal -11.58 0.98 48 90 External 0.98 48 90    Figure 30. Docking score distribution for GLIC. The distribution of scores shifted towards better (lower) docking scores after each iteration. 83  Table 24. Model cutoff, ROC-AUC, FPDE and recall for Gamma-AminoButyric Acid receptor subunit alpha-1 (GABA α1) after each iteration. All the statistics are consistent between the internal and external validation set. Iteration Number Validation Set Model cutoff Model ROC AUC FPDE Model Recall (%) 1 Internal -9.74 0.90 3 90 External 0.90 3 90 2 Internal -10.26 0.95 6 91 External 0.95 6 90 3 Internal -10.53 0.96 9 90 External 0.96 9 90 4 Internal -10.68 0.96 12 92 External 0.96 12 90 5 Internal -11.23 0.98 30 90 External 0.97 29 87    Figure 31. Docking score distribution for GABA α1. The distribution of scores shifted towards better (lower) docking scores after each iteration.  84   Figure 32. Top 10, 100, 1000 enrichment for the 12 targets after the last iteration. The enrichment values decreased for an increasing number of top molecules, showing that the DNN model was able to prioritize high-affine molecules (good) i.e. molecules associated to higher DNN probabilities had higher chances of being good-scoring molecules.   85   Figure 33. The mean docking score distribution for the 12 targets across all the iterations. The mean score decreased with the iterations. The decrease rates were higher at the beginning and eventually slowed down. The rate of decrease, as well as the docking score values, were protein-dependent.    86   Figure 34. The number of molecules left after each iteration for the 12 targets. The number of molecules left after each iteration decreased with the iterations. The decrease was more significant at the beginning and eventually slowed down, and it was highly protein-dependent.   87   Figure 35. Top 100 enrichment value for the 12 targets after each iteration. The enrichment value increased with the iterations and the rate of increase, as well as maximum enrichment value, was protein-dependent.   4.4 Future direction PD2.0 is based on multiple stages: 2D to 3D conversion, molecular docking, model making, hyperparameter optimization, and prediction. Out of all these components, 2D to 3D conversion and molecular docking have little room for improvement. In our opinion, the type of model and the hyperparameter search are the two areas that can further be improved to increase the speedup and accuracy of PD2.0. For example, graphical neural networks (GNNs) implemented into the deepchem framework could replace the feed-forward NNs, where SMILES can be represented as graphs, and other methods (Bayesian optimization, gradient-based optimization, genetic optimization or population-based optimization methods) could be used instead of grid-88  based hyperparameter search to select a better set of hyperparameters and provide a more accurate model. PD2.0 is protein-specific i.e. it needs to be started from the beginning for every new target and we use only the information of the molecule. In the future, we would want to include information about the protein as well as the available experimental information. This can help us to test the molecules against multiple proteins from the same family and improving the specificity. One of the major limitations of PD2.0 is that the rate of decrease of the molecule database reduces significantly after the first iteration. A better way of sampling (i.e. sampling more molecules similar to the incorrect predictions) may help in tackling this problem.          89  Chapter 5: Virtual screening of ETS transcription factor ERG with PD2.0 5.1 Introduction In this chapter, we have discussed ETS transcription factor ERG and how it is linked to prostate cancer, then we discussed the small molecule inhibitor that we found previously79 and finally, we applied our developed method PD2.0 to virtually screen a large-scale 3D molecule database for identifying new classes of small-molecule inhibitor.  The common driver of prostate cancer (PCa) is androgen signaling146, 147 and is often treated by Androgen Receptor (AR) pathway inhibitors. The treatment leads to a 30% to 40% decrease in disease-specific mortality but almost all the patients end up developing resistance against the AR pathway inhibitors147, 148, demanding the development of novel therapeutics. ERG is a transcription factor with a full-length of 486 amino-acid and a molecular weight of 54kDa100. ERG is a member of the ETS family, the majority of which have the ETS DNA-binding domain (DBD) that is around 85 amino acid long. The ETS domain recognizes the DNA sequence containing a core GGA(A/T) motif101. ERG is initially highly expressed in the embryonic mesoderm and endothelium where it plays an important role in bone development, the formation of urogenital tract and vascular system102, 103. ERG is also expressed at high levels in embryonic neural crest cells during their migratory phase104. ERG expression decreases during vascular development105 but continues to regulate the pluripotency of hematopoietic stem cells106, endothelial cell (EC) homeostasis107, 108 and angiogenesis102. The most prevalent genetic irregularity occurring in ~50% of PCa patients is the fusion of transmembrane protease serine 2 (TMPRSS2)-ERG gene149, 150. ERG is not expressed by prostatic epithelial cells but the fusion of the open reading frame sequence of ERG with the AR response element of TMPRSS2 results in a 6000-fold increase (after androgen stimulation) in the expression 90  level of ERG, making it one of the most commonly overexpressed genes in PCa150. This overexpression of ERG in combination with loss of tumor suppressor such as PTEN leads to tumor tumorigenesis in both xenograft and transgenic mouse models151. ERG modulates various PCa related phenotypes such as disruption of the epithelial differentiation program via AR dysregulation109, activation of c-Myc, epigenetic reprogramming via EZH2110 and promotion of genomic instability via PARP dysregulation111. Overexpression of ERG also promotes epithelial-mesenchymal transition (EMT) and enables the transformed cells to acquire migratory and invasive characteristics109, 112. Androgen receptor is responsible for the initiation of ERG expression initially but later on, ERG can become self-driven through feed-forward regulation152. In summary, there are multiple lines of evidence that together strongly demonstrate that ERG is an attractive molecular target for developing PCa therapies. A more detailed review can be found in the paper by Hsing et.al153. We have previously79 discovered and characterized the new class of ERG small molecule inhibitors using CADD. The starting database size for virtual screening was about 3 million molecules (extracted from the ZINC database in 2013), out of which 48 compounds were screened in vitro that led to the discovery of compound VPC-18005. VPC-18005 inhibited ERG‐driven transcriptional activity in VCaP and PNT1B‐ERG cells, and the interaction between ERG and VPC-18005 was also validated using NMR spectroscopy (consistent with the proposed in-silico model). Similar to the TOX project, we have previously only used about 1% of available molecules for virtual screening against ERG. Thus, expanding virtual screening from the initial 3 million to 100s of millions of molecules provides a tremendous opportunity for ERG drug discovery. We have now applied our developed method PD2.0 to virtually screen more than 300 million molecules (all the molecules with 3D representation at the time of in silico screening in 2018-91  2019) against ERG to select the top 5 million candidate molecules for subsequent computational steps such as consensus scoring.   5.2 Methods We used PD2.0 developed as described in Chapter 4 to screen the non-conventional target ERG. We used all the 3D conformations available in the ZINC15 database at the time of screening, which accounted for 350 million molecules. The ERG protein (in complex with VPC-18005) was prepared and optimized using Protein Preparation Wizard of Schrödinger suite. The small molecule binding pocket was defined based on the same docking pose of previously identified hit compound VPC-18005, using the Schrödinger’s Grid Preparation Wizard. Docking was performed using Schrödinger’s GLIDE program (Standard Precision mode with default parameters).    5.3 Results and discussion We used PD2.0 to screen the non-conventional target, ERG which is a DNA binding protein (transcription factor). However, transcription factors have always been very difficult to target because they exert their activity via protein-protein and/or protein-DNA interactions. Unlike proteins such as enzymes, which have a well-defined ligand-binding site that small molecules can target, the extensive surface-exposed sites of transcription factor interactions are usually difficult to disrupt by using VS. We have a good track record of successfully identifying SMIs inhibiting protein-DNA interactions developed via CADD to target the DNA-binding domains of other cancer drug targets such as AR, MYC, and TOX. To target ERG we used PD2.0 that we developed in the previous section. 92  For ERG, we performed 3 iterations of Progressive Docking, and Figure 36 shows the docking score distribution after each iteration. The plot shows that the distribution moves towards the left with lower (better) docking scores, indicating the molecule dataset is more enriched with good molecules after each iteration. The average docking score improves from -4 to -5.8, and the rate of improvement was higher initially and gradually slows down as Progressive Docking approached towards the final iteration. Table 25 shows the model statistics for ERG after each iteration, and the model enrichment (FPDE) improves after each iteration. We performed 3 iterations of PD2.0 that took about 9 days, and a total of 5 million molecules remained after the final iteration. The docking of this final set of 5 million molecules, which has the best binding potential against ERG as determined by PD2.0, took another 3 days. Thus, the total amount of time required to virtually screen 350 million molecules against ERG was only 12 days by using PD2.0, and it would have taken more than 300 days by using a regular docking method without any form of machine learning. The application of PD2.0 on large-scale virtual screening against ERG has provided a 25x speed boost and an FPDE of 72x. When we compared the docking scores of the top molecules obtained from PD2.0 and the top molecules obtained from just standard molecular docking, there are 100 times more molecules with docking scores less than -6.5 using PD2.0 compared to regular docking. The docking score is on a Log10 scale, and each score unit difference represents a 10-fold difference in binding affinity. Thus, the best molecule (docking score of -8.9) obtained using PD2.0 is theoretically 100 times better than the best molecule obtained using regular docking (docking score of -7). In addition,  the average docking score of the top 5 million molecules from PD2.0 (mean docking score of -5.8) is 100 times better than the mean docking score of 3 million molecules from regular docking (mean docking score of -3.8, from initial screening). Based on the docking score values, 93  we anticipate that PD2.0 can provide more potent ERG inhibitors compared to VPC-18005, which has a docking score around -6. In addition to being a critical drug target in prostate cancer, ERG has been shown to play an important role in other types of cancers including Ewing sarcoma and Leukemia154-156. Ewing (or Ewing's) sarcoma (EWS) is the second most common bone cancer in children and young adults, and it is associated with high metastatic potential154. Leukemia represents a group of cancers that begin in the bone marrow and result in high numbers of abnormal white blood cells. Thus, the development of better ERG SMIs through the use of PD2.0 will not only enable better therapeutics for prostate cancer but also provide new therapies to target many other cancer types including Ewing sarcoma and Leukemia.  Table 25. Model statistics after each iteration for ERG. ROC-AUC and FPDE values for the best model are reported. We observed an increase in all enrichment values after each iteration. The model cutoff/definition of good molecules improved after each iteration as well. Iteration Number Docking score cutoff Model ROC AUC FPDE 1 -6.0 0.81 6 2 -6.5 0.90 18 3 -6.65 0.90 72   94   Figure 36. Docking score distribution for ERG. The distribution of scores shifted towards better (lower) docking scores after each iteration. Lower docking score is an indication of a molecule with better binding affinity with ERG.    5.4 Future directions Through the use of PD2.0, we have virtually screened 350 million molecules against ERG and identified the top 5 million molecules based on the GLIDE docking scores. In the future, we will perform consensus scoring with additional docking programs by using the in-house CADDpipeline (Chapter 1) to further narrow this set of molecules down to the best candidates (around 100 compounds) for experimental testing. In particular, all of the 5 million molecules will be further docked with the FRED program from OEDocking (up to 500 conformers will be 95  generated for each molecule and will be docked using FRED with default parameters), and the corresponding RMSD values will be calculated for the top poses. All the molecules with an RMSD ≤ 2A will be retained and will be docked again using another docking program, ICM. For the poses predicted by ICM (default parameters), RMSD values against Glide will be calculated and only the molecules with RMSD ≤ 2A will be retained.  Within the resulting set, theoretical pKi values will be calculated for each molecule using a custom MOE SVL script. Other properties like ADMET (absorption, distribution, metabolism, excretion, toxicity) and pharmacokinetics predictions will also be calculated by using computational programs such as ADMET Predictor, FAF-Drugs and Quantitative Estimate of Drug-likeness (QED).  As the next step, a consensus scoring method will be used, and molecules with total vote greater than a threshold value will be retained and then clustered together to remove the similar compounds (70% similarity). Finally, around 100 compounds will be selected for in vitro testing. We anticipate this will lead to new classes of anti-ERG small molecules, with better activities than compound VPC-18005.   96  Chapter 6: Conclusions While drug discovery continues to be an expensive and time-consuming effort, the thesis work shows this process can be significantly accelerated through the combined use of molecular docking and machine learning, as demonstrated on two important cancer drug targets, TOX and ERG. TOX protein represents a promising drug target for CTCL for which there is still no small-molecule inhibitor developed. To address this challenge, we employed the CADD technology and virtually screened 7.6 million molecules, firring them into the DNA binding domain of TOX. As a result, we identified and experimentally confirmed 18 compounds with micro-molar potency and sufficient selectivity toward TOX. The success of the initial campaign highlights the need to screen a much larger available library of virtual molecules (approximately 1 billion entries) for identification of additional molecules with improved anti-TOX potentials. Unfortunately, docking 1 billion compounds into TOX target site would take years of processing time on even advanced computer cluster available at the VPC. To address this challenge, we developed a novel AI-enabled CADD method called PD2.0, which allows screening billions of molecules using a very reasonable amount of time and computational resources.  Thus, within ~18 days we were able to screen a library of 1.3 billion molecules against ER-AF2 target using 300 CPU and 4 GPUs cores, achieving a VS enrichment of >100 and a speedup of 65 folds. We also demonstrated that PD2.0 can be applied to a wide variety of drug targets and docking protocols. For instance, we successfully used the developed approach to dock 570 million molecules against 12 targets from 4 main target protein (3 for each), including  G protein‐coupled receptors, kinases, ion channels, and nuclear receptors. 97  Furthermore, we have applied PD2.0 to ERG, an important drug target not only in prostate cancer but also in Ewing sarcoma and Leukemia. Through the use of PD2.0, we have virtually screened 350 million molecules against ERG and identified the top 5 million molecules based on the GLIDE docking scores. We anticipate applying subsequent CADD protocols will narrow these 5 million molecules down to 100 best drug candidates with potent anti-ERG activities. To summarize, we have developed PD2.0 as a highly-scalable method enabling much higher speed and enrichment gains for virtual screening. Due to the exponential growth of databases such as ZINC and Enamine REAL, we expect PD2.0 to play a pivotal role in upcoming challenges faced by CADD community. We will continue working on making PD2.0 more accurate and scalable.98  Bibliography  1. Agrawal V, Su M, Huang Y, Hsing M, Cherkasov A, Zhou Y. Computer-Aided Discovery of Small Molecule Inhibitors of Thymocyte Selection-Associated High Mobility Group Box Protein (TOX) as Potential Therapeutics for Cutaneous T-Cell Lymphomas. Molecules. 2019;24(19):3459 2. Agrawal V, Gentile F, Hsing M, Ban F, Cherkasov A. Progressive Docking-Deep Learning Based Approach for Accelerated Virtual Screening. Paper presented at: International Conference on Artificial Neural Networks2019. 3. Seber GA, Lee AJ. Linear regression analysis. Vol 329: John Wiley & Sons; 2012. 4. Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. Vol 398: John Wiley & Sons; 2013. 5. Breiman L. Random Forests. Mach Learn. 2001;45(1):5-32.doi: 10.1023/a:1010933404324. 6. Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20(3):273-297 7. Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of statistics. 2001:1189-1232 8. McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics. 1943;5(4):115-133 9. Hartigan JA, Wong MA. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society Series C (Applied Statistics). 1979;28(1):100-108 10. Sander J, Ester M, Kriegel H-P, Xu X. Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data mining and knowledge discovery. 1998;2(2):169-194 11. Jolliffe I. Principal component analysis. Springer; 2011. 12. Maaten Lvd, Hinton G. Visualizing data using t-SNE. Journal of machine learning research. 2008;9(Nov):2579-2605 13. Svozil D, Kvasnicka V, Pospichal J. Introduction to multi-layer feed-forward neural networks. Chemometrics and intelligent laboratory systems. 1997;39(1):43-62 14. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Paper presented at: Advances in neural information processing systems2012. 15. Mikolov T, Karafiát M, Burget L, Černocký J, Khudanpur S. Recurrent neural network based language model. Paper presented at: Eleventh annual conference of the international speech communication association2010. 16. Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997;9(8):1735-1780 17. Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:14123555. 2014 18. Doersch C. Tutorial on variational autoencoders. arXiv preprint arXiv:160605908. 2016 19. Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. Paper presented at: Advances in neural information processing systems2014. 99  20. Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. Paper presented at: Advances in neural information processing systems2015. 21. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. Paper presented at: Proceedings of the IEEE conference on computer vision and pattern recognition2016. 22. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:14090473. 2014 23. Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:14061078. 2014 24. Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine. 2012;29 25. Turk MA, Pentland AP. Face recognition using eigenfaces. Paper presented at: Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition1991. 26. Wen Y, Zhang K, Li Z, Qiao Y. A discriminative feature learning approach for deep face recognition. Paper presented at: European conference on computer vision2016. 27. Wang D, Khosla A, Gargeya R, Irshad H, Beck AH. Deep learning for identifying metastatic breast cancer. arXiv preprint arXiv:160605718. 2016 28. Mayr A, Klambauer G, Unterthiner T, Hochreiter S. DeepTox: toxicity prediction using deep learning. Frontiers in Environmental Science. 2016;3:80 29. Hughes JP, Rees S, Kalindjian SB, Philpott KL. Principles of early drug discovery. British journal of pharmacology. 2011;162(6):1239-1249 30. Ban F, Dalal K, Li H, LeBlanc E, Rennie PS, Cherkasov A. Best practices of computer-aided drug discovery: lessons learned from the development of a preclinical candidate for prostate cancer with a new mechanism of action. Journal of chemical information and modeling. 2017;57(5):1018-1028 31. Bajorath J. Computer-aided drug discovery. F1000Research. 2015;4 32. Sterling T, Irwin JJ. ZINC 15–ligand discovery for everyone. Journal of chemical information and modeling. 2015;55(11):2324-2337 33. ENAMINE.  https://enamine.net/library-synthesis/real-compounds/real-database. 34. Kim S, Thiessen PA, Bolton EE, et al. PubChem substance and compound databases. Nucleic acids research. 2015;44(D1):D1202-D1213 35. Sussman JL, Lin D, Jiang J, et al. Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallographica Section D: Biological Crystallography. 1998;54(6):1078-1084 36. Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK. BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic acids research. 2006;35(suppl_1):D198-D201 37. Wang R, Fang X, Lu Y, Yang C-Y, Wang S. The PDBbind database: methodologies and updates. Journal of medicinal chemistry. 2005;48(12):4111-4119 38. Veber DF, Johnson SR, Cheng H-Y, Smith BR, Ward KW, Kopple KD. Molecular properties that influence the oral bioavailability of drug candidates. Journal of medicinal chemistry. 2002;45(12):2615-2623 100  39. Golla S, Neely BJ, Whitebay E, Madihally S, Robinson Jr RL, Gasem KA. Virtual design of chemical penetration enhancers for transdermal drug delivery. Chemical biology & drug design. 2012;79(4):478-487 40. Bennion BJ, Be NA, McNerney MW, et al. Predicting a drug’s membrane permeability: a computational model validated with in vitro permeability assay data. The Journal of Physical Chemistry B. 2017;121(20):5228-5237 41. Paudel KS, Milewski M, Swadley CL, Brogden NK, Ghosh P, Stinchcomb AL. Challenges and opportunities in dermal/transdermal delivery. Therapeutic delivery. 2010;1(1):109-131 42. Lyu J, Wang S, Balius TE, et al. Ultra-large library docking for discovering new chemotypes. Nature. 2019;566(7743):224 43. Khamis MA, Gomaa W, Ahmed WF. Machine learning in computational docking. Artificial intelligence in medicine. 2015;63(3):135-152 44. Ashtawy HM, Mahapatra NR. A comparative assessment of ranking accuracies of conventional and machine-learning-based scoring functions for protein-ligand binding affinity prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2012;9(5):1301-1313 45. Moustakas DT, Lang PT, Pegg S, et al. Development and validation of a modular, extensible docking program: DOCK 5. Journal of computer-aided molecular design. 2006;20(10-11):601-619 46. Wang R, Lai L, Wang S. Further development and validation of empirical scoring functions for structure-based binding affinity prediction. Journal of computer-aided molecular design. 2002;16(1):11-26 47. Muegge I. PMF scoring revisited. Journal of medicinal chemistry. 2006;49(20):5895-5902 48. Kinnings SL, Liu N, Tonge PJ, Jackson RM, Xie L, Bourne PE. A machine learning-based method to improve docking scoring functions and its application to drug repurposing. Journal of chemical information and modeling. 2011;51(2):408-419 49. Ballester PJ, Mitchell JB. A machine learning approach to predicting protein–ligand binding affinity with applications to molecular docking. Bioinformatics. 2010;26(9):1169-1175 50. Li H, Leung K-S, Wong M-H, Ballester PJ. Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study. BMC bioinformatics. 2014;15(1):291 51. Ballester PJ, Schreyer A, Blundell TL. Does a more precise chemical description of protein–ligand complexes lead to more accurate prediction of binding affinity? Journal of chemical information and modeling. 2014;54(3):944-955 52. Rifaioglu AS, Atas H, Martin MJ, Cetin-Atalay R, Atalay V, Dogan T. Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Brief Bioinform. 2018;10 53. Li Y, Han L, Liu Z, Wang R. Comparative assessment of scoring functions on an updated benchmark: 2. Evaluation methods and general results. Journal of chemical information and modeling. 2014;54(6):1717-1736 101  54. Zavodszky MI, Sanschagrin PC, Kuhn LA, Korde RS. Distilling the essential features of a protein surface for improving protein-ligand docking, scoring, and virtual screening. Journal of computer-aided molecular design. 2002;16(12):883-902 55. Schnecke V, Kuhn LA. Virtual screening with solvation and ligand-induced complementarity. In: Virtual Screening: An Alternative or Complement to High Throughput Screening?: Springer; 2000:171-190. 56. Zavodszky MI, Kuhn LA. Side‐chain flexibility in protein–ligand binding: the minimal rotation hypothesis. Protein Science. 2005;14(4):1104-1114 57. Friedman JH. Multivariate adaptive regression splines. The annals of statistics. 1991;19(1):1-67 58. Hechenbichler K, Schliep K. Weighted k-nearest-neighbor techniques and ordinal classification. 2004 59. Ridgeway G. Generalized Boosted Models: A guide to the gbm package. Update. 2007;1(1):2007 60. Amini A, Shrimpton PJ, Muggleton SH, Sternberg MJ. A general approach for developing system‐specific functions to score protein–ligand docked complexes using support vector inductive logic programming. Proteins: Structure, Function, and Bioinformatics. 2007;69(4):823-831 61. Li L, Wang B, Meroueh SO. Support vector regression scoring of receptor–ligand complexes for rank-ordering and virtual screening of chemical libraries. Journal of chemical information and modeling. 2011;51(9):2132-2138 62. Korb O, Stutzle T, Exner TE. Empirical scoring functions for advanced protein− ligand docking with PLANTS. Journal of chemical information and modeling. 2009;49(1):84-96 63. Eldridge MD, Murray CW, Auton TR, Paolini GV, Mee RP. Empirical scoring functions: I. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes. Journal of computer-aided molecular design. 1997;11(5):425-445 64. Gabel J, Desaphy J, Rognan D. Beware of Machine Learning-Based Scoring Functions  On the Danger of Developing Black Boxes. Journal of chemical information and modeling. 2014;54(10):2807-2815 65. Khamis MA, Gomaa W. Comparative assessment of machine-learning scoring functions on PDBbind 2013. Engineering Applications of Artificial Intelligence. 2015;45:136-151 66. Pereira JC, Caffarena ER, dos Santos CN. Boosting docking-based virtual screening with deep learning. Journal of chemical information and modeling. 2016;56(12):2495-2506 67. Cherkasov A, Ban F, Li Y, Fallahi M, Hammond GL. Progressive docking: a hybrid QSAR/docking approach for accelerating in silico high throughput screening. Journal of medicinal chemistry. 2006;49(25):7466-7478 68. Svensson F, Norinder U, Bender A. Improving screening efficiency through iterative screening using docking and conformal prediction. Journal of chemical information and modeling. 2017;57(3):439-444 69. Ahmed L, Georgiev V, Capuccini M, et al. Efficient iterative virtual screening with Apache Spark and conformal prediction. Journal of cheminformatics. 2018;10(1):8 70. Cherkasov A, Muratov EN, Fourches D, et al. QSAR modeling: where have you been? Where are you going to? Journal of medicinal chemistry. 2014;57(12):4977-5010 102  71. Hansch C, Maloney PP, Fujita T, Muir RM. Correlation of biological activity of phenoxyacetic acids with Hammett substituent constants and partition coefficients. Nature. 1962;194(4824):178 72. Bajorath J. Modeling of activity landscapes for drug discovery. Expert opinion on drug discovery. 2012;7(6):463-473 73. Stumpfe D, Bajorath J. Methods for SAR visualization. RSC Advances. 2012;2(2):369-378 74. Deprez-Poulain R, Deprez B. Facts, figures and trends in lead generation. Current topics in medicinal chemistry. 2004;4(6):569-580 75. Davis A, Ward SE. The handbook of medicinal chemistry: principles and practice. Royal Society of Chemistry; 2014. 76. Maltarollo VG, Gertrudes JC, Oliveira PR, Honorio KM. Applying machine learning techniques for ADME-Tox prediction: a review. Expert opinion on drug metabolism & toxicology. 2015;11(2):259-271 77. ADMET Predictor [computer program]. Simulations Plus; 2018. 78. Dalal K, Roshan-Moniri M, Sharma A, et al. Selectively targeting the DNA-binding domain of the androgen receptor as a prospective therapy for prostate cancer. Journal of Biological Chemistry. 2014;289(38):26417-26429 79. Butler MS, Roshan-Moniri M, Hsing M, et al. Discovery and characterization of small molecules targeting the DNA-binding ETS domain of ERG in prostate cancer. Oncotarget. 2017;8(26):42438 80. Carabet LA, Lallous N, Leblanc E, et al. Computer-aided drug discovery of Myc-Max inhibitors as potential therapeutics for prostate cancer. European journal of medicinal chemistry. 2018;160:108-119 81. Roshan-Moniri M, Hsing M, Butler MS, Cherkasov A, Rennie PS. Orphan nuclear receptors as drug targets for the treatment of prostate and breast cancers. Cancer treatment reviews. 2014;40(10):1137-1152 82. Thorsteinson N, Ban F, Santos-Filho O, et al. In silico identification of anthropogenic chemicals as ligands of zebrafish sex hormone binding globulin. Toxicology and applied pharmacology. 2009;234(1):47-57 83. Singh K, Munuganti RSN, Leblanc E, et al. In silico discovery and validation of potent small-molecule inhibitors targeting the activation function 2 site of human oestrogen receptor α. Breast Cancer Research. 2015;17(1):27 84. Molecular Operating Environment [computer program]. Chemical Computing Group; 2018. 85. Lipinski CA. Lead-and drug-like compounds: the rule-of-five revolution. Drug Discovery Today: Technologies. 2004;1(4):337-341 86. Friesner RA, Banks JL, Murphy RB, et al. Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. Journal of medicinal chemistry. 2004;47(7):1739-1749 87. McGann M. FRED and HYBRID docking performance on standardized datasets. Journal of computer-aided molecular design. 2012;26(8):897-906 88. Neves MA, Totrov M, Abagyan R. Docking and scoring with ICM: the benchmarking results and strategies for improvement. Journal of computer-aided molecular design. 2012;26(6):675-686 103  89. Lagorce D, Bouslama L, Becot J, Miteva MA, Villoutreix BO. FAF-Drugs4: free ADME-tox filtering computations for chemical biology and early stages drug discovery. Bioinformatics. 2017;33(22):3658-3660 90. Bickerton GR, Paolini GV, Besnard J, Muresan S, Hopkins AL. Quantifying the chemical beauty of drugs. Nature chemistry. 2012;4(2):90 91. Wilkinson B, Chen JY-F, Han P, Rufner KM, Goularte OD, Kaye J. TOX: an HMG box protein implicated in the regulation of thymocyte selection. Nature immunology. 2002;3(3):272 92. Aliahmad P, Seksenyan A, Kaye J. The many roles of TOX in the immune system. Current opinion in immunology. 2012;24(2):173-177 93. Aliahmad P, Kadavallore A, de la Torre B, Kappes D, Kaye J. TOX is required for development of the CD4 T cell lineage gene program. The Journal of Immunology. 2011;187(11):5931-5940 94. Aliahmad P, De La Torre B, Kaye J. Shared dependence on the DNA-binding factor TOX for the development of lymphoid tissue–inducer cell and NK cell lineages. Nature immunology. 2010;11(10):945 95. Aliahmad P, O'Flaherty E, Han P, et al. TOX provides a link between calcineurin activation and CD8 lineage commitment. Journal of Experimental Medicine. 2004;199(8):1089-1099 96. Litvinov IV, Netchiporouk E, Cordeiro B, et al. Ectopic expression of embryonic stem cell and other developmental genes in cutaneous T-cell lymphoma. Oncoimmunology. 2014;3(11):e970025 97. Huang Y, Su M-W, Jiang X, Zhou Y. Evidence of an oncogenic role of aberrant TOX activation in cutaneous T-cell lymphoma. Blood. 2015;125(9):1435-1443 98. Huang Y, Litvinov IV, Wang Y, et al. Thymocyte selection-associated high mobility group box gene (TOX) is aberrantly over-expressed in mycosis fungoides and correlates with poor prognosis. Oncotarget. 2014;5(12):4418 99. Zhang Y, Wang Y, Yu R, et al. Molecular markers of early-stage mycosis fungoides. Journal of Investigative Dermatology. 2012;132(6):1698-1706 100. RAo VN, Papas TS, Reddy E. erg, a human ets-related gene on chromosome 21: alternative splicing, polyadenylation, and translation. Science. 1987;237(4815):635-639 101. Shore P, Whitmarsh AJ, Bhaskaran R, Davis RJ, Waltho JP, Sharrocks AD. Determinants of DNA-binding specificity of ETS-domain transcription factors. Molecular and Cellular Biology. 1996;16(7):3338-3349 102. Birdsey GM, Dryden NH, Amsellem V, et al. Transcription factor Erg regulates angiogenesis and endothelial apoptosis through VE-cadherin. Blood. 2008;111(7):3498-3506 103. Vijayaraj P, Le Bras A, Mitchell N, et al. Erg is a crucial regulator of endocardial-mesenchymal transformation during cardiac valve morphogenesis. Development. 2012;139(21):3973-3985 104. Maroulakou IG, Bowe DB. Expression and function of Ets transcription factors in mammalian development: a regulatory network. Oncogene. 2000;19(55):6432 105. Mohamed AA, Tan S-H, Mikhalkevich N, et al. Ets family protein, erg expression in developing and adult mouse tissues by a highly specific monoclonal antibody. Journal of Cancer. 2010;1:197 104  106. Ng AP, Loughran SJ, Metcalf D, et al. Erg is required for self-renewal of hematopoietic stem cells during stress hematopoiesis in mice. Blood. 2011;118(9):2454-2461 107. Birdsey GM, Shah AV, Dufton N, et al. The endothelial transcription factor ERG promotes vascular stability and growth through Wnt/β-catenin signaling. Developmental cell. 2015;32(1):82-96 108. Lathen C, Zhang Y, Chow J, et al. ERG-APLNR axis controls pulmonary venule endothelial proliferation in pulmonary veno-occlusive disease. Circulation. 2014;130(14):1179-1191 109. Yu J, Yu J, Mani R-S, et al. An integrated network of androgen receptor, polycomb, and TMPRSS2-ERG gene fusions in prostate cancer progression. Cancer cell. 2010;17(5):443-454 110. Chng KR, Chang CW, Tan SK, et al. A transcriptional repressor co‐regulatory network governing androgen response in prostate cancers. The EMBO journal. 2012;31(12):2810-2823 111. Brenner JC, Ateeq B, Li Y, et al. Mechanistic rationale for inhibition of poly (ADP-ribose) polymerase in ETS gene fusion-positive prostate cancer. Cancer cell. 2011;19(5):664-678 112. Becker-Santos DD, Guo Y, Ghaffari M, et al. Integrin-linked kinase as a target for ERG-mediated invasive properties in prostate cancer models. Carcinogenesis. 2012;33(12):2558-2567 113. Santos R, Ursu O, Gaulton A, et al. A comprehensive map of molecular drug targets. Nature reviews Drug discovery. 2017;16(1):19 114. Hur E, Pfaff SJ, Payne ES, Grøn H, Buehrer BM, Fletterick RJ. Recognition and accommodation at the androgen receptor coactivator binding interface. PLoS biology. 2004;2(9):e274 115. Brzozowski AM, Pike AC, Dauter Z, et al. Molecular basis of agonism and antagonism in the oestrogen receptor. Nature. 1997;389(6652):753 116. Jang JY, Bae H, Lee YJ, et al. Structural basis for the enhanced anti-diabetic efficacy of lobeglitazone on PPARγ. Scientific reports. 2018;8(1):31 117. Kukimoto-Niino M, Yoshikawa S, Takagi T, et al. Crystal structure of the Ca2+/calmodulin-dependent protein kinase kinase in complex with the inhibitor STO-609. Journal of Biological Chemistry. 2011;286(25):22570-22579 118. Chen P, Lee NV, Hu W, et al. Spectrum and degree of CDK drug interactions predicts clinical performance. Molecular cancer therapeutics. 2016;15(10):2273-2281 119. McTigue M, Murray BW, Chen JH, Deng Y-L, Solowiej J, Kania RS. Molecular conformations, interactions, and properties associated with drug efficiency and clinical performance among VEGFR TK inhibitors. Proceedings of the National Academy of Sciences. 2012;109(45):18281-18289 120. Cheng RK, Segala E, Robertson N, et al. Structures of human A1 and A2A adenosine receptors with xanthines reveal determinants of selectivity. Structure. 2017;25(8):1275-1285. e1274 121. Fan H, Chen S, Yuan X, et al. Structural basis for ligand recognition of the human thromboxane A 2 receptor. Nature chemical biology. 2019;15(1):27 105  122. Zhang H, Unal H, Desnoyer R, et al. Structural basis for ligand recognition and functional selectivity at angiotensin receptor. Journal of Biological Chemistry. 2015;290(49):29127-29139 123. Ahuja S, Mukund S, Deng L, et al. Structural basis of Nav1. 7 inhibition by an isoform-selective small-molecule antagonist. Science. 2015;350(6267):aac5464 124. Pan J, Chen Q, Willenbring D, et al. Structure of the pentameric ligand-gated ion channel GLIC bound with anesthetic ketamine. Structure. 2012;20(9):1463-1469 125. Laverty D, Thomas P, Field M, et al. Crystal structures of a GABA A-receptor chimera reveal new endogenous neurosteroid-binding sites. Nature structural & molecular biology. 2017;24(11):977 126. Irwin JJ, Shoichet BK. ZINC--a free database of commercially available compounds for virtual screening. J Chem Inf Model. 2005;45(1):177-182.doi: 10.1021/ci049714+. 127. Hwang ST, Janik JE, Jaffe ES, Wilson WH. Mycosis fungoides and Sézary syndrome. The Lancet. 2008;371(9616):945-957 128. Netchiporouk E, Gantchev J, Tsang M, et al. Analysis of CTCL cell lines reveals important differences between mycosis fungoides/Sézary syndrome vs. HTLV-1+ leukemic cell lines. Oncotarget. 2017;8(56):95981 129. Cherkasov A, Muratov EN, Fourches D, et al. QSAR modeling: where have you been? Where are you going to? J Med Chem. 2014;57(12):4977-5010.doi: 10.1021/jm4004285. 130. Cereto-Massagué A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallvé S, Pujadas G. Molecular fingerprint similarity search in virtual screening. Methods. 2015;71:58-63 131. Weininger D, Weininger A, Weininger JL. SMILES. 2. Algorithm for generation of unique SMILES notation. Journal of chemical information and computer sciences. 1989;29(2):97-101 132. Ewing T, Baber JC, Feher M. Novel 2D fingerprints for ligand-based virtual screening. Journal of chemical information and modeling. 2006;46(6):2423-2431 133. Landrum G. Rdkit documentation. Release. 2013;1:1-79 134. Rogers D, Hahn M. Extended-connectivity fingerprints. Journal of chemical information and modeling. 2010;50(5):742-754 135. Mason JS, Cheney DL. Library design and virtual screening using multiple 4-point pharmacophore fingerprints. In: Biocomputing 2000. World Scientific; 1999:576-587. 136. Qing X, Lee XY, De Raeymaecker J, et al. Pharmacophore modeling: advances, limitations, and current utility in drug discovery. Journal of Receptor, Ligand and Channel Research. 2014;7:81-92 137. Sastry GM, Adzhigirey M, Day T, Annabhimoju R, Sherman W. Protein and ligand preparation: parameters, protocols, and influence on virtual screening enrichments. Journal of computer-aided molecular design. 2013;27(3):221-234 138. Sculley D. Web-scale k-means clustering. Paper presented at: Proceedings of the 19th international conference on World wide web2010. 139. Feizollah A, Anuar NB, Salleh R, Amalina F. Comparative study of k-means and mini batch k-means clustering algorithms in android malware detection using network traffic analysis. Paper presented at: 2014 International Symposium on Biometrics and Security Technologies (ISBAST)2014. 140. Shalev-Shwartz S, Ben-David S. Understanding machine learning: From theory to algorithms. Cambridge university press; 2014. 106  141. Hofmann M. Support vector machines-kernels and the kernel trick. Notes. 2006;26 142. Chung K-M, Kao W-C, Sun C-L, Wang L-L, Lin C-J. Radius margin bounds for support vector machines with the RBF kernel. Neural computation. 2003;15(11):2643-2681 143. Shafer G, Vovk V. A tutorial on conformal prediction. Journal of Machine Learning Research. 2008;9(Mar):371-421 144. Lipkus AH. A proof of the triangle inequality for the Tanimoto distance. Journal of Mathematical Chemistry. 1999;26(1-3):263-265 145. Delfosse V, Grimaldi M, Pons J-L, et al. Structural and mechanistic insights into bisphenols action provide guidelines for risk assessment and discovery of bisphenol A substitutes. Proceedings of the National Academy of Sciences. 2012;109(37):14930-14935 146. Abeshouse A, Ahn J, Akbani R, et al. The molecular taxonomy of primary prostate cancer. Cell. 2015;163(4):1011-1025 147. Robinson D, Van Allen EM, Wu Y-M, et al. Integrative clinical genomics of advanced prostate cancer. Cell. 2015;161(5):1215-1228 148. Chang AJ, Autio KA, Roach III M, Scher HI. High-risk prostate cancer—classification and therapy. Nature reviews Clinical oncology. 2014;11(6):308 149. Seth A, Watson DK. ETS transcription factors and their emerging roles in human cancer. European journal of cancer. 2005;41(16):2462-2478 150. Tomlins SA, Rhodes DR, Perner S, et al. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. science. 2005;310(5748):644-648 151. Adamo P, Ladomery M. The oncogene ERG: a key factor in prostate cancer. Oncogene. 2016;35(4):403 152. Mani R-S, Iyer MK, Cao Q, et al. TMPRSS2–ERG-mediated feed-forward regulation of wild-type ERG in human prostate cancers. Cancer research. 2011;71(16):5387-5392 153. Hsing M, Wang Y, Rennie PS, Cox ME, Cherkasov A. ETS transcription factors as emerging drug targets in cancer. Medicinal research reviews. 2019 154. Gruenewald TG, Cidre-Aranaz F, Surdez D, et al. Ewing sarcoma. Nature Reviews Disease Primers. 2018;4(1):5 155. Sizemore GM, Pitarresi JR, Balakrishnan S, Ostrowski MC. The ETS family of oncogenic transcription factors in solid tumours. Nature Reviews Cancer. 2017;17(6):337 156. Li Y, Luo H, Liu T, Zacksenhaus E, Ben-David Y. The ets transcription factor Fli-1 in development, cancer and disease. Oncogene. 2015;34(16):2022  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0384604/manifest

Comment

Related Items