Application of Machine LearningAlgorithms to Mineral ProspectivityMappingbyJustin GranekB.Sc., Acadia University, 2009M.Sc., The University of British Columbia, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Geophysics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2016c© Justin Granek 2016AbstractIn the modern era of diminishing returns on fixed exploration budgets, challengingtargets, and ever-increasing numbers of multi-parameter datasets, proper manage-ment and integration of available data is a crucial component of any mineral explo-ration program. Machine learning algorithms have successfully been used for yearsby the technology sector to accomplish just this task on their databases, and recentdevelopments aim at appropriating these successes to the field of mineral exploration.Framing the exploration task as a supervised learning problem, the geological, geo-chemical and geophysical information can be used as training data, and known mineraloccurences can be used as training labels. The goal is to parameterize the complexrelationships between the data and the labels such that mineral potential can beestimated in under-explored regions using available geoscience data.Numerous models and algorithms have been attempted for mineral prospectivitymapping in the past, and in this thesis we propose two new approaches. The first is amodified support vector machine algorithm which incorporates uncertainties on boththe data and the labels. Due to the nature of geoscience data and the characteristicsof the mineral prospectivity mapping problem, uncertainties are known to be veryimportant. The algorithm is demonstrated on a synthetic dataset to highlight thisimportance, and then used to generate a prospectivity map for copper-gold porphyrytargets in central British Columbia using the QUEST dataset as a case study.The second approach, convolutional neural networks, was selected due to its in-herent sensitivity to spatial patterns. Though neural networks have been used foriiAbstractmineral prospectivity mapping, convolutional neural nets have yet to be applied tothe problem. Having gained extreme popularity in the computer vision field for tasksinvolving image segmentation, identification and anomaly detection, the algorithm isideally suited to handle the mineral prospectivity mapping problem. A CNN code isdeveloped in Julia, then tested on a synthetic example to illustrate its effectivenessat identifying coincident structures in a multi-modal dataset. Finally, a subset of theQUEST dataset is used to generate a prospectivity map using CNNs.iiiPrefaceThis thesis contains original research conducted while studying at the University ofBritish Columbia, resulting in three publications and a conference extended abstract.The idea of using mineral prospectivity mapping for exploration targeting camefrom experience working with real datasets for mineral exploration at ComputationalGeosciences and through discussions with Dr. Eldad Haber, Dr. Elliot Holtham,Dr. David Marchant and Livia Mahler. The decision to start with support vectormachines was based on investigation of industry best practices in machine learningand experience trying various algorithms on a preliminary dataset.The decision to write our own SVM codes was motivated by the lack of currentlyavaible packages to incorporate uncertainties on the data and labels, which from ex-perience working in geophysical inversion was known to be extremely important whenworking with geoscience data. The subsequent Matlab codes, including the adaptedSVM formulation published in J. Granek and E. Haber,“Advanced Geoscience Tar-geting via Focused Machine Learning Applied to the QUEST Project Dataset, BritishColumbia.” in Geoscience BC Summary of Activities 2015, 2015:117-125, as well asthe new L1-RSVR formulation published in J. Granek and E. Haber, “Data miningfor real mining : A robust algorithm for prospectivity mapping with uncertainties”, inProceedings of the 2015 SIAM International Conference on Data Mining, 2015:145-153, were developed and written in collaboration with Dr. Eldad Haber.The idea of applying convolutional neural networks to the mineral prospectivityproblem came from discussions with Dr. Eldad Haber and was sparked by theirivPrefacecurrent popularity in the machine learning community. The decision to write our ownCNN package in Julia came from a frustration in understanding the inner workingsof existing packages and a desire to have control over all aspects of the algorithm.The subsequent code was written by myself with heavy input and assistance from Dr.Haber. This work has been submitted for publication as J. Granek and E. Haber,“Deep Learning for Natural Resource Exploration: Convolutional Neural Networksfor Prospectivity Mapping”, IEEE Transactions on Geoscience and Remote Sensing,and is currently going throught the review process.Acquisition, processing and sampling of the field data for the QUEST region wereperformed by myself using open source Quantum GIS software, with consultationfrom staff at NEXT Exploration.A summary of this research was presented at the ASEG 2016 conference inAdelaide, and published in the extended abstracts as J. Granek, E. Haber and E.Holtham, “Resource Management Through Machine Learning”, ASEG Extended Ab-stracts, 2016:1-5.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Research Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Research Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Mineral Prospectivity Mapping . . . . . . . . . . . . . . . . . . . . . . 82.1 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Problem Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1 Maximum Margin Classifier . . . . . . . . . . . . . . . . . . . . . . . . 15viTable of Contents3.2 Separable Linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3 Variants and Modifications . . . . . . . . . . . . . . . . . . . . . . . . 193.3.1 Inseparable SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.2 Non-linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3.3 Multiclass SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.4 Support Vector Regression . . . . . . . . . . . . . . . . . . . . 243.4 Robust SVR with Uncertainy . . . . . . . . . . . . . . . . . . . . . . . 273.4.1 Total Least Squares . . . . . . . . . . . . . . . . . . . . . . . . 283.4.2 Total Least Norm - L1 Sparsity . . . . . . . . . . . . . . . . . . 293.4.3 Case of Different Variances . . . . . . . . . . . . . . . . . . . . 323.5 Synthetic Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . 344 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . 384.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.1.2 General Architecture . . . . . . . . . . . . . . . . . . . . . . . . 404.1.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . 454.2.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . 454.2.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.3 Classification/Objective Function . . . . . . . . . . . . . . . . . 474.2.4 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 484.3 Synthetic Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . 495 QUEST Project Field Example . . . . . . . . . . . . . . . . . . . . . . 565.1 QUesnelia Exploration STrategy Data . . . . . . . . . . . . . . . . . . 565.2 The Target: Porphyry Copper-Gold Systems . . . . . . . . . . . . . . 605.3 Exploration Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63viiTable of Contents5.4 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.5 SVM Prospectivity Mapping . . . . . . . . . . . . . . . . . . . . . . . 675.5.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.6 CNN Prospectivity Mapping . . . . . . . . . . . . . . . . . . . . . . . 705.6.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.1 Major Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.2 SVM vs CNN Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 776.3 Remaining Challenges & Future Work . . . . . . . . . . . . . . . . . . 79Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82viiiList of Figures2.1 Mineral prospectivity mapping aims to combine numerous geosciencelayers to derive a map of mineral potential. . . . . . . . . . . . . . . . 143.1 Optimal margin classification using SVM. The seperating hyperplane(black) is determined by maximizing the margin (yellow) between afew sparse support vectors (outlined in green). . . . . . . . . . . . . . 163.2 Calculating the distance between the hyperplane and a nearest pointin feature space, Xn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Inseparable SVM: points are allowed to be misclassified, but are penal-ized using positive slack variables ξ. . . . . . . . . . . . . . . . . . . . 203.4 Non-linear transformation via the kernel trick. The data is linearlyseparable in the transformed feature space. . . . . . . . . . . . . . . . 223.5 Multiclass SVM problem can be solved as one-vs-rest or as one-vs-oneto generate multiple decision boundaries. . . . . . . . . . . . . . . . . . 233.6 Support vector regression using -insensitive loss function; only valuesgreater than ± contribute to the objective value (see equation 3.18). . 243.7 Three of the most common loss functions used in conjunction withSVMs. Note: The hinge loss is using = 0.5. . . . . . . . . . . . . . . 26ixList of Figures3.8 Full synthetic dataset. Separable binary dataset is rendered inseparableby two forms of errors: bad data (purple boxes) and bad labels (greenboxes). Training data are circles and predicted labels are triangles.Marker size is proportional to label confidence. . . . . . . . . . . . . . 353.9 Prediction results from synthetic example (Left: L1-RSVR and right:SVR without uncertainties). Training data are circles and predictedlabels are triangles. Marker size is proportional to label confidence.Separating boundaries shown in green and pink, and black crosses in-dicate mis-classified points. . . . . . . . . . . . . . . . . . . . . . . . . 364.1 Neural networks are modelled after biological neurons in the brain, inwhich each neuron will only fire (perpetuate a signal) if the total inputsignal exceeds some threshold (taken from Stanford CS231n coursematerial). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Example diagram of a feed-forward neural network. This network has3 inputs and 3 layers: two hidden layers with 4 neurons each, and oneoutput layer (taken from Stanford CS231n course material). . . . . . . 414.3 Different activation functions commonly used in neural networks. . . . 414.4 Supervised vs unsupervised neural networks. Top: Feed-forward neu-ral networks uses training labels to optimize the hidden layer param-eters; bottom: Autoencoders minimize the difference between thetraining data and the output feature vector to find optimal compressedrepresentation in the hidden layer. . . . . . . . . . . . . . . . . . . . . 434.5 Backpropagation of error through a neural network . . . . . . . . . . . 454.6 Diagram of a typical convolution layer in a convolutional neural net-work (taken from DeepLearning.net’s tutorial on CNN). . . . . . . . . 46xList of Figures4.7 Diagram of a typical max-pooling operation in a convolutional neuralnetwork (taken from Stanford CS231n course material). . . . . . . . . 474.8 Example architecture of a convolutional neural network, stacking (two)layers of convolution and pooling with a final fully connected layer(taken from DeepLearning.net’s tutorial on CNN). . . . . . . . . . . . 494.9 A single example from the synthetic dataset. a) First channel is asmoothly decaying ellipsoid with a randomly chosen origin and radii,b) second channel is a smoothly decaying ring, also with randomlychosen origin and radii, and c) third channel is a discrete block withrandom origin and dimensions . . . . . . . . . . . . . . . . . . . . . . . 504.10 Examples of a) positive and b) negative training data. Notice how allthree channels overlap in the positive example, but not in the negativeexample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.11 Convergence plots for the best result running the synthetic dataset inthe CNN. Blue curve is training data, red curve is validating. . . . . . 534.12 False negative classification results from synthetic dataset. Contourplots of channels 1,2,3 are plotted as red, blue and green, respectively. 544.13 False positive classification results from synthetic dataset. Contourplots of channels 1,2,3 are plotted as red, blue and green, respectively. 555.1 QUEST project area in central British Columbia . . . . . . . . . . . . 565.2 Porphyry belts in central British Columbia . . . . . . . . . . . . . . . 575.3 Sample of the available geoscience maps in the QUEST region. . . . . 595.4 Major porphyry systems around the world, taken from (John et al.,2010). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.5 Illustrative example of different signatures over a copper-gold porphyrydeposit (Alumbrera deposit in Argentina), taken from (Hoschke, 2010). 61xiList of Figures5.6 Diagram of a copper porphyry system, taken from (Sillitoe, 2010). . . 625.7 Left: Known mineralization locations in the validation set of theQUEST project area. Right: Uncertainty estimates for the miner-alization labels (red is more uncertain) in the validation set. . . . . . . 685.8 Prediction for prospective mineral regions in validation set of the QUESTarea (red is more prospective, blue is less) from SVM. . . . . . . . . . 685.9 Prediction for prospective mineral regions in QUEST area (red is moreprospective, blue is less) from SVM using only potential field data. . . 695.10 Convergence plots for the best result running the QUEST dataset inthe CNN. Blue curve is training data, orange curve is validating. . . . 715.11 Mineral prospectivity map generated from CNN for QUEST region.Colorscale maps the probability of mineralization from 0-1 (blue-red).Known mineral occurences are plotted as black dots for reference. . . . 72xiiAcknowledgementsMost importantly, I must thank my supervisor Eldad Haber for his mentorship overthe last 4 years. This degree, and likely my career path, would not exist if it were notfor you. I am grateful for all the knowledge, guidance, kindness and arguments youhave shared with me during the course of my program. Your willingness to explorenew topics and your dedication to understanding them have kept my research excitingand fresh, and left me with a strong foundation for future endeavours.A special thanks goes out to NEXT Exploration for sponsoring my researchthrough a Mitacs Accelate Fellowship. Thanks also to Geoscience BC for the generousfunding through their scholarship program, and to UBC for additional support.Thanks to the team at Computational Geosciences for the experience of workingin industry and creating the motivation to complete this degree. In particular thanksto Elliot Holtham, Dave Marchant and Livia Mahler for the support through-out myPhD, both in tackling the field data from the QUEST project and in keeping medriven to finish.To all my office mates and colleagues at UBC-GIF, thank you for the assistanceand fun over the past 7 years! Gudni, Lindsey, Mike and Sarah, it’s been a pleasurebouncing ideas off of you and moving through our degrees together.To all my friends and family outside the UBC bubble that have been with meon this journey and have kept me sane and focused, your support and friendship arevalued beyond measure. In particular thank you to my parents for your love andguidance and for always being there to pick me up and push me onwards.xiiiTo my parents.xivChapter 1Introduction1.1 MotivationOver the last 25 years the number of major mineral discoveries has drastically de-creased, as have the profit margins in the mining sector (Barnett and Williams, 2006).This finding is counter-intuitive given the rising global population, rapid industrial-ization of emerging economies and the shrinking globe. As with most fields, one wouldexpect as the steady march of science and progress give rise to advances in technologyand methodology efficiency should increase; however in mineral exploration this hasnot been the case.This shortage in new mineral resource discoveries is partially due to inertia of theindustry in adopting new technologies, and partially a simple phenomenon of limitednatural resources. The majority of current active mines were discovered because theyeither had outcropping mineralization (geology) or were relatively shallow bodiesexhibiting fairly strong geophysical or geochemical signatures. These represent thelow hanging fruit in the context of exploration targeting; the remaining targets willnot be so easily discovered. Future mineral prospects will likely bear at least one ofthe following challenges:• Deeper ore bodies• Complex mineral systems with non-trivial signatures in the data• Obscured beneath overburden or other complex geology11.1. MotivationThis is not to suggest that the mining industry is ill-prepared to explore these tar-gets. On the contrary, advances in geoscience have been consistent, producing a widerange of new tools for prospecting. One such advance has been the development oftechniques and technologies for the integration of large quantities of geoscientific datavia mineral prospectivity mapping.Traditional exploration teams would operate within the bounds of their respec-tive fields (ie: geology, geochemistry and geophysics), with little or no collaborationbetween the disciplines. The challenge with this approach is that in many cases thewhole is greater than the sum of parts when targeting a mineral deposit, and theexploration program can benefit immensely from a coordinated approach involving adiverse team of experts. Because of this, integrated exploration targeting teams havebecome much more the norm over the last decade.Despite this, one of the remaining challenges resides in objective, efficient han-dling of the ever increasing volume of available geoscientific data. In many developedcountries exploration data sets are made public, either by local government agenciesor as a mandatory clause for all data collection (Bastrakova and Fyfe, 2012). Thisarray of knowledge and data can be staggering, spanning various aspects of geology,geochemistry and geophysics. While indisputably valuable, expert analysis of geosci-entific data is always subject to various degrees of personal bias and interpretation,and does not always offer a consistent understanding of an exploration target. Thiscan be particularily pronounced when attempting to integrate interpretations fromdifferent disciplines, such as geophysical inversion models and geological cross sec-tions. Given that all the data is being generated by sampling some function of thesame rocks - whether via airborne magnetometer or analysis of outcrop in hand sam-ples - it stands to reason there should be some correlation between the various datatypes that ties them all together.21.2. Research Outline1.2 Research OutlineMineral prospectivity mapping (see Chapter 2) aims to tackle this challenge by lever-aging the intuitive integration possible using geographical information systems (GIS)in conjunction with the computational power of optimization algorithms borrowedfrom the machine learning community. These algorithms have enjoyed much successover the last 25 years in fields such as pattern recognition, bioinformatics, fraud de-tection, and many others. As advanced data analysis tools, the power in machinelearning algorithms is in their ability to highlight hidden insights in datasets withoutthe need to explicitly search for them. Starting from some assumed flexible modelf(x; θ) of the data x with tuneable parameters θ, machine learning algorithms willlearn the correct representation of the data by iteratively updating the parametersto minimize some objective function, such as the residual sum of squares (RSS)RSS =n∑i=1(yi − f(xi; θ))2 (1.1)which quantifies the goodness of fit for the selected model and parameters with respectto some desired output, y. Beginning in the 1980s, various researchers have appliedmany different machine learning algorithms to mineral prospectivity mapping, rangingfrom regression to fuzzy logic to neural networks.As the dataset gets larger and more complicated, more sophistocated algorithmsare required. One of the more popular machine learning algorithms from the early2000s is support vector machines (SVM), in which one tunes a set of weights to obtaina hyperplane in some feature space which optimally separates the data classes (seeChapter 3). SVMs were and are widely used because they have strong theoreticalguarantees and they provide an elegant and efficient algorithm to solve for complexnon-linear relationships using a sparse classifier. Because of this, the classifier can31.2. Research Outlinebe constructed using a small subset of the training points, meaning the size of thetraining set has little impact on the complexity of the model.For these reasons, support vector machines were selected as the starting pointfor this research. Chapter 3 derives the algorithm from basic principles and outlinessome of the common variants, as well as the practical considerations for use. Initialattempts were completed using existing open source SVM software libSVM (Changand Lin, 2011), however it was quickly apparent that certain characteristics of themineral prospectivity mapping data (see Section 2.3) required the use of uncertaintiesin training the algorithm, which was not available in this, or any, SVM package. As aresult, SVM algorithms were coded up in MATLAB to incorporate this functionalityusing both a modified classical SVM formulation (equation 3.21), as well as a newimplementation, termed L1-RSVR (robust support vector regression), which beginsfrom a total least squares approach (equation 3.37). To illustrate the effectiveness ofthe L1-RSVR algorithm at handling uncertainties on both the data and the labels asmall toy problem was used as a synthetic example (see Section 3.5).Satisfied with this success, support vector machines were used to train a classifieron the QUEST dataset (see Chapter 5). This dataset from central British Columbiaspans roughly 150,000 km2 and consists of a suite of geoscience data ranging fromairborne geophysics, to reintrepreted multi-element geochemistry to geological map-ping. The region is known to host copper-gold porphyry systems, a style of economicmineral deposit which occurs in clusters or trends up and down the Pacific coast ofthe Americas, and accounts for more than 60% of the global copper reserves. Usinga database of known mineralization within the QUEST study area, a prospectivitymap for copper-gold porphyry systems was generated using the L1-RSVR algorithm(see Section 5.5).Though validation error was low and results were encouraging, one aspect of thesupport vector machine algorithm was unsatisfying for this application: the lack41.2. Research Outlineof spatial information. While SVMs are powerful learning machines able to modelextremely complex data, they inherently operate point-wise, with no sensitivity toregional or global structure. For the mineral prospectivity mapping problem this canbe crucial, since many of the most important indicators of mineralization are spatialpatterns rather than anomalous values. Porphyries, for example, are known to exhibita halo-like structure in both the magnetics as well as certain geochemical signatures,and are known to occur near cross-cutting fault structures. While solutions existto incorporate spatial information into the point-sample framework of SVMs, thisrealization presented an opportunity to explore one of the most widely used machinelearning algorithms of the last decade: convolutional neural networks (CNNs).Convolutional neural networks (see Chapter 4) are a subset of neural networks, apopular learning algorithm originally motivated by the hierarchical and distributedmanner in which neurons in the brain are able to master complex abstract ideas. Firstdeveloped in the 1990s, CNNs have gained widespread popularity over the last decadefor applications in image, text and speech recognition. Their power comes from therealization that natural images (and sounds) have reoccuring local structure which canbe exploited. Given the recent success of CNNs in image identification, segmentation,and parsing, the algorithm is an ideal candidate for mineral prospectivity mapping inwhich the task is to identify subtle patterns in a multi-parameter dataset of thematicgeoscience maps.While many open source packages exist for the implementation of convolutionalneural networks (ie: Torch, TensorFlow, Caffe, Theano) it was decided that we wouldcode our own so as to have a better understanding of the inner workings of thealgorithm. A preliminary CNN code was written in MATLAB, and then a workingcopy was written in Julia (Bezanson et al., 2014). As the focus of this part of theresearch was the incorportation of spatial information, uncertainties were left out ofthe code at this stage. As future work, uncertainties on the training labels can easily51.3. Research Summarybe added, however uncertainties on the data will require further consideration.Though many benchmark datasets exist in the machine learning community forvision tasks (ie: MNIST, CIFAR10, ImageNet etc.) none of these highlighted theimportance of treating multi-modal data (data which comes from different sensorswith different distributions). To address this and test our CNN code we developed asynthetic dataset (see Section 4.3) with multiple input channels as an analog for themineral prospectivity mapping problem. Content with the results on this syntheticexample, a convolutional neural network was trained on a subset of the QUESTdataset to generate a prospectivity map for copper-gold porphyries in central BritishColumbia (see Section 5.6).1.3 Research SummaryThe main objective of the research presented in this thesis is to explore, developand apply advanced machine learning algorithms for mineral prospectiv-ity mapping. Accomplishing this requires the synthesis of three existing broaddisciplines, each with a large body of work behind them: geoscience, geographical in-formation systems (GIS), and machine learning. This work resulted in the followingkey contributions:• A modified support vector machine code was written in Matlab which incorpo-rates uncertainties using extra terms on the traditional formulation• A support vector regression code (L1-RSVR) was written in Matlab using anew formulation beginning from a total least squares approach to incorporateuncertainties on both data and labels• A convolutional neural network package was written in Julia using a nonlinearconjugate gradient solver61.3. Research Summary• Convolutional neural networks were applied to the mineral prospectivity map-ping problem for the first time• Mineral prospectivity maps were generated for the QUEST dataset in centralBritish ColumbiaThe rest of this thesis document is organized as follows. Chapter 2 introducesthe mineral prospectivity mapping problem, complete with a summary of previouswork and the relevant problem characteristics. Chapter 3 derives the robust supportvector regression algorithm, L1-RSVR, beginning from a basic explanation of supportvector machines and extending them to handle various complexities common with realdata, including uncertainties. To highlight the importance of properly incorporatinguncertainties, a simple synthetic example is presented.Chapter 4 builds up the necessary background information for proper under-standing and implementation of a convolutional neural network, including a historyof their development, as well as a break down of their common components. Todemonstrate the utility of CNNs for target identification on multi-modal data, a sim-ple synthetic dataset is presented (different than the example for SVM).To illustrate the effectiveness of these algorithms on real data, Chapter 5 intro-duces the QUEST dataset from central British Columbia, including a brief descriptionof the exploration target (copper-gold porphyry systems) and a short discussion ondata pre-processing and preparation. Both algorithms, SVM and CNN, are usedindependently to generate prospectivity maps for the QUEST region.Finally, in Chapter 6 the work is summarized and the differences in the twoapproaches are discussed. Remaining challenges and opportunities for future workare highlighted, and the major contributions of this research are stated.7Chapter 2Mineral Prospectivity Mapping2.1 General FormulationMineral prospectivity mapping takes a wholistic approach to exploration targeting;the basic goal being to recognize and parametrize the subtle patterns and features inthe multi-disciplinary dataset which are indicative of a desired mineralization style(ie: copper-gold porphyry). Once this is accomplished, the assumption is that onecan then apply the learned model (parameterization) to predict the likelihood ofmineralization in other similar exploration environments.Mathematically, this can be represented as follows. Given a suite N of geoscientificdata maps XTraini for i = 1..N (ie: X1 is geological age, X2 is total magnetic field,etc.) and a database of known mineral occurence locations in the region yTrain,parameterize a model f(θ; XTrain) by tuning θ to fit the data X to the labels yTrainaccording to some objective function L such thatθ∗ = argmin L(yTrain − f(θ; XTrain)) (2.1)Assuming this parametrization can approximate the relationship between the geo-science data and the known mineral occurences, mineral potential can then be esti-mated in new areas using avalailable exploration data XPredict asyPredict = f(θ∗; XPredict)82.2. Previous Workso long as the training and predicting data have the same distributions and are sam-pled in a similar fashion. In other words, so long as the exploration target and envi-ronment are similar for the training and predicting sets, and the data is sampled andprocessed in the same way, the classifier can be used to predict mineralization in thenew area. In general f(·) can be any function, ranging from simple linear classifiersto complex nonlinear functions, and θ can be any kind of parameterization, rangingfrom indicator variables to regression weights and more complex representations.2.2 Previous WorkMineral prospectivity mapping was first proposed in the late 80s by geoscientistsas a statistical method for the integration and interpretation of spatial patterns ingeoscience data (Bonham-Carter et al., 1989). The concept was to determine the linkbetween various geoscience datasets (ie: geology/geophysics/geochemistry) and theexistence or absence of economic mineralization in an objective way.As computers became more powerful, accessible, and user-friendly, software suchas geographical information systems (GIS) revolutionized the way in which geoscien-tists could analyze and interpret exploration data. The ability to simultaneously viewmultiple layers of maps on a common reference system facilitated the developmentand adoption of sophistocated targeting algorithms for the integration of multipledatasets, resulting in the rise of mineral prospectivity mapping.One of the original formulations, termed Weights of Evidence (WofE) (Agterberget al., 1990), used posterior probability as the mapping function, which is calculatedby counting the relative number of occurences within and without a series of binarythematic layers (ie: Granite Contact). Other authors have explored using logisticregression (Harris and Pan, 1999), wherein weights are calculated for a series of ge-ological variables using least-squares regression on the probability of mineralization92.3. Problem Characteristicsand the binary presence or absence of mineralization. To address the inherent uncer-tainty in geoscience data, handle categorical variables, and incorporate some expertknowledge, some authors have used fuzzy logic (Porwal et al., 2003b), in which amembership function acts in the place of uncertainty to quantify the degree to whichstatements are true.In the last 15 years more sophistocated algorithms have been borrowed from themachine learning field. Feed forward neural networks (see Section 4.1), now ubiquitousin all forms of artificial intelligence, had humble beginnings in the early 1990s and bythe 2000s were being applied in simple architectures for mineral prospectivity mapping(Barnett and Williams, 2009; Brown et al., 2000; Harris and Pan, 1999; Porwal et al.,2003a; Rodriguez-Galiano et al., 2015; Singer and Kouda, 1997). Support vectormachines (see Section 3.2), a maximum margin classififer which rose to prominencein the 1990s as a favoured algorithm with strong theoretical background, has beenapplied by a number of authors (Abedi et al., 2012; Porwal et al., 2010; Rodriguez-Galiano et al., 2015; Zuo and Carranza, 2011).Due to the difficulty in field testing algorithms for this application and the rel-atively slow adoption of these methods by industry, most of the work on mineralprospectivity mapping has been academic. That being said, some of the more com-monly adopted methods include weights of evidence, fuzzy logic and neural networks(Partington and Sale, 2004; Raines et al., 2010). The popularity of these methodscan be attributed to ease of use, flexibility, and successful application in other fields.2.3 Problem CharacteristicsDespite widespread use of statistical and machine learning methods in numerous fields,mineral prospectivity mapping presents a number of challenges due to the followingproblem characteristics.102.3. Problem CharacteristicsFew training data Although mineral occurrences appear in large well documentedtrends, the number of occurrences and the size of their footprint relative to the regionof exploration is quite small. In British Columbia1, for example, there are only∼420 documented alkalic porphyry occurrences, which when associated with a typicalfootprint of approximately 10km2 accounts for less than 1% of the surface area. In thecontext of supervised machine learning this is a very small training set, and when oneconsiders that each occurence will be slightly different the variance within positivetraining data begins to become important.Imbalanced learning problem In most supervised machine learning environ-ments, an unbiased training is achieved by approximately sampling uniformly fromeach class. When this is not true, the problem is termed imbalanced (Chawla et al.,2004), and can lead to poor generalization of the resulting predictor. A number ofmethods exist to handle imbalanced data, including boosting (Chawla et al., 2011;Guo and Viktor, 2004) and re-balancing (Kubat and Matwin, 1997; Raskutti andKowalczyk, 2004; Tang et al., 2009). As might be expected, mineral occurrencesare relatively rare, resulting in an extremely imbalanced set of training labels. Thisproblem is further exacerbated when one restricts the problem to a specific type ofmineralization (ie: VMS deposits), as is often the case for prospectivity studies.Uncertain training labels On top of the imbalanced nature of the mineral prospec-tivity mapping problem is the large degree of uncertainty associated with the traininglabels. In this regard, there are two fundamental problems: 1) The crucial distinctionthat in most cases a label of “no mineralization” simply means that mineralizationhas not been discovered, and not necessarily that there is none, and 2) within eachclass (mineralized and not mineralized) exists a large range in certainty. For exam-1British Columbia is part of a large well-known porphyry belt which extends from Alaska downto Chile (Sinclair, 2007).112.3. Problem Characteristicsple, in many mineral occurrence databases “mineralization” encompasses occurrencesranging from producing mines all the way down to anomalies and prospects. One canalso understand how a classification of “no mineralization” bears very different im-plications in the middle of a highly explored mining district than it does in a remotelocation miles from the nearest field sample site.Uncertain training data As with any observed data, the training data in the min-eral prospectivity mapping problem has associated with it uncertainty from varioussources. Some data, such as a magnetic total field measurement, will have numericaluncertainties associated with detection limits and processing procedures. Others, suchas geological mapping of bedrock units, will have qualitative uncertainties associatedwith expert interpretation and sampling bias. Additionally, some data can have aspatially correlated uncertainty introduced by different exploration environments inthe field (ie: under regions of thick overburden it becomes prohibitively difficult tomap bedrock). Unlike many machine learning problems where both the data and thelabels can be trusted, for mineral prospectivity mapping it is known that both haveassociated errors. Furthermore, the errors on the different data types can have verylarge statistical differences.Non-trivial multi-disciplinary data The data typically employed for mineralexploration encompasses a wide variety of sensors and interpretations. Geologicalmapping of the bedrock is conducted in situ wherever outcrop is available, and in-terpreted to produce multiple lithological maps of the regional and local geology,including parameters such as age, rock type, primary minerals and fault structure.Geochemical sampling of soils and sediments can produce maps of elemental contentin ppm or % which can be instrumental in identifying and interpreting alterationzones and other processes indicative of mineralization. Geophysical surveys provide122.3. Problem Characteristicsmeans to peek beneath the surface in a non-invasive way. Various surveys are ableto sample the gravitational, magnetic, or electromagnetic fields in the ground to pro-duce maps or volumetric models which can be correlated to the physical propertiesof the rocks. Since mineral occurrences are anomalous bodies of ore, they often beargeophysical signatures which are identifiable (Ford et al., 2004).Mineral systems are complex Given the variety in the multi-sensor data, theexpected structure in each input differs greatly, however it is the co-occurrence ofa number of these structures which is indicative of prospective mineralization (i.e.:a magnetic high on its own can be meaningless, however when associated with thecorrect geology, conductivity and geochemical signature, can be very compelling as apotential target).132.3. Problem CharacteristicsFigure 2.1: Mineral prospectivity mapping aims to combine numerous geosciencelayers to derive a map of mineral potential.14Chapter 3Support Vector Machines3.1 Maximum Margin ClassifierSupport Vector Machines (SVMs) were first formally introduced in the 90s by Vap-nik, Boser and Guyon (Boser et al., 1992; Cortes and Vapnik, 1995; Vapnik, 1995,1999) as a machine learning algorithm structured on the statistical learning theory(VC theory) developed by Vapnik and Chervonenkis during the 60s and 70s (Vapnikand Chervonenkis, 1971; Vapnik and Lerner, 1963). The basic principle of SVMs isto construct an optimal margin classifier which has complexity based not on the di-mensionality of the feature space, but rather on the number of support vectors, thusallowing for sparse solutions (see Figure 3.1). This can be a powerful advantage whendealing with large highly dimensional datasets since only a small subset of data arenecessary in constructing the SVM classifier. SVM falls under the branch of machinelearning known as supervised learning, in which a predictor is learned using train-ing data and training labels (Friedman et al., 2001). In the simplest case, one canconsider training data with binary labels(X1,y1), (X2,y2), .., (Xn,yn) with Xi ∈ Rm, yi ∈ {−1, 1}For mineral prospectivity mapping, Xi would be a row vector of different field mea-surements (ie: [magnetic total field, bedrock age, distance to fault...etc]) for a givensample location, and yi would signify ”mineralization” or ”no mineralization” for that153.2. Separable Linear SVMFigure 3.1: Optimal margin classification using SVM. The seperating hyperplane(black) is determined by maximizing the margin (yellow) between a few sparse supportvectors (outlined in green).location.The goal of a maximum margin algorithm such as support vector machines is tolearn the equation of a hyperplane (black line in Figure 3.1) which optimally separatesone class of points from another in feature space. Clearly any hyperplane whichachieves this will in effect also be maximizing the margin (yellow lines in Figure 3.1)between points of opposing class, hence the name. Assuming the testing data comesfrom the same distribution as the training data, this hyperplane can then be used asa powerful classifier.3.2 Separable Linear SVMIf the data is linearly seperable then one can define a separating hyperplane asXw + b = 0 (3.1)163.2. Separable Linear SVMwhere w is a column vector of weighst and b is a bias constant. Since the equation ofthe plane is scale invariant, we can define normalization|Xnw + b| = 1 (3.2)where Xn is the nearest point to the hyperplane. To maximize the margin we needto know the distance between the hyperplane and the nearest point, Xn.Figure 3.2: Calculating the distance between the hyperplane and a nearest point infeature space, Xn.This can be calculated by projecting the distance between any point on the hyper-plane, X, and the nearest point Xn onto the unit vector perpendicular to the plane,wˆ (see Figure 3.2). Therefore the distance isDistance = |(Xn −X)wˆ| = 1‖w‖ |Xnw −Xw| (3.3)If we simply add and subtract the bias term b we get a familiar expressionDistance =1‖w‖ |Xnw + b−Xw − b| =1‖w‖ |1− 0| =1‖w‖ (3.4)173.2. Separable Linear SVMTherefore the optimization problem is the followingMaximize1‖w‖ Subject to |Xnw + b| = 1 (3.5)which can be simplified due to our binary label assumption, since for all correctlyclassified points |Xiw+b| = yi(Xiw+b). By replacing the maximization with a min-imization we arrive at the following constrained optimization problem for separablebinary support vector machinesminimizew12w>wsubject to yi(Xiw + b) ≥ 1 for i = 1, . . . , n.(3.6)This forms a KKT system which can be solved either iteratively in the primal form(as above) or via quadratic programming in the dual form using Lagrange multipliersαL(w, b, α) = 12w>w +m∑i=1αi [1− yi (Xiw + b)]L(w,b, α) = 12w>w + α> [1− diag(y) (Xw + b)] (3.7)This Lagrangian can now be minimized with respect to w and b, and maximized withrespect to αi ≥ 0. To do so we can take the derivatives with respect to the primalvariables (w and b) and set them equal to 0,∇wL = w −X>diag(y)α = 0 (3.8)∇bL = y>α = 0 (3.9)183.3. Variants and ModificationsIf we solve equation 3.8 for w and substitute this back into the Lagrangian we getL(α) = 12α>diag(y)XX>diag(y)α+ α>[1− diag(y)(XX>diag(y)α) + b)](3.10)Expanding and rearranging the terms givesL(α) =12α>diag(y)XX>diag(y)α+ 1>α− α>diag(y)XX>diag(y)α+ (y>α)b (3.11)and since y>α = 0 the last term drops out. Combining the similar terms, we arriveat the dual form of the linear SVM problem:maximizeα− 12α>diag(y)XX>diag(y)α+ 1>αsubject to y>α = 0 αi > 0 for i = 1..n(3.12)which can finally be rewritten in the common dual form as an unconstrained quadraticoptimization problem by noting that maximizing −f(x) is equivalent to minimizingf(x) and adding in the constraint that y>α = 0 using a regularization parameter βminimizeα12α>Qα+ (−1> − βy>)α (3.13)where Q = diag(y)XX>diag(y). Both forms - the primal and the dual - are equiva-lent, and the solution technique is typically a matter of personal preference, thoughit can be impacted by the size of the data matrix X, since Q is of size n× n.3.3 Variants and ModificationsThe optimization problem outlined in the previous section is the simplest to under-stand, however it obviously makes a number of assumptions which are often violated,193.3. Variants and Modificationsnamely that there are only two classes which are linearly seperable. It is easy toextend support vector machines to handle cases in which this is not true, whether be-cause there exist points which violate the margin, the points are separable in a higherdimensional feature space, or the points belong to more than two classes (whethermulti-class or continuous values).3.3.1 Inseparable SVMConsider the case in which the data is nearly linearly separable, with the exceptionof a few points (see Figure 3.3). These points are most likely outliers, and shouldideally be allowed to violate the margin; the objective should not be overly penalizedfor them being misclassified.Figure 3.3: Inseparable SVM: points are allowed to be misclassified, but are penalizedusing positive slack variables ξ.One way to handle this is with slack variables, ξi, which quantify the violation foreach point. Using this formulation the total violation for all points isn∑iξi with a203.3. Variants and Modificationsregularization parameter C and the original SVM equation from (3.6) is modified tominimizew12w>w+Cn∑iξisubject to yi(Xiw + b) ≥ 1−ξi with ξi ≥ 0 for i = 1, . . . , n(3.14)3.3.2 Non-linear SVMIn a more extreme case it is possible that the data are completely inseparable in theoriginal data space. By performing a nonlinear transformation Φ : X → Z one canalmost always find a new feature space in which the data is linearly separable (seeFigure 3.4).The key observation for kernel methods (Boser et al., 1992; Hofmann et al., 2008;Scho¨lkopf et al., 1999) such as support vector machines is that to optimize this problemin the dual (Equation 3.15) only the inner product of features is required, K(X,X′) =ZZ> = Φ(X)Φ(X)>.minimizeα12α>Qα+ (−1> − βy>)αwith Q = diag(y)K(X,X′)diag(y) (3.15)This inner product has dimensions n× n, regardless of the dimensionality of thefeature space Z. This powerful observation means that the data can be transformedto high (even infinite) dimensional space at no cost, allowing support vector machinesto be very flexible in fitting complex nonlinear data.Due to its natural extension to the kernel trick, nonlinear support vector machinesare most often solved in the dual, however they can also be solved in the primal via therepresenter theorem using iterative gradient based methods (Chapelle, 2007; Shalev-Shwartz and Singer, 2011). This can be particularily helpful when the number ofdata, n, is large, since K(X,X′) will be a large dense matrix.213.3. Variants and Modifications⇓Figure 3.4: Non-linear transformation via the kernel trick. The data is linearly sepa-rable in the transformed feature space.Some of the more common kernel transformations used for support vector ma-chines include polynomial, sigmoidal and radial basis functions, however in principleone can arbitrarily construct their own kernel so long as it obeys Mercer’s condition.223.3. Variants and Modifications3.3.3 Multiclass SVMAll previous formulations have assumed a binary classification problem. It is easy tothink of a situation in which more than two output values are possible. The simplestextension is to apply support vector machines to a classification problem with mclasses.Figure 3.5: Multiclass SVM problem can be solved as one-vs-rest or as one-vs-one togenerate multiple decision boundaries.Two main strategies exist for tackling this problem. The first, termed one-vs-rest,basically trains m− 1 classifiers using “one-hot” labels for y such thaty =[1 1 2 · · · 3]→1 1 0 · · · 00 0 1 · · · 00 0 0 · · · 1so that each class is trained versus all others, and membership is assigned for eachpoint to the class which has the largest objective value233.3. Variants and Modificationsg(X) =∑αn>0αnynK(Xn,X) + bThe other common strategy is known as one-vs-one, and constructs m(m−1)2 clas-sifiers (for each pair of classes). In this method the number of times each point isassigned to each class (out of a possible m − 1 times for each of the m classes) iscounted, and membership is assigned to the class with the maximum.3.3.4 Support Vector RegressionFinally one can envision many applications where the problem isn’t one of classifica-tion at all, but rather a regression problem with continuous output values. The mostcommon formulation for this scenario is known as -SVR (Vapnik, 1995), in whichthe aim is to find an approximating function for the labels which deviates by at most from the target.Figure 3.6: Support vector regression using -insensitive loss function; only valuesgreater than ± contribute to the objective value (see equation 3.18).243.3. Variants and ModificationsMathematically, the optimization problem becomesminimizew12w>w + Cn∑i=1(ξi + ξ∗i ) subject toyi − (Xiw + b) ≥ + ξi(Xiw + b)− yi ≥ + ξ∗iξi, ξ∗i ≥ 0(3.16)where i is a measure of the precision of the approximating function and ξ(∗)i are theslack variables allowing for violations of the separating hyperplane. SVR can also begeneralized as the following regularized risk minimizationRisk(w) =n∑i=1ρ(yi −Xiw) + α2‖w‖22 (3.17)Where in the case of -SVR, the loss function isρ(σ; ) =0 if |σ| < |σ| − otherwise(3.18)In general, for small , this bares a resemblance to ridge regression, which uses the1-norm as the cost function (ρσ = ‖σ‖1). Both of these problems try to minimizesome function ρσ(·) of the residual errors (σ) between the true labels (y) and thepredicted labels (Φ(X)w).minσ,wρσ(σ) +α2‖w‖22s.t. Φ(X)w − y − σ = 0 (3.19)As discussed in section 2.3, it is possible however that the labels are not known exactly,and that some uncertainty exists. Furthermore, the same could be true of the data;measurements in the real world often have some associated uncertainty.253.3. Variants and ModificationsFigure 3.7: Three of the most common loss functions used in conjunction with SVMs.Note: The hinge loss is using = 0.5.One can modify SVR so as to handle not only errors on the labels, but also errorson the data. This has been explored to some extent by previous authors, thoughin many cases uncertainties are only treated for the either inputs or the outputs;not both (Carrizosa, 2007; Pant et al., 2011; Zhang, 2004). In some instances bothuncertainties were simultaneously considered, however the algorithm was treated inthe dual form (Huang et al., 2012). As has been previously discussed, this can quicklybecome cumbersome when dealing with large datasets. One way to handle this wouldbe to modify the primal objective function from (3.6) to incorporate a weighting termproportional to the uncertainty on the labelsΦ(w) =C12w>w +1>max(0,1− diag(y) (Xw + b) ) (3.20)and a new term which allows for errors in the observed data, Xobs, with a user defineduncertainty matrix Σ (can be diagonal, full, or anything in between)263.4. Robust SVR with UncertainyΦ(w,X) =C12w>w +1>max(0,1− diag(y) (Xw + b) )+C22(X−Xobs)>diag(1Σ)(X−Xobs) (3.21)where C1 and C2 are regularization constants which control the trade-off betweenchanging the model, the data fit, and changing the data. This can be solved iterativelyfor w and X until convergence is reached for both unknowns. In the following sectionan alternative formulation is done using a Total Least Squares approach as a startingpoint, which lends to an elegant and efficient iterative algorithm for solving the SVRproblem with uncertainties on both inputs and outputs.3.4 Robust SVR with UncertainyContrary to ridge regression/Least Squares where errors are assumed to exist onlyin the residual between the true labels and predicted labels, Total Least Squares(TLS) allows for errors to exist in the data as well (Golub and Loan, 1980; Golubet al., 1999; Osborne and Watson, 1985). To our knowledge taking this approachto the support vector machine problem has yet to be presented. This results in thefollowing optimization problem:minΣ,σ,wρΣ(Σ) + ρσ(σ) +α2‖w‖22s.t. (Φ(X) + Σ)w − y − σ = 0 (3.22)This is a regularized regression problem that implies that we aim to solve for theweights, w, assuming that the input data Φ(X) and the labels y are inaccurate, withunknown errors Σ and σ. We wish to minimize these errors, measured by the loss273.4. Robust SVR with Uncertainyfunctions ρΣ and ρσ. The simplest case is total least squares (TLS) where we choosequadratic loss (ρΣ(E) =12‖E‖2F and ρσ() = 12‖‖2) but other loss functions can beused as well. We now quickly review the TLS and then extend it to the l-1 norm.3.4.1 Total Least SquaresConsider first the case of quadratic loss functions. The Lagrangian isL = 12‖Σ‖2F +12‖σ‖2 + α2‖w‖22 + λ> ((Φ + Σ)w − y − σ)where λ is a lagrangian parameter. To obtain the necessary conditions for a minimumwe take the derivatives with respect to w, Σ, σ and λ and set them all equal to zero.With a little re-arranging we can get Σ = −λw> and σ = λ. Substituting thisrelation back in to the optimality conditions we arrive at the following system of twononlinear equations with two unknowns, w and λ:αw + Φ>λ−w λ>λ = 0 (3.23a)Φw − λ w>w − y − λ = 0 (3.23b)Defining a new constant β = (w>w + 1)−1 and rearranging we obtain(Φ>Φ +α− λ>λβI)w = Φ>yIf we now define another constant γ = β−1(α− λ>λ) we obtain that the solution tothe problem is nothing but the “familiar” Tikhonov regularization(Φ>Φ + γI)w = Φ>y (3.24)283.4. Robust SVR with Uncertainywith an unknown regularization parameter γ. Regardless of the solution strategy,equation (3.24) implies that the weights w are qualitatively similar to the ones ob-tained when assuming that the modeling error Σ = 0 and using a regularizationparameter that is different than the one used for the discrepancy principle. Notably,this implies that the errorσ = (Φ + Σ)w − yis not sparse. In the context of SVR a non-zero residual is a support vector and there-fore dense residuals imply many support vectors and that may not yield a desirableresult in the context of machine learning.3.4.2 Total Least Norm - L1 SparsityTo obtain sparse residuals we propose to solve the following optimization problemminΣ,σ,w‖σ‖1 + β2‖Σ‖2F +α2‖w‖22s.t. (Φ + Σ)w − y − σ = 0 (3.25)Our functional to be minimized contains one main feature: it demands sparse residualsby using L1 on the residual vector σ, thus obtaining a small number of support vectors.This time, we eliminate σ to obtain the unconstrained optimization problemminΣ,w‖(Φ + Σ)w − y‖1 + β2‖Σ‖2F +α2‖w‖22The Euler Lagrange equations for this problem are0 = (Φ + Σ)>sign((Φ + Σ)w − y) + αw (3.26a)0 = sign((Φ + Σ)w − y)w> + βΣ (3.26b)293.4. Robust SVR with UncertainyThis is a coupled nonlinear system that cannot be decoupled as in the previous case ofthe total least squares. However, note that for a given Σ the system can be thought ofas a mixed L1/L2 recovery problem for w and, for a given w the problem is a mixedL1/L2 recovery problem for Σ. In particular, using equation (3.26b) we obtain thatΣ = −β−1sign((Φ + Σ)w − y)w> (3.27)That is, Σ = vw> is a rank one matrix withv = β−1sign((Φ + Σ)w − y) (3.28)Each of these problems can be solved effectively by a number of methods. Thus, onesolution strategy is to use a block coordinate descent, solving for w and Σ at eachiteration.Solving for wConsider the system (3.26). In the first step we assume that Σ is known and solveequation (3.26a) for w. This is equivalent to solving the optimization problemminw‖Φ̂w − y‖1 + α2‖w‖2 (3.29)where Φ̂ = Φ+Σ. This can be done using iteratively reweighted least squares (IRLS).Using the identity |t| = t2/|t|, t 6= 0 and approximating it by |t| ≈ t2ρ(t) with ρ(t) =√t2 + we rewrite the problem asminw(Φ̂w − y)>diag(1ρ(Φ̂w − y))(Φ̂w − y) + α2‖w‖2303.4. Robust SVR with UncertainyAlthough that at first glance the choice of seems to be somewhat unclear, as wewill show, we could actually choose = 0 without any difficulty. In IRLS we solve asequence of reweighted quadratic problems of the formwk = arg minw (Φ̂w − y)>diag(1ρ(Φ̂wk−1 − y))(Φ̂w − y) + α2‖w‖2with the solution(2Φ̂>W−1Φ̂ + αI)wk = Φ̂>W−1y (3.30)where W−1 = diag(1ρ(Φ̂wk−1−y)). At this point we make an interesting observation.It is well known that the system (3.30) is equivalent to the following− 1α(Φ̂Φ̂>+α2W)z = y (3.31a)1αΦ̂>z = w (3.31b)This is sometimes referred to as the “data space system”. Note that in this formulationonly W = diag(ρ(Φ̂wk−1 − y))appears and therefore choosing = 0 does notgenerate any difficulty.Solve for ΣIn this step we assume that w is known and solve equation (3.26b) for Σ. Theequation readssign((Φ + Σ)w − y)w> + βΣ = 0 (3.32)313.4. Robust SVR with UncertainyThis equation implies that Σ is a rank one matrix. Setting Σ = vw> for someunknown vector v, and substituting into (3.26b) we obtain(sign((Φ + vw>)w − y) + βv)w> = 0 (3.33)and this implies that sign((w>w)v + Φw − y) + βv = 0. Setting γ = w>w andŷ = y −Φw, and multiplying the equation by γ we rewrite the system asγsign(γv − ŷ) + βγv = 0 (3.34)We now note that this equation is nothing but the optimality condition to the problemminv‖γv − ŷ‖1 + βγ2‖v‖2 (3.35)Furthermore, (3.34) is essentially a system of decoupled nonlinear equations, andtherefore, it can be solved analytically for each componentvi =ŷi/γ if γvi − ŷi = 0 & − 1 ≤ βvi ≤ 1β−1 if γvi − ŷi < 0−β−1 if γvi − ŷi > 0(3.36)The solution to (3.25) can be now obtained by iteratively solving for w (using equa-tions 3.31a and 3.31b) and Σ (using equation 3.36 and Σ = vw>).3.4.3 Case of Different VariancesA more interesting case that is relevant to our application is when the errors havedifferent variance. This is one way in which estimates of the uncertainties can beincorporated into the MPM problem. In this case problem (3.26) is modified to the323.4. Robust SVR with UncertainyfollowingminΣ,w‖diag(Γ)((Φ + Σ)w − y)‖1 + β2‖Ω 12Σ‖2F +α2‖w‖2 (3.37)where Γ is the inverse variance vector of σ (the label error), Ω 12is the square root ofthe inverse variance of Σ (the data errors) and is the Hadamard product. In thiscase the conditions for a minimum are(Φ + Σ)>diag(Γ) sign(diag(Γ)((Φ + Σ)w − y)) + αw = 0diag(Γ) sign(diag(Γ)((Φ + Σ)w − y))w> + βΩΣ = 0 (3.38a)where Ω = Ω 12Ω 12. The first observation to be made is that Σ is no longer a rankone matrix. In fact, if Ω is a full rank matrix then so is Σ and Σ = Ω1vw> withv = −β−1diag(Γ)sign(diag(Γ)((Φ + Σ)w − y)) (3.39)Here we are defining Ω1 as the element-wise inverse of the squared variance, Ω. Onceagain this is a coupled nonlinear system that can be solved using a block coordinatedescent, iteratively solving for w and Σ. Just as before, setting Φ̂ = Φ+Σ in (3.38a),this is equivalent to solving the following L1/L2 optimization problemminw‖diag(Γ)(Φ̂w − y)‖1 + α2‖w‖22 (3.40)which can again be solved using IRLS, with the added weighting of diag(Γ) in the L1term. Equation (3.38a) can similarly be shown to be equivalent to a simple L1/L2optimization problem, however due to the Hadamard product it requires a bit of extraalgebra and where γ was a scalar in (3.35) it now becomes a diagonal matrix of the333.5. Synthetic Data Exampleform γ = diag(Ω(w w)). The optimization problem becomesminv‖diag(Γ)(γ v − ŷ)‖1 + β2‖γ v‖2F (3.41)where ŷ = y −Φw remains the same from (3.35) and ‖ · ‖2F is the Frobenius norm.Once again (3.41) is a system of 1D linear equations and can be solved analyticallyasvi =ŷi/γi if γivi − ŷi = 0 & − 1 ≤ βviΓi ≤ 1Γi/β if γivi − ŷi < 0−Γi/β if γivi − ŷi > 0(3.42)3.5 Synthetic Data ExampleTo demonstrate the impact of incorporating uncertainties into the machine learningalgorithm a toy example has been contrived (see figure 3.8 below). The dataset con-sists of 70 sample points in two dimensions, each with an associated binary label -either -1 or 1 (shown as red or blue). Also associated with each sample are two uncer-tainties, one on the data and one on the label. To illustrate the effect of uncertaintieson the training, the dataset is composed of two separable Gaussian distributions whichare made inseparable by the presence of both samples with data errors (purple boxes)and samples with label errors (green boxes).343.5. Synthetic Data Example0 0.5 100.10.20.30.40.50.60.70.80.910 0.5 100.10.20.30.40.50.60.70.80.910 0.5 100.10.20.30.40.50.60.70.80.91Figure 3.8: Full synthetic dataset. Separable binary dataset is rendered inseparable bytwo forms of errors: bad data (purple boxes) and bad labels (green boxes). Trainingdata are circles and predicted labels are triangles. Marker size is proportional to labelconfidence.The dataset was then split into training and predicting subsets via random sam-pling of the original data points. Finally, the L1-RSVR algorithm was run on thetraining data and compared to prediction results from simple SVR without uncer-tainties (see figure 3.9 below).353.5. Synthetic Data Example0 0.5 100.10.20.30.40.50.60.70.80.910 0.5 100.10.20.30.40.50.60.70.80.910 0.5 100.10.20.30.40.50.60.70.80.91Figure 3.9: Prediction results from synthetic example (Left: L1-RSVR and right:SVR without uncertainties). Training data are circles and predicted labels are trian-gles. Marker size is proportional to label confidence. Separating boundaries shown ingreen and pink, and black crosses indicate mis-classified points.363.5. Synthetic Data ExampleFrom the results it is evident that the incorporation of uncertainties (when theyare available) improves the resulting predictor. The only mis-classified points in theL1-RSVR result are the points with uncertain labels (except for one), and the outlierswhich are assumed to be incorrect, and thus are not penalized for being on the wrongside of the boundary. Though this is only a small contrived toy example, it illustrateshow failure to incorporate uncertainties can result in very poor prediction as theclassifier tries to fit data which are not to be trusted.37Chapter 4Convolutional Neural Networks4.1 Neural NetworksNeural networks are powerful learning algorithms which are now ubiquitous acrossmost fields of artificial intelligence, ranging from image segmentation (Alvarez et al.,2012; Lai, 2015; Long et al., 2015) to text and image identification (Krizhevsky et al.,2012; LeCun and Bengio, 1995; Russakovsky et al., 2014; Simard et al., 2003) tospeech and time series analysis (Bengio et al., 2012; Greff et al., 2015; Hochreiter andSchmidhuber, 1997).4.1.1 HistoryThe term “neural network”, which was coined in the 1950s, is in reference to thealgorithms’ attempt to model neural processes in the brain. In the biological setting,synapses are passed from one neuron to the next, and each neuron will only fire if thetotal input signal exceeds some firing threshold. Beginning from this biological andpsychological model, neural networks were developped through the 1940-1960s withmajor contributions from McCulloch & Pits (McCulloch and Pitts, 1943), Rosenblatt(Rosenblatt, 1958) and Minsky & Papert (Minsky and Papert, 1969). The mostwell known of these early models was Rosenblatt’s perceptron, which was in essence asingle layer of neurons from a modern neural network using the heaviside step function384.1. Neural NetworksFigure 4.1: Neural networks are modelled after biological neurons in the brain, inwhich each neuron will only fire (perpetuate a signal) if the total input signal exceedssome threshold (taken from Stanford CS231n course material).for the activation.f(X; w, b) =1 Xw + b > 00 otherwise(4.1)The fundamental shortcomings of these early perceptrons were that they were un-able to model the “XOR” function and that they were very slow to run. This delayedfurther advances in neural networks until the 1980s, when the combination of supe-rior computing power, the development of the backpropogation algorithm (Rumelhartet al., 1986), and the realization that stacking multiple layers of perceptrons solvedthe “XOR” problem.This first resurgence of neural networks presented them as powerful tools for AIas both theoretical proofs (Hornik et al., 1989) and industrial applications (LeCunet al., 1989) were developed. In particular, the development of convolution neuralnetworks (Cun et al., 1990; LeCun and Bengio, 1995; LeCun et al., 1998) (CNN - seeSection 4.2 for more information) for image, text and speech recognition, as well asother methods such as self-organizing maps (SOM) (Kohonen, 1990) and recurrentneural networks (RNN) (Hochreiter, 1998) drove the development and popularity ofneural networks through the 1980s and 1990s.394.1. Neural NetworksAs more researchers began turning to machine learning in the 1990s, alternative al-gorithms began to arise. In particular, support vector machines (Burges, 1998; Cortesand Vapnik, 1995) presented themselves as simpler, more theoretically grounded learn-ing machines which in many cases were performing at least as well as the most carefullycrafted neural networks.It wasn’t until the late 2000s that under the new branding of “deep learning” neu-ral nets made another comeback (Bengio and LeCun, 2007; Glorot and Bengio, 2010;Hinton et al., 2006), due in large part to the increased power and efficiency of moderncomputers. Since then neural networks have risen to dominate the machine learn-ing community, with networks getting ever deeper (Krizhevsky et al., 2012; Szegedyet al., 2015) as advances in both hardware (Chellapilla et al., 2006; Krizhevsky, 2014)and software (Hinton, 2014; Nair and Hinton, 2010) continue to push them to greateraccuracy and success. For a more exhaustive history of their development, the readeris directed to (Schmidhuber, 2015).4.1.2 General ArchitectureFollowing the original analogy as a network of neurons used to model an artificialbrain, neural networks are typically structured in terms of a number of layers andthe number of neurons per layer.They are commonly drawn in diagrams such as Figure 4.2, in which each columnof circles represents a layer and each circle represents a neuron. In a fully connectedlayer the lines connecting each of the neurons from the previous layer to each of theneurons in the current layer represent the weights. Neural networks achieve the non-linear “neural” response through the use of activation functions, typically in the formof some kind of sigmoid (see Figure 4.3).404.1. Neural NetworksFigure 4.2: Example diagram of a feed-forward neural network. This network has3 inputs and 3 layers: two hidden layers with 4 neurons each, and one output layer(taken from Stanford CS231n course material).Figure 4.3: Different activation functions commonly used in neural networks.414.1. Neural NetworksMathematically, a single layer of a neural network computes the following non-linear mapping of input features to output features:Xn+1i = σ(∑jXnjwnij + bni ) (4.2)where n denotes the layer, i denotes which neuron in the current layer, j denotes whichneuron in the previous layer, and σ is one of the activation functions from Figure 4.3.By combining multiple neurons together and stacking layers of this architecture neuralnetworks are able to model complex functions using relatively simple operations.This can be done in both the supervised and unsupervised environment, depend-ing on the architecture and objective function used. Common feed-forward neuralnetworks will often be supervised, employing a loss function such as least squares orlogistic regression to compute some metric of the difference between the predictedlabels and the true labels y. Alternatively, unsupervised neural networks such asautoencoders use a symmetric architecture to encode, and then decode the input sig-nal, with the goal of recovering the input signal with minimal reconstruction error.Similarly to principle component analysis (PCA), this is essentially computing a new(low-dimensional) basis with which to represent the data.424.1. Neural NetworksFigure 4.4: Supervised vs unsupervised neural networks. Top: Feed-forward neuralnetworks uses training labels to optimize the hidden layer parameters; bottom: Au-toencoders minimize the difference between the training data and the output featurevector to find optimal compressed representation in the hidden layer.434.1. Neural Networks4.1.3 OptimizationRegardless of the architecture, training of neural networks involves optimizing thenumerous parameters (ie: w, b) to minimize some objective function. Due to thelarge number of parameters often used for neural networks (particularly for deeparchitectures) and the large amount of data required to train them, this is typicallydone using a batch stochastic gradient descent (SGD) algorithm wherein the gradientof the objective function Q(w) for the full training set is approximated by the gradientfor a (small) subset of the training data, m << n. The reasons given for this aretypically the large size of the training data, n (resulting in a large computationaldemands for handling the Jacobian), and the online nature of many applications.Many variants of SGD exist which attempt to optimally select the step size, or learningrate, η, using a variety of criteria (Duchi et al., 2011), but the basic equation for theparameter update remains the following.wi+1 → wi + ηm∑j=1∇Qj(wi) (4.3)Though SGD does not solve the optimization problem as accurately as a second-ordermethod such as Gauss-Newton iterations, this can be somewhat justified by the factthat neural networks are non-convex and are known to have many local minima. Bystochastically optimizing the parameters with a weaker step direction, one is in effectregularizing the solution.δQδwi=δQδXjδXjδXj−1...δXiδwifor j > i (4.4)Efficient implementation of SGD is done using the back-propagation algorithm (Rumel-hart et al., 1986) (Eq. 4.4 above), wherein the derivatives of the objective function canbe computed using chain-rule backwards through each of the layers in the network.444.2. Convolutional Neural NetworksFigure 4.5: Backpropagation of error through a neural network4.2 Convolutional Neural NetworksThough first formally developed in the 1990s (LeCun and Bengio, 1995; LeCun et al.,1998), convolutional neural networks gained a considerable amount of popularity overthe last decade, to the point where they are ubiquitous in the fields of image, audio andcharacter recognition. As a subset of feed-forward neural networks, they fall under thebanner of “deep learning”, in which a number of layers of nonlinear transformationsare stacked to produce a higher order feature representation of the original data.The distinction which sets convolutional neural networks apart is the use of sharedweights over a local receptive field. This has two primary effects: 1) the stationarityof natural images is exploited by the local structure of the filter architecture and2) the number of parameters in the model is vastly reduced. Though many subtleenhancements and modifications exist, the main components of a convolutional neuralnetwork are convolution layers, pooling layers, non-linear transformations, and someclassification/objective function to be optimized.4.2.1 Convolutional LayerA convolution layer in a CNN computes the inner-product between a patch of theinput image X0p and a kernel, or filter, Kij . This filter is slided over the entire image454.2. Convolutional Neural Networksto produce a new image, or feature map. Since the kernel is (typically) much smallerthan the patch size, this can be represented by a convolution. If all the inner productsare collected into a single operation, this results in a Toeplitz matrix of size [numberof output elements, number of input elements] - depending on how boundary condi-tions are handled for the filter this will vary. Using this large sparse matrix ratherthan looping renders the entire convolution layer into a single matrix matrix multi-plication of the form: Kij ∗Xi where Kij is the large Toeplitz matrix of convolutionoperations and Xi is a matrix of vectorized inputs concatenated together. Combinedwith an activation function typically involving a sigmoid (ie: tanh) and a bias, bij ,this constitutes a “neuron” of the formZij = tanh(Kij ∗Xi + bij) (4.5)A single layer of a convolutional neural network will consist of a number of differentkernels, Kj being convolved with the same input data Xi to give a number of differentoutput feature maps [Zi+11 ,Zi+12 , ...,Zi+1K ] (see figure 4.6).Figure 4.6: Diagram of a typical convolution layer in a convolutional neural network(taken from DeepLearning.net’s tutorial on CNN).As the network is optimized, the convolution kernels are “learned” so as to providethe best set of feature maps from which to classify the images. The intuition on whythis might be a good idea for mineral prospectivity mapping is quite simple. When464.2. Convolutional Neural Networksmanually analyzing geoscientific data for prospecitivty mapping, it is common towork with derived maps (ie: vertical derivative of the residual magnetic field) ratherthan the raw data, since they facilitate the detection and identification of spatialcorrelations and patterns. By optimizing the CNN, the user is allowing the dataitself to guide the selection of which derived feature maps are most useful for theclassification problem rather than having to arbitrarily chose them manually.4.2.2 Pooling LayerPooling layers in convolutional neural networks serve two purposes. First, they in-troduce an additional nonlinear transformation to the learning function, in particularby adding a level of local spatial invariance. This is best exemplified by the common“max pooling”, in which case the maximum value from a small patch is selected torepresent the entire patch, regardless of its position in that patch. Second, the poolinglayers are inherently a down-sampling, resulting in a more compact representation ofthe feature maps. This condensation of the features allows for fewer parameters inthe network, and therefore less computations.Figure 4.7: Diagram of a typical max-pooling operation in a convolutional neuralnetwork (taken from Stanford CS231n course material).4.2.3 Classification/Objective FunctionThe third component in any CNN is a final layer which takes as input the featuremaps derived from all previous layers (both convolution and pooling) and outputs a474.2. Convolutional Neural Networkspredicted label (either a value or a probability distribution for the set of label classes).This can be as simple as a set of weights for each element in the final feature maps(regression), as complex as a fully-connected feed-forward neural network (completewith sigmoidal neurons), or any other classification/regression function such as sup-port vector machines.In order to solve for the numerous parameters in the network, some objectivefunction must be defined which penalizes poor performance during training. In generalthis is often something of the formΦ = ‖f(θ; X)− y‖p (4.6)where y are the true labels, f(·) is some function of the input data X (the images), θare the parameters of the function (in this case this would include all the convolutionkernels and biases of the CNN), and ‖·‖p is some p-norm of the misfit. The parametersare then iteratively optimized using the back-propagation algorithm (LeCun et al.,1989), which in most cases is some form of (stochastic) gradient descent.4.2.4 CNN ArchitecturePutting this all together in subsequent layers, each of which involves convolutionoperations and pooling operations, produces architectures of the form shown in figure4.8, where the jth channel of the (i + 1)-layer of convolution and pooling can bemathematically expressedXi+1j = P(tanh(Ki+1j ∗Xi + bi+1j))j = 1..N i+1 (4.7)with N i+1 the number of channels, or feature maps, in the (i+ 1)th layer as dictatedby the chosen number of convolution kernels Ki+1j , and bi+1j is the bias on the jth484.3. Synthetic Data Examplechannel in the (i+1)th layer. P(·) represents the pooling, or down-sampling, operation.Though it has been shown that a relatively simple two layer neural network is auniversal approximator (Hornik et al., 1989), modern networks often employ numerouslayers (the latest winners of the ImageNet Large-Scale Visual Recognition Challengeare using upwards of 20 layers in their networks (Szegedy et al., 2015)).Figure 4.8: Example architecture of a convolutional neural network, stacking (two)layers of convolution and pooling with a final fully connected layer (taken fromDeepLearning.net’s tutorial on CNN).4.3 Synthetic Data ExampleAs an illustrative example, a synthetic dataset has been created consisting of threechannels. The first channel is a smoothly decaying ellipsoid with a randomly chosenorigin and radii, the second channel is a smoothly decaying ring, also with randomlychosen origin and radii, and the third channel is a discrete block with random originand dimensions (see Figure 4.9).This data is meant to represent a simplified porphyry exploration model wherean overlapping high (channel 1) and halo-type structure (channel 2) with the correctcoincident geological structure (channel 3) would be suggestive of mineralization. 500positive training examples were defined as samples where the structures in all threechannels overlapped, and 500 negative selected where they do not (see Figure 4.10).Since the Julia CNN package uses a nonlinear-conjugate gradient solver ratherthan the more common stochastic gradient descent, steps are computed for all of the494.3. Synthetic Data ExampleFigure 4.9: A single example from the synthetic dataset. a) First channel is asmoothly decaying ellipsoid with a randomly chosen origin and radii, b) second chan-nel is a smoothly decaying ring, also with randomly chosen origin and radii, and c)third channel is a discrete block with random origin and dimensionstraining data rather than mini-batches. The algorithm has a number of exit criteriacorresponding to tolerances on different values. A maximum number of iterations isspecified, as well as a maximum number of learn search iterations. If the objectivefunction on the training set is not decreased by a sufficient amount, or a sufficientlysmall value is obtained, convergence is assumed. Additionally, if the derivatives be-come too small, the optimization will stop since no more progress can be made.Finally, if the validation error (objective function value) increases early stopping isinitiated so as to avoid over-fitting and maintain generalization of the solution. Allof these exit criteria are tuneable. For these runs the following values were used:Max iterations 200Max line search iterations 9Minimum objective function decrease 0.1%Objective function target value 0.01 ∗NMinimum relative derivative size 1e−4504.3. Synthetic Data ExampleFigure 4.10: Examples of a) positive and b) negative training data. Notice how allthree channels overlap in the positive example, but not in the negative example.514.3. Synthetic Data ExampleThe network used for the synthetic problem consists of four layers; the first layerhas eleven 5 × 5 × 3 convolution kernels, each of which using a sigmoid nonlinearity( 11+e−x ) followed by a 2 × 2 softmax pooling function. The second layer has fifteen5× 5× 11 convolution kernels, each of which again using a sigmoid nonlinearity andfollowed by a 2 × 2 softmax pooling function, and the third layer has twenty-seven3× 3× 15 convolution kernels, each of which again using a sigmoid nonlinearity andfollowed by a 2× 2 softmax pooling function. The final layer is a softmax classifier ofthe formypredj (wj ; X) =eX>wj∑Ki=1 eX>wi(4.8)which gives the probability of each data being either a positive example or a negativeexample (for our example it is binary; in general K can be larger than 2). In totalthis results in 8,702 parameters in the network.Dataset Success % # Correct False Positive False NegativeTraining 90.4 452/500 18 30Validating 93.6 468/500 19 13Table 4.1: Results from classification of synthetic data. Numbers reported are thebest achieved using the user-selected CNN architecture and 20 random initializations.From 20 randomly initialized runs, the best results were 90.4% succesful predictionon the training set and 93.6% succesful prediction on a validation set (see Table 4.1).The algorithm converges to a solution in 40 iterations, exiting because the trainingobjective function was no longer changing (see Figure 4.11).If we look at a subset of the incorrectly classified negative validation data (Fig-ure 4.12), we can see they are almost positive examples; the centers of the 3 channelsare close, though not overlapping. Similar but opposite conclusions can in some casesbe made when looking at the false positive examples (Figure 4.13), giving some intu-524.3. Synthetic Data Exampleition and confidence that the CNN algorithm is effectively recognizing the overlappingpatterns of the 3 channels.Figure 4.11: Convergence plots for the best result running the synthetic dataset inthe CNN. Blue curve is training data, red curve is validating.534.3. Synthetic Data ExampleFigure 4.12: False negative classification results from synthetic dataset. Contour plotsof channels 1,2,3 are plotted as red, blue and green, respectively.544.3. Synthetic Data ExampleFigure 4.13: False positive classification results from synthetic dataset. Contour plotsof channels 1,2,3 are plotted as red, blue and green, respectively.55Chapter 5QUEST Project Field Example5.1 QUesnelia Exploration STrategy DataThe QUEST (QUesnelia Exploration STrategy) project was a large $5M initiative byGeoscienceBC and the local government to encourage mineral exploration in centralBritish Columbia. The region, spanning approximately 150,000 km2, sits within es-tablished porphyry belts which extend the length of the province (see Figure 5.2),and is known to host a number of economic copper/gold porphyry deposits (Devine,2012), however many of the easier surface targets have already been found.Figure 5.1: QUEST project area in central British Columbia565.1. QUesnelia Exploration STrategy DataFigure 5.2: Porphyry belts in central British ColumbiaTo stimulate further discoveries, an extensive data acquisition program was com-pleted from 2008-2012, including airborne geophysics (Barnett and Kowalczyk, 2008),geochemical sampling (Jackaman and Balfour, 2008), and extensive geological fieldmapping. The result was a large, rich database of multi-parameter geoscience data575.1. QUesnelia Exploration STrategy Datain a region of known mineral potential for copper/gold porphyries. In addition tothe newly acquired data, historical data was compiled and re-analyzed, including adatabase of known mineral occurences.This data rich environment has been used as a case study by many geoscientistsfor various projects, including mineral prospectivity mapping (Barnett and Williams,2009; Granek and Haber, 2015). Amongst others, the publicly available datasets in theQUEST area include airborne gravity, magnetic and electromagnetic data, inductivelycoupled plasma mass spectrometry (ICP-MS) analysis of stream and sediment samples(providing compositional information for 35 elements), geological era, period, rockclass and rock type, and location and classification of known mineral occurences.585.1. QUesnelia Exploration STrategy DataFigure 5.3: Sample of the available geoscience maps in the QUEST region.595.2. The Target: Porphyry Copper-Gold Systems5.2 The Target: Porphyry Copper-Gold SystemsCopper-gold porphyry systems are important exploration targets. They account forover 60% of the global copper production, as well as a substantial percentage of globalgold, molybdenum and to a lesser extent silver (John et al., 2010). Though they canbe lower grade mineralization, they are often large economic ore bodies with longlived mines.Geologically, porphyry systems consist of stucturally controlled hydrothermallyaltered rocks which most typically occur at subduction zones on plate boundaries(Sillitoe, 2010). As such, they often occur in clusters, or trends; one of which passesthrough the study area in British Columbia (see figure 5.4 for a map of porphyriesaround the world).Figure 5.4: Major porphyry systems around the world, taken from (John et al., 2010).605.2. The Target: Porphyry Copper-Gold SystemsTargeting porphyry systems is typically done using a combination of geology, geo-chemistry and geophysics (Berger et al., 2008). Once the correct geological asemblageis identified, geochemistry can be instrumental in identifying the correct alterationzones to vector in on the core of the system. Since porphyry systems are structurallycontrolled, geophysical techniques such as airborne magnetics and gravity are oftenable to identify intrusions, faults or contacts, and the alteration zone associated withthe deposit often creates a halo effect due to enrichment and depletion of magneticminerals. Electromagnetic surveys are also effective at delineating the zonation ofthe porphyry system, and induced-polarization surveys can be particularly useful formapping the disseminated sulfides characteristic of such deposits.Figure 5.5: Illustrative example of different signatures over a copper-gold porphyrydeposit (Alumbrera deposit in Argentina), taken from (Hoschke, 2010).615.2. The Target: Porphyry Copper-Gold SystemsFigure 5.6: Diagram of a copper porphyry system, taken from (Sillitoe, 2010).625.3. Exploration Goals5.3 Exploration GoalsThe goal in applying mineral prospectivity mapping to the QUEST region is to testthe two algorithms (SVM and CNN) and assess their utility as an exploration tool.Though part of the reasoning for turning to machine learning algorithms for suchapplications is to remove the implicit subjectivity and bias of expert interpretation,it is important to understand that the role of expert domain knowledge in data selec-tion and preparation is of utmost importance. It is also always worthwhile applyingcommon sense when interpreting the results presented from any computer algorithm,and thinking critically about how the results were achieved.Furthermore, it is important to note that mineral prospectivity mapping (on aregional scale such as performed in this research) does not aim to target ore bodies ordrill holes, but rather to highlight regions more likely to bear the target mineralizationstyle. As such, there is no guarantee of success, but rather a diminished risk ofexploration in the regions of interest.5.4 Data PreparationPublicly available data layers were acquired from GeoscienceBC, the Natural Re-sources Canada Geophysical Data Repository, and DataBC’s online catalogue. Oncethese were all downloaded, they were imported into Quantum GIS - an open sourcegeographical information system - to be visually analyzed and prepared.One of the first steps in preparing the data is deciding which input layers arerelevant to the learning problem at hand. A common axiom in machine learning is“garbage in, garbage out”, meaning that despite the desire to allow the data itself tolead the learning process, expert knowledge in the selection of input data is crucial tosuccesful training of the network. For the detection of prospective copper-gold por-phyry systems in the QUEST region the geoscientific datasets which were considered635.4. Data Preparationof most importance were the following (in no particular order): airborne isostaticresidual gravity data, airborne residual total magnetic data, airborne VTEM electro-magnetic data, bedrock geology class, bedrock geology age, bedrock geology primaryminerals, bedrock geology type, faults, and geochemical pathfinder elements such asCu, Au, Mo, Ag, As, Sb, Se, U, W, Cd, Ca.Once all these data layers were loaded into QGIS, some geoscientific processingwas performed, such as the following. It is well known that porphyry systems areoften structurally controlled, and that such structures can often be imaged using di-rectional filters of the geophysical data, therefore derivative products of the magneticand gravity data were taken. Additionally, since faults are very important, proximityto a fault system, and the degree of fracturing can be useful metrics. For the cate-gorical geological data such as class (ie: sedimentary, volcanic etc), inputs were splitinto binary indicator layers so that relationships were not incorrectly inferred fromthe arbitrary ordering of a discrete representation (ie: volcanic rocks are not moresimilar to sedimentary rocks than they are to metamorphic rocks; they are simplyin alphabetic order). Finally, for regional geochemical data it is often important toaccount for variability in the background signal since elemental composition can varygreatly over large areas, and can often be attributed to changes in the bedrock, thesurface environment, or other factors not directly related to mineral prospectivity.Since each dataset was collected independently with its own sampling scheme, alllayers were resampled to a base grid of 300x300m, resulting in over 700,000 samplepoints. When all data were assembled and properly processed for training, 91 distinctinput layers were prepared, including both continuous and discrete values.In addition to the training data, training labels were acquired from the BC Minfiledatabase. This database contains a record of all documented mineral occurences inthe province, along with relevant information such as the location, the status of theoccurence (ie: anomaly, showing, prospect or mine) and the mineralization type (ie:645.4. Data Preparationalkalic Cu-Au porphyry). Using this information it was possible to create a set of 155known alkalic Cu-Au porphyry occurences to use as positive training labels.Next the data needed to be sampled for the machine learning algorithm, andlabels prepared for training. Since the two methods, support vector machines andconvolutional neural networks, operate on different inputs (SVM treats point values,CNN uses images), the preparation was specific to the method.For support vector machines, the resampled grid of 700,000 points on 91 inputlayers was easily made into a large data matrix. To remove bias from different layersall inputs were normalized and scaled from 0-1. Uncertainty on these values can varywidely depending on the data source. For example, most geophysical data can bearuncertainty in the form of a noise floor plus an acceptable standard deviation, whilefor geological data it is less obvious due to the subjective, interpreted nature of themeasurements. In these cases estimates can still be made based on confidence in theexpert and the availability of field measurements. For simplicity during this field test,data uncertainties were not applied.As previously alluded to, the labels for this application present a suite of prac-tical issues. The lack of confident negative labels (no mineralization) results in animbalanced learning problem, and the sparse subjective nature of the positive labels(mineralization) results in a large range in confidence which can be adequately quanti-fied using a framework of uncertainty estimates. For the QUEST dataset, the mineraloccurences were used to generate a set of binary labels on the base grid using a radialbasis function spline to interpolate between values. Each occurence has associatedwith it a status ranging from ‘Showing’ to ‘Producer’ (six unique statuses are possi-ble), indicating the confidence in the mineral occurence being economic. Combiningthis with other factors such as the extent of the overburden (which conceals poten-tially mineral-bearing bedrock), uncertainty estimates for the labels (see figure 5.7)were generated ranging from 1 (confident label) - 50 (not confident label).655.4. Data PreparationTo feed all of this geoscientific information into a convolutional neural network, thedata must be parsed into windows which represent a single patch of the region of in-terest. Positive training sample locations were cherry-picked using the mineralizationlocations, and to balance out the problem with negative examples an equal number oflocations were chosen from the QUEST region to represent the non-mineralized train-ing labels. These locations were randomly selected from a subset of locations whichwere greater than 500m from any known mineral occurences, and should thereforerepresent the average background (non-mineralized) signal of the QUEST region.Given this set of positive and negative training image locations and a windowsize of roughly 10km x 10km, selected based on the expected footprint of a porphyrysystem, a number of training images were extracted from the dataset. Additionally,to allow for translational and rotational invariance, each window can be shifted androtated a number of times so as to not bias the training. In other words, for a givenpatch, the algorithm should detect if there is a porphyry target somewhere in thewindow; it should not matter where or which orientation. Similar methods can beused to achieve scale invariance in the prediction stages of the process: by scaling thedata up and down prior to feeding it to the network, the size of an anomaly will bearmuch less importance than the structure and pattern of the input data.665.5. SVM Prospectivity Mapping5.5 SVM Prospectivity Mapping5.5.1 Problem SetupTo assess the success of the algorithm, the data was split into training, validating,and predicting sets. Since all of the known mineralization occurs in the north andsouth of the QUEST region (the central portion is obscured from exploration bya thick overburden), the south was selected for training, the north for validating,and predictions were made for the central region where no mineralization has beendiscovered to date.5.5.2 ResultsAt the request of NEXT Exploration the full result using all data layers is not shown,however the mineral prospectivity map for the validation set (the northern sectionof QUEST) is shown (Figure 5.8) indicating which regions are more favorable forcopper porphyry mineralization than others. As one can see the algorithm was ableto successfully predict the known prospective regions, as well as illuminate potentialnew areas of exploration. The addition of uncertainty estimates in the algorithmprovides a more robust framework for the incorporation of multi-disciplinary datawhich possess a large range in data quality.675.5. SVM Prospectivity MappingFigure 5.7: Left: Known mineralization locations in the validation set of the QUESTproject area. Right: Uncertainty estimates for the mineralization labels (red is moreuncertain) in the validation set.Figure 5.8: Prediction for prospective mineral regions in validation set of the QUESTarea (red is more prospective, blue is less) from SVM.For a more fair comparison with the results from the CNN experiment in thefollowing section, as a demonstration on the full QUEST study area a separate runof the SVM algorithm was completed using only the potential field data (magneticsand gravity). The prospectivity map generated from this experiment (see Figure 5.9)is similar to that using all the data, however it is far less descriminating since lessinformation is being used. When compared to the maps of the individual data layers685.5. SVM Prospectivity Mapping(magnetics and gravity) small simpler runs such as this can be instructive as to howthe algorithm is combining the information.Figure 5.9: Prediction for prospective mineral regions in QUEST area (red is moreprospective, blue is less) from SVM using only potential field data.695.6. CNN Prospectivity Mapping5.6 CNN Prospectivity Mapping5.6.1 Problem SetupAs a preliminary trial run of real data, the gravity, magnetics and faults were used asinputs for a convolutional neural network bearing a similar architecture to that usedfor the synthetic example. The first layer had eleven 5×5×3 convolution kernels, eachof which using a sigmoid nonlinearity followed by a 2 × 2 softmax pooling function.The second layer had fifteen 5×5×11 convolution kernels, each of which again using asigmoid nonlinearity and followed by a 2×2 softmax pooling function, the third layerhad twenty-seven 3 × 3 × 15 convolution kernels, again using a sigmoid nonlinearityand followed by 2× 2 softmax pooling, and the fourth and final layer was a softmaxclassifier. This resulted in a total of 8702 parameters in the network.Similar to the SVM example, the data was split into a training and validatingset (the north and the south). Since CNNs are nonconvex, and therefore sensitive toinitial conditions, 20 runs were attempted with different random initializations, andthe best results were reported.5.6.2 ResultsEncouragingly, even this limited training data set, with a relatively simple CNN archi-tecture was able to achieve fairly good results, with the best of 20 randomly initializedruns converging to 80.8% accuracy on the validation set in under 70 iterations (seeFigure 5.10 and Table 5.1 for more details). Using the trained classifier to predictthe mineral prospectivity for the full QUEST region we were able to create the mapseen in Figure 5.11, wherein red indicates the most favourable zones for copper goldporphyry mineralization.705.6. CNN Prospectivity MappingFigure 5.10: Convergence plots for the best result running the QUEST dataset in theCNN. Blue curve is training data, orange curve is validating.Dataset Success % # Correct False Positive False NegativeTraining 74.3 165/222 17 40Validating 80.8 42/52 8 2Table 5.1: Results from classification of QUEST data. Numbers reported are the bestachieved using the user-selected CNN architecture over 20 randomly initialized runs.715.6. CNN Prospectivity MappingFigure 5.11: Mineral prospectivity map generated from CNN for QUEST region.Colorscale maps the probability of mineralization from 0-1 (blue-red). Known mineraloccurences are plotted as black dots for reference.72Chapter 6Conclusions6.1 Major ContributionsThe goal of the research presented in this thesis was to explore, develop and ap-ply advanced machine learning algorithms for mineral prospectivity mapping. Thisresearch began with an investigation of current methods for mineral prospectivitymapping which highlighted two important aspects of the problem which to date hadnot been adequately addressed.1. The geoscience data available for prospectivity mapping is inherently fraughtwith highly variable levels of uncertainty, both on the training data as well thetraining labels.2. Targeting mineral deposit systems requires a regional approach, focusing oninteresting structures/patterns in the data rather than single anomalous values.To address the first of these difficulties, support vector machines were selectedas an algorithmic starting point due to their success and popularity in numerousapplications over the last two decades. To incorporate the necessary flexibility tohandle uncertainties on both training data and labels the algorithm was modifiedin two different ways. The first was a simple approach which reweighted the labelmisfit by a term proportional to the label uncertainty and added an extra term to theobjective function which penalized errors in the data using a Gaussian assumption.736.1. Major ContributionsThis new objective function could then be solved iteratively for the model weights wand the ‘true’ data X.Φ(w,X) =C12w>w +1>max(0,1− diag(y) (Xw + b) )+C22(X−Xobs)>diag(1Σ)(X−Xobs) (6.1)Another algorithm was coded beginning from the observation that support vectorregression can be rewritten to look very similar to a regularized least squares problem(ridge regression) and that uncertainties in the data could be incorporated by takinga total least squares approach.minΣ,w‖diag(Γ)((Φ + Σ)w − y)‖1 + β2‖Ω 12Σ‖2F +α2‖w‖2 (6.2)Following some re-arranging and linear algebra the problem can be broken into amixed L1/L2 recovery problem again solving iteratively for the model weights w andthe data errors Σ using the following two elegant optimization problems:minw‖diag(Γ)(Φ̂w − y)‖1 + α2‖w‖22 (6.3)andminv‖diag(Γ)(γ v − ŷ)‖1 + β2‖γ v‖2F (6.4)where Γ and Ω are the uncertainty on the labels and the data, respectively, Φ̂ = Φ+Σ,ŷ = y −Φw and γ = diag(Ω(w w)). The two optimization problems are coupledvia the relationshipΣ = Ω1 vw>746.1. Major ContributionsThis algorithm, dubbed L1-RSVR (robust support vector regression) was testedon a synthetic data set to illustrate the importance and effectiveness of including theuncertainties on both the data and the labels (when available). This data containedtwo separable classes of points sampled from respective gaussian distributions, whichwere confounded by inseparable points having large uncertainty either on their dataor their label. The results from this toy example clearly highlight the importance ofincluding uncertainty estimates when the data has variable reliability.To demonstrate the utility of the algorithm for the mineral prospectivity problem,it was used to generate a prospectivity map for copper-gold porphyry systems in theQUEST region in central British Columbia. The QUEST study contains a large arrayof publically available geoscience data, including airborne geophysics, multi-elementgeochemical assaying, and geological mapping of the bedrock. All told, after pre-processing the training data consisted of a matrix of over 700,000 (resampled) pointmeasurements on 91 different data layers. The training labels were taken from the BCMinfile database, using 155 alkalic copper-gold porphyries to constuct a binary labelfor each sample location with an associated uncertainty derived using the informationavailable from the Minfile database in conjunction with the mapped overburden inthe region.To assess the success of the algorithm, the data was split into training and test-ing sets. In this way it was possible to validate the results on the known mineraloccurences. From these experiments, the support vector machine based algorithmsseemed to be performing quite well, however the problem of incorporating spatialinformation into the learning problem remained a challenge.To address this second important aspect of the mineral prospectivity mappingproblem, we turned to convolutional neural networks. CNNs have become ubiqui-tous in the fields of computer vision and image identification and segmentation. Ascomputational resources have continued to expand and develop, the power of neural756.1. Major Contributionsnetworks has been unlocked, and they have once more gained favour in both researchand industrial communities through the resurgence dubbed ‘deep learning’.Convolutional neural networks owe their success in visual tasks to the observationthat natural images contain many localized repeating structures. This hierarchicalstructure is leveraged by CNNs using multiple layers of convolution to represent com-plex signals as combinations of simple patterns.Due to their popularity, many open-source packages exist to implement CNNs(Torch, Theano, Caffe, TensorFlow), however for greater control and to better under-stand the inner workings of the algorithm we coded our own using the JULIA pro-gramming language. This provided greater flexibility and allowed simple implemen-tation of different solvers (including the chosen nonlinear conjugate gradient solver)and modifications such as the inclusion of uncertainties on the labels.To demonstrate the ability of a CNN at detecting coincident target signals a sim-ple synthetic problem was designed involving three input channels. The first channelis a smoothly decaying ellipsoid with a randomly chosen origin and radii, the secondis a smoothly decaying ring, also with randomly chosen origin and radii, and the thirdis a discrete block with random origin and dimensions. These were chosen to repre-sent simplified versions of geoscientic data where often a high coinciding with a halostructure and the correct geological rock type might be indicative of mineralization (apositive training label, and when the three structures don’t line up this is a negativetraining label).Since neural networks are nonlinear and nonconvex, they are sensitive to initialconditions, and therefore 20 runs were completed with different initializations. Forthe best run, the success rate (on the validation set) was over 90% using a relativelysmall and simple network architecture. This was deemed convincing evidence thatCNNs could perform well on the mineral prospectivity mapping task and therefore itwas worthwhile attempting the algorithm on the real data from the QUEST region.766.2. SVM vs CNN ComparisonTo simplify the first run and reduce computational demands only 3 input channelswere used for this: gravity, magnetics and faults. Again 20 runs were completed withdifferent initializations, and the best performing run achieved over 80% success onthe validating set.6.2 SVM vs CNN ComparisonThough the two algorithms used in this research were selected for different purposes,and the goal was never to compare them, it would be natural to question which wasbetter. The support vector machines were chosen due to their elegant formulationand the ease with which one can implement uncertainties. They are also effectiveat generating a sparse classifier via the selection of support vectors. Because ofthese factors, it is straight-forward to implement a SVM code which incorporatesuncertainty and is able to handle large datasets without prohibitive run times orcomputational requirements. An additional advantage of support vector machinesis the ease with which results can be interpreted and investigated, since the learntweights essentially highlight which inputs are more or less important for the classifier.Querying a neural network is not nearly so straightforward, and although it is possibleto visualize the intermediate feature maps (Zeiler and Fergus, 2014) it is neither simpleto implement nor obvious to interpret.On the other hand, despite the ability to use the kernel trick to achieve non-linear transformations of the data, SVMs can be somewhat limited in their abilityto represent the data in an optimal feature space since the kernel must be specifieda priori. The strength of convolutional neural networks is their ability to extracthigh level features from the data via successive layers of convolutional filters. Sincethese filters are trained during the learning process, they are dynamic and adapt tothe data as required. The trade-off for this is that CNNs are much more complex776.2. SVM vs CNN Comparisonmodels and therefore both more challenging to modify as well as more computationallydemanding.As was discussed previously, one of the main advantages of convolutional neuralnetworks over support vector machines is the ability to recognize anomalous structurein the data rather than simply anomalous values. This can be of great importancein mineral exploration as it is often the structures which are of interest for targeting.It is possible to incorporate spatial information into support vector machines in arudimentary way via neighbours. One can simply augment the training dataset byadding channels for each of the neighbouring sample locations, resulting in a datamatrix of size n ×md, where n is the number of sample locations, m is the numberof neighbouring points to include (the size of the stencil) and d is the number of datachannels (ie: magnetics, gravity, geological age, etc). The down-side to this approachis that the size of the data grows rapidly, and to achieve the same spatial sensitivityas the convolutional neural network would require a prohibitively large data matrix(a stencil of 34× 34 pixels).Comparing the prospectivity maps from the two methods is not totally fair sincethe support vector machine was trained with the full dataset whereas the convolutionalneural network only used 3 (of the available 91) data layers. Despite this, a quickcomparison does reveal many similarities, giving greater confidence to both resultssince each algorithm was run independently.786.3. Remaining Challenges & Future Work6.3 Remaining Challenges & Future WorkA natural next step for this research would be to extend the convolutional neuralnetworks to use the entire geoscience dataset. Since structure is treated rather thansample point values, some thought must be given to the handling of categorical vari-ables such as rock class so that boundaries between different units are still recognized.The added input layers would likely require a larger architecture (more neurons, andpossibly more layers) to adequately represent the added variability in the inputs.One of the great challenges with using convolutional neural networks is architec-ture selection. There exists no clear rule or guide for choosing how many layers touse, nor how many neurons per layer, nor which nonlinearity to apply (though cer-tain types, such as the ReLU (Nair and Hinton, 2010), have become very popular).This remains one of the most frustrating parts of working with neural networks. Thearchitecture selected for this research was chosen somewhat arbitrarily through trialand error, though it is possible alternate choices would perform better.Since the mineral prospectivity mapping problem is imbalanced and only positivelabels are available, assumptions were made constructing the negative labels whichmay or may not be valid. For the SVM algorithm a spline with radii proportional tothe Minfile status (anomaly, prospect, producing mine, etc) was used to interpolatebetween the copper-gold porphyry mineral occurences in the database. Uncertaintieswere used to give very little confidence to all non-mineralized points, as well as datalocations known to have thick overburden (since this drastically limits the ability andconfidence to prospect and map geology).For the convolutional neural networks a different approach was taken to generatethe labels since the data was windowed into 34× 34 images as inputs. Windows wereused centered on each mineralization location as positive labels, and negative labelswere randomly selected from a subset of locations which were over 500m from a known796.3. Remaining Challenges & Future Workmineral occurence. This distance was selected through trial and error, since largerdistances removed the majority of interesting regions (the north and south) from thenegative labels, resulting in the majority of the negative training labels being clusteredin the central QUEST area, covered by overburden. The goal in selecting negativelabels in this manner was that the variability in these examples should approximatethe background signal of the various data layers, however it is possible that some ofthese locations overlap mineralized zones and therefore are actually positive labels.Given the relatively small size of the training set, selection of training labels can havea large impact on the classifier, and therefore both algorithms will likely be sensitiveto these choices.One possibility for combining the strengths of the two algorithms is to use a con-volutional neural network with a support vector machine classifier, as in (Nagi et al.,2012) or (Huang and LeCun, 2006). This architecture would leverage the powerfulfeature extraction of the convolutional neural networks with the discriminatory powerof the support vector machines. It can be shown that doing so is equivalent to con-structing an optimal kernel (using the CNN) for use with the SVM. In this frameworkit would also be easy to incorporate uncertainty to the labels.Adding uncertainty to the data in the convolutional neural network frameworkis an open problem which has yet to be examined. Due to the recursive nature ofthe architecture and the nonlinear transformations care will be needed to ensure thatboth small and large values do not blow up. This remains unexplored and requiresextra thought for proper implementation.Finally, it is worth noting that as with many applications, greater importanceshould be placed on the quality of the dataset than the technique used for analysis.While algorithm selection will no doubt impact the success of mineral prospectivitymapping, the availability, quality, and processing of geoscience data will have muchstronger influence on the results. This includes the scale of the data (how many806.3. Remaining Challenges & Future Workkilometers are covered; just how regional is the data?), the resolution of the data (howfinely is the data sampled; what is the smallest feature to be trusted?), the quality ofthe data and availability of uncertainty estimates (how precise, and how accurate isthe data; is this quantified?), as well as the processing steps used to prepare the data(interpolation, gridding, filtering, etc). For these reasons two important take-awaypoints are worth remembering:1. Though a powerful analytical tool, mineral prospectivity mapping is not alwaysapplicable; using data-driven methods requires that the data is sufficient toguide the learning process!2. Even in scenarios when mineral prospectivity mapping applies, expert knowl-edge is crucial in acquiring and processing the data prior to training, and incritically examining results after.81BibliographyAbedi, M., Norouzi, G.-H., and Bahroudi, A. (2012). Support vector machinefor multi-classification of mineral prospectivity areas. Computers & Geosciences,46:272–283.Agterberg, F., Bonham-Carter, G., and Wright, D. (1990). Statistical pattern in-tegration for mineral exploration. Computer applications in Resource EstimationPrediction and Assement for Metals and Petroleum, pages 1–21.Alvarez, J. M., Gevers, T., LeCun, Y., and Lopez, A. M. (2012). Road scene segmen-tation from a single image. Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7578LNCS(PART 7):376–389.Barnett, C. and Williams, P. (2006). Mineral Exploration Using Modern Data MiningTechniques. First Break, 24(7):295–310.Barnett, C. T. and Kowalczyk, P. L. (2008). Airborne Electromagnetics and Air-borne Gravity in the QUEST Project Area , Williams Lake to Mackenzie , BritishColumbia. Technical report, Geoscience BC.Barnett, C. T. and Williams, P. M. (2009). Using Geochemistry and Neural Networksto Map Geology under Glacial Cover Using Geochemistry and Neural Networks tomap Geology under Glacial Cover. Technical report.82BibliographyBastrakova, I. and Fyfe, S. (2012). Building blocks for data management in GeoscienceAustralia : towards a managed environment.Bengio, Y., Boulanger-Lewandowski, N., and Pascanu, R. (2012). Advances in Opti-mizing Recurrent Networks.Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. Large-ScaleKernel Machines, (1):1–41.Berger, B. R., Ayuso, R. a., Wynn, J. C., and Seal, R. R. (2008). Preliminary modelof porphyry copper deposits. Open-File Report - U. S. Geological Survey, page 55.Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2014). Julia: A FreshApproach to Numerical Computing. pages 1–37.Bonham-Carter, G., Agterberg, F., and Wright, D. (1989). Integration of geologicaldatasets for gold exploration in Nova Scotia. Short Courses in Geology 10, pages15–23.Boser, B., Guyon, I., and Vapnik, V. (1992). A Training Algorithm for Optimal Mar-gin Classifers. Proceedings of the fifth annual workshop on Computational LearningTheory, pages 144–152.Brown, W. M., Gedeon, T. D., Groves, D. I., and Barnes, R. G. (2000). Artificialneural networks: A new method for mineral prospectivity mapping. AustralianJournal of Earth Sciences, 47(4):757–770.Burges, C. (1998). A tutorial on support vector machines for pattern recognition.Data mining and knowledge discovery, 43:1–43.Carrizosa, E. (2007). Support Vector Regression for imprecise data. Technical report,Dept. MOSI, Vrije University, Brussel, Belgium.83BibliographyChang, C. and Lin, C. (2011). LIBSVM: a library for support vector machines. ACMTransactions on Intelligent Systems and . . . , pages 1–39.Chapelle, O. (2007). Training a support vector machine in the primal. Neural com-putation, 19(5):1155–78.Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, W. (2011). SMOTE: synthetic mi-nority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357.Chawla, N., Japkowicz, N., and Kotcz, A. (2004). Editorial: special issue on learningfrom imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1):1–6.Chellapilla, K., Puri, S., and Simard, P. (2006). High Performance ConvolutionalNeural Networks for Document Processing. . . . Workshop on Frontiers in . . . .Cortes, C. and Vapnik, V. (1995). Support Vector Networks. Machine learning,297:273–297.Cun, Y. L., Denker, J. S., and Solla, S. a. (1990). Optimal Brain Damage. Advancesin Neural Information Processing Systems, 2(1):598–605.Devine, F. (2012). Porphyry Integration Project: Bringing Together Geoscience andExploration Datasets for British Columbia’ s Porphyry Districts. Technical report,Geoscience BC.Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive Subgradient Methods for OnlineLearning and Stochastic Optimization. Journal of Machine Learning Research,12:2121–2159.Ford, K., Keating, P., and Thomas, M. (2004). Overview of geophysical signaturesassociated with Canadian ore deposits. Geophysics.84BibliographyFriedman, J., Hastie, T., and Tibshirani, R. (2001). The Elements of StatisticalLearning.Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deepfeedforward neural networks. Proceedings of the 13th International Conference onArtificial Intelligence and Statistics (AISTATS), 9:249–256.Golub, G. and Loan, C. V. (1980). An Analysis of the Total Least Squares Problem.SIAM Journal on Numerical Analysis, 17(6):883–893.Golub, G. H., Hansen, P. C., and O’Leary, D. P. (1999). Tikhonov Regularizationand Total Least Squares. SIAM Journal on Matrix Analysis and Applications,21(1):185–194.Granek, J. and Haber, E. (2015). Data mining for real mining : A robust algorithmfor prospectivity mapping with uncertainties. In Proceedings of the 2015 SIAMInternational Conference on Data Mining.Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R., and Schmidhuber, J.(2015). LSTM: A Search Space Odyssey. arXiv, page 10.Guo, H. and Viktor, H. (2004). Learning from imbalanced data sets with boostingand data generation: the DataBoost-IM approach. ACM SIGKDD ExplorationsNewsletter, 6(1):30–39.Harris, D. and Pan, G. (1999). Mineral favorability mapping: a comparison of artificialneural networks, logistic regression, and discriminant analysis. Natural ResourcesResearch, 8(2):93–109.Hinton, G. (2014). Dropout : A Simple Way to Prevent Neural Networks fromOverfitting. Journal of Machine Learning Research (JMLR), 15:1929–1958.85BibliographyHinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm fordeep belief nets. Neural computation, 18(7):1527–54.Hochreiter, S. (1998). The Vanishing Gradient Problem During Learning RecurrentNeural Nets and Problem Solutions. International Journal of Uncertainty, Fuzzi-ness and Knowledge-Based Systems, 06:107–116.Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Com-putation, 9(8):1–32.Hofmann, T., Scho¨lkopf, B., and Smola, A. J. (2008). Kernel methods in machinelearning. Annals of Statistics, 36(3):1171–1220.Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networksare universal approximators. Neural Networks, 2(5):359–366.Hoschke, T. (2010). Geophysical signatures of copper-gold porphyry and epithermalgold deposits , and implications for exploration.Huang, F. J. and LeCun, Y. (2006). Large-scale learning with SVM and convolutionalnets for generic object categorization. Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition, 1:284–291.Huang, G., Song, S., Wu, C., and You, K. (2012). Robust Support Vector Regressionfor Uncertain Input and Output Data. 23(11):1690–1700.Jackaman, W. and Balfour, J. (2008). QUEST Project Geochemistry: Field Surveysand Data Reanalysis, Central British Columbia (parts of NTS 093A, B, G, H, J,K, N, O). Technical report, Geoscience BC.John, D. a., Ayuso, R., Barton, M., Bodnar, R., Dilles, J., Gray, F., Graybeal, F.,Mars, J., McPhee, D., Seal, R., Taylor, R.D., , and Vikre, P. (2010). Porphyry Cop-86Bibliographyper Deposit Model Scientific Investigations Report 2010 5070 B. USGS ScientificInvestigations report 2010-5070-B, page 169.Kohonen, T. (1990). The Self-Organizing Map. Proceedings of the IEEE.Krizhevsky, A. (2014). One weird trick for parallelizing convolutional neural networks.arXiv preprint, pages 1–7.Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). Imagenet classification withdeep convolutional neural networks. Advances in neural information processingsystems, pages 1097–1105.Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced training sets:one-sided selection. In In Proceedings of the Fourteenth International Conferenceon Machine Learning, pages 179–186. Morgan Kaufmann.Lai, M. (2015). Deep Learning for Medical Image Segmentation. arXiv preprintarXiv:1505.02000.LeCun, Y. and Bengio, Y. (1995). Convolutional networks for images, speech, andtime series. The handbook of brain theory and neural networks, 3361:255–258.LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., andJackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code Recogni-tion.LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learningapplied to document recognition. Proceedings of the IEEE, 86(11):2278–2323.Long, J., Shelhamer, E., and Darrell, T. (2015). Fully Convolutional Networks forSemantic Segmentation. IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR).87BibliographyMcCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent innervous activity. The Bulletin of Mathematical Biophysics, 5(4):115–133.Minsky, M. and Papert, S. (1969). Perceptrons: An Introduction to ComputationalGeometry, volume 165.Nagi, J., Caro, G. A. D., Giusti, A., Nagi, F., and Gambardella, L. M. (2012).Convolutional neural support vector machines: Hybrid visual pattern classifiers formulti-robot systems. Proceedings - 2012 11th International Conference on MachineLearning and Applications, ICMLA 2012, 1:27–32.Nair, V. and Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltz-mann Machines. Proceedings of the 27th International Conference on MachineLearning, (3):807–814.Osborne, M. R. and Watson, G. a. (1985). An Analysis of the Total ApproximationProblem in Separable Norms, and an Algorithm for the Total $l 1 $ Problem. SIAMJournal on Scientific and Statistical Computing, 6(2):410–424.Pant, R., Trafalis, T., and Barker, K. (2011). Support vector machine classificationof uncertain and imbalanced data using robust optimization. In Proceedings ofthe 15th WSEAS International Conference on Computers, pages 369–374, CorfuIsland, Greece. World Scientific and Engineering Academy and Society (WSEAS).Partington, G. and Sale, M. (2004). Prospectivity mapping using GIS with publiclyavailable earth science data-a new targeting tool being successfully used for explo-ration in New Zealand. Pacrim 2004 Congress Volume, Adelaide, (September):19–22.Porwal, A., Carranza, E., and Hale, M. (2003a). Artificial neural networks for mineral-88Bibliographypotential mapping: a case study from Aravalli Province, Western India. Naturalresources research, 12(3).Porwal, A., Carranza, E., and Hale, M. (2003b). Knowledge-driven and data-drivenfuzzy models for predictive mineral potential mapping. Natural Resources Research,12(1).Porwal, A., Yu, L., and Gessner, K. (2010). SVM-based base-metal prospectivitymodeling of the Aravalli Orogen, northwestern India. EGU General Assembly . . . ,12:15171.Raines, G., Sawatzky, D., and Bonham-Carter, G. (2010). Incorporating expertknowledge: New fuzzy logic tools in ArcGIS 10. ArcUser, pages 8–13.Raskutti, B. and Kowalczyk, A. (2004). Extreme re-balancing for SVMs: a case study.ACM Sigkdd Explorations Newsletter, 6(1).Rodriguez-Galiano, V., Sanchez-Castillo, M., Chica-Olmo, M., and Chica-Rivas, M.(2015). Machine learning predictive models for mineral prospectivity: An evalu-ation of Neural Networks, Random Forest, Regression Trees and Support Vectormachines. Ore Geology Reviews.Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storageand organization in . . . . Psychological Review, 65(6):386–408.Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representationsby back-propagating errors. Nature, 323(6088):533–536.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpa-thy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2014). ImageNetLarge Scale Visual Recognition Challenge. Arxiv, page 37.89BibliographySchmidhuber, J. (2015). Deep Learning in neural networks: An overview. NeuralNetworks, 61:85–117.Scho¨lkopf, B., Burges, C., and Smola, A. (1999). Advances in kernel methods: supportvector learning. The MIT Press.Shalev-Shwartz, S. and Singer, Y. (2011). Pegasos: Primal estimated sub-gradientsolver for svm. Mathematical Programming.Sillitoe, R. H. (2010). Porphyry copper systems. Economic Geology, 105(1):3–41.Simard, P. Y., Steinkraus, D., and Platt, J. C. (2003). Best practices for convolu-tional neural networks applied to visual document analysis. Document Analysisand Recognition, 2003. Proceedings. Seventh International Conference on, pages958–963.Sinclair, W. (2007). Porphyry Deposits. In Mineral Deposits of Canada: A Synthesisof Major Deposit-Types, number 5, pages 223–243.Singer, D. and Kouda, R. (1997). Use of a neural network to integrate geoscienceinformation in the classification of mineral deposits and occurrences. Proceedingsof Exploration, pages 127–134.Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,Vanhoucke, V., and Rabinovich, A. (2015). Going Deeper with Convolutions. Cvpr.Tang, Y., Zhang, Y.-Q., Chawla, N. V., and Krasser, S. (2009). SVMs modelingfor highly imbalanced classification. IEEE transactions on systems, man, and cy-bernetics. Part B, Cybernetics : a publication of the IEEE Systems, Man, andCybernetics Society, 39(1):281–8.90BibliographyVapnik, V. and Chervonenkis, A. (1971). On the uniform convergence of relativefrequencies of events to their probabilities. Theory of Probability & Its Applications,XVI(2):264–280.Vapnik, V. and Lerner, A. (1963). Pattern Recognition using Generalized PortraitMethod. Automation and Remote Control, 24:774–780.Vapnik, V. N. (1995). The Nature of Statistical Learning Theory, volume 8.Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE transactions onneural networks / a publication of the IEEE Neural Networks Council, 10(5):988–99.Zeiler, M. D. and Fergus, R. (2014). Visualizing and Understanding ConvolutionalNetworks arXiv:1311.2901v3 [cs.CV] 28 Nov 2013. Computer VisionECCV 2014,8689:818–833.Zhang, J. (2004). Support vector classification with input data uncertainty. Advancesin neural information processing systems.Zuo, R. and Carranza, E. J. M. (2011). Support vector machine: A tool for mappingmineral prospectivity. Computers & Geosciences, 37(12):1967–1975.91