Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Mapping urban trees with deep learning and street-level imagery Lumnitz, Stefanie 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2020_may_lumnitz_stefanie.pdf [ 31.47MB ]
JSON: 24-1.0387513.json
JSON-LD: 24-1.0387513-ld.json
RDF/XML (Pretty): 24-1.0387513-rdf.xml
RDF/JSON: 24-1.0387513-rdf.json
Turtle: 24-1.0387513-turtle.txt
N-Triples: 24-1.0387513-rdf-ntriples.txt
Original Record: 24-1.0387513-source.json
Full Text

Full Text

MAPPING URBAN TREES WITH DEEP LEARNINGAND STREET-LEVEL IMAGERYbyStefanie LumnitzBSc. Geography, Ludwig-Maximilian’s University Munich, 2017A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Forestry)The University of British Columbia(Vancouver)December 2019c© Stefanie Lumnitz, 2019The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:MAPPING URBAN TREES WITH DEEP LEARNING AND STREET-LEVEL IMAGERYsubmitted by Stefanie Lumnitz in partial fulfillment of the requirements for thedegree of MASTER OF SCIENCE in Forestry.Examining Committee:Dr. Verena Griess, Forest Resource ManagementSupervisorDr. Nicholas Coops, Forest Resource ManagementSupervisory Committee MemberDr. Tahia Devisscher, Forest Resource ManagementSupervisory Committee MemberDr. Helge Rhodin, Computer ScienceAdditional ExamineriiAbstractPlanning and managing urban trees and forests for livable cities remains an out-standing challenge worldwide owing to scarce information on their spatial distri-bution, structure and composition. Sources of tree inventory remain limited dueto a lack of detailed and consistent inventory assessments. In practice, most mu-nicipalities still perform labor-intensive field surveys to collect and update treeinventories.This thesis examines the potential of deep learning to automatically assess ur-ban tree location and species distribution from street-level photographs. A robustand affordable method for detecting, locating, classifying and ultimately, creatingdetailed tree inventories in any urban region where sufficient street-level imageryis readily available was developed.The developed method is novel in that a Mask Regional Convolutional NeuralNetwork is used to detect and locate tree instances from street-level imagery, cre-ating shape masks around unique fuzzy urban objects like trees. The novelty ofthis method is enhanced by using monocular depth estimation and triangulation toestimate precise tree location, relying only on photographs and images taken fromthe street. In combination with Google Street View, a technique for the rapid de-velopment of an extensive tree genera training dataset was presented based on themethod of tree detection and location. This tree genera dataset was used to train aConvolutional Neural Network (CNN) for tree genera classification.Experiments across four cities show that the novel method for tree detectionand location can be transferable to different image sources and urban ecosystems.Over 70% of trees recorded in a ground-truth campaign (2019) were detected andcould be located with a mean error in the absolute position ranging from 4m toiii6m, comparable to GPS accuracy used for geolocation in classical manual urbantree inventory campaigns. The trained CNN classifies 41 fine-grained tree generaclasses with 83% accuracy. The detection and classification models were then usedto generate maps of urban tree genera distribution in the Metro Vancouver region.Results of this research show that developed methods can be applied acrossdifferent regions and cities and that deep learning and street-level imagery showpromise to inform smart urban forest management, including bio-surveillance cam-paign planning.ivLay SummaryUrban trees play a vital role in making our cities more livable, sustainable andresilient to climate change. In order to manage and maximize benefits urban treesprovide for cities and their inhabitants, city officials need to know the locationof trees and how different species are distributed throughout urban environments.This thesis explored a novel approach to collect information on the species anddistribution of urban trees from photographs taken from the streets. Using newtechnologies like deep learning, a tool was developed that detected over 70% ofall trees growing on streets and classified 41 tree species for different cities in theMetro Vancouver region. In addition, it was examined if the developed tool canbe transferred to other urban areas. In future, the developed methods, tools anddata inform urban tree inventories to assist planning decisions and managementschedules of city planners and urban forest practitioners.vPrefaceThis research was proposed, designed and carried out in its entirety by myself,with support of my Masters supervisory committee. I identified and designed thisoriginal research, with suggestions and guidance from the committee. I performedall phases of methodological development, field data collection, data analysis, in-terpretation of results and manuscript preparation as primary researcher. I soughtadvice and support of my committee and co-authors, who provided clarificationand modifications. This research project was undertaken as part of the BioSAFEproject and in partnership with the Canadian Food Inspection Agency, the CanadianUrban Environmental Health Research Consortium and the University of Califor-nia, Riverside. A list of publications and presentations of thesis content as part ofthe BioSAFE project can be found in appendix A.Chapters 2 and 3 are independent research chapters that have been structuredand written as scientific articles. A list of publications for each research chapter ispresented as follows:Chapter 2:• Lumnitz, S., Devisscher, T., Mayaud, J., Coops, N. and Griess, V. (in prep):Mapping urban trees with deep learning and street-level imagery.Chapter 3:• Lumnitz, S., Devisscher, T., Coops, N. and Griess, V. (in prep): Mappingurban tree diversity with deep learning and street-level imagery.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Healthy green cities of the future . . . . . . . . . . . . . . . . . . 11.1.1 Smart urban forest management . . . . . . . . . . . . . . 31.1.2 Urban trees as vectors for tree pests and pathogens . . . . 31.1.3 The need for tree inventory data . . . . . . . . . . . . . . 51.1.4 Collecting urban tree data . . . . . . . . . . . . . . . . . 61.2 Computer vision for urban tree inventories . . . . . . . . . . . . . 71.2.1 Street-level imagery . . . . . . . . . . . . . . . . . . . . 8vii1.2.2 Theoretical background of deep learning . . . . . . . . . 101.3 Research questions and research design . . . . . . . . . . . . . . 191.3.1 Research questions . . . . . . . . . . . . . . . . . . . . . 191.3.2 Research design . . . . . . . . . . . . . . . . . . . . . . 201.4 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 Mapping urban trees with deep learning and street-level imagery . . 262.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.1.1 Urban tree assessment . . . . . . . . . . . . . . . . . . . 262.1.2 Remote sensing for individual tree mapping . . . . . . . . 272.1.3 Trends in automatic tree inventory assessments . . . . . . 282.1.4 Chapter objectives . . . . . . . . . . . . . . . . . . . . . 292.2 Data and methods . . . . . . . . . . . . . . . . . . . . . . . . . . 292.2.1 Study site . . . . . . . . . . . . . . . . . . . . . . . . . . 312.2.2 Ground-truth measurements . . . . . . . . . . . . . . . . 322.2.3 Street-level imagery . . . . . . . . . . . . . . . . . . . . 332.2.4 Tree instance segmentation . . . . . . . . . . . . . . . . . 342.2.5 Geolocation of trees . . . . . . . . . . . . . . . . . . . . 372.2.6 Model evaluation . . . . . . . . . . . . . . . . . . . . . . 402.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.3.1 Instance segmentation . . . . . . . . . . . . . . . . . . . 412.3.2 Localization . . . . . . . . . . . . . . . . . . . . . . . . . 452.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523 A machine learning tool for mapping urban tree diversity . . . . . . 543.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.1.1 The importance of urban tree diversity . . . . . . . . . . . 543.1.2 Bio-surveillance in the Metro Vancouver region . . . . . . 553.1.3 Training data for tree genus classification . . . . . . . . . 563.1.4 Chapter objectives . . . . . . . . . . . . . . . . . . . . . 563.2 Data and methods . . . . . . . . . . . . . . . . . . . . . . . . . . 573.2.1 Case study site . . . . . . . . . . . . . . . . . . . . . . . 573.2.2 Full mapping workflow . . . . . . . . . . . . . . . . . . . 58viii3.2.3 Tree detection . . . . . . . . . . . . . . . . . . . . . . . . 583.2.4 Multi-stage strategy for building tree genera dataset . . . . 603.2.5 Tree genus classification . . . . . . . . . . . . . . . . . . 623.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.3.1 Classification performance . . . . . . . . . . . . . . . . . 653.3.2 Hotspot maps of Metro Vancouver . . . . . . . . . . . . . 683.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.4.1 Classification model performance . . . . . . . . . . . . . 713.4.2 Transferability to other areas . . . . . . . . . . . . . . . . 723.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.1 Key findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.1.1 Tree detection with Mask R-CNN . . . . . . . . . . . . . 754.1.2 Tree geolocation with monocular depth estimation . . . . 764.1.3 Tree genus classification . . . . . . . . . . . . . . . . . . 764.2 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.2.1 Deep learning for bio-surveillance planning . . . . . . . . 774.2.2 A new baseline for risk assessment . . . . . . . . . . . . 774.2.3 Smart urban forest management . . . . . . . . . . . . . . 784.2.4 A novel method for environmental research . . . . . . . . 784.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.3.1 Tree visibility on street-level imagery . . . . . . . . . . . 794.3.2 Availability of street-level imagery . . . . . . . . . . . . . 794.3.3 Limited tree genera training data . . . . . . . . . . . . . . 804.4 Future research directions . . . . . . . . . . . . . . . . . . . . . . 814.4.1 Assessing different data sources . . . . . . . . . . . . . . 814.4.2 Crowd-sourcing and street-level imagery collection . . . . 824.4.3 Methodological adaption for bio-surveillance . . . . . . . 834.4.4 Green smart cities of the future . . . . . . . . . . . . . . . 83Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85A Additional publications and presentations . . . . . . . . . . . . . . . 109ixB Theoretical background on developing Mask R-CNN . . . . . . . . 110B.1 Training, development and evaluation data generation . . . . . . . 110B.1.1 COCO Stuff dataset . . . . . . . . . . . . . . . . . . . . 111B.1.2 Street-level panoramas and annotations . . . . . . . . . . 112B.2 Mask R-CNN for tree detection . . . . . . . . . . . . . . . . . . . 116B.2.1 Mask R-CNN framework . . . . . . . . . . . . . . . . . . 116B.3 Evaluation strategy . . . . . . . . . . . . . . . . . . . . . . . . . 117B.3.1 Architecture evaluation for tree detection . . . . . . . . . 117B.3.2 Evaluation metrics for tree detection . . . . . . . . . . . . 119B.4 Training strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 119B.4.1 Feature extraction and fine-tuning . . . . . . . . . . . . . 119B.4.2 Training Mask R-CNN . . . . . . . . . . . . . . . . . . . 120C Comparing existing and generated tree genera distribution informa-tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122D Tree genera detection . . . . . . . . . . . . . . . . . . . . . . . . . . 123xList of TablesTable 2.1 Street-level imagery and mask annotations for the fine-tuningprocedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Table 2.2 Evaluation metrics for training with and without COCO Stuffon combined Vancouver and Surrey dataset . . . . . . . . . . . 41Table 2.3 Evaluation metrics for Vancouver, Surrey and Pasadena . . . . 42Table 2.4 Absolute geolocation accuracy (in meters) . . . . . . . . . . . 48Table 3.1 Classification accuracy for development and test sets . . . . . . 65Table 3.2 Selected occurrences of tree genera in Metro Vancouver . . . . 70Table D.1 Tree genera detections in Metro Vancouver . . . . . . . . . . . 124xiList of FiguresFigure 1.1 Deep learning, machine learning and artificial intelligence [26]. 11Figure 1.2 Concept of deep neural networks [51]. . . . . . . . . . . . . . 13Figure 1.3 Convolutional neural network architecture, inference and training 15Figure 1.4 The four tasks of computer vision [83]. . . . . . . . . . . . . 18Figure 1.5 AiTree: Open source software for urban tree mapping . . . . . 21Figure 1.6 Methodology for the development of trained deep neural net-work models for automatic tree detection, geolocation and genusclassification. . . . . . . . . . . . . . . . . . . . . . . . . . . 22Figure 1.7 Google street view data. (Source: Google Maps, 2018; GoogleStreet View; 2018) . . . . . . . . . . . . . . . . . . . . . . . 23Figure 2.1 Urban tree mapping workflow . . . . . . . . . . . . . . . . . 30Figure 2.2 Location of street-level imagery datasets and ground-truth mea-surements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 2.3 Correction of predicted tree locations through triangulation . . 39Figure 2.4 Precision-recall curves for development (Surrey, Vancouver)and test (Coquitlam, Pasadena) datasets . . . . . . . . . . . . 42Figure 2.5 Effect of mask size on model performance . . . . . . . . . . . 44Figure 2.6 The most common inference errors of tree instance segmenta-tion with Mask R-CNN (in percent) . . . . . . . . . . . . . . 45Figure 2.7 Examples of masking errors in instance segmentation with MaskR-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 2.8 Location prediction results of trees . . . . . . . . . . . . . . . 47xiiFigure 2.9 Absolute geolocation accuracy for street trees, private trees,and all trees in the Vancouver area . . . . . . . . . . . . . . . 49Figure 2.10 Influence of camera position at time of image capture on treelocation prediction . . . . . . . . . . . . . . . . . . . . . . . 51Figure 3.1 Tree genus classification workflow . . . . . . . . . . . . . . . 59Figure 3.2 Training data generation for tree genus classification . . . . . 61Figure 3.3 Examples of tree genera dataset and data augmentation . . . . 63Figure 3.4 Precision for different genus classes in the test dataset . . . . 66Figure 3.5 Confusion matrix for tree genera classification . . . . . . . . 67Figure 3.6 Distribution of sizes of generated tree cutouts with examples . 68Figure 3.7 Tree genera distributions in Metro Vancouver . . . . . . . . . 69Figure 3.8 Images per class in the tree genus classification test dataset . . 72Figure B.1 COCO Stuff image and segmentation mask examples [2, 18] . 111Figure B.2 Manual image annotation with Labelbox (Imagery source: GoogleStreet View 2018) . . . . . . . . . . . . . . . . . . . . . . . . 113Figure B.3 Google Street View panorama and tree annotations (Imagerysource: Google Street View 2018) . . . . . . . . . . . . . . . 114Figure B.4 Reshaped mask annotations . . . . . . . . . . . . . . . . . . 115Figure B.5 The Mask R-CNN framework for instance segmentation [57] . 116Figure B.6 Over-fitting Mask R-CNN . . . . . . . . . . . . . . . . . . . 118Figure C.1 Visual comparison of existing and generated tree inventoryrecords for Prunus . . . . . . . . . . . . . . . . . . . . . . . 122xiiiGlossaryAGM Asian Gypsy MothAI Artificial IntelligenceALB Asian Longhorned BeetleAPI Application Programming InterfaceAPP ApplicationBC British ColumbiaCFIA Canadian Food Inspection AgencyCNN Convolutional Neural NetworkCOCO Common Objects in ContextCONV LAYER Convolutional LayerDED Dutch Elm DiseaseDL Deep LearningDNN Deep Neural NetworkEAB Emerald Ash BorerFASTER R-CNN Faster Region-based Convolutional Neural NetworkFIS Forest Invasive SpeciesxivFPN Feature Pyramid NetworkFC LAYER Fully-connected LayerFOV Filed of ViewFPN Feature Pyramind NetworkFP1 Single-precision Floating PointsFP16 16-bit Floating PointsGPS Global Positioning SystemGPU Graphics Processing UnitGVI Green View IndexGSV Google Street ViewIOU Intersection over UnionISPRS International Society for Photogrammetry and Remote SensingITCD Individual Tree Crown Delineation or DetectionKDE Kernel Density EstimatesLIDAR Light Detection and RangingMASK R-CNN Mask Region-based Convolutional Neural NetworkML Machine LearningPPM Pine Processionary MothRESNET101 Residual Network with 101 LayersRESNET50 Residual Network with 50 LayersRGB Red Green BlueRMSLE Root Mean Squared Log ErrorxvROI Region of InterestSLAM Simultaneous localisation and mappingSOD Sudden Oak DeathSFM Structure from MotionNN Neural NetworkRS Remote SensingVHR Very High ResolutionxviAcknowledgmentsThis research was funded as part of the bioSAFE project and Genomics Canada,BC and Quebec. The Canadian Food Inspection Agency provided insights andknowledge in current bio-surveillance methods and gaps. The Canadian Urban En-vironmental Health Research Consortium provided access to a High PerformanceCompute Cluster. Many thanks to the organizations that offered financial assis-tance including Google and the Python Software Foundation for a Google Summerof Code Fellowship, the Mitacs Globalink Fellowhsip, the International Fel-lowships, and the Mary and David Macaree Fellowship.I thank my supervisor Dr. Verena Griess for your mentorship and unwaveringsupport in persuing this research and other projects that came along during thisMasters. Thank you Dr. Nicholas Coops for sharing your expertise and alwaysinviting me to join the IRSS lab. Dr. Tahia Devisscher, I am grateful for all theinspiring discussions, ideas and projects we shared. I thank the members of theFresh lab for your company, encouragement and valuable feedback in all aspectsof life at UBC and in Canada. Thank you Valentine Lafond and Kathleen Cooplandfor your help in the field and lab.I thank the Center for Geospatial Sciences team at the University of Riversideand all members of the Python Spatial Analysis core-development team for yourfriendship and encouragement. Our conversations and support always brought newperspectives, ideas and hacks into my day-to-day life and work. Lastly, thank youSam Anderson, Jerome Mayaud, Sophie Nitoslawski, Christina Draeger and RalfGommers for your friendship, consolidation, guidance and never tiring support.You made my work days lighter and I feel extremely fortunate to have benefitedfrom meeting and spending time with you.xviiDedicationThis work is dedicated to my parents and brother - Martina, Ramon and FritzLumnitz - for their love and support, reaching over continents.xviiiChapter 1IntroductionThe great green city of the future is ecologically and economicallyresilient; it’s made up of healthy, livable neighborhoods where thebenefits of nature are available to all people. — Pascal Mittermaier,The Nature Conservancy’s Global Managing Director for Cities(2019)1.1 Healthy green cities of the futureBy 2050, three out of four people on Earth will live in cities [139]. As urbanizationcontinues in the epoch of the Anthropocene [30], cities embody the forefront of ac-tion against global change impacts, but also become vulnerable to their detrimentaleffects [35, 55]. It is increasingly recognized that urban trees play a critical role inmitigating negative effects of global change for people and the planet [38, 53]. Nu-merous studies have shown that urban trees are key in making cities more livable,resilient and help adapt for impacts of future climate change [64]. By providingshade, a natural way of air-cooling and absorbing CO2 through growth, trees helpmitigate climate change and save energy by reducing the need for air conditioning[137]. They clean the air and environment, by capturing particulates and urbanpollutants through natural gas-exchange with the atmosphere [137]. Urban treespromote storm water runoff as they intercept rainfall and increase infiltration [60].The presence of healthy trees in urban areas is known to have beneficial effects onhuman health and well being, promoting mental health, reducing stress, prevent-1ing obesity and accelerating recovery from illnesses [138, 140]. Urban forest havethe potential to foster biodiversity, by providing shelter and food for animals andplants [7]. It has been shown that urban forests have direct social impacts such asincreasing property values, positively impacting social cohesion and strengtheningcommunities [37, 101, 145].As evidence gathers about the diverse benefits healthy urban trees provide to re-silience and livability in cities through various ecosystem services, the demand foreffective urban forest management and planning grows [106, 140]. Expanding andmaintaining a healthy urban forest is a recognized challenge to-date [38]. Poorlymanaged urban forests do not provide the same ecosystem services and can evenlead to property damage, personal injury or other disservices [97, 114]. Threats tourban tree health and challenges associated with their mitigation and managementare diverse [38]. Street trees for example often suffer from water stress caused bydecreasing water availability to root systems from de-icing salts, barriers to rootgrowth, poor soil quality or presence of toxic substances [12]. Above ground stres-sors include heat radiation from buildings and impervious surfaces, high windschannelized through urban canyons, cutting of tree crowns and growth inhibitinglight patterns, especially for trees planted on the north side of buildings in thenorthern hemisphere [87]. Owing to urban trees proximity to centers of human ac-tivity and international trade routes, they are further threatened by damage throughnative and invasive pests and pathogens [110]. Many of these threats to urban treehealth are expected to intensify with the effects of climate change, such as risingtemperatures in cities [91].Proactive management and decision making is required to protect, improve andextend urban forests with a direct influence on over 60% of the world populationin the future [38, 53]. One of the biggest limitations for proactive urban tree man-agement is the scarcity of up-to-date urban tree inventory data, used as a basis forplanning and decision-making [42]. To date, the most common practice to retrievesuch important information is the manual collection and measurement of singleurban trees with hand held devices (s. section 1.1.4). At present, existing treeinventory data are mostly restricted to public street trees or other trees on publicland and there is a lack of information on a large proportion of the urban forest,especially trees on residential property. Cost-efficient and widely applicable tools2are needed to provide high-resolution spatial information to enhance urban treemanagement and support decision making [144].1.1.1 Smart urban forest managementThe field of urban forestry is growing rapidly alongside novel, technically ori-ented urban sciences, like ecological engineering and smart city planning [11, 38].”Smart urban forest management” describes the integration of urban forest man-agement into emerging smart city planning concepts, applying novel technologicaldevelopments like Artificial Intelligence (AI), open source mapping platforms ormobile Application (APP) driven citizen engagement to address the diverse chal-lenges urban trees face [105]. In the context of proactive, resilient and smart urbanforest management, the purpose of this work is to explore the suitability of novelDeep Learning (DL) architectures and openly available street-level imagery to de-velop a tool that meets the need for up-to-date urban tree inventory data (s. chapter1).The approach proposed in this thesis is based on recent advances in ”instancesegmentation” for fuzzy objects and monocular depth estimation to locate featuresdetected on photographs in space (s. chapter 2). Ultimately, the aim is to pro-duce a robust, cost-effective and rapid method for creating detailed tree locationand diversity data in any urban region where sufficient street-level imagery is read-ily available. To demonstrate the value of the developed model for smart urbanforest management, the created tree diversity data is used to inform urban bio-surveillance management in the Metro Vancouver region (s. chapter 3).1.1.2 Urban trees as vectors for tree pests and pathogensOne of the major challenges enhanced by climate change for urban tree manage-ment is the introduction and spread of native and invasive pests and pathogens inurban areas [32]. Canada’s urban forests, for example, are increasingly threat-ened by Forest Invasive Species (FIS) such as the Emerald Ash Borer (EAB), AsianLonghorned Beetle (ALB), the Asian Gypsy Moth (AGM), Dutch Elm Disease(DED) or Sudden Oak Death (SOD) [109]. These FIS can cause irreversible damageto both natural and urban ecosystems and are associated with high management3costs after establishment. In 2003 ALB infestations in Toronto, for example, leadto the replacement of 28,700 urban trees, after the infestation was detected in cam-paigns lead by the the Canadian Food Inspection Agency (CFIA) responsible forinvasive species management in Canada. Within this ALB management effort, thereplacement of one tree was subsidized by CA$300 in private areas, CA$150 inpublic areas and CA$40 in urban woodlands, generating a total cost of > CA$6millio for eradication of ALB in Toronto alone (correspondence M. Marcotte, Oc-tober 2019, CFIA). Furthermore, Canada wide FIS are estimated to cost CA$800million annually in management efforts and further generate a threat to export mar-kets estimated up to CA$2.2 billion annually.Increasingly, new FIS are expected to enter Canada, with urban forests actingas key nodes in their dispersal pathways [110]. Yemshanov et al. [147] identifiedMetro Vancouver and the Greater Toronto area as the two major points of entryfor invasive pests and pathogens through international trade and transportation net-works. Urban trees located close to centers of human activity and internationaltrade routes such as ports, commercial zones or tree nurseries within urban envi-ronments are under constant risk to be exposed to native and invasive pests andpathogens [110]. Trees that are unhealthy are under considerable threat of addi-tional damage caused by tree pests and pathogens, as defense mechanisms in un-healthy trees are weakened [86]. Once an invasive pest or pathogen is establishedin urban areas, urban trees can act as vectors for spread of these harmful diseasesinto surrounding ecosystems [38].Next to maintaining a healthy forest, urban tree managers face additional pres-sure to detect these pests and pathogens early, to contain and prevent the spread andestablishment throughout the urban ecosystem into surrounding areas [110]. Theearlier FIS populations are noticed at the initial stages of infestation and invasion,the higher the cost-efficiency and probability of management success [86]. Conse-quently, invasion managers are faced with the task of maximising bio-surveillanceand early detection efforts to support rapid decision making, and minimize poten-tial FIS management costs [88]. This is significant because 1) established FIS inurban spaces can have negative impacts on ecosystem services urban forests pro-vide, and 2) urban trees can act as sentinels for early detection and rapid responsebefore establishment, preventing FIS spreading into natural ecosystems [21].41.1.3 The need for tree inventory dataUnfortunately, urban forest planning and management remains an outstanding chal-lenge worldwide owing to relatively scarce information on the spatial distributionand accessibility of urban forests, as well as their health condition, composition,structure and function. Urban forest assessments are the basis for all decision-making in managing and mitigating threats to tree health [60]. Tree inventories arean important aspect of urban forest assessments, and usually involve the collectionof field data on the location, genus, species, crown shape and volume, diameter,height and health condition of urban trees [68]. Urban tree inventories predomi-nantly focus on information on individual urban trees, less so on groups of treesas for example found in urban parks [102]. At present, existing tree inventory dataare mostly restricted to public street trees or other trees on public land and excludelarge areas of urban forest, especially trees on residential properties. For example,in Vancouver, about 37 percent of urban forest is located on private land, and notincluded in the city inventory [62].For the purpose of bio-surveillance management, extensive inventories are neededto manage and contain the spread and economic cost of harmful FIS through earlydetection [8, 109]. Information about potential host-tree distribution over the urbanspace can help to identify hotspots of establishment for FIS [116]. Similarly, tree-health can be used as an indicator to detect trees already harmed by FIS. Knowingwhere unhealthy trees are located can also be valuable to predict where FIS are mostlikely to spread, since natural defense mechanisms of unhealthy trees are alreadyweakened [115]. Furthermore, detailed tree inventories can be used to quantify themonetary value of environmental and aesthetic benefits of single trees, includingecosystem services [126]. Weighting the cost of managing and protecting urbantrees against the benefits or services they provide is often used as a basis for deci-sion making or economic risk assessments [60]. Such information can be used asan economic baseline to inform decision makers which FIS mitigation strategies areeconomically valuable and what is at risk for different stakeholders in case urbantrees are attacked [16, 88]. For the purpose of bio-surveillance it is therefore bene-ficial to provide up to date urban tree inventories, especially including informationabout location, tree health, tree genus and general tree structure. Furthermore, it is5important to collect data over both public and private trees, since pests and insectsspread over public and private urban property. Additionally, detailed tree inventoryinformation about tree location or genus provides a backbone for many other ur-ban forest assessments, such as the prediction of urban tree health under changingclimate conditions [47].Methods used to collect data and the extent of these inventories is often gov-erned by the direct application of the inventory and the municipality’s budgetaryconstraints [102]. Depending on these factors cities need to decide whether theyare collecting information for every single urban tree or for a subset of trees andif the inventory will be updated over time or information is only collected once,resulting in a fragmented patchwork of multiple data collections merged over time[68]. Keller and Konijnendijk [68] point out the need for inventory methods thatcan be scaled over multiple regions in order to develop national and internationalrecommendations and standards for urban tree inventory data collection. Moreabundant and standardized urban tree inventory data, specifically including privatetrees, would open the possibility to plan and manage public green spaces moreholistically, integrating both public and private green spaces into one urban forest.Aronson et al. [9] stress the need for this more holistic management approach in or-der to protect and manage urban trees and biodiversity sustainably in future. Hence,cost-efficient and widely applicable tools are needed to provide high-resolutionspatial information to enhance early detection and support decision making [144].1.1.4 Collecting urban tree dataNielsen et al. [102] distinguish four main types of generating and updating previ-ously identified information in urban tree inventories: satellite-supported methods,airplane-supported methods, on-the-ground scanning or digital photography, andfield surveys. Satellite-supported methods have primarily been used for single-treecrown detection [67] or tree health assessments [117]. Data can be retrieved bymultiple sensors ranging from Red Green Blue (RGB) colour space, over multi-spectral, hyper-spectral to panchromatic. Most satellite-based imagery used forurban tree inventories is of Very High Resolution (VHR) in order to detect the com-parably small objects of trees [102]. Similarly, airplane-supported methods have6been used for tree detection, tree health assessments and more recently tree speciesclassification [6]. Multi-spectral, hyper-spectral, and Light Detection and Rang-ing (LIDAR) sensors provide the advantage of more quickly generating relativelyhigh resolution data over bigger areas than field surveys [102]. However, in con-trast to satellite-based imagery, airplane-supported sensors have to explicitly beflown for the purpose of tree inventory data collection. On-the-ground scanningor digital photography can in general be used to retrieve more detail and volumeof single tree inventory parameters than aerial based methods. [113] for exampledeveloped a semi-automated method to calculate tree crown volume and densityfrom side view photographs. Nevertheless, classical field surveys with direct man-ual measurement and visual tree inspection are the most commonly used methodto generate and update urban tree inventories [102].All of these methods above are often limited in either geographical space, tem-poral coverage or the number of parameters covered by the urban tree inventory[144]. The need to perform labour-intensive field surveys or costly aerial cam-paigns often limits the detail and frequency of urban tree inventory updates [6].Most data sources furthermore lack processing methods that can be generalized orautomated over multiple cities. Often, expert knowledge is required to handle largefile sizes or semi-automate classifications [144]. This leads to some municipalitiesnot being able to collect tree inventory data at all, due to a constrained urban treeinventory budget [102]. [68] point out the need for methods scaling over variousregions in order to develop national and international recommendations and stan-dards for urban tree inventory data collection. Geospatial technologies and datasetsthat are more cost efficient or free could be used to support a larger number of mu-nicipalities and allow for urban tree inventory standardization [142]. A data sourcethat has recently attracted a lot of attention by the urban forest research commu-nity, due to its low cost and global coverage, is street-level imagery in general andGoogle Street View (GSV) in particular [13].1.2 Computer vision for urban tree inventoriesTwo recent trends have gained attention in smart city planning because they allowremote data collection, can be applied over large areas at low cost, and promote7uptake from a larger number of municipalities [132]. First, the growing avail-ability of low-cost, detailed and increasingly crowd-sourced street-level imagery(photographs of street scenes taken from the ground) [13, 77]. Second, the successof DL and Convolutional Neural Network (CNN) out-competing other methods forextracting abstract features and objects in imagery [89].1.2.1 Street-level imageryAbundance of street-level imageryIn this thesis, street-level imagery is defined as photographs taken on the street,captured by RGB sensors mounted on different types of vehicles (i.e cars, bikes)or hand-held devices (i.e mobile phones, cameras). Easy access to affordable RGBsensors and cameras by companies and the public, in combination with the will-ingness to share imagery on the internet has lead to wide availability of street-levelimagery data [125]. To-date, there is an increasing abundance of platforms and ser-vices providing street-level images. Imagery can, for example, either be accessedthrough relevant Application Programming Interface (API)’s from e.g. GSV andBing Maps Streetside or it can be crowdsourced and collected through serviceslike Mapillary and OpenStreetCam. Currently efforts to standardize street-levelimagery are limited, resulting in data coming with varying quality and quantitydepending on the imagery provider.GSV is a geospatial platform that offers standardized and geocoded street-levelimagery in different formats and resolutions at relatively low cost [52]. GSV pro-vides extensive spatial coverage of North America and other countries in the world.Google [52] gives a detailed and up to date description which places are cov-ered by GSV, when they will be recorded next and when they were previouslyrecorded. GSV Street-level imagery is typically collected through a panoramiccamera mounted on a car roof. Panoramic recordings are single snapshots in timecovering a range of view of 360 degrees, spaced every 15 meters apart on publicroads which means that one tree can be seen in multiple images [144]. GSV up-dates street-level images of public roads every 1-4 years. GSV data can be accessedonline via an official API, that allows querying of the closest street-level image ac-8cording to a given geographic position or a geographic latitude and longitude datapair [144].Street-level imagery for smart urban forest assessmentsGSV imagery has already found its way into smart urban forestry assessments inrecent literature. Berland and Lange [13] manually inspected single street-levelimagery to generate tree inventory data through ”virtual tree inventory surveys”.They found that these ”virtual surveys” of street trees conducted with GSV agreedwith field data with over 93% of documented trees and discovered that it was pos-sible to assess genus, species, location, diameter at breast height and tree health.Rousselet et al. [118] tested if GSV data could be used to identify trees under attackby the Pine Processionary Moth (PPM), through manual visual inspection of im-agery by bio-surveillance professionals. A comparison of field data retrieved by alarge-scale analysis based on a mesh of 16 km grid size with a GSV based approachrecorded 96% of matching positive findings. Nevertheless, these studies and manyothers, are still not automated and limited by expensive manual labour [36].Advances in Computer Vision (s. section 1.2.2), the field of study that assessespossibilities to automate tasks of the human vision system with a computer [83],and DL are enabling automatic and robust information extraction for street-levelimagery in urban environments [57]. Computer vision algorithms developed forsmart city research using street-level imagery have been applied to assess demo-graphics [45], urban change [99], wealth [48], perceived urban safety [100], build-ing types [66] and urban morphology [94]. In the field of smart urban forestry,street-level imagery in combination with computer vision has been applied in threekey areas: 1) estimation of shade provision for urban trees [78, 80, 81], 2) quantifi-cation of perceived urban canopy cover [19, 36, 79, 124, 132], and 3) mapping thelocation of urban trees [15, 144].Li et al. [81], for example, calculated a sky view factor from GSV, which in-dicates the level of enclosure of street canyons, in order to quantify shade provi-sioning from trees. Seiferling et al. [124] used GSV imagery in combination withMachine Learning (ML) techniques to quantify perceived urban canopy cover. Sim-ilarly, Li et al. [79] assessed the percentage of vegetation in streets, by quantifying9the amount of green pixels seen in a Street View Scene. These methodologieshelped to shape the so called Green View Index (GVI) [36]. This index indicateshow green streets are perceived by pedestrians and has been synthesized for multi-ple cities all over the world [36]. Wegner et al. [144] designed a workflow for au-tomatic street tree detection and geolocation from GSV and Google Maps imagery,based on the Faster Region-based Convolutional Neural Network (FASTER R-CNN)framework.Research gapsMost workflows generating urban tree inventory data using GSV and computer vi-sion techniques are currently limited by quantifying and classifying single pixelswithout distinguishing between separate trees, detecting tree position without thepossibility to quantify tree characteristics or relying on secondary data sources forprecise location predictions of trees [36]. This thesis proposes a workflow for thedetection, classification, and geolocation of separate trees ready to use as part oftree inventories and introduces a method to streamline geolocation only relying onstreet-level imagery.1.2.2 Theoretical background of deep learningDL has proven to be a powerful tool for extracting abstract features and objects fromraw imagery, and is increasingly adopted in ecology [27], environmental researchand the Remote Sensing (RS) community [154]. In the following section relevantDL concepts for computer vision and their technical background will be presented.In order to define DL, the umbrella concepts of AI and ML will be introduced,under which DL can be placed as a sub-discipline (s. section 1.2.2). Next, CNNsand their technical concepts, challenges, and limitations of use will be discussed (s.section 1.2.2). Finally, the section will conclude by explaining the application ofmodern computer vision to solve problems in environmental monitoring and earthobservation (s. section 1.2.2).10Artificial Intelligence and Machine LearningAI is the study to automate intellectual tasks normally performed by humans [119].Figure 1.1 provides a conceptual overview of how DL can be placed in the field ofAI. ML is a subfield of AI and represents the new paradigm in the development ofalgorithms that computers are able to learn without being explicitly programmed[121]. In comparison to classical programming where specific rule sets are hard-coded to process data to best match an output, ML systems allow to automaticallygenerate these rule sets though exposure to an input-output data pair [136]. Theprocess of generating and refining these rule sets, by automatically comparing thesystems current output to its expected output, is commonly described as training orlearning [26].Figure 1.1: Deep learning, machine learning and artificial intelligence.Deep learning is a concept used in machine learning, which is in turn asubfield of artificial intelligence [26].The field of ML can be further separated into three broad categories: Unsuper-vised Learning, Reinforcement Learning and Supervised Learning. UnsupervisedLearning systems transform data without the previously described use of specifictargets, answers or outputs in the training process [56]. The two best known Un-supervised Learning tasks are Clustering and Dimensionality Reduction. Thesetasks are used to compress, denoise or visualize data and are often a necessary step11to analyze a dataset before using it in a supervised-learning problem [26]. Re-inforcement Learning conceptually places an agent at the core of a system wholearns to take action in order to gain the maximum reward by receiving informa-tion about its environment [95]. Reinforcement learning has recently made a majorbreak through, for example, with Google DeepMind’s AlphaGo, a system usingreinforcement learning to master and succeed in the game of Go [128].Supervised Learning, however, is currently the most commonly used type ofML and the most dominant form of DL. At its core, Supervised Learning is the pro-cess of meaningfully transforming input data into new data representations, whichis learned by exposing the system to known input-output data pairs. In other words,Supervised Learning systems automatically learn new data representations by map-ping input data onto a set of known output targets, so called annotations or labels[26]. The trained system can then be used to generate predictions, its own annota-tions and labels, on new input data. This process is called inference [75].Deep learningAs a sub-field of Supervised Learning, DL represents the notion of learning multi-ple, hierarchical layers of representations in between input data and output target[75]. Shallow learning in comparison only transforms input data into one or maxi-mal two successive representation layers. Representations in DL gain in complexitywith each successive deeper layer, whereby complex representations are build outof simpler representation [51]. DL is most commonly applied in a Neural Net-work (NN), also called a Deep Neural Network (DNN). These terms are often usedinterchangeably [26]. Figure 1.2 shows how a DNN conceptualizes the image of aperson by layers of successively more and more complex representations.DL in NN’s could be thought of as a multistage information-distillation process[26]. A NN ”learns” by creating a sequence of data transformations to map a giveninput to the target output (s. fig 1.2). Each step transforming the data is imple-mented in successive layers and parameterized by so called weights or parameters[26]. In the process of learning, these parameters are automatically updated, de-pending on a distance or loss score between the prediction and the target, calculatedwith the so called loss function (s. fig 1.3 (b)). The goal in DL is to minimize the12Figure 1.2: Concept of deep neural networks. The image of a person ispresented in a concept of successive layers of representations. Repre-sentations get more complex the ”deeper” the layers of the NN. Thesedeep layers build on top of shallower layers containing simpler repre-sentations, like colors, vertical or horizontal lines [51].loss score, to achieve a close match between prediction and target. To optimize theloss score, parameters will be adjusted depending on an optimizer, a mechanismthat implements the so called back-propagation algorithm, i.e. a specific variant ofstochastic gradient descent [26]. In other words, training a DL model means thatdata is transformed into new data representations or features by exposing the modelto a set of input variables and output targets, so called training data, to automat-ically update parameters and minimize the loss function. Typically, the ”deeper”the NN the larger the amount of training data needed in order to learn meaningfulrepresentations.13Convolutional Neural NetworksEven though DL has a long history, the field has only recently achieved a majorbreak through in near-human-level to even superhuman performance in image clas-sification [28]. The increasing depth of DNN frameworks, scale of computationalpower available for training and the amount of openly available training data hasaccelerated scientific discovery and development within the field and popularity ofmethods transferable to other areas of research since the early 2010s [26]. Mostimportant though was the invention of a so called Convolution Operation used byCNN [51].CNNs are the most common algorithmic architecture to implement DL for an-alyzing imagery (s. fig 1.3). They are explicitly designed to process large, multi-dimensional tensors such as volumetric data or images, with typical dimensionsof nr. of pixels in width x nr. of pixels in height x 3 (red, green and blue) fortrue color images [51]. An ordinary NN, relying mainly on fully-connected layers,where a pixel would represent one neuron (i.e. unit within a DL model), wouldneed to learn a vast amount of parameters even for relatively small images, i.e. animage of size 64x64x3 would result in a parameter vector of length 12,288 only forone shallow layer. In contrast to regular NN’s, CNN’s leverage the concept of pa-rameter sharing [74]. They learn the parameters of convolutional filters to directlyextract meaningful features from images, reducing the amount of parameters thatneed to be learned and therefore enhancing scalability of data input and process-ing speed [26]. The three basic building blocks of CNN’s are the Fully-connectedLayer (FC LAYER), the Convolutional Layer (CONV LAYER) and the pooling layer(s. fig 1.3).FC LAYERs, often referred to as ”densely connected layers” are the standardlayers representing the original idea of a NN, where all input elements, so called”neurons”, are connected to all output elements, another array of neurons (s. fig1.3). Parameters in FC LAYERs are one dimensional arrays and can be very large,i.e. of length 12,288 for a small image. In CNN architectures, FC LAYERs aretypically applied at the ”top” of the network for classification, to transform theinput into a last desired number of outputs, i.e. a list of labels or a one dimensionalarray.14Figure 1.3: Convolutional neural network architecture, inference andtraining. A CNN consists of a sequence of CONV LAYERs and poolinglayers for feature learning, typically followed by FC LAYERs for clas-sification of learned features. Inference denotes the process of runningan input image through all layers subsequently to generate a predictionY ′. Training a CNN requires to optimize weights stored in filters throughcalculating a loss score between predictions Y ′ and target Y .15CONV LAYERs are at the core of a CNN and use convolutional filters to imple-ment the concept of parameter sharing. In contrast to FC LAYERs, parameters inCONV LAYERs are directly represented by convolutional filters. In figure 1.3, forexample, a layer of two convolutional filters of size 3x3, has a total of 3x3x2= 8 pa-rameters that need to be learned for the same small input image with the dimensionsof 64x64x3. Typically, each filter is a tensor of relatively small size compared to theinput image. In other words, the fundamental difference between FC LAYERs andCONV LAYERs is that FC LAYERs learn global patterns and CONV LAYERs learnlocal patterns in the input feature space, found through small filters. In additionto reducing the amount of parameters that need to be learned, CONV LAYERs al-low the model to learn patterns that are translation invariant, i.e. if the modelcan recognize horizontal edges in the upper left corner of the image, it can recog-nize the same pattern everywhere, reducing the amount of training images that areneeded. An FC LAYER would need to learn the same pattern at a different locationagain. Furthermore, CONV LAYERs allow models to learn spatial hierarchies ofpatterns, starting with less complex patterns like edges and increasingly learningmore complex and abstract representations (s. fig 1.2). A convolution works bysliding the filter over the last 3D feature map, and transforming the extracted 3Dpatch (of shape width f ilter x height f ilter x depthinput) into a 1D vector (of shape(depthout put ,)). This process quantifies the presence of the filter’s pattern at differ-ent locations in the image and results in a new feature volume of size (width f ilter xheight f ilter x depthout put) (s. fig 1.3).Lastly, pooling layers are applied to reduce the spatial extent of feature maps[51]. These layers do not contain learnable parameters, but are used to reducecomputational cost of the model by reducing the image resolution while preservingthe depth and allowing for more filters to be applied. Max pooling is the mostcommon pooling strategy, convolving the maximum pixel value while sweeping afilter over an image (s. fig 1.3). For a complete complete overview of pooling andCNNs see Goodfellow et al. [51].The invention of CNN has already revolutionized research in the field of roboticsand self driving cars [54] or medical imagery analysis [85]. The application ofDL in urban and ecological assessments in general and on street-level imagery forupdating urban forest inventories and monitoring urban trees in particular, is still16sparse (s. section 1.2.1). Owing to a fast growing open source community, theavailability of pre-trained CNNs and extensive, openly available annotated datasetson street-level imagery, CNNs show great potential to be successfully applied toproblems with small-scale data availability and fine-grained classification prob-lems [51].Classification, detection, segmentation and instance segmentationThe four main problems in computer vision to solve with CNNs are (s. fig 1.4): (1)thematic image classification (e.g. classifying an image as tree vs non-tree) [122],(2) multiple object detection (e.g. retrieving bounding box pixel-locations of alltrees in the image) [51], (3) pixel-wise semantic segmentation (e.g. classifyingevery single-pixel into the class tree or the class non-tree) [18], and (4) pixel- andobject-wise instance segmentation (e.g. retrieving every pixel that belongs to theclass tree, differentiated per single tree instance) [26]. Zhang et al. [152] provide adetailed overview and a technical tutorial for RS data analysis using DL algorithmsand Mountrakis et al. [98] highlight state-of-the-art examples for traditional andnovel RS applications enhanced by DL.These four problems have been addressed with CNN algorithms in different ar-eas of RS research in the past. Xing et al. [146] used thematic image classification(problem 1) to classify the land cover seen on geo-tagged photos. In combinationwith the photos’ distance to pixels in the GlobeLand30-2010 land cover map, thephotos’ classification results were used to validate the map product’s accuracy. Bycombining geo-tagged GSV imagery with DL, Kang et al. [66] were able to per-form thematic image classification (problem 1) of individual buildings and mapthem in space. DNN were also used to extract the position of multiple smallerobjects and features (problem 2) from aerial and satellite imagery, including vehi-cles [23], aircrafts [20], oil tanks [151] and sport fields [24]. Wegner et al. [144]designed a workflow for automatic detection and geolocation of street trees, bycombining FASTER R-CNN bounding box detection (problem 2) scores from GSVand aerial imagery, with information retrieved from Google Maps in a probabilis-tic model. The authors then used street-level and aerial imagery to classify 18different species among the detected trees (problem 1). Branson et al. [15] subse-17Figure 1.4: The four tasks of computer vision. This thesis is based on in-stance segmentation (d). [83]quently built upon methods for object detection in [144] by including a SiameseCNN, to verify whether detected trees had changed visibly over time. Pixel-wise se-mantic segmentation (problem 3) with DNN has been dominating the InternationalSociety for Photogrammetry and Remote Sensing (ISPRS) semantic segmentationchallenge, and is increasingly used for a variety of land cover classification projects[65]. Despite the proven suitability of DL for semantic segmentation, research ap-plying CNNs to quantify urban greenery from street-level imagery is sparse. Caiet al. [19] recently tested tree canopy segmentation (problem 3) with different CNNsand estimated the GVI with a custom architecture based on a Residual Network with50 Layers (RESNET50).The fourth problem (instance segmentation, i.e. pixel-wise detection of sep-arate objects of the same class) is challenging, and DNN frameworks have onlyrecently shown great potential for this task [57]. Detecting and masking each dis-tinct object in an image for ‘Stuff classes’ (fuzzy object classes without clearlydelineated shapes, like the sky, trees or other vegetation) was brought to the at-18tention of the DL community via the 2016 Common Objects in Context (COCO)Stuff Challenge [18]. Models such as Mask Region-based Convolutional NeuralNetwork (MASK R-CNN) have since had great success in performing instance clas-sification on Stuff classes, which opens the door for assessing and mapping therich data on fuzzy objects contained in side-view imagery. Combining street-levelimagery and DL techniques to analyse urban features and objects is a promisingavenue in urban research.1.3 Research questions and research designLimited awareness of cost-efficient street-level imagery and DNN potential to sup-port environmental management and policy making constrains the utilization ofthese data and technologies. The direct and automated implementation of CNNarchitectures and street-level imagery as a promising tool for high-resolution datageneration for decision support in smart urban forestry management is sparse.The main objective of this thesis is to develop a cost-efficient and automatedapproach to generate fine-scale tree inventory data in urban areas. The researchaims to assess the potential and advantages of an open source DL based approachfor decision support in smart urban forest management in general and in FIS man-agement in particular. Therefore, the thesis explores the potential to use readilyavailable street-level imagery in combination with new and emerging open sourcetools. A case study demonstrates how an automated data generation approach couldassist bio-surveillance procedures in Metro Vancouver, Canada.1.3.1 Research questionsSpecifically, the project addressed the following questions:1. How can deep learning algorithms assist in improving tree detection in theurban landscape?2. How can monocular depth estimation be combined with tree detection forurban tree geolocation from single street-level images?3. How can urban trees be classified using emerging deep learning techniquesand what are potential applications for smart urban forest management?191.3.2 Research designThis research project explores the potential to extract valuable information abouturban forests from street-level imagery using DL techniques. An approach to au-tomatically generate fine-scale urban tree inventory data is investigated and thepotential for implementation of the generated information in decision support forbio-surveillance efforts is outlined. Novel CNN architectures are implemented inthree different modules that constitute an open source software package (s. fig 1.5)implemented in a workflow for urban tree location and genera mapping. One mod-ule (Detection) is for the task of tree instance segmentation on street-level imagery,a second module (Geolocation) for the task of location prediction with monoculardepth estimation for detected trees (s. chapter 2) and the third module (Classifica-tion) is for the tree genus classification based on street-level imagery (s. chapter 3).Inference results generated through the three modules are combined to map urbantree genera distributions in the Metro Vancouver region. The final workflow is thenevaluated for implementation in decision support for bio-surveillance (s. chapter3).A more detailed description of how street-level imagery and benchmark datasets(s. section B.1.1) were acquired and processed (s. section B.1.2) for training andevaluation purpose can be found in the appendix B. Details about the base architec-ture MASK R-CNN of the proposed workflow (s. section B.2.1), evaluation strate-gies to assess the suitability and performance of the chosen architecture (s. sectionB.3), and the training strategy used for tree detection (s. section B.4) can also befound in the appendix.The three research questions outlined in section 1.3.1 were tested with themethodological workflow depicted in figure 1.6. The proposed experimental work-flow proceeds in four stages. Stage one (s. fig 1.6, orange) is characterized by dataacquisition and pre-processing. Developing a reproducible Python-based pipelinefor automated GSV image acquisition allowed for a user-friendly download of opensource imagery from any desired area for any desired view. Figure 1.7 shows a typi-cal GSV scene (right), used in the analysis. Acquired imagery was further combinedwith tree location and genus information from the existing street tree inventoriesin Vancouver in order to build tree genera classification datasets. All datasets were20Input1. Detection 2. GeolocationOutputMask R-CNNTriangulationmonoDepthMetadata{Location, Bearing}boundingboxsoftmaxprobabilitymaskdepthvaluetreelocationsCNN Classiergenuslabel3. ClassiationAcerFigure 1.5: AiTree: Open source software for urban tree mapping. Thefull software will be published under github/slumnitz/aiTree and con-tains three modules. One module for the task of tree instance segmenta-tion on street-level imagery (1), a second module for the task of locationprediction with monocular depth estimation for detected trees (2) andthe third one for tree genus classification on street-level imagery (3).separated for the use as training data, development data and as test data (explainedin stage three).21Street-level imagery(Google, Mappillary)Mask R-CNN Tree detection Automated geolocationTree genus classificationImproved tree inventory dataTraining/ Development Dataautomated image downloadTest DataAccuracy assessmentStreet tree inventoryData pre-processingGenus cluster and heat-mapsHyper-parameter tuning and data optimizationDecision support for FIAS management and bio-surveillanceResNet50monoDepthFigure 1.6: Methodology for the development of trained deep neural network models for automatic tree de-tection, geolocation and genus classification. Stage one (orange) is characterized by data acquisition. Stagetwo (green) includes the design and training of tree detection, geolocation and classification models with theMASK R-CNN, monodepth and RESNET50 architectures. In stage three (blue), the accuracy of the trained neuralnetwork models is assess through development and test datasets. In a last step, the software and workflow aretested for inference on Metro Vancouver imagery and decision support for bio-surveillance efforts (yellow).22Figure 1.7: Google Street View data. Current tree inventory data and GSVcamera positions depicted on Google Maps imagery from 8th West Av-enue, Vancouver (left). Street trees and trees on private property seenin GSV imagery (right). (Source: Google Maps, 2018; Google StreetView; 2018)Stage two (s. fig 1.6, green) included the design, training and deployment oftree detection and classification models built on top of three different CNN archi-tectures (s. fig 1.5). The final scientific software was implemented in Python andwill be made available as open source software on github. The software was pri-marily based on instance segmentation done with the MASK R-CNN architecture.MASK R-CNN is currently the most promising openly available framework for thecore instance segmentation model of this tree inventory generation workflow. Atpresent, MASK R-CNN is the best performing framework able to carry out both ob-ject classification and pixel-level instance segmentation [57]. MASK R-CNN allowsfor the automatic generation of bounding boxes, shape masks and classificationscores with only one CNN architecture (s. fig 1.5, Detection). For more detail on23the MASK R-CNN architecture and training see appendix B.2.1. Furthermore, thedesign of the proposed workflow can be easily adapted replacing MASK R-CNN,depending on the accuracy of results or improvements in published state-of-the-artalgorithms. The trained MASK R-CNN model was utilized to classify tree instancesin GSV and Mapillary images from a human’s perspective (s. section 2.2.4). Gen-erated bounding boxes and tree shapes were then used for extracting tree imagesof interest that were fed into the subsequent geolocation (s. section 2.2.5) and clas-sification modules (s. section 3.2.5). Stage two, in which detection, classificationand geolocation models were primarily designed and trained, was continuously in-formed by stage three, in which model performance was assessed. Model trainingin stage two was then repeated until model performance was optimized.In stage three (s. fig 1.6, blue), model performance was analyzed and opti-mized. Pre-processed data were split into training, development, and test datasets.During training, a split dataset was required to test for both bias and over-fitting.Model accuracy was assessed using both development and test datasets. For eachmodel (tree detection, geolocation, and genus classification), different optimiza-tion measures (and their corresponding accuracies) were tested in order to find thebest set of model hyper-parameters, training data, and optimization algorithms andachieve the highest possible model performance. The most accurate models wereused for inference and an overall workflow accuracy was calculated. Achievedaccuracy for all models were compared to state-of-the-art models or human-levelperformance and used as an indicator for performance of the proposed methodolo-gies. Software and model output data were exported as geographic data stored in.csv, .shp or .geo json format.In stage four (s. fig 1.6, yellow) the software was used to collect tree inven-tory information for the Metro Vancouver region. The generated tree inventoryinformation was tested for implementation in bio-surveillance efforts in the MetroVancouver region. The location and intensity of tree genera hotspots are reportedin chapter 3.241.4 Thesis structureIn chapter 2 the tree detection and geolocation methodology and workflow aredescribed and model performance is assessed. In chapter 3 the tree genus classifi-cation model is presented and evaluated for the purpose of generating estimates oftree genera accumulation in the Metro Vancouver area. To conclude, key findings,implications, limitations and future work is discussed in chapter 4. The appendixcontains additional information25Chapter 2Mapping urban trees with deeplearning and street-level imagery2.1 Introduction2.1.1 Urban tree assessmentUrban forests are gaining global attention as evidence is gathered about the diversebenefits they provide to human health and well-being through various ecosystemservices [106, 140] (s. section 1.1). Planning and managing urban forests and treeson the basis of urban tree inventories is increasingly coming to the fore in the con-text of global urbanization trends, rapid climate change and increasingly connectedtrade [110]. Nevertheless, urban forest planning and management remains an out-standing challenge worldwide owing to relatively scarce information on the spatialdistribution and accessibility of urban forests and trees, as well as their health con-dition, composition, structure and function [69]. In practice, most municipalitiesstill perform labor-intensive field surveys to collect and update inventories of pub-lic trees. Despite the importance of urban trees, national and municipal sourcesof tree inventory lack in detail, consistency and quantity due to the cost associ-ated with mapping and monitoring trees through time and over large areas [102] (s.section 1.1.3).26The study aims to produce an automatic, affordable and novel method for treedetection and geolocation that can be used in any urban region where sufficientstreet-level imagery is readily available. I introduce state-of-the-art instance seg-mentation (object detection and pixel masking) with DL frameworks to extract andmask fuzzy features like trees in images (s. section 2.2.4 and appendix B.2.1). Inaddition, the novelty of this method is enhanced by using monocular depth estima-tion and triangulation to estimate precise tree locations without the need to rely onsecondary datasets (s. section 2.2.5). Ultimately, I aim to fulfill the need for inven-tory methods that can be automated and generalized over multiple cities in order todevelop national and international recommendations and standards for urban treeinventory data collection [102] (s. section 1.1.3).2.1.2 Remote sensing for individual tree mappingRecent research has focused on RS data and techniques allowing for the remote andautomated recognition and characterization of individual trees [67]. Individual treemapping from remotely-sensed data, termed as Individual Tree Crown Delineationor Detection (ITCD), has gained popularity since the mid-1980s as an alternativeto ground-truth measurements [153]. However, mapping and monitoring of indi-vidual trees in heterogeneous urban areas using remotely-sensed data and currentITCD methods remains challenging [10]. The small size of individual tree crownsin urban areas binds the use of most satellite imagery sources to analyzing clustersof urban trees or requires a process for spectral unmixing [129]. VHR satellite oraerial imagery (<80 centimeter) can help provide the level of detail required for in-dividual urban tree assessments, but are often impacted by urban shadows [33, 82].Similarly, the use of high-resolution LIDAR data in individual tree assessments isoften impacted by vertical urban structures such as power lines and lamp posts[153]. Datasets such as LIDAR or VHR aerial imagery are usually collected at one-point in time and can be expensive to acquire [6, 82]. Novel, readily-accessiblemethods and data sources to build standardized tree inventories on a large spatialscale allowing for cheap, seamless and recurrent data collection and rapid process-ing are still needed [67].272.1.3 Trends in automatic tree inventory assessmentsTwo recent trends have gained attention to filling the gap in assessing urban treesover large areas at low cost, and promote uptake from a larger number of munic-ipalities in recent literature [132]: first, the success of CNN out-competing othermethods for extracting abstract features and objects in imagery [89] (s. section1.2.2), and second, the growing availability of low-cost, detailed and increasinglycrowd-sourced street-level imagery (photographs of street scenes taken from theground) [13, 77] (s. section 1.2.1).Street-level imagery is, for example, used to quantify ‘perceived urban canopycover’ by estimating the percentage of detected tree canopy cover pixels relativeto the total number of pixels in an image [124]. Similarly, Li et al. [80] assessedthe percentage of vegetation in streets by quantifying the amount of green pixelsseen in a street view scene. Both of these methodologies calculated a GVI, a metricthat quantifies the proportional amount of green pixels in each image. The indexserves as a proxy for how urban vegetation is perceived by pedestrians, and hassince been applied to a variety of cities all over the world [36]. Wegner et al.[144] designed a workflow for automatic detection and geolocation of street trees,by combining FASTER R-CNN tree detection results from GSV and aerial imagery,with information retrieved from Google Maps in a probabilistic model. The authorsthen used street-level and aerial imagery to classify 18 different species among thedetected trees. Branson et al. [15] subsequently built upon methods for objectdetection in [144] by including a Siamese CNN, to verify whether detected treeshad changed visibly over time. Both approaches, however, still rely on a datafusion approach with VHR aerial imagery to locate urban trees.Localizing objects that have previously been detected in photographs acquiredwith smart phones or cameras from the street is a unique challenge for RS practi-tioners. Satellite or aerial imagery pixels and LIDAR point clouds inherently storeeither relative or absolute geographic location information and make the need toadditionally compute three dimensional geographic pixel coordinates redundant.The translation process for features from street-level imagery to a geographical lo-cation is usually achieved using one of two principal approaches: (1) objects are ei-ther matched to locations using overhead or 3-dimensional data (e.g. passive aerial28imagery or active LIDAR) [76], or (2) the location is directly retrieved from street-view imagery by reconstructing 3-dimensional space or feature data (e.g. camera-to-object depth) [5]. The latter approach can be achieved through multi-view stereomethods (using multiple images to reconstruct the objects) [25], binocular meth-ods (using two images) [59] or monocular methods (using only one image) [50].Because monocular depth estimation can be made using a single image, it does notrequire a large amount of images taken from multiple perspectives or additionalknowledge about the analysed scene, and allows for the analysis of features fromvarious data sources taken at different points in time [134]. Monocular depth esti-mation has benefited from recent development of novel DL approaches, particularlyin the field of self-driving cars [92]. The potential to retrieve location informationof detected objects from a single image through the use of monocular depth esti-mation enhances the potential to use street-level imagery collected with differentsensors over time to match detected objects to specific locations [144].2.1.4 Chapter objectivesIn this chapter, I propose a novel, low-cost method for urban tree detection andgeolocation using readily available geo-tagged street-level images. I investigatethe generalizability and transferability of the tree detection model by applying itto different geographical locations in three cities in the Metro Vancouver region(Canada) and the city of Pasadena (US). I further test the robustness of the ap-proach on images provided by two different street-level imagery sources, namelyGSV, which provides proprietary data, and Mapillary, which offers crowd-sourceddata. I validate the model by comparing its output with on-the-ground tree locationmeasurements in the Metro Vancouver region. Ultimately, the aim is to produce arobust, cheap and rapid method for creating detailed tree inventories in any urbanregion where sufficient street-level imagery is readily available.2.2 Data and methodsI started by detecting trees and generating tree instance masks in all availablepanorama imagery prior to mapping detected trees in space using two DL archi-tectures (s. fig 2.1). DL has proven to be a powerful tool for extracting abstract29Input1. Detection 2. GeolocationOutputMask R-CNNTriangulationmonoDepthMetadata{Location, Bearing}boundingboxsoftmaxprobabilitymaskdepthvaluetreelocationsFigure 2.1: Urban tree mapping workflow: I first generate a bounding box,a mask and a probability score for all urban trees for the input imageryusing the trained MASK R-CNN algorithm. I then compute a dense depthmask with monoDepth for the same input imagery and extract a depthvalue for every previously generated tree mask. I use the depth value,and imagery metadata in the triangulation pipeline to generate tree lo-cations as geographic coordinates. The output is a map of urban treepositions connected to generated tree masks.features and objects from raw imagery, and is increasingly adopted in the RS com-munity [154]. The most common algorithmic architecture to implement DL foranalyzing imagery are CNNs (s. section 1.2.2). Zhang et al. [152] provide a de-tailed overview and a technical tutorial for RS data analysis using DL algorithmsand Mountrakis et al. [98] highlight state-of-the-art examples for traditional andnovel RS applications enhanced by DL. CNNs were, for example, used to extractthe position of multiple smaller objects and features from aerial and satellite im-30agery, including vehicles [23], air crafts [20], oil tanks [151] and sport fields [24].Xing et al. [146] used CNNs to classify the land cover seen on geo-tagged photos.In combination with the photos’ distance to pixels in the GlobeLand30-2010 landcover map, the photos’ classification results were used to validate the map prod-uct’s accuracy. The combination of street-level imagery and DL is also showingpromising applications in the urban context. Kang et al. [66] classified the use ofbuildings and mapped them in space by combining geo-tagged GSV imagery withDL. The approach assesses the potential to map urban trees in four areas of interestand consists of the following steps (s. fig 2.1):1. Street-level imagery retrieval for areas of interest (s. section 2.2.3)2. Tree instance segmentation with MASK R-CNN architecture (s. section 2.2.4)3. Geolocation of trees detected in panoramas through monocular depth esti-mation and panorama metadata (s. section 2.2.5)4. Geolocation correction of trees present in multiple panoramas through trian-gulation (s. section 2.2.5)2.2.1 Study siteFor the majority of model training and assessment, I chose imagery and ground-truth measurement plots distributed over the Metro Vancouver area (49◦ N, 123◦W), specifically in the municipalities of Vancouver, Surrey and Coquitlam (s. fig2.2). Metro Vancouver is located on Canada’s south-west coast, being one of thewarmest Canadian cities in winter and experiencing relatively high rainfall ratesthroughout the year. Mild climatic conditions are favouring growth and survivalfor tree species from harsher and milder climatic conditions [131]. Metro Vancou-ver’s urban forest includes many exotic tree species imported from different cli-matic zones in North America as well as over 60% of Canada’s native tree speciesresulting in one of the most diverse urban forests in Canada [130]. Metro Vancou-ver strives to be one of the world’s greenest cities by 2020, resulting in spatiallyvarying types of proactive urban forest management [62] (s. section 3.2.1). Todemonstrate the potential transferability of the model to other urban ecosystems, I31Figure 2.2: Location of street-level imagery datasets and ground-truthmeasurements. Street-level imagery is available for Metro Vancouverand Pasadena. Geolocation test sites are located in the cities of Vancou-ver, Coquitlam and Surrey, Canada. (Map tiles by Stamen Design, underCC BY 3.0. Data by OpenStreetMap, under ODbL. Source: Mapillary2019, Google Street View 2018)evaluate the trained tree instance segmentation model on imagery of the PasadenaUrban Tree dataset (34◦ N, 118◦ W), located 2000 kilometer further south in thewest coast of the United States (s. fig 2.2).2.2.2 Ground-truth measurementsTo evaluate the model’s geolocation performance, I conducted a field campaign inMarch 2019 to collect ground-truth location measurements of all public and privatetrees in four areas of interest: Vancouver, Surrey and two in Coquitlam (urbanand suburban areas) (s. fig 2.2). I define a tree as any vegetation with a clearlydistinguishable stem and crown, that has the potential to grow over 5 meters inheight with full maturity [103]. I recorded the Global Positioning System (GPS)positions of each visible tree from the sidewalk using a Trimble Geo 7X handheld32device in an unobstructed position to maximize GPS signal. I used the rangefinderon the handheld device to measure the offset between the measurement positionand the tree, from which a precise tree location could be determined during post-processing. The measured GPS locations were corrected using Base Station Datafor Vancouver (BCVC: 491632.73535 N, 123521.58021 W) and Surrey (BCSF:491131.49655 N, 1225136.24849 W). This provided us with an overall accuracyof under 0.5 meters for 294 recorded trees in Vancouver, 152 trees in Surrey, and336 trees in Coquitlam.2.2.3 Street-level imageryIt is common practice in training a CNN to divide a dataset into training, develop-ment and testing datasets [122]. This practice ensures that model parameters areadjusted to support the best possible generalization of the model and its applica-bility to various datasets and tree objects. Given the range of imagery providers,cameras used and quality of street-level imagery, I decided to test the transfer-ability of the model between two different data providers: proprietary GSV data(Vancouver, Surrey, Pasadena) and crowdsourced Mapillary data (Coquitlam).Building training and development datasetsIn total, I compiled two datasets for two separate training steps described in section2.2.4. I split each of the compiled datasets with a 80:20 ratio into training anddevelopment datasets. First, I extracted 36,500 images from the openly availableCOCO Stuff semantic segmentation dataset that contains semantic labels (classifiedpixels) for trees. The COCO Stuff dataset is, to date, the most expansive collectionof images with semantic segmentation labels (∼164,000) for ”amorphous Stuff”classes (e.g. sky, roads, brick walls, trees, etc.) [18]. In contrast, most datasetsfocus on clearly delineated ”thing” classes (e.g. people, cars, traffic lights, etc)[14]. Second, I acquired GSV images from Vancouver and Surrey in March 2018(s. fig 2.2, tab 2.1). I used the “Labelbox” web tool to create single tree instancelabels for combined 60 images for Vancouver and Surrey by manually masking allvisible trees, resulting in approximately 1200 tree masks (i.e. instance labels), 453for Vancouver and 711 for Surrey [73].33Table 2.1: Street-level imagery and mask annotations for the fine-tuningprocedure. Compiled datasets consist on 30 panorama images each.dataset provider tree masks green infrastructureVancouver train/dev Google 453 street and private treesSurrey train/dev Google 711 private treesPasadena test Google 365 street and private treesCoquitlam test Mapillary 471 street and private treesBuilding test datasetsTo assess the models generalization performance and transferability to imagery ofa different urban ecosystem, I evaluated the model’s performance on an indepen-dent test dataset consisting of imagery from the city of Pasadena (s. tab 2.1). Thedataset, which covers all of Pasadena, was created in March 2016 and is availablefrom Branson et al. [15]. In addition, I used imagery acquired from Mapillary forthe city of Coquitlam as a second independent test dataset to demonstrate the ro-bustness of the model applied to different street-level imagery providers (s. fig 2.2and tab 2.1). For Coquitlam, I downloaded panorama images in February 2019[133]. I randomly sampled 30 test images from both datasets using the NumPyrandom number generator (assuming a univariate Gaussian distribution) in orderto test imagery from different types of city structure. I masked and annotated ap-proximately 360 individual tree masks for Pasadena and 470 for Coquitlam usingthe same method for labeling Vancouver and Surrey imagery described above. Allpanorama imagery contained metadata about the camera location of capture, and a360◦ bearing reference to true north.2.2.4 Tree instance segmentationTree instance segmentation modelI trained the MASK R-CNN architecture for tree instance segmentation, the task ofpixel-wise detection and delineation of separate objects of interest in an image.Instance segmentation for ”Stuff classes” (fuzzy object classes without clearly de-lineated shapes, like the sky, trees or other vegetation) is technically challenging34and has only recently gained attention in the DL community, resulting in a relativescarcity of architectures performing well for this task [18]. I chose MASK R-CNNdue to its generality, flexibility and the best performing architecture in the recentCOCO 2017 Instance Segmentation Challenge [14]. MASK R-CNN is a state-of-the-art architecture that is implemented in the model in a modular way so that it can bereplaced, by surpassing instance segmentation algorithms in future. The implemen-tation of MASK R-CNN adopts the original framework He et al. [57] implemented inPython 3, Keras and TensorFlow and can be accessed through Abdulla [4]. Gener-ated outputs include: (1) bounding boxes in pixel coordinates around each detectedtree object, (2) a probability score of the class label assigned to a detected object(binary: tree or non-tree), and (3) a pixel mask through the assignment of singlepixels to individually detected object [57]. For more detailed information about thearchitecture of MASK R-CNN c.f. He et al. [57] and supplementary material in ap-pendix B. After finalizing training, the MASK R-CNN model was applied to detectand mask new tree objects in images which were not exposed during the previoustraining step, which I refer to as the process of inference (s. section 1.2.2) [51].Training strategyI used all three, transfer learning, a layered-training approach and fine-tuning totrain the MASK R-CNN architecture [26]. Through transfer learning, a model pre-trained for one task (e.g. semantic segmentation of ”thing” classes in the COCOdataset) is re-trained for another task (e.g. tree instance segmentation on street-level imagery) through the transferal of weights [111][14]. I transferred weights(i.e. feature representations) of a MASK R-CNN model pre-trained on the COCOdataset to the first (deep) layers of the fresh model, and initialized the trainingprocess with these COCO weights. In this way, the model first learned to distinguishtree structure in general and was later able to separate single trees more effectively[19].In the subsequent training iteration, defined as the layered-training approach, Iused images containing tree objects in the COCO Stuff dataset. For this, I trained thelast 5+ top layers of MASK R-CNN with 50 epochs on COCO Stuff using a learningrate of 1e-4. An epoch refers to all images in the training dataset being run through35the entire model and the internal model parameters being updated at least once[51].Finally, I fine-tuned the model by training with the labeled data from Vancouverand Surrey. I therefore trained the model heads (the most shallow or last layers ofthe model) with 30 epochs, followed by another iteration training +5 layers ofthe Residual Network with 101 Layers (RESNET101). After a sparse grid search Ifound 1e-4/10 to be the most successful learning rate to use for the fine-tuning step.I set all other hyper-parameters following recommendations in existing researchthat employs MASK R-CNN [4, 57].To avoid over-training the model on relatively few training samples from Sur-rey and Vancouver (1000 tree instances), I used heavy data augmentation at train-time during the fine-tuning procedure. In brief, I split the panorama images in halfat the start of each training epoch, downscaled the halves to 1024x1024 pixels insize. Each time an image was loaded into memory: (1) I flipped the images left toright 50% of the time; (2) I either re-scaled the image in the x and/or y directionby a variable factor between 0.8 and 1.2, rotated the image with a random anglebetween -4 and +4 degrees, or sheared the image with a random angle between -2and +2 degrees; (3) and performed contrast normalization using a random targetfactor between 0.9 and 1.1 of the initial contrast of the image. The full model wastrained on a NVIDIA GTX1080 Ti Graphics Processing Unit (GPU), and limitedby 11GB of memory to stochastic gradient descent, or a mini-batch size of 1.Evaluation strategyI evaluated the method on two development datasets (Vancouver and Surrey, usingGSV imagery) and two test datasets (Coquitlam using Mapillary, and Pasadena us-ing GSV imagery). To evaluate model performance, I chose three commonly usedevaluation metrics for instance segmentation frameworks [14]: (1) mean averageprecision (mAP), (2) average precision over Intersection over Union (IOU) usinga threshold of 0.5 (AP50), and (3) average precision over IOU with threshold 0.75(AP75) (s. appendix B.3.2). To quantify the known, negative influence of smalltree mask sizes on instance segmentation performance in detail, I iteratively ex-cluded all smaller masks under a mask size threshold and compared recalculated36evaluation metrics [70]. Next, I manually inspect failure cases according to threedifferent tree mask sizes to identify the most frequent tree detection error. For de-tailed tree instance segmentation error assessment, I defined small masks to be un-der 3000 pixels in size (approximately 0.3% of total image pixels per small mask),representing detections of very distant trees (> 70 meters) relative to the cameraposition at image capture. I defined medium masks to be between 3000 and 30,000pixels in size, roughly all private trees, found in front yards, and big masks to beover 30,000 pixels in size (approximately 3% of total image pixels per big mask),repetitive to large front yard trees, or street trees.Additionally, I calculated precision and recall to compare detection results withannotated images [34]:precision =treepredtreepred +otherpred(2.1)recall =treepredtreeannotated(2.2)All final evaluation metrics and precision-recall curves to compare model per-formance for different datasets were calculated excluding very small masks undera size of 3000 pixels (approximately 0.3% of total image pixels per small mask)[18].2.2.5 Geolocation of treesDepth EstimationTo geolocate individual trees, I first created a dense depth estimate layer for eachpanorama using monocular depth estimation [50]. I adapted Godard et al. [50]’smonocular depth estimation architecture, MonoDepth, to develop dense depth masksbecause of its applicability to images with varying lens types (e.g. panoramic ornarrow view) typically found in street-level imagery datasets [132]. MonoDepthis available off-the-shelf as a fully trained unsupervised DL model with a deptherror margin of less than 20%. Godard et al. [50] provide multiple trained weightsfor non-commercial usage, which allows researchers to use the model for infer-37ence purposes without having to perform laborious training stages. For the analy-sis, I followed Godard et al. [50]’s recommendations to adopt the best performingweights when MonoDepth was pre-trained on the Cityscapes dataset and fine-tunedusing the KITTI vision benchmark dataset [29, 46].The depth estimation model typically computes disparities (D) between objectsin each panorama image which then need to be translated by a factor (F) intometers. During the translation process, values need to be calibrated for the specificlens (C) used and corrected for differences in image sizes between the originalCityscapes and KITTI datasets (W0) and the input image (W1). Depth can then beinterpreted as per pixel depth estimate (depth) between the object in the image andthe camera position at the time of image capture:depth =W0 ∗F(W1 ∗D)∗C (2.3)Based on Godard et al. [50], I set F to 0.54 meters, W0 to 721 pixels, W1 to6656 pixels and C to 1.5 to account for the use of a camera lens that capturespanorama images. The calibration parameter C was determined by minimizing theaverage error in the monocular depth estimates compared to distances of measuredground-truth trees to the recorded camera position. Once dense depth layers werecomputed, I extracted a single depth value in meters for each previously detectedtree using depth pixel values at the center of mass calculated for each instancemask. Finally, I translated each observation of a detected tree into geographiccoordinates by combining the depth value, the bearing of each tree instance inrespect to the position the panorama was captured, and the panoramas coordinates.Both of the later are recorded in the imagery’s metadata.TriangulationIn a subsequent processing step, triangulation is used to reduce duplicate obser-vations of individual tree predictions and correct their position estimation wheremultiple observations of the same tree are recorded (s. fig 2.3 (a)). I assumed thatthere are no false-positive tree instances (i.e. each tree detected as an object is areal tree). Through triangulation, I created nodes of intersections for all crossingedges, drawn between preliminary tree prediction positions and the correspond-38Figure 2.3: Correction of predicted tree locations through triangulation.I draw bearings between raw tree locations from monocular depth esti-mation and camera positions (a). The triangulated intersections of thesebearings are selected if they are located within a maximum distance(maxdist) from raw tree locations, for a minimum of two raw predic-tions (b). The intersection which are the closest to the raw tree positionsare chosen and clustered (c) to create the final corrected tree position(d).ing camera location. I selected candidate intersections within a maximum distance(maxdist, s. equation (2.4)) to the preliminary tree location estimate for at leasttwo preliminary tree locations (s. fig 2.3 (b)).maxdist = c0 + c1 ∗depth (2.4)c0 is a constant offset in meters, and c1 describes the maximum relative error inthe depth estimate, calculated as 65%. I chose a value of 3 meters for c0 to accountfor the average inaccuracy in the GPS positioning of panorama locations.Given that each edge between the preliminary tree position and the correspond-39ing panorama location has potentially multiple candidate intersections, I selectedthe closest intersection to each preliminary tree position. Finally, I used hierarchi-cal clustering to assign the average position of all selected candidate points in thecluster as an output tree coordinate (s. fig 2.3 (c), (d)). I analyzed the distancesof all ground-truth measurements with respect to each other, and found that over99% of all points were separated by at least 3 meters. To avoid multiple detectionsof the same tree represented by multiple candidate intersection points, I chose a 3meter threshold for the clustering as the minimum distance between observations.2.2.6 Model evaluationI used measured public and private tree positions to evaluate the tree location modelperformance for areas of interest in Vancouver, Surrey, an urban and a suburbanarea in Coquitlam. I evaluated the absolute positioning error of tree predictions asthe euclidean distance between the ground-truth measurement and the tree loca-tion predictions [148]. I used a greedy algorithm to assign closest matching treesfirst, and then took matched trees out of the running process until no ground-truthmeasurements were left to match [144]. A match is kept as a true positive matchif the distance between ground-truth measurement and tree prediction does not ex-ceed 15 meters. A 15 meter threshold was chosen to include as many private treesas possible in the error assessment, typically found in a 15-30 meter range fromthe camera position at time of image capture. Trees located more than 15 metersfrom the measured private ground-truth data represent distant tree detections be-hind houses without comparable measured ground-truth data and were excluded inthe error asssessment by the given threshold. I defined a measure of absolute treepositioning accuracy as the mean of absolute positioning errors meanepos [148]:meanepos =1nT PnT P∑i=1√(xi− xpredi)2 +(yi− ypredi)2 (2.5)I evaluate absolute tree positioning accuracy and the ratio of matches to non-matches for public, private and all trees respectively.402.3 Experiments2.3.1 Instance segmentationEffects of layered trainingPerforming layered training with COCO Stuff images improves the detection oftree objects from ∼80% without COCO Stuff training to a slight overcounting oftrees with 103% detected trees compared to the labeled tree masks in the combinedVancouver and Surrey datasets (s. tab 2.2). The overall model accuracy with AP50and mAP improves slightly while values for AP75 decline. The small decreasein mask accuracy for AP75 may be a direct result of the layered model includingdetections of trees which are more difficult to detect compared to the lower numberof detections when layered training is not used.Table 2.2: Evaluation metrics for training with and without COCO Stuffon combined Vancouver and Surrey datasetAP50 AP75 mAP mask predicted mask annotationsCOCO only 0.608 0.252 0.281 74 90COCO Stuff 0.661 0.199 0.290 93 90Transferability between different ecosystems and data sourcesI find that MASK R-CNN developed on Vancouver and Surrey training datasets wassuccessfully applied to detecting trees across all four datasets (s. fig 2.4 and tab2.3). AP50 values ranging from 0.620 to 0.682 and values of other evaluation met-rics are consistent with current tree or plant semantic segmentation performancesfound in other studies, with the difference that I not only evaluate pixel-based clas-sification results, but also distinguish between different tree objects [18] (s. tab2.3). AP75 and AP50 values are lowest for Surrey (0.157, 0.262) and highest forPasadena (0.262, 0.316) and Coquitlam (0.261, 0.342) datasets. Overall, at a recallthreshold of 0.6, precision is above 0.8 for all four datasets (s. fig 2.4). With arecall of 0.35 or higher, precision-recall curves for both development datasets areslightly higher than for the testing datasets, which is to be expected since assessed41features in the development datasets directly influence the training process (s. fig2.4) [75].0.0 0.2 0.4 0.6 0.8 1.0Recall0. instance segmentationCoquitlam (Mappilary)PasadenaVancouverSurreyFigure 2.4: Precision-recall curves for development (Surrey, Vancouver)and test (Coquitlam, Pasadena) datasets.Table 2.3: Evaluation metrics for Vancouver, Surrey and Pasadena. Ex-cluding masks under 3000 pixels in size.AP50 AP75 mAP mask predicted mask annotationsVancouver 0.682 0.232 0.312 53 52Surrey 0.634 0.157 0.262 40 38Pasadena 0.628 0.262 0.316 194 202Coquitlam 0.620 0.261 0.342 238 215MASK R-CNN performance for the Pasadena and Mappillary test datasets isvery similar (mAP) to slightly better (AP75, AP50) than that of the Vancouver andSurrey dataset (s. tab 2.3). Slight differences in precision-recall curves and vari-ations in evaluation metrics may be attributed to overall varying tree shapes andsizes found in each dataset. Features learned throughout the trained MASK R-CNNmodel appear to be sufficient to detect a variety of urban trees in different urbangreenspace management settings, i.e. they are not limited to tree species and forms42observed in Vancouver and Surrey (i.e. detection of palm trees in Pasadena). Thetree detection model is therefore robust to a variety of ecosystems and urban greenspace design without the need for extensive retraining.Furthermore, performing inference on Coquitlam (Mapillary) test imagery with-out retraining results in the highest mAP value of the four datasets. Model perfor-mance appears therefore robust to the different data source and sensor used forstreet-level photography, i.e. Mapillary, (s. fig 2.4 and tab 2.3). Precision-recallcurves for both testing datasets appear to be very similar. This indicates that thepresented model has the ability to generalize well for both a city with a very dif-ferent ecosystem (Pasadena) and imagery from different data sources or sensors(Mapillary in the case of Coquitlam). The consistency of model performance withan AP50 > 0.6 regardless of data and sensor source implies that panoramas acquiredfrom both GSV and Mapillary are suitable for use in the urban tree mapping model.Instance Segmentation performance as a function of mask sizeNext, I assessed the influence of tree mask sizes on MASK R-CNN instance segmen-tation performance. Plotting AP metrics values against mask size thresholds showsthat larger masks get predicted more accurately (AP values over 0.6) (s. fig 2.5).This is expected, since the ratio between the outline of a tree and the tree mass con-tained by the outline (i.e. outline-mass-ratio) decreases with object size, resultingin a bigger weight of the fuzzy tree outline with decreasing mask size in the calcu-lation of model evaluation metrics [70]. Predicting precise fuzzy tree outlines (treeto sky interface) is often harder than predicting the tree mass. Outlines thereforeoften differ more from the ground-truth outline than the actual mass of a tree, re-sulting in declining evaluation metrics numbers with smaller tree size. Notably, theaccuracy for large masks are similar for both the Vancouver (GSV) and Coquitlam(Mapillary) datasets, while accuracy for Surrey (GSV) and Pasadena (GSV) datasetsare approximately 0.2 lower. This indicates that similarity in tree object structure,to a degree, may have a bigger influence on image segmentation performance thansimilarity in image quality when instance segmentation is performed on multipledatasets. Values for AP50 increase once masks under 3000 pixels (which representvery small or distant trees) are removed. There is a corresponding significant de-43crease in accuracy for masks smaller than 3000 pixels once all masks over 3000pixels in size are discarded.0 25000 50000 75000 1000000. 25000 50000 75000 100000masksize > nr pixelsmAP0 25000 50000 75000 100000AP75namemapillarypasadenavancouversurreyEffect of mask size on model performanceFigure 2.5: Effect of mask size on model performance. I iteratively ex-cluded smaller masks, measured by the number in pixels per mask incalculating AP50, mAP and AP75. The dotted vertical line indicates thecut-off size of >3000 pixels for final calculation of the evaluation met-rics. The model can predict big tree masks of Coquitlam (Mappilarry)in red and Vancouver in green datasets the most accurate.Error analysis for instance segmentationI manually inspected 296 failure cases (including small masks) to identify the mostfrequent tree detection error (s. fig 2.6 and 2.7). The majority of errors arise fromdensely planted public and private trees, resulting in two trees being detected asone combined tree, occlusion of trees, or otherwise overlapping trees (s. fig 2.6(a), (b), (e), (h)). This source of error confirms that distinguishing between visu-ally overlapping amorphous objects is a difficult task [18]. Detecting trees usingmultiple street-level perspectives potentially offsets this error source, as occludedor overlapping trees can either be seen in the foreground or are otherwise distin-guishable from another perspective and image in the full model [72]. Detectinghedges as false positives, small trees, trees in shadows of buildings and trees withleaf-off condition (s. fig 2.7, (c), (d), (f), (g), (i)) is a direct result of having veryfew training examples of these special cases in the datasets. As expected, most ofthese errors were detected for small mask sizes (<3000 pixel) (s. fig 2.6).Trees seen far in the distance were often disregarded in the manual labeling440 5 10 15 20 25 30% of all errorsleaf_offencapsulatedothershadowoccludedsmallhedgedoublecombined masksizebigmediumsmallInstance segmentation errorsFigure 2.6: The most common inference errors of tree instance segmen-tation with Mask R-CNN (in percent).process due to their small mask size and relative distance for the direct cameralocation at image capture time. I note that 200 of these human labeling errors wererecorded, describing instances where the model correctly identified a tree but nocorresponding tree label was created. I disregarded human labeling errors for smallmasks, as masks under a threshold of > 3000 pixels in size were not included inthe final evaluation (s. section 2.3.1). These smaller masks, which represent treesfound in backyards or distant trees, could be included with help of additional dataaugmentation methods mentioned in Kisantal et al. [70] in future analyses.2.3.2 LocalizationComparison to ground-truthI matched 70% of all ground-truth tree measurements with tree predictions afterexcluding all matches over 15 meters in distance as false positive matches (s. fig2.8). Non-matched ground-truth measurements often result from a tree missingin the tree detection process, through either occlusion by larger trees in the frontof an image (s. fig 2.8 (a)), or by the absence of a tree in either the ground-truth45(a) combined (b) double (c) hedge(d) small (e) occluded (f) shadow(g) other (h) encapsulated (i) leaf-offFigure 2.7: Examples of masking errors in instance segmentation withMask R-CNN. Most common errors include: separate trees detectedas one (a), one tree split into multiple detections (b), hedges or shrubsdetected as trees (c), small trees that were not detected (d), undetectedtrees behind detected trees (e), undetected trees in shadows of buildings(e), non-tree objects detected as trees and masking errors (g), undetectedsmall trees in front of large trees (h) and undetected trees with leaf-offcondition and non-sky background (i)46measurements or the street-level imagery, due to a two or more year time differencein the ground-truth and imagery datasets (s. fig 2.8 (b)). Localizing trees usingmonocular depth estimation can potentially help to prevent loss of informationsince every single tree detection can be localized and is not dependent on manyphotographs from different views [72]. I note that triangulation, in comparison toraw tree location predictions, successfully reduced the mean absolute position errorfor all areas by approximately 2 meters, from 9 to 7 meters, and the total count oftree predictions by 45-55%.a) b) c)Figure 2.8: Location prediction results of trees. Predicted tree positions(yellow), ground-truth measurements (red) and common detection er-rors (blue). 70% of all measured trees were detected, 30% are miss-ing through occlusion by large trees (a). The time difference betweenground-truth measurements and when a street-level image was taken(>=2 years) results in the absence of either a tree prediction or aground-truth measurement (b). Geolocation accuracy decreases slightlywith increasing distance of trees to the car position of image capture (c).47Minimum distances between tree location prediction and ground-truth mea-surements for all areas are 0.26 meters or higher (s. tab 2.4). Tree location predic-tion in Vancouver is, with a mean of 5.28 and a median position accuracy of 4.36meters approximately 2 meters more accurate than all other areas, followed by Co-quitlam’s urban and suburban areas with mean and median slightly above 6 meters.With a mean of 7.06 meters and a medium of 6.87 meters, geolocation performancein Surrey is lower than in all other areas. Overall lower position accuracy in theSurrey could be attributed to the area of interest being located on a slope >15%negatively influencing triangulation, compared to other areas with no or a relativelylow slope (<5%). Another source of error may be the overall more spread buildingand green space structure of the Surrey area. This structural characteristic is lead-ing to trees being located further away from the camera position of image capturewith a slight decline in detection and location accuracy with resulting smaller treemasks, discussed in section 2.3.2 (s. fig 2.8 (c)).Table 2.4: Absolute geolocation accuracy (in meters)match min mean median stdVancouver 235 0.26 5.28 4.36 3.59street trees 143 0.26 4.31 3.92 2.76private trees 97 1.56 8.55 8.38 3.77Surrey 94 0.42 7.06 6.87 3.36Coquitlam (urban) 64 0.46 6.58 6.26 3.22Coquitlam (suburban) 159 0.55 6.83 6.07 3.73Location accuracy for private and street treesAfter triangulation, 143 street trees (93% of the ground-truth measurement) weresuccessfully located in the Vancouver area after triangulation. 11 trees (7%) of 154street trees remained unmatched (<15 meter ground-truth to prediction). The ma-jority (9 trees) of ground-truth street trees that were not matched were either newlyor re-planted small trees in between the date of capture for street-level images andthe collection of ground-truth data (i.e. 2 years) (s. fig 2.8, (c)). The two remainingunmatched trees were not detected in the instance segmentation step, due to largevehicles obstructing the trees in respective street-level images. Distances from48ground-truth to tree predictions for street trees range from 0.26 to 13.14 m meterswith a mean of 4.31 meters, a median of 3.92 meters and a standard deviation of2.76 meters (s. tab 2.4). The mean of street trees (red) can be detected almost 1meter more accurately than the mean of all trees (blue) and 4 meters more accu-rately than the mean of private trees (green) in the Vancouver area (s. fig 2.9). Theoverall more accurate predictions in Vancouver are possibly a result of the presenceof uniformly and separately planted street trees (s. fig 2.9, red, and fig 2.8). Owingto street trees proximity to the camera position, tree masks are bigger and predictedmore accurate which has a potential influence on the location prediction of streettrees (s. section 2.3.1).0 2 4 6 8 10 12 14Distance prediction to ground-truth (meters) of treesAbsolute tree location accuracyAll treesStreet treesPrivate treesFigure 2.9: Absolute geolocation accuracy for street trees (red), privatetrees (green), and all trees (blue) in the Vancouver area.Absolute location accuracy for private trees with the presence of street trees is 3meters less accurate compared to the previously discussed overall accuracies of allof Vancouver, Surrey and Coquitlam areas (s. tab 2.4). Values for Vancouver’s pri-vate trees are a minimum of 1.56 meters, a mean of 8.55 meters, a median of 8.38meters and a standard deviation of 3.77 meters. 70% (97 trees) of all recorded pri-vate trees were matched, 30% (43 trees) were not matched. Both, non-matches and49low position accuracy of private trees may be influenced by the more varying spa-tial pattern of planted private trees. Surrey and Coquitlam areas with similar spatialtree heterogeneity, not influenced by street trees, also recorded approximately 30%of non-matched ground-truth trees. As previously discussed, the combination oftwo trees detected as a single tree is the most common tree detection error for allmask sizes (s. section 2.3.1). This error is expected to occur more often for denselyplanted private trees with overlapping canopy, than uniformly planted street trees[144].It is also possible that the presence of street trees influences the model’s lo-cation accuracy. Surrey and Coquitlam’s (suburban) private trees (no presence ofstreet trees) show lower positional errors than Vancouver’s private trees. Streettrees often overlap with private trees in street-level photographs due to their prox-imity to the camera position. Street trees therefore influence both, monocular depthestimation of private trees and the bearing information of the tree detection bound-ing box from the camera position, as the center of mass shifts towards the largerpart of the mask, the street tree. These detection errors negatively influence thelocalization process and may result in lower positional accuracy for private trees inareas with street tree presence.Location accuracy with distance of tree from position of image captureDistances of street tree measurements to the camera positions range between 6-14meters, distances of private tree measurements to the camera position are typically>15 meters away, resulting in a bi-modal distribution of all distances betweenground measurements and car positions for all areas (s. fig 2.10). Another reasonfor more accurate geolocation in Vancouver may be attributed to Vancouver treesbeing positioned closer to the camera at the time of image capture than in Surreyand Coquitlam image datasets (s. fig 2.10). The range of absolute position errorfrom the tree location prediction to the tree ground-truth measurement is slightlylower for trees closer to the camera position than the error range for trees furtheraway from the camera (s. fig 2.8, (b)), indicated by the shape of the Kernel DensityEstimates (KDE) in figure 2.10. However, only a low correlation with R-square of0.17 can be recorded between the distance of the predicted tree to the ground-truth50measurement and the distance of the ground-truth measurement to the car position.The mean positional error increases with distance to the camera by approximately0.23 times the distance between ground-truth tree location and camera, indicatedby the slope of figure 2.10. This aligns in magnitude with a Root Mean SquaredLog Error (RMSLE) of approximately 0.2 reported by Godard et al. [50] for theincrease in error for monocular depth estimation with distance from the cameraposition. Random noise of 6.3 meters is introduced, likely through different streetslopes, described tree detection errors and resulting triangulation errors. I alsodetect a systematic error, an intercept of 2.7 meters with a potential cause throughsystematic car position GPS inaccuracies in urban landscapes [41]. Another causefor this systematic error could be the initial tree location prediction using the centerof mass for each tree crown, retrieving a depth measurement for the outside of thecrown diameter instead of the usually measured stem position.0 5 10 15 20 25 30 35Ground-truth to camera position (meter) to tree prediction (meter)R2: 0.17Influence of camera position on tree location predictionFigure 2.10: Influence of camera position at time of image capture on treelocation prediction. Comparison of the distance of ground-truth mea-surements to the camera location at time of image capture vs. ground-truth measurements to predicted tree locations for all data points (Van-couver, Surrey, Coquitlam))512.4 ConclusionTo support decision-making and research that can improve the management of ur-ban forests, cities need more cost-efficient and widely applicable tools that can pro-vide high-resolution spatial information on single urban trees for the entire urbanand peri-urban landscape (s. section 1.1.3) [69]. I presented a promising low-costframework for mapping individual urban trees over large areas that shows potentialto be adopted in different cities around the world. This novel model relies solely onstreet-level imagery as a data input and does not require any additional, potentiallyexpensive VHR aerial or satellite imagery for the geolocation of trees. Further-more, it is developed and tested to be transferable over different image sources andgeographical regions as evidenced by the experimental results.The approach can be applied to a diversity of urban trees and forests, both pub-lic and private, and could form the basis for urban assessments that require singletree detection. I found that MASK R-CNN can be successfully trained to identifyfuzzy objects like trees to a high precision with a minimal amount of training im-ages (48 images) and a layered training approach integrating open source imagerydatasets (COCO Stuff). The experimental results of this study demonstrate that alayered training approach resulted in a 23% higher tree detection rate comparedto only using transfer learning. AP50 values over 0.62 are consistent with state-of-the-art results in other studies segmenting fuzzy objects [18]. The instance seg-mentation model, in combination with the layered training approach, has shownpotential to learning a broad range of tree shapes, species and sizes without theneed for extensive training on particular tree features. For instance, palm trees inthe Pasadena test set were detected without palm trees being included in the Van-couver and Surrey training set. The combination of DL and street-level imageryappears promising towards the detection of trees in different urban ecosystems.Further, the model is not limited to the use of the same sensor or dataset. BothMapillary and GSV panoramas showed suitable for urban tree mapping.I accurately geolocated trees using one or two street-level images, a monoculardepth estimation algorithm, and triangulation that requires no additional or contextinformation. The geolocation of street trees with a mean accuracy of around 4meters was approximately 2 meters more accurate than the mean accuracy of 652meters for private trees seen from the street. Most trees clustered at 10 metersdistance from the camera position for the tested Google and Mapillary imagery.The proximity of trees resulted in a generally larger tree mask in the tree detectionstep. Inversely, the further away a tree was away from the camera, the more errorsin tree detection occured. This suggests that for future application the distancefrom the camera position at time of image capture to trees of interest should bea consideration when choosing or generating a dataset for urban tree mapping forfuture application. Detection errors influenced our tree geolocation module and Irecommend collecting images in a range of 7-14 meters away from trees of interestfor best positioning results.Street-level imagery in combination with DL brings a new perspective to as-sessing urban forests. Accurate masking and geolocation of trees can provide thebasis for a variety of quantitative urban forest assessments (s. section 4). For ex-ample, this assessment could be used in the future to quantify ecosystem servicesof urban trees at a large scale using tree structure attributes as proxies to estimatemulti-functionality. Future research directions aimed at optimizing street-level im-agery capture could include: assessing the influence of spacing between camerapositions of image capture on triangulation and positional accuracy of tree predic-tions; evaluating the influence of slope and vertical terrain variability on geolo-cation performance; and improving geolocation performance for areas with highterrain variability.53Chapter 3A machine learning tool formapping urban tree diversity3.1 Introduction3.1.1 The importance of urban tree diversityA diverse and healthy urban forest enhances the ability of cities to adapt to cli-mate change impacts, such as droughts or floods, improves wildlife habitat, con-tributes to the protection of native ecosystems and, importantly, increases resilienceof cities to pest and disease outbreaks [7, 104]. Mass tree mortality, i.e. of Fraxinusor Ulmus, due to disease and invasive pests has been known to occur; famous ex-amples in Canada and the United States being DED and EAB [90]. These outbreakscan be hugely detrimental to the health of urban ecosystems, including the healthof people living in cities, and can come at a great cost to municipalities (s. section1.1.2) [135].Given the many benefits of a diverse urban forest (s. section 1.1), preservingand improving urban forest health and diversity are key goals of many urban foreststrategies across North America and Europe [7, 62, 108]. Tree diversity metricsused in the context of urban forest management often relate to richness (the countof different tree genera), evenness (the proportion of a given tree genus with the54total urban forest), composition (the identity of present tree genera), and distribu-tion (the spatial abundance of tree genera) [104]. These metrics are used to assessthe state of diversity in urban forests, and they require detailed and accurate ur-ban tree inventories as a baseline. In the face of resource constraints and lack ofcapacity, municipalities are increasingly looking for new ways to carry out urbanforest inventories and continuously assess the state of their urban forest resource(s. section 1.1.4), especially as climate change is predicted to increase urban forestvulnerability to pests and disease (s. section 1.1.3) [96].3.1.2 Bio-surveillance in the Metro Vancouver regionUrban decision makers in Canada will require detailed urban tree inventory datato predict FIS spread patterns, to minimise impacts on valuable urban forests andto assess the efficacy and efficiency of bio-surveillance programs [43]. Emergingnational and international policy statements and strategies will greatly impact themanagement of urban forests for prevention, detection and rapid response for FISinfestations [39]. Key goals of these policies and strategies are to improve surveil-lance activities in geographic areas under risk of pest and pathogen introduction,such as residential areas close to ports, tree nurseries, or industrial zones, to evalu-ate the effectiveness of international policies and pest contamination procedures inregards to introduction prevention.The implementation of the above strategies will require a sound understandingof baseline conditions such as tree species and genera richness, evenness, com-position, and distribution. The availability of cost-efficient, fine-scale urban treeinventory data, therefore, has the potential to direct successful bio-surveillance ef-forts and identify areas of high economic and invasion risk for many known andunknown FIS [116]. Detailed tree inventory data can, for example, provide valu-able information about the location of native and planted host trees susceptible toattack from specific FIS. Knowledge of the spatial distribution of urban trees andtheir genera composition allows to determine the most effective bio-surveillanceactivity in varying urban forest landscapes (s. section 1.1.3) [71].Two common bio-surveillance activities for early detection of FIS in the MetroVancouver region involve: 1) the distribution of pheromone traps around areas of55high introduction risk, such as harbours or commercial zones where FIS first comein contact with trees, and 2) manual visual inspection of potential host trees presentat intersection points of a 1km by 1km triangular grid placed over the Metro Van-couver area (correspondence with Kimoto T., April 2019, CFIA). Both of these ac-tivities directly rely on up-to-date, spatially comprehensive data on the distributionand host tree or genera composition of the urban forest. Current tree inventories ofthe 21 municipalities in the Metro Vancouver region are mostly restricted to publicstreet trees or other trees on public land and exclude large areas of urban forest,especially trees on residential properties. For example, in Vancouver, about 37%of urban forest is located on private land [62]. As outlined in section 1.2 automaticprocessing of spatially extensive GSV imagery has the potential to provide detailedinformation on the abundance and distribution of host trees or tree genera in urbanareas. With the aim of improving urban bio-surveillance activities in the MetroVancouver region, I propose a method to automatically classify tree species at thegenus level, leveraging GSV imagery and CNNs.3.1.3 Training data for tree genus classificationThe main challenge of using new DL technology in urban and environmental re-search is the lack of availability of large scale, public training datasets. Most pub-licly available datasets still focus on classical areas of machine learning like facerecognition [112], self-driving cars or medical imagery. There are few very special-ized datasets available for tree species or plant recognition [141]. These datasetsrepresent best available imagery or specific use cases and often do not represent anoperational dataset needed for learning correct feature representations by the CNN.3.1.4 Chapter objectivesThe study presented in this chapter has two key goals. The first goal is to propose amulti-stage strategy for building large imagery datasets for research in urban areas,with only a limited amount of manual annotation needed. Therefore, I propose amethod that integrates readily available geospatial information (i.e environmentalin-situ datasets, such as street-tree inventories) and geo-tagged street-view imagery,leveraging the tree detection method presented in chapter 2. The second goal is to56create a DL model to classify urban tree genera from street-level imagery. I ex-plore state-of-the-art procedures in transfer-learning for fine-grained classificationproblems. Ultimately, I create a new dataset of tree genera hotspots for the MetroVancouver area in British Columbia (BC), Canada, to inform urban bio-surveillancemanagement and planning.3.2 Data and methods3.2.1 Case study siteThe Metro Vancouver region spans 2,700 square kilometers, including the threecities (vancouver, Surrey, and Coquitlam) introduced in chapter 2, section 2.2.1.Metro Vancouver is a federation of 21 municipalities, with 26 urban centers rang-ing in size and character. Given proximity to one of the major trade nodes to Asia,urban forests in the Metro Vancouver region are particularly vulnerable to invasivetree pests and insects, such as ALB, AGM, DED or SOD arriving through interna-tional trade [147] (s. section 1.1.2). Detailed maps of tree genera can for examplehelp urban bio-surveillance managers to identify areas of high invasion risk andtarget areas for visual inspections according to host tree abundance and accumula-tion (s. section 1.1.3). In this study, I define areas of high invasion risk as areasthat show an accumulation of detected trees of the same genus (as investigated byKDE), located in relative proximity to points of entry for pests and pathogens, suchas industrial areas or ports, or connected to areas where the assessed tree genus canbe observed. Pseudotsuga, Thuja, Acer, Ulmus, Quercus and Fraxinus are nativeand planted tree genera, that are abundant within the Metro Vancouver area and canact as host to either ALB, AGM, DED or SOD, pests and pathogens that are expectedto arrive in the region in future [69, 120, 127, 147].Metro Vancouver imagery datasetA GSV imagery dataset for assessing tree species distribution in parts of the MetroVancouver area (excluding Abbotsford, Mission, Maple Ridge, Langley Townshipand Pitt Meadows) was acquired from GSV in 2017. The dataset consists of a totalof over 2 million images of size 512x512 pixels, predominantly collected from57April to September, 2017. It contains images for over 690,000 car positions spreadover Metro Vancouver, with four images per car position representing a Filed ofView (FOV) of 0◦, 90◦, 180◦ and 270◦ from true north, respectively.Tree genus classification training datasetTo build a tree genera classification model to classify the retrieved images of MetroVancouver, I developed a workflow to build a labeled dataset containing approxi-mately 40,000 curated training images for 120 different tree genera compiled into41 genera classes and one “other genera“ class. For a full list of tree genera classessee appendix D. The method for download and retrieval of imagery for buildingtraining, development and testing datasets for tree genus classification is describedin section Full mapping workflowComputer vision tasks are frequently approached as end-to-end DL problems. Inend-to-end learning, learning is highly automated, meaning that all stages of learn-ing are performed as a holistic learning process (i.e. detection and classificationof trees as one model and learning process). The main drawback of end-to-end DLmodels is that they usually require enormous amounts of data to train (millions ofimages) [49]. I promoted the strategy of sub-problem solving which requires lesstraining imagery compared to training end-to-end DL models for the task of treegenera classification from street-level imagery [49] (s. fig 3.1). In an automatedmanner, I first detected trees in street scenes, as described in chapter 2, and thenclassified cropped images displaying the tree of interest in the center of the im-agery. Adopting this strategy I received a tree count per street-level car location aswell as a tree genera label for each detected tree. Lastly, I created maps of kernel-density estimates, using the open source package ”seaborne” to visualize hot spotsof tree genera in the Metro Vancouver area [143].3.2.3 Tree detectionIn this study, MASK R-CNN was used to detect and outline trees in imagery by gen-erating bounding boxes and binary segmentation masks for each tree instance (s.58Figure 3.1: Tree genus classification workflow. GSV images are cropped todisplay single trees and further augmented to build training, develop-ment and testing datasets.section 2.2.4). These tree detections were then used for both tree location, as pre-sented in chapter 2, and the following genus classification model. For the tree genusclassification workflow, MASK R-CNN’s binary segmentation masks were used fortwo purposes. First, the tree detection module, developed in chapter 2, was usedto build a tree genera dataset outlined in section 3.2.4. Second, MASK R-CNN wasapplied to the full Metro Vancouver dataset and generated bounding boxes wereused to extract images of single trees used for single-label tree genus classification.The MASK R-CNN architecture, its training and evaluation process, as well as thechosen training data, are described in detail in appendix B.593.2.4 Multi-stage strategy for building tree genera datasetI proposed a multi-stage strategy to rapidly collect and sample tree genera im-ages from street-level imagery providers for training a tree genus classifier. Thechallenge of this task was to match known occurrences of tree genera recorded inVancouver’s official street-tree inventory to pictures of the recorded trees retrievedfrom GSV, to create a labeled dataset for training. The main stages of the proposedstrategy to create this labeled street-level imagery dataset involved: 1) image ac-quisition, 2) cropping images to the tree of interest, and 3) manual removal oferroneously labeled images.Step 1: Imagery acquisitionI leveraged location and genus information of existing street tree inventory datafor the city of Vancouver to semi-automatically create a training and testing treegenera imagery dataset. Vancouver street tree inventory data contains geographiccoordinates for manually recorded individual trees, connected to a single speciesand genus label. Metadata of street-level imagery from different providers (i.e.crowd-sourced Mapillary data or proprietary GSV data) generally contains the cam-era position at the time of image capture in geographical coordinates and bearing ofthe image center in relation to magnetic north in degrees. Given the spatial relationbetween the tree location and species name recorded in the street tree inventoryand the camera position at the time of image capture I calculated the necessary im-agery bearing to display the selected, manually recorded tree location in the middleof the image (s. fig 3.2 (a)). I then downloaded all available imagery with the cal-culated bearing information as input for the image center, a FOV of 90◦ and a pixelresolution of 512x512 pixels from the GSV platform in June 2018 [144] (s. fig 3.2(b)). The choice of a relatively wide FOV of 90◦ accounted for known errors ofGPS accuracy of 2-12 meters in urban environments for both the car position attime of image capture and the tree location recorded in the urban tree inventory[149]. Owing to inaccuracies in GPS measurements, and associated error in cal-culated bearings, a wide FOV ensured that each downloaded image displayed thetree of interest recorded in the tree inventory, even if the tree was off-set from thecenter of the image.60Figure 3.2: Training data generation for tree genus classification. Us-ing existing street tree inventory, the closest GSV car coordinates toeach recorded tree are calculated (a) and a corresponding image is re-quested (b). All trees in the requested image are detected with a trainedMASK R-CNN model (c) and the closest and largest bounding boxes ortree detections are chosen to represent the street tree (d). Images arecropped to display the selected single tree in the center for training thetree genus classification model (e).Step 2: Cropping images to tree of interestI post-processed images in order to create an image displaying a single tree, con-nected to the correct label from Vancouver’s street tree inventory. I first applied thetrained MASK R-CNN model to detect all trees present in the image as described inchapter 2 and appendix B (s. fig 3.2 (c)). I then ran a monocular depth estimationmodel, to create a dense depth layer for each image as described in section 2.2.5.I computed a measure of distance between each tree detected in the image andthe camera position at the time of image capture in meters by extracting the depthvalue of the pixel located at the center of mass for each calculated tree mask. I thenselected one tree per downloaded GSV image as the labeled street-tree under theassumption that the particular street-tree must be the tree with the smallest depthvalue, or the closest tree to the camera position at the time of image capture (s.fig 3.2 (d)). Each GSV image was cropped according to the bounding box of tehselected tree, previously computed by MASK R-CNN (s. fig 3.2 (e)).61Step 3: Manual removal of erroneous imagesLastly, I manually inspected the created training dataset of cropped images, onegenus class at a time, to identify all cropped images of trees with an incorrectlabel. Images that displayed a tree genus that did not correspond to the matchedlabel were discarded. Furthermore, I discarded all images of size 50 KB or smalleras visual analysis revealed that their resolution was not fit for training the classifier.Following the recommendations of [15] for tree species classification, all generawith an image sample set over 125 images per class were used as separate classestraining the genus classifier. All genera with an image sample set under 125 imagesper class were combined under the class label “other”. Resulting in a tree genusdataset of 41 tree genera and one ”other” class (s. fig 3.3).3.2.5 Tree genus classificationTo classify the detected trees into one of the 42 fine-grained genera classes, Itrained an image classification model using the novel DL framework builton PyTorch. is an open source software package, designed for researchersand DL practitioners to quickly build and iteratively train DL models with state-of-the-art guidance on best practices for training. I used a transfer-learning approachwith a modified RESNET50 architecture and a softmax classifier [58].Balancing the training datasetFirst, I split the genera training dataset into training, development and test datasetswith a 80:10:10 ratio. I sought to prevent the classifier from over-fitting on tree gen-era which dominate the genera dataset, due to their abundance in the Vancouver,i.e. Acer or Prunus with over 5,000 images per class. In order to balance classesin the training dataset, I under-sampled respective classes, meaning, I selected amaximum of 4,000 images per class and removed the rest from training. I thenover-sampled all other classes [17]. First, I added all downloaded, uncropped im-ages of the respective tree, and second, multiplied images using a NumPy randomnumber generator (assuming a univariate Gaussian distribution) until all classeswere equalized to a count of 4,000 images per class.62Figure 3.3: Examples of tree genera dataset and data augmentation. GSVimages are cropped to display single trees and further augmented tobuild training, development and testing datasets. The final training sizeof each image is 256x256 pixels.Mixup and data augmentationMixup is a novel data augmentation technique known to improve generalization er-ror of models and avoid the memorization of corrupted labels [150]. As the namesuggests, mixup constructs a training image through mixing two random examplesfrom the training set and their labels through linear interpolation (60% image one,40% image two) [150]. In order to avoid over-fitting on the oversampled dataset,I implemented mixup. Additionally, several data augmentation techniques were63applied to imagery as mixed training images were fed into the model. Data aug-mentation was applied randomly and included a horizontal flip with a probabilityof 50%, a rotation of up to 15◦, a zoom up to 150%, lighting and contrast changeof magnitude up to 0.4 and a symmetric warp of magnitude up to 0.2 (s. fig 3.3).Mixed precision training and progressive resizingMixed precision training performs operations within the model using smaller sizeddata types when possible – so called half-precision or 16-bit Floating Points (FP16)and Single-precision Floating Points (FP1) – which improves training time and de-creases the use of memory [93, 107]. I implemented mixed precision training tospeed up the computational training process [63]. Furthermore, I used progres-sive resizing of images, starting from 64x64, over 128x128, to 256x256 pixels fortraining respective models [112]. As training with smaller sized images was lessmemory intensive the training process was accelerated through learning to distin-guish tree genera on coarse resolution images first, in comparison to training onlarger resolution images from the beginning. I used trained weights from eachmodel with a smaller image size to initiate the training process of the model withthe next bigger image size. I trained the RESNET50 implementation on an NVIDIAGPU (Tesla P100-PCIE-12GB) with 32 CPU cores and 32 GB of memory.EvaluationI used mAP as an evaluation metric, where precision for each genus class (pclass)was defined as the number of correctly classified trees (true positives, TP) di-vided by the number of correctly classified trees (true positives, TP), plus thenumber of incorrectly classified trees (false positives, FP) falling into the specificgenus class. mAP was then calculated as the weighted mean of all class preci-sions (p1, p2, ..., pn), with the corresponding number of training images per classas weights (w1,w2, ...,wn)::pclass =T PclassT Pclass +FPclass(3.1)64mAP =n∑i=1wi pin∑i=1wi(3.2)In addition to mAP, I used top-3 accuracy to assess model performance. Top-3accuracy was defined as the percentage of images of one class whose ground truthgenus label is within the three highest ranked predicted labels. Last, I comparedclassification of separate genera classes in a detailed confusion matrix.3.3 Experiments3.3.1 Classification performanceThe model achieved an overall classification accuracy of mAP of > 82% and top-3 accuracy of > 95% for 41 different tree genera and one ”other” class in bothtraining and development datasets (s. tab 3.1). I closely examined classificationmAP for different genera classes (s. fig 3.4): the model classified 6 genera over90%, 13 genera with over 80% and 27 genera over 70% precision; 12 genera wereclassified with a precision under 70%. Laburnum, Abies, Ilex, Prunus and Betulagenera display the highest classification precision. Amelanchier, Ginko, Cercis,Juglans and Tsuga could not be classified successfully.Table 3.1: Classification accuracy for development and test setstop-3 accuracy mAPdevelopment set 95.0% 82.4%test set 95.5% 82.9%The confusion matrix for the test dataset revealed that most tree genera weresuccessfully distinguished from one another (s. fig 3.5). Few high values divergingfrom the center line were observed, indicating overall high prediction accuracy,precision and recall for all genera classes. Cercis, Amelancier, Tsuga and Ginkgowere the most confused genera classes (s. fig 3.5).65Figure 3.4: Precision for different genus classes in the test dataset. Up to27 genera can be classified with a precision over 0.7 (Green).66Figure 3.5: Confusion matrix for tree genera classification. Darker shadesindicate higher classification accuracy, precision and recall. The darkerblue the colour of the confusion matrix, the higher the displayed per-centage value. The dark blue diagonal center line indicates that in mostcases, the model generated a matching prediction to the actual tree gen-era displayed on the analyzed image.673.3.2 Hotspot maps of Metro VancouverI detected a total of > 4 million trees in street-level imagery dataset of the MetroVancouver area. Generated image sizes varied from a minimum of 2 KB to a maxi-mum of 450 KB with a mean of 32 KB (s. fig 3.6). An image of pixel size 256x256used for the last iteration of training the genus model, corresponded approximatelyto 65,000 pixels or 64 KB in size. Visual analysis of generated images revealed thatimages under 20 KB or 30,000 pixels in size typically represented trees far awayfrom the camera position of image capture (s. fig 3.6). Images over a threshold of> 20 KB were used to generate the following hotspot maps (s. fig 3.7).Figure 3.6: Distribution of sizes of generated tree cutouts with examples.Applying the tree genera classifier to cropped images retrieved through treedetection with MASK R-CNN, I built maps of specific tree genera hotspots throughKDE for Metro Vancouver, to aid bio-surveillance planning and management. Dis-played in figure 3.7 are two coniferous (Pseudotsuga, Thuja), and four deciduous68Figure 3.7: Tree genera distributions in Metro Vancouver.Kernel density estimates are shown for two coniferous, Pseudotsuga and Thuja,and four deciduous, Acer, Quercus, Fraxinus and Ulmus, tree genera.69(Acer, Ulmus, Quercus, Fraxinus) tree genera. All six genera are currently un-der threat by invasive pests and pathogens and of high interest in bio-surveillancecampaigns (s. tab 3.2). Appendix C provides a visual example, comparing thegenerated, underlying dataset used for KDE maps to existing street tree inventorydata.Table 3.2: Selected occurrences of tree genera in Metro Vancouver. Countof generated detections and example threats of two coniferous, Pseudot-suga, Thuja, and four deciduous, Acer, Ulmus, Quercus, Fraxinus treegeneragenus Count native Threatened byACER 493,000 x Asian Long-horned Beetle (ALB)THUJA 175,000 x Asian Gypsy Moth (AGM)FRAXINUS 110,000 Emerald Ash Borer (EAB)QUERCUS 95,000 x Sudden Oak Death (SOD)PSEUDOTSUGA 90,000 x Sudden Oak Death (SOD)ULMUS 47,000 Dutch Elm Disease (DED)Generated data and KDE maps helped to answer diverse questions about thegenera composition of Metro Vancouver’s urban forest ranging from where mosttrees were detected (Vancouver West), to highest percentage of assessed conifer-ous trees (North Vancouver), or highest percentage of Fraxinus (new settlements inEast Vancouver). KDE maps for both coniferous genera, which are native to Van-couver, show that Pseudotsuga and Thuja were found throughout Metro Vancou-ver, but were especially abundant in less densely populated areas, close to provin-cial parks or nature reserves, i.e. Stanley Park, North Vancouver or Coquitlam.Ulmus was mainly detected in the city of Vancouver, West Vancouver and EastVancouver, but was not observed in Surrey or Richmond. In comparison, Acerhotspots were very interconnected and spread wide over the region. This suggeststhat in the event of an infestation of Acer, negative impacts of pests and pathogenscould be far reaching for Metro Vancouver, as pests and pathogens spread moreeasily and quickly when host trees are interconnected.The highest count of tree observations on imagery among the displayed treegenera of interest were Acer, followed by Thuja. Trees of the genus Ulmus wereobserved the least in this example with a total of approximately 50,000 occurrences70in street-level images (s. tab 3.2). A full list of genera and corresponding occur-rences in street-level imagery can be found in appendix D.3.4 Discussion3.4.1 Classification model performanceThe model presented in this chapter is the first method currently available for treegenera classification from street-level imagery that tests applicability to a largerarea like Metro Vancouver. Cercis, Amelancier, Tsuga and Ginkgo were the mostconfused and lowest performing genera classes. These four classes were stronglyunderrepresented in the training, developing and testing datasets with < 1% of thetotal dataset for each class in the testing dataset (s. fig 3.8). In contrast Laburnum,Abies and Ilex that are amongst the classes with the best model performance arealso among the image class datasets with < 1% of the total dataset (s. fig. 3.8).Differences in model performance for these 6 classes could result from: 1) treegenera structures that are generally very difficult to classify through either highheterogeneity within the genus class or similarities to other genera, 2) the lowamount of remaining test images after selecting 100 training images is not sufficientto accurately represent the real world distribution of potential GSV images of therespective class. It appears that 100 training examples per class are not enough totrain a model to successfully classify Cercis, Amelancier, Tsuga and Ginkgo. Bothof the above named reasons could be solved in future through increasing trainingand testing data for the respective classes.On the other hand, the class with the highest number of training examples(Acer) was not the class with the highest classification precision (s. fig 3.4 andfig 3.8). This suggests that certain genera could be harder to classify either ow-ing to similarities to different genera classes or to high heterogeneity within therespective genera class [123]. In the latter case it might be beneficial to separatethese genera out to classify different tree species found within the genus. Assess-ing which of the above factors (lack of training imagery, insufficient testing data,heterogeneity within the genus class, similarity in between genera) is the ultimatecause of lower model performance for certain classes is currently still a challeng-71Figure 3.8: Images per class in the tree genus classification test dataset.All classes have a minimum of 125 images in the training set, all re-maining images are in the development and test datasets. Up to 8 generaclasses (yellow) have under 50 images (red) in the test and time consuming task through the lack of tools to interpret DNNs [22]. Thisgap in available tools and standardized workflows opens up possibilities for futureresearch in assessing the influence of tree genera class structure and training dataavailability on classification performance.3.4.2 Transferability to other areasRetraining the presented model with labelled imagery from other cities and includ-ing a greater number of tree genera would make it possible to use the model to as-sess urban forest diversity more extensively. However, as beneficial climatic condi-tions have made Vancouver’s urban forest one of Canada’s most diverse forests, themodel was trained using a large dataset containing > 1000 images per genus classthat characterise a wide range of trees found within a genera, located in differingurban environments. The model could be used to analyse the distribution of generawith a high number of training images (> 50 images in the test dataset) and goodmodel performance (> 70% accuracy) in other urban environments with sufficientstreet-level imagery taken in summer, including but not restricted to: Acer, Prunus,72Tilia, Fraxinus, Carpinus, Ulmus, Betula, Magnolia, Platanus, Thuja, Pinus, Pseu-dotsuga, Sorbus, Cercidiphyllum, Cedrus, Metasequoia, Malus and Quercus.GSV imagery, processed for training the model, is currently predominantlyavailable for the months of April to October, when most tree genera display leaves.An expansion of the tree genera training data to different seasons would increasethe generality of the model as it could be used to classify tree images taken at anytime. Even though bio-surveillance management focuses on campaign planningfor leaf-on tree conditions, when harmful pest and pathogens are the most activeand the most likely to be detected, other smart urban forest management activities,i.e. the health assessment of allergy potential from urban trees, might benefit frominformation about tree diversity before spring starts (s. section 4.2).In addition to the current imagery being constrained to certain times of theyear, the process of generating training data was restricted to Vancouver’s pub-lic street and boulevard trees, which limits the ability to train the model for raretree genera found only on private property. For an exhaustive urban tree diversityassessment, future work should focus on the development of tools for or the collec-tion of imagery for genera that cannot be assessed through the presented trainingdata generation workflow, including tree genera on private property and parks (s.section 4.4).3.5 ConclusionI successfully analysed tree genera distribution across the Metro Vancouver area,using DL for the classification of over 2 million street-level images. To facilitateCNN training, I presented a method to rapidly and semi-automatically collect alarge training dataset. I trained a fine-grained tree genera classification model witha mean average precision of 83% for 41 different tree genera and one ”other” classincluding a total of 125 genera in the analysis. Integrated into smart urban forestmanagement, the presented workflow and model for analysing tree genera distribu-tions in urban environments presents the opportunity to aid bio-surveillance cam-paign planning to detect invasive pests and insects early. The approach, coupledwith publicly available street-level imagery, could enhance urban forest diversityassessments through more detailed information on trees located on private prop-73erty and has the potential to generate information on tree genera more rapidly andover large areas than has been possible to date through to manual data collection toupdate tree inventories. Depending on the genera of interest, the workflow can bereproduced to retrain the model on new genera classes or the model can directly betransferred to other urban environments.74Chapter 4Conclusions4.1 Key findingsThis thesis presented exploratory research in developing a method for automati-cally detecting, locating and classifying urban trees. In the context of smart ur-ban forest management, a combination of novel DL architectures and cost-efficientstreet-level imagery was used to generate urban tree inventory data over a largeurban spatial extent. The developed method relied solely on street-level imagery asa data input instead of more costly or less detailed aerial or satellite imagery thatmany other models require (s. chapter 1 and section 1.2.1). The novelty of thismethod was enhanced in that monocular depth estimation and triangulation wereused to predict tree locations without a dependence on complementary informationor aerial imagery (s. chapter 2). Finally, a reproducible and fast approach to gener-ate a tree genera classification dataset was presented and maps of urban tree generadistributions for the Metro Vancouver area were created (s. chapter 3).4.1.1 Tree detection with Mask R-CNNTrees were detected through training and using the MASK R-CNN architecture forinstance segmentation. Experimental results for performance and transferability oftree instance segmentation were demonstrated for four cities (Vancouver, Surrey,Coquitlam and Pasadena) and two data sources (GSV and Mapillary). MASK R-CNNwas successfully trained with a minimal amount of training images (48 images) and75a layered training approach integrating open source imagery datasets (i.e. COCOStuff) to identify fuzzy objects like trees to a high precision. The experimental re-sults of this study demonstrated that a layered training approach allowed for moreaccurate instance segmentation of trees, compared to using only transfer learn-ing. Tree instance segmentation results (0.6-0.7 AP50) were consistent with currenttree or plant semantic segmentation performances found in other studies, with theadded value that this work also distinguished between different tree objects [18].The combination of DL and street-level imagery showed promising results for thedetection of different tree shapes and sizes in various urban ecosystems and urbanmanagement regimes and was not limited to the use of the same sensor or dataset,without the need for extensive retraining.4.1.2 Tree geolocation with monocular depth estimationTrees were located using one or multiple street-level photographs, combining monoc-ular depth estimations generated with the monoDepth model, with tree detectionmasks and location and bearing information of each photograph. Initial tree lo-cation predictions were enhanced using triangulation that required no additionalor contextual information. Tree detection with MASK R-CNN in combination withmonocular depth estimation was able to provide a basis for street tree location pre-diction that is comparable to manually conducted ground truth measurements withhand held GPS devices in urban environments. Over 70% of trees, measured onthe ground, were successfully located for four different plots (Surrey, Vancouver,Coquitlam urban center and residential area). The geolocation of street trees with amean of around 4 meters, mainly found in the Vancouver area, was approximately2 meters more accurate than for private trees (6 meter), predominantly recorded inSurrey and Coquitlam, seen from the street. The presented method allowed for theassessment of trees on private property, a part of the urban forest for that cities arestill lacking information.4.1.3 Tree genus classificationFine-grained classification across different tree genera from imagery is a challeng-ing task even for humans [13]. To facilitate tree genera classification in urban76environments, a method for rapid sampling of tree genera training images fromGSV was presented. A tree genera dataset of 40,000 images compiled for 41 fine-grained tree genera and one ”other genera” class, including a total of 80 differenttree genera, was created. This dataset was used to train a CNN for tree genera clas-sification with 83% mean average precision. The model was applied to generatetree genera distribution maps over the Metro Vancouver area and could be used inthe future for other urban areas provided that sufficient street-level imagery can beacquired to use as training sample.4.2 Implications4.2.1 Deep learning for bio-surveillance planningThe goal of this research was to create and assess a methodology that has the poten-tial to improve the consistency and availability of urban tree inventory data acrossdifferent regional authorities and scales. New data can help inform decision makersfor bio-surveillance efforts and urban forest management. For example, an openand reproducible DL approach resulting in more accurate and detailed tree inven-tories could add significant value to identifying and targeting areas of high infes-tation risk in existing bio-surveillance investigations, particularly in cases whereinfestation risk and impact is predicated on species composition and forest struc-ture [40]. Improving urban forest inventories and subsequently identifying treeswith high infestation risk using DL techniques can allow decision makers to pro-actively prevent, monitor and manage forest invasive alien species outbreaks inhigher temporal resolution than currently possible [88].4.2.2 A new baseline for risk assessmentDiversity in structure and function is crucial to urban forest resilience, as exempli-fied by the outbreaks that have devastated monocultures of elms (Ulmus) and ash(Fraxinus) trees across cities in Canada and the United States. Urban tree biodi-versity and the connectivity of tree canopy supports wildlife habitat, contributesto the protection of native ecosystems, and enhances the ability for urban ecosys-tems and people to adapt to climate change. A detailed urban tree inventory can be77used to quantify the monetary value of and manage the suite of ecosystem servicesprovided by biodiverse urban forests, including ecological, health, recreational andaesthetic benefits [126]. Tree clusters or groups of trees, for example, will generatemore services (such as cooling) compared to a single tree. Large trees will generategreater ecosystem services value than smaller trees.Weighting the cost of managing and protecting urban trees against the benefitsor services they provide is often used as a baseline for risk assessment and decisionmaking. In combination with other spatial information derived from LIDAR or highresolution satellite RS data (e.g. tree health, tree structure), the trained NN can, forexample, improve bio-surveillance efforts through implementation into a decisionsupport pipeline for FIS risk analysis. The presented tool could also help map treegenera that are more drought-tolerant contributing to climate adaptation strategiesof cities that are expected to be affected by more frequent heat waves.4.2.3 Smart urban forest managementThe research goals identified for this thesis (s. section 1.3) are of interest to manyindustry and government participants. According to O¨stberg et al. [155], treespecies or genus and location of urban trees have been named some of the mostneeded urban tree inventory parameters by various city officials and researchers.Municipalities often do not have the resources nor capacity to carry out completeinventories of their urban forest resource, not to mention consistent updates oncea baseline inventory has been completed. Efficient, cost-effective, and reliable ur-ban tree inventory techniques are sorely needed to provide cities with the tools forstrategic urban forest planning and management. This research also highlights anovel way in which technology can be used to monitor urban forests and enablemore proactive decision making about urban biodiversity, which could be consid-ered a contribution to smart urban forestry.4.2.4 A novel method for environmental researchLastly, this study represents a project at the forefront of introducing state-of-the-artDL frameworks to environmental management and decision making. It is expectedto not only produce a cost-efficient and openly available tree inventory generation78framework, but also to inform research needs for other fields of study. By intro-ducing and showcasing how new AI concepts can be leveraged for environmentalRS and object detection, I intend to inspire their application to generate new so-lutions and expect far-reaching future implications for the fields of environmentalmanagement and global change studies.4.3 LimitationsDetecting, mapping and classifying urban trees from street-level imagery is a com-plex and challenging task. As a novel approach to generate tree inventory data, thisproject encountered critical limitations that require further thought for applicationand research in future (s. section 4.4).4.3.1 Tree visibility on street-level imageryThe presented methodological workflow is limited to assessing trees that can beseen on street-level imagery. This often excludes parts of the urban forest that arenot visible from the street, for example trees found in backyards and trees in parks.Furthermore, as outlined in chapter 2, 70% of all ground-truth measurements werematched in the analysis. Thus, 30% of front yard and street trees were not recordedin our detection and location predictions, either due to erroneous localization or dueto detection errors, i.e. through occlusion from other trees (s. chapter 2). Addi-tionally, the performance to detect and locate trees in other parts of the urban forest(e.g., urban woodlands and parks) was not assessed. Even though the developedtool helps to gain insights about the urban forest going beyond street-trees, it is stillunclear what proportion of the urban forest can be recorded.4.3.2 Availability of street-level imageryAnother key limitation for the future application of the developed software andworkflow is the availability of spatially coherent street-level imagery. Increasingly,street-level imagery providers update their terms of use to prohibit the large-scaleprocessing and extraction of information from the provided data source (this is inparticular the case for GSV, which updated terms of use in September 2018). Thepurpose of closing many geospatial data sources is predominantly to avoid costly79law suits in case street-level imagery were used to collect sensitive and privateinformation (correspondence I. Seiferling, December 2018, MIT). Other providersstill cannot generate large spatial coverage within cities and over different countriesand regions (Mapillary).Lastly, standards among service providers differ. As a result, different servicesprovide street-level imagery of varying quantity and quality. GSV spaces theircamera positions of image capture roughly every 15 meters, whereas Mapillaryprovides imagery spaced one meter or lower apart. While GSV makes it a re-quirement to collect data with high resolution panoramic cameras only, Mapillaryimagery ranges from low resolution smart phone cameras to also very high reso-lution panoramic cameras and lenses. Similarly to GSV, Bing StreetSide or Open-StreetCam have certain standards in place before an image is made publicly avail-able on their platform, however, image resolution is, to date, still lower than GSV.The presented methodology requires processing of panoramic or high resolutionstreet-level imagery. It has not been tested for compatibility and performance withlower-resolution imagery from providers other than GSV and Mapillary. Lastly,the methodology is restricted to assessing areas with sufficient, high-resolutionstreet-level imagery coverage only.4.3.3 Limited tree genera training dataRelated to the above, the generated tree genera dataset and tree genera classificationmodel are mainly targeted to identify planted trees and trees on boulevards ratherthan trees found in local parks, greenbelts, backyards or other local natural forests.Planted trees on developed sites that can be identified well and are abundant inMetro Vancouver include: Acer, Prunus, Quercus, Tilia, Platanus, Fagus, Thuja,Malus, Carpinus and Magnolia. Due to the genera dataset being developed mainlyon the basis of Vancouver’s existing street tree inventory, classification accuracyfor native trees is generally lower than for planted trees. This raises the questionof how far the developed tree genera classification model is applicable to nativetrees in urban woodlands. Hence, the chosen training and evaluation methodologyis limited in that it does not assess urban tree classification accuracy for the largestnumber of trees, namely native trees, because the number of forest trees are vastly80more than planted trees on streets, in yards and in cultivated park areas: Alnusand Populus dominate the deciduous list in abundance and size; Thuja, Tsuga andPseudotsuga dominate the coniferous list in abundance and size. Improving train-ing data for native trees is particularly important in suburbs outside of the City ofVancouver (Langley, Maple Ridge, Surrey etc.), as many of the roadside trees arenative trees growing on road allowance or near the road on private property.4.4 Future research directionsThis research has demonstrated the value of using street-level imagery and DLarchitectures for smart urban forestry management. Current limitations open up arange of avenues for future methodological development, testing applications andresearch.4.4.1 Assessing different data sourcesA promising research avenue to assess the urban forest as a whole is the com-bination of data sources from different perspectives, such as the side view fromstreet-level imagery and an aerial perspective from aerial imagery or LIDAR data[13, 144]. Mapping urban trees from multiple perspectives could have the poten-tial to overcome the limitation in missing tree instances of street-level imagerypresented in 4.3. While street-level imagery remains the most valuable data sourcefor fine-grained urban tree species classification, a combination with aerial imageryor LIDAR data has the potential to provide more accurate localization of trees. Fur-thermore, a baseline count of urban trees from aerial data could help quantify thepercentage of urban trees that were not detected on street-level images. Knowinghow many park and private trees are not included in the genera classification couldprovide insights into how applicable the developed tree genera maps are for bio-surveillance efforts. The difference in LIDAR and street-level imagery tree countscould further help to identify areas where a denser street-level imagery coveragecould be needed. Additionally, a combination with aerial imagery and LIDAR datacould contribute other information, such as tree health (multispectral aerial im-agery) or tree structure (LIDAR), to allow for a more comprehensive assessment ofthe urban forest state in the future.81Similarly, using video imagery instead of street-level imagery could providemore detail of urban scenes leading to more accurate tree location predictions andtree detection of private trees otherwise occluded. MASK R-CNN as well as Yolo areDL architectures that also perform well for instance segmentation in video datasets.Novel methods such as Optical flow, Structure from Motion (SFM) or Simultaneouslocalisation and mapping (SLAM) could provide the basis for an improved geolo-cation module. Such research could be beneficial for smart urban forest manage-ment applications that require a higher level of detail but only need assessments forsmaller urban areas, such as genera classification for areas directly located at portsor commercial zones of very high FIS risk.4.4.2 Crowd-sourcing and street-level imagery collectionTo overcome limitations caused by low or no availability of spatially coherent, highresolution street-level imagery, new avenues of imagery collection in urban areascould be evaluated. Data could, for example, be crowd-sourced through mobilephone applications or citizen scientist campaigns, engaging citizens in smart urbanforest management. Including citizens in the acquisition of imagery could add thebenefit of also covering private areas or parks with imagery. Adding these imagesin the presented workflow could allow urban forest managers to include otherwiseundetected private and park trees in the tree inventory (s. section 4.3).Alternatively, street-level imagery could be captured through professional cam-paigns using the same cameras, sensors, techniques and standards given by GSV.This could give urban forest managers more control over the season, spacing or fre-quency of image acquisition. Collected data through citizen engagement or smarturban forest management campaigns could be hosted for free through services likeMapillary and processed as presented in this research. Future research questionscould include: How does the cost of acquiring tree inventory data through manualmeasurements, LIDAR or hyperspectral surveys compare to data generated throughstreet-level images? How do tree detection and geolocation results change with dif-ferent spacing between camera positions of image capture? How does seasonality,image resolution and other external circumstances like weather influence tree de-tection and classification results? How frequently can tree inventories be updated82using the proposed workflow from data caption to the final map product?4.4.3 Methodological adaption for bio-surveillanceThe achieved results for fine-grained tree genus classification raise the question:What are future possibilities and where are limits for urban tree assessments withDL architectures and street-level imagery? Tree genera identified in this researchcould be extended in number and characteristics, e.g. through inclusion of morenative genera as outlined above (s. section 4.3) or rare genera found in urban en-vironments. A next step after assessing tree classification at the genus level couldbe a more detailed tree species classification, focusing on species particularly en-dangered by arriving FIS or vulnerable to climate change impacts, such as droughtsand rising temperatures. Additionally, specifically for bio-surveillance and earlydetection of tree pests and pathogens, it would be interesting to directly assesstree health from street-level imagery. Impacts of tree pests and pathogens, such aslarge holes in the bark for trees caused by ALB can already be visually identifiedby professionals from photographs. Generating training datasets and developing amodel to detect different FIS impacts, for example defoliation, discolouring, holes,in combination with regular imagery captures could open up opportunities to detectand contain the spread of FIS at an early stage.4.4.4 Green smart cities of the futureFinally, the generated dataset opens up a range of avenues for future research insmart urban forest management. Accurate masking of trees and the position ofgenerated tree masks could provide the basis for a variety of quantitative urban for-est assessments. For example, a detailed analysis of tree masks to extract informa-tion about tree structure could allow estimating ecosystem function using proxies.Qualitatively comparing different generated tree masks could provide valuable in-formation on aesthetic appeal of different trees types and shapes, providing insightsinto cultural ecosystem services provided by street trees. Location information ofurban trees in combination with a genus label can help answer questions such as:What effect do urban trees have on the livability and resilience of our cities [53]?What is the value and range of ecosystem services provided by urban forests [60]?83How can urban trees help adapt and mitigate impacts of climate change on cities[38]? How do urban trees contribute to human health and well being [137]? Inconclusion, street-level imagery in combination with DL brings a new perspectiveto assessing urban forests.84Bibliography[1] nightrome/cocostuff: The official homepage of the COCO-Stuff dataset.URL → page 111[2] COCO - Common Objects in Context, December 2018. URL → pages xiii, 111[3] Martın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, MichaelIsard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore,Derek G Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, PeteWarden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Asystem for large-scale machine learning. page 21. → page 117[4] Waleed Abdulla. Mask R-CNN for object detection and instancesegmentation on Keras and TensorFlow: matterport/Mask rcnn, 2017.URL RCNN. original-date:2017-10-19T20:28:34Z. → pages 35, 36[5] S. Agarwal, Y. Furukawa, N. Snavely, B. Curless, S. M. Seitz, andR. Szeliski. Reconstructing Rome. Computer, 43(6):40–47, June 2010.ISSN 0018-9162. doi:10.1109/MC.2010.175. → page 29[6] Michael Alonzo, Bodo Bookhagen, and Dar A. Roberts. Urban tree speciesmapping using hyperspectral and lidar data fusion. Remote Sensing ofEnvironment, 148:70–83, May 2014. ISSN 0034-4257.doi:10.1016/j.rse.2014.03.018. URL →pages 7, 27[7] Alexis A. Alvey. Promoting and preserving biodiversity in the urban forest.Urban Forestry & Urban Greening, 5(4):195–201, December 2006. ISSN1618-8667. doi:10.1016/j.ufug.2006.09.003. URL85 →pages 2, 54[8] Mark Ambrose, Frank Koch, and Denis Yemshanov. Modeling Urban HostTree Distributions for Invasive Forest Pests Using a Multi-Step Approach.World Conference on Natural Resource Modeling, June 2016. URL → page 5[9] Myla FJ Aronson, Christopher A. Lepczyk, Karl L. Evans, Mark A.Goddard, Susannah B. Lerman, J. Scott MacIvor, Charles H. Nilon, andTimothy Vargo. Biodiversity in the city: key challenges for urban greenspace management. Frontiers in Ecology and the Environment, 15(4):189–196, 2017. ISSN 1540-9309. doi:10.1002/fee.1480. URL →page 6[10] Josselin Aval, Jean Demuynck, Emmanuel Zenou, Sophie Fabre, DavidSheeren, Mathieu Fauvel, Karine Adeline, and Xavier Briottet. Detectionof individual trees in urban alignment from airborne data and contextualinformation: A marked point process approach. ISPRS Journal ofPhotogrammetry and Remote Sensing, 146:197–210, December 2018.ISSN 0924-2716. doi:10.1016/j.isprsjprs.2018.09.016. URL →page 27[11] Karen Bakker and Max Ritts. Smart Earth: A meta-review and implicationsfor environmental governance. Global Environmental Change, 52:201–211, September 2018. ISSN 0959-3780.doi:10.1016/j.gloenvcha.2018.07.011. URL →page 3[12] Nina Bassuk and Thomas Whitlow. Environmental Stress In Street Trees.Arboricultural Journal, 12(2):195–201, May 1988. ISSN 0307-1375,2168-1074. doi:10.1080/03071375.1988.9746788. URL →page 2[13] Adam Berland and Daniel A. Lange. Google Street View shows promisefor virtual street tree surveys. Urban Forestry & Urban Greening, 21:11–15, January 2017. ISSN 1618-8667. doi:10.1016/j.ufug.2016.11.006.URL86 →pages 7, 8, 9, 28, 76, 81, 117[14] Zhou Bolei. COCO + Places 2017 Challenge, 2017. URL → pages 33, 35, 36[15] Steve Branson, Jan Dirk Wegner, David Hall, Nico Lang, Konrad Schindler,and Pietro Perona. From Google Maps to a fine-grained catalog of streettrees. ISPRS Journal of Photogrammetry and Remote Sensing, 135:13–30,January 2018. ISSN 0924-2716. doi:10.1016/j.isprsjprs.2017.11.008. URL →pages 9, 17, 28, 34, 62, 116[16] Eckehard G. Brockerhoff, Andrew M. Liebhold, Brian Richardson, andDavid M. Suckling. Eradication of invasive forest insects: concepts,methods, costs and benefits. New Zealand Journal of Forestry Science. 40suppl.: S117-S135., 40(suppl):S117–S135, 2010. URL → page 5[17] Mateusz Buda, Atsuto Maki, and Maciej A. Mazurowski. A systematicstudy of the class imbalance problem in convolutional neural networks.Neural Networks, 106:249–259, October 2018. ISSN 08936080.doi:10.1016/j.neunet.2018.07.011. URL 1710.05381. → page 62[18] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. COCO-Stuff: Thingand Stuff Classes in Context. arXiv:1612.03716 [cs], December 2016.URL arXiv: 1612.03716. → pagesxiii, 17, 19, 33, 35, 37, 41, 44, 52, 76, 111, 119[19] Bill Yang Cai, Xiaojiang Li, Ian Seiferling, and Carlo Ratti. Treepedia 2.0:Applying Deep Learning for Large-scale Quantification of Urban TreeCover. August 2018. URL → pages9, 18, 35[20] K. Cai, W. Shao, X. Yin, and G. Liu. Co-segmentation of aircrafts fromhigh-resolution satellite images. In 2012 IEEE 11th InternationalConference on Signal Processing, volume 2, pages 993–996, October 2012.doi:10.1109/ICoSP.2012.6491746. → pages 17, 31[21] Joe¨lle Salomon Cavin and Christian A. Kull. Invasion ecology goes totown: from disdain to sympathy. Biological Invasions, 19(12):3471–3487,December 2017. ISSN 1387-3547, 1573-1464.87doi:10.1007/s10530-017-1588-9. URL → page 4[22] Supriyo Chakraborty, Richard Tomsett, Ramya Raghavendra, DanielHarborne, Moustafa Alzantot, Federico Cerutti, Mani Srivastava, AlunPreece, Simon Julier, Raghuveer M. Rao, Troy D. Kelley, Dave Braines,Murat Sensoy, Christopher J. Willis, and Prudhvi Gurram. Interpretabilityof deep learning models: A survey of results. In 2017 IEEE SmartWorld,Ubiquitous Intelligence Computing, Advanced Trusted Computed, ScalableComputing Communications, Cloud Big Data Computing, Internet ofPeople and Smart City Innovation(SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pages 1–6, August2017. doi:10.1109/UIC-ATC.2017.8397411. ISSN: null. → page 72[23] X. Chen, S. Xiang, C. Liu, and C. Pan. Vehicle Detection in SatelliteImages by Hybrid Deep Convolutional Neural Networks. IEEE Geoscienceand Remote Sensing Letters, 11(10):1797–1801, October 2014. ISSN1545-598X. doi:10.1109/LGRS.2014.2309695. → pages 17, 31[24] G. Cheng, P. Zhou, and J. Han. Learning Rotation-Invariant ConvolutionalNeural Networks for Object Detection in VHR Optical Remote SensingImages. IEEE Transactions on Geoscience and Remote Sensing, 54(12):7405–7415, December 2016. ISSN 0196-2892.doi:10.1109/TGRS.2016.2601622. → pages 17, 31[25] Liang Cheng, Yi Yuan, Nan Xia, Song Chen, Yanming Chen, Kang Yang,Lei Ma, and Manchun Li. Crowd-sourced pictures geo-localization methodbased on street view images and 3d reconstruction. ISPRS Journal ofPhotogrammetry and Remote Sensing, 141:72–85, July 2018. ISSN0924-2716. doi:10.1016/j.isprsjprs.2018.04.006. URL →page 29[26] Francois Chollet. Deep Learning with Python. Manning Publications Co.,Greenwich, CT, USA, 1st edition, 2017. ISBN 978-1-61729-443-3. →pages xii, 11, 12, 13, 14, 17, 35, 118, 119, 120[27] Sylvain Christin, E´ric Hervet, and Nicolas Lecomte. Applications for deeplearning in ecology. Methods in Ecology and Evolution, 10(10):1632–1644,2019. ISSN 2041-210X. doi:10.1111/2041-210X.13256. URL→ page 1088[28] Dan Cires¸an, Ueli Meier, Jonathan Masci, and Ju¨rgen Schmidhuber.Multi-column deep neural network for traffic sign classification. NeuralNetworks, 32:333–338, August 2012. ISSN 0893-6080.doi:10.1016/j.neunet.2012.02.023. URL →page 14[29] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, MarkusEnzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and BerntSchiele. The Cityscapes Dataset for Semantic Urban Scene Understanding.In IEEE Conference on Computer Vision and Pattern Recognition, pages3213–3223, 2016. URL cvpr 2016/html/Cordts The Cityscapes Dataset CVPR 2016 paper.html. → page 38[30] Paul J. Crutzen. The “Anthropocene”. In Eckart Ehlers and Thomas Krafft,editors, Earth System Science in the Anthropocene, pages 13–18. Springer,Berlin, Heidelberg, 2006. ISBN 978-3-540-26590-0.doi:10.1007/3-540-26590-2 3. URL 3. → page 1[31] Jifeng Dai, Kaiming He, and Jian Sun. Convolutional Feature Masking forJoint Object and Stuff Segmentation. pages 3992–4000, 2015. URL cvpr 2015/html/Dai Convolutional Feature Masking 2015 CVPR paper.html. → page 111[32] Adam G. Dale and Steven D. Frank. The Effects of Urban Warming onHerbivore Abundance and Street Tree Condition. PLOS ONE, 9(7):e102996, July 2014. ISSN 1932-6203. doi:10.1371/journal.pone.0102996.URL →page 3[33] Paul M. Dare. Shadow Analysis in High-Resolution Satellite Imagery ofUrban Areas. Photogrammetric Engineering & Remote Sensing, 71(2):169–177, February 2005. ISSN 00991112. doi:10.14358/PERS.71.2.169.URL → page 27[34] Jesse Davis and Mark Goadrich. The Relationship BetweenPrecision-Recall and ROC Curves. In Proceedings of the 23rdInternational Conference on Machine Learning, ICML ’06, pages233–240, New York, NY, USA, 2006. ACM. ISBN 978-1-59593-383-6.89doi:10.1145/1143844.1143874. URL event-place: Pittsburgh,Pennsylvania, USA. → page 37[35] Tahia Devisscher, Lorien Nesbitt, and Adrina C. Bardekjian. MainFindings and Trends of Urbanization | Forestry in the Midst of GlobalChanges | Taylor & Francis Group. In Forestry in the Midst of GlobalChanges, page 446. Taylor & Francis, December 2018. ISBN978-1-315-28237-4. URL → page 1[36] Fa´bio Duarte and Carlo Ratti. What Big Data Tell Us About Trees and theSky in the Cities. In Klaas De Rycke, Christoph Gengnagel, OlivierBaverel, Jane Burry, Caitlin Mueller, Minh Man Nguyen, Philippe Rahm,and Mette Ramsgaard Thomsen, editors, Humanizing Digital Reality:Design Modelling Symposium Paris 2017, pages 59–62. SpringerSingapore, Singapore, 2018. ISBN 978-981-10-6611-5.doi:10.1007/978-981-10-6611-5 6. URL 6. → pages 9, 10, 28[37] T Elmqvist, H Seta¨la¨, SN Handel, S van der Ploeg, J Aronson, JN Blignaut,E Go´mez-Baggethun, DJ Nowak, J Kronenberg, and R de Groot. Benefitsof restoring ecosystem services in urban areas. Current Opinion inEnvironmental Sustainability, 14:101–108, June 2015. ISSN 1877-3435.doi:10.1016/j.cosust.2015.05.001. URL →page 2[38] Theodore A. Endreny. Strategically growing the urban forest will improveour world. Nature Communications, 9(1):1–3, March 2018. ISSN2041-1723. doi:10.1038/s41467-018-03622-0. URL → pages1, 2, 3, 4, 84[39] Environment Canada. An invasive alien species strategy for Canada.Environment Canada, Ottawa - Ontario, September 2004. → page 55[40] Rebecca S. Epanchin-Niell, Robert G. Haight, Ludek Berec, John M. Kean,and Andrew M. Liebhold. Optimal surveillance and eradication of invasivespecies in heterogeneous landscapes. Ecology Letters, 15(8):803–812,August 2012. ISSN 1461-0248. doi:10.1111/j.1461-0248.2012.01800.x.URL http:90//→ page 77[41] Gianluca Falco, Marco Pini, and Gianluca Marucco. Loose and TightGNSS/INS Integrations: Comparison of Performance Assessed in RealUrban Scenarios. Sensors, 17(2):255, February 2017.doi:10.3390/s17020255. URL→ page 51[42] Marcin Feltynowski, Jakub Kronenberg, Tomasz Bergier, Nadja Kabisch,Edyta Łaszkiewicz, and Michael W. Strohbach. Challenges of urban greenspace management in the face of using inadequate data. Urban Forestry &Urban Greening, 31:56–66, April 2018. ISSN 1618-8667.doi:10.1016/j.ufug.2017.12.003. URL →page 2[43] Mirijam Gaertner, John R. U. Wilson, Marc W. Cadotte, J. Scott MacIvor,Rafael D. Zenni, and David M. Richardson. Non-native species in urbanenvironments: patterns, processes, impacts and challenges. BiologicalInvasions, 19(12):3461–3469, December 2017. ISSN 1387-3547,1573-1464. doi:10.1007/s10530-017-1598-7. URL → page 55[44] Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea, VictorVillena-Martinez, and Jose Garcia-Rodriguez. A Review on Deep LearningTechniques Applied to Semantic Segmentation. arXiv:1704.06857 [cs],April 2017. URL arXiv: 1704.06857. →page 117[45] Timnit Gebru, Jonathan Krause, Yilun Wang, Duyun Chen, Jia Deng,Erez Lieberman Aiden, and Li Fei-Fei. Using deep learning and GoogleStreet View to estimate the demographic makeup of neighborhoods acrossthe United States. Proceedings of the National Academy of Sciences, 114(50):13108–13113, December 2017. ISSN 0027-8424, 1091-6490.doi:10.1073/pnas.1700035114. URL → page 9[46] A Geiger, P Lenz, C Stiller, and R Urtasun. Vision meets robotics: TheKITTI dataset. The International Journal of Robotics Research, 32(11):1231–1237, September 2013. ISSN 0278-3649.91doi:10.1177/0278364913491297. URL → page 38[47] S.E Gill, J.F Handley, A.R Ennos, and S Pauleit. Adapting Cities forClimate Change: The Role of the Green Infrastructure. Built Environment,33(1):115–133, March 2007. ISSN 02637960. doi:10.2148/benv.33.1.115.URL → page 6[48] Edward L. Glaeser, Scott Duke Kominers, Michael Luca, and Nikhil Naik.Big Data and Big Cities: The Promises and Limitations of ImprovedMeasures of Urban Life. Economic Inquiry, 56(1):114–137, 2018. ISSN1465-7295. doi:10.1111/ecin.12364. URL → page 9[49] Tobias Glasmachers. Limits of End-to-End Learning. page 16. → page 58[50] Cle´ment Godard, Oisin Mac Aodha, and Gabriel J. Brostow. UnsupervisedMonocular Depth Estimation with Left-Right Consistency.arXiv:1609.03677 [cs, stat], September 2016. URL arXiv: 1609.03677. → pages29, 37, 38, 51[51] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.MIT Press, November 2016. ISBN 978-0-262-03561-3. Google-Books-ID:Np9SDQAAQBAJ. → pages xii, 12, 13, 14, 16, 17, 35, 36, 119[52] Google. Street View – Where We’ve Been & Where We’re Headed Next,2018. URL → page 8[53] Nancy B. Grimm, Stanley H. Faeth, Nancy E. Golubiewski, Charles L.Redman, Jianguo Wu, Xuemei Bai, and John M. Briggs. Global Changeand the Ecology of Cities. Science, 319(5864):756–760, February 2008.ISSN 0036-8075, 1095-9203. doi:10.1126/science.1150195. URL → pages 1, 2, 83[54] Raia Hadsell, Pierre Sermanet, Jan Ben, Ayse Erkan, Marco Scoffier, KorayKavukcuoglu, Urs Muller, and Yann LeCun. Learning long-range visionfor autonomous off-road driving. Journal of Field Robotics, 26(2):120–144, February 2009. ISSN 1556-4967. doi:10.1002/rob.20276. URL → page 1692[55] Terry Hartig and Peter H. Kahn. Living in cities, naturally. Science, 352(6288):938–940, May 2016. ISSN 0036-8075, 1095-9203.doi:10.1126/science.aaf3759. URL → page 1[56] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. UnsupervisedLearning. In Trevor Hastie, Robert Tibshirani, and Jerome Friedman,editors, The Elements of Statistical Learning: Data Mining, Inference, andPrediction, Springer Series in Statistics, pages 485–585. Springer NewYork, New York, NY, 2009. ISBN 978-0-387-84858-7.doi:10.1007/978-0-387-84858-7 14. URL 14. → page 11[57] K. He, G. Gkioxari, P. Dolla´r, and R. Girshick. Mask R-CNN. In 2017IEEE International Conference on Computer Vision (ICCV), pages2980–2988, October 2017. doi:10.1109/ICCV.2017.322. → pagesxiii, 9, 18, 23, 35, 36, 116, 117[58] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep ResidualLearning for Image Recognition. pages 770–778, 2016. URL cvpr 2016/html/He Deep Residual Learning CVPR 2016 paper.html. → pages 62, 117[59] H. Hirschmuller. Stereo Processing by Semiglobal Matching and MutualInformation. IEEE Transactions on Pattern Analysis and MachineIntelligence, 30(2):328–341, February 2008. ISSN 0162-8828.doi:10.1109/TPAMI.2007.1166. → page 29[60] Ngaio Hotte, Lorien Nesbitt, Sara Barron, Judith Cowan, andZhaohua Cindy Cheng. The Social and Economic Values of Canada’sUrban Forests: A National Synthesis. UBC Faculty of Forestry Universityof British Columbia, April 2015. URL\OT1\textquoterights-Urban-Forests-A-National-Synthesis-2015.pdf. → pages1, 5, 83[61] Ronghang Hu, Piotr Dollar, Kaiming He, Trevor Darrell, and RossGirshick. Learning to Segment Every Thing. page 9, November 2017. →page 117[62] Katherine Isaac, Lee Beaulieu, Cameron Owen, Angela Danyluk, and BenMulhall. Urban Forest Strategy. page 60, 2018. → pages 5, 31, 54, 5693[63] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, FeihuZhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen,Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. Highly Scalable DeepLearning Training System with Mixed-Precision: Training ImageNet inFour Minutes. arXiv:1807.11205 [cs, stat], July 2018. URL arXiv: 1807.11205. → page 64[64] Nadia Kabisch, Horst Korn, Jutta Stadler, and Aletta Bonn, editors.Nature-Based Solutions to Climate Change Adaptation in Urban Areas:Linkages between Science, Policy and Practice. Theory and Practice ofUrban Sustainability Transitions. Springer International Publishing, 2017.ISBN 978-3-319-53750-4. doi:10.1007/978-3-319-56091-5. URL → page 1[65] Michael Kampffmeyer, Arnt-Borre Salberg, and Robert Jenssen. SemanticSegmentation of Small Objects and Modeling of Uncertainty in UrbanRemote Sensing Images Using Deep Convolutional Neural Networks.pages 1–9, 2016. URL cvpr 2016 workshops/w19/html/Kampffmeyer Semantic Segmentation of CVPR 2016 paper.html. →page 18[66] Jian Kang, Marco Ko¨rner, Yuanyuan Wang, Hannes Taubenbo¨ck, andXiao Xiang Zhu. Building instance classification using street view images.ISPRS Journal of Photogrammetry and Remote Sensing, 145:44–59,November 2018. ISSN 0924-2716. doi:10.1016/j.isprsjprs.2018.02.006.URL →pages 9, 17, 31[67] Yinghai Ke and Lindi J. Quackenbush. A review of methods for automaticindividual tree-crown detection and delineation from passive remotesensing. International Journal of Remote Sensing, 32(17):4725–4747,September 2011. ISSN 0143-1161. doi:10.1080/01431161.2010.494184.URL → pages 6, 27[68] Julie Kjeldsen-Kragh Keller and Cecil C Konijnendijk. ShortCommunication: A Comparative Analysis of Municipal Urban TreeInventories of Selected Major Cities in North America and Europe. page 8,2012. → pages 5, 6, 794[69] Maggi Kelly, Qinghua Guo, Desheng Liu, and David Shaari. Modeling therisk for a new invasive forest disease in the United States: An evaluation offive environmental niche models. Computers, Environment and UrbanSystems, 31(6):689–710, November 2007. ISSN 0198-9715.doi:10.1016/j.compenvurbsys.2006.10.002. URL →pages 26, 52, 57[70] Mate Kisantal, Zbigniew Wojna, Jakub Murawski, Jacek Naruniec, andKyunghyun Cho. Augmentation for small object detection.arXiv:1902.07296 [cs], February 2019. URL arXiv: 1902.07296. → pages 37, 43, 45[71] Frank H. Koch, Mark J. Ambrose, Denys Yemshanov, P. Eric Wiseman,and F. D. Cowett. Modeling urban distributions of host trees for invasiveforest insects in the eastern and central USA: A three-step approach usingfield inventory data. Forest Ecology and Management, 417:222–236, May2018. ISSN 0378-1127. doi:10.1016/j.foreco.2018.03.004. URL →page 55[72] Vladimir A. Krylov, Eamonn Kenny, and Rozenn Dahyot. AutomaticDiscovery and Geotagging of Objects from Street View Imagery. RemoteSensing, 10(5):661, May 2018. doi:10.3390/rs10050661. URL → pages 44, 47[73] Labelbox Inc. Labelbox: The best way to create and manage training data.URL → page 33[74] Yann LeCun, Yoshua Bengio, and T Bell Laboratories. ConvolutionalNetworks for Images, Speech, and Time-Series. In The handbook of braintheory and neural networks, pages 255–258. MIT Press, October 1998. →page 14[75] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,521(7553):436–444, May 2015. ISSN 0028-0836, 1476-4687.doi:10.1038/nature14539. URL → pages 12, 42[76] S. Lefe`vre, D. Tuia, J. D. Wegner, T. Produit, and A. S. Nassar. TowardSeamless Multiview Scene Analysis From Satellite to Street Level.Proceedings of the IEEE, 105(10):1884–1899, October 2017. ISSN0018-9219. doi:10.1109/JPROC.2017.2684300. → page 2995[77] Songnian Li, Suzana Dragicevic, Francesc Anto´n Castro, Monika Sester,Stephan Winter, Arzu Coltekin, Christopher Pettit, Bin Jiang, JamesHaworth, Alfred Stein, and Tao Cheng. Geospatial big data handling theoryand methods: A review and research challenges. ISPRS Journal ofPhotogrammetry and Remote Sensing, 115:119–133, May 2016. ISSN0924-2716. doi:10.1016/j.isprsjprs.2015.10.012. URL →pages 8, 28[78] Xiaojiang Li and Carlo Ratti. Using Google Street View for Street-LevelUrban Form Analysis, a Case Study in Cambridge, Massachusetts. In LucaD’Acci, editor, The Mathematics of Urban Morphology, Modeling andSimulation in Science, Engineering and Technology, pages 457–470.Springer International Publishing, Cham, 2019. ISBN 978-3-030-12381-9.doi:10.1007/978-3-030-12381-9 20. URL 20. → page 9[79] Xiaojiang Li, Chuanrong Zhang, Weidong Li, Robert Ricard, QingyanMeng, and Weixing Zhang. Assessing street-level urban greenery usingGoogle Street View and a modified green view index. Urban Forestry &Urban Greening, 14(3):675–685, January 2015. ISSN 1618-8667.doi:10.1016/j.ufug.2015.06.006. URL →page 9[80] Xiaojiang Li, Carlo Ratti, and Ian Seiferling. Mapping Urban LandscapesAlong Streets Using Google Street View. In Michael P. Peterson, editor,Advances in Cartography and GIScience, Lecture Notes in Geoinformationand Cartography, pages 341–356. Springer International Publishing, 2017.ISBN 978-3-319-57336-6. → pages 9, 28[81] Xiaojiang Li, Carlo Ratti, and Ian Seiferling. Quantifying the shadeprovision of street trees in urban landscape: A case study in Boston, USA,using Google Street View. Landscape and Urban Planning, 169:81–91,January 2018. ISSN 0169-2046. doi:10.1016/j.landurbplan.2017.08.011.URL →page 9[82] Xun Li, Wendy Y. Chen, Giovanni Sanesi, and Raffaele Lafortezza.Remote Sensing in Urban Forestry: Recent Applications and FutureDirections. Remote Sensing, 11(10):1144, January 2019.96doi:10.3390/rs11101144. URL → page 27[83] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, RossGirshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick,and Piotr Dolla´r. Microsoft COCO: Common Objects in Context.arXiv:1405.0312 [cs], May 2014. URL 1405.0312. → pages xii, 9, 18, 111[84] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, BharathHariharan, and Serge Belongie. Feature Pyramid Networks for ObjectDetection. In 2017 IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 936–944, Honolulu, HI, July 2017. IEEE.ISBN 978-1-5386-0457-1. doi:10.1109/CVPR.2017.106. URL → page 117[85] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, ArnaudArindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A.W. M. van der Laak, Bram van Ginneken, and Clara I. Sa´nchez. A surveyon deep learning in medical image analysis. Medical Image Analysis, 42:60–88, December 2017. ISSN 1361-8415.doi:10.1016/ URL →page 16[86] David M. Lodge, Paul W. Simonin, Stanley W. Burgiel, Reuben P. Keller,Jonathan M. Bossenbroek, Christopher L. Jerde, Andrew M. Kramer,Edward S. Rutherford, Matthew A. Barnes, Marion E. Wittmann,W. Lindsay Chadderton, Jenny L. Apriesnig, Dmitry Beletsky, Roger M.Cooke, John M. Drake, Scott P. Egan, David C. Finnoff, Crysta A. Gantz,Erin K. Grey, Michael H. Hoff, Jennifer G. Howeth, Richard A. Jensen,Eric R. Larson, Nicholas E. Mandrak, Doran M. Mason, Felix A. Martinez,Tammy J. Newcomb, John D. Rothlisberger, Andrew J. Tucker, Travis W.Warziniack, and Hongyan Zhang. Risk Analysis and Bioeconomics ofInvasive Species to Inform Policy and Management. Annual Review ofEnvironment and Resources, 41(1):453–488, 2016.doi:10.1146/annurev-environ-110615-085532. URL → page 4[87] A. Lopes, S. Oliveira, M. Fragoso, J.A. Andrade, and P. Pedro. Wind RiskAssessment in Urban Environments: The Case of Falling Trees DuringWindstorm Events in Lisbon. In Katarı´na Strˇelcova´, Csaba Ma´tya´s, Axel97Kleidon, Milan Lapin, Frantisˇek Matejka, Miroslav Blazˇenec, JaroslavSˇkvarenina, and Ja´n Hole´cy, editors, Bioclimatology and Natural Hazards,pages 55–74. Springer Netherlands, Dordrecht, 2009. ISBN978-1-4020-8875-9 978-1-4020-8876-6.doi:10.1007/978-1-4020-8876-6 5. URL 5. → page 2[88] Gary M. Lovett, Marissa Weiss, Andrew M. Liebhold, Thomas P. Holmes,Brian Leung, Kathy Fallon Lambert, David A. Orwig, Faith T. Campbell,Jonathan Rosenthal, Deborah G. McCullough, Radka Wildova, Matthew P.Ayres, Charles D. Canham, David R. Foster, Shannon L. LaDeau, and TroyWeldy. Nonnative forest insects and pathogens in the United States:Impacts and policy options. Ecological Applications, 26(5):1437–1455,July 2016. ISSN 1939-5582. doi:10.1890/15-1176. URL → pages4, 5, 77[89] Lei Ma, Yu Liu, Xueliang Zhang, Yuanxin Ye, Gaofei Yin, and Brian AlanJohnson. Deep learning in remote sensing applications: A meta-analysisand review. ISPRS Journal of Photogrammetry and Remote Sensing, 152:166–177, June 2019. ISSN 0924-2716.doi:10.1016/j.isprsjprs.2019.04.015. URL →pages 8, 28[90] Daniel W. McKenney, John H. Pedlar, Denys Yemshanov, D. Barry Lyons,Kathy Campbell, and Kate Lawrence. Estimates of the Potential Cost ofEmerald Ash Borer (Agrilus planipennis Fairmaire) in CanadianMunicipalities. 2012. → page 54[91] Emily K. Meineke, Robert R. Dunn, Joseph O. Sexton, and Steven D.Frank. Urban Warming Drives Insect Pest Abundance on Street Trees.PLOS ONE, 8(3):e59687, March 2013. ISSN 1932-6203.doi:10.1371/journal.pone.0059687. URL →page 2[92] Jeff Michels, Ashutosh Saxena, and Andrew Y. Ng. High Speed ObstacleAvoidance Using Monocular Vision and Reinforcement Learning. InProceedings of the 22Nd International Conference on Machine Learning,ICML ’05, pages 593–600, New York, NY, USA, 2005. ACM. ISBN978-1-59593-180-1. doi:10.1145/1102351.1102426. URL98 event-place: Bonn,Germany. → page 29[93] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, ErichElsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev,Ganesh Venkatesh, and Hao Wu. Mixed Precision Training.arXiv:1710.03740 [cs, stat], October 2017. URL arXiv: 1710.03740. → page 64[94] Ariane Middel, Jonas Lukasczyk, Sophie Zakrzewski, Michael Arnold, andRoss Maciejewski. Urban form and composition of street canyons: Ahuman-centric big data and deep learning approach. Landscape and UrbanPlanning, 183:122–132, March 2019. ISSN 0169-2046.doi:10.1016/j.landurbplan.2018.12.001. URL →page 9[95] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, JoelVeness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K.Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik,Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra,Shane Legg, and Demis Hassabis. Human-level control through deepreinforcement learning. Nature, 518(7540):529–533, February 2015. ISSN0028-0836, 1476-4687. doi:10.1038/nature14236. URL → page 12[96] J. Morgenroth, J. O¨stberg, C. Konijnendijk van den Bosch, A. B. Nielsen,R. Hauer, H. Sjo¨man, W. Chen, and M. Jansson. Urban treediversity—Taking stock and looking ahead. Urban Forestry & UrbanGreening, 15:1–5, January 2016. ISSN 1618-8667.doi:10.1016/j.ufug.2015.11.003. URL →page 55[97] Michael J. Mortimer and Brian Kane. Hazard tree liability in the UnitedStates: Uncertain risks for owners and professionals. Urban Forestry &Urban Greening, 2(3):159–165, January 2004. ISSN 16188667.doi:10.1078/1618-8667-00032. URL → page 2[98] Giorgos Mountrakis, Jun Li, Xiaoqiang Lu, and Olaf Hellwich. Deeplearning for remotely sensed data. ISPRS Journal of Photogrammetry and99Remote Sensing, 145:1–2, November 2018. ISSN 0924-2716.doi:10.1016/j.isprsjprs.2018.08.011. URL →pages 17, 30[99] Nikhil Naik, Scott Duke Kominers, Ramesh Raskar, Edward L Glaeser, andCe´sar A Hidalgo. Do People Shape Cities, or Do Cities Shape People? TheCo-evolution of Physical, Social, and Economic Change in Five Major U.S.Cities. page 38, October 2015. → page 9[100] Nikhil Naik, Scott Duke Kominers, Ramesh Raskar, Edward L. Glaeser,and Ce´sar A. Hidalgo. Computer vision uncovers predictors of physicalurban change. Proceedings of the National Academy of Sciences, 114(29):7571–7576, July 2017. ISSN 0027-8424, 1091-6490.doi:10.1073/pnas.1619003114. URL → page 9[101] Lorien Nesbitt, Ngaio Hotte, Sara Barron, Judith Cowan, and Stephen R. J.Sheppard. The social and economic value of cultural ecosystem servicesprovided by urban forests in North America: A review and suggestions forfuture research. Urban Forestry & Urban Greening, 25:103–111, July2017. ISSN 1618-8667. doi:10.1016/j.ufug.2017.05.005. URL →page 2[102] Anders B Nielsen, Johan O¨stberg, and Tim Delshammar. Review of UrbanTree Inventory Methods Used to Collect Data at Single-Tree 17, 2014. → pages 5, 6, 7, 26, 27[103] Sophie Nitoslawski and Peter Duinker. Managing Tree Diversity: AComparison of Suburban Development in Two Canadian Cities. Forests, 7,May 2016. doi:10.3390/f7060119. → page 32[104] Sophie A. Nitoslawski, Peter N. Duinker, and Peter G. Bush. A review ofdrivers of tree diversity in suburban areas: Research needs for NorthAmerican cities. Environmental Reviews, 24(4):471–483, December 2016.ISSN 1181-8700, 1208-6053. doi:10.1139/er-2016-0027. URL → pages54, 55[105] Sophie A. Nitoslawski, Nadine J. Galle, Cecil Konijnendijk VanDen Bosch, and James W. N. Steenberg. Smarter ecosystems for smarter100cities? A review of trends, technologies, and turning points for smart urbanforestry. Sustainable Cities and Society, 51:101770, November 2019. ISSN2210-6707. doi:10.1016/j.scs.2019.101770. URL →page 3[106] David J. Nowak, Satoshi Hirabayashi, Allison Bodine, and Eric Greenfield.Tree and forest effects on air quality and human health in the United States.Environmental Pollution, 193:119–129, October 2014. ISSN 0269-7491.doi:10.1016/j.envpol.2014.05.028. URL →pages 2, 26[107] NVIDIA. Deep Learning SDK Documentation, October 2019. URL → page 64[108] Vancouver Board of Parks and Recreation. Biodiversity strategy. Technicalreport, Vancouver Board of Parks and Recreation, Vancouver, 2016. URL → page 54[109] Trudy Paap, Treena I. Burgess, and Michael J. Wingfield. Urban trees:bridge-heads for forest pest invasions and sentinels for early detection.Biological Invasions, 19(12):3515–3526, December 2017. ISSN1387-3547, 1573-1464. doi:10.1007/s10530-017-1595-x. URL → pages 3, 5[110] Ashlyn L. Padayachee, Ulrike M. Irlich, Katelyn T. Faulkner, MirijamGaertner, S¸erban Proches¸, John R. U. Wilson, and Mathieu Rouget. Howdo invasive species travel to and through urban environments? BiologicalInvasions, 19(12):3557–3570, December 2017. ISSN 1387-3547,1573-1464. doi:10.1007/s10530-017-1596-9. URL → pages2, 4, 26[111] Sinno Jialin Pan and Qiang Yang. A Survey on Transfer Learning. IEEETransactions on Knowledge and Data Engineering, 22(10):1345–1359,October 2010. ISSN 1041-4347. doi:10.1109/TKDE.2009.191. URL → page 35[112] Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep FaceRecognition. In Procedings of the British Machine Vision Conference2015, pages 41.1–41.12, Swansea, 2015. British Machine Vision101Association. ISBN 978-1-901725-53-7. doi:10.5244/C.29.41. URL → pages56, 64[113] Mason F. Patterson, P. Eric Wiseman, Matthew F. Winn, Sang-mook Lee,and Philip A. Araman. Effects of photographic distance on tree crownatributes calculated using urbancrowns image analysis software.Arboriculture & Urban Forestry 37(4):173-179, 37(4):173–179, 2011.URL → page 7[114] Jill D Pokorny. Urban Tree Risk Management: A Community Guide toProgram Design and Implementation. Technical report, 2003. → page 2[115] Therese M. Poland and Deborah G. McCullough. Emerald Ash Borer:Invasion of the Urban Forest and the Threat to North America’s AshResource. Journal of Forestry, 104(3):118–124, March 2006. ISSN0022-1201. doi:10.1093/jof/104.3.118. URL → page 5[116] Rassati Davide, Faccoli Massimo, Petrucco Toffolo Edoardo, BattistiAndrea, Marini Lorenzo, and Clough Yann. Improving the early detectionof alien wood-boring beetles in ports and surrounding forests. Journal ofApplied Ecology, 52(1):50–58, September 2014. ISSN 0021-8901.doi:10.1111/1365-2664.12347. URL →pages 5, 55[117] M. K. Ridd. Exploring a V-I-S (vegetation-impervious surface-soil) modelfor urban ecosystem analysis through remote sensing: comparativeanatomy for cities†. International Journal of Remote Sensing, 16(12):2165–2185, August 1995. ISSN 0143-1161.doi:10.1080/01431169508954549. URL → page 6[118] Je´roˆme Rousselet, Charles-Edouard Imbert, Anissa Dekri, Jacques Garcia,Francis Goussard, Bruno Vincent, Olivier Denux, Christelle Robinet,Franck Dorkeld, Alain Roques, and Jean-Pierre Rossi. Assessing SpeciesDistribution Using Google Street View: A Pilot Study with the PineProcessionary Moth. PLOS ONE, 8(10):e74918, October 2013. ISSN1932-6203. doi:10.1371/journal.pone.0074918. URL →page 9102[119] Stuart J. Russell and Peter Norvig. Artificial Intelligence : A ModernApproach. Malaysia; Pearson Education Limited,, 2016. URL 123456789/4010. → page11[120] Jacques Re´gnie`re, Vince Nealis, and Kevin Porter. Climate suitability andmanagement of the gypsy moth invasion into Canada. Biological Invasions,11(1):135–148, January 2009. ISSN 1387-3547, 1573-1464.doi:10.1007/s10530-008-9325-z. URL → page 57[121] A. L. Samuel. Some Studies in Machine Learning Using the Game ofCheckers. IBM Journal of Research and Development, 3(3):210–229, July1959. ISSN 0018-8646. doi:10.1147/rd.33.0210. → page 11[122] Juergen Schmidhuber. Deep Learning in Neural Networks: An Overview.Neural Networks, 61:85–117, January 2015. ISSN 08936080.doi:10.1016/j.neunet.2014.09.003. URL 1404.7828. → pages 17, 33[123] Tracey-Lee Schwets and Robert D Brown. Form and structure of mapletrees in urban environments. Landscape and Urban Planning, 46(4):191–201, February 2000. ISSN 0169-2046.doi:10.1016/S0169-2046(99)00072-9. URL →page 71[124] Ian Seiferling, Nikhil Naik, Carlo Ratti, and Rapha¨el Proulx. Green streetsQuantifying and mapping urban trees with street-level imagery andcomputer vision. Landscape and Urban Planning, 165:93–101, September2017. ISSN 0169-2046. doi:10.1016/j.landurbplan.2017.05.010. URL →pages 9, 28[125] Hansi Senaratne, Amin Mobasheri, Ahmed Loai Ali, Cristina Capineri, andMordechai (Muki) Haklay. A review of volunteered geographicinformation quality assessment methods. International Journal ofGeographical Information Science, 31(1):139–167, January 2017. ISSN1365-8816. doi:10.1080/13658816.2016.1189556. URL → page 8[126] USDA Forest Service. i-Tree Streets, October 2018. URL → pages 5, 78103[127] Andrew J. Shatz, John Rogan, Florencia Sangermano, Jennifer Miller, andArthur Elmes. Modeling the risk of spread and establishment for Asianlonghorned beetle (Anoplophora glabripennis) in Massachusetts from2008-2009. Geocarto International, 31(8):813–831, September 2016.ISSN 1010-6049. doi:10.1080/10106049.2015.1086901. URL → page 57[128] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou,Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai,Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre,George van den Driessche, Thore Graepel, and Demis Hassabis. Masteringthe game of Go without human knowledge. Nature, 550(7676):354–359,October 2017. ISSN 1476-4687. doi:10.1038/nature24270. URL → page 12[129] C. Small. Estimation of urban vegetation abundance by spectral mixtureanalysis. International Journal of Remote Sensing, 22(7):1305–1334,January 2001. ISSN 0143-1161. doi:10.1080/01431160151144369. URL → page 27[130] Fiona Steele. Urban Forest Climate Adaptation Framework for MetroVancouver. Technical report, Metro Vancouver, February 2016. → page 31[131] Iain Stewart and Tim Oke. Thermal differentiation of local climate zonesusing temperature observations from urban and rural field sites. In UrbanClimate, page 7, Keystone, Colorado, August 2010. → page 31[132] Philip Stubbings, Joe Peskett, Francisco Rowe, and Dani Arribas-Bel. AHierarchical Urban Forest Index Using Street-Level Imagery and DeepLearning. Remote Sensing, 11(12):1395, January 2019.doi:10.3390/rs11121395. URL → pages 8, 9, 28, 37[133] Mapillary AB Sweden. Map data at scale from street-level imagery. URL → page 34[134] Beau Tippetts, Dah Jye Lee, Kirt Lillywhite, and James Archibald. Reviewof stereo vision algorithms and their suitability for resource-limitedsystems. Journal of Real-Time Image Processing, 11(1):5–25, January2016. ISSN 1861-8219. doi:10.1007/s11554-012-0313-2. URL → page 29104[135] K. V. Tubby and J. F. Webber. Pests and diseases threatening urban treesunder a changing climate. Forestry: An International Journal of ForestResearch, 83(4):451–459, October 2010. ISSN 0015-752X.doi:10.1093/forestry/cpq027. URL → page 54[136] Alan M. Turing. Computing Machinery and Intelligence. In RobertEpstein, Gary Roberts, and Grace Beber, editors, Parsing the Turing Test:Philosophical and Methodological Issues in the Quest for the ThinkingComputer, pages 23–65. Springer Netherlands, Dordrecht, 2009. ISBN978-1-4020-6710-5. doi:10.1007/978-1-4020-6710-5 3. URL 3. → page 11[137] Jessica B. Turner-Skoff and Nicole Cavender. The benefits of trees forlivable and sustainable communities. PLANTS, PEOPLE, PLANET, 1(4):323–335, 2019. ISSN 2572-2611. doi:10.1002/ppp3.39. URL → pages1, 84[138] Konstantinos Tzoulas, Kalevi Korpela, Stephen Venn, Vesa Yli-Pelkonen,Aleksandra Kaz´mierczak, Jari Niemela, and Philip James. Promotingecosystem and human health in urban areas using Green Infrastructure: Aliterature review. Landscape and Urban Planning, 81(3):167–178, June2007. ISSN 0169-2046. doi:10.1016/j.landurbplan.2007.02.001. URL →page 2[139] DESA UN. World urbanization prospects: The 2018 revision. UnitedNations Department of Economics and Social Affairs, Population Division:New York, NY, USA, 2019. → page 1[140] M. van den Bosch and A˚ Ode Sang. Urban natural environments asnature-based solutions for improved public health – A systematic review ofreviews. Environmental Research, 158:373–384, October 2017. ISSN0013-9351. doi:10.1016/j.envres.2017.05.040. URL →pages 2, 26[141] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, AlexShepard, Hartwig Adam, Pietro Perona, and Serge Belongie. TheiNaturalist Species Classification and Detection Dataset. arXiv:1707.06642105[cs], July 2017. URL arXiv: 1707.06642.→ page 56[142] Kathleen T. Ward and Gary R. Johnson. Geospatial methods provide timelyand comprehensive urban forest information. Urban Forestry & UrbanGreening, 6(1):15–22, February 2007. ISSN 1618-8667.doi:10.1016/j.ufug.2006.11.002. URL →page 7[143] Michael Waskom, Olga Botvinnik, Drew O’Kane, Paul Hobson, SauliusLukauskas, David C Gemperline, Tom Augspurger, Yaroslav Halchenko,John B. Cole, Jordi Warmenhoven, Julian de Ruiter, Cameron Pye, StephanHoyer, Jake Vanderplas, Santi Villalba, Gero Kunter, Eric Quintero, PeteBachant, Marcel Martin, Kyle Meyer, Alistair Miles, Yoav Ram, TalYarkoni, Mike Lee Williams, Constantine Evans, Clark Fitzgerald, Brian,Chris Fonnesbeck, Antony Lee, and Adel Qalieh. mwaskom/seaborn:v0.8.1 (September 2017), September 2017. URL → page 58[144] Jan D. Wegner, Steven Branson, David Hall, Konrad Schindler, and PietroPerona. Cataloging Public Objects Using Aerial and Street-Level Images -Urban Trees. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 6014–6023, 2016. URL cvpr 2016/html/Wegner Cataloging Public Objects CVPR 2016 paper.html. → pages3, 6, 7, 8, 9, 10, 17, 18, 28, 29, 40, 50, 60, 81, 112[145] Lynne M. Westphal. Social Aspects of Urban Forestry: Urban Greeningand Social Benefits: a Study of Empowerment Outcomes. Journal ofArboriculture 29(3):137-147, 29(3), 2003. URL → page 2[146] Hanfa Xing, Yuan Meng, Zixuan Wang, Kaixuan Fan, and Dongyang Hou.Exploring geo-tagged photos for land cover validation with deep learning.ISPRS Journal of Photogrammetry and Remote Sensing, 141:237–251, July2018. ISSN 0924-2716. doi:10.1016/j.isprsjprs.2018.04.025. URL →pages 17, 31[147] Denys Yemshanov, Frank H. Koch, Mark Ducey, and Klaus Koehler.Trade-associated pathways of alien forest insect entries in Canada.106Biological Invasions, 14(4):797–812, April 2012. ISSN 1573-1464.doi:10.1007/s10530-011-0117-5. URL → pages 4, 57[148] Dameng Yin and Le Wang. How to assess the accuracy of the individualtree-based forest inventory derived from remotely sensed data: a review.International Journal of Remote Sensing, 37(19):4521–4553, October2016. ISSN 0143-1161. doi:10.1080/01431161.2016.1214302. URL → page 40[149] Paul A. Zandbergen and Sean J. Barbeau. Positional Accuracy of AssistedGPS Data from High-Sensitivity GPS-enabled Mobile Phones. The Journalof Navigation, 64(3):381–399, July 2011. ISSN 1469-7785, 0373-4633.doi:10.1017/S0373463311000051. URL → page 60[150] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz.mixup: Beyond Empirical Risk Minimization. arXiv:1710.09412 [cs, stat],October 2017. URL arXiv: 1710.09412.→ page 63[151] L. Zhang, L. Zhang, and B. Du. Deep Learning for Remote Sensing Data:A Technical Tutorial on the State of the Art. IEEE Geoscience and RemoteSensing Magazine, 4(2):22–40, June 2016. ISSN 2168-6831.doi:10.1109/MGRS.2016.2540798. → pages 17, 31[152] Xin Zhang, Gui-Song Xia, Qikai Lu, Weiming Shen, and Liangpei Zhang.Visual object tracking by correlation filters and online learning. ISPRSJournal of Photogrammetry and Remote Sensing, 140:77–89, June 2018.ISSN 0924-2716. doi:10.1016/j.isprsjprs.2017.07.009. URL →pages 17, 30, 117[153] Zhen Zhen, Lindi J. Quackenbush, and Lianjun Zhang. Trends inAutomatic Individual Tree Crown Detection and Delineation—Evolutionof LiDAR Data. Remote Sensing, 8(4):333, April 2016.doi:10.3390/rs8040333. URL→ page 27[154] X. X. Zhu, D. Tuia, L. Mou, G. Xia, L. Zhang, F. Xu, and F. Fraundorfer.Deep Learning in Remote Sensing: A Comprehensive Review and List of107Resources. IEEE Geoscience and Remote Sensing Magazine, 5(4):8–36,December 2017. ISSN 2168-6831. doi:10.1109/MGRS.2017.2762307. →pages 10, 30[155] Johan O¨stberg, Tim Delshammar, Bjo¨rn Wistro¨m, and Anders BusseNielsen. Grading of Parameters for Urban Tree Inventories by CityOfficials, Arborists, and Academics Using the Delphi Method.Environmental Management, 51(3):694–708, March 2013. ISSN1432-1009. doi:10.1007/s00267-012-9973-8. URL → page 78108Appendix AAdditional publications andpresentations• [oral presentation] Lumnitz, S. (2018): Monitoring Urban Trees in a Chang-ing World: Instance Segmentation with Deep Learning and Google StreetView Imagery. BioSAFE Symposium. Vancouver, Canada.• [oral presentation] Lumnitz, S. (2018): Bio-surveillance in a Changing World:Monitoring Urban Trees with Deep Learning and Street View Imagery. En-tomology 2018 - ESA, ESC and ESBC Joint Annual Meeting. Vancouver,Canada.• [oral presentation] Lumnitz, S. (2019): A Geographers Journey into AI:mapping Urban Trees from Scratch. Scipy. Austin, USA.URL:• [oral presentation] Lumnitz, S. (2019): Mapping urban trees with deep learn-ing and street-level imagery - a story of geospatial open source software.UBC GIS day. Vancouver, Canada.• Lafond, V., Lingua, F., Lumnitz, S., Paradis, G., Srivastava, V. and Griess,V. (2019): Challenges and opportunities in developing decision support sys-tems for risk assessment and management of forest invasive alien species.Accepted: Environmental Reviews.109Appendix BTheoretical background ondeveloping Mask R-CNNIn this project, MASK R-CNN was used to detect and outline trees in imagery bygenerating bounding boxes and binary segmentation masks for each tree instance.These tree detections were then used for both tree location and genus classifica-tion (s. chapter 2 and 3). As the most central part of the proposed workflow, theMASK R-CNN architecture, it’s training and evaluation process as well as the cho-sen training data is described in detail in this section.B.1 Training, development and evaluation datagenerationThe presented workflow relied on two main data sources used to train, develop andevaluate the tree detection (MASK R-CNN) and the genus classification (RESNET50)models: 1) openly available benchmark datasets for training, i.e. the COCO Stuffannotated dataset, and second, generated street-level panoramas and correspondingtree labels. 2) Acquired street-level imagery for training, development, testing andinference purposes.110B.1.1 COCO Stuff datasetThe COCO Stuff dataset is the most expansive dataset with pixel-level annotationsfor ”stuff classes” to date [18]. Stuff classes describe classes of objects that areamorphous and consist of less to no distinct parts, e.g. sky, grass [31]. Stuff classesare opposed to ”thing classes”, e.g. car, human, which describe objects of a specificsize and shape, and are made of identifiable parts, e.g. wheels and door for the thingclass ”car” [83]. In the context of computer vision, the ”tree class” is commonlyreferred to as a stuff class, owing to a tree’s amorphous structure composed oftypically trunk, branches, leafs but also through shining background noise. COCOStuff consists of 164,000 images from the original COCO dataset with pixel-levelannotations for 172 classes: 91 stuff classes and 1 unlabeled class, additionally tothe traditional 80 thing classes [18]. Figure B.1 provides an example of images andimage annotations of images collected in COCO Stuff.Figure B.1: COCO Stuff image and segmentation mask examples. TheCOCO Stuff dataset is the most expansive dataset with pixel-level an-notations for stuff classes. The COCO Stuff dataset is used as trainingdata in the presented workflow. These are three example images withcorresponding Stuff segmentation masks retrieved from the COCO 2017Stuff Segmentation Task Challenge [2, 18]The dataset can be downloaded via the official COCO dataset website [2]. Afterdownloading the image datasets, COCO Stuff data structures and annotations canbe accessed, manipulated and integrated into our DL workflow using the officialCOCO Stuff API [1].In the presented workflow, this extensive stuff dataset was used to extend thesmaller annotated tree datasets for training the instance segmentation DNN model.All 36,500 images containing tree objects and pixel-level annotations were strate-111gically extracted from COCO Stuff. Extracted images are used as basic trainingand development datasets in the layered training approach for the tree detectionmodel, and therefore complement the smaller development and test datasets usedfor fine-tuning (presented in the following section B.1.2).B.1.2 Street-level panoramas and annotationsA developed data processing pipeline enabled a dense download of all GSV panora-mas in a defined area of interest, and relied on the Python implementation of theofficial Google API. In order to download a single panorama the API required aset of coordinates of interest as main user input. The given user input was thencompared to recorded GSV car or camera positions at time of image capture andmetadata, including coordinate points, for the panorama closest to the user inputcoordinates was returned. In a subsequent step, metadata information was thenused to download the actual panorama in various resolutions. Leveraging the APIthe proposed processing pipeline consisted on the following steps of operation:1. Generating a grid of coordinates over the area of interest: A grid ofcoordinate points was generated to allow a dense download of all panoramasover a given area of interest. The upper left and lower right corner coordinatedefined the size of the grid and allowed for a download of panoramas inthe resulting square. A new coordinate point was generated every 6 metersto ensure all panoramas of an area were downloaded, since the commondistance between two points of panorama capture was 15 meters [144].2. Crawling metadata of nearest panorama: The generated coordinate gridwas now used to crawl and download the metadata of all corresponding near-est panoramas.3. Downloading panorama data: All panoramas of interest were denselydownloaded based on the crawled metadata and the Google API. Panora-mas were stored as a coherent dataset and further processed.4. Dataset annotation: Training, development and testing datasets were buildthrough semi-atuomatic and manual annotation of a subset of GSV images.112(a) For tree detection model development a small subset of 120 panora-mas including roughly 1200 private and public trees located in MetroVancouver, Canada and Pasadena, USA were annotated. Trees weremanually annotated using Labelbox, software to annotate images andmask objects, in combination with a custom designed label template.Figures B.2 and B.3 show an annotated example panorama and detailsof annotation masks.(b) For tree genus classification model development tree masks were au-tomatically generated through inference of the trained tree detectionmodel on all available imagery. Resulting masks were combined withinformation from existing tree inventories based on the spatial locationof single trees. For a more detailed description see chapter 3.Figure B.2: Manual image annotation with Labelbox. Trees on the fullpanorama are manually annotated using Labelbox, an open source soft-ware package. Masks are created by manually outlining tree instances.Labels are saved in . json format. (Imagery source: Google Street View2018)The presented data generation pipeline based on these four steps resulted inannotated datasets which were used as training, development and test datasets intraining the tree detection and genus classification models. Step one to three canbe used in combination with our fully trained model to update or generate urbantree inventories in future.113Figure B.3: Google Street View panorama and tree annotations. Trees onthe full panorama are annotated using Labelbox. Original panoramasare of the size 6656x3328 pixels but will be downsized and halved forthe purpose of training. The black stripes represent an automatic appli-cation of zero padding in order to transform images to the same formatfor training (Imagery source: Google Street View 2018).Generated datasets are further preprocessed as part of both tree detection andtree genus classification models:1. Automated bounding box generation: In addition to tree labels and masks,114bounding boxes were automatically computed by drawing a rectangular bound-ing box with the best fit around each separate mask. This approach allowedfor an easy augmentation of tree detection datasets without the need to up-date each bounding box according to the respective image and mask trans-formation.2. Resizing images: All images and corresponding labels in a dataset wereresized to 1024x1024 pixels for tree detection and 256x256 pixels for treeclassification. Zero padding was added at the sides when using the COCOStuff dataset to preserve the original aspect ration. GSV panoramas weresplit in half and downsized from 6656x3328 pixels. Resizing images to thesame size was vital to gradient descent.3. Resizing masks: Individual instance masks were downsized to a resolutionof 56x56 pixels in order to save memory space during one step of trainingand accelerate the training process. Figure B.4 gives an example of reshapedand normalized tree masks.Figure B.4: Reshaped mask annotations. Annotations need to be reshapedto 56x56 pixels in order to reduce memory needed during training. Thisis an effective method to speed up the training process.These preprocessing steps were particularly important to optimize the trainingprocess on high-resolution imagery for tree detection. All of the steps above sig-115nificantly decreased the time needed for training and allow me to use a single GPUwith 11 GB of memory.B.2 Mask R-CNN for tree detectionB.2.1 Mask R-CNN frameworkMASK R-CNN is a framework specialized on instance segmentation, the computervision task to detect and outline seperate objects in an image [57]. Figure B.5shows the architecture of the MASK R-CNN framework.Figure B.5: The Mask R-CNN framework for instance segmentation. Aninput image is transformed using a Feature Pyramind Network (FPN)and a RESNET101 core. Resulting features are additionally realignedwith the input image through a Region of Interest (ROI)Align operationwhich ensures that further generated masks are in the right positionon the input image. Class and bounding box are predicted separatelybranching off from the first convolution layer. Binary masks are pre-dicted in parallel for every ROI using an additional convolution layer[57].MASK R-CNN extends FASTER R-CNN used by Branson et al. [15] by adding asegmentation mask prediction for each detected instance or object. It therefore al-lowed me to classify and detect single urban tree instances, create bounding boxes116and segmentation masks surrounding the individual trees [57]. MASK R-CNN canbe conceptualizes as a two stage algorithm: 1) The first part is also referred toas a Region Proposal Network (RPN), predicting multiple ROI. A convolutionalbackbone architecture, RESNET101 coupled with a FPN, allows to generate multi-ple anchor boxes at different scales [58, 84]. 2) In the so called head of the model,features are then extracted from each ROI which can be associated with the relevantclass. All in parallel, the head extracts class and bounding box values, leveraginga ROIPool operation, and creates a binary mask for each detected object using aROIAlign operation [152].The implemented network head of MASK R-CNN decouples the classificationfrom the convolution mask prediction branch. Class, bounding box and binarymask are therefore predicted separately and in parallel for every ROI [57]. In com-parison, the classification task usually depends on a previously predicted mask inother state-of-the-art instance segmentation frameworks [44]. Due to the separa-tion of these different tasks, MASK R-CNN is performing faster and with higheraccuracy compared to prior and other state-of-the-art systems and was chosen forthis research project [61]. He et al. [57] provide more detailed information on thearchitecture of MASK R-CNN. The original architecture can be openly accessedonline and is implemented in Keras with a Tensorflow backend [3].B.3 Evaluation strategyB.3.1 Architecture evaluation for tree detectionDetecting trees on street-level imagery is a binary classification and image segmen-tation problem. This means the CNN learns whether a pixel in an image is a tree ora non-tree pixel and assigns tree pixels to exactly one tree object, which is locatedat a specific position in the image. Owing to findings of Berland and Lange [13],where humans could identify trees and tree genera from the GSV image with 90%overlap to the field survey, and the promise of current CNN architectures to nearhuman-level performance, MASK R-CNN was assumed to be able to preform thistask. However, it was important to test whether or not the chosen MASK R-CNN117architecture was powerful enough for the task at hand.A common test used to assess weather an architecture can perform a task is totrain the architecture to over-fit a set of example images [26]. The goal of training aNN is to strike the balance between over-fitting to it’s training data and introducinga bias due to inaccurate representation of the training data (under-fitting) and there-fore to generalize the task at hand. Therefore, before a NN is trained to generalizeit needs to be assessed if the architecture is suited to over-fit on a specific trainingset.In a preliminary experiment, MASK R-CNN was trained with 20 annotated GSVimages and inference was evaluated on the same images. Figure B.6 clearly indi-cates that MASK R-CNN correctly detects and masks all trees after only 8 epochs oftraining. This implied that MASK R-CNN could be used for the task of tree instancesegmentation with the right training strategy and a good representation of the targetclass.Figure B.6: Over-fitting Mask R-CNN. I purposely over-fit the model in or-der to test whether or not MASK R-CNN is powerful enough for the taskof tree instance segmentation.118B.3.2 Evaluation metrics for tree detectionTo evaluate the performance of tree detection and masking with MASK R-CNN theMean Average Precision (mAP), Average Precision 50, and 75 (AP50, AP75) werecomputed.A common way to determine if a detection proposal is right is to asses IOU[51]. A set of of proposed object pixels A and the set of true object pixels B werecompared:IoU(A,B) =area of overlaparea of union=A∩BA∪B (B.1)Unless otherwise specified, AP is commonly averaged over multiple IOU val-ues [18]. In this work, 10 IOU thresholds of .50:.05:.95 are used. This is a breakfrom tradition, where AP is computed at a single IOU of .50 (which corresponds tothe metric APIoU=.50). Averaging over IOUs rewards detectors with better local-ization evaluation.B.4 Training strategyB.4.1 Feature extraction and fine-tuningFor training MASK R-CNN two commonly used training strategies using pre-trainedmodels were applied, 1) feature extraction and 2) fine-tuning [26].In feature extraction new models are build upon already trained NN’s, by down-loading trained weights, i.e. mathematical features of representations (s. section1.2.2). These weights derive from already learned very generic representationsfound in the first layers (1-80) of the RESNET101 core (s. section B.2.1). In con-trast to the more complex representations learned in end layers or so called heads,these simple base representations can be reused. The level of re-usability of theserepresentations or features decreases with the depth of layers in the architecture.The chosen implementation of the MASK R-CNN architecture allowed to downloadopenly sourced weights with which can be used to initialize the training process.These weights are derived from the same MASK R-CNN architecture that has previ-ously been trained on the official COCO dataset (which differs from the COCO stuff119dataset).Fine-tuning describes the process in which MASK R-CNN is retrained for a newtask, starting from transferred COCO weights. A best-practice fine-tuning workflowstarts by only training the heads, followed by only training the last 4 RESNET101layers and lastly completed by training only heads a second time. This approachoptimizes and reduces the amount of training data and time needed [26]. Onlythe last layers of the RESNET101 core, the so called heads were adjusted in thetraining process. Generic layers were frozen, i.e. excluded in the updating pro-cess described in section 1.2.2), and only the later, top layers were trained to fitthe problem of tree detection. Representations derived from the heads were there-fore fine-tuned to tree shapes. Since tree shapes are very specific and in commonbenchmark datasets like COCO rather uncommon shapes, the last 4 layers in theRESNET101 core needed to be updated additionally to the classification head.B.4.2 Training Mask R-CNNIn order to train MASK R-CNN training configurations and a specific training strat-egy had to be developed. Both of these are typically altered in order to optimizeresults achieved during the evaluation of the model. The most important trainingconfigurations that had to be set in MASK R-CNN for training were:1. Steps per epoch. This determined how many gradient updates (updates ofweights) were done before a new set of weights was saved at the end of anepoch.2. Images per GPU. For a 1024x1024 pixel image, the maximum batch sizewas two, using a GPU with 11-12 GB memory. Assessing more than twoimages per training step, required more than 12 GB of memory.3. Validation steps. These ran at the end of an epoch to generate validationstatistics. These default to 50 and should generally be much smaller thansteps per epoch to not slow down the training process, due to additionalmemory requirements.4. Learning rate. The smaller the number of the learning rate, the smallerthe number of gradient updates. A default of 0.001 to train the heads was120used; a smaller value was more appropriate for training last layers of theRESNET101 backbone.5. Number of classes. This indicates the number of classes to train for and wasset to ”tree, no-tree”.121Appendix CComparing existing andgenerated tree generadistribution informationFigure C.1: Visual comparison of existing and generated tree inventoryrecords for Prunus. Generated Prunus occurrence data (green) is morespread as existing records of Prunus (red) in Vancouver, owing to theinclusion of private trees and oversampling inherent to the methodol-ogy used. (Map tiles by Stamen Design, under CC BY 3.0. Data byOpenStreetMap, under ODbL.)122Appendix DTree genera detection123Table D.1: Tree genera detections in Metro Vancouvergenus count — genus countFAGUS 609288 SORBUS 52896ACER 492853 TILIA 49200PRUNUS 360943 ULMUS 46560PICEA 263775 LIQUIDAMBAR 28151MALUS 240133 STYRAX 27796CHAMAECYPARIS 232685 AESCULUS 26863BETULA 214790 ILEX 25902THUJA 174806 MAGNOLIA 20259PARROTIA 140427 AMELANCHIER 18929CRATAEGUS 122600 GLEDITSIA 14956PINUS 120400 PLATANUS 10277FRAXINUS 110167 LABURNUM 7708QUERCUS 94854 ROBINIA 5226PSEUDOTSUGA 90482 CERCIS 3675CARPINUS 88762 ABIES 2727CORNUS 79627 TSUGA 2498CERCIDIPHYLLUM 74687 CATALPA 2182CEDRUS 59897 METASEQUOIA 1323LIRIODENDRON 55320 GINKGO 1176PYRUS 54857 JUGLANS 564OTHER 41614124


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items