Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Accounting for preferential sampling in the statistical analysis of spatio-temporal data Watson, Joe 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2021_may_watson_joe.pdf [ 15.53MB ]
JSON: 24-1.0395329.json
JSON-LD: 24-1.0395329-ld.json
RDF/XML (Pretty): 24-1.0395329-rdf.xml
RDF/JSON: 24-1.0395329-rdf.json
Turtle: 24-1.0395329-turtle.txt
N-Triples: 24-1.0395329-rdf-ntriples.txt
Original Record: 24-1.0395329-source.json
Full Text

Full Text

Accounting for Preferential Sampling in theStatistical Analysis of Spatio-temporal DatabyJoe WatsonMMath, University of Bath, 2016a dissertation submitted in partial fulfillmentof the requirements for the degree ofDoctor of Philosophyinthe faculty of graduate and postdoctoralstudies(Statistics)The University of British Columbia(Vancouver)December 2020c© Joe Watson, 2020The following individuals certify that they have read, and recommend tothe Faculty of Graduate and Postdoctoral Studies for acceptance, the dis-sertation entitled:Accounting for Preferential Sampling in the Statistical Anal-ysis of Spatio-temporal Datasubmitted by Joe Watson in partial fulfillment of the requirements for thedegree of Doctor of Philosophy in Statistics.Examining Committee:Jim Zidek, Professor Emeritus, Statistics, UBCSupervisorMarie Auger-Méthé, Assistant Professor, Statistics, UBCCo SupervisorLang Wu, Professor, Statistics, UBCSupervisory Committee MemberMatias Salibian-Barrera, Professor, Statistics, UBCUniversity ExaminerRebecca Tyson, Associate Professor, Mathematics, UBC OkanaganUniversity ExaminerChristopher K. Wikle, Curators Distinguished Professor, Statistics, Univer-sity of MissouriExternal ExaminerAdditional Supervisory Committee Members:Nancy Heckman, Professor, Statistics, UBCSupervisory Committee MemberiiAbstractSpatio-temporal statistical methods are widely used to model natural phe-nomena across both space and time. Example phenomena include theconcentrations of airborne pollutants and the distributions of endangeredspecies. A spatio-temporal process is said to have been preferentially sam-pled when the locations and/or times chosen to observe it depend stochas-tically on the values of the process at the chosen locations and/or times.When standard statistical methodologies are used, predictions of a prefer-entially sampled spatio-temporal process into unsampled regions and timesmay be severely biased. Preferential sampling within spatio-temporal datamay be the rule rather than the exception in practice.The work demonstrated in this dissertation addresses the issue of pref-erential sampling. We develop the first general framework for modellingpreferential sampling in spatio-temporal data and apply it to historicalUK black smoke measurements. We demonstrate that existing estimatesof population-level black smoke exposures may be highly inaccurate due topreferential sampling. By leveraging the information contained in the chosensampling locations, we can adjust estimates of black smoke exposure to thepresence of preferential sampling. Next, we develop a fast, intuitive, pow-erful, and general test for preferential sampling. A user-friendly R-packagewe wrote performs the test. We demonstrate its utility in both a thoroughsimulation study and by successfully replicating previously-published resultson preferential sampling. Finally, we adapt our ideas on preferential sam-pling to the setting of spatio-temporal point patterns. By considering theobserved point pattern as a spatio-temporal thinned, marked log-GaussianiiiCox process, we show that preferential sampling can be directly accountedfor within the model. Under certain assumptions, the true distribution of lo-cations can then be attained. Using these ideas, we develop a framework forcombining multiple data sources to estimate the spatio-temporal distribu-tion of an animal. We then apply our framework to estimate effort-correctedspace-use of an endangered ecotype of killer whales.Ultimately, we hope that investigations into preferential sampling willbecome an essential component within spatio-temporal analyses, akin tomodel diagnostics. The methods developed in this dissertation are widelyapplicable, allowing researchers to routinely perform such investigations.ivLay SummarySpace-time statistical methods are widely used to model natural phenom-ena. Examples include the concentrations of airborne pollutants and thedistributions of endangered species. Statistical models use data to describethese phenomena, and results from these models frequently inform govern-mental policy across a range of issues on matters of public health and theenvironment. However, it is common for the objectives underlying the datacollection protocols to be related to the measured phenomenon. For ex-ample, governments preferentially situate air quality monitors near majorsources of airborne pollutants. Ignoring these objectives when performinga statistical analysis can severely bias our understanding of a phenomenon.This dissertation develops tools for testing for the presence of preferentialsampling and for subsequently adjusting conclusions to account for its exis-tence. We demonstrate the utility of the methods in real-world case studies:predicting Great British air pollution concentrations, predicting lead con-centrations across Galicia, and mapping killer whales.vPrefaceThis thesis was completed under the joint supervision of Prof. Jim Zidek andProf. Marie Auger-Méthé. A version of Chapter 3 has been published [Wat-son J, Zidek JV, Shaddick G. A general theory for preferential sampling inenvironmental networks. The Annals of Applied Statistics. 2019;13(4):2662-700.] and versions of Chapters 4 and 5 have been submitted for peer review[Watson J. A fast Monte Carlo test for preferential sampling.] and [WatsonJ, Joy R, Tollit D, Thornton SJ, Auger-Méthé M. Estimating animal utiliza-tion distributions from multiple data types: a joint spatio-temporal pointprocess framework.] respectively. The ideas for Chapter 3 have been jointlydeveloped by Joe Watson, Prof. Jim Zidek and Prof. Gavin Shaddick withthe majority of computational work and manuscript writing conducted byJoe Watson as lead author. Work seen in Chapter 4 was solely developedby Joe Watson. The ideas for Chapter 5 have been jointly developed byJoe Watson, Prof. Marie Auger-Méthé and coauthors with the majority ofcomputational work and manuscript writing conducted by Joe Watson aslead author.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xviiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . xxDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background on Spatio-temporal Statistics . . . . . . . . . . 52.1 Preferential sampling in discrete-space spatio-temporal data . 122.2 Preferential sampling in continuous spatio-temporal data . . . 152.3 Preferential sampling in spatio-temporal point-pattern data . 192.4 Spatio-temporal generalized linear mixed-effects models . . . 272.4.1 Incorporating preferential sampling within STGLMMs 302.4.2 INLA . . . . . . . . . . . . . . . . . . . . . . . . . . . 33vii2.4.3 Applicability of the methods in practice . . . . . . . . 383 A General Theory for Preferential Sampling in Environ-mental Networks . . . . . . . . . . . . . . . . . . . . . . . . . 403.1 Introduction to Chapter 3 . . . . . . . . . . . . . . . . . . . . 413.2 Modelling frameworks . . . . . . . . . . . . . . . . . . . . . . 463.2.1 Review of related work . . . . . . . . . . . . . . . . . . 463.2.2 A general retrospective modelling framework . . . . . 483.3 Case study: the data . . . . . . . . . . . . . . . . . . . . . . . 533.4 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.4.1 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . 583.4.2 Observation process . . . . . . . . . . . . . . . . . . . 583.4.3 Site–selection process . . . . . . . . . . . . . . . . . . 613.4.4 Three implementations . . . . . . . . . . . . . . . . . . 653.4.5 Model identifiability issues . . . . . . . . . . . . . . . . 693.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.5.1 Implementation 1 – assuming independence betweenY and R . . . . . . . . . . . . . . . . . . . . . . . . . . 723.5.2 Implementation 2 – P1 . . . . . . . . . . . . . . . . . . 753.5.3 Implementation 3 – P2 . . . . . . . . . . . . . . . . . . 793.5.4 Impacts of preferential sampling on estimates of pop-ulation exposure levels and noncompliance . . . . . . . 823.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873.7 Conclusion to Chapter 3 . . . . . . . . . . . . . . . . . . . . . 904 A Perceptron for Detecting the Preferential Sampling ofSites Chosen to Monitor a Spatio-temporal Process . . . . 924.1 Introduction to Chapter 4 . . . . . . . . . . . . . . . . . . . . 964.2 Preferential sampling in geostatistical data . . . . . . . . . . 1004.2.1 Assumed model for preferential sampling . . . . . . . 1024.2.2 Perceptron algorithm . . . . . . . . . . . . . . . . . . . 1054.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 1124.3 Preferential sampling in the discrete spatial data setting . . . 114viii4.3.1 Assumed model for preferential sampling . . . . . . . 1154.3.2 Perceptron algorithm . . . . . . . . . . . . . . . . . . . 1164.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . 1174.5 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1224.5.1 Great Britain’s black smoke monitoring network . . . 1234.5.2 Galicia lead concentrations . . . . . . . . . . . . . . . 1254.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . 1275 Estimating Animal Utilization Distributions from Multi-ple Data Types: a Joint Spatio-temporal Point ProcessFramework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305.1 Introduction to Chapter 5 . . . . . . . . . . . . . . . . . . . . 1325.2 Motivating problem . . . . . . . . . . . . . . . . . . . . . . . 1365.2.1 An introduction to the problem . . . . . . . . . . . . . 1365.2.2 The data available . . . . . . . . . . . . . . . . . . . . 1375.2.3 Previous work estimating the space use of SRKW . . 1385.2.4 Goals of the analysis . . . . . . . . . . . . . . . . . . . 1395.3 Building the modelling framework . . . . . . . . . . . . . . . 1415.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . 1545.4.1 Effects of observer effort and detection range misspec-ification . . . . . . . . . . . . . . . . . . . . . . . . . . 1575.5 Application to empirical data . . . . . . . . . . . . . . . . . . 1615.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1716 Summary, conclusions, and future work . . . . . . . . . . . 174Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . 197A.1 Chapter 3 Supporting Materials . . . . . . . . . . . . . . . . . 197A.1.1 Chosen priors for the case study . . . . . . . . . . . . 197A.1.2 Details on the R-INLA implementation . . . . . . . . 198A.1.3 Posterior pointwise mean and pointwise standard de-viation plots . . . . . . . . . . . . . . . . . . . . . . . 202ixA.1.4 Additional plot of the exceedance of the annual blacksmoke EU guide value . . . . . . . . . . . . . . . . . . 205A.1.5 Additional plot of annual average black smoke levels . 205A.1.6 Model diagnostic plots . . . . . . . . . . . . . . . . . . 207A.2 Chapter 4 Supporting Materials . . . . . . . . . . . . . . . . . 213A.2.1 More details on the simulation study . . . . . . . . . . 213A.3 Chapter 5 Supporting Materials . . . . . . . . . . . . . . . . . 227A.3.1 Additional theory on marked point processes . . . . . 227A.3.2 Extra results of the main simulation study . . . . . . . 227A.3.3 Details of the additional simulation study . . . . . . . 231A.3.4 Additional comments on the causal DAG . . . . . . . 237A.3.5 Deriving site occurrence and site count likelihoods . . 238A.3.6 Comments on preferential sampling . . . . . . . . . . . 240A.3.7 More notes on estimating the whale-watch observereffort . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241A.3.8 Computational steps for approximating the likelihood 242A.3.9 Additional details on the results and additional tables 243A.3.10 Pseudo-code for computing the modelling frameworkin inlabru . . . . . . . . . . . . . . . . . . . . . . . . . 248A.3.11 Additional figures . . . . . . . . . . . . . . . . . . . . 252xList of TablesTable 3.1 A table showing the posterior mean and standard devia-tions for parameter estimates for the three implementations. 72Table 4.1 A table of empirical p-values for the UK black smoke datasetfor both the assumed homogeneous and inhomogeneousPoisson point process models. . . . . . . . . . . . . . . . . 125Table 4.2 A table of empirical p-values for the Galicia dataset. . . . 126Table A.1 A table of posterior estimates of the fixed effects β, withtheir 95% posterior credble intervals for the final model . . 247Table A.2 A table showing the DIC values of all the models tested,with the model formulations summarized in the columns.A value of NA implies that model convergence issues oc-curred. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248xiList of FiguresFigure 2.1 Plots of the toy simulation study in the discrete-spacespatio-temporal setting. . . . . . . . . . . . . . . . . . . . 15Figure 2.2 A series of plots showing the toy simulation study in thecontinuous-space spatio-temporal setting. . . . . . . . . . 18Figure 2.3 Plots showing the toy simulation study in the spatio-temporal point-pattern setting. . . . . . . . . . . . . . . . 25Figure 3.1 A plot showing the number of the monitoring sites thatare operational at each year and have data capture of atleast 75%. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Figure 3.2 A plot showing the mean black smoke level on a log trans-formed scale for 30 randomly chosen sites. . . . . . . . . . 56Figure 3.3 A plot of Great Britain, with the locations of the observedsites, and hence P1 shown. . . . . . . . . . . . . . . . . . 59Figure 3.4 A plot of the locations of all sites considered for selectionin Population 2. The locations are shown as blue dots,many of which are in regions of low human populationdensity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Figure 3.5 Implementation 1. In green are the model–estimated BSlevels averaged over online sites in P1, those in red arethe model–estimated BS levels averaged over offline sitesin P1, and in blue are the model–estimated BS levels av-eraged across Great Britain. . . . . . . . . . . . . . . . . . 73xiiFigure 3.6 A plot of the year–by–year change in the logit of selec-tion captured by the autoregressive β?1(t) process in theR process in Implementation 2. . . . . . . . . . . . . . . . 76Figure 3.7 Implementation 3. In green are the model–estimated BSlevels averaged over online sites in P1, in red are themodel–estimated BS levels averaged over offline sites inP1, and in blue are the model–estimated BS levels aver-aged across Great Britain. . . . . . . . . . . . . . . . . . . 80Figure 3.8 A map plot of the posterior pointwise probability of theannual average black smoke level exceeding the EU guidevalue of 34µgm−3 under Implementation 2 (left) and Im-plementation 3 (on the right). . . . . . . . . . . . . . . . . 84Figure 3.9 A plot showing the posterior mean and 95% credible in-tervals of the annual residential–average exposure levelsacross the years of study. . . . . . . . . . . . . . . . . . . 85Figure 3.10 A plot showing the posterior mean and 95% credible in-tervals of the annual proportion of the population withblack smoke exposure levels exceeding the EU guide valueof 34µgm−3 across the years of study. . . . . . . . . . . . 85Figure 4.1 Plots demonstrating the intuition behind the PS test . . . 95Figure 4.2 A plot of the Type 1 error for four tests. . . . . . . . . . . 120Figure 4.3 A plot of the Power for two tests when the PS parameterγ equals 1 and ρZ ∈ 0.2,1. . . . . . . . . . . . . . . . . . . 121Figure 4.4 A plot of the locations of the black smoke monitoringsites in 1966. Observe the clustering of sites around thepopulous cities of London, Manchester, and Glasgow. . . 123Figure 4.5 A plot of the 1997 sampled locations of lead concentra-tions in Galicia, northern Spain. Observe the clusteringof sites in Northern Galicia. . . . . . . . . . . . . . . . . . 126xiiiFigure 5.1 A plot showing our area of interest Ω in green, with theGPS tracklines of the DFO survey effort displayed asblack lines. All DFO survey sightings are shown as ared overlay on top of the effort. All sightings from theOM and BCCSN datasets are shown in yellow. . . . . . . 140Figure 5.2 A diagram showing an example of an ‘encounter’. . . . . . 141Figure 5.3 A plot showing the assumed causal DAG for the proposedframework with the detection probability assumed constant.152Figure 5.4 A plot showing the long run densities of the animal andthe observers. . . . . . . . . . . . . . . . . . . . . . . . . . 155Figure 5.5 A plot showing the mean squared prediction error (MSPE)of the animals UD under the bias-corrected and bias-uncorrected models vs the types of observers. . . . . . . . 158Figure 5.6 A series of four plots demonstrating four different insightsinto the SRKW gained under our modelling framework. . 170Figure A.1 A plot of the posterior mean black smoke in 1966 and 1996under Implementation 1 with corresponding standard er-rors plotted below. . . . . . . . . . . . . . . . . . . . . . . 202Figure A.2 A plot of the posterior mean black smoke in 1966 and 1996under Implementation 2 with corresponding standard er-rors plotted below. . . . . . . . . . . . . . . . . . . . . . . 203Figure A.3 A plot of the posterior mean black smoke in 1966 and 1996under Implementation 3 with corresponding standard er-rors plotted below. . . . . . . . . . . . . . . . . . . . . . . 204Figure A.4 A plot showing the posterior proportion of the total sur-face area of Great Britain with annual average black smokelevel exceeding the EU guide value of 34µgm−3 across Im-plementations 2 and 3. . . . . . . . . . . . . . . . . . . . . 205Figure A.5 In green are the BS levels averaged over sites online sitesin P1, in red are the BS levels averaged over the offlinesites in P1, and in blue are the BS levels averaged acrossGreat Britain. Posterior levels are under Implementation 2.206xivFigure A.6 A plot of the residuals vs. year from Implementation 1with a fitted smoother. . . . . . . . . . . . . . . . . . . . . 208Figure A.7 A Normal Q–Q plot of the residuals from Implementation1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208Figure A.8 Histograms of the spatially–uncorrelated random inter-cepts (top left) and slopes(bottom left), with correspond-ing Normal Q–Q plots shown on the right from Imple-mentation 1. . . . . . . . . . . . . . . . . . . . . . . . . . 209Figure A.9 A plot of the residuals vs. year for Implementation 2,with a fitted smoother. . . . . . . . . . . . . . . . . . . . . 210Figure A.10 A Normal Q–Q plot of the residuals from Implementation2, with 95% confidence intervals shown in red. . . . . . . 210Figure A.11 Histograms of the spatially–uncorrelated random inter-cepts (top left) and slopes(bottom left), with correspond-ing Normal Q–Q plots shown on the right from Imple-mentation 2. . . . . . . . . . . . . . . . . . . . . . . . . . 211Figure A.12 A plot of the residuals vs. year for Implementation 3 witha fitted smoother. . . . . . . . . . . . . . . . . . . . . . . 212Figure A.13 A Normal Q-Q plot of the residuals from Implementation3 with 95% confidence intervals shown in red. . . . . . . . 212Figure A.14 Histograms of the spatially-uncorrelated random inter-cepts (top left) and slopes(bottom right), with correspond-ing Normal Q-Q plots shown on the right from Implemen-tation 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213Figure A.15 A plot of the Type 1 error for four tests. The three boxesshow the results for ρZ ∈ {0.02,0.2,1}, from left to rightrespectively for a sample size of 50. . . . . . . . . . . . . . 215Figure A.16 A plot of the Power for two tests when the PS parameterγ equals 1, ρZ ∈ 0.2,1, and for sample sizes of 50 and 100. 217Figure A.17 A plot of the Power for two tests when the PS parameterγ equals 2, ρZ = 0.02, and for all three sample sizes. . . . 219xvFigure A.18 A plot of the Power for two tests when the PS parameterγ equals 1, the covariate effect α1 equals 1, and when thesample size is 50. . . . . . . . . . . . . . . . . . . . . . . . 220Figure A.19 A plot of the Type 1 error for four tests for ρZ ∈ 0.02,0.2,1and for a sample size of 100. . . . . . . . . . . . . . . . . 223Figure A.20 A plot of the Power for two tests when the PS parameterγ equals 1, ρZ = 0.2, and for all three sample sizes. . . . . 224Figure A.21 A plot of the Power for two tests when the true samplingprocess is a Hard Core point process, with PS parameterγ equals 1 and ρZ = 1. . . . . . . . . . . . . . . . . . . . . 225Figure A.22 A plot of the Power for two tests when the PS parameterγ equals 1, the covariate effect α1 equals 1 and when thesample size is 250. . . . . . . . . . . . . . . . . . . . . . . 226Figure A.23 A plot showing the bias of the estimated y-coordinate ofthe animal’s UD center µy under the bias-corrected andbias-uncorrected models vs the types of observers. . . . . 228Figure A.24 A plot showing the mean squared prediction error (MSPE)of the estimated animal’s UD under the bias-correctedand bias-uncorrected models vs the types of observers. . . 229Figure A.25 A plot showing the bias of the estimated animal’s UDcenter µy under the bias-corrected and bias-uncorrectedmodels vs the types of observers. . . . . . . . . . . . . . . 230Figure A.26 A plot showing the bias of the estimated y-coordinate ofthe animal’s UD center µy under the bias-corrected andbias-uncorrected models vs the types of observers. . . . . 232Figure A.27 A plot showing the mean squared prediction error (MSPE)of the animal’s UD under the bias-corrected and bias-uncorrected models vs the types of observers. . . . . . . . 233Figure A.28 A plot showing the bias of the estimated animal’s UDcenter µy under the bias-corrected, bias-uncorrected, andthe overlap-corrected models for the twenty mobile obervers.235xviFigure A.29 A plot showing the mean squared prediction error (MSPE)of the estimated animal’s UD under the bias-corrected,bias-uncorrected, and the overlap-corrected models forthe twenty mobile obervers. . . . . . . . . . . . . . . . . . 236Figure A.30 A plot showing the assumed causal DAG for the proposedframework with the detection probability assumed con-stant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237Figure A.31 The computational mesh on the left and the correspond-ing dual mesh on the right, formed by constructing Voronoipolygons around the mesh vertices. . . . . . . . . . . . . . 243Figure A.32 A plot showing the posterior probability that the sum ofthe three pod’s intensities across the region takes value inthe upper 30% for the month of May. . . . . . . . . . . . 245Figure A.33 A plot showing the total observed number of sightingsmade per month with the posterior 95% credible intervalsshown. Results shown are for Model 8 with MC observereffort error. . . . . . . . . . . . . . . . . . . . . . . . . . . 249Figure A.34 Plots showing the average monthly sea-surface tempera-tures in degrees Celsius (top 6) and the natural logarithmof chlorophyll-A concentrations in mgm−3 (bottom 6).The averages have been taken over the years 2009-2016. 253Figure A.35 A plot showing the posterior mean and posterior 95%credible intervals of the pod-specific (sum-to-zero con-strained) random walk monthly effect from the ‘best’ modelwith Monte Carlo observer effort error included. . . . . . 254Figure A.36 A plot showing the posterior standard deviation of thesum of the SRKW intensities for the three pods, for themonth of May. . . . . . . . . . . . . . . . . . . . . . . . . 255xviiGlossaryBCCSN BC Cetacean Sightings NetworkBS Black SmokeDAG Directed Acyclic GraphicDFO Department of Fisheries and Oceans CanadaDIC Deviance Information CriterionEPA United States Environmental Protection AgencyEU European UnionFLOPS Floating-Point OperarionsGB Great BritainGLM Generalized Linear ModelGMRF Gaussian Markov Random FieldIID Independent and Identically DistributedINLA Integrated Nested Laplace ApproximationIPP Inhomogeneous Poisson ProcessLASSO Least Absolute Shrinkage and Selection OperatorLGCP Log-Gaussian Cox ProcessxviiiMC Monte CarloMCAR Missing Completely At RandomMCMC Markov-Chain Monte CarloMSPE Mean Squared Prediction ErrorNN Nearest NeighbourNOAA National Oceanic and Atmospheric AdministrationPDF Probability Density FunctionPS Preferential Sampling or Preferentially Sampled depending on the con-text of the sentenceQ-Q Quantile-QuantileSDM Species Distribution ModelSMRU Sea Mammal Research UnitSPDE Stochastic Partial Differential EquationSRKW Southern Resident Killer WhaleSTGLMM Spatio-Temporal Generalized Linear Mixed-effects ModelUBC University of British ColumbiaUD Utilization DistributionUK United KingdomxixAcknowledgmentsI am incredibly fortunate to have been immersed in the most welcoming, sup-portive, and diverse environment throughout my PhD journey. The open-door atmosphere enjoyed at UBC’s Statistics Department is something thatshould be truly celebrated. Whilst there are numerous students, staff andfaculty of the department that have helped to shape my PhD in some way,I would like to particularly thank a few individuals.First and foremost I would like to thank my supervisor Dr. Jim Zidekand my cosupervisor Dr. Marie Auger-Méthé for their continuous support,inspiration and dedication throughout. I could not have asked for bettersupervisors! To Jim, I will greatly miss our weekly chats and your personalanecdotes (many of which involve the statistical ‘Gods’). I thank you forinviting me to present at numerous conferences and workshops in my firstyear. It is thanks to you that I was able to begin developing my researchnetwork and to overcome a major fear of mine, public speaking, early on inmy PhD. I hope to one day have a wealth of knowledge as great as yours. ToMarie, your dedication and commitment to your students is truly special.I still don’t know where you find your energy! It is thanks to you that Idiscovered a passion for statistical ecology, which I hope to pursue in thefuture. Your consistent faith in me as a researcher gave me the confidence Ineeded to pursue academia further.During the final stages of my PhD journey, I was fortunate enough tohave the two brilliant professors Dr. Nancy E. Heckman and Dr. Lang Wuprovide feedback on my thesis. I thank you both for agreeing to spend yourtime and energy on my thesis! I truly appreciate the unique insights thatxxyou both offer. To Lang, I would also like to thank you for introducingme to the literature on the use of joint models to account for response-biased sampling within biostatistical longitudinal studies. The majority ofthe work presented in this thesis was inspired by this literature. I would alsolike to express my deepest gratitude to all the members of my examiningcommittee.I would never have made it to UBC in the first place without the constantlove and support of my parents, Lynn and Ian. I will be forever grateful tothem for always listening when I need help, and, for providing advice thatputs even the best agony aunt to shame! Finally, I thank Sammy Marsh formaking my time in Vancouver so special.In addition to the general acknowledgements above, I would also like toexpress my gratitude for the help I received with Chapters 3 and 5 that arelargely based on papers. For the work in Chapter 3, I would like to thank theco-authors of the paper Jim Zidek and Gavin Shaddick. I would also like tothank the Associate Editor and the anonymous referees for their insightfulcomments. Their constructive feedback greatly helped to improve the focusof the paper in which the Chapter is based. For the work in Chapter 5, Iwould like to thank the co-authors of the paper Marie Auger-Méthé, RuthJoy, Dominic Tollit and Sheila Thornton. I would like to thank the DFO,BCCSN, The Whale Museum and NOAA for access to sightings databases.I thank Jason Wood (SMRU), Jennifer Olson (The Whale Museum) andTaylor Shedd (Soundwatch) for their detailed insight into the operations ofthe whale-watch industry. Additionally, I would like to thank Eagle WingWhale & Wildlife Tours for their substantial help with understanding thewhale watching operations out of Victoria. This message of thanks extendsto various other whale-watch companies who also provided assistance.xxiDedicationTo Sammy Marsh for helpingme see colour in the world onthe greyest of days.Chapter 1Introduction"You can’t fix by analysis what you bungled by design."— Light, Singer and Willett (1990), page v“How were the data collected?" “Who collected the data?" “Why werethese sample units chosen?" “Do the sample units reflect the population asa whole?" These are fundamental questions that are asked by statisticianswhen they first encounter a set of claims and conclusions drawn from a newstudy. Analysts of the study are expected to justify how and why theirsample units were chosen from the population. The reason for this scrutinyis simple and summarized well in the quote above. Inconsistencies betweenstatistical inference and the truth are commonplace when a disconnect existsbetween the sample units and the population they were drawn from [Heck-man, 1990]. Although methods exist for adjusting for poor design [Hernanand Robins, 2020], post hoc analysis can rarely fix these inconsistencies. Itis the disconnect between sample units and the population that continues todraw skepticism on conclusions drawn from observational studies [Ebrahimand Smith, 2008].Yet one domain of statistics has largely avoided such scrutiny despite itsgreat importance and that is the domain of spatio-temporal modelling. SinceCressie’s classic text [Cressie, 1993] laid its foundation, interest has expandedtremendously and numerous books have been written (e.g. Banerjee et al.[2015] and Cressie and Wikle [2011]). But when presented with spatio-1temporal data to model, few questions are typically asked by statisticiansabout how and why the sampled locations and times were chosen. Worsestill, it is commonplace for these locations and times to be chosen to meetobjectives related to the process being measured [Schumacher and Zidek,1993]. For example, observations may be taken at locations and times wherethe process is expected to be ‘highest’ [EPA, 2005, Loperfido and Guttorp,2008]. Intuitively, this can induce a dependence between the locations andtimes and the sampling process being measured. Surprisingly little is knownabout the possible consequences that a biased sampling design can haveon the validity of statistical inference in spatio-temporal settings. Evenfewer methods are available for adjusting inference to biases in the samplingdesign.In spatio-temporal applications, a collection of (possibly noisy) observa-tions of a spatio-temporal process are collected. We say that the data werepreferentially sampled (PS), or identically that preferential sampling (PS)occurred, if a stochastic dependence exists between the choice of samplinglocations and/or times with the underlying spatio-temporal process beingmeasured [Diggle et al., 2010]. Thus PS is a special case of response-biasedsampling. Interest in PS was sparked in 2010 by the landmark paper by Dig-gle et al. [2010]. The authors demonstrated that the consequences of PS onspatial inference can be severe, with both spatial prediction and parameterestimation affected.Since that landmark paper, PS has been identified as a major concernacross multiple fields including environmental statistics, ecology and econo-metrics [Fithian et al., 2015, Gelfand and Shirota, 2019, Paci et al., 2020,Pennino et al., 2019, Shaddick et al., 2016, Zidek et al., 2014]. Yet, thus far,PS has only been considered in the spatial-only setting, with no method yetproposed for modelling PS in the spatio-temporal setting. Furthermore, nogeneral methodology exists for testing for PS. These methodologies are sorelyneeded. Most processes of interest are dynamic in time and recent computa-tional, methodological, and software advances have taken place which haveenabled spatio-temporal analyses to be performed with relative ease [Bakkaet al., 2018]. Thus, the frequency of spatio-temporal analyses is expected to2increase into the foreseeable future.In the remainder of this dissertation, we aim to: demonstrate that PScan be highly problematic for spatio-temporal analyses; present new meth-ods for testing for PS; and present new methods for adjusting statisticalinference to its presence. Chapter 2 provides a review of statistical meth-ods for modelling spatio-temporal data and provides a clear definition of PS.Spatio-temporal data can be classified into one of three types: discrete-spacespatio-temporal data, continuous-space spatio-temporal data, and spatio-temporal point-pattern data. Preferential sampling can affect the statisticalanalyses of all three types of data and different tools are required for mod-elling PS in each type. We briefly present toy examples of data generatingmechanisms that give rise to preferential sampling in all three settings. Foreach setting, we then highlight the negative impact that PS can have on sta-tistical inference when it is ignored. Finally, we introduce a popular class ofhierarchical models, called spatio-temporal generalized mixed-effects models(STGLMMs). We show that these can provide a highly flexible frameworkfor modelling preferentially-sampled spatio-temporal data of all three types.Then, we introduce a computational approach, called the integrated nestedLaplace approximation (INLA), that can fit STGLMMs quickly and effi-ciently. Chapter 2 helps to provide context for the later Chapters.Chapter 3 outlines the first general framework for modelling PS datain both discrete-space and continuous-space spatio-temporal settings. Wedemonstrate its utility by modelling historical black smoke levels in theUnited Kingdom. We show that the processes that determined where toplace pollution monitoring sites may have had a severe impact on histori-cal estimates of black smoke levels, including population-average exposurelevels.Chapter 4 outlines the first general test, insofar as we are aware, for pref-erential sampling in spatio-temporal data. We demonstrate that the test at-tains high power across numerous response types (continuous, count, etc.,),in both the spatial and spatio-temporal settings, for both continuous-spaceand discrete-space spatio-temporal data, even in settings with small samplesizes. Both the results presented in Chapter 3 and the results presented3in Diggle et al. [2010] on Galician lead concentration levels are replicated.Chapter 5 considers the effect of preferential sampling in point-pattern data.In particular, we consider the ecological application of estimating the spaceuse of a highly mobile species, and develop a general modelling framework.Then, we apply it to a case study on killer whales (Orcinus orca). Finally,we provide a discussion of all the work and address the possible avenues forfuture research in Chapter 6.4Chapter 2Background onSpatio-temporal Statistics“Everything is related to everything else, but near things aremore related than distant things.”— Tobler (1970)Spatio-temporal statistics is the study of how to model and performstatistical inference on data collected in space and time. As Tobler’s FirstLaw of Geography states above, measurements taken close together in spaceand time are often more likely to be similar than those taken far apart.For conclusions to be valid, a statistical analysis must consider the addi-tional autocorrelations that may exist between measurements taken acrossspace and time. Methods for accounting for autocorrelations present withindata collected across space and time are a primary focus of spatio-temporalstatistics.The statistical analysis of spatio-temporal data differs fundamentallyfrom ‘classical’ statistics in many aspects. Firstly, in the spatial-only set-ting, it is typical for only a single replication of the process to have beenobserved [Cressie, 1993]. This ‘sample size of one’ scenario then requiresthat additional assumptions be placed on the underlying process to pro-vide a foundation for statistical inference. Concepts such as stationarity5help to create ‘pseudo-replicates’ of the process which enable various ofits characteristics to be statistically identified. Secondly, unlike many do-mains of ‘classical’ statistics, it is rare for observed spatio-temporal datato be assumed to be an independent and identically distributed (IID) sam-ple. Instead, it is commonly believed that additional spatial, temporal, andspatio-temporal correlations exist due to a multitude of latent processes [Leand Zidek, 2006]. These processes are believed to drive the response andmake observations taken close together in space and/or time appear moresimilar than those that are highly separated in space and/or time. Adjust-ing inference to account for the effects of these residual correlations requirescareful treatment by the analyst.In this dissertation, we consider descriptive models. Mechanistic mod-els require a deeper understanding of the subject-specific matter so that asystem of equations can be developed for capturing the dynamics of thespatio-temporal phenomenon through time [Wikle, 2015]. On the otherhand, descriptive models attempt only to describe the first two momentsof the spatio-temporal phenomenon with mean and covariance functions.By taking a covariance-based approach, descriptive models have the ben-efit of being highly general. However, descriptive models risk painting anoversimplified picture of complicated phenomena and incorporating subjectknowledge into descriptive models can prove challenging [Wikle, 2015]. De-spite these limitations, we only consider descriptive models hereafter.The literature on descriptive models for spatio-temporal data is vast[Blangiardo and Cameletti, 2015, Cressie and Wikle, 2011, Le and Zidek,2006]. Multiple methods including kernel density smoothers [Hallin et al.,2004], splines [Wood et al., 2017], and random fields [van Lieshout, 2019]have all been proposed to capture additional spatio-temporal correlationsunaccounted for in standard analyses. Throughout this dissertation, we con-sider only the use of Gaussian processes, also known as Gaussian randomfields in the context of spatio-temporal applications. Gaussian processeshave the advantages of being extremely flexible and highly data-driven,without the need for additional tuning or bandwidth parameters that of-ten require ad-hoc methods for their estimation. Furthermore, they are6extremely useful for uncertainty quantification, in that they naturally pro-vide a full distribution at prediction locations instead of merely providing apoint estimate.Throughout this section we adapt the definitions found in van Lieshout[2019]. We denote a Gaussian process at a point in space s ∈ Ω and timet ∈ T as Z(s, t). Formally, a Gaussian process on the dense index set I =(Ω×T )⊂ R2×R is defined as follows:Definition 2.1 The family (Z(s, t))(s,t)∈I indexed on I is a Gaussian pro-cess (or Gaussian random field) if for any finite set of indices (i.e. points inspace and time) (s1, t1), ...,(sn, tn), the random vector (Z(s1, t1), ...,Z(sn, tn))Tis multivariate Gaussian distributed.The above definition can be scaled up to higher dimensions, or to thespatial-only setting with T a singleton. The multivariate Gaussian distribu-tion is fully determined by its mean and covariance matrix. This propertycarries over to the Gaussian process, with only the mean function and co-variance function required to fully determine the Gaussian process.Definition 2.2 The mean function at any point in space and time (s, t) isdenoted µ(s, t) and is defined as:µ(s, t) = E(Z(s, t)).The covariance function at any two points in space and time (si, ti),(sj , tj)is denoted Σ(si,sj ; ti, tj) and is defined as:Σ(si,sj ; ti, tj) = Cov(Z(si, ti),Z(sj , tj)).The only restriction required for Σ(·, ·; ·, ·) to be a valid covariance func-tion of a Gaussian process is that it be positive-definite [van Lieshout, 2019].This implies that for all collections of locations and times, the resulting co-variance matrix must be a positive semi-definite matrix. A valid covariancefunction that is commonly used in spatial analyses is the Matérn covariance.7The Matérn covariance depends on three parameters, and different combina-tions of these can result in a Gaussian process that can exhibit a wide rangeof properties. For example, the orders of mean-square differentiability, andthe minimum distance at which two points become ‘approximately indepen-dent’ can be readily defined by the parameters [Diggle et al., 2007]. For agiven set of parameters, the value of the function evaluated at two pointsdepends only on their interpoint distance. The Matérn covariance func-tion between two points separated by a distance d units, with non-negativeparameter values σ,ν, and ρ is defined:C(d;σ,ν,ρ) = σ2 21−νΓ(ν)(√2ν dρ)νKν(√2ν dρ),where Γ(·) is the gamma function and Kν is the modified Bessel function ofthe second kind.For a given set of n values y = (y1, ...,yn)T ∈ Rn observed at locationsand times I = (s1, t1), ...,(sn, tn), the probability density function (pdf) of amultivariate Gaussian distribution with given mean and covariance functionsµ and Σ can be specified. Let µ and Σ denote the n-vector of mean valuesand the n×n covariance matrix appropriately evaluated at I. The pdf is:pi(y|µ,Σ) = (2pi)−n/2|Σ|−1/2exp(−12(y−µ)TΣ−1(y−µ)).Here, we have used pi to denote a probability density, T as a superscriptto denote a matrix or vector transpose, and the vertical bar to be readas ‘conditional upon’. The multivariate Gaussian distribution above hasmany nice properties. For example, if two variables yi,yj are uncorrelated(i.e. Σi,j = 0), then they are statistically independent. A dual propertycan be deduced from the precision matrix Σ−1, defined as the inverse ofthe covariance matrix. The precision matrix is often given the notation Q.If Qi,j , the (i, j)th entry of the precision matrix Q, is identically zero thenthe variables yi,yj are conditionally independent, given the values of all the8other variables y−(i,j) = y \ (yi,yj) [Rue and Held, 2005]. Thus Qi,j = 0implies the following Markovian property:pi(yi|yj ,y−(i,j)) =pi(yi,yj |y−(i,j))pi(yj |y−(i,j))(Bayes’ rule)=pi(yi|y−(i,j))pi(yj |y−(i,j))pi(yj |y−(i,j))(by conditional independence)= pi(yi|y−(i,j))This latter property has been exploited for model specification and formaking advances in computational efficiency [Blangiardo and Cameletti,2015, Lindgren et al., 2011b]. The conditional independence allows analyststo specify covariance matrices that are guaranteed to be positive definite,whilst being interpretable and justifiable on logical grounds [Besag, 1974b].Whilst the true latent process being modeled is unlikely to satisfy this condi-tional independence property, the property provides a useful mathematicalstructure for approximating the latent dependency structures.For example, a set of ‘neighbours’ N(i) may be defined for each locationsi and time ti, such that the variable yi can be assumed to be conditionallyindependent of all the other variables yj : j /∈N(i), given its neighbors yk :k ∈N(i). This can equivalently be specified as Qi,j = 0 : iff j /∈N(i). This iscalled the local Markov property. Popular criteria used for defining whetheror not two locations/regions are neighbours include: whether or not theyshare a common contiguous border and whether or not their inter-pointdistance is less than some threshold.Classes of flexible sparse precision matrices have been developed for mod-elling general spatio-temporal phenomena. When a sparse precision matrixis used, computational tricks involving sparse matrix algebra can then be ap-plied for implementing a Gaussian process in a fast and memory-saving man-ner [Rue and Held, 2005]. The main gains in computational speed arise fromthe computation of the sparse matrix determinants and inverses. In fact,the computational order of evaluating the above pdf for a spatio-temporal9Gaussian process defined with a sparse precision matrix, instead of with adense covariance matrix, can be reduced from O(n3) floating-point opera-tions (flops) to O(n3/2) flops [Lindgren et al., 2011b] for processes definedin continuous-space, or to O(n) flops [Katzfuss et al., 2020] for processesdefined in discrete-space.Gaussian processes have proven to be an effective tool for modellingresidual spatio-temporal correlations in general spatio-temporal applications[Diggle et al., 2007]. As we will discuss later, the methods from spatio-temporal statistics largely fall into three categories, depending on the natureof the data and the inferential goals [van Lieshout, 2019]. The first cate-gory of methods are referred to as discrete-space spatio-temporal methods[Blangiardo and Cameletti, 2015]. These are concerned with the modellingof data collected at discrete ‘sites’. These ‘sites’ are typically well-definedareal units such as pixels or census districts. Thus, Ω is considered a finiteset for these methods. Observations are typically aggregated measures orcounts, with the goal of inference being noise-reduction or the ‘smoothing’of the process [van Lieshout, 2019].The second set of methods is known as geostatistical methods and theirspatio-temporal extensions, sometimes referred to as continuous-space spatio-temporal methods [Diggle et al., 2007]. These are concerned with the mod-elling of point-referenced data, collected in continuous-space. Thus, Ω isconsidered dense for these methods. Spatio-temporal interpolation is oftena primary objective of these methods.The final class of methods, referred to as spatio-temporal point pro-cess methods are concerned with the modelling of a spatio-temporal point-pattern, or point-pattern for short. The point-patterns we consider for mod-eling are sets of geographical points and times deemed to be random real-izations from some process of interest [Baddeley et al., 2015, Illian et al.,2008]. The points may refer to events of interest (e.g. diseases, earthquakeepicentres, etc.,), with the goal being the explanation of the driving forcesbehind the events and the identification of regional ‘hotspots’. This is theonly class of methods out of the three to consider the locations as random.Marked point processes consider the setting where the point-pattern, and a10set of observations taken at each point, are both considered random variables[Schlather et al., 2004].Whilst the earlier definition of a Gaussian process was given for a denseindex set, the index set, I, may also be a finite set. Once again, the mainrequirement is that the covariance matrix must be strictly positive definite.Numerous classes of valid Gaussian processes have been developed that canbe defined on discrete sets. An especially popular subclass are GaussianMarkov random fields (GMRFs). The addition of the ’Markov’ name arisesfrom the Markovian conditional independence property exhibited by pre-cision matrices. As with their continuous counterparts, GMRFs are com-pletely specified by their mean vector and covariance matrix. Popular ex-amples include conditional autoregressive and simultaneous autoregressivemodels [Besag and Kooperberg, 1995, van Lieshout, 2019].For the remainder of this Chapter, we present how PS can arise in allthree types of spatio-temporal data, and how it can impact statistical infer-ence. To achieve this, we present a toy example of PS for each type of data.In each example, we first generate two similar-sized datasets: one from a PSdata generating mechanism, and one from a data generating mechanism freefrom PS. We then fit ‘standard’ models to both datasets and investigate theconclusions drawn from each model. This helps us to isolate the impactsthat PS has on the statistical inference. The ‘standard’ statistical modelsused to perform the statistical inference are all examples of a popular class ofmodels referred to as spatio-temporal generalized linear mixed-effects mod-els (STGLMMs). We close the Chapter by rigorously defining this class ofmodels, before providing details on a computational tool called INLA thatis commonly employed to fit STGLMMs. This class of models and compu-tational tools is used throughout the later Chapters. Note that throughoutthis dissertation, we consider spatial statistics, the study of how to modeland perform statistical inference on data collected across space, to be asubfield of spatio-temporal statistics.112.1 Preferential sampling in discrete-spacespatio-temporal dataFor general discrete-space spatio-temporal data, we assume we have a fi-nite index set I of size M , which defines the population of discrete arealunits Ai ⊂ Ω : i ∈ {1, ...,M}. We assume a finite set of N times tj ∈ T : j ∈{1, ...,N}, defining the temporal domain. These may define points in time,or time intervals, in which case tj ⊂ T . We assume that observations aretaken at a subset of the areal and temporal units. We collect these obser-vations into a response vector, denoted y. Throughout the dissertation, wevectorize all spatio-temporal terms for convenience, and avoid matric distri-butions. We denote Y as the random variable associated with the response.We will often refer to the areal units as sites.By Tobler’s First Law of Geography, introduced earlier, we assume thatthe observations y, taken close together in space and/or time, are more likelyto be similar. Throughout this dissertation, we use a Gaussian Process forcapturing these additional spatio-temporal correlations. In discrete space,we use a GMRF and denote it Z. We assume Z is valid, and has beenspecified with MN ×MN covariance matrix Σ =Q−1 and a constant meanMN -vector with entries β0. In general, covariates collected at spatial siteand time, denoted xi,j , may also be included in a model for Y.We now give a toy example to illustrate both the appearance and theconsequences of PS in the discrete-space setting. We repeat this for theother two data types later. For simplicity, we consider the spatial-onlysetting throughout all examples.A toy exampleFor the toy example in this section, we assume that no covariates are presentand we assume that the response vector Y is a set of noise-free observationsof the Gaussian process Z. We assume that values of the response yi areobserved at a subset of the M sites. Let Ri denote the indicator variable forsite Ai ∈ I, with Ri = 1 indicating that we observe the process at site i. Wetake a Bayesian approach to statistical inference throughout the dissertation.12With this in mind, and with the above assumptions, the model we use forgenerating the data in the toy example is:(Yi|Ri = 1) = Zi i ∈ {1, ...,M}[Z1, ...,ZM ]T = ZvN(β0,Σ(θ))Ri vBernoulli(pi)Θ = (β0,θ,pi)v Priors.The Bernoulli process defining the site-selection indicators is rarely statedwithin a statistical analysis of spatio-temporal data. This suggests moststatistical analyses are conditioned on the observed locations or sites chosenfor sampling as being fixed. This is equivalent to the assumption that theprobability function p is independent of Z. Explicitly defining the Bernoullisampling process helps to highlight the assumptions being made on themissingness mechanism of the data. For example, the assumption of a con-stant probability (pi ≡ p), implies the sites are assumed by the analyst to be‘missing completely at random’ (MCAR) [Wu, 2009]. Under this MCAR as-sumption, the Bernoulli process may be ignored when modelling y withoutbiasing the statistical inference.In this dissertation, we investigate the impact that different functionalforms of pi can have on the statistical inference of Y. By our earlier defi-nition, when pi is related to Z (e.g. through a logit link), then a stochasticdependence exists between pi and Z and thus the discrete-space spatio-temporal data have been preferentially sampled.We define 1078 discrete spatial units and generate a single realisation ofthe GMRF model above to form the complete data Z. Next, we generate thetwo datasets in the spatial-only setting by changing only the specification ofpi. The first sampling scheme sees the value of Zi appear linearly in the logitof pi with a positive parameter. This preferentially selects for observationthe sites with the highest values of Zi, and preferentially censors the spatialunits with the lowest values of Zi. Plots of the true field and the 18413preferentially-selected areal units are shown in panels A and B of Figure2.1 respectively. Data are recorded more often in areal units situated in theSouth-Western corner of the region. These are precisely the areal units withthe highest values of the process. The converse is seen in the South-Easternand South-Central regions.The second sampling scheme sees pi ≡ p. Under this scheme, sites aresampled completely at random. This provides an example of a MCAR site-selection process, free of preferential sampling. The units are all equally-likely of being selected and hence the region is expected to be consistentlyrepresented across space. Under this sampling scheme, we should be ableto condition on the selected sites as fixed without biasing inference. Forbrevity, we omit a plot of these chosen areal units. We fix the number ofselected areal units to be 184 to remove any effects of sample size from theconclusions. In both cases, we fit the correctly specified model for Y, albeitwith the selection process determining Ri ignored. As mentioned earlier, theselection process is typically ignored in practice. This allows us to highlightthe impacts that PS can have on the statistical inference of discrete spatialdata. These concepts naturally transfer into the spatio-temporal setting.The posterior mean values of the field are then estimated across all 1078areal units. These are shown in Panels C and D of Figure 2.1 for the MCARand PS datasets respectively. Clearly the posterior estimates of the fieldfrom the PS data consistently overestimate the true values throughout thedata-sparse regions. Conversely, those from the MCAR data appear to re-flect the true values much better. This disparity can best be summarizedby looking at the posterior distributions of the intercept β0 from the twomodels. The posterior mean and 95% credible intervals from the modelsfit to the PS and the MCAR data are respectively -0.378 (-0.449, -0.307)and -0.795 (-0.852, -0.738). For comparison, the values for the model fit tothe complete data are -0.8 (-0.802, -0.798). The posterior distribution ofthe intercept is positively-biased for the PS data. This bias is a definingcharacteristic of PS in discrete-space spatial data, and the bias carries overto the spatio-temporal setting. In Chapters 3, we develop a general frame-work for adjusting a statistical analysis of spatio-temporal data to PS. In14ATrue Field−2−1012BPS Data−2−1012CPosterior Mean MCAR−2−1012DPosterior Mean PS−2−1012Figure 2.1: Plots of the toy simulation study. Panel A shows thecomplete dataset simulated from a Gaussian Markov randomfield. Panel B shows the realisation of the preferentially-sampledBernoulli sampling process. Panel C shows the posterior meansof the field from the model fit to the data sampled from theMCAR process. Panel D shows the posterior means of the fieldfrom the model fit to the preferentially-sampled data. Observehow poorly D characterizes the field in A compared with CChapter 4, we develop a test for PS in spatio-temporal data. Both apply tothe discrete-space setting.2.2 Preferential sampling in continuousspatio-temporal dataFor general continuous-space spatio-temporal data, the index set I is a densesubset Ω×T of R2×R. Data are collected at a finite set of spatio-temporal15locations, denoted (S,T) = (s1, t1), ...,(sn, tn) and vectorized into a responsevector y. Once again, we will include a Gaussian process Z(s, t) within themodel used to describe Y. Within the STGLMM framework introducedlater, the random variables of the response at each space-time coordinate(s, t), denoted Y (s, t), are then assumed to be independently distributed,given Z(s, t). The Gaussian process Z(s, t) is well-defined at all locations andtimes within Ω×T with mean function 0 and covariance function Σ(·, ·; ·, ·).In general, covariates x(s, t) may be available at each of the finite set oflocations. Often, the goal of spatio-temporal analyses of continuous-spacedata is spatio-temporal interpolation or prediction.As before, we now present a toy example. Again, we consider the spatial-only setting and ignore covariates.A toy exampleWe assume that a finite set of locations, denoted S, are generated from apoint process. Details of point processes are found in the next section. Ateach of these locations si ∈ S, a noise-free observation of Z(si) is made andstored in the vector y. Under the above assumptions, the model we use forgenerating the data in the toy example is:(Y (s)|s ∈ S) = Z(s)Z(S)vN(β0,Σ(θ))Sv Point process(λ(s))Θ = (β0,θ,λ)v Priors.As before, analysts rarely state their assumptions about the samplingprocess that generated the set of locations at which observations of the pro-cess were made. Instead, analysts typically condition on the locations S asfixed when performing a statistical analysis of a continuous-space spatialdataset [Diggle et al., 2010]. In this dissertation, we argue that the assumedsampling processes should be explicitly defined. Clearly stating our beliefs16on how the locations were sampled helps to highlight the possible conse-quences that conditioning on the locations as fixed, and hence ignoring thesampling process, can have on the statistical inference of Y. For exam-ple, if we know a-priori that all regions were sampled with equal intensityand hence that λ(s)≡ λ ∀s ∈ Ω, then the sampling process may be ignoredwithin a statistical analysis of Y without biasing the inference. However,when λ(s) ∝ Z, the data are said to have been preferentially sampled andignoring the sampling process can bias the statistical inference about Y[Gelfand et al., 2012].We simulate a Gaussian process with a spatial Matérn covariance func-tion on the unit square with parameters chosen to ensure the spatial rangeand marginal standard deviation are equal to 0.7 and 1 respectively. Therealisation is shown in panel A of Figure 2.2. The values of the field appearlarger in the West and smaller in the East. Next, we define the two samplingschemes that generate the locations at which Z(s) is observed without error.In the first scheme, sampling is preferential. The sampling locations are arealisation of an inhomogeneous Poisson point process. The value of the fieldat location s, Z(s), is included linearly within the log intensity, log(λ(s)), ofthe point process with a positive coefficient. Under this sampling scheme, ahigh density of sampling locations is expected in regions where Z(s) is high(e.g. the West). Conversely, in regions where Z(s) is small, the density ofsampling locations is expected to be low (e.g. the East). A plot of the 24chosen locations is shown in Panel C of Figure 2.2. Indeed, the expectedbehaviour is seen.The second sampling scheme can be thought of as the continuous ana-logue of MCAR. The locations at which to sample the field are a realisationof a homogeneous Poisson point process. Thus, λ(s)≡ λ. Under this scheme,the density of sampling locations is expected to be constant throughout theunit square. Furthermore, the average intensity of this sampling schemeis set equal to that of the previous sampling scheme to encourage a simi-lar sample size to be realized. Indeed, 27 locations are chosen to observethe process under the second sampling scheme and are shown in the bottomright plot of Figure 2.2. No obvious spatial trend in the density of the points17ATrue Field−202BPS Prediction Bias−202CPosterior Mean MCAR−202DPosterior Mean PS−202Figure 2.2: A series of plots showing the toy simulation study. PanelA shows the complete data that were simulated from a Gaussianrandom field. Panel B shows the bias of the posterior mean fromthe model fit to the preferentially-sampled data. Panel C showsthe posterior means of the field from the model fit to the MCARdata. The locations at which samples of the field were takenare shown in black. Panel D shows the posterior means of thefield from the model fit to the preferentially-sampled data. Thelocations at which samples of the field were taken are shown inblack.can be seen.Next, the correctly-specified model for Y is fit to both datasets, albeitwith the sampling process ignored by conditioning on the locations as fixedto reflect a typical geostatistical analysis. For both models, the posteriormean values of the field are then estimated across a grid of pixels placedacross the unit square. The posterior means are shown in colour behind thepoints in panels C and D of Figure 2.2. It can immediately be seen that18the posterior mean values from the model fit to the preferentially-sampleddataset do not capture the lower tail or the contrast of the field.To better highlight the impacts that preferential sampling has on infer-ence, we compute the prediction bias from the model fit to the preferentially-sampled data. This is shown in panel B of Figure 2.2. It can be seen thatthe model overestimates the true value of the field in almost the entirety ofthe unit square. Once again, the abilities of the models to capture the globalaverage value of the field can be assessed through summary statistics. Theposterior average prediction bias, averaged across the pixels, is 0.01 and 0.20for the models fit to the MCAR and preferentially-sampled datasets respec-tively. This prediction bias is the norm when standard statistical modelsare fit to preferentially sampled data [Diggle et al., 2010]. This includesspatio-temporal data. This demonstrates that we risk biasing our inferenceif we ignore the sampling processes that generated the locations chosen toobserve the process.In Chapters 3, we develop a general framework for adjusting a statisticalanalysis of spatio-temporal data to preferential sampling. In Chapter 4, wedevelop a test for preferential sampling in spatio-temporal data. Both applyto the continuous-space setting.2.3 Preferential sampling in spatio-temporalpoint-pattern dataBackground on log-Gaussian Cox processesPoint processes are used for modelling point-pattern data. Whereas in theprevious subsection a point process was used to characterize the points atwhich the realization of Z was observed, in this section the realization is thepoints themselves. To guarantee that a valid stochastic process is definedfor characterising the points, a set of underlying assumptions is required.The first definition defines collections of point configurations that can bemore easily modelled. In particular, the definition restrict us to consideronly configurations that have countably infinite numbers of points.19Definition 2.3 The family N lf(R2×R) of locally finite point configurationsin R2×R consists of all subsets (S,T)⊂R2×R such that for every boundedBorel set A⊂R2×R, finitely many points are contained within the intersec-tion (S,T)⋂A.The next definition defines a point process on the family of locally finitepoint configurations.Definition 2.4 A point process (S,T) on R2×R with realisations in N lf(R2×R) is a random, locally finite configuration of points such that for all boundedBorel sets A⊂R2×R, the number of points of (S,T) that fall within A is afinite random variable that we denote NS,T (A).Whilst many classes of point process models exist, we focus our attentionon the flexible class of inhomogeneous spatio-temporal Poisson point pro-cesses (IPPs). IPPs can be made even more flexible through the additionof Gaussian processes, and can then be computed by approximating theirlikelihood with models falling within the STGLMMs framework introducedlater [Simpson et al., 2016]. Note that we remove the subscript T from therandom variable NS,T (·) for readability.Definition 2.5 A spatio-temporal point process (S,T) on R2×R is an in-homogeneous spatio-temporal Poisson point process with intensity λ(s, t) :R2×R→ R+ if:• For every bounded Borel set A⊂R2×R, NS(A) is Poisson distributedwith mean Λ(A) =∫Aλ(s˜, t˜)ds˜dt˜,• For any k disjoint bounded Borel sets A1, ...,Ak,k ∈ N, the randomvariables NS(A1), ...,NS(Ak) are independent,• λ(·, ·) is integrable on bounded sets.The IPP is a very flexible model. The intensity λ(·, ·) may be allowedto vary across space and time according to a set of linear and nonlinear20covariates, including splines. The likelihood of a point-pattern Y = (S, T)observed within bounded spatial and temporal domains Ω and T is:pi(Y|λ(·, ·)) = exp{|Ω||T |−∫Ω∫Tλ(s, t)dtds} ∏(si,ti)∈Yλ(si, ti), (2.1)with |Ω| and |T | denoting the area and length of the study region andtemporal domain respectively.However, the stringent mean-variance relationship assumed on the countsNS(A) is undesirable in many settings. Overdispersion, which in this settingoccurs when the variance of the counts exceeds the mean, frequently occursin practice. Furthermore, additional spatio-temporal clusters may form dueto the presence of unmeasured processes that drive the intensity. To allevi-ate these concerns, the class of Cox spatio-temporal point processes extendsthe Poisson process by allowing λ(·, ·) to be a realisation of a random field.If we assume the random field is a realisation of a log-Gaussian process,then we have the class of log-Gaussian Cox spatio-temporal point processes(LGCP herafter). LGCPs can be specified more simply in hierarchical form:Definition 2.6 (S,T) on R2×R is an LGCP with intensity λ(s, t) : R2×R→R+, Gaussian process Z, and covariates x(s, t)∈Rp, if conditioned uponZ(s, t):• log(λ(s, t)) = β0 +βTx(s, t) +Z(s, t),• (S,T) is an inhomogeneous spatio-temporal Poisson point process withintensity λ(·, ·).LGCPs are an especially popular class of point process. The hierarchi-cal definition given above makes them relatively easy to fit compared withother competing models [Simpson et al., 2016]. Furthermore, the flexibilityof Gaussian processes carries over, allowing for point processes with highlyflexible properties to be specified [Yuan et al., 2017]. Marginally, the LGCPcan have very general mean-variance relationships due to the various covari-ance structures that can be specified on the Gaussian process. Furthermore,21stochastic dependence can exist between the counts in disjoint regions [Bad-deley et al., 2015]. This property can account for the additional clusteringthat is often present in real point-pattern data.One of the most useful properties of the IPP and the LGCP withinecological [Fithian et al., 2015] and epidemiological [Johnson et al., 2019]applications is its ability to serve as an integrative modelling framework forjointly modelling datasets that were collected following different protocols.In addition to modelling the direct observations of points in space and/ortime, the LGCP allows for the simultaneous modelling of aggregated countsand/or binary indicator variables of the occurrences of points within well-defined spatial units and time intervals [Hefley and Hooten, 2016]. In fact,the conditional likelihoods of these different observation processes can allbe derived from the fact that the number of points within any boundedBorel set A, conditioned on knowing the Gaussian process Z, is Poissondistributed (Definitions 2.5 and 2.6). In particular, Definition 2.5 impliesthat ∀A= (S,T)⊂ Ω×T :(NS(A)|Z)v Poisson(Λ(A)) (2.2)P(NS(A)> 0|Z) = 1−P(NS(A) = 0|Z) = 1− exp(Λ(A)). (2.3)Equation 2.2 shows that aggregates of point process data into countswithin well-defined spatial and temporal units may be data modelled asPoisson counts, conditional upon Z. If, instead of counts, the point processdata are instead summarized by binary indicator variables of occurrence (i.e.I(NS(A)> 0)), then equation 2.3 shows that the occurrence events may bemodelled as a collection of Bernoulli random variables with probability ofsuccess equal to 1− exp(Λ(A)), conditional upon Z.In either case, conditional upon knowing Z, the likelihoods of data col-lected from different observation processes may be combined. We will seethat many of these aggregated data types will naturally take the form ofspatio-temporal generalized linear mixed-effects models (STGLMMs). Tocompute the likelihood of the LGCP, additional work is required. The inte-22gral seen in (2.1) is intractable, and therefore computational approximationsare required to compute the likelihood [Simpson et al., 2016]. Approxima-tions are often derived by taking aggregates of the point process over verysmall regions. Poisson regression and logistic regression methods based on(2.2) and (2.3) are two such examples [Baddeley et al., 2015]. These com-putational approximations to LGCPs fall within the STGLMMs frameworkand thus the computational method (INLA) that we introduce in section2.4 can be used for fitting LGCPs.Preferential sampling in spatio-temporal point-pattern dataPreferential sampling within spatio-temporal point-pattern data can bestbe summarized by a two-step process. In the first step, a spatio-temporalpoint-pattern (S,T) is generated. In the second step, a process acts upon(S,T), causing each space-time point (si, ti)∈ (S,T) to be either retained ordiscarded. The process in the second stage is known as a ‘thinning process’and it determines which points of (S,T) are kept and which are discardedprior to analysis.Only the retained points, denoted (S0,T0) ⊂ (S,T) are made availableto the analyst. Preferential sampling then occurs when the probability thata point at location (s, t) ∈ Ω×T is retained, denoted p(s, t), depends uponthe underlying intensity λ(s, t) at that location. When this occurs, the thin-ning process is stochastically dependent on the underlying intensity λ(s, t).As before, the statistical inference of preferentially-sampled point-patterndata may be biased. When the thinning process discards the points uni-formly throughout Ω×T , then the intensity λ(s) may be recovered from thepartially-observed data without adjustment, albeit with a negatively-biasedintercept within the log-linear model for λ(s, t). The magnitude of the biasdepends on the proportion of points that are discarded.In practice, the thinning process may be deterministic and retain pointswith certainty within certain subregions of Ω, and time intervals of T de-noted Ω0 and T0 respectively. When this occurs, the analysis must be suit-ably adjusted to match the partial observations [Chakraborty et al., 2011]23Again, if Ω0 and T0 are chosen where λ(s, t) is high (or low), then preferen-tial sampling occurs. Alternatively, the thinning process may be stochastic,with p(s, t) driven by factors that affect the visibility or detectability of thepoints [Fithian et al., 2015]. For example, weather conditions and/or terrainruggedness may need to be considered in cases where the points representthe presence of animals at points in space and time.In either case, thinned point processes allow for the modelling of partially-observed point processes and the intensity of a thinned point process canbe derived [Hefley and Hooten, 2016]. The intensity of the observed point-pattern, λobs(s, t), is the product of the true data-generating intensity, λ(s, t),and p(s, t), reflecting the imperfect detectability through space and time.Consequently, if p(s, t) can be estimated well, then the bias due to imperfectdetection may be partially controlled for in an analysis.Once again we simulate a toy example in the spatial-only setting withoutcovariates to highlight the impacts that ignoring the sampling process canhave on the statistical inference of spatial point-patterns.A toy exampleWe sample a complete point-pattern from an LGCP. To do this, we simulatea Gaussian process with spatial Matérn covariance on the unit square withspatial range equal to 0.7 and standard deviation equal to 1. Next, wegenerate a realisation of a Poisson point process in the unit square withintensity equal to the exponential of the random field, scaled by 500. Therealized field is shown in the top left plot of Figure 2.3, with the realisation ofthe 867 points shown in black. These form the complete set of points. Thereis a clear increase in the density of the points within the central-Westernregion, precisely where the field is largest.Next, we assume that the points are imperfectly detected according totwo thinning processes. In the first case, the point-pattern is preferentiallysampled according to a detection probability surface p(s) =Cλ(s), assumedto be unknown to the analyst. Each of the points from the complete pat-tern are either retained or discarded following independent Bernoulli trials24ATrue Field0.250.500.751.00BDetection Probability0.250.500.751.00CPosterior Mean MCAR0.250.500.751.00DPosterior Mean PS0.250.500.751.00EBias MCAR−0.6−0.4− PS−0.6−0.4− 2.3: Plots showing the toy simulation study. Panel A showsthe complete point-pattern data that was simulated from a log-Gaussian Cox process. The normalized intensity surface thatgenerated the points is shown behind the points as a raster incolour. Panel B shows the detection probability surface usedto preferentially thin the point-pattern. The complete point-pattern is included. The retained and discarded points from asingle realisation of a Bernoulli process with probabilities givenby the detection probability surface are shown in blue and redrespectively. Panel C shows the normalized posterior meansof the intensity from the log-Gaussian Cox process model fitto the MCAR data which are shown in black. Panel D showsthe normalized posterior means of the intensity from the log-Gaussian Cox process model fit to the preferentially-sampleddataset which are shown in black. Panels E and F show theprediction biases of the normalized intensities from the modelsfit to the MCAR and preferentially-sampled datasets respec-tively. The model fit to the PS data underestimates the fieldalmost entirely throughout Ω.25with success probabilities following p(s). The constant C is chosen to en-sure p(s) ∈ [0,1]. Panel B of Figure 2.3 shows p(s), with the retained anddiscarded points plotted in blue and red respectively in the figure. It is im-mediately apparent that the points are only retained in regions where λ(s)is high; a telltale behaviour of preferential sampling.The second sampling scheme is representative of a MCAR scheme in thecontext of point-patterns. Each point is equally likely to be retained (i.e.p(s) = p). For comparative purposes, we fix the total number of points at 303to match the number of points sampled under the previous sampling scheme.Under this sampling scheme, each region is equally under-represented. Thus,the spatial density of the retained point-pattern, (S0,T0), throughout Ω isexpected to closely match that of the complete pattern, with the majorityof global features retained. In fact, the intensity of (S0,T0) is identicallypλ(s). We use this fact for model comparison purposes next.Log-Gaussian Cox processes are fit to both point-patterns, with p(s)assumed constant, and hence the sampling processes ignored. To comparethe results from the two sampling schemes, we compute the posterior meanintensities (i.e. the fields) from the models and then normalize to the unitinterval. Panel A of Figure 2.3 shows the true normalized intensity. PanelsC and D of Figure 2.3 show the normalized posterior mean intensity fromthe model fit to the preferentially-sampled data and the MCAR datasetsrespectively. The retained points are plotted in black. It can be seen thatthe model fit to the preferentially-sampled data dramatically underestimatesthe field throughout Ω. In general, apart from the major hotspots, the globalcharacteristics of the field are missed. Conversely, this is not seen from themodel fit to the MCAR data.Panels E and F in Figure 2.3 present the prediction bias of the normal-ized field from the two models. The consistent underestimation of the fieldaround the central band of Ω from the PS model can be seen. In summary,the preferential detection probability surface leads to a model that badlycharacterizes the field. The average prediction bias throughout Ω is -0.12and -0.05 for the PS and MCAR models respectively, demonstrating onceagain that large biases in the predictions of the field are a direct conse-26quences of PS. This is a clear demonstration that the sampling scheme mustbe considered within a point process analysis.In practice, it is is typically assumed that either the points were perfectlydetected, the points were retained with constant p(s, t)≡ p, or that a set ofcovariates are available that can fully explain the p(s, t). In Chapter 5 werelax these assumptions and also allow for the underlying process Z(s, t) todrive the sampling intensity. Thus we allow for preferential sampling to beaccounted for directly within a modelling framework.2.4 Spatio-temporal generalized linearmixed-effects modelsIn the previous three sections, we saw examples of PS across all three typesof spatio-temporal data. All of these models can be considered within theclass of spatio-temporal generalized linear mixed models (STGLMMs), de-fined now. We focus on STGLMMs as they provide a very flexible class ofmodels that can suitably model a large range of phenomena [Gómez-Rubio,2020]. STGLMMs are relatively easy and fast to implement using the ap-proximate INLA approach discussed later in this Section. Furthermore, themodels used to describe preferential sampling in the three data types: theBernoulli process, the point process, and the thinning process can all bemodelled within the STGLMMs framework [Gómez-Rubio, 2020]. Thus, wecan develop joint models for PS that fall within the STGLMMs framework.We demonstrate this later in this Section.The models seen in Sections 2.1 - 2.3 all ignored covariates. Covariatesare frequently available in practice and can be easily included within theSTGLMM framework. Furthermore, the models in Sections 2.1 and 2.2 as-sumed that the Gaussian process Z was observed without noise. In practice,most data Y will have large amounts of measurement error and noise andwill not be reasonably modelled as Gaussian. It may also be impossible totransform the responses to approximate Gaussianity without any additionalskewness, heavy tails, or heteroscedasticity remaining. A prime example ofsuch data was seen with the aggregated point process data in Section 2.327where the responses were binary and count data. A vast library of statisticaldistributions exist that can account for the quirks and individualism thatreal data possess [Krishnamoorthy, 2016]. The STGLMM framework pro-vides a convenient hierarchical modelling framework that can place suitabledistributions on the responses Y.SGLMMs, the spatial-only equivalent to STGLMMs, were popularizedby the papers of Besag et al. [1991] and Diggle et al. [1998] in the discrete-space and continuous-space settings respectively. The basic idea of SGLMMsis to construct a generalized linear model of the response variable, condi-tional upon a series of spatially correlated random effects. Linear combina-tions of these random effects alongside covariates are then included withinthe linear predictor of the generalized linear model (GLM) describing thetransformed expectation of the response. The GLM set-up assumes thatthe responses are conditionally independent of each other, given the ran-dom effects. STGLMMs then extend the SGLMM framework by allow-ing for spatio-temporally correlated random effects. The general set-up ofSTGLMMs is:Yi,j v fY (g(µi,j),θY ), fY v densityg(µi,j) = ηi,j = xTi,jγ+q∑k=1ui,j,kβk(si, tj)βk(si, tj)vN(0,Σk(θk)) k ∈ {1, .., q}Θ = (θY ,γ,θ1, ...,θq)v Priorsxi,j ∈Rp, ui,j ∈Rq.The above framework is evaluated on finite spatial and temporal indexsets i ∈ {1, ...,M} and j ∈ {1, ...,N}. This will either correspond to the pop-ulation of discrete spatial units, the set of all observation and predictionlocations being considered for a continuous-space spatio-temporal analysis,or the set of all spatial and temporal units used to approximate the com-putation of the IPP’s likelihood (2.1). The function g is known as the link28function and maps the support of the expectation of the response onto thereal line. The linear predictor η contains a linear combination of covariatesx and latent effects βk(·, ·), possibly scaled by covariates u. These latenteffects are zero-mean Gaussian distributed, possibly with white noise, spa-tial, temporal and/or spatio-temporal covariance structures Σk(θk) assumedon them. We change the notation from Z to β to highlight the fact thatGaussian processes can also be used for capturing spatio-temporal changesin the effects of covariates. This is done in Chapter 3.For this dissertation, we work in the Bayesian setting. We choose thisapproach for two reasons. First, in spatio-temporal settings, it is often thecase that numerous combinations of hyperparameters θk that define the la-tent effects βk(·, ·) within an STGLMM may explain a given dataset ‘well’.Put differently, in typical settings, the hyperparameters may only be weaklyidentified [Zhang, 2004]. The Bayesian approach offers a natural method foraveraging over this (potentially large) set of competing models. By placingadditional prior distributions on all the hyperparameters, Bayesian meth-ods naturally perform a weighted-average across all the possible predictivedistributions defined by the set of possible values that the hyperparameterscan take. The weights of each model are proportional to the posterior dis-tribution of the hyperparameters in light of the data and the assumed priors[Robert, 2007]. Conversely, competing methods such as empirical Bayes ortype-II maximum likelihood-based approaches condition on the estimatedhyperparameters as known when forming model-based predictions of bothη and µ [Le and Zidek, 2006]. Thus, they may miss (potentially large)uncertainties in the hyperparameter estimates.Secondly, in data-sparse settings, there may also be an additional ad-vantage offered by Bayesian methods. When additional knowledge is knowna-priori about the true levels of complexity present in the spatio-temporalphenomenon being studied, then hyperparameters may naturally be con-strained to take ‘sensible’ values through the use of penalized complexityprior distributions [Fuglstad et al., 2018, Simpson et al., 2017]. These canhelp to reduce the risk of over-fitting that can plague flexible modellingframeworks. Whilst penalty functions such as the LASSO penalty [Tibshi-29rani, 1996], may be added to frequentist analyses to reduce the risk of over-fitting, the Bayesian approach offers a more flexibile and intuitive frameworkfor incorporating prior knowledge to achieve this goal.Given the hierarchical specification, Markov chain Monte Carlo methods(MCMC) provide a convenient and popular method for fitting STGLMMs[Gelfand et al., 2010]. Unfortunately, MCMCmethods can be computationally-demanding and time-consuming without the development of bespoke effi-cient MCMC implementations. Thus we do not consider these approachesand instead look at the recently-developed Integrated Nested Laplace Ap-proximation (INLA) approach [Rue et al., 2009] for model-fitting in Section2.4.1. We discuss in depth the benefits and limitations of using the INLAapproach to fit STGLMMs in Section Incorporating preferential sampling withinSTGLMMsAnother major advantage of the STGLMMs framework for modelling a re-sponse vector Y is that it may also be used for modelling the preferentialsampling process. Thus, a joint modelling framework may be developed foraccounting for preferential sampling in all spatio-temporal data settings.We consider the discrete-space and continuous-space settings separatelyfrom the point-pattern setting for reasons that will be made clear. For thediscrete-space and continuous-space settings, let the STGLMM for Y be de-fined as above. Next, define the Ri,j as the random variables used for defin-ing the PS. In the discrete-space setting, the Ri,j will be indicator variablesdescribing the site-selection process that determines which of the areal unitscontain data. In the continuous-space setting, multiple choices of Ri,j areavailable for approximating the point process (LGCP) that determines theselection of locations and times (S,T) chosen to observe the spatio-temporalphenomenon. Two approximations are commonly used. The first methodsets Ri,j as binary indicator events and then approximates the LGCP with alogistic regression STGLMM [Warton et al., 2010]. The second approach setsRi,j as counts and then approximates thr LGCP with a Poisson STGLMM[Baddeley et al., 2015]. We use the former in Chapter 3.30Now, we define the model for PS as:Ri,j v fR(h(mi,j),θR), fR v densityh(mi,j) = ζi,j = xTi,jα1 +wTi,jα2 +q∑k=1vi,j,kβk(si, tj)ΘR = (θR,α1,α2)v Priorswi,j ∈Rr, vi,j ∈Rq.We jointly fit the above model with the earlier STGLMM for Y. If weassume the joint model truly generates the data, then preferential samplingcan arise in two distinct ways. The necessary approach required for ad-justing for preferential sampling differs across the two. The first way thatpreferential sampling can occur is when one or more latent effects βk(s, t) areshared between η and ζ. This occurs when both vi,j,k and ui,j,k are nonzerofor some k ∈ {1, ..., q}. This is the form of preferential sampling seen in allthree toy examples. Accounting for PS under this assumed data generatingmechanism requires the joint model to be fit.The second way that preferential sampling can occur is when the PScan be explained by the available covariates x. This occurs when one ormore covariate within x is shared between the η and ζ. Formally, thishappens when both α1,j and γj are nonzero for the true data generatingmechanism, for some j ∈ {1, ...,p}. When this occurs, but when none ofthe latent effects are shared in ζ, and hence when vi,j,k = 0 ∀(i, j,k), thencontrolling for PS is greatly simplified. Here, unlike with the previous formof preferential sampling, a joint model is not required for controlling for PS.Instead, the offending covariate/s within x need only be included in theircorrect functional forms within the model for Y to control for PS.As noted earlier, the above statements cover only the discrete-space andcontinuous-space settings. In point-pattern data, we only have a singlesource of information associated with each data point, the space-time loca-tion (si, ti). Contrast this with the previous discrete-space and continuous-31space settings. Here, two sources of information are available at each sam-pled location (si, ti). First, are the values of the response y(si, ti). Second,are the locations and times (si, ti) themselves. Thus, two separate models:one for describing Y, and the other for describing the characteristics of thesampling process through Ri,j , could be constructed.Given the availability of only a single source of information, we are un-able to construct additional variables Ri,j . Instead, we are left with needingto construct a suitable model for the thinned point-pattern Y = (S0,T0)that can incorporate PS. To achieve this, we can take a model based ap-proach. We may construct a thinning process, denoted p(s, t), that actsmultiplicatively on the intensity of the points. Under this assumed thinnedpoint process model, the marginal intensity of the observed point-patterntakes the form λobs(s, t) = λ(s, t)p(s, t) [Fithian et al., 2015]. Thus, insteadof building a joint model with two separate likelihoods, we instead fit a jointmodel with two processes contained within a single LGCP likelihood.In particular, the model becomes:Yi,j v fY (g(µi,j)h(mi,j),θY ) , fY v densityg(µi,j) = ηi,j = xTi,jγ+q∑k=1ui,j,kβk(si, tj)h(mi,j) = ζi,j = wTi,jα2βk(si, tj)vN(0,Σk(θk)) k ∈ {1, .., q}Θ = (θY ,γ,θ1, ...,θq,α2)v Priorsxi,j ∈Rp, ui,j ∈Rq, wi,j ∈Rr.In the above model, the LGCP likelihood is approximated using thefY density. The true intensity λ(s, t) is again modelled through the g(µi,j)terms, with the thinning process p(s, t) modelled with the h(mi,j) terms.Note how neither the latent effects βk(s, t), nor the covariates x, are presentin the thinning process. This is because the above model is plagued byidentifiability issues. In fact, without additional prior knowledge available32on the thinning function, neither the covariates x, nor the latent effectsβk(s, t) present in η may be shared with the thinning process (see Chapter5). Note that the above model remains identifiable when there is correlationbetween w and x [Warton et al., 2013]. In Chapter 5, we consider thescenario where additional strong prior knowledge is available.When the above model truly describes the PS, then the PS may be con-trolled for by including the covariates w in their correct functional forms,without any additional knowledge required of the thinning function. Com-putationally, the simplest choice is to build a log-linear model for both theintensity and the thinning process. One example is to use the Poisson ap-proximation to the LGCP likelihood and to set h ≡ g ≡ log. Under thischoice, fY is a Poisson density function and the model remains within theSTGLMM class. In fact, this leads to a single linear predictor, with thewTi,jα2 terms added to each ηi,j .However, this does not lead to a strictly valid model. The log link func-tion for h can lead to thinning probabilities p(s, t) being predicted outsideof the interval [0,1]. Regardless, this choice is often made [Fithian et al.,2015]. Alternatively, a more suitable link function (e.g. logit, probit, etc.,)may be specified for h and hence for p(s, t). However, this leads to w beingincluded nonlinearly within the linear predictor for η and thus takes themodel outside of the STGLMMs framework. This leads to additional com-putational challenges, with one proposed solution to linearise the model viaTaylor series expansions [Bachl et al., 2019].Large computational advantages can be attained if we can remain withinthe class of STGLMMs. One such computational approach that can fitSTGLMMs efficiently, called INLA, is summarized next.2.4.2 INLAThe multivariate Gaussian prior distribution assumed on both the latenteffects βk(·, ·) and the parameter vectors γ, α1, and α2 allows the aboveSTGLMM framework to be implemented quickly and accurately using theapproximate INLA approach [Rue et al., 2009]. INLA is a novel technique33for approximating posterior distributions. It is based on the Laplace ap-proximation method and is applicable to a wide class of models called latentGaussian models, of which STGLMMs defined above is a subclass.Laplace approximationThe basic idea of the Laplace approximation in statistics is to approximatean arbitrary distribution with a multivariate Gaussian distribution, centredat the mode. The mean and covariance matrix of the Gaussian distributionis derived from the 2-term Taylor series expansion of the original distributionabout its posterior mode. The covariance is set equal to the inverse of thematrix of second derivatives.In particular, for arbitrary multivariate density pi(x) with mode x˜ andwith ∇ denoting the gradient operator and H(x) denoting the Hessian ma-trix of second derivatives:log(pi(x))≈ log(pi(x˜)) + (x− x˜)T∇log(pi(x˜)) + 12(x− x˜)T∇2log(pi(x˜))(x− x˜)= log(pi(x˜)) + 12(x− x˜)TH(x˜)(x− x˜) gradient 0 at mode.Taking the exponential of both sides of the equation:pi(x)≈ pi(x˜)exp(12(x− x˜)TH(x˜)(x− x˜))(2.4)= p˜i(x),we obtain the multivariate normal approximation to pi(x) with meanpi(x˜) and covariance matrix −H(x˜)−1. The accuracy of the approximationdepends strongly on the distribution being approximated. Furthermore, theaccuracy of the approximation deteriorates as the distance of x from themode increases. In the STGLMM setting, the target distribution that wewill commonly wish to approximate will be the posterior distribution ofthe latent effects. When the values of the hyperparameters are known, the34Laplace-approximation method provides a good approximation [Rue et al.,2009].INLAIn practice, the hyperparameters are rarely known. In general, the aboveLaplace approximation applied to an arbitrary posterior distribution willhave a mode and Hessian matrix that will depend upon the hyperparam-eters. Furthermore, to compute the posterior marginal distribution of thelatent effects in an STGLMM, we will need to integrate over the posteriordistribution of the hyperparameters. The INLA approach offers a computa-tionally fast approach for doing this. Let y,β, and θ denote the response,latent effects and hyperparameters respectively. For notational simplicity,we use pi(·) throughout to denote probability density.For any given set of hyperparameters θ, the posterior distribution ofthe latent effects given the response and hyperparameters can be written asfollows:pi (β|y,θ) = pi (β,θ|y)pi (θ|y) .Next, the above equation can be rearranged to provide an expression forthe posterior distribution of the hyperparameters, given the response:pi (θ|y) = pi (β,θ|y)pi (β|y,θ) . (2.5)Finally, the numerator may be rearranged as follows:pi (β,θ|y) = pi (β,θ,y)pi(y)∝ pi (β,θ,y) = pi (y|β,θ)pi (β|θ)pi (θ) .Given that the denominator pi(y) does not depend on either the unknown35latent effects or hyperparameters, we see that the joint posterior distributionis proportional to the product of three probability density (or probabilitymass) functions that can each be evaluated exactly. Thus, an approximationto equation 2.5 can be formulated as follows:pi (θ|y)≈ pi (β,θ|y)p˜i(β˜|y,θ) = p˜i (θ|y) . (2.6)Here, p˜i(β˜|y,θ)denotes the Laplace approximation to the posterior ofthe latent effects, evaluated at the posterior mode β˜ and conditional on θ.The posterior mode can be found using standard optimization techniques.In settings where the assumed distribution of the response is Gaussian, theLaplace approximation to the posterior of the latent effects is exact, due tothe self-conjugacy property of the Gaussian. As an aside, the INLA approachcan be iterated one step further to offer an improved approximation to theposterior of the latent effects. However, this comes at a high computationalcost and the gains in accuracy are small [Krainski et al., 2018, Taylor andDiggle, 2014]. Thus, we do not consider this further.Equation 2.5 may now be used to approximate the posterior marginaldistributions of the latent effects, with the uncertainties from the hyperpa-rameters integrated out. The approximate posterior distribution p˜i (θ|y) issearched across its support to find the regions of highest probability den-sity. Next, a grid of G integration points θg : g ∈ {1, ...,G} is chosen, witheach integration point having corresponding volume ∆g. Then, the followingapproximation to the integral can be formed:pi (β|y) =∫supp(θ)pi(β˜|y,θ)pi (θ|y)dθ≈∫supp(θ)p˜i(β˜|y,θ)p˜i (θ|y)dθ≈G∑g=1p˜i(β˜|y,θg)p˜i (θg|y)∆g. (2.7)36Discussion of Laplace approximations and INLAIn theory, if a set of regularity conditions holds on both the latent effects andthe response distribution [Schervish, 2012], including the condition that thedimension of the latent effects is fixed, then the Laplace approximation canbecome very good in large sample settings due to the Bernstein-von Misestheorem. Simply put, the Bernstein-von Mises theorem says that under aset of regularity conditions, the posterior distribution of the latent effectswill asymptotically converge to that of a multivariate normal distribution indistribution [Schervish, 2012]. Thus, the use of the Laplace approximationin conjunction with a sensible method of choosing the hyperparameters (e.g.maximum likelihood) is justified on asymptotic grounds in certain settings.However, the above standard conditions that justify the use of the Laplaceapproximation based on asymptotic grounds do not hold in most spatio-temporal settings. Often it is the case that the dimension of the latenteffects β grows with n, instead of being fixed. Thus, observations rarelyaccumulate around each location in the domain that index β and insteaddim(β)/n ≈ C, with C constant. Rue et al. [2009] discusses this and con-cludes that the accuracy of the Laplace (and INLA) approximations to bothβ and θ “seems to be directly related to the ‘actual’ dimension” of β andsuggest that the effective number of parameters be calculated to investigatethe approximation accuracy.Furthermore, the approximation accuracy of INLA also depends uponthe size of G in (2.7), the method used to set-up the grid points, and whetheror not the improved approximation to pi(β˜|y,θg)is used. In practice, theaccuracy of the approximation is most sensitive to the choice of method usedto set-up the grid points [Krainski et al., 2018]. The INLA approximation isthe one we use throughout the remaining chapters of this dissertation, withthe actual computational procedures both designed and implemented in theR package R-INLA [Lindgren et al., 2011b, Rue et al., 2009, 2017]. Thefeasibility of the INLA approach is limited to settings where the dimensionof θ is no more than about 15 [Rue et al., 2009]. In simulation studies, theINLA approach has been found to perform comparatively as well as MCMC37methods in LGCP analyses [Taylor and Diggle, 2014].While INLA is a highly general method suited for fitting STGLMMsquickly, efficiently, and with relative ease, limitations do exist. First, INLAperforms inference on the posterior marginal distributions of individualmodel parameters within (2.7), instead of on the joint posterior distribu-tion of the model parameters [Gómez-Rubio, 2020]. Knowledge of the jointposterior distribution is required to perform inference on functions of mul-tiple parameters (e.g. the pairwise differences between parameters). How-ever, methods developed by Rue et al. [2009] allow for samples to be drawnfrom an approximate joint posterior distribution of the latent effects andhyperparameters, using a fitted INLA model. These samples then enableapproximate joint inference to be performed.Additionally, by default INLA is unable to fit: models with multiple lev-els of hierarchy, general mixture models, or models with missing covariates.Recently, a hybrid MCMC-INLA hybrid approach was proposed by Gómez-Rubio and Rue [2018] that allows for all of the above model features to be fitusing INLA and can also be used to obtain the joint posterior distributionof a chosen set of parameters. Crucially, only standard MCMC algorithmssuch as Metropolis-Hastings are typically required, making the developmentof computer code relatively easy.2.4.3 Applicability of the methods in practiceWe have briefly introduced the field of spatio-temporal statistics and its threebranches: discrete-space, continuous-space, and point-pattern data. Next,we presented examples of preferential sampling in all three data settings,and ultimately demonstrated that prediction bias is a common consequenceof ignoring the sampling process. Then, we introduced STGLMMs, a flex-ible framework for modelling all three types of spatio-temporal data. Wediscussed how the sampling process can be modelled within the STGLMMsclass with relative ease. Finally, we introduced the computational approach,INLA, that can fit STGLMMs both quickly and in a memory-efficient man-ner.38In the following three Chapters, we build upon the methods introducedin this Chapter. Throughout, we demonstrate that: preferential sampling isprevalent in real-world data by demonstrating its presence in three real-worlddatasets; the magnitude of prediction bias can be large and that preferentialsampling should not be ignored; the methods developed in this dissertationcan be applied to very large datasets quickly using standard software thanksto the INLA computational approach.Thus, we hope to leave readers with a clear awareness of the potentialdeleterious effects that preferential sampling can have on their statisticalinference of spatio-temporal data. We aim to convince readers that theproblem of preferential sampling must not be ignored within any statisticalanalysis of spatio-temporal data. We provide readers with a suite of toolsfor both testing for the presence of preferential sampling in their data, andfor subsequently adjusting their inference to its presence.39Chapter 3A General Theory forPreferential Sampling inEnvironmental Networks"close attention should be given to densely populated areaswithin the region, especially when they are in the vicinity ofheavy pollution."— The United States’ EPA Monitoring Network Design QAHandbook Vol II Section 6.0, guidelines for selecting thenumber and locations of air pollution samplersA previewThis Chapter presents a general framework for modeling the PS of locationsand times chosen to monitor a spatio-temporal process. We focus on envi-ronmental spatio-temporal processes, although the framework can be gener-alized beyond this. The framework is applicable in both the discrete–spaceand continuous–space settings seen in the previous Chapter (Sections 2.1and 2.2). As discussed in the previous Chapter (Section 2.4.1), we constructbinary indicator variables Ri,j for both the discrete–space and continuous–space settings. In the continuous–space setting, we choose to discretize thespace into a large set of discrete spatial ‘sites’ and use the logistic regression40approximation to a LGCP [Warton et al., 2010].Our framework considers the joint distribution of an environmental pro-cess with a site–selection process that considers where and when sites areplaced to measure the process. By sharing random effects between the twoprocesses, the joint model is able to establish whether or not site placementwas stochastically dependent on the environmental process under study.Furthermore, if stochastic dependence is identified between the two pro-cesses, then inferences about the probability distribution of the spatio–temporal process will change, as will predictions made of the process acrossspace and time. The embedding into a spatio–temporal framework also al-lows for the modelling of the dynamic site–selection process itself. Real worldfactors affecting both the size and location of the environmental monitoringnetwork can be easily modelled and quantified.We then apply this framework to a case study involving particulate airpollution over the UK where a major reduction in the size of a monitor-ing network through time occurred. It is demonstrated that a significantresponse–biased reduction in the air quality monitoring network occurred,namely the relocation of monitoring sites to locations with the highest pollu-tion levels, and the routine removal of sites at locations with the lowest. Wealso show that the network was consistently unrepresentative of the levelsof particulate matter seen across much of GB throughout the operating lifeof the network. Finally we show that this preferential sampling of monitor-ing sites may have led to a severe over–reporting of the population–averageexposure levels experienced across GB. This discovery could have great im-pacts on estimates of the health effects of black smoke levels.3.1 Introduction to Chapter 3This Chapter concerns preferential sampling (PS), where the locations ofsites selected to monitor a spatio–temporal environmental process Zst, s ∈Ω, t ∈ T , depend stochastically on the process they are measuring. ThusPS is a special case of response–biased sampling. The space–time point isdefined (s, t) ∈ Ω×T , with Ω denoting the spatial domain of interest and41T the temporal domain. Purely spatial processes (i.e. when |T | = 1), andpurely temporal processes (i.e. when Ω is ignored) are two special cases.Spatial sampling network designers must specify a set of time pointsT ⊂ T at which to observe Z and at each time t ∈ T , a finite subset of sitesSt ⊂ Ω at which to do so. Generally the temporal domain T would be afinite set as for practical reasons Z must be a time–averaged quantity. Thedesigner may select the network sites in a preferential way to meet specifiedobjectives [Schumacher and Zidek, 1993], although attaining those objec-tives may present its own challenges [Chang et al., 2007]). Moreover, thesuitability of the network for achieving its initial objectives may decline overtime as in the case of the air quality monitoring network for Metro Vancou-ver [Ainslie et al., 2009]. In some cases, the objectives may not be wellprescribed in which case evidence suggests that in these cases administra-tors may select monitoring sites preferentially [Shaddick and Zidek, 2014].Finally, the data provided by networks for one purpose may be used for an-other purpose and this mismatch may cause problems. For example, urbanair pollution monitoring sites provide the information needed to detect non-compliance with air quality standards [EPA, 2005, Loperfido and Guttorp,2008]. However, these measured values of Z would tend to overestimate theoverall levels of the air pollutant throughout Ω and thus render the data un-suitable for assessing the impacts of Z on human health and welfare. In suchcases networks well designed for one purpose may be seen as preferentiallysampled when the data they yield are used for another purpose.A variety of approaches can be taken for modelling PS and mitigatingits effects in a spatio–temporal process framework. The choice of frameworkdepends on contexts and purposes. Subsection 3.2.1 reviews some of theseapproaches along with their associated references. Two different situationsare encountered. In what might be called the retrospective approach, allthe process data are available for use in assessing and mitigating the impactof PS at any given time t≤max(T ). Such impacts could, for example, dis-tort estimates of model parameters, spatial predictions, temporal forecasts,trends, and risk assessments. A special case is where |T | = 1 and ZsT ,s ∈Ωis a random spatial field. Since data are not collected over time, strong42assumptions must be made about the PS process that yields the network ofsites. The data cannot be used to build an emulator of the actual selectionprocess itself, since the requisite data are not yet available when the spatialsites are selected. But it might be assumed that the future latent data doesreflect the past during the period under which the network was designed.In the prospective case, the selection of network sites at time t ∈ T maybe based on process observations up to and including time t− 1. In thiscase, the propensity to preferentially select sites at time t can be estimatedwithout benefit of having the data for time t. The temporal model can thenbe sequentially updated at time t+ 1 and the process model could adaptquickly to abrupt changes rather than projecting long term trends.We develop a general modelling framework for the retrospective case,that enables a researcher to determine if the locations of the monitoring sitesthat form an operational network have been selected preferentially throughtime (i.e. if response–biased selection occurred). Furthermore, unlike withthe spatial–only data, our framework applied to spatio–temporal data allowsfor a site–selection process emulator to be developed. The population ofall site locations considered for selection at any time t ∈ T is defined asP ⊂ Ω. P must be specified a–priori, as the model framework does notconsider locations outside of the fixed (pre–specified) population P in thesite–selection process. But within that framework both static and mobilemonitoring networks are admitted. Importantly, depending on the choice ofpopulation P, different insights into the nature of PS can be explored.Defining the population of sites considered for selection throughout (Ω×T ) has been an issue of fundamental importance for all previous work on PS.This is especially true for the model framework introduced in this Chapter.Depending on the choice of population, different insights into the nature ofPS can be obtained and spatial predictions may change dramatically. Weconsider two populations in this Chapter, however, more can be thought ofand implemented to suit the needs and knowledge of the researchers. Inone of the cases considered in this Chapter, that population is consideredto consist of all sites that have been deemed worthy of being monitored atsome times t ∈ T . We refer to these as the observed sites. In another case,43pseudo–sites are also included uniformly throughout Ω. These pseudo–siteshave never been monitored but are considered important for characterizingthe field itself and for investigating the impacts of PS. The name pseudo–sites follows from presence-only applications in statistical ecology, wheresuch sites are often referred to as pseudo–zeros [Fithian and Hastie, 2013,Warton et al., 2010]. We opt for the name pseudo–sites to distinguish theselocations from the traditional ‘data-locations’ and ‘prediction-locations’ ter-minology used in classical geostatistics. This is because in many applica-tions, not all prediction locations can also be pseudo site locations. Forexample there may be regions in A⊂Ω across which we wish to predict thefield, yet know with certainty that a site could not have been consideredfor selection for reasons unrelated to the process being measured. Possiblereasons include the presence of a physical barrier (e.g. a mountain range) ora political barrier (e.g. a militarized zone) that could make the placement ofa monitoring site impossible. Note that in all cases our population of sites Pis finite. This assumption of a finite population is in contrast to the spatialcontinuum assumed by point process models, although parallels between themethodologies exist and are discussed at length in this Chapter.A Bayesian model is introduced for the joint distribution of the responsevector (Yst,Rst). Rst is a binary response for the site–selection process,which is 0 or 1 according to whether or not a monitoring site is absentor present at the space–time point (s, t) ∈ P ×T , with P ⊂ Ω a fixed pop-ulation of site locations under consideration. The resulting model whenfitted, identifies the effects of PS if any, on inferences about the populationmean of the process underlying Y . For brevity, we denote this population’smean by ‘P–mean’. By sharing random effects across the two processes, thestochastic dependence (if any) between Ys,t and Rs,t can be quantified, andsubsequently the model can adjust the space–time predictions according tothe nature of PS detected.Moreover it yields an emulator of the dynamic preferential site–selectionprocess as the operational monitoring network (denoted by St) evolves overtime. The factors affecting the initial site placements can be allowed to dif-fer from those affecting the retention of existing sites in the network. The44dynamic model allows for an assessment of the degree to which preferen-tiality is determined not just by stochastic processes underlying Y , but byother factors that might include for example the administrative processesinvolved in the establishment of a monitoring site. Two examples consid-ered in this Chapter are political affinity for environmental monitoring andbudgetary constraints in an attempt to emulate the site–selection process,although more can be hypothesized and included. A key result describedin the Chapter is the ability to use the R-INLA software package with theSPDE approach Lindgren et al. [2011b], Rue et al. [2009, 2017] to fit thejoint distributions proposed in our framework. This ensures inference re-mains feasible, even for space–time applications with many thousands ofpseudo–site locations.Finally, we fit our model framework to a real case study: a large scaleair pollution monitoring network in the UK that monitored black smoke(BS hereafter) levels for more than fifty years. This case study provides anideal data example for our model since the network underwent a constant,dramatic re–design through time and furthermore, the locations of the ob-served sites appear to largely under-represent rural regions of Great Britain(GB hereafter). We consider two populations P of sites. First, we considerP1 to be the locations at which a site was operational at some t ∈ T (i.e.observed sites only). Here, we ultimately wish to see the effects of PS, if any,on estimates of the P1–mean, as well as investigate if the network evolvedpreferentially. Our second population P2 includes thousands of uniformlylocated (‘pseudo’) sites placed approximately 5km apart from each otherthroughout GB. Since we uniformly cover GB, from this population we areable to assess if the observed sites were preferentially placed within GB (i.e.Ω), and then preferentially retained in the network. We can then evaluatethe effects of PS on the P2–mean (i.e. the average across GB). These twochoices of population help to address two distinct questions.453.2 Modelling frameworksThis section describes a very general framework in which PS can be exploreddepending on the purpose of that exploration. It begins in Subsection 3.2.1with a review of some existing theory.3.2.1 Review of related workMost work on PS is set in the geostatistical framework where T consists of asingle time point so for expository simplicity we temporarily drop the sub-script t in this context. In geostatistics PS has a long history. For exampleIsaaks and Srivastava [1988] describe the deleterious impact to variogramestimates when “ the data locations... are preferentially located in high– orlow–valued areas”, in particular because the “preferentially clustered data”can lead to a “destructuring” of the variogram. In fact this concern aboutclustered data goes back to Switzer [1977]. Olea [2007] reviews the historyof PS, in particular with respect to the clustering due to it. However inter-est in this topic has spread to a variety of subject areas (see for exampleMichalcová et al. [2011], Zoltán et al. [2007]).Interest in the statistical science community seems to have been sparkedby the paper of Diggle et al. [2010] (hereafter DMS). DMS defines the PSof a space–time field succinctly as the property [Z,S] 6= [Z][S]. Here Zdenotes the spatial field and S the locations. The square bracket notationcan be read as the “probability distribution of”. DMS notes that whensampling is non–preferential, S can be regarded as fixed; inferences aboutZ and its distribution can then be based on conditional distributions givenS. The authors also note that non–PS differs from “uniform sampling”when for a given sample size, every possible realization of S is equally likely.DMS assumes that conditional on S and the Gaussian process Zs, s ∈ S,the measured values of Z denoted by Y are mutually independent Gaussianrandom variables with mean µ+Zs. At the same time, conditional on Z, S isassumed to be an inhomogeneous Poisson point process (IPP) with intensityfunction λ(s) = exp{α+βZs}, s∈Ω. The parameter β represents the degreeof PS – with β > 0, implying large values of Zs are associated with an46increased chance of inclusion of a sample in a local neighbourhood arounds in S. As noted by Professor Dawid in his discussion of DMS, this modelcannot represent the real site selection process since the network designerswould not know anything about Z until the sites had been established andtheir measured values were available. Thus this model cannot be viewedas a site–selection emulator since perfect knowledge surrounding Z priorto measurement cannot be assumed. Nevertheless in a post–hoc analysisof those data, the PS model can be fitted and so capture the impact ofthe real selection process on inferences made about Z and its probabilitydistribution.The IPP model was used subsequent to the publication of DMS by otherinvestigators in a similar way but in a fully Bayesian model for inference.More specifically Gelfand et al. [2012] replaces α+βZs in DMS’s intensityfunction by (in our notation) α+αT1 Xs where X denotes a vector of observ-able covariates. This change makes the model more like a possible model forthe real process. Note that without the inclusion of the process Zs insidethe linear predictor of the Poisson process model, they assume a missing–at–random missingness mechanism, with no further dependence existing be-tween the site locations and the underlying process Zs when conditioned onthe included covariates Xs. Thus this would no longer be considered PS byour earlier definitions. Pati et al. [2011] also includes the covariate vectorand replaces α+βZs by α+αT1 Xs +βξs so that the effect of the observ-able covariates is incorporated in the PS model. The {ξs} are referred toas a “residual process” and so unlike DMS, these authors are not makingPS depend directly on the process Z. A second residual process η is addedto the measurement model so conditional on ξ, η, X and S the {Ys} areassumed to be independently distributed with mean µ+αT1 Xs+βξs+β1ηs.Thus it would seem that in effect that the process model is being repre-sented by Zs = αT1 Xs+βξs+β1ηs while the potential PS derives from onlya subcomponent of that process.The need to include covariates (predictors) is well recognized in DMSand its ensuing discussions, so Gelfand et al. [2012] and Pati et al. [2011]are welcome additions to the geostatistical literature on PS. But none of47these models include as we do in this Chapter, residual terms that representthe ill–defined administrative and other processes involved in actual siteselection. These terms are not subcomponents of Z and yet the case studypresented in this Chapter suggests that these residuals play a significant rolein PS. Furthermore the point process model on which the above models arebased will not be suitable in all applications such as that in Conn et al.[2017] about mapping species abundance in ecology. That paper presentsa general theory for PS where Ω consists of a finite set of points and theresponse distributions are non–Gaussian to include such things as countdata.3.2.2 A general retrospective modelling frameworkIn this section we introduce the general model framework and its purpose,before implementing it on a real case study in Section 3.4. First, we carefullydefine the population of locations s∈P ⊂Ω to consider for selection at someor all t ∈ T . The size and placement of this population may substantiallyaffect the resulting inference. In many cases, either the precise locations ofall sites under consideration at each t ∈ T will be known, or there will bea clearly defined population of locations at which interest lies in estimatingthe space–time field and/or its corresponding population summary statistics.This case is Population 1 (P1) considered in our later application. For thesecond population (P2) used in our later analysis, we consider all possiblepoints s ∈ Ω to be the population.Computational considerations lead us, for Population 2, to approximatethis by the placement of pseudo–sites in a high density regular grid, thusplacing a pseudo site approximately every 5km in Ω. This is similar inflavour to the discretized computational lattice used in the log GaussianCox Process (LGCP hereafter) approach by DMS [Diggle et al., 2010]. Infact, as the density of pseudo–sites under consideration in Ω increases, theresulting logistic regression likelihood converges towards a (scaled) Poissonpoint process likelihood. Parameter estimates and their standard errorsconverge to those from the Poisson point process too. However, the accuracy48of this approximation depends on the density and placement of the pseudo–sites [Fithian and Hastie, 2013, Warton et al., 2010]. We discuss this indepth later. The LGCP idea has also been considered further, but theneed to explicitly add a third likelihood to the joint model to capture theretention process in spatio–temporal applications may make this approachless desirable in some scenarios.Note that the space–time field represented as Zi,t in previous work, isrepresented in our model framework as a sum of latent random effects. Thebenefit of this is that each of the components making up the space-time fieldmay be allowed to have a unique influence on the site–selection process. Anexample of where this could be beneficial is the case where site–selectionis driven only by the subset of the latent random effects that act on aparticular spatial scale. We let P denote the set of site locations in thepopulation and define M to be the number of sites (i.e. M = |P|). Note theinterpretation of the P–mean differs substantially across these populations.The P1–mean can be interpreted as the network average, whilst the P2–mean can be interpreted as the GB–average (the mean of the space–timefield across GB).We let Yi(t) denote a spatio–temporal observation process (continuous,count, etc.) at site i, that is at location si ∈ P ⊂ Ω, at time t ∈ T . We letRi(t) denote the random selection indicator for site si ∈ P at time t, with 1meaning the site was operational at this time. We let t1, ..., tN denote the(finite) N observation times, and let ri,j ∈ {0,1} denote the realisation ofRi(tj) for site si ∈ P at time tj , i ∈ {1, ...,M}, j ∈ {1, ...,N}. The subscript jwill act as a pointer to the desired time. Then our general model frameworkcan be written as follows:49(Yi,j |Ri,j = 1)v fY (g(µi,j),θY ), fY v densityg(µi,j) = ηi,j = xTi,jγ+q1∑k=1ui,j,kβk(si, tj)Ri,j v Bernoulli(pi,j)h(pi,j) = νi,j = vTi,jα+q2∑l=1dlq1∑k=1wi,j,l,kβk(si,φi,l,k (tj)) +q3∑m=1w?i,j,mβ?m(si, tj)βk(si, tj)v (possibly shared) latent effect with parameters θk k ∈ {1, .., q1}β?m(si, tj)v site–selection only latent effect with parameters θ?m m ∈ {1, .., q3}Θ =(θY ,α,γ,θ1, ...,θq1 ,d1, ...,dq2 ,θ?1, ...,θ?q3)v Priorsxi,j ∈Rp1 , ui,j ∈Rq1 , vi,j ∈Rp2 , Wi,j ∈Rq2×q1 , w?Ti,j ∈Rq3The above framework is set up to allow for a large degree of modellingflexibility for spatial, temporal and spatio–temporal applications. Note thatthe two functions g and h are known as link functions. These relate theexpected value of the response to the linear predictor. Popular choices of hfor the Bernoulli likelihood are the logit, complementary log-log and probitfunctions. In our later analysis, we will generate our zeros (or pseudo–sites)with an approximately constant intensity across Ω. Thus in our case thelogit link is the suitable choice for link function since it exploits a natu-ral connection between the conditional logistic regression and the loglinearPoisson point process model we are approximating when we condition onthe total count [Baddeley et al., 2015].We now dissect the model term–by–term. Firstly, consider the observa-tion process Y . We allow for any distribution to be chosen as the likelihoodfor the observation process. This allows a range of different data types (e.g.continuous, count, etc.,) to be modelled, including those that exhibit a rangeof features such as skewness, heavy tails and/or over-dispersion. In the lin-ear predictor ηi,j , we may include a linear combination of fixed covariates50xi,j with a linear combination of q1 latent effects βk(si, tj). These q1 ran-dom effects can include any combination of spatially–correlated processes(such as Gaussian [Markov] random fields), temporally correlated processes(such as autoregressive terms), spatio–temporal processes and IID randomeffects. Note that we include the additional fixed covariates ui,j to allow forspatially–varying coefficient models, as well as both random slopes and/orscaled random effects to be included. The flexibility here allows for arealdata to be modelled too, simply by changing the definition of si from beinga point to representing a well-defined area.Next, we consider the site–selection process Ri,j . As before, in the linearpredictor νi,j , we may include a linear combination of fixed covariates vi,jwith a linear combination of latent effects. This time however, the latenteffects appearing in the observation process Yi,j are allowed to exist in thelinear predictor of the selection process Ri,t. This sharing of the latenteffects across the two processes allows for stochastic dependence to existbetween the two processes and hence enables us to investigate whether wehave a missing–not–at–random mechanism. Note that the matrix Wi,j isfixed beforehand, and allows for q2 linear combinations (possibly scaled bycovariates) of the latent effects from the Yi,j process to be copied across.The parameter vector d determines the degree to which each shared latenteffect (or combination of) affects the R process and therefore measures themagnitude and direction of stochastic dependence between the two modelsterm–by–term. We denote this term by d in recognition of the landmarkpaper by Diggle et al. [2010]. Finally, as seen in Pati et al. [2011], we allowq3 latent effects, independent from the Yi,j process to exist in the linearpredictor. This allows us to extract as many sources of variation from thesite–selection process as possible, reducing the risk of over–estimating themagnitude of the dl terms, and thus the stochastic dependence between thetwo processes.For added flexibility we allow temporal lags in the stochastic dependence.This allows the site–selection process to depend upon the realized values ofthe latent effects at any arbitrary time in the past, present or future. Thusthis framework allows for both proactive and reactive site–selection to oc-51cur. For example, if for a pollution monitoring network, site–selection weredesired near immediate sources of pollution (say for exceedance detection),then we may view as reasonable, a model that allows for a dependence be-tween the latent field at the previous time step as a site–selection emulator.In this case, we would select as the temporal lag function, φi,l,k (tj) = tj−1.We define this to be reactive selection, where placement depends only onpast realisations of the space–time field. Say instead, site placements weredesired near areas forecast to increase in industrialisation (and hence pol-lution emission). Then a model allowing for dependence with future val-ues of the latent process may be suitable. To achieve this we would selectφi,l,k (tj) > tj . We define this to be proactive site selection. Models withmixtures of reactive and proactive site selection could also be admitted andfit under this framework since a unique temporal lag function φi,l,k (t) isallowed for each latent effect shared between the linear predictors.Also of interest is the possibility of setting wi,j,l,m = 0 for some values ofthe subscripts to allow for the directions of preferentiality to change throughtime. For example, the initial placement of the sites might be made in a pos-itively (or negatively) preferential manner but over time the network mightbe redesigned so that sites were later placed to reduce the bias. To capturethis, it would make sense to have a separate PS parameter d estimated fortime t = 1 and for times t > 1 to capture the changing directions of prefer-entiality through time. This can easily be implemented. Furthermore, wemay wish to set wi,j,l,m = 0 for certain values of the subscripts to see if theeffects of covariates and/or the effects of PS differs between the initial siteplacement process and the site retention process.Clearly the above modelling framework has potential for over–fittingand model non–identifiability among others things. Thus careful choice ofprior distributions, linear constraints on the latent effects (e.g. sum–to–zero constraints) and exploratory analysis is vital to fully utilize this modelframework.523.3 Case study: the dataAnnual concentrations of BS were obtained from the UK National Air Qual-ity Information Archive ( Set up in 1961, this archivewas the world’s first coordinated archive of national air pollution monitoringnetworks. While it was being established, the network increased in size andthe initial growth was quite rapid; from 800 sites in 1962, 1159 sites in 1966,to 1275 sites in 1971 (see Fig 3.1). After this initial period the overall sizeof the network declined due to rationalization and in response to changinglevels of air pollution; in 1976 there were 1235 operational sites, 563 in 1986,225 in 1996, and 65 in 2006.Site locations (at a 10 m resolution) and annual average concentrations ofBS (µgm−3) were obtained from monitoring sites. For the reasons given byShaddick and Zidek [2014], we restrict ourselves to only the sites operatingbetween April 1966 and March 1996 and with data capture of at least 75%,equivalent to 273 days a year (as stated in the EC directive 80/779/EECColls [2002]). The locations of all these sites (i.e the population P1 consid-ered in this Chapter) can be seen in Fig 3.3. It can be seen immediately thata high density of sites are located near many major industrial cities such asLondon and the Midlands, with almost no sites located in the relativelysparsely populated north of Scotland.The decline in concentrations during this time period was most dramatic.Annual recorded network means fell from 80 µgm−3 in 1966 to 31 in 1976, 19in 1986, 9 in 1996 and 5 µgm−3 in 2006. Fig 3.2 shows a random sample ofsite–specific log–transformed annual BS levels. Concentrations of BS weretypically highest in areas where the use of coal for domestic heating wasrelatively widespread, such as in parts of Yorkshire and within large cities.Along with these large changes in concentrations, the dramatic changesin the size of the network can be seen in Fig 3.1 which shows the numberof operational sites with at least 75% (annual) data capture vs. year withinthe chosen study period. The initial increase in the size of the network canclearly be seen followed by the long–term reduction in the number of sitesover time. Also evident is the marked reduction of the network in the early531980s when there was an almost 50% reduction in the number of sites as thenetwork was reorganized owing to falling urban concentrations. With sucha dramatic drop in the size of the network, one must ask how the networkreduction was chosen. Fig 3.2 shows a plot of a random sample of 30 sites’(log-transformed) black smoke trajectories. From this plot there appears tobe evidence that the sites that remained in the network until the end werethose providing the highest measurements. Thus, we can see clear evidencefor a response–biased network reduction process (i.e PS).Thus, we have a dataset that exhibits three interesting features:1. A high density of monitoring sites near major industrious regions, andhence near potential sources of BS. Conversely, an under-representationof the rural areas of Northern Scotland, Wales and Cornwall (Fig 3.3),and hence areas with low expected BS.2. A large change in concentrations of BS throughout the period of study,resulting in a rapidly evolving latent spatio–temporal process (Fig 3.2).3. A network whose size dramatically changes through time (Fig 3.1).4. A network that underwent a biased redesign through time (Fig 3.2),with the sites providing the smallest BS readings being dropped fromthe network.These four features provide the perfect opportunity for the model frameworkto both detect and attempt to correct for the effects of PS made within thenetwork. In particular, depending on our choice of P, we are investigatingwhether or not informative dropout/inclusion occurred in the operationalnetwork St through time, and/or whether the network of observed sites isrepresentative of Great Britain (GB) as a whole.Note that the same exploratory analysis was conducted as in Shaddickand Zidek [2014], and a quadratic temporal effect was found suitable toboth fit the data and also provide a non–complex relationship to explain theobserved decline in (log transformed) concentrations over time. Variogramswere constructed for each year separately and for the average over all years,54both on the original data and on the residuals from the temporal model; aspatial model from the Matérn class seemed an appropriate choice.Figure 3.1: A plot showing the number of the monitoring sites thatare operational at each year and have data capture of at least75%. Notice the sharp drop in the size of the network in 1982.Note that a total of 1466 sites were operational at some pointin time.55Figure 3.2: A plot showing the mean black smoke level on a log trans-formed scale for 30 randomly chosen sites. Missing line segmentsindicate the site was offline that year. Notice that the sites re-porting the lowest values tend to be removed from the networkearliest.3.4 ModellingWe build one model from the general framework introduced in Section 3.2.We fit and present the results from three implementations of this model todisplay the features of the modelling framework. The three implementa-tions are developed through a combination of imposing strict constraints onthe PS parameters (i.e. by imposing point mass priors on the d parame-ter vector), and changing the population under consideration. These threeimplementations clearly demonstrate the ability of the model framework toboth detect, and adjust for, PS. Furthermore, they highlight the componentsof the model involved with the PS detection and correction. This helps todemystify the method and to avoid it being seen as a black–box approach.56The joint model developed incorporates the effects of selection by sharingthe random effects present in the observation process with the site–selectionprocess. In particular, the selection process is allowed to use informationfrom both spatially varying Gaussian processes and spatially–uncorrelatedsite–specific effects, to determine the site selection probabilities each year.If PS is detected, then this model should help to de–bias predictions of theP1 and P2–means relative to those reported from the raw data, by movingtheir point predictions against the direction of preferentiality. The magni-tude of this movement is dependent upon: the flexibility of the model, themagnitude of the estimated PS parameters dβ,db, and the choice of P. Thisfact is clearly demonstrated by the results from the three implementations.The same joint model, and computational mesh is used across all threeimplementations. The differences seen in the results come only from thedifferent assumptions placed upon the site–selection processes and popula-tions. In the first implementation, the site–selection process is forced to beindependent from the pollution process through the point mass prior at 0imposed on dβ,db. In other words we constrain the PS parameters to be zero.Consequently the subsequent inference from this model will ultimately beequivalent to the inference from a model without any site–selection processcomponent. In the second and third implementations, we remove this con-straint, and two different choices of P are made to address two alternativescenarios.All modelling is performed in R-INLA with the SPDE approach [Lind-gren et al., 2011a, Rue et al., 2009, 2017]. This enables the rapid compu-tation of approximate Bayesian posterior distributions for both the modelvariables and latent effect predictions. It does this by approximating thespatio–temporal processes with a Gaussian Markov random field (GMRF)representation by solving an SPDE on a triangulation grid. Details can befound in Lindgren et al. [2011a]. Due to the large size of the dataset and thedesired spatial prediction, MCMC approaches without sophisticated approx-imations or efficient implementation could become infeasible. This is dueto the computationally expensive operation of inverting large, dense spatialcovariance matrices being required at each MCMC iteration to evaluate the57likelihood. The SPDE approach, by developing a GMRF representation tothe spatial fields, only requires the computationally cheaper operations ofcomputing the inverse and the determinants of sparse precision matrices –a task that is made possible with numerical sparse matrix libraries.3.4.1 Data cleaningA few data cleaning steps were carried out before fitting the modelling. Dueto the right skewness of the black smoke observation distribution, we appliedthe natural logarithmic transformation to the values to make the observationdistribution more Gaussian in shape. Since the natural logarithm is a non–transcendental function, meaning in particular that its series representationcontains an infinite series of powers of its argument, we first divided eachvalue by the mean of all the recorded black smoke levels to make the responsedimensionless. This ensures not only that the inference remains valid, butalso readily interpretable as they are in effect compared to a natural origin[Shaddick and Zidek, 2014]. Next, we scaled the Eastings and Northingscoordinates by the standard deviation of the Eastings, and re–scaled theyears to lie in the interval [0,1].3.4.2 Observation processThe following model for the observation process is used for all three imple-mentations seen shortly. The specification follows from Shaddick and Zidek[2014] and is formulated as follows. Let Yi,j denote the observed log blacksmoke ratio at site i, situated at si, at time tj i ∈ {1, ...,M}, j ∈ {1, ...,N}.Let t?j denote the jth time–scaled observations that lie in the interval [0,1].Let Ri,j denote the random selection indicator for site i at time tj . LetRi,j = 1 or 0 depending on whether or not the site was operational in thatyear and provided the minimum number of readings outlined earlier. Notethat there are 1466 sites that record at least one annual reading, and N = 31.58Figure 3.3: A plot of Great Britain, with the locations of the observedsites, and hence P1 shown. Notice the high density of sitesplaced in the most population-dense regions.(Yi,j |Ri,j = 1)vN(µi,j ,σ2)µi,j = (γ0 + b0,i+β0(si)) + (γ1 + b1,i+β1(si))t?j+ (γ2 +β2(si))(t?j )2[βk(s1),βk(s2), ...,βk(sm)]T vIID N(0,Σ(ζk)) for k ∈ {0,1,2}[b0,i, b1,i]vIID N(0,Σb) Σb =[σ2b,1 ρbρb σ2b,2]Σ(ζk) = Matérn(ζk)θ =(σ2 ,γ,ζk,σ2b,1,ρb)v Priors.The choice of the observation process model is explained as follows. The59sources of variation can be broken up into three components: global vari-ation, independent site–specific variation and smooth spatially correlatedvariation. To ensure model identifiability, we enforced sum–to–zero con-straints on all random effects (β and b), and furthermore we did not estimatespatially–uncorrelated random effects b at locations with no observations.Note that in the notation of Section 3.2, the b and βk(si) terms are exam-ples of the β(s, t) latent effects and thus q1 = 5. For readability we chooseto separate the notation for these effects. Note that, whilst the b termsare assumed independent between sites, the terms b0,i, b1,i are assumed a-priori to be a realisation from a (possibly-correlated) multivariate Gaussiandistribution with covariance matrix Σb.The global temporal trend is captured by the γk terms since these pa-rameters remain constant across the sites. As in Shaddick and Zidek [2014],when comparing various models for the first (non-joint) implementation,more complex temporal relationships (such as splines) were not favouredby multiple model selection criteria including DIC. Secondly, the indepen-dent site–specific variations are captured by the IID random intercepts andrandom slopes (b0,i, b1,i). In geostatistical terms, the b terms act as nuggeteffects for their corresponding βk(s) terms. The (nugget–free) βk(s) termsthen capture the smooth spatially–correlated variation. Models withoutthe b terms showed large residual site–specific errors. Thus it appears thatsmall–scale factors may be a large source of variability in the measured blacksmoke trajectories, independent from the regional location alone. Note thatseparate spatially–correlated Gaussian fields for each year were tested (i.eusing a separate β0,j(s) field for each year), but did not improve the modelfit.The intuition behind the short scale b terms in the model is as follows.An observation tower close to a large source of black smoke (e.g. a road,a polluting factory, or a power station) would likely yield a much higherannual reading than placing it say half a kilometer away from such a source.Since this spatial scale is much smaller than that captured by the βk(s)processes, these differences will not be accounted for without either includingcovariates that capture the causes of these effects (e.g. distance from the60nearest pollutant source), or by allowing each site to have it’s own deviationfrom the smoothly predicted field via either a fixed or random, site–specificeffect. Note that spatially–uncorrelated random quadratic slopes b2,i werenot found to improve the model fit with respect to DIC under the firstimplementation and actually led to a large instability in the predictions ofsites that took fewer measurements. It appears that the inclusion of theseterms led to some over–fitting.The choice of priors for the hyperparameters θ were made to make themas weakly informative as possible and hence to reduce their effects uponthe posterior results, but also to bound their values inside sensible limits.Despite the fact that previous analyses have been made on this dataset, weonly use vague information from these results when constructing the priors.We discuss the details of the chosen priors in the supporting material foundin Appendix A. Site–selection processThe following model for the site–selection process is used for all three im-plementations with the aim of emulating the complex decision–making pro-cesses that occurred when setting up the monitoring network. Let: Ri,jdenote the random selection indicator for site i at time tj ; Let Ri,j = 1or 0 depending on whether or not the site was operational in that yearand provided that the minimum number of readings outlined earlier is at-tained. Let ri,j ∈ {0,1} denote the realisation of Ri,j for site i at timet?j , i ∈ {1, ...,M}, j ∈ {1, ...,N}. Finally, si denotes the location (the scaledEastings and Northings coordinates) of site i. The model is then:61Ri,j v Bernoulli(pi,j)logitpi,1 = α0,0 +α1t?1 +α2(t?1)2 +β?1(t1)+αrepIi,2 +β?0(si)+db [b0,i+ b1,i(t?1)]+dβ[β0(si) +β1(si)(t?1) +β2(si)(t?1)2]for j 6= 1 logitpi,j = α0,1 +α1t?j +α2(t?j )2 +β?1(tj)+αretri,(j−1) +αrepIi,j +β?0(si)+db[b0,i+ b1,i(t?j−1)]+dβ[β0(si) +β1(si)(t?j−1) +β2(si)(t?j−1)2]Ii,j = I∑l 6=irl,j−1I (||si−sl||< c)> 0[β?0(s1), ...,β?0(sm)]T vN(0,Σ(ζR)]Σ(ζR) = Matérn(ζR)[β?1(t1), ...,β?1(tT )]T vAR1(ρa,σ2a)θR =[α,db,dβ,ρa,σ2a, ζR]v Priors.The first rows of the linear predictors comprise the global effects of timeon the log odds (and thus eventually the probability) of selection. We al-low for a quadratically changing global log odds of selection with time, andallow for a global first–order autoregressive deviation from this quadraticchange (denoted by β?1(tj)). This term represents the change in time ofboth the political and public moods regarding the need for maintaining theoverall network size. New governments may well prioritize public spendingon the environment in different ways and furthermore, the public’s approvalof environmental spending likely changes in light of new knowledge. Addi-tionally, large changes in the size of the public monitoring network can be62seen around 1982 (see Fig 3.1). Here a sharp decrease in the size of thenetwork occurred, reducing the number of sites by almost half. The smoothquadratic effect of time clearly would not suffice to capture this short termtrend and thus a random effect seems compelling, especially one that is ableto adequately capture this short term change (i.e. overdispersion), such asthe autoregressive term we used.The second rows of the linear predictors represent the site–specific factorsinfluencing the log odds ratio in favour of a site’s inclusion in the network Sjat time tj . Firstly, αret represents what we call the “retention effect”. Thisterm reflects how the probability a site is selected in a given year, changesconditional upon its inclusion in the network in the previous year. Since largecosts can be incurred in setting up monitoring sites at new locations, it isplausible that network designers would favour the maintenance of existingsites over their replacement at new site locations, even if the conditions atother sites (represented by the other terms in the linear predictor) are morefavourable. In fact, it is this indicator variable that determines whether ornot the linear predictor corresponds to the site-placement process or thesite-retention process. If we wanted to investigate the possibility that theeffects of PS or covariates were different between the two processes, thenwe could include additional product terms between the various effects andri,j−1 to capture this change. Here, we share all parameters across the twoprocesses and allow only a unique intercept to exist between the processes.This is discussed in depth later.In contrast, αrep captures the repulsion effect. Ii,j denotes an indicatorvariable that determines whether or not another site in the network placedwithin a distance c from site i was operational at the previous time tj−1.Plausibly network designers would not want to place sites close to an existingsite. Conversely, there may be unmeasured regional confounders affectingthe localized site–selection probabilities (e.g. population density) that maylead to additional clustering that cannot be explained by the model withoutthe inclusion of the confounder. This parameter should help to capture anyadditional clustering that may be present. We choose the hyperparameter cto be 10km.63Finally, there may be a larger motivation to place more/fewer sites incertain areas of the UK throughout T , that cannot be explained by theother terms in the model. This may be due to population density or due toincreased/decreased political incentives in this area. We attempt to capturesuch spatially–varying area effects in the β?0(s) field. This can be viewedas a spatially–correlated correction field similar to that used by Pati et al.[2011]. Note that this is fixed in time with the aim of avoiding identifiabilityissues.Whilst it may appear that we have included a lot of effects in the site–selection process, it is of paramount importance to adequately capture andremove as many sources of variability from the site–selection process aspossible. The preferentiality parameters should therefore only act uponthe residual signal, after such effects have been removed. Since we aredealing with a large quantity of spatio-temporal data, we are able to learnthe temporal features affecting site–selection and thus we can attempt toemulate the true process itself. This is in stark contrast with the spatialsetting. By removing large sources of variability from the site–selectionprocess first, we reduce the risk of over–estimating the stochastic dependencebetween the selection and observation processes and hence reduce the riskof over-adjusting our parameter estimates and predictions.The third and final rows of the linear predictor represent the prefer-entiality parameters of the selection process, following the work of Diggleet al. [2010]. We decide to separate the preferentiality into two sources:small–scale deviations from the localized average black smoke levels, andthe medium–scale regional deviations from the UK–wide annual black smokelevels. In recognition of the landmark paper by Diggle, we denote the two pa-rameters by db,dβ respectively. Since we have constrained both the [b0,i, b1,i]terms and the βk(s) processes to sum to zero, the terms being multipliedby db,dβ represent deviations from the P–mean. Both of these effects areallowed to affect site selection independently. The interpretation of thesePS parameters depends largely upon the choice of the population P. All PSeffects detected are after controlling for the other site–selection effects.In consideration of the discussions following Diggle et al. [2010], for j > 164site selections made at time tj involve estimated black smoke levels basedon observations made at the previous time tj−1. Thus in our model we donot assume the network designers formulate site selection decisions basedon black smoke forecasts into the future or for the current unobserved year,but on predicted quantities at the previous time step. Therefore in ourframework, we model the site–selection as being reactive for times tj : j > 1.Using the notation from 2.2, φi,l,k(tj) = tj−1∀i, l,k and tj > 1. If the trueselection mechanism is believed to be different, then the change of paradigmis trivial. For computational savings, we base the site selection at time 1 tobe based on the estimated field at time 1 (i.e. φi,l,k(t1) = t1). Our choice ofpriors are discussed in depth in the supporting material found in AppendixA. Three implementationsFor Implementation 1 we constrain the PS parameters db,dβ to equal 0.Thus Implementation 1 incorporates the prior assumption that no stochasticdependence between the site–selection process and the observation processwas present and thus that no PS occurred. A direct result of this inde-pendence assumption is that the posterior distribution of the observationprocess Y is the same, regardless of the specification of either the site–selection model terms, or the choice of the population of sites P to considerfor selection. Thus, the results from Implementation 1 will match those ofa typical spatio–temporal analysis that ignores site–selection. This will beused as our baseline for comparison.For Implementation 2, we remove the zero constraints on the PS pa-rameters, imposing instead weakly informative Gaussian priors with mean 0and variance 10. For Implementation 2, we consider only the 1466 observedsite-locations for selection at each time t ∈ T . We define this as Population1, P1 and thus M = |P1|= 1466. Population 1 is shown as the red circles inFig 3.3.For Implementation 3, we replace the zero constraints with the sameGaussian priors, but consider a different population of sites for selection65at each year, P2. For P2 thousands of pseudo–sites are also considered forselection at each time step along with the observed sites from P1. We en-sure the locations of the pseudo–sites are uniformly distributed throughoutGreat Britain (GB) and placed with high density. It has been shown that es-timates and corresponding standard errors of all (non-intercept) parametersconverge toward those of the equivalent IPP as the number of pseudo–sitestends towards infinity, so long as the density of the points is uniform (inprobability) [Fithian and Hastie, 2013, Warton et al., 2010]. Thus thereis some duality with the approach of DMS [Diggle et al., 2010] and ourImplementation 3. The locations of P2 are shown in Fig 3.4.For Implementation 2 we aim to see if the network evolved preferentially.That is, out of the observed sites, were sites added and dropped from thenetwork in a manner that was dependent upon the value of the latent blacksmoke process and hence missing not at random (MNAR). Under Popula-tion 1, since we do not consider locations within the unsampled regions forselection, no additional information is being added to the unsampled re-gions. Hence we do not expect the estimates of BS to change much at theselocations unless estimates of the site–trajectories and hence the P1–meanchange. Furthermore, we are unsure if the joint model will substantiallyadjust estimates of the P1–mean, even if PS is detected. This is since re-sults from a small simulation study we conducted suggest that if we havea case where we fit an inflexible temporal model to a dataset whose siteshave a long average consecutive lifetime, estimates will remain largely thesame due to the over–determined nature of the problem. In fact, the sitesin the dataset provide an average of 12 consecutive years of readings, withthe minimum consecutive lifetime of a site being 6 years. Additionally thedeviation from the quadratic trend is typically small (Fig 3.2). Thus we mayexpect only a small change to the results seen from Implementation 1.For Implementation 3 we investigate if the network of operational sitesat each time St : t ∈ T is being located throughout GB (Ω) in a preferen-tial manner. Thus the interpretation of preferential (i.e. response–biased)network evolution is lost under this choice of population. Instead, these PSparameters dβ,db now measure the degree to which the operational network66Figure 3.4: A plot of the locations of all sites considered for selectionin Population 2. The locations are shown as blue dots, many ofwhich are in regions of low human population density.(St) is preferentially located in Ω through time T . This is due to our sec-ond population P2 covering Ω uniformly and hence considering each points ∈ Ω as being equally likely to be sampled a-priori. This is unlike Popula-tion 1, which did not include large areas of unsampled Scotland, Wales andCornwall for selection at each time t ∈ T . Thus Population 2, by addingadditional information to the unsampled regions via the site–selection pro-cess, should inform the joint model about the appropriate adjustment of BSestimates in the unsampled regions according to the nature of PS detected.67Put differently, the joint model will extrapolate any associations detectedbetween the site–selection process and the underlying latent effects into theunsampled regions.In fact, hidden away in the details of Implementation 3 is the fact thatthe Bernoulli random variable models two processes simultaneously. Imple-mentation 3 can be considered as being a joint model with three processes:an observation process, an initial site–placement process and a site–retentionprocess. The latter two are fit using only one Bernoulli likelihood. The initialsite–placement process is fit using a conditional logistic regression approxi-mation to a log-Gaussian Cox process, and is similar to that seen in Diggleet al. [2010]. The site–retention process is modeled as a Bernoulli randomvariable. Inside the linear predictor of the Bernoulli likelihood, the indica-tor variable ri,(j−1) points the linear predictor towards the site–placementprocess when it is equal to 0 or towards the site–retention process whenit is equal to 1. In our example, we only allow for a unique intercept toexist across the two processes, sharing the remaining parameters. Thus weassume that the effects of all the covariates and the effects of PS are con-stant across the two processes. This assumption can be relaxed by includinginteraction effects between ri,(j−1) and the other parameters, including thePS parameters.Note that care is required to ensure that only the pseudo–sites contributea zero to the Bernoulli likelihood for the site–placement process across allyears. Furthermore, for our application, we must ensure that only the sitesthat have been removed from the network in year j contribute a zero to theBernoulli likelihood for the site–retention process at year j. This ensuresthat no site in the network was ever re-installed after its removal, a fact seenin our data. Clearly then, the choice of zeros here is application–dependent.Additional details are given in the supporting material found in AppendixA.1.The ability of our joint model to adjust estimates of the pollution processat a point s depends upon the distance of the point from the nearest mon-itoring site in the network. For pseudo–sites further from an observed sitethan the effective range of the spatially varying β processes, essentially all68the degrees–of–freedom of the spatially–varying quadratic terms βk(s) areavailable for use in fitting the site–selection process to make the posteriorprobability of repeated non-selections (i.e. the ri,j = 0’s) of the pseudo–site high. Since we have no black smoke observations here, the fitting ofthe quadratic slopes to these pseudo–sites is therefore an under–determinedproblem. Thus we would expect the estimates of black smoke here to be dif-ferent. For pseudo–sites very close to an observed site (i.e. well within theeffective range), we would expect the estimates at the pseudo–site locationsto remain largely unchanged, since the problem remains over–determined.For pseudo–sites within the effective range of, but not immediately nextto an observed site, we expect estimates to change moderately since theproblem is weakly-determined.3.4.5 Model identifiability issuesWhen fitting a model this large, issues around model identifiability com-monly arise, namely the possibility of the data providing information aboutthe model parameter values through the likelihood. We assessed these issueswith two approaches. First, we enforced sum–to–zero constraints on all therandom effects to ensure they are simply localized deviations about a globaltrend. As discussed in the supporting material found in Appendix A.1, weplaced penalized complexity priors [Fuglstad et al., 2019, Simpson et al.,2017] on the Matérn parameters of the Gaussian processes to provide someprior information on the range and scale, while reducing the possibility ofoverfitting the data.To confirm that we had fully resolved the model identifiability issues, wethen conducted a small simulation study. We sampled the data from vari-ous models similar in form to the joint model introduced in sections 3.4.2and 4.3 to see if the posterior estimates of both the parameters and thespace–time field covered the true values. Interestingly, for a much smallerdataset, we found no identifiability issues except for the range parameteron the β?0(s) process. Here the mean squared error of the point estimatesof this parameter were very high relative to the other parameters, although69the nominal coverage levels and bias remained good. This could be a signof identifiability issues surrounding this effect, or perhaps could be due tothe difficulty with estimating a Matérn field using only small amounts ofbinary point data. All other parameter estimates in the simulation studies,as well as posterior predictions were good. Of most interest was the model’scapability to detect the preferentiality parameters dβ,db with high preci-sion, negligible bias and with posterior credibile intervals attaining nominalcoverage levels.Interestingly, we experience the same difficulties with identifying theβ?0(s) process in our case study. Our estimated marginal distribution for therange parameter of the β?0(s) process in the UK black smoke case study wasfound to have a 95% posterior credible interval of (0.03, 1.18). Given thatwe scaled the coordinates, these distances imply that the model encountereddifficulties with estimating this parameter. Importantly, the posterior meansof the standard deviation of this effect were around 0.03 with 95% credibleintervals lying in the region of between 0.00 and 0.08. Thus, ultimately thiseffect has minimal impact upon the model fit.We also assessed the ability of the joint model framework, under simu-lated PS settings, to de-bias estimates of site–specific trajectories and net-work averages (equivalent to the P1–mean). Two such simulation studiesconsidered distinct temporal trends. The first fixed the temporal componentto be rigid, the second allowed for a flexible nonlinear trend. In particular,we witnessed that under a rigid (spatially-varying) linear slopes model, whenthe average of the consecutive lifetimes of the sites is high, the bias inducedin the site–specific estimates and the P1–mean that occurs from ignoringthe site–selection process is almost zero. This is due to the problem beingover-determined – only a few observations of the process at each site arerequired for the model to accurately forecast/backcast estimates through-out T . This is similar to what is seen in the case of the UK black smokedataset. Conversely, when the temporal trend is highly nonlinear and theaverage consecutive lifetimes of the sites are short, the biases in parameterestimates, site–specific predictions and estimates of the P1–mean throughtime can all be high if we ignore the site–selection process. To provide70a ‘highly nonlinear’ trend, we opted to use an independent realisation of aMatérn field for each of the 30 simulated ‘years’. The insights from these twoscenarios help explain the results seen shortly in Implementation 2. Theyalso hint that changes to inference under Implementation 2 P1 would behighest for applications with mobile monitoring sites.3.5 ResultsWe focus our attention upon the following issues and objectives:1. Do implementations 2 and/or 3 detect that, within the network of ob-served sites (i.e. Population 1), the sites have been preferentially addedand removed even after controlling for the various covariates includedin the site–selection process? If so, has this been done based uponshort–range, site–specific deviations from the regional mean black smoke,and/or medium-range regional deviations from the annual P– mean?2. When considering Implementation 3, does the model detect that thenetwork of operational sites St have been preferentially located withinGB (Ω) through time, even after controlling for the various covariatesincluded in the site–selection process?3. Do estimates of the black smoke annual means in GB (i.e. the P2–mean) change significantly when we consider the stochastic depen-dence between the placement of the sites and the black smoke field?4. If we backcast and/or forecast the predictions at all observed site lo-cations (i.e. s ∈ P1) at all times, how do the estimated black smokelevels differ between the operational (St) and offline sites (SCt )? Dothese differences change in time, and if so, does the apparent priorityof site placement change through time?5. Given the original purpose of the air quality network for monitoringthe progress achieved by the Clean Air Act in reducing the populationexposure levels to both black smoke and sulphur dioxide [McMillan andMurphy, 2017], if we average the estimated black smoke field across71Great Britain’s population, do the estimated population-average ex-posure levels change between the implementations?6. Considering the 1980 EU black smoke guide value of 34 µgm−3, howdoes the estimated proportion of GB exceeding this value changethrough time? What are the differences across the three implementa-tions? Furthermore, how do estimates of the proportion of the pop-ulation exposed to BS levels above this value change under the threeimplementations?Parameter Implementation 1 Implementation 2 Implementation 3dγ 0 (0) 0.62 (0.17) 2.77 (0.01)db 0 (0) 0.06 (0.04) 0.12 (0.01)β0 96.50 94.94 21.87(trans scale) 1.15 (0.02) 1.13 (0.01) -0.34 (0.09)ρb -0.77 (0.02) -0.76 (0.02) -0.78 (0.00)αret - 6.18 (0.06) 6.47 (0.06)αrep - 0.08 (0.11) 0.82 (0.10)Table 3.1: A table showing the posterior mean and standard devi-ations for parameter estimates for the three implementations.Note that the top row estimates of β0 have been transformedback onto the original data scale.3.5.1 Implementation 1 – assuming independence betweenY and RIf we assume independence between Y and R, the posterior results aboutthe observation process Y from Implementation 1 are identical to those thatwould have been discovered from fitting only the observation process (i.e.fitting only the Y model). As expected, especially high values of blacksmoke are predicted to exist around the North West and Yorkshire areas ofEngland in 1966. This area covers the major cities of Liverpool, Manchester,Leeds and Sheffield, all industry–heavy cities at the time under study. By1996 the relative levels of black smoke in these areas are far reduced and72Figure 3.5: Implementation 1. In green are the model–estimated BSlevels averaged over sites that were selected in P1 (i.e. oper-ational) at time t. In contrast, those in red are the model–estimated BS levels averaged over sites that were not selectedin P1 (i.e. offline) at time t. Finally, in blue are the model–estimated BS levels averaged across Great Britain. Also in-cluded with the posterior mean values are their 95% posteriorcredible intervals. If printed in black-and-white, the green bandis initially the lower line, the red band is the upper line and theblue band is initially the middle line. Notice the change in theordering of the values that occurs in 1982.73exceeded by the Greater London area. Counter–intuitively however, theestimated black smoke levels in the Scottish Highlands, an area with almostno manufacturing or industry are predicted to be relatively high (see Fig A.1in the Appendix) across all time periods. This result is a direct consequenceof the absence of monitoring sites in this area (see Fig 3.3), along with alack of informative covariates included in the observation process Y for thisregion.A typical location in the unsampled regions of the Scottish Highlands,Cornwall or The Borders (the Northernmost and South-Westernmost regionsof Fig 3.3) sees their distance to the nearest site in P1 typically exceedingthe estimated spatial ranges of the random fields. Consequently, model–estimates in such areas essentially equal the average of the observed pollutionlevels (i.e. the P1–mean). This feature can immediately be seen to beproblematic since it is likely that the true black smoke levels will be belowthe P1–mean in these regions. As well, large standard errors (i.e. posteriorpointwise standard deviations) for the predicted black smoke levels are foundin these regions due to their lack of monitoring sites (see Fig A.1).Next, we consider the model–estimated black smoke levels for all theobserved site locations (i.e. Population 1) in Fig 3.5 at every time point.To investigate Objective 4, for each t ∈ T we split the observed sites intothe operational sites St and offline sites SCt . The set of operational sitesSt are defined to be the sites in Population 1 that recorded the minimumnumber of observations that year. The set of offline sites SCt are defined tobe the sites in Population 1 that failed to record this minimum number ofobservations that year. Note that St⋃SCt = P1 and St⋂SCt = ∅.Here we can see that from Implementation 1, that it appears the siteswere initially placed in regions with below–average black smoke levels be-tween 1966 – 1980 (see Fig 3.5). This is inferred from the posterior meanblack smoke levels – they are significantly lower for the operational sitescompared with the estimated GB–average. The lack of additional informa-tion for the unsampled regions of GB makes the estimates in these areasequal to the P1–mean and thus the GB–average is nearly identical to theP1–mean. Over time, the posterior means for the black smoke levels at the74operational and offline sites converge, before the direction of preferentialitychanges in 1982. The latter was the year a major network redesign was ini-tiated, removing almost half of the operational sites (see Fig 3.1). Here wesee strong evidence the sites that remained in the network after this redesignwere in locations with black smoke levels above the P1–mean. This is dueto the posterior mean black smoke levels being significantly higher for theoperational sites compared with the offline sites.Thus from looking at the results from Implementation 1 alone, we gainsome insight about Issues 1 and 4. It appears that the sites were prefer-entially sampled in almost all time periods. Initially the operational sitesappear to have been placed in regions with black smoke levels below the P1–mean, before being placed in regions with levels above the P1–mean after themajor network redesign in 1982. These results are significant with respectto 95% credible intervals. However, we still have doubts about the predictedblack smoke levels in regions of GB known to have little industry or popu-lation density – two major sources of black smoke. Since these regions coverlarge percentages of the surface area of GB, the effect of over–estimatingthe predictions in these areas would be a marked increase in the estimatedGB–average black smoke level. Implementation 3 attempts to rectify thisproblem by extending the definition of P into these regions.3.5.2 Implementation 2 – P1Firstly, we consider the posterior parameter estimates for the two sourcesof preferentiality (see Table 1). These are denoted by dβ,db, the medium–range and short–range preferentialities respectively. Only the former effectdβ was detected to be significantly nonzero with a posterior estimated valueof 0.66 and a 95% posterior credible interval of (0.34, 0.99). The posteriorestimate of the short–range preferentiality was 0.06 with a 95% posteriorcredible interval of (-0.01, 0.15). Thus in both cases the direction of pref-erentiality was positive, suggesting that year–by–year, the site placementsare positively associated with the relative levels of black smoke at the sitelocation, especially with the regional–average level.75Figure 3.6: A plot of the year–by–year change in the logit of selectioncaptured by the autoregressive β?1(t) process in the R process inImplementation 2. Notice the sharp negative signal seen in 1982.Note that the plot for Implementation 3 is almost identical.Interestingly however, despite this reasonably strong evidence of PS,the posterior predictions of black smoke levels are almost identical to thosefrom Implementation 1. Fig A.5 and Fig A.2 both appear strikingly similarto those from Implementation 1 (Fig 3.5 and Fig A.5). In particular, noobvious changes in the estimated BS levels are seen across the unsampledregions of the Scottish Highlands or the foot of Cornwall. Furthermore, theposterior mean black smoke level averaged across GB remains largely thesame throughout time relative to the predictions from Implementation 1.Thus it appears that despite the joint model detecting PS under Popu-lation 1, little–to–no change in the posterior estimates is seen in either the76GB–average levels or the individual site–specific BS trajectories. This is instark contrast with the observed de–biasing of the regional mean witnessedshortly under Implementation 3. The explanation for these two results maybe best explained in terms of the two different populations P1,P2 of sitesunder consideration for selection.For P1, since the sites considered for selection at each time t are onlythe locations in which an operational site is placed at any time t ∈ T , noinformation about the selection of sites has been added to the never–sampledregions in Ω. Consequently, when estimating the levels of black smoke viathe estimation of the latent Gaussian fields in these regions, we have noadditional information about the possible values they could take. Thus,model–based estimates in these unsampled regions will tend towards thepredicted global mean levels, which in this case is precisely the P1–mean(the average taken across the network of observed locations). Furthermore,given the high average lifetime of the monitoring sites, estimates of the site–specific trajectories and hence the P1–mean barely change under the jointmodel due to the over–determined nature of the estimation. This is in starkcontrast with P2 or when a point process approach is taken. These placezero counts throughout the domain Ω and hence add additional informationinto the never–sampled and hence under–determined regions. The lack ofchange in estimates of the P1–mean is not a problem with the model. Thequadratic model showed a good model fit and we therefore see the inabilityof the model to change the longitudinal trajectories at the observed sitelocations for this dataset as proof of the model’s robustness – we wouldalmost certainly be concerned if the estimates changed dramatically at thesite locations.If instead, when forming our predictions of black smoke at these never–sampled locations, the model had the additional information that no sitewas selected here at this time (i.e. Ri,j = 0 at site si ∈ Ω \P1), then thiswould provide the model with additional information about the likely valuesof black smoke at this location. For example, if PS were detected by themodel, such that locations in regions with above average black smoke wereestimated to have a site with higher probability (i.e if dβ > 0), then knowl-77edge that a site was not placed at a given location would provide (albeit onlyslight) evidence for the model that the black smoke level here is below theoperational network average. Suppose instead that we have a whole regionsuch as the Highlands, with no monitoring sites present at any time. Esti-mates of black smoke across this region could then be considerably belowthe average of the predicted levels at the observed site locations throughouttime, depending upon the magnitude of PS detected. This idea of fillingthe region with zeros to indicate non–selection is the basis of the paper ofDiggle et al. [2010] and the approach taken in Implementation 3.For datasets where the average lifetimes of the monitoring sites areshorter, the measurement error is higher, and/or the functional form of thetemporal trend is of higher order, then this joint model framework wouldhave a greater capacity to change estimates of site–specific trajectories, theP1–mean and hence predictions throughout Ω. This was seen in our simu-lation study. However, for many applications involving data collected fromstatic monitors, little will change in inferences under a joint model withpopulation P1. An example of where large differences may be witnessed isfor data collected over time from mobile monitors whose locations change ateach time step. In this setting we would have a very sparse data setup, withonly a single observation of the process’ trajectory obtained at each location.The large under–determined missing–data problem here would present theperfect opportunity to assess the ability of the joint model framework toadjust the inference.After the extensive network redesign in 1982, the autoregressive β?1(t)process captured a sharp decline in the average logit for site selection in1982 (see Fig 3.6). This process may be reflecting, among other things,the year–by–year changes in public and political moods towards pollutionmonitoring. The 95% posterior credible intervals do not cover 0 and thusthe drop of over half of the network in 1982 appears to be a significant eventin the lifetime of the network.Turning our attention now to the estimated parameters of the site se-lection process Ri,j , no clear repulsion effect αrep was detected (αrep = 0.0895% CI (-0.14, 0.31)). This implies that any clustering or repulsion effects78witnessed in the data with respect to P1 can be attributed to the levelsof black smoke alone. On the contrary, the retention effect was found tobe very large 6.18 (95% CI (6.07, 6.29)), in agreement with common sense.This finding indicates that there is a clear incentive (possibly financial) forsite–selectors to maintain sites in their current locations instead of relocatingthem each year.In summary, for this dataset Implementation 2 does not lead to changesin site–specific trajectories, nor does it lead to changes in estimated BSlevels in unsampled regions of GB. However, we do still gain some usefulinsights. We find that the site–selection was in fact preferentially made (i.e.response–biased), and that the extent of this PS could not be attributedto chance alone. Furthermore, we were able to investigate the impact ofother factors, such as retention effects and changing political affinities forthe network expansion on the evolving operational network St. We havepresented future applications where the results from implementations 1 and2 may not agree so closely.3.5.3 Implementation 3 – P2Firstly, we consider the posterior parameter estimates for the two sourcesof PS (see Table 1). These are denoted by dβ,db, the medium-range andshort-range preferentiabilities respectively. The posterior estimated valueof dβ was 2.77 with a 95% posterior credible interval (2.76, 2.79). Theposterior estimate of the short-range preferentiality was 0.12 with a 95%posterior credible interval (0.11, 0.13). Thus in both cases the direction ofpreferentiality was significantly positive, suggesting that year-by-year, thesite placements were positively associated with the relative levels of blacksmoke at the site location, both locally and regionally.Fig A.3 shows a striking difference in the appearance of the estimatedblack smoke field through time. A direct consequence of the strong PSdetected is the dramatic drop in the posterior predictions of black smokelevels in undersampled regions of GB relative to Implementation 1. Fig A.3shows a huge drop in estimated levels in the unsampled regions of Northern79Figure 3.7: Implementation 3. In green are the model–estimated BSlevels averaged over sites that were selected in P1 (i.e. oper-ational) at time t. In contrast, those in red are the model–estimated BS levels averaged over sites that were not selectedin P1 (i.e. offline) at time t. Finally, in blue are the model–estimated BS levels averaged across Great Britain. Also in-cluded with the posterior mean values are their 95% posteriorcredible intervals. The black dashed lines denote the lower 10thpercentile and lower quartile observed in the data. Note thatthe estimated black smoke trajectories from the pseudo–sites arenot included in the mean calculations to form the red band. Ifprinted in black-and-white, the green band is initially the middleline, the red band is initially the upper line and the blue band isinitially the bottom line. Notice that the GB–average BS levelsare around a quarter of their sizes seen in Implementation 2.80Scotland, MidWales and the foot of Cornwall relative to Fig A.1 and Fig A.2.Implementations 1 and 2 estimated these regions to have average BS levelsdue to the lack of any additional information in these regions. Furthermore,Fig 3.7 shows that the posterior mean black smoke level averaged across GBis around a quarter of the size of that estimated from implementations 1 and2 (see Fig 3.5 and Fig A.5). This is a direct consequence of the decreasedlevels estimated in the undersampled regions that make up a large percentageof the surface area of GB. This addresses objective 3 of the analysis.Interestingly, model inferred black smoke levels in these unsampled re-gions have very high standard errors (i.e. large pointwise posterior standarddeviations) associated with their point estimates. This can be seen in thebottom two plots of Fig A.3. Here, the upper 95% pointwise credible inter-vals actually cover the estimates from Implementation 1. As expected, theposterior estimates of the observed site trajectories (both operational andoffline) change very little (see Fig 3.7).To address Objective 4 refer to Fig 3.7. In agreement with Figures 3.5and A.5, it appears that the magnitude of preferentiality increases over time.Initially, the annual averages at the locations of the offline observed sitesfar exceed those from the locations of the operational observed sites. Thedifference diminishes over time until the major network redesign in 1982,which led to a change in direction of the relative annual mean levels. Thusit appears that the magnitude of the bias in the reported annual black smokelevels from the operational network, relative to the Great British averageincreased over time - with a dramatic step-change seen in 1982. Of mostimportance however is the discovery that the observed black smoke levelsfrom the network appears to have never been representative of the levels ofGB as a whole, with a positive PS effect detected at all times. In fact, FigA.3 shows that around 85-90% of the sites in the network were placed inregions with above P2–mean BS throughout the lifetime of the network.Once again the autoregressive β?1(t) process reflecting the year-by-yearchanges in public and political mood towards pollution monitoring, captureda sharp decline in the average log intensity for site placement in 1982. Theestimate is almost identical to that seen in Implementation 2 (see Fig 3.6)81and so we omit the plot.Regarding the estimated parameters of the site selection process Ri,j ,the αrep term was detected to be positive with value 0.82 [95% CI (0.62,1.02)]. This implies that there is additional clustering present that cannotbe explained due to the levels of black smoke alone. This may be capturingsome of the latent factors influencing the selection of monitoring sites suchas population density.3.5.4 Impacts of preferential sampling on estimates ofpopulation exposure levels and noncomplianceWhilst the dramatic decline in GB–average black smoke levels seen under thejoint model in Implementation 3 is interesting, the monitoring network wasnot intended for the accurate mapping of black smoke across the whole ofGreat Britain but instead was established for tracking the progress achievedby the Clean Air Act in reducing the exposure levels of both black smoke andsulphur dioxide [McMillan and Murphy, 2017]. Thus, judging the monitor-ing network based on its ability to represent the levels of black smoke acrossGB as a whole is potentially misleading. Taking this into consideration, wenow attempt to assess the effects of PS on estimates of population expo-sure, and hence the effects of PS on the ability of the network to fulfill itsobjectives. Over the time period of study, various EU limits and guidelineson annual black smoke levels were introduced, including the annual averageguide value of 34µgm−3 introduced in 1980 (repealed in 2005) [Zidek et al.,2014]. We repeat the analysis of Zidek et al. [2014] and assess the changesin the estimates of noncompliance under PS.For estimating the population exposure levels, we obtained gridded res-idential human population count data with a spatial resolution of 1 km x 1km for Great Britain based on 2011 Census data and 2015 Land Cover Mapdata from the Natural Environment Research Council Centre for Ecology &Hydrology [Reis, 2017]. The data came in the form of a raster layer and weformulate our estimate of population density across the time period (1966 -1996) by normalizing the count raster by dividing each cell by the total sumacross all the cells. Here we assume that the relative population density82has remained stable from 1966-2011 for the estimated population densitylayer to be a good proxy across the years of study. We also assume thatresidential population density is a good proxy of where the population issituated throughout the year and hence that actual black smoke exposurelevels are similar to estimated residential levels. Next, we define a projectormatrix, to project the GMRF estimated in INLA on the triangulation meshonto the centroids of the population density cells that make up the raster.Finally, we are able to use the Monte Carlo samples from the posteriormarginals from INLA and the projector matrix to estimate the posteriordistribution of the black smoke field at each of the grid cells. Letting ρj(s)denote the population density of Great Britian at location s ∈ Ω, in year j,such that∫Ω ρj(s)ds = 1, we can then estimate the population–mean expo-sure levels by approximating the following integral:µpop,j(Ω) =∫Ωµ(s, j)ρj(s)ds≈G∑i=1¯ˆµj(si)ρˆi =1MM∑m=1G∑i=1µˆi,j,m(si)ρˆiwhere si denotes the ith raster grid cell centroid (i= 1, ...,G), ¯ˆµj(si) denotesthe Monte Carlo mean black smoke level at location si in year j and ρˆidenotes the estimated population density at the ith grid cell. Approximatecredible intervals for this quantity can also be formed. We can also use thismethod to estimate the proportion of the population exposed to annual av-erage black smoke levels exceeding the EU guide level of 34µgm−3 each year,by simply replacing the term µˆi,j,m(si) in the summation by the indicatorvariable representing the event that the value exceeds 34µgm−3. Note herethat the index m denotes the Monte Carlo sample number.We now do this, both for the estimated black smoke levels under Im-plementation 2 (i.e. Population 1) and again under Implementation 3 (i.e.Population 2). Note that the results under Implementation 1 are almostidentical to those from Implementation 2 so we omit them in the plots.83Figure 3.8: A map plot of the posterior pointwise probability of theannual average black smoke level exceeding the EU guide valueof 34µgm−3 under Implementation 2 (left) and Implementation3 (on the right). From top to bottom are the years 1970, 1973and 1976. The colour scale goes from 0 to 1 for all the plots,with dark blue denoting a posterior probability of 0 and darkred denoting a posterior probability of 1. Note that the plotsfor Implementation 1 are almost identical to those from Imple-mentation 2 and are omitted. Notice that almost none (themajority) of Great Britain is confidently predicted in 1970 tohave BS levels below the EU guide value under Implementation2 (3).84Figure 3.9: A plot showing the posterior mean and 95% credible in-tervals of the annual residential–average exposure levels acrossthe years of study. Shown are the results from Implementation 2(i.e. Population 1) and from Implementation 3 (i.e. Population2). The horizontal line denotes the EU guide value of 34µgm−3.Figure 3.10: A plot showing the posterior mean and 95% credibleintervals of the annual proportion of the population withblack smoke exposure levels exceeding the EU guide value of34µgm−3 across the years of study. Shown are the results fromImplementation 2 and from Implementation 3.85Fig 3.8 shows plots of the posterior pointwise probability of exceedingthe EU annual black smoke guide value of 34µgm−3 under implementations2 and 3, across the years 1970, 1973 and 1976. The colour scale goes from 0to 1 for all the plots, with dark blue denoting a posterior probability of 0 anddark red denoting a posterior probability of 1. In agreement with the plots ofthe pointwise posterior means (see Fig A.2 and Fig A.3), a dramatic declinein the estimates of noncompliance can be seen under Implementation 3 inthe regions far from the nearest monitoring network across the years (see Fig3.8). This has major ramifications regarding the total reported proportionGreat Britain in noncompliance with the guide value. For example, underImplementation 2 almost the entirety of Great Britain is estimated to bein noncompliance with the guide value up until 1970. This figure drops tobelow 25% in 1970 under Implementation 3 (see Fig A.4).However, once again the monitoring network and the guide value wereintended to measure and control the population exposure to black smokelevels. Thus our maps showing the pointwise posterior probability of ex-ceedance, whilst being dramatic, may not be a fair assessment of the net-work. Instead, we now focus our estimates on the estimated proportion ofthe population of Great Britian exposed to black smoke levels out of com-pliance with the air quality standard. Given that the density of monitoringsites in the network follows the large population centres of GB closely, weexpect the differences between the estimates to be much lower. In fact, thisis not the case. Fig 3.10 still shows a large decrease in the estimated propor-tion under Implementation 2, from 89% to 73% in 1966 for example. Notethat the posterior credible intervals still show a large discrepancy betweenthe estimated proportions. This is despite us including the additional shortscale variability from the spatially-uncorrelated IID effects in the estimates(one pair of realized b terms per 1km grid cell, per Monte Carlo sample).Finally, we turn our attention to the estimated population–average an-nual black smoke exposure levels across the two implementations (2 and 3).In agreement with Fig 3.10, Fig 3.9 shows a clear decrease in the estimated86annual averages. Given the sensitivity of health effect estimates of air pol-lution to the accuracy of population exposure levels, this result is especiallystriking.3.6 DiscussionImportantly, a lot of the detected preferentiality effects and subsequent de–biasing effects on prediction are likely mediated by well–known covariates.For example, annual population density figures and/or industrialisation in-dices (in their correct functional form) would likely simultaneously explaina lot of the PS detected if included in the Ri,j process, and be strongly pos-itively associated with the observed levels of Y in the observation process.Sites may well be placed in regions where lots of people live and work to en-sure the network captures ‘typical’ exposures experienced by the public, andsome sites may be located in areas close to polluting industry for exceedancedetection. Since the daily activities of people and industry may well be themain contributors to black smoke levels, including these covariates in the ob-servation model Y would therefore likely lead to decreased model–estimatedpollution levels in unsampled regions such as The Highlands of Scotlandwith low population density and industry.In many applications, the PS may disappear upon the inclusion of suchcovariates and hence be reduced to a missing–at–random scenario. Giventhat the focus of this Chapter was to repeat previous analyses of this dataset[Shaddick and Zidek, 2014, Zidek et al., 2014] under our new framework andassess the changes, we do not consider including covariates here. Further-more, we wanted to show that in settings where such covariates are unavail-able, sensible adjustments can still be realized under a careful use of ourmodel framework. Additionally, given that the locations of the monitor-ing sites are almost exclusively situated near population-dense, industrious,and urban regions, it is unclear if these locations would provide the ade-quate contrast required to estimate the correct the functional forms of thesecovariates. It would be interesting in future work to see if any PS is detectedin this data after conditioning on as many such variables in both processes.87In summary, this Chapter is not attempting to bypass the need for includingrelevant covariates in the modelling. Rather, it is presenting a method foraccounting for the effects of any residual unmeasured confounders associatedwith both processes by using spatio–temporal fields to act as a proxy.This modelling framework should be considered to both detect preferen-tial dropout within a fixed population or network P, and to detect if the pop-ulation or network P was preferentially placed within the domain of study Ω.Accomplishment of both of the above depends upon the choice of populationof sites under consideration for the site–selection process. If PS is detectedusing this model, then first and foremost, the modeller should attempt tofind available covariates that mediate the detected preferentiality. If, afterexhausting the available mediators (e.g. population density), and after re-moving as many sources of variability from the site–selection process as pos-sible, preferentiality is still detected, then this modelling framework shouldbe used for detecting the potential consequences of this sampling scheme onthe subsequent inference – either on parameters or spatio-temporal predic-tion.Furthermore, different regression models can be explored for the initialsite–placement and site–retention processes. For example, different covari-ates may be believed to affect only one of the two processes, the qualitativebehaviour of certain covariates on the two processes may be different orperhaps the nature of PS could differ across the two processes. We did notexplore these possibilities here, assuming only a unique intercept existedbetween the two processes.Additionally, the functional form used to model PS can be as flexible asdesired. Here we opted to model the direction and magnitude of PS as beingconstant through time. In reality this may not be suitable and the directionand magnitude of preferentiality may change through time. In Fig 3.9 wecan see that initially (at t = 1) the operational network was establishedsuch that it gave annual readings below the P–mean under Population 1.Then, as time progressed, the magnitude of the preferentiality decreased asthe annual averages from the operational sites approached those from thepopulation average. Thus it may make sense here to estimate a separate88preferentiality parameter dβ for times 1 and for t > 1. For time 1 this wouldlikely be estimated to be smaller compared with for t > 1. For simplicity weopted against this approach, however such a model would help paint a moredetailed picture of the dynamic nature of the PS through time.If one wishes to adjust the estimates of the domain–average (the GB–average in our example) to the effects of PS, the population of locations Pconsidered for selection should be extended to include locations in unsampledregions in the domain of study Ω. Population 2 did just that, and as aresult the GB–average estimates significantly dropped under the joint model.An alternative approach would be to consider modelling the site placementevents each year implicitly as realisations from a LGCP and the site retentionevents separately as Bernoulli trials. Two reasons for not pursuing thisapproach were given earlier in the Chapter.Extensive analytic and simulation studies on jointly modelling dropoutwith various longitudinal clinical markers have been made in biostatisticsover the past 20 years. The inspiration for this work came from the litera-ture on the joint modelling of viral load, dropout and longitudinal clinicalmarkers measured in HIV clinical trials [Lawrence Gould et al., 2015, Li andSu, 2018, Wu, 2009]. In fact, after transforming the data, Fig 3.2 shows blacksmoke trajectories that are very similar to the subject–specific dose–responsetrajectories seen in such longitudinal clinical data. The same philosophybehind jointly modelling informative patient dropout with the process ofinterest via shared random effects can be applied to spatio–temporal envi-ronmental network data with minimal alteration. The major difference withspatio-temporal data are the spatial correlations assumed on the randomeffects. It is this correlation which allows for the spatial extrapolation tooccur.Whilst the case study in this Chapter considered the observations to beon the same time scale as the site–selections, this need not be the case. Forexample, this general framework could simultaneously model high–frequency(e.g. hourly) observations with a low–frequency (e.g. annual) site–selectionprocess. This would comprise decomposing the temporal trajectories intotrend, seasonal and cyclical (e.g. daily) terms in the model. It would then89likely make most sense to include only the trend term in the linear predictorof the site–selection process.Assuming the locations of the monitoring sites are realisations from anIPP or a LGCP, while being useful computationally, may not always besensible in certain applications. For example, if a strict lower limit on thedistances between the monitoring site locations was known, then a LGCPor a IPP would not be the most suitable model for use and alternatives suchas a Matérn hard-core point process model would be more suited [Baddeleyet al., 2015]. Having said that, a nice property of using our logistic regressionapproximation to the LGCP, is that we are able to delete pseudo–site loca-tions in P2 that violate any known rule (e.g. a minimum distance/hard–corerule). Furthermore, if additional clustering is present then a cluster pointprocess or Gibbs point process may be more desirable [Baddeley et al., 2015].Whilst we attempt to adjust for the additional clustering seen in our datasetby constructing a covariate Ii,j , this is by no means the best way forwardhere.On a closing note, it should be apparent that the modelling frameworkintroduced in this Chapter can be applied to monitoring data that have comefrom static monitoring sites, mobile monitoring sites, and a combination ofthe two. Furthermore, the ability for the joint model framework to adjustfor PS under P1 should be greater in applications with mobile monitors.One such study that could be revisited is the MESA Air Study ( Since this study involves the estimation of the healtheffects associated with exposure to various air pollutants, with pollutionreadings taken from a combination of static and mobile monitoring sites,this data set offers an ideal opportunity to test out this framework. Ofinterest may be the detection of any PS, and its resulting effects on thehealth effects.3.7 Conclusion to Chapter 3We applied our general framework to the network of air quality monitorsin Great Britain between the years 1966-1996. From this, we were able90to show that the monitors were preferentially placed within Great Britainthroughout the life of the network. In particular, each year the locations ofthe operational sites were found to have been situated in areas with blacksmoke levels considered much higher than the annual average level acrossGreat Britain. Furthermore, we showed that the network was updated ina preferential manner throughout the life of the network. Monitoring sitesat locations with highest black smoke levels were favoured for selection intothe network each year, and monitoring sites at locations with lowest blacksmoke levels were favoured for removal from the network each year.The implications for this biased network placement were then clearlydemonstrated. The PS of the monitoring sites may have had a significantdeleterious impact upon the ability of the network to serve its purpose as atool for measuring the black smoke exposure levels experienced by the pop-ulation of Great Britain as a whole. It appears that estimates of populationexposure levels may have been overestimated (see Fig 3.9). Furthermore, es-timates of noncompliance to the various air quality regulations establishedthroughout the chosen time period of 1966 - 1996, may also have been af-fected by how and where the monitoring sites were situated. It appearsthat any estimates of noncompliance that used the observations from theair quality monitoring network may have over–estimated the true amount ofnoncompliance (see Figures 3.8 and 3.10). This includes historical estimatesof the proportion of the population of Great Britain exposed to black smokelevels that were out of compliance.91Chapter 4A Perceptron for Detectingthe Preferential Sampling ofSites Chosen to Monitor aSpatio-temporal Process"the priority area is the zone of highest pollution concentrationwithin the region; one or more stations should be located in thisarea."— The United States’ EPA Monitoring Network Design QAHandbook Vol II Section 6.0, guidelines for selecting thenumber and locations of air pollution samplersA previewThe previous chapter developed a general framework that enables PS to bedetected in spatio-temporal analyses and subsequently adjusted for whenmaking inference and predictions. The framework also allows for the selec-tion process itself to be emulated. However, the previous Chapter relied onthe precise nature of PS to be known a-priori. In particular, the methodrequires the selection process to be specified, with PS described in its cor-92rect functional form. In practice, the precise nature of the PS may not beknown.Furthermore, fitting the joint model framework introduced in the pre-vious Chapter can be both computationally costly and challenging to im-plement. These challenges may render the method unsuitable for appliedresearchers. Consequently, researchers may remain unable to test for thepresence of PS in their spatio-temporal datasets. This Chapter focuses ondeveloping an easy-to-implement test for PS in spatio-temporal data.Importantly, this Chapter relaxes the assumption that the precise natureof PS is known. It is instead replaced with a simpler assumption, namely,that the PS is monotonic. We define this to mean that the density of space-time points chosen to observe the spatio-temporal process (S,T)⊂ (Ω,×T ),depends monotonically on the values of the latent spatio-temporally corre-lated random effects Z used to describe the spatio-temporal process. Thissimpler assumption allows for a computationally-fast and general test forPS to be developed.We refer to the test as a perceptron as it attempts to capture the numer-ous factors behind the human decision-making that selected the sampled lo-cations. Importantly, the method can also help with the discovery of a set ofinformative covariates that can sufficiently control for the PS. The discoveryof these covariates can justify the continued use of standard methodologies.The test is applicable to both discrete-space and continuous-space settingsand is developed within the STGLMMs framework introduced in Chapter2. We demonstrate its high power across a range of settings in a thoroughsimulation study. We then apply it to two real-world datasets.Throughout this Chapter, we will be using the nearest neighbour (NN)distances between points as a way to measure the local degree of clusteringpresent. In particular, associated with each point or ‘site’ chosen to observethe process will be the nearest neighbour distance (or averaged K-nearestneighbour distances for some chosen integer K) to the closest point(s) orsite(s). This distance value is chosen to capture the local degree of cluster-ing that is present around each point or site. The toy examples presentedearlier in sections 1 and 2 of Chapter 2, demonstrated a clear characteristic93of PS in both the discrete-space and continuous-space settings. When PS ispresent under a log-Gaussian Cox process, it is apparent that an increaseddensity of space-time sampling locations (or units) is chosen to observe thespatio-temporal process in regions where the spatio-temporal process is high(or low). This suggests that when PS is present under this assumed sam-pling process, a correlation should exist between the NN distances and theobserved response values at each of sampled locations and times.To make this point clear, we revisit the toy example of Section 2.2. Forboth the PS and MCAR datasets, we compute the NN distances for each ofthe sampling locations and then compute the Spearman’s rank correlationbetween the field values and the NN values. This is seen in the first tworows of Figure 4.1 and a clear negative trend between the NN distances andthe field values can be seen when the data are PS. The lowest ranks of theNN distances are seen more frequently in the West of the map, preciselywhere the field takes the largest values, leading to a moderate negative rankcorrelation of -0.39 existing between the field and NN values. Conversely,when the data are not PS, no such trend is seen. The rank correlation seenin the toy example data is close to zero (-0.08). Next, we repeat the processfor the PS data, but compute the average K NN distances from each pointfor K = 2 and K = 4. The results are seen in the bottom two rows of Figure4.1. Stronger correlations of -0.61 and -0.79 are seen for K = 2 and K = 4respectively, demonstrating the possible variance reduction in the rankedNN distances, and hence the possible increased power to detect PS, offeredby increasing K in certain settings. The results from this example providethe intuition behind the PS test discussed next.When the sampling process is assumed to be a different point processes(e.g. a Hardcore process), the NN distance may be a poor choice to quan-tify the degree of clustering. We consider the benefits of using alternativemeasures of clustering in this Chapter.94121.51715.518.518.51.5 13.59233.5243.520.513.55.515.55.57.520.510.510.57.52212.515.51.518.57.517 12.515.510.522.55.510.59257.53.518.5271.514262420.53.55.520.522.5rank cor = −0.3905101520250 5 10 15 20 25Nearest Neighbour Distance RankField Rankrank cor = −0.08010200 10 20Nearest Neighbour Distance RankField Rankrank cor = −0.6105101520250 5 10 15 20 252−Nearest Neighbour Distances RankField Rank rank cor = −0.7905101520250 5 10 15 20 254−Nearest Neighbour Distances RankField RankFigure 4.1: The top left plot shows the 24 PS sampling locations. TheGaussian process is shown in colour, with blue (red) representing thelowest (highest) values. Sampling locations are labeled with theirranked NN distance. Largest ranks are seen in the regions with thelowest field values. The arrow shows the NN distance from the sitelabelled ‘24’. The top right plot repeats the above for the 27 ‘MCAR’sampling locations. No trend can be seen. The center-left and center-right plots show the field ranks plotted against the NN distance ranksfor the PS and MCAR datasets respectively. The black line demon-strates a rank correlation of -1. Rank correlations of -0.39 and -0.08are found in the PS and MCAR data respectively. The bottom twoplots show that these correlations increase for the PS data to -0.61and -0.79 when the ranks of the average of the 2 and 4 NN distancesare computed for each sampling location.954.1 Introduction to Chapter 4This Chapter concerns preferential sampling (PS), where the locations se-lected to monitor a spatio–temporal process µst, s ∈ Ω, t ∈ T , dependstochastically on the process they are measuring. PS is a special case ofresponse–biased sampling. The space–time point is defined as (s, t)∈Ω×T .Ω denotes the spatial domain of interest and T denotes the temporal do-main. Purely spatial processes (i.e. when T is a singleton) are a specialcase.To gain understanding of the process µs,t, a set of time points T ⊂ T atwhich to observe µs,t are selected. Then, for each t ∈ T , a set of nt samplinglocations St ⊂ Ω are chosen. Generally, the temporal domain T is a finiteset, with µ a time–averaged quantity for practical reasons. Typically, µs,t isnot observed directly and instead a noisy observation Ys,t is taken instead.The noise could be due to the presence of measurement error (i.e. thenugget effect) or other factors. St may represent a set of points in space (i.e.St = (si ∈Ω)nti=1), or a set of well-defined areal units (i.e. St = (Ai ⊂Ω)nti=1).In this Chapter, these two cases are referred to as the geostatistical anddiscrete spatial settings respectively. In the latter case, observations willgenerally represent spatial–averages of µs,t.Difficulties arise with the estimation of µs,t when St are preferentiallyselected. This is because most statistical methods for modelling spatio-temporal data, condition on the locations as fixed [Cressie and Wikle, 2015,Diggle et al., 2010]. Such models assume that locations were selected undercomplete spatial randomness. Departures from this assumption in the truesampling scheme can lead to large biases in the prediction of the processµs,t (see Chapters 2 and 3). Hereafter, models that consider the locationsas fixed are referred to as ‘naive’.Additionally, a set of covariates Xs,t may exist that influence the choiceof sampling locations St. These covariates may also be associated withthe underlying process being modeled µs,t. When this occurs, including thenecessary Xs,t in a ‘naive’ regression model for µs,t may partially remove thedeleterious effects of PS on the spatio-temporal prediction of µs,t [Gelfand96et al., 2012]. This becomes a regression-adjustment approach to help correctfor the departure from a complete spatial randomness sampling design forSt. Covariates common to both the sampling process and µs,t, are hereafterreferred to as ‘informative’ as in Gelfand et al. [2012].Preferential sampling has been identified as a major concern across mul-tiple fields. In ecology, PS may occur due to sightings data being comprisedof opportunistic sightings or poorly-designed surveys (see Chapter 5). Ob-servers frequently focus their efforts in areas where they expect to find thespecies, leading to PS [Fithian et al., 2015]. A consequence of this is that‘naive’ estimates of the geographical distribution of a species may be severelybiased. Estimates of a species’ abundance have also been shown to be af-fected [Pennino et al., 2019]. PS should also be considered in the analysisof environmental data recorded from tagged animals. Dinsdale et al. [2019]demonstrated this with a case study using sea surface temperature record-ings from tags attached to Elephant Seals in the Southern Indian ocean.The seals’ preference for cooler waters led to biased ‘naive’ spatial estimatesof sea surface temperature.In environmental statistics, the deleterious impacts of PS have been high-lighted. For example, pollution concentration levels throughout Ω and T arecommonly estimated using noisy observations, Ys,t, recorded from environ-mental monitoring networks [Shaddick et al., 2018]. Here, the locations ofthe monitors in a network St, may have been chosen in a preferential way tomeet specified objectives [Schumacher and Zidek, 1993]. For example, urbanair pollution monitoring sites are sometimes used for detecting noncompli-ance with air quality standards [EPA, 2005, Loperfido and Guttorp, 2008].In such settings, observations Ys,t will likely lead to overestimates of theoverall levels of the air pollutant, µs,t, throughout Ω and T . These biased‘naive’ estimates µˆs,t may then be unsuitable for assessing the impacts ofµs,t on human health and welfare [Lee et al., 2015].Previous PS tests have been developed for continuous spatial data, butlimitations hinder their general use. Firstly, Schlather et al. [2004] devel-oped two Monte Carlo tests. Their null hypothesis assumes that the dataare a realization of a random-field model. They assume: the sampled point97locations St are a realization of a point process P on Ω, the recorded values(called marks) of the points are the values of a realisation of a random fieldµs,t on Ω, P and µs,t are independent processes. Independence here impliesa non-preferential sampling mechanism. To detect departures from the nullhypothesis, the authors define two characteristics of marked point processes,denoted E(d) and V (d). These represent respectively the conditional expec-tation and conditional variance of a mark, given that there exists anotherpoint of the process at a distance d. These are chosen since under the nullhypothesis E and V should be constant. Monte Carlo tests are used toassess departures of estimates of E and V from a constant function. Thisapproach requires the assumption of Gaussian observations and hence doesnot generalize to non-continuous marks.Next, Guan and Afshartous [2007] developed an alternative simulation-free test for PS. Instead of fitting a parametric model for the marks, their ap-proach instead divides the region Ω into non-over-lapping subregions. Theseare assumed to be approximately independent, generating approximatelyIID replicates of the test statistic. The spatial range of µs,t can be thoughtof as representing the inter-point distance required for two observations ofµs,t to be approximately independent. Finding a suitable set of subregionsrequired for their test may prove a challenge when the spatial range of thecorrelation of µs,t is large relative to the size of Ω. Furthermore, this testrequires very large sample sizes; their application used a sample size of over4000.For modelling PS directly, it is common to take a model-based approach.Approaches often simultaneously fit a model for the observation process, Ys,t,with a model for the sampling process, P, within a joint-model framework[Diggle et al., 2010]. Linear combinations of any spatio-temporal latenteffects used to describe µs,t are shared across the linear predictors of thetwo processes. This sharing of latent effects helps to capture any stochas-tic dependence that may exist between the two processes. A nonzero effectestimate of any of these linear combinations provides evidence that PS ispresent (see Chapter 3). Whilst this approach has been successfully appliedto mitigate PS, the use of this approach to test for PS may be out of reach98of many researchers. These joint models are currently not implemented inmany popular software packages, are computationally intensive to fit andcan be difficult to design and interpret. Note that design-based approacheshave also been introduced for specific scenarios [Zidek et al., 2014]. Here-after, the collection of spatio-temporal latent effects are denoted Zs,t.Due to the computational challenges of fitting joint models and the lackof generality of the current PS tests, PS appears to be often overlooked.Researchers may have non-Gaussian data, or have too small a sample sizeto perform either test. Consequently, without the ability to test for PS,researchers may then fit ‘naive’ models to preferentially sampled data. Thepotential consequences of PS on their inferences may then be ignored. For-tunately, in many situations, a sufficient set of informative covariates Xs,tmay be available. The decision of where to sample is typically a human one,and hence Xs,t may include measures of accessibility (e.g. distance from thenearest road or population center). These are often associated with the un-derlying process being measured and thus may help to control for the PS.Verifying the existence of Xs,t would allow researchers to confidently con-tinue to use their preferred methodologies and packages, without the needto fit joint models Fithian et al. [2015].This Chapter presents a computationally fast method for detecting PS.The algorithm for implementing the test is both intuitive and easy to pro-gram. The method primarily requires that the researcher be able to predictthe values of µs,t, and any latent spatio-temporal effect Zs,t, throughout Ωand T . Any preferred ‘naive’ method can be used. The method is generalin that it can test for PS in both the geostatistical and discrete spatial set-tings, and can be used when the responses (marks) are non-Gaussian andeven non-continuous. A general algorithm is provided for all settings. Thetest can also be adjusted for covariates, allowing researchers to discover if agiven Xs,t is sufficient for controlling the PS.Qualitatively, PS has a clear appearance in continuous spatial data. PSoften appears as a clustering of locations chosen to observe µs,t in regionswhere one or more Zs,t is either high or low. The test in this Chapter directlytargets this excess clustering. In the continuous spatial setting, a suitable99point process is fit to the observed locations to capture the true samplingprocess under the null hypothesis of no PS. Then, Monte Carlo (MC) real-isations of the point process under the null are generated. The magnitudeof correlation between the degree of clustering and the estimated values ofZs,t is computed for both the observed data and the MC realisations. Ifa stronger correlation is observed in the observed data compared with theMC samples, then evidence for PS has been found. The mean of the Knearest neighbours is our default recommendation to capture the degree ofclustering as this quantity may also be used in the discrete spatial setting.In the discrete spatial setting, a Bernoulli sampling process is instead fitto a population of well-defined areal units under the null. A clustering ofareal units chosen to observe µs,t in regions where Zs,t is either high or lowindicates PS.The Chapter is organized as follows. Section 2 introduces the assumedmarked point process data-generating mechanism for the geostatistical set-ting. Then, the perceptron algorithm and properties of the PS test aredescribed. Section 3 repeats the above for the discrete data setting. Section4 demonstrates the power of the test to detect PS in a thorough simula-tion study. The joint effects of the: sample size, spatial smoothness of Zs,t,spatio-temporal covariates Xs,t and the magnitude of PS on the power ofthe test are discussed. Section 5 applies the test to two real datasets previ-ously analysed in the literature. The PS test can be performed using the Rpackage PStestR, now available on GitHub, that we developed.4.2 Preferential sampling in geostatistical dataIn continuous spatio-temporal settings, observations Ys,t are taken at a setof point locations St within the study region Ω at each time step t ∈ T ⊂ T .Standard approaches for modelling µs,t from a set of observations Ys,t includevariogram analysis and kriging-based methods [Diggle and Ribeiro, 2007].These methods fall under the umbrella term of “geostatistical methods” andrequire the assumption that the locations chosen to observe the process µs,twere not preferentially sampled [Diggle et al., 2010].100For modelling point-patterns in space and time, spatio-temporal pointprocesses are the standard statistical toolbox [Baddeley et al., 2015, Illianet al., 2008]. This class of models will be used throughout our Chapterto explain the observed point-patterns St through time. Standard ‘naive’geostatistical methods require the assumption that the sampling process Pgenerating the sampled locations St be independent of the underlying spatio-temporal field µs,t. This assumption implies no PS and simplifies the analysisgreatly. Here, the point-pattern St and the marks Ys,t may be investigatedseparately using standard techniques. However, when this assumption isviolated, the two processes must be considered together. Marked spatio-temporal point processes should be considered as a formal framework forsuch a data analysis [Schlather et al., 2004].The PS test we are about to describe requires the following three assump-tions. The final assumption describes the assumed characteristic behaviourof the PS.Assumption 4.1 The PS is driven by one or more spatio-temporal latenteffect in Zs,t.Assumption 4.2 All of the latent effects within Zs,t that drive the PS arespatially ‘smooth enough’ relative to both the size of the study region |Ω| andthe number of locations chosen to sample the process |St|.Assumption 4.3 The density of points within St at space-time point (s, t)∈Ω×T ) depends monotonically on the values of the components of Zs,t drivingthe PS.Assumptions 4.1 - 4.3 imply that preferentially sampled data will appearas point-patterns St that are clustered in space for each t∈ T . These clusterswill focus around regions where relevant elements of Zs,t are especially highor low, depending on the direction of PS. The first inferential goal becomesthe detection of monotonic associations between the degree of clusteringthroughout Ω×T with the values of relevant Zs,t. If PS is detected, thenthe second inferential objective becomes the determination of whether or101not clustering can be explained by a set of informative covariates Xs,t. Thatobjective is achieved if such a set removes all of the PS-associations.The ranked nearest neighbour distances between the sampling locationsSt are proposed as a default choice to measure the magnitude of clustering.Following the recommendations of Gignoux et al. [1999], edge-corrections forthese distances are not considered within the Monte Carlo algorithm. Manyother quantities can be chosen to capture local clustering and may be moresuited for specific St generating mechanisms. The PS test developed in thisChapter can easily be modified to use another quantity. The ranked nearestneighbour quantity is chosen for its generalisability across both discrete andcontinuous spatial settings.4.2.1 Assumed model for preferential samplingMany spatio-temporal point processes have been developed, with each pos-sessing fundamentally different properties. An appropriate choice for a givenanalysis depends upon the sampling protocols that generated St. For exam-ple, Gibbs point processes allow for second-order effects such as inter-pointattraction and repulsion to exist between points. A limiting case is seen inthe Hard Core process. This process does not allow for points to exist withina distance R, called the ‘range of interaction’. Cluster processes provide aclass of point processes that describe the locations of ‘parent’ points with aseparate process from their ‘daughter’ points [Baddeley et al., 2015]. Manymore processes exist and may prove useful in applications.The simplest class of spatio-temporally varying point processes is theinhomogeneous Poisson process (IPP hereafter) [Illian et al., 2008]. TheIPP is completely defined by its intensity function λ(s, t). This is defined asthe expected number of points per unit area and time immediately around(s, t) ∈ Ω×T . More formally, the intensity can be defined as:λ(s, t) = lim→0 E [N(B(s, t))]/ |B(s, t)|, (4.1)where the cube B(s, t) = (s,s+ 1)× (t, t+ ).Let Ω ⊂ R2. Define two disjoint space-time volumes (A1,T1),(A2,T2) ⊂102(Ω×T ). Then the numbers of points that fall within the two space-timevolumes N(Ai,Ti) are independently Poisson distributed random variableswith means:Λ(Ai,Ti) =∫Ai∫Tiλ(s, t)dtds. (4.2)Locally-integrable random fields Zs,t can be added to any linear predic-tor used to model the natural logarithm of λ(s, t). The point process is thensaid to be a Cox process driven by Z. If Zs,t is a Gaussian process, thenλ(s, t) becomes a log-Gaussian random field and the process becomes knownas a log-Gaussian Cox process (LGCP hereafter) [Simpson et al., 2016].LGCP models are especially useful for modelling point-patterns when resid-ual spatio-temporal correlations are expected to remain in the intensity, evenafter including any available covariates. In this case, the Gaussian processZs,t is given a spatio-temporal correlation structure. Linear combinationsof multiple Gaussian processes may also be included within a LGCP. Wedenote a set of Gaussian processes as Zs,t.A base Cox process model is now introduced for describing the samplingprocess of St in many geostatistical settings. This model is very general. Toease the notational burden, only one latent effect is considered and denotedZs,t. Furthermore, for the remainder of the Chapter Zs,t is assumed tobe a Gaussian process, although this constraint can be relaxed. Note thatwhen the sampling process generating St deviates from the assumptionsunderlying the conditional Poisson process, other point processes should beconsidered. Details of other processes are found in Baddeley et al. [2015],Illian et al. [2008].With a slight change of notation, removing the subscripts to improvereadability, let Y (s, t) denote the observation process at location s ∈ Ω andtime t ∈ T . This may be of any type (e.g continuous, count, binary, etc.).Let µ(s, t) denote the target spatio-temporal process and let Z(s, t) denotea spatio-temporal latent Gaussian random field. As before, let St denotethe collection of sampled points at time t ∈ T ⊂ T . The following data-103generating mechanism is now assumed:[Y (s, t)|s ∈ St,Z(s, t)]∼ f(µ(s, t),θ) (4.3)[St|Z(s, t)]∼ IPP(λ(s, t)) (4.4)g(µ(s, t)) = β0 +βTx(s, t) +Z(s, t) (4.5)log(λ(s, t)) =αTw(s, t) +δ(x(s, t)) +h(Z(s, t)) (4.6)[Z(s, t)]∼GP(0,Σ). (4.7)Square brackets denote random variables. In equation (4.3), f repre-sents the conditional probability distribution Y (s, t), given the target latentspatio-temporal effect Z(s, t), and given the location was sampled at timet (i.e. s ∈ St). Values of Y (s, t) are missing at all non-sampled locations.The link function g describes the relationship between the linear predictorand the target spatio-temporal process µ(s, t). Thus, the model containsthe popular class of STGLMMs [Diggle and Ribeiro, 2007]. The regressionequation for µ(s, t) is specified in (4.5), with fixed covariates x(s, t).In equation (4.4), the sampling process P is modeled as a Cox process.When h is linear, the process is a LGCP with mean function m(s, t) =αTw(s, t) + δ(x(s, t)). Note that, conditional on Z(s, t), St is assumed tobe a realisation from an IPP. Unique fixed covariates w(s, t) and sharedcovariates x(s, t) both describe the intensity of the conditional IPP. Theshared covariates are transformed by functions δ that may be nonlinear.The h function need not be linear, however when this is the case P is nolonger a LGCP. In any case, h specifies the nature of PS.The PS test developed in our Chapter is highly general. When h isstrictly monotonic, the primary goal of the test is to detect the monotonic-ity of h. The precise form of h does not require specification. Suppose h≡ 0and at least one element of δ is non-zero. The subset of covariates x(s, t)corresponding to these nonzero elements provide a sufficient set of informa-tive covariates required to control for PS. The second goal of the test is tocorrectly identify this subset of covariates. Note that the covariance matrix104of the vector of Z(s, t) values evaluated at St is denoted Σ. This also requiresestimation. Finally, θ are hyperparameters to be estimated in the model.Both Σ and θ may be estimated using a maximum likelihood approach, or,given prior distributions, and then estimated under a Bayesian approach.Including a sufficient set of informative covariates in a model for µ(s, t)should help to improve the prediction of µ(s, t) across Ω and T , by reducingthe deleterious impacts of PS on spatial prediction. However, it must bestressed that this approach is not a silver bullet; if the data were insteadcollected via a complete spatial randomness sampling design, then the pre-dictive accuracy of the fitted model would likely be improved [Gelfand et al.,2012]. Thus, these methods should be viewed only as a partial remedy forbadly sampled data rather than as a justification for ignoring the need forgood spatial design of networks and surveys.Finally, the conditional likelihood of the IPP given Z(s, t) is:pi (St|Z(s, t)) = exp{|Ω||T|−∫Ω∫Tλ(s,t)dtds} ∏si∈St,t∈Tλ(si,t), (4.8)with |Ω| being the area of the domain Ω and |T | being the length of the timeset [Simpson et al., 2016].4.2.2 Perceptron algorithmAssume the above data-generating mechanism. A Monte Carlo algorithm isnow designed for testing the null hypothesis that h≡ 0, versus the alternativehypothesis that h is a monotonic function of Z. Under the null hypothesish ≡ 0, the observation and sampling processes are conditionally indepen-dent given x(s, t). Thus, given x(s, t), no associations are expected to existbetween computable quantities from the fitted (null) IPP and estimates ofZ(s, t).Conversely, suppose that the null hypothesis is false. Specifically, leth be a monotonic increasing function of Z. Point-patterns St from thisdata-generating mechanism are expected to exhibit an excess of clustering105in regions of high Z(s, t), relative to that explained by the null model. Thisphenomenon is referred to as positive PS. Here, a positive association be-tween the localized amount of clustering and estimated Z(s, t) values wouldbe expected. The converse holds when h is a decreasing function.The primary challenge is defining what constitutes a ‘strong’ associationbetween estimates of Z(s, t) and the computed quantities used to captureexcess localized clustering. Positive spatio-temporal correlations are presentin µ(s, t) due to Z(s, t). This leads to non-standard sampling distributionsfor test statistics computed to capture association. Standard hypothesistests of association (e.g. t-tests, rank-correlation tests, etc.) will have a type1 error above the specified level due to the positive correlations.This is why Monte Carlo methods are used. An empirical p-value associ-ated with any desired test statistic can be computed by sampling realisationsfrom the assumed IPP under the null hypothesis (i.e. fixing h ≡ 0). Theapplication generalizes to any given dataset. Crucially, this procedure ac-counts for the nonstandard sampling distribution of the chosen test statisticin a natural way. The mean of the K nearest neighbour distances fromeach observed point is our default choice of computable quantity. Smallvalues of this quantity within a region, indicates the presence of clusteringthere. When K = 1, this reduces to the nearest neighbour distance. For thedefault choice of test statistic, the Spearman’s rank correlation coefficientbetween estimates of Z(s, t) at locations s ∈ St and the mean nearest neigh-bour distances is proposed. This is specifically chosen to capture the degreeof monotonicity of h.We now state the probability distribution function of the (spatial) dis-tance from any point of St to its nearest neighbouring point for the IPP datamodel (when h ≡ 0). Let (s, t) ∈ Ω×T define a reference space time pointand let T be the time interval (t, tT ) of interest respectively. Next, defineb(s, r) as the ball of radius r centered at s. Let λIPP(s, t) once again denotethe intensity function for the assumed IPP model under the null hypothesish≡ 0.Theorem 4.1 Assuming h ≡ 0 and the above data-generating mechanism,106the probability that the nearest point of Sτ from (s, t), lies within a spatialdistance r, at some time τ ∈ T is [Matern, 1971]:1− exp(−∫b(s,r)∫TλIPP(ω, τ)dτdω)= 1− exp{−Λ(b(s, r),T )}. (4.9)Equation (4.9) gives us the following intuitive result when h ≡ 0. Theexpected nearest neighbour distances are lower in regions of high intensity(i.e where λIPP(s, t) is high). This is to be expected - the intensity functionat (s, t) precisely defines the expected density of points immediately around(s, t).The result can also been derived under the alternative hypothesis when his linear. When h(Z) = cZ : c 6= 0 then the point process is a LGCP. Coeur-jolly et al. [2017] derived the Palm distribution for LGCPs. The authorsshowed the remarkable result that conditional on a single point in St lyingat location (s, t) ∈Ω×T , the remaining points of St are also a LGCP. Usingthe authors’ result, this conditional process differs only in the mean func-tion, with the covariance function remaining the same. The updated meanfunction describing this process is denoted µ(s,t)(·, ·). This can be thoughtof as representing the mean function of the LGCP, conditional on (s, t) ∈ St.Theorem 4.2 Assume that h is linear and assume the above data-generatingmechanism. With Σ(·, ·) denoting the covariance function, the mean func-tion of the LGCP conditioned on (s, t) ∈ St µ(s,t)(·, ·) at (ω, τ) is:µ(s,t)(ω, τ) =αTw(ω, τ) +δ(x(ω, τ)) + c2Σ(s−ω, t− τ) (4.10)= µ(ω, τ) + c2Σ(s−ω, t− τ).Then, using the law of total expectation, the probability that the nearestpoint from (s, t), lies within distance r, at some time τ ∈ T is:1071−EZ[exp(−∫b(s,r)∫TλIPP(ω, τ){exp[cZ(ω, τ) + c2Σ(s−ω, t− τ)]}dτdω)].(4.11)Thus, by conditioning on a point of the process at location (s, t), theonly change to the mean function of the LGCP is the addition of the termc2Σ(s−ω, t− τ). This is the covariance function between the conditioningpoint (s, t) and the space-time point (ω, τ), scaled by the squared linearcoefficient c2. If we make the following additional assumptions, then theinterpretation of the (4.11) simplifies greatly.Assumption 4.4 The covariance is strictly non-negative, i.e. Σ(·, ·)≥ 0Assumption 4.5 Z is stationary and isotropic: Σ(s−ω, t−τ) = σ2ZR(||s−ω||, ||t− τ ||)Assumption 4.6 The correlation function decays monotonically as the dis-tance from the conditioning point s increases: σ2ZR(||s−ω||, ||t− τ ||) ≥σ2ZR(||s−ω||+ δ, ||t− τ ||) ∀δ > 0Assumption 4.7 The correlation function decays monotonically as the dis-tance from the conditioning time t increases: σ2ZR(||s−ω||, ||t−τ ||)≥σ2ZR(||s−ω||, ||t− τ ||+ δ) ∀δ > 0Under assumption 4.4, conditioning on point (s, t) ∈ St will always in-crease the mean function and hence the intensity immediately around the(s, t). Contrast this with the IPP. Here, the knowledge of a point of Stexisting at (s, t) does not affect the intensity immediately around (s, t).Assumptions 4.4 - 4.7 are commonly made in practice. They imply thatthe latent process will be expected to be more similar at two space-timelocations that are ‘close together’ than two space-time locations that are‘far apart’. Popular choices of correlation functions include the Matern cor-relation function across space [Diggle and Ribeiro, 2007] and autoregressive108correlation functions across time. Despite being unrealistic for most envi-ronmental processes [Cressie and Huang, 1999], spatio-temporal correlationfunctions are often defined to be the products of these spatial and tempo-ral functions for computational simplicity [Blangiardo and Cameletti, 2015].Under these models, the correlation often becomes negligible at spatial (tem-poral) distances greater than some value, often called the spatial (temporal)range.Under Assumptions 4.4 - 4.7 and assuming h(Z) = cZ : c > 0, equation(4.11) now helps to explain the suitability of our choice of nearest neighbourdistance to capture the excess clustering. Firstly, suppose the conditioningpoint (s, t) is in a region where the latent effect is above average (i.e. Z(s, t)>0). Here, we see that the expected nearest neighbour distance from (s, t)decreases monotonically relative to that expected from the IPP as: i) Zincreases, ii) the correlation function R(·, ·) increases, and iii) c increases.When c≡ 0, (4.11) equals (4.9) as expected.Thus we have shown the result we wanted. Under Assumptions 4.4 -4.7, with h a linear function with either a positive or negative slope, amonotonic association is expected between the nearest neighbour distancesbetween the observed points (s, t) ∈ St and the values of Z at St. Thismonotonic association should be captured with our rank correlation teststatistic when K = 1. Conversely, when h≡ 0, no association is expected, solong as the fitted null IPP is correctly specified. In this case, the observedtest statistic should be no more extreme than the Monte Carlo realisations.The test will generalize to nonlinear monotonic functions h, although thenearest neighbour probability function will not take the form (4.11). Thisdoes not cause problems for the perceptron algorithm, since the Monte Carlosimulations can still be performed.To define the test performed by the perceptron algorithm requires someadditional notation. Let T be a finite set of time intervals. Let nt denotethe observed number of points si,t ∈ St ⊂ Ω at time t ∈ T . Let Ni,t(K)denote the set of K nearest indices from each point si,t : i ∈ {1, ...,nt}. LetZˆ(s, t) denote the estimate of Z(s, t). Define the superscript above each ofthese quantities m as the index of the Monte Carlo sample m ∈ {1, ...,M}.109Thus Nmi,t(K) : i ∈ {1, ...nmt },m ∈ {1, ...,M} denotes the set of K nearestindices from point smi,t in the mth Monte Carlo sample Smt . Finally, letD¯i,t(K), D¯mi,t(K) denote the mean of the distances to the K nearest pointsfrom point i at time t in the original data and the mth Monte Carlo sampledpoint-pattern respectively. Thus:D¯mi,t(K) =1K∑j∈Nmi,t(K)||smi,t− smj,t||. (4.12)The terms NNk,t and NNmk,t are defined to be the vectors of length ntand nmt containing the values of D¯i,t(K) and D¯mi,t(K) respectively. Whencalculating (4.12) in the original dataset, simply drop the m superscripts.110Algorithm 1: Perceptron NN test for PS in geostatistical dataData:Observations y(s, t) for (s, t) ∈ (St×T )⊂ (Ω×T )Covariates {w(s, t), x(s, t)} for (s, t) ∈ (Ω×T )Result:Empirical p-value for the test h≡ 0 vs. h monotonicbeginFit a model for (4.3) using a preferred methodProduce estimates Zˆ(s, t) throughout Ω,TCompute the NNk,t values D¯i,t(K)Evaluate Zˆ(s, t) at locations StCompute the rank correlations ρt between Zˆ(si, t) and D¯i,t(K)Fit the chosen point process model with h≡ 0 in (4.6)Fix m= 1while m≤M doSample nmt locations Smt from the fitted model for t ∈ TCompute the NNk,t values D¯mi,t(K)Compute Zˆ(s, t) at locations SmtCompute the rank correlations ρmt between Zˆ(smi , t) andD¯mi,t(K)if m=M thenreturn the empirical p-values of either pointwise or rankenvelope tests using ρt and ρmt .elsem←m+ 1endendend111The perceptron NN algorithm, referred to as the NN test hereafter, isdefined above in Algorithm 1. It can now be summarized as follows. First, fitthe assumed models (4.3) and (4.4) for both Y (s, t) and St. Next, estimateZˆ(s, t) throughout Ω×T and compute the averaged K nearest neighbourdistances NNk,t. Using the estimates Zˆ(s, t) at St and NNk,t, compute theSpearman’s rank correlation coefficient ρt between them for each t∈ T . Thisis the observed test statistic.Next, sample M realisations, Smt : m ∈ {1, ...,M}, from the fitted pointprocess model (4.4). For each of the M realisations, repeat the procedure.Compute the distancesNNmk,t and estimate the values Zˆ(s, t) at Smt to obtainρmt :m ∈ {1, ...,M}. Finally, compute the desired empirical p-value. For thepointwise tests, simply evaluate the proportion of the Monte Carlo-sampledρmt that are more extreme than ρt. For Monte Carlo envelope tests that donot suffer from the problems of multiple testing, refer to Mrkvička et al.[2017], Myllymäki et al. [2017].4.2.3 DiscussionThe values of the latent field Z are not known and must be estimated.The power of the test to detect PS may depend upon the suitability of themethod used to produce estimates Zˆ(s, t). Likelihood-based approaches forfitting the above observation model (i.e. components (4.3), (4.4) and (4.7))have been shown to have many nice properties. Asymptotic consistencyhas been proven under the null hypothesis that h≡ 0 for certain choices off . These include Bernoulli [Ghosal et al., 2006] and Gaussian [Choi andSchervish, 2007] likelihoods. Such results help to justify the suitability ofthe method. Furthermore, under the null hypothesis h ≡ 0, some ‘naive’approaches even produce unbiased estimates of Z(s, t). For example, underthe above data-generating mechanism, with f the normal distribution andg the identity function, the Gaussian process regression is the best linearunbiased predictor for Z(s, t) [Cressie, 1992].Other choices of a computable quantity for capturing spatial cluster-ing can be made. These may be more suitable for certain data-generating112mechanisms of St. However, few choices are as generalisable across bothcontinuous and discrete spatial data. For continuous spatial data, smoothedestimates of the residual measure from the fitted (null) point process, eval-uated at the points s ∈ St may be suitable. However, this depends upontwo tuning parameters: the details of the discretisation method chosen toapproximate the likelihood and the choice of the bandwidth used to smooththe estimated values.The nearest neighbour method does not suffer these drawbacks, andhas many desirable properties in addition to its generalisability across bothcontinuous and discrete spatial settings. For example, different choices of Kcan lead to improved powers to detect PS under different sampling processes.When the spatial scales of clusters within St are very small, and hence whenclustering is very localized to only a few points per cluster, smaller K maylead to improvements in power. This is because larger values of K may‘smooth over’ any clustering. Conversely, when clusters are large in spatialscale, with each cluster being comprised of several points, the power maybe improved with larger choices of K. Here, the additional smoothing canreduce the variance of the computed test statistic (see Figure 4.1). Anotherbeneficial property of the nearest neighbour quantity is that the distancescan be computed exactly, with values not dependent upon any choice ofcomputational approximation.In some applications it may be suitable to fix the sample size across theMonte Carlo samples (i.e. nmt ≡ nt). For example, regulatory standardsmay dictate the required number of samples nt. The assumption of con-ditional independence between the sampled locations under the null IPPmakes enforcing this condition easy.Strictly speaking, since in practice the values of Z(s, t) and the param-eters in (4.5) are not known and are only estimates, the plug-in test ofAlgorithm 1 will be invalid. This is because the null hypothesis is composite[Baddeley et al., 2017]. However, tests that ignore the effects of parameterestimation will tend to be conservative in most cases. A loss of power istypically the price to pay [Dao and Genton, 2014]. Whilst the method in-troduced by Dao and Genton [2014] can be used to ensure the test attains113nominal type 1 error, the required nested Monte Carlo simulations dramati-cally slows down the implementation. In the simulation studies of Section 4,the test defined in Algorithm 1 is found to be conservative across all testedsimulation settings. Thus we do not consider this matter further.P-values have come under increasing criticism recently (see Wassersteinand Lazar [2016] and references within). Indeed, the computation of an em-pirical p-value alone to identify the binary presence/absence of PS within adataset has its flaws. For example, it does not help to quantify the potentialmagnitude of the biasing effects that the PS may have on the spatial pre-diction of µ(s, t). Furthermore, as with all p-values, it is easy to fall victimto the p-value fallacy. A given p-value does not provide much informationon its own. A value close to 0.05 neither provides strong evidence in favourof the alternative hypothesis vs. the null hypothesis, nor implies that thefrequentist error probability is close to 0.05 [Sellke et al., 2001]. Additionalsteps must be taken to make such inferences, such as the use of the calibra-tions introduced by Sellke et al. [2001]. In summary, as with all p-values,p-values reported from the tests outlined in algorithms 1 and 2 should beused with care.Assuming the correct data-generating mechanism is specified and thetrue parameters are known, the test will be exact regardless of how smallM is chosen. For testing at the 5% significance level, M could be chosen aslow as 19. However, this comes at a cost of power, with the loss of powerproportional to 1/M [Davidson and MacKinnon, 2000]. Furthermore, asmallM implies a high standard error of the empirical p-value. This leads toa test whose outcome is heavily dependent on the precise sequence of randomnumbers used to implement the algorithm. To alleviate these concerns, Mshould be chosen as large as is computationally feasible.4.3 Preferential sampling in the discrete spatialdata settingIn the discrete spatial data setting, observations are taken across a set ofareal units St within the study region Ω. Examples of areal units include114electoral districts and large survey transects. The sizes of these areal unitsmay be irregular, and are assumed known. It is also assumed that the fullpopulation of all possible areal units that were available for sampling ateach t ∈ T is known. This population is denoted Pt. A binary process isfit to emulate the true sampling process. The choice between a Bernoulliand Binomial model depends on whether or not the constraint nmt = nt isimplemented. Once again, a Monte Carlo approach is taken for testing forPS.4.3.1 Assumed model for preferential samplingGiven the population of nt areal units available at time t, denoted Pt ={Ai,t ⊂ S : i ∈ {1, ...,nt}}, define the site-selection indicator variables as fol-lows. Let Ri(t) denote the indicator random variable that the ith areal unitin Pt is selected at time t. Then the collection of sampled areal units St ⊂ Ptat each time t∈ T , is simply the subset of the population of areal unit whoseindicator variables take value 1 (i.e. St = {Ai,t ∈ Pt :Ri(t) = 1}).Next, define w¯(A, t) to be the fixed spatio-temporal covariates for theindicator selection process at areal unit A. These will typically be areal-aggregate or areal-count values. Similarly, Z¯(A, t) and x¯(Ai, t) will typicallybe areal-aggregates of the underlying spatio-temporal process Z(s, t) andthe spatio-temporal covariates x(s, t) respectively. In applications, Z¯(A, t)will typically be modeled as a discrete spatio-temporal process on the arealunit scale instead of as a continuous process. Examples include the con-ditional autoregressive process and its spatio-temporal extensions [Besag,1974a, Blangiardo and Cameletti, 2015].The same model form is assumed for the observation process Y (A, t) in(4.1). The only change made is the spatial scale. Thus, the new samplingprocess P is defined as:[Ri(t)|Z¯(Ai, t)]∼ Bernoulli(p(Ai, t)) (4.13)logit(p(Ai, t)) =αT w¯(Ai, t) +δT x¯(Ai, t) +h(Z¯(Ai, t)). (4.14)115For each time step t ∈ T , it is assumed that each of the areal units Aiwithin the population Pt has values Y (Ai, t) sampled or not according tothe outcomes of the independent Bernoulli trials defined in (4.12).4.3.2 Perceptron algorithmAlgorithm 2: Perceptron test for PS in discrete spatial dataData:Observations y(Ai, t) for (Ai, t) ∈ (St×T )Covariates w¯(Ai, t), x¯(Ai, t) for (Ai, t) ∈ (Pt×T )Result:Empirical p-value for the test h≡ 0 vs. h monotonicbeginFit a model for (4.3) using a preferred methodProduce estimates ˆ¯Z(Ai, t) across Pt and TCompute the NNk,t values D¯i,t(K)Compute ˆ¯Z(Ai, t) at areal units StCompute the rank correlations ρtFit the chosen Bernoulli model with h≡ 0 in (4.14)Fix m= 1. while m≤M doSample nmt areal units Smt from the fitted model for t ∈ TCompute the NN distance measure D¯mi,t(K)Compute ˆ¯Z(Ai, t) at areal units SmtCompute the rank correlations ρmtif m=M thenreturn the empirical p-values of either pointwise or rankenvelope tests using ρt and ρmt .elsem←m+ 1endendendAll of the same assumptions and issues outlined earlier carry over tothe discrete spatial setting. Once again, Z(Ai, t) must be spatially smoothacross the areal units and estimates of Z¯(Ai, t) must be available at each116of the areal units Ai in the population St at each time t ∈ T . Nearestneighbour distances between areal units within St can once again be used.Such distances can be defined relative to the areal unit-centroids or oth-erwise. Strictly speaking, the PS is no longer seen as a clustering of thepoint-pattern around high (or low) values of Z¯(Ai, t). Instead, the PS takesthe form of a clustering of the areal units with complete data around high(or low) values of Z¯(Ai, t). The procedure is defined in Algorithm 2 below.4.4 Simulation studyThis section summarizes the key results of an investigation into the perfor-mance of the NN test. The power of the test is demonstrated across a rangeof simulated data settings. A more thorough treatment of the simulationstudy is provided in the supporting material found in the Appendix. Allcomputations involving point processes were performed using the spatstatpackage [Baddeley et al., 2015].The following data-generating mechanism is chosen for the Gaussianresponse simulation study:[Y (s)|Z(s)] = Z(s) (4.15)[S|Z(s)]∼ IPP(λ(s)) (4.16)log(λ(s)) = α0 +α1w(s) +γZ(s) (4.17)[Z(s)]∼GP(0,Σ). (4.18)The simulated data are in the purely spatial setting (i.e. T is a single-ton), with the Y (s) specified as noise-free observations of Z(s). Z(s) is arealisation of a mean-zero Gaussian process with Matern covariance matrixΣ. The Matern roughness parameter ν is set to 1 and the standard devia-tion of Z(s) is fixed at 1. The spatial range ρZ of the process is adjusted.The spatial range is defined here to be the distance at which the spatialcorrelation drops below 0.1. A larger ρZ implies the process has a greaterspatial smoothness (i.e. a lower frequency).117The sampled locations are generated from a LGCP with a single (non-informative) covariate w(s). The number of points is fixed equal to n, andthus the true process is a Binomial point process. The parameter γ de-termines the magnitude of PS, with γ = 0 corresponding to the null IPPmodel of no PS. Again, w(s) is an independent realisation of another Gaus-sian process with Matern covariance function. Both the roughness ν and thestandard deviation are again fixed at 1, but the range parameter ρw is variedindependently from ρZ . The values of w(s) are assumed known throughoutΩ. The parameter α1 determines the effect of the covariate on the intensityλ(s).The NN test is performed at the 5% significance level using 19 MonteCarlo samples (i.e. M = 19). M = 19 is chosen to allow for an exact 5%significance level to be attained, however this small value of M implies thatthese results provide a lower bound on the power of the test [Davidson andMacKinnon, 2000]; a higher power would have been attained had a largerM value been chosen. All tests are performed with the two-sided alterna-tive hypothesis that h is a monotonic function of Z. Each experimentalsetting is repeated 200 times. In the study, all combinations of the followingparameters are evaluated:• Sample size n ∈ {50,100,250},• PS magnitude γ ∈ {0,1,2},• Covariate effect α1 ∈ {0,1},• Spatial range of Z, ρZ ∈ {1.00,0.20,0.02},• Spatial range of w, ρw ∈ {1.00,0.02},• Number of nearest neighbour distances K ∈ {1, ...,15}.Along with the NN test outlined in Algorithm 1, a Monte Carlo testusing estimates of the raw residuals of the assumed IPP under the nullhypothesis is also computed. This time the rank correlation between the118estimates of Z(s) and the estimated residuals is the test statistic. As be-fore, both sets of estimates are only evaluated at the point locations. Theestimated residual values are simply kernel smoothed raw-residuals. Anedge-corrected Gaussian kernel is used with bandwidth selected using leave-one-out cross-validation. We refer to this perceptron residual test as theresidual test hereafter. It is interesting to assess the relative performanceof the NN test, given its generality across all point processes and to thediscrete spatial setting. We now summarize the results of the study.First, our investigation strongly suggests that the type 1 error is boundedabove by the nominal level. This is seen across all simulation settings whenthe null hypothesis is true. Thus, it appears that the computationally costlynested Monte Carlo approach of Dao and Genton [2014] need not be used,except in the interest of improving the power of the test. Fig. 4.2 showsthe results for n = 50 in the simplest setting without covariate effects (i.e.α1 = 0) or PS effects (i.e. γ = 0). The spatial range ρZ is changed andthe test is performed across different values of K. Both Monte Carlo testsattain Type 1 error at or below the 5% level. For comparison, the twostandard simulation-free rank correlation tests attain a Type 1 error wellabove the 5% level. The error of these tests increases dramatically with ρZbecause they ignore the spatial correlation and hence the non-standard sam-pling distribution of the test statistic. We omit the simulation-free resultshereafter.Next, the power is assessed. Across all the simulation settings, the powerimproved with increasing spatial range ρZ . This is in agreement with theearlier analytic results. Z(s, t) must be spatially-smooth to achieve highpower. Furthermore, the power of the NN test is found to be sensitive tothe choice of K value. Optimal choice of K depends upon both the spatialrange of Z and the sample size. Larger values of both implies that a largervalue of K should be chosen. Fig. 4.3 shows the results for the settingwhere no covariate effects exist (i.e. α1 = 0), but where moderate positivepreferential sampling occurs (i.e. γ = 1). For n = 50, the NN test has aslightly lower power than the residual test. This difference diminishes asthe sample size increases. Conversely, the NN test outperforms when both119high freq field med freq field low freq field4 8 12 4 8 12 4 8 of nearest neighboursType 1 Error probabilityTestResidual RankNN RankResidual Rank MCNN Rank MCNo covariate effect, n=50A plot of the Type 1 error probability vs KFigure 4.2: A plot of the Type 1 error for four tests. The three boxesshow the results for ρZ ∈ {0.02,0.2,1}, from left to right respec-tively for a sample size of 50. The two ‘Residual’ tests are com-puted using the kernel-density smoothed values of the residualsfrom the fitted homogeneous Poisson processes. Leave-one-outcross-validation was used to select the bandwidth. The ‘NN’tests are those based on the K nearest neighbour values. Thesuffix ‘MC’ denotes the test has been computed from MonteCarlo realisations of the fitted point process. A black line isplotted at the type 1 error level of 0.05 to indicate the targetvalue. Notice that the simulation-free tests can have severelyinflated type 1 error.the spatial range is very small (ρZ = 0.02) and when the magnitude of PSis very high (γ = 2). Fig. A.17 in the Appendix demonstrates this. Underthese conditions, very small clusters form. However, a different choice of120med freq field low freq fieldN 50N 1004 8 12 4 8 120.250.500.751.000.250.500.751.00Number of nearest neighboursPower TestResidual Rank MCNN Rank MCNo covariate effect, PS=1A plot of the Power vs KFigure 4.3: A plot of the Power for two tests when the PS parameterγ equals 1. The two columns show the results for ρZ ∈ 0.2,1,from left to right respectively. The two rows show results fora sample size of 50 and 100 respectively. The ‘Residual’ testsare computed using the kernel-density smoothed values of theresiduals from the fitted homogeneous Poisson processes. Leave-one-out cross-validation was used to select the bandwidth. The‘NN’ test is based on the K nearest neighbour values. Thesuffix ‘MC’ denotes the test has been computed from MonteCarlo realisations of the fitted point process.bandwidth-selection method may improve the power of the residual test.Strong (non-informative) covariate effects in the sampling process hurtsthe power of all the tests. This can be seen in Fig A.18 in the Appendix.Interestingly, the NN test is shown to be competitive across all settings121tested, except one. When the spatial ranges of both the covariate w(s)and the field Z(s) are large and similar, the power of the NN test is verylow. Here, the residual test, with residuals computed from the fitted IPP,performs much better. The power is almost doubled that of the NN test.In these settings, despite the w(s) and Z(s) arising from independent dis-tributions, the empirical correlations between them in any given realisationmay be large. Consequently, the NN test may be unable to distinguishbetween the clustering due to Z(s), and the clustering due to the measuredcovariate w(s). This is because the NN test, unlike the residual test, doesnot directly adjust for w(s). However, the IPP residual test is not alwayssuperior. When the ρw is very low, the NN test has higher power whenρZ = 0.2.The performance of the tests are also assessed in settings where theresponse is non-Gaussian, and when the true sampling process P is notan IPP. Y (s) take the form of Poisson counts and the true sampling pro-cess is set equal to a Hardcore process. Different radii of interactions arecompared. The Hardcore process is purposefully chosen. Here, the use ofnearest-neighbour distances to capture additional clustering is poor. Sincethe nearest neighbour distances are lower bounded, the contrast in their ob-served values will decrease as the radius of interaction increases. Conversely,estimates of the smoothed residuals will not be directly affected. Figure A.21in the Appendix clearly shows that the test based on the smoothed residualsfar outperforms the NN test when the radius of interaction is high. Thisdemonstrates a clear need for the researcher to choose a measure of cluster-ing that is suitable for the true sampling process. The power remains highfor Poisson f .4.5 Case studiesThe ability of the test to detect PS in two real case studies is now demon-strated. These two datasets are chosen since the presence of PS within themhas previously been shown in published work. The first example shows howresearchers can easily detect positive PS and then search for a sufficient122set of informative covariates x(s, t) using the test. In the latter example,negative PS is detected.4.5.1 Great Britain’s black smoke monitoring networkAnnual concentrations of black smoke were obtained from the UK NationalAir Quality Information Archive ( Site locations and an-nual average concentrations of black smoke (µgm−3) were obtained fromthe monitoring sites. Previous analyses and the results presented in Chap-ter 3 demonstrated that the network had been preferentially sampled, withh a strictly increasing function across the years of 1966-1996 [Shaddick andZidek, 2014].Figure 4.4: A plot of the locations of the black smoke monitoring sitesin 1966. Observe the clustering of sites around the populouscities of London, Manchester, and Glasgow.The analysis is restricted to the 1966 concentrations. For reasons out-lined in [Shaddick and Zidek, 2014], only sites that gave readings for at least273 days of the year (75% data capture) are considered. Fig. 4.4 shows thatit is apparent that sites were located in the industrial cities around London,the Midlands and the North West of England, with almost no sites present in123the industry-free Scottish Highlands. Thus the point-pattern displays clearevidence of clustering around regions expected to have high concentrationsof black smoke.The R-INLA package (see Chapter 2) with the SPDE approach is used tofit a standard geostatistical model [Lindgren et al., 2011b, 2015, Rue et al.,2009]. Other methodologies, Bayesian or frequentist, could be used. As inChapter 3, the log-ratios of the concentrations are the choice of response.PC priors are placed on the approximate Matern field [Fuglstad et al., 2018].A prior probability that the spatial range is below 5km is set to 0.1, and aprior probability that the standard deviation of the field is above 3 is setas 0.1. The (mean-centered) posterior means of the log-transformed blacksmoke levels across Ω were then used as the ‘naive’ Zˆ(s) values.Gridded residential human population count data with a spatial resolu-tion of 1 km x 1 km were obtained for Great Britain. These counts camefrom the 2011 Census data and 2015 Land Cover Map data from the Nat-ural Environment Research Council Centre for Ecology & Hydrology [Reis,2017]. As in Chapter 3, it is assumed that the relative population densityacross Great Britain remained approximately stable from 1966-2011. Thesenormalized counts are used as an informative covariate x(s) for both the nullIPP sampling process and the observation process. Initially, the presence ofPS is tested for without the population density covariate included in eitherprocess. This test is referred to as V1. Next, population density is con-trolled for as an informative covariate in both the observation and the nullIPP sampling processes (it is found to be strongly associated with both). Itis then investigated if PS remains. This test is referred to as V2.Estimates of Zˆ(s) in the V2 test are thus corrected for population den-sity. Population density is included as a Bayesian spline to capture anynonlinear effects of population density on the observed black smoke Y (s).Mechanistically, including population density as an informative covariate ina model for black smoke concentration is reasonable. Localized sources ofblack smoke include the combustion of carbon-based fuels, with expectedlevels of combustion expected to increase with population density. If PS isno longer detected after this adjustment, then the population density has124explained away the PS. Conversely, if PS is still detected, then standard‘naive’ methods may be biased even after controlling for population density.Table 4.1 shows the empirical pointwise p-values of the tests with chang-ing K, under both of the assumed sampling mechanisms. Results are shownfor both the V1 and V2 tests. The empirical p-values are the proportion ofrank correlations in the 1000 Monte Carlo samples, that were more negativethan observed in the data. Thus this test is 1-sided. For K > 1, strongevidence is found against the null in favour of a positive monotonic h underthe V1 HPP test. This remains under the V2 IPP sampling process. Notethat the p-values have not been adjusted for multiple testing (over K).Table 4.1: A table of empirical p-values for the UK black smokedataset for both the assumed homogeneous and inhomogeneousPoisson point process models.K 1 2 3 4 5 6 7 8HPP P value V1 0.28 0.01 0.01 0.01 0.00 0.00 0.01 0.01IPP P value V2 0.20 0.01 0.01 0.01 0.00 0.00 0.00 0.01The insights gained from these tests are as follows. A ‘naive’ model fitto the black smoke data without adjusting for population density may bebiased due to PS being present. Furthermore, population density fails toexplain the observed PS seen in the data. Residual PS remains after con-trolling for population density in a model for black smoke (IPP V2 result).Thus a sufficient set of PS-removing covariates has not yet been identified.Either this iterative process of finding a sufficient set of covariates must becontinued, or a joint model should be considered as in Chapter Galicia lead concentrationsThe second real-world dataset consists of the concentrations of lead in mosssamples collected across Galicia, northern Spain, in 1997 [Fernández et al.,2000]. The concentration is measured in micrograms per gram of dry moss.The 1997 locations were previously shown to have been preferentially sam-pled in the landmark preferential sampling paper by Diggle et al. [2010]. In125fact, in this example, a significant linear h effect with negative slope wasfound. Thus h was found to be monotonically decreasing.Figure 4.5: A plot of the 1997 sampled locations of lead concentra-tions in Galicia, northern Spain. Observe the clustering of sitesin Northern Galicia.Fig. 4.5 shows a clear increase in the density of sampled locations in thenorthern half of Galicia. This half was found to have the lowest concen-trations of lead. PC priors were again placed on the approximate Maternfield. A prior probability that the spatial range is below 10km was set to 0.1,and the prior probability that the standard deviation of the field is above3 was set to 0.1. The posterior mean log-transformed lead concentrationswere used as the Zˆ(s) values. Empirical p-values were computed using 1000Monte Carlo samples. The direction of the inequality was reversed this time,to test for negative PS.Table 4.2: A table of empirical p-values for the Galicia dataset.K 1 2 3 4 5 6 7 8P value 0.01 0.03 0.06 0.06 0.07 0.08 0.08 0.09126The empirical pointwise p-values of this test are shown in Table 4.2. Evenafter considering Monte Carlo error, moderately strong evidence of negativePS exists, with the strongest evidence occurring at K = 1. A researcherwould now have to decide whether to pursue a sufficient set of covariates orfit a joint model as in Diggle et al. [2010].4.6 Concluding remarksA fast and intuitive test has been presented for detecting preferential sam-pling (PS) in both continuous-space and discrete-space spatio-temporal data.The test is highly general; any preferred methodology for estimating a latentspatio-temporal process Z(s, t) may be used. This includes both Bayesianand frequentist methods. In many situations, detected PS may be ade-quately described by a set of available informative covariates that are as-sociated with both the sampling process and the observation process beingmeasured. The test presented in this Chapter can identify such an informa-tive covariate set. In cases where residual PS remains, even after controllingfor a set of informative covariates, researchers can either seek a sufficientset of informative covariates, or seek a method that directly models the PS(e.g. a joint model).In this Chapter, the properties and validity of the test were demon-strated through an extensive simulation study. The suitability of the testfor real-world applications was then confirmed through the re-analysis oftwo previously published case studies. Both case studies had previously de-tected PS, and the test successfully replicated the findings of both. Thepower of the test to detect PS was shown to increase with: the spatial range(i.e. the inter-point distance required for observations to be approximatelyindependent), the sample size, and the degree of PS. The power decreaseddramatically with the inclusion of covariates, independent from the obser-vation process, that had a strong effect on the sampling process.The general framework introduced in Chapter 3 could also have beenused to identify PS in both of the two case studies. However, we recommendthat the test developed in this Chapter be used instead for the purposes of127testing. Unlike the general framework, the test requires only that the PSbe monotonic. This could potentially have large implications for inference.An incorrectly specified functional form of PS within the general frameworkcould lead to it failing to detect PS altogether. Unfortunately, the high com-putational cost of fitting the framework prohibited its comparison within thesimulation study. Furthermore, the three tests previously developed in theliterature would be unsuitable for use in the two case studies. Whilst thetwo case studies both involved continuous responses, the two tests devel-oped by Schlather et al. [2004] would only be able to accept or reject theassumption that the data were a realization of a random-field model. Thedirection of PS would not be discernible. Additionally, population densitycould not have been integrated within their test. The test by Guan andAfshartous [2007] would not be suitable since the sample sizes seen in thetwo case studies were orders of magnitude too small for its use.The biasing effects of PS on spatial prediction has been demonstratedacross a wide range of fields and can be severe in magnitude [Diggle et al.,2010, Dinsdale et al., 2019, Pennino et al., 2019]. Thus, PS should not beignored in spatial analyses. The perceptron test proposed in this Chapteris suitable for assessing the presence of PS in many spatio-temporal anal-yses. Therefore, in cases where the sampling protocol is either unknownor known to be preferential, reporting the results from a PS test alongsideany publication of spatio-temporal analyses should become standard prac-tice. We have made a user-friendly R package, PStestR, available on GitHubfor implementing the algorithm. No extra work or computation time is re-quired. The package works seamlessly with many of the commonly useddata types (e.g. from the sp, sf and spatstat libraries [Baddeley et al., 2015,Bivand et al., 2013, Pebesma, 2018, Pebesma and Bivand, 2005]). The testcan perform both pointwise and rank envelope tests, to alleviate the mul-tiple testing problem. The package can also fit numerous point processes,including Hardcore and Cluster point processes.Three avenues of research should be pursued in the future. First, how tobest capture the localized clustering in specific sampling settings should beexplored. For example, in this Chapter it was shown that the nearest neigh-128bour distance may be a poor choice for measuring the degree of clusteringunder certain sampling processes. How to optimize the power of the test inspecific settings should be pursued. Next, eliminating the need for generat-ing Monte Carlo samples from the null point process may be possible [Acostaet al., 2018]. We found that ‘effective-sample size’-adjusted rank correlationtests showed very poor performance. Convergence rates of the test could beas low as 10% for specific simulation settings. Thus, we omitted the results.Further investigation here is warranted. Finally, adjustments are requiredfor using this test in spatio-temporal applications where sampling locationsare retained from one time-step to the next. Environmental monitoring net-works are an example. Here, the chosen network locations from one timestep to the next are not independent. Additional work should be pursuedto generalize the methods in this Chapter to such settings.129Chapter 5Estimating AnimalUtilization Distributionsfrom Multiple Data Types: aJoint Spatio-temporal PointProcess Framework"In our lust for measurement, we frequently measure that whichwe can rather than that which we wish to measure... and forgetthat there is a difference."— George Udny YuleA previewThe previous two Chapters developed methods for accounting for prefer-ential sampling (PS) in the statistical analysis of both discrete-space, andcontinuous-space, spatio-temporal data. In Chapter 2, we introduced a thirdtype of spatio-temporal data, which we referred to as spatio-temporal point-pattern data. Accounting for PS in the statistical analysis of spatio-temporalpoint-pattern data is the focus of this Chapter.130Accounting for PS in spatio-temporal point-pattern data offers addi-tional challenges compared with the other two data types. No longer canthe observed locations themselves be used directly to inform a model aboutthe nature of PS through the sharing of latent effects between processes. In-stead, additional strong assumptions are required. We assume the scenariowhere a point-pattern is observed in Ω×T . We assume that the point-pattern represents a censored collection of events that are recorded by aset of observers. We assume that numerous factors caused the true point-pattern to be only partially observed throughout Ω×T . By borrowing ideasfrom thinned point processes, we multiplicatively decompose the intensityof the observed (censored) point process into three terms. The first termdescribes the true spatio-temporal intensity process responsible for gener-ating the latent, uncensored point-pattern. The next two terms reflect theprocesses that lead to some of the events (points) to become censored. Thefirst of these terms describes the behaviours of the observers through spaceand time. We refer to this term as an effort surface and it reflects the ob-servers’ efforts spent searching for the events. The final term then reflectsthe abilities of the observers to detect an event at location s and time t,conditional on their being one there. We refer to this term as a detectabilitysurface.For modelling, we consider the setting where researchers have additionalknowledge, either in the form of a set of informative covariates or an ex-ternal emulator, that can spatio-temporally describe both the behavioursof the observers, and the detectability of the events, sufficiently well. Theapplication we consider is the modelling of the spatio-temporal distributionsof animals. In particular, we develop a framework for estimating the utili-sation distribution (UD) of animals that can adjust for both heterogeneousobserver effort and heterogeneous detectability. These issues are common-place in ecological applications. We then apply our framework to a real casestudy: the estimation of the spatio-temporal distribution of an endangeredecotype of killer whales using opportunistic sightings made by citizen sci-entists. As beautifully stated in the above quote, we must adjust for theopportunism of these citizen scientists if we wish to draw sensible inference131from their sightings.5.1 Introduction to Chapter 5Accurate knowledge of the spatio-temporal distribution of animals is vitalfor understanding the effects of climate and habitat changes on species andfor governments to implement successful management policies. In particular,ecologists are interested in estimating the space use of individual animalsto inform a wide range of questions. For example, such estimates can beused to quantify the location and extent of habitats required for a species’conservation strategy [Fleming et al., 2015]. Estimates of space use can belinked to environmental covariates within an occupancy model to explainresource use [Mordecai et al., 2011]. These estimates are also used within aspatial capture-recapture framework for estimating home range centers andpopulation abundance [Royle et al., 2011]. The space use of an individ-ual is often referred to as the utilization distribution (UD). The UD of anindividual defines the probability density that an individual is found at agiven point in space and time [Lele et al., 2013]. Estimating individual UDsis often complicated by complex animal movement and observer processes,both of which must be considered within an analysis [Royle et al., 2011].Accounting for observers with unknown, but estimable, effort remains anopen problem for UDs.Ecologists have a plethora of methods available to them for estimatingUDs from different types of individually-identified sightings. In particular,many methods exist for animal tracking datasets where the locations of in-dividuals are followed through time with devices such as GPS tags. Simplemethods include geometric techniques such as the minimum convex poly-gon [Fieberg and Börger, 2012], and statistical smoothers such as kerneldensity estimators [Worton, 1989]. Methods have also been developed toaccount for the autocorrelations due to animal movement [Fleming et al.,2015], but these assume that the sightings were made without observer bias.For tracking data collected at high temporal frequency, resource selectionfunction models can be used to estimate UDs [Johnson et al., 2013]. Fewer132methods exist for sightings (or captures) data made at a set of discrete lo-cations (e.g. camera traps) or by a group of mobile observers with recordedlocations. Such datasets are commonly collected [Hussey et al., 2015, Who-riskey et al., 2019]. In these settings, spatial capture-recapture models canbe used to estimate UDs throughout the study region [Royle et al., 2011].However, these methods typically assume that the UD of each individualfollows a bivariate normal distribution centered at a unique, latent, homerange center. This assumed form of the UD may be overly simplistic whenlarge quantities of data are available for each individual. We propose aframework to model complex individual UDs that accounts for the observerefforts from both mobile and static observers and models the relationshipsbetween the individuals’ space use and environmental characteristics.A similar but distinct objective to estimating the UDs of individuals isestimating the distribution of a species. Species distribution models (SDM’shereafter) are tools for predicting the density of a species and for relating itto measured environmental covariates. SDMs have been the focus of statis-tical research for decades [Elith and Leathwick, 2007, Fithian et al., 2015].In particular, much work has been done on how to use point process meth-ods to develop SDMs that can combine data of varying type (see Milleret al. [2019] for a recent review), account for observer processes and biases[Koshkina et al., 2017], and include spatio-temporal random effects to cap-ture additional autocorrelations [Yuan et al., 2017]. Furthermore, point pro-cesses are scale-invariant and do not encounter the ‘pseudo-absence’ problemfaced by competing methods [Warton et al., 2010]. Thus, point processeshave emerged as the most promising integrative model framework for jointlymodelling data of varying types and quality [e.g. Giraud et al., 2016, Renneret al., 2015]. By sharing parameters and latent effects between the likeli-hoods of each data type within a joint model [e.g. Bedriñana-Romano et al.,2018, Giraud et al., 2016], strength can be borrowed and hence a greaterprecision in conclusions and improved management policies can be attained[Fithian et al., 2015]. This approach can be made more robust for datasetsof especially poor quality, by allowing only a correlative relationship to existbetween their models and the latent effects defining the SDM [Pacifici et al.,1332017].In this Chapter, we show how to adapt these recent point process frame-works for use with UDs. In particular, we assume that the individuals’UDs are stationary within well-defined time periods and that we can sub-set the sightings data to obtain approximately independent snapshots ofthe UDs. We then argue that these recent developments make point pro-cesses an ideal basis to create a framework for modelling individual UDs indata-rich settings. In particular, using point processes for UDs allows usto combine numerous datasets following different complex data collectionprotocols [Miller et al., 2019]. For example, presence-only sightings thatlack any records of locations where sightings were not made (i.e. absences)may be combined with high quality presence-absence data (see Hefley andHooten [2016], Miller et al. [2019] and references within). Accounting fordifferences in protocols is crucial. Some protocols may focus observer effortsin regions where the density of the species under study is highest. This pref-erential sampling can lead to positively biased estimates of species density,abundance, and UDs [Pennino et al., 2019].With the data appropriately subsetted, any autocorrelation between thesightings that could bias inference is removed [Johnson et al., 2013]. Yet,there remain two hurdles that we must overcome to apply point processmethods to UD estimation. First, we must adapt the point process frame-work to the setting where individuals can be identified at numerous locationsthrough time. Typical SDM analyses often assume that the locations of in-dividuals remain fixed throughout the sampling time [Giraud et al., 2016,Koshkina et al., 2017]. Second, we must generalize the point process frame-work to allow for combinations of mobile and static observers with highlycomplex, and potentially unknown (but estimable) effort. Well-defined dis-crete sampling ‘sites’ are often assumed to exist [Giraud et al., 2016]. How-ever, in many cases, observer effort is continuous in both space and time.In continuous-time, unless strict distance sampling protocols with high ob-server speed are followed, animal movement can bias estimates of absoluteintensity [Glennie et al., 2015, Yuan et al., 2017]. Explicitly modelling theanimal movement model within the observer’s sampling process to elimi-134nate the bias caused by animal movement can prove challenging [Glennieet al., 2020]. This task is made especially difficult when information on thesampling protocols is unavailable, as is often the case.Taking the above concerns into account, we create a flexible frameworkfor estimating the UDs of individuals in data-rich settings. We relax manyof the stringent assumptions and requirements made previously, creatinga general approach for estimating the spatio-temporal distributions of in-dividual animals. Data from combinations of mobile and static observers,with differing skill levels, and following differing protocols, may be com-bined. Locations of observers may be known (e.g GPS), or unknown. In thelatter, either a highly informative set of covariates or an emulator must beavailable that can adequately describe the observers’ efforts. In either case,observer effort can be controlled for, with any uncertainties in effort prop-agated through to the resulting inference. Our approach circumvents themodelling of an animal’s movement process by focusing on relative intensityvalues when estimating UDs. This approach largely avoids the bias dis-cussed by Glennie et al. [2020] for estimating absolute intensity values. Wedemonstrate this claim in a thorough simulation study and show that largeimprovements in the predictions of UDs are possible using our approach insettings where the degree of spatial heterogeneity in observer effort is high.Using our approach, statistical inference can be made at both an individ-ual level, and/or a population level. We show that population inference istrivial in settings where sightings data of each individual in the populationare available. When only a subset of the population is observed, the sampledsubpopulation must be representative of the population for extrapolation tobe accurate. To adapt the point process framework for use with repeatedsightings of individuals, we redefine the intensity surface being modeled asan encounter rate instead of as an expected number of individuals per unitarea. Including random fields and random effects can model the additionalspatio-temporal correlation induced by unmeasured spatially-smooth covari-ates and/or biological processes [Pacifici et al., 2017, Yuan et al., 2017]. Sincethis approach is subsumed under a log-Gaussian Cox process (LGCP) frame-work [Chakraborty et al., 2011], implementing this idea is made especially135easy using the R package inlabru [Bachl et al., 2019, R Core Team, 2019],enabling researchers to adapt this framework for use in their applications.The paper is structured as follows. First, we present our motivating prob-lem: estimating the spatio-temporal distribution of the Southern ResidentKiller Whale between May and October. This species is of special conser-vation concern. Next, we define the types of encounter events assumed togenerate the data. Then, we introduce marked LGCPs as the preferred toolfor modelling the encounters. We discuss their properties and define thenew intensity surface to be modeled in terms of encounter rates and effortintensities. After presenting our modelling framework, we demonstrate itsproperties in a thorough simulation study, before adapting it to our moti-vating problem. Finally, we design easily-interpretable maps to display theUDs of the whales across the months. We then aggregate these to provideinference on both a pod (i.e. group) level and a population level. Computercode using the inlabru package [Bachl et al., 2019] is provided to enableresearchers to adapt this framework for their applications.5.2 Motivating problem5.2.1 An introduction to the problemThe Southern Resident Killer Whale (SRKW) population is listed as En-dangered under both the Canadian Species at Risk Act (DFO) and theUnited States Endangered Species Act (NOAA) because of their small popu-lation size, low reproductive rate, the existence of a variety of anthropogenicthreats, and prey availability. The range of the SRKW extends from south-eastern Alaska to central California; however, between May - September,all three pods of these whales frequent the waters of both Canada and theUnited States, concentrating in the Salish Sea and Swiftsure bank (DFO).We extend our study region beyond these areas (see Fig 5.1).The development of successful and effective policies to help protect theSRKW requires accurate, high-resolution knowledge on how their use ofspace evolves across the calendar year. The SRKW are highly social ani-136mals, spending the majority of their time in three well-defined groups calledthe J, K and L pods [Ford et al., 1996]. Hereafter, we consider pods as in-dividuals for the purposes of modelling. Inter-pod variation in the summerspace-use exists [Hauser et al., 2007]. Due to the differing characteristicsof the three pods, additional knowledge of space-use on a pod-level mayhelp to improve the effectiveness of management decisions. While SRKWsare known to favour the inshore waters of Washington State and BritishColumbia in the summer months [Ford et al., 2017, Hauser et al., 2006],precise knowledge surrounding their space use across the months is lack-ing, as is precise knowledge surrounding the differences between the pods[Hauser et al., 2007].5.2.2 The data availableMultiple sources of SRKW sightings are available, including GPS-tracked fo-cal follows of targeted individuals by citizen scientists, presence-only sight-ings from commercial whale-watching vessels, and opportunistic sightingsreported by the public. We judge two of these data sources to be mostsuited for use in an effort-corrected analysis as we are able to estimate itsobserver effort and we are confident that the source could accurately differ-entiate between the three SRKW pods.The first source of data we use are presence-absence data collected througha project funded by the Department of Fisheries and Oceans Canada. Thedata were collected from a mobile vessel on the west side of the Strait ofJuan de Fuca fitted with a GPS tracker between 2009 and 2016. The GPScoordinates of the vessel were recorded at a high frequency, and all sightingsof SRKWs were reported, along with the pod (see Fig 5.1). The marinemammal observer on this vessel is an expert on the SRKW, and thus, thereported pod classification can be considered accurate. The motivations ofthe Captain’s data collection varied from year to year, and hence, the spatialdistribution of their observer effort varied. We refer to this dataset as DFOhereafter.The second reliable data source we use are the presence-only SRKW137sightings reported by the whale-watch industry between 2009 - 2016 and col-lected by two organisations. The B.C. Cetacean Sightings Network (BCCSN)[Vancouver Aquarium] and The OrcaMaster (OM) [The Whale Museum][Olson et al., 2018] datasets both contain sightings from a vast range of ob-server types, but we exclusively model the whale-watch sightings for threereasons. First, the whale-watch operators have a high degree of expertiseon the SRKW, with vessels typically having a biologist or other expert on-board. This results in accurate pod classifications. Second, the whale-watchcompanies are known to share the locations of the sighted SRKW with eachother. Thus, our dataset likely contains the majority of whale-watch sight-ings that were made, not just the subset of those made by the operators whoreport to the databases. Third, a vast amount of data has been collectedon the activities of the whale-watching industry operating in the area. Thisenables us to estimate the observer effort from these companies with a highdegree of accuracy and precision. We refer to this combined dataset as WWhereafter.The majority of the data we obtained on the activities of the whale-watchindustry came from the Soundwatch Boater Education Program (Sound-watch hereafter). Soundwatch is a vessel monitoring and public educa-tion outreach program that systematically monitors vessel activities aroundcetaceans during the whale-watch season (May - September) in the HaroStrait Region of the Salish Sea [Seely et al., 2017]. Since 2004, Sound-watch has been using data collection protocols established in partnershipwith NOAA, DFO, and the Canadian Straitwatch Program. This includesdetailed accounts of vessel types, whale-watch vessel numbers, and whale-watch vessel activities.5.2.3 Previous work estimating the space use of SRKWPrevious work estimating the summer space use of SRKW has greatly as-sisted the development of critical habitat regions and has helped informsuccessful management initiatives. In the past 15 years, two pieces of pub-lished research tackled the problem from different angles. Hauser et al. [2007]138estimated the summer space use of the SRKW within the Salish Sea’s in-shore waters of Washington and British Columbia. The authors assumeda constant observer effort across the months from the whale-watch opera-tors within their study region, based on results from a field study [Hauseret al., 2006]. Pod-specific core areas were identified and SRKW hotspotswere clearly displayed. However, their study region was smaller than ours.Most recently, Olson et al. [2018] estimated an effort-corrected map ofSRKW summer space use. This expanded on the work of Hauser et al.[2007], by estimating the space use across a larger area than theirs, as wellas by incorporating the heterogeneous observer effort in their modelling di-rectly. Regions of ‘high’ effort-adjusted whale density were identified andclearly presented in detailed plots. To reduce the impact of autocorrelationfrom the whales’ movements on the analysis, they defined ‘whale days’ astheir target metric. They defined a whale day to be any day where a SRKWwas reported, regardless of the number of times they were reported on thatday. A smaller study region relative to ours was studied, and no environ-mental covariates were used. Estimation of observer effort followed previousunpublished research from the Vancouver Aquarium (Erin U. Rechsteiner[2013], pers comm).5.2.4 Goals of the analysisWe aim to build upon the previous research and build a model for estimatinghigh-resolution, effort-corrected, and temporally-changing SRKW space useacross the summer months (May - October). For each pod, their UD will pro-vide the probability that they occupy a specific region at any given instantin time within each month. Under the assumption that the study regionproperly captures the full spatial extent of the UDs, we can then estimatethe spatial distribution of the SRKW for each month. All estimates will beconditioned upon detailed estimates of observer effort from the whale-watchcompanies, combined with GPS tracks from the DFO dataset. Unlike previ-ous attempts, our model will use multiple environmental covariates such assea-surface temperature and various measures of primary productivity (e.g.139Figure 5.1: A plot showing our area of interest Ω in green, with theGPS tracklines of the DFO survey effort displayed as black lines.All DFO survey sightings are shown as a red overlay on top ofthe effort. All sightings from the OM and BCCSN datasetsare shown in yellow. Shown are sightings and tracklines fromMay - October 2009 - 2016. The green circle roughly locatesthe Swiftsure Bank, with the waters due East and North of thisrepresenting the Salish Sea. The BC Albers projection shownis in units of metres.chlorophyll-A) to improve the accuracy of predicted maps.By turning to statistical/probabilistic modelling, we will attempt to ac-count for all sources of uncertainties, including the uncertainties associatedwith our estimates of the observer effort from the whale-watch vessels. Thishas not previously been done. Finally, we will demonstrate how the method-ology allows for the creation of maps that simultaneously display regions ofhigh SRKW intensity for each month, along with their corresponding uncer-tainties. We display these for the month of May for exhibition. Identifyingcritical habitat plays a major role in the protection plans for any endan-gered species, thus we hope our work can assist with future policy decisions140surrounding the protection and management of the SRKW population.5.3 Building the modelling frameworkDefining observer-animal encountersWe define the movement trajectory through time of an individual animalm as ξm(t). This denotes the spatial coordinate of the individual at timet with respect to the coordinate reference system used. We assume thatthere exists sufficiently large time windows Tl ⊂ R : l ∈ L such that themovement process driving the trajectories of the individuals ξm(t) : t ∈ Tlhave stationary invariant densities for each Tl, denoted pim(s,Tl). Thesedensities define the UDs we aim to estimate and are assumed a-priori to bearbitrarily complex and represent the long-run density of locations at whichthe individual visits during Tl. An example time window could be a calendarmonth.Figure 5.2: A diagram showing an example of an ‘encounter’. Twoobservers, one mobile and one static, with circular space-timefields-of-view φ1(t) and φ2(t) respectively are plotted with theirfields-of-view through time shown as blue dotted lines. Both ob-servers search for an individual moving with space-time trajec-tory ξ(t) throughout the study region Ω shown as a red dashedline. The arrows denote the direction of travel. At time t?,the individual crosses into the mobile observers field-of-view atlocation s?. Thus, at time t? the individual is encountered atlocation s?. Formally, at time t?, ξ(t?) ∈ φ1(t?)⊂ Ω.141Observers fall into two categories: static observers (e.g. hydrophones andcamera traps) and mobile observers that may move through continuous-space through time (e.g. vehicle-based observers). For each observer o ∈O, we denote their position and field-of-view at time t as ξo(t) and φo(t)respectively. Unlike with ξo(t), which defines a unique point in space ateach time t, φo(t) defines a unique region or line. Thus φo(t) ⊂ Ω ∀t is asubset of the study region Ω. Both ξo(t) and φo(t) are assumed to arise fromarbitrarily complex processes. We discuss these processes and our proposedapproximation method (Definition 5.1) later. Note that the fields-of-viewfrom two observers o, o˜ ∈O overlap in Tl if φo(t)⋂φo˜(t) 6= ∅ for some t ∈ Tl.We now describe the assumed data generating mechanism that generate therealisation of encounter locations and times used for inference.Under the assumption of perfect detectability, and assuming the ob-servers are searching continuously throughout Tl, we say that an encounterof individual m occurs during time window Tl when m’s movement tra-jectory ξm(t) intersects with one or more observer’s field-of-view functionφo(t) for some time t ∈ Tl. That is ξm(t) ∈ φo(t) for some t ∈ Tl ando ∈ O during the study. When observers only search during a discrete setof times tj ∈ Tl : j ∈ {1, ...,J}, we say that a sighting occurs at time tj ifξm(tj) ∈ φo(tj). It is these encounters that we perform inference on. Theassumption of perfect detectability by an observer within their field-of-viewmay be unrealistic and we allow for this assumption to be relaxed later.Fig 5.2 presents an example diagram displaying a setting with one movingindividual, with two observers (one static, one mobile) searching for it. Theindividual intersects observer 1’s field-of-view at time t? and encountered ats?. Thus, (s?, t?) becomes available for inference.Target point process modelWe assume that there exists an effort surface, denoted λeff (s,Tl), such that,by conditioning on λeff (s,Tl) within an inhomogeneous Poisson process(IPP) model, the UD pim(s,Tl) can be recovered with any observer biasremoved. Thus, λeff (s,Tl) is assumed to fully control for the bias caused142by the heterogeneous observer effort that is driven by the complex processesthat generate the observers’ fields-of-views and efforts. We use the followinginhomogeneous Poisson process (IPP) model to describe the true data gen-erating mechanism. We assume that the number of encounters of m withinany subregion A ⊂ Ω and time window Tl ⊂ T is Poisson-distributed withmean Λobs(A,Tl) =∫Aλobs(s,Tl)ds=∫Aλtrue(s,Tl)λeff (s,Tl)ds . We refer toλtrue(s,Tl) as individual m’s true intensity surface. With λeff (s,Tl) able tofully capture the effort, pim(s,Tl) ∝ λtrue(s,Tl). At location s during timewindow Tl, it is defined as:λtrue(s,Tl) = lim→0 E [N(B(s),Tl)]/ |B(s)|,where B(s) denotes a circle with centre s, radius , and area |B(s)|.N(B(s),Tl) denotes the number of encounters with m per unit effort withinthe circle during time window Tl. Given the above assumptions, λtrue(s,Tl)loosely represents the expected number of encounters with individual m’strajectory within an infinitesimally small region around point s, per uniteffort. A flat λtrue(s,Tl) throughout Ω implies that m exhibits completespatial randomness throughout Ω. Conversely, regions of high λtrue(s,Tl)indicate ‘hotspots’ of m.Conditioned upon knowing λobs(s,Tl) and observing a point-pattern con-sisting of nTl points (i.e. a collection of nTl encounter locations) within thetime set Tl ∈ T , YTl = {si : i ∈ {1, ...,nTl},si ∈ Ω}, the likelihood of thespatio-temporal IPP is [Simpson et al., 2016]:pi (YTl |λobs) = exp{|Ω|∫Ωλobs(s,Tl)ds} ∏si∈YTlλobs(si,Tl) (5.1)where |Ω| denotes the area of the domain Ω.In practice, we will not know the effort intensity surface λeff (s,Tl) andwe will need to estimate it. When encounters are made in continuous-time,and when overlap exists between the observers’ fields-of-view, this quan-tity may be very complex. In this setting, accurately modelling this termwould require the explicit modelling of the animal movement model within143the observer’s sampling process [Glennie et al., 2020]. In this Chapter, weshow that even crude approximations of λeff (s,Tl) can improve the statis-tical inference of pim(s,Tl). The crude approximation we use is based onthe estimated path integrals of the observers’ fields-of-view φo(t). Whenthe locations of the observers are known (e.g. GPS positions), this can becomputed by estimating φo(t) around each recorded location and then sum-ming through time. When GPS positions are unavailable, the path integralcan still prove a useful target quantity for building an effort emulator, or aneffort model from a set of informative covariates.Let | · | denote the area or length function. Assuming no overlap existsbetween observers, we define the path integral approximation to λeff (s,Tl)as follows:Definition 5.1 the path integral approximation to λeff (s,Tl) within a re-gion Ai ⊂ Ω is∫Tl∑o∈O |φo(t)⋂Ai|dt.When the entirety of Ai is observed throughout Tl by an observer o,the estimated cumulative effort in Ai from o becomes |A||Tl|. The degreeof approximation error will depend on many factors including the size ofthe fields-of-view relative to the size of Ω, the number of observers, theaccuracy in estimates of φo(t), and whether or not encounters were madein continuous-time or discrete-time. Modelling λeff (s,Tl) in discrete-time iseasier.Log-Gaussian Cox processes as a suitable base model forestimating UDsLinking an animal’s space use to a set of covariates (e.g. sea surface tem-perature) is often a key component of an ecological analysis. The estimatedrelationships between encounter rate and environment allow researchers topredict variation in space use across space and time, possibly extrapolat-ing into areas beyond the study area Ω, and into time windows beyond thetemporal domain T . As with many popular regression-based SDM methods(linear models, GLMs, GAMs, etc.), λtrue(s,Tl) may be modeled with a col-lection of nonlinear transformations of covariates, interactions, and splines144within a log linear model:log λtrue(s,Tl) = βTx(s,Tl), (5.2)where x(s,Tl) denote the set of measured covariates at location s ∈ Ω as-sumed constant throughout time window Tl ⊂ T . This IPP may be in-adequate for use in ecological settings [Pacifici et al., 2017]. The Poissondistribution assumed on the counts inside any subregion A⊂Ω, implies thevariance of the counts is equal to the mean. If the amount of environmentalvariability not captured by the modeled covariates is high, and the overdis-persion is not controlled for, model-based confidence intervals can becomeoverly-narrow and suffer from poor frequentist coverage [Baddeley et al.,2015]. Spurious ‘significance’ between the associations of covariates and theintensity may then be reported, unless computationally-intensive resamplingmethods, such as block-bootstrap, are performed [Fithian et al., 2015].Cox process models extend the IPP by treating the intensity surfaceas a realisation of a random field [Baddeley et al., 2015]. This enablesthe variance-mean relationships of the point process models to be moreflexible. The random fields from the Cox process models can be specified tocapture spatial, temporal, and/or spatio-temporal correlations, helping tocontrol for any unmeasured covariates and biological processes driving thetrue intensities of the studied individuals [Yuan et al., 2017].A popular class of flexible random fields chosen are Gaussian (Markov)random fields, letting λtrue(s,Tl) be a realisation of a log-Gaussian process[Simpson et al., 2016]. These models are called log-Gaussian Cox processes(LGCPs). Specification of a LGCP is achieved by adding a Gaussian process,denoted as Z(s,Tl), to the linear predictor in (2):log λtrue(s,Tl) = βTx(s,Tl) +Z(s,Tl), (5.3)Z(S,T) =[Z(s1,Tl(1)), ...,Z(sn,Tl(n))]TvN (0,Σ) , (5.4)where Σ denotes the variance-covariance matrix of the Gaussian process145evaluated at all of the n = ∑l∈Lnl locations and time windows (si,Tl(i)).The function l(i) maps each observation to its corresponding time window.Different choices of covariance structures, lead to Gaussian processes withfundamentally different properties and uses. R packages such as spatstatand inlabru can fit such models [Bachl et al., 2019, Baddeley and Turner,2014]. We choose the LGCP as the base model for our framework due to itsflexibility.Covariates to account for detection probability and observereffortIn practice, we can rarely satisfy the previous assumption of perfect de-tectability. Thus, we relax this assumption now and assume the existence ofa ‘detection probability surface’, pdet(s,Tl) that can describe the heteroge-neous detectability within the earlier point process model. Additional covari-ates may be available that can model both the heterogeneous detectability[e.g. visibility indices, distance from the observer, etc., Fithian et al., 2015]and/or the heterogeneous cumulative effort of each observer (e.g. distancefrom the nearest road). If these covariates are included in their correctfunctional forms, this regression adjustment approach may help to capturesome of the heterogeneity in the observer effort and partially remove theassociated biases [Dorazio, 2014]. For applications with strongly informa-tive covariates, such approaches have been shown to significantly improvepredictive performance in SDMs [Elith and Leathwick, 2007, Fithian et al.,2015].We model pdet(s,Tl) with a set of covariates, w1(s,Tl), that are believedto affect only the observer abilities and not influence the true intensity ofthe individual of interest. These covariates are assumed constant throughouteach time window Tl. By definition, the values of pdet(s, t) are constrainedto lie between 0 and 1. A value of 1 implies that all instances where the indi-vidual’s trajectory intersects an observer’s field-of-view leads to a recordedencounter. This could reflect a scenario where an easily-detected individ-ual is in the immediate proximity to an observer under perfect weatherconditions. A value less than 1 implies that some encounters are missed146or not recorded. Thus, this helps to capture the processes that drive theunder-reporting seen in many ecological studies. In the context of pointprocesses, pdet(s, t) is known as a thinning function. Tl-average sea stateand Tl-average visibility are two example covariates that could be includedin w1(s,Tl). Distance sampling functions can also be included following theapproach of Yuan et al. [2017].Similarly, we can model λeff (s,Tl) with a set of covariates, w2(s,Tl),within a loglinear model. As before, the covariates used to explain observereffort are assumed to not directly impact the true intensity of the animals.This time, these covariates are believed to explain both the spatial distribu-tions of observers and explain any differences in their efficiencies. When dif-ferences in observer efficiency exist, the earlier approximation to λeff (s,Tl)seen in Definition 5.1 becomes∫Tl∑o∈Oωo|φo(t)⋂Ai|dt, with the relativeefficiencies captured in their weights ωo. This can help with the selection ofrelevant covariates and with model formulation. We advise modelling theobserver efficiencies ωo within the loglinear model and not the earlier prob-ability surface to avoid an upper bound being placed on the efficiency of anobserver. This allows for the existence of observers with higher skill levelsthan those who collected the data used to fit the model. Distance from roadand observer type are two example covariates.Our joint model for true species’ intensity, detection probability andobserver effort is:λobs(s,Tl) = λtrue(s,Tl)pdet(s,Tl)λeff (s,Tl) (5.5)g−1 (pdet)(s,Tl) = γT1 w1(s,Tl) (5.6)log λeff (s,Tl) = γT2 w2(s,Tl) (5.7)log λtrue(s,Tl) = βTx(s,Tl) +Z(s,Tl), (5.8)with g, a suitable link function (e.g. the logistic function), mapping thelinear predictor of the detection probability surface to the unit interval. Byassumption, pi(s,Tl) = λtrue(s,Tl)/∫Ωλtrue(x,Tl)dx.147Suppose the encounters are made by a collection of observers o ∈O withtheir unique observer efficiencies modeled with unique intercepts with re-spect to a baseline observer type. Then, the λtrue(s,Tl) being modeled isinterpreted as the expected encounter rate at location s ∈ Ω during Tl ofthe individual by the chosen baseline observer. The log linear model thenensures that any differences in the observer efficiencies are modeled mul-tiplicatively. Crucially, the interpretation of λtrue(s1,Tl) = 2λtrue(s2,Tl) isthat the individual will occupy the area immediately around s1 twice as of-ten as around s2 in the long run. Finally, if a series of encounters/detectionsare made where the entirety of Ω is perfectly observable, then both pdet(s,Tl)and λeff (s,Tl) should be fixed equal to a constant. Two examples are whenan ultra high-resolution satellite image containing the entirety of Ω is takenand where telemetry data is available. In the latter case, care must be takento properly subset the telemetry data to ensure that any autocorrelationsfrom the individual’s movement removed.Estimation of the non-intercept terms within the detection γ1, effort γ2and the environmental β parameters is possible, so long as all the corre-sponding covariates x, w1, and w2 are not linearly dependent or interact[Dorazio, 2014]. Non-perfect correlation between the three sets of covariates,whilst making estimation more difficult, does not affect the identifiability ofthese parameters [Fithian et al., 2015]. This is a desirable property in thecontext of UDs. The intercept parameters are estimable if either there isindependence between x, w1 and w2, or if at least one accurate controldataset is included in the joint model [Fithian et al., 2015]. A control couldbe a survey with known observer effort. When the intercept is desired,the animal movement process should be considered to reduce bias [Glennieet al., 2020]. Furthermore, the estimability of observer-specific intercepts(i.e. relative observer efficiencies) will require significant spatial-overlap toexist between the cumulative observer efforts of the different observers. Thisis especially true when spatially correlated Z(s,Tl) terms are included, sinceany differences in observer efficiencies may be erroneously captured by theZ(s,Tl) terms. Of course, the assumption of non-overlapping fields-of-viewat all times t ∈ Tl is still required. If the above conditions hold, then af-148ter model-fitting, fixing both sets of covariates w1(s,Tl),w2(s,Tl) equal toa constant allows the effects of variable detection probability and observereffort to be removed from predictions throughout Ω.Approximating effort from GPS-tagged observersIn many situations,∫Tl|φo(t)⋂Ai|dt may be known or directly estimablefor a set of pixels Ai ⊂ Ω used to approximate (5.1). For example, mobileobservers may record their GPS coordinates, or may be known to travelalong a strict network of known routes (e.g. shipping lanes). Similarly, thelocations of static observers (e.g. camera traps) and their detection rangesmay also be known. In both cases, a simple, yet effective approach for ap-proximating λeff (s,Tl) can be implemented. Given φo(t), we can include|Ai|−1∫Tl|φo(t)⋂Ai|dt as a fixed covariate within w2 to represent the av-erage effort within Ai. Then, only the corresponding slope term ωo in γT2needs to be estimated. When only one observer type is available, includingthe logarithm of |Ai|−1∫Tl|φ(t)⋂Ai|dt as an offset (i.e. fixing ω ≡ 1) is allthat is required.The performance of the above approach may deteriorate as the degree ofoverlap between the fields-of-view of the observers increases. One solutionis to remove the effort and encounters from overlapping observers. Alter-natively, it may be possible to model the inter-observer autocorrelationsdirectly [Clare et al., 2017]. We demonstrate the utility of the path integralapproximation in a simulation study in Section 4.Generalising the base model with the addition of marksOften, the utilization distributions of multiple individuals of a species orpopulation and their changes through time are desired [Elith and Leathwick,2007, Fithian et al., 2015]. Furthermore, understanding the factors drivingthe individuals to use the space may also be of importance to researchers.Spatio-temporal point processes can be further generalized to marked spatio-temporal point processes to allow for a greater range of research questionsto be tackled [e.g. Chakraborty et al., 2011]. The main idea of marked149point processes is that for each point, we observe attributes in addition toits location and time. These attributes are called marks. Marks might becategorical variables such as whale pod or an indicator of foraging behaviour,count variables such as group size, or continuous variables such as travelspeed.Formally, we associate a random variable my (a mark) to each locationand time window of the random set y∈ YTl . We place a probability distribu-tion on each of the marks, and model the joint distribution of the locationsand marks. Let M denote the support of a distribution of marks my. Themark distribution is allowed to depend upon space and time (i.e. dependupon y and Tl), but is not allowed to depend on other points in YTl . Thus,the my for different y ∈ YTl are independent. Now the pair (YTl ,mY ) maybe viewed as a random variable Y ?Tl in the product space Ω×M . There isno limit to the number of marks that can be associated with each point.We simply need to include a joint probability distribution for the J marksmjYTl: j ∈ {1, ...,J}.When the mark distribution of one of the J marks is discrete (i.e. whenmjYTl∈ {1, ...,K}), as in the case of individual ID, we can estimate the prob-ability that the presence of an individual at a given location s ∈ Ω withinTl ⊂ T has jth mark equal to k ∈ {1, ...,K}. For notational simplicity, letJ = 1 and define λtrue(s,Tl,k) to be the true intensity for the mark categoryk during Tl (i.e. {λtrue(s,Tl,m) :ms,Tl = k}). The probability at location sis then:ptrue(s,Tl,k) =λtrue(s,Tl,k)∑Kκ=1λtrue(s,Tl,κ)=exp(Z(s,Tl,k) +βTx(s,Tl,k))∑Kκ=1 exp(Z(s,Tl,κ) +βTx(s,Tl,κ)) (5.9)In many cases, computing and plotting estimates of ptrue(s,Tl,k), thetrue mark-specific probabilities will be the inferential target. When k de-notes the individual ID, estimates of ptrue(s,Tl,k) can help to establish spa-150tial niches specific to the kth individual, whilst removing any observer anddetectability biases. Note that when K denotes the number of individualsof a target population and when Ω is sufficiently large, then the normal-ized denominator of (5.9) reflects the distribution of the whole population.When some or all of the parameters and/or covariates are shared betweenthe mark-specific intensities, cancellations will occur in (5.9).The proposed model frameworkLet Ω, T , and Tl ⊂ T : l ∈ L be defined as before. Let the support of themarks beM . Suppose for each Tl we have a collection of encounter locationsand marks (YTl ,MYTl ) = {(si,mi) : (si,mi) ∈Ω×M}. The proposed modelis:λobs(s,Tl,m) = λtrue(s,Tl,m)pdet(s,Tl,m)λeff (s,Tl,m) (5.10)g−1 (pdet(s,Tl,m)) = γT1 w1(s,Tl,m) (5.11)log λeff (s,Tl,m) = γT2 w2(s,Tl,m) (5.12)log λtrue(s,Tl,m) = βTx(s,Tl,m) +Z(s,Tl,m). (5.13)For estimates of λtrue(s,Tl,m) under the above framework to be freefrom confounding by effort and for estimates of individual-environment re-lationships to be accurate, many assumptions are required in addition tothose highlighted earlier. As shown in the causal directed acyclic graph(DAG) in Fig 5.3 [Hernan and Robins, 2020], one of the fundamental as-sumptions required for estimates of λtrue(s,Tl) to be free of confounding, isthat λeff (s,Tl,m) fully describes the efforts of the observers through spaceand time across the marks. This is achieved when either the covariatesw2(s,Tl,m) completely explain the efforts of the observers, or when knownor accurate estimates of observer effort are included in w2(s,Tl,m). For es-timates of the UD to be free of confounding, only the relative efforts of theobservers need be known or estimable. In either case, with unknown effort,the existence of unobserved covariates of effort can confound estimates of151λtrue(s,Tl,m) in two ways.Figure 5.3: A plot showing the assumed causal DAG for the proposedframework with the detection probability assumed constant. Anarrow between a variable set A and a variable set B indicatesthat at least one variable exists in both sets with a direct causaleffect between them. The causal Markov assumption is madesuch that a variable is independent of its non-descendants, whenconditioned on its parents [Hernan and Robins, 2020]. If anyof the causal effects found within the red shapes exist, thenproblematic confounding may follow. This is explained at depthin the supporting material found in the Appendix.Firstly, if the unobserved effort covariates affect one or more observedenvironmental covariates x(s,Tl,m), then the corresponding effect estimateβ and hence the individual’s intensity λtrue(s,Tl,m) may remain confoundedby the effort. For example, suppose effort is unknown for a dataset contain-ing encounters with marine-based individuals. Suppose that distance-to-shore is not included as a covariate despite strongly impacting search effort.Furthermore, suppose chlorophyll-A, which has high values closer to shore,is included in x(s,Tl,m). Estimates of the effects of chlorophyll-A in βTwill likely be confounded by effort, which in turn will bias the estimates ofλtrue(s,Tl,m).Secondly, even if the unobserved effort covariates are independent of152x(s,Tl,m), residual spatio-temporal correlations in the sightings data drivenby the unobserved effort covariates may be erroneously captured by theGaussian process Z(s,Tl,m). Consequently, estimates of λtrue(s,Tl,m) maytherefore remain confounded by the heterogeneity in the observer effort. Atthis point, one may be tempted to simply include a unique Gaussian processfor λeff to capture these missed covariates. This cannot be done. Withoutadditional knowledge available that can adequately constrain the additionalGaussian process, it will be non-identifiable. Similar problems occur if adetection probability surface pdet(s,Tl,m) is estimated. Once again, thetrue detectability of the species must be fully captured by w1(s,Tl,m).Removing the confounding of λtrue(s,Tl,m) by effort will be challengingfor many ecological applications and it highlights the need for increasedcollection of effort information along with sightings data. However, Fig5.3 shows that if effort is known, then conditioning on it can remove allthe problematic confounding at location s. Furthermore, in the simulationstudy in Section 4, we show that even crude estimates of observer effort anddetectability can dramatically improve estimates of λtrue(s,Tl,m) comparedwith simply ignoring effort altogether. The issues of confounding are notexclusive to our framework and are present across all methods for estimatingboth UDs and SDMs [Fithian et al., 2015, Koshkina et al., 2017]. Thus, theseconcerns should not be seen as a weakness of our framework, but instead asa weakness inherent to biased data collection-protocols.If observer effort is limited to a small subregion Ω0 ⊂ Ω, such thatλeff (s,Tl,m) = 0 ∀s ∈ Ω∩ΩC0 , then additional assumptions must be placedon how Ω0 was selected. For example, if search effort was focused in regionswhere the species’ intensity was expected to be highest, then extrapolatedestimates of λtrue(s,Tl,m) into the regions of zero effort Ω∩ΩC0 may remainbiased. Estimates of the true intensity λtrue(s,Tl,m) into these regions maybe too high, with estimates of the intercept positively biased. This issueis known as preferential sampling [Pennino et al., 2019]. This highlightsthe benefits of conducting high-quality surveys with randomized effort. Bychoosing Ω0 at random, no systematic bias is expected in predictions intoΩ∩ΩC0 , and extrapolation of the intensity throughout Ω can be performed153with greater confidence. When there is little-to-no confidence that Ω0 wasselected in a random manner, predictions should be constrained to lie withinΩ0.An important advantage of the framework is that data from differentobservers and of differing type can be combined to jointly estimate one in-tensity surface [Koshkina et al., 2017]. This is achievable as the intensitygiven by (5.10), can be linked to the likelihoods of several common datatypes, including aggregated forms. For example, the logistic regression like-lihood for binary site presence-absence data, the Poisson likelihood for sitecount data and the LGCP likelihood for presence-only data can all be derivedfrom the intensity (see Hefley and Hooten [2016] and the supporting mate-rial found in the Appendix). Distance sampling methods have also been fitusing LGCPs (see Yuan et al. [2017] for details). All that is required for thesuitable combination is that encounters with individuals are approximatelyindependent snapshots of their UDs and that any heterogeneous effort ordetectability can be suitably controlled for. This is a clear demonstrationof the unifying potential of the LGCP for ecological data, and the causalDAG of Fig 5.3 is a useful tool to assess whether the assumptions needed toapply model (5.10) are satisfied. A short summary of how to approximatethe LGCP likelihood and additional details on the DAG are provided in theAppendix.5.4 Simulation studyWe now present a simulation study to demonstrate the ability of the frame-work to combine encounter data from mobile and static observers and predictan animal’s UD with both minimal bias and high precision. We first simulatethe movements of observers and an animal and generate encounters followingthe earlier data generating mechanism. Next, we use the recorded locationsof the observers to compute the approximation to effort that was introducedearlier in Definition 5.1. We ignore issues of overlap. We then plug this es-timate of effort into a point process model and attempt to recover the UDfrom the generated encounters. Note that we do not attempt to explicitly154model the movements of the animal within the sampling process model asin [Glennie et al., 2020]. Despite the use of a crude approximation for effort,we show that the framework offers improvements in prediction performance,even when the analyst incorrectly specifies φo(t) throughout time. Based onthese results, we provide a list of recommendations for analysts.02550751000 25 50 75 100xy0.00040.00080.0012pim(x, y)Long−run density of the animal02550751000 25 50 75 100xy1e−042e−04piξ(x, y)Long−run density of the observers02550751000 25 50 75 100xy1e−042e−043e−044e−045e−04piξ(x, y)Long−run density of the observersFigure 5.4: A plot showing the long run densities of the animal andthe observers. The top-right and bottom-right plots show thelow and high observer bias settings respectively. The smallerBrownian motion variance leads to a higher concentration ofeffort in the North of the study region and hence a larger degreeof observer bias.We simulate the movements of an animal and a set of observers using themodel of Brillinger et al. [2012]. In particular, we use a stochastic differentialequation (SDE) with potential functions chosen to ensure a desired long-runbehaviour. The animal’s potential function is chosen to be the logarithm of asymmetric bivariate normal distribution centered at (µx,µy) = (50,50), withvariance 100. The observers’ potential function is specified as the logarithmof a univariate half-normal distribution centered at my = 100, with variance200. The variance of the Brownian motion terms driving the movements155is fixed at 2 for the animal, and fixed at either 2 or 8 for the observers.Thus, the animal’s UD is a symmetric bivariate normal distribution, withthe observers’ UD a univariate normal distribution which focuses their effortsin the North of the study region (Fig 5.4). The study region is a square withside lengths equal to 100 arbitrary units. The observers are given circularfields-of-view with maximum range of 10 units. These settings imply thatthe study region is very small.We discretize time and use a first-order approximation to generate pathsfrom the continuous-time SDE. Thus, both the simulated movements andpotential encounter events occur across the discrete time-steps. The averagedistance travelled at each time step is roughly 1.75 units for the animal, andeither 1.75 or 3.5 units for the mobile observers depending on whether thevariance of the Brownian motion is 2 or 8. At each time step, if the animalis closer than 10 units of distance from an observer, it is encountered witha probability that decays linearly from 1 to 0 as the distance from theobserver increases from 0 units to 10 units. If the animal is detected within500 time-steps, representing a single ‘trip’, then the encounter location isrecorded along with the observers’ tracks. If no encounter occurs duringthe trip, then only the observers’ tracks are recorded. For each trip, werandomly sample the initial locations of the animal and the observers fromtheir respective UDs. Subsequent locations are restricted from leaving thestudy region. For static observers, we simply hold their initial values fixedthrough time. The fields-of-view of all observers may overlap.For each simulation iteration, we then repeat the above steps 150 or300 times to generate 150 or 300 trips. We fit a (IPP) point process modelwith correctly specified parametric form to the encounter locations withand without effort adjustment. We compute a crude approximation of effortas follows. The observers’ paths are mapped to a coarse 100× 100 grid ofpixels. Next, estimates of their fields-of-view, φˆo(t), are computed, and thenpath integrals of φˆo(t) are taken over the grid. Log-values of these pathintegral approximations are then included as an offset within the IPP. Here,estimates of effort are summed across the observers, ignoring any overlap intheir fields-of-view. We also present results from a method that controls for156overlapping fields-of-view in the supporting material found in the Appendix,but only minor improvements in predictive performance are seen. For modelcomparison, we compute at each simulation iteration both the mean squaredprediction error (MSPE) of the animal’s UD across the grid of pixels, and thebias of the estimated y-axis center of the UDs µˆy. The MSPE is computedwith respect to the true UD.To understand how the method performs in practice, we change boththe data-generating mechanism (DGM) and the assumptions made by theanalyst when formulating estimates of observer effort. For the DGM, we ad-just the degree of observer bias from ‘high’ to ‘low’ by changing the varianceof the Brownian motion driving the observers’ motions from 2 to 8 respec-tively. We also change the number and type of observers (mobile and/orstatic) and the number of trips made (150 or 300). For estimating effort, weeither assume perfect detectability across φˆo(t), or we model the linearly de-caying distance sampling function. Next, we either underestimate, correctlyspecify, or overestimate the detection range of φˆo(t) at 2, 10, and 50 unitsrespectively. We perform 100 replications of each setting.5.4.1 Effects of observer effort and detection rangemisspecificationFig 5.5 and Fig A.23 in the supporting material found in the Appendixdemonstrate that improvements in both prediction variance and bias can beattained with the approximate effort-correction approach. These benefits areseen across all the observer types (i.e. static, mobile, and combinations) andthe typical performance of the bias-corrected method is seemingly insensitiveto the degree of the observer bias. In contrast, the performance of theuncorrected model is greatly affected by both the level of observer bias andobserver type. In particular, the uncorrected model performs poorly whenone ignores large observer bias. The results from both the 150 and 300 tripsettings are similar and so we aggregate the results when forming the plots.The performance of the bias-correction approach is sensitive to the an-alyst using the correct detection ranges of the observers to define the ob-servers’ fields-of-view φo(t) (Figs A.24 and A.25). However, improvements157Low Observer Bias High Observer Bias1 Mobile20 Static1 Mobile + 20 Static20 Mobile1 Mobile20 Static1 Mobile + 20 Static20 Mobile0.00e+002.50e−085.00e−087.50e−081.00e−071.25e−07Observer TypeMean Squared Prediction ErrorQuantity Bias−Corrected Model Uncorrected ModelFigure 5.5: A plot showing the mean squared prediction error(MSPE) of the animals UD under the bias-corrected and bias-uncorrected models vs the types of observers. From left to rightare the results from one mobile observer, twenty static observers,twenty static with one mobile observer, and twenty mobile ob-servers. The degree of observer bias is changed from low to highin the columns. The red solid lines and the blue dashed linesshow the median MSPE along with robust intervals computedas ±2cMAD from the Bias-corrected and uncorrected modelsacross the 100 simulation replicates respectively. The medianabsolute deviations about the medians (MAD) have been scaledby c = 1.48. This ensures that the intervals are asymptoticallyequivalent to the 95% confidence intervals that would be com-puted if the MSPE values were normally distributed. Note thathere all the analyst’s assumptions correctly match the true data-generating mechanism, albeit with any overlap in the observers’efforts ignored.158in both the prediction variance and the bias of UD center-estimates can stillbe seen, even with badly misspecified detection ranges. Underestimating theobservers’ detection range leads to an over-correction of observer bias andleads to estimates of the animal’s UD center to be negatively biased. Theconverse is true when the observers’ ranges are overestimated. Both lead toincreases in MSPE. Interestingly, the bias-correction method appears insen-sitive to whether or not a distance sampling function is used.The MSPE is a measure of predictive performance that is driven by boththe squared prediction biases and the variances of the predictions. For theuncorrected model, the heterogeneous observer effort is the major cause ofprediction bias. This is expected to decrease as the number of observersincreases, due to the study region becoming increasingly explored. For theeffort-corrected model, two major causes of prediction bias remain in ourcrude approach for approximating effort. The first such cause is due tothe approximation error of using the path integrals of the fields-of-view torepresent the cumulative effort. Even if φo(t) is correctly specified at alltimes, approximation error will remain due to the statistical dependencebetween the encounter/non-encounter events through time. To understandthis, suppose an observer has failed to record an encounter for a significantperiod of time. Conditional upon this information, the current location ofthe animal is unlikely to be situated within the immediate proximity of theobserver. Accurate estimates of the true effort λeff (s, t) would need to beadjusted to account for this fact. This would require explicitly modelling theanimal’s movement process jointly within the sampling model [Glennie et al.,2020]. The second cause of prediction bias is due to overlap in the observers’fields-of-view. This gets worse as the density of observers increases. Multiplefactors impact the variance of the predictions. For both models, the varianceof the predictions decreases as the number of encounters increases. For theeffort-corrected model, longer observer paths further reduce the variance.Both encounter frequency and the cumulative observer path length increasewith the number of observers.For the uncorrected model, we indeed see that the mean squared predic-tion error (MSPE) decreases as the number of observers is increased. The159largest improvements are seen with the addition of mobile observers dueto the study region being increasingly explored. With twenty mobile ob-servers in the low bias setting, the impact of the observer bias is negligibleon the prediction performance of the uncorrected model and it outperformsthe effort-corrected model (Fig 5.5). Conversely, in the low observer biassetting, the MSPE from the effort-corrected approach is found to increasewith the number of observers. Here, increases in the squared prediction biasdominate any possible reductions in the variance of predictions. In the highbias setting, the reverse relationship is seen. Here, reductions in predictionvariance offered by the increased number of observers offsets any increasesin the squared prediction bias. Fig A.23 shows that in this setting, thevariability in the estimates of the UD center decreases substantially as thenumber of observers increases.We demonstrate our claims made above in an additional simulation studyexplained in depth in the supporting material found in the Appendix. Init, we change two simulation settings. First, we increase the speed of themovements across each time step to reduce the autocorrelation between theencounters. This moves the simulation from the pseudo continuous-time en-counter setting to a more discrete-time setting. Second, we fit an additionaleffort-corrected model that directly accounts for the overlap between the ob-servers’ fields-of-view. This ‘overlap-corrected model’ is found to completelyeliminate the estimation bias of the UD center (Fig SA.28). Interestinglyhowever, no change in the MSPE is witnessed relative to the previous effort-corrected model (Fig SA.29). Thus it appears that the bias reduction fromthe overlap-correction approach comes at a cost of an increased varianceof the UD predictions. Both effort-corrected models outperform the uncor-rected model with respect to MSPE across all levels of observer bias.In summary, it appears that the benefits of effort-correction can be at-tained when little is known about the precise nature of the observer effort.As long as the animal’s UD remains reasonably constant throughout thetrips, crude attempts at effort correction appear to be better than ignor-ing effort in most settings. Furthermore, the path integrals of observers’fields-of-view appears control for effort reasonably well. When the degree of160observer bias is expected to be high, as is expected in our case study, it ap-pears that this form of effort-correction can lead to dramatic improvementsin predictive performance, without the need to consider observer overlap orexplicitly model the animal movement.5.5 Application to empirical dataSpecial considerations required for our motivating problemTo demonstrate the utility of our modelling framework, we apply it tothe southern resident killer whale (SRKW) data. We partition the tem-poral domain (May - October) T into months Tl : l ∈ {May, ...,October}and assume the intensities and hence the UDs of the pods are constantwithin each month. We denote the day as d ∈ {1, ...,NTl} and the year asy ∈ {2009, ...,2016}. We assume that no changes to the UDs occur between2009-2016. Our motivating dataset contains several special features thatrequire careful consideration.As mentioned in Section 5.2.2, the pod identities (J, K, or L) of thesightings can be considered known. We denote the pod identity for a sightingwith the discrete markm and we consider the pods as our ‘individuals’. Podsare often found swimming together in ‘super-pods’. We break up sightingsof super-pods into their individual components. For example, if a sightingof super-pod JK is made (i.e. J and K are found together), then we recordthis as a sighting of pod J and a sighting of pod K and ignore the potentialinteraction.The data are heavily autocorrelated. Sightings are often made of thesame pod in quick succession, and the locations of whale pod sightings areshared between whale-watch operators. In fact, once a pod has been sighted,it is rarely lost by the tour operators for the remainder of the day. To removethe autocorrelations, we consider only the first sightings per day of each pod,discarding all repeated sightings made within a day. Importantly, for eachday and for each pod, we also discard all predicted effort that occurs after theinitial sighting. Because whales move quickly relative to |Ω|, an overnight161window between sightings is sufficient to remove the autocorrelation betweensightings. Next, we estimate the cumulative monthly observer effort fromall observers. The effort is summed across the 8 years.Incorporating the observer effort from the DFO dataThe daily GPS tracklines of the DFO vessel prior to each initial SRKWsighting are used to approximate the DFO’s observer effort. The GPS dataare irregular, with a typical resolution of around 15 seconds. We predict thelocations at regular 30 second intervals using a continuous-time correlatedrandom walk model fit to each trip using the crawl package [Johnson andLondon, 2018, Johnson et al., 2008]. We denote the approximate locationsand effort as ξDFO(t,y,d) and EobsDFO(s,y,Tl,m) respectively. Next, we countup the number of predicted points that fall into a set of polygonal regions Aiused to approximate (5.1). Thus, we assume that at each 30 second interval,the observer’s field-of-view φDFO(t,y,d) is uniform throughout the Ai thatcontains the vessel. Thus,∫AiEobsDFO(s,y,Tl,m)ds ≈∑d∫TlI{ξDFO(t,y,d) ∈Ai}dt. The Ai are approximately circular with radius 2.6km (see Fig A.31).Note that we only have the location of the vessel during encounters. Resultsfrom the simulation study suggest that these steps are unlikely to signifi-cantly impact the analysis, given the large number of boat tracks available,the large Ω, and given that our assumed maximum detection range of 2.6kmfor φDFO(t) is likely not orders of magnitude form the truth. Effort is scaledinto units of hours.Estimating the whale-watch observer effortTo incorporate the observer effort from the whale-watch vessels, we builda stochastic emulator of the cumulative ‘boat-hours’ spent in each of theintegration pointsAi by the whale-watch companies for each day, month, andyear under study. We refer to the cumulative pod-specific monthly whale-watch observer effort intensity as EobsWW (s,y,Tl,m). Because the whale-watchsightings are not linked to a specific vessel, we assume throughout that theobserver efficiencies across the whale-watch vessels are constant. We do not162adjust for overlap between the fields-of-view of the vessels. The density ofboats within the study region is expected to be far smaller than it was inthe simulation study with twenty vessels and the degree of observer bias isvery high, suggesting that the results should be accurate. Note that theassumptions made on φWW (t,y,d) match those of φDFO(t,y,d).For each day and for each pod, we first record the number of hours intothe operational day at which the initial discoveries were made. We denotethis τ . We assume that the daily operational period for the whale-watchcompanies is 9am - 6pm [Seely et al., 2017], thus τ ∈ [0,9]. As an example,suppose that on a given day, pods J and K were both sighted at 12pm andpod L was never sighted. Then τ would be recorded as 3 hours for pods Jand K and 9 hours for pod L. To account for the changing effort throughoutthe day, we use the numbers of vessels reported by Soundwatch to be inclose proximity with whales by hour of day as our proxy for whale-watcheffort intensity. We then use these reported values to estimate a cumulativedistribution function FE(τ) for the proportion of total whale-watch effortspent τ hours into the day. Thus, for an initial sighting of a pod made τhours into the day, FE(τ) represents the fraction of total whale-watch effortspent prior to that sighting.Let (τm,y,Tl,d)NTld=1 denote the number of hours after 9am when the firstsighting of pod m, in year y, in month Tl and on day d occurs. Under theassumption that an overnight window removes the autocorrelation betweenthe SRKW locations, the fraction of total WW observer effort spent priorto the initial sightings of pod m in a given month/year is:1NTlNTl∑d=1FE(τm,y,Tl,d). (5.14)Next, we need to estimate the maximum possible number of boat hoursof observer effort for each year, month, and day. We denote it EWW (y,Tl,d).We will then multiply the year, month sums EWW (y,Tl) =∑NTld=1EWW (y,Tl,d)by the fraction (5.14). The result will be an estimate of the observer effortassociated with the initial sightings. This requires some strong assumptions163that are detailed in the supporting material found in the Appendix, includ-ing that the average spatial distribution of the whale-watch boat observereffort is constant throughout the day.Soundwatch reports on: the number of active whale-watch ports peryear, the maximum number of trips departing each day from each port,the changing number of daily trips across the months, and the duration(in hours) of the trips from each port. We also download wind-speed dataand ask various operators for their operational guidelines on cancellationsdue to poor weather/sea state. We then remove days considered ‘dangerous’.Given the large sources of uncertainties associated with estimating the abovequantities, we formulate probability distributions to appropriately expressthe uncertainties with each of our estimates. These probability distributionsform the backbone of our stochastic emulator of EWW (y,Tl,d).To estimate the spatial distribution of the observer effort, we estimatehow many boat hours could fall in each of the integration points Ai permonth and year. Estimates of maximum travel ranges from each port areobtained, considering land as a barrier. Typical vessel routes from the whale-watching companies are established through: private communications withthe operators, Soundwatch reports, the operators’ flyers and websites. Com-bining these together, we then formulate plausible effort fields from each portby hand using GIS tools.We denote EWW (s,y,Tl), the maximum possible observer effort intensityfor year y, month Tl and at location s ∈ Ω. It is subject to the followingconstraint:∫ΩEWW (s,y,Tl)ds = Total possible WW boat hours in year y, month Tl= EWW (y,Tl).Our estimate of the pod-specific monthly whale-watch observer effortsurface associated with our initial daily sightings is:164EobsWW (s,Tl,m) =2016∑y=2009EobsWW (s,y,Tl,m) (5.15)EobsWW (s,y,Tl,m) = EWW (s,y,Tl)×1NTlNTl∑d=1FE(τm,y,Tl,d).Combining effort surfacesDue to their spatially disjoint observer efforts (Fig 5.1), almost no spatialoverlap exists between the two sources of sightings: presence-only sight-ings (reported by the whale-watch operators) and the presence-absence data(recorded from the DFO boat survey). Consequently, under our LGCPframework, any intercept term added to w1(s,Tl,m) for capturing the rela-tive observer efficiencies between the two observer types will not be estimabledue to confounding with the spatial field Z(s,Tl,m).Since both observer types involve similarly-sized vessels, we make theassumption that the efficiencies across the two observer types are identical.Thus, we simply sum the two observer effort layers to get the total observereffort:EobsTotal(s,Tl,m) = EobsWW (s,Tl,m) +EobsDFO(s,Tl,m) (5.16)Large uncertainties surround our estimates of the whale-watch observereffort, with the coefficient of variation exceeding 0.25 for the estimates fromsome of the smaller ports. Failing to account for these uncertainties couldlead to over-confident inference. We produce G Monte Carlo samples ofthe effort field EobsWW,g(s,Tl,m) : g ∈ {1, ...,G}. For each sampled observereffort field, we then fit the LGCP model and sample once from the posteriordistributions of all the parameters and random effects. These new posteriordistributions will help account for the uncertainty in observer effort, so longas G is chosen sufficiently large to reduce the Monte Carlo error. We choose165G= 1000.Model selectionWe propose and fit several candidate models of increasing complexity foranalysis. We fit the models using the R-INLA package with the SPDE ap-proach [Lindgren et al., 2011b, 2015, R Core Team, 2019, Rue et al., 2009].All models use the estimated observer effort field EobsTotal, with no detectabil-ity or observer effort covariates used (i.e. pdet ≡ 1 and λeff ≡EobsTotal). Modelcandidates start from the simplest complete spatial randomness model. Thisassumes that conditioned on observer effort, encounter locations for eachpod and month arise from a homogeneous Poisson process. They finishwith models for λtrue(s,Tl,m) which include: covariates, temporal splines,and Gaussian (Markov) random fields with separable spatio-temporal co-variance structures within (5.13). To avoid excessive computation time, weperform model selection on a single realisation of our observer effort field.Then, for our ‘best’ model, we propagate the uncertainties with observereffort through to the results via the Monte Carlo approach.We explore two space-time covariates and one spatial covariate: sea-surface temperature (SST), chlorophyll-A (chl-A), and depth. Covariateswere downloaded from the ERDDAP database [Simons, 2019], with themonthly composite SST and chl-A rasters (Fig A.34) extracted from satel-lite level 3 images from the Moderate Resolution Imaging Spectroradiometer(MODIS) sensor onboard the Aqua satellite (Data set ID’s: erdMH1sstdmdayand erdMH1chlamday respectively). We compare the two types of hierachi-cal space-time centering seen in Yuan et al. [2017].We also explore the addition of spatial random fields, spatio-temporalrandom fields, and temporal splines within (5.13). Including these adds asubstantial amount of complexity to the model. To avoid over-fitting thedata, we start with the simplest models without random effects, and itera-tively increase the complexity of the model in a stepwise manner. To choosethe ‘best’ model, we use the Deviance Information Criterion (DIC). Thistrades-off the goodness-of-fit of the model with a penalty for the model’s166complexity [see Spiegelhalter et al., 2002]. The candidate models that donot contain random fields or splines are equivalent to MAXENT models, acommonly used SDM method [Renner and Warton, 2013]. Thus the DICvalues of the models allow for comparisons to be made between the com-monly used MAXENT models and our proposed LGCP model.We also conduct posterior predictive checks on the candidate models[Gelman et al., 1996]. In particular, we assess the ability of the modelsto accurately estimate the total number of first sightings of each pod, permonth. We also assess the models’ abilities to suitably capture the spatialtrend by comparing the observed number of sightings falling within eachregion Ai with their model-estimated credible intervals.The final selected modelThe final ‘best’ model, as judged by DIC and posterior predictive checkassessments, includes a spatial random field shared across the three pods, aspatial field unique to pod L, pod-specific temporal effects (as captured bysecond-order random walk processes), SST, and chl-A. Both covariates werespace-time centered. Depth was omitted as its inclusion led to numericalinstabilities due to high multicollinearity. Details of all the models are inthe supporting material found in the Appendix (see Table A.2).The importance of including a random field unique to pod L implies thatpod L exhibits different space use compared with J and K. This result is inagreement with [Hauser et al., 2007]. No unique spatial field for pod J orK was found to significantly improve the model. The pod-specific random-walks reflect the different times the pods arrive and leave the area of interest.For example pod J is found to remain in the area of interest across themonths, whereas pods K and L are found to have lower intensities in Mayrelative to September (Fig A.35). This is in agreement with Ford et al.[1996] 104 pp.Finally, we use the the causal DAG shown in Fig 5.3 to display ourassumptions about the ‘best’ model. The first assumption is that we can ac-curately emulate observer effort λeff (s,Tl,m) with EobsTotal(s,Tl,m) and that167no unmeasured strong predictors of effort w?(s,Tl,m) exist. Large residualspatio-temporal correlations caused by w?(s,Tl,m) would be erroneouslycaptured in the spatial fields for λtrue(s,Tl,m), leading to estimates of podintensity to remain confounded by effort. The next assumption is that nopath in the DAG exists between the environmental covariates, measured orunmeasured (i.e. x(s,Tl,m) nor x?(s,Tl,m)), and the effort λeff (s,Tl,m).This would also lead to pod intensity estimates to remain confounded by ef-fort. For the estimates of the species-environment effects β to be accurate,we need to assume that no unmeasured environmental covariates x?(s,Tl,m)exist. This assumption is unlikely to hold. The choices driving the move-ments of the SRKW are likely far more complex than explained by thetwo covariates alone and unmeasured environmental factors x?(s,Tl,m) arelikely to interact with both x(s,Tl,m) and λtrue(s,Tl,m) causing confound-ing. However, the presence of x?(s,Tl,m) should not impact our abilityto predict λtrue(s,Tl,m), since any strong residual autocorrelations due tox?(s,Tl,m) should be captured by Z(s,Tl,m).Displaying the resultsLarge uncertainties surround our estimates of the SRKW intensity (i.e. en-counter rate). Side-by-side maps of posterior mean and posterior standarddeviation can prove challenging to interpret, making it difficult to determineregions of ‘high’ intensity. Instead, by using a large number of posterior sam-ples (G) from our model, we are able to compute exceedance probabilitiesand then clearly display both point estimates with uncertainty in a singlemap, called an exceedance map.Exceedance maps display the posterior pointwise probabilities that thevalue of a random surface evaluated across a regular lattice grid of pointsexceeds a chosen threshold. For our application, we are interested in iden-tifying regions of high whale intensity. As such, our maps will display theposterior pointwise probabilities for month t that the pod-specific intensityλtrue(s,Tl,m), at location s, lies above a chosen intensity threshold value.First, we choose the 70th percentile of that pod’s intensity averaged across168all times and spatial grid pixels. Hotspots are then identified by displayingonly the points that have a posterior pointwise probability above a proba-bility threshold. We choose a probability threshold of 0.95, which representsareas where the model predicts with at least a probability of 0.95 that theposterior intensity is in the top 30% of values for that pod. Then, we repeatthe above process but this time with the 70th percentile threshold fixed at aspecific month. Such maps simultaneously present our point (i.e. ‘best’) es-timates whilst also reflecting the uncertainties surrounding these estimates.For example, regions predicted to have a high encounter rate, but also alarge uncertainty (e.g. regions rarely visited, but where a few encounterswere made), will no longer appear in these exceedance plots.For demonstration, we explore regions that our model confidently pre-dicts to have a high J-pod intensity λtrue(s,Tl,m) during May. These cor-respond to hotspots of their UD. Panel A in Fig 5.6 shows the posteriorprobability that the J-pod intensity in May lies in the top 30% of valuesacross all months. The plots show clear hotspots in J-pod’s May intensityin the West of the region and in inshore waters. We repeat the plot, butnow colour all pixels grey for which a posterior probability of exceeding the70th percentile value is below 0.95. This helps to differentiate the regionsof interest that we are most confident about (Fig 5.6 B). If we change theupper exceedance value to be the 70th percentile value for the month of Mayonly, rather than across all months, the regions of interest are larger (Fig5.6 C-D).Pod-probability maps can identify the core areas within Ω associatedwith each pod and month. For a chosen month Tl and pod m, we defineits ‘core area’ to be a region DTl,m ⊂ Ω such that if an encounter is madewithin DTl,m during Tl, there is a ‘high’ probability that it is of pod m.Under our multi-type LGCP framework, because we can fix observer effort,we are able to compute the posterior probabilities that an encounter madeat a given location and month contains a specific pod (see equation (5.9)).For May, we display the posterior probabilities that an encounter made atlocation s ∈ Ω contain pod J, K and L respectively in panels E, F and G inFig 5.6. It is apparent that in May, pod J is most likely to be encountered,169Figure 5.6: A series of plots demonstrating the different types of plotspossible under our modelling framework. Panels A and B showthe posterior probability that J pod’s intensity across the re-gion takes value in the upper 30% in the month of May. PanelA shows the raw probabilities, while Panel B has a minimumprobability threshold of 0.95. Panels C and D are the same;however the upper 30% exceedance value is defined uniquely forthe month of May instead of as an average over all the months.Panels E, F, and G show the posterior probabilities that a sight-ing made at a given location in May contains pods J, K and Lrespectively. All results are shown for the ‘best’ model withMonte Carlo observer effort agreement with [Ford et al., 1996].When the sightings of every individual from the target population areavailable, one can sum the individuals’ intensities and then normalize tocreate estimates of the population-level distribution. Maps of the popula-tion’s distribution may be especially useful for conservation purposes. See170for example Fig A.32, where we fix the upper value to exceed as the 70thpercentile value of the sum of the three pod’s intensities across all months.We assume that individuals strictly swim in their pods, and that each podis a single unit of identical size. We do not scale the pod-specific intensitiesby their group sizes. Thus, the intensity represents the expected number ofencounters of any pod per boat hour of effort. The effort EobsTotal(s,Tl,m) isestimated to be nonzero throughout most of the region Ω, with two excep-tions. The first is the region in the very top of Ω, to the West of Vancouver.The second is in the Northwestern corner of Ω. These regions were nevervisited and so little can be said about the true SRKW intensity in theseregions. This is reflected in the very large posterior standard deviationsshown there in Fig A.36.5.6 DiscussionWe have built upon the recent developments made in the species distribu-tion modelling literature and presented a general framework for estimatingan individual’s utilization distribution (UD). In addition, we have shownthat these estimates can be combined to form the spatio-temporal distri-bution of a species or group. We demonstrated its use by identifying areasfrequently used by an endangered ecotype of killer whale. Using the method-ology, data from multiple observers, and data of varying quality and type,may all be combined to jointly estimate the spatio-temporal distribution.Crucially, high-quality survey data can be combined with low quality op-portunistic data, including presence-only data. Data types compatible formodelling with this framework extend beyond those seen in this motivatingexample. Log-Gaussian Cox processes (LGCPs) have the unifying feature ofproviding a base model for deriving the likelihoods of many of the commonlyfound data, including presence-only, presence-absence, site occupancy, andsite count data [Miller et al., 2019]. Such data fusion can improve the spa-tial resolution and statistical precision of estimates of the spatio-temporaldistribution of species [Fithian et al., 2015, Koshkina et al., 2017].However, including presence-only data requires knowledge about the ob-171server effort, either directly (e.g. GPS records) or through a set of strongpredictors (e.g. distance from the nearest road). In either case, we showthat approximating the observer effort by either computing or modelling thepath integrals of the observers’ fields-of-view can be a relatively straightfor-ward and successful approach. Furthermore, results from our simulationstudy suggest that only crude estimates of the observers’ fields-of-view arerequired, and that substantial improvements in the accuracy of UD predic-tions can be attained when the degree of observer bias is high. Furthermore,these improvements are still seen in settings where substantial overlap existsbetween the observers’ fields-of-view and where the size of the study regionis small. A fundamental assumption of our work was that the utilizationdistributions of the individuals were stationary throughout known time in-tervals. This greatly simplified the task of estimating the observer effort. Ifthis stationarity assumption is unsuitable and the UDs evolve continuouslythrough time, then observer effort needs to be known or estimated on acontinuous-time scale too. Estimating unknown effort in continuous-timefrom a set of covariates will likely prove to be a challenge.While the mathematical theory underpinning the LGCP may appearchallenging to many researchers, the application of these models is widelyapplicable. Recent developments in spatial point process R packages [RCore Team, 2019], such as spatstat [Baddeley and Turner, 2014] and inlabru[Bachl et al., 2019] facilitate their computation. Inlabru requires only basicknowledge of R packages such as sp [Bivand et al., 2013, Pebesma and Bi-vand, 2005], rgeos [Bivand and Rundel, 2013], and rgdal [Bivand et al., 2015].Pseudo-code is supplied in the Appendix to show how a dataset with a com-bination of distance sampling survey data and opportunistic presence-onlydata could be analysed using this modelling framework. Joint models are fitand sampled from using only 7 function calls, emphasising the applicabilityof the framework across a wide range of disciplines.A biology-focused companion paper is currently underway, using thefinal model outputs to explore SRKW habitat use and how it varies inthis region across pods and summer months. Importantly, it will compareand contrast habitat use based on traditional opportunistic sightings data172analyses and for the first time present relative SRKW habitat use across theentire extent of SRKW critical habitat in Canadian Pacific waters togetherwith estimates of confidence. Thus the models developed in this Chapterwill play an important role in planning future SRKW conservation effortsand highlighting regions of ecological significance.173Chapter 6Summary, conclusions, andfuture work"More data means more information, but it also means morefalse information."— Nassim Nicholas Taleb, Antifragile: Things That Gain fromDisorderData are being collected on an increasingly large number of phenomena,and the speed at which data are being created is increasing. The result isthat enormous amounts of data are becoming available for every conceivableentity with which humans interact, hence the emergence of the term "BigData". However, as stated in the above quote, an increase in the quantityof data collected on a phenomenon does not necessarily lead to an improvedunderstanding of it. In this dissertation, we have shown that this discon-nect can be stark when the phenomenon under study is spatio-temporalin nature because spatio-temporal data are routinely collected to meet oneobjective and then analyzed to meet another. We have shown that thismismatch in objectives, when ignored, can have a deleterious impact on thestatistical inference of spatio-temporal data. Consequently, researchers needto question why and how a spatio-temporal dataset was collected to avoidpreferential sampling (PS) biasing their understanding of the phenomenonunder question.174In this dissertation, we focused on two major objectives. The first was todemonstrate that PS can have severe impacts on the statistical inference ofspatio-temporal data and that it should be considered within any analysis.The second was to provide researchers with a set of tools for both detectingthe presence of PS, and for subsequently adjusting for it in their analyses.Throughout the dissertation we focused our attention on real-world data,both to demonstrate that PS is prevalent in real-world data, and to showthat our tools are applicable in practice.In Chapter 2 we introduced the concepts of PS in all three types of spatio-temporal data, which we referred to as: discrete-space, continuous-space,and point-pattern data. We then focused on the popular spatio-temporalgeneralized linear mixed-effects (STGLMMs) class of models that are com-monly used to describe the three data types in practice. Next, we provideddemonstrative examples of PS in all three settings. In all the examples, weshared spatio-temporal random effects between the processes that describedthe PS and the target spatio-temporal process being observed. These demon-strative examples of PS in all three spatio-temporal data types, combinedwith the formal definition of STGLMMs, then provided us with the nec-essary framework required in the later Chapters for both developing testsfor PS, and, for developing methods to adjust inference to its presence. Weended the Chapter with a discussion of the Integrated Nested Laplace Ap-proximation (INLA) method that allows for models within the STGLMMsclass to be efficiently fit within a Bayesian framework.In Chapter 3, we built on the STGLMMs framework seen in Chapter2 and developed the first general framework for modelling spatio-temporaldata. The framework is applicable in both the discrete-space and continuous-space settings. We demonstrated its utility by analyzing historical air pol-lution data collected across a network in Great Britain. We demonstratedthat PS was present throughout the lifetime of the network and that thismay have led to a dramatic overestimation of black smoke levels, includingestimates of population exposure.In Chapter 4, we also considered the STGLMMs framework and used itto develop a general test for preferential sampling. The test is the first to be:175applicable in both discrete-space and continuous-space settings, applicableto non-continuous response data, and powerful in small sample size settings.We demonstrated the high power of the test across a wide range of settings ina thorough simulation study, before applying it to two previously-publishedreal-world case-studies.In Chapter 5, we focused on PS in point-pattern data. We turned tospatio-temporal log-Gaussian Cox processes and decomposed the spatio-temporal intensity surface into the product of a term reflecting the trueintensity and an additional two terms that reflected both the spatial ‘effort’exerted and the detectability of the points. We then used this approachto develop a framework for estimating the utilization distributions (UDs) ofanimals. UDs help ecologists to build a better understanding of how animalsinteract with their environment and use space. This information can thenbe used for informing successful management policies. We demonstrated itsutility in a real-world case study to estimate the space use of an endangeredecotype of killer whales, using sightings data from observers who are knownto focus their efforts in regions where the animals are expected to be present.PS has been identified as a serious problem, with a recent surge in in-terest sparked by the landmark paper by Diggle et al. [2010]. Since thatpaper, it has been made clear that PS is commonplace across many fields ofresearch including those related to ecology, public health, the environment,and econometrics [Gelfand and Shirota, 2019, Lee et al., 2015, Pennino et al.,2019, Shaddick et al., 2016]. Furthermore, as shown in Chapter 3 with anapplication to Great Britain’s air pollution monitoring network, the impactsof PS on inference can be very large.Whilst the application to Great Britain’s air pollution monitoring net-work was ideal for demonstrating the utility of both the framework andtest, the application was by no means cherrypicked. To appreciate how sys-temic PS may be throughout national air pollution monitoring networks,one only needs to read government guidelines for their design. Frequently,these networks are designed with the purposes of noncompliance detectionand maximum concentration detection [Lee et al., 2015]. For example, con-sider the published guidelines for air quality monitoring network design from176the United States’ EPA and Canada’s CCME. In the EPA’s handbook, itis suggested that monitors be placed to measure “concentrations in areasof high population density” and to determine the “highest concentrationexpected to occur in the area covered by the network” (von Lehmden andNelson [1977], Section 6.1). In the CCME’s guidelines, it is suggested thatmonitors be placed to “measure the highest representative ozone concentra-tions in metropolitan areas” and to “measure representative PM and ozoneconcentrations in populated areas across the country” (of Ministers of theEnvironment [2011], Executive Summary). Thus, PS may be rife in airpollution data.Now it is important to state that this dissertation is not implying thatthere is a problem with the design of the air quality networks themselves.The networks are designed with a clear set of objectives in mind. Theproblem lies with researchers who decide to use these data for a purposefor which they were not collected. Researchers typically fail to incorporatethese objectives within their statistical analyses which can severely impacttheir conclusions. The ramifications for our understanding of issues of greatimportance such as establishing the public health consequences associatedwith air pollution exposure, and monitoring compliance to regulatory airquality standards may be great (see Chapter 3 and Lee et al. [2015]).Whilst existing methods for modelling PS have been developed, theyhave been criticized on philosophical grounds. In his discussion of the workby Diggle et al. [2010], Richard D. Wilkinson stated that the spatial-onlymodel described “could never arise in practice as the surveyors do not knowS [the value of the field]", before observing it. In addition to advancingthe PS literature into the spatio-temporal setting, the general frameworkwe introduced in Section 3 bypasses this philosophical concern. Insteadof requiring that network designers had a complete understanding of thefield before measurement, they are instead allowed to choose site placementbased on the previous years’ observations. Thus our approach gets closer tobeing able to emulate the real selection processes that truly governed thesite selection.Existing methods have also been criticized for being computationally177prohibitive and challenging to program [Pennino et al., 2019], requiring be-spoke Monte Carlo-based maximum likelihood procedures to be written [Dig-gle et al., 2010]. All of the methods we introduced in Chapters 3, 4, and 5were developed with computational efficiency and accessibility in mind. Byframing the methods of Chapters 3 and 5 within the STGLMMs class of mod-els, we ensured that the INLA approach can be used to quickly, efficiently,and accurately implement them. Importantly, the user-friendly R-INLApackage [Lindgren et al., 2011a, Rue et al., 2009], when used in conjunctionwith the inlabru package [Bachl et al., 2019], ensures that the methodologiescan be implemented by researchers with only modest computational skillsrequired. For the test in Chapter 4, we developed a user-friendly R packagePSTestR, now available on GitHub, for implementing the test. Crucially, thepackage is compatible with objects from the popular R packages sp, sf, andspatstat. Thus, we hope that the methods introduced in this dissertationwill be widely used by researchers.Statistical models used to describe spatio-temporal phenomena are in-creasingly being used to inform government policy across a range of issuesincluding matters of public health and the environment. Whilst the use ofevidence to inform policy is not new, there has been a recent increase ininternational demand for evidence-based policy (EBP) following its popu-larization in the UK during Tony Blair’s leadership [Sutcliffe and Court,2006]. The UK’s increased interest in EBP began with the publication of agovernmental white paper in 1999 titled “modernising government” whichdemanded a “better use of evidence and research in policy making" [Office,1999]. Since then, the UK has been actively promoting the concept interna-tionally [Sutcliffe and Court, 2006], and the demand for EBP has increased.In 2017, the US Commission on Evidence-Based Policymaking published areport titled “The Promise of Evidence-Based Policymaking” in which theyproposed a vision of “a future in which rigorous evidence is created effi-ciently, as a routine part of government operations, and used to constructeffective public policy” [on Evidence-Based Policymaking, 2017]. Yet, forpolicy-makers to implement EBP effectively, the evidence being ‘created’needs to be both accurate and reliable. This dissertation has demonstrated178that the accuracy of evidence arising from standard spatio-temporal statis-tical methods can be strongly impacted by the protocols governing how thedata were collected. The methods developed in this dissertation can help toreduce this sensitivity.Whilst we have developed a suite of tools for accounting for PS in thestatistical analysis of spatio-temporal data, a large range of research ques-tions remain for future work. Firstly, in Chapter 3 we showed that estimatesof Great Britain’s population exposure to black smoke may have been wildlyinaccurate. In spatial epidemiology, estimates of the health effects of expo-sure to air pollutants are commonly derived from ecological models of health,with exposure values typically imputed from models that ignore PS. Inves-tigating the impacts that PS may have on these health effect estimates is animportant avenue of research. We expect that PS will impact the imputedexposures derived from other air pollution monitoring networks to varyingdegrees. We are currently testing this hypothesis on air quality data fromthe United States.Secondly, in Chapter 3, we shared linear combinations of spatio-temporallycorrelated latent effects between the model used to describe the environmen-tal process and the model used to describe the PS. This joint model approachallowed us to adjust for PS. However, we never considered the possibilitythat the latent effects describing the spatio-temporal process being measuredcould influence the process governing the PS in a nonlinear way. In practice,there may often be a desire by the committees and processes responsible forchoosing the sampling locations and times to seek out extreme values of thespatio-temporal process. A linear functional form would not suffice to rep-resent this objective. It would be interesting in future work to investigatehow sensitive the PS-adjusted predictions of the underlying spatio-temporalprocess are to the functional form specified on the PS.Thirdly, in Chapter 4 we developed a test for PS in both the spatialand spatio-temporal settings. Yet these ideas could be carried across tothe temporal setting. In bio-statistics, longitudinal studies are commonlyemployed to investigate the population-average temporal response to treat-ments. These studies frequently suffer from patient dropout and the pro-179cesses driving the decision of a patient to dropout of a study are commonlyfeared to be related to the underlying patient’s treatment response beingmeasured. This relation between dropout and treatment response is knownas informative dropout and joint models are commonly fit to adjust for itsbiasing effects [Wu, 2009]. Yet this type of informative dropout falls withinour definition of PS. Thus, by studying the patient dropout times as the re-alization of a temporal point process whose intensity may be related to theunderlying treatment response, a test for informative dropout (i.e. PS) maybe derived using the same principles as seen in Chapter 4. Such a test wouldallow researchers to quickly assess the evidence for PS within a given study,without having to fit conceptually-difficult and computationally-costly jointmodels.Finally, in Chapter 5 we developed an approach for adjusting the sta-tistical inference of point-patterns to the presence of PS. This required usto either emulate the process that governed the PS, or estimate the PSwith a set of available covariates. However, in many ecological settings,aggregated sightings data is collected at discrete ‘sites’. For example, sitecount data is frequently collected when researchers visit discrete locations(‘sites’) and record the number of sightings of the target species. In thesesettings, one could consider the locations of ‘sites’ as a point-pattern andjointly model the intensity governing the chosen ‘sites’ with the underlyingspecies’ intensity describing the count data. By sharing the latent effectsacross the likelihoods, this joint model would allow researchers to investi-gate whether or not the chosen ‘site’ locations were preferentially sampled.Statistical inference on the species’ distributions could then be adjusted forPS accordingly, without the need for any covariates or emulators of effort.180BibliographyJonathan Acosta, Ronny Vallejos, et al. Effective sample size for spatialregression models. Electronic Journal of Statistics, 12(2):3147–3180,2018. → page 129B. Ainslie, C. Reuten, DG Steyn, N.D. Le, and J.V. Zidek. Application ofan entropy-based bayesian optimization technique to the redesign of anexisting monitoring network for single air pollutants. Journal ofenvironmental management, 90(8):2715–2729, 2009. → page 42Fabian E. Bachl, Finn Lindgren, David L. Borchers, and Janine B. Illian.inlabru: an R package for bayesian spatial modelling from ecologicalsurvey data. Methods in Ecology and Evolution, 10:760–766, 2019.doi:10.1111/2041-210X.13168. → pages 33, 136, 146, 172, 178, 248Adrian Baddeley and Rolf Turner. Package ‘spatstat’. The ComprehensiveR Archive Network (), 2014. → pages 146, 172Adrian Baddeley, Ege Rubak, and Rolf Turner. Spatial point patterns:methodology and applications with R. Chapman and Hall/CRC, 2015. →pages 10, 22, 23, 30, 50, 90, 101, 102, 103, 117, 128, 145, 200, 214Adrian Baddeley, Andrew Hardegen, Thomas Lawrence, Robin K Milne,Gopalan Nair, and Suman Rakshit. On two-stage monte carlo tests ofcomposite hypotheses. Computational Statistics & Data Analysis, 114:75–87, 2017. → page 113Haakon Bakka. Mesh creation including coastlines, Jan 2017. URL → page 198Haakon Bakka, Håvard Rue, Geir-Arne Fuglstad, Andrea Riebler, DavidBolin, Janine Illian, Elias Krainski, Daniel Simpson, and Finn Lindgren.181Spatial modeling with r-inla: A review. Wiley Interdisciplinary Reviews:Computational Statistics, 10(6):e1443, 2018. → page 2S. Banerjee, B. P. Carlin, and A.E. Gelfand. Hierarchical modeling andanalysis for spatial data. Second Edition. CRC Press, 2015. → page 1Luis Bedriñana-Romano, Rodrigo Hucke-Gaete, Francisco Alejandro Viddi,Juan Morales, Rob Williams, Erin Ashe, José Garcés-Vargas,Juan Pablo Torres-Florez, and Jorge Ruiz. Integrating multiple datasources for assessing blue whale abundance and distribution in chileannorthern patagonia. Diversity and Distributions, 24(7):991–1004, 2018.→ page 133J. Besag. Spatial interaction and the statistical analysis of lattice systems.Journal of the Royal Statistical Society. Series B (Methodological), pages192–236, 1974a. → page 115J. Besag and C. Kooperberg. On conditional and intrinsic auto-regressions.Biometrics, 82:733–746, 1995. → page 11J.E. Besag. Spatial interaction and the statistical analysis of latticesystems. Journal of the Royal Statistical Socety, Series B, 36:192–236,1974b. → page 9Julian Besag, Jeremy York, and Annie Mollié. Bayesian image restoration,with two applications in spatial statistics. Annals of the institute ofstatistical mathematics, 43(1):1–20, 1991. → page 28Roger Bivand and Colin Rundel. rgeos: interface to geometry engine-opensource (geos). R package version 0.3-2, 2013. → page 172Roger Bivand, Tim Keitt, Barry Rowlingson, Edzer Pebesma, MichaelSumner, Robert Hijmans, Even Rouault, and Maintainer Roger Bivand.Package ‘rgdal’. Bindings for the Geospatial Data Abstraction Library.Available online: https://cran. r-project. org/web/packages/rgdal/index.html (accessed on 15 October 2017), 2015. → page 172Roger S. Bivand, Edzer Pebesma, and Virgilio Gomez-Rubio. Appliedspatial data analysis with R, Second edition. Springer, NY, 2013. URL → pages 128, 172, 250Marta Blangiardo and Michela Cameletti. Spatial and spatio-temporalBayesian models with R-INLA. John Wiley & Sons, 2015. → pages6, 9, 10, 109, 115182David R Brillinger, Haiganoush K Preisler, Alan A Ager, and JG Kie. Theuse of potential functions in modelling animal movement. In SelectedWorks of David Brillinger, pages 385–409. Springer, 2012. → page 155Avishek Chakraborty, Alan E Gelfand, Adam M Wilson, Andrew MLatimer, and John A Silander. Point pattern modelling for degradedpresence-only data over large regions. Journal of the Royal StatisticalSociety: Series C (Applied Statistics), 60(5):757–776, 2011. → pages23, 135, 149H. Chang, A.Q. Fu, N.D. Le, and J.V. Zidek. Designing environmentalmonitoring networks to measure extremes. Environmental and EcologicalStatistics, 14(3):301–321, 2007. → page 42Taeryon Choi and Mark J Schervish. On posterior consistency innonparametric regression problems. Journal of Multivariate Analysis, 98(10):1969–1987, 2007. → page 112John Clare, Shawn T McKinney, John E DePue, and Cynthia S Loftin.Pairing field methods to improve inference in wildlife surveys whileaccommodating detection covariance. Ecological applications, 27(7):2031–2047, 2017. → page 149Jean-François Coeurjolly, Jesper Møller, and Rasmus Waagepetersen.Palm distributions for log gaussian cox processes. Scandinavian Journalof Statistics, 44(1):192–203, 2017. → page 107J. Colls. Air pollution, modelling, and mitigation. Routledge, Abingdon,Oxford, 2002. → page 53Paul B Conn, James T Thorson, and Devin S Johnson. Confrontingpreferential sampling when analyzing population distributions: diagnosisand model-based triage. Methods in Ecology and Evolution, 8:1535–1545,2017. → page 48N. Cressie and H.-C. ( Huang. Classes of nonseparable, spatio-temporalstationary covariance functions. J. Am. Statist. Assoc. Statist. Assoc.,94:1330–40, 1999. → page 109N. Cressie and C.K. Wikle. Statistics for spatio-temporal data, volume 465.Wiley, 2011. → pages 1, 6N.A.C. Cressie. Statistics for Spatial Data, Revised edition. John Wiley,New York, 1993. → pages 1, 5183Noel Cressie. Statistics for spatial data. Terra Nova, 4(5):613–617, 1992.→ page 112Noel Cressie and Christopher K Wikle. Statistics for spatio-temporal data.John Wiley & Sons, 2015. → page 96Ngoc Anh Dao and Marc G Genton. A monte carlo-adjustedgoodness-of-fit test for parametric models describing spatial pointpatterns. Journal of Computational and Graphical Statistics, 23(2):497–517, 2014. → pages 113, 119Russell Davidson and James G MacKinnon. Bootstrap tests: How manybootstraps? Econometric Reviews, 19(1):55–68, 2000. → pages 114, 118DFO. Killer whale (northeast pacific southern resident population). Accessed: 2019-03-29. →page 136Peter J Diggle, JA Tawn, and RA Moyeed. Model-based geostatistics.Journal of the Royal Statistical Society: Series C (Applied Statistics), 47(3):299–350, 1998. → page 28PJ Diggle and Paulo Justiniano Ribeiro. Model-based geostatistics(springer series in statistics). 2007. → pages 100, 104, 108P.J. Diggle, P.J. Ribeiro, and SpringerLink (Service en ligne). Model-basedgeostatistics, volume 846. Springer New York, 2007. → pages 8, 10P.J. Diggle, R. Menezes, and T. Su. Geostatistical inference underpreferential sampling. Journal of the Royal Statistical Society: Series C(Applied Statistics), 59(2):191–232, 2010. → pages2, 4, 16, 19, 46, 48, 51, 64, 66, 68, 78, 96, 98, 100, 125, 127, 128, 176, 177, 178, 240Daniel Dinsdale, Matias Salibian-Barrera, et al. Modelling oceantemperatures from bio-probes under preferential sampling. The Annalsof Applied Statistics, 13(2):713–745, 2019. → pages 97, 128Robert M Dorazio. Accounting for imperfect detection and survey bias instatistical analysis of presence-only data. Global Ecology andBiogeography, 23(12):1472–1484, 2014. → pages 146, 148Shah Ebrahim and George Davey Smith. Mendelian randomization: cangenetic epidemiology help redress the failures of observationalepidemiology? Human genetics, 123(1):15–33, 2008. → page 1184Jane Elith and John Leathwick. Predicting species distributions frommuseum and herbarium records using multiresponse models fitted withmultivariate adaptive regression splines. Diversity and distributions, 13(3):265–275, 2007. → pages 133, 146, 149EPA. Air quality criteria for ozone and related photochemical oxidants.Technical report,, 2005. →pages 2, 42, 97Doug Sandilands Iain U. Smith Alana. V. Phillips Lance G.Barrett-Lennard Erin U. Rechsteiner, Caitlin F. C. Birdsall. Quantifyingobserver effort for opportunistically-collected wildlife sightings.Unpublished - url:,2013. Accessed: 2019-03-29. → page 139JA Fernández, A Rey, and A Carballeira. An extended study of heavymetal deposition in galicia (nw spain) based on moss analysis. Science ofthe Total Environment, 254(1):31–44, 2000. → page 125John Fieberg and Luca Börger. Could you please phrase “home range” asa question? Journal of mammalogy, 93(4):890–902, 2012. → page 132William Fithian and Trevor Hastie. Finite-sample equivalence in statisticalmodels for presence-only data. The annals of applied statistics, 7(4):1917, 2013. → pages 44, 49, 66, 200William Fithian, Jane Elith, Trevor Hastie, and David A Keith. Biascorrection in species distribution models: pooling survey and collectiondata for multiple species. Methods in Ecology and Evolution, 6(4):424–438, 2015. → pages2, 22, 24, 32, 33, 97, 99, 133, 145, 146, 148, 149, 153, 171Chris H Fleming, William F Fagan, Thomas Mueller, Kirk A Olson, PeterLeimgruber, and Justin M Calabrese. Rigorous home range estimationwith movement data: a new autocorrelated kernel density estimator.Ecology, 96(5):1182–1188, 2015. → page 132John KB Ford, Graeme M Ellis, and Kenneth C Balcomb. Killer whales:the natural history and genealogy of Orcinus orca in British Columbiaand Washington. UBC press, 1996. → pages 137, 167, 170185John KB Ford, James F Pilkington, M Otsuki, B Gisborne,RM Abernethy, EH Stredulinsky, JR Towers, and GM Ellis. Habitats ofspecial importance to Resident Killer Whales (Orcinus orca) off the westcoast of Canada. Fisheries and Oceans Canada, Ecosystems and OceansScience, 2017. → pages 137, 244Geir-Arne Fuglstad, Daniel Simpson, Finn Lindgren, and Håvard Rue.Constructing priors that penalize the complexity of gaussian randomfields. Journal of the American Statistical Association, pages 1–8, 2018.→ pages 29, 124, 213, 245, 250Geir-Arne Fuglstad, Daniel Simpson, Finn Lindgren, and Håvard Rue.Constructing priors that penalize the complexity of gaussian randomfields. Journal of the American Statistical Association, 114(525):445–452, 2019. → pages 69, 197Alan E Gelfand and Shinichiro Shirota. Preferential sampling forpresence/absence data and for fusion of presence/absence data withpresence-only data. Ecological Monographs, 89(3):e01372, 2019. → pages2, 176Alan E Gelfand, Peter Diggle, Peter Guttorp, and Montserrat Fuentes.Handbook of spatial statistics. CRC press, 2010. → page 30Alan E Gelfand, Sujit K Sahu, and David M Holland. On the effect ofpreferential sampling in spatial prediction. Environmetrics, 23(7):565–578, 2012. → pages 17, 47, 96, 97, 105Andrew Gelman, Xiao-Li Meng, and Hal Stern. Posterior predictiveassessment of model fitness via realized discrepancies. Statistica sinica,pages 733–760, 1996. → page 167Subhashis Ghosal, Anindya Roy, et al. Posterior consistency of gaussianprocess prior for nonparametric binary regression. The Annals ofStatistics, 34(5):2413–2429, 2006. → page 112Jacques Gignoux, Camille Duby, and Sébastian Barot. Comparing theperformances of diggle’s tests of spatial randomness for small sampleswith and without edge-effect correction: application to ecological data.Biometrics, 55(1):156–164, 1999. → page 102Christophe Giraud, Clément Calenge, Camille Coron, and Romain Julliard.Capitalizing on opportunistic data for monitoring relative abundances ofspecies. Biometrics, 72(2):649–658, 2016. → pages 133, 134186Richard Glennie, Stephen T Buckland, and Len Thomas. The effect ofanimal movement on line transect estimates of abundance. PloS one, 10(3):e0121333, 2015. → page 134Richard Glennie, Stephen Terrence Buckland, Roland Langrock, TimGerrodette, LT Ballance, SJ Chivers, and MD Scott. Incorporatinganimal movement into distance sampling. Journal of the AmericanStatistical Association, (just-accepted):1–17, 2020. → pages135, 144, 148, 155, 159Virgilio Gómez-Rubio. Bayesian inference with INLA. CRC Press, 2020.→ pages 27, 38Virgilio Gómez-Rubio and Håvard Rue. Markov chain monte carlo withthe integrated nested laplace approximation. Statistics and Computing,28(5):1033–1051, 2018. → page 38Yongtao Guan and David R Afshartous. Test for independence betweenmarks and points of marked point processes: a subsampling approach.Environmental and Ecological Statistics, 14(2):101–111, 2007. → pages98, 128Marc Hallin, Zudi Lu, and Lanh T Tran. Kernel density estimation forspatial processes: the l1 theory. Journal of Multivariate Analysis, 88(1):61–75, 2004. → page 6Ephraim M Hanks, Erin M Schliep, Mevin B Hooten, and Jennifer AHoeting. Restricted spatial regression in practice: geostatistical models,confounding, and robustness under model misspecification.Environmetrics, 26(4):243–254, 2015. → page 218Donna DW Hauser, Glenn R Van Blaricom, Elizabeth E Holmes, andRichard W Osborne. Evaluating the use of whalewatch data indetermining killer whale (orcinus orca) distribution patterns. Journal ofCetacean Research and Management, 8(3):273, 2006. → pages 137, 139Donna DW Hauser, Miles G Logsdon, Elizabeth E Holmes, Glenn RVanBlaricom, and Richard W Osborne. Summer distribution patterns ofsouthern resident killer whales orcinus orca: core areas and spatialsegregation of social groups. Marine Ecology Progress Series, 351:301–310, 2007. → pages 137, 138, 139, 167James J Heckman. Selection bias and self-selection. In Econometrics,pages 201–224. Springer, 1990. → page 1187Trevor J Hefley and Mevin B Hooten. Hierarchical species distributionmodels. Current Landscape Ecology Reports, 1(2):87–97, 2016. → pages22, 24, 134, 154, 239Miguel A Hernan and James M Robins. Causal inference: What if, 2020.→ pages 1, 151, 152, 237Nigel E Hussey, Steven T Kessel, Kim Aarestrup, Steven J Cooke, Paul DCowley, Aaron T Fisk, Robert G Harcourt, Kim N Holland, Sara JIverson, John F Kocik, et al. Aquatic animal telemetry: a panoramicwindow into the underwater world. Science, 348(6240), 2015. → page133Janine Illian, Antti Penttinen, Helga Stoyan, and Dietrich Stoyan.Statistical analysis and modelling of spatial point patterns, volume 70.John Wiley & Sons, 2008. → pages 10, 101, 102, 103EH Isaaks and R Mohan Srivastava. Spatial continuity measures forprobabilistic and deterministic geostatistics. Mathematical geology, 20(4):313–341, 1988. → page 46Devin S. Johnson and Josh M. London. crawl: an r package for fittingcontinuous-cime correlated random walk models to animal movementdata, 2018. URL → page 162Devin S. Johnson, Josh M. London, Mary-Anne Lea, and John W.Durban. Continuous-time correlated random walk model for animaltelemetry data. Ecology, 89(5):1208–1215, 2008. doi:10.1890/07-1032.1.URL → page 162Devin S Johnson, Mevin B Hooten, and Carey E Kuhn. Estimating animalresource selection from telemetry data using point process models.Journal of Animal Ecology, 82(6):1155–1164, 2013. → pages 132, 134Olatunji Johnson, Peter Diggle, and Emanuele Giorgi. A spatially discreteapproximation to log-gaussian cox processes for modelling aggregateddisease count data. Statistics in medicine, 38(24):4871–4887, 2019. →page 22Matthias Katzfuss, Joseph Guinness, Wenlong Gong, and Daniel Zilber.Vecchia approximations of gaussian-process predictions. Journal ofAgricultural, Biological and Environmental Statistics, pages 1–32, 2020.→ page 10188JF Kingman. Poisson processes. Oxford Studies in Probability. Oxford:Oxford University Press, 1994. → page 227Vira Koshkina, Yan Wang, Ascelin Gordon, Robert M Dorazio, MattWhite, and Lewi Stone. Integrated species distribution models:combining presence-background data and site-occupancy data withimperfect detection. Methods in Ecology and Evolution, 8(4):420–430,2017. → pages 133, 134, 153, 154, 171Elias T Krainski, Virgilio Gómez-Rubio, Haakon Bakka, Amanda Lenzi,Daniela Castro-Camilo, Daniel Simpson, Finn Lindgren, and HåvardRue. Advanced spatial modeling with stochastic partial differentialequations using R and INLA. CRC Press, 2018. → pages 36, 37Kalimuthu Krishnamoorthy. Handbook of statistical distributions withapplications. CRC Press, 2016. → page 28A Lawrence Gould, Mark Ernest Boye, Michael J Crowther, Joseph GIbrahim, George Quartey, Sandrine Micallef, and Frederic Y Bois. Jointmodeling of survival and longitudinal non-survival data: currentmethods and issues. report of the dia bayesian joint modeling workinggroup. Statistics in medicine, 34(14):2181–2195, 2015. → page 89N.D. Le and J.V. Zidek. Statistical analysis of environmental space-timeprocesses. Springer Verlag, 2006. → pages 6, 29A Lee, A Szpiro, SY Kim, and L Sheppard. Impact of preferentialsampling on exposure prediction and health effect inference in thecontext of air pollution epidemiology. Environmetrics, 26(4):255–267,2015. → pages 97, 176, 177Subhash R Lele, Evelyn H Merrill, Jonah Keim, and Mark S Boyce.Selection, use, choice and occupancy: clarifying concepts in resourceselection studies. Journal of Animal Ecology, 82(6):1183–1191, 2013. →page 132Qiuju Li and Li Su. Accommodating informative dropout and death: ajoint modelling approach for longitudinal and semicompeting risks data.Journal of the Royal Statistical Society: Series C (Applied Statistics), 67(1):145–163, 2018. → page 89F. Lindgren, H. Rue, and J Lindstö. An explicit link between gaussianfields and gaussian markov random fields: the stochastic partial189differential equation approach. Roy Statist Soc, Ser B, page To appear,2011a. → pages 57, 178Finn Lindgren, Håvard Rue, and Johan Lindström. An explicit linkbetween gaussian fields and gaussian markov random fields: thestochastic partial differential equation approach. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 73(4):423–498,2011b. → pages 9, 10, 37, 45, 124, 166, 201, 213, 243Finn Lindgren, Havard Rue, et al. Bayesian spatial modelling with r-inla.Journal of Statistical Software, 63(19):1–25, 2015. → pages124, 166, 213, 243Nicola Loperfido and Peter Guttorp. Network bias in air qualitymonitoring design. Environmetrics, 19(7):661–671, 2008. → pages2, 42, 97B Matern. Doubly stochastic poisson processes in the plane. Statisticalecology, 1:195–213, 1971. → page 107Robert McMillan and Joshua Murphy. Measuring the effects of severe airpollution:evidence from the uk clean air act. June 2017. → pages 71, 82Dana Michalcová, Samuel Lvončik, Milan Chytrý, and Ondřej Hájek. Biasin vegetation databases? a comparison of stratified-random andpreferential sampling. Journal of Vegetation Science, 22(2):281–291,2011. → page 46David AW Miller, Krishna Pacifici, Jamie S Sanderlin, and Brian J Reich.The recent past and promising future for data integration methods toestimate species’ distributions. Methods in Ecology and Evolution, 10(1):22–37, 2019. → pages 133, 134, 171Rua S Mordecai, Brady J Mattsson, Caleb J Tzilkowski, and Robert JCooper. Addressing challenges when studying mobile or episodic species:hierarchical bayes estimation of occupancy and use. Journal of AppliedEcology, 48(1):56–66, 2011. → page 132Tomáš Mrkvička, Mari Myllymäki, and Ute Hahn. Multiple monte carlotesting, with applications in spatial point processes. Statistics andComputing, 27(5):1239–1255, 2017. → page 112190Mari Myllymäki, Tomáš Mrkvička, Pavel Grabarnik, Henri Seijo, and UteHahn. Global envelope tests for spatial processes. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 79(2):381–404,2017. → page 112NOAA. Endangered species act status of puget sound killer whales. Accessed: 2019-09-27. → page136Canadian Council of Ministers of the Environment. Ambient airmonitoring protocol for pm2.5 and ozone. canada-wide standards forparticulate matter and ozone. Technical report, Canadian Council ofMinisters of the Environment, 2011. → page 177Cabinet Office. Modernising government, 1999. → page 178Ricardo A Olea. Declustering of clustered preferential sampling forhistogram and semivariogram inference. Mathematical Geology, 39(5):453–467, 2007. → page 46Jennifer K Olson, Jason Wood, Richard W Osborne, LanceBarrett-Lennard, and Shawn Larson. Sightings of southern residentkiller whales in the salish sea 1976–2014: the importance of a long-termopportunistic dataset. Endangered Species Research, 37:105–118, 2018.→ pages 138, 139United States. Commission on Evidence-Based Policymaking. The promiseof evidence-based policymaking: Report of the Commission onEvidence-Based Policymaking. Commission on Evidence-BasedPolicymaking, 2017. → page 178Lucia Paci, Alan E Gelfand, Beamonte, María Asunción, Pilar Gargallo,and Manuel Salvador. Spatial hedonic modelling adjusted forpreferential sampling. Journal of the Royal Statistical Society: Series A(Statistics in Society), 183(1):169–192, 2020. → page 2Krishna Pacifici, Brian J Reich, David AW Miller, Beth Gardner, GlennStauffer, Susheela Singh, Alexa McKerrow, and Jaime A Collazo.Integrating multiple data sources in species distribution modeling: Aframework for data fusion. Ecology, 98(3):840–850, 2017. → pages133, 135, 145191Debdeep Pati, Brian J Reich, and David B Dunson. Bayesian geostatisticalmodelling with informative sampling locations. Biometrika, 98(1):35–48,2011. → pages 47, 51, 64Edzer Pebesma. Simple Features for R: Standardized Support for SpatialVector Data. The R Journal, 10(1):439–446, 2018.doi:10.32614/RJ-2018-009. URL→ page 128Edzer J. Pebesma and Roger S. Bivand. Classes and methods for spatialdata in R. R News, 5(2):9–13, November 2005. URL → pages 128, 172, 250Maria Grazia Pennino, Iosu Paradinas, Janine B Illian, Facundo Muñoz,José María Bellido, Antonio López-Quílez, and David Conesa.Accounting for preferential sampling in species distribution models.Ecology and evolution, 9(1):653–663, 2019. → pages2, 97, 128, 134, 153, 176, 178, 240R Core Team. R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria, 2019. URL → pages 136, 166, 172, 250T.;Steinle S.;Carnell E.;Leaver-D.;Roberts E.;Vieno M.;BeckR.;Dragosits U. Reis, S.;Liska. Uk gridded population 2011 based oncensus 2011 and land cover map 2015. nerc environmental informationdata centre., 2017. URL →pages 82, 124Ian W Renner and David I Warton. Equivalence of maxent and poissonpoint process models for species distribution modeling in ecology.Biometrics, 69(1):274–281, 2013. → page 167Ian W Renner, Jane Elith, Adrian Baddeley, William Fithian, TrevorHastie, Steven J Phillips, Gordana Popovic, and David I Warton. Pointprocess models for presence-only analysis. Methods in Ecology andEvolution, 6(4):366–379, 2015. → page 133Christian Robert. The Bayesian choice: from decision-theoreticfoundations to computational implementation. Springer Science &Business Media, 2007. → page 29192J Andrew Royle, Marc Kery, and Jerome Guelat. Spatialcapture-recapture models for search-encounter data. Methods in Ecologyand Evolution, 2(6):602–611, 2011. → pages 132, 133H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference forlatent Gaussian models by using integrated nested Laplaceapproximations. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 71(2):319–392, 2009. → pages30, 33, 35, 37, 38, 45, 57, 124, 166, 178, 201, 213, 243Havard Rue and Leonhard Held. Gaussian Markov random fields: theoryand applications. Chapman and Hall/CRC, 2005. → page 9Håvard Rue, Andrea Riebler, Sigrunn H Sørbye, Janine B Illian, Daniel PSimpson, and Finn K Lindgren. Bayesian computing with inla: a review.Annual Review of Statistics and Its Application, 4:395–421, 2017. →pages 37, 45, 57, 201Ramiro Ruiz-Cárdenas, Elias T Krainski, and Håvard Rue. Direct fittingof dynamic models using integrated nested laplace approximations—inla.Computational Statistics & Data Analysis, 56(6):1808–1828, 2012. →page 199Mark J Schervish. Theory of statistics. Springer Science & BusinessMedia, 2012. → page 37Martin Schlather, Paulo J Ribeiro, and Peter J Diggle. Detectingdependence between marks and locations of marked point processes.Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 66(1):79–93, 2004. → pages 11, 97, 101, 128P. Schumacher and J.V. Zidek. Using prior information in designingintervention detection experiments. The Annals of Statistics, pages447–463, 1993. → pages 2, 42, 97Elizabeth Seely, Richard W Osborne, Kari Koski, and Shawn Larson.Soundwatch: Eighteen years of monitoring whale watch vessel activitiesin the salish sea. PloS one, 12(12):e0189764, 2017. → pages 138, 163T Sellke, MJ Bayarri, and JO Berger. Calibration of p-values for precisenull hypotheses. The American Statistician, 2001. → page 114193Gavin Shaddick and James V Zidek. A case study in preferential sampling:Long term monitoring of air pollution in the uk. Spatial Statistics, 9:51–65, 2014. → pages 42, 53, 54, 58, 60, 87, 123, 198Gavin Shaddick, James V Zidek, and Yi Liu. Mitigating the effects ofpreferentially selected monitoring sites for environmental policy andhealth risk analysis. Spatial and spatio-temporal epidemiology, 18:44–52,2016. → pages 2, 176Gavin Shaddick, Matthew L Thomas, Amelia Green, Michael Brauer,Aaron van Donkelaar, Rick Burnett, Howard H Chang, Aaron Cohen,Rita Van Dingenen, Carlos Dora, et al. Data integration model for airquality: a hierarchical approach to the global estimation of exposures toambient air pollution. Journal of the Royal Statistical Society: Series C(Applied Statistics), 67(1):231–253, 2018. → page 97R.A Simons. Erddap. erddap.,2019. → page 166Daniel Simpson, Janine B Illian, Finn Lindgren, Sigrunn H Sørbye, andHavard Rue. Going off grid: Computationally efficient inference forlog-gaussian cox processes. Biometrika, 103(1):49–70, 2016. → pages20, 21, 23, 103, 105, 143, 145, 242Daniel Simpson, Håvard Rue, Andrea Riebler, Thiago G Martins,Sigrunn H Sørbye, et al. Penalising model component complexity: Aprincipled, practical approach to constructing priors. Statistical science,32(1):1–28, 2017. → pages 29, 69, 197David J Spiegelhalter, Nicola G Best, Bradley P Carlin, and Angelika VanDer Linde. Bayesian measures of model complexity and fit. Journal ofthe royal statistical society: Series b (statistical methodology), 64(4):583–639, 2002. → page 167Sophie Sutcliffe and Julius Court. A toolkit for progressive policymakers indeveloping countries. Research and Policy in Development Programme,2006. → page 178P Switzer. Estimation of spatial distributions from point sources withapplication to air pollution measurement. technical report no. 9.Technical report, Stanford Univ., CA (USA). Dept. of Statistics, 1977.→ page 46194Benjamin M Taylor and Peter J Diggle. Inla or mcmc? a tutorial andcomparative evaluation for spatial prediction in log-gaussian coxprocesses. Journal of Statistical Computation and Simulation, 84(10):2266–2284, 2014. → pages 36, 38Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. → page 29Maria Nicolette Margaretha van Lieshout. Theory of Spatial Statistics: AConcise Introduction. CRC Press, 2019. → pages 6, 7, 10, 11Vancouver Aquarium. BC Cetacean Sightings Network. Accessed: 2019-03-29. → page 138DJ von Lehmden and C Nelson. Quality assurance handbook for airpollution measurement systems. volume ii. ambient air specific methods.Technical report, Environmental Protection Agency, Research TrianglePark, NC (USA . . . , 1977. → page 177David I Warton, Leah C Shepherd, et al. Poisson point process modelssolve the “pseudo-absence problem” for presence-only data in ecology.The Annals of Applied Statistics, 4(3):1383–1402, 2010. → pages30, 41, 44, 49, 66, 133, 200David I Warton, Ian W Renner, and Daniel Ramp. Model-based control ofobserver bias for the analysis of presence-only data in ecology. PloS one,8(11):e79168, 2013. → page 33Ronald L Wasserstein and Nicole A Lazar. The asa statement on p-values:context, process, and purpose, 2016. → page 114Kim Whoriskey, Eduardo G Martins, Marie Auger-Méthé, Lee FGGutowsky, Robert J Lennox, Steven J Cooke, Michael Power, andJoanna Mills Flemming. Current and emerging statistical techniques foraquatic telemetry data: A guide to analysing spatially discrete animaldetections. Methods in Ecology and Evolution, 10(7):935–948, 2019. →page 133Christopher K Wikle. Modern perspectives on statistics forspatio-temporal data. Wiley Interdisciplinary Reviews: ComputationalStatistics, 7(1):86–98, 2015. → page 6195Simon N Wood, Zheyuan Li, Gavin Shaddick, and Nicole H Augustin.Generalized additive models for gigadata: modeling the uk black smokenetwork daily data. Journal of the American Statistical Association, 112(519):1199–1210, 2017. → page 6Brian J Worton. Kernel methods for estimating the utilization distributionin home-range studies. Ecology, 70(1):164–168, 1989. → page 132Lang Wu. Mixed effects models for complex data. Chapman andHall/CRC, 2009. → pages 13, 89, 180Yuan Yuan, Fabian E Bachl, Finn Lindgren, David L Borchers, Janine BIllian, Stephen T Buckland, Haavard Rue, Tim Gerrodette, et al. Pointprocess models for spatio-temporal distance sampling data from alarge-scale survey of blue whales. The Annals of Applied Statistics, 11(4):2270–2297, 2017. → pages 21, 133, 134, 135, 145, 147, 154, 166, 244H. Zhang. Inconsistent estimation and asymptotically equal interpolationsin model-based geostatistics. Journal of the American StatisticalAssociation, 99(465):250–261, 2004. ISSN 0162-1459. → pages 29, 199Sha Zhe. Mesh creation including coastlines, Aug 2017. URL → page 198J.V. Zidek, G. Shaddick, and C.G. Taylor. Reducing estimation bias inadaptively changing monitoring networks with preferential site selection.Ann Applied Statistics, page To appear, 2014. → pages 2, 82, 87, 99Botta-Duká Zoltán, Edit Kovács-Láng, Tamás Rédei, Miklós Kertész, andJános Garadnai. Statistical and biological consequences of preferentialsampling in phytosociology: theoretical considerations and a case study.Folia Geobotanica, 42(2):141–152, 2007. → page 46196Appendix ASupporting MaterialsA.1 Chapter 3 Supporting MaterialsA.1.1 Chosen priors for the case studyFor the Y process, we used weakly informative Gaussian priors for the γk’s.We used a Gamma(a, b) prior for the precision parameter 1/σ2 , where adenotes the shape parameter and b denotes the inverse-scale parameter. Wechose a = 1 and b = 5× 10−5. Under this parameterisation, the mean andvariance of this distribution are a/b and a/b2 respectively. Thus this priorassumption allows for very large and very small variances of the response toexist. Next, a 2D Wishart distribution is assumed for Σ−1b with four degreesof freedom. The prior matrix is given 0 off–diagonal elements and diagonalvalues of 1. This results in a prior mean for the two variance terms of therandom effects (σ2b,1,σ2b,2) of 4 with a prior variance for these terms equalto 8. The prior mean for the correlation term is 0 with variance for thelogit transform of the correlation equal to 4. This allows for random effectswith a large range of magnitudes and correlation structures to exist. Weplace the PC joint priors [Fuglstad et al., 2019, Simpson et al., 2017] on thetwo hyperparameters for the 3 independent Matern realisations, with priorbelief that the 5th percentile for the range is 3.4km (a fifth of the smallestrange found in previous analyses) and the 99th percentile for the standard197deviation of each field is 1 (noting that the data have been transformed).We fix the Matern roughness parameter to equal 1 since this is the largestsmoothness value currently implemented in R-INLA, and we assume a-priorithat the medium–range pollution process will be reasonably smooth. Thelower prior bound on the range parameter, combined with the probabilisticupper bound on the variance, should help prevent the model from collapsinginto a state that over–fits the data.For the site–selection process R, our choice of priors follows the same ob-jectives as for the observation process. Weakly informative Gaussian priorswere placed on all the α terms. The same PC prior chosen for the observa-tion process was placed on the β?0(s) field. For the first order autoregressiveterm β?1(t), we placed a Gamma(1,5×10−5) on the marginal precision anda N(0,0.15) prior was placed on the logit of the lag 1 correlation (i.e. onlog((1 +ρa)/(1−ρa))) to allow for a large degree of flexibility. Finally, weconsider two different sets of priors for the PS parameters db,dβ. For Im-plementation 1 we constrain these to equal 0 and thus we can view this assetting a point mass prior at 0. For implementations 2 and 3, we assign aN(0,10) prior to allow PS to be detected.A.1.2 Details on the R-INLA implementationWe used the estimated ranges from Shaddick and Zidek [2014] to constructthe Delauney triangulation mesh required for use in R–INLA. Followingthe advice of Bakka [2017], Zhe [2017], and trading it off with the needfor maintaining a reasonable computation time, we set the edge lengthsof the triangles throughout the domain to be around 5km, less than theminimum estimated range of 17km found in Shaddick and Zidek [2014].This is important since it has been shown that the length of the triangleedges must be less than the range of any Matern field and should ideallybe less than a quarter of this. Failure to do so leads to large errors in theapproximation of the Gaussian random field. We are confident that withour choice of mesh, any changes to the inference in the unsampled regionswill be a direct result of our joint model framework and not due to any198undesirable artifacts caused by a poor choice of triangulation mesh for theSPDE approximation.It is well known that an empirical Bayes or maximum likelihood ap-proach does not fully account for the uncertainties in the hyperparameterswhen performing predictions and inference, and these may be high in spa-tially correlated Gaussian random fields [Zhang, 2004]. Interestingly, for thisdataset we compared the fully Bayesian approach with the empirical Bayesmethod using R-INLA and found little difference. The posterior credibleintervals for the latent effects and parameters were slightly wider under thefully Bayesian approach, however the posterior credible intervals for thepredictions were almost identical. Additionally we used the empirical Bayesapproach in a small simulation study with good results. Thus for computa-tional savings we opted to consider only empirical Bayes methods.In R-INLA, copying across a linear combination of latent processes (po-tentially from a different time point) requires the use of dummy variables.In particular, the idea of Ruiz-Cárdenas et al. [2012] is required. This simplyinvolves creating infinite precision Gaussian variables with observed valuesof zero and with linear predictor set equal to the (negative) linear combina-tion of latent processes desired, plus an infinite variance random interceptprocess. It is not hard to see that the values of these random interceptsequal precisely the values of the linear combination of the desired processes.This approach proved vital for fitting implementations 2 and 3.Note that in essence, for Implementation 3 we are modelling the ini-tial site–placement process as a LGCP, but using a Bernoulli likelihood as apseudo-likelihood instead of the usual Poisson likelihood to form the compu-tational approximation. We use the conditional logistic regression approach,commonly used to fit Poisson point processes, placing the zeros in a regular(not a latticed) manner throughout Ω, independent from the observed sitelocations. In practice, we created a reasonably regular delauney triangu-lation mesh in R-INLA throughout Ω for our GMRF with mesh verticesplaced independent from the observed site locations. Regularity was en-forced through a combination of the choices of a minimum vertex length of5km, an upper vertex length of 7km and a minimum angle of 25 degrees.199We then used the created mesh vertices as our pseudo–sites.A somewhat undesirable property of using the logistic regression ap-proach is that the likelihood value does not converge as the number of pseudozeros tends towards infinity. Thus, unlike the result of using the Poisson ap-proximation to a Point Process, convergence must instead be judged withthe convergence of fixed parameter estimates, excluding the estimate of theintercept. However, if the Poisson approximation is chosen, then it can-not be used to simultaneously model the retention process alongside thesite–placement process and hence a third Bernoulli likelihood modelling theretention–process would be required. Thus in either case, there is a trade-off. Given that the computational time required to fit the model in R-INLAusing the SPDE approach is affected more by the resolution of the computa-tional mesh than by the number of observations, we can increase the densityof the pseudo–sites with a reasonably small effect on the total computationtime.Thus for fitting Implementation 3, we follow the advice given in theliterature [Fithian and Hastie, 2013, Warton et al., 2010]. We repeatedlyre–fit the joint model on an ever-increasing density of pseudo–sites until theparameters and predictions converge. We found that all estimates, exceptof course the site–selection intercept, stabilized once the average distancebetween pseudo–sites was decreased to 5km. This supports the claim thatour estimates from our model are close to those of the joint triple modelwith a LGCP for the site–selection process, a Bernoulli likelihood for thesite–retention process, and a Gaussian process for the observation process.The correct placement of the zeros in the site–selection process is vitalfor the asymptotic convergence of the pseudo-likelihood to the LGCP. Inparticular, the asymptotics of the conditional logistic regression approxima-tion used in our example with the logit link are only established when thezeros are either a realisation of a homogeneous Poisson point process, in-dependent of the monitoring site locations [Baddeley et al., 2015], or whenthey are placed uniformly throughout the domain Ω [Warton et al., 2010].In either case, the density of the zeros must be uniform (at least in proba-bility) throughout Ω for each year j ∈ {1, ...,31} and be placed independent200from the observation locations.A direct consequence of this is that for our site–selection process, weshould not consider for selection at time j the subset of observed sites (i.e.the subset of Population 1) that are offline at year j (i.e. SCtj ). Put dif-ferently, we should not include the Ri,j ’s in Population 1 in the likelihoodsuch that ri,j = 0. Erroneously doing so would lead to an increased densityof zeros in the heavily sampled regions and thus a ‘preferential sample’ ofzeros. Similarly, for the site–retention process at time j, we should onlyconsider the sites online at the previous time j−1.Putting these two processes together, the only zeros that should con-tribute to the joint Bernoulli likelihood at time j are the pseudo–sites andthe sites that were online at the previous time j−1 and were removed fromthe network at time j. In fact, we tested the sensitivity of the results tothe above, re–fitting the model once by following the advice given above,and again but ignoring the advice and considering all the observed sites(operational and offline) for selection at each time step j, along with thepseudo–sites. Despite the former being more appropriate, we found no dif-ferences in estimates, but we required a higher density of pseudo–sites, andhence an increased computational cost to reduce this bias in the parameterestimates. This advice is therefore of most importance for the modellingof very large datasets where the number of unique observed site locationsthrough time could be much higher than seen here.To form all of our predictions and maps, we simulated 1000 MCMCsamples of all the parameters and latent effects from the fitted models. Thisfeature is available in the R-INLA package [Lindgren et al., 2011b, Rueet al., 2009, 2017], by simply saving all the configuration settings generatedby the software required to fit the model. We then formed all the site-specifictrajectories by appropriately combining all latent effects and parameters inthe linear predictor. We computed the mean, the empirical upper 97.5% andempirical lower 2.5% values of the 1000 linear predictor estimates to formour point estimates and credible intervals. Finally, to obtain the map ofthe pointwise expectations of the predictive distribution across GB, we usedthe MCMC samples of the latent effects and parameters (minus the IID site-201specific effects) and linearly interpolated the estimated field throughout Ω ona regular lattice grid covering the the map of GB, before taking the empiricalmean and standard deviation across the 1000 maps. To compute the averageBS across the Whole GB, we computed the mean (averaging across thepixels) of each the 1000 sampled/realized maps. Then, we computed themean, the empirical 2.5% and the empirical 97.5% values of these 1000(mean) values.A.1.3 Posterior pointwise mean and pointwise standarddeviation plotsFigure A.1: A plot of the posterior mean black smoke in 1966 and1996 under Implementation 1 with corresponding standard er-rors plotted below. Note that for visualisation purposes, thetwo plots have had their values scaled to put them on the samecolour scale.202Figure A.2: A plot of the posterior mean black smoke in 1966 and1996 with corresponding standard errors plotted below. Esti-mates are taken from Implementation 2. Note that for visual-isation purposes, the two plots have had their values scaled toput them on the same colour scale.203Figure A.3: A plot of the posterior mean black smoke in 1966 and1996 with corresponding standard errors plotted below. Esti-mates are taken from Implementation 3. Note that for visual-isation purposes, the two plots have had their values scaled toput them on the same colour scale. Notice the large drop inestimated BS in throughout much of the region.204A.1.4 Additional plot of the exceedance of the annual blacksmoke EU guide valueFigure A.4: A plot showing the posterior proportion of the total sur-face area of Great Britain with annual average black smokelevel exceeding the EU guide value of 34µgm−3. Shown arethe results from Implementation 2 (the red solid line) and Im-plementation 3 (the blue dashed line). Note that the line forImplementation 1 is almost identical to that from Implementa-tion 1 and omitted.A.1.5 Additional plot of annual average black smoke levels205Figure A.5: Implementation 2. In green are the model–averaged BSlevels averaged over sites that were selected in P1 (i.e. oper-ational) at time t. In contrast, those in red are the model–averaged BS levels averaged over sites that were not selectedin P1 (i.e. offline) at time t. Finally, in blue are the model–averaged BS levels averaged across Great Britain. Also in-cluded with the posterior mean values are their 95% posteriorcredible intervals. If printed in black-and-white, the green bandis initially the lower line, the red band is the upper line andthe blue band is initially the middle line.206A.1.6 Model diagnostic plotsWe include, for each of the three implementations considered in this Chapter,residual plots to help diagnose poor model fit. Included are residuals vs.year plots, with a fitted lowess smoother to help show that the choice ofa quadratic model adequately captured the temporal trend in the data.Also shown are normal QQ-plots of the residuals with fitted 99% confidencebands around the overlain QQ-line. Residuals are computed with respectto the posterior mean values. It is clear from this plot that a heavier taileddistribution on the response would have been more suitable. Finally, weinclude histograms and normal QQ-plots of the random effects. Here wesee slightly left-skewed and right-skewed empirical marginal distributionsfor the random intercepts and slopes respectively. We have no strong causefor concern with these final plots.207Figure A.6: A plot of the residuals vs. year from Implementation 1with a fitted smoother.Figure A.7: A Normal Q–Q plot of the residuals from Implementation1.208Figure A.8: Histograms of the spatially–uncorrelated random inter-cepts (top left) and slopes(bottom left), with correspondingNormal Q–Q plots shown on the right from Implementation 1.209Figure A.9: A plot of the residuals vs. year for Implementation 2,with a fitted smoother.Figure A.10: A Normal Q–Q plot of the residuals from Implementa-tion 2, with 95% confidence intervals shown in red.210Figure A.11: Histograms of the spatially–uncorrelated random inter-cepts (top left) and slopes(bottom left), with correspondingNormal Q–Q plots shown on the right from Implementation2.211Figure A.12: A plot of the residuals vs. year for Implementation 3with a fitted smoother.Figure A.13: A Normal Q-Q plot of the residuals from Implementa-tion 3 with 95% confidence intervals shown in red.212Figure A.14: Histograms of the spatially-uncorrelated random inter-cepts (top left) and slopes(bottom right), with correspondingNormal Q-Q plots shown on the right from Implementation3.A.2 Chapter 4 Supporting MaterialsA.2.1 More details on the simulation studyFor computational speed-ups, the R-INLA package is used for both the simu-lation and estimation of Z(s) [Lindgren et al., 2011b, 2015, Rue et al., 2009].A high resolution triangulation mesh (triangle lengths of 0.01) is defined forthe SPDE approximation over the unit square Ω, and linear interpolationis used to impute the values at any point s within Ω. For the priors on theGaussian process, PC priors [Fuglstad et al., 2018] are specified with priorprobability of 0.1 that the spatial range is less than 0.1 and prior probabil-ity of 0.1 that the standard deviation is greater than 3. Gaussian errors onthe responses Y (s, t) are added, with a weakly informative Gamma(1, 5e−5)distribution placed on the precision of the error distribution. This is done213to reduce the risk of computational singularities. The NN test is performedat the 5% significance level using 19 Monte Carlo samples (i.e. M = 19).Each experimental setting is repeated 200 times.Along with the NN test outlined in algorithm 1, a Monte Carlo testusing the rank correlations between estimates of Z(s) and estimates of theraw residuals of the assumed IPP under the null hypothesis is also com-pared. This may provide a more suitable test, when the assumed samplingmechanism is indeed a LGCP. Furthermore, such a test does not require achoice of K. To compute these residual values, an edge-corrected Gaussiankernel is used to smooth the raw residuals. The bandwidth is selected usingleave-one-out cross-validation. This is performed using the spatstat package[Baddeley et al., 2015]. To compute the test, the NNk values are simplyreplaced with the smoothed residual values, evaluated at the point locations.We refer to this test as the residual test hereafter. It is interesting to assessthe relative performance of the NN test, given its generality across all pointprocesses and to the discrete spatial setting.First, the Type 1 error of the PS tests are assessed in the simplest settingwithout any covariate effects (i.e. α1 = 0) or PS effects (i.e. γ = 0). Re-sults from four tests are compared. The first two are residual tests. The firstcomputes the p-value directly using a standard permutation approach underthe (false) assumption that the pairs of residuals and estimates Zˆ(s) are anIID sample from some bivariate distribution. The positive spatial correla-tions due to the process Z(s) violate this assumption, with the magnitude ofviolation increasing with the spatial range ρZ . The second attempts to cor-rect for this spatial correlation. By forming realisations from the estimatedsampling process, the spatial correlation in Z(s) is accounted for. The thirdand fourth are NN tests. Once again, comparisons are made between thepermutation-based and the Monte Carlo-based approaches.Fig. A.15 shows the results for n = 50 across three increasing values ofthe spatial range ρZ and across different numbers of nearest neighbours K.It is apparent that both Monte Carlo tests attain Type 1 error at or belowthe 5% level. The two standard permutation tests attain a Type 1 errorabove the 5% level, and this increases dramatically with ρZ . At the highest214high freq field med freq field low freq field4 8 12 4 8 12 4 8 of nearest neighboursType 1 Error probabilityTestResidual RankNN RankResidual Rank MCNN Rank MCNo covariate effect, n=50A plot of the Type 1 error probability vs KFigure A.15: A plot of the Type 1 error for four tests. The threeboxes show the results for ρZ ∈ {0.02,0.2,1}, from left toright respectively for a sample size of 50. The two ‘Resid-ual’ tests are computed using the kernel-density smoothedvalues of the residuals from the fitted homogeneous Poissonprocesses. Leave-one-out cross-validation was used to selectthe bandwidth. The ‘NN’ tests are those based on the Knearest neighbour values. The suffix ‘MC’ denotes the testhas been computed from Monte Carlo realisations of the fit-ted point process.value of ρZ = 1, equal to the length of the domain Ω, the Type 1 errorcan be higher than 40%. The results for the very low value of ρZ = 0.02,demonstrate that the type 1 error approaches the nominal 5% level whenthe spatial correlation approaches zero. This is due to the IID assumption215becoming more reasonable as the Z(s) tends towards Gaussian white noise.When ρZ = 0.02, the prior distribution on the range parameter would reflectthe case where a researcher incorrectly assumed spatially smooth data priorto the model-fitting. Fig. A.19 shows the results for n= 100. It is apparentthat the Type 1 error increases with sample size for the permutation tests,while the Monte Carlo tests remain bounded above by 0.05.Next, the power of the two Monte Carlo tests to detect a PS effect whenthe alternative hypothesis is true is assessed. The behaviours of the tests arefirst investigated in the setting where no covariate effects exist (i.e. α1 = 0),but where moderate positive preferential sampling occurrs (i.e. γ = 1). Alltests are performed with the two-sided alternative hypothesis, namely thath is a monotonic function of Z in either direction.Fig. A.16 shows the results for n = 50, this time with spatial ranges ofρZ ∈ {0.2,1}, again across K ∈ {1, ...,15}. The power results for ρZ = 0.02are omitted in Figure A.17, since the power is consistently small (<0.1)for both. This demonstrates the need for the Z(s, t) term to be spatially-smooth for the test to detect PS. It is clear that the power of the NN testis sensitive to the choice of K value, especially for smaller sample sizes.Interestingly the optimum power achieved by the NN test with respect toK depends upon both the spatial range of Z and the sample size. Thehigher the spatial range of Z, and hence the smoother it is, the greater thevalue of K that is required to optimize the power. The optimum choice ofK also increases with the sample size, since the number of realized pointsper cluster increases. For example, when ρZ = 0.2 and the sample size is 50,the test is optimized when K = 1. This increases to K = 5 when ρZ = 1 andthe sample size is 100. Finally, for n= 50 it appears that the NN tests havea slightly lower power than the residual measure-based test.Next, the spatial range is fixed to be very small (ρZ = 0.02), and themagnitude of PS is fixed to be very high (γ = 2). This set-up leads to verysmall clusters to form when γ 6= 0. The joint effects of sample size and K onthe power of the NN test to detect PS is then demonstrated. Additionally,the power of the NN test is compared to the residual test. Three plots areshown to present the power vs. K in Fig. A.17. From left to right, these216med freq field low freq fieldN 50N 1004 8 12 4 8 120.250.500.751.000.250.500.751.00Number of nearest neighboursPower TestResidual Rank MCNN Rank MCNo covariate effect, PS=1A plot of the Power vs KFigure A.16: A plot of the Power for two tests when the PS parameterγ equals 1. The two columns show the results for ρZ ∈ 0.2,1,from left to right respectively. The two rows show resultsfor a sample size of 50 and 100 respectively. The ‘Resid-ual’ tests are computed using the kernel-density smoothedvalues of the residuals from the fitted homogeneous Poissonprocesses. Leave-one-out cross-validation was used to selectthe bandwidth. The ‘NN’ test is based on the K nearestneighbour values. The suffix ‘MC’ denotes the test has beencomputed from Monte Carlo realisations of the fitted sample sizes of 50,100 and 250. For small sample sizes (n ∈ {50,100}),both tests have low power to detect PS as expected. Interestingly however,the NN test outperforms the smoothed residual test for all three sample217sizes, achieving maximum powers of 0.21, 0.65 and 1 at K = 1 comparedwith 0.14, 0.40 and 0.97 for the residual test. Furthermore, the power of theNN test attains it maximum at K = 1, before dramatically diminishing to 0as K increases. Fig. A.20 shows the equivalent plots for ρZ = 0.2. Here, theNN test is no longer more powerful, with the residual test performing betterat n= 50. Note that the performance of the residual test may improve witha different choice of bandwidth-selection method.Results are now presented for the case when a unique covariate effectexists for the sampling process. The magnitudes of the covariate effect andthe PS effect are both set to 1 (thus α1 = γ = 1). The spatial range ofthe covariate effect is varied (ρw ∈ {1,0.02}). The results on the power ofthree different tests to detect PS are shown. As before, the first two arethe kernel-smoothed residual and NN rank tests. The third test is a ranktest using kernel-smoothed estimated residuals, but this time using residualscomputed from an incorrectly specified point process fit to the points. Thisis chosen to be a homogeneous Poisson process (HPP hereafter). Note thatthe Monte Carlo realized points Smt still come from the null IPP, fitted tothe original observations St. Thus the Monte Carlo sampled realisations stillcome from the correct data-generating mechanism (correct up to parameterestimation error). Unlike the residuals from the first test, these residuals donot adjust for the covariate effect. The purpose of this comparison is to seeif any improvements in the power of the test can be attained by consideringcomputed quantities that directly adjust for any covariate effects.The spatial range of the covariate field is changed for the following rea-son. When the spatial ranges of both the covariate field w(s) and the under-lying spatial field Z(s) are large and similar, then the magnitude of the em-pirical correlation of a single realisation of the two fields may be high. Thisis despite their realisations arising from independent distributions [Hankset al., 2015]. A possible consequence of this is that the tests may be un-able to distinguish between clustering due to an unknown process Z, andclustering due to the measured covariate w(s). This may affect the abilityof tests to detect preferential sampling, when their computed quantities arenot adjusted for the effects of covariates. The rank test of the residuals from218N 50 N 100 N 2504 8 12 4 8 12 4 8 of nearest neighboursPower TestResidual Rank MCNN Rank MCNo covariate effect, PS=2, range=0.02A plot of the Power vs KFigure A.17: A plot of the Power for two tests when the PS parameterγ equals 2 and ρZ = 0.02. The three columns show the resultsfor the sample sizes 50, 100 and 250 from left to right respec-tively. The ‘Residual’ tests are computed from the kernel-density smoothed values of the residuals from the fitted ho-mogeneous Poisson processes. Leave-one-out cross-validationwas used to select the bandwidth. The ‘NN’ test is based onthe K nearest neighbour values. The suffix ‘MC’ denotes thetest has been computed from Monte Carlo realisations of thefitted point process.the correctly specified IPP model is the only test that directly adjusts forthe covariate effects.Fig A.18 presents a complex interaction of different factors. When thecovariate field has very low spatial range (i.e. ρw = 0.02), and hence has high219med freq field low freq fieldhigh freq covariatelow freq covariate4 8 12 4 8 120.250.500.750.250.500.75Number of nearest neighboursType 1 Error probabilityTestHPP Residual Rank MCNN Rank MCIPP Residual Rank MC PS=1, Covariate effect, n=50A plot of the Power vs KFigure A.18: A plot of the Power for two tests when the PS param-eter γ equals 1, the covariate effect α1 equals 1 and whenthe sample size is 50. The two columns show the resultsfor the spatial range ρZ ∈ {0.2,1} from left to right respec-tively. The two rows show the results for the spatial range ofthe covariate ρw ∈ {0.02,1} from top to bottom respectively.The two ‘Residual’ tests are computed from the kernel-densitysmoothed values of the raw residuals from the fitted Homo-geneous (HPP) and Inhomogeneous Poisson processes (IPP).Leave-one-out cross-validation was used to select the band-width. The ‘NN’ test is based on the K nearest neighbourvalues. The suffix ‘MC’ denotes the test has been computedfrom Monte Carlo realisations of the fitted point process.frequency, negligible correlation can exist between Z(s) and w(s). Conse-quently, no gains in power are seen when tests use covariate-adjusted mea-220sures of clustering relative to when they use unadjusted measures. However,when both the covariate w and underlying field Z are very smooth (i.e.ρw = 1,ρZ = 1), large increases in power are seen with the covariate-adjustedmeasure of clustering. The power increases to 0.57 compared with 0.40 and0.33 for the HPP residual and the NN methods respectively. The resultsare similar for n = 250 (see Fig. A.22 in the supplementary material). Inconclusion then, in cases where the spatial ranges of informative covariatesare large and similar in size to the underlying Z(s), computed quantitiesother than NNk should be considered to improve the power to detect PS.Finally, the performance of the tests is assessed in settings where theresponse is non-Gaussian, and when the true sampling process is not anIPP. The Y (s) values now take the form of counts and (13) is replaced witha Poisson distribution. The log-transformed mean at location s ∈ Ω is setequal to the random field Z(s) plus a constant intercept of 2. The interceptis chosen to ensure non-zero counts occur often. The true sampling processis set equal to a Hardcore process with two different radii of interactions,denoted R, compared (0.025 and 0.05). Under these two processes, pointswithin St cannot be sampled closer than a distance apart of 0.025 or 0.05respectively. With n = 100, these two constraints enforce moderate andstrict levels of regularization of the points respectively, violating the IPPassumption of no inter-point interaction.The Hardcore process is chosen to highlight the fact that the choice ofnearest-neighbour distances to capture additional clustering will be poorin some settings. Here, due to the nearest neighbour distances being lowerbounded, the contrast in their observed values will decrease as R is increased.Estimates of the smoothed residuals are not directly affected by an increasein R and hence the residual test is expected to far outperform the NN testas R increases.After sampling the data, the tests under two scenarios are compared.The first considers the case where the researcher assumes the correct Hard-core process sampling mechanism for the Monte Carlo realisations of St.The second considers the case when the researcher misspecifies it as an IPP(i.e. assumes no inter-point interaction).221The results from this simulation study, repeated 100 times, are shownin Fig. A.21. As expected, the residual tests far outperform the NN testwhen the radius of interaction is 0.05. This is due to the lack of contrastin the nearest neighbour distances that leads to a reduction in the powerof the NN test. When the radius of interaction is 0.025, the contrast inthe nearest neighbour distances is restored and both methods perform verywell again. In this case, the power exceeds 0.95 when the correct Hard Coreprocess is fitted. Interestingly, the residual test that use the raw residualsfrom the correctly specified Hard Core processes perform no better than theresidual test that use raw residuals of the incorrectly specified HPP. On theother hand, the performance of the NN test improves when the class of pointprocess is correctly specified.222high freq field med freq field low freq field4 8 12 4 8 12 4 8 of nearest neighboursType 1 Error probabilityTestResidual RankNN RankResidual Rank MCNN Rank MCNo covariate effect, n=100A plot of the Type 1 error probability vs KFigure A.19: A plot of the Type 1 error for four tests. The three boxesshow the results for ρZ ∈ 0.02,0.2,1, from left to right respec-tively for a sample size of 100. The ‘Residual’ tests are com-puted from the kernel-density smoothed values of the residualsfrom the fitted homogeneous Poisson processes. Leave-one-out cross-validation was used to select the bandwidth. The‘NN’ tests are those based on the K nearest neighbour val-ues. The suffix ‘MC’ denotes the test has been computed fromMonte Carlo realisations of the fitted point process.223N 50 N 100 N 2504 8 12 4 8 12 4 8 of nearest neighboursPower TestResidual Rank MCNN Rank MCNo covariate effect, PS=1, range=0.2A plot of the Power vs KFigure A.20: A plot of the Power for two tests when the PS param-eter γ equals 1 and ρZ = 0.2. The three columns show theresults for the sample sizes 50, 100 and 250 from left to rightrespectively. The two rows show results for a sample size of 50and 100 respectively. The ‘Residual’ test denotes the kernel-density smoothed values of the raw residuals from the homo-geneous Poisson process. Leave-one-out cross-validation wasused to select the bandwidth. The ‘NN’ tests are those basedon the K nearest neighbour values. The suffix ‘MC’ denotesthe test has been computed from Monte Carlo realisations ofthe fitted point process.224None Hardcoresmall Rlarge R4 8 12 4 8 of nearest neighboursPower TestResidual Rank MCNN Rank MCPS=1, no covariate effect, n=100A plot of the Power vs KFigure A.21: A plot of the Power for two tests when the true samplingprocess is a Hard Core point process, with PS parameter γequals 1 and ρZ = 1. The two columns show the results whena Poisson process model (‘None’), and when a Hard Core pro-cess are fitted and then used for Monte Carlo sampling. Fromtop to bottom, the rows denote the case where the true ra-dius of interaction for the Hard Core process equals 0.025 and0.05. The ‘Residual’ test is computed using kernel-densitysmoothed values of the raw residuals from the fitted pointprocess. Leave-one-out cross-validation was used to select thebandwidth. The ‘NN’ tests are those based on the K nearestneighbour values. The suffix ‘MC’ denotes the test has beencomputed from Monte Carlo realisations of the fitted pointprocesses.225med freq field low freq fieldhigh freq covariatelow freq covariate4 8 12 4 8 120.250.500.751.000.250.500.751.00Number of nearest neighboursType 1 Error probabilityTestHPP Residual Rank MCNN Rank MCIPP Residual Rank MC PS=1, Covariate effect, n=250A plot of the Power vs KFigure A.22: A plot of the Power for two tests when the PS parame-ter γ equals 1, the covariate effect α1 equals 1 and when thesample size is 250. The two columns show the results for thespatial range ρZ ∈ {0.2,1} from left to right respectively. Thetwo rows show the results for the spatial range of the covariateρw ∈ {0.02,1} from top to bottom respectively. The ‘Resid-ual’ test is computed using kernel-density smoothed valuesof the raw residuals from the fitted Homogeneous and Inho-mogeneous Poisson processes. Leave-one-out cross-validationwas used to select the bandwidth. The ‘NN’ test is based onthe K nearest neighbour values. The suffix ‘MC’ denotes thetest has been computed from Monte Carlo realisations of thefitted point process.226A.3 Chapter 5 Supporting MaterialsA.3.1 Additional theory on marked point processesStart with a Poisson process Y on Ω with intensity λ(s). Next, take a prob-ability distribution p(s, .) on M depending on s ∈ Ω such that, for B ⊂M ,p(.,B) is a measurable function on Ω. A marking of Y is a random subsetof Ω×M such that the projection onto Ω is Y and such that the condi-tional distribution of Y ?, given Y makes the marks my : y ∈ Y independentwith respective distributions p(y, .). We now have the following theorems(adapted from Kingman [1994]):Theorem A.1 (Marking Theorem) The random subset C ⊂Y ? is a Pois-son process on Ω×M with mean measure Λ? defined:Λ?(C) =∫ ∫(s,m)∈Cλ(s)p(s,dm)ds (A.1)Theorem A.2 (Mapping Theorem) If the points (Y,mY ) form a Pois-son process on Ω×M , then the marks form a Poisson process on M and themean measure is obtained by setting C = Ω×B in (A.1):µm(B) =∫Ω∫Bλ(s)p(s,dm)ds (A.2)if the marks take on only K different values, then the theorem specializesfor the ith mark to:Λi(A) =∫Aλ(s)p(s,{mi})ds A⊂ Ω (A.3)A.3.2 Extra results of the main simulation study227Low Observer Bias High Observer Bias1 Mobile20 Static1 Mobile + 20 Static20 Mobile1 Mobile20 Static1 Mobile + 20 Static20 Mobile−5051015Observer TypeBias of estimates of UDs µYQuantity Bias−Corrected Model Uncorrected ModelFigure A.23: A plot showing the bias of the estimated y-coordinate ofthe animal’s UD center µy under the bias-corrected and bias-uncorrected models vs the types of observers. From left toright are the results from one mobile observer, twenty staticobservers, twenty static with one mobile observers, and twentymobile observers. The degree of observer bias is changedfrom low to high in the columns. The red solid lines andthe blue dashed lines show the median bias along with ro-bust intervals computed as ±2cMAD from the bias-correctedand uncorrected models across the 100 simulation replicatesrespectively. The MAD has been scaled by c= 1.48. This en-sures that the intervals are asymptotically equivalent to the95% confidence intervals that would be computed if the biaseswere normally distributed. Note that here all the analyst’sassumptions correctly match the true data-generating mecha-nism, albeit with any overlap in the observers’ efforts ignored.Note the large reduction in Bias offered by the bias-correctedmethod in the ‘High Observer Bias’ setting.228Detectability surface modeled Perfect Detectability assumedCorrect detection rangeUnderestimatedOverestimated1 Mobile20 Static1 Mobile + 20 Static20 Mobile1 Mobile20 Static1 Mobile + 20 Static20 Mobile0.0e+005.0e−081.0e−071.5e−070.0e+005.0e−081.0e−071.5e−070.0e+005.0e−081.0e−071.5e−07Observer TypeMean Squared Prediction ErrorQuantity Bias−Corrected Model Uncorrected ModelFigure A.24: A plot showing the mean squared prediction error(MSPE) of the estimated animal’s UD under the bias-corrected and bias-uncorrected models vs the types of ob-servers. From left to right are the results from one mobileobserver, twenty static observers, twenty static with one mo-bile observers, and twenty mobile observers. The distancesampling function has either been modeled or ignored in thetwo columns from left to right and the observers’ detectionrange has been assumed to be 10, 2 and 50 across the rows.The red solid lines and the blue dashed lines show the me-dian MSPE along with robust intervals computed as ±2cMADfrom the bias-corrected and uncorrected models across the 100simulation replicates respectively. The MAD has been scaledby c = 1.48. This ensures that the intervals are asymptoti-cally equivalent to the 95% confidence intervals that wouldbe computed if the MSPE values were normally distributed.The results are shown for 150 trips with high observer bias.229Detectability surface modeled Perfect Detectability assumedCorrect detection rangeUnderestimatedOverestimated1 Mobile20 Static1 Mobile + 20 Static20 Mobile1 Mobile20 Static1 Mobile + 20 Static20 Mobile−10010−10010−10010Observer TypeBias of estimates of UDs µYQuantity Bias−Corrected Model Uncorrected ModelFigure A.25: A plot showing the bias of the estimated animal’s UDcenter µy under the bias-corrected and bias-uncorrected mod-els vs the types of observers. From left to right are the resultsfrom one mobile observer, twenty static observers, twentystatic with one mobile observers, and twenty mobile observers.The distance sampling function has either been modeled orignored in the two columns from left to right and the ob-servers’ detection range has been assumed to be 10, 2 and 50across the rows. The red solid lines and the blue dashed linesshow the median bias along with robust intervals computedas ±2cMAD from the bias-corrected and uncorrected modelsacross the 100 simulation replicates respectively. The MADhas been scaled by c= 1.48. This ensures that the intervals areasymptotically equivalent to the 95% confidence intervals thatwould be computed if the biases were normally distributed.The results are shown for 150 trips with high observer bias.230A.3.3 Details of the additional simulation studyIn the previous simulation study, we argued that two major sources of pre-diction bias were the autocorrelations between the encounter/non-encounterevents, and the overlap between the observers’ fields-of-view. We demon-strate these claims in a second simulation study. Unlike the previous simula-tion study, this one is designed to ensure that the encounter/non-encounterevents at each time step are approximately independent of each other. Thisis achieved by increasing the average distance travelled by the animal ateach discrete time step. This could also be interpreted as the setting whereobservers attempt encounters at discrete sampling times and wait for a suf-ficiently long amount of time between sampling times to reduce the auto-correlation between the encounter/non-encounter events.We simulate the movements of observers from the same stochastic dif-ferential equation model. For the animal, we change the variance of thepotential function to 1 and increase the variance of the Brownian motionterms to 400. This leads to the animal moving an average distance of 23units, compared with 1.75 and 3.5 units for the high bias and low bias mobileobservers respectively. Given that each simulation trip ends when the firstencounter is made, these simulation settings ensure that the autocorrelationbetween the encounter/non-encounter events is greatly reduced.Fig SA.26 clearly demonstrates that a reduction in the autocorrelationbetween the encounter/nonencounter events leads to a reduction in the biasof the estimated UD center. Furthermore, a large increase is witnessedin the relative MSPE of the effort-corrected approach compared with theuncorrected approach. In fact, the effort-corrected approach outperformsthe uncorrected approach in all settings. However, there is some remainingbias in the estimates of the UD center from the effort-corrected approachand the magnitude of this bias increases with the numbe of observers.To demonstrate that this bias is in fact caused by overlap in the ob-servers’ fields-of-view, we implement a method for adjusting for overlap whenestimating observer effort. In particular, let pdet(o,s, t) denote the detectionprobability function for observer o, evaluated at space-time coordinate (s, t).231Low Observer Bias High Observer Bias1 Mobile20 Static1 Mobile + 20 Static20 Mobile1 Mobile20 Static1 Mobile + 20 Static20 Mobile0102030Observer TypeBias of estimates of UDs µYQuantity Bias−Corrected Model Uncorrected ModelFigure A.26: A plot showing the bias of the estimated y-coordinate ofthe animal’s UD center µy under the bias-corrected and bias-uncorrected models vs the types of observers. The resultsshown here are for the second simulation study. From left toright are the results from one mobile observer, twenty staticobservers, twenty static with one mobile observers, and twentymobile observers. The degree of observer bias is changedfrom low to high in the columns. The red solid lines andthe blue dashed lines show the median bias along with ro-bust intervals computed as ±2cMAD from the bias-correctedand uncorrected models across the 100 simulation replicatesrespectively. The MAD has been scaled by c= 1.48. This en-sures that the intervals are asymptotically equivalent to the95% confidence intervals that would be computed if the bi-ases were normally distributed. Note that here all the ana-lyst’s assumptions correctly match the true data-generatingmechanism, albeit with any overlap in the observers’ effortsignored.232Low Observer Bias High Observer Bias1 Mobile20 Static1 Mobile + 20 Static20 Mobile1 Mobile20 Static1 Mobile + 20 Static20 Mobile0e+002e−084e−086e−08Observer TypeMean Squared Prediction ErrorQuantity Bias−Corrected Model Uncorrected ModelFigure A.27: A plot showing the mean squared prediction error(MSPE) of the animal’s UD under the bias-corrected and bias-uncorrected models vs the types of observers. The resultsshown here are for the second simulation study. From left toright are the results from one mobile observer, twenty staticobservers, twenty static with one mobile observers, and twentymobile observers. The degree of observer bias is changed fromlow to high in the columns. The red solid lines and the bluedashed lines show the median bias along with robust inter-vals computed as ±2cMAD from the bias-corrected and un-corrected models across the 100 simulation replicates respec-tively. The MAD has been scaled by c = 1.48. This ensuresthat the intervals are asymptotically equivalent to the 95%confidence intervals that would be computed if the MSPEvalues were normally distributed. Note that here all the an-alyst’s assumptions correctly match the true data-generatingmechanism, albeit with any overlap in the observers’ effortsignored.233The standard bias correction approach simply estimates effort asE(s) =∑t∑o∈Opdet(o,s, t).However, the probabilities of detection from overlapping observers do notsum. We correct for this and compute:E(s) =∑t(1−∏o∈O(1−pdet(o,s, t))).We implement this approach for 50 simulation iterations in the settingswith twenty mobile observers. This setting is chosen since it suffers from thelargest degree of overlap. Fig SA.28 demonstrates that the overlap correctedmethod indeed eliminates the bias in estimates of the animals UD center inboth the low observer bias and high observer bias settings. However, noimprovement in the MSPE is seen in Fig SA.29.234Low Observer Bias High Observer Bias20 Mobile20 Mobile0102030Observer TypeBias of estimates of UDs µYQuantity Bias−Corrected Model Uncorrected Model Overlap−Corrected ModelFigure A.28: A plot showing the bias of the estimated animal’s UDcenter µy under the bias-corrected, bias-uncorrected, andthe overlap-corrected models for the twenty mobile obervers.From left to right are the results when the degree of observerbias was either low or high. The red solid lines, the bluedashed lines, and the green dotted lines show the medianbias along with robust intervals computed as ±2cMAD fromthe bias-corrected, uncorrected, and overlap-corrected modelsacross the 50 simulation replicates respectively. The MAD hasbeen scaled by c = 1.48. This ensures that the intervals areasymptotically equivalent to the 95% confidence intervals thatwould be computed if the biases were normally distributed.Note that here the correct data-generating mechanism wasassumed by the analyst. Notice that the bias is completelyeliminated by the overlap-corrected method.235Low Observer Bias High Observer Bias20 Mobile20 Mobile0e+002e−084e−086e−08Observer TypeMean Squared Prediction ErrorQuantity Bias−Corrected Model Uncorrected Model Overlap−Corrected ModelFigure A.29: A plot showing the mean squared prediction error(MSPE) of the estimated animal’s UD under the bias-corrected, bias-uncorrected, and the overlap-corrected modelsfor the twenty mobile obervers. From left to right are the re-sults when the degree of observer bias was either low or high.The red solid lines, the blue dashed lines, and the green dot-ted lines show the median MSPE along with robust intervalscomputed as ±2cMAD from the bias-corrected, uncorrected,and overlap-corrected models across the 50 simulation repli-cates respectively. The MAD has been scaled by c = 1.48.This ensures that the intervals are asymptotically equivalentto the 95% confidence intervals that would be computed ifthe MSPE values were normally distributed. Note that herethe correct data-generating mechanism was assumed by theanalyst. Notice that there is little-to-no improvement in theMSPE between the two corrected methods.236Figure A.30: A plot showing the assumed causal DAG for the pro-posed framework with the detection probability assumed con-stant. An arrow between a variable set A and a variableset B indicates that at least one variable exists in both setswith a direct causal effect between them. The causal Markovassumption is made such that a variable is independent ofits non-descendants, when conditioned on its parents [Hernanand Robins, 2020].A.3.4 Additional comments on the causal DAGModel (5.10) is fit to a set of observed environmental covariates x(s, t) andobserved effort covariates w(s, t), but in general, there may exist unobservedcovariates x?(s, t) and w?(s, t). These unobserved covariates, in conjunctionwith the causal paths contained in the stars, the circles, and the triangle ofFig A.30 may cause problems. For example, the lower causal path denotedby the arrow within the red star on the left combined with the causal pathwithin the red circle on the bottom right opens a back-door pathway betweenthe effort intensity surface and the true species’ intensity surface. Thispathway passes through the unobserved environmental covariates x?(s, t)causing estimates of γT1 to be confounded by x?(s, t).A similar conclusion may be drawn by considering the upper of the twoarrows within the left had star, combined with the arrow seen in the triangle.Here, estimates of βT will be confounded by w?(s, t). Further problems237would occur due to the two causal paths within the right star. These wouldlead to λtrue(s, t) and E(s, t) becoming non-identifiable. The existence of asubset of covariates w˜(s, t) within w(s, t) driving both λtrue(s, t) and E(s, t)causes neither intensity surface to be estimable. This is because only thesum of the effects of w˜(s, t) are estimable within the loglinear model (5.10).Thus for the true species’ intensity surface to not be confounded by theeffort intensity, none of the causal paths within the red stars can exist.Yet more problems can occur if the four causal paths within the redcircles and the red triangle exist. The upper two paths lead to estimatesof γT2 being confounded by unmeasured effort covariates w?(s, t) and thebottom two paths lead to estimates of βT being confounded by unmeasuredenvironmental covariates x?(s, t). Furthermore, the existence of the causalpath in the red triangle alone may lead to estimates of λtrue(s, t) within(10) to be confounded by w?(s, t). This is because if a Gaussian processZ(s, t) is included within the linear predictor for λtrue(s, t) then any residualspatio-temporal correlations in the sightings data due to w?(s, t) may beerroneously captured by Z(s, t).Note that similar issues occur if the detection probability is not constantand a detection probability surface pdet(s, t) is estimated with its own setof covariates. An extension to this causal DAG would allow for similarconclusions to be drawn. For the later case study, we assume that none ofthe causal paths within the stars or the triangle are present.A.3.5 Deriving site occurrence and site count likelihoodsLikelihoods for site occurrence and site count data can all be derived fromthe modelling framework if the true locations of the target species follow alog-Gaussian Cox process. We ignore time for notational simplicity. With Ωour study region, with a known sampled region (e.g. a transect) Ai ⊂Ω, andwith a known or estimable observer effort captured by λeff (s,m), define thefollowing quantity:238Λobs(Ai,m|Z) =∫Aiλtrue(s,m)pdet(s,m)λeff (s,m).This is referred to as the integrated observed intensity function, condi-tional upon knowing the Gaussian process Z(s). Importantly, this representsthe expected number of observed sightings within Ai. Following Hefley andHooten [2016], we can then derive the target likelihoods. Firstly, supposethat an observer records the number of sightings made within Ai, denotedN(Ai). Then, the distribution of the number of counts, conditioned uponknowing Z, is:[N(Ai,m)|Z]v Poisson(Λobs(Ai,m|Z)).In practice, the Gaussian process is not known and thus needs to beestimated. Consequently, the above likelihood is an example of a spatio-temporal generalized linear mixed effects model (STGLMM). Multiple soft-ware packages exist to fit such models (e.g. R-INLA). Next, suppose thatinstead of recording the number of sightings made within Ai, a binary pres-ence/absence indicator of presence (denoted P (Ai)) was recorded. The dis-tribution of this indicator variable can also be derived from the conditionalPoisson distribution on the counts. In particular, let O(Ai) = I (N(Ai)> 0),with I denoting the indicator function. Then the probability statementP (O(Ai|Z) = 1) = P (N(Ai|Z)> 0) = 1−exp[Λobs(Ai|Z)] implies the follow-ing conditional distribution on the indicator variables:[O(Ai,m)|Z]v Bernoulli(1− exp [Λobs(Ai,m|Z)]).Once again, the likelihood is of the STGLMM format which can becomputed using standard software packages. Note also that computing theintegrated observed intensity function is critical across the likelihoods.239A.3.6 Comments on preferential samplingIn the setting of this Chapter, preferential sampling would be defined as astochastic dependence between the observer effort and the underlying speciesintensity. An example would be a setting where observers focused their ob-server effort in areas with high species density, perhaps due to some priorknowledge on their likely locations. The biasing effects of preferential sam-pling on spatial prediction [Diggle et al., 2010] and on the estimation of themean intensity in ecological applications [Pennino et al., 2019] have beenshown.In many situations this modelling framework will suitably adjust infer-ence for any heterogeneous observer effort across Ω, removing the biasingeffects of preferential sampling. In cases where nonzero observer effort existsthroughout the study region (i.e. where E(s,m) > 0 ∀ (s,m) ∈ (Ω×M)),the estimation of λtrue(s,m) will be unaffected by preferential sampling.However, when a subregion B ⊂ Ω is never visited, (i.e. when E(s,m) =0 ∀ (s,m) ∈ (B×M)), the estimation of λtrue(s,m)) within B may be bi-ased. To highlight this fact, suppose our study region Ω is split into anorthern region A and a southern region B. Suppose that the true intensityλtrue(s) takes value 2 within A and value 1 within B. If only A is visited,then without the availability of strong covariates explaining the differencesacross A and B, then any model will wrongly overestimate the true intensityin B, namely the model will predict that λtrue(s) = 1 ∀ s ∈B.To minimize the impacts of preferential sampling on any conclusionsmade using this modelling framework, extrapolating predictions into un-sampled regions should be done with care, especially if it is believed thatthe intensity of observer effort may depend upon the underlying species’intensity. This is standard advice in any statistical analysis and is not alimitation unique to this framework.240A.3.7 More notes on estimating the whale-watch observereffortTwo strong assumptions are required to allow us to multiply the total ob-server effort field by the fraction of total observer effort observed in a givenmonth/year. We first assume that the expected spatial positions of the boatsare constant throughout the time period of interest 9am - 6pm. We knowthat at the starts and ends of the days the boats will likely be closer toport. We assume however, that the whale-watch boats are travelling inde-pendently in equilibrium (represented by our estimated observer effort fieldEWW (s,Tl,y)).Second, we assume that the boats are spread out throughout Ω suffi-ciently, such that the total observer effort from all the vessels (assumedequal) is additive. In other words, we assume the whale-watch boats aresufficiently spread out, such that their observation ranges do not overlap.In reality the whale-watch vessels often visit similar nature ‘hotspots’ andhence traverse similar routes. As a consequence, they may travel close to-gether at certain times. At these times, their combined observer effort maynot scale linearly with the number of the boats.Given that we have chosen months to be our discretization of time, wemust estimate the monthly observer effort across space, adding up the con-tributions of effort across the years of interest (2009 - 2016).EobsWW (s,Tl,m)) =2016∑y=2009EobsWW (s,y,Tl,m))We define boat hours to be our unit of observer effort, kilometers to beour unit of distance and month to be our unit of time. Thus, EobsWW (s,Tl,m)denotes the number of WW boat hours of observer effort, per unit area thatoccurred for pod m, at location s and month Tl, summed over all the yearsof the study. In a similar flavour to the intensity surface, the effort surfaceis not really defined pointwise, but defined over regions of non-zero area asan integral. In particular, for a region A ⊂ Ω and month Tl, we define thetotal observer effort that occured inside A in boat hours to be:241E˜obsWW (A,Tl,m) =∫AEobsWW (s,Tl,m)ds (A.4)For later computation of the LGCP, we approximate the stochastic inte-gral required for the likelihood over a finite set of integration points. Thus,we are required to compute the integrals of the effort field over the integra-tion points.A.3.8 Computational steps for approximating thelikelihoodThe LGCP likelihood above (1) is analytically intractable, as it requiresthe integral of the intensity surface, which typically cannot be calculatedexplicitly. However, various methods exist for approximating this integral.We consider the approximation method from Simpson et al. [2016]. Wepresent the spatial-only setting (i.e. L= 1) and ignore marks for notationalconvenience. The results generalize easily to the spatio-temporal case withmarks. First, p suitable integration points are chosen in Ω with knowncorresponding areas {α˜j}pj=1. Then, the first p indices are defined to bethe chosen integration points with the last n indices chosen as the observedlocations of the sightings si ∈ Ω. Then, define α = (α˜Tp×1,0Tn×1)T and y =(0Tp×1,1Tn×1)T . We define log(ηi) = log(λtrue(si)pdet(si)λeff (si)). We obtain:pi(y|z)≈Kn+p∏i=1ηyii exp(−αiηi). (A.5)We can see that the stochastic integral is only approximated across thefirst p integration points, hence the name. The expected count aroundan integration point scales linearly with the area {α˜j} associated with it.This is under the assumption that for fixed intensity, doubling the area ofa region, doubles the expected number of encounters occurring within theregion. The problem of evaluating (5.1) is reduced to a problem similarto evaluating n+ p independent Poisson random variables, conditional on242Z = z, with means αiηi and ‘observed’ values yi. This is a Riemann sumapproximation to the integral. In standard software, the natural logarithmof the weights αi is added as an offset in the model and equation (A.5) canbe fit if one defines the minor modification that log(αi) is defined to be zeroif αi = 0. This is implemented as standard in the R-INLA package [Lindgrenet al., 2011b, 2015, Rue et al., 2009].Including known or estimated effort from the O observers in the modelsimply requires evaluating the areal-averaged effort that occurred at eachencounter location and around each of the p chosen integration points si :i ∈ {1, ...,p}. We denote the regions around the integration points, Aj ⊂ Ω.These may correspond to regular lattice cells or as in our example, irregularVoronoi polygons (Fig A.31). Often there will be uncertainty surroundingthe effort. We use the Monte Carlo sampling procedure seen in Chapter 5to account for this uncertainty in our application.Figure A.31: The computational mesh on the left and the correspond-ing dual mesh on the right, formed by constructing Voronoipolygons around the mesh vertices. The Voronoi polygonsform our integration points Ai.A.3.9 Additional details on the results and additional tablesEnvironmental covariates were mapped to the integration points Ai and tothe sighting locations y for modelling. In cases where we had noisy covariateswith missing values, we chose the median covariate value (out of those that243spatially-intersect the Voronoi polygon) as the polygon’s ‘representative’.For missing covariates at observation locations, we mapped the non-missingvalue which was closest in distance to the observation location. Sea-surfacetemperature (SST) and (log) chlorophyll-A (chl-A) levels were obtained.Monthly chl-A and SST were obtained for each year and averaged overthe years. Log transformed covariates were centered to have mean 0 andscaled to have unit variance. Sea surface temperature was not scaled forinterpretation reasons.Next we performed hierarchical centering of our SST and chl-A covari-ates. This is following the advice of Yuan et al. [2017], where it was shownthat three unique biological insights can be obtained per covariate. In partic-ular, we performed two types of centering: spatial and spacetime centering.Centering covariates like this can also improve the predictive performanceof models. The 2 hierachical centering schemes applied to both SST andchl-A were compared. We refer to these as covariate sets 1 and 2.Models that included a wide range of different latent effects within (5.13)were compared. A unique (sum-to-zero constrained) random walk of secondorder for each pod was tested, alongside a shared spatial and/or spatio-temporal Gaussian (markov) field across the pods and a unique spatial fieldfor pod L. For the random walk term, we shared the precision parameteracross the pods. We put INLA’s default logGamma(1, 5e-05) prior distribu-tion on this shared (log) precision. Finally a unique intercept was allowedfor each pod. The unique intercepts per pod allow for a different globalintensity for each pod to exist across the months, whilst the unique randomwalk terms per pod allow for a changing relative intensity of each pod acrossthe months. This is chosen based on previous work that found pod J to bethe most likely to be present in the Salish Sea year-round [Ford et al., 2017].We also fitted the models without covariates included in the linear pre-dictor (5.13) and hence only with spatial and spatio-temporal terms includedin the model. We also fitted models with covariates present in (5.13), butwith the spatial random fields removed. These are inhomogeneous Pois-son processes. We did this to attempt to show how the variability seen inthe data is captured by covariates and random effects. We also did this to244investigate whether or not the spatial distribution of the SRKW intensity(conditioned on the observer effort), changes with month, or whether or notit is spatially static across the months.For all spatial fields, we placed PC priors [Fuglstad et al., 2018] on theGMRF, with a prior probability of 0.01 that the ranges of the fields areless than 15km. We also placed a prior probability of 0.1 that the standarddeviations of the fields exceed 3. Thus, our prior beliefs were that the fieldsare smooth (i.e. the ranges are not too small) and are not too large inamplitude (i.e. the standard deviations are not too large). We did this toreduce the risk of over-fitting the data. The PC priors penalize departuresfrom our prior beliefs under the Occam’s razor principle; penalizing modelswith greater complexity than that specified in our prior.0.950.960.970.980.991.00ProbabilityMonth of MayPosterior Probability the total SRKW density is in the top 30%Figure A.32: A plot showing the posterior probability that the sum ofthe three pod’s intensities across the region takes value in theupper 30% for the month of May. The 30% exceedance value iscomputed across all the months. Shown are the probabilitiesof exceedance, with only the probabilities greater than 0.95displayed. Results shown are for the ‘best’ model, adjustedfor Monte Carlo observer effort error.245Now we display the table of coefficients from the ‘best’ model, and thetable of DIC values of all tested candidate models. Finally, we displayour model-estimated number of sightings per pod and per month, with 95%credible intervals. We also display the observed number of sightings to checkthe model’s calibration.246Table A.1: A table of posterior estimates of the fixed effects β, withtheir 95% posterior credble intervals for the final model (Model8 in Table A.2) repeatedly fit with 1000 Monte Carlo estimatedobserver effort fields. Note that the symbol * denotes ‘signifi-cance’ such that the 95% credible intervals do not cover 0 (noeffect). For the pods, * indicates that a difference was foundbetween the relative pod intensities with respect to their 95%credible intervals. The ‘change’ column displays the change in‘significance’ of the effect size compared with the results frommodel 8 without the additional MC error from the observer ef-fort. The ‘-’ symbol denotes no change in significance. Noneof the directions and hence qualitative conclusions of the effectestimates change.Mean SD 0.025 Q 0.5 Q 0.975 Q ∆Pod J -3.84 0.71 -5.23 -3.83 -2.44 -Pod K -4.57 0.71 -5.95 -4.56 -3.20 -Pod L -3.95 0.98 -5.81 -3.96 -1.99 -SST month avg 0.03 0.26 -0.49 0.04 0.54 -SST spatial avg∗ -0.37 0.17 -0.70 -0.37 -0.05 -chl-A month avg 0.31 1.11 -1.89 0.26 2.58 -chl-A spatial avg∗ -1.03 0.32 -1.67 -1.02 -0.38 -SST ST residual∗ -0.67 0.05 -0.77 -0.67 -0.57 -chl-A ST residual∗ -0.23 0.07 -0.38 -0.23 -0.09 -247Table A.2: A table showing the DIC values of all the models tested,with the model formulations summarized in the columns. Avalue of NA implies that model convergence issues occurred.Model DIC ∆DIC Covariate Set Shared Field Field for L0 3614 5554 × × ×1 2843 4783 1 × ×2 2707 4642 2 × ×3 -1633 307 × Spatial ×4 -1730 210 × Spatio-temporal ×5 -1842 98 1 Spatial ×6 -1851 89 2 Spatial ×7 -1931 9 1 Spatial Spatial8 -1940 0 2 Spatial Spatial9 NA NA 1 Spatio-temporal ×10 NA NA 2 Spatio-temporal ×11 NA NA 1 Spatio-temporal Spatial12 NA NA 2 Spatio-temporal SpatialA.3.10 Pseudo-code for computing the modellingframework in inlabruFitting log-Gaussian Cox processes within a Bayesian framework models isgreatly simplified with the use of the R package inlabru [Bachl et al., 2019].Furthermore, inlabru can fit joint models containing many (possibly differ-ent) likelihoods, and is able to share parameters and latent effects betweenthem with ease. Numerous other features exist and helper functions areprovided to help produce publication-quality plots. For full information andfor free tutorials, visit the inlabru website at The followingpseudo-code is largely based on the available tutorials.In this section we will demonstrate the simplicity of fitting a joint modelto a dataset comprised of a distance sampling survey, and a presence-only248Figure A.33: A plot showing the total observed number of sightingsmade per month with the posterior 95% credible intervalsshown. Results shown are for Model 8 with MC observer efforterror. The posterior predictions are made, given the identicalobserver effort to that estimated for the observed data. Alsoshown are the horizontal lines showing the maximum possiblenumber of sightings that could be made in months with 30 and31 days respectively. Posterior credible intervals extendingabove this upper bound imply the Poisson model is severelymisspecified.dataset with a corresponding observer effort field using inlabru’s syntax. Forsimplicity, suppose we have 1 continuous environmental covariate, called co-var1 (e.g. SST), and that it is in the ‘SpatialGridDataFrame’ or ‘SpatialPix-elsDataFrame’ class. Next, suppose we have an estimate of the natural log-arithm of the observer effort for the presence-only data called logeffort_po,also of class ‘SpatialGridDataFrame’ or ‘SpatialPixelsDataFrame’. We as-sume that the effort took values strictly greater than 0 everywhere before249taking the logarithm (or that we have added a small constant to enforcethis).Suppose we have the observed sighting locations of the individual ofinterest as two separate objects of class ‘SpatialPointsDataFrame’, one forthe survey sightings and one for the presence-only sightings. Call thesesurv_points and po_points and suppose we have thinned the data to en-sure any autocorrelation has been removed. Suppose also that we have ourtransect lines from the survey as an object called surveylines in the class‘SpatialLinesDataFrame’, and that we know the transect strip half-width(denoted W ). Finally, suppose that our spatial domain of interest is de-scribed by an object called boundary of ‘SpatialPolygonsDataFrame’ class.All ‘Spatial’ objects in the sp package [Bivand et al., 2013, Pebesma andBivand, 2005] must be in the same coordinate reference system.Suppose we wished to estimate a half-normal detection probability func-tion, as a function of distance. To program this in inlabru, we must firstdefine the half-norm detection probability function in R [R Core Team,2019]. Let ‘logsigma’ denote the natural logarithm of the standard devia-tion and ‘distance’ denote the perpendicular distance from the transect tothe observed point. Then our function is:ha l fno rm = f u n c t i o n ( d i s t a n c e , l og s i gma ){exp (−0.5∗( d i s t a n c e / exp ( l og s i gma ) )^2 )} .Next, given a well constructed Delauney triangulation mesh called ‘mesh’,we construct the spatial random field for the LGCP. Helper functions existfor creating appropriate meshes in inlabru. The code for creating the spatialfield, with Matern covariance structure is:matern <− i n l a . spde2 . pcmatern (mesh ,p r i o r . s igma = c ( upper_sigma , p r i o r_p r ob s ) ,p r i o r . range = c ( lower_range , p r i o r_p r o b r ) ) .Here, upper_sigma, prior_probs, lower_range and prior_probr all de-fine the parameters of the PC prior on the random field [Fuglstad et al.,2018]. Once again, the tutorials help assist with the choice of prior. Now,we define all the parameters and terms in the model that must be estimated:250mod_components <− ~ mySpa t i a l F i e l d (map = coo r d i n a t e s , model = matern ) +beta . cova r1 (map = covar1 , model = ‘ l i n e a r ’ ) +po_s ea r ch_e f f o r t (map = l og e f f o r t_po , model = ‘ l i n e a r ’ ,mean . l i n e a r =1, p r e c . l i n e a r=1e20 ) +log s i gma + In t e r c e p t_Su r v e y + Intercept_PO .Note here that we choose the prior mean and precision of the ‘po_search_effort’field to enforce it to enter the model as an offset. Now we can create thelikelihood objects for both data types, each with their own formulae, butsharing components.l i k _ s u r v <− l i k e ( ‘ cp ’ ,f o rmu la = c o o r d i n a t e s ~ I n t e r c e p t_Su r v e y + mySpa t i a l F i e l d +beta . cova r1 + l og ( ha l fno rm ( d i s t an c e , l og s i gma ) ) +l og (1/W) ,data = su rv_po in t s ,components = mod_components ,s amp l e r s = s u r v e y l i n e s ,domain = l i s t ( c o o r d i n a t e s = mesh ) )l i k_po <− l i k e ( ‘ cp ’ ,f o rmu la = c o o r d i n a t e s ~ Intercept_PO + mySpa t i a l F i e l d +beta . cova r1 + po_sea r ch_e f f o r t ,data = po_points ,components = mod_components ,s amp l e r s = boundary ,domain = l i s t ( c o o r d i n a t e s = mesh )And then we can fit the joint model and simulate M samples of all ofthe parameters and latent effects from the posterior distribution.f i t _ j o i n t <− bru (mod_components , l i k_ s u r v , l i k_po )p o s t e r i o r_ s amp l e s <− gen e r a t e ( f i t _ j o i n t , n . samp les = M) .Note that once the model object (fit_joint) is created, the estimated fieldcan be easily plotted, and predictions can easily be made on new datasetsand at new locations. Stochastic integration of the field to estimate abun-dance (for suitable datasets) is also possible using inlabru helper functions.Details can be found in the inlabru tutorials. The above code can scale up toinclude multiple environmental covariates (including categorical predictors),spatio-temporal fields, and/or temporal effects. Likelihoods of different type(e.g. Bernoulli, Poisson, Gaussian etc.,) can all be included, with this featurebecoming especially useful for when the joint estimation of data of differingtype is desired.251A.3.11 Additional figuresCovariate plots252Figure A.34: Plots showing the average monthly sea-surface temper-atures in degrees Celsius (top 6) and the natural logarithmof chlorophyll-A concentrations in mgm−3 (bottom 6). Theaverages have been taken over the years 2009-2016.253Plot of the pod-specific random walk effectsFigure A.35: A plot showing the posterior mean and posterior 95%credible intervals of the pod-specific (sum-to-zero constrained)random walk monthly effect from the ‘best’ model with MonteCarlo observer effort error included.254Plot of model standard deviationFigure A.36: A plot showing the posterior standard deviation of thesum of the SRKW intensities for the three pods, for the monthof May. The qualitative behaviour is almost identical acrossall pods and across all months, so we omit them. Resultsshown are for Model 8 with MC observer effort error. Notethat the computational mesh is visible in the plot as we lin-early interpolated the standard deviations from the computa-tional mesh vertices to the pixel locations, instead of approx-imating the full posterior distributions at each pixel location.This was done to reduce computation time.255


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items